LinuxLists.cc - [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Thu, Nov 26, 2009 at 1:19 PM, Mel Gorman <[email protected]> wrote:
> (cc'ing the people from the page allocator failure thread as this might be
> relevant to some of their problems)
>
> I know this is very last minute but I believe we should consider disabling
> the "low_latency" tunable for block devices by default for 2.6.32. There was
> evidence that low_latency was a problem last week for page allocation failure
> reports but the reproduction-case was unusual and involved high-order atomic
> allocations in low-memory conditions. It took another few days to accurately
> show the problem for more normal workloads and it's a bit more wide-spread
> than just allocation failures.
>
> Basically, low_latency looks great as long as you have plenty of memory
> but in low memory situations, it appears to cause problems that manifest
> as reduced performance, desktop stalls and in some cases, page allocation
> failures. I think most kernel developers are not seeing the problem as they
> tend to test on beefier machines and without hitting swap or low-memory
> situations for the most part. When they are hitting low-memory situations,
> it tends to be for stress tests where stalls and low performance are expected.

The low latency tunable controls various policies inside cfq.
The one that could affect memory reclaim is:
/*
* Async queues must wait a bit before being allowed dispatch.
* We also ramp up the dispatch depth gradually for async IO,
* based on the last sync IO we serviced
*/
if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
unsigned int depth;

depth = last_sync / cfqd->cfq_slice[1];
if (!depth && !cfqq->dispatched)
depth = 1;
if (depth < max_dispatch)
max_dispatch = depth;
}

here the async queues max depth is limited to 1 for up to 200 ms after
a sync I/O is completed.
Note: dirty page writeback goes through an async queue, so it is
penalized by this.

This can affect both low and high end hardware. My non-NCQ sata disk
can handle a depth of 2 when writing. NCQ sata disks can handle a
depth up to 31, so limiting depth to 1 can cause write performance
drop, and this in turn will slow down dirty page reclaim, and cause
allocation failures.

It would be good to re-test the OOM conditions with that code commented out.

>
> To show the problem, I used an x86-64 machine booting booted with 512MB of
> memory. This is a small amount of RAM but the bug reports related to page
> allocation failures were on smallish machines and the disks in the system
> are not very high-performance.
>
> I used three tests. The first was sysbench on postgres running an IO-heavy
> test against a large database with 10,000,000 rows. The second was IOZone
> running most of the automatic tests with a record length of 4KB and the
> last was a simulated launching of gitk with a music player running in the
> background to act as a desktop-like scenario. The final test was similar
> to the test described here http://lwn.net/Articles/362184/ except that
> dm-crypt was not used as it has its own problems.

low_latency was tested on other scenarios:
http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html
http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html
where it improved actual and perceived performance, so disabling it
completely may not be good.

Thanks,
Corrado

2009-11-26 13:56:14

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Thu, Nov 26, 2009 at 02:37:31PM +0100, Mike Galbraith wrote:
> On Thu, 2009-11-26 at 14:20 +0100, Bartlomiej Zolnierkiewicz wrote:
> > On Thursday 26 November 2009 02:08:57 pm Mike Galbraith wrote:
> > > On Thu, 2009-11-26 at 12:19 +0000, Mel Gorman wrote:
> > > > (cc'ing the people from the page allocator failure thread as this might be
> > > > relevant to some of their problems)
> > > >
> > > > I know this is very last minute but I believe we should consider disabling
> > > > the "low_latency" tunable for block devices by default for 2.6.32. There was
> > > > evidence that low_latency was a problem last week for page allocation failure
> > > > reports but the reproduction-case was unusual and involved high-order atomic
> > > > allocations in low-memory conditions. It took another few days to accurately
> > > > show the problem for more normal workloads and it's a bit more wide-spread
> > > > than just allocation failures.
> > > >
> > > > Basically, low_latency looks great as long as you have plenty of memory
> > > > but in low memory situations, it appears to cause problems that manifest
> > > > as reduced performance, desktop stalls and in some cases, page allocation
> > > > failures. I think most kernel developers are not seeing the problem as they
> > > > tend to test on beefier machines and without hitting swap or low-memory
> > > > situations for the most part. When they are hitting low-memory situations,
> > > > it tends to be for stress tests where stalls and low performance are expected.
> > >
> > > Ouch. It was bad desktop stalls under heavy write that kicked the whole
> > > thing off.
> >
> > The problem is that 'desktop' means different things for different people
> > (for some kernel developers 'desktop' is more like 'a workstation' and for
> > others it is more like 'an embedded device').

Will concede that - the term "desktop" is fuzzy at best. The
characteristics of note are a mid-range machine running workloads that
are not steady, have abupt phase changes and are not very well sized to
the available memory. "Desktops" fall into this category but it's also
possible that badly-or-borderline-provisioned servers would also fall
into it.

>
> The stalls I'm talking about were reported for garden variety desktop
> PC.

The stalls I'm seeing on the laptop are tiny but there. It's prefectly
possible a whole host of stalls for people have been resolved but there
is one corner case.

> I reproduced them on my supermarket special Q6600 desktop PC. That
> problem has been with us roughly forever, but I'd hoped it had been
> cured. Guess not.
>

It's possible the corner case causing stalls is specific to low-memory rather
than writes. Conceivably, what is going wrong is that writes need to complete
for pages to be clean so pages can be reclaimed. The cleaning of pages is
getting pre-empted by sync IO until such point as pages cannot be reclaimed
and they stall allowing writes to complete. I'll prototype something to
disable low_latency if kswapd is awake. If it makes as difference, this
might be plausible.

As Jens would say though, this is "mostly hand-wavy nonsense".

> As an idle speculation, I wonder if the sync vs async slice ratios may
> not have been knocked out of kilter a bit by giving more to sync.
>

I don't know enough to speculate.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-26 14:17:37

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Thu, Nov 26, 2009 at 02:47:10PM +0100, Corrado Zoccolo wrote:
> On Thu, Nov 26, 2009 at 1:19 PM, Mel Gorman <[email protected]> wrote:
> > (cc'ing the people from the page allocator failure thread as this might be
> > relevant to some of their problems)
> >
> > I know this is very last minute but I believe we should consider disabling
> > the "low_latency" tunable for block devices by default for 2.6.32. ?There was
> > evidence that low_latency was a problem last week for page allocation failure
> > reports but the reproduction-case was unusual and involved high-order atomic
> > allocations in low-memory conditions. It took another few days to accurately
> > show the problem for more normal workloads and it's a bit more wide-spread
> > than just allocation failures.
> >
> > Basically, low_latency looks great as long as you have plenty of memory
> > but in low memory situations, it appears to cause problems that manifest
> > as reduced performance, desktop stalls and in some cases, page allocation
> > failures. I think most kernel developers are not seeing the problem as they
> > tend to test on beefier machines and without hitting swap or low-memory
> > situations for the most part. When they are hitting low-memory situations,
> > it tends to be for stress tests where stalls and low performance are expected.
>
> The low latency tunable controls various policies inside cfq.
> The one that could affect memory reclaim is:
> /*
> * Async queues must wait a bit before being allowed dispatch.
> * We also ramp up the dispatch depth gradually for async IO,
> * based on the last sync IO we serviced
> */
> if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
> unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
> unsigned int depth;
>
> depth = last_sync / cfqd->cfq_slice[1];
> if (!depth && !cfqq->dispatched)
> depth = 1;
> if (depth < max_dispatch)
> max_dispatch = depth;
> }
>
> here the async queues max depth is limited to 1 for up to 200 ms after
> a sync I/O is completed.
> Note: dirty page writeback goes through an async queue, so it is
> penalized by this.
>
> This can affect both low and high end hardware. My non-NCQ sata disk
> can handle a depth of 2 when writing. NCQ sata disks can handle a
> depth up to 31, so limiting depth to 1 can cause write performance
> drop, and this in turn will slow down dirty page reclaim, and cause
> allocation failures.
>
> It would be good to re-test the OOM conditions with that code commented out.
>

All of it or just the cfq_latency part?

As it turns out the test machine does report for the disk NCQ (depth 31/32)
and it's the same on the laptop so slowing down dirty page cleaning
could be impacting reclaim.

> >
> > To show the problem, I used an x86-64 machine booting booted with 512MB of
> > memory. This is a small amount of RAM but the bug reports related to page
> > allocation failures were on smallish machines and the disks in the system
> > are not very high-performance.
> >
> > I used three tests. The first was sysbench on postgres running an IO-heavy
> > test against a large database with 10,000,000 rows. The second was IOZone
> > running most of the automatic tests with a record length of 4KB and the
> > last was a simulated launching of gitk with a music player running in the
> > background to act as a desktop-like scenario. The final test was similar
> > to the test described here http://lwn.net/Articles/362184/ except that
> > dm-crypt was not used as it has its own problems.
>
> low_latency was tested on other scenarios:
> http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html
> where it improved actual and perceived performance, so disabling it
> completely may not be good.
>

It may not indeed.

In case you mean a partial disabling of cfq_latency, I'm try the
following patch. The intention is to disable the low_latency logic if
kswapd is at work and presumably needs clean pages. Alternative
suggestions welcome.

======
cfq: Do not limit the async queue depth while kswapd is awake

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index aa1e953..dcab74e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1308,7 +1308,7 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
* We also ramp up the dispatch depth gradually for async IO,
* based on the last sync IO we serviced
*/
- if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
+ if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency && !kswapd_awake()) {
unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
unsigned int depth;

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6f75617..b593aff 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -655,6 +655,7 @@ typedef struct pglist_data {
void get_zone_counts(unsigned long *active, unsigned long *inactive,
unsigned long *free);
void build_all_zonelists(void);
+int kswapd_awake(void);
void wakeup_kswapd(struct zone *zone, int order);
int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
int classzone_idx, int alloc_flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 777af57..75cdd9a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2201,6 +2201,15 @@ static int kswapd(void *p)
return 0;
}

+int kswapd_awake(void)
+{
+ pg_data_t *pgdat;
+ for_each_online_pgdat(pgdat)
+ if (!waitqueue_active(&pgdat->kswapd_wait))
+ return 1;
+ return 0;
+}
+
/*
* A zone is low on free memory, so wake its kswapd task to service it.
*/

2009-11-26 15:18:14

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Thu, Nov 26, 2009 at 3:17 PM, Mel Gorman <[email protected]> wrote:
> On Thu, Nov 26, 2009 at 02:47:10PM +0100, Corrado Zoccolo wrote:
>> On Thu, Nov 26, 2009 at 1:19 PM, Mel Gorman <[email protected]> wrote:
>> > (cc'ing the people from the page allocator failure thread as this might be
>> > relevant to some of their problems)
>> >
>> > I know this is very last minute but I believe we should consider disabling
>> > the "low_latency" tunable for block devices by default for 2.6.32. There was
>> > evidence that low_latency was a problem last week for page allocation failure
>> > reports but the reproduction-case was unusual and involved high-order atomic
>> > allocations in low-memory conditions. It took another few days to accurately
>> > show the problem for more normal workloads and it's a bit more wide-spread
>> > than just allocation failures.
>> >
>> > Basically, low_latency looks great as long as you have plenty of memory
>> > but in low memory situations, it appears to cause problems that manifest
>> > as reduced performance, desktop stalls and in some cases, page allocation
>> > failures. I think most kernel developers are not seeing the problem as they
>> > tend to test on beefier machines and without hitting swap or low-memory
>> > situations for the most part. When they are hitting low-memory situations,
>> > it tends to be for stress tests where stalls and low performance are expected.
>>
>> The low latency tunable controls various policies inside cfq.
>> The one that could affect memory reclaim is:
>> /*
>> * Async queues must wait a bit before being allowed dispatch.
>> * We also ramp up the dispatch depth gradually for async IO,
>> * based on the last sync IO we serviced
>> */
>> if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
>> unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
>> unsigned int depth;
>>
>> depth = last_sync / cfqd->cfq_slice[1];
>> if (!depth && !cfqq->dispatched)
>> depth = 1;
>> if (depth < max_dispatch)
>> max_dispatch = depth;
>> }
>>
>> here the async queues max depth is limited to 1 for up to 200 ms after
>> a sync I/O is completed.
>> Note: dirty page writeback goes through an async queue, so it is
>> penalized by this.
>>
>> This can affect both low and high end hardware. My non-NCQ sata disk
>> can handle a depth of 2 when writing. NCQ sata disks can handle a
>> depth up to 31, so limiting depth to 1 can cause write performance
>> drop, and this in turn will slow down dirty page reclaim, and cause
>> allocation failures.
>>
>> It would be good to re-test the OOM conditions with that code commented out.
>>
>
> All of it or just the cfq_latency part?
The whole if, that is enabled only with cfq_latency.

>
> As it turns out the test machine does report for the disk NCQ (depth 31/32)
> and it's the same on the laptop so slowing down dirty page cleaning
> could be impacting reclaim.
Yes, I think so.

>
>> >
>> > To show the problem, I used an x86-64 machine booting booted with 512MB of
>> > memory. This is a small amount of RAM but the bug reports related to page
>> > allocation failures were on smallish machines and the disks in the system
>> > are not very high-performance.
>> >
>> > I used three tests. The first was sysbench on postgres running an IO-heavy
>> > test against a large database with 10,000,000 rows. The second was IOZone
>> > running most of the automatic tests with a record length of 4KB and the
>> > last was a simulated launching of gitk with a music player running in the
>> > background to act as a desktop-like scenario. The final test was similar
>> > to the test described here http://lwn.net/Articles/362184/ except that
>> > dm-crypt was not used as it has its own problems.
>>
>> low_latency was tested on other scenarios:
>> http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html
>> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html
>> where it improved actual and perceived performance, so disabling it
>> completely may not be good.
>>
>
> It may not indeed.
>
> In case you mean a partial disabling of cfq_latency, I'm try the
> following patch. The intention is to disable the low_latency logic if
> kswapd is at work and presumably needs clean pages. Alternative
> suggestions welcome.
Yes, I meant exactly to disable that part, and doing it when kswapd is
active is probably a good choice.
I have a different idea for 2.6.33, though.
If you have a reliable reproducer of the issue, can you test it on
git://git.kernel.dk/linux-2.6-block.git branch for-2.6.33?
It may already be unaffected, since we had various performance
improvements there, but I think a better way to boost writeback is
possible.

Thanks,
Corrado

>
> ======
> cfq: Do not limit the async queue depth while kswapd is awake
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index aa1e953..dcab74e 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1308,7 +1308,7 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> * We also ramp up the dispatch depth gradually for async IO,
> * based on the last sync IO we serviced
> */
> - if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
> + if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency && !kswapd_awake()) {
> unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
> unsigned int depth;
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 6f75617..b593aff 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -655,6 +655,7 @@ typedef struct pglist_data {
> void get_zone_counts(unsigned long *active, unsigned long *inactive,
> unsigned long *free);
> void build_all_zonelists(void);
> +int kswapd_awake(void);
> void wakeup_kswapd(struct zone *zone, int order);
> int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
> int classzone_idx, int alloc_flags);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 777af57..75cdd9a 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2201,6 +2201,15 @@ static int kswapd(void *p)
> return 0;
> }
>
> +int kswapd_awake(void)
> +{
> + pg_data_t *pgdat;
> + for_each_online_pgdat(pgdat)
> + if (!waitqueue_active(&pgdat->kswapd_wait))
> + return 1;
> + return 0;
> +}
> +
> /*
> * A zone is low on free memory, so wake its kswapd task to service it.
> */
>

2009-11-27 04:36:19

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

> Signed-off-by: Mel Gorman <[email protected]>
> ---
> block/cfq-iosched.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index aa1e953..dc33045 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -2543,7 +2543,7 @@ static void *cfq_init_queue(struct request_queue *q)
> cfqd->cfq_slice[1] = cfq_slice_sync;
> cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
> cfqd->cfq_slice_idle = cfq_slice_idle;
> - cfqd->cfq_latency = 1;
> + cfqd->cfq_latency = 0;
> cfqd->hw_tag = 1;
> cfqd->last_end_sync_rq = jiffies;
> return cfqd;

Great. Probably we can reenable this feature at 2.6.33. but there isn't any reason to take
any risk at 2.6.32. i.e. This simple disabling is best. I like this.

Reviewed-by: KOSAKI Motohiro <[email protected]>

2009-11-27 05:58:26

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

> On Thu, Nov 26, 2009 at 02:47:10PM +0100, Corrado Zoccolo wrote:
> > On Thu, Nov 26, 2009 at 1:19 PM, Mel Gorman <[email protected]> wrote:
> > > (cc'ing the people from the page allocator failure thread as this might be
> > > relevant to some of their problems)
> > >
> > > I know this is very last minute but I believe we should consider disabling
> > > the "low_latency" tunable for block devices by default for 2.6.32. ?There was
> > > evidence that low_latency was a problem last week for page allocation failure
> > > reports but the reproduction-case was unusual and involved high-order atomic
> > > allocations in low-memory conditions. It took another few days to accurately
> > > show the problem for more normal workloads and it's a bit more wide-spread
> > > than just allocation failures.
> > >
> > > Basically, low_latency looks great as long as you have plenty of memory
> > > but in low memory situations, it appears to cause problems that manifest
> > > as reduced performance, desktop stalls and in some cases, page allocation
> > > failures. I think most kernel developers are not seeing the problem as they
> > > tend to test on beefier machines and without hitting swap or low-memory
> > > situations for the most part. When they are hitting low-memory situations,
> > > it tends to be for stress tests where stalls and low performance are expected.
> >
> > The low latency tunable controls various policies inside cfq.
> > The one that could affect memory reclaim is:
> > /*
> > * Async queues must wait a bit before being allowed dispatch.
> > * We also ramp up the dispatch depth gradually for async IO,
> > * based on the last sync IO we serviced
> > */
> > if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
> > unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
> > unsigned int depth;
> >
> > depth = last_sync / cfqd->cfq_slice[1];
> > if (!depth && !cfqq->dispatched)
> > depth = 1;
> > if (depth < max_dispatch)
> > max_dispatch = depth;
> > }
> >
> > here the async queues max depth is limited to 1 for up to 200 ms after
> > a sync I/O is completed.
> > Note: dirty page writeback goes through an async queue, so it is
> > penalized by this.
> >
> > This can affect both low and high end hardware. My non-NCQ sata disk
> > can handle a depth of 2 when writing. NCQ sata disks can handle a
> > depth up to 31, so limiting depth to 1 can cause write performance
> > drop, and this in turn will slow down dirty page reclaim, and cause
> > allocation failures.
> >
> > It would be good to re-test the OOM conditions with that code commented out.
> >
>
> All of it or just the cfq_latency part?
>
> As it turns out the test machine does report for the disk NCQ (depth 31/32)
> and it's the same on the laptop so slowing down dirty page cleaning
> could be impacting reclaim.
>
> > >
> > > To show the problem, I used an x86-64 machine booting booted with 512MB of
> > > memory. This is a small amount of RAM but the bug reports related to page
> > > allocation failures were on smallish machines and the disks in the system
> > > are not very high-performance.
> > >
> > > I used three tests. The first was sysbench on postgres running an IO-heavy
> > > test against a large database with 10,000,000 rows. The second was IOZone
> > > running most of the automatic tests with a record length of 4KB and the
> > > last was a simulated launching of gitk with a music player running in the
> > > background to act as a desktop-like scenario. The final test was similar
> > > to the test described here http://lwn.net/Articles/362184/ except that
> > > dm-crypt was not used as it has its own problems.
> >
> > low_latency was tested on other scenarios:
> > http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html
> > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html
> > where it improved actual and perceived performance, so disabling it
> > completely may not be good.
> >
>
> It may not indeed.
>
> In case you mean a partial disabling of cfq_latency, I'm try the
> following patch. The intention is to disable the low_latency logic if
> kswapd is at work and presumably needs clean pages. Alternative
> suggestions welcome.

I like treat vmscan writeout as special. because
- vmscan use various process context. but it doesn't write own process's page.
IOW, it doesn't so match cfq's io fairness logic.
- plus, the above mean vmscan writeout doesn't need good i/o latency.
- vmscan maintain page granularity lru list. It mean vmscan makes awful
seekful I/O. it assume block-layer buffered much i/o request.
- plus, the above mena vmscan. writeout need good io throughput. otherwise
system might cause hangup.

However, I don't think kswapd_awake is good choice. because
- zone reclaim run before kswapd wakeup. iow, this patch doesn't solve hpc machine.
btw, some Core i7 box (at least, Intel's reference box) also use zone reclaim.
- On large (many memory node) machine, one of much kswapd always run.

Instead, PF_MEMALLOC is good idea?

Subject: [PATCH] cfq: Do not limit the async queue depth while memory reclaim

Not-Signed-off-by: KOSAKI Motohiro <[email protected]> (I haven't test this)
---
block/cfq-iosched.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index aa1e953..9546f64 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1308,7 +1308,8 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
* We also ramp up the dispatch depth gradually for async IO,
* based on the last sync IO we serviced
*/
- if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
+ if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency &&
+ !(current->flags & PF_MEMALLOC)) {
unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
unsigned int depth;

--
1.6.5.2

2009-11-27 06:29:09

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

> Instead, PF_MEMALLOC is good idea?

This patch was obviously wrong. please forget it. i'm sorry.

>
>
> Subject: [PATCH] cfq: Do not limit the async queue depth while memory reclaim
>
> Not-Signed-off-by: KOSAKI Motohiro <[email protected]> (I haven't test this)
> ---
> block/cfq-iosched.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index aa1e953..9546f64 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1308,7 +1308,8 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> * We also ramp up the dispatch depth gradually for async IO,
> * based on the last sync IO we serviced
> */
> - if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
> + if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency &&
> + !(current->flags & PF_MEMALLOC)) {
> unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
> unsigned int depth;
>
> --
> 1.6.5.2
>
>
>
>
>
>

2009-11-27 11:44:51

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Thu, Nov 26, 2009 at 04:18:18PM +0100, Corrado Zoccolo wrote:
> > <SNIP>
> >
> > In case you mean a partial disabling of cfq_latency, I'm try the
> > following patch. The intention is to disable the low_latency logic if
> > kswapd is at work and presumably needs clean pages. Alternative
> > suggestions welcome.

As it turned out, that patch sucked so I aborted the test and I need to
think about it a lot more.

> Yes, I meant exactly to disable that part, and doing it when kswapd is
> active is probably a good choice.
> I have a different idea for 2.6.33, though.
> If you have a reliable reproducer of the issue, can you test it on
> git://git.kernel.dk/linux-2.6-block.git branch for-2.6.33?
> It may already be unaffected, since we had various performance
> improvements there, but I think a better way to boost writeback is
> possible.
>

I haven't tested the high-order allocation scenario yet but the results
as thing stands are below. There are four kernels being compared

1. with-low-latency is 2.6.32-rc8 vanilla
2. with-low-latency-block-2.6.33 is with the for-2.6.33 from linux-block applied
3. with-low-latency-async-rampup is with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula"
4. without-low-latency is with low_latency disabled

SYSBENCH
sysbench-with low-latency low-latency sysbench-without
low-latency block-2.6.33 async-rampup low-latency
1 1266.02 ( 0.00%) 824.08 (-53.63%) 1265.15 (-0.07%) 1278.55 ( 0.98%)
2 1182.58 ( 0.00%) 1226.42 ( 3.57%) 1223.03 ( 3.31%) 1379.25 (14.26%)
3 1218.64 ( 0.00%) 1271.38 ( 4.15%) 1246.42 ( 2.23%) 1580.08 (22.87%)
4 1212.11 ( 0.00%) 1257.84 ( 3.64%) 1325.17 ( 8.53%) 1534.17 (20.99%)
5 1046.77 ( 0.00%) 981.71 (-6.63%) 1008.44 (-3.80%) 1552.48 (32.57%)
6 1187.14 ( 0.00%) 1132.89 (-4.79%) 1147.18 (-3.48%) 1661.19 (28.54%)
7 1179.37 ( 0.00%) 1183.61 ( 0.36%) 1202.49 ( 1.92%) 790.26 (-49.24%)
8 1164.62 ( 0.00%) 1143.54 (-1.84%) 1184.56 ( 1.68%) 854.10 (-36.36%)
9 1095.22 ( 0.00%) 1178.72 ( 7.08%) 1002.42 (-9.26%) 1655.04 (33.83%)
10 1147.52 ( 0.00%) 1153.46 ( 0.52%) 1151.73 ( 0.37%) 1653.89 (30.62%)
11 823.38 ( 0.00%) 820.64 (-0.33%) 754.15 (-9.18%) 1627.45 (49.41%)
12 813.73 ( 0.00%) 791.44 (-2.82%) 848.32 ( 4.08%) 1494.63 (45.56%)
13 898.22 ( 0.00%) 789.63 (-13.75%) 931.47 ( 3.57%) 1521.64 (40.97%)
14 873.50 ( 0.00%) 938.90 ( 6.97%) 875.75 ( 0.26%) 1311.09 (33.38%)
15 808.32 ( 0.00%) 979.88 (17.51%) 877.87 ( 7.92%) 1009.70 (19.94%)
16 758.17 ( 0.00%) 1096.81 (30.87%) 881.23 (13.96%) 725.17 (-4.55%)

sysbench is helped by both both block-2.6.33 and async-rampup to some
extent. For many of the results, plain old disabling low_latency still
helps the most.

desktop-net-gitk
gitk-with low-latency low-latency gitk-without
low-latency block-2.6.33 async-rampup low-latency
min 954.46 ( 0.00%) 570.06 (40.27%) 796.22 (16.58%) 640.65 (32.88%)
mean 964.79 ( 0.00%) 573.96 (40.51%) 798.01 (17.29%) 655.57 (32.05%)
stddev 10.01 ( 0.00%) 2.65 (73.55%) 1.91 (80.95%) 13.33 (-33.18%)
max 981.23 ( 0.00%) 577.21 (41.17%) 800.91 (18.38%) 675.65 (31.14%)

The changes for block in 2.6.33 make a massive difference here, notably
beating the disabling of low_latency.

IOZone
iozone-with low-latency low-latency iozone-without
low-latency block-2.6.33 async-rampup low-latency
write-64 151212 ( 0.00%) 163359 ( 7.44%) 163359 ( 7.44%) 159856 ( 5.41%)
write-128 189357 ( 0.00%) 184922 (-2.40%) 202805 ( 6.63%) 206233 ( 8.18%)
write-256 219883 ( 0.00%) 211232 (-4.10%) 189867 (-15.81%) 223174 ( 1.47%)
write-512 224932 ( 0.00%) 222601 (-1.05%) 204459 (-10.01%) 220227 (-2.14%)
write-1024 227738 ( 0.00%) 226728 (-0.45%) 216009 (-5.43%) 226155 (-0.70%)
write-2048 227564 ( 0.00%) 224167 (-1.52%) 229387 ( 0.79%) 224848 (-1.21%)
write-4096 208556 ( 0.00%) 227707 ( 8.41%) 216908 ( 3.85%) 223430 ( 6.66%)
write-8192 219484 ( 0.00%) 222365 ( 1.30%) 217737 (-0.80%) 219389 (-0.04%)
write-16384 206670 ( 0.00%) 209355 ( 1.28%) 204146 (-1.24%) 206295 (-0.18%)
write-32768 203023 ( 0.00%) 205097 ( 1.01%) 199766 (-1.63%) 201852 (-0.58%)
write-65536 162134 ( 0.00%) 196670 (17.56%) 189975 (14.66%) 189173 (14.29%)
write-131072 68534 ( 0.00%) 69145 ( 0.88%) 64519 (-6.22%) 67417 (-1.66%)
write-262144 32936 ( 0.00%) 28587 (-15.21%) 31470 (-4.66%) 27750 (-18.69%)
write-524288 24044 ( 0.00%) 23560 (-2.05%) 23116 (-4.01%) 23759 (-1.20%)
rewrite-64 755681 ( 0.00%) 800767 ( 5.63%) 469931 (-60.81%) 755681 ( 0.00%)
rewrite-128 581518 ( 0.00%) 639723 ( 9.10%) 591774 ( 1.73%) 799840 (27.30%)
rewrite-256 639427 ( 0.00%) 710511 (10.00%) 666414 ( 4.05%) 659861 ( 3.10%)
rewrite-512 669577 ( 0.00%) 743788 ( 9.98%) 692017 ( 3.24%) 684954 ( 2.24%)
rewrite-1024 680960 ( 0.00%) 755195 ( 9.83%) 701422 ( 2.92%) 686182 ( 0.76%)
rewrite-2048 685263 ( 0.00%) 743123 ( 7.79%) 703445 ( 2.58%) 692780 ( 1.09%)
rewrite-4096 631352 ( 0.00%) 686776 ( 8.07%) 640007 ( 1.35%) 643266 ( 1.85%)
rewrite-8192 442146 ( 0.00%) 474089 ( 6.74%) 457768 ( 3.41%) 442624 ( 0.11%)
rewrite-16384 428641 ( 0.00%) 454857 ( 5.76%) 442896 ( 3.22%) 432613 ( 0.92%)
rewrite-32768 425361 ( 0.00%) 444206 ( 4.24%) 434472 ( 2.10%) 430568 ( 1.21%)
rewrite-65536 405183 ( 0.00%) 433898 ( 6.62%) 419843 ( 3.49%) 389242 (-4.10%)
rewrite-131072 66110 ( 0.00%) 58370 (-13.26%) 54342 (-21.66%) 58472 (-13.06%)
rewrite-262144 29254 ( 0.00%) 24665 (-18.61%) 25710 (-13.78%) 29306 ( 0.18%)
rewrite-524288 23812 ( 0.00%) 20742 (-14.80%) 22490 (-5.88%) 24543 ( 2.98%)
read-64 934589 ( 0.00%) 1160938 (19.50%) 1004538 ( 6.96%) 840903 (-11.14%)
read-128 1601534 ( 0.00%) 1869179 (14.32%) 1681806 ( 4.77%) 1280633 (-25.06%)
read-256 1255511 ( 0.00%) 1526887 (17.77%) 1304314 ( 3.74%) 1310683 ( 4.21%)
read-512 1291158 ( 0.00%) 1377278 ( 6.25%) 1336145 ( 3.37%) 1319723 ( 2.16%)
read-1024 1319408 ( 0.00%) 1306564 (-0.98%) 1368162 ( 3.56%) 1347557 ( 2.09%)
read-2048 1316016 ( 0.00%) 1394645 ( 5.64%) 1339827 ( 1.78%) 1347393 ( 2.33%)
read-4096 1253710 ( 0.00%) 1307525 ( 4.12%) 1247519 (-0.50%) 1251882 (-0.15%)
read-8192 995149 ( 0.00%) 1033337 ( 3.70%) 1016944 ( 2.14%) 1011794 ( 1.65%)
read-16384 883156 ( 0.00%) 905213 ( 2.44%) 905213 ( 2.44%) 897458 ( 1.59%)
read-32768 844368 ( 0.00%) 855213 ( 1.27%) 849609 ( 0.62%) 856364 ( 1.40%)
read-65536 816099 ( 0.00%) 839262 ( 2.76%) 835019 ( 2.27%) 826473 ( 1.26%)
read-131072 818055 ( 0.00%) 837369 ( 2.31%) 828230 ( 1.23%) 824351 ( 0.76%)
read-262144 827225 ( 0.00%) 839635 ( 1.48%) 840538 ( 1.58%) 835693 ( 1.01%)
read-524288 24653 ( 0.00%) 21387 (-15.27%) 20602 (-19.66%) 22519 (-9.48%)
reread-64 2329708 ( 0.00%) 2251544 (-3.47%) 1985134 (-17.36%) 1985134 (-17.36%)
reread-128 1446222 ( 0.00%) 1979446 (26.94%) 2009076 (28.02%) 2137031 (32.33%)
reread-256 1828508 ( 0.00%) 2006158 ( 8.86%) 1892980 ( 3.41%) 1879725 ( 2.72%)
reread-512 1521718 ( 0.00%) 1642783 ( 7.37%) 1508887 (-0.85%) 1579934 ( 3.68%)
reread-1024 1347557 ( 0.00%) 1422540 ( 5.27%) 1384034 ( 2.64%) 1375171 ( 2.01%)
reread-2048 1340664 ( 0.00%) 1413929 ( 5.18%) 1372364 ( 2.31%) 1350783 ( 0.75%)
reread-4096 1259592 ( 0.00%) 1324868 ( 4.93%) 1273788 ( 1.11%) 1284839 ( 1.96%)
reread-8192 1007285 ( 0.00%) 1033710 ( 2.56%) 1027159 ( 1.93%) 1011317 ( 0.40%)
reread-16384 891404 ( 0.00%) 910828 ( 2.13%) 916562 ( 2.74%) 905022 ( 1.50%)
reread-32768 850492 ( 0.00%) 859341 ( 1.03%) 856385 ( 0.69%) 862772 ( 1.42%)
reread-65536 836565 ( 0.00%) 852664 ( 1.89%) 852315 ( 1.85%) 847020 ( 1.23%)
reread-131072 844516 ( 0.00%) 862590 ( 2.10%) 854067 ( 1.12%) 853155 ( 1.01%)
reread-262144 851524 ( 0.00%) 860559 ( 1.05%) 864921 ( 1.55%) 860653 ( 1.06%)
reread-524288 24927 ( 0.00%) 21300 (-17.03%) 19748 (-26.23%) 22487 (-10.85%)
randread-64 1605256 ( 0.00%) 1605256 ( 0.00%) 1605256 ( 0.00%) 1775099 ( 9.57%)
randread-128 1179358 ( 0.00%) 1582649 (25.48%) 1511363 (21.97%) 1528576 (22.85%)
randread-256 1421755 ( 0.00%) 1599680 (11.12%) 1460430 ( 2.65%) 1310683 (-8.47%)
randread-512 1306873 ( 0.00%) 1278855 (-2.19%) 1243315 (-5.11%) 1281909 (-1.95%)
randread-1024 1201314 ( 0.00%) 1254656 ( 4.25%) 1190657 (-0.90%) 1231629 ( 2.46%)
randread-2048 1179413 ( 0.00%) 1227971 ( 3.95%) 1185272 ( 0.49%) 1190529 ( 0.93%)
randread-4096 1107005 ( 0.00%) 1160862 ( 4.64%) 1110727 ( 0.34%) 1116792 ( 0.88%)
randread-8192 894337 ( 0.00%) 924264 ( 3.24%) 912676 ( 2.01%) 899487 ( 0.57%)
randread-16384 783760 ( 0.00%) 800299 ( 2.07%) 793351 ( 1.21%) 791341 ( 0.96%)
randread-32768 740498 ( 0.00%) 743720 ( 0.43%) 741233 ( 0.10%) 743511 ( 0.41%)
randread-65536 721640 ( 0.00%) 727692 ( 0.83%) 726984 ( 0.74%) 728139 ( 0.89%)
randread-131072 715284 ( 0.00%) 722094 ( 0.94%) 717746 ( 0.34%) 720825 ( 0.77%)
randread-262144 709855 ( 0.00%) 706770 (-0.44%) 709133 (-0.10%) 714943 ( 0.71%)
randread-524288 394 ( 0.00%) 421 ( 6.41%) 418 ( 5.74%) 431 ( 8.58%)
randwrite-64 730988 ( 0.00%) 764288 ( 4.36%) 723111 (-1.09%) 730988 ( 0.00%)
randwrite-128 746459 ( 0.00%) 799840 ( 6.67%) 746459 ( 0.00%) 742331 (-0.56%)
randwrite-256 695778 ( 0.00%) 752329 ( 7.52%) 720041 ( 3.37%) 727850 ( 4.41%)
randwrite-512 666253 ( 0.00%) 722760 ( 7.82%) 667081 ( 0.12%) 691126 ( 3.60%)
randwrite-1024 651223 ( 0.00%) 697776 ( 6.67%) 663292 ( 1.82%) 659625 ( 1.27%)
randwrite-2048 655558 ( 0.00%) 691887 ( 5.25%) 665720 ( 1.53%) 664073 ( 1.28%)
randwrite-4096 635556 ( 0.00%) 662721 ( 4.10%) 643170 ( 1.18%) 642400 ( 1.07%)
randwrite-8192 467357 ( 0.00%) 491364 ( 4.89%) 476720 ( 1.96%) 469734 ( 0.51%)
randwrite-16384 413188 ( 0.00%) 427521 ( 3.35%) 417353 ( 1.00%) 417282 ( 0.98%)
randwrite-32768 404161 ( 0.00%) 411721 ( 1.84%) 404942 ( 0.19%) 407580 ( 0.84%)
randwrite-65536 379372 ( 0.00%) 397312 ( 4.52%) 386853 ( 1.93%) 381273 ( 0.50%)
randwrite-131072 21780 ( 0.00%) 16924 (-28.69%) 21177 (-2.85%) 19758 (-10.23%)
randwrite-262144 6249 ( 0.00%) 5548 (-12.64%) 6370 ( 1.90%) 6316 ( 1.06%)
randwrite-524288 2915 ( 0.00%) 2582 (-12.90%) 2871 (-1.53%) 2859 (-1.96%)
bkwdread-64 1141196 ( 0.00%) 1141196 ( 0.00%) 1004538 (-13.60%) 1141196 ( 0.00%)
bkwdread-128 1066865 ( 0.00%) 1386465 (23.05%) 1400936 (23.85%) 1101900 ( 3.18%)
bkwdread-256 877797 ( 0.00%) 1105556 (20.60%) 1105556 (20.60%) 1105556 (20.60%)
bkwdread-512 1133103 ( 0.00%) 1162547 ( 2.53%) 1175271 ( 3.59%) 1162547 ( 2.53%)
bkwdread-1024 1163562 ( 0.00%) 1206714 ( 3.58%) 1213534 ( 4.12%) 1195962 ( 2.71%)
bkwdread-2048 1163439 ( 0.00%) 1218910 ( 4.55%) 1204552 ( 3.41%) 1204552 ( 3.41%)
bkwdread-4096 1116792 ( 0.00%) 1175477 ( 4.99%) 1159922 ( 3.72%) 1150600 ( 2.94%)
bkwdread-8192 912288 ( 0.00%) 935233 ( 2.45%) 944695 ( 3.43%) 934724 ( 2.40%)
bkwdread-16384 817707 ( 0.00%) 824140 ( 0.78%) 832527 ( 1.78%) 829152 ( 1.38%)
bkwdread-32768 775898 ( 0.00%) 773714 (-0.28%) 785494 ( 1.22%) 787691 ( 1.50%)
bkwdread-65536 759643 ( 0.00%) 769924 ( 1.34%) 778780 ( 2.46%) 772174 ( 1.62%)
bkwdread-131072 763215 ( 0.00%) 769634 ( 0.83%) 773707 ( 1.36%) 773816 ( 1.37%)
bkwdread-262144 765491 ( 0.00%) 768992 ( 0.46%) 780876 ( 1.97%) 780021 ( 1.86%)
bkwdread-524288 3688 ( 0.00%) 3595 (-2.59%) 3577 (-3.10%) 3724 ( 0.97%)

The upcoming changes for 2.6.33 also help iozone in many cases, often by more
than just disabling low_latency. It has the occasional massive gain or loss
for the larger file sizes. I don't know why this is but as the big losses
appear to be mostly in the write-tests, I would guess that it's differences
in heavy-writer-throttling.

The only downside with block-2.6.33 is that there are a lot of patches in
there and doesn't help with the 2.6.32 release as such. I could do a reverse
bisect to see what helps the most in there but under ideal conditions, it'll
take 3 days to complete and I wouldn't be able to start until Monday as I'm
out of the country for the weekend. That's a bit late.

p.s. As a consequence of being out of the country, I also won't be able to
respond to mail over the weekend.

--
Mel Gorman

2009-11-27 12:03:27

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <[email protected]> wrote:> On Thu, Nov 26, 2009 at 04:18:18PM +0100, Corrado Zoccolo wrote:>> > <SNIP>>> >>> > In case you mean a partial disabling of cfq_latency, I'm try the>> > following patch. The intention is to disable the low_latency logic if>> > kswapd is at work and presumably needs clean pages. Alternative>> > suggestions welcome.>> As it turned out, that patch sucked so I aborted the test and I need to> think about it a lot more.What about using the dirty ratio, instead of checking if kswapd is running?
>> Yes, I meant exactly to disable that part, and doing it when kswapd is>> active is probably a good choice.>> I have a different idea for 2.6.33, though.>> If you have a reliable reproducer of the issue, can you test it on>> git://git.kernel.dk/linux-2.6-block.git branch for-2.6.33?>> It may already be unaffected, since we had various performance>> improvements there, but I think a better way to boost writeback is>> possible.>>>> I haven't tested the high-order allocation scenario yet but the results> as thing stands are below. There are four kernels being compared>> 1. with-low-latency is 2.6.32-rc8 vanilla> 2. with-low-latency-block-2.6.33 is with the for-2.6.33 from linux-block applied> 3. with-low-latency-async-rampup is with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula"> 4. without-low-latency is with low_latency disabled>> SYSBENCH> sysbench-with low-latency low-latency sysbench-without> low-latency block-2.6.33 async-rampup low-latency> 1 1266.02 ( 0.00%) 824.08 (-53.63%) 1265.15 (-0.07%) 1278.55 ( 0.98%)> 2 1182.58 ( 0.00%) 1226.42 ( 3.57%) 1223.03 ( 3.31%) 1379.25 (14.26%)> 3 1218.64 ( 0.00%) 1271.38 ( 4.15%) 1246.42 ( 2.23%) 1580.08 (22.87%)> 4 1212.11 ( 0.00%) 1257.84 ( 3.64%) 1325.17 ( 8.53%) 1534.17 (20.99%)> 5 1046.77 ( 0.00%) 981.71 (-6.63%) 1008.44 (-3.80%) 1552.48 (32.57%)> 6 1187.14 ( 0.00%) 1132.89 (-4.79%) 1147.18 (-3.48%) 1661.19 (28.54%)> 7 1179.37 ( 0.00%) 1183.61 ( 0.36%) 1202.49 ( 1.92%) 790.26 (-49.24%)> 8 1164.62 ( 0.00%) 1143.54 (-1.84%) 1184.56 ( 1.68%) 854.10 (-36.36%)> 9 1095.22 ( 0.00%) 1178.72 ( 7.08%) 1002.42 (-9.26%) 1655.04 (33.83%)> 10 1147.52 ( 0.00%) 1153.46 ( 0.52%) 1151.73 ( 0.37%) 1653.89 (30.62%)> 11 823.38 ( 0.00%) 820.64 (-0.33%) 754.15 (-9.18%) 1627.45 (49.41%)> 12 813.73 ( 0.00%) 791.44 (-2.82%) 848.32 ( 4.08%) 1494.63 (45.56%)> 13 898.22 ( 0.00%) 789.63 (-13.75%) 931.47 ( 3.57%) 1521.64 (40.97%)> 14 873.50 ( 0.00%) 938.90 ( 6.97%) 875.75 ( 0.26%) 1311.09 (33.38%)> 15 808.32 ( 0.00%) 979.88 (17.51%) 877.87 ( 7.92%) 1009.70 (19.94%)> 16 758.17 ( 0.00%) 1096.81 (30.87%) 881.23 (13.96%) 725.17 (-4.55%)>> sysbench is helped by both both block-2.6.33 and async-rampup to some> extent. For many of the results, plain old disabling low_latency still> helps the most.>> desktop-net-gitk> gitk-with low-latency low-latency gitk-without> low-latency block-2.6.33 async-rampup low-latency> min 954.46 ( 0.00%) 570.06 (40.27%) 796.22 (16.58%) 640.65 (32.88%)> mean 964.79 ( 0.00%) 573.96 (40.51%) 798.01 (17.29%) 655.57 (32.05%)> stddev 10.01 ( 0.00%) 2.65 (73.55%) 1.91 (80.95%) 13.33 (-33.18%)> max 981.23 ( 0.00%) 577.21 (41.17%) 800.91 (18.38%) 675.65 (31.14%)>> The changes for block in 2.6.33 make a massive difference here, notably> beating the disabling of low_latency.Yes. These are read of lots of small files, so the improvements forseeky workload we introduced in 2.6.33 helps a lot here.>> IOZone> iozone-with low-latency low-latency iozone-without> low-latency block-2.6.33 async-rampup low-latency> write-64 151212 ( 0.00%) 163359 ( 7.44%) 163359 ( 7.44%) 159856 ( 5.41%)> write-128 189357 ( 0.00%) 184922 (-2.40%) 202805 ( 6.63%) 206233 ( 8.18%)> write-256 219883 ( 0.00%) 211232 (-4.10%) 189867 (-15.81%) 223174 ( 1.47%)> write-512 224932 ( 0.00%) 222601 (-1.05%) 204459 (-10.01%) 220227 (-2.14%)> write-1024 227738 ( 0.00%) 226728 (-0.45%) 216009 (-5.43%) 226155 (-0.70%)> write-2048 227564 ( 0.00%) 224167 (-1.52%) 229387 ( 0.79%) 224848 (-1.21%)> write-4096 208556 ( 0.00%) 227707 ( 8.41%) 216908 ( 3.85%) 223430 ( 6.66%)> write-8192 219484 ( 0.00%) 222365 ( 1.30%) 217737 (-0.80%) 219389 (-0.04%)> write-16384 206670 ( 0.00%) 209355 ( 1.28%) 204146 (-1.24%) 206295 (-0.18%)> write-32768 203023 ( 0.00%) 205097 ( 1.01%) 199766 (-1.63%) 201852 (-0.58%)> write-65536 162134 ( 0.00%) 196670 (17.56%) 189975 (14.66%) 189173 (14.29%)> write-131072 68534 ( 0.00%) 69145 ( 0.88%) 64519 (-6.22%) 67417 (-1.66%)> write-262144 32936 ( 0.00%) 28587 (-15.21%) 31470 (-4.66%) 27750 (-18.69%)> write-524288 24044 ( 0.00%) 23560 (-2.05%) 23116 (-4.01%) 23759 (-1.20%)> rewrite-64 755681 ( 0.00%) 800767 ( 5.63%) 469931 (-60.81%) 755681 ( 0.00%)> rewrite-128 581518 ( 0.00%) 639723 ( 9.10%) 591774 ( 1.73%) 799840 (27.30%)> rewrite-256 639427 ( 0.00%) 710511 (10.00%) 666414 ( 4.05%) 659861 ( 3.10%)> rewrite-512 669577 ( 0.00%) 743788 ( 9.98%) 692017 ( 3.24%) 684954 ( 2.24%)> rewrite-1024 680960 ( 0.00%) 755195 ( 9.83%) 701422 ( 2.92%) 686182 ( 0.76%)> rewrite-2048 685263 ( 0.00%) 743123 ( 7.79%) 703445 ( 2.58%) 692780 ( 1.09%)> rewrite-4096 631352 ( 0.00%) 686776 ( 8.07%) 640007 ( 1.35%) 643266 ( 1.85%)> rewrite-8192 442146 ( 0.00%) 474089 ( 6.74%) 457768 ( 3.41%) 442624 ( 0.11%)> rewrite-16384 428641 ( 0.00%) 454857 ( 5.76%) 442896 ( 3.22%) 432613 ( 0.92%)> rewrite-32768 425361 ( 0.00%) 444206 ( 4.24%) 434472 ( 2.10%) 430568 ( 1.21%)> rewrite-65536 405183 ( 0.00%) 433898 ( 6.62%) 419843 ( 3.49%) 389242 (-4.10%)> rewrite-131072 66110 ( 0.00%) 58370 (-13.26%) 54342 (-21.66%) 58472 (-13.06%)> rewrite-262144 29254 ( 0.00%) 24665 (-18.61%) 25710 (-13.78%) 29306 ( 0.18%)> rewrite-524288 23812 ( 0.00%) 20742 (-14.80%) 22490 (-5.88%) 24543 ( 2.98%)> read-64 934589 ( 0.00%) 1160938 (19.50%) 1004538 ( 6.96%) 840903 (-11.14%)> read-128 1601534 ( 0.00%) 1869179 (14.32%) 1681806 ( 4.77%) 1280633 (-25.06%)> read-256 1255511 ( 0.00%) 1526887 (17.77%) 1304314 ( 3.74%) 1310683 ( 4.21%)> read-512 1291158 ( 0.00%) 1377278 ( 6.25%) 1336145 ( 3.37%) 1319723 ( 2.16%)> read-1024 1319408 ( 0.00%) 1306564 (-0.98%) 1368162 ( 3.56%) 1347557 ( 2.09%)> read-2048 1316016 ( 0.00%) 1394645 ( 5.64%) 1339827 ( 1.78%) 1347393 ( 2.33%)> read-4096 1253710 ( 0.00%) 1307525 ( 4.12%) 1247519 (-0.50%) 1251882 (-0.15%)> read-8192 995149 ( 0.00%) 1033337 ( 3.70%) 1016944 ( 2.14%) 1011794 ( 1.65%)> read-16384 883156 ( 0.00%) 905213 ( 2.44%) 905213 ( 2.44%) 897458 ( 1.59%)> read-32768 844368 ( 0.00%) 855213 ( 1.27%) 849609 ( 0.62%) 856364 ( 1.40%)> read-65536 816099 ( 0.00%) 839262 ( 2.76%) 835019 ( 2.27%) 826473 ( 1.26%)> read-131072 818055 ( 0.00%) 837369 ( 2.31%) 828230 ( 1.23%) 824351 ( 0.76%)> read-262144 827225 ( 0.00%) 839635 ( 1.48%) 840538 ( 1.58%) 835693 ( 1.01%)> read-524288 24653 ( 0.00%) 21387 (-15.27%) 20602 (-19.66%) 22519 (-9.48%)> reread-64 2329708 ( 0.00%) 2251544 (-3.47%) 1985134 (-17.36%) 1985134 (-17.36%)> reread-128 1446222 ( 0.00%) 1979446 (26.94%) 2009076 (28.02%) 2137031 (32.33%)> reread-256 1828508 ( 0.00%) 2006158 ( 8.86%) 1892980 ( 3.41%) 1879725 ( 2.72%)> reread-512 1521718 ( 0.00%) 1642783 ( 7.37%) 1508887 (-0.85%) 1579934 ( 3.68%)> reread-1024 1347557 ( 0.00%) 1422540 ( 5.27%) 1384034 ( 2.64%) 1375171 ( 2.01%)> reread-2048 1340664 ( 0.00%) 1413929 ( 5.18%) 1372364 ( 2.31%) 1350783 ( 0.75%)> reread-4096 1259592 ( 0.00%) 1324868 ( 4.93%) 1273788 ( 1.11%) 1284839 ( 1.96%)> reread-8192 1007285 ( 0.00%) 1033710 ( 2.56%) 1027159 ( 1.93%) 1011317 ( 0.40%)> reread-16384 891404 ( 0.00%) 910828 ( 2.13%) 916562 ( 2.74%) 905022 ( 1.50%)> reread-32768 850492 ( 0.00%) 859341 ( 1.03%) 856385 ( 0.69%) 862772 ( 1.42%)> reread-65536 836565 ( 0.00%) 852664 ( 1.89%) 852315 ( 1.85%) 847020 ( 1.23%)> reread-131072 844516 ( 0.00%) 862590 ( 2.10%) 854067 ( 1.12%) 853155 ( 1.01%)> reread-262144 851524 ( 0.00%) 860559 ( 1.05%) 864921 ( 1.55%) 860653 ( 1.06%)> reread-524288 24927 ( 0.00%) 21300 (-17.03%) 19748 (-26.23%) 22487 (-10.85%)> randread-64 1605256 ( 0.00%) 1605256 ( 0.00%) 1605256 ( 0.00%) 1775099 ( 9.57%)> randread-128 1179358 ( 0.00%) 1582649 (25.48%) 1511363 (21.97%) 1528576 (22.85%)> randread-256 1421755 ( 0.00%) 1599680 (11.12%) 1460430 ( 2.65%) 1310683 (-8.47%)> randread-512 1306873 ( 0.00%) 1278855 (-2.19%) 1243315 (-5.11%) 1281909 (-1.95%)> randread-1024 1201314 ( 0.00%) 1254656 ( 4.25%) 1190657 (-0.90%) 1231629 ( 2.46%)> randread-2048 1179413 ( 0.00%) 1227971 ( 3.95%) 1185272 ( 0.49%) 1190529 ( 0.93%)> randread-4096 1107005 ( 0.00%) 1160862 ( 4.64%) 1110727 ( 0.34%) 1116792 ( 0.88%)> randread-8192 894337 ( 0.00%) 924264 ( 3.24%) 912676 ( 2.01%) 899487 ( 0.57%)> randread-16384 783760 ( 0.00%) 800299 ( 2.07%) 793351 ( 1.21%) 791341 ( 0.96%)> randread-32768 740498 ( 0.00%) 743720 ( 0.43%) 741233 ( 0.10%) 743511 ( 0.41%)> randread-65536 721640 ( 0.00%) 727692 ( 0.83%) 726984 ( 0.74%) 728139 ( 0.89%)> randread-131072 715284 ( 0.00%) 722094 ( 0.94%) 717746 ( 0.34%) 720825 ( 0.77%)> randread-262144 709855 ( 0.00%) 706770 (-0.44%) 709133 (-0.10%) 714943 ( 0.71%)> randread-524288 394 ( 0.00%) 421 ( 6.41%) 418 ( 5.74%) 431 ( 8.58%)> randwrite-64 730988 ( 0.00%) 764288 ( 4.36%) 723111 (-1.09%) 730988 ( 0.00%)> randwrite-128 746459 ( 0.00%) 799840 ( 6.67%) 746459 ( 0.00%) 742331 (-0.56%)> randwrite-256 695778 ( 0.00%) 752329 ( 7.52%) 720041 ( 3.37%) 727850 ( 4.41%)> randwrite-512 666253 ( 0.00%) 722760 ( 7.82%) 667081 ( 0.12%) 691126 ( 3.60%)> randwrite-1024 651223 ( 0.00%) 697776 ( 6.67%) 663292 ( 1.82%) 659625 ( 1.27%)> randwrite-2048 655558 ( 0.00%) 691887 ( 5.25%) 665720 ( 1.53%) 664073 ( 1.28%)> randwrite-4096 635556 ( 0.00%) 662721 ( 4.10%) 643170 ( 1.18%) 642400 ( 1.07%)> randwrite-8192 467357 ( 0.00%) 491364 ( 4.89%) 476720 ( 1.96%) 469734 ( 0.51%)> randwrite-16384 413188 ( 0.00%) 427521 ( 3.35%) 417353 ( 1.00%) 417282 ( 0.98%)> randwrite-32768 404161 ( 0.00%) 411721 ( 1.84%) 404942 ( 0.19%) 407580 ( 0.84%)> randwrite-65536 379372 ( 0.00%) 397312 ( 4.52%) 386853 ( 1.93%) 381273 ( 0.50%)> randwrite-131072 21780 ( 0.00%) 16924 (-28.69%) 21177 (-2.85%) 19758 (-10.23%)> randwrite-262144 6249 ( 0.00%) 5548 (-12.64%) 6370 ( 1.90%) 6316 ( 1.06%)> randwrite-524288 2915 ( 0.00%) 2582 (-12.90%) 2871 (-1.53%) 2859 (-1.96%)> bkwdread-64 1141196 ( 0.00%) 1141196 ( 0.00%) 1004538 (-13.60%) 1141196 ( 0.00%)> bkwdread-128 1066865 ( 0.00%) 1386465 (23.05%) 1400936 (23.85%) 1101900 ( 3.18%)> bkwdread-256 877797 ( 0.00%) 1105556 (20.60%) 1105556 (20.60%) 1105556 (20.60%)> bkwdread-512 1133103 ( 0.00%) 1162547 ( 2.53%) 1175271 ( 3.59%) 1162547 ( 2.53%)> bkwdread-1024 1163562 ( 0.00%) 1206714 ( 3.58%) 1213534 ( 4.12%) 1195962 ( 2.71%)> bkwdread-2048 1163439 ( 0.00%) 1218910 ( 4.55%) 1204552 ( 3.41%) 1204552 ( 3.41%)> bkwdread-4096 1116792 ( 0.00%) 1175477 ( 4.99%) 1159922 ( 3.72%) 1150600 ( 2.94%)> bkwdread-8192 912288 ( 0.00%) 935233 ( 2.45%) 944695 ( 3.43%) 934724 ( 2.40%)> bkwdread-16384 817707 ( 0.00%) 824140 ( 0.78%) 832527 ( 1.78%) 829152 ( 1.38%)> bkwdread-32768 775898 ( 0.00%) 773714 (-0.28%) 785494 ( 1.22%) 787691 ( 1.50%)> bkwdread-65536 759643 ( 0.00%) 769924 ( 1.34%) 778780 ( 2.46%) 772174 ( 1.62%)> bkwdread-131072 763215 ( 0.00%) 769634 ( 0.83%) 773707 ( 1.36%) 773816 ( 1.37%)> bkwdread-262144 765491 ( 0.00%) 768992 ( 0.46%) 780876 ( 1.97%) 780021 ( 1.86%)> bkwdread-524288 3688 ( 0.00%) 3595 (-2.59%) 3577 (-3.10%) 3724 ( 0.97%)>> The upcoming changes for 2.6.33 also help iozone in many cases, often by more> than just disabling low_latency. It has the occasional massive gain or loss> for the larger file sizes. I don't know why this is but as the big losses> appear to be mostly in the write-tests, I would guess that it's differences> in heavy-writer-throttling.I wonder if 2.6.33 + my async rampup patch will improve still further,maybe reaching the low_latency=0 performance also for writing tests.>> The only downside with block-2.6.33 is that there are a lot of patches in> there and doesn't help with the 2.6.32 release as such. I could do a reverse> bisect to see what helps the most in there but under ideal conditions, it'll> take 3 days to complete and I wouldn't be able to start until Monday as I'm> out of the country for the weekend. That's a bit late.Bisect will likely not help, since we have several patch series withheavy internal dependencies in that tree.If one of the patch series is found to bring the improvement, you haveto backport the entire series, that is not advisable for a rc8 or forstable.>> p.s. As a consequence of being out of the country, I also won't be able to> respond to mail over the weekend.>> --> Mel Gorman>Thanks for the detailed reportCorrado????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2009-11-27 12:16:27

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Fri, Nov 27, 2009 at 02:58:26PM +0900, KOSAKI Motohiro wrote:
> > > <SNIP>
> > > low_latency was tested on other scenarios:
> > > http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html
> > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html
> > > where it improved actual and perceived performance, so disabling it
> > > completely may not be good.
> > >
> >
> > It may not indeed.
> >
> > In case you mean a partial disabling of cfq_latency, I'm try the
> > following patch. The intention is to disable the low_latency logic if
> > kswapd is at work and presumably needs clean pages. Alternative
> > suggestions welcome.
>
> I like treat vmscan writeout as special. because
> - vmscan use various process context. but it doesn't write own process's page.
> IOW, it doesn't so match cfq's io fairness logic.
> - plus, the above mean vmscan writeout doesn't need good i/o latency.

While it might not need good latency as such, it does need pages to be
clean because direct reclaim has trouble cleaning pages in its own
behalf.

> - vmscan maintain page granularity lru list. It mean vmscan makes awful
> seekful I/O. it assume block-layer buffered much i/o request.
> - plus, the above mena vmscan. writeout need good io throughput. otherwise
> system might cause hangup.
>
> However, I don't think kswapd_awake is good choice. because
> - zone reclaim run before kswapd wakeup. iow, this patch doesn't solve hpc machine.
> btw, some Core i7 box (at least, Intel's reference box) also use zone reclaim.

Good point.

> - On large (many memory node) machine, one of much kswapd always run.
>

Also true.

>
> Instead, PF_MEMALLOC is good idea?
>

It doesn't work out either because a process with PF_MEMALLOC is in
direct reclaim and like kswapd, it may not be able to clean the pages at
all, let alone in a small period of time.

>
> Subject: [PATCH] cfq: Do not limit the async queue depth while memory reclaim
>
> Not-Signed-off-by: KOSAKI Motohiro <[email protected]> (I haven't test this)
> ---
> block/cfq-iosched.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index aa1e953..9546f64 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -1308,7 +1308,8 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> * We also ramp up the dispatch depth gradually for async IO,
> * based on the last sync IO we serviced
> */
> - if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) {
> + if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency &&
> + !(current->flags & PF_MEMALLOC)) {
> unsigned long last_sync = jiffies - cfqd->last_end_sync_rq;
> unsigned int depth;
>
> --
> 1.6.5.2
>
>
>
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-27 15:58:40

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote:
> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <[email protected]> wrote:
> > On Thu, Nov 26, 2009 at 04:18:18PM +0100, Corrado Zoccolo wrote:
> >> > <SNIP>
> >> >
> >> > In case you mean a partial disabling of cfq_latency, I'm try the
> >> > following patch. The intention is to disable the low_latency logic if
> >> > kswapd is at work and presumably needs clean pages. Alternative
> >> > suggestions welcome.
> >
> > As it turned out, that patch sucked so I aborted the test and I need to
> > think about it a lot more.
>
> What about using the dirty ratio, instead of checking if kswapd is running?
>

How would one go about selecting the proper ratio at which to disable
the low_latency logic?

> >> Yes, I meant exactly to disable that part, and doing it when kswapd is
> >> active is probably a good choice.
> >> I have a different idea for 2.6.33, though.
> >> If you have a reliable reproducer of the issue, can you test it on
> >> git://git.kernel.dk/linux-2.6-block.git branch for-2.6.33?
> >> It may already be unaffected, since we had various performance
> >> improvements there, but I think a better way to boost writeback is
> >> possible.
> >>
> >
> > I haven't tested the high-order allocation scenario yet but the results
> > as thing stands are below. There are four kernels being compared
> >
> > 1. with-low-latency ? ? ? ? ? ? ? is 2.6.32-rc8 vanilla
> > 2. with-low-latency-block-2.6.33 ?is with the for-2.6.33 from linux-block applied
> > 3. with-low-latency-async-rampup ?is with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula"
> > 4. without-low-latency ? ? ? ? ? ?is with low_latency disabled
> >
> > SYSBENCH
> > ? ? ? ? ? ? ? ? sysbench-with ? ? ? low-latency ? ? ? low-latency ?sysbench-without
> > ? ? ? ? ? ? ? ? ? low-latency ? ? ?block-2.6.33 ? ? ?async-rampup ? ? ? low-latency
> > ? ? ? ? ? 1 ?1266.02 ( 0.00%) ? 824.08 (-53.63%) ?1265.15 (-0.07%) ?1278.55 ( 0.98%)
> > ? ? ? ? ? 2 ?1182.58 ( 0.00%) ?1226.42 ( 3.57%) ?1223.03 ( 3.31%) ?1379.25 (14.26%)
> > ? ? ? ? ? 3 ?1218.64 ( 0.00%) ?1271.38 ( 4.15%) ?1246.42 ( 2.23%) ?1580.08 (22.87%)
> > ? ? ? ? ? 4 ?1212.11 ( 0.00%) ?1257.84 ( 3.64%) ?1325.17 ( 8.53%) ?1534.17 (20.99%)
> > ? ? ? ? ? 5 ?1046.77 ( 0.00%) ? 981.71 (-6.63%) ?1008.44 (-3.80%) ?1552.48 (32.57%)
> > ? ? ? ? ? 6 ?1187.14 ( 0.00%) ?1132.89 (-4.79%) ?1147.18 (-3.48%) ?1661.19 (28.54%)
> > ? ? ? ? ? 7 ?1179.37 ( 0.00%) ?1183.61 ( 0.36%) ?1202.49 ( 1.92%) ? 790.26 (-49.24%)
> > ? ? ? ? ? 8 ?1164.62 ( 0.00%) ?1143.54 (-1.84%) ?1184.56 ( 1.68%) ? 854.10 (-36.36%)
> > ? ? ? ? ? 9 ?1095.22 ( 0.00%) ?1178.72 ( 7.08%) ?1002.42 (-9.26%) ?1655.04 (33.83%)
> > ? ? ? ? ?10 ?1147.52 ( 0.00%) ?1153.46 ( 0.52%) ?1151.73 ( 0.37%) ?1653.89 (30.62%)
> > ? ? ? ? ?11 ? 823.38 ( 0.00%) ? 820.64 (-0.33%) ? 754.15 (-9.18%) ?1627.45 (49.41%)
> > ? ? ? ? ?12 ? 813.73 ( 0.00%) ? 791.44 (-2.82%) ? 848.32 ( 4.08%) ?1494.63 (45.56%)
> > ? ? ? ? ?13 ? 898.22 ( 0.00%) ? 789.63 (-13.75%) ? 931.47 ( 3.57%) ?1521.64 (40.97%)
> > ? ? ? ? ?14 ? 873.50 ( 0.00%) ? 938.90 ( 6.97%) ? 875.75 ( 0.26%) ?1311.09 (33.38%)
> > ? ? ? ? ?15 ? 808.32 ( 0.00%) ? 979.88 (17.51%) ? 877.87 ( 7.92%) ?1009.70 (19.94%)
> > ? ? ? ? ?16 ? 758.17 ( 0.00%) ?1096.81 (30.87%) ? 881.23 (13.96%) ? 725.17 (-4.55%)
> >
> > sysbench is helped by both both block-2.6.33 and async-rampup to some
> > extent. For many of the results, plain old disabling low_latency still
> > helps the most.
> >
> > desktop-net-gitk
> > ? ? ? ? ? ? ? ? ? ? gitk-with ? ? ? low-latency ? ? ? low-latency ? ? ?gitk-without
> > ? ? ? ? ? ? ? ? ? low-latency ? ? ?block-2.6.33 ? ? ?async-rampup ? ? ? low-latency
> > min ? ? ? ? ? ?954.46 ( 0.00%) ? 570.06 (40.27%) ? 796.22 (16.58%) ? 640.65 (32.88%)
> > mean ? ? ? ? ? 964.79 ( 0.00%) ? 573.96 (40.51%) ? 798.01 (17.29%) ? 655.57 (32.05%)
> > stddev ? ? ? ? ?10.01 ( 0.00%) ? ? 2.65 (73.55%) ? ? 1.91 (80.95%) ? ?13.33 (-33.18%)
> > max ? ? ? ? ? ?981.23 ( 0.00%) ? 577.21 (41.17%) ? 800.91 (18.38%) ? 675.65 (31.14%)
> >
> > The changes for block in 2.6.33 make a massive difference here, notably
> > beating the disabling of low_latency.
>
> Yes. These are read of lots of small files, so the improvements for
> seeky workload we introduced in 2.6.33 helps a lot here.

Ok, good to know

> >
> > IOZone
> > ? ? ? ? ? ? ? ? ? ? ? ? ? iozone-with ? ? ? ? ? low-latency ? ? ? ? ? low-latency ? ? ? ?iozone-without
> > ? ? ? ? ? ? ? ? ? ? ? ? ? low-latency ? ? ? ? ?block-2.6.33 ? ? ? ? ?async-rampup ? ? ? ? ? low-latency
> > write-64 ? ? ? ? ? ? ? 151212 ( 0.00%) ? ? ? 163359 ( 7.44%) ? ? ? 163359 ( 7.44%) ? ? ? 159856 ( 5.41%)
> > write-128 ? ? ? ? ? ? ?189357 ( 0.00%) ? ? ? 184922 (-2.40%) ? ? ? 202805 ( 6.63%) ? ? ? 206233 ( 8.18%)
> > write-256 ? ? ? ? ? ? ?219883 ( 0.00%) ? ? ? 211232 (-4.10%) ? ? ? 189867 (-15.81%) ? ? ? 223174 ( 1.47%)
> > write-512 ? ? ? ? ? ? ?224932 ( 0.00%) ? ? ? 222601 (-1.05%) ? ? ? 204459 (-10.01%) ? ? ? 220227 (-2.14%)
> > write-1024 ? ? ? ? ? ? 227738 ( 0.00%) ? ? ? 226728 (-0.45%) ? ? ? 216009 (-5.43%) ? ? ? 226155 (-0.70%)
> > write-2048 ? ? ? ? ? ? 227564 ( 0.00%) ? ? ? 224167 (-1.52%) ? ? ? 229387 ( 0.79%) ? ? ? 224848 (-1.21%)
> > write-4096 ? ? ? ? ? ? 208556 ( 0.00%) ? ? ? 227707 ( 8.41%) ? ? ? 216908 ( 3.85%) ? ? ? 223430 ( 6.66%)
> > write-8192 ? ? ? ? ? ? 219484 ( 0.00%) ? ? ? 222365 ( 1.30%) ? ? ? 217737 (-0.80%) ? ? ? 219389 (-0.04%)
> > write-16384 ? ? ? ? ? ?206670 ( 0.00%) ? ? ? 209355 ( 1.28%) ? ? ? 204146 (-1.24%) ? ? ? 206295 (-0.18%)
> > write-32768 ? ? ? ? ? ?203023 ( 0.00%) ? ? ? 205097 ( 1.01%) ? ? ? 199766 (-1.63%) ? ? ? 201852 (-0.58%)
> > write-65536 ? ? ? ? ? ?162134 ( 0.00%) ? ? ? 196670 (17.56%) ? ? ? 189975 (14.66%) ? ? ? 189173 (14.29%)
> > write-131072 ? ? ? ? ? ?68534 ( 0.00%) ? ? ? ?69145 ( 0.88%) ? ? ? ?64519 (-6.22%) ? ? ? ?67417 (-1.66%)
> > write-262144 ? ? ? ? ? ?32936 ( 0.00%) ? ? ? ?28587 (-15.21%) ? ? ? ?31470 (-4.66%) ? ? ? ?27750 (-18.69%)
> > write-524288 ? ? ? ? ? ?24044 ( 0.00%) ? ? ? ?23560 (-2.05%) ? ? ? ?23116 (-4.01%) ? ? ? ?23759 (-1.20%)
> > rewrite-64 ? ? ? ? ? ? 755681 ( 0.00%) ? ? ? 800767 ( 5.63%) ? ? ? 469931 (-60.81%) ? ? ? 755681 ( 0.00%)
> > rewrite-128 ? ? ? ? ? ?581518 ( 0.00%) ? ? ? 639723 ( 9.10%) ? ? ? 591774 ( 1.73%) ? ? ? 799840 (27.30%)
> > rewrite-256 ? ? ? ? ? ?639427 ( 0.00%) ? ? ? 710511 (10.00%) ? ? ? 666414 ( 4.05%) ? ? ? 659861 ( 3.10%)
> > rewrite-512 ? ? ? ? ? ?669577 ( 0.00%) ? ? ? 743788 ( 9.98%) ? ? ? 692017 ( 3.24%) ? ? ? 684954 ( 2.24%)
> > rewrite-1024 ? ? ? ? ? 680960 ( 0.00%) ? ? ? 755195 ( 9.83%) ? ? ? 701422 ( 2.92%) ? ? ? 686182 ( 0.76%)
> > rewrite-2048 ? ? ? ? ? 685263 ( 0.00%) ? ? ? 743123 ( 7.79%) ? ? ? 703445 ( 2.58%) ? ? ? 692780 ( 1.09%)
> > rewrite-4096 ? ? ? ? ? 631352 ( 0.00%) ? ? ? 686776 ( 8.07%) ? ? ? 640007 ( 1.35%) ? ? ? 643266 ( 1.85%)
> > rewrite-8192 ? ? ? ? ? 442146 ( 0.00%) ? ? ? 474089 ( 6.74%) ? ? ? 457768 ( 3.41%) ? ? ? 442624 ( 0.11%)
> > rewrite-16384 ? ? ? ? ?428641 ( 0.00%) ? ? ? 454857 ( 5.76%) ? ? ? 442896 ( 3.22%) ? ? ? 432613 ( 0.92%)
> > rewrite-32768 ? ? ? ? ?425361 ( 0.00%) ? ? ? 444206 ( 4.24%) ? ? ? 434472 ( 2.10%) ? ? ? 430568 ( 1.21%)
> > rewrite-65536 ? ? ? ? ?405183 ( 0.00%) ? ? ? 433898 ( 6.62%) ? ? ? 419843 ( 3.49%) ? ? ? 389242 (-4.10%)
> > rewrite-131072 ? ? ? ? ?66110 ( 0.00%) ? ? ? ?58370 (-13.26%) ? ? ? ?54342 (-21.66%) ? ? ? ?58472 (-13.06%)
> > rewrite-262144 ? ? ? ? ?29254 ( 0.00%) ? ? ? ?24665 (-18.61%) ? ? ? ?25710 (-13.78%) ? ? ? ?29306 ( 0.18%)
> > rewrite-524288 ? ? ? ? ?23812 ( 0.00%) ? ? ? ?20742 (-14.80%) ? ? ? ?22490 (-5.88%) ? ? ? ?24543 ( 2.98%)
> > read-64 ? ? ? ? ? ? ? ?934589 ( 0.00%) ? ? ?1160938 (19.50%) ? ? ?1004538 ( 6.96%) ? ? ? 840903 (-11.14%)
> > read-128 ? ? ? ? ? ? ?1601534 ( 0.00%) ? ? ?1869179 (14.32%) ? ? ?1681806 ( 4.77%) ? ? ?1280633 (-25.06%)
> > read-256 ? ? ? ? ? ? ?1255511 ( 0.00%) ? ? ?1526887 (17.77%) ? ? ?1304314 ( 3.74%) ? ? ?1310683 ( 4.21%)
> > read-512 ? ? ? ? ? ? ?1291158 ( 0.00%) ? ? ?1377278 ( 6.25%) ? ? ?1336145 ( 3.37%) ? ? ?1319723 ( 2.16%)
> > read-1024 ? ? ? ? ? ? 1319408 ( 0.00%) ? ? ?1306564 (-0.98%) ? ? ?1368162 ( 3.56%) ? ? ?1347557 ( 2.09%)
> > read-2048 ? ? ? ? ? ? 1316016 ( 0.00%) ? ? ?1394645 ( 5.64%) ? ? ?1339827 ( 1.78%) ? ? ?1347393 ( 2.33%)
> > read-4096 ? ? ? ? ? ? 1253710 ( 0.00%) ? ? ?1307525 ( 4.12%) ? ? ?1247519 (-0.50%) ? ? ?1251882 (-0.15%)
> > read-8192 ? ? ? ? ? ? ?995149 ( 0.00%) ? ? ?1033337 ( 3.70%) ? ? ?1016944 ( 2.14%) ? ? ?1011794 ( 1.65%)
> > read-16384 ? ? ? ? ? ? 883156 ( 0.00%) ? ? ? 905213 ( 2.44%) ? ? ? 905213 ( 2.44%) ? ? ? 897458 ( 1.59%)
> > read-32768 ? ? ? ? ? ? 844368 ( 0.00%) ? ? ? 855213 ( 1.27%) ? ? ? 849609 ( 0.62%) ? ? ? 856364 ( 1.40%)
> > read-65536 ? ? ? ? ? ? 816099 ( 0.00%) ? ? ? 839262 ( 2.76%) ? ? ? 835019 ( 2.27%) ? ? ? 826473 ( 1.26%)
> > read-131072 ? ? ? ? ? ?818055 ( 0.00%) ? ? ? 837369 ( 2.31%) ? ? ? 828230 ( 1.23%) ? ? ? 824351 ( 0.76%)
> > read-262144 ? ? ? ? ? ?827225 ( 0.00%) ? ? ? 839635 ( 1.48%) ? ? ? 840538 ( 1.58%) ? ? ? 835693 ( 1.01%)
> > read-524288 ? ? ? ? ? ? 24653 ( 0.00%) ? ? ? ?21387 (-15.27%) ? ? ? ?20602 (-19.66%) ? ? ? ?22519 (-9.48%)
> > reread-64 ? ? ? ? ? ? 2329708 ( 0.00%) ? ? ?2251544 (-3.47%) ? ? ?1985134 (-17.36%) ? ? ?1985134 (-17.36%)
> > reread-128 ? ? ? ? ? ?1446222 ( 0.00%) ? ? ?1979446 (26.94%) ? ? ?2009076 (28.02%) ? ? ?2137031 (32.33%)
> > reread-256 ? ? ? ? ? ?1828508 ( 0.00%) ? ? ?2006158 ( 8.86%) ? ? ?1892980 ( 3.41%) ? ? ?1879725 ( 2.72%)
> > reread-512 ? ? ? ? ? ?1521718 ( 0.00%) ? ? ?1642783 ( 7.37%) ? ? ?1508887 (-0.85%) ? ? ?1579934 ( 3.68%)
> > reread-1024 ? ? ? ? ? 1347557 ( 0.00%) ? ? ?1422540 ( 5.27%) ? ? ?1384034 ( 2.64%) ? ? ?1375171 ( 2.01%)
> > reread-2048 ? ? ? ? ? 1340664 ( 0.00%) ? ? ?1413929 ( 5.18%) ? ? ?1372364 ( 2.31%) ? ? ?1350783 ( 0.75%)
> > reread-4096 ? ? ? ? ? 1259592 ( 0.00%) ? ? ?1324868 ( 4.93%) ? ? ?1273788 ( 1.11%) ? ? ?1284839 ( 1.96%)
> > reread-8192 ? ? ? ? ? 1007285 ( 0.00%) ? ? ?1033710 ( 2.56%) ? ? ?1027159 ( 1.93%) ? ? ?1011317 ( 0.40%)
> > reread-16384 ? ? ? ? ? 891404 ( 0.00%) ? ? ? 910828 ( 2.13%) ? ? ? 916562 ( 2.74%) ? ? ? 905022 ( 1.50%)
> > reread-32768 ? ? ? ? ? 850492 ( 0.00%) ? ? ? 859341 ( 1.03%) ? ? ? 856385 ( 0.69%) ? ? ? 862772 ( 1.42%)
> > reread-65536 ? ? ? ? ? 836565 ( 0.00%) ? ? ? 852664 ( 1.89%) ? ? ? 852315 ( 1.85%) ? ? ? 847020 ( 1.23%)
> > reread-131072 ? ? ? ? ?844516 ( 0.00%) ? ? ? 862590 ( 2.10%) ? ? ? 854067 ( 1.12%) ? ? ? 853155 ( 1.01%)
> > reread-262144 ? ? ? ? ?851524 ( 0.00%) ? ? ? 860559 ( 1.05%) ? ? ? 864921 ( 1.55%) ? ? ? 860653 ( 1.06%)
> > reread-524288 ? ? ? ? ? 24927 ( 0.00%) ? ? ? ?21300 (-17.03%) ? ? ? ?19748 (-26.23%) ? ? ? ?22487 (-10.85%)
> > randread-64 ? ? ? ? ? 1605256 ( 0.00%) ? ? ?1605256 ( 0.00%) ? ? ?1605256 ( 0.00%) ? ? ?1775099 ( 9.57%)
> > randread-128 ? ? ? ? ?1179358 ( 0.00%) ? ? ?1582649 (25.48%) ? ? ?1511363 (21.97%) ? ? ?1528576 (22.85%)
> > randread-256 ? ? ? ? ?1421755 ( 0.00%) ? ? ?1599680 (11.12%) ? ? ?1460430 ( 2.65%) ? ? ?1310683 (-8.47%)
> > randread-512 ? ? ? ? ?1306873 ( 0.00%) ? ? ?1278855 (-2.19%) ? ? ?1243315 (-5.11%) ? ? ?1281909 (-1.95%)
> > randread-1024 ? ? ? ? 1201314 ( 0.00%) ? ? ?1254656 ( 4.25%) ? ? ?1190657 (-0.90%) ? ? ?1231629 ( 2.46%)
> > randread-2048 ? ? ? ? 1179413 ( 0.00%) ? ? ?1227971 ( 3.95%) ? ? ?1185272 ( 0.49%) ? ? ?1190529 ( 0.93%)
> > randread-4096 ? ? ? ? 1107005 ( 0.00%) ? ? ?1160862 ( 4.64%) ? ? ?1110727 ( 0.34%) ? ? ?1116792 ( 0.88%)
> > randread-8192 ? ? ? ? ?894337 ( 0.00%) ? ? ? 924264 ( 3.24%) ? ? ? 912676 ( 2.01%) ? ? ? 899487 ( 0.57%)
> > randread-16384 ? ? ? ? 783760 ( 0.00%) ? ? ? 800299 ( 2.07%) ? ? ? 793351 ( 1.21%) ? ? ? 791341 ( 0.96%)
> > randread-32768 ? ? ? ? 740498 ( 0.00%) ? ? ? 743720 ( 0.43%) ? ? ? 741233 ( 0.10%) ? ? ? 743511 ( 0.41%)
> > randread-65536 ? ? ? ? 721640 ( 0.00%) ? ? ? 727692 ( 0.83%) ? ? ? 726984 ( 0.74%) ? ? ? 728139 ( 0.89%)
> > randread-131072 ? ? ? ?715284 ( 0.00%) ? ? ? 722094 ( 0.94%) ? ? ? 717746 ( 0.34%) ? ? ? 720825 ( 0.77%)
> > randread-262144 ? ? ? ?709855 ( 0.00%) ? ? ? 706770 (-0.44%) ? ? ? 709133 (-0.10%) ? ? ? 714943 ( 0.71%)
> > randread-524288 ? ? ? ? ? 394 ( 0.00%) ? ? ? ? ?421 ( 6.41%) ? ? ? ? ?418 ( 5.74%) ? ? ? ? ?431 ( 8.58%)
> > randwrite-64 ? ? ? ? ? 730988 ( 0.00%) ? ? ? 764288 ( 4.36%) ? ? ? 723111 (-1.09%) ? ? ? 730988 ( 0.00%)
> > randwrite-128 ? ? ? ? ?746459 ( 0.00%) ? ? ? 799840 ( 6.67%) ? ? ? 746459 ( 0.00%) ? ? ? 742331 (-0.56%)
> > randwrite-256 ? ? ? ? ?695778 ( 0.00%) ? ? ? 752329 ( 7.52%) ? ? ? 720041 ( 3.37%) ? ? ? 727850 ( 4.41%)
> > randwrite-512 ? ? ? ? ?666253 ( 0.00%) ? ? ? 722760 ( 7.82%) ? ? ? 667081 ( 0.12%) ? ? ? 691126 ( 3.60%)
> > randwrite-1024 ? ? ? ? 651223 ( 0.00%) ? ? ? 697776 ( 6.67%) ? ? ? 663292 ( 1.82%) ? ? ? 659625 ( 1.27%)
> > randwrite-2048 ? ? ? ? 655558 ( 0.00%) ? ? ? 691887 ( 5.25%) ? ? ? 665720 ( 1.53%) ? ? ? 664073 ( 1.28%)
> > randwrite-4096 ? ? ? ? 635556 ( 0.00%) ? ? ? 662721 ( 4.10%) ? ? ? 643170 ( 1.18%) ? ? ? 642400 ( 1.07%)
> > randwrite-8192 ? ? ? ? 467357 ( 0.00%) ? ? ? 491364 ( 4.89%) ? ? ? 476720 ( 1.96%) ? ? ? 469734 ( 0.51%)
> > randwrite-16384 ? ? ? ?413188 ( 0.00%) ? ? ? 427521 ( 3.35%) ? ? ? 417353 ( 1.00%) ? ? ? 417282 ( 0.98%)
> > randwrite-32768 ? ? ? ?404161 ( 0.00%) ? ? ? 411721 ( 1.84%) ? ? ? 404942 ( 0.19%) ? ? ? 407580 ( 0.84%)
> > randwrite-65536 ? ? ? ?379372 ( 0.00%) ? ? ? 397312 ( 4.52%) ? ? ? 386853 ( 1.93%) ? ? ? 381273 ( 0.50%)
> > randwrite-131072 ? ? ? ?21780 ( 0.00%) ? ? ? ?16924 (-28.69%) ? ? ? ?21177 (-2.85%) ? ? ? ?19758 (-10.23%)
> > randwrite-262144 ? ? ? ? 6249 ( 0.00%) ? ? ? ? 5548 (-12.64%) ? ? ? ? 6370 ( 1.90%) ? ? ? ? 6316 ( 1.06%)
> > randwrite-524288 ? ? ? ? 2915 ( 0.00%) ? ? ? ? 2582 (-12.90%) ? ? ? ? 2871 (-1.53%) ? ? ? ? 2859 (-1.96%)
> > bkwdread-64 ? ? ? ? ? 1141196 ( 0.00%) ? ? ?1141196 ( 0.00%) ? ? ?1004538 (-13.60%) ? ? ?1141196 ( 0.00%)
> > bkwdread-128 ? ? ? ? ?1066865 ( 0.00%) ? ? ?1386465 (23.05%) ? ? ?1400936 (23.85%) ? ? ?1101900 ( 3.18%)
> > bkwdread-256 ? ? ? ? ? 877797 ( 0.00%) ? ? ?1105556 (20.60%) ? ? ?1105556 (20.60%) ? ? ?1105556 (20.60%)
> > bkwdread-512 ? ? ? ? ?1133103 ( 0.00%) ? ? ?1162547 ( 2.53%) ? ? ?1175271 ( 3.59%) ? ? ?1162547 ( 2.53%)
> > bkwdread-1024 ? ? ? ? 1163562 ( 0.00%) ? ? ?1206714 ( 3.58%) ? ? ?1213534 ( 4.12%) ? ? ?1195962 ( 2.71%)
> > bkwdread-2048 ? ? ? ? 1163439 ( 0.00%) ? ? ?1218910 ( 4.55%) ? ? ?1204552 ( 3.41%) ? ? ?1204552 ( 3.41%)
> > bkwdread-4096 ? ? ? ? 1116792 ( 0.00%) ? ? ?1175477 ( 4.99%) ? ? ?1159922 ( 3.72%) ? ? ?1150600 ( 2.94%)
> > bkwdread-8192 ? ? ? ? ?912288 ( 0.00%) ? ? ? 935233 ( 2.45%) ? ? ? 944695 ( 3.43%) ? ? ? 934724 ( 2.40%)
> > bkwdread-16384 ? ? ? ? 817707 ( 0.00%) ? ? ? 824140 ( 0.78%) ? ? ? 832527 ( 1.78%) ? ? ? 829152 ( 1.38%)
> > bkwdread-32768 ? ? ? ? 775898 ( 0.00%) ? ? ? 773714 (-0.28%) ? ? ? 785494 ( 1.22%) ? ? ? 787691 ( 1.50%)
> > bkwdread-65536 ? ? ? ? 759643 ( 0.00%) ? ? ? 769924 ( 1.34%) ? ? ? 778780 ( 2.46%) ? ? ? 772174 ( 1.62%)
> > bkwdread-131072 ? ? ? ?763215 ( 0.00%) ? ? ? 769634 ( 0.83%) ? ? ? 773707 ( 1.36%) ? ? ? 773816 ( 1.37%)
> > bkwdread-262144 ? ? ? ?765491 ( 0.00%) ? ? ? 768992 ( 0.46%) ? ? ? 780876 ( 1.97%) ? ? ? 780021 ( 1.86%)
> > bkwdread-524288 ? ? ? ? ?3688 ( 0.00%) ? ? ? ? 3595 (-2.59%) ? ? ? ? 3577 (-3.10%) ? ? ? ? 3724 ( 0.97%)
> >
> > The upcoming changes for 2.6.33 also help iozone in many cases, often by more
> > than just disabling low_latency. It has the occasional massive gain or loss
> > for the larger file sizes. I don't know why this is but as the big losses
> > appear to be mostly in the write-tests, I would guess that it's differences
> > in heavy-writer-throttling.
>
> I wonder if 2.6.33 + my async rampup patch will improve still further,
> maybe reaching the low_latency=0 performance also for writing tests.

It might, I didn't test yet as the machine is tied up. However, even if
it does, it will not help the 2.6.32 if the patches for 2.6.33 are being
considered.

> >
> > The only downside with block-2.6.33 is that there are a lot of patches in
> > there and doesn't help with the 2.6.32 release as such. I could do a reverse
> > bisect to see what helps the most in there but under ideal conditions, it'll
> > take 3 days to complete and I wouldn't be able to start until Monday as I'm
> > out of the country for the weekend. That's a bit late.
>
> Bisect will likely not help, since we have several patch series with
> heavy internal dependencies in that tree.
> If one of the patch series is found to bring the improvement, you have
> to backport the entire series, that is not advisable for a rc8 or for
> stable.

Scratch that then.

I did a quick test for when high-order-atomic-allocations-for-network
are happening but the results are not great. By quick test, I mean I
only did the gitk tests as there wasn't time to do the sysbench and
iozone tests as well before I'd go offline.

desktop-net-gitk
high-with low-latency low-latency high-without
low-latency block-2.6.33 async-rampup low-latency
min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51 (-37.69%) 303.43 (64.76%)
mean 866.60 ( 0.00%) 616.28 (28.89%) 1201.82 (-38.68%) 459.69 (46.96%)
stddev 4.39 ( 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%)
max 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31 (38.42%)
pgalloc-fail 25 ( 0.00%) 10 (50.00%) 39 (-95.00%) 20 ( 0.00%)

The patches for 2.6.33 help a little all right but the async-rampup
patches both make the performance worse and causes more page allocation
failures to occur. In other words, on most machines it'll appear fine
but people with wireless cards doing high-order allocations may run into
trouble.

Disabling low_latency again helps performance significantly in this
scenario. There were still page allocation failures because not all the
patches related to that problem made it to mainline.

I was somewhat aggrevated by the page allocation failures until I remembered
that there are three patches in -mm that I failed to convince either Jens or
Andrew of them being suitable for mainline. When they are added to the mix,
the results are as follows;

desktop-net-gitk
atomics-with low-latency low-latency atomics-without
low-latency block-2.6.33 async-rampup low-latency
min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75 (-95.71%) 375.05 (41.50%)
mean 743.61 ( 0.00%) 631.20 (15.12%) 1272.70 (-71.15%) 389.71 (47.59%)
stddev 60.30 ( 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%)
max 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%)
pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 23 ( 0.00%) 0 ( 0.00%)

Again, plain old disabling low_latency both performs the best and fails page
allocations the least. The three patches for page allocation failures are
in -mm but not mainline are;

[PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
[PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
[PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble

It still seems to be that the route of least damage is to disable low_latency
by default for 2.6.32. It's very unfortunate that I wasn't able to fully
justify the 3 patches for page allocation failures in time but all that
can be done there is consider them for -stable I suppose.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-27 18:14:37

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Fri, Nov 27, 2009 at 4:58 PM, Mel Gorman <[email protected]> wrote:
> On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote:
>> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <[email protected]> wrote:

> How would one go about selecting the proper ratio at which to disable
> the low_latency logic?
Can we measure the dirty ratio when the allocation failures start to happen?

>> >
>> > I haven't tested the high-order allocation scenario yet but the results
>> > as thing stands are below. There are four kernels being compared
>> >
>> > 1. with-low-latency is 2.6.32-rc8 vanilla
>> > 2. with-low-latency-block-2.6.33 is with the for-2.6.33 from linux-block applied
>> > 3. with-low-latency-async-rampup is with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula"
>> > 4. without-low-latency is with low_latency disabled
>> >
>> > desktop-net-gitk
>> > gitk-with low-latency low-latency gitk-without
>> > low-latency block-2.6.33 async-rampup low-latency
>> > min 954.46 ( 0.00%) 570.06 (40.27%) 796.22 (16.58%) 640.65 (32.88%)
>> > mean 964.79 ( 0.00%) 573.96 (40.51%) 798.01 (17.29%) 655.57 (32.05%)
>> > stddev 10.01 ( 0.00%) 2.65 (73.55%) 1.91 (80.95%) 13.33 (-33.18%)
>> > max 981.23 ( 0.00%) 577.21 (41.17%) 800.91 (18.38%) 675.65 (31.14%)
>> >
>> > The changes for block in 2.6.33 make a massive difference here, notably
>> > beating the disabling of low_latency.
>>
> I did a quick test for when high-order-atomic-allocations-for-network
> are happening but the results are not great. By quick test, I mean I
> only did the gitk tests as there wasn't time to do the sysbench and
> iozone tests as well before I'd go offline.
>
> desktop-net-gitk
> high-with low-latency low-latency high-without
> low-latency block-2.6.33 async-rampup low-latency
> min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51 (-37.69%) 303.43 (64.76%)
> mean 866.60 ( 0.00%) 616.28 (28.89%) 1201.82 (-38.68%) 459.69 (46.96%)
> stddev 4.39 ( 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%)
> max 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31 (38.42%)
> pgalloc-fail 25 ( 0.00%) 10 (50.00%) 39 (-95.00%) 20 ( 0.00%)
>
> The patches for 2.6.33 help a little all right but the async-rampup
> patches both make the performance worse and causes more page allocation
> failures to occur. In other words, on most machines it'll appear fine
> but people with wireless cards doing high-order allocations may run into
> trouble.
>
> Disabling low_latency again helps performance significantly in this
> scenario. There were still page allocation failures because not all the
> patches related to that problem made it to mainline.
I'm puzzled how almost all kernels, excluding the async rampup,
perform better when high order allocations are enabled, than in
previous test.

> I was somewhat aggrevated by the page allocation failures until I remembered
> that there are three patches in -mm that I failed to convince either Jens or
> Andrew of them being suitable for mainline. When they are added to the mix,
> the results are as follows;
>
> desktop-net-gitk
> atomics-with low-latency low-latency atomics-without
> low-latency block-2.6.33 async-rampup low-latency
> min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75 (-95.71%) 375.05 (41.50%)
> mean 743.61 ( 0.00%) 631.20 (15.12%) 1272.70 (-71.15%) 389.71 (47.59%)
> stddev 60.30 ( 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%)
> max 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%)
> pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 23 ( 0.00%) 0 ( 0.00%)
>
Those patches penalize block-2.6.33, that was the one with lowest
number of failures in previous test.
I think the heuristics were tailored to 2.6.32. They need to be
re-tuned for 2.6.33.

> Again, plain old disabling low_latency both performs the best and fails page
> allocations the least. The three patches for page allocation failures are
> in -mm but not mainline are;
>
> [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
> [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
> [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
>
> It still seems to be that the route of least damage is to disable low_latency
> by default for 2.6.32. It's very unfortunate that I wasn't able to fully
> justify the 3 patches for page allocation failures in time but all that
> can be done there is consider them for -stable I suppose.

Just disabling low_latency will not solve the allocation issues (20
instead of 25).
Moreover, it will improve some workloads, but penalize others.

Your 3 patches, though, seem to improve the situation also for
low_latency enabled, both for performance and allocation failures (25
to 3). Having those 3 patches with low_latency enabled seems better,
since it won't penalize the workloads that are benefited by
low_latency (if you add a sequential read to your test, you should see
a big difference).

Thanks,
Corrado

2009-11-27 18:52:35

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Fri, Nov 27, 2009 at 07:14:41PM +0100, Corrado Zoccolo wrote:
> On Fri, Nov 27, 2009 at 4:58 PM, Mel Gorman <[email protected]> wrote:
> > On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote:
> >> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <[email protected]> wrote:
>
> > How would one go about selecting the proper ratio at which to disable
> > the low_latency logic?
>
> Can we measure the dirty ratio when the allocation failures start to happen?
>

Would the number of dirty pages in the page allocation failure message to
kern.log be enough? You won't get them all because of printk suppress but
it's something. Alternatively, tell me exactly what stats from /proc you
want and I'll stick a monitor on there. Assuming you want nr_dirty vs total
number of pages though, the monitor tends to execute too late to be useful.

> >> >
> >> > I haven't tested the high-order allocation scenario yet but the results
> >> > as thing stands are below. There are four kernels being compared
> >> >
> >> > 1. with-low-latency ? ? ? ? ? ? ? is 2.6.32-rc8 vanilla
> >> > 2. with-low-latency-block-2.6.33 ?is with the for-2.6.33 from linux-block applied
> >> > 3. with-low-latency-async-rampup ?is with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula"
> >> > 4. without-low-latency ? ? ? ? ? ?is with low_latency disabled
> >> >
> >> > desktop-net-gitk
> >> > ? ? ? ? ? ? ? ? ? ? gitk-with ? ? ? low-latency ? ? ? low-latency ? ? ?gitk-without
> >> > ? ? ? ? ? ? ? ? ? low-latency ? ? ?block-2.6.33 ? ? ?async-rampup ? ? ? low-latency
> >> > min ? ? ? ? ? ?954.46 ( 0.00%) ? 570.06 (40.27%) ? 796.22 (16.58%) ? 640.65 (32.88%)
> >> > mean ? ? ? ? ? 964.79 ( 0.00%) ? 573.96 (40.51%) ? 798.01 (17.29%) ? 655.57 (32.05%)
> >> > stddev ? ? ? ? ?10.01 ( 0.00%) ? ? 2.65 (73.55%) ? ? 1.91 (80.95%) ? ?13.33 (-33.18%)
> >> > max ? ? ? ? ? ?981.23 ( 0.00%) ? 577.21 (41.17%) ? 800.91 (18.38%) ? 675.65 (31.14%)
> >> >
> >> > The changes for block in 2.6.33 make a massive difference here, notably
> >> > beating the disabling of low_latency.
> >>
> > I did a quick test for when high-order-atomic-allocations-for-network
> > are happening but the results are not great. By quick test, I mean I
> > only did the gitk tests as there wasn't time to do the sysbench and
> > iozone tests as well before I'd go offline.
> >
> > desktop-net-gitk
> > ? ? ? ? ? ? ? ? ? ? high-with ? ? ? low-latency ? ? ? low-latency ? ? ?high-without
> > ? ? ? ? ? ? ? ? ? low-latency ? ? ?block-2.6.33 ? ? ?async-rampup ? ? ? low-latency
> > min ? ? ? ? ? ?861.03 ( 0.00%) ? 467.83 (45.67%) ?1185.51 (-37.69%) ? 303.43 (64.76%)
> > mean ? ? ? ? ? 866.60 ( 0.00%) ? 616.28 (28.89%) ?1201.82 (-38.68%) ? 459.69 (46.96%)
> > stddev ? ? ? ? ? 4.39 ( 0.00%) ? ?86.90 (-1877.46%) ? ?23.63 (-437.75%) ? ?92.75 (-2010.76%)
> > max ? ? ? ? ? ?872.56 ( 0.00%) ? 679.36 (22.14%) ?1242.63 (-42.41%) ? 537.31 (38.42%)
> > pgalloc-fail ? ? ? 25 ( 0.00%) ? ? ? 10 (50.00%) ? ? ? 39 (-95.00%) ? ? ? 20 ( 0.00%)
> >
> > The patches for 2.6.33 help a little all right but the async-rampup
> > patches both make the performance worse and causes more page allocation
> > failures to occur. In other words, on most machines it'll appear fine
> > but people with wireless cards doing high-order allocations may run into
> > trouble.
> >
> > Disabling low_latency again helps performance significantly in this
> > scenario. There were still page allocation failures because not all the
> > patches related to that problem made it to mainline.
>
> I'm puzzled how almost all kernels, excluding the async rampup,
> perform better when high order allocations are enabled, than in
> previous test.
>

Two major differences. 1, the previous non-high-order tests had also
run sysbench and iozone so the starting conditions are different. I had
disabled those tests to get some of the high-order figures before I went
offline. However, the starting conditions are probably not as important as
the fact that kswapd is working to free order-2 pages and staying awake
until watermarks are reached. kswapd working harder is probably making a
big difference.

> > I was somewhat aggrevated by the page allocation failures until I remembered
> > that there are three patches in -mm that I failed to convince either Jens or
> > Andrew of them being suitable for mainline. When they are added to the mix,
> > the results are as follows;
> >
> > desktop-net-gitk
> > ? ? ? ? ? ? ? ? ?atomics-with ? ? ? low-latency ? ? ? low-latency ? atomics-without
> > ? ? ? ? ? ? ? ? ? low-latency ? ? ?block-2.6.33 ? ? ?async-rampup ? ? ? low-latency
> > min ? ? ? ? ? ?641.12 ( 0.00%) ? 627.91 ( 2.06%) ?1254.75 (-95.71%) ? 375.05 (41.50%)
> > mean ? ? ? ? ? 743.61 ( 0.00%) ? 631.20 (15.12%) ?1272.70 (-71.15%) ? 389.71 (47.59%)
> > stddev ? ? ? ? ?60.30 ( 0.00%) ? ? 2.53 (95.80%) ? ?10.64 (82.35%) ? ?22.38 (62.89%)
> > max ? ? ? ? ? ?793.85 ( 0.00%) ? 633.76 (20.17%) ?1281.65 (-61.45%) ? 428.41 (46.03%)
> > pgalloc-fail ? ? ? ?3 ( 0.00%) ? ? ? ?2 ( 0.00%) ? ? ? 23 ( 0.00%) ? ? ? ?0 ( 0.00%)
> >
>
> Those patches penalize block-2.6.33, that was the one with lowest
> number of failures in previous test.
> I think the heuristics were tailored to 2.6.32. They need to be
> re-tuned for 2.6.33.
>

I made a mistake in the script that was generating the summary. I neglected to
take into account printk rate suppressions. When they are taken into account,
the first round of figures look like

desktop-net-gitk
high-with low-latency low-latency high-without
low-latency block-2.6.33 async-rampup low-latency
min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51 (-37.69%) 303.43 (64.76%)
mean 866.60 ( 0.00%) 616.28 (28.89%) 1201.82 (-38.68%) 459.69 (46.96%)
stddev 4.39 ( 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%)
max 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31 (38.42%)
pgalloc-fail 65 ( 0.00%) 10 (84.62%) 293 (-350.77%) 20 (69.23%)

So the async-rampup is getting smacked very hard with allocation failures
in the high-order case. With the three additional applied for allocation
failures, the figures look like

desktop-net-gitk
atomics-with low-latency low-latency atomics-without
low-latency block-2.6.33 async-rampup low-latency
min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75 (-95.71%) 375.05 (41.50%)
mean 743.61 ( 0.00%) 631.20 (15.12%) 1272.70 (-71.15%) 389.71 (47.59%)
stddev 60.30 ( 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%)
max 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%)
pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 27 ( 0.00%) 0 ( 0.00%)

So again, async-rampup is getting smacked in terms of allocation failures
although the three additional patches help a lot. This is a real pity
because it looked nice in the tests involving no high-order allocations for
the network.

> > Again, plain old disabling low_latency both performs the best and fails page
> > allocations the least. The three patches for page allocation failures are
> > in -mm but not mainline are;
> >
> > [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
> > [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
> > [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
> >
> > It still seems to be that the route of least damage is to disable low_latency
> > by default for 2.6.32. It's very unfortunate that I wasn't able to fully
> > justify the 3 patches for page allocation failures in time but all that
> > can be done there is consider them for -stable I suppose.
>
> Just disabling low_latency will not solve the allocation issues (20
> instead of 25).

20 instead of 65 and I know it doesn't fully help the problem with
high-order allocations. The patches that do help that problem aren't in
mainline but they do exist.

> Moreover, it will improve some workloads, but penalize others.
>

It really does appear to hurt a lot when the machine is kinda low on
memory though. That is a fairly common situation with a desktop loaded
up with random apps. Well..... by common, I mean I hit that situation a
lot on my laptop. I don't hit it on server workloads because I make sure
the machines are not overloaded.

> Your 3 patches, though, seem to improve the situation also for
> low_latency enabled, both for performance and allocation failures (25
> to 3). Having those 3 patches with low_latency enabled seems better,
> since it won't penalize the workloads that are benefited by
> low_latency (if you add a sequential read to your test, you should see
> a big difference).
>

This is true and I would like to see them merged. However, this close to
release, with Jens unhappiness with the explanation of why
congestion_wait() changes made a difference and Andrew feeling there
wasn't enough cause to merge them, I'm doubtful it'll happen. Will see
Monday what the story is.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-29 15:11:33

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Fri, Nov 27, 2009 19:52:34, Mel Gorman wrote:
: > On Fri, Nov 27, 2009 at 07:14:41PM +0100, Corrado Zoccolo wrote:
> > On Fri, Nov 27, 2009 at 4:58 PM, Mel Gorman <[email protected]> wrote:
> > > On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote:
> > >> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <[email protected]> wrote:
> > >
> > > How would one go about selecting the proper ratio at which to disable
> > > the low_latency logic?
> >
> > Can we measure the dirty ratio when the allocation failures start to
> > happen?
>
> Would the number of dirty pages in the page allocation failure message to
> kern.log be enough? You won't get them all because of printk suppress but
> it's something. Alternatively, tell me exactly what stats from /proc you
> want and I'll stick a monitor on there. Assuming you want nr_dirty vs total
> number of pages though, the monitor tends to execute too late to be useful.
>
Since I wanted to go deeper in the understanding, but my system is healty,
I devised a measure of fragmentation, and wanted to chart it to understand
what was going wrong. A perl script that produces gnuplot compatible output is provided:

use strict;
select(STDOUT);
$|=1;
do {
open (my $bf, "< /proc/buddyinfo") or die;
open (my $up, "< /proc/uptime") or die;
my $now = <$up>;
chomp $now;
print $now;
while(<$bf>) {
next unless /Node (\d+), zone\s+([a-zA-Z]+)\s+(.+)$/;
my ($frag, $tot, $val) = (0,0,1);
map { $frag += $_; $tot += $val * $_; $val <<= 1;} ($3 =~ /\d+/g);
print "\t", $frag/$tot;
}
print "\n";
sleep 1;
} while(1);

My definition of fragmentation is just the number of fragments / the number of pages:
* It is 1 only when all pages are of order 0
* it is 2/3 on a random marking of used pages (each page has probability 0.5 of being used)
* to be sure that a order k allocation succeeds, the fragmentation should be <= 2^-k

I observed the mainline kernel during normal usage, and found that:
* the fragmentation is very low after boot (< 1%).
* it tends to increase when memory is freed, and to decrease when memory is allocated (since the kernel usually performs order 0 allocations).
* high memory fragmentation increases first, and only when all high memory is used, normal memory starts to fragment.
* when the page cache is big enough (so memory pressure is high for the allocator), the fragmentation starts to fluctuate a lot, sometimes exceeding 2/3 (up to 0.8).
* the only way to make the fragmentation return to sane values after it enters fluctuation is to do a sync & drop caches. Even in this case, it will go around 14%, that is still quite high.
>
> Two major differences. 1, the previous non-high-order tests had also
> run sysbench and iozone so the starting conditions are different. I had
> disabled those tests to get some of the high-order figures before I went
> offline. However, the starting conditions are probably not as important as
> the fact that kswapd is working to free order-2 pages and staying awake
> until watermarks are reached. kswapd working harder is probably making a
> big difference.
>
>From my observation, having run a program that fills page cache before a test has a lot of impact to the fragmentation.
We (block layer guys) tend to do a sync & drop cache before starting any test, so this can explain why our optimizations work best when machine has plenty of free memory.
On the other hand, machines with plenty of memory should be the norm now, even for desktops.

>
> I made a mistake in the script that was generating the summary. I neglected
> to take into account printk rate suppressions. When they are taken into
> account, the first round of figures look like
>
> desktop-net-gitk
> high-with low-latency low-latency
> high-without low-latency block-2.6.33 async-rampup
> low-latency min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51
> (-37.69%) 303.43 (64.76%) mean 866.60 ( 0.00%) 616.28
> (28.89%) 1201.82 (-38.68%) 459.69 (46.96%) stddev 4.39 (
> 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%) max
> 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31
> (38.42%) pgalloc-fail 65 ( 0.00%) 10 (84.62%) 293
> (-350.77%) 20 (69.23%)
>
> So the async-rampup is getting smacked very hard with allocation failures
> in the high-order case. With the three additional applied for allocation
> failures, the figures look like
>
> desktop-net-gitk
> atomics-with low-latency low-latency
> atomics-without low-latency block-2.6.33 async-rampup
> low-latency min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75
> (-95.71%) 375.05 (41.50%) mean 743.61 ( 0.00%) 631.20
> (15.12%) 1272.70 (-71.15%) 389.71 (47.59%) stddev 60.30 (
> 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%) max
> 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%)
> pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 27 ( 0.00%) 0
> ( 0.00%)
>
> So again, async-rampup is getting smacked in terms of allocation failures
> although the three additional patches help a lot. This is a real pity
> because it looked nice in the tests involving no high-order allocations for
> the network.
Ok. Forget that patch for now. Maybe we can test it with 2.6.33 to see if it fits.
On the other hand, I saw that the problems with high order allocations started
around 2.6.31, where we didn't have any low_latency patch. So I don't think the
solution to the problem is in the block layer. A slightly slower or faster writeback
shouldn't cause a DoS like situation as the one encountered with your network driver.

> > Moreover, it will improve some workloads, but penalize others.
>
> It really does appear to hurt a lot when the machine is kinda low on
> memory though. That is a fairly common situation with a desktop loaded
> up with random apps. Well..... by common, I mean I hit that situation a
> lot on my laptop. I don't hit it on server workloads because I make sure
> the machines are not overloaded.
This is why we have it as a tunable. If your workload is negatively affected,
you can switch it off. But make sure to test it thoroughly, because even if
you found a 2x slowdown in a particular circumstance, it can gain 10x
speedup (see http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/01848.html)
in others.

>
> > Your 3 patches, though, seem to improve the situation also for
> > low_latency enabled, both for performance and allocation failures (25
> > to 3). Having those 3 patches with low_latency enabled seems better,
> > since it won't penalize the workloads that are benefited by
> > low_latency (if you add a sequential read to your test, you should see
> > a big difference).
>
> This is true and I would like to see them merged. However, this close to
> release, with Jens unhappiness with the explanation of why
> congestion_wait() changes made a difference and Andrew feeling there
> wasn't enough cause to merge them, I'm doubtful it'll happen. Will see
> Monday what the story is.

After a 1day study of the VM, I found an other way to improve the fragmentation.
With the patch below, the fragmentation stays below 2/3 even when memory pressure is high,
and decreases overtime, if the system is lightly used, even without dropping caches.
Moreover, the precious zones (Normal, DMA) are kept at a lower fragmentation, since high order
allocations are usually serviced by the other zones (more likely than with mainline allocator).

The idea is to have 2 freelists for each zone.
The free_list_0 has the pages that are less likely to cause an higher-order merge, since the buddy of their compound is not free.
The free_list_1 contains the other ones.
When expanding, we put pages into free_list_1. When freeing, we put them in the proper one by checking the buddy of the compound.
And when extracting, we always extract from free_list_0 first, and fall back on the other if the first is empty.
In this way, we keep free longer the pages that are more likely to cause a big merge.
Consequently we tend to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation.

It can, though, slow down allocation and reclaim, so someone more knowledgeable than me should have a look.

Signed-off-by: Corrado Zoccolo <[email protected]>

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6f75617..6427361 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -55,7 +55,8 @@ static inline int get_pageblock_migratetype(struct page *page)
}

struct free_area {
- struct list_head free_list[MIGRATE_TYPES];
+ struct list_head free_list_0[MIGRATE_TYPES];
+ struct list_head free_list_1[MIGRATE_TYPES];
unsigned long nr_free;
};

diff --git a/kernel/kexec.c b/kernel/kexec.c
index f336e21..aee5ef5 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1404,13 +1404,15 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_OFFSET(zone, free_area);
VMCOREINFO_OFFSET(zone, vm_stat);
VMCOREINFO_OFFSET(zone, spanned_pages);
- VMCOREINFO_OFFSET(free_area, free_list);
+ VMCOREINFO_OFFSET(free_area, free_list_0);
+ VMCOREINFO_OFFSET(free_area, free_list_1);
VMCOREINFO_OFFSET(list_head, next);
VMCOREINFO_OFFSET(list_head, prev);
VMCOREINFO_OFFSET(vm_struct, addr);
VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER);
log_buf_kexec_setup();
- VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
+ VMCOREINFO_LENGTH(free_area.free_list_0, MIGRATE_TYPES);
+ VMCOREINFO_LENGTH(free_area.free_list_1, MIGRATE_TYPES);
VMCOREINFO_NUMBER(NR_FREE_PAGES);
VMCOREINFO_NUMBER(PG_lru);
VMCOREINFO_NUMBER(PG_private);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cdcedf6..5f488d8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -451,6 +451,8 @@ static inline void __free_one_page(struct page *page,
int migratetype)
{
unsigned long page_idx;
+ unsigned long combined_idx;
+ bool high_order_free = false;

if (unlikely(PageCompound(page)))
if (unlikely(destroy_compound_page(page, order)))
@@ -464,7 +466,6 @@ static inline void __free_one_page(struct page *page,
VM_BUG_ON(bad_range(zone, page));

while (order < MAX_ORDER-1) {
- unsigned long combined_idx;
struct page *buddy;

buddy = __page_find_buddy(page, page_idx, order);
@@ -481,8 +482,21 @@ static inline void __free_one_page(struct page *page,
order++;
}
set_page_order(page, order);
- list_add(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
+
+ if (order < MAX_ORDER-1) {
+ struct page *parent_page, *ppage_buddy;
+ combined_idx = __find_combined_index(page_idx, order);
+ parent_page = page + combined_idx - page_idx;
+ ppage_buddy = __page_find_buddy(parent_page, combined_idx, order + 1);
+ high_order_free = page_is_buddy(parent_page, ppage_buddy, order + 1);
+ }
+
+ if (high_order_free)
+ list_add(&page->lru,
+ &zone->free_area[order].free_list_1[migratetype]);
+ else
+ list_add(&page->lru,
+ &zone->free_area[order].free_list_0[migratetype]);
zone->free_area[order].nr_free++;
}

@@ -663,7 +677,7 @@ static inline void expand(struct zone *zone, struct page *page,
high--;
size >>= 1;
VM_BUG_ON(bad_range(zone, &page[size]));
- list_add(&page[size].lru, &area->free_list[migratetype]);
+ list_add(&page[size].lru, &area->free_list_1[migratetype]);
area->nr_free++;
set_page_order(&page[size], high);
}
@@ -723,12 +737,19 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,

/* Find a page of the appropriate size in the preferred list */
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
+ bool fl0, fl1;
area = &(zone->free_area[current_order]);
- if (list_empty(&area->free_list[migratetype]))
+ fl0 = list_empty(&area->free_list_0[migratetype]);
+ fl1 = list_empty(&area->free_list_1[migratetype]);
+ if (fl0 && fl1)
continue;

- page = list_entry(area->free_list[migratetype].next,
- struct page, lru);
+ if (fl0)
+ page = list_entry(area->free_list_1[migratetype].next,
+ struct page, lru);
+ else
+ page = list_entry(area->free_list_0[migratetype].next,
+ struct page, lru);
list_del(&page->lru);
rmv_page_order(page);
area->nr_free--;
@@ -792,7 +813,7 @@ static int move_freepages(struct zone *zone,
order = page_order(page);
list_del(&page->lru);
list_add(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
+ &zone->free_area[order].free_list_0[migratetype]);
page += 1 << order;
pages_moved += 1 << order;
}
@@ -845,6 +866,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
for (current_order = MAX_ORDER-1; current_order >= order;
--current_order) {
for (i = 0; i < MIGRATE_TYPES - 1; i++) {
+ bool fl0, fl1;
migratetype = fallbacks[start_migratetype][i];

/* MIGRATE_RESERVE handled later if necessary */
@@ -852,11 +874,20 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
continue;

area = &(zone->free_area[current_order]);
- if (list_empty(&area->free_list[migratetype]))
+
+
+ fl0 = list_empty(&area->free_list_0[migratetype]);
+ fl1 = list_empty(&area->free_list_1[migratetype]);
+
+ if (fl0 && fl1)
continue;

- page = list_entry(area->free_list[migratetype].next,
- struct page, lru);
+ if (fl0)
+ page = list_entry(area->free_list_1[migratetype].next,
+ struct page, lru);
+ else
+ page = list_entry(area->free_list_0[migratetype].next,
+ struct page, lru);
area->nr_free--;

/*
@@ -1061,7 +1092,14 @@ void mark_free_pages(struct zone *zone)
}

for_each_migratetype_order(order, t) {
- list_for_each(curr, &zone->free_area[order].free_list[t]) {
+ list_for_each(curr, &zone->free_area[order].free_list_0[t]) {
+ unsigned long i;
+
+ pfn = page_to_pfn(list_entry(curr, struct page, lru));
+ for (i = 0; i < (1UL << order); i++)
+ swsusp_set_page_free(pfn_to_page(pfn + i));
+ }
+ list_for_each(curr, &zone->free_area[order].free_list_1[t]) {
unsigned long i;

pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -2993,7 +3031,8 @@ static void __meminit zone_init_free_lists(struct zone *zone)
{
int order, t;
for_each_migratetype_order(order, t) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+ INIT_LIST_HEAD(&zone->free_area[order].free_list_0[t]);
+ INIT_LIST_HEAD(&zone->free_area[order].free_list_1[t]);
zone->free_area[order].nr_free = 0;
}
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c81321f..613ef1e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -468,7 +468,9 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,

area = &(zone->free_area[order]);

- list_for_each(curr, &area->free_list[mtype])
+ list_for_each(curr, &area->free_list_0[mtype])
+ freecount++;
+ list_for_each(curr, &area->free_list_1[mtype])
freecount++;
seq_printf(m, "%6lu ", freecount);
}

2009-11-30 10:18:16

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

> On Fri, Nov 27, 2009 at 02:58:26PM +0900, KOSAKI Motohiro wrote:
> > > > <SNIP>
> > > > low_latency was tested on other scenarios:
> > > > http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html
> > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html
> > > > where it improved actual and perceived performance, so disabling it
> > > > completely may not be good.
> > > >
> > >
> > > It may not indeed.
> > >
> > > In case you mean a partial disabling of cfq_latency, I'm try the
> > > following patch. The intention is to disable the low_latency logic if
> > > kswapd is at work and presumably needs clean pages. Alternative
> > > suggestions welcome.
> >
> > I like treat vmscan writeout as special. because
> > - vmscan use various process context. but it doesn't write own process's page.
> > IOW, it doesn't so match cfq's io fairness logic.
> > - plus, the above mean vmscan writeout doesn't need good i/o latency.
>
> While it might not need good latency as such, it does need pages to be
> clean because direct reclaim has trouble cleaning pages in its own
> behalf.

Well.
if direct reclaim need lumpy reclaim, you are right.

In no lupy case, vmscan start pageout and move the page list tail typically.
cleaned page will be used by another task.

---------------------------------------------------------------------------------------
static unsigned long shrink_page_list(struct list_head *page_list,
struct list_head *freed_pages_list,
struct scan_control *sc,
enum pageout_io sync_writeback)
{
(snip)
switch (pageout(page, mapping, sync_writeback)) {
case PAGE_KEEP:
goto keep_locked;
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
if (PageWriteback(page) || PageDirty(page))
goto keep; /////// HERE
---------------------------------------------------------------------------------------

> > - vmscan maintain page granularity lru list. It mean vmscan makes awful
> > seekful I/O. it assume block-layer buffered much i/o request.
> > - plus, the above mena vmscan. writeout need good io throughput. otherwise
> > system might cause hangup.
> >
> > However, I don't think kswapd_awake is good choice. because
> > - zone reclaim run before kswapd wakeup. iow, this patch doesn't solve hpc machine.
> > btw, some Core i7 box (at least, Intel's reference box) also use zone reclaim.
>
> Good point.
>
> > - On large (many memory node) machine, one of much kswapd always run.
> >
>
> Also true.
>
> >
> > Instead, PF_MEMALLOC is good idea?
>
> It doesn't work out either because a process with PF_MEMALLOC is in
> direct reclaim and like kswapd, it may not be able to clean the pages at
> all, let alone in a small period of time.

please forget this idea ;)

2009-11-30 12:04:30

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Sun, Nov 29, 2009 at 04:11:15PM +0100, Corrado Zoccolo wrote:
> On Fri, Nov 27, 2009 19:52:34, Mel Gorman wrote:
> : > On Fri, Nov 27, 2009 at 07:14:41PM +0100, Corrado Zoccolo wrote:
> > > On Fri, Nov 27, 2009 at 4:58 PM, Mel Gorman <[email protected]> wrote:
> > > > On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote:
> > > >> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <[email protected]> wrote:
> > > >
> > > > How would one go about selecting the proper ratio at which to disable
> > > > the low_latency logic?
> > >
> > > Can we measure the dirty ratio when the allocation failures start to
> > > happen?
> >
> > Would the number of dirty pages in the page allocation failure message to
> > kern.log be enough? You won't get them all because of printk suppress but
> > it's something. Alternatively, tell me exactly what stats from /proc you
> > want and I'll stick a monitor on there. Assuming you want nr_dirty vs total
> > number of pages though, the monitor tends to execute too late to be useful.
> >
> Since I wanted to go deeper in the understanding, but my system is healty,
> I devised a measure of fragmentation, and wanted to chart it to understand
> what was going wrong. A perl script that produces gnuplot compatible output is provided:
>
> use strict;
> select(STDOUT);
> $|=1;
> do {
> open (my $bf, "< /proc/buddyinfo") or die;
> open (my $up, "< /proc/uptime") or die;
> my $now = <$up>;
> chomp $now;
> print $now;
> while(<$bf>) {
> next unless /Node (\d+), zone\s+([a-zA-Z]+)\s+(.+)$/;
> my ($frag, $tot, $val) = (0,0,1);
> map { $frag += $_; $tot += $val * $_; $val <<= 1;} ($3 =~ /\d+/g);
> print "\t", $frag/$tot;
> }
> print "\n";
> sleep 1;
> } while(1);
>
> My definition of fragmentation is just the number of fragments / the number of pages:
> * It is 1 only when all pages are of order 0
> * it is 2/3 on a random marking of used pages (each page has probability 0.5 of being used)
> * to be sure that a order k allocation succeeds, the fragmentation should be <= 2^-k
>

In practice, the ordering of page allocations and frees are not random
but it's ok for the purposes here.

Also when considering fragmentation, I'd take into account the order of the
desired allocation as fragmentations at or over that size are not contributing
to fragmentation in a negative way. I'd usually express it in terms of free
pages instead of total pages as well to avoid large fluctuations when reclaim
is working. We can work with this measure for the moment though to avoid
getting side-tracked on what fragmentation is.

> I observed the mainline kernel during normal usage, and found that:
> * the fragmentation is very low after boot (< 1%).
> * it tends to increase when memory is freed, and to decrease when memory is allocated (since the kernel usually performs order 0 allocations).
> * high memory fragmentation increases first, and only when all high memory is used, normal memory starts to fragment.

All three of these observations are expected.

> * when the page cache is big enough (so memory pressure is high for the allocator), the fragmentation starts to fluctuate a lot, sometimes exceeding 2/3 (up to 0.8).

Again, this is expected. Page cache pages stay resident until
reclaimed. If they are clean, they are not really contributing to
fragmentation in any way that matters as they should be quickly found
and discarded in most cases. In the networking case, it's depending on
kswapd to find and reclaim the pages fast enough.

> * the only way to make the fragmentation return to sane values after it enters fluctuation is to do a sync & drop caches. Even in this case, it will go around 14%, that is still quite high.
> >
> > Two major differences. 1, the previous non-high-order tests had also
> > run sysbench and iozone so the starting conditions are different. I had
> > disabled those tests to get some of the high-order figures before I went
> > offline. However, the starting conditions are probably not as important as
> > the fact that kswapd is working to free order-2 pages and staying awake
> > until watermarks are reached. kswapd working harder is probably making a
> > big difference.
> >
>
> From my observation, having run a program that fills page cache before a test has a lot of impact to the fragmentation.

While this is true, during the course of the test, the old page cache
should be discarded quickly. It's not as abrupt as dropping the page
cache but the end result should be similar in the majority of cases -
the exception being when atomic allocations are a major factor.

> We (block layer guys) tend to do a sync & drop cache before starting any test, so this can explain why our optimizations work best when machine has plenty of free memory.
> On the other hand, machines with plenty of memory should be the norm now, even for desktops.
>

Even large memory machines will eventually use the bulk of their memory
on old page cache. There is no problem with this as such.

> >
> > I made a mistake in the script that was generating the summary. I neglected
> > to take into account printk rate suppressions. When they are taken into
> > account, the first round of figures look like
> >
> > desktop-net-gitk
> > high-with low-latency low-latency
> > high-without low-latency block-2.6.33 async-rampup
> > low-latency min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51
> > (-37.69%) 303.43 (64.76%) mean 866.60 ( 0.00%) 616.28
> > (28.89%) 1201.82 (-38.68%) 459.69 (46.96%) stddev 4.39 (
> > 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%) max
> > 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31
> > (38.42%) pgalloc-fail 65 ( 0.00%) 10 (84.62%) 293
> > (-350.77%) 20 (69.23%)
> >
> > So the async-rampup is getting smacked very hard with allocation failures
> > in the high-order case. With the three additional applied for allocation
> > failures, the figures look like
> >
> > desktop-net-gitk
> > atomics-with low-latency low-latency
> > atomics-without low-latency block-2.6.33 async-rampup
> > low-latency min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75
> > (-95.71%) 375.05 (41.50%) mean 743.61 ( 0.00%) 631.20
> > (15.12%) 1272.70 (-71.15%) 389.71 (47.59%) stddev 60.30 (
> > 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%) max
> > 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%)
> > pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 27 ( 0.00%) 0
> > ( 0.00%)
> >
> > So again, async-rampup is getting smacked in terms of allocation failures
> > although the three additional patches help a lot. This is a real pity
> > because it looked nice in the tests involving no high-order allocations for
> > the network.
>
> Ok. Forget that patch for now. Maybe we can test it with 2.6.33 to see if it fits.

Sounds reasonable.

> On the other hand, I saw that the problems with high order allocations started
> around 2.6.31, where we didn't have any low_latency patch.

While this is true, there appear to be many sources of the high order
allocation failures. While low_latency is not the original source, it
does not appear to have helped either. Even without high-order
allocations being involved, disabling low_latency performs much better
in low-memory situations.

> So I don't think the
> solution to the problem is in the block layer. A slightly slower or faster writeback
> shouldn't cause a DoS like situation as the one encountered with your network driver.
>
> > > Moreover, it will improve some workloads, but penalize others.
> >
> > It really does appear to hurt a lot when the machine is kinda low on
> > memory though. That is a fairly common situation with a desktop loaded
> > up with random apps. Well..... by common, I mean I hit that situation a
> > lot on my laptop. I don't hit it on server workloads because I make sure
> > the machines are not overloaded.
>
> This is why we have it as a tunable. If your workload is negatively affected,
> you can switch it off.

True, although it's hard to spot.

> But make sure to test it thoroughly, because even if
> you found a 2x slowdown in a particular circumstance, it can gain 10x
> speedup (see http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/01848.html)
> in others.
>

Ok.

> >
> > > Your 3 patches, though, seem to improve the situation also for
> > > low_latency enabled, both for performance and allocation failures (25
> > > to 3). Having those 3 patches with low_latency enabled seems better,
> > > since it won't penalize the workloads that are benefited by
> > > low_latency (if you add a sequential read to your test, you should see
> > > a big difference).
> >
> > This is true and I would like to see them merged. However, this close to
> > release, with Jens unhappiness with the explanation of why
> > congestion_wait() changes made a difference and Andrew feeling there
> > wasn't enough cause to merge them, I'm doubtful it'll happen. Will see
> > Monday what the story is.
>
> After a 1day study of the VM, I found an other way to improve the fragmentation.
> With the patch below, the fragmentation stays below 2/3 even when memory pressure is high,
> and decreases overtime, if the system is lightly used, even without dropping caches.
> Moreover, the precious zones (Normal, DMA) are kept at a lower fragmentation, since high order
> allocations are usually serviced by the other zones (more likely than with mainline allocator).
>
> The idea is to have 2 freelists for each zone.
> The free_list_0 has the pages that are less likely to cause an higher-order merge, since the buddy of their compound is not free.
> The free_list_1 contains the other ones.
> When expanding, we put pages into free_list_1.When freeing, we put them in the proper one by checking the buddy of the compound.
> And when extracting, we always extract from free_list_0 first,

This is subtle, but as well as increased overhead in the page allocator, I'd
expect this to break the page-ordering when a caller is allocation many numbers
of order-0 pages. Some IO controllers get a boost by the pages coming back
in physically contiguous order which happens if a high-order page is being
split towards the beginning of the stream of requests. Previous attempts at
altering how coalescing and splitting to reduce fragmentation with methods
similar to yours have fallen foul of this.

> and fall back on the other if the first is empty.
> In this way, we keep free longer the pages that are more likely to cause a big merge.
> Consequently we tend to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation.
>
> It can, though, slow down allocation and reclaim, so someone more knowledgeable than me should have a look.
>
> Signed-off-by: Corrado Zoccolo <[email protected]>
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 6f75617..6427361 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -55,7 +55,8 @@ static inline int get_pageblock_migratetype(struct page *page)
> }
>
> struct free_area {
> - struct list_head free_list[MIGRATE_TYPES];
> + struct list_head free_list_0[MIGRATE_TYPES];
> + struct list_head free_list_1[MIGRATE_TYPES];
> unsigned long nr_free;
> };
>
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index f336e21..aee5ef5 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -1404,13 +1404,15 @@ static int __init crash_save_vmcoreinfo_init(void)
> VMCOREINFO_OFFSET(zone, free_area);
> VMCOREINFO_OFFSET(zone, vm_stat);
> VMCOREINFO_OFFSET(zone, spanned_pages);
> - VMCOREINFO_OFFSET(free_area, free_list);
> + VMCOREINFO_OFFSET(free_area, free_list_0);
> + VMCOREINFO_OFFSET(free_area, free_list_1);
> VMCOREINFO_OFFSET(list_head, next);
> VMCOREINFO_OFFSET(list_head, prev);
> VMCOREINFO_OFFSET(vm_struct, addr);
> VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER);
> log_buf_kexec_setup();
> - VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
> + VMCOREINFO_LENGTH(free_area.free_list_0, MIGRATE_TYPES);
> + VMCOREINFO_LENGTH(free_area.free_list_1, MIGRATE_TYPES);
> VMCOREINFO_NUMBER(NR_FREE_PAGES);
> VMCOREINFO_NUMBER(PG_lru);
> VMCOREINFO_NUMBER(PG_private);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cdcedf6..5f488d8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -451,6 +451,8 @@ static inline void __free_one_page(struct page *page,
> int migratetype)
> {
> unsigned long page_idx;
> + unsigned long combined_idx;
> + bool high_order_free = false;
>
> if (unlikely(PageCompound(page)))
> if (unlikely(destroy_compound_page(page, order)))
> @@ -464,7 +466,6 @@ static inline void __free_one_page(struct page *page,
> VM_BUG_ON(bad_range(zone, page));
>
> while (order < MAX_ORDER-1) {
> - unsigned long combined_idx;
> struct page *buddy;
>
> buddy = __page_find_buddy(page, page_idx, order);
> @@ -481,8 +482,21 @@ static inline void __free_one_page(struct page *page,
> order++;
> }
> set_page_order(page, order);
> - list_add(&page->lru,
> - &zone->free_area[order].free_list[migratetype]);
> +
> + if (order < MAX_ORDER-1) {
> + struct page *parent_page, *ppage_buddy;
> + combined_idx = __find_combined_index(page_idx, order);
> + parent_page = page + combined_idx - page_idx;

parent_page is a bad name here. It's not the parent of anything. What I
think you're looking for is the lowest page of the pair of buddies that
was last considered for merging.

> + ppage_buddy = __page_find_buddy(parent_page, combined_idx, order + 1);
> + high_order_free = page_is_buddy(parent_page, ppage_buddy, order + 1);
> + }

And you are checking if when one buddy of this pair frees, will it then
be merged with the next-highest order. If so, you want to delay reusing
that page for allocation.

> +
> + if (high_order_free)
> + list_add(&page->lru,
> + &zone->free_area[order].free_list_1[migratetype]);
> + else
> + list_add(&page->lru,
> + &zone->free_area[order].free_list_0[migratetype]);

You could have avoided the extra list to some extent by altering whether
it was the head or tail of the list the page was added to. It would have
had a similar effect of the page not being used for longer with slightly
less overhead.

> zone->free_area[order].nr_free++;
> }
>
> @@ -663,7 +677,7 @@ static inline void expand(struct zone *zone, struct page *page,
> high--;
> size >>= 1;
> VM_BUG_ON(bad_range(zone, &page[size]));
> - list_add(&page[size].lru, &area->free_list[migratetype]);
> + list_add(&page[size].lru, &area->free_list_1[migratetype]);

I think this here will damage the contiguous ordering of pages being
returned to callers.

> area->nr_free++;
> set_page_order(&page[size], high);
> }
> @@ -723,12 +737,19 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>
> /* Find a page of the appropriate size in the preferred list */
> for (current_order = order; current_order < MAX_ORDER; ++current_order) {
> + bool fl0, fl1;
> area = &(zone->free_area[current_order]);
> - if (list_empty(&area->free_list[migratetype]))
> + fl0 = list_empty(&area->free_list_0[migratetype]);
> + fl1 = list_empty(&area->free_list_1[migratetype]);
> + if (fl0 && fl1)
> continue;
>
> - page = list_entry(area->free_list[migratetype].next,
> - struct page, lru);
> + if (fl0)
> + page = list_entry(area->free_list_1[migratetype].next,
> + struct page, lru);
> + else
> + page = list_entry(area->free_list_0[migratetype].next,
> + struct page, lru);

By altering whether it's the head or tail free pages are added to, you
can achieve a similar effect.

> list_del(&page->lru);
> rmv_page_order(page);
> area->nr_free--;
> @@ -792,7 +813,7 @@ static int move_freepages(struct zone *zone,
> order = page_order(page);
> list_del(&page->lru);
> list_add(&page->lru,
> - &zone->free_area[order].free_list[migratetype]);
> + &zone->free_area[order].free_list_0[migratetype]);
> page += 1 << order;
> pages_moved += 1 << order;
> }
> @@ -845,6 +866,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> for (current_order = MAX_ORDER-1; current_order >= order;
> --current_order) {
> for (i = 0; i < MIGRATE_TYPES - 1; i++) {
> + bool fl0, fl1;
> migratetype = fallbacks[start_migratetype][i];
>
> /* MIGRATE_RESERVE handled later if necessary */
> @@ -852,11 +874,20 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> continue;
>
> area = &(zone->free_area[current_order]);
> - if (list_empty(&area->free_list[migratetype]))
> +
> +
> + fl0 = list_empty(&area->free_list_0[migratetype]);
> + fl1 = list_empty(&area->free_list_1[migratetype]);
> +
> + if (fl0 && fl1)
> continue;
>
> - page = list_entry(area->free_list[migratetype].next,
> - struct page, lru);
> + if (fl0)
> + page = list_entry(area->free_list_1[migratetype].next,
> + struct page, lru);
> + else
> + page = list_entry(area->free_list_0[migratetype].next,
> + struct page, lru);
> area->nr_free--;
>
> /*
> @@ -1061,7 +1092,14 @@ void mark_free_pages(struct zone *zone)
> }
>
> for_each_migratetype_order(order, t) {
> - list_for_each(curr, &zone->free_area[order].free_list[t]) {
> + list_for_each(curr, &zone->free_area[order].free_list_0[t]) {
> + unsigned long i;
> +
> + pfn = page_to_pfn(list_entry(curr, struct page, lru));
> + for (i = 0; i < (1UL << order); i++)
> + swsusp_set_page_free(pfn_to_page(pfn + i));
> + }
> + list_for_each(curr, &zone->free_area[order].free_list_1[t]) {
> unsigned long i;
>
> pfn = page_to_pfn(list_entry(curr, struct page, lru));
> @@ -2993,7 +3031,8 @@ static void __meminit zone_init_free_lists(struct zone *zone)
> {
> int order, t;
> for_each_migratetype_order(order, t) {
> - INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
> + INIT_LIST_HEAD(&zone->free_area[order].free_list_0[t]);
> + INIT_LIST_HEAD(&zone->free_area[order].free_list_1[t]);
> zone->free_area[order].nr_free = 0;
> }
> }
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index c81321f..613ef1e 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -468,7 +468,9 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>
> area = &(zone->free_area[order]);
>
> - list_for_each(curr, &area->free_list[mtype])
> + list_for_each(curr, &area->free_list_0[mtype])
> + freecount++;
> + list_for_each(curr, &area->free_list_1[mtype])
> freecount++;
> seq_printf(m, "%6lu ", freecount);
> }

No more than the low_latency switch, I think this will help some
workloads in terms of fragmentation but hurt others that depend on the
ordering of pages being returned. There is a fair amount of overhead
introduced here as well with branches and a lot of extra lists although
I believe that could be mitigated.

What are the results if you just alter whether it's the head or tail of
the list that is used in __free_one_page()?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-30 12:54:02

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Mon, Nov 30, 2009 at 1:04 PM, Mel Gorman <[email protected]> wrote:
> On Sun, Nov 29, 2009 at 04:11:15PM +0100, Corrado Zoccolo wrote:
>> On Fri, Nov 27, 2009 19:52:34, Mel Gorman wrote:
>> : > On Fri, Nov 27, 2009 at 07:14:41PM +0100, Corrado Zoccolo wrote:
>> > > On Fri, Nov 27, 2009 at 4:58 PM, Mel Gorman <[email protected]> wrote:
>> > > > On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote:
>> > > >> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <[email protected]> wrote:
>> > > >
>> > > > How would one go about selecting the proper ratio at which to disable
>> > > > the low_latency logic?
>> > >
>> > > Can we measure the dirty ratio when the allocation failures start to
>> > > happen?
>> >
>> > Would the number of dirty pages in the page allocation failure message to
>> > kern.log be enough? You won't get them all because of printk suppress but
>> > it's something. Alternatively, tell me exactly what stats from /proc you
>> > want and I'll stick a monitor on there. Assuming you want nr_dirty vs total
>> > number of pages though, the monitor tends to execute too late to be useful.
>> >
>> Since I wanted to go deeper in the understanding, but my system is healty,
>> I devised a measure of fragmentation, and wanted to chart it to understand
>> what was going wrong. A perl script that produces gnuplot compatible output is provided:
>>
>> use strict;
>> select(STDOUT);
>> $|=1;
>> do {
>> open (my $bf, "< /proc/buddyinfo") or die;
>> open (my $up, "< /proc/uptime") or die;
>> my $now = <$up>;
>> chomp $now;
>> print $now;
>> while(<$bf>) {
>> next unless /Node (\d+), zone\s+([a-zA-Z]+)\s+(.+)$/;
>> my ($frag, $tot, $val) = (0,0,1);
>> map { $frag += $_; $tot += $val * $_; $val <<= 1;} ($3 =~ /\d+/g);
>> print "\t", $frag/$tot;
>> }
>> print "\n";
>> sleep 1;
>> } while(1);
>>
>> My definition of fragmentation is just the number of fragments / the number of pages:
>> * It is 1 only when all pages are of order 0
>> * it is 2/3 on a random marking of used pages (each page has probability 0.5 of being used)
>> * to be sure that a order k allocation succeeds, the fragmentation should be <= 2^-k
>>
>
> In practice, the ordering of page allocations and frees are not random
> but it's ok for the purposes here.
>
> Also when considering fragmentation, I'd take into account the order of the
> desired allocation as fragmentations at or over that size are not contributing
> to fragmentation in a negative way. I'd usually express it in terms of free
> pages instead of total pages as well to avoid large fluctuations when reclaim
> is working. We can work with this measure for the moment though to avoid
> getting side-tracked on what fragmentation is.
>
>> I observed the mainline kernel during normal usage, and found that:
>> * the fragmentation is very low after boot (< 1%).
>> * it tends to increase when memory is freed, and to decrease when memory is allocated (since the kernel usually performs order 0 allocations).
>> * high memory fragmentation increases first, and only when all high memory is used, normal memory starts to fragment.
>
> All three of these observations are expected.
>
>> * when the page cache is big enough (so memory pressure is high for the allocator), the fragmentation starts to fluctuate a lot, sometimes exceeding 2/3 (up to 0.8).
>
> Again, this is expected. Page cache pages stay resident until
> reclaimed. If they are clean, they are not really contributing to
> fragmentation in any way that matters as they should be quickly found
> and discarded in most cases. In the networking case, it's depending on
> kswapd to find and reclaim the pages fast enough.

If you need an order 5 page, how would kswapd work?
Will it free randomly some order 0 pages until a merge magically happens?
Unless the dirty ratio is really high, there should already be plenty
of contiguous non-dirty pages in the page cache that could be freed,
but if you use an LRU policy to evict, you can go through a lot of
freeing before a merge will happen.

>> * the only way to make the fragmentation return to sane values after it enters fluctuation is to do a sync & drop caches. Even in this case, it will go around 14%, that is still quite high.
>> >
>> > Two major differences. 1, the previous non-high-order tests had also
>> > run sysbench and iozone so the starting conditions are different. I had
>> > disabled those tests to get some of the high-order figures before I went
>> > offline. However, the starting conditions are probably not as important as
>> > the fact that kswapd is working to free order-2 pages and staying awake
>> > until watermarks are reached. kswapd working harder is probably making a
>> > big difference.
>> >
>>
>> From my observation, having run a program that fills page cache before a test has a lot of impact to the fragmentation.
>
> While this is true, during the course of the test, the old page cache
> should be discarded quickly. It's not as abrupt as dropping the page
> cache but the end result should be similar in the majority of cases -
> the exception being when atomic allocations are a major factor.

For my I/O scheduler tests I use an external disk, to be able to
monitor exactly what is happening.
If I don't do a sync & drop cache before starting a test, I usually
see writeback happening on the main disk, even if the only activity on
the machine is writing a sequential file to my external disk. If that
writeback is done in the context of my test process, this will alter
the result.
And with high order allocations, depending on how do you free page
cache, it can be even worse than that.

>
>> On the other hand, I saw that the problems with high order allocations started
>> around 2.6.31, where we didn't have any low_latency patch.
>
> While this is true, there appear to be many sources of the high order
> allocation failures. While low_latency is not the original source, it
> does not appear to have helped either. Even without high-order
> allocations being involved, disabling low_latency performs much better
> in low-memory situations.
Can you try reproducing:
http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/01848.html
in a low memory scenario, to substantiate your claim?

>> After a 1day study of the VM, I found an other way to improve the fragmentation.
>> With the patch below, the fragmentation stays below 2/3 even when memory pressure is high,
>> and decreases overtime, if the system is lightly used, even without dropping caches.
>> Moreover, the precious zones (Normal, DMA) are kept at a lower fragmentation, since high order
>> allocations are usually serviced by the other zones (more likely than with mainline allocator).
>>
>> The idea is to have 2 freelists for each zone.
>> The free_list_0 has the pages that are less likely to cause an higher-order merge, since the buddy of their compound is not free.
>> The free_list_1 contains the other ones.
>> When expanding, we put pages into free_list_1.When freeing, we put them in the proper one by checking the buddy of the compound.
>> And when extracting, we always extract from free_list_0 first,
>
> This is subtle, but as well as increased overhead in the page allocator, I'd
> expect this to break the page-ordering when a caller is allocation many numbers
> of order-0 pages. Some IO controllers get a boost by the pages coming back
> in physically contiguous order which happens if a high-order page is being
> split towards the beginning of the stream of requests. Previous attempts at
> altering how coalescing and splitting to reduce fragmentation with methods
> similar to yours have fallen foul of this.
I took extreme care in not disrupting the page ordering. In fact, I
thought, too, to a single list solution, but it could cause page
reordering (since I would have used add_tail to add to the other
list).

>
>> and fall back on the other if the first is empty.
>> In this way, we keep free longer the pages that are more likely to cause a big merge.
>> Consequently we tend to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation.
>>
>> It can, though, slow down allocation and reclaim, so someone more knowledgeable than me should have a look.
>>
>> Signed-off-by: Corrado Zoccolo <[email protected]>
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 6f75617..6427361 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -55,7 +55,8 @@ static inline int get_pageblock_migratetype(struct page *page)
>> }
>>
>> struct free_area {
>> - struct list_head free_list[MIGRATE_TYPES];
>> + struct list_head free_list_0[MIGRATE_TYPES];
>> + struct list_head free_list_1[MIGRATE_TYPES];
>> unsigned long nr_free;
>> };
>>
>> diff --git a/kernel/kexec.c b/kernel/kexec.c
>> index f336e21..aee5ef5 100644
>> --- a/kernel/kexec.c
>> +++ b/kernel/kexec.c
>> @@ -1404,13 +1404,15 @@ static int __init crash_save_vmcoreinfo_init(void)
>> VMCOREINFO_OFFSET(zone, free_area);
>> VMCOREINFO_OFFSET(zone, vm_stat);
>> VMCOREINFO_OFFSET(zone, spanned_pages);
>> - VMCOREINFO_OFFSET(free_area, free_list);
>> + VMCOREINFO_OFFSET(free_area, free_list_0);
>> + VMCOREINFO_OFFSET(free_area, free_list_1);
>> VMCOREINFO_OFFSET(list_head, next);
>> VMCOREINFO_OFFSET(list_head, prev);
>> VMCOREINFO_OFFSET(vm_struct, addr);
>> VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER);
>> log_buf_kexec_setup();
>> - VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
>> + VMCOREINFO_LENGTH(free_area.free_list_0, MIGRATE_TYPES);
>> + VMCOREINFO_LENGTH(free_area.free_list_1, MIGRATE_TYPES);
>> VMCOREINFO_NUMBER(NR_FREE_PAGES);
>> VMCOREINFO_NUMBER(PG_lru);
>> VMCOREINFO_NUMBER(PG_private);
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index cdcedf6..5f488d8 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -451,6 +451,8 @@ static inline void __free_one_page(struct page *page,
>> int migratetype)
>> {
>> unsigned long page_idx;
>> + unsigned long combined_idx;
>> + bool high_order_free = false;
>>
>> if (unlikely(PageCompound(page)))
>> if (unlikely(destroy_compound_page(page, order)))
>> @@ -464,7 +466,6 @@ static inline void __free_one_page(struct page *page,
>> VM_BUG_ON(bad_range(zone, page));
>>
>> while (order < MAX_ORDER-1) {
>> - unsigned long combined_idx;
>> struct page *buddy;
>>
>> buddy = __page_find_buddy(page, page_idx, order);
>> @@ -481,8 +482,21 @@ static inline void __free_one_page(struct page *page,
>> order++;
>> }
>> set_page_order(page, order);
>> - list_add(&page->lru,
>> - &zone->free_area[order].free_list[migratetype]);
>> +
>> + if (order < MAX_ORDER-1) {
>> + struct page *parent_page, *ppage_buddy;
>> + combined_idx = __find_combined_index(page_idx, order);
>> + parent_page = page + combined_idx - page_idx;
>
> parent_page is a bad name here. It's not the parent of anything. What I
> think you're looking for is the lowest page of the pair of buddies that
> was last considered for merging.
Right, this should be the combined page, to keep naming consistent
with combined_idx.

>
>> + ppage_buddy = __page_find_buddy(parent_page, combined_idx, order + 1);
>> + high_order_free = page_is_buddy(parent_page, ppage_buddy, order + 1);
>> + }
>
> And you are checking if when one buddy of this pair frees, will it then
> be merged with the next-highest order. If so, you want to delay reusing
> that page for allocation.
Exactly.
If you have two streams of allocations, with different average
lifetime (and with the long lifetime allocations having a slower
rate), this will make very probable that the long lifetime allocations
span a smaller set of compounds.
>
>> +
>> + if (high_order_free)
>> + list_add(&page->lru,
>> + &zone->free_area[order].free_list_1[migratetype]);
>> + else
>> + list_add(&page->lru,
>> + &zone->free_area[order].free_list_0[migratetype]);
>
> You could have avoided the extra list to some extent by altering whether
> it was the head or tail of the list the page was added to. It would have
> had a similar effect of the page not being used for longer with slightly
> less overhead.
Right, but the order of insertions at the tail would be reversed.

>> zone->free_area[order].nr_free++;
>> }
>>
>> @@ -663,7 +677,7 @@ static inline void expand(struct zone *zone, struct page *page,
>> high--;
>> size >>= 1;
>> VM_BUG_ON(bad_range(zone, &page[size]));
>> - list_add(&page[size].lru, &area->free_list[migratetype]);
>> + list_add(&page[size].lru, &area->free_list_1[migratetype]);
>
> I think this here will damage the contiguous ordering of pages being
> returned to callers.
This shouldn't damage the order. In fact, expand always inserts in the
free_list_1, in the same order as the original code inserted in the
free_list. And if we hit expand, then the free_list_0 is empty, so all
allocations will be serviced from free_list_1 in the same order as the
original code.

>
>> area->nr_free++;
>> set_page_order(&page[size], high);
>> }
>> @@ -723,12 +737,19 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
>>
>> /* Find a page of the appropriate size in the preferred list */
>> for (current_order = order; current_order < MAX_ORDER; ++current_order) {
>> + bool fl0, fl1;
>> area = &(zone->free_area[current_order]);
>> - if (list_empty(&area->free_list[migratetype]))
>> + fl0 = list_empty(&area->free_list_0[migratetype]);
>> + fl1 = list_empty(&area->free_list_1[migratetype]);
>> + if (fl0 && fl1)
>> continue;
>>
>> - page = list_entry(area->free_list[migratetype].next,
>> - struct page, lru);
>> + if (fl0)
>> + page = list_entry(area->free_list_1[migratetype].next,
>> + struct page, lru);
>> + else
>> + page = list_entry(area->free_list_0[migratetype].next,
>> + struct page, lru);
>
> By altering whether it's the head or tail free pages are added to, you
> can achieve a similar effect.
>
>> list_del(&page->lru);
>> rmv_page_order(page);
>> area->nr_free--;
>> @@ -792,7 +813,7 @@ static int move_freepages(struct zone *zone,
>> order = page_order(page);
>> list_del(&page->lru);
>> list_add(&page->lru,
>> - &zone->free_area[order].free_list[migratetype]);
>> + &zone->free_area[order].free_list_0[migratetype]);
>> page += 1 << order;
>> pages_moved += 1 << order;
>> }
>> @@ -845,6 +866,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>> for (current_order = MAX_ORDER-1; current_order >= order;
>> --current_order) {
>> for (i = 0; i < MIGRATE_TYPES - 1; i++) {
>> + bool fl0, fl1;
>> migratetype = fallbacks[start_migratetype][i];
>>
>> /* MIGRATE_RESERVE handled later if necessary */
>> @@ -852,11 +874,20 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
>> continue;
>>
>> area = &(zone->free_area[current_order]);
>> - if (list_empty(&area->free_list[migratetype]))
>> +
>> +
>> + fl0 = list_empty(&area->free_list_0[migratetype]);
>> + fl1 = list_empty(&area->free_list_1[migratetype]);
>> +
>> + if (fl0 && fl1)
>> continue;
>>
>> - page = list_entry(area->free_list[migratetype].next,
>> - struct page, lru);
>> + if (fl0)
>> + page = list_entry(area->free_list_1[migratetype].next,
>> + struct page, lru);
>> + else
>> + page = list_entry(area->free_list_0[migratetype].next,
>> + struct page, lru);
>> area->nr_free--;
>>
>> /*
>> @@ -1061,7 +1092,14 @@ void mark_free_pages(struct zone *zone)
>> }
>>
>> for_each_migratetype_order(order, t) {
>> - list_for_each(curr, &zone->free_area[order].free_list[t]) {
>> + list_for_each(curr, &zone->free_area[order].free_list_0[t]) {
>> + unsigned long i;
>> +
>> + pfn = page_to_pfn(list_entry(curr, struct page, lru));
>> + for (i = 0; i < (1UL << order); i++)
>> + swsusp_set_page_free(pfn_to_page(pfn + i));
>> + }
>> + list_for_each(curr, &zone->free_area[order].free_list_1[t]) {
>> unsigned long i;
>>
>> pfn = page_to_pfn(list_entry(curr, struct page, lru));
>> @@ -2993,7 +3031,8 @@ static void __meminit zone_init_free_lists(struct zone *zone)
>> {
>> int order, t;
>> for_each_migratetype_order(order, t) {
>> - INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
>> + INIT_LIST_HEAD(&zone->free_area[order].free_list_0[t]);
>> + INIT_LIST_HEAD(&zone->free_area[order].free_list_1[t]);
>> zone->free_area[order].nr_free = 0;
>> }
>> }
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index c81321f..613ef1e 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -468,7 +468,9 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
>>
>> area = &(zone->free_area[order]);
>>
>> - list_for_each(curr, &area->free_list[mtype])
>> + list_for_each(curr, &area->free_list_0[mtype])
>> + freecount++;
>> + list_for_each(curr, &area->free_list_1[mtype])
>> freecount++;
>> seq_printf(m, "%6lu ", freecount);
>> }
>
> No more than the low_latency switch, I think this will help some
> workloads in terms of fragmentation but hurt others that depend on the
> ordering of pages being returned.
Hopefully not, if my considerations above are correct.
> There is a fair amount of overhead
> introduced here as well with branches and a lot of extra lists although
> I believe that could be mitigated.
>
> What are the results if you just alter whether it's the head or tail of
> the list that is used in __free_one_page()?
In that case, it would alter the ordering, but not the one of the
pages returned by expand.
In fact, only the order of the pages returned by free will be
affected, and in that case maybe it is already quite disordered.
If that order is not needed to be kept, I can prepare a new version
with a single list.

BTW, if we only guarantee that pages returned by expand are well
ordered, this patch will increase the ordered-ness of the stream of
allocated pages, since it will increase the probability that
allocations go into expand (since frees will more likely create high
order combined pages). So it will also improve the workloads that
prefer ordered allocations.

>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-11-30 15:48:34

[permalink] [raw]

Subject: Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32

On Mon, Nov 30, 2009 at 01:54:04PM +0100, Corrado Zoccolo wrote:
> On Mon, Nov 30, 2009 at 1:04 PM, Mel Gorman <[email protected]> wrote:
> > On Sun, Nov 29, 2009 at 04:11:15PM +0100, Corrado Zoccolo wrote:
> >> On Fri, Nov 27, 2009 19:52:34, Mel Gorman wrote:
> >> : > On Fri, Nov 27, 2009 at 07:14:41PM +0100, Corrado Zoccolo wrote:
> >> > > On Fri, Nov 27, 2009 at 4:58 PM, Mel Gorman <[email protected]> wrote:
> >> > > > On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote:
> >> > > >> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <[email protected]> wrote:
> >> > > >
> >> > > > How would one go about selecting the proper ratio at which to disable
> >> > > > the low_latency logic?
> >> > >
> >> > > Can we measure the dirty ratio when the allocation failures start to
> >> > > happen?
> >> >
> >> > Would the number of dirty pages in the page allocation failure message to
> >> > kern.log be enough? You won't get them all because of printk suppress but
> >> > it's something. Alternatively, tell me exactly what stats from /proc you
> >> > want and I'll stick a monitor on there. Assuming you want nr_dirty vs total
> >> > number of pages though, the monitor tends to execute too late to be useful.
> >> >
> >> Since I wanted to go deeper in the understanding, but my system is healty,
> >> I devised a measure of fragmentation, and wanted to chart it to understand
> >> what was going wrong. A perl script that produces gnuplot compatible output is provided:
> >>
> >> use strict;
> >> select(STDOUT);
> >> $|=1;
> >> do {
> >> open (my $bf, "< /proc/buddyinfo") or die;
> >> open (my $up, "< /proc/uptime") or die;
> >> my $now = <$up>;
> >> chomp $now;
> >> print $now;
> >> while(<$bf>) {
> >> ? ? next unless /Node (\d+), zone\s+([a-zA-Z]+)\s+(.+)$/;
> >> ? ? my ($frag, $tot, $val) = (0,0,1);
> >> ? ? map { $frag += $_; $tot += $val * $_; $val <<= 1;} ($3 =~ /\d+/g);
> >> ? ? print "\t", $frag/$tot;
> >> }
> >> print "\n";
> >> sleep 1;
> >> } while(1);
> >>
> >> My definition of fragmentation is just the number of fragments / the number of pages:
> >> * It is 1 only when all pages are of order 0
> >> * it is 2/3 on a random marking of used pages (each page has probability 0.5 of being used)
> >> * to be sure that a order k allocation succeeds, the fragmentation should be <= 2^-k
> >>
> >
> > In practice, the ordering of page allocations and frees are not random
> > but it's ok for the purposes here.
> >
> > Also when considering fragmentation, I'd take into account the order of the
> > desired allocation as fragmentations at or over that size are not contributing
> > to fragmentation in a negative way. I'd usually express it in terms of free
> > pages instead of total pages as well to avoid large fluctuations when reclaim
> > is working. We can work with this measure for the moment though to avoid
> > getting side-tracked on what fragmentation is.
> >
> >> I observed the mainline kernel during normal usage, and found that:
> >> * the fragmentation is very low after boot (< 1%).
> >> * it tends to increase when memory is freed, and to decrease when memory is allocated (since the kernel usually performs order 0 allocations).
> >> * high memory fragmentation increases first, and only when all high memory is used, normal memory starts to fragment.
> >
> > All three of these observations are expected.
> >
> >> * when the page cache is big enough (so memory pressure is high for the allocator), the fragmentation starts to fluctuate a lot, sometimes exceeding 2/3 (up to 0.8).
> >
> > Again, this is expected. Page cache pages stay resident until
> > reclaimed. If they are clean, they are not really contributing to
> > fragmentation in any way that matters as they should be quickly found
> > and discarded in most cases. In the networking case, it's depending on
> > kswapd to find and reclaim the pages fast enough.
>
> If you need an order 5 page, how would kswapd work?
> Will it free randomly some order 0 pages until a merge magically happens?

No, it won't. There is contiguity-aware reclaim logic called "lumpy reclaim"
which is used for high-order pages. The next LRU page for reclaiming is
a cursor page and the naturally-aligned block of pages around it are also
considered for reclaim so that a high-order page gets freed.

> Unless the dirty ratio is really high, there should already be plenty
> of contiguous non-dirty pages in the page cache that could be freed,
> but if you use an LRU policy to evict, you can go through a lot of
> freeing before a merge will happen.
>

Indeed. There is no need to go into details but if it was order-0 pages
being reclaimed, an extremely large percentage of memory would have to be
freed to get a order-5 page.

> >> * the only way to make the fragmentation return to sane values after it enters fluctuation is to do a sync & drop caches. Even in this case, it will go around 14%, that is still quite high.
> >> >
> >> > Two major differences. 1, the previous non-high-order tests had also
> >> > run sysbench and iozone so the starting conditions are different. I had
> >> > disabled those tests to get some of the high-order figures before I went
> >> > offline. However, the starting conditions are probably not as important as
> >> > the fact that kswapd is working to free order-2 pages and staying awake
> >> > until watermarks are reached. kswapd working harder is probably making a
> >> > big difference.
> >> >
> >>
> >> From my observation, having run a program that fills page cache before a test has a lot of impact to the fragmentation.
> >
> > While this is true, during the course of the test, the old page cache
> > should be discarded quickly. It's not as abrupt as dropping the page
> > cache but the end result should be similar in the majority of cases -
> > the exception being when atomic allocations are a major factor.
>
> For my I/O scheduler tests I use an external disk, to be able to
> monitor exactly what is happening.
> If I don't do a sync & drop cache before starting a test, I usually
> see writeback happening on the main disk, even if the only activity on
> the machine is writing a sequential file to my external disk. If that
> writeback is done in the context of my test process, this will alter
> the result.

Why does the writeback kick in late? I thought pages were meant to be written
back after a contigurable interval of time had passed.

> And with high order allocations, depending on how do you free page
> cache, it can be even worse than that.
>
> >
> >> On the other hand, I saw that the problems with high order allocations started
> >> around 2.6.31, where we didn't have any low_latency patch.
> >
> > While this is true, there appear to be many sources of the high order
> > allocation failures. While low_latency is not the original source, it
> > does not appear to have helped either. Even without high-order
> > allocations being involved, disabling low_latency performs much better
> > in low-memory situations.
>
> Can you try reproducing:
> http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/01848.html
> in a low memory scenario, to substantiate your claim?
>

I can try but it'll take a few days to get around to. I'm still trying
to identify other sources of the problems from between 2.6.30 and
2.6.32-rc8. It'll be tricky to test what you ask because it might not just
be low-memory that is the problem but low memory + enough pressure that
processes are stalling waiting on reclaim.

> >> After a 1day study of the VM, I found an other way to improve the fragmentation.
> >> With the patch below, the fragmentation stays below 2/3 even when memory pressure is high,
> >> and decreases overtime, if the system is lightly used, even without dropping caches.
> >> Moreover, the precious zones (Normal, DMA) are kept at a lower fragmentation, since high order
> >> allocations are usually serviced by the other zones (more likely than with mainline allocator).
> >>
> >> The idea is to have 2 freelists for each zone.
> >> The free_list_0 has the pages that are less likely to cause an higher-order merge, since the buddy of their compound is not free.
> >> The free_list_1 contains the other ones.
> >> When expanding, we put pages into free_list_1.When freeing, we put them in the proper one by checking the buddy of the compound.
> >> And when extracting, we always extract from free_list_0 first,
> >
> > This is subtle, but as well as increased overhead in the page allocator, I'd
> > expect this to break the page-ordering when a caller is allocation many numbers
> > of order-0 pages. Some IO controllers get a boost by the pages coming back
> > in physically contiguous order which happens if a high-order page is being
> > split towards the beginning of the stream of requests. Previous attempts at
> > altering how coalescing and splitting to reduce fragmentation with methods
> > similar to yours have fallen foul of this.
>
> I took extreme care in not disrupting the page ordering. In fact, I
> thought, too, to a single list solution, but it could cause page
> reordering (since I would have used add_tail to add to the other
> list).
>

You're right. this way does preserve the page ordering.

> >
> >> and fall back on the other if the first is empty.
> >> In this way, we keep free longer the pages that are more likely to cause a big merge.
> >> Consequently we tend to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation.
> >>
> >> It can, though, slow down allocation and reclaim, so someone more knowledgeable than me should have a look.
> >>
> >> Signed-off-by: Corrado Zoccolo <[email protected]>
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index 6f75617..6427361 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -55,7 +55,8 @@ static inline int get_pageblock_migratetype(struct page *page)
> >> ?}
> >>
> >> ?struct free_area {
> >> - ? ? struct list_head ? ? ? ?free_list[MIGRATE_TYPES];
> >> + ? ? struct list_head ? ? ? ?free_list_0[MIGRATE_TYPES];
> >> + ? ? struct list_head ? ? ? ?free_list_1[MIGRATE_TYPES];
> >> ? ? ? unsigned long ? ? ? ? ? nr_free;
> >> ?};
> >>
> >> diff --git a/kernel/kexec.c b/kernel/kexec.c
> >> index f336e21..aee5ef5 100644
> >> --- a/kernel/kexec.c
> >> +++ b/kernel/kexec.c
> >> @@ -1404,13 +1404,15 @@ static int __init crash_save_vmcoreinfo_init(void)
> >> ? ? ? VMCOREINFO_OFFSET(zone, free_area);
> >> ? ? ? VMCOREINFO_OFFSET(zone, vm_stat);
> >> ? ? ? VMCOREINFO_OFFSET(zone, spanned_pages);
> >> - ? ? VMCOREINFO_OFFSET(free_area, free_list);
> >> + ? ? VMCOREINFO_OFFSET(free_area, free_list_0);
> >> + ? ? VMCOREINFO_OFFSET(free_area, free_list_1);
> >> ? ? ? VMCOREINFO_OFFSET(list_head, next);
> >> ? ? ? VMCOREINFO_OFFSET(list_head, prev);
> >> ? ? ? VMCOREINFO_OFFSET(vm_struct, addr);
> >> ? ? ? VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER);
> >> ? ? ? log_buf_kexec_setup();
> >> - ? ? VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES);
> >> + ? ? VMCOREINFO_LENGTH(free_area.free_list_0, MIGRATE_TYPES);
> >> + ? ? VMCOREINFO_LENGTH(free_area.free_list_1, MIGRATE_TYPES);
> >> ? ? ? VMCOREINFO_NUMBER(NR_FREE_PAGES);
> >> ? ? ? VMCOREINFO_NUMBER(PG_lru);
> >> ? ? ? VMCOREINFO_NUMBER(PG_private);
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index cdcedf6..5f488d8 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -451,6 +451,8 @@ static inline void __free_one_page(struct page *page,
> >> ? ? ? ? ? ? ? int migratetype)
> >> ?{
> >> ? ? ? unsigned long page_idx;
> >> + ? ? unsigned long combined_idx;
> >> + ? ? bool high_order_free = false;
> >>
> >> ? ? ? if (unlikely(PageCompound(page)))
> >> ? ? ? ? ? ? ? if (unlikely(destroy_compound_page(page, order)))
> >> @@ -464,7 +466,6 @@ static inline void __free_one_page(struct page *page,
> >> ? ? ? VM_BUG_ON(bad_range(zone, page));
> >>
> >> ? ? ? while (order < MAX_ORDER-1) {
> >> - ? ? ? ? ? ? unsigned long combined_idx;
> >> ? ? ? ? ? ? ? struct page *buddy;
> >>
> >> ? ? ? ? ? ? ? buddy = __page_find_buddy(page, page_idx, order);
> >> @@ -481,8 +482,21 @@ static inline void __free_one_page(struct page *page,
> >> ? ? ? ? ? ? ? order++;
> >> ? ? ? }
> >> ? ? ? set_page_order(page, order);
> >> - ? ? list_add(&page->lru,
> >> - ? ? ? ? ? ? &zone->free_area[order].free_list[migratetype]);
> >> +
> >> + ? ? if (order < MAX_ORDER-1) {
> >> + ? ? ? ? ? ? struct page *parent_page, *ppage_buddy;
> >> + ? ? ? ? ? ? combined_idx = __find_combined_index(page_idx, order);
> >> + ? ? ? ? ? ? parent_page = page + combined_idx - page_idx;
> >
> > parent_page is a bad name here. It's not the parent of anything. What I
> > think you're looking for is the lowest page of the pair of buddies that
> > was last considered for merging.
>
> Right, this should be the combined page, to keep naming consistent
> with combined_idx.
>
> >
> >> + ? ? ? ? ? ? ppage_buddy = __page_find_buddy(parent_page, combined_idx, order + 1);
> >> + ? ? ? ? ? ? high_order_free = page_is_buddy(parent_page, ppage_buddy, order + 1);
> >> + ? ? }
> >
> > And you are checking if when one buddy of this pair frees, will it then
> > be merged with the next-highest order. If so, you want to delay reusing
> > that page for allocation.
>
> Exactly.
> If you have two streams of allocations, with different average
> lifetime (and with the long lifetime allocations having a slower
> rate), this will make very probable that the long lifetime allocations
> span a smaller set of compounds.

I see the logic.

> >
> >> +
> >> + ? ? if (high_order_free)
> >> + ? ? ? ? ? ? list_add(&page->lru,
> >> + ? ? ? ? ? ? ? ? ? ? &zone->free_area[order].free_list_1[migratetype]);
> >> + ? ? else
> >> + ? ? ? ? ? ? list_add(&page->lru,
> >> + ? ? ? ? ? ? ? ? ? ? &zone->free_area[order].free_list_0[migratetype]);
> >
> > You could have avoided the extra list to some extent by altering whether
> > it was the head or tail of the list the page was added to. It would have
> > had a similar effect of the page not being used for longer with slightly
> > less overhead.
>
> Right, but the order of insertions at the tail would be reversed.
>

True but maybe it doesn't matter. What's important is that the order the
pages are returned during allocation and after a high-order page is split
is what is important.

> >> ? ? ? zone->free_area[order].nr_free++;
> >> ?}
> >>
> >> @@ -663,7 +677,7 @@ static inline void expand(struct zone *zone, struct page *page,
> >> ? ? ? ? ? ? ? high--;
> >> ? ? ? ? ? ? ? size >>= 1;
> >> ? ? ? ? ? ? ? VM_BUG_ON(bad_range(zone, &page[size]));
> >> - ? ? ? ? ? ? list_add(&page[size].lru, &area->free_list[migratetype]);
> >> + ? ? ? ? ? ? list_add(&page[size].lru, &area->free_list_1[migratetype]);
> >
> > I think this here will damage the contiguous ordering of pages being
> > returned to callers.
>
> This shouldn't damage the order. In fact, expand always inserts in the
> free_list_1, in the same order as the original code inserted in the
> free_list. And if we hit expand, then the free_list_0 is empty, so all
> allocations will be serviced from free_list_1 in the same order as the
> original code.
>
> >
> >> ? ? ? ? ? ? ? area->nr_free++;
> >> ? ? ? ? ? ? ? set_page_order(&page[size], high);
> >> ? ? ? }
> >> @@ -723,12 +737,19 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
> >>
> >> ? ? ? /* Find a page of the appropriate size in the preferred list */
> >> ? ? ? for (current_order = order; current_order < MAX_ORDER; ++current_order) {
> >> + ? ? ? ? ? ? bool fl0, fl1;
> >> ? ? ? ? ? ? ? area = &(zone->free_area[current_order]);
> >> - ? ? ? ? ? ? if (list_empty(&area->free_list[migratetype]))
> >> + ? ? ? ? ? ? fl0 = list_empty(&area->free_list_0[migratetype]);
> >> + ? ? ? ? ? ? fl1 = list_empty(&area->free_list_1[migratetype]);
> >> + ? ? ? ? ? ? if (fl0 && fl1)
> >> ? ? ? ? ? ? ? ? ? ? ? continue;
> >>
> >> - ? ? ? ? ? ? page = list_entry(area->free_list[migratetype].next,
> >> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct page, lru);
> >> + ? ? ? ? ? ? if (fl0)
> >> + ? ? ? ? ? ? ? ? ? ? page = list_entry(area->free_list_1[migratetype].next,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct page, lru);
> >> + ? ? ? ? ? ? else
> >> + ? ? ? ? ? ? ? ? ? ? page = list_entry(area->free_list_0[migratetype].next,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct page, lru);
> >
> > By altering whether it's the head or tail free pages are added to, you
> > can achieve a similar effect.
> >
> >> ? ? ? ? ? ? ? list_del(&page->lru);
> >> ? ? ? ? ? ? ? rmv_page_order(page);
> >> ? ? ? ? ? ? ? area->nr_free--;
> >> @@ -792,7 +813,7 @@ static int move_freepages(struct zone *zone,
> >> ? ? ? ? ? ? ? order = page_order(page);
> >> ? ? ? ? ? ? ? list_del(&page->lru);
> >> ? ? ? ? ? ? ? list_add(&page->lru,
> >> - ? ? ? ? ? ? ? ? ? ? &zone->free_area[order].free_list[migratetype]);
> >> + ? ? ? ? ? ? ? ? ? ? &zone->free_area[order].free_list_0[migratetype]);
> >> ? ? ? ? ? ? ? page += 1 << order;
> >> ? ? ? ? ? ? ? pages_moved += 1 << order;
> >> ? ? ? }
> >> @@ -845,6 +866,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> >> ? ? ? for (current_order = MAX_ORDER-1; current_order >= order;
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? --current_order) {
> >> ? ? ? ? ? ? ? for (i = 0; i < MIGRATE_TYPES - 1; i++) {
> >> + ? ? ? ? ? ? ? ? ? ? bool fl0, fl1;
> >> ? ? ? ? ? ? ? ? ? ? ? migratetype = fallbacks[start_migratetype][i];
> >>
> >> ? ? ? ? ? ? ? ? ? ? ? /* MIGRATE_RESERVE handled later if necessary */
> >> @@ -852,11 +874,20 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
> >>
> >> ? ? ? ? ? ? ? ? ? ? ? area = &(zone->free_area[current_order]);
> >> - ? ? ? ? ? ? ? ? ? ? if (list_empty(&area->free_list[migratetype]))
> >> +
> >> +
> >> + ? ? ? ? ? ? ? ? ? ? fl0 = list_empty(&area->free_list_0[migratetype]);
> >> + ? ? ? ? ? ? ? ? ? ? fl1 = list_empty(&area->free_list_1[migratetype]);
> >> +
> >> + ? ? ? ? ? ? ? ? ? ? if (fl0 && fl1)
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? continue;
> >>
> >> - ? ? ? ? ? ? ? ? ? ? page = list_entry(area->free_list[migratetype].next,
> >> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct page, lru);
> >> + ? ? ? ? ? ? ? ? ? ? if (fl0)
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? page = list_entry(area->free_list_1[migratetype].next,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct page, lru);
> >> + ? ? ? ? ? ? ? ? ? ? else
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? page = list_entry(area->free_list_0[migratetype].next,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct page, lru);
> >> ? ? ? ? ? ? ? ? ? ? ? area->nr_free--;
> >>
> >> ? ? ? ? ? ? ? ? ? ? ? /*
> >> @@ -1061,7 +1092,14 @@ void mark_free_pages(struct zone *zone)
> >> ? ? ? ? ? ? ? }
> >>
> >> ? ? ? for_each_migratetype_order(order, t) {
> >> - ? ? ? ? ? ? list_for_each(curr, &zone->free_area[order].free_list[t]) {
> >> + ? ? ? ? ? ? list_for_each(curr, &zone->free_area[order].free_list_0[t]) {
> >> + ? ? ? ? ? ? ? ? ? ? unsigned long i;
> >> +
> >> + ? ? ? ? ? ? ? ? ? ? pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >> + ? ? ? ? ? ? ? ? ? ? for (i = 0; i < (1UL << order); i++)
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? swsusp_set_page_free(pfn_to_page(pfn + i));
> >> + ? ? ? ? ? ? }
> >> + ? ? ? ? ? ? list_for_each(curr, &zone->free_area[order].free_list_1[t]) {
> >> ? ? ? ? ? ? ? ? ? ? ? unsigned long i;
> >>
> >> ? ? ? ? ? ? ? ? ? ? ? pfn = page_to_pfn(list_entry(curr, struct page, lru));
> >> @@ -2993,7 +3031,8 @@ static void __meminit zone_init_free_lists(struct zone *zone)
> >> ?{
> >> ? ? ? int order, t;
> >> ? ? ? for_each_migratetype_order(order, t) {
> >> - ? ? ? ? ? ? INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
> >> + ? ? ? ? ? ? INIT_LIST_HEAD(&zone->free_area[order].free_list_0[t]);
> >> + ? ? ? ? ? ? INIT_LIST_HEAD(&zone->free_area[order].free_list_1[t]);
> >> ? ? ? ? ? ? ? zone->free_area[order].nr_free = 0;
> >> ? ? ? }
> >> ?}
> >> diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> index c81321f..613ef1e 100644
> >> --- a/mm/vmstat.c
> >> +++ b/mm/vmstat.c
> >> @@ -468,7 +468,9 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,
> >>
> >> ? ? ? ? ? ? ? ? ? ? ? area = &(zone->free_area[order]);
> >>
> >> - ? ? ? ? ? ? ? ? ? ? list_for_each(curr, &area->free_list[mtype])
> >> + ? ? ? ? ? ? ? ? ? ? list_for_each(curr, &area->free_list_0[mtype])
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? freecount++;
> >> + ? ? ? ? ? ? ? ? ? ? list_for_each(curr, &area->free_list_1[mtype])
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? freecount++;
> >> ? ? ? ? ? ? ? ? ? ? ? seq_printf(m, "%6lu ", freecount);
> >> ? ? ? ? ? ? ? }
> >
> > No more than the low_latency switch, I think this will help some
> > workloads in terms of fragmentation but hurt others that depend on the
> > ordering of pages being returned.
>
> Hopefully not, if my considerations above are correct.

Right, it doesn't affect the ordering of pages returned. The impact is
additional branches and a lot more lists but it's still very interesting.

> > There is a fair amount of overhead
> > introduced here as well with branches and a lot of extra lists although
> > I believe that could be mitigated.
> >
> > What are the results if you just alter whether it's the head or tail of
> > the list that is used in __free_one_page()?
>
> In that case, it would alter the ordering, but not the one of the
> pages returned by expand.
> In fact, only the order of the pages returned by free will be
> affected, and in that case maybe it is already quite disordered.
> If that order is not needed to be kept, I can prepare a new version
> with a single list.
>

The ordering of free does not need to be preserved. The important
property is that if a high-order page is split by expand() that
subsequent allocations use the contiguous pages.

> BTW, if we only guarantee that pages returned by expand are well
> ordered, this patch will increase the ordered-ness of the stream of
> allocated pages, since it will increase the probability that
> allocations go into expand (since frees will more likely create high
> order combined pages). So it will also improve the workloads that
> prefer ordered allocations.
>

That's a distinct possibility.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-30 17:24:19