2009-11-12 19:30:15

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/7] Reduce GFP_ATOMIC allocation failures, candidate fix V3

Sorry for the long delay in posting another version. Testing is extremely
time-consuming and I wasn't getting to work on this as much as I'd have liked.

Changelog since V2
o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
testing, it made latencies even worse as kswapd slept more on high-order
congestion causing order-0 direct reclaims.
o Added changes to how congestion_wait() works
o Added a number of new patches altering the behaviour of reclaim

Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
failures. A significant number of these have been high-order GFP_ATOMIC
failures and while they are generally brushed away, there has been a large
increase in them recently and there are a number of possible areas the
problem could be in - core vm, page writeback and a specific driver. The
bugs affected by this that I am aware of are;

[Bug #14141] order 2 page allocation failures in iwlagn
[Bug #14141] order 2 page allocation failures (generic)
[Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
[No BZ ID] Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
[No BZ ID] page allocation failure message kernel 2.6.31.4 (tty-related)

The following are a series of patches that bring the behaviour of reclaim
and the page allocator more in line with 2.6.30.

Patches 1-3 should be tested first. The testing I've done shows that the
page allocator and behaviour of congestion_wait() is more in line with
2.6.30 than the vanilla kernels.

It'd be nice to have 2 more tests, applying each patch on top noting any
behaviour change. i.e. ideally there would be results for

o patches 1+2+3
o patches 1+2+3+4
o patches 1+2+3+4+5

Of course, any tests results are welcome. The rest of the mail is the
results of my own tests.

I've tested against 2.6.31 and 2.6.32-rc6. I've somewhat replicated the
problem in Bug #14141 and believe the other bugs are variations of the same
style of problem. The basic reproduction case was;

1. X86-64 AMD Phenom and X86 P4 booted with mem=512MB. Expectation is
any machine will do as long as it's 512MB for the size of workload
involved.

2. A crypted work partition and swap partition was created. On my
own setup, I gave no passphrase so it'd be easier to activate without
interaction but there are multiple options. I should have taken better
notes but the setup goes something like this;

cryptsetup create -y crypt-partition /dev/sda5
pvcreate /dev/mapper/crypt-partition
vgcreate crypt-volume /dev/mapper/crypt-partition
lvcreate -L 5G -n crypt-logical crypt-volume
lvcreate -L 2G -n crypt-swap crypt-volume
mkfs -t ext3 /dev/crypt-volume/crypt-logical
mkswap /dev/crypt-volume/crypt-swap

3. With the partition mounted on /scratch, I
cd /scratch
mkdir music
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-2.6

4. On a normal partition, I expand a tarball containing test scripts available at
http://www.csn.ul.ie/~mel/postings/latency-20091112/latency-tests-with-results.tar.gz

There are two helper programs that run as part of the test - a fake
music player and a fake gitk.

The fake music player uses rsync with bandwidth limits to start
downloading a music folder from another machine. It's bandwidth
limited to simulate playing music over NFS. I believe it generates
similar if not exact traffic to a music player. It occured to be
afterwards that if one patched ogg123 to print a line when 1/10th
of a seconds worth of music was played, it could be used as an
indirect measure of desktop interactivity and help pin down pesky
"audio skips" bug reports.

The fake gitk is based on observing roughly what gitk does using
strace. It loads all the logs into a large buffer and then builds a
very basic hash map of parent to child commits. The data is stored
because it was insufficient just to read the logs. It had to be
kept in an in-memory buffer to generate swap. It then discards the
data and does it over again in a loop for a small number of times
so the test is finite. When it processes a large number of commits,
it outputs a line to stdout so that stalls can be observed. Ideal
behaviour is that commits are read at a constant rate and latencies
look flat.

Output from the two programs is piped through another script -
latency-output. It records how far into the test it was when the
line was outputted and what the latency was since the last line
appeared. The latency should always be very smooth. Because pipes
buffer IO, they are all run by expect_unbuffered which is available
from expect-dev on Debian at least.

All the tests are driven via run-test.sh. While the tests run,
it records the kern.log to track page allocation failures, records
nr_writeback at regular intervals and tracks Page IO and Swap IO.

5. For running an actual test, a kernel is built, booted, the
crypted partition activated, lvm restarted,
/dev/crypt-volume/crypt-logical mounted on /scratch, all
swap partitions turned off and then the swap partition on
/dev/crypt-volume/crypt-swap activated. I then run run-test.sh from
the tarball

6. Run the test script

To evaluate the patches, I considered three basic metrics.

o The length of time it takes fake-gitk to complete on average
o How often and how long fake-gitk stalled for
o How long was spent in congestion_wait

All generated data is in the tarball.

On X86, the results I got were

2.6.30-0000000-force-highorder Elapsed:10:59.095 Failures:0

2.6.31-0000000-force-highorder Elapsed:11:53.505 Failures:0
2.6.31-revert-8aa7e847 Elapsed:14:01.595 Failures:0
2.6.31-0000012-pgalloc-2.6.30 Elapsed:13:32.237 Failures:0
2.6.31-0000123-congestion-both Elapsed:12:44.170 Failures:0
2.6.31-0001234-kswapd-quick-recheck Elapsed:10:35.327 Failures:0
2.6.31-0012345-adjust-priority Elapsed:11:02.995 Failures:0

2.6.32-rc6-0000000-force-highorder Elapsed:18:18.562 Failures:0
2.6.32-rc6-revert-8aa7e847 Elapsed:10:29.278 Failures:0
2.6.32-rc6-0000012-pgalloc-2.6.30 Elapsed:13:32.393 Failures:0
2.6.32-rc6-0000123-congestion-both Elapsed:14:55.265 Failures:0
2.6.32-rc6-0001234-kswapd-quick-recheck Elapsed:13:35.628 Failures:0
2.6.32-rc6-0012345-adjust-priority Elapsed:12:41.278 Failures:0

The 0000000-force-highorder is a vanilla kernel patched so that network
receive always results in an order-2 allocation. This machine wasn't
suffering page allocation failures even under this circumstance. However,
note how slow 2.6.32-rc6 is and how much the revert helps. With the patches
applied, there is comparable performance.

Latencies were generally reduced with the patches applied. 2.6.32-rc6 was
particularly crazy with long stalls measured over the duration of the test
but has comparable latencies with 2.6.30 with the patches applied.

congestion_wait behaviour is more in line with 2.6.30 after the
patches with similar amounts of time being spent. In general,
2.6.32-rc6-0012345-adjust-priority waits for longer than 2.6.30 or the
reverted kernels did. It also waits in more instances such as inside
shrink_inactive_list() where it didn't before. Forcing behaviour like 2.6.30
resulted in good figures but I couldn't justify the patches with anything
more solid than "in tests, it behaves well even though it doesn't make a
lot of sense"

On X86-64, the results I got were

2.6.30-0000000-force-highorder Elapsed:09:48.545 Failures:0

2.6.31-0000000-force-highorder Elapsed:09:13.020 Failures:0
2.6.31-revert-8aa7e847 Elapsed:09:02.120 Failures:0
2.6.31-0000012-pgalloc-2.6.30 Elapsed:08:52.742 Failures:0
2.6.31-0000123-congestion-both Elapsed:08:59.375 Failures:0
2.6.31-0001234-kswapd-quick-recheck Elapsed:09:19.208 Failures:0
2.6.31-0012345-adjust-priority Elapsed:09:39.225 Failures:0

2.6.32-rc6-0000000-force-highorder Elapsed:19:38.585 Failures:5
2.6.32-rc6-revert-8aa7e847 Elapsed:17:21.257 Failures:0
2.6.32-rc6-0000012-pgalloc-2.6.30 Elapsed:18:56.682 Failures:1
2.6.32-rc6-0000123-congestion-both Elapsed:16:08.340 Failures:0
2.6.32-rc6-0001234-kswapd-quick-recheck Elapsed:18:11.200 Failures:7
2.6.32-rc6-0012345-adjust-priority Elapsed:21:33.158 Failures:0

Failures were down and my impression was that it was much harder to cause
failures. Performance on mainline is still not as good as 2.6.30. On
this particular machine, I was able to force performance to be in line
but not with any patch I could justify in the general case.

Latencies were slightly reduced by applying the patches against 2.6.31.
against 2.6.32-rc6, applying the patches significantly reduced the latencies
but they are still significant. I'll continue to investigate what can be
done to improve this further.

Again, congestion_wait() is more in line with 2.6.30 when the patches
are applied. Similarly to X86, almost identical behaviour can be forced
by waiting on BLK_ASYNC_BOTH for each caller to congestion_wait() in the
reclaim and allocator paths.

Bottom line, the patches made triggering allocation failures much harder
and in a number of instances and latencies are reduced when the system
is under load. I will keep looking around this area - particularly the
performance under load on 2.6.32-rc6 but with 2.6.32 almost out the door,
I am releasing what I have now.


2009-11-12 19:30:12

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed

If a direct reclaim makes no forward progress, it considers whether it
should go OOM or not. Whether OOM is triggered or not, it may retry the
application afterwards. In times past, this would always wake kswapd as well
but currently, kswapd is not woken up after direct reclaim fails. For order-0
allocations, this makes little difference but if there is a heavy mix of
higher-order allocations that direct reclaim is failing for, it might mean
that kswapd is not rewoken for higher orders as much as it did previously.

This patch wakes up kswapd when an allocation is being retried after a direct
reclaim failure. It would be expected that kswapd is already awake, but
this has the effect of telling kswapd to reclaim at the higher order as well.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/page_alloc.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cdcedf6..250d055 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1817,9 +1817,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;

+restart:
wake_all_kswapd(order, zonelist, high_zoneidx);

-restart:
/*
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
--
1.6.5

2009-11-12 20:29:43

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/7] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Thu, Nov 12, 2009 at 07:30:06PM +0000, Mel Gorman wrote:
> Sorry for the long delay in posting another version. Testing is extremely
> time-consuming and I wasn't getting to work on this as much as I'd have liked.
>
> Changelog since V2
> o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
> testing, it made latencies even worse as kswapd slept more on high-order
> congestion causing order-0 direct reclaims.
> o Added changes to how congestion_wait() works
> o Added a number of new patches altering the behaviour of reclaim
>
> Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
> failures. A significant number of these have been high-order GFP_ATOMIC
> failures and while they are generally brushed away, there has been a large
> increase in them recently and there are a number of possible areas the
> problem could be in - core vm, page writeback and a specific driver. The
> bugs affected by this that I am aware of are;

Thanks for all the time you've spent on this one. Let me start with
some more questions about the workload ;)

> 2. A crypted work partition and swap partition was created. On my
> own setup, I gave no passphrase so it'd be easier to activate without
> interaction but there are multiple options. I should have taken better
> notes but the setup goes something like this;
>
> cryptsetup create -y crypt-partition /dev/sda5
> pvcreate /dev/mapper/crypt-partition
> vgcreate crypt-volume /dev/mapper/crypt-partition
> lvcreate -L 5G -n crypt-logical crypt-volume
> lvcreate -L 2G -n crypt-swap crypt-volume
> mkfs -t ext3 /dev/crypt-volume/crypt-logical
> mkswap /dev/crypt-volume/crypt-swap
>
> 3. With the partition mounted on /scratch, I
> cd /scratch
> mkdir music
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-2.6
>
> 4. On a normal partition, I expand a tarball containing test scripts available at
> http://www.csn.ul.ie/~mel/postings/latency-20091112/latency-tests-with-results.tar.gz
>
> There are two helper programs that run as part of the test - a fake
> music player and a fake gitk.
>
> The fake music player uses rsync with bandwidth limits to start
> downloading a music folder from another machine. It's bandwidth
> limited to simulate playing music over NFS.

So the workload is gitk reading a git repo and a program reading data
over the network. Which part of the workload writes to disk?

-chris

2009-11-12 22:01:37

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/7] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Thu, Nov 12, 2009 at 03:27:48PM -0500, Chris Mason wrote:
> On Thu, Nov 12, 2009 at 07:30:06PM +0000, Mel Gorman wrote:
> > Sorry for the long delay in posting another version. Testing is extremely
> > time-consuming and I wasn't getting to work on this as much as I'd have liked.
> >
> > Changelog since V2
> > o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
> > testing, it made latencies even worse as kswapd slept more on high-order
> > congestion causing order-0 direct reclaims.
> > o Added changes to how congestion_wait() works
> > o Added a number of new patches altering the behaviour of reclaim
> >
> > Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
> > failures. A significant number of these have been high-order GFP_ATOMIC
> > failures and while they are generally brushed away, there has been a large
> > increase in them recently and there are a number of possible areas the
> > problem could be in - core vm, page writeback and a specific driver. The
> > bugs affected by this that I am aware of are;
>
> Thanks for all the time you've spent on this one. Let me start with
> some more questions about the workload ;)
>
> So the workload is gitk reading a git repo and a program reading data
> over the network. Which part of the workload writes to disk?

Sorry for the self reply, I started digging through your data (man,
that's a lot of data ;). I took another tour through dm-crypt and
things make more sense now.

dm-crypt has two different single threaded workqueues for each dm-crypt
device. The first one is meant to deal with the actual encryption and
decryption, and the second one is meant to do the IO.

So the path for a write looks something like this:

filesystem -> crypt thread -> encrypt the data -> io thread -> disk

And the path for read looks something like this:

filesystem -> io thread -> disk -> crypt thread -> decrypt data -> FS

One thread does encryption and one thread does IO, and these threads are
shared for reads and writes. The end result is that all of the sync
reads get stuck behind any async write congestion and all of the async
writes get stuck behind any sync read congestion.

It's almost like you need to check for both sync and async congestion
before you have any hopes of a new IO making progress.

The confusing part is that dm hasn't gotten any worse in this regard
since 2.6.30 but the workload here is generating more sync reads
(hopefully from gitk and swapin) than async writes (from the low
bandwidth rsync). So in general if you were to change mm/*.c wait
for sync congestion instead of async, things should appear better.

The punch line is that the btrfs guy thinks we can solve all of this with
just one more thread. If we change dm-crypt to have a thread dedicated
to sync IO and a thread dedicated to async IO the system should smooth
out.

-chris

2009-11-13 02:48:39

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/7] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Thu, Nov 12, 2009 at 05:00:05PM -0500, Chris Mason wrote:

[ ...]

>
> The punch line is that the btrfs guy thinks we can solve all of this with
> just one more thread. If we change dm-crypt to have a thread dedicated
> to sync IO and a thread dedicated to async IO the system should smooth
> out.

This is pretty likely to set your dm data on fire. It's only for Mel
who starts his script w/mkfs.

It adds the second thread and more importantly makes sure the kcryptd
thread doesn't get stuck waiting for requests.

-chris

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index ed10381..295ffeb 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -94,6 +94,7 @@ struct crypt_config {
struct bio_set *bs;

struct workqueue_struct *io_queue;
+ struct workqueue_struct *async_io_queue;
struct workqueue_struct *crypt_queue;

/*
@@ -691,7 +692,10 @@ static void kcryptd_queue_io(struct dm_crypt_io *io)
struct crypt_config *cc = io->target->private;

INIT_WORK(&io->work, kcryptd_io);
- queue_work(cc->io_queue, &io->work);
+ if (io->base_bio->bi_rw & (1 << BIO_RW_SYNCIO))
+ queue_work(cc->io_queue, &io->work);
+ else
+ queue_work(cc->async_io_queue, &io->work);
}

static void kcryptd_crypt_write_io_submit(struct dm_crypt_io *io,
@@ -759,8 +763,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)

/* Encryption was already finished, submit io now */
if (crypt_finished) {
- kcryptd_crypt_write_io_submit(io, r, 0);
-
+ kcryptd_crypt_write_io_submit(io, r, 1);
/*
* If there was an error, do not try next fragments.
* For async, error is processed in async handler.
@@ -1120,6 +1123,12 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
} else
cc->iv_mode = NULL;

+ cc->async_io_queue = create_singlethread_workqueue("kcryptd_async_io");
+ if (!cc->async_io_queue) {
+ ti->error = "Couldn't create kcryptd io queue";
+ goto bad_async_io_queue;
+ }
+
cc->io_queue = create_singlethread_workqueue("kcryptd_io");
if (!cc->io_queue) {
ti->error = "Couldn't create kcryptd io queue";
@@ -1139,6 +1148,8 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
bad_crypt_queue:
destroy_workqueue(cc->io_queue);
bad_io_queue:
+ destroy_workqueue(cc->async_io_queue);
+bad_async_io_queue:
kfree(cc->iv_mode);
bad_ivmode_string:
dm_put_device(ti, cc->dev);
@@ -1166,6 +1177,7 @@ static void crypt_dtr(struct dm_target *ti)
struct crypt_config *cc = (struct crypt_config *) ti->private;

destroy_workqueue(cc->io_queue);
+ destroy_workqueue(cc->async_io_queue);
destroy_workqueue(cc->crypt_queue);

if (cc->req)

2009-11-13 13:00:08

by Chris Mason

[permalink] [raw]
Subject: [PATCH] make crypto unplug fix V3

This is still likely to set your dm data on fire. It is only meant for
testers that start with mkfs and don't have any valuable dm data.

It includes my patch from last night, along with changes to force dm to
unplug when its IO queues empty.

The problem goes like this:

Process: submit read bio
dm: put bio onto work queue
process: unplug
dm: work queue finds bio, does a generic_make_request

The end result is that we miss the unplug completely. dm-crypt needs to
unplug for sync bios. This patch also changes it to unplug whenever the
queue is empty, which is far from ideal but better than missing the
unplugs.

This doesn't completely fix io stalls I'm seeing with dm-crypt, but its
my best guess. If it works, I'll break it up and submit for real to
the dm people.

-chris

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index ed10381..729ae01 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -94,8 +94,12 @@ struct crypt_config {
struct bio_set *bs;

struct workqueue_struct *io_queue;
+ struct workqueue_struct *async_io_queue;
struct workqueue_struct *crypt_queue;

+ atomic_t sync_bios_in_queue;
+ atomic_t async_bios_in_queue;
+
/*
* crypto related data
*/
@@ -679,11 +683,29 @@ static void kcryptd_io_write(struct dm_crypt_io *io)
static void kcryptd_io(struct work_struct *work)
{
struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
+ struct crypt_config *cc = io->target->private;
+ int zero_sync = 0;
+ int zero_async = 0;
+ int was_sync = 0;
+
+ if (io->base_bio->bi_rw & (1 << BIO_RW_SYNCIO)) {
+ zero_sync = atomic_dec_and_test(&cc->sync_bios_in_queue);
+ was_sync = 1;
+ } else
+ zero_async = atomic_dec_and_test(&cc->async_bios_in_queue);

if (bio_data_dir(io->base_bio) == READ)
kcryptd_io_read(io);
else
kcryptd_io_write(io);
+
+ if ((was_sync && zero_sync) ||
+ (!was_sync && zero_async &&
+ atomic_read(&cc->sync_bios_in_queue) == 0)) {
+ struct backing_dev_info *bdi;
+ bdi = blk_get_backing_dev_info(io->base_bio->bi_bdev);
+ blk_run_backing_dev(bdi, NULL);
+ }
}

static void kcryptd_queue_io(struct dm_crypt_io *io)
@@ -691,7 +713,13 @@ static void kcryptd_queue_io(struct dm_crypt_io *io)
struct crypt_config *cc = io->target->private;

INIT_WORK(&io->work, kcryptd_io);
- queue_work(cc->io_queue, &io->work);
+ if (io->base_bio->bi_rw & (1 << BIO_RW_SYNCIO)) {
+ atomic_inc(&cc->sync_bios_in_queue);
+ queue_work(cc->io_queue, &io->work);
+ } else {
+ atomic_inc(&cc->async_bios_in_queue);
+ queue_work(cc->async_io_queue, &io->work);
+ }
}

static void kcryptd_crypt_write_io_submit(struct dm_crypt_io *io,
@@ -759,8 +787,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)

/* Encryption was already finished, submit io now */
if (crypt_finished) {
- kcryptd_crypt_write_io_submit(io, r, 0);
-
+ kcryptd_crypt_write_io_submit(io, r, 1);
/*
* If there was an error, do not try next fragments.
* For async, error is processed in async handler.
@@ -1120,6 +1147,15 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
} else
cc->iv_mode = NULL;

+ atomic_set(&cc->sync_bios_in_queue, 0);
+ atomic_set(&cc->async_bios_in_queue, 0);
+
+ cc->async_io_queue = create_singlethread_workqueue("kcryptd_async_io");
+ if (!cc->async_io_queue) {
+ ti->error = "Couldn't create kcryptd io queue";
+ goto bad_async_io_queue;
+ }
+
cc->io_queue = create_singlethread_workqueue("kcryptd_io");
if (!cc->io_queue) {
ti->error = "Couldn't create kcryptd io queue";
@@ -1139,6 +1175,8 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
bad_crypt_queue:
destroy_workqueue(cc->io_queue);
bad_io_queue:
+ destroy_workqueue(cc->async_io_queue);
+bad_async_io_queue:
kfree(cc->iv_mode);
bad_ivmode_string:
dm_put_device(ti, cc->dev);
@@ -1166,6 +1204,7 @@ static void crypt_dtr(struct dm_target *ti)
struct crypt_config *cc = (struct crypt_config *) ti->private;

destroy_workqueue(cc->io_queue);
+ destroy_workqueue(cc->async_io_queue);
destroy_workqueue(cc->crypt_queue);

if (cc->req)

2009-11-13 13:44:04

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Thu, Nov 12, 2009 at 05:00:05PM -0500, Chris Mason wrote:
> On Thu, Nov 12, 2009 at 03:27:48PM -0500, Chris Mason wrote:
> > On Thu, Nov 12, 2009 at 07:30:06PM +0000, Mel Gorman wrote:
> > > Sorry for the long delay in posting another version. Testing is extremely
> > > time-consuming and I wasn't getting to work on this as much as I'd have liked.
> > >
> > > Changelog since V2
> > > o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
> > > testing, it made latencies even worse as kswapd slept more on high-order
> > > congestion causing order-0 direct reclaims.
> > > o Added changes to how congestion_wait() works
> > > o Added a number of new patches altering the behaviour of reclaim
> > >
> > > Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
> > > failures. A significant number of these have been high-order GFP_ATOMIC
> > > failures and while they are generally brushed away, there has been a large
> > > increase in them recently and there are a number of possible areas the
> > > problem could be in - core vm, page writeback and a specific driver. The
> > > bugs affected by this that I am aware of are;
> >
> > Thanks for all the time you've spent on this one. Let me start with
> > some more questions about the workload ;)
> >
> > So the workload is gitk reading a git repo and a program reading data
> > over the network. Which part of the workload writes to disk?
>
> Sorry for the self reply, I started digging through your data (man,
> that's a lot of data ;).

Yeah, sorry about that. Because I lacked a credible explanation as to
why waiting on sync really made such a difference, I had little choice
but to punt everything I had for people to dig through.

To be clear, I'm not actually running gitk. The fake-gitk is reading the
commits into memory and building a tree in a similar fashion to what gitk
does. I didn't want to use gitk itself because there wasn't a way of measuring
whether it was stalling or just other than looking at it and making a guess.

> I took another tour through dm-crypt and
> things make more sense now.
>
> dm-crypt has two different single threaded workqueues for each dm-crypt
> device. The first one is meant to deal with the actual encryption and
> decryption, and the second one is meant to do the IO.
>
> So the path for a write looks something like this:
>
> filesystem -> crypt thread -> encrypt the data -> io thread -> disk
>
> And the path for read looks something like this:
>
> filesystem -> io thread -> disk -> crypt thread -> decrypt data -> FS
>
> One thread does encryption and one thread does IO, and these threads are
> shared for reads and writes. The end result is that all of the sync
> reads get stuck behind any async write congestion and all of the async
> writes get stuck behind any sync read congestion.
>
> It's almost like you need to check for both sync and async congestion
> before you have any hopes of a new IO making progress.
>
> The confusing part is that dm hasn't gotten any worse in this regard
> since 2.6.30 but the workload here is generating more sync reads
> (hopefully from gitk and swapin) than async writes (from the low
> bandwidth rsync). So in general if you were to change mm/*.c wait
> for sync congestion instead of async, things should appear better.
>

Thanks very much for that explanation. It makes a lot of sense and
explains why waiting on sync-congestion made such a difference on the
test setup.

> The punch line is that the btrfs guy thinks we can solve all of this with
> just one more thread. If we change dm-crypt to have a thread dedicated
> to sync IO and a thread dedicated to async IO the system should smooth
> out.
>

I see you have posted another patch so I'll test that out first before
looking into that.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-13 17:34:48

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] make crypto unplug fix V3

On Fri, Nov 13, 2009 at 07:58:12AM -0500, Chris Mason wrote:
> This is still likely to set your dm data on fire. It is only meant for
> testers that start with mkfs and don't have any valuable dm data.
>

The good news is that my room remains fire-free. Despite swap also
running from dm-crypt, I had no corruption or instability issues.

Here is an updated set of results for fake-gitk running.

X86
2.6.30-0000000-force-highorder Elapsed:12:08.908 Failures:0
2.6.31-0000000-force-highorder Elapsed:10:56.283 Failures:0
2.6.31-0000006-dm-crypt-unplug Elapsed:11:51.653 Failures:0
2.6.31-0000012-pgalloc-2.6.30 Elapsed:12:26.587 Failures:0
2.6.31-0000123-congestion-both Elapsed:10:55.298 Failures:0
2.6.31-0001234-kswapd-quick-recheck Elapsed:18:01.523 Failures:0
2.6.31-0123456-dm-crypt-unplug Elapsed:10:45.720 Failures:0
2.6.31-revert-8aa7e847 Elapsed:15:08.020 Failures:0
2.6.32-rc6-0000000-force-highorder Elapsed:16:20.765 Failures:4
2.6.32-rc6-0000006-dm-crypt-unplug Elapsed:13:42.920 Failures:0
2.6.32-rc6-0000012-pgalloc-2.6.30 Elapsed:16:13.380 Failures:1
2.6.32-rc6-0000123-congestion-both Elapsed:18:39.118 Failures:0
2.6.32-rc6-0001234-kswapd-quick-recheck Elapsed:15:04.398 Failures:0
2.6.32-rc6-0123456-dm-crypt-unplug Elapsed:12:50.438 Failures:0
2.6.32-rc6-revert-8aa7e847 Elapsed:20:50.888 Failures:0

X86-64
2.6.30-0000000-force-highorder Elapsed:10:37.300 Failures:0
2.6.31-0000000-force-highorder Elapsed:08:49.338 Failures:0
2.6.31-0000006-dm-crypt-unplug Elapsed:09:37.840 Failures:0
2.6.31-0000012-pgalloc-2.6.30 Elapsed:15:49.690 Failures:0
2.6.31-0000123-congestion-both Elapsed:09:18.790 Failures:0
2.6.31-0001234-kswapd-quick-recheck Elapsed:08:39.268 Failures:0
2.6.31-0123456-dm-crypt-unplug Elapsed:08:20.965 Failures:0
2.6.31-revert-8aa7e847 Elapsed:08:07.457 Failures:0
2.6.32-rc6-0000000-force-highorder Elapsed:18:29.103 Failures:1
2.6.32-rc6-0000006-dm-crypt-unplug Elapsed:25:53.515 Failures:3
2.6.32-rc6-0000012-pgalloc-2.6.30 Elapsed:19:55.570 Failures:6
2.6.32-rc6-0000123-congestion-both Elapsed:17:29.255 Failures:2
2.6.32-rc6-0001234-kswapd-quick-recheck Elapsed:14:41.068 Failures:0
2.6.32-rc6-0123456-dm-crypt-unplug Elapsed:15:48.028 Failures:1
2.6.32-rc6-revert-8aa7e847 Elapsed:14:48.647 Failures:0

The numbering in the kernel indicates what patches are applied. I tested
the dm-crypt patch both in isolation and in combination with the patches
in this series.

Basically, the dm-crypt-unplug makes a small difference in performance
overall, mostly slight gains and losses. There was one massive regression
with the dm-crypt patch applied to 2.6.32-rc6 but at the moment, I don't
know what that is.

In general, the patch reduces the amount of time direct reclaimers are
spending on congestion_wait.

> It includes my patch from last night, along with changes to force dm to
> unplug when its IO queues empty.
>
> The problem goes like this:
>
> Process: submit read bio
> dm: put bio onto work queue
> process: unplug
> dm: work queue finds bio, does a generic_make_request
>
> The end result is that we miss the unplug completely. dm-crypt needs to
> unplug for sync bios. This patch also changes it to unplug whenever the
> queue is empty, which is far from ideal but better than missing the
> unplugs.
>
> This doesn't completely fix io stalls I'm seeing with dm-crypt, but its
> my best guess. If it works, I'll break it up and submit for real to
> the dm people.
>

Out of curiousity, how are you measuring IO stalls? In the tests I'm doing,
the worker processes output their progress and it should be at a steady
rate. I considered a stall to be an excessive delay between updates which
is a pretty indirect measure.

> -chris
>
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index ed10381..729ae01 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -94,8 +94,12 @@ struct crypt_config {
> struct bio_set *bs;
>
> struct workqueue_struct *io_queue;
> + struct workqueue_struct *async_io_queue;
> struct workqueue_struct *crypt_queue;
>
> + atomic_t sync_bios_in_queue;
> + atomic_t async_bios_in_queue;
> +
> /*
> * crypto related data
> */
> @@ -679,11 +683,29 @@ static void kcryptd_io_write(struct dm_crypt_io *io)
> static void kcryptd_io(struct work_struct *work)
> {
> struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
> + struct crypt_config *cc = io->target->private;
> + int zero_sync = 0;
> + int zero_async = 0;
> + int was_sync = 0;
> +
> + if (io->base_bio->bi_rw & (1 << BIO_RW_SYNCIO)) {
> + zero_sync = atomic_dec_and_test(&cc->sync_bios_in_queue);
> + was_sync = 1;
> + } else
> + zero_async = atomic_dec_and_test(&cc->async_bios_in_queue);
>
> if (bio_data_dir(io->base_bio) == READ)
> kcryptd_io_read(io);
> else
> kcryptd_io_write(io);
> +
> + if ((was_sync && zero_sync) ||
> + (!was_sync && zero_async &&
> + atomic_read(&cc->sync_bios_in_queue) == 0)) {
> + struct backing_dev_info *bdi;
> + bdi = blk_get_backing_dev_info(io->base_bio->bi_bdev);
> + blk_run_backing_dev(bdi, NULL);
> + }
> }
>
> static void kcryptd_queue_io(struct dm_crypt_io *io)
> @@ -691,7 +713,13 @@ static void kcryptd_queue_io(struct dm_crypt_io *io)
> struct crypt_config *cc = io->target->private;
>
> INIT_WORK(&io->work, kcryptd_io);
> - queue_work(cc->io_queue, &io->work);
> + if (io->base_bio->bi_rw & (1 << BIO_RW_SYNCIO)) {
> + atomic_inc(&cc->sync_bios_in_queue);
> + queue_work(cc->io_queue, &io->work);
> + } else {
> + atomic_inc(&cc->async_bios_in_queue);
> + queue_work(cc->async_io_queue, &io->work);
> + }
> }
>
> static void kcryptd_crypt_write_io_submit(struct dm_crypt_io *io,
> @@ -759,8 +787,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
>
> /* Encryption was already finished, submit io now */
> if (crypt_finished) {
> - kcryptd_crypt_write_io_submit(io, r, 0);
> -
> + kcryptd_crypt_write_io_submit(io, r, 1);
> /*
> * If there was an error, do not try next fragments.
> * For async, error is processed in async handler.
> @@ -1120,6 +1147,15 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> } else
> cc->iv_mode = NULL;
>
> + atomic_set(&cc->sync_bios_in_queue, 0);
> + atomic_set(&cc->async_bios_in_queue, 0);
> +
> + cc->async_io_queue = create_singlethread_workqueue("kcryptd_async_io");
> + if (!cc->async_io_queue) {
> + ti->error = "Couldn't create kcryptd io queue";
> + goto bad_async_io_queue;
> + }
> +
> cc->io_queue = create_singlethread_workqueue("kcryptd_io");
> if (!cc->io_queue) {
> ti->error = "Couldn't create kcryptd io queue";
> @@ -1139,6 +1175,8 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> bad_crypt_queue:
> destroy_workqueue(cc->io_queue);
> bad_io_queue:
> + destroy_workqueue(cc->async_io_queue);
> +bad_async_io_queue:
> kfree(cc->iv_mode);
> bad_ivmode_string:
> dm_put_device(ti, cc->dev);
> @@ -1166,6 +1204,7 @@ static void crypt_dtr(struct dm_target *ti)
> struct crypt_config *cc = (struct crypt_config *) ti->private;
>
> destroy_workqueue(cc->io_queue);
> + destroy_workqueue(cc->async_io_queue);
> destroy_workqueue(cc->crypt_queue);
>
> if (cc->req)
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-13 18:42:12

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] make crypto unplug fix V3

On Fri, Nov 13, 2009 at 05:34:46PM +0000, Mel Gorman wrote:
> On Fri, Nov 13, 2009 at 07:58:12AM -0500, Chris Mason wrote:
> > This is still likely to set your dm data on fire. It is only meant for
> > testers that start with mkfs and don't have any valuable dm data.
> >
>
> The good news is that my room remains fire-free. Despite swap also
> running from dm-crypt, I had no corruption or instability issues.

Ok, definitely not so convincing I'd try and shove it into a late rc.

>
> Here is an updated set of results for fake-gitk running.
>
> X86
> 2.6.30-0000000-force-highorder Elapsed:12:08.908 Failures:0
> 2.6.31-0000000-force-highorder Elapsed:10:56.283 Failures:0
> 2.6.31-0000006-dm-crypt-unplug Elapsed:11:51.653 Failures:0
> 2.6.31-0000012-pgalloc-2.6.30 Elapsed:12:26.587 Failures:0
> 2.6.31-0000123-congestion-both Elapsed:10:55.298 Failures:0
> 2.6.31-0001234-kswapd-quick-recheck Elapsed:18:01.523 Failures:0
> 2.6.31-0123456-dm-crypt-unplug Elapsed:10:45.720 Failures:0
> 2.6.31-revert-8aa7e847 Elapsed:15:08.020 Failures:0
> 2.6.32-rc6-0000000-force-highorder Elapsed:16:20.765 Failures:4
> 2.6.32-rc6-0000006-dm-crypt-unplug Elapsed:13:42.920 Failures:0
> 2.6.32-rc6-0000012-pgalloc-2.6.30 Elapsed:16:13.380 Failures:1
> 2.6.32-rc6-0000123-congestion-both Elapsed:18:39.118 Failures:0
> 2.6.32-rc6-0001234-kswapd-quick-recheck Elapsed:15:04.398 Failures:0
> 2.6.32-rc6-0123456-dm-crypt-unplug Elapsed:12:50.438 Failures:0
> 2.6.32-rc6-revert-8aa7e847 Elapsed:20:50.888 Failures:0
>
> X86-64
> 2.6.30-0000000-force-highorder Elapsed:10:37.300 Failures:0
> 2.6.31-0000000-force-highorder Elapsed:08:49.338 Failures:0
> 2.6.31-0000006-dm-crypt-unplug Elapsed:09:37.840 Failures:0
> 2.6.31-0000012-pgalloc-2.6.30 Elapsed:15:49.690 Failures:0
> 2.6.31-0000123-congestion-both Elapsed:09:18.790 Failures:0
> 2.6.31-0001234-kswapd-quick-recheck Elapsed:08:39.268 Failures:0
> 2.6.31-0123456-dm-crypt-unplug Elapsed:08:20.965 Failures:0
> 2.6.31-revert-8aa7e847 Elapsed:08:07.457 Failures:0
> 2.6.32-rc6-0000000-force-highorder Elapsed:18:29.103 Failures:1
> 2.6.32-rc6-0000006-dm-crypt-unplug Elapsed:25:53.515 Failures:3
> 2.6.32-rc6-0000012-pgalloc-2.6.30 Elapsed:19:55.570 Failures:6
> 2.6.32-rc6-0000123-congestion-both Elapsed:17:29.255 Failures:2
> 2.6.32-rc6-0001234-kswapd-quick-recheck Elapsed:14:41.068 Failures:0
> 2.6.32-rc6-0123456-dm-crypt-unplug Elapsed:15:48.028 Failures:1
> 2.6.32-rc6-revert-8aa7e847 Elapsed:14:48.647 Failures:0
>
> The numbering in the kernel indicates what patches are applied. I tested
> the dm-crypt patch both in isolation and in combination with the patches
> in this series.
>
> Basically, the dm-crypt-unplug makes a small difference in performance
> overall, mostly slight gains and losses. There was one massive regression
> with the dm-crypt patch applied to 2.6.32-rc6 but at the moment, I don't
> know what that is.

How consistent are your numbers between runs? I was trying to match
this up with your last email and things were pretty different.

>
> In general, the patch reduces the amount of time direct reclaimers are
> spending on congestion_wait.
>
> > It includes my patch from last night, along with changes to force dm to
> > unplug when its IO queues empty.
> >
> > The problem goes like this:
> >
> > Process: submit read bio
> > dm: put bio onto work queue
> > process: unplug
> > dm: work queue finds bio, does a generic_make_request
> >
> > The end result is that we miss the unplug completely. dm-crypt needs to
> > unplug for sync bios. This patch also changes it to unplug whenever the
> > queue is empty, which is far from ideal but better than missing the
> > unplugs.
> >
> > This doesn't completely fix io stalls I'm seeing with dm-crypt, but its
> > my best guess. If it works, I'll break it up and submit for real to
> > the dm people.
> >
>
> Out of curiousity, how are you measuring IO stalls? In the tests I'm doing,
> the worker processes output their progress and it should be at a steady
> rate. I considered a stall to be an excessive delay between updates which
> is a pretty indirect measure.

I just setup a crypto disk and did dd if=/dev/zero of=/mnt/foo bs=1M

If you watch vmstat 1, there's supposed to be a constant steam of IO to
the disk. If a whole second goes by with zero IO, we're doing something
wrong, I get a number of multi-second stalls where we are just waiting
for IO to happen.

Most of the time I was able to catch a sysrq-w for it, someone was
waiting on a read to finish. It isn't completely clear to me if the
unplugging is working properly.

-chris

2009-11-13 20:29:16

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] make crypto unplug fix V3

On Fri, Nov 13, 2009 at 01:40:04PM -0500, Chris Mason wrote:
> On Fri, Nov 13, 2009 at 05:34:46PM +0000, Mel Gorman wrote:
> > On Fri, Nov 13, 2009 at 07:58:12AM -0500, Chris Mason wrote:
> > > This is still likely to set your dm data on fire. It is only meant for
> > > testers that start with mkfs and don't have any valuable dm data.
> > >
> >
> > The good news is that my room remains fire-free. Despite swap also
> > running from dm-crypt, I had no corruption or instability issues.
>
> Ok, definitely not so convincing I'd try and shove it into a late rc.
>
> >
> > Here is an updated set of results for fake-gitk running.
> >
> > X86
> > 2.6.30-0000000-force-highorder Elapsed:12:08.908 Failures:0
> > 2.6.31-0000000-force-highorder Elapsed:10:56.283 Failures:0
> > 2.6.31-0000006-dm-crypt-unplug Elapsed:11:51.653 Failures:0
> > 2.6.31-0000012-pgalloc-2.6.30 Elapsed:12:26.587 Failures:0
> > 2.6.31-0000123-congestion-both Elapsed:10:55.298 Failures:0
> > 2.6.31-0001234-kswapd-quick-recheck Elapsed:18:01.523 Failures:0
> > 2.6.31-0123456-dm-crypt-unplug Elapsed:10:45.720 Failures:0
> > 2.6.31-revert-8aa7e847 Elapsed:15:08.020 Failures:0
> > 2.6.32-rc6-0000000-force-highorder Elapsed:16:20.765 Failures:4
> > 2.6.32-rc6-0000006-dm-crypt-unplug Elapsed:13:42.920 Failures:0
> > 2.6.32-rc6-0000012-pgalloc-2.6.30 Elapsed:16:13.380 Failures:1
> > 2.6.32-rc6-0000123-congestion-both Elapsed:18:39.118 Failures:0
> > 2.6.32-rc6-0001234-kswapd-quick-recheck Elapsed:15:04.398 Failures:0
> > 2.6.32-rc6-0123456-dm-crypt-unplug Elapsed:12:50.438 Failures:0
> > 2.6.32-rc6-revert-8aa7e847 Elapsed:20:50.888 Failures:0
> >
> > X86-64
> > 2.6.30-0000000-force-highorder Elapsed:10:37.300 Failures:0
> > 2.6.31-0000000-force-highorder Elapsed:08:49.338 Failures:0
> > 2.6.31-0000006-dm-crypt-unplug Elapsed:09:37.840 Failures:0
> > 2.6.31-0000012-pgalloc-2.6.30 Elapsed:15:49.690 Failures:0
> > 2.6.31-0000123-congestion-both Elapsed:09:18.790 Failures:0
> > 2.6.31-0001234-kswapd-quick-recheck Elapsed:08:39.268 Failures:0
> > 2.6.31-0123456-dm-crypt-unplug Elapsed:08:20.965 Failures:0
> > 2.6.31-revert-8aa7e847 Elapsed:08:07.457 Failures:0
> > 2.6.32-rc6-0000000-force-highorder Elapsed:18:29.103 Failures:1
> > 2.6.32-rc6-0000006-dm-crypt-unplug Elapsed:25:53.515 Failures:3
> > 2.6.32-rc6-0000012-pgalloc-2.6.30 Elapsed:19:55.570 Failures:6
> > 2.6.32-rc6-0000123-congestion-both Elapsed:17:29.255 Failures:2
> > 2.6.32-rc6-0001234-kswapd-quick-recheck Elapsed:14:41.068 Failures:0
> > 2.6.32-rc6-0123456-dm-crypt-unplug Elapsed:15:48.028 Failures:1
> > 2.6.32-rc6-revert-8aa7e847 Elapsed:14:48.647 Failures:0
> >
> > The numbering in the kernel indicates what patches are applied. I tested
> > the dm-crypt patch both in isolation and in combination with the patches
> > in this series.
> >
> > Basically, the dm-crypt-unplug makes a small difference in performance
> > overall, mostly slight gains and losses. There was one massive regression
> > with the dm-crypt patch applied to 2.6.32-rc6 but at the moment, I don't
> > know what that is.
>
> How consistent are your numbers between runs? I was trying to match
> this up with your last email and things were pretty different.
>

The figures from the first mail were based on kernels that were not
instrumented. It so happened that this run was based on an instrumented
kernel to get the congestion_wait figures so the results are different.

However, the results vary a lot. In some cases, it will spike just as
2.6.32-rc6-0000006-dm-crypt-unplug did. The reported figure is the
average of 4 fake-gitk runs. I don't have the standard deviation handy
but it's high.

> >
> > In general, the patch reduces the amount of time direct reclaimers are
> > spending on congestion_wait.
> >
> > > It includes my patch from last night, along with changes to force dm to
> > > unplug when its IO queues empty.
> > >
> > > The problem goes like this:
> > >
> > > Process: submit read bio
> > > dm: put bio onto work queue
> > > process: unplug
> > > dm: work queue finds bio, does a generic_make_request
> > >
> > > The end result is that we miss the unplug completely. dm-crypt needs to
> > > unplug for sync bios. This patch also changes it to unplug whenever the
> > > queue is empty, which is far from ideal but better than missing the
> > > unplugs.
> > >
> > > This doesn't completely fix io stalls I'm seeing with dm-crypt, but its
> > > my best guess. If it works, I'll break it up and submit for real to
> > > the dm people.
> > >
> >
> > Out of curiousity, how are you measuring IO stalls? In the tests I'm doing,
> > the worker processes output their progress and it should be at a steady
> > rate. I considered a stall to be an excessive delay between updates which
> > is a pretty indirect measure.
>
> I just setup a crypto disk and did dd if=/dev/zero of=/mnt/foo bs=1M
>
> If you watch vmstat 1, there's supposed to be a constant steam of IO to
> the disk. If a whole second goes by with zero IO, we're doing something
> wrong, I get a number of multi-second stalls where we are just waiting
> for IO to happen.
>

Ok, cool.

> Most of the time I was able to catch a sysrq-w for it, someone was
> waiting on a read to finish. It isn't completely clear to me if the
> unplugging is working properly.
>

The machines I'm using are now tied up doing the same tests with dm-crypt
and then I need to rerun with dm-crypt with the changed patches. It'll be
early next week before the machines free up again for me to investigate your
patch more but initial results look promising at least.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-16 16:46:09

by Milan Broz

[permalink] [raw]
Subject: Re: [PATCH 0/7] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On 11/13/2009 03:46 AM, Chris Mason wrote:
> On Thu, Nov 12, 2009 at 05:00:05PM -0500, Chris Mason wrote:
>
> [ ...]
>
>>
>> The punch line is that the btrfs guy thinks we can solve all of this with
>> just one more thread. If we change dm-crypt to have a thread dedicated
>> to sync IO and a thread dedicated to async IO the system should smooth
>> out.

Please, can you cc DM maintainers with these kind of patches? dm-devel list at least.

Note that the crypt requests can be already processed synchronously or asynchronously,
depending on used crypto module (async it is in the case of some hw acceleration).

Adding another queue make the situation more complicated and because the crypt
requests can be queued in crypto layer I am not sure that this solution will help
in this situation at all.
(Try to run that with AES-NI acceleration for example.)


Milan
--
[email protected]

2009-11-16 18:37:39

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 0/7] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Mon, Nov 16, 2009 at 05:44:07PM +0100, Milan Broz wrote:
> On 11/13/2009 03:46 AM, Chris Mason wrote:
> > On Thu, Nov 12, 2009 at 05:00:05PM -0500, Chris Mason wrote:
> >
> > [ ...]
> >
> >>
> >> The punch line is that the btrfs guy thinks we can solve all of this with
> >> just one more thread. If we change dm-crypt to have a thread dedicated
> >> to sync IO and a thread dedicated to async IO the system should smooth
> >> out.
>
> Please, can you cc DM maintainers with these kind of patches? dm-devel list at least.
>

Well, my current patch is a hack. If I had come up with a proven theory
(hopefully Mel can prove it ;), it definitely would have gone through
the dm-devel lists.

> Note that the crypt requests can be already processed synchronously or asynchronously,
> depending on used crypto module (async it is in the case of some hw acceleration).
>
> Adding another queue make the situation more complicated and because the crypt
> requests can be queued in crypto layer I am not sure that this solution will help
> in this situation at all.
> (Try to run that with AES-NI acceleration for example.)

The problem is that async threads still imply a kind of ordering.
If there's a fifo serviced by one thread or 10, the latency ramifications
are very similar for a new entry on the list. We have to wait for a
large portion of the low-prio items in order to service a high prio
item.

With a queue dedicated to sync requests and one dedicated to async,
you'll get better read latencies. Btrfs has a similar problem around
the crc helper threads and it ends up solving things with two different
lists (high and low prio) processed by one thread.

-chris

2009-11-19 08:12:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/7] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Mon, Nov 16, 2009 at 01:36:13PM -0500, Chris Mason wrote:
> On Mon, Nov 16, 2009 at 05:44:07PM +0100, Milan Broz wrote:
> > On 11/13/2009 03:46 AM, Chris Mason wrote:
> > > On Thu, Nov 12, 2009 at 05:00:05PM -0500, Chris Mason wrote:
> > >
> > > [ ...]
> > >
> > >>
> > >> The punch line is that the btrfs guy thinks we can solve all of this with
> > >> just one more thread. If we change dm-crypt to have a thread dedicated
> > >> to sync IO and a thread dedicated to async IO the system should smooth
> > >> out.
> >
> > Please, can you cc DM maintainers with these kind of patches? dm-devel list at least.
> >
>
> Well, my current patch is a hack. If I had come up with a proven theory
> (hopefully Mel can prove it ;), it definitely would have gone through
> the dm-devel lists.
>

I can't prove it for sure but the workload might not be targetted enough
to show better or worse read latencies.

I adjusted the workload to run fake-gitk multiple times to get a better
sense of the deviation between runs

On X86-64, the timings were
2.6.30-0000000-force-highorder Elapsed:10:52.218(stddev:008.085) Failures:0
2.6.31-0000000-force-highorder Elapsed:11:32.258(stddev:130.779) Failures:0
2.6.31-0012345-kswapd-stay-awake-when-min Elapsed:09:34.662(stddev:022.239) Failures:0
2.6.31-0123456-dm-crypt-unplug Elapsed:10:28.718(stddev:060.897) Failures:0
2.6.32-rc6-0000000-force-highorder Elapsed:27:53.686(stddev:207.508) Failures:37
2.6.32-rc6-0012345-kswapd-stay-awake-when-min Elapsed:27:26.735(stddev:221.214) Failures:6
2.6.32-rc6-0123456-dm-crypt-unplug Elapsed:27:35.462(stddev:205.017) Failures:4

On X86, they were
2.6.30-0000000-force-highorder Elapsed:13:36.768(stddev:019.514) Failures:0
2.6.31-0000000-force-highorder Elapsed:16:27.922(stddev:134.839) Failures:0
2.6.31-0000006-dm-crypt-unplug Elapsed:15:47.160(stddev:183.488) Failures:0
2.6.31-0012345-kswapd-stay-awake-when-min Elapsed:18:32.458(stddev:182.164) Failures:0
2.6.31-0123456-dm-crypt-unplug Elapsed:17:07.482(stddev:210.404) Failures:0
2.6.32-rc6-0000000-force-highorder Elapsed:26:08.763(stddev:123.926) Failures:4
2.6.32-rc6-0000006-dm-crypt-unplug Elapsed:17:57.550(stddev:254.412) Failures:1
2.6.32-rc6-0012345-kswapd-stay-awake-when-min Elapsed:25:03.435(stddev:234.685) Failures:1
2.6.32-rc6-0123456-dm-crypt-unplug Elapsed:25:21.382(stddev:211.252) Failures:0

(I forgot to queue up the dm-crypt patches on their own for X86-64 which
is why the results are missing).

While the dm-crypt patch shows small differences, they are well within
the noise for each run of fake-gitk so I can't draw any major conclusion
from it.

On X86 for 2.6.31, roughly the same amount of time is spent in
congestion_wait() with or without the patch. On 2.6.32-rc6, the time
kswapd spends congestioned on the ASYNC queue is reduced by about 20%
both when compared against mainline and compared against the other
patches in the series applied. There is very little difference to the
congestion on the SYNC queue.

On X86-64 for 2.6.31, the story is slightly different. I don't think
it's an architecture thing because the X86-64 machine has twice as many
cores as the X86 test machine. Here, congestion_wait() spent on the
ASYNC queue remains roughly the same but the time spent on the SYNC
queue for direct reclaim is reduced by almost a third. Against
2.6.32-rc6, there was very little difference.

Again, it's hard to draw solid conclusions from this. I know from other
testing that the low_latency tunable for the IO scheduler is an important
factor for the performance of this test on 2.6.32-rc6 so if disabled, it
mgiht show a clearer picture, but right now I can't say for sure it's an
improvement.

> > Note that the crypt requests can be already processed synchronously or asynchronously,
> > depending on used crypto module (async it is in the case of some hw acceleration).
> >
> > Adding another queue make the situation more complicated and because the crypt
> > requests can be queued in crypto layer I am not sure that this solution will help
> > in this situation at all.
> > (Try to run that with AES-NI acceleration for example.)
>
> The problem is that async threads still imply a kind of ordering.
> If there's a fifo serviced by one thread or 10, the latency ramifications
> are very similar for a new entry on the list. We have to wait for a
> large portion of the low-prio items in order to service a high prio
> item.
>
> With a queue dedicated to sync requests and one dedicated to async,
> you'll get better read latencies. Btrfs has a similar problem around
> the crc helper threads and it ends up solving things with two different
> lists (high and low prio) processed by one thread.
>
> -chris
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab