2011-03-08 16:45:50

by Mario 'BitKoenig' Holbe

[permalink] [raw]
Subject: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8

Hello,

dm-crypt in 2.6.38 changed to per-CPU workqueues to increase it's
performance by parallelizing encryption to multiple CPUs.
This modification seems to cause (massive) performance drops for
multiple parallel dm-crypt instances...

I'm running a 4-disk RAID0 on top of 4 independent dm-crypt(aes-xts)
devices on a Core2Quad 3GHz. This setup did overcome the single-CPU
limitation from previous versions and utilized all 4 cores for
encryption.
The throughput of this array drops from 282MB/s sustained read (dd,
single process) with 2.6.37.3 down to 133MB/s with 2.6.38-rc8 (which
nearly equals to single-disk throughput of 128MB/s - just in case this
matters).

This indicates way less parallelization now with 2.6.38 than before.
I don't think this was intentional :)

The dm-crypt per-CPU workqueues got introduced in 2.6.38 with
c029772125594e31eb1a5ad9e0913724ed9891f2
Reverting dm-crypt.c to the version before this commit re-gains the same
throughput as with 2.6.37.


Submitters/Signers of c029772125594e31eb1a5ad9e0913724ed9891f2 CC:ed.


Thanks for your work & regards
Mario
--
There are two major products that come from Berkeley: LSD and UNIX.
We don't believe this to be a coincidence. -- Jeremy S. Anderson


Attachments:
(No filename) (1.21 kB)
signature.asc (482.00 B)
Digital signature
Download all attachments

2011-03-08 17:35:14

by Milan Broz

[permalink] [raw]
Subject: Re: [dm-crypt] dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8

On 03/08/2011 05:45 PM, Mario 'BitKoenig' Holbe wrote:
> dm-crypt in 2.6.38 changed to per-CPU workqueues to increase it's
> performance by parallelizing encryption to multiple CPUs.
> This modification seems to cause (massive) performance drops for
> multiple parallel dm-crypt instances...
>
> I'm running a 4-disk RAID0 on top of 4 independent dm-crypt(aes-xts)
> devices on a Core2Quad 3GHz. This setup did overcome the single-CPU
> limitation from previous versions and utilized all 4 cores for
> encryption.
> The throughput of this array drops from 282MB/s sustained read (dd,
> single process) with 2.6.37.3 down to 133MB/s with 2.6.38-rc8 (which
> nearly equals to single-disk throughput of 128MB/s - just in case this
> matters).
>
> This indicates way less parallelization now with 2.6.38 than before.
> I don't think this was intentional :)

Well, it depends. I never suggested this kind of workaround because
you basically hardcoded (in device stacking) how many parallel instances
(==cpu cores ideally) of dmcrypt can run effectively.

Previously there was no cpu affinity, so dmcrypt thread simply run
on some core.

With current design the IO is encrypted by the cpu which submitted it.

If you have RAID0 it probably means that one IO is split into stripes
and these try to encrypt on the same core (in "parallel").
(I need to test what actually happens though.)

If you use one dmcrypt instance over RAID0, you will now get probably
much more better throughput. (Even with one process generating IOs
the bios are, surprisingly, submitted on different cpus. But this time
it runs really in parallel.)

Maybe we can find some compromise but I basically prefer current design,
which provides much more better behaviour for most of configurations.

Milan

2011-03-08 19:24:21

by Mario 'BitKoenig' Holbe

[permalink] [raw]
Subject: Re: [dm-crypt] dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8

On Tue, Mar 08, 2011 at 06:35:01PM +0100, Milan Broz wrote:
> On 03/08/2011 05:45 PM, Mario 'BitKoenig' Holbe wrote:
> > dm-crypt in 2.6.38 changed to per-CPU workqueues to increase it's
> > performance by parallelizing encryption to multiple CPUs.
> > This modification seems to cause (massive) performance drops for
> > multiple parallel dm-crypt instances...
> Well, it depends. I never suggested this kind of workaround because
> you basically hardcoded (in device stacking) how many parallel instances
> (==cpu cores ideally) of dmcrypt can run effectively.

Yes. But it was the best to get :)

> With current design the IO is encrypted by the cpu which submitted it.
...
> If you use one dmcrypt instance over RAID0, you will now get probably
> much more better throughput. (Even with one process generating IOs
> the bios are, surprisingly, submitted on different cpus. But this time
> it runs really in parallel.)

Mh, not really. I just tested this with kernels fresh booted into
emergency and udev started to create device nodes:

# cryptsetup -c aes-xts-plain -s 256 -h sha256 -d /dev/urandom create foo1 /dev/sdc
...
# cryptsetup -c aes-xts-plain -s 256 -h sha256 -d /dev/urandom create foo4 /dev/sdf
# mdadm -B -l raid0 -n 4 -c 256 /dev/md/foo /dev/mapper/foo[1-4]
# dd if=/dev/md/foo of=/dev/null bs=1M count=20k

2.6.37: 291MB/s 2.6.38: 139MB/s

# mdadm -B -l raid0 -n 4 -c 256 /dev/md/foo /dev/sd[c-f]
# cryptsetup -c aes-xts-plain -s 256 -h sha256 -d /dev/urandom create foo /dev/md/foo
# dd if=/dev/mapper/foo of=/dev/null bs=1M count=20k

2.6.37: 126MB/s 2.6.38: 138MB/s

So... performance drops on .37 (as expected) and nothing changes on .38
(unlike expected).

Those results, btw., differ dramatically when using tmpfs-backed
loop-devices instead of hard disks:

raid0 over crypted loops:
2.6.37: 285MB/s 2.6.38: 324MB/s
crypted raid0 over loops:
2.6.37: 119MB/s 2.6.38: 225MB/s

Here we have indeed changing results - even if they are not what one
would expect.

All those constructs are read-only and hence can be tested on any
somewhat available block device. Setting devices read-only would
probably be a good idea to compensate being short on sleep or whatever.

> Maybe we can find some compromise but I basically prefer current design,
> which provides much more better behaviour for most of configurations.

Hmmm...


regards
Mario
--
File names are infinite in length where infinity is set to 255 characters.
-- Peter Collinson, "The Unix File System"


Attachments:
(No filename) (2.46 kB)
signature.asc (482.00 B)
Digital signature
Download all attachments

2011-03-08 20:07:28

by Milan Broz

[permalink] [raw]
Subject: Re: [dm-crypt] dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8

On 03/08/2011 08:23 PM, Mario 'BitKoenig' Holbe wrote:
> On Tue, Mar 08, 2011 at 06:35:01PM +0100, Milan Broz wrote:
>> On 03/08/2011 05:45 PM, Mario 'BitKoenig' Holbe wrote:
>>> dm-crypt in 2.6.38 changed to per-CPU workqueues to increase it's
>>> performance by parallelizing encryption to multiple CPUs.
>>> This modification seems to cause (massive) performance drops for
>>> multiple parallel dm-crypt instances...
>> Well, it depends. I never suggested this kind of workaround because
>> you basically hardcoded (in device stacking) how many parallel instances
>> (==cpu cores ideally) of dmcrypt can run effectively.
>
> Yes. But it was the best to get :)

I know...

>
>> With current design the IO is encrypted by the cpu which submitted it.
> ...
>> If you use one dmcrypt instance over RAID0, you will now get probably
>> much more better throughput. (Even with one process generating IOs
>> the bios are, surprisingly, submitted on different cpus. But this time
>> it runs really in parallel.)
>
> Mh, not really. I just tested this with kernels fresh booted into
> emergency and udev started to create device nodes:
>
> # cryptsetup -c aes-xts-plain -s 256 -h sha256 -d /dev/urandom create foo1 /dev/sdc
> ...
> # cryptsetup -c aes-xts-plain -s 256 -h sha256 -d /dev/urandom create foo4 /dev/sdf
> # mdadm -B -l raid0 -n 4 -c 256 /dev/md/foo /dev/mapper/foo[1-4]
> # dd if=/dev/md/foo of=/dev/null bs=1M count=20k
>
> 2.6.37: 291MB/s 2.6.38: 139MB/s
>
> # mdadm -B -l raid0 -n 4 -c 256 /dev/md/foo /dev/sd[c-f]
> # cryptsetup -c aes-xts-plain -s 256 -h sha256 -d /dev/urandom create foo /dev/md/foo
> # dd if=/dev/mapper/foo of=/dev/null bs=1M count=20k
>
> 2.6.37: 126MB/s 2.6.38: 138MB/s
>
> So... performance drops on .37 (as expected) and nothing changes on .38
> (unlike expected).

Could you please try also writes? I get better results than reads here.

Anyway, the patch provides parallel processing if it is submitted from
different CPUs, it does not provide any load balancing if everything is submitted
from one process.
(Seems it is side effect of something else...)

So unfortunately for someone it is huge improvement, in this case it causes
just troubles.

We need to investigate if some change on top of current code can provide
better results here.

Milan

2011-03-08 20:18:11

by Mario 'BitKoenig' Holbe

[permalink] [raw]
Subject: Re: [dm-crypt] dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8

On Tue, Mar 08, 2011 at 09:07:10PM +0100, Milan Broz wrote:
> Could you please try also writes? I get better results than reads here.

No, I can't, sorry. I don't have *that* spare devices to try with.
However, if you are able to reproduce my read-results your write-results
should be similar to what I'd get.


Mario
--
The secret that the NSA could read the Iranian secrets was more
important than any specific Iranian secrets that the NSA could
read. -- Bruce Schneier


Attachments:
(No filename) (499.00 B)
signature.asc (482.00 B)
Digital signature
Download all attachments

2011-03-10 16:57:59

by Andi Kleen

[permalink] [raw]
Subject: Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8

On Tue, Mar 08, 2011 at 05:45:08PM +0100, Mario 'BitKoenig' Holbe wrote:
> Hello,
>
> dm-crypt in 2.6.38 changed to per-CPU workqueues to increase it's
> performance by parallelizing encryption to multiple CPUs.
> This modification seems to cause (massive) performance drops for
> multiple parallel dm-crypt instances...
>
> I'm running a 4-disk RAID0 on top of 4 independent dm-crypt(aes-xts)
> devices on a Core2Quad 3GHz. This setup did overcome the single-CPU
> limitation from previous versions and utilized all 4 cores for
> encryption.
> The throughput of this array drops from 282MB/s sustained read (dd,
> single process) with 2.6.37.3 down to 133MB/s with 2.6.38-rc8 (which

It will be better with multiple processes running on different CPUs.
The new design is really for multiple processes.

Do you actually use dd for production or is this just a benchmark?
(if yes: newsflash: use a better benchmark)

-Andi

2011-03-10 17:54:54

by Mario 'BitKoenig' Holbe

[permalink] [raw]
Subject: Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8

On Thu, Mar 10, 2011 at 08:57:30AM -0800, Andi Kleen wrote:
> On Tue, Mar 08, 2011 at 05:45:08PM +0100, Mario 'BitKoenig' Holbe wrote:
> > I'm running a 4-disk RAID0 on top of 4 independent dm-crypt(aes-xts)
> > devices on a Core2Quad 3GHz. This setup did overcome the single-CPU
> Do you actually use dd for production or is this just a benchmark?

The array is streaming most of the time, i.e. single-process sequential
read or write (read mostly) for large chunks of data.
So, no and yes, but...

> (if yes: newsflash: use a better benchmark)

this makes dd quite a valid benchmark for me in this case.

> It will be better with multiple processes running on different CPUs.
> The new design is really for multiple processes.

Of course it is. What bother me is that I can't get back my old
performance in my case whatever I do.

I don't know what kind of parallelism padata uses, i.e. whether a
padata-based solution would suffer from the same limitations like the
current dm-crypt/kcryptd-parallelism or not.

Wth the current approach:
Would it be possible to make CPU-affinity configurable for *single*
kcryptd instances? Either in the way to nail a specific kcryptd to a
specific CPU or (what would be better for me, I guess) in the way to
completely remove CPU-affinity from a specific kcryptd, like it was
before?


Mario
--
There is nothing more deceptive than an obvious fact.
-- Sherlock Holmes by Arthur Conan Doyle


Attachments:
(No filename) (1.41 kB)
signature.asc (482.00 B)
Digital signature
Download all attachments

2011-03-11 01:18:45

by Andi Kleen

[permalink] [raw]
Subject: Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8

> Would it be possible to make CPU-affinity configurable for *single*
> kcryptd instances? Either in the way to nail a specific kcryptd to a
> specific CPU or (what would be better for me, I guess) in the way to
> completely remove CPU-affinity from a specific kcryptd, like it was
> before?

I don't think that's a good idea. You probably need to find some way
to make pcrypt (parallel crypt layer) work for dmcrypt. That may
actually give you more speedup too than your old hack because
it can balance over more cores.

Or get a system with AES-NI -- that usually solves it too.

Frankly I don't think it's a very interesting case, the majority
of workloads are not like that.

-Andi

2011-03-11 18:04:13

by Mario 'BitKoenig' Holbe

[permalink] [raw]
Subject: Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8

I was long pondering whether to reply to this or not, but sorry, I
couldn't resist.

On Thu, Mar 10, 2011 at 05:18:42PM -0800, Andi Kleen <[email protected]> wrote:
> You probably need to find some way
> to make pcrypt (parallel crypt layer) work for dmcrypt. That may
> actually give you more speedup too than your old hack because
> it can balance over more cores.

"my" old "hack" balances well as long as the number of stripes is equal
or greater than the number of cores.
And for my specific case... it's hard to balance over more than 4 cores
on a Core2Quad :)

> Or get a system with AES-NI -- that usually solves it too.

Honi soit qui mal y pense.
Of course I understand that Intel's primary goal is to sell new
hardware and hence I understand that you are required to tell this to
me. However, based on the AES-NI benchmarks from the linux-crypto ML,
even with AES-NI it would be hard to impossible to re-gain my
(non-AES-NI!) pre-.38 performance with the .38 dm-crypt parallelization
approach.

> Frankly I don't think it's a very interesting case, the majority
> of workloads are not like that.

Well, I'm not sure if we understand each other.
Probably my use case is a little bit special, but that's not the point.

The main point is that the .38 dm-crypt parallelization approach does
kill performance on *each* RAID0-over-dm-crypt setup. A setup which, I
believe, is not that uncommon as you may believe because it was the only
way to spread disk-encryption over multiple CPUs until .38.

Up to .37 due to the CPU-inaffinity accessing (reading or writing) one
stripe in the RAID0 did always spread over min(#core, #kcryptd) cores.
Now with .38 the same access will always only utilize one single core
because all the chunks of the stripe are (obviously) accessed on the
same core and hence either the multiple underlying kcryptds block each
other now with the old approach or with dm-crypt-over-RAID0 there is
only one kcryptd involved in serving one request on one core. Hence, for
single requests the new approach always decreases throughput and
increases latency. The latency-increase holds even for multi-process
workloads.

For your approach to at least match up the old one it requires
min(#core, #kcryptd) parallel requests all the time assuming latency
doesn't matter and disk seek time to be zero (now you tell me to get
X25s, right? :)).


Mario
--
There are two major products that come from Berkeley: LSD and UNIX.
We don't believe this to be a coincidence. -- Jeremy S. Anderson


Attachments:
(No filename) (2.45 kB)
signature.asc (482.00 B)
Digital signature
Download all attachments

2011-03-11 18:30:11

by Milan Broz

[permalink] [raw]
Subject: Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8

On 03/11/2011 07:03 PM, Mario 'BitKoenig' Holbe wrote:

>> You probably need to find some way
>> to make pcrypt (parallel crypt layer) work for dmcrypt. That may
>> actually give you more speedup too than your old hack because
>> it can balance over more cores.

dmcrypt is already using async crypto interface, it is ready
for parallelization on this level.

Perhaps the problem is that pcrypt is not yet implemented for needed
algorithms?

Milan

2011-03-11 18:37:47

by Andi Kleen

[permalink] [raw]
Subject: Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8

On Fri, Mar 11, 2011 at 07:29:58PM +0100, Milan Broz wrote:
> On 03/11/2011 07:03 PM, Mario 'BitKoenig' Holbe wrote:
>
> >> You probably need to find some way
> >> to make pcrypt (parallel crypt layer) work for dmcrypt. That may
> >> actually give you more speedup too than your old hack because
> >> it can balance over more cores.
>
> dmcrypt is already using async crypto interface, it is ready
> for parallelization on this level.
>
> Perhaps the problem is that pcrypt is not yet implemented for needed
> algorithms?

It needs some glue according to Herbert. I forgot the details.

-Andi
--
[email protected] -- Speaking for myself only

2011-03-12 01:06:04

by Herbert Xu

[permalink] [raw]
Subject: Re: dm-crypt: Performance Regression 2.6.37 -> 2.6.38-rc8

On Fri, Mar 11, 2011 at 10:36:54AM -0800, Andi Kleen wrote:
>
> It needs some glue according to Herbert. I forgot the details.

As we don't want to have pcrypt default on it needs to be loaded
by hand. What we lack is a clean way to instantiate it.

For now you have to do something like

modprobe tcrypt alg="pcrypt(authenc(hmac(sha1-generic),cbc(aes-asm)))" type=3

Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt