2013-03-26 12:27:33

by Alasdair G Kergon

[permalink] [raw]
Subject: Re: [dm-devel] dm-crypt performance

[Adding dm-crypt + linux-kernel]

On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote:
> I performed some dm-crypt performance tests as Mike suggested.
>
> It turns out that unbound workqueue performance has improved somewhere
> between kernel 3.2 (when I made the dm-crypt patches) and 3.8, so the
> patches for hand-built dispatch are no longer needed.
>
> For RAID-0 composed of two disks with total throughput 260MB/s, the
> unbound workqueue performs as well as the hand-built dispatch (both
> sustain the 260MB/s transfer rate).
>
> For ramdisk, unbound workqueue performs better than hand-built dispatch
> (620MB/s vs 400MB/s). Unbound workqueue with the patch that Mike suggested
> (git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git) improves
> performance slighlty on ramdisk compared to 3.8 (700MB/s vs. 620MB/s).
>
>
>
> However, there is still the problem with request ordering. Milan found out
> that under some circumstances parallel dm-crypt has worse performance than
> the previous dm-crypt code. I found out that this is not caused by
> deficiencies in the code that distributes work to individual processors.
> Performance drop is caused by the fact that distributing write bios to
> multiple processors causes the encryption to finish out of order and the
> I/O scheduler is unable to merge these out-of-order bios.
>
> The deadline and noop schedulers perform better (only 50% slowdown
> compared to old dm-crypt), CFQ performs very badly (8 times slowdown).
>
>
> If I sort the requests in dm-crypt to come out in the same order as they
> were received, there is no longer any slowdown, the new crypt performs as
> well as the old crypt, but the last time I submitted the patches, people
> objected to sorting requests in dm-crypt, saying that the I/O scheduler
> should sort them. But it doesn't. This problem still persists in the
> current kernels.
>
>
> For best performance we could use the unbound workqueue implementation
> with request sorting, if people don't object to the request sorting being
> done in dm-crypt.


On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote:
> FYI, XFS also does it's own request ordering for the metadata buffers,
> because it knows the needed ordering and has a bigger view than than
> than especially CFQ. You at least have precedence in a widely used
> subsystem for this code.


So please post this updated version of the patches for a wider group of
people to try out.

Alasdair


2013-03-26 20:06:32

by Milan Broz

[permalink] [raw]
Subject: Re: [dm-devel] dm-crypt performance

On 26.3.2013 13:27, Alasdair G Kergon wrote:
> [Adding dm-crypt + linux-kernel]

Thanks.

>
> On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote:
>> I performed some dm-crypt performance tests as Mike suggested.
>>
>> It turns out that unbound workqueue performance has improved somewhere
>> between kernel 3.2 (when I made the dm-crypt patches) and 3.8, so the
>> patches for hand-built dispatch are no longer needed.
>>
>> For RAID-0 composed of two disks with total throughput 260MB/s, the
>> unbound workqueue performs as well as the hand-built dispatch (both
>> sustain the 260MB/s transfer rate).
>>
>> For ramdisk, unbound workqueue performs better than hand-built dispatch
>> (620MB/s vs 400MB/s). Unbound workqueue with the patch that Mike suggested
>> (git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git) improves
>> performance slighlty on ramdisk compared to 3.8 (700MB/s vs. 620MB/s).

I found that ramdisk tests are usualy quite misleading for dmcrypt.
Better use some fast SSD, ideally in RAID0 (so you get >500MB or so).
Also be sure you compare recent machines which uses AES-NI.
For reference, null cipher (no crypt, data copy only) works as well,
but this is not real-world scenario.

After introducing Andi's patches, we created performance regression
for people which created "RAID over several dmcrypt devices".
(All IOs were processed by one core.) Rerely use case but several people
complained.
But most of people reported that current approach works much better
(even with stupid dd test - I think it is because page cache sumbits
requests from different CPUs so it in fact run in parallel).

But using dd with direct-io is trivial way how to simulate the "problem".
(I guess we all like using dd for performance testing... :-])

>> However, there is still the problem with request ordering. Milan found out
>> that under some circumstances parallel dm-crypt has worse performance than
>> the previous dm-crypt code. I found out that this is not caused by
>> deficiencies in the code that distributes work to individual processors.
>> Performance drop is caused by the fact that distributing write bios to
>> multiple processors causes the encryption to finish out of order and the
>> I/O scheduler is unable to merge these out-of-order bios.

If the IO scheduler is unable to merge these request because of out of order
bios, please try to FIX IO scheduler and do not invent workarounds in dmcrypt.
(With recent accelerated crypto this should not happen so often btw.)

I know it is not easy but I really do not like that in "little-walled
device-mapper garden" is something what should be done on different
layer (again).

>> The deadline and noop schedulers perform better (only 50% slowdown
>> compared to old dm-crypt), CFQ performs very badly (8 times slowdown).
>>
>>
>> If I sort the requests in dm-crypt to come out in the same order as they
>> were received, there is no longer any slowdown, the new crypt performs as
>> well as the old crypt, but the last time I submitted the patches, people
>> objected to sorting requests in dm-crypt, saying that the I/O scheduler
>> should sort them. But it doesn't. This problem still persists in the
>> current kernels.

I have probable no vote here anymore but for the record: I am strictly
against any sorting of requests in dmcrypt. My reasons are:

- dmcrypt should be simple transparent layer (doing one thing - encryption),
sorting of requests was always primarily IO scheduler domain
(which has well-known knobs to control it already)

- Are we sure we are not inroducing some another side channel in disc
encryption? (Unprivileged user can measure timing here).
(Perhaps stupid reason but please do not prefer performance to security
in encryption. Enough we have timing attacks for AES implementations...)

- In my testing (several months ago) output was very unstable - in some
situations it helps, in some it was worse. I have no longer hard
data but some test output was sent to Alasdair.

>> For best performance we could use the unbound workqueue implementation
>> with request sorting, if people don't object to the request sorting being
>> done in dm-crypt.

So again:

- why IO scheduler is not working properly here? Do it need some extensions?
If fixed, it can help even is some other non-dmcrypt IO patterns.
(I mean dmcrypt can set some special parameter for underlying device queue
automagically to fine-tune sorting parameters.)

- can we have some cpu-bound workqueue which automatically switch to unbound
(relocates work to another cpu) if it detects some saturation watermark etc?
(Again, this can be used in other code.
http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html
(Yes, I see skepticism there :-)

> On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote:
>> FYI, XFS also does it's own request ordering for the metadata buffers,
>> because it knows the needed ordering and has a bigger view than than
>> than especially CFQ. You at least have precedence in a widely used
>> subsystem for this code.

Nice. But XFS is much more complex system.
Isn't it enough that multipath uses own IO queue (so we have one IO scheduler
on top of another, and now we have metadata io sorting in XFS on top of it
and planning one more in dmcrypt? Is it really good approach?)

Milan

2013-03-26 20:28:59

by Mike Snitzer

[permalink] [raw]
Subject: Re: dm-crypt performance

On Tue, Mar 26 2013 at 4:05pm -0400,
Milan Broz <[email protected]> wrote:

> >On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote:
> >
> >>For best performance we could use the unbound workqueue implementation
> >>with request sorting, if people don't object to the request sorting being
> >>done in dm-crypt.
>
> So again:
>
> - why IO scheduler is not working properly here? Do it need some extensions?
> If fixed, it can help even is some other non-dmcrypt IO patterns.
> (I mean dmcrypt can set some special parameter for underlying device queue
> automagically to fine-tune sorting parameters.)

Not sure, but IO scheduler changes are fairly slow to materialize given
the potential for adverse side-effects. Are you so surprised that a
shotgun blast of IOs might make the IO schduler less optimal than if
some basic sorting were done at the layer above?

> - can we have some cpu-bound workqueue which automatically switch to unbound
> (relocates work to another cpu) if it detects some saturation watermark etc?
> (Again, this can be used in other code.
> http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html
> (Yes, I see skepticism there :-)

Question for Tejun? (now cc'd).

> >On Tue, Mar 26, 2013 at 02:52:29AM -0400, Christoph Hellwig wrote:
> >>FYI, XFS also does it's own request ordering for the metadata buffers,
> >>because it knows the needed ordering and has a bigger view than than
> >>than especially CFQ. You at least have precedence in a widely used
> >>subsystem for this code.
>
> Nice. But XFS is much more complex system.
> Isn't it enough that multipath uses own IO queue (so we have one IO scheduler
> on top of another, and now we have metadata io sorting in XFS on top of it
> and planning one more in dmcrypt? Is it really good approach?)

Multipath's request_queue is the only one with an active IO scheduler;
the requests are dispatched directly to the underlying devices' queues
without any IO scheduling.

As for dm-crypt; as you know it is bio-based so it is already dealing
with out of order IOs (no benefit of upper level IO scheduler). Seems
relatively clear, from Mikulas' results, that maybe you're hoping for a
bit too much magic from the IO scheduler gnomes that lurk on LKML. BTW,
pretty sure btrfs takes care to maintain some IO dispatch ordering too.

Mike

2013-03-26 21:00:33

by Milan Broz

[permalink] [raw]
Subject: Re: dm-crypt performance

On 26.3.2013 21:28, Mike Snitzer wrote:
> On Tue, Mar 26 2013 at 4:05pm -0400,
> Milan Broz <[email protected]> wrote:
>
>>> On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote:
>>>
>>>> For best performance we could use the unbound workqueue implementation
>>>> with request sorting, if people don't object to the request sorting being
>>>> done in dm-crypt.
>>
>> So again:
>>
>> - why IO scheduler is not working properly here? Do it need some extensions?
>> If fixed, it can help even is some other non-dmcrypt IO patterns.
>> (I mean dmcrypt can set some special parameter for underlying device queue
>> automagically to fine-tune sorting parameters.)
>
> Not sure, but IO scheduler changes are fairly slow to materialize given
> the potential for adverse side-effects. Are you so surprised that a
> shotgun blast of IOs might make the IO schduler less optimal than if
> some basic sorting were done at the layer above?

All I said is that I think the problems should be solved on proper layer where
are already all mechanisms to properly control it.
And only if it is not possible then use such workarounds.

CPU bounded io in dmcrypt is in kernel for >2 years and I know about just
few cases where it caused real problems. Maybe I am mistaken - then now is
ideal time for people to complain :)

Anyway, are we talking about the same Mikulas' patch I tested months ago
or you have something new?
I mean this part from series of dmcrypt patches:
http://mbroz.fedorapeople.org/dm-crypt/3.6-rc/dm-crypt-25-sort-writes.patch

Milan

2013-03-28 18:53:37

by Tejun Heo

[permalink] [raw]
Subject: Re: dm-crypt performance

Hello,

(cc'ing Vivek and Jens for the iosched related bits)

On Tue, Mar 26, 2013 at 04:28:38PM -0400, Mike Snitzer wrote:
> On Tue, Mar 26 2013 at 4:05pm -0400,
> Milan Broz <[email protected]> wrote:
>
> > >On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote:
> > >
> > >>For best performance we could use the unbound workqueue implementation
> > >>with request sorting, if people don't object to the request sorting being
> > >>done in dm-crypt.
> >
> > So again:
> >
> > - why IO scheduler is not working properly here? Do it need some extensions?
> > If fixed, it can help even is some other non-dmcrypt IO patterns.
> > (I mean dmcrypt can set some special parameter for underlying device queue
> > automagically to fine-tune sorting parameters.)
>
> Not sure, but IO scheduler changes are fairly slow to materialize given
> the potential for adverse side-effects. Are you so surprised that a
> shotgun blast of IOs might make the IO schduler less optimal than if
> some basic sorting were done at the layer above?

My memory is already pretty hazy but Vivek should be able to correct
me if I say something nonsense. The thing is, the order and timings
of IOs coming down from upper layers has certain meanings to ioscheds
and they exploit those patterns to do better scheduling.

Reordering IOs randomly actually makes certain information about the
IO stream lost and makes ioscheds mis-classify the IO stream -
e.g. what could have been classfied as "mostly consecutive streaming
IO" could after such reordering fail to be detected as such. Sure,
ioscheds can probably be improved to compensate for such temporary
localized reorderings but nothing is free and given that most of the
upper stacks already do pretty good job of issuing IOs orderly when
possible, it would be a bit silly to do more than usually necessary in
ioscheds.

So, no, I don't think maintaining IO order in stacking drivers is a
bad idea. I actually think all stacking drivers should do that;
otherwise, they really are destroying actual useful side-band
information.

> > - can we have some cpu-bound workqueue which automatically switch to unbound
> > (relocates work to another cpu) if it detects some saturation watermark etc?
> > (Again, this can be used in other code.
> > http://www.redhat.com/archives/dm-devel/2012-August/msg00288.html
> > (Yes, I see skepticism there :-)
>
> Question for Tejun? (now cc'd).

Unbound workqueues went through quite a bit of improvements lately and
are currently growing NUMA affinity support. Once merged, all unbound
work items issued on a NUMA node will be processed in the same NUMA
node, which should mitigate some, unfortunately not all, of the
disadvantages compared to per-cpu ones. Mikulas, can you share more
about your test setup? Was it a NUMA machine? Which wq branch did
you use?

The NUMA affinity support would have less severe but similar issue as
per-cpu. If all IOs are being issued from one node while other nodes
are idle, that specific node can get saturated. NUMA affinity support
is adjusted both from inside kernel and userland via sysfs, so there
are control knobs for corner cases.

As for maintaining CPU or NUMA affinity until the CPU / node is
saturated and spilling to other CPUs/nodes beyond that, yeah, an
interesting idea. It's non-trivial and would have to incorporate a
lot of notions on "load" similar to the scheduler. It really becomes
a generic load balancing problem as it'd be pointless and actually
harmful to, say, spill work items to each other between two saturated
NUMA nodes.

So, if the brunt of scattering workload across random CPUs can be
avoided by NUMA affinity, that could be a reasonable tradeoff, I
think.

Thanks.

--
tejun

2013-03-28 19:34:12

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-crypt performance

On Thu, Mar 28, 2013 at 11:53:27AM -0700, Tejun Heo wrote:
> Hello,
>
> (cc'ing Vivek and Jens for the iosched related bits)
>
> On Tue, Mar 26, 2013 at 04:28:38PM -0400, Mike Snitzer wrote:
> > On Tue, Mar 26 2013 at 4:05pm -0400,
> > Milan Broz <[email protected]> wrote:
> >
> > > >On Mon, Mar 25, 2013 at 11:47:22PM -0400, Mikulas Patocka wrote:
> > > >
> > > >>For best performance we could use the unbound workqueue implementation
> > > >>with request sorting, if people don't object to the request sorting being
> > > >>done in dm-crypt.
> > >
> > > So again:
> > >
> > > - why IO scheduler is not working properly here? Do it need some extensions?
> > > If fixed, it can help even is some other non-dmcrypt IO patterns.
> > > (I mean dmcrypt can set some special parameter for underlying device queue
> > > automagically to fine-tune sorting parameters.)
> >
> > Not sure, but IO scheduler changes are fairly slow to materialize given
> > the potential for adverse side-effects. Are you so surprised that a
> > shotgun blast of IOs might make the IO schduler less optimal than if
> > some basic sorting were done at the layer above?
>
> My memory is already pretty hazy but Vivek should be able to correct
> me if I say something nonsense. The thing is, the order and timings
> of IOs coming down from upper layers has certain meanings to ioscheds
> and they exploit those patterns to do better scheduling.
>
> Reordering IOs randomly actually makes certain information about the
> IO stream lost and makes ioscheds mis-classify the IO stream -
> e.g. what could have been classfied as "mostly consecutive streaming
> IO" could after such reordering fail to be detected as such. Sure,
> ioscheds can probably be improved to compensate for such temporary
> localized reorderings but nothing is free and given that most of the
> upper stacks already do pretty good job of issuing IOs orderly when
> possible, it would be a bit silly to do more than usually necessary in
> ioscheds.
>
> So, no, I don't think maintaining IO order in stacking drivers is a
> bad idea. I actually think all stacking drivers should do that;
> otherwise, they really are destroying actual useful side-band
> information.

I am curious why out of order bio is a problem. Doesn't the elevator
already merge bio's with existing requests and if merging does not
happen then requests are sorted in order. So why ordering is not
happening properly with dm-crypt. What additional info dm-crypt has
that it can do better ordering than IO scheduler.

CFQ might seeing more performance hit because we maintain per
process queues and kernel threads might not be sharing the IO context
(i am not sure). So if all the crypto threads can share the IO
context, atleast it will make sure all IO from them goes into a
single queue.

So it would help if somebody can explain that why existing merging
and sorting logic is not working well with dm-crypt and what additional
info dm-crypt has which can help it do better job.

Thanks
Vivek

2013-03-28 19:44:57

by Tejun Heo

[permalink] [raw]
Subject: Re: dm-crypt performance

Hello,

On Thu, Mar 28, 2013 at 03:33:43PM -0400, Vivek Goyal wrote:
> I am curious why out of order bio is a problem. Doesn't the elevator
> already merge bio's with existing requests and if merging does not
> happen then requests are sorted in order. So why ordering is not
> happening properly with dm-crypt. What additional info dm-crypt has
> that it can do better ordering than IO scheduler.

Hmmm... well, for one, it doesn't only change ordering. It also
changes the timings. Before iosched would get contiguous stream of
IOs when the queue gets unplugged (BTW, how does dm crypt handling
plugging? If not handled properly, it could definitely affect a lot
of things.) With multiple threads doing encryption in the middle, the
iosched could get scattered IOs which could easily span multiple
millisecs. Even if context tagging was done properly, it could easily
lead to much less efficient IO patterns to hardware.

Keeping IO order combined with proper plug handling would not only
keep the ordering constant but also the relative timing of events,
which is an important factor when scheduling IOs.

> CFQ might seeing more performance hit because we maintain per
> process queues and kernel threads might not be sharing the IO context
> (i am not sure). So if all the crypto threads can share the IO
> context, atleast it will make sure all IO from them goes into a
> single queue.

Right, this is important too although I fail to see how workqueue
vs. custom dispatch would make any difference here. dm-crypt should
definitely be using bio_associate_current().

Thanks.

--
tejun

2013-03-28 20:38:58

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-crypt performance

On Thu, Mar 28, 2013 at 12:44:43PM -0700, Tejun Heo wrote:
> Hello,
>
> On Thu, Mar 28, 2013 at 03:33:43PM -0400, Vivek Goyal wrote:
> > I am curious why out of order bio is a problem. Doesn't the elevator
> > already merge bio's with existing requests and if merging does not
> > happen then requests are sorted in order. So why ordering is not
> > happening properly with dm-crypt. What additional info dm-crypt has
> > that it can do better ordering than IO scheduler.
>
> Hmmm... well, for one, it doesn't only change ordering. It also
> changes the timings. Before iosched would get contiguous stream of
> IOs when the queue gets unplugged (BTW, how does dm crypt handling
> plugging? If not handled properly, it could definitely affect a lot
> of things.) With multiple threads doing encryption in the middle, the
> iosched could get scattered IOs which could easily span multiple
> millisecs. Even if context tagging was done properly, it could easily
> lead to much less efficient IO patterns to hardware.
>
> Keeping IO order combined with proper plug handling would not only
> keep the ordering constant but also the relative timing of events,
> which is an important factor when scheduling IOs.

If timing of unordered IO is an issue, then dm-crypt can try
to batch IO submission using blk_start_plug()/blk_finish_plug(). That way
dm-crypt can batch bio and control submission and there should not
be a need to put specific ordering logic in dm-crypt.

So if there are multiple threads doing crypto, they end up submitting
bio after crypto operation or queue it somewhere and ther is single
submitter thread.

If compltion of crypto is an issue, then I think it is very hard to
determine whether extra waiting helps with throughput or hurts. If
dm-crypt can decide that somehow, then I guess they can just try
to do batch submission of IO from various crypto threads and see
if it helps with performance. (At some point of time, submitter
thread will become a bottleneck).

>
> > CFQ might seeing more performance hit because we maintain per
> > process queues and kernel threads might not be sharing the IO context
> > (i am not sure). So if all the crypto threads can share the IO
> > context, atleast it will make sure all IO from them goes into a
> > single queue.
>
> Right, this is important too although I fail to see how workqueue
> vs. custom dispatch would make any difference here. dm-crypt should
> definitely be using bio_associate_current().

Agreed. bio_associate_current() will atleast help keeping all bios
of single context in single queue and promote more merging (if submission
happens in right time frame).

Thanks
Vivek

2013-03-28 20:45:28

by Tejun Heo

[permalink] [raw]
Subject: Re: dm-crypt performance

Hello,

On Thu, Mar 28, 2013 at 04:38:08PM -0400, Vivek Goyal wrote:
> If timing of unordered IO is an issue, then dm-crypt can try
> to batch IO submission using blk_start_plug()/blk_finish_plug(). That way
> dm-crypt can batch bio and control submission and there should not
> be a need to put specific ordering logic in dm-crypt.

Yes, it has to preserve and propagate the plugging boundaries and if
you think about the implementation, maintaining issue order don't
really need to be "sorted" per-se. Just keep the list of bios
received but still going through encryption in the received order with
a counter of in-progress bios in the plugging boundary. Link the
outputs to the source bios somehow and when the counter hits zero,
issue them in the same order. While keeping the specific order itself
might not be essential, it's not gonna add any significant complexity
or runtime overhead and I think it generally is a good idea for
stacking drivers to preserve as much information and context as
possible in general.

Thanks.

--
tejun

2013-04-09 17:52:12

by Mikulas Patocka

[permalink] [raw]
Subject: dm-crypt parallelization patches

Hi

I placed the dm-crypt parallization patches at:
http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/current/

The patches paralellize dm-crypt and make it possible to use all processor
cores.


The patch dm-crypt-remove-percpu.patch removes some percpu variables and
replaces them with per-request variables.

The patch dm-crypt-unbound-workqueue.patch sets WQ_UNBOUND on the
encryption workqueue, allowing the encryption to be distributed to all
CPUs in the system.

The patch dm-crypt-offload-writes-to-thread.patch moves submission of all
write requests to a single thread.

The patch dm-crypt-sort-requests.patch sorts write requests submitted by a
single thread. The requests are sorted according to the sector number,
rb-tree is used for efficient sorting.

Some usage notes:

* turn off automatic cpu frequency scaling (or set it to "performance"
governor) - cpufreq doesn't recognize encryption workload correctly,
sometimes it underclocks all the CPU cores when there is some encryption
work to do, resulting in bad performance

* when using filesystem on encrypted dm-crypt device, reduce maximum
request size with "/sys/block/dm-2/queue/max_sectors_kb" (substitute
"dm-2" with the real name of your dm-crypt device). Note that having too
big requests means that there is a small number of requests and they
cannot be distributed to all available processors in parallel - it
results in worse performance. Having too small requests results in high
request overhead and also reduced performance. So you must find the
optimal request size for your system and workload. For me, when testing
this on ramdisk, the optimal is 8KiB.

---

Now, the problem with I/O scheduler: when doing performance testing, it
turns out that the parallel version is sometimes worse than the previous
implementation.

When I create a 4.3GiB dm-crypt device on the top of dm-loop on the top of
ext2 filesystem on 15k SCSI disk and run this command

time fio --rw=randrw --size=64M --bs=256k --filename=/dev/mapper/crypt
--direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5
--name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11
--name=job12

the results are this:
CFQ scheduler:
--------------
no patches:
21.9s
patch 1:
21.7s
patches 1,2:
2:33s
patches 1,2 (+ nr_requests = 1280000)
2:18s
patches 1,2,3:
20.7s
patches 1,2,3,4:
20.7s

deadline scheduler:
-------------------
no patches:
27.4s
patch 1:
27.4s
patches 1,2:
27.8s
patches 1,2,3:
29.6s
patches 1,2,3,4:
29.6s


We can see that CFQ performs badly with the patch 2, but improves with the
patch 3. All that patch 3 does is that it moves write requests from
encryption threads to a separate thread.

So it seems that CFQ has some deficiency that it cannot merge adjacent
requests done by different processes.

The problem is this:
- we have 256k write direct-i/o request
- it is broken to 4k bios (because we run on dm-loop on a filesystem with
4k block size)
- encryption of these 4k bios is distributed to 12 processes on a 12-core
machine
- encryption finishes out of order and in different processes, 4k bios
with encrypted data are submitted to CFQ
- CFQ doesn't merge them
- the disk is flooded with random 4k write requests, and performs much
worse than with 256k requests

Increasing nr_requests to 1280000 helps a little, but not much - it is
still order of magnitute slower.

I'd like to ask if someone who knows the CFQ scheduler (Jens?) could look
at it and find out why it doesn't merge requests from different processes.

Why do I have to do a seemingly senseless operation (hand over write
requests to a separate thread) in patch 3 to improve performance?

Mikulas

2013-04-09 17:58:01

by Tejun Heo

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches

On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote:
> The patch dm-crypt-sort-requests.patch sorts write requests submitted by a
> single thread. The requests are sorted according to the sector number,
> rb-tree is used for efficient sorting.

Hmmm? Why not just keep the issuing order along with plugging
boundaries?

> So it seems that CFQ has some deficiency that it cannot merge adjacent
> requests done by different processes.

As I wrote before, please use bio_associate_current(). Currently,
dm-crypt is completely messing up all the context information that cfq
depends on to schedule IOs. Of course, it doesn't perform well.

Thanks.

--
tejun

2013-04-09 18:08:26

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches



On Tue, 9 Apr 2013, Tejun Heo wrote:

> On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote:
> > The patch dm-crypt-sort-requests.patch sorts write requests submitted by a
> > single thread. The requests are sorted according to the sector number,
> > rb-tree is used for efficient sorting.
>
> Hmmm? Why not just keep the issuing order along with plugging
> boundaries?

What do you mean?

I used to have a patch that keeps order of requests as they were
introduced, but sorting the requests according to sector number is a bit
simpler.

> > So it seems that CFQ has some deficiency that it cannot merge adjacent
> > requests done by different processes.
>
> As I wrote before, please use bio_associate_current(). Currently,
> dm-crypt is completely messing up all the context information that cfq
> depends on to schedule IOs. Of course, it doesn't perform well.

bio_associate_current() is only valid on a system with cgroups and there
are no cgroups on the kernel where I tested it. It is an empty function:

static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }

Mikulas

> Thanks.
>
> --
> tejun
>

2013-04-09 18:09:12

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [dm-devel] dm-crypt performance



On Tue, 26 Mar 2013, Milan Broz wrote:

> - Are we sure we are not inroducing some another side channel in disc
> encryption? (Unprivileged user can measure timing here).
> (Perhaps stupid reason but please do not prefer performance to security
> in encryption. Enough we have timing attacks for AES implementations...)

So use serpent - it is implemented without any data-dependent lookup
tables, so it has no timing attacks.

AES uses data-dependent lookup tables, on CPU with hyperthreding, the
second thread can observe L1 cache footprint done by the first thread and
get some information about data being encrypted...

Mikulas

2013-04-09 18:10:38

by Tejun Heo

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches

Hey,

On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote:
> > Hmmm? Why not just keep the issuing order along with plugging
> > boundaries?
>
> What do you mean?
>
> I used to have a patch that keeps order of requests as they were
> introduced, but sorting the requests according to sector number is a bit
> simpler.

You're still destroying the context information. Please just keep the
issuing order along with plugging boundaries.

> > As I wrote before, please use bio_associate_current(). Currently,
> > dm-crypt is completely messing up all the context information that cfq
> > depends on to schedule IOs. Of course, it doesn't perform well.
>
> bio_associate_current() is only valid on a system with cgroups and there
> are no cgroups on the kernel where I tested it. It is an empty function:
>
> static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }

Yeah, because blkcg was the only user. Please feel free to drop the
ifdefs. It covers both iocontext and cgroup association.

Thanks.

--
tejun

2013-04-09 18:36:23

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches

On Tue, Apr 09, 2013 at 01:51:43PM -0400, Mikulas Patocka wrote:
> Hi
>
> I placed the dm-crypt parallization patches at:
> http://people.redhat.com/~mpatocka/patches/kernel/dm-crypt-paralelizace/current/
>
> The patches paralellize dm-crypt and make it possible to use all processor
> cores.
>
>
> The patch dm-crypt-remove-percpu.patch removes some percpu variables and
> replaces them with per-request variables.
>
> The patch dm-crypt-unbound-workqueue.patch sets WQ_UNBOUND on the
> encryption workqueue, allowing the encryption to be distributed to all
> CPUs in the system.
>
> The patch dm-crypt-offload-writes-to-thread.patch moves submission of all
> write requests to a single thread.
>
> The patch dm-crypt-sort-requests.patch sorts write requests submitted by a
> single thread. The requests are sorted according to the sector number,
> rb-tree is used for efficient sorting.
>
> Some usage notes:
>
> * turn off automatic cpu frequency scaling (or set it to "performance"
> governor) - cpufreq doesn't recognize encryption workload correctly,
> sometimes it underclocks all the CPU cores when there is some encryption
> work to do, resulting in bad performance
>
> * when using filesystem on encrypted dm-crypt device, reduce maximum
> request size with "/sys/block/dm-2/queue/max_sectors_kb" (substitute
> "dm-2" with the real name of your dm-crypt device). Note that having too
> big requests means that there is a small number of requests and they
> cannot be distributed to all available processors in parallel - it
> results in worse performance. Having too small requests results in high
> request overhead and also reduced performance. So you must find the
> optimal request size for your system and workload. For me, when testing
> this on ramdisk, the optimal is 8KiB.
>
> ---
>
> Now, the problem with I/O scheduler: when doing performance testing, it
> turns out that the parallel version is sometimes worse than the previous
> implementation.
>
> When I create a 4.3GiB dm-crypt device on the top of dm-loop on the top of
> ext2 filesystem on 15k SCSI disk and run this command
>
> time fio --rw=randrw --size=64M --bs=256k --filename=/dev/mapper/crypt
> --direct=1 --name=job1 --name=job2 --name=job3 --name=job4 --name=job5
> --name=job6 --name=job7 --name=job8 --name=job9 --name=job10 --name=job11
> --name=job12
>
> the results are this:
> CFQ scheduler:
> --------------
> no patches:
> 21.9s
> patch 1:
> 21.7s
> patches 1,2:
> 2:33s
> patches 1,2 (+ nr_requests = 1280000)
> 2:18s
> patches 1,2,3:
> 20.7s
> patches 1,2,3,4:
> 20.7s
>
> deadline scheduler:
> -------------------
> no patches:
> 27.4s
> patch 1:
> 27.4s
> patches 1,2:
> 27.8s
> patches 1,2,3:
> 29.6s
> patches 1,2,3,4:
> 29.6s
>
>
> We can see that CFQ performs badly with the patch 2, but improves with the
> patch 3. All that patch 3 does is that it moves write requests from
> encryption threads to a separate thread.
>
> So it seems that CFQ has some deficiency that it cannot merge adjacent
> requests done by different processes.
>

CFQ does not merge requests across different cfq queues (cfqq). Each
queue is associated with one iocontext. So in this case each worker
thread is submitting its own bio and each 4K bio must be going in
separate cfqq hence no merging is taking place.

The moment you applied patch 3, where a single thread submitted bios,
each bio went into single queue and possibly got merged.

So either use single thread to submit bio or better use
bio_associate_current() (as tejun suggested) on original 256K bio.
(Hopefully bio iocontext association information is retained when you
split the bios into smaller pieces).

> The problem is this:
> - we have 256k write direct-i/o request
> - it is broken to 4k bios (because we run on dm-loop on a filesystem with
> 4k block size)
> - encryption of these 4k bios is distributed to 12 processes on a 12-core
> machine
> - encryption finishes out of order and in different processes, 4k bios
> with encrypted data are submitted to CFQ
> - CFQ doesn't merge them
> - the disk is flooded with random 4k write requests, and performs much
> worse than with 256k requests
>

Thanks
Vivek

2013-04-09 18:43:15

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches

On Tue, Apr 09, 2013 at 11:10:31AM -0700, Tejun Heo wrote:
> Hey,
>
> On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote:
> > > Hmmm? Why not just keep the issuing order along with plugging
> > > boundaries?
> >
> > What do you mean?
> >
> > I used to have a patch that keeps order of requests as they were
> > introduced, but sorting the requests according to sector number is a bit
> > simpler.
>
> You're still destroying the context information. Please just keep the
> issuing order along with plugging boundaries.

I guess plugging boundary is more important than issuing order as
block layer should take care of mering the bio and put in right
order (attempt_plug_merge()).

But to make use of plugging boundary, one would probably still need
submission using single thread.

And if one is using single thread for submission, one will still get
good performance (even if you are not using bio_associate_current()), as
by default all bio will go to submitting thread's context.

Thanks
Vivek

2013-04-09 18:57:29

by Tejun Heo

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches

Hello,

On Tue, Apr 09, 2013 at 02:42:48PM -0400, Vivek Goyal wrote:
> I guess plugging boundary is more important than issuing order as
> block layer should take care of mering the bio and put in right
> order (attempt_plug_merge()).

Yeah, the exact order probably doesn't affect things too much but it's
just a nice design principle to follow - if you're gonna step in in
the middle and meddle with requests, preserve as much context as
reasonably possible, and it's not like preserving that order is
difficult.

> But to make use of plugging boundary, one would probably still need
> submission using single thread.

It doesn't have to a specific task. Whoever finishes the last bio /
segment / whatever in the plugging domain can issue all of them. I
probably am missing details but the overall mechanism can be pretty
simple. Just keep the bios from the same plugging domain in the
received order along with an atomic counter and issue them all when
the counter hits zero. No need to fiddle with sorting or whatever.

> And if one is using single thread for submission, one will still get
> good performance (even if you are not using bio_associate_current()), as
> by default all bio will go to submitting thread's context.

And destroy all per-ioc and cgroup logics in block layer in the
process.

Thanks.

--
tejun

2013-04-09 18:59:33

by Milan Broz

[permalink] [raw]
Subject: Re: [dm-crypt] [dm-devel] dm-crypt performance

On 9.4.2013 20:08, Mikulas Patocka wrote:
>
>
> On Tue, 26 Mar 2013, Milan Broz wrote:
>
>> - Are we sure we are not inroducing some another side channel in disc
>> encryption? (Unprivileged user can measure timing here).
>> (Perhaps stupid reason but please do not prefer performance to security
>> in encryption. Enough we have timing attacks for AES implementations...)
>
> So use serpent - it is implemented without any data-dependent lookup
> tables, so it has no timing attacks.

I wish using something different than AES is just such simple technical issue
for many people. But e.g. just try it in FIPS mode where AES is the only option:-)

Anyway, using bio_associate_current() seems to be the right way to try now...

Milan

2013-04-09 19:14:11

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches

On Tue, Apr 09, 2013 at 11:57:21AM -0700, Tejun Heo wrote:

[..]
> And destroy all per-ioc and cgroup logics in block layer in the
> process.

Oh, I am in no-way suggesting don't use bio_associate_current(). I am
just trying to analyze the performance issue right now and saying that
as far as performance is concenred, one will get it back even if one
does not use bio_associate_current().

Yes but one needs to use bio_associate_current() to make sure bio's
are attributed to right cgroup and associate bio to right task. This
should help solving the long standing issue of task losing its ioprio
if dm-crypt target is in the stack.

Thanks
Vivek

2013-04-09 19:42:51

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches



On Tue, 9 Apr 2013, Tejun Heo wrote:

> Hey,
>
> On Tue, Apr 09, 2013 at 02:08:06PM -0400, Mikulas Patocka wrote:
> > > Hmmm? Why not just keep the issuing order along with plugging
> > > boundaries?
> >
> > What do you mean?
> >
> > I used to have a patch that keeps order of requests as they were
> > introduced, but sorting the requests according to sector number is a bit
> > simpler.
>
> You're still destroying the context information. Please just keep the
> issuing order along with plugging boundaries.
>
> > > As I wrote before, please use bio_associate_current(). Currently,
> > > dm-crypt is completely messing up all the context information that cfq
> > > depends on to schedule IOs. Of course, it doesn't perform well.
> >
> > bio_associate_current() is only valid on a system with cgroups and there
> > are no cgroups on the kernel where I tested it. It is an empty function:
> >
> > static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
>
> Yeah, because blkcg was the only user. Please feel free to drop the
> ifdefs. It covers both iocontext and cgroup association.
>
> Thanks.

If I drop ifdefs, it doesn't compile (because other cgroup stuff it
missing).

So I enabled bio cgroups.

bio_associate_current can't be used, because by the time we allocate the
outgoing write bio, we are no longer in the process that submitted the
original bio.

Anyway, I tried to reproduce in dm-crypt what bio_associate_current does -
in the submitting process I record "ioc" and "css" fields in "dm_crypt_io"
structure and set these fields on all outgoing bios. It has no effect on
performance, it is as bad as if I hadn't done it.

Mikulas

---
(this is the patch that I used, to be applied after
dm-crypt-unbound-workqueue.patch)

---
drivers/md/dm-crypt.c | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)

Index: linux-3.8.6-fast/drivers/md/dm-crypt.c
===================================================================
--- linux-3.8.6-fast.orig/drivers/md/dm-crypt.c 2013-04-09 20:32:41.000000000 +0200
+++ linux-3.8.6-fast/drivers/md/dm-crypt.c 2013-04-09 21:29:12.000000000 +0200
@@ -20,6 +20,7 @@
#include <linux/backing-dev.h>
#include <linux/atomic.h>
#include <linux/scatterlist.h>
+#include <linux/cgroup.h>
#include <asm/page.h>
#include <asm/unaligned.h>
#include <crypto/hash.h>
@@ -60,6 +61,9 @@ struct dm_crypt_io {
int error;
sector_t sector;
struct dm_crypt_io *base_io;
+
+ struct io_context *ioc;
+ struct cgroup_subsys_state *css;
};

struct dm_crypt_request {
@@ -797,6 +801,14 @@ static struct bio *crypt_alloc_buffer(st
if (!clone)
return NULL;

+ if (unlikely(io->base_io != NULL)) {
+ clone->bi_ioc = io->base_io->ioc;
+ clone->bi_css = io->base_io->css;
+ } else {
+ clone->bi_ioc = io->ioc;
+ clone->bi_css = io->css;
+ }
+
clone_init(io, clone);
*out_of_pages = 0;

@@ -859,6 +871,9 @@ static struct dm_crypt_io *crypt_io_allo
io->ctx.req = NULL;
atomic_set(&io->io_pending, 0);

+ io->ioc = NULL;
+ io->css = NULL;
+
return io;
}

@@ -884,6 +899,14 @@ static void crypt_dec_pending(struct dm_

if (io->ctx.req)
mempool_free(io->ctx.req, cc->req_pool);
+
+ if (io->ioc) {
+ put_io_context(io->ioc);
+ }
+ if (io->css) {
+ css_put(io->css);
+ }
+
mempool_free(io, cc->io_pool);

if (likely(!base_io))
@@ -927,6 +950,9 @@ static void crypt_endio(struct bio *clon
if (rw == WRITE)
crypt_free_buffer_pages(cc, clone);

+ clone->bi_ioc = NULL;
+ clone->bi_css = NULL;
+
bio_put(clone);

if (rw == READ && !error) {
@@ -1658,6 +1684,21 @@ static int crypt_map(struct dm_target *t

io = crypt_io_alloc(cc, bio, dm_target_offset(ti, bio->bi_sector));

+ if (current->io_context) {
+ struct io_context *ioc = current->io_context;
+ struct cgroup_subsys_state *css;
+
+ get_io_context_active(ioc);
+ io->ioc = ioc;
+
+ /* associate blkcg if exists */
+ rcu_read_lock();
+ css = task_subsys_state(current, blkio_subsys_id);
+ if (css && css_tryget(css))
+ io->css = css;
+ rcu_read_unlock();
+ }
+
if (bio_data_dir(io->base_bio) == READ) {
if (kcryptd_io_read(io, GFP_NOWAIT))
kcryptd_queue_io(io);

2013-04-09 19:53:08

by Tejun Heo

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches

On Tue, Apr 09, 2013 at 03:42:16PM -0400, Mikulas Patocka wrote:
> If I drop ifdefs, it doesn't compile (because other cgroup stuff it
> missing).
>
> So I enabled bio cgroups.
>
> bio_associate_current can't be used, because by the time we allocate the
> outgoing write bio, we are no longer in the process that submitted the
> original bio.

Oh, I suppose it'd need some massaging to selectively turn off the
cgroup part.

> Anyway, I tried to reproduce in dm-crypt what bio_associate_current does -

and we probably need to change that to bio_associate_task().

> in the submitting process I record "ioc" and "css" fields in "dm_crypt_io"
> structure and set these fields on all outgoing bios. It has no effect on
> performance, it is as bad as if I hadn't done it.

A good way to verify that the tagging is correct would be configuring
io limits in block cgroup and see whether the limits are correctly
applied when going through dm-crypt (please test with direct-io or
reads, writeback is horribly broken, sorry).working correctly, maybe
plugging is the overriding factor?

Thanks.

--
tejun

2013-04-09 20:32:51

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches



On Tue, 9 Apr 2013, Tejun Heo wrote:

> On Tue, Apr 09, 2013 at 03:42:16PM -0400, Mikulas Patocka wrote:
> > If I drop ifdefs, it doesn't compile (because other cgroup stuff it
> > missing).
> >
> > So I enabled bio cgroups.
> >
> > bio_associate_current can't be used, because by the time we allocate the
> > outgoing write bio, we are no longer in the process that submitted the
> > original bio.
>
> Oh, I suppose it'd need some massaging to selectively turn off the
> cgroup part.
>
> > Anyway, I tried to reproduce in dm-crypt what bio_associate_current does -
>
> and we probably need to change that to bio_associate_task().

Generally, we shouldn't associate bios with "current" task in device
mapper targets. For example suppose that we have two stacked dm-crypt
targets:

In the "current" process pointer in lower dm-crypt target's request
function always points to the workqueue of the upper dm-crypt target that
submits the bios. So if we associate the bio with "current" in the lower
target, we are associating it with a preallocated workqueue and we already
lost the information who submitted it.

You should associate a bio with a task when you create the bio and "md"
and "dm" midlayers should just forward this association to lower layer
bios.

> > in the submitting process I record "ioc" and "css" fields in "dm_crypt_io"
> > structure and set these fields on all outgoing bios. It has no effect on
> > performance, it is as bad as if I hadn't done it.
>
> A good way to verify that the tagging is correct would be configuring
> io limits in block cgroup and see whether the limits are correctly
> applied when going through dm-crypt (please test with direct-io or
> reads, writeback is horribly broken, sorry).working correctly, maybe
> plugging is the overriding factor?
>
> Thanks.

It doesn't work because device mapper on the underlying layers ignores
bi_ioc and bi_css.

If I make device mapper forward bi_ioc and bi_css to outgoing bios, it
improves performance (from 2:30 to 1:30), but it is still far from
perfect.

Mikulas

---

dm: forward cgroup context

This patch makes dm forward associated cgroup context to cloned bios.

Signed-off-by: Mikulas Patocka <[email protected]>

---
drivers/md/dm.c | 9 +++++++++
fs/bio.c | 2 ++
2 files changed, 11 insertions(+)

Index: linux-3.8.6-fast/drivers/md/dm.c
===================================================================
--- linux-3.8.6-fast.orig/drivers/md/dm.c 2013-04-09 22:00:36.000000000 +0200
+++ linux-3.8.6-fast/drivers/md/dm.c 2013-04-09 22:19:40.000000000 +0200
@@ -453,6 +453,10 @@ static void free_io(struct mapped_device

static void free_tio(struct mapped_device *md, struct dm_target_io *tio)
{
+#ifdef CONFIG_BLK_CGROUP
+ tio->clone.bi_ioc = NULL;
+ tio->clone.bi_css = NULL;
+#endif
bio_put(&tio->clone);
}

@@ -1124,6 +1128,11 @@ static struct dm_target_io *alloc_tio(st
clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs);
tio = container_of(clone, struct dm_target_io, clone);

+#ifdef CONFIG_BLK_CGROUP
+ tio->clone.bi_ioc = ci->bio->bi_ioc;
+ tio->clone.bi_css = ci->bio->bi_css;
+#endif
+
tio->io = ci->io;
tio->ti = ti;
memset(&tio->info, 0, sizeof(tio->info));

2013-04-09 21:02:09

by Tejun Heo

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches

Hey,

On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote:
> > and we probably need to change that to bio_associate_task().
>
> Generally, we shouldn't associate bios with "current" task in device
> mapper targets. For example suppose that we have two stacked dm-crypt
> targets:

It only follows the first association so it doesn't matter how many
layers it goes through. That said, yeah, there could be situations
where @task is availalbe but the bio's already in the hand of a
different task. If that's the case, change it to
associate_task(@task).

> It doesn't work because device mapper on the underlying layers ignores
> bi_ioc and bi_css.
>
> If I make device mapper forward bi_ioc and bi_css to outgoing bios, it
> improves performance (from 2:30 to 1:30), but it is still far from
> perfect.

For testing, copying bi_ioc and bi_css directly is fine but please add
another interface to copy those for the actual code. Say,
bio_copy_association(@to_bio, @from_bio) or whatever.

As for the performance loss, I'm somewhat confident in saying the
remaining difference would be from ignoring plugging boundaries.

Thanks.

--
tejun

2013-04-09 21:04:04

by Tejun Heo

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches

On Tue, Apr 09, 2013 at 02:02:01PM -0700, Tejun Heo wrote:
> For testing, copying bi_ioc and bi_css directly is fine but please add
> another interface to copy those for the actual code. Say,
> bio_copy_association(@to_bio, @from_bio) or whatever.

Another and probably better possibility is just remembering the
issuing task (you would of course need to hold an extra ref as long as
you wanna use it) and use bio_associate_task() on it when creating new
bios.

Thanks.

--
tejun

2013-04-09 21:07:54

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches

On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote:

[..]
> Generally, we shouldn't associate bios with "current" task in device
> mapper targets. For example suppose that we have two stacked dm-crypt
> targets:
>
> In the "current" process pointer in lower dm-crypt target's request
> function always points to the workqueue of the upper dm-crypt target that
> submits the bios. So if we associate the bio with "current" in the lower
> target, we are associating it with a preallocated workqueue and we already
> lost the information who submitted it.
>
> You should associate a bio with a task when you create the bio and "md"
> and "dm" midlayers should just forward this association to lower layer
> bios.

bio_associate_current() return -EBUSY if bio has already been associated
with an io context.

So in a stack if every driver calls bio_associate_current(), upon bio
submission, it will automatically amke sure bio gets associated with
submitter task in top level device and calls by lower level device
will be ignored.

Lower level devices I think just need to make sure this context
info is propogated to cloned bios.


[..]
> +#ifdef CONFIG_BLK_CGROUP
> + tio->clone.bi_ioc = ci->bio->bi_ioc;
> + tio->clone.bi_css = ci->bio->bi_css;

You also need to take references to ioc and css objects. I guess a helper
function will be better. May be something like.

bio_associate_bio_context(bio1, bio2)

And this initialize bio2's context with bio1's context.

Thanks
Vivek

2013-04-09 21:18:42

by Mikulas Patocka

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches



On Tue, 9 Apr 2013, Vivek Goyal wrote:

> On Tue, Apr 09, 2013 at 04:32:28PM -0400, Mikulas Patocka wrote:
>
> [..]
> > Generally, we shouldn't associate bios with "current" task in device
> > mapper targets. For example suppose that we have two stacked dm-crypt
> > targets:
> >
> > In the "current" process pointer in lower dm-crypt target's request
> > function always points to the workqueue of the upper dm-crypt target that
> > submits the bios. So if we associate the bio with "current" in the lower
> > target, we are associating it with a preallocated workqueue and we already
> > lost the information who submitted it.
> >
> > You should associate a bio with a task when you create the bio and "md"
> > and "dm" midlayers should just forward this association to lower layer
> > bios.
>
> bio_associate_current() return -EBUSY if bio has already been associated
> with an io context.
>
> So in a stack if every driver calls bio_associate_current(), upon bio
> submission, it will automatically amke sure bio gets associated with
> submitter task in top level device and calls by lower level device
> will be ignored.

The stacking drivers do not pass the same bio to each other.

The stacking driver receives a bio, allocates zero or more new bios and
sends these new bios to the lower layer. So you need to propagate
ownership from the received bio to newly allocated bios, you don't want to
associate newly allocated bio with "current" process.

> Lower level devices I think just need to make sure this context
> info is propogated to cloned bios.
>
>
> [..]
> > +#ifdef CONFIG_BLK_CGROUP
> > + tio->clone.bi_ioc = ci->bio->bi_ioc;
> > + tio->clone.bi_css = ci->bio->bi_css;
>
> You also need to take references to ioc and css objects. I guess a helper
> function will be better. May be something like.

The lifetime of the "tio" structure is shorter than the lifetime of
"ci->bio". So we don't need to increment reference.

We only need to increment reference if we copy ownership to a new bio that
has could have longer lifetime than the original bio. But this situation
is very rare - in most stacking drivers the newly allocated bio has
shorter lifetime than the original one.

> bio_associate_bio_context(bio1, bio2)
>
> And this initialize bio2's context with bio1's context.

Yes, that would be ok.

> Thanks
> Vivek

Mikulas

2013-04-10 19:24:57

by Vivek Goyal

[permalink] [raw]
Subject: Re: dm-crypt parallelization patches

On Tue, Apr 09, 2013 at 05:18:25PM -0400, Mikulas Patocka wrote:

[..]
> > bio_associate_current() return -EBUSY if bio has already been associated
> > with an io context.
> >
> > So in a stack if every driver calls bio_associate_current(), upon bio
> > submission, it will automatically amke sure bio gets associated with
> > submitter task in top level device and calls by lower level device
> > will be ignored.
>
> The stacking drivers do not pass the same bio to each other.
>
> The stacking driver receives a bio, allocates zero or more new bios and
> sends these new bios to the lower layer. So you need to propagate
> ownership from the received bio to newly allocated bios, you don't want to
> associate newly allocated bio with "current" process.

Actually I was asking to call bio_associate_current() for the incoming
bio in the driver and not for the newly created bios by the driver.

For any newly created bios on behalf of this incoming bio, we need to
copy the context so that this context info can be propogated down the
stack.


>
> > Lower level devices I think just need to make sure this context
> > info is propogated to cloned bios.
> >
> >
> > [..]
> > > +#ifdef CONFIG_BLK_CGROUP
> > > + tio->clone.bi_ioc = ci->bio->bi_ioc;
> > > + tio->clone.bi_css = ci->bio->bi_css;
> >
> > You also need to take references to ioc and css objects. I guess a helper
> > function will be better. May be something like.
>
> The lifetime of the "tio" structure is shorter than the lifetime of
> "ci->bio". So we don't need to increment reference.
>
> We only need to increment reference if we copy ownership to a new bio that
> has could have longer lifetime than the original bio. But this situation
> is very rare - in most stacking drivers the newly allocated bio has
> shorter lifetime than the original one.

I think it is not a good idea to rely on the fact that cloned bios or
newly created bios will have shorter lifetime than the original bio. In fact
when the bio completes and you free it up bio_disassociate_task() will try
to put io context and blkcg reference. So it is important to take
reference if you are copying context info in any newly created bio.

Thanks
Vivek

2013-04-10 23:43:35

by Mikulas Patocka

[permalink] [raw]
Subject: [PATCH] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)



On Wed, 10 Apr 2013, Vivek Goyal wrote:

> On Tue, Apr 09, 2013 at 05:18:25PM -0400, Mikulas Patocka wrote:
>
> [..]
> > > bio_associate_current() return -EBUSY if bio has already been associated
> > > with an io context.
> > >
> > > So in a stack if every driver calls bio_associate_current(), upon bio
> > > submission, it will automatically amke sure bio gets associated with
> > > submitter task in top level device and calls by lower level device
> > > will be ignored.
> >
> > The stacking drivers do not pass the same bio to each other.
> >
> > The stacking driver receives a bio, allocates zero or more new bios and
> > sends these new bios to the lower layer. So you need to propagate
> > ownership from the received bio to newly allocated bios, you don't want to
> > associate newly allocated bio with "current" process.
>
> Actually I was asking to call bio_associate_current() for the incoming
> bio in the driver and not for the newly created bios by the driver.

Yes, I think it's better to call it in the driver than in the upper layer,
because if the driver doesn't forward the bio to a worker thread, we don't
have to call bio_associate_current() and we save a few atomic
instructions.

> For any newly created bios on behalf of this incoming bio, we need to
> copy the context so that this context info can be propogated down the
> stack.

See this patch. It implements cgroup associations for dm core and dm-crypt
target.

Do you think the interface is correct? (i.e. can I start modifying more dm
dm-targets to use it?)

> > We only need to increment reference if we copy ownership to a new bio that
> > has could have longer lifetime than the original bio. But this situation
> > is very rare - in most stacking drivers the newly allocated bio has
> > shorter lifetime than the original one.
>
> I think it is not a good idea to rely on the fact that cloned bios or
> newly created bios will have shorter lifetime than the original bio. In fact
> when the bio completes and you free it up bio_disassociate_task() will try
> to put io context and blkcg reference. So it is important to take
> reference if you are copying context info in any newly created bio.

We clear the associate with bio_clear_context before freeing the bio.

> Thanks
> Vivek


---

dm: retain cgroup context

This patch makes dm and dm-crypt target retain cgroup context.

New functions bio_clone_context and bio_clear_context are introduced.
bio_associate_current and bio_disassociate_task are exported to modules.

dm core is changed so that it copies the context to cloned bio. dm
associates the bio with current process if it is going to offload bio to a
thread.

dm-crypt copies the context to outgoing bios and associates the bio with
current process.

Signed-off-by: Mikulas Patocka <[email protected]>

---
drivers/md/dm-crypt.c | 17 ++++++++++++++---
drivers/md/dm.c | 5 +++++
fs/bio.c | 2 ++
include/linux/bio.h | 39 +++++++++++++++++++++++++++++++++++++++
4 files changed, 60 insertions(+), 3 deletions(-)

Index: linux-3.8.6-fast/drivers/md/dm.c
===================================================================
--- linux-3.8.6-fast.orig/drivers/md/dm.c 2013-04-10 14:39:28.000000000 +0200
+++ linux-3.8.6-fast/drivers/md/dm.c 2013-04-10 20:09:52.000000000 +0200
@@ -453,6 +453,7 @@ static void free_io(struct mapped_device

static void free_tio(struct mapped_device *md, struct dm_target_io *tio)
{
+ bio_clear_context(&tio->clone);
bio_put(&tio->clone);
}

@@ -521,6 +522,8 @@ static void queue_io(struct mapped_devic
{
unsigned long flags;

+ bio_associate_current(bio);
+
spin_lock_irqsave(&md->deferred_lock, flags);
bio_list_add(&md->deferred, bio);
spin_unlock_irqrestore(&md->deferred_lock, flags);
@@ -1124,6 +1127,8 @@ static struct dm_target_io *alloc_tio(st
clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs);
tio = container_of(clone, struct dm_target_io, clone);

+ bio_clone_context(ci->bio, &tio->clone);
+
tio->io = ci->io;
tio->ti = ti;
memset(&tio->info, 0, sizeof(tio->info));
Index: linux-3.8.6-fast/include/linux/bio.h
===================================================================
--- linux-3.8.6-fast.orig/include/linux/bio.h 2013-04-10 14:38:56.000000000 +0200
+++ linux-3.8.6-fast/include/linux/bio.h 2013-04-10 20:14:08.000000000 +0200
@@ -291,6 +291,15 @@ extern void bvec_free_bs(struct bio_set
extern unsigned int bvec_nr_vecs(unsigned short idx);

#ifdef CONFIG_BLK_CGROUP
+/*
+ * bio_associate_current associates the bio with the current process. It must be
+ * called by any block device driver that passes the bio to a different process
+ * to be processed. It must be called in the original process.
+ * bio_associate_current does nothing if the bio is already associated.
+ *
+ * bio_dissociate_task dissociates the bio from the task. It is called
+ * automatically at bio destruction.
+ */
int bio_associate_current(struct bio *bio);
void bio_disassociate_task(struct bio *bio);
#else /* CONFIG_BLK_CGROUP */
@@ -299,6 +308,36 @@ static inline void bio_disassociate_task
#endif /* CONFIG_BLK_CGROUP */

/*
+ * bio_clone_context copies cgroup context from the original bio to the new bio.
+ * It is used by bio midlayer drivers that create new bio based on an original
+ * bio and forward it to the lower layer.
+ *
+ * No reference counts are incremented - it is assumed that the lifestime of the
+ * new bio is shorter than the lifetime of the original bio. If the new bio can
+ * outlive the old bio, the caller must increment the reference counts.
+ *
+ * Before freeing the new bio, the caller must clear the context with
+ * bio_clear_context function. If bio_clear_context were not called, the
+ * reference counts would be decremented on both new and original bio, resulting
+ * in crash due to reference count underflow.
+ */
+static inline void bio_clone_context(struct bio *orig, struct bio *new)
+{
+#ifdef CONFIG_BLK_CGROUP
+ new->bi_ioc = orig->bi_ioc;
+ new->bi_css = orig->bi_css;
+#endif
+}
+
+static inline void bio_clear_context(struct bio *bio)
+{
+#ifdef CONFIG_BLK_CGROUP
+ bio->bi_ioc = NULL;
+ bio->bi_css = NULL;
+#endif
+}
+
+/*
* bio_set is used to allow other portions of the IO system to
* allocate their own private memory pools for bio and iovec structures.
* These memory pools in turn all allocate from the bio_slab
Index: linux-3.8.6-fast/drivers/md/dm-crypt.c
===================================================================
--- linux-3.8.6-fast.orig/drivers/md/dm-crypt.c 2013-04-10 14:38:56.000000000 +0200
+++ linux-3.8.6-fast/drivers/md/dm-crypt.c 2013-04-10 19:52:56.000000000 +0200
@@ -181,6 +181,7 @@ struct crypt_config {
static struct kmem_cache *_crypt_io_pool;

static void clone_init(struct dm_crypt_io *, struct bio *);
+static void clone_free(struct bio *);
static void kcryptd_queue_crypt(struct dm_crypt_io *io);
static u8 *iv_of_dmreq(struct crypt_config *cc, struct dm_crypt_request *dmreq);

@@ -846,7 +847,7 @@ static struct bio *crypt_alloc_buffer(st
}

if (!clone->bi_size) {
- bio_put(clone);
+ clone_free(clone);
return NULL;
}

@@ -945,7 +946,7 @@ static void crypt_endio(struct bio *clon
if (rw == WRITE)
crypt_free_buffer_pages(cc, clone);

- bio_put(clone);
+ clone_free(clone);

if (rw == READ && !error) {
kcryptd_queue_crypt(io);
@@ -966,6 +967,14 @@ static void clone_init(struct dm_crypt_i
clone->bi_end_io = crypt_endio;
clone->bi_bdev = cc->dev->bdev;
clone->bi_rw = io->base_bio->bi_rw;
+
+ bio_clone_context(io->base_bio, clone);
+}
+
+static void clone_free(struct bio *clone)
+{
+ bio_clear_context(clone);
+ bio_put(clone);
}

static int kcryptd_io_read(struct dm_crypt_io *io, gfp_t gfp)
@@ -1026,7 +1035,7 @@ static void kcryptd_crypt_write_io_submi

if (unlikely(io->error < 0)) {
crypt_free_buffer_pages(cc, clone);
- bio_put(clone);
+ clone_free(clone);
crypt_dec_pending(io);
return;
}
@@ -1692,6 +1701,8 @@ static int crypt_map(struct dm_target *t
return DM_MAPIO_REMAPPED;
}

+ bio_associate_current(bio);
+
io = crypt_io_alloc(cc, bio, dm_target_offset(ti, bio->bi_sector));

if (bio_data_dir(io->base_bio) == READ) {
Index: linux-3.8.6-fast/fs/bio.c
===================================================================
--- linux-3.8.6-fast.orig/fs/bio.c 2013-04-10 19:49:13.000000000 +0200
+++ linux-3.8.6-fast/fs/bio.c 2013-04-10 19:50:10.000000000 +0200
@@ -1703,6 +1703,7 @@ int bio_associate_current(struct bio *bi

return 0;
}
+EXPORT_SYMBOL(bio_associate_current);

/**
* bio_disassociate_task - undo bio_associate_current()
@@ -1719,6 +1720,7 @@ void bio_disassociate_task(struct bio *b
bio->bi_css = NULL;
}
}
+EXPORT_SYMBOL(bio_disassociate_task);

#endif /* CONFIG_BLK_CGROUP */

2013-04-10 23:50:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

On Wed, Apr 10, 2013 at 07:42:59PM -0400, Mikulas Patocka wrote:
> /*
> + * bio_clone_context copies cgroup context from the original bio to the new bio.
> + * It is used by bio midlayer drivers that create new bio based on an original
> + * bio and forward it to the lower layer.
> + *
> + * No reference counts are incremented - it is assumed that the lifestime of the
> + * new bio is shorter than the lifetime of the original bio. If the new bio can
> + * outlive the old bio, the caller must increment the reference counts.
> + *
> + * Before freeing the new bio, the caller must clear the context with
> + * bio_clear_context function. If bio_clear_context were not called, the
> + * reference counts would be decremented on both new and original bio, resulting
> + * in crash due to reference count underflow.
> + */
> +static inline void bio_clone_context(struct bio *orig, struct bio *new)
> +{
> +#ifdef CONFIG_BLK_CGROUP
> + new->bi_ioc = orig->bi_ioc;
> + new->bi_css = orig->bi_css;

Hmmm... Let's not do this. Sure, you'd be saving several instructions
but the gain is unlikely to be significant given that those cachelines
are likely to be hot anyway. Also, please name it
bio_copy_association().

Thanks.

--
tejun

2013-04-11 19:49:51

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)



On Wed, 10 Apr 2013, Tejun Heo wrote:

> On Wed, Apr 10, 2013 at 07:42:59PM -0400, Mikulas Patocka wrote:
> > /*
> > + * bio_clone_context copies cgroup context from the original bio to the new bio.
> > + * It is used by bio midlayer drivers that create new bio based on an original
> > + * bio and forward it to the lower layer.
> > + *
> > + * No reference counts are incremented - it is assumed that the lifestime of the
> > + * new bio is shorter than the lifetime of the original bio. If the new bio can
> > + * outlive the old bio, the caller must increment the reference counts.
> > + *
> > + * Before freeing the new bio, the caller must clear the context with
> > + * bio_clear_context function. If bio_clear_context were not called, the
> > + * reference counts would be decremented on both new and original bio, resulting
> > + * in crash due to reference count underflow.
> > + */
> > +static inline void bio_clone_context(struct bio *orig, struct bio *new)
> > +{
> > +#ifdef CONFIG_BLK_CGROUP
> > + new->bi_ioc = orig->bi_ioc;
> > + new->bi_css = orig->bi_css;
>
> Hmmm... Let's not do this. Sure, you'd be saving several instructions
> but the gain is unlikely to be significant given that those cachelines
> are likely to be hot anyway. Also, please name it
> bio_copy_association().
>
> Thanks.
>
> --
> tejun

If the bi_css pointer points to a structure that is shared between
processes, using atomic instruction causes cache line boucing - it doesn't
cost a few instructions, it costs 2-3 hundreds cycles.

I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that
the refcount must be decremented - if the flag is set, refcounts must be
decremented when bio is destroyed, if it is not set, references are
borrowed from upper layer bio.

It is less bug-prone than the previous patch.

Mikulas

---

dm: retain cgroup context

This patch makes dm and dm-crypt target retain cgroup context.

New function bio_clone_context is introduced. It copies cgroup context
from one bio to another without incrementing the reference count. A new
bio flag BIO_DROP_CGROUP_REFCOUNT specifies that cgroup refcounts should
be decremented when the bio is freed.

bio_associate_current and bio_disassociate_task are exported to modules.

dm core is changed so that it copies the context to cloned bio. dm
associates the bio with current process if it is going to offload bio to a
thread.

dm-crypt associates the bio with the current process and copies the
context to outgoing bios.

Signed-off-by: Mikulas Patocka <[email protected]>

---
drivers/md/dm-crypt.c | 4 ++++
drivers/md/dm.c | 7 +++++--
fs/bio.c | 11 +++++++++--
include/linux/bio.h | 27 +++++++++++++++++++++++++++
include/linux/blk_types.h | 3 +++
5 files changed, 48 insertions(+), 4 deletions(-)

Index: linux-3.8.6-fast/drivers/md/dm.c
===================================================================
--- linux-3.8.6-fast.orig/drivers/md/dm.c 2013-04-11 17:29:09.000000000 +0200
+++ linux-3.8.6-fast/drivers/md/dm.c 2013-04-11 19:33:47.000000000 +0200
@@ -1124,6 +1124,8 @@ static struct dm_target_io *alloc_tio(st
clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, ci->md->bs);
tio = container_of(clone, struct dm_target_io, clone);

+ bio_clone_context(&tio->clone, ci->bio);
+
tio->io = ci->io;
tio->ti = ti;
memset(&tio->info, 0, sizeof(tio->info));
@@ -1469,9 +1471,10 @@ static void _dm_request(struct request_q
if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
dm_put_live_table(md, srcu_idx);

- if (bio_rw(bio) != READA)
+ if (bio_rw(bio) != READA) {
+ bio_associate_current(bio);
queue_io(md, bio);
- else
+ } else
bio_io_error(bio);
return;
}
Index: linux-3.8.6-fast/include/linux/bio.h
===================================================================
--- linux-3.8.6-fast.orig/include/linux/bio.h 2013-04-11 17:29:07.000000000 +0200
+++ linux-3.8.6-fast/include/linux/bio.h 2013-04-11 19:34:11.000000000 +0200
@@ -291,6 +291,15 @@ extern void bvec_free_bs(struct bio_set
extern unsigned int bvec_nr_vecs(unsigned short idx);

#ifdef CONFIG_BLK_CGROUP
+/*
+ * bio_associate_current associates the bio with the current process. It should
+ * be called by any block device driver that passes the bio to a different
+ * process to be processed. It must be called in the original process.
+ * bio_associate_current does nothing if the bio is already associated.
+ *
+ * bio_dissociate_task dissociates the bio from the task. It is called
+ * automatically at bio destruction.
+ */
int bio_associate_current(struct bio *bio);
void bio_disassociate_task(struct bio *bio);
#else /* CONFIG_BLK_CGROUP */
@@ -299,6 +308,24 @@ static inline void bio_disassociate_task
#endif /* CONFIG_BLK_CGROUP */

/*
+ * bio_clone_context copies cgroup context from the original bio to the new bio.
+ * It is used by bio midlayer drivers that create new bio based on an original
+ * bio and forward it to the lower layer.
+ *
+ * No reference counts are incremented - it is assumed that the lifestime of the
+ * new bio is shorter than the lifetime of the original bio. If the new bio can
+ * outlive the old bio, the caller must increment the reference counts.
+ */
+static inline void bio_clone_context(struct bio *bio, struct bio *bio_src)
+{
+#ifdef CONFIG_BLK_CGROUP
+ BUG_ON(bio->bi_ioc != NULL);
+ bio->bi_ioc = bio_src->bi_ioc;
+ bio->bi_css = bio_src->bi_css;
+#endif
+}
+
+/*
* bio_set is used to allow other portions of the IO system to
* allocate their own private memory pools for bio and iovec structures.
* These memory pools in turn all allocate from the bio_slab
Index: linux-3.8.6-fast/drivers/md/dm-crypt.c
===================================================================
--- linux-3.8.6-fast.orig/drivers/md/dm-crypt.c 2013-04-11 17:29:07.000000000 +0200
+++ linux-3.8.6-fast/drivers/md/dm-crypt.c 2013-04-11 19:33:48.000000000 +0200
@@ -966,6 +966,8 @@ static void clone_init(struct dm_crypt_i
clone->bi_end_io = crypt_endio;
clone->bi_bdev = cc->dev->bdev;
clone->bi_rw = io->base_bio->bi_rw;
+
+ bio_clone_context(clone, io->base_bio);
}

static int kcryptd_io_read(struct dm_crypt_io *io, gfp_t gfp)
@@ -1692,6 +1694,8 @@ static int crypt_map(struct dm_target *t
return DM_MAPIO_REMAPPED;
}

+ bio_associate_current(bio);
+
io = crypt_io_alloc(cc, bio, dm_target_offset(ti, bio->bi_sector));

if (bio_data_dir(io->base_bio) == READ) {
Index: linux-3.8.6-fast/fs/bio.c
===================================================================
--- linux-3.8.6-fast.orig/fs/bio.c 2013-04-11 17:29:07.000000000 +0200
+++ linux-3.8.6-fast/fs/bio.c 2013-04-11 19:30:58.000000000 +0200
@@ -1690,6 +1690,8 @@ int bio_associate_current(struct bio *bi
if (!ioc)
return -ENOENT;

+ bio->bi_flags |= 1UL << BIO_DROP_CGROUP_REFCOUNT;
+
/* acquire active ref on @ioc and associate */
get_io_context_active(ioc);
bio->bi_ioc = ioc;
@@ -1703,6 +1705,7 @@ int bio_associate_current(struct bio *bi

return 0;
}
+EXPORT_SYMBOL(bio_associate_current);

/**
* bio_disassociate_task - undo bio_associate_current()
@@ -1711,14 +1714,18 @@ int bio_associate_current(struct bio *bi
void bio_disassociate_task(struct bio *bio)
{
if (bio->bi_ioc) {
- put_io_context(bio->bi_ioc);
+ if (bio_flagged(bio, BIO_DROP_CGROUP_REFCOUNT))
+ put_io_context(bio->bi_ioc);
bio->bi_ioc = NULL;
}
if (bio->bi_css) {
- css_put(bio->bi_css);
+ if (bio_flagged(bio, BIO_DROP_CGROUP_REFCOUNT))
+ css_put(bio->bi_css);
bio->bi_css = NULL;
}
+ bio->bi_flags &= ~(1UL << BIO_DROP_CGROUP_REFCOUNT);
}
+EXPORT_SYMBOL(bio_disassociate_task);

#endif /* CONFIG_BLK_CGROUP */

Index: linux-3.8.6-fast/include/linux/blk_types.h
===================================================================
--- linux-3.8.6-fast.orig/include/linux/blk_types.h 2013-04-11 17:29:07.000000000 +0200
+++ linux-3.8.6-fast/include/linux/blk_types.h 2013-04-11 17:29:10.000000000 +0200
@@ -114,6 +114,9 @@ struct bio {
#define BIO_FS_INTEGRITY 9 /* fs owns integrity data, not block layer */
#define BIO_QUIET 10 /* Make BIO Quiet */
#define BIO_MAPPED_INTEGRITY 11/* integrity metadata has been remapped */
+#ifdef CONFIG_BLK_CGROUP
+#define BIO_DROP_CGROUP_REFCOUNT 12 /* decrement cgroup context refcount */
+#endif

/*
* Flags starting here get preserved by bio_reset() - this includes

2013-04-11 19:52:11

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

On Thu, Apr 11, 2013 at 03:49:20PM -0400, Mikulas Patocka wrote:
> If the bi_css pointer points to a structure that is shared between
> processes, using atomic instruction causes cache line boucing - it doesn't
> cost a few instructions, it costs 2-3 hundreds cycles.
>
> I modified the patch to use new flag BIO_DROP_CGROUP_REFCOUNT to note that
> the refcount must be decremented - if the flag is set, refcounts must be
> decremented when bio is destroyed, if it is not set, references are
> borrowed from upper layer bio.
>
> It is less bug-prone than the previous patch.

If this becomes an actual bottleneck, the right thing to do is making
css ref per-cpu. Please stop messing around with refcounting.

NACK.

--
tejun

2013-04-11 20:00:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote:
> If this becomes an actual bottleneck, the right thing to do is making
> css ref per-cpu. Please stop messing around with refcounting.

If you think this kind of hackery is acceptable, you really need to
re-evaluate your priorities in making engineering decisions. In
tightly coupled code, maybe, but you're trying to introduce utterly
broken error-prone thing as a generic block layer API. I mean, are
you for real?

--
tejun

2013-04-12 00:06:44

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)



On Thu, 11 Apr 2013, Tejun Heo wrote:

> On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote:
> > If this becomes an actual bottleneck, the right thing to do is making
> > css ref per-cpu. Please stop messing around with refcounting.
>
> If you think this kind of hackery is acceptable, you really need to
> re-evaluate your priorities in making engineering decisions. In
> tightly coupled code, maybe, but you're trying to introduce utterly
> broken error-prone thing as a generic block layer API. I mean, are
> you for real?
>
> --
> tejun

All that I can tell you is that adding an empty atomic operation
"cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);"
to bio_clone_context and bio_disassociate_task increases the time to run a
benchmark from 23 to 40 seconds.

Every single atomic reference in the block layer is measurable.



How did I measure it:

(1) use dm SRCU patches
(http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/)
that replace some atomic accesses in device mapper with SRCU. The patches
will likely be included in the kernel to improve performance.

(2) use the patch v2 that I posted in this thread

(3) add bio_associate_current(bio) to _dm_request (so that each bio is
associated with a process even if it is not offloaded to a workqueue)

(4) change bio_clone_context to actually increase reference counts:
static inline void bio_clone_context(struct bio *bio, struct bio *bio_src)
{
#ifdef CONFIG_BLK_CGROUP
BUG_ON(bio->bi_ioc != NULL);
if (bio_src->bi_ioc) {
get_io_context_active(bio_src->bi_ioc);
bio->bi_ioc = bio_src->bi_ioc;
if (bio_src->bi_css && css_tryget(bio_src->bi_css)) {
bio->bi_css = bio_src->bi_css;
}
bio->bi_flags |= 1UL << BIO_DROP_CGROUP_REFCOUNT;
}
#endif
}

(5) add "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt,
bio->bi_css->refcnt)" to bio_clone_context and bio_disassociate_task


Now, measuring:
- create 4GiB ramdisk, fill it with dd so that it is allocated
- create 5 nested device mapper linear targets on it
- run "time fio --rw=randrw --size=1G --bs=512
--filename=/dev/mapper/linear5 --direct=1 --name=job1 --name=job2
--name=job3 --name=job4 --name=job5 --name=job6 --name=job7
--name=job8 --name=job9 --name=job10 --name=job11 --name=job12"
(it was ran on 12-core machine, so there are 12 concurrent jobs)


If I measure kernel (4), the benchmark takes 23 seconds. For kernel (5) it
takes 40 seconds.

Mikulas

2013-04-12 00:23:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote:
> All that I can tell you is that adding an empty atomic operation
> "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);"
> to bio_clone_context and bio_disassociate_task increases the time to run a
> benchmark from 23 to 40 seconds.

Right, linear target on ramdisk, very realistic, and you know what,
hell with dm, let's just hand code everything into submit_bio(). I'm
sure it will speed up your test case significantly.

If this actually matters, improve it in *sane* way. Make the refcnts
per-cpu and not use atomic ops. In fact, we already have proposed
implementation of percpu refcnt which is being used by aio restructure
patches and likely to be included in some form. It's not quite ready
yet, so please work on something useful like that instead of
continuing this non-sense.

--
tejun

2013-04-12 05:59:39

by Milan Broz

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context

On 12.4.2013 2:22, Tejun Heo wrote:
> On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote:
>> All that I can tell you is that adding an empty atomic operation
>> "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);"
>> to bio_clone_context and bio_disassociate_task increases the time to run a
>> benchmark from 23 to 40 seconds.
>
> Right, linear target on ramdisk, very realistic, and you know what,
> hell with dm, let's just hand code everything into submit_bio(). I'm
> sure it will speed up your test case significantly.
>
> If this actually matters, improve it in *sane* way. Make the refcnts
> per-cpu and not use atomic ops. In fact, we already have proposed
> implementation of percpu refcnt which is being used by aio restructure
> patches and likely to be included in some form. It's not quite ready
> yet, so please work on something useful like that instead of
> continuing this non-sense.

Hey, what's going on here?

Seems dmcrypt problem transformed into block level refcount flame :)

Mikulas, please, can you talk to Tejun and find some better way how
to solve DM & block level context bio contexts here?
(Ideally on some realistic scenario, you have enough hw in Red Hat to try,
some raid0 ssds with linear on top should be good example)
and later (when agreed) implement it on dmcrypt?

I definitely do not want dmcrypt becomes guinea pig here, it should
remain simple as possible and should do transparent _encryption_
and not any inline device-mapper super optimizing games.


Thanks,
Milan

2013-04-12 18:01:31

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)



On Thu, 11 Apr 2013, Tejun Heo wrote:

> On Thu, Apr 11, 2013 at 12:52:03PM -0700, Tejun Heo wrote:
> > If this becomes an actual bottleneck, the right thing to do is making
> > css ref per-cpu. Please stop messing around with refcounting.
>
> If you think this kind of hackery is acceptable, you really need to
> re-evaluate your priorities in making engineering decisions. In
> tightly coupled code, maybe, but you're trying to introduce utterly
> broken error-prone thing as a generic block layer API. I mean, are
> you for real?
>
> --
> tejun

Please describe what is wrong with the code? Why do you call it hackery?

When device mapper is creating a cloned bio for the lower layer, it is
already assumed that the cloned bio has shorter lifetime than the original
bio it was created from.

The device mapper copies a part of the bio vector from the original bio to
the cloned bio, it copies pointers to pages without increasing reference
counts on those pages. As long as the original bio is not returned with
bio_endio, the pages must exist, so there is no need to increase their
reference counts.

Now, if copying pointers without increasing reference counts is OK for
pages, why do you think it is not OK for cgroup context?

Why do you call this bug-prone? - how do you think a bug could happen? If
someone in device mapper errorneously ends the master bio while the cloned
bio is still in progress, there is already a memory corruption bug (the
cloned bio vector points to potentially free pages) and safeguarding the
cgroup pointers won't fix it.


So if you think that reference counts should be incremented by every clone
of the original bio, what kind of bug should it protect against? If we
don't increment reference counts for pages, why should we do it for cgroup
pointers?

Mikulas

2013-04-12 18:17:28

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)



On Thu, 11 Apr 2013, Tejun Heo wrote:

> On Thu, Apr 11, 2013 at 08:06:10PM -0400, Mikulas Patocka wrote:
> > All that I can tell you is that adding an empty atomic operation
> > "cmpxchg(&bio->bi_css->refcnt, bio->bi_css->refcnt, bio->bi_css->refcnt);"
> > to bio_clone_context and bio_disassociate_task increases the time to run a
> > benchmark from 23 to 40 seconds.
>
> Right, linear target on ramdisk, very realistic, and you know what,
> hell with dm, let's just hand code everything into submit_bio(). I'm
> sure it will speed up your test case significantly.

The purpose of this benchmarking is not to optimize linear target on
ramdisk. The purpose is to optimize the kernel for upcoming massively
parallel servers, with possibly hundreds of energy-efficient cores or so.
On these systems every single atomic reference really becomes a
bottleneck. And since I don't have such massively parallel server, I am
testing it on 12-core machine with ramdisk - the performance drop due to
atomic accesses can be measured even on that.

I already eliminated most of the atomic operations with this patch
http://people.redhat.com/~mpatocka/patches/kernel/dm-lock-optimization/dm-optimize.patch
And I don't see a sense in adding more for cgroup, especially, if it can
be easily avoided.

Mikulas

2013-04-12 18:29:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote:
> So if you think that reference counts should be incremented by every clone
> of the original bio, what kind of bug should it protect against? If we
> don't increment reference counts for pages, why should we do it for cgroup
> pointers?

These things are called trade-offs. You look at the overhead of the
things and how complex / fragile things get when certain shortcuts are
taken and how well contained and easy to verify / debug when things go
wrong and then make your choice.

Do the two really look the same to you? The page refs are much more
expensive, mostly contained in and the main focus of dm. ioc/css refs
aren't that expensive to begin with, css refcnting is widely scattered
across the kernel, the association interface is likely to be used by
any entity issuing IOs asynchronously soonish, and there is much saner
way to improve it - which would be beneficial not only to block / dm
but everyone else using it.

Something being done in one place doesn't automatically make it okay
everywhere else. We can and do use hackery but *with* discretion.

If you still can't understand, I'm not sure what more I can tell you.

--
tejun

2013-04-15 13:02:38

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)



On Fri, 12 Apr 2013, Tejun Heo wrote:

> On Fri, Apr 12, 2013 at 02:01:08PM -0400, Mikulas Patocka wrote:
> > So if you think that reference counts should be incremented by every clone
> > of the original bio, what kind of bug should it protect against? If we
> > don't increment reference counts for pages, why should we do it for cgroup
> > pointers?
>
> These things are called trade-offs. You look at the overhead of the
> things and how complex / fragile things get when certain shortcuts are
> taken and how well contained and easy to verify / debug when things go
> wrong and then make your choice.

So what we are here trading for what?


The patch eliminates the atomic references when passing the bio down the
stack of bio drivers.


The patch adds a flag BIO_DROP_CGROUP_REFCOUNT, modifies it at two places
and tests it on two places - that is not big.

The flag BIO_DROP_CGROUP_REFCOUNT is never supposed to be used outside bio
cgroup functions, so it doesn't complicate interface to other subsystems.

The patch is not bug-prone, because we already must make sure that the
cloned bio has shorter lifetime than the master bio - so the patch doesn't
introduce any new possibilities to make bugs.


> Do the two really look the same to you? The page refs are much more
> expensive, mostly contained in and the main focus of dm. ioc/css refs
> aren't that expensive to begin with, css refcnting is widely scattered

ioc is per-task, so it is likely to be cached (but there are processors
that have slow atomic operations even on cached data - on Pentium 4 it
takes about 100 cycles). But css is shared between tasks and produces the
cache ping-pong effect.

> across the kernel, the association interface is likely to be used by
> any entity issuing IOs asynchronously soonish, and there is much saner
> way to improve it - which would be beneficial not only to block / dm
> but everyone else using it.
>
> Something being done in one place doesn't automatically make it okay
> everywhere else. We can and do use hackery but *with* discretion.
>
> If you still can't understand, I'm not sure what more I can tell you.
>
> --
> tejun

I don't know what's wrong with 4 lines of code to manipulate a flag.

I understand that you don't want to do something complicated and bug-prone
to improve performance. But the patch is neither complicated nor
bug-prone.

Mikulas

2013-04-16 17:24:41

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

Hey,

On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote:
> The patch is not bug-prone, because we already must make sure that the
> cloned bio has shorter lifetime than the master bio - so the patch doesn't
> introduce any new possibilities to make bugs.

The whole world isn't composed of only your code. As I said
repeatedly, you're introducing an API which is misleading and can
easily cause subtle bugs which are very difficult to reproduce.

Imagine it being used to tag a metatdata or checksum update bio being
sent down while processing another bio and used to "clone" the context
of the original bio. It'll work most of the time even if the original
bio gets completed first but it'll break when it gets really unlucky -
e.g. racing with other operations which can put the base css ref, and
it'll be hellish to reproduce and everyone would have to pay for your
silly hack.

> > Do the two really look the same to you? The page refs are much more
> > expensive, mostly contained in and the main focus of dm. ioc/css refs
> > aren't that expensive to begin with, css refcnting is widely scattered
>
> ioc is per-task, so it is likely to be cached (but there are processors
> that have slow atomic operations even on cached data - on Pentium 4 it
> takes about 100 cycles). But css is shared between tasks and produces the
> cache ping-pong effect.

For $DIETY's sake, how many times do I have to tell you to use per-cpu
reference count? Why do I have to repeat the same story over and over
again? What part of "make the reference count per-cpu" don't you get?
It's not a complicated message.

At this point, I can't even understand why or what the hell you're
arguing. There's a clearly better way to do it and you're just
repeating yourself like a broken record that your hack in itself isn't
broken.

So, if you wanna continue that way for whatever reason, you have my
firm nack and I'm outta this thread.

Bye bye.

--
tejun

2013-04-16 19:41:53

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)



On Tue, 16 Apr 2013, Tejun Heo wrote:

> Hey,
>
> On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote:
> > The patch is not bug-prone, because we already must make sure that the
> > cloned bio has shorter lifetime than the master bio - so the patch doesn't
> > introduce any new possibilities to make bugs.
>
> The whole world isn't composed of only your code. As I said
> repeatedly, you're introducing an API which is misleading and can
> easily cause subtle bugs which are very difficult to reproduce.
>
> Imagine it being used to tag a metatdata or checksum update bio being
> sent down while processing another bio and used to "clone" the context
> of the original bio. It'll work most of the time even if the original
> bio gets completed first but it'll break when it gets really unlucky -
> e.g. racing with other operations which can put the base css ref, and
> it'll be hellish to reproduce and everyone would have to pay for your
> silly hack.

That's why the comment at the function says: "it is assumed that the
lifestime of the new bio is shorter than the lifetime of the original bio.
If the new bio can outlive the old bio, the caller must increment the
reference counts." - do you think that it is so bad that someone will use
the function without reading the comment?

Anyway, the situation that you describe could only happen in dm-bufio or
dm-kcopyd files, so it's easy to control and increment the reference
counts there. There are no other places in device mapper where we create
bios that live longer than original one.

Mikulas

2013-04-18 16:48:12

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

On Tue, Apr 16 2013 at 1:24pm -0400,
Tejun Heo <[email protected]> wrote:

> Hey,
>
> On Mon, Apr 15, 2013 at 09:02:06AM -0400, Mikulas Patocka wrote:
> > The patch is not bug-prone, because we already must make sure that the
> > cloned bio has shorter lifetime than the master bio - so the patch doesn't
> > introduce any new possibilities to make bugs.
>
> The whole world isn't composed of only your code. As I said
> repeatedly, you're introducing an API which is misleading and can
> easily cause subtle bugs which are very difficult to reproduce.
>
> Imagine it being used to tag a metatdata or checksum update bio being
> sent down while processing another bio and used to "clone" the context
> of the original bio. It'll work most of the time even if the original
> bio gets completed first but it'll break when it gets really unlucky -
> e.g. racing with other operations which can put the base css ref, and
> it'll be hellish to reproduce and everyone would have to pay for your
> silly hack.
>
> > > Do the two really look the same to you? The page refs are much more
> > > expensive, mostly contained in and the main focus of dm. ioc/css refs
> > > aren't that expensive to begin with, css refcnting is widely scattered
> >
> > ioc is per-task, so it is likely to be cached (but there are processors
> > that have slow atomic operations even on cached data - on Pentium 4 it
> > takes about 100 cycles). But css is shared between tasks and produces the
> > cache ping-pong effect.
>
> For $DIETY's sake, how many times do I have to tell you to use per-cpu
> reference count? Why do I have to repeat the same story over and over
> again? What part of "make the reference count per-cpu" don't you get?
> It's not a complicated message.
>
> At this point, I can't even understand why or what the hell you're
> arguing. There's a clearly better way to do it and you're just
> repeating yourself like a broken record that your hack in itself isn't
> broken.
>
> So, if you wanna continue that way for whatever reason, you have my
> firm nack and I'm outta this thread.
>
> Bye bye.

Hey Tejun,

I see you nack and raise you with: please reconsider in the near term.

Your point about not wanting to introduce a generic block interface that
isn't "safe" for all users noted. But as Mikulas has repeatedly said DM
does _not_ ever need to do the refcounting. So it seems a bit absurd to
introduce the requirement that DM should stand up an interface that uses
percpu. That is a fair amount of churn that DM will never have a need
to take advantage of.

So why not introduce __bio_copy_association(bio1, bio2) and add a BUG_ON
in it if bio2 isn't a clone of bio1?

When there is a need for async IO to have more scalable refcounting that
would be the time to introduce bio_copy_association that uses per-cpu
refcounting (and yes we could then even nuke __bio_copy_association).

It just seems to me a bit burdensome to ask Mikulas to add this
infrastructure when DM really doesn't need it at all. But again I do
understand your desire for others to be stearing the kernel where it
needs to be to benefit future use-cases. But I think in general it best
to introduce complexity when there is an actual need.

Your insights are amazingly helpful and I think it is unfortunate that
this refcounting issue overshadowed the positive advancements of
dm-crypt scaling. I'm just looking to see if we can carry on with a
temporary intermediate step with __bio_copy_association.

Thanks,
Mike

2013-04-18 17:03:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

Hello, Mike.

On Thu, Apr 18, 2013 at 12:47:42PM -0400, Mike Snitzer wrote:
> I see you nack and raise you with: please reconsider in the near term.

The thing is that percpu-refcnting is already in mostly ready-form, so
unless this dm series is planned to be merged for v3.10-rc1, I don't
see the need for a separate near term solution. Mikulas can just do
the normal refcnting and cgroup will most likely adopt per-cpu ref
anyway once percpu-refcnting is in, so it should all fall in places
pretty quickly. For devel / testing / whatever, Mikulas can surely
turn off cgroup, right?

Thanks.

--
tejun

2013-05-22 18:51:13

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

On Thu, Apr 18 2013 at 1:03pm -0400,
Tejun Heo <[email protected]> wrote:

> Hello, Mike.
>
> On Thu, Apr 18, 2013 at 12:47:42PM -0400, Mike Snitzer wrote:
> > I see you nack and raise you with: please reconsider in the near term.
>
> The thing is that percpu-refcnting is already in mostly ready-form, so
> unless this dm series is planned to be merged for v3.10-rc1, I don't
> see the need for a separate near term solution. Mikulas can just do
> the normal refcnting and cgroup will most likely adopt per-cpu ref
> anyway once percpu-refcnting is in, so it should all fall in places
> pretty quickly. For devel / testing / whatever, Mikulas can surely
> turn off cgroup, right?

Hey Tejun,

Was wondering: how is percpu-refcounting coming along? Do you have a
pointer to the code that can be pulled in for use by Mikulas' dm-crypt
changes?

Would be nice to get this stuff sorted out for the 3.11 merge window.

Mike

2013-05-22 19:48:57

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2] make dm and dm-crypt forward cgroup context (was: dm-crypt parallelization patches)

Hey,

On Wed, May 22, 2013 at 02:50:14PM -0400, Mike Snitzer wrote:
> Was wondering: how is percpu-refcounting coming along? Do you have a
> pointer to the code that can be pulled in for use by Mikulas' dm-crypt
> changes?
>
> Would be nice to get this stuff sorted out for the 3.11 merge window.

Still in progress. Waiting for Kent's next round and yeah I do hope
so too as I really wanna convert css refcnting to it regardless of
dm-crypt work. The thread is at

https://patchwork.kernel.org/patch/2562111/

Thanks.

--
tejun