LinuxLists.cc - Problems caused by dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues

2019-05-13 18:48:58

Subject: Problems caused by dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues

Hi,

I wanted to jump on the bandwagon of people reporting problems with
commit a1b89132dc4f ("dm crypt: use WQ_HIGHPRI for the IO and crypt
workqueues").

Specifically I've been tracking down communication errors when talking
to our Embedded Controller (EC) over SPI. I found that communication
errors happened _much_ more frequently on newer kernels than older
ones. Using ftrace I managed to track the problem down to the dm
crypt patch. ...and, indeed, reverting that patch gets rid of the
vast majority of my errors.

If you want to see the ftrace of my high priority worker getting
blocked for 7.5 ms, you can see:

https://bugs.chromium.org/p/chromium/issues/attachmentText?aid=392715

In my case I'm looking at solving my problems by bumping the CrOS EC
transfers fully up to real time priority. ...but given that there are
other reports of problems with the dm-crypt priority (notably I found
https://bugzilla.kernel.org/show_bug.cgi?id=199857) maybe we should
also come up with a different solution for dm-crypt?

Thanks!

-Doug

2019-05-13 19:32:58

by Mike Snitzer

[permalink] [raw]

Subject: Re: Problems caused by dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues

On Mon, May 13 2019 at 12:18pm -0400,
Doug Anderson <[email protected]> wrote:

> Hi,
>
> I wanted to jump on the bandwagon of people reporting problems with
> commit a1b89132dc4f ("dm crypt: use WQ_HIGHPRI for the IO and crypt
> workqueues").
>
> Specifically I've been tracking down communication errors when talking
> to our Embedded Controller (EC) over SPI. I found that communication
> errors happened _much_ more frequently on newer kernels than older
> ones. Using ftrace I managed to track the problem down to the dm
> crypt patch. ...and, indeed, reverting that patch gets rid of the
> vast majority of my errors.
>
> If you want to see the ftrace of my high priority worker getting
> blocked for 7.5 ms, you can see:
>
> https://bugs.chromium.org/p/chromium/issues/attachmentText?aid=392715
>
>
> In my case I'm looking at solving my problems by bumping the CrOS EC
> transfers fully up to real time priority. ...but given that there are
> other reports of problems with the dm-crypt priority (notably I found
> https://bugzilla.kernel.org/show_bug.cgi?id=199857) maybe we should
> also come up with a different solution for dm-crypt?
>

And chance you can test how behaviour changes if you remove
WQ_CPU_INTENSIVE? e.g.:

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 692cddf3fe2a..c97d5d807311 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2827,8 +2827,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)

ret = -ENOMEM;
cc->io_queue = alloc_workqueue("kcryptd_io/%s",
- WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM,
- 1, devname);
+ WQ_HIGHPRI | WQ_MEM_RECLAIM, 1, devname);
if (!cc->io_queue) {
ti->error = "Couldn't create kcryptd io queue";
goto bad;
@@ -2836,11 +2835,10 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)

if (test_bit(DM_CRYPT_SAME_CPU, &cc->flags))
cc->crypt_queue = alloc_workqueue("kcryptd/%s",
- WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM,
- 1, devname);
+ WQ_HIGHPRI | WQ_MEM_RECLAIM, 1, devname);
else
cc->crypt_queue = alloc_workqueue("kcryptd/%s",
- WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM | WQ_UNBOUND,
+ WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND,
num_online_cpus(), devname);
if (!cc->crypt_queue) {
ti->error = "Couldn't create kcryptd queue";

2019-05-14 16:49:57

by Doug Anderson

[permalink] [raw]

Subject: Re: Problems caused by dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues

Hi,

On Mon, May 13, 2019 at 10:15 AM Mike Snitzer <[email protected]> wrote:

> On Mon, May 13 2019 at 12:18pm -0400,
> Doug Anderson <[email protected]> wrote:
>
> > Hi,
> >
> > I wanted to jump on the bandwagon of people reporting problems with
> > commit a1b89132dc4f ("dm crypt: use WQ_HIGHPRI for the IO and crypt
> > workqueues").
> >
> > Specifically I've been tracking down communication errors when talking
> > to our Embedded Controller (EC) over SPI. I found that communication
> > errors happened _much_ more frequently on newer kernels than older
> > ones. Using ftrace I managed to track the problem down to the dm
> > crypt patch. ...and, indeed, reverting that patch gets rid of the
> > vast majority of my errors.
> >
> > If you want to see the ftrace of my high priority worker getting
> > blocked for 7.5 ms, you can see:
> >
> > https://bugs.chromium.org/p/chromium/issues/attachmentText?aid=392715
> >
> >
> > In my case I'm looking at solving my problems by bumping the CrOS EC
> > transfers fully up to real time priority. ...but given that there are
> > other reports of problems with the dm-crypt priority (notably I found
> > https://bugzilla.kernel.org/show_bug.cgi?id=199857) maybe we should
> > also come up with a different solution for dm-crypt?
> >
>
> And chance you can test how behaviour changes if you remove
> WQ_CPU_INTENSIVE? e.g.:
>
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 692cddf3fe2a..c97d5d807311 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -2827,8 +2827,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
>
> ret = -ENOMEM;
> cc->io_queue = alloc_workqueue("kcryptd_io/%s",
> - WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM,
> - 1, devname);
> + WQ_HIGHPRI | WQ_MEM_RECLAIM, 1, devname);
> if (!cc->io_queue) {
> ti->error = "Couldn't create kcryptd io queue";
> goto bad;
> @@ -2836,11 +2835,10 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
>
> if (test_bit(DM_CRYPT_SAME_CPU, &cc->flags))
> cc->crypt_queue = alloc_workqueue("kcryptd/%s",
> - WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM,
> - 1, devname);
> + WQ_HIGHPRI | WQ_MEM_RECLAIM, 1, devname);
> else
> cc->crypt_queue = alloc_workqueue("kcryptd/%s",
> - WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM | WQ_UNBOUND,
> + WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND,
> num_online_cpus(), devname);
> if (!cc->crypt_queue) {
> ti->error = "Couldn't create kcryptd queue";

It's not totally trivially easy for me to test. My previous failure
cases were leaving a few devices "idle" over a long period of time. I
did that on 3 machines last night and didn't see any failures. Thus
removing "WQ_CPU_INTENSIVE" may have made things better. Before I say
for sure I'd want to test for longer / redo the test a few times,
since I've seen the problem go away on its own before (just by
timing/luck) and then re-appear.

Do you have a theory about why removing WQ_CPU_INTENSIVE would help?

---

NOTE: in trying to reproduce problems more quickly I actually came up
with a better test case for the problem I was seeing. I found that I
can reproduce my own problems much better with this test:

dd if=/dev/zero of=/var/log/foo.txt bs=4M count=512&
while true; do
ectool version > /dev/null;
done

It should be noted that "/var" is on encrypted stateful on my system
so throwing data at it stresses dm-crypt. It should also be noted
that somehow "/var" also ends up traversing through a loopback device
(this becomes relevant below):

With the above test:

1. With a mainline kernel that has commit 37a186225a0c
("platform/chrome: cros_ec_spi: Transfer messages at high priority"):
I see failures.

2. With a mainline kernel that has commit 37a186225a0c plus removing
WQ_CPU_INTENSIVE in dm-crypt: I still see failures.

3. With a mainline kernel that has commit 37a186225a0c plus removing
high priority (but keeping CPU intensive) in dm-crypt: I still see
failures.

4. With a mainline kernel that has commit 37a186225a0c plus removing
high priority (but keeping CPU intensive) in dm-crypt plus removing
set_user_nice() in loop_prepare_queue(): I get a pass!

5. With a mainline kernel that has commit 37a186225a0c plus removing
set_user_nice() in loop_prepare_queue() plus leaving dm-crypt alone: I
see failures.

6. With a mainline kernel that has commit 37a186225a0c plus removing
set_user_nice() in loop_prepare_queue() plus removing WQ_CPU_INTENSIVE
in dm-crypt: I still see failures

7. With my new "cros_ec at realtime" series and no other patches, I get a pass!

tl;dr: High priority (even without CPU_INTENSIVE) definitely causes
interference with my high priority work starving it for > 8 ms, but
dm-crypt isn't unique here--loopback devices also have problems.

-Doug

2019-05-14 17:31:06

by Mike Snitzer

[permalink] [raw]

Subject: Re: Problems caused by dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues

On Tue, May 14 2019 at 12:47pm -0400,
Doug Anderson <[email protected]> wrote:

> Hi,
>
> On Mon, May 13, 2019 at 10:15 AM Mike Snitzer <[email protected]> wrote:
>
> > On Mon, May 13 2019 at 12:18pm -0400,
> > Doug Anderson <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > I wanted to jump on the bandwagon of people reporting problems with
> > > commit a1b89132dc4f ("dm crypt: use WQ_HIGHPRI for the IO and crypt
> > > workqueues").
> > >
> > > Specifically I've been tracking down communication errors when talking
> > > to our Embedded Controller (EC) over SPI. I found that communication
> > > errors happened _much_ more frequently on newer kernels than older
> > > ones. Using ftrace I managed to track the problem down to the dm
> > > crypt patch. ...and, indeed, reverting that patch gets rid of the
> > > vast majority of my errors.
> > >
> > > If you want to see the ftrace of my high priority worker getting
> > > blocked for 7.5 ms, you can see:
> > >
> > > https://bugs.chromium.org/p/chromium/issues/attachmentText?aid=392715
> > >
> > >
> > > In my case I'm looking at solving my problems by bumping the CrOS EC
> > > transfers fully up to real time priority. ...but given that there are
> > > other reports of problems with the dm-crypt priority (notably I found
> > > https://bugzilla.kernel.org/show_bug.cgi?id=199857) maybe we should
> > > also come up with a different solution for dm-crypt?
> > >
> >
> > And chance you can test how behaviour changes if you remove
> > WQ_CPU_INTENSIVE? e.g.:
> >
> > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> > index 692cddf3fe2a..c97d5d807311 100644
> > --- a/drivers/md/dm-crypt.c
> > +++ b/drivers/md/dm-crypt.c
> > @@ -2827,8 +2827,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> >
> > ret = -ENOMEM;
> > cc->io_queue = alloc_workqueue("kcryptd_io/%s",
> > - WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM,
> > - 1, devname);
> > + WQ_HIGHPRI | WQ_MEM_RECLAIM, 1, devname);
> > if (!cc->io_queue) {
> > ti->error = "Couldn't create kcryptd io queue";
> > goto bad;
> > @@ -2836,11 +2835,10 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
> >
> > if (test_bit(DM_CRYPT_SAME_CPU, &cc->flags))
> > cc->crypt_queue = alloc_workqueue("kcryptd/%s",
> > - WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM,
> > - 1, devname);
> > + WQ_HIGHPRI | WQ_MEM_RECLAIM, 1, devname);
> > else
> > cc->crypt_queue = alloc_workqueue("kcryptd/%s",
> > - WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM | WQ_UNBOUND,
> > + WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND,
> > num_online_cpus(), devname);
> > if (!cc->crypt_queue) {
> > ti->error = "Couldn't create kcryptd queue";
>
> It's not totally trivially easy for me to test. My previous failure
> cases were leaving a few devices "idle" over a long period of time. I
> did that on 3 machines last night and didn't see any failures. Thus
> removing "WQ_CPU_INTENSIVE" may have made things better. Before I say
> for sure I'd want to test for longer / redo the test a few times,
> since I've seen the problem go away on its own before (just by
> timing/luck) and then re-appear.

What you shared below seems to indicate that removing WQ_CPU_INTENSIVE
didn't work.

> Do you have a theory about why removing WQ_CPU_INTENSIVE would help?

Reading this comment is what made me think to ask:
https://bugzilla.kernel.org/show_bug.cgi?id=199857#c4

> NOTE: in trying to reproduce problems more quickly I actually came up
> with a better test case for the problem I was seeing. I found that I
> can reproduce my own problems much better with this test:
>
> dd if=/dev/zero of=/var/log/foo.txt bs=4M count=512&
> while true; do
> ectool version > /dev/null;
> done
>
> It should be noted that "/var" is on encrypted stateful on my system
> so throwing data at it stresses dm-crypt. It should also be noted
> that somehow "/var" also ends up traversing through a loopback device
> (this becomes relevant below):
>
>
> With the above test:
>
> 1. With a mainline kernel that has commit 37a186225a0c
> ("platform/chrome: cros_ec_spi: Transfer messages at high priority"):
> I see failures.
>
> 2. With a mainline kernel that has commit 37a186225a0c plus removing
> WQ_CPU_INTENSIVE in dm-crypt: I still see failures.
>
> 3. With a mainline kernel that has commit 37a186225a0c plus removing
> high priority (but keeping CPU intensive) in dm-crypt: I still see
> failures.
>
> 4. With a mainline kernel that has commit 37a186225a0c plus removing
> high priority (but keeping CPU intensive) in dm-crypt plus removing
> set_user_nice() in loop_prepare_queue(): I get a pass!
>
> 5. With a mainline kernel that has commit 37a186225a0c plus removing
> set_user_nice() in loop_prepare_queue() plus leaving dm-crypt alone: I
> see failures.
>
> 6. With a mainline kernel that has commit 37a186225a0c plus removing
> set_user_nice() in loop_prepare_queue() plus removing WQ_CPU_INTENSIVE
> in dm-crypt: I still see failures
>
> 7. With my new "cros_ec at realtime" series and no other patches, I get a pass!
>
>
> tl;dr: High priority (even without CPU_INTENSIVE) definitely causes
> interference with my high priority work starving it for > 8 ms, but
> dm-crypt isn't unique here--loopback devices also have problems.

Well I read it all ;)

I don't have a commit 37a186225a0c, the original commit in querstion is
a1b89132dc4 right?

But I think we need a deeper understanding from workqueue maintainers on
what the right way forward is here. I cc'd Tejun in my previous reply
but IIRC he no longer looks after the workqueue code.

I think it'd be good for you to work with the original author of commit
a1b89132dc4 (Tim, on cc) to see if you can reach consensus on what works
for both of your requirements.

Given 7 above, if your new "cros_ec at realtime" series fixes it.. ship
it?

Mike

2019-05-14 22:14:29

by Doug Anderson

[permalink] [raw]

Subject: Re: Problems caused by dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues

Hi,

On Tue, May 14, 2019 at 10:29 AM Mike Snitzer <[email protected]> wrote:

> > tl;dr: High priority (even without CPU_INTENSIVE) definitely causes
> > interference with my high priority work starving it for > 8 ms, but
> > dm-crypt isn't unique here--loopback devices also have problems.
>
> Well I read it all ;)
>
> I don't have a commit 37a186225a0c, the original commit in querstion is
> a1b89132dc4 right?

commit 37a186225a0c ("platform/chrome: cros_ec_spi: Transfer messages
at high priority") is only really relevant to my particular test case
of using cros_ec to reproduce the problem.

> But I think we need a deeper understanding from workqueue maintainers on
> what the right way forward is here. I cc'd Tejun in my previous reply
> but IIRC he no longer looks after the workqueue code.
>
> I think it'd be good for you to work with the original author of commit
> a1b89132dc4 (Tim, on cc) to see if you can reach consensus on what works
> for both of your requirements.

Basically what I decided in the end was that the workqueue code didn't
offer enough flexibility in terms of priorities. To get realtime
priority I needed to fallback to using kthread_create_worker() to
create my worker. Presumably if you want something nicer than the
"min_nice" priority you get with the high priority workqueue flag then
you'd have to do something similar (but moving in the opposite
direction).

> Given 7 above, if your new "cros_ec at realtime" series fixes it.. ship
> it?

Yeah, that's the plan. Right now I have
<https://lkml.kernel.org/r/[email protected]>
but Guenter pointed out some embarrassing bugs in my patch so I'll
post up a v4 tomorrow. I pointed to a Chrome OS review that has a
preview of my v4 if you for some reason can't wait. That can be found
at <https://crrev.com/c/1612165>.

-Doug