Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp6619257yba; Tue, 14 May 2019 10:31:06 -0700 (PDT) X-Google-Smtp-Source: APXvYqyMDMjRtr6sisZtOG3ffkJi/w1fzRDMY6SY44NJY4HD68s/rQLWsCd51dSJAeMTush/ZG/x X-Received: by 2002:a63:f44f:: with SMTP id p15mr39139852pgk.65.1557855066564; Tue, 14 May 2019 10:31:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1557855066; cv=none; d=google.com; s=arc-20160816; b=MHmY6mL9Sxrze1YX3sC/6jkutUx1v6yZT9vgWPlggcoS4L65MvFRLzvQ2lUTKDsaY2 Z0L8+GpCYjIppT2ikQSfjFIFt0yIvyoQPZDsukNRLY+hSMi49BWlM3grPMonf3xDsLcl X3HZRUY/xKd3zXu9hP3YG0kn5PGvGtr9r0gIpH8U0uyLoDbAGKnh1CNmrlyzSEom3FfF GRIzpxM70/rod1v3Z3skINw4dZvsn9z3Au0zuth3QD6LUEcO9SmdnjtajEKOlND6KKYp NJcheEt94JOrt2j1lNiFvHr2GXAvqItD6nQzGPq1wPTwupR1qbr/luFv9D5nsXzavF/e V9qA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=cFEA18VfmwiLX8kBBPF7EXRURvrrOCBQlS03h/ZuwUU=; b=jDWdjNYXFzLreQYu47MDij/gLHyGG00EMCuVErrRhNqBfhQ91KKEUqVxAQ2YdzneY0 r8+x0sp1fbXYLkVi3JSckspDZvxcsXA145EV8Sw+x1pWKzQZu+St/a7t7ztyeU9sYZMG 6Rti+2q60dfDQn7bqWnYqSIYYs2ZOlKa1Ug4rteX/0SP9YzpnmVxpt2+zYTCZXTib5ys 9fuM8Lvz5Ees1XYo8g6Wc5zeckIO/N30CeZBNEOyKRH+ExnbWnTR2vbRKs0sDJvDo6ri xNWt90ImFs83iFYxdgkhA4QTqPCSmWVUHVGlQHcDwPYlF8fQIXdGYkPtK8yOYuiZEA53 gqjw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 21si10643955pfc.98.2019.05.14.10.30.51; Tue, 14 May 2019 10:31:06 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726911AbfENR3o (ORCPT + 99 others); Tue, 14 May 2019 13:29:44 -0400 Received: from mx1.redhat.com ([209.132.183.28]:41572 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726272AbfENR3o (ORCPT ); Tue, 14 May 2019 13:29:44 -0400 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1D47CC05FBCB; Tue, 14 May 2019 17:29:44 +0000 (UTC) Received: from localhost (unknown [10.18.25.174]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 4654C1001E98; Tue, 14 May 2019 17:29:39 +0000 (UTC) Date: Tue, 14 May 2019 13:29:38 -0400 From: Mike Snitzer To: Doug Anderson Cc: Tim Murray , Guenter Roeck , Enric Balletbo i Serra , Vito Caputo , LKML , dm-devel@redhat.com, Tejun Heo Subject: Re: Problems caused by dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues Message-ID: <20190514172938.GA31835@redhat.com> References: <20190513171519.GA26166@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.32]); Tue, 14 May 2019 17:29:44 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 14 2019 at 12:47pm -0400, Doug Anderson wrote: > Hi, > > On Mon, May 13, 2019 at 10:15 AM Mike Snitzer wrote: > > > On Mon, May 13 2019 at 12:18pm -0400, > > Doug Anderson wrote: > > > > > Hi, > > > > > > I wanted to jump on the bandwagon of people reporting problems with > > > commit a1b89132dc4f ("dm crypt: use WQ_HIGHPRI for the IO and crypt > > > workqueues"). > > > > > > Specifically I've been tracking down communication errors when talking > > > to our Embedded Controller (EC) over SPI. I found that communication > > > errors happened _much_ more frequently on newer kernels than older > > > ones. Using ftrace I managed to track the problem down to the dm > > > crypt patch. ...and, indeed, reverting that patch gets rid of the > > > vast majority of my errors. > > > > > > If you want to see the ftrace of my high priority worker getting > > > blocked for 7.5 ms, you can see: > > > > > > https://bugs.chromium.org/p/chromium/issues/attachmentText?aid=392715 > > > > > > > > > In my case I'm looking at solving my problems by bumping the CrOS EC > > > transfers fully up to real time priority. ...but given that there are > > > other reports of problems with the dm-crypt priority (notably I found > > > https://bugzilla.kernel.org/show_bug.cgi?id=199857) maybe we should > > > also come up with a different solution for dm-crypt? > > > > > > > And chance you can test how behaviour changes if you remove > > WQ_CPU_INTENSIVE? e.g.: > > > > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c > > index 692cddf3fe2a..c97d5d807311 100644 > > --- a/drivers/md/dm-crypt.c > > +++ b/drivers/md/dm-crypt.c > > @@ -2827,8 +2827,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv) > > > > ret = -ENOMEM; > > cc->io_queue = alloc_workqueue("kcryptd_io/%s", > > - WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, > > - 1, devname); > > + WQ_HIGHPRI | WQ_MEM_RECLAIM, 1, devname); > > if (!cc->io_queue) { > > ti->error = "Couldn't create kcryptd io queue"; > > goto bad; > > @@ -2836,11 +2835,10 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv) > > > > if (test_bit(DM_CRYPT_SAME_CPU, &cc->flags)) > > cc->crypt_queue = alloc_workqueue("kcryptd/%s", > > - WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, > > - 1, devname); > > + WQ_HIGHPRI | WQ_MEM_RECLAIM, 1, devname); > > else > > cc->crypt_queue = alloc_workqueue("kcryptd/%s", > > - WQ_HIGHPRI | WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM | WQ_UNBOUND, > > + WQ_HIGHPRI | WQ_MEM_RECLAIM | WQ_UNBOUND, > > num_online_cpus(), devname); > > if (!cc->crypt_queue) { > > ti->error = "Couldn't create kcryptd queue"; > > It's not totally trivially easy for me to test. My previous failure > cases were leaving a few devices "idle" over a long period of time. I > did that on 3 machines last night and didn't see any failures. Thus > removing "WQ_CPU_INTENSIVE" may have made things better. Before I say > for sure I'd want to test for longer / redo the test a few times, > since I've seen the problem go away on its own before (just by > timing/luck) and then re-appear. What you shared below seems to indicate that removing WQ_CPU_INTENSIVE didn't work. > Do you have a theory about why removing WQ_CPU_INTENSIVE would help? Reading this comment is what made me think to ask: https://bugzilla.kernel.org/show_bug.cgi?id=199857#c4 > NOTE: in trying to reproduce problems more quickly I actually came up > with a better test case for the problem I was seeing. I found that I > can reproduce my own problems much better with this test: > > dd if=/dev/zero of=/var/log/foo.txt bs=4M count=512& > while true; do > ectool version > /dev/null; > done > > It should be noted that "/var" is on encrypted stateful on my system > so throwing data at it stresses dm-crypt. It should also be noted > that somehow "/var" also ends up traversing through a loopback device > (this becomes relevant below): > > > With the above test: > > 1. With a mainline kernel that has commit 37a186225a0c > ("platform/chrome: cros_ec_spi: Transfer messages at high priority"): > I see failures. > > 2. With a mainline kernel that has commit 37a186225a0c plus removing > WQ_CPU_INTENSIVE in dm-crypt: I still see failures. > > 3. With a mainline kernel that has commit 37a186225a0c plus removing > high priority (but keeping CPU intensive) in dm-crypt: I still see > failures. > > 4. With a mainline kernel that has commit 37a186225a0c plus removing > high priority (but keeping CPU intensive) in dm-crypt plus removing > set_user_nice() in loop_prepare_queue(): I get a pass! > > 5. With a mainline kernel that has commit 37a186225a0c plus removing > set_user_nice() in loop_prepare_queue() plus leaving dm-crypt alone: I > see failures. > > 6. With a mainline kernel that has commit 37a186225a0c plus removing > set_user_nice() in loop_prepare_queue() plus removing WQ_CPU_INTENSIVE > in dm-crypt: I still see failures > > 7. With my new "cros_ec at realtime" series and no other patches, I get a pass! > > > tl;dr: High priority (even without CPU_INTENSIVE) definitely causes > interference with my high priority work starving it for > 8 ms, but > dm-crypt isn't unique here--loopback devices also have problems. Well I read it all ;) I don't have a commit 37a186225a0c, the original commit in querstion is a1b89132dc4 right? But I think we need a deeper understanding from workqueue maintainers on what the right way forward is here. I cc'd Tejun in my previous reply but IIRC he no longer looks after the workqueue code. I think it'd be good for you to work with the original author of commit a1b89132dc4 (Tim, on cc) to see if you can reach consensus on what works for both of your requirements. Given 7 above, if your new "cros_ec at realtime" series fixes it.. ship it? Mike