Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752777AbdF2Q30 (ORCPT ); Thu, 29 Jun 2017 12:29:26 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:40960 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751690AbdF2Q3R (ORCPT ); Thu, 29 Jun 2017 12:29:17 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org A5BDB60732 Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=none smtp.mailfrom=jhugo@codeaurora.org Subject: Re: [BUG] Deadlock due due to interactions of block, RCU, and cpu offline To: paulmck@linux.vnet.ibm.com Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, pprakash@codeaurora.org, Josh Triplett , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Jens Axboe , Sebastian Andrzej Siewior , Thomas Gleixner , Richard Cochran , Boris Ostrovsky , Richard Weinberger References: <20170326232843.GA3637@linux.vnet.ibm.com> <20170327181711.GF3637@linux.vnet.ibm.com> <20170620234623.GA16200@linux.vnet.ibm.com> <20170621161853.GB3721@linux.vnet.ibm.com> <20170623033456.GA15959@linux.vnet.ibm.com> <20170628001130.GB3721@linux.vnet.ibm.com> From: Jeffrey Hugo Message-ID: Date: Thu, 29 Jun 2017 10:29:12 -0600 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.1.1 MIME-Version: 1.0 In-Reply-To: <20170628001130.GB3721@linux.vnet.ibm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2155 Lines: 49 On 6/27/2017 6:11 PM, Paul E. McKenney wrote: > On Tue, Jun 27, 2017 at 04:32:09PM -0600, Jeffrey Hugo wrote: >> On 6/22/2017 9:34 PM, Paul E. McKenney wrote: >>> On Wed, Jun 21, 2017 at 09:18:53AM -0700, Paul E. McKenney wrote: >>>> No worries, and I am very much looking forward to seeing the results of >>>> your testing. >>> >>> And please see below for an updated patch based on LKML review and >>> more intensive testing. >>> >> >> I spent some time on this today. It didn't go as I expected. I >> validated the issue is reproducible as before on 4.11 and 4.12 rcs 1 >> through 4. However, the version of stress-ng that I was using ran >> into constant errors starting with rc5, making it nearly impossible >> to make progress toward reproduction. Upgrading stress-ng to tip >> fixes the issue, however, I've still been unable to repro the issue. >> >> Its my unfounded suspicion that something went in between rc4 and >> rc5 which changed the timing, and didn't actually fix the issue. I >> will run the test overnight for 5 hours to try to repro. >> >> The patch you sent appears to be based on linux-next, and appears to >> have a number of dependencies which prevent it from cleanly applying >> on anything current that I'm able to repro on at this time. Do you >> want to provide a rebased version of the patch which applies to say >> 4.11? I could easily test that and report back. > > Here is a very lightly tested backport to v4.11. > Works for me. Always reproduced the lockup within 2 minutes on stock 4.11. With the change applied, I was able to test for 2 hours in the same conditions, and 4 hours with the full system and not encounter an issue. Feel free to add: Tested-by: Jeffrey Hugo I'm going to go back to 4.12-rc5 and see if I can get either repro the issue, or identify what changed. Hopefully I can get to linux-next and double check the original version of the change as well. -- Jeffrey Hugo Qualcomm Datacenter Technologies as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.