Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752363AbdHRV6I (ORCPT ); Fri, 18 Aug 2017 17:58:08 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:36005 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751786AbdHRV6G (ORCPT ); Fri, 18 Aug 2017 17:58:06 -0400 Subject: Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc To: Bart Van Assche , "linuxppc-dev@lists.ozlabs.org" , "abdhalee@linux.vnet.ibm.com" Cc: "linux-kernel@vger.kernel.org" , "hch@lst.de" , "linux-scsi@vger.kernel.org" , "sfr@canb.auug.org.au" , "sachinp@linux.vnet.ibm.com" , "linux-next@vger.kernel.org" , "hare@suse.com" , "mpe@ellerman.id.au" References: <1502902815.3305.22.camel@abdul.in.ibm.com> <1502904072.2421.3.camel@wdc.com> <2f686064-3e32-df8d-134f-962b5181da9d@linux.vnet.ibm.com> <1502985161.2615.8.camel@wdc.com> <71fb9c1b-9f3f-acdc-8bb5-aa1240aea763@linux.vnet.ibm.com> <1503092473.2622.17.camel@wdc.com> From: Brian King Date: Fri, 18 Aug 2017 16:57:59 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: <1503092473.2622.17.camel@wdc.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 17081821-0020-0000-0000-00000C8F7C13 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00007569; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000223; SDB=6.00904282; UDB=6.00453082; IPR=6.00684495; BA=6.00005539; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00016759; XFM=3.00000015; UTC=2017-08-18 21:58:05 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17081821-0021-0000-0000-00005DC200DF Message-Id: <0f7e2114-eba1-f149-ea80-d32d8b6d212a@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-08-18_12:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1708180350 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2550 Lines: 48 On 08/18/2017 04:41 PM, Bart Van Assche wrote: > On Fri, 2017-08-18 at 16:04 -0500, Brian King wrote: >> I think I have an understanding what is going on and why Bart's patch is causing problems for ipr. >> I can work around the boot hang in ipr, but ultimately I think we need to figure out a fix >> in scsi / block. I added some tracing and confirmed its not a matter of commands getting stuck >> in ipr. The issue is we are retrying failed commands until we finally run out of time. This is >> what I see: >> >> 1. sd_revalidate_disk calls scsi_report_opcode >> 2. ipr RAID arrays don't support MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES >> 3. ipr returns the command with DID_ERROR >> 4. scsi_decide_disposition goes to maybe_retry, increments scmd->retries, and returns NEEDS_RETRY >> 5. scsi_softirq_done calls scsi_queue_insert to requeue the command, which calls scsi_mq_requeue_cmd >> 6. With Bart's change, we then clear RQF_DONTPREP in this path, while prior we did not >> 7. This results in the command getting scmd->retries zeroed out when it gets re-queued, >> since we go through prep again and we lose our retry counter, resulting in lots and lots of retries. >> 8. Since the default command timeout for an ipr RAID array is 120 seconds, these retries go on for >> quite a long time... >> 9. Finally, the command has been retried so long we trip over the overall retry timer >> in scsi_softirq_done and we timeout the command. >> >> I'll follow up with a patch to ipr to workaround the hang, but I think we need to somehow preserve >> the retry counter in the scsi command, as this will likely cause issues with other drivers. > > Hello Brian, > > Thanks for the detailed analysis. This is very helpful. Have you considered > to change the ipr driver such that it terminates REPORT SUPPORTED OPERATION > CODES commands with the appropriate check condition code instead of DID_ERROR? Yes. That data is actually in the sense buffer, but since I'm also setting DID_ERROR, scsi_decide_disposition isn't using it. I've got a patch to do just as you suggest, to stop setting DID_ERROR when there is more detailed error data available, but it will need some additional testing before I submit, as it will impact much more than just this case. To add to my analysis above, #9 should not be there... It looks like jiffies_at_alloc would also be getting reinitialized in this case, resulting in a perpetual retry, which is what I was seeing. Thanks, Brian -- Brian King Power Linux I/O IBM Linux Technology Center