Received: by 2002:a25:1985:0:0:0:0:0 with SMTP id 127csp148214ybz; Thu, 30 Apr 2020 18:21:40 -0700 (PDT) X-Google-Smtp-Source: APiQypIcYJqWrNpv+DTWS+OMxW0e2Yv5uwXWAIlMonFpvxd2I/kjhTa7h45kN2MzNETkheUBv7lk X-Received: by 2002:aa7:c40e:: with SMTP id j14mr1626204edq.125.1588296099875; Thu, 30 Apr 2020 18:21:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1588296099; cv=none; d=google.com; s=arc-20160816; b=aE5orYKnkgioZiDLKtuKdVbTZt3RsjHlcN//hAV0sl3b1sJ4Qa79d8rmfxsvaXiUZd EstE/qm+ROoFr4WXMDUgaAQmRwHTi6SQiH/paPTMWaa8OgW5dNNapQ6QBNxkV41sEyXc AXMhj/4as46Pm9MO8B6A4O92QWdBvRuNBQCeKNFQbbdEXNeOuuZnP7dRgjQfWFC4U+g3 WeY3yt5CF2GXJJo8atNxxTy6XxSO9Kd5u2pBgLWSSvkwzT7rFtwyjGCepb2lGtnOLk2a Xx+B8AOKI71P9od135QIwYsRilgs2ivWoqrikAiKqx+Z1nbxgLxIM87GcLkgqGs/RCRD Or5Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:message-id:references :in-reply-to:subject:cc:to:from:date:content-transfer-encoding :mime-version:dkim-signature; bh=T52XjBMRa795mkQixIuPfXttq+DeeUPrNOtA5L+BOUs=; b=GtzntrbvmOJHl6b0s6hoeevXK3k2BSLD91E+FnDyox96uSX7F3DJR/v+A6GyH4Em+V RZA2ssoHGHojSrQCDpAU1l1oMcRr4dyDDz83G20Hqujf2l3/tMznlG3dxOHZrbeasdw+ LEenwJ7rxmzzSS/CcNS33AyY+j2rKxge8xN96jBHcXkSjG9+8a+7IkCI79kZ2alw2Pmk h0UECXg+n724FA7PNecHUNlO2LDCQfuniS4PA0RjV0Q6vGMu0FAh+1hhBQeJoS+avA27 L54E6WZEbel9vUpbAVNaxiWlvqt0zrhp1dpMub6rUaopSCsvTowTA85dASfH319ep2z3 j72g== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@mg.codeaurora.org header.s=smtp header.b=hfORPu1l; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id r11si762011edq.390.2020.04.30.18.21.16; Thu, 30 Apr 2020 18:21:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=fail header.i=@mg.codeaurora.org header.s=smtp header.b=hfORPu1l; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728051AbgEABTw (ORCPT + 99 others); Thu, 30 Apr 2020 21:19:52 -0400 Received: from mail27.static.mailgun.info ([104.130.122.27]:36121 "EHLO mail27.static.mailgun.info" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727114AbgEABTw (ORCPT ); Thu, 30 Apr 2020 21:19:52 -0400 DKIM-Signature: a=rsa-sha256; v=1; c=relaxed/relaxed; d=mg.codeaurora.org; q=dns/txt; s=smtp; t=1588295990; h=Message-ID: References: In-Reply-To: Subject: Cc: To: From: Date: Content-Transfer-Encoding: Content-Type: MIME-Version: Sender; bh=T52XjBMRa795mkQixIuPfXttq+DeeUPrNOtA5L+BOUs=; b=hfORPu1ldWfG+95+Gf2IicUX6Lk4LZe53Cw4Z8UiBZ7BkU4F8IOH9SmfvSIj2giAaAobSVuG W656v7E+qbyQHJOZatIfeka/dw9F80KOVSlZbXr2sWGW5SwRVaOQEdiqnIeap3sOL3EQXmZi pkqBBjnqsMY6HFzeayKlLLKLbd0= X-Mailgun-Sending-Ip: 104.130.122.27 X-Mailgun-Sid: WyI0MWYwYSIsICJsaW51eC1rZXJuZWxAdmdlci5rZXJuZWwub3JnIiwgImJlOWU0YSJd Received: from smtp.codeaurora.org (ec2-35-166-182-171.us-west-2.compute.amazonaws.com [35.166.182.171]) by mxa.mailgun.org with ESMTP id 5eab791e.7fcd1355bc38-smtp-out-n03; Fri, 01 May 2020 01:19:26 -0000 (UTC) Received: by smtp.codeaurora.org (Postfix, from userid 1001) id D5C21C43636; Fri, 1 May 2020 01:19:26 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-caf-mail-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=2.0 tests=ALL_TRUSTED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.codeaurora.org (localhost.localdomain [127.0.0.1]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: cang) by smtp.codeaurora.org (Postfix) with ESMTPSA id C1942C433CB; Fri, 1 May 2020 01:19:25 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Fri, 01 May 2020 09:19:25 +0800 From: Can Guo To: Bart Van Assche Cc: asutoshd@codeaurora.org, nguyenb@codeaurora.org, hongwus@codeaurora.org, rnayak@codeaurora.org, stanley.chu@mediatek.com, alim.akhtar@samsung.com, beanhuo@micron.com, Avri.Altman@wdc.com, bjorn.andersson@linaro.org, linux-scsi@vger.kernel.org, kernel-team@android.com, saravanak@google.com, salyzyn@google.com, "James E.J. Bottomley" , "Martin K. Petersen" , open list Subject: Re: [PATCH v3 1/1] scsi: pm: Balance pm_only counter of request queue during system resume In-Reply-To: References: <1588219805-25794-1-git-send-email-cang@codeaurora.org> <9e15123e-4315-15cd-3d23-2df6144bd376@acm.org> <1ef85ee212bee679f7b2927cbbc79cba@codeaurora.org> Message-ID: <0d9a1e88b0477e8a04b091b9532923f5@codeaurora.org> X-Sender: cang@codeaurora.org User-Agent: Roundcube Webmail/1.3.9 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2020-05-01 04:32, Bart Van Assche wrote: > On 2020-04-29 22:40, Can Guo wrote: >> On 2020-04-30 13:08, Bart Van Assche wrote: >>> On 2020-04-29 21:10, Can Guo wrote: >>>> During system resume, scsi_resume_device() decreases a request >>>> queue's >>>> pm_only counter if the scsi device was quiesced before. But after >>>> that, >>>> if the scsi device's RPM status is RPM_SUSPENDED, the pm_only >>>> counter is >>>> still held (non-zero). Current scsi resume hook only sets the RPM >>>> status >>>> of the scsi device and its request queue to RPM_ACTIVE, but leaves >>>> the >>>> pm_only counter unchanged. This may make the request queue's pm_only >>>> counter remain non-zero after resume hook returns, hence those who >>>> are >>>> waiting on the mq_freeze_wq would never be woken up. Fix this by >>>> calling >>>> blk_post_runtime_resume() if pm_only is non-zero to balance the >>>> pm_only >>>> counter which is held by the scsi device's RPM ops. >>> >>> How was this issue discovered? How has this patch been tested? >> >> As the issue was found after system resumes, so the issue was >> discovered >> during system suspend/resume test, and it is very easy to be >> replicated. >> After system resumes, if this issue hits some scsi devices, all bios >> sent >> to their request queues are blocked, which may cause a system hang if >> the >> scsi devices are vital to system functionality. >> >> To make sure the patch work well, we have tested system suspend/resume >> and made sure no system hang happen due to request queues got blocked >> by imbalanced pm_only counter. > > Thanks, that's very interesting information. My concern with this patch > is that the power management code is not the only caller of > blk_set_pm_only() / blk_clear_pm_only(). E.g. the SCSI SPI code also > calls scsi_device_quiesce() and scsi_device_resume(). These last > functions call blk_set_pm_only() and blk_clear_pm_only(). More calls of > scsi_device_quiesce() and scsi_device_resume() might be added in the > future. > > Has it been considered to test directly whether a SCSI device has been > runtime suspended instead of relying on blk_queue_pm_only()? How about > using pm_runtime_status_suspended() or adding a function in > block/blk-pm.h that checks whether q->rpm_status == RPM_SUSPENDED? > > Thanks, > > Bart. Hi Bart, Please let me address your concern. First of all, it is allowed to call scsi_device_quiesce() multiple times, but one sdev's request queue's pm_only counter can only be increased once by scsi_device_quiesce(), because if a sdev has already been quiesced, in scsi_device_quiesce(), scsi_device_set_state(sdev, SDEV_QUIESCE) would return -ENIVAL (illegal state transform), then blk_clear_pm_only() shall be called to decrease pm_only once, so no matter how many times scsi_device_quiesce() is called, it can only increase pm_only once. scsi_device_resume() is same, it calls blk_clear_pm_only only once and only if the sdev was quiesced(). So, in a word, after scsi_device_resume() returns in scsi_dev_type_resume(), pm_only counter should be 1 (if the sdev's runtime power status is RPM_SUSPENDED) or 0 (if the sdev's runtime power status is RPM_ACTIVE). > Has it been considered to test directly whether a SCSI device has been > runtime suspended instead of relying on blk_queue_pm_only()? How about > using pm_runtime_status_suspended() or adding a function in > block/blk-pm.h that checks whether q->rpm_status == RPM_SUSPENDED? Yes, I used to make the patch like that way, and it also worked well, as both ways are equal actually. I kinda like the current code because we should be confident that after scsi_dev_type_resume() returns, pm_only must be 0. Different reviewers may have different opionions, either way works well anyways. Thanks, Can Guo.