Received: by 10.223.164.202 with SMTP id h10csp4743162wrb; Mon, 20 Nov 2017 22:11:50 -0800 (PST) X-Google-Smtp-Source: AGs4zMbbOSS/jk5afkjaDHmbY/dpeaHATkeeMXHQmZ7nq5oo6GNDXcZ4Psu5RhBVJ+3CBFbyc/Yd X-Received: by 10.84.232.76 with SMTP id f12mr16142305pln.195.1511244710798; Mon, 20 Nov 2017 22:11:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1511244710; cv=none; d=google.com; s=arc-20160816; b=nLP6vDPY/9Af4mIGhb83SDFvweORO8YYZm4Nd5ngZCUYXd6+8+q8gWhLKJ0pyareik fnaN+iktUSc7jN9Yv2Lcck7X+l7oWrMH4Pe0mfw7J9FRDpkxyiuKR6RTVDDiN7PbUXUN 8+gTV/z/9W844L0tjZAYt/L9AFo/3Q9VDUN703q+dqga4AhwbYXa3RZRVzWS0VKE6zRo Sjxn5fLo+c8kU1MbxvSL5Hcq3NIoiB2riv5fO9XmAV/BaKh+dsB+F1McspBNjCSQot9N uScRZfTuGJ+7KpPCzQCo7iYmwc0/ZoZccLACmZn08egh9A62TxJXfUXrDbMIbIfVVDUM OYug== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=0GQ1tIZldwLkg/03Kdn2Y5wiKYDWTg4G4Rew3Ug6yw4=; b=fsOYozO4/z0PA69kIc5RlE7YqFgH5zXh69p+UyYTkGDyDFTYxlTuLMdZBairlBYuCz ce8QBNHNz+iaxa8YqLV61y4U001/8vtmFRF6nWr10aFaXJbMlkMmDcIuAKKZ0AjftmkS ilfDVJnK0ROi3Jf7SfmrUxT8+RPOmdxf7IOi84lBvDUK7AscUVOHA/WwS5ltG+TXm+rE OryQgyDd003mqRNfZar8+omIVOZx0JvxQ1eszztGToipxCpsP/AMBL+sXZBTIpUu7iKu gzlpuE6yELZjhWpLSCgcNFgObyARehm80GFjw/xoRAcXecI3vUP+WrM5WUp1WuZt6dlr 9IFw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=Pk4O/OM/; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f6si10083687plf.342.2017.11.20.22.11.39; Mon, 20 Nov 2017 22:11:50 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=Pk4O/OM/; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752834AbdKUGLB (ORCPT + 73 others); Tue, 21 Nov 2017 01:11:01 -0500 Received: from mail-ot0-f194.google.com ([74.125.82.194]:41157 "EHLO mail-ot0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750794AbdKUGK7 (ORCPT ); Tue, 21 Nov 2017 01:10:59 -0500 Received: by mail-ot0-f194.google.com with SMTP id b54so9621732otd.8; Mon, 20 Nov 2017 22:10:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=0GQ1tIZldwLkg/03Kdn2Y5wiKYDWTg4G4Rew3Ug6yw4=; b=Pk4O/OM/6HWrMYCBoQhnEPBpUmeQOJO1+1cK77aF+YwBrPnyKxHVNHspCBgcvVIO+j EJJysa/R5Dn6H/oxrZwpqBtrWAnmMEq/YtYRvyQgqrAUuOcZP9eskKA5k9mZ6QuiNu/H 8AgjLX9pKortCjx6u6xCTS5aafPPxmMqS8F5idmWrCtS67f6rZ1xIrioY1XmVDUqJtgU hWCIMLsNXDME4a5QVB9avqi8FXQVMPN/Cn/vXmww1z+SkfuiDWbjAkcz2LWGYlJlRAsB jlxagOuKMtAU5Q2Llfngq4IqQs0vXNvBPGR/x09VnytlmAhIBe2DGkWA19VPb9UwUIpd EL8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=0GQ1tIZldwLkg/03Kdn2Y5wiKYDWTg4G4Rew3Ug6yw4=; b=jTbVyvRyEZl8jwEwfXYjJuxKtwgQ8Z/YiydWlF/6e+hDzEWC7Jn3FRz5taEMdPxBJS Yi6Enwx38tFBjU2VH/Ij/9Q8KAdg4a5AMvNCPX0TSnwyVWTyFhRnCwkAHw8I50OZR1Sz NT2Sc5zLEQbhksgQzrCXozs6Oh7BM+t/uUqMLZGNW2G9+OwSkw5jX89cy9tXGpRPTNSF M/WgcLzbhq5hpEfdH8JJxXkt3a8FyDy2E21tknmlc6fHz1u5ZBq9p1iGdeqGzohKCf8d RWRUYiS2eGEh/46PwrXRNLU8RyROKB3hqWbm8yq8o7b/OthKvoHRGEArM2sIgl9Opfbq DPZg== X-Gm-Message-State: AJaThX5YLyhk94ShRQpZAZqNUJ8vgb7ydA/eBI9ffseO3Z5uriH8CAsc haqhIAO+5GJH1pc/pYpAGGM= X-Received: by 10.157.55.181 with SMTP id x50mr11205860otb.252.1511244658728; Mon, 20 Nov 2017 22:10:58 -0800 (PST) Received: from [192.168.0.2] (cpe-24-27-59-32.austin.res.rr.com. [24.27.59.32]) by smtp.gmail.com with ESMTPSA id y9sm1342722oie.18.2017.11.20.22.10.56 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 20 Nov 2017 22:10:57 -0800 (PST) Subject: Re: [PATCH] scsi/eh: fix hang adding ehandler wakeups after decrementing host_busy To: Pavel Tikhomirov Cc: "James E . J . Bottomley" , "Martin K . Petersen" , Christoph Hellwig , linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Konstantin Khorenko , devel@openvz.org References: <20170905125424.15412-1-ptikhomirov@virtuozzo.com> <496d6c6e-565b-9ebb-59b4-d4b17e0d9a62@virtuozzo.com> From: Stuart Hayes Message-ID: <9941b46c-72be-e171-2e4b-cb12ac5c439d@gmail.com> Date: Tue, 21 Nov 2017 00:10:55 -0600 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <496d6c6e-565b-9ebb-59b4-d4b17e0d9a62@virtuozzo.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Antivirus: Avast (VPS 171120-14, 11/20/2017), Outbound message X-Antivirus-Status: Clean Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Pavel, It turns out that the error handler on our systems was not getting woken up for a different reason... I submitted a patch earlier today that fixes the issue I were seeing (I CCed you on the patch). Before I got my hands on the failing system and was able to root cause it, I was pretty sure that your patch was going to fix our issue, because after I examined the code paths, I couldn't find any other reason that the error handler would not get woken up. I tried forcing the bug that your patch fixes to occur, by compiling in some mdelay()s in a key place or two in the scsi code, but it never failed for me that way. With my patch, several systems that previously failed in 10 minutes or less successfully ran for many days. Thanks, Stuart On 11/9/2017 8:54 AM, Pavel Tikhomirov wrote: >> Are there any issues with this patch (https://patchwork.kernel.org/patch/9938919/) that Pavel Tikhomirov submitted back in September?  I am willing to help if there's anything I can do to help get it accepted. > > Hi, Stuart, I asked James Bottomley about the patch status offlist and it seems that the problem is - patch lacks testing and review. I would highly appreciate review from your side and anyone who wants to participate! > > And if you can confirm that the patch solves the problem on your environment with no side effects please add "Tested-by:" tag also. > > Thanks, Pavel > > On 09/05/2017 03:54 PM, Pavel Tikhomirov wrote: >> We have a problem on several our nodes with scsi EH. Imagine such an >> order of execution of two threads: >> >> CPU1 scsi_eh_scmd_add        CPU2 scsi_host_queue_ready >> /* shost->host_busy == 1 initialy */ >> >>                 if (shost->shost_state == SHOST_RECOVERY) >>                     /* does not get here */ >>                     return 0; >> >> lock(shost->host_lock); >> shost->shost_state = SHOST_RECOVERY; >> >>                 busy = shost->host_busy++; >>                 /* host->can_queue == 1 initialy, busy == 1 >>                  * - go to starved label */ >>                 lock(shost->host_lock) /* wait */ >> >> shost->host_failed++; >> /* shost->host_busy == 2, shost->host_failed == 1 */ >> call scsi_eh_wakeup(shost) { >>     if (host_busy == host_failed) { >>         /* does not get here */ >>         wake_up_process(shost->ehandler) >>     } >> } >> unlock(shost->host_lock) >> >>                 /* acquire lock */ >>                 shost->host_busy--; >> >> Finaly we do not wakeup scsi_error_handler and all other commands >> coming will hang as we are in never ending recovery state as there >> is no one left to wakeup handler. >> >> So scsi disc in these host becomes unresponsive and all bio on node >> hangs. (We trigger these problem when scsi cmnds to DVD drive timeout.) >> >> Main idea of the fix is to try to do wake up every time we decrement >> host_busy or increment host_failed(the latter is already OK). >> >> Now the very *last* one of busy threads getting host_lock after >> decrementing host_busy will see all write operations on host's >> shost_state, host_busy and host_failed completed thanks to implied >> memory barriers on spin_lock/unlock, so at the time of busy==failed >> we will trigger wakeup in at least one thread. (Thats why putting >> recovery and failed checks under lock) >> >> Signed-off-by: Pavel Tikhomirov >> --- >>   drivers/scsi/scsi_lib.c | 21 +++++++++++++++++---- >>   1 file changed, 17 insertions(+), 4 deletions(-) >> >> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c >> index f6097b89d5d3..6c99221d60aa 100644 >> --- a/drivers/scsi/scsi_lib.c >> +++ b/drivers/scsi/scsi_lib.c >> @@ -320,12 +320,11 @@ void scsi_device_unbusy(struct scsi_device *sdev) >>       if (starget->can_queue > 0) >>           atomic_dec(&starget->target_busy); >>   +    spin_lock_irqsave(shost->host_lock, flags); >>       if (unlikely(scsi_host_in_recovery(shost) && >> -             (shost->host_failed || shost->host_eh_scheduled))) { >> -        spin_lock_irqsave(shost->host_lock, flags); >> +             (shost->host_failed || shost->host_eh_scheduled))) >>           scsi_eh_wakeup(shost); >> -        spin_unlock_irqrestore(shost->host_lock, flags); >> -    } >> +    spin_unlock_irqrestore(shost->host_lock, flags); >>         atomic_dec(&sdev->device_busy); >>   } >> @@ -1503,6 +1502,13 @@ static inline int scsi_host_queue_ready(struct request_queue *q, >>       spin_unlock_irq(shost->host_lock); >>   out_dec: >>       atomic_dec(&shost->host_busy); >> + >> +    spin_lock_irq(shost->host_lock); >> +    if (unlikely(scsi_host_in_recovery(shost) && >> +             (shost->host_failed || shost->host_eh_scheduled))) >> +        scsi_eh_wakeup(shost); >> +    spin_unlock_irq(shost->host_lock); >> + >>       return 0; >>   } >>   @@ -1964,6 +1970,13 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, >>     out_dec_host_busy: >>       atomic_dec(&shost->host_busy); >> + >> +    spin_lock_irq(shost->host_lock); >> +    if (unlikely(scsi_host_in_recovery(shost) && >> +             (shost->host_failed || shost->host_eh_scheduled))) >> +        scsi_eh_wakeup(shost); >> +    spin_unlock_irq(shost->host_lock); >> + >>   out_dec_target_busy: >>       if (scsi_target(sdev)->can_queue > 0) >>           atomic_dec(&scsi_target(sdev)->target_busy); >> > --- This email has been checked for viruses by Avast antivirus software. https://www.avast.com/antivirus From 1583600702235619492@xxx Thu Nov 09 14:55:17 +0000 2017 X-GM-THRID: 1577704452897311330 X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread