Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp2092146pxa; Sun, 16 Aug 2020 23:46:04 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz1WDiyW2no0t/PCPrnIliNNGotcs8WyfPo0DYIyM6LiXvRWjzktDzHwbO1WOKdms59dE8g X-Received: by 2002:aa7:df0c:: with SMTP id c12mr13074338edy.60.1597646763843; Sun, 16 Aug 2020 23:46:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1597646763; cv=none; d=google.com; s=arc-20160816; b=DIj4VJ4KK0a9wtoTnEW6eCSPT+FLSvdW/OxBD29c/gQX6zbcbsCA0GMZz+QA+dnkKB ez7bUNXeOfOD/2i/h39A3/z+qwv0Q6waRpRtdI0SSOBonNwmvY7hr1HnW0hZkT2nppp9 lnOx9xT2JlIEwMOl1d6KHZ8ZM95L2mNDp5jFyT+BS7rtU2pmYZrJK/gf6pCylze1qUZb mSE3UfvxSaCQSjaoZGKf9B8b1A468G35F4AUkOpf7AkOXhyLiD5TQIF+AY0a9ScGGWUS EnW/T9yerkYJiX1Fngsmts1vt1KcevlsL9Eui9lXPwv5QEpM1xm+pa9riDsA5KTPdNKP euQQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=Zmke6G8N+PgZRa2aKdFidNd4uB0Pqzf1DiitvFCvCuU=; b=rvNA21No0nMYwZi6Y6jNUOy5L5QxjeCV7d4Kbds+ApmAMGX4euln/95bbaxakuJdGO 1JksZaUcwe2s6r8+16Z5vfIaDbwCfrVOi159aqTRfZQzyH/F5DnFF+GpDIEetiNjPaS+ mZ5GGInLW6nb6hlm05tPwuE7yhGT0uYz7F4betVyLVfqdbTekm/pm62HzSxXe2miYZuE 7LmusllxIYgU78YMpn2yLXUVvkELwheMusv3HsKi8F7VlICiUESYow30dOL6LyoOx3ms IMw9q73lt2ibYwYkH2rf2krs7CaJbDC5SHhRA3Llf80TKbYbOeYPih6ImYdFWaUhLvgm t/Ew== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=casper.20170209 header.b=J9cbbCuf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id co5si4964712edb.210.2020.08.16.23.45.40; Sun, 16 Aug 2020 23:46:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=casper.20170209 header.b=J9cbbCuf; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726703AbgHQGmo (ORCPT + 99 others); Mon, 17 Aug 2020 02:42:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36652 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726151AbgHQGmn (ORCPT ); Mon, 17 Aug 2020 02:42:43 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 40CBDC061388; Sun, 16 Aug 2020 23:42:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=Zmke6G8N+PgZRa2aKdFidNd4uB0Pqzf1DiitvFCvCuU=; b=J9cbbCufoVG7GnQlnTllpVYRFS xnmN/E/lQoGNfjD9gxwvGVqpakG0w5kIL4+L7m9ciriu4JH8W1ZxNu8ORCPASxnqWk/X1CMn2BvZ/ pWpASZXz2WRary7g+yGl0QziI8zpe11PoX/bMVoDZMd9VQosHhqr+Ntm/JT4hP7gueBY4LV+eVWD/ 1kXjUoRxjnnEkt4zXiIJM+LW2KL+/0QknKChKKctHpzbYVKw5qygt2HLMKm993b6GOSEtmNV7c3Hm arGX4g2yT+FKMZyCeBTGlIV0bRl/zfE2XDkY0H3DrQnj10BoNeL7nCQHgK6OZEz7H3SDC3Gfpvix7 DOc366mQ==; Received: from hch by casper.infradead.org with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1k7Yqq-00061Q-I1; Mon, 17 Aug 2020 06:42:36 +0000 Date: Mon, 17 Aug 2020 07:42:36 +0100 From: Christoph Hellwig To: Xianting Tian Cc: axboe@kernel.dk, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org Subject: Re: [PATCH] block: don't read block device if it's invalid Message-ID: <20200817064236.GA22917@infradead.org> References: <1597153386-87954-1-git-send-email-xianting_tian@126.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1597153386-87954-1-git-send-email-xianting_tian@126.com> X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Aug 11, 2020 at 09:43:06AM -0400, Xianting Tian wrote: > We found several processes in 'D' state after nvme device hot-removed, > The call trace as below, we can see process 848 got lock 'bdev->bd_mutex' > in blkdev_reread_part(), but scheduled out due to wait for IO done. But > the IO won't be completed as the device is hot-removed. Then it caused > the lock 'bdev->bd_mutex' can't be unlocked. As a result, it caused > other processes, which need to get the same lock 'bdev->bd_mutex', > blocked on this lock. > > When nvme device hot-removed, kernel will start a thread to handle the > task of nvme device removing, as the call trace of process 1111504 shows > below. I listed the call trace of nvme_kill_queues() in detail as below, > we can see 'NVME_NS_DEAD' is set, then when executing > nvme_revalidate_disk(), it found 'NVME_NS_DEAD' is set and > 'set_capacity(disk, 0)' will be called to set disk capacity to 0. > nvme_kill_queues() > if (test_and_set_bit(NVME_NS_DEAD, &ns->flags)) return; > revalidate_disk(disk) > disk->fops->revalidate_disk(disk) <=for nvme device, revalidate_disk=nvme_revalidate_disk() > mutex_lock(&bdev->bd_mutex) > > This patch is to reduce the probability of such problem. Before getting > the lock of 'bdev->bd_mutex' in blkdev_reread_part(), add the code to > check if the capacity of the disk is 0, just return. Then we can avoid > the happen of the issue: > nvme device is hot-removed, and its capacity is alreday set to 0; then > if there is process like 848 want to read the device, it will return > directly in blkdev_reread_part(), then it will not get the lock > "bdev->bd_mutex", which can't be unlocked by the process itself as IO > can't be completed. We need to fix this for real, as you stated at best this reduces the window that the race can happen. I think our main problem is that due to bd_mutex we can't update the block device size from arbitrary context. If we instead add an irqsave spinlock just for the size we'd get rid of the limitation and can stop papering over the problem. Give m a little time to try to do that.