Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp4319244imu; Mon, 14 Jan 2019 20:24:26 -0800 (PST) X-Google-Smtp-Source: ALg8bN5bOFJKMHh0zUgLI1fRyru21BqL91reHZu9GYI+n9PJR1tWYsv8VKXSVW2qd+CcJiPITfT2 X-Received: by 2002:a63:e344:: with SMTP id o4mr1994802pgj.158.1547526266280; Mon, 14 Jan 2019 20:24:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1547526266; cv=none; d=google.com; s=arc-20160816; b=vBFoquGPyTRlo+lGVI8xWpfA3M9gJzlQWebMm12awPdcW0z3OzJ8+tSWn3RgIZF+oC 4qLiCpjw491lyHMWxBtT0MsJFge85+LiLBM/HP3UtjIPfJuT/C87IrvsiAbHivNQf0wr SkNobXXXOhQNfi714MW3g6Ddfzz6hDWQFQYovmySMWqfcVZeHbwPbCBy2iZ2tkY+La4S NturWwgHc+HCAAOKihxG5Qz1RDhpXxzvVxr5aTCe7uODWprc0nBm7mZW8b5VGdqsOK2F yvBh4uOAP80xlFvarCkc5f0OpcZhZbYZZexDtfZhSv5UzQx8pemoiyw77TgWw/Hkjuau nYOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=JizfROLyOskkmThxUttIkmdDOiGyP4PSqrVs4OVeKOI=; b=p/kzrLq4MTTIbPxUaDtBkLmw7Y47CythU4AvQ0Kr09hSF+rMpelAEt4dAYElaijvDK IFPISD3f+/CetnWVkCo/d9ZzD3doTGOuCVvJEEFb68ZIH6Zq8ofq48YzPSNA7rBd95Dn HGVltEBt7WhH0KrI+l6agpqSFNr5kQG7hxXFlBTG/BeABwCsgZ3C79Arw89WsWh5awvE R5DK/co4ybJ7r+c0k3FL2DScr+Iz+TwwoytdDVDaDidiKO81uQn3d/lHp7ROr2SeUjYQ DJH2HD5iH3xMd2LMu8dpjXaSRH24q2JsHKuWTfwpwGDHsVHmzXBgFJ9IWGvK7dZPUzO1 jXZQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r39si2233068pld.434.2019.01.14.20.24.10; Mon, 14 Jan 2019 20:24:26 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727953AbfAODYM (ORCPT + 99 others); Mon, 14 Jan 2019 22:24:12 -0500 Received: from mx1.redhat.com ([209.132.183.28]:36818 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727169AbfAODYM (ORCPT ); Mon, 14 Jan 2019 22:24:12 -0500 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 6FEA580472; Tue, 15 Jan 2019 03:24:11 +0000 (UTC) Received: from ming.t460p (ovpn-8-20.pek2.redhat.com [10.72.8.20]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 728031001F52; Tue, 15 Jan 2019 03:24:00 +0000 (UTC) Date: Tue, 15 Jan 2019 11:23:56 +0800 From: Ming Lei To: Steven Rostedt Cc: Jens Axboe , LKML , Linus Torvalds , Andrew Morton , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , Clark Williams , Bart Van Assche Subject: Re: Real deadlock being suppressed in sbitmap Message-ID: <20190115032355.GE10121@ming.t460p> References: <20190114121414.450ab4ea@gandalf.local.home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190114121414.450ab4ea@gandalf.local.home> User-Agent: Mutt/1.9.1 (2017-09-22) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Tue, 15 Jan 2019 03:24:11 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Steven, On Mon, Jan 14, 2019 at 12:14:14PM -0500, Steven Rostedt wrote: > It was brought to my attention (by this creating a splat in the RT tree > too) this code: > > static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index) > { > unsigned long mask, val; > unsigned long __maybe_unused flags; > bool ret = false; > > /* Silence bogus lockdep warning */ > #if defined(CONFIG_LOCKDEP) > local_irq_save(flags); > #endif > spin_lock(&sb->map[index].swap_lock); > > Commit 58ab5e32e6f ("sbitmap: silence bogus lockdep IRQ warning") > states the following: > > For this case, it's a false positive. The swap_lock is used from process > context only, when we swap the bits in the word and cleared mask. We > also end up doing that when we are getting a driver tag, from the > blk_mq_mark_tag_wait(), and from there we hold the waitqueue lock with > IRQs disabled. However, this isn't from an actual IRQ, it's still > process context. > > The thing is, lockdep doesn't define a lock as "irq-safe" based on it > being taken under interrupts disabled or not. It detects when locks are > used in actual interrupts. Further in that commit we have this: > > [ 106.097386] fio/1043 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire: > [ 106.098231] 000000004c43fa71 > (&(&sb->map[i].swap_lock)->rlock){+.+.}, at: sbitmap_get+0xd5/0x22c > [ 106.099431] > [ 106.099431] and this task is already holding: > [ 106.100229] 000000007eec8b2f > (&(&hctx->dispatch_wait_lock)->rlock){....}, at: > blk_mq_dispatch_rq_list+0x4c1/0xd7c > [ 106.101630] which would create a new lock dependency: > [ 106.102326] (&(&hctx->dispatch_wait_lock)->rlock){....} -> > (&(&sb->map[i].swap_lock)->rlock){+.+.} > > Saying that you are trying to take the swap_lock while holding the > dispatch_wait_lock. > > > [ 106.103553] but this new dependency connects a SOFTIRQ-irq-safe lock: > [ 106.104580] (&sbq->ws[i].wait){..-.} > > Which means that there's already a chain of: > > sbq->ws[i].wait -> dispatch_wait_lock > > [ 106.104582] > [ 106.104582] ... which became SOFTIRQ-irq-safe at: > [ 106.105751] _raw_spin_lock_irqsave+0x4b/0x82 > [ 106.106284] __wake_up_common_lock+0x119/0x1b9 > [ 106.106825] sbitmap_queue_wake_up+0x33f/0x383 > [ 106.107456] sbitmap_queue_clear+0x4c/0x9a > [ 106.108046] __blk_mq_free_request+0x188/0x1d3 > [ 106.108581] blk_mq_free_request+0x23b/0x26b > [ 106.109102] scsi_end_request+0x345/0x5d7 > [ 106.109587] scsi_io_completion+0x4b5/0x8f0 > [ 106.110099] scsi_finish_command+0x412/0x456 > [ 106.110615] scsi_softirq_done+0x23f/0x29b > [ 106.111115] blk_done_softirq+0x2a7/0x2e6 > [ 106.111608] __do_softirq+0x360/0x6ad > [ 106.112062] run_ksoftirqd+0x2f/0x5b > [ 106.112499] smpboot_thread_fn+0x3a5/0x3db > [ 106.113000] kthread+0x1d4/0x1e4 > [ 106.113457] ret_from_fork+0x3a/0x50 > > > We see that sbq->ws[i].wait was taken from a softirq context. Actually sbq->ws[i].wait is taken from a softirq context only in case of single-queue, see __blk_mq_complete_request(). For multiple queue, sbq->ws[i].wait is taken from hardirq context. > > > > [ 106.131226] Chain exists of: > [ 106.131226] &sbq->ws[i].wait --> > &(&hctx->dispatch_wait_lock)->rlock --> > &(&sb->map[i].swap_lock)->rlock > > This is telling us that we now have a chain of: > > sbq->ws[i].wait -> dispatch_wait_lock -> swap_lock > > [ 106.131226] > [ 106.132865] Possible interrupt unsafe locking scenario: > [ 106.132865] > [ 106.133659] CPU0 CPU1 > [ 106.134194] ---- ---- > [ 106.134733] lock(&(&sb->map[i].swap_lock)->rlock); > [ 106.135318] local_irq_disable(); > [ 106.136014] lock(&sbq->ws[i].wait); > [ 106.136747] > lock(&(&hctx->dispatch_wait_lock)->rlock); > [ 106.137742] > [ 106.138110] lock(&sbq->ws[i].wait); > [ 106.138625] > [ 106.138625] *** DEADLOCK *** > [ 106.138625] > > I need to make this more than just two levels deep. Here's the issue: > > > CPU0 CPU1 CPU2 > ---- ---- ---- > lock(swap_lock) > local_irq_disable() > lock(dispatch_lock); > local_irq_disable() > lock(sbq->ws[i].wait) > lock(dispatch_lock) > lock(swap_lock) > > lock(sbq->ws[i].wait) I guess the above 'dispatch_lock' is actually 'dispatch_wait_lock', which is always acquired after sbq->ws[i].wait is held, so I think the above description about CPU1/CPU2 may not be possible or correct. Thinking about the original lockdep log further, looks it is one real deadlock: [ 106.132865] Possible interrupt unsafe locking scenario: [ 106.132865] [ 106.133659] CPU0 CPU1 [ 106.134194] ---- ---- [ 106.134733] lock(&(&sb->map[i].swap_lock)->rlock); [ 106.135318] local_irq_disable(); [ 106.136014] lock(&sbq->ws[i].wait); [ 106.136747] lock(&(&hctx->dispatch_wait_lock)->rlock); [ 106.137742] [ 106.138110] lock(&sbq->ws[i].wait); Given 'swap_lock' can be acquired from blk_mq_dispatch_rq_list() via blk_mq_get_driver_tag() directly, the above deadlock may be possible. Sounds the correct fix may be the following one, and the irqsave cost should be fine given sbitmap_deferred_clear is only triggered when one word is run out of. -- diff --git a/lib/sbitmap.c b/lib/sbitmap.c index 65c2d06250a6..24d62d7894cb 100644 --- a/lib/sbitmap.c +++ b/lib/sbitmap.c @@ -26,14 +26,11 @@ static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index) { unsigned long mask, val; - unsigned long __maybe_unused flags; + unsigned long flags; bool ret = false; /* Silence bogus lockdep warning */ -#if defined(CONFIG_LOCKDEP) - local_irq_save(flags); -#endif - spin_lock(&sb->map[index].swap_lock); + spin_lock_irqsave(&sb->map[index].swap_lock, flags); if (!sb->map[index].cleared) goto out_unlock; @@ -54,10 +51,7 @@ static inline bool sbitmap_deferred_clear(struct sbitmap *sb, int index) ret = true; out_unlock: - spin_unlock(&sb->map[index].swap_lock); -#if defined(CONFIG_LOCKDEP) - local_irq_restore(flags); -#endif + spin_unlock_irqrestore(&sb->map[index].swap_lock, flags); return ret; } Thanks, Ming