Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp866899imm; Fri, 1 Jun 2018 10:53:49 -0700 (PDT) X-Google-Smtp-Source: ADUXVKIgt0UlrzaCKEXPranZwQTt45JYvvwO1vPJx3enFgAFRtDpkGxd2fKwbA7cqcNazTUNJH7C X-Received: by 2002:a17:902:3343:: with SMTP id a61-v6mr11923094plc.241.1527875629436; Fri, 01 Jun 2018 10:53:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527875629; cv=none; d=google.com; s=arc-20160816; b=y4DPhS1UXgnW8pxONAFRlzqjdgCC1dVzeys4ulufz5X9BL5HBB3gVnvzm0tKsjmALB duF1dlpf0cLN+dPwImq0nJtbGPscVeLoY2hBrNGqKnNeQaxI6SCaAvXhNAIseL269sNn izRZHGXld6ovrK+BaKEPQhgipHTAak6xhSmoo61A4kcvwAr0+R73vSpzgtvuuvEdRBZ2 eqOTx/bINtaR00ynlXZI9VL8u5HQ5Oq3Fcuayx2CP2c5w8JQNhHhI6S5FG7A5EM0svB6 ZSuRCmdsGqbbYM+AoRnJsc4mzuY8H+ZCk4pJegVDVz4po1uZo+uiCTYrcLSmJtLzSU8j RAxQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=FWrRn5XO6h1uy1tmHRGj/uEMmNyCaKQsliNnAAv6YLs=; b=vJ3kuwh3QQpihqQDww4lHuWufo2oniM8cja9X1Xh+QUY/NJcg5RG6Jo/SU8iZ9++aC YmGW/R3UFRCO6hJ3JK/8n0HLmzI7jGl2xYV6o11842rdgygE6I0vvC5uBc7KZF+iS/nO +lMEfLIOzjmO/gsm1SlBYOdGNMiLlVjUW5uaTNBrcgyjKD7rryQLVGJl8gbkHYe2SAwM alUVVKsBJyV6v8B0lGDTwz8l7Ncb/cifer3AV7gE4bijQDP4LSfZcS+cNJDYShvD6p3J G9mC9wod6hHzBXWagQZRH00RyEZDupvrBPc924RB11tUjJODiGGl9QlxtQlGS2EIJpu8 sSGw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=E/7G17Ny; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m12-v6si31836850pgn.215.2018.06.01.10.53.35; Fri, 01 Jun 2018 10:53:49 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=E/7G17Ny; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753197AbeFARww (ORCPT + 99 others); Fri, 1 Jun 2018 13:52:52 -0400 Received: from mail-it0-f66.google.com ([209.85.214.66]:50894 "EHLO mail-it0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753115AbeFARwt (ORCPT ); Fri, 1 Jun 2018 13:52:49 -0400 Received: by mail-it0-f66.google.com with SMTP id u4-v6so2804000itg.0 for ; Fri, 01 Jun 2018 10:52:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=FWrRn5XO6h1uy1tmHRGj/uEMmNyCaKQsliNnAAv6YLs=; b=E/7G17NyOMoNgNlkaed4/D4xwHmgjshjwVXTdm7NXDXDrYNHfLzNZBDAcTe9ePyG3j JHFF1c+tJspLopj9+XkFCdoZIVodnZQSyQCRV+pumX8SC2bLlJVTpYEwhK2wYrrDkF32 sZ/wQYV37FrTuYASObsvKcQY2NaUHBj450RynHNHM1pPNY9TTUp601x0h1eYSlk9rUQu Ke4UUQe/zMZ8xKFafFqeqMAV5WBcpgM3j5ZLurCGS5CSroBcr/XzAurSK2RK7omxMFaI JLrKLap/xO7xtEVihj+k7LktUbwnXE2r6EUXKAvlTfHIufoScAyLDKEOHqswkSzeW7Rv h3xw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=FWrRn5XO6h1uy1tmHRGj/uEMmNyCaKQsliNnAAv6YLs=; b=UzZO2hjvS+T1sQdNSVT3bEMrwQOX9hLVBnEGkrpDSjdf0e0LnXTW0Tu2PdyIAiUeVO nmoxu0YZR8PUnIKNy/B+2AhHw+TqkxJbmIR8H6OHdZ9VViTkLe2Wz8GBJ1PMRMQUWnMt /0v3I3z3YIyFL9oh8B8bT52QiDzemU6/beplQl4n7QS65aWhZQ7OaVZn8XD0t+ESbtsi vXBvXEesAYVAZSb2qS3j07mnMMuK95AWFhacgrDcjpQJrf654bRGx4vZ8qhXppW9iIB7 mYuGhNhB+mt1c0NoRQsPeZORPU3gRQm7ixWrDFnpRytFN3Ka2DOnXPdqk+17JgdBImBO RiJA== X-Gm-Message-State: APt69E0wdBCvch/FlkHDAMrQ0Bh6Y7g5Bi1uTzRxLUSbtPc+7hikkUvM CFVLHiuAmq47tptFHdby7Wk6DlmXke8= X-Received: by 2002:a24:29cc:: with SMTP id p195-v6mr5157561itp.132.1527875569096; Fri, 01 Jun 2018 10:52:49 -0700 (PDT) Received: from [192.168.1.167] ([216.160.245.98]) by smtp.gmail.com with ESMTPSA id p7-v6sm1379378iob.80.2018.06.01.10.52.46 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jun 2018 10:52:47 -0700 (PDT) Subject: Re: INFO: task hung in blk_queue_enter To: Tetsuo Handa , Bart.VanAssche@wdc.com, dvyukov@google.com Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, jthumshirn@suse.de, alan.christopher.jenkins@gmail.com, syzbot+c4f9cebf9d651f6e54de@syzkaller.appspotmail.com, martin.petersen@oracle.com, dan.j.williams@intel.com, hch@lst.de, oleksandr@natalenko.name, ming.lei@redhat.com, martin@lichtvoll.de, hare@suse.com, syzkaller-bugs@googlegroups.com, ross.zwisler@linux.intel.com, keith.busch@intel.com, linux-ext4@vger.kernel.org References: <43327033306c3dd2f7c3717d64ce22415b6f3451.camel@wdc.com> <6db16aa3a7c56b6dcca2d10b4e100a780c740081.camel@wdc.com> <201805220652.BFH82351.SMQFFOJOtFOVLH@I-love.SAKURA.ne.jp> <201805222020.FEJ82897.OFtJMFHOVLQOSF@I-love.SAKURA.ne.jp> <25708e84-6f35-04c3-a2e4-6854f0ed9e78@I-love.SAKURA.ne.jp> From: Jens Axboe Message-ID: Date: Fri, 1 Jun 2018 11:52:45 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.0 MIME-Version: 1.0 In-Reply-To: <25708e84-6f35-04c3-a2e4-6854f0ed9e78@I-love.SAKURA.ne.jp> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 6/1/18 4:10 AM, Tetsuo Handa wrote: > Tetsuo Handa wrote: >> Since sum of percpu_count did not change after percpu_ref_kill(), this is >> not a race condition while folding percpu counter values into atomic counter >> value. That is, for some reason, someone who is responsible for calling >> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is >> unable to call percpu_ref_put(). >> But I don't know how to find someone who is failing to call percpu_ref_put()... > > I found the someone. It was already there in the backtrace... > > ---------------------------------------- > [ 62.065852] a.out D 0 4414 4337 0x00000000 > [ 62.067677] Call Trace: > [ 62.068545] __schedule+0x40b/0x860 > [ 62.069726] schedule+0x31/0x80 > [ 62.070796] schedule_timeout+0x1c1/0x3c0 > [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0 > [ 62.073670] blk_queue_enter+0x218/0x520 > [ 62.074985] ? remove_wait_queue+0x70/0x70 > [ 62.076361] generic_make_request+0x3d/0x540 > [ 62.077785] ? __bio_clone_fast+0x6b/0x80 > [ 62.079147] ? bio_clone_fast+0x2c/0x70 > [ 62.080456] blk_queue_split+0x29b/0x560 > [ 62.081772] ? blk_queue_split+0x29b/0x560 > [ 62.083162] blk_mq_make_request+0x7c/0x430 > [ 62.084562] generic_make_request+0x276/0x540 > [ 62.086034] submit_bio+0x6e/0x140 > [ 62.087185] ? submit_bio+0x6e/0x140 > [ 62.088384] ? guard_bio_eod+0x9d/0x1d0 > [ 62.089681] do_mpage_readpage+0x328/0x730 > [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0 > [ 62.092726] mpage_readpages+0x120/0x190 > [ 62.094034] ? check_disk_change+0x70/0x70 > [ 62.095454] ? check_disk_change+0x70/0x70 > [ 62.096849] ? alloc_pages_current+0x65/0xd0 > [ 62.098277] blkdev_readpages+0x18/0x20 > [ 62.099568] __do_page_cache_readahead+0x298/0x360 > [ 62.101157] ondemand_readahead+0x1f6/0x490 > [ 62.102546] ? ondemand_readahead+0x1f6/0x490 > [ 62.103995] page_cache_sync_readahead+0x29/0x40 > [ 62.105539] generic_file_read_iter+0x7d0/0x9d0 > [ 62.107067] ? futex_wait+0x221/0x240 > [ 62.108303] ? trace_hardirqs_on+0xd/0x10 > [ 62.109654] blkdev_read_iter+0x30/0x40 > [ 62.110954] generic_file_splice_read+0xc5/0x140 > [ 62.112538] do_splice_to+0x74/0x90 > [ 62.113726] splice_direct_to_actor+0xa4/0x1f0 > [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10 > [ 62.116773] do_splice_direct+0x8a/0xb0 > [ 62.118056] do_sendfile+0x1aa/0x390 > [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0 > [ 62.120666] do_syscall_64+0x6e/0x210 > [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe > ---------------------------------------- > > The someone is blk_queue_split() from blk_mq_make_request() who depends on an > assumption that blk_queue_enter() from recursively called generic_make_request() > does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure. > > ---------------------------------------- > generic_make_request(struct bio *bio) { > if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */ > if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT)) > bio_wouldblock_error(bio); > else > bio_io_error(bio); > return ret; > } > (...snipped...) > ret = q->make_request_fn(q, bio); > (...snipped...) > if (q) > blk_queue_exit(q); > } > ---------------------------------------- > > where q->make_request_fn == blk_mq_make_request which does > > ---------------------------------------- > blk_mq_make_request(struct request_queue *q, struct bio *bio) { > blk_queue_split(q, &bio); > } > > blk_queue_split(struct request_queue *q, struct bio **bio) { > generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */ > } > ---------------------------------------- > > and meanwhile atomic_inc_return(&q->mq_freeze_depth) and > percpu_ref_kill() are called by blk_freeze_queue_start()... > > Now, it is up to you about how to fix this race problem. Ahh, nicely spotted! One idea would be the one below. For this case, we're recursing, so we can either do a non-block queue enter, or we can just do a live enter. diff --git a/block/blk-core.c b/block/blk-core.c index 85909b431eb0..7de12e0dcc3d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio) if (bio->bi_opf & REQ_NOWAIT) flags = BLK_MQ_REQ_NOWAIT; - if (blk_queue_enter(q, flags) < 0) { + if (bio->bi_opf & REQ_QUEUE_ENTERED) + blk_queue_enter_live(q); + else if (blk_queue_enter(q, flags) < 0) { if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT)) bio_wouldblock_error(bio); else diff --git a/block/blk-merge.c b/block/blk-merge.c index 782940c65d8a..e1063e8f7eda 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio) /* there isn't chance to merge the splitted bio */ split->bi_opf |= REQ_NOMERGE; + /* + * Since we're recursing into make_request here, ensure + * that we mark this bio as already having entered the queue. + * If not, and the queue is going away, we can get stuck + * forever on waiting for the queue reference to drop. But + * that will never happen, as we're already holding a + * reference to it. + */ + (*bio)->bi_opf |= REQ_QUEUE_ENTERED; + bio_chain(split, *bio); trace_block_split(q, split, (*bio)->bi_iter.bi_sector); generic_make_request(*bio); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 17b18b91ebac..a6bc789b83e7 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -282,6 +282,8 @@ enum req_flag_bits { /* command specific flags for REQ_OP_WRITE_ZEROES: */ __REQ_NOUNMAP, /* do not free blocks when zeroing */ + __REQ_QUEUE_ENTERED, /* queue already entered for this request */ + /* for driver use */ __REQ_DRV, @@ -302,9 +304,8 @@ enum req_flag_bits { #define REQ_RAHEAD (1ULL << __REQ_RAHEAD) #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND) #define REQ_NOWAIT (1ULL << __REQ_NOWAIT) - #define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP) - +#define REQ_QUEUE_ENTERED (1ULL << __REQ_QUEUE_ENTERED) #define REQ_DRV (1ULL << __REQ_DRV) #define REQ_FAILFAST_MASK \ -- Jens Axboe