Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp1189172imm; Fri, 1 Jun 2018 17:51:17 -0700 (PDT) X-Google-Smtp-Source: ADUXVKI5jhZ8lRUxtY2x+SfYC1kXaLrb3PalNhhuz14OCWIoHNZduBWevOuTskHiCuJyDonKFG3k X-Received: by 2002:a17:902:c6b:: with SMTP id 98-v6mr12827713pls.37.1527900677784; Fri, 01 Jun 2018 17:51:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527900677; cv=none; d=google.com; s=arc-20160816; b=eVR9PG2BSz9I6HAG6LoycDwnEnKs5rd/FAkDL6+3GFHLXEZxlz6oCjR/SvDZuPnn2L D/Z3m66LqNbu9COwAKAyCruqYNWQet6/YSslZz/KZBA6Ehn8LQYNPNpuArVaWgETjojp wyxILslW/gyFvwJmcrjG6yFhBc4TnmyeLWAeFZetsylUZzBeVXGphs6ZqvE8S+Ys/Uye Ut3SbGKzRdKV3sKBIG7gfTls33SmDHF47zymNNRnzgsPnWh7y6lIVcv4uZbLL4D/eQS4 HBxDMBi8ah3s+3/fIyCmtgCkcdxVEsBunicLMzr1n8/lCh7gYYtKsbig/eh7/2hSn/oG 7SAg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature :arc-authentication-results; bh=rPn+6pc15LN/IAd+8yiZNaUTqo1Ra0eNqYTYdJFyXwc=; b=bFPPHAnaWhPYivKF5Vwavp4dxbgM8pKqU/2ipEou78FYMBlQxLJoNDNq9FqVyl4kvE XLVZl9mbBB2Aa3LlQKrICEX7LlfYKgwxyFAmGbk8s/yXvWS3vlRXUiF1invtpKsFyC1/ IsNDdj/+gGJ2Yjhb52OTSqPCqsYyTMFy4+dJ3zf2BbPe7yhE8csuHkta70V85lBvOH/P mhG4csr/t638tOQpDFynMyDBhv0tPBV0pVgJ1qlhpTQoehDm+V3kK0eMoiLME9daXMBm fPhw76ZszhgdtcnmpcG1Q8QNwbhn3akMHg2Y9M9HkuprwxUwkHvEgJ/1o1BAvDy4QaCC CKow== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=zBoPEyt7; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b3-v6si3396575pgr.495.2018.06.01.17.50.39; Fri, 01 Jun 2018 17:51:17 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=zBoPEyt7; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751474AbeFBAtR (ORCPT + 99 others); Fri, 1 Jun 2018 20:49:17 -0400 Received: from mail-it0-f66.google.com ([209.85.214.66]:39757 "EHLO mail-it0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751386AbeFBAtP (ORCPT ); Fri, 1 Jun 2018 20:49:15 -0400 Received: by mail-it0-f66.google.com with SMTP id c3-v6so3947440itj.4 for ; Fri, 01 Jun 2018 17:49:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=rPn+6pc15LN/IAd+8yiZNaUTqo1Ra0eNqYTYdJFyXwc=; b=zBoPEyt7Opnnal8rvSIwOIVzKptatE9fqpFd3M0xxkWddUMwzBeTVNYywpARXHAFak D3xy3F+ZMFE/yZ9qgiCYU6xPFwVCdfs5ZgKjXCOXauX0tsBaIXK/LJcW1CI6EDnzoChL LUKjTOXKThcgSL7pTCoYb2oZO8C56PPEPGOfMiC+5d+GlaA1jBeHWb8OBm0K5nxxEEqy tCRmLjso8BETARDbdu2xj2JInOi4lOzNH0qOIbJrlVJoItrxYmimfxzfiY3b6khZMeva IvMjgaa7bAJYdmVZ2dAE2aDpzmE/W+Kio+Y/Ddf6TXYQsqo4Zj7r0gq0Av0BJMqQbHsy d3tA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=rPn+6pc15LN/IAd+8yiZNaUTqo1Ra0eNqYTYdJFyXwc=; b=XkJVqFpMyK8qSGZpdjKAyyKpZMqnzJO2BBo112WSxqywN0ZGm5zOngmjWNVpKfcMKc x5gym/1j78QeAKihPaAKUPXSLIIjIGOZM77dlkfkFGAUF54G01Ud8yXDMt7Gtc0djwBS 2FTQtIVWyKJQwu2mKm72G8+VfF9dWdNg9CY0rwmH26CrSE4+VJNWLj+kTTNlL+dZX1f3 Ji+wCi947JgM+08h5Z9z7/3XEmf0VOESSTKlZaQAcRJsoFvHlc5Ajp6VR1KcFo1BAXyi r9OLHOzQQQ8pQldxjuU52RnoVpBx1XJwUF+xBqAT03n0XBBbx8AhwTUiT4MbMU4s6rdD On6Q== X-Gm-Message-State: APt69E1+24KCyw+7VJkTxXE57WFWSb3VO/tzYQZstwSj/D3B1KU0905w ScseqVsj57pmxJnqSE+u39GfnA== X-Received: by 2002:a24:c408:: with SMTP id v8-v6mr6429281itf.100.1527900554333; Fri, 01 Jun 2018 17:49:14 -0700 (PDT) Received: from [192.168.1.212] (107.191.0.158.static.utbb.net. [107.191.0.158]) by smtp.gmail.com with ESMTPSA id w142-v6sm1653132ita.21.2018.06.01.17.49.11 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 01 Jun 2018 17:49:13 -0700 (PDT) Subject: Re: INFO: task hung in blk_queue_enter To: Ming Lei Cc: Tetsuo Handa , Bart.VanAssche@wdc.com, dvyukov@google.com, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, jthumshirn@suse.de, alan.christopher.jenkins@gmail.com, syzbot+c4f9cebf9d651f6e54de@syzkaller.appspotmail.com, martin.petersen@oracle.com, dan.j.williams@intel.com, hch@lst.de, oleksandr@natalenko.name, martin@lichtvoll.de, hare@suse.com, syzkaller-bugs@googlegroups.com, ross.zwisler@linux.intel.com, keith.busch@intel.com, linux-ext4@vger.kernel.org References: <43327033306c3dd2f7c3717d64ce22415b6f3451.camel@wdc.com> <6db16aa3a7c56b6dcca2d10b4e100a780c740081.camel@wdc.com> <201805220652.BFH82351.SMQFFOJOtFOVLH@I-love.SAKURA.ne.jp> <201805222020.FEJ82897.OFtJMFHOVLQOSF@I-love.SAKURA.ne.jp> <25708e84-6f35-04c3-a2e4-6854f0ed9e78@I-love.SAKURA.ne.jp> <20180601234946.GA655@ming.t460p> From: Jens Axboe Message-ID: <95c419d8-7f19-f9c0-a53f-3d381fe93176@kernel.dk> Date: Fri, 1 Jun 2018 18:49:10 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.0 MIME-Version: 1.0 In-Reply-To: <20180601234946.GA655@ming.t460p> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 6/1/18 5:49 PM, Ming Lei wrote: > On Fri, Jun 01, 2018 at 11:52:45AM -0600, Jens Axboe wrote: >> On 6/1/18 4:10 AM, Tetsuo Handa wrote: >>> Tetsuo Handa wrote: >>>> Since sum of percpu_count did not change after percpu_ref_kill(), this is >>>> not a race condition while folding percpu counter values into atomic counter >>>> value. That is, for some reason, someone who is responsible for calling >>>> percpu_ref_put(&q->q_usage_counter) (presumably via blk_queue_exit()) is >>>> unable to call percpu_ref_put(). >>>> But I don't know how to find someone who is failing to call percpu_ref_put()... >>> >>> I found the someone. It was already there in the backtrace... >>> >>> ---------------------------------------- >>> [ 62.065852] a.out D 0 4414 4337 0x00000000 >>> [ 62.067677] Call Trace: >>> [ 62.068545] __schedule+0x40b/0x860 >>> [ 62.069726] schedule+0x31/0x80 >>> [ 62.070796] schedule_timeout+0x1c1/0x3c0 >>> [ 62.072159] ? __next_timer_interrupt+0xd0/0xd0 >>> [ 62.073670] blk_queue_enter+0x218/0x520 >>> [ 62.074985] ? remove_wait_queue+0x70/0x70 >>> [ 62.076361] generic_make_request+0x3d/0x540 >>> [ 62.077785] ? __bio_clone_fast+0x6b/0x80 >>> [ 62.079147] ? bio_clone_fast+0x2c/0x70 >>> [ 62.080456] blk_queue_split+0x29b/0x560 >>> [ 62.081772] ? blk_queue_split+0x29b/0x560 >>> [ 62.083162] blk_mq_make_request+0x7c/0x430 >>> [ 62.084562] generic_make_request+0x276/0x540 >>> [ 62.086034] submit_bio+0x6e/0x140 >>> [ 62.087185] ? submit_bio+0x6e/0x140 >>> [ 62.088384] ? guard_bio_eod+0x9d/0x1d0 >>> [ 62.089681] do_mpage_readpage+0x328/0x730 >>> [ 62.091045] ? __add_to_page_cache_locked+0x12e/0x1a0 >>> [ 62.092726] mpage_readpages+0x120/0x190 >>> [ 62.094034] ? check_disk_change+0x70/0x70 >>> [ 62.095454] ? check_disk_change+0x70/0x70 >>> [ 62.096849] ? alloc_pages_current+0x65/0xd0 >>> [ 62.098277] blkdev_readpages+0x18/0x20 >>> [ 62.099568] __do_page_cache_readahead+0x298/0x360 >>> [ 62.101157] ondemand_readahead+0x1f6/0x490 >>> [ 62.102546] ? ondemand_readahead+0x1f6/0x490 >>> [ 62.103995] page_cache_sync_readahead+0x29/0x40 >>> [ 62.105539] generic_file_read_iter+0x7d0/0x9d0 >>> [ 62.107067] ? futex_wait+0x221/0x240 >>> [ 62.108303] ? trace_hardirqs_on+0xd/0x10 >>> [ 62.109654] blkdev_read_iter+0x30/0x40 >>> [ 62.110954] generic_file_splice_read+0xc5/0x140 >>> [ 62.112538] do_splice_to+0x74/0x90 >>> [ 62.113726] splice_direct_to_actor+0xa4/0x1f0 >>> [ 62.115209] ? generic_pipe_buf_nosteal+0x10/0x10 >>> [ 62.116773] do_splice_direct+0x8a/0xb0 >>> [ 62.118056] do_sendfile+0x1aa/0x390 >>> [ 62.119255] __x64_sys_sendfile64+0x4e/0xc0 >>> [ 62.120666] do_syscall_64+0x6e/0x210 >>> [ 62.121909] entry_SYSCALL_64_after_hwframe+0x49/0xbe >>> ---------------------------------------- >>> >>> The someone is blk_queue_split() from blk_mq_make_request() who depends on an >>> assumption that blk_queue_enter() from recursively called generic_make_request() >>> does not get blocked due to percpu_ref_tryget_live(&q->q_usage_counter) failure. >>> >>> ---------------------------------------- >>> generic_make_request(struct bio *bio) { >>> if (blk_queue_enter(q, flags) < 0) { /* <= percpu_ref_tryget_live() succeeds. */ >>> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT)) >>> bio_wouldblock_error(bio); >>> else >>> bio_io_error(bio); >>> return ret; >>> } >>> (...snipped...) >>> ret = q->make_request_fn(q, bio); >>> (...snipped...) >>> if (q) >>> blk_queue_exit(q); >>> } >>> ---------------------------------------- >>> >>> where q->make_request_fn == blk_mq_make_request which does >>> >>> ---------------------------------------- >>> blk_mq_make_request(struct request_queue *q, struct bio *bio) { >>> blk_queue_split(q, &bio); >>> } >>> >>> blk_queue_split(struct request_queue *q, struct bio **bio) { >>> generic_make_request(*bio); /* <= percpu_ref_tryget_live() fails and waits until atomic_read(&q->mq_freeze_depth) becomes 0. */ >>> } >>> ---------------------------------------- >>> >>> and meanwhile atomic_inc_return(&q->mq_freeze_depth) and >>> percpu_ref_kill() are called by blk_freeze_queue_start()... >>> >>> Now, it is up to you about how to fix this race problem. >> >> Ahh, nicely spotted! One idea would be the one below. For this case, >> we're recursing, so we can either do a non-block queue enter, or we >> can just do a live enter. >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 85909b431eb0..7de12e0dcc3d 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio) >> >> if (bio->bi_opf & REQ_NOWAIT) >> flags = BLK_MQ_REQ_NOWAIT; >> - if (blk_queue_enter(q, flags) < 0) { >> + if (bio->bi_opf & REQ_QUEUE_ENTERED) >> + blk_queue_enter_live(q); >> + else if (blk_queue_enter(q, flags) < 0) { >> if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT)) >> bio_wouldblock_error(bio); >> else >> diff --git a/block/blk-merge.c b/block/blk-merge.c >> index 782940c65d8a..e1063e8f7eda 100644 >> --- a/block/blk-merge.c >> +++ b/block/blk-merge.c >> @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio) >> /* there isn't chance to merge the splitted bio */ >> split->bi_opf |= REQ_NOMERGE; >> >> + /* >> + * Since we're recursing into make_request here, ensure >> + * that we mark this bio as already having entered the queue. >> + * If not, and the queue is going away, we can get stuck >> + * forever on waiting for the queue reference to drop. But >> + * that will never happen, as we're already holding a >> + * reference to it. >> + */ >> + (*bio)->bi_opf |= REQ_QUEUE_ENTERED; >> + >> bio_chain(split, *bio); >> trace_block_split(q, split, (*bio)->bi_iter.bi_sector); >> generic_make_request(*bio); >> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h >> index 17b18b91ebac..a6bc789b83e7 100644 >> --- a/include/linux/blk_types.h >> +++ b/include/linux/blk_types.h >> @@ -282,6 +282,8 @@ enum req_flag_bits { >> /* command specific flags for REQ_OP_WRITE_ZEROES: */ >> __REQ_NOUNMAP, /* do not free blocks when zeroing */ >> >> + __REQ_QUEUE_ENTERED, /* queue already entered for this request */ >> + >> /* for driver use */ >> __REQ_DRV, >> >> @@ -302,9 +304,8 @@ enum req_flag_bits { >> #define REQ_RAHEAD (1ULL << __REQ_RAHEAD) >> #define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND) >> #define REQ_NOWAIT (1ULL << __REQ_NOWAIT) >> - >> #define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP) >> - >> +#define REQ_QUEUE_ENTERED (1ULL << __REQ_QUEUE_ENTERED) >> #define REQ_DRV (1ULL << __REQ_DRV) >> >> #define REQ_FAILFAST_MASK \ > > bio clone inherits .bi_opf, so this way may not work for cloned bio. > > Given most of bio_split() is run in .make_request_fn(), holding the > queue ref may not be needed in case of bio splitting, so another > candidate fix might be to introduce one extra parameter of > 'need_queue_ref' to generic_make_request(). I did consider that, but that's pretty silly for a corner case and you'd have to change a crap ton of calls to it. I think it'd be cleaner to clear the bit when we need to, potentially even adding a debug check to blk_queue_enter_live() that complains if the ref was not already elevated. Though that would be expensive, compared to the percpu inc now. Not saying the bit is necessarily the best way forward, but I do like it a LOT more than adding an argument to generic_make_request. It should have been a bio flag to begin with, that better captures the scope and also avoids clones inheriting anything they shouldn't. It's also not applicable to a request. diff --git a/block/blk-core.c b/block/blk-core.c index 85909b431eb0..8f1063f50863 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -2392,7 +2392,9 @@ blk_qc_t generic_make_request(struct bio *bio) if (bio->bi_opf & REQ_NOWAIT) flags = BLK_MQ_REQ_NOWAIT; - if (blk_queue_enter(q, flags) < 0) { + if (bio->bi_flags & BIO_QUEUE_ENTERED) + blk_queue_enter_live(q); + else if (blk_queue_enter(q, flags) < 0) { if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT)) bio_wouldblock_error(bio); else diff --git a/block/blk-merge.c b/block/blk-merge.c index 782940c65d8a..9ee9474e579c 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -210,6 +210,16 @@ void blk_queue_split(struct request_queue *q, struct bio **bio) /* there isn't chance to merge the splitted bio */ split->bi_opf |= REQ_NOMERGE; + /* + * Since we're recursing into make_request here, ensure + * that we mark this bio as already having entered the queue. + * If not, and the queue is going away, we can get stuck + * forever on waiting for the queue reference to drop. But + * that will never happen, as we're already holding a + * reference to it. + */ + (*bio)->bi_flags |= BIO_QUEUE_ENTERED; + bio_chain(split, *bio); trace_block_split(q, split, (*bio)->bi_iter.bi_sector); generic_make_request(*bio); diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 17b18b91ebac..4687d7c99076 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -186,6 +186,8 @@ struct bio { * throttling rules. Don't do it again. */ #define BIO_TRACE_COMPLETION 10 /* bio_endio() should trace the final completion * of this bio. */ +#define BIO_QUEUE_ENTERED 11 /* can use blk_queue_enter_live() */ + /* See BVEC_POOL_OFFSET below before adding new flags */ /* -- Jens Axboe