Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752444AbdHNPSI (ORCPT ); Mon, 14 Aug 2017 11:18:08 -0400 Received: from esa3.hgst.iphmx.com ([216.71.153.141]:33985 "EHLO esa3.hgst.iphmx.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751441AbdHNPSG (ORCPT ); Mon, 14 Aug 2017 11:18:06 -0400 X-IronPort-AV: E=Sophos;i="5.41,373,1498492800"; d="scan'208";a="41883405" From: Bart Van Assche To: "lduncan@suse.com" , "tang.chen@huawei.com" , "cleech@redhat.com" , "axboe@kernel.dk" CC: "linux-scsi@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "guijianfeng@huawei.com" , "zhengchuan@huawei.com" Subject: Re: [iscsi] Deadlock occurred when network is in error Thread-Topic: [iscsi] Deadlock occurred when network is in error Thread-Index: AdMU7px11YQAMA8RQpqjHfoYPGQQNgAIc0wA Date: Mon, 14 Aug 2017 15:17:17 +0000 Message-ID: <1502723836.2333.3.camel@wdc.com> References: <22E823DBB7698E489DC113638F7470729C17B6@DGGEMM506-MBX.china.huawei.com> In-Reply-To: <22E823DBB7698E489DC113638F7470729C17B6@DGGEMM506-MBX.china.huawei.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Bart.VanAssche@wdc.com; x-originating-ip: [63.163.107.100] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1;CY1PR0401MB1536;20:n58PaPaTFBPOubFUJ41mWJ8WIf3av08zr8/Wb3w4T/RcRP0Rh6ohpqi3QFFvKU20jBvI9SUB8/SsD6r6dNkuLdHaTx1PaZjfn8OOSsAIIDF0ZWi2w9xEsT3ZhTYH/rwJ02rNc6qpTNSMGC4TCiWPYcjUM5iwVZSvZKf9lRAKLxo= x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ms-office365-filtering-correlation-id: 7ba3f156-7a1b-43c6-29a8-08d4e3278d28 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(300000500095)(300135000095)(300000501095)(300135300095)(300000502095)(300135100095)(22001)(2017030254152)(300000503095)(300135400095)(48565401081)(2017052603031)(201703131423075)(201703031133081)(201702281549075)(300000504095)(300135200095)(300000505095)(300135600095)(300000506095)(300135500095);SRVR:CY1PR0401MB1536; x-ms-traffictypediagnostic: CY1PR0401MB1536: wdcipoutbound: EOP-TRUE x-exchange-antispam-report-test: UriScan:; x-microsoft-antispam-prvs: x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(100000700101)(100105000095)(100000701101)(100105300095)(100000702101)(100105100095)(6040450)(601004)(2401047)(5005006)(8121501046)(3002001)(10201501046)(100000703101)(100105400095)(93006095)(93001095)(6055026)(6041248)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123562025)(20161123560025)(20161123558100)(20161123564025)(20161123555025)(6072148)(201708071742011)(100000704101)(100105200095)(100000705101)(100105500095);SRVR:CY1PR0401MB1536;BCL:0;PCL:0;RULEID:(100000800101)(100110000095)(100000801101)(100110300095)(100000802101)(100110100095)(100000803101)(100110400095)(100000804101)(100110200095)(100000805101)(100110500095);SRVR:CY1PR0401MB1536; x-forefront-prvs: 039975700A x-forefront-antispam-report: SFV:NSPM;SFS:(10019020)(6009001)(39860400002)(24454002)(189002)(377424004)(199003)(86362001)(99286003)(50986999)(76176999)(25786009)(54356999)(72206003)(103116003)(66066001)(36756003)(53936002)(6116002)(3846002)(478600001)(6306002)(6512007)(68736007)(81156014)(2906002)(229853002)(189998001)(54906002)(101416001)(81166006)(305945005)(102836003)(8676002)(105586002)(6506006)(2501003)(106356001)(7736002)(33646002)(6246003)(2201001)(2900100001)(14454004)(5660300001)(3660700001)(97736004)(77096006)(4326008)(6436002)(6486002)(8936002)(3280700002)(2950100002);DIR:OUT;SFP:1102;SCL:1;SRVR:CY1PR0401MB1536;H:CY1PR0401MB1536.namprd04.prod.outlook.com;FPR:;SPF:None;PTR:InfoNoRecords;MX:1;A:1;LANG:en; spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="utf-8" Content-ID: <7B4C87BEF48B0F4D9744F418D2B422E9@namprd04.prod.outlook.com> MIME-Version: 1.0 X-OriginatorOrg: wdc.com X-MS-Exchange-CrossTenant-originalarrivaltime: 14 Aug 2017 15:17:17.5656 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: b61c8803-16f3-4c35-9b17-6f65f441df86 X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY1PR0401MB1536 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from base64 to 8bit by nfs id v7EFICHq012257 Content-Length: 2395 Lines: 70 On Mon, 2017-08-14 at 11:23 +0000, Tangchen (UVP) wrote: > Problem 2: > > *************** > [What it looks like] > *************** > When remove a scsi device, and the network error happens, __blk_drain_queue() could hang forever. > > # cat /proc/19160/stack > [] msleep+0x1d/0x30 > [] __blk_drain_queue+0xe4/0x160 > [] blk_cleanup_queue+0x106/0x2e0 > [] __scsi_remove_device+0x52/0xc0 [scsi_mod] > [] scsi_remove_device+0x2b/0x40 [scsi_mod] > [] sdev_store_delete_callback+0x10/0x20 [scsi_mod] > [] sysfs_schedule_callback_work+0x15/0x80 > [] process_one_work+0x169/0x340 > [] worker_thread+0x183/0x490 > [] kthread+0x96/0xa0 > [] kernel_thread_helper+0x4/0x10 > [] 0xffffffffffffffff > > The request queue of this device was stopped. So the following check will be true forever: > __blk_run_queue() > { > if (unlikely(blk_queue_stopped(q))) > return; > > __blk_run_queue_uncond(q); > } > > So __blk_run_queue_uncond() will never be called, and the process hang. > > [ ... ] > > **************** > [How to reproduce] > **************** > Unfortunately I cannot reproduce it in the latest kernel. > The script below will help to reproduce, but not very often. > > # create network error > tc qdisc add dev eth1 root netem loss 60% > > # restart iscsid and rescan scsi bus again and again > while [ 1 ] > do > systemctl restart iscsid > rescan-scsi-bus (http://manpages.ubuntu.com/manpages/trusty/man8/rescan-scsi-bus.8.html) > done This should have been fixed by commit 36e3cf273977 ("scsi: Avoid that SCSI queues get stuck"). The first mainline kernel that includes this commit is kernel v4.11. > void __blk_run_queue(struct request_queue *q) > { > - if (unlikely(blk_queue_stopped(q))) > + if (unlikely(blk_queue_stopped(q)) && unlikely(!blk_queue_dying(q))) > return; > > __blk_run_queue_uncond(q); Are you aware that the single queue block layer is on its way out and will be removed sooner or later? Please focus your testing on scsi-mq. Regarding the above patch: it is wrong because it will cause lockups during path removal for other block drivers. Please drop this patch. Bart.