Received: by 2002:a05:6520:4211:b029:f4:110d:56bc with SMTP id o17csp2111452lkv; Thu, 20 May 2021 03:41:06 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwbOpOKlVK3tyEFW10L9CL4W8j5UBFJcA+4ndl/dHSdQg5RJOjRyQx1NF3zUvx+lZSjCz1z X-Received: by 2002:a6b:5a16:: with SMTP id o22mr5054073iob.63.1621507266394; Thu, 20 May 2021 03:41:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1621507266; cv=none; d=google.com; s=arc-20160816; b=bMFS2o/ikaIKXGHsNeja/w8edHgSq6cpS1i4/MsImCwZs6WXGIjLeQRSGFqJEsKoNK iGS6XnY+grcvHFHOTzot3z9CCNpq2rCYmPcz6W2e9VFQ42K5EzA359W2fCsu4VmF2771 Fj8q9g8CSsImGq99uYs2R7kstZW7ulJpbATFP8GWPXBrRH/AfrTnl6Qb9cH4QdNX91Ai 2xZx5ot9T40YvIrq82h++DmMOrySIkxxOKeuqnLpPQjzJaBo8Zpk5YVy7wZGTIiiQU+3 Uu8+00jhYWk3Kybum2/eh6D4nmsVrtwGqShuN7Wc+6XF9d8Uf3yl3/dMBULjjykA9sOi /rEw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=ajJRm6ziU723FdF8a4fEnhrD43LACXV9VdEViZwLocs=; b=g/w1AWurbv6mVbrvu4iFfsDMkyMm9Uvm1Bkj/egCYMHA6ydvGWw3n88x66snjhuYQR /doCbW5B2MdsU6uubT4msnh7AHRj2I081va9hy9zPPCSg9sfB2+N6lpxAxiWn3V067/s 7e6s3iaL+R+58zegoAVxxc/j5O6gyaThecKDgp5CBz+RiTDR8+VEttJ8ckItUGnSSrbt 5LJIPcMh4afcJHLnJtcFNfcE1yQrib+sGSbLd6hVBlfRZ8ZwB19PEZRlNyU8jgnKjmAp QATKKMJw3Uu67gTrp/X+1RHd+RJZLM9FYTLRU7CkSvFLEuLXKMdFGB0WEO1lktTTeBZA Jq3Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=OqUjFdnm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id z6si2080245ilq.89.2021.05.20.03.40.43; Thu, 20 May 2021 03:41:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=OqUjFdnm; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238326AbhETKjp (ORCPT + 99 others); Thu, 20 May 2021 06:39:45 -0400 Received: from mail.kernel.org ([198.145.29.99]:55848 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236856AbhETKYh (ORCPT ); Thu, 20 May 2021 06:24:37 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 001AC61476; Thu, 20 May 2021 09:49:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1621504173; bh=Xtb75x+6vaCBMpeFqitC/PyP58tciQn/PQ7zTTMXx6M=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=OqUjFdnmAShMGaq+M5bRTUg0wGXK8JMrhmUqmcLzcqTJ/Y5HzW14VnJvKoHvelmE9 OiY2WzI6fnQlyQTzZWjci6cG0z/uKOWVsJvZT9e7eOOcttGgGF+8Yme01b7JJtf3/h MRemo0wBpGVgF6teeVHM6Vi0boz3fUT5eWi3yrBk= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Christoph Hellwig , Zhao Heming , Song Liu Subject: [PATCH 4.14 124/323] md: md_open returns -EBUSY when entering racing area Date: Thu, 20 May 2021 11:20:16 +0200 Message-Id: <20210520092124.358628919@linuxfoundation.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210520092120.115153432@linuxfoundation.org> References: <20210520092120.115153432@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Zhao Heming commit 6a4db2a60306eb65bfb14ccc9fde035b74a4b4e7 upstream. commit d3374825ce57 ("md: make devices disappear when they are no longer needed.") introduced protection between mddev creating & removing. The md_open shouldn't create mddev when all_mddevs list doesn't contain mddev. With currently code logic, there will be very easy to trigger soft lockup in non-preempt env. This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which will break the infinitely retry when md_open enter racing area. This patch is partly fix soft lockup issue, full fix needs mddev_find is split into two functions: mddev_find & mddev_find_or_alloc. And md_open should call new mddev_find (it only does searching job). For more detail, please refer with Christoph's "split mddev_find" patch in later commits. *** env *** kvm-qemu VM 2C1G with 2 iscsi luns kernel should be non-preempt *** script *** about trigger every time with below script ``` 1 node1="mdcluster1" 2 node2="mdcluster2" 3 4 mdadm -Ss 5 ssh ${node2} "mdadm -Ss" 6 wipefs -a /dev/sda /dev/sdb 7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \ /dev/sdb --assume-clean 8 9 for i in {1..10}; do 10 echo ==== $i ====; 11 12 echo "test ...." 13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb" 14 sleep 1 15 16 echo "clean ....." 17 ssh ${node2} "mdadm -Ss" 18 done ``` I use mdcluster env to trigger soft lockup, but it isn't mdcluster speical bug. To stop md array in mdcluster env will do more jobs than non-cluster array, which will leave enough time/gap to allow kernel to run md_open. *** stack *** ``` [ 884.226509] mddev_put+0x1c/0xe0 [md_mod] [ 884.226515] md_open+0x3c/0xe0 [md_mod] [ 884.226518] __blkdev_get+0x30d/0x710 [ 884.226520] ? bd_acquire+0xd0/0xd0 [ 884.226522] blkdev_get+0x14/0x30 [ 884.226524] do_dentry_open+0x204/0x3a0 [ 884.226531] path_openat+0x2fc/0x1520 [ 884.226534] ? seq_printf+0x4e/0x70 [ 884.226536] do_filp_open+0x9b/0x110 [ 884.226542] ? md_release+0x20/0x20 [md_mod] [ 884.226543] ? seq_read+0x1d8/0x3e0 [ 884.226545] ? kmem_cache_alloc+0x18a/0x270 [ 884.226547] ? do_sys_open+0x1bd/0x260 [ 884.226548] do_sys_open+0x1bd/0x260 [ 884.226551] do_syscall_64+0x5b/0x1e0 [ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ``` *** rootcause *** "mdadm -A" (or other array assemble commands) will start a daemon "mdadm --monitor" by default. When "mdadm -Ss" is running, the stop action will wakeup "mdadm --monitor". The "--monitor" daemon will immediately get info from /proc/mdstat. This time mddev in kernel still exist, so /proc/mdstat still show md device, which makes "mdadm --monitor" to open /dev/md0. The previously "mdadm -Ss" is removing action, the "mdadm --monitor" open action will trigger md_open which is creating action. Racing is happening. ``` : "mdadm -Ss" md_release mddev_put deletes mddev from all_mddevs queue_work for mddev_delayed_delete at this time, "/dev/md0" is still available for opening : "mdadm --monitor ..." md_open + mddev_find can't find mddev of /dev/md0, and create a new mddev and | return. + trigger "if (mddev->gendisk != bdev->bd_disk)" and return -ERESTARTSYS. ``` In non-preempt kernel, is occupying on current CPU. and mddev_delayed_delete which was created in also can't be schedule. In preempt kernel, it can also trigger above racing. But kernel doesn't allow one thread running on a CPU all the time. after running some time, the later "mdadm -A" (refer above script line 13) will call md_alloc to alloc a new gendisk for mddev. it will break md_open statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller, the soft lockup is broken. Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig Signed-off-by: Zhao Heming Signed-off-by: Song Liu Signed-off-by: Greg Kroah-Hartman --- drivers/md/md.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/md/md.c b/drivers/md/md.c index 368cad6cd53a..464cca5d5952 100644 --- a/drivers/md/md.c +++ b/drivers/md/md.c @@ -7821,8 +7821,7 @@ static int md_open(struct block_device *bdev, fmode_t mode) /* Wait until bdev->bd_disk is definitely gone */ if (work_pending(&mddev->del_work)) flush_workqueue(md_misc_wq); - /* Then retry the open from the top */ - return -ERESTARTSYS; + return -EBUSY; } BUG_ON(mddev != bdev->bd_disk->private_data); -- 2.31.1