Received: by 2002:a05:7412:2a8c:b0:e2:908c:2ebd with SMTP id u12csp908504rdh; Sun, 24 Sep 2023 18:11:47 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEWwXspief/aLPFn9suEibC0Bcs2S3wNaA/4gxV7JkHH7gLpuQGVF0wOtl8yXRsGgoblBGU X-Received: by 2002:a05:6a00:1a42:b0:68f:cf6f:e212 with SMTP id h2-20020a056a001a4200b0068fcf6fe212mr4363759pfv.20.1695604306780; Sun, 24 Sep 2023 18:11:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695604306; cv=none; d=google.com; s=arc-20160816; b=sjNXBDQwyFU9Znrbd/Wp9ryq2F872zeXwTNCYgeHvnzHlKGeO/LJdAjaw+hyqOxyk5 PCS/9wrcwJHQTgEWM/Wyk/g+UwLcaR19xVH7Fzb4mzhcsK9I+ajkPHuE1nE32SMCMFwq DHnTTbbI2iCjrv+xUUOEpJ1tPyEyQn9u00SvOyFR/gtMpGJB1JxPXFkAfoEL1eUfdhlA GVwMSSEkkjfKRvvMyqWdAuIQNo95eFAFhg9HeO1mwGXBOCRVvSkSwAwG1YP2+k9gqLWv ioLY3h2CNcJYwDrdzl6E6+qjxLASClMextkGxYgq1ercHulxuReevwtrCDJ9pWD2n8yW Xm0w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject; bh=XmoKuZEdIx5rzBBD5DPIv7MPr2NoCzthMb1vbLrFvpk=; fh=qFLOMwoEF/ZVS28p0hvnigvgNDSf+/hOoYRsuNpODwQ=; b=DPT9+SWsRh883URLUll74OXQrGONruFuL9XP/jF6CDAaN6AGdX2XchXOsIkYWXsM0t HTJH64QKB0S9zsmVRky2R3lGTyQJJf26bNuRKxT7qD1OaPU02DukcpD9MfYF7xiHZU6u o+BwOY597yjWTNDYexNqXfarvyLX7ubFBsSyqPxdkEPe5vttKRLwnjZKvFCU/1527Lit pF0jB7yYjDKw5aHXGSyYTb5az45TcDaNZBGSE25R0HbP255wpJo73WC/oxqcdcox0w1Q uAnlDT2MCWdI6kJW9u3TvSvjWquTTEbrRvNqCSu4E2DNjY/ycdnFYDxB5EGtuUYC7v2t oWHw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id a185-20020a6390c2000000b00574035fd472si8887528pge.31.2023.09.24.18.11.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 24 Sep 2023 18:11:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 5928980B79C3; Sun, 24 Sep 2023 18:11:25 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231481AbjIYBLX (ORCPT + 99 others); Sun, 24 Sep 2023 21:11:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38308 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231390AbjIYBLW (ORCPT ); Sun, 24 Sep 2023 21:11:22 -0400 Received: from dggsgout11.his.huawei.com (unknown [45.249.212.51]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1B0BBEE; Sun, 24 Sep 2023 18:11:16 -0700 (PDT) Received: from mail02.huawei.com (unknown [172.30.67.143]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4Rv4ZS1yQLz4f3kJs; Mon, 25 Sep 2023 09:11:12 +0800 (CST) Received: from [10.174.176.73] (unknown [10.174.176.73]) by APP4 (Coremail) with SMTP id gCh0CgAnvdww3hBllLtIBQ--.16291S3; Mon, 25 Sep 2023 09:11:13 +0800 (CST) Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition To: Donald Buczek , Dragan Stancevic , Yu Kuai , song@kernel.org Cc: guoqing.jiang@linux.dev, it+raid@molgen.mpg.de, linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org, msmith626@gmail.com, "yangerkun@huawei.com" , "yukuai (C)" References: <20230822211627.1389410-1-dragan@stancevic.com> <2061b123-6332-1456-e7c3-b713752527fb@stancevic.com> <07d5c7c2-c444-8747-ed6d-ca24231decd8@huaweicloud.com> <0d79d1f9-00e8-93be-3c7c-244030521cd7@huaweicloud.com> <07ef7b78-66d4-d3de-4e25-8a889b902e14@stancevic.com> <63c63d93-30fc-0175-0033-846b93fe9eff@molgen.mpg.de> From: Yu Kuai Message-ID: <80e0f8aa-6d53-3109-37c0-b07c5a4b558c@huaweicloud.com> Date: Mon, 25 Sep 2023 09:11:12 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-CM-TRANSID: gCh0CgAnvdww3hBllLtIBQ--.16291S3 X-Coremail-Antispam: 1UD129KBjvJXoWxXF45AF4rtF4DGrW8uFyxuFg_yoW5GF15p3 4Fv3W5tr4DJr1kuws2qw48uay0yw1xXay5GrykuwnYk3WY9rZYvFy5AF45ua4DC3Z3uF1I vFy5JFZxXFyUZaUanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUU9Y14x267AKxVW8JVW5JwAFc2x0x2IEx4CE42xK8VAvwI8IcIk0 rVWrJVCq3wAFIxvE14AKwVWUJVWUGwA2ocxC64kIII0Yj41l84x0c7CEw4AK67xGY2AK02 1l84ACjcxK6xIIjxv20xvE14v26F1j6w1UM28EF7xvwVC0I7IYx2IY6xkF7I0E14v26r4U JVWxJr1l84ACjcxK6I8E87Iv67AKxVW0oVCq3wA2z4x0Y4vEx4A2jsIEc7CjxVAFwI0_Gc CE3s1le2I262IYc4CY6c8Ij28IcVAaY2xG8wAqx4xG64xvF2IEw4CE5I8CrVC2j2WlYx0E 2Ix0cI8IcVAFwI0_Jrv_JF1lYx0Ex4A2jsIE14v26r1j6r4UMcvjeVCFs4IE7xkEbVWUJV W8JwACjcxG0xvEwIxGrwACjI8F5VA0II8E6IAqYI8I648v4I1lFIxGxcIEc7CjxVA2Y2ka 0xkIwI1lc7I2V7IY0VAS07AlzVAYIcxG8wCF04k20xvY0x0EwIxGrwCFx2IqxVCFs4IE7x kEbVWUJVW8JwC20s026c02F40E14v26r1j6r18MI8I3I0E7480Y4vE14v26r106r1rMI8E 67AF67kF1VAFwI0_Jw0_GFylIxkGc2Ij64vIr41lIxAIcVC0I7IYx2IY67AKxVWUJVWUCw CI42IY6xIIjxv20xvEc7CjxVAFwI0_Gr0_Cr1lIxAIcVCF04k26cxKx2IYs7xG6Fyj6rWU JwCI42IY6I8E87Iv67AKxVWUJVW8JwCI42IY6I8E87Iv6xkF7I0E14v26r4j6r4UJbIYCT nIWIevJa73UjIFyTuYvjfUouWlDUUUU X-CM-SenderInfo: 51xn3trlr6x35dzhxuhorxvhhfrp/ X-CFilter-Loop: Reflected X-Spam-Status: No, score=-2.2 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Sun, 24 Sep 2023 18:11:25 -0700 (PDT) Hi, 在 2023/09/24 22:35, Donald Buczek 写道: > On 9/17/23 10:55, Donald Buczek wrote: >> On 9/14/23 08:03, Donald Buczek wrote: >>> On 9/13/23 16:16, Dragan Stancevic wrote: >>>> Hi Donald- >>>> [...] >>>> Here is a list of changes for 6.1: >>>> >>>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly >>>> f71209b1f21c md: enhance checking in md_check_recovery() >>>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread() >>>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock >>>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in >>>> action_store() >>>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen' >>>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in >>>> action_store" >>> >>> Thanks! >>> >>> I've put these patches on v6.1.52. I've started a script which >>> transitions the three md-devices of a very active backup server >>> through idle->check->idle every 6 minutes a few ours ago.  It went >>> through ~400 iterations till now. No lock-ups so far. >> >> Oh dear, looks like the deadlock problem is _not_fixed with these >> patches. > > Some more info after another incident: > > - We've hit the deadlock with 5.15.131 (so it is NOT introduced by any > of the above patches) > - The symptoms are not exactly the same as with the original year-old > problem. Differences: > - - mdX_raid6 is NOT busy looping > - - /sys/devices/virtual/block/mdX/md/array_state says "active" not > "write pending" > - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does > not resolve the deadlock > - - After hours in the deadlock state the system resumed operation when > a script of mine read(!) lots of sysfs files. > - But in both cases, `echo idle > > /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O > operation on the raid. > > The fact that we didn't hit the problem for many month on 5.15.94 might > hint that it was introduced between 5.15.94 and 5.15.131 > > We'll try to reproduce the problem on a test machine for analysis, but > this make take time (vacation imminent for one...). > > But its not like these patches caused the problem. Any maybe they _did_ > fix the original problem, as we didn't hit that one. Sorry for the late reply, yes, this looks like a different problem. I'm pretty confident that the orignal problem is fixed since that echo idle/frozen doesn't hold the lock 'reconfig_mutex' to wait for sync_thread to be done. I'll check patches between 5.15.94 and 5.15.131. Thanks, Kuai > > Best > >   Donald >