Received: by 2002:a05:7412:2a8c:b0:e2:908c:2ebd with SMTP id u12csp689823rdh; Sun, 24 Sep 2023 07:37:23 -0700 (PDT) X-Google-Smtp-Source: AGHT+IECtrmZkOVSsg+fexhJTByJHPHEaCr/gQMVwL9S6n6G8k7u0ApvfYfoj63L2smIH9z86s2Y X-Received: by 2002:a05:6a21:9996:b0:14c:ca25:3b53 with SMTP id ve22-20020a056a21999600b0014cca253b53mr3524655pzb.27.1695566243361; Sun, 24 Sep 2023 07:37:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695566243; cv=none; d=google.com; s=arc-20160816; b=cPQfdgkVDynBHJ0+T8DVqFHYD0VGGBqiVTwv6iYlfHMJqTRn4NcxILDTJsOjzOBiLX QQeAYjmO6IXGjHwg/rJycSCR9cNW755kCAbaq+Ccj2wLWenZkCQSui0dTPP1NzljW/kY mSloBEHvu4xxpi8lD0l/+7yImw/VjTUmh+fNfYVSrg4qObsHWtSDkPdOouRl5YZCgqe+ q7kRl7VWIb6lqHYq1nVYbouSjpwmChiykOeAdBpwOK4JXm6CeVtVMWYNGwa8vFuw/N1e ks5E7aX0d1234W9A6foJfuqgnTiAu55fWQmOl23u8C+LETj6l2AM3GA68ylJHJ4tuiMS DZiw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:references :cc:to:from:content-language:subject:user-agent:mime-version:date :message-id; bh=w104jm/CFzfJdmnXuLqckHTDo2esqEg0iiipu/G8VDo=; fh=Dld/LIMtYnJCYgCD4kB6Mqvs6OxgaDAy74VVaB1VrpA=; b=ljdRpa8DWFC7D8UE4XRapn5jjlhEpccS15n50sFj0mMhboQ3B5BaafD1+aZSrCM/nA qgUH3hSmAxKnTTRorI/0WC2u6HomnmAZooNGri41gkwR207hv/CFyqVOZyOfiXupmItT zuLpgASNLCdltVITxaPvq+nBamkzrj9w60jUXbX8Cn7kpaRT5DesJGWBvanW2E6beC1G qb2OawqtE75rnkGFKhRyaAHrqK8VwhU3JkkXjpGph/J9P8Oph0BgqmUPRuqdTw8CxDhD 35ghqFQl/lfoBgP1J940qwFoqXEAGP9IgmzOvsIK0V2aCrx/NFxbTgKtMA42Y12+n976 pSKg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id x18-20020a17090a8a9200b00276f2dd1818si7646727pjn.86.2023.09.24.07.37.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 24 Sep 2023 07:37:23 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id C19AD8177300; Sun, 24 Sep 2023 07:37:17 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229974AbjIXOhL (ORCPT + 99 others); Sun, 24 Sep 2023 10:37:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34574 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229795AbjIXOhK (ORCPT ); Sun, 24 Sep 2023 10:37:10 -0400 Received: from mx3.molgen.mpg.de (mx3.molgen.mpg.de [141.14.17.11]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D7514FA; Sun, 24 Sep 2023 07:37:02 -0700 (PDT) Received: from [192.168.1.122] (ip5b41a963.dynamic.kabel-deutschland.de [91.65.169.99]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: buczek) by mx.molgen.mpg.de (Postfix) with ESMTPSA id 64C5761E5FE03; Sun, 24 Sep 2023 16:36:00 +0200 (CEST) Message-ID: Date: Sun, 24 Sep 2023 16:35:59 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition Content-Language: en-US From: Donald Buczek To: Dragan Stancevic , Yu Kuai , song@kernel.org Cc: guoqing.jiang@linux.dev, it+raid@molgen.mpg.de, linux-kernel@vger.kernel.org, linux-raid@vger.kernel.org, msmith626@gmail.com, "yangerkun@huawei.com" References: <20230822211627.1389410-1-dragan@stancevic.com> <2061b123-6332-1456-e7c3-b713752527fb@stancevic.com> <07d5c7c2-c444-8747-ed6d-ca24231decd8@huaweicloud.com> <0d79d1f9-00e8-93be-3c7c-244030521cd7@huaweicloud.com> <07ef7b78-66d4-d3de-4e25-8a889b902e14@stancevic.com> <63c63d93-30fc-0175-0033-846b93fe9eff@molgen.mpg.de> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-2.2 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Sun, 24 Sep 2023 07:37:18 -0700 (PDT) On 9/17/23 10:55, Donald Buczek wrote: > On 9/14/23 08:03, Donald Buczek wrote: >> On 9/13/23 16:16, Dragan Stancevic wrote: >>> Hi Donald- >>> [...] >>> Here is a list of changes for 6.1: >>> >>> e5e9b9cb71a0 md: factor out a helper to wake up md_thread directly >>> f71209b1f21c md: enhance checking in md_check_recovery() >>> 753260ed0b46 md: wake up 'resync_wait' at last in md_reap_sync_thread() >>> 130443d60b1b md: refactor idle/frozen_sync_thread() to fix deadlock >>> 6f56f0c4f124 md: add a mutex to synchronize idle and frozen in action_store() >>> 64e5e09afc14 md: refactor action_store() for 'idle' and 'frozen' >>> a865b96c513b Revert "md: unlock mddev before reap sync_thread in action_store" >> >> Thanks! >> >> I've put these patches on v6.1.52. I've started a script which transitions the three md-devices of a very active backup server through idle->check->idle every 6 minutes a few ours ago.  It went through ~400 iterations till now. No lock-ups so far. > > Oh dear, looks like the deadlock problem is _not_fixed with these patches. Some more info after another incident: - We've hit the deadlock with 5.15.131 (so it is NOT introduced by any of the above patches) - The symptoms are not exactly the same as with the original year-old problem. Differences: - - mdX_raid6 is NOT busy looping - - /sys/devices/virtual/block/mdX/md/array_state says "active" not "write pending" - - `echo active > /sys/devices/virtual/block/mdX/md/array_state` does not resolve the deadlock - - After hours in the deadlock state the system resumed operation when a script of mine read(!) lots of sysfs files. - But in both cases, `echo idle > /sys/devices/virtual/block/mdX/md/sync_action` hangs as does all I/O operation on the raid. The fact that we didn't hit the problem for many month on 5.15.94 might hint that it was introduced between 5.15.94 and 5.15.131 We'll try to reproduce the problem on a test machine for analysis, but this make take time (vacation imminent for one...). But its not like these patches caused the problem. Any maybe they _did_ fix the original problem, as we didn't hit that one. Best Donald -- Donald Buczek buczek@molgen.mpg.de Tel: +49 30 8413 1433