Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp361456pxb; Wed, 20 Jan 2021 08:43:13 -0800 (PST) X-Google-Smtp-Source: ABdhPJy+aXrHndB8mYnyrgt40qdXDwxPHpiDNwv+tNGneLseldJKqSz5OZF//d5g0E341n2M6lb0 X-Received: by 2002:a05:6402:b30:: with SMTP id bo16mr4471285edb.84.1611160992858; Wed, 20 Jan 2021 08:43:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1611160992; cv=none; d=google.com; s=arc-20160816; b=NRimcIy53Oq6vjg2IbpDMm0KaAPreGeCWJiifaZUlOjxn/RvzHNkhfg0je6rwQ2EDX xRzeQGncADzDS9hBlwpJswKmCCQKgckEMi4u0OfPjVGpCUCovvLPQfXa+1yZQsgCrQcO lLTd1LsFJHdsKI4KGz4gliyUaAhOURfk4znuck4KySr/4EKM46jF8i+nKABBs9M+eZLf juNYGe8+aLSCksuI4aSyC45qPc2lVqAkPeVgTB9/9vdKMUl1hfHqBCKmudrBeHP+Lh69 i6SKcx+3dAZrjEEXpRG3Hb87tIld37aIrzJZTGz6K6MyAh9DEaRhlo1fNcN6BVVdS+uh APlw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :to:subject:dkim-signature; bh=JLGWYHpThCdPVVDkiECV6GKoYJQKgK7kvIvNi0RW4IU=; b=Z+b1jYUwDWy9vbYOx4QN0iXF/FvXhVvFQgsELpvScef2wcTD4+QEr/XSCAFxnKpI0H X29ZUtPrvTmPgXdz/zl/velmxHywIMM3RLUz/3DoD6e/udsWepfcXNzqrdTWi07h17Y9 KYcnZB2zsbGy5riwuwCVY2lg7Q8lSAYqIdzqNNPNAZHKC6aowr8wMMiiNElyueDom2sP Zs4rHL31XWAAGIJ/tAKCRgfQHIOl38H0joIABM5q/DGtrGjHes5jYpbCC49cKV6WMjPg SjMl3Un/VO7ihuTWZFcIHmCVlPVedEGSxSFczXNqroqeSME7LyLWO3+9MsBfdrI2dE3M tmkA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cloud.ionos.com header.s=google header.b=EajgIUSD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ionos.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q16si781805ejc.575.2021.01.20.08.42.48; Wed, 20 Jan 2021 08:43:12 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@cloud.ionos.com header.s=google header.b=EajgIUSD; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=ionos.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391524AbhATQhw (ORCPT + 99 others); Wed, 20 Jan 2021 11:37:52 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51756 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2391274AbhATQe6 (ORCPT ); Wed, 20 Jan 2021 11:34:58 -0500 Received: from mail-pl1-x635.google.com (mail-pl1-x635.google.com [IPv6:2607:f8b0:4864:20::635]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6BC55C061575 for ; Wed, 20 Jan 2021 08:33:58 -0800 (PST) Received: by mail-pl1-x635.google.com with SMTP id r4so12779625pls.11 for ; Wed, 20 Jan 2021 08:33:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloud.ionos.com; s=google; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=JLGWYHpThCdPVVDkiECV6GKoYJQKgK7kvIvNi0RW4IU=; b=EajgIUSDQxdCeg1OOPniYRAp+omb97kwO2g5i7bqkXMWDjS7VjoaElZINfyJOOZmoG pOvX5fWOSSjkCG8Eeomlfkmqm7v39xXJ37bWf1wavV1hK+/uHSOVeUyffjcKeAYaB68O L3+FQ7T7sxByCYhXnnS+yDgeR6rnjk7aDwJTnO5logdsMbwWEl8CyAaaN/oBETX+l86S qH95YaB6iVT4E/2DZBRSXHtXJTOgTVMoMyd45xl7KvrOgBfdaJP5ck/FJdH7PJl9hbV+ 47XR+Nkw7+/EHz51FwAFAj8Vw0ueySSNW2sobdHcRnLSF1wENC7Z8rSG2OGxoaUtHLnM S52Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=JLGWYHpThCdPVVDkiECV6GKoYJQKgK7kvIvNi0RW4IU=; b=gagh2xiYegNuCoY7/khxPsqT+fIWgADhGcwxqFOj47QuMQz6Gf9pjbyB/Pp4NhO807 L5vceqiceaHy1ZgRoBVboxRKSEPB1c2pC/PvYKbD2mjcckpqBjDXITDztTkFB8t6RrlM U0xFkkWR/mxdqRa1WLNljUhY6XZ21GVanyv9Xw03b6NAK16sTJwnPF4wEzaDXOi6Ttol E7iBB/Hb72AnIdCYSjzON1pwcpgs0OURYhiuRTP9kMwqXFPgKIdFWmiHmXBkYcv1oQB6 shWZYIGy6lNqzvbbg7QFU+eH4KnRhZcdAJ3+zpaCtMUM0EPaa1KhfcPO2azBEIPojxO5 pQww== X-Gm-Message-State: AOAM532Lot1dlUTLB8HIe9FlZHCsrVV2nJKibPVABFZdZzwDAR0fP0DT YjSaGR0httDcH8AvSveUzJF3zg== X-Received: by 2002:a17:90a:db96:: with SMTP id h22mr6610657pjv.204.1611160436902; Wed, 20 Jan 2021 08:33:56 -0800 (PST) Received: from [10.8.0.149] ([196.245.9.36]) by smtp.gmail.com with ESMTPSA id y26sm3031976pgk.42.2021.01.20.08.33.51 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 20 Jan 2021 08:33:56 -0800 (PST) Subject: Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition To: Donald Buczek , Song Liu , linux-raid@vger.kernel.org, Linux Kernel Mailing List , it+raid@molgen.mpg.de References: <95fbd558-5e46-7a6a-43ac-bcc5ae8581db@cloud.ionos.com> <77244d60-1c2d-330e-71e6-4907d4dd65fc@molgen.mpg.de> <7c5438c7-2324-cc50-db4d-512587cb0ec9@molgen.mpg.de> <37c158cb-f527-34f5-2482-cae138bc8b07@molgen.mpg.de> <55e30408-ac63-965f-769f-18be5fd5885c@molgen.mpg.de> From: Guoqing Jiang Message-ID: Date: Wed, 20 Jan 2021 17:33:38 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: <55e30408-ac63-965f-769f-18be5fd5885c@molgen.mpg.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Donald, On 1/19/21 12:30, Donald Buczek wrote: > Dear md-raid people, > > I've reported a problem in this thread in December: > > "We are using raid6 on several servers. Occasionally we had failures, > where a mdX_raid6 process seems to go into a busy loop and all I/O to > the md device blocks. We've seen this on various kernel versions." It > was clear, that this is related to "mdcheck" running, because we found, > that the command, which should stop the scrubbing in the morning (`echo > idle > /sys/devices/virtual/block/md1/md/sync_action`) is also blocked. > > On 12/21/20, I've reported, that the problem might be caused by a > failure of the underlying block device driver, because I've found > "inflight" counters of the block devices not being zero. However, this > is not the case. We were able to run into the mdX_raid6 looping > condition a few times again, but the non-zero inflight counters have not > been observed again. > > I was able to collect a lot of additional information from a blocked > system. > > - The `cat idle > /sys/devices/virtual/block/md1/md/sync_action` command > is waiting at kthread_stop to stop the sync thread. [ > https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L7989 ] > > - The sync thread ("md1_resync") does not finish, because its blocked at > > [<0>] raid5_get_active_stripe+0x4c4/0x660     # [1] > [<0>] raid5_sync_request+0x364/0x390 > [<0>] md_do_sync+0xb41/0x1030 > [<0>] md_thread+0x122/0x160 > [<0>] kthread+0x118/0x130 > [<0>] ret_from_fork+0x22/0x30 > > [1] https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L735 > > - yes, gdb confirms that `conf->cache_state` is 0x03 ( > R5_INACTIVE_BLOCKED + R5_ALLOC_MORE ) The resync thread is blocked since it can't get sh from inactive list, so R5_ALLOC_MORE and R5_INACTIVE_BLOCKED are set, that is why `echo idle > /sys/devices/virtual/block/md1/md/sync_action` can't stop sync thread. > > - We have lots of active stripes: > > root@deadbird:~/linux_problems/mdX_raid6_looping# cat > /sys/block/md1/md/stripe_cache_active > 27534 There are too many active stripes, so this is false: atomic_read(&conf->active_stripes) < (conf->max_nr_stripes * 3 / 4) so raid5_get_active_stripe has to wait till it becomes true, either increase max_nr_stripes or decrease active_stripes. 1. Increase max_nr_stripes since "mdX_raid6 process seems to go into a busy loop" and R5_ALLOC_MORE is set, if there is enough memory, I suppose mdX_raid6 (raid5d) could alloc new stripe in grow_one_stripe and increase max_nr_stripes. So please check the memory usage of your system. Another thing is you can try to increase the number of sh manually by write new number to stripe_cache_size if there is enough memory. 2. Or decrease active_stripes This is suppose to be done by do_release_stripe, but STRIPE_HANDLE is set > - handle_stripe() doesn't make progress: > > echo "func handle_stripe =pflt" > /sys/kernel/debug/dynamic_debug/control > > In dmesg, we see the debug output from > https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4925 > but never from > https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4958: > > [171908.896651] [1388] handle_stripe:4929: handling stripe 4947089856, > state=0x2029 cnt=1, pd_idx=9, qd_idx=10 >                 , check:4, reconstruct:0 > [171908.896657] [1388] handle_stripe:4929: handling stripe 4947089872, > state=0x2029 cnt=1, pd_idx=9, qd_idx=10 >                 , check:4, reconstruct:0 > [171908.896669] [1388] handle_stripe:4929: handling stripe 4947089912, > state=0x2029 cnt=1, pd_idx=9, qd_idx=10 >                 , check:4, reconstruct:0 > > The sector numbers repeat after some time. We have only the following > variants of stripe state and "check": > > state=0x2031 cnt=1, check:0 # ACTIVE        +INSYNC+REPLACED+IO_STARTED, > check_state_idle > state=0x2029 cnt=1, check:4 # ACTIVE+SYNCING       +REPLACED+IO_STARTED, > check_state_check_result > state=0x2009 cnt=1, check:0 # ACTIVE+SYNCING                +IO_STARTED, > check_state_idle > > - We have MD_SB_CHANGE_PENDING set: because MD_SB_CHANGE_PENDING flag. So do_release_stripe can't call atomic_dec(&conf->active_stripes). Normally, SB_CHANGE_PENDING is cleared from md_update_sb which could be called by md_reap_sync_thread. But md_reap_sync_thread stuck with unregister sync_thread (it was blocked in raid5_get_active_stripe). Still I don't understand well why mdX_raid6 in a busy loop, maybe raid5d can't break from the while(1) loop because "if (!batch_size && !released)" is false. I would assume released is '0' since > released_stripes = { > first = 0x0 > } And __get_priority_stripe fetches sh from conf->handle_list and delete it from handle_list, handle_stripe marks the state of the sh with STRIPE_HANDLE, then do_release_stripe put the sh back to handle_list. So batch_size can't be '0', and handle_active_stripes in the loop repeats the process in the busy loop. This is my best guess to explain the busy loop situation. > > root@deadbird:~/linux_problems/mdX_raid6_looping# cat > /sys/block/md1/md/array_state > write-pending > > gdb confirms that sb_flags = 6 (MD_SB_CHANGE_CLEAN + MD_SB_CHANGE_PENDING) since rdev_set_badblocks could set them, could you check if there is badblock of underlying device (sd*)? > > So it can be assumed that handle_stripe breaks out at > https://elixir.bootlin.com/linux/latest/source/drivers/md/raid5.c#L4939 > > - The system can manually be freed from the deadlock: > > When `echo active > /sys/block/md1/md/array_state` is used, the > scrubbing and other I/O continue. Probably because of > https://elixir.bootlin.com/linux/latest/source/drivers/md/md.c#L4520 Hmm, seems clear SB_CHANGE_PENDING made the trick, so the blocked process can make progress. > > I, of coruse, don't fully understand it yet. Any ideas? > > I append some data from a hanging raid... (mddev, r5conf and a sample > stripe_head from handle_list with it first disks) These data did help for investigation! Thanks, Guoqing