Received-SPF: pass (google.com: domain of linux-kernel+bounces-94925-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223;
Subject: Re: [REGRESSION] 6.7.1: md: raid5 hang and unresponsive system;
 successfully bisected
To: junxiao.bi@oracle.com, Song Liu <song@kernel.org>,
 Linux regressions mailing list <regressions@lists.linux.dev>
Cc: gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org,
 linux-raid@vger.kernel.org, stable@vger.kernel.org,
 Dan Moulding <dan@danm.net>, "yukuai (C)" <yukuai3@huawei.com>
References: <20240123005700.9302-1-dan@danm.net>
 <20240220230658.11069-1-dan@danm.net>
 <7efac6e0-32df-457e-9d21-4945c69328f8@leemhuis.info>
 <CAPhsuW5QYTjBvjAjf8SdcKmPGO20e5-p57n6af5FaXudSiOCmg@mail.gmail.com>
 <739634c3-3e21-44dd-abb1-356cf54e54fd@oracle.com>
From: Yu Kuai <yukuai1@huaweicloud.com>
Message-ID: <cd0f300f-49a0-235a-099c-e76af5ed7c05@huaweicloud.com>
Date: Thu, 7 Mar 2024 11:34:57 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.8.0
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <739634c3-3e21-44dd-abb1-356cf54e54fd@oracle.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

Hi,

在 2024/03/02 4:26, junxiao.bi@oracle.com 写道:
> Hi Dan & Song,
> 
> I have not root cause this yet, but would like share some findings from 
> the vmcore Dan shared. From what i can see, this doesn't look like a md 
> issue, but something wrong with block layer or below.

I would like to take a look at vmcore as well. How dose Dan sharing the
vmcore? I don't find it in the thread.

Thanks,
Kuai

> 
> 1. There were multiple process hung by IO over 15mins.
> 
> crash> ps -m | grep UN
> [0 00:15:50.424] [UN]  PID: 957      TASK: ffff88810baa0ec0  CPU: 1 
> COMMAND: "jbd2/dm-3-8"
> [0 00:15:56.151] [UN]  PID: 1835     TASK: ffff888108a28ec0  CPU: 2 
> COMMAND: "dd"
> [0 00:15:56.187] [UN]  PID: 876      TASK: ffff888108bebb00  CPU: 3 
> COMMAND: "md0_reclaim"
> [0 00:15:56.185] [UN]  PID: 1914     TASK: ffff8881015e6740  CPU: 1 
> COMMAND: "kworker/1:2"
> [0 00:15:56.255] [UN]  PID: 403      TASK: ffff888101351d80  CPU: 7 
> COMMAND: "kworker/u21:1"
> 
> 2. Let pick md0_reclaim to take a look, it is waiting done super_block 
> update. We can see there were two pending superblock write and other 
> pending io for the underling physical disk, which caused these process 
> hung.
> 
> crash> bt 876
> PID: 876      TASK: ffff888108bebb00  CPU: 3    COMMAND: "md0_reclaim"
>   #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac
>   #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82
>   #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a
>   #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609
>   #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4
>   #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef
>   #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee
>   #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98
>   #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1
> 
> crash> mddev.pending_writes,disks 0xffff888108335800
>    pending_writes = {
>      counter = 2  <<<<<<<<<< 2 active super block write
>    },
>    disks = {
>      next = 0xffff88810ce85a00,
>      prev = 0xffff88810ce84c00
>    },
> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending 
> 0xffff88810ce85a00
> ffff88810ce85a00
>    kobj.name = 0xffff8881067c1a00 "dev-dm-1",
>    nr_pending = {
>      counter = 0
>    },
> ffff8881083ace00
>    kobj.name = 0xffff888100a93280 "dev-sde",
>    nr_pending = {
>      counter = 10 <<<<
>    },
> ffff8881010ad200
>    kobj.name = 0xffff8881012721c8 "dev-sdc",
>    nr_pending = {
>      counter = 8 <<<<<
>    },
> ffff88810ce84c00
>    kobj.name = 0xffff888100325f08 "dev-sdd",
>    nr_pending = {
>      counter = 2 <<<<<
>    },
> 
> 3. From block layer, i can find the inflight IO for md superblock write 
> which has been pending 955s which matches with the hung time of 
> "md0_reclaim"
> 
> crash> 
> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io 
> ffff888103b4c300
>    q = 0xffff888103a00d80,
>    mq_hctx = 0xffff888103c5d200,
>    cmd_flags = 38913,
>    rq_flags = 139408,
>    start_time_ns = 1504179024146,
>    bio = 0x0,
>    biotail = 0xffff888120758e40,
>    state = MQ_RQ_COMPLETE,
>    __data_len = 0,
>    flush = {
>      seq = 3, <<<< REQ_FSEQ_PREFLUSH |  REQ_FSEQ_DATA
>      saved_end_io = 0x0
>    },
>    end_io = 0xffffffff815186e0 <mq_flush_data_end_io>,
> 
> crash> p tk_core.timekeeper.tkr_mono.base
> $1 = 2459916243002
> crash> eval 2459916243002-1504179024146
> hexadecimal: de86609f28
>      decimal: 955737218856  <<<<<<< IO pending time is 955s
>        octal: 15720630117450
>       binary: 
> 0000000000000000000000001101111010000110011000001001111100101000
> 
> crash> bio.bi_iter,bi_end_io 0xffff888120758e40
>    bi_iter = {
>      bi_sector = 8, <<<< super block offset
>      bi_size = 0,
>      bi_idx = 0,
>      bi_bvec_done = 0
>    },
>    bi_end_io = 0xffffffff817dca50 <super_written>,
> crash> dev -d | grep ffff888103a00d80
>      8 ffff8881003ab000   sdd        ffff888103a00d80       0 0     0
> 
> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it is 
> still pending. That's because each md superblock write was marked with 
> REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 steps: pre_flush, 
> data, and post_flush. Once each step complete, it will be marked in 
> "request.flush.seq", here the value is 3, which is REQ_FSEQ_PREFLUSH | 
> REQ_FSEQ_DATA, so the last step "post_flush" has not be done.  Another 
> wired thing is that blk_flush_queue.flush_data_in_flight is still 1 even 
> "data" step already done.
> 
> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200
>    fq = 0xffff88810332e240,
> crash> blk_flush_queue 0xffff88810332e240
> struct blk_flush_queue {
>    mq_flush_lock = {
>      {
>        rlock = {
>          raw_lock = {
>            {
>              val = {
>                counter = 0
>              },
>              {
>                locked = 0 '\000',
>                pending = 0 '\000'
>              },
>              {
>                locked_pending = 0,
>                tail = 0
>              }
>            }
>          }
>        }
>      }
>    },
>    flush_pending_idx = 1,
>    flush_running_idx = 1,
>    rq_status = 0 '\000',
>    flush_pending_since = 4296171408,
>    flush_queue = {{
>        next = 0xffff88810332e250,
>        prev = 0xffff88810332e250
>      }, {
>        next = 0xffff888103b4c348, <<<< the request is in this list
>        prev = 0xffff888103b4c348
>      }},
>    flush_data_in_flight = 1,  >>>>>> still 1
>    flush_rq = 0xffff888103c2e000
> }
> 
> crash> list 0xffff888103b4c348
> ffff888103b4c348
> ffff88810332e260
> 
> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of hw queue
>    tag = -1,
>    state = MQ_RQ_IDLE,
>    ref = {
>      counter = 0
>    },
> 
> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may have 
> some issue which leading to the io request from md layer stayed in a 
> partial complete statue. I can't see how this can be related with the 
> commit bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in 
> raid5d"")
> 
> 
> Dan,
> 
> Are you able to reproduce using some regular scsi disk, would like to 
> rule out whether this is related with virtio-scsi?
> 
> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the 
> official mainline v6.8-rc5 without any other patches?
> 
> 
> Thanks,
> 
> Junxiao.
> 
> On 2/23/24 6:13 PM, Song Liu wrote:
>> Hi,
>>
>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten
>> Leemhuis) <regressions@leemhuis.info> wrote:
>>> On 21.02.24 00:06, Dan Moulding wrote:
>>>> Just a friendly reminder that this regression still exists on the
>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a
>>>> development system to 6.8-rc5 today and immediately hit this issue
>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree.
>>> Song Liu, what's the status here? I aware that you fixed with quite a
>>> few regressions recently, but it seems like resolving this one is
>>> stalled. Or were you able to reproduce the issue or make some progress
>>> and I just missed it?
>> Sorry for the delay with this issue. I have been occupied with some
>> other stuff this week.
>>
>> I haven't got luck to reproduce this issue. I will spend more time 
>> looking
>> into it next week.
>>
>>> And if not, what's the way forward here wrt to the release of 6.8?
>>> Revert the culprit and try again later? Or is that not an option for one
>>> reason or another?
>> If we don't make progress with it in the next week, we will do the 
>> revert,
>> same as we did with stable kernels.
>>
>>> Or do we assume that this is not a real issue? That it's caused by some
>>> oddity (bit-flip in the metadata or something like that?) only to be
>>> found in Dan's setup?
>> I don't think this is because of oddities. Hopefully we can get more
>> information about this soon.
>>
>> Thanks,
>> Song
>>
>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
>>> -- 
>>> Everything you wanna know about Linux kernel regression tracking:
>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>> If I did something stupid, please tell me, as explained on that page.
>>>
>>> #regzbot poke
>>>
> 
> .
>