Received: by 2002:a05:6a10:2726:0:0:0:0 with SMTP id ib38csp2237058pxb; Wed, 30 Mar 2022 20:17:12 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzI9LmrXnOlIbhGsz5qKlanJoCHP+vZJ5OUFBerxhqRQxwgzIICQ61loUQwmCVI+EAUd8jA X-Received: by 2002:a17:90b:1d0e:b0:1c9:b74e:494 with SMTP id on14-20020a17090b1d0e00b001c9b74e0494mr3512356pjb.238.1648696632315; Wed, 30 Mar 2022 20:17:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1648696632; cv=none; d=google.com; s=arc-20160816; b=IBCm/3MwgfI8iqB3nMwkrUWXHbTw2HtMeK/+T3T18XFnk9jFaO8cJBdo9Fu0C7grQG s5s1QE5EZexNF2lChc8zIdD/2bRZdhG/zUgQlbq43+Y6IoYq5C8iN9nGblPZ8tw9AhVu e+rbZanzL2/gdcLAQvOePBeY016Xx6Xi5x5D8fkKyo2YKYXYwBEftVtQMItxrzYYEo4O JCbYJVCcEVIGd720358EdsguM3jCDgHCE0qTTa7y0uJYb05UwixlsWtVKHduba+E3pBX HApGFVHSzgFsare4ZU3gF6QgIjUXSBJ7bL1koo6Qi33qQ1Y6dQv7twnB+7pSaU53ZvJj V4AQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=b8fnVDmV1fQtqL/bYX2EyKNGYVvxpSWus9eoYMzC1Do=; b=SHOzHuQGc2OnQVfrK5DHx+FUZaHmZ4BR2Sn4l0CibSyXG2H8eBhNgj6ffUCUg+MQ+4 5bTKuvFRIhLCcRk3uOLdJLVJKoaKFF3zLg33xFGxZ6RH2uqvE80+PSs1dxLr+a3mp1/G EeI3zObWN49oQbY9ud9C3ji/UoFLKYKf3HNWehT0D+mk1CIxetysnjPinoo24rmoA0ke sbE+2p0BGFJLCzbNEM8hpDRh6KMfiCHpjqpSxbOT9nf/BYZ+F7ZAAp6NDCVIWb1+u4qK nR2ftwYMCA64otcbl4ZMqB7opBFd4D2FxwET2oauFvn1Y1cCamTT3dQCdCatJaXvPhyo vxbg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id i6-20020a170902c94600b00153bffd0af8si24667810pla.491.2022.03.30.20.17.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 30 Mar 2022 20:17:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=huawei.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 6EA8DFFFA2; Wed, 30 Mar 2022 19:49:34 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345496AbiC3NBh (ORCPT + 99 others); Wed, 30 Mar 2022 09:01:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39824 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345425AbiC3NBg (ORCPT ); Wed, 30 Mar 2022 09:01:36 -0400 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 92785D5E81; Wed, 30 Mar 2022 05:59:49 -0700 (PDT) Received: from fraeml706-chm.china.huawei.com (unknown [172.18.147.226]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4KT60y1nY3z67bMY; Wed, 30 Mar 2022 20:57:54 +0800 (CST) Received: from lhreml724-chm.china.huawei.com (10.201.108.75) by fraeml706-chm.china.huawei.com (10.206.15.55) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2375.24; Wed, 30 Mar 2022 14:59:46 +0200 Received: from [10.47.83.59] (10.47.83.59) by lhreml724-chm.china.huawei.com (10.201.108.75) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.24; Wed, 30 Mar 2022 13:59:45 +0100 Message-ID: <08717833-19bb-8aaa-4f24-2989a9f56cd3@huawei.com> Date: Wed, 30 Mar 2022 13:59:43 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.6.1 Subject: Re: filesystem corruption with "scsi: core: Reallocate device's budget map on queue depth change" To: Andrea Righi , Ming Lei CC: Martin Wilck , Bart Van Assche , "James E.J. Bottomley" , "Martin K. Petersen" , , References: From: John Garry In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.47.83.59] X-ClientProxiedBy: lhreml745-chm.china.huawei.com (10.201.108.195) To lhreml724-chm.china.huawei.com (10.201.108.75) X-CFilter-Loop: Reflected X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 30/03/2022 12:21, Andrea Righi wrote: > On Wed, Mar 30, 2022 at 11:38:02AM +0100, John Garry wrote: >> On 30/03/2022 11:11, Andrea Righi wrote: >>> Hello, >>> >>> after this commit I'm experiencing some filesystem corruptions at boot >>> on a power9 box with an aacraid controller. >>> >>> At the moment I'm running a 5.15.30 kernel; when the filesystem is >>> mounted at boot I see the following errors in the console: About "scsi: core: Reallocate device's budget map on queue depth change" being added to a stable kernel, I am not sure if this was really a fix or just a memory optimisation. >>> >>> Begin: Will now check root file system ... fsck from util-linux 2.36.1 >>> [/usr/sbin/fsck.ext4 (1) -- /dev/sda2] fsck.ext4 -a -C0 /dev/sda2 >>> root: clean, 99646/122101760 files, 11187342/488376336 blocks >>> done. >>> [ 4.636613] sd 0:2:0:0: [sda] tag#257 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s >>> [ 4.636655] sd 0:2:0:0: [sda] tag#257 CDB: Read(10) 28 00 00 00 4c 10 00 00 08 00 >>> [ 4.636689] blk_update_request: I/O error, dev sda, sector 19472 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 >>> [ 4.636734] sd 0:2:0:0: [sda] tag#258 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s >>> [ 4.636772] sd 0:2:0:0: [sda] tag#258 CDB: Read(10) 28 00 00 00 4c 18 00 00 08 00 >>> [ 4.636796] blk_update_request: I/O error, dev sda, sector 19480 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 >>> [ 4.636840] sd 0:2:0:0: [sda] tag#260 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s >>> [ 4.636877] sd 0:2:0:0: [sda] tag#260 CDB: Read(10) 28 00 00 00 4c 28 00 00 08 00 >>> [ 4.636901] blk_update_request: I/O error, dev sda, sector 19496 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 >>> [ 4.636944] sd 0:2:0:0: [sda] tag#259 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s >>> [ 4.636971] sd 0:2:0:0: [sda] tag#259 CDB: Read(10) 28 00 00 00 4c 20 00 00 08 00 >>> [ 4.637005] blk_update_request: I/O error, dev sda, sector 19488 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 >>> [ 4.637049] sd 0:2:0:0: [sda] tag#262 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s >>> [ 4.637085] sd 0:2:0:0: [sda] tag#262 CDB: Read(10) 28 00 00 00 4c 38 00 00 08 00 >>> [ 4.637118] blk_update_request: I/O error, dev sda, sector 19512 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 >>> [ 4.637161] sd 0:2:0:0: [sda] tag#264 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s >>> [ 4.637197] sd 0:2:0:0: [sda] tag#264 CDB: Read(10) 28 00 00 00 4c 48 00 00 08 00 >>> [ 4.637221] blk_update_request: I/O error, dev sda, sector 19528 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 >>> [ 4.637270] sd 0:2:0:0: [sda] tag#284 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s >>> [ 4.637306] sd 0:2:0:0: [sda] tag#284 CDB: Read(10) 28 00 00 00 4c e8 00 00 08 00 >>> [ 4.637332] blk_update_request: I/O error, dev sda, sector 19688 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 >>> [ 4.637375] sd 0:2:0:0: [sda] tag#286 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s >>> [ 4.637411] sd 0:2:0:0: [sda] tag#286 CDB: Read(10) 28 00 00 00 4c f8 00 00 08 00 >>> [ 4.637444] blk_update_request: I/O error, dev sda, sector 19704 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 >>> [ 4.637481] blk_update_request: I/O error, dev sda, sector 19664 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 >>> [ 4.637485] sd 0:2:0:0: [sda] tag#282 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s >>> [ 4.637487] sd 0:2:0:0: [sda] tag#287 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s >>> [ 4.637491] sd 0:2:0:0: [sda] tag#287 CDB: Read(10) 28 00 00 00 4d 00 00 00 08 00 >>> [ 4.637491] sd 0:2:0:0: [sda] tag#282 CDB: Read(10) 28 00 00 00 4c d8 00 00 08 00 >>> [ 4.637494] blk_update_request: I/O error, dev sda, sector 19672 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0 >>> [ 4.747771] EXT4-fs (sda2): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none. >>> We need to find where that memory corruption is coming from. Maybe the root is that too many commands are being sent to the disk. Ming? >>> If I reboot multiple times fsck requires a manual fix and I get dropped >>> to the initramfs shell. Some times the filesystem gets corrupted and I >>> need to redeploy the box. >>> >>> If I use the same kernel with this commit reverted I can reboot as many >>> times as I want without any failure: >>> >>> 813c6871f76b ("scsi: core: Reallocate device's budget map on queue depth change") >> >> I would not have thought that this causes possible corruption. >> >>> >>> For now I've just reverted the commit, but I'll try to add some >>> debugging and collect more info. >>> >>> Let me know if there's any specific test that you want me to try. >>> >> >> Please try this: >> https://lore.kernel.org/linux-scsi/yq1ee2kumrh.fsf@ca-mkp.ca.oracle.com/T/#t >> >> It never made 5.17, which I would have hoped for. > > Thanks John! It looks like this one is actually fixing the problem. > I rebooted multiple times and I didn't get any I/O error or corruption. > > If you want you can add my: > > Tested-by: Andrea Righi > . We should prob add something like this also: --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1708,7 +1708,9 @@ static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, blk_status_t ret; int reason; - WARN_ON_ONCE(cmd->budget_token < 0); + if (WARN_ON_ONCE(cmd->budget_token < 0 || + cmd->budget_token >= sdev->queue_depth)) + return BLK_STS_IOERR; Thanks, John