Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp700031rwd; Mon, 12 Jun 2023 21:56:20 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6zB81S9J+DgII9qR1Byd1hDeSIQiRIj3XYXyJ4fTaWF4A0MipJK7KR7aXIRssecybNUrpl X-Received: by 2002:a0d:dfd2:0:b0:56d:c97:39f4 with SMTP id i201-20020a0ddfd2000000b0056d0c9739f4mr928846ywe.8.1686632180572; Mon, 12 Jun 2023 21:56:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686632180; cv=none; d=google.com; s=arc-20160816; b=WSkVn6Es0/LyfwX7NU83pYHJMmOJaS1F2+MLiQoa4qQWBferpNHQkjb+y/iZusDLw7 XXq0yPfdDqBY112/OGYE6Spkci5HsR8T93b7+Clj3rYnAefej1wZtAMfYbpRxUOnzXne bDW/gOpM9GtcHFX0gkcp07Sg7Xty9MY1cQ9Y2MLaf5wG9H5U6rnn4RvNzi0nzZ+k2q84 80qb9yqrOLo+OPuxuBQUNcvXhxi+pO3GMVTpZYzxsAmPgQd7rXWkzJZPcpRNvkNIgb74 p6atWeSHX0eCLlQy8jVgOXI+2i67RJuzq8/SI65GygfJHB+ViiIUcUajT52E+EqaAdwk 45Hw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=cKB04XX96zaqTHLDk7A2yl9q/pYStDpBWJcQVn9eU9Y=; b=S7gknRNz9zTwaL9N4SqUFGjYjA5wS7tOzEUNxlVM0PlOBSQBvEZO4wE9n4qnYsq4ry cK+LO1YlqvHzFNwX6FMeC5I7f7dMbZzDP3OFMxV5Cn6bk3HMxLWiXcoU/wD480sW4X8z UDhJI8E7yLgEupxMG0evr1N6l9GpG9jnJnufMmQfgjxB06SmStqLt8KSopEhGME/M6J9 0Fkq4PKI9M3cHuQo7etY5K7v97cF/tpFx7HZ/qytL9NjGYmZJ7KPVJ8yVbwLQBDLhnSn AZp8MiOKnkZw6WQMC/XvWhCmICDek1IviM2HsP/Dl2DjdKCdX+edw1KqTCfuM6Ryx2W/ XBHA== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@mit.edu header.s=outgoing header.b=dn2TNZKs; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=mit.edu Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 93-20020a17090a0fe600b0025bc1728a83si4092022pjz.26.2023.06.12.21.56.02; Mon, 12 Jun 2023 21:56:20 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=fail header.i=@mit.edu header.s=outgoing header.b=dn2TNZKs; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=mit.edu Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239633AbjFMEdp (ORCPT + 99 others); Tue, 13 Jun 2023 00:33:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51054 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239654AbjFMEdT (ORCPT ); Tue, 13 Jun 2023 00:33:19 -0400 Received: from outgoing.mit.edu (outgoing-auth-1.mit.edu [18.9.28.11]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7A8A82D52 for ; Mon, 12 Jun 2023 21:31:59 -0700 (PDT) Received: from cwcc.thunk.org (pool-173-48-82-39.bstnma.fios.verizon.net [173.48.82.39]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id 35D4VKsm027047 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 13 Jun 2023 00:31:21 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mit.edu; s=outgoing; t=1686630683; bh=cKB04XX96zaqTHLDk7A2yl9q/pYStDpBWJcQVn9eU9Y=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=dn2TNZKsJPGrwCA6rNo4JfBttGFBnN6c95t6wBPoHOn2kBxqvLN+daRaEpRQdMpzQ A8WIhHPUApwXJTlVHQcYCsfHYJwoNXWz7uI/4YQe6OCRFRJkx8ejmXSLfCzE9bWOhq 5Q8VFlPVDuEom/QMUHigTNhZDus3a8Kf9v7UWTzxMB+bi4xyUR3H+h+j9BjbKklt68 AsfRhLeZxRelkjUMXKsVIDeXDGJfffmizM6r7PaaR+ryhgo21BrfaQqLchhEwkUzWG 2sZTzSA6Sfil9r/Hadg9zxK7d/mrPSzfL1WwpDJ+v+PtpwDQo0dM0j/ykdOqgUAxeS DC+Q/ocTUDysQ== Received: by cwcc.thunk.org (Postfix, from userid 15806) id 96EF115C00B0; Tue, 13 Jun 2023 00:31:20 -0400 (EDT) Date: Tue, 13 Jun 2023 00:31:20 -0400 From: "Theodore Ts'o" To: Zhang Yi Cc: linux-ext4@vger.kernel.org, adilger.kernel@dilger.ca, jack@suse.cz, yi.zhang@huawei.com, yukuai3@huawei.com, chengzhihao1@huawei.com Subject: Re: [PATCH v3 4/6] jbd2: Fix wrongly judgement for buffer head removing while doing checkpoint Message-ID: <20230613043120.GB1584772@mit.edu> References: <20230606135928.434610-1-yi.zhang@huaweicloud.com> <20230606135928.434610-5-yi.zhang@huaweicloud.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230606135928.434610-5-yi.zhang@huaweicloud.com> X-Spam-Status: No, score=-4.0 required=5.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_NONE, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org There is something about this patch which is causing test runs to hang when running "gce-xfstests -c ext4/adv -C 10 generic/475" at least 60-70% of the time. When I took a closer look, the problem seems to be e2fsck is hanging after a SEGV when running e2fsck -nf on the block device. This then causes the check script to hang, until the test appliance's safety timer triggers and forces a shutdown of the test VM and aborts the test run. The cause of the hang is clearly an e2fsprogs bug --- no matter how corrupted the file system is, e2fsck should never crash or hang. So something is clearly going wrong with e2fsck: ... Symlink /p1/dc/d14/dee/l154 (inode #2898) is invalid. Clear? no Entry 'l154' in /p1/dc/d14/dee (2753) has an incorrect filetype (was 7, should be 0). Fix? no corrupted size vs. prev_size Signal (6) SIGABRT si_code=SI_TKILL (Note: "corrutped size vs prev_size" is issued by glibc when malloc's internal data structures have been corrupted. So there is definitely something going very wrong with e2fsck.) That being said, if I run the same test on the parent commit (patch 3/6, jbd2: remove journal_clean_one_cp_list()), e2fsck does *not* hang or crash, and the regression tests complete. So this patch is changing the behavior of the kernel in terms of the file system that is left behind after a large number of injected I/O errors. My plan therefore is to drop patches 4/6 through 6/6 of this patch series. This will allow at least the "long standing metadata corruption issue that happens from to time" to be addressed, and it will give us time study what's going on here in more detail. I've captured the compressed file system image which is causing e2fsck (version 1.47.0) to corrupt malloc's data structure, and I'll try see what using Address Sanitizer or valgrind show about what's going on. Looking at the patch, it looks pretty innocuous, and I don't understand how this could be making a significant enough difference that it's causing e2fsck, which had previously been working fine, to now start tossing its cookies. If you could double check the patch and see you see anything that I might have missed in my code review, I'd really appreciate it. Thanks, - Ted