Received: by 2002:a05:6a10:1287:0:0:0:0 with SMTP id d7csp1344094pxv; Fri, 23 Jul 2021 06:11:56 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx2HKGZbIk8QXMtuIoqZFq25R/s2izn9pQfFSv3Mbm4DS5ibs1KPNy8/nJzPzXAI6kNoPij X-Received: by 2002:a17:907:76b8:: with SMTP id jw24mr4532678ejc.375.1627045916306; Fri, 23 Jul 2021 06:11:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1627045916; cv=none; d=google.com; s=arc-20160816; b=yu7MYhgLmxMPtpJVBR8k0yEkzfP4Jfk/PdnfuT9MYS36GjZjiUTJnw0OwJCZ3XoMXP ZXeWLJRb7kgkEEFX5URYpR7qAJRwdQgwxXisiOWg6aeEDvRpOGdifk8RxVPCO3pn8o63 UX6UXaY2jT4q7Tu4p/n9s2GOCp8rXD8BY+Vn7+SoKB7zBjuHpsFDCUFUpp7m0lNyDX5n KGOrw9Bqz7XnTXF13QtfExtBebMz0I+Z/m5+SPfTUQYVVJFwpInXsz1Frt8ZKWHwA38c uO+7Ji7cCeAnbgwHl0Br0oSAceLBr0MhuWzYfvzqr6swzhEBQDAkvIpl6nUOOD8PSya1 AxRg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject; bh=N91cpAVrlhFnXM+UT5o5Hc9tGElcjAMtVoU7tyzGD/o=; b=bkLSoHQb9HX1FhN7brIqz1jBmBcuXOf/sxPki26lzuAjBPhlsRvcqXNlRIyzpgfkWU u6lYj6j1xQoJ7TP3PXpyxrM2PHUHUeT/WtJiwzG2dZMqtyKsqGdmZXmlwA/H60NY668C nYVgKhUyY8/ZmFwHCd9OxaDnSOiqv0EHhPjzNdEIa/KNv5woPT8Sf14gNiQ6ZcRx0rc/ D6i4PBIdDfqhlmEhVVl5zReWpXOHZqGNZ1YAfmmhf7t/npCMby41EmsRVa0ZhZmdnuYi 616RCRlnkZN2/pIy88hnNi+5SQ9Gwv3OhH4gLXLhnHZbBwo1M4w+gzK2DY8PO18nbnWa fEvw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=huawei.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id v6si31054536ejw.426.2021.07.23.06.11.25; Fri, 23 Jul 2021 06:11:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=huawei.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235074AbhGWMai (ORCPT + 99 others); Fri, 23 Jul 2021 08:30:38 -0400 Received: from szxga01-in.huawei.com ([45.249.212.187]:15054 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235072AbhGWMai (ORCPT ); Fri, 23 Jul 2021 08:30:38 -0400 Received: from dggemv711-chm.china.huawei.com (unknown [172.30.72.55]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4GWV3j0KgDzZrdv; Fri, 23 Jul 2021 21:07:45 +0800 (CST) Received: from dggema766-chm.china.huawei.com (10.1.198.208) by dggemv711-chm.china.huawei.com (10.1.198.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id 15.1.2176.2; Fri, 23 Jul 2021 21:11:09 +0800 Received: from [10.174.177.210] (10.174.177.210) by dggema766-chm.china.huawei.com (10.1.198.208) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256_P256) id 15.1.2176.2; Fri, 23 Jul 2021 21:11:08 +0800 Subject: Re: [PATCH] ext4: flush s_error_work before journal destroy in ext4_fill_super To: Theodore Ts'o CC: , , References: <20210720062409.960734-1-yangerkun@huawei.com> From: yangerkun Message-ID: Date: Fri, 23 Jul 2021 21:11:08 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.10.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.210] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggema766-chm.china.huawei.com (10.1.198.208) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org 在 2021/7/23 20:11, Theodore Ts'o 写道: > On Tue, Jul 20, 2021 at 02:24:09PM +0800, yangerkun wrote: >> 'commit c92dc856848f ("ext4: defer saving error info from atomic >> context")' and '2d01ddc86606 ("ext4: save error info to sb through journal >> if available")' add s_error_work to fix checksum error problem. But the >> error path in ext4_fill_super can lead the follow BUG_ON. > > Can you share with me your test case? Your patch will result in the > shrinker potentially not getting released in some error paths (which > will cause other kernel panics), and in any case, once the journal is > destroyed here: > >> @@ -5173,15 +5173,15 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) >> >> ext4_xattr_destroy_cache(sbi->s_ea_block_cache); >> sbi->s_ea_block_cache = NULL; >> +failed_mount3a: >> + ext4_es_unregister_shrinker(sbi); >> +failed_mount3: >> + flush_work(&sbi->s_error_work); >> >> if (sbi->s_journal) { >> jbd2_journal_destroy(sbi->s_journal); >> sbi->s_journal = NULL; >> } >> -failed_mount3a: >> - ext4_es_unregister_shrinker(sbi); >> -failed_mount3: >> - flush_work(&sbi->s_error_work); > > sbi->s_journal is set to NULL, which means that in > flush_stashed_error_work(), journal will be NULL, which means we won't > call start_this_handle and so this change will not make a difference > given the kernel stack trace in the commit description. For example, before wo goto failed_mount_wq, we may meet some error and will goto ext4_handle_error which can call schedule_work(&EXT4_SB(sb)->s_error_work). So the work may start concurrent with ext4_fill_super goto failed_mount_wq. There does not have any lock to protect the concurrent read and modifies for sbi->s_journal. Injection fault some delay between jbd2_journal_destory and sbi->s_journal and We can injection fault while we do mount and add some delay like follow. We will get some panic report easily... failed_mount_wq: ext4_xattr_destroy_cache(sbi->s_ea_inode_cache); sbi->s_ea_inode_cache = NULL; ext4_xattr_destroy_cache(sbi->s_ea_block_cache); sbi->s_ea_block_cache = NULL; if (sbi->s_journal) { jbd2_journal_destroy(sbi->s_journal); msleep(300); <---- add some delay sbi->s_journal = NULL; } So we need to make sure to work will exists while we destroy the journal. It maybe better to fix it by move the flush_work before destory journal. > > Thanks, > > - Ted > . >