Received: by 2002:a05:6a10:1d13:0:0:0:0 with SMTP id pp19csp1585663pxb; Sun, 22 Aug 2021 22:47:29 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyV3JgKYngkPmdLopheKyE9aVIT4qxCmzArAi+12IayK/VLJSGja+thcPVOY7hocvKFHr+k X-Received: by 2002:a17:906:d0d5:: with SMTP id bq21mr33591057ejb.470.1629697648754; Sun, 22 Aug 2021 22:47:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1629697648; cv=none; d=google.com; s=arc-20160816; b=PB36pkB7FKSQ/Ld3u+Tn8/l40CFeM/GyHA6MR8EzTYpm8/qHKB7bP09VkSE8Tkh9gD PNEtwD6nhry9E6zVH+zylU5T+z2KOelzcLo/4dg3lSqBRiT4M12pHVAeDHTwoRwUFF4o CpS9774feayji7qulD4MQ6UDyluUClgezQynOYfZXrc6NQm/cAI4L/uffukE1DuIfBM3 sCS8mFa2biR0N7eIbokbXQam9qyesGYcf8thdw4EAVp7bp18NBwSEM3N+DguqYt6UjXV S6e7f66syZyk2//AqTIuBn+55H2xG6uKkGVCAfRSGAYkCCRJymABVEby7O/h6pjPEx1r VRsQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=aX6HRCO/KTtYm1P6szxyQBQbvKd2g6GntStE1YGjmJU=; b=e0XLGYZ+hekRv07cs4Zz/T8I9BKrPQbDN1+Aw2TPRPMBCTkQ94DBoFaou0gxy+cUbX FIqrXjSDHHRwrNMeCQXLLN7OILii8DFJ1J9cPJZvooLne7jxow6aLwgeLca1+O039JIT GTl1rxF0MAe0xNyRuyt4WeR+n01c0bMFWVdpuPj2ue6AdEP00zrl/NCUBr3KighanSQ1 sXOjphY8L0GJSKJlQLXdOLvhJb0hMb8rBAcsAHle692ggKN/z0sqNWU0Iy9kKZZP4Vxv BkVCIA8YYCco++eChw5X0lrgnPuRE/FMd8uEKaGfKNJMysALkBMsiPEN7Igq7q0YcdY5 A0Kg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id t9si13517973ejj.606.2021.08.22.22.46.51; Sun, 22 Aug 2021 22:47:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233781AbhHWFrP (ORCPT + 99 others); Mon, 23 Aug 2021 01:47:15 -0400 Received: from out30-133.freemail.mail.aliyun.com ([115.124.30.133]:42507 "EHLO out30-133.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231267AbhHWFrP (ORCPT ); Mon, 23 Aug 2021 01:47:15 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R101e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=alimailimapcm10staff010182156082;MF=jefflexu@linux.alibaba.com;NM=1;PH=DS;RN=5;SR=0;TI=SMTPD_---0UlD8Sr1_1629697591; Received: from admindeMacBook-Pro-2.local(mailfrom:jefflexu@linux.alibaba.com fp:SMTPD_---0UlD8Sr1_1629697591) by smtp.aliyun-inc.com(127.0.0.1); Mon, 23 Aug 2021 13:46:31 +0800 Subject: Re: [PATCH] ext4: fix reserved space counter leakage To: Eric Whitney , Joseph Qi Cc: tytso@mit.edu, adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org References: <20210819091351.19297-1-jefflexu@linux.alibaba.com> <20210820164556.GA30851@localhost.localdomain> <20210822215214.GA12669@localhost.localdomain> From: JeffleXu Message-ID: <24adeb24-feaa-c32e-a4d2-143ba1f7f9c1@linux.alibaba.com> Date: Mon, 23 Aug 2021 13:46:31 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.12.0 MIME-Version: 1.0 In-Reply-To: <20210822215214.GA12669@localhost.localdomain> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On 8/23/21 5:52 AM, Eric Whitney wrote: > * Joseph Qi : >> >> >> On 8/22/21 9:06 PM, Joseph Qi wrote: >>> >>> >>> On 8/21/21 12:45 AM, Eric Whitney wrote: >>>> * Jeffle Xu : >>>>> When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM, >>>>> previously reserved space is not released as the error handling, >>>>> in which case @s_dirtyclusters_counter is left over. Since this delayed >>>>> extent failes to be inserted into extent status tree, when inode is >>>>> written back, the extra @s_dirtyclusters_counter won't be subtracted and >>>>> remains there forever. >>>>> >>>>> This can leads to /sys/fs/ext4//delayed_allocation_blocks remains >>>>> non-zero even when syncfs is executed on the filesystem. >>>>> >>>> >>>> Hi: >>>> >>>> I think the fix below looks fine. However, this comment doesn't look right >>>> to me. Are you really seeing delayed_allocation_blocks values that remain >>>> incorrectly elevated across last closes (or across file system unmounts and >>>> remounts)? s_dirtyclusters_counter isn't written out to stable storage - >>>> it's an in-memory only variable that's created when a file is first opened >>>> and destroyed on last close. >>>> >>> >>> Actually we've encountered a real case in our production environment, >>> which has about 20G space lost (df - du = ~20G). >>> After some investigation, we've confirmed that it cause by leaked >>> s_dirtyclusters_counter (~5M), and even we do manually sync, it remains. >>> Since there is no error messages, we've checked all logic around >>> s_dirtyclusters_counter and found this. Also we can manually inject >>> error and reproduce the leaked s_dirtyclusters_counter. >>> > > Sure - as I noted, the fix looks good - I agree that you could see inaccurate > s_dirtyclusters_counter (and i_reserved_data_blocks) values. This is a good > catch and a good fix. It's the comment I find misleading / inaccurate, and > I'd like to see that improved for the sake of developers reading commit > histories in the future. > > Also, Gao Xiang's idea of checking i_reserved_data_blocks in the inode evict > path sounds good to me - I'd considered doing that in the past but never > actually did it. Thanks, it will be added in v2. > >> >> BTW, it's a runtime lost, but not about on-disk. >> If umount and then mount it again, it becomes normal. But >> application also should be restarted... > > And this is where the comment could use a little help. "when inode is > written back, the extra @s_dirtyclusters_counter won't be subtracted and > remains there forever" suggests to me that s_dirtyclusters_counter is > being persisted on stable storage. But as you note, simply umounting and > remounting the filesystem clears up the problem. (And in my rush to get > my feedback out earlier I incorrectly stated that s_dirtyclusters_counter > would get created and destroyed on first open and last close - that's > i_reserved_data_blocks, of course.) > > So, in order to speed things along, please allow me to suggest some edits > for the commit comment: > > When ext4_insert_delayed block receives and recovers from an error from > ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the > space it has reserved for that block insertion as it should. One effect > of this bug is that s_dirtyclusters_counter is not decremented and remains > incorrectly elevated until the file system has been unmounted. This can > result in premature ENOSPC returns and apparent loss of free space. > > Another effect of this bug is that /sys/fs/ext4//delayed_allocation_blocks > can remain non-zero even after syncfs has been executed on the filesystem. > > Does that sound right? Thanks, I will update the commit log in v2. -- Thanks, Jeffle