Received: by 2002:a05:6a10:1d13:0:0:0:0 with SMTP id pp19csp1475116pxb; Sun, 22 Aug 2021 18:43:18 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzmC9Qy10areLSYeZKu51f2ud8ARVPe/GtKZI5ILHSADvBQKawOpZIzu+6tNigtPSJCZ76v X-Received: by 2002:a05:6402:70c:: with SMTP id w12mr33833937edx.288.1629682998695; Sun, 22 Aug 2021 18:43:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1629682998; cv=none; d=google.com; s=arc-20160816; b=Xe5B3u4YZPp8lqmWY6YGx+FsBqkzhFpUeSMXKsnqQEHR5r4EYMVz9NmnMfghNqEHEa jrfjouKmm5UITyxSdEJTgK2oon8Po5eNykVh7IYadYDX/lu7N1nDYobTDm3eNJrlvSg4 JWoI8yCcOguvs7CxcU2gS7UGeXlmBjjpvm+4JTaThx5/e97pkd2L0PpskJ5rEDdVemk0 4G4sa8rPeYZCkvQdgM3RG7M7EQC/zL9YY+D5CbSUwzAJB0MQfpRnglYhv156cAd4V+LZ O+Gr/+VtYMnGKYLyZdUoQXWYcgJqz62qSCbtCJwEf47YBIl23v304s2dUkN91my2k26Y 6uxg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=dZmQTkxfn+HOcJz4dzYIgRk45PzsNgIdw3AhOdLp0PU=; b=0cwKeWA+9fy5joUsWqGnVY2PaCWzvqMUooTW1gWWBbQ/H6iKfTu5iL5L8Ysb8ihR7K 6bvNuHG4Jo1MY11O1CMqthZHXb2Qt0T/rWEm8/sAVACxRiQzaBgXTvPYzO9CfpohjlBF D5xb+3wzVv4eDWBh+P6JY2J0ozB2wyq5Gn/4qrp93uk+oa3PNOsKnpXycdC1Iw6hMXZi 2CJS+O5OTAU4OWUmI0wexlLgD0/Pe2nh77xn4ClhGzY3mY0F7JXbfViI3H/vszwOfIbM mA2wUEu64gW5bCGtTSlmHk/Q52OAUz2lP9PE054ACxHZ8yvd0ZEn191Wgbhv6wRmY84A 5sfQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id nc37si7076006ejc.528.2021.08.22.18.42.46; Sun, 22 Aug 2021 18:43:18 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234566AbhHWBnC (ORCPT + 99 others); Sun, 22 Aug 2021 21:43:02 -0400 Received: from out30-57.freemail.mail.aliyun.com ([115.124.30.57]:39486 "EHLO out30-57.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234497AbhHWBnB (ORCPT ); Sun, 22 Aug 2021 21:43:01 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R421e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01424;MF=joseph.qi@linux.alibaba.com;NM=1;PH=DS;RN=5;SR=0;TI=SMTPD_---0Ul7OwVr_1629682937; Received: from B-D1K7ML85-0059.local(mailfrom:joseph.qi@linux.alibaba.com fp:SMTPD_---0Ul7OwVr_1629682937) by smtp.aliyun-inc.com(127.0.0.1); Mon, 23 Aug 2021 09:42:17 +0800 Subject: Re: [PATCH] ext4: fix reserved space counter leakage To: Eric Whitney Cc: Jeffle Xu , tytso@mit.edu, adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org References: <20210819091351.19297-1-jefflexu@linux.alibaba.com> <20210820164556.GA30851@localhost.localdomain> <20210822215214.GA12669@localhost.localdomain> From: Joseph Qi Message-ID: <9cc97eac-ce0b-4492-08f7-32b3884e24d7@linux.alibaba.com> Date: Mon, 23 Aug 2021 09:42:17 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: <20210822215214.GA12669@localhost.localdomain> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On 8/23/21 5:52 AM, Eric Whitney wrote: > * Joseph Qi : >> >> >> On 8/22/21 9:06 PM, Joseph Qi wrote: >>> >>> >>> On 8/21/21 12:45 AM, Eric Whitney wrote: >>>> * Jeffle Xu : >>>>> When ext4_es_insert_delayed_block() returns error, e.g., ENOMEM, >>>>> previously reserved space is not released as the error handling, >>>>> in which case @s_dirtyclusters_counter is left over. Since this delayed >>>>> extent failes to be inserted into extent status tree, when inode is >>>>> written back, the extra @s_dirtyclusters_counter won't be subtracted and >>>>> remains there forever. >>>>> >>>>> This can leads to /sys/fs/ext4//delayed_allocation_blocks remains >>>>> non-zero even when syncfs is executed on the filesystem. >>>>> >>>> >>>> Hi: >>>> >>>> I think the fix below looks fine. However, this comment doesn't look right >>>> to me. Are you really seeing delayed_allocation_blocks values that remain >>>> incorrectly elevated across last closes (or across file system unmounts and >>>> remounts)? s_dirtyclusters_counter isn't written out to stable storage - >>>> it's an in-memory only variable that's created when a file is first opened >>>> and destroyed on last close. >>>> >>> >>> Actually we've encountered a real case in our production environment, >>> which has about 20G space lost (df - du = ~20G). >>> After some investigation, we've confirmed that it cause by leaked >>> s_dirtyclusters_counter (~5M), and even we do manually sync, it remains. >>> Since there is no error messages, we've checked all logic around >>> s_dirtyclusters_counter and found this. Also we can manually inject >>> error and reproduce the leaked s_dirtyclusters_counter. >>> > > Sure - as I noted, the fix looks good - I agree that you could see inaccurate > s_dirtyclusters_counter (and i_reserved_data_blocks) values. This is a good > catch and a good fix. It's the comment I find misleading / inaccurate, and > I'd like to see that improved for the sake of developers reading commit > histories in the future. > > Also, Gao Xiang's idea of checking i_reserved_data_blocks in the inode evict > path sounds good to me - I'd considered doing that in the past but never > actually did it. > >> >> BTW, it's a runtime lost, but not about on-disk. >> If umount and then mount it again, it becomes normal. But >> application also should be restarted... > > And this is where the comment could use a little help. "when inode is > written back, the extra @s_dirtyclusters_counter won't be subtracted and > remains there forever" suggests to me that s_dirtyclusters_counter is > being persisted on stable storage. But as you note, simply umounting and > remounting the filesystem clears up the problem. (And in my rush to get > my feedback out earlier I incorrectly stated that s_dirtyclusters_counter > would get created and destroyed on first open and last close - that's > i_reserved_data_blocks, of course.) > > So, in order to speed things along, please allow me to suggest some edits > for the commit comment: > > When ext4_insert_delayed block receives and recovers from an error from > ext4_es_insert_delayed_block(), e.g., ENOMEM, it does not release the > space it has reserved for that block insertion as it should. One effect > of this bug is that s_dirtyclusters_counter is not decremented and remains > incorrectly elevated until the file system has been unmounted. This can > result in premature ENOSPC returns and apparent loss of free space. > > Another effect of this bug is that /sys/fs/ext4//delayed_allocation_blocks > can remain non-zero even after syncfs has been executed on the filesystem. > > Does that sound right? > Yes, looks better. Thanks for your comments. We'll update the commit log in v2. Thanks, Joseph