Received: by 2002:a05:6a10:7420:0:0:0:0 with SMTP id hk32csp946122pxb; Thu, 17 Feb 2022 19:20:25 -0800 (PST) X-Google-Smtp-Source: ABdhPJx3n1DQ0Bjciv0Ty+Ojo+i98//9Mq77xwoIIaMRxQBUYi8rFvjS9scN3E8qtNrw3hsiAVqX X-Received: by 2002:a17:90b:388d:b0:1b9:950c:f08b with SMTP id mu13-20020a17090b388d00b001b9950cf08bmr10577246pjb.49.1645154425336; Thu, 17 Feb 2022 19:20:25 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1645154425; cv=none; d=google.com; s=arc-20160816; b=dMN4cgaY5WOpI4NL8FzL7ZHiBAdyWncol3rwPjQ6pCFQa9SilO0AumIT4WfZYhlG6L oP0jOmslt5BdEhXn2kxS7fscfF9Wt/ecswcertd/UJlxU2M67WSQRTknCTuRzRDnPijS EQACpxu6abbnIvV5d3K6CIXBwwan9nPMQNXFsIO5bLzoRs4XjWTBBUE8g1tDdFPw9Mbf GvvQKRyjNJTJ20ABaQmCcRezNjbrYj24xFIUmWQ81GT5FQ7zjmRQDyMbjnjjFFVDQAiv jdwemIKeAiQDYLa31VAFbkIq//Pvq5nXG3HehZ2IHEU0lBL0C1Hp9qYD0FWLM3ndSwDr ctPw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=GcPn18149+4zhkt8fNiUfMAc7zNkoDme4L7GfuE8U/A=; b=POSnOx9zO5A22hYYD8KWHhEX08vuBs2MzOJ+Bmbp7BGge8cOLoUsnGBa1N+8EK9GgE Vpu42m0I2dSBwwjNMi2MZkJtI+weYqfln2a6+BgyfcRDY6tRYsa0sPvkzhZIsMyracc7 +p0xsKoLmX/Od20Ckuvd1Bilpfrp44kIFZ9hrq45XDGWtnXU//LQGolya0h/otfqCCRT yLxQd3jJmYyth/44S6mkX8vngXoXMmlImeLVvhQYylJ0P4/zeI+Y1rwSgi5plWprQ9dw vSF0h/sd0nqla/c0X2B8mjH3ukwGbrU1HJAlTuNzmMfTRWw3DDcCi4wA4qngwG3J5Ttg P0zA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id o17si16387179plg.212.2022.02.17.19.20.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 17 Feb 2022 19:20:25 -0800 (PST) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id CE6311CFD3; Thu, 17 Feb 2022 19:18:22 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231918AbiBRDSg (ORCPT + 99 others); Thu, 17 Feb 2022 22:18:36 -0500 Received: from gmail-smtp-in.l.google.com ([23.128.96.19]:42844 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231909AbiBRDSg (ORCPT ); Thu, 17 Feb 2022 22:18:36 -0500 Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0BB1D1CFD3 for ; Thu, 17 Feb 2022 19:18:19 -0800 (PST) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04407;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0V4nDTNG_1645154295; Received: from B-P7TQMD6M-0146.local(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0V4nDTNG_1645154295) by smtp.aliyun-inc.com(127.0.0.1); Fri, 18 Feb 2022 11:18:17 +0800 Date: Fri, 18 Feb 2022 11:18:14 +0800 From: Gao Xiang To: "Theodore Y. Ts'o" Cc: Shyam Prasad N , David Howells , Steve French , linux-ext4@vger.kernel.org, Jeffle Xu , bo.liu@linux.alibaba.com, tao.peng@linux.alibaba.com Subject: Re: Regarding ext4 extent allocation strategy Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE,UNPARSEABLE_RELAY autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org Hi Ted and David, On Tue, Jul 13, 2021 at 07:39:16AM -0400, Theodore Y. Ts'o wrote: > On Tue, Jul 13, 2021 at 12:22:14PM +0530, Shyam Prasad N wrote: > > > > Our team in Microsoft, which works on the Linux SMB3 client kernel > > filesystem has recently been exploring the use of fscache on top of > > ext4 for caching the network filesystem data for some customer > > workloads. > > > > However, the maintainer of fscache (David Howells) recently warned us > > that a few other extent based filesystem developers pointed out a > > theoretical bug in the current implementation of fscache/cachefiles. > > It currently does not maintain a separate metadata for the cached data > > it holds, but instead uses the sparseness of the underlying filesystem > > to track the ranges of the data that is being cached. > > The bug that has been pointed out with this is that the underlying > > filesystems could bridge holes between data ranges with zeroes or > > punch hole in data ranges that contain zeroes. (@David please add if I > > missed something). > > > > David has already begun working on the fix to this by maintaining the > > metadata of the cached ranges in fscache itself. > > However, since it could take some time for this fix to be approved and > > then backported by various distros, I'd like to understand if there is > > a potential problem in using fscache on top of ext4 without the fix. > > If ext4 doesn't do any such optimizations on the data ranges, or has a > > way to disable such optimizations, I think we'll be okay to use the > > older versions of fscache even without the fix mentioned above. > > Yes, the tuning knob you are looking for is: > > What: /sys/fs/ext4//extent_max_zeroout_kb > Date: August 2012 > Contact: "Theodore Ts'o" > Description: > The maximum number of kilobytes which will be zeroed > out in preference to creating a new uninitialized > extent when manipulating an inode's extent tree. Note > that using a larger value will increase the > variability of time necessary to complete a random > write operation (since a 4k random write might turn > into a much larger write due to the zeroout > operation). > > (From Documentation/ABI/testing/sysfs-fs-ext4) > > The basic idea here is that with a random workload, with HDD's, the > cost of writing a 16k random write is not much more than the time to > write a 4k random write; that is, the cost of HDD seeks dominates. > There is also a cost in having a many additional entries in the extent > tree. So if we have a fallocated region, e.g: > > +-------------+---+---+---+----------+---+---+---------+ > ... + Uninit (U) | W | U | W | Uninit | W | U | Written | ... > +-------------+---+---+---+----------+---+---+---------+ > > It's more efficient to have the extent tree look like this > > +-------------+-----------+----------+---+---+---------+ > ... + Uninit (U) | Written | Uninit | W | U | Written | ... > +-------------+-----------+----------+---+---+---------+ > > And just simply write zeros to the first "U" in the above figure. > > The default value of extent_max_zeroout_kb is 32k. This optimization > can be disabled by setting extent_max_zeroout_kb to 0. The downside > of this is a potential degredation of a random write workload (using > for example the fio benchmark program) on that file system. > As far as I understand what cachefile does, it just truncates a sparse file with a big hole, and do direct IO _only_ all the time to fill the holes. But the description above is all around (un)written extents, which already have physical blocks allocated, but just without data initialization. So we could zero out the middle extent and merge these extents into one bigger written extent. However, IMO, it's not the case of what the current cachefiles behavior is... I think rare local fs allocates blocks with direct i/o due to real holes, zero out and merge extents since at least it touches disk quota. David pointed this message yesterday since we're doing on-demand read feature by using cachefiles as well. But I still fail to understand why the current cachefile behavior is wrong. Could you kindly leave more hints about this? Many thanks! Thanks, Gao Xiang > Cheers, > > - Ted