Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <CAGQ4T_Jne-bxdP9rMNBzqXw16a4kD4FM=F5VuGgUbczj5WgCLA@mail.gmail.com>
 <Yqz8a0ggTjIU3h7T@mit.edu> <CAGQ4T_J-43q5xszJK8yDTUt14NGjjQACK4Z1RST-ZQkju3xSzQ@mail.gmail.com>
In-Reply-To: <CAGQ4T_J-43q5xszJK8yDTUt14NGjjQACK4Z1RST-ZQkju3xSzQ@mail.gmail.com>
From:   Santosh S <santosh.letterz@gmail.com>
Date:   Fri, 17 Jun 2022 20:41:07 -0400
Message-ID: <CAGQ4T_J0+B=QzVLmNY0MBEWhuCno6=6DLVk1q-oZu_K52mt=4A@mail.gmail.com>
Subject: Re: Overwrite faster than fallocate
To:     "Theodore Ts'o" <tytso@mit.edu>
Cc:     linux-ext4@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Fri, Jun 17, 2022 at 7:56 PM Santosh S <santosh.letterz@gmail.com> wrote:
>
>  On Fri, Jun 17, 2022 at 6:13 PM Theodore Ts'o <tytso@mit.edu> wrote:
> >
> > On Fri, Jun 17, 2022 at 12:38:20PM -0400, Santosh S wrote:
> > > Dear ext4 developers,
> > >
> > > This is my test - preallocate a large file (2G) and then do sequential
> > > 4K direct-io writes to that file, with fdatasync after every write.
> > > I am preallocating using fallocate mode 0. I noticed that if the 2G
> > > file is pre-written rather than fallocate'd I get more than twice the
> > > throughput. I could reproduce this with fio. The storage is nvme.
> > > Kernel version is 5.3.18 on Suse.
> > >
> > > Am I doing something wrong or is this difference expected? Any
> > > suggestion to get a better throughput without actually pre-writing the
> > > file.
> >
> > This is, alas, expected.  The reason for this is because when you use
> > fallocate, the extent is marked as uninitialized, so that when you
> > read from the those newly allocated blocks, you don't see previously
> > written data belonging to deleted files.  These files could contain
> > someone else's e-mail, or medical information, etc.  So if we didn't
> > do this, it would be a walking, talking HIPPA or PCI violation.
> >
> > So when you write to an fallocated region, and then call fdatasync(2),
> > we need to update the metadata blocks to clear the uninitialized bit
> > so that when you read from the file after a crash, you actually get
> > the data that was written.  So the fdatasync(2) operation is quite the
> > heavyweight operation, since it requries journal commit because of the
> > required metadata update.  When you do an overwrite, there is no need
> > to force a metadata update and journal update, which is why write(2)
> > plus fdatasync(2) is much lighter weight when you do an overwrite.
> >
> > What enterprise databases (e.g., Oracle Enterprise Database and IBM's
> > Informix DB) tend to do is to use fallocate a chunk of space (say,
> > 16MB or 32MB), because for Legacy Unix OS's, this tends enable some
> > file system's block allocators to be more likely to allocate a
> > contiguous block range, and then immediate write zero's on that 16 or
> > 32MB, plus a fdatasync(2).  This fdatasync(2) would update the extent
> > tree once to make that 16MB or 32MB to be marked initialized to the
> > database's tablespace file, so you only pay the metadata update once,
> > instead of every few dozen kilobytes as you write each database commit
> > into the tablespace file.
> >
> > There is also an old, out of tree patch which enables an fallocate
> > mode called "no hide stale", which marks the extent tree blcoks which
> > are allocated using fallocate(2) as initialized.  This substantially
> > speeds things up, but it is potentially a walking, talking, HIPPA or
> > PCI violation in that revealing previously written data is considered
> > a horrible security violation by most file system developers.
> >
> > If you know, say, that a cluster file system is the only user of the
> > file system, and all data is written encrypted at rest using a
> > per-user key, such that exposing stale data is not a security
> > disaster, the "no hide stale" flag could be "safe" in that highly
> > specialized user case.
> >
> > But that assumes that file system authors can trust application
> > writers not to do something stupid and insecure, and historically,
> > file system authors (possibly with good reason, given bitter past
> > experience) don't trust application writesr to do something which is
> > very easy, and gooses performance, even if it has terrible side
> > effects on either data robustness or data security.
> >
> > Effectively, the no hide stale flag could be considered an "Attractive
> > Nuisance"[1] and so support for this feature has never been accepted
> > into the mainline kernel, and never to any distro kernels, since the
> > distribution companies don't want to be held liable for making an
> > "acctive nuisance" that might enable application authors from shooting
> > themselves in the foot.
> >
> > [1] https://en.wikipedia.org/wiki/Attractive_nuisance_doctrine
> >
> > In any case, the technique of fallocatE(2) plus zero-fill-write plus
> > fdatasync(2) isn't *that* slow, and is only needed when you are first
> > extending the tablespace file.  In the steady state, most database
> > applications tend to be overwriting space, so this isn't an issue.
> >
> > In any case, if you need to get that last 5% or so of performance ---
> > say, if you are are an enterprise database company interested in
> > taking a full page advertisement on the back cover of Business Week
> > Magazine touting how your enterprise database benchmarks are better
> > than the competition --- the simple solution is to use a raw block
> > device.  Of course, most end users want the convenience of the file
> > system, but that's not the point if you are engaging in
> > benchmarketing.   :-)
> >
> > Cheers,
> >
> >                                                 - Ted
>
> Thank you for a comprehensive answer :-)
>
> I have one more question - when I gradually increase the i/o transfer
> size the performance degradation begins to lessen and at 32K it is
> similar to the "overwriting the file" case. I assume this is because
> the metadata update is now spread over 32K of data rather than 4K.
> However, my understanding is that, in my case, an extent should
> represent the max 128MiB of data and so the clearing of the
> uninitialized bit for an extent should happen once every 128MiB, so
> then why is a higher transfer size making a difference?
>

I think I understand. The metadata update cannot just be clearing the
uninitialized bit, but also updating the high water mark telling the
length of the initialized part of the extent.

> Santosh