Hi,
For things like database journals using fallocate(0) is not sufficient,
as writing into the the pre-allocated data with O_DIRECT | O_DSYNC
writes requires the unwritten extents to be converted, which in turn
requires journal operations.
The performance difference in a journalling workload (lots of
sequential, low-iodepth, often small, writes) is quite remarkable. Even
on quite fast devices:
andres@awork3:/mnt/t3$ grep /mnt/t3 /proc/mounts
/dev/nvme1n1 /mnt/t3 xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
andres@awork3:/mnt/t3$ fallocate -l $((1024*1024*1024)) test_file
andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 117.587 s, 9.1 MB/s
andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.69125 s, 291 MB/s
andres@awork3:/mnt/t3$ fallocate -z -l $((1024*1024*1024)) test_file
andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
z262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 109.398 s, 9.8 MB/s
andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
262144+0 records in
262144+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.76166 s, 285 MB/s
The way around that, from a database's perspective, is obviously to just
overwrite the file "manually" after fallocate()ing it, utilizing larger
writes, and then to recycle the file.
But that's a fair bit of unnecessary IO from userspace, and it's IO that
the kernel can do more efficiently on a number of types of block
devices, e.g. by utilizing write-zeroes.
Which brings me to $subject:
Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
doesn't convert extents into unwritten extents, but instead uses
blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4
myself, but ...
Doing so as a variant of FALLOC_FL_ZERO_RANGE seems to make the most
sense, as that'd work reasonably efficiently to initialize newly
allocated space as well as for zeroing out previously used file space.
As blkdev_issue_zeroout() already has a fallback path it seems this
should be doable without too much concern for which devices have write
zeroes, and which do not?
Greetings,
Andres Freund
On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
> Hi,
>
> For things like database journals using fallocate(0) is not sufficient,
> as writing into the the pre-allocated data with O_DIRECT | O_DSYNC
> writes requires the unwritten extents to be converted, which in turn
> requires journal operations.
>
> The performance difference in a journalling workload (lots of
> sequential, low-iodepth, often small, writes) is quite remarkable. Even
> on quite fast devices:
>
> andres@awork3:/mnt/t3$ grep /mnt/t3 /proc/mounts
> /dev/nvme1n1 /mnt/t3 xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
>
> andres@awork3:/mnt/t3$ fallocate -l $((1024*1024*1024)) test_file
>
> andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 117.587 s, 9.1 MB/s
>
> andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.69125 s, 291 MB/s
>
> andres@awork3:/mnt/t3$ fallocate -z -l $((1024*1024*1024)) test_file
>
> andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
> z262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 109.398 s, 9.8 MB/s
>
> andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.76166 s, 285 MB/s
>
>
> The way around that, from a database's perspective, is obviously to just
> overwrite the file "manually" after fallocate()ing it, utilizing larger
> writes, and then to recycle the file.
>
>
> But that's a fair bit of unnecessary IO from userspace, and it's IO that
> the kernel can do more efficiently on a number of types of block
> devices, e.g. by utilizing write-zeroes.
>
>
> Which brings me to $subject:
>
> Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
> doesn't convert extents into unwritten extents, but instead uses
> blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4
> myself, but ...
>
> Doing so as a variant of FALLOC_FL_ZERO_RANGE seems to make the most
> sense, as that'd work reasonably efficiently to initialize newly
> allocated space as well as for zeroing out previously used file space.
>
>
> As blkdev_issue_zeroout() already has a fallback path it seems this
> should be doable without too much concern for which devices have write
> zeroes, and which do not?
Question: do you want the kernel to write zeroes even for devices that
don't support accelerated zeroing? Since I assume that if the fallocate
fails you'll fall back to writing zeroes from userspace anyway...
Second question: Would it help to have a FALLOC_FL_DRY_RUN flag that
could be used to probe if a file supports fallocate without actually
changing anything? I'm (separately) pursuing a fix for the loop device
not being able to figure out if a file actually supports a particular
fallocate mode.
--D
> Greetings,
>
> Andres Freund
Hi,
On 2021-01-04 10:19:58 -0800, Darrick J. Wong wrote:
> On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
> > Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
> > doesn't convert extents into unwritten extents, but instead uses
> > blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4
> > myself, but ...
> >
> > Doing so as a variant of FALLOC_FL_ZERO_RANGE seems to make the most
> > sense, as that'd work reasonably efficiently to initialize newly
> > allocated space as well as for zeroing out previously used file space.
> >
> >
> > As blkdev_issue_zeroout() already has a fallback path it seems this
> > should be doable without too much concern for which devices have write
> > zeroes, and which do not?
>
> Question: do you want the kernel to write zeroes even for devices that
> don't support accelerated zeroing?
I don't have a strong opinion on it. A complex userland application can
do a bit better job managing queue depth etc, but otherwise I suspect
doing the IO from kernel will win by a small bit. And the queue-depth
issue presumably would be relevant for write-zeroes as well, making me
lean towards just using the fallback.
> Since I assume that if the fallocate fails you'll fall back to writing
> zeroes from userspace anyway...
And there's non-linux platforms as well, at least that's the rumor I hear.
> Second question: Would it help to have a FALLOC_FL_DRY_RUN flag that
> could be used to probe if a file supports fallocate without actually
> changing anything? I'm (separately) pursuing a fix for the loop device
> not being able to figure out if a file actually supports a particular
> fallocate mode.
Hm. I can see some potential uses of such a flag, but I haven't really
wished for it so far.
Greetings,
Andres Freund
On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
>
> Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
> doesn't convert extents into unwritten extents, but instead uses
> blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4
> myself, but ...
One thing to note is that there are some devices which support a write
zeros operation, but where it is *less* performant than actually
writing zeros via DMA'ing zero pages. Yes, that's insane.
Unfortunately, there are a insane devices out there....
This is not hypothetical; I know this because we tried using write
zeros in mke2fs, and I got regression complaints where
mke2fs/mkfs.ext4 got substantially slower for some devices.
That doesn't meant that your proposal shouldn't be adopted. But it
would be a good idea to have some kind of way to either allow some
kind of tuning knob to disable the user of zeroout (either in the
block device, file system, or in userspace), and/or some kind of way
to try to automatically figure out whether using zeroout is actually a
win, since most users aren't going to be up to adjusting a manual
tuning knob.
- Ted
On Mon, Jan 04, 2021 at 02:17:05PM -0500, Theodore Ts'o wrote:
> One thing to note is that there are some devices which support a write
> zeros operation, but where it is *less* performant than actually
> writing zeros via DMA'ing zero pages. Yes, that's insane.
> Unfortunately, there are a insane devices out there....
We already have quirks to disable commands, in NVMe, SCSI and ATA.
This sounds like another quirk to throw on the bonfire ("Yes, this
device claims to support Write Zeroes, but don't use it"). Indeed,
NVMe already has precisely this quirk, NVME_QUIRK_DISABLE_WRITE_ZEROES.
On 04/01/2021 21.10, Andres Freund wrote:
> Hi,
>
> On 2021-01-04 10:19:58 -0800, Darrick J. Wong wrote:
>> On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
>>> Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
>>> doesn't convert extents into unwritten extents, but instead uses
>>> blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4
>>> myself, but ...
>>>
>>> Doing so as a variant of FALLOC_FL_ZERO_RANGE seems to make the most
>>> sense, as that'd work reasonably efficiently to initialize newly
>>> allocated space as well as for zeroing out previously used file space.
>>>
>>>
>>> As blkdev_issue_zeroout() already has a fallback path it seems this
>>> should be doable without too much concern for which devices have write
>>> zeroes, and which do not?
>> Question: do you want the kernel to write zeroes even for devices that
>> don't support accelerated zeroing?
> I don't have a strong opinion on it. A complex userland application can
> do a bit better job managing queue depth etc, but otherwise I suspect
> doing the IO from kernel will win by a small bit. And the queue-depth
> issue presumably would be relevant for write-zeroes as well, making me
> lean towards just using the fallback.
>
The new flag will avoid requiring DMA to transfer the entire file size,
and perhaps can be implemented in the device by just adjusting metadata.
So there is potential for the new flag to be much more efficient.
But note it will need to be plumbed down to md and dm to be generally
useful.
Hi,
On 2021-01-04 14:17:05 -0500, Theodore Ts'o wrote:
> One thing to note is that there are some devices which support a write
> zeros operation, but where it is *less* performant than actually
> writing zeros via DMA'ing zero pages. Yes, that's insane.
> Unfortunately, there are a insane devices out there....
That doesn't surprise me at all, unfortunately. I'm planning to send a
proposal to allow disabling a device's use of fua for similar reasons...
> That doesn't meant that your proposal shouldn't be adopted. But it
> would be a good idea to have some kind of way to either allow some
> kind of tuning knob to disable the user of zeroout (either in the
> block device, file system, or in userspace), and/or some kind of way
> to try to automatically figure out whether using zeroout is actually a
> win, since most users aren't going to be up to adjusting a manual
> tuning knob.
A block device know seems to make sense to me. There already is
/sys/block/*/queue/write_zeroes_max_bytes
it seems like it could make sense to add a sibling entry to allow tuning
that? Presumably with a quirks (as suggested by Matthew) to choose a
sensible default?
It's not quite analogous, but there's for
max_hw_sectors_kb/max_sectors_kb and discard_max_bytes /
discard_max_hw_bytes, and it seems like something vaguely in that
direction could make sense?
Greetings,
Andres Freund
On 1/4/21 1:17 PM, Theodore Ts'o wrote:
> On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
>>
>> Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
>> doesn't convert extents into unwritten extents, but instead uses
>> blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4
>> myself, but ...
>
> One thing to note is that there are some devices which support a write
> zeros operation, but where it is *less* performant than actually
> writing zeros via DMA'ing zero pages. Yes, that's insane.
> Unfortunately, there are a insane devices out there....
>
> This is not hypothetical; I know this because we tried using write
> zeros in mke2fs, and I got regression complaints where
> mke2fs/mkfs.ext4 got substantially slower for some devices.
Was this "libext2fs: mkfs.ext3 really slow on centos 8.2" ?
If so, wasn't the problem that it went from a few very large IOs to a
multitude of per-block fallocate calls, a problem which was fixed by
this commit?
commit 86d6153417ddaccbe3d1f4466a374716006581f4 (HEAD)
Author: Theodore Ts'o <[email protected]>
Date: Sat Apr 25 11:41:24 2020 -0400
libext2fs: batch calls to ext2fs_zero_blocks2()
When allocating blocks for an indirect block mapped file, accumulate
blocks to be zero'ed and then call ext2fs_zero_blocks2() to zero them
in large chunks instead of block by block.
This significantly speeds up mkfs.ext3 since we don't send a large
number of ZERO_RANGE requests to the kernel, and while the kernel does
batch write requests, it is not batching ZERO_RANGE requests. It's
more efficient to batch in userspace in any case, since it avoids
unnecessary system calls.
Reported-by: Mario Schuknecht <[email protected]>
Signed-off-by: Theodore Ts'o <[email protected]>
or do I have the wrong report above?
I ask because mkfs.xfs is now also using FALLOC_FL_ZERO_RANGE
Thanks,
-Eric
On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
> Hi,
>
> For things like database journals using fallocate(0) is not sufficient,
> as writing into the the pre-allocated data with O_DIRECT | O_DSYNC
> writes requires the unwritten extents to be converted, which in turn
> requires journal operations.
>
> The performance difference in a journalling workload (lots of
> sequential, low-iodepth, often small, writes) is quite remarkable. Even
> on quite fast devices:
>
> andres@awork3:/mnt/t3$ grep /mnt/t3 /proc/mounts
> /dev/nvme1n1 /mnt/t3 xfs rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota 0 0
>
> andres@awork3:/mnt/t3$ fallocate -l $((1024*1024*1024)) test_file
>
> andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 117.587 s, 9.1 MB/s
>
> andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.69125 s, 291 MB/s
>
> andres@awork3:/mnt/t3$ fallocate -z -l $((1024*1024*1024)) test_file
>
> andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
> z262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 109.398 s, 9.8 MB/s
>
> andres@awork3:/mnt/t3$ dd if=/dev/zero of=test_file bs=4096 conv=notrunc iflag=count_bytes count=$((1024*1024*1024)) oflag=direct,dsync
> 262144+0 records in
> 262144+0 records out
> 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.76166 s, 285 MB/s
>
>
> The way around that, from a database's perspective, is obviously to just
> overwrite the file "manually" after fallocate()ing it, utilizing larger
> writes, and then to recycle the file.
>
>
> But that's a fair bit of unnecessary IO from userspace, and it's IO that
> the kernel can do more efficiently on a number of types of block
> devices, e.g. by utilizing write-zeroes.
>
>
> Which brings me to $subject:
>
> Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
> doesn't convert extents into unwritten extents, but instead uses
> blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4
> myself, but ...
We have explicit requests from users (think initialising large VM
images) that FALLOC_FL_ZERO_RANGE must never fall back to writing
zeroes manually.
Because those users want us to guarantee that FALLOC_FL_ZERO_RANGE
is *always* going to be faster than writing a large range of zeroes.
They also want FALLOC_FL_ZERO_RANGE to fail if it can't zero the
range by metadata manipulation and would need to write zeros,
because then they can make the choice on how to initialise the
device (e.g. at runtime, via on-demand ZERO_RANGE calls, by writing
zeroes to pad partial blocks, etc). That bird has already flown,
so we can't really do that retrospectively, but we really don't want
to make life worse for these users.
IOWs, while you might want FALLOC_FL_ZERO_RANGE to explicitly write
zeros, we have users who explicitly don't want it to do this.
Perhaps we should add want FALLOC_FL_CONVERT_RANGE, which tells the
filesystem to convert an unwritten range of zeros to a written range
by manually writing zeros. i.e. you do FALLOC_FL_ZERO_RANGE to zero
the range and fill holes using metadata manipulation, followed by
FALLOC_FL_WRITE_RANGE to then convert the "metadata zeros" to real
written zeros.
Cheers,
Dave.
--
Dave Chinner
[email protected]
Hi,
On 2021-01-07 09:52:01 +1100, Dave Chinner wrote:
> On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
> > Which brings me to $subject:
> >
> > Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
> > doesn't convert extents into unwritten extents, but instead uses
> > blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4
> > myself, but ...
>
> We have explicit requests from users (think initialising large VM
> images) that FALLOC_FL_ZERO_RANGE must never fall back to writing
> zeroes manually.
That behaviour makes a lot of sense for quite a few use cases - I wasn't
trying to make it sound like it should not be available. Nor that
FALLOC_FL_ZERO_RANGE should behave differently.
> IOWs, while you might want FALLOC_FL_ZERO_RANGE to explicitly write
> zeros, we have users who explicitly don't want it to do this.
Right - which is why I was asking for a variant of FALLOC_FL_ZERO_RANGE
(jokingly named FALLOC_FL_ZERO_RANGE_BUT_REALLY in the subject), rather
than changing the behaviour.
> Perhaps we should add want FALLOC_FL_CONVERT_RANGE, which tells the
> filesystem to convert an unwritten range of zeros to a written range
> by manually writing zeros. i.e. you do FALLOC_FL_ZERO_RANGE to zero
> the range and fill holes using metadata manipulation, followed by
> FALLOC_FL_WRITE_RANGE to then convert the "metadata zeros" to real
> written zeros.
Yep, something like that would do the trick. Perhaps
FALLOC_FL_MATERIALIZE_RANGE?
Greetings,
Andres Freund
On Wed, Jan 06, 2021 at 03:40:09PM -0800, Andres Freund wrote:
> Hi,
>
> On 2021-01-07 09:52:01 +1100, Dave Chinner wrote:
> > On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
> > > Which brings me to $subject:
> > >
> > > Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
> > > doesn't convert extents into unwritten extents, but instead uses
> > > blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4
> > > myself, but ...
> >
> > We have explicit requests from users (think initialising large VM
> > images) that FALLOC_FL_ZERO_RANGE must never fall back to writing
> > zeroes manually.
>
> That behaviour makes a lot of sense for quite a few use cases - I wasn't
> trying to make it sound like it should not be available. Nor that
> FALLOC_FL_ZERO_RANGE should behave differently.
>
>
> > IOWs, while you might want FALLOC_FL_ZERO_RANGE to explicitly write
> > zeros, we have users who explicitly don't want it to do this.
>
> Right - which is why I was asking for a variant of FALLOC_FL_ZERO_RANGE
> (jokingly named FALLOC_FL_ZERO_RANGE_BUT_REALLY in the subject), rather
> than changing the behaviour.
>
>
> > Perhaps we should add want FALLOC_FL_CONVERT_RANGE, which tells the
> > filesystem to convert an unwritten range of zeros to a written range
> > by manually writing zeros. i.e. you do FALLOC_FL_ZERO_RANGE to zero
> > the range and fill holes using metadata manipulation, followed by
> > FALLOC_FL_WRITE_RANGE to then convert the "metadata zeros" to real
> > written zeros.
>
> Yep, something like that would do the trick. Perhaps
> FALLOC_FL_MATERIALIZE_RANGE?
[ FWIW, I really dislike the "RANGE" part of fallocate flag names.
It's redundant (fallocate always operates on a range!) and just
makes names unnecessarily longer. ]
I used "convert range" as the name explicitly because it has
specific meaning for extent space manipulation. i.e. we "convert"
extents from one state to another. "write range" is also has
explicit meaning, in that it will convert extents from unwritten to
written data.
In comparison, "materialise" is something undefined, and could be
easily thought to take something ephemeral (such as a hole) and turn
it into something real (an allocated extent). We wouldn't want this
operation to allocate space, so I think "materialise" is just too
much magic to encoding into an API for an explicit, well defined
state change.
We also have people asking for ZERO_RANGE to just flip existing
extents from written to unwritten (rather than the punch/preallocate
we do now). This is also a "convert" operation, just in the other
direction (from data to zeros rather than from zeros to data).
The observation I'm making here is that these "convert" oeprations
will both makes SEEK_HOLE/SEEK_DATA behave differently for the
underlying data. preallocated space is considered a HOLE, written
zeros are considered DATA. So we do expose the ability to check that
a "convert" operation has actually changed the state of the
underlying extents in either direction...
CONVERT_TO_DATA/CONVERT_TO_ZERO as an operational pair whose
behaviour is visible and easily testable via SEEK_HOLE/SEEK_DATA
makes a lot more sense to me. Also defining them to fail fast if
unwritten extents are not supported by the filesystem (i.e. they
should -never- physically write anything) would also allow
applications to fall back to ZERO_RANGE on filesystems that don't
support unwritten extents to explicitly write zeros if
CONVERT_TO_ZERO fails....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Jan 04, 2021 at 09:57:48PM +0200, Avi Kivity wrote:
> > I don't have a strong opinion on it. A complex userland application can
> > do a bit better job managing queue depth etc, but otherwise I suspect
> > doing the IO from kernel will win by a small bit. And the queue-depth
> > issue presumably would be relevant for write-zeroes as well, making me
> > lean towards just using the fallback.
> >
>
> The new flag will avoid requiring DMA to transfer the entire file size, and
> perhaps can be implemented in the device by just adjusting metadata. So
> there is potential for the new flag to be much more efficient.
We already support a WRITE_ZEROES operation, which many (but not all)
NVMe devices and some SCSI devices support. The blkdev_issue_zeroout
helper can use those, or falls back to writing actual zeroes.
XFS already has a XFS_IOC_ALLOCSP64 that is defined to actually
allocate written extents. It does not currently use
blkdev_issue_zeroout, but could be changed pretty trivially to do so.
> But note it will need to be plumbed down to md and dm to be generally
> useful.
DM and MD already support mddev_check_write_zeroes, at least for the
usual targets.
On Jan 12, 2021, at 11:16 AM, Christoph Hellwig <[email protected]> wrote:
>
> On Mon, Jan 04, 2021 at 09:57:48PM +0200, Avi Kivity wrote:
>>> I don't have a strong opinion on it. A complex userland application can
>>> do a bit better job managing queue depth etc, but otherwise I suspect
>>> doing the IO from kernel will win by a small bit. And the queue-depth
>>> issue presumably would be relevant for write-zeroes as well, making me
>>> lean towards just using the fallback.
>>>
>>
>> The new flag will avoid requiring DMA to transfer the entire file size, and
>> perhaps can be implemented in the device by just adjusting metadata. So
>> there is potential for the new flag to be much more efficient.
>
> We already support a WRITE_ZEROES operation, which many (but not all)
> NVMe devices and some SCSI devices support. The blkdev_issue_zeroout
> helper can use those, or falls back to writing actual zeroes.
>
> XFS already has a XFS_IOC_ALLOCSP64 that is defined to actually
> allocate written extents. It does not currently use
> blkdev_issue_zeroout, but could be changed pretty trivially to do so.
>
>> But note it will need to be plumbed down to md and dm to be generally
>> useful.
>
> DM and MD already support mddev_check_write_zeroes, at least for the
> usual targets.
Similarly, ext4 also has EXT4_GET_BLOCKS_CREATE_ZERO that can allocate zero
filled extents rather than unwritten extents (without clobbering existing
data like FALLOC_FL_ZERO_RANGE does), and just needs a flag from fallocate()
to trigger it. This is plumbed down to blkdev_issue_zeroout() as well.
Cheers, Andreas
On Tue, Jan 12, 2021 at 11:39:58AM -0700, Andreas Dilger wrote:
> > XFS already has a XFS_IOC_ALLOCSP64 that is defined to actually
> > allocate written extents. It does not currently use
> > blkdev_issue_zeroout, but could be changed pretty trivially to do so.
> >
> >> But note it will need to be plumbed down to md and dm to be generally
> >> useful.
> >
> > DM and MD already support mddev_check_write_zeroes, at least for the
> > usual targets.
>
> Similarly, ext4 also has EXT4_GET_BLOCKS_CREATE_ZERO that can allocate zero
> filled extents rather than unwritten extents (without clobbering existing
> data like FALLOC_FL_ZERO_RANGE does), and just needs a flag from fallocate()
> to trigger it. This is plumbed down to blkdev_issue_zeroout() as well.
XFS_IOC_ALLOCSP64 actually is an ioctl that has been around since 1995
on IRIX (as an fcntl).
On Jan 12, 2021, at 11:43 AM, Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Jan 12, 2021 at 11:39:58AM -0700, Andreas Dilger wrote:
>>> XFS already has a XFS_IOC_ALLOCSP64 that is defined to actually
>>> allocate written extents. It does not currently use
>>> blkdev_issue_zeroout, but could be changed pretty trivially to do so.
>>>
>>>> But note it will need to be plumbed down to md and dm to be generally
>>>> useful.
>>>
>>> DM and MD already support mddev_check_write_zeroes, at least for the
>>> usual targets.
>>
>> Similarly, ext4 also has EXT4_GET_BLOCKS_CREATE_ZERO that can allocate zero
>> filled extents rather than unwritten extents (without clobbering existing
>> data like FALLOC_FL_ZERO_RANGE does), and just needs a flag from fallocate()
>> to trigger it. This is plumbed down to blkdev_issue_zeroout() as well.
>
> XFS_IOC_ALLOCSP64 actually is an ioctl that has been around since 1995
> on IRIX (as an fcntl).
I'm not against adding XFS_IOC_ALLOCSP64 to ext4, if applications are actually
using that.
It also makes sense to me that there also be an fallocate() mode for allocating
zeroed blocks (which was the original request), since fallocate() is already
doing very similar things and is the central interface for managing block
allocation instead of having a filesystem-specific ioctl() to do this.
Cheers, Andreas
On Tue, Jan 12, 2021 at 11:51:07AM -0700, Andreas Dilger wrote:
> On Jan 12, 2021, at 11:43 AM, Christoph Hellwig <[email protected]> wrote:
> >
> > On Tue, Jan 12, 2021 at 11:39:58AM -0700, Andreas Dilger wrote:
> >>> XFS already has a XFS_IOC_ALLOCSP64 that is defined to actually
> >>> allocate written extents. It does not currently use
> >>> blkdev_issue_zeroout, but could be changed pretty trivially to do so.
> >>>
> >>>> But note it will need to be plumbed down to md and dm to be generally
> >>>> useful.
> >>>
> >>> DM and MD already support mddev_check_write_zeroes, at least for the
> >>> usual targets.
> >>
> >> Similarly, ext4 also has EXT4_GET_BLOCKS_CREATE_ZERO that can allocate zero
> >> filled extents rather than unwritten extents (without clobbering existing
> >> data like FALLOC_FL_ZERO_RANGE does), and just needs a flag from fallocate()
> >> to trigger it. This is plumbed down to blkdev_issue_zeroout() as well.
> >
> > XFS_IOC_ALLOCSP64 actually is an ioctl that has been around since 1995
> > on IRIX (as an fcntl).
>
> I'm not against adding XFS_IOC_ALLOCSP64 to ext4, if applications are actually
> using that.
<shudder> Some of them are, but--
ALLOCSP64 can only allocate pre-zeroed blocks as part of extending EOF,
whereas a new FZERO flag means that we can pre-zero an arbitrary range
of bytes in a file. I don't know if Avi or Andres' usecases demand that
kind of flexibilty but I know I'd rather go for the more powerful
interface.
--D
> It also makes sense to me that there also be an fallocate() mode for allocating
> zeroed blocks (which was the original request), since fallocate() is already
> doing very similar things and is the central interface for managing block
> allocation instead of having a filesystem-specific ioctl() to do this.
>
> Cheers, Andreas
>
>
>
>
>
Hi,
On 2021-01-12 13:14:45 -0800, Darrick J. Wong wrote:
> ALLOCSP64 can only allocate pre-zeroed blocks as part of extending EOF,
> whereas a new FZERO flag means that we can pre-zero an arbitrary range
> of bytes in a file. I don't know if Avi or Andres' usecases demand that
> kind of flexibilty but I know I'd rather go for the more powerful
> interface.
Postgres/I don't at the moment have a need to allocate "written" zeroed
space anywhere but EOF. I can see some potential uses for more flexible
pre-zeroing in the future though, but not very near term.
Greetings,
Andres Freund
On 1/12/21 11:36 PM, Andres Freund wrote:
> Hi,
>
> On 2021-01-12 13:14:45 -0800, Darrick J. Wong wrote:
>> ALLOCSP64 can only allocate pre-zeroed blocks as part of extending EOF,
>> whereas a new FZERO flag means that we can pre-zero an arbitrary range
>> of bytes in a file. I don't know if Avi or Andres' usecases demand that
>> kind of flexibilty but I know I'd rather go for the more powerful
>> interface.
> Postgres/I don't at the moment have a need to allocate "written" zeroed
> space anywhere but EOF. I can see some potential uses for more flexible
> pre-zeroing in the future though, but not very near term.
>
Same here.
I also agree that it's better not to have the kernel fall back
internally on writing zeros, letting userspace do that. The assumption
is that WRITE SAME will be O(1)-ish and so can bypass scheduling
decisions, but if we need to write zeros, better let the application
throttle the rate.
On Jan 13, 2021, at 12:44 AM, Avi Kivity <[email protected]> wrote:
>
> On 1/12/21 11:36 PM, Andres Freund wrote:
>> Hi,
>>
>> On 2021-01-12 13:14:45 -0800, Darrick J. Wong wrote:
>>> ALLOCSP64 can only allocate pre-zeroed blocks as part of extending EOF,
>>> whereas a new FZERO flag means that we can pre-zero an arbitrary range
>>> of bytes in a file. I don't know if Avi or Andres' usecases demand that
>>> kind of flexibilty but I know I'd rather go for the more powerful
>>> interface.
>> Postgres/I don't at the moment have a need to allocate "written" zeroed
>> space anywhere but EOF. I can see some potential uses for more flexible
>> pre-zeroing in the future though, but not very near term.
>>
>
> I also agree that it's better not to have the kernel fall back internally on writing zeros, letting userspace do that. The assumption is that WRITE SAME will be O(1)-ish and so can bypass scheduling decisions, but if we need to write zeros, better let the application throttle the rate.
Writing zeroes from userspace has a *lot* more overhead when there is a network
filesystem involved. It would be better to generate the zeroes on the server,
or directly in the disk than sending GB of zeroes over the network.
Cheers, Andreas