LinuxLists.cc - Re: [PATCH RFC 5/5] ext4: Add fallocate2() support

2020-02-28 21:22:05

Subject: Re: [PATCH RFC 5/5] ext4: Add fallocate2() support

On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote:
> On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <[email protected]> wrote:
> > On 27.02.2020 00:51, Andreas Dilger wrote:
> >> On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <[email protected]> wrote:
> >> In that case, an interesting userspace interface would be an array of
> >> inode numbers (64-bit please) that should be packed together densely in
> >> the order they are provided (maybe a flag for that). That allows the
> >> filesystem the freedom to find the physical blocks for the allocation,
> >> while userspace can tell which files are related to each other.
> >
> > So, this interface is 3-in-1:
> >
> > 1)finds a placement for inodes extents;
>
> The target allocation size would be sum(size of inodes), which should
> be relatively small in your case).
>
> > 2)assigns this space to some temporary donor inode;
>
> Maybe yes, or just reserves that space from being allocated by anyone.
>
> > 3)calls ext4_move_extents() for each of them.
>
> ... using the target space that was reserved earlier
>
> > Do I understand you right?
>
> Correct. That is my "5 minutes thinking about an interface for grouping
> small files together without exposing kernel internals" proposal for this.

You don't need any special kernel interface with XFS for this. It is
simply:

mkdir tmpdir
create O_TMPFILEs in tmpdir

Now all the tmpfiles you create and their data will be co-located
around the location of the tmpdir inode. This is the natural
placement policy of the filesystem. i..e the filesystem assumes that
files in the same directory are all related, so will be accessed
together and so should be located in relatively close proximity to
each other.

This is a locality optimisation technique that is older than XFS. It
works remarkably well when the filesystem can spread directories
effectively across it's address space. It also allows userspace to
use simple techniques to group (or separate) data files as desired.
Indeed, this is how xfs_fsr directs locality for it's tmpfiles when
relocating/defragmenting data....

> > If so, then IMO it's good to start from two inodes, because here may code
> > a very difficult algorithm of placement of many inodes, which may require
> > much memory. Is this OK?
>
> Well, if the files are small then it won't be a lot of memory. Even so,
> the kernel would only need to copy a few MB at a time in order to get
> any decent performance, so I don't think that is a huge problem to have
> several MB of dirty data in flight.
>
> > Can we introduce a flag, that some of inode is unmovable?
>
> There are very few flags left in the ext4_inode->i_flags for use.
> You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they
> also have other semantics. The EXT4_NOTAIL_FL is for not merging the
> tail of a file, but ext4 doesn't have tails (that was in Reiserfs),
> so we might consider it a generic "do not merge" flag if set?

We've had that in XFS for as long as I can remember. Many
applications were sensitive to the exact layout of the files they
created themselves, so having xfs_fsr defrag/move them about would
cause performance SLAs to be broken.

Indeed, thanks to XFS, ext4 already has an interface that can be
used to set/clear a "no defrag" flag such as you are asking for.
It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl.
In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag,
and it has special meaning for directories. From the 'man 3 xfsctl'
man page where this interface came from:

Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG
No defragment file bit - the file should be skipped during a
defragmentation operation. When applied to a directory,
new files and directories created will inherit the no-defrag
bit.

> > Can this interface use a knowledge about underlining device discard granuality?
>
> As I wrote above, ext4+mballoc has a very good appreciation for alignment.
> That was written for RAID storage devices, but it doesn't matter what
> the reason is. It isn't clear if flash discard alignment is easily
> used (it may not be a power-of-two value or similar), but wouldn't be
> harmful to try.

Yup, XFS has the similar (but more complex) alignment controls for
directing allocation to match the underlying storage
characteristics. e.g. stripe unit is also the "small file size
threshold" where the allocation policy changes from packing to
aligning and separating.

> > In the answer to Dave, I wrote a proposition to make fallocate() care about
> > i_write_hint. Could you please comment what you think about that too?
>
> I'm not against that. How the two interact would need to be documented
> first and discussed to see if that makes sene, and then implemented.

Individual filesystems can make their own choices as to what they do
with write hints, including ignoring them and leaving it for the
storage device to decide where to physically place the data. Which,
in many cases, ignoring the hint is the right thing for the
filesystem to do...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-02-29 20:15:48

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH RFC 5/5] ext4: Add fallocate2() support

On Feb 28, 2020, at 2:16 PM, Dave Chinner <[email protected]> wrote:
>
> On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote:
>> On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <[email protected]> wrote:
>>>
>>> So, this interface is 3-in-1:
>>>
>>> 1)finds a placement for inodes extents;
>>
>> The target allocation size would be sum(size of inodes), which should
>> be relatively small in your case).
>>
>>> 2)assigns this space to some temporary donor inode;
>>
>> Maybe yes, or just reserves that space from being allocated by anyone.
>>
>>> 3)calls ext4_move_extents() for each of them.
>>
>> ... using the target space that was reserved earlier
>>
>>> Do I understand you right?
>>
>> Correct. That is my "5 minutes thinking about an interface for grouping
>> small files together without exposing kernel internals" proposal for this.
>
> You don't need any special kernel interface with XFS for this. It is
> simply:
>
> mkdir tmpdir
> create O_TMPFILEs in tmpdir
>
> Now all the tmpfiles you create and their data will be co-located
> around the location of the tmpdir inode. This is the natural
> placement policy of the filesystem. i..e the filesystem assumes that
> files in the same directory are all related, so will be accessed
> together and so should be located in relatively close proximity to
> each other.

Sure, this will likely get inodes allocate _close_ to each other on
ext4 as well (the new directory will preferentially be located in a
group that has free space), but it doesn't necessarily result in
all of the files being packed densely. For 1MB+4KB and 1MB-4KB files
they will still prefer to be aligned on 1MB boundaries rather than
packed together.

>>> Can we introduce a flag, that some of inode is unmovable?
>>
>> There are very few flags left in the ext4_inode->i_flags for use.
>> You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they
>> also have other semantics. The EXT4_NOTAIL_FL is for not merging the
>> tail of a file, but ext4 doesn't have tails (that was in Reiserfs),
>> so we might consider it a generic "do not merge" flag if set?
>
> Indeed, thanks to XFS, ext4 already has an interface that can be
> used to set/clear a "no defrag" flag such as you are asking for.
> It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl.
> In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag,
> and it has special meaning for directories. From the 'man 3 xfsctl'
> man page where this interface came from:
>
> Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG
> No defragment file bit - the file should be skipped during a
> defragmentation operation. When applied to a directory,
> new files and directories created will inherit the no-defrag
> bit.

The interface is not the limiting factor here, but rather the number
of flags available in the inode. Since chattr/lsattr from e2fsprogs
was used as "common ground" for a few years, there are a number of
flags in the namespace that don't actually have any meaning for ext4.

One of those flags is:

#define EXT4_NOTAIL_FL 0x00008000 /* file tail should not be merged */

This was added for Reiserfs, but it is not used by any other filesystem,
so generalizing it slightly to mean "no migrate" is reasonable. That
doesn't affect Reiserfs in any way, and it would still be possible to
also wire up the XFS_XFLAG_NODEFRAG bit to be stored as that flag.

It wouldn't be any issue at all to chose an arbitrary unused flag to
store this in ext4 inode internally, except that chattr/lsattr are used
by a variety of different filesystems, so whatever flag is chosen will
immediately also apply to any other filesystem that users use those
tools on.

Cheers, Andreas

Attachments:

signature.asc (890.00 B)
Message signed with OpenPGP

2020-03-01 00:07:21

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH RFC 5/5] ext4: Add fallocate2() support

On Sat, Feb 29, 2020 at 01:12:52PM -0700, Andreas Dilger wrote:
> On Feb 28, 2020, at 2:16 PM, Dave Chinner <[email protected]> wrote:
> >
> > On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote:
> >> On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <[email protected]> wrote:
> >>>
> >>> So, this interface is 3-in-1:
> >>>
> >>> 1)finds a placement for inodes extents;
> >>
> >> The target allocation size would be sum(size of inodes), which should
> >> be relatively small in your case).
> >>
> >>> 2)assigns this space to some temporary donor inode;
> >>
> >> Maybe yes, or just reserves that space from being allocated by anyone.
> >>
> >>> 3)calls ext4_move_extents() for each of them.
> >>
> >> ... using the target space that was reserved earlier
> >>
> >>> Do I understand you right?
> >>
> >> Correct. That is my "5 minutes thinking about an interface for grouping
> >> small files together without exposing kernel internals" proposal for this.
> >
> > You don't need any special kernel interface with XFS for this. It is
> > simply:
> >
> > mkdir tmpdir
> > create O_TMPFILEs in tmpdir
> >
> > Now all the tmpfiles you create and their data will be co-located
> > around the location of the tmpdir inode. This is the natural
> > placement policy of the filesystem. i..e the filesystem assumes that
> > files in the same directory are all related, so will be accessed
> > together and so should be located in relatively close proximity to
> > each other.
>
> Sure, this will likely get inodes allocate _close_ to each other on
> ext4 as well (the new directory will preferentially be located in a
> group that has free space), but it doesn't necessarily result in
> all of the files being packed densely. For 1MB+4KB and 1MB-4KB files
> they will still prefer to be aligned on 1MB boundaries rather than
> packed together.

Userspace can control that, too, simply by choosing to relocate only
small files into a single directory, then relocating the large files
in a separate set of operations after flushing the small files and
having the packed tightly.

Seriously, userspace has a *lot* of control of how data is located
and packed simply by grouping the IO it wants to be written together
into the same logical groups (i.e. directories) in the same temporal
IO domain...

> >>> Can we introduce a flag, that some of inode is unmovable?
> >>
> >> There are very few flags left in the ext4_inode->i_flags for use.
> >> You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they
> >> also have other semantics. The EXT4_NOTAIL_FL is for not merging the
> >> tail of a file, but ext4 doesn't have tails (that was in Reiserfs),
> >> so we might consider it a generic "do not merge" flag if set?
> >
> > Indeed, thanks to XFS, ext4 already has an interface that can be
> > used to set/clear a "no defrag" flag such as you are asking for.
> > It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl.
> > In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag,
> > and it has special meaning for directories. From the 'man 3 xfsctl'
> > man page where this interface came from:
> >
> > Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG
> > No defragment file bit - the file should be skipped during a
> > defragmentation operation. When applied to a directory,
> > new files and directories created will inherit the no-defrag
> > bit.
>
> The interface is not the limiting factor here, but rather the number
> of flags available in the inode.

Yes, that's an internal ext4 issue, not a userspace API problem.

> Since chattr/lsattr from e2fsprogs
> was used as "common ground" for a few years, there are a number of
> flags in the namespace that don't actually have any meaning for ext4.

Yes, that's a shitty API bed that extN made for itself, isn't it?
We've sucked at API design for a long, long time. :/

But the chattr userspace application is also irrelevant to the
problem at hand: it already uses the FS_IOC_FS[GS]ETXATTR ioctl
interface for changing project quota IDs and the per-inode
inheritance flag. Hence how it manages the new flag is irrelevant,
but it also means we can't change the definition or behaviour of
existing flags it controls regardless of what filesystem those flags
act on.

> One of those flags is:
>
> #define EXT4_NOTAIL_FL 0x00008000 /* file tail should not be merged */
>
> This was added for Reiserfs, but it is not used by any other filesystem,
> so generalizing it slightly to mean "no migrate" is reasonable. That
> doesn't affect Reiserfs in any way, and it would still be possible to
> also wire up the XFS_XFLAG_NODEFRAG bit to be stored as that flag.

Yes, ext4 could do that, but we are not allowed to redefine long
standing userspace API flags to mean something completely different.
That's effectively what you are proposing here if you allow ext4 to
manipulate the same on-disk flag via both FS_NOTAIL_FL and
FS_XFLAG_NODEFRAG. ie. the FS_NOTAIL_FL flag is manipulated by
FS_IOC_[GS]ETFLAGS and is marked both as user visible and modifiable
by ext4 even though ti does nothing.

IOWs, to redefine this on-disk flag we would also need to have
EXT4_IOC_GETFLAGS / EXT4_IOC_SETFLAGS reject attempts to set/clear
FS_NOTAIL_FL with EOPNOTSUPP or EINVAL. Which we then risk breaking
applications that use this flag even though ext4 does not implement
anything other than setting/clearing the flag on demand.

IOWs, we cannot change the meaning of the EXT4_NOTAIL_FL on disk
flag, because that either changes the user visible behaviour of the
on-disk flag or it changes the definition of a userspace API flag to
mean something it was never meant to mean. Neither of those things
are acceptible changes to make to a generic userspace API.

> It wouldn't be any issue at all to chose an arbitrary unused flag to
> store this in ext4 inode internally, except that chattr/lsattr are used
> by a variety of different filesystems, so whatever flag is chosen will
> immediately also apply to any other filesystem that users use those
> tools on.

The impact on userspace is only a problem if you re-use a flag ext4
already exposes to userspace. And that is not allowed if it causes
the userspace API to be globally redefined for everyone. Which,
clearly, it would.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-03-02 10:34:38

by Kirill Tkhai

[permalink] [raw]

Subject: Re: [PATCH RFC 5/5] ext4: Add fallocate2() support

On 29.02.2020 00:16, Dave Chinner wrote:
> On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote:
>> On Feb 27, 2020, at 5:24 AM, Kirill Tkhai <[email protected]> wrote:
>>> On 27.02.2020 00:51, Andreas Dilger wrote:
>>>> On Feb 26, 2020, at 1:05 PM, Kirill Tkhai <[email protected]> wrote:
>>>> In that case, an interesting userspace interface would be an array of
>>>> inode numbers (64-bit please) that should be packed together densely in
>>>> the order they are provided (maybe a flag for that). That allows the
>>>> filesystem the freedom to find the physical blocks for the allocation,
>>>> while userspace can tell which files are related to each other.
>>>
>>> So, this interface is 3-in-1:
>>>
>>> 1)finds a placement for inodes extents;
>>
>> The target allocation size would be sum(size of inodes), which should
>> be relatively small in your case).
>>
>>> 2)assigns this space to some temporary donor inode;
>>
>> Maybe yes, or just reserves that space from being allocated by anyone.
>>
>>> 3)calls ext4_move_extents() for each of them.
>>
>> ... using the target space that was reserved earlier
>>
>>> Do I understand you right?
>>
>> Correct. That is my "5 minutes thinking about an interface for grouping
>> small files together without exposing kernel internals" proposal for this.
>
> You don't need any special kernel interface with XFS for this. It is
> simply:
>
> mkdir tmpdir
> create O_TMPFILEs in tmpdir
>
> Now all the tmpfiles you create and their data will be co-located
> around the location of the tmpdir inode. This is the natural
> placement policy of the filesystem. i..e the filesystem assumes that
> files in the same directory are all related, so will be accessed
> together and so should be located in relatively close proximity to
> each other.

Hm, but does this help for my problem? 1)allocate two files in the same directory
and then 2)move source files there?

In case of I have two 512K files ext4 allows the same:

1)fallocate() 1M continuous space (this should ends with success in case of disc
is not almost full);
2)move both files into newly allocated space.

But this doubles IO, since both of files have to be moved. The ideal solution
would be to allocate space around one of them and to move the second file there.

> This is a locality optimisation technique that is older than XFS. It
> works remarkably well when the filesystem can spread directories
> effectively across it's address space. It also allows userspace to
> use simple techniques to group (or separate) data files as desired.
> Indeed, this is how xfs_fsr directs locality for it's tmpfiles when
> relocating/defragmenting data....
>
>>> If so, then IMO it's good to start from two inodes, because here may code
>>> a very difficult algorithm of placement of many inodes, which may require
>>> much memory. Is this OK?
>>
>> Well, if the files are small then it won't be a lot of memory. Even so,
>> the kernel would only need to copy a few MB at a time in order to get
>> any decent performance, so I don't think that is a huge problem to have
>> several MB of dirty data in flight.
>>
>>> Can we introduce a flag, that some of inode is unmovable?
>>
>> There are very few flags left in the ext4_inode->i_flags for use.
>> You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they
>> also have other semantics. The EXT4_NOTAIL_FL is for not merging the
>> tail of a file, but ext4 doesn't have tails (that was in Reiserfs),
>> so we might consider it a generic "do not merge" flag if set?
>
> We've had that in XFS for as long as I can remember. Many
> applications were sensitive to the exact layout of the files they
> created themselves, so having xfs_fsr defrag/move them about would
> cause performance SLAs to be broken.
>
> Indeed, thanks to XFS, ext4 already has an interface that can be
> used to set/clear a "no defrag" flag such as you are asking for.
> It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl.
> In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag,
> and it has special meaning for directories. From the 'man 3 xfsctl'
> man page where this interface came from:
>
> Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG
> No defragment file bit - the file should be skipped during a
> defragmentation operation. When applied to a directory,
> new files and directories created will inherit the no-defrag
> bit.
>
>>> Can this interface use a knowledge about underlining device discard granuality?
>>
>> As I wrote above, ext4+mballoc has a very good appreciation for alignment.
>> That was written for RAID storage devices, but it doesn't matter what
>> the reason is. It isn't clear if flash discard alignment is easily
>> used (it may not be a power-of-two value or similar), but wouldn't be
>> harmful to try.
>
> Yup, XFS has the similar (but more complex) alignment controls for
> directing allocation to match the underlying storage
> characteristics. e.g. stripe unit is also the "small file size
> threshold" where the allocation policy changes from packing to
> aligning and separating.
>
>>> In the answer to Dave, I wrote a proposition to make fallocate() care about
>>> i_write_hint. Could you please comment what you think about that too?
>>
>> I'm not against that. How the two interact would need to be documented
>> first and discussed to see if that makes sene, and then implemented.
>
> Individual filesystems can make their own choices as to what they do
> with write hints, including ignoring them and leaving it for the
> storage device to decide where to physically place the data. Which,
> in many cases, ignoring the hint is the right thing for the
> filesystem to do...
>
> Cheers,
>
> Dave.
>