This series expands atomic write support to filesystems, specifically
XFS. Since XFS rtvol supports extent alignment already, support will
initially be added there. When XFS forcealign feature is merged, then we
can similarly support atomic writes for a non-rtvol filesystem.
Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.
For XFS rtvol, support can be enabled through xfs_io command:
$xfs_io -c "chattr +W" filename
$xfs_io -c "lsattr -v" filename
[realtime, atomic-writes] filename
The FS needs to be formatted with a specific extent alignment size, like:
mkf.xfs -r rtdev=/dev/sdb,extsize=16K -d rtinherit=1 /dev/sda
This enables 16K atomic write support. There are no checks whether the
underlying HW actually supports that for enabling atomic writes with
xfs_io, though, so statx needs to be issued for a file to know atomic
write limits.
For supporting non-rtvol, we will require forcealign enabled. As such, a
dedicated xfs_io command to enable atomic writes for a regular FS may
be useful, which would enable FS_XFLAG_ATOMICWRITES, enable forcealign,
and set an extent alignment hint.
Baseline is following series (which is based on v6.8-rc1):
https://urldefense.com/v3/__https://lore.kernel.org/linux-nvme/[email protected]/T/*m4ad28b480a8e12eb51467e17208d98ca50041ff2__;Iw!!ACWV5N9M2RV99hQ!PKOcFzPtVYZ9uATl1BrTJmYanWxEtCKJPV-tTPDYqeTjuWmChXn08ZcmP_H07A9mxPyQ8wwjdSzgH0eYU_45MaIOJyEW$
Basic xfsprogs support at:
https://urldefense.com/v3/__https://github.com/johnpgarry/xfsprogs-dev/tree/atomicwrites__;!!ACWV5N9M2RV99hQ!PKOcFzPtVYZ9uATl1BrTJmYanWxEtCKJPV-tTPDYqeTjuWmChXn08ZcmP_H07A9mxPyQ8wwjdSzgH0eYU_45MTapy6qp$
John Garry (6):
fs: iomap: Atomic write support
fs: Add FS_XFLAG_ATOMICWRITES flag
fs: xfs: Support FS_XFLAG_ATOMICWRITES for rtvol
fs: xfs: Support atomic write for statx
fs: xfs: iomap atomic write support
fs: xfs: Set FMODE_CAN_ATOMIC_WRITE for FS_XFLAG_ATOMICWRITES set
fs/iomap/direct-io.c | 21 +++++++++++++++++-
fs/iomap/trace.h | 3 ++-
fs/xfs/libxfs/xfs_format.h | 8 +++++--
fs/xfs/libxfs/xfs_sb.c | 2 ++
fs/xfs/xfs_file.c | 2 ++
fs/xfs/xfs_inode.c | 22 +++++++++++++++++++
fs/xfs/xfs_inode.h | 7 ++++++
fs/xfs/xfs_ioctl.c | 19 ++++++++++++++--
fs/xfs/xfs_iomap.c | 41 ++++++++++++++++++++++++++++++++++
fs/xfs/xfs_iops.c | 45 ++++++++++++++++++++++++++++++++++++++
fs/xfs/xfs_iops.h | 4 ++++
fs/xfs/xfs_mount.h | 2 ++
fs/xfs/xfs_super.c | 4 ++++
include/linux/iomap.h | 1 +
include/uapi/linux/fs.h | 1 +
15 files changed, 176 insertions(+), 6 deletions(-)
--
2.31.1
Add a flag indicating that a regular file is enabled for atomic writes.
Signed-off-by: John Garry <[email protected]>
---
include/uapi/linux/fs.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index a0975ae81e64..b5b4e1db9576 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -140,6 +140,7 @@ struct fsxattr {
#define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */
#define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
#define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
+#define FS_XFLAG_ATOMICWRITES 0x00020000 /* atomic writes enabled */
#define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
/* the read-only stuff doesn't really belong here, but any other place is
--
2.31.1
On Wed, Jan 24, 2024 at 02:26:41PM +0000, John Garry wrote:
> Add a flag indicating that a regular file is enabled for atomic writes.
This is a file attribute that mirrors an ondisk inode flag. Actual
support for untorn file writes (for now) depends on both the iflag and
the underlying storage devices, which we can only really check at statx
and pwrite time. This is the same story as FS_XFLAG_DAX, which signals
to the fs that we should try to enable the fsdax IO path on the file
(instead of the regular page cache), but applications have to query
STAT_ATTR_DAX to find out if they really got that IO path.
"try to enable atomic writes", perhaps?
(and the comment for FS_XFLAG_DAX ought to read "try to use DAX for IO")
--D
>
> Signed-off-by: John Garry <[email protected]>
> ---
> include/uapi/linux/fs.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index a0975ae81e64..b5b4e1db9576 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -140,6 +140,7 @@ struct fsxattr {
> #define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */
> #define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
> #define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
> +#define FS_XFLAG_ATOMICWRITES 0x00020000 /* atomic writes enabled */
> #define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
>
> /* the read-only stuff doesn't really belong here, but any other place is
> --
> 2.31.1
>
>
On 02/02/2024 17:57, Darrick J. Wong wrote:
> On Wed, Jan 24, 2024 at 02:26:41PM +0000, John Garry wrote:
>> Add a flag indicating that a regular file is enabled for atomic writes.
>
> This is a file attribute that mirrors an ondisk inode flag. Actual
> support for untorn file writes (for now) depends on both the iflag and
> the underlying storage devices, which we can only really check at statx
> and pwrite time. This is the same story as FS_XFLAG_DAX, which signals
> to the fs that we should try to enable the fsdax IO path on the file
> (instead of the regular page cache), but applications have to query
> STAT_ATTR_DAX to find out if they really got that IO path.
To be clear, are you suggesting to add this info to the commit message?
>
> "try to enable atomic writes", perhaps? >
> (and the comment for FS_XFLAG_DAX ought to read "try to use DAX for IO")
To me that sounds like "try to use DAX for IO, and, if not possible,
fall back on some other method" - is that reality of what that flag does?
Thanks,
John
>
> --D
>
>>
>> Signed-off-by: John Garry <[email protected]>
>> ---
>> include/uapi/linux/fs.h | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
>> index a0975ae81e64..b5b4e1db9576 100644
>> --- a/include/uapi/linux/fs.h
>> +++ b/include/uapi/linux/fs.h
>> @@ -140,6 +140,7 @@ struct fsxattr {
>> #define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */
>> #define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
>> #define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
>> +#define FS_XFLAG_ATOMICWRITES 0x00020000 /* atomic writes enabled */
>> #define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
>>
>> /* the read-only stuff doesn't really belong here, but any other place is
>> --
>> 2.31.1
>>
>>
On Wed, Jan 24, 2024 at 02:26:39PM +0000, John Garry wrote:
> This series expands atomic write support to filesystems, specifically
> XFS. Since XFS rtvol supports extent alignment already, support will
> initially be added there. When XFS forcealign feature is merged, then we
> can similarly support atomic writes for a non-rtvol filesystem.
Hi John,
Along with rtvol check, we can also have a simple check to see if the
FS blocksize itself is big enough to satisfy the atomic requirements.
For eg on machines with 64K page, we can have say 16k or 64k block sizes
which should be able to provide required allocation behavior for atomic
writes. In such cases we don't need rtvol.
Regards,
ojaswin
On 09/02/2024 07:14, Ojaswin Mujoo wrote:
> On Wed, Jan 24, 2024 at 02:26:39PM +0000, John Garry wrote:
>> This series expands atomic write support to filesystems, specifically
>> XFS. Since XFS rtvol supports extent alignment already, support will
>> initially be added there. When XFS forcealign feature is merged, then we
>> can similarly support atomic writes for a non-rtvol filesystem.
>
> Hi John,
>
> Along with rtvol check, we can also have a simple check to see if the
> FS blocksize itself is big enough to satisfy the atomic requirements.
> For eg on machines with 64K page, we can have say 16k or 64k block sizes
> which should be able to provide required allocation behavior for atomic
> writes. In such cases we don't need rtvol.
>
I suppose we could do, but I would rather just concentrate on rtvol
support initially, and there we do report atomic write unit min = FS
block size (even if rt extsize is unset).
In addition, I plan to initially just support atomic write unit min = FS
block size (for both rtvol and !rtvol).
Thanks,
John
On Fri, Feb 09, 2024 at 09:22:20AM +0000, John Garry wrote:
> On 09/02/2024 07:14, Ojaswin Mujoo wrote:
> > On Wed, Jan 24, 2024 at 02:26:39PM +0000, John Garry wrote:
> > > This series expands atomic write support to filesystems, specifically
> > > XFS. Since XFS rtvol supports extent alignment already, support will
> > > initially be added there. When XFS forcealign feature is merged, then we
> > > can similarly support atomic writes for a non-rtvol filesystem.
> >
> > Hi John,
> >
> > Along with rtvol check, we can also have a simple check to see if the
> > FS blocksize itself is big enough to satisfy the atomic requirements.
> > For eg on machines with 64K page, we can have say 16k or 64k block sizes
> > which should be able to provide required allocation behavior for atomic
> > writes. In such cases we don't need rtvol.
> >
> I suppose we could do, but I would rather just concentrate on rtvol support
> initially, and there we do report atomic write unit min = FS block size
> (even if rt extsize is unset).
Okay understood.
Thanks,
ojaswin
>
> In addition, I plan to initially just support atomic write unit min = FS
> block size (for both rtvol and !rtvol).
>
> Thanks,
> John
On Mon, Feb 05, 2024 at 12:58:30PM +0000, John Garry wrote:
> To me that sounds like "try to use DAX for IO, and, if not possible, fall
> back on some other method" - is that reality of what that flag does?
Yes. Of course for a fallback on XFS we need Darrick's swapext log
item. Which would be good to have..
From reading the series and the discussions with Darrick and Dave
I'm coming more and more back to my initial position that tying this
user visible feature to hardware limits is wrong and will just keep
on creating ever more painpoints in the future.
Based on that I suspect that doing proper software only atomic writes
using the swapext log item and selective always COW mode and making that
work should be the first step. We can then avoid that overhead for
properly aligned writs if the hardware supports it. For your Oracle
DB loads you'll set the alignment hints and maybe even check with
fiemap that everything is fine and will get the offload, but we also
provide a nice and useful API for less performance critical applications
that don't have to care about all these details.
John Garry <[email protected]> writes:
> This series expands atomic write support to filesystems, specifically
> XFS. Since XFS rtvol supports extent alignment already, support will
> initially be added there. When XFS forcealign feature is merged, then we
> can similarly support atomic writes for a non-rtvol filesystem.
>
> Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.
>
> For XFS rtvol, support can be enabled through xfs_io command:
> $xfs_io -c "chattr +W" filename
> $xfs_io -c "lsattr -v" filename
> [realtime, atomic-writes] filename
Hi John,
I first took your block atomic write patch series [1] and then applied this
series on top. I also compiled xfsprogs with chattr atomic write support from [2].
[1]: https://lore.kernel.org/linux-nvme/[email protected]/T/#m4ad28b480a8e12eb51467e17208d98ca50041ff2
[2]: https://github.com/johnpgarry/xfsprogs-dev/commits/atomicwrites/
But while setting +W attr, I see an Invalid argument error. Is there
anything I need to do first?
root@ubuntu:~# /root/xt/xfsprogs-dev/io/xfs_io -c "chattr +W" /mnt1/test/f1
xfs_io: cannot set flags on /mnt1/test/f1: Invalid argument
root@ubuntu:~# /root/xt/xfsprogs-dev/io/xfs_io -c "lsattr -v" /mnt1/test/f1
[realtime] /mnt1/test/f1
>
> The FS needs to be formatted with a specific extent alignment size, like:
> mkf.xfs -r rtdev=/dev/sdb,extsize=16K -d rtinherit=1 /dev/sda
>
> This enables 16K atomic write support. There are no checks whether the
> underlying HW actually supports that for enabling atomic writes with
> xfs_io, though, so statx needs to be issued for a file to know atomic
> write limits.
>
Here you say that xfs_io does not check whether underlying HW actually
supports atomic writes or not. So I am assuming xfs_io -c "chattr +W"
should have just worked?
Sorry, I am still in the process of going over the patches, but I thought let
me anyways ask this first.
-ritesh
On 13/02/2024 07:45, Ritesh Harjani (IBM) wrote:
> John Garry <[email protected]> writes:
>
>> This series expands atomic write support to filesystems, specifically
>> XFS. Since XFS rtvol supports extent alignment already, support will
>> initially be added there. When XFS forcealign feature is merged, then we
>> can similarly support atomic writes for a non-rtvol filesystem.
>>
>> Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.
>>
>> For XFS rtvol, support can be enabled through xfs_io command:
>> $xfs_io -c "chattr +W" filename
>> $xfs_io -c "lsattr -v" filename
>> [realtime, atomic-writes] filename
>
> Hi John,
>
> I first took your block atomic write patch series [1] and then applied this
> series on top. I also compiled xfsprogs with chattr atomic write support from [2].
>
> [1]: https://lore.kernel.org/linux-nvme/[email protected]/T/#m4ad28b480a8e12eb51467e17208d98ca50041ff2
> [2]: https://github.com/johnpgarry/xfsprogs-dev/commits/atomicwrites/
>
>
> But while setting +W attr, I see an Invalid argument error. Is there
> anything I need to do first?
>
> root@ubuntu:~# /root/xt/xfsprogs-dev/io/xfs_io -c "chattr +W" /mnt1/test/f1
> xfs_io: cannot set flags on /mnt1/test/f1: Invalid argument
>
> root@ubuntu:~# /root/xt/xfsprogs-dev/io/xfs_io -c "lsattr -v" /mnt1/test/f1
> [realtime] /mnt1/test/f1
Can you provide your full steps?
I'm doing something like:
# /mkfs.xfs -r rtdev=/dev/sdb,extsize=16k -d rtinherit=1 /dev/sda
meta-data=/dev/sda isize=512 agcount=4, agsize=22400 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=0 bigtime=1 inobtcount=1
nrext64=0
data = bsize=4096 blocks=89600, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1
log =internal log bsize=4096 blocks=16384, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =/dev/sdb extsz=16384 blocks=89600, rtextents=22400
# mount /dev/sda mnt -o rtdev=/dev/sdb
[ 5.553482] XFS (sda): EXPERIMENTAL atomic writes feature in use. Use
at your own risk!
[ 5.556752] XFS (sda): Mounting V5 Filesystem
6e0820e6-4d44-4c3e-89f2-21b4d4480f88
[ 5.602315] XFS (sda): Ending clean mount
#
# touch mnt/file
# /xfs_io -c "lsattr -v" mnt/file
[realtime] mnt/file
#
#
# /xfs_io -c "chattr +W" mnt/file
# /xfs_io -c "lsattr -v" mnt/file
[realtime, atomic-writes] mnt/file
And then we can check limits:
# /test-statx -a /root/mnt/file
dump_statx results=9fff
Size: 0 Blocks: 0 IO Block: 16384 regular file
Device: 08:00 Inode: 131 Links: 1
Access: (0644/-rw-r--r--) Uid: 0 Gid: 0
Access: 2024-02-13 08:31:51.962900974+0000
Modify: 2024-02-13 08:31:51.962900974+0000
Change: 2024-02-13 08:31:51.969900974+0000
Birth: 2024-02-13 08:31:51.962900974+0000
stx_attributes_mask=0x603070
STATX_ATTR_WRITE_ATOMIC set
unit min: 4096
unit max: 16384
segments max: 1
Attributes: 0000000000400000 (........ ........ ........ ........
....... .?-..... ..--.... .---....)
#
#
Does xfs_io have a statx function? If so, I can add support for atomic
writes for statx there. In the meantime, that test-statx code is also on
my branch, and can be run on the block device file (to sanity check that
the rtvol device supports atomic writes).
Thanks,
John
John Garry <[email protected]> writes:
> On 13/02/2024 07:45, Ritesh Harjani (IBM) wrote:
>> John Garry <[email protected]> writes:
>>
>>> This series expands atomic write support to filesystems, specifically
>>> XFS. Since XFS rtvol supports extent alignment already, support will
>>> initially be added there. When XFS forcealign feature is merged, then we
>>> can similarly support atomic writes for a non-rtvol filesystem.
>>>
>>> Flag FS_XFLAG_ATOMICWRITES is added as an enabling flag for atomic writes.
>>>
>>> For XFS rtvol, support can be enabled through xfs_io command:
>>> $xfs_io -c "chattr +W" filename
>>> $xfs_io -c "lsattr -v" filename
>>> [realtime, atomic-writes] filename
>>
>> Hi John,
>>
>> I first took your block atomic write patch series [1] and then applied this
>> series on top. I also compiled xfsprogs with chattr atomic write support from [2].
>>
>> [1]: https://lore.kernel.org/linux-nvme/[email protected]/T/#m4ad28b480a8e12eb51467e17208d98ca50041ff2
>> [2]: https://github.com/johnpgarry/xfsprogs-dev/commits/atomicwrites/
>>
>>
>> But while setting +W attr, I see an Invalid argument error. Is there
>> anything I need to do first?
>>
>> root@ubuntu:~# /root/xt/xfsprogs-dev/io/xfs_io -c "chattr +W" /mnt1/test/f1
>> xfs_io: cannot set flags on /mnt1/test/f1: Invalid argument
>>
>> root@ubuntu:~# /root/xt/xfsprogs-dev/io/xfs_io -c "lsattr -v" /mnt1/test/f1
>> [realtime] /mnt1/test/f1
>
> Can you provide your full steps?
>
> I'm doing something like:
>
> # /mkfs.xfs -r rtdev=/dev/sdb,extsize=16k -d rtinherit=1 /dev/sda
> meta-data=/dev/sda isize=512 agcount=4, agsize=22400 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=1 finobt=1, sparse=1, rmapbt=0
> = reflink=0 bigtime=1 inobtcount=1
> nrext64=0
> data = bsize=4096 blocks=89600, imaxpct=25
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1
> log =internal log bsize=4096 blocks=16384, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =/dev/sdb extsz=16384 blocks=89600, rtextents=22400
> # mount /dev/sda mnt -o rtdev=/dev/sdb
> [ 5.553482] XFS (sda): EXPERIMENTAL atomic writes feature in use. Use
> at your own risk!
My bad, I missed to see your xfsprogs change involve setting this
feature flag as well during mkfs time itself. I wasn't using the right
mkfs utility.
> [ 5.556752] XFS (sda): Mounting V5 Filesystem
> 6e0820e6-4d44-4c3e-89f2-21b4d4480f88
> [ 5.602315] XFS (sda): Ending clean mount
> #
> # touch mnt/file
> # /xfs_io -c "lsattr -v" mnt/file
> [realtime] mnt/file
> #
> #
> # /xfs_io -c "chattr +W" mnt/file
> # /xfs_io -c "lsattr -v" mnt/file
> [realtime, atomic-writes] mnt/file
>
Yup, this seems to work fine. Thanks!
> And then we can check limits:
>
> # /test-statx -a /root/mnt/file
> dump_statx results=9fff
> Size: 0 Blocks: 0 IO Block: 16384 regular file
> Device: 08:00 Inode: 131 Links: 1
> Access: (0644/-rw-r--r--) Uid: 0 Gid: 0
> Access: 2024-02-13 08:31:51.962900974+0000
> Modify: 2024-02-13 08:31:51.962900974+0000
> Change: 2024-02-13 08:31:51.969900974+0000
> Birth: 2024-02-13 08:31:51.962900974+0000
> stx_attributes_mask=0x603070
> STATX_ATTR_WRITE_ATOMIC set
> unit min: 4096
> unit max: 16384
> segments max: 1
> Attributes: 0000000000400000 (........ ........ ........ ........
> ........ .?-..... ..--.... .---....)
> #
> #
>
> Does xfs_io have a statx function? If so, I can add support for atomic
> writes for statx there. In the meantime, that test-statx code is also on
> my branch, and can be run on the block device file (to sanity check that
> the rtvol device supports atomic writes).
>
> Thanks,
> John
On Mon, Feb 05, 2024 at 12:58:30PM +0000, John Garry wrote:
> On 02/02/2024 17:57, Darrick J. Wong wrote:
> > On Wed, Jan 24, 2024 at 02:26:41PM +0000, John Garry wrote:
> > > Add a flag indicating that a regular file is enabled for atomic writes.
> >
> > This is a file attribute that mirrors an ondisk inode flag. Actual
> > support for untorn file writes (for now) depends on both the iflag and
> > the underlying storage devices, which we can only really check at statx
> > and pwrite time. This is the same story as FS_XFLAG_DAX, which signals
> > to the fs that we should try to enable the fsdax IO path on the file
> > (instead of the regular page cache), but applications have to query
> > STAT_ATTR_DAX to find out if they really got that IO path.
>
> To be clear, are you suggesting to add this info to the commit message?
That and a S_ATOMICW flag for the inode that triggers the proposed
STAT_ATTR_ATOMICWRITES flag.
> > "try to enable atomic writes", perhaps? >
> > (and the comment for FS_XFLAG_DAX ought to read "try to use DAX for IO")
>
> To me that sounds like "try to use DAX for IO, and, if not possible, fall
> back on some other method" - is that reality of what that flag does?
As hch said, yes.
--D
> Thanks,
> John
>
> >
> > --D
> >
> > >
> > > Signed-off-by: John Garry <[email protected]>
> > > ---
> > > include/uapi/linux/fs.h | 1 +
> > > 1 file changed, 1 insertion(+)
> > >
> > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > index a0975ae81e64..b5b4e1db9576 100644
> > > --- a/include/uapi/linux/fs.h
> > > +++ b/include/uapi/linux/fs.h
> > > @@ -140,6 +140,7 @@ struct fsxattr {
> > > #define FS_XFLAG_FILESTREAM 0x00004000 /* use filestream allocator */
> > > #define FS_XFLAG_DAX 0x00008000 /* use DAX for IO */
> > > #define FS_XFLAG_COWEXTSIZE 0x00010000 /* CoW extent size allocator hint */
> > > +#define FS_XFLAG_ATOMICWRITES 0x00020000 /* atomic writes enabled */
> > > #define FS_XFLAG_HASATTR 0x80000000 /* no DIFLAG for this */
> > > /* the read-only stuff doesn't really belong here, but any other place is
> > > --
> > > 2.31.1
> > >
> > >
>
>
On Tue, Feb 13, 2024 at 08:22:37AM +0100, Christoph Hellwig wrote:
> From reading the series and the discussions with Darrick and Dave
> I'm coming more and more back to my initial position that tying this
> user visible feature to hardware limits is wrong and will just keep
> on creating ever more painpoints in the future.
>
> Based on that I suspect that doing proper software only atomic writes
> using the swapext log item and selective always COW mode
Er, what are you thinking w.r.t. swapext and sometimescow? swapext
doesn't currently handle COW forks at all, and it can only exchange
between two of the same type of fork (e.g. both data forks or both attr
forks, no mixing).
Or will that be your next suggestion whenever I get back to fiddling
with the online fsck patches? ;)
> and making that
> work should be the first step. We can then avoid that overhead for
> properly aligned writs if the hardware supports it. For your Oracle
> DB loads you'll set the alignment hints and maybe even check with
> fiemap that everything is fine and will get the offload, but we also
> provide a nice and useful API for less performance critical applications
> that don't have to care about all these details.
I suspect they might want to fail-fast (back to standard WAL mode or
whatever) if the hardware support isn't available.
--D
On Tue, Feb 13, 2024 at 08:22:37AM +0100, Christoph Hellwig wrote:
> From reading the series and the discussions with Darrick and Dave
> I'm coming more and more back to my initial position that tying this
> user visible feature to hardware limits is wrong and will just keep
> on creating ever more painpoints in the future.
Yes, that's pretty much what I've been trying to say from the start.
The functionality atomic writes need from the filesystem is for
extent alignment constraints to be applied to all extent
manipulations, not just allocation. This is the same functionality
that DAX based XFS filesystems need to guarantee PMD aligned
extents.
IOWs, the required filesystem extent alignment functionality is not
specific to atomic writes and it is not specific to a particular
type of storage hardware.
If we implement the generic extent alignment constraints properly,
everything else from there is just a matter of configuring the
filesystem geometry to match the underlying hardware capability.
mkfs can do that for us, like it already does for RAID storage...
-Dave.
--
Dave Chinner
[email protected]
On Wed, Feb 14, 2024 at 10:50:22AM +1100, Dave Chinner wrote:
> The functionality atomic writes need from the filesystem is for
> extent alignment constraints to be applied to all extent
> manipulations, not just allocation. This is the same functionality
> that DAX based XFS filesystems need to guarantee PMD aligned
> extents.
>
> IOWs, the required filesystem extent alignment functionality is not
> specific to atomic writes and it is not specific to a particular
> type of storage hardware.
>
> If we implement the generic extent alignment constraints properly,
> everything else from there is just a matter of configuring the
> filesystem geometry to match the underlying hardware capability.
> mkfs can do that for us, like it already does for RAID storage...
Agreed.
But the one thing making atomic writes odd right now is that it
absolutely is required for operation right now, while for other
features is is somewhere between nice and important to have and not
a deal breaker.
So eithe we need to figure out a somewhat generic and not totally
XFS implementation specific user space interface to do the force
alignment (which this series tries to do). Or we make atomic
writes like the other features and ensure they still work without
the proper alignment if they suck. Doing that was my initial
gut feeling, and looking at other approaches just makes me tend
even stronger towards that.
>
> -Dave.
> --
> Dave Chinner
> [email protected]
---end quoted text---
On Tue, Feb 13, 2024 at 09:55:49AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 13, 2024 at 08:22:37AM +0100, Christoph Hellwig wrote:
> > From reading the series and the discussions with Darrick and Dave
> > I'm coming more and more back to my initial position that tying this
> > user visible feature to hardware limits is wrong and will just keep
> > on creating ever more painpoints in the future.
> >
> > Based on that I suspect that doing proper software only atomic writes
> > using the swapext log item and selective always COW mode
>
> Er, what are you thinking w.r.t. swapext and sometimescow?
What do you mean with sometimescow? Just normal reflinked inodes?
> swapext
> doesn't currently handle COW forks at all, and it can only exchange
> between two of the same type of fork (e.g. both data forks or both attr
> forks, no mixing).
>
> Or will that be your next suggestion whenever I get back to fiddling
> with the online fsck patches? ;)
Let's take a step back. If we want atomic write semantics without
hardware offload, what we need is to allocate new blocks and atomically
swap them into the data fork. Basicall an atomic version of
xfs_reflink_end_cow. But yes, the details of the current swapext
item might not be an exact fit, maybe it's just shared infrastructure
and concepts.
I'm not planning to make you do it, because such a log item would
generally be pretty useful for always COW mode.
> > and making that
> > work should be the first step. We can then avoid that overhead for
> > properly aligned writs if the hardware supports it. For your Oracle
> > DB loads you'll set the alignment hints and maybe even check with
> > fiemap that everything is fine and will get the offload, but we also
> > provide a nice and useful API for less performance critical applications
> > that don't have to care about all these details.
>
> I suspect they might want to fail-fast (back to standard WAL mode or
> whatever) if the hardware support isn't available.
Maybe for your particular DB use case. But there's plenty of
applications that just want atomic writes without building their
own infrastruture, including some that want pretty large chunks.
Also if a file system supports logging data (which I have an
XFS early prototype for that I plan to finish), we can even do
the small double writes more efficiently than the application,
all through the same interface.
On 13/02/2024 22:49, Dave Chinner wrote:
>> Does xfs_io have a statx function?
> Yes, it's right there in the man page:
>
> statx [ -v|-r ][ -m basic | -m all | -m <mask> ][ -FD ]
> Selected statistics from stat(2) and the XFS_IOC_GETXATTR system call on the current file.
> -v Show timestamps.
> -r Dump raw statx structure values.
> -m basic
> Set the field mask for the statx call to STATX_BASIC_STATS.
> -m all
> Set the the field mask for the statx call to STATX_ALL (default).
> -m <mask>
> Specify a numeric field mask for the statx call.
> -F Force the attributes to be synced with the server.
> -D Don't sync attributes with the server.
ok, I can check that out and look to add any support required for atomic
writes extension.
Thanks,
John
On Wed, Feb 14, 2024 at 08:45:59AM +0100, Christoph Hellwig wrote:
> On Tue, Feb 13, 2024 at 09:55:49AM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 13, 2024 at 08:22:37AM +0100, Christoph Hellwig wrote:
> > > From reading the series and the discussions with Darrick and Dave
> > > I'm coming more and more back to my initial position that tying this
> > > user visible feature to hardware limits is wrong and will just keep
> > > on creating ever more painpoints in the future.
> > >
> > > Based on that I suspect that doing proper software only atomic writes
> > > using the swapext log item and selective always COW mode
> >
> > Er, what are you thinking w.r.t. swapext and sometimescow?
>
> What do you mean with sometimescow? Just normal reflinked inodes?
>
> > swapext
> > doesn't currently handle COW forks at all, and it can only exchange
> > between two of the same type of fork (e.g. both data forks or both attr
> > forks, no mixing).
> >
> > Or will that be your next suggestion whenever I get back to fiddling
> > with the online fsck patches? ;)
>
> Let's take a step back. If we want atomic write semantics without
> hardware offload, what we need is to allocate new blocks and atomically
> swap them into the data fork. Basicall an atomic version of
> xfs_reflink_end_cow. But yes, the details of the current swapext
> item might not be an exact fit, maybe it's just shared infrastructure
> and concepts.
Hmm. For rt reflink (whenever I get back to that, ha) I've been
starting to think that yes, we actually /do/ want to have a log item
that tracks the progress of remap and cow operations. That would solve
the problem of someone wanting to reflink a semi-written rtx.
That said, it might complicate the reflink code quite a bit since right
now it writes zeroes to the unwritten parts of an rt file's rtx so that
there's only one mapping record for the whole rtx, and then it remaps
them. That's most of why I haven't bothered to implement that solution.
> I'm not planning to make you do it, because such a log item would
> generally be pretty useful for always COW mode.
One other thing -- while I was refactoring the swapext code into
exch{range,maps}, it occurred to me that doing an exchange between the
cow and data forks isn't possible because log recovery won't be able to
do anything. There's no ondisk metadata to map a cow staging extent
back to the file it came from, which means we can't generally resume an
exchange operation.
However for a small write I guess you could simply queue all the log
intent items for all the changes needed and commit that.
> > > and making that
> > > work should be the first step. We can then avoid that overhead for
> > > properly aligned writs if the hardware supports it. For your Oracle
> > > DB loads you'll set the alignment hints and maybe even check with
> > > fiemap that everything is fine and will get the offload, but we also
> > > provide a nice and useful API for less performance critical applications
> > > that don't have to care about all these details.
> >
> > I suspect they might want to fail-fast (back to standard WAL mode or
> > whatever) if the hardware support isn't available.
>
> Maybe for your particular DB use case. But there's plenty of
> applications that just want atomic writes without building their
> own infrastruture, including some that want pretty large chunks.
>
> Also if a file system supports logging data (which I have an
> XFS early prototype for that I plan to finish), we can even do
> the small double writes more efficiently than the application,
> all through the same interface.
Heh. Ted's been trying to kill data=journal. Now we've found a use for
it after all. :)
--D
On Wed, Feb 21, 2024 at 08:56:15AM -0800, Darrick J. Wong wrote:
> Hmm. For rt reflink (whenever I get back to that, ha) I've been
> starting to think that yes, we actually /do/ want to have a log item
> that tracks the progress of remap and cow operations. That would solve
> the problem of someone wanting to reflink a semi-written rtx.
>
> That said, it might complicate the reflink code quite a bit since right
> now it writes zeroes to the unwritten parts of an rt file's rtx so that
> there's only one mapping record for the whole rtx, and then it remaps
> them. That's most of why I haven't bothered to implement that solution.
I'm still not sure that supporting reflinks for rtextsize > 1 is a good
idea..
> > I'm not planning to make you do it, because such a log item would
> > generally be pretty useful for always COW mode.
>
> One other thing -- while I was refactoring the swapext code into
> exch{range,maps}, it occurred to me that doing an exchange between the
> cow and data forks isn't possible because log recovery won't be able to
> do anything. There's no ondisk metadata to map a cow staging extent
> back to the file it came from, which means we can't generally resume an
> exchange operation.
Yeah.
> > Also if a file system supports logging data (which I have an
> > XFS early prototype for that I plan to finish), we can even do
> > the small double writes more efficiently than the application,
> > all through the same interface.
>
> Heh. Ted's been trying to kill data=journal. Now we've found a use for
> it after all. :)
Well.. unconditional logging of data just seems like a really bad idea.
Using it as an optimization for very small and/or synchronous writes
is a pretty common technique.