LinuxLists.cc - [RFC 0/9] ext4: Add direct-io atomic write support using fsawu

2024-03-02 07:42:38

Subject: [RFC 0/9] ext4: Add direct-io atomic write support using fsawu

Hello all,

This RFC series adds support for atomic writes to ext4 direct-io using
filesystem atomic write unit. It's built on top of John's "block atomic
write v5" series which adds RWF_ATOMIC flag interface to pwritev2() and enables
atomic write support in underlying device driver and block layer.

This series uses the same RWF_ATOMIC interface for adding atomic write support
to ext4's direct-io path. One can utilize it by 2 of the methods explained below.
((1)mkfs.ext4 -b <BS>, (2) with bigalloc).

Filesystem atomic write unit (fsawu):
============================================
Atomic writes within ext4 can be supported using below 3 methods -
1. On a large pagesize system (e.g. Power with 64k pagesize or aarch64 with 64k pagesize),
we can mkfs using different blocksizes. e.g. mkfs.ext4 -b <4k/8k/16k/32k/64k).
Now if the underlying HW device supports atomic writes, than a corresponding
blocksize can be chosen as a filesystem atomic write unit (fsawu) which
should be within the underlying hw defined [awu_min, awu_max] range.
For such filesystem, fsawu_[min|max] both are equal to blocksize (e.g. 16k)

On a smaller pagesize system this can be utilized when support for LBS is
complete (on ext4).

2. EXT4 already supports a feature called bigalloc. In that ext4 can handle
allocation in cluster size units. So for e.g. we can create a filesystem with
4k blocksize but with 64k clustersize. Such a configuration can also be used
to support atomic writes if the underlying hw device supports it.
In such case the fsawu_min will most likely be the filesystem blocksize and
fsawu_max will mostly likely be the cluster size.

So a user can do an atomic write of any size between [fsawu_min, fsawu_max]
range as long as it satisfies other constraints being laid out by HW device
(or by software stack) to support atomic writes.
e.g. len should be a power of 2, pos % len should be naturally
aligned and [start | end] (phys offsets) should not straddle over
an atomic write boundary.

3. EXT4 mballoc can be made aware of doing aligned block allocation for e.g. by
utilizing cr-0 allocation criteria. With this support, we won't be needing
to format a new filesystem and hopefully when the support for this in mballoc
is done, it can utilize the same interface/helper routines laid out in this
patch series. There is work going on in this aspect too in parallel [2]

Purpose of an early RFC:
(note only minimal testing has been done on this).
========================
Other than getting early review comments on the design, hopefully it should also
help folks in their discussion at LSFMM since there are various topic proposals
out there regarding atomic write support in xfs and ext4 [3][4].

How to utilize this support:
===========================
1. mkfs.ext4 -b 4096 -C 65536 /dev/<sdb> (scsi_debug or device with atomic write)
or mkfs.ext4 -b <BS=16k> if your platform supports it.
2. mount /dev/sdb /mnt
3. touch /mnt/f1
4. chattr +W /mnt/f1
5. xfs_io -dc "pwrite <pos> <len>" /mnt/f1

References:
===========
[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://lore.kernel.org/linux-ext4/[email protected]/
[3]: https://www.spinics.net/lists/linux-xfs/msg81086.html
[4]: https://www.spinics.net/lists/linux-fsdevel/msg265226.html

John Garry (1):
fs: Add FS_XFLAG_ATOMICWRITES flag

Ritesh Harjani (IBM) (7):
fs: Reserve inode flag FS_ATOMICWRITES_FL for atomic writes
iomap: Add atomic write support for direct-io
ext4: Add statx and other atomic write helper routines
ext4: Adds direct-io atomic writes checks
ext4: Add an inode flag for atomic writes
ext4: Enable FMODE_CAN_ATOMIC_WRITE in open for direct-io
ext4: Adds atomic writes using fsawu

Ritesh Harjani (IBM) (1):
e2fsprogs/chattr: Supports atomic writes attribute

fs/ext4/ext4.h | 87 +++++++++++++++++++++++++++++++++++++++-
fs/ext4/file.c | 38 ++++++++++++++++--
fs/ext4/inode.c | 16 ++++++++
fs/ext4/ioctl.c | 11 +++++
fs/ext4/super.c | 1 +
fs/ioctl.c | 4 ++
fs/iomap/direct-io.c | 75 ++++++++++++++++++++++++++++++++--
fs/iomap/trace.h | 3 +-
include/linux/fileattr.h | 4 +-
include/linux/iomap.h | 1 +
include/uapi/linux/fs.h | 2 +
11 files changed, 232 insertions(+), 10 deletions(-)

--
2.39.2

2024-03-02 07:45:49

by Ritesh Harjani

[permalink] [raw]

Subject: [RFC 9/9] e2fsprogs/chattr: Supports atomic writes attribute

This adds 'W' which is atomic write attribute to chattr.

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
lib/e2p/pf.c | 1 +
lib/ext2fs/ext2_fs.h | 2 +-
misc/chattr.1.in | 18 ++++++++++++++----
misc/chattr.c | 3 ++-
4 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/lib/e2p/pf.c b/lib/e2p/pf.c
index 81e3bb26..9b311477 100644
--- a/lib/e2p/pf.c
+++ b/lib/e2p/pf.c
@@ -45,6 +45,7 @@ static struct flags_name flags_array[] = {
{ EXT4_EXTENTS_FL, "e", "Extents" },
{ FS_NOCOW_FL, "C", "No_COW" },
{ FS_DAX_FL, "x", "DAX" },
+ { FS_ATOMICWRITES_FL, "W", "ATOMIC_WRITES" },
{ EXT4_CASEFOLD_FL, "F", "Casefold" },
{ EXT4_INLINE_DATA_FL, "N", "Inline_Data" },
{ EXT4_PROJINHERIT_FL, "P", "Project_Hierarchy" },
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index 0fc9c09a..f9dcf71f 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -346,7 +346,7 @@ struct ext2_dx_tail {
#define EXT4_EA_INODE_FL 0x00200000 /* Inode used for large EA */
/* EXT4_EOFBLOCKS_FL 0x00400000 was here */
#define FS_NOCOW_FL 0x00800000 /* Do not cow file */
-#define EXT4_SNAPFILE_FL 0x01000000 /* Inode is a snapshot */
+#define FS_ATOMICWRITES_FL 0x01000000 /* Inode can do atomic writes */
#define FS_DAX_FL 0x02000000 /* Inode is DAX */
#define EXT4_SNAPFILE_DELETED_FL 0x04000000 /* Snapshot is being deleted */
#define EXT4_SNAPFILE_SHRUNK_FL 0x08000000 /* Snapshot shrink has completed */
diff --git a/misc/chattr.1.in b/misc/chattr.1.in
index 50c54e7d..22757123 100644
--- a/misc/chattr.1.in
+++ b/misc/chattr.1.in
@@ -26,7 +26,7 @@ changes the file attributes on a Linux file system.
The format of a symbolic
.I mode
is
-.BR +-= [ aAcCdDeFijmPsStTux ].
+.BR +-= [ aAcCdDeFijmPsStTuxW ].
.PP
The operator
.RB ' + '
@@ -38,7 +38,7 @@ causes them to be removed; and
causes them to be the only attributes that the files have.
.PP
The letters
-.RB ' aAcCdDeFijmPsStTux '
+.RB ' aAcCdDeFijmPsStTuxW '
select the new attributes for the files:
append only
.RB ( a ),
@@ -74,8 +74,10 @@ top of directory hierarchy
.RB ( T ),
undeletable
.RB ( u ),
-and direct access for files
-.RB ( x ).
+direct access for files
+.RB ( x ),
+and atomic writes for files.
+.RB ( W ).
.PP
The following attributes are read-only, and may be listed by
.BR lsattr (1)
@@ -263,6 +265,14 @@ directory. If an existing directory has contained some files and
subdirectories, modifying the attribute on the parent directory doesn't
change the attributes on these files and subdirectories.
.TP
+.B W
+The 'W' attribute can only be set on a regular file. A file which has this
+attribute set can do untorn writes i.e. if an atomic write is requested by
+user with proper alignment and atomic flags set (such as RWF_ATOMIC), then
+a subsequent read to that block(s) will either read entire new data or entire
+old data (in case of a power failure). The block(s) written can never contain
+mix of both.
+.TP
.B V
A file with the 'V' attribute set has fs-verity enabled. It cannot be
written to, and the file system will automatically verify all data read
diff --git a/misc/chattr.c b/misc/chattr.c
index c7382a37..24db790e 100644
--- a/misc/chattr.c
+++ b/misc/chattr.c
@@ -86,7 +86,7 @@ static unsigned long sf;
static void usage(void)
{
fprintf(stderr,
- _("Usage: %s [-RVf] [-+=aAcCdDeijPsStTuFx] [-p project] [-v version] files...\n"),
+ _("Usage: %s [-RVf] [-+=aAcCdDeijPsStTuFxW] [-p project] [-v version] files...\n"),
program_name);
exit(1);
}
@@ -114,6 +114,7 @@ static const struct flags_char flags_array[] = {
{ EXT2_TOPDIR_FL, 'T' },
{ FS_NOCOW_FL, 'C' },
{ FS_DAX_FL, 'x' },
+ { FS_ATOMICWRITES_FL, 'W' },
{ EXT4_CASEFOLD_FL, 'F' },
{ 0, 0 }
};
--
2.39.2

2024-03-06 11:25:34

by John Garry

[permalink] [raw]

Subject: Re: [RFC 0/9] ext4: Add direct-io atomic write support using fsawu

On 02/03/2024 07:41, Ritesh Harjani (IBM) wrote:
> Hello all,
>
> This RFC series adds support for atomic writes to ext4 direct-io using
> filesystem atomic write unit. It's built on top of John's "block atomic
> write v5" series which adds RWF_ATOMIC flag interface to pwritev2() and enables
> atomic write support in underlying device driver and block layer.
>
> This series uses the same RWF_ATOMIC interface for adding atomic write support
> to ext4's direct-io path. One can utilize it by 2 of the methods explained below.
> ((1)mkfs.ext4 -b <BS>, (2) with bigalloc).
>
> Filesystem atomic write unit (fsawu):
> ============================================
> Atomic writes within ext4 can be supported using below 3 methods -
> 1. On a large pagesize system (e.g. Power with 64k pagesize or aarch64 with 64k pagesize),
> we can mkfs using different blocksizes. e.g. mkfs.ext4 -b <4k/8k/16k/32k/64k).
> Now if the underlying HW device supports atomic writes, than a corresponding
> blocksize can be chosen as a filesystem atomic write unit (fsawu) which
> should be within the underlying hw defined [awu_min, awu_max] range.
> For such filesystem, fsawu_[min|max] both are equal to blocksize (e.g. 16k)
>
> On a smaller pagesize system this can be utilized when support for LBS is
> complete (on ext4).
>
> 2. EXT4 already supports a feature called bigalloc. In that ext4 can handle
> allocation in cluster size units. So for e.g. we can create a filesystem with
> 4k blocksize but with 64k clustersize. Such a configuration can also be used
> to support atomic writes if the underlying hw device supports it.
> In such case the fsawu_min will most likely be the filesystem blocksize and
> fsawu_max will mostly likely be the cluster size.
>
> So a user can do an atomic write of any size between [fsawu_min, fsawu_max]
> range as long as it satisfies other constraints being laid out by HW device
> (or by software stack) to support atomic writes.
> e.g. len should be a power of 2, pos % len should be naturally
> aligned and [start | end] (phys offsets) should not straddle over
> an atomic write boundary.

JFYI, I gave this a quick try, and it seems to work ok. Naturally it
suffers from the same issue discussed at
https://lore.kernel.org/linux-fsdevel/[email protected]/
with regards to writing to partially written extents, which I have tried
to address properly in my v2 for that same series.

Thanks,
John

2024-03-08 20:26:05

by Ritesh Harjani

[permalink] [raw]

Subject: [RFC] ext4: Add support for ext4_map_blocks_atomic()

Currently ext4 exposes [fsawu_min, fsawu_max] size as
[blocksize, clustersize] (given the hw block device constraints are
larger than FS atomic write units).

That means a user should be allowed to -
1. pwrite 0 4k /mnt/test/f1
2. pwrite 0 16k /mnt/test/f1

w/o this patch the second atomic write will fail. Since current
ext4_map_blocks() will just return the already allocated extent length
to the iomap (which is less than the user requested write length).

So add ext4_map_blocks_atomic() function which can allocate full
requested length for doing an atomic write before returning to iomap.
With this we have -

1. touch /mnt1/test/f2
2. chattr +W /mnt1/test/f2
3. xfs_io -dc "pwrite -b 4k -A -V 1 0 4k" /mnt1/test/f2
wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0.0320 sec (124.630 KiB/sec and 31.1575 ops/sec)
4. filefrag -v /mnt1/test/f2
Filesystem type is: ef53
File size of /mnt1/test/f2 is 4096 (1 block of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 9728.. 9728: 1: last,eof
/mnt1/test/f2: 1 extent found
5. xfs_io -dc "pwrite -b 16k -A -V 1 0 16k" /mnt1/test/f2
wrote 16384/16384 bytes at offset 0
16 KiB, 1 ops; 0.0337 sec (474.637 KiB/sec and 29.6648 ops/sec)
6. filefrag -v /mnt1/test/f2
Filesystem type is: ef53
File size of /mnt1/test/f2 is 16384 (4 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 3: 9728.. 9731: 4: last,eof
/mnt1/test/f2: 1 extent found

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---

Please note, that this is very minimal tested. But it serves as a PoC of what
can be done within ext4 to allow the usecase which John pointed out.

This also shows that every filesystem can have a different ways of doing aligned
allocations to support atomic writes. So lifting extent size hints to iomap
perhaps might become very XFS centric? Althouh as long as other filesystems are
not forced to follow that, I don't think it should be a problem.

fs/ext4/ext4.h | 2 ++
fs/ext4/inode.c | 40 +++++++++++++++++++++++++++++++++++++---
2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 529ca32b9813..1e9adc5d6569 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3702,6 +3702,8 @@ extern int ext4_convert_unwritten_io_end_vec(handle_t *handle,
ext4_io_end_t *io_end);
extern int ext4_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags);
+extern int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
+ struct ext4_map_blocks *map, int flags);
extern int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
int num,
struct ext4_ext_path *path);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ea009ca9085d..db273c7faf36 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -453,6 +453,29 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
}
#endif /* ES_AGGRESSIVE_TEST */

+int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
+ struct ext4_map_blocks *map, int flags)
+{
+ unsigned int mapped_len = 0, m_len = map->m_len;
+ ext4_lblk_t m_lblk = map->m_lblk;
+ int ret;
+
+ WARN_ON(!(flags & EXT4_GET_BLOCKS_CREATE));
+
+ do {
+ ret = ext4_map_blocks(handle, inode, map, flags);
+ if (ret < 0)
+ return ret;
+ mapped_len += map->m_len;
+ map->m_lblk += map->m_len;
+ map->m_len = m_len - mapped_len;
+ } while (mapped_len < m_len);
+
+ map->m_lblk = m_lblk;
+ map->m_len = mapped_len;
+ return mapped_len;
+}
+
/*
* The ext4_map_blocks() function tries to look up the requested blocks,
* and returns if the blocks are already mapped.
@@ -3315,7 +3338,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;

- ret = ext4_map_blocks(handle, inode, map, m_flags);
+ if (flags & IOMAP_ATOMIC)
+ ret = ext4_map_blocks_atomic(handle, inode, map, m_flags);
+ else
+ ret = ext4_map_blocks(handle, inode, map, m_flags);

/*
* We cannot fill holes in indirect tree based inodes as that could
@@ -3339,6 +3365,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
int ret;
struct ext4_map_blocks map;
u8 blkbits = inode->i_blkbits;
+ unsigned int orig_len;

if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
return -EINVAL;
@@ -3352,6 +3379,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
map.m_lblk = offset >> blkbits;
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+ orig_len = map.m_len;

if (flags & IOMAP_WRITE) {
/*
@@ -3362,9 +3390,15 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
*/
if (offset + length <= i_size_read(inode)) {
ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
- goto out;
+ if (map.m_flags & EXT4_MAP_MAPPED) {
+ if ((flags & IOMAP_ATOMIC && ret >= orig_len) ||
+ (!(flags & IOMAP_ATOMIC) && ret > 0))
+ goto out;
+
+ }
}
+ WARN_ON(map.m_lblk != offset >> blkbits);
+ map.m_len = orig_len;
ret = ext4_iomap_alloc(inode, &map, flags);
} else {
ret = ext4_map_blocks(NULL, inode, &map, 0);
--
2.39.2

2024-03-09 02:37:40

by Ritesh Harjani

[permalink] [raw]

Subject: Re: [RFC] ext4: Add support for ext4_map_blocks_atomic()

"Ritesh Harjani (IBM)" <[email protected]> writes:

> +int ext4_map_blocks_atomic(handle_t *handle, struct inode *inode,
> + struct ext4_map_blocks *map, int flags)
> +{
> + unsigned int mapped_len = 0, m_len = map->m_len;
> + ext4_lblk_t m_lblk = map->m_lblk;
> + int ret;
> +
> + WARN_ON(!(flags & EXT4_GET_BLOCKS_CREATE));
> +
> + do {
> + ret = ext4_map_blocks(handle, inode, map, flags);
> + if (ret < 0)
> + return ret;
> + mapped_len += map->m_len;
> + map->m_lblk += map->m_len;
> + map->m_len = m_len - mapped_len;
> + } while (mapped_len < m_len);
> +
> + map->m_lblk = m_lblk;
> + map->m_len = mapped_len;
> + return mapped_len;

ouch!
1. I need to make sure map.m_pblk is updated properly.
2. I need to make sure above call only happens with bigalloc.

Sorry about that. Generally not a good idea to send something that late
at night.
But I guess this can be fixed easily. so hopefully the algorithm should
still remain, more or less the same for ext4_map_blocks_atomic().

-ritesh

2024-03-14 15:52:45

by Ritesh Harjani

[permalink] [raw]

Subject: Re: [RFC] ext4: Add support for ext4_map_blocks_atomic()

John Garry <[email protected]> writes:

> On 08/03/2024 20:25, Ritesh Harjani (IBM) wrote:
>
> Hi Ritesh,
>
>> Currently ext4 exposes [fsawu_min, fsawu_max] size as
>> [blocksize, clustersize] (given the hw block device constraints are
>> larger than FS atomic write units).
>>
>> That means a user should be allowed to -
>> 1. pwrite 0 4k /mnt/test/f1
>> 2. pwrite 0 16k /mnt/test/f1
>>
>
> Previously you have mentioned 2 or 3 methods in which ext4 could support
> atomic writes. To avoid doubt, is this patch for the "Add intelligence
> in multi-block allocator of ext4 to provide aligned allocations (this
> option won't require any formatting)" method mentioned at
> https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> and same as method 3 at
> https://lore.kernel.org/linux-fsdevel/[email protected]/?

Hi John,

No. So this particular patch to add ext4_map_blocks_atomic() method is
only to support the usecase which you listed should work for a good user
behaviour. This is because, with bigalloc we advertizes fsawu_min and
fsawu_max as [blocksize, clustersize]
i.e.

That means a user should be allowed to -
1. pwrite 0 4k /mnt/test/f1
followed by
2. pwrite 0 16k /mnt/test/f1

So earlier we were failing the second 16k write at an offset where there
is already an existing extent smaller that 16k (that was because of the
assumption that the most of the users won't do such a thing).

But for a more general usecase, it is not difficult to support the
second 16k write in such a way for atomic writes with bigalloc,
so this patch just adds that support to this series.

-ritesh

>
>
> Thanks,
> John

2024-03-18 08:23:08

by John Garry

[permalink] [raw]

Subject: Re: [RFC] ext4: Add support for ext4_map_blocks_atomic()

On 14/03/2024 15:52, Ritesh Harjani (IBM) wrote:
>> and same as method 3 at
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]/?__;!!ACWV5N9M2RV99hQ!Pb-HbBdm2OWUIGDFfG1OkemtRSy2LyHsc5s6WiyTtGHW4uGWV6sMkoVjmknmBydf_i6TF_CDqp7dR0Y-CGY8EIc$
> Hi John,
>
> No. So this particular patch to add ext4_map_blocks_atomic() method is
> only to support the usecase which you listed should work for a good user
> behaviour. This is because, with bigalloc we advertizes fsawu_min and
> fsawu_max as [blocksize, clustersize]
> i.e.
>
> That means a user should be allowed to -
> 1. pwrite 0 4k /mnt/test/f1
> followed by
> 2. pwrite 0 16k /mnt/test/f1
>
>
> So earlier we were failing the second 16k write at an offset where there
> is already an existing extent smaller that 16k (that was because of the
> assumption that the most of the users won't do such a thing).
>
> But for a more general usecase, it is not difficult to support the
> second 16k write in such a way for atomic writes with bigalloc,
> so this patch just adds that support to this series.

Is there some reason for which the generic iomap solution in
https://lore.kernel.org/linux-xfs/[email protected]/
won't work? That is, you would just need to set iomap->extent_shift
appropriately. I will note that we gate this feature on XFS based on
forcealign enabled for the inode - I am not sure if you would want this
always for bigalloc.

Thanks,
John