2024-04-22 14:43:23

by John Garry

[permalink] [raw]
Subject: [PATCH RFC 0/7] buffered block atomic writes

This series introduces a proof-of-concept for buffered block atomic
writes.

There is a requirement for userspace to be able to issue a write which
will not be torn due to HW or some other failure. A solution is presented
in [0] and [1].

Those series mentioned only support atomic writes for direct IO. The
primary target of atomic (or untorn) writes is DBs like InnoDB/MySQL,
which require direct IO support. However, as mentioned in [2], there is
a want to support atomic writes for DBs which use buffered writes, like
Postgres.

The issue raised in [2] was that the API proposed is not suitable for
buffered atomic writes. Specifically, since the API permits a range of
sizes of atomic writes, it is too difficult to track in the pagecache the
geometry of atomic writes which overlap with other atomic writes of
differing sizes and alignment. In addition, tracking and handling
overlapping atomic and non-atomic writes is difficult also.

In this series, buffered atomic writes are supported based upon the
following principles:
- A buffered atomic write requires RWF_ATOMIC flag be set, same as
direct IO. The same other atomic writes rules apply, like power-of-2
size and naturally aligned.
- For an inode, only a single size of buffered write is allowed. So for
statx, atomic_write_unit_min = atomic_write_unit_max always for
buffered atomic writes.
- A single folio maps to an atomic write in the pagecache. Folios match
atomic writes well, as an atomic write must be a power-of-2 in size and
naturally aligned.
- A folio is tagged as "atomic" when atomically written. If any part of an
"atomic" folio is fully or partially overwritten with a non-atomic
write, the folio loses it atomicity. Indeed, issuing a non-atomic write
over an atomic write would typically be seen as a userspace bug.
- If userspace wants to guarantee a buffered atomic write is written to
media atomically after the write syscall returns, it must use RWF_SYNC
or similar (along with RWF_ATOMIC).

This series just supports buffered atomic writes for XFS. I do have some
patches for bdev file operations buffered atomic writes. I did not include
them, as:
a. I don't know of any requirement for this support
b. atomic_write_unit_min and atomic_write_unit_max would be fixed at
PAGE_SIZE there. This is very limiting. However an API like BLKBSZSET
could be added to allow userspace to program the values for
atomic_write_unit_{min, max}.
c. We may want to support atomic_write_unit_{min, max} < PAGE_SIZE, and
this becomes more complicated to support.
d. I would like to see what happens with bs > ps work there.

This series is just an early proof-of-concept, to prove that the API
proposed for block atomic writes can work for buffered IO. I would like to
unblock that direct IO series and have it merged.

Patches are based on [0], [1], and [3] (the bs > ps series). For the bs >
ps series, I had to borrow an earlier filemap change which allows the
folio min and max order be selected.

All patches can be found at:
https://github.com/johnpgarry/linux/tree/atomic-writes-v6.9-v6-fs-v2-buffered

[0] https://lore.kernel.org/linux-block/[email protected]/
[1] https://lore.kernel.org/linux-block/[email protected]/
[2] https://lore.kernel.org/linux-fsdevel/[email protected]/
[3] https://lore.kernel.org/linux-xfs/[email protected]/

John Garry (7):
fs: Rename STATX{_ATTR}_WRITE_ATOMIC -> STATX{_ATTR}_WRITE_ATOMIC_DIO
filemap: Change mapping_set_folio_min_order() ->
mapping_set_folio_orders()
mm: Add PG_atomic
fs: Add initial buffered atomic write support info to statx
fs: iomap: buffered atomic write support
fs: xfs: buffered atomic writes statx support
fs: xfs: Enable buffered atomic writes

block/bdev.c | 9 +++---
fs/iomap/buffered-io.c | 53 +++++++++++++++++++++++++++++-----
fs/iomap/trace.h | 3 +-
fs/stat.c | 26 ++++++++++++-----
fs/xfs/libxfs/xfs_inode_buf.c | 8 +++++
fs/xfs/xfs_file.c | 12 ++++++--
fs/xfs/xfs_icache.c | 10 ++++---
fs/xfs/xfs_ioctl.c | 3 ++
fs/xfs/xfs_iops.c | 11 +++++--
include/linux/fs.h | 3 +-
include/linux/iomap.h | 1 +
include/linux/page-flags.h | 5 ++++
include/linux/pagemap.h | 20 ++++++++-----
include/trace/events/mmflags.h | 3 +-
include/uapi/linux/stat.h | 6 ++--
mm/filemap.c | 8 ++++-
16 files changed, 141 insertions(+), 40 deletions(-)

--
2.31.1



2024-04-22 14:43:45

by John Garry

[permalink] [raw]
Subject: [PATCH RFC 7/7] fs: xfs: Enable buffered atomic writes

Enable support for buffered atomic writes, in addition to already
supported direct IO atomic writes.

The folio mapping order min and max is set to this same size for an inode
with FS_XFLAG_ATOMICWRITES set. That size is the extent alignment size.

Atomic writes support depends on forcealign. For forcealign, extent sizes
need to be a power-of-2 and naturally aligned, and this matches folios
nicely.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/libxfs/xfs_inode_buf.c | 8 ++++++++
fs/xfs/xfs_file.c | 12 ++++++++++--
fs/xfs/xfs_ioctl.c | 3 +++
3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index abaef1137b97..38e058756b1e 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -181,6 +181,7 @@ xfs_inode_from_disk(
struct inode *inode = VFS_I(ip);
int error;
xfs_failaddr_t fa;
+ struct xfs_mount *mp = ip->i_mount;

ASSERT(ip->i_cowfp == NULL);

@@ -261,6 +262,13 @@ xfs_inode_from_disk(
}
if (xfs_is_reflink_inode(ip))
xfs_ifork_init_cow(ip);
+
+ if (xfs_inode_atomicwrites(ip)) {
+ unsigned int folio_order = ffs(XFS_B_TO_FSB(mp, ip->i_extsize)) - 1;
+
+ mapping_set_folio_orders(VFS_I(ip)->i_mapping, folio_order, folio_order);
+ }
+
return 0;

out_destroy_data_fork:
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2fbefd60d753..d35869b5e4ce 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -782,6 +782,16 @@ xfs_file_buffered_write(
ssize_t ret;
bool cleared_space = false;
unsigned int iolock;
+ struct xfs_mount *mp = ip->i_mount;
+
+ if (iocb->ki_flags & IOCB_ATOMIC) {
+ unsigned int extsz_bytes = XFS_FSB_TO_B(mp, ip->i_extsize);
+
+ if (!generic_atomic_write_valid_size(iocb->ki_pos, from,
+ extsz_bytes, extsz_bytes)) {
+ return -EINVAL;
+ }
+ }

write_retry:
iolock = XFS_IOLOCK_EXCL;
@@ -1241,8 +1251,6 @@ static bool xfs_file_open_can_atomicwrite(
struct xfs_inode *ip = XFS_I(inode);
struct xfs_buftarg *target = xfs_inode_buftarg(ip);

- if (!(file->f_flags & O_DIRECT))
- return false;

if (!xfs_inode_atomicwrites(ip))
return false;
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d115f2601921..d6b146c999f6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1169,10 +1169,13 @@ xfs_ioctl_setattr_xflags(
}

if (atomic_writes) {
+ unsigned int folio_order = ffs(XFS_B_TO_FSB(mp, fa->fsx_extsize)) - 1;
+
if (!xfs_has_atomicwrites(mp))
return -EINVAL;
if (!(fa->fsx_xflags & FS_XFLAG_FORCEALIGN))
return -EINVAL;
+ mapping_set_folio_orders(VFS_I(ip)->i_mapping, folio_order, folio_order);
}

ip->i_diflags = xfs_flags2diflags(ip, fa->fsx_xflags);
--
2.31.1


2024-04-22 14:52:55

by John Garry

[permalink] [raw]
Subject: [PATCH RFC 2/7] filemap: Change mapping_set_folio_min_order() -> mapping_set_folio_orders()

Borrowed from:

https://lore.kernel.org/linux-fsdevel/[email protected]/
(credit given in due course)

We will need to be able to only use a single folio order for buffered
atomic writes, so allow the mapping folio order min and max be set.

We still have the restriction of not being able to support order-1
folios - it will be required to lift this limit at some stage.

Signed-off-by: John Garry <[email protected]>
---
fs/xfs/xfs_icache.c | 10 ++++++----
include/linux/pagemap.h | 20 +++++++++++++-------
mm/filemap.c | 8 +++++++-
3 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 8fb5cf0f5a09..6186887bd6ff 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -89,8 +89,9 @@ xfs_inode_alloc(
/* VFS doesn't initialise i_mode or i_state! */
VFS_I(ip)->i_mode = 0;
VFS_I(ip)->i_state = 0;
- mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
- M_IGEO(mp)->min_folio_order);
+ mapping_set_folio_orders(VFS_I(ip)->i_mapping,
+ M_IGEO(mp)->min_folio_order,
+ MAX_PAGECACHE_ORDER);

XFS_STATS_INC(mp, vn_active);
ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -325,8 +326,9 @@ xfs_reinit_inode(
inode->i_rdev = dev;
inode->i_uid = uid;
inode->i_gid = gid;
- mapping_set_folio_min_order(inode->i_mapping,
- M_IGEO(mp)->min_folio_order);
+ mapping_set_folio_orders(inode->i_mapping,
+ M_IGEO(mp)->min_folio_order,
+ MAX_PAGECACHE_ORDER);
return error;
}

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index fc8eb9c94e9c..c22455fa28a1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -363,9 +363,10 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
#endif

/*
- * mapping_set_folio_min_order() - Set the minimum folio order
+ * mapping_set_folio_orders() - Set the minimum and max folio order
* @mapping: The address_space.
* @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ * @max: Maximum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
*
* The filesystem should call this function in its inode constructor to
* indicate which base size of folio the VFS can use to cache the contents
@@ -376,15 +377,20 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
* Context: This should not be called while the inode is active as it
* is non-atomic.
*/
-static inline void mapping_set_folio_min_order(struct address_space *mapping,
- unsigned int min)
+
+static inline void mapping_set_folio_orders(struct address_space *mapping,
+ unsigned int min, unsigned int max)
{
- if (min > MAX_PAGECACHE_ORDER)
- min = MAX_PAGECACHE_ORDER;
+ if (min == 1)
+ min = 2;
+ if (max < min)
+ max = min;
+ if (max > MAX_PAGECACHE_ORDER)
+ max = MAX_PAGECACHE_ORDER;

mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
(min << AS_FOLIO_ORDER_MIN) |
- (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
+ (max << AS_FOLIO_ORDER_MAX);
}

/**
@@ -400,7 +406,7 @@ static inline void mapping_set_folio_min_order(struct address_space *mapping,
*/
static inline void mapping_set_large_folios(struct address_space *mapping)
{
- mapping_set_folio_min_order(mapping, 0);
+ mapping_set_folio_orders(mapping, 0, MAX_PAGECACHE_ORDER);
}

static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
diff --git a/mm/filemap.c b/mm/filemap.c
index d81530b0aac0..d5effe50ddcb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1898,9 +1898,15 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
no_page:
if (!folio && (fgp_flags & FGP_CREAT)) {
unsigned int min_order = mapping_min_folio_order(mapping);
- unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
+ unsigned int max_order = mapping_max_folio_order(mapping);
+ unsigned int order = FGF_GET_ORDER(fgp_flags);
int err;

+ if (order > max_order)
+ order = max_order;
+ else if (order < min_order)
+ order = max_order;
+
index = mapping_align_start_index(mapping, index);

if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
--
2.31.1


2024-04-25 14:48:06

by Pankaj Raghav (Samsung)

[permalink] [raw]
Subject: Re: [PATCH RFC 2/7] filemap: Change mapping_set_folio_min_order() -> mapping_set_folio_orders()

On Mon, Apr 22, 2024 at 02:39:18PM +0000, John Garry wrote:
> Borrowed from:
>
> https://lore.kernel.org/linux-fsdevel/[email protected]/
> (credit given in due course)
>
> We will need to be able to only use a single folio order for buffered
> atomic writes, so allow the mapping folio order min and max be set.

>
> We still have the restriction of not being able to support order-1
> folios - it will be required to lift this limit at some stage.

This is already supported upstream for file-backed folios:
commit: 8897277acfef7f70fdecc054073bea2542fc7a1b

> index fc8eb9c94e9c..c22455fa28a1 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -363,9 +363,10 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> #endif
>
> /*
> - * mapping_set_folio_min_order() - Set the minimum folio order
> + * mapping_set_folio_orders() - Set the minimum and max folio order

In the new series (sorry forgot to CC you), I added a new helper called
mapping_set_folio_order_range() which does something similar to avoid
confusion based on willy's suggestion:
https://lore.kernel.org/linux-xfs/[email protected]/

mapping_set_folio_min_order() also sets max folio order to be
MAX_PAGECACHE_ORDER order anyway. So no need of explicitly calling it
here?

> /**
> @@ -400,7 +406,7 @@ static inline void mapping_set_folio_min_order(struct address_space *mapping,
> */
> static inline void mapping_set_large_folios(struct address_space *mapping)
> {
> - mapping_set_folio_min_order(mapping, 0);
> + mapping_set_folio_orders(mapping, 0, MAX_PAGECACHE_ORDER);
> }
>
> static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d81530b0aac0..d5effe50ddcb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1898,9 +1898,15 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> no_page:
> if (!folio && (fgp_flags & FGP_CREAT)) {
> unsigned int min_order = mapping_min_folio_order(mapping);
> - unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
> + unsigned int max_order = mapping_max_folio_order(mapping);
> + unsigned int order = FGF_GET_ORDER(fgp_flags);
> int err;
>
> + if (order > max_order)
> + order = max_order;
> + else if (order < min_order)
> + order = max_order;

order = min_order; ?
--
Pankaj

2024-04-26 08:04:01

by John Garry

[permalink] [raw]
Subject: Re: [PATCH RFC 2/7] filemap: Change mapping_set_folio_min_order() -> mapping_set_folio_orders()

On 25/04/2024 15:47, Pankaj Raghav (Samsung) wrote:
> On Mon, Apr 22, 2024 at 02:39:18PM +0000, John Garry wrote:
>> Borrowed from:
>>
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]/__;!!ACWV5N9M2RV99hQ!LvajFab0xQx8oBWDlDtVY8duiLDjOKX91G4YqadoCu6gqatA2H0FzBUvdSC69dqXNoe2QvStSwrxIZ142MXOKk8$
>> (credit given in due course)
>>
>> We will need to be able to only use a single folio order for buffered
>> atomic writes, so allow the mapping folio order min and max be set.
>
>>
>> We still have the restriction of not being able to support order-1
>> folios - it will be required to lift this limit at some stage.
>
> This is already supported upstream for file-backed folios:
> commit: 8897277acfef7f70fdecc054073bea2542fc7a1b

ok

>
>> index fc8eb9c94e9c..c22455fa28a1 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -363,9 +363,10 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>> #endif
>>
>> /*
>> - * mapping_set_folio_min_order() - Set the minimum folio order
>> + * mapping_set_folio_orders() - Set the minimum and max folio order
>
> In the new series (sorry forgot to CC you),

no worries, I saw it

> I added a new helper called
> mapping_set_folio_order_range() which does something similar to avoid
> confusion based on willy's suggestion:
> https://urldefense.com/v3/__https://lore.kernel.org/linux-xfs/[email protected]/__;!!ACWV5N9M2RV99hQ!LvajFab0xQx8oBWDlDtVY8duiLDjOKX91G4YqadoCu6gqatA2H0FzBUvdSC69dqXNoe2QvStSwrxIZ14opzAoso$
>

Fine, I can include that

> mapping_set_folio_min_order() also sets max folio order to be
> MAX_PAGECACHE_ORDER order anyway. So no need of explicitly calling it
> here?
>

Here mapping_set_folio_min_order() is being replaced with
mapping_set_folio_order_range(), so not sure why you mention that.
Regardless, I'll use your mapping_set_folio_order_range().

>> /**
>> @@ -400,7 +406,7 @@ static inline void mapping_set_folio_min_order(struct address_space *mapping,
>> */
>> static inline void mapping_set_large_folios(struct address_space *mapping)
>> {
>> - mapping_set_folio_min_order(mapping, 0);
>> + mapping_set_folio_orders(mapping, 0, MAX_PAGECACHE_ORDER);
>> }
>>
>> static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index d81530b0aac0..d5effe50ddcb 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -1898,9 +1898,15 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>> no_page:
>> if (!folio && (fgp_flags & FGP_CREAT)) {
>> unsigned int min_order = mapping_min_folio_order(mapping);
>> - unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
>> + unsigned int max_order = mapping_max_folio_order(mapping);
>> + unsigned int order = FGF_GET_ORDER(fgp_flags);
>> int err;
>>
>> + if (order > max_order)
>> + order = max_order;
>> + else if (order < min_order)
>> + order = max_order;
>
> order = min_order; ?

right

Thanks,
John