2022-05-19 09:37:20

by Javier Martinez Canillas

[permalink] [raw]
Subject: [RFC PATCH 2/3] fat: add renameat2 RENAME_EXCHANGE flag support

The renameat2 RENAME_EXCHANGE flag allows to atomically exchange two paths
but is currently not supported by the Linux vfat filesystem driver.

Add a vfat_rename_exchange() helper function that implements this support.

The super block lock is acquired during the operation to ensure atomicity,
and in the error path actions made are reversed also with the mutex held,
making the whole operation transactional.

Signed-off-by: Javier Martinez Canillas <[email protected]>
---

fs/fat/namei_vfat.c | 153 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 152 insertions(+), 1 deletion(-)

diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index 88ccb2ee3537..6415a59eed13 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -1017,13 +1017,164 @@ static int vfat_rename(struct inode *old_dir, struct dentry *old_dentry,
goto out;
}

+static int vfat_rename_exchange(struct inode *old_dir, struct dentry *old_dentry,
+ struct inode *new_dir, struct dentry *new_dentry)
+{
+ struct buffer_head *old_dotdot_bh = NULL, *new_dotdot_bh = NULL;
+ struct msdos_dir_entry *old_dotdot_de = NULL, *new_dotdot_de = NULL;
+ struct inode *old_inode, *new_inode;
+ struct timespec64 ts = current_time(old_dir);
+ loff_t old_i_pos, new_i_pos;
+ int err, corrupt = 0;
+ struct super_block *sb = old_dir->i_sb;
+
+ old_inode = d_inode(old_dentry);
+ new_inode = d_inode(new_dentry);
+
+ /* Acquire super block lock for the operation to be atomic */
+ mutex_lock(&MSDOS_SB(sb)->s_lock);
+
+ /* if directories are not the same, get ".." info to update */
+ if (old_dir != new_dir) {
+ if (S_ISDIR(old_inode->i_mode))
+ if (fat_get_dotdot_entry(old_inode, &old_dotdot_bh, &old_dotdot_de)) {
+ err = -EIO;
+ goto out;
+ }
+
+ if (S_ISDIR(new_inode->i_mode))
+ if (fat_get_dotdot_entry(new_inode, &new_dotdot_bh, &new_dotdot_de)) {
+ err = -EIO;
+ goto out;
+ }
+ }
+
+ /* exchange the two dentries */
+ old_i_pos = MSDOS_I(old_inode)->i_pos;
+ new_i_pos = MSDOS_I(new_inode)->i_pos;
+
+ fat_detach(old_inode);
+ fat_detach(new_inode);
+
+ fat_attach(old_inode, new_i_pos);
+ fat_attach(new_inode, old_i_pos);
+
+ if (IS_DIRSYNC(old_dir)) {
+ err = fat_sync_inode(new_inode);
+ if (err)
+ goto error_exchange;
+ } else {
+ mark_inode_dirty(new_inode);
+ }
+
+ if (IS_DIRSYNC(new_dir)) {
+ err = fat_sync_inode(old_inode);
+ if (err)
+ goto error_exchange;
+ } else {
+ mark_inode_dirty(old_inode);
+ }
+
+ /* update ".." directory entry info */
+ if (old_dotdot_de) {
+ fat_set_start(old_dotdot_de, MSDOS_I(new_dir)->i_logstart);
+ mark_buffer_dirty_inode(old_dotdot_bh, old_inode);
+ if (IS_DIRSYNC(new_dir)) {
+ err = sync_dirty_buffer(old_dotdot_bh);
+ if (err)
+ goto error_old_dotdot;
+ }
+ drop_nlink(old_dir);
+ inc_nlink(new_dir);
+ }
+
+ if (new_dotdot_de) {
+ fat_set_start(new_dotdot_de, MSDOS_I(old_dir)->i_logstart);
+ mark_buffer_dirty_inode(new_dotdot_bh, new_inode);
+ if (IS_DIRSYNC(old_dir)) {
+ err = sync_dirty_buffer(new_dotdot_bh);
+ if (err)
+ goto error_new_dotdot;
+ }
+ drop_nlink(new_dir);
+ inc_nlink(old_dir);
+ }
+
+ /* update inode version and timestamps */
+ inode_inc_iversion(old_dir);
+ inode_inc_iversion(new_dir);
+ inode_inc_iversion(old_inode);
+ inode_inc_iversion(new_inode);
+
+ fat_truncate_time(old_dir, &ts, S_CTIME | S_MTIME);
+ fat_truncate_time(new_dir, &ts, S_CTIME | S_MTIME);
+
+ if (IS_DIRSYNC(old_dir))
+ (void)fat_sync_inode(old_dir);
+ else
+ mark_inode_dirty(old_dir);
+
+ if (IS_DIRSYNC(new_dir))
+ (void)fat_sync_inode(new_dir);
+ else
+ mark_inode_dirty(new_dir);
+out:
+ brelse(old_dotdot_bh);
+ brelse(new_dotdot_bh);
+ mutex_unlock(&MSDOS_SB(sb)->s_lock);
+
+ return err;
+
+error_new_dotdot:
+ /* data cluster is shared, serious corruption */
+ corrupt = 1;
+
+ if (new_dotdot_de) {
+ fat_set_start(new_dotdot_de, MSDOS_I(new_dir)->i_logstart);
+ mark_buffer_dirty_inode(new_dotdot_bh, new_inode);
+ corrupt |= sync_dirty_buffer(new_dotdot_bh);
+ }
+
+error_old_dotdot:
+ /* data cluster is shared, serious corruption */
+ corrupt = 1;
+
+ if (old_dotdot_de) {
+ fat_set_start(old_dotdot_de, MSDOS_I(old_dir)->i_logstart);
+ mark_buffer_dirty_inode(old_dotdot_bh, old_inode);
+ corrupt |= sync_dirty_buffer(old_dotdot_bh);
+ }
+
+error_exchange:
+ fat_detach(old_inode);
+ fat_detach(new_inode);
+
+ fat_attach(old_inode, old_i_pos);
+ fat_attach(new_inode, new_i_pos);
+
+ if (corrupt) {
+ corrupt |= fat_sync_inode(old_inode);
+ corrupt |= fat_sync_inode(new_inode);
+ }
+
+ if (corrupt < 0) {
+ fat_fs_error(new_dir->i_sb,
+ "%s: Filesystem corrupted (i_pos %lld, %lld)",
+ __func__, old_i_pos, new_i_pos);
+ }
+ goto out;
+}
+
static int vfat_rename2(struct user_namespace *mnt_userns, struct inode *old_dir,
struct dentry *old_dentry, struct inode *new_dir,
struct dentry *new_dentry, unsigned int flags)
{
- if (flags & ~RENAME_NOREPLACE)
+ if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
return -EINVAL;

+ if (flags & RENAME_EXCHANGE)
+ return vfat_rename_exchange(old_dir, old_dentry, new_dir, new_dentry);
+
/* VFS already handled RENAME_NOREPLACE, handle it as a normal rename */
return vfat_rename(old_dir, old_dentry, new_dir, new_dentry);
}
--
2.35.1



2022-05-23 06:42:58

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] fat: add renameat2 RENAME_EXCHANGE flag support

Javier Martinez Canillas <[email protected]> writes:

> The renameat2 RENAME_EXCHANGE flag allows to atomically exchange two paths
> but is currently not supported by the Linux vfat filesystem driver.
>
> Add a vfat_rename_exchange() helper function that implements this support.
>
> The super block lock is acquired during the operation to ensure atomicity,
> and in the error path actions made are reversed also with the mutex held,
> making the whole operation transactional.

I'm not fully reviewed yet though (write order and race), basically
looks like good.

> + /* if directories are not the same, get ".." info to update */
> + if (old_dir != new_dir) {
> + if (S_ISDIR(old_inode->i_mode))
> + if (fat_get_dotdot_entry(old_inode, &old_dotdot_bh, &old_dotdot_de)) {
> + err = -EIO;
> + goto out;
> + }
> + if (S_ISDIR(new_inode->i_mode))
> + if (fat_get_dotdot_entry(new_inode, &new_dotdot_bh, &new_dotdot_de)) {
> + err = -EIO;
> + goto out;
> + }
> + }

It may not be linux coding style though, please add {}

if () {
...
}

for non one liner body.

> + /* update ".." directory entry info */
> + if (old_dotdot_de) {
> + fat_set_start(old_dotdot_de, MSDOS_I(new_dir)->i_logstart);
> + mark_buffer_dirty_inode(old_dotdot_bh, old_inode);
> + if (IS_DIRSYNC(new_dir)) {
> + err = sync_dirty_buffer(old_dotdot_bh);
> + if (err)
> + goto error_old_dotdot;
> + }
> + drop_nlink(old_dir);
> + inc_nlink(new_dir);
> + }
> +
> + if (new_dotdot_de) {
> + fat_set_start(new_dotdot_de, MSDOS_I(old_dir)->i_logstart);
> + mark_buffer_dirty_inode(new_dotdot_bh, new_inode);
> + if (IS_DIRSYNC(old_dir)) {
> + err = sync_dirty_buffer(new_dotdot_bh);
> + if (err)
> + goto error_new_dotdot;
> + }
> + drop_nlink(new_dir);
> + inc_nlink(old_dir);
> + }

There are some copy&paste codes, for example above, it may be better to
use function and consolidate? If you had some intent, it is ok though.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2022-05-23 10:43:23

by Colin Walters

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] fat: add renameat2 RENAME_EXCHANGE flag support

On Thu, May 19, 2022, at 5:23 AM, Javier Martinez Canillas wrote:
> The renameat2 RENAME_EXCHANGE flag allows to atomically exchange two paths
> but is currently not supported by the Linux vfat filesystem driver.
>
> Add a vfat_rename_exchange() helper function that implements this support.
>
> The super block lock is acquired during the operation to ensure atomicity,
> and in the error path actions made are reversed also with the mutex held,
> making the whole operation transactional.

Transactional with respect to the mounted kernel, but AIUI because vfat does not have journaling, the semantics on hard failure are...unspecified? Is it possible for example we could see no file at all in the destination path?

This relates to https://github.com/ostreedev/ostree/issues/1951

TL;DR I'd been thinking that in order to have things be maximally robust we need to:

1. Write new desired bootloader config
2. fsync it
3. fsync containing directory (I guess for vfat really, syncfs())
4. remove old config, syncfs()

And here the bootloader would know to prefer the "new" file if it exists, and to delete the old one if it's still present on the next boot.

(Now obviously this is a small patch which will surely be generally useful, e.g. for tools that operate on things like mounted USB sticks, being able to do an atomic exchange at least from the running kernel PoV is just as useful as it is on other "regular" (and journaled) mounted filesystems)

So assuming we have this, I guess the flow could be:

1. rename_exchange(old, new)
2. syncfs()

? But that's assuming that the implementation of this doesn't e.g. have any "holes" where in theory we could flush an intermediate state.



2022-05-23 15:34:39

by Javier Martinez Canillas

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] fat: add renameat2 RENAME_EXCHANGE flag support

Hello Colin,

Thanks for your feedback.

On 5/23/22 12:40, Colin Walters wrote:
> On Thu, May 19, 2022, at 5:23 AM, Javier Martinez Canillas wrote:
>> The renameat2 RENAME_EXCHANGE flag allows to atomically exchange two paths
>> but is currently not supported by the Linux vfat filesystem driver.
>>
>> Add a vfat_rename_exchange() helper function that implements this support.
>>
>> The super block lock is acquired during the operation to ensure atomicity,
>> and in the error path actions made are reversed also with the mutex held,
>> making the whole operation transactional.
>
> Transactional with respect to the mounted kernel, but AIUI because vfat does not have journaling, the semantics on hard failure are...unspecified? Is it possible for example we could see no file at all in the destination path?
>

That's correct, it's transactional within the constraints imposed by vfat.
That is, there's no journal replay that would be done if something gets
corrupted in the filesystem.

But I believe that's also true with any journaled filesystem and GRUB too?
Since GRUB doesn't mount filesystems but just attempt to read it without
trying to do any journal replay. Even if is able to detect that something
is wrong with the filesystem, it just tries in an best effort basis, i.e:

https://git.savannah.gnu.org/cgit/grub.git/commit/?id=777276063e2

About the semantics for a hard failure, that's not documented in the man
page for the renameat(2) system call but what most filesystems do AFAICT
is revert the operation if possible and print an error.

I don't think that not having a file at all at destination is a possible
outcome of a failure since the function does a detach, attach and sync
and only the sync can fail.

If the sync fails, then the detach/attach are reverted and another sync
is attempted. If this succeeds, then the old state would be preserved
and if it fails, then no sync was made so it should be good too I think.

But I'm not a filesystem expert so maybe someone else more familiar with
vfat and filesystems in general could chime in.

> This relates to https://github.com/ostreedev/ostree/issues/1951
>
> TL;DR I'd been thinking that in order to have things be maximally robust we need to:
>
> 1. Write new desired bootloader config
> 2. fsync it
> 3. fsync containing directory (I guess for vfat really, syncfs())
> 4. remove old config, syncfs()
>

Yes, I've seen that issue before but I (wrongly) understood that it was a
way to workaround the lack of renameat2(..., RENAME_EXCHANGE) in vfat. On
a second read I see that you also mention the journaled fs writes vs no
replay in the bootloader issue that I mentioned above. So it makes sense
to do the two phase commit even for journaled filesystems.

> And here the bootloader would know to prefer the "new" file if it exists, and to delete the old one if it's still present on the next boot.
>

This is the disadvantage of this approach, that then we will need to make
all bootloaders aware of the two phase commit as well. I'm OK with that but
then I believe that we should document the expectations clearly as a part
of the https://systemd.io/BOOT_LOADER_SPECIFICATION/.

Anyways, I don't think this is the place to discuss this though and we should
just focus on the actual kernel patches :)

> (Now obviously this is a small patch which will surely be generally useful, e.g. for tools that operate on things like mounted USB sticks, being able to do an atomic exchange at least from the running kernel PoV is just as useful as it is on other "regular" (and journaled) mounted filesystems)
>

Agreed. I think that it wouldn't hurt to have this implementation in vfat.

> So assuming we have this, I guess the flow could be:
>
> 1. rename_exchange(old, new)
> 2. syncfs()
>

Correct. In fact, Alex pointed me out that I should do sync in the test too
before checking that the rename succeeded. I was mostly interested that the
logic worked even if only the in-memory representation or page cache was
used. But I've added a `sudo sync -f "${MNT_PATH}"` for the next iteration.

> ? But that's assuming that the implementation of this doesn't e.g. have any "holes" where in theory we could flush an intermediate state.
>

Ogawa said that didn't fully review it yet but gave useful feedback that I
will also address in the next version. As said, is my first contribution to
a filesystem driver so it would be good if people with more experience can
let me know if there are holes in the implementation.

--
Best regards,

Javier Martinez Canillas
Linux Engineering
Red Hat


2022-05-23 15:38:32

by Javier Martinez Canillas

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] fat: add renameat2 RENAME_EXCHANGE flag support

Hello OGAWA,

Thanks a lot for your feedback.

On 5/22/22 19:42, OGAWA Hirofumi wrote:
> Javier Martinez Canillas <[email protected]> writes:
>
>> The renameat2 RENAME_EXCHANGE flag allows to atomically exchange two paths
>> but is currently not supported by the Linux vfat filesystem driver.
>>
>> Add a vfat_rename_exchange() helper function that implements this support.
>>
>> The super block lock is acquired during the operation to ensure atomicity,
>> and in the error path actions made are reversed also with the mutex held,
>> making the whole operation transactional.
>
> I'm not fully reviewed yet though (write order and race), basically
> looks like good.
>

Thanks for looking at the patch. I agree with all your remarks and will
address them in v2. Please let me know once you have reviewed if is OK
from a write order and race point of view.

--
Best regards,

Javier Martinez Canillas
Linux Engineering
Red Hat


2022-05-23 17:08:24

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] fat: add renameat2 RENAME_EXCHANGE flag support

Javier Martinez Canillas <[email protected]> writes:

>> So assuming we have this, I guess the flow could be:
>>
>> 1. rename_exchange(old, new)
>> 2. syncfs()
>>
>
> Correct. In fact, Alex pointed me out that I should do sync in the test too
> before checking that the rename succeeded. I was mostly interested that the
> logic worked even if only the in-memory representation or page cache was
> used. But I've added a `sudo sync -f "${MNT_PATH}"` for the next iteration.
>
>> ? But that's assuming that the implementation of this doesn't e.g. have any "holes" where in theory we could flush an intermediate state.
>>
>
> Ogawa said that didn't fully review it yet but gave useful feedback that I
> will also address in the next version. As said, is my first contribution to
> a filesystem driver so it would be good if people with more experience can
> let me know if there are holes in the implementation.

I'm not reading emails about ostree and stuff, so I may not understand
the issue though. If you are expecting the atomics on disk (not
in-core), rename exchange can't provide atomics on vfat without non
standard extension like adding journal or such. And even any syncfs(2)
can't prevent rename corruption, syncfs(2) can just only minimize the
race window.

If power failure happened on rename exchange, the file may lost in worst
case. (If had journal, file can recover to before or after rename
exchange while journal replay, but as you know vfat can't)

Thanks.
--
OGAWA Hirofumi <[email protected]>