LinuxLists.cc - [PATCH] ovl: skip overlayfs superblocks at global sync

2020-04-09 08:31:12

Subject: [PATCH] ovl: skip overlayfs superblocks at global sync

Stacked filesystems like overlayfs has no own writeback, but they have to
forward syncfs() requests to backend for keeping data integrity.

During global sync() each overlayfs instance calls method ->sync_fs()
for backend although it itself is in global list of superblocks too.
As a result one syscall sync() could write one superblock several times
and send multiple disk barriers.

This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.

Reported-by: Dmitry Monakhov <[email protected]>
Signed-off-by: Konstantin Khlebnikov <[email protected]>
---
fs/overlayfs/super.c | 5 +++--
fs/sync.c | 3 ++-
include/linux/fs.h | 2 ++
3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 732ad5495c92..59df13d16280 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -261,8 +261,8 @@ static int ovl_sync_fs(struct super_block *sb, int wait)
return 0;

/*
- * If this is a sync(2) call or an emergency sync, all the super blocks
- * will be iterated, including upper_sb, so no need to do anything.
+ * Not called for sync(2) call or an emergency sync (SB_I_SKIP_SYNC).
+ * All the super blocks will be iterated, including upper_sb.
*
* If this is a syncfs(2) call, then we do need to call
* sync_filesystem() on upper_sb, but enough if we do it when being
@@ -1818,6 +1818,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
sb->s_xattr = ovl_xattr_handlers;
sb->s_fs_info = ofs;
sb->s_flags |= SB_POSIXACL;
+ sb->s_iflags |= SB_I_SKIP_SYNC;

err = -ENOMEM;
root_dentry = ovl_get_root(sb, upperpath.dentry, oe);
diff --git a/fs/sync.c b/fs/sync.c
index 4d1ff010bc5a..16c2630ee4bf 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -76,7 +76,8 @@ static void sync_inodes_one_sb(struct super_block *sb, void *arg)

static void sync_fs_one_sb(struct super_block *sb, void *arg)
{
- if (!sb_rdonly(sb) && sb->s_op->sync_fs)
+ if (!sb_rdonly(sb) && !(sb->s_iflags & SB_I_SKIP_SYNC) &&
+ sb->s_op->sync_fs)
sb->s_op->sync_fs(sb, *(int *)arg);
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4f6f59b4f22a..f186a966a36c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1409,6 +1409,8 @@ extern int send_sigurg(struct fown_struct *fown);
#define SB_I_IMA_UNVERIFIABLE_SIGNATURE 0x00000020
#define SB_I_UNTRUSTED_MOUNTER 0x00000040

+#define SB_I_SKIP_SYNC 0x00000100 /* Skip superblock at global sync */
+
/* Possible states of 'frozen' field */
enum {
SB_UNFROZEN = 0, /* FS is unfrozen */

2020-04-09 10:24:38

by Amir Goldstein

[permalink] [raw]

Subject: Re: [PATCH] ovl: skip overlayfs superblocks at global sync

On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
<[email protected]> wrote:
>
> Stacked filesystems like overlayfs has no own writeback, but they have to
> forward syncfs() requests to backend for keeping data integrity.
>
> During global sync() each overlayfs instance calls method ->sync_fs()
> for backend although it itself is in global list of superblocks too.
> As a result one syscall sync() could write one superblock several times
> and send multiple disk barriers.
>
> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>
> Reported-by: Dmitry Monakhov <[email protected]>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> ---

Seems reasonable.
You may add:
Reviewed-by: Amir Goldstein <[email protected]>

+CC: containers list

This bring up old memories.
I posted this way back to fix handling of emergency_remount() in the
presence of loop mounted fs:
https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/

But seems to me that emergency_sync() and sync(2) are equally broken
for this use case.

I wonder if anyone cares enough about resilience of loop mounted fs to try
and change the iterate_* functions to iterate supers/bdevs in reverse order...

Thanks,
Amir.

2020-04-09 11:30:10

by Konstantin Khlebnikov

[permalink] [raw]

Subject: Re: [PATCH] ovl: skip overlayfs superblocks at global sync

On 09/04/2020 13.23, Amir Goldstein wrote:
> On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
> <[email protected]> wrote:
>>
>> Stacked filesystems like overlayfs has no own writeback, but they have to
>> forward syncfs() requests to backend for keeping data integrity.
>>
>> During global sync() each overlayfs instance calls method ->sync_fs()
>> for backend although it itself is in global list of superblocks too.
>> As a result one syscall sync() could write one superblock several times
>> and send multiple disk barriers.
>>
>> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>>
>> Reported-by: Dmitry Monakhov <[email protected]>
>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>> ---
>
> Seems reasonable.
> You may add:
> Reviewed-by: Amir Goldstein <[email protected]>
>
> +CC: containers list

Thanks

>
> This bring up old memories.
> I posted this way back to fix handling of emergency_remount() in the
> presence of loop mounted fs:
> https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
>
> But seems to me that emergency_sync() and sync(2) are equally broken
> for this use case.
>
> I wonder if anyone cares enough about resilience of loop mounted fs to try
> and change the iterate_* functions to iterate supers/bdevs in reverse order...

Now I see reason behind "sync; sync; sync; reboot" =)

Order old -> new allows to not miss new items if list modifies.
Might be important for some users.

bdev iteration seems already reversed: inode_sb_list_add adds to the head

>
> Thanks,
> Amir.
>

2020-04-09 11:51:25

by Amir Goldstein

[permalink] [raw]

Subject: Re: [PATCH] ovl: skip overlayfs superblocks at global sync

On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov
<[email protected]> wrote:
>
> On 09/04/2020 13.23, Amir Goldstein wrote:
> > On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
> > <[email protected]> wrote:
> >>
> >> Stacked filesystems like overlayfs has no own writeback, but they have to
> >> forward syncfs() requests to backend for keeping data integrity.
> >>
> >> During global sync() each overlayfs instance calls method ->sync_fs()
> >> for backend although it itself is in global list of superblocks too.
> >> As a result one syscall sync() could write one superblock several times
> >> and send multiple disk barriers.
> >>
> >> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
> >>
> >> Reported-by: Dmitry Monakhov <[email protected]>
> >> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> >> ---
> >
> > Seems reasonable.
> > You may add:
> > Reviewed-by: Amir Goldstein <[email protected]>
> >
> > +CC: containers list
>
> Thanks
>
> >
> > This bring up old memories.
> > I posted this way back to fix handling of emergency_remount() in the
> > presence of loop mounted fs:
> > https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
> >
> > But seems to me that emergency_sync() and sync(2) are equally broken
> > for this use case.
> >
> > I wonder if anyone cares enough about resilience of loop mounted fs to try
> > and change the iterate_* functions to iterate supers/bdevs in reverse order...
>
> Now I see reason behind "sync; sync; sync; reboot" =)
>
> Order old -> new allows to not miss new items if list modifies.
> Might be important for some users.
>

That's not the reason I suggested reverse order.
The reason is that with loop mounted fs, the correct order of flushing is:
1. sync loop mounted fs inodes => writes to loop image file
2. sync loop mounted fs sb => fsyncs the loop image file
3. sync the loop image host fs sb

With forward sb iteration order, #3 happens before #1, so the
loop mounted fs changes are not really being made durable by
a single sync(2) call.

> bdev iteration seems already reversed: inode_sb_list_add adds to the head
>

I think bdev iteration order will not make a difference in this case.
flushing /dev/loopX will not be needed and it happens too late
anyway.

Thanks,
Amir.

2020-04-09 12:05:43

by Konstantin Khlebnikov

[permalink] [raw]

Subject: Re: [PATCH] ovl: skip overlayfs superblocks at global sync

On 09/04/2020 14.48, Amir Goldstein wrote:
> On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov
> <[email protected]> wrote:
>>
>> On 09/04/2020 13.23, Amir Goldstein wrote:
>>> On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
>>> <[email protected]> wrote:
>>>>
>>>> Stacked filesystems like overlayfs has no own writeback, but they have to
>>>> forward syncfs() requests to backend for keeping data integrity.
>>>>
>>>> During global sync() each overlayfs instance calls method ->sync_fs()
>>>> for backend although it itself is in global list of superblocks too.
>>>> As a result one syscall sync() could write one superblock several times
>>>> and send multiple disk barriers.
>>>>
>>>> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>>>>
>>>> Reported-by: Dmitry Monakhov <[email protected]>
>>>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
>>>> ---
>>>
>>> Seems reasonable.
>>> You may add:
>>> Reviewed-by: Amir Goldstein <[email protected]>
>>>
>>> +CC: containers list
>>
>> Thanks
>>
>>>
>>> This bring up old memories.
>>> I posted this way back to fix handling of emergency_remount() in the
>>> presence of loop mounted fs:
>>> https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
>>>
>>> But seems to me that emergency_sync() and sync(2) are equally broken
>>> for this use case.
>>>
>>> I wonder if anyone cares enough about resilience of loop mounted fs to try
>>> and change the iterate_* functions to iterate supers/bdevs in reverse order...
>>
>> Now I see reason behind "sync; sync; sync; reboot" =)
>>
>> Order old -> new allows to not miss new items if list modifies.
>> Might be important for some users.
>>
>
> That's not the reason I suggested reverse order.
> The reason is that with loop mounted fs, the correct order of flushing is:
> 1. sync loop mounted fs inodes => writes to loop image file
> 2. sync loop mounted fs sb => fsyncs the loop image file
> 3. sync the loop image host fs sb
>
> With forward sb iteration order, #3 happens before #1, so the
> loop mounted fs changes are not really being made durable by
> a single sync(2) call.

If fs in loop mounted with barriers then sync_fs will issue
REQ_OP_FLUSH to loop device and trigger fsync() for image file.
Sync() might write something twice but data should be safe.
Without barriers this scenario is broken for sure.

Emergency remount R/O is other thing. It really needs reverse order.

>
>> bdev iteration seems already reversed: inode_sb_list_add adds to the head
>>
>
> I think bdev iteration order will not make a difference in this case.
> flushing /dev/loopX will not be needed and it happens too late
> anyway.
>
> Thanks,
> Amir.
>

2020-04-09 16:16:54

by Amir Goldstein

[permalink] [raw]

Subject: Re: [PATCH] ovl: skip overlayfs superblocks at global sync

On Thu, Apr 9, 2020 at 3:04 PM Konstantin Khlebnikov
<[email protected]> wrote:
>
>
>
> On 09/04/2020 14.48, Amir Goldstein wrote:
> > On Thu, Apr 9, 2020 at 2:28 PM Konstantin Khlebnikov
> > <[email protected]> wrote:
> >>
> >> On 09/04/2020 13.23, Amir Goldstein wrote:
> >>> On Thu, Apr 9, 2020 at 11:30 AM Konstantin Khlebnikov
> >>> <[email protected]> wrote:
> >>>>
> >>>> Stacked filesystems like overlayfs has no own writeback, but they have to
> >>>> forward syncfs() requests to backend for keeping data integrity.
> >>>>
> >>>> During global sync() each overlayfs instance calls method ->sync_fs()
> >>>> for backend although it itself is in global list of superblocks too.
> >>>> As a result one syscall sync() could write one superblock several times
> >>>> and send multiple disk barriers.
> >>>>
> >>>> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
> >>>>
> >>>> Reported-by: Dmitry Monakhov <[email protected]>
> >>>> Signed-off-by: Konstantin Khlebnikov <[email protected]>
> >>>> ---
> >>>
> >>> Seems reasonable.
> >>> You may add:
> >>> Reviewed-by: Amir Goldstein <[email protected]>
> >>>
> >>> +CC: containers list
> >>
> >> Thanks
> >>
> >>>
> >>> This bring up old memories.
> >>> I posted this way back to fix handling of emergency_remount() in the
> >>> presence of loop mounted fs:
> >>> https://lore.kernel.org/linux-ext4/CAA2m6vfatWKS1CQFpaRbii2AXiZFvQUjVvYhGxWTSpz+2rxDyg@mail.gmail.com/
> >>>
> >>> But seems to me that emergency_sync() and sync(2) are equally broken
> >>> for this use case.
> >>>
> >>> I wonder if anyone cares enough about resilience of loop mounted fs to try
> >>> and change the iterate_* functions to iterate supers/bdevs in reverse order...
> >>
> >> Now I see reason behind "sync; sync; sync; reboot" =)
> >>
> >> Order old -> new allows to not miss new items if list modifies.
> >> Might be important for some users.
> >>
> >
> > That's not the reason I suggested reverse order.
> > The reason is that with loop mounted fs, the correct order of flushing is:
> > 1. sync loop mounted fs inodes => writes to loop image file
> > 2. sync loop mounted fs sb => fsyncs the loop image file
> > 3. sync the loop image host fs sb
> >
> > With forward sb iteration order, #3 happens before #1, so the
> > loop mounted fs changes are not really being made durable by
> > a single sync(2) call.
>
> If fs in loop mounted with barriers then sync_fs will issue
> REQ_OP_FLUSH to loop device and trigger fsync() for image file.
> Sync() might write something twice but data should be safe.
> Without barriers this scenario is broken for sure.
>
> Emergency remount R/O is other thing. It really needs reverse order.
>

Correct. There is no problem with durability.
Although for some filesystems it would be more efficient to first
write and fsync the loop images and then sync_fs().
I can potentially result in less overall disk barriers.

Thanks,
Amir.

2020-04-11 22:34:55

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH] ovl: skip overlayfs superblocks at global sync

On Thu, Apr 09, 2020 at 11:29:47AM +0300, Konstantin Khlebnikov wrote:
> Stacked filesystems like overlayfs has no own writeback, but they have to
> forward syncfs() requests to backend for keeping data integrity.
>
> During global sync() each overlayfs instance calls method ->sync_fs()
> for backend although it itself is in global list of superblocks too.
> As a result one syscall sync() could write one superblock several times
> and send multiple disk barriers.
>
> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.

Why wouldn't you just remove the ->sync_fs method from overlay?

I mean, if you don't need the filesystem to do anything special for
one specific data integrity sync_fs call, you don't need it for any
of them, yes?

-Dave.
--
Dave Chinner
[email protected]

2020-04-12 06:49:41

by Amir Goldstein

[permalink] [raw]

Subject: Re: [PATCH] ovl: skip overlayfs superblocks at global sync

On Sun, Apr 12, 2020 at 1:29 AM Dave Chinner <[email protected]> wrote:
>
> On Thu, Apr 09, 2020 at 11:29:47AM +0300, Konstantin Khlebnikov wrote:
> > Stacked filesystems like overlayfs has no own writeback, but they have to
> > forward syncfs() requests to backend for keeping data integrity.
> >
> > During global sync() each overlayfs instance calls method ->sync_fs()
> > for backend although it itself is in global list of superblocks too.
> > As a result one syscall sync() could write one superblock several times
> > and send multiple disk barriers.
> >
> > This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>
> Why wouldn't you just remove the ->sync_fs method from overlay?
>
> I mean, if you don't need the filesystem to do anything special for
> one specific data integrity sync_fs call, you don't need it for any
> of them, yes?
>

No, but I understand the confusion.

Say you have 1000 overlay sb's all of them using upper directories
from a single xfs sb (quite common for containers).

syncfs(2) of each overlay, must call sync_fs of xfs (see ovl_sync_fs)
sync(2) will call xfs sync_fs anyway, so there is no point in calling
ovl_sync_fs => xfs sync_fs 1000 more times.

Thanks,
Amir.

2020-04-21 09:38:37

by Miklos Szeredi

[permalink] [raw]

Subject: Re: [PATCH] ovl: skip overlayfs superblocks at global sync

On Thu, Apr 9, 2020 at 10:29 AM Konstantin Khlebnikov
<[email protected]> wrote:
>
> Stacked filesystems like overlayfs has no own writeback, but they have to
> forward syncfs() requests to backend for keeping data integrity.
>
> During global sync() each overlayfs instance calls method ->sync_fs()
> for backend although it itself is in global list of superblocks too.
> As a result one syscall sync() could write one superblock several times
> and send multiple disk barriers.
>
> This patch adds flag SB_I_SKIP_SYNC into sb->sb_iflags to avoid that.
>
> Reported-by: Dmitry Monakhov <[email protected]>
> Signed-off-by: Konstantin Khlebnikov <[email protected]>

Thanks, applied.

Miklos