2016-10-04 08:54:02

by Pierre Morel

[permalink] [raw]
Subject: [PATCH] return the right error in thaw_bdev()

When triggering thaw-filesystems via magic sysrq, the system freezes.

It is due to an error in thaw_bdev() for which I sent a
patch in august 2015, but the patch has not been applied and
the problem remains.

Since the original subject "always return an error..." may
be confusing, I changed the subject line and rebased the patch
on Linux 4.8-rc8.


Pierre Morel (1):
fs/block_dev.c: return the right error in thaw_bdev()

fs/block_dev.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

--
2.5.0


2016-10-04 08:53:48

by Pierre Morel

[permalink] [raw]
Subject: [PATCH] fs/block_dev.c: return the right error in thaw_bdev()

When triggering thaw-filesystems via magic sysrq, the system enters a
loop in do_thaw_one(), as thaw_bdev() still returns success if
bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
error (and simplify the code a bit at the same time).

Reviewed-by: Eric Farman <[email protected]>
Reviewed-by: Cornelia Huck <[email protected]>
Signed-off-by: Pierre Morel <[email protected]>
---
fs/block_dev.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 08ae993..0e7a8f4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -302,14 +302,11 @@ int thaw_bdev(struct block_device *bdev, struct super_block *sb)
error = sb->s_op->thaw_super(sb);
else
error = thaw_super(sb);
- if (error) {
+ if (error)
bdev->bd_fsfreeze_count++;
- mutex_unlock(&bdev->bd_fsfreeze_mutex);
- return error;
- }
out:
mutex_unlock(&bdev->bd_fsfreeze_mutex);
- return 0;
+ return error;
}
EXPORT_SYMBOL(thaw_bdev);

--
2.5.0

2016-10-04 09:06:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] fs/block_dev.c: return the right error in thaw_bdev()

On Tue 04-10-16 10:53:40, Pierre Morel wrote:
> When triggering thaw-filesystems via magic sysrq, the system enters a
> loop in do_thaw_one(), as thaw_bdev() still returns success if
> bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
> error (and simplify the code a bit at the same time).
>
> Reviewed-by: Eric Farman <[email protected]>
> Reviewed-by: Cornelia Huck <[email protected]>
> Signed-off-by: Pierre Morel <[email protected]>

The patch looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Jens, can you please merge this patch? Thanks!

Honza

> ---
> fs/block_dev.c | 7 ++-----
> 1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 08ae993..0e7a8f4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -302,14 +302,11 @@ int thaw_bdev(struct block_device *bdev, struct super_block *sb)
> error = sb->s_op->thaw_super(sb);
> else
> error = thaw_super(sb);
> - if (error) {
> + if (error)
> bdev->bd_fsfreeze_count++;
> - mutex_unlock(&bdev->bd_fsfreeze_mutex);
> - return error;
> - }
> out:
> mutex_unlock(&bdev->bd_fsfreeze_mutex);
> - return 0;
> + return error;
> }
> EXPORT_SYMBOL(thaw_bdev);
>
> --
> 2.5.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-10-05 06:47:51

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [PATCH] fs/block_dev.c: return the right error in thaw_bdev()

On Tue, Oct 04, 2016 at 11:06:15AM +0200, Jan Kara wrote:
> On Tue 04-10-16 10:53:40, Pierre Morel wrote:
> > When triggering thaw-filesystems via magic sysrq, the system enters a
> > loop in do_thaw_one(), as thaw_bdev() still returns success if
> > bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
> > error (and simplify the code a bit at the same time).
> >
>
> The patch looks good.
>

Now that I had a closer look, while the patch indeed gets rid of the
infinite loop, the functionality itself does not work properly.

Note I'm not familiar with this code, so chances are I got details
wrong. I also don't know the original reasoning.

The current state is that you can freeze by calling either freeze_super
or freeze_bdev. The latter bumps bd_fsfreeze_count and calls
freeze_super. freeze_super does NOT modify bd_fsfreeze_count.

freeze_bdev is used by device mapper, xfs and e2fs.

freeze_super is used by the FIFREEZE ioctl.

This disparity leads to breakage.

Quick look suggests device freezing has its users and sb is optionally
present. So the fix would consist on making all freezers call the bdev
variant.

The j sysrq ends up not working in 2 ways.
1. It ends up calling do_thaw_one -> thaw_bdev. It will look at the
bd_fsfreeze_count counter and error out. But if the user froze the fs
with the ioctl, the counter is not modified and the fs in question ends
up not being thawed.

2. (assuming 1 is fixed) Since freezing through freeze_bdev supports
nesting, multiple invocations of the sysrq may be needed to actually
thaw.

Now, looking at *_bdev functions:

> struct super_block *freeze_bdev(struct block_device *bdev)
> {
> struct super_block *sb;
> int error = 0;
>
> mutex_lock(&bdev->bd_fsfreeze_mutex);
> if (++bdev->bd_fsfreeze_count > 1) {

No limit is put in place so in principle this will eventually turn negative.

> /*
> * We don't even need to grab a reference - the first call
> * to freeze_bdev grab an active reference and only the last
> * thaw_bdev drops it.
> */
> sb = get_super(bdev);
> if (sb)
> drop_super(sb);
> mutex_unlock(&bdev->bd_fsfreeze_mutex);

The code assumes nobody thaws the fs from under the consumer.

That is, code doing sb = freeze_bdev(b); ...; thaw_bdev(b, sb); risks
use-after-free if the user thawed the fs.

I did not check if current consumers hold the object in different ways
and thus avoiding the bug.

> return sb;
> }
>
> sb = get_active_super(bdev);
> if (!sb)
> goto out;
> if (sb->s_op->freeze_super)
> error = sb->s_op->freeze_super(sb);
> else
> error = freeze_super(sb);

This differs from the ioctl version, which has this check:

> if (sb->s_op->freeze_fs == NULL && sb->s_op->freeze_super == NULL)
> return -EOPNOTSUPP;

For a filesystem with freeze_fs == NULL, freeze_super will return 0 and
freeze_bdev will end up returnig 0.

I don't know if this turns into a real problem.

> int thaw_bdev(struct block_device *bdev, struct super_block *sb)
[snip]
> if (sb->s_op->thaw_super)
> error = sb->s_op->thaw_super(sb);
> else
> error = thaw_super(sb);
> if (error) {
> bdev->bd_fsfreeze_count++;
> mutex_unlock(&bdev->bd_fsfreeze_mutex);
> return error;
> }

The fact that someone else can thaw makes it very fishy to pass super
blocks around in the first place. In particular:
kernel freeze -> user thaw -> mount -> user freeze -> kernel thaw

Are we guaranteed that the sb returned for the kernel freeze is the same
which was used for user freeze?

That said, rough proposed fix is as follows:
- store the pointer to the sb used for freezing in bdev
- drop the sb argument from thaw_bdev
- optionally return a referenced sb from freeze_bdev, otherwise just
return NULL on success
- convert all freeze consumers to use *_bdev variants
- somewhat cosmetic, introduce nesting limit for freezes (say 2048 or
whatever, just to have something)

The ioctl used use freeze_bdev but this was changed with 18e9e5104fcd
"Introduce freeze_super and thaw_super for the fsfreeze ioctl" which
cites btrfs as the reason for freeze_super as opposed to freeze_bdev.

CC'ing the author for comments.

--
Mateusz Guzik

2016-10-05 09:59:13

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] fs/block_dev.c: return the right error in thaw_bdev()

On Wed 05-10-16 08:47:42, Mateusz Guzik wrote:
> On Tue, Oct 04, 2016 at 11:06:15AM +0200, Jan Kara wrote:
> > On Tue 04-10-16 10:53:40, Pierre Morel wrote:
> > > When triggering thaw-filesystems via magic sysrq, the system enters a
> > > loop in do_thaw_one(), as thaw_bdev() still returns success if
> > > bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
> > > error (and simplify the code a bit at the same time).
> > >
> >
> > The patch looks good.
> >
>
> Now that I had a closer look, while the patch indeed gets rid of the
> infinite loop, the functionality itself does not work properly.
>
> Note I'm not familiar with this code, so chances are I got details
> wrong. I also don't know the original reasoning.
>
> The current state is that you can freeze by calling either freeze_super
> or freeze_bdev. The latter bumps bd_fsfreeze_count and calls
> freeze_super. freeze_super does NOT modify bd_fsfreeze_count.
>
> freeze_bdev is used by device mapper, xfs and e2fs.

Where is freeze_bdev() used by e2fs?

> freeze_super is used by the FIFREEZE ioctl.
>
> This disparity leads to breakage.
>
> Quick look suggests device freezing has its users and sb is optionally
> present. So the fix would consist on making all freezers call the bdev
> variant.

Except that e.g. for btrfs which does its own device management, there is
no bdev to freeze. As you properly note below, that was the reason why
FIFREEZE has been introduced. Also there are other filesystems which need
not necessarily exist on top of a block device but where freezing makes
sense (e.g. filesystems working directly on top of flash). So this is not
going to fly. That being said I fully agree that having two different paths
to freeze the filesystem which behave slightly differently is a headache.

> The j sysrq ends up not working in 2 ways.
> 1. It ends up calling do_thaw_one -> thaw_bdev. It will look at the
> bd_fsfreeze_count counter and error out. But if the user froze the fs
> with the ioctl, the counter is not modified and the fs in question ends
> up not being thawed.

Yes. Right fix for this would be to call thaw_super() for each superblock
from do_thaw_one() just to be sure.

> 2. (assuming 1 is fixed) Since freezing through freeze_bdev supports
> nesting, multiple invocations of the sysrq may be needed to actually
> thaw.

This is not true. The whole point of the loop in do_thaw_one() is to get
bd_fsfreeze_count to 0...

> Now, looking at *_bdev functions:
>
> > struct super_block *freeze_bdev(struct block_device *bdev)
> > {
> > struct super_block *sb;
> > int error = 0;
> >
> > mutex_lock(&bdev->bd_fsfreeze_mutex);
> > if (++bdev->bd_fsfreeze_count > 1) {
>
> No limit is put in place so in principle this will eventually turn negative.

Yeah, ok, send a fix...

> > /*
> > * We don't even need to grab a reference - the first call
> > * to freeze_bdev grab an active reference and only the last
> > * thaw_bdev drops it.
> > */
> > sb = get_super(bdev);
> > if (sb)
> > drop_super(sb);
> > mutex_unlock(&bdev->bd_fsfreeze_mutex);
>
> The code assumes nobody thaws the fs from under the consumer.
>
> That is, code doing sb = freeze_bdev(b); ...; thaw_bdev(b, sb); risks
> use-after-free if the user thawed the fs.
>
> I did not check if current consumers hold the object in different ways
> and thus avoiding the bug.

Actually, freeze_super() called from freeze_bdev() will get additional
superblock reference (sb->s_active) so superblock is guaranteed to stay
around until thaw_super() is called. But yes, if there's an admin error and
someone else calls thaw_super() before that_bdev() is called, there's a
possible use-after-free issue. So probably thaw_bdev() should better lookup
the superblock again instead of taking one as an argument.

> > return sb;
> > }
> >
> > sb = get_active_super(bdev);
> > if (!sb)
> > goto out;
> > if (sb->s_op->freeze_super)
> > error = sb->s_op->freeze_super(sb);
> > else
> > error = freeze_super(sb);
>
> This differs from the ioctl version, which has this check:
>
> > if (sb->s_op->freeze_fs == NULL && sb->s_op->freeze_super == NULL)
> > return -EOPNOTSUPP;
>
> For a filesystem with freeze_fs == NULL, freeze_super will return 0 and
> freeze_bdev will end up returnig 0.
>
> I don't know if this turns into a real problem.

So it would be good to make freeze_bdev() and FIFREEZE ioctl consistent.
However you have to be careful for regressions - e.g. filesystems that can
be mounted only read-only (e.g. isofs) implicitly do support freezing
although they have no ->freeze_fs (or ->freeze_super). Then you have
filesystems like UDF for which what freeze_super() does is enough so they
don't provide ->freeze_fs method. But I guess such filesystems could be
forced to provide it if they want to support freezing (e.g. UDF could mark
the LVID as closed on freeze to make fs snapshot fully consistent).

> > int thaw_bdev(struct block_device *bdev, struct super_block *sb)
> [snip]
> > if (sb->s_op->thaw_super)
> > error = sb->s_op->thaw_super(sb);
> > else
> > error = thaw_super(sb);
> > if (error) {
> > bdev->bd_fsfreeze_count++;
> > mutex_unlock(&bdev->bd_fsfreeze_mutex);
> > return error;
> > }
>
> The fact that someone else can thaw makes it very fishy to pass super
> blocks around in the first place. In particular:
> kernel freeze -> user thaw -> mount -> user freeze -> kernel thaw
>
> Are we guaranteed that the sb returned for the kernel freeze is the same
> which was used for user freeze?
>
> That said, rough proposed fix is as follows:
> - store the pointer to the sb used for freezing in bdev
> - drop the sb argument from thaw_bdev

No, don't store the pointer anywhere. Just look it up again in thaw_bdev().
If someone else thawed the filesystem, he could have also unmounted it.
We'll get proper errors in those cases if we look up again.

> - optionally return a referenced sb from freeze_bdev, otherwise just
> return NULL on success

I've checked and nobody really needs the returned sb from freeze_bdev() so
I would just stop returning it...

> - convert all freeze consumers to use *_bdev variants

That's not easily doable - see above.

> - somewhat cosmetic, introduce nesting limit for freezes (say 2048 or
> whatever, just to have something)

OK.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-10-05 20:22:05

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] fs/block_dev.c: return the right error in thaw_bdev()

On Wed, Oct 05, 2016 at 11:59:03AM +0200, Jan Kara wrote:
> On Wed 05-10-16 08:47:42, Mateusz Guzik wrote:
> > On Tue, Oct 04, 2016 at 11:06:15AM +0200, Jan Kara wrote:
> > > On Tue 04-10-16 10:53:40, Pierre Morel wrote:
> > > > When triggering thaw-filesystems via magic sysrq, the system enters a
> > > > loop in do_thaw_one(), as thaw_bdev() still returns success if
> > > > bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
> > > > error (and simplify the code a bit at the same time).
> > > >
> > >
> > > The patch looks good.
> > >
> >
> > Now that I had a closer look, while the patch indeed gets rid of the
> > infinite loop, the functionality itself does not work properly.
> >
> > Note I'm not familiar with this code, so chances are I got details
> > wrong. I also don't know the original reasoning.
> >
> > The current state is that you can freeze by calling either freeze_super
> > or freeze_bdev. The latter bumps bd_fsfreeze_count and calls
> > freeze_super. freeze_super does NOT modify bd_fsfreeze_count.
> >
> > freeze_bdev is used by device mapper, xfs and e2fs.
>
> Where is freeze_bdev() used by e2fs?

typo - it's f2fs, and it's usage is copied from XFS. Both of those
usages can be completely ignored for th purposes of this argument,
because they are done in the context of {X,F2}FS_IOC_FSGOINGDOWN.
i.e ioctls to shutdown a filesystem and completely deny access to
it. Freezing is just means to an end - it's just used to prevent any
more IO from being issued by the block device while we do the
filesystem shutdown...

IOWs, These are never used in normal usage by users - they are a
diagnostic tool and, potentially, a get-out-of-gaol-free card that
can be played when something goes wrong with the underlying storage
and the filesystem hangs waiting for it.

Indeed, the code in both does:

sb = freeze_bdev()
if (sb && !IS_ERR(sb) {
do_shutdown()
thaw_bdev(sb)
}

Which is perfectly sane and will not cause any sort of unbalanced
freeze behaviour to occur unless the shutdown oopses, in which case
the user has bigger problems than a frozen device...

So in production systems, dm_suspend()->lock_fs()->freeze_bdev() is
the only path that will ever be exercised.

> > Now, looking at *_bdev functions:
> >
> > > struct super_block *freeze_bdev(struct block_device *bdev)
> > > {
> > > struct super_block *sb;
> > > int error = 0;
> > >
> > > mutex_lock(&bdev->bd_fsfreeze_mutex);
> > > if (++bdev->bd_fsfreeze_count > 1) {
> >
> > No limit is put in place so in principle this will eventually turn negative.
>
> Yeah, ok, send a fix...

FWIW, that will only overflow if someone tries to freeze their block
device more than *2 billion times* without a thaw. If that happens,
the user has bigger problems to worry about, like the days/weeks the
system couldn't write to the filesystem because it was frozen.... :/

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-10-05 20:35:52

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] fs/block_dev.c: return the right error in thaw_bdev()

On 10/04/2016 03:06 AM, Jan Kara wrote:
> On Tue 04-10-16 10:53:40, Pierre Morel wrote:
>> When triggering thaw-filesystems via magic sysrq, the system enters a
>> loop in do_thaw_one(), as thaw_bdev() still returns success if
>> bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
>> error (and simplify the code a bit at the same time).
>>
>> Reviewed-by: Eric Farman <[email protected]>
>> Reviewed-by: Cornelia Huck <[email protected]>
>> Signed-off-by: Pierre Morel <[email protected]>
>
> The patch looks good. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>
>
> Jens, can you please merge this patch? Thanks!

Added, thanks all.

--
Jens Axboe