Date: Wed, 5 Oct 2016 11:59:03 +0200
From: Jan Kara <jack@suse.cz>
To: Mateusz Guzik <mguzik@redhat.com>
Cc: Jan Kara <jack@suse.cz>, Pierre Morel <pmorel@linux.vnet.ibm.com>,
        viro@zeniv.linux.org.uk, linux-kernel@vger.kernel.org,
        linux-fsdevel@vger.kernel.org, farman@linux.vnet.ibm.com,
        cornelia.huck@de.ibm.com, Jens Axboe <axboe@fb.com>,
        Josef Bacik <jbacik@fb.com>
Subject: Re: [PATCH] fs/block_dev.c: return the right error in thaw_bdev()
Message-ID: <20161005095903.GC7291@quack2.suse.cz>
References: <1475571220-2522-1-git-send-email-pmorel@linux.vnet.ibm.com>
 <1475571220-2522-2-git-send-email-pmorel@linux.vnet.ibm.com>
 <20161004090615.GF17515@quack2.suse.cz>
 <20161005064740.yekp2jl3nler2won@mguzik>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20161005064740.yekp2jl3nler2won@mguzik>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6850
Lines: 173

On Wed 05-10-16 08:47:42, Mateusz Guzik wrote:
> On Tue, Oct 04, 2016 at 11:06:15AM +0200, Jan Kara wrote:
> > On Tue 04-10-16 10:53:40, Pierre Morel wrote:
> > > When triggering thaw-filesystems via magic sysrq, the system enters a
> > > loop in do_thaw_one(), as thaw_bdev() still returns success if
> > > bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return
> > > error (and simplify the code a bit at the same time).
> > > 
> > 
> > The patch looks good.
> > 
> 
> Now that I had a closer look, while the patch indeed gets rid of the
> infinite loop, the functionality itself does not work properly.
> 
> Note I'm not familiar with this code, so chances are I got details
> wrong. I also don't know the original reasoning.
> 
> The current state is that you can freeze by calling either freeze_super
> or freeze_bdev. The latter bumps bd_fsfreeze_count and calls
> freeze_super. freeze_super does NOT modify bd_fsfreeze_count.
> 
> freeze_bdev is used by device mapper, xfs and e2fs.

Where is freeze_bdev() used by e2fs?

> freeze_super is used by the FIFREEZE ioctl.
> 
> This disparity leads to breakage.
> 
> Quick look suggests device freezing has its users and sb is optionally
> present. So the fix would consist on making all freezers call the bdev
> variant.

Except that e.g. for btrfs which does its own device management, there is
no bdev to freeze. As you properly note below, that was the reason why
FIFREEZE has been introduced. Also there are other filesystems which need
not necessarily exist on top of a block device but where freezing makes
sense (e.g. filesystems working directly on top of flash). So this is not
going to fly. That being said I fully agree that having two different paths
to freeze the filesystem which behave slightly differently is a headache.
 
> The j sysrq ends up not working in 2 ways.
> 1. It ends up calling do_thaw_one -> thaw_bdev. It will look at the
> bd_fsfreeze_count counter and error out. But if the user froze the fs
> with the ioctl, the counter is not modified and the fs in question ends
> up not being thawed.

Yes. Right fix for this would be to call thaw_super() for each superblock
from do_thaw_one() just to be sure.

> 2. (assuming 1 is fixed) Since freezing through freeze_bdev supports
> nesting, multiple invocations of the sysrq may be needed to actually
> thaw.

This is not true. The whole point of the loop in do_thaw_one() is to get
bd_fsfreeze_count to 0...

> Now, looking at *_bdev functions:
> 
> > struct super_block *freeze_bdev(struct block_device *bdev)
> > {       
> >         struct super_block *sb;
> >         int error = 0;
> >         
> >         mutex_lock(&bdev->bd_fsfreeze_mutex);
> >         if (++bdev->bd_fsfreeze_count > 1) {
> 
> No limit is put in place so in principle this will eventually turn negative.

Yeah, ok, send a fix...

> >                 /*
> >                  * We don't even need to grab a reference - the first call
> >                  * to freeze_bdev grab an active reference and only the last
> >                  * thaw_bdev drops it.
> >                  */
> >                 sb = get_super(bdev);
> >                 if (sb) 
> >                         drop_super(sb);
> >                 mutex_unlock(&bdev->bd_fsfreeze_mutex);
> 
> The code assumes nobody thaws the fs from under the consumer.
> 
> That is, code doing sb = freeze_bdev(b); ...; thaw_bdev(b, sb); risks
> use-after-free if the user thawed the fs.
> 
> I did not check if current consumers hold the object in different ways
> and thus avoiding the bug.

Actually, freeze_super() called from freeze_bdev() will get additional
superblock reference (sb->s_active) so superblock is guaranteed to stay
around until thaw_super() is called. But yes, if there's an admin error and
someone else calls thaw_super() before that_bdev() is called, there's a
possible use-after-free issue. So probably thaw_bdev() should better lookup
the superblock again instead of taking one as an argument.

> >                 return sb;
> >         }
> >         
> >         sb = get_active_super(bdev);
> >         if (!sb)
> >                 goto out;
> >         if (sb->s_op->freeze_super)
> >                 error = sb->s_op->freeze_super(sb);
> >         else    
> >                 error = freeze_super(sb);
> 
> This differs from the ioctl version, which has this check:
> 
> >        if (sb->s_op->freeze_fs == NULL && sb->s_op->freeze_super == NULL)
> >                return -EOPNOTSUPP;
> 
> For a filesystem with freeze_fs == NULL, freeze_super will return 0 and
> freeze_bdev will end up returnig 0.
> 
> I don't know if this turns into a real problem.

So it would be good to make freeze_bdev() and FIFREEZE ioctl consistent.
However you have to be careful for regressions - e.g. filesystems that can
be mounted only read-only (e.g. isofs) implicitly do support freezing
although they have no ->freeze_fs (or ->freeze_super). Then you have
filesystems like UDF for which what freeze_super() does is enough so they
don't provide ->freeze_fs method. But I guess such filesystems could be
forced to provide it if they want to support freezing (e.g. UDF could mark
the LVID as closed on freeze to make fs snapshot fully consistent).

> > int thaw_bdev(struct block_device *bdev, struct super_block *sb)
> [snip]
> >         if (sb->s_op->thaw_super)
> >                 error = sb->s_op->thaw_super(sb);
> >         else
> >                 error = thaw_super(sb);
> >         if (error) {                                                           
> >                 bdev->bd_fsfreeze_count++;
> >                 mutex_unlock(&bdev->bd_fsfreeze_mutex); 
> >                 return error;
> >         } 
> 
> The fact that someone else can thaw makes it very fishy to pass super
> blocks around in the first place. In particular:
> kernel freeze -> user thaw -> mount -> user freeze -> kernel thaw
> 
> Are we guaranteed that the sb returned for the kernel freeze is the same
> which was used for user freeze?
> 
> That said, rough proposed fix is as follows:
> - store the pointer to the sb used for freezing in bdev
> - drop the sb argument from thaw_bdev

No, don't store the pointer anywhere. Just look it up again in thaw_bdev().
If someone else thawed the filesystem, he could have also unmounted it.
We'll get proper errors in those cases if we look up again.

> - optionally return a referenced sb from freeze_bdev, otherwise just
>   return NULL on success

I've checked and nobody really needs the returned sb from freeze_bdev() so
I would just stop returning it...

> - convert all freeze consumers to use *_bdev variants

That's not easily doable - see above.

> - somewhat cosmetic, introduce nesting limit for freezes (say 2048 or
>   whatever, just to have something)

OK.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR