Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754809AbcJEJ7N (ORCPT ); Wed, 5 Oct 2016 05:59:13 -0400 Received: from mx2.suse.de ([195.135.220.15]:51376 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753085AbcJEJ7L (ORCPT ); Wed, 5 Oct 2016 05:59:11 -0400 Date: Wed, 5 Oct 2016 11:59:03 +0200 From: Jan Kara To: Mateusz Guzik Cc: Jan Kara , Pierre Morel , viro@zeniv.linux.org.uk, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, farman@linux.vnet.ibm.com, cornelia.huck@de.ibm.com, Jens Axboe , Josef Bacik Subject: Re: [PATCH] fs/block_dev.c: return the right error in thaw_bdev() Message-ID: <20161005095903.GC7291@quack2.suse.cz> References: <1475571220-2522-1-git-send-email-pmorel@linux.vnet.ibm.com> <1475571220-2522-2-git-send-email-pmorel@linux.vnet.ibm.com> <20161004090615.GF17515@quack2.suse.cz> <20161005064740.yekp2jl3nler2won@mguzik> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161005064740.yekp2jl3nler2won@mguzik> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6850 Lines: 173 On Wed 05-10-16 08:47:42, Mateusz Guzik wrote: > On Tue, Oct 04, 2016 at 11:06:15AM +0200, Jan Kara wrote: > > On Tue 04-10-16 10:53:40, Pierre Morel wrote: > > > When triggering thaw-filesystems via magic sysrq, the system enters a > > > loop in do_thaw_one(), as thaw_bdev() still returns success if > > > bd_fsfreeze_count == 0. To fix this, let thaw_bdev() always return > > > error (and simplify the code a bit at the same time). > > > > > > > The patch looks good. > > > > Now that I had a closer look, while the patch indeed gets rid of the > infinite loop, the functionality itself does not work properly. > > Note I'm not familiar with this code, so chances are I got details > wrong. I also don't know the original reasoning. > > The current state is that you can freeze by calling either freeze_super > or freeze_bdev. The latter bumps bd_fsfreeze_count and calls > freeze_super. freeze_super does NOT modify bd_fsfreeze_count. > > freeze_bdev is used by device mapper, xfs and e2fs. Where is freeze_bdev() used by e2fs? > freeze_super is used by the FIFREEZE ioctl. > > This disparity leads to breakage. > > Quick look suggests device freezing has its users and sb is optionally > present. So the fix would consist on making all freezers call the bdev > variant. Except that e.g. for btrfs which does its own device management, there is no bdev to freeze. As you properly note below, that was the reason why FIFREEZE has been introduced. Also there are other filesystems which need not necessarily exist on top of a block device but where freezing makes sense (e.g. filesystems working directly on top of flash). So this is not going to fly. That being said I fully agree that having two different paths to freeze the filesystem which behave slightly differently is a headache. > The j sysrq ends up not working in 2 ways. > 1. It ends up calling do_thaw_one -> thaw_bdev. It will look at the > bd_fsfreeze_count counter and error out. But if the user froze the fs > with the ioctl, the counter is not modified and the fs in question ends > up not being thawed. Yes. Right fix for this would be to call thaw_super() for each superblock from do_thaw_one() just to be sure. > 2. (assuming 1 is fixed) Since freezing through freeze_bdev supports > nesting, multiple invocations of the sysrq may be needed to actually > thaw. This is not true. The whole point of the loop in do_thaw_one() is to get bd_fsfreeze_count to 0... > Now, looking at *_bdev functions: > > > struct super_block *freeze_bdev(struct block_device *bdev) > > { > > struct super_block *sb; > > int error = 0; > > > > mutex_lock(&bdev->bd_fsfreeze_mutex); > > if (++bdev->bd_fsfreeze_count > 1) { > > No limit is put in place so in principle this will eventually turn negative. Yeah, ok, send a fix... > > /* > > * We don't even need to grab a reference - the first call > > * to freeze_bdev grab an active reference and only the last > > * thaw_bdev drops it. > > */ > > sb = get_super(bdev); > > if (sb) > > drop_super(sb); > > mutex_unlock(&bdev->bd_fsfreeze_mutex); > > The code assumes nobody thaws the fs from under the consumer. > > That is, code doing sb = freeze_bdev(b); ...; thaw_bdev(b, sb); risks > use-after-free if the user thawed the fs. > > I did not check if current consumers hold the object in different ways > and thus avoiding the bug. Actually, freeze_super() called from freeze_bdev() will get additional superblock reference (sb->s_active) so superblock is guaranteed to stay around until thaw_super() is called. But yes, if there's an admin error and someone else calls thaw_super() before that_bdev() is called, there's a possible use-after-free issue. So probably thaw_bdev() should better lookup the superblock again instead of taking one as an argument. > > return sb; > > } > > > > sb = get_active_super(bdev); > > if (!sb) > > goto out; > > if (sb->s_op->freeze_super) > > error = sb->s_op->freeze_super(sb); > > else > > error = freeze_super(sb); > > This differs from the ioctl version, which has this check: > > > if (sb->s_op->freeze_fs == NULL && sb->s_op->freeze_super == NULL) > > return -EOPNOTSUPP; > > For a filesystem with freeze_fs == NULL, freeze_super will return 0 and > freeze_bdev will end up returnig 0. > > I don't know if this turns into a real problem. So it would be good to make freeze_bdev() and FIFREEZE ioctl consistent. However you have to be careful for regressions - e.g. filesystems that can be mounted only read-only (e.g. isofs) implicitly do support freezing although they have no ->freeze_fs (or ->freeze_super). Then you have filesystems like UDF for which what freeze_super() does is enough so they don't provide ->freeze_fs method. But I guess such filesystems could be forced to provide it if they want to support freezing (e.g. UDF could mark the LVID as closed on freeze to make fs snapshot fully consistent). > > int thaw_bdev(struct block_device *bdev, struct super_block *sb) > [snip] > > if (sb->s_op->thaw_super) > > error = sb->s_op->thaw_super(sb); > > else > > error = thaw_super(sb); > > if (error) { > > bdev->bd_fsfreeze_count++; > > mutex_unlock(&bdev->bd_fsfreeze_mutex); > > return error; > > } > > The fact that someone else can thaw makes it very fishy to pass super > blocks around in the first place. In particular: > kernel freeze -> user thaw -> mount -> user freeze -> kernel thaw > > Are we guaranteed that the sb returned for the kernel freeze is the same > which was used for user freeze? > > That said, rough proposed fix is as follows: > - store the pointer to the sb used for freezing in bdev > - drop the sb argument from thaw_bdev No, don't store the pointer anywhere. Just look it up again in thaw_bdev(). If someone else thawed the filesystem, he could have also unmounted it. We'll get proper errors in those cases if we look up again. > - optionally return a referenced sb from freeze_bdev, otherwise just > return NULL on success I've checked and nobody really needs the returned sb from freeze_bdev() so I would just stop returning it... > - convert all freeze consumers to use *_bdev variants That's not easily doable - see above. > - somewhat cosmetic, introduce nesting limit for freezes (say 2048 or > whatever, just to have something) OK. Honza -- Jan Kara SUSE Labs, CR