We check whether a new handle can be started through
ext4_journal_check_start() and the function refuses to start the handle
when the filesystem is mounted with read-only. But now, when we remount
the filesystem with read-only option, already started handles are
allowed to be written on disk, but the subsequent metadata modification
using the handles are refused by ext4_journal_check_start().
As an example, in ext4_evict_inode(), i_size can be set to 0 using
a successfully started handle, but, when we remount the filesystem
with read-only option at that time, the subsequent ext4_truncate()
will be failed and the filesystem integrity will be damaged.
Therefore, we need to permit the metadata modification using already
started handles to be proceeded, even if s_flags of the filesystem is
set to MS_RDONLY.
Kitae found the problem and suggested the solution.
Signed-off-by: Kitae Lee <[email protected]>
Signed-off-by: Daeho Jeong <[email protected]>
---
fs/ext4/ext4_jbd2.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index e770c1ee..3b59e11 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -43,7 +43,7 @@ static int ext4_journal_check_start(struct super_block *sb)
journal_t *journal;
might_sleep();
- if (sb->s_flags & MS_RDONLY)
+ if (sb->s_flags & MS_RDONLY && ext4_journal_current_handle() == NULL)
return -EROFS;
WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
journal = EXT4_SB(sb)->s_journal;
--
1.7.9.5
On Mon 02-05-16 09:50:37, Daeho Jeong wrote:
> We check whether a new handle can be started through
> ext4_journal_check_start() and the function refuses to start the handle
> when the filesystem is mounted with read-only. But now, when we remount
> the filesystem with read-only option, already started handles are
> allowed to be written on disk, but the subsequent metadata modification
> using the handles are refused by ext4_journal_check_start().
>
> As an example, in ext4_evict_inode(), i_size can be set to 0 using
> a successfully started handle, but, when we remount the filesystem
> with read-only option at that time, the subsequent ext4_truncate()
> will be failed and the filesystem integrity will be damaged.
>
> Therefore, we need to permit the metadata modification using already
> started handles to be proceeded, even if s_flags of the filesystem is
> set to MS_RDONLY.
>
> Kitae found the problem and suggested the solution.
So can you share a reproducer for this issue? Because my initial thinking
is that checks during remount should fail the remount with EBUSY if there
is any modification outstanding... If they don't we have a racy remount and
fs-freezing code, which is a bug.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, May 02, 2016 at 09:50:37AM +0900, Daeho Jeong wrote:
> We check whether a new handle can be started through
> ext4_journal_check_start() and the function refuses to start the handle
> when the filesystem is mounted with read-only. But now, when we remount
> the filesystem with read-only option, already started handles are
> allowed to be written on disk, but the subsequent metadata modification
> using the handles are refused by ext4_journal_check_start().
>
> As an example, in ext4_evict_inode(), i_size can be set to 0 using
> a successfully started handle, but, when we remount the filesystem
> with read-only option at that time, the subsequent ext4_truncate()
> will be failed and the filesystem integrity will be damaged.
>
> Therefore, we need to permit the metadata modification using already
> started handles to be proceeded, even if s_flags of the filesystem is
> set to MS_RDONLY.
>
> Kitae found the problem and suggested the solution.
>
> Signed-off-by: Kitae Lee <[email protected]>
> Signed-off-by: Daeho Jeong <[email protected]>
Hmm, I'm not really comfortable with putting this hack in, since this
is papering over the real problem, which is that Android is trying to
use the emergency remount read-only sysrq option and this is
fundamentally unsafe. I'm not sure what else could break if it is
situation normal that there is active processes busily writing to the
file system and sysrq-u followed by reboot is the normal way the
Android kernel does a reboot.
A much better solution would be to change the Android userspace to
call the FIFREEZE ioctl on each mounted file system, and then call for
a reboot.
- Ted
> So can you share a reproducer for this issue? Because my initial thinking
> is that checks during remount should fail the remount with EBUSY if there
> is any modification outstanding... If they don't we have a racy remount and
> fs-freezing code, which is a bug.
Hi Jan,
Sorry to make you confused. My patch description was little wrong.
I meant only the emergency read-only remount case, not the normal read-only
remount case. Android is currently using emergency ro remount for the device
shutdown procedure.
We can easily reproduce this problem, but we didn't make a TC for xfstest yet.
If we did that, I will let you know.
> Hmm, I'm not really comfortable with putting this hack in, since this
> is papering over the real problem, which is that Android is trying to
> use the emergency remount read-only sysrq option and this is
> fundamentally unsafe. I'm not sure what else could break if it is
> situation normal that there is active processes busily writing to the
> file system and sysrq-u followed by reboot is the normal way the
> Android kernel does a reboot.
> A much better solution would be to change the Android userspace to
> call the FIFREEZE ioctl on each mounted file system, and then call for
> a reboot.
I agree with you. I know that current Android shutdown procedure is
not a safe way. But, without this patch, "even not in Android system",
when we trigger the emergency read-only remount while evicting inodes,
i_size of the inode becomes zero and the inode is not in orphan list,
but blocks of the inode are still allocated to the inode, because
ext4_truncate() will fail while stating the handle which was already started
by ext4_evict_inode(). This causes the filesystem inconsistency and
we will encounter an ext4 kernel panic in the next boot by this problem.
I think that this kind of filesystem crash can occur anywhere that
the same journal handle is repeatly used. During an atomic filesystem
operation, a part of the operation will succeed and the other one will fail.
On Fri, May 06, 2016 at 06:01:11AM +0000, Daeho Jeong wrote:
> > Hmm, I'm not really comfortable with putting this hack in, since this
> > is papering over the real problem, which is that Android is trying to
> > use the emergency remount read-only sysrq option and this is
> > fundamentally unsafe. I'm not sure what else could break if it is
> > situation normal that there is active processes busily writing to the
> > file system and sysrq-u followed by reboot is the normal way the
> > Android kernel does a reboot.
>
> > A much better solution would be to change the Android userspace to
> > call the FIFREEZE ioctl on each mounted file system, and then call for
> > a reboot.
>
> I agree with you. I know that current Android shutdown procedure is
> not a safe way. But, without this patch, "even not in Android system",
> when we trigger the emergency read-only remount while evicting inodes,
> i_size of the inode becomes zero and the inode is not in orphan list,
> but blocks of the inode are still allocated to the inode, because
> ext4_truncate() will fail while stating the handle which was already started
> by ext4_evict_inode(). This causes the filesystem inconsistency and
> we will encounter an ext4 kernel panic in the next boot by this problem.
>
> I think that this kind of filesystem crash can occur anywhere that
> the same journal handle is repeatly used. During an atomic filesystem
> operation, a part of the operation will succeed and the other one will fail.
Sure, but if this were a bug, then it could happen under a normal
crash scenario. In this particular case, under what circumstances
could i_size be set to zero *and* the inode is not on the orphan list
*and* ext4_evict_inode() is getting called?
If that could happen, then we could have problems without emergency
unmount, and we should fix that problem.
The real problem here is that we want emergency unmount to be
completely safe, butting MS_RDONLY randomly isn't safe against file
system corruption. The idea is you do this only when you have no
other choice, and the consequences would be worse --- and where you
would be prepared to do a file system consistency check afterwards.
We can either fix Android userspace, or we could add a per-file system
callback which tries (as much as possible) to make an emergency
unmount safe. In this particular case, it would probably involve
setting the "file system is corrupt flag" to force a file system
consistency check at the next reboot. Because if you use emergency
unmount, all bets are off, and there may be other problems....
- Ted
On May 6, 2016, at 7:00 AM, Theodore Ts'o <[email protected]> wrote:
>
> On Fri, May 06, 2016 at 06:01:11AM +0000, Daeho Jeong wrote:
>>> Hmm, I'm not really comfortable with putting this hack in, since this
>>> is papering over the real problem, which is that Android is trying to
>>> use the emergency remount read-only sysrq option and this is
>>> fundamentally unsafe. I'm not sure what else could break if it is
>>> situation normal that there is active processes busily writing to the
>>> file system and sysrq-u followed by reboot is the normal way the
>>> Android kernel does a reboot.
>>
>>> A much better solution would be to change the Android userspace to
>>> call the FIFREEZE ioctl on each mounted file system, and then call for
>>> a reboot.
>>
>> I agree with you. I know that current Android shutdown procedure is
>> not a safe way. But, without this patch, "even not in Android system",
>> when we trigger the emergency read-only remount while evicting inodes,
>> i_size of the inode becomes zero and the inode is not in orphan list,
>> but blocks of the inode are still allocated to the inode, because
>> ext4_truncate() will fail while stating the handle which was already started
>> by ext4_evict_inode(). This causes the filesystem inconsistency and
>> we will encounter an ext4 kernel panic in the next boot by this problem.
>>
>> I think that this kind of filesystem crash can occur anywhere that
>> the same journal handle is repeatly used. During an atomic filesystem
>> operation, a part of the operation will succeed and the other one will fail.
>
> Sure, but if this were a bug, then it could happen under a normal
> crash scenario. In this particular case, under what circumstances
> could i_size be set to zero *and* the inode is not on the orphan list
> *and* ext4_evict_inode() is getting called?
>
> If that could happen, then we could have problems without emergency
> unmount, and we should fix that problem.
>
> The real problem here is that we want emergency unmount to be
> completely safe, but setting MS_RDONLY randomly isn't safe against file
> system corruption. The idea is you do this only when you have no
> other choice, and the consequences would be worse --- and where you
> would be prepared to do a file system consistency check afterwards.
>
> We can either fix Android userspace, or we could add a per-file system
> callback which tries (as much as possible) to make an emergency
> unmount safe. In this particular case, it would probably involve
> setting the "file system is corrupt flag" to force a file system
> consistency check at the next reboot. Because if you use emergency
> unmount, all bets are off, and there may be other problems....
The problem is that emergency remount-ro doesn't block in-progress writes,
since most operations only check the MS_RDONLY at the start of an operation.
It would be possible for do_emergency_remount() call ->freeze_fs() first for
all the filesystems, then doing the remount read-only (would need a change to
do_remount_ro() to allow this)?
That ensures the filesystem is in a (more) consistent state when force
remount-ro is called (i.e. which doesn't block or return an error if there
are writers on the filesystem). I don't think this is an issue for normal
remount-ro, since there can't be any writes in progress.
This would also avoid the need for Android to know more about the internals
of how to remount read-only, and probably help normal sysadmins too.
Cheers, Andreas
On Fri, May 06, 2016 at 02:01:17PM -0600, Andreas Dilger wrote:
>
> The problem is that emergency remount-ro doesn't block in-progress writes,
> since most operations only check the MS_RDONLY at the start of an operation.
> It would be possible for do_emergency_remount() call ->freeze_fs() first for
> all the filesystems, then doing the remount read-only (would need a change to
> do_remount_ro() to allow this)?
I thought about doing that, but that would mean that the code path
might need to take some locks along the way, and if you have multiple
file systems, for which one has wedged, the do_emergency_remount()
function might end up blocking when it tries calling freeze_fs() on
one of the file system before it managed to get to the rest of the
file systems in the system.
This really goes to the question of what is do_emergency_remount()
for. If the goal is to minimize damage, then you want to keep things
as simple as possible, and to not allow any emergency remounts for any
file system to block.
If the goal is to allow the normal shutdown path to use this because
the userspace code is too lazy to do a proper shutdown of all user
processes, and too lazy to go through all of the mounted file systems
and individually call FIFREEZE, then sure, we could iterate over the
file systems and call freeze_fs() in kernel code. But I'm not really
sure I see the point......
> That ensures the filesystem is in a (more) consistent state when force
> remount-ro is called (i.e. which doesn't block or return an error if there
> are writers on the filesystem).
Right, but if the kernel is calling freeze_fs(), freeze_fs() might
block, and then what would we do?
- Ted
> The real problem here is that we want emergency unmount to be
> completely safe, butting MS_RDONLY randomly isn't safe against file
> system corruption. The idea is you do this only when you have no
> other choice, and the consequences would be worse --- and where you
> would be prepared to do a file system consistency check afterwards.
> We can either fix Android userspace, or we could add a per-file system
> callback which tries (as much as possible) to make an emergency
> unmount safe. In this particular case, it would probably involve
> setting the "file system is corrupt flag" to force a file system
> consistency check at the next reboot. Because if you use emergency
> unmount, all bets are off, and there may be other problems....
Actually, we had executed power on/off test repeatedly with 10 Android
devices for a month. But, during the test, just this problem only happened
repeatedly, even if it occurred very rarely. So we had concluded that we
had to fix this problem certainly.
However, I can see the point now. Android have misused the emergency
ro-remount and the filesystem crash by emergency ro-remount is not an issue.
I can understand the purpose of the emergency ro-remount.
I heard that the next version of Android will do full scan of ext4 filesystems
using e2fsck every boot-up, so the existing problem will be naturally resolved,
even if it might not be the right solution.
Thank you for your valuable comments.
On Sat, May 07, 2016 at 01:05:49PM +0000, Daeho Jeong wrote:
>
> Actually, we had executed power on/off test repeatedly with 10 Android
> devices for a month. But, during the test, just this problem only happened
> repeatedly, even if it occurred very rarely. So we had concluded that we
> had to fix this problem certainly.
>
> However, I can see the point now. Android have misused the emergency
> ro-remount and the filesystem crash by emergency ro-remount is not an issue.
> I can understand the purpose of the emergency ro-remount.
>
> I heard that the next version of Android will do full scan of ext4 filesystems
> using e2fsck every boot-up, so the existing problem will be naturally resolved,
> even if it might not be the right solution.
I've sent an inquiry to contacts I have in the Google Android team, to
make sure they're aware of the potential issues, whether or not it
gets addressed (by accident) in the next version of Android.
Something to consider is that it's not clear (for example) whether or
not some other file system, such as f2fs, will do the right thing with
such a scheme (and what kind of application data loss you might have,
even if the file systrem checker is run at boot-up). Or if you have a
process which is actively writing to a SD card containing a VFAT file
system at the time of the shutdown, whether or not it will do the
right thing.
So there are some bigger issues that probably need to be considered,
not just with ext4.
- Ted
On Fri 06-05-16 23:36:23, Ted Tso wrote:
> On Fri, May 06, 2016 at 02:01:17PM -0600, Andreas Dilger wrote:
> >
> > The problem is that emergency remount-ro doesn't block in-progress writes,
> > since most operations only check the MS_RDONLY at the start of an operation.
> > It would be possible for do_emergency_remount() call ->freeze_fs() first for
> > all the filesystems, then doing the remount read-only (would need a change to
> > do_remount_ro() to allow this)?
>
> I thought about doing that, but that would mean that the code path
> might need to take some locks along the way, and if you have multiple
> file systems, for which one has wedged, the do_emergency_remount()
> function might end up blocking when it tries calling freeze_fs() on
> one of the file system before it managed to get to the rest of the
> file systems in the system.
>
> This really goes to the question of what is do_emergency_remount()
> for. If the goal is to minimize damage, then you want to keep things
> as simple as possible, and to not allow any emergency remounts for any
> file system to block.
>
> If the goal is to allow the normal shutdown path to use this because
> the userspace code is too lazy to do a proper shutdown of all user
> processes, and too lazy to go through all of the mounted file systems
> and individually call FIFREEZE, then sure, we could iterate over the
> file systems and call freeze_fs() in kernel code. But I'm not really
> sure I see the point......
>
> > That ensures the filesystem is in a (more) consistent state when force
> > remount-ro is called (i.e. which doesn't block or return an error if there
> > are writers on the filesystem).
>
> Right, but if the kernel is calling freeze_fs(), freeze_fs() might
> block, and then what would we do?
Yeah, 100% agreed. Emergency remount is for the case where the system is
too hosed to allow for normal shutdown (e.g. kernel has oopsed while
holding fs locks so normal unmount attempt will just block) and you want to
limit fs damage. So emergency remount should not block and thus it cannot
wait for any outstanding writes.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR