Apparently, there users out there with a single gigantic journalled
rootfs and some gnarly system software. If the user reboots into
"offline system update" mode to install a kernel update, the system
control software has no provision to kick the cute splash screen off its
writable file descriptor down in /var/log somewhere before unmounting,
remount-ro'ing, and thus reboots the system... with a live rw rootfs!
Since the journal may not have been checkpointed immediately prior to
the reboot, a subsequent invocation of the hapless user's grubby
bootloader sees obsolete metadata because the newest data is safely in
the log, but the log needs to be replayed. Weirdly, the bootloader is
fine with reading files off a dirty filesystem (though really, can you
imagine log replay in x86 real mode?) but still tries to read files and
the boot fails until someone intervenes to replay the journal.
Therefore, add a reboot hook to freeze all filesystems (which in general
will induce ext4/xfs/btrfs to checkpoint the log) just prior to reboot.
This is an unfortunate and insufficient workaround for multiple layers
of inadequate external software, but at least it will reduce boot time
surprises for the "OS updater failed to disengage the filesystem before
rebooting" case.
Seeing as the world has been drifting towards grubbiness (except for
those booting straight off a flabby unjournalled fs via firmware), this
seems like the least crappy solution to this problem. Yes, you're still
screwed in grub if the system crashes. :)
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/super.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/fs/super.c b/fs/super.c
index adb0c0d..4a9deaa 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -34,6 +34,7 @@
#include <linux/fsnotify.h>
#include <linux/lockdep.h>
#include <linux/user_namespace.h>
+#include <linux/reboot.h>
#include "internal.h"
@@ -1529,3 +1530,32 @@ int thaw_super(struct super_block *sb)
return 0;
}
EXPORT_SYMBOL(thaw_super);
+
+static void fsreboot_freeze_sb(struct super_block *sb, void *priv)
+{
+ int error;
+
+ up_read(&sb->s_umount);
+ error = freeze_super(sb);
+ down_read(&sb->s_umount);
+ if (error && error != -EBUSY)
+ printk(KERN_NOTICE "%s (%s): Unable to freeze, error=%d",
+ sb->s_type->name, sb->s_id, error);
+}
+
+static int fsreboot_freeze(struct notifier_block *nb, ulong event, void *buf)
+{
+ iterate_supers(fsreboot_freeze_sb, NULL);
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block fsreboot_notifier = {
+ .notifier_call = fsreboot_freeze,
+ .priority = INT_MAX,
+};
+
+static int __init fsreboot_init(void)
+{
+ return register_reboot_notifier(&fsreboot_notifier);
+}
+__initcall(fsreboot_init);
On Fri, May 19, 2017 at 3:20 AM, Darrick J. Wong
<[email protected]> wrote:
>
> Therefore, add a reboot hook to freeze all filesystems (which in general
> will induce ext4/xfs/btrfs to checkpoint the log) just prior to reboot.
> This is an unfortunate and insufficient workaround for multiple layers
> of inadequate external software, but at least it will reduce boot time
> surprises for the "OS updater failed to disengage the filesystem before
> rebooting" case.
>
Darrick,
Did you consider how many support calls this will generate for a stuck
reboot command?
I can think of at least one situation where this is guarantied to hang.
See this patch for the details:
https://patchwork.kernel.org/patch/6266791/
The referenced patch was applied to Android kernel to prevent
system crash on emergency remount-ro via sysrq trigger.
I don't know if it was even seriously considered by Al, because
I got no comment, but I do realize that the change of behavior
could generate support calls, so it's scary to make that change
in mainline.
I know it's not going to work around broken system software update,
but how about providing sysrq trigger for emergency_freeze_all()?
like emergency_remount(), but stronger.
And this time, iterate supers in reverse order like I suggested to
avoid loop mounted fs freeze dependencies.
There is one little tiny problem though. Eric used up the last sysrq trigger
key for emergency_thaw_all(). Do you see the irony in that? ;)
I am wondering how many people know about or use the emergency
thaw trigger, but one dodgy option is to use the 't' trigger to toggle
thaw_all/freeze_all.
Another perhaps slightly less dodgy option is to trigger freeze_all
on a sequence of sysrq "emergency" triggers where it makes sense
and is least likely to change any existing behavior, for example:
echo u > /proc/sysrq-trigger
# Remember if do_emergency_remount() completed with failures
echo u > /proc/sysrq-trigger
# Escalate to emergency freeze
OR
echo u > /proc/sysrq-trigger
# Remember if do_emergency_remount() completed with failures
echo s > /proc/sysrq-trigger
# Sync *after* remount r/o? That must mean emergency freeze
I bet that system software that is already aware of and is issuing
emergency remount r/o trigger prior to reboot, won't see any harm
in adding an extra u/s trigger for good luck.
Do you know if the gnarly system software in question is issuing
emergency remount r/o prior to reboot?
Amir.
On Thu, May 18, 2017, at 08:20 PM, Darrick J. Wong wrote:
> Therefore, add a reboot hook to freeze all filesystems (which in general
> will induce ext4/xfs/btrfs to checkpoint the log) just prior to reboot.
> This is an unfortunate and insufficient workaround for multiple layers
> of inadequate external software, but at least it will reduce boot time
> surprises for the "OS updater failed to disengage the filesystem before
> rebooting" case.
As a maintainer of one of those userspace tools (https://github.com/ostreedev/ostree),
which I don't think is the one in question here, but likely has the same
issue - I'd like to have some sort of API to fix this - maybe flush the journal *without*
remounting r/o?
Unlike the case you're talking about with rebooting into a special
update mode, libostree constructs a new root with hardlinks while
the system is running. Hence, system downtime is just reboot, like
dual-partition update systems, except we're more flexible.
Although hm...I guess an API to flush the journal would only narrow
the race.
Is the single partition case really just doomed?
On Fri, May 19, 2017 at 10:00:31AM -0400, Colin Walters wrote:
> As a maintainer of one of those userspace tools (https://github.com/ostreedev/ostree),
> which I don't think is the one in question here, but likely has the same
> issue - I'd like to have some sort of API to fix this - maybe flush the journal *without*
> remounting r/o?
>
> Unlike the case you're talking about with rebooting into a special
> update mode, libostree constructs a new root with hardlinks while
> the system is running. Hence, system downtime is just reboot, like
> dual-partition update systems, except we're more flexible.
>
> Although hm...I guess an API to flush the journal would only narrow
> the race.
>
> Is the single partition case really just doomed?
One of the things that came up when Darrick and I discussed this on
the weekly ext4 developer's conference call was our mutual wonderment
that none of the userspace tools implemented a reboot by created a
tmpfs chroot, pivoting into the chroot, and then unmounting all of the
remaining file systems.
This would also allow update schemes who want to enable various new
file system features, or upgrade the root file system somehow, to be
able to do so while the root file system is completely and cleanly
unmounted.
The other thing that would be useful is if grub2 would actually be
able to replay the file system journal --- but given that grub2 is
GPLv3, and both ext4 and xfs are GPLv2-only, and given that past
attempts of teams attempting to do clean room reimplementations of
complex code bases for licensing reasons only (cough, make_ext4fs,
*cough*) have not necessarily turned out well, I'm at least not going
to hold my breath.
- Ted
On Fri, May 19, 2017, at 11:27 AM, Theodore Ts'o wrote:
>
> One of the things that came up when Darrick and I discussed this on
> the weekly ext4 developer's conference call was our mutual wonderment
> that none of the userspace tools implemented a reboot by created a
> tmpfs chroot, pivoting into the chroot, and then unmounting all of the
> remaining file systems.
On general purpose systems we have a tmpfs chroot already: the initramfs.
Although IIRC, systemd will only switch back to it on shutdown I think only
if you have a root storage daemon enabled:
https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/
That said I'd like to focus on the harder case: supporting powerloss/system lockup on
single-partition systems. IMO, the shutdown case is just a special variant
of that where the user asked nicely for the system to halt =)
(See also https://en.wikipedia.org/wiki/Crash-only_software)
I was thinking about this a bit, and I think if userspace tools (like ostree)
*delayed* their updates to /boot until shutdown, then we could ensure
that on powerloss, the system is unchanged. (In a traditional dpkg/rpm
scenario where you only have one userspace root, you'd end up with
old kernel + new rootfs, but that's exactly the problem ostree solves)
That narrows the problem down to keeping `/boot` consistent at
shutdown time. AIUI, a problem here is that XFS doesn't flush the
journal on `syncfs`, only on unmount? And from what I can tell,
even the `XFS_IOC_FREEZE` ioctl won't do that either.
So as far as I can see, a userspace API to ensure the journal is
flushed on a mounted filesystem is going to be necessary for
the general case. I don't have a strong opinion on whether or not
that's `syncfs()` - if it's e.g. a `XFS_IOC_FREEZE` `_THAW` pair
that seems OK to me too.
On Fri, May 19, 2017, at 12:34 PM, Colin Walters wrote:
>
> So as far as I can see, a userspace API to ensure the journal is
> flushed on a mounted filesystem is going to be necessary for
> the general case. I don't have a strong opinion on whether or not
> that's `syncfs()` - if it's e.g. a `XFS_IOC_FREEZE` `_THAW` pair
> that seems OK to me too.
Or (thinking about this more) maybe we indeed could implement that today by pivoting back to
the initramfs, and using umount()+mount() as our "syncfs() + journal flush" implementation.
Basically when we have to update /boot, we unmount, then remount again and add
new kernel+initramfs, unmount, remount and mv(/boot/grub2.conf.new,/boot/grub2.conf),
then finally unmount again.
In current design this would require keeping the initramfs resident in memory
just for this purpose, or to re-synthesize it on shutdown.
Not impossible, but it'd sure be simpler if say syncfs() had a flags argument
and there were a special "flush the journal" argument for this.
On Fri, May 19, 2017 at 12:34:29PM -0400, Colin Walters wrote:
> > One of the things that came up when Darrick and I discussed this on
> > the weekly ext4 developer's conference call was our mutual wonderment
> > that none of the userspace tools implemented a reboot by created a
> > tmpfs chroot, pivoting into the chroot, and then unmounting all of the
> > remaining file systems.
>
> On general purpose systems we have a tmpfs chroot already: the initramfs.
Aren't we discarding the initramfs after we've pivoted away from it,
to save on memory? Keeping the tmpfs chroot around forever would be a
waste of memory, and in some cases, especially if you are using a
distribution kernel, the initramfs chroot can be rather large.
Creating an tmpfs chroot that was only good enough to manage the
shutdown would be pretty easy, though; the number of files you would
need would be quite very few in number.
> That narrows the problem down to keeping `/boot` consistent at
> shutdown time. AIUI, a problem here is that XFS doesn't flush the
> journal on `syncfs`, only on unmount? And from what I can tell,
> even the `XFS_IOC_FREEZE` ioctl won't do that either.
I believe the log *is* checkpointed on an XFS_IOC_FREEZE.
- Ted
On Fri, May 19, 2017 at 11:29:04AM +0300, Amir Goldstein wrote:
> On Fri, May 19, 2017 at 3:20 AM, Darrick J. Wong
> <[email protected]> wrote:
>
> >
> > Therefore, add a reboot hook to freeze all filesystems (which in general
> > will induce ext4/xfs/btrfs to checkpoint the log) just prior to reboot.
> > This is an unfortunate and insufficient workaround for multiple layers
> > of inadequate external software, but at least it will reduce boot time
> > surprises for the "OS updater failed to disengage the filesystem before
> > rebooting" case.
> >
>
> Darrick,
>
> Did you consider how many support calls this will generate for a stuck
> reboot command?
>
> I can think of at least one situation where this is guarantied to hang.
> See this patch for the details:
> https://patchwork.kernel.org/patch/6266791/
>
> The referenced patch was applied to Android kernel to prevent
> system crash on emergency remount-ro via sysrq trigger.
Hmmm, I agree that we ought to avoid hanging on loopmounted filesystems,
and that iterating superblocks backwards is one (rough) way to do that.
> I don't know if it was even seriously considered by Al, because
> I got no comment, but I do realize that the change of behavior
> could generate support calls, so it's scary to make that change
> in mainline.
>
> I know it's not going to work around broken system software update,
> but how about providing sysrq trigger for emergency_freeze_all()?
> like emergency_remount(), but stronger.
> And this time, iterate supers in reverse order like I suggested to
> avoid loop mounted fs freeze dependencies.
>
> There is one little tiny problem though. Eric used up the last sysrq trigger
> key for emergency_thaw_all(). Do you see the irony in that? ;)
LOL.
> I am wondering how many people know about or use the emergency
> thaw trigger, but one dodgy option is to use the 't' trigger to toggle
> thaw_all/freeze_all.
>
> Another perhaps slightly less dodgy option is to trigger freeze_all
> on a sequence of sysrq "emergency" triggers where it makes sense
> and is least likely to change any existing behavior, for example:
>
> echo u > /proc/sysrq-trigger
>
> # Remember if do_emergency_remount() completed with failures
>
> echo u > /proc/sysrq-trigger
>
> # Escalate to emergency freeze
Or maybe it's simpler just to have a counter -- three sysrq-u in a row
and we freeze all?
> OR
>
> echo u > /proc/sysrq-trigger
>
> # Remember if do_emergency_remount() completed with failures
>
> echo s > /proc/sysrq-trigger
>
> # Sync *after* remount r/o? That must mean emergency freeze
>
> I bet that system software that is already aware of and is issuing
> emergency remount r/o trigger prior to reboot, won't see any harm
> in adding an extra u/s trigger for good luck.
>
> Do you know if the gnarly system software in question is issuing
> emergency remount r/o prior to reboot?
It does not.
--D
>
> Amir.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 19, 2017 at 10:00:31AM -0400, Colin Walters wrote:
> On Thu, May 18, 2017, at 08:20 PM, Darrick J. Wong wrote:
>
> > Therefore, add a reboot hook to freeze all filesystems (which in general
> > will induce ext4/xfs/btrfs to checkpoint the log) just prior to reboot.
> > This is an unfortunate and insufficient workaround for multiple layers
> > of inadequate external software, but at least it will reduce boot time
> > surprises for the "OS updater failed to disengage the filesystem before
> > rebooting" case.
>
> As a maintainer of one of those userspace tools
> (https://github.com/ostreedev/ostree), which I don't think is the one
> in question here, but likely has the same issue - I'd like to have
> some sort of API to fix this - maybe flush the journal *without*
> remounting r/o?
The convention (at least among ext4 and xfs) is that fs freeze should be
checkpointing the journal.
> Unlike the case you're talking about with rebooting into a special
> update mode, libostree constructs a new root with hardlinks while
> the system is running. Hence, system downtime is just reboot, like
> dual-partition update systems, except we're more flexible.
>
> Although hm...I guess an API to flush the journal would only narrow
> the race.
>
> Is the single partition case really just doomed?
Probably. TBH given the current behavior of grub, I would always have a
separate /boot to minimize the amount it's allowed to touch. :)
--D
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 19, 2017 at 11:27:34AM -0400, Theodore Ts'o wrote:
> On Fri, May 19, 2017 at 10:00:31AM -0400, Colin Walters wrote:
> > As a maintainer of one of those userspace tools (https://github.com/ostreedev/ostree),
> > which I don't think is the one in question here, but likely has the same
> > issue - I'd like to have some sort of API to fix this - maybe flush the journal *without*
> > remounting r/o?
> >
> > Unlike the case you're talking about with rebooting into a special
> > update mode, libostree constructs a new root with hardlinks while
> > the system is running. Hence, system downtime is just reboot, like
> > dual-partition update systems, except we're more flexible.
> >
> > Although hm...I guess an API to flush the journal would only narrow
> > the race.
> >
> > Is the single partition case really just doomed?
>
> One of the things that came up when Darrick and I discussed this on
> the weekly ext4 developer's conference call was our mutual wonderment
> that none of the userspace tools implemented a reboot by created a
> tmpfs chroot, pivoting into the chroot, and then unmounting all of the
> remaining file systems.
systemd seems to have the ability to do this -- if something dumps an
executable into /run/initramfs/shutdown (and remounts /run with 'exec')
then systemd will pivot to this script which can then kill everything it
needs and then unmount the filesystems. Or upgrade the fs. Seeing as
the rootfs is still mounted ro at the point that the shutdown script is
run, it could pull in whatever tools it wants.
Or inject malware, I guess. :P
In any case, I don't think it's unreasonable to want a system updater to
be able to detect that the fs containing with vmlinuz and initrd hasn't
unmounted at the end of the upgrade, and therefore it needs to resort to
stronger tactics to forcibly unmount it before systemd reboots.
> This would also allow update schemes who want to enable various new
> file system features, or upgrade the root file system somehow, to be
> able to do so while the root file system is completely and cleanly
> unmounted.
>
> The other thing that would be useful is if grub2 would actually be
> able to replay the file system journal --- but given that grub2 is
Gross! :)
I don't think the XFS community will be enthusiastic about supporting
whatever wreckage may come out of that.
> GPLv3, and both ext4 and xfs are GPLv2-only, and given that past
> attempts of teams attempting to do clean room reimplementations of
> complex code bases for licensing reasons only (cough, make_ext4fs,
> *cough*) have not necessarily turned out well, I'm at least not going
> to hold my breath.
Err... yes, but that's a different thread altogether.
--D
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri 19-05-17 11:27:34, Ted Tso wrote:
> The other thing that would be useful is if grub2 would actually be
> able to replay the file system journal --- but given that grub2 is
> GPLv3, and both ext4 and xfs are GPLv2-only, and given that past
> attempts of teams attempting to do clean room reimplementations of
> complex code bases for licensing reasons only (cough, make_ext4fs,
> *cough*) have not necessarily turned out well, I'm at least not going
> to hold my breath.
Boot loader really should *not* write to the filesystem. Firstly, it would
have to be completely separate codebase running under very different
constraints (real mode, no real memory management, etc) so there's no easy
way to share the code with any other userspace libraries and thus the code
will be inherently buggy. Secondly, think of stuff like suspend to disk -
if someone touches the filesystem in any way under the hands of suspended
kernel, file system corruption is very likely to follow sooner or later.
Just last year, I've spent couple of interesting days hunting down ext4
corruption on s390 only to find out that the boot procedure(*) there ended
up replaying the journal under suspended kernel...
(*) Just for the ones interested in mainframe woes: s390 in SLES doesn't
use grub to parse the filesystem (as it is too difficult to access some
storage types from the boot loader AFAIU) so it uses "first-stage" Linux
kernel to mount the root filesystem, finds proper kernel image there and
then kexecs into it...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Resurrecting this thread:
On Fri, May 19, 2017, at 03:01 PM, Darrick J. Wong wrote:
> On Fri, May 19, 2017 at 10:00:31AM -0400, Colin Walters wrote:
> > On Thu, May 18, 2017, at 08:20 PM, Darrick J. Wong wrote:
> >
> > > Therefore, add a reboot hook to freeze all filesystems (which in general
> > > will induce ext4/xfs/btrfs to checkpoint the log) just prior to reboot.
> > > This is an unfortunate and insufficient workaround for multiple layers
> > > of inadequate external software, but at least it will reduce boot time
> > > surprises for the "OS updater failed to disengage the filesystem before
> > > rebooting" case.
> >
> > As a maintainer of one of those userspace tools
> > (https://github.com/ostreedev/ostree), which I don't think is the one
> > in question here, but likely has the same issue - I'd like to have
> > some sort of API to fix this - maybe flush the journal *without*
> > remounting r/o?
>
> The convention (at least among ext4 and xfs) is that fs freeze should be
> checkpointing the journal.
OK, so I finally implemented this:
https://github.com/ostreedev/ostree/pull/1049
I had to go to some awkward lengths to try to make this safe; everything
in libostree is designed to be "crash only" - we're an update system
that doesn't install a SIGINT/SIGTERM handler, we just let the kernel
kill us, and that should always be safe. But if we're interrupted right after
we invoke FIFREEZE we'd leave the fs frozen.
Any objections to something like an ioctl (fd, FIFREEZETHAW, 0) ?
I was thinking about this more though, and while this obviously helps,
it's still just narrowing a window; if we have a system crash after
writing the config but before we've done a freeze-thaw, we still
have the journaled data problem.
in the end probably the real fix is probably something like storing
multiple copies of the bootloader config with checksums that grub
can verify. Basically teach grub to try really hard to extract known-good
data from the FS. For file-level consistency that'd be pretty easy,
we could have e.g.
/boot/efi/grub.cfg
/boot/efi/grub.cfg.checksum (sha256 of grub.cfg)
/boot/efi/grub.cfg.orig
/boot/efi/grub.cfg.orig.checksum (sha256 of grub.cfg.orig)
etc.
But what I don't know offhand without diving a lot more into XFS
internals is how resilient such a scheme would be against the
outstanding journal writes for the directory. (Maybe it's more
resilient to use separate /boot/efi/grub-new and /boot/efi/grub-old
dirs?)
> Any objections to something like an ioctl (fd, FIFREEZETHAW, 0) ?
It's going to be completely trivial, which argues for it. The only
points left woul be bikeshedding over the name, and how to describe
its semantics.
> in the end probably the real fix is probably something like storing
> multiple copies of the bootloader config with checksums that grub
> can verify. Basically teach grub to try really hard to extract known-good
> data from the FS. For file-level consistency that'd be pretty easy,
> we could have e.g.
The real answer is to have a filesystem that does the above for you
for the boot partition, e.g. one where the kernel and grub have
a common consistency protocol for.
On Sat, Aug 05, 2017 at 07:16:21AM -0700, Christoph Hellwig wrote:
> > Any objections to something like an ioctl (fd, FIFREEZETHAW, 0) ?
>
> It's going to be completely trivial, which argues for it. The only
> points left woul be bikeshedding over the name, and how to describe
> its semantics.
FSCHECKPOINT? Since that's your requirement anyway...
"Ensures that all filesystem metadata (which may be in a journal
somewhere) has been checkpointed back to disk." ?
--D
> > in the end probably the real fix is probably something like storing
> > multiple copies of the bootloader config with checksums that grub
> > can verify. Basically teach grub to try really hard to extract known-good
> > data from the FS. For file-level consistency that'd be pretty easy,
> > we could have e.g.
>
> The real answer is to have a filesystem that does the above for you
> for the boot partition, e.g. one where the kernel and grub have
> a common consistency protocol for.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 05, 2017 at 08:45:28AM -0700, Darrick J. Wong wrote:
> FSCHECKPOINT? Since that's your requirement anyway...
>
> "Ensures that all filesystem metadata (which may be in a journal
> somewhere) has been checkpointed back to disk." ?
What about a file system that is entirely log structured and just
garbage collects some times?
On Fri, Aug 11, 2017 at 03:02:30AM -0700, Christoph Hellwig wrote:
> On Sat, Aug 05, 2017 at 08:45:28AM -0700, Darrick J. Wong wrote:
> > FSCHECKPOINT? Since that's your requirement anyway...
> >
> > "Ensures that all filesystem metadata (which may be in a journal
> > somewhere) has been checkpointed back to disk." ?
>
> What about a file system that is entirely log structured and just
> garbage collects some times?
"Ensure that <insert insufficient engineering insult here> external fs
drivers can find files on disk." :P
--D
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 11, 2017 at 09:26:20AM -0700, Darrick J. Wong wrote:
> "Ensure that <insert insufficient engineering insult here> external fs
> drivers can find files on disk." :P
If they are correctly implemented they can always access it anyway..