2018-03-06 19:05:15

by Lennart Sorensen

[permalink] [raw]
Subject: ext4 ignoring rootfs default mount options

While switching a system from using ext3 to ext4 (It's about time) I
discovered that setting default options for the filesystem using tune2fs
-o doesn't work for the root filesystem when mounted by the kernel itself.
Filesystems mounted from userspace with the mount command use the options
set just fine. The extended option set with tune2fs -E mount_opts=
works fine however. I am sure those using an initrd works fine (and
hence why almost noone would ever see this bug) since that uses the
mount command from userspace to mount the rootfs.

Specifically we did:

tune2fs -o nodelalloc /dev/sda1

at boot we got:
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)

Instead using:
tune2fs -E mount_opts=nodelalloc /dev/sda1

at boot we got:
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: nodelalloc; (null)

which seems better.

For filesystems mounted from userspace with the mount command, either
method works however. The first option however is what the comment in
fs/ext4/super.c suggests to use.

Of course I also got the messages:
EXT4-fs (sda1): Mount option "nodelalloc" incompatible with ext3
EXT4-fs (sda1): failed to parse options in superblock: nodelalloc
EXT4-fs (sda1): couldn't mount as ext3 due to feature incompatibilities

Those of course all ought to not be there. If it is ext4, I don't really
care if ext3 doesn't understand the options, it works fine with ext4.
I should not have to explicitly specify that it is ext4 just to avoid
those. It is not an ext3 filesystem.

And of course the last annoying thing I noticed is that /proc/mounts
doesn't actually tell you that nodelalloc is active when it is set
from the default mount options rather than from the mount command line
(or fstab). Lots of other non default options are explicitly handled,
but not delalloc. The only place you see it, is in the dmesg line
telling you what options the filesystem was mounted with.

--
Len Sorensen


2018-03-07 04:07:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 ignoring rootfs default mount options

On Tue, Mar 06, 2018 at 02:03:15PM -0500, Lennart Sorensen wrote:
> While switching a system from using ext3 to ext4 (It's about time) I
> discovered that setting default options for the filesystem using tune2fs
> -o doesn't work for the root filesystem when mounted by the kernel itself.
> Filesystems mounted from userspace with the mount command use the options
> set just fine. The extended option set with tune2fs -E mount_opts=
> works fine however.

Well.... it's not that it's being ignored. It's just a
misunderstanding of how a few things. It's also that the how we
handled mount options has evolved over time, leading to a situation
which is confusing.

First, tune2fs changes the default of ext4's mount options. This is
stated in the tune2fs man page:

-o [^]mount-option[,...]
Set or clear the indicated default mount options in the filesys‐
tem. Default mount options can be overridden by mount options
specified either in /etc/fstab(5) or on the command line argu‐
ments to mount(8). Older kernels may not support this feature;
in particular, kernels which predate 2.4.20 will almost cer‐
tainly ignore the default mount options field in the superblock.

Secondly, the message when af ile sytem is mounted, e.g.:

> EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)

... is the mount option string that are passed to the mount system
call.

The extended mount options is different. It was something that we
added later. If it is present, this the extended mount options is
printed first, followed by a semi-colon, followed by string passed to
the mount system call.

Hence:

> tune2fs -E mount_opts=nodelalloc /dev/sda1
>
> at boot we got:
> EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: nodelalloc; (null)


The description of -E option in the tune2fs man page talks about some
of this, but it's arguably confusing.

You can see exactly what mount options that are active by looking at
the file /proc/fs/ext4/<dev>/options. So this is how you can prove to
yourself that tune2fs -o works.

root@kvm-xfstests:~# dmesg -n 7
root@kvm-xfstests:~# tune2fs -o nodelalloc /dev/vdc
tune2fs 1.44-WIP (06-Sep-2017)
root@kvm-xfstests:~# mount /dev/vdc /vdc
[ 27.389192] EXT4-fs (vdc): mounted filesystem with ordered data mode. Opts: (null)
root@kvm-xfstests:~# cat /proc/fs/ext4/vdc/options
rw
bsddf
nogrpid
block_validity
dioread_lock
nodiscard
nodelalloc
journal_checksum
barrier
auto_da_alloc
user_xattr
acl
noquota
resuid=0
resgid=0
errors=continue
commit=5
min_batch_time=0
max_batch_time=15000
stripe=0
data=ordered
inode_readahead_blks=32
init_itable=10
max_dir_size_kb=0

> For filesystems mounted from userspace with the mount command, either
> method works however. The first option however is what the comment in
> fs/ext4/super.c suggests to use.
>
> Of course I also got the messages:
> EXT4-fs (sda1): Mount option "nodelalloc" incompatible with ext3
> EXT4-fs (sda1): failed to parse options in superblock: nodelalloc
> EXT4-fs (sda1): couldn't mount as ext3 due to feature incompatibilities

So what's happening here is something that has recently started
getting reported by users. Most modern distro's use an initial
ramdisk to mount the root file system, and they use blkid to determine
the file system with the right file system type. If the kernel is
mounting the root file system. An indication that this is what's
happening is the following message in dmesg:

[ 2.196149] VFS: Mounted root (ext4 filesystem) readonly on device 254:0.

This message means the kernel fallback code was used to mount the file
system, not the initial ramdisk code in userspace.

If you are using the kernel fallback code, it will first try to mount
the file system as ext3, and if you have "nodelalloc" in the extended
mount options in the superblock, it will try it first. The messages
you have quoted above are harmless. But they are scaring users, so we
are looking into ways to suppress them.

> And of course the last annoying thing I noticed is that /proc/mounts
> doesn't actually tell you that nodelalloc is active when it is set
> from the default mount options rather than from the mount command line
> (or fstab). Lots of other non default options are explicitly handled,
> but not delalloc. The only place you see it, is in the dmesg line
> telling you what options the filesystem was mounted with.

That's because /proc/mounts is trying to emulate the user-space
maintained /etc/mtab file. So we deliberately suppress default mount
options. If you take out this feature:

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 756f515b762d..e93b86f68da5 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2038,8 +2038,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
if (((m->flags & (MOPT_SET|MOPT_CLEAR)) == 0) ||
(m->flags & MOPT_CLEAR_ERR))
continue;
- if (!(m->mount_opt & (sbi->s_mount_opt ^ def_mount_opt)))
- continue; /* skip if same as the default */
+// if (!(m->mount_opt & (sbi->s_mount_opt ^ def_mount_opt)))
+// continue; /* skip if same as the default */
if ((want_set &&
(sbi->s_mount_opt & m->mount_opt) != m->mount_opt) ||
(!want_set && (sbi->s_mount_opt & m->mount_opt)))


... then /proc/mounts looks a lot messier, and most users would not
like the result:

/dev/vdc /vdc ext4 rw,relatime,bsddf,nogrpid,block_validity,dioread_lock,nodiscard,nodelalloc,journal_checksum,barrier,auto_da_alloc,user_xattr,acl,noquota,data=ordered 0 0

If you really want the reliable "what are the mount options right
now", the place to look is /proc/fs/ext4/<device>/options, as
described above.

Cheers,

- Ted


2018-03-07 15:16:51

by Lennart Sorensen

[permalink] [raw]
Subject: Re: ext4 ignoring rootfs default mount options

On Tue, Mar 06, 2018 at 11:06:08PM -0500, Theodore Y. Ts'o wrote:
> On Tue, Mar 06, 2018 at 02:03:15PM -0500, Lennart Sorensen wrote:
> > While switching a system from using ext3 to ext4 (It's about time) I
> > discovered that setting default options for the filesystem using tune2fs
> > -o doesn't work for the root filesystem when mounted by the kernel itself.
> > Filesystems mounted from userspace with the mount command use the options
> > set just fine. The extended option set with tune2fs -E mount_opts=
> > works fine however.
>
> Well.... it's not that it's being ignored. It's just a
> misunderstanding of how a few things. It's also that the how we
> handled mount options has evolved over time, leading to a situation
> which is confusing.
>
> First, tune2fs changes the default of ext4's mount options. This is
> stated in the tune2fs man page:
>
> -o [^]mount-option[,...]
> Set or clear the indicated default mount options in the filesys‐
> tem. Default mount options can be overridden by mount options
> specified either in /etc/fstab(5) or on the command line argu‐
> ments to mount(8). Older kernels may not support this feature;
> in particular, kernels which predate 2.4.20 will almost cer‐
> tainly ignore the default mount options field in the superblock.
>
> Secondly, the message when af ile sytem is mounted, e.g.:
>
> > EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
>
> ... is the mount option string that are passed to the mount system
> call.
>
> The extended mount options is different. It was something that we
> added later. If it is present, this the extended mount options is
> printed first, followed by a semi-colon, followed by string passed to
> the mount system call.
>
> Hence:
>
> > tune2fs -E mount_opts=nodelalloc /dev/sda1
> >
> > at boot we got:
> > EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: nodelalloc; (null)
>
>
> The description of -E option in the tune2fs man page talks about some
> of this, but it's arguably confusing.
>
> You can see exactly what mount options that are active by looking at
> the file /proc/fs/ext4/<dev>/options. So this is how you can prove to
> yourself that tune2fs -o works.

OK that does in fact seem to be the case. That's good.

> root@kvm-xfstests:~# dmesg -n 7
> root@kvm-xfstests:~# tune2fs -o nodelalloc /dev/vdc
> tune2fs 1.44-WIP (06-Sep-2017)
> root@kvm-xfstests:~# mount /dev/vdc /vdc
> [ 27.389192] EXT4-fs (vdc): mounted filesystem with ordered data mode. Opts: (null)
> root@kvm-xfstests:~# cat /proc/fs/ext4/vdc/options
> rw
> bsddf
> nogrpid
> block_validity
> dioread_lock
> nodiscard
> nodelalloc
> journal_checksum
> barrier
> auto_da_alloc
> user_xattr
> acl
> noquota
> resuid=0
> resgid=0
> errors=continue
> commit=5
> min_batch_time=0
> max_batch_time=15000
> stripe=0
> data=ordered
> inode_readahead_blks=32
> init_itable=10
> max_dir_size_kb=0
>
> > For filesystems mounted from userspace with the mount command, either
> > method works however. The first option however is what the comment in
> > fs/ext4/super.c suggests to use.
> >
> > Of course I also got the messages:
> > EXT4-fs (sda1): Mount option "nodelalloc" incompatible with ext3
> > EXT4-fs (sda1): failed to parse options in superblock: nodelalloc
> > EXT4-fs (sda1): couldn't mount as ext3 due to feature incompatibilities
>
> So what's happening here is something that has recently started
> getting reported by users. Most modern distro's use an initial
> ramdisk to mount the root file system, and they use blkid to determine
> the file system with the right file system type. If the kernel is
> mounting the root file system. An indication that this is what's
> happening is the following message in dmesg:
>
> [ 2.196149] VFS: Mounted root (ext4 filesystem) readonly on device 254:0.
>
> This message means the kernel fallback code was used to mount the file
> system, not the initial ramdisk code in userspace.
>
> If you are using the kernel fallback code, it will first try to mount
> the file system as ext3, and if you have "nodelalloc" in the extended
> mount options in the superblock, it will try it first. The messages
> you have quoted above are harmless. But they are scaring users, so we
> are looking into ways to suppress them.
>
> > And of course the last annoying thing I noticed is that /proc/mounts
> > doesn't actually tell you that nodelalloc is active when it is set
> > from the default mount options rather than from the mount command line
> > (or fstab). Lots of other non default options are explicitly handled,
> > but not delalloc. The only place you see it, is in the dmesg line
> > telling you what options the filesystem was mounted with.
>
> That's because /proc/mounts is trying to emulate the user-space
> maintained /etc/mtab file. So we deliberately suppress default mount
> options. If you take out this feature:
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 756f515b762d..e93b86f68da5 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2038,8 +2038,8 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
> if (((m->flags & (MOPT_SET|MOPT_CLEAR)) == 0) ||
> (m->flags & MOPT_CLEAR_ERR))
> continue;
> - if (!(m->mount_opt & (sbi->s_mount_opt ^ def_mount_opt)))
> - continue; /* skip if same as the default */
> +// if (!(m->mount_opt & (sbi->s_mount_opt ^ def_mount_opt)))
> +// continue; /* skip if same as the default */
> if ((want_set &&
> (sbi->s_mount_opt & m->mount_opt) != m->mount_opt) ||
> (!want_set && (sbi->s_mount_opt & m->mount_opt)))
>
>
> ... then /proc/mounts looks a lot messier, and most users would not
> like the result:
>
> /dev/vdc /vdc ext4 rw,relatime,bsddf,nogrpid,block_validity,dioread_lock,nodiscard,nodelalloc,journal_checksum,barrier,auto_da_alloc,user_xattr,acl,noquota,data=ordered 0 0

Yes that gets too messy.

> If you really want the reliable "what are the mount options right
> now", the place to look is /proc/fs/ext4/<device>/options, as
> described above.

But delalloc is the default for ext4, so a filesystem mounted with
nodelalloc ought to show that in /proc/mounts as far as I am concerned.
The comment in the code says anything that is different than the global
defaults and the filesystem defaults will be shown, but in this case it
is not. Maybe the comment is just wrong or unclear and this is actually
the intended behaviour. I don't think I like the behaviour if it is
intended to work this way. The /proc/fs/ext4/ option at least looks
workable. Strangely I found the function that implements it but couldn't
find anything using it for some reason. I must have just missed it
since it obviously is there.

--
Len Sorensen

2018-03-07 16:10:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 ignoring rootfs default mount options

On Wed, Mar 07, 2018 at 10:14:28AM -0500, Lennart Sorensen wrote:
>
> But delalloc is the default for ext4, so a filesystem mounted with
> nodelalloc ought to show that in /proc/mounts as far as I am concerned.
> The comment in the code says anything that is different than the global
> defaults and the filesystem defaults will be shown, but in this case it
> is not. Maybe the comment is just wrong or unclear and this is actually
> the intended behaviour. I don't think I like the behaviour if it is
> intended to work this way. The /proc/fs/ext4/ option at least looks
> workable. Strangely I found the function that implements it but couldn't
> find anything using it for some reason. I must have just missed it
> since it obviously is there.

This is where it's critcal to understand that the "tune2fs -o" changes
the *default* mount options. The key in the comment is the anything
different from the *filesystem* defaults (that is, the defaults of the
particular ext4 file system, as opposed to the global defaults). The
idea is that /proc/mounts, and /etc/mtab shows the options string that
if used in /etc/fstab, or in the mount command line, will replicate
the current mount option state. Furthermore, that /proc/mounts is the
minimal set of mount option strings.

You may not like the behavior, but it's been this way forever, and the
reasoning behind it is that the low-level file system code doesn't
really know what the actual mount option string that would be in
/etc/fstab or in the /sbin/mount command line. That's because
/sbin/mount command actually parses the mount options, and things like
"ro" is actually translated into bitflag passed to the mount option.
So for example, it's impossible to know whether "rw" was in the
user-specified mount string, since we never see it by the time the
string gets sent to ext4_fill_super (in fact the kernel never sees
it). So when we try to make /proc/mounts a replacement for /etc/mtab
(since some people make /etc/mtab as symlink /proc/mounts), it's
actually impossible. Trying to make it be the minimal set of options
was at least a consitent thing. That is, if you use "tune2fs -o
nodelalloc", it's not necessary to put nodelalloc in /etc/fstab or in
the mount command line. And hence, it should not be in /proc/mounts.

As far as where ext4_seq_options_show() gets called, it's because we
have a fair amount of macro shortcuts in fs/ext4/sysfs.c (which is
where we put all of the pseudo file system support for ext4, which
means it includes procfs). Search for macros PROC_FILE_SHOW_DEFN and
PROC_FILE_LIST.

- Ted

2018-03-07 16:16:31

by Lennart Sorensen

[permalink] [raw]
Subject: Re: ext4 ignoring rootfs default mount options

On Wed, Mar 07, 2018 at 11:08:56AM -0500, Theodore Y. Ts'o wrote:
> This is where it's critcal to understand that the "tune2fs -o" changes
> the *default* mount options. The key in the comment is the anything
> different from the *filesystem* defaults (that is, the defaults of the
> particular ext4 file system, as opposed to the global defaults). The
> idea is that /proc/mounts, and /etc/mtab shows the options string that
> if used in /etc/fstab, or in the mount command line, will replicate
> the current mount option state. Furthermore, that /proc/mounts is the
> minimal set of mount option strings.
>
> You may not like the behavior, but it's been this way forever, and the
> reasoning behind it is that the low-level file system code doesn't
> really know what the actual mount option string that would be in
> /etc/fstab or in the /sbin/mount command line. That's because
> /sbin/mount command actually parses the mount options, and things like
> "ro" is actually translated into bitflag passed to the mount option.
> So for example, it's impossible to know whether "rw" was in the
> user-specified mount string, since we never see it by the time the
> string gets sent to ext4_fill_super (in fact the kernel never sees
> it). So when we try to make /proc/mounts a replacement for /etc/mtab
> (since some people make /etc/mtab as symlink /proc/mounts), it's
> actually impossible. Trying to make it be the minimal set of options
> was at least a consitent thing. That is, if you use "tune2fs -o
> nodelalloc", it's not necessary to put nodelalloc in /etc/fstab or in
> the mount command line. And hence, it should not be in /proc/mounts.

OK, that makes sense. Thanks. I will work on convincing myself this
is how it should be.

> As far as where ext4_seq_options_show() gets called, it's because we
> have a fair amount of macro shortcuts in fs/ext4/sysfs.c (which is
> where we put all of the pseudo file system support for ext4, which
> means it includes procfs). Search for macros PROC_FILE_SHOW_DEFN and
> PROC_FILE_LIST.

Oh that's where it is.

--
Len Sorensen

2018-03-07 19:14:55

by Lennart Sorensen

[permalink] [raw]
Subject: Re: ext4 ignoring rootfs default mount options

On Wed, Mar 07, 2018 at 11:08:56AM -0500, Theodore Y. Ts'o wrote:
> This is where it's critcal to understand that the "tune2fs -o" changes
> the *default* mount options. The key in the comment is the anything
> different from the *filesystem* defaults (that is, the defaults of the
> particular ext4 file system, as opposed to the global defaults). The
> idea is that /proc/mounts, and /etc/mtab shows the options string that
> if used in /etc/fstab, or in the mount command line, will replicate
> the current mount option state. Furthermore, that /proc/mounts is the
> minimal set of mount option strings.

One more question about this.

Trying to use tune2fs -E mount_opts to set some default options, and
can't figure out how to enter two options at once.

Doing:

tune2fs -E mount_opts=nodelalloc,nouser_xattr /dev/sda3

gives the error:

tune2fs 1.43.4 (31-Jan-2017)

Bad options specified.

Extended options are separated by commas, and may take an argument which
is set off by an equals ('=') sign.

Valid extended options are:
clear_mmp
hash_alg=<hash algorithm>
mount_opts=<extended default mount options>
stride=<RAID per-disk chunk size in blocks>
stripe_width=<RAID stride*data disks in blocks>
test_fs
^test_fs

Apparently it gets confused by the , in the argument to mount_opts.

Unfortunately , happens to be the separator required by ext4 for that
field. So how does one use it?

Sure in this case I can set one with -o and the other with -E, but in
general there seems to be a small problem here, probably only in user
space though. Seems tune2fs needs some change in how it deals with
extended options that contain commas.

--
Len Sorensen

2018-03-07 22:25:29

by Tyson Nottingham

[permalink] [raw]
Subject: Re: ext4 ignoring rootfs default mount options

On Wed, Mar 07, 2018 at 02:13:24PM -0500, Lennart Sorensen wrote:

> Trying to use tune2fs -E mount_opts to set some default options, and
> can't figure out how to enter two options at once.

Lennart,

I noticed this a while back, too. I don't believe you can currently. As a
workaround, you can use debugfs:

debugfs -w -R 'set_super_value mount_opts opt1,opt2,opt3' <dev>

Note that when printing superblock info, the e2fsprogs tools call the bit-based
default mount options "Default mount options" and the string-based default mount
options "Mount options".

Tyson

2018-03-07 22:52:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 ignoring rootfs default mount options

On Wed, Mar 07, 2018 at 02:13:24PM -0500, Lennart Sorensen wrote:
>
> Trying to use tune2fs -E mount_opts to set some default options, and
> can't figure out how to enter two options at once.
>
> ...
>
> Sure in this case I can set one with -o and the other with -E, but in
> general there seems to be a small problem here, probably only in user
> space though. Seems tune2fs needs some change in how it deals with
> extended options that contain commas.

Yes, this is a shortcoming in tune2fs. You can set extended mount
options using debugfs:

debugfs -w -R "set_super_value mount_opts foo,bar" /dev/sda1

... but there ought to be some way to support some kind of quoting
mechanism so that tune2fs can understand when a comma is part of an
extended option value, as opposed to separating extended options.

Extended options haven't been used much, so it's not been something
that has gotten a lot of polish. Backing up for a bit, is there a
reason why you need so many mount options when mounting the root file
sytsem? Specifically, why do you want to turn off dellayed allocation?

- Ted

2018-03-15 18:36:52

by Lennart Sorensen

[permalink] [raw]
Subject: Re: ext4 ignoring rootfs default mount options

On Wed, Mar 07, 2018 at 05:50:43PM -0500, Theodore Y. Ts'o wrote:
> Yes, this is a shortcoming in tune2fs. You can set extended mount
> options using debugfs:
>
> debugfs -w -R "set_super_value mount_opts foo,bar" /dev/sda1
>
> ... but there ought to be some way to support some kind of quoting
> mechanism so that tune2fs can understand when a comma is part of an
> extended option value, as opposed to separating extended options.
>
> Extended options haven't been used much, so it's not been something
> that has gotten a lot of polish. Backing up for a bit, is there a
> reason why you need so many mount options when mounting the root file
> sytsem? Specifically, why do you want to turn off dellayed allocation?

Completely don't trust it. Also don't do much writing so no real benefit.
Having seen machines get killed by the watchdog come back with blank
chunks in log files is not acceptable. Getting all user space software
fixed to actually sync properly in all the right places takes forever.

Now for my home machine (especially my mythtv box) it is absoletely on
and a great thing to have.

--
Len Sorensen