2007-05-21 19:49:14

by Kalpak Shah

[permalink] [raw]
Subject: [RFC][PATCH] Multiple mount protection

Hi,

There have been reported instances of a filesystem having been mounted at 2 places at the same time causing a lot of damage to the filesystem. This patch reserves superblock fields and an INCOMPAT flag for adding multiple mount protection(MMP) support within the ext4 filesystem itself. The superblock will have a block number (s_mmp_block) which will hold a MMP structure which has a sequence number which will be periodically updated every 5 seconds by a mounted filesystem. Whenever a filesystem will be mounted it will wait for s_mmp_interval seconds to make sure that the MMP sequence does not change. To further make sure, we write a random sequence number into the MMP block and wait for another s_mmp_interval secs. If the sequence no. doesn't change then the mount will succeed. In case of failure, the nodename, bdevname and the time at which the MMP block was last updated will be displaye
d. tune2fs can be used to set s_mmp_interval as desired.

It will also protect against running e2fsck on a mounted filesystem by adding similar logic to ext2fs_open().

Any comments or views are welcome!

Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Kalpak Shah <[email protected]>

Index: e2fsprogs-1.40/lib/ext2fs/ext2_fs.h
===================================================================
--- e2fsprogs-1.40.orig/lib/ext2fs/ext2_fs.h
+++ e2fsprogs-1.40/lib/ext2fs/ext2_fs.h
@@ -568,8 +568,9 @@ struct ext2_super_block {
__u16 s_want_extra_isize; /* New inodes should reserve # bytes */
__u32 s_flags; /* Miscellaneous flags */
__u16 s_raid_stride; /* RAID stride */
- __u16 s_pad; /* Padding */
- __u32 s_reserved[166]; /* Padding to the end of the block */
+ __u16 s_mmp_interval; /* Wait for # seconds in MMP checking */
+ __u64 s_mmp_block; /* Block for multi-mount protection */
+ __u32 s_reserved[164]; /* Padding to the end of the block */
};

/*
@@ -631,10 +632,12 @@ struct ext2_super_block {
#define EXT2_FEATURE_INCOMPAT_META_BG 0x0010
#define EXT3_FEATURE_INCOMPAT_EXTENTS 0x0040
#define EXT4_FEATURE_INCOMPAT_64BIT 0x0080
+#define EXT4_FEATURE_INCOMPAT_MMP 0x0100


#define EXT2_FEATURE_COMPAT_SUPP 0
-#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE)
+#define EXT2_FEATURE_INCOMPAT_SUPP (EXT2_FEATURE_INCOMPAT_FILETYPE| \
+ EXT4_FEATURE_INCOMPAT_MMP)
#define EXT2_FEATURE_RO_COMPAT_SUPP (EXT2_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT2_FEATURE_RO_COMPAT_LARGE_FILE| \
EXT2_FEATURE_RO_COMPAT_BTREE_DIR)


Thanks,
Kalpak.


2007-05-22 07:15:16

by Manoj Joseph

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

Kalpak Shah wrote:
> Hi,
>
> There have been reported instances of a filesystem having been
> mounted at 2 places at the same time causing a lot of damage to the
> filesystem. This patch reserves superblock fields and an INCOMPAT
> flag for adding multiple mount protection(MMP) support within the
> ext4 filesystem itself. The superblock will have a block number
> (s_mmp_block) which will hold a MMP structure which has a sequence
> number which will be periodically updated every 5 seconds by a
> mounted filesystem. Whenever a filesystem will be mounted it will
> wait for s_mmp_interval seconds to make sure that the MMP sequence
> does not change. To further make sure, we write a random sequence
> number into the MMP block and wait for another s_mmp_interval secs.
> If the sequence no. doesn't change then the mount will succeed. In
> case of failure, the nodename, bdevname and the time at which the MMP
> block was last updated will be displayed. tune2fs can be used to set
> s_mmp_interval as desired.

What would the default value of s_mmp_interval be? 5 seconds? more?

If I am not reading this wrong a mount will take more than
's_mmp_interval' seconds to complete. Wouldn't this be too much of a
penalty during boot up if the system has many 'mount at boot' filesystems?

Also, I am curious about this. Is there a test case for mounting the
same filesystem multiple times? Does this use different paths to reach
the device? Or is there a race? Or does it happen on a device shared by
multiple hosts?

-Manoj

2007-05-22 07:31:08

by Kalpak Shah

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

On Tue, 2007-05-22 at 12:45 +0530, Manoj Joseph wrote:
> Kalpak Shah wrote:
> > Hi,
> >
> > There have been reported instances of a filesystem having been
> > mounted at 2 places at the same time causing a lot of damage to the
> > filesystem. This patch reserves superblock fields and an INCOMPAT
> > flag for adding multiple mount protection(MMP) support within the
> > ext4 filesystem itself. The superblock will have a block number
> > (s_mmp_block) which will hold a MMP structure which has a sequence
> > number which will be periodically updated every 5 seconds by a
> > mounted filesystem. Whenever a filesystem will be mounted it will
> > wait for s_mmp_interval seconds to make sure that the MMP sequence
> > does not change. To further make sure, we write a random sequence
> > number into the MMP block and wait for another s_mmp_interval secs.
> > If the sequence no. doesn't change then the mount will succeed. In
> > case of failure, the nodename, bdevname and the time at which the MMP
> > block was last updated will be displayed. tune2fs can be used to set
> > s_mmp_interval as desired.
>
> What would the default value of s_mmp_interval be? 5 seconds? more?

I have set the default value to 6 seconds. Depending on specific
conditions (hardware, etc.) it can be increased using tunefs.
>
> If I am not reading this wrong a mount will take more than
> 's_mmp_interval' seconds to complete. Wouldn't this be too much of a
> penalty during boot up if the system has many 'mount at boot' filesystems?

Yes it may take a maximum of s_mmp_interval*2 seconds to mount a
filesystem which has INCOMPAT_MMP feature set. Its up to the user to use
this feature, if he finds the penalty is too large, he can do away with
this feature. This feature will mostly be used for filesystems used in
failover scenarios.

>
> Also, I am curious about this. Is there a test case for mounting the
> same filesystem multiple times? Does this use different paths to reach
> the device? Or is there a race? Or does it happen on a device shared by
> multiple hosts?
>

If you are using some HA software, there is the possibility of a race.
Yes it can happen on a device shared by multiple hosts.

A simple test case for this will be:
$ dd if=/dev/zero of=img0 bs=1M count=256
$ mke2fs -F -j img0
$ ln img0 img1
$ losetup /dev/loop0 img0
$ losetup /dev/loop1 img1
$ mount /dev/loop0 /mnt/loop0
$ mount /dev/loop1 /mnt/loop1

This succeeds currently causing a multiple mount.

Thanks,
Kalpak.


> -Manoj
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2007-05-22 07:54:11

by Manoj Joseph

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

Kalpak Shah wrote:

>> Also, I am curious about this. Is there a test case for mounting the
>> same filesystem multiple times? Does this use different paths to reach
>> the device? Or is there a race? Or does it happen on a device shared by
>> multiple hosts?
>>
>
> If you are using some HA software, there is the possibility of a race.
> Yes it can happen on a device shared by multiple hosts.

Ah, if the HA-software doesn't deal with multiple mounts for filesystems
it is managing, then I would claim that the software is flawed. :)

But yes, turning on MMP would help.

It might also help to make the frequency at which sequence number gets
updated (currently 5 sec) tunable. Would making that also a field in the
super block be a bad idea (set only by mkfs/tunefs)?

It might also be worthwhile to write the dev_t, the path of the device
and the hostname to the s_mmp_block, along with the random sequence. (I
assume there is enough space.) If the mount is being failed because of a
multiple mount scenario, these fields could be used to provide useful
diagnostics.

My $ 0.02. :)

Regards,
Manoj

2007-05-22 08:02:38

by Kalpak Shah

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

On Tue, 2007-05-22 at 13:23 +0530, Manoj Joseph wrote:
> Kalpak Shah wrote:
>
> >> Also, I am curious about this. Is there a test case for mounting the
> >> same filesystem multiple times? Does this use different paths to reach
> >> the device? Or is there a race? Or does it happen on a device shared by
> >> multiple hosts?
> >>
> >
> > If you are using some HA software, there is the possibility of a race.
> > Yes it can happen on a device shared by multiple hosts.
>
> Ah, if the HA-software doesn't deal with multiple mounts for filesystems
> it is managing, then I would claim that the software is flawed. :)

Well, it is known to happen so it wouldn't be bad to make sure.

>
> But yes, turning on MMP would help.
>
> It might also help to make the frequency at which sequence number gets
> updated (currently 5 sec) tunable. Would making that also a field in the
> super block be a bad idea (set only by mkfs/tunefs)?

Updating the MMP sequence too often would hurt the filesystem
performance.

>
> It might also be worthwhile to write the dev_t, the path of the device
> and the hostname to the s_mmp_block, along with the random sequence. (I
> assume there is enough space.) If the mount is being failed because of a
> multiple mount scenario, these fields could be used to provide useful
> diagnostics.

Yes, the dev_t, host name, the sequence and the time last updated would
all be printed. There is lots of space since we have an entire block.

> My $ 0.02. :)

Thanks. :)

>
> Regards,
> Manoj

2007-05-24 23:25:24

by Karel Zak

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

On Tue, May 22, 2007 at 01:04:42PM +0530, Kalpak Shah wrote:
> On Tue, 2007-05-22 at 12:45 +0530, Manoj Joseph wrote:
> > Kalpak Shah wrote:
> > > Hi,
> > >
> > > There have been reported instances of a filesystem having been
> > > mounted at 2 places at the same time causing a lot of damage to the
> > > filesystem. This patch reserves superblock fields and an INCOMPAT
> > > flag for adding multiple mount protection(MMP) support within the
> > > ext4 filesystem itself. The superblock will have a block number
> > > (s_mmp_block) which will hold a MMP structure which has a sequence
> > > number which will be periodically updated every 5 seconds by a
> > > mounted filesystem. Whenever a filesystem will be mounted it will
> > > wait for s_mmp_interval seconds to make sure that the MMP sequence
> > > does not change. To further make sure, we write a random sequence
> > > number into the MMP block and wait for another s_mmp_interval secs.
> > > If the sequence no. doesn't change then the mount will succeed. In
> > > case of failure, the nodename, bdevname and the time at which the MMP
> > > block was last updated will be displayed. tune2fs can be used to set
> > > s_mmp_interval as desired.

Frankly, I don't understand why we need this feature. The filesystem
limitations (=not ready for clusters) should be described in docs.
That's enough from my POV...

> >
> > What would the default value of s_mmp_interval be? 5 seconds? more?
>
> I have set the default value to 6 seconds. Depending on specific
> conditions (hardware, etc.) it can be increased using tunefs.
> >
> > If I am not reading this wrong a mount will take more than
> > 's_mmp_interval' seconds to complete. Wouldn't this be too much of a
> > penalty during boot up if the system has many 'mount at boot' filesystems?
>
> Yes it may take a maximum of s_mmp_interval*2 seconds to mount a
> filesystem which has INCOMPAT_MMP feature set. Its up to the user to use
> this feature, if he finds the penalty is too large, he can do away with
> this feature. This feature will mostly be used for filesystems used in
> failover scenarios.

I hope the feature will be disabled by default. It sounds strange
that I have to way 6 secs to mount a FS if I (and 99% of Linux users)
needn't to share same FS between two mountpoint.

I have 5 filesystems on my workstation = 30 secs penality during boot?!

> > Also, I am curious about this. Is there a test case for mounting the
> > same filesystem multiple times? Does this use different paths to reach
> > the device? Or is there a race? Or does it happen on a device shared by
> > multiple hosts?
>
> If you are using some HA software, there is the possibility of a race.
> Yes it can happen on a device shared by multiple hosts.

That's reason why people use OCFS or GFS.

> A simple test case for this will be:
> $ dd if=/dev/zero of=img0 bs=1M count=256
> $ mke2fs -F -j img0
> $ ln img0 img1
> $ losetup /dev/loop0 img0
> $ losetup /dev/loop1 img1
> $ mount /dev/loop0 /mnt/loop0
> $ mount /dev/loop1 /mnt/loop1
>
> This succeeds currently causing a multiple mount.

And what? That's wrong FS usage.

Karel

--
Karel Zak <[email protected]>

2007-05-25 06:40:39

by Kalpak Shah

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

On Fri, 2007-05-25 at 01:25 +0200, Karel Zak wrote:
> Frankly, I don't understand why we need this feature. The filesystem
> limitations (=not ready for clusters) should be described in docs.
> That's enough from my POV...

It is highly advocated that ext3/4 filesystem should not be multiply
mounted. This just makes doubly sure of this only if the user desires.

> > >
> > > What would the default value of s_mmp_interval be? 5 seconds? more?
> >
> > I have set the default value to 6 seconds. Depending on specific
> > conditions (hardware, etc.) it can be increased using tunefs.
> > >
> > > If I am not reading this wrong a mount will take more than
> > > 's_mmp_interval' seconds to complete. Wouldn't this be too much of a
> > > penalty during boot up if the system has many 'mount at boot' filesystems?
> >
> > Yes it may take a maximum of s_mmp_interval*2 seconds to mount a
> > filesystem which has INCOMPAT_MMP feature set. Its up to the user to use
> > this feature, if he finds the penalty is too large, he can do away with
> > this feature. This feature will mostly be used for filesystems used in
> > failover scenarios.
>
> I hope the feature will be disabled by default. It sounds strange
> that I have to way 6 secs to mount a FS if I (and 99% of Linux users)
> needn't to share same FS between two mountpoint.
>
> I have 5 filesystems on my workstation = 30 secs penality during boot?!

This feature won't be enabled by default. Its absolutely the users
discretion if he wants to enable this feature. It can be set by tune2fs
and can be disabled without unmounting the filesystem. So you won't have
to waste time during mounting unless you choose to.

>
> > > Also, I am curious about this. Is there a test case for mounting the
> > > same filesystem multiple times? Does this use different paths to reach
> > > the device? Or is there a race? Or does it happen on a device shared by
> > > multiple hosts?
> >
> > If you are using some HA software, there is the possibility of a race.
> > Yes it can happen on a device shared by multiple hosts.
>
> That's reason why people use OCFS or GFS.

OCFS and GFS are clustered file systems and hence provide read-write
support at multiple mount points.

Note that the MMP feature will make that you can't run e2fsck on a
mounted filesystem. So in short the filesystem cannot be opened in
read-write mode by more than 1 entity.

>
> > A simple test case for this will be:
> > $ dd if=/dev/zero of=img0 bs=1M count=256
> > $ mke2fs -F -j img0
> > $ ln img0 img1
> > $ losetup /dev/loop0 img0
> > $ losetup /dev/loop1 img1
> > $ mount /dev/loop0 /mnt/loop0
> > $ mount /dev/loop1 /mnt/loop1
> >
> > This succeeds currently causing a multiple mount.
>
> And what? That's wrong FS usage.

Here I had just described a test case for reproducing multiple mounts.

Thanks,
Kalpak.

>
> Karel
>

2007-05-25 14:40:05

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

Hi Kalpak,

On Tue, May 22, 2007 at 01:22:32AM +0530, Kalpak Shah wrote:
> It will also protect against running e2fsck on a mounted filesystem
> by adding similar logic to ext2fs_open().

Your patch didn't add this logic to ext2fs_open(); it just reserved
the space in the superblock.

I don't mind reserving the space so we don't have to worry about
conflicting superblock uses, but I'm still on the fence about actually
adding this functionality (a) into e2fsprogs, and (b) into the ext4
kernel code. I guess it depends on how complicated/icky the
implementation code is, I guess. The question as before is whether
the complexity is worth it, given that someone who is actually going
to be subject to accidentally mounting an ext3/4 filesystem on
multiple systems needs to be using an HA system anyway. So basically
this is just to protect against (a) a bug/failure in the HA subsystem,
and (b) the idiotic user that failed to realized he/she needed to set
up an HA subsystem in the first place. Granted, the universe is going
to create idiots at a faster rate that we can deal with it, but that's
why I'm still not 100% convinced the complexity is worth it.

To be fair, if I was on a L3 support team having to deal with these
idiots, I'd probably feel differently. :-)

- Ted

2007-05-25 19:35:26

by Jim Garlick

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

Hi Ted,

For what it's worth, we have several petabytes of data residing in
ext3 file systems, a large staff of mainly non-idiots, and HA s/w,
and I still feel strongly that multi-mount protection is a good idea.
People, software, and hardware all malfunction in myriad ways, and the
more you have, the greater the odds (or so it seems to us). This
relatively simple safeguard at the fs level has high value IMHO.

Regards,

Jim

On Fri, 25 May 2007, Theodore Tso wrote:

> Hi Kalpak,
>
> On Tue, May 22, 2007 at 01:22:32AM +0530, Kalpak Shah wrote:
>> It will also protect against running e2fsck on a mounted filesystem
>> by adding similar logic to ext2fs_open().
>
> Your patch didn't add this logic to ext2fs_open(); it just reserved
> the space in the superblock.
>
> I don't mind reserving the space so we don't have to worry about
> conflicting superblock uses, but I'm still on the fence about actually
> adding this functionality (a) into e2fsprogs, and (b) into the ext4
> kernel code. I guess it depends on how complicated/icky the
> implementation code is, I guess. The question as before is whether
> the complexity is worth it, given that someone who is actually going
> to be subject to accidentally mounting an ext3/4 filesystem on
> multiple systems needs to be using an HA system anyway. So basically
> this is just to protect against (a) a bug/failure in the HA subsystem,
> and (b) the idiotic user that failed to realized he/she needed to set
> up an HA subsystem in the first place. Granted, the universe is going
> to create idiots at a faster rate that we can deal with it, but that's
> why I'm still not 100% convinced the complexity is worth it.
>
> To be fair, if I was on a L3 support team having to deal with these
> idiots, I'd probably feel differently. :-)
>
> - Ted
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2007-05-25 21:33:04

by Kalpak Shah

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

Hi Ted,

On Fri, 2007-05-25 at 10:39 -0400, Theodore Tso wrote:
> Hi Kalpak,
>
> On Tue, May 22, 2007 at 01:22:32AM +0530, Kalpak Shah wrote:
> > It will also protect against running e2fsck on a mounted filesystem
> > by adding similar logic to ext2fs_open().
>
> Your patch didn't add this logic to ext2fs_open(); it just reserved
> the space in the superblock.

Yeah the earlier patch for just reserving the fields.

>
> I don't mind reserving the space so we don't have to worry about
> conflicting superblock uses, but I'm still on the fence about actually
> adding this functionality (a) into e2fsprogs, and (b) into the ext4
> kernel code. I guess it depends on how complicated/icky the
> implementation code is, I guess.

I am attaching the kernel and e2fsrogs patches so that you can suggest
any short-comings in the implementation. These patches are still a WIP.

> The question as before is whether
> the complexity is worth it, given that someone who is actually going
> to be subject to accidentally mounting an ext3/4 filesystem on
> multiple systems needs to be using an HA system anyway. So basically
> this is just to protect against (a) a bug/failure in the HA subsystem,
> and (b) the idiotic user that failed to realized he/she needed to set
> up an HA subsystem in the first place. Granted, the universe is going
> to create idiots at a faster rate that we can deal with it, but that's
> why I'm still not 100% convinced the complexity is worth it.

Given the amount of damage that multiple mounts can cause to the
filesystem, it would be desirable to make doubly sure. Also the MMP
feature is quite uncomplicated and absolutely tunable.

Thanks for your views.

- Kalpak.
>
> To be fair, if I was on a L3 support team having to deal with these
> idiots, I'd probably feel differently. :-)
>
> - Ted
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Attachments:
mmp.patch (9.86 kB)
e2fsprogs-mmp.patch (23.49 kB)
Download all attachments

2007-05-30 20:54:59

by Kalpak Shah

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

On Sat, 2007-05-26 at 03:06 +0530, Kalpak Shah wrote:
> Hi Ted,
>
> On Fri, 2007-05-25 at 10:39 -0400, Theodore Tso wrote:
> > Hi Kalpak,
> >
> > On Tue, May 22, 2007 at 01:22:32AM +0530, Kalpak Shah wrote:
> > > It will also protect against running e2fsck on a mounted filesystem
> > > by adding similar logic to ext2fs_open().
> >
> > Your patch didn't add this logic to ext2fs_open(); it just reserved
> > the space in the superblock.
>
> Yeah the earlier patch for just reserving the fields.
>
> >
> > I don't mind reserving the space so we don't have to worry about
> > conflicting superblock uses, but I'm still on the fence about actually
> > adding this functionality (a) into e2fsprogs, and (b) into the ext4
> > kernel code. I guess it depends on how complicated/icky the
> > implementation code is, I guess.
>

Hi Ted,

So can I assume that the INCOMPAT_MMP flag and the s_mmp_interval and
s_mmp_block superblock fields will be reserved regardless of whether the
patches go into ext4? I had attached the patches in the last mail so you
can share your views on them.

Thanks,
Kalpak.

2007-05-31 16:16:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

On Thu, May 31, 2007 at 02:28:33AM +0530, Kalpak Shah wrote:
>
> So can I assume that the INCOMPAT_MMP flag and the s_mmp_interval and
> s_mmp_block superblock fields will be reserved regardless of whether the
> patches go into ext4? I had attached the patches in the last mail so you
> can share your views on them.

Yes, i've reserved the code point and superblock fields. I'm not
going to add INCOMPAT_MMP flag to the supported file until I get and
integrate the patch ext2fs_open() that actually tests for the flag,
though, since that would be a bit silly.

I assume the patch will add a flag to ext2fs_open which skips the MMP
checking. After all, tune2fs is allowed to make changes to the
superblock while the filesystem is mounted. So it needs to be able to
open the filesystem read/only even if it is mounted.

Regards,

- Ted

2007-05-31 21:05:57

by Kalpak Shah

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

On Thu, 2007-05-31 at 12:16 -0400, Theodore Tso wrote:
> On Thu, May 31, 2007 at 02:28:33AM +0530, Kalpak Shah wrote:
> >
> > So can I assume that the INCOMPAT_MMP flag and the s_mmp_interval and
> > s_mmp_block superblock fields will be reserved regardless of whether the
> > patches go into ext4? I had attached the patches in the last mail so you
> > can share your views on them.
>
> Yes, i've reserved the code point and superblock fields.

Thanks.

> I'm not going to add INCOMPAT_MMP flag to the supported file until I get and
> integrate the patch ext2fs_open() that actually tests for the flag,
> though, since that would be a bit silly.
>
> I assume the patch will add a flag to ext2fs_open which skips the MMP
> checking.

Yes I have added a EXT2_FLAG_SKIP_MMP flag to ext2fs_open() to bypass
MMP which will be set if tunefs is used with -f option. Also MMP check
will not be run if the filesystem is being opened readonly.

Thanks,
Kalpak.

> After all, tune2fs is allowed to make changes to the
> superblock while the filesystem is mounted. So it needs to be able to
> open the filesystem read/only even if it is mounted.
>
> Regards,
>
> - Ted

2007-06-01 07:49:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

Kalpak Shah <[email protected]> writes:

> Hi,
>
> There have been reported instances of a filesystem having been mounted at 2 places at the same time causing a lot of damage to the filesystem. This patch reserves superblock fields and an INCOMPAT flag for adding multiple mount protection(MMP) support within the ext4 filesystem itself. The superblock will have a block number (s_mmp_block) which will hold a MMP structure which has a sequence number which will be periodically updated every 5 seconds by a mounted filesystem. Whenever a filesystem will be mounted it will wait for s_mmp_interval seconds to make sure that the MMP sequence does not change. To further make sure, we write a random sequence number into the MMP block and wait for another s_mmp_interval secs. If the sequence no. doesn't change then the mount will succeed. In case of failure, the nodename, bdevname and the time at which the MMP block was last updated will be displa
ye
> d. tune2fs can be used to set s_mmp_interval as desired.

That will make laptop users very unhappy if you spin up their disks every 5 seconds. And
even on other systems it might reduce the MTBF if you write the super block much more
often than before. It might be better to set it up in some way to only increase
that number when the super block is written for some other reason anyways.

-Andi

2007-06-01 08:24:04

by Kalpak Shah

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

On Fri, 2007-06-01 at 10:46 +0200, Andi Kleen wrote:
> Kalpak Shah <[email protected]> writes:
>
> > Hi,
> >
> > There have been reported instances of a filesystem having been mounted at 2 places at the same time causing a lot of damage to the filesystem. This patch reserves superblock fields and an INCOMPAT flag for adding multiple mount protection(MMP) support within the ext4 filesystem itself. The superblock will have a block number (s_mmp_block) which will hold a MMP structure which has a sequence number which will be periodically updated every 5 seconds by a mounted filesystem. Whenever a filesystem will be mounted it will wait for s_mmp_interval seconds to make sure that the MMP sequence does not change. To further make sure, we write a random sequence number into the MMP block and wait for another s_mmp_interval secs. If the sequence no. doesn't change then the mount will succeed. In case of failure, the nodename, bdevname and the time at which the MMP block was last updated will be disp
laye
> > d. tune2fs can be used to set s_mmp_interval as desired.
>
> That will make laptop users very unhappy if you spin up their disks every 5 seconds. And
> even on other systems it might reduce the MTBF if you write the super block much more
> often than before. It might be better to set it up in some way to only increase
> that number when the super block is written for some other reason anyways.

The super block only saves the block number of the MMP block. So the
super block is not updated but the contents of the MMP block are updated
every 5 seconds.

If any user is unhappy with the 5 seconds interval, he can modify the
interval to be greater, with the caveat that it will take 2*mmp_interval
seconds during mounting the filesystem.

Thanks,
Kalpak.

>
> -Andi

2007-06-01 09:14:37

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

On Jun 01, 2007 10:46 +0200, Andi Kleen wrote:
> Kalpak Shah <[email protected]> writes:
> > There have been reported instances of a filesystem having been
> mounted at 2 places at the same time causing a lot of damage to the
> filesystem.... The superblock will have a block number (s_mmp_block)
> which will hold a MMP structure which has a sequence number which will be
> periodically updated every 5 seconds by a mounted filesystem.
>
> That will make laptop users very unhappy if you spin up their disks every
> 5 seconds. And even on other systems it might reduce the MTBF if you
> write the super block much more often than before. It might be better to
> set it up in some way to only increase that number when the super block is
> written for some other reason anyways.

It was mentioned before but deserves mentioning again that this will
be an optional feature, mostly for use on SANs, iSCSI, etc where a disk
might be accessed by multiple nodes at the same time. That means there
will not be any impact for desktop users waiting 10s for each of their
filesystems to mount.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2007-06-01 10:56:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

> It was mentioned before but deserves mentioning again that this will
> be an optional feature, mostly for use on SANs, iSCSI, etc where a disk
> might be accessed by multiple nodes at the same time. That means there
> will not be any impact for desktop users waiting 10s for each of their
> filesystems to mount.

A safety feature that is optional? Doesn't sound very safe to me.
If the safety is needed it should be probably default, otherwise
it isn't needed.

-Andi

2007-06-01 11:41:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

On Fri, Jun 01, 2007 at 10:46:19AM +0200, Andi Kleen wrote:
>
> That will make laptop users very unhappy if you spin up their disks
> every 5 seconds. And even on other systems it might reduce the MTBF
> if you write the super block much more often than before. It might
> be better to set it up in some way to only increase that number when
> the super block is written for some other reason anyways.

You would never want to use this feature on a laptop; it would buy no
benefit for its costs, since with (all common) laptops, their hard
drives can't be shared with other machines in a cluster.

Unfortunately, it's not possible to do what you suggest, since one of
the whole points of increasing the sequence number every 5 seconds is
to act as a keep-alive, so another machine trying to access the shared
hard drive can tell whether or not the machine which currently had the
hard drive mounted is still alive or not.

This is why I and others have been a little worried about implementing
this feature, since it adds complexity which has to be in a proper HA
system anyway, and what is there isn't really an optimal HA solution
(since it lacks STONITH) and so you have to implement the
functionality again _anyway_ using a proper HA solution.

The argument on the other side is that it protects against failed HA
solutions, and against users who are too stupid to know that they need
an HA solution. It does do the first; the second would only apply if
the users who were too stupid to realize they needed an HA solution,
were smart enough to enable it the MMP feature --- and because of its
many costs, including keeping the disk spun up on laptops, and
delaying the time required to mount the disk by 10 seconds, I don't
think it will ever be enabled by default. Hence, I don't really think
it helps the idiotic user problem.

But apparently a belt-and-suspenders approach to HA is comforting to
some users, and so I don't mind reserving the space. The code to
implement it still seems like more complexity than what should be in
the kernel. My suggestion would be to put it in a separate file, and
make it be something which has to be explicitly configured to enable
it, possibly as a module (but that may add too much extra hair). I
really don't think the save-the-stupid-user argument holds water, but
the belt-and-suspenders argument IFF you are using a shared-disk setup
is a valid, although probably not a common setup.

Regards,

- Ted

2007-06-01 12:13:43

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

> Unfortunately, it's not possible to do what you suggest, since one of
> the whole points of increasing the sequence number every 5 seconds is
> to act as a keep-alive, so another machine trying to access the shared

Clusters usually have other ways to do this, haven't they?
Typically they have STONITH too. It's probably too simple minded
to just replace a real cluster setup which also handles split
brain and other conditions. So it's purely against mistakes.

Besides relying on it would seem dangerous because it is not synchronous
and you could do a lot of damage in 5 seconds.

The rationale of the lazy writing would be that at least in usual
operation super block should be written regularly and with
5 seconds delay just being a not fully reliable heuristic it probably
doesn't matter if the possible delay is a little longer even.

-Andi

2007-06-01 13:52:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

On Fri, Jun 01, 2007 at 02:13:39PM +0200, Andi Kleen wrote:
> > Unfortunately, it's not possible to do what you suggest, since one of
> > the whole points of increasing the sequence number every 5 seconds is
> > to act as a keep-alive, so another machine trying to access the shared
>
> Clusters usually have other ways to do this, haven't they?
> Typically they have STONITH too. It's probably too simple minded
> to just replace a real cluster setup which also handles split
> brain and other conditions. So it's purely against mistakes.

Yes, it's only real value is to protect against Cluster-HA
malfunctions or misconfiguration.

> Besides relying on it would seem dangerous because it is not synchronous
> and you could do a lot of damage in 5 seconds.

Well, the MMP feature is assigned an incompatible feature bit, so a
kernel who doesn't know about MMP will refuse to touch it; and a
kernel which does follow the MMP protocol will check the MMP block
(delaying the mount by 10 seconds) to make sure no other system is
using the block.

So aside from being [email protected]#[email protected] annoying (which is why it will never be the
default), it does work, modulo the problem that without STONITH or any
kind of I/O fencing, we do risk the other system coming back to life
and then modifying the filesystem in parallel. So as everyone has
said, this is not solution that works in isolation, but is really only
a backup.

The question of whether the complexity and then 10 second mount delay
for what is only a backup solution is worth it is obviously going to
be a very subjective one --- and as I've said previously, I'm on the
fence on this.

- Ted

2007-06-01 18:00:10

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC][PATCH] Multiple mount protection

On Jun 01, 2007 09:52 -0400, Theodore Tso wrote:
> On Fri, Jun 01, 2007 at 02:13:39PM +0200, Andi Kleen wrote:
> > Clusters usually have other ways to do this, haven't they?
> > Typically they have STONITH too. It's probably too simple minded
> > to just replace a real cluster setup which also handles split
> > brain and other conditions. So it's purely against mistakes.
>
> Yes, it's only real value is to protect against Cluster-HA
> malfunctions or misconfiguration.

While I agree that HA systems _should_ be enough for this, in our
experience even with an HA system some people get it wrong (e.g.
manually mounting and bypassing HA, HA itself is broken, comms failure,
STONITH failure, etc).

I agree it is not intended to be a replacement for an HA/STONITH
solution, just belt & suspenders that would have saved hundreds of
TB of user data in several cases if it were available. We will
enable it by default on all of our filesystems, and of course I'd
advise anyone in a SAN environment (whether they _intend_ to have
shared disk access or not) to enable it also.

> > Besides relying on it would seem dangerous because it is not synchronous
> > and you could do a lot of damage in 5 seconds.
>
> Well, the MMP feature is assigned an incompatible feature bit, so a
> kernel who doesn't know about MMP will refuse to touch it; and a
> kernel which does follow the MMP protocol will check the MMP block
> (delaying the mount by 10 seconds) to make sure no other system is
> using the block.

Correct. There is a "fast path" where it will wait a shorter time
during mount if the fs is reported cleanly unmounted. We can't skip
the check entirely, because 2 systems might be mounting at the same
time.

> So aside from being [email protected]#[email protected] annoying (which is why it will never be the
> default), it does work, modulo the problem that without STONITH or any
> kind of I/O fencing, we do risk the other system coming back to life
> and then modifying the filesystem in parallel. So as everyone has
> said, this is not solution that works in isolation, but is really only
> a backup.

If the kmmpd is not scheduled in more than 10s then it will re-read the
block to ensure that the local system is still the one in control. If
not, it will ext3_error() and (in our case at least) this will make the
client fs read-only. Even if there is some IO leakage from the local
client, this is far better than to continue running with 2 systems writing
to the same disk.

Ideally there would also be a block-layer functionality to fence the IO
on the local system (e.g. plug the elevator output, I don't think that
there is anything that could be done about IO already submitted to the
device), but the function I thought did this (set_device_rdonly()) is
only checked at mount time and is useless.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.