2006-11-08 19:22:18

by Suzuki

[permalink] [raw]
Subject: Problem with multiple mounts

Hi,

I am working on a problem as mentioned below :

Problem description:

"I exported a disk partition using nbd protocol. On the nbd client, I
make reiserfs and run fsstress test case on this partition. At the same
time, I mount this partition on the nbd server. Then Oops appears as
following:"

ReiserFS: sda10: found reiserfs format "3.6" with standard journal
ReiserFS: sda10: using ordered data mode
ReiserFS: sda10: journal params: device sda10, size 8192, journal first
block 18, max trans len 1024, max batch 900, max commit age 30, max
trans age 30
ReiserFS: sda10: checking transaction log (sda10)

Oops: Kernel access of bad area, sig: 11 [#1]

Call Trace:
[C000000011333090] [C0000000001EDB70] .journal_read+0x165c/0x1b6c
(unreliable)
[C000000011333410] [C0000000001EF280] .journal_init+0xdc0/0xee8
[C000000011333530] [C0000000001CDBD8] .reiserfs_fill_super+0xa90/0x1e40
[C000000011333790] [C00000000011E988] .get_sb_bdev+0x208/0x31c
[C000000011333870] [C0000000001CA00C] .get_super_block+0x38/0x60
[C000000011333900] [C00000000011E260] .vfs_kern_mount+0xec/0x198
[C0000000113339B0] [C00000000011E3E0] .do_kern_mount+0x88/0xdc
[C000000011333A50] [C0000000001532CC] .do_mount+0xd50/0xe08
[C000000011333D60] [C000000000175090] .compat_sys_mount+0x368/0x448
[C000000011333E30] [C00000000000861C] syscall_exit+0x0/0x40

But, if we try the steps in the reverse order,

"mount the partition on nbd server first and then try fsstress tests on
the client side. This is just to ensure that the server is not seeing an
incomplete journal created by the client side runs."

Things work fine !

I doubt if this is due to the mount finding an incomplete journal
created by the client side fsstress runs in the first scenario.

My question is : Is this supported ? Mounting a filesystem which is
already mounted and replaying the ( - a may be incomplete- ) journal.


Thanks.

Suzuki


2006-11-08 20:33:56

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Problem with multiple mounts

On Wed, Nov 08, 2006 at 11:22:15AM -0800, Suzuki wrote:
> "I exported a disk partition using nbd protocol. On the nbd client, I
> make reiserfs and run fsstress test case on this partition. At the same
> time, I mount this partition on the nbd server. Then Oops appears as
> following:"
>
> ReiserFS: sda10: found reiserfs format "3.6" with standard journal
> ReiserFS: sda10: using ordered data mode
> ReiserFS: sda10: journal params: device sda10, size 8192, journal first
> block 18, max trans len 1024, max batch 900, max commit age 30, max
> trans age 30
> ReiserFS: sda10: checking transaction log (sda10)
>
> Oops: Kernel access of bad area, sig: 11 [#1]
>
> Call Trace:
> [C000000011333090] [C0000000001EDB70] .journal_read+0x165c/0x1b6c
> (unreliable)
> [C000000011333410] [C0000000001EF280] .journal_init+0xdc0/0xee8
> [C000000011333530] [C0000000001CDBD8] .reiserfs_fill_super+0xa90/0x1e40
> [C000000011333790] [C00000000011E988] .get_sb_bdev+0x208/0x31c
> [C000000011333870] [C0000000001CA00C] .get_super_block+0x38/0x60
> [C000000011333900] [C00000000011E260] .vfs_kern_mount+0xec/0x198
> [C0000000113339B0] [C00000000011E3E0] .do_kern_mount+0x88/0xdc
> [C000000011333A50] [C0000000001532CC] .do_mount+0xd50/0xe08
> [C000000011333D60] [C000000000175090] .compat_sys_mount+0x368/0x448
> [C000000011333E30] [C00000000000861C] syscall_exit+0x0/0x40
>
> But, if we try the steps in the reverse order,
>
> "mount the partition on nbd server first and then try fsstress tests on
> the client side. This is just to ensure that the server is not seeing an
> incomplete journal created by the client side runs."
>
> Things work fine !
>
> I doubt if this is due to the mount finding an incomplete journal
> created by the client side fsstress runs in the first scenario.
>
> My question is : Is this supported ? Mounting a filesystem which is
> already mounted and replaying the ( - a may be incomplete- ) journal.

Absolutely not supported. Unless you have a filesystem that is
specifically designed for simultanious read-write mount from multiple
places, then you can't. For performance reasons most systems cache
writes and updates in many cases, so the data read by one system may be
out of date because another system has an update waiting to go to disk.
You need a filesystem that has the ability for multiple systems to talk
to each other about updates and locking and such things. Look for a
cluster supporting filesystem or whatever is used to refer to a
filesystem that supports multiple hosts having it mounted to provide
redundant access. No normal filesystem can do it unless everyone has it
mounted read only. If you want to share it, use NFS. That's what it's
for.

--
Len Sorensen

2006-11-08 22:13:46

by David Masover

[permalink] [raw]
Subject: Re: Problem with multiple mounts

Suzuki wrote:

> "I exported a disk partition using nbd protocol.

Why?

I don't mean it's completely absurd, I'd just like to know what the
reason is for not using NFS. Perhaps you should look at DRBD?

What is the real goal here?

2006-11-08 22:39:04

by Suzuki

[permalink] [raw]
Subject: Re: Problem with multiple mounts

Lennart Sorensen wrote:
> On Wed, Nov 08, 2006 at 11:22:15AM -0800, Suzuki wrote:
>
>>"I exported a disk partition using nbd protocol. On the nbd client, I
>>make reiserfs and run fsstress test case on this partition. At the same
>>time, I mount this partition on the nbd server. Then Oops appears as
>>following:"
>>
>>ReiserFS: sda10: found reiserfs format "3.6" with standard journal
>>ReiserFS: sda10: using ordered data mode
>>ReiserFS: sda10: journal params: device sda10, size 8192, journal first
>>block 18, max trans len 1024, max batch 900, max commit age 30, max
>>trans age 30
>>ReiserFS: sda10: checking transaction log (sda10)
>>
>>Oops: Kernel access of bad area, sig: 11 [#1]
>>
>>Call Trace:
>>[C000000011333090] [C0000000001EDB70] .journal_read+0x165c/0x1b6c
>>(unreliable)
>>[C000000011333410] [C0000000001EF280] .journal_init+0xdc0/0xee8
>>[C000000011333530] [C0000000001CDBD8] .reiserfs_fill_super+0xa90/0x1e40
>>[C000000011333790] [C00000000011E988] .get_sb_bdev+0x208/0x31c
>>[C000000011333870] [C0000000001CA00C] .get_super_block+0x38/0x60
>>[C000000011333900] [C00000000011E260] .vfs_kern_mount+0xec/0x198
>>[C0000000113339B0] [C00000000011E3E0] .do_kern_mount+0x88/0xdc
>>[C000000011333A50] [C0000000001532CC] .do_mount+0xd50/0xe08
>>[C000000011333D60] [C000000000175090] .compat_sys_mount+0x368/0x448
>>[C000000011333E30] [C00000000000861C] syscall_exit+0x0/0x40
>>
>>But, if we try the steps in the reverse order,
>>
>>"mount the partition on nbd server first and then try fsstress tests on
>>the client side. This is just to ensure that the server is not seeing an
>>incomplete journal created by the client side runs."
>>
>>Things work fine !
>>
>>I doubt if this is due to the mount finding an incomplete journal
>>created by the client side fsstress runs in the first scenario.
>>
>>My question is : Is this supported ? Mounting a filesystem which is
>>already mounted and replaying the ( - a may be incomplete- ) journal.
>
>
> Absolutely not supported. Unless you have a filesystem that is
> specifically designed for simultanious read-write mount from multiple
> places, then you can't. For performance reasons most systems cache
> writes and updates in many cases, so the data read by one system may be
> out of date because another system has an update waiting to go to disk.
> You need a filesystem that has the ability for multiple systems to talk
> to each other about updates and locking and such things. Look for a
> cluster supporting filesystem or whatever is used to refer to a
> filesystem that supports multiple hosts having it mounted to provide
> redundant access. No normal filesystem can do it unless everyone has it
> mounted read only. If you want to share it, use NFS. That's what it's
> for.

Thanks for the response. This problem was reported by one of our test
team on 2.6.19. So, I wanted to confirm that what they are doing is not
supported !



Thanks,

Suzuki
>
> --
> Len Sorensen

2006-11-08 23:06:28

by Andreas Dilger

[permalink] [raw]
Subject: Re: Problem with multiple mounts

On Nov 08, 2006 14:38 -0800, Suzuki wrote:
> Lennart Sorensen wrote:
> >>ReiserFS: sda10: checking transaction log (sda10)
> >>
> >>Oops: Kernel access of bad area, sig: 11 [#1]
> >>
> >>Call Trace:
> >>[C000000011333090] [C0000000001EDB70] .journal_read+0x165c/0x1b6c
> >>(unreliable)
> >>[C000000011333410] [C0000000001EF280] .journal_init+0xdc0/0xee8
> >>[C000000011333530] [C0000000001CDBD8] .reiserfs_fill_super+0xa90/0x1e40
> >>[C000000011333790] [C00000000011E988] .get_sb_bdev+0x208/0x31c
> >>[C000000011333870] [C0000000001CA00C] .get_super_block+0x38/0x60
> >>[C000000011333900] [C00000000011E260] .vfs_kern_mount+0xec/0x198
> >>[C0000000113339B0] [C00000000011E3E0] .do_kern_mount+0x88/0xdc
> >>[C000000011333A50] [C0000000001532CC] .do_mount+0xd50/0xe08
> >>[C000000011333D60] [C000000000175090] .compat_sys_mount+0x368/0x448
> >>[C000000011333E30] [C00000000000861C] syscall_exit+0x0/0x40
> >>
> >>My question is : Is this supported ? Mounting a filesystem which is
> >>already mounted and replaying the ( - a may be incomplete- ) journal.
>
> Thanks for the response. This problem was reported by one of our test
> team on 2.6.19. So, I wanted to confirm that what they are doing is not
> supported !

I would suggest that even while this is not supported, it would be prudent
to fix such a bug. It might be possible to hit a similar problem if there
is corruption of the on-disk data in the journal and oopsing the kernel
isn't a graceful way to deal with bad data on disk.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-11-09 15:15:56

by Lennart Sorensen

[permalink] [raw]
Subject: Re: Problem with multiple mounts

On Wed, Nov 08, 2006 at 04:06:23PM -0700, Andreas Dilger wrote:
> I would suggest that even while this is not supported, it would be prudent
> to fix such a bug. It might be possible to hit a similar problem if there
> is corruption of the on-disk data in the journal and oopsing the kernel
> isn't a graceful way to deal with bad data on disk.

On the other hand corrupt data at least doesn't change under you while
you are trying to figure out the filesystem. This particular use would
have meta data changing while you are trying to read it, making things
not be consistent with each other from one moment to another. There may
be nothing that can be done about it.

--
Len Sorensen

2006-11-09 16:37:09

by David Masover

[permalink] [raw]
Subject: Re: Problem with multiple mounts

Lennart Sorensen wrote:
> On Wed, Nov 08, 2006 at 04:06:23PM -0700, Andreas Dilger wrote:
>> I would suggest that even while this is not supported, it would be prudent
>> to fix such a bug. It might be possible to hit a similar problem if there
>> is corruption of the on-disk data in the journal and oopsing the kernel
>> isn't a graceful way to deal with bad data on disk.
>
> On the other hand corrupt data at least doesn't change under you while
> you are trying to figure out the filesystem.

It might.

I'd suspect that there can, in fact, be something done about this,
assuming good RAM. After all, a corrupt image won't crash a decent web
browser.

It might sacrifice a ton of performance, though. I suggest it shouldn't
be a priority to try to figure this out, and if it's ever implemented,
make it a mount option -- -o paranoid or something. Obviously we don't
care if a rescue disc takes forever, but we don't want to be waiting for
hours on our FS boot, which is why I have an initrd mount my Reiser4 FS
with "-o dont_load_bitmap"

(Yes, I realize the right way to do this is initramfs now. I'm too lazy
to figure out how to make that work.)