2001-04-26 15:47:47

by Alexander Viro

[permalink] [raw]
Subject: [PATCH] SMP race in ext2 - metadata corruption.

Ext2 does getblk+wait_on_buffer for new metadata blocks before
filling them with zeroes. While that is enough for single-processor,
on SMP we have the following race:

getblk gives us unlocked, non-uptodate bh
wait_on_buffer() does nothing
read from device locks it and starts IO
we zero it out.
on-disk data overwrites our zeroes.
we mark it dirty
bdflush writes the old data (_not_ zeroes) back to disk.

Result: crap in metadata block. Proposed fix: lock_buffer()/unlock_buffer()
around memset()/mark_buffer_uptodate() instead of wait_on_buffer() before
them.

Patch against 2.4.4-pre7 follows. Please, apply.
Al


--- S4-pre7/fs/ext2/inode.c Wed Apr 25 20:43:08 2001
+++ S4-pre7-ext2/fs/ext2/inode.c Thu Apr 26 11:36:11 2001
@@ -397,13 +397,13 @@
* the pointer to new one, then send parent to disk.
*/
bh = getblk(inode->i_dev, parent, blocksize);
- if (!buffer_uptodate(bh))
- wait_on_buffer(bh);
+ lock_buffer(bh);
memset(bh->b_data, 0, blocksize);
branch[n].bh = bh;
branch[n].p = (u32*) bh->b_data + offsets[n];
*branch[n].p = branch[n].key;
mark_buffer_uptodate(bh, 1);
+ unlock_buffer(bh);
mark_buffer_dirty_inode(bh, inode);
if (IS_SYNC(inode) || inode->u.ext2_i.i_osync) {
ll_rw_block (WRITE, 1, &bh);
@@ -587,10 +587,10 @@
struct buffer_head *bh;
bh = getblk(dummy.b_dev, dummy.b_blocknr, inode->i_sb->s_blocksize);
if (buffer_new(&dummy)) {
- if (!buffer_uptodate(bh))
- wait_on_buffer(bh);
+ lock_buffer(bh);
memset(bh->b_data, 0, inode->i_sb->s_blocksize);
mark_buffer_uptodate(bh, 1);
+ unlock_buffer(bh);
mark_buffer_dirty_inode(bh, inode);
}
return bh;


2001-04-26 18:18:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 11:45:47AM -0400, Alexander Viro wrote:
> Ext2 does getblk+wait_on_buffer for new metadata blocks before
> filling them with zeroes. While that is enough for single-processor,
> on SMP we have the following race:
>
> getblk gives us unlocked, non-uptodate bh
> wait_on_buffer() does nothing
> read from device locks it and starts IO
> we zero it out.
> on-disk data overwrites our zeroes.
> we mark it dirty
> bdflush writes the old data (_not_ zeroes) back to disk.
>
> Result: crap in metadata block. Proposed fix: lock_buffer()/unlock_buffer()
> around memset()/mark_buffer_uptodate() instead of wait_on_buffer() before
> them.
>
> Patch against 2.4.4-pre7 follows. Please, apply.

correct. I bet other fs are affected as well btw.

Andrea

2001-04-26 18:26:06

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Thu, 26 Apr 2001, Andrea Arcangeli wrote:

> correct. I bet other fs are affected as well btw.

If only... block_read() vs. block_write() has the same race. I'm going
through the list of all wait_on_buffer() users right now.

2001-04-26 18:52:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.


On Thu, Apr 26, 2001 at 11:45:47AM -0400, Alexander Viro wrote:
>
> Ext2 does getblk+wait_on_buffer for new metadata blocks before
> filling them with zeroes. While that is enough for single-processor,
> on SMP we have the following race:
>
> getblk gives us unlocked, non-uptodate bh
> wait_on_buffer() does nothing
> read from device locks it and starts IO

I see the race, but I don't see how you can actually trigger it.

Exactly _who_ does the "read from device" part? Somebody doing a
"fsck" while the filesystem is mounted read-write and actively written
to? Yeah, you'd get disk corruption that way, but you'll get it regardless
of this bug.

There's nothing else that should be using that block at that stage. And if
there were, that would be a bug in itself, as far as I can tell. We've
just allocated it, and we're the only and exclusive owners of that block
on the disk. Anybody else who touches it is seriously broken.

Now, I don't disagree with your patch (it's just obviously cleaner to lock
it properly), but I don't think this is a real bug. I suspect that even
the wait-on-buffer is not strictly necessary: it's probably there to make
sure old write-backs have completed, but that doesn't really matter
either.

We used to have "breada()" do physical read-ahead that could have
triggered this, but we've long since gotten rid of that.

Or am I overlooking something?

Linus

2001-04-26 19:06:13

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Thursday, April 26, 2001 02:24:26 PM -0400 Alexander Viro
<[email protected]> wrote:

>
>
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
>
>> correct. I bet other fs are affected as well btw.
>
> If only... block_read() vs. block_write() has the same race. I'm going
> through the list of all wait_on_buffer() users right now.
>

Looks like reiserfs has it too when allocating tree blocks, but it should
be harder to hit. The fix should be small but it will take me a bit to
make sure it doesn't affect the rest of the balancing code.

-chris

2001-04-26 19:10:03

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Thu, 26 Apr 2001, Linus Torvalds wrote:

> I see the race, but I don't see how you can actually trigger it.
>
> Exactly _who_ does the "read from device" part? Somebody doing a
> "fsck" while the filesystem is mounted read-write and actively written
> to? Yeah, you'd get disk corruption that way, but you'll get it regardless
> of this bug.

> There's nothing else that should be using that block at that stage. And if
> there were, that would be a bug in itself, as far as I can tell. We've
> just allocated it, and we're the only and exclusive owners of that block
> on the disk. Anybody else who touches it is seriously broken.

> Now, I don't disagree with your patch (it's just obviously cleaner to lock
> it properly), but I don't think this is a real bug. I suspect that even
> the wait-on-buffer is not strictly necessary: it's probably there to make
> sure old write-backs have completed, but that doesn't really matter
> either.
>
> We used to have "breada()" do physical read-ahead that could have
> triggered this, but we've long since gotten rid of that.
>
> Or am I overlooking something?

Somebody doing dd(1) _from_ that disk. Sure, he's bound to get crap.
But I really don't think that opening device for read should be able
to affect its contents in any way.

BTW, same race exists between block_read() and block_write(). And that
one is even more obviously wrong:

xterm A: xterm B:

dd if=/dev/hda of=/dev/hdb dd if=/dev/hdb of=/dev/null

result: some blocks on hdb retaining their old contents.

IMO "no matter what you read, you don't affect the contents" is a good
general principle. Sure, you can get crap if you read in the middle of
write. That's expected and sane. However, the final contents of file
depends only on the things done by writers.

Al

2001-04-26 19:19:44

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Thu, 26 Apr 2001, I wrote:

> On Thu, 26 Apr 2001, Linus Torvalds wrote:
>
> > I see the race, but I don't see how you can actually trigger it.
> >
> > Exactly _who_ does the "read from device" part? Somebody doing a
> > "fsck" while the filesystem is mounted read-write and actively written
> > to? Yeah, you'd get disk corruption that way, but you'll get it regardless
> > of this bug.

OK, I think I've a better explanation now:

Suppose /dev/hda1 is owned by root.disks and permissions are 640.
It is mounted read-write.

Process foo belongs to pfy.staff. PFY is included into disks, but doesn't
have root. I claim that he should be unable to cause fs corruption on
/dev/hda1.

Currently foo _can_ cause such corruption, even though it has nothing
resembling write permissions for device in question.

IMO it is wrong. I'm not saying that it's a real security problem. I'm
not saying that PFY is not idiot or that his actions make any sense.
However, I think that situation when he can do that without write
access to device is just plain wrong.

Does the above make sense?
Al

2001-04-26 19:22:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 11:49:14AM -0700, Linus Torvalds wrote:
>
> On Thu, Apr 26, 2001 at 11:45:47AM -0400, Alexander Viro wrote:
> >
> > Ext2 does getblk+wait_on_buffer for new metadata blocks before
> > filling them with zeroes. While that is enough for single-processor,
> > on SMP we have the following race:
> >
> > getblk gives us unlocked, non-uptodate bh
> > wait_on_buffer() does nothing
> > read from device locks it and starts IO
>
> I see the race, but I don't see how you can actually trigger it.
>
> Exactly _who_ does the "read from device" part? Somebody doing a

/sbin/dump

> the wait-on-buffer is not strictly necessary: it's probably there to make

maybe not but I need to check some more bit to be sure.

Andrea

2001-04-26 19:35:46

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Thu, 26 Apr 2001, Andrea Arcangeli wrote:

> > the wait-on-buffer is not strictly necessary: it's probably there to make
>
> maybe not but I need to check some more bit to be sure.

Same scenario, but with read-in-progress started before we do getblk(). BTW,
old writeback is harmless - we will overwrite anyway. And _that_ can happen
without direct access to device - truncate() doesn't terminate writeout of
the indirect blocks it frees (IMO it should, but that's another story).

2001-04-26 19:45:15

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 09:15:57PM +0200, Andrea Arcangeli wrote:
> maybe not but I need to check some more bit to be sure.

yes we probably don't need it for fs against fs in 2.4 because we make
the new metadata block visible to a reader (splice) only after they're
all uptodate and all directory operations are serialized by the vfs
(with the i_zombie).

Andrea

2001-04-26 19:51:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 03:34:00PM -0400, Alexander Viro wrote:
> Same scenario, but with read-in-progress started before we do getblk(). BTW,

how can the read in progress see a branch that we didn't spliced yet? We
clear and mark uptodate the new part of the branch before it's visible
to any reader no? then in splice we write the key into the where->p and
the branch become visible to the readers but by that time the reader
won't start I/O because the buffer are just uptodate. I only had a short
look now and to verify Ingo's fix, so maybe I overlooked something.

> without direct access to device - truncate() doesn't terminate writeout of
> the indirect blocks it frees (IMO it should, but that's another story).

If the block is under I/O or dirty that's another story, the only issue
here is when the buffer block is new and not uptodate.

Andrea

2001-04-26 19:57:16

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Thu, 26 Apr 2001, Andrea Arcangeli wrote:

> On Thu, Apr 26, 2001 at 03:34:00PM -0400, Alexander Viro wrote:
> > Same scenario, but with read-in-progress started before we do getblk(). BTW,
>
> how can the read in progress see a branch that we didn't spliced yet? We

fd = open("/dev/hda1", O_RDONLY);
read(fd, buf, sizeof(buf));

2001-04-26 20:11:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Thu, 26 Apr 2001, Alexander Viro wrote:
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
> >
> > how can the read in progress see a branch that we didn't spliced yet? We
>
> fd = open("/dev/hda1", O_RDONLY);
> read(fd, buf, sizeof(buf));

Note that I think all these arguments are fairly bogus. Doing things like
"dump" on a live filesystem is stupid and dangerous (in my opinion it is
stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
discussion in itself), and there really are no valid uses for opening a
block device that is already mounted. More importantly, I don't think
anybody actually does.

The fact that you _can_ do so makes the patch valid, and I do agree with
Al on the "least surprise" issue. I've already applied the patch, in fact.
But the fact is that nobody should ever do the thing that could cause
problems.

Linus

2001-04-26 20:12:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 03:17:54PM -0400, Alexander Viro wrote:
>
>
> On Thu, 26 Apr 2001, I wrote:
>
> > On Thu, 26 Apr 2001, Linus Torvalds wrote:
> >
> > > I see the race, but I don't see how you can actually trigger it.
> > >
> > > Exactly _who_ does the "read from device" part? Somebody doing a
> > > "fsck" while the filesystem is mounted read-write and actively written
> > > to? Yeah, you'd get disk corruption that way, but you'll get it regardless
> > > of this bug.
>
> OK, I think I've a better explanation now:
>
> Suppose /dev/hda1 is owned by root.disks and permissions are 640.
> It is mounted read-write.
>
> Process foo belongs to pfy.staff. PFY is included into disks, but doesn't
> have root. I claim that he should be unable to cause fs corruption on
> /dev/hda1.
>
> Currently foo _can_ cause such corruption, even though it has nothing
> resembling write permissions for device in question.
>
> IMO it is wrong. I'm not saying that it's a real security problem. I'm
> not saying that PFY is not idiot or that his actions make any sense.
> However, I think that situation when he can do that without write
> access to device is just plain wrong.
>
> Does the above make sense?

Sure. And as said `dump` has the same issues.

Andrea

2001-04-26 20:44:47

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, 26 Apr 2001, Alexander Viro wrote:

>
>
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
>
> > > the wait-on-buffer is not strictly necessary: it's probably there to make
> >
> > maybe not but I need to check some more bit to be sure.
>
> Same scenario, but with read-in-progress started before we do getblk(). BTW,
> old writeback is harmless - we will overwrite anyway. And _that_ can happen
> without direct access to device - truncate() doesn't terminate writeout of
> the indirect blocks it frees (IMO it should, but that's another story).
>

This seems to be the problem reported about a year ago, but never fixed.
It exists, even in early kernels.

mke2fs /dev/fd0
mount /dev/fd0 /mnt
cp stuff /mnt

lilo -C - <<EOF
boot = /dev/fd0
map = /mnt/map
backup = /dev/null
install=/mnt/boot.b
image=/mnt/vmlinuz
initrd=/mnt/initrd
root=/dev/ram0
EOF

umount /dev/fd0
cp /dev/fd0 raw.bin

The disk image, raw.bin, does NOT contain the image of the floppy.
Most of boot stuff added by lilo is missing. It will eventually
get there, but it's not there now, even though the floppy was
un-mounted!

A work-around was to do:

ioctl(fd, FDFLUSH, NULL);

... from a program before copying the image.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


2001-04-26 20:44:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
>
> What I'm saying above is that even without the wait_on_buffer ext2 can
> screwup itself because the splice happens after the buffer are just all
> uptodate so any "reader" (I mean any reader through ext2 not through
> block_dev) will never try to do a bread on that blocks before they're
> just zeroed and uptodate.

I assume you meant "..can _not_ screw up itself..", otherwise the rest of
the sentence doesn't seem to make much sense.

Linus

2001-04-26 20:51:17

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Thu, 26 Apr 2001, Linus Torvalds wrote:

> Note that I think all these arguments are fairly bogus. Doing things like
> "dump" on a live filesystem is stupid and dangerous (in my opinion it is
> stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
> discussion in itself), and there really are no valid uses for opening a
> block device that is already mounted. More importantly, I don't think
> anybody actually does.

Agreed.

> The fact that you _can_ do so makes the patch valid, and I do agree with
> Al on the "least surprise" issue. I've already applied the patch, in fact.
> But the fact is that nobody should ever do the thing that could cause
> problems.

IMO the real issue is in fuzzy rules for use of wait_on_buffer(). There is
one case when it's 100% correct - when we had done ll_rw_block() on that
bh at earlier point and want to make sure that it's completed. And there
is a lot of uses that are kinda-sorta correct for UP, but break on SMP.
unmap_buffer() was similar to that race. So are races in minix, sysvfs and
ufs. So is one in block_write() and here the problem is quite real - it's
not as idiotic as device/mounted fs races.

Basically, all legitimate cases are ones where we would be very unhappy
about buffer being not uptodate afterwards.
getblk(); if (!buffer_uptodate) wait_on_buffer();
is not in that class. It _is_ OK on UP as long as we don't block, but on
SMP it doesn't guarantee anything - buffer_head can be in any state
when we are done. IMO all such places require fixing.

How about adding
if (!buffer_uptodate(bh)) {
printk(KERN_ERR "IO error or racy use of wait_on_buffer()");
show_task(current);
}
in the end of wait_on_buffer() for a while?
Al

2001-04-26 20:57:17

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Thu, 26 Apr 2001, Richard B. Johnson wrote:

> The disk image, raw.bin, does NOT contain the image of the floppy.
> Most of boot stuff added by lilo is missing. It will eventually
> get there, but it's not there now, even though the floppy was
> un-mounted!

I doubt that you can reproduce that on anything remotely recent.
All buffers are flushed when last user closes device.

2001-04-26 21:01:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 01:08:25PM -0700, Linus Torvalds wrote:
> But the fact is that nobody should ever do the thing that could cause
> problems.

dump in 2.4 also gets uncoherent view of the data which make things even
worse than in 2.2 (to change that we should hash in the buffer hashtable
all the bh overlapped in the pagecache and no I'm not suggesting that
relax). The only reason it has a chance to work with ext2 is because
ext2 is very dumb and it misses an inode map and in turn inodes are at a
predictable location on disk so it cannot run totally out of control.

Andrea

2001-04-26 21:01:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 03:55:19PM -0400, Alexander Viro wrote:
>
>
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
>
> > On Thu, Apr 26, 2001 at 03:34:00PM -0400, Alexander Viro wrote:
> > > Same scenario, but with read-in-progress started before we do getblk(). BTW,
> >
> > how can the read in progress see a branch that we didn't spliced yet? We
>
> fd = open("/dev/hda1", O_RDONLY);
> read(fd, buf, sizeof(buf));

You misunderstood the context of what I said, I perfectly know the race
you are talking about, I was answering Linus's question "the
wait_on_buffer isn't even necessary to protect ext2 against ext2". You
are talking about the other race that is "ext2" against "block_dev", and
I obviously agree on that one since the first place as I immediatly
answered you "correct".

What I'm saying above is that even without the wait_on_buffer ext2 can
screwup itself because the splice happens after the buffer are just all
uptodate so any "reader" (I mean any reader through ext2 not through
block_dev) will never try to do a bread on that blocks before they're
just zeroed and uptodate.

Andrea

2001-04-26 21:01:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 10:11:09PM +0200, Andrea Arcangeli wrote:
> On Thu, Apr 26, 2001 at 03:55:19PM -0400, Alexander Viro wrote:
> >
> >
> > On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
> >
> > > On Thu, Apr 26, 2001 at 03:34:00PM -0400, Alexander Viro wrote:
> > > > Same scenario, but with read-in-progress started before we do getblk(). BTW,
> > >
> > > how can the read in progress see a branch that we didn't spliced yet? We
> >
> > fd = open("/dev/hda1", O_RDONLY);
> > read(fd, buf, sizeof(buf));
>
> You misunderstood the context of what I said, I perfectly know the race
> you are talking about, I was answering Linus's question "the
> wait_on_buffer isn't even necessary to protect ext2 against ext2". You
> are talking about the other race that is "ext2" against "block_dev", and
> I obviously agree on that one since the first place as I immediatly
> answered you "correct".
>
> What I'm saying above is that even without the wait_on_buffer ext2 can
^^^ "cannot" of course
> screwup itself because the splice happens after the buffer are just all
> uptodate so any "reader" (I mean any reader through ext2 not through
> block_dev) will never try to do a bread on that blocks before they're
> just zeroed and uptodate.
>
> Andrea


Andrea

2001-04-26 21:13:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 04:49:20PM -0400, Alexander Viro wrote:
> getblk(); if (!buffer_uptodate) wait_on_buffer();
> is not in that class. It _is_ OK on UP as long as we don't block, but on
> SMP it doesn't guarantee anything - buffer_head can be in any state
> when we are done. IMO all such places require fixing.

Yes, actually it's probably ok for most of other "fs" against "fs" cases
because those fses still hold the big lock while handling metadata but
they should really use the lock_buffer way if they want to protect
against the block_dev accesses too.

> How about adding
> if (!buffer_uptodate(bh)) {
> printk(KERN_ERR "IO error or racy use of wait_on_buffer()");
> show_task(current);
> }
> in the end of wait_on_buffer() for a while?

At the _top_ of wait_on_buffer would be better then at the end.

Andrea

2001-04-26 21:16:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 01:26:15PM -0700, Linus Torvalds wrote:
>
>
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
> >
> > What I'm saying above is that even without the wait_on_buffer ext2 can
> > screwup itself because the splice happens after the buffer are just all
> > uptodate so any "reader" (I mean any reader through ext2 not through
> > block_dev) will never try to do a bread on that blocks before they're
> > just zeroed and uptodate.
>
> I assume you meant "..can _not_ screw up itself..", otherwise the rest of

yes, it was a typo sorry.

Andrea

2001-04-26 21:23:17

by Andrzej Krzysztofowicz

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

>
> Note that I think all these arguments are fairly bogus. Doing things like
> "dump" on a live filesystem is stupid and dangerous (in my opinion it is
> stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
> discussion in itself), and there really are no valid uses for opening a
> block device that is already mounted. More importantly, I don't think
> anybody actually does.

I know a few people that often do:

dd if=/dev/hda1 of=/dev/hdc1
e2fsck /dev/hdc1

to make an "exact" copy of a currently working system.

Maybe it is stupid, but they do.
Fortunately, their systems are not SMP...

Andrzej

2001-04-26 23:27:15

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Thu, 26 Apr 2001, Andrea Arcangeli wrote:

> > How about adding
> > if (!buffer_uptodate(bh)) {
> > printk(KERN_ERR "IO error or racy use of wait_on_buffer()");
> > show_task(current);
> > }
> > in the end of wait_on_buffer() for a while?
>
> At the _top_ of wait_on_buffer would be better then at the end.

In that case ll_rw_block() + wait_on_buffer() (absolutely legitimate
combination) will scream at you.

2001-04-27 00:10:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 07:25:23PM -0400, Alexander Viro wrote:
>
>
> On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
>
> > > How about adding
> > > if (!buffer_uptodate(bh)) {
> > > printk(KERN_ERR "IO error or racy use of wait_on_buffer()");
> > > show_task(current);
> > > }
> > > in the end of wait_on_buffer() for a while?
> >
> > At the _top_ of wait_on_buffer would be better then at the end.
>
> In that case ll_rw_block() + wait_on_buffer() (absolutely legitimate
> combination) will scream at you.

--- 2.4.4pre7/include/linux/locks.h Thu Apr 26 05:22:11 2001
+++ 2.4.4pre7aa1/include/linux/locks.h Fri Apr 27 01:52:31 2001
@@ -18,6 +18,11 @@
{
if (test_bit(BH_Lock, &bh->b_state))
__wait_on_buffer(bh);
+ else if (!buffer_uptodate(bh)) {
+ __label__ here;
+ here:
+ printk(KERN_ERR "IO error or racy use of wait_on_buffer() from %p\n", &&here);
+ }
}

extern inline void lock_buffer(struct buffer_head * bh)

Andrea

2001-04-27 00:23:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Fri, 27 Apr 2001, Andrea Arcangeli wrote:
> + __label__ here;
> + here:
> + printk(KERN_ERR "IO error or racy use of wait_on_buffer() from %p\n", &&here);

Detail nit: don't do this. Use "current_text_addr()" instead. Simpler to
read, and gcc will actually do the right thing wrt inlining etc.

Linus

2001-04-27 00:42:06

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 05:19:26PM -0700, Linus Torvalds wrote:
> Detail nit: don't do this. Use "current_text_addr()" instead. Simpler to
> read, and gcc will actually do the right thing wrt inlining etc.

Agreed, thanks for the info.

Andrea

2001-04-27 11:45:22

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, 26 Apr 2001, Alexander Viro wrote:

>
>
> On Thu, 26 Apr 2001, Richard B. Johnson wrote:
>
> > The disk image, raw.bin, does NOT contain the image of the floppy.
> > Most of boot stuff added by lilo is missing. It will eventually
> > get there, but it's not there now, even though the floppy was
> > un-mounted!
>
> I doubt that you can reproduce that on anything remotely recent.
> All buffers are flushed when last user closes device.
>
2.4.3

Buffers are not flushed (actually written) to disk. The floppy continues
to be written for 20 to 30 seconds after `umount` returns to
the shell. A program like `cp` , accessing the raw device during this time
does not get what will eventually be written.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.1 on an i686 machine (799.53 BogoMips).

"Memory is like gasoline. You use it up when you are running. Of
course you get it all back when you reboot..."; Actual explanation
obtained from the Micro$oft help desk.


2001-04-27 13:05:35

by Vojtech Pavlik

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 01:08:25PM -0700, Linus Torvalds wrote:

> On Thu, 26 Apr 2001, Alexander Viro wrote:
> > On Thu, 26 Apr 2001, Andrea Arcangeli wrote:
> > >
> > > how can the read in progress see a branch that we didn't spliced yet? We
> >
> > fd = open("/dev/hda1", O_RDONLY);
> > read(fd, buf, sizeof(buf));
>
> Note that I think all these arguments are fairly bogus. Doing things like
> "dump" on a live filesystem is stupid and dangerous (in my opinion it is
> stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
> discussion in itself), and there really are no valid uses for opening a
> block device that is already mounted. More importantly, I don't think
> anybody actually does.

Actually this is done quite often, even on mounted fs's:

hdparm -t /dev/hda

> The fact that you _can_ do so makes the patch valid, and I do agree with
> Al on the "least surprise" issue. I've already applied the patch, in fact.
> But the fact is that nobody should ever do the thing that could cause
> problems.

--
Vojtech Pavlik
SuSE Labs

2001-04-27 13:25:46

by Alexander Viro

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.



On Fri, 27 Apr 2001, Vojtech Pavlik wrote:

> Actually this is done quite often, even on mounted fs's:
>
> hdparm -t /dev/hda

You would need either hdparm -t /dev/hda<something> or mounting the
whole /dev/hda.

Buffer cache for the disk is unrelated to buffer cache for parititions.

2001-04-27 14:30:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Thu, Apr 26, 2001 at 01:08:25PM -0700, Linus Torvalds wrote:
> Note that I think all these arguments are fairly bogus. Doing things like
> "dump" on a live filesystem is stupid and dangerous (in my opinion it is
> stupid and dangerous to use "dump" at _all_, but that's a whole 'nother
> discussion in itself), and there really are no valid uses for opening a
> block device that is already mounted. More importantly, I don't think
> anybody actually does.

You can use LVM snapshot volumes to do it safely.

-Andi

2001-04-27 14:33:16

by Ville Herva

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Fri, Apr 27, 2001 at 09:23:57AM -0400, you [Alexander Viro] claimed:
>
>
> On Fri, 27 Apr 2001, Vojtech Pavlik wrote:
>
> > Actually this is done quite often, even on mounted fs's:
> >
> > hdparm -t /dev/hda
>
> You would need either hdparm -t /dev/hda<something> or mounting the
> whole /dev/hda.
>
> Buffer cache for the disk is unrelated to buffer cache for parititions.

Well, I for one have been running

hdparm -t /dev/md0
or
time head -c 1000m /dev/md0 > /dev/null

while /dev/md0 was mounted without realizing that this could be "stupid" or
that it could eat my data.

/dev/md0 on /backup-versioned type ext2 (rw)

I often cat(1) or head(1) partitions or devices (even mounted ones) if I
need dummy randomish test data for compression or tape drives (that I've
been having trouble with).

BTW: is 2.2 affected? 2.0?


-- v --

[email protected]

2001-04-27 16:53:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.


On Fri, 27 Apr 2001, Vojtech Pavlik wrote:
>
> Actually this is done quite often, even on mounted fs's:
>
> hdparm -t /dev/hda

Note that this one happens to be ok.

The buffer cache is "virtual" in the sense that /dev/hda is a completely
separate name-space from /dev/hda1, even if there is some physical
overlap.

Linus

2001-04-27 17:37:11

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

Linus Torvalds wrote:
> On Fri, 27 Apr 2001, Neil Conway wrote:
> >
> > I'm surprised that dump is deprecated (by you at least ;-)). What to
> > use instead for backups on machines that can't umount disks regularly?
>
> Note that dump simply won't work reliably at all even in 2.4.x: the buffer
> cache and the page cache (where all the actual data is) are not
> coherent. This is only going to get even worse in 2.5.x, when the
> directories are moved into the page cache as well.

> Dump was a stupid program in the first place. Leave it behind.

Dump/restore are useful, on-line dump is silly. I am personally amazed
that on-line, mounted dump was -ever- supported. I guess it would work
if mounted ro...

dump is still the canonical solution, IMHO, for saving and restoring
filesystem metadata OFFLINE. tar/cpio can be taught to do stuff like
security ACLs and EAs and such, but such code and formats are not yet
standardized, and they do not approach dump when it comes to taking an
accurate snapshot of the filesystem.

--
Jeff Garzik | Disbelief, that's why you fail.
Building 1024 |
MandrakeSoft |

2001-04-27 18:04:50

by L A Walsh

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

Andrzej Krzysztofowicz wrote:

> I know a few people that often do:
>
> dd if=/dev/hda1 of=/dev/hdc1
> e2fsck /dev/hdc1
>
> to make an "exact" copy of a currently working system.

---
Presumably this isn't a problem is the source disks are either unmounted or mounted 'read-only' ?


--
The above thoughts and | They may have nothing to do with
writings are my own. | the opinions of my employer. :-)
L A Walsh | Trust Technology, Core Linux, SGI
[email protected] | Voice: (650) 933-5338



2001-04-27 18:18:22

by David Konerding

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Fri, Apr 27, 2001 at 11:02:17AM -0700, LA Walsh wrote:
> Andrzej Krzysztofowicz wrote:
>
> > I know a few people that often do:
> >
> > dd if=/dev/hda1 of=/dev/hdc1
> > e2fsck /dev/hdc1
> >
> > to make an "exact" copy of a currently working system.
>
> ---
> Presumably this isn't a problem is the source disks are either unmounted or mounted 'read-only' ?
>
>

I thought the known best solution on this was to use COW snapshots,
because then you copy the filesystem as exactly the state when the snapshot
was made, without impacting the writability of the filesystem while
the (potentially very long) dump is made?

I tried using this on LVM, but after seeing a few messages on the list about
kernel oopses happening with snapshots of filesystems with heavy write
activities, as well as experiencing serious problems with the LVM userspace
tools (they would core dump on startup if the LVM filesystem had any sort
of corruption or integrity failure) I decided to put it away until the LVM
folks managed to get a production version ready.

2001-04-27 19:22:30

by Shane

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Fri, Apr 27, 2001 at 09:52:19AM -0700, Linus Torvalds wrote:
>
> On Fri, 27 Apr 2001, Vojtech Pavlik wrote:
> >
> > Actually this is done quite often, even on mounted fs's:
> >
> > hdparm -t /dev/hda
>
> Note that this one happens to be ok.
>
> The buffer cache is "virtual" in the sense that /dev/hda is a completely
> separate name-space from /dev/hda1, even if there is some physical
> overlap.

Wouldn't something like "hdparm -t /dev/md0" trigger it
though. It is the same device as gets mounted as md
devices aren't partitioned.

Shane


--
Shane Wegner: [email protected]
http://www.cm.nu/~shane/
PGP: 1024D/FFE3035D
A0ED DAC4 77EC D674 5487
5B5C 4F89 9A4E FFE3 035D

2001-04-28 04:57:14

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

Linus Torvalds writes:

> The buffer cache is "virtual" in the sense that /dev/hda is a
> completely separate name-space from /dev/hda1, even if there
> is some physical overlap.

So the aliasing problems and elevator algorithm confusion remain?
Is this ever likely to change, and what is with the 1 kB assumptions?
(Hmmm, cruft left over from the 1 kB Minix filesystem blocks?)

2001-04-28 08:31:47

by Matthias Urlichs

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

Martin Dalecki :
> tar/cpio and friends don't deal properly with
>
> a. holes inside of files.
> b. hardlinks between files.
>
??? GNU tar does both. The only thing it currently cannot handle is Not
Changing Anything: either atime or ctime _will_ be updated.

2001-04-28 10:11:29

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

On Sat, Apr 28 2001, Albert D. Cahalan wrote:
> Linus Torvalds writes:
>
> > The buffer cache is "virtual" in the sense that /dev/hda is a
> > completely separate name-space from /dev/hda1, even if there
> > is some physical overlap.
>
> So the aliasing problems and elevator algorithm confusion remain?

At least for the I/O scheduler confusion, requests to partitions will
remap the buffer location and this problem disappears nicely. It's not a
big issue, really.

> Is this ever likely to change, and what is with the 1 kB assumptions?
> (Hmmm, cruft left over from the 1 kB Minix filesystem blocks?)

What 1kB assumption?

--
Jens Axboe

2001-04-28 13:34:09

by Olaf Titz

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

> or such. tar/cpio and friends don't deal properly with
> a. holes inside of files.
> b. hardlinks between files.

GNU tar handles both of these. (Not particularly efficiently in the
case of sparse files, but that's a minor issue in this case.) See -S flag.

Perhaps more importantly, for a _robust_ backup solution which can
deal with partially unreadable tapes, you have pretty much no option
other than tar for the actual storage.

Olaf

2001-04-30 08:47:13

by Neil Conway

[permalink] [raw]
Subject: Re: [PATCH] SMP race in ext2 - metadata corruption.

Hiya.

Linus Torvalds wrote:
> So anybody who depends on "dump" getting backups right is already playing
> russian rulette with their backups. It's not at all guaranteed to get the
> right results - you may end up having stale data in the buffer cache that
> ends up being "backed up".
>
> Dump was a stupid program in the first place. Leave it behind.

Ouch. I just re-read the man page and it doesn't caution (*) against
using it on mounted filesystems. That probably means that there are
thousands of other losers like me using it on production machines.
Volunteers to (a) change the man page, (b) talk to the distros about
dumping "dump"?

> However, it may be that in the long run it would be advantageous to have a
> "filesystem maintenance interface" for doing things like backups and
> defragmentation..

Yup, sounds good.

Neil

(*) The KNOWNBUGS file mentions "possible" problems while dumping active
mounted filesystems, but I've (elsewhere) seen these characterised as no
real problem; also, this falls a long way short of discouraging use in
this fashion.