2008-10-14 14:50:40

by Richard Kojedzinszky

[permalink] [raw]
Subject: minix/ext2 + rd problem

dear all,

I got an embedded system, where I use ramdisk, minix on it as a filesystem
for /etc. With kernels 2.4 and with 2.6.19 kernels also, the following
code did exactly what i wanted, creating an image of the /etc without
unmounting it:

# mount -o remount,ro /etc
# cat /dev/ram0 > /tmp/image
# mount -o remount,rw /etc

And then I had a consistent image from /etc in /tmp/image.

This worked still with kernel version 2.6.23.14, but nowadays i upgraded
to 2.6.26.2, and noticed that the little code didnt work anymore. I wrote
a simple test script for checking, and reproducing the issue, which is
also attached. Unfortunately this does not work with ext2 also.

my linux version is:
Linux version 2.6.26 ([email protected]) (gcc version 4.2.4 (Debian
4.2.4-3)) #27 SMP PREEMPT Tue Oct 14 15:19:30 CEST 2008

i use debian lenny/sid.

Were there any intended change that made this behaviour change or was it
by an accident?

Thanks in advance,

Kojedzinszky Richard
TvNetWork Nyrt.
E-mail: krichy (at) tvnetwork [dot] hu
PGP: 0x24E79141
Fingerprint = 6847 ECFF EF58 0C09 18A5 16CF 270F 0C6F 24E7 9141


Attachments:
test2.sh (435.00 B)
ver_linux.txt (1.18 kB)
.config (46.47 kB)
Download all attachments

2008-10-15 04:16:45

by Nick Piggin

[permalink] [raw]
Subject: Re: minix/ext2 + rd problem

Hi,
Thanks for reporting this.

On Tue, Oct 14, 2008 at 04:43:55PM +0200, Richard Kojedzinszky wrote:
> dear all,
>
> I got an embedded system, where I use ramdisk, minix on it as a filesystem
> for /etc. With kernels 2.4 and with 2.6.19 kernels also, the following
> code did exactly what i wanted, creating an image of the /etc without
> unmounting it:
>
> # mount -o remount,ro /etc
> # cat /dev/ram0 > /tmp/image
> # mount -o remount,rw /etc
>
> And then I had a consistent image from /etc in /tmp/image.
>
> This worked still with kernel version 2.6.23.14, but nowadays i upgraded
> to 2.6.26.2, and noticed that the little code didnt work anymore. I wrote
> a simple test script for checking, and reproducing the issue, which is
> also attached. Unfortunately this does not work with ext2 also.
>
> my linux version is:
> Linux version 2.6.26 ([email protected]) (gcc version 4.2.4 (Debian
> 4.2.4-3)) #27 SMP PREEMPT Tue Oct 14 15:19:30 CEST 2008
>
> i use debian lenny/sid.
>
> Were there any intended change that made this behaviour change or was it
> by an accident?

That shouldn't have been changed on purpose, unless it is doing something
funny that worked by accident before.

/dev/ram will behave much more like any other block device now (with brd)
wheras previously it was probably more coherent between filesystem and
block device node. Hmm, does all the filesystems cache get written back
before remount ro return?

Could you try sticking a sync after remount,ro?

Thanks,
Nick

2008-10-15 08:19:47

by Richard Kojedzinszky

[permalink] [raw]
Subject: Re: minix/ext2 + rd problem

dear nick,

i have tried a sync after the remount, but that did not help. what helped
is dropping the cache by echoing 3 to /proc/sys/vm/drop_caches, but this
still didnt solve the problem in 100%, only in 95% of the cases.

But when i read the device with
# dd if=/dev/ram0 iflag=direct ...
then it worked. I think this bypassed some caches, and thus read the
actual data.

But a sad result is that I experienced with it, and only with ramdisk does
it work as expected. for example with a logical volume it behaves in the
wrong way.

thanks in advance,
Kojedzinszky Richard
TvNetWork Nyrt.
E-mail: krichy (at) tvnetwork [dot] hu
PGP: 0x24E79141
Fingerprint = 6847 ECFF EF58 0C09 18A5 16CF 270F 0C6F 24E7 9141

On Wed, 15 Oct 2008, Nick Piggin wrote:

> Date: Wed, 15 Oct 2008 06:16:44 +0200
> From: Nick Piggin <[email protected]>
> To: Richard Kojedzinszky <[email protected]>
> Cc: [email protected]
> Subject: Re: minix/ext2 + rd problem
>
> Hi,
> Thanks for reporting this.
>
> On Tue, Oct 14, 2008 at 04:43:55PM +0200, Richard Kojedzinszky wrote:
>> dear all,
>>
>> I got an embedded system, where I use ramdisk, minix on it as a filesystem
>> for /etc. With kernels 2.4 and with 2.6.19 kernels also, the following
>> code did exactly what i wanted, creating an image of the /etc without
>> unmounting it:
>>
>> # mount -o remount,ro /etc
>> # cat /dev/ram0 > /tmp/image
>> # mount -o remount,rw /etc
>>
>> And then I had a consistent image from /etc in /tmp/image.
>>
>> This worked still with kernel version 2.6.23.14, but nowadays i upgraded
>> to 2.6.26.2, and noticed that the little code didnt work anymore. I wrote
>> a simple test script for checking, and reproducing the issue, which is
>> also attached. Unfortunately this does not work with ext2 also.
>>
>> my linux version is:
>> Linux version 2.6.26 ([email protected]) (gcc version 4.2.4 (Debian
>> 4.2.4-3)) #27 SMP PREEMPT Tue Oct 14 15:19:30 CEST 2008
>>
>> i use debian lenny/sid.
>>
>> Were there any intended change that made this behaviour change or was it
>> by an accident?
>
> That shouldn't have been changed on purpose, unless it is doing something
> funny that worked by accident before.
>
> /dev/ram will behave much more like any other block device now (with brd)
> wheras previously it was probably more coherent between filesystem and
> block device node. Hmm, does all the filesystems cache get written back
> before remount ro return?
>
> Could you try sticking a sync after remount,ro?
>
> Thanks,
> Nick
>

2008-10-15 14:05:50

by Nick Piggin

[permalink] [raw]
Subject: Re: minix/ext2 + rd problem

On Wed, Oct 15, 2008 at 10:19:44AM +0200, Richard Kojedzinszky wrote:
> dear nick,
>
> i have tried a sync after the remount, but that did not help. what helped
> is dropping the cache by echoing 3 to /proc/sys/vm/drop_caches, but this
> still didnt solve the problem in 100%, only in 95% of the cases.
>
> But when i read the device with
> # dd if=/dev/ram0 iflag=direct ...
> then it worked. I think this bypassed some caches, and thus read the
> actual data.
>
> But a sad result is that I experienced with it, and only with ramdisk does
> it work as expected. for example with a logical volume it behaves in the
> wrong way.

I've reproduced this problem (ext2 image corruption flagged in e2fsck
even though it was remounted ro and marked clean in the sb).

Issuing a sync, then drop_caches, seems to fix it here for me.

On the other hand, I also see problems with inconsistencies even after
unmounting if I hold the /dev/ram0 device open with something else (which
causes the buffer cache not to be invalidated on unmount).

I think what is happening is that the block device is being modified
without going through the buffer cache (ie. via pagecache or direct
writes), but the buffer cache doesn't get invalidated. So you get stale
data when reading from /dev/ram0.

I don't think we're going to want the overhead in the kernel to detect
these kinds of aliases. It might be reasonable to flush the blockdev
on unmount and remount,ro after syncing the filesystem.

The old rd driver's backing store was actually its buffercache, so that
particular issue wouldn't cause aliasing.

Thanks,
Nick

2008-10-15 14:10:11

by Richard Kojedzinszky

[permalink] [raw]
Subject: Re: minix/ext2 + rd problem

Dear Nick,

Sorry for my stupid question, but how can i flush a blockdev? If i can
do it without unmounting the fs i will be happy.

Thanks in advance,


Kojedzinszky Richard
TvNetWork Nyrt.
E-mail: krichy (at) tvnetwork [dot] hu
PGP: 0x24E79141
Fingerprint = 6847 ECFF EF58 0C09 18A5 16CF 270F 0C6F 24E7 9141

On Wed, 15 Oct 2008, Nick Piggin wrote:

> Date: Wed, 15 Oct 2008 16:05:23 +0200
> From: Nick Piggin <[email protected]>
> To: Richard Kojedzinszky <[email protected]>
> Cc: [email protected], [email protected]
> Subject: Re: minix/ext2 + rd problem
>
> On Wed, Oct 15, 2008 at 10:19:44AM +0200, Richard Kojedzinszky wrote:
>> dear nick,
>>
>> i have tried a sync after the remount, but that did not help. what helped
>> is dropping the cache by echoing 3 to /proc/sys/vm/drop_caches, but this
>> still didnt solve the problem in 100%, only in 95% of the cases.
>>
>> But when i read the device with
>> # dd if=/dev/ram0 iflag=direct ...
>> then it worked. I think this bypassed some caches, and thus read the
>> actual data.
>>
>> But a sad result is that I experienced with it, and only with ramdisk does
>> it work as expected. for example with a logical volume it behaves in the
>> wrong way.
>
> I've reproduced this problem (ext2 image corruption flagged in e2fsck
> even though it was remounted ro and marked clean in the sb).
>
> Issuing a sync, then drop_caches, seems to fix it here for me.
>
> On the other hand, I also see problems with inconsistencies even after
> unmounting if I hold the /dev/ram0 device open with something else (which
> causes the buffer cache not to be invalidated on unmount).
>
> I think what is happening is that the block device is being modified
> without going through the buffer cache (ie. via pagecache or direct
> writes), but the buffer cache doesn't get invalidated. So you get stale
> data when reading from /dev/ram0.
>
> I don't think we're going to want the overhead in the kernel to detect
> these kinds of aliases. It might be reasonable to flush the blockdev
> on unmount and remount,ro after syncing the filesystem.
>
> The old rd driver's backing store was actually its buffercache, so that
> particular issue wouldn't cause aliasing.
>
> Thanks,
> Nick
>

2008-10-15 14:34:45

by Nick Piggin

[permalink] [raw]
Subject: Re: minix/ext2 + rd problem

On Wed, Oct 15, 2008 at 04:10:08PM +0200, Richard Kojedzinszky wrote:
> Dear Nick,
>
> Sorry for my stupid question, but how can i flush a blockdev? If i can
> do it without unmounting the fs i will be happy.
>
> Thanks in advance,

Not a stupid question. Actually I meant that maybe the kernel should do
the flush so people don't get surprised like this.

You can flush and invalidate the blockdev with the --flushbufs argument
to blockdev command. However you can't use this with ramdisk devices:
someone thought it would be a good idea to save on precious ioctl space
and implemented totally different semantics on that device with the
same ioctl (it throws away the underlying data as well as the cache).

Your direct io reads essentially do the same thing (and the kernel flushes
the cache in that case to avoid a similar aliasing problem). Actually I
would say direct IO to the block device is the safest option when you are
working on the block device like this (does the direct IO read work for
logical volumes?)

Thanks,
Nick

>
>
> Kojedzinszky Richard
> TvNetWork Nyrt.
> E-mail: krichy (at) tvnetwork [dot] hu
> PGP: 0x24E79141
> Fingerprint = 6847 ECFF EF58 0C09 18A5 16CF 270F 0C6F 24E7 9141
>
> On Wed, 15 Oct 2008, Nick Piggin wrote:
>
> >Date: Wed, 15 Oct 2008 16:05:23 +0200
> >From: Nick Piggin <[email protected]>
> >To: Richard Kojedzinszky <[email protected]>
> >Cc: [email protected], [email protected]
> >Subject: Re: minix/ext2 + rd problem
> >
> >On Wed, Oct 15, 2008 at 10:19:44AM +0200, Richard Kojedzinszky wrote:
> >>dear nick,
> >>
> >>i have tried a sync after the remount, but that did not help. what helped
> >>is dropping the cache by echoing 3 to /proc/sys/vm/drop_caches, but this
> >>still didnt solve the problem in 100%, only in 95% of the cases.
> >>
> >>But when i read the device with
> >># dd if=/dev/ram0 iflag=direct ...
> >>then it worked. I think this bypassed some caches, and thus read the
> >>actual data.
> >>
> >>But a sad result is that I experienced with it, and only with ramdisk does
> >>it work as expected. for example with a logical volume it behaves in the
> >>wrong way.
> >
> >I've reproduced this problem (ext2 image corruption flagged in e2fsck
> >even though it was remounted ro and marked clean in the sb).
> >
> >Issuing a sync, then drop_caches, seems to fix it here for me.
> >
> >On the other hand, I also see problems with inconsistencies even after
> >unmounting if I hold the /dev/ram0 device open with something else (which
> >causes the buffer cache not to be invalidated on unmount).
> >
> >I think what is happening is that the block device is being modified
> >without going through the buffer cache (ie. via pagecache or direct
> >writes), but the buffer cache doesn't get invalidated. So you get stale
> >data when reading from /dev/ram0.
> >
> >I don't think we're going to want the overhead in the kernel to detect
> >these kinds of aliases. It might be reasonable to flush the blockdev
> >on unmount and remount,ro after syncing the filesystem.
> >
> >The old rd driver's backing store was actually its buffercache, so that
> >particular issue wouldn't cause aliasing.
> >
> >Thanks,
> >Nick
> >

2008-10-15 19:22:54

by Matthew Wilcox

[permalink] [raw]
Subject: Re: minix/ext2 + rd problem

On Wed, Oct 15, 2008 at 04:34:25PM +0200, Nick Piggin wrote:
> You can flush and invalidate the blockdev with the --flushbufs argument
> to blockdev command. However you can't use this with ramdisk devices:
> someone thought it would be a good idea to save on precious ioctl space
> and implemented totally different semantics on that device with the
> same ioctl (it throws away the underlying data as well as the cache).

What happens if we declare that a bug and fix it (and add a new ioctl to
actually throw away the data ... oh, wait, we have one, it's BLKDISCARD)?

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-10-16 03:48:27

by Nick Piggin

[permalink] [raw]
Subject: Re: minix/ext2 + rd problem

On Wed, Oct 15, 2008 at 01:22:54PM -0600, Matthew Wilcox wrote:
> On Wed, Oct 15, 2008 at 04:34:25PM +0200, Nick Piggin wrote:
> > You can flush and invalidate the blockdev with the --flushbufs argument
> > to blockdev command. However you can't use this with ramdisk devices:
> > someone thought it would be a good idea to save on precious ioctl space
> > and implemented totally different semantics on that device with the
> > same ioctl (it throws away the underlying data as well as the cache).
>
> What happens if we declare that a bug and fix it (and add a new ioctl to
> actually throw away the data ... oh, wait, we have one, it's BLKDISCARD)?

Well... that's a good point. We probably could, because the worst someone
will see is their backing store memory does not get freed. It won't munch
someone's data.

I'd love to do this, OTOH we've had the old behaviour, apparently documented
and used by someone at some point, for a long time :(