2012-06-08 09:53:03

by Asdo

[permalink] [raw]
Subject: Sync does not flush to disk!?

Hello all
I don't exactly know where to ask this question...

I have a situation of

sda1 + sdb1 --> MD raid1
Above that is an ext4 filesystem. No LVM.

I am making changes to that filesystem (vi a file) and then i am doing
sync
sync
(twice)

then I am starting KVM in snapshot mode on the sda and sdb disks so to
virtualize the same system on which I am operating.

kvm -m 1024 -hda /dev/sda -hdb /dev/sdb -snapshot

The strange thing is that the virtual machine is NOT seeing the latest
changes to that file!

Then I tried to do :

for i in /dev/md? /dev/sda /dev/sdb ; do blockdev --flushbufs $i ; done

and restart KVM,
and NOW it is seeing the changes.

In the past I had similar problems, and not knowing about blockdev
--flushbufs I ended up dismounting the filesystems and stopping the
RAIDs. That also appeared to actually commit stuff to disk.

So sync is not enough? Would somebody explain to me better?

Thank you


2012-06-08 11:39:35

by Asdo

[permalink] [raw]
Subject: Re: Sync does not flush to disk!?

On 06/08/12 11:53, Asdo wrote:
> .....
> Then I tried to do :
>
> for i in /dev/md? /dev/sda /dev/sdb ; do blockdev --flushbufs $i ; done
> ....
> and NOW it is seeing the changes.

After some further tests:
Flushbuf'ing just the MD devices only, generated an even more
intermediate situation in which the file being changed assumed garbage
content coming from another old file.
Flushbufing just /dev/sda /dev/sdb has worked a few times I tried. I'm
not sure that it is enough in general.
Flushbufing everything appears to work reliably.

Still I am puzzled. Wasn't "sync" from bash enough to commit to disk
even in case of power failure?

Or is there any chance that KVM "sees" a version of sda and sdb which is
actually *older* than the actual content on the platters?

Thank you

2012-06-08 12:33:44

by NeilBrown

[permalink] [raw]
Subject: Re: Sync does not flush to disk!?

On Fri, 08 Jun 2012 11:53:14 +0200 Asdo <[email protected]> wrote:

> Hello all
> I don't exactly know where to ask this question...
>
> I have a situation of
>
> sda1 + sdb1 --> MD raid1
> Above that is an ext4 filesystem. No LVM.
>
> I am making changes to that filesystem (vi a file) and then i am doing
> sync
> sync
> (twice)
>
> then I am starting KVM in snapshot mode on the sda and sdb disks so to
> virtualize the same system on which I am operating.
>
> kvm -m 1024 -hda /dev/sda -hdb /dev/sdb -snapshot
>
> The strange thing is that the virtual machine is NOT seeing the latest
> changes to that file!
>
> Then I tried to do :
>
> for i in /dev/md? /dev/sda /dev/sdb ; do blockdev --flushbufs $i ; done
>
> and restart KVM,
> and NOW it is seeing the changes.
>
> In the past I had similar problems, and not knowing about blockdev
> --flushbufs I ended up dismounting the filesystems and stopping the
> RAIDs. That also appeared to actually commit stuff to disk.
>
> So sync is not enough? Would somebody explain to me better?

There is a cache associated with /dev/sda and /dev/sdb which md does not make
any use of. The filesystem doesn't use it either. It is only used from
user-space reads from /dev/sda or /dev/sdb.
When you "sync" the filesystem, the new data is written out, but the cache it
not changes. When you then read from /dev/sda, you might get cached data,
which is stale.

blockdev --flushbufs
clears that cache so that subsequent reads come from the device, not from the
cache.

i.e. it is read caching that is causing the confusion you see, not write
caching.

NeilBrown


Attachments:
signature.asc (828.00 B)

2012-06-08 13:49:01

by Phil Turmel

[permalink] [raw]
Subject: Re: Sync does not flush to disk!?

On 06/08/2012 08:33 AM, NeilBrown wrote:
> On Fri, 08 Jun 2012 11:53:14 +0200 Asdo <[email protected]> wrote:
>
>> Hello all
>> I don't exactly know where to ask this question...
>>
>> I have a situation of
>>
>> sda1 + sdb1 --> MD raid1
>> Above that is an ext4 filesystem. No LVM.
>>
>> I am making changes to that filesystem (vi a file) and then i am doing
>> sync
>> sync
>> (twice)
>>
>> then I am starting KVM in snapshot mode on the sda and sdb disks so to
>> virtualize the same system on which I am operating.
>>
>> kvm -m 1024 -hda /dev/sda -hdb /dev/sdb -snapshot
>>
>> The strange thing is that the virtual machine is NOT seeing the latest
>> changes to that file!
>>
>> Then I tried to do :
>>
>> for i in /dev/md? /dev/sda /dev/sdb ; do blockdev --flushbufs $i ; done
>>
>> and restart KVM,
>> and NOW it is seeing the changes.
>>
>> In the past I had similar problems, and not knowing about blockdev
>> --flushbufs I ended up dismounting the filesystems and stopping the
>> RAIDs. That also appeared to actually commit stuff to disk.

*Exactly*

>> So sync is not enough? Would somebody explain to me better?
>
> There is a cache associated with /dev/sda and /dev/sdb which md does not make
> any use of. The filesystem doesn't use it either. It is only used from
> user-space reads from /dev/sda or /dev/sdb.
> When you "sync" the filesystem, the new data is written out, but the cache it
> not changes. When you then read from /dev/sda, you might get cached data,
> which is stale.
>
> blockdev --flushbufs
> clears that cache so that subsequent reads come from the device, not from the
> cache.
>
> i.e. it is read caching that is causing the confusion you see, not write
> caching.

To put it another way: You can't safely access ext filesystems via
raw devices in two systems. The kernel cache won't be synchronized,
and you almost certainly *will* corrupt the contents.

You can unmount the FS then pass the raid to the VM, or dismantle the
raid as well, and let the VM assemble it.

There are cluster filesystems that allow multiple mounts of shared
devices, though. I haven't played with them, so you might want to
do some googling.

Phil

2012-06-08 13:57:04

by Asdo

[permalink] [raw]
Subject: Re: Sync does not flush to disk!?

On 06/08/12 15:49, Phil Turmel wrote:
>
> To put it another way: You can't safely access ext filesystems via
> raw devices in two systems. The kernel cache won't be synchronized,
> and you almost certainly *will* corrupt the contents.

Thanks both of you for your explanations

I might say that it seems to me a bad design: never before I saw a cache
that is not updated by writes.
Here the cache content is *older* than the data on the real devices!?
if it was *newer*, there are known cases (writeback cache not flushed
yet), but *older*... never seen.

Thanks

2012-06-08 14:11:42

by Jan Kara

[permalink] [raw]
Subject: Re: Sync does not flush to disk!?

On Fri 08-06-12 15:57:04, Asdo wrote:
> On 06/08/12 15:49, Phil Turmel wrote:
> >
> >To put it another way: You can't safely access ext filesystems via
> >raw devices in two systems. The kernel cache won't be synchronized,
> >and you almost certainly *will* corrupt the contents.
>
> Thanks both of you for your explanations
>
> I might say that it seems to me a bad design: never before I saw a
> cache that is not updated by writes.
> Here the cache content is *older* than the data on the real devices!?
> if it was *newer*, there are known cases (writeback cache not
> flushed yet), but *older*... never seen.
Well, the problem is in inconsistency of caches. There is one cache -
page cache - used by filesystems to read & write file data which is
addressed by inode, offset. And there is another cache caching the whole
device addressed by device, offset. It would be too costly to keep both
these caches consistent and most people don't care so we don't.

BTW, if you configured KVM to use direct IO or virt IO when accessing the
devices (a good idea anyway), you wouldn't have the problems either.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-06-08 14:16:13

by Jan Kara

[permalink] [raw]
Subject: Re: Sync does not flush to disk!?

On Fri 08-06-12 16:11:39, Jan Kara wrote:
> On Fri 08-06-12 15:57:04, Asdo wrote:
> > On 06/08/12 15:49, Phil Turmel wrote:
> > >
> > >To put it another way: You can't safely access ext filesystems via
> > >raw devices in two systems. The kernel cache won't be synchronized,
> > >and you almost certainly *will* corrupt the contents.
> >
> > Thanks both of you for your explanations
> >
> > I might say that it seems to me a bad design: never before I saw a
> > cache that is not updated by writes.
> > Here the cache content is *older* than the data on the real devices!?
> > if it was *newer*, there are known cases (writeback cache not
> > flushed yet), but *older*... never seen.
> Well, the problem is in inconsistency of caches. There is one cache -
> page cache - used by filesystems to read & write file data which is
> addressed by inode, offset. And there is another cache caching the whole
> device addressed by device, offset. It would be too costly to keep both
> these caches consistent and most people don't care so we don't.
>
> BTW, if you configured KVM to use direct IO or virt IO when accessing the
> devices (a good idea anyway), you wouldn't have the problems either.
Hmm, I didn't notice you actually keep the fs mounted on host when
starting the guest. That is really asking for trouble - host's data that is
cached in memory (and I'm not speaking just about data but more importantly
also allocation information etc.) will not be update when guest changes the
filesystem so the filesystem will get almost certainly corrupted.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-06-08 18:28:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Sync does not flush to disk!?

On Fri, Jun 08, 2012 at 03:57:04PM +0200, Asdo wrote:
>
> I might say that it seems to me a bad design: never before I saw a
> cache that is not updated by writes.
> Here the cache content is *older* than the data on the real devices!?
> if it was *newer*, there are known cases (writeback cache not
> flushed yet), but *older*... never seen.

It's not just a matter of keeping the caches in sync --- it's also a
simple matter of locking. If a file system is mounted on two systems
at the same time, there's no way (without using a cluster lock
manager, which is what a cluster file system like ocfs2 uses) to avoid
both systems from trying to modify a particular of the file system (an
inode or a directory, for example) at the same time.

As a result, there's no way for a local disk file system to know when
a block has been modified out from under it, so that it can update its
inode cache (where the in-memory inode data structure looks quite
different from the on-disk inode table).

There is overhead in using a cluster file system, since it has to do
all of these extra checks to see if the block device has gotten
magically modified out from under it. So that's why most people won't
use a cluster file system if it is only going to be mounted on one
system at a time.

But if you are going to have a file system mounted in both the guest
and host file system at the same time, you *have* to use a cluster
file system. Alternately, you could have the guest access the file
system as mounted on the host OS via NFS.

Regards,

- Ted