2009-03-01 14:43:19

by Francis Moreau

[permalink] [raw]
Subject: Re: Question regarding concurrent accesses through block device and fs


[ Sorry for being long to answer but I was off, I'm slow and there are
a lot of complex code to dig out ! ]

Nick Piggin <[email protected]> writes:

> On Saturday 21 February 2009 01:10:24 Francis Moreau wrote:

[...]

>> - looking at unmap_underlying_metadata(), there's no code to deal with
>> meta data buffers. It gets the buffer and unmap it whatever the type of
>> data it contains.
>
> That's why I say it only really works for buffer cache used by the same
> filesystem that is now known to be unused.
>

hum, I still don't know what you mean by this, sorry to be slow.


[...]

>> What am I missing ?
>
> That we might complete the write of the new buffer before the
> old buffer is finished writing out?

Ah yes actually I realize that I don't know where and when the inode
blocks are effectively written to the disk !

It seems that write_inode(), called after data are commited to the
disk, only marks the inode buffers as dirty but it performs no IO (at
least it looks so for ext2 when its 'do_sync' parameter is 0 which is
the case when this method is called by write_inode()).

Could you enlight me one more time ?

Thanks
--
Francis


2009-03-01 15:33:36

by Nick Piggin

[permalink] [raw]
Subject: Re: Question regarding concurrent accesses through block device and fs

On Monday 02 March 2009 01:42:55 Francis Moreau wrote:
> [ Sorry for being long to answer but I was off, I'm slow and there are
> a lot of complex code to dig out ! ]
>
> Nick Piggin <[email protected]> writes:
> > On Saturday 21 February 2009 01:10:24 Francis Moreau wrote:
>
> [...]
>
> >> - looking at unmap_underlying_metadata(), there's no code to deal with
> >> meta data buffers. It gets the buffer and unmap it whatever the type
> >> of data it contains.
> >
> > That's why I say it only really works for buffer cache used by the same
> > filesystem that is now known to be unused.
>
> hum, I still don't know what you mean by this, sorry to be slow.

OK, the "buffercache", the cache of block device contents, is normally
thought of as metadata when it is being used by the filesystem (eg.
usually via bread() etc), or data when it is being read/written from
userspace via /dev/<blockdevice>.

In the former case, the buffer.c/filesystem code together know when a
metadata buffer is unused (because the filesystem has deallocated it),
so unmap_underlying_metadata will work there.

And it is insane to have a mounted filesystem and have userspace working
on the same block device, so unmap_underlying_metadata doesn't have to
care about that case. (IIRC some filesystem tools can do this, but there
are obviously a lot of tricks to it)


> >> What am I missing ?
> >
> > That we might complete the write of the new buffer before the
> > old buffer is finished writing out?
>
> Ah yes actually I realize that I don't know where and when the inode
> blocks are effectively written to the disk !
>
> It seems that write_inode(), called after data are commited to the
> disk, only marks the inode buffers as dirty but it performs no IO (at
> least it looks so for ext2 when its 'do_sync' parameter is 0 which is
> the case when this method is called by write_inode()).
>
> Could you enlight me one more time ?

Depends on the filesystem. Many do just use the buffercache as a
writeback cache for their metadata, and are happy to just let the
dirty page flushers write it out when it suits them (or when there
are explicit sync instructions given).

Most of the time, these filesystems don't really know or care when
exactly their metadata is under writeback.

2009-03-01 21:07:51

by Francis Moreau

[permalink] [raw]
Subject: Re: Question regarding concurrent accesses through block device and fs

Nick Piggin <[email protected]> writes:

> On Monday 02 March 2009 01:42:55 Francis Moreau wrote:

[...]

> OK, the "buffercache", the cache of block device contents, is normally
> thought of as metadata when it is being used by the filesystem (eg.
> usually via bread() etc), or data when it is being read/written from
> userspace via /dev/<blockdevice>.
>
> In the former case, the buffer.c/filesystem code together know when a
> metadata buffer is unused (because the filesystem has deallocated it),
> so unmap_underlying_metadata will work there.
>
> And it is insane to have a mounted filesystem and have userspace working
> on the same block device, so unmap_underlying_metadata doesn't have to
> care about that case. (IIRC some filesystem tools can do this, but there
> are obviously a lot of tricks to it)

Thanks for clarifying this.

[...]

> Depends on the filesystem. Many do just use the buffercache as a
> writeback cache for their metadata, and are happy to just let the
> dirty page flushers write it out when it suits them

I guess you're talking about the pdflush threads here.

This is the case where I can't find when the metadata are actually
written back to the disk by the flushers. I looked at
writback_inodes() but I fail to find this out.

Could you point out the place in the code where this happen ?

> (or when there are explicit sync instructions given).

yes I see where this happens in these cases.

> Most of the time, these filesystems don't really know or care when
> exactly their metadata is under writeback.

This sounds very weird to me but I need to learn how things work
before doing any serious comments.

thanks
--
Francis

2009-03-02 07:12:30

by Nick Piggin

[permalink] [raw]
Subject: Re: Question regarding concurrent accesses through block device and fs

On Monday 02 March 2009 08:07:30 Francis Moreau wrote:
> Nick Piggin <[email protected]> writes:

> > Depends on the filesystem. Many do just use the buffercache as a
> > writeback cache for their metadata, and are happy to just let the
> > dirty page flushers write it out when it suits them
>
> I guess you're talking about the pdflush threads here.

Yeah.


> This is the case where I can't find when the metadata are actually
> written back to the disk by the flushers. I looked at
> writback_inodes() but I fail to find this out.
>
> Could you point out the place in the code where this happen ?


I guess it picks them up via their block device inodes.


> > (or when there are explicit sync instructions given).
>
> yes I see where this happens in these cases.
>
> > Most of the time, these filesystems don't really know or care when
> > exactly their metadata is under writeback.
>
> This sounds very weird to me but I need to learn how things work
> before doing any serious comments.

Why would they? They just operate on their metadata, and the buffer
cache is basically a transparent writeback cache to them. In the
same way, an application doesn't really know or care when exactly
its data is under writeback. unmap_underlying_metadata is the
important exception because Linux pagecache otherwise doesn't have
a good way to keep pagecache of different mappings coherent. So if
a block switches from buffercache to file mapping, it needs to be
made coherent.

When switching back the other way, the truncate code actually makes
sure of this, that there won't be blocks under writeout after
being deallocated.

Things do get more complicated with journalling file systems.

2009-03-02 13:30:31

by Francis Moreau

[permalink] [raw]
Subject: Re: Question regarding concurrent accesses through block device and fs

Nick Piggin <[email protected]> writes:

> On Monday 02 March 2009 08:07:30 Francis Moreau wrote:
>> This is the case where I can't find when the metadata are actually
>> written back to the disk by the flushers. I looked at
>> writback_inodes() but I fail to find this out.
>>
>> Could you point out the place in the code where this happen ?
>
> I guess it picks them up via their block device inodes.

Probably but I don't find the actual place.

I looked at the place where page are normally written back to disk (ie
in background_writeout()) but I can see only the writeback of data, not
metadata...

>> This sounds very weird to me but I need to learn how things work
>> before doing any serious comments.
>
> Why would they? They just operate on their metadata, and the buffer
> cache is basically a transparent writeback cache to them.

Well the fact that metadata are written back to disk at an unknown point
in the time means that we don't know in which order metadata and data
are written. So it means that data can be written before or after
metadata or they can be mixed up.

And this sounds just weird to me. But as I said I'm just a noob so I
need to think and study more on this area and I really have to see where
the actual writes of metadata happen in the code.

> In the same way, an application doesn't really know or care when
> exactly its data is under writeback.

Except when dealing with metadata of the fs, we can corrupt the whole
thing, I think.

> unmap_underlying_metadata is the important exception because Linux
> pagecache otherwise doesn't have a good way to keep pagecache of
> different mappings coherent. So if a block switches from buffercache
> to file mapping, it needs to be made coherent.
>
> When switching back the other way, the truncate code actually makes
> sure of this, that there won't be blocks under writeout after
> being deallocated.
>
> Things do get more complicated with journalling file systems.
>

I think I'll just forget about them, things are currently enough
complicated to make them more obscure ;)

thanks
--
Francis

2009-03-03 03:53:35

by Nick Piggin

[permalink] [raw]
Subject: Re: Question regarding concurrent accesses through block device and fs

On Tuesday 03 March 2009 00:30:18 Francis Moreau wrote:
> Nick Piggin <[email protected]> writes:
> > On Monday 02 March 2009 08:07:30 Francis Moreau wrote:
> >> This is the case where I can't find when the metadata are actually
> >> written back to the disk by the flushers. I looked at
> >> writback_inodes() but I fail to find this out.
> >>
> >> Could you point out the place in the code where this happen ?
> >
> > I guess it picks them up via their block device inodes.
>
> Probably but I don't find the actual place.

It was an educated guess ;) I'm quite sure it does.


> I looked at the place where page are normally written back to disk (ie
> in background_writeout()) but I can see only the writeback of data, not
> metadata...

What are you expecting writeback of metadata to look like? To the
core kernel it looks the same as writeback of data.


> >> This sounds very weird to me but I need to learn how things work
> >> before doing any serious comments.
> >
> > Why would they? They just operate on their metadata, and the buffer
> > cache is basically a transparent writeback cache to them.
>
> Well the fact that metadata are written back to disk at an unknown point
> in the time means that we don't know in which order metadata and data
> are written. So it means that data can be written before or after
> metadata or they can be mixed up.

But the cache layer on top of that ensures it *appears* not to be mixed
up. A problem arises when the system crashes in the middle of this, and
we lose that information and see a mixed up filesystem. Hence journalling
filesystems.

2009-03-12 08:06:01

by Francis Moreau

[permalink] [raw]
Subject: Re: Question regarding concurrent accesses through block device and fs

Hello Nick,

Sorry for the long delay before my answer, but I don't have enough
time to dig the kernel source.

On Tue, Mar 3, 2009 at 4:52 AM, Nick Piggin <[email protected]> wrote:
> On Tuesday 03 March 2009 00:30:18 Francis Moreau wrote:
>> Nick Piggin <[email protected]> writes:
>> > On Monday 02 March 2009 08:07:30 Francis Moreau wrote:
>> >> This is the case where I can't find when the metadata are actually
>> >> written back to the disk by the flushers. I looked at
>> >> writback_inodes() but I fail to find this out.
>> >>
>> >> Could you point out the place in the code where this happen ?
>> >
>> > I guess it picks them up via their block device inodes.
>>
>> Probably but I don't find the actual place.
>
> It was an educated guess ;) I'm quite sure it does.
>

Ok I think I got the idea now. I though block device main purpose was
to handle block nodes such as /dev/sdx but it isn't.

>
>> I looked at the place where page are normally written back to disk (ie
>> in background_writeout()) but I can see only the writeback of data, not
>> metadata...
>
> What are you expecting writeback of metadata to look like? To the
> core kernel it looks the same as writeback of data.
>

I don't know. I was just thinking that since metadata are special since they
handle critical file system information, the kernel did treat them specially.

> But the cache layer on top of that ensures it *appears* not to be mixed
> up. A problem arises when the system crashes in the middle of this, and
> we lose that information and see a mixed up filesystem. Hence journalling
> filesystems.

Ok I guess I win a new tour in the kernel code ;) to understand how the cache
layer do that.

thanks a lot.
--
Francis

2009-03-12 08:23:15

by Nick Piggin

[permalink] [raw]
Subject: Re: Question regarding concurrent accesses through block device and fs

On Thursday 12 March 2009 19:05:39 Francis Moreau wrote:

> > It was an educated guess ;) I'm quite sure it does.
>
> Ok I think I got the idea now. I though block device main purpose was
> to handle block nodes such as /dev/sdx but it isn't.

Well, /dev/sdX access is important, at least to create and fsck the
filesystem ;) But for most Linux users, I think majority of buffercache
access will be by filesystem metadata access.


> >> I looked at the place where page are normally written back to disk (ie
> >> in background_writeout()) but I can see only the writeback of data, not
> >> metadata...
> >
> > What are you expecting writeback of metadata to look like? To the
> > core kernel it looks the same as writeback of data.
>
> I don't know. I was just thinking that since metadata are special since
> they handle critical file system information, the kernel did treat them
> specially.

It is, but you have to look in the filesystems themselves to see that.
There are some exceptions to that -- eg. sync_mapping_buffers in
buffer.c where it writes out dirty metadata buffers that the filesystem
has attached to a file. But that's fsync driven rather than background
writeout.


> > But the cache layer on top of that ensures it *appears* not to be mixed
> > up. A problem arises when the system crashes in the middle of this, and
> > we lose that information and see a mixed up filesystem. Hence journalling
> > filesystems.
>
> Ok I guess I win a new tour in the kernel code ;) to understand how the
> cache layer do that.

Ignore details like crashes, direct IO and coherency between data mappings
and buffercache where things get a bit hairy, and it's just a writeback
cache. The last thing you write to some location will be what you get back
if you read from that location -- regardless of whether it is dirty or clean
or not present when you ask for it (and has to be read from disk).

2009-03-12 09:00:54

by Francis Moreau

[permalink] [raw]
Subject: Re: Question regarding concurrent accesses through block device and fs

On Thu, Mar 12, 2009 at 9:22 AM, Nick Piggin <[email protected]> wrote:
>
> Ignore details like crashes, direct IO and coherency between data mappings
> and buffercache where things get a bit hairy, and it's just a writeback
> cache. The last thing you write to some location will be what you get back
> if you read from that location -- regardless of whether it is dirty or clean
> or not present when you ask for it (and has to be read from disk).
>

Well yes but I was wondering in the special where the kernel crash or
the power supply is down how the kernel is minimizing the risk of file
system inconsistency. Hence my questions about metadata handling.

Thanks
--
Francis

2009-03-12 09:13:01

by Nick Piggin

[permalink] [raw]
Subject: Re: Question regarding concurrent accesses through block device and fs

On Thursday 12 March 2009 20:00:38 Francis Moreau wrote:
> On Thu, Mar 12, 2009 at 9:22 AM, Nick Piggin <[email protected]>
wrote:
> > Ignore details like crashes, direct IO and coherency between data
> > mappings and buffercache where things get a bit hairy, and it's just a
> > writeback cache. The last thing you write to some location will be what
> > you get back if you read from that location -- regardless of whether it
> > is dirty or clean or not present when you ask for it (and has to be read
> > from disk).
>
> Well yes but I was wondering in the special where the kernel crash or
> the power supply is down how the kernel is minimizing the risk of file
> system inconsistency. Hence my questions about metadata handling.

Well, journalling filesystems, other filesystems can do synchronous
metadata updates (write through) which can help too. There is really
nothing that the generic pagecache/buffercache code does to try to
handle this because it is far to filesystem specific.