2006-12-29 23:32:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
> I think ext3 is terminally crap by now. It still uses buffer heads in
> places where it really really shouldn't, and as a result, things like
> directory accesses are simply slower than they should be. Sadly, I don't
> think ext4 is going to fix any of this, either.

Not just ext3; ocfs2 is using the jbd layer as well. I think we're
going to have to put this (a rework of jbd2 to use the page cache) on
the ext4 todo list, and work with the ocfs2 folks to try to come up
with something that suits their needs as well. Fortunately we have
this filesystem/storage summit thing coming up in the next few months,
and we can try to get some discussion going on the linux-ext4 mailing
list in the meantime. Unfortunately, I don't think this is going to
be trivial.

If we do get this fixed for ext4, one interesting question is whether
people would accept a patch to backport the fixes to ext3, given the
the grief this is causing the page I/O and VM routines. OTOH, reiser3
probably has the same problems, and I suspect the changes to ext3 to
cause it to avoid buffer heads, especially in order to support for
filesystem blocksizes < pagesize, are going to be sufficiently risky
in terms of introducing regressions to ext3 that they would probably
be rejected on those grounds. So unfortunately, we probably are going
to have to support flushes via buffer heads for the foreseeable
future.

- Ted


2006-12-29 23:59:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)



On Fri, 29 Dec 2006, Theodore Tso wrote:
>
> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines.

I don't think backporting is the smartest option (unless it's done _way_
later), but the real problem with it isn't actually the VM behaviour, but
simply the fact that cached performance absoluely _sucks_ with the buffer
cache.

With the physically indexed buffer cache thing, you end up always having
to do these complicated translations into block numbers for every single
access, and at some point when I benchmarked it, it was a huge overhead
for doing simple things like readdir.

It's also a major pain for read-ahead, exactly partly due to the high cost
of translation - because you can't cheaply check whether the next block is
there, the cost of even asking the question "should I try to read ahead?"
is much much higher. As a result, read-ahead is seriously limited, because
it's so expensive for the cached case (which is still hopefully the
_common_ case).

So because read-ahead is limited, the non-cached case then _really_ sucks.

It was somewhat fixed in a really god-awful fashion by having
ext3_readdir() actually do _readahead_ though the page cache, even though
it does everything else through the buffer cache. And that just happens to
work because we hopefully have physically contiguous blocks, but when that
isn't true, the readahead doesn't do squat.

It's really quite fundamentally broken. But none of that causes any
problems for the VM, since directories cannot be mmap'ed anyway. But it's
really pitiful, and it really doesn't work very well. Of course, other
filesystems _also_ suck at this, and other operating systems haev even
MORE problems, so people don't always seem to realize how horribly
horribly broken this all is.

I really wish somebody would write a filesystem that did large cold-cache
directories well. Open some horrible file manager on /usr/bin with cold
caches, and weep. The biggest problem is the inode indirection, but at
some point when I looked at why it sucked, it was doing basically
synchronous single-buffer reads on the directory too, because readahead
didn't work properly.

I was hoping that something like SpadFS would actually take off, because
it seemed to do a lot of good design choices (having inodes in-line in the
directory for when there are no hardlinks is probably a requirement for a
good filesystem these days. The separate inode table had its uses, but
indirection in a filesystem really does suck, and stat information is too
important to be indirect unless it absolutely has to).

But I suspect it needs more than somebody who just wants to get his thesis
written ;)

Linus

2006-12-30 00:06:07

by Andrew Morton

[permalink] [raw]
Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

On Fri, 29 Dec 2006 18:32:07 -0500
Theodore Tso <[email protected]> wrote:

> On Fri, Dec 29, 2006 at 02:42:51PM -0800, Linus Torvalds wrote:
> > I think ext3 is terminally crap by now. It still uses buffer heads in
> > places where it really really shouldn't, and as a result, things like
> > directory accesses are simply slower than they should be. Sadly, I don't
> > think ext4 is going to fix any of this, either.
>
> Not just ext3; ocfs2 is using the jbd layer as well. I think we're
> going to have to put this (a rework of jbd2 to use the page cache) on
> the ext4 todo list, and work with the ocfs2 folks to try to come up
> with something that suits their needs as well. Fortunately we have
> this filesystem/storage summit thing coming up in the next few months,
> and we can try to get some discussion going on the linux-ext4 mailing
> list in the meantime. Unfortunately, I don't think this is going to
> be trivial.

I suspect it would be insane to move any part of JBD (apart from the
ordered-data flush) to use pagecache. The whole thing is fundamentally
block-based. But only for metadata - there's no strong reason why ext3/4
needs to manipulate file data via buffer_heads if data=journal and chattr
+j aren't in use.

We could possibly move ext3/4 directories out of the blockdev pagecache and
into per-directory pagecache, but that wouldn't change anything - the
journalling would still be block-based.

Adam Richter spent considerable time a few years ago trying to make the
mpage code go direct-to-BIO in all cases and we eventually gave up. The
conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
and ugly to fully optimise away the "block" bit in the middle.

buffer_heads become more important with large PAGE_CACHE_SIZE. I'd expect
nobh mode to be quite inefficient with some workloads on 64k pages. We
need that representation of the state (and location) of the block-sized
hunks which make up the page.

> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines. OTOH, reiser3
> probably has the same problems, and I suspect the changes to ext3 to
> cause it to avoid buffer heads, especially in order to support for
> filesystem blocksizes < pagesize, are going to be sufficiently risky
> in terms of introducing regressions to ext3 that they would probably
> be rejected on those grounds. So unfortunately, we probably are going
> to have to support flushes via buffer heads for the foreseeable
> future.

We'll see.

2006-12-30 00:51:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)



On Fri, 29 Dec 2006, Andrew Morton wrote:
>
> Adam Richter spent considerable time a few years ago trying to make the
> mpage code go direct-to-BIO in all cases and we eventually gave up. The
> conceptual layering of page<->blocks<->bio is pretty clean, and it is hard
> and ugly to fully optimise away the "block" bit in the middle.

Using the buffer cache as a translation layer to the physical address is
fine. That's what _any_ block device will do.

I'm not at all sayign that "buffer heads must go away". They work fine.

What I'm saying is that

- if you index by buffer heads, you're screwed.
- if you do IO by starting at buffer heads, you're screwed.

Both indexing and writeback decisions should be done at the page cache
layer. Then, when you actually need to do IO, you look at the buffers. But
you start from the "page". YOU SHOULD NEVER LOOK UP a buffer on its own
merits, and YOU SHOULD NEVER DO IO on a buffer head on its own cognizance.

So by all means keep the buffer heads as a way to keep the
"virtual->physical" translation. It's what they were designed for. But
they were _originally_ also designed for "lookup" and "driving the start
of IO", and that is wrong, and has been wrong for a long time now, because

- lookup based on physical address is fundamentally slow and inefficient.
You have to look up the virtual->physical translation somewhere else,
so it's by design an unnecessary indirection _and_ that "somewere
else" is also by definition filesystem-specific, so you can't do any
of these things at the VFS layer.

Ergo: anything that needs to look up the physical address in order to
find the buffer head is BROKEN in this day and age. We look up the
_virtual_ page cache page, and then we can trivially find the buffer
heads within that page thanks to page->buffers.

Example: ext2 vs ext3 readdir. One of them sucks, the other doesn't.

- starting IO based on the physical entity is insane. It's insane exactly
_because_ the VM doesn't actually think in physical addresses, or in
buffer-sized blocks. The VM only really knows about whole pages, and
all the VM decisions fundamentally have to be page-based. We don't ever
"free a buffer". We free a whole page, and as such, doing writeback
based on buffers is pointless, because it doesn't actually say anything
about the "page state" which is what the VM tracks.

But neither of these means that "buffer_head" itself has to go away. They
both really boil down to the same thing: you should never KEY things by
the buffer head. All actions should be based on virtual indexes as far as
at all humanly possible.

Once you do lookup and locking and writeback _starting_ from the page,
it's then easy to look up the actual buffer head within the page, and use
that as a way to do the actual _IO_ on the physical address. So the buffer
heads still exist in ext2, for example, but they don't drive the show
quite as much.

(They still do in some areas: the allocation bitmaps, the xattr code etc.
But as long as none of those have big VM footprints, and as long as no
_common_ operations really care deeply, and as long as those data
structures never need to be touched by the VM or VFS layer, nobody will
ever really care).

The directory case comes up just because "readdir()" actually is very
common, and sometimes very slow. And it can have a big VM working set
footprint ("find"), so trying to be page-based actually really helps,
because it all drives things like writeback on the _right_ issues, and we
can do things like LRU's and writeback decisions on the level that really
matters.

I actually suspect that the inode tables could benefit from being in the
page cache too (although I think that the inode buffer address is actually
"physical", so there's no indirection for inode tables, which means that
the virtual vs physical addressing doesn't matter). For directories, there
definitely is a big cost to continually doing the virtual->physical
translation all the time.

Linus