2009-03-11 16:25:51

by Daniel Phillips

[permalink] [raw]
Subject: Tux3 report: Tux3 Git tree available

A Git tree for Tux3 kernel code is now available at:

git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git

To build a recent of Linus's kernel with Tux3:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git
cd linux-tux3
make defconfig
make CONFIG_TUX3=y

Tux3 userspace code is available from our Mercurial repository. To
build Tux3 userspace code and create a Tux3 filesystem:

hg clone http://hg.tux3.org/tux3
cd tux3
make
./tux3 mkfs /dev/<testpartition>

The Git Tux3 code is identical to the code in the user/kernel directory
of the Mercurial repository. More than 80% of the kernel code also
executes in userspace under Fuse, or as a component of the "tux3"
utility command, or as part of the "make tests" unit tests. We plan to
continue our dual userspace/kernel development model, as it has proven
effective at producing decent code with a low bug count. For this
reason, please expect to see some wrappers in the kernel code, especially
around buffer operations. Not only do we need to be compatible with
a slightly different buffer API in userspace, but we plan to use these
wrappers to help move away from buffers completely in due course.

There are various other oddities to be seen where writing code for both
userspace and kernel was not completely straightforward. For example,
kernel has multiple kinds of inodes with conversions between generic
and specific inodes, while userspace has only one kind. We will attempt
to reduce these differences over time, and reduce the frequency of #ifdef
__KERNEL__ where bits of userspace appear in the kernel files.

Tux3 is strictly for developers at the moment. >>Crash recovery is not
implemented yet.<< If Tux3 oopses or the machine loses power before
unmounting a Tux3 filesystem, then you should reboot and recreate the
Tux3 filesystem with "tux3 mkfs".

A quick status report: Tux3 now runs fine as root fs, as was announced
last month. Of course nobody should use this for real work because if
it crashes, chances are it will not mount again. Crash recovery code is
now partly implemented both in userspace and kernel, so we can see how
it looks now, but it is not yet functional. There are a few known bugs
and deficiencies. We thought that it would be wrong of us to hog the
bugs for ourselves, thus depriving the wider community of the fun and
glory of fixing these together, so we are going to offer Tux3 for
review, and hopefully merging in the not too distant future, complete
with bugs, warts and deficiencies.

The good news is: Tux3 does appear to work pretty well. With 6,525
lines of kernel code, it is close to implementing an Ext3-level feature
set. Ext3 is 32,926 lines including JBD. Sure, we are short a few
comments, and some rather important code, like allocation policy.
However, at this point we can see that Tux3 is as light and tight as
originally hoped for, and we expect it to remain so throughout its life.
As a wild guess, when fleshed out with recovery, an effective allocation
policy, btree directories, versioning, and loads of comments, Tux3 will
still come in under 10,000 lines.

Development work is proceding on a number of fronts. Hirofumi has taken
over the logging and recovery work from me, I am working on out of space
handling, and a team of students at Pune Institute in India are busy
implementing block deduplication, more on that later.

Tux3 performance at this point is decent, much more so than I expected
since we have not yet begun serious optimization work. At least for
intial untar-like loads, Tux3 performs a little better than Ext3 and a
little worse than Ext2. When logging is fully functional, Tux3 will be
writing a few additional log blocks, but there should also be less
seeking, according to my calculations. In other words, when Tux3
recovery code is fully functional, it should be a speedup rather than a
slowdown. And there is at least one big, easy optimization that will
yield a substantial reduction in total blocks transfered. This ongoing
work promises to be a lot of fun. Since it would be selfish of us to
keep all that fun for ourselves, we think it is about time to offer up
Tux3 for review.

How do we want to go about this? Six and a half thousand lines of code
is a bit much to post to lkml. On the other hand, that hasn't stopped a
number of other artists from posting even bigger patches.

This is what we have today:

git diff 16b71fdf97599f1b1b7f38418ee9922d9f117396 | diffstat
Makefile | 1
tux3/Kconfig | 7
tux3/Makefile | 13
tux3/balloc.c | 314 ++++++++++++++++++
tux3/btree.c | 774 ++++++++++++++++++++++++++++++++++++++++++++++
tux3/commit.c | 241 ++++++++++++++
tux3/dir.c | 347 ++++++++++++++++++++
tux3/dleaf.c | 838 ++++++++++++++++++++++++++++++++++++++++++++++++++
tux3/filemap.c | 645 ++++++++++++++++++++++++++++++++++++++
tux3/iattr.c | 201 ++++++++++++
tux3/ileaf.c | 312 ++++++++++++++++++
tux3/inode.c | 564 ++++++++++++++++++++++++++++++++++
tux3/log.c | 180 ++++++++++
tux3/namei.c | 282 +++++++++++++++++
tux3/super.c | 253 +++++++++++++++
tux3/trace.h | 34 ++
tux3/tux3.h | 944 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
tux3/utility.c | 68 ++++
tux3/xattr.c | 508 ++++++++++++++++++++++++++++++
19 files changed, 6526 insertions(+)

The full patch is 191KB. We could try patchbombing lkml with that, one
patch per file, say.

Or should we do it with a url:

http://tux3.org/patches/tux3-2.6.29-rc7

Or just post one big patch? It's not that big.

Regards,

Daniel


2009-03-11 18:44:03

by Andrew Morton

[permalink] [raw]
Subject: Re: Tux3 report: Tux3 Git tree available

On Wed, 11 Mar 2009 09:25:37 -0700
Daniel Phillips <[email protected]> wrote:

> The full patch is 191KB. We could try patchbombing lkml with that, one
> patch per file, say.

One patch per file is OK.

Two considerations:


- It should be reviewable! People don't want to spend all their
review time scratching their heads wondering "wtf is this piece of
code supposed to do".

Thoughtfully commented data structures and functions are valuable
during the code-review stage. If you try to retrofit them later then
reviewers get justifiably grumpy about all the time they wasted
ineffectually trying to review the uncommented stuff.

- Send the code for review when it's ready for linux-next
integration. I don't think it's a good idea to have it reviewed,
then you all disappear and spend three months changing it and then
put it up for linux-next integration.

OTOH, there may well be large changes as a result of review, so
don't leave it too late and avoid the temptation to think of it as
"finished", because it won't be!


Drive-by review:

- please prefer to leave a blank line between end-of-locals and
start-of-code in each fucntion.

- C++ comments make checkpatch (and kernel developers) whine.

- The non-tux3-specific bitmap-handling functions in balloc.c
shouldn't exist, I suspect. Use core kernel helpers. If they don't
exist, add them.

- bytebits() should use hweight8()

- No new typedefs, please. That means block_t. If there is some
real need to make block_t a typedef (such as: its size varies
according to Kconfig options) then grumble, OK. But it should then
be called tux3_block_t.

- count_range() is an unsuitable identifier for a kernel-wide symbol.
Please review all global symbols in the fs, ensure that they are
prefixed with "tux3_". Or make them static, of course.

- It uses printf() and assert()? Kernel code uses printk() and
BUG_ON(). Confused.

- There's a lot of this:

int ended = 0, any = 0;
struct buffer_head *buffer = blockread(mapping(inode), block);
if (!buffer)
return -1;
unsigned bytes = blocksize - offset;
if (bytes > tail)
bytes = tail;
unsigned char *p = bufdata(buffer) + offset, *top = p + bytes;

Which will spew compilation warnings due to local variable
definitions coming after code.

Maybe this code is never to be compiled into the kernel image, but
we might as well be consistent.

- What's "L"?

printf("%Lx-", (L)begin);

- When 'bh' is known to be non-NULL, use put_bh() rather than brelse().

- Use __packed, not PACKED.

- Run `checkpatch --file', enjoy the result.

- get_buffer() looks like it's racy against truncate. Needs to lock
the page.

- eh?

typedef u64 fixed32; /* Tux3 time values */

- link_entry() looks dangerous.

- Move SB_MAGIC to magic.h, change name.

- remove private hexdump(), use lib/hexdump.c


Generally: the code is _very_ namespace-piggy. Lots and lots of very
generic-sounding identifiers.

2009-03-12 05:39:17

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Wednesday 11 March 2009, Andrew Morton wrote:
> On Wed, 11 Mar 2009 09:25:37 -0700
> Daniel Phillips <[email protected]> wrote:
>
> > The full patch is 191KB. We could try patchbombing lkml with that, one
> > patch per file, say.
>
> One patch per file is OK.

OK, the next iteration will be that. For today there is a new patch
here:

http://tux3.org/patches/tux3-2.6.29-rc7-2

and the Git repository is updated:

git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git

Any idea how I make this appear on the web interface at git.kernel.org?
The Mercurial version of the changes (identical except for applying to
tux3/user/kernel instead of linux/fs/tux3) are here:

http://hg.tux3.org/tux3

> Two considerations:
>
> - It should be reviewable! People don't want to spend all their
> review time scratching their heads wondering "wtf is this piece of
> code supposed to do".
>
> Thoughtfully commented data structures and functions are valuable
> during the code-review stage. If you try to retrofit them later then
> reviewers get justifiably grumpy about all the time they wasted
> ineffectually trying to review the uncommented stuff.

The next round of checkins will be a batch of informative comments. A
bit late to tackle that today.

> - Send the code for review when it's ready for linux-next
> integration. I don't think it's a good idea to have it reviewed,
> then you all disappear and spend three months changing it and then
> put it up for linux-next integration.
>
> OTOH, there may well be large changes as a result of review, so
> don't leave it too late and avoid the temptation to think of it as
> "finished", because it won't be!

Will do. I think it is about ready for review now, with mainly cosmetic
issues between here and there. It is hard indeed, but we have resisted
the temptation to jump into the shiny optimizing work, or even get some
of the major functionality like recovery fully working. By now, it is
pretty clear how we intend to go at these things and the important code
structure is in place and mostly working. But no useful purpose is
served by taking several more weeks to make it all work perfectly at the
cost of missing out on the thousand eyeballs effect. OK sure, in the
real workd it is more like the two dozen eyeballs effect, but that is
still two dozen more than we had yesterday.

> Drive-by review:
>
> - please prefer to leave a blank line between end-of-locals and
> start-of-code in each fucntion.

Done.

> - C++ comments make checkpatch (and kernel developers) whine.

Each of those C++ comments is meant to cause sufficient pain to force
Tux3 devs to fix the commented issues, in the full knowledge that the
alternative is an eternity of suffering the slings and arrows of
outraged hacker whining.

Not all are fixed in this round of updates, but they will surely all be
gone before "official" review begins.

> - The non-tux3-specific bitmap-handling functions in balloc.c
> shouldn't exist, I suspect. Use core kernel helpers. If they don't
> exist, add them.

It looks like lib/bitmap.c has something serviceable, though the endian
issues are a little less than clear on a quick reading.

By the way, the endian comment in bitmap.c needs some loving:

http://lxr.linux.no/linux+v2.6.28.7/lib/bitmap.c#L35

The s390 file needs to be config-independent:

- include/asm-s390/bitops.h
+ arch/s390/include/asm/bitops.h

and the ppc file is nowhere to be seen. I will put a driveby patch on
the to.do list.

A more substantive point: The claim that "the byte ordering of bitmaps
is more natural on little endian architectures... see the big-endian
headers" is not actually supported in any obvious way. This is an
important question for us, because the bitmap format is the only endian
issue in Tux3 not completely nailed down. We use big endian format for
everything except bitmaps, which are currently little endian. This does
not make any difference until we start optimizing by using multi-byte
loads and stores. At that point, bits get scrambled in registers.

This is a disk format issue, so we need to get started on resolving it.
For perfect consistency, we should assume bit 0 at the most significant
end of a big-endian byte. But if little endian bytes really are "more
natural", which is not at all clear, then maybe we should settle on a
mixed format with littled endian bytes that have to be byteswapped when
loaded as words. I dunno. In a perfect world I would know the answer
just from reading the headers above, but this world isn't perfect and
neither am I, so hopefully somebody can enlighten me.

As an intermediate step, all our generic bit range operations are now
moved to tux3/utility.c and soon we will try to use the lib/bitmap.c
code, endian issues notwithstanding (does anybody really know what that
word means?)

> - bytebits() should use hweight8()

Done.

> - No new typedefs, please. That means block_t. If there is some
> real need to make block_t a typedef (such as: its size varies
> according to Kconfig options) then grumble, OK. But it should then
> be called tux3_block_t.

Not done yet (biggish).

> - count_range() is an unsuitable identifier for a kernel-wide symbol.
> Please review all global symbols in the fs, ensure that they are
> prefixed with "tux3_". Or make them static, of course.

This symbol is only supposed to be shared between separate compilation
units within fs/tux3, not be visible outside our module. How do we do
that?

Anyway, this isn't used by kernel code at all right now, though it
should be. For now, it has been made #ifndef __KERNEL__ until we
figure out what to do about this general class of symbols.

> - It uses printf() and assert()? Kernel code uses printk() and
> BUG_ON(). Confused.

That is the dual userspace/kernel personality showing. Userspace code
really looks odd when filled with printfs, and BUG_ON is essentially a
nontraditional incarnation of Posix assert. So please just declare on
this: assert and printf are banned from kernel code or not? I have a
slight preference for not banned, given the prior art of the Posix
flavors, but however you want it is how it will be. Not changed for
now.

> - There's a lot of this:
>
> int ended = 0, any = 0;
> struct buffer_head *buffer = blockread(mapping(inode), block);
> if (!buffer)
> return -1;
> unsigned bytes = blocksize - offset;
> if (bytes > tail)
> bytes = tail;
> unsigned char *p = bufdata(buffer) + offset, *top = p + bytes;
>
> Which will spew compilation warnings due to local variable
> definitions coming after code.

We compile without warnings in kernel by putting this in our Makefile:

EXTRA_CFLAGS += -Werror -std=gnu99 -Wno-declaration-after-statement

Obviously, that raises the question of whether C99 syntax is banned in
kernel. Well, C99 comments have a long tradition of attracting flames,
so all those are on their way out of Tux3 code (though not out yet) but
inline declarations actually serve a useful purpose in code that is in
process of continuous refactoring. So I think we would like to keep
these in some files for the time being, that are in heavy development,
and change some of the "traditional" usage like VFS glue to K&R style.

Naturally I would prefer that this matter be left up to the individual
artist for non-core kernel code, but please just make a ruling on it.
If the ruling is "we don't brook none of that filthy new C syntax"
then we will start converting all the kernel code. Not started yet.

> Maybe this code is never to be compiled into the kernel image, but
> we might as well be consistent.

Almost all of it is compiled by both userspace and kernel.

> - What's "L"?
>
> printf("%Lx-", (L)begin);

A very handy way of working around 32/64 bit format string issues. We
just cast all the messy cases to (long long), aka (L). All other
solutions to this messy problem are worse in my opinion, but whatever
the ruling is, is what we will do. This is used heavily in tracing and
dumping code, which can all be turned off with ifdefs, so it doesn't
affect production kernel text.

> - When 'bh' is known to be non-NULL, use put_bh() rather than brelse().

We have wrapped nearly all buffer operations with a view to getting rid
of buffers entirely, which we are not that far away from. Brelse is
one we didn't get around to. Now changed to blockput(buffer), a wrapper
that calls put_bh in kernel and our buffer emulation code in userspace,
which never did allow null buffers.

By the way, getting rid of buffers is an example of a process we really
want to do "in public", to avoid the danger of getting too far out on
some shaky limb.

> - Use __packed, not PACKED.

Done, and I noticed that we are missing some packed struct attributes,
to be fixed.

> - Run `checkpatch --file', enjoy the result.

Surely you meant "enjoy" the result.

> - get_buffer() looks like it's racy against truncate. Needs to lock
> the page.

Thankyou. That is ground zero for current development.

> - eh?
>
> typedef u64 fixed32; /* Tux3 time values */

That is 32.32 fixed point format. Tux3 uses fixed point rather than
decimal-encoded representation of sub-second time fields, with a view
to supporting configurable time precision with shifts rather that yucky
decimal conversions. We want the yucky decimal conversions only at
the VFS interface points, not in the disk representation code. Anyway,
32.32 is just a placeholder for some parameterized shift/mask code that
is not yet implemented. To be done.

> - link_entry() looks dangerous.

Live fast, die young. It's an evolving interface. Suggestions?

I came up with some code for using single linked lists in much the same
way as traditional generic double linked lists, but saving a word per
pointer. Which Hirofumi adopted and got busy generalizing, thinking
that kernel could use this in a few bazillion places to save metric
truckloads of cache memory. Anyway, a work in progress, suggestions
welcome.

> - Move SB_MAGIC to magic.h, change name.

Done.

> - remove private hexdump(), use lib/hexdump.c
>
>
> Generally: the code is _very_ namespace-piggy. Lots and lots of very
> generic-sounding identifiers.

Righto. A way of sandboxing external symbols just to our own module
would be very helpful, otherwise we will get busy tux3_ing those.

Regards,

Daniel

2009-03-12 06:08:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Wed, 11 Mar 2009 22:38:55 -0700 Daniel Phillips <[email protected]> wrote:

> On Wednesday 11 March 2009, Andrew Morton wrote:
> > On Wed, 11 Mar 2009 09:25:37 -0700
> > Daniel Phillips <[email protected]> wrote:
> >
> > > The full patch is 191KB. We could try patchbombing lkml with that, one
> > > patch per file, say.
> >
> > One patch per file is OK.
>
> OK, the next iteration will be that. For today there is a new patch
> here:
>
> http://tux3.org/patches/tux3-2.6.29-rc7-2
>
> and the Git repository is updated:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/daniel/linux-tux3.git
>
> Any idea how I make this appear on the web interface at git.kernel.org?

Sorry, my git cluelessness is epic.

> > - count_range() is an unsuitable identifier for a kernel-wide symbol.
> > Please review all global symbols in the fs, ensure that they are
> > prefixed with "tux3_". Or make them static, of course.
>
> This symbol is only supposed to be shared between separate compilation
> units within fs/tux3, not be visible outside our module. How do we do
> that?

You can't. Bear in mind that we want the allyesconfig kernel to link
and run, and that includes tux3. So do it manually by prefixing
everything with tux3_ or whatever. (does it need the "3"?)

> Anyway, this isn't used by kernel code at all right now, though it
> should be. For now, it has been made #ifndef __KERNEL__ until we
> figure out what to do about this general class of symbols.
>
> > - It uses printf() and assert()? Kernel code uses printk() and
> > BUG_ON(). Confused.
>
> That is the dual userspace/kernel personality showing. Userspace code
> really looks odd when filled with printfs, and BUG_ON is essentially a
> nontraditional incarnation of Posix assert. So please just declare on
> this: assert and printf are banned from kernel code or not? I have a
> slight preference for not banned, given the prior art of the Posix
> flavors, but however you want it is how it will be. Not changed for
> now.

Don't care much. There's heaps of value in being able to
run/develop/test the fs code in userspace. I expect that this
capability will eventually bitrot, so it'd be best to plan for that in
some fashion.

umm, it _is_ primarily kernel code, after all. So it seems appropriate
to use BUG_ON() and printk(), etc and to provide versions of those for
the userspace incarnation?

> > - There's a lot of this:
> >
> > int ended = 0, any = 0;
> > struct buffer_head *buffer = blockread(mapping(inode), block);
> > if (!buffer)
> > return -1;
> > unsigned bytes = blocksize - offset;
> > if (bytes > tail)
> > bytes = tail;
> > unsigned char *p = bufdata(buffer) + offset, *top = p + bytes;
> >
> > Which will spew compilation warnings due to local variable
> > definitions coming after code.
>
> We compile without warnings in kernel by putting this in our Makefile:
>
> EXTRA_CFLAGS += -Werror -std=gnu99 -Wno-declaration-after-statement
>
> Obviously, that raises the question of whether C99 syntax is banned in
> kernel.

It is banned ;)

I'm not sure why, really - I have vague memories of Linus having an
episode... It seems an OK construct if used tastefully. Although it
does make it easy to hide nasty surprises.

> Well, C99 comments have a long tradition of attracting flames,
> so all those are on their way out of Tux3 code (though not out yet) but
> inline declarations actually serve a useful purpose in code that is in
> process of continuous refactoring. So I think we would like to keep
> these in some files for the time being, that are in heavy development,
> and change some of the "traditional" usage like VFS glue to K&R style.
>
> Naturally I would prefer that this matter be left up to the individual
> artist for non-core kernel code, but please just make a ruling on it.
> If the ruling is "we don't brook none of that filthy new C syntax"
> then we will start converting all the kernel code. Not started yet.

Well. As I say, it doesn't bother me much (but I like C++, so ignore
me). But it will make merge/review life harder for you at the outset.
How much harder I cannot predict. People will fixate on this issue
at the expense of everything else..

> > Maybe this code is never to be compiled into the kernel image, but
> > we might as well be consistent.
>
> Almost all of it is compiled by both userspace and kernel.
>
> > - What's "L"?
> >
> > printf("%Lx-", (L)begin);
>
> A very handy way of working around 32/64 bit format string issues. We
> just cast all the messy cases to (long long), aka (L). All other
> solutions to this messy problem are worse in my opinion, but whatever
> the ruling is, is what we will do. This is used heavily in tracing and
> dumping code, which can all be turned off with ifdefs, so it doesn't
> affect production kernel text.

What format string issues are we talking about here?

See, a number of them will be fixed real soon now (geologically
speaking) when various 64-bit architectures switch their s64/u64
implementation from `long' to `long long'.

But if "L" is being used for (say) sector_t then that problem will
remain. It would be conventional to just use the open-coded cast.

> > - link_entry() looks dangerous.
>
> Live fast, die young. It's an evolving interface. Suggestions?

It looks like it can be passed almost any pointer at all (I forget).
Perhaps work out which types this is really to be used on, split it up
into a number of fully-C-typed helper functions and call those rather
than diving straight into the lower-level link_entry().

Something like that..

> > Generally: the code is _very_ namespace-piggy. Lots and lots of very
> > generic-sounding identifiers.
>
> Righto. A way of sandboxing external symbols just to our own module
> would be very helpful, otherwise we will get busy tux3_ing those.

tux3_ away ;)

2009-03-12 08:33:27

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Wednesday 11 March 2009, Andrew Morton wrote:
> > > - count_range() is an unsuitable identifier for a kernel-wide symbol.
> > > Please review all global symbols in the fs, ensure that they are
> > > prefixed with "tux3_". Or make them static, of course.
> >
> > This symbol is only supposed to be shared between separate compilation
> > units within fs/tux3, not be visible outside our module. How do we do
> > that?
>
> You can't. Bear in mind that we want the allyesconfig kernel to link
> and run, and that includes tux3. So do it manually by prefixing
> everything with tux3_ or whatever. (does it need the "3"?)

It doesn't need the 3, and sometimes we leave the _ off as well. OK,
this project is now underway. (Actually, has been underway for a
while, the code was much more free form a few months ago.)

> > Anyway, this isn't used by kernel code at all right now, though it
> > should be. For now, it has been made #ifndef __KERNEL__ until we
> > figure out what to do about this general class of symbols.
> >
> > > - It uses printf() and assert()? Kernel code uses printk() and
> > > BUG_ON(). Confused.
> >
> > That is the dual userspace/kernel personality showing. Userspace code
> > really looks odd when filled with printfs, and BUG_ON is essentially a
> > nontraditional incarnation of Posix assert. So please just declare on
> > this: assert and printf are banned from kernel code or not? I have a
> > slight preference for not banned, given the prior art of the Posix
> > flavors, but however you want it is how it will be. Not changed for
> > now.
>
> Don't care much. There's heaps of value in being able to
> run/develop/test the fs code in userspace. I expect that this
> capability will eventually bitrot, so it'd be best to plan for that in
> some fashion.
>
> umm, it _is_ primarily kernel code, after all. So it seems appropriate
> to use BUG_ON() and printk(), etc and to provide versions of those for
> the userspace incarnation?

Will do.

> > Obviously, that raises the question of whether C99 syntax is banned in
> > kernel.
>
> It is banned ;)
>
> I'm not sure why, really - I have vague memories of Linus having an
> episode... It seems an OK construct if used tastefully. Although it
> does make it easy to hide nasty surprises.
> ...
> Well. As I say, it doesn't bother me much (but I like C++, so ignore
> me). But it will make merge/review life harder for you at the outset.
> How much harder I cannot predict. People will fixate on this issue
> at the expense of everything else..

Well, I suppose we will do something in the middle for now: change some
to K&R, and leave some of it as is where we expect it to be developed
heavily, like dleaf.c which is going to see whole bunch of work to
integrate versioning, so it really makes little sense to make it harder
to factor just before starting that work. Anyway, the C++ comments are
on their way out and after all that is the one people love to hate.

> > > - What's "L"?
> > >
> > > printf("%Lx-", (L)begin);
> >
> > A very handy way of working around 32/64 bit format string issues. We
> > just cast all the messy cases to (long long), aka (L). All other
> > solutions to this messy problem are worse in my opinion, but whatever
> > the ruling is, is what we will do. This is used heavily in tracing and
> > dumping code, which can all be turned off with ifdefs, so it doesn't
> > affect production kernel text.
>
> What format string issues are we talking about here?
>
> See, a number of them will be fixed real soon now (geologically
> speaking) when various 64-bit architectures switch their s64/u64
> implementation from `long' to `long long'.

Ah, that would be helpful. But not done yet? How long until it
happens, and does it make sense to wait, so we can reduce the number
of problems cases? And... will it be all 64 bit arches or just some?
Because this issue isn't solved if it isn't fixed for all arches.

There are a couple of issues, one is u64 being (long) instead of
(long long) as you say, and the other is variable type sizes like
loff_t. That specific one isn't actually a problem, we can just refuse
to support 32 bit libc file ops, but there may be others. We had a
world of pain before (L) arrived, then with (L) it was easy. Maybe
just edit them all to (long long) for now, and damn the line length.

> But if "L" is being used for (say) sector_t then that problem will
> remain. It would be conventional to just use the open-coded cast.

Sector_t is probably the biggest issue. I really think we should
continue to support 32 bit sector_t, after all it 64 bit sector_t
makes zero sense on a cell phone, at least this year's cell phone,
and there is a real possibility that Tux3 could make a lot of sense
on a cell phone if we don't lose sight of the downward scaling
considerations.

> > > - link_entry() looks dangerous.
> >
> > Live fast, die young. It's an evolving interface. Suggestions?
>
> It looks like it can be passed almost any pointer at all (I forget).
> Perhaps work out which types this is really to be used on, split it up
> into a number of fully-C-typed helper functions and call those rather
> than diving straight into the lower-level link_entry().
>
> Something like that..

Well it's exactly as dangerous as list_entry. In fs/tux3 we only have
a single use of it, which is another non-type-safe wrapper, which is
only used twice in the same file. So ok, these will be changed to a
type safe wrapper.

Thanks much for the driveby lovin. See you again in a few days, same
bat time, same bat channel.

Regards,

Daniel

2009-03-12 08:48:19

by Nick Piggin

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009 19:33:02 Daniel Phillips wrote:
> On Wednesday 11 March 2009, Andrew Morton wrote:

> > > Obviously, that raises the question of whether C99 syntax is banned in
> > > kernel.
> >
> > It is banned ;)
> >
> > I'm not sure why, really - I have vague memories of Linus having an
> > episode... It seems an OK construct if used tastefully. Although it
> > does make it easy to hide nasty surprises.
> > ...
> > Well. As I say, it doesn't bother me much (but I like C++, so ignore
> > me). But it will make merge/review life harder for you at the outset.
> > How much harder I cannot predict. People will fixate on this issue
> > at the expense of everything else..
>
> Well, I suppose we will do something in the middle for now: change some
> to K&R, and leave some of it as is where we expect it to be developed
> heavily, like dleaf.c which is going to see whole bunch of work to
> integrate versioning, so it really makes little sense to make it harder
> to factor just before starting that work. Anyway, the C++ comments are
> on their way out and after all that is the one people love to hate.

I think they need to be fixed before merge. If the function is easier
to follow when you use this feature, IMO it indicates the function is
too big or badly written anyway.


> There are a couple of issues, one is u64 being (long) instead of
> (long long) as you say, and the other is variable type sizes like
> loff_t. That specific one isn't actually a problem, we can just refuse
> to support 32 bit libc file ops, but there may be others. We had a
> world of pain before (L) arrived, then with (L) it was easy. Maybe
> just edit them all to (long long) for now, and damn the line length.

Yes please do this. A significant style change like this that lots of
code already does I think is best first discussed as a standalone
change to kernel rather than everyone developing their own convention.

2009-03-12 09:00:32

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

Hi Nick,

On Thursday 12 March 2009, Nick Piggin wrote:
> On Thursday 12 March 2009 19:33:02 Daniel Phillips wrote:
> > On Wednesday 11 March 2009, Andrew Morton wrote:
>
> > > > Obviously, that raises the question of whether C99 syntax is banned in
> > > > kernel.
> > >
> > > It is banned ;)
> > >
> > > I'm not sure why, really - I have vague memories of Linus having an
> > > episode... It seems an OK construct if used tastefully. Although it
> > > does make it easy to hide nasty surprises.
> > > ...
> > > Well. As I say, it doesn't bother me much (but I like C++, so ignore
> > > me). But it will make merge/review life harder for you at the outset.
> > > How much harder I cannot predict. People will fixate on this issue
> > > at the expense of everything else..
> >
> > Well, I suppose we will do something in the middle for now: change some
> > to K&R, and leave some of it as is where we expect it to be developed
> > heavily, like dleaf.c which is going to see whole bunch of work to
> > integrate versioning, so it really makes little sense to make it harder
> > to factor just before starting that work. Anyway, the C++ comments are
> > on their way out and after all that is the one people love to hate.
>
> I think they need to be fixed before merge. If the function is easier
> to follow when you use this feature, IMO it indicates the function is
> too big or badly written anyway.

It's not about being easier to follow as being easier to factor. Putting
the variables far away from point of use increases the busy work in
moving code around significantly, with a corresponding reduction in code
quality in the long run, because time is spent fiddling with declarations
instead of improving structure. That said, it no doubt will be changed
before merge. Not that I think there is a sensible reason for it, but
because it makes little sense to dig in over a cosmetic issue.

> > There are a couple of issues, one is u64 being (long) instead of
> > (long long) as you say, and the other is variable type sizes like
> > loff_t. That specific one isn't actually a problem, we can just refuse
> > to support 32 bit libc file ops, but there may be others. We had a
> > world of pain before (L) arrived, then with (L) it was easy. Maybe
> > just edit them all to (long long) for now, and damn the line length.
>
> Yes please do this. A significant style change like this that lots of
> code already does I think is best first discussed as a standalone
> change to kernel rather than everyone developing their own convention.

That will be in the next batch of changes. So... we offered our shiny
new convention, and I consider it voted down. All in a days work :)

Regards,

Daniel

2009-03-12 09:10:45

by Nick Piggin

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009 20:00:18 Daniel Phillips wrote:
> Hi Nick,
>
> On Thursday 12 March 2009, Nick Piggin wrote:
> > On Thursday 12 March 2009 19:33:02 Daniel Phillips wrote:
> > > On Wednesday 11 March 2009, Andrew Morton wrote:
> > > > > Obviously, that raises the question of whether C99 syntax is banned
> > > > > in kernel.
> > > >
> > > > It is banned ;)
> > > >
> > > > I'm not sure why, really - I have vague memories of Linus having an
> > > > episode... It seems an OK construct if used tastefully. Although it
> > > > does make it easy to hide nasty surprises.
> > > > ...
> > > > Well. As I say, it doesn't bother me much (but I like C++, so ignore
> > > > me). But it will make merge/review life harder for you at the
> > > > outset. How much harder I cannot predict. People will fixate on this
> > > > issue at the expense of everything else..
> > >
> > > Well, I suppose we will do something in the middle for now: change some
> > > to K&R, and leave some of it as is where we expect it to be developed
> > > heavily, like dleaf.c which is going to see whole bunch of work to
> > > integrate versioning, so it really makes little sense to make it harder
> > > to factor just before starting that work. Anyway, the C++ comments are
> > > on their way out and after all that is the one people love to hate.
> >
> > I think they need to be fixed before merge. If the function is easier
> > to follow when you use this feature, IMO it indicates the function is
> > too big or badly written anyway.
>
> It's not about being easier to follow as being easier to factor. Putting
> the variables far away from point of use increases the busy work in
> moving code around significantly, with a corresponding reduction in code
> quality in the long run, because time is spent fiddling with declarations
> instead of improving structure. That said, it no doubt will be changed

Again, I would say if it takes so much more work to maintain, then the
functions are too big or badly written. But I guess it is a matter of
opinion, so...


> before merge. Not that I think there is a sensible reason for it, but
> because it makes little sense to dig in over a cosmetic issue.

... how about because it follows agreed convention?


> > > There are a couple of issues, one is u64 being (long) instead of
> > > (long long) as you say, and the other is variable type sizes like
> > > loff_t. That specific one isn't actually a problem, we can just refuse
> > > to support 32 bit libc file ops, but there may be others. We had a
> > > world of pain before (L) arrived, then with (L) it was easy. Maybe
> > > just edit them all to (long long) for now, and damn the line length.
> >
> > Yes please do this. A significant style change like this that lots of
> > code already does I think is best first discussed as a standalone
> > change to kernel rather than everyone developing their own convention.
>
> That will be in the next batch of changes. So... we offered our shiny
> new convention, and I consider it voted down. All in a days work :)

Cool. Maybe if you offer it as a standalone patch to the kernel, it
will get more attention? It's just not appropriate to put in with a
new filesystem.

2009-03-12 09:46:05

by Sam Ravnborg

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

>
> > > > - What's "L"?
> > > >
> > > > printf("%Lx-", (L)begin);
> > >
> > > A very handy way of working around 32/64 bit format string issues. We
> > > just cast all the messy cases to (long long), aka (L). All other
> > > solutions to this messy problem are worse in my opinion, but whatever
> > > the ruling is, is what we will do. This is used heavily in tracing and
> > > dumping code, which can all be turned off with ifdefs, so it doesn't
> > > affect production kernel text.
> >
> > What format string issues are we talking about here?
> >
> > See, a number of them will be fixed real soon now (geologically
> > speaking) when various 64-bit architectures switch their s64/u64
> > implementation from `long' to `long long'.
>
> Ah, that would be helpful. But not done yet? How long until it
> happens, and does it make sense to wait, so we can reduce the number
> of problems cases? And... will it be all 64 bit arches or just some?
> Because this issue isn't solved if it isn't fixed for all arches.

Most if not all will hit mainline during coming merge window.
When an arch changes u64 becomes unsigned long long.

It is considered a 'must have' for 64 bit archs these days so
do not workaround it.

Sam

2009-03-12 10:15:27

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, Nick Piggin wrote:
> On Thursday 12 March 2009 20:00:18 Daniel Phillips wrote:
> > Hi Nick,
> >
> > On Thursday 12 March 2009, Nick Piggin wrote:
> > > On Thursday 12 March 2009 19:33:02 Daniel Phillips wrote:
> > > > On Wednesday 11 March 2009, Andrew Morton wrote:
> > > > > > Obviously, that raises the question of whether C99 syntax is banned
> > > > > > in kernel.
> > > > >
> > > > > It is banned ;)
> > > > >
> > > > > I'm not sure why, really - I have vague memories of Linus having an
> > > > > episode... It seems an OK construct if used tastefully. Although it
> > > > > does make it easy to hide nasty surprises.
> > > > > ...
> > > > > Well. As I say, it doesn't bother me much (but I like C++, so ignore
> > > > > me). But it will make merge/review life harder for you at the
> > > > > outset. How much harder I cannot predict. People will fixate on this
> > > > > issue at the expense of everything else..
> > > >
> > > > Well, I suppose we will do something in the middle for now: change some
> > > > to K&R, and leave some of it as is where we expect it to be developed
> > > > heavily, like dleaf.c which is going to see whole bunch of work to
> > > > integrate versioning, so it really makes little sense to make it harder
> > > > to factor just before starting that work. Anyway, the C++ comments are
> > > > on their way out and after all that is the one people love to hate.
> > >
> > > I think they need to be fixed before merge. If the function is easier
> > > to follow when you use this feature, IMO it indicates the function is
> > > too big or badly written anyway.
> >
> > It's not about being easier to follow as being easier to factor. Putting
> > the variables far away from point of use increases the busy work in
> > moving code around significantly, with a corresponding reduction in code
> > quality in the long run, because time is spent fiddling with declarations
> > instead of improving structure. That said, it no doubt will be changed
>
> Again, I would say if it takes so much more work to maintain, then the
> functions are too big or badly written. But I guess it is a matter of
> opinion, so...

It's not maintaining, it's developing.

Yes, this is a religious issue, pure and simple. So when in Rome...

By the way, I just spotted your fsblock effort on LWN, and I should
mention there is a lot of commonality with a side project we have
going in Tux3, called "block handles", which aims to get rid of buffers
entirely, leaving a tiny structure attached to the page->private that
just records the block states. Currently, four bits per block. This
can be done entirely _within_ a filesystem. We are already running
some of the code that has to be in place before switching over to this
model.

Tux3 block handles (as prototyped, not in the current code base) are
16 bytes per page, which for 1K block size on a 32 bit arch is a factor
of 14 savings, more on 64 bit arch. More importantly, it saves lots of
individual slab allocations, a feature I gather is part of fsblock as
well.

We represent up to 8 block states in one 16 byte object. To make this
work, we don't try to emulate the behavior of the venerable block
library, but first refactor the filesystem data flow so that we are
only using the parts of buffer_head that will be emulated by the block
handle. For example, no caching of physical block address. It keeps
changing in Tux3 anyway, so this is really a useless thing to do.

Anyway, that is more than I meant to write about it. Good luck to you,
you will need it. Keep in mind that some of the nastiest kernel bugs
ever arose from interactions between page and buffer state bits. You
may run into understandable reluctance to change stable filesystems
over to the new model. Even if it reduces the chance for confusion,
the fact that it touches cache state at all is going to make people
jumpy. I would suggest that you get Ext3 working perfectly to prove
your idea, no small task. One advantage: Ext3 uses block handles for
directories, as opposed to Ext2 which operates on pages. So Ext3 will
be easier to deal with in that one area. But with Ext3 you have to
deal with jbd too, which is going to keep you busy for a while.

The block handles patch is one of those fun things we have on hold for
the time being while we get the more mundane, but essential issues out
of the way, like whether we are going to run that commie C99 mindrot
filth out of town on a rail.

Regards,

Daniel

2009-03-12 10:25:46

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, Sam Ravnborg wrote:
> >
> > > > > - What's "L"?
> > > > >
> > > > > printf("%Lx-", (L)begin);
> > > >
> > > > A very handy way of working around 32/64 bit format string issues. We
> > > > just cast all the messy cases to (long long), aka (L). All other
> > > > solutions to this messy problem are worse in my opinion, but whatever
> > > > the ruling is, is what we will do. This is used heavily in tracing and
> > > > dumping code, which can all be turned off with ifdefs, so it doesn't
> > > > affect production kernel text.
> > >
> > > What format string issues are we talking about here?
> > >
> > > See, a number of them will be fixed real soon now (geologically
> > > speaking) when various 64-bit architectures switch their s64/u64
> > > implementation from `long' to `long long'.
> >
> > Ah, that would be helpful. But not done yet? How long until it
> > happens, and does it make sense to wait, so we can reduce the number
> > of problems cases? And... will it be all 64 bit arches or just some?
> > Because this issue isn't solved if it isn't fixed for all arches.
>
> Most if not all will hit mainline during coming merge window.
> When an arch changes u64 becomes unsigned long long.
>
> It is considered a 'must have' for 64 bit archs these days so
> do not workaround it.

OK, well a mindless edit of (L) to (long long) isn't a workaround, just
a reshuffle. Then maybe when the 64 bit changes land in a month or two
we can lose most or all of those ugly things.

Regards,

Daniel

2009-03-12 11:03:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009 21:15:06 Daniel Phillips wrote:

> By the way, I just spotted your fsblock effort on LWN, and I should
> mention there is a lot of commonality with a side project we have
> going in Tux3, called "block handles", which aims to get rid of buffers
> entirely, leaving a tiny structure attached to the page->private that
> just records the block states. Currently, four bits per block. This
> can be done entirely _within_ a filesystem. We are already running
> some of the code that has to be in place before switching over to this
> model.
>
> Tux3 block handles (as prototyped, not in the current code base) are
> 16 bytes per page, which for 1K block size on a 32 bit arch is a factor
> of 14 savings, more on 64 bit arch. More importantly, it saves lots of
> individual slab allocations, a feature I gather is part of fsblock as
> well.

That's interesting. Do you handle 1K block sizes with 64K page size? :)

fsblock isn't quite as small. 20 bytes per block on a 32-bit arch. Yeah,
so it will do a single 80 byte allocation for 4K page 1K block.

That's good for cache efficiency. As far as total # slab allocations
themselves go, fsblock probably tends to do more of them than buffer.c
because it frees them proactively when their refcounts reach 0 (by
default, one can switch to a lazy mode like buffer heads).

That's one of the most important things, so we don't end up with lots
of these things lying around.

eg. I could make it 16 bytes I think, but it would be a little harder
and would make support for block size > page size much harder so I
wouldn't bother. Or share the refcount field for all blocks in a page
and just duplicate state etc, but it just makes code larger and slower
and harder.


> We represent up to 8 block states in one 16 byte object. To make this
> work, we don't try to emulate the behavior of the venerable block
> library, but first refactor the filesystem data flow so that we are
> only using the parts of buffer_head that will be emulated by the block
> handle. For example, no caching of physical block address. It keeps
> changing in Tux3 anyway, so this is really a useless thing to do.

fsblocks in their refcount mode don't tend to _cache_ physical block addresses
either, because they're only kept around for as long as they are required
(eg. to write out the page to avoid memory allocation deadlock problems).

But some filesystems don't do very fast block lookups and do want a cache.
I did a little extent map library on the side for that.


> Anyway, that is more than I meant to write about it. Good luck to you,
> you will need it. Keep in mind that some of the nastiest kernel bugs
> ever arose from interactions between page and buffer state bits. You

Yes, I even fixed several of them too :)

fsblock simplifies a lot of those games. It protects pagecache state and
fsblock state for all assocated blocks under a lock, so no weird ordering
issues, and the two are always kept coherent (to the point that I can do
writeout by walking dirty fsblocks in block device sector-order, although
that requires bloat to fsblock struct and isn't straightforward with
delalloc).

Of course it is new code so it will have more bugs, but it is better code.


> may run into understandable reluctance to change stable filesystems
> over to the new model. Even if it reduces the chance for confusion,
> the fact that it touches cache state at all is going to make people
> jumpy. I would suggest that you get Ext3 working perfectly to prove
> your idea, no small task. One advantage: Ext3 uses block handles for
> directories, as opposed to Ext2 which operates on pages. So Ext3 will
> be easier to deal with in that one area. But with Ext3 you have to
> deal with jbd too, which is going to keep you busy for a while.

It is basically already proven. It is faster with ext2 and it works with
XFS delalloc, unwritten etc blocks (mostly -- except where I wasn't
really able to grok XFS enough to convert it). And works with minix
with larger block size than page size (except some places where core
pagecache code needs some hacking that I haven't got around to).

Yes an ext3 conversion would probably reveal some tweaks or fixes to
fsblock. I might try doing ext3 next. I suspect most of the problems
would be fitting ext3 to much stricter checks and consistency required
by fsblock, rather than adding ext3-required features to fsblock.

ext3 will be a tough one to convert because it is complex, very stable,
and widely used so there are lots of reasons not to make big changes to
it.


> The block handles patch is one of those fun things we have on hold for
> the time being while we get the more mundane

Good luck with it. I suspect that doing filesystem-specific layers to
duplicate basically the same functionality but slightly optimised for
the specific filesystem may not be a big win. As you say, this is where
lots of nasty problems have been, so sharing as much code as possible
is a really good idea.

I would be very interested in anything like this that could beat fsblock
in functionality or performance anywhere, even if it is taking shortcuts
by being less generic.

If there is a significant gain to be had from less generic, perhaps it
could still be made into a library usable by more than 1 fs.

2009-03-12 12:24:52

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, Nick Piggin wrote:
> On Thursday 12 March 2009 21:15:06 Daniel Phillips wrote:
>
> > By the way, I just spotted your fsblock effort on LWN, and I should
> > mention there is a lot of commonality with a side project we have
> > going in Tux3, called "block handles", which aims to get rid of buffers
> > entirely, leaving a tiny structure attached to the page->private that
> > just records the block states. Currently, four bits per block. This
> > can be done entirely _within_ a filesystem. We are already running
> > some of the code that has to be in place before switching over to this
> > model.
> >
> > Tux3 block handles (as prototyped, not in the current code base) are
> > 16 bytes per page, which for 1K block size on a 32 bit arch is a factor
> > of 14 savings, more on 64 bit arch. More importantly, it saves lots of
> > individual slab allocations, a feature I gather is part of fsblock as
> > well.
>
> That's interesting. Do you handle 1K block sizes with 64K page size? :)

Not in its current incarnation. That would require 32 bytes worth of
state while the current code just has a 4 byte map (4 bits X 8 blocks).
I suppose a reasonable way to extend it would be 4 x 8 byte maps. Has
somebody spotted a 64K page?

> fsblock isn't quite as small. 20 bytes per block on a 32-bit arch. Yeah,
> so it will do a single 80 byte allocation for 4K page 1K block.
>
> That's good for cache efficiency. As far as total # slab allocations
> themselves go, fsblock probably tends to do more of them than buffer.c
> because it frees them proactively when their refcounts reach 0 (by
> default, one can switch to a lazy mode like buffer heads).

I think that's a very good thing to do and intend to do the same. If
it shows on a profiler, then the filesystem should keep its own free
list to avoid whatever slab thing creates the bottleneck.

> That's one of the most important things, so we don't end up with lots
> of these things lying around.

Amen. Doing nothing.

> eg. I could make it 16 bytes I think, but it would be a little harder
> and would make support for block size > page size much harder so I
> wouldn't bother. Or share the refcount field for all blocks in a page
> and just duplicate state etc, but it just makes code larger and slower
> and harder.

Right, block handles share the refcount for all blocks on one page.

> > We represent up to 8 block states in one 16 byte object. To make this
> > work, we don't try to emulate the behavior of the venerable block
> > library, but first refactor the filesystem data flow so that we are
> > only using the parts of buffer_head that will be emulated by the block
> > handle. For example, no caching of physical block address. It keeps
> > changing in Tux3 anyway, so this is really a useless thing to do.
>
> fsblocks in their refcount mode don't tend to _cache_ physical block addresses
> either, because they're only kept around for as long as they are required
> (eg. to write out the page to avoid memory allocation deadlock problems).
>
> But some filesystems don't do very fast block lookups and do want a cache.
> I did a little extent map library on the side for that.

Sure, good plan. We are attacking the transfer path, so that all the
transfer state goes directly from the filesystem into a BIO and doesn't
need that twisty path back and forth to the block library. The BIO
remembers the physical address across the transfer cycle. If you must
still support those twisty paths for compatibility with the existing
buffer.c scheme, you have a much harder project.

> > Anyway, that is more than I meant to write about it. Good luck to you,
> > you will need it. Keep in mind that some of the nastiest kernel bugs
> > ever arose from interactions between page and buffer state bits. You
>
> Yes, I even fixed several of them too :)
>
> fsblock simplifies a lot of those games. It protects pagecache state and
> fsblock state for all assocated blocks under a lock, so no weird ordering
> issues, and the two are always kept coherent (to the point that I can do
> writeout by walking dirty fsblocks in block device sector-order, although
> that requires bloat to fsblock struct and isn't straightforward with
> delalloc).
>
> Of course it is new code so it will have more bugs, but it is better code.

I've started poking at it, starting with the manifesto. It's a pretty
big reading project.

> > The block handles patch is one of those fun things we have on hold for
> > the time being while we get the more mundane
>
> Good luck with it. I suspect that doing filesystem-specific layers to
> duplicate basically the same functionality but slightly optimised for
> the specific filesystem may not be a big win. As you say, this is where
> lots of nasty problems have been, so sharing as much code as possible
> is a really good idea.

The big win will come from avoiding the use of struct buffer_head as
an API element for mapping logical cache to disk, which is a narrow
constriction when the filesystem wants to do things with extents in
btrees. It is quite painful doing a btree probe for every ->get_block
the way it is now. We want probe... page page page page... submit bio
(or put it on a list for delayed allocation).

Once we have the desired, nice straight path above then we don't need
most of the fields in buffer_head, so tightening it up into a bitmap,
a refcount and a pointer back to the page makes a lot of sense. This
in itself may not make a huge difference, but the reduction in cache
pressure ought to be measurable and worth the not very many lines of
code for the implementation.

> I would be very interested in anything like this that could beat fsblock
> in functionality or performance anywhere, even if it is taking shortcuts
> by being less generic.
>
> If there is a significant gain to be had from less generic, perhaps it
> could still be made into a library usable by more than 1 fs.

I don't see any reason right off that it is not generic, except that it
does not try to fill the API role that buffer_head has, and so it isn't
a small, easy change to an existing filesystem. It ought to be useful
for new designs though. Mind you, the code hasn't been tried yet, it
is currently just a state-smashing API waiting for the filesystem to
evolve into the necessary shape, which is going to take another month
or two.

The Tux3 userspace buffer emulation already works much like the kernel
block handles will work, in that it doesn't cache a physical address,
and maintains cache state as a scalar value instead of a set of bits,
so we already have a fair amount of experience with the model. When it
does get to the top of the list of things to do, it should slot in
smoothly. At that point we could hand it to you to try your generic
API, which seems to implement similar ideas.

Regards,

Daniel

2009-03-12 12:32:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thu, Mar 12, 2009 at 05:24:33AM -0700, Daniel Phillips wrote:
> On Thursday 12 March 2009, Nick Piggin wrote:
> > That's interesting. Do you handle 1K block sizes with 64K page size? :)
>
> Not in its current incarnation. That would require 32 bytes worth of
> state while the current code just has a 4 byte map (4 bits X 8 blocks).
> I suppose a reasonable way to extend it would be 4 x 8 byte maps. Has
> somebody spotted a 64K page?

I believe SGI ship their ia64 kernels configured this way. Certainly
16k ia64 kernels are common, which would (if I understand your scheme
correctly) be 8 bytes worth of state in your scheme.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-03-12 12:45:33

by Nick Piggin

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009 23:32:30 Matthew Wilcox wrote:
> On Thu, Mar 12, 2009 at 05:24:33AM -0700, Daniel Phillips wrote:
> > On Thursday 12 March 2009, Nick Piggin wrote:
> > > That's interesting. Do you handle 1K block sizes with 64K page size? :)
> >
> > Not in its current incarnation. That would require 32 bytes worth of
> > state while the current code just has a 4 byte map (4 bits X 8 blocks).
> > I suppose a reasonable way to extend it would be 4 x 8 byte maps. Has
> > somebody spotted a 64K page?
>
> I believe SGI ship their ia64 kernels configured this way. Certainly
> 16k ia64 kernels are common, which would (if I understand your scheme
> correctly) be 8 bytes worth of state in your scheme.

I think some distros will (or do) ship configs with 64K page size for
ia64 and powerpc too. I think I have heard of people using 64K pages
with ARM. There was some (public) talk of x86-64 getting a 16K or 64K
page size too (and even if not HW, some people want to be able to go
bigger SW pagecache size).

I wouldn't expect 64K page and 1K block to be worth optimising for
(although 64K page systems could easily use older or shared 4K block
filesystems). But just keep in mind that a good solution should not
rely on PAGE_CACHE_SIZE for correctness.

2009-03-12 13:04:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009 23:24:33 Daniel Phillips wrote:
> On Thursday 12 March 2009, Nick Piggin wrote:

> > That's good for cache efficiency. As far as total # slab allocations
> > themselves go, fsblock probably tends to do more of them than buffer.c
> > because it frees them proactively when their refcounts reach 0 (by
> > default, one can switch to a lazy mode like buffer heads).
>
> I think that's a very good thing to do and intend to do the same. If
> it shows on a profiler, then the filesystem should keep its own free
> list to avoid whatever slab thing creates the bottleneck.

slab allocation/free fastpath is on the order of 100 cycles, which is
a cache miss. I have a feeling that actually doing lots of allocs and
frees can work better because it keeps reusing the same memory for
different objects being operated on, so you get fewer cache misses.
(anyway it doesn't seem to be measurable in fsblock when switching
between cached and refcounted mode).


> > fsblocks in their refcount mode don't tend to _cache_ physical block
> > addresses either, because they're only kept around for as long as they
> > are required (eg. to write out the page to avoid memory allocation
> > deadlock problems).
> >
> > But some filesystems don't do very fast block lookups and do want a
> > cache. I did a little extent map library on the side for that.
>
> Sure, good plan. We are attacking the transfer path, so that all the
> transfer state goes directly from the filesystem into a BIO and doesn't
> need that twisty path back and forth to the block library. The BIO
> remembers the physical address across the transfer cycle. If you must
> still support those twisty paths for compatibility with the existing
> buffer.c scheme, you have a much harder project.

I don't quite know what you mean. You have a set of dirty cache that
needs to be written. So you need to know the block addresses in order
to create the bio of course.

fsblock allocates the block and maps[*] the block at pagecache *dirty*
time, and holds onto it until writeout is finished. In something like
ext2, finding the offset->block map can require buffercache allocations
so it is technically deadlocky if you have to do it at writeout time.

[*] except in the case of delalloc. fsblock does its best, but for
complex filesystems like delalloc, some memory reservation would have
to be done by the fs.


> > > The block handles patch is one of those fun things we have on hold for
> > > the time being while we get the more mundane
> >
> > Good luck with it. I suspect that doing filesystem-specific layers to
> > duplicate basically the same functionality but slightly optimised for
> > the specific filesystem may not be a big win. As you say, this is where
> > lots of nasty problems have been, so sharing as much code as possible
> > is a really good idea.
>
> The big win will come from avoiding the use of struct buffer_head as
> an API element for mapping logical cache to disk, which is a narrow
> constriction when the filesystem wants to do things with extents in
> btrees. It is quite painful doing a btree probe for every ->get_block
> the way it is now. We want probe... page page page page... submit bio
> (or put it on a list for delayed allocation).
>
> Once we have the desired, nice straight path above then we don't need
> most of the fields in buffer_head, so tightening it up into a bitmap,
> a refcount and a pointer back to the page makes a lot of sense. This
> in itself may not make a huge difference, but the reduction in cache
> pressure ought to be measurable and worth the not very many lines of
> code for the implementation.

I haven't done much about this in fsblock yet. I think some things need
a bit of changing in the pagecache layer (in the block library, eg.
write_begin/write_end doesn't have enough info to reserve/allocate a big
range of blocks -- we need a callback higher up to tell the filesystem
that we will be writing xxx range in the file, so get things ready for
us).

As far as the per-block pagecache state (as opposed to the per-block fs
state), I don't see any reason it is a problem for efficiency. We have to
do per-page operations anyway.


> > I would be very interested in anything like this that could beat fsblock
> > in functionality or performance anywhere, even if it is taking shortcuts
> > by being less generic.
> >
> > If there is a significant gain to be had from less generic, perhaps it
> > could still be made into a library usable by more than 1 fs.
>
> I don't see any reason right off that it is not generic, except that it
> does not try to fill the API role that buffer_head has, and so it isn't
> a small, easy change to an existing filesystem. It ought to be useful
> for new designs though. Mind you, the code hasn't been tried yet, it
> is currently just a state-smashing API waiting for the filesystem to
> evolve into the necessary shape, which is going to take another month
> or two.
>
> The Tux3 userspace buffer emulation already works much like the kernel
> block handles will work, in that it doesn't cache a physical address,
> and maintains cache state as a scalar value instead of a set of bits,
> so we already have a fair amount of experience with the model. When it
> does get to the top of the list of things to do, it should slot in
> smoothly. At that point we could hand it to you to try your generic
> API, which seems to implement similar ideas.

Cool. I will be interested to see how it works.

Thanks,
Nick

2009-03-12 13:06:55

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, Matthew Wilcox wrote:
> On Thu, Mar 12, 2009 at 05:24:33AM -0700, Daniel Phillips wrote:
> > On Thursday 12 March 2009, Nick Piggin wrote:
> > > That's interesting. Do you handle 1K block sizes with 64K page size? :)
> >
> > Not in its current incarnation. That would require 32 bytes worth of
> > state while the current code just has a 4 byte map (4 bits X 8 blocks).
> > I suppose a reasonable way to extend it would be 4 x 8 byte maps. Has
> > somebody spotted a 64K page?
>
> I believe SGI ship their ia64 kernels configured this way. Certainly
> 16k ia64 kernels are common, which would (if I understand your scheme
> correctly) be 8 bytes worth of state in your scheme.

Yes, correct, and after that the state object would have to expand by a
binary factor, which probably doesn't matter because at that scale it
is really small compared to the blocks it maps. And the mapped blocks
should just be metadata like index nodes, directory entry blocks and
bitmap blocks, which need per block data handles and locking while
regular file data can work in full pages, which is the same equation
that keeps the pain of buffer_heads down to a dull roar.

Regards,

Daniel

2009-03-12 13:13:25

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, Nick Piggin wrote:
> On Thursday 12 March 2009 23:32:30 Matthew Wilcox wrote:
> > On Thu, Mar 12, 2009 at 05:24:33AM -0700, Daniel Phillips wrote:
> > > On Thursday 12 March 2009, Nick Piggin wrote:
> > > > That's interesting. Do you handle 1K block sizes with 64K page size? :)
> > >
> > > Not in its current incarnation. That would require 32 bytes worth of
> > > state while the current code just has a 4 byte map (4 bits X 8 blocks).
> > > I suppose a reasonable way to extend it would be 4 x 8 byte maps. Has
> > > somebody spotted a 64K page?
> >
> > I believe SGI ship their ia64 kernels configured this way. Certainly
> > 16k ia64 kernels are common, which would (if I understand your scheme
> > correctly) be 8 bytes worth of state in your scheme.
>
> I think some distros will (or do) ship configs with 64K page size for
> ia64 and powerpc too. I think I have heard of people using 64K pages
> with ARM. There was some (public) talk of x86-64 getting a 16K or 64K
> page size too (and even if not HW, some people want to be able to go
> bigger SW pagecache size).
>
> I wouldn't expect 64K page and 1K block to be worth optimising for
> (although 64K page systems could easily use older or shared 4K block
> filesystems). But just keep in mind that a good solution should not
> rely on PAGE_CACHE_SIZE for correctness.

Not worth optimizing for, but it better work. Which the current nasty
circular buffer list will, and I better keep that in mind for the next
round of effort on block handles.

On the other hand, 4K blocks on 64K pages better work really well or
those 64K systems will be turkeys.

Regards,

Daniel

2009-03-12 13:27:17

by Andi Kleen

[permalink] [raw]
Subject: Re: Tux3 report: Tux3 Git tree available

Andrew Morton <[email protected]> writes:

As a general comment I'm not sure it really makes all that sense to
integrate anywhere without recovery functionality because noone can
really use in that state. I think it would be better to concentrate
reviewing/testing efforts on other submitted file systems that
actually work and are used today (like ceph or nilfs2)

Also to be honest the estimate of adding btree directories in 3kLOC
of code seems very optimistic.

>
>> > - count_range() is an unsuitable identifier for a kernel-wide symbol.
>> > Please review all global symbols in the fs, ensure that they are
>> > prefixed with "tux3_". Or make them static, of course.
>>
>> This symbol is only supposed to be shared between separate compilation
>> units within fs/tux3, not be visible outside our module. How do we do
>> that?
>
> You can't. Bear in mind that we want the allyesconfig kernel to link
> and run, and that includes tux3. So do it manually by prefixing
> everything with tux3_ or whatever. (does it need the "3"?)

AFAIK it could be done by using special linker scripts for the sublinks.
But I'm not sure it's worth it because that would require listing of the symbols
in some file. Would need some Makefile magic.

For modules I also had the "namespace" patches some time ago which implemented
simple namespace support. Should probably resurrect that.

> I'm not sure why, really - I have vague memories of Linus having an
> episode... It seems an OK construct if used tastefully. Although it
> does make it easy to hide nasty surprises.

The original reason was gcc 2.95, but that got dropped.

Then I loved to use c99 mixed code/declarations (imho putting the declaration
near the code improves readability), but someone pointed
out that having them makes it harder to catch patch mistakes when
patch puts a hunk in the wrong place.

-Andi

--
[email protected] -- Speaking for myself only.

2009-03-12 14:00:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Fri, Mar 13, 2009 at 12:04:40AM +1100, Nick Piggin wrote:
> As far as the per-block pagecache state (as opposed to the per-block fs
> state), I don't see any reason it is a problem for efficiency. We have to
> do per-page operations anyway.

Why? Wouldn't it be nice if we could do arbitrary extents? I suppose
superpages or soft page sizes get us most of the way there, but the
rounding or pieces at the end are a bit of a pain. Sure, it'll be a
huge upheaval for the VM, but we're good at huge upheavals ;-)

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-03-12 14:20:16

by Nick Piggin

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Friday 13 March 2009 00:59:40 Matthew Wilcox wrote:
> On Fri, Mar 13, 2009 at 12:04:40AM +1100, Nick Piggin wrote:
> > As far as the per-block pagecache state (as opposed to the per-block fs
> > state), I don't see any reason it is a problem for efficiency. We have to
> > do per-page operations anyway.
>
> Why? Wouldn't it be nice if we could do arbitrary extents? I suppose
> superpages or soft page sizes get us most of the way there, but the
> rounding or pieces at the end are a bit of a pain. Sure, it'll be a
> huge upheaval for the VM, but we're good at huge upheavals ;-)

Sounds nice in theory. I personally think it would be very very hard to
do well, and wouldn't and up being much of a win if at all anyway.

But if you have some structure representing arbitrary pagecache extents,
then you would probably be able to naturally use those pagecache extents
themselves, or parallel very similar data structures for holding block
state.

2009-03-12 15:28:33

by Diego Calleja

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Jueves 12 Marzo 2009 09:33:02 Daniel Phillips escribi?:
> > umm, it _is_ primarily kernel code, after all. So it seems appropriate
> > to use BUG_ON() and printk(), etc and to provide versions of those for
> > the userspace incarnation?
>
> Will do.

btrfs-progs also runs some btrfs kernel code in userspace, tux3 could reuse
their "compatibility layer", it can be found at btrfs-progs/kerncompat.h

2009-03-12 16:18:44

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

Andrew Morton <[email protected]> writes:

> - When 'bh' is known to be non-NULL, use put_bh() rather than brelse().

It sounds strange. Almost all bh is non-NULL. This means we are going to
replace almost all brelse() by put_bh()?

Thanks.
--
OGAWA Hirofumi <[email protected]>

2009-03-12 16:54:37

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

Daniel Phillips <[email protected]> writes:

>> > > - What's "L"?
>> > >
>> > > printf("%Lx-", (L)begin);
>> >
>> > A very handy way of working around 32/64 bit format string issues. We
>> > just cast all the messy cases to (long long), aka (L). All other
>> > solutions to this messy problem are worse in my opinion, but whatever
>> > the ruling is, is what we will do. This is used heavily in tracing and
>> > dumping code, which can all be turned off with ifdefs, so it doesn't
>> > affect production kernel text.
>>
>> What format string issues are we talking about here?
>>
>> See, a number of them will be fixed real soon now (geologically
>> speaking) when various 64-bit architectures switch their s64/u64
>> implementation from `long' to `long long'.
>
> Ah, that would be helpful. But not done yet? How long until it
> happens, and does it make sense to wait, so we can reduce the number
> of problems cases? And... will it be all 64 bit arches or just some?
> Because this issue isn't solved if it isn't fixed for all arches.
>
> There are a couple of issues, one is u64 being (long) instead of
> (long long) as you say, and the other is variable type sizes like
> loff_t. That specific one isn't actually a problem, we can just refuse
> to support 32 bit libc file ops, but there may be others. We had a
> world of pain before (L) arrived, then with (L) it was easy. Maybe
> just edit them all to (long long) for now, and damn the line length.

BTW, those are almost because of userland issue. Kernel are more and
more using same type. But, glibc is not. And we (tux3) are sharing the
same code with kernel and userland. Some types are depending to
CONFIG_*, so if we have generic cast type like (L).

[The fatfs also has own type (llu), if it become generic, fatfs will
also be happy.]

Thanks.
--
OGAWA Hirofumi <[email protected]>

2009-03-12 17:01:20

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

Nick Piggin <[email protected]> writes:

>> There are a couple of issues, one is u64 being (long) instead of
>> (long long) as you say, and the other is variable type sizes like
>> loff_t. That specific one isn't actually a problem, we can just refuse
>> to support 32 bit libc file ops, but there may be others. We had a
>> world of pain before (L) arrived, then with (L) it was easy. Maybe
>> just edit them all to (long long) for now, and damn the line length.
>
> Yes please do this. A significant style change like this that lots of
> code already does I think is best first discussed as a standalone
> change to kernel rather than everyone developing their own convention.

BTW, personally I wonder whether there are any suggestion - how early
stage do we try to start review process? Someone suggests it should be
early, but how early?

I know tux3 still have to work for various things, however, people join
to develop those, and see the progress? Or people just want to see the
result of various work later? ...
--
OGAWA Hirofumi <[email protected]>

2009-03-12 17:07:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thu, Mar 12, 2009 at 10:03:31PM +1100, Nick Piggin wrote:
> It is basically already proven. It is faster with ext2 and it works with
> XFS delalloc, unwritten etc blocks (mostly -- except where I wasn't
> really able to grok XFS enough to convert it). And works with minix
> with larger block size than page size (except some places where core
> pagecache code needs some hacking that I haven't got around to).
>
> Yes an ext3 conversion would probably reveal some tweaks or fixes to
> fsblock. I might try doing ext3 next. I suspect most of the problems
> would be fitting ext3 to much stricter checks and consistency required
> by fsblock, rather than adding ext3-required features to fsblock.
>
> ext3 will be a tough one to convert because it is complex, very stable,
> and widely used so there are lots of reasons not to make big changes to
> it.

One possibility would be to do this with ext4 instead, since there are
fewer users, and it has more a "development" feel to it. OTOH, there
are poeple (including myself) who are using ext4 in production
already, and I'd appreciate not having my source trees on my laptop
getting toasted. :-)

Is it going to be possible to make the fsblock conversion being
something which is handled via CONFIG_EXT4_FSBLOCK #ifdefs, or are the
changes too invasive to really allow that? (Also note BTW that ocfs2
is also using jbd2, so we need to be careful we don't break ocfs2
while we're doing the fsblock conversion.)

- Ted

2009-03-12 20:04:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Fri, 13 Mar 2009 01:18:29 +0900
OGAWA Hirofumi <[email protected]> wrote:

> Andrew Morton <[email protected]> writes:
>
> > - When 'bh' is known to be non-NULL, use put_bh() rather than brelse().
>
> It sounds strange. Almost all bh is non-NULL. This means we are going to
> replace almost all brelse() by put_bh()?
>

Well.. you can make up your own mind about this. If you see benefit
in the NULL-checking and extra debugging which brelse() provides then
continue to use brelse().

2009-03-12 20:46:56

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

Andrew Morton <[email protected]> writes:

> On Fri, 13 Mar 2009 01:18:29 +0900
> OGAWA Hirofumi <[email protected]> wrote:
>
>> Andrew Morton <[email protected]> writes:
>>
>> > - When 'bh' is known to be non-NULL, use put_bh() rather than brelse().
>>
>> It sounds strange. Almost all bh is non-NULL. This means we are going to
>> replace almost all brelse() by put_bh()?
>>
>
> Well.. you can make up your own mind about this. If you see benefit
> in the NULL-checking and extra debugging which brelse() provides then
> continue to use brelse().

I thought someone started to convert it. Ok, personally, I think
NULL-check is just not needed always, and if it is needed, check it
explicitly.
--
OGAWA Hirofumi <[email protected]>

2009-03-12 21:02:27

by Roland Dreier

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

> By the way, the endian comment in bitmap.c needs some loving:
>
> http://lxr.linux.no/linux+v2.6.28.7/lib/bitmap.c#L35
>
> The s390 file needs to be config-independent:
>
> - include/asm-s390/bitops.h
> + arch/s390/include/asm/bitops.h
>
> and the ppc file is nowhere to be seen. I will put a driveby patch on
> the to.do list.

The ppc file is arch/powerpc/include/asm/bitops.h.

- R.

2009-03-12 21:24:24

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, Andi Kleen wrote:
> to be honest the estimate of adding btree directories in 3kLOC
> of code seems very optimistic.

Is that a bet?

> >> This symbol is only supposed to be shared between separate compilation
> >> units within fs/tux3, not be visible outside our module. How do we do
> >> that?
> >
> > You can't. Bear in mind that we want the allyesconfig kernel to link
> > and run, and that includes tux3. So do it manually by prefixing
> > everything with tux3_ or whatever. (does it need the "3"?)
>
> AFAIK it could be done by using special linker scripts for the sublinks.
> But I'm not sure it's worth it because that would require listing of the symbols
> in some file. Would need some Makefile magic.
>
> For modules I also had the "namespace" patches some time ago which implemented
> simple namespace support. Should probably resurrect that.

Per-module name scope would be a nice creature comfort. Well, for now
the offending names will just get tux_ed.

Regards,

Daniel

2009-03-12 23:38:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thu, Mar 12, 2009 at 02:24:08PM -0700, Daniel Phillips wrote:
> On Thursday 12 March 2009, Andi Kleen wrote:
> > to be honest the estimate of adding btree directories in 3kLOC
> > of code seems very optimistic.
>
> Is that a bet?

No, just a general comment comparing with other read/write btree
implementations.

-Andi
--
[email protected] -- Speaking for myself only.

2009-03-13 09:33:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Friday 13 March 2009 04:06:53 Theodore Tso wrote:
> On Thu, Mar 12, 2009 at 10:03:31PM +1100, Nick Piggin wrote:
> > It is basically already proven. It is faster with ext2 and it works with
> > XFS delalloc, unwritten etc blocks (mostly -- except where I wasn't
> > really able to grok XFS enough to convert it). And works with minix
> > with larger block size than page size (except some places where core
> > pagecache code needs some hacking that I haven't got around to).
> >
> > Yes an ext3 conversion would probably reveal some tweaks or fixes to
> > fsblock. I might try doing ext3 next. I suspect most of the problems
> > would be fitting ext3 to much stricter checks and consistency required
> > by fsblock, rather than adding ext3-required features to fsblock.
> >
> > ext3 will be a tough one to convert because it is complex, very stable,
> > and widely used so there are lots of reasons not to make big changes to
> > it.
>
> One possibility would be to do this with ext4 instead, since there are
> fewer users, and it has more a "development" feel to it. OTOH, there

Yes I think ext4 would be the best candidate for the next conversion.


> are poeple (including myself) who are using ext4 in production
> already, and I'd appreciate not having my source trees on my laptop
> getting toasted. :-)

Definitely ;)


> Is it going to be possible to make the fsblock conversion being
> something which is handled via CONFIG_EXT4_FSBLOCK #ifdefs, or are the
> changes too invasive to really allow that? (Also note BTW that ocfs2
> is also using jbd2, so we need to be careful we don't break ocfs2
> while we're doing the fsblock conversion.)

Hmm, it would be difficult. I think once I get a patch working, it
wouldn't be too hard to maintain out of tree though (there tends to
be just a smallish number of patterns used many times). I'd start
by doing that, and see how it looks from there.

2009-03-15 02:41:28

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, Nick Piggin wrote:
> On Thursday 12 March 2009 23:24:33 Daniel Phillips wrote:
> > > fsblocks in their refcount mode don't tend to _cache_ physical block
> > > addresses either, because they're only kept around for as long as they
> > > are required (eg. to write out the page to avoid memory allocation
> > > deadlock problems).
> > >
> > > But some filesystems don't do very fast block lookups and do want a
> > > cache. I did a little extent map library on the side for that.
> >
> > Sure, good plan. We are attacking the transfer path, so that all the
> > transfer state goes directly from the filesystem into a BIO and doesn't
> > need that twisty path back and forth to the block library. The BIO
> > remembers the physical address across the transfer cycle. If you must
> > still support those twisty paths for compatibility with the existing
> > buffer.c scheme, you have a much harder project.
>
> I don't quite know what you mean. You have a set of dirty cache that
> needs to be written. So you need to know the block addresses in order
> to create the bio of course.
>
> fsblock allocates the block and maps[*] the block at pagecache *dirty*
> time, and holds onto it until writeout is finished.

As it happens, Tux3 also physically allocates each _physical_ metadata
block (i.e., what is currently called buffer cache) at the time it is
dirtied. I don't know if this is the best thing to do, but it is
interesting that you do the same thing. I also don't know if I want to
trust a library to get this right, before having completely proved out
the idea in a non-trival filesystem. But good luck with that! It
seems to me like a very good idea to take Ted up on his offer and try
out your library on Ext4. This is just a gut feeling, but I think you
will need many iterations to refine the idea. Just working, and even
showing benchmark improvement is not enough. If it is a core API
proposal, it needs a huge body of proof. If you end up like JBD with
just one user, because it actually only implements the semantics of
exactly one filesystem, then the extra overhead of unused generality
will just mean more lines of code to maintain and more places for bugs
to hide.

This is all general philosophy of course. Actually reading your code
would help a lot. By comparision, I intend the block handles library
to be a few hundred lines of code, including new incarnations of
buffer.c functionality like block_read/write_*. If this is indeed
possible, and it does the job with 4 bytes per block on a 1K block/4K
page configuration as it does in the prototype, then I think I would
prefer a per-filesystem solution and let it evolve that way for a long
time before attempting a library. But that is just me.

I suppose you would like to see some code?

> In something like
> ext2, finding the offset->block map can require buffercache allocations
> so it is technically deadlocky if you have to do it at writeout time.

I am not sure what "technically" means. Pretty much everything you do
in this area has high deadlock risk. That is one of the things that
scares me about trying to handle every filesystem uniformly. How would
filesystem writers even know what the deadlock avoidance rules are,
thus what they need to do in their own filesystem to avoid it?

Anyway, the Tux3 reason for doing the allocation at dirty time is, this
is the only time the filesystem knows what the parent block of a given
metadata block is. Note that we move btree blocks around when they are
dirtied, and thus need to know the parent in order to update the parent
pointer to the child. This is a complication you will not run into in
any of the filesystems you have poked at so far. This subtle detail is
very much filesystem specific, or it is specific to the class of
filesystems that do remap on write. Good luck knowing how to generalize
that before Linux has seen even one of them up and doing real production
work.

> [*] except in the case of delalloc. fsblock does its best, but for
> complex filesystems like delalloc, some memory reservation would have
> to be done by the fs.

And that is a whole, huge and critical topic. Again, something that I
feel needs to be analyzed per filesystem, until we have considerably
more experience with the issues.

> > > > The block handles patch is one of those fun things we have on hold for
> > > > the time being while we get the more mundane
> > >
> > > Good luck with it. I suspect that doing filesystem-specific layers to
> > > duplicate basically the same functionality but slightly optimised for
> > > the specific filesystem may not be a big win. As you say, this is where
> > > lots of nasty problems have been, so sharing as much code as possible
> > > is a really good idea.
> >
> > The big win will come from avoiding the use of struct buffer_head as
> > an API element for mapping logical cache to disk, which is a narrow
> > constriction when the filesystem wants to do things with extents in
> > btrees. It is quite painful doing a btree probe for every ->get_block
> > the way it is now. We want probe... page page page page... submit bio
> > (or put it on a list for delayed allocation).
> >
> > Once we have the desired, nice straight path above then we don't need
> > most of the fields in buffer_head, so tightening it up into a bitmap,
> > a refcount and a pointer back to the page makes a lot of sense. This
> > in itself may not make a huge difference, but the reduction in cache
> > pressure ought to be measurable and worth the not very many lines of
> > code for the implementation.
>
> I haven't done much about this in fsblock yet. I think some things need
> a bit of changing in the pagecache layer (in the block library, eg.
> write_begin/write_end doesn't have enough info to reserve/allocate a big
> range of blocks -- we need a callback higher up to tell the filesystem
> that we will be writing xxx range in the file, so get things ready for
> us).

That would be write_cache_pages, it already exists and seems perfectly
serviceable.

> As far as the per-block pagecache state (as opposed to the per-block fs
> state), I don't see any reason it is a problem for efficiency. We have to
> do per-page operations anyway.

I don't see a distinction between page cache vs fs state for a block.
Tux3 has these scalar block states:

EMPTY - not read into cache yet
CLEAN - cache data matches disk data (which might be a hole)
DIRTY0 .. DIRTY3 - dirty in one of up to four pipelined delta updates

Besides the per-page block reference count (hmm, do we really need it?
Why not rely on the page reference count?) there is no cache-specific
state, it is all "fs" state.

To complete the enumeration of state Tux3 represents in block handles,
there is also a per-block lock bit, used for reading blocks, the same
as buffer lock. So far there is no writeback bit, which does not seem
to be needed, because the flow of block writeback is subtly different
from page writeback. I am not prepared to defend that assertion just
yet! But I think the reason for this is, there is no such thing as
redirty for metadata blocks in Tux3, there is only "dirty in a later
delta", and that implies redirecting the block to a new physical
location that has its own, separate block state. Anyway, this is a
pretty good example of why you may find it difficult to generalize your
library to handle every filesystem. Is there any existing filesystem
that works this way? How would you know in advance what features to
include in your library to handle it? Will some future filesystem
have very different requirements, not handled by your library? If you
have finally captured every feature, will they interact? Will all
these features be confusing to use and hard to analyze? I am not
saying you can't solve all these problems, just that it is bound to be
hard, take a long time, and might possibly end up less elegant than a
more lightweight approach that leaves the top level logic in the hands
of the filesystem.

Regards,

Daniel

2009-03-15 03:04:56

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, Andi Kleen wrote:
> On Thu, Mar 12, 2009 at 02:24:08PM -0700, Daniel Phillips wrote:
> > On Thursday 12 March 2009, Andi Kleen wrote:
> > > to be honest the estimate of adding btree directories in 3kLOC
> > > of code seems very optimistic.
> >
> > Is that a bet?
>
> No, just a general comment comparing with other read/write btree
> implementations.

Htree is about 1650 lines in Ext3. The Tux3 implementation does not
have to implement its own custom btree code because it can use the
generic tux3/btree.c code (with index node operations generalized as
leaf operations already are). The Tux3 implementation will be shorter
if anything, because of using the generic btree code, no need for
backward compatibility, and having had several years to think about
how to handle the NFS readdir issue more compactly.

So I guess the Tux3 directory btree will come in somewhere around
1500 lines.

Regards,

Daniel

2009-03-15 03:25:06

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, Matthew Wilcox wrote:
> On Fri, Mar 13, 2009 at 12:04:40AM +1100, Nick Piggin wrote:
> > As far as the per-block pagecache state (as opposed to the per-block fs
> > state), I don't see any reason it is a problem for efficiency. We have to
> > do per-page operations anyway.
>
> Why? Wouldn't it be nice if we could do arbitrary extents? I suppose
> superpages or soft page sizes get us most of the way there, but the
> rounding or pieces at the end are a bit of a pain. Sure, it'll be a
> huge upheaval for the VM, but we're good at huge upheavals ;-)

Actually, filesystem extents tend to erode the argument for superpages.
There are three reasons we have seen for wanting big pages: 1) support
larger block buffers without adding messy changes to buffer.c; 2) TLB
efficiency; 3) less per-page state in kernel memory. TLB efficiency is
only there if the hardware supports it, which X86 arguably doesn't.
The main argument for larger block buffers is less per-block transfer
setup overhead, but the BIO model combined with filesystem extents
does that job better, or at least it will when filesystems learn to
take better advantage of this.

VM extents on the other hand could possibly do a really good job of
reducing per-page VM overhead, if anybody still cares about that now
that 64 bit machines rule the big iron world.

I expect implementing VM extents to be a brutally complex project, as
filesystem extents always turn out to be, even though one tends to
enter such projects thinking, how hard could this be? Answer: harder
than you think. But VM extents would be good for a modest speedup, so
somebody is sure to get brave enough to try it sometime.

Regards,

Daniel

2009-03-15 03:37:06

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, OGAWA Hirofumi wrote:
> Daniel Phillips <[email protected]> writes:
> >> > > - What's "L"?
> >> > >
> >> > > printf("%Lx-", (L)begin);
> >> >
> >> > A very handy way of working around 32/64 bit format string issues. We
> >> > just cast all the messy cases to (long long), aka (L). All other
> >> > solutions to this messy problem are worse in my opinion, but whatever
> >> > the ruling is, is what we will do. This is used heavily in tracing and
> >> > dumping code, which can all be turned off with ifdefs, so it doesn't
> >> > affect production kernel text.
> >>
> >> What format string issues are we talking about here?
> >>
> >> See, a number of them will be fixed real soon now (geologically
> >> speaking) when various 64-bit architectures switch their s64/u64
> >> implementation from `long' to `long long'.
> >
> > Ah, that would be helpful. But not done yet? How long until it
> > happens, and does it make sense to wait, so we can reduce the number
> > of problems cases? And... will it be all 64 bit arches or just some?
> > Because this issue isn't solved if it isn't fixed for all arches.
> >
> > There are a couple of issues, one is u64 being (long) instead of
> > (long long) as you say, and the other is variable type sizes like
> > loff_t. That specific one isn't actually a problem, we can just refuse
> > to support 32 bit libc file ops, but there may be others. We had a
> > world of pain before (L) arrived, then with (L) it was easy. Maybe
> > just edit them all to (long long) for now, and damn the line length.
>
> BTW, those are almost because of userland issue. Kernel are more and
> more using same type. But, glibc is not. And we (tux3) are sharing the
> same code with kernel and userland. Some types are depending to
> CONFIG_*, so if we have generic cast type like (L).
>
> [The fatfs also has own type (llu), if it become generic, fatfs will
> also be happy.]
>
> Thanks.

Maybe we should argue for some generic flavor of the (L)/(llu) idea
then. I suppose we should figure out exactly how much of our usage
will remain after the kernel issue is resolved. One small thing we
could do is make it a typedef instead of a macro. And spelling it
out completely as (long long) is not so bad, except it loses the
desirable property of being able to grep for the messy thing, and
adds a painful amount of useless line length, given how frequently
the issue shows up.

Regards,

Daniel

2009-03-15 03:45:30

by Nick Piggin

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Sunday 15 March 2009 13:41:09 Daniel Phillips wrote:
> On Thursday 12 March 2009, Nick Piggin wrote:
> > On Thursday 12 March 2009 23:24:33 Daniel Phillips wrote:
> > > > fsblocks in their refcount mode don't tend to _cache_ physical block
> > > > addresses either, because they're only kept around for as long as
> > > > they are required (eg. to write out the page to avoid memory
> > > > allocation deadlock problems).
> > > >
> > > > But some filesystems don't do very fast block lookups and do want a
> > > > cache. I did a little extent map library on the side for that.
> > >
> > > Sure, good plan. We are attacking the transfer path, so that all the
> > > transfer state goes directly from the filesystem into a BIO and doesn't
> > > need that twisty path back and forth to the block library. The BIO
> > > remembers the physical address across the transfer cycle. If you must
> > > still support those twisty paths for compatibility with the existing
> > > buffer.c scheme, you have a much harder project.
> >
> > I don't quite know what you mean. You have a set of dirty cache that
> > needs to be written. So you need to know the block addresses in order
> > to create the bio of course.
> >
> > fsblock allocates the block and maps[*] the block at pagecache *dirty*
> > time, and holds onto it until writeout is finished.
>
> As it happens, Tux3 also physically allocates each _physical_ metadata
> block (i.e., what is currently called buffer cache) at the time it is
> dirtied. I don't know if this is the best thing to do, but it is
> interesting that you do the same thing. I also don't know if I want to
> trust a library to get this right, before having completely proved out
> the idea in a non-trival filesystem. But good luck with that! It

I'm not sure why it would be a big problem. fsblock isn't allocating
the block itself of course, it just asks the filesystem to. It's
trivial to do for fsblock.


> seems to me like a very good idea to take Ted up on his offer and try
> out your library on Ext4. This is just a gut feeling, but I think you
> will need many iterations to refine the idea. Just working, and even
> showing benchmark improvement is not enough. If it is a core API
> proposal, it needs a huge body of proof. If you end up like JBD with
> just one user, because it actually only implements the semantics of
> exactly one filesystem, then the extra overhead of unused generality
> will just mean more lines of code to maintain and more places for bugs
> to hide.

I don't know what you're thinking is so difficult with it. I've already
converted minix, ext2, and xfs and they seem to work fine. There is not
really fundamentally anything that buffer heads can do that fsblock can't.


> This is all general philosophy of course. Actually reading your code
> would help a lot. By comparision, I intend the block handles library
> to be a few hundred lines of code, including new incarnations of
> buffer.c functionality like block_read/write_*. If this is indeed
> possible, and it does the job with 4 bytes per block on a 1K block/4K
> page configuration as it does in the prototype, then I think I would
> prefer a per-filesystem solution and let it evolve that way for a long
> time before attempting a library. But that is just me.

If you're tracking pagecache state in these things, then I can't see how
it can get any easier just because it is smaller. In which case, your
concerns about duplicating functionality of this layer.


> I suppose you would like to see some code?
>
> > In something like
> > ext2, finding the offset->block map can require buffercache allocations
> > so it is technically deadlocky if you have to do it at writeout time.
>
> I am not sure what "technically" means. Pretty much everything you do

Technically means that it is deadlocky. Today, practically every Linux
filesystem technically has memory deadlocks. In practice, the mm does
keep reserves around to help this and so it is very very hard to hit.


> in this area has high deadlock risk. That is one of the things that
> scares me about trying to handle every filesystem uniformly. How would
> filesystem writers even know what the deadlock avoidance rules are,
> thus what they need to do in their own filesystem to avoid it?

The rule is simple: if forward progress requires resource allocation,
then you would ensure resource deadlocks are avoided or can be recovered
from.

I don't think many fs developers actually care very much, but obviously
a rewrite of such core functionality must not introduce such deadlocks
by design.


> Anyway, the Tux3 reason for doing the allocation at dirty time is, this
> is the only time the filesystem knows what the parent block of a given
> metadata block is. Note that we move btree blocks around when they are
> dirtied, and thus need to know the parent in order to update the parent
> pointer to the child. This is a complication you will not run into in
> any of the filesystems you have poked at so far. This subtle detail is
> very much filesystem specific, or it is specific to the class of
> filesystems that do remap on write. Good luck knowing how to generalize
> that before Linux has seen even one of them up and doing real production
> work.

Uh, this kind of stuff is completely not what fsblock would try to do.
fsblock gives the filesystem notifications when the block gets dirtied,
when the block is prepared for writeout, etc.

It is up to the filesystem to do everything else (with the postcondition
that the block is mapped after being prepared for writeout).


> > [*] except in the case of delalloc. fsblock does its best, but for
> > complex filesystems like delalloc, some memory reservation would have
> > to be done by the fs.
>
> And that is a whole, huge and critical topic. Again, something that I
> feel needs to be analyzed per filesystem, until we have considerably
> more experience with the issues.

Again, fsblock does as much as it can up to guaranteeing fsblock metadata
(and hence, any filesystem private data is attached to the fsblock) is
allocated as long as the block is dirty.

Of course the actual delalloc scheme is filesystem specific and can't be
handled by fsblock.


> > I haven't done much about this in fsblock yet. I think some things need
> > a bit of changing in the pagecache layer (in the block library, eg.
> > write_begin/write_end doesn't have enough info to reserve/allocate a big
> > range of blocks -- we need a callback higher up to tell the filesystem
> > that we will be writing xxx range in the file, so get things ready for
> > us).
>
> That would be write_cache_pages, it already exists and seems perfectly
> serviceable.

No it isn't. That's completely different.


> > As far as the per-block pagecache state (as opposed to the per-block fs
> > state), I don't see any reason it is a problem for efficiency. We have to
> > do per-page operations anyway.
>
> I don't see a distinction between page cache vs fs state for a block.
> Tux3 has these scalar block states:
>
> EMPTY - not read into cache yet
> CLEAN - cache data matches disk data (which might be a hole)
> DIRTY0 .. DIRTY3 - dirty in one of up to four pipelined delta updates
>
> Besides the per-page block reference count (hmm, do we really need it?
> Why not rely on the page reference count?) there is no cache-specific
> state, it is all "fs" state.

dirty / uptodate is a property of the cache.


> To complete the enumeration of state Tux3 represents in block handles,
> there is also a per-block lock bit, used for reading blocks, the same
> as buffer lock. So far there is no writeback bit, which does not seem
> to be needed, because the flow of block writeback is subtly different
> from page writeback. I am not prepared to defend that assertion just
> yet! But I think the reason for this is, there is no such thing as
> redirty for metadata blocks in Tux3, there is only "dirty in a later
> delta", and that implies redirecting the block to a new physical
> location that has its own, separate block state. Anyway, this is a
> pretty good example of why you may find it difficult to generalize your
> library to handle every filesystem. Is there any existing filesystem
> that works this way? How would you know in advance what features to
> include in your library to handle it? Will some future filesystem
> have very different requirements, not handled by your library? If you
> have finally captured every feature, will they interact? Will all
> these features be confusing to use and hard to analyze? I am not
> saying you can't solve all these problems, just that it is bound to be
> hard, take a long time, and might possibly end up less elegant than a
> more lightweight approach that leaves the top level logic in the hands
> of the filesystem.

It's not meant to handle every possible feature of every current and
future fs! It's meant to replace buffer-head. If there is some common
filesystem feature in future that makes sense to generalise and support
in fsblock then great.

2009-03-15 03:51:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Sunday 15 March 2009 14:24:29 Daniel Phillips wrote:
> On Thursday 12 March 2009, Matthew Wilcox wrote:
> > On Fri, Mar 13, 2009 at 12:04:40AM +1100, Nick Piggin wrote:
> > > As far as the per-block pagecache state (as opposed to the per-block fs
> > > state), I don't see any reason it is a problem for efficiency. We have
> > > to do per-page operations anyway.
> >
> > Why? Wouldn't it be nice if we could do arbitrary extents? I suppose
> > superpages or soft page sizes get us most of the way there, but the
> > rounding or pieces at the end are a bit of a pain. Sure, it'll be a
> > huge upheaval for the VM, but we're good at huge upheavals ;-)
>
> Actually, filesystem extents tend to erode the argument for superpages.
> There are three reasons we have seen for wanting big pages: 1) support
> larger block buffers without adding messy changes to buffer.c; 2) TLB
> efficiency; 3) less per-page state in kernel memory. TLB efficiency is
> only there if the hardware supports it, which X86 arguably doesn't.
> The main argument for larger block buffers is less per-block transfer
> setup overhead, but the BIO model combined with filesystem extents
> does that job better, or at least it will when filesystems learn to
> take better advantage of this.
>
> VM extents on the other hand could possibly do a really good job of
> reducing per-page VM overhead, if anybody still cares about that now
> that 64 bit machines rule the big iron world.
>
> I expect implementing VM extents to be a brutally complex project, as
> filesystem extents always turn out to be, even though one tends to
> enter such projects thinking, how hard could this be? Answer: harder
> than you think. But VM extents would be good for a modest speedup, so
> somebody is sure to get brave enough to try it sometime.

I don't think there is enough evidence to be able to make such an
assertion.

When you actually implement extent splitting and merging in a deadlock
free manner and synchronize everything properly I wouldn't be surprised
if it is slower most of the time. If it was significantly faster, then
memory fragmentation means that it is going to get significantly slower
over the uptime of the kernel, so you would have to virtually map the
kernel and implement memory defragmentation, at which point you get even
slower and more complex.

2009-03-15 03:54:22

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, OGAWA Hirofumi wrote:
> Nick Piggin <[email protected]> writes:
> >> There are a couple of issues, one is u64 being (long) instead of
> >> (long long) as you say, and the other is variable type sizes like
> >> loff_t. That specific one isn't actually a problem, we can just refuse
> >> to support 32 bit libc file ops, but there may be others. We had a
> >> world of pain before (L) arrived, then with (L) it was easy. Maybe
> >> just edit them all to (long long) for now, and damn the line length.
> >
> > Yes please do this. A significant style change like this that lots of
> > code already does I think is best first discussed as a standalone
> > change to kernel rather than everyone developing their own convention.
>
> BTW, personally I wonder whether there are any suggestion - how early
> stage do we try to start review process? Someone suggests it should be
> early, but how early?
>
> I know tux3 still have to work for various things, however, people join
> to develop those, and see the progress? Or people just want to see the
> result of various work later? ...

It is not just how quickly the work gets done, but the quality of work
that is more widely reviewed and has more total review cycles. I guess
we are proceeding about the right speed in that direction: not slow,
not fast. We won't be asking for merging immediately, just review.

There is a lot to be said for finishing up the most interesting
pre-merge bit, atomic commit, "in public", which of course we have done
all along, but it could be more public. I would like to see a few more
folks get the "aha" about what we are doing with the cache model, the
hybrid logging and other details.

Regards,

Daniel

2009-03-15 03:59:18

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, OGAWA Hirofumi wrote:
> Andrew Morton <[email protected]> writes:
>
> > On Fri, 13 Mar 2009 01:18:29 +0900
> > OGAWA Hirofumi <[email protected]> wrote:
> >
> >> Andrew Morton <[email protected]> writes:
> >>
> >> > - When 'bh' is known to be non-NULL, use put_bh() rather than brelse().
> >>
> >> It sounds strange. Almost all bh is non-NULL. This means we are going to
> >> replace almost all brelse() by put_bh()?
> >>
> >
> > Well.. you can make up your own mind about this. If you see benefit
> > in the NULL-checking and extra debugging which brelse() provides then
> > continue to use brelse().
>
> I thought someone started to convert it. Ok, personally, I think
> NULL-check is just not needed always, and if it is needed, check it
> explicitly.

I checked in a patch last week to convert all the brelses to "blockput"
(note how close the spelling is to Nick's block_put, an unsurprising
accident). This should mean a little less work when the time comes to
try out the block handles idea, and it surely doesn't hurt anything.

Regards,

Daniel

2009-03-15 04:03:40

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Thursday 12 March 2009, Roland Dreier wrote:
> > By the way, the endian comment in bitmap.c needs some loving:
> >
> > http://lxr.linux.no/linux+v2.6.28.7/lib/bitmap.c#L35
> >
> > The s390 file needs to be config-independent:
> >
> > - include/asm-s390/bitops.h
> > + arch/s390/include/asm/bitops.h
> >
> > and the ppc file is nowhere to be seen. I will put a driveby patch on
> > the to.do list.
>
> The ppc file is arch/powerpc/include/asm/bitops.h.

Ah right, so the commentary should just reference arch/ instead of
/..asm-<arch>. I am still hoping to hear an actual argument why little
endian bitops are "more natural" than big endian. I don't know of any
myself, either in terms of machine instructions, cycle times or
whatever.

Regards,

Daniel

2009-03-15 04:09:19

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Saturday 14 March 2009, Nick Piggin wrote:
> On Sunday 15 March 2009 14:24:29 Daniel Phillips wrote:
> > I expect implementing VM extents to be a brutally complex project, as
> > filesystem extents always turn out to be, even though one tends to
> > enter such projects thinking, how hard could this be? Answer: harder
> > than you think. But VM extents would be good for a modest speedup, so
> > somebody is sure to get brave enough to try it sometime.
>
> I don't think there is enough evidence to be able to make such an
> assertion.
>
> When you actually implement extent splitting and merging in a deadlock
> free manner and synchronize everything properly I wouldn't be surprised
> if it is slower most of the time. If it was significantly faster, then
> memory fragmentation means that it is going to get significantly slower
> over the uptime of the kernel, so you would have to virtually map the
> kernel and implement memory defragmentation, at which point you get even
> slower and more complex.

You can make exactly the same argument about filesystem extents, and
we know that extents are faster there. So what is the fundamental
difference?

Regards,

Daniel

2009-03-15 04:14:46

by Nick Piggin

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Sunday 15 March 2009 15:08:52 Daniel Phillips wrote:
> On Saturday 14 March 2009, Nick Piggin wrote:
> > On Sunday 15 March 2009 14:24:29 Daniel Phillips wrote:
> > > I expect implementing VM extents to be a brutally complex project, as
> > > filesystem extents always turn out to be, even though one tends to
> > > enter such projects thinking, how hard could this be? Answer: harder
> > > than you think. But VM extents would be good for a modest speedup, so
> > > somebody is sure to get brave enough to try it sometime.
> >
> > I don't think there is enough evidence to be able to make such an
> > assertion.
> >
> > When you actually implement extent splitting and merging in a deadlock
> > free manner and synchronize everything properly I wouldn't be surprised
> > if it is slower most of the time. If it was significantly faster, then
> > memory fragmentation means that it is going to get significantly slower
> > over the uptime of the kernel, so you would have to virtually map the
> > kernel and implement memory defragmentation, at which point you get even
> > slower and more complex.
>
> You can make exactly the same argument about filesystem extents, and
> we know that extents are faster there. So what is the fundamental
> difference?

Uh, aside from all the obvious fundamental differences there are, you
can only make such an assertion if performance characteristics and
usage patterns are very similar, nevermind fundamentally different...

2009-03-15 04:27:20

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

Daniel Phillips <[email protected]> writes:

>> BTW, those are almost because of userland issue. Kernel are more and
>> more using same type. But, glibc is not. And we (tux3) are sharing the
>> same code with kernel and userland. Some types are depending to
>> CONFIG_*, so if we have generic cast type like (L).
>>
>> [The fatfs also has own type (llu), if it become generic, fatfs will
>> also be happy.]
>>
>> Thanks.
>
> Maybe we should argue for some generic flavor of the (L)/(llu) idea
> then. I suppose we should figure out exactly how much of our usage
> will remain after the kernel issue is resolved. One small thing we
> could do is make it a typedef instead of a macro.

It is already typedef?

typedef long long L; // widen for printf on 64 bit systems

> And spelling it out completely as (long long) is not so bad, except it
> loses the desirable property of being able to grep for the messy
> thing, and adds a painful amount of useless line length, given how
> frequently the issue shows up.

Yes. Well, it is depending on the warn/info/trace strategy of the
modules. I guess so many modules are not requiring it, because there is
no trace. But, if those are implementing the trace code or something
like it, I guess (long long) will bother devlopers.
--
OGAWA Hirofumi <[email protected]>

2009-03-15 21:45:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Sun, Mar 15, 2009 at 02:45:04PM +1100, Nick Piggin wrote:
> > As it happens, Tux3 also physically allocates each _physical_ metadata
> > block (i.e., what is currently called buffer cache) at the time it is
> > dirtied. I don't know if this is the best thing to do, but it is
> > interesting that you do the same thing. I also don't know if I want to
> > trust a library to get this right, before having completely proved out
> > the idea in a non-trival filesystem. But good luck with that! It
>
> I'm not sure why it would be a big problem. fsblock isn't allocating
> the block itself of course, it just asks the filesystem to. It's
> trivial to do for fsblock.

So the really unfortunate thing about allocating the block as soon as
the page is dirty is that it spikes out delayed allocation. By
delaying the physical allocation of the logical->physical mapping as
long as possible, the filesystem can select the best possible physical
location. XFS, for example, keeps a btree of free regions indexed by
size so that it can select the perfect location for a newly written
file which is 24k or 56k long. If fsblock forces the physical
allocation of blocks the moment the page is dirty, it will destroy
XFS's capability to select the perfect file.

In addition, XFS uses delayed allocation to avoid the problem of
uninitalized data becoming visible in the event of a crash. If
fsblock immediately allocates the physical block, then either the
unitialized data might become available on a system crash (which is a
security problem), or XFS is going to have to force all newly written
data blocks to disk before a commit. If that sounds familiar it's
what ext3's data=ordered mode does, and it's what is responsible for
the Firefox 3.0 fsync performance problem.

A similar issue exists for ext4's delayed allocation.

- Ted

2009-03-15 22:41:54

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

Hi Ted,

On Sunday 15 March 2009, Theodore Tso wrote:
> On Sun, Mar 15, 2009 at 02:45:04PM +1100, Nick Piggin wrote:
> > > As it happens, Tux3 also physically allocates each _physical_ metadata
> > > block (i.e., what is currently called buffer cache) at the time it is
> > > dirtied. I don't know if this is the best thing to do, but it is
> > > interesting that you do the same thing. I also don't know if I want to
> > > trust a library to get this right, before having completely proved out
> > > the idea in a non-trival filesystem. But good luck with that! It
> >
> > I'm not sure why it would be a big problem. fsblock isn't allocating
> > the block itself of course, it just asks the filesystem to. It's
> > trivial to do for fsblock.
>
> So the really unfortunate thing about allocating the block as soon as
> the page is dirty is that it spikes out delayed allocation. By
> delaying the physical allocation of the logical->physical mapping as
> long as possible, the filesystem can select the best possible physical
> location.

Tux3 does not dirty the metadata until data cache is flushed, so the
allocation decisions for data and metadata are made at the same time.
That is the reason for the distinction between physical metadata above,
and logical metadata such as directory data and bitmaps, which are
delayed. Though physical metadata is positioned when first dirtied,
physical metadata dirtying is delayed until delta commit.

Implementing this model (we are still working on it) requires taking
care of a lot of subtle details that are specific to the Tux3 cache
model. I have a hard time imagining those allocation decisions driven
by callbacks from a buffer-like library.

Regards,

Daniel

2009-03-16 05:12:34

by Dave Chinner

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Sun, Mar 15, 2009 at 05:44:26PM -0400, Theodore Tso wrote:
> On Sun, Mar 15, 2009 at 02:45:04PM +1100, Nick Piggin wrote:
> > > As it happens, Tux3 also physically allocates each _physical_ metadata
> > > block (i.e., what is currently called buffer cache) at the time it is
> > > dirtied. I don't know if this is the best thing to do, but it is
> > > interesting that you do the same thing. I also don't know if I want to
> > > trust a library to get this right, before having completely proved out
> > > the idea in a non-trival filesystem. But good luck with that! It
> >
> > I'm not sure why it would be a big problem. fsblock isn't allocating
> > the block itself of course, it just asks the filesystem to. It's
> > trivial to do for fsblock.
>
> So the really unfortunate thing about allocating the block as soon as
> the page is dirty is that it spikes out delayed allocation. By
> delaying the physical allocation of the logical->physical mapping as
> long as possible, the filesystem can select the best possible physical
> location.

This is no different to the way delayed allocation with bufferheads
works. Both XFS and ext4 set the buffer_delay flag instead of
allocating up front so that later on in ->writepages we can do
optimal delayed allocation. AFAICT fsblock works the same way....

> XFS, for example, keeps a btree of free regions indexed by
> size so that it can select the perfect location for a newly written
> file which is 24k or 56k long.

Ah, no. It's far more complex than that. To begin with, XFS has
*two* freespace trees per allocation group - one indexed by extent size,
the other by extent starting block.

XFS looks for an exact or nearby extent start block match that is
big enough in the by-block tree. If it can't find a nearby match,
then it looks up a size match in the by-size tree. i.e. the
fundamental allocation assumption is that locality of data placement
matters far more than filling holes in the freespace trees.....

> In addition, XFS uses delayed allocation to avoid the problem of
> uninitalized data becoming visible in the event of a crash.

No it doesn't. Delayed allocation minimises the problem but doesn't
prevent it. It has been known for years (since before I joined SGI
in 2002) that there is a theoretical timing gap in XFS where the
allocation transaction can commit and a crash occur before data hits
the disk hence exposing stale data.

The reality is that no-one has ever reported exposing stale data in
this scenario, and there has been plenty of effort expended trying
to trigger it. Hence it has remained in the realm of a theoretical
problem....

> If
> fsblock immediately allocates the physical block, then either the
> unitialized data might become available on a system crash (which
> is a security problem), or XFS is going to have to force all newly
> written data blocks to disk before a commit. If that sounds
> familiar it's what ext3's data=ordered mode does, and it's what is
> responsible for the Firefox 3.0 fsync performance problem.

If this was to occur, the obvious solution to this problem is to
allocate unwritten extents and do conversion after data I/O
completion. That would result in correct metadata/data ordering in
all cases with only a small performance impact and without
introducing ext3-sync-the-world-like issues...

Ted, I appreciate you telling the world over and over again how bad
XFS is and what you think needs to be done to fix it. Truth is, this
would have been a much better email had you written about it from an
ext4 perspective. That way it wouldn't have been full of errors or
sound like a kid caught with his hand in the cookie jar:

"It's not my fault! I was only copying XFS! He did it first!"

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-03-16 06:39:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

Dave,

It wasn't my intention to say that XFS was bad; in fact, I thought I
was actually complementing XFS by talking about some of the advanced
features that XFS had (many of which I have always said that ext3 has,
and some of which ext4 still does not have, and probably never will
have). I stand corrected on some of the details that I got wrong.
What I was trying to say was that *if* (and perhaps I'm
misunderstanding fsblock) that fsblock is requiring that as soon as a
page is dirty, fsblock requests the filesystem to assign a block
allocation to the buffers attached to the dirty page, that this would
spike out delayed allocation, which would be unfortunate for *both*
ext4 and XFS.

But maybe I'm misunderstanding what fsblock is doing, and there isn't
a problem here.

Regards,

- Ted

2009-03-16 10:14:37

by Nick Piggin

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Monday 16 March 2009 17:38:30 Theodore Tso wrote:
> Dave,
>
> It wasn't my intention to say that XFS was bad; in fact, I thought I
> was actually complementing XFS by talking about some of the advanced
> features that XFS had (many of which I have always said that ext3 has,
> and some of which ext4 still does not have, and probably never will
> have). I stand corrected on some of the details that I got wrong.
> What I was trying to say was that *if* (and perhaps I'm
> misunderstanding fsblock) that fsblock is requiring that as soon as a
> page is dirty, fsblock requests the filesystem to assign a block
> allocation to the buffers attached to the dirty page, that this would
> spike out delayed allocation, which would be unfortunate for *both*
> ext4 and XFS.
>
> But maybe I'm misunderstanding what fsblock is doing, and there isn't
> a problem here.

Yeah, Dave's understanding of fsblock is correct. I might have stated
things confusingly... fsblock allocates the in-memory fsblock metadata
structure (~= struct buffer_head) at the time of block dirtying. It
also asks the filesystem to respond to the dirtying event appropriately.
In the case of say ext2, this means allocating a block on disk. Wheras
XFS does the delalloc/reserve thing (yes, XFS appears to be working
with fsblock well enough to get this far).

fsblock really isn't too much different to buffer_head from an abstract
capability / functionality point of view except that it is often more
strict with things where I feel it makes sense.

So for this particular example; in buffer.c, buffers do tend to be
allocated when a page is dirtied, but not always, and even when they are,
they can get reclaimed while the page is still dirty. fsblock tigtens
this up.

2009-03-16 10:33:00

by Nick Piggin

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: Tux3 Git tree available

On Monday 16 March 2009 09:41:35 Daniel Phillips wrote:
> Hi Ted,

> > So the really unfortunate thing about allocating the block as soon as
> > the page is dirty is that it spikes out delayed allocation. By
> > delaying the physical allocation of the logical->physical mapping as
> > long as possible, the filesystem can select the best possible physical
> > location.
>
> Tux3 does not dirty the metadata until data cache is flushed, so the
> allocation decisions for data and metadata are made at the same time.
> That is the reason for the distinction between physical metadata above,
> and logical metadata such as directory data and bitmaps, which are
> delayed. Though physical metadata is positioned when first dirtied,
> physical metadata dirtying is delayed until delta commit.
>
> Implementing this model (we are still working on it) requires taking
> care of a lot of subtle details that are specific to the Tux3 cache
> model. I have a hard time imagining those allocation decisions driven
> by callbacks from a buffer-like library.

The filesystem can get pagecache-block-dirty events in a few ways
(often a combination of):
write_begin/write_end, set_page_dirty, page_mkwrite, etc. Short of
implementing entirely your own write path (and even then you need to
hook at least page_mkwrite to catch mmapped writes, for completeness),
I don't see why a get_block(BLOCK_DIRTY) kind of callback is much
harder for you to imagine than any of the other callbacks. Actually
I imagine the block based callback should be easier for filesystems
that support any block size != page size because all the others are
page based.

I would like to hear firm details about any problems definitely,
because I would like to try to make it more generic even if your
filesystem won't use it :)

Now this is not to say the current buffer APIs are totally _optimal_.
As I said, I would like to see at least something along the lines of
"we are about to dirty range (x,y)" callback in the higher level
generic write code. But that's another story (which I am planning
to get to).