2008-12-31 03:36:07

by Daniel Phillips

[permalink] [raw]
Subject: Tux3 report: A Golden Copy

All versions of Tux3 are special, we think, but this one is a little
more special than all the rest: for the first time, it survives
repeated fsx-linux runs. That is mainly due to a burst of amazingly
accurate work by Hirofumi, and finishing up the SMP lock coverage.

Here is the HowTo again, lightly updated to install the first official
Tux3 Golden Copy:

? ?# Get a kernel tree:
? ?wget http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.26.5.tar.bz2
? ?tar -xjf linux-2.6.26.5.tar.bz2
? ?cd linux-2.6.26.5

? ?# Get the Christmas tux3 patch and patch the kernel:
? ?wget http://tux3.org/patches/tux3-2.6.26.5-4
? ?patch <tux3-2.6.26.5-4 -p1

? ?# Build linux with tux3:
? ?make defconfig
? ?make CONFIG_TUX3=y
? ?sudo make install

? ?# Get the Christmas tux3 userspace snapshot:
? ?wget http://tux3.org/downloads/snapshots/tux3-20081230.tar.gz
? ?tar -xzf tux3-20081230.tar.gz
? ?cd tux3/user
? ?make

? ?# make a tux3 filesystem
? ?sudo ./tux3 mkfs /dev/<testpartition>

Boot and mount!

? ?cat /proc/filesystems | grep tux3 && mount /dev/<testpartition> /mnt

Recovery after crash or unexpected shutdown is easy:

tux3 mkfs <volname>

In other words, do not expect Tux3 to survive this kind of treatment
just yet. Atomic commit is on the way, and is the last major change
planned before starting review.

Caveats are the same as before:

http://kerneltrap.org/mailarchive/linux-fsdevel/2008/12/26/4493214
"Tux3 for Christmas"

And finally the usual offer: If you drop by to lend a hand we will
treat you with respect and admiration for showing such courage, and we
will make you even more famous than you already are, by carving your
name in the Tux3 Hall of Fame:

http://tux3.org/about.html

Regards,

Daniel


2008-12-31 07:34:22

by sniper

[permalink] [raw]
Subject: Re: Tux3 report: A Golden Copy

On Wed, Dec 31, 2008 at 11:35 AM, Daniel Phillips <[email protected]> wrote:
> All versions of Tux3 are special, we think, but this one is a little
> more special than all the rest: for the first time, it survives
> repeated fsx-linux runs. That is mainly due to a burst of amazingly
> accurate work by Hirofumi, and finishing up the SMP lock coverage.
>
> Here is the HowTo again, lightly updated to install the first official
> Tux3 Golden Copy:
>
> # Get a kernel tree:
> wget http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.26.5.tar.bz2
> tar -xjf linux-2.6.26.5.tar.bz2
> cd linux-2.6.26.5
>
> # Get the Christmas tux3 patch and patch the kernel:
> wget http://tux3.org/patches/tux3-2.6.26.5-4
> patch <tux3-2.6.26.5-4 -p1
>
> # Build linux with tux3:
> make defconfig
> make CONFIG_TUX3=y
> sudo make install
>
> # Get the Christmas tux3 userspace snapshot:
> wget http://tux3.org/downloads/snapshots/tux3-20081230.tar.gz
> tar -xzf tux3-20081230.tar.gz
> cd tux3/user
> make
>
> # make a tux3 filesystem
> sudo ./tux3 mkfs /dev/<testpartition>
>
> Boot and mount!
>
> cat /proc/filesystems | grep tux3 && mount /dev/<testpartition> /mnt
>
> Recovery after crash or unexpected shutdown is easy:
>
> tux3 mkfs <volname>
>
> In other words, do not expect Tux3 to survive this kind of treatment
> just yet. Atomic commit is on the way, and is the last major change
> planned before starting review.
>
> Caveats are the same as before:
>
> http://kerneltrap.org/mailarchive/linux-fsdevel/2008/12/26/4493214
> "Tux3 for Christmas"
>
> And finally the usual offer: If you drop by to lend a hand we will
> treat you with respect and admiration for showing such courage, and we
> will make you even more famous than you already are, by carving your
> name in the Tux3 Hall of Fame:
>
> http://tux3.org/about.html
>
> Regards,
>
> Daniel
> --

Great, I have mounted tux3 filesystem under UML with stuffs in this mail,
but I still can't debug it with gdb. Anyone gives me suggestion?

fqh@ubuntu:~/Desktop/linux-2.6.26.5$ gdb -args linux ubda=tuxroot ubdb=testdev
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu"...
(gdb) run
Starting program: /home/fqh/Desktop/linux-2.6.26.5/linux ubda=tuxroot
ubdb=testdev
Locating the bottom of the address space ...
Program received signal SIGSEGV, Segmentation fault.
0x08066c7f in page_ok (page=<value optimized out>) at
arch/um/os-Linux/sys-i386/task_size.c:31
31 n = *address;
(gdb)

2008-12-31 08:01:15

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Tuesday 30 December 2008 23:34, sniper wrote:
> Great, I have mounted tux3 filesystem under UML with stuffs in this mail,
> but I still can't debug it with gdb. Anyone gives me suggestion?

You just have to give a "cont" command a bunch of times and you will
eventually get to a command prompt. The reason for this is, uml uses
the segfault interrupt as part of its machine simulation, and there
is no exsiting way for uml and gdb to communicate in such a way that
uml can recognize that the interrupt came from its own code and filter
it.

Jeff Dike is the expert on this, and Daniel Jacobowitz is the expert
on the gdb side. Fixing this would be a big effort, getting two complex
systems to cooperate better, with nontrivial API issues to solve. But
UML is such a wonderful kernel development tool that it might be worth
the effort.

In the mean time, you could just tell gdb to mask off all segfaults,
but would be kind of problematic for debugging.

Regards,

Daniel

2008-12-31 08:15:21

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Daniel Phillips wrote:
> On Tuesday 30 December 2008 23:34, sniper wrote:
>
>> Great, I have mounted tux3 filesystem under UML with stuffs in this mail,
>> but I still can't debug it with gdb. Anyone gives me suggestion?
>>
>
> You just have to give a "cont" command a bunch of times and you will
> eventually get to a command prompt. The reason for this is, uml uses
> the segfault interrupt as part of its machine simulation, and there
> is no exsiting way for uml and gdb to communicate in such a way that
> uml can recognize that the interrupt came from its own code and filter
> it.
>
> Jeff Dike is the expert on this, and Daniel Jacobowitz is the expert
> on the gdb side. Fixing this would be a big effort, getting two complex
> systems to cooperate better, with nontrivial API issues to solve. But
> UML is such a wonderful kernel development tool that it might be worth
> the effort.
>
> In the mean time, you could just tell gdb to mask off all segfaults,
> but would be kind of problematic for debugging.
>
> Regards,
>
> Daniel
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
Hmm.. seems like a redundancy;
Anyways I looked at you're site, but am still
confused at what tux3 is: what is tux3?
(at first I thought it was a video game, but was wrong);
can I use tux3 to secure a linux system or is it for
something else?

regards;

Justin P. Mattock

2008-12-31 08:16:54

by sniper

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Wed, Dec 31, 2008 at 4:00 PM, Daniel Phillips <[email protected]> wrote:
> On Tuesday 30 December 2008 23:34, sniper wrote:
>> Great, I have mounted tux3 filesystem under UML with stuffs in this mail,
>> but I still can't debug it with gdb. Anyone gives me suggestion?
>
> You just have to give a "cont" command a bunch of times and you will
Yes, it works now,

Thanks.
> eventually get to a command prompt. The reason for this is, uml uses
> the segfault interrupt as part of its machine simulation, and there
> is no exsiting way for uml and gdb to communicate in such a way that
> uml can recognize that the interrupt came from its own code and filter
> it.
>
> Jeff Dike is the expert on this, and Daniel Jacobowitz is the expert
> on the gdb side. Fixing this would be a big effort, getting two complex
> systems to cooperate better, with nontrivial API issues to solve. But
> UML is such a wonderful kernel development tool that it might be worth
> the effort.
>
> In the mean time, you could just tell gdb to mask off all segfaults,
> but would be kind of problematic for debugging.
>
> Regards,
>
> Daniel
>

2008-12-31 08:31:39

by Dave Chinner

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Wed, Dec 31, 2008 at 12:00:54AM -0800, Daniel Phillips wrote:
> On Tuesday 30 December 2008 23:34, sniper wrote:
> > Great, I have mounted tux3 filesystem under UML with stuffs in this mail,
> > but I still can't debug it with gdb. Anyone gives me suggestion?
....
> In the mean time, you could just tell gdb to mask off all segfaults,
> but would be kind of problematic for debugging.

Not really. That's the default setting I use for XFS debugging. I
just put breakpoints on "panic" and "stop" (sometimes
bust_spinlocks) and just let the kernel panic routine handle the
segv which will trip a breakpoint. Then just walk back up the stack
to the function that triggered the real SEGV and go from there.

This is pretty much necessary for XFS debugging because
just mounting a filesystem causes 8-10 SEGV signals to occur.
You simply can't run xfsqa when that is occurring...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-12-31 09:41:08

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Wednesday 31 December 2008 00:31, Dave Chinner wrote:
> On Wed, Dec 31, 2008 at 12:00:54AM -0800, Daniel Phillips wrote:
> > On Tuesday 30 December 2008 23:34, sniper wrote:
> > > Great, I have mounted tux3 filesystem under UML with stuffs in this mail,
> > > but I still can't debug it with gdb. Anyone gives me suggestion?
> ....
> > In the mean time, you could just tell gdb to mask off all segfaults,
> > but would be kind of problematic for debugging.
>
> Not really. That's the default setting I use for XFS debugging. I
> just put breakpoints on "panic" and "stop" (sometimes
> bust_spinlocks) and just let the kernel panic routine handle the
> segv which will trip a breakpoint. Then just walk back up the stack
> to the function that triggered the real SEGV and go from there.
>
> This is pretty much necessary for XFS debugging because
> just mounting a filesystem causes 8-10 SEGV signals to occur.
> You simply can't run xfsqa when that is occurring...

Thanks for the howto. That was Jeff's suggestion also, but it would
be so much slicker if it was automagic, and first-time users would not
be constantly hitting this. I realize the difficulty. We have two
big tools that are not a perfect match, and don't know how to
cooperate to smooth out the bumps. Sigh. It was ever thus.

UML used to know about gdb, because it had to - the ptrace version of
UML could not be debugged as a regular task so gdb was execced from
UML. The result was very slick. You said "linux debug" and you would
land at the gdb command prompt from which you could continue, or
"linux debug=go" and it would continue without pausing, great for
rapid development. Now, it's all much more high tech and better I am
sure, but not as slick. You have to set up a .gdbinit, it's more to
do and not something you are going to get around to doing on every
machine you might develop on. So I typically end up just typing cont
and lots of carriage returns, not so pretty.

By the way, gdb -args linux ubda=... is very convenient.

Added Jeff to CC. Thanks Jeff, I still think UML is the best kernel
development tool ever.

Regards,

Daniel

2008-12-31 11:35:43

by Martin Steigerwald

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Am Mittwoch 31 Dezember 2008 schrieb Justin P. Mattock:
> Daniel Phillips wrote:
> > On Tuesday 30 December 2008 23:34, sniper wrote:
> >> Great, I have mounted tux3 filesystem under UML with stuffs in this
> >> mail, but I still can't debug it with gdb. Anyone gives me
> >> suggestion?
> >
> > You just have to give a "cont" command a bunch of times and you will
> > eventually get to a command prompt. The reason for this is, uml uses
> > the segfault interrupt as part of its machine simulation, and there
> > is no exsiting way for uml and gdb to communicate in such a way that
> > uml can recognize that the interrupt came from its own code and
> > filter it.

[...]

> Hmm.. seems like a redundancy;
> Anyways I looked at you're site, but am still
> confused at what tux3 is: what is tux3?
>
> (at first I thought it was a video game, but was wrong);
> can I use tux3 to secure a linux system or is it for
> something else?
>

Hmmm, I thought

---------------------------------------------------------------------
Tux3 is a write-anywhere, atomic commit, btree-based versioning
filesystem. It is the spiritual and moral successor of Tux2, the most
famous filesystem that was never released. The main purpose of Tux3 is to
embody Daniel Phillips's new ideas on storage data versioning. The
secondary goal is to provide a more efficient snapshotting and
replication method for the Zumastor NAS project, and a tertiary goal is
to be better than ZFS.
---------------------------------------------------------------------
http://tux3.org/

was pretty clear. What are you missing?

Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7


Attachments:
(No filename) (1.67 kB)
signature.asc (197.00 B)
This is a digitally signed message part.
Download all attachments

2008-12-31 14:25:10

by Andi Kleen

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Daniel Phillips <[email protected]> writes:
>
> Thanks for the howto. That was Jeff's suggestion also, but it would
> be so much slicker if it was automagic, and first-time users would not

Just provide a sample gdbrc that does all this somewhere in arch/um ?
Then they would just have to source that file.

There used to be a similar one for kgdb which defined all sorts
of useful macros. Some of the BSDs even have their own collections
of useful gdb macros in tree, I always thought this was a good idea.

-Andi

--
[email protected]

2008-12-31 17:42:21

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Martin Steigerwald wrote:
> Am Mittwoch 31 Dezember 2008 schrieb Justin P. Mattock:
>
>> Daniel Phillips wrote:
>>
>>> On Tuesday 30 December 2008 23:34, sniper wrote:
>>>
>>>> Great, I have mounted tux3 filesystem under UML with stuffs in this
>>>> mail, but I still can't debug it with gdb. Anyone gives me
>>>> suggestion?
>>>>
>>> You just have to give a "cont" command a bunch of times and you will
>>> eventually get to a command prompt. The reason for this is, uml uses
>>> the segfault interrupt as part of its machine simulation, and there
>>> is no exsiting way for uml and gdb to communicate in such a way that
>>> uml can recognize that the interrupt came from its own code and
>>> filter it.
>>>
>
> [...]
>
>
>> Hmm.. seems like a redundancy;
>> Anyways I looked at you're site, but am still
>> confused at what tux3 is: what is tux3?
>>
>> (at first I thought it was a video game, but was wrong);
>> can I use tux3 to secure a linux system or is it for
>> something else?
>>
>>
>
> Hmmm, I thought
>
> ---------------------------------------------------------------------
> Tux3 is a write-anywhere, atomic commit, btree-based versioning
> filesystem. It is the spiritual and moral successor of Tux2, the most
> famous filesystem that was never released. The main purpose of Tux3 is to
> embody Daniel Phillips's new ideas on storage data versioning. The
> secondary goal is to provide a more efficient snapshotting and
> replication method for the Zumastor NAS project, and a tertiary goal is
> to be better than ZFS.
> ---------------------------------------------------------------------
> http://tux3.org/
>
> was pretty clear. What are you missing?
>
> Ciao,
>

I guess this is what is confusing to me:
atomic commit, btree-based versioning.

irregardless about how it's worded,
I'm wondering if I should use this mechanism,
or not.

regards;

Justin P. Mattock

2008-12-31 18:14:49

by sniper

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Sorry I meet another problem. After the system was completely setupped,
the gdb interface will be covered with that system interaction interface.
So, I can't stop the system's running and add some new break point etc.
But I can do those operations in remote kgdb debugging.

Any method to fix this problem?

Thanks.

2008-12-31 18:18:21

by sniper

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Thu, Jan 1, 2009 at 2:14 AM, sniper <[email protected]> wrote:
> Sorry I meet another problem. After the system was completely setupped,
Not setupped, I mean "startupped"

2009-01-01 09:56:24

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Wednesday 31 December 2008 10:14, sniper wrote:
> Sorry I meet another problem. After the system was completely setupped,
> the gdb interface will be covered with that system interaction interface.
> So, I can't stop the system's running and add some new break point etc.
> But I can do those operations in remote kgdb debugging.
>
> Any method to fix this problem?
>
> Thanks.

I haven't figured that one out. Usually I get into gdb from uml by
setting a breakpoint, fielding a crash, or executing an explicit
breakpoint: asm("int3"). But I have been unsuccessful so far in
breaking a running uml into the debugger from outside, ever since
SKAS arrived.

Regards,

Daniel

2009-01-01 14:46:58

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Thursday 01 January 2009 01:56, Daniel Phillips wrote:
> On Wednesday 31 December 2008 10:14, sniper wrote:
> > Sorry I meet another problem. After the system was completely setupped,
> > the gdb interface will be covered with that system interaction interface.
> > So, I can't stop the system's running and add some new break point etc.
> > But I can do those operations in remote kgdb debugging.
> >
> > Any method to fix this problem?
> >
> > Thanks.
>
> I haven't figured that one out...

...and now I have. From Jeff's docs:

debian package: user-mode-linux-doc
file:///usr/share/doc/user-mode-linux-doc/html/debugging-skas.html

"If you need to interrupt UML, you can't ^C it because the terminal
is in raw mode, and the ^C will just hit whatever UML is running.
What you need to do is send the UML kernel thread a SIGINT from
another shell. It is normally the first process after the gdb"

I take no responsibility for the following command:

kill -INT $(pgrep linux | head -n1)

Regards,

Daniel

2009-01-01 23:58:38

by Dave Chinner

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Thu, Jan 01, 2009 at 02:14:35AM +0800, sniper wrote:
> Sorry I meet another problem. After the system was completely setupped,
> the gdb interface will be covered with that system interaction interface.
> So, I can't stop the system's running and add some new break point etc.
> But I can do those operations in remote kgdb debugging.
>
> Any method to fix this problem?

Start up the uml console and send an "int" to the UML process.
That should drop it into the debugger. If that doesn't work,
pick the UML process that is burning CPU and send it a SIGTRAP ;)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2009-01-02 20:17:30

by Martin Steigerwald

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Am Mittwoch 31 Dezember 2008 schrieb Justin P. Mattock:
> Martin Steigerwald wrote:
> > Am Mittwoch 31 Dezember 2008 schrieb Justin P. Mattock:
> >> Daniel Phillips wrote:
> >>> On Tuesday 30 December 2008 23:34, sniper wrote:
> >>>> Great, I have mounted tux3 filesystem under UML with stuffs in
> >>>> this mail, but I still can't debug it with gdb. Anyone gives me
> >>>> suggestion?
> >>>
> >>> You just have to give a "cont" command a bunch of times and you
> >>> will eventually get to a command prompt. The reason for this is,
> >>> uml uses the segfault interrupt as part of its machine simulation,
> >>> and there is no exsiting way for uml and gdb to communicate in such
> >>> a way that uml can recognize that the interrupt came from its own
> >>> code and filter it.
> >
> > [...]
> >
> >> Hmm.. seems like a redundancy;
> >> Anyways I looked at you're site, but am still
> >> confused at what tux3 is: what is tux3?
> >>
> >> (at first I thought it was a video game, but was wrong);
> >> can I use tux3 to secure a linux system or is it for
> >> something else?
> >
> > Hmmm, I thought
> >
> > ---------------------------------------------------------------------
> > Tux3 is a write-anywhere, atomic commit, btree-based versioning
> > filesystem. It is the spiritual and moral successor of Tux2, the most
> > famous filesystem that was never released. The main purpose of Tux3
> > is to embody Daniel Phillips's new ideas on storage data versioning.
> > The secondary goal is to provide a more efficient snapshotting and
> > replication method for the Zumastor NAS project, and a tertiary goal
> > is to be better than ZFS.
> > ---------------------------------------------------------------------
> > http://tux3.org/
> >
> > was pretty clear. What are you missing?
> >
> > Ciao,
>
> I guess this is what is confusing to me:
> atomic commit, btree-based versioning.

Ah, the buzz words. ;)

The tux3 mailing list contains quite some design notes about these
concepts. I think others can give better answers about these concepts - I
think I understood what it is for, not the implementation details. But
basically "atomic commit" is a strategy to have the filesystem always in
a consistent state and btree-based versioning allows to keep different
versions of a file / directory around. And unlike other filesystem tux3
has this per inode and not for the complete filesystem. At least if I
understand correctly.

But at least it should clear that tux3 is a filesystem and not a video
game ;).

> irregardless about how it's worded,
> I'm wondering if I should use this mechanism,
> or not.

Right now its still in heavy development and not of release quality. I.e.
something to play around and test with if you want.

Ciao,
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7


Attachments:
(No filename) (2.80 kB)
signature.asc (197.00 B)
This is a digitally signed message part.
Download all attachments

2009-01-02 20:37:16

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Martin Steigerwald wrote:
> Am Mittwoch 31 Dezember 2008 schrieb Justin P. Mattock:
>
>> Martin Steigerwald wrote:
>>
>>> Am Mittwoch 31 Dezember 2008 schrieb Justin P. Mattock:
>>>
>>>> Daniel Phillips wrote:
>>>>
>>>>> On Tuesday 30 December 2008 23:34, sniper wrote:
>>>>>
>>>>>> Great, I have mounted tux3 filesystem under UML with stuffs in
>>>>>> this mail, but I still can't debug it with gdb. Anyone gives me
>>>>>> suggestion?
>>>>>>
>>>>> You just have to give a "cont" command a bunch of times and you
>>>>> will eventually get to a command prompt. The reason for this is,
>>>>> uml uses the segfault interrupt as part of its machine simulation,
>>>>> and there is no exsiting way for uml and gdb to communicate in such
>>>>> a way that uml can recognize that the interrupt came from its own
>>>>> code and filter it.
>>>>>
>>> [...]
>>>
>>>
>>>> Hmm.. seems like a redundancy;
>>>> Anyways I looked at you're site, but am still
>>>> confused at what tux3 is: what is tux3?
>>>>
>>>> (at first I thought it was a video game, but was wrong);
>>>> can I use tux3 to secure a linux system or is it for
>>>> something else?
>>>>
>>> Hmmm, I thought
>>>
>>> ---------------------------------------------------------------------
>>> Tux3 is a write-anywhere, atomic commit, btree-based versioning
>>> filesystem. It is the spiritual and moral successor of Tux2, the most
>>> famous filesystem that was never released. The main purpose of Tux3
>>> is to embody Daniel Phillips's new ideas on storage data versioning.
>>> The secondary goal is to provide a more efficient snapshotting and
>>> replication method for the Zumastor NAS project, and a tertiary goal
>>> is to be better than ZFS.
>>> ---------------------------------------------------------------------
>>> http://tux3.org/
>>>
>>> was pretty clear. What are you missing?
>>>
>>> Ciao,
>>>
>> I guess this is what is confusing to me:
>> atomic commit, btree-based versioning.
>>
>
> Ah, the buzz words. ;)
>
> The tux3 mailing list contains quite some design notes about these
> concepts. I think others can give better answers about these concepts - I
> think I understood what it is for, not the implementation details. But
> basically "atomic commit" is a strategy to have the filesystem always in
> a consistent state and btree-based versioning allows to keep different
> versions of a file / directory around. And unlike other filesystem tux3
> has this per inode and not for the complete filesystem. At least if I
> understand correctly.
>
> But at least it should clear that tux3 is a filesystem and not a video
> game ;).
>
>
>> irregardless about how it's worded,
>> I'm wondering if I should use this mechanism,
>> or not.
>>
>
> Right now its still in heavy development and not of release quality. I.e.
> something to play around and test with if you want.
>
> Ciao,
>
Yeah, my bad for thinking tux3 was a video game
(don't ask why); I need to get glasses!!
When I was told it was a filesystem it clicked.
As for the atomic commit Thanks for explanation.
(I honestly had no idea what that meant);
with test and playing around with this, I have
an old dell inspiron hanging around, when
I have the time I'll have to give it a try.
Do I have to wipe out the ext3 partition
on it, or is that O.K.

regards;

Justin P. Mattock

2009-01-02 22:45:53

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Friday 02 January 2009 12:17, Martin Steigerwald wrote:
> Am Mittwoch 31 Dezember 2008 schrieb Justin P. Mattock:
> > I guess this is what is confusing to me:
> > atomic commit, btree-based versioning.
>
> Ah, the buzz words. ;)
>
> The tux3 mailing list contains quite some design notes about these
> concepts. I think others can give better answers about these concepts - I
> think I understood what it is for, not the implementation details. But
> basically "atomic commit" is a strategy to have the filesystem always in
> a consistent state

Right. Atomic commit is a term that came from the database world and
was first applied to filesystems in an LKML message from Victor
Yodaiken back in 1998 as I dimly recall, and I adopted it to describe
the tree ased atomic update strategy I was developing for Tux2 at the
time. Tux3 uses a new logging variant that is supposed to avoid the
write-twice behaviour of journalling and the recursive copy behavior of
WAFL, ZFS and Btrfs, so should be pretty good at synchronous write
loads and generally reduce write traffic.

> and btree-based versioning allows to keep different
> versions of a file / directory around. And unlike other filesystem tux3
> has this per inode and not for the complete filesystem. At least if I
> understand correctly.

You do.

"Btree-based" and "versioning" are separate buzzwords. Tux3 is a btree
of btrees: the inode table is a btree, containing files that are
btrees. It was conceived to demonstrate a new method of versioning
files that puts the versioning information at the btree leaves instead
of having multiple independently rooted trees sharing subtrees:

Versioned pointers: a new method of representing snapshots
http://lwn.net/Articles/288896/

This approach lends itself to per-object versioning: each data pointer
and each inode attribute has its own version label. Making it work
per file and even per directory is a matter of clever mapping tricks to
turn global version numbers into per pointer version numbers.

But note that versioning support is still just a nice demo: the focus
has shifted to Tux3 as general purpose filesystem, with versioning
seen as a feature to be integrated after the basic Ext3-class
functionality is solid and reviewed.

> But at least it should clear that tux3 is a filesystem and not a video
> game ;).

It's kind of like a video game where you sneak through IRC channels
trying to frag bugs with your BFG.

Regards,

Daniel

2009-01-02 23:11:50

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Daniel Phillips wrote:
> On Friday 02 January 2009 12:17, Martin Steigerwald wrote:
>
>> Am Mittwoch 31 Dezember 2008 schrieb Justin P. Mattock:
>>
>>> I guess this is what is confusing to me:
>>> atomic commit, btree-based versioning.
>>>
>> Ah, the buzz words. ;)
>>
>> The tux3 mailing list contains quite some design notes about these
>> concepts. I think others can give better answers about these concepts - I
>> think I understood what it is for, not the implementation details. But
>> basically "atomic commit" is a strategy to have the filesystem always in
>> a consistent state
>>
>
> Right. Atomic commit is a term that came from the database world and
> was first applied to filesystems in an LKML message from Victor
> Yodaiken back in 1998 as I dimly recall, and I adopted it to describe
> the tree ased atomic update strategy I was developing for Tux2 at the
> time. Tux3 uses a new logging variant that is supposed to avoid the
> write-twice behaviour of journalling and the recursive copy behavior of
> WAFL, ZFS and Btrfs, so should be pretty good at synchronous write
> loads and generally reduce write traffic.
>
>
>> and btree-based versioning allows to keep different
>> versions of a file / directory around. And unlike other filesystem tux3
>> has this per inode and not for the complete filesystem. At least if I
>> understand correctly.
>>
>
> You do.
>
> "Btree-based" and "versioning" are separate buzzwords. Tux3 is a btree
> of btrees: the inode table is a btree, containing files that are
> btrees. It was conceived to demonstrate a new method of versioning
> files that puts the versioning information at the btree leaves instead
> of having multiple independently rooted trees sharing subtrees:
>
> Versioned pointers: a new method of representing snapshots
> http://lwn.net/Articles/288896/
>
> This approach lends itself to per-object versioning: each data pointer
> and each inode attribute has its own version label. Making it work
> per file and even per directory is a matter of clever mapping tricks to
> turn global version numbers into per pointer version numbers.
>
> But note that versioning support is still just a nice demo: the focus
> has shifted to Tux3 as general purpose filesystem, with versioning
> seen as a feature to be integrated after the basic Ext3-class
> functionality is solid and reviewed.
>
>
>> But at least it should clear that tux3 is a filesystem and not a video
>> game ;).
>>
>
> It's kind of like a video game where you sneak through IRC channels
> trying to frag bugs with your BFG.
>
> Regards,
>
> Daniel
>
>
The game that came to mind when I first
heard of tux3(I had to google a bit to find the name)
was tux racer. :^)
quick question:
what is the state for security file labeling for
SELinux on this filesystem?


regards;

Justin P. Mattock

2009-01-03 01:19:42

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Friday 02 January 2009 15:11, Justin P. Mattock wrote:
> The game that came to mind when I first
> heard of tux3(I had to google a bit to find the name)
> was tux racer. :^)
> quick question:
> what is the state for security file labeling for
> SELinux on this filesystem?

There is a lot of interest in security labels. You are not the first
to ask.

Tux3 variable inode attributes are ideal for implementing security
labels efficiently, way more lightweight than extended attributes.
Otherwise, we would like to know exactly what people want.

Regards,

Daniel

2009-01-03 01:33:11

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Daniel Phillips wrote:
> On Friday 02 January 2009 15:11, Justin P. Mattock wrote:
>
>> The game that came to mind when I first
>> heard of tux3(I had to google a bit to find the name)
>> was tux racer. :^)
>> quick question:
>> what is the state for security file labeling for
>> SELinux on this filesystem?
>>
>
> There is a lot of interest in security labels. You are not the first
> to ask.
>
> Tux3 variable inode attributes are ideal for implementing security
> labels efficiently, way more lightweight than extended attributes.
> Otherwise, we would like to know exactly what people want.
>
> Regards,
>
> Daniel
>
>
>
thats probably one of the main areas of
interest that I have in filesystems,
the ability to run a policy etc..
As for what people want, thats tough
to say, my guess would be file corruption,
then probably security etc..

regards;

Justin P. Mattock

2009-01-03 03:03:39

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Friday 02 January 2009 17:32, Justin P. Mattock wrote:
> Daniel Phillips wrote:
> > On Friday 02 January 2009 15:11, Justin P. Mattock wrote:
> >
> >> The game that came to mind when I first
> >> heard of tux3(I had to google a bit to find the name)
> >> was tux racer. :^)
> >> quick question:
> >> what is the state for security file labeling for
> >> SELinux on this filesystem?
> >
> > There is a lot of interest in security labels. You are not the first
> > to ask.
> >
> > Tux3 variable inode attributes are ideal for implementing security
> > labels efficiently, way more lightweight than extended attributes.
> > Otherwise, we would like to know exactly what people want.
> >
> > Regards,
> >
> > Daniel
> >
> thats probably one of the main areas of
> interest that I have in filesystems,
> the ability to run a policy etc..
> As for what people want, thats tough
> to say, my guess would be file corruption,
> then probably security etc..

I meant, what do people specifically want in security. For SELinux,
probably the most important issue is efficient extended attribute
support, which Tux3 has a pretty good start on:

http://lwn.net/Articles/300416/
Tux3 gets a high speed atom smasher

One feature we are kicking around to make life easier for SELinux:
sometimes the filesystem can run while SELinux is not running, and
security labels will be wrong when SELinux re-enters the picture. We
have in mind to provide a persistent log of filesystem events that the
security system can attach to on startup and find out what went on in
its absence.

And it might be nice to provide direct access to Tux3's variable inode
attributes as I mentioned, letting the security system bypass the
heavyweight xattr paths. My thinking is, the more direct the security
path, the more likely it is to be secure, and the less overhead it has,
the more likely people are to use it. Somebody might want to play with
this idea and see if it makes a difference.

Of course, these features are secondary to base filesystem solidity,
which will be the main focus for the next little while, but now is the
time to talk about what you want, in case the design can be adjusted to
make it more practical.

More security wishes go here: ->[___________________]<-

Regards,

Daniel

2009-01-03 03:40:09

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Daniel Phillips wrote:
> On Friday 02 January 2009 17:32, Justin P. Mattock wrote:
>
>> Daniel Phillips wrote:
>>
>>> On Friday 02 January 2009 15:11, Justin P. Mattock wrote:
>>>
>>>
>>>> The game that came to mind when I first
>>>> heard of tux3(I had to google a bit to find the name)
>>>> was tux racer. :^)
>>>> quick question:
>>>> what is the state for security file labeling for
>>>> SELinux on this filesystem?
>>>>
>>> There is a lot of interest in security labels. You are not the first
>>> to ask.
>>>
>>> Tux3 variable inode attributes are ideal for implementing security
>>> labels efficiently, way more lightweight than extended attributes.
>>> Otherwise, we would like to know exactly what people want.
>>>
>>> Regards,
>>>
>>> Daniel
>>>
>>>
>> thats probably one of the main areas of
>> interest that I have in filesystems,
>> the ability to run a policy etc..
>> As for what people want, thats tough
>> to say, my guess would be file corruption,
>> then probably security etc..
>>
>
> I meant, what do people specifically want in security. For SELinux,
> probably the most important issue is efficient extended attribute
> support, which Tux3 has a pretty good start on:
>
> http://lwn.net/Articles/300416/
> Tux3 gets a high speed atom smasher
>
>
Thats some crazy stuff!! and just think most of it is
simply magnets.(but more complicated than that)
> One feature we are kicking around to make life easier for SELinux:
> sometimes the filesystem can run while SELinux is not running, and
> security labels will be wrong when SELinux re-enters the picture. We
> have in mind to provide a persistent log of filesystem events that the
> security system can attach to on startup and find out what went on in
> its absence.
>
>
That sounds nice:

find out what went on in
its absence.

> And it might be nice to provide direct access to Tux3's variable inode
> attributes as I mentioned, letting the security system bypass the
> heavyweight xattr paths. My thinking is, the more direct the security
> path, the more likely it is to be secure, and the less overhead it has,
> the more likely people are to use it. Somebody might want to play with
> this idea and see if it makes a difference.
That makes sense:

the more direct the security
path, the more likely it is to be secure, and the less overhead it has,

> Of course, these features are secondary to base filesystem solidity,
> which will be the main focus for the next little while, but now is the
> time to talk about what you want, in case the design can be adjusted to
> make it more practical.
>
> More security wishes go here: ->[___________________]<-
>
> Regards,
>
> Daniel
>
>

I guess the most simplest wish would to make sure that tux3
does support SELinux, this way people have more options,
to work with.(Then worry about everything else later);
One of the main problems I have with osx and winxp
is the missing of such options(I feel naked without SELinux);

regards;

Justin P. Mattock

2009-01-04 03:17:52

by Jamie Lokier

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Justin P. Mattock wrote:
> Thats some crazy stuff!! and just think most of it is
> simply magnets.(but more complicated than that)
> >One feature we are kicking around to make life easier for SELinux:
> >sometimes the filesystem can run while SELinux is not running, and
> >security labels will be wrong when SELinux re-enters the picture. We
> >have in mind to provide a persistent log of filesystem events that the
> >security system can attach to on startup and find out what went on in
> >its absence.
> >
> >
> That sounds nice:
>
> find out what went on in
> its absence.

That sounds like a feature Windows had for many years now, (since
Windows 2000?). It complements the Windows equivlant of
dnotify/inotify/fsnotify.

It's used for file indexing too (think equivalent to Spotlight,
Beagle, etc.), and other types of security scanning (think equivalent
to Tripwire).

I wonder why the people writing file indexing tools for Linux never
made a fuss about this. Inotify is ok for indexing, but means quite a
few minutes of intensive disk activity after each boot to rescan /home.

-- Jamie

2009-01-04 04:16:13

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Saturday 03 January 2009 19:17, Jamie Lokier wrote:
> Justin P. Mattock wrote:
> > Thats some crazy stuff!! and just think most of it is
> > simply magnets.(but more complicated than that)
> > >One feature we are kicking around to make life easier for SELinux:
> > >sometimes the filesystem can run while SELinux is not running, and
> > >security labels will be wrong when SELinux re-enters the picture. We
> > >have in mind to provide a persistent log of filesystem events that the
> > >security system can attach to on startup and find out what went on in
> > >its absence.
> > >
> > >
> > That sounds nice:
> >
> > find out what went on in
> > its absence.
>
> That sounds like a feature Windows had for many years now, (since
> Windows 2000?). It complements the Windows equivlant of
> dnotify/inotify/fsnotify.
>
> It's used for file indexing too (think equivalent to Spotlight,
> Beagle, etc.), and other types of security scanning (think equivalent
> to Tripwire).
>
> I wonder why the people writing file indexing tools for Linux never
> made a fuss about this. Inotify is ok for indexing, but means quite a
> few minutes of intensive disk activity after each boot to rescan /home.

Actually they did. It was a poke from Jos van den Oever, the Strigi
guy, that got me thinking about it, the security aspect came up later.

Regards,

Daniel

2009-01-04 04:29:27

by Justin P. Mattock

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Jamie Lokier wrote:
> Justin P. Mattock wrote:
>
>> Thats some crazy stuff!! and just think most of it is
>> simply magnets.(but more complicated than that)
>>
>>> One feature we are kicking around to make life easier for SELinux:
>>> sometimes the filesystem can run while SELinux is not running, and
>>> security labels will be wrong when SELinux re-enters the picture. We
>>> have in mind to provide a persistent log of filesystem events that the
>>> security system can attach to on startup and find out what went on in
>>> its absence.
>>>
>>>
>>>
>> That sounds nice:
>>
>> find out what went on in
>> its absence.
>>
>
> That sounds like a feature Windows had for many years now, (since
> Windows 2000?). It complements the Windows equivlant of
> dnotify/inotify/fsnotify.
>
> It's used for file indexing too (think equivalent to Spotlight,
> Beagle, etc.), and other types of security scanning (think equivalent
> to Tripwire).
>
> I wonder why the people writing file indexing tools for Linux never
> made a fuss about this. Inotify is ok for indexing, but means quite a
> few minutes of intensive disk activity after each boot to rescan /home.
>
> -- Jamie
>
>
Thanks for the info.
What about apps like git?
i.g. when changing a file
it knows that the file was changed;
(not sure how it works, remembers
the size or something like that);

With the file indexing is it smart(like git) enough
to know that the data was changed, or does it just
go by the name. With running SELinux I'm able to
change wpa_supplicant.conf with different ssid's
and keys and there wont be a denial, but If I change
out libflashplayer.so with a newer or same plugin
I will receive a denial. (bad example but all I could think of)
So it does have an idea to when a file is changed.
personally having mechanisms know exactly when
a file was changed internally is nice, this way
you at least are aware
that something has changed, and you know where.

regards;

Justin P. Mattock

2009-01-04 13:05:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

On Sun, Jan 04, 2009 at 03:17:33AM +0000, Jamie Lokier wrote:
> Justin P. Mattock wrote:
> > >One feature we are kicking around to make life easier for SELinux:
> > >sometimes the filesystem can run while SELinux is not running, and
> > >security labels will be wrong when SELinux re-enters the picture. We
> > >have in mind to provide a persistent log of filesystem events that the
> > >security system can attach to on startup and find out what went on in
> > >its absence.
> > >
> That sounds like a feature Windows had for many years now, (since
> Windows 2000?). It complements the Windows equivlant of
> dnotify/inotify/fsnotify.

Arguably you want to do this in the VFS layer, not in the low-level
filesystem level if you want most applications to adopt it.

> It's used for file indexing too (think equivalent to Spotlight,
> Beagle, etc.), and other types of security scanning (think equivalent
> to Tripwire).

Eric Paris has a patch he's been proposing for a while now for a new
notify mechanism designed for anti-virus scanners...

- Ted

2009-01-05 01:10:38

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Hi Ted,

On Sunday 04 January 2009 05:04, Theodore Tso wrote:
> On Sun, Jan 04, 2009 at 03:17:33AM +0000, Jamie Lokier wrote:
> > Justin P. Mattock wrote:
> > > >One feature we are kicking around to make life easier for SELinux:
> > > >sometimes the filesystem can run while SELinux is not running, and
> > > >security labels will be wrong when SELinux re-enters the picture. We
> > > >have in mind to provide a persistent log of filesystem events that the
> > > >security system can attach to on startup and find out what went on in
> > > >its absence.
> > > >
> > That sounds like a feature Windows had for many years now, (since
> > Windows 2000?). It complements the Windows equivlant of
> > dnotify/inotify/fsnotify.
>
> Arguably you want to do this in the VFS layer, not in the low-level
> filesystem level if you want most applications to adopt it.

It has to be generic all right, but the VFS is not able to do the job
on its own. To be useful for indexing, the reported events must
already be persistently recorded, and the VFS has no idea about when
that happens. The filesystem is the expert on that subject, and it
must generate the events. I can't imagine a reasonable VFS-level
emulation, or what value the VFS would add by acting as middleman for
a stream of filesystem events.

The natural way to do this is for the filesystem to stream events
directly to the monitoring application over a pipe-like fd. Maybe a
library for event delivery could be shared by filesystems, to impose
a standard format. The role of the VFS would be simply to set up the
event connection, or to report that it is not supported.

> > It's used for file indexing too (think equivalent to Spotlight,
> > Beagle, etc.), and other types of security scanning (think equivalent
> > to Tripwire).
>
> Eric Paris has a patch he's been proposing for a while now for a new
> notify mechanism designed for anti-virus scanners...

An event stream accurate enough to support indexing is a considerably
harder problem, I think.

Regards,

Daniel

2009-01-05 02:14:20

by Jamie Lokier

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Daniel Phillips wrote:
> > Arguably you want to do this in the VFS layer, not in the low-level
> > filesystem level if you want most applications to adopt it.
>
> It has to be generic all right, but the VFS is not able to do the job
> on its own. To be useful for indexing, the reported events must
> already be persistently recorded, and the VFS has no idea about when
> that happens. The filesystem is the expert on that subject, and it
> must generate the events. I can't imagine a reasonable VFS-level
> emulation, or what value the VFS would add by acting as middleman for
> a stream of filesystem events.

The VFS does have a some helpful generic support for quotas, although
it also requires filesystem-specific help. This is quite similar.

I see what you mean about knowing when an event reaches _persistent_
storage. To be accurate, the event log must be folded into the
filesystem's transaction/commit model (including right use of barriers
etc.), and during journal/equivalent recovery, and fsck repair, the
event log must err on the side of too many rather than too few events.
(Or have a "rescan everything needed" event.)

An event log does not have to be _entirely_ accurate to be useful for
things like security scanning and indexing. It is enough that it errs
on the side of recording a few too many, causing a few more app level
checks.

On the other hand, when used for an audit trail, you never want extra
events to be logged.

It seems to me whatever transaction/commit support is needed for event
logging is similarly needed for accurate quotas.

I've read that sometimes quotas get out of sync with the real amount
of user data stored on some filesystems, and then need to be
recalculated with a filesystem scan. If true, this is unfortunate.

> The natural way to do this is for the filesystem to stream events
> directly to the monitoring application over a pipe-like fd. Maybe a
> library for event delivery could be shared by filesystems, to impose
> a standard format. The role of the VFS would be simply to set up the
> event connection, or to report that it is not supported.

There was an extension to inotify posted a few months ago to do this.
Additional events when something becomes persistent.

> An event stream accurate enough to support indexing is a considerably
> harder problem, I think.

No really. It's enough if an indexer can efficiently find all changed
files since it was last running. That doesn't have to be an accurate
event stream. For example, simply having xattrs
"user.scanned.indexer_app_name" automatically deleted whenever the
file is modified, and recursively doing the same to parent
directories, would be enough in most cases. Not for hard links,
obviously, but indexers can treat those separately and detect them by
link count.

There's one other application which needs *really accurate* event
notification delivery. That is, anything which caches the result of
reading one or more files (such as for example compiling a script and
its dependencies to an internal representation in memory or into
another disk file), but where the caching must be *absolutely*
reliably invalidated at the time it's checked so that the behaviour is
guaranteed identical to not caching.

That kind of app needs to be able to say "are there any change events
pending since I last looked?" efficiently for many files (e.g. inotify
is ok, 1 syscall for many files), but with the guarantee that when the
answer is "no change events", calling read() and stat() on all the
files really would see no changes. Networked inotify does not
guarantee this, because event reception is delayed.

-- Jamie

2009-01-08 02:51:28

by Daniel Phillips

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Hi Jamie,

On Sunday 04 January 2009 18:13, Jamie Lokier wrote:
> Daniel Phillips wrote:
> > > Arguably you want to do this in the VFS layer, not in the low-level
> > > filesystem level if you want most applications to adopt it.
> >
> > It has to be generic all right, but the VFS is not able to do the job
> > on its own. To be useful for indexing, the reported events must
> > already be persistently recorded, and the VFS has no idea about when
> > that happens. The filesystem is the expert on that subject, and it
> > must generate the events. I can't imagine a reasonable VFS-level
> > emulation, or what value the VFS would add by acting as middleman for
> > a stream of filesystem events.
>
> The VFS does have a some helpful generic support for quotas, although
> it also requires filesystem-specific help. This is quite similar.

If the VFS stored the index on the filesystem then it would be similar,
but I don't think anybody will like the idea of the VFS operating an
indexer in-kernel. Given that the indexer is maintained by user space,
the kernel's job is just to deliver the events the user space indexer
needs, which is a very different activity pattern from the generic quota
file scheme.

> I see what you mean about knowing when an event reaches _persistent_
> storage. To be accurate, the event log must be folded into the
> filesystem's transaction/commit model (including right use of barriers
> etc.), and during journal/equivalent recovery, and fsck repair, the
> event log must err on the side of too many rather than too few events.
> (Or have a "rescan everything needed" event.)
>
> An event log does not have to be _entirely_ accurate to be useful for
> things like security scanning and indexing. It is enough that it errs
> on the side of recording a few too many, causing a few more app level
> checks.

Suppose a file delete event is sent, the external indexer dutifully
deletes its index entry for the file, then the machine crashes without
completing the delete transaction. On reboot, the file still exists
but it has leaked from the index. Ideas?

> On the other hand, when used for an audit trail, you never want extra
> events to be logged.
>
> It seems to me whatever transaction/commit support is needed for event
> logging is similarly needed for accurate quotas.
>
> I've read that sometimes quotas get out of sync with the real amount
> of user data stored on some filesystems, and then need to be
> recalculated with a filesystem scan. If true, this is unfortunate.

True, that. The quota file support really seems like it is driven
from the wrong end. It should just be a helpful library that the
filesystem calls at just the right time, to format quota blocks that
are otherwised managed by the filesystem however it chooses. When we
get to quota support, I think we will take a look at hooking into the
top level quota API instead of the generic quota file support. I
really hate the idea of recursive journal transactions you see in Ext3
as a result of the weird dance with the quota api. I don't know, maybe
it will all make sense when we get there, but chances are, Tux3 will
have to do something kooky too, to use generic quota file support.

> > The natural way to do this is for the filesystem to stream events
> > directly to the monitoring application over a pipe-like fd. Maybe a
> > library for event delivery could be shared by filesystems, to impose
> > a standard format. The role of the VFS would be simply to set up the
> > event connection, or to report that it is not supported.
>
> There was an extension to inotify posted a few months ago to do this.
> Additional events when something becomes persistent.

Do you have a pointer?

> > An event stream accurate enough to support indexing is a considerably
> > harder problem, I think.
>
> No really. It's enough if an indexer can efficiently find all changed
> files since it was last running. That doesn't have to be an accurate
> event stream.

Actually, it is not much like an event stream at all, it's like a delta
stream. Looking at it that way suggests a new model: the indexer
receives periodic deltas from the filesystem and processes them to
find all the changes it is interested in. Attractive features of the
delta model:

- The filesystem must already make this persistent and completely
accurate.

- Deltas are efficient at consolidating large numbers of changes.

- The new crop of next gen snapshotting filesystems will all be
able to do this.

Just an alternative way of looking at the problem.

> For example, simply having xattrs
> "user.scanned.indexer_app_name" automatically deleted whenever the
> file is modified, and recursively doing the same to parent
> directories, would be enough in most cases. Not for hard links,
> obviously, but indexers can treat those separately and detect them by
> link count.

Hand waving alert! Hard link handling is a basic requirement of any
indexer worthy of the name. This is my main litmus test for whether
an API proposal satisfies the ACID test. Do you have a specific
suggestion for indexing hard links?

Anyway, I would prefer if the indexer could build its index using just
the event stream, which would create significantly less disk activity.
It should be able to rescan like you suggest, for a reality check.

> There's one other application which needs *really accurate* event
> notification delivery. That is, anything which caches the result of
> reading one or more files (such as for example compiling a script and
> its dependencies to an internal representation in memory or into
> another disk file), but where the caching must be *absolutely*
> reliably invalidated at the time it's checked so that the behaviour is
> guaranteed identical to not caching.

Good example. I think my position is, if the API doesn't support a
_completely_ accurate consistency model, I am not interested in
proposing it. It will be a few months before we're ready to add any
kind of log at all, and in that time, I hope to gather requirements and
get into some blue sky discussion of specific designs. It was the
Strigi guys who brought this up, and I'm sure they will be more than
willing to find the holes in specific proposals.

> That kind of app needs to be able to say "are there any change events
> pending since I last looked?" efficiently for many files (e.g. inotify
> is ok, 1 syscall for many files), but with the guarantee that when the
> answer is "no change events", calling read() and stat() on all the
> files really would see no changes. Networked inotify does not
> guarantee this, because event reception is delayed.
>
> -- Jamie

The delta model of thinking about this problem may help. If the indexer
is aware of the delta boundaries, it can be sure it has all the changes
as of exactly some delta. Then if the indexer and filesystem crash at
different times, they can sync back up. The indexer does have to
acknowledge receipt of each delta, so that the filesystem knows when it
can drop that part of the log.

In the common case where the index is stored on the filesystem being
indexed, it's interesting to note the behavior where the persistent log
is delivered to the indexer, which massages it and stores it on a file
on the filesystem, then lets the filesystem discard part of its log.
The persistent data moves from one place to another on the filesystem,
filtered by userspace as it goes. This seems to make some kind of
sense.

As for indexing in-memory filesystem changes before they arrive on
stable storage, I think that is the business of an inotify-type
mechanism. That seems to me to be a separate problem. I think we have
two layers of events mashed together here. One is the current view of
filesystem cache as required by Samba, for example, to export a current
view of files as they change, or by a desktop to refresh a directory
view when it changes. The other is the long term stable, checkpointed
view of the filesystem as required by an indexer. I think some head
scratching needs to be done about the relationship between these two
layers, and whether there are applications that actually need access to
both at the same time. If not, then two separate kinds of event stream
sounds like not such a bad idea.

Regards,

Daniel

2009-01-08 04:38:37

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [Tux3] Tux3 report: A Golden Copy

Hi Daniel.

On Wed, Jan 07, 2009 at 06:50:59PM -0800, Daniel Phillips ([email protected]) wrote:
> Suppose a file delete event is sent, the external indexer dutifully
> deletes its index entry for the file, then the machine crashes without
> completing the delete transaction. On reboot, the file still exists
> but it has leaked from the index. Ideas?

Sending delete event when delete is completed?

> > There was an extension to inotify posted a few months ago to do this.
> > Additional events when something becomes persistent.
>
> Do you have a pointer?

Somthing like that
http://lkml.org/lkml/2008/11/25/272

As of VFS s file-system indexer: what if vfs could provide an interface
to the lower-level FS to get the event flow which could be used by the
userspace the same way it reads the file, and if no such operation is
upported by the filesystem falback to the whole dir rescan...

--
Evgeniy Polyakov