2008-07-15 18:28:34

by Sage Weil

[permalink] [raw]
Subject: Recursive directory accounting for size, ctime, etc.

All-

Ceph is a new distributed file system for Linux designed for scalability
(terabytes to exabytes, tens to thousands of storage nodes), reliability,
and performance. The latest release (v0.3), aside from xattr support and
the usual slew of bugfixes, includes a unique (?) recursive accounting
infrastructure that allows statistics about all metadata nested beneath a
point in the directory hierarchy to be efficiently propagated up the tree.
Currently this includes a file and directory count, total bytes (summation
over file sizes), and most recent inode ctime. For example, for a
directory like /home, Ceph can efficiently report the total number of
files, directories, and bytes contained by that entire subtree of the
directory hierarchy.

The file size summation is the most interesting, as it effectively gives
you directory-based quota space accounting with fine granularity. In many
deployments, the quota _accounting_ is more important than actual
enforcement. Anybody who has had to figure out what has filled/is filling
up a large volume will appreciate how cumbersome and inefficient 'du' can
be for that purpose--especially when you're in a hurry.

There are currently two ways to access the recursive stats via a standard
shell. The first simply sets the directory st_size value to the
_recursive_ bytes ('rbytes') value (when the client is mounted with -o
rbytes). For example (watch the directory sizes),

$ tar jxf linux-2.6.24.3.tar.bz2
$ ls -l
total 8
drwxr-xr-x 1 root root 0 Jul 10 05:30 .
drwxr-xr-x 8 root root 4096 Jul 9 18:21 ..
drwxrwxr-x 1 root root 254025660 Feb 26 00:20 linux-2.6.24.3
$ du -s linux-2.6.24.3/
254237 linux-2.6.24.3/
$ ls -al linux-2.6.24.3/
total 281
drwxrwxr-x 1 root root 254025660 Feb 26 00:20 .
drwxr-xr-x 1 root root 0 Jul 10 05:30 ..
-rw-rw-r-- 1 root root 628 Feb 26 00:20 .gitignore
-rw-rw-r-- 1 root root 3657 Feb 26 00:20 .mailmap
-rw-rw-r-- 1 root root 18693 Feb 26 00:20 COPYING
-rw-rw-r-- 1 root root 92230 Feb 26 00:20 CREDITS
drwxrwxr-x 1 root root 8984828 Feb 26 00:20 Documentation
-rw-rw-r-- 1 root root 1596 Feb 26 00:20 Kbuild
-rw-rw-r-- 1 root root 93957 Feb 26 00:20 MAINTAINERS
-rw-rw-r-- 1 root root 53162 Feb 26 00:20 Makefile
-rw-rw-r-- 1 root root 16930 Feb 26 00:20 README
-rw-rw-r-- 1 root root 3119 Feb 26 00:20 REPORTING-BUGS
drwxrwxr-x 1 root root 44216036 Feb 26 00:20 arch
drwxrwxr-x 1 root root 349137 Feb 26 00:20 block
drwxrwxr-x 1 root root 959654 Feb 26 00:20 crypto
drwxrwxr-x 1 root root 118578205 Feb 26 00:20 drivers
drwxrwxr-x 1 root root 21526882 Feb 26 00:20 fs
drwxrwxr-x 1 root root 27456604 Feb 26 00:20 include
drwxrwxr-x 1 root root 99077 Feb 26 00:20 init
drwxrwxr-x 1 root root 170827 Feb 26 00:20 ipc
drwxrwxr-x 1 root root 2189735 Feb 26 00:20 kernel
drwxrwxr-x 1 root root 679502 Feb 26 00:20 lib
drwxrwxr-x 1 root root 1213804 Feb 26 00:20 mm
drwxrwxr-x 1 root root 12562134 Feb 26 00:20 net
drwxrwxr-x 1 root root 3940 Feb 26 00:20 samples
drwxrwxr-x 1 root root 1105977 Feb 26 00:20 scripts
drwxrwxr-x 1 root root 740395 Feb 26 00:20 security
drwxrwxr-x 1 root root 12888682 Feb 26 00:20 sound
drwxrwxr-x 1 root root 16269 Feb 26 00:20 usr

Note that st_blocks is _not_ recursively defined, so 'du' still behaves as
expected. If mounted with -o norbytes instead, the directory st_size is
the number of entries in the directory.

The second interface takes advantage of the fact (?) that read() on a
directory is more or less undefined. (Okay, that's not really true, but
it used to return encoded dirents or something similar, and more recently
returns -EISDIR. As far as I know, no sane application expects meaningful
data from read() on a directory...) So, assuming Ceph is mounted with -o
dirstat,

$ cat linux-2.6.24.3/
entries: 27
files: 9
subdirs: 18
rentries: 24418
rfiles: 23062
rsubdirs: 1356
rbytes: 254025660
rctime: 1215668428.051898000

Fields prefixed with 'r' are recursively defined, while
entries/files/subdirs is just for the one directory. 'rctime' is the most
recent ctime within the hierarchy, which should be useful for backup
software or anything else scanning the hierarchy for recent changes.

Naturally, there are a few caveats:

- There is some built-in delay before statistics fully propagate up
toward the root of the hierarchy. Changes are propagated
opportunistically when lock/lease state allows, with an upper bound of (by
default) ~30 seconds for each level of directory nesting.

- Ceph internally distinguishes between multiple links to the same file
(there is a single 'primary' link, and then zero or more 'remote' links).
Only the primary link contributes toward the 'rbytes' total.

- The 'rbytes' summation is over i_size, not blocks used. That means
sparse files "appear" larger than the storage space they actually consume.

- Directories don't yet contribute anything to the 'rbytes' total. They
should probably include an estimate of the storage consumed by directory
metadata. For this reason, and because the size isn't rounded up to the
block size, the 'rbytes' total will usually be slightly smaller than what
you get from 'du'.

- Currently no stats for the root directory itself.


I'm extremely interested in what people think of overloading the file
system interface in this way. Handy? Crufty? Dangerous? Does anybody
know of any applications that rely on or expect meaningful values for a
directory's i_size? Or read() a directory?


More information on the recursive accounting at

http://ceph.newdream.net/wiki/Recursive_accounting

and Ceph itself at

http://ceph.newdream.net/

Cheers-
sage


2008-07-15 19:47:22

by Andreas Dilger

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

On Jul 15, 2008 11:28 -0700, Sage Weil wrote:
> unique (?) recursive accounting
> infrastructure that allows statistics about all metadata nested beneath a
> point in the directory hierarchy to be efficiently propagated up the tree.
> Currently this includes a file and directory count, total bytes (summation
> over file sizes), and most recent inode ctime.

Interesting...

> Note that st_blocks is _not_ recursively defined, so 'du' still behaves as
> expected. If mounted with -o norbytes instead, the directory st_size is
> the number of entries in the directory.

Is it possible to extract an environment variable from the process
in the kernel to decide what behaviour to have (e.g. like LS_COLORS)?

> The second interface takes advantage of the fact (?) that read() on a
> directory is more or less undefined. (Okay, that's not really true, but
> it used to return encoded dirents or something similar, and more recently
> returns -EISDIR. As far as I know, no sane application expects meaningful
> data from read() on a directory...) So, assuming Ceph is mounted with -o
> dirstat,

Hmm, what about just creating a virtual xattr that can be had with
getfattr user.dirstats?

> - The 'rbytes' summation is over i_size, not blocks used. That means
> sparse files "appear" larger than the storage space they actually consume.

I'd think that in many cases it is more important to accumulate the
blocks count and not the size, since a single core file would throw
off the whole "hunt for the worst space consumer" approach.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2008-07-15 19:53:45

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

On Tue, Jul 15, 2008 at 11:28:22AM -0700, Sage Weil wrote:
> Fields prefixed with 'r' are recursively defined, while
> entries/files/subdirs is just for the one directory. 'rctime' is the most
> recent ctime within the hierarchy, which should be useful for backup
> software or anything else scanning the hierarchy for recent changes.
>
> Naturally, there are a few caveats:
>
> - There is some built-in delay before statistics fully propagate up
> toward the root of the hierarchy. Changes are propagated
> opportunistically when lock/lease state allows, with an upper bound of (by
> default) ~30 seconds for each level of directory nesting.

That makes it less useful, e.g., for somebody with cached data trying to
validate their cache, or for something like git trying to check a
directory tree for changes.

> - Ceph internally distinguishes between multiple links to the same file
> (there is a single 'primary' link, and then zero or more 'remote' links).
> Only the primary link contributes toward the 'rbytes' total.

Is that only true for 'rbytes'?

--b.

>
> - The 'rbytes' summation is over i_size, not blocks used. That means
> sparse files "appear" larger than the storage space they actually consume.
>
> - Directories don't yet contribute anything to the 'rbytes' total. They
> should probably include an estimate of the storage consumed by directory
> metadata. For this reason, and because the size isn't rounded up to the
> block size, the 'rbytes' total will usually be slightly smaller than what
> you get from 'du'.
>
> - Currently no stats for the root directory itself.
>
>
> I'm extremely interested in what people think of overloading the file
> system interface in this way. Handy? Crufty? Dangerous? Does anybody
> know of any applications that rely on or expect meaningful values for a
> directory's i_size? Or read() a directory?
>
>
> More information on the recursive accounting at
>
> http://ceph.newdream.net/wiki/Recursive_accounting
>
> and Ceph itself at
>
> http://ceph.newdream.net/
>
> Cheers-
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-07-15 20:26:53

by Sage Weil

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

On Tue, 15 Jul 2008, Andreas Dilger wrote:
> > Note that st_blocks is _not_ recursively defined, so 'du' still behaves as
> > expected. If mounted with -o norbytes instead, the directory st_size is
> > the number of entries in the directory.
>
> Is it possible to extract an environment variable from the process
> in the kernel to decide what behaviour to have (e.g. like LS_COLORS)?

That could work too. Currently the flag is changing the client's i_size,
but the conditional can go in place of generic_fillattr, where st_size is
set. I would worry about the overhead of looking at the environment for
every getattr, though.

> > The second interface takes advantage of the fact (?) that read() on a
> > directory is more or less undefined. (Okay, that's not really true, but
> > it used to return encoded dirents or something similar, and more recently
> > returns -EISDIR. As far as I know, no sane application expects meaningful
> > data from read() on a directory...) So, assuming Ceph is mounted with -o
> > dirstat,
>
> Hmm, what about just creating a virtual xattr that can be had with
> getfattr user.dirstats?

Yeah, or ceph.dirstats, which hopefully backup software would ignore?
(Not quite sure how the xattr 'namespaces' are intended to be used.) Not
quite as convenient as 'cat dir' for the user, but cleaner.

> > - The 'rbytes' summation is over i_size, not blocks used. That means
> > sparse files "appear" larger than the storage space they actually consume.
>
> I'd think that in many cases it is more important to accumulate the
> blocks count and not the size, since a single core file would throw
> off the whole "hunt for the worst space consumer" approach.

Yes. If and when the MDS actually stores blocks used, that could
trivially be supported as well. But currently sparseness is a function of
the objects on the storage nodes, so things like hole-finding and fiemap
will require probing objects.

thanks-
sage

2008-07-15 20:42:35

by Sage Weil

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

On Tue, 15 Jul 2008, J. Bruce Fields wrote:
> > - There is some built-in delay before statistics fully propagate up
> > toward the root of the hierarchy. Changes are propagated
> > opportunistically when lock/lease state allows, with an upper bound of (by
> > default) ~30 seconds for each level of directory nesting.
>
> That makes it less useful, e.g., for somebody with cached data trying to
> validate their cache, or for something like git trying to check a
> directory tree for changes.

Having fully up to date values would definitely be nice, but unfortunately
doesn't play nice with the fact that different parts of the directory
hierarchy may be managed by different metadata servers. A primary goal in
implementing this was to minimize any impact on performance. The uses I
had I mind were more in line with quota-based accounting than cache
validation.

I think I can adjust the propagation heuristics/timeouts to make updates
seem more or less immediate to a user in most cases, but that won't be
sufficient for a tool like git that needs to reliably identify very recent
updates. For backup software wanting a consistent file system image, it
should really be operating on a snapshot as well, in which case a delay
between taking the snapshot and starting the scan for changes would allow
those values to propagate.

> > - Ceph internally distinguishes between multiple links to the same file
> > (there is a single 'primary' link, and then zero or more 'remote' links).
> > Only the primary link contributes toward the 'rbytes' total.
>
> Is that only true for 'rbytes'?

The same goes for rctime. As far as the recursive stats go, the other
stats (file/directory counts) aren't affected. The primary/remote
hard link distinction is fundamental to the way metadata is internally
managed and stored by the MDS, though, if that's what you mean (inode
content is embedded with the primary link's directory metadata).

sage


>
> --b.
>
> >
> > - The 'rbytes' summation is over i_size, not blocks used. That means
> > sparse files "appear" larger than the storage space they actually consume.
> >
> > - Directories don't yet contribute anything to the 'rbytes' total. They
> > should probably include an estimate of the storage consumed by directory
> > metadata. For this reason, and because the size isn't rounded up to the
> > block size, the 'rbytes' total will usually be slightly smaller than what
> > you get from 'du'.
> >
> > - Currently no stats for the root directory itself.
> >
> >
> > I'm extremely interested in what people think of overloading the file
> > system interface in this way. Handy? Crufty? Dangerous? Does anybody
> > know of any applications that rely on or expect meaningful values for a
> > directory's i_size? Or read() a directory?
> >
> >
> > More information on the recursive accounting at
> >
> > http://ceph.newdream.net/wiki/Recursive_accounting
> >
> > and Ceph itself at
> >
> > http://ceph.newdream.net/
> >
> > Cheers-
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2008-07-15 20:48:24

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

On Tue, Jul 15, 2008 at 01:41:25PM -0700, Sage Weil wrote:
> On Tue, 15 Jul 2008, J. Bruce Fields wrote:
> > > - There is some built-in delay before statistics fully propagate up
> > > toward the root of the hierarchy. Changes are propagated
> > > opportunistically when lock/lease state allows, with an upper bound of (by
> > > default) ~30 seconds for each level of directory nesting.
> >
> > That makes it less useful, e.g., for somebody with cached data trying to
> > validate their cache, or for something like git trying to check a
> > directory tree for changes.
>
> Having fully up to date values would definitely be nice, but unfortunately
> doesn't play nice with the fact that different parts of the directory
> hierarchy may be managed by different metadata servers. A primary goal in
> implementing this was to minimize any impact on performance. The uses I
> had I mind were more in line with quota-based accounting than cache
> validation.

Fair enough.

> I think I can adjust the propagation heuristics/timeouts to make updates
> seem more or less immediate to a user in most cases, but that won't be
> sufficient for a tool like git that needs to reliably identify very recent
> updates. For backup software wanting a consistent file system image, it
> should really be operating on a snapshot as well, in which case a delay
> between taking the snapshot and starting the scan for changes would allow
> those values to propagate.
>
> > > - Ceph internally distinguishes between multiple links to the same file
> > > (there is a single 'primary' link, and then zero or more 'remote' links).
> > > Only the primary link contributes toward the 'rbytes' total.
> >
> > Is that only true for 'rbytes'?
>
> The same goes for rctime. As far as the recursive stats go, the other
> stats (file/directory counts) aren't affected. The primary/remote
> hard link distinction is fundamental to the way metadata is internally
> managed and stored by the MDS, though, if that's what you mean (inode
> content is embedded with the primary link's directory metadata).

I just wonder how one would explain to users (or application writers)
why changes to a file are reflected in the parent's rctime in one case,
and not in another, especially if the primary link is otherwise
indistinguishable from the others. The symptoms could be a bit
mysterious from their point of view.

--b.

2008-07-15 21:18:36

by Sage Weil

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

On Tue, 15 Jul 2008, J. Bruce Fields wrote:
> I just wonder how one would explain to users (or application writers)
> why changes to a file are reflected in the parent's rctime in one case,
> and not in another, especially if the primary link is otherwise
> indistinguishable from the others. The symptoms could be a bit
> mysterious from their point of view.

Yes. I'm not sure it can really be avoided, though. I'm trying to lift
the usual restriction of having to predefine what the
volume/subvolume/qtree boundary is and then disallowing links/renames
between then. When all of a file's links are contained within the
directory you're looking at (i.e. something that might be a subvolume
under that paradigm), things look sensible. If links span two directories
and you're looking at recursive stats for a dir containing only one of
them, then you're necessarily going to have some weirdness (you don't want
to double-count).

Making the primary/remote-ness visible to users somehow (via, say, a
virtual xattr) might help a bit. The bottom line, though, is that links
from multiple points in the namespace and a hierarchical view of file
_content_ aren't particularly compatible concepts...

sage

2008-07-15 21:44:57

by Jamie Lokier

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

J. Bruce Fields wrote:
> I just wonder how one would explain to users (or application writers)
> why changes to a file are reflected in the parent's rctime in one case,
> and not in another, especially if the primary link is otherwise
> indistinguishable from the others. The symptoms could be a bit
> mysterious from their point of view.

Btw, what happens when the primary link is deleted? Does another link
become the primary link?

-- Jamie

2008-07-15 21:52:17

by Sage Weil

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

On Tue, 15 Jul 2008, Jamie Lokier wrote:
> J. Bruce Fields wrote:
> > I just wonder how one would explain to users (or application writers)
> > why changes to a file are reflected in the parent's rctime in one case,
> > and not in another, especially if the primary link is otherwise
> > indistinguishable from the others. The symptoms could be a bit
> > mysterious from their point of view.
>
> Btw, what happens when the primary link is deleted? Does another link
> become the primary link?

Yeah. It's initially moved to a hidden directory (along with
open-but-unlinked files), and then moved back into the hierarchy the next
time a remote link is used.

sage

2008-07-15 21:56:38

by Jamie Lokier

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

Sage Weil wrote:
> Having fully up to date values would definitely be nice, but unfortunately
> doesn't play nice with the fact that different parts of the directory
> hierarchy may be managed by different metadata servers. A primary goal in
> implementing this was to minimize any impact on performance. The uses I
> had I mind were more in line with quota-based accounting than cache
> validation.
>
> I think I can adjust the propagation heuristics/timeouts to make updates
> seem more or less immediate to a user in most cases, but that won't be
> sufficient for a tool like git that needs to reliably identify very recent
> updates. For backup software wanting a consistent file system image, it
> should really be operating on a snapshot as well, in which case a delay
> between taking the snapshot and starting the scan for changes would allow
> those values to propagate.

I have a similar thing in a distributed database (with some
filesystem-like characteristics) I'm working on.

The way I handle propagating compound values which are derived from
multiple metadata servers, like that, is using leases. (Similar to
fcntl F_GETLEASE, Windows oplocks, and CPU MESI protocol).

E.g. when a single server is about to modify a file, it grabs a lease
covering the metadata for this file _plus_ leases for the aggregated
values for all parent directories, prior to allowing the file
modification. The first file modification will be delayed briefly to
do this, but then subsequent modifications, including to other files
covered by the same directories, are instant because those servers
already have leases. They can renew them asynchronously as needed.

When a client wants the aggregate values for a directory (i.e. total
size of all files recursively under it), it acquires a lease on that
directory only. To do that, it has to query all the metadata servers
which currently hold a lease covering that.

The net effect is you can use the results for cache validation as the
git example. There's a network ping-pong if someone is alternately
modifying a file under the tree and reading the aggregate value from a
parent directory elsewhere, but at least the values are always
consistent. Most times, there is no ping-pong because that's not a
common scenario.

(In my project, you can also specify that some queries are allowed to
be a little out of date, to avoid lease acquisition delays if getting
an inaccurate result fast is better. That's useful for GUIs, but not
suitable for git-like cache validation.)

-- Jamie

2008-07-15 22:46:03

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

On Tue, Jul 15, 2008 at 02:16:45PM -0700, Sage Weil wrote:
> On Tue, 15 Jul 2008, J. Bruce Fields wrote:
> > I just wonder how one would explain to users (or application writers)
> > why changes to a file are reflected in the parent's rctime in one case,
> > and not in another, especially if the primary link is otherwise
> > indistinguishable from the others. The symptoms could be a bit
> > mysterious from their point of view.
>
> Yes. I'm not sure it can really be avoided, though. I'm trying to lift
> the usual restriction of having to predefine what the
> volume/subvolume/qtree boundary is and then disallowing links/renames
> between then. When all of a file's links are contained within the
> directory you're looking at (i.e. something that might be a subvolume
> under that paradigm), things look sensible. If links span two directories
> and you're looking at recursive stats for a dir containing only one of
> them, then you're necessarily going to have some weirdness (you don't want
> to double-count).

Yeah, there's no clear right answer--that's partly why I was curious
about rctime specifically.

--b.

2008-08-07 20:10:38

by Pavel Machek

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

On Tue 2008-07-15 11:28:22, Sage Weil wrote:
> All-
>
> Ceph is a new distributed file system for Linux designed for scalability
> (terabytes to exabytes, tens to thousands of storage nodes), reliability,
> and performance. The latest release (v0.3), aside from xattr support and
> the usual slew of bugfixes, includes a unique (?) recursive accounting
> infrastructure that allows statistics about all metadata nested beneath a
> point in the directory hierarchy to be efficiently propagated up the tree.
> Currently this includes a file and directory count, total bytes (summation
> over file sizes), and most recent inode ctime. For example, for a
> directory like /home, Ceph can efficiently report the total number of
> files, directories, and bytes contained by that entire subtree of the
> directory hierarchy.
>
> The file size summation is the most interesting, as it effectively gives
> you directory-based quota space accounting with fine granularity. In many
> deployments, the quota _accounting_ is more important than actual
> enforcement. Anybody who has had to figure out what has filled/is filling
> up a large volume will appreciate how cumbersome and inefficient 'du' can
> be for that purpose--especially when you're in a hurry.
>
> There are currently two ways to access the recursive stats via a standard
> shell. The first simply sets the directory st_size value to the
> _recursive_ bytes ('rbytes') value (when the client is mounted with -o
> rbytes). For example (watch the directory sizes),
...

> Naturally, there are a few caveats:
>
> - There is some built-in delay before statistics fully propagate up
> toward the root of the hierarchy. Changes are propagated
> opportunistically when lock/lease state allows, with an upper bound of (by
> default) ~30 seconds for each level of directory nesting.

Having instant rctime would be very nice -- for stuff like locate and
speeding up kde startup.

> I'm extremely interested in what people think of overloading the file
> system interface in this way. Handy? Crufty? Dangerous? Does anybody

Too ugly to live.

What about new rstat() syscall?

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-08-08 13:11:35

by John Stoffel

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

>>>>> "Pavel" == Pavel Machek <[email protected]> writes:

Pavel> On Tue 2008-07-15 11:28:22, Sage Weil wrote:
>> All-
>>
>> Ceph is a new distributed file system for Linux designed for scalability
>> (terabytes to exabytes, tens to thousands of storage nodes), reliability,
>> and performance. The latest release (v0.3), aside from xattr support and
>> the usual slew of bugfixes, includes a unique (?) recursive accounting
>> infrastructure that allows statistics about all metadata nested beneath a
>> point in the directory hierarchy to be efficiently propagated up the tree.
>> Currently this includes a file and directory count, total bytes (summation
>> over file sizes), and most recent inode ctime. For example, for a
>> directory like /home, Ceph can efficiently report the total number of
>> files, directories, and bytes contained by that entire subtree of the
>> directory hierarchy.
>>
>> The file size summation is the most interesting, as it effectively gives
>> you directory-based quota space accounting with fine granularity. In many
>> deployments, the quota _accounting_ is more important than actual
>> enforcement. Anybody who has had to figure out what has filled/is filling
>> up a large volume will appreciate how cumbersome and inefficient 'du' can
>> be for that purpose--especially when you're in a hurry.
>>
>> There are currently two ways to access the recursive stats via a standard
>> shell. The first simply sets the directory st_size value to the
>> _recursive_ bytes ('rbytes') value (when the client is mounted with -o
>> rbytes). For example (watch the directory sizes),
Pavel> ...

>> Naturally, there are a few caveats:
>>
>> - There is some built-in delay before statistics fully propagate up
>> toward the root of the hierarchy. Changes are propagated
>> opportunistically when lock/lease state allows, with an upper bound of (by
>> default) ~30 seconds for each level of directory nesting.

Pavel> Having instant rctime would be very nice -- for stuff like locate and
Pavel> speeding up kde startup.

>> I'm extremely interested in what people think of overloading the file
>> system interface in this way. Handy? Crufty? Dangerous? Does anybody

Pavel> Too ugly to live.

Pavel> What about new rstat() syscall?

Or how about tying this into the quotactl() syscall and extending it a
bit? Say quotactl2(cmd,device,id,addr,path) which is probably just as
ugly, but seems to make better sense.

Me, I'd love to have this type of reporting on my filesystems, esp
since it would help me in my day job.

How exports over NFS would look is an issue too.

John

2008-08-08 23:32:43

by Sage Weil

[permalink] [raw]
Subject: Re: Recursive directory accounting for size, ctime, etc.

On Fri, 8 Aug 2008, John Stoffel wrote:
> >>>>> "Pavel" == Pavel Machek <[email protected]> writes:
>
> Pavel> On Tue 2008-07-15 11:28:22, Sage Weil wrote:
> >> All-
> >>
> >> Ceph is a new distributed file system for Linux designed for scalability
> >> (terabytes to exabytes, tens to thousands of storage nodes), reliability,
> >> and performance. The latest release (v0.3), aside from xattr support and
> >> the usual slew of bugfixes, includes a unique (?) recursive accounting
> >> infrastructure that allows statistics about all metadata nested beneath a
> >> point in the directory hierarchy to be efficiently propagated up the tree.
> >> Currently this includes a file and directory count, total bytes (summation
> >> over file sizes), and most recent inode ctime. For example, for a
> >> directory like /home, Ceph can efficiently report the total number of
> >> files, directories, and bytes contained by that entire subtree of the
> >> directory hierarchy.
> >>
> >> The file size summation is the most interesting, as it effectively gives
> >> you directory-based quota space accounting with fine granularity. In many
> >> deployments, the quota _accounting_ is more important than actual
> >> enforcement. Anybody who has had to figure out what has filled/is filling
> >> up a large volume will appreciate how cumbersome and inefficient 'du' can
> >> be for that purpose--especially when you're in a hurry.
> >>
> >> There are currently two ways to access the recursive stats via a standard
> >> shell. The first simply sets the directory st_size value to the
> >> _recursive_ bytes ('rbytes') value (when the client is mounted with -o
> >> rbytes). For example (watch the directory sizes),
> Pavel> ...
>
> >> Naturally, there are a few caveats:
> >>
> >> - There is some built-in delay before statistics fully propagate up
> >> toward the root of the hierarchy. Changes are propagated
> >> opportunistically when lock/lease state allows, with an upper bound of (by
> >> default) ~30 seconds for each level of directory nesting.
>
> Pavel> Having instant rctime would be very nice -- for stuff like locate and
> Pavel> speeding up kde startup.
>
> >> I'm extremely interested in what people think of overloading the file
> >> system interface in this way. Handy? Crufty? Dangerous? Does anybody
>
> Pavel> Too ugly to live.

:)

> Pavel> What about new rstat() syscall?
>
> Or how about tying this into the quotactl() syscall and extending it a
> bit? Say quotactl2(cmd,device,id,addr,path) which is probably just as
> ugly, but seems to make better sense.

Introducing or modifying system calls makes for pretty interfaces, but is
a bit impractical (and overkill) to support something present in only one
filesystem.

So far I think Andreas' suggestion of using pseudo-xattrs is the cleanest
and simplest: it's doesn't interfere with any existing interfaces
(provided the virtual xattr name is well chosen), and is usable via
standard command line tools like getfattr.

sage