LinuxLists.cc - Designing Another File System

2004-11-30 04:32:14

Subject: Designing Another File System

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Greetings.

I've been interested in file system design for a while, and I think I
may be able to design one. Being poor, I would like to patent/sell it
when done; however, whatever the result, I'd assuredly give a free
license to implement the resulting file system in GPL2 licensed code (if
you're not making money off it, I'm not making money off it). This is
similar in principle to how Hans Reiser sells his file system I think.

I've examined briefly the overall concepts which go into several file
systems and pooled them together to create my basic requirements. In
these include things such as:

- - a new journaling method that scales up and down to high loads and
small devices

- - localization of Inodes and related meta-data to prevent disk thrashing

- - a scheme which allows Inodes to be dynamically allocated and
deallocated out of order

- - 64 bit indices indicating the exact physical location on disk of
Inodes, giving a O(1) seek to the Inode itself

- - A design based around preventing information leaking by guaranteed
secure erasure of data (insofar that the journal even will finish wiping
out data when replaying a transaction)

I have 6 basic questions.

1) Can Unix utilities in general deal with 64 bit Inodes? (Most
programs I assume won't care; ls -i and df -i might have trouble)

2) Are there any other security concerns aside from information leaking
(deleted data being left over on disk, which may pop up in incompletely
written files)?

3) What basic information do I absolutely *need* in my super block?

4) What basic information do I absolutely *need* in my Inodes? (I'm
thinking {type,atime,dtime,ctime,mtime,posix_dac,meta_data_offset,size,\
links}

5) What basic information do I absolutely *need* in my directory
entries? (I'm thinking {name,inode})

6) How do I lay out a time field for atime/dtime/ctime/mtime?

A few more specifics, as Block size grows, the file system does not lose
efficiency. Fragment Blocks should skillfully subdivide huge Blocks
with a third level index, increasing seek time O(+1) for the information
stored in the Fragment Block:

- -Inode
|-Meta-data
||-Data Block
|||-Data (2 seeks)
||-Fragment Block
|||-Fragment
||||-Data (3 seeks)

Fragment Blocks can store anything but Inodes, so it would be advisable
to avoid huge Blocks; if a Block is 25% of the disk, for example, one
Block must be dedicated to serving up Inode information. Also, with
64KiB blocks, a 256TiB file system can be created. Larger Blocks will
allow larger file systems; a larger file system will predictably house
more files (unless it houses one gargantuan relational data base-- the
only plausible situation where my design would require more, smaller
blocks to be more efficient).

Directories will have to be some sort of on disk BsomethingTree. . . B*,
B+, something. I have no idea how to implement this :D I'll assume I
treat the directory data as a logically contiguous file (yes I'm gonna
store directory data just like file data). I could use a string hash of
the Directory Entry name with disambiguation at the end, except my
options here seem limited:

- - Use a 32 bit string hash, 4 levels, 8 bits per level. 256 entries of
32 bit {offset} for the next level per level, 1024B per level on unique
paths. Worst case 1 path per file, 65536 files in the worst case
measure 1 root (1KiB) + 256 L2 (256KiB) + 65536 L3 (64M) + 65536 L4
(64M), total 128MiB, collision rate is probably low. Ouch.

- - Use a 16 bit string hash, 2 levels, 8 bits per level. 256 entries of
32 bit {offset) for the next level per level, 1024B per level on unique
path. Worst case 1 path per file, 65536 files in the worst case measure
1 root (1KiB) + 256 L2 (256KiB), no disambiguation, though collision
rate is probably higher than that. Beyond 65536 creates collisions.
Hitting a disambiguation situation is going to be fairly common.

I guess the second would be better? I can't locate any directories on
my drive with >2000 entries *shrug*. The end key is just the entry
{name,inode} pair.

Any ideas?

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBq/fEhDd4aOud5P8RAofxAJ4hBzZfFiLAyU413zNj2jVqlLzXWACfcaqd
4iwnewSp2xKrO94F9TF6dSY=
=ZKm6
-----END PGP SIGNATURE-----

2004-11-30 07:17:03

by Bernd Eckenfels

[permalink] [raw]

Subject: Re: Designing Another File System

In article <[email protected]> you wrote:
> when done; however, whatever the result, I'd assuredly give a free
> license to implement the resulting file system in GPL2 licensed code (if
> you're not making money off it, I'm not making money off it).

You can't restrict the GPL.

> I've examined briefly the overall concepts which go into several file
> systems and pooled them together to create my basic requirements. In
> these include things such as:

It is not a good idea to publish your ideas if you want to patent them.

> - - localization of Inodes and related meta-data to prevent disk thrashing

Why do you asume inodes?

> - - 64 bit indices indicating the exact physical location on disk of
> Inodes, giving a O(1) seek to the Inode itself

How is that different from Logical Block Numbers with a fixed cluster
factor? (besides it wastes more bits)

> - - A design based around preventing information leaking by guaranteed
> secure erasure of data (insofar that the journal even will finish wiping
> out data when replaying a transaction)

Hopefully optional.

> 2) Are there any other security concerns aside from information leaking
> (deleted data being left over on disk, which may pop up in incompletely
> written files)?

IT-Security is about availability and reliability also.

> 3) What basic information do I absolutely *need* in my super block?

How should WE know? (For mount you might want to have versioning and uuids)

> 5) What basic information do I absolutely *need* in my directory
> entries? (I'm thinking {name,inode})

Think about optimizations like NTFS, where you can sort directories by
attributes and store often used attribtes in the dir, so you dont need a
inode dereference (MFT lookup in case of ntfs)

> Directories will have to be some sort of on disk BsomethingTree. . . B*,
> B+, something. I have no idea how to implement this :D I'll assume I
> treat the directory data as a logically contiguous file (yes I'm gonna
> store directory data just like file data). I could use a string hash of
> the Directory Entry name with disambiguation at the end, except my
> options here seem limited:

Dont forget to optimize directories for all 4: insertion/deletion/lookup
and traversal.

> I guess the second would be better? I can't locate any directories on
> my drive with >2000 entries *shrug*. The end key is just the entry
> {name,inode} pair.

Do you want to run old style NNTP Servers?

> Any ideas?

In fact I miss a bit a new idea, so what makes your FS better, why should
anyone use a non-open source implementation if it does not provide anything
new?

Greetings
Bernd

2004-11-30 13:00:36

by Helge Hafting

[permalink] [raw]

Subject: Re: Designing Another File System

John Richard Moser wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Greetings.
>
> I've been interested in file system design for a while, and I think I
> may be able to design one. Being poor, I would like to patent/sell it
> when done;

Well, if you want to make money - you get to do the work. Including
the initial research into filesystem design.

> however, whatever the result, I'd assuredly give a free
> license to implement the resulting file system in GPL2 licensed code (if
> you're not making money off it, I'm not making money off it). This is
> similar in principle to how Hans Reiser sells his file system I think.

GPL means no restrictions, you can't prevent, say, redhat from
selling distro CDs with your GPL'ed fs. As for _patenting_ it, well,
people here generally dislike software patents for good reasons.
You don't need a patent for this either - a copyright is probably
what you're after.

>
> 2) Are there any other security concerns aside from information leaking
> (deleted data being left over on disk, which may pop up in incompletely
> written files)?

Have you considered what kind of data loss you'll get if power fails
before stuff is written to disk, or while it is being written? That sort
of thing is important to users, because it really happens a lot. Can you
design so a good fsck for your fs is possible?

> 6) How do I lay out a time field for atime/dtime/ctime/mtime?
>
Any way you like, as long as you actually store the information
in a retrieveable way.

>
> A few more specifics, as Block size grows, the file system does not lose
> efficiency. Fragment Blocks should skillfully subdivide huge Blocks
> with a third level index, increasing seek time O(+1) for the information
> stored in the Fragment Block:
>
>
> - -Inode
> |-Meta-data
> ||-Data Block
> |||-Data (2 seeks)
> ||-Fragment Block
> |||-Fragment
> ||||-Data (3 seeks)
>
>
> Fragment Blocks can store anything but Inodes, so it would be advisable
> to avoid huge Blocks; if a Block is 25% of the disk, for example, one
> Block must be dedicated to serving up Inode information. Also, with
> 64KiB blocks, a 256TiB file system can be created. Larger Blocks will
> allow larger file systems; a larger file system will predictably house
> more files (unless it houses one gargantuan relational data base-- the
> only plausible situation where my design would require more, smaller
> blocks to be more efficient).
>
> Directories will have to be some sort of on disk BsomethingTree. . . B*,
> B+, something. I have no idea how to implement this :D I'll assume I

Then you have a lot to learn about fs design - before you can
design anything _sellable_. Take a look at the details of
existing linux fs'es - and be happy that you can do that at all because
they aren't patented or otherwise restricted.

> treat the directory data as a logically contiguous file (yes I'm gonna
> store directory data just like file data). I could use a string hash of
> the Directory Entry name with disambiguation at the end, except my
> options here seem limited:
>
>
> - - Use a 32 bit string hash, 4 levels, 8 bits per level. 256 entries of
> 32 bit {offset} for the next level per level, 1024B per level on unique
> paths. Worst case 1 path per file, 65536 files in the worst case
> measure 1 root (1KiB) + 256 L2 (256KiB) + 65536 L3 (64M) + 65536 L4
> (64M), total 128MiB, collision rate is probably low. Ouch.
>
> - - Use a 16 bit string hash, 2 levels, 8 bits per level. 256 entries of
> 32 bit {offset) for the next level per level, 1024B per level on unique
> path. Worst case 1 path per file, 65536 files in the worst case measure
> 1 root (1KiB) + 256 L2 (256KiB), no disambiguation, though collision
> rate is probably higher than that. Beyond 65536 creates collisions.
> Hitting a disambiguation situation is going to be fairly common.
>
>
> I guess the second would be better? I can't locate any directories on
> my drive with >2000 entries *shrug*. The end key is just the entry
> {name,inode} pair.

If you want this fs to be generically used, expect some people to show
up with millions of files in a directory or tens of millions.
They will not be amused if the lookup isn't O(1). . .

There are still fs'es around that don't look up names in O(1)
time, but some O(1) fs'es exist. So, to _sell_ you need to be among the
better ones.

> Any ideas?
>
Write a serious free fs, and get help here. Or go commercial - sell
your fs and pay your developers and helpers.

Helge Hafting

2004-11-30 17:34:37

by Alan

[permalink] [raw]

Subject: Re: Designing Another File System

On Maw, 2004-11-30 at 04:32, John Richard Moser wrote:
> I've been interested in file system design for a while, and I think I
> may be able to design one. Being poor, I would like to patent/sell it
> when done; however, whatever the result, I'd assuredly give a free
> license to implement the resulting file system in GPL2 licensed code

Several other vendors have followed this kind of dual model - MySQL,
Troll Tech and Sleepycat being three obvious examples.

> - - 64 bit indices indicating the exact physical location on disk of
> Inodes, giving a O(1) seek to the Inode itself

Until you get a bad block 8) or want to find it in memory (which is the
usual case)

> 1) Can Unix utilities in general deal with 64 bit Inodes? (Most
> programs I assume won't care; ls -i and df -i might have trouble)

You would have to ask a unix source code licensee. For Linux inodes can
be 64bit on 64bit platforms, although you would undoubtedly found some
oddments that were not resolved.

> 4) What basic information do I absolutely *need* in my Inodes? (I'm
> thinking {type,atime,dtime,ctime,mtime,posix_dac,meta_data_offset,size,\
links}

See posix 1003.1 and the Single Unix SPecification. That defines the
behaviour.

> 5) What basic information do I absolutely *need* in my directory
> entries? (I'm thinking {name,inode})

Ditto

> 6) How do I lay out a time field for atime/dtime/ctime/mtime?

Internal issue to the file system.

2004-11-30 18:32:33

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: Designing Another File System

On Mon, 29 Nov 2004 23:32:05 EST, John Richard Moser said:

(Somebody else can address the performance issues - I'll stick to the
parts I understand.. ;)

> - - A design based around preventing information leaking by guaranteed
> secure erasure of data (insofar that the journal even will finish wiping
> out data when replaying a transaction)

Consider carefully what threat model you have here - your choices will be
constrained by it. If you don't know your threat model, it's time to look
for other features to design around...

(your comment indicates a level of confusion regarding what paranoia is desired)

Note well that you need to define "Guaranteed secure erasure" - at least for
US DOD contractors, DOD 5220-22.M requires 3 over-writes (a character, its
complement, and random) and verify). That's good for SECRET and below - but
TOP SECRET and higher still require physical destruction of the media. Also,
they punt on the issue of over-writing a sector that's been re-allocated by
the hardware (apparently the chances of critical secret data being left in
a reallocated block but still actually readable are "low enough" not to worry).

Also, 5220-22.M is more concerned with the exposure of "You surplus the machine
and forgot to wipe the drives" - there's a whole *different* set of rules
regarding how you keep an attacker who steals the system (physical access
issues) or gets root access (this is a *tough* one) from scavenging the
drives....

> 2) Are there any other security concerns aside from information leaking
> (deleted data being left over on disk, which may pop up in incompletely
> written files)?

Which of the following are you worried about:

1) On a filesystem that does metadata journalling but not data journalling,
it's possible for a file to be extended by a write(), the metadata is
journalled, but the system fails before the actual data makes it to disk. As a
result, after the system comes up, stale data is present in the file, causing a
small data exposure and large reliability issues (opening a file with
OpenOffice will almost certainly cause Very Bad Errors if it's a file that was
in the process of being saved when the system went down, so you're actually
reading blocks of somebody else's old Fortran source code...) Note that this
exposure does *NOT* need you to clear data out of the journal - merely to
journal data (or find other ways to guarantee you never allocate a stale block
of data). This is why I suggest that you're unclear regarding your threat
model.

2) If you're worried about a well-funded attacker managing to scavenge secure
data out of the journal, what do you do about the attacker scavenging *the rest
of the disk, including existing files*? (Hint - cryptoloop is the correct
answer here, unless you think Jaari Russo is right, in which case *his*
encrypted filesystem stuff is the right answer).

Alternatively, you may wish to consider a filesystem that does crypto on a
per-file basis - be prepared to deal with some very hairy key-management and
information leakage problems if you pursue this route...

In any case, both cryptoloop and Jaari's loop-aes address the "disk captured by
the attacker" issues, but don't do much for "attacker gets root" - that's a
whole DIFFERENT set of problems...

Attachments:

(No filename) (226.00 B)

2004-11-30 18:50:10

by Alan

[permalink] [raw]

Subject: Re: Designing Another File System

On Maw, 2004-11-30 at 18:28, [email protected] wrote:
> On Mon, 29 Nov 2004 23:32:05 EST, John Richard Moser said:
> they punt on the issue of over-writing a sector that's been re-allocated by
> the hardware (apparently the chances of critical secret data being left in
> a reallocated block but still actually readable are "low enough" not to worry).

I guess they never consider CF cards which internally are log structured
and for whom such erase operations are very close to pointless.

2004-11-30 19:03:31

by Jan Engelhardt

[permalink] [raw]

Subject: Re: Designing Another File System

>I guess they never consider CF cards which internally are log structured
>and for whom such erase operations are very close to pointless.

Don't forget WORM (in "multi session", of course).

Jan Engelhardt
--
ENOSPC

2004-11-30 19:14:24

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: Designing Another File System

On Tue, 30 Nov 2004 17:46:10 GMT, Alan Cox said:
> On Maw, 2004-11-30 at 18:28, [email protected] wrote:
> > On Mon, 29 Nov 2004 23:32:05 EST, John Richard Moser said:
> > they punt on the issue of over-writing a sector that's been re-allocated by
> > the hardware (apparently the chances of critical secret data being left in
> > a reallocated block but still actually readable are "low enough" not to wor
ry).
>
> I guess they never consider CF cards which internally are log structured
> and for whom such erase operations are very close to pointless.

The 3-overwrites is for "rigid disks" only, and for "sanitize" operations
required when the media is being released from continuous protection (basically,
when you're disposing of the drive). Clearing (for when the drive is being
kept, but re-used) only requires 1 pass. A CF card would be handled
under different rules - I'm pretty sure it would be treated as the appropriate
class of "memory" (but admit not knowing which technology it would be).

See "Clearing and sanitization matrix" from the bottom of:
http://www.dss.mil/isec/chapter8.htm

Attachments:

(No filename) (226.00 B)

2004-11-30 19:22:37

by Phillip Lougher

[permalink] [raw]

Subject: Re: Designing Another File System

On Mon, 29 Nov 2004 23:32:05 -0500, John Richard Moser
<[email protected]> wrote:

> - - localization of Inodes and related meta-data to prevent disk thrashing

All filesystems place their filesystem metadata inside the inodes. If
you mean file metadata then please be more precise. This isn't
terribly new, recent posts have discussed how moving eas/acls inside
the inode for ext3 has sped up performance.

>
> - - a scheme which allows Inodes to be dynamically allocated and
> deallocated out of order
>

Um, all filesystems do that, I think you're missing words to the
effect "without any performance loss or block fragmentation" !

> - - 64 bit indices indicating the exact physical location on disk of
> Inodes, giving a O(1) seek to the Inode itself

> 1) Can Unix utilities in general deal with 64 bit Inodes? (Most
> programs I assume won't care; ls -i and df -i might have trouble)
>

There seems to be some confusion here. The filesystem can use 64 bit
inode numbers internally but hide these 64 bits and instead present
munged 32 bit numbers to Linux.

The 64 bit resolution is only necessary within the filesystem dentry
lookup function to go from a directory name entry to the physical
inode location on disk. The inode number can then be reduced to 32
bits for 'presentation' to the VFS. AFAIK as all file access is
through the dentry cache this is sufficient. The only problems are
that VFS iget() needs to be replaced with a filesystem specific iget.
A number of filesystems do this. Squashfs internally uses 37 bit
inode numbers and presents them as 32 bit inode numbers in this way.

> 3) What basic information do I absolutely *need* in my super block?
> 4) What basic information do I absolutely *need* in my Inodes? (I'm
> thinking {type,atime,dtime,ctime,mtime,posix_dac,meta_data_offset,size,\
> links}

Very much depends on your filesystem. Cramfs is a good example of the
minimum you need to store to satisfy the Linux VFS. If you don't care
what they are almost anything can be invented (uid, gid, mode, atime,
dtime etc) and set to a useful default. The *absolute* minimum is
probably type, file/dir size, and file/dir data location on disk.

> I guess the second would be better? I can't locate any directories on
> my drive with >2000 entries *shrug*. The end key is just the entry
> {name,inode} pair.

I've had people trying to store 500,000 + files in a Squashfs
directory. Needless to say with the original directory implementation
this didn't work terribly well...

Phillip Lougher

2004-11-30 20:23:05

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: Designing Another File System

On Tue, 30 Nov 2004 14:14:10 EST, [email protected] said:

> See "Clearing and sanitization matrix" from the bottom of:
> http://www.dss.mil/isec/chapter8.htm

"I'm sorry, that's an old, worn out magic word" ;)

For what it's worth, that table is in the original 1995 version.
"Change 2" in Feb 01 completely redid chapter 8, and removed said
table. It's unclear what, if any, replacement table should be used.
I'll follow up if I find out anything specific - the current text just says:

"Instructions on clearing, sanitization and release of IS media shall be issued
by the accrediting CSA."

Attachments:

(No filename) (226.00 B)

2004-12-01 01:21:31

by John Richard Moser

[permalink] [raw]

Subject: Re: Designing Another File System

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Helge Hafting wrote:
| John Richard Moser wrote:
|
|> -----BEGIN PGP SIGNED MESSAGE-----
|> Hash: SHA1
|>
|> Greetings.
|>
|> I've been interested in file system design for a while, and I think I
|> may be able to design one. Being poor, I would like to patent/sell it
|> when done;
|
|
| Well, if you want to make money - you get to do the work. Including
| the initial research into filesystem design.
|

Eh, you're probably right; but I thought all idealistic open source
advocates considered the fruits of the work to be proper return to the
community >:P

Anyway, I dunno how I'd go about any of that anyway. The thoughts have
crossed my mind though, as they should yours when you're poor and living
in your parents' house.

I'll definitely code it myself if I decide to sell licenses to use the
code in closed source products; I can't sell other peoples' work, it
doesn't work that way.

|> however, whatever the result, I'd assuredly give a free
|> license to implement the resulting file system in GPL2 licensed code (if
|> you're not making money off it, I'm not making money off it). This is
|> similar in principle to how Hans Reiser sells his file system I think.
|
|
| GPL means no restrictions, you can't prevent, say, redhat from
| selling distro CDs with your GPL'ed fs. As for _patenting_ it, well,
| people here generally dislike software patents for good reasons.
| You don't need a patent for this either - a copyright is probably
| what you're after.
|

MySQL, Dan's Guardian. They cost for business use.

As for "no restrictions," remember that GPL means you can't just
incorporate the code into your closed source product. A few GPL
projects have been known to sell separate licenses (reiserfs I thought
did this) to allow such an incorporation of their work. If i.e. Apple
wants *me* to write them a driver for the FS, fine, they can pay me
royaltees too.

|>
|> 2) Are there any other security concerns aside from information leaking
|> (deleted data being left over on disk, which may pop up in incompletely
|> written files)?
|
|
| Have you considered what kind of data loss you'll get if power fails
| before stuff is written to disk, or while it is being written? That sort
| of thing is important to users, because it really happens a lot. Can you
| design so a good fsck for your fs is possible?
|

Data loss can't really be helped. Either the data makes it to disk or
it doesn't; data goes after allocation, but even a full journal won't
save me data loss. A rollback journal "might," if the entire change to
the file from fopen() to fclose() is exactly one transaction; else, no.

As for a good fsck, call Sally. :)

Actually a fsck shouldn't be too hard; inode blocks and fragment blocks
will be self-identifying and indexing, so even if the bitmap is lost,
the whole thing can be rebuilt *if* you know the exact offset of the
data area, the FS version, and the block size.

|> 6) How do I lay out a time field for atime/dtime/ctime/mtime?
|>
| Any way you like, as long as you actually store the information
| in a retrieveable way.
|

What information is that?

|>
|> A few more specifics, as Block size grows, the file system does not lose
|> efficiency. Fragment Blocks should skillfully subdivide huge Blocks
|> with a third level index, increasing seek time O(+1) for the information
|> stored in the Fragment Block:
|>
|>
|> - -Inode
|> |-Meta-data
|> ||-Data Block
|> |||-Data (2 seeks)
|> ||-Fragment Block
|> |||-Fragment
|> ||||-Data (3 seeks)
|>
|>
|> Fragment Blocks can store anything but Inodes, so it would be advisable
|> to avoid huge Blocks; if a Block is 25% of the disk, for example, one
|> Block must be dedicated to serving up Inode information. Also, with
|> 64KiB blocks, a 256TiB file system can be created. Larger Blocks will
|> allow larger file systems; a larger file system will predictably house
|> more files (unless it houses one gargantuan relational data base-- the
|> only plausible situation where my design would require more, smaller
|> blocks to be more efficient).
|>
|> Directories will have to be some sort of on disk BsomethingTree. . . B*,
|> B+, something. I have no idea how to implement this :D I'll assume I
|
|
| Then you have a lot to learn about fs design - before you can
| design anything _sellable_. Take a look at the details of
| existing linux fs'es - and be happy that you can do that at all because
| they aren't patented or otherwise restricted.

I still think patents are a good thing, if the patent owners have half a
brain. For example, Judy Arrays ( http://judy.sf.net/ ) are patented by
HP; their design is LGPL though. "Patented" doesn't equivilate to "Out
of reach" by nature, just by controller.

|
|> treat the directory data as a logically contiguous file (yes I'm gonna
|> store directory data just like file data). I could use a string hash of
|> the Directory Entry name with disambiguation at the end, except my
|> options here seem limited:
|>
|>
|> - - Use a 32 bit string hash, 4 levels, 8 bits per level. 256 entries of
|> 32 bit {offset} for the next level per level, 1024B per level on unique
|> paths. Worst case 1 path per file, 65536 files in the worst case
|> measure 1 root (1KiB) + 256 L2 (256KiB) + 65536 L3 (64M) + 65536 L4
|> (64M), total 128MiB, collision rate is probably low. Ouch.
|>
|> - - Use a 16 bit string hash, 2 levels, 8 bits per level. 256 entries of
|> 32 bit {offset) for the next level per level, 1024B per level on unique
|> path. Worst case 1 path per file, 65536 files in the worst case measure
|> 1 root (1KiB) + 256 L2 (256KiB), no disambiguation, though collision
|> rate is probably higher than that. Beyond 65536 creates collisions.
|> Hitting a disambiguation situation is going to be fairly common.
|>
|>
|> I guess the second would be better? I can't locate any directories on
|> my drive with >2000 entries *shrug*. The end key is just the entry
|> {name,inode} pair.
|
|
| If you want this fs to be generically used, expect some people to show
| up with millions of files in a directory or tens of millions. They will
| not be amused if the lookup isn't O(1). . .
|
| There are still fs'es around that don't look up names in O(1)
| time, but some O(1) fs'es exist. So, to _sell_ you need to be among the
| better ones.
|
|> Any ideas?
|>
| Write a serious free fs, and get help here. Or go commercial - sell
| your fs and pay your developers and helpers.
|

Well, I'm not Hans Reiser, so I guess DARPA isn't going to give me $13M;
selling this thing (even if I can outrank everyone in the world by leaps
and bounds) isn't going to make me instantly rich. Even if it would,
nobody's gonna help me with it like that; and I don't have the skill to
do this myself.

Ah why not.

| Helge Hafting
|

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBrRtPhDd4aOud5P8RAkhgAJ95lOJFvG+cHpOrWyDGn1/VZlFhzQCdGP/p
T8b0J6idnUgZY/uS35FVRZc=
=Vr0i
-----END PGP SIGNATURE-----

2004-12-01 01:43:59

by John Richard Moser

[permalink] [raw]

Subject: Re: Designing Another File System

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[email protected] wrote:
| On Mon, 29 Nov 2004 23:32:05 EST, John Richard Moser said:
|
| (Somebody else can address the performance issues - I'll stick to the
| parts I understand.. ;)
|
|
|>- - A design based around preventing information leaking by guaranteed
|>secure erasure of data (insofar that the journal even will finish wiping
|>out data when replaying a transaction)
|
|
| Consider carefully what threat model you have here - your choices will be
| constrained by it. If you don't know your threat model, it's time to look
| for other features to design around...
|

:)

| (your comment indicates a level of confusion regarding what paranoia
is desired)
|
| Note well that you need to define "Guaranteed secure erasure" - at
least for
| US DOD contractors, DOD 5220-22.M requires 3 over-writes (a character, its
| complement, and random) and verify). That's good for SECRET and below
- - but

I've read DOD 5220-22.M chapte. . . oh fuck it *cut* *paste*

Strips of up to 512 contiguous bytes may be written at a time during the
erasure, and may possibly consist of overwrite all affected locations
with a random pattern, binary zeros, binary ones, then a random
character and verifying, as derived by combining the methods given in
DOD 5220.22-M[6], Chapter 8, section 8-306, for sanitizing removable and
non-removable rigid disks using method (d) and EEPROMs (i.e. memory
cards for digital cameras and music players) (h).

^^^ From my original notes in the document I'm writing where I first
alluded to a new "secure" file system (which quickly became a much
higher goal).

| TOP SECRET and higher still require physical destruction of the media.
Also,
| they punt on the issue of over-writing a sector that's been
re-allocated by
| the hardware (apparently the chances of critical secret data being left in
| a reallocated block but still actually readable are "low enough" not
to worry).
|

I can't make the FS burn the disk, sorry. You need to use kerosene.

| Also, 5220-22.M is more concerned with the exposure of "You surplus
the machine
| and forgot to wipe the drives" - there's a whole *different* set of rules
| regarding how you keep an attacker who steals the system (physical access
| issues) or gets root access (this is a *tough* one) from scavenging the
| drives....
|

hey I'm working for general setting. Nobody said this was going to be
"We've taken care of everything, just rm -rf / and you're good to go."
There are certain guarantees that I can't make from the FILE SYSTEM;
they require outside forces. What I can guarantee is that you won't
crack open shopping_list.txt after a crash and see a part of previously
erased secret_military_assult_on_iranian_nuclear_breeder_reactors.txt

|
|>2) Are there any other security concerns aside from information leaking
|>(deleted data being left over on disk, which may pop up in incompletely
|>written files)?
|
|
| Which of the following are you worried about:
|
| 1) On a filesystem that does metadata journalling but not data
journalling,
| it's possible for a file to be extended by a write(), the metadata is
| journalled, but the system fails before the actual data makes it to
disk. As a
| result, after the system comes up, stale data is present in the file,
causing a
| small data exposure

Clipped that off already, that's what I was talking about in the
original message ("information leaking").

| and large reliability issues (opening a file with
| OpenOffice will almost certainly cause Very Bad Errors if it's a file
that was
| in the process of being saved when the system went down, so you're
actually
| reading blocks of somebody else's old Fortran source code...)

I can't help data loss; it's impossible. The data either gets to disk
or it doesn't. Unless one big write() was made between fopen() and
fclose(), you have a window of fuxxx0rzd data if the machine goes down--
even with roll forward and roll back full data journaling.

| Note that this
| exposure does *NOT* need you to clear data out of the journal - merely to
| journal data (or find other ways to guarantee you never allocate a
stale block
| of data). This is why I suggest that you're unclear regarding your threat
| model.
|

If I make numorous write() calls from an app, the fs will journal them
seprately, or periodically, or run out of journal space (actually my
journal design would expand the journal until it ran out of FS space)
trying to heuristically determine what is one write and prevent data
destruction.

When you talk about data loss, remember that even full journaling relies
on data making it to disk. Stop trying to make it safe to flip the big
switch; you're not gonna do it.

| 2) If you're worried about a well-funded attacker managing to scavenge
secure
| data out of the journal, what do you do about the attacker scavenging
*the rest
| of the disk, including existing files*? (Hint - cryptoloop is the correct
| answer here, unless you think Jaari Russo is right, in which case *his*
| encrypted filesystem stuff is the right answer).

cryptoloop or crypt extensions is a nice answer; though the journal
needs to be non-crypt. Cryptoloop + journal == bad, unless journal is
on another device. :)

as for well-funded attackers, eh. The built-in secure erase is mainly
paranoia, although it makes `rm -rf /` THE utility to perform the above
erasure method on all file data. Encryption extensions are the only way
to ensure instant destruction (secure delete takes time) without a
degausser :) Of course, I could rm -f ~/.fs/my_cypher_key and a cypher
extension would be left keyless. Cypher *extension* mind you; I'm not
designing encryption into the core of the FS, just leaving it a
possibility in the future (xattrs-- this is the same intention ext2 had
originally iirc regarding compression).

|
| Alternatively, you may wish to consider a filesystem that does crypto on a
| per-file basis - be prepared to deal with some very hairy
key-management and
| information leakage problems if you pursue this route...
|

yeah, could embed the user's keys into the FS and use a utility to erase
them in emergency; and I could also use a special key loaded from a
floppy or usb stick and kept (frobnicated) in memory to encrypt "more
sensitive" files. Need to destroy the keys? blow_my_key or
blow_all_keys, and set fire to your usb stick.

| In any case, both cryptoloop and Jaari's loop-aes address the "disk
captured by
| the attacker" issues, but don't do much for "attacker gets root" -
that's a
| whole DIFFERENT set of problems...
|

"attacker gets root" is doable, but only with a multilayered
defense-in-depth matrix with host based intrusion prevention systems.

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBrR/hhDd4aOud5P8RAoo+AJ9fdn18OaGn56I0LHxgA0LdDQiPHACfb/D5
V3MiVjkdazvy1x3ObdOzJ0w=
=Hh25
-----END PGP SIGNATURE-----

2004-12-01 01:46:54

by John Richard Moser

[permalink] [raw]

Subject: Re: Designing Another File System

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Phil Lougher wrote:
| On Mon, 29 Nov 2004 23:32:05 -0500, John Richard Moser
| <[email protected]> wrote:
|
|
|>- - localization of Inodes and related meta-data to prevent disk thrashing
|
|
| All filesystems place their filesystem metadata inside the inodes. If
| you mean file metadata then please be more precise. This isn't
| terribly new, recent posts have discussed how moving eas/acls inside
| the inode for ext3 has sped up performance.
|

I got the idea from FFS and its derivatives (UFS, UFS2, EXT2). I don't
want to store xattrs inside inodes though; I want them in the same block
with the inode. A few mS for the seek, but eh, the data's right there,
not on the other side of the disk.

|
|>- - a scheme which allows Inodes to be dynamically allocated and
|>deallocated out of order
|>
|
|
| Um, all filesystems do that, I think you're missing words to the
| effect "without any performance loss or block fragmentation" !

All filesystems allow you to create the FS with 1 inode total?

Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/hda1 7823616 272670 7550946 4% /

No, it looks like this one allocates many inodes and uses them as it
goes. Reiser has 0 inodes . . .

|
|
|>- - 64 bit indices indicating the exact physical location on disk of
|>Inodes, giving a O(1) seek to the Inode itself
|
|
|>1) Can Unix utilities in general deal with 64 bit Inodes? (Most
|>programs I assume won't care; ls -i and df -i might have trouble)
|>
|
|
| There seems to be some confusion here. The filesystem can use 64 bit
| inode numbers internally but hide these 64 bits and instead present
| munged 32 bit numbers to Linux.
|
| The 64 bit resolution is only necessary within the filesystem dentry
| lookup function to go from a directory name entry to the physical
| inode location on disk. The inode number can then be reduced to 32
| bits for 'presentation' to the VFS. AFAIK as all file access is
| through the dentry cache this is sufficient. The only problems are
| that VFS iget() needs to be replaced with a filesystem specific iget.
| A number of filesystems do this. Squashfs internally uses 37 bit
| inode numbers and presents them as 32 bit inode numbers in this way.
|

Ugly, but ok. What happens when i actually have >4G inodes though?

|
|>3) What basic information do I absolutely *need* in my super block?
|>4) What basic information do I absolutely *need* in my Inodes? (I'm
|>thinking {type,atime,dtime,ctime,mtime,posix_dac,meta_data_offset,size,\
|>links}
|
|
| Very much depends on your filesystem. Cramfs is a good example of the
| minimum you need to store to satisfy the Linux VFS. If you don't care
| what they are almost anything can be invented (uid, gid, mode, atime,
| dtime etc) and set to a useful default. The *absolute* minimum is
| probably type, file/dir size, and file/dir data location on disk.

I meant basic, not for me. Basic things a real Unix filesystem needs.
What *I* need comes from my head. :)

|
|
|>I guess the second would be better? I can't locate any directories on
|>my drive with >2000 entries *shrug*. The end key is just the entry
|>{name,inode} pair.
|
|
| I've had people trying to store 500,000 + files in a Squashfs
| directory. Needless to say with the original directory implementation
| this didn't work terribly well...
|

Ouch. Someone told me the directory had to be O(1) lookup . . . .

| Phillip Lougher
|

- --
All content of all messages exchanged herein are left in the
Public Domain, unless otherwise explicitly stated.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBrSGNhDd4aOud5P8RAlo0AJ4pxB/LMhgTvNW4GdMmaNA2/uM0wACfWR8+
kOxwHU3/mTUUNAAhda2rv+g=
=fsJV
-----END PGP SIGNATURE-----

2004-12-01 02:46:45

by Phillip Lougher

[permalink] [raw]

Subject: Re: Designing Another File System

On Tue, 30 Nov 2004 20:42:38 -0500, John Richard Moser
> Phil Lougher wrote:
> | Um, all filesystems do that, I think you're missing words to the
> | effect "without any performance loss or block fragmentation" !
>
> All filesystems allow you to create the FS with 1 inode total?
>
> Filesystem Inodes IUsed IFree IUse% Mounted on
> /dev/hda1 7823616 272670 7550946 4% /
>
> No, it looks like this one allocates many inodes and uses them as it
> goes. Reiser has 0 inodes . . .

Yes you're right. What I said was total rubbish. I read your
statement as meaning to dynamically allocate/deallocate inodes from a
set configured at filesystem creation...

> |
> | The 64 bit resolution is only necessary within the filesystem dentry
> | lookup function to go from a directory name entry to the physical
> | inode location on disk. The inode number can then be reduced to 32
> | bits for 'presentation' to the VFS. AFAIK as all file access is
> | through the dentry cache this is sufficient. The only problems are
> | that VFS iget() needs to be replaced with a filesystem specific iget.
> | A number of filesystems do this. Squashfs internally uses 37 bit
> | inode numbers and presents them as 32 bit inode numbers in this way.
> |
>
> Ugly, but ok. What happens when i actually have >4G inodes though?

Well this is an issue that affects all filesystems on 32-bit systems
(as Alan said inode numbers are 64 bits on 64 bit systems). To be
honest I've never let this worry me...

A 32-bit system can never cache 4G inodes in memory without falling
over, and so a simple solution would be to allocate the 32-bit inode
number dynamically (e.g. starting at one and going up, keeping track
of inode numbers still used for use when/if wraparound occured), this
would guarantee inode number uniqueness for the subset of file inodes
in memory at any one time, with the drawback that inode numbers
allocated to files will change over filesystem mounts. Alternatively
from reading fs/inode.c it appears that inode numbers only need to be
unique if the fast hashed inode cache lookup functions are used, there
are other inode cache lookup functions that can be used if inode
numbers are not unique.

> | I've had people trying to store 500,000 + files in a Squashfs
> | directory. Needless to say with the original directory implementation
> | this didn't work terribly well...
> |
>
> Ouch. Someone told me the directory had to be O(1) lookup . . . .

Ideally yes, but ultimately with your filesystem you make the rules
:-) The Squashfs directory design was fast for the original expected
directory size (ideally <= 64K, maximum 512K) seen on embedded
systems. The next release of Squashfs has considerably improved
indexed directories which are O(1) for large directories. To be
released sometime soon, if anyone's interested...

Phillip Lougher

2004-12-01 04:32:33

by Bernd Eckenfels

[permalink] [raw]

Subject: Re: Designing Another File System

In article <[email protected]> you wrote:
> systems. The next release of Squashfs has considerably improved
> indexed directories which are O(1) for large directories.

Are you talking about time complexity based on a named lookup over the
number of files in a directory? (you might also want to look at the
complexity relative to the naming component length). What data structure
which is not wasting memory allows you those lookups? Even if you consider
hashing the name as a constant O(1) operation (which it isnt), you still can
get collisions.

Greetings
Bernd