2006-03-19 04:48:54

by Xin Zhao

[permalink] [raw]
Subject: Question regarding to store file system metadata in database

I was wondering why only few file system uses database to store file
system metadata. Here, metadata primarily refers to directory entries.
For example, one can setup a database to store file pathname, its
inode number, and some extended attribution. File pathname can be used
as primary key. As such, we can achieve pathname to inode mapping as
well as many other features such as fast search and extended file
attribute management. In contrast, storing file system entries in
directory files may result in slow dentry search. I guess that's why
ReiserFS and some other file systems proposed to use B+ tree like
strucutre to manage file entries. But why not simple use database to
provide the same feature? DB has been heavily optimized to provide
fast search and should be good at managing metadata.

I guess one concern about this idea is performance impact caused by
database system. I ran a test on a mysql database: I inserted about
1.2 million such kind of records into an initially empty mysql
database. Average insertion rate is about 300 entries per second,
which is fast enough to handle normal file system burden, I think. I
haven't try the query speed, but I believe it should be fast enough
too (maybe I am wrong, if so, please point that out.).

Then I am a little curious why only few people use database to store
file system metadata, although I know WinFS plans to use database to
manage metadata. I guess one reason is that it is difficult for kernel
based file system driver to access database. But this could be
addressed by using efficient kernel/user communication mechanism.
Another reason could be the worry about database system. If database
system crashes, file system will stop functioning too. However, the
feature needed by file system is really a small part of database
system, A reduced database system should be sufficient to provide this
feature and be stable enough to support a file system.

Can someone point out more issues that could become obstables to using
database to manage metadata for a file system?

Many thanks!
Xin


2006-03-19 05:04:15

by Mikado

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Where is that database located, on other filesystem or on database-based
filesystem?

Xin Zhao wrote:
> I was wondering why only few file system uses database to store file
> system metadata. Here, metadata primarily refers to directory entries.
> For example, one can setup a database to store file pathname, its
> inode number, and some extended attribution. File pathname can be used
> as primary key. As such, we can achieve pathname to inode mapping as
> well as many other features such as fast search and extended file
> attribute management. In contrast, storing file system entries in
> directory files may result in slow dentry search. I guess that's why
> ReiserFS and some other file systems proposed to use B+ tree like
> strucutre to manage file entries. But why not simple use database to
> provide the same feature? DB has been heavily optimized to provide
> fast search and should be good at managing metadata.
>
> I guess one concern about this idea is performance impact caused by
> database system. I ran a test on a mysql database: I inserted about
> 1.2 million such kind of records into an initially empty mysql
> database. Average insertion rate is about 300 entries per second,
> which is fast enough to handle normal file system burden, I think. I
> haven't try the query speed, but I believe it should be fast enough
> too (maybe I am wrong, if so, please point that out.).
>
> Then I am a little curious why only few people use database to store
> file system metadata, although I know WinFS plans to use database to
> manage metadata. I guess one reason is that it is difficult for kernel
> based file system driver to access database. But this could be
> addressed by using efficient kernel/user communication mechanism.
> Another reason could be the worry about database system. If database
> system crashes, file system will stop functioning too. However, the
> feature needed by file system is really a small part of database
> system, A reduced database system should be sufficient to provide this
> feature and be stable enough to support a file system.
>
> Can someone point out more issues that could become obstables to using
> database to manage metadata for a file system?
>
> Many thanks!
> Xin
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.1 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFEHOceNWc9T2Wr2JcRAsKKAJ9t1fRZ1xczAaeruDUqTNeLMcGuiwCfeTNt
31pFUK79Q7BE1AptbmNqr9Q=
=LbiF
-----END PGP SIGNATURE-----

2006-03-19 17:48:10

by Xin Zhao

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

well, the database could reside on another file system. So the
database based file system could be a secondary file system but
provide more features and better performance. I am not saying that
database-based file system must be the only filesystem on the system.

On 3/19/06, Mikado <[email protected]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Where is that database located, on other filesystem or on database-based
> filesystem?
>
> Xin Zhao wrote:
> > I was wondering why only few file system uses database to store file
> > system metadata. Here, metadata primarily refers to directory entries.
> > For example, one can setup a database to store file pathname, its
> > inode number, and some extended attribution. File pathname can be used
> > as primary key. As such, we can achieve pathname to inode mapping as
> > well as many other features such as fast search and extended file
> > attribute management. In contrast, storing file system entries in
> > directory files may result in slow dentry search. I guess that's why
> > ReiserFS and some other file systems proposed to use B+ tree like
> > strucutre to manage file entries. But why not simple use database to
> > provide the same feature? DB has been heavily optimized to provide
> > fast search and should be good at managing metadata.
> >
> > I guess one concern about this idea is performance impact caused by
> > database system. I ran a test on a mysql database: I inserted about
> > 1.2 million such kind of records into an initially empty mysql
> > database. Average insertion rate is about 300 entries per second,
> > which is fast enough to handle normal file system burden, I think. I
> > haven't try the query speed, but I believe it should be fast enough
> > too (maybe I am wrong, if so, please point that out.).
> >
> > Then I am a little curious why only few people use database to store
> > file system metadata, although I know WinFS plans to use database to
> > manage metadata. I guess one reason is that it is difficult for kernel
> > based file system driver to access database. But this could be
> > addressed by using efficient kernel/user communication mechanism.
> > Another reason could be the worry about database system. If database
> > system crashes, file system will stop functioning too. However, the
> > feature needed by file system is really a small part of database
> > system, A reduced database system should be sufficient to provide this
> > feature and be stable enough to support a file system.
> >
> > Can someone point out more issues that could become obstables to using
> > database to manage metadata for a file system?
> >
> > Many thanks!
> > Xin
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.2.1 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFEHOceNWc9T2Wr2JcRAsKKAJ9t1fRZ1xczAaeruDUqTNeLMcGuiwCfeTNt
> 31pFUK79Q7BE1AptbmNqr9Q=
> =LbiF
> -----END PGP SIGNATURE-----
>

2006-03-19 17:58:51

by Ming Zhang

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

database can reside on a raw block device.

but 300 metadata iops is not that fast. ;)

ming

On Sun, 2006-03-19 at 12:48 -0500, Xin Zhao wrote:
> well, the database could reside on another file system. So the
> database based file system could be a secondary file system but
> provide more features and better performance. I am not saying that
> database-based file system must be the only filesystem on the system.
>
> On 3/19/06, Mikado <[email protected]> wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Where is that database located, on other filesystem or on database-based
> > filesystem?
> >
> > Xin Zhao wrote:
> > > I was wondering why only few file system uses database to store file
> > > system metadata. Here, metadata primarily refers to directory entries.
> > > For example, one can setup a database to store file pathname, its
> > > inode number, and some extended attribution. File pathname can be used
> > > as primary key. As such, we can achieve pathname to inode mapping as
> > > well as many other features such as fast search and extended file
> > > attribute management. In contrast, storing file system entries in
> > > directory files may result in slow dentry search. I guess that's why
> > > ReiserFS and some other file systems proposed to use B+ tree like
> > > strucutre to manage file entries. But why not simple use database to
> > > provide the same feature? DB has been heavily optimized to provide
> > > fast search and should be good at managing metadata.
> > >
> > > I guess one concern about this idea is performance impact caused by
> > > database system. I ran a test on a mysql database: I inserted about
> > > 1.2 million such kind of records into an initially empty mysql
> > > database. Average insertion rate is about 300 entries per second,
> > > which is fast enough to handle normal file system burden, I think. I
> > > haven't try the query speed, but I believe it should be fast enough
> > > too (maybe I am wrong, if so, please point that out.).
> > >
> > > Then I am a little curious why only few people use database to store
> > > file system metadata, although I know WinFS plans to use database to
> > > manage metadata. I guess one reason is that it is difficult for kernel
> > > based file system driver to access database. But this could be
> > > addressed by using efficient kernel/user communication mechanism.
> > > Another reason could be the worry about database system. If database
> > > system crashes, file system will stop functioning too. However, the
> > > feature needed by file system is really a small part of database
> > > system, A reduced database system should be sufficient to provide this
> > > feature and be stable enough to support a file system.
> > >
> > > Can someone point out more issues that could become obstables to using
> > > database to manage metadata for a file system?
> > >
> > > Many thanks!
> > > Xin
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > Please read the FAQ at http://www.tux.org/lkml/
> > >
> > -----BEGIN PGP SIGNATURE-----
> > Version: GnuPG v1.4.2.1 (GNU/Linux)
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> >
> > iD8DBQFEHOceNWc9T2Wr2JcRAsKKAJ9t1fRZ1xczAaeruDUqTNeLMcGuiwCfeTNt
> > 31pFUK79Q7BE1AptbmNqr9Q=
> > =LbiF
> > -----END PGP SIGNATURE-----
> >
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2006-03-19 18:11:44

by Xin Zhao

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

Do you have any statistics on how many metadata accesses are required
for a heavy load file system? I don't have on in hand, but
intuitively I think 300 per second should be enough. If storing
metadata in database will not hit the file system performance, plus
database allows flexible file searching, the database-based file
system might not be a bad idea. :)

Xin

On 3/19/06, Ming Zhang <[email protected]> wrote:
> database can reside on a raw block device.
>
> but 300 metadata iops is not that fast. ;)
>
> ming
>
> On Sun, 2006-03-19 at 12:48 -0500, Xin Zhao wrote:
> > well, the database could reside on another file system. So the
> > database based file system could be a secondary file system but
> > provide more features and better performance. I am not saying that
> > database-based file system must be the only filesystem on the system.
> >
> > On 3/19/06, Mikado <[email protected]> wrote:
> > > -----BEGIN PGP SIGNED MESSAGE-----
> > > Hash: SHA1
> > >
> > > Where is that database located, on other filesystem or on database-based
> > > filesystem?
> > >
> > > Xin Zhao wrote:
> > > > I was wondering why only few file system uses database to store file
> > > > system metadata. Here, metadata primarily refers to directory entries.
> > > > For example, one can setup a database to store file pathname, its
> > > > inode number, and some extended attribution. File pathname can be used
> > > > as primary key. As such, we can achieve pathname to inode mapping as
> > > > well as many other features such as fast search and extended file
> > > > attribute management. In contrast, storing file system entries in
> > > > directory files may result in slow dentry search. I guess that's why
> > > > ReiserFS and some other file systems proposed to use B+ tree like
> > > > strucutre to manage file entries. But why not simple use database to
> > > > provide the same feature? DB has been heavily optimized to provide
> > > > fast search and should be good at managing metadata.
> > > >
> > > > I guess one concern about this idea is performance impact caused by
> > > > database system. I ran a test on a mysql database: I inserted about
> > > > 1.2 million such kind of records into an initially empty mysql
> > > > database. Average insertion rate is about 300 entries per second,
> > > > which is fast enough to handle normal file system burden, I think. I
> > > > haven't try the query speed, but I believe it should be fast enough
> > > > too (maybe I am wrong, if so, please point that out.).
> > > >
> > > > Then I am a little curious why only few people use database to store
> > > > file system metadata, although I know WinFS plans to use database to
> > > > manage metadata. I guess one reason is that it is difficult for kernel
> > > > based file system driver to access database. But this could be
> > > > addressed by using efficient kernel/user communication mechanism.
> > > > Another reason could be the worry about database system. If database
> > > > system crashes, file system will stop functioning too. However, the
> > > > feature needed by file system is really a small part of database
> > > > system, A reduced database system should be sufficient to provide this
> > > > feature and be stable enough to support a file system.
> > > >
> > > > Can someone point out more issues that could become obstables to using
> > > > database to manage metadata for a file system?
> > > >
> > > > Many thanks!
> > > > Xin
> > > > -
> > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > Please read the FAQ at http://www.tux.org/lkml/
> > > >
> > > -----BEGIN PGP SIGNATURE-----
> > > Version: GnuPG v1.4.2.1 (GNU/Linux)
> > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> > >
> > > iD8DBQFEHOceNWc9T2Wr2JcRAsKKAJ9t1fRZ1xczAaeruDUqTNeLMcGuiwCfeTNt
> > > 31pFUK79Q7BE1AptbmNqr9Q=
> > > =LbiF
> > > -----END PGP SIGNATURE-----
> > >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>

2006-03-19 18:26:38

by Ming Zhang

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

no. i have no such statistics. also people always want it to be faster,
so it is never enough.

from another point of view, if such fs is used by a mail server, large #
of file create/close/modify will be vital for it. 300/s is not enough
for a busy mail server of course.

database based file system will be useful for archiving. for heavy
online use? not sure.

also will a database based fs too be too complex while all benefits
brought by db can be brought by add-on utilities? find and grep do not
fit u bill?

ming

On Sun, 2006-03-19 at 13:11 -0500, Xin Zhao wrote:
> Do you have any statistics on how many metadata accesses are required
> for a heavy load file system? I don't have on in hand, but
> intuitively I think 300 per second should be enough. If storing
> metadata in database will not hit the file system performance, plus
> database allows flexible file searching, the database-based file
> system might not be a bad idea. :)
>
> Xin
>
> On 3/19/06, Ming Zhang <[email protected]> wrote:
> > database can reside on a raw block device.
> >
> > but 300 metadata iops is not that fast. ;)
> >
> > ming
> >
> > On Sun, 2006-03-19 at 12:48 -0500, Xin Zhao wrote:
> > > well, the database could reside on another file system. So the
> > > database based file system could be a secondary file system but
> > > provide more features and better performance. I am not saying that
> > > database-based file system must be the only filesystem on the system.
> > >
> > > On 3/19/06, Mikado <[email protected]> wrote:
> > > > -----BEGIN PGP SIGNED MESSAGE-----
> > > > Hash: SHA1
> > > >
> > > > Where is that database located, on other filesystem or on database-based
> > > > filesystem?
> > > >
> > > > Xin Zhao wrote:
> > > > > I was wondering why only few file system uses database to store file
> > > > > system metadata. Here, metadata primarily refers to directory entries.
> > > > > For example, one can setup a database to store file pathname, its
> > > > > inode number, and some extended attribution. File pathname can be used
> > > > > as primary key. As such, we can achieve pathname to inode mapping as
> > > > > well as many other features such as fast search and extended file
> > > > > attribute management. In contrast, storing file system entries in
> > > > > directory files may result in slow dentry search. I guess that's why
> > > > > ReiserFS and some other file systems proposed to use B+ tree like
> > > > > strucutre to manage file entries. But why not simple use database to
> > > > > provide the same feature? DB has been heavily optimized to provide
> > > > > fast search and should be good at managing metadata.
> > > > >
> > > > > I guess one concern about this idea is performance impact caused by
> > > > > database system. I ran a test on a mysql database: I inserted about
> > > > > 1.2 million such kind of records into an initially empty mysql
> > > > > database. Average insertion rate is about 300 entries per second,
> > > > > which is fast enough to handle normal file system burden, I think. I
> > > > > haven't try the query speed, but I believe it should be fast enough
> > > > > too (maybe I am wrong, if so, please point that out.).
> > > > >
> > > > > Then I am a little curious why only few people use database to store
> > > > > file system metadata, although I know WinFS plans to use database to
> > > > > manage metadata. I guess one reason is that it is difficult for kernel
> > > > > based file system driver to access database. But this could be
> > > > > addressed by using efficient kernel/user communication mechanism.
> > > > > Another reason could be the worry about database system. If database
> > > > > system crashes, file system will stop functioning too. However, the
> > > > > feature needed by file system is really a small part of database
> > > > > system, A reduced database system should be sufficient to provide this
> > > > > feature and be stable enough to support a file system.
> > > > >
> > > > > Can someone point out more issues that could become obstables to using
> > > > > database to manage metadata for a file system?
> > > > >
> > > > > Many thanks!
> > > > > Xin
> > > > > -
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > > > the body of a message to [email protected]
> > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > > Please read the FAQ at http://www.tux.org/lkml/
> > > > >
> > > > -----BEGIN PGP SIGNATURE-----
> > > > Version: GnuPG v1.4.2.1 (GNU/Linux)
> > > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> > > >
> > > > iD8DBQFEHOceNWc9T2Wr2JcRAsKKAJ9t1fRZ1xczAaeruDUqTNeLMcGuiwCfeTNt
> > > > 31pFUK79Q7BE1AptbmNqr9Q=
> > > > =LbiF
> > > > -----END PGP SIGNATURE-----
> > > >
> > > -
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >

2006-03-19 18:50:25

by Xin Zhao

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

I agree that people always want to access metadata faster. But if a
system seldom need to do pathname-to-inode translation for over 300
times per second, even one can access do this translation for over
thousands times per second, the difference on file system performance
could be very small. Plus, I have no real data on how many times a
busy email server open files, but I really doubt it really needs to
open files 300 times/second.

Anybody know how fast a file system can do pathname-to-inode
translation? I know the performance value could vary according to
different access pattern and the size of dentry cache. But an average
value should bu sufficient in our informal discussion.

Last, database-based file system is not so complex. As first step, I
am just proposing to store pathanem-to-inode number in database. So it
is basically a simple table. We don't really need any fancy features
provided by db system. That's why I said a reduced db system is
enough. So the only difference betwen db-based file system and a
regular one is that regular file system use directory file to store
entries, but db-based file system use database to achieve the same
goal. Looks like db will be a more efficient way. ;-)

Xin

On 3/19/06, Ming Zhang <[email protected]> wrote:
> no. i have no such statistics. also people always want it to be faster,
> so it is never enough.
>
> from another point of view, if such fs is used by a mail server, large #
> of file create/close/modify will be vital for it. 300/s is not enough
> for a busy mail server of course.
>
> database based file system will be useful for archiving. for heavy
> online use? not sure.
>
> also will a database based fs too be too complex while all benefits
> brought by db can be brought by add-on utilities? find and grep do not
> fit u bill?
>
> ming
>
> On Sun, 2006-03-19 at 13:11 -0500, Xin Zhao wrote:
> > Do you have any statistics on how many metadata accesses are required
> > for a heavy load file system? I don't have on in hand, but
> > intuitively I think 300 per second should be enough. If storing
> > metadata in database will not hit the file system performance, plus
> > database allows flexible file searching, the database-based file
> > system might not be a bad idea. :)
> >
> > Xin
> >
> > On 3/19/06, Ming Zhang <[email protected]> wrote:
> > > database can reside on a raw block device.
> > >
> > > but 300 metadata iops is not that fast. ;)
> > >
> > > ming
> > >
> > > On Sun, 2006-03-19 at 12:48 -0500, Xin Zhao wrote:
> > > > well, the database could reside on another file system. So the
> > > > database based file system could be a secondary file system but
> > > > provide more features and better performance. I am not saying that
> > > > database-based file system must be the only filesystem on the system.
> > > >
> > > > On 3/19/06, Mikado <[email protected]> wrote:
> > > > > -----BEGIN PGP SIGNED MESSAGE-----
> > > > > Hash: SHA1
> > > > >
> > > > > Where is that database located, on other filesystem or on database-based
> > > > > filesystem?
> > > > >
> > > > > Xin Zhao wrote:
> > > > > > I was wondering why only few file system uses database to store file
> > > > > > system metadata. Here, metadata primarily refers to directory entries.
> > > > > > For example, one can setup a database to store file pathname, its
> > > > > > inode number, and some extended attribution. File pathname can be used
> > > > > > as primary key. As such, we can achieve pathname to inode mapping as
> > > > > > well as many other features such as fast search and extended file
> > > > > > attribute management. In contrast, storing file system entries in
> > > > > > directory files may result in slow dentry search. I guess that's why
> > > > > > ReiserFS and some other file systems proposed to use B+ tree like
> > > > > > strucutre to manage file entries. But why not simple use database to
> > > > > > provide the same feature? DB has been heavily optimized to provide
> > > > > > fast search and should be good at managing metadata.
> > > > > >
> > > > > > I guess one concern about this idea is performance impact caused by
> > > > > > database system. I ran a test on a mysql database: I inserted about
> > > > > > 1.2 million such kind of records into an initially empty mysql
> > > > > > database. Average insertion rate is about 300 entries per second,
> > > > > > which is fast enough to handle normal file system burden, I think. I
> > > > > > haven't try the query speed, but I believe it should be fast enough
> > > > > > too (maybe I am wrong, if so, please point that out.).
> > > > > >
> > > > > > Then I am a little curious why only few people use database to store
> > > > > > file system metadata, although I know WinFS plans to use database to
> > > > > > manage metadata. I guess one reason is that it is difficult for kernel
> > > > > > based file system driver to access database. But this could be
> > > > > > addressed by using efficient kernel/user communication mechanism.
> > > > > > Another reason could be the worry about database system. If database
> > > > > > system crashes, file system will stop functioning too. However, the
> > > > > > feature needed by file system is really a small part of database
> > > > > > system, A reduced database system should be sufficient to provide this
> > > > > > feature and be stable enough to support a file system.
> > > > > >
> > > > > > Can someone point out more issues that could become obstables to using
> > > > > > database to manage metadata for a file system?
> > > > > >
> > > > > > Many thanks!
> > > > > > Xin
> > > > > > -
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > > > > the body of a message to [email protected]
> > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > > > Please read the FAQ at http://www.tux.org/lkml/
> > > > > >
> > > > > -----BEGIN PGP SIGNATURE-----
> > > > > Version: GnuPG v1.4.2.1 (GNU/Linux)
> > > > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> > > > >
> > > > > iD8DBQFEHOceNWc9T2Wr2JcRAsKKAJ9t1fRZ1xczAaeruDUqTNeLMcGuiwCfeTNt
> > > > > 31pFUK79Q7BE1AptbmNqr9Q=
> > > > > =LbiF
> > > > > -----END PGP SIGNATURE-----
> > > > >
> > > > -
> > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
>
>

2006-03-19 19:47:27

by Al Viro

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

On Sun, Mar 19, 2006 at 01:50:22PM -0500, Xin Zhao wrote:
> Last, database-based file system is not so complex. As first step, I
> am just proposing to store pathanem-to-inode number in database. So it
> is basically a simple table. We don't really need any fancy features
> provided by db system. That's why I said a reduced db system is
> enough. So the only difference betwen db-based file system and a
> regular one is that regular file system use directory file to store
> entries, but db-based file system use database to achieve the same
> goal. Looks like db will be a more efficient way. ;-)

Define "database". And explain how any of existing filesystems manages
to _not_ qualify your definition.

As for "more efficient"... 300 lookups per second is less than an
improvement. It's orders of magnitude worse than e.g. ext2; I don't
know in which world that would be considered more efficient, but I
certainly glad that I don't live there.

2006-03-19 21:24:25

by Ming Zhang

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

On Sun, 2006-03-19 at 15:45 -0500, Ian Young wrote:
> Indeed. I think the issue here is that someone doesn't grasp that the
> filesystem is already a database: that in fact path and filenames are
> already metadata, with inode as the pkey, and they are indexed by
> linked lists, b-trees, or hashtables, just as you might have going on
> inside the "black box" that is a database. The problem is that he
> thinks databases are somehow different because you use SQL to interact
> with them, when in fact the very same things that go on in
> filesystems, are going on in databases, just that with a database, you
> have strict pre-set agreements about what you'll be storing, so you
> can lay it out on-disk more efficiently. In a filesystem, there's no
> guarantee, or way to guarantee "this group of things will always be
> written in 256-word chunks", so it seems like something you could
> optimise using "a database" but it only shows ignorance of what
> exactly databases do. And... I'm fairly sure someone could write a
> SQL-like 'find' implementation in userland, and then you could query
> your "Database".

agree. all fancy/complex algorithm and data structures like trees are
used here already. and with some file systems get ACID as well, it can
be called as a database, if someone really want to call it in that
name. ;)


>
> A better suggestion would be along the lines of "why don't we
> standardize a common set of metadata types to expand on directory,
> filename, size, mtime, ctime, atime, owner, and group? Why not 'type'
> 'title' 'author' 'revision' 'comment' 'icon' 'readers' 'writers'
> 'executors' 'checksum' etc.. etc...?" Now, THAT's something I'd like
> to see.

that is why we have VFS right?


>
>
> Al Viro wrote:
> > On Sun, Mar 19, 2006 at 01:50:22PM -0500, Xin Zhao wrote:
> >
> > > Last, database-based file system is not so complex. As first step, I
> > > am just proposing to store pathanem-to-inode number in database. So it
> > > is basically a simple table. We don't really need any fancy features
> > > provided by db system. That's why I said a reduced db system is
> > > enough. So the only difference betwen db-based file system and a
> > > regular one is that regular file system use directory file to store
> > > entries, but db-based file system use database to achieve the same
> > > goal. Looks like db will be a more efficient way. ;-)
> > >
> >
> > Define "database". And explain how any of existing filesystems manages
> > to _not_ qualify your definition.
> >
> > As for "more efficient"... 300 lookups per second is less than an
> > improvement. It's orders of magnitude worse than e.g. ext2; I don't
> > know in which world that would be considered more efficient, but I
> > certainly glad that I don't live there.
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >

2006-03-19 21:34:13

by Ming Zhang

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

On Sun, 2006-03-19 at 13:50 -0500, Xin Zhao wrote:
> I agree that people always want to access metadata faster. But if a
> system seldom need to do pathname-to-inode translation for over 300
> times per second, even one can access do this translation for over
> thousands times per second, the difference on file system performance
> could be very small. Plus, I have no real data on how many times a
> busy email server open files, but I really doubt it really needs to
> open files 300 times/second.

ok, so u want u httpd server only serve 300 files per sec? when a web
page can contains 5-60 small images, u will feel crazy with that 300.


>
> Anybody know how fast a file system can do pathname-to-inode
> translation? I know the performance value could vary according to
> different access pattern and the size of dentry cache. But an average
> value should bu sufficient in our informal discussion.
>
> Last, database-based file system is not so complex. As first step, I
> am just proposing to store pathanem-to-inode number in database. So it
> is basically a simple table. We don't really need any fancy features
> provided by db system. That's why I said a reduced db system is
> enough. So the only difference betwen db-based file system and a
> regular one is that regular file system use directory file to store
> entries, but db-based file system use database to achieve the same
> goal. Looks like db will be a more efficient way. ;-)

first, others already point out that there are already mini-db there.
second, what is the point to use a regular db to just store file name?
third, do u still think 300 is MORE efficient?
fourth, if you think sql here is good, then shall we use xml here? ;)



>
> Xin
>
> On 3/19/06, Ming Zhang <[email protected]> wrote:
> > no. i have no such statistics. also people always want it to be faster,
> > so it is never enough.
> >
> > from another point of view, if such fs is used by a mail server, large #
> > of file create/close/modify will be vital for it. 300/s is not enough
> > for a busy mail server of course.
> >
> > database based file system will be useful for archiving. for heavy
> > online use? not sure.
> >
> > also will a database based fs too be too complex while all benefits
> > brought by db can be brought by add-on utilities? find and grep do not
> > fit u bill?
> >
> > ming
> >
> > On Sun, 2006-03-19 at 13:11 -0500, Xin Zhao wrote:
> > > Do you have any statistics on how many metadata accesses are required
> > > for a heavy load file system? I don't have on in hand, but
> > > intuitively I think 300 per second should be enough. If storing
> > > metadata in database will not hit the file system performance, plus
> > > database allows flexible file searching, the database-based file
> > > system might not be a bad idea. :)
> > >
> > > Xin
> > >
> > > On 3/19/06, Ming Zhang <[email protected]> wrote:
> > > > database can reside on a raw block device.
> > > >
> > > > but 300 metadata iops is not that fast. ;)
> > > >
> > > > ming
> > > >
> > > > On Sun, 2006-03-19 at 12:48 -0500, Xin Zhao wrote:
> > > > > well, the database could reside on another file system. So the
> > > > > database based file system could be a secondary file system but
> > > > > provide more features and better performance. I am not saying that
> > > > > database-based file system must be the only filesystem on the system.
> > > > >
> > > > > On 3/19/06, Mikado <[email protected]> wrote:
> > > > > > -----BEGIN PGP SIGNED MESSAGE-----
> > > > > > Hash: SHA1
> > > > > >
> > > > > > Where is that database located, on other filesystem or on database-based
> > > > > > filesystem?
> > > > > >
> > > > > > Xin Zhao wrote:
> > > > > > > I was wondering why only few file system uses database to store file
> > > > > > > system metadata. Here, metadata primarily refers to directory entries.
> > > > > > > For example, one can setup a database to store file pathname, its
> > > > > > > inode number, and some extended attribution. File pathname can be used
> > > > > > > as primary key. As such, we can achieve pathname to inode mapping as
> > > > > > > well as many other features such as fast search and extended file
> > > > > > > attribute management. In contrast, storing file system entries in
> > > > > > > directory files may result in slow dentry search. I guess that's why
> > > > > > > ReiserFS and some other file systems proposed to use B+ tree like
> > > > > > > strucutre to manage file entries. But why not simple use database to
> > > > > > > provide the same feature? DB has been heavily optimized to provide
> > > > > > > fast search and should be good at managing metadata.
> > > > > > >
> > > > > > > I guess one concern about this idea is performance impact caused by
> > > > > > > database system. I ran a test on a mysql database: I inserted about
> > > > > > > 1.2 million such kind of records into an initially empty mysql
> > > > > > > database. Average insertion rate is about 300 entries per second,
> > > > > > > which is fast enough to handle normal file system burden, I think. I
> > > > > > > haven't try the query speed, but I believe it should be fast enough
> > > > > > > too (maybe I am wrong, if so, please point that out.).
> > > > > > >
> > > > > > > Then I am a little curious why only few people use database to store
> > > > > > > file system metadata, although I know WinFS plans to use database to
> > > > > > > manage metadata. I guess one reason is that it is difficult for kernel
> > > > > > > based file system driver to access database. But this could be
> > > > > > > addressed by using efficient kernel/user communication mechanism.
> > > > > > > Another reason could be the worry about database system. If database
> > > > > > > system crashes, file system will stop functioning too. However, the
> > > > > > > feature needed by file system is really a small part of database
> > > > > > > system, A reduced database system should be sufficient to provide this
> > > > > > > feature and be stable enough to support a file system.
> > > > > > >
> > > > > > > Can someone point out more issues that could become obstables to using
> > > > > > > database to manage metadata for a file system?
> > > > > > >
> > > > > > > Many thanks!
> > > > > > > Xin
> > > > > > > -
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > > > > > > the body of a message to [email protected]
> > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > > > > Please read the FAQ at http://www.tux.org/lkml/
> > > > > > >
> > > > > > -----BEGIN PGP SIGNATURE-----
> > > > > > Version: GnuPG v1.4.2.1 (GNU/Linux)
> > > > > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> > > > > >
> > > > > > iD8DBQFEHOceNWc9T2Wr2JcRAsKKAJ9t1fRZ1xczAaeruDUqTNeLMcGuiwCfeTNt
> > > > > > 31pFUK79Q7BE1AptbmNqr9Q=
> > > > > > =LbiF
> > > > > > -----END PGP SIGNATURE-----
> > > > > >
> > > > > -
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > > > the body of a message to [email protected]
> > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > >
> > > >
> >
> >

2006-03-19 22:59:46

by Alan

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

On Sad, 2006-03-18 at 23:48 -0500, Xin Zhao wrote:
> database system. I ran a test on a mysql database: I inserted about
> 1.2 million such kind of records into an initially empty mysql
> database. Average insertion rate is about 300 entries per second,

Thats extremely slow for a file system.

> Then I am a little curious why only few people use database to store
> file system metadata, although I know WinFS plans to use database to
> manage metadata.

The one well known example of a database as file system (or was well
known) was the Pick OS (now defunct although the database system lives
on). They did manage to build an OS which had a database as a file
system.

The thing is a database and a file system are the same thing anyway.
You'll find the same structures like B trees used in some for example.
They are just optimised for different kinds of queries. If you want to
know whether a db as fs works , build a prototype and see - you've
already taken the first step and with FUSE you can prototype the rest in
user space.

Alan

2006-03-19 23:44:06

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

On Sun, Mar 19, 2006 at 11:06:25PM +0000, Alan Cox wrote:
> The one well known example of a database as file system (or was well
> known) was the Pick OS (now defunct although the database system lives
> on). They did manage to build an OS which had a database as a file
> system.

The other well-known example would be OS/400 (or whatever IBM renamed it
to now)

2006-03-20 08:30:14

by Matti Aarnio

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

On Sun, Mar 19, 2006 at 01:11:41PM -0500, Xin Zhao wrote:
> Do you have any statistics on how many metadata accesses are required
> for a heavy load file system? I don't have on in hand, but
> intuitively I think 300 per second should be enough. If storing
> metadata in database will not hit the file system performance, plus
> database allows flexible file searching, the database-based file
> system might not be a bad idea. :)
>
> Xin

Folks, first of all, DO NOT TOP POST!
You are attempting to violate causality. (At least it looks like such.)

One thing to realize in UNIX filesystems is that they are OPTIMIZED
for finding files (with huge majority - 3 to 5 nines worth).
That means there are many cute caches to help lookups.

Cases where directory contents are modified (e.g. creat/rename/unlink)
are considered extremely rare, and there the only really interesting
thing is correctness.

In some very odd cases a lot more files are created / destroyed in
a system, than is average - such applications include: Squid and INN.
For INN this particular detail has been known to be a bottle neck, and
thus was born Cyclic News Filesystem - a way to use pre-allocated big
files as storage space for news items.

> On 3/19/06, Ming Zhang <[email protected]> wrote:
> > database can reside on a raw block device.
> >
> > but 300 metadata iops is not that fast. ;)
> >
> > ming

I should think that 300 IOPS is a lot from single PC hard-drive.
Indeed they usually can't go that fast.

Again, that is where all these marvellous caches come to play.

/Matti Aarnio

2006-03-20 13:10:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

On Sun, Mar 19, 2006 at 07:47:23PM +0000, Al Viro wrote:
> As for "more efficient"... 300 lookups per second is less than an
> improvement. It's orders of magnitude worse than e.g. ext2; I don't
> know in which world that would be considered more efficient, but I
> certainly glad that I don't live there.

There are two problems... well, more, but in the performance domain,
at least two issues that stick out like a sore thumb.

The first is throughput, and as Al and others have already pointed out
300 metadata operations per second is defintely nothing to write home
about. The second is latency; how much *time* does it take to perform
an individual operations, especially if you have to do an upcall from
the kernel to a userspace database process, the user space process
then has to dick around its own general purpose,
non-optimized-for-a-filesystem data structures, possibly make syscalls
back down into the kernel only to have the data blocks pushed back up
into userspace, and then finally return the result of the "stat"
system call back to the kernel so the kernel can ship it off to the
original process that called stat(2).

Even in WinFS, dropped from Microsoft Longwait, it really wasn't using
the database to store all metadata. A better way of thinking about it
is a forcible bundling of a Microsoft's database product (European
regulators take note) with the OS; all of the low-level filesystem
operations are still being done the traditional way, and it's only
high level indexing operation which are being done in userspace (and
only in userspace). It would be like taking the taking the locate(1)
userspace program and claiming it was part of the filesystem; it's
more about packaging than anything else.

- Ted

2006-03-20 16:12:08

by Xin Zhao

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

OK. Sorry for causing so much confusion here. I have to clarify
several things before go further on the discussion.

First, my experiment that resulted in 300 insertions/sec was set up as follows:
1. the testing code is written in python
2. I first creates a file list using "find /testdir -name "*" -print
> filelist", and record current time after the filelist is created.
3. Then, I started a loop to read file pathnames line by line, for
each line, I do stat to get inode number, then I created a record and
insert it into database
4. after all records are inserted, I recorded current time again and
computed the elapsed time used to insert all records

>From this setting, we can see this experiment is not very fair for
database, because the time used to read filelist and do stat() are
also counted database insertion time. As noted before, I did that
experiment just to get some sense how slow a database could be. If I
remove the file read and stat() cost, I will expect to see an
improvement of insertion speed. I will redo the experiment and report
the result. Still, 300/sec might be good enough to handle most
scenarios. Yes. this might not be good enough to handle a busy web
server, while I still doubt a web server need to open so many files
per second. The frequently accessed files like small images are
commonly cached instead of requiring to access file system every time.

Second, I might want to give the background on which we are
considering the possibility of storing metadata in database. We are
currently developing a file system that allows multiple virtual
machines to share base software environment. With our current design,
a new VM can be deployed in several seconds by inheriting the file
system of an existing VM. If a VM is to modify a shared file, the file
system will do copy-on-write to gernerate a private copy for this VM.
Thus, there could be multiple physical copies for a virtual pathname.
Even more complicated, a physical copy could be shared by arbitrary
subset of VMs. Now let's consider how to support this using regular
file system. You can treat VMs as clients or users of a standard
linux. Consider the following scenario: VM2 inherit VM1's file
system. The physical copy for virtual file F is F.1. Then, it modified
file F and get its private copy F.2. Now VM3 inherit VM2's file
system. The inherit graph is as follow:
VM1-->VM2-->VM3

Now VM3 wants to access virtual file F. It has to determine the right
physical copy. The right answer is F.2. But in the file system, we
have F.1 and F.2. So some mapping mechanism must be devised. No matter
how we manipulate the pathname of physical copies, several disk
accesses seem to be required for a mapping operation. That is the
reason we are considering database to store metadata.

We do know many file systems already use db like technique to index
metadata. For example B tree used by ReiserFS and HTree used by Ext3.
But they do not provide the feature we need. This at least exposes one
fundamental limit: they do not support easy extension on metadata. So
at least some extension must be made to make the mapping efficient. So
we thought "since they are using db like technique, why not simply use
DB? " At least a DB makes it simple to extend metadata of a file
system. For example, in our case, we might also want to add hash value
of file content into a file's metadata. This allows us to merge
several files with identical contents into one for disk space saving,
which is important in our scenario since we assume that many VMs uses
identical software environment.

Also, I am not proposing to use db to store all metadata. As mentioned
before, currently I am just considering to store the pathname-inode
mapping. Other attributes like atime, ctime are stored using standard
way. So this is essentially a layer above standard FS. Because only
open () syscall needs to access metadata with the communication across
kernel boundary, I am expecting a moderate performance impact. But I
am not sure about this. Someone has any experience on that?

Any further comments?

Xin


On 3/20/06, Theodore Ts'o <[email protected]> wrote:
> On Sun, Mar 19, 2006 at 07:47:23PM +0000, Al Viro wrote:
> > As for "more efficient"... 300 lookups per second is less than an
> > improvement. It's orders of magnitude worse than e.g. ext2; I don't
> > know in which world that would be considered more efficient, but I
> > certainly glad that I don't live there.
>
> There are two problems... well, more, but in the performance domain,
> at least two issues that stick out like a sore thumb.
>
> The first is throughput, and as Al and others have already pointed out
> 300 metadata operations per second is defintely nothing to write home
> about. The second is latency; how much *time* does it take to perform
> an individual operations, especially if you have to do an upcall from
> the kernel to a userspace database process, the user space process
> then has to dick around its own general purpose,
> non-optimized-for-a-filesystem data structures, possibly make syscalls
> back down into the kernel only to have the data blocks pushed back up
> into userspace, and then finally return the result of the "stat"
> system call back to the kernel so the kernel can ship it off to the
> original process that called stat(2).
>
> Even in WinFS, dropped from Microsoft Longwait, it really wasn't using
> the database to store all metadata. A better way of thinking about it
> is a forcible bundling of a Microsoft's database product (European
> regulators take note) with the OS; all of the low-level filesystem
> operations are still being done the traditional way, and it's only
> high level indexing operation which are being done in userspace (and
> only in userspace). It would be like taking the taking the locate(1)
> userspace program and claiming it was part of the filesystem; it's
> more about packaging than anything else.
>
> - Ted
>

2006-03-20 19:36:58

by Xin Zhao

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

OK. Now I have more experimental results.

After excluding the cost of reading file list and do stat(), the
insertion rate becomes 587/sec, instead of 300/sec. The query rate is
2137/sec. I am runing mysql 4.1.11. FC4, 2.8G CPU and 1G mem.

2137/sec seems to be good enough to handle pathname to inode
resolving. Anyone has some statistics how many file open in a busy
file system?

Xin


On 3/20/06, Xin Zhao <[email protected]> wrote:
> OK. Sorry for causing so much confusion here. I have to clarify
> several things before go further on the discussion.
>
> First, my experiment that resulted in 300 insertions/sec was set up as follows:
> 1. the testing code is written in python
> 2. I first creates a file list using "find /testdir -name "*" -print
> > filelist", and record current time after the filelist is created.
> 3. Then, I started a loop to read file pathnames line by line, for
> each line, I do stat to get inode number, then I created a record and
> insert it into database
> 4. after all records are inserted, I recorded current time again and
> computed the elapsed time used to insert all records
>
> From this setting, we can see this experiment is not very fair for
> database, because the time used to read filelist and do stat() are
> also counted database insertion time. As noted before, I did that
> experiment just to get some sense how slow a database could be. If I
> remove the file read and stat() cost, I will expect to see an
> improvement of insertion speed. I will redo the experiment and report
> the result. Still, 300/sec might be good enough to handle most
> scenarios. Yes. this might not be good enough to handle a busy web
> server, while I still doubt a web server need to open so many files
> per second. The frequently accessed files like small images are
> commonly cached instead of requiring to access file system every time.
>
> Second, I might want to give the background on which we are
> considering the possibility of storing metadata in database. We are
> currently developing a file system that allows multiple virtual
> machines to share base software environment. With our current design,
> a new VM can be deployed in several seconds by inheriting the file
> system of an existing VM. If a VM is to modify a shared file, the file
> system will do copy-on-write to gernerate a private copy for this VM.
> Thus, there could be multiple physical copies for a virtual pathname.
> Even more complicated, a physical copy could be shared by arbitrary
> subset of VMs. Now let's consider how to support this using regular
> file system. You can treat VMs as clients or users of a standard
> linux. Consider the following scenario: VM2 inherit VM1's file
> system. The physical copy for virtual file F is F.1. Then, it modified
> file F and get its private copy F.2. Now VM3 inherit VM2's file
> system. The inherit graph is as follow:
> VM1-->VM2-->VM3
>
> Now VM3 wants to access virtual file F. It has to determine the right
> physical copy. The right answer is F.2. But in the file system, we
> have F.1 and F.2. So some mapping mechanism must be devised. No matter
> how we manipulate the pathname of physical copies, several disk
> accesses seem to be required for a mapping operation. That is the
> reason we are considering database to store metadata.
>
> We do know many file systems already use db like technique to index
> metadata. For example B tree used by ReiserFS and HTree used by Ext3.
> But they do not provide the feature we need. This at least exposes one
> fundamental limit: they do not support easy extension on metadata. So
> at least some extension must be made to make the mapping efficient. So
> we thought "since they are using db like technique, why not simply use
> DB? " At least a DB makes it simple to extend metadata of a file
> system. For example, in our case, we might also want to add hash value
> of file content into a file's metadata. This allows us to merge
> several files with identical contents into one for disk space saving,
> which is important in our scenario since we assume that many VMs uses
> identical software environment.
>
> Also, I am not proposing to use db to store all metadata. As mentioned
> before, currently I am just considering to store the pathname-inode
> mapping. Other attributes like atime, ctime are stored using standard
> way. So this is essentially a layer above standard FS. Because only
> open () syscall needs to access metadata with the communication across
> kernel boundary, I am expecting a moderate performance impact. But I
> am not sure about this. Someone has any experience on that?
>
> Any further comments?
>
> Xin
>
>
> On 3/20/06, Theodore Ts'o <[email protected]> wrote:
> > On Sun, Mar 19, 2006 at 07:47:23PM +0000, Al Viro wrote:
> > > As for "more efficient"... 300 lookups per second is less than an
> > > improvement. It's orders of magnitude worse than e.g. ext2; I don't
> > > know in which world that would be considered more efficient, but I
> > > certainly glad that I don't live there.
> >
> > There are two problems... well, more, but in the performance domain,
> > at least two issues that stick out like a sore thumb.
> >
> > The first is throughput, and as Al and others have already pointed out
> > 300 metadata operations per second is defintely nothing to write home
> > about. The second is latency; how much *time* does it take to perform
> > an individual operations, especially if you have to do an upcall from
> > the kernel to a userspace database process, the user space process
> > then has to dick around its own general purpose,
> > non-optimized-for-a-filesystem data structures, possibly make syscalls
> > back down into the kernel only to have the data blocks pushed back up
> > into userspace, and then finally return the result of the "stat"
> > system call back to the kernel so the kernel can ship it off to the
> > original process that called stat(2).
> >
> > Even in WinFS, dropped from Microsoft Longwait, it really wasn't using
> > the database to store all metadata. A better way of thinking about it
> > is a forcible bundling of a Microsoft's database product (European
> > regulators take note) with the OS; all of the low-level filesystem
> > operations are still being done the traditional way, and it's only
> > high level indexing operation which are being done in userspace (and
> > only in userspace). It would be like taking the taking the locate(1)
> > userspace program and claiming it was part of the filesystem; it's
> > more about packaging than anything else.
> >
> > - Ted
> >
>

2006-03-20 19:59:07

by Al Viro

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

On Mon, Mar 20, 2006 at 02:36:51PM -0500, Xin Zhao wrote:
> OK. Now I have more experimental results.
>
> After excluding the cost of reading file list and do stat(), the
> insertion rate becomes 587/sec, instead of 300/sec. The query rate is
> 2137/sec. I am runing mysql 4.1.11. FC4, 2.8G CPU and 1G mem.
>
> 2137/sec seems to be good enough to handle pathname to inode
> resolving. Anyone has some statistics how many file open in a busy
> file system?

This is still ridiculously slow. From cold cache (i.e. with a lot of IO)
cp -rl linux-2.6 foo1 gives 1.2s here. That's at least about 50000
operations. On slower CPU, BTW, with half of the RAM you have.

Moreover,
al@duke:~/linux$ time mv foo1 foo2

real 0m0.002s
user 0m0.000s
sys 0m0.001s

Now, try _that_ on your setup. If you are using entire pathname as key,
you are FUBAR - updating key in 20-odd thousand of records is going to
_hurt_. If you are splitting the pathname into components, you've just
reproduced fs directory structure and had shown that your fs layout
is too fscking slow. Not to mention the fun with symlink implementation,
or handling mountpoints.

You are at least an order of magnitude off by performance (more, actually)
and I still don't see any reason for what you are trying to do.

2006-03-20 21:08:32

by Matti Aarnio

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

On Mon, Mar 20, 2006 at 02:36:51PM -0500, Xin Zhao wrote:
>
> OK. Now I have more experimental results.
>
> After excluding the cost of reading file list and do stat(), the
> insertion rate becomes 587/sec, instead of 300/sec. The query rate is
> 2137/sec. I am runing mysql 4.1.11. FC4, 2.8G CPU and 1G mem.
>
> 2137/sec seems to be good enough to handle pathname to inode
> resolving. Anyone has some statistics how many file open in a busy
> file system?
>
> Xin

What is wrong in here, I think, is your pre-set assumption, that
using proper modern database things will be faster. Yes, perhaps
they will, under some specific conditions.

Like Gene Amdahl so long ago did point out, optimizing something
that forms 1% of the load will speed things up at most that 1%.

Could you instrument directory management primitive operations
accounting ? How many directory inserts/removes/lookups per
mounted filesystem (or entire system), including dnames -cache
operations (they are already instrumented, I think) are used in
normal system operations ?

If your system behaviour shows more than 1% of other than lookups,
try to find out _why_ is that.

So far Linux optimizes filesystem directory reads to maximum.



Long ago I had a problem, where I needed insertion into an application
specific database from data origination system -- I needed also fast
batch replication from one dataset copy to another. Doing hash keying
made insert _slow_. Doing btree indexing and inserting in key-order
made things fast. Not flushing database at every insert made it almost
linearly faster by the flush interval. Not flushing the database except
at batch end made it maximally fast -- around 100 000 inserts per second,
but it had to be pre-sorted data. (This was single SCSI-disk back in
1996.) We had requirement to do batch insert as fast as possible,
similarly for batch replication that was used for maintenance, and
a ten-thousand-fold speedup was well worth the added complexities
in the software.

That database had also about 4 sigma "read-only" property.


/Matti Aarnio

2006-03-20 22:19:20

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

On Mon, Mar 20, 2006 at 10:13:43AM -0500, Xin Zhao wrote:
> Second, I might want to give the background on which we are
> considering the possibility of storing metadata in database. We are
> currently developing a file system that allows multiple virtual
> machines to share base software environment. With our current design,
> a new VM can be deployed in several seconds by inheriting the file
> system of an existing VM. If a VM is to modify a shared file, the file
> system will do copy-on-write to gernerate a private copy for this VM.
> Thus, there could be multiple physical copies for a virtual pathname.
> Even more complicated, a physical copy could be shared by arbitrary
> subset of VMs. Now let's consider how to support this using regular
> file system. You can treat VMs as clients or users of a standard
> linux. Consider the following scenario: VM2 inherit VM1's file
> system. The physical copy for virtual file F is F.1. Then, it modified
> file F and get its private copy F.2. Now VM3 inherit VM2's file
> system. The inherit graph is as follow:
> VM1-->VM2-->VM3

Why not leverage devicemapper, and implement muliple hierarchical
copy-on-write snapshots at the block device level? It would be much
easier....

> We do know many file systems already use db like technique to index
> metadata. For example B tree used by ReiserFS and HTree used by Ext3.
> But they do not provide the feature we need. This at least exposes one
> fundamental limit: they do not support easy extension on metadata. So
> at least some extension must be made to make the mapping efficient. So
> we thought "since they are using db like technique, why not simply use
> DB? " At least a DB makes it simple to extend metadata of a file
> system. For example, in our case, we might also want to add hash value
> of file content into a file's metadata. This allows us to merge
> several files with identical contents into one for disk space saving,
> which is important in our scenario since we assume that many VMs uses
> identical software environment.

Why not use a DB? Because most databases's are big and bloated and
not something you want to have in the kernel (not even Hans Reiser was
crazy enough to propose stuffing an SQL interpreter into the kernel :-)
--- and if you put the generic database (complete with SQL interpreter
and all the rest) in userspace, doing upcalls into userspace, and then
having to have the database interpret the SQL query, etc., takes time.

If you don't care about performance, by all means, try using FUSE and
implementing a user-space filesystem. It will be slow as all get-out,
but maybe it won't matter for your application.

> Also, I am not proposing to use db to store all metadata. As mentioned
> before, currently I am just considering to store the pathname-inode
> mapping. Other attributes like atime, ctime are stored using standard
> way. So this is essentially a layer above standard FS. Because only
> open () syscall needs to access metadata with the communication across
> kernel boundary, I am expecting a moderate performance impact. But I
> am not sure about this. Someone has any experience on that?

That won't be just open(), but stat(), readdir(), unlink(), rename(),
etc. It's all going to depend on your workload and how much
filesystem access it requires. It certainly won't be a general
purpose solution, and for some workloads it will be disaterously slow.
But hey, if you don't believe me, go ahead try implementing it.....

- Ted

2006-03-20 22:29:23

by Erez Zadok

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

You may be interested in a project we have which ported a subset of the
Berkeley Database (BDB) to the Linux kernel. BDB is much smaller than other
SQL-based DBMSs, and is thus a lot more suitable for embedding into kernels.

We called our port KBDB. We built a simple transactional f/s on top of it,
called kbdbfs. We also had a short project called EAFS, which is a
stackable f/s that adds EA/ACL support to any f/s it's layered on top of.

Recently we also implemented a full-blown transactional f/s using ptrace
methods. You can find links to these projects, and in some cases papers and
software:

http://www.fsl.cs.sunysb.edu/project-kbdb.html
http://www.fsl.cs.sunysb.edu/project-amino.html
http://www.fsl.cs.sunysb.edu/project-goanna.html
http://www.fsl.cs.sunysb.edu/project-extattrfs.html

Erez.

2006-03-20 22:53:13

by Xin Zhao

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

Apparently this comparison is not 100% fair. In my experiment, I
randomly pick pathname from 1.2 million path names to resolve the
inode number. But in your "cp -rl linux2.6 foo1" experiment, you
essentially did directory entry lookup sequentially, which maximize
the possible performance. If you do the same thing in a random
fashion, you will probably get much worse performance. As I said
before, I totally agree that 2000/sec is slow. But the point here is
whether 2000/sec is enough for most scenarios?

I am not saying existing FS implementation is not efficient. I agreed
that file system has been fully optimized. What I want to say is to
support complex mapping in the system I described before, we might
need some extension on existing file systems. Question is what is the
best extension. Consider how to allow user a, b to share physical copy
f.1, while allowing user c to use private copy f.2? The virtual
pathname to physical pathname should be transparent to end users. That
is, all the users should be able to access right file copies using
virtual path "f". The file system should be able to tell the different
identity and return the data from the right physical copy. That's what
we want to do. But it is hard to achieve without some extension. :)

Xin

On 3/20/06, Al Viro <[email protected]> wrote:
> On Mon, Mar 20, 2006 at 02:36:51PM -0500, Xin Zhao wrote:
> > OK. Now I have more experimental results.
> >
> > After excluding the cost of reading file list and do stat(), the
> > insertion rate becomes 587/sec, instead of 300/sec. The query rate is
> > 2137/sec. I am runing mysql 4.1.11. FC4, 2.8G CPU and 1G mem.
> >
> > 2137/sec seems to be good enough to handle pathname to inode
> > resolving. Anyone has some statistics how many file open in a busy
> > file system?
>
> This is still ridiculously slow. From cold cache (i.e. with a lot of IO)
> cp -rl linux-2.6 foo1 gives 1.2s here. That's at least about 50000
> operations. On slower CPU, BTW, with half of the RAM you have.
>
> Moreover,
> al@duke:~/linux$ time mv foo1 foo2
>
> real 0m0.002s
> user 0m0.000s
> sys 0m0.001s
>
> Now, try _that_ on your setup. If you are using entire pathname as key,
> you are FUBAR - updating key in 20-odd thousand of records is going to
> _hurt_. If you are splitting the pathname into components, you've just
> reproduced fs directory structure and had shown that your fs layout
> is too fscking slow. Not to mention the fun with symlink implementation,
> or handling mountpoints.
>
> You are at least an order of magnitude off by performance (more, actually)
> and I still don't see any reason for what you are trying to do.
>

2006-03-20 23:32:55

by Al Viro

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

On Mon, Mar 20, 2006 at 05:53:08PM -0500, Xin Zhao wrote:
> Apparently this comparison is not 100% fair. In my experiment, I
> randomly pick pathname from 1.2 million path names to resolve the
> inode number. But in your "cp -rl linux2.6 foo1" experiment, you
> essentially did directory entry lookup sequentially, which maximize
> the possible performance. If you do the same thing in a random
> fashion, you will probably get much worse performance. As I said
> before, I totally agree that 2000/sec is slow. But the point here is
> whether 2000/sec is enough for most scenarios?

It is not; e.g. unpacking a tarball or running make(1) on even a
medium-sized tree is going to hurt that way. Moreover, the same
goes for a lot of scripts, etc.

And that's aside of the question of CPU load you are inflicting -
it's not just 2000/sec, it's 2000/sec _and_ _nothing_ _else_ _gets_
_done_.

BTW, for real lookup speed, try find(1). From hot cache.

> I am not saying existing FS implementation is not efficient. I agreed
> that file system has been fully optimized. What I want to say is to
> support complex mapping in the system I described before, we might
> need some extension on existing file systems. Question is what is the
> best extension. Consider how to allow user a, b to share physical copy
> f.1, while allowing user c to use private copy f.2? The virtual
> pathname to physical pathname should be transparent to end users. That
> is, all the users should be able to access right file copies using
> virtual path "f". The file system should be able to tell the different
> identity and return the data from the right physical copy. That's what
> we want to do. But it is hard to achieve without some extension. :)

So what happens upon rename()?

2006-03-21 06:52:07

by Miklos Szeredi

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

> Why not use a DB? Because most databases's are big and bloated and
> not something you want to have in the kernel (not even Hans Reiser was
> crazy enough to propose stuffing an SQL interpreter into the kernel :-)
> --- and if you put the generic database (complete with SQL interpreter
> and all the rest) in userspace, doing upcalls into userspace, and then
> having to have the database interpret the SQL query, etc., takes time.
>
> If you don't care about performance, by all means, try using FUSE and
> implementing a user-space filesystem. It will be slow as all get-out,
> but maybe it won't matter for your application.

Something like this has already been done:

http://www.noofs.org/

Miklos

2006-03-21 20:10:28

by Pavel Machek

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

Hi!

> Second, I might want to give the background on which we are
> considering the possibility of storing metadata in database. We are
> currently developing a file system that allows multiple virtual
> machines to share base software environment. With our current design,
> a new VM can be deployed in several seconds by inheriting the file
> system of an existing VM. If a VM is to modify a shared file, the file
> system will do copy-on-write to gernerate a private copy for this VM.
> Thus, there could be multiple physical copies for a virtual pathname.
> Even more complicated, a physical copy could be shared by arbitrary
> subset of VMs. Now let's consider how to support this using regular
> file system. You can treat VMs as clients or users of a standard
> linux. Consider the following scenario: VM2 inherit VM1's file
> system. The physical copy for virtual file F is F.1. Then, it modified
> file F and get its private copy F.2. Now VM3 inherit VM2's file
> system. The inherit graph is as follow:
> VM1-->VM2-->VM3
>
> Now VM3 wants to access virtual file F. It has to determine the right
> physical copy. The right answer is F.2. But in the file system, we
> have F.1 and F.2. So some mapping mechanism must be devised. No matter
> how we manipulate the pathname of physical copies, several disk
> accesses seem to be required for a mapping operation. That is the
> reason we are considering database to store metadata.

Hardlinks? ext3cow (google it)?
Pavel

--
Picture of sleeping (Linux) penguin wanted...

2006-03-22 15:21:48

by Xin Zhao

[permalink] [raw]
Subject: Re: Question regarding to store file system metadata in database

Many thanks for so many helpful comments!

First of all, the reason I kept arguing is to ask for more reasonable
arguments on this idea. This idea may not be good enough, but worth
thinking, I think. Actually, even myself is still hesitate to use
database to store pathname-inode mapping. I just want to get more
ideas or solid data from you guys to convince myself that database is
not a good idea. I don't really want to spend several months and end
up with a slow file system. ;-) Now I think I am pretty much
convinced. :)

AI raised a good question on how our system handles rename. Our
current approach is very simple. We first determine whether the
requesting VM owns the physical copy corresponding to virtual file
FOO. If it does, we just go ahead renaming this physical copy as a
regular file system; otherwise, we have to create a "special" symbolic
link with the modified name, say "BAR", as the private copy of this
VM. The link "BAR" points to the original physical copy. In the
future, when this VM access file BAR, it will get a symbolic link
which directs this VM to the right data blocks. The reason we say
"special symbolic link" is that we don't regard this symbolic link as
a real symbolic link. It is just used to point to the physical copy. A
special flag in inode will indicate that this symbolic link is
special. Thus, to guest VMs, BAR still looks like a normal file
instead of a symbolic link. This helps avoid confusion.

Ted and Pavel pointed out two alternatives that can potential achieve
the same goal. Both devicemapper and ext3cow allow users to build
snapshot of file system. We can then start from the snapshot to
leverage COW technique to allow users to modify certain files. I do
know these two systems before. However, these two system only work
when one copy (version) is needed at a time. For example, ext3cow can
keep multiple versions of a file. One version is the latest version
while the rest are kept with epoch number. When one needs to access
past version, he or she needs to point out which version is desired.
This solution does not directly apply to our scenario. In our
scenario, multiple VMs may need to access a virtual file FOO. As
different VMs are allowed to own their private copies, multiple
physical copies of FOO may exist simultaneously. A better way to map
virtual pathname to right physical copy for a VM is needed.

Now let's talk about devicemapper, actually several other block level
versioning systems like Venti also use similar solution. They build
read-only snapshot and do COW on that snapshot when some files need to
be changed. Actually VMWare already uses similar way to support fast
clone and data sharing of VMs. There are several problems for this
solution. First, if multiple VMs share one block device and do
copy-on-write on this device, after several updates in those VMs, we
will end up with a splitted storage system. To evaluate the extent, we
ran VMWare and installed a standard distribution of Windows XP on a
virtual machine, took a snapshot, installed all of the recommended
updates, and then took another snapshot to see the amount of data that
had been modified. After installing all the patches, we found that
VMWare had allocated 3.5 GB of disk space in addition to the data that
it read from the parent snapshot. If one hundred virtual machines were
cloned after running all the updates, the extra 3.5 GB would be a
one-time cost. If they were cloned before, however, then one hundred
times as much space, 350 GB, would be required to store the same data.

Another problem for block level sharing is the difficulty of garbage
collection. Consider block A is used by file FOO and shared by VM1 and
VM2. Now VM1 and VM2 wants to delete file FOO, they will simply modify
the block bitmap (assuming they use ext3) to release block A. However,
we have no way to connect the block bitmap update with the release of
block A. Then block A will never be reused even if it is not used by
any VM.

Hardlink is of course another potential approach. But it also need
extension to map (virtual path, VMID) to (physical path). Using some
special naming mechanism is not good enough as several VMs could share
the same physical copy. One possibility is to keep a private directory
tree for each VM. All files on this directory tree are hardlinks.
However, if VM2 inherits VM1's file system, we would have to create
hardlinks for all files in VM1's file system. This could be a problem.
Moreover, creating private dir tree for for all VMs will imcrease
dentry space significantly and reduce the effectiveness of dcache. I
don't know the extent though. Last, hardlinks have some restrictions.
For example, a hardlink must reside on the same device as the target
file. one can not create hardlink for directories in many OS. These
restrictions will be another problem.

Xin

On 3/21/06, Pavel Machek <[email protected]> wrote:
> Hi!
>
> > Second, I might want to give the background on which we are
> > considering the possibility of storing metadata in database. We are
> > currently developing a file system that allows multiple virtual
> > machines to share base software environment. With our current design,
> > a new VM can be deployed in several seconds by inheriting the file
> > system of an existing VM. If a VM is to modify a shared file, the file
> > system will do copy-on-write to gernerate a private copy for this VM.
> > Thus, there could be multiple physical copies for a virtual pathname.
> > Even more complicated, a physical copy could be shared by arbitrary
> > subset of VMs. Now let's consider how to support this using regular
> > file system. You can treat VMs as clients or users of a standard
> > linux. Consider the following scenario: VM2 inherit VM1's file
> > system. The physical copy for virtual file F is F.1. Then, it modified
> > file F and get its private copy F.2. Now VM3 inherit VM2's file
> > system. The inherit graph is as follow:
> > VM1-->VM2-->VM3
> >
> > Now VM3 wants to access virtual file F. It has to determine the right
> > physical copy. The right answer is F.2. But in the file system, we
> > have F.1 and F.2. So some mapping mechanism must be devised. No matter
> > how we manipulate the pathname of physical copies, several disk
> > accesses seem to be required for a mapping operation. That is the
> > reason we are considering database to store metadata.
>
> Hardlinks? ext3cow (google it)?
> Pavel
>
> --
> Picture of sleeping (Linux) penguin wanted...
>