I see many people interested in designing of new filesystems for
different purposes, and one of the common tasks all filesystem designers
will do is to manage device blocks.
I just thought of starting a new project that aims to create an
advanced, scalable and high performance block level storage management
layer for the Linux kernel.
This layer should provide low level storage services to simplify the
development of filesystems, DBMS, LDAP servers, or any other
applications that require high performance storage.
The planned features are :
1- Very fast block allocation ( Using balanced trees for tracking free
blocks comes into my mind now, but I still it is early to decide the
design ).
2- Support for multi-disk/multi-host storage pool.
3- Meta data storage and block storage can be isolated for better
performance.
4- Meta data and block replication options.
5- Transactional options for journaling filesystems or transactional
databases.
6- Supports clustering through lock managers where multiple hosts can
read/write to same storage devices concurrently ( suitable for SANs )
7- Transparent recovery from corruption or hardware failure.
8- Direct access from userland ( for DBMS, LDAP, and other userland
applications ).
9- Plugins support ( like those of reiserfs 4).
If you know of any similar effort, or any technical obstacle I am
missing , please let me know.
Ramy
Ramy M. Hassan wrote:
> 1- Very fast block allocation ( Using balanced trees for tracking free
> blocks comes into my mind now, but I still it is early to decide the
> design ).
You most likely want to use extents in addition to whatever else you use
(ie, trees/etc.).
> 2- Support for multi-disk/multi-host storage pool.
You're mixing layers here. MD and DM already work in this area.
> 3- Meta data storage and block storage can be isolated for better
> performance.
There is support for journal on a different device in the generic JBD
code that ext3 uses, and reiserfs (possibly others also). That may be a
good place to work from.
Are there any examples of this in other OSes? If you put the meta-data
on a separate drive, it would be an inherently seeky load. How does
this compare to putting raid below the mixed data and meta-data block
device?
> 4- Meta data and block replication options.
Coda and Intermezzo do this in a filesystem independent way already.
This can add flexibility.
> 5- Transactional options for journaling filesystems or transactional
> databases.
Isn't journaling inherently transaction based already?
> 6- Supports clustering through lock managers where multiple hosts can
> read/write to same storage devices concurrently ( suitable for SANs )
This is going to be a very heavy layer, and few people will use it if it
isn't very light (or can be configured that way)
> 7- Transparent recovery from corruption or hardware failure.
Journaling in ext3 is block based, and the rest are virtual
(descriptions of the actions are in the journal, not the entire block of
meta-data -- when you're not running in data journaling mode).
How do you plan on integrating your proposed layer with these two very
different approaches to filesystem journaling?
> 8- Direct access from userland ( for DBMS, LDAP, and other userland
> applications ).
You have a separate userspace, and kernel implementation right?
> 9- Plugins support ( like those of reiserfs 4).
This can be good or bad. Make sure it doesn't bloat your layer.
Mike
On Saturday March 6, [email protected] wrote:
>
> > 2- Support for multi-disk/multi-host storage pool.
>
> You're mixing layers here. MD and DM already work in this area.
>
I would probably disagree here.
I think it makes much more sense for a filesystem to know about
multiple devices than for MD or DM to combine a bunch of devices into
the illusion of one big device, only to have the filesystem chop that
big device into little files....
(Note that I wouldn't expect a filesystem to include raid5 style
behaviour, and probably wouldn't expect raid1 like behaviour, but
having the filesystem do striping and inter-device migration itself
seems eminently sensible.)
However I don't see much value if the suggestion of a new layer that
provide lots of services of filesystems. I strongly suspect that no
filesystem would want to use them. Look at "jdb". It is designed to
provide a journalling layer for any filesystem, but how many
filesystems use it? Just one - ext3 - the one it was designed for.
NeilBrown
Neil Brown wrote:
> On Saturday March 6, [email protected] wrote:
>
>>>2- Support for multi-disk/multi-host storage pool.
>>
>>You're mixing layers here. MD and DM already work in this area.
>>
>
>
> I would probably disagree here.
> I think it makes much more sense for a filesystem to know about
> multiple devices than for MD or DM to combine a bunch of devices into
> the illusion of one big device, only to have the filesystem chop that
> big device into little files....
>
> (Note that I wouldn't expect a filesystem to include raid5 style
> behaviour, and probably wouldn't expect raid1 like behaviour, but
> having the filesystem do striping and inter-device migration itself
> seems eminently sensible.)
>
I saw something doing that in a SAN. Don't know if it was at the
filesytem level though.
> However I don't see much value if the suggestion of a new layer that
> provide lots of services of filesystems. I strongly suspect that no
> filesystem would want to use them. Look at "jdb". It is designed to
> provide a journalling layer for any filesystem, but how many
> filesystems use it? Just one - ext3 - the one it was designed for.
Since JBD is "Journaled Block Device", does that mean it's meant for
block based journaling instead of "virtual" (I don't think I'm using the
right term, so please someone correct me) journaling?
Mike
Ramy M. Hassan wrote:
> I see many people interested in designing of new filesystems for
> different purposes, and one of the common tasks all filesystem designers
> will do is to manage device blocks.
>
> If you know of any similar effort, or any technical obstacle I am
> missing , please let me know.
Please take a look at EVMS before you reinvent the wheel here.
http://evms.sourceforge.net
>
> Ramy
--
Jeremy Jackson
Coplanar Networks
http://www.coplanar.net