We have designed a new stackable file system that we called RAIF:
Redundant Array of Independent Filesystems.
Similar to Unionfs, RAIF is a fan-out file system and can be mounted over
many different disk-based, memory, network, and distributed file systems.
RAIF can use the stable and maintained code of the other file systems and
thus stay simple itself. Similar to standard RAID, RAIF can replicate the
data or store it with parity on any subset of the lower file systems. RAIF
has three main advantages over traditional driver-level RAID systems:
1. RAIF can be mounted over any set of file systems. This allows users to
create many more useful configurations. For example, it is possible to
replicate the data on the local and remote disks, and stripe the data on
the local hard drives and keep the parity (or even ECC to tolerate
multiple failures) on the remote server(s). In the latter case, all the
read requests will be satisfied from the fast local disks and no local
disk space will be spent on parity.
2. RAIF is a file system and thus has access to the meta-data. This allows
it to store different files differently. For example, RAIF can replicate
important files (*.c, *.doc, etc) on all the lower file systems and
stripe the multimedia files with parity at the same time.
3. It is sometimes more convenient to work with file systems than devices as
the lower storage. For example, it is possible to mount RAIF over a
directory on an existing file system. The data is represented as files
on the lower file systems. Therefore, any lower file system is an exact
replica of the RAIF file system in the case of replication. It also
makes it easy to backup the data on the lower file systems using existing
tools.
We have performed some benchmarking on a 3GHz PC with 2GB of RAM and U320
SCSI disks. Compared to the Linux RAID driver, RAIF has overheads of about
20-25% under the Postmark v1.5 benchmark in case of striping and
replication. In case of RAID4 and RAID5-like configurations, RAIF performed
about two times *better* than software RAID and even better than an Adaptec
2120S RAID5 controller. This is because RAIF is located above file system
caches and can cache parity as normal data when needed. We have more
performance details in a technical report, if anyone is interested.
We started the project in April 2004. Right now I am using it as my
/home/kolya file system at home. We believe that at this stage RAIF is
mature enough for others to try it out. The code is available at:
<ftp://ftp.fsl.cs.sunysb.edu/pub/raif/>
The code requires no kernel patches and compiles for a wide range of kernels
as a module. The latest kernel we used it for is 2.6.13 and we are in the
process of porting it to 2.6.19.
We will be happy to hear your back.
Nikolai Joukov on behalf of the RAIF team.
----------------------------------
Filesystems and Storage Laboratory
Stony Brook University
Nikolai Joukov wrote:
> replication. In case of RAID4 and RAID5-like configurations, RAIF performed
> about two times *better* than software RAID and even better than an Adaptec
> 2120S RAID5 controller. This is because RAIF is located above file system
> caches and can cache parity as normal data when needed. We have more
> performance details in a technical report, if anyone is interested.
This doesn't make sense to me. You do not want to cache the parity
data. It only needs to be used to validate the data blocks when the
stripe is read, and after that, you only want to cache the data, and
throw out the parity. Caching the parity as well will pollute the cache
and thus, should lower performance due to more important data being
thrown out.
> Nikolai Joukov wrote:
> > replication. In case of RAID4 and RAID5-like configurations, RAIF performed
> > about two times *better* than software RAID and even better than an Adaptec
> > 2120S RAID5 controller. This is because RAIF is located above file system
> > caches and can cache parity as normal data when needed. We have more
> > performance details in a technical report, if anyone is interested.
>
> This doesn't make sense to me. You do not want to cache the parity
> data. It only needs to be used to validate the data blocks when the
> stripe is read, and after that, you only want to cache the data, and
> throw out the parity. Caching the parity as well will pollute the cache
> and thus, should lower performance due to more important data being
> thrown out.
This happens automatically: unused parity pages are treated as unused
pages and get reused to cache something else. Also, the parity
never gets cached if you do not write the data (or recover the data).
However, if you use the same parity page over and over you do not need to
fetch it from the disk again.
By the way, unlike most other stackable file systems, RAIF does not cache
the data (or parity) multiple times: it only caches the data at its own
level and not at the level of the lower file systems.
Nikolai.
-------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
> > Nikolai Joukov wrote:
> > > replication. In case of RAID4 and RAID5-like configurations, RAIF performed
> > > about two times *better* than software RAID and even better than an Adaptec
> > > 2120S RAID5 controller. This is because RAIF is located above file system
> > > caches and can cache parity as normal data when needed. We have more
> > > performance details in a technical report, if anyone is interested.
> >
> > This doesn't make sense to me. You do not want to cache the parity
> > data. It only needs to be used to validate the data blocks when the
> > stripe is read, and after that, you only want to cache the data, and
> > throw out the parity. Caching the parity as well will pollute the cache
> > and thus, should lower performance due to more important data being
> > thrown out.
>
> This happens automatically: unused parity pages are treated as unused
> pages and get reused to cache something else. Also, the parity
> never gets cached if you do not write the data (or recover the data).
> However, if you use the same parity page over and over you do not need to
> fetch it from the disk again.
To avoid confusion here: data recovery is not the only situation when it
is necessary to read the parity. Existing parity is also necessary for
writes that are smaller than the page size.
Nikolai.
---------------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
>We have designed a new stackable file system that we called RAIF:
>Redundant Array of Independent Filesystems.
>
>Similar to Unionfs, RAIF is a fan-out file system and can be mounted over
>many different disk-based, memory, network, and distributed file systems.
>RAIF can use the stable and maintained code of the other file systems and
>thus stay simple itself. Similar to standard RAID, RAIF can replicate the
>data or store it with parity on any subset of the lower file systems. RAIF
>has three main advantages over traditional driver-level RAID systems:
>
>1. RAIF can be mounted over any set of file systems. This allows users to
> create many more useful configurations. For example, it is possible to
> replicate the data on the local and remote disks, and stripe the data on
> the local hard drives and keep the parity (or even ECC to tolerate
> multiple failures) on the remote server(s). In the latter case, all the
> read requests will be satisfied from the fast local disks and no local
> disk space will be spent on parity.
As for striping on a simplistic level, look at the Equal File
Distribution patch for unionfs :-)
http://www.mail-archive.com/[email protected]/msg01936.html
Files are stored normally so that after the union is unmounted, the
files appear in one piece (unlike real RAID0 over two block devices).
-`J'
--
> >We have designed a new stackable file system that we called RAIF:
> >Redundant Array of Independent Filesystems.
> >
> >Similar to Unionfs, RAIF is a fan-out file system and can be mounted over
> >many different disk-based, memory, network, and distributed file systems.
> >RAIF can use the stable and maintained code of the other file systems and
> >thus stay simple itself. Similar to standard RAID, RAIF can replicate the
> >data or store it with parity on any subset of the lower file systems. RAIF
> >has three main advantages over traditional driver-level RAID systems:
> >
> >1. RAIF can be mounted over any set of file systems. This allows users to
> > create many more useful configurations. For example, it is possible to
> > replicate the data on the local and remote disks, and stripe the data on
> > the local hard drives and keep the parity (or even ECC to tolerate
> > multiple failures) on the remote server(s). In the latter case, all the
> > read requests will be satisfied from the fast local disks and no local
> > disk space will be spent on parity.
>
> As for striping on a simplistic level, look at the Equal File
> Distribution patch for unionfs :-)
>
> http://www.mail-archive.com/[email protected]/msg01936.html
>
> Files are stored normally so that after the union is unmounted, the
> files appear in one piece (unlike real RAID0 over two block devices).
RAIF supports rules that describe how to store particular files or groups
of files. A rule with RAIF level 0 (which is similar to RAID level 0) and
a special striping unit size = '-1' will do the same (distribute the
files on the lower file systems) for files that match any given file name
pattern. A rule with level 4 and striping unit size = '-1' will
distribute files on several file systems and store an extra copy of the
files on a dedicated file system (e.g., an NFS mount with lots of space).
Now guess what RAIF's level 6 will do with a special striping unit
size = '-1' :-)
Nikolai.
----------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
> Nikolai Joukov wrote:
> > We have designed a new stackable file system that we called RAIF:
> > Redundant Array of Independent Filesystems.
>
> Great!
>
> > We have performed some benchmarking on a 3GHz PC with 2GB of RAM and U320
> > SCSI disks. Compared to the Linux RAID driver, RAIF has overheads of
> > about 20-25% under the Postmark v1.5 benchmark in case of striping and
> > replication. In case of RAID4 and RAID5-like configurations, RAIF
> > performed about two times *better* than software RAID and even better than
> > an Adaptec 2120S RAID5 controller.
>
> I am not surprised. RAID 4/5/6 performance is highly sensitive to the
> underlying hw, and thus needs a fair amount of fine tuning.
Nevertheless, performance is not the biggest advantage of RAIF. For
read-biased workloads RAID is always slightly faster than RAIF. The
biggest advantages of RAIF are flexible configurations (e.g., can combine
NFS and local file systems), per-file-type storage policies, and the fact
that files are stored as files on the lower file systems (which is
convenient).
> > This is because RAIF is located above
> > file system caches and can cache parity as normal data when needed. We
> > have more performance details in a technical report, if anyone is
> > interested.
>
> Definitely interested. Can you give a link?
The main focus of the paper is on a general OS profiling method and not
on RAIF. However, it has some details about the RAIF benchmarking with
Postmark in Chapter 9:
<http://www.fsl.cs.sunysb.edu/docs/joukov-phdthesis/thesis.pdf>
Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5
operation under the same Postmark workload.
Nikolai.
---------------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
On Friday 15 December 2006 10:01, Nikolai Joukov wrote:
> > Nikolai Joukov wrote:
> > > We have designed a new stackable file system that we called RAIF:
> > > Redundant Array of Independent Filesystems.
> >
> > Great!
Yes, definitely...
I see the major benefit being in the mobile, industrial and embedded systems
arena. Perhaps this might come as a suprise to people, but a very large and
ever growing number (perhaps even most) Linux devices don't use block devices
for storage. Instead they use flash file systems or nfs, niether of which use
local block devices.
It looks like RAIF gives a way to provide redundancy etc on these devices.
> >
> > > We have performed some benchmarking on a 3GHz PC with 2GB of RAM and
> > > U320 SCSI disks. Compared to the Linux RAID driver, RAIF has overheads
> > > of about 20-25% under the Postmark v1.5 benchmark in case of striping
> > > and replication. In case of RAID4 and RAID5-like configurations, RAIF
> > > performed about two times *better* than software RAID and even better
> > > than an Adaptec 2120S RAID5 controller.
> >
> > I am not surprised. RAID 4/5/6 performance is highly sensitive to the
> > underlying hw, and thus needs a fair amount of fine tuning.
>
> Nevertheless, performance is not the biggest advantage of RAIF. For
> read-biased workloads RAID is always slightly faster than RAIF. The
> biggest advantages of RAIF are flexible configurations (e.g., can combine
> NFS and local file systems), per-file-type storage policies, and the fact
> that files are stored as files on the lower file systems (which is
> convenient).
>
> > > This is because RAIF is located above
> > > file system caches and can cache parity as normal data when needed. We
> > > have more performance details in a technical report, if anyone is
> > > interested.
> >
> > Definitely interested. Can you give a link?
>
> The main focus of the paper is on a general OS profiling method and not
> on RAIF. However, it has some details about the RAIF benchmarking with
> Postmark in Chapter 9:
>
> <http://www.fsl.cs.sunysb.edu/docs/joukov-phdthesis/thesis.pdf>
>
> Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5
> operation under the same Postmark workload.
>
> Nikolai.
> ---------------------
> Nikolai Joukov, Ph.D.
> Filesystems and Storage Laboratory
> Stony Brook University
> -
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Nikolai Joukov wrote:
>
> <http://www.fsl.cs.sunysb.edu/docs/joukov-phdthesis/thesis.pdf>
>
> Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5
> operation under the same Postmark workload.
>
> Nikolai.
> ---------------------
> Nikolai Joukov, Ph.D.
> Filesystems and Storage Laboratory
> Stony Brook University
>
>
Well, Congratulations, Doctor!! [Must be nice to be exiled to Stony
Brook!! Oh, well, not I]
For some reason, I can not connect to the above link, but I may not need
to. Does [should] it contain a link/pointer to the underlying source
code? This concept sounds very interesting, and I am sure that many of
us would like to look closer, and maybe even get a taste.
Here's hoping that source exists, and that it is available for us.
Thanks
b-
> > We started the project in April 2004. Right now I am using it as my
> > /home/kolya file system at home. We believe that at this stage RAIF is
> > mature enough for others to try it out. The code is available at:
> >
> > <ftp://ftp.fsl.cs.sunysb.edu/pub/raif/>
> >
> > The code requires no kernel patches and compiles for a wide range of
> > kernels as a module. The latest kernel we used it for is 2.6.13 and we
> > are in the process of porting it to 2.6.19.
> >
> > We will be happy to hear your back.
>
> When removing a file from the underlying branch, the oops below happens.
> Wouldn't it be possible to just fail the branch instead of oopsing?
This is a known problem of all Linux stackable file systems. Users are
not supposed to change the file systems below mounted stackable file
systems (but they can read them). One of the ways to enforce it is to use
overlay mounts. For example, mount the lower file systems at
/raif/b0 ... /raif/bN and then mount RAIF at /raif. Stackable file
systems recently started getting into the kernel and we hope that there
will be a better solution for this problem in the future. Having said
that, you are right: failing the branch would be the right thing to do.
Nikolai.
---------------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
> Well, Congratulations, Doctor!! [Must be nice to be exiled to Stony
> Brook!! Oh, well, not I]
Long Island is a very nice place with lots of vineries and perfect sand
beaches - don't envy :-)
> Here's hoping that source exists, and that it is available for us.
I guess, you are subscribed to the linux-raid list only. Unfortunately, I
didn't CC my post to that list and one of the replies was CC'd there
without the link. The original post is available here:
<http://marc.theaimsgroup.com/?l=linux-fsdevel&m=116603282106036&w=2>
And the link to the sources is:
<ftp://ftp.fsl.cs.sunysb.edu/pub/raif/>
Nikolai.
---------------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
Nikolai Joukov wrote:
> > Nikolai Joukov wrote:
> > > We have designed a new stackable file system that we called RAIF:
> > > Redundant Array of Independent Filesystems.
> >
> > Great!
> >
> > > We have performed some benchmarking on a 3GHz PC with 2GB of RAM and
> > > U320 SCSI disks. Compared to the Linux RAID driver, RAIF has
> > > overheads of about 20-25% under the Postmark v1.5 benchmark in case of
> > > striping and replication. In case of RAID4 and RAID5-like
> > > configurations, RAIF performed about two times *better* than software
> > > RAID and even better than an Adaptec 2120S RAID5 controller.
> >
> > I am not surprised. RAID 4/5/6 performance is highly sensitive to the
> > underlying hw, and thus needs a fair amount of fine tuning.
>
> Nevertheless, performance is not the biggest advantage of RAIF. For
> read-biased workloads RAID is always slightly faster than RAIF. The
> biggest advantages of RAIF are flexible configurations (e.g., can combine
> NFS and local file systems), per-file-type storage policies, and the fact
> that files are stored as files on the lower file systems (which is
> convenient).
Ok, a I was just about to inform you of a three nfs-branch raif which was
unable to fill the net pipe. So it looks like a 25% performance hit across
the board. Should be possible to reduce to sub 3% though once RAIF matures,
don't you think?
> > > This is because RAIF is located above
> > > file system caches and can cache parity as normal data when needed.
> > > We have more performance details in a technical report, if anyone is
> > > interested.
> >
> > Definitely interested. Can you give a link?
>
> The main focus of the paper is on a general OS profiling method and not
> on RAIF. However, it has some details about the RAIF benchmarking with
> Postmark in Chapter 9:
>
> <http://www.fsl.cs.sunysb.edu/docs/joukov-phdthesis/thesis.pdf>
>
> Figures 9.7 and 9.8 also show profiles of the Linux RAID5 and RAIF5
> operation under the same Postmark workload.
Thanks!
--
Al
Nikolai Joukov wrote:
> > > We started the project in April 2004. Right now I am using it as my
> > > /home/kolya file system at home. We believe that at this stage RAIF
> > > is mature enough for others to try it out. The code is available at:
> > >
> > > <ftp://ftp.fsl.cs.sunysb.edu/pub/raif/>
> > >
> > > The code requires no kernel patches and compiles for a wide range of
> > > kernels as a module. The latest kernel we used it for is 2.6.13 and
> > > we are in the process of porting it to 2.6.19.
> > >
> > > We will be happy to hear your back.
> >
> > When removing a file from the underlying branch, the oops below happens.
> > Wouldn't it be possible to just fail the branch instead of oopsing?
>
> This is a known problem of all Linux stackable file systems. Users are
> not supposed to change the file systems below mounted stackable file
> systems (but they can read them). One of the ways to enforce it is to use
> overlay mounts. For example, mount the lower file systems at
> /raif/b0 ... /raif/bN and then mount RAIF at /raif. Stackable file
> systems recently started getting into the kernel and we hope that there
> will be a better solution for this problem in the future. Having said
> that, you are right: failing the branch would be the right thing to do.
Good. It seems that there is also some tmpfs/raif-over-nfs deadlock
situation. Can't really tell if it's the kernel or the raif, but when do
you think the patches could be brought into sync with the current mainline?
Thanks!
--
Al
On Wednesday 13 December 2006 12:47, Nikolai Joukov wrote:
> We have designed a new stackable file system that we called RAIF:
> Redundant Array of Independent Filesystems
Do you have a function similar to an an EMC cloneset? Basicily a cloneset
tracks what has changed in both the source and target luns (drives). When one
updates the cloneset the target is made identical to the source. Its a great
way to do backups. Its an important feature to be able to write to the target drives.
I would love to see this working at a filesystem level.
Thanks
Ed Tomlinson
> On Friday 15 December 2006 10:01, Nikolai Joukov wrote:
> > > Nikolai Joukov wrote:
> > > > We have designed a new stackable file system that we called RAIF:
> > > > Redundant Array of Independent Filesystems.
> > >
> > > Great!
>
> Yes, definitely...
>
> I see the major benefit being in the mobile, industrial and embedded systems
> arena. Perhaps this might come as a suprise to people, but a very large and
> ever growing number (perhaps even most) Linux devices don't use block devices
> for storage. Instead they use flash file systems or nfs, niether of which use
> local block devices.
>
> It looks like RAIF gives a way to provide redundancy etc on these devices.
Good point! Also, RAIF can store different file types differently.
Therefore, it is possible to mount RAIF over file systems with lots of
storage space and a flash file system (with usually less space). In this
case, RAIF can be configured to use flash to keep replicas of the most
important data only. And yes, thanks to the stackable nature of RAIF no
explicit flash support is required. RAIF can reuse existing file systems
designed for flash media (e.g., JFFS2).
Nikolai.
> Nikolai Joukov wrote:
> > > Nikolai Joukov wrote:
> > > > We have designed a new stackable file system that we called RAIF:
> > > > Redundant Array of Independent Filesystems.
> > >
> > > Great!
> > >
> > > > We have performed some benchmarking on a 3GHz PC with 2GB of RAM and
> > > > U320 SCSI disks. Compared to the Linux RAID driver, RAIF has
> > > > overheads of about 20-25% under the Postmark v1.5 benchmark in case of
> > > > striping and replication. In case of RAID4 and RAID5-like
> > > > configurations, RAIF performed about two times *better* than software
> > > > RAID and even better than an Adaptec 2120S RAID5 controller.
> > >
> > > I am not surprised. RAID 4/5/6 performance is highly sensitive to the
> > > underlying hw, and thus needs a fair amount of fine tuning.
> >
> > Nevertheless, performance is not the biggest advantage of RAIF. For
> > read-biased workloads RAID is always slightly faster than RAIF. The
> > biggest advantages of RAIF are flexible configurations (e.g., can combine
> > NFS and local file systems), per-file-type storage policies, and the fact
> > that files are stored as files on the lower file systems (which is
> > convenient).
>
> Ok, a I was just about to inform you of a three nfs-branch raif which was
> unable to fill the net pipe. So it looks like a 25% performance hit across
> the board. Should be possible to reduce to sub 3% though once RAIF matures,
> don't you think?
Hmmm. Which workload did you try? Which RAIF level did you use: RAIF0
(striping), replication (RAIF1, default), or striping with parity (RAIF4,
5, 6)? Which hardware did you use? RAIF has to consume extra CPU time
and on older machines the overheads seem to be higher (CPUs are getting
faster than I/O devices at a faster pace). Also, I guess you are
comparing RAIF mounted over three NFS branches with NFS alone, right?
It doesn't seem to be very fair to me :-)
Recently we solved the double-caching problem, which improved RAIF's
performance by an order of magnitude under I/O-intensive workloads.
(Normally, Linux stackable file systems cache the data twice.)
Unfortunately, many VFS meta-operations are synchronous (e.g., lookup).
RAIF has to wait on such operations in sequence for every branch
involved. (This is different from, say, readpage operation. We
call readpage on all the branches right away and then wait for their
simultaneous operation.) Sequential waiting on lookups should be OK for
multi-threaded workloads but may result in extra elapsed time for the
single-threaded workloads. Again, elegant solutions may require VFS API
changes. Alternatively, we can create kernel threads for every branch.
I am not sure about 3% overheads in all the cases compared to NFS alone.
On one hand, there should be some price to pay for the extra
functionality. On the other hand, for some workloads RAIF should
even improve performance compared to a single NFS because of the load
distribution. In general, I agree that there are still many things we can
optimize.
Nikolai.
---------------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
> Nikolai Joukov wrote:
> > > > We started the project in April 2004. Right now I am using it as my
> > > > /home/kolya file system at home. We believe that at this stage RAIF
> > > > is mature enough for others to try it out. The code is available at:
> > > >
> > > > <ftp://ftp.fsl.cs.sunysb.edu/pub/raif/>
> > > >
> > > > The code requires no kernel patches and compiles for a wide range of
> > > > kernels as a module. The latest kernel we used it for is 2.6.13 and
> > > > we are in the process of porting it to 2.6.19.
> > > >
> > > > We will be happy to hear your back.
> > >
> > > When removing a file from the underlying branch, the oops below happens.
> > > Wouldn't it be possible to just fail the branch instead of oopsing?
> >
> > This is a known problem of all Linux stackable file systems. Users are
> > not supposed to change the file systems below mounted stackable file
> > systems (but they can read them). One of the ways to enforce it is to use
> > overlay mounts. For example, mount the lower file systems at
> > /raif/b0 ... /raif/bN and then mount RAIF at /raif. Stackable file
> > systems recently started getting into the kernel and we hope that there
> > will be a better solution for this problem in the future. Having said
> > that, you are right: failing the branch would be the right thing to do.
>
> Good. It seems that there is also some tmpfs/raif-over-nfs deadlock
> situation. Can't really tell if it's the kernel or the raif, but when do
> you think the patches could be brought into sync with the current mainline?
It would be great if you could send us more details about how to recreate
this deadlock and we will take a look at it. It would be even better if
you and everybody else who finds bugs in RAIF submit the bug reports to:
<http://bugzilla.fsl.cs.sunysb.edu>
We are in the process of porting RAIF to 2.6.19 right now. Should be done
in early January. The trick is that we are trying to keep the same source
good for a wide range of kernel versions. In fact, not too long ago we
even were able to compile it for 2.4.24!
Nikolai.
---------------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
> On Wednesday 13 December 2006 12:47, Nikolai Joukov wrote:
> > We have designed a new stackable file system that we called RAIF:
> > Redundant Array of Independent Filesystems
>
> Do you have a function similar to an an EMC cloneset? Basicily a cloneset
> tracks what has changed in both the source and target luns (drives). When one
> updates the cloneset the target is made identical to the source. Its a great
> way to do backups. Its an important feature to be able to write to the target drives.
> I would love to see this working at a filesystem level.
Well, if you mount RAIF over your file system and a for-backups file
system, RAIF can replicate the files on both of them automatically. I
guess that's what you need.
Nikolai.
---------------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
On Friday 15 December 2006 15:11, Nikolai Joukov wrote:
> > On Wednesday 13 December 2006 12:47, Nikolai Joukov wrote:
> > > We have designed a new stackable file system that we called RAIF:
> > > Redundant Array of Independent Filesystems
> >
> > Do you have a function similar to an an EMC cloneset? Basicily a cloneset
> > tracks what has changed in both the source and target luns (drives). When one
> > updates the cloneset the target is made identical to the source. Its a great
> > way to do backups. Its an important feature to be able to write to the target drives.
> > I would love to see this working at a filesystem level.
>
> Well, if you mount RAIF over your file system and a for-backups file
> system, RAIF can replicate the files on both of them automatically. I
> guess that's what you need.
Yes and no. The idea behind the cloneset is that most of the blocks (or files)
do not change in either source or target. This being the case its only necessary
to update the changed elements. This means updates are incremental. Once
the system has figured out what it needs to update its usable and if you access
an element that should be updated you will see the correctly updated version - even
though backgound resyncing is still in progress. This type of logic is great for backups.
Thanks
Ed Tomlinson
On Wed, 13 Dec 2006, Nikolai Joukov wrote:
> We have designed a new stackable file system that we called RAIF:
> Redundant Array of Independent Filesystems.
>
> Similar to Unionfs, RAIF is a fan-out file system and can be mounted over
> many different disk-based, memory, network, and distributed file systems.
> RAIF can use the stable and maintained code of the other file systems and
> thus stay simple itself. Similar to standard RAID, RAIF can replicate the
> data or store it with parity on any subset of the lower file systems. RAIF
> has three main advantages over traditional driver-level RAID systems:
this sounds very interesting. did you see the paper on chunkfs?
http://www.usenix.org/events/hotdep06/tech/prelim_papers/henson/henson_html/
this sounds as if it may be something that you would be able to make a
functional equivalent to chunkfs with your raid0 mode.
David Lang
> > We have designed a new stackable file system that we called RAIF:
> > Redundant Array of Independent Filesystems.
> >
> > Similar to Unionfs, RAIF is a fan-out file system and can be mounted over
> > many different disk-based, memory, network, and distributed file systems.
> > RAIF can use the stable and maintained code of the other file systems and
> > thus stay simple itself. Similar to standard RAID, RAIF can replicate the
> > data or store it with parity on any subset of the lower file systems. RAIF
> > has three main advantages over traditional driver-level RAID systems:
>
> this sounds very interesting. did you see the paper on chunkfs?
> http://www.usenix.org/events/hotdep06/tech/prelim_papers/henson/henson_html/
I saw Val at OSDI right before this HotDep talk and sure, I have seen the
paper :-)
> this sounds as if it may be something that you would be able to make a
> functional equivalent to chunkfs with your raid0 mode.
I also have this feeling. RAIF0 is similar to chunkfs and allows more
flexibility. Not only RAIF can stripe the data on many small local file
systems (possibly located on multiple drives) but also can stripe the data
on remote file systems. In addition, it can keep the parity, use
per-file-type storage policies etc. However, such a configuration would
mean lots and lots of lower file systems ( = branches = chunks). I am
afraid that in this case RAIF's performance would be not so great due to
VFS API restrictions for operations like lookup.
Nikolai.
---------------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
> >The idea behind the cloneset is that most of the blocks (or files)
> >do not change in either source or target. This being the case its only
> necessary
> >to update the changed elements. This means updates are incremental. Once
> >the system has figured out what it needs to update its usable and if you
> access
> >an element that should be updated you will see the correctly updated
> version - even
> >though backgound resyncing is still in progress.
>
> I still can't tell what you're describing. With RAID1 as well, only
> changed elements ever get updated. I have two identical filesystems,
> members of a RAIF set. I change one file. One file in each member
> filesystem gets updated, and I again have two identical filesystems.
>
> How would a cloneset work differently, and how would it be better?
Thanks, Bryan. I was about to write almost the same.
> > This type of logic is great for backups.
>
> Can you give an example of using it for backup?
I guess, you can mount Versionfs (yet another stackable file system)
below RAIF and above one of the lower file systems or use some other
versioning file system such as ext3cow. This will allow rolling back to
any older file system version at any time.
Nikolai.
---------------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
> I am looking at filling the net-pipe, and it only reaches 40-75% max, with
> some short 100% bursts, and a slow 10% start. It seems that caching
> somewhat delays the writes, which then batch up and sync at various speeds.
> So you have the cache really hiding slow sync speeds. To tune this, it may
> be helpful to turn off caching, which in turn could surface the actual
> bottlenecks.
Well, RAIF is an external kernel module and as such cannot change the
caching behavior much. The notorious problem of all Linux stackable file
systems is double-caching of data. Every stackable file system caches the
data at its own level and copies it from/to the lower file system's cached
pages when necessary. This has some advantages for file systems that
change the data (e.g., encrypt or compress). However, this effectively
reduces the system's cache memory size by two or more times.
Among all the existing stackable file systems, Tracefs and RAIF have the
highest requirements for low overheads. Here are the solutions to the
double-caching problem that we tried:
1. Redirect read and write requests to the lower file systems directly.
This allows to avoid caching of data at the RAIF level. However, this
optimization must be turned off as soon as a file is mmap'ed to avoid
cache inconsistencies. Also, even if the file is not mmap'ed, RAIF1
still keeps the data copies for all the lower branches. This
optimization is implemented in RAIF but is not turned on by default.
(We strip this and many other #ifdef'ed code fragments from the
code releases automatically.)
2. We cache the data at the RAIF level. When we write to the lower file
systems we do allocate the lower pages and do copy the data but we
also mark the lower pages with PG_reclaim flag before calling the
lower writepage operation. This releases all the lower pages right
after the write completes. This works fine for mmap'ed files and this
is the default RAIF behavior now. This solves the problem for most
workloads that mix reads and writes. For example, it improved
Postmark's performance several times. Unfortunately, this optimization
does not improve performance for big sequential writes - the workload
that you tried. So essentially, you had a quarter of your original page
cache while running your workload.
3. A known ideal solution for this problem is sharing of the cached pages
between file systems. We attempted to do it for Tracefs but the
resulting code is not beautiful and is potentially racy:
<http://marc.theaimsgroup.com/?l=linux-fsdevel&m=113193082115222&w=2>
Unfortunately, for fan-out file systems this solution requires even
more support from the OS. However, this is what most OSs do
(including BSD and Windows) but unfortunately not Linux :-(
> little overhead. So for RAIF to be viable, it needs to have low overhead,
> which doesn't seem impossible to implement, given RAIF's simple but
> beautiful approach.
Thanks a lot!
Nikolai.
---------------------
Nikolai Joukov, Ph.D.
Filesystems and Storage Laboratory
Stony Brook University
> > 3. A known ideal solution for this problem is sharing of the cached pages
> > between file systems. We attempted to do it for Tracefs but the
> > resulting code is not beautiful and is potentially racy:
> > <http://marc.theaimsgroup.com/?l=linux-fsdevel&m=113193082115222&w=2>
> > Unfortunately, for fan-out file systems this solution requires even
> > more support from the OS. However, this is what most OSs do
> > (including BSD and Windows) but unfortunately not Linux :-(
>
> VFS-hooks seem to be the cleanest solution not only for a stacked-fs, but
> also for many other situations. It's rather sad that linux hasn't seen the
> light yet.
Jeff Sipek just got his proposal for a paper/discussion topic accepted to
the Linux Storage and Filesystems workshop, co-located with FAST. The
topic for discussion will be what "surgery" the Linux kernel needs to
support stackable file systems properly. I hope it is an indicator that
the situation with support of the stackable file systems in Linux may
improve soon.
Nikolai.
> > Every stackable file system caches the data at its own level and
> > copies it from/to the lower file system's cached pages when necessary.
> > ...
> > this effectively reduces the system's cache memory size by two or more
> > times.
>
> It should not be that bad with a decent cache replacement policy; I
> wonder if observing the problem (that you corrected in the various ways
> you've described), you got some insight as to what exactly was happening.
I agree that appropriate replacement policies can partially eliminate
the double caching problem for stackable file systems. In fact, that's
exactly what RAIF does: it forces the data pages of the lower file
systems to be evicted right after they are written and are not needed
anymore. This solves the problem for most write-intensive workloads.
Without this optimization the situation is much worse because Linux is
trying to protect caches of different file systems from each other. But,
as you mentioned, any cache replacement policy is optimized for some set
of workloads and is bad for some other set of workloads. Also, caching
the data at multiple layers not just increases the memory consumption but
also adds CPU time overheads because of the data copying between the
pages. I believe that the real solution to the problem is the ability to
share data pages between file systems.
Nikolai.
> We are in the process of porting RAIF to 2.6.19 right now. Should be done
> in early January. The trick is that we are trying to keep the same source
> good for a wide range of kernel versions. In fact, not too long ago we
> even were able to compile it for 2.4.24!
>
> Nikolai.
We now have RAIF for the 2.6.19 kernel available at:
ftp://ftp.fsl.cs.sunysb.edu/pub/raif/raif-1.1.tar.gz
This version is more stable but there are for sure still some remaining
bugs and we very much appreciate your feedback.
Thank you.
Chaitanya on behalf of the RAIF team.
Filesystems and Storage Laboratory
Stony Brook University