From: Andreas Dilger <adilger@clusterfs.com>
Date: Wed, 4 Sep 2002 02:39:16 -0600
To: Helge Hafting <helgehaf@aitel.hist.no>
Cc: ptb@it.uc3m.es, linux-kernel@vger.kernel.org
Subject: Re: [RFC] mount flag "direct" (fwd)
Message-ID: <20020904083916.GX32468@clusterfs.com>
Mail-Followup-To: Helge Hafting <helgehaf@aitel.hist.no>, ptb@it.uc3m.es,
	linux-kernel@vger.kernel.org
References: <200209032107.g83L71h10758@oboe.it.uc3m.es> <3D75B344.66D4166@aitel.hist.no>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <3D75B344.66D4166@aitel.hist.no>
User-Agent: Mutt/1.4i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2471
Lines: 55

On Sep 04, 2002  09:16 +0200, Helge Hafting wrote:
> Your idea about re-reading stuff over and over isn't going to help 
> because that sort of thing consumes much more bandwith. Caches help
> because they _avoid_ data transfers.  So shared writeable data
> will happen, and it will use some sort of cache coherency,
> for performance reasons.

You assume too much about the applications.  For example, Oracle
does not want _any_ cacheing to be done by the OS, because it
manages the cache itself, and would rather allocate the full amount
of RAM itself instead of the OS duplicating data it is cacheing
internally.

Similarly, there are many "write only" applications that are only
hindered by OS cache, such as any kind of high-speed data recording
(video, particle accelerators, scientific computing, etc) which is
using most of the RAM for internal structures and wants the data it
writes to go directly to disk at the highest possible speed.

> I claim that making a new fs from scratch for the distributed
> case is easier than tweaking ext2 and 10-20 other existing fs'es
> to work in such an environment.  Making a new fs from scratch
> isn't such a big deal after all.

The problem isn't making a new fs, the problem is making a _good_
new fs.  It takes at least several years of development, testing,
tuning, etc to get just a local fs right, if not longer (i.e.
reiserfs, JFS, XFS, ext3, etc).  Add in the complexity of the
network side of things and it just gets that much harder to do
it all well.

We have taken the approach that local filesystems do a good job
with the "one node" assumption, so just use them as-is to
do a job they are good at.  All of the network and locking code
for Lustre is outside of the filesystem, and the "local" filesystems
are used for storing either the directory structure + attributes
(for the metadata server), or file data (for the storage targets).

Local filesystems can do both of those jobs very well already, so
no need to re-invent the wheel.

See http://www.lustre.org/docs.html for lots of papers and
documentation on the design of Lustre.

Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/