Hi, this is the latest set of gfs patches, it includes some minor munging
since the previous set. Andrew, could this be added to -mm? there's not
much in the way of pending changes.
http://redhat.com/~teigland/gfs2/20050901/gfs2-full.patch
http://redhat.com/~teigland/gfs2/20050901/broken-out/
I'd like to get a list of specific things remaining for merging. I
believe we've responded to everything from earlier reviews, they were very
helpful and more would be excellent. The list begins with one item from
before that's still pending:
- Adapt the vfs so gfs (and other cfs's) don't need to walk vma lists.
[cf. ops_file.c:walk_vm(), gfs works fine as is, but some don't like it.]
...
Thanks
Dave
On Thu, 2005-09-01 at 18:46 +0800, David Teigland wrote:
> Hi, this is the latest set of gfs patches, it includes some minor munging
> since the previous set. Andrew, could this be added to -mm? there's not
> much in the way of pending changes.
can you post them here instead so that they can be actually reviewed?
David Teigland <[email protected]> wrote:
>
> Hi, this is the latest set of gfs patches, it includes some minor munging
> since the previous set. Andrew, could this be added to -mm?
Dumb question: why?
Maybe I was asleep, but I don't recall seeing much discussion or exposition
of
- Why the kernel needs two clustered fileystems
- Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
possibly gain (or vice versa)
- Relative merits of the two offerings
etc.
Maybe this has all been thrashed out and agreed to. If so, please remind me.
On Thu, 2005-09-01 at 18:46 +0800, David Teigland wrote:
> Hi, this is the latest set of gfs patches, it includes some minor munging
> since the previous set. Andrew, could this be added to -mm? there's not
> much in the way of pending changes.
>
> http://redhat.com/~teigland/gfs2/20050901/gfs2-full.patch
> http://redhat.com/~teigland/gfs2/20050901/broken-out/
+static inline void glock_put(struct gfs2_glock *gl)
+{
+ if (atomic_read(&gl->gl_count) == 1)
+ gfs2_glock_schedule_for_reclaim(gl);
+ gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,);
+ atomic_dec(&gl->gl_count);
+}
this code has a race
what is gfs2_assert() about anyway? please just use BUG_ON directly everywhere
+static inline int queue_empty(struct gfs2_glock *gl, struct list_head *head)
+{
+ int empty;
+ spin_lock(&gl->gl_spin);
+ empty = list_empty(head);
+ spin_unlock(&gl->gl_spin);
+ return empty;
+}
that looks like a racey interface to me... if so.. why bother locking at all?
+void gfs2_glock_hold(struct gfs2_glock *gl)
+{
+ glock_hold(gl);
+}
eh why?
+struct gfs2_holder *gfs2_holder_get(struct gfs2_glock *gl, unsigned int state,
+ int flags, int gfp_flags)
+{
+ struct gfs2_holder *gh;
+
+ gh = kmalloc(sizeof(struct gfs2_holder), GFP_KERNEL | gfp_flags);
this looks odd. Either you take flags or you don't.. this looks really half arsed and thus is really surprising
to all callers
static int gi_skeleton(struct gfs2_inode *ip, struct gfs2_ioctl *gi,
+ gi_filler_t filler)
+{
+ unsigned int size = gfs2_tune_get(ip->i_sbd, gt_lockdump_size);
+ char *buf;
+ unsigned int count = 0;
+ int error;
+
+ if (size > gi->gi_size)
+ size = gi->gi_size;
+
+ buf = kmalloc(size, GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ error = filler(ip, gi, buf, size, &count);
+ if (error)
+ goto out;
+
+ if (copy_to_user(gi->gi_data, buf, count + 1))
+ error = -EFAULT;
where does count get a sensible value?
+static unsigned int handle_roll(atomic_t *a)
+{
+ int x = atomic_read(a);
+ if (x < 0) {
+ atomic_set(a, 0);
+ return 0;
+ }
+ return (unsigned int)x;
+}
this is just plain scary.
you'll have to post the rest of your patches if you want anyone to look at them...
On 9/1/05, David Teigland <[email protected]> wrote:
> - Adapt the vfs so gfs (and other cfs's) don't need to walk vma lists.
> [cf. ops_file.c:walk_vm(), gfs works fine as is, but some don't like it.]
It works fine only if you don't care about playing well with other
clustered filesystems.
Pekka
On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote:
> - Why the kernel needs two clustered fileystems
So delete reiserfs4, FAT, VFAT, ext2, and all the other "junk".
> - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> possibly gain (or vice versa)
>
> - Relative merits of the two offerings
You missed the important one - people actively use it and have been for
some years. Same reason with have NTFS, HPFS, and all the others. On
that alone it makes sense to include.
Alan
On Thu, Sep 01, 2005 at 03:49:18PM +0100, Alan Cox wrote:
> > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > possibly gain (or vice versa)
> >
> > - Relative merits of the two offerings
>
> You missed the important one - people actively use it and have been for
> some years. Same reason with have NTFS, HPFS, and all the others. On
> that alone it makes sense to include.
That's GFS. The submission is about a GFS2 that's on-disk incompatible
to GFS.
> That's GFS. The submission is about a GFS2 that's on-disk incompatible
> to GFS.
Just like say reiserfs3 and reiserfs4 or ext and ext2 or ext2 and ext3
then. I think the main point still stands - we have always taken
multiple file systems on board and we have benefitted enormously from
having the competition between them instead of a dictat from the kernel
kremlin that 'foofs is the one true way'
Competition will decide if OCFS or GFS is better, or indeed if someone
comes along with another contender that is better still. And competition
will probably get the answer right.
The only thing that is important is we don't end up with each cluster fs
wanting different core VFS interfaces added.
Alan
On 2005-09-01T16:28:30, Alan Cox <[email protected]> wrote:
> Competition will decide if OCFS or GFS is better, or indeed if someone
> comes along with another contender that is better still. And competition
> will probably get the answer right.
Competition will come up with the same situation like reiserfs and ext3
and XFS, namely that they'll all be maintained going forward because of,
uhm, political constraints ;-)
But then, as long as they _are_ maintained and play along nicely with
eachother (which, btw, is needed already so that at least data can be
migrated...), I don't really see a problem of having two or three.
> The only thing that is important is we don't end up with each cluster fs
> wanting different core VFS interfaces added.
Indeed.
Sincerely,
Lars Marowsky-Br?e <[email protected]>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
On Thursday 01 September 2005 10:49, Alan Cox wrote:
> On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote:
> > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > possibly gain (or vice versa)
> >
> > - Relative merits of the two offerings
>
> You missed the important one - people actively use it and have been for
> some years. Same reason with have NTFS, HPFS, and all the others. On
> that alone it makes sense to include.
I thought that gfs2 just appeared last month. Or is it really still just gfs?
If there are substantive changes from gfs to gfs2 then obviously they have
had practically zero testing, let alone posted benchmarks, testimonials, etc.
If it is really still just gfs then the silly-rename should be undone.
Regards,
Daniel
On Thursday 01 September 2005 06:46, David Teigland wrote:
> I'd like to get a list of specific things remaining for merging.
Where are the benchmarks and stability analysis? How many hours does it
survive cerberos running on all nodes simultaneously? Where are the
testimonials from users? How long has there been a gfs2 filesystem? Note
that Reiser4 is still not in mainline a year after it was first offered, why
do you think gfs2 should be in mainline after one month?
So far, all catches are surface things like bogus spinlocks. Substantive
issues have not even begun to be addressed. Patience please, this is going
to take a while.
Regards,
Daniel
On Thu, Sep 01, 2005 at 04:28:30PM +0100, Alan Cox wrote:
> > That's GFS. The submission is about a GFS2 that's on-disk incompatible
> > to GFS.
>
> Just like say reiserfs3 and reiserfs4 or ext and ext2 or ext2 and ext3
> then. I think the main point still stands - we have always taken
> multiple file systems on board and we have benefitted enormously from
> having the competition between them instead of a dictat from the kernel
> kremlin that 'foofs is the one true way'
I didn't say anything agains a particular fs, just that your previous
arguments where utter nonsense. In fact I think having two or more cluster
filesystems in the tree is a good thing. Whether the gfs2 code is mergeable
is a completely different question, and it seems at least debatable to
submit a filesystem for inclusion that's still pretty new.
While we're at it I can't find anything describing what gfs2 is about,
what is lacking in gfs, what structual changes did you make, etc..
p.s. why is gfs2 in fs/gfs in the kernel tree?
Alan Cox <[email protected]> wrote:
>
> On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote:
> > - Why the kernel needs two clustered fileystems
>
> So delete reiserfs4, FAT, VFAT, ext2, and all the other "junk".
Well, we did delete intermezzo.
I was looking for technical reasons, please.
> > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > possibly gain (or vice versa)
> >
> > - Relative merits of the two offerings
>
> You missed the important one - people actively use it and have been for
> some years. Same reason with have NTFS, HPFS, and all the others. On
> that alone it makes sense to include.
Again, that's not a technical reason. It's _a_ reason, sure. But what are
the technical reasons for merging gfs[2], ocfs2, both or neither?
If one can be grown to encompass the capabilities of the other then we're
left with a bunch of legacy code and wasted effort.
I'm not saying it's wrong. But I'd like to hear the proponents explain why
it's right, please.
On Thu, Sep 01, 2005 at 06:56:03PM +0100, Christoph Hellwig wrote:
> Whether the gfs2 code is mergeable is a completely different question,
> and it seems at least debatable to submit a filesystem for inclusion
I actually asked what needs to be done for merging. We appreciate the
feedback and are carefully studying and working on all of it as usual.
We'd also appreciate help, of course, if that sounds interesting to
anyone.
Thanks
Dave
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> + gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,);
> what is gfs2_assert() about anyway? please just use BUG_ON directly
> everywhere
When a machine has many gfs file systems mounted at once it can be useful
to know which one failed. Does the following look ok?
#define gfs2_assert(sdp, assertion) \
do { \
if (unlikely(!(assertion))) { \
printk(KERN_ERR \
"GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \
"GFS2: fsid=%s: function = %s\n" \
"GFS2: fsid=%s: file = %s, line = %u\n" \
"GFS2: fsid=%s: time = %lu\n", \
sdp->sd_fsname, # assertion, \
sdp->sd_fsname, __FUNCTION__, \
sdp->sd_fsname, __FILE__, __LINE__, \
sdp->sd_fsname, get_seconds()); \
BUG(); \
} \
} while (0)
On Fri, 2 September 2005 17:44:03 +0800, David Teigland wrote:
> On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
>
> > + gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,);
>
> > what is gfs2_assert() about anyway? please just use BUG_ON directly
> > everywhere
>
> When a machine has many gfs file systems mounted at once it can be useful
> to know which one failed. Does the following look ok?
>
> #define gfs2_assert(sdp, assertion) \
> do { \
> if (unlikely(!(assertion))) { \
> printk(KERN_ERR \
> "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \
> "GFS2: fsid=%s: function = %s\n" \
> "GFS2: fsid=%s: file = %s, line = %u\n" \
> "GFS2: fsid=%s: time = %lu\n", \
> sdp->sd_fsname, # assertion, \
> sdp->sd_fsname, __FUNCTION__, \
> sdp->sd_fsname, __FILE__, __LINE__, \
> sdp->sd_fsname, get_seconds()); \
> BUG(); \
> } \
> } while (0)
That's a lot of string constants. I'm not sure how smart current
versions of gcc are, but older ones created a new constant for each
invocation of such a macro, iirc. So you might want to move the code
out of line.
J?rn
--
There's nothing better for promoting creativity in a medium than
making an audience feel "Hmm ? I could do better than that!"
-- Douglas Adams in a slashdot interview
Andrew Morton <[email protected]> writes:
>
> > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > > possibly gain (or vice versa)
> > >
> > > - Relative merits of the two offerings
> >
> > You missed the important one - people actively use it and have been for
> > some years. Same reason with have NTFS, HPFS, and all the others. On
> > that alone it makes sense to include.
>
> Again, that's not a technical reason. It's _a_ reason, sure. But what are
> the technical reasons for merging gfs[2], ocfs2, both or neither?
There seems to be clearly a need for a shared-storage fs of some sort
for HA clusters and virtualized usage (multiple guests sharing a
partition). Shared storage can be more efficient than network file
systems like NFS because the storage access is often more efficient
than network access and it is more reliable because it doesn't have a
single point of failure in form of the NFS server.
It's also a logical extension of the "failover on failure" clusters
many people run now - instead of only failing over the shared fs at
failure and keeping one machine idle the load can be balanced between
multiple machines at any time.
One argument to merge both might be that nobody really knows yet which
shared-storage file system (GFS or OCFS2) is better. The only way to
find out would be to let the user base try out both, and that's most
practical when they're merged.
Personally I think ocfs2 has nicer&cleaner code than GFS.
It seems to be more or less a 64bit ext3 with cluster support, while
GFS seems to reinvent a lot more things and has somewhat uglier code.
On the other hand GFS' cluster support seems to be more aimed
at being a universal cluster service open for other usages too,
which might be a good thing. OCFS2s cluster seems to be more
aimed at only serving the file system.
But which one works better in practice is really an open question.
The only thing that should be probably resolved is a common API
for at least the clustered lock manager. Having multiple
incompatible user space APIs for that would be sad.
-Andi
I have to correct an error in perspective, or at least in the wording of
it, in the following, because it affects how people see the big picture in
trying to decide how the filesystem types in question fit into the world:
>Shared storage can be more efficient than network file
>systems like NFS because the storage access is often more efficient
>than network access
The shared storage access _is_ network access. In most cases, it's a
fibre channel/FCP network. Nowadays, it's more and more common for it to
be a TCP/IP network just like the one folks use for NFS (but carrying
ISCSI instead of NFS). It's also been done with a handful of other
TCP/IP-based block storage protocols.
The reason the storage access is expected to be more efficient than the
NFS access is because the block access network protocols are supposed to
be more efficient than the file access network protocols.
In reality, I'm not sure there really is such a difference in efficiency
between the protocols. The demonstrated differences in efficiency, or at
least in speed, are due to other things that are different between a given
new shared block implementation and a given old shared file
implementation.
But there's another advantage to shared block over shared file that hasn't
been mentioned yet: some people find it easier to manage a pool of blocks
than a pool of filesystems.
>it is more reliable because it doesn't have a
>single point of failure in form of the NFS server.
This advantage isn't because it's shared (block) storage, but because it's
a distributed filesystem. There are shared storage filesystems (e.g. IBM
SANFS, ADIC StorNext) that have a centralized metadata or locking server
that makes them unreliable (or unscalable) in the same ways as an NFS
server.
--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems
On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
> The only thing that should be probably resolved is a common API
> for at least the clustered lock manager. Having multiple
> incompatible user space APIs for that would be sad.
As far as userspace dlm apis go, dlmfs already abstracts away a large part
of the dlm interaction, so writing a module against another dlm looks like
it wouldn't be too bad (startup of a lockspace is probably the most
difficult part there).
--Mark
--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]
On Thu, Sep 01, 2005 at 01:21:04PM -0700, Andrew Morton wrote:
> Alan Cox <[email protected]> wrote:
> > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > > possibly gain (or vice versa)
> > >
> > > - Relative merits of the two offerings
> >
> > You missed the important one - people actively use it and have been for
> > some years. Same reason with have NTFS, HPFS, and all the others. On
> > that alone it makes sense to include.
>
> Again, that's not a technical reason. It's _a_ reason, sure. But what are
> the technical reasons for merging gfs[2], ocfs2, both or neither?
>
> If one can be grown to encompass the capabilities of the other then we're
> left with a bunch of legacy code and wasted effort.
GFS is an established fs, it's not going away, you'd be hard pressed to
find a more widely used cluster fs on Linux. GFS is about 10 years old
and has been in use by customers in production environments for about 5
years. It is a mature, stable file system with many features that have
been technically refined over years of experience and customer/user
feedback. The latest development cycle (GFS2) has focussed on improving
performance, it's not a new file system -- the "2" indicates that it's not
ondisk compatible with earlier versions.
OCFS2 is a new file system. I expect they'll want to optimize for their
own unique goals. When OCFS appeared everyone I know accepted it would
coexist with GFS, each in their niche like every other fs. That's good,
OCFS and GFS help each other technically even though they may eventually
compete in some areas (which can also be good.)
Dave
Here's a random summary of technical features:
- cluster infrastructure: a lot of work, perhaps as much as gfs itself,
has gone into the infrastructure surrounding and supporting gfs
- cluster infrastructure allows for easy cooperation with CLVM
- interchangable lock/cluster modules: gfs interacts with the external
infrastructure, including lock manager, through an interchangable
module allowing the fs to be adapted to different environments.
- a "nolock" module can be plugged in to use gfs as a local fs
(can be selected at mount time, so any fs can be mounted locally)
- quotas, acls, cluster flocks, direct io, data journaling,
ordered/writeback journaling modes -- all supported
- gfs transparently switches to a different locking scheme for direct io
allowing parallel non-allocating writes with no lock contention
- posix locks -- supported, although it's being reworked for better
performance right now
- asynchronous locking, lock prefetching + read-ahead
- coherent shared-writeable memory mappings across the cluster
- nfs3 support (multiple nfs servers exporting one gfs is very common)
- extend fs online, add journals online
- full fs quiesce to allow for block level snapshot below gfs
- read-only mount
- "specatator" mount (like ro but no journal allocated for the mount,
no fencing needed for failed node that was mounted as specatator)
- infrastructure in place for live ondisk inode migration, fs shrink
- stuffed dinodes, small files are stored in the disk inode block
- tunable (fuzzy) atime updates
- fast, nondisruptive stat on files during non-allocating direct-io
- fast, nondisruptive statfs (df) even during heavy fs usage
- friendly handling of io errors: shut down fs and withdraw from cluster
- largest GFS cluster deployed was around 200 nodes, most are much smaller
- use many GFS file systems at once on a node and in a cluster
- customers use GFS for: scientific apps, HA, NFS serving, database,
others I'm sure
- graphical management tools for gfs, clvm, and the cluster infrastruture
exist and are improving quickly
On Fri, Sep 02, 2005 at 05:44:03PM +0800, David Teigland wrote:
> On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
>
> > + gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,);
>
> > what is gfs2_assert() about anyway? please just use BUG_ON directly
> > everywhere
>
> When a machine has many gfs file systems mounted at once it can be useful
> to know which one failed. Does the following look ok?
>
> #define gfs2_assert(sdp, assertion) \
> do { \
> if (unlikely(!(assertion))) { \
> printk(KERN_ERR \
> "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \
> "GFS2: fsid=%s: function = %s\n" \
> "GFS2: fsid=%s: file = %s, line = %u\n" \
> "GFS2: fsid=%s: time = %lu\n", \
> sdp->sd_fsname, # assertion, \
> sdp->sd_fsname, __FUNCTION__, \
> sdp->sd_fsname, __FILE__, __LINE__, \
> sdp->sd_fsname, get_seconds()); \
> BUG(); \
You will already get the __FUNCTION__ (and hence the __FILE__ info)
directly from the BUG() dump, as well as the time from the syslog
message (turn on the printk timestamps if you want a more fine grain
timestamp), so the majority of this macro is redundant with the BUG()
macro...
thanks,
greg k-h
On Friday 02 September 2005 17:17, Andi Kleen wrote:
> The only thing that should be probably resolved is a common API
> for at least the clustered lock manager. Having multiple
> incompatible user space APIs for that would be sad.
The only current users of dlms are cluster filesystems. There are zero users
of the userspace dlm api. Therefore, the (g)dlm userspace interface actually
has nothing to do with the needs of gfs. It should be taken out the gfs
patch and merged later, when or if user space applications emerge that need
it. Maybe in the meantime it will be possible to come up with a userspace
dlm api that isn't completely repulsive.
Also, note that the only reason the two current dlms are in-kernel is because
it supposedly cuts down on userspace-kernel communication with the cluster
filesystems. Then why should a userspace application bother with a an
awkward interface to an in-kernel dlm? This is obviously suboptimal. Why
not have a userspace dlm for userspace apps, if indeed there are any
userspace apps that would need to use dlm-style synchronization instead of
more typical socket-based synchronization, or Posix locking, which is already
exposed via a standard api?
There is actually nothing wrong with having multiple, completely different
dlms active at the same time. There is no urgent need to merge them into the
one true dlm. It would be a lot better to let them evolve separately and
pick the winner a year or two from now. Just think of the dlm as part of the
cfs until then.
What does have to be resolved is a common API for node management. It is not
just cluster filesystems and their lock managers that have to interface to
node management. Below the filesystem layer, cluster block devices and
cluster volume management need to be coordinated by the same system, and
above the filesystem layer, applications also need to be hooked into it.
This work is, in a word, incomplete.
Regards,
Daniel
On Sat, 2005-09-03 at 13:18 +0800, David Teigland wrote:
> On Thu, Sep 01, 2005 at 01:21:04PM -0700, Andrew Morton wrote:
> > Alan Cox <[email protected]> wrote:
> > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > > > possibly gain (or vice versa)
> > > >
> > > > - Relative merits of the two offerings
> > >
> > > You missed the important one - people actively use it and have been for
> > > some years. Same reason with have NTFS, HPFS, and all the others. On
> > > that alone it makes sense to include.
> >
> > Again, that's not a technical reason. It's _a_ reason, sure. But what are
> > the technical reasons for merging gfs[2], ocfs2, both or neither?
> >
> > If one can be grown to encompass the capabilities of the other then we're
> > left with a bunch of legacy code and wasted effort.
>
> GFS is an established fs, it's not going away, you'd be hard pressed to
> find a more widely used cluster fs on Linux. GFS is about 10 years old
> and has been in use by customers in production environments for about 5
> years.
but you submitted GFS2 not GFS.
On Saturday 03 September 2005 02:14, Arjan van de Ven wrote:
> On Sat, 2005-09-03 at 13:18 +0800, David Teigland wrote:
> > On Thu, Sep 01, 2005 at 01:21:04PM -0700, Andrew Morton wrote:
> > > Alan Cox <[email protected]> wrote:
> > > > > - Why GFS is better than OCFS2, or has functionality which
> > > > > OCFS2 cannot possibly gain (or vice versa)
> > > > >
> > > > > - Relative merits of the two offerings
> > > >
> > > > You missed the important one - people actively use it and
> > > > have been for some years. Same reason with have NTFS, HPFS,
> > > > and all the others. On that alone it makes sense to include.
> > >
> > > Again, that's not a technical reason. It's _a_ reason, sure.
> > > But what are the technical reasons for merging gfs[2], ocfs2,
> > > both or neither?
> > >
> > > If one can be grown to encompass the capabilities of the other
> > > then we're left with a bunch of legacy code and wasted effort.
> >
> > GFS is an established fs, it's not going away, you'd be hard
> > pressed to find a more widely used cluster fs on Linux. GFS is
> > about 10 years old and has been in use by customers in production
> > environments for about 5 years.
>
> but you submitted GFS2 not GFS.
I'd rather not step into the middle of this mess, but you clipped out
a good portion that explains why he talks about GFS when he submitted
GFS2. Let me quote the post you've pulled that partial paragraph
from: "The latest development cycle (GFS2) has focused on improving
performance, it's not a new file system -- the "2" indicates that it's
not ondisk compatible with earlier versions."
In other words he didn't submit the original, but the new version of
it that is not compatable with the original GFS on disk format.
While it is clear that GFS2 cannot claim the large installed user
base or the proven capacity of the original (it is, after all, a new
version that has incompatabilities) it can claim that as it's
heritage and what it's aiming towards, the same as ext3 can (and
does) claim the power and reliability of ext2.
In this case I've been following this thread just for the hell of it
and I've noticed that there are some people who seem to not want to
even think of having GFS2 included in a mainline kernel for personal
and not technical reasons. That does not describe most of the people
on this list, many of whom have helped debug the code (among other
things), but it does describe a few.
I'll go back to being quiet now...
DRH
On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> As far as userspace dlm apis go, dlmfs already abstracts away a large part
> of the dlm interaction...
Dumb question, why can't you use sysfs for this instead of rolling your own?
Side note: you seem to have deleted all the 2.6.12-rc4 patches. Perhaps you
forgot that there are dozens of lkml archives pointing at them?
Regards,
Daniel
On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote:
> On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> > As far as userspace dlm apis go, dlmfs already abstracts away a large part
> > of the dlm interaction...
>
> Dumb question, why can't you use sysfs for this instead of rolling your own?
because it's totally different. have a look at what it does.
On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
> Andrew Morton <[email protected]> writes:
>
> >
> > Again, that's not a technical reason. It's _a_ reason, sure. But what are
> > the technical reasons for merging gfs[2], ocfs2, both or neither?
clusterfilesystems are very common, there are companies that had/have a
whole business around it, veritas, polyserve, ex-sistina, thus now
redhat, ibm, tons of companies out there sell this, big bucks. as
someone said, it's different than nfs because for certian things there
is less overhead but there are many other reasons, it makes it a lot
easier to create a clustered nfs server so you create a cfs on a set of
disks with a number of nodes and export that fs from all those, you can
easily do loadbalancing for applications, you have a lot of
infrastructure where people have invested in that allows for shared
storage...
for ocfs we have tons of production customers running many terabyte
databases on a cfs. why ? because dealing with the raw disk froma number
of nodes sucks. because nfs is pretty broken for a lot of stuff, there
is no consistency across nodes when each machine nfs mounts a server
partition. yes nfs can be used for things but cfs's are very useful for
many things nfs just can't do. want a list ?
companies building failover for services like to use things like this,
it creates a non single point of failure kind of setup much more easily.
andso on and so on, yes there are alternatives out there but fact is
that a lot of folks like to use it, have been using it for ages, and
want to be using it.
from an implementation point of view, as folks here have already said,
we 've tried our best to implement things as a real linux filesystem, no
abstractions to have something generic, it's clean and as tight as can
be for a lot of stuff. and compared to other cfs's it's pretty darned
nice, however I think it's silly to have competition between ocfs2 and
gfs2. they are different just like the ton of local filesystems are
different and people like to use one or/over the other. david said gfs
is popular and has been around, well, I can list you tons of folks that
have been using our stuff 24/7 for years (for free) just as well. it's
different. that's that.
it'd be really nice if mainline kernel had it/them included. it would be
a good start to get more folks involved and instead of years of talk on
maillists that end up in nothing actually end up with folks
participating and contributing.
In article <[email protected]> you wrote:
> for ocfs we have tons of production customers running many terabyte
> databases on a cfs. why ? because dealing with the raw disk froma number
> of nodes sucks. because nfs is pretty broken for a lot of stuff, there
> is no consistency across nodes when each machine nfs mounts a server
> partition. yes nfs can be used for things but cfs's are very useful for
> many things nfs just can't do. want a list ?
Oh thats interesting, I never thought about putting data files (tablespaces)
in a clustered file system. Does that mean you can run supported RAC on
shared ocfs2 files and anybody is using that? Do you see this go away with
ASM?
Greetings
Bernd
On Sat, Sep 03, 2005 at 08:14:00AM +0200, Arjan van de Ven wrote:
> On Sat, 2005-09-03 at 13:18 +0800, David Teigland wrote:
> > On Thu, Sep 01, 2005 at 01:21:04PM -0700, Andrew Morton wrote:
> > > Alan Cox <[email protected]> wrote:
> > > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > > > > possibly gain (or vice versa)
> > > > >
> > > > > - Relative merits of the two offerings
> > > >
> > > > You missed the important one - people actively use it and have been for
> > > > some years. Same reason with have NTFS, HPFS, and all the others. On
> > > > that alone it makes sense to include.
> > >
> > > Again, that's not a technical reason. It's _a_ reason, sure. But what are
> > > the technical reasons for merging gfs[2], ocfs2, both or neither?
> > >
> > > If one can be grown to encompass the capabilities of the other then we're
> > > left with a bunch of legacy code and wasted effort.
> >
> > GFS is an established fs, it's not going away, you'd be hard pressed to
> > find a more widely used cluster fs on Linux. GFS is about 10 years old
> > and has been in use by customers in production environments for about 5
> > years.
>
> but you submitted GFS2 not GFS.
Just a new version, not a big difference. The ondisk format changed a
little making it incompatible with the previous versions. We'd been
holding out on the format change for a long time and thought now would be
a sensible time to finally do it.
This is also about timing things conveniently. Each GFS version coincides
with a development cycle and we decided to wait for this version/cycle to
move code upstream. So, we have new version, format change, and code
upstream all together, but it's still the same GFS to us.
As with _any_ new version (involving ondisk formats or not) we need to
thoroughly test everything to fix the inevitible bugs and regresssions
that are introduced, there's nothing new or surprising about that.
About the name -- we need to support customers running both versions for a
long time. The "2" was added to make that process a little easier and
clearer for people, that's all. If the 2 is really distressing we could
rip it off, but there seems to be as many file systems ending in digits
than not these days...
Dave
On Saturday 03 September 2005 06:35, David Teigland wrote:
> Just a new version, not a big difference. The ondisk format changed a
> little making it incompatible with the previous versions. We'd been
> holding out on the format change for a long time and thought now would be
> a sensible time to finally do it.
What exactly was the format change, and for what purpose?
On Saturday 03 September 2005 02:46, Wim Coekaerts wrote:
> On Sat, Sep 03, 2005 at 02:42:36AM -0400, Daniel Phillips wrote:
> > On Friday 02 September 2005 20:16, Mark Fasheh wrote:
> > > As far as userspace dlm apis go, dlmfs already abstracts away a large
> > > part of the dlm interaction...
> >
> > Dumb question, why can't you use sysfs for this instead of rolling your
> > own?
>
> because it's totally different. have a look at what it does.
You create a dlm domain when a directory is created. You create a lock
resource when a file of that name is opened. You lock the resource when the
file is opened. You access the lvb by read/writing the file. Why doesn't
that fit the configfs-nee-sysfs model? If it does, the payoff will be about
500 lines saved.
This little dlm fs is very slick, but grossly inefficient. Maybe efficiency
doesn't matter here since it is just your slow-path userspace tools taking
these locks. Please do not even think of proposing this as a way to export a
kernel-based dlm for general purpose use!
Your userdlm.c file has some hidden gold in it. You have factored the dlm
calls far more attractively than the bad old bazillion-parameter Vaxcluster
legacy. You are almost in system call zone there. (But note my earlier
comment on dlms in general: until there are dlm-based applications, merging a
general-purpose dlm API is pointless and has nothing to do with getting your
filesystem merged.)
Regards,
Daniel
On Sat, Sep 03, 2005 at 06:21:26PM -0400, Daniel Phillips wrote:
> that fit the configfs-nee-sysfs model? If it does, the payoff will be about
> 500 lines saved.
I'm still awaiting your merge of ext3 and reiserfs, because you
can save probably 500 lines having a filesystem that can create reiser
and ext3 files at the same time.
Joel
--
Life's Little Instruction Book #267
"Lie on your back and look at the stars."
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
Joel Becker <[email protected]> wrote:
>
> On Sat, Sep 03, 2005 at 06:21:26PM -0400, Daniel Phillips wrote:
> > that fit the configfs-nee-sysfs model? If it does, the payoff will be about
> > 500 lines saved.
>
> I'm still awaiting your merge of ext3 and reiserfs, because you
> can save probably 500 lines having a filesystem that can create reiser
> and ext3 files at the same time.
oy. Daniel is asking a legitimate question.
If there's duplicated code in there then we should seek to either make the
code multi-purpose or place the common or reusable parts into a library
somewhere.
If neither approach is applicable or practical for *every single function*
then fine, please explain why. AFAIR that has not been done.
On Sat, Sep 03, 2005 at 06:32:41PM -0700, Andrew Morton wrote:
> If there's duplicated code in there then we should seek to either make the
> code multi-purpose or place the common or reusable parts into a library
> somewhere.
Regarding sysfs and configfs, that's a whole 'nother
conversation. I've not yet come up with a function involved that is
identical, but that's a response here for another email.
Understanding that Daniel is talking about dlmfs, dlmfs is far
more similar to devptsfs, tmpfs, and even sockfs and pipefs than it is
to sysfs. I don't see him proposing that sockfs and devptsfs be folded
into sysfs.
dlmfs is *tiny*. The VFS interface is less than his claimed 500
lines of savings. The few VFS callbacks do nothing but call DLM
functions. You'd have to replace this VFS glue with sysfs glue, and
probably save very few lines of code.
In addition, sysfs cannot support the dlmfs model. In dlmfs,
mkdir(2) creates a directory representing a DLM domain and mknod(2)
creates the user representation of a lock. sysfs doesn't support
mkdir(2) or mknod(2) at all.
More than mkdir() and mknod(), however, dlmfs uses open(2) to
acquire locks from userspace. O_RDONLY acquires a shared read lock (PR
in VMS parlance). O_RDWR gets an exclusive lock (X). O_NONBLOCK is a
trylock. Here, dlmfs is using the VFS for complete lifetiming. A lock
is released via close(2). If a process dies, close(2) happens. In
other words, ->release() handles all the cleanup for normal and abnormal
termination.
sysfs does not allow hooking into ->open() or ->release(). So
this model, and the inherent lifetiming that comes with it, cannot be
used. If dlmfs was changed to use a less intuitive model that fits
sysfs, all the handling of lifetimes and cleanup would have to be added.
This would make it more complex, not less complex. It would give it a
larger code size, not a smaller one. In the end, it would be harder to
maintian, less intuitive to use, and larger.
Joel
--
"Anything that is too stupid to be spoken is sung."
- Voltaire
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Saturday 03 September 2005 23:06, Joel Becker wrote:
> dlmfs is *tiny*. The VFS interface is less than his claimed 500
> lines of savings.
It is 640 lines.
> The few VFS callbacks do nothing but call DLM
> functions. You'd have to replace this VFS glue with sysfs glue, and
> probably save very few lines of code.
> In addition, sysfs cannot support the dlmfs model. In dlmfs,
> mkdir(2) creates a directory representing a DLM domain and mknod(2)
> creates the user representation of a lock. sysfs doesn't support
> mkdir(2) or mknod(2) at all.
I said "configfs" in the email to which you are replying.
> More than mkdir() and mknod(), however, dlmfs uses open(2) to
> acquire locks from userspace. O_RDONLY acquires a shared read lock (PR
> in VMS parlance). O_RDWR gets an exclusive lock (X). O_NONBLOCK is a
> trylock. Here, dlmfs is using the VFS for complete lifetiming. A lock
> is released via close(2). If a process dies, close(2) happens. In
> other words, ->release() handles all the cleanup for normal and abnormal
> termination.
>
> sysfs does not allow hooking into ->open() or ->release(). So
> this model, and the inherent lifetiming that comes with it, cannot be
> used.
Configfs has a per-item release method. Configfs has a group open method.
What is it that configfs can't do, or can't be made to do trivially?
> If dlmfs was changed to use a less intuitive model that fits
> sysfs, all the handling of lifetimes and cleanup would have to be added.
The model you came up with for dlmfs is beyond cute, it's downright clever.
Why mar that achievement by then failing to capitalize on the framework you
already have in configfs?
By the way, do you agree that dlmfs is too inefficient to be an effective way
of exporting your dlm api to user space, except for slow-path applications
like you have here?
Regards,
Daniel
On Sun, Sep 04, 2005 at 12:22:36AM -0400, Daniel Phillips wrote:
> It is 640 lines.
It's 450 without comments and blank lines. Please, don't tell
me that comments to help understanding are bloat.
> I said "configfs" in the email to which you are replying.
To wit:
> Daniel Phillips said:
> > Mark Fasheh said:
> > > as far as userspace dlm apis go, dlmfs already abstracts away a
> > > large
> > > part of the dlm interaction...
> >
> > Dumb question, why can't you use sysfs for this instead of rolling
> > your
> > own?
You asked why dlmfs can't go into sysfs, and I responded.
Joel
--
"I don't want to achieve immortality through my work; I want to
achieve immortality through not dying."
- Woody Allen
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Sunday 04 September 2005 00:30, Joel Becker wrote:
> You asked why dlmfs can't go into sysfs, and I responded.
And you got me! In the heat of the moment I overlooked the fact that you and
Greg haven't agreed to the merge yet ;-)
Clearly, I ought to have asked why dlmfs can't be done by configfs. It is the
same paradigm: drive the kernel logic from user-initiated vfs methods. You
already have nearly all the right methods in nearly all the right places.
Regards,
Daniel
Daniel Phillips <[email protected]> wrote:
>
> The model you came up with for dlmfs is beyond cute, it's downright clever.
Actually I think it's rather sick. Taking O_NONBLOCK and making it a
lock-manager trylock because they're kinda-sorta-similar-sounding? Spare
me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
acquire a clustered filesystem lock". Not even close.
It would be much better to do something which explicitly and directly
expresses what you're trying to do rather than this strange "lets do this
because the names sound the same" thing.
What happens when we want to add some new primitive which has no posix-file
analog?
Waaaay too cute. Oh well, whatever.
On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
> It would be much better to do something which explicitly and directly
> expresses what you're trying to do rather than this strange "lets do this
> because the names sound the same" thing.
So, you'd like a new flag name? That can be done.
> What happens when we want to add some new primitive which has no posix-file
> analog?
The point of dlmfs is not to express every primitive that the
DLM has. dlmfs cannot express the CR, CW, and PW levels of the VMS
locking scheme. Nor should it. The point isn't to use a filesystem
interface for programs that need all the flexibility and power of the
VMS DLM. The point is a simple system that programs needing the basic
operations can use. Even shell scripts.
Joel
--
"You must remember this:
A kiss is just a kiss,
A sigh is just a sigh.
The fundamental rules apply
As time goes by."
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote:
> Clearly, I ought to have asked why dlmfs can't be done by configfs. It is the
> same paradigm: drive the kernel logic from user-initiated vfs methods. You
> already have nearly all the right methods in nearly all the right places.
configfs, like sysfs, does not support ->open() or ->release()
callbacks. And it shouldn't. The point is to hide the complexity and
make it easier to plug into.
A client object should not ever have to know or care that it is
being controlled by a filesystem. It only knows that it has a tree of
items with attributes that can be set or shown.
Joel
--
"In a crisis, don't hide behind anything or anybody. They're going
to find you anyway."
- Paul "Bear" Bryant
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
Joel Becker <[email protected]> wrote:
>
> > What happens when we want to add some new primitive which has no posix-file
> > analog?
>
> The point of dlmfs is not to express every primitive that the
> DLM has. dlmfs cannot express the CR, CW, and PW levels of the VMS
> locking scheme. Nor should it. The point isn't to use a filesystem
> interface for programs that need all the flexibility and power of the
> VMS DLM. The point is a simple system that programs needing the basic
> operations can use. Even shell scripts.
Are you saying that the posix-file lookalike interface provides access to
part of the functionality, but there are other APIs which are used to
access the rest of the functionality? If so, what is that interface, and
why cannot that interface offer access to 100% of the functionality, thus
making the posix-file tricks unnecessary?
On Sunday 04 September 2005 01:00, Joel Becker wrote:
> On Sun, Sep 04, 2005 at 12:51:10AM -0400, Daniel Phillips wrote:
> > Clearly, I ought to have asked why dlmfs can't be done by configfs. It
> > is the same paradigm: drive the kernel logic from user-initiated vfs
> > methods. You already have nearly all the right methods in nearly all the
> > right places.
>
> configfs, like sysfs, does not support ->open() or ->release()
> callbacks.
struct configfs_item_operations {
void (*release)(struct config_item *);
ssize_t (*show)(struct config_item *, struct attribute *,char *);
ssize_t (*store)(struct config_item *,struct attribute *,const char *, size_t);
int (*allow_link)(struct config_item *src, struct config_item *target);
int (*drop_link)(struct config_item *src, struct config_item *target);
};
struct configfs_group_operations {
struct config_item *(*make_item)(struct config_group *group, const char *name);
struct config_group *(*make_group)(struct config_group *group, const char *name);
int (*commit_item)(struct config_item *item);
void (*drop_item)(struct config_group *group, struct config_item *item);
};
You do have ->release and ->make_item/group.
If I may hand you a more substantive argument: you don't support user-driven
creation of files in configfs, only directories. Dlmfs supports user-created
files. But you know, there isn't actually a good reason not to support
user-created files in configfs, as dlmfs demonstrates.
Anyway, goodnight.
Regards,
Daniel
On Sat, Sep 03, 2005 at 10:41:40PM -0700, Andrew Morton wrote:
> Are you saying that the posix-file lookalike interface provides access to
> part of the functionality, but there are other APIs which are used to
> access the rest of the functionality? If so, what is that interface, and
> why cannot that interface offer access to 100% of the functionality, thus
> making the posix-file tricks unnecessary?
Currently, this is all the interface that the OCFS2 DLM
provides. But yes, if you wanted to provide the rest of the VMS
functionality (something that GFS2's DLM does), you'd need to use a more
concrete interface.
IMHO, it's worthwhile to have a simple interface, one already
used by mkfs.ocfs2, mount.ocfs2, fsck.ocfs2, etc. This is an interface
that can and is used by shell scripts even (we do this to test the DLM).
If you make it a C-library-only interface, you've just restricted the
subset of folks that can use it, while adding programming complexity.
I think that a simple fs-based interface can coexist with a more
complex one. FILE* doesn't give you the flexibility of read()/write(),
but I wouldn't remove it :-)
Joel
--
"In the beginning, the universe was created. This has made a lot
of people very angry, and is generally considered to have been a
bad move."
- Douglas Adams
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Sun, Sep 04, 2005 at 01:52:29AM -0400, Daniel Phillips wrote:
> You do have ->release and ->make_item/group.
->release is like kobject release. It's a free callback, not a
callback from close.
> If I may hand you a more substantive argument: you don't support user-driven
> creation of files in configfs, only directories. Dlmfs supports user-created
> files. But you know, there isn't actually a good reason not to support
> user-created files in configfs, as dlmfs demonstrates.
It is outside the domain of configfs. Just because it can be
done does not mean it should be. configfs isn't a "thing to create
files". It's an interface to creating kernel items. The actual
filesystem representation isn't the end, it's just the means.
Joel
--
"In the room the women come and go
Talking of Michaelangelo."
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
> Actually I think it's rather sick. Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding? Spare
> me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock". Not even close.
What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
is a bit unfortunate, but really it just needs a bit to express that -
nobody over here cares what it's called.
--Mark
--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]
On Sunday 04 September 2005 00:46, Andrew Morton wrote:
> Daniel Phillips <[email protected]> wrote:
> > The model you came up with for dlmfs is beyond cute, it's downright
> > clever.
>
> Actually I think it's rather sick. Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding? Spare
> me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock". Not even close.
Now, I see the ocfs2 guys are all ready to back down on this one, but I will
at least argue weakly in favor.
Sick is a nice word for it, but it is actually not that far off. Normally,
this fs will acquire a lock whenever the user creates a virtual file and the
create will block until the global lock arrives. With O_NONBLOCK, it will
return, erm... ETXTBSY (!) immediately. Is that not what O_NONBLOCK is
supposed to accomplish?
> It would be much better to do something which explicitly and directly
> expresses what you're trying to do rather than this strange "lets do this
> because the names sound the same" thing.
>
> What happens when we want to add some new primitive which has no posix-file
> analog?
>
> Waaaay too cute. Oh well, whatever.
The explicit way is syscalls or a set of ioctls, which he already has the
makings of. If there is going to be a userspace api, I would hope it looks
more like the contents of userdlm.c than the traditional Vaxcluster API,
which sucks beyond belief.
Another explicit way is to do it with a whole set of virtual attributes
instead of just a single file trying to capture the whole model. That is
really unappealing, but I am afraid that is exactly what a whole lot of
sysfs/configfs usage is going to end up looking like.
But more to the point: we have no urgent need for a userspace dlm api at the
moment. Nothing will break if we just put that issue off for a few months,
quite the contrary.
If the only user is their tools I would say let it go ahead and be cute, even
sickeningly so. It is not supposed to be a general dlm api, at least that is
my understanding. It is just supposed to be an interface for their tools.
Of course it would help to know exactly how those tools use it. Too sleepy
to find out tonight...
Regards,
Daniel
Mark Fasheh <[email protected]> wrote:
>
> On Sat, Sep 03, 2005 at 09:46:53PM -0700, Andrew Morton wrote:
> > Actually I think it's rather sick. Taking O_NONBLOCK and making it a
> > lock-manager trylock because they're kinda-sorta-similar-sounding? Spare
> > me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> > acquire a clustered filesystem lock". Not even close.
>
> What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
> is a bit unfortunate, but really it just needs a bit to express that -
> nobody over here cares what it's called.
The whole idea of reinterpreting file operations to mean something utterly
different just seems inappropriate to me.
You get a lot of goodies when using a filesystem - the ability for
unrelated processes to look things up, resource release on exit(), etc. If
those features are valuable in the ocfs2 context then fine. But I'd have
thought that it would be saner and more extensible to add new syscalls
(perhaps taking fd's) rather than overloading the open() mode in this
manner.
Daniel Phillips <[email protected]> wrote:
>
> If the only user is their tools I would say let it go ahead and be cute, even
> sickeningly so. It is not supposed to be a general dlm api, at least that is
> my understanding. It is just supposed to be an interface for their tools.
> Of course it would help to know exactly how those tools use it.
Well I'm not saying "don't do this". I'm saying "eww" and "why?".
If there is already a richer interface into all this code (such as a
syscall one) and it's feasible to migrate the open() tricksies to that API
in the future if it all comes unstuck then OK. That's why I asked (thus
far unsuccessfully):
Are you saying that the posix-file lookalike interface provides
access to part of the functionality, but there are other APIs which are
used to access the rest of the functionality? If so, what is that
interface, and why cannot that interface offer access to 100% of the
functionality, thus making the posix-file tricks unnecessary?
On Sun, Sep 04, 2005 at 12:28:28AM -0700, Andrew Morton wrote:
> If there is already a richer interface into all this code (such as a
> syscall one) and it's feasible to migrate the open() tricksies to that API
> in the future if it all comes unstuck then OK.
> That's why I asked (thus far unsuccessfully):
I personally was under the impression that "syscalls are not
to be added". I'm also wary of the effort required to hook into process
exit. Not to mention all the lifetiming that has to be written again.
On top of that, we lose our cute ability to shell script it. We
find this very useful in testing, and think others would in practice.
> Are you saying that the posix-file lookalike interface provides
> access to part of the functionality, but there are other APIs which are
> used to access the rest of the functionality? If so, what is that
> interface, and why cannot that interface offer access to 100% of the
> functionality, thus making the posix-file tricks unnecessary?
I thought I stated this in my other email. We're not intending
to extend dlmfs. It pretty much covers the simple DLM usage required of
a simple interface. The OCFS2 DLM does not provide any other
functionality.
If the OCFS2 DLM grew more functionality, or you consider the
GFS2 DLM that already has it (and a less intuitive interface via sysfs
IIRC), I would contend that dlmfs still has a place. It's simple to use
and understand, and it's usable from shell scripts and other simple
code.
Joel
--
"The first thing we do, let's kill all the lawyers."
-Henry VI, IV:ii
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
Joel Becker <[email protected]> wrote:
>
> On Sun, Sep 04, 2005 at 12:28:28AM -0700, Andrew Morton wrote:
> > If there is already a richer interface into all this code (such as a
> > syscall one) and it's feasible to migrate the open() tricksies to that API
> > in the future if it all comes unstuck then OK.
> > That's why I asked (thus far unsuccessfully):
>
> I personally was under the impression that "syscalls are not
> to be added".
We add syscalls all the time. Whichever user<->kernel API is considered to
be most appropriate, use it.
> I'm also wary of the effort required to hook into process
> exit.
I'm not questioning the use of a filesystem. I'm questioning this
overloading of normal filesystem system calls. For example (and this is
just an example! there's also mknod, mkdir, O_RDWR, O_EXCL...) it would be
more usual to do
fd = open("/sys/whatever", ...);
err = sys_dlm_trylock(fd);
I guess your current implementation prevents /sys/whatever from ever
appearing if the trylock failed. Dunno if that's valuable.
> Not to mention all the lifetiming that has to be written again.
> On top of that, we lose our cute ability to shell script it. We
> find this very useful in testing, and think others would in practice.
>
> > Are you saying that the posix-file lookalike interface provides
> > access to part of the functionality, but there are other APIs which are
> > used to access the rest of the functionality? If so, what is that
> > interface, and why cannot that interface offer access to 100% of the
> > functionality, thus making the posix-file tricks unnecessary?
>
> I thought I stated this in my other email. We're not intending
> to extend dlmfs.
Famous last words ;)
> It pretty much covers the simple DLM usage required of
> a simple interface. The OCFS2 DLM does not provide any other
> functionality.
> If the OCFS2 DLM grew more functionality, or you consider the
> GFS2 DLM that already has it (and a less intuitive interface via sysfs
> IIRC), I would contend that dlmfs still has a place. It's simple to use
> and understand, and it's usable from shell scripts and other simple
> code.
(wonders how to do O_NONBLOCK from a script)
I don't buy the general "fs is nice because we can script it" argument,
really. You can just write a few simple applications which provide access
to the syscalls (or the fs!) and then write scripts around those.
Yes, you suddenly need to get a little tarball into users' hands and that's
a hassle. And I sometimes think we let this hassle guide kernel interfaces
(mutters something about /sbin/hotplug), and that's sad.
On Sun, Sep 04, 2005 at 12:23:43AM -0700, Andrew Morton wrote:
> > What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
> > is a bit unfortunate, but really it just needs a bit to express that -
> > nobody over here cares what it's called.
>
> The whole idea of reinterpreting file operations to mean something utterly
> different just seems inappropriate to me.
Putting aside trylock for a minute, I'm not sure how utterly different the
operations are. You create a lock resource by creating a file named after
it. You get a lock (fd) at read or write level on the resource by calling
open(2) with the appropriate mode (O_RDONLY, O_WRONLY/O_RDWR).
Now that we've got an fd, lock value blocks are naturally represented as
file data which can be read(2) or written(2).
Close(2) drops the lock.
A really trivial usage example from shell:
node1$ echo "hello world" > mylock
node2$ cat mylock
hello world
I could always give a more useful one after I get some sleep :)
> You get a lot of goodies when using a filesystem - the ability for
> unrelated processes to look things up, resource release on exit(), etc. If
> those features are valuable in the ocfs2 context then fine.
Right, they certainly are and I think Joel, in another e-mail on this
thread, explained well the advantages of using a filesystem.
> But I'd have thought that it would be saner and more extensible to add new
> syscalls (perhaps taking fd's) rather than overloading the open() mode in
> this manner.
The idea behind dlmfs was to very simply export a small set of cluster dlm
operations to userspace. Given that goal, I felt that a whole set of system
calls would have been overkill. That said, I think perhaps I should clarify
that I don't intend dlmfs to become _the_ userspace dlm api, just a simple
and (imho) intuitive one which could be trivially accessed from any software
which just knows how to read and write files.
--Mark
--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]
Mark Fasheh <[email protected]> wrote:
>
> On Sun, Sep 04, 2005 at 12:23:43AM -0700, Andrew Morton wrote:
> > > What would be an acceptable replacement? I admit that O_NONBLOCK -> trylock
> > > is a bit unfortunate, but really it just needs a bit to express that -
> > > nobody over here cares what it's called.
> >
> > The whole idea of reinterpreting file operations to mean something utterly
> > different just seems inappropriate to me.
> Putting aside trylock for a minute, I'm not sure how utterly different the
> operations are. You create a lock resource by creating a file named after
> it. You get a lock (fd) at read or write level on the resource by calling
> open(2) with the appropriate mode (O_RDONLY, O_WRONLY/O_RDWR).
> Now that we've got an fd, lock value blocks are naturally represented as
> file data which can be read(2) or written(2).
> Close(2) drops the lock.
>
> A really trivial usage example from shell:
>
> node1$ echo "hello world" > mylock
> node2$ cat mylock
> hello world
>
> I could always give a more useful one after I get some sleep :)
It isn't extensible though. One couldn't retain this approach while adding
(random cfs ignorance exposure) upgrade-read, downgrade-write,
query-for-various-runtime-stats, priority modification, whatever.
> > You get a lot of goodies when using a filesystem - the ability for
> > unrelated processes to look things up, resource release on exit(), etc. If
> > those features are valuable in the ocfs2 context then fine.
> Right, they certainly are and I think Joel, in another e-mail on this
> thread, explained well the advantages of using a filesystem.
>
> > But I'd have thought that it would be saner and more extensible to add new
> > syscalls (perhaps taking fd's) rather than overloading the open() mode in
> > this manner.
> The idea behind dlmfs was to very simply export a small set of cluster dlm
> operations to userspace. Given that goal, I felt that a whole set of system
> calls would have been overkill. That said, I think perhaps I should clarify
> that I don't intend dlmfs to become _the_ userspace dlm api, just a simple
> and (imho) intuitive one which could be trivially accessed from any software
> which just knows how to read and write files.
Well, as I say. Making it a filesystem is superficially attractive, but
once you've build a super-dooper enterprise-grade infrastructure on top of
it all, nobody's going to touch the fs interface by hand and you end up
wondering why it's there, adding baggage.
Not that I'm questioning the fs interface! It has useful permission
management, monitoring and resource releasing characteristics. I'm
questioning the open() tricks. I guess from Joel's tiny description, the
filesystem's interpretation of mknod and mkdir look sensible enough.
On Sun, Sep 04, 2005 at 01:18:05AM -0700, Andrew Morton wrote:
> > I thought I stated this in my other email. We're not intending
> > to extend dlmfs.
>
> Famous last words ;)
Heh, of course :-)
> I don't buy the general "fs is nice because we can script it" argument,
> really. You can just write a few simple applications which provide access
> to the syscalls (or the fs!) and then write scripts around those.
I can't see how that works easily. I'm not worried about a
tarball (eventually Red Hat and SuSE and Debian would have it). I'm
thinking about this shell:
exec 7</dlm/domainxxxx/lock1
do stuff
exec 7</dev/null
If someone kills the shell while stuff is doing, the lock is unlocked
because fd 7 is closed. However, if you have an application to do the
locking:
takelock domainxxx lock1
do sutff
droplock domainxxx lock1
When someone kills the shell, the lock is leaked, becuase droplock isn't
called. And SEGV/QUIT/-9 (especially -9, folks love it too much) are
handled by the first example but not by the second.
Joel
--
"Same dancers in the same old shoes.
You get too careful with the steps you choose.
You don't care about winning but you don't want to lose
After the thrill is gone."
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
Joel Becker <[email protected]> wrote:
>
> I can't see how that works easily. I'm not worried about a
> tarball (eventually Red Hat and SuSE and Debian would have it). I'm
> thinking about this shell:
>
> exec 7</dlm/domainxxxx/lock1
> do stuff
> exec 7</dev/null
>
> If someone kills the shell while stuff is doing, the lock is unlocked
> because fd 7 is closed. However, if you have an application to do the
> locking:
>
> takelock domainxxx lock1
> do sutff
> droplock domainxxx lock1
>
> When someone kills the shell, the lock is leaked, becuase droplock isn't
> called. And SEGV/QUIT/-9 (especially -9, folks love it too much) are
> handled by the first example but not by the second.
take-and-drop-lock -d domainxxx -l lock1 -e "do stuff"
On Sun, Sep 04, 2005 at 02:18:36AM -0700, Andrew Morton wrote:
> take-and-drop-lock -d domainxxx -l lock1 -e "do stuff"
Ahh, but then you have to have lots of scripts somewhere in
path, or do massive inline scripts. especially if you want to take
another lock in there somewhere.
It's doable, but it's nowhere near as easy. :-)
Joel
--
"I always thought the hardest questions were those I could not answer.
Now I know they are the ones I can never ask."
- Charlie Watkins
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
> takelock domainxxx lock1
> do sutff
> droplock domainxxx lock1
>
> When someone kills the shell, the lock is leaked, becuase droplock isn't
> called.
Why not open the lock resource (or the lock space) instead of
individual locks as file? It then looks like this:
open lock space file
takelock lockresource lock1
do stuff
droplock lockresource lock1
close lock space file
Then if you are killed the ->release of lock space file should take
care of cleaning up all the locks
On Sunday 04 September 2005 03:28, Andrew Morton wrote:
> If there is already a richer interface into all this code (such as a
> syscall one) and it's feasible to migrate the open() tricksies to that API
> in the future if it all comes unstuck then OK. That's why I asked (thus
> far unsuccessfully):
>
> Are you saying that the posix-file lookalike interface provides
> access to part of the functionality, but there are other APIs which are
> used to access the rest of the functionality? If so, what is that
> interface, and why cannot that interface offer access to 100% of the
> functionality, thus making the posix-file tricks unnecessary?
There is no such interface at the moment, nor is one needed in the immediate
future. Let's look at the arguments for exporting a dlm to userspace:
1) Since we already have a dlm in kernel, why not just export that and save
100K of userspace library? Answer: because we don't want userspace-only
dlm features bulking up the kernel. Answer #2: the extra syscalls and
interface baggage serve no useful purpose.
2) But we need to take locks in the same lockspaces as the kernel dlm(s)!
Answer: only support tools need to do that. A cut-down locking api is
entirely appropriate for this.
3) But the kernel dlm is the only one we have! Answer: easily fixed, a
simple matter of coding. But please bear in mind that dlm-style
synchronization is probably a bad idea for most cluster applications,
particularly ones that already do their synchronization via sockets.
In other words, exporting the full dlm api is a red herring. It has nothing
to do with getting cluster filesystems up and running. It is really just
marketing: it sounds like a great thing for userspace to get a dlm "for
free", but it isn't free, it contributes to kernel bloat and it isn't even
the most efficient way to do it.
If after considering that, we _still_ want to export a dlm api from kernel,
then can we please take the necessary time and get it right? The full api
requires not only syscall-style elements, but asynchronous events as well,
similar to aio. I do not think anybody has a good answer to this today, nor
do we even need it to begin porting applications to cluster filesystems.
Oracle guys: what is the distributed locking API for RAC? Is the RAC team
waiting with bated breath to adopt your kernel-based dlm? If not, why not?
Regards,
Daniel
Hi!
> - read-only mount
> - "specatator" mount (like ro but no journal allocated for the mount,
> no fencing needed for failed node that was mounted as specatator)
I'd call it "real-read-only", and yes, that's very usefull
mount. Could we get it for ext3, too?
Pavel
--
if you have sharp zaurus hardware you don't need... you know my address
On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote:
> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> > no fencing needed for failed node that was mounted as specatator)
>
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?
In OCFS2 we call readonly+journal+connected-to-cluster "soft
readonly". We're a live node, other nodes know we exist, and we can
flush pending transactions during the rw->ro transition. In addition,
we can allow a ro->rw transition.
The no-journal+no-cluster-connection mode we call "hard
readonly". This is the mode you get when a device itself is readonly,
because you can't do *anything*.
Joel
--
"Lately I've been talking in my sleep.
Can't imagine what I'd have to say.
Except my world will be right
When love comes back my way."
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Fri, Sep 02, 2005 at 10:28:21PM -0700, Greg KH wrote:
> On Fri, Sep 02, 2005 at 05:44:03PM +0800, David Teigland wrote:
> > On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> >
> > > + gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,);
> >
> > > what is gfs2_assert() about anyway? please just use BUG_ON directly
> > > everywhere
> >
> > When a machine has many gfs file systems mounted at once it can be useful
> > to know which one failed. Does the following look ok?
> >
> > #define gfs2_assert(sdp, assertion) \
> > do { \
> > if (unlikely(!(assertion))) { \
> > printk(KERN_ERR \
> > "GFS2: fsid=%s: fatal: assertion \"%s\" failed\n" \
> > "GFS2: fsid=%s: function = %s\n" \
> > "GFS2: fsid=%s: file = %s, line = %u\n" \
> > "GFS2: fsid=%s: time = %lu\n", \
> > sdp->sd_fsname, # assertion, \
> > sdp->sd_fsname, __FUNCTION__, \
> > sdp->sd_fsname, __FILE__, __LINE__, \
> > sdp->sd_fsname, get_seconds()); \
> > BUG(); \
>
> You will already get the __FUNCTION__ (and hence the __FILE__ info)
> directly from the BUG() dump, as well as the time from the syslog
> message (turn on the printk timestamps if you want a more fine grain
> timestamp), so the majority of this macro is redundant with the BUG()
> macro...
Joern already suggested moving this out of line and into a function (as it
was before) to avoid repeating string constants. In that case the
function, file and line from BUG aren't useful. We now have this, does it
look ok?
void gfs2_assert_i(struct gfs2_sbd *sdp, char *assertion, const char *function,
char *file, unsigned int line)
{
panic("GFS2: fsid=%s: fatal: assertion \"%s\" failed\n"
"GFS2: fsid=%s: function = %s, file = %s, line = %u\n",
sdp->sd_fsname, assertion,
sdp->sd_fsname, function, file, line);
}
#define gfs2_assert(sdp, assertion) \
do { \
if (unlikely(!(assertion))) { \
gfs2_assert_i((sdp), #assertion, \
__FUNCTION__, __FILE__, __LINE__); \
} \
} while (0)
On Sat, Sep 03, 2005 at 10:41:40PM -0700, Andrew Morton wrote:
> Joel Becker <[email protected]> wrote:
> >
> > > What happens when we want to add some new primitive which has no
> > > posix-file analog?
> >
> > The point of dlmfs is not to express every primitive that the
> > DLM has. dlmfs cannot express the CR, CW, and PW levels of the VMS
> > locking scheme. Nor should it. The point isn't to use a filesystem
> > interface for programs that need all the flexibility and power of the
> > VMS DLM. The point is a simple system that programs needing the basic
> > operations can use. Even shell scripts.
>
> Are you saying that the posix-file lookalike interface provides access to
> part of the functionality, but there are other APIs which are used to
> access the rest of the functionality? If so, what is that interface, and
> why cannot that interface offer access to 100% of the functionality, thus
> making the posix-file tricks unnecessary?
We're using our dlm quite a bit in user space and require the full dlm
API. It's difficult to export the full API through a pseudo fs like
dlmfs, so we've not found it a very practical approach. That said, it's a
nice idea and I'd be happy if someone could map a more complete dlm API
onto it.
We export our full dlm API through read/write/poll on a misc device. All
user space apps use the dlm through a library as you'd expect. The
library communicates with the dlm_device kernel module through
read/write/poll and the dlm_device module talks with the actual dlm:
linux/drivers/dlm/device.c If there's a better way to do this, via a
pseudo fs or not, we'd be pleased to try it.
Dave
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> +void gfs2_glock_hold(struct gfs2_glock *gl)
> +{
> + glock_hold(gl);
> +}
>
> eh why?
You removed the comment stating exactly why, see below. If that's not a
accepted technique in the kernel, say so and I'll be happy to change it
here and elsewhere.
Thanks,
Dave
static inline void glock_hold(struct gfs2_glock *gl)
{
gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0);
atomic_inc(&gl->gl_count);
}
/**
* gfs2_glock_hold() - As glock_hold(), but suitable for exporting
* @gl: The glock to hold
*
*/
void gfs2_glock_hold(struct gfs2_glock *gl)
{
glock_hold(gl);
}
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> +static unsigned int handle_roll(atomic_t *a)
> +{
> + int x = atomic_read(a);
> + if (x < 0) {
> + atomic_set(a, 0);
> + return 0;
> + }
> + return (unsigned int)x;
> +}
>
> this is just plain scary.
Not really, it was just resetting atomic statistics counters when they
became negative. Unecessary, though, so removed.
Dave
On Sun, Sep 04, 2005 at 10:33:44PM +0200, Pavel Machek wrote:
> Hi!
>
> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> > no fencing needed for failed node that was mounted as specatator)
>
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?
This is a bit of a degression, but it's quite a bit different from
what ocfs2 is doing, where it is not necessary to replay the journal
in order to assure filesystem consistency.
In the ext3 case, the only time when read-only isn't quite read-only
is when the filesystem was unmounted uncleanly and the journal needs
to be replayed in order for the filesystem to be consistent. Mounting
the filesystem read-only without replaying the journal could and very
likely would result in the filesystem reporting filesystem consistency
problems, and if the filesystem is mounted with the reboot-on-errors
option, well....
- Ted
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> > +void gfs2_glock_hold(struct gfs2_glock *gl)
> > +{
> > + glock_hold(gl);
> > +}
> >
> > eh why?
On 9/5/05, David Teigland <[email protected]> wrote:
> You removed the comment stating exactly why, see below. If that's not a
> accepted technique in the kernel, say so and I'll be happy to change it
> here and elsewhere.
Is there a reason why users of gfs2_glock_hold() cannot use
glock_hold() directly?
Pekka
On Mon, Sep 05, 2005 at 01:54:28AM -0400, Theodore Ts'o wrote:
> In the ext3 case, the only time when read-only isn't quite read-only
> is when the filesystem was unmounted uncleanly and the journal needs
> to be replayed in order for the filesystem to be consistent.
Right, and OCFS2 is going to try to keep the behavior of only using the
journal for recovery in normal (soft) read-only operation.
Unfortunately other cluster nodes could die at any moment which can
complicate things as we are now required to do recovery on them to ensure
file system consistency.
Recovery of course includes things like orphan dir cleanup, etc so we need a
journal around for those transactions. To simplify all this, I'm just going
to have it load the journal as it normally does (as opposed to only when the
local node has a dirty journal) because it could be used at any moment.
Btw, I'm curious to know how useful folks find the ext3 mount options
errors=continue and errors=panic. I'm extremely likely to implement the
errors=read-only behavior as default in OCFS2 and I'm wondering whether the
other two are worth looking into.
--Mark
--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]
On Mon, Sep 05, 2005 at 09:32:59AM +0300, Pekka Enberg wrote:
> On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> > > +void gfs2_glock_hold(struct gfs2_glock *gl)
> > > +{
> > > + glock_hold(gl);
> > > +}
> > >
> > > eh why?
>
> On 9/5/05, David Teigland <[email protected]> wrote:
> > You removed the comment stating exactly why, see below. If that's not a
> > accepted technique in the kernel, say so and I'll be happy to change it
> > here and elsewhere.
>
> Is there a reason why users of gfs2_glock_hold() cannot use
> glock_hold() directly?
Either set could be trivially removed. It's such an insignificant issue
that I've removed glock_hold and put. For the record,
within glock.c we consistently paired inlined versions of:
glock_hold()
glock_put()
we wanted external versions to be appropriately named so we had:
gfs2_glock_hold()
gfs2_glock_put()
still not sure if that technique is acceptable in this crowd or not.
Dave
On 9/5/05, David Teigland <[email protected]> wrote:
> Either set could be trivially removed. It's such an insignificant issue
> that I've removed glock_hold and put. For the record,
>
> within glock.c we consistently paired inlined versions of:
> glock_hold()
> glock_put()
>
> we wanted external versions to be appropriately named so we had:
> gfs2_glock_hold()
> gfs2_glock_put()
>
> still not sure if that technique is acceptable in this crowd or not.
You still didn't answer my question why you needed two versions,
though. AFAIK you didn't which makes the other one an redundant
wrapper which are discouraged in kernel code.
Pekka
Hi!
> > > - read-only mount
> > > - "specatator" mount (like ro but no journal allocated for the mount,
> > > no fencing needed for failed node that was mounted as specatator)
> >
> > I'd call it "real-read-only", and yes, that's very usefull
> > mount. Could we get it for ext3, too?
>
> This is a bit of a degression, but it's quite a bit different from
> what ocfs2 is doing, where it is not necessary to replay the journal
> in order to assure filesystem consistency.
>
> In the ext3 case, the only time when read-only isn't quite read-only
> is when the filesystem was unmounted uncleanly and the journal needs
> to be replayed in order for the filesystem to be consistent.
Yes, I know... And that is going to be a disaster when you are
attempting to recover data from failing harddrive (and absolutely do
not want to write there).
There's a better reason, too. I do swsusp. Then I'd like to boot with
/ mounted read-only (so that I can read my config files, some
binaries, and maybe suspended image), but I absolutely may not write
to disk at this point, because I still want to resume.
Currently distros do that using initrd, but that does not allow you to
store suspended image into file, and is slightly hard to setup.
Pavel
--
if you have sharp zaurus hardware you don't need... you know my address
David Teigland <[email protected]> wrote:
>
> We export our full dlm API through read/write/poll on a misc device.
>
inotify did that for a while, but we ended up going with a straight syscall
interface.
How fat is the dlm interface? ie: how many syscalls would it take?
On Mon, 5 September 2005 11:47:39 +0800, David Teigland wrote:
>
> Joern already suggested moving this out of line and into a function (as it
> was before) to avoid repeating string constants. In that case the
> function, file and line from BUG aren't useful. We now have this, does it
> look ok?
Ok wrt. my concerns, but not with Greg's. BUG() still gives you
everything that you need, except:
o fsid
Notice how this list is just one entry long? ;)
So how about
#define gfs2_assert(sdp, assertion) do { \
if (unlikely(!(assertion))) { \
printk(KERN_ERR "GFS2: fsid=\n", (sdp)->sd_fsname); \
BUG(); \
} while (0)
Or, to move the constant out of line again
void __gfs2_assert(struct gfs2_sbd *sdp) {
printk(KERN_ERR "GFS2: fsid=\n", sdp->sd_fsname);
}
#define gfs2_assert(sdp, assertion) do {\
if (unlikely(!(assertion))) { \
__gfs2_assert(sdp); \
BUG(); \
} while (0)
J?rn
--
Admonish your friends privately, but praise them openly.
-- Publilius Syrus
On Mon, Sep 05, 2005 at 10:58:08AM +0200, J?rn Engel wrote:
> #define gfs2_assert(sdp, assertion) do { \
> if (unlikely(!(assertion))) { \
> printk(KERN_ERR "GFS2: fsid=\n", (sdp)->sd_fsname); \
> BUG(); \
> } while (0)
OK thanks,
Dave
On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> David Teigland <[email protected]> wrote:
> >
> > We export our full dlm API through read/write/poll on a misc device.
> >
>
> inotify did that for a while, but we ended up going with a straight syscall
> interface.
>
> How fat is the dlm interface? ie: how many syscalls would it take?
Four functions:
create_lockspace()
release_lockspace()
lock()
unlock()
Dave
David Teigland <[email protected]> wrote:
>
> On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > David Teigland <[email protected]> wrote:
> > >
> > > We export our full dlm API through read/write/poll on a misc device.
> > >
> >
> > inotify did that for a while, but we ended up going with a straight syscall
> > interface.
> >
> > How fat is the dlm interface? ie: how many syscalls would it take?
>
> Four functions:
> create_lockspace()
> release_lockspace()
> lock()
> unlock()
Neat. I'd be inclined to make them syscalls then. I don't suppose anyone
is likely to object if we reserve those slots.
On Monday 05 September 2005 05:19, Andrew Morton wrote:
> David Teigland <[email protected]> wrote:
> > On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > > David Teigland <[email protected]> wrote:
> > > > We export our full dlm API through read/write/poll on a misc device.
> > >
> > > inotify did that for a while, but we ended up going with a straight
> > > syscall interface.
> > >
> > > How fat is the dlm interface? ie: how many syscalls would it take?
> >
> > Four functions:
> > create_lockspace()
> > release_lockspace()
> > lock()
> > unlock()
>
> Neat. I'd be inclined to make them syscalls then. I don't suppose anyone
> is likely to object if we reserve those slots.
Better take a look at the actual parameter lists to those calls before jumping
to conclusions...
Regards,
Daniel
On Mon, Sep 05, 2005 at 02:19:48AM -0700, Andrew Morton wrote:
> David Teigland <[email protected]> wrote:
> > Four functions:
> > create_lockspace()
> > release_lockspace()
> > lock()
> > unlock()
>
> Neat. I'd be inclined to make them syscalls then. I don't suppose anyone
> is likely to object if we reserve those slots.
Patrick is really the expert in this area and he's off this week, but
based on what he's done with the misc device I don't see why there'd be
more than two or three parameters for any of these.
Dave
Hi,
On Sun, 2005-09-04 at 21:33, Pavel Machek wrote:
> > - read-only mount
> > - "specatator" mount (like ro but no journal allocated for the mount,
> > no fencing needed for failed node that was mounted as specatator)
>
> I'd call it "real-read-only", and yes, that's very usefull
> mount. Could we get it for ext3, too?
I don't want to pollute the ext3 paths with extra checks for the case
when there's no journal struct at all. But a dummy journal struct that
isn't associated with an on-disk journal and that can never, ever go
writable would certainly be pretty easy to do.
But mount -o readonly gives you most of what you want already. An
always-readonly option would be different in some key ways --- for a
start, it would be impossible to perform journal recovery if that's
needed, as that still needs journal and superblock write access. That's
not necessarily a good thing.
And you *still* wouldn't get something that could act as a spectator to
a filesystem mounted writable elsewhere on a SAN, because updates on the
other node wouldn't invalidate cached data on the readonly node. So is
this really a useful combination?
About the only combination I can think of that really makes sense in
this context is if you have a busted filesystem that somehow can't be
recovered --- either the journal is broken or the underlying device is
truly readonly --- and you want to mount without recovery in order to
attempt to see what you can find. That's asking for data corruption,
but that may be better than getting no data at all.
But that is something that could be done with a "-o skip-recovery" mount
option, which would necessarily imply always-readonly behaviour.
--Stephen
On Mon, Sep 05, 2005 at 10:27:35AM +0200, Pavel Machek wrote:
>
> There's a better reason, too. I do swsusp. Then I'd like to boot with
> / mounted read-only (so that I can read my config files, some
> binaries, and maybe suspended image), but I absolutely may not write
> to disk at this point, because I still want to resume.
>
You could _hope_ that the filesystem is consistent enough that it is
safe to try to read config files, binaries, etc. without running the
journal, but there is absolutely no guarantee that this is the case.
I'm not sure you want to depend on that for swsusp.
One potential solution that would probably meet your needs is a dm
hack which reads in the blocks in the journal, and then uses the most
recent block in the journal in preference to the version on disk.
- Ted
On Mon, Sep 05, 2005 at 12:09:23AM -0700, Mark Fasheh wrote:
> Btw, I'm curious to know how useful folks find the ext3 mount options
> errors=continue and errors=panic. I'm extremely likely to implement the
> errors=read-only behavior as default in OCFS2 and I'm wondering whether the
> other two are worth looking into.
For a single-user system errors=panic is definitely very useful on the
system disk, since that's the only way that we can force an fsck, and
also abort a server that might be failing and returning erroneous
information to its clients. Think of it is as i/o fencing when you're
not sure that the system is going to be performing correctly.
Whether or not this is useful for ocfs2 is a different matter. If
it's only for data volumes, and if the only way to fix filesystem
inconsistencies on a cluster filesystem is to request all nodes in the
cluster to unmount the filesystem and then arrange to run ocfs2's fsck
on the filesystem, then forcing every single cluster in the node to
panic is probably counterproductive. :-)
- Ted
On 2005-09-03T01:57:31, Daniel Phillips <[email protected]> wrote:
> The only current users of dlms are cluster filesystems. There are zero users
> of the userspace dlm api.
That is incorrect, and you're contradicting yourself here:
> What does have to be resolved is a common API for node management. It is not
> just cluster filesystems and their lock managers that have to interface to
> node management. Below the filesystem layer, cluster block devices and
> cluster volume management need to be coordinated by the same system, and
> above the filesystem layer, applications also need to be hooked into it.
> This work is, in a word, incomplete.
The Cluster Volume Management of LVM2 for example _does_ use simple
cluster-wide locks, and some OCFS2 scripts, I seem to recall, do too.
(EVMS2 in cluster-mode uses a verrry simple locking scheme which is
basically operated by the failover software and thus uses a different
model.)
Sincerely,
Lars Marowsky-Br?e <[email protected]>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
On 2005-09-03T09:27:41, Bernd Eckenfels <[email protected]> wrote:
> Oh thats interesting, I never thought about putting data files (tablespaces)
> in a clustered file system. Does that mean you can run supported RAC on
> shared ocfs2 files and anybody is using that?
That is the whole point why OCFS exists ;-)
> Do you see this go away with ASM?
No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
benefits in several aspects from a general-purpose SAN-backed CFS.
Sincerely,
Lars Marowsky-Br?e <[email protected]>
--
High Availability & Clustering
SUSE Labs, Research and Development
SUSE LINUX Products GmbH - A Novell Business -- Charles Darwin
"Ignorance more frequently begets confidence than does knowledge"
On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> On 2005-09-03T01:57:31, Daniel Phillips <[email protected]> wrote:
> > The only current users of dlms are cluster filesystems. There are zero
> > users of the userspace dlm api.
>
> That is incorrect...
Application users Lars, sorry if I did not make that clear. The issue is
whether we need to export an all-singing-all-dancing dlm api from kernel to
userspace today, or whether we can afford to take the necessary time to get
it right while application writers take their time to have a good think about
whether they even need it.
> ...and you're contradicting yourself here:
How so? Above talks about dlm, below talks about cluster membership.
> > What does have to be resolved is a common API for node management. It is
> > not just cluster filesystems and their lock managers that have to
> > interface to node management. Below the filesystem layer, cluster block
> > devices and cluster volume management need to be coordinated by the same
> > system, and above the filesystem layer, applications also need to be
> > hooked into it. This work is, in a word, incomplete.
Regards,
Daniel
On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > On 2005-09-03T01:57:31, Daniel Phillips <[email protected]> wrote:
> > > The only current users of dlms are cluster filesystems. There are zero
> > > users of the userspace dlm api.
> >
> > That is incorrect...
>
> Application users Lars, sorry if I did not make that clear. The issue is
> whether we need to export an all-singing-all-dancing dlm api from kernel to
> userspace today, or whether we can afford to take the necessary time to get
> it right while application writers take their time to have a good think about
> whether they even need it.
>
If Linux fully supported OpenVMS DLM semantics we could start thinking asbout
moving our application onto a Linux box because our alpha server is aging.
That's just my user application writer $0.02.
--
Dmitry
On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
> > create_lockspace()
> > release_lockspace()
> > lock()
> > unlock()
>
> Neat. I'd be inclined to make them syscalls then. I don't suppose anyone
> is likely to object if we reserve those slots.
If the locks are not file descriptors then answer the following:
- How are they ref counted
- What are the cleanup semantics
- How do I pass a lock between processes (AF_UNIX sockets wont work now)
- How do I poll on a lock coming free.
- What are the semantics of lock ownership
- What rules apply for inheritance
- How do I access a lock across threads.
- What is the permission model.
- How do I attach audit to it
- How do I write SELinux rules for it
- How do I use mount to make namespaces appear in multiple vservers
and thats for starters...
Every so often someone decides that a deeply un-unix interface with new
syscalls is a good idea. Every time history proves them totally bonkers.
There are cases for new system calls but this doesn't seem one of them.
Look at system 5 shared memory, look at system 5 ipc, and so on. You
can't use common interfaces on them, you can't select on them, you can't
sanely pass them by fd passing.
All our existing locking uses the following behaviour
fd = open(namespace, options)
fcntl(.. lock ...)
blah
flush
fcntl(.. unlock ...)
close
Unfortunately some people here seem to have forgotten WHY we do things
this way.
1. The semantics of file descriptors are well understood by users and by
programs. That makes programming easier and keeps code size down
2. Everyone knows how close() works including across fork
3. FD passing is an obscure art but understood and just works
4. Poll() is a standard understood interface
5. Ownership of files is a standard model
6. FD passing across fork/exec is controlled in a standard way
7. The semantics for threaded applications are defined
8. Permissions are a standard model
9. Audit just works with the same tools
9. SELinux just works with the same tools
10. I don't need specialist applications to see the system state (the
whole point of sysfs yet someone wants to break it all again)
11. fcntl fd locking is a posix standard interface with precisely
defined semantics. Our extensions including leases are very powerful
12. And yes - fcntl fd locking supports mandatory locking too. That also
is standards based with precise semantics.
Everyone understands how to use the existing locking operations. So if
you use the existing interfaces with some small extensions if neccessary
everyone understands how to use cluster locks. Isn't that neat....
On Sad, 2005-09-03 at 21:46 -0700, Andrew Morton wrote:
> Actually I think it's rather sick. Taking O_NONBLOCK and making it a
> lock-manager trylock because they're kinda-sorta-similar-sounding? Spare
> me. O_NONBLOCK means "open this file in nonblocking mode", not "attempt to
> acquire a clustered filesystem lock". Not even close.
The semantics of O_NONBLOCK on many other devices are "trylock"
semantics. OSS audio has those semantics for example, as do regular
files in the presence of SYS5 mandatory locks. While the latter is "try
lock , do operation and then drop lock" the drivers using O_NDELAY are
very definitely providing trylock semantics.
I am curious why a lock manager uses open to implement its locking
semantics rather than using the locking API (POSIX locks etc) however.
Alan
On Mon, Sep 05, 2005 at 05:24:33PM +0800, David Teigland wrote:
> On Mon, Sep 05, 2005 at 01:54:08AM -0700, Andrew Morton wrote:
> > David Teigland <[email protected]> wrote:
> > >
> > > We export our full dlm API through read/write/poll on a misc device.
> > >
> >
> > inotify did that for a while, but we ended up going with a straight syscall
> > interface.
> >
> > How fat is the dlm interface? ie: how many syscalls would it take?
>
> Four functions:
> create_lockspace()
> release_lockspace()
> lock()
> unlock()
FWIW, it looks like we can agree on the core interface. ocfs2_dlm
exports essentially the same functions:
dlm_register_domain()
dlm_unregister_domain()
dlmlock()
dlmunlock()
I also implemented dlm_migrate_lockres() to explicitly remaster a lock
on another node, but this isn't used by any callers today (except for
debugging purposes). There is also some wiring between the fs and the
dlm (eviction callbacks) to deal with some ordering issues between the
two layers, but these could go if we get stronger membership.
There are quite a few other functions in the "full" spec(1) that we
didn't even attempt, either because we didn't require direct
user<->kernel access or we just didn't need the function. As for the
rather thick set of parameters expected in dlm calls, we managed to get
dlmlock down to *ahem* eight, and the rest are fairly slim.
Looking at the misc device that gfs uses, it seems like there is pretty
much complete interface to the same calls you have in kernel, validated
on the write() calls to the misc device. With dlmfs, we were seeking to
lock down and simplify user access by using standard ast/bast/unlockast
calls, using a file descriptor as an opaque token for a single lock,
letting the vfs lifetime on this fd help with abnormal termination, etc.
I think both the misc device and dlmfs are helpful and not necessarily
mutually exclusive, and probably both are better approaches than
exporting everything via loads of syscalls (which seems to be the
VMS/opendlm model).
-kurt
1. http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf
Kurt C. Hackel
Oracle
[email protected]
Alan Cox <[email protected]> wrote:
>
> On Llu, 2005-09-05 at 02:19 -0700, Andrew Morton wrote:
> > > create_lockspace()
> > > release_lockspace()
> > > lock()
> > > unlock()
> >
> > Neat. I'd be inclined to make them syscalls then. I don't suppose anyone
> > is likely to object if we reserve those slots.
>
> If the locks are not file descriptors then answer the following:
>
> - How are they ref counted
> - What are the cleanup semantics
> - How do I pass a lock between processes (AF_UNIX sockets wont work now)
> - How do I poll on a lock coming free.
> - What are the semantics of lock ownership
> - What rules apply for inheritance
> - How do I access a lock across threads.
> - What is the permission model.
> - How do I attach audit to it
> - How do I write SELinux rules for it
> - How do I use mount to make namespaces appear in multiple vservers
>
> and thats for starters...
Return an fd from create_lockspace().
On Mon, Sep 05, 2005 at 04:16:31PM +0200, Lars Marowsky-Bree wrote:
> That is the whole point why OCFS exists ;-)
The whole point of the orcacle cluster filesystem as it was described in old
papers was about pfiles, control files and software, because you can easyly
use direct block access (with ASM) for tablespaces.
> No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
> benefits in several aspects from a general-purpose SAN-backed CFS.
Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
replicated filesystem makes more sense), I am just nor sure if anybody sane
would use it for tablespaces.
I guess I have to correct the artile in my german it blog :) (if somebody
can name productive customers).
Gruss
Bernd
--
http://itblog.eckenfels.net/archives/54-Cluster-Filesysteme.html
On Mon, Sep 05, 2005 at 10:24:03PM +0200, Bernd Eckenfels wrote:
> On Mon, Sep 05, 2005 at 04:16:31PM +0200, Lars Marowsky-Bree wrote:
> > That is the whole point why OCFS exists ;-)
>
> The whole point of the orcacle cluster filesystem as it was described in old
> papers was about pfiles, control files and software, because you can easyly
> use direct block access (with ASM) for tablespaces.
The original OCFS was intended for use with pfiles and control files but
very definitely *not* software (the ORACLE_HOME). It was not remotely
general purpose. It also predated ASM by about a year or so, and the
two solutions are complementary. Either one is a good choice for Oracle
datafiles, depending upon your needs.
> > No. Beyond the table spaces, there's also ORACLE_HOME; a cluster
> > benefits in several aspects from a general-purpose SAN-backed CFS.
>
> Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
> replicated filesystem makes more sense), I am just nor sure if anybody sane
> would use it for tablespaces.
Too many to mention here, but let's just say that some of the largest
databases are running Oracle datafiles on top of OCFS1. Very large
companies with very important data.
> I guess I have to correct the artile in my german it blog :) (if somebody
> can name productive customers).
Yeah you should definitely update your blog ;-) If you need named
references, we can give you loads of those.
-kurt
Kurt C. Hackel
Oracle
[email protected]
On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
> > - How are they ref counted
> > - What are the cleanup semantics
> > - How do I pass a lock between processes (AF_UNIX sockets wont work now)
> > - How do I poll on a lock coming free.
> > - What are the semantics of lock ownership
> > - What rules apply for inheritance
> > - How do I access a lock across threads.
> > - What is the permission model.
> > - How do I attach audit to it
> > - How do I write SELinux rules for it
> > - How do I use mount to make namespaces appear in multiple vservers
> >
> > and thats for starters...
>
> Return an fd from create_lockspace().
That only answers about four of the questions. The rest only come out if
create_lockspace behaves like a file system - in other words
create_lockspace is better known as either mkdir or mount.
Its certainly viable to make the lock/unlock functions taken a fd, it's
just not clear why the current lock/unlock functions we have won't do
the job. Being able to extend the functionality to leases later on may
be very powerful indeed and will fit the existing API
Alan Cox <[email protected]> wrote:
>
> On Llu, 2005-09-05 at 12:53 -0700, Andrew Morton wrote:
> > > - How are they ref counted
> > > - What are the cleanup semantics
> > > - How do I pass a lock between processes (AF_UNIX sockets wont work now)
> > > - How do I poll on a lock coming free.
> > > - What are the semantics of lock ownership
> > > - What rules apply for inheritance
> > > - How do I access a lock across threads.
> > > - What is the permission model.
> > > - How do I attach audit to it
> > > - How do I write SELinux rules for it
> > > - How do I use mount to make namespaces appear in multiple vservers
> > >
> > > and thats for starters...
> >
> > Return an fd from create_lockspace().
>
> That only answers about four of the questions. The rest only come out if
> create_lockspace behaves like a file system - in other words
> create_lockspace is better known as either mkdir or mount.
But David said that "We export our full dlm API through read/write/poll on
a misc device.". That miscdevice will simply give us an fd. Hence my
suggestion that the miscdevice be done away with in favour of a dedicated
syscall which returns an fd.
What does a filesystem have to do with this?
On Sun, Sep 04, 2005 at 09:37:15AM +0100, Alan Cox wrote:
> I am curious why a lock manager uses open to implement its locking
> semantics rather than using the locking API (POSIX locks etc) however.
Because it is simple (how do you fcntl(2) from a shell fd?), has no
ranges (what do you do with ranges passed in to fcntl(2) and you don't
support them?), and has a well-known fork(2)/exec(2) pattern. fcntl(2)
has a known but less intuitive fork(2) pattern.
The real reason, though, is that we never considered fcntl(2).
We could never think of a case when a process wanted a lock fd open but
not locked. At least, that's my recollection. Mark might have more to
comment.
Joel
--
"In the room the women come and go
Talking of Michaelangelo."
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Mon, Sep 05, 2005 at 10:24:03PM +0200, Bernd Eckenfels wrote:
> The whole point of the orcacle cluster filesystem as it was described in old
> papers was about pfiles, control files and software, because you can easyly
> use direct block access (with ASM) for tablespaces.
OCFS, the original filesystem, only works for datafiles,
logfiles, and other database data. It's currently used in serious anger
by several major customers. Oracle's websites must have a list of them
somewhere. We're talking many terabytes of datafiles.
> Yes, I dont dispute the usefullness of OCFS for ORA_HOME (beside I think a
> replicated filesystem makes more sense), I am just nor sure if anybody sane
> would use it for tablespaces.
OCFS2, the new filesystem, is fully general purpose. It
supports all the usual stuff, is quite fast, and is what we expect folks
to use for both ORACLE_HOME and datafiles in the future. Customers can,
of course, use ASM or even raw devices. OCFS2 is as fast as raw
devices, and far more manageable, so raw devices are probably not a
choice for the future. ASM has its own management advantages, and we
certainly expect customers to like it as well. But that doesn't mean
people won't use OCFS2 for datafiles depending on their environment or
needs.
--
"The first requisite of a good citizen in this republic of ours
is that he shall be able and willing to pull his weight."
- Theodore Roosevelt
Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127
On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > On 2005-09-03T01:57:31, Daniel Phillips <[email protected]> wrote:
> > > > The only current users of dlms are cluster filesystems. There are
> > > > zero users of the userspace dlm api.
> > >
> > > That is incorrect...
> >
> > Application users Lars, sorry if I did not make that clear. The issue is
> > whether we need to export an all-singing-all-dancing dlm api from kernel
> > to userspace today, or whether we can afford to take the necessary time
> > to get it right while application writers take their time to have a good
> > think about whether they even need it.
>
> If Linux fully supported OpenVMS DLM semantics we could start thinking
> asbout moving our application onto a Linux box because our alpha server is
> aging.
>
> That's just my user application writer $0.02.
What stops you from trying it with the patch? That kind of feedback would be
worth way more than $0.02.
Regards,
Daniel
On Monday 05 September 2005 19:57, Daniel Phillips wrote:
> On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> > On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > > On 2005-09-03T01:57:31, Daniel Phillips <[email protected]> wrote:
> > > > > The only current users of dlms are cluster filesystems. There are
> > > > > zero users of the userspace dlm api.
> > > >
> > > > That is incorrect...
> > >
> > > Application users Lars, sorry if I did not make that clear. The issue is
> > > whether we need to export an all-singing-all-dancing dlm api from kernel
> > > to userspace today, or whether we can afford to take the necessary time
> > > to get it right while application writers take their time to have a good
> > > think about whether they even need it.
> >
> > If Linux fully supported OpenVMS DLM semantics we could start thinking
> > asbout moving our application onto a Linux box because our alpha server is
> > aging.
> >
> > That's just my user application writer $0.02.
>
> What stops you from trying it with the patch? That kind of feedback would be
> worth way more than $0.02.
>
We do not have such plans at the moment and I prefer spending my free
time on tinkering with kernel, not rewriting some in-house application.
Besides, DLM is not the only thing that does not have a drop-in
replacement in Linux.
You just said you did not know if there are any potential users for the
full DLM and I said there are some.
--
Dmitry
On Monday 05 September 2005 22:03, Dmitry Torokhov wrote:
> On Monday 05 September 2005 19:57, Daniel Phillips wrote:
> > On Monday 05 September 2005 12:18, Dmitry Torokhov wrote:
> > > On Monday 05 September 2005 10:49, Daniel Phillips wrote:
> > > > On Monday 05 September 2005 10:14, Lars Marowsky-Bree wrote:
> > > > > On 2005-09-03T01:57:31, Daniel Phillips <[email protected]> wrote:
> > > > > > The only current users of dlms are cluster filesystems. There
> > > > > > are zero users of the userspace dlm api.
> > > > >
> > > > > That is incorrect...
> > > >
> > > > Application users Lars, sorry if I did not make that clear. The
> > > > issue is whether we need to export an all-singing-all-dancing dlm api
> > > > from kernel to userspace today, or whether we can afford to take the
> > > > necessary time to get it right while application writers take their
> > > > time to have a good think about whether they even need it.
> > >
> > > If Linux fully supported OpenVMS DLM semantics we could start thinking
> > > asbout moving our application onto a Linux box because our alpha server
> > > is aging.
> > >
> > > That's just my user application writer $0.02.
> >
> > What stops you from trying it with the patch? That kind of feedback
> > would be worth way more than $0.02.
>
> We do not have such plans at the moment and I prefer spending my free
> time on tinkering with kernel, not rewriting some in-house application.
> Besides, DLM is not the only thing that does not have a drop-in
> replacement in Linux.
>
> You just said you did not know if there are any potential users for the
> full DLM and I said there are some.
I did not say "potential", I said there are zero dlm applications at the
moment. Nobody has picked up the prototype (g)dlm api, used it in an
application and said "gee this works great, look what it does".
I also claim that most developers who think that using a dlm for application
synchronization would be really cool are probably wrong. Use sockets for
synchronization exactly as for a single-node, multi-tasking application and
you will end up with less code, more obviously correct code, probably more
efficient and... you get an optimal, single-node version for free.
And I also claim that there is precious little reason to have a full-featured
dlm in-kernel. Being in-kernel has no benefit for a userspace application.
But being in-kernel does add kernel bloat, because there will be extra
features lathered on that are not needed by the only in-kernel user, the
cluster filesystem.
In the case of your port, you'd be better off hacking up a userspace library
to provide OpenVMS dlm semantics exactly, not almost.
By the way, you said "alpha server" not "alpha servers", was that just a slip?
Because if you don't have a cluster then why are you using a dlm?
Regards,
Daniel
On Monday 05 September 2005 23:02, Daniel Phillips wrote:
>
> By the way, you said "alpha server" not "alpha servers", was that just a slip?
> Because if you don't have a cluster then why are you using a dlm?
>
No, it is not a slip. The application is running on just one node, so we
do not really use "distributed" part. However we make heavy use of the
rest of lock manager features, especially lock value blocks.
--
Dmitry
On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
> On Monday 05 September 2005 23:02, Daniel Phillips wrote:
> > By the way, you said "alpha server" not "alpha servers", was that just a
> > slip? Because if you don't have a cluster then why are you using a dlm?
>
> No, it is not a slip. The application is running on just one node, so we
> do not really use "distributed" part. However we make heavy use of the
> rest of lock manager features, especially lock value blocks.
Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature
without even having the excuse you were forced to use it. Why don't you just
have a daemon that sends your values over a socket? That should be all of a
day's coding.
Anyway, thanks for sticking your head up, and sorry if it sounds aggressive.
But you nicely supported my claim that most who think they should be using a
dlm, really shouldn't.
Regards,
Daniel
On Monday 05 September 2005 23:58, Daniel Phillips wrote:
> On Tuesday 06 September 2005 00:07, Dmitry Torokhov wrote:
> > On Monday 05 September 2005 23:02, Daniel Phillips wrote:
> > > By the way, you said "alpha server" not "alpha servers", was that just a
> > > slip? Because if you don't have a cluster then why are you using a dlm?
> >
> > No, it is not a slip. The application is running on just one node, so we
> > do not really use "distributed" part. However we make heavy use of the
> > rest of lock manager features, especially lock value blocks.
>
> Urk, so you imprinted on the clunkiest, most pathetically limited dlm feature
> without even having the excuse you were forced to use it. Why don't you just
> have a daemon that sends your values over a socket? That should be all of a
> day's coding.
>
Umm, because when most of the code was written TCP and the rest was the
clunkiest code out there? Plus, having a daemon introduces problems with
cleanup (say process dies for one reason or another) whereas having it in
OS takes care of that.
> Anyway, thanks for sticking your head up, and sorry if it sounds aggressive.
> But you nicely supported my claim that most who think they should be using a
> dlm, really shouldn't.
Heh, do you think it is a bit premature to dismiss something even without
ever seeing the code?
--
Dmitry
On Monday 05 September 2005 19:37, Joel Becker wrote:
> OCFS2, the new filesystem, is fully general purpose. It
> supports all the usual stuff, is quite fast...
So I have heard, but isn't it time to quantify that? How do you think you
would stack up here:
http://www.caspur.it/Files/2005/01/10/1105354214692.pdf
Regards,
Daniel
On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> do you think it is a bit premature to dismiss something even without
> ever seeing the code?
You told me you are using a dlm for a single-node application, is there
anything more I need to know?
Regards,
Daniel
On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > do you think it is a bit premature to dismiss something even without
> > ever seeing the code?
>
> You told me you are using a dlm for a single-node application, is there
> anything more I need to know?
>
I would still like to know why you consider it a "sin". On OpenVMS it is
fast, provides a way of cleaning up and does not introduce single point
of failure as it is the case with a daemon. And if we ever want to spread
the load between 2 boxes we easily can do it. Why would I not want to use
it?
--
Dmitry
On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
> On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > > do you think it is a bit premature to dismiss something even without
> > > ever seeing the code?
> >
> > You told me you are using a dlm for a single-node application, is there
> > anything more I need to know?
>
> I would still like to know why you consider it a "sin". On OpenVMS it is
> fast, provides a way of cleaning up...
There is something hard about handling EPIPE?
> and does not introduce single point
> of failure as it is the case with a daemon. And if we ever want to spread
> the load between 2 boxes we easily can do it.
But you said it runs on an aging Alpha, surely you do not intend to expand it
to two aging Alphas? And what makes you think that socket-based
synchronization keeps you from spreading out the load over multiple boxes?
> Why would I not want to use it?
It is not the right tool for the job from what you have told me. You want to
get a few bytes of information from one task to another? Use a socket, as
God intended.
Regards,
Daniel
On Fri, Sep 02, 2005 at 11:17:08PM +0200, Andi Kleen wrote:
> Andrew Morton <[email protected]> writes:
>
> >
> > > > - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot
> > > > possibly gain (or vice versa)
> > > >
> > > > - Relative merits of the two offerings
> > >
> > > You missed the important one - people actively use it and have been for
> > > some years. Same reason with have NTFS, HPFS, and all the others. On
> > > that alone it makes sense to include.
> >
> > Again, that's not a technical reason. It's _a_ reason, sure. But what are
> > the technical reasons for merging gfs[2], ocfs2, both or neither?
>
> There seems to be clearly a need for a shared-storage fs of some sort
> for HA clusters and virtualized usage (multiple guests sharing a
> partition). Shared storage can be more efficient than network file
> systems like NFS because the storage access is often more efficient
> than network access and it is more reliable because it doesn't have a
> single point of failure in form of the NFS server.
>
> It's also a logical extension of the "failover on failure" clusters
> many people run now - instead of only failing over the shared fs at
> failure and keeping one machine idle the load can be balanced between
> multiple machines at any time.
>
> One argument to merge both might be that nobody really knows yet which
> shared-storage file system (GFS or OCFS2) is better. The only way to
> find out would be to let the user base try out both, and that's most
> practical when they're merged.
>
> Personally I think ocfs2 has nicer&cleaner code than GFS.
> It seems to be more or less a 64bit ext3 with cluster support, while
The "more or less" is what bothers me here - the first time I heard this,
it sounded a little misleading, as I expected to find some kind of a
patch to ext3 to make it 64 bit with extents and cluster support.
Now I understand it a little better (thanks to Joel and Mark)
And herein lies the issue where I tend to agree with Andrew on
-- its really nice to have multiple filesystems innovating freely in
their niches and eventually proving themselves in practice, without
being bogged down by legacy etc. But at the same time, is there enough
thought and discussion about where the fragmentation/diversification is really
warranted, vs improving what is already there, or say incorporating
the best of one into another, maybe over a period of time ?
The number of filesystems seems to just keep growing, and supporting
all of them isn't easy -- for users it isn't really easy to switch from
one to another, and the justifications for choosing between them is
sometimes confusing and burdensome from an administrator standpoint
- one filesystem is good in certain conditions, another in others,
stability levels may vary etc, and its not always possible to predict
which aspect to prioritize.
Now, with filesystems that have been around in production for a long
time, the on-disk format becomes a major constraining factor, and the
reason for having various legacy support around. Likewise, for some
special purpose filesystems there really is a niche usage. But for new
and sufficiently general purpose filesystems, with new on-disk structure,
isn't it worth thinking this through and trying to get it right ?
Yeah, it is a lot of work upfront ... but with double the people working
on something, it just might get much better than what they individually
can. Sometimes.
BTW, I don't know if it is worth it in this particular case, but just
something that worries me in general.
> GFS seems to reinvent a lot more things and has somewhat uglier code.
> On the other hand GFS' cluster support seems to be more aimed
> at being a universal cluster service open for other usages too,
> which might be a good thing. OCFS2s cluster seems to be more
> aimed at only serving the file system.
>
> But which one works better in practice is really an open question.
True, but what usually ends up happening is that this question can
never quite be answered in black and white. So both just continue
to exist and apps need to support both ... convergence becomes impossible
and long term duplication inevitable.
So at least having a clear demarcation/guideline of what situations
each is suitable for upfront would be a good thing. That might also
get some cross ocfs-gfs and ocfs-ext3 reviews in the process :)
Regards
Suparna
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India
On Maw, 2005-09-06 at 02:48 -0400, Daniel Phillips wrote:
> On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > do you think it is a bit premature to dismiss something even without
> > ever seeing the code?
>
> You told me you are using a dlm for a single-node application, is there
> anything more I need to know?
That's standard practice for many non-Unix operating systems. It means
your code supports failover without much additional work and it provides
all the functionality for locks on a single node too
On 9/6/05, Daniel Phillips <[email protected]> wrote:
> On Tuesday 06 September 2005 02:55, Dmitry Torokhov wrote:
> > On Tuesday 06 September 2005 01:48, Daniel Phillips wrote:
> > > On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote:
> > > > do you think it is a bit premature to dismiss something even without
> > > > ever seeing the code?
> > >
> > > You told me you are using a dlm for a single-node application, is there
> > > anything more I need to know?
> >
> > I would still like to know why you consider it a "sin". On OpenVMS it is
> > fast, provides a way of cleaning up...
>
> There is something hard about handling EPIPE?
>
Just the fact that you want me to handle it ;)
> > and does not introduce single point
> > of failure as it is the case with a daemon. And if we ever want to spread
> > the load between 2 boxes we easily can do it.
>
> But you said it runs on an aging Alpha, surely you do not intend to expand it
> to two aging Alphas?
You would be right if I was designing this right now. Now roll 10 - 12
years back and now I have a shiny new alpha. Would you criticize me
then for using a mechanism that allowed easily spread application
across several nodes with minimal changes if needed?
What you fail to realize that there applications that run and will
continue to run for a long time.
> And what makes you think that socket-based
> synchronization keeps you from spreading out the load over multiple boxes?
>
> > Why would I not want to use it?
>
> It is not the right tool for the job from what you have told me. You want to
> get a few bytes of information from one task to another? Use a socket, as
> God intended.
>
Again, when TCPIP is not a native network stack, when libc socket
routines are not readily available - DLM starts looking much more
viable.
--
Dmitry
On Thu, Sep 01, 2005 at 01:35:23PM +0200, Arjan van de Ven wrote:
> +static inline void glock_put(struct gfs2_glock *gl)
> +{
> + if (atomic_read(&gl->gl_count) == 1)
> + gfs2_glock_schedule_for_reclaim(gl);
> + gfs2_assert(gl->gl_sbd, atomic_read(&gl->gl_count) > 0,);
> + atomic_dec(&gl->gl_count);
> +}
>
> this code has a race
The first two lines of the function with the race are non-essential and
could be removed. In the common case where there's no race, they just add
efficiency by moving the glock to the reclaim list immediately.
Otherwise, the scand thread would do it later when actively trying to
reclaim glocks.
> +static inline int queue_empty(struct gfs2_glock *gl, struct list_head *head)
> +{
> + int empty;
> + spin_lock(&gl->gl_spin);
> + empty = list_empty(head);
> + spin_unlock(&gl->gl_spin);
> + return empty;
> +}
>
> that looks like a racey interface to me... if so.. why bother locking at
> all?
The spinlock protects the list but is not the primary method of
synchronizing processes that are working with a glock.
When the list is in fact empty, there will be no race, and the locking
wouldn't be necessary. In this case, the "glmutex" in the code fragment
below is preventing any change in the list, so we can safely release the
spinlock immediately.
When the list is not empty, then a process could be adding another entry
to the list without "glmutex" locked [1], making the spinlock necessary.
In this case we quit after queue_empty() returns and don't do anything
else, so releasing the spinlock immediately was still safe.
[1] A process that already holds a glock (i.e. has a "holder" struct on
the gl_holders list) is allowed to hold it again by adding another holder
struct to the same list. It adds the second hold without locking glmutex.
if (gfs2_glmutex_trylock(gl)) {
if (gl->gl_ops == &gfs2_inode_glops) {
struct gfs2_inode *ip = get_gl2ip(gl);
if (ip && !atomic_read(&ip->i_count))
gfs2_inode_destroy(ip);
}
if (queue_empty(gl, &gl->gl_holders) &&
gl->gl_state != LM_ST_UNLOCKED)
handle_callback(gl, LM_ST_UNLOCKED);
gfs2_glmutex_unlock(gl);
}
There is a second way that queue_empty() is used, and that's within
assertions that the list is empty. If the assertion is correct, locking
isn't necessary; locking is only needed if there's already another bug
causing the list to not be empty and the assertion to fail.
> static int gi_skeleton(struct gfs2_inode *ip, struct gfs2_ioctl *gi,
> + gi_filler_t filler)
> +{
> + unsigned int size = gfs2_tune_get(ip->i_sbd, gt_lockdump_size);
> + char *buf;
> + unsigned int count = 0;
> + int error;
> +
> + if (size > gi->gi_size)
> + size = gi->gi_size;
> +
> + buf = kmalloc(size, GFP_KERNEL);
> + if (!buf)
> + return -ENOMEM;
> +
> + error = filler(ip, gi, buf, size, &count);
> + if (error)
> + goto out;
> +
> + if (copy_to_user(gi->gi_data, buf, count + 1))
> + error = -EFAULT;
>
> where does count get a sensible value?
from filler()
We'll add comments in the code to document the things above.
Thanks,
Dave
>
> You removed the comment stating exactly why, see below. If that's not a
> accepted technique in the kernel, say so and I'll be happy to change it
> here and elsewhere.
> Thanks,
> Dave
entirely useless wrapping is not acceptable indeed.
I've just returned from holiday so I'm late to this discussion so let me tell
you what we do now and why and lets see what's wrong with it.
Currently the library create_lockspace() call returns an FD upon which all lock
operations happen. The FD is onto a misc device, one per lockspace, so if you
want lockspace protection it can happen at that level. There is no protection
applied to locks within a lockspace nor do I think it's helpful to do so to be
honest. Using a misc device limits you to <255 lockspaces depending on the other
uses of misc but this is just for userland-visible lockspace - it does not
affect GFS filesystems for instance.
Lock/convert/unlock operations are done using write calls on that lockspace FD.
Callbacks are implemented using poll and read on the FD, read will return data
blocks (one per callback) as long as there are active callbacks to process. The
current read functionality behaves more like a SOCK_PACKET than a data stream
which some may not like but then you're going to need to know what you're
reading from the device anyway.
ioctl/fcntl isn't really useful for DLM locks because you can't do asynchronous
operations on them - the lock has to succeed or fail in the one operation - if
you want a callback for completion (or blocking notification) you have to poll
the lockspace FD anyway and then you might as well go back to using read and
write because at least they are something of a matched pair. Something similar
applies, I think, to a syscall interface.
Another reason the existing fcntl interface isn't appropriate is that it's not
locking the same kind of thing. Current Unix fcntl calls lock byte ranges. DLM
locks arbitrary names and has a much richer list of lock modes. Adding another
fcntl just runs in the problems mentioned above.
The other reason we use read for callbacks is that there is information to be
passed back: lock status, value block and (possibly) query information.
While having an FD per lock sounds like a nice unixy idea I don't think it would
work very well in practice. Applications with hundreds or thousands of locks
(such as databases) would end up with huge pollfd structs to manage, and it
while it helps the refcounting (currently the nastiest bit of the current
dlm_device code) removes the possibility of having persistent locks that exist
after the process exits - a handy feature that some people do use, though I
don't think it's in the currently submitted DLM code. One FD per lock also gives
each lock two handles, the lock ID used internally by the DLM and the FD used
externally by the application which I think is a little confusing.
I don't think a dlmfs is useful, personally. The features you can export from it
are either minimal compared to the full DLM functionality (so you have to export
the rest by some other means anyway) or are going to be so un-filesystemlike as
to be very awkward to use. Doing lock operations in shell scripts is all very
cool but how often do you /really/ need to do that?
I'm not saying that what we have is perfect - far from it - but we have thought
about how this works and what we came up with seems like a good compromise
between providing full DLM functionality to userspace using unix features. But
we're very happy to listen to other ideas - and have been doing I hope.
--
patrick
On 9/14/05, Patrick Caulfield <[email protected]> wrote:
> I've just returned from holiday so I'm late to this discussion so let me tell
> you what we do now and why and lets see what's wrong with it.
>
> [snip lots of useful stuff]
>
> While having an FD per lock sounds like a nice unixy idea I don't think it would
> work very well in practice. Applications with hundreds or thousands of locks
> (such as databases) would end up with huge pollfd structs to manage, and it
Or rather use epoll instead of poll/select to avoid scalability issues.
> while it helps the refcounting (currently the nastiest bit of the current
> dlm_device code) removes the possibility of having persistent locks that exist
The refcounting is already implemented for FDs and will be kept
working 100% of the time because if it failed the kernel would not
even boot for most people due to initscripts leaking massive memory at
exit time. Not that GFS will not be maintained, but the
kernel-at-large has more people behind it.
> after the process exits - a handy feature that some people do use, though I
> don't think it's in the currently submitted DLM code. One FD per lock also gives
Perhaps persistent locks could be registered with a daemon by passing
them via unix pipes.
> each lock two handles, the lock ID used internally by the DLM and the FD used
> externally by the application which I think is a little confusing.
>
> I don't think a dlmfs is useful, personally. The features you can export from it
> are either minimal compared to the full DLM functionality (so you have to export
> the rest by some other means anyway) or are going to be so un-filesystemlike as
> to be very awkward to use. Doing lock operations in shell scripts is all very
> cool but how often do you /really/ need to do that?
>
> I'm not saying that what we have is perfect - far from it - but we have thought
> about how this works and what we came up with seems like a good compromise
> between providing full DLM functionality to userspace using unix features. But
> we're very happy to listen to other ideas - and have been doing I hope.
>
> --
>
> patrick
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
Greetz, Antonio Vargas aka winden of network
http://wind.codepixel.com/
Las cosas no son lo que parecen, excepto cuando parecen lo que si son.