2001-11-03 17:47:51

by Richard Gooch

[permalink] [raw]
Subject: [PATCH] devfs v196 available

Hi, all. Version 196 of my devfs patch is now available from:
http://www.atnf.csiro.au/~rgooch/linux/kernel-patches.html
The devfs FAQ is also available here.

Patch directly available from:
ftp://ftp.??.kernel.org/pub/linux/kernel/people/rgooch/v2.4/devfs-patch-current.gz

AND:
ftp://ftp.atnf.csiro.au/pub/people/rgooch/linux/kernel-patches/v2.4/devfs-patch-current.gz

This is against 2.4.14-pre7. Highlights of this release:

- Fixed race in <devfsd_ioctl> when setting event mask
Thanks to Kari Hurtta <[email protected]>

- Avoid deadlock in <devfs_follow_link> by using temporary buffer

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]


2001-11-16 22:57:23

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] devfs v196 available

Hi,

On Sat, 3 Nov 2001, Richard Gooch wrote:

> Hi, all. Version 196 of my devfs patch is now available from:

Some small comments:
- You should consider to integrate devfs into the dcache (check e.g.
ramfs), currently you duplicate lots of infrastructure, which you get
for free (and faster) from the dcache. It would save you also lots of
locking.
- you should delay the path generation in try_modload to the read
function, then you're not limited here in the pathname length and don't
have to abuse the stack.
- you should use "%.*s" if you want to print a dentry name (no need to
copy it).
- you should do something about the recursive calls, it's an invitation to
abuse them.
- symlink/slave handling of tapes/disk/cdroms is maybe better done in
userspace.
- uid/gid in devfsd_buf_entry is 16 bit
- devfs is not a database! E.g. devfs has no business to store the
char/block device ops table. So devfs_get_ops is wrong here, the same
for devfs_[gs]et_info. devfs has to stay optional and storing/retrieving
certain device data should not be done in different ways, this is only
asking for trouble because of subtle differences. The problem so far is
that we have no [bc]dev_t, where this info should be stored, but this
hopefully changes early 2.5. So the path to get e.g. to the ops table
should be "kdev_t -> [bc]dev_t -> ops" and not "kdev_t -> search whole
devfs tree -> no ops" (because you missed manual dev nodes).
- the simple event mechanism looks prone to DOS attacks (even if all races
are gone). Events are too easily delayed or even dropped. This makes
the events unreliable and unsuitable for any serious use.
- in _devfs_make_parent_for_leaf you shouldn't simply return if
_devfs_append_entry fails, because someone else might have created the
directory since _devfs_descend.

Especially the first point is very important, this was even suggested by
Linus already more than 3 years ago. devfs could be a very thin layer, but
right now it's far bigger than it had to be.

bye, Roman



2001-11-17 21:17:28

by Richard Gooch

[permalink] [raw]
Subject: Re: [PATCH] devfs v196 available

Roman Zippel writes:
> On Sat, 3 Nov 2001, Richard Gooch wrote:
>
> > Hi, all. Version 196 of my devfs patch is now available from:
>
> Some small comments:
> - You should consider to integrate devfs into the dcache (check e.g.
> ramfs), currently you duplicate lots of infrastructure, which you get
> for free (and faster) from the dcache. It would save you also lots of
> locking.

That's something that I've already said I'm planning, but that's a 2.5
thing, and I'm waiting for the dcache to be split from the icache.
Even if the current, horribly bloated, struct inode is trimmed down by
removing that union, it will still be too big for my liking. Hence I
want the dcache/icache split.

> - you should delay the path generation in try_modload to the read
> function, then you're not limited here in the pathname length and don't
> have to abuse the stack.

Yep. Now that I've got the refcounting, I can do the thing I'm doing
with other events (a for loop to increment the refcounts before
returning) for the lookup event as well. In fact, it will allow me to
clean up the devfsd_read() code as well (I never liked that test for
the event type).

> - you should use "%.*s" if you want to print a dentry name (no need to
> copy it).

Yep, good idea.

> - you should do something about the recursive calls, it's an
> invitation to abuse them.

That's why I've tagged them as such. I've tried to keep their stack
usage low. Switching to a non-recursive algorithm has it's own
problems: it's a lot harder to understand, and it's much harder to get
right in the first place. Non-recursive algorithms are more fragile.

> - symlink/slave handling of tapes/disk/cdroms is maybe better done
> in userspace.

No, because you want to be able to mount your root FS using
"root=/dev/cdroms/cdrom0", for example.

> - uid/gid in devfsd_buf_entry is 16 bit

Actually, it's uid_t and gid_t. Just like struct inode uses. When
these are increased to 32 bits, devfs will automatically be 32 bits
for these as well.

> - devfs is not a database! E.g. devfs has no business to store the
> char/block device ops table. So devfs_get_ops is wrong here, the
> same for devfs_[gs]et_info. devfs has to stay optional and
> storing/retrieving certain device data should not be done in
> different ways, this is only asking for trouble because of subtle
> differences. The problem so far is that we have no [bc]dev_t,
> where this info should be stored, but this hopefully changes early
> 2.5. So the path to get e.g. to the ops table should be "kdev_t ->
> [bc]dev_t -> ops" and not "kdev_t -> search whole devfs tree -> no
> ops" (because you missed manual dev nodes).

Again, you're talking about 2.5 stuff. If and when we get a decent
block device structure, I'll look at using that. It's questionable
whether we'll ever get a generic char device structure, since they're
all so different. So for char devices, the ops pointer should be kept
inside the devfs entry.

As for devfs_get_ops(), that has to stay until you can change over all
the SGI IA64 drivers, which (ab)use it heavily. API change-> 2.5.

> - the simple event mechanism looks prone to DOS attacks (even if all
> races are gone). Events are too easily delayed or even
> dropped. This makes the events unreliable and unsuitable for any
> serious use.

I know about the dropped events: people with some ISDN cards get them,
as did I when I was testing my sd-many patch. That's been on my list
of things to fix, once the locking and refcounting was working. It was
planned to be a v1.1 fix:-)

> - in _devfs_make_parent_for_leaf you shouldn't simply return if
> _devfs_append_entry fails, because someone else might have created
> the directory since _devfs_descend.

Yes, I know. That too was planned to be a v1.1 improvement.

> Especially the first point is very important, this was even
> suggested by Linus already more than 3 years ago. devfs could be a
> very thin layer, but right now it's far bigger than it had to be.

Yeah, on more than one occasion I've stated that I'd love to rip out
all the tree management code in devfs. I'm waiting for VFS changes to
make that palatable. Don't think for a minute that v1.x (where x is
small) is my final vision of devfs. v1.x was supposed to be the
race-free version which did it's own tree management (v0.x being the
racy, development version prior to v1.0). And v2.0 is supposed to be
the lightweight-let-the-dcache-do-the-work version. I hope that 2.5
will see further VFS improvements.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-11-18 12:07:41

by Xavier Bestel

[permalink] [raw]
Subject: Re: [PATCH] devfs v196 available

le sam 17-11-2001 ? 22:16, Richard Gooch a ?crit :
> Roman Zippel writes:
> > - symlink/slave handling of tapes/disk/cdroms is maybe better done
> > in userspace.
>
> No, because you want to be able to mount your root FS using
> "root=/dev/cdroms/cdrom0", for example.

"userspace" should probably be the "extended initrd" stuff we'll have in
2.5.

Xav


2001-11-18 16:01:32

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] devfs v196 available

Hi,

Richard Gooch wrote:

> That's something that I've already said I'm planning, but that's a 2.5
> thing, and I'm waiting for the dcache to be split from the icache.
> Even if the current, horribly bloated, struct inode is trimmed down by
> removing that union, it will still be too big for my liking. Hence I
> want the dcache/icache split.

Why not do it the other way around? Such a change would require changes
to all filesystems. If devfs would already make better use of existing
infrastructure, it would be far easier to take the needs of devfs into
account, instead of trying to fit it in later. A nice side effect would
be people could help with the conversion.

> > - you should do something about the recursive calls, it's an
> > invitation to abuse them.
>
> That's why I've tagged them as such. I've tried to keep their stack
> usage low. Switching to a non-recursive algorithm has it's own
> problems: it's a lot harder to understand, and it's much harder to get
> right in the first place. Non-recursive algorithms are more fragile.

_devfs_walk_path: There is already nicely working code in fs/namei.c.
unregister: you can easily go back with the parent pointer.
find_by_dev: this is just wrong (see below)

> > - symlink/slave handling of tapes/disk/cdroms is maybe better done
> > in userspace.
>
> No, because you want to be able to mount your root FS using
> "root=/dev/cdroms/cdrom0", for example.

That's about the only use of it at boot time and can be achieved easier.

> Again, you're talking about 2.5 stuff. If and when we get a decent
> block device structure, I'll look at using that.

The correct interface in 2.4 is get_blkfops. devfs has to use it like
everyone else.

> It's questionable
> whether we'll ever get a generic char device structure, since they're
> all so different. So for char devices, the ops pointer should be kept
> inside the devfs entry.

No, find_by_dev is just wrong, scanning a whole tree for a specific
information is broken. Timing is absolutely unpredictable and prone to
abuse. The current implementation can even return wrong information.
The correct place to store char device info is fs/char_dev.c.

> As for devfs_get_ops(), that has to stay until you can change over all
> the SGI IA64 drivers, which (ab)use it heavily. API change-> 2.5.

If it depends on devfs, it's simply broken and must be fixed, that has
nothing to do with the API.

> Yeah, on more than one occasion I've stated that I'd love to rip out
> all the tree management code in devfs. I'm waiting for VFS changes to
> make that palatable.

Do the changes now and you can push the responsibility to VFS, but the
last you should do is to maintain bloat.

bye, Roman

2001-11-18 17:34:14

by Tony Hoyle

[permalink] [raw]
Subject: Re: [PATCH] devfs v196 available

On Fri, 16 Nov 2001 22:59:05 +0000, Roman Zippel wrote:

> Hi,
>
> On Sat, 3 Nov 2001, Richard Gooch wrote:
>
>> Hi, all. Version 196 of my devfs patch is now available from:
>
I noticed with the 2.4.15pre kernels it doesn't support my LS-120 floppy
- there are no devfs calls in ide-floppy.c. How hard would it be to add
these, or is it a 2.5 issue?

Tony

2001-11-21 13:11:41

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] devfs v196 available

Hi,

On Tue, 20 Nov 2001, Richard Gooch wrote:

> Delayed events are harmless, since devfs ensures correct ordering.

It's not about ordering, timing is currently unpredictable, anything
timing sensitive has to be very careful to touch anything in devfs.

> After consideration, I've decided to dynamically grow the event buffer
> as required, and free up space as it's not needed.

You should use a slab cache and acknowledge events as soon as they are
finished. Right now all processes are waiting until the devfsd is
completely finished and restarted at the same time. This is currently
limited by dropping events, with a dynamic event queue this can become a
problem.

> Since devfsd has to
> wait for a module load to complete, it's not a good idea to block
> waiting for free space in the event buffer (potential deadlock).

What do you mean?

bye, Roman

2001-11-21 18:56:58

by Richard Gooch

[permalink] [raw]
Subject: Re: [PATCH] devfs v196 available

Roman Zippel writes:
> Hi,
>
> On Tue, 20 Nov 2001, Richard Gooch wrote:
>
> > Delayed events are harmless, since devfs ensures correct ordering.
>
> It's not about ordering, timing is currently unpredictable, anything
> timing sensitive has to be very careful to touch anything in devfs.

Are you talking about user-space or kernel-space? I assume you mean
user-space. And it's only a problem if you have module autoloading
enabled, *and* you are attempting a lookup of a non-existant inode.
I'd argue that if you have applications which are timing sensitive,
you shouldn't be autoloading modules anyway, you should have them
statically loaded or built into the kernel. Loading a module remains a
heavyweight operation.

> > After consideration, I've decided to dynamically grow the event buffer
> > as required, and free up space as it's not needed.
>
> You should use a slab cache and acknowledge events as soon as they
> are finished. Right now all processes are waiting until the devfsd
> is completely finished and restarted at the same time. This is
> currently limited by dropping events, with a dynamic event queue
> this can become a problem.

I'm concerned about the overhead of allocating an entry. With the
current scheme, the common case is that the buffer doesn't overflow,
and hence it's usually "good enough". In addition, it's very fast,
since it takes few instructions to "allocate" an entry (grab a lock,
test and update some pointers).

If the slab cache is fast enough on allocations (for the common case),
then it has clear advantages, since it makes cleaning up very easy,
and hence it's easier to wake up waiters as events are processed.

Changing to a wake-per-event-processed scheme will require some
careful consideration, though, since the current devfs+devfsd behviour
ensures there are few surprises (everyone waits until devfsd has
finished, so you know the system is in a "finished" state before
providing access). I'm not going to be rushed into changing it until
I'm satisfied that the new behaviour is reasonable. This is definately
not a v1.1 thing.

> > Since devfsd has to
> > wait for a module load to complete, it's not a good idea to block
> > waiting for free space in the event buffer (potential deadlock).
>
> What do you mean?

One thought I had was to have event generators wait if the queue is
full, until devfsd consumes (at least) one event. But that's no good
if devfsd is responsible for generating those events (i.e. REGISTER
events when devsfd loads a module). Since devsfd can't process new
events while waiting for the module to finish loading, in effect
devfsd would have to wait on itself: deadlock.

I tried to come up with ways around this deadlock, but none of them
were pretty, so I abandoned that approach (fortunately before cutting
any code:-). So a dynamic buffer (possibly a slab cache) is probably
the only reasonable solution.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]

2001-11-23 04:06:57

by Richard Gooch

[permalink] [raw]
Subject: Re: [PATCH] devfs v196 available

Tony Hoyle writes:
> On Fri, 16 Nov 2001 22:59:05 +0000, Roman Zippel wrote:
>
> > Hi,
> >
> > On Sat, 3 Nov 2001, Richard Gooch wrote:
> >
> >> Hi, all. Version 196 of my devfs patch is now available from:
> >
> I noticed with the 2.4.15pre kernels it doesn't support my LS-120
> floppy - there are no devfs calls in ide-floppy.c. How hard would
> it be to add these, or is it a 2.5 issue?

A kind person is shipping me a (partly working) LS-120 drive. I should
finally be able to play with this myself. This issue will get resolved
in due course.

Regards,

Richard....
Permanent: [email protected]
Current: [email protected]