I have a small problem with a module I'm writing. It uses a
pseudo-filesystem, rather like futexes and pipes, so that it can hand
out special file descriptors on request.
Following the examples of pipe.c and futex.c, but specifically for a 2.4
kernel, I'm doing this:
static DECLARE_FSTYPE (mymod_fs_type, "mymod_fs",
mymod_read_super, FS_NOMOUNT);
static int __init mymod_init (void)
{
int err = register_filesystem (&mymod_fs_type);
if (err)
return err;
mymod_mnt = kern_mount (&mymod_fs_type);
if (IS_ERR (mymod_mnt)) {
unregister_filesystem (&mymod_fs_type);
return PTR_ERR (mymod_mnt);
}
}
static void __exit mymod_exit (void)
{
mntput (mymod_mnt);
unregister_filesystem (&mymod_fs_type);
}
Unfortunately, when I come to _unload_ the module, it can't be unloaded
because the kern_mount increments the module reference count.
(pipe.c in 2.4 appears to have the same problem, but of course nobody
can ever unload it anyway so it doesn't matter).
I'm handing out file descriptors rather like futexes from 2.5: they all
share the same dentry, which is the root of the filesystem. In my
code's case, that dentry is created in `mymod_read_super' (just the same
way as 2.4 pipe.c).
Somehow, it looks like I need to mount the filesystem when it first
generates an fd, and unmount it when the last fd is destroyed -- but is
it safe to unmount the filesystem _within_ a release function of an
inode on the filesystem?
Either that, or I need something else.
Thanks,
-- Jamie
On Wednesday 04 September 2002 20:02, Jamie Lokier wrote:
> I have a small problem with a module I'm writing. It uses a
> pseudo-filesystem, rather like futexes and pipes, so that it can hand
> out special file descriptors on request.
>
> Following the examples of pipe.c and futex.c, but specifically for a 2.4
> kernel, I'm doing this:
>
> static DECLARE_FSTYPE (mymod_fs_type, "mymod_fs",
> mymod_read_super, FS_NOMOUNT);
>
>
> static int __init mymod_init (void)
> {
> int err = register_filesystem (&mymod_fs_type);
> if (err)
> return err;
> mymod_mnt = kern_mount (&mymod_fs_type);
> if (IS_ERR (mymod_mnt)) {
> unregister_filesystem (&mymod_fs_type);
> return PTR_ERR (mymod_mnt);
> }
> }
>
> static void __exit mymod_exit (void)
> {
> mntput (mymod_mnt);
> unregister_filesystem (&mymod_fs_type);
> }
>
> Unfortunately, when I come to _unload_ the module, it can't be unloaded
> because the kern_mount increments the module reference count.
>
> (pipe.c in 2.4 appears to have the same problem, but of course nobody
> can ever unload it anyway so it doesn't matter).
>
> I'm handing out file descriptors rather like futexes from 2.5: they all
> share the same dentry, which is the root of the filesystem. In my
> code's case, that dentry is created in `mymod_read_super' (just the same
> way as 2.4 pipe.c).
>
> Somehow, it looks like I need to mount the filesystem when it first
> generates an fd, and unmount it when the last fd is destroyed -- but is
> it safe to unmount the filesystem _within_ a release function of an
> inode on the filesystem?
>
> Either that, or I need something else.
It's a good example of why the module interface is stupidly wrong, and
__exit needs to be called by the module unloader, returning 0 if it's
ok to unload. Then your __exit can whatever condition it's interested
in and, if all is well, do the kern_umount.
--
Daniel
On Sat, 7 Sep 2002, Daniel Phillips wrote:
> On Wednesday 04 September 2002 20:02, Jamie Lokier wrote:
> > I have a small problem with a module I'm writing. It uses a
> > pseudo-filesystem, rather like futexes and pipes, so that it can hand
> > out special file descriptors on request.
> >
> > Following the examples of pipe.c and futex.c, but specifically for a 2.4
> > kernel, I'm doing this:
> >
> > static DECLARE_FSTYPE (mymod_fs_type, "mymod_fs",
> > mymod_read_super, FS_NOMOUNT);
> >
> >
> > static int __init mymod_init (void)
> > {
> > int err = register_filesystem (&mymod_fs_type);
> > if (err)
> > return err;
> > mymod_mnt = kern_mount (&mymod_fs_type);
> > if (IS_ERR (mymod_mnt)) {
> > unregister_filesystem (&mymod_fs_type);
> > return PTR_ERR (mymod_mnt);
> > }
> > }
> >
> > static void __exit mymod_exit (void)
> > {
> > mntput (mymod_mnt);
> > unregister_filesystem (&mymod_fs_type);
> > }
> >
> > Unfortunately, when I come to _unload_ the module, it can't be unloaded
> > because the kern_mount increments the module reference count.
> >
> > (pipe.c in 2.4 appears to have the same problem, but of course nobody
> > can ever unload it anyway so it doesn't matter).
> >
> > I'm handing out file descriptors rather like futexes from 2.5: they all
> > share the same dentry, which is the root of the filesystem. In my
> > code's case, that dentry is created in `mymod_read_super' (just the same
> > way as 2.4 pipe.c).
> >
> > Somehow, it looks like I need to mount the filesystem when it first
> > generates an fd, and unmount it when the last fd is destroyed -- but is
> > it safe to unmount the filesystem _within_ a release function of an
> > inode on the filesystem?
> >
> > Either that, or I need something else.
You need something else - namely, you need the code to follow the semantics
you want for the lifetime of that beast.
If your rules are "it's pinned as long as there are opened files created
by foo()" - very well, there are two variants. The basic idea is the same
- have sum of ->mnt_count for all vfsmounts of our type bumped whenever we
call foo() and drop whenever final fput() is done on a file created by foo().
1) Trivial variant:
in foo() have
file->f_vfsmnt = do_kern_mount(...);
file->f_dentry = dget(file->f_vfsmnt->mnt_root);
and lose kern_mount() in init
2) Slightly optimized one:
spinlock_t lock;
int count;
struct vfsmount *mnt;
in foo()
p = NULL;
spin_lock(&lock);
if (!count++) {
/* slow path */
count--;
spin_unlock(&lock);
p = do_kern_mount(...);
spin_lock(&lock);
if (!count++)
mnt = p;
}
mntget(mnt);
spin_unlock(&lock);
mntput(p);
file->f_vfsmnt = mnt;
file->f_dentry = dget(mnt->mnt_root);
and in ->release() for these files
spin_lock(&lock);
if (!--count)
mnt = NULL;
spin_unlock(&lock);
(and lose kern_mount() in init, again)
In both variants destruction of superblock is triggered by final fput() -
upon mntput(file->f_vfsmnt); the difference being that (b) takes some
care to avoid allocation of new vfsmounts if we already have one pinned
down by opened file - whether that optimization is worth the trouble depends
on intended use.
> It's a good example of why the module interface is stupidly wrong, and
> __exit needs to be called by the module unloader, returning 0 if it's
> ok to unload. Then your __exit can whatever condition it's interested
> in and, if all is well, do the kern_umount.
BS. Instead of playing silly buggers with "oh, we will start exiting
and maybe we'll bail out, let's just hope we won't find that we want
to do that after we'd destroyed something" you need to decide what kind
of rules you really want for the module lifetime. The rest is trivial.
Again, variant (a) (which is absolutely straightforward - add one line
in foo(), modify one line in foo(), delete one line in init) is enough
to give the desired rules. Optimizing it if needed is not too hard -
see (b) for one possible variant...
Alexander Viro wrote:
> If your rules are "it's pinned as long as there are opened files created
> by foo()" - very well, there are two variants. The basic idea is the same
> - have sum of ->mnt_count for all vfsmounts of our type bumped whenever we
> call foo() and drop whenever final fput() is done on a file created by foo().
Thanks -- that's what I implemented, except I used a semaphore instead
of a spinlock.
I wanted to check that it's safe to call `mntput' from `->release()',
which seems like quite a dubious thing to depend on. But if you say it
is safe, that's cool.
> > It's a good example of why the module interface is stupidly wrong, and
> > __exit needs to be called by the module unloader, returning 0 if it's
> > ok to unload. Then your __exit can whatever condition it's interested
> > in and, if all is well, do the kern_umount.
>
> BS. Instead of playing silly buggers with "oh, we will start exiting
> and maybe we'll bail out, let's just hope we won't find that we want
> to do that after we'd destroyed something" you need to decide what kind
> of rules you really want for the module lifetime. The rest is trivial.
> Again, variant (a) (which is absolutely straightforward - add one line
> in foo(), modify one line in foo(), delete one line in init) is enough
> to give the desired rules. Optimizing it if needed is not too hard -
> see (b) for one possible variant...
Unfortunately, your suggestion, which I ended up implementing, is not
safe from race conditions.
The problem comes during the call to `->release()'. If that's really
the last reference to the module, than as soon as I call `mntput' the
module might be unloaded. In practice this doesn't happen, but if there
were a long scheduling delay... (see CONFIG_PREEMPT), it could.
-- Jamie
On Sat, 7 Sep 2002, Jamie Lokier wrote:
> Alexander Viro wrote:
> > If your rules are "it's pinned as long as there are opened files created
> > by foo()" - very well, there are two variants. The basic idea is the same
> > - have sum of ->mnt_count for all vfsmounts of our type bumped whenever we
> > call foo() and drop whenever final fput() is done on a file created by foo().
>
> Thanks -- that's what I implemented, except I used a semaphore instead
> of a spinlock.
>
> I wanted to check that it's safe to call `mntput' from `->release()',
> which seems like quite a dubious thing to depend on. But if you say it
> is safe, that's cool.
It is neither safe nor needed. Please, look at the previous posting again -
neither variant calls mntput() in ->release().
Now, __fput() _does_ call mntput() - always. And yes, if that happens to
be the final reference - it's OK.
Alexander Viro wrote:
> It is neither safe nor needed. Please, look at the previous posting again -
> neither variant calls mntput() in ->release().
>
> Now, __fput() _does_ call mntput() - always. And yes, if that happens to
> be the final reference - it's OK.
Thanks, that's really nice.
I'd assumed `kern_mount' was similar to mounting a normal filesystem,
but in a non-existent namespace. Traditionally in unix you can't
unmount a filesystem when its in use, and mounts don't disappear when
the last file being used on them disappears.
But you've rather cutely arranged that these kinds of mount _do_
disappear when the last file being used on them disappears. Clever, if
a bit disturbing.
-- Jamie
On Sun, 8 Sep 2002, Jamie Lokier wrote:
> Alexander Viro wrote:
> > It is neither safe nor needed. Please, look at the previous posting again -
> > neither variant calls mntput() in ->release().
> >
> > Now, __fput() _does_ call mntput() - always. And yes, if that happens to
> > be the final reference - it's OK.
>
> Thanks, that's really nice.
>
> I'd assumed `kern_mount' was similar to mounting a normal filesystem,
> but in a non-existent namespace. Traditionally in unix you can't
> unmount a filesystem when its in use, and mounts don't disappear when
> the last file being used on them disappears.
>
> But you've rather cutely arranged that these kinds of mount _do_
> disappear when the last file being used on them disappears. Clever, if
> a bit disturbing.
Normal mount _is_ do_kern_mount() + pinning down the resulting vfsmount +
attaching it to mountpoint.
The second part is what keeps them alive until fs is unmounted. Moreover,
umount itself is just
check refcount, return -EBUSY if the thing is busy and MNT_DETACH is
not given.
detach from mountpoint
drop the reference
Notice that umount _with_ MNT_DETACH will happily skip that check - and
that will simply detach filesystem from mountpoint with fs shutdown
postponed until it stops being busy.
IOW, filesystem shutdown is _always_ a garbage collection - there's nothing
special about vfsmounts that are not mounted somewhere. Check for fs
being busy in umount(2) is just a compatibility code - kernel is perfectly
happy with dissolving mountpoints, busy or not. If stuff mounted is not
busy it will be shut down immediately, if not - when it stops being busy.
It's actually very similar to relation between open(), unlink() and close() -
you can unlink opened files and their contents will be there until the
final close(); then it will be killed.
On Sunday 08 September 2002 04:21, Jamie Lokier wrote:
> Alexander Viro wrote:
> > It is neither safe nor needed. Please, look at the previous posting again -
> > neither variant calls mntput() in ->release().
> >
> > Now, __fput() _does_ call mntput() - always. And yes, if that happens to
> > be the final reference - it's OK.
>
> Thanks, that's really nice.
>
> I'd assumed `kern_mount' was similar to mounting a normal filesystem,
> but in a non-existent namespace. Traditionally in unix you can't
> unmount a filesystem when its in use, and mounts don't disappear when
> the last file being used on them disappears.
>
> But you've rather cutely arranged that these kinds of mount _do_
> disappear when the last file being used on them disappears. Clever, if
> a bit disturbing.
And it's not a good way to drive module unloading. It is rmmod that
should cause a module to be unloaded, not close. The final close
*allows* the module to be unloaded, it does not *cause* it to be. So
to get the expected behaviour, you have to lather on some other fanciful
construction to garbage collect modules ready to be unloaded, or to let
rmmod inquire the state of the module in the process of attempting to
unload it, and not trigger the nasty races we've discussed. Enter
fancy locking regime that 3 people in the world actually understand.
--
Daniel
Daniel Phillips wrote:
> > But you've rather cutely arranged that these kinds of mount _do_
> > disappear when the last file being used on them disappears. Clever, if
> > a bit disturbing.
>
> And it's not a good way to drive module unloading. It is rmmod that
> should cause a module to be unloaded, not close. The final close
> *allows* the module to be unloaded, it does not *cause* it to be. So
> to get the expected behaviour, you have to lather on some other fanciful
> construction to garbage collect modules ready to be unloaded, or to let
> rmmod inquire the state of the module in the process of attempting to
> unload it, and not trigger the nasty races we've discussed. Enter
> fancy locking regime that 3 people in the world actually understand.
Eh? In this case, Al Viro's scheme is really simple and works: the
kern_mount keeps the module use-count non-zero so long as any file
descriptors are using the module's filesystem. fput() decrements the
use-count at a safe time -- no race conditions.
The expected behaviour is as it has always been: rmmod fails if anyone
is using the module, and succeeds if nobody is using the module. The
garbage collection of modules is done using "rmmod -a" periodically, as
it always has been.
-- Jamie
On Monday 09 September 2002 21:48, Jamie Lokier wrote:
> Daniel Phillips wrote:
> > > But you've rather cutely arranged that these kinds of mount _do_
> > > disappear when the last file being used on them disappears. Clever, if
> > > a bit disturbing.
> >
> > And it's not a good way to drive module unloading. It is rmmod that
> > should cause a module to be unloaded, not close. The final close
> > *allows* the module to be unloaded, it does not *cause* it to be. So
> > to get the expected behaviour, you have to lather on some other fanciful
> > construction to garbage collect modules ready to be unloaded, or to let
> > rmmod inquire the state of the module in the process of attempting to
> > unload it, and not trigger the nasty races we've discussed. Enter
> > fancy locking regime that 3 people in the world actually understand.
>
> Eh? In this case, Al Viro's scheme is really simple and works: the
> kern_mount keeps the module use-count non-zero so long as any file
> descriptors are using the module's filesystem. fput() decrements the
> use-count at a safe time -- no race conditions.
It wasn't obvious to you, was it. So how can you call it simple. For
modules, we need something that is truly simple, and having __exit return
a flag is it. That is not to say that Al's mechanism is wrong, at least
as far as filesystem modules go. In this case you'd rely on (part of)
it to determine the correct flag to return. It's just that this is not
the most straightforward mechanism for the job, and therefore wrong.
Besides, how much sense does it make for __exit to be incapable of
returning an error code?
> The expected behaviour is as it has always been: rmmod fails if anyone
> is using the module, and succeeds if nobody is using the module. The
> garbage collection of modules is done using "rmmod -a" periodically, as
> it always has been.
Al's analysis is all focussed on the filesystem case, where you can
prove assertions about whether the subsystem defined by the module is
active or not. This doesn't cover the whole range of module applications.
There is a significant class of module types that must rely on sheduling
techniques to prove inactivity. My suggestion covers both, Al has his
blinders on.
Designs are not always correct just because they work.
--
Daniel
On Monday 09 September 2002 21:48, Jamie Lokier wrote:
> The expected behaviour is as it has always been: rmmod fails if anyone
> is using the module, and succeeds if nobody is using the module. The
> garbage collection of modules is done using "rmmod -a" periodically, as
> it always has been.
When you say 'rmmod modulename' the module is supposed to be removed, if
it can be. That is the user's expectation, and qualifies as 'obviously
correct'.
Garbage collecting should *not* be the primary mechanism for removing
modules, that is what rmmod is for. Neither should a filesystem module
magically disappear from the system just because the last mount went
away, unless the module writer very specifically desires that. This is
where the obfuscating opinion is coming from: Al has come up with an
application where he wants the magic disappearing behavior and wants
to impose it on the rest of the world, regardless of whether it makes
sense.
--
Daniel
On Monday 09 September 2002 21:48, Jamie Lokier wrote:
> The expected behaviour is as it has always been: rmmod fails if anyone
> is using the module, and succeeds if nobody is using the module. The
> garbage collection of modules is done using "rmmod -a" periodically, as
> it always has been.
Actually, it would be more useful if I stated the following simple fact:
Returning a flag from __exit definitively gets rid of one race, that is
the race where a module's memory can be freed while __exit is active.
To get rid of this race by other means you have to put in place some
fancy mechanism. This alone should be enough reason to do it the way
I suggest, besides the fact that it is a simpler, more obvious and more
robust interface.
--
Daniel
Daniel Phillips wrote:
> > The expected behaviour is as it has always been: rmmod fails if anyone
> > is using the module, and succeeds if nobody is using the module. The
> > garbage collection of modules is done using "rmmod -a" periodically, as
> > it always has been.
>
> When you say 'rmmod modulename' the module is supposed to be removed, if
> it can be. That is the user's expectation, and qualifies as 'obviously
> correct'.
>
> Garbage collecting should *not* be the primary mechanism for removing
> modules, that is what rmmod is for. Neither should a filesystem module
> magically disappear from the system just because the last mount went
> away, unless the module writer very specifically desires that. This is
> where the obfuscating opinion is coming from: Al has come up with an
> application where he wants the magic disappearing behavior and wants
> to impose it on the rest of the world, regardless of whether it makes
> sense.
I think you've misunderstood. The module does _not_ disappear when the
last file reference is closed. It's reference count is decremented,
that is all. Just the same as if you managed the reference count
yourself. You still need rmmod to actually remove the module.
-- Jamie
Daniel Phillips wrote:
> It wasn't obvious to you, was it. So how can you call it simple.
I have to agree.
> [...] This doesn't cover the whole range of module applications.
> There is a significant class of module types that must rely on
> sheduling techniques to prove inactivity. My suggestion covers both,
> Al has his blinders on.
Unfortunately having cleanup_module() return a value don't necessarily
make things simpler. Sure, it's a general solution, but it's not always
easier to use.
Typically, your module's resources are protected by a lock or so.
cleanup_module() could take this lock, check any private reference
counts, and (because it has the lock) decide whether to proceed with
unregistering the module's resources. Once it begins unregistering
resources, it's pretty committed to unregistering them all and saying it
exited ok.
Unfortunately, once it has the lock, and the reference counts are all
zero, it's still _not_ generally safe to cleanup up a module.
This is because any other function, for example a release() op on a
file, or a remove() op on a PCI device, can't take a module's private
lock, decrement the private reference count, release the lock and
return. There's a race between releasing the lock and returning, where
it still isn't safe to remove the module's memory.
Even waiting for a schedule to happen won't help if CONFIG_PREEMPT is
enabled.
In other words, the module's idea of whether it's own resources are no
longer in use _must_ be released by code outside the module - or at
very least, locks protecting that information must be released outside
the module.
> Designs are not always correct just because they work.
Unfortunately it's not immediately clear to me that having
cleanup_module() be able to abort an exit actually helps.
Doing so with RCU-style "wait until none of my module functions could
possible be in their race window" might work, though. If you could 100%
trust GCC's sibling call optimisation, variations on
`spin_unlock_and_return' and `up_and_return' could be written.
But even if you can write code that's safe, is it likely to be
understood by most module authors? If not, it's no better than Al
Viro's filesystem method.
-- Jamie
On Tuesday 10 September 2002 02:44, Jamie Lokier wrote:
> Daniel Phillips wrote:
> > It wasn't obvious to you, was it. So how can you call it simple.
>
> I have to agree.
>
> > [...] This doesn't cover the whole range of module applications.
> > There is a significant class of module types that must rely on
> > sheduling techniques to prove inactivity. My suggestion covers both,
> > Al has his blinders on.
>
> Unfortunately having cleanup_module() return a value don't necessarily
> make things simpler. Sure, it's a general solution, but it's not always
> easier to use.
It's always as easy to use, or easier. It's certainly more obvious.
> Typically, your module's resources are protected by a lock or so.
> cleanup_module() could take this lock, check any private reference
> counts, and (because it has the lock) decide whether to proceed with
> unregistering the module's resources. Once it begins unregistering
> resources, it's pretty committed to unregistering them all and saying it
> exited ok.
And failing silently if it can't?
> Unfortunately, once it has the lock, and the reference counts are all
> zero, it's still _not_ generally safe to cleanup up a module.
>
> This is because any other function, for example a release() op on a
> file, or a remove() op on a PCI device, can't take a module's private
> lock, decrement the private reference count, release the lock and
> return. There's a race between releasing the lock and returning, where
> it still isn't safe to remove the module's memory.
>
> Even waiting for a schedule to happen won't help if CONFIG_PREEMPT is
> enabled.
That's exactly the race that is removed by having the module subsystem call
__exit to remove the module. Since the module subsystem checks the __exit's
flag on return and releases the lock, so there is no window after releasing
the lock when the releasor is still executing in the module.
> In other words, the module's idea of whether it's own resources are no
> longer in use _must_ be released by code outside the module - or at
> very least, locks protecting that information must be released outside
> the module.
Yup, that's exactly what I've proposed, in the simplest way possible.
> > Designs are not always correct just because they work.
>
> Unfortunately it's not immediately clear to me that having
> cleanup_module() be able to abort an exit actually helps.
Silent failure is about the worst thing that we could possibly design into
the system, but that's not even the issue I'm going on about - because I
think it's so appallingly obvious it's wrong, I assume everybody can see that.
Rather, it's the simple fact that this is the obvious interface a naive
person would expect, and nobody has presented a rational argument for not
using it.
> Doing so with RCU-style "wait until none of my module functions could
> possible be in their race window" might work, though. If you could 100%
> trust GCC's sibling call optimisation, variations on
> `spin_unlock_and_return' and `up_and_return' could be written.
>
> But even if you can write code that's safe, is it likely to be
> understood by most module authors? If not, it's no better than Al
> Viro's filesystem method.
It solves one of the races in a tidy, obviously correct way. There are other
races that are much more difficult (which don't for the most part apply to
filesystem modules) where we have to do cute things to be sure that all
threads are out of the module, however, that is an othogonal issue, and when
last sighted, it had a workable solution on the table, which requires each
task to schedule once. Config_preempt is a trivial issue: just increment the
preempt counter and nobody will preempt on you while you run the magic
quiescence test. The module runs this test, by the way, so that only modules
that can't figure this out by some less intrusive means have to impose
themselves on the rest of the system this way.
Anyway, this does not replace Al's filesystem method, it *uses* it, but only
where appropriate.
Of course we can design a more complex method for accomplishing the same
thing, but why? Have you looked at the module.c by the way? If you have and
you like it, you are one sick puppy ;-)
--
Daniel
On Mon, 9 Sep 2002, Jamie Lokier wrote:
> Daniel Phillips wrote:
> > > The expected behaviour is as it has always been: rmmod fails if anyone
> > > is using the module, and succeeds if nobody is using the module. The
> > > garbage collection of modules is done using "rmmod -a" periodically, as
> > > it always has been.
> >
> > When you say 'rmmod modulename' the module is supposed to be removed, if
> > it can be. That is the user's expectation, and qualifies as 'obviously
> > correct'.
> >
> > Garbage collecting should *not* be the primary mechanism for removing
> > modules, that is what rmmod is for. Neither should a filesystem module
> > magically disappear from the system just because the last mount went
> > away, unless the module writer very specifically desires that. This is
> > where the obfuscating opinion is coming from: Al has come up with an
> > application where he wants the magic disappearing behavior and wants
> > to impose it on the rest of the world, regardless of whether it makes
> > sense.
Huh?
> I think you've misunderstood. The module does _not_ disappear when the
> last file reference is closed. It's reference count is decremented,
> that is all. Just the same as if you managed the reference count
> yourself. You still need rmmod to actually remove the module.
Never let the facts to stand in a way of a rant. Or presume that ability to
write implies ability to read, for that matter...
Daniel Phillips wrote:
> > ... Once it begins unregistering
> > resources, it's pretty committed to unregistering them all and saying it
> > exited ok.
>
> And failing silently if it can't?
It could make a noise, but it will still be a failure. Once you've
unregistered one of the module's resources (e.g. a PCI device
registration), there is no turning back. If the next thing you
unregister (e.g. a filesystem) fails, what are you going to do? Try to
re-register the PCI device?
Basically, once you've started unregistering resources, if there's an
error you're left in an inconsistent state. You can try to get back to
one of the states "module installed" or "module removed", but there's no
guarantee of either -- and trying to do that would certainly complicated
cleanup_module() functions.
> > Unfortunately, once it has the lock, and the reference counts are all
> > zero, it's still _not_ generally safe to cleanup up a module.
> >
> > This is because any other function, for example a release() op on a
> > file, or a remove() op on a PCI device, can't take a module's private
> > lock, decrement the private reference count, release the lock and
> > return. There's a race between releasing the lock and returning, where
> > it still isn't safe to remove the module's memory.
> >
> > Even waiting for a schedule to happen won't help if CONFIG_PREEMPT is
> > enabled.
>
> That's exactly the race that is removed by having the module subsystem call
> __exit to remove the module. Since the module subsystem checks the __exit's
> flag on return and releases the lock, so there is no window after releasing
> the lock when the releasor is still executing in the module.
Please say if you mean "cleanup_module()" when you say "__exit".
Assuming you do, how can __exit safely decide whether to return 0 or -1?
> > In other words, the module's idea of whether it's own resources are no
> > longer in use _must_ be released by code outside the module - or at
> > very least, locks protecting that information must be released outside
> > the module.
>
> Yup, that's exactly what I've proposed, in the simplest way possible.
Then you've completely confused me, because you seem to be saying that a
module's own cleanup function returns a success/failure result to the
module subsystem. Or is that not what you mean when you say __exit?
(Which isn't the name of a function, btw).
> Silent failure is about the worst thing that we could possibly design into
> the system, but that's not even the issue I'm going on about - because I
> think it's so appallingly obvious it's wrong, I assume everybody can
> see that.
I'm sure there would be no harm in cleanup_module() returning an error
code, but if it did all we could do is log it.
> Rather, it's the simple fact that this is the obvious interface a naive
> person would expect, and nobody has presented a rational argument for not
> using it.
Well it's not obvious, because either I don't understand what you are
describing, or you don't understand when I explain that it doesn't work :-)
> [...] we have to do cute things to be sure that all threads are out of
> the module, however, that is an othogonal issue, and when last
> sighted, it had a workable solution on the table, which requires each
> task to schedule once. Config_preempt is a trivial issue: just
> increment the preempt counter and nobody will preempt on you while you
> run the magic quiescence test.
You have to increment the preempt counter in the threads, not in the
path that's testing for a quiescent state, so that when a thread does
set_finished_flag/dec_ref_count -> unlock -> exit, there can be no
preempt between the unlock and the exit.
This is all made simpler by using complete_and_exit() these days.
> Of course we can design a more complex method for accomplishing the same
> thing, but why?
Well, first can you please explain what you mean by "calling __exit"?
Do you mean "calling cleanup_module()", which is the the function in a
module that's declared using the `module_exit' macro?
If so then I maintain there are non-trivial races that bite nearly every
use of MOD_DEC_USE_COUNT from within the module.
-- Jamie
On Tuesday 10 September 2002 03:56, Jamie Lokier wrote:
> Daniel Phillips wrote:
> > > ... Once it begins unregistering
> > > resources, it's pretty committed to unregistering them all and saying it
> > > exited ok.
> >
> > And failing silently if it can't?
>
> It could make a noise, but it will still be a failure. Once you've
> unregistered one of the module's resources (e.g. a PCI device
> registration), there is no turning back. If the next thing you
> unregister (e.g. a filesystem) fails, what are you going to do? Try to
> re-register the PCI device?
The module unloader is going to look at the error code and decide if it can
do something about it (it probably can't). Otherwise, it will report the
error in a standard, sane way, and the deluxe version will attempt to wall
off the damage. What would you prefer, that __exit should fail silently,
leaving various resources in random states, and that the remaining,
functional part of the system should never be told about it? Granted,
that's a time-honored approach to error handling, but it falls far short
of adequate.
> Basically, once you've started unregistering resources, if there's an
> error you're left in an inconsistent state. You can try to get back to
> one of the states "module installed" or "module removed", but there's no
> guarantee of either -- and trying to do that would certainly complicated
> cleanup_module() functions.
I'm not advocating that somebody solve the whole 'how to we recover' problem
this week. That's the kind of trouble IBM would go to on a mainframe, and
they'd get it right, but we have lots more low hanging fruit before we get
to that level of detail.
> > > Unfortunately, once it has the lock, and the reference counts are all
> > > zero, it's still _not_ generally safe to cleanup up a module.
> > >
> > > This is because any other function, for example a release() op on a
> > > file, or a remove() op on a PCI device, can't take a module's private
> > > lock, decrement the private reference count, release the lock and
> > > return. There's a race between releasing the lock and returning, where
> > > it still isn't safe to remove the module's memory.
> > >
> > > Even waiting for a schedule to happen won't help if CONFIG_PREEMPT is
> > > enabled.
> >
> > That's exactly the race that is removed by having the module subsystem call
> > __exit to remove the module. Since the module subsystem checks the __exit's
> > flag on return and releases the lock, so there is no window after releasing
> > the lock when the releasor is still executing in the module.
>
> Please say if you mean "cleanup_module()" when you say "__exit".
Yes, I mean:
-void cleanup_module(void)
+int cleanup_module(void)
Or possibly:
+int cleanup_module(struct module *module, int funny_flags)
> Assuming you do, how can __exit safely decide whether to return 0 or -1?
Well, I'd suggest negative for an error code, or positive if the error is
simply "this module still seems to have this many users". For filesystems
it simply looks at the use count, which Al maintains appropriately. Locking
has been arranged by module.c such that this won't increase during the
course of the cleanup_module call. The module exit routine does:
if (module->count)
return module->count;
return clean_up_whatever();
For more difficult cases, such as SCM where we can't know so easily if there
are tasks executing in the module, it has to be more like:
if (module->count)
return module->count;
if ((howmany = there_are_still_users_magic_test()))
return howmany;
return clean_up_whatever();
> > > In other words, the module's idea of whether it's own resources are no
> > > longer in use _must_ be released by code outside the module - or at
> > > very least, locks protecting that information must be released outside
> > > the module.
> >
> > Yup, that's exactly what I've proposed, in the simplest way possible.
>
> Then you've completely confused me, because you seem to be saying that a
> module's own cleanup function returns a success/failure result to the
> module subsystem.
That's exactly what I'm saying.
> > Silent failure is about the worst thing that we could possibly design into
> > the system, but that's not even the issue I'm going on about - because I
> > think it's so appallingly obvious it's wrong, I assume everybody can
> > see that.
>
> I'm sure there would be no harm in cleanup_module() returning an error
> code, but if it did all we could do is log it.
That's a start. If we were IBM, writing a mainframe executive, we'd also
clean up all the resources, which we'd have kept track of carefully.
But that's a little ambitious for us at the moment, given that we don't
even have clear agreement that errors should be reported.
> > Rather, it's the simple fact that this is the obvious interface a naive
> > person would expect, and nobody has presented a rational argument for not
> > using it.
>
> Well it's not obvious, because either I don't understand what you are
> describing, or you don't understand when I explain that it doesn't work :-)
Eh. You now have sufficient data to explain to me why it doesn't work,
if it really doesn't.
> > [...] we have to do cute things to be sure that all threads are out of
> > the module, however, that is an othogonal issue, and when last
> > sighted, it had a workable solution on the table, which requires each
> > task to schedule once. Config_preempt is a trivial issue: just
> > increment the preempt counter and nobody will preempt on you while you
> > run the magic quiescence test.
>
> You have to increment the preempt counter in the threads, not in the
> path that's testing for a quiescent state, so that when a thread does
> set_finished_flag/dec_ref_count -> unlock -> exit, there can be no
> preempt between the unlock and the exit.
>
> This is all made simpler by using complete_and_exit() these days.
Hmm, lovely documentation there. Different rant, calm down, ignore ;-)
It's not just the little detail of how you get out crashlessly after
the test, the hard part is proving no other threads are in the
module. And yes, you have to increment *all* the preempt counts,
I did indeed omit that detail, nonetheless, it's just a detail. I'm
willing to believe the scheduling crew have the there_are_still_users
magic test problem under control, at least they seemed to be headed in
that direction last time it got beaten to death on the list.
> If so then I maintain there are non-trivial races that bite nearly every
> use of MOD_DEC_USE_COUNT from within the module.
Let's see if you can find one in the above.
--
Daniel
Daniel Phillips wrote:
> Locking
> has been arranged by module.c such that this won't increase during the
> course of the cleanup_module call. The module exit routine does:
>
> if (module->count)
> return module->count;
>
> return clean_up_whatever();
If any code within the module does `MOD_DEC_USE_COUNT; return;', you
have a race condition.
That's why nowadays you shouldn't see this. The main reference points
to a module are either function calls from another module (already count
as references), or a file/filesystem/blockdev/netdevice: these all have
an `owner' field now, so that the MOD_DEC_USE_COUNT can take place
outside their modules.
If you do still see MOD_DEC_USE_COUNT in a module, then there's a race
condition. Some task calls the function which does MOD_DEC_USE_COUNT,
and some _other_ task calls sys_delete_module(). Lo, the module is
cleaned up and destroyed by the second task while it's in the small
window just before `return' in the first task.
Given that, you need to either (a) not call MOD_DEC_USE_COUNT in the
module, and use the `owner' fields of various structures instead, or (b)
something quite clever involving a quiescent state and some flags.
Note that (b) is not trivial, as you can't just do
`wait_for_quiescent_state()' followed by `if (module->count)': it's
possible that someone may call MOD_INC_USE_COUNT after the quiescent
state has passed, but before you finish cleaning up.
-- Jamie
ps. I'm not sure what "SCM" means, so can't comment on that.
On Tuesday 10 September 2002 05:26, Jamie Lokier wrote:
> Daniel Phillips wrote:
> > Locking
> > has been arranged by module.c such that this won't increase during the
> > course of the cleanup_module call. The module exit routine does:
> >
> > if (module->count)
> > return module->count;
> >
> > return clean_up_whatever();
>
> If any code within the module does `MOD_DEC_USE_COUNT; return;', you
> have a race condition.
>
> That's why nowadays you shouldn't see this. The main reference points
> to a module are either function calls from another module (already count
> as references), or a file/filesystem/blockdev/netdevice: these all have
> an `owner' field now, so that the MOD_DEC_USE_COUNT can take place
> outside their modules.
Thanks for telling me this, but I know it. The manner in which the
module use count is protected is completely irrelvant to the code I
showed. The cleanup_module just knows the count is protected and
stable.
> If you do still see MOD_DEC_USE_COUNT in a module, then there's a race
> condition. Some task calls the function which does MOD_DEC_USE_COUNT,
> and some _other_ task calls sys_delete_module(). Lo, the module is
> cleaned up and destroyed by the second task while it's in the small
> window just before `return' in the first task.
>
> Given that, you need to either (a) not call MOD_DEC_USE_COUNT in the
> module, and use the `owner' fields of various structures instead, or (b)
> something quite clever involving a quiescent state and some flags.
>
> Note that (b) is not trivial, as you can't just do
> `wait_for_quiescent_state()' followed by `if (module->count)': it's
> possible that someone may call MOD_INC_USE_COUNT after the quiescent
> state has passed, but before you finish cleaning up.
You promised to find a race or some other deficiency in my code and
you pointed out some other interesting, but incidental fact.
If you look at my code snippet for cleanup_module, you'll see it
doesn't decrement anything, though it could if it wanted to, because
it's called under protection provided by module.c.
The way the module keeps track of its unloadable/not unloadable
state is up to it, which makes sense, n'est-ce pas? Given that
there is no way you can make the MOD_DEC_USE_COUNT concept work
for SCM, for example.
There is in fact no choice other than to bite the bullet and
implement some form of magic_wait_for_quiescent_state(), however
hard that may be, to be used in the cases where some precise
accounting method simply isn't possible. The alternative is to
forbid unloading of some types of modules, and I hope you
realize what kind of criticism that would invite.
Anyway, solutions to the difficulties are known. We are simply
quibbling over whether we should use an obscure interface or a
simple, transparent one.
--
Daniel
[email protected] (Daniel Phillips) wrote on 09.09.02 in <E17oUzP-0006tu-00@starship>:
> On Monday 09 September 2002 21:48, Jamie Lokier wrote:
> > The expected behaviour is as it has always been: rmmod fails if anyone
> > is using the module, and succeeds if nobody is using the module. The
> > garbage collection of modules is done using "rmmod -a" periodically, as
> > it always has been.
>
> Actually, it would be more useful if I stated the following simple fact:
> Returning a flag from __exit definitively gets rid of one race, that is
> the race where a module's memory can be freed while __exit is active.
>
> To get rid of this race by other means you have to put in place some
> fancy mechanism. This alone should be enough reason to do it the way
> I suggest, besides the fact that it is a simpler, more obvious and more
> robust interface.
I just love the way you propose insanely hard-to-get-right, hard-to-
understand, and complicated interfaces with known broken cases you admit
to, to replace simple, obviously correct and race-free mechanisms, all
with a supposed goal to make things simpler and safer.
Is there some fancy word to describe something that looks like irony,
sarcasm, or satire, except for the fact it's actually meant entirely
seriously?
MfG Kai
On Tuesday 10 September 2002 02:44, Jamie Lokier wrote:
> Typically, your module's resources are protected by a lock or so.
> cleanup_module() could take this lock, check any private reference
> counts, and (because it has the lock) decide whether to proceed with
> unregistering the module's resources.
Actually, there did turn out to be a problem with the code I showed
yesterday, resulting from the fact that cleanup_module wants to be able to
sleep. Because of this, there is no good way for module.c to protect
module->count from try_inc_mod_count, not without changing the spinlock in
try_inc_mod_count into a semaphore, which isn't nice either, because
try_inc_mod_count is taken under at least one other spinlock in super.c.
There's a simple solution though: examine the module->count under the same
spinlock as try_inc_mod_count, which is what sys_delete_module does. We just
encapsulate that check in a handy wrapper and define it as part of the
try_inc_mod_count interface. At this point the thing is generalized to the
point where the module count isn't used at all by module.c, so the same
interface will also accomodate the still-under-construction magic wait for
quiescent state(), needed for modules that don't fit the mod_count model.
The nice thing is, hardly any work is required to accomplish this, though
it's hard to look at module.c for more than a few seconds without seeing more
things that want cleaning up.
--
Daniel
Hi,
On Tue, 10 Sep 2002, Daniel Phillips wrote:
> There's a simple solution though: examine the module->count under the same
> spinlock as try_inc_mod_count, which is what sys_delete_module does. We just
> encapsulate that check in a handy wrapper and define it as part of the
> try_inc_mod_count interface. At this point the thing is generalized to the
> point where the module count isn't used at all by module.c, so the same
> interface will also accomodate the still-under-construction magic wait for
> quiescent state(), needed for modules that don't fit the mod_count model.
I implemented something like this some time ago. If module->count isn't
used by module.c anymore, why should it be in the module structure?
Consequently I removed it from the module struct (what breaks of course
unloading of all modules, so I'll probably reintroduce it with big a
warning). If the count isn't in the module structure, the locking will
become quite simpler. More info is here
http://marc.theaimsgroup.com/?l=linux-kernel&m=102754132716703&w=2
bye, Roman
On Mon, 9 Sep 2002, Daniel Phillips wrote:
> > The expected behaviour is as it has always been: rmmod fails if anyone
> > is using the module, and succeeds if nobody is using the module. The
> > garbage collection of modules is done using "rmmod -a" periodically, as
> > it always has been.
I can't disagree with any of that, but the idea of releasing resources
when they are not in use and preventing new use of the resource from the
time of the release request is hardly new to UNIX. It's exactly the way
filespace is treated when a file is unlinked while open.
> Al's analysis is all focussed on the filesystem case, where you can
> prove assertions about whether the subsystem defined by the module is
> active or not. This doesn't cover the whole range of module applications.
Which if true doesn't preclude solving the problems it does address.
> There is a significant class of module types that must rely on sheduling
> techniques to prove inactivity. My suggestion covers both, Al has his
> blinders on.
Clearly there are people who don't agree, and you don't help by insulting
them. I'd like a single solution, but it looks as if Al's idea will work
for filesystems, and it's relatively simple.
> Designs are not always correct just because they work.
Solutions which work are most certainly correct, and no degree of elegance
will make a non-working solution correct for any definition of correct I
use. I think you mean "optimal" here, and clearly an optimal solution is
simple, reliable, efficient, general, and easy to code and understand.
Anything less is subject to improvement, and that certainly applies to
module removal ;-)
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
Hi Roman,
On Tuesday 10 September 2002 12:17, Roman Zippel wrote:
> I implemented something like this some time ago. If module->count isn't
> used by module.c anymore, why should it be in the module structure?
> Consequently I removed it from the module struct (what breaks of course
> unloading of all modules, so I'll probably reintroduce it with big a
> warning). If the count isn't in the module structure, the locking will
> become quite simpler. More info is here
> http://marc.theaimsgroup.com/?l=linux-kernel&m=102754132716703&w=2
Ah, I remember your original post but I didn't fully understand what you were
going on about at the time. Now I do, and we're on the same page. Not only
that, but I'm getting to the point with this messy problem where I can
articulate the issues clearly, so let's try to do it now. Please ack/nack me
as I go.
First, let's identify the central issue that's causing conceptual problems
for some who've sniffed at these issues. It's simple, there are two broad
categories of modules:
- Those for which we account exactly for all task entry/exits
- Those for which we do not, or cannot account current usage exactly
Let's call the former "counting" and the latter "noncounting" modules. The
number of tasks executing in a given module is the count of task entries
minus the count of task exits. Obviously this is a big problem when we have
no good way of counting accurately, but it is not an insoluble problem, and a
solution is presented here, based on earlier discussions on lkml.
Preventing Unload Races in Counting Modules
-------------------------------------------
Module unload races in counting modules are easy to handle. Al Viro has
written code that handles those in a passably competent way and we will
improve Al's approach incrementally as described below.
I'll digress for a moment and ask the question "why are counting modules
countable?". Well, using filesystems as an example, the vfs initiates each
and every call into a filesystem module, and every vfs high level function is
structured in such a way that all calls into a filesystem module return
before the the vfs goes on to do something else. So it's easy to write
accounting code that relies on this behavior, and in fact, the accounting job
reduces to simply counting the number of times a given module is mounted vs
unmounted.
Even in this simple situation there are possible races, and one of the races
is dealt with by Al's odd-looking try_inc_mod_count function, which protects
the incrementing of its counter in three ways: 1) by using atomic operations
2) by protecting incrementing and testing with a spin lock 3) by additionally
protecting incrementing with a state flag. Decrementing the counter is
protected only by the use of atomic decrement, and turns out not to need any
more protection than that (granted, this interesting fact is largely useless
in terms of cycles saved).
So let's look at the curious structure of the count increment operation,
try_inc_mod_count. This acts on a counter embedded in struct module. (As
you point out, it doesn't need to, it can act on a counter embedded in the
struct filesystem instead, there is a very good reason for doing that, which
I'll get back to.)
I'm won't show Al's version, I'll show my improved version, which
substitutes a special state of the counter variable for Al's state flag, a
small but useful change:
int try_inc_mod_count(struct module *module)
{
int result = 1;
if (module) {
spin_lock(&unload_lock);
if (atomic_read(&module->count) == -1)
result = 0;
else
atomic_inc(&module->count);
spin_unlock(&unload_lock);
}
return result;
}
If the counter has the value -1 then it can't be incremented and this
function returns 0. This corresponds to the state where unloading is in
progress, and for filesystems, means that an attempted mount will fail. If
the counter has any other value then that value is incremented, the function
returns true, and the mount proceeds. Getting rid of the state flag
simplifies the later step of removing the counter from struct module.
The only other thing worth noting about this function is that, to use it, you
have to be able to locate the module given a struct filesystem, hence the
->owner field in that structure. With the techniques you and I have
identified, we can dispense with the ->owner field, removing considerable
cruft in the process. However, for the moment I think we should produce a
patch that aims narrowly at getting rid of the unload races, and do the
spinoff cleanups after.
The other main part of this mechanism is in sys_delete_module, in which the
test for zero module count and setting of the state flag are protected by the
same spinlock as above. Our improvement here will be to encapsulate this
functionality in a module helper function, that looks something like this:
int try_mod_count(struct module *module)
{
int count;
spin_lock(&unload_lock);
count = atomic_read(&module->count);
if (!count)
atomic_set(&module->count, -1);
spin_unlock(&unload_lock);
return count;
}
This function returns zero if the (counting) module has no users, and returns
the (approximate) number of users otherwise. If the function returns zero,
then additionally, the module count is set to -1, a state in which
try_inc_mod_count is prevented from incrementing it, so after the spinlock is
released, any pending mounts will fail as they should, because we have been
asked to unload the module and are in the process of doing so.
The remaining possible races here are currently closed by the BKL, and we
will change that to a semaphore in due course. The fact that we use a
sleeping lock for this purpose means that a module's cleanup_function is
unrestricted in terms of scheduling.
The (counting) module's cleanup_function will therefore look something like
this:
int error = try_mod_count(module);
if (!error) {
unregister_filesystem(foo_filesystem);
cleanup_foo(foo_filesystem);
}
return error;
And we will later design the 'module' parameter out of this, because it is no
longer needed by generic code and so the associated cruft can be eliminated.
As a programmer/guru friend once said: the code "won't have its thumb tied to
its nose any more".
Note that our returned error can also take on a negative value (i.e., a
standard error code) should module unloading fail for some other reason than
the module simply having users. Additionally, a return of -1 means "you
already killed me, stupid, I'm already dead, and I already unregistered and
cleaned up my filesystem". The state the module is in when its count is -1
is interesting and potentially useful: in this state the module can legally
be reinitialized/reactivated, instead of just being removed from the module
list and destroyed.
(Roman, is -1 an ok error code to return here, given that there could be
other negative error values returned as well?)
Now we can look at the revised version of sys_delete_module, which has the
code:
BUG_ON(!mod->cleanup);
if ((error = mod->cleanup())
goto out_error;
free_module(module);
and free_module will no longer call the cleanup_function. Nice and simple,
easy to understand, is it not? What's more, we don't have to change this
code at all to handle the more difficult case of noncounting modules.
Preventing Unload Races in Noncounting Modules
----------------------------------------------
Noncounting modules typically don't keep track of the number of task
entry/exits, typically because the cost of even a simple inc/dec wrapper in
generic kernel code is unacceptable, and because the wrappers just plain look
ugly. The proposed Linux Security Module is a good example of a noncounting
kind of module: many vfs functions that currently execute security checks
using inline code now must henceforth call indirectly through a module to do
it. Linus will never buy it if these calls are inefficient. At present,
some of these security tests consist of just a few instructions, so wrapping
an accounting interface around those is going to show up on an execution
profile, never mind the extra source code.
This is far from the only reason why a module might be noncounting, but it's
sufficient to show why we can't ignore the problem. Fortunately, there is a
well-understood way to deal with such modules, consisting of two steps:
- Stub out all the module call points
- Wait until ever running task on the system has scheduled at least once
To make this work, we rely on the rule that no module code may sleep/schedule
unless it first increments a counter. The counter incremented will, at
present, be the module->count, but that is entirely up to the module itself;
module.c will no longer care how it keeps track, as long as it does.
With config_preempt code, we must additionally disable preemption until the
module quiescence test has completed. I'm not going to go further into this,
because this is deep scheduling-fu, and I haven't got working code to show
yet. Suffice to say that we have the technology to build a
magic_wait_for_quiescence, and we must now proceed to do that. (Robert, are
you reading?) A noncounting module uses the test as follows, in its
cleanup_function:
unregister_callpoints(...);
magic_wait_for_quiescence();
return cleanup_foo(...);
Blessed relief, that was easy. Now it's just a matter of coding ;-)
One can easily imagine a kind of module that needs both the try_mod_count
mechanism and the magic_wait_for_quiescence, or one that also has to wait for
some daemon to exit, or a module that supports more than a single subsystem,
possibly of very different kinds. All these variations are handled in the
obvious way: by customizing the cleanup_function, with the aid of helper
functions. This allows us to avoid the strong temptation to over-engineer
the module interface, trying to anticipate every possible kind of module that
someone might write in the future.
Now there's the question "if this is such a great approach, why not make all
modules work this way, not just filesystems?". Easy: the magic scheduling
approach impacts the scheduler, however lightly, and worse, we cannot put an
upper bound on the time needed for magic_wait_for_quiescence to complete. So
the try_count approach is preferable, where it works.
Implementation Issues
---------------------
OK, I think we agree on the general outline of the work that has to be done.
If not, please shout now, otherwise, feel free to niggle me to death in the
usual way ;-)
Let me touch on just one more issue: we need to do this work without breaking
every existing module in the world. In other words, we need a compatibility
wrapper to get us through the period, possibly as long as a year or two, when
there are still a lot of modules both in-kernel and outside, whose
cleanup_functions just return void. Module.c will detect this and do theits
magic_wait_for_quiescence following the module's cleanup_function. Thus,
such modules will be no more broken than they ever were.
What we want is a define a new 'module_kill' macro, which will live alongside
a backwards-compatible 'module_exit' macro for now. The latter generates a
function call hook that module.c detects and wraps like this:
if (!mod->old_cleanup) {
WARN(1, "Please fix your old-style module");
if ((error = try_mod_count(module)))
goto out_error;
mod->old_cleanup();
free_module(module);
} else {
<nice new interface>
}
Miscellaneous other improvements we might want to make are:
- Use a semaphore instead of lock_kernel
- Clean up the generally substandard code in module.c
- Move the semaphore into struct module
- Move unload_lock into module
- Other things I've forgotten for the moment
But these can all wait. We have races now, and we have proposals on the
table that smell like over-engineering. So we need to do the work I've
described above, promptly, and get it into circulation.
Comments/flames/races?
--
Daniel
> Now there's the question "if this is such a great approach, why not make
> all modules work this way, not just filesystems?". Easy: the magic
> scheduling approach impacts the scheduler, however lightly, and worse,
> we cannot put an upper bound on the time needed for
You are in principle describing RCU. These guys seem to have solved the
problem.
> magic_wait_for_quiescence to complete. So the try_count approach is
> preferable, where it works.
But the try_count approach hurts every user of the defined interfaces,
even if modules are not used. Is the impact on the scheduler limited
to the time magic_wait_for_quiescence is running.
If so, the approach looks superior.
Regards
Oliver
On Wednesday 11 September 2002 20:53, Oliver Neukum wrote:
> > Now there's the question "if this is such a great approach, why not make
> > all modules work this way, not just filesystems?". Easy: the magic
> > scheduling approach impacts the scheduler, however lightly, and worse,
> > we cannot put an upper bound on the time needed for
>
> You are in principle describing RCU. These guys seem to have solved the
> problem.
Eh, how about that, and it was obvious to me. I wonder what that
means, if anything. (No, I never knew anything about RCU other
than the name.) Anyway, in case I sound too snippy here, here's
a hearty thanks to IBM for contributing that IP to Linux without
a fuss.
> > magic_wait_for_quiescence to complete. So the try_count approach is
> > preferable, where it works.
>
> But the try_count approach hurts every user of the defined interfaces,
> even if modules are not used.
Well, it works great for filesystems, which is my point. And I'll
bet beer that somebody out there will find another application
where it works well. It's all about choice, and the thing about
the proposed interface is, it gives you the flexibility to make
that choice. The fact that it's also the simplest interface is
just nice.
> Is the impact on the scheduler limited
> to the time magic_wait_for_quiescence is running.
> If so, the approach looks superior.
It definitely is not superior for filesystem modules. However,
"damm good" would be a nice level of functionality to aim for.
The more we come to rely on virtual filesystem, the more we care
that the load/unload procedure should be reliable and fast.
Don't forget that point about not being able to put an upper
bound on the time required to complete the magic wait.
Are you hinting that we only need, and only ever will need, one
mechanism here, so the module doesn't need to return a result
from cleanup_module? If so then I should add that we don't just
have varying requirements for cleanup style between modules, but
other problems, such as single modules that support multiple
services, which are common in the pcmcia world.
--
Daniel
On Wednesday 11 September 2002 17:28, Bill Davidsen wrote:
> On Mon, 9 Sep 2002, Daniel Phillips wrote:
>
> > > The expected behaviour is as it has always been: rmmod fails if anyone
> > > is using the module, and succeeds if nobody is using the module. The
> > > garbage collection of modules is done using "rmmod -a" periodically, as
> > > it always has been.
>
> I can't disagree with any of that, but the idea of releasing resources
> when they are not in use and preventing new use of the resource from the
> time of the release request is hardly new to UNIX. It's exactly the way
> filespace is treated when a file is unlinked while open.
It's a little different for modules. For one thing, modules are
sometimes used like switches: to make the system behave differently,
plug in some module, to make it stop doing that, remove the module
again. The user wants those transitions to happen *now*, like a
switch. Sure, we can make various guru arguments about why the user
is wrong, but in the end, the user is never wrong, in the customer
sense.
Anyway, this problem is under control now and the nice solution also
happens to be the simplest, so what more could you ask for.
> > Al's analysis is all focussed on the filesystem case, where you can
> > prove assertions about whether the subsystem defined by the module is
> > active or not. This doesn't cover the whole range of module applications.
>
> Which if true doesn't preclude solving the problems it does address.
Oh absolutely. Please see my recent [RFC] Raceless module interface.
<flamebait>Not to say that I was unable to improve Al's code.</flamebait>
> > There is a significant class of module types that must rely on sheduling
> > techniques to prove inactivity. My suggestion covers both, Al has his
> > blinders on.
>
> Clearly there are people who don't agree, and you don't help by insulting
> them.
I wish I could agree with you about that, but Al is a special case.
I think he actually *likes* it, like a Pit Bull likes fighting. He
certainly became easier to deal with after I fell into the habit of
returning each insult promptly and unambiguously. Note: Al is the one
person on the list that I deal with this way, and he worked really
hard to get there.
> I'd like a single solution, but it looks as if Al's idea will work
> for filesystems, and it's relatively simple.
And what do you propose to do about the modules it doesn't work for?
Sorry, rhetorical question, the solution is laid out in the above
[RFC].
> > Designs are not always correct just because they work.
>
> Solutions which work are most certainly correct, and no degree of elegance
> will make a non-working solution correct for any definition of correct I
> use. I think you mean "optimal" here, and clearly an optimal solution is
> simple, reliable, efficient, general, and easy to code and understand.
> Anything less is subject to improvement, and that certainly applies to
> module removal ;-)
Yes, well please read the [RFC] and pass judgement on whether it's
optimal or not.
--
Daniel
> > > magic_wait_for_quiescence to complete. So the try_count approach is
> > > preferable, where it works.
> >
> > But the try_count approach hurts every user of the defined interfaces,
> > even if modules are not used.
>
> Well, it works great for filesystems, which is my point. And I'll
> bet beer that somebody out there will find another application
> where it works well. It's all about choice, and the thing about
How would the other version work less well for filesystems ?
How long unloading takes doesn't matter, as long as it's safe.
The overhead while running is the issue, nothing else.
> the proposed interface is, it gives you the flexibility to make
> that choice. The fact that it's also the simplest interface is
> just nice.
That makes no difference if you have to support two interfaces.
> > Is the impact on the scheduler limited
> > to the time magic_wait_for_quiescence is running.
> > If so, the approach looks superior.
>
> It definitely is not superior for filesystem modules. However,
> "damm good" would be a nice level of functionality to aim for.
> The more we come to rely on virtual filesystem, the more we care
> that the load/unload procedure should be reliable and fast.
What? Unloading is a rare event in any case.
> Don't forget that point about not being able to put an upper
> bound on the time required to complete the magic wait.
>
> Are you hinting that we only need, and only ever will need, one
> mechanism here, so the module doesn't need to return a result
> from cleanup_module? If so then I should add that we don't just
Returning the error value is good.
> have varying requirements for cleanup style between modules, but
> other problems, such as single modules that support multiple
> services, which are common in the pcmcia world.
Regards
Oliver
On Wednesday 11 September 2002 22:29, Oliver Neukum wrote:
> > > > magic_wait_for_quiescence to complete. So the try_count approach is
> > > > preferable, where it works.
> > >
> > > But the try_count approach hurts every user of the defined interfaces,
> > > even if modules are not used.
> >
> > Well, it works great for filesystems, which is my point. And I'll
> > bet beer that somebody out there will find another application
> > where it works well. It's all about choice, and the thing about
>
> How would the other version work less well for filesystems ?
> How long unloading takes doesn't matter, as long as it's safe.
Really, that's not so, there are limits. 30 seconds? Whatever.
Remember, during this time the service provided by the module is
unavailable, so this is denial-of-service land. You could of
course put in extra code to abort the unload process on demand,
but, hmm, it probably wouldn't work ;-)
And even if it did work, why would that be better than staying
with a simple, proven solution that works (for filesystems)?
> The overhead while running is the issue, nothing else.
Au contraire, there are a variety of issues, see above for one.
> > the proposed interface is, it gives you the flexibility to make
> > that choice. The fact that it's also the simplest interface is
> > just nice.
>
> That makes no difference if you have to support two interfaces.
Mea culpa, I failed to communicate. The whole point of my [RFC]
is that one new interface does both jobs, and potentially more
besides, by the simple expediant of pushing the support for
variants down into module land where it belongs, and at the same
time, making these variants dead simple to write. Probably with
the same or less code than we have now. Downside?
> > > Is the impact on the scheduler limited
> > > to the time magic_wait_for_quiescence is running.
> > > If so, the approach looks superior.
> >
> > It definitely is not superior for filesystem modules. However,
> > "damm good" would be a nice level of functionality to aim for.
> > The more we come to rely on virtual filesystem, the more we care
> > that the load/unload procedure should be reliable and fast.
>
> What? Unloading is a rare event in any case.
Maybe for you it is. Anyway, I prefer that my computer does
whatever I tell it to as quickly as it can, that's why I pay so
much to run Linux on it ;-)
> > Don't forget that point about not being able to put an upper
> > bound on the time required to complete the magic wait.
> >
> > Are you hinting that we only need, and only ever will need, one
> > mechanism here, so the module doesn't need to return a result
> > from cleanup_module? If so then I should add that we don't just
>
> Returning the error value is good.
:-)
--
Daniel
Daniel Phillips wrote:
> Really, that's not so, there are limits. 30 seconds? Whatever.
> Remember, during this time the service provided by the module is
> unavailable, so this is denial-of-service land. You could of
> course put in extra code to abort the unload process on demand,
> but, hmm, it probably wouldn't work ;-)
If you're going to do it right, you should fix that denial-of-service by
waiting until the module has finished unloading and then demand-loading
the module again.
Ideally, those periodic "rmmod -a" calls should _never_ cause a
denial-of-service.
-- Jamie
On Wednesday 11 September 2002 23:26, Jamie Lokier wrote:
> Daniel Phillips wrote:
> > Really, that's not so, there are limits. 30 seconds? Whatever.
> > Remember, during this time the service provided by the module is
> > unavailable, so this is denial-of-service land. You could of
> > course put in extra code to abort the unload process on demand,
> > but, hmm, it probably wouldn't work ;-)
>
> If you're going to do it right, you should fix that denial-of-service by
> waiting until the module has finished unloading and then demand-loading
> the module again.
That doesn't make the DoS go away, it just makes it a little
harder to trigger. Anyway, one thing we could do if the rest
of the module mechanism is up to it, is know that somebody is
trying to reactivate a module that has just returned from
module_cleanup(), and immediately reactivate it instead of
freeing it, hoping to save some disk activity - if this turns
out to be a real problem, that is. The null solution is likely
the winner here.
> Ideally, those periodic "rmmod -a" calls should _never_ cause a
> denial-of-service.
Goodness no. By the way, nobody has asked me how rmmod -a is
to be implemented. Oh well.
--
Daniel
In message <E17pCKQ-0007Sz-00@starship> you write:
> Hi Roman,
>
> On Tuesday 10 September 2002 12:17, Roman Zippel wrote:
> > I implemented something like this some time ago. If module->count isn't
> > used by module.c anymore, why should it be in the module structure?
> > Consequently I removed it from the module struct (what breaks of course
> > unloading of all modules, so I'll probably reintroduce it with big a
> > warning). If the count isn't in the module structure, the locking will
> > become quite simpler. More info is here
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=102754132716703&w=2
>
> Ah, I remember your original post but I didn't fully understand what you were
I hate people who can't be concise. It's a sign of sloppy thinking.
1) You only need reference counts if you want to unload a module.
2) A module can control its own reference counts safely if it does not
sleep without holding a reference, and you use the rcu patch's
synchronize_kernel() primitive.
3) Relying on *every* driver to control its own reference counts is a
recipe for disaster: some subsystems will want to control module
counts for their users.
4) Moving reference counts out of the module and into the particular
objects is *not* a good idea, since per-cpu cache-friendly
refcounting schemes are (almost by definition) about
SMP_CACHE_BYTES*NR_CPUS in size.
Hope I haven't missed anything,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
In message <E17pFKr-0007V7-00@starship> you write:
> On Wednesday 11 September 2002 23:26, Jamie Lokier wrote:
> > Daniel Phillips wrote:
> > > Really, that's not so, there are limits. 30 seconds? Whatever.
> > > Remember, during this time the service provided by the module is
> > > unavailable, so this is denial-of-service land. You could of
> > > course put in extra code to abort the unload process on demand,
> > > but, hmm, it probably wouldn't work ;-)
> >
> > If you're going to do it right, you should fix that denial-of-service by
> > waiting until the module has finished unloading and then demand-loading
> > the module again.
>
> That doesn't make the DoS go away, it just makes it a little
> harder to trigger. Anyway, one thing we could do if the rest
> of the module mechanism is up to it, is know that somebody is
> trying to reactivate a module that has just returned from
> module_cleanup(), and immediately reactivate it instead of
> freeing it, hoping to save some disk activity - if this turns
> out to be a real problem, that is. The null solution is likely
> the winner here.
Ah, yes, four-point module interface: init, start, stop, destroy.
Then you can call stop, realize the module is not at zero refcnt (you
lost a race), then call start again. Similar thing if someone
requests a stopped module.
Now you're going to have to change request_module() so the kernel can
realize that you're requesting a module which already exists
(request_module()'s effect currently depends on /etc/modules.conf of
course).
Now, of course, your module interface is starting to look really
complex, too. Because you want to solve the "half-loaded" problem, so
you really want "init" to reserve resources, and "start" to register
them (ie. start can't fail). So every register_xxx interface needs to
be split into reserve_xxx and use_xxx.
We can do all this, of course. I have an awful lot of patches. But
I'm not really happy with any of them.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Rusty Russell wrote:
> Ah, yes, four-point module interface: init, start, stop, destroy.
> Then you can call stop, realize the module is not at zero refcnt (you
> lost a race), then call start again. Similar thing if someone
> requests a stopped module.
That's one fairly complicated way of going about it.
Simpler is three points: init, stop, destroy. Stop is that part before
you call synchronise_kernel() - it's not something you can turn back
from.
If somebody needs the module, and it currently inside its
cleanup_module() function, they simply wait until destroy finishes, and
the module is unloaded, and then _reload_ the module from scratch. So,
let the disk I/O happen. This is a rare.
> Now you're going to have to change request_module() so the kernel can
> realize that you're requesting a module which already exists
> (request_module()'s effect currently depends on /etc/modules.conf of
> course).
Not at all. Keep request_module() simple, and change module loading,
like this:
1. Just before a module's cleanup_module() function is called, mark
the module as "unloading". This should force
try_inc_mod_use_count() to fail, causing its caller to behave like
the associated resource (e.g. filesystem) isn't actually
registered, and to call request_module().
2. request_module() should simply ignore modules marked as
"unloading". It should proceed to call "insmod" etc.
3. sys_create_module() or sys_init_module() should block if there is
a module of the same name currently in the "unloading" state.
They should block until that module's cleanup_module() returns.
4. At this point, the new instance of the module will initialise,
request_module() calls will return and the callers which called
try_inc_mod_use_count() in step 1 will continue with the resource
they needed.
> Now, of course, your module interface is starting to look really
> complex, too. Because you want to solve the "half-loaded" problem, so
> you really want "init" to reserve resources, and "start" to register
> them (ie. start can't fail). So every register_xxx interface needs to
> be split into reserve_xxx and use_xxx.
I don't see the point in this at all. What half-loaded problem? If a
module is destroyed, then reloaded and fails to initialise properly:
tough. It lost the resources fair and square.
> We can do all this, of course. I have an awful lot of patches. But
> I'm not really happy with any of them.
What do you think of the idea described above?: To mark modules as
"unloading" (as we do now), use synchronise_kernel() for an rcu-style
safety pause (as you suggested), and change module loading so that a
dying module is waited for before its replacement takes over?
-- Jamie
In message <[email protected]> you write:
> I don't see the point in this at all.
Yes, I'm starting to realize that.
Frankly, I'm sick of pointing out the same problems individually to
every shallow thinker who thinks that "it's easy".
The fundamental problems with modules are as follows:
A) Many places in the kernel do not increment module reference counts
for you, and it is difficult currently write a module which safely
looks after its own reference counts (see
net/ipv4/netfilter/ip_conntrack_core.c's ip_conntrack_cleanup())
B) We do not handle the "half init problem" where a module fails to load, eg.
a = register_xxx();
b = register_yyy();
if (!b) {
unregister_xxx(a);
return -EBARF;
}
Someone can start using "a", and we are in trouble when we remove
the failed module.
Suggested solutions come in several forms, with different mixtures of
the following:
1) Poor man's pageable kernel:
o Every interface in the kernel takes a struct module *, and
looks after the reference counts before and after it jumps in.
o Variants with/without try_inc_mod_count() or
two-stage unload.
o try_inc_mod_count variant solves the half-init problem
2) Modules carry their own overhead:
o Modules can look after their own reference counts.
o "don't sleep until you've grabbed a refcount" & wait
for quiescence.
o Two-stage module delete (stop & destroy).
o More advanced variations allow "restart" of stopped
module if race is lost.
o Still leaves the half-init problem
3) Deprecate module unload:
o Don't unload modules except by kernel hacking config option.
o Unloading in-use modules: "don't do that".
o Still leaves the half-init problem
4) Two-stage module init
o "Reserve" (can fail) and "use" (can't fail).
o Solves ONLY the half-init problem
5) Primordial Soup:
o Module loading/activation and deactivation/destruction occur
in a state similar to the kernel at boot.
o You need to stop all tasks but the loader and some kernel
threads, even if module load sleeps.
o You probably want to defer almost all interrupts, too.
o Or you can make module_init in interrupt context,
which breaks lots of register_xxx interfaces.
(1) PRO: Simple to write modules.
CON: try_inc_mod_count is a source of random subsystem failure.
CON: Everyone pays the inc/dec price even if they're not a module.
CON: Massive infrastructure change
CON: Modules started as a cute hack, are they really worth this cost?
(2) PRO: Non-modules don't pay the overhead for inc/dec.
CON: Harder to write modules than (1).
CON: Gracefully handling the "lost the race between stop and
cleanup" adds complexity
(3) PRO: Infrastructure simple.
PRO: Module implementation simple.
CON: Some broken drivers (eg. 16-bit PCMCIA) require unload to
reset (PCMCIA rewrite and/or suspend hooks can help here).
CON: Judging "unused" from userspace is fun & racy on real systems.
CON: We never remove features from Linux.
(4) PRO: Some drivers register things before setting up their internal
state already (which works fine at boot). This would solve
that, too.
PRO: Can be implemented naively, so if a module doesn't have
two-stage init, we just never free the module memory (still
dangerous since there other resources, but no worse than
before).
CON: Requires more work for module authors in future.
(5) PRO: Infrastructure simple.
PRO: Module implementation simple.
PRO: Would fix drivers which register too early, too.
CON: I'm not smart enough to implement it (esp. before 31 October).
CON: Scheduler magic lots of fun to keep common case efficient.
CON: Deferring interrupts may have interesting side effects.
CON: May not be possible without deadlock?
CON: "stop the world" during module load may piss off some.
Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
On Thursday 12 September 2002 04:09, Jamie Lokier wrote:
> 1. Just before a module's cleanup_module() function is called, mark
> the module as "unloading". This should force
> try_inc_mod_use_count() to fail, causing its caller to behave like
> the associated resource (e.g. filesystem) isn't actually
> registered, and to call request_module().
My proposal produces the same effect using a slightly different
arrangement: module_cleanup() is called, and the first thing it
does is bail out with an error if there are users, or puts the
module into the inactive state (count=-1). This forces
try_inc_mod_use_count to fail, as you say.
> 2. request_module() should simply ignore modules marked as
> "unloading". It should proceed to call "insmod" etc.
>
> 3. sys_create_module() or sys_init_module() should block if there is
> a module of the same name currently in the "unloading" state.
> They should block until that module's cleanup_module() returns.
OK, this is a really good example of why replacing the lock_kernel
with a dedicated module_sem is good: if we do that, the right thing
happens. Insmod will block on the module_sem until the module is
completely unloaded and removed from the list. By the time it gets
the semaphore, everything will be in a pristine state.
The BKL on the other hand will happily relinquish control if the
cleanup sleeps, messing things up nicely.
> 4. At this point, the new instance of the module will initialise,
> request_module() calls will return and the callers which called
> try_inc_mod_use_count() in step 1 will continue with the resource
> they needed.
Yes, I don't see a problem. The problems start when you try to
shortcut this cycle. Given that a shortcut is hardly going to save
much at all, due to caching, I'd call it a waste of effort.
--
Daniel
On Thursday 12 September 2002 05:13, Rusty Russell wrote:
> B) We do not handle the "half init problem" where a module fails to load, eg.
> a = register_xxx();
> b = register_yyy();
> if (!b) {
> unregister_xxx(a);
> return -EBARF;
> }
> Someone can start using "a", and we are in trouble when we remove
> the failed module.
No we are not. The module remains in the 'stopped' state
throughout the entire initialization process, as it should and
does, in my model.
--
Daniel
On Thu, 12 Sep 2002, Daniel Phillips wrote:
> On Thursday 12 September 2002 05:13, Rusty Russell wrote:
> > B) We do not handle the "half init problem" where a module fails to load, eg.
> > a = register_xxx();
> > b = register_yyy();
> > if (!b) {
> > unregister_xxx(a);
> > return -EBARF;
> > }
> > Someone can start using "a", and we are in trouble when we remove
> > the failed module.
>
> No we are not. The module remains in the 'stopped' state
> throughout the entire initialization process, as it should and
> does, in my model.
Bzzzert. At the very least, for block devices we need to be able to open
disks during module initialization.
Al, fully expecting a stack of mind-boggling (and broken) kludges to be
posted...
In message <E17pLKe-0007ds-00@starship> you write:
> On Thursday 12 September 2002 05:53, Alexander Viro wrote:
> > Bzzzert.
>
> Rusty, writing "Bzzzert" on lkml is juvenile, as is refering to your
> peers as "shallow thinkers".
I've outlined the variations on solutions multiple times. When people
repeat variations and use the word "simple", they are thinking
shallowly.
I had patience with this thread the first few times I had to explain
it, but no longer, as you can tell.
And it's clear that, shallow thinker or not, you have trouble reading 8)
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
On Thu, 12 Sep 2002 05:47:57 +0200
Daniel Phillips <[email protected]> wrote:
> On Thursday 12 September 2002 05:13, Rusty Russell wrote:
> > B) We do not handle the "half init problem" where a module fails to load, eg.
> > a = register_xxx();
> > b = register_yyy();
> > if (!b) {
> > unregister_xxx(a);
> > return -EBARF;
> > }
> > Someone can start using "a", and we are in trouble when we remove
> > the failed module.
>
> No we are not. The module remains in the 'stopped' state
> throughout the entire initialization process, as it should and
> does, in my model.
Um, so register_xxx interfaces all use try_inc_mod_count (ie. a
struct module * extra arg to register_xxx)? Or those entry points
are not protected by try_inc_mod_count, so it must bump the refcnt, so
you need to sleep in load until the module becomes unused again.
You have the same issue in the "wait for schedule" case on unload,
where someone jumps in while you are unregistering. My implementation
decided to ignore this problem (ie. potentially wait forever with the
module half-loaded) but it is an issue.
Rusty.
--
there are those who do and those who hang on and you don't see too
many doers quoting their contemporaries. -- Larry McVoy
On Thursday 12 September 2002 05:53, Alexander Viro wrote:
> On Thu, 12 Sep 2002, Daniel Phillips wrote:
>
> > On Thursday 12 September 2002 05:13, Rusty Russell wrote:
> > > B) We do not handle the "half init problem" where a module fails to load, eg.
> > > a = register_xxx();
> > > b = register_yyy();
> > > if (!b) {
> > > unregister_xxx(a);
> > > return -EBARF;
> > > }
> > > Someone can start using "a", and we are in trouble when we remove
> > > the failed module.
> >
> > No we are not. The module remains in the 'stopped' state
> > throughout the entire initialization process, as it should and
> > does, in my model.
>
> Bzzzert.
Rusty, writing "Bzzzert" on lkml is juvenile, as is refering to your
peers as "shallow thinkers". Ok, let's see if there is some content
in your post.
> At the very least, for block devices we need to be able to open
> disks during module initialization.
Huh? Would you please back up and try to make a coherent point?
I'd love to point out why you're wrong, but you didn't actually
say anything.
> Al, fully expecting a stack of mind-boggling (and broken) kludges to be
> posted...
As I recall, you are the one who proposed eliminating the ability
to unload modules entirely, because you were not able to solve the
unload races. It's a good thing that people with more sense
shouted you down.
Now, let's be civilized about this. If you have points, make them,
we do not need them decorated with sound effects.
--
Daniel
On Thursday 12 September 2002 06:40, Rusty Russell wrote:
> In message <E17pLKe-0007ds-00@starship> you write:
> > On Thursday 12 September 2002 05:53, Alexander Viro wrote:
> > > Bzzzert.
> >
> > Rusty, writing "Bzzzert" on lkml is juvenile, as is refering to your
> > peers as "shallow thinkers".
>
> I've outlined the variations on solutions multiple times. When people
> repeat variations and use the word "simple", they are thinking
> shallowly.
>
> I had patience with this thread the first few times I had to explain
> it, but no longer, as you can tell.
Well I can understand that when people are just talking and not coding,
but when the latter happens, it's worth trying to make something of it.
Just look at Roman's patch, there is a lot of value in it, though some
central are indeed left hanging.
> And it's clear that, shallow thinker or not, you have trouble reading 8)
Oh crap, I misattributed the funny sound effects to you, I should have
known. Sorry about that. Hey, it's *long* after midnight here.
--
Daniel
On Thursday 12 September 2002 06:52, Rusty Russell wrote:
> On Thu, 12 Sep 2002 05:47:57 +0200
> Daniel Phillips <[email protected]> wrote:
>
> > On Thursday 12 September 2002 05:13, Rusty Russell wrote:
> > > B) We do not handle the "half init problem" where a module fails to
> > > load, eg.
> > > a = register_xxx();
> > > b = register_yyy();
> > > if (!b) {
> > > unregister_xxx(a);
> > > return -EBARF;
> > > }
> > > Someone can start using "a", and we are in trouble when we remove
> > > the failed module.
> >
> > No we are not. The module remains in the 'stopped' state
> > throughout the entire initialization process, as it should and
> > does, in my model.
>
> Um, so register_xxx interfaces all use try_inc_mod_count (ie. a
> struct module * extra arg to register_xxx)?
In that case they are filesystems or some other thing that fits the
model closely enough for try_ind_mod_count to be efficient. This
case is easy and solved as far as I'm concerned, do correct me if
I'm wrong. Assuming I'm not wrong, on to a hard and interesting
case.
> Or those entry points
> are not protected by try_inc_mod_count, so it must bump the refcnt, so
> you need to sleep in load until the module becomes unused again.
Let's consider LSM as a really hard case, and let's suppose that the
register_*'s just plug new functions into function tables. These functions
are just called indirectly, there is no module entry/exit accounting
whatsoever, except that the inc/dec rule must be followed where module code
itself might sleep.
Anyway, at the -EBARF point, all the function pointers have been restored to
their original state, but we may have some tasks executing in the module
already, which got in while _xxx was there. The solution is to do the
magic_wait_for_quiescence (yes I know it has a real name, I forgot it) before
returning -EBARF. Refcounts never come into it.
What did I miss? It was too easy.
Ah, I'm glossing over how the "I need to sleep" intramodule refcounts
interact with the magic quiescence test. Let's see if some sleep fixes that.
I doubt this is any central issue though. We could, for instance, pass the
address of the count variable to the quiescence tester, and it will be
examined at schedule time, however that works (I promise to find out soon).
> You have the same issue in the "wait for schedule" case on unload,
> where someone jumps in while you are unregistering.
We don't. Our LSM module will first restore all the function pointers to
their original state, then wait for everyone to schedule.
> My implementation
> decided to ignore this problem (ie. potentially wait forever with the
> module half-loaded) but it is an issue.
I don't see the endless wait. It looks to me like it's just the time
required for everyone to schedule, which is unbounded all right, but not in
any interesting sense.
--
Daniel
In message <E17pLKe-0007ds-00@starship> you write:
> As I recall, you are the one who proposed eliminating the ability
> to unload modules entirely, because you were not able to solve the
> unload races.
Huh? I was able to solve the module races, in multiple ways. I even
had an implementation, using two stage init and two stage remove. But
it's not *simple*, and pushing additional constraints on kernel driver
authors may be worse than not being able to remove modules.
> It's a good thing that people with more sense shouted you down.
There was (understandably) some resistance to removing a "working"
feature, but noone produced a new workable alternative.
The *point* of my presentation was to make people think of
alternatives, especially, if we can't (or don't want to) solve all the
unload races, is it OK to say "well, you can't unload those kind of
modules"?
If I had come up with a clear winner, I might have saved myself the
pain of standing up in front of 70 of my peers and admitting that I
wasn't smart enough, ferchissakes.
But it doesn't help if you don't listen, then sprout the "one true
solution" which turns out to be a wordy combination two previously
discussed schemes, with the disadvantages of both.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
In message <E17pMzL-0007fx-00@starship> you write:
> On Thursday 12 September 2002 06:52, Rusty Russell wrote:
> > Um, so register_xxx interfaces all use try_inc_mod_count (ie. a
> > struct module * extra arg to register_xxx)?
>
> In that case they are filesystems or some other thing that fits the
> model closely enough for try_ind_mod_count to be efficient. This
> case is easy and solved as far as I'm concerned, do correct me if
> I'm wrong. Assuming I'm not wrong, on to a hard and interesting
> case.
No, let's talk about a simple notifier chain. Say you want to know
when CPUs go up and down...
> > Or those entry points
> > are not protected by try_inc_mod_count, so it must bump the refcnt, so
> > you need to sleep in load until the module becomes unused again.
>
> Let's consider LSM as a really hard case, and let's suppose that the
> register_*'s just plug new functions into function tables. These functions
> are just called indirectly, there is no module entry/exit accounting
> whatsoever, except that the inc/dec rule must be followed where module code
> itself might sleep.
>
> Anyway, at the -EBARF point, all the function pointers have been restored to
> their original state, but we may have some tasks executing in the module
> already, which got in while _xxx was there. The solution is to do the
> magic_wait_for_quiescence (yes I know it has a real name, I forgot it) before
> returning -EBARF. Refcounts never come into it.
>
> What did I miss? It was too easy.
The notifier routine did the right thing, and did:
MOD_INC_USE_COUNT();
wait_for_some_event();
It's sitting there now, sleeping inside the module (which has refcnt
1) and we returned -EBARF from module_init().
> I doubt this is any central issue though. We could, for instance, pass the
> address of the count variable to the quiescence tester, and it will be
> examined at schedule time, however that works (I promise to find out soon).
You can sleep in some way before actually kfreeing the failed-to-load
module memory; preempt makes this a bit trickier but the theory is the
same (it was preempt that broke my implementation, yes it was that
long ago). The point is that you could be sleeping for a *long*
time...
> I don't see the endless wait. It looks to me like it's just the time
> required for everyone to schedule, which is unbounded all right, but not in
> any interesting sense.
It's not the waiting to schedule. It's the waiting for the refcnt to
drop to zero.
I don't mind saying "we don't have to handle that corner case", but
you're left with the messiness of two classes of kernel interface:
those which handle your refcounting and those which don't. And the
module author better get right which ones are which (remember, we're
talking about people who are *not* kernel gurus, writing their first
and last kernel module on paid time for their company).
This is where the "how important is module unload?" thing somes in.
You could get rid of the bugs by having a list of "module count safe"
interfaces (ie. exported symbols), and if a module uses any others,
you simply refuse to unload it. Then no modules need *ever* worry
about their own reference counts.
Now, the trick I have in my back pocket is a distributed reference
count scheme (unimplemented so far on Linux, but another idea stolen
from Paul McKenney's Dynix/PTX team). You use per-cpu counts *until*
you want to see if the count is zero (ie. module unload), at which
case you flip them into (slow) shared-counter mode (handwave,
handwave, I know).
So it *might* be cheap to apply this to every interface, and migrate
interfaces across into the "module safe" category on an as-we-need-it
basis.
Hope that clarifies my thinking this week...
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
> To make this work, we rely on the rule that no module code may
> sleep/schedule unless it first increments a counter. The counter
> incremented will, at present, be the module->count, but that is entirely
> up to the module itself; module.c will no longer care how it keeps
> track, as long as it does.
>
> With config_preempt code, we must additionally disable preemption until
> the module quiescence test has completed. I'm not going to go further
> into this, because this is deep scheduling-fu, and I haven't got working
> code to show yet. Suffice to say that we have the technology to build a
> magic_wait_for_quiescence, and we must now proceed to do that. (Robert,
> are you reading?) A noncounting module uses the test as follows, in its
> cleanup_function:
>
> unregister_callpoints(...);
> magic_wait_for_quiescence();
> return cleanup_foo(...);
Please correct me if I am wrong, but ...
Task A Task B counter
call_module() +1
schedule() still 1
unregister_callpoints() still 1
magic_wait_for_quiescence(); still 1
call_module_second_func() -> Won't work
So by trying to unregister a module you make a module unusable
for an unspecified amount of time.
Regards
Oliver
Hi,
On Wed, 11 Sep 2002, Daniel Phillips wrote:
> Ah, I remember your original post but I didn't fully understand what you were
> going on about at the time. Now I do, and we're on the same page. Not only
> that, but I'm getting to the point with this messy problem where I can
> articulate the issues clearly, so let's try to do it now. Please ack/nack me
> as I go.
>
> First, let's identify the central issue that's causing conceptual problems
> for some who've sniffed at these issues. It's simple, there are two broad
> categories of modules:
Sigh, please let's analyze the complete problem first, before making any
more kludges.
Every (loaded) module is at least registered with two hooks in the kernel
- the module structure and a driver structure (e.g. file_system_type for
fs). During unloading we have to remove these hooks safely again.
A module basically goes through these stages during its lifetime:
1. module load
2. module init
3. module exit
4. module unload
The problem now is stage 3, as it's not allowed to fail. This means before
we even start with module exit, we have to make sure nobody has any
reference and nobody gets any new reference to the module through any
hook (the alternative would be to possibly wait forever in the exit
function to wait for these references to go away). So nobody is allowed to
increment the module count or to get any pointer to the driver, this on
the other hand means before you take any reference to the driver, you have
to check the state of the module, so any driver hook must include a
pointer to the module structure. This is also what makes the locking
interesting, since module structure and driver hooks are protected with
different locks, both have to be taken to safely get a single reference to
the driver.
So how does this whole picture change, if we allow the module exit to
fail? Module load/unload and module init/exit can become two independent
stages. This means in order to remove a driver hook, we don't have to
check the module state anymore. Module exit only has to make sure, that
all driver hooks and all references to the driver are gone. How it does
that exactly is up to the driver, whether it uses reference counts, RCU or
other scheduler tricks doesn't matter. We gain complete flexibility here,
only by allowing the exit function to fail.
It's really important to understand this consequence in order to
understand my patch (e.g. try_inc_mod_count() became a dummy). With the
exception of module loading I completely disabled the old interface, but
it probably has to be reenabled (with a big warning) until enough modules
are converted. The start/stop methods I added are completely irrellevant
to solving this problem, they only give us more flexibility in the module
handling, e.g. we can disable new fs mounts, while the fs still has
something mounted.
Anyway, the important consequence is that we can sanely clean up the
module interface this way. The module load/unload phase is private to the
module code, the module init/exit phase is private to the driver. This
means while the module or the driver is in control it doesn't has to
care about the state of the other part.
The core problem is actually not that difficult, you only have to forget
all the old cruft, it's only causing confusion and is not worth fixing.
bye, Roman
In message <Pine.LNX.4.44.0209121043160.28515-100000@serv> you write:
> Sigh, please let's analyze the complete problem first, before making any
> more kludges.
> Every (loaded) module is at least registered with two hooks in the kernel
> - the module structure and a driver structure (e.g. file_system_type for
> fs). During unloading we have to remove these hooks safely again.
> A module basically goes through these stages during its lifetime:
> 1. module load
> 2. module init
> 3. module exit
> 4. module unload
> The problem now is stage 3, as it's not allowed to fail. This means before
Nope, that's one of the two problems. Read my previous post: the
other is partial initialization.
Your patch is two-stage delete, with the additional of a usecount
function. So you have to move the usecount from the module to each
object it registers: great for filesystems, but I don't think it buys
you anything (since they were easy anyway).
Moreover, I don't see where your patch prevented someone increasing
the module count during try_unregister_module(), so that check is
pointless (do it in userspace unless they specify rmmod -f).
Alexey said we needed two-stage module delete back in 1999, so this is
not a new idea...
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Hi,
On Thu, 12 Sep 2002, Rusty Russell wrote:
> Nope, that's one of the two problems. Read my previous post: the
> other is partial initialization.
>
> Your patch is two-stage delete, with the additional of a usecount
> function. So you have to move the usecount from the module to each
> object it registers: great for filesystems, but I don't think it buys
> you anything (since they were easy anyway).
I'm aware of the init problem, what I described was the core problem,
which prevents any further cleanup.
The usecount is optional, the only important question a module must be
able to answer is: Are there any objects/references belonging to the
module? It's a simple yes/no question. If a module can't answer that, it
likely has more problem than just module unloading.
How that interface is exactly done is open for discussion and needs to be
specified.
> Moreover, I don't see where your patch prevented someone increasing
> the module count during try_unregister_module(), so that check is
> pointless (do it in userspace unless they specify rmmod -f).
I don't see your problem, if someone looks up a module, he gets a
reference to the module structure, if a reference count goes to zero the
module must be completely freed. State changes are protected with a
separate lock, if a module is loaded/unloaded an extra reference is used
to prevent module removal.
bye, Roman
On Thu, 12 Sep 2002, Rusty Russell wrote:
> Date: Thu, 12 Sep 2002 14:40:42 +1000
> From: Rusty Russell <[email protected]>
> To: Daniel Phillips <[email protected]>
> Cc: [email protected], Jamie Lokier <[email protected]>,
> Oliver Neukum <[email protected]>, Roman Zippel <[email protected]>,
> [email protected], [email protected]
> Subject: Re: [RFC] Raceless module interface
>
> In message <E17pLKe-0007ds-00@starship> you write:
> > On Thursday 12 September 2002 05:53, Alexander Viro wrote:
> > > Bzzzert.
> >
> > Rusty, writing "Bzzzert" on lkml is juvenile, as is refering to your
> > peers as "shallow thinkers".
>
> I've outlined the variations on solutions multiple times. When people
> repeat variations and use the word "simple", they are thinking
> shallowly.
>
> I had patience with this thread the first few times I had to explain
> it, but no longer, as you can tell.
>
> And it's clear that, shallow thinker or not, you have trouble reading 8)
> Rusty.
Write a FAQ then...
--
Gerhard Mack
[email protected]
<>< As a computer I find your faith in technology amusing.
In message <[email protected]> you write:
> On Thu, 12 Sep 2002, Rusty Russell wrote:
>
> > Date: Thu, 12 Sep 2002 14:40:42 +1000
> > From: Rusty Russell <[email protected]>
> > To: Daniel Phillips <[email protected]>
> > Cc: [email protected], Jamie Lokier <[email protected]>,
> > Oliver Neukum <[email protected]>, Roman Zippel <[email protected]
g>,
> > [email protected], [email protected]
> > Subject: Re: [RFC] Raceless module interface
> >
> > In message <E17pLKe-0007ds-00@starship> you write:
> > > On Thursday 12 September 2002 05:53, Alexander Viro wrote:
> > > > Bzzzert.
> > >
> > > Rusty, writing "Bzzzert" on lkml is juvenile, as is refering to your
> > > peers as "shallow thinkers".
> >
> > I've outlined the variations on solutions multiple times. When people
> > repeat variations and use the word "simple", they are thinking
> > shallowly.
> >
> > I had patience with this thread the first few times I had to explain
> > it, but no longer, as you can tell.
> >
> > And it's clear that, shallow thinker or not, you have trouble reading 8)
> > Rusty.
>
> Write a FAQ then...
People who can't find the lkml archives will read the FAQ?
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
In message <Pine.LNX.4.44.0209121520300.28515-100000@serv> you write:
> Hi,
>
> On Thu, 12 Sep 2002, Rusty Russell wrote:
>
> > Nope, that's one of the two problems. Read my previous post: the
> > other is partial initialization.
> >
> > Your patch is two-stage delete, with the additional of a usecount
> > function. So you have to move the usecount from the module to each
> > object it registers: great for filesystems, but I don't think it buys
> > you anything (since they were easy anyway).
>
> I'm aware of the init problem, what I described was the core problem,
> which prevents any further cleanup.
I don't think of either of them as core, they are two problems.
> The usecount is optional, the only important question a module must be
Yes, for sure. So why check the use count at all?
> able to answer is: Are there any objects/references belonging to the
> module? It's a simple yes/no question. If a module can't answer that, it
> likely has more problem than just module unloading.
Ah, we're assuming you insert synchronize_kernel() between the call
to stop and the call to exit?
In which case *why* do you check the use count *inside* exit_affs_fs?
Why not get exit_module() to do "if (mod->usecount() != 0) return
-EBUSY; else mod->exit();"?
There's the other issue of symmetry. If you allocate memory, in
start, do you clean it up in stop or exit? Similarly for other
resources: you call mod->exit() every time start fails, so that is
supposed to check that mod->start() succeeded?
Of course, separating start into "init" and "start" allows you to
solve the half-initialized problem as well as clarify the rules.
================
I disagree with pushing the count into the filesystem structure: it
changes nothing except hide the fact that you're only keeping
reference counts for the sake of the module. Filesystems are low
performance impact, but other subsystems really don't want that atomic
inc and dec for every access. See my implementation (for those in the
audience: http://www.kernel.org/pub/linux/kernel/people/rusty/
if (down_interruptible(&module_mutex))
return -EINTR;
mod = find_module(name);
if (!mod) {
ret = -ENOENT;
goto out;
}
if (!mod->stop && !mod->exit) {
/* This module can't be removed */
ret = -EBUSY;
goto out;
}
if (mod->stop) {
ret = mod->stop();
if (ret != 0) goto out;
}
/* Remove symbols so module count can't increase */
spin_lock_irq(&modlist_lock);
list_del_init(&mod->symbols.list);
spin_unlock_irq(&modlist_lock);
/* This may or may not decrement to zero. We may be waiting
for someone else who is holding a reference. */
set_current_state(TASK_UNINTERRUPTIBLE);
mod->waiting = current;
module_put(mod);
schedule();
/* Wait to ensure that noone is executing inside a module right now */
synchronize_kernel();
/* Final destruction now noone is using it. */
if (mod->exit) mod->exit();
free_module(mod);
ret = 0;
Now, I chose to leave the usage count checks as a userspace problem,
and *not* to do call ->start again in this implementation, but sleep
until the refcount hits 0.
This works better for my in ip_conntrack, since ideally the usage
count is a count of the number of packets we're tracking, which is
under control of the outside world, so I wanted an "rmmod -f".
I think either way is valid: effectively, where I do the wait, you do
the check and reinitialize the module ("oh oh!").
Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
On Friday 13 September 2002 03:30, Rusty Russell wrote:
> In message <Pine.LNX.4.44.0209121520300.28515-100000@serv> you write:
> > The usecount is optional, the only important question a module must be
> > able to answer is: Are there any objects/references belonging to the
> > module? It's a simple yes/no question. If a module can't answer that, it
> > likely has more problem than just module unloading.
>
> Ah, we're assuming you insert synchronize_kernel() between the call
> to stop and the call to exit?
>
> In which case *why* do you check the use count *inside* exit_affs_fs?
> Why not get exit_module() to do "if (mod->usecount() != 0) return
> -EBUSY; else mod->exit();"?
Because mod->usecount may be a totally inadequate way of determining
if a module is busy. How does it work for LSM, for example?
> There's the other issue of symmetry. If you allocate memory, in
> start, do you clean it up in stop or exit?
Actually, I'm going to press you on why you think you even need a
two stage stop. I know you have your reasons, but I doubt any of
the effects you aim at cannot be achieved with a single stage
stop/exit. Could you please summarize the argument in favor of the
two stage stop?
> Similarly for other
> resources: you call mod->exit() every time start fails, so that is
> supposed to check that mod->start() succeeded?
He does? That's not right. ->start should clean up after itself if
it fails, like any other good Linux citizen.
> Of course, separating start into "init" and "start" allows you to
> solve the half-initialized problem as well as clarify the rules.
I doubt it gives any new capability at all. The same with the
entrenched separation at the user level between create and init
module: what does it give you that an error exit from a single
create/init would not? Sure, I know it's not going to change,
but I'd like to know what the thinking was, and especially, if
there's a non-bogus reason, I'd like to know it.
--
Daniel
On Fri, Sep 13, 2002 at 11:30:47AM +1000, Paul 'Rusty' Russell wrote:
> In message <Pine.LNX.4.44.0209121520300.28515-100000@serv> you write:
> > Hi,
> >
> > On Thu, 12 Sep 2002, Rusty Russell wrote:
> >
> > > Nope, that's one of the two problems. Read my previous post: the
> > > other is partial initialization.
> > >
> > > Your patch is two-stage delete, with the additional of a usecount
> > > function. So you have to move the usecount from the module to each
> > > object it registers: great for filesystems, but I don't think it buys
> > > you anything (since they were easy anyway).
> >
> > I'm aware of the init problem, what I described was the core problem,
> > which prevents any further cleanup.
>
> I don't think of either of them as core, they are two problems.
Actually, with one stage init, module unload is essentially a special
case of module load failure, consider:
module_init()
{
/* initialize stuff */
...
wait_event_interruptible(wq, 0 == 1);
/* clean stuff up */
...
return -EINTR
}
--
David Gibson | For every complex problem there is a
[email protected] | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson
On Friday 13 September 2002 02:39, Rusty Russell wrote:
> In message <[email protected]> you write:
> > Write a FAQ then...
>
> People who can't find the lkml archives will read the FAQ?
>
> Rusty.
In this case it would be a "Frequently Argued Questions" because
there is no general agreement, and it is not just because everybody
but you is a shallow thinker ;-)
--
Daniel
In message <E17pg3H-0007pb-00@starship> you write:
> On Friday 13 September 2002 03:30, Rusty Russell wrote:
> > In message <Pine.LNX.4.44.0209121520300.28515-100000@serv> you write:
> > > The usecount is optional, the only important question a module must be
> > > able to answer is: Are there any objects/references belonging to the
> > > module? It's a simple yes/no question. If a module can't answer that, it
> > > likely has more problem than just module unloading.
> >
> > Ah, we're assuming you insert synchronize_kernel() between the call
> > to stop and the call to exit?
> >
> > In which case *why* do you check the use count *inside* exit_affs_fs?
> > Why not get exit_module() to do "if (mod->usecount() != 0) return
> > -EBUSY; else mod->exit();"?
>
> Because mod->usecount may be a totally inadequate way of determining
> if a module is busy. How does it work for LSM, for example?
As established previously: unless the hooks do it for you, you keep
track of it yourself before you sleep.
> > There's the other issue of symmetry. If you allocate memory, in
> > start, do you clean it up in stop or exit?
>
> Actually, I'm going to press you on why you think you even need a
> two stage stop. I know you have your reasons, but I doubt any of
> the effects you aim at cannot be achieved with a single stage
> stop/exit. Could you please summarize the argument in favor of the
> two stage stop?
Of course you can simulate a two-stage within a single-stage, of course,
by doing int single_stage(void) { stage_one(); stage_two(); }, so
"need" is a bit strong.
Basically, you can't do stage two until you know noone is using the
module, but you can do stage one at any time. Basically stage 1
ensures that the refcount never *increases* by removing all external
references to the module (ie. deregister, etc). Stage 2 then knows
that if the refcnt is zero, there's no race and it can destroy they
module data structures.
The synchronize_kernel() covers those "I was about to bump the module
count!" races, as long as noone explicitly sleeps before mod_inc, or
after mod_dec.
> > Similarly for other
> > resources: you call mod->exit() every time start fails, so that is
> > supposed to check that mod->start() succeeded?
>
> He does? That's not right. ->start should clean up after itself if
> it fails, like any other good Linux citizen.
>From my reading, yes.
> > Of course, separating start into "init" and "start" allows you to
> > solve the half-initialized problem as well as clarify the rules.
>
> I doubt it gives any new capability at all.
Since I explained what it does for you at the kernel summit, you
obviously aren't listening. If you split registration interfaces into
reserve (can fail) and use (can't fail), then you do:
int my_module_init(void)
{
int ret;
ret = reserve_foo();
if (ret != 0)
return ret;
ret = reserve_bar();
if (ret != 0)
unreserve_foo();
return ret;
}
void my_module_start(void)
{
use_foo();
use_bar();
}
Note the symmetry here with the exit case: noone can actually use the
module until my_module_start is called, so even if the reserve_bar()
fails, we're safe.
> The same with the entrenched separation at the user level between
> create and init module: what does it give you that an error exit
> from a single create/init would not?
That's done for entirely different reasons, as the userspace linker
needs to know the address of the module.
> Sure, I know it's not going to change, but I'd like to know what the
> thinking was, and especially, if there's a non-bogus reason, I'd
> like to know it.
You should probably start playing with my code if you're really
interested.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
Rusty Russell wrote:
>
> In message <[email protected]> you write:
> > I don't see the point in this at all.
>
> Yes, I'm starting to realize that.
>
> Frankly, I'm sick of pointing out the same problems individually to
> every shallow thinker who thinks that "it's easy".
>
> The fundamental problems with modules are as follows:
> A) Many places in the kernel do not increment module reference counts
> for you, and it is difficult currently write a module which safely
> looks after its own reference counts (see
> net/ipv4/netfilter/ip_conntrack_core.c's ip_conntrack_cleanup())
This sort of thing makes correct unloading hard, but in my eyes it
is an argument for changing the module interface to something better
and say "that thing CAN'T be a module of its own!"
What if we *require* modules to use some open/close reference count that
don't change so often as to be a performance problem?
It makes the unloading/reloading problem easier, similiar to
mount/umount.
It'll work for lots of modules, such as:
* drivers for hardware devices, char, block, and NIC.
* filesystems
Things with no easy refcounting (and it cannot even be grafted on)
will have to be non-modular, or folded into some parent module.
Maybe IP connection tracking can't be modular with these rules - so
what? Make it compiled-in, or a part of the IPV4-module. So
if you really need to load and unload that you unload the
ipv4_with_conntrack module and then load a ipv4_without_conntrack
module.
To me, it seems like the current module interface is too fine-grained,
and that cause trouble. I.e. the overhead of correct refcounting
is so high in some cases that it isn't used - causing trouble
with safe unloading. The solution is to say no to such modules.
Making them work isn't easy - but why try?
Helge Hafting
Hi,
On Fri, 13 Sep 2002, Rusty Russell wrote:
> > I'm aware of the init problem, what I described was the core problem,
> > which prevents any further cleanup.
>
> I don't think of either of them as core, they are two problems.
Your init problem is easily solvable for try_inc_mod_count() users, just
add a check for MOD_INITIALIZING, so it will fail during module init. On
the other hand forcing everyone to use try_inc_mod_count is a serious
problem.
> > able to answer is: Are there any objects/references belonging to the
> > module? It's a simple yes/no question. If a module can't answer that, it
> > likely has more problem than just module unloading.
>
> Ah, we're assuming you insert synchronize_kernel() between the call
> to stop and the call to exit?
>
> In which case *why* do you check the use count *inside* exit_affs_fs?
> Why not get exit_module() to do "if (mod->usecount() != 0) return
> -EBUSY; else mod->exit();"?
>
> There's the other issue of symmetry. If you allocate memory, in
> start, do you clean it up in stop or exit? Similarly for other
> resources: you call mod->exit() every time start fails, so that is
> supposed to check that mod->start() succeeded?
Actually I consider removing the start/stop methods, if one can't get it
right without them, one won't get it right with them either and it's easy
enough to add them later again.
The exit function should always be called after the init function (even if
it failed, I don't do it in the patch, that's a bug). The fs init/exit
would like this then:
static int __init init_fs(void)
{
inode_cachep = kmem_cache_create(...);
if (!inode_cachep)
return -ENOMEM;
return register_filesystem(&fs_type);
}
static int __exit exit_fs(void)
{
int err;
err = unregister_filesystem(&fs_type);
if (err)
return err;
kmem_cache_destroy(inode_cachep);
inode_cachep = NULL;
return 0;
}
> I disagree with pushing the count into the filesystem structure: it
> changes nothing except hide the fact that you're only keeping
> reference counts for the sake of the module. Filesystems are low
> performance impact, but other subsystems really don't want that atomic
> inc and dec for every access.
As I already said before, it doesn't has to be an atomic reference count.
> if (down_interruptible(&module_mutex))
> return -EINTR;
I really don't like that the global module lock is held during calls to
the driver, e.g. it makes it impossible to request further modules during
module load.
> This works better for my in ip_conntrack, since ideally the usage
> count is a count of the number of packets we're tracking, which is
> under control of the outside world, so I wanted an "rmmod -f".
The ip_conntrack problem is a variation of the LSM problem and both are a
problem of the driver not of the module code.
Basically you have to wait long enough until no new package can call to
the module. This means after removing the hooks, you have to check how
much packages are busy and wait for them to be processed.
bye, Roman
On Friday 13 September 2002 08:51, Rusty Russell wrote:
> If you split registration interfaces into reserve (can fail) and
> use (can't fail), then you do:
>
> int my_module_init(void)
> {
> int ret;
> ret = reserve_foo();
> if (ret != 0)
> return ret;
> ret = reserve_bar();
> if (ret != 0)
> unreserve_foo();
> return ret;
> }
>
> void my_module_start(void)
> {
> use_foo();
> use_bar();
> }
Why is that different from:
int my_module_init(void)
{
int ret;
ret = reserve_foo();
if (ret != 0)
return ret;
ret = reserve_bar();
if (ret != 0) {
unreserve_foo();
return ret;
}
use_foo();
use_bar();
return 0;
}
> Note the symmetry here with the exit case: noone can actually use the
> module until my_module_start is called, so even if the reserve_bar()
> fails, we're safe.
And in my example above, nobody can actually use the module until
use_foo(). What's the difference?
> > Sure, I know it's not going to change, but I'd like to know what the
> > thinking was, and especially, if there's a non-bogus reason, I'd
> > like to know it.
>
> You should probably start playing with my code if you're really
> interested.
Of course, and you might consider actually reading my [RFC]. We still
disagree on whether your fat interface or my thin one is the right way
to go. Don't forget that the Unix way has traditionally been to use
the simplest interface that will do the job; if you propose a fat
interface you need to prove that the thin one cannot do the job.
--
Daniel
On Friday 13 September 2002 12:35, Roman Zippel wrote:
> Hi,
>
> On Fri, 13 Sep 2002, Rusty Russell wrote:
>
> > > I'm aware of the init problem, what I described was the core problem,
> > > which prevents any further cleanup.
> >
> > I don't think of either of them as core, they are two problems.
>
> Your init problem is easily solvable for try_inc_mod_count() users, just
> add a check for MOD_INITIALIZING, so it will fail during module init. On
> the other hand forcing everyone to use try_inc_mod_count is a serious
> problem.
Well now. We agree about that, do we not? I go on to conclude that it
is better to leave the job of monitoring module activity and determining
quiescence up to the module itself, do we agree about that as well?
> Actually I consider removing the start/stop methods, if one can't get it
> right without them, one won't get it right with them either and it's easy
> enough to add them later again.
Exactly my feeling.
> The exit function should always be called after the init function (even if
> it failed, I don't do it in the patch, that's a bug). The fs init/exit
> would like this then:
Perhaps, but if so, the module itself should call the exit function in
its failure path itself. Doing the full exit whether it needs to be
done or not is wasteful and opens up new DoS opportunities.
In the example you give below you must rely on register_filesystem
tolerating unregistering a nonexistent filesystem. That's sloppy at
best, and you will have to ensure *every* helper used by ->exit is
similarly sloppy.
> static int __init init_fs(void)
> {
> inode_cachep = kmem_cache_create(...);
> if (!inode_cachep)
> return -ENOMEM;
> return register_filesystem(&fs_type);
> }
>
> static int __exit exit_fs(void)
> {
> int err;
> err = unregister_filesystem(&fs_type);
> if (err)
> return err;
> kmem_cache_destroy(inode_cachep);
> inode_cachep = NULL;
> return 0;
> }
--
Daniel
Hi,
On Fri, 13 Sep 2002, Daniel Phillips wrote:
> On Friday 13 September 2002 08:51, Rusty Russell wrote:
> > [cool code]
>
> Why is that different from:
>
> [more code]
Because in your example, my_module_start() would not be able to run
separately, and because, as Rusty mentioned, my_module_init() can fail
separately (e.g. if there's no space to drop that).
Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-
On Friday 13 September 2002 15:52, Thunder from the hill wrote:
> Hi,
>
> On Fri, 13 Sep 2002, Daniel Phillips wrote:
> > On Friday 13 September 2002 08:51, Rusty Russell wrote:
> > > [cool code]
> >
> > Why is that different from:
> >
> > [more code]
>
> Because in your example, my_module_start() would not be able to run
> separately
That's obvious. What hasn't been shown is why that's necessary.
Note: this is the *real* meaning of "begs the question". You answered
my question "why is it necessary the these to be separate" with "because
if they were not separate, then you could not use them separately". In
logical terms, it amounts to "A because A". This is a logical falacy
called "begging the question".
When people say "begs the question", 99% of the time they really mean
"invites the question". As an exercise, try scanning lkml for "From
includes Torvalds" and "begs". Linus studied debating ;-)
--
Daniel
Hi,
On Fri, 13 Sep 2002, Daniel Phillips wrote:
> > Because in your example, my_module_start() would not be able to run
> > separately
>
> That's obvious. What hasn't been shown is why that's necessary.
Yeah, but it was also obvious that my_module_init() can fail this way.
Look, first we watch the module initialization, that is, we run the
critical stuff like resource allocation, data structure allocation, etc.
If we fail here, we can't load the module, because it would be unoperative
if we proceed. (Because the data simply isn't there.)
And possibly Rusty wanted to avoid a certain race, which is unrelated to
school and kids. Once we've initialized, the module can be used, earlier
is balderdash.
Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-
On Friday 13 September 2002 16:33, Thunder from the hill wrote:
> Hi,
>
> On Fri, 13 Sep 2002, Daniel Phillips wrote:
> > > Because in your example, my_module_start() would not be able to run
> > > separately
> >
> > That's obvious. What hasn't been shown is why that's necessary.
>
> Yeah, but it was also obvious that my_module_init() can fail this way.
> Look, first we watch the module initialization, that is, we run the
> critical stuff like resource allocation, data structure allocation, etc.
> If we fail here, we can't load the module, because it would be unoperative
> if we proceed. (Because the data simply isn't there.)
>
> And possibly Rusty wanted to avoid a certain race, which is unrelated to
> school and kids. Once we've initialized, the module can be used, earlier
> is balderdash.
On the face of it, your post is content-free. To redeem yourself,
please identify the race Rusty avoided and show how I did not avoid
the same race.
--
Daniel
Hi,
On Fri, 13 Sep 2002, Daniel Phillips wrote:
> On Friday 13 September 2002 16:33, Thunder from the hill wrote:
> > Look, first we watch the module initialization, that is, we run the
> > critical stuff like resource allocation, data structure allocation, etc.
> > If we fail here, we can't load the module, because it would be unoperative
> > if we proceed. (Because the data simply isn't there.)
-> if we don't do the starting here, we can operate on the data structures
earlier, since we know they're running free.
Also could we run start again, even though it sounds buggy.
> > And possibly Rusty wanted to avoid a certain race, which is unrelated to
> > school and kids.
Ouch. I've referred to a comparison here which I've dropped lateron. Sorry
if you felt insulted.
> please identify the race Rusty avoided and show how I did not avoid the
> same race.
I'm sure Rusty could do that better. However, there might be some weird
situations. For example, take someone trying to bring all modules down
the moment we init. We might start running in unchecked environment, and
there we fail because there is no 'we' any more.
Thus rather module->init(). if (module) module->start(). Since then we can
be sure that the module is locked, and if somebody unloads it, he'll have
to wait for the use count to drop.
Or as another example, take someone trying to use the resources we claimed
before the module is really up. If you can rely on the module to be known
to be up, you know what do do. Yes, usually that's no real good example,
since resources ought to be locked as well.
Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-
On Friday 13 September 2002 16:59, Thunder from the hill wrote:
> Hi,
>
> On Fri, 13 Sep 2002, Daniel Phillips wrote:
> > On Friday 13 September 2002 16:33, Thunder from the hill wrote:
> > > Look, first we watch the module initialization, that is, we run the
> > > critical stuff like resource allocation, data structure allocation, etc.
> > > If we fail here, we can't load the module, because it would be unoperative
> > > if we proceed. (Because the data simply isn't there.)
>
> -> if we don't do the starting here, we can operate on the data structures
> earlier, since we know they're running free.
>
> Also could we run start again, even though it sounds buggy.
We can do this without having start separte from init as well.
> > please identify the race Rusty avoided and show how I did not avoid the
> > same race.
>
> I'm sure Rusty could do that better.
I'd be surprised if Rusty can do it any better than you. It's hard to
show a race that doesn't exist, even harder to prove that a four-prong
interface is necessary in order to be able to handle it. The latter is
the question on the table.
> However, there might be some weird
> situations. For example, take someone trying to bring all modules down
> the moment we init. We might start running in unchecked environment, and
> there we fail because there is no 'we' any more.
Oh indeed, there are weird situations, but they apply equally to the
two-prong and the four-prong interfaces.
> Thus rather module->init(). if (module) module->start(). Since then we can
> be sure that the module is locked, and if somebody unloads it, he'll have
> to wait for the use count to drop.
This applies equally to the two-prong interface.
> Or as another example, take someone trying to use the resources we claimed
> before the module is really up. If you can rely on the module to be known
> to be up, you know what do do. Yes, usually that's no real good example,
> since resources ought to be locked as well.
This applies equally to the two-prong interface. Do you see the pattern
yet?
--
Daniel
Hi,
On Fri, 13 Sep 2002, Daniel Phillips wrote:
> > The exit function should always be called after the init function (even if
> > it failed, I don't do it in the patch, that's a bug). The fs init/exit
> > would like this then:
>
> Perhaps, but if so, the module itself should call the exit function in
> its failure path itself. Doing the full exit whether it needs to be
> done or not is wasteful and opens up new DoS opportunities.
The exit itself can fail as well, so it has to be done by the module code
anyway (until it suceeds).
What DoS opportunities are there? Module init failure is the exception
case and usally needs further attention, so we could actually disable
further attempts to load this module, unless the user tells us
specifically so.
> In the example you give below you must rely on register_filesystem
> tolerating unregistering a nonexistent filesystem. That's sloppy at
> best, and you will have to ensure *every* helper used by ->exit is
> similarly sloppy.
Why is that sloppy? E.g. kfree() happily accepts NULL pointers as well.
bye, Roman
On Friday 13 September 2002 17:13, Roman Zippel wrote:
> Hi,
>
> On Fri, 13 Sep 2002, Daniel Phillips wrote:
>
> > > The exit function should always be called after the init function (even if
> > > it failed, I don't do it in the patch, that's a bug). The fs init/exit
> > > would like this then:
> >
> > Perhaps, but if so, the module itself should call the exit function in
> > its failure path itself. Doing the full exit whether it needs to be
> > done or not is wasteful and opens up new DoS opportunities.
>
> The exit itself can fail as well, so it has to be done by the module code
> anyway (until it suceeds).
That's debatable. Arguably, a failed ->module_cleanup() should be
retried on every rmmod -a, but expecting module.c to just keep
retrying mindlessly on its own sounds too much like a busy wait.
> What DoS opportunities are there?
Suppose the module exit relies on synchronize_kernel. The attacker
can force repeated synchronize_kernels, knowing that module.c will
mindlessly do a synchronize_kernel every time a module init fails,
whether needed or not. Each synchronize_kernel takes an unbounded
amount of time to complete, across which module.c holds a lock.
> Module init failure is the exception
> case and usally needs further attention, so we could actually disable
> further attempts to load this module, unless the user tells us
> specifically so.
Sure, you can fix it by lathering on more complexity. What you have
to do is explain why we should do that, when there is a simpler and
faster approach that doesn't introduce the problem in the first
place.
> > In the example you give below you must rely on register_filesystem
> > tolerating unregistering a nonexistent filesystem. That's sloppy at
> > best, and you will have to ensure *every* helper used by ->exit is
> > similarly sloppy.
>
> Why is that sloppy? E.g. kfree() happily accepts NULL pointers as well.
That is sloppy. Different discussion.
I take it that the points you didn't reply to are points that you
agree with? (The main point being, that we both advocate a simple,
two-method interface for module load/unload.)
--
Daniel
Hi,
On Fri, 13 Sep 2002, Daniel Phillips wrote:
> I'd be surprised if Rusty can do it any better than you. It's hard to
> show a race that doesn't exist, even harder to prove that a four-prong
> interface is necessary in order to be able to handle it. The latter is
> the question on the table.
Not as easy as missing ones point, yes.
> > However, there might be some weird situations. For example, take
> > someone trying to bring all modules down the moment we init. We might
> > start running in unchecked environment, and there we fail because
> > there is no 'we' any more.
>
> Oh indeed, there are weird situations, but they apply equally to the
> two-prong and the four-prong interfaces.
>
> > Thus rather module->init(). if (module) module->start(). Since then we can
> > be sure that the module is locked, and if somebody unloads it, he'll have
> > to wait for the use count to drop.
>
> This applies equally to the two-prong interface.
>
> > Or as another example, take someone trying to use the resources we claimed
> > before the module is really up. If you can rely on the module to be known
> > to be up, you know what do do. Yes, usually that's no real good example,
> > since resources ought to be locked as well.
>
> This applies equally to the two-prong interface. Do you see the pattern
> yet?
Yes, but you don't seem to. (No, I don't want to insult you here.)
Just to draw that:
2p:
thread1 thread2
struct x *y = malloc(sizeof(struct x));
check y;
blah(); cleanup(y et al);
touch y->blah; /* bang */
4p:
thread1 thread2
struct x *y = malloc(sizeof(struct x));
check y;
struct z *a = malloc(sizeof(struct z));
check a; cleanup(y et al);
struct u *v = malloc(sizeof(struct u));
check v;
return success;
(back in caller - from that moment we're protected against arbitrariness
from other threads)
check y; <-- detected
You know? We allocate, we lock, we check, we win.
Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-
On Friday 13 September 2002 17:27, Thunder from the hill wrote:
> > This applies equally to the two-prong interface. Do you see the pattern
> > yet?
>
> Yes, but you don't seem to. (No, I don't want to insult you here.)
>
> Just to draw that:
>
> 2p:
>
> thread1 thread2
> struct x *y = malloc(sizeof(struct x));
> check y;
> blah(); cleanup(y et al);
> touch y->blah; /* bang */
This can't happen because a semaphore serializes the load and unload.
(Currently, module.c uses lock_kernel, which is obviously inadequate.)
--
Daniel
Hi,
On Fri, 13 Sep 2002, Daniel Phillips wrote:
> > The exit itself can fail as well, so it has to be done by the module code
> > anyway (until it suceeds).
>
> That's debatable. Arguably, a failed ->module_cleanup() should be
> retried on every rmmod -a, but expecting module.c to just keep
> retrying mindlessly on its own sounds too much like a busy wait.
That's not what I meant, if module_init fails the module goes directly to
the cleanup state and the module code calls module_exit. Depending on this
return value it continues to the exit state. Further exit attempts (if
necessary) are done on user request.
> > What DoS opportunities are there?
>
> Suppose the module exit relies on synchronize_kernel. The attacker
> can force repeated synchronize_kernels, knowing that module.c will
> mindlessly do a synchronize_kernel every time a module init fails,
> whether needed or not. Each synchronize_kernel takes an unbounded
> amount of time to complete, across which module.c holds a lock.
This can't happen:
if (hook) {
hook = NULL;
synchronize();
}
> > Module init failure is the exception
> > case and usally needs further attention, so we could actually disable
> > further attempts to load this module, unless the user tells us
> > specifically so.
>
> Sure, you can fix it by lathering on more complexity. What you have
> to do is explain why we should do that, when there is a simpler and
> faster approach that doesn't introduce the problem in the first
> place.
It doesn't add any complexity (at least not to the kernel). A simple
approach might be that a failed kernel module cannot be loaded with
modprobe anymore, this sort of policy can be done in userspace.
> I take it that the points you didn't reply to are points that you
> agree with? (The main point being, that we both advocate a simple,
> two-method interface for module load/unload.)
Basically yes, it's just that your initial RFC was more confusing than
helpful.
bye, Roman
On Friday 13 September 2002 08:51, Rusty Russell wrote:
> In message <E17pg3H-0007pb-00@starship> you write:
> > The same with the entrenched separation at the user level between
> > create and init module: what does it give you that an error exit
> > from a single create/init would not?
>
> That's done for entirely different reasons, as the userspace linker
> needs to know the address of the module.
Thanks for clearing that up. Perhaps a callback would have been
better, but it's obviously moot now.
--
Daniel
On Friday 13 September 2002 17:55, Roman Zippel wrote:
> Hi,
>
> On Fri, 13 Sep 2002, Daniel Phillips wrote:
>
> > > The exit itself can fail as well, so it has to be done by the module code
> > > anyway (until it suceeds).
> >
> > That's debatable. Arguably, a failed ->module_cleanup() should be
> > retried on every rmmod -a, but expecting module.c to just keep
> > retrying mindlessly on its own sounds too much like a busy wait.
>
> That's not what I meant, if module_init fails the module goes directly to
> the cleanup state and the module code calls module_exit. Depending on this
> return value it continues to the exit state.
What I don't like about that is, module_exit doesn't know the exact state
module_cleanup was in when it encountered the error. Look at how error
cleanup is done normally: a bunch of gotos that jump into a sequence that
releases resources in reverse order to the way they were allocated, so
that each initialization state has a corresponding error cleanup state.
How are you going to recover that level of precision if the error
cleanup is in a separate function?
> Further exit attempts (if necessary) are done on user request.
Yep, another point nailed down. I'm keeping a list ;-)
> > > What DoS opportunities are there?
> >
> > Suppose the module exit relies on synchronize_kernel. The attacker
> > can force repeated synchronize_kernels, knowing that module.c will
> > mindlessly do a synchronize_kernel every time a module init fails,
> > whether needed or not. Each synchronize_kernel takes an unbounded
> > amount of time to complete, across which module.c holds a lock.
>
> This can't happen:
>
> if (hook) {
> hook = NULL;
> synchronize();
> }
Ah, this relys on synchronize_kernel only being called when necessary.
In general, that can only be known by the module itself. This is one
more point we agree on, and which differs from Rusty's proposal.
> > > Module init failure is the exception
> > > case and usally needs further attention, so we could actually disable
> > > further attempts to load this module, unless the user tells us
> > > specifically so.
> >
> > Sure, you can fix it by lathering on more complexity. What you have
> > to do is explain why we should do that, when there is a simpler and
> > faster approach that doesn't introduce the problem in the first
> > place.
>
> It doesn't add any complexity (at least not to the kernel). A simple
> approach might be that a failed kernel module cannot be loaded with
> modprobe anymore, this sort of policy can be done in userspace.
User space complexity is an issue too, if there is an approach that
avoids complexity in both the kernel and user space.
> > I take it that the points you didn't reply to are points that you
> > agree with? (The main point being, that we both advocate a simple,
> > two-method interface for module load/unload.)
>
> Basically yes, it's just that your initial RFC was more confusing than
> helpful.
Humble apologies. I will rewrite it now that I've had some practice
at explaining the points. The second try should be much more concise
and backed with more actual code. Some of which I will shamelessly
lift from you and Rusty ;-)
--
Daniel
Hi,
On Fri, 13 Sep 2002, Daniel Phillips wrote:
> That's debatable. Arguably, a failed ->module_cleanup() should be
> retried on every rmmod -a, but expecting module.c to just keep
> retrying mindlessly on its own sounds too much like a busy wait.
Hmmm. You might as well give it back to the user.
rmmod: remove failed: do it again!
So the cleanup code could as well just do it on its own.
> > Why is that sloppy? E.g. kfree() happily accepts NULL pointers as well.
>
> That is sloppy. Different discussion.
What should kfree do in your opinion? BUG()?
doodle.c:12: attempted to free NULL pointer, as you know it already is.
> I take it that the points you didn't reply to are points that you
> agree with? (The main point being, that we both advocate a simple,
> two-method interface for module load/unload.)
You could even do it using three methods.
Thunder
--
--./../...-/. -.--/---/..-/.-./..././.-../..-. .---/..-/.../- .-
--/../-./..-/-/./--..-- ../.----./.-../.-.. --./../...-/. -.--/---/..-
.- -/---/--/---/.-./.-./---/.--/.-.-.-
--./.-/-.../.-./.././.-../.-.-.-
On Friday 13 September 2002 18:39, Thunder from the hill wrote:
> Hi,
>
> On Fri, 13 Sep 2002, Daniel Phillips wrote:
> > That's debatable. Arguably, a failed ->module_cleanup() should be
> > retried on every rmmod -a, but expecting module.c to just keep
> > retrying mindlessly on its own sounds too much like a busy wait.
>
> Hmmm. You might as well give it back to the user.
>
> rmmod: remove failed: do it again!
That's what happens now. We can certainly improve the error message,
and actually, that just falls out, from properly adding an error
return code to ->cleanup_module()
> So the cleanup code could as well just do it on its own.
???
> > > Why is that sloppy? E.g. kfree() happily accepts NULL pointers as well.
> >
> > That is sloppy. Different discussion.
>
> What should kfree do in your opinion? BUG()?
Yuppers:
static inline void kfree_test(void *object)
{
if (object)
kfree(object);
}
#define kfree_sloppy kfree_test
s/kfree/kfree_sloppy/g
(But see "different discussion" above.)
> doodle.c:12: attempted to free NULL pointer, as you know it already is.
Um. You know it's NULL and you freed it anyway?
(But see "different discussion" above.)
> > I take it that the points you didn't reply to are points that you
> > agree with? (The main point being, that we both advocate a simple,
> > two-method interface for module load/unload.)
>
> You could even do it using three methods.
Yes, or two, my favorite.
--
Daniel
On Saturday 07 September 2002 10:43 pm, Alexander Viro wrote:
> IOW, filesystem shutdown is _always_ a garbage collection - there's nothing
> special about vfsmounts that are not mounted somewhere. Check for fs
> being busy in umount(2) is just a compatibility code - kernel is perfectly
> happy with dissolving mountpoints, busy or not. If stuff mounted is not
> busy it will be shut down immediately, if not - when it stops being busy.
This wanders into something I've been trying to figure out for a while now.
I want to move an existing mount point. I can't unmount it, because it's got
a file open in it, but I want to move it to another binding in the tree and
make the original binding go away so the filesystem IT'S under can go away.
I can create a duplicate binding, but can't get the first binding to go away
while the filesystem's in use.
In case you're curious, the reason is I created a system where the whole root
partition is a zisofs image, with /boot, /etc, /root, and /tmp symlinks to
directories under /var. When the system is fully up, /var will be a mount
point for /dev/hda1 (ext3), and /home on /dev/hda3. / is loopback mounted
from a file ("firmware") under /var.
To make this work, I make an initrd, which mounts /dev/hda1 on /var, does an
losetup on /dev/loop0, exits and lets the boot continue (with root on
/dev/loop0), and then runs /sbin/init-first which does a mount --bind
/initrd/var to /var (duplicating the mount point), and then "exec
/sbin/init", and life goes on. (Obviously, init-first has to run before init
or else /etc is a symlink pointing to nothing. Haven't cleaned up /etc
enough that it can be read-only, not sure I ever will...)
The downside of duplicating the binding in init-first rather than moving it
is I can't unmount /initrd and free its memory. It's not a major burning
issue keeping me awake at night (caffiene manages to do that quite well on
its own, thank you), it's just two extra mount entries in /proc/mounts and
about four megabytes of wasted ram on a system with a quarter gig of the
stuff, but it seems there should be a way to do this...
Since each filesystem only has one superblock (regardless of the number of
mount points), there's no way it can care too deeply about where it's
mounted. The losetup really shouldn't care too much either (fundamentally,
it's bound to an inode). So there's no theoretical reason "mount --move /old
/new" couldn't be made to work. There may need to be a loop to flush
dentries, and some locking, but that would be about it. (Mount points under
the mount point being moved could either be moved as well, or the action
could be denied if any exist. The first is cleaner, it's sort of the
opposite of chroot or pivot_root. P.S. yes, I tried a few variants of
pivot_root, that fails if either of the filesystems has any files open in
it...)
It could just be that since the concept of mount point and filesystem
instance were only separated in 2.4, that this is just a historical relic.
If so, I'd like to correct it. A filesystem instance can't go away with an
open file, but a mount point can...
Did I miss the ability to do this already, or are the userspace tools simply
not taking advantage of something the kernel can currently do, or I need to
come up with a patch to make this work under the 2.4 kernel? (I've poked
around a bit, and vfs.txt does a pretty good job of describing how things
were in 2.1.99, back when mount points and filesystem instances were the same
thing...)
Rob
On Fri, 13 Sep 2002, Daniel Phillips wrote:
> > What DoS opportunities are there?
>
> Suppose the module exit relies on synchronize_kernel. The attacker
> can force repeated synchronize_kernels, knowing that module.c will
> mindlessly do a synchronize_kernel every time a module init fails,
> whether needed or not. Each synchronize_kernel takes an unbounded
> amount of time to complete, across which module.c holds a lock.
That's a good argument WRT the kernel autoloader, but not for manual load
by root. The system should not refuse to attempt an operation about which
root may have additional information (odd hardware or the like). We don't
want to have the kernel DOS itself, but root should be allowed to at least
try.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
In message <Pine.LNX.4.44.0209131131210.28515-100000@serv> you write:
> > I don't think of either of them as core, they are two problems.
>
> Your init problem is easily solvable for try_inc_mod_count() users, just
> add a check for MOD_INITIALIZING, so it will fail during module init. On
> the other hand forcing everyone to use try_inc_mod_count is a serious
> problem.
Well, yes. I think Daniel, yourself and I all share a dislike for
try_inc_mod_count() everywhere, but it *is* one solution.
> The exit function should always be called after the init function (even if
> it failed, I don't do it in the patch, that's a bug). The fs init/exit
> would like this then:
I disagree with this, but really, it's a detail.
> > I disagree with pushing the count into the filesystem structure: it
> > changes nothing except hide the fact that you're only keeping
> > reference counts for the sake of the module. Filesystems are low
> > performance impact, but other subsystems really don't want that atomic
> > inc and dec for every access.
>
> As I already said before, it doesn't has to be an atomic reference count.
Please explain? It has to be atomic for all the interesting cases
(sure, fs mounting might be protected by a lock, but other things aren't).
> > if (down_interruptible(&module_mutex))
> > return -EINTR;
>
> I really don't like that the global module lock is held during calls to
> the driver, e.g. it makes it impossible to request further modules during
> module load.
Yes, this is a problem. It's a little icky to fix though, since you
must disallow the *same* module being loaded. I can certainly do it:
well spotted.
> > This works better for my in ip_conntrack, since ideally the usage
> > count is a count of the number of packets we're tracking, which is
> > under control of the outside world, so I wanted an "rmmod -f".
>
> The ip_conntrack problem is a variation of the LSM problem and both are a
> problem of the driver not of the module code.
> Basically you have to wait long enough until no new package can call to
> the module. This means after removing the hooks, you have to check how
> much packages are busy and wait for them to be processed.
Yes, which means an atomic reference count somewhere. At the moment I
spin inside the cleanup() routine (not nice at all).
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
In message <E17pqaz-000891-00@starship> you write:
> On Friday 13 September 2002 08:51, Rusty Russell wrote:
> > If you split registration interfaces into reserve (can fail) and
> > use (can't fail), then you do:
>
> Why is that different from:
>
> int my_module_init(void)
> {
> int ret;
> ret = reserve_foo();
> if (ret != 0)
> return ret;
> ret = reserve_bar();
> if (ret != 0) {
> unreserve_foo();
> return ret;
> }
> use_foo();
> use_bar();
> return 0;
> }
As I said:
R> Of course you can simulate a two-stage within a single-stage, of course,
R> by doing int single_stage(void) { stage_one(); stage_two(); }, so
R> "need" is a bit strong.
> Of course, and you might consider actually reading my [RFC].
You weighed into a debate without background, with a longwinded
"original" suggestion which wasn't, and then you accuse *me* of not
reading?
You divided modules into counting and non-counting. This is overly
simplistic. In fact, *interfaces* are divided into counting and
non-counting: a module may use both. A module which only uses
counting interfaces is trivially safe from unload races. The
interesting problem is module which control their own reference
counts, because they use one or more non-counting interfaces.
Your "solution" does not work:
> unregister_callpoints(...);
> magic_wait_for_quiescence();
> return cleanup_foo(...);
In fact, it would look like:
> unregister_callpoints(...);
> synchronize_kernel();
> if (atomic_read(&usecount) != 0) {
> reregister_callpoints(...);
> return -EBUSY;
> }
> cleanup_foo();
Now, think what happens if reregister_callpoints() fails. So we need
"unuse_xxx" here then.
Now, *think* for one moment, from the point of view of the author of
one of these modules. Now, how are you going to explain the subtle
requirements of your "two stage in one" interfaces? Bear in mind that
the init races weren't even understood by anyone on linux-kernel until
two years ago, and you're dealing with a newbie kernel programmer.
Now do you see my preference for taking the weight off the shoulders
of module authors? It's just not sane to ask them to deal with these
fairly esoteric races, and expect them to get it right.
We could simply ban modules from using non-counting interfaces. Or we
could introduce two-stage registration interfaces and then simply ban
their unloading. Or we can make their unloading a kernel hacking
option. Or we can provide all the infrastructure, and allow the
module authors to set their own comfort level.
> Don't forget that the Unix way has traditionally been to use the
> simplest interface that will do the job; if you propose a fat
> interface you need to prove that the thin one cannot do the job.
Gee, really? You're so clever!
You patronising little shit,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
On Monday 16 September 2002 04:17, Rusty Russell wrote:
> You weighed into a debate without background, with a longwinded
> "original" suggestion which wasn't, and then you accuse *me* of not
> reading?
Yup. If you had actually read it, I apologize, but you showed every
sign of having not read it.
> You divided modules into counting and non-counting. This is overly
> simplistic. In fact, *interfaces* are divided into counting and
> non-counting: a module may use both.
Read closely and you will see I covered that. Sure, my [rfc] may have
been less consise than the perfect piece of prose you would have
submitted in my place, but other than that, it seems to have held up
rather well under attack.
> A module which only uses
> counting interfaces is trivially safe from unload races. The
> interesting problem is module which control their own reference
> counts, because they use one or more non-counting interfaces.
>
> Your "solution" does not work:
>
> > unregister_callpoints(...);
> > magic_wait_for_quiescence();
> > return cleanup_foo(...);
>
> In fact, it would look like:
>
> > unregister_callpoints(...);
> > synchronize_kernel();
> > if (atomic_read(&usecount) != 0) {
> > reregister_callpoints(...);
> > return -EBUSY;
> > }
> > cleanup_foo();
Ah no, I don't like the second one, it will introduce erratic behaviour
in LSM. You will have a state where some security calls are implemented
and others not, then you will return to a state where all are. Eep.
Granted, LSM will always have a state when the calls are partially
unplugged, but the next state after that should be complete removal.
The admin is going to be awfully surprised if something else happens.
So what I had in mind here (and did mention it) is to take the counter
into account in magic_wait_for_quiescence(), on the understanding that
this counter only counts possibly-sleeping states of module threads.
Anyway, how do you propose to handle this:
task in module:
inc count
sleep
synchronize_kernel:
schedule()
wake up on cpu 1
wake up on cpu 0
dec count
if (atomic_read(&usecount) != 0) /* false */
free_module(module);
BOOM!
> Now, think what happens if reregister_callpoints() fails. So we need
> "unuse_xxx" here then.
>
> Now, *think* for one moment, from the point of view of the author of
> one of these modules. Now, how are you going to explain the subtle
> requirements of your "two stage in one" interfaces? Bear in mind that
> the init races weren't even understood by anyone on linux-kernel until
> two years ago, and you're dealing with a newbie kernel programmer.
OK, *finally* we get to the core of your argument. You accused me of
being long-winded, but you have just taken several days to admit that
there is no inherent difference in capability between your proposal and
mine. Though to be sure, some of your commentary was informative and
useful, still, you should have taken the position you are now about to
take right from the beginning.
> Now do you see my preference for taking the weight off the shoulders
> of module authors? It's just not sane to ask them to deal with these
> fairly esoteric races, and expect them to get it right.
I feel no need to take any weight off the shoulders of the LSM authors,
they are big boys and they better know what they're doing. IMHO, they'd
appreciate maximum control at the insertion/removal stage.
But proc modules, device drivers, even filesystems: I strongly agree
we need a dumbed-down interface. However, please forgive me for not
immediately accepting the proposition that yours is the final word in
dumbed-down module interfaces. I'm thinking that a set of easy-to-use
library calls will be just as good and will win on the grounds of using
the simpler interface.
> We could simply ban modules from using non-counting interfaces.
No we can't. As I mentioned in my [rfc], there are some modules that
simply cannot use this technique, and I hope we've already rejected
the strategy of disallowing removal of such modules. (The Chinese
have a saying that covers this: "cut off toes to avoid worm coming".)
> Or we
> could introduce two-stage registration interfaces and then simply ban
> their unloading.
No, the complaining would never end.
> Or we can make their unloading a kernel hacking
> option. Or we can provide all the infrastructure, and allow the
> module authors to set their own comfort level.
This is a logical falacy, let's call it "trifurcation". You have
presented three alternatives as if they were the only alternatives.
Two were bogus anyway; by either measure this last qualifies as pure
rhetoric.
In summary, you have retreated from the position that my interface
is somehow less capable than yours, and you have dug in to defend
the position that your interface is better for newbie programmers,
which appears to be your only argument at the moment.
If you turn out to be right, then good and my hat off to you. If
you are wrong, and I can present an interface that is just as easy
for newbie programmers, than I am right and you should thank me
for sticking to my guns and insisting that we go with the simplest
inteface that will do the job.
--
Daniel
This is the third in my "Understanding the Principles of Argumentation"
series. Rusty has obligingly provided us with a fine example of an "ad
hominem attack":
On Monday 16 September 2002 04:17, Rusty Russell wrote:
> > Don't forget that the Unix way has traditionally been to use the
> > simplest interface that will do the job; if you propose a fat
> > interface you need to prove that the thin one cannot do the job.
>
> Gee, really? You're so clever!
>
> You patronising little shit,
An ad hominem attack is a logical fallacy where the arguer attacks the
person, rather than the issue:
http://home.mcn.net/~montanabw/fallacies.html
--
Daniel
Can you PLEASE discuss that in PRIVATE?
Thanks in advice.
On Mon, 16 Sep 2002, Daniel Phillips wrote:
> This is the third in my "Understanding the Principles of Argumentation"
> series. Rusty has obligingly provided us with a fine example of an "ad
> hominem attack":
>
> On Monday 16 September 2002 04:17, Rusty Russell wrote:
> > > Don't forget that the Unix way has traditionally been to use the
> > > simplest interface that will do the job; if you propose a fat
> > > interface you need to prove that the thin one cannot do the job.
> >
> > Gee, really? You're so clever!
> >
> > You patronising little shit,
>
> An ad hominem attack is a logical fallacy where the arguer attacks the
> person, rather than the issue:
>
> http://home.mcn.net/~montanabw/fallacies.html
>
>
--
Robinson Maureira Castillo
Asesor DAI
INACAP
In cases where the issue is extraordinarily boring and pointless, it's
sometimes more entertaining to attack the person.
His attack is not baseless, though. You are being a "patronizing little
shit". Solve the problem and move on. Do not argue for days about a
hangnail.
} This is the third in my "Understanding the Principles of Argumentation"
} series. Rusty has obligingly provided us with a fine example of an "ad
} hominem attack":
Rusty doesn't like people quoting him in signatures, but that ones a definate possibility...
Cheers, Dean McEwan. Currently hacking KGI, which I don't understand, oh and ask me about OpenModemTalk...
On Mon, 16 Sep 2002 18:36:42 +0200 Daniel Phillips <[email protected]> wrote:
Hi,
On Mon, 16 Sep 2002, Rusty Russell wrote:
> > Your init problem is easily solvable for try_inc_mod_count() users, just
> > add a check for MOD_INITIALIZING, so it will fail during module init. On
> > the other hand forcing everyone to use try_inc_mod_count is a serious
> > problem.
>
> Well, yes. I think Daniel, yourself and I all share a dislike for
> try_inc_mod_count() everywhere, but it *is* one solution.
And it's an overly complex one, with my patch I demonstrated, how it could
be removed and made simpler.
> > The exit function should always be called after the init function (even if
> > it failed, I don't do it in the patch, that's a bug). The fs init/exit
> > would like this then:
>
> I disagree with this, but really, it's a detail.
I wouldn't call it a detail, it's an important part of the complete module
load/unload mechanism.
> > > I disagree with pushing the count into the filesystem structure: it
> > > changes nothing except hide the fact that you're only keeping
> > > reference counts for the sake of the module. Filesystems are low
> > > performance impact, but other subsystems really don't want that atomic
> > > inc and dec for every access.
> >
> > As I already said before, it doesn't has to be an atomic reference count.
>
> Please explain? It has to be atomic for all the interesting cases
> (sure, fs mounting might be protected by a lock, but other things aren't).
Let me explain it in a larger context. You and Daniel are making it really
more complex than necessary. Only the module itself can really answer the
question whether it's safe to unload or not. So all the module code
needs to do is some general module management and ask the module somehow,
whether it's safe to unload, when the user requests it. The easiest way
for modules to check for this is to use counters, it's very simple and
covers the majority of modules.
The few modules that don't want/can't use counters can use whatever they
want and they usually know better how to synchronize with the part of the
kernel where they installed their hooks into, without disturbing too much
other parts of the kernel.
I really don't like the synchronize calls as general mechanism, either
they have to to do too much and possibly disturb other parts without good
reason or modules have to take care of it (e.g. don't call somehow
schedule() without extra protection). This makes the whole synchronize
mechanism very fragile, which I'd prefer to keep it in the modules which
really need it, where it can be better controlled.
> > The ip_conntrack problem is a variation of the LSM problem and both are a
> > problem of the driver not of the module code.
> > Basically you have to wait long enough until no new package can call to
> > the module. This means after removing the hooks, you have to check how
> > much packages are busy and wait for them to be processed.
>
> Yes, which means an atomic reference count somewhere. At the moment I
> spin inside the cleanup() routine (not nice at all).
Packets don't come out of nowhere. I'm not familiar with the locking
there, but it should be possible to use a cheap nonatomic count, which is
read only accessible outside the normal locking.
bye, Roman
On Monday 16 September 2002 23:36, Roman Zippel wrote:
> > > > I disagree with pushing the count into the filesystem structure: it
> > > > changes nothing except hide the fact that you're only keeping
> > > > reference counts for the sake of the module. Filesystems are low
> > > > performance impact, but other subsystems really don't want that atomic
> > > > inc and dec for every access.
> > >
> > > As I already said before, it doesn't has to be an atomic reference count.
> >
> > Please explain? It has to be atomic for all the interesting cases
> > (sure, fs mounting might be protected by a lock, but other things aren't).
>
> Let me explain it in a larger context. You and Daniel are making it really
> more complex than necessary. Only the module itself can really answer the
> question whether it's safe to unload or not.
Excuse me, Roman, but that's the central thesis of my [rfc]. If I didn't
express it concisely enough to make that obvious, I apologize.
> So all the module code
> needs to do is some general module management and ask the module somehow,
> whether it's safe to unload, when the user requests it. The easiest way
> for modules to check for this is to use counters, it's very simple and
> covers the majority of modules.
Err, check. In the [rfc].
> The few modules that don't want/can't use counters can use whatever they
> want and they usually know better how to synchronize with the part of the
> kernel where they installed their hooks into, without disturbing too much
> other parts of the kernel.
Check. In the [rfc].
> I really don't like the synchronize calls as general mechanism, either
> they have to to do too much and possibly disturb other parts without good
> reason
Check. In the [rfc].
> or modules have to take care of it (e.g. don't call somehow
> schedule() without extra protection). This makes the whole synchronize
> mechanism very fragile, which I'd prefer to keep it in the modules which
> really need it, where it can be better controlled.
Yuppers. Are there any points at all we disagree on?
--
Daniel
[email protected] said:
> This is the third in my "Understanding the Principles of
> Argumentation" series. Rusty has obligingly provided us with a fine
> example of an "ad hominem attack":
It is my understanding that the ad hominem fallacy takes the form "you are a
patronising little shit, therefore what you say cannot be true".
Rusty's comment seemed to be only that you were a patronising little shit,
and not that this proved you to be incorrect -- hence it doesn't appear to
be a particularly fine example of 'ad hominem' at all.
--
dwmw2
Hi,
On Mon, 16 Sep 2002, Daniel Phillips wrote:
> > Let me explain it in a larger context. You and Daniel are making it really
> > more complex than necessary. Only the module itself can really answer the
> > question whether it's safe to unload or not.
>
> Excuse me, Roman, but that's the central thesis of my [rfc]. If I didn't
> express it concisely enough to make that obvious, I apologize.
Sorry, but even your later mails are not very concise, I rather wait for a
new rfc (preferably with patch) before commenting on it. :-)
bye, Roman
On Tuesday 17 September 2002 00:31, David Woodhouse wrote:
> [email protected] said:
> > This is the third in my "Understanding the Principles of
> > Argumentation" series. Rusty has obligingly provided us with a fine
> > example of an "ad hominem attack":
>
> It is my understanding that the ad hominem fallacy takes the form "you are a
> patronising little shit, therefore what you say cannot be true".
>
> Rusty's comment seemed to be only that you were a patronising little shit,
> and not that this proved you to be incorrect -- hence it doesn't appear to
> be a particularly fine example of 'ad hominem' at all.
You are entirely incorrect. The issue raised was "given two interfaces with
the same functionality, choose the simpler of them". Instead of addressing
that issue, the author was attacked. Perfect example, as I said.
Now, if you think this thread is too long, why did you post to it?
--
Daniel
[email protected] said:
> > It is my understanding that the ad hominem fallacy takes the form "you
> > are a patronising little shit, therefore what you say cannot be true".
> >
> > Rusty's comment seemed to be only that you were a patronising little
> > shit, and not that this proved you to be incorrect -- hence it doesn't
> > appear to be a particularly fine example of 'ad hominem' at all.
> You are entirely incorrect. The issue raised was "given two
> interfaces with the same functionality, choose the simpler of them".
> Instead of addressing that issue, the author was attacked. Perfect
> example, as I said.
No, Daniel. Do try to pay attention.
The ad hominem fallacy take the form "You are a patronising little shit,
therefore what you say cannot be true.". Rusty's mail contained only the
first half of that 'logic', and hence was not an example of ad hominem at
all.
Stating that you are a patronising little shit was a digression and was
irrelevant to the argument.
Stating that _because_ you are a patronising little shit your claims
must therefore be wrong, would be an example of the ad hominem fallacy.
But as I said -- Rusty didn't do that; he only did the former.
See http://www.nizkor.org/features/fallacies/ad-hominem.html
Note the words "Second, this attack is taken to be evidence against the
claim or argument the person in question is making (or presenting)."
--
dwmw2