Hello,
> >
> > The suggestion was made that kernel module removal be depreciated or
> > removed. I'd like to note that there are two common uses for this
> > capability, and the problems addressed by module removal should be
> > kept in mind. These are in addition to the PCMCIA issue raised.
I saw this mail flashes thru the reflector. It is worrying to know
that this great feature is on the discussion table for removal.
My humbly two cents, is that the Kernel Module is very much appreciative
in the Linux embedded development. We, in embedded development, write
"Kernel Modules/Drivers" to run our hardware. The Kernel Module feature
allows us to segregate the HW specific from the Linux, and it also
allows
us to upgrade the module code without reload of Linux. This approach is
very efficient for us in the embedded products.
I hope others can share this comment, and help keep this feature as is.
Reagards,
Michael.
Michael Nguyen wrote:
> I saw this mail flashes thru the reflector. It is worrying to know
> that this great feature is on the discussion table for removal.
One work-around that was suggested was to allow modules to be
superseded, i.e. the old module stays forever, but a new
version can be loaded in parallel. I must say that I'm very
sceptic about this idea, as it seems likely to just mask more
severe problems.
If I remember right, the main arguments why module removal can
race with references were:
- buggy modules that don't even know themselves if they still
serve a purpose or not (solution: fix 'em)
- likewise, but with the excuse that correctness was
sacrificed on the altar of performance
- references getting copied without the module knowing. Looks
like a problem in the subsystem managing the references.
(This was discussed mainly in the context of automating
reference tracking.)
- removal happening immediately after module usage count is
decremented to zero may unload module before module has
executed "return" instruction
While I can accept the theoretical possibility that some code
may indeed not be able to afford handling the module usage
count, I kind of doubt that such conditions exist in real life.
For the removal-before-return problem, I thought a bit about it
on my return flight. It would seem to me that an "atomic"
"decrement_module_count_and_return" function would do the trick.
That function would prepare to return from its caller, then
decrement the module count, and finally do the return. That way,
no resources of the caller would be used after the module usage
counter drops to zero. Obviously, any related cleanup would have
to happen before this. Also, you couldn't call
decrement_module_count_and_return from a function that gets
called from another function in the same module.
Not sure if such a solution for removal-before-return has been
considered/rejected yet. It would seem obvious enough.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://icapeople.epfl.ch/almesber/_____________________________________/
On Mon, 1 Jul 2002 22:40:34 -0300,
Werner Almesberger <[email protected]> wrote:
>If I remember right, the main arguments why module removal can
>race with references were:
>....
> - removal happening immediately after module usage count is
> decremented to zero may unload module before module has
> executed "return" instruction
>For the removal-before-return problem, I thought a bit about it
>on my return flight. It would seem to me that an "atomic"
>"decrement_module_count_and_return" function would do the trick.
This is just one symptom of the overall problem, which is module code
that adjusts its use count by executing code that belongs to the
module. The same problem exists on entry to a module function, the
module can be removed before MOD_INC_USE_COUNT is reached.
Apart from abandoning module removal, there are only two viable fixes:
1) Do the reference counting outside the module, before it is entered.
This is why Al Viro added the owner: __THIS_MODULE; line to various
structures. The problem is that it spreads like a cancer. Every
structure that contains function pointers needs an owner field.
Every bit of code that dereferences a function pointer must first
bump the owner's use count (using atomic ops) and must cope with the
owner no longer existing.
Not only does this pollute all structures that contain function
pointers, it introduces overhead on every function dereference. All
of this just to cope with the relatively low possibility that a
module will be removed.
2) Introduce a delay after unregistering a module's services and before
removing the code from memory.
This puts all the penalty and complexity where it should be, in the
unload path. However it requires a two stage rmmod process (check
use count, unregister, delay, recheck use count, remove if safe)
so all module cleanup routines need to be split into unregister and
final remove routines.
This is relatively easy to do without preemption, it is
significantly harder with preempt. There are also unsolved problems
with long running device commands with callbacks (e.g. CD-R fixate)
and with kernel threads started from a module (must wait until
zombies have been reaped).
Rusty and I agree that option (2) is the only sane way to do module
unload, assuming that we retain module unloading. First decide if the
extra work is justified.
Keith Owens wrote:
> This is just one symptom of the overall problem, which is module code
> that adjusts its use count by executing code that belongs to the
> module. The same problem exists on entry to a module function, the
> module can be removed before MOD_INC_USE_COUNT is reached.
Ah yes, now I remember, thanks. I filed that under "improper reference
tracking". After all, why would anybody hold an uncounted reference in
the first place ?
I can understand that the exact number of references may be unknown,
e.g. if I pass a reference to some registration function, which may in
turn hand it to third parties, but why wouldn't I know that there is
at least one reference ?
If some other module B hands out uncounted references on behalf of
some module A, it would seem natural for B to make sure that it
collects them before getting unloaded (and thereby releasing A).
> 1) Do the reference counting outside the module, before it is entered.
Evil, agreed ;-)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://icapeople.epfl.ch/almesber/_____________________________________/
On Tue, 2 Jul 2002 00:11:52 -0300,
Werner Almesberger <[email protected]> wrote:
>Keith Owens wrote:
>> This is just one symptom of the overall problem, which is module code
>> that adjusts its use count by executing code that belongs to the
>> module. The same problem exists on entry to a module function, the
>> module can be removed before MOD_INC_USE_COUNT is reached.
>
>Ah yes, now I remember, thanks. I filed that under "improper reference
>tracking". After all, why would anybody hold an uncounted reference in
>the first place ?
All functions passed to registration routines by modules are uncounted
references. A module is loaded, registers its operations and exits
from the cleanup routine. At that point its use count is 0, even
though it there are references to the module from tables outside the
module.
When the open routine (or its equivalent) is called, then the use count
is incremented from within the module. The executing code between
if (ops->open)
ops->open();
and MOD_INC_USE_COUNT in the module's open routine is racy, there is no
lock that prevents the module being removed while the start of the open
routine is being executed.
Incrementing the use count at registration time is no good, it stops
the module being unloaded. Operations are deregistered at rmmod time.
Setting the use count at registration prevents rmmod from removing the
module, so you cannot deregister the operations. Catch 22.
Module unload is not racy on UP without preempt. It is racy on SMP or
with preempt. It used to be safe on SMP because almost everything was
under the BKL, but that protection no longer exists.
Keith Owens wrote:
> Incrementing the use count at registration time is no good, it stops
> the module being unloaded. Operations are deregistered at rmmod time.
> Setting the use count at registration prevents rmmod from removing the
> module, so you cannot deregister the operations. Catch 22.
But those references go through the module exit function, which
acts like an implicit reference counter. So as long as
- module exit de-registers all of them (if it doesn't, we're
screwed anyhow), and
- the registry itself isn't racy (if it is, this is likely to
surface in other circumstances too, e.g. if a driver destroys
internal state immediately after de-registration)
they should be safe, shouldn't they ?
- Werner
P.S. mail.ocs.com.au thinks I'm a spammer :-(
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://icapeople.epfl.ch/almesber/_____________________________________/
Keith Owens wrote:
> On Mon, 1 Jul 2002 22:40:34 -0300,
> Werner Almesberger <[email protected]> wrote:
>
>>If I remember right, the main arguments why module removal can
>>race with references were:
>>....
>>- removal happening immediately after module usage count is
>> decremented to zero may unload module before module has
>> executed "return" instruction
>>For the removal-before-return problem, I thought a bit about it
>>on my return flight. It would seem to me that an "atomic"
>>"decrement_module_count_and_return" function would do the trick.
>
>
> This is just one symptom of the overall problem, which is module code
> that adjusts its use count by executing code that belongs to the
> module. The same problem exists on entry to a module function, the
> module can be removed before MOD_INC_USE_COUNT is reached.
>
> Apart from abandoning module removal, there are only two viable fixes:
>
> 1) Do the reference counting outside the module, before it is entered.
>
> This is why Al Viro added the owner: __THIS_MODULE; line to various
> structures. The problem is that it spreads like a cancer. Every
> structure that contains function pointers needs an owner field.
> Every bit of code that dereferences a function pointer must first
> bump the owner's use count (using atomic ops) and must cope with the
> owner no longer existing.
>
> Not only does this pollute all structures that contain function
> pointers, it introduces overhead on every function dereference. All
> of this just to cope with the relatively low possibility that a
> module will be removed.
Only "first use" (ie. ->open) functions need gaurding against unloads.
Any subsequent functions are guaranteed to have a reference to the
module, and don't need to bother with the refcount. I have a few ideas
to optimize the refcounting better than it is now.
> 2) Introduce a delay after unregistering a module's services and before
> removing the code from memory.
>
> This puts all the penalty and complexity where it should be, in the
> unload path. However it requires a two stage rmmod process (check
> use count, unregister, delay, recheck use count, remove if safe)
> so all module cleanup routines need to be split into unregister and
> final remove routines.
>
> This is relatively easy to do without preemption, it is
> significantly harder with preempt. There are also unsolved problems
> with long running device commands with callbacks (e.g. CD-R fixate)
> and with kernel threads started from a module (must wait until
> zombies have been reaped).
The callbacks should hold references that would not allow the module to
unload. Other than that, this is the same problem the RCU folks are
working on.
> Rusty and I agree that option (2) is the only sane way to do module
> unload, assuming that we retain module unloading. First decide if the
> extra work is justified.
Freeing up the limited vmalloc address space should be justification enough.
--
Brian Gerst
On Tue, 02 Jul 2002 00:08:55 -0400,
Brian Gerst <[email protected]> wrote:
>Keith Owens wrote:
>> 1) Do the reference counting outside the module, before it is entered.
>> Not only does this pollute all structures that contain function
>> pointers, it introduces overhead on every function dereference. All
>> of this just to cope with the relatively low possibility that a
>> module will be removed.
>
>Only "first use" (ie. ->open) functions need gaurding against unloads.
>Any subsequent functions are guaranteed to have a reference to the
>module, and don't need to bother with the refcount. I have a few ideas
>to optimize the refcounting better than it is now.
Also the close routine, otherwise there is a window where the use count
is 0 but code is still executing in the module.
Network operations such as SIOCGIFHWADDR take an interface name and do
not call any 'open' routine. The only lock I can see around dev_ifsioc
is dev_base_lock, AFAICT that will not protect against a module being
unloaded while SIOCGIFHWADDR is running. If dev_base_lock does protect
against module unload, it is not clear that it does so.
For netfilter, the use count reflects the number of packets being
processed. Complex and potentially high overhead.
All of this requires that the module information be passed in multiple
structures and assumes that all code is careful about reference
counting the code it is about to execute. There has to be a better
way!
Keith Owens wrote:
> For netfilter, the use count reflects the number of packets being
> processed. Complex and potentially high overhead.
Good example - netfilter may access a huge number of tiny
modules when working on a packet. While this by itself is a
performance issue, the need to keep reference counts right
doesn't necessarily help.
> All of this requires that the module information be passed in multiple
> structures and assumes that all code is careful about reference
> counting the code it is about to execute.
It's not really just the module information. If I can, say, get
callbacks from something even after I unregister, I may well
have destroyed the data I need to process the callbacks, and
oops or worse.
> There has to be a better way!
Well yes, there are other approaches that make sure something is
not used, e.g.
If you can a) make sure no new references are generated, and b)
send a sequential marker through the subsystem, you can be sure
that all references are gone, when the marker re-emerges. (Or use
multiple markers, e.g. one per CPU.)
Likewise, if you can disable restarting of a subsystem, and then
wait until the subsystem is idle, you're safe. E.g. tasklet_disable
works like this.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://icapeople.epfl.ch/almesber/_____________________________________/
Keith Owens wrote:
>
> On Mon, 1 Jul 2002 22:40:34 -0300,
> Werner Almesberger <[email protected]> wrote:
> >If I remember right, the main arguments why module removal can
> >race with references were:
> >....
> > - removal happening immediately after module usage count is
> > decremented to zero may unload module before module has
> > executed "return" instruction
> >For the removal-before-return problem, I thought a bit about it
> >on my return flight. It would seem to me that an "atomic"
> >"decrement_module_count_and_return" function would do the trick.
>
> This is just one symptom of the overall problem, which is module code
> that adjusts its use count by executing code that belongs to the
> module. The same problem exists on entry to a module function, the
> module can be removed before MOD_INC_USE_COUNT is reached.
>
> Apart from abandoning module removal, there are only two viable fixes:
>
> 1) Do the reference counting outside the module, before it is entered.
>
> This is why Al Viro added the owner: __THIS_MODULE; line to various
> structures. The problem is that it spreads like a cancer.
How about letting the module handle the reference counting but
with one key difference from today's arrangement:
The _code_ for inc/decrementing is kept outside the modules, in
the permanent part of the kernel. So the module simply
does something like decrement_usecount(&my_use_count)
decrement_usecount will return to the module _if_ the count
didn't reach 0. If it does, it goes straight to whatever
usually happens after the module returns. (I.e. pop the
return address that goes to the module, and return to where
the module were supposed to return.)
The module gets an additional restriction: decrement_usecount
may not return so all cleanup etc. must be done first, so
only a return remains when it is called.
The "unsafe return" after decrementing to 0 is eliminated.
Helge Hafting
Helge Hafting wrote:
> decrement_usecount will return to the module _if_ the count
> didn't reach 0. If it does, it goes straight to whatever
> usually happens after the module returns.
This looks appealing, but you still need an unconditional
"decrement_module_count_and_return", like I proposed earlier,
because multiple decrement_usecount could race with removal,
so you'd still hit removed code.
Example:
...
decrement_usecount();
return; /* unsafe */
}
Now we have two CPUs executing this, and the third one will
remove the module.
Use count CPU 1 CPU 2 CPU 3
2 decrement ... ...
1 return 1x decrement ...
0 - return 2x remove
0 *CRASH* ... ...
The only case where decrement_usecount would work is if all
possible callers of decrement_usecount are serialized. I
can't think of any case where this is true, and just moving
the decrementing before the return(s) would improve anything
but perhaps the code structure, so I'm not sure if this would
be worth the potential and easily overlooked problems.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://icapeople.epfl.ch/almesber/_____________________________________/
I wrote:
> It's not really just the module information. If I can, say, get
> callbacks from something even after I unregister, I may well
> have destroyed the data I need to process the callbacks, and
> oops or worse.
Actually, if module exit synchronizes properly, even the
return-after-removal case shouldn't exist, because we'd simply
wait for this call to return.
Hmm, interesting. Did I just make the whole problem go away,
or is travel fatigue playing tricks on my brain ? :-)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://icapeople.epfl.ch/almesber/_____________________________________/
>> It's not really just the module information. If I can, say, get
>> callbacks from something even after I unregister, I may well
>> have destroyed the data I need to process the callbacks, and
>> oops or worse.
>
>Actually, if module exit synchronizes properly, even the
>return-after-removal case shouldn't exist, because we'd simply
>wait for this call to return.
>
>Hmm, interesting. Did I just make the whole problem go away,
>or is travel fatigue playing tricks on my brain ? :-)
That was one of the solutions proposed by Rusty, that is basically
waiting for all CPUs to have scheduled upon exit from module_exit
and before doing the actual removal.
Ben.
Benjamin Herrenschmidt wrote:
> That was one of the solutions proposed by Rusty, that is basically
> waiting for all CPUs to have scheduled upon exit from module_exit
> and before doing the actual removal.
No, that's not what I mean. If you have a de-registration function,
it may give you one of the following assurances:
1) none (e.g. of xxx_deregister just unlinks some ops structure,
but makes no attempt to squash concurrent accesses)
2) guarantee that no further access to ops structure will occur
after xxx_deregister returns, and
2a) one can probe for cached references/execution of callbacks,
e.g.
foo_deregister(&my_ops);
while (foo_running()) schedule();
/* can destroy local state/code now */
This is a common construct in the Linux kernel. Note that
none of the module code executes after foo_running returns,
so return-after-removal can't happen.
2b) we're guaranteed that any running callbacks or such will
have acquired their own locks/counts/etc. before
xxx_deregister returns. Modules would (also) use the use
count to protect code. Not sure if this happens anywhere
in the kernel. return-after-removal would be possible
here (unless we're going through a layer of trampolines
or such). E.g.
foo_deregister(&my_ops);
while (myself_running()) schedule();
/* may still have to execute code after unlocking */
while (MOD_IN_USE) schedule();
Note that this one still protects data !
3) guarantee that no direct effects of accesses to ops structure
occur after xxx_deregister returns, e.g.
foo_deregister(&my_ops);
/* can destroy local state/code now */
This also eliminates return-after-removal, because deregister
would wait for callbacks to return.
4a/4b) Like 3, but the callback signals back completion without
returning (i.e. to indicate that it has copied all shared data,
and is now working on a private copy), we're back to case 2a or
2b. In the case of modules, the usage count would be used to
protect code, so decrement_and_return would be sufficient here.
I think these are the most common cases. The point is that only
case 1) leaves you completely in the dark, and only cases 2b and
4b are special when it comes to modules. Cases 2a, 3, and 4a are
always safe.
Moreover, case 1, and cases 2a, 2b, 4a, and 4b without
synchronizing after de-registration, can also cause problems in
non-module code, e.g.
bar_callback(void *my_data)
{
*(int **) my_data = 1;
}
...
int whatever = 0,*ptr = &whatever;
...
foo_register(&bar_ops,&ptr);
...
foo_deregister(&bar_ops); /* case 1, 2a, 2b, 4a, or 4b */
ptr = NULL;
/* BUG: bar_callback may still be running ! */
...
Such code would only be data-safe (but not code-safe) if all shared
data is static, e.g. if the callback writes to some hardware
register, and the address is either static or constant.
AFAIK, most de-registration functions should be in case 2a, 3, or
4a. Well, we probably have a bunch of cases 1, too :-)
Now, what does this all boil down to ? Cases 2a, 3, and 4a are
non-issues. Cases 2b and 4b still need decrement_and_return, so I
was a bit too optimistic in assuming we're totally safe. Case 1 is
pathological also for non-modules, unless a few unlikely
constraints are met.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://icapeople.epfl.ch/almesber/_____________________________________/
On Tue, 2 Jul 2002, Keith Owens wrote:
> Incrementing the use count at registration time is no good, it stops
> the module being unloaded. Operations are deregistered at rmmod time.
> Setting the use count at registration prevents rmmod from removing the
> module, so you cannot deregister the operations. Catch 22.
Right. So you should use try_inc_mod_count() before dereferencing the
registered pointers and treat failure as "not found".
> Module unload is not racy on UP without preempt. It is racy on SMP or
> with preempt. It used to be safe on SMP because almost everything was
> under the BKL, but that protection no longer exists.
<wry> You forgot "as long as nothing blocks". Which is more than brittle -
code changes routinely violate such warranties. </wry>
[Apologies for over-the-head reply]
> Keith Owens wrote:
> > 1) Do the reference counting outside the module, before it is entered.
> >
> > This is why Al Viro added the owner: __THIS_MODULE; line to various
> > structures. The problem is that it spreads like a cancer. Every
> > structure that contains function pointers needs an owner field.
> > Every bit of code that dereferences a function pointer must first
> > bump the owner's use count (using atomic ops) and must cope with the
> > owner no longer existing.
a) How many place of that kind do we have?
b) How many structures have 'owner' field? How many instances of these
structures exist and how large each of them is?
> > Not only does this pollute all structures that contain function
> > pointers, it introduces overhead on every function dereference. All
Not true. As the matter of fact, for all such structures we have
activation points - open/mount/exec/etc. All subsequent uses are
protected by that - e.g. if ->write() is called outside of ->open()/->release()
range we are fucked regardless of modules, since ->write() can refer to the
structures created by ->open() and torn down by ->release().
I would really like to see hard data - either from you or from
Rusty - regarding the overhead you are talking about. As it is, the
only major class of structures that have ->owner is file_operations
of character devices. Everything else is red herring - too few instances
to talk about (e.g. each binfmt type compiled into the kernel wastes a
word - all 4 or 5 of them).
And with file_operations we are talking about a word in a structure
that currently consists of 18 words. And has 200-300 instances in the
kernel with every filesystem and every driver compiled in. In real-world
cases you'll have ~40 of these. Pardon me when I don't take the talks about
overhead seriously in this case. Especially since your quiescense code will
take more than a kilobyte.
Folks, how about you give some numbers - "spreads like a cancer"
would be great on a talk show, but I'd expect something better on l-k...
On Tue, 2 Jul 2002, Keith Owens wrote:
> For netfilter, the use count reflects the number of packets being
> processed. Complex and potentially high overhead.
<blink>
_Why_???
If some rule in your chains needs a module foo, you probably want foo
to stay around until that rule is removed. Even if there's no traffic
whatsoever.
So what's the problem with making use count equal to number of rules
referencing the module? You need exclusion between chain changes and
chain traversing anyway...
> All of this requires that the module information be passed in multiple
> structures and assumes that all code is careful about reference
> counting the code it is about to execute. There has to be a better
> way!
You know, I'd rather trust core code to do something right than expect
all drivers to follow any non-trivial rules (and "you should not block
in <areas>" _is_ non-trivial enough).
No comments on the network devices - I'll need to read through the code
to answer that one...
Hi!
> >> It's not really just the module information. If I can, say, get
> >> callbacks from something even after I unregister, I may well
> >> have destroyed the data I need to process the callbacks, and
> >> oops or worse.
> >
> >Actually, if module exit synchronizes properly, even the
> >return-after-removal case shouldn't exist, because we'd simply
> >wait for this call to return.
> >
> >Hmm, interesting. Did I just make the whole problem go away,
> >or is travel fatigue playing tricks on my brain ? :-)
>
> That was one of the solutions proposed by Rusty, that is basically
> waiting for all CPUs to have scheduled upon exit from module_exit
> and before doing the actual removal.
That seems reasonable... lighter version of freeze_processes() I was
thinking about.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?
On Tue, 2 Jul 2002, Werner Almesberger wrote:
> Actually, if module exit synchronizes properly, even the
> return-after-removal case shouldn't exist, because we'd simply
> wait for this call to return.
>
> Hmm, interesting. Did I just make the whole problem go away,
> or is travel fatigue playing tricks on my brain ? :-)
You redefined it in what might be a useful way...
My question is, if the race condition involves use of a module after it's
removed, is removing it and leaving it in memory going to be better? With
a driver for an unregistered device be more likely do the right thing even
if it's in memory? Isn't the right thing to make everything stop using the
module before ending it, for any definition of ending? Because I certainly
can't define what the right thing to do would be otherwise, at least in
any general sense.
It seems slightly like that tree falling in the forest, and no one to hear
it. Much easier to handle removal right than service requests after close.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
Bill Davidsen wrote:
> Isn't the right thing to make everything stop using the module before
> ending it, for any definition of ending?
This certainly seems to be the most understandable way, yes. I think
modules follow basically the principle illistrated below (that's case
4b in the taxonomy I posted earlier), where deregistration doesn't
stop everything, but which should be safe except for
return-after-removal:
foo_dispatch(...)
{
read_lock(global_foo_lock);
...
xxx = find_stuff();
...
xxx->whatever_op(xxx->data);
}
foo_deregister(...)
{
write_lock(global_foo_lock);
...
write_unlock(global_foo_lock);
}
bar_whatever_op(my_data)
{
lock(my_data); /* think MOD_INC_USE_COUNT */
...
read_unlock(global_foo_lock);
...
unlock(my_data); /* think MOD_DEC_USE_COUNT */
/* return-after-removal race if my_data also protects code,
not only data ! */
}
bar_main(...)
{
foo_register(&bar_ops,&bar_data);
...
foo_deregister(&bar_ops);
...
lock(bar_data); /* barrier */
unlock(bar_data);
...
destroy(&bar_data);
}
/* I'm a lazy bastard for not validating this with something like
Spin, so there may be races left. */
Now there are quite a few things you can omit, without ever noticing
problems. E.g. the read_unlock from bar_whatever_op could be before
the call to whatever_op in foo_dispatch, and all you get is a tiny
bar_whatever_op vs. destroy race. Likewise, moving the barrier in
bar_main may go unnoticed for a long time. If bar_data is never
destroyed, or synchronized by some other means, most of the locking
is actually superfluous, so in many situations, using this model
without strictly following every single detail is actually okay, at
last as far as data races are concerned.
So I think we need to distinguish the entry-after-removal race and
the return-after-removal race, because the former may also be a data
race, while the latter is typically only a code race. (The latter
becomes a data race if you touch shared data after releasing the
lock, but this would be a fairly obvious mistake.)
The entry-after-removal race is not module specific, and may exist
as a true data race in the kernel. (Didn't search for it - I'm not
happy with the fact that I wouldn't be able to catch it in the act
anyway, so I'd rather play a bit with infrastructure first.)
The return-after-removal race could be solved by not deallocating
module memory, or - probably cheaper, but requiring either code
changes or advanced gcc wizardry - by using decrement_and_return.
By the way, there are cases where MOD_DEC_USE_COUNT is followed by
code other than a direct return, e.g. (arbitrary example) in
drivers/net/aironet4500_core.c
This is correct if we can be sure that the use count never reaches
0 here, but then the whole inc/dec exercise is probably redundant.
("probably" as in "it doesn't have to be, but I'd be surprised if
it isn't"; I'll give an example in a later posting.)
However, if this MOD_DEC_USE_COUNT may ever decrement the count to
zero, we have something worse than the usual return-after-removal
race. This may even be a data race, so such code is suspicious in
any case.
> It seems slightly like that tree falling in the forest, and no one to hear
> it. Much easier to handle removal right than service requests after close.
I can understand why people don't want to use totally "safe"
deregistration, e.g.
- locking gets more complex and you may run into hairy deadlock
scenarios
- accomplishing timely removal may become more difficult
- nobody likes patches that change the world
So the entry/return-after-removal issues may still need to be
resolved. I'm mainly worried about entry-after-removal, because
it seems to me that this is likely to imply data races in
non-modules too, so try_inc_mod_count neither sufficient (for it
doesn't help with non-modules) nor required (because fixing the
race would also eliminate entry-after-removal).
I've seen very few reponses to my analysis. Does everybody think
I'm nuts (quite possible :-), or is this worth continuing ?
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://icapeople.epfl.ch/almesber/_____________________________________/
I wrote:
> This is correct if we can be sure that the use count never reaches
> 0 here, but then the whole inc/dec exercise is probably redundant.
> ("probably" as in "it doesn't have to be, but I'd be surprised if
> it isn't"; I'll give an example in a later posting.)
Okay, here's an almost correct example (except for the usual
return-after-removal, plus an unlikely data race described
below). foo_1 runs first:
foo_1(...)
{
MOD_INC_USE_COUNT;
...
initiate_asynchronous_execution_of(foo_2);
rendezvous(foo_a);
/* wait until foo_2 has incremented the use count */
...
MOD_DEC_USE_COUNT;
...
rendezvous(foo_b); /* release foo_2 */
/* cool return-after-removal race */
}
foo_2(...)
{
MOD_INC_USE_COUNT;
rendezvous(foo_a);
rendezvous(foo_b);
MOD_DEC_USE_COUNT;
}
(The pseudo-primitive redezvous(X) stops execution until all
"threads" have reached that point, then they all continue.)
I think the easiest solution is to simply declare such constructs
illegal.
Note that I'm using "return" loosely - whatever is used to implement
rendezvous() may execute instructions after unblocking, which would
race with removal too. Also note that in this case, we may, at least
theoretically, data race with removal if accessing foo_b after
unblocking foo_2.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://icapeople.epfl.ch/almesber/_____________________________________/
On Thu, 4 Jul 2002, Werner Almesberger wrote:
> Bill Davidsen wrote:
> > Isn't the right thing to make everything stop using the module before
> > ending it, for any definition of ending?
>
> This certainly seems to be the most understandable way, yes. I think
> modules follow basically the principle illistrated below (that's case
> 4b in the taxonomy I posted earlier), where deregistration doesn't
> stop everything, but which should be safe except for
> return-after-removal:
There seems no right thing to do when a module get a service request for
something which is not active. Anything at that point would be likely to
corrupt data structures, oops, or leave a process in some unkillable
state.
> I can understand why people don't want to use totally "safe"
> deregistration, e.g.
>
> - locking gets more complex and you may run into hairy deadlock
> scenarios
> - accomplishing timely removal may become more difficult
> - nobody likes patches that change the world
Everybody loves them in development kernels ;-) Actually, I think this is
unlikely to result in more than a hang if it fails one way, or no worse
than what we have if it fails the other.
And if I understand the problem, it only is needed when either smp or
preempt are built in, so there would be no overhead for small systems, one
other possible objection to doing it safely.
> So the entry/return-after-removal issues may still need to be
> resolved. I'm mainly worried about entry-after-removal, because
> it seems to me that this is likely to imply data races in
> non-modules too, so try_inc_mod_count neither sufficient (for it
> doesn't help with non-modules) nor required (because fixing the
> race would also eliminate entry-after-removal).
>
> I've seen very few reponses to my analysis. Does everybody think
> I'm nuts (quite possible :-), or is this worth continuing ?
My read is that eliminating actual unloading of modules won't solve these
issues, it would just change them. Therefore if you're having a good time
looking at this I think it would make the kernel more stable, which will
be a goal of 2.6.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
Werner Almesberger wrote:
> Okay, here's an almost correct example (except for the usual
> return-after-removal, plus an unlikely data race described
> below). foo_1 runs first:
This can be fixed if we assume a way to ask "is any CPU still executing
module code?".
To do this, have the `free_module' function use `smp_call_function' to
ask every CPU "are you executing code for module XXX?". The question is
answered by a routine which walks the stack, checking the instruction
pointer at each place on the stack to see whether it's inside the module
of interest.
Yes this is complex, but it's not that complex -- provided you can rely
on stack walking to find the module. (It wouldn't work if x86 used
-fomit-frame-pointer, for example).
So module removal works like this:
int free_module(...)
{
/* ...stuff... */
if (module->uc.usecount != 0)
return -EAGAIN;
if (smp_call_function (check_if_in_module, module, 1, 1))
return -EGAIN;
/* remove module. */
}
Admittedly, this _only_ deals with this particular class of race
conditions.
Another possibility would be the RCU thing: execute the module's exit
function, but keep the module's memory allocated until some safe
scheduling point later, when you are sure that no CPU can possibly be
running the module.
These don't help at all with races against parallel opens -- food for
thought, though.
-- Jamie
Hi,
> To do this, have the `free_module' function use `smp_call_function' to
> ask every CPU "are you executing code for module XXX?". The question is
> answered by a routine which walks the stack, checking the instruction
> pointer at each place on the stack to see whether it's inside the module
> of interest.
>
> Yes this is complex, but it's not that complex -- provided you can rely
> on stack walking to find the module. (It wouldn't work if x86 used
> -fomit-frame-pointer, for example).
How do you find CPU's that are about to execute module code ?
IMHO you need to do this freeze trick before you check the module
usage count.
[..]
> Another possibility would be the RCU thing: execute the module's exit
> function, but keep the module's memory allocated until some safe
> scheduling point later, when you are sure that no CPU can possibly be
> running the module.
But what do you do if that CPU increases the module usage count?
Regards
Oliver
On Sun, 7 Jul 2002, Jamie Lokier wrote:
> Werner Almesberger wrote:
> > Okay, here's an almost correct example (except for the usual
> > return-after-removal, plus an unlikely data race described
> > below). foo_1 runs first:
>
> This can be fixed if we assume a way to ask "is any CPU still executing
> module code?".
>
> To do this, have the `free_module' function use `smp_call_function' to
> ask every CPU "are you executing code for module XXX?". The question is
> answered by a routine which walks the stack, checking the instruction
> pointer at each place on the stack to see whether it's inside the module
> of interest.
>
> Yes this is complex, but it's not that complex -- provided you can rely
> on stack walking to find the module. (It wouldn't work if x86 used
> -fomit-frame-pointer, for example).
Complex, may not be reliable, and the question is "are you now or will you
ever run code in or for module XXX" because there may be pointers for
threads, interrupt handlers, etc. Gets ugly doing it from the back.
I think you have to do it with the use count, and there may well be
modules you can't remove safely. But to add re-init code to modules,
define new ioctls to call it, etc, etc, doesn't seem satisfactory. I think
we really need to bump the use counter more carefully, to really know when
a module is in use, and when we can clear it out.
The smp case looks doable, the preempt case may be harder. I really like
the idea of simply queueing a remove and then doing it when the use count
drops to zero. But we have have to provide a "don't use" to keep new
things out. That's hard.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
Oliver Neukum wrote:
> How do you find CPU's that are about to execute module code ?
>
> IMHO you need to do this freeze trick before you check the module
> usage count.
>
> [..]
> > Another possibility would be the RCU thing: execute the module's exit
> > function, but keep the module's memory allocated until some safe
> > scheduling point later, when you are sure that no CPU can possibly be
> > running the module.
>
> But what do you do if that CPU increases the module usage count?
Those are the cases where I said this does not help.
You basically need:
(a) to catch the exiting case properly
(b) to catch entry points
Catching the entry points is what the current `try_inc_mod_count' code
does. I can't think of another way to do that.
-- Jamie
Bill Davidsen wrote:
> I think you have to do it with the use count, and there may well be
> modules you can't remove safely.
I agree, this is the correct and clean thing to do.
It rather implies that any function in a module which calls
MOD_{INC,DEC}_USE_COUNT should always be called from a non-module
function which _itself_ protects the module from removal by temporarily
bumping the use count.
> But to add re-init code to modules,
> define new ioctls to call it, etc, etc, doesn't seem satisfactory. I think
> we really need to bump the use counter more carefully, to really know when
> a module is in use, and when we can clear it out.
>
> The smp case looks doable, the preempt case may be harder. I really like
> the idea of simply queueing a remove and then doing it when the use count
> drops to zero. But we have have to provide a "don't use" to keep new
> things out. That's hard.
That's already done in `try_inc_mod_count', it's just a bit slow. But
arguably it's only impact is when you're getting a handle or creating an
object, which is usually relatively slow anyway, such as opening a
device or socket, or adding a firewall rule.
The more I think about it, the more I think the `owner' thing in
file_operations et al. is the right thing in all cases, and that Al Viro
is right about the overhead being reasonable. Perhaps the interface
could be made a little more generic (`{get,put}_module_reference'?), and
double check the corner cases such as when a module is in "don't use"
mode, blocking and scheduling a reload.
-- Jamie
On Monday 08 July 2002 02:46, Jamie Lokier wrote:
> Bill Davidsen wrote:
> > I think you have to do it with the use count, and there may well be
> > modules you can't remove safely.
>
> I agree, this is the correct and clean thing to do.
>
> It rather implies that any function in a module which calls
> MOD_{INC,DEC}_USE_COUNT should always be called from a non-module
> function which _itself_ protects the module from removal by temporarily
> bumping the use count.
That's not nice. It requires the calling code to know it's calling a
module and it imposes the inc/dec overhead on callers even when the
target isn't compiled as a module.
--
Daniel
> Catching the entry points is what the current `try_inc_mod_count' code
> does. I can't think of another way to do that.
The old way "2.2-rules" are safe on UP without preempt. So if you
temporarily, through the freeze hack, can introduce these conditions,
you've solved it. Except that you need to deal with a failure due to
an elevated usage count.
IMHO you gain very little by finding partial solutions which won't
help you in the hard cases. Module unload is very rare. It just needs
to work. You need to optimise use of modules, not the unload.
Strictly speaking, by having owner fields you punish drivers compiled
statically. The effect is small and not worth doing something about
it, but it should show the direction.
Regards
Oliver
On Mon, 8 Jul 2002, Daniel Phillips wrote:
> That's not nice. It requires the calling code to know it's calling a
> module and it imposes the inc/dec overhead on callers even when the
> target isn't compiled as a module.
I don't think so... the module knows, there's a bit of hard code for the
module anyway, so that could do the inc/dec and know when the module was
in the "about to be removed" state. It's not clear if "it works most of
the time now" means we just need to address some simple corner cases or if
the whole design of modules needs to be reimplemented to really sweep out
the corners, and they're not "simple" at all.
This has the feeling that someone could say "we could just..." at any
time, and the whole problem would vanish.
--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.
Jamie Lokier wrote:
> The more I think about it, the more I think the `owner' thing in
> file_operations et al. is the right thing in all cases,
What troubles me most in this discussion is that nobody seems to
think of how these issues affect non-modules.
Entry-after-removal is an anomaly in the subsystem making that call.
Fixing it for modules is fine as long as only modules will ever
un-register, but what do you do if non-module code ever wants to
un-register too ? (E.g. think hot-plugging or software devices.)
Invent some pseudo-module framework for it ? Require all code using
that subsystem to be a module ? What happens if there are multiple
registrations per chunk of code ?
Besides, a common pattern for those "hold this down, kick that, then
slowly count to ten, and now it's safe to release" solutions seems
to be that it takes roughly ten times as long to find the race
condition buried deep within them, than it took for the respective
predecessor.
I'm not sure the incremental band aid approach works in this case.
I'm sorry that I don't have to offer anything more constructive at
the moment, but I first want to build some testing framework. One
oops says more than ten pages of analysis explaining a race
condition :-)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://icapeople.epfl.ch/almesber/_____________________________________/
On Mon, 8 Jul 2002 21:07:13 -0300,
Werner Almesberger <[email protected]> wrote:
>What troubles me most in this discussion is that nobody seems to
>think of how these issues affect non-modules.
>
>Entry-after-removal is an anomaly in the subsystem making that call.
>Fixing it for modules is fine as long as only modules will ever
>un-register, but what do you do if non-module code ever wants to
>un-register too ? (E.g. think hot-plugging or software devices.)
The only difference between module and non-module is whether the code
and data areas are freed or left in the kernel for the next user. All
other processing should be the same for hotplug. This is why
CONFIG_HOTPLUG retains the __exit code in the kernel instead of
discarding it at link time.
>Besides, a common pattern for those "hold this down, kick that, then
>slowly count to ten, and now it's safe to release" solutions seems
>to be that it takes roughly ten times as long to find the race
>condition buried deep within them, than it took for the respective
>predecessor.
I don't like 'wait for n seconds then release' methods, as you say they
just hide the problem. Neither Rusty nor I are suggesting that
approach. Rusty's idea is
Check use count.
Unregister.
Ensure that all code that had a reference to the module has either
dropped that reference or bumped the use count.
Check use count again, it may have been bumped above.
If use count is still 0, there are no stale references left. It is
now safe to unload, call the module unallocate routine then free its
area.
If use count != 0, either schedule for removal when the count goes to
0 or reregister. I prefer reregister, Rusty prefers delayed removal.
Ensuring that all code has either dropped stale references or bumped
the use count means that all processes that were not sleeping when
rmmod was started must proceed to a sync point. It is (and always has
been) illegal to sleep while using module code/data (note: using, not
referencing).
On Thu, Jul 04, 2002 at 03:29:29AM -0300, Werner Almesberger wrote:
> This certainly seems to be the most understandable way, yes. I think
> modules follow basically the principle illistrated below (that's case
> 4b in the taxonomy I posted earlier), where deregistration doesn't
> stop everything, but which should be safe except for
> return-after-removal:
>
> foo_dispatch(...)
> {
> read_lock(global_foo_lock);
> ...
> xxx = find_stuff();
> ...
> xxx->whatever_op(xxx->data);
> }
>
> foo_deregister(...)
> {
> write_lock(global_foo_lock);
> ...
> write_unlock(global_foo_lock);
> }
>
> bar_whatever_op(my_data)
> {
> lock(my_data); /* think MOD_INC_USE_COUNT */
> ...
> read_unlock(global_foo_lock);
> ...
> unlock(my_data); /* think MOD_DEC_USE_COUNT */
> /* return-after-removal race if my_data also protects code,
> not only data ! */
> }
>
> bar_main(...)
> {
> foo_register(&bar_ops,&bar_data);
> ...
> foo_deregister(&bar_ops);
> ...
> lock(bar_data); /* barrier */
> unlock(bar_data);
> ...
> destroy(&bar_data);
> }
Hi Werner,
I think the above works, but the locking is a bit hairy. How about the
following. (Taxonomy 3)
foo_dispatch(...)
{
spin_lock(global_foo_lock);
xxx = find_stuff(); // atomic_inc(xxx->refcnt)
spin_unlock(global_foo_lock);
...
xxx->whatever_op(xxx->data);
...
put_stuff(xxx); // atomic_dec(xxx->refcnt)
}
foo_deregister(...)
{
spin_lock(global_foo_lock);
remove_stuff(xxx);
spin_unlock(global_foo_lock);
while (atomic_read(xxx->refcnt))
schedule();
}
bar_whatever_op(my_data)
{
/* no return-after-removal race because bar is pinned. */
}
bar_main(...)
{
foo_register(&bar_ops,&bar_data);
...
foo_deregister(&bar_ops);
/* Now guaranteed that no references to bar_ops exist, and guaranteed
that no concurrent accesses exist. */
destroy(&bar_data);
}
I get the feeling you already had something like this in mind, but your
example above was rather complicated. The latter implementation is just
standard reference counting.
> [...]
> I can understand why people don't want to use totally "safe"
> deregistration, e.g.
>
> - locking gets more complex and you may run into hairy deadlock
> scenarios
> - accomplishing timely removal may become more difficult
> - nobody likes patches that change the world
With a standard reference counting implementation I don't see any of these
issues being a problem..
Did I miss something?
-Kevin
--
------------------------------------------------------------------------
| Kevin O'Connor "BTW, IMHO we need a FAQ for |
| [email protected] 'IMHO', 'FAQ', 'BTW', etc. !" |
------------------------------------------------------------------------
Kevin O'Connor wrote:
> I think the above works, but the locking is a bit hairy. How about the
> following. (Taxonomy 3)
Looks like an efficient and almost correct implementation of this,
yes.
One bug: you should use yield() instead of schedule() (2.5) or do
current->policy |= SCHED_YIELD;
before calling schedule (2.4). Or play it extra-safe and use a
traditional wait queue, even though this has more overhead.
> I get the feeling you already had something like this in mind, but your
> example above was rather complicated. The latter implementation is just
> standard reference counting.
Yup, my example was simply an approximation of the current module
code, with the obvious races fixed.
> With a standard reference counting implementation I don't see any of these
> issues being a problem..
Amazing, isn't it ? :-)
> Did I miss something?
I see two possible reasons why somebody may not want this solution:
- those wishing to squeeze the very last bit of performance out of
this may find holding the spin lock during the call a better
compromise than the reference counting
- in some cases, there could be a lock that's being held by the
caller of foo_deregister and that's also needed by
bar_whatever_op. In this case, you have to drop the lock while
waiting (or re-think the locking design).
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://icapeople.epfl.ch/almesber/_____________________________________/
Keith Owens wrote:
> The only difference between module and non-module is whether the code
> and data areas are freed or left in the kernel for the next user.
Yes, that's exactly why the entry-after-removal problem worries me:
this is something that also means trouble if you're a non-module.
Just remember the "timer pops after del_timer" problem a while back.
De-registration that leaves stuff behind is extremely dangerous, no
matter if you're a module or not.
I fully agree that return-after-removal is a different issue and
that there are several perfectly reasonable ways for dealing with
this.
> Ensuring that all code has either dropped stale references or bumped
> the use count means that all processes that were not sleeping when
> rmmod was started must proceed to a sync point. It is (and always has
> been) illegal to sleep while using module code/data (note: using, not
> referencing).
That's the problem with entry-after-removal: you may simply not be
able to know when all references are gone.
Since basically all references to modules go through some sort of
registration function, why not fix the registration functions ?
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://icapeople.epfl.ch/almesber/_____________________________________/
Hi,
On Tue, 9 Jul 2002, Keith Owens wrote:
> Check use count.
> Unregister.
> Ensure that all code that had a reference to the module has either
> dropped that reference or bumped the use count.
> Check use count again, it may have been bumped above.
> If use count is still 0, there are no stale references left. It is
> now safe to unload, call the module unallocate routine then free its
> area.
> If use count != 0, either schedule for removal when the count goes to
> 0 or reregister. I prefer reregister, Rusty prefers delayed removal.
>
> Ensuring that all code has either dropped stale references or bumped
> the use count means that all processes that were not sleeping when
> rmmod was started must proceed to a sync point. It is (and always has
> been) illegal to sleep while using module code/data (note: using, not
> referencing).
Who is changing the use count? How do you ensure that someone doesn't
cache a reference past that magic sync point?
IMO the real problem is that cleanup_module() can't fail. If modules
already do proper data acounting, it's not too difficult to detect if
there still exists a reference. In the driverfs example the problem is
that remove_driver() returns void instead of int, so it can't fail.
bye, Roman
On Tue, 9 Jul 2002, Roman Zippel wrote:
> Who is changing the use count? How do you ensure that someone doesn't
> cache a reference past that magic sync point?
> IMO the real problem is that cleanup_module() can't fail. If modules
> already do proper data acounting, it's not too difficult to detect if
> there still exists a reference. In the driverfs example the problem is
> that remove_driver() returns void instead of int, so it can't fail.
I tend to see this differently: cleanup_module() cannot fail, but it can
sleep. So it's perfectly fine to deregister, wait until all references are
gone, clean up and return. So a kind of two-stage unregister is already
happening.
It's different in that it does use explicit refcounting, but
when the right interfaces are provided, the driver author doesn't need to
care - the author should just call pci_unregister/netdev_unregister/..,
that'll sleep until all references are gone (which also means no one will
use callbacks into the module anymore) and be done.
--Kai
Hi,
On Tue, 9 Jul 2002, Kai Germaschewski wrote:
> I tend to see this differently: cleanup_module() cannot fail, but it can
> sleep. So it's perfectly fine to deregister, wait until all references are
> gone, clean up and return. So a kind of two-stage unregister is already
> happening.
That's a possibility, if you can live with a noninterruptable rmmod
process sleeping for a very long time...
> It's different in that it does use explicit refcounting, but
> when the right interfaces are provided, the driver author doesn't need to
> care - the author should just call pci_unregister/netdev_unregister/..,
> that'll sleep until all references are gone (which also means no one will
> use callbacks into the module anymore) and be done.
The unregister function can't prevent someone from start using the device
again (at least not with reasonable effort), but it can detect this. The
author should just check the return value, like he already (hopefully)
does during initialization.
bye, Roman
Keith Owens wrote:
> All functions passed to registration routines by modules are uncounted
> references. A module is loaded, registers its operations and exits
> from the cleanup routine. At that point its use count is 0, even
> though it there are references to the module from tables outside the
> module.
Perhaps we should have 2 counters: a use count and a reference count. You can
start the unload process at any time that the use count is 0, but you can't
unload the module's memory until the reference count is 0. This is the
2-stage unload, formalized with counters for each stage.
You could register references from outside the module. AFAIK, you can take
any pointer or function pointer and do a lookup to find out what module it
belongs to. Thus, when you pass pointers (in a registration) to another
module, then it calls some function (register_reference, say) on each pointer
and function pointer. When it can guarantee that it is about to discard a
pointer (and so it will never use it again), then it calls
unregister_reference.
You may not have to register each pointer in an ops structure...it might be
enough to just call register_reference on the ops structure itself, since that
would hold the module in place. Doesn't work if the ops structure is
dynamically allocated memory, but then register_reference would be able to
return a failure return code "NOT_MODULE_MEMORY" or some such.
--
The Villagers are Online! villagersonline.com
.[ (the fox.(quick,brown)) jumped.over(the dog.lazy) ]
.[ (a version.of(English).(precise.more)) is(possible) ]
?[ you want.to(help(develop(it))) ]