2002-09-18 02:14:47

by Rusty Russell

[permalink] [raw]
Subject: [PATCH] In-kernel module loader 1/7

[ Sorry Ingo, an hour late. Look hard at 1/7, sched.c BTW ]

Hi all,

I've rewritten my in-kernel module loader: this version breaks
much less existing code. Basically, we go to a model of
externally-controlled module refcounts with possibility of failure
(ie. try_inc_mod_count, now called try_module_get()).

There are some rough edges, but it Works for Me(TM). Read the
descriptions for caveats.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Name: Later primitive
Author: Rusty Russell
Section: Misc
Status: Tested on 2.5.35

D: This patch implements the wait_for_later() call needed for safe
D: module unload and hotplug CPU. It is a minimal subset of RCU which
D: works with a preemptible kernel.

diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22006-linux-2.5.35/include/linux/later.h .22006-linux-2.5.35.updated/include/linux/later.h
--- .22006-linux-2.5.35/include/linux/later.h 1970-01-01 10:00:00.000000000 +1000
+++ .22006-linux-2.5.35.updated/include/linux/later.h 2002-09-17 14:00:58.000000000 +1000
@@ -0,0 +1,60 @@
+#ifndef __LINUX_LATER_H
+#define __LINUX_LATER_H
+/* Do something after system has calmed down.
+ (c) 2002 Rusty Russell, IBM Corporation.
+*/
+#include <linux/config.h>
+#include <linux/sched.h>
+#include <linux/compiler.h>
+#include <asm/atomic.h>
+
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
+struct later
+{
+ struct later *next;
+ void (*func)(void *data);
+ void *data;
+};
+
+extern void noone_running(void);
+extern atomic_t runcounts[2];
+extern int which_runcount;
+
+/* We flip between two run counters. */
+static inline atomic_t *runcount(struct task_struct *task)
+{
+ return &runcounts[!(task->flags & PF_RUNCOUNT)];
+}
+
+/* Decrement counter on entering voluntary schedule. */
+static inline void put_runcount(void)
+{
+ /* If we hit zero and it's the not the active list... */
+ if (atomic_dec_and_test(runcount(current))
+ && (current->flags & PF_RUNCOUNT) != which_runcount)
+ noone_running();
+}
+
+/* Increment counter in leaving voluntary schedule. */
+static inline void get_runcount(struct task_struct *task)
+{
+ /* Sets PF_RUNCOUNT, or not */
+ task->flags = ((task->flags & ~PF_RUNCOUNT) | which_runcount);
+ atomic_inc(runcount(task));
+}
+
+/* Queues future request: may sleep on UP if func sleeps... */
+void do_later(struct later *head, void (*func)(void *data), void *data);
+
+/* Wait until it's later. */
+void wait_for_later(void);
+
+#else /* !SMP, !PREEMPT */
+struct later { };
+#define do_later(head, func, data) (func(data))
+static inline void put_runcount(void) { }
+static inline void get_runcount(void) { }
+
+extern inline void wait_for_later(void) { }
+#endif
+#endif /* __LINUX_LATER_H */
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22006-linux-2.5.35/include/linux/sched.h .22006-linux-2.5.35.updated/include/linux/sched.h
--- .22006-linux-2.5.35/include/linux/sched.h 2002-09-16 12:43:48.000000000 +1000
+++ .22006-linux-2.5.35.updated/include/linux/sched.h 2002-09-17 14:02:56.000000000 +1000
@@ -424,6 +424,7 @@ do { if (atomic_dec_and_test(&(tsk)->usa
#define PF_IOTHREAD 0x00020000 /* this thread is needed for doing I/O to swap */
#define PF_FROZEN 0x00040000 /* frozen for system suspend */
#define PF_SYNC 0x00080000 /* performing fsync(), etc */
+#define PF_RUNCOUNT 0x00100000 /* which run counter to use */

/*
* Ptrace flags
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22006-linux-2.5.35/kernel/Makefile .22006-linux-2.5.35.updated/kernel/Makefile
--- .22006-linux-2.5.35/kernel/Makefile 2002-08-11 15:31:43.000000000 +1000
+++ .22006-linux-2.5.35.updated/kernel/Makefile 2002-09-17 14:00:58.000000000 +1000
@@ -24,6 +24,11 @@ obj-$(CONFIG_MODULES) += ksyms.o
obj-$(CONFIG_PM) += pm.o
obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o
obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o
+ifeq ($(CONFIG_PREEMPT),y)
+obj-y += later.o
+else
+obj-$(CONFIG_SMP) += later.o
+endif

ifneq ($(CONFIG_IA64),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22006-linux-2.5.35/kernel/fork.c .22006-linux-2.5.35.updated/kernel/fork.c
--- .22006-linux-2.5.35/kernel/fork.c 2002-09-16 12:43:49.000000000 +1000
+++ .22006-linux-2.5.35.updated/kernel/fork.c 2002-09-17 14:03:06.000000000 +1000
@@ -28,6 +28,7 @@
#include <linux/security.h>
#include <linux/futex.h>
#include <linux/ptrace.h>
+#include <linux/later.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -816,6 +817,7 @@ static struct task_struct *copy_process(
else
p->exit_signal = clone_flags & CSIGNAL;
p->pdeath_signal = 0;
+ get_runcount(p);

/*
* Share the timeslice between parent and child, thus the
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22006-linux-2.5.35/kernel/later.c .22006-linux-2.5.35.updated/kernel/later.c
--- .22006-linux-2.5.35/kernel/later.c 1970-01-01 10:00:00.000000000 +1000
+++ .22006-linux-2.5.35.updated/kernel/later.c 2002-09-17 14:00:58.000000000 +1000
@@ -0,0 +1,148 @@
+/* Do something after system has calmed down.
+
+ Copyright (C) 2002 Rusty Russell, IBM Corporation
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+*/
+#include <linux/later.h>
+#include <linux/spinlock.h>
+#include <linux/completion.h>
+
+static spinlock_t later_lock = SPIN_LOCK_UNLOCKED;
+
+/* We track how many tasks are running or preempted. At opportune
+ times, we switch all new counting over to the alternate counter,
+ and when the old counter hits zero, we know that everyone has
+ voluntarily preempted. */
+
+/* One for the boot cpu idle thread. */
+atomic_t runcounts[2] = { ATOMIC_INIT(0), ATOMIC_INIT(1) };
+int which_runcount; /* = 0, means runcount[1] active. */
+
+/* The two lists of tasks. */
+static struct later *later_list[2]; /* = { NULL, NULL } */
+
+static inline int active_count(void)
+{
+ if (which_runcount == PF_RUNCOUNT)
+ return 0;
+ else
+ return 1;
+};
+
+/* We're done: process batch. */
+void noone_running(void)
+{
+ struct later *list;
+
+ /* Remove inactive list for processing. */
+ spin_lock_irq(&later_lock);
+ list = later_list[!active_count()];
+ later_list[!active_count()] = NULL;
+ spin_unlock_irq(&later_lock);
+
+ /* Callback usually frees the entry, so be careful */
+ while (list) {
+ struct later *next = list->next;
+ list->func(list->data);
+ list = next;
+ }
+}
+
+/* Queues future request: if nothing happening, switch queues. */
+void do_later(struct later *head, void (*func)(void *data), void *data)
+{
+ unsigned long flags;
+
+ head->func = func;
+ head->data = data;
+
+ spin_lock_irqsave(&later_lock, flags);
+ /* Add to list */
+ head->next = later_list[active_count()];
+ later_list[active_count()] = head;
+
+ /* If other list is empty, switch them. */
+ if (later_list[!active_count()] == NULL) {
+ /* Beware: which_runcount being read without lock by sched.c */
+ wmb();
+ which_runcount ^= PF_RUNCOUNT;
+ }
+ spin_unlock_irqrestore(&later_lock, flags);
+}
+
+/* Because of FASTCALL declaration of complete, we use this wrapper */
+static void wakeme(void *completion)
+{
+ complete(completion);
+}
+
+/* Wait until it's later. */
+void wait_for_later(void)
+{
+ DECLARE_COMPLETION(completion);
+ struct later later;
+
+ /* Queue it and wait... */
+ do_later(&later, wakeme, &completion);
+ wait_for_completion(&completion);
+}
+
+#if 0
+#include <linux/proc_fs.h>
+
+static int write_wait(struct file *file, const char *buffer,
+ unsigned long count, void *data)
+{
+ wait_for_later();
+ return count;
+}
+
+static int read_wait(char *page, char **start, off_t off,
+ int count, int *eof, void *data)
+{
+ char *p = page;
+ int len;
+
+ spin_lock_irq(&later_lock);
+ p += sprintf(p, "ACTIVE: %u (%p)\nOTHER: %u (%p)\n",
+ atomic_read(&runcounts[active_count()]),
+ later_list[active_count()],
+ atomic_read(&runcounts[!active_count()]),
+ later_list[!active_count()]);
+ spin_unlock_irq(&later_lock);
+
+ len = (p - page) - off;
+ if (len < 0)
+ len = 0;
+
+ *eof = (len <= count) ? 1 : 0;
+ *start = page + off;
+
+ return len;
+}
+
+static int __init create_wait_proc(void)
+{
+ struct proc_dir_entry *e;
+
+ e = create_proc_entry("wait_for_later", 0644, NULL);
+ e->write_proc = &write_wait;
+ e->read_proc = &read_wait;
+ return 0;
+}
+
+__initcall(create_wait_proc);
+#endif
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .22006-linux-2.5.35/kernel/sched.c .22006-linux-2.5.35.updated/kernel/sched.c
--- .22006-linux-2.5.35/kernel/sched.c 2002-09-16 12:43:49.000000000 +1000
+++ .22006-linux-2.5.35.updated/kernel/sched.c 2002-09-17 14:02:16.000000000 +1000
@@ -29,6 +29,7 @@
#include <linux/security.h>
#include <linux/notifier.h>
#include <linux/delay.h>
+#include <linux/later.h>

/*
* Convert user-nice values [ -20 ... 0 ... 19 ]
@@ -938,7 +939,7 @@ asmlinkage void schedule(void)
runqueue_t *rq;
prio_array_t *array;
struct list_head *queue;
- int idx;
+ int idx, preempt;

if (unlikely(in_atomic()))
BUG();
@@ -946,6 +947,9 @@ asmlinkage void schedule(void)
#if CONFIG_DEBUG_HIGHMEM
check_highmem_ptes();
#endif
+ preempt = (preempt_count() & PREEMPT_ACTIVE);
+ if (!preempt)
+ put_runcount();
need_resched:
preempt_disable();
prev = current;
@@ -959,7 +963,7 @@ need_resched:
* if entering off of a kernel preemption go straight
* to picking the next task.
*/
- if (unlikely(preempt_count() & PREEMPT_ACTIVE))
+ if (unlikely(preempt))
goto pick_next_task;

switch (prev->state) {
@@ -1020,6 +1024,9 @@ switch_tasks:
preempt_enable_no_resched();
if (test_thread_flag(TIF_NEED_RESCHED))
goto need_resched;
+
+ if (!preempt)
+ get_runcount(current);
}

#ifdef CONFIG_PREEMPT


2002-09-18 22:55:10

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Wed, 18 Sep 2002, Rusty Russell wrote:

> I've rewritten my in-kernel module loader: this version breaks
> much less existing code. Basically, we go to a model of
> externally-controlled module refcounts with possibility of failure
> (ie. try_inc_mod_count, now called try_module_get()).

You add a lot of complexity in an attempt to solve a quite simple problem.
I agree that the module load mechanism could be simplified, but why do you
want to do it in the kernel? Some general module management in userspace
is needed anyway and all you save is a single call to the kernel.
Most of the module related changes are rather cosmetic. You still have to
spread try_module_get() calls all over the kernel, one has to call it
before calling a function which might sleep, this means we will add it
everywhere just to be save. What makes it worse is that it's not generally
usable, a per object reference count can also be used for module
management, but the module count can't be used for object management. This
means for modules with more dynamic interfaces, they have to do twice as
much work.
I can only refer to my own patch again, which has most of the basic things
needed to sanely get out of this mess:
1. Allow module exit to fail. This gives modules far more control over
module management and the generic module code can be simplified.
2. The new module layout simplifies module loading, much more than
relocating isn't necessary, but keeps backward compability as long as
necessary. This means new modules can be loaded with old modutils and
modules using the old interface can be kept working for a while.

bye, Roman


2002-09-19 02:01:55

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <Pine.LNX.4.44.0209182313360.8911-100000@serv> you write:
> Hi,
>
> On Wed, 18 Sep 2002, Rusty Russell wrote:
>
> > I've rewritten my in-kernel module loader: this version breaks
> > much less existing code. Basically, we go to a model of
> > externally-controlled module refcounts with possibility of failure
> > (ie. try_inc_mod_count, now called try_module_get()).
>
> You add a lot of complexity in an attempt to solve a quite simple problem.
> I agree that the module load mechanism could be simplified, but why do you
> want to do it in the kernel?

Count the total lines of code in the kernel. It's less than it was
before. Even for ia64, it's around the same IIRC.

Now add the userspace code, and it's obviously far simpler. Not to
mention not having to worry about problems like insmod dying between
the two system calls...

I'm all for keeping things out of the kernel, but you can take things
too far. I was originally reluctant, but the beauty and simplicity of
doing it in-kernel changed my mind.

> Some general module management in userspace
> is needed anyway and all you save is a single call to the kernel.

Huh? Think about initramfs's insmod code. Think about busybox. Look
at my insmod code.

Sure, *modprobe* has some complexity to it, but a simple insmod is
trivial.

> Most of the module related changes are rather cosmetic. You still have to
> spread try_module_get() calls all over the kernel, one has to call it
> before calling a function which might sleep, this means we will add it
> everywhere just to be save.

Yes. It should be added everywhere. One idea is to mark interfaces
which are "safe", and if a module uses something else, refuse to
unload it.

> What makes it worse is that it's not generally usable, a per object
> reference count can also be used for module management, but the
> module count can't be used for object management. This means for
> modules with more dynamic interfaces, they have to do twice as much
> work.

Yes, you force reference counts in everything, whether you've got
modules (or module unloading) or not. You *do not* want to do this,
for performance reasons if nothing else.

> I can only refer to my own patch again, which has most of the basic things
> needed to sanely get out of this mess:
>
> 1. Allow module exit to fail. This gives modules far more control over
> module management and the generic module code can be simplified.

I really like this, myself (although I did it in two stages in my
earlier implementation, it's the same principle). I could use it
really nicely. You could use it really nicely. I can name a dozen
more that would get it right.

<Sigh>

Unfortunately, saying "you must understand module races to write a
module" is not realistic 8(. Look at this code for an example of
understanding of current modules among kernel developers: (lib/crc32.c):

/**
* init_crc32(): generates CRC32 tables
*
* On successful initialization, use count is increased.
* This guarantees that the library functions will stay resident
* in memory, and prevents someone from 'rmmod crc32' while
* a driver that needs it is still loaded.
* This also greatly simplifies drivers, as there's no need
* to call an initialization/cleanup function from each driver.
* Since crc32.o is a library module, there's no requirement
* that the user can unload it.
*/
static int __init init_crc32(void)
{
int rc1, rc2, rc;
rc1 = crc32init_le();
rc2 = crc32init_be();
rc = rc1 || rc2;
if (!rc) MOD_INC_USE_COUNT;
return rc;
}

> 2. The new module layout simplifies module loading, much more than
> relocating isn't necessary, but keeps backward compability as long as
> necessary. This means new modules can be loaded with old modutils and
> modules using the old interface can be kept working for a while.

The backwards compatibility issue is simple to handle in userspace:
try sys_query_module, and exec insmod.old if it doesn't give ENOSYS.

Sure, I'm lazy, but I'll release a 0.4 which does that I promise 8)

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-19 02:14:56

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Thu, Sep 19, 2002 at 11:00:23AM +1000, Rusty Russell wrote:
> In message <Pine.LNX.4.44.0209182313360.8911-100000@serv> you write:
> > Hi,
> >
> > On Wed, 18 Sep 2002, Rusty Russell wrote:
> >
> > > I've rewritten my in-kernel module loader: this version breaks
> > > much less existing code. Basically, we go to a model of
> > > externally-controlled module refcounts with possibility of failure
> > > (ie. try_inc_mod_count, now called try_module_get()).
> >
> > You add a lot of complexity in an attempt to solve a quite simple problem.
> > I agree that the module load mechanism could be simplified, but why do you
> > want to do it in the kernel?
>
> Count the total lines of code in the kernel. It's less than it was
> before. Even for ia64, it's around the same IIRC.
>
> Now add the userspace code, and it's obviously far simpler. Not to
> mention not having to worry about problems like insmod dying between
> the two system calls...
>
> I'm all for keeping things out of the kernel, but you can take things
> too far. I was originally reluctant, but the beauty and simplicity of
> doing it in-kernel changed my mind.

I still think that the kernel has no business knowing how to parse ELF
relocation. It's just as easy to parse it in userspace; and what do
you gain from moving the complexity from userspace to kernelspace?

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2002-09-19 03:53:32

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <[email protected]> you write:
> I still think that the kernel has no business knowing how to parse ELF
> relocation.

Thanks for your feedback.

> It's just as easy to parse it in userspace;

It is? Thanks for explaining, I thought it was significantly more
difficult.

> and what do you gain from moving the complexity from userspace to
> kernelspace?

Clearly, nothing. I apologize for wasting your time.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-19 10:40:06

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Thu, 19 Sep 2002, Rusty Russell wrote:

> > I agree that the module load mechanism could be simplified, but why do you
> > want to do it in the kernel?
>
> Count the total lines of code in the kernel. It's less than it was
> before. Even for ia64, it's around the same IIRC.

That's not really difficult.

> Now add the userspace code, and it's obviously far simpler. Not to
> mention not having to worry about problems like insmod dying between
> the two system calls...

A dying insmod is not really a problem. Removing the module after the
create call isn't a problem (I think that was already possible in my
version). Your current relocation code can be easily moved to userspace. I
agree that it can be far simpler, but doing two instead of one system call
doesn't really add any complexity.

> I'm all for keeping things out of the kernel, but you can take things
> too far. I was originally reluctant, but the beauty and simplicity of
> doing it in-kernel changed my mind.

Compared to the complexity of the current insmod I can I agree. On the
other hand with my module layout, I could load a module with ld and a few
lines of shell script (only the system calls are a bit tricky).

> > Most of the module related changes are rather cosmetic. You still have to
> > spread try_module_get() calls all over the kernel, one has to call it
> > before calling a function which might sleep, this means we will add it
> > everywhere just to be save.
>
> Yes. It should be added everywhere. One idea is to mark interfaces
> which are "safe", and if a module uses something else, refuse to
> unload it.

Now we get to the interesting part, cleaning up the module code is rather
simple to do, but getting everything else safe will be a bit more work and
how nicely it's done depends very much on which interfaces the module code
offers.

> > What makes it worse is that it's not generally usable, a per object
> > reference count can also be used for module management, but the
> > module count can't be used for object management. This means for
> > modules with more dynamic interfaces, they have to do twice as much
> > work.
>
> Yes, you force reference counts in everything, whether you've got
> modules (or module unloading) or not. You *do not* want to do this,
> for performance reasons if nothing else.

What else is try_module_get()? What is your problem with reference counts?
First of all only people _exporting_ symbols have to think about module
races, most drivers should just use interfaces, which are module safe.
Let's look again at the fs example I posted earlier:

static int __init init_fs(void)
{
inode_cachep = kmem_cache_create(...);
if (!inode_cachep)
return -ENOMEM;
return register_filesystem(&fs_type);
}

static int __exit exit_fs(void)
{
int err;
err = unregister_filesystem(&fs_type);
if (err)
return err;
kmem_cache_destroy(inode_cachep);
inode_cachep = NULL;
return 0;
}

Is this difficult to understand? I'n in doubt of this.
This is simple resource management, everyone writing kernel code has to do
this anyway, adding modules to this picture doesn't change it much. As
long as you have control over all your objects, you also know easily
whether it's safe to unload or not, but only the driver knows this, the
module code can only guess or it has to be told explicitely via
try_module_get().

bye, Roman

2002-09-19 12:54:03

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <Pine.LNX.4.44.0209191134360.28515-100000@serv> you write:
> Hi,
>
> On Thu, 19 Sep 2002, Rusty Russell wrote:
>
> > > I agree that the module load mechanism could be simplified, but why do yo
u
> > > want to do it in the kernel?
> >
> > Count the total lines of code in the kernel. It's less than it was
> > before. Even for ia64, it's around the same IIRC.
>
> That's not really difficult.

I know, I've done it for 5 architectures.

> > Now add the userspace code, and it's obviously far simpler. Not to
> > mention not having to worry about problems like insmod dying between
> > the two system calls...
>
> A dying insmod is not really a problem. Removing the module after the
> create call isn't a problem (I think that was already possible in my
> version). Your current relocation code can be easily moved to userspace. I
> agree that it can be far simpler, but doing two instead of one system call
> doesn't really add any complexity.

Of course it does. Logically there should be a lock between the two
operations: it's effectively an upcall to userspace, and that's messy.

> > I'm all for keeping things out of the kernel, but you can take things
> > too far. I was originally reluctant, but the beauty and simplicity of
> > doing it in-kernel changed my mind.
>
> Compared to the complexity of the current insmod I can I agree. On the
> other hand with my module layout, I could load a module with ld and a few
> lines of shell script (only the system calls are a bit tricky).

Do that on sparc64, x86_64 or ppc64 and I'll be really impressed. And
of course, good luck fitting libbfd into busybox!

> > > Most of the module related changes are rather cosmetic. You still have to
> > > spread try_module_get() calls all over the kernel, one has to call it
> > > before calling a function which might sleep, this means we will add it
> > > everywhere just to be save.
> >
> > Yes. It should be added everywhere. One idea is to mark interfaces
> > which are "safe", and if a module uses something else, refuse to
> > unload it.
>
> Now we get to the interesting part, cleaning up the module code is rather
> simple to do, but getting everything else safe will be a bit more work and
> how nicely it's done depends very much on which interfaces the module code
> offers.

Yes, see latest patchball (to follow).

> > > What makes it worse is that it's not generally usable, a per object
> > > reference count can also be used for module management, but the
> > > module count can't be used for object management. This means for
> > > modules with more dynamic interfaces, they have to do twice as much
> > > work.
> >
> > Yes, you force reference counts in everything, whether you've got
> > modules (or module unloading) or not. You *do not* want to do this,
> > for performance reasons if nothing else.
>
> What else is try_module_get()? What is your problem with reference counts?

Incrementing and decrementing a reference count for every network
protocol (think: one for IPv4, one for TCP) is going to do terrible
things for performance. Now, if we *only* do it when IPv4 is a
module, *and* we use bigrefs (which take about a page each 8(), it's
not so bad.

> Let's look again at the fs example I posted earlier:
>
> static int __init init_fs(void)
> {
> inode_cachep = kmem_cache_create(...);
> if (!inode_cachep)
> return -ENOMEM;
> return register_filesystem(&fs_type);
> }
>
> static int __exit exit_fs(void)
> {
> int err;
> err = unregister_filesystem(&fs_type);
> if (err)
> return err;
> kmem_cache_destroy(inode_cachep);
> inode_cachep = NULL;
> return 0;
> }
>
> Is this difficult to understand? I'n in doubt of this.

This still has the init problem (well, not this example but...)

Try doing this with IPv4, or LSM.

If every single object in the kernel is reference counted, *yes* you
can do this. But they're not, and they will never be. Changing them
over to use try_module_get() is feasible, though.

Of course, we could just say "you can't unload those module then!" and
all those problems disappear. You can see why I was so tempted 8)

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-19 13:49:53

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Thu, 19 Sep 2002, Rusty Russell wrote:

> If every single object in the kernel is reference counted, *yes* you
> can do this. But they're not, and they will never be. Changing them
> over to use try_module_get() is feasible, though.

Rusty, slowly I'm pissed. :(
I already said often enough, a module has only to answer the simple
question: Is it safe to unload the module? Reference counts is the
simplest way, but I'm sure even you can come up with other methods.
On the other hand you force on _everyone_ to use try_module_get(), which
are nothing else than reference counts. As I already wrote in my last
mail:
This is simple resource management, everyone writing kernel code has to do
this anyway, adding modules to this picture doesn't change it much. As
long as you have control over all your objects, you also know easily
whether it's safe to unload or not, but only the driver knows this, the
module code can only guess or it has to be told explicitely via
try_module_get().
In other words: If you get it right with try_module_get(), you will more
easily get it right with reference counts.

bye, Roman


2002-09-19 18:33:53

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Thu, Sep 19, 2002 at 03:54:40PM +0200, Roman Zippel wrote:
> I already said often enough, a module has only to answer the simple
> question: Is it safe to unload the module?

And with a LSM module, how can it answer that? There's no way, unless
we count every time someone calls into our module. And if you do that,
no one will even want to use your module, given the number of hooks, and
the paths those hooks are on (the speed hit would be horrible.)

I'm with Rusty, just don't let people unload modules, unless you are
running a development kernel, and "obviously" know what you are doing.

thanks,

greg k-h

2002-09-19 18:48:46

by Alan

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Thu, 2002-09-19 at 19:38, Greg KH wrote:
> And with a LSM module, how can it answer that? There's no way, unless
> we count every time someone calls into our module. And if you do that,
> no one will even want to use your module, given the number of hooks, and
> the paths those hooks are on (the speed hit would be horrible.)

So the LSM module always says no. Don't make other modules suffer

2002-09-19 20:06:47

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Thu, Sep 19, 2002 at 07:58:15PM +0100, Alan Cox wrote:
> On Thu, 2002-09-19 at 19:38, Greg KH wrote:
> > And with a LSM module, how can it answer that? There's no way, unless
> > we count every time someone calls into our module. And if you do that,
> > no one will even want to use your module, given the number of hooks, and
> > the paths those hooks are on (the speed hit would be horrible.)
>
> So the LSM module always says no. Don't make other modules suffer

Ok, I don't have a problem with that, I was just trying to point out
that not all modules can know when they are able to be unloaded, as
Roman stated.

thanks,

greg k-h

2002-09-19 20:05:58

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Thu, 19 Sep 2002, Greg KH wrote:

> > I already said often enough, a module has only to answer the simple
> > question: Is it safe to unload the module?
>
> And with a LSM module, how can it answer that? There's no way, unless
> we count every time someone calls into our module. And if you do that,
> no one will even want to use your module, given the number of hooks, and
> the paths those hooks are on (the speed hit would be horrible.)

Check Rusty's first patch. You basically have to uninstall all the hooks
first, then you have to wait for all running processes to advance enough,
that you can be sure a caller advanced far enough to a synchronization
point within the module (e.g. a try_module_get()). At this point you can
be sure whether you have module users or not.
Note that Rusty's patch doesn't help LSM at all, because he didn't
introduce the separate stop/exit calls.

> I'm with Rusty, just don't let people unload modules, unless you are
> running a development kernel, and "obviously" know what you are doing.

For the majority of modules unloading can be implemented safely and simple
modules don't have to care about module races, as long as the interfaces
they are using are safe.

bye, Roman

2002-09-19 20:38:40

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Thu, 19 Sep 2002, Greg KH wrote:

> > So the LSM module always says no. Don't make other modules suffer
>
> Ok, I don't have a problem with that, I was just trying to point out
> that not all modules can know when they are able to be unloaded, as
> Roman stated.

That's not really what I meant. Only the module can know whether it's safe
and somehow has to tell that module.c. The question is now how that should
be done.

bye, Roman

2002-09-19 23:59:59

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <Pine.LNX.4.44.0209191532110.8911-100000@serv> you write:
> Hi,
>
> On Thu, 19 Sep 2002, Rusty Russell wrote:
>
> > If every single object in the kernel is reference counted, *yes* you
> > can do this. But they're not, and they will never be. Changing them
> > over to use try_module_get() is feasible, though.
>
> Rusty, slowly I'm pissed. :(

Sorry if I haven't been as clear as I might wish. I shall try: I do
appreciate your patience.

1) You keep ignoring the load race problem. Your solution does not
solve that, so you will need something else as well.

2) Several places in the kernel do *not* keep reference counts, for
example net/core/dev.c's dev_add_pack and dev_remove_pack. You
want to add reference counts to all of them, but the only reason
for the reference counts is for module unload: you are penalizing
everyone just *in case* one is a module.

3) The cost of doing atomic_incs and decs on eg. our network performance
is simply unacceptable. The only way to avoid hitting the same
cacheline all the time is to use bigrefs, and the kernel bloat will
kill us (and they're still not free for the 99% of people who don't
have IPv4 and TCP as modules).

4) Your solution does not allow implementation of "rmmod -f" which
prevents module count from increasing, and removes it when it is
done. This is very nice when your usage count is controlled by an
external source (eg. your network).

Now, your proposal *can* be modified to address these things, but I'm
not sure you'll like your proposale when that's done 8(

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-20 03:57:38

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <[email protected]> you write:
> On Thu, Sep 19, 2002 at 03:54:40PM +0200, Roman Zippel wrote:
> > I already said often enough, a module has only to answer the simple
> > question: Is it safe to unload the module?
>
> And with a LSM module, how can it answer that? There's no way, unless
> we count every time someone calls into our module. And if you do that,
> no one will even want to use your module, given the number of hooks, and
> the paths those hooks are on (the speed hit would be horrible.)

Well, it's up to you. You *could* implement:

#define call_security(method , ...) \
({ int __ret; \
if (try_module_get(security_ops->owner)) { \
__ret = security_ops->method(__VA_ARGS__); \
module_put(security_ops->owner); \
} else \
/* If unloading or loading, default to "allow" */ \
__ret = 0; \
__ret; \
})

Now, if you don't have CONFIG_MODULES this becomes the code as it is
now. If you don't have CONFIG_MODULE_UNLOAD, this becomes:

if (security_ops->owner && security_ops->owner->live)
__ret = security_ops->method(__VA_ARGS__);
else
__ret = 0;

With the case of CONFIG_MODULE_UNLOAD, it looks *really* horrible (53
instructions with 6 branches: ehttp://www...). So I would recommend
register_security insert a shim and take the hit itself:

#ifdef CONFIG_MODULE_UNLOAD
if (ops->owner) {
module_ops->real = ops;
ops = module_ops;
}
#endif

Then module_ops does the inc/dec wrappers (well, it can optimize if
the ops can't sleep...).

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-20 04:27:46

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Fri, Sep 20, 2002 at 11:22:08AM +1000, Rusty Russell wrote:
> In message <[email protected]> you write:
> > On Thu, Sep 19, 2002 at 03:54:40PM +0200, Roman Zippel wrote:
> > > I already said often enough, a module has only to answer the simple
> > > question: Is it safe to unload the module?
> >
> > And with a LSM module, how can it answer that? There's no way, unless
> > we count every time someone calls into our module. And if you do that,
> > no one will even want to use your module, given the number of hooks, and
> > the paths those hooks are on (the speed hit would be horrible.)
>
> Well, it's up to you. You *could* implement:

<snip>

Ok, now that's just sick and twisted enough that it might just work. I
really don't want to use a macro for the security functions, but this
provides type safety, and... well... I'm at a loss of words, and just
amazed...

greg k-h

2002-09-20 09:23:51

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <[email protected]> you write:
> On Fri, Sep 20, 2002 at 11:22:08AM +1000, Rusty Russell wrote:
> > In message <[email protected]> you write:
> > > On Thu, Sep 19, 2002 at 03:54:40PM +0200, Roman Zippel wrote:
> > > > I already said often enough, a module has only to answer the simple
> > > > question: Is it safe to unload the module?
> > >
> > > And with a LSM module, how can it answer that? There's no way, unless
> > > we count every time someone calls into our module. And if you do that,
> > > no one will even want to use your module, given the number of hooks, and
> > > the paths those hooks are on (the speed hit would be horrible.)
> >
> > Well, it's up to you. You *could* implement:
>
> <snip>
>
> Ok, now that's just sick and twisted enough that it might just work. I
> really don't want to use a macro for the security functions, but this
> provides type safety, and... well... I'm at a loss of words, and just
> amazed...

Then you'll love this.

This allows modules to safely look after their own reference counts
even with preemption and without requiring rcu-like scheduler changes
(synchronize_kernel here simple schedules itself on each CPU in turn).

Thanks to Stephen Rothwell for the Intel asm (ie. the tricky bit).

[ Roman: I'm not really serious about this, but it maybe someone will
really want to control their own counts... ]

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

Name: module_put_return primitive for x86
Author: Rusty Russell
Status: Experimental
Depends: Module/modbase-try-i386.patch.gz

D: This patch implements module_put_return() for x86, which allows a
D: module to control its own reference counts. A module must never
D: use module_put() on itself, since this may result in the module
D: being removable immediately: this is the alternative, which
D: atomically decrements the count and returns.
D:
D: Each architecture would need to implement this.

diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.5.36-modbase-try-i386/arch/i386/kernel/module.c working-2.5.36-modbase-try-i386-decandret/arch/i386/kernel/module.c
--- working-2.5.36-modbase-try-i386/arch/i386/kernel/module.c 2002-09-20 13:32:20.000000000 +1000
+++ working-2.5.36-modbase-try-i386-decandret/arch/i386/kernel/module.c 2002-09-20 14:24:35.000000000 +1000
@@ -28,6 +28,9 @@
#define DEBUGP(fmt , ...)
#endif

+/* For the magic module return. */
+struct module_percpu module_percpu[NR_CPUS];
+
static void *alloc_and_zero(unsigned long size)
{
void *ret;
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.5.36-modbase-try-i386/arch/i386/lib/module.S working-2.5.36-modbase-try-i386-decandret/arch/i386/lib/module.S
--- working-2.5.36-modbase-try-i386/arch/i386/lib/module.S 1970-01-01 10:00:00.000000000 +1000
+++ working-2.5.36-modbase-try-i386-decandret/arch/i386/lib/module.S 2002-09-20 13:50:37.000000000 +1000
@@ -0,0 +1,33 @@
+/* Icky, icky, return trampoline for dying modules. Entered with
+ interrupts off. (C) 2002 Stephen Rothwell, IBM Corporation. */
+
+#include <asm/thread_info.h>
+
+.text
+.align 4
+.globl __magic_module_return
+
+#define MODULE_PERCPU_SIZE_ORDER 3
+
+__magic_module_return:
+ /* Save one working variable */
+ pushl %eax
+
+ /* Get CPU number from current. */
+ GET_THREAD_INFO(%eax)
+ movl TI_CPU(%eax), %eax
+
+ /* Push module_percpu[cpu].flags on the stack */
+ shll $MODULE_PERCPU_SIZE_ORDER, %eax
+ pushl module_percpu(%eax)
+
+ /* Put module_percpu[cpu].returnaddr into %eax */
+ movl module_percpu+4(%eax), %eax
+
+ /* Push returnaddr and restore eax */
+ xchgl %eax, 4(%esp)
+
+ /* Restore interrupts */
+ popf
+ /* Go home */
+ ret
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.5.36-modbase-try-i386/include/asm-i386/module.h working-2.5.36-modbase-try-i386-decandret/include/asm-i386/module.h
--- working-2.5.36-modbase-try-i386/include/asm-i386/module.h 2002-09-20 13:32:20.000000000 +1000
+++ working-2.5.36-modbase-try-i386-decandret/include/asm-i386/module.h 2002-09-20 14:24:07.000000000 +1000
@@ -8,4 +8,33 @@ struct mod_arch_specific
#define Elf_Shdr Elf32_Shdr
#define Elf_Sym Elf32_Sym
#define Elf_Ehdr Elf32_Ehdr
+
+/* Non-preemtible decrement the refcount and return. */
+#define module_put_return(firstarg , ...) \
+do { \
+ unsigned int cpu; \
+ unsigned long flags; \
+ \
+ local_irq_save(flags); \
+ if (unlikely(module_put(THIS_MODULE)) { \
+ module_percpu[cpu].flags = flags; \
+ module_percpu[cpu].returnaddr = ((void **)&(firstarg))[-1]; \
+ ((void **)&(firstarg))[-1] = __magic_module_return; \
+ } else \
+ local_irq_restore(flags); \
+ return __VA_ARGS__; \
+} while(0)
+
+/* FIXME: Use per-cpu vars --RR */
+struct module_percpu
+{
+ unsigned long flags;
+ void *returnaddr;
+};
+
+extern struct module_percpu module_percpu[NR_CPUS];
+
+/* Restore flags and return to caller. */
+extern void __magic_module_return(void);
+
#endif /* _ASM_I386_MODULE_H */
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal working-2.5.36-modbase-try-i386/kernel/module.c working-2.5.36-modbase-try-i386-decandret/kernel/module.c
--- working-2.5.36-modbase-try-i386/kernel/module.c 2002-09-20 13:32:20.000000000 +1000
+++ working-2.5.36-modbase-try-i386-decandret/kernel/module.c 2002-09-20 14:35:10.000000000 +1000
@@ -229,6 +229,9 @@ sys_delete_module(const char *name_user,
/* Since it's not live, this should monotonically decrease. */
bigref_wait_for_zero(&mod->use);

+ /* Wait in case doing own refcounts and using module_put_return */
+ synchronize_kernel();
+
/* Final destruction now noone is using it. */
mod->exit();
free_module(mod);

2002-09-20 09:28:06

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Fri, 20 Sep 2002, Rusty Russell wrote:

> 1) You keep ignoring the load race problem. Your solution does not
> solve that, so you will need something else as well.

Could you please explain, why you think I'm ignoring this problem?
(I think I don't, but I want to be sure we talk about the same problem.)

> 2) Several places in the kernel do *not* keep reference counts, for
> example net/core/dev.c's dev_add_pack and dev_remove_pack. You
> want to add reference counts to all of them, but the only reason
> for the reference counts is for module unload: you are penalizing
> everyone just *in case* one is a module.

For an alternative solution:
http://marc.theaimsgroup.com/?l=linux-kernel&m=103246649130126&w=2

> 3) The cost of doing atomic_incs and decs on eg. our network performance
> is simply unacceptable. The only way to avoid hitting the same
> cacheline all the time is to use bigrefs, and the kernel bloat will
> kill us (and they're still not free for the 99% of people who don't
> have IPv4 and TCP as modules).

Even your bigref is still overkill. When packets come in, you already have
to take _some_ lock, under the protection of this lock you can implement
cheap, simple, portable and cache friendly counts, which can be used for
synchronization.

> 4) Your solution does not allow implementation of "rmmod -f" which
> prevents module count from increasing, and removes it when it is
> done. This is very nice when your usage count is controlled by an
> external source (eg. your network).

Multiple possible solutions:
- Separate stop/exit is probably the most elegant solution.
- A call to exit does in any case start the removal of the module, that
means it starts removing interface (and which won't get reinstalled).
If there is still any user, exit will fail, you can try it later again
after you killed that user.
Anyway, almost any access to a driver goes through the filesystem and
there it's a well known problem of unlinked but still open files. Driver
access is pretty much the same problem to which you can apply the same
well known solutions.

bye, Roman

2002-09-21 04:14:35

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <Pine.LNX.4.44.0209201105330.338-100000@serv> you write:
> Hi,
>
> On Fri, 20 Sep 2002, Rusty Russell wrote:
>
> > 1) You keep ignoring the load race problem. Your solution does not
> > solve that, so you will need something else as well.
>
> Could you please explain, why you think I'm ignoring this problem?
> (I think I don't, but I want to be sure we talk about the same problem.)

initfn()
{
register_notifier(some_notifier); /* Entry point one */
if (register_filesystem(some_notifier2) != 0) {
unregister_notifier(some_notifier); /* This fails! */
return -EBUSY;
}

How does your solution of failing the unregister_notifier in this case
stop the race? I probably missed something here?

> > 2) Several places in the kernel do *not* keep reference counts, for
> > example net/core/dev.c's dev_add_pack and dev_remove_pack. You
> > want to add reference counts to all of them, but the only reason
> > for the reference counts is for module unload: you are penalizing
> > everyone just *in case* one is a module.
>
> For an alternative solution:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103246649130126&w=2

Yes, two stage remove. I liked it, except for the fact that you want
to be able to re-init the module in the case where you lose the race
between between I and II. Then you also need two-stage init (or a
re-init function), whereas the try_module_get solution makes this easy
(ie. no requirement on the module author).

> > 3) The cost of doing atomic_incs and decs on eg. our network performance
> > is simply unacceptable. The only way to avoid hitting the same
> > cacheline all the time is to use bigrefs, and the kernel bloat will
> > kill us (and they're still not free for the 99% of people who don't
> > have IPv4 and TCP as modules).
>
> Even your bigref is still overkill. When packets come in, you already have
> to take _some_ lock, under the protection of this lock you can implement
> cheap, simple, portable and cache friendly counts, which can be used for
> synchronization.

No: networking uses the network brlock for exactly this reason 8(

> > 4) Your solution does not allow implementation of "rmmod -f" which
> > prevents module count from increasing, and removes it when it is
> > done. This is very nice when your usage count is controlled by an
> > external source (eg. your network).
>
> Multiple possible solutions:
> - Separate stop/exit is probably the most elegant solution.
> - A call to exit does in any case start the removal of the module, that
> means it starts removing interface (and which won't get reinstalled).
> If there is still any user, exit will fail, you can try it later again
> after you killed that user.

If the exit fails and you fail the rmmod, you need to reinit the
module. Otherwise noone can use it, but it cannot be replaced (say
it's holding some io region or other resource).

If you want to wait, that may be OK, but if you want to abort the
unload, the module needs to give you a *third* function, and that's
where my "too much work for every module author" line gets crossed 8(

> Anyway, almost any access to a driver goes through the filesystem and
> there it's a well known problem of unlinked but still open files. Driver
> access is pretty much the same problem to which you can apply the same
> well known solutions.

Not sure what you mean here.
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-21 07:33:34

by Kevin O'Connor

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Fri, Sep 20, 2002 at 11:22:08AM +1000, Rusty Russell wrote:
> Well, it's up to you. You *could* implement:
>
> #define call_security(method , ...) \
> ({ int __ret; \
> if (try_module_get(security_ops->owner)) { \
> __ret = security_ops->method(__VA_ARGS__); \
> module_put(security_ops->owner); \
> } else \
> /* If unloading or loading, default to "allow" */ \
> __ret = 0; \
> __ret; \
> })
[...]
> Now, if you don't have CONFIG_MODULES this becomes the code as it is
> now.

Hi Rusty,

Please consider the following non-module code snippet:

int
sys_enable_foo_security()
{
foocache = kmalloc(100000);
register_security(&foo_ops);
}

int
sys_disable_foo_security()
{
unregister_security(&foo_ops);
kfree(foocache); // OOPS
}


If I follow Roman's argument correctly, the unload race is not module
specific. (The problem is that unregister_security() only asserts that no
new callers will be made to foo_ops, it doesn't guarantee that there are no
current callers.)

In the above example, one solution would be to reference count foocache.
However, another viable solution would be to ref-count the security_ops
field.

Anyway, given that the problem is a general resource management issue (and
not module specific), I think one could implement call_security() with less
overhead:

#define call_security(method , ...) \
({ int __ret; \
read_lock(&SecurityLock); \
__ret = security_ops->method(__VA_ARGS__); \
read_unlock(&SecurityLock); \
__ret; \
})

where (un)register_security used a write_lock to guard accesses to
security_ops changes.

This implementation is still a bit sluggish (as well as limiting), however
one could conceivable use RCU or a similar mechanism to further reduce
overhead of the common path.

-Kevin


P.S. it may also be possible for this alternate solution to work:

#define call_security(method , ...) \
({ int __ret; \
atomic_inc(&SecurityRefCount); \
__ret = security_ops->method(__VA_ARGS__); \
atomic_dec(&SecurityRefCount); \
})

where unregister_security set the security_ops field to a dummy value and
then waited for the ref-count to hit zero before returning. However, this
may depend too heavily on memory ordering..

--
------------------------------------------------------------------------
| Kevin O'Connor "BTW, IMHO we need a FAQ for |
| [email protected] 'IMHO', 'FAQ', 'BTW', etc. !" |
------------------------------------------------------------------------

2002-09-21 17:04:28

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Sat, 21 Sep 2002, Rusty Russell wrote:

> initfn()
> {
> register_notifier(some_notifier); /* Entry point one */
> if (register_filesystem(some_notifier2) != 0) {
> unregister_notifier(some_notifier); /* This fails! */
> return -EBUSY;
> }
>
> How does your solution of failing the unregister_notifier in this case
> stop the race? I probably missed something here?

You shouldn't do any cleanup in the init function and do it instead in
the exit function. That's the reason why I said earlier that calling exit
even after a failed init is not just an implementation detail. So your
functions would look like this:

int initfn()
{
int err;

err = register_notifier(&some_notifier);
if (err)
return err;
err = register_filesystem(&some_notifier2);
return err;
}

int exitfn()
{
int err;

err = unregister_filesystem(&some_notifier2);
if (err)
return err;
err = unregister_notifier(&some_notifier);
return err;
}

This means we get rid of duplicated cleanup paths and drivers are easier
to verify.
The unregister functions would look like this:

int unregister(notifier)
{
if (notifier->count == 0)
return 0;
unlink(notifier);
if (notifier->count > 1)
return -EBUSY;
...
notifier->count = 0;
return 0;
}

> > For an alternative solution:
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=103246649130126&w=2
>
> Yes, two stage remove. I liked it, except for the fact that you want
> to be able to re-init the module in the case where you lose the race
> between between I and II. Then you also need two-stage init (or a
> re-init function), whereas the try_module_get solution makes this easy
> (ie. no requirement on the module author).

If you insist on doing the synchronize call in the module code, then you
need two stages. On the other you also could simply do this in the driver
code:

if (hook) {
hook = NULL;
synchronize();
}
...


> > Even your bigref is still overkill. When packets come in, you already have
> > to take _some_ lock, under the protection of this lock you can implement
> > cheap, simple, portable and cache friendly counts, which can be used for
> > synchronization.
>
> No: networking uses the network brlock for exactly this reason 8(

It can't be that difficult. After you removed the hooks you only have to
wait for some time and I can't believe it's that difficult to calculate
the needed time. Using the number of incoming packets is one possibility,
the number of network interrupts should be another way.
You could even temporarily shutdown all network interfaces, if it's really
that difficult.

> > - A call to exit does in any case start the removal of the module, that
> > means it starts removing interface (and which won't get reinstalled).
> > If there is still any user, exit will fail, you can try it later again
> > after you killed that user.
>
> If the exit fails and you fail the rmmod, you need to reinit the
> module. Otherwise noone can use it, but it cannot be replaced (say
> it's holding some io region or other resource).

Why would I want to reinit the module after a failed exit? As long as
someone is still using the module, you cannot remove the module. If you
look at the unregister example above, you see that the hook is removed
first, so nobody can start using the module again. If a module exit fails
it means you have to take care of the remaining users to complete module
unload. This means if you want to force the removal of a module, you have
to kill its users.

> If you want to wait, that may be OK, but if you want to abort the
> unload, the module needs to give you a *third* function, and that's
> where my "too much work for every module author" line gets crossed 8(

What do you mean?

> > Anyway, almost any access to a driver goes through the filesystem and
> > there it's a well known problem of unlinked but still open files. Driver
> > access is pretty much the same problem to which you can apply the same
> > well known solutions.
>
> Not sure what you mean here.

You start using drivers like files by opening them, while it's open it can
be removed (made unreachable), but only when the last user is gone can the
resources be released. Treat a driver like a file (or maybe a complete
file system) and you can use a lot of Al-proof infrastructure for free.

bye, Roman

2002-09-23 00:18:07

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <Pine.LNX.4.44.0209211411010.338-100000@serv> you write:
> Hi,
>
> On Sat, 21 Sep 2002, Rusty Russell wrote:
>
> > initfn()
> > {
> > register_notifier(some_notifier); /* Entry point one */
> > if (register_filesystem(some_notifier2) != 0) {
> > unregister_notifier(some_notifier); /* This fails! */
> > return -EBUSY;
> > }
> >
> > How does your solution of failing the unregister_notifier in this case
> > stop the race? I probably missed something here?
>
> You shouldn't do any cleanup in the init function and do it instead in
> the exit function. That's the reason why I said earlier that calling exit
> even after a failed init is not just an implementation detail. So your
> functions would look like this:

Yeah, I realized this after I sent the mail. Sorry. The problem is
one level higher in your implementation:

error = start_module(mod);
if (error)
exit_module(mod);

If the exit fails here (because the module is now in use, so the
unregister_notifier() fails, you need to loop and yield, because you
have no way of saying "wake me when you can be unloaded".

> int initfn()
> {
> int err;
>
> err = register_notifier(&some_notifier);
> if (err)
> return err;
> err = register_filesystem(&some_notifier2);
> return err;
> }

OK, so we fail the some_notifier2 here, and call exitfn:

> int exitfn()
> {
> int err;
>
> err = unregister_filesystem(&some_notifier2);
> if (err)
> return err;

This will trigger (err == -ENOENT) I assume? I think you really need:

int exitfn()
{
int busy;

busy = (unregister_filesystem(&some_notifier2) == -EBUSY
|| unregister_filesystem(&some_notifier) == -EBUSY);
if (busy)
return -EBUSY;
return 0;
}

> If you insist on doing the synchronize call in the module code, then you
> need two stages. On the other you also could simply do this in the driver
> code:
>
> if (hook) {
> hook = NULL;
> synchronize();
> }
> ...

Hmm, at least if we force the module author to provide two functions,
they don't have to have a clear understanding of the subtleties. I
know, I'm a coward.

> > > Even your bigref is still overkill. When packets come in, you already hav
e
> > > to take _some_ lock, under the protection of this lock you can implement
> > > cheap, simple, portable and cache friendly counts, which can be used for
> > > synchronization.
> >
> > No: networking uses the network brlock for exactly this reason 8(
>
> It can't be that difficult. After you removed the hooks you only have to
> wait for some time and I can't believe it's that difficult to calculate
> the needed time. Using the number of incoming packets is one possibility,
> the number of network interrupts should be another way.
> You could even temporarily shutdown all network interfaces, if it's really
> that difficult.

Yes, you can actually unregister all your hooks and then take the
write brlock for a moment (this is what I do in the ip_conntrack
code). Let me think, in this case, functions would look like. Please
see if my understanding is correct:

static struct proc_dir_entry *proc;

int start_ip_conntrack(void)
{
ret = ip_conntrack_init();
if (ret < 0)
return ret;

proc = proc_net_create("ip_conntrack",0,list_conntracks);
if (!proc)
return -EBUSY;

ret = nf_register_hook(&ip_conntrack_in_ops);
if (ret < 0)
return ret;

ret = nf_register_hook(&ip_conntrack_local_out_ops);
if (ret < 0)
return ret;

ret = nf_register_hook(&ip_conntrack_out_ops);
if (ret < 0)
return ret;

ret = nf_register_hook(&ip_conntrack_local_in_ops);
return ret;
}

I'm not clear on the exact desired semantics of stop and exit? When
should stop() fail?

int stop_ip_conntrack(void)
{
nf_unregister_hook(&ip_conntrack_in_ops);
nf_unregister_hook(&ip_conntrack_local_out_ops);
nf_unregister_hook(&ip_conntrack_out_ops);
nf_unregister_hook(&ip_conntrack_local_in_ops);
/* Force synchronization */
br_write_lock_irq(BR_NETPROTO_LOCK);
br_write_unlock_irq(BR_NETPROTO_LOCK);
if (proc_net_destroy(proc) == -EBUSY)
return -EBUSY;
return 0;
}

Now, we put callback pointers inside packets, so we need to keep a
count of how many of those we have (packet_count):

int exit_ip_conntrack(void)
{
if (!bigref_is_zero(&packet_count))
return -EBUSY;

/* Woohoo! We're free, clean up! */
ip_conntrack_cleanup();
return 0;
}

int usecount_ip_conntrack(void)
{
return atomic_read(&proc->use)
+ bigref_approx_val(&packet_count);
}

Am I using this interface correctly? Since netfilter hooks can't
sleep, nf_register_hook can't be busy (ie. can't fail for that
reason). I added a -EBUSY return to proc_net_destroy().

> > > - A call to exit does in any case start the removal of the module, that
> > > means it starts removing interface (and which won't get reinstalled).
> > > If there is still any user, exit will fail, you can try it later again
> > > after you killed that user.
> >
> > If the exit fails and you fail the rmmod, you need to reinit the
> > module. Otherwise noone can use it, but it cannot be replaced (say
> > it's holding some io region or other resource).
>
> Why would I want to reinit the module after a failed exit? As long as

But, this is what you seem to do in try_unregister_module:

if (test_bit(MOD_INITIALIZED, &mod->flags)) {
res = exit_module(mod);
if (res) {
start_module(mod);
goto out;
}
}

> > If you want to wait, that may be OK, but if you want to abort the
> > unload, the module needs to give you a *third* function, and that's
> > where my "too much work for every module author" line gets crossed 8(
>
> What do you mean?

I thought you might want a "restart()" function, but I don't think you
do.

> > > Anyway, almost any access to a driver goes through the filesystem and
> > > there it's a well known problem of unlinked but still open files. Driver
> > > access is pretty much the same problem to which you can apply the same
> > > well known solutions.
> >
> > Not sure what you mean here.
>
> You start using drivers like files by opening them, while it's open it can
> be removed (made unreachable), but only when the last user is gone can the
> resources be released. Treat a driver like a file (or maybe a complete
> file system) and you can use a lot of Al-proof infrastructure for free.

Oh, I see what you are saying. Yes, this is two-stage delete. It's
very common in the networking code too, for similar reasons.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-23 00:18:09

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <[email protected]> you write:
> Please consider the following non-module code snippet:
>
> int
> sys_enable_foo_security()
> {
> foocache = kmalloc(100000);
> register_security(&foo_ops);
> }
>
> int
> sys_disable_foo_security()
> {
> unregister_security(&foo_ops);
> kfree(foocache); // OOPS
> }
>
>
> If I follow Roman's argument correctly, the unload race is not module
> specific. (The problem is that unregister_security() only asserts that no
> new callers will be made to foo_ops, it doesn't guarantee that there are no
> current callers.)

But in practice there are many resources which are only unregistered
in the "unloading module" case: certainly by far the most common
case. It's hard for most module authors to spot this kind of race.

> In the above example, one solution would be to reference count foocache.
> However, another viable solution would be to ref-count the security_ops
> field.
>
> Anyway, given that the problem is a general resource management issue (and
> not module specific), I think one could implement call_security() with less
> overhead:
>
> #define call_security(method , ...) \
> ({ int __ret; \
> read_lock(&SecurityLock); \
> __ret = security_ops->method(__VA_ARGS__); \
> read_unlock(&SecurityLock); \
> __ret; \
> })

Whack me with a cacheline... that's going to hurt on SMP. You could
use a brlock, though.

You're assuming security methods cannot sleep? If true, use a brlock
and be done with it. Otherwise, you'll need a refcount, and we're
back to bigrefs and synchronize_kernel. Adding a bigref to each
security_ops struct might be acceptable, since it's so big. Adding a
single "owner" field certainly is.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-24 10:11:57

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Mon, 23 Sep 2002, Rusty Russell wrote:

> Yeah, I realized this after I sent the mail. Sorry. The problem is
> one level higher in your implementation:
>
> error = start_module(mod);
> if (error)
> exit_module(mod);
>
> If the exit fails here (because the module is now in use, so the
> unregister_notifier() fails, you need to loop and yield, because you
> have no way of saying "wake me when you can be unloaded".

That's not the job of the kernel, the module will stay in the cleanup
stage, until the the user requests module removal. This is what actually
happens above:

mod->state = initializing;
err = mod->init();
if (err) {
mod->state = cleanup;
if (!mod->exit())
mod->state = exit;
} else
mod->state = running;

As long as the module is in the cleanup stage you can only call exit().

> OK, so we fail the some_notifier2 here, and call exitfn:
>
> > int exitfn()
> > {
> > int err;
> >
> > err = unregister_filesystem(&some_notifier2);
> > if (err)
> > return err;
>
> This will trigger (err == -ENOENT) I assume? I think you really need:

Look at the unregister example in the last mail. It will suceed silently.

> > If you insist on doing the synchronize call in the module code, then you
> > need two stages. On the other you also could simply do this in the driver
> > code:
> >
> > if (hook) {
> > hook = NULL;
> > synchronize();
> > }
> > ...
>
> Hmm, at least if we force the module author to provide two functions,
> they don't have to have a clear understanding of the subtleties. I
> know, I'm a coward.

This shouldn't be the standard method, so any module author using such
hooks should be aware of it's problems, so I don't see a problem here.

> I'm not clear on the exact desired semantics of stop and exit? When
> should stop() fail?

stop() would just unregister the interface and is not really expected to
fail. On the other it's also possible to do an implicit stop within the
exit call.

> int stop_ip_conntrack(void)
> {
> nf_unregister_hook(&ip_conntrack_in_ops);
> nf_unregister_hook(&ip_conntrack_local_out_ops);
> nf_unregister_hook(&ip_conntrack_out_ops);
> nf_unregister_hook(&ip_conntrack_local_in_ops);
> /* Force synchronization */
> br_write_lock_irq(BR_NETPROTO_LOCK);
> br_write_unlock_irq(BR_NETPROTO_LOCK);

nf_unregister_hook() already takes BR_NETPROTO_LOCK?

> if (proc_net_destroy(proc) == -EBUSY)
> return -EBUSY;

That goes into the exit routine and a proc_net_unlink() should be here.

> return 0;
> }
>
> Now, we put callback pointers inside packets, so we need to keep a
> count of how many of those we have (packet_count):
>
> int exit_ip_conntrack(void)
> {
> if (!bigref_is_zero(&packet_count))
> return -EBUSY;
>
> /* Woohoo! We're free, clean up! */
> ip_conntrack_cleanup();
> return 0;
> }

It's pretty simple to join the stop/exit functions into single functions.
I think it's better if the destroy function also does an unlink if
necessary and the start/stop functions should/can be introduced later.
BTW another way to avoid the packet count would to examine every active
packet and check for pointers to the module.

> int usecount_ip_conntrack(void)
> {
> return atomic_read(&proc->use)
> + bigref_approx_val(&packet_count);
> }

The usecount doesn't has to be exact, basically you could even return 0,
then you just have to prevent automatic removal somehow. The usecount is
mostly for the user and IMO should have a meaningful value for the user.

> > Why would I want to reinit the module after a failed exit? As long as
>
> But, this is what you seem to do in try_unregister_module:
>
> if (test_bit(MOD_INITIALIZED, &mod->flags)) {
> res = exit_module(mod);
> if (res) {
> start_module(mod);
> goto out;
> }
> }

Um, sorry if that code caused confusion, that code will change. The
start/stop will be introduced later. If the exit fails the module must
stay in the cleanup state and the next cleanup attempt should only be done
on explicit user request.
Maybe I really should update the patch. :) I'm not sure if the problem is
really that urgent that it can't wait for 2.7. On the other hand I could
produce a patch that preserves complete backward compability, but already
introduces the new infrastructure.

bye, Roman


2002-09-24 14:59:14

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <Pine.LNX.4.44.0209231954040.338-100000@serv> you write:
> Hi,
>
> On Mon, 23 Sep 2002, Rusty Russell wrote:
>
> > Yeah, I realized this after I sent the mail. Sorry. The problem is
> > one level higher in your implementation:
> >
> > error = start_module(mod);
> > if (error)
> > exit_module(mod);
> >
> > If the exit fails here (because the module is now in use, so the
> > unregister_notifier() fails, you need to loop and yield, because you
> > have no way of saying "wake me when you can be unloaded".
>
> That's not the job of the kernel, the module will stay in the cleanup
> stage, until the the user requests module removal. This is what actually
> happens above:
>
> mod->state = initializing;
> err = mod->init();
> if (err) {
> mod->state = cleanup;
> if (!mod->exit())
> mod->state = exit;
> } else
> mod->state = running;
>
> As long as the module is in the cleanup stage you can only call exit().

Ah, OK, I missed the bit test around the unregister_module(). You
leave it half-loaded. Sorry for being so obtuse.

> > > err = unregister_filesystem(&some_notifier2);
> > > if (err)
> > > return err;
> >
> > This will trigger (err == -ENOENT) I assume? I think you really need:
>
> Look at the unregister example in the last mail. It will suceed silently.

Does that really seem consistent? Unregister functions don't fail if
you're not registered, but fail if it can't unregister you because
you're busy?

> > Hmm, at least if we force the module author to provide two functions,
> > they don't have to have a clear understanding of the subtleties. I
> > know, I'm a coward.
>
> This shouldn't be the standard method, so any module author using such
> hooks should be aware of it's problems, so I don't see a problem here.

I still worry that they don't *know* when they need to do something
special. This is true of almost any implementation.

> > I'm not clear on the exact desired semantics of stop and exit? When
> > should stop() fail?
>
> stop() would just unregister the interface and is not really expected to
> fail. On the other it's also possible to do an implicit stop within the
> exit call.

I guess it would be clearer (to me at least) if it returned void then.

> > int stop_ip_conntrack(void)
> > {
> > nf_unregister_hook(&ip_conntrack_in_ops);
> > nf_unregister_hook(&ip_conntrack_local_out_ops);
> > nf_unregister_hook(&ip_conntrack_out_ops);
> > nf_unregister_hook(&ip_conntrack_local_in_ops);
> > /* Force synchronization */
> > br_write_lock_irq(BR_NETPROTO_LOCK);
> > br_write_unlock_irq(BR_NETPROTO_LOCK);
>
> nf_unregister_hook() already takes BR_NETPROTO_LOCK?

It does indeed. The synchronize step is not required here (it is in
the real code, which uses also uses an "ip_ct_attach" ptr).

> > if (proc_net_destroy(proc) == -EBUSY)
> > return -EBUSY;
>
> That goes into the exit routine and a proc_net_unlink() should be here.

Right, ok.

> > return 0;
> > }
> >
> > Now, we put callback pointers inside packets, so we need to keep a
> > count of how many of those we have (packet_count):
> >
> > int exit_ip_conntrack(void)
> > {
> > if (!bigref_is_zero(&packet_count))
> > return -EBUSY;
> >
> > /* Woohoo! We're free, clean up! */
> > ip_conntrack_cleanup();
> > return 0;
> > }
>
> It's pretty simple to join the stop/exit functions into single functions.
> I think it's better if the destroy function also does an unlink if
> necessary and the start/stop functions should/can be introduced later.
> BTW another way to avoid the packet count would to examine every active
> packet and check for pointers to the module.

We don't keep a list of packets anywhere useful: they're spread
throughout the system (well, not quite true, we do use a slab cache
8).

> Um, sorry if that code caused confusion, that code will change. The
> start/stop will be introduced later. If the exit fails the module must
> stay in the cleanup state and the next cleanup attempt should only be done
> on explicit user request.

Oh, OK.

> Maybe I really should update the patch. :) I'm not sure if the problem is
> really that urgent that it can't wait for 2.7. On the other hand I could
> produce a patch that preserves complete backward compability, but already
> introduces the new infrastructure.

Roman, I'm determined to put the in-kernel module linker in 2.5, and
these races will be solved in some way. It's too much obviously the
right thing, especially now I've implemented modversions, parameters
and licensing stuff on top of it. Now the implementations can change
between kernel versions without any pain, even if you don't agree that
it's markedly simpler and makes initramdisk feasible.

In summary, your approach requires (I hope I am being fair?):

1) Registration interfaces (where callbacks can sleep) must keep track
of reference counts of current users.

2) They must also implement two-stage delete (ie. proc_net_unlink()).

3) Their de-registration interfaces must fail if they are in use (and
for no other reason, including an uninitialized never-registered
object).

4) Module initialization should remove their cleanup paths, and ensure
that the exit function can be called if initialization fails
partway.

5) Modules which use unsafe interfaces must be very careful.

Now, my approach requires:
1) Registration interfaces (where callbacks can sleep) must keep track
of reference counts of current users through try_module_get(foo->owner),
and handle it failing.

2) Modules which use unsafe interfaces must be very careful.

See why I like the "extend try_inc_mod_count()" solution? It's not
perfect, but it's already there and it's simple.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-25 00:41:48

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Wed, 25 Sep 2002, Rusty Russell wrote:

> > Look at the unregister example in the last mail. It will suceed silently.
>
> Does that really seem consistent? Unregister functions don't fail if
> you're not registered, but fail if it can't unregister you because
> you're busy?

Unregister functions fail, if the resource is busy, if nothing is busy,
they'll suceed. It's like kfree() which silently accepts NULL pointers.

> > This shouldn't be the standard method, so any module author using such
> > hooks should be aware of it's problems, so I don't see a problem here.
>
> I still worry that they don't *know* when they need to do something
> special. This is true of almost any implementation.

As I said already earlier, anyone _exporting_ interfaces has to be aware
of the problems and document its usage.

> Roman, I'm determined to put the in-kernel module linker in 2.5, and
> these races will be solved in some way. It's too much obviously the
> right thing, especially now I've implemented modversions, parameters
> and licensing stuff on top of it. Now the implementations can change
> between kernel versions without any pain, even if you don't agree that
> it's markedly simpler and makes initramdisk feasible.

Will the usage of an initramdisk be a requirement for 2.6? Otherwise I
don't see a reason for such hurry. I don't know of Linus plans, but I
don't think it's ready yet. So far we have only lots of little pieces,
that still needs to be put together to a consistent system and not rushing
it sounds like a good idea to me.

> In summary, your approach requires (I hope I am being fair?):
>
> 1) Registration interfaces (where callbacks can sleep) must keep track
> of reference counts of current users.

Rusty, read my lips: They only have to know whether they are busy. It's a
_Bool not an atomic_t, e.g. this would work too:

if (!list_empty(users))
return -EBUSY;

> 2) They must also implement two-stage delete (ie. proc_net_unlink()).

That's a "can" requirement.

> See why I like the "extend try_inc_mod_count()" solution? It's not
> perfect, but it's already there and it's simple.

I explained already earlier why try_inc_mod_count() is a bad interface and
renaming it doesn't change much. The consequences of its usage are not
simple and its existence is not a good argument (at least not a technical
one).
My approach offers a gradual switch to a much more flexible interface,
which doesn't force drivers to use only module approved synchronization
methods. I keep most of the complexity (the module loading/relocation) in
user space, that means my patch has far less impact on the kernel. My
patch doesn't require new tools, but allows for an userspace insmod of
similiar complexity as your kernel loader. Making module removal a config
option is a really bad idea, either it works or it doesn't.
Can we please judge the approaches on a technical level and forget for a
moment the impending code freeze?

bye, Roman

2002-09-25 05:46:37

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <Pine.LNX.4.44.0209241739460.338-100000@serv> you write:
> Will the usage of an initramdisk be a requirement for 2.6? Otherwise I
> don't see a reason for such hurry. I don't know of Linus plans, but

That was the plan as far as I knew, hence recent klibc development.

> > 1) Registration interfaces (where callbacks can sleep) must keep track
> > of reference counts of current users.
>
> Rusty, read my lips: They only have to know whether they are busy. It's a
> _Bool not an atomic_t, e.g. this would work too:
>
> if (!list_empty(users))
> return -EBUSY;

Yes, same thing. Either way, register_xxx has a new requirement it
didn't have before, to keep track of the number of users.

> > 2) They must also implement two-stage delete (ie. proc_net_unlink()).
>
> That's a "can" requirement.

You didn't show me how to get rid of my ip_conntrack problem without
it in the previous mail.

That's a *real* example we have today. My code fixes it without
requiring a split of such interfaces.

> > See why I like the "extend try_inc_mod_count()" solution? It's not
> > perfect, but it's already there and it's simple.
>
> I explained already earlier why try_inc_mod_count() is a bad interface and
> renaming it doesn't change much. The consequences of its usage are not
> simple and its existence is not a good argument (at least not a technical
> one).

It's well understood, and already more widespread than any other real
solution. All other approaches being equal, it wins.

> My approach offers a gradual switch to a much more flexible interface,

No, *every* unregister function needs to change. What's gradual about
that?

> which doesn't force drivers to use only module approved synchronization
> methods. I keep most of the complexity (the module loading/relocation) in
> user space, that means my patch has far less impact on the kernel.

Look, you're wrong. Keeping it in userspace is complex; I have lines
of code and bytecounts for you to read. You can argue as much as you
like but I've been there, and you're simply not listening.

Moving this into the kernel has proven to be smaller, more flexible,
makes small initramdisks possible *and* makes the whole thing vastly
simpler. What part of this don't you understand?

> My patch doesn't require new tools, but allows for an userspace
> insmod of similiar complexity as your kernel loader.

I challenge you to write a 64-bit kernel linker in a 32-bit userspace
in the codesize of my in-kernel linker. I've tried it, and it's *not*
easy, and when you pile on all the extra kernel interfaces to support
it don't quite work, you end up *nowhere near*.

> Making module removal a config option is a really bad idea, either
> it works or it doesn't.

Maybe, but it gives an indication of how much it costs us to have this
feature, which is why I've kept it through development. I blame
CONFIG_PREEMPT for getting me into bad habits. 8)

> Can we please judge the approaches on a technical level and forget for a
> moment the impending code freeze?

I started implementing module race solutions back in early 2001. This
is not a rushed decision, sorry.

This is more heat than light now 8(
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-25 11:31:52

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Wed, 25 Sep 2002, Rusty Russell wrote:

> In message <Pine.LNX.4.44.0209241739460.338-100000@serv> you write:
> > Will the usage of an initramdisk be a requirement for 2.6? Otherwise I
> > don't see a reason for such hurry. I don't know of Linus plans, but
>
> That was the plan as far as I knew, hence recent klibc development.

Why this hurry? What is so badly broken, that it must be fixed before 2.6?

> > > 1) Registration interfaces (where callbacks can sleep) must keep track
> > > of reference counts of current users.
> >
> > Rusty, read my lips: They only have to know whether they are busy. It's a
> > _Bool not an atomic_t, e.g. this would work too:
> >
> > if (!list_empty(users))
> > return -EBUSY;
>
> Yes, same thing. Either way, register_xxx has a new requirement it
> didn't have before, to keep track of the number of users.

It's not a new requirement, before it had to use try_inc_mod_count() for
this.

> > > 2) They must also implement two-stage delete (ie. proc_net_unlink()).
> >
> > That's a "can" requirement.
>
> You didn't show me how to get rid of my ip_conntrack problem without
> it in the previous mail.
>
> That's a *real* example we have today. My code fixes it without
> requiring a split of such interfaces.

Where is the problem? You can use your bigref like you already
demonstrated and proc_net_destroy() also does a call to proc_net_unlink().

> > > See why I like the "extend try_inc_mod_count()" solution? It's not
> > > perfect, but it's already there and it's simple.
> >
> > I explained already earlier why try_inc_mod_count() is a bad interface and
> > renaming it doesn't change much. The consequences of its usage are not
> > simple and its existence is not a good argument (at least not a technical
> > one).
>
> It's well understood, and already more widespread than any other real
> solution. All other approaches being equal, it wins.

I don't get this, sorry.

> > My approach offers a gradual switch to a much more flexible interface,
>
> No, *every* unregister function needs to change. What's gradual about
> that?

It can be done gradually? The interfaces can be changed step by step.
Interfaces which already use try_inc_mod_count() can be converted
trivially as demonstrated by my patch.

> Moving this into the kernel has proven to be smaller, more flexible,
> makes small initramdisks possible *and* makes the whole thing vastly
> simpler. What part of this don't you understand?

So your argument is to move everything into the kernel to make the
initramdisk simpler?

> > My patch doesn't require new tools, but allows for an userspace
> > insmod of similiar complexity as your kernel loader.
>
> I challenge you to write a 64-bit kernel linker in a 32-bit userspace
> in the codesize of my in-kernel linker. I've tried it, and it's *not*
> easy, and when you pile on all the extra kernel interfaces to support
> it don't quite work, you end up *nowhere near*.

Rusty, have you understood, what my new module layout is all about? So far
I haven't heard a single argument for or against it. I haven't explained
it yet in detail, but I didn't got any questions about it either.
If you understood it, it will be certainly no problem for you to show me,
where my claim is flawed.

> > Making module removal a config option is a really bad idea, either
> > it works or it doesn't.
>
> Maybe, but it gives an indication of how much it costs us to have this
> feature, which is why I've kept it through development.

If we continue to build on bad interfaces, it will certainly cost us a
lot.

> > Can we please judge the approaches on a technical level and forget for a
> > moment the impending code freeze?
>
> I started implementing module race solutions back in early 2001. This
> is not a rushed decision, sorry.

What does this prove?
You're trying to get a monster patch into the kernel, which is mostly only
tested by you and breaks most architectures without good (urgent) reason.
Sorry, that I can't really feel comfortable with this.

bye, Roman

2002-09-25 12:50:41

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <Pine.LNX.4.44.0209251147250.338-100000@serv> you write:
> Rusty, have you understood, what my new module layout is all about?

Not a clue.

> If you understood it, it will be certainly no problem for you to show me,
> where my claim is flawed.

OK, assume all (sane) designs have some part inside the kernel, and
some part outside. My version almost all kernel: "here is a module and
some args".

Once you move the linking outside the kernel, you need to communicate
more information. You need some form of "allocate", and "here is all
the other symbol information", then "please tell me what you used so I
can update the reference counts" and "place linked module". So you've
added some complexity to deal with synchronization of these acts with
a userspace process.

Now, say your architecture decided that it wanted to try to allocate
its modules: it wants to allocate one part for init and one for core
(so the init can be easily discarded), but if they're not close
enough, it'll give up and allocate one big one and not discard init.

But half if it is in userspace, so you have to support both in the
kernel and both in userspace while you are in transition. Or, say the
kernel slightly changes the way it parses boot paramters and you
wanted the module to match? Or you wanted to change the version
encoding? Or something else I don't know about yet?

Let's look at what we can expect to remove from the kernel by moving
the linking stage out. Probably the most complex architecture to link
is ia64. And that linker is 507 lines (approx, it needs to be updated
to the latest patch). x86 is 130 lines. Add maybe 200 lines of
arch-independent code to help.

It it *now* clear why I'm not interested in saving a few hundred lines
of kernel code, even if the communication overhead didn't eat them up
again?

Hope this makes my point clear,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-25 21:23:43

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Wed, 25 Sep 2002, Rusty Russell wrote:

> Once you move the linking outside the kernel, you need to communicate
> more information. You need some form of "allocate", and "here is all
> the other symbol information", then "please tell me what you used so I
> can update the reference counts" and "place linked module". So you've
> added some complexity to deal with synchronization of these acts with
> a userspace process.

With the new module layout this will be avoided as much as possible. This
information is already mostly generated during compile time and only
extracted by insmod. So to avoid this we only need to store the
information in a way the kernel can find it itself without insmod help.
Now let's look at the new module interface:

DEFINE_MODULE
.init = init_affs_fs,
.exit = exit_affs_fs,
DEFINE_MODULE_END

DEFINE_MODULE is used to define a structure, which looks like this:

struct module __this_module = {
.ex_table_start = __ex_table_start,
.ex_table_end = __ex_table_end,
.syms_start = __syms_start,
.syms_end = __syms_end,

Now we can use a linker script like this:

SECTIONS
{
.module : { *(.module_struct) }
.text : { *(.text) }
.data : { *(.data) }
.syms : {
__syms_start = .;
*(__ksymtab)
__syms_end = .;
}
.ex_table : {
__ex_table_start = .;
*(__ex_table)
__ex_table_end = .;
}
.bss : { *(.bss) }
}

which can be used to add all the information usually added by insmod. So
ld does already most of the work and insmod only needs to do the
relocation and the kernel finds at the start of the module all
information it needs. The only information which is left to be added are
the module arguments.

> Now, say your architecture decided that it wanted to try to allocate
> its modules: it wants to allocate one part for init and one for core
> (so the init can be easily discarded), but if they're not close
> enough, it'll give up and allocate one big one and not discard init.
>
> But half if it is in userspace, so you have to support both in the
> kernel and both in userspace while you are in transition. Or, say the
> kernel slightly changes the way it parses boot paramters and you
> wanted the module to match? Or you wanted to change the version
> encoding? Or something else I don't know about yet?

A possible solution would be to do something what I'm planning to do for
kernel conf - export it as library, then external tools can link the
module linker in dynamically.

> Let's look at what we can expect to remove from the kernel by moving
> the linking stage out. Probably the most complex architecture to link
> is ia64. And that linker is 507 lines (approx, it needs to be updated
> to the latest patch). x86 is 130 lines. Add maybe 200 lines of
> arch-independent code to help.
>
> It it *now* clear why I'm not interested in saving a few hundred lines
> of kernel code, even if the communication overhead didn't eat them up
> again?

It's not only this. A kernel linker would force us to keep all symbol
information in kernel space. Doing the linking in userspace would allow us
to remove all symbol information from the kernel (and only keep them for
debugging, e.g. kksymoops), even module dependency tracking could be done
completely from userspace. Module related information could be reduced to
an absolute minimum, so that a nonmodular kernel had no advantage to a
modular kernel anymore.

bye, Roman

2002-09-26 01:53:44

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <Pine.LNX.4.44.0209252003370.338-100000@serv> you write:
> A possible solution would be to do something what I'm planning to do for
> kernel conf - export it as library, then external tools can link the
> module linker in dynamically.

That makes sense for kernel conf, but less sense to try to find the
library matching the kernel that you're running on now. We have
enough trouble finding the right modules 8) So you need a mechanism to
check that this is the right library for the kernel.

> > It it *now* clear why I'm not interested in saving a few hundred lines
> > of kernel code, even if the communication overhead didn't eat them up
> > again?
>
> It's not only this. A kernel linker would force us to keep all symbol
> information in kernel space.

No, just the exported symbols (about 40k on my kernel). We do
currently keep them uncompressed, which could be fixed (and kind of
shows how few people care about kernel bloat 8( )

If you load an initial ramdisk, which loads modules, then you need to
leave the initial ramdisk loaded so you can load more modules later
(for the symbols). Of if you assume the modules inside the initrd are
the same as the ones a later insmod sees, all you need is "what
symbols are loaded where?".

So:
o A interface to list of what modules are loaded and where.
o An interface to allocate some space in the kernel.
o An interface to put something into that space and call a given
function.

Now, you still have the "insmod gets killed" problem, so you really
want to have your "allocate" interface detect this. Easy: hand out a
file descriptor, so the module code gets called back when its closed,
and frees the memory.

Each interface requires two pieces of code, one inside the kernel, and
one in userspace. In this case, the creation of the interface is not
for compatibility reasons (in fact, it's such a problem that you need
matching code versions on both sides!), but just to try to keep some
code outside the kernel.

> Doing the linking in userspace would allow us
> to remove all symbol information from the kernel (and only keep them for
> debugging, e.g. kksymoops), even module dependency tracking could be done
> completely from userspace. Module related information could be reduced to
> an absolute minimum, so that a nonmodular kernel had no advantage to a
> modular kernel anymore.

It'll *always* be slightly smaller and slightly faster, unless you
don't optimize the CONFIG_MODULES=n case.

We're still talking about cutting a few hundred lines of code, and
tens of K, from the kernel in return for increasing complexity.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-26 23:33:42

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Thu, 26 Sep 2002, Rusty Russell wrote:

Hmm, no comments about the proposed module layout? I suppose you believe
me now, that a small and simple userspace insmod is possible. :)

> > A possible solution would be to do something what I'm planning to do for
> > kernel conf - export it as library, then external tools can link the
> > module linker in dynamically.
>
> That makes sense for kernel conf, but less sense to try to find the
> library matching the kernel that you're running on now. We have
> enough trouble finding the right modules 8) So you need a mechanism to
> check that this is the right library for the kernel.

You need to check that the module matches the kernel as well, so same
problem and possibly the same solution as well.

> If you load an initial ramdisk, which loads modules, then you need to
> leave the initial ramdisk loaded so you can load more modules later
> (for the symbols). Of if you assume the modules inside the initrd are
> the same as the ones a later insmod sees, all you need is "what
> symbols are loaded where?".
>
> So:
> o A interface to list of what modules are loaded and where.
> o An interface to allocate some space in the kernel.
> o An interface to put something into that space and call a given
> function.

Transfering data from the initramdisk to the final root shouldn't really
be a problem. Other tools might need that too.

> Now, you still have the "insmod gets killed" problem, so you really
> want to have your "allocate" interface detect this. Easy: hand out a
> file descriptor, so the module code gets called back when its closed,
> and frees the memory.

That's a userspace problem that is also solvable in userspace. A module
which is only allocated can also be removed like a normal module.
A better module interface to the kernel is desirable, but not a really
urgent yet.

> Each interface requires two pieces of code, one inside the kernel, and
> one in userspace. In this case, the creation of the interface is not
> for compatibility reasons (in fact, it's such a problem that you need
> matching code versions on both sides!), but just to try to keep some
> code outside the kernel.

What is new about this? Module loading is hardly the only kernel interface
which requires matching user space tools. Do you plan to change the module
interface every other release?

> > Doing the linking in userspace would allow us
> > to remove all symbol information from the kernel (and only keep them for
> > debugging, e.g. kksymoops), even module dependency tracking could be done
> > completely from userspace. Module related information could be reduced to
> > an absolute minimum, so that a nonmodular kernel had no advantage to a
> > modular kernel anymore.
>
> It'll *always* be slightly smaller and slightly faster, unless you
> don't optimize the CONFIG_MODULES=n case.

The speed difference can be fixed for most modules and the additional
memory overhead could be almost negligible.
Actually we do too little in the nonmodule case, for example if a module
fails the kernel will happily initialize any module which depends on it.

> We're still talking about cutting a few hundred lines of code, and
> tens of K, from the kernel in return for increasing complexity.

I rather keep the complexity in userspace than forcing it into the kernel,
where here the complexity is not bad and gives us much more flexibility.
I'm not completely opposed to a kernel, but completely disabling the user
space loader is IMO not acceptable (at least not until the new loader went
through one stable release).

bye, Roman

2002-09-27 01:06:39

by Scott Murray

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Fri, 27 Sep 2002, Roman Zippel wrote:

> Hi,
>
> On Thu, 26 Sep 2002, Rusty Russell wrote:
>
> Hmm, no comments about the proposed module layout? I suppose you believe
> me now, that a small and simple userspace insmod is possible. :)

You'll be waiting a while for a response:

On Wed, 25 Sep 2002, Rusty Russell wrote:

> Hi Linus,
>
> I leave in about 28 hours for my wedding + honeymoon, back 14
> October. So now is a good time to apply the module replacement
> patches. They survive stress testing here.
[snip]

Scott


--
Scott Murray
SOMA Networks, Inc.
Toronto, Ontario
e-mail: [email protected]

2002-09-27 01:29:48

by Roman Zippel

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

Hi,

On Thu, 26 Sep 2002, Scott Murray wrote:

> > Hmm, no comments about the proposed module layout? I suppose you believe
> > me now, that a small and simple userspace insmod is possible. :)
>
> You'll be waiting a while for a response:

Well, it seems that's a chance for someone else to jump into the
discussion. :)
Rusty's patch will have quite some impact on the kernel, so that I'm
wondering, why there are not more comments about it.

bye, Roman

2002-09-28 00:51:32

by David Lang

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

I will say that anything that requires more after-the-fact matching of
files on disk to the running system strikes me as a very bad idea.

I have had to do to many really ugly things with boot disks and old system
images just to keep things running at different times, I don't want to
make ANY assumptions that the currently running kernel+modules match
anything on the system.

I'm also the type that wants to compile monolithic kernels if possible to
eliminate any posibility of module mismatches so this is something that
I'll probably only use on my laptop so take it for what little it's worth
:-)

I've also been burned a few times by old utilities on a system when
installing a new kernel (I had a system running for a couple years where
ifconfig reported tx packets in the tx errors slot after an upgrade) and
anything that can reduce the possiblity of this is good.

David Lang

On Fri, 27 Sep 2002, Roman Zippel wrote:

> Date: Fri, 27 Sep 2002 03:34:56 +0200 (CEST)
> From: Roman Zippel <[email protected]>
> To: Scott Murray <[email protected]>
> Cc: Rusty Russell <[email protected]>, [email protected]
> Subject: Re: [PATCH] In-kernel module loader 1/7
>
> Hi,
>
> On Thu, 26 Sep 2002, Scott Murray wrote:
>
> > > Hmm, no comments about the proposed module layout? I suppose you believe
> > > me now, that a small and simple userspace insmod is possible. :)
> >
> > You'll be waiting a while for a response:
>
> Well, it seems that's a chance for someone else to jump into the
> discussion. :)
> Rusty's patch will have quite some impact on the kernel, so that I'm
> wondering, why there are not more comments about it.
>
> bye, Roman
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-09-30 15:27:56

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Thursday 19 September 2002 22:11, Greg KH wrote:
> On Thu, Sep 19, 2002 at 07:58:15PM +0100, Alan Cox wrote:
> > On Thu, 2002-09-19 at 19:38, Greg KH wrote:
> > > And with a LSM module, how can it answer that? There's no way, unless
> > > we count every time someone calls into our module. And if you do that,
> > > no one will even want to use your module, given the number of hooks, and
> > > the paths those hooks are on (the speed hit would be horrible.)
> >
> > So the LSM module always says no. Don't make other modules suffer
>
> Ok, I don't have a problem with that, I was just trying to point out
> that not all modules can know when they are able to be unloaded, as
> Roman stated.

Not being able to unload LSM would suck enormously. At last count, we
knew how to do this:

1) Unhook the function hooks (using a call table simplifies this)
2) Schedule on each CPU to ensure all tasks are out of the module
3) A schedule where the module count is incremented doesn't count

and we rely on the rule that and module code that could sleep must be
bracketed by inc/dec of the module count.

Did somebody come up with a reason why this will not work?

--
Daniel

2002-10-03 23:47:43

by Rob Landley

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Monday 30 September 2002 11:32 am, Daniel Phillips wrote:

> Not being able to unload LSM would suck enormously. At last count, we
> knew how to do this:
>
> 1) Unhook the function hooks (using a call table simplifies this)
> 2) Schedule on each CPU to ensure all tasks are out of the module
> 3) A schedule where the module count is incremented doesn't count
>
> and we rely on the rule that and module code that could sleep must be
> bracketed by inc/dec of the module count.
>
> Did somebody come up with a reason why this will not work?

Preemption?

Scheduling doesn't guarantee making any specific amount of progress within
the kernel with preemption enabled. I thought the preferred strategy was to
wait for the time slices to refill and then exhaust (since everybody has to
exhaust their time slices before anybody gets new ones. Unless I've missed
something...?)

Rob

2002-10-04 00:05:49

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Thursday 03 October 2002 20:53, Rob Landley wrote:
> On Monday 30 September 2002 11:32 am, Daniel Phillips wrote:
>
> > Not being able to unload LSM would suck enormously. At last count, we
> > knew how to do this:
> >
> > 1) Unhook the function hooks (using a call table simplifies this)
> > 2) Schedule on each CPU to ensure all tasks are out of the module
> > 3) A schedule where the module count is incremented doesn't count
> >
> > and we rely on the rule that and module code that could sleep must be
> > bracketed by inc/dec of the module count.
> >
> > Did somebody come up with a reason why this will not work?
>
> Preemption?

Turn it off.

> Scheduling doesn't guarantee making any specific amount of progress within
> the kernel with preemption enabled. I thought the preferred strategy was to
> wait for the time slices to refill and then exhaust (since everybody has to
> exhaust their time slices before anybody gets new ones. Unless I've missed
> something...?)

Probably ;-)

--
Daniel

2002-10-15 05:22:06

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <E17w2XF-0005oW-00@starship> you write:
> On Thursday 19 September 2002 22:11, Greg KH wrote:
> > On Thu, Sep 19, 2002 at 07:58:15PM +0100, Alan Cox wrote:
> > > On Thu, 2002-09-19 at 19:38, Greg KH wrote:
> > > > And with a LSM module, how can it answer that? There's no way, unless
> > > > we count every time someone calls into our module. And if you do that,
> > > > no one will even want to use your module, given the number of hooks, an
d
> > > > the paths those hooks are on (the speed hit would be horrible.)
> > >
> > > So the LSM module always says no. Don't make other modules suffer
> >
> > Ok, I don't have a problem with that, I was just trying to point out
> > that not all modules can know when they are able to be unloaded, as
> > Roman stated.
>
> Not being able to unload LSM would suck enormously. At last count, we
> knew how to do this:
>
> 1) Unhook the function hooks (using a call table simplifies this)
> 2) Schedule on each CPU to ensure all tasks are out of the module
> 3) A schedule where the module count is incremented doesn't count
>
> and we rely on the rule that and module code that could sleep must be
> bracketed by inc/dec of the module count.
>
> Did somebody come up with a reason why this will not work?

It won't quite work if the hooks can sleep. You can say "don't sleep"
or have a wedge which does the "try_inc_mod_count()" then calls into
the module (and returns some default if it can't inc the module count).

You can't disable preemption before calling in, because there is no
way to sleep with preemption disabled. 8(

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-10-15 05:21:03

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <Pine.LNX.4.44.0209261845420.338-100000@serv> you write:
> I'm not completely opposed to a kernel, but completely disabling the user
> space loader is IMO not acceptable (at least not until the new loader went
> through one stable release).

Huh? We're talking about 50 lines of code for insmod (modprobe is
harder, yes). But risking this instability is what unstable kernels
are *for*!

You're still talking about forcing us to maintain a *very intimate*
interface in order to remove 200 lines from the kernel. Ask Keith the
troubles of keeping modutils in sync with the kernel. I've been
following it for about 18 months, and I haven't envied the (excellent)
job he had to do.

200 lines! 8)
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-10-15 15:22:25

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Tuesday 15 October 2002 05:25, Rusty Russell wrote:
> In message <E17w2XF-0005oW-00@starship> you write:
> > Not being able to unload LSM would suck enormously. At last count, we
> > knew how to do this:
> >
> > 1) Unhook the function hooks (using a call table simplifies this)
> > 2) Schedule on each CPU to ensure all tasks are out of the module
> > 3) A schedule where the module count is incremented doesn't count
> >
> > and we rely on the rule that and module code that could sleep must be
> > bracketed by inc/dec of the module count.
> >
> > Did somebody come up with a reason why this will not work?
>
> It won't quite work if the hooks can sleep. You can say "don't sleep"
> or have a wedge which does the "try_inc_mod_count()" then calls into
> the module (and returns some default if it can't inc the module count).

Right. By coincidence, I found myself thinking about this very problem
as I re-materialized this morning. If TRY_INC_MOD_COUNT also ors a flag
(which it does now, for other reasons) then:

1) Clear the mod_inc flag
2) Unhook the function hooks
3) Schedule on each CPU
4) If the mod_inc flag is set, repeat from (1)

This should perform acceptably well, and would only be done in cases
where the existing TRY_INC_MOD_COUNT strategy can't be used.

> You can't disable preemption before calling in, because there is no
> way to sleep with preemption disabled. 8(

Why is that harder than bumping a counter that makes preempt_schedule
return without doing anything?

--
Daniel

2002-10-16 02:13:55

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <E181Tcc-0003k0-00@starship> you write:
> On Tuesday 15 October 2002 05:25, Rusty Russell wrote:
> > It won't quite work if the hooks can sleep. You can say "don't sleep"
> > or have a wedge which does the "try_inc_mod_count()" then calls into
> > the module (and returns some default if it can't inc the module count).
>
> Right. By coincidence, I found myself thinking about this very problem
> as I re-materialized this morning. If TRY_INC_MOD_COUNT also ors a flag
> (which it does now, for other reasons) then:
>
> 1) Clear the mod_inc flag
> 2) Unhook the function hooks
> 3) Schedule on each CPU
> 4) If the mod_inc flag is set, repeat from (1)
>
> This should perform acceptably well, and would only be done in cases
> where the existing TRY_INC_MOD_COUNT strategy can't be used.

This is basically the same technique used in my current patch. We set
module->live = 0, sychronize_kernel(), then look at reference count.
In this case, instead of setting a flag, try_inc_mod_count (aka
try_module_get()) bumps the refcount, to similar effect to the flag.

> > You can't disable preemption before calling in, because there is no
> > way to sleep with preemption disabled. 8(
>
> Why is that harder than bumping a counter that makes preempt_schedule
> return without doing anything?

Definitely. We could simply allow schedule() to be called when
preempt is disabled, but it's a useful debugging tool to not do that.
And, of course, disabling preemption widely kind of defeats the point
of having a preemptive kernel 8(

I really wish the security guys had gone down the macro path, with
something like

#define security_check(func, default_val, ...)
({ if (try_inc_mod_count(security_ops->owner))
security_ops->func(__VA_ARGS__);
else
default_val;
})

This also allows the whole thing to vanish if
CONFIG_SECURITY_CAPABILITIES=n, and allows more flexibility for
schemes like "always run with preemption disabled around security ops"
or whatever, rather than having to search for all the references to
security_ops.

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-10-16 02:52:46

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Wednesday 16 October 2002 01:53, Rusty Russell wrote:
> In message <E181Tcc-0003k0-00@starship> you write:
> > On Tuesday 15 October 2002 05:25, Rusty Russell wrote:
> > > It won't quite work if the hooks can sleep. You can say "don't sleep"
> > > or have a wedge which does the "try_inc_mod_count()" then calls into
> > > the module (and returns some default if it can't inc the module count).
> >
> > Right. By coincidence, I found myself thinking about this very problem
> > as I re-materialized this morning. If TRY_INC_MOD_COUNT also ors a flag
> > (which it does now, for other reasons) then:
> >
> > 1) Clear the mod_inc flag
> > 2) Unhook the function hooks
> > 3) Schedule on each CPU
> > 4) If the mod_inc flag is set, repeat from (1)
> >
> > This should perform acceptably well, and would only be done in cases
> > where the existing TRY_INC_MOD_COUNT strategy can't be used.
>
> This is basically the same technique used in my current patch. We set
> module->live = 0, sychronize_kernel(), then look at reference count.
> In this case, instead of setting a flag, try_inc_mod_count (aka
> try_module_get()) bumps the refcount, to similar effect to the flag.

Yes, my earlier-posted algorithm was just plain wrong. So that's one
item out of the way.

> > > You can't disable preemption before calling in, because there is no
> > > way to sleep with preemption disabled. 8(
> >
> > Why is that harder than bumping a counter that makes preempt_schedule
> > return without doing anything?
>
> Definitely. We could simply allow schedule() to be called when
> preempt is disabled, but it's a useful debugging tool to not do that.

It doesn't strike me as difficult or costly to accomodate this.

> And, of course, disabling preemption widely kind of defeats the point
> of having a preemptive kernel 8(

It only needs to be turned off when unloading one of the "hard" modules.
This would be an incrementing disable to accodate simultaneous unloads.
During the unload your desktop might get a little bit less interactive,
but that's better than not being able to unload at all.

> I really wish the security guys had gone down the macro path, with
> something like
>
> #define security_check(func, default_val, ...)
> ({ if (try_inc_mod_count(security_ops->owner))
> security_ops->func(__VA_ARGS__);
> else
> default_val;
> })
>
> This also allows the whole thing to vanish if
> CONFIG_SECURITY_CAPABILITIES=n, and allows more flexibility for
> schemes like "always run with preemption disabled around security ops"
> or whatever, rather than having to search for all the references to
> security_ops.

Then everybody would complain about the extra overhead, no matter how
small it is. Conceptually, are there any outstanding issues with "hard"
way of unloading modules, assuming we can use the TRY_INC way[1] for
"easy" modules? One I don't recall being discussed, is the inherent
difficulty of unhooking an interface like LSM, one function at a time.

[1] Or a related way, to be determined via the tag-team mud-wrestling
method.

--
Daniel

2002-10-16 08:18:47

by Chris Wright

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

* Rusty Russell ([email protected]) wrote:
>
> I really wish the security guys had gone down the macro path, with
> something like
>
> #define security_check(func, default_val, ...)
> ({ if (try_inc_mod_count(security_ops->owner))
> security_ops->func(__VA_ARGS__);
> else
> default_val;
> })
>
> This also allows the whole thing to vanish if
> CONFIG_SECURITY_CAPABILITIES=n, and allows more flexibility for
> schemes like "always run with preemption disabled around security ops"
> or whatever, rather than having to search for all the references to
> security_ops.

You may be seeing something like this soon. Greg is working on some
conversions right now. I'm not sure what the final format will be,
but it should make the work above easier. Check out his '[RFC] change
format of LSM hooks' message from a few hours ago.

thanks,
-chris
--
Linux Security Modules http://lsm.immunix.org http://lsm.bkbits.net

2002-10-16 08:12:09

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <E181eOt-00044e-00@starship> you write:
> > Definitely. We could simply allow schedule() to be called when
> > preempt is disabled, but it's a useful debugging tool to not do that.
>
> It doesn't strike me as difficult or costly to accomodate this.

It's possible, of course.

> > And, of course, disabling preemption widely kind of defeats the point
> > of having a preemptive kernel 8(
>
> It only needs to be turned off when unloading one of the "hard" modules.
> This would be an incrementing disable to accodate simultaneous unloads.
> During the unload your desktop might get a little bit less interactive,
> but that's better than not being able to unload at all.

It needs to be turned off when dealing with any interface which might
be used by one of the hard modules. Which is pretty bad.

> > I really wish the security guys had gone down the macro path, with
> > something like
> >
> > #define security_check(func, default_val, ...)
> > ({ if (try_inc_mod_count(security_ops->owner))
> > security_ops->func(__VA_ARGS__);
> > else
> > default_val;
> > })
> Then everybody would complain about the extra overhead, no matter how
> small it is.

These people didn't even whimper when an extra (useless) indirect
function call got added at all these points. It's easy to make this
overhead only exist when a modular security module is in use (the
try_inc_mod_count wedge solution), and the overhead for non-modules is
one branch.

> Conceptually, are there any outstanding issues with "hard"
> way of unloading modules, assuming we can use the TRY_INC way[1] for
> "easy" modules? One I don't recall being discussed, is the inherent
> difficulty of unhooking an interface like LSM, one function at a time.

Without preempt, it was relatively easy (my initial code preceeded
preempt). With preempt, unless you touch the scheduler (I have code
for that too, Ingo doesn't like it anyway), a module can't control its
own reference counts.

Either way, how do you return "Pretend I wasn't here" in general, if
the module is being unloaded? Only the infrastructure knows what to
do.

If a module cannot control its own reference counts, every exported
interface which can sleep needs to do the try_inc_modcount thing, or a
module which uses it cannot be unloaded.

Now, there remains a subtle problem with the try_inc_mod_count
approach in general. It is the "spurious failure" problem, where
eg. a notifier cannot inc the module count, and so does not call the
registered notifier, but the module is still being initialized *OR* is
in the middle of an attempt to remove the module (which fails, and the
module is restored to "life").

Will this be a problem in general? I don't think so, since I couldn't
find an example, but it's possible.

The major advantage of this scheme is the simplicity for module
authors (for the vast majority, no change). Given the complete dogs
breakfast most module authors made of the current module count scheme,
that's a HUGE bonus.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-10-16 17:27:25

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

(Warning, this is one of those reply-to-every-point posts, read on if you
have masochistic tendencies.)

On Wednesday 16 October 2002 08:11, Rusty Russell wrote:
> > > And, of course, disabling preemption widely kind of defeats the point
> > > of having a preemptive kernel 8(
> >
> > It only needs to be turned off when unloading one of the "hard" modules.
> > This would be an incrementing disable to accodate simultaneous unloads.
> > During the unload your desktop might get a little bit less interactive,
> > but that's better than not being able to unload at all.
>
> It needs to be turned off when dealing with any interface which might
> be used by one of the hard modules. Which is pretty bad.

What are you thinking about here? (Oh, I see it's elaborated below.)

As far as I can see, preemption only has to be disabled during the
synchronize_kernel phase of unloading that one module, and this requirement
is inherited neither by dependant or depending modules.

> > Conceptually, are there any outstanding issues with "hard"
> > way of unloading modules, assuming we can use the TRY_INC way[1] for
> > "easy" modules? One I don't recall being discussed, is the inherent
> > difficulty of unhooking an interface like LSM, one function at a time.
>
> Without preempt, it was relatively easy (my initial code preceeded
> preempt). With preempt, unless you touch the scheduler (I have code
> for that too, Ingo doesn't like it anyway), a module can't control its
> own reference counts.

The current proposal touches the scheduler only in the slow path of
preempt_schedule, does it not? It's hard to see what the objection to that
would be. By the way, do you when I say "schedule on every cpu", is that
exactly equivalent to your synchronize_kernel?

> Either way, how do you return "Pretend I wasn't here" in general, if
> the module is being unloaded? Only the infrastructure knows what to
> do.
>
> If a module cannot control its own reference counts, every exported
> interface which can sleep needs to do the try_inc_modcount thing, or a
> module which uses it cannot be unloaded.

But this question goes away if we get our preempt-off switch, no?

> Now, there remains a subtle problem with the try_inc_mod_count
> approach in general. It is the "spurious failure" problem, where
> eg. a notifier cannot inc the module count, and so does not call the
> registered notifier, but the module is still being initialized *OR* is
> in the middle of an attempt to remove the module (which fails, and the
> module is restored to "life").

Oh, I'm well aware of that one. This is a subtle design requirement that
just didn't get to the top of the list of things to talk about, because of
the other, competing things to talk about. To me, it's a given that the
interface must not introduce any dead periods as a result of, say, the user
attempting to unload a busy module. For "hard" modules there is no choice:
the first step must necessarily be to remove the function hooks and thus deny
the service the module was providing. For this kind of module, the unload
should either wait until it succeeds or generate a user-visible error if it
cannot. If the "hard" module also exports counting-style interfaces, then
these should all be determined to be in a quiescent state before proceding
with the unhooking.

For pure counting-style modules, it's easy to avoid this problem: the module
is placed in the can't-increment state if and only if the current count is
zero, and from that point on we know the unload will either succeed or fail
with an error.

> Will this be a problem in general? I don't think so, since I couldn't
> find an example, but it's possible.

Since it's easy to avoid, why leave the fuzz there?

Things get more interesting with (the proposed) rmmod -f. In this case
some as-yet-undefined mechanism would try to, say, unmount all the
filesystems using a given module. It's just too hard and too much work to
try do the shy suitor thing here, and determine beforehand if all the
unmounts will be successful. The easy way out is just to say "yes,
rmmod -f may be unsuccessful but still change the system state, deal with
it". An even easier way is just to put rmmod -f on the back burner for
the time being.

> The major advantage of this scheme is the simplicity for module
> authors (for the vast majority, no change). Given the complete dogs
> breakfast most module authors made of the current module count scheme,
> that's a HUGE bonus.

Yes, well, for "easy" modules, all the interfaces that are on the table
or in use are simple. Hard modules require some care on the part of the
author, but we knew that. IMHO, the important thing is to be able to state
the rules clearly.

--
Daniel

2002-10-16 23:02:57

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

In message <E181s3M-0004C7-00@starship> you write:
> (Warning, this is one of those reply-to-every-point posts, read on if you
> have masochistic tendencies.)

Yes, whip me harder.

> On Wednesday 16 October 2002 08:11, Rusty Russell wrote:
> > It needs to be turned off when dealing with any interface which might
> > be used by one of the hard modules. Which is pretty bad.
>
> As far as I can see, preemption only has to be disabled during the
> synchronize_kernel phase of unloading that one module, and this requirement
> is inherited neither by dependant or depending modules.

No, someone could already have been preempted before you start
synchronize_kernel().

> > > Conceptually, are there any outstanding issues with "hard"
> > > way of unloading modules, assuming we can use the TRY_INC way[1] for
> > > "easy" modules? One I don't recall being discussed, is the inherent
> > > difficulty of unhooking an interface like LSM, one function at a time.
> >
> > Without preempt, it was relatively easy (my initial code preceeded
> > preempt). With preempt, unless you touch the scheduler (I have code
> > for that too, Ingo doesn't like it anyway), a module can't control its
> > own reference counts.
>
> The current proposal touches the scheduler only in the slow path of
> preempt_schedule, does it not? It's hard to see what the objection to that
> would be. By the way, do you when I say "schedule on every cpu", is that
> exactly equivalent to your synchronize_kernel?

Yes, my original version does just that. The version in the kernel at
the moment (2.5.43) is functionally equivalent.

> > Either way, how do you return "Pretend I wasn't here" in general, if
> > the module is being unloaded? Only the infrastructure knows what to
> > do.
> >
> > If a module cannot control its own reference counts, every exported
> > interface which can sleep needs to do the try_inc_modcount thing, or a
> > module which uses it cannot be unloaded.
>
> But this question goes away if we get our preempt-off switch, no?

Not really. If the hook disables preemption before calling in, it
might as well simply do try_inc_mod_count() and be done with it.
ie. either way, we have to modify the hook: the module cannot control
its own reference counts.

> > Now, there remains a subtle problem with the try_inc_mod_count
> > approach in general. It is the "spurious failure" problem, where
> > eg. a notifier cannot inc the module count, and so does not call the
> > registered notifier, but the module is still being initialized *OR* is
> > in the middle of an attempt to remove the module (which fails, and the
> > module is restored to "life").
>
> For pure counting-style modules, it's easy to avoid this problem: the module
> is placed in the can't-increment state if and only if the current count is
> zero, and from that point on we know the unload will either succeed or fail
> with an error.

Still a race between the zero check and the can't-increment state
setting. This is what my current code does: rmmod itself checks (if
/proc/modules available), then the kernel sets the module to
can't-increment, then checks again. If the non-blocking flag is set,
it then re-animates the module and fails, otherwise it waits.

BTW, current patchset (2.5.43):

http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Misc/kbuild_object.patch.gz
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/implicit-init-removal.patch.gz
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/everyone-needs-init.patch.gz
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/module.patch.gz
http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/module-i386.patch.gz

> > Will this be a problem in general? I don't think so, since I couldn't
> > find an example, but it's possible.
>
> Since it's easy to avoid, why leave the fuzz there?
>
> Things get more interesting with (the proposed) rmmod -f. In this case
> some as-yet-undefined mechanism would try to, say, unmount all the
> filesystems using a given module. It's just too hard and too much work to
> try do the shy suitor thing here, and determine beforehand if all the
> unmounts will be successful. The easy way out is just to say "yes,
> rmmod -f may be unsuccessful but still change the system state, deal with
> it". An even easier way is just to put rmmod -f on the back burner for
> the time being.

I have two variants of this. First is the "wait" mentioned above, so
the module will eventually be removed (my latest patch, above, allows
^C to interrupt this). The second is the "die-mother-fucker-die"
version, which taints the kernel and just removes the damn thing. For
most people, this is better than a reboot, and will usually "work".

http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/force-unload.patch.gz
[ Doesn't apply currently, needs updating ]

> > The major advantage of this scheme is the simplicity for module
> > authors (for the vast majority, no change). Given the complete dogs
> > breakfast most module authors made of the current module count scheme,
> > that's a HUGE bonus.
>
> Yes, well, for "easy" modules, all the interfaces that are on the table
> or in use are simple. Hard modules require some care on the part of the
> author, but we knew that. IMHO, the important thing is to be able to state
> the rules clearly.

Agreed. I wrote a document to enumerate the conditions. I will
revise it today as I update the rest of my patches to 2.5.43...

Cheers!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-10-17 01:50:45

by Daniel Phillips

[permalink] [raw]
Subject: Re: [PATCH] In-kernel module loader 1/7

On Thursday 17 October 2002 00:48, Rusty Russell wrote:
> > On Wednesday 16 October 2002 08:11, Rusty Russell wrote:
> > > It needs to be turned off when dealing with any interface which might
> > > be used by one of the hard modules. Which is pretty bad.
> >
> > As far as I can see, preemption only has to be disabled during the
> > synchronize_kernel phase of unloading that one module, and this requirement
> > is inherited neither by dependant or depending modules.
>
> No, someone could already have been preempted before you start
> synchronize_kernel().

I don't get that. The sequence is:

- turn off preemption
- unhook call points
- synchronize_kernel
- ...

which doesn't leave any preemption hole that I can see, so I can't comment
on a couple of the other points until you clear that one up.

> > > Now, there remains a subtle problem with the try_inc_mod_count
> > > approach in general. It is the "spurious failure" problem, where
> > > eg. a notifier cannot inc the module count, and so does not call the
> > > registered notifier, but the module is still being initialized *OR* is
> > > in the middle of an attempt to remove the module (which fails, and the
> > > module is restored to "life").
> >
> > For pure counting-style modules, it's easy to avoid this problem: the module
> > is placed in the can't-increment state if and only if the current count is
> > zero, and from that point on we know the unload will either succeed or fail
> > with an error.
>
> Still a race between the zero check and the can't-increment state
> setting.

But that one is easy: the zero check just takes the same spinlock as
TRY_INC_MOD_COUNT, then sets can't-increment only in the case the count
is zero, considerably simpler than:

> This is what my current code does: rmmod itself checks (if
> /proc/modules available), then the kernel sets the module to
> can't-increment, then checks again. If the non-blocking flag is set,
> it then re-animates the module and fails, otherwise it waits.

and leaves no window for spurious failure. The still-initializing case is
also easy, e.g., a filesystem module simply doesn't call register_filesystem
until it's completely ready to service calls, so nobody is able to do
TRY_INC_MOD_COUNT.

> BTW, current patchset (2.5.43):

Thanks, I'll read them all on the 21st ;-) The other thing I need to read
closely is Roman's strategy for changing the module format, and the weird
linker connections.

> ...The second is the "die-mother-fucker-die"
> version, which taints the kernel and just removes the damn thing. For
> most people, this is better than a reboot, and will usually "work".

Is there a case where removing a module would actually help? What is
the user going to do next, try to reinsert the same module?

> http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/force-unload.patch.gz
> [ Doesn't apply currently, needs updating ]

ERROR 404: Not Found.

--
Daniel

2002-10-17 07:49:49

by Rusty Russell

[permalink] [raw]
Subject:

In message <E181zuY-0004Fl-00@starship> you write:
> On Thursday 17 October 2002 00:48, Rusty Russell wrote:
> > > On Wednesday 16 October 2002 08:11, Rusty Russell wrote:
> > > > It needs to be turned off when dealing with any interface which might
> > > > be used by one of the hard modules. Which is pretty bad.
> > >
> > > As far as I can see, preemption only has to be disabled during the
> > > synchronize_kernel phase of unloading that one module, and this requireme
nt
> > > is inherited neither by dependant or depending modules.
> >
> > No, someone could already have been preempted before you start
> > synchronize_kernel().
>
> I don't get that. The sequence is:
>
> - turn off preemption
> - unhook call points
> - synchronize_kernel
> - ...
>
> which doesn't leave any preemption hole that I can see, so I can't comment
> on a couple of the other points until you clear that one up.

You mean that "turn off preemption" also wakes up anyone currently
preempted? Otherwise they're preempted just inside one of those call
points.

> > Still a race between the zero check and the can't-increment state
> > setting.
>
> But that one is easy: the zero check just takes the same spinlock as
> TRY_INC_MOD_COUNT, then sets can't-increment only in the case the count
> is zero, considerably simpler than:

The current spinlock is horrible. You could use a brlock, of course,
but I didn't mainly because of code bloat and speed. My current code
looks like:

static inline int try_module_get(struct module *module)
{
int ret = 1;

if (module) {
unsigned int cpu = get_cpu();
if (likely(module->ref[cpu].live))
local_inc(&module->ref[cpu].counter);
else
ret = 0;
put_cpu();
}
return ret;
}

Which is small enough to be inlined quite nicely, and very fast.
Adding br_read_lock_irqsave() starts to get big and slow (at that
point it's more likely we want to move the module case out of line).

> > This is what my current code does: rmmod itself checks (if
> > /proc/modules available), then the kernel sets the module to
> > can't-increment, then checks again. If the non-blocking flag is set,
> > it then re-animates the module and fails, otherwise it waits.
>
> and leaves no window for spurious failure. The still-initializing case is
> also easy, e.g., a filesystem module simply doesn't call register_filesystem
> until it's completely ready to service calls, so nobody is able to do
> TRY_INC_MOD_COUNT.

Consider some code which needs to know when cpus go up and down, so
registers a notifier. If the notifier fires before the init is
finished, the notifier code will fail to "try_inc_mod_count()" and
won't call it (it doesn't do try_inc_mod_count at the moment, but
that's a bug).

I don't know of any code which does this now, but it is at least a
theoretical problem.

> > BTW, current patchset (2.5.43):
>
> Thanks, I'll read them all on the 21st ;-) The other thing I need to read
> closely is Roman's strategy for changing the module format, and the weird
> linker connections.

Roman dislikes linking in the kernel. So did I until I wrote it: it's
really trivial (esp. compared with the code to coordinate with the
userspace linker properly). And it exists today. The linking takes
around 200 lines. But, let's say his solution is 500 lines shorter
than mine.

For those five hundred lines, the new parameter infrastructure and
module versioning changes can be done *without* requiring any changes
in modutils. If you've been following the module changes closely in
the last couple of years, you'll realize what a pain it has been to
introduce changes like licensing, etc. This frees up our hand.

IMHO, the benifits of having it in-kernel outweigh the slight extra
size.

> > ...The second is the "die-mother-fucker-die"
> > version, which taints the kernel and just removes the damn thing. For
> > most people, this is better than a reboot, and will usually "work".
>
> Is there a case where removing a module would actually help? What is
> the user going to do next, try to reinsert the same module?

Insert a fixed one, hopefully 8). I was thinking for kernel
developers, and general robustness (eg. an oops inside a module leaves
its refcount at 1).

> > http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/force-un
load.patch.gz
>
> ERROR 404: Not Found.

Damn my fingers. Updated (now applies on top of the others) but I
haven't tested this version yet (that's what I'm doing now):

http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/forceunload.patch.gz

Cheers!
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-10-17 14:43:33

by Roman Zippel

[permalink] [raw]
Subject: Re:

Hi,

On Thu, 17 Oct 2002, Rusty Russell wrote:

> Roman dislikes linking in the kernel. So did I until I wrote it: it's
> really trivial (esp. compared with the code to coordinate with the
> userspace linker properly). And it exists today. The linking takes
> around 200 lines. But, let's say his solution is 500 lines shorter
> than mine.

I believe you that linking in the kernel is simpler, but so would be a lot
of other things and the part I really dislike is to remove the ability to
keep it in user space.

bye, Roman

2002-10-17 14:50:31

by Kai Germaschewski

[permalink] [raw]
Subject: Re: your mail

On Thu, 17 Oct 2002, Rusty Russell wrote:

> > But that one is easy: the zero check just takes the same spinlock as
> > TRY_INC_MOD_COUNT, then sets can't-increment only in the case the count
> > is zero, considerably simpler than:
>
> The current spinlock is horrible. You could use a brlock, of course,
> but I didn't mainly because of code bloat and speed. My current code
> looks like:
>
> static inline int try_module_get(struct module *module)
> {
> int ret = 1;
>
> if (module) {
> unsigned int cpu = get_cpu();
> if (likely(module->ref[cpu].live))
> local_inc(&module->ref[cpu].counter);
> else
> ret = 0;
> put_cpu();
> }
> return ret;
> }

Since I made the mistake of getting involved into this discussion lately,
I wonder if this new method is going to be mandatory (the only one
available) or optional. I think there's two different kind of users, for
one modules which use an API which provides its own infrastructure for
dealing with modules via ->owner, on the other hand things like netfilter
(that's probably where you are coming from) where calls into a module,
which need protection are really frequent.

Note that for the vast majority of modules, dealing with unload races is
as simple as setting ->owner, for example filesystems, network drivers.

Sure, we need a global lock (unload_lock) when calling into these modules
initially, but these "binding/unbinding" calls are really rare. For
filesystems, they happen once per mount, for network drivers only for
ifconfig up/down. Afterwards, calling into the module (e.g. accessing the
mounted filesystem, xmitting/receiving data) doesn't have any overhead at
all compared to a linked-in filesystem/driver (well, ignore TLB misses)

I don't see a good reason to change this, in particular, since it provides
useful information to the user, that is the mod_use_count. It means "Is it
possible to successfully unload the module now?", and since looking at
the count and the actual unload is protected by unload_lock, the unload
will either succeed basically immediately, or fail with -EBUSY right away.

I see that your approach makes frequent calls into the module cheaper, but
I'm not totally convinced that the current safe interfaces need to change
just to accomodate rare cases like netfilter (there's most likely some
more cases like it, but the majority of modules is not).

Anyway, I may see further problems, but let me check first: Is your count
supposed to only count users which are currently executing in the module's
.text, or is it also to count references to data allocated in the module?
(I.e. when I register_netdev(), does that keep a reference to the module
even after the code has left the module's .text?)

--Kai

2002-10-17 17:14:13

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] change format of LSM hooks

On Thursday 17 October 2002 09:41, Rusty Russell wrote:
> In message <E181zuY-0004Fl-00@starship> you write:
> > On Thursday 17 October 2002 00:48, Rusty Russell wrote:
> > > > On Wednesday 16 October 2002 08:11, Rusty Russell wrote:
> > > > > It needs to be turned off when dealing with any interface which
> > > > > might be used by one of the hard modules. Which is pretty bad.
> > > >
> > > > As far as I can see, preemption only has to be disabled during the
> > > > synchronize_kernel phase of unloading that one module, and this
> > > > requirement is inherited neither by dependant or depending modules.
> > >
> > > No, someone could already have been preempted before you start
> > > synchronize_kernel().
> >
> > I don't get that. The sequence is:
> >
> > - turn off preemption
> > - unhook call points
> > - synchronize_kernel
> > - ...
> >
> > which doesn't leave any preemption hole that I can see, so I can't comment
> > on a couple of the other points until you clear that one up.
>
> You mean that "turn off preemption" also wakes up anyone currently
> preempted? Otherwise they're preempted just inside one of those call
> points.

Urk, yes, or just those preempted in kernel. Which looks doable and not
intrusive, modulo my limited scheduler knowledge. So synchronize_kernel
isn't dead yet, though it just got a new flesh wound.

> > > Still a race between the zero check and the can't-increment state
> > > setting.
> >
> > But that one is easy: the zero check just takes the same spinlock as
> > TRY_INC_MOD_COUNT, then sets can't-increment only in the case the count
> > is zero, considerably simpler than:
>
> The current spinlock is horrible.

Is it? You must be thinking about much more intensive use of the spinlock as
with per-op calls as opposed to per-attach (mount). I'd planned to make the
spinlocks per-module, but your per-cpu code looks just fine.

> > ...The still-initializing case is also easy, e.g., a filesystem module
> > simply doesn't call register_filesystem until it's completely ready to
> > service calls, so nobody is able to do TRY_INC_MOD_COUNT.
>
> Consider some code which needs to know when cpus go up and down, so
> registers a notifier. If the notifier fires before the init is
> finished, the notifier code will fail to "try_inc_mod_count()" and
> won't call it (it doesn't do try_inc_mod_count at the moment, but
> that's a bug).
>
> I don't know of any code which does this now, but it is at least a
> theoretical problem.

To resolve this, start the module in can-increment state, do the module
initialization, register the notifiers, and finally register the interface.
In other words, the module never needs to be in can't-increment state at
initialization. (The module writer must ensure they have the correct,
raceless initial state of whatever the notifiers are notifying about, which
strikes me as a little tricky in itself.)

> IMHO, the benifits of having it in-kernel outweigh the slight extra
> size.

I'll cast my vote for your in-kernel linker, in the mistaken belief that
democracy has anything to do with the question. Does this make progress
towards eliminating one of create_module or init_module?

> > > ...The second is the "die-mother-fucker-die"
> > > version, which taints the kernel and just removes the damn thing. For
> > > most people, this is better than a reboot, and will usually "work".
> >
> > Is there a case where removing a module would actually help? What is
> > the user going to do next, try to reinsert the same module?
>
> Insert a fixed one, hopefully 8).

I'd use this feature in filesystem development, regardless of the risks,
since I regularly do worse things to the kernel anyway. But don't you think
the rmmod parameter should be different for the ask-nicely vs the
shoot-in-the-head flavor? How about -f -f or -F for the latter?

--
Daniel

2002-10-17 17:23:27

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] change format of LSM hooks

On Thursday 17 October 2002 09:41, Rusty Russell wrote:
> http://www.kernel.org/pub/linux/kernel/people/rusty/patches/Module/forceunload.patch.gz

O_TRUNC in try_force is just a test hack?

--
Daniel

2002-10-18 02:19:39

by Rusty Russell

[permalink] [raw]
Subject: Re: [RFC] change format of LSM hooks

In message <E182EJl-0004Rd-00@starship> you write:
> > The current spinlock is horrible.
>
> Is it? You must be thinking about much more intensive use of the spinlock as
> with per-op calls as opposed to per-attach (mount). I'd planned to make the
> spinlocks per-module, but your per-cpu code looks just fine.

Actually, after I thought about it harder, I was inspired by your
point that the race can be prevented with a spinlock (as currently
done). I spent last night writing a new solution to this (I'm testing
it now), which works as follows:

/* We only need protection against local interrupts. */
#ifndef __HAVE_ARCH_LOCAL_INC
#define local_inc(x) atomic_inc(x)
#define local_dec(x) atomic_dec(x)
#endif

static inline int try_module_get(struct module *module)
{
int ret = 1;

if (module) {
unsigned int cpu = get_cpu();
if (likely(module->live))
local_inc(&module->ref[cpu]);
else
ret = 0;
put_cpu();
}
return ret;
}

static inline void module_put(struct module *module)
{
if (module) {
unsigned int cpu = get_cpu();
local_dec(&module->ref[cpu]);
/* Maybe they're waiting for us to drop reference? */
if (unlikely(!module->live))
wake_up_process(module->waiter);
put_cpu();
}
}

Now, these are both short, sweet, and cache-local. The unload stage
becomes harder. We stop all the reference counts by scheduling a
thread on every cpu, then having them all do a local_irq_disable.
This means we know that the refcounts won't change, and we can safely
sum them. If they're zero (or the non-block flag isn't set), we set
"module->live" to false and release the CPUs.

Now, in the sleeping case, I sleep uninterruptible, but drop the
module mutex first, so other module stuff can happen at the same time.

This completely prevents the spurious unload race.

> > I don't know of any code which does this now, but it is at least a
> > theoretical problem.
>
> To resolve this, start the module in can-increment state, do the module
> initialization, register the notifiers, and finally register the interface.
> In other words, the module never needs to be in can't-increment state at
> initialization. (The module writer must ensure they have the correct,
> raceless initial state of whatever the notifiers are notifying about, which
> strikes me as a little tricky in itself.)

Yes, this basically means two-state init for modules, which I shy
clear of in general. But if any modules care, they can do this
themselves, though, by setting "THIS_MODULE.live = 1;" to activate
the notifiers during their init routine.

Basically, with the other one solved, I'm happy to place this on the
shoulders of any module which really cares.

> > IMHO, the benifits of having it in-kernel outweigh the slight extra
> > size.
>
> I'll cast my vote for your in-kernel linker, in the mistaken belief that
> democracy has anything to do with the question. Does this make progress
> towards eliminating one of create_module or init_module?

Yes, query_module, create_module and /proc/ksyms all gone.

> > > > ...The second is the "die-mother-fucker-die"
>
> I'd use this feature in filesystem development, regardless of the risks,
> since I regularly do worse things to the kernel anyway. But don't you think
> the rmmod parameter should be different for the ask-nicely vs the
> shoot-in-the-head flavor? How about -f -f or -F for the latter?

Yes, currently:
rusty@mingo:~$ /sbin/rmmod
Usage: /sbin/rmmod [--wait|-f] <modulename>
The --wait option begins a module removal even if it is used
and will stop new users from accessing the module (so it
should eventually fall to zero).
The -f option forces a module unload, and may crash your
machine. This requires the Forced Module Removal option
when the kernel was compiled.

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-10-18 02:50:36

by Rusty Russell

[permalink] [raw]
Subject: Re: your mail

In message <[email protected]> yo
u write:
> Since I made the mistake of getting involved into this discussion lately,

My condolences. 8)

> I wonder if this new method is going to be mandatory (the only one
> available) or optional. I think there's two different kind of users, for
> one modules which use an API which provides its own infrastructure for
> dealing with modules via ->owner, on the other hand things like netfilter
> (that's probably where you are coming from) where calls into a module,
> which need protection are really frequent.

Mandatory for interfaces where the function can sleep (or be preempted).

> Note that for the vast majority of modules, dealing with unload races is
> as simple as setting ->owner, for example filesystems, network drivers.

Yes. We do not have complete coverage though, this policy would
extend it.

> I see that your approach makes frequent calls into the module cheaper, but
> I'm not totally convinced that the current safe interfaces need to change
> just to accomodate rare cases like netfilter (there's most likely some
> more cases like it, but the majority of modules is not).

They're not changing. The current users doing try_inc_mod_count() are
fine. It's the ones not doing it which are problematic.

> Anyway, I may see further problems, but let me check first: Is your count
> supposed to only count users which are currently executing in the module's
> .text, or is it also to count references to data allocated in the module?
> (I.e. when I register_netdev(), does that keep a reference to the module
> even after the code has left the module's .text?)

It's to protect entry to the function, but of course, some interfaces
(eg. filesystems) lend themselves very neatly to batching this at
mount/unmount time. Data is already protected by the usual means.

At risk of boring you, here's the document from the documentation
patch. Suggestions welcome.

+Writing Modules and the Interfaces To Be Used By Them: A Gentle Guide.
+Copyright 2002, Rusty Russell IBM Corportation
+
+Modules are running parts of the kernel which can be added, and
+sometimes removed, while the kernel is operational.
+
+There are several delicate issues involved in this procedure which
+indicate special care should be taken.
+
+There are three cases you need to be careful:
+
+1) Any code which creates an interface for callbacks (ie. almost any
+ function called register_*)
+ => See Rule #1
+
+2) Any modules which use (old) interfaces which do not obey Rule #1
+ => See Rule #2
+
+Rule #1: Module-safe Interfaces. Any interface which allows
+ registration of callbacks, must also allow registration of a
+ "struct module *owner", either in the structure or as a
+ function parameter, and it must use them to protect the
+ callbacks. See "MAKING INTERFACES SAFE".
+
+Exception #1: As an optimization, you may skip this protection if you
+ *know* that the callbacks are non-preemtible and never
+ sleep (eg. registration of interrupt handlers).
+
+
+Rule #2: Modules using unsafe interfaces. If your module is using any
+ interface which does not obey rule number 1, that means your
+ module functions may be called from the rest of the kernel
+ without the caller first doing a successful try_module_get().
+
+ You must not register a "module_cleanup" handler, and your module
+ cannot be unloaded except by force. You must be especially
+ careful in this case with initialization: see "INITIALIZING
+ MODULES WHICH USE UNSAFE INTERFACES".
+
+MAKING INTERFACES SAFE
+
+A caller must always call "try_module_get()" on a function pointers's
+owner before calling through that function pointer. If
+"try_module_get()" returns 0 (false), the function pointer must *not*
+be called, and the caller should pretend that registration does not
+exist: this means the (module) owner is closing down and doesn't want
+any more calls, or in the process of starting up and isn't ready yet.
+
+For many interfaces, this can be optimized by assuming that a
+structure containing function pointers has the same owner, and knowing
+that one function is always called before the others, such as the
+filesystem code which knows a mount must succeed before any other
+methods can be accessed.
+
+You must call "module_put()" on the owner sometime after you have
+called the function(s).
+
+If you cannot make your interface module-safe in this way, you can at
+least split registration into a "reserve" stage and an "activate"
+stage, so that modules can use the interface, even if they cannot
+(easily) unload.
+
+
+INITIALIZING MODULES WHICH USE UNSAFE INTERFACES
+
+Safe interfaces will never enter your module before module_init() has
+successfully finished, but unsafe interfaces may. The rule is simple:
+your init_module() function *must* succeed (by returning 0) if it has
+successfully used any unsafe interfaces.
+
+So, if you are only using ONE unsafe interface, simply use that
+interface last. Otherwise you will have to use printk() to report
+failure and leave the module initialized (but possibly useless).
+
+
+
+If you have questions about how to apply this document to your own
+modules, please ask [email protected] or [email protected].
+
+Thankyou,
+Rusty.

--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-10-18 21:44:27

by Kai Germaschewski

[permalink] [raw]
Subject: Re: your mail

On Fri, 18 Oct 2002, Rusty Russell wrote:

> > I wonder if this new method is going to be mandatory (the only one
> > available) or optional. I think there's two different kind of users, for
> > one modules which use an API which provides its own infrastructure for
> > dealing with modules via ->owner, on the other hand things like netfilter
> > (that's probably where you are coming from) where calls into a module,
> > which need protection are really frequent.
>
> Mandatory for interfaces where the function can sleep (or be preempted).

and is not protected by other means (try_inc_mod_count()), I presume.

> > I see that your approach makes frequent calls into the module cheaper, but
> > I'm not totally convinced that the current safe interfaces need to change
> > just to accomodate rare cases like netfilter (there's most likely some
> > more cases like it, but the majority of modules is not).
>
> They're not changing. The current users doing try_inc_mod_count() are
> fine. It's the ones not doing it which are problematic.

Alright, so I'm fine with it ;) (not that makes a difference, but...)

--Kai