I'm just curious if it is know that remove_proc_entry has an inherit
race condition? I have a modified kernel that would add and remove
stuff from the proc system and it would every so often crash. I traced
the bug to remove_proc_entry.
for (p = &parent->subdir; *p; p=&(*p)->next ) {
if (!proc_match(len, fn, *p))
continue;
Looking at proc_match
int proc_match(int len, const char *name, struct proc_dir_entry *de)
{
if (de->namelen != len)
return 0;
return !memcmp(name, de->name, len);
}
The bug would happen either at de->namelen in proc_match or in the loop
of p=&(*p)->next.
The race is if two threads remove two entries that are siblings. Since
p = &(*p)->next, and this is then dereferenced, the race is with *p
becoming NULL.
The way I'm fixing this is to put a lock around the call to
remove_proc_entry. But is this race already known and the solution is
to have the callers perform their own locking? Or is this an actual
bug? If it is not a bug, where's the documentation on having callers
protect it?
Thanks,
-- Steve
Working on a custom kernel that adds and removes proc entries quite a
bit, I discovered that remove_proc_entry is not protected against
multiple threads removing entries belonging to the same parent. At
first I thought that this is only a problem with my changes, but after
inspecting the vanilla kernel, I see that there's several places that
calls remove_proc_entry with the same parent (most noticeably
/proc/drivers).
I've added a global remove_proc_lock to protect this section of code. I
was going to add a lock to proc_dir_entry so that the locking is only
cut down to the same parent, but since this function is called so
infrequently, why waste more memory then is needed. One global lock
should not cause too much of a headache here.
I'm not sure if remove_proc_entry is called from interrupt context, so I
did a irqsave just in case.
-- Steve
Index: linux-2.6.15-rc7/fs/proc/generic.c
===================================================================
--- linux-2.6.15-rc7.orig/fs/proc/generic.c 2005-12-30 14:19:39.000000000 -0500
+++ linux-2.6.15-rc7/fs/proc/generic.c 2005-12-30 16:18:42.000000000 -0500
@@ -19,6 +19,7 @@
#include <linux/idr.h>
#include <linux/namei.h>
#include <linux/bitops.h>
+#include <linux/spinlock.h>
#include <asm/uaccess.h>
static ssize_t proc_file_read(struct file *file, char __user *buf,
@@ -27,6 +28,8 @@
size_t count, loff_t *ppos);
static loff_t proc_file_lseek(struct file *, loff_t, int);
+static DEFINE_SPINLOCK(remove_proc_lock);
+
int proc_match(int len, const char *name, struct proc_dir_entry *de)
{
if (de->namelen != len)
@@ -689,10 +692,13 @@
struct proc_dir_entry *de;
const char *fn = name;
int len;
+ unsigned long flags;
if (!parent && xlate_proc_name(name, &parent, &fn) != 0)
goto out;
len = strlen(fn);
+
+ spin_lock_irqsave(&remove_proc_lock, flags);
for (p = &parent->subdir; *p; p=&(*p)->next ) {
if (!proc_match(len, fn, *p))
continue;
@@ -713,6 +719,7 @@
}
break;
}
+ spin_unlock_irqrestore(&remove_proc_lock, flags);
out:
return;
}
On Fri, 2005-12-30 at 16:28 -0500, Steven Rostedt wrote:
> + spin_lock_irqsave(&remove_proc_lock, flags);
...
> + spin_unlock_irqrestore(&remove_proc_lock, flags);
Why do an irqsave here? Are you not sure what context it gets called
from?
Daniel
On Fri, 30 Dec 2005, Daniel Walker wrote:
> On Fri, 2005-12-30 at 16:28 -0500, Steven Rostedt wrote:
>
> > + spin_lock_irqsave(&remove_proc_lock, flags);
> ...
> > + spin_unlock_irqrestore(&remove_proc_lock, flags);
>
> Why do an irqsave here? Are you not sure what context it gets called
> from?
>
[snipped from original message]
"I'm not sure if remove_proc_entry is called from interrupt context, so I
did a irqsave just in case."
;-)
-- Steve
Steven Rostedt wrote:
> I've added a global remove_proc_lock to protect this section of code. I
> was going to add a lock to proc_dir_entry so that the locking is only
> cut down to the same parent, but since this function is called so
> infrequently, why waste more memory then is needed. One global lock
> should not cause too much of a headache here.
Are you sure that it's the only place where we need guard ->subdir? It
looks like proc_lookup() and proc_readdir() use the BLK when walking that
list, so probably the best fix would be to use that lock everywhere else
->subdir is touched
-Mitch
On Fri, 2005-12-30 at 13:55 -0800, Mitchell Blank Jr wrote:
> Steven Rostedt wrote:
> > I've added a global remove_proc_lock to protect this section of code. I
> > was going to add a lock to proc_dir_entry so that the locking is only
> > cut down to the same parent, but since this function is called so
> > infrequently, why waste more memory then is needed. One global lock
> > should not cause too much of a headache here.
>
> Are you sure that it's the only place where we need guard ->subdir? It
> looks like proc_lookup() and proc_readdir() use the BLK when walking that
> list, so probably the best fix would be to use that lock everywhere else
> ->subdir is touched
Good point.
God, we should be getting rid of the stupid BKL, not add more. But
seeing that this is what is used to protect that list, I guess I'll add
it.
I'm also assuming that interrupt context wont use this.
-- Steve
Index: linux-2.6.15-rc7/fs/proc/generic.c
===================================================================
--- linux-2.6.15-rc7.orig/fs/proc/generic.c 2005-12-30 14:19:39.000000000 -0500
+++ linux-2.6.15-rc7/fs/proc/generic.c 2005-12-30 17:05:56.000000000 -0500
@@ -693,6 +693,8 @@
if (!parent && xlate_proc_name(name, &parent, &fn) != 0)
goto out;
len = strlen(fn);
+
+ lock_kernel();
for (p = &parent->subdir; *p; p=&(*p)->next ) {
if (!proc_match(len, fn, *p))
continue;
@@ -713,6 +715,7 @@
}
break;
}
+ unlock_kernel();
out:
return;
}
On Fri, 2005-12-30 at 13:55 -0800, Mitchell Blank Jr wrote:
> Steven Rostedt wrote:
> > I've added a global remove_proc_lock to protect this section of code. I
> > was going to add a lock to proc_dir_entry so that the locking is only
> > cut down to the same parent, but since this function is called so
> > infrequently, why waste more memory then is needed. One global lock
> > should not cause too much of a headache here.
>
> Are you sure that it's the only place where we need guard ->subdir? It
> looks like proc_lookup() and proc_readdir() use the BLK when walking that
> list, so probably the best fix would be to use that lock everywhere else
> ->subdir is touched
Perhaps this is a good candidate to have the BKL removed from this
protection and replaced with a spin lock or something else. If I
remember, I'll look into that further.
-- Steve
On Fri, 2005-12-30 at 17:09 -0500, Steven Rostedt wrote:
> Index: linux-2.6.15-rc7/fs/proc/generic.c
> ===================================================================
> --- linux-2.6.15-rc7.orig/fs/proc/generic.c 2005-12-30 14:19:39.000000000 -0500
> +++ linux-2.6.15-rc7/fs/proc/generic.c 2005-12-30 17:05:56.000000000 -0500
> @@ -693,6 +693,8 @@
> if (!parent && xlate_proc_name(name, &parent, &fn) != 0)
> goto out;
> len = strlen(fn);
> +
> + lock_kernel();
> for (p = &parent->subdir; *p; p=&(*p)->next ) {
> if (!proc_match(len, fn, *p))
> continue;
> @@ -713,6 +715,7 @@
> }
> break;
> }
> + unlock_kernel();
> out:
> return;
> }
>
FYI, to make sure that this solves the problem, I'm removing my locking
in my kernel and using this instead. It usually crashes in a day or
two, so I can say this works if it makes it three days.
-- Steve
Steven Rostedt <[email protected]> wrote:
>
> +static DEFINE_SPINLOCK(remove_proc_lock);
>
I'll take a closer look at this next week.
The official way of protecting the contents of a directory from concurrent
lookup or modification is to take its i_sem. But procfs is totally weird
and that approach may well not be practical here. We'd certainly prefer
not to rely upon lock_kernel().
On Fri, 30 Dec 2005, Andrew Morton wrote:
> Steven Rostedt <[email protected]> wrote:
> >
> > +static DEFINE_SPINLOCK(remove_proc_lock);
> >
>
> I'll take a closer look at this next week.
>
> The official way of protecting the contents of a directory from concurrent
> lookup or modification is to take its i_sem. But procfs is totally weird
> and that approach may well not be practical here. We'd certainly prefer
> not to rely upon lock_kernel().
>
Yeah, I thought about using the sem (or mutex ;) but remove_proc_entry is
used so darn much around the kernel, I wasn't sure it's not used in irq
context. So I'm not sure I like the idea of making a non-scheduling
function schedule. But it may not be a problem and it could very well be
ok to schedule here.
Also as Mitchell Blank pointed out, this list should probably be protected
everywhere by the same protection used, and an analysis should be done to
see what replacing the BKL affects.
-- Steve
On Sat, 2005-12-31 at 01:58 -0500, Steven Rostedt wrote:
> On Fri, 30 Dec 2005, Andrew Morton wrote:
>
> > Steven Rostedt <[email protected]> wrote:
> > >
> > > +static DEFINE_SPINLOCK(remove_proc_lock);
> > >
> >
> > I'll take a closer look at this next week.
> >
> > The official way of protecting the contents of a directory from concurrent
> > lookup or modification is to take its i_sem. But procfs is totally weird
> > and that approach may well not be practical here. We'd certainly prefer
> > not to rely upon lock_kernel().
> >
>
> Yeah, I thought about using the sem (or mutex ;) but remove_proc_entry is
> used so darn much around the kernel, I wasn't sure it's not used in irq
> context.
it can't be; "anything-VFS" like this can sleep if the file is busy etc
etc.
--- ./fs/proc/generic.c.proclk 2005-12-26 13:43:14.000000000 +0300
+++ ./fs/proc/generic.c 2005-12-31 11:48:16.000000000 +0300
@@ -27,6 +27,8 @@ static ssize_t proc_file_write(struct fi
size_t count, loff_t *ppos);
static loff_t proc_file_lseek(struct file *, loff_t, int);
+static DECLARE_RWSEM(proc_tree_sem);
+
int proc_match(int len, const char *name, struct proc_dir_entry *de)
{
if (de->namelen != len)
@@ -381,6 +383,7 @@ struct dentry *proc_lookup(struct inode
lock_kernel();
de = PDE(dir);
if (de) {
+ down_read(&proc_tree_sem);
for (de = de->subdir; de ; de = de->next) {
if (de->namelen != dentry->d_name.len)
continue;
@@ -392,6 +395,7 @@ struct dentry *proc_lookup(struct inode
break;
}
}
+ up_read(&proc_tree_sem);
}
unlock_kernel();
@@ -446,12 +450,13 @@ int proc_readdir(struct file * filp,
filp->f_pos++;
/* fall through */
default:
+ down_read(&proc_tree_sem);
de = de->subdir;
i -= 2;
for (;;) {
if (!de) {
ret = 1;
- goto out;
+ goto out_up;
}
if (!i)
break;
@@ -462,14 +467,18 @@ int proc_readdir(struct file * filp,
do {
if (filldir(dirent, de->name, de->namelen, filp->f_pos,
de->low_ino, de->mode >> 12) < 0)
- goto out;
+ goto out_up;
filp->f_pos++;
de = de->next;
} while (de);
+ up_read(&proc_tree_sem);
}
ret = 1;
out: unlock_kernel();
return ret;
+out_up:
+ up_read(&proc_tree_sem);
+ goto out;
}
/*
@@ -517,6 +526,7 @@ static int proc_register(struct proc_dir
if (dp->proc_iops == NULL)
dp->proc_iops = &proc_file_inode_operations;
}
+ de_get(dir);
return 0;
}
@@ -576,6 +586,7 @@ static struct proc_dir_entry *proc_creat
memset(ent, 0, sizeof(struct proc_dir_entry));
memcpy(((char *) ent) + sizeof(struct proc_dir_entry), fn, len + 1);
+ atomic_set(&ent->count, 1);
ent->name = ((char *) ent) + sizeof(*ent);
ent->namelen = len;
ent->mode = mode;
@@ -589,6 +600,7 @@ struct proc_dir_entry *proc_symlink(cons
{
struct proc_dir_entry *ent;
+ down_write(&proc_tree_sem);
ent = proc_create(&parent,name,
(S_IFLNK | S_IRUGO | S_IWUGO | S_IXUGO),1);
@@ -606,6 +618,7 @@ struct proc_dir_entry *proc_symlink(cons
ent = NULL;
}
}
+ up_write(&proc_tree_sem);
return ent;
}
@@ -614,6 +627,7 @@ struct proc_dir_entry *proc_mkdir_mode(c
{
struct proc_dir_entry *ent;
+ down_write(&proc_tree_sem);
ent = proc_create(&parent, name, S_IFDIR | mode, 2);
if (ent) {
ent->proc_fops = &proc_dir_operations;
@@ -624,6 +638,7 @@ struct proc_dir_entry *proc_mkdir_mode(c
ent = NULL;
}
}
+ up_write(&proc_tree_sem);
return ent;
}
@@ -633,7 +648,7 @@ struct proc_dir_entry *proc_mkdir(const
return proc_mkdir_mode(name, S_IRUGO | S_IXUGO, parent);
}
-struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode,
+static struct proc_dir_entry *__create_proc_entry(const char *name, mode_t mode,
struct proc_dir_entry *parent)
{
struct proc_dir_entry *ent;
@@ -665,6 +680,17 @@ struct proc_dir_entry *create_proc_entry
return ent;
}
+struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode,
+ struct proc_dir_entry *parent)
+{
+ struct proc_dir_entry *ent;
+
+ down_write(&proc_tree_sem);
+ ent = __create_proc_entry(name, mode, parent);
+ up_write(&proc_tree_sem);
+ return ent;
+}
+
void free_proc_entry(struct proc_dir_entry *de)
{
unsigned int ino = de->low_ino;
@@ -683,15 +709,13 @@ void free_proc_entry(struct proc_dir_ent
* Remove a /proc entry and free it if it's not currently in use.
* If it is in use, we set the 'deleted' flag.
*/
-void remove_proc_entry(const char *name, struct proc_dir_entry *parent)
+static void __remove_proc_entry(const char *name, struct proc_dir_entry *parent)
{
struct proc_dir_entry **p;
struct proc_dir_entry *de;
const char *fn = name;
int len;
- if (!parent && xlate_proc_name(name, &parent, &fn) != 0)
- goto out;
len = strlen(fn);
for (p = &parent->subdir; *p; p=&(*p)->next ) {
if (!proc_match(len, fn, *p))
@@ -699,20 +723,24 @@ void remove_proc_entry(const char *name,
de = *p;
*p = de->next;
de->next = NULL;
+ de_put(parent);
if (S_ISDIR(de->mode))
parent->nlink--;
proc_kill_inodes(de);
de->nlink = 0;
WARN_ON(de->subdir);
- if (!atomic_read(&de->count))
- free_proc_entry(de);
- else {
- de->deleted = 1;
- printk("remove_proc_entry: %s/%s busy, count=%d\n",
- parent->name, de->name, atomic_read(&de->count));
- }
+ de->deleted = 1;
+ de_put(de);
break;
}
-out:
- return;
+}
+
+void remove_proc_entry(const char *name, struct proc_dir_entry *parent)
+{
+ const char *fn = name;
+
+ down_write(&proc_tree_sem);
+ if (parent || xlate_proc_name(name, &parent, &fn) == 0)
+ __remove_proc_entry(name, parent);
+ up_write(&proc_tree_sem);
}
--- ./fs/proc/inode.c.proclk 2005-12-26 13:43:14.000000000 +0300
+++ ./fs/proc/inode.c 2005-12-31 11:48:16.000000000 +0300
@@ -21,34 +21,25 @@
extern void free_proc_entry(struct proc_dir_entry *);
-static inline struct proc_dir_entry * de_get(struct proc_dir_entry *de)
-{
- if (de)
- atomic_inc(&de->count);
- return de;
-}
-
/*
* Decrements the use count and checks for deferred deletion.
*/
-static void de_put(struct proc_dir_entry *de)
+void de_put(struct proc_dir_entry *de)
{
if (de) {
- lock_kernel();
if (!atomic_read(&de->count)) {
printk("de_put: entry %s already free!\n", de->name);
- unlock_kernel();
return;
}
if (atomic_dec_and_test(&de->count)) {
- if (de->deleted) {
- printk("de_put: deferred delete of %s\n",
- de->name);
- free_proc_entry(de);
+ if (!de->deleted) {
+ printk("de_put: entry %s is not removed yet\n",
+ de->name);
+ return;
}
- }
- unlock_kernel();
+ free_proc_entry(de);
+ }
}
}
--- ./include/linux/proc_fs.h.proclk 2005-12-26 13:43:16.000000000 +0300
+++ ./include/linux/proc_fs.h 2005-12-31 11:48:16.000000000 +0300
@@ -69,6 +69,14 @@ struct proc_dir_entry {
void *set;
};
+extern void de_put(struct proc_dir_entry *);
+static inline struct proc_dir_entry *de_get(struct proc_dir_entry *de)
+{
+ if (de)
+ atomic_inc(&de->count);
+ return de;
+}
+
struct kcore_list {
struct kcore_list *next;
unsigned long addr;
--- ./kernel/sysctl.c.proclk 2005-12-26 13:43:16.000000000 +0300
+++ ./kernel/sysctl.c 2005-12-31 11:48:37.000000000 +0300
@@ -1396,19 +1396,15 @@ static void unregister_proc_table(ctl_ta
continue;
}
- /*
- * In any case, mark the entry as goner; we'll keep it
- * around if it's busy, but we'll know to do nothing with
- * its fields. We are under sysctl_lock here.
- */
de->data = NULL;
-
- /* Don't unregister proc entries that are still being used.. */
- if (atomic_read(&de->count))
- continue;
-
table->de = NULL;
+ /*
+ * sys_sysctl can't find us, since we are removed from list.
+ * proc won't touch either, since de->data is NULL.
+ */
+ spin_unlock(&sysctl_lock);
remove_proc_entry(table->procname, root);
+ spin_lock(&sysctl_lock);
}
}
On Fri, 2005-12-30 at 15:46 -0800, Andrew Morton wrote:
> Steven Rostedt <[email protected]> wrote:
> >
> > +static DEFINE_SPINLOCK(remove_proc_lock);
> >
>
> I'll take a closer look at this next week.
>
> The official way of protecting the contents of a directory from concurrent
> lookup or modification is to take its i_sem. But procfs is totally weird
> and that approach may well not be practical here. We'd certainly prefer
> not to rely upon lock_kernel().
FWIW,
My test that would crash within two days has been running for three days
now with the lock_kernel patch. So, at least this fixes the problem,
whether we use another locking or not, it's good to know what to fix.
-- Steve
Steven Rostedt <[email protected]> wrote:
>
> On Fri, 2005-12-30 at 17:09 -0500, Steven Rostedt wrote:
>
> > Index: linux-2.6.15-rc7/fs/proc/generic.c
> > ===================================================================
> > --- linux-2.6.15-rc7.orig/fs/proc/generic.c 2005-12-30 14:19:39.000000000 -0500
> > +++ linux-2.6.15-rc7/fs/proc/generic.c 2005-12-30 17:05:56.000000000 -0500
> > @@ -693,6 +693,8 @@
> > if (!parent && xlate_proc_name(name, &parent, &fn) != 0)
> > goto out;
> > len = strlen(fn);
> > +
> > + lock_kernel();
> > for (p = &parent->subdir; *p; p=&(*p)->next ) {
> > if (!proc_match(len, fn, *p))
> > continue;
> > @@ -713,6 +715,7 @@
> > }
> > break;
> > }
> > + unlock_kernel();
> > out:
> > return;
> > }
> >
>
> FYI, to make sure that this solves the problem, I'm removing my locking
> in my kernel and using this instead. It usually crashes in a day or
> two, so I can say this works if it makes it three days.
>
I guess the lock_kernel() approach is the way to go. Fixing a race and
de-BKLing procfs are separate exercises...
Did the patch work?
Kirill Korotaev <[email protected]> wrote:
>
> Hi Andrew,
>
> I have a full patch for this.
Please don't top-post. It makes things hard...
> I don't remember the details yet, but lock was not god here, we used
> semaphore. I pointed to this problem long ago when fixed error path in
> proc with moduleget.
>
> This patch protects proc_dir_entry tree with a proc_tree_sem semaphore.
> I suppose lock_kernel() can be removed later after checking that no proc
> handlers require it.
> Also this patch remakes de refcounters a bit making it more clear and
> more similar to dentry scheme - this is required to make sure that
> everything works correctly.
>
> Patch is against 2.6.15-rcX and was tested for about a week. Also works
> half a year on 2.6.8 :)
>
> [ patch which uses an rwsem for procfs and somewhat removes lock_kernel() ]
>
I worry about replacing a spinlock with a sleeping lock. In some
circumstances it can cause a complete scalability collapse and I suspect
this could happen with /proc. Although I guess the only fastpath here is
proc_readdir(), and as the lock is taken there for reading, we'll be OK..
The patch does leave some lock_kernel() calls behind. If we're going to do
this, I think they should all be removed?
Races in /proc have been plentiful and hard to find. The patch worries me,
frankly. I'd like to see quite a bit more description of the locking
schema and some demonstration that it's actually complete before taking the
plunge.
>>Hi Andrew,
>>
>>I have a full patch for this.
>
>
> Please don't top-post. It makes things hard...
do you prefer separate mails with patch and with reference to original
report? will do so.
>>I don't remember the details yet, but lock was not god here, we used
>>semaphore. I pointed to this problem long ago when fixed error path in
>>proc with moduleget.
>>
>>This patch protects proc_dir_entry tree with a proc_tree_sem semaphore.
>>I suppose lock_kernel() can be removed later after checking that no proc
>>handlers require it.
>>Also this patch remakes de refcounters a bit making it more clear and
>>more similar to dentry scheme - this is required to make sure that
>>everything works correctly.
>>
>>Patch is against 2.6.15-rcX and was tested for about a week. Also works
>>half a year on 2.6.8 :)
>>
>>[ patch which uses an rwsem for procfs and somewhat removes lock_kernel() ]
>>
>
>
> I worry about replacing a spinlock with a sleeping lock. In some
> circumstances it can cause a complete scalability collapse and I suspect
> this could happen with /proc. Although I guess the only fastpath here is
> proc_readdir(), and as the lock is taken there for reading, we'll be OK..
>
> The patch does leave some lock_kernel() calls behind. If we're going to do
> this, I think they should all be removed?
>
> Races in /proc have been plentiful and hard to find. The patch worries me,
> frankly. I'd like to see quite a bit more description of the locking
> schema and some demonstration that it's actually complete before taking the
> plunge.
ok, here goes some more descriptions:
1.
proc_readdir is a sleeping ops:
sys_getdents
`- vfs_readdir
`- proc_readdir
`- filldir
`- put_user/copy_to_user
that's why we must use semaphore. of course spinlock can be used too,
but this will case another problem: it must be dropped to call filldir, but
beofre it will be retaken the dentry being filldir-ed may be removed and
we won't see it's siblings in output.
2.
lock_kernel() is needed to handle at least simultaneous sys_read vs
create_proc_entry with sequental de->proc_fops = XXX. this can be
handled by passing fops directly to create_proc_entry.
i.e. there is a 3rd problem I pointed to you before:
create_proc_entry() and setting of de->owner/de->proc_fops is inatomic.
lock_kernel() is not a best way to protect against this, sure...
I would prefer to fix it with a separate patch somehow...
3.
refcounting:
each proc_dir_entry's refcounter is the reference from inode plus
references from children. once this count reaches zero - dentry is freed.
so on each proc_register() parent is get-ed, on each remove_proc_entry
parent is put-ed.
Kirill
On Wed, 4 Jan 2006, Andrew Morton wrote:
> Steven Rostedt <[email protected]> wrote:
> >
> > FYI, to make sure that this solves the problem, I'm removing my locking
> > in my kernel and using this instead. It usually crashes in a day or
> > two, so I can say this works if it makes it three days.
> >
>
> I guess the lock_kernel() approach is the way to go. Fixing a race and
> de-BKLing procfs are separate exercises...
>
> Did the patch work?
>
Sorry, I forgot to respond, because the test is still running ;)
So yes, it not only ran for three days, it ran for six. I just killed it.
-- Steve
Steven Rostedt wrote:
> > I guess the lock_kernel() approach is the way to go. Fixing a race and
> > de-BKLing procfs are separate exercises...
> >
> > Did the patch work?
>
> Sorry, I forgot to respond, because the test is still running ;)
>
> So yes, it not only ran for three days, it ran for six. I just killed it.
Have you looked at the other paths that touch ->subdir? Namely:
proc_devtree.c:
proc_device_tree_add_node() -- plays games with ->subdir directly
generic.c:
proc_create() -- calls xlate_proc_name() which touches ->subdir and
should therefore probably be called under BKL
proc_register() -- both the call to xlate_proc_name() and the
following accesses to ->subdir should be under BKL
-Mitch
Steven Rostedt <[email protected]> wrote:
>
> God, we should be getting rid of the stupid BKL, not add more. But
> seeing that this is what is used to protect that list, I guess I'll add
> it.
>
> I'm also assuming that interrupt context wont use this.
>
> -- Steve
>
> Index: linux-2.6.15-rc7/fs/proc/generic.c
> ===================================================================
> --- linux-2.6.15-rc7.orig/fs/proc/generic.c 2005-12-30 14:19:39.000000000 -0500
> +++ linux-2.6.15-rc7/fs/proc/generic.c 2005-12-30 17:05:56.000000000 -0500
> @@ -693,6 +693,8 @@
> if (!parent && xlate_proc_name(name, &parent, &fn) != 0)
> goto out;
> len = strlen(fn);
> +
> + lock_kernel();
> for (p = &parent->subdir; *p; p=&(*p)->next ) {
> if (!proc_match(len, fn, *p))
> continue;
> @@ -713,6 +715,7 @@
> }
> break;
> }
> + unlock_kernel();
> out:
> return;
> }
OK, we're kind of screwed here.
Debug: sleeping function called from invalid context at include/asm/semaphore.h:105
in_atomic():1, irqs_disabled():0
Call Trace:<ffffffff8012689b>{__might_sleep+190} <ffffffff803e83ce>{lock_kernel+53}
<ffffffff801a6db2>{remove_proc_entry+74} <ffffffff8016b295>{poison_obj+58}
<ffffffff80134ee5>{unregister_proc_table+121} <ffffffff80134eb6>{unregister_proc_table+74}
<ffffffff80134eb6>{unregister_proc_table+74} <ffffffff80134eb6>{unregister_proc_table+74}
<ffffffff80134eb6>{unregister_proc_table+74} <ffffffff80134fdb>{unregister_sysctl_table+232}
<ffffffff803a2a0e>{ip_mc_dec_group+181} <ffffffff803e802e>{_write_lock_irqsave+32}
<ffffffff80133dc2>{local_bh_enable+114} <ffffffff803e82b3>{_write_unlock_bh+24}
<ffffffff8039ebfe>{devinet_sysctl_unregister+31} <ffffffff8039ecc1>{inetdev_destroy+171}
<ffffffff8039f1c1>{inet_del_ifa+509} <ffffffff8039f2dc>{inet_rtm_deladdr+268}
<ffffffff8036efea>{rtnetlink_rcv_msg+437} <ffffffff803761c3>{netlink_run_queue+140}
<ffffffff8036ee35>{rtnetlink_rcv_msg+0} <ffffffff8036f032>{rtnetlink_rcv+41}
<ffffffff803758af>{netlink_data_ready+23} <ffffffff80374d77>{netlink_sendskb+41}
<ffffffff80374ff4>{netlink_unicast+539} <ffffffff80375881>{netlink_sendmsg+667}
<ffffffff8035bedf>{sock_sendmsg+232} <ffffffff803e8203>{_read_unlock_irq+20}
<ffffffff80142e90>{autoremove_wake_function+0} <ffffffff80171a1e>{fget+157}
<ffffffff80142e90>{autoremove_wake_function+0} <ffffffff80171a1e>{fget+157}
<ffffffff8035bbea>{sockfd_lookup+18} <ffffffff8035d408>{sys_sendto+246}
<ffffffff803e80cc>{_spin_unlock_irqrestore+27} <ffffffff80238031>{__up_write+371}
<ffffffff8010db46>{system_call+126}
Because CONFIG_PREEMPT_BKL makes lock_kernel do down() and some callers of
remove_proc_entry() do it from inside spinlock.
And the spinlocking variant of lock_kernel() is also pretty much illegal,
because (in this case) whatever lock networking has taken may be taken
elsewhere inside lock_kernel(), so we have an ab/ba deadlock.
Best I can think of is that we need a private spinlock for this list.
Steven Rostedt <[email protected]> wrote:
>
> Working on a custom kernel that adds and removes proc entries quite a
> bit, I discovered that remove_proc_entry is not protected against
> multiple threads removing entries belonging to the same parent. At
> first I thought that this is only a problem with my changes, but after
> inspecting the vanilla kernel, I see that there's several places that
> calls remove_proc_entry with the same parent (most noticeably
> /proc/drivers).
>
> I've added a global remove_proc_lock to protect this section of code. I
> was going to add a lock to proc_dir_entry so that the locking is only
> cut down to the same parent, but since this function is called so
> infrequently, why waste more memory then is needed. One global lock
> should not cause too much of a headache here.
>
> I'm not sure if remove_proc_entry is called from interrupt context, so I
> did a irqsave just in case.
>
> -- Steve
>
>
> Index: linux-2.6.15-rc7/fs/proc/generic.c
> ===================================================================
> --- linux-2.6.15-rc7.orig/fs/proc/generic.c 2005-12-30 14:19:39.000000000 -0500
> +++ linux-2.6.15-rc7/fs/proc/generic.c 2005-12-30 16:18:42.000000000 -0500
> @@ -19,6 +19,7 @@
> #include <linux/idr.h>
> #include <linux/namei.h>
> #include <linux/bitops.h>
> +#include <linux/spinlock.h>
> #include <asm/uaccess.h>
>
> static ssize_t proc_file_read(struct file *file, char __user *buf,
> @@ -27,6 +28,8 @@
> size_t count, loff_t *ppos);
> static loff_t proc_file_lseek(struct file *, loff_t, int);
>
> +static DEFINE_SPINLOCK(remove_proc_lock);
> +
> int proc_match(int len, const char *name, struct proc_dir_entry *de)
> {
> if (de->namelen != len)
> @@ -689,10 +692,13 @@
> struct proc_dir_entry *de;
> const char *fn = name;
> int len;
> + unsigned long flags;
>
> if (!parent && xlate_proc_name(name, &parent, &fn) != 0)
> goto out;
> len = strlen(fn);
> +
> + spin_lock_irqsave(&remove_proc_lock, flags);
> for (p = &parent->subdir; *p; p=&(*p)->next ) {
> if (!proc_match(len, fn, *p))
> continue;
> @@ -713,6 +719,7 @@
> }
> break;
> }
> + spin_unlock_irqrestore(&remove_proc_lock, flags);
> out:
> return;
> }
Aren't there other places where we need to take this lock? Code which
traverses that list, code which adds things to it?
On Sat, 7 Jan 2006, Andrew Morton wrote:
>
> Aren't there other places where we need to take this lock? Code which
> traverses that list, code which adds things to it?
>
Yeah, that patch was just a quick fix. I'll look more into that on
Monday. (My wife has too many chores for me this weekend ;)
-- Steve
On Sat, 2006-01-07 at 03:36 -0800, Andrew Morton wrote:
>
> Aren't there other places where we need to take this lock? Code which
> traverses that list, code which adds things to it?
>
Andrew,
How's this patch look? I tested this with the following module:
http://www.kihontech.com/tests/proc/proc_stress.c
Without the patch, I could hang the processes (the processes would
either crash, or just get stuck spinning inside the list). With the
patch, the module ran to completion each time.
I don't believe any of the calls are called from interrupt context, so I
held off from using the _irqsave versions of spin_lock.
-- Steve
Index: linux-2.6.15/fs/proc/generic.c
===================================================================
--- linux-2.6.15.orig/fs/proc/generic.c 2006-01-09 13:58:23.000000000 -0500
+++ linux-2.6.15/fs/proc/generic.c 2006-01-09 13:58:34.000000000 -0500
@@ -19,6 +19,7 @@
#include <linux/idr.h>
#include <linux/namei.h>
#include <linux/bitops.h>
+#include <linux/spinlock.h>
#include <asm/uaccess.h>
static ssize_t proc_file_read(struct file *file, char __user *buf,
@@ -27,6 +28,8 @@
size_t count, loff_t *ppos);
static loff_t proc_file_lseek(struct file *, loff_t, int);
+DEFINE_SPINLOCK(proc_subdir_lock);
+
int proc_match(int len, const char *name, struct proc_dir_entry *de)
{
if (de->namelen != len)
@@ -275,7 +278,9 @@
const char *cp = name, *next;
struct proc_dir_entry *de;
int len;
+ int rtn = 0;
+ spin_lock(&proc_subdir_lock);
de = &proc_root;
while (1) {
next = strchr(cp, '/');
@@ -287,13 +292,17 @@
if (proc_match(len, cp, de))
break;
}
- if (!de)
- return -ENOENT;
+ if (!de) {
+ rtn = -ENOENT;
+ goto out;
+ }
cp += len + 1;
}
*residual = cp;
*ret = de;
- return 0;
+out:
+ spin_unlock(&proc_subdir_lock);
+ return rtn;
}
static DEFINE_IDR(proc_inum_idr);
@@ -378,6 +387,7 @@
int error = -ENOENT;
lock_kernel();
+ spin_lock(&proc_subdir_lock);
de = PDE(dir);
if (de) {
for (de = de->subdir; de ; de = de->next) {
@@ -386,12 +396,15 @@
if (!memcmp(dentry->d_name.name, de->name, de->namelen)) {
unsigned int ino = de->low_ino;
+ spin_unlock(&proc_subdir_lock);
error = -EINVAL;
inode = proc_get_inode(dir->i_sb, ino, de);
+ spin_lock(&proc_subdir_lock);
break;
}
}
}
+ spin_unlock(&proc_subdir_lock);
unlock_kernel();
if (inode) {
@@ -445,11 +458,13 @@
filp->f_pos++;
/* fall through */
default:
+ spin_lock(&proc_subdir_lock);
de = de->subdir;
i -= 2;
for (;;) {
if (!de) {
ret = 1;
+ spin_unlock(&proc_subdir_lock);
goto out;
}
if (!i)
@@ -459,12 +474,16 @@
}
do {
+ /* filldir passes info to user space */
+ spin_unlock(&proc_subdir_lock);
if (filldir(dirent, de->name, de->namelen, filp->f_pos,
de->low_ino, de->mode >> 12) < 0)
goto out;
+ spin_lock(&proc_subdir_lock);
filp->f_pos++;
de = de->next;
} while (de);
+ spin_unlock(&proc_subdir_lock);
}
ret = 1;
out: unlock_kernel();
@@ -498,9 +517,13 @@
if (i == 0)
return -EAGAIN;
dp->low_ino = i;
+
+ spin_lock(&proc_subdir_lock);
dp->next = dir->subdir;
dp->parent = dir;
dir->subdir = dp;
+ spin_unlock(&proc_subdir_lock);
+
if (S_ISDIR(dp->mode)) {
if (dp->proc_iops == NULL) {
dp->proc_fops = &proc_dir_operations;
@@ -692,6 +715,8 @@
if (!parent && xlate_proc_name(name, &parent, &fn) != 0)
goto out;
len = strlen(fn);
+
+ spin_lock(&proc_subdir_lock);
for (p = &parent->subdir; *p; p=&(*p)->next ) {
if (!proc_match(len, fn, *p))
continue;
@@ -712,6 +737,7 @@
}
break;
}
+ spin_unlock(&proc_subdir_lock);
out:
return;
}
Index: linux-2.6.15/fs/proc/proc_devtree.c
===================================================================
--- linux-2.6.15.orig/fs/proc/proc_devtree.c 2006-01-09 13:58:23.000000000 -0500
+++ linux-2.6.15/fs/proc/proc_devtree.c 2006-01-09 14:05:10.000000000 -0500
@@ -112,9 +112,11 @@
* properties are quite unimportant for us though, thus we
* simply "skip" them here, but we do have to check.
*/
+ spin_lock(&proc_subdir_lock);
for (ent = de->subdir; ent != NULL; ent = ent->next)
if (!strcmp(ent->name, pp->name))
break;
+ spin_unlock(&proc_subdir_lock);
if (ent != NULL) {
printk(KERN_WARNING "device-tree: property \"%s\" name"
" conflicts with node in %s\n", pp->name,
Index: linux-2.6.15/include/linux/proc_fs.h
===================================================================
--- linux-2.6.15.orig/include/linux/proc_fs.h 2006-01-09 08:59:13.000000000 -0500
+++ linux-2.6.15/include/linux/proc_fs.h 2006-01-09 14:07:09.000000000 -0500
@@ -4,6 +4,7 @@
#include <linux/config.h>
#include <linux/slab.h>
#include <linux/fs.h>
+#include <linux/spinlock.h>
#include <asm/atomic.h>
/*
@@ -92,6 +93,8 @@
extern struct proc_dir_entry *proc_root_driver;
extern struct proc_dir_entry *proc_root_kcore;
+extern spinlock_t proc_subdir_lock;
+
extern void proc_root_init(void);
extern void proc_misc_init(void);
Ingo,
FYI
I just uploaded my 2.6.15-rt2-sr3 which includes the latest patch to fix
the bug in remove_proc_entry.
http://home.stny.rr.com/rostedt/patches/patch-2.6.15-rt2-sr3
Again, the module to test this is here:
http://www.kihontech.com/tests/proc/proc_stress.c
I tested it like the following:
# insmod proc_stress.ko & for i in `seq 1 10000`; do ls /proc/proc_tests; done
-- Steve
* Steven Rostedt <[email protected]> wrote:
> Ingo,
>
> FYI
>
> I just uploaded my 2.6.15-rt2-sr3 which includes the latest patch to
> fix the bug in remove_proc_entry.
>
> http://home.stny.rr.com/rostedt/patches/patch-2.6.15-rt2-sr3
thanks Steve - i've applied your fixes and have uploaded 2.6.15-rt3 to
the usual place:
http://redhat.com/~mingo/realtime-preempt/
(other than the version string it is the same as -rt2-sr3.)
Ingo
Andrew, here's a better explanation of the patch as well as my
Signed-off.
Description:
It has been discovered that the remove_proc_entry has a race in the
removing of entries in the proc file system that are siblings. There's
no protection around the traversing and removing of elements that belong
in the same subdirectory.
This subdirectory list is protected in other areas by the BKL. So the
BKL was at first used to protect this area too, but unfortunately,
remove_proc_entry may be called with spinlocks held. The BKL may
schedule, so this was not a solution.
The final solution was to add a new global spin lock to protect this
list, called proc_subdir_lock. This lock now protects the list in
remove_proc_entry, and I also went around looking for other areas that
this list is modified and added this protection there too. Care must be
taken since these locations call several functions that may also
schedule.
Since I don't see any location that these functions that modify the
subdirectory list are called by interrupts, the irqsave/restore versions
of the spin lock was _not_ used.
Signed-off-by: Steven Rostedt <[email protected]>
-- Steve