2005-11-29 07:17:24

by Grzegorz Nosek

[permalink] [raw]
Subject: [PATCH] race condition in procfs

Hello,

I found a race condition in procfs on SMP systems. The result is an
oops in processes like pidof. Apparently ->proc_read() gets passed a
potentially NULL pointer. The following micro-patch seems to fix it.

Best regards,
Grzegorz Nosek

--- linux-2.6/fs/proc/base.c.orig 2005-11-25 00:07:43.000000000 +0100
+++ linux-2.6/fs/proc/base.c 2005-11-28 11:44:11.000000000 +0100
@@ -718,6 +718,9 @@
ssize_t length;
struct task_struct *task = proc_task(inode);

+ if (!task)
+ return -ENOENT;
+
if (count > PROC_BLOCK_SIZE)
count = PROC_BLOCK_SIZE;
if (!(page = __get_free_page(GFP_KERNEL)))


2005-11-29 08:09:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs

Grzegorz Nosek <[email protected]> wrote:
>
> Hello,
>
> I found a race condition in procfs on SMP systems. The result is an
> oops in processes like pidof. Apparently ->proc_read() gets passed a
> potentially NULL pointer.

Do you know what the race is?

How does one reproduce it?

> The following micro-patch seems to fix it.

It might be right, or it might be a workaround..

2005-11-29 08:38:33

by Grzegorz Nosek

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs

2005/11/29, Andrew Morton <[email protected]>:
> > I found a race condition in procfs on SMP systems. The result is an
> > oops in processes like pidof. Apparently ->proc_read() gets passed a
> > potentially NULL pointer.
>
> Do you know what the race is?

Apparently it's a race between deleting a process and accessing its
/proc/pid entries. It came out in pidof while it was accessing
/proc/pid/stat (fs/proc/array.c:do_task_stat crashed on first
instruction - it was an inline function accessing task->state,
get_task_state IIRC). oops (with vserver history data - I'm using a
patch mentioned below) is attached.

>
> How does one reproduce it?

I managed to reproduce it (although not reliably) during high CPU load
and I/O (parallel kernel compiles) on SMP systems with the vserver
patch (http://linux-vserver.org, the exact patch is
http://vserver.13thfloor.at/Experimental/patch-2.6.14.2-vs2.1.0-rc8.diff),
but the vserver maintainer pointed out that it probably is a mainline
issue. We're not using 2.6 systems too much except for the vserver
test beds so I cannot tell if it happens on vanilla kernels.

>
> > The following micro-patch seems to fix it.
>
> It might be right, or it might be a workaround..
>

I'm not a kernel guru so it's just my proposal. Can it break anything?
An alternative _might_ be somewhat coarser task_struct locking
(do_task_stat grabs a spinlock but then it's already too late).
However, if no "right" solution appears, I'll keep using my two-liner
because it seems to help, at least in my setup.

Best regards,
Grzegorz Nosek


Attachments:
(No filename) (1.53 kB)
oops.s35 (8.23 kB)
Download all attachments

2005-11-29 13:25:15

by Grzegorz Nosek

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs

> >
> > Do you know what the race is?
>
> Apparently it's a race between deleting a process and accessing its
> /proc/pid entries. It came out in pidof while it was accessing
> /proc/pid/stat (fs/proc/array.c:do_task_stat crashed on first
> instruction - it was an inline function accessing task->state,
> get_task_state IIRC). oops (with vserver history data - I'm using a
> patch mentioned below) is attached.
>
> >
> > How does one reproduce it?
>
> I managed to reproduce it (although not reliably) during high CPU load
> and I/O (parallel kernel compiles) on SMP systems with the vserver
> patch (http://linux-vserver.org, the exact patch is
> http://vserver.13thfloor.at/Experimental/patch-2.6.14.2-vs2.1.0-rc8.diff),
> but the vserver maintainer pointed out that it probably is a mainline
> issue. We're not using 2.6 systems too much except for the vserver
> test beds so I cannot tell if it happens on vanilla kernels.
>
> >
> > > The following micro-patch seems to fix it.
> >
> > It might be right, or it might be a workaround..
> >
>
> I'm not a kernel guru so it's just my proposal. Can it break anything?
> An alternative _might_ be somewhat coarser task_struct locking
> (do_task_stat grabs a spinlock but then it's already too late).
> However, if no "right" solution appears, I'll keep using my two-liner
> because it seems to help, at least in my setup.
>

Oh well, I got another oops in the very same place with the patch
applied. So now I surrounded the check with
read_[un]lock(&tasklist_lock) and added a check to do_task_stat (both
now have a printk). If it builds, boots and doesn't crash, I'll post
the patch.

Best regards,
Grzegorz Nosek

2005-11-29 14:04:12

by Grzegorz Nosek

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs

2005/11/29, Grzegorz Nosek <[email protected]>:
>
> Oh well, I got another oops in the very same place with the patch
> applied. So now I surrounded the check with
> read_[un]lock(&tasklist_lock) and added a check to do_task_stat (both
> now have a printk). If it builds, boots and doesn't crash, I'll post
> the patch.

Froze solid a minute after booting :( netconsole didn't log anything
either. Back to the drawing board.

Best regards,
Grzegorz Nosek

2005-11-29 14:29:01

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs

On Tue, 2005-11-29 at 15:04 +0100, Grzegorz Nosek wrote:
> 2005/11/29, Grzegorz Nosek <[email protected]>:
> >
> > Oh well, I got another oops in the very same place with the patch
> > applied. So now I surrounded the check with
> > read_[un]lock(&tasklist_lock) and added a check to do_task_stat (both
> > now have a printk). If it builds, boots and doesn't crash, I'll post
> > the patch.
>
> Froze solid a minute after booting :( netconsole didn't log anything
> either. Back to the drawing board.

Have you seen this crash the vanilla kernel? What exactly are you doing
to see the crash? If you have a script or something, could you post it.
I could spend some time helping you debug it too on one of my SMP boxes.

-- Steve


2005-11-29 14:39:28

by Grzegorz Nosek

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs

2005/11/29, Steven Rostedt <[email protected]>:
> Have you seen this crash the vanilla kernel? What exactly are you doing
> to see the crash? If you have a script or something, could you post it.
> I could spend some time helping you debug it too on one of my SMP boxes.
>

I'm not really using vanilla 2.6 kernels and my setup would be quite
hard to run on a vanilla kernel.

The reproduceability of this bug varies. Sometimes it'll go for a few
days without happening, sometimes it's a matter of a few minutes. I'm
beginning to feel it's a vserver issue after all, somehow related to
pid virtualisation (it maps some vxi->vx_initpid to 1).

Thus I cannot provide a simple script to trigger the bug (I wish I
could) but often doing a -j8 kernel compile in a vserver is enough.

2005-11-29 14:49:56

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs


On Tue, 29 Nov 2005, Grzegorz Nosek wrote:

>
> I'm not really using vanilla 2.6 kernels and my setup would be quite
> hard to run on a vanilla kernel.
>
> The reproduceability of this bug varies. Sometimes it'll go for a few
> days without happening, sometimes it's a matter of a few minutes. I'm
> beginning to feel it's a vserver issue after all, somehow related to
> pid virtualisation (it maps some vxi->vx_initpid to 1).
>
> Thus I cannot provide a simple script to trigger the bug (I wish I
> could) but often doing a -j8 kernel compile in a vserver is enough.
>

What you are showing, would probably show up by others if this were a
vanilla kernel issue. I don't have an 8 way machine, just 2 way, but the
vanilla kernel is being used on many 8 ways out there, so I think you are
right that this _is_ a vserver issue.

Unless, of course, that the vserver is producing an obscure race in the
vanilla kernel that normal operations would seldom have. Just like the
PREEMPT_RT patch has discovered many race conditions that were in the
vanilla kernel but were not often a problem.

-- Steve

2005-11-29 15:22:25

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] shrinks dentry struct

--- linux-2.6.15-rc3/include/linux/dcache.h 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/include/linux/dcache.h 2005-11-29 12:04:54.000000000 +0100
@@ -95,14 +95,19 @@
struct qstr d_name;

struct list_head d_lru; /* LRU list */
- struct list_head d_child; /* child of parent list */
+ /*
+ * d_child and d_rcu can share memory
+ */
+ union {
+ struct list_head d_child; /* child of parent list */
+ struct rcu_head d_rcu;
+ } d_u;
struct list_head d_subdirs; /* our children */
struct list_head d_alias; /* inode alias list */
unsigned long d_time; /* used by d_revalidate */
struct dentry_operations *d_op;
struct super_block *d_sb; /* The root of the dentry tree */
void *d_fsdata; /* fs-specific data */
- struct rcu_head d_rcu;
struct dcookie_struct *d_cookie; /* cookie, if any */
int d_mounted;
unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */
--- linux-2.6.15-rc3/fs/dcache.c 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/dcache.c 2005-11-29 15:07:10.000000000 +0100
@@ -71,7 +71,7 @@

static void d_callback(struct rcu_head *head)
{
- struct dentry * dentry = container_of(head, struct dentry, d_rcu);
+ struct dentry * dentry = container_of(head, struct dentry, d_u.d_rcu);

if (dname_external(dentry))
kfree(dentry->d_name.name);
@@ -86,7 +86,7 @@
{
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);
- call_rcu(&dentry->d_rcu, d_callback);
+ call_rcu(&dentry->d_u.d_rcu, d_callback);
}

/*
@@ -193,7 +193,7 @@
list_del(&dentry->d_lru);
dentry_stat.nr_unused--;
}
- list_del(&dentry->d_child);
+ list_del(&dentry->d_u.d_child);
dentry_stat.nr_dentry--; /* For d_free, below */
/*drops the locks, at that point nobody can reach this dentry */
dentry_iput(dentry);
@@ -367,7 +367,7 @@
struct dentry * parent;

__d_drop(dentry);
- list_del(&dentry->d_child);
+ list_del(&dentry->d_u.d_child);
dentry_stat.nr_dentry--; /* For d_free, below */
dentry_iput(dentry);
parent = dentry->d_parent;
@@ -518,7 +518,7 @@
resume:
while (next != &this_parent->d_subdirs) {
struct list_head *tmp = next;
- struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
+ struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
/* Have we found a mount point ? */
if (d_mountpoint(dentry))
@@ -532,7 +532,7 @@
* All done at this level ... ascend and resume the search.
*/
if (this_parent != parent) {
- next = this_parent->d_child.next;
+ next = this_parent->d_u.d_child.next;
this_parent = this_parent->d_parent;
goto resume;
}
@@ -569,7 +569,7 @@
resume:
while (next != &this_parent->d_subdirs) {
struct list_head *tmp = next;
- struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
+ struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;

if (!list_empty(&dentry->d_lru)) {
@@ -610,7 +610,7 @@
* All done at this level ... ascend and resume the search.
*/
if (this_parent != parent) {
- next = this_parent->d_child.next;
+ next = this_parent->d_u.d_child.next;
this_parent = this_parent->d_parent;
#ifdef DCACHE_DEBUG
printk(KERN_DEBUG "select_parent: ascending to %s/%s, found=%d\n",
@@ -753,12 +753,12 @@
dentry->d_parent = dget(parent);
dentry->d_sb = parent->d_sb;
} else {
- INIT_LIST_HEAD(&dentry->d_child);
+ INIT_LIST_HEAD(&dentry->d_u.d_child);
}

spin_lock(&dcache_lock);
if (parent)
- list_add(&dentry->d_child, &parent->d_subdirs);
+ list_add(&dentry->d_u.d_child, &parent->d_subdirs);
dentry_stat.nr_dentry++;
spin_unlock(&dcache_lock);

@@ -1310,8 +1310,8 @@
/* Unhash the target: dput() will then get rid of it */
__d_drop(target);

- list_del(&dentry->d_child);
- list_del(&target->d_child);
+ list_del(&dentry->d_u.d_child);
+ list_del(&target->d_u.d_child);

/* Switch the names.. */
switch_names(dentry, target);
@@ -1322,15 +1322,15 @@
if (IS_ROOT(dentry)) {
dentry->d_parent = target->d_parent;
target->d_parent = target;
- INIT_LIST_HEAD(&target->d_child);
+ INIT_LIST_HEAD(&target->d_u.d_child);
} else {
do_switch(dentry->d_parent, target->d_parent);

/* And add them back to the (new) parent lists */
- list_add(&target->d_child, &target->d_parent->d_subdirs);
+ list_add(&target->d_u.d_child, &target->d_parent->d_subdirs);
}

- list_add(&dentry->d_child, &dentry->d_parent->d_subdirs);
+ list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
spin_unlock(&target->d_lock);
spin_unlock(&dentry->d_lock);
write_sequnlock(&rename_lock);
@@ -1568,7 +1568,7 @@
resume:
while (next != &this_parent->d_subdirs) {
struct list_head *tmp = next;
- struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
+ struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
next = tmp->next;
if (d_unhashed(dentry)||!dentry->d_inode)
continue;
@@ -1579,7 +1579,7 @@
atomic_dec(&dentry->d_count);
}
if (this_parent != root) {
- next = this_parent->d_child.next;
+ next = this_parent->d_u.d_child.next;
atomic_dec(&this_parent->d_count);
this_parent = this_parent->d_parent;
goto resume;
--- linux-2.6.15-rc3/fs/autofs4/autofs_i.h 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/autofs4/autofs_i.h 2005-11-29 12:04:54.000000000 +0100
@@ -209,7 +209,7 @@
struct dentry *child;
int ret = 0;

- list_for_each_entry(child, &dentry->d_subdirs, d_child)
+ list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
if (simple_positive(child))
goto out;
ret = 1;
--- linux-2.6.15-rc3/fs/autofs4/expire.c 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/autofs4/expire.c 2005-11-29 12:04:54.000000000 +0100
@@ -105,7 +105,7 @@
next = this_parent->d_subdirs.next;
resume:
while (next != &this_parent->d_subdirs) {
- struct dentry *dentry = list_entry(next, struct dentry, d_child);
+ struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);

/* Negative dentry - give up */
if (!simple_positive(dentry)) {
@@ -138,7 +138,7 @@
}

if (this_parent != top) {
- next = this_parent->d_child.next;
+ next = this_parent->d_u.d_child.next;
this_parent = this_parent->d_parent;
goto resume;
}
@@ -163,7 +163,7 @@
next = this_parent->d_subdirs.next;
resume:
while (next != &this_parent->d_subdirs) {
- struct dentry *dentry = list_entry(next, struct dentry, d_child);
+ struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);

/* Negative dentry - give up */
if (!simple_positive(dentry)) {
@@ -199,7 +199,7 @@
}

if (this_parent != parent) {
- next = this_parent->d_child.next;
+ next = this_parent->d_u.d_child.next;
this_parent = this_parent->d_parent;
goto resume;
}
@@ -238,7 +238,7 @@
/* On exit from the loop expire is set to a dgot dentry
* to expire or it's NULL */
while ( next != &root->d_subdirs ) {
- struct dentry *dentry = list_entry(next, struct dentry, d_child);
+ struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);

/* Negative dentry - give up */
if ( !simple_positive(dentry) ) {
@@ -302,7 +302,7 @@
expired, (int)expired->d_name.len, expired->d_name.name);
spin_lock(&dcache_lock);
list_del(&expired->d_parent->d_subdirs);
- list_add(&expired->d_parent->d_subdirs, &expired->d_child);
+ list_add(&expired->d_parent->d_subdirs, &expired->d_u.d_child);
spin_unlock(&dcache_lock);
return expired;
}
--- linux-2.6.15-rc3/fs/autofs4/inode.c 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/autofs4/inode.c 2005-11-29 12:04:54.000000000 +0100
@@ -91,7 +91,7 @@
next = this_parent->d_subdirs.next;
resume:
while (next != &this_parent->d_subdirs) {
- struct dentry *dentry = list_entry(next, struct dentry, d_child);
+ struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);

/* Negative dentry - don`t care */
if (!simple_positive(dentry)) {
@@ -117,7 +117,7 @@
if (this_parent != sbi->root) {
struct dentry *dentry = this_parent;

- next = this_parent->d_child.next;
+ next = this_parent->d_u.d_child.next;
this_parent = this_parent->d_parent;
spin_unlock(&dcache_lock);
DPRINTK("parent dentry %p %.*s",
--- linux-2.6.15-rc3/fs/coda/cache.c 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/coda/cache.c 2005-11-29 12:05:21.000000000 +0100
@@ -93,7 +93,7 @@
spin_lock(&dcache_lock);
list_for_each(child, &parent->d_subdirs)
{
- de = list_entry(child, struct dentry, d_child);
+ de = list_entry(child, struct dentry, d_u.d_child);
/* don't know what to do with negative dentries */
if ( ! de->d_inode )
continue;
--- linux-2.6.15-rc3/fs/libfs.c 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/libfs.c 2005-11-29 14:58:51.000000000 +0100
@@ -93,16 +93,16 @@
loff_t n = file->f_pos - 2;

spin_lock(&dcache_lock);
- list_del(&cursor->d_child);
+ list_del(&cursor->d_u.d_child);
p = file->f_dentry->d_subdirs.next;
while (n && p != &file->f_dentry->d_subdirs) {
struct dentry *next;
- next = list_entry(p, struct dentry, d_child);
+ next = list_entry(p, struct dentry, d_u.d_child);
if (!d_unhashed(next) && next->d_inode)
n--;
p = p->next;
}
- list_add_tail(&cursor->d_child, p);
+ list_add_tail(&cursor->d_u.d_child, p);
spin_unlock(&dcache_lock);
}
}
@@ -126,7 +126,7 @@
{
struct dentry *dentry = filp->f_dentry;
struct dentry *cursor = filp->private_data;
- struct list_head *p, *q = &cursor->d_child;
+ struct list_head *p, *q = &cursor->d_u.d_child;
ino_t ino;
int i = filp->f_pos;

@@ -153,7 +153,7 @@
}
for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
struct dentry *next;
- next = list_entry(p, struct dentry, d_child);
+ next = list_entry(p, struct dentry, d_u.d_child);
if (d_unhashed(next) || !next->d_inode)
continue;

@@ -261,7 +261,7 @@
int ret = 0;

spin_lock(&dcache_lock);
- list_for_each_entry(child, &dentry->d_subdirs, d_child)
+ list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
if (simple_positive(child))
goto out;
ret = 1;
--- linux-2.6.15-rc3/fs/ncpfs/dir.c 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/ncpfs/dir.c 2005-11-29 14:59:31.000000000 +0100
@@ -365,7 +365,7 @@
spin_lock(&dcache_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
- dent = list_entry(next, struct dentry, d_child);
+ dent = list_entry(next, struct dentry, d_u.d_child);
if ((unsigned long)dent->d_fsdata == fpos) {
if (dent->d_inode)
dget_locked(dent);
--- linux-2.6.15-rc3/fs/ncpfs/ncplib_kernel.h 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/ncpfs/ncplib_kernel.h 2005-11-29 14:59:31.000000000 +0100
@@ -196,7 +196,7 @@
spin_lock(&dcache_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
- dentry = list_entry(next, struct dentry, d_child);
+ dentry = list_entry(next, struct dentry, d_u.d_child);

if (dentry->d_fsdata == NULL)
ncp_age_dentry(server, dentry);
@@ -218,7 +218,7 @@
spin_lock(&dcache_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
- dentry = list_entry(next, struct dentry, d_child);
+ dentry = list_entry(next, struct dentry, d_u.d_child);
dentry->d_fsdata = NULL;
ncp_age_dentry(server, dentry);
next = next->next;
--- linux-2.6.15-rc3/fs/smbfs/cache.c 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/fs/smbfs/cache.c 2005-11-29 14:59:47.000000000 +0100
@@ -66,7 +66,7 @@
spin_lock(&dcache_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
- dentry = list_entry(next, struct dentry, d_child);
+ dentry = list_entry(next, struct dentry, d_u.d_child);
dentry->d_fsdata = NULL;
smb_age_dentry(server, dentry);
next = next->next;
@@ -100,7 +100,7 @@
spin_lock(&dcache_lock);
next = parent->d_subdirs.next;
while (next != &parent->d_subdirs) {
- dent = list_entry(next, struct dentry, d_child);
+ dent = list_entry(next, struct dentry, d_u.d_child);
if ((unsigned long)dent->d_fsdata == fpos) {
if (dent->d_inode)
dget_locked(dent);
--- linux-2.6.15-rc3/kernel/cpuset.c 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/kernel/cpuset.c 2005-11-29 15:01:01.000000000 +0100
@@ -304,7 +304,7 @@
spin_lock(&dcache_lock);
node = dentry->d_subdirs.next;
while (node != &dentry->d_subdirs) {
- struct dentry *d = list_entry(node, struct dentry, d_child);
+ struct dentry *d = list_entry(node, struct dentry, d_u.d_child);
list_del_init(node);
if (d->d_inode) {
d = dget_locked(d);
@@ -316,7 +316,7 @@
}
node = dentry->d_subdirs.next;
}
- list_del_init(&dentry->d_child);
+ list_del_init(&dentry->d_u.d_child);
spin_unlock(&dcache_lock);
remove_dir(dentry);
}
--- linux-2.6.15-rc3/drivers/usb/core/inode.c 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/drivers/usb/core/inode.c 2005-11-29 12:04:54.000000000 +0100
@@ -186,7 +186,7 @@

down(&bus->d_inode->i_sem);

- list_for_each_entry(dev, &bus->d_subdirs, d_child)
+ list_for_each_entry(dev, &bus->d_subdirs, d_u.d_child)
if (dev->d_inode)
update_dev(dev);

@@ -203,7 +203,7 @@

down(&root->d_inode->i_sem);

- list_for_each_entry(bus, &root->d_subdirs, d_child) {
+ list_for_each_entry(bus, &root->d_subdirs, d_u.d_child) {
if (bus->d_inode) {
switch (S_IFMT & bus->d_inode->i_mode) {
case S_IFDIR:
@@ -319,7 +319,7 @@
spin_lock(&dcache_lock);

list_for_each(list, &dentry->d_subdirs) {
- struct dentry *de = list_entry(list, struct dentry, d_child);
+ struct dentry *de = list_entry(list, struct dentry, d_u.d_child);
if (usbfs_positive(de)) {
spin_unlock(&dcache_lock);
return 0;
--- linux-2.6.15-rc3/net/sunrpc/rpc_pipe.c 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/net/sunrpc/rpc_pipe.c 2005-11-29 15:01:01.000000000 +0100
@@ -492,7 +492,7 @@
repeat:
spin_lock(&dcache_lock);
list_for_each_safe(pos, next, &parent->d_subdirs) {
- dentry = list_entry(pos, struct dentry, d_child);
+ dentry = list_entry(pos, struct dentry, d_u.d_child);
spin_lock(&dentry->d_lock);
if (!d_unhashed(dentry)) {
dget_locked(dentry);
--- linux-2.6.15-rc3/security/selinux/selinuxfs.c 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6.15-rc3-ed/security/selinux/selinuxfs.c 2005-11-29 15:01:09.000000000 +0100
@@ -889,7 +889,7 @@
spin_lock(&dcache_lock);
node = de->d_subdirs.next;
while (node != &de->d_subdirs) {
- struct dentry *d = list_entry(node, struct dentry, d_child);
+ struct dentry *d = list_entry(node, struct dentry, d_u.d_child);
list_del_init(node);

if (d->d_inode) {


Attachments:
shrink_dentry_struct.patch (14.43 kB)

2005-11-30 02:07:20

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] shrinks dentry struct

Eric,

Would the following accomplish the same thing as your patch, to shrink
UP dentry structs back to 128 bytes, with a smaller and less intrusive
patch?

---

include/linux/dcache.h | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

--- 2.6.15-rc2-mm1.orig/include/linux/dcache.h 2005-11-29 17:45:51.977352268 -0800
+++ 2.6.15-rc2-mm1/include/linux/dcache.h 2005-11-29 18:04:59.151307979 -0800
@@ -95,19 +95,24 @@ struct dentry {
struct qstr d_name;

struct list_head d_lru; /* LRU list */
- struct list_head d_child; /* child of parent list */
+ union { /* Fit 32 bit UP dentry in 128 bytes */
+ struct list_head du_child; /* child of parent list */
+ struct rcu_head du_rcu;
+ } d_du;
struct list_head d_subdirs; /* our children */
struct list_head d_alias; /* inode alias list */
unsigned long d_time; /* used by d_revalidate */
struct dentry_operations *d_op;
struct super_block *d_sb; /* The root of the dentry tree */
void *d_fsdata; /* fs-specific data */
- struct rcu_head d_rcu;
struct dcookie_struct *d_cookie; /* cookie, if any */
int d_mounted;
unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */
};

+#define d_child d_du.du_child
+#define d_rcu d_du.du_rcu
+
struct dentry_operations {
int (*d_revalidate)(struct dentry *, struct nameidata *);
int (*d_hash) (struct dentry *, struct qstr *);


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-30 02:14:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] shrinks dentry struct

Paul Jackson <[email protected]> wrote:
>
> Would the following accomplish the same thing as your patch, to shrink
> UP dentry structs back to 128 bytes, with a smaller and less intrusive
> patch?
>
> ...
> - struct list_head d_child; /* child of parent list */
> + union { /* Fit 32 bit UP dentry in 128 bytes */
> + struct list_head du_child; /* child of parent list */
> + struct rcu_head du_rcu;
> + } d_du;
> ...
>
> +#define d_child d_du.du_child
> +#define d_rcu d_du.du_rcu

Yes, but it's better to just do the big edit, rather than letting these
little namespace crufties accumulate over time.

Even better would be to ditch gcc-2.95.x and use an anonymous union, but
Hugh won't let me ;)

2005-11-30 02:43:20

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] shrinks dentry struct

Andrew wrote:
> Yes, but it's better to just do the big edit, rather than letting these
> little namespace crufties accumulate over time.

Ok.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-30 06:56:17

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] shrinks dentry struct

On Tue, 29 Nov 2005, Andrew Morton wrote:
> Yes, but it's better to just do the big edit, rather than letting these
> little namespace crufties accumulate over time.
>
> Even better would be to ditch gcc-2.95.x and use an anonymous union, but
> Hugh won't let me ;)

Oh, I let you, but not for me (I don't want the eternal blame).

Hugh

2005-11-30 14:41:43

by Grzegorz Nosek

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs

2005/11/29, Steven Rostedt <[email protected]>:
>
> What you are showing, would probably show up by others if this were a
> vanilla kernel issue. I don't have an 8 way machine, just 2 way, but the
> vanilla kernel is being used on many 8 ways out there, so I think you are
> right that this _is_ a vserver issue.

Yeah, I guess so. I also noticed that running an older build (w/o ACPI
so it sees only 2 CPUs due to lack of HT - it's a dual Xeon HT machine
so there are 4 logical CPUs) seems a bit more stable, but it happens
there too.

>
> Unless, of course, that the vserver is producing an obscure race in the
> vanilla kernel that normal operations would seldom have. Just like the
> PREEMPT_RT patch has discovered many race conditions that were in the
> vanilla kernel but were not often a problem.
>

I'm not using preemption. What made me just stare in wonder was when I
added a check in do_task_stat at the very beginning to the effect of:

if (!task) {
printk(...);
return -ENOENT;
}

/* dereference task as usual */

I *still* got the oops (and no message got logged). So either it is
used before the entry point (there is an occurrence of
sizeof(task->comm) but that should be statically determined by the
compiler, right?) or it is set to NULL in some magical way between the
check and usage (yep, it's still a race but the window should be
smaller I think).

The only place I can find a proc_inode.task field set to NULL is in
proc_pid_make_inode(). However, it is set to the value of task
parameter just a few instructions later. Am I right? Or can
proc_pid_make_inode get passed a NULL pointer?

I'm lost. Any assistance will be invaluable.

Best regards,
Grzegorz Nosek

2005-11-30 15:14:26

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs

On Wed, 2005-11-30 at 15:41 +0100, Grzegorz Nosek wrote:
>
> I'm lost. Any assistance will be invaluable.

OK, Remove your patches, run the system where you can capture the log,
and provide a full output of the oops. Make sure you have
CONFIG_KALLSYMS set.

-- Steve


2005-11-30 15:29:45

by Grzegorz Nosek

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs

2005/11/30, Steven Rostedt <[email protected]>:
>
> OK, Remove your patches, run the system where you can capture the log,
> and provide a full output of the oops. Make sure you have
> CONFIG_KALLSYMS set.
>

OK, attached an oops from netconsole.

TIA,
Grzegorz Nosek


Attachments:
(No filename) (272.00 B)
oops.s35 (8.23 kB)
Download all attachments

2005-11-30 16:26:04

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs

(Andrew, this will be the last email that I include you on. I'm taking
you off unless you want to stay on this thread, and say so. I figure
that you get enough spam without having to read through this. I'll
obviously add you back if this results in a patch.)

On Wed, 2005-11-30 at 16:29 +0100, Grzegorz Nosek wrote:
> 2005/11/30, Steven Rostedt <[email protected]>:
> >
> > OK, Remove your patches, run the system where you can capture the log,
> > and provide a full output of the oops. Make sure you have
> > CONFIG_KALLSYMS set.
> >
>
> OK, attached an oops from netconsole.
>

The oops happened at address a01b50eb. Could you go into the compiled
directory run gdb on vmlinux and type li *0xa01b50eb and show what you
get.

For example:

~/work/ernie/linux-2.6.15-rc2-git5$ gdb vmlinux
GNU gdb 6.3-debian
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-linux"...Using host libthread_db library "/lib/tls/libthread_db.so.1".

(gdb) li *0xc01019e0
0xc01019e0 is in sys_clone (arch/i386/kernel/process.c:768).
763 {
764 return do_fork(SIGCHLD, regs.esp, &regs, 0, NULL, NULL);
765 }
766
767 asmlinkage int sys_clone(struct pt_regs regs)
768 {
769 unsigned long clone_flags;
770 unsigned long newsp;
771 int __user *parent_tidptr, *child_tidptr;
772
(gdb)

Obviously, use "li *0xa01b50eb" instead of 0xc01019e0.

Now if you get an error in starting gdb like:

$ gdb vmlinux
GNU gdb 6.3.90_20051119-debian
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i486-linux-gnu"...(no debugging symbols found)
Using host libthread_db library "/lib/tls/libthread_db.so.1".

(gdb) li *0xa01b50eb
No symbol table is loaded. Use the "file" command.
(gdb)

Notice the (no debugging symbols found). Then start up make menuconfig,
goto "Kernel Hacking", set "Kernel Debugging" and "Compile the kernel
with debug info". And try again. It may also be helpful to have
"Compile the kernel with frame pointers" also set. If you do this, you
will probably need to use something other than the 0xa01b50eb. Look at
the output of the oops and see the following:

Nov 27 00:15:26 s35 [43281574.240000] CPU: 1
Nov 27 00:15:26 s35 [43281574.240000] EIP: 0060:[<a01b50eb>] Not tainted VLI
Nov 27 00:15:26 s35 [43281574.240000] EFLAGS: 00010257 (2.6.14.2amd64smp.17)

That EIP is what I want.

-- Steve


2005-11-30 17:23:14

by Grzegorz Nosek

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs

2005/11/30, Steven Rostedt <[email protected]>:
> (Andrew, this will be the last email that I include you on. I'm taking
> you off unless you want to stay on this thread, and say so. I figure
> that you get enough spam without having to read through this. I'll
> obviously add you back if this results in a patch.)

(removed Andrew from the CC as well)

>
> On Wed, 2005-11-30 at 16:29 +0100, Grzegorz Nosek wrote:
> > 2005/11/30, Steven Rostedt <[email protected]>:
> > >
> > > OK, Remove your patches, run the system where you can capture the log,
> > > and provide a full output of the oops. Make sure you have
> > > CONFIG_KALLSYMS set.
> > >
> >
> > OK, attached an oops from netconsole.
> >
>
> The oops happened at address a01b50eb. Could you go into the compiled
> directory run gdb on vmlinux and type li *0xa01b50eb and show what you
> get.
>

OK, will send it as soon as I get my hands on it (I'm building a new
kernel at the moment with full debug info). In the meantime, if you
have a copy of fs/proc/array.o handy, have a look at do_task_stat
dissassembly and search for movzbl (%eax), %eax. Regardless of my
kernel config, architecture or whatever, the oops is in that
instruction (clearly a NULL pointer dereference). From some previous
debug build I found out (via objdump -dl) that it's apparently at the
entry point of the get_task_stat inline function.

Best regards,
Grzegorz Nosek

2005-12-01 20:38:03

by Grzegorz Nosek

[permalink] [raw]
Subject: Re: [PATCH] race condition in procfs

> > 2005/11/30, Steven Rostedt <[email protected]>:
> >
> > The oops happened at address a01b50eb. Could you go into the compiled
> > directory run gdb on vmlinux and type li *0xa01b50eb and show what you
> > get.

It turned out to be a bug in the vserver patches. Sent to maintainer.
As it's not a mainline issue, I'm not bothering you any more. Thanks
for debugging tips :)

Best regards,
Grzegorz Nosek

2005-12-03 01:16:12

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] remove unused blkp field in percpu_data

--- linux-2.6/include/linux/percpu.h 2005-11-29 04:51:27.000000000 +0100
+++ linux-2.6-ed/linux/percpu.h 2005-12-03 01:57:23.000000000 +0100
@@ -19,7 +19,6 @@

struct percpu_data {
void *ptrs[NR_CPUS];
- void *blkp;
};

/*


Attachments:
percpu_data.patch (231.00 B)

2005-12-13 18:03:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] shrinks dentry struct

On Tue, Nov 29, 2005 at 04:22:00PM +0100, Eric Dumazet wrote:
> Hi Andrew
>
> Could you add this patch to mm ?
>
> Thank you
>
> [PATCH] shrinks dentry struct
>
> Some long time ago, dentry struct was carefully tuned so that on 32 bits
> UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple
> of memory cache lines.
>
> Then RCU was added and dentry struct enlarged by two pointers, with nice
> results for SMP, but not so good on UP, because breaking the above tuning
> (128 + 8 = 136 bytes)
>
> This patch reverts this unwanted side effect, by using an union (d_u),
> where d_rcu and d_child are placed so that these two fields can share their
> memory needs.
>
> At the time d_free() is called (and d_rcu is really used), d_child is known
> to be empty and not touched by the dentry freeing.
>
> Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so
> the previous content of d_child is not needed if said dentry was unhashed
> but still accessed by a CPU because of RCU constraints)
>
> As dentry cache easily contains millions of entries, a size reduction is
> worth the extra complexity of the ugly C union.

Looks sound to me! Some opportunities for simplification below.

(Please accept my apologies for the delay -- some diversions turned out
to be more consuming than I had expected.)

Thanx, Paul

> Signed-off-by: Eric Dumazet <[email protected]>
>

> --- linux-2.6.15-rc3/include/linux/dcache.h 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/include/linux/dcache.h 2005-11-29 12:04:54.000000000 +0100
> @@ -95,14 +95,19 @@
> struct qstr d_name;
>
> struct list_head d_lru; /* LRU list */
> - struct list_head d_child; /* child of parent list */
> + /*
> + * d_child and d_rcu can share memory
> + */
> + union {
> + struct list_head d_child; /* child of parent list */
> + struct rcu_head d_rcu;
> + } d_u;
> struct list_head d_subdirs; /* our children */
> struct list_head d_alias; /* inode alias list */
> unsigned long d_time; /* used by d_revalidate */
> struct dentry_operations *d_op;
> struct super_block *d_sb; /* The root of the dentry tree */
> void *d_fsdata; /* fs-specific data */
> - struct rcu_head d_rcu;
> struct dcookie_struct *d_cookie; /* cookie, if any */
> int d_mounted;
> unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* small names */
> --- linux-2.6.15-rc3/fs/dcache.c 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/dcache.c 2005-11-29 15:07:10.000000000 +0100
> @@ -71,7 +71,7 @@
>
> static void d_callback(struct rcu_head *head)
> {
> - struct dentry * dentry = container_of(head, struct dentry, d_rcu);
> + struct dentry * dentry = container_of(head, struct dentry, d_u.d_rcu);
>
> if (dname_external(dentry))
> kfree(dentry->d_name.name);
> @@ -86,7 +86,7 @@
> {
> if (dentry->d_op && dentry->d_op->d_release)
> dentry->d_op->d_release(dentry);
> - call_rcu(&dentry->d_rcu, d_callback);
> + call_rcu(&dentry->d_u.d_rcu, d_callback);
> }
>
> /*
> @@ -193,7 +193,7 @@
> list_del(&dentry->d_lru);
> dentry_stat.nr_unused--;
> }
> - list_del(&dentry->d_child);
> + list_del(&dentry->d_u.d_child);
> dentry_stat.nr_dentry--; /* For d_free, below */
> /*drops the locks, at that point nobody can reach this dentry */
> dentry_iput(dentry);
> @@ -367,7 +367,7 @@
> struct dentry * parent;
>
> __d_drop(dentry);
> - list_del(&dentry->d_child);
> + list_del(&dentry->d_u.d_child);
> dentry_stat.nr_dentry--; /* For d_free, below */
> dentry_iput(dentry);
> parent = dentry->d_parent;
> @@ -518,7 +518,7 @@
> resume:
> while (next != &this_parent->d_subdirs) {
> struct list_head *tmp = next;
> - struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
> + struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
> next = tmp->next;
> /* Have we found a mount point ? */
> if (d_mountpoint(dentry))
> @@ -532,7 +532,7 @@
> * All done at this level ... ascend and resume the search.
> */
> if (this_parent != parent) {
> - next = this_parent->d_child.next;
> + next = this_parent->d_u.d_child.next;
> this_parent = this_parent->d_parent;
> goto resume;
> }
> @@ -569,7 +569,7 @@
> resume:
> while (next != &this_parent->d_subdirs) {
> struct list_head *tmp = next;
> - struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
> + struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
> next = tmp->next;
>
> if (!list_empty(&dentry->d_lru)) {
> @@ -610,7 +610,7 @@
> * All done at this level ... ascend and resume the search.
> */
> if (this_parent != parent) {
> - next = this_parent->d_child.next;
> + next = this_parent->d_u.d_child.next;
> this_parent = this_parent->d_parent;
> #ifdef DCACHE_DEBUG
> printk(KERN_DEBUG "select_parent: ascending to %s/%s, found=%d\n",
> @@ -753,12 +753,12 @@
> dentry->d_parent = dget(parent);
> dentry->d_sb = parent->d_sb;
> } else {
> - INIT_LIST_HEAD(&dentry->d_child);
> + INIT_LIST_HEAD(&dentry->d_u.d_child);
> }
>
> spin_lock(&dcache_lock);
> if (parent)
> - list_add(&dentry->d_child, &parent->d_subdirs);
> + list_add(&dentry->d_u.d_child, &parent->d_subdirs);
> dentry_stat.nr_dentry++;
> spin_unlock(&dcache_lock);
>
> @@ -1310,8 +1310,8 @@
> /* Unhash the target: dput() will then get rid of it */
> __d_drop(target);
>
> - list_del(&dentry->d_child);
> - list_del(&target->d_child);
> + list_del(&dentry->d_u.d_child);
> + list_del(&target->d_u.d_child);
>
> /* Switch the names.. */
> switch_names(dentry, target);
> @@ -1322,15 +1322,15 @@
> if (IS_ROOT(dentry)) {
> dentry->d_parent = target->d_parent;
> target->d_parent = target;
> - INIT_LIST_HEAD(&target->d_child);
> + INIT_LIST_HEAD(&target->d_u.d_child);
> } else {
> do_switch(dentry->d_parent, target->d_parent);
>
> /* And add them back to the (new) parent lists */
> - list_add(&target->d_child, &target->d_parent->d_subdirs);
> + list_add(&target->d_u.d_child, &target->d_parent->d_subdirs);
> }
>
> - list_add(&dentry->d_child, &dentry->d_parent->d_subdirs);
> + list_add(&dentry->d_u.d_child, &dentry->d_parent->d_subdirs);
> spin_unlock(&target->d_lock);
> spin_unlock(&dentry->d_lock);
> write_sequnlock(&rename_lock);
> @@ -1568,7 +1568,7 @@
> resume:
> while (next != &this_parent->d_subdirs) {
> struct list_head *tmp = next;
> - struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
> + struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
> next = tmp->next;
> if (d_unhashed(dentry)||!dentry->d_inode)
> continue;
> @@ -1579,7 +1579,7 @@
> atomic_dec(&dentry->d_count);
> }
> if (this_parent != root) {
> - next = this_parent->d_child.next;
> + next = this_parent->d_u.d_child.next;
> atomic_dec(&this_parent->d_count);
> this_parent = this_parent->d_parent;
> goto resume;
> --- linux-2.6.15-rc3/fs/autofs4/autofs_i.h 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/autofs4/autofs_i.h 2005-11-29 12:04:54.000000000 +0100
> @@ -209,7 +209,7 @@
> struct dentry *child;
> int ret = 0;
>
> - list_for_each_entry(child, &dentry->d_subdirs, d_child)
> + list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
> if (simple_positive(child))
> goto out;
> ret = 1;
> --- linux-2.6.15-rc3/fs/autofs4/expire.c 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/autofs4/expire.c 2005-11-29 12:04:54.000000000 +0100
> @@ -105,7 +105,7 @@
> next = this_parent->d_subdirs.next;
> resume:
> while (next != &this_parent->d_subdirs) {
> - struct dentry *dentry = list_entry(next, struct dentry, d_child);
> + struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
>
> /* Negative dentry - give up */
> if (!simple_positive(dentry)) {
> @@ -138,7 +138,7 @@
> }
>
> if (this_parent != top) {
> - next = this_parent->d_child.next;
> + next = this_parent->d_u.d_child.next;
> this_parent = this_parent->d_parent;
> goto resume;
> }
> @@ -163,7 +163,7 @@
> next = this_parent->d_subdirs.next;
> resume:
> while (next != &this_parent->d_subdirs) {
> - struct dentry *dentry = list_entry(next, struct dentry, d_child);
> + struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
>
> /* Negative dentry - give up */
> if (!simple_positive(dentry)) {
> @@ -199,7 +199,7 @@
> }
>
> if (this_parent != parent) {
> - next = this_parent->d_child.next;
> + next = this_parent->d_u.d_child.next;
> this_parent = this_parent->d_parent;
> goto resume;
> }
> @@ -238,7 +238,7 @@
> /* On exit from the loop expire is set to a dgot dentry
> * to expire or it's NULL */
> while ( next != &root->d_subdirs ) {
> - struct dentry *dentry = list_entry(next, struct dentry, d_child);
> + struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
>
> /* Negative dentry - give up */
> if ( !simple_positive(dentry) ) {
> @@ -302,7 +302,7 @@
> expired, (int)expired->d_name.len, expired->d_name.name);
> spin_lock(&dcache_lock);
> list_del(&expired->d_parent->d_subdirs);
> - list_add(&expired->d_parent->d_subdirs, &expired->d_child);
> + list_add(&expired->d_parent->d_subdirs, &expired->d_u.d_child);
> spin_unlock(&dcache_lock);
> return expired;
> }
> --- linux-2.6.15-rc3/fs/autofs4/inode.c 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/autofs4/inode.c 2005-11-29 12:04:54.000000000 +0100
> @@ -91,7 +91,7 @@
> next = this_parent->d_subdirs.next;
> resume:
> while (next != &this_parent->d_subdirs) {
> - struct dentry *dentry = list_entry(next, struct dentry, d_child);
> + struct dentry *dentry = list_entry(next, struct dentry, d_u.d_child);
>
> /* Negative dentry - don`t care */
> if (!simple_positive(dentry)) {
> @@ -117,7 +117,7 @@
> if (this_parent != sbi->root) {
> struct dentry *dentry = this_parent;
>
> - next = this_parent->d_child.next;
> + next = this_parent->d_u.d_child.next;
> this_parent = this_parent->d_parent;
> spin_unlock(&dcache_lock);
> DPRINTK("parent dentry %p %.*s",
> --- linux-2.6.15-rc3/fs/coda/cache.c 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/coda/cache.c 2005-11-29 12:05:21.000000000 +0100
> @@ -93,7 +93,7 @@
> spin_lock(&dcache_lock);
> list_for_each(child, &parent->d_subdirs)
> {
> - de = list_entry(child, struct dentry, d_child);
> + de = list_entry(child, struct dentry, d_u.d_child);

The above list_entry() could be combined with the earlier list_for_each()
using list_for_each_entry().

> /* don't know what to do with negative dentries */
> if ( ! de->d_inode )
> continue;
> --- linux-2.6.15-rc3/fs/libfs.c 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/libfs.c 2005-11-29 14:58:51.000000000 +0100
> @@ -93,16 +93,16 @@
> loff_t n = file->f_pos - 2;
>
> spin_lock(&dcache_lock);
> - list_del(&cursor->d_child);
> + list_del(&cursor->d_u.d_child);
> p = file->f_dentry->d_subdirs.next;
> while (n && p != &file->f_dentry->d_subdirs) {
> struct dentry *next;
> - next = list_entry(p, struct dentry, d_child);
> + next = list_entry(p, struct dentry, d_u.d_child);

Should be possible to combine the list_entry() and the while() into
list_for_each_entry().

> if (!d_unhashed(next) && next->d_inode)
> n--;
> p = p->next;
> }
> - list_add_tail(&cursor->d_child, p);
> + list_add_tail(&cursor->d_u.d_child, p);
> spin_unlock(&dcache_lock);
> }
> }
> @@ -126,7 +126,7 @@
> {
> struct dentry *dentry = filp->f_dentry;
> struct dentry *cursor = filp->private_data;
> - struct list_head *p, *q = &cursor->d_child;
> + struct list_head *p, *q = &cursor->d_u.d_child;
> ino_t ino;
> int i = filp->f_pos;
>
> @@ -153,7 +153,7 @@
> }
> for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
> struct dentry *next;
> - next = list_entry(p, struct dentry, d_child);
> + next = list_entry(p, struct dentry, d_u.d_child);

Ditto...

> if (d_unhashed(next) || !next->d_inode)
> continue;
>
> @@ -261,7 +261,7 @@
> int ret = 0;
>
> spin_lock(&dcache_lock);
> - list_for_each_entry(child, &dentry->d_subdirs, d_child)
> + list_for_each_entry(child, &dentry->d_subdirs, d_u.d_child)
> if (simple_positive(child))
> goto out;
> ret = 1;
> --- linux-2.6.15-rc3/fs/ncpfs/dir.c 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/ncpfs/dir.c 2005-11-29 14:59:31.000000000 +0100
> @@ -365,7 +365,7 @@
> spin_lock(&dcache_lock);
> next = parent->d_subdirs.next;
> while (next != &parent->d_subdirs) {
> - dent = list_entry(next, struct dentry, d_child);
> + dent = list_entry(next, struct dentry, d_u.d_child);

Ditto...

> if ((unsigned long)dent->d_fsdata == fpos) {
> if (dent->d_inode)
> dget_locked(dent);
> --- linux-2.6.15-rc3/fs/ncpfs/ncplib_kernel.h 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/ncpfs/ncplib_kernel.h 2005-11-29 14:59:31.000000000 +0100
> @@ -196,7 +196,7 @@
> spin_lock(&dcache_lock);
> next = parent->d_subdirs.next;
> while (next != &parent->d_subdirs) {
> - dentry = list_entry(next, struct dentry, d_child);
> + dentry = list_entry(next, struct dentry, d_u.d_child);

Ditto...

>
> if (dentry->d_fsdata == NULL)
> ncp_age_dentry(server, dentry);
> @@ -218,7 +218,7 @@
> spin_lock(&dcache_lock);
> next = parent->d_subdirs.next;
> while (next != &parent->d_subdirs) {
> - dentry = list_entry(next, struct dentry, d_child);
> + dentry = list_entry(next, struct dentry, d_u.d_child);

Ditto...

> dentry->d_fsdata = NULL;
> ncp_age_dentry(server, dentry);
> next = next->next;
> --- linux-2.6.15-rc3/fs/smbfs/cache.c 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/fs/smbfs/cache.c 2005-11-29 14:59:47.000000000 +0100
> @@ -66,7 +66,7 @@
> spin_lock(&dcache_lock);
> next = parent->d_subdirs.next;
> while (next != &parent->d_subdirs) {
> - dentry = list_entry(next, struct dentry, d_child);
> + dentry = list_entry(next, struct dentry, d_u.d_child);

Ditto...

> dentry->d_fsdata = NULL;
> smb_age_dentry(server, dentry);
> next = next->next;
> @@ -100,7 +100,7 @@
> spin_lock(&dcache_lock);
> next = parent->d_subdirs.next;
> while (next != &parent->d_subdirs) {
> - dent = list_entry(next, struct dentry, d_child);
> + dent = list_entry(next, struct dentry, d_u.d_child);

Ditto...

> if ((unsigned long)dent->d_fsdata == fpos) {
> if (dent->d_inode)
> dget_locked(dent);
> --- linux-2.6.15-rc3/kernel/cpuset.c 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/kernel/cpuset.c 2005-11-29 15:01:01.000000000 +0100
> @@ -304,7 +304,7 @@
> spin_lock(&dcache_lock);
> node = dentry->d_subdirs.next;
> while (node != &dentry->d_subdirs) {
> - struct dentry *d = list_entry(node, struct dentry, d_child);
> + struct dentry *d = list_entry(node, struct dentry, d_u.d_child);

Ditto...

> list_del_init(node);
> if (d->d_inode) {
> d = dget_locked(d);
> @@ -316,7 +316,7 @@
> }
> node = dentry->d_subdirs.next;
> }
> - list_del_init(&dentry->d_child);
> + list_del_init(&dentry->d_u.d_child);
> spin_unlock(&dcache_lock);
> remove_dir(dentry);
> }
> --- linux-2.6.15-rc3/drivers/usb/core/inode.c 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/drivers/usb/core/inode.c 2005-11-29 12:04:54.000000000 +0100
> @@ -186,7 +186,7 @@
>
> down(&bus->d_inode->i_sem);
>
> - list_for_each_entry(dev, &bus->d_subdirs, d_child)
> + list_for_each_entry(dev, &bus->d_subdirs, d_u.d_child)
> if (dev->d_inode)
> update_dev(dev);
>
> @@ -203,7 +203,7 @@
>
> down(&root->d_inode->i_sem);
>
> - list_for_each_entry(bus, &root->d_subdirs, d_child) {
> + list_for_each_entry(bus, &root->d_subdirs, d_u.d_child) {
> if (bus->d_inode) {
> switch (S_IFMT & bus->d_inode->i_mode) {
> case S_IFDIR:
> @@ -319,7 +319,7 @@
> spin_lock(&dcache_lock);
>
> list_for_each(list, &dentry->d_subdirs) {
> - struct dentry *de = list_entry(list, struct dentry, d_child);
> + struct dentry *de = list_entry(list, struct dentry, d_u.d_child);

The list_entry() and list_for_each() could be combined.

> if (usbfs_positive(de)) {
> spin_unlock(&dcache_lock);
> return 0;
> --- linux-2.6.15-rc3/net/sunrpc/rpc_pipe.c 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/net/sunrpc/rpc_pipe.c 2005-11-29 15:01:01.000000000 +0100
> @@ -492,7 +492,7 @@
> repeat:
> spin_lock(&dcache_lock);
> list_for_each_safe(pos, next, &parent->d_subdirs) {
> - dentry = list_entry(pos, struct dentry, d_child);
> + dentry = list_entry(pos, struct dentry, d_u.d_child);
> spin_lock(&dentry->d_lock);
> if (!d_unhashed(dentry)) {
> dget_locked(dentry);
> --- linux-2.6.15-rc3/security/selinux/selinuxfs.c 2005-11-29 04:51:27.000000000 +0100
> +++ linux-2.6.15-rc3-ed/security/selinux/selinuxfs.c 2005-11-29 15:01:09.000000000 +0100
> @@ -889,7 +889,7 @@
> spin_lock(&dcache_lock);
> node = de->d_subdirs.next;
> while (node != &de->d_subdirs) {
> - struct dentry *d = list_entry(node, struct dentry, d_child);
> + struct dentry *d = list_entry(node, struct dentry, d_u.d_child);

The list_entry() and while() could be combined.

> list_del_init(node);
>
> if (d->d_inode) {

2005-12-13 18:25:10

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] shrinks dentry struct

Paul E. McKenney a ?crit :
> On Tue, Nov 29, 2005 at 04:22:00PM +0100, Eric Dumazet wrote:
>
>>Hi Andrew
>>
>>Could you add this patch to mm ?
>>
>>Thank you
>>
>>[PATCH] shrinks dentry struct
>>
>>Some long time ago, dentry struct was carefully tuned so that on 32 bits
>>UP, sizeof(struct dentry) was exactly 128, ie a power of 2, and a multiple
>>of memory cache lines.
>>
>>Then RCU was added and dentry struct enlarged by two pointers, with nice
>>results for SMP, but not so good on UP, because breaking the above tuning
>>(128 + 8 = 136 bytes)
>>
>>This patch reverts this unwanted side effect, by using an union (d_u),
>>where d_rcu and d_child are placed so that these two fields can share their
>>memory needs.
>>
>>At the time d_free() is called (and d_rcu is really used), d_child is known
>>to be empty and not touched by the dentry freeing.
>>
>>Lockless lookups only access d_name, d_parent, d_lock, d_op, d_flags (so
>>the previous content of d_child is not needed if said dentry was unhashed
>>but still accessed by a CPU because of RCU constraints)
>>
>>As dentry cache easily contains millions of entries, a size reduction is
>>worth the extra complexity of the ugly C union.
>
>
> Looks sound to me! Some opportunities for simplification below.
>
> (Please accept my apologies for the delay -- some diversions turned out
> to be more consuming than I had expected.)
>
> Thanx, Paul
>

Hi Paul

My patch only address the layout of dentry structure, basically a 'global
substitute' on various places.

Adding some 'optimizations' or simplifications was not the goal, so please
submit a patch if you feel the need for it :)

Thank you

Eric