2007-06-11 05:01:22

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET 2.6.22-rc4] sysfs: fix race conditions

Hello, all.

Currently, there are several race conditions around dentry/inode
reclamation.

a. sysfs_dirent->s_dentry dereferencing in sysfs_readdir()

b. sysfs_dirent->s_dentry dereferencing in sysfs_drop_dentry()

c. sysfs_dirent->s_dentry clearing in sysfs_d_iput()

All aboves are done without synchronization and can cause oops if the
timing is right (or wrong).

These race conditions are difficult to trigger but with the attached
patch (sysfs-races.patch) and the following commands running
parallelly, all three are reliably reproducible (you may have to
change timings or disable others to trigger specific one).

1. while true; do insmod drivers/ata/libata.ko; insmod drivers/ata/ata_piix.ko; sleep 1; rmmod ata_piix; rmmod libata; sleep 1; echo -n . ; done
2. while true; do find /sys -type f | xargs cat > /dev/null; echo -n .; sleep 1; done
3. while true; do find /sys/class/scsi_disk -type f | sort | xargs cat > /dev/null; echo -n .; sleep 1; done
4. while true; do umount /sys; sleep 1; mount /sys; sleep 1; echo -n .; done

#1 assumes there are several devices attached to ata_piix controller.
Change #1 to any module or command which creates and removes sysfs
nodes repeatedly and adjust #3 to cat those sysfs nodes.

All known race conditions are fixed in the current -mm. #a is
replaced by adding sd->s_ino and allocating unique ino with ida. #b
and #c are fixed during sysfs_drop_dentry() rewrite. However, those
changes were too big to apply to 2.6.22-rcX or any stable branches.

This patchset contains three minimal backports of fixes in -mm. With
all patches in the patchset and sysfs-races.patch applied, kernel
survived ~20 hours of stress test without any problem.

Thanks.

--
tejun


Attachments:
(No filename) (1.68 kB)
sysfs-races.patch (1.16 kB)
Download all attachments

2007-06-11 05:03:10

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/3] sysfs: store sysfs inode nrs in s_ino to avoid readdir oopses

From: Eric Sandeen <[email protected]>

Backport of
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc1/2.6.22-rc1-mm1/broken-out/gregkh-driver-sysfs-allocate-inode-number-using-ida.patch

For regular files in sysfs, sysfs_readdir wants to traverse
sysfs_dirent->s_dentry->d_inode->i_ino to get to the inode number.
But, the dentry can be reclaimed under memory pressure, and there is
no synchronization with readdir. This patch follows Tejun's scheme of
allocating and storing an inode number in the new s_ino member of a
sysfs_dirent, when dirents are created, and retrieving it from there
for readdir, so that the pointer chain doesn't have to be traversed.

Tejun's upstream patch uses a new-ish "ida" allocator which brings
along some extra complexity; this -stable patch has a brain-dead
incrementing counter which does not guarantee uniqueness, but because
sysfs doesn't hash inodes as iunique expects, uniqueness wasn't
guaranteed today anyway.

Signed-off-by: Eric Sandeen <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/sysfs/dir.c | 16 +++++++++++-----
fs/sysfs/inode.c | 1 +
fs/sysfs/mount.c | 1 +
fs/sysfs/sysfs.h | 1 +
4 files changed, 14 insertions(+), 5 deletions(-)

Index: work1/fs/sysfs/dir.c
===================================================================
--- work1.orig/fs/sysfs/dir.c
+++ work1/fs/sysfs/dir.c
@@ -30,6 +30,14 @@ static struct dentry_operations sysfs_de
.d_iput = sysfs_d_iput,
};

+static unsigned int sysfs_inode_counter;
+ino_t sysfs_get_inum(void)
+{
+ if (unlikely(sysfs_inode_counter < 3))
+ sysfs_inode_counter = 3;
+ return sysfs_inode_counter++;
+}
+
/*
* Allocates a new sysfs_dirent and links it to the parent sysfs_dirent
*/
@@ -41,6 +49,7 @@ static struct sysfs_dirent * __sysfs_new
if (!sd)
return NULL;

+ sd->s_ino = sysfs_get_inum();
atomic_set(&sd->s_count, 1);
atomic_set(&sd->s_event, 1);
INIT_LIST_HEAD(&sd->s_children);
@@ -509,7 +518,7 @@ static int sysfs_readdir(struct file * f

switch (i) {
case 0:
- ino = dentry->d_inode->i_ino;
+ ino = parent_sd->s_ino;
if (filldir(dirent, ".", 1, i, ino, DT_DIR) < 0)
break;
filp->f_pos++;
@@ -538,10 +547,7 @@ static int sysfs_readdir(struct file * f

name = sysfs_get_name(next);
len = strlen(name);
- if (next->s_dentry)
- ino = next->s_dentry->d_inode->i_ino;
- else
- ino = iunique(sysfs_sb, 2);
+ ino = next->s_ino;

if (filldir(dirent, name, len, filp->f_pos, ino,
dt_type(next)) < 0)
Index: work1/fs/sysfs/inode.c
===================================================================
--- work1.orig/fs/sysfs/inode.c
+++ work1/fs/sysfs/inode.c
@@ -141,6 +141,7 @@ struct inode * sysfs_new_inode(mode_t mo
inode->i_mapping->a_ops = &sysfs_aops;
inode->i_mapping->backing_dev_info = &sysfs_backing_dev_info;
inode->i_op = &sysfs_inode_operations;
+ inode->i_ino = sd->s_ino;
lockdep_set_class(&inode->i_mutex, &sysfs_inode_imutex_key);

if (sd->s_iattr) {
Index: work1/fs/sysfs/mount.c
===================================================================
--- work1.orig/fs/sysfs/mount.c
+++ work1/fs/sysfs/mount.c
@@ -33,6 +33,7 @@ static struct sysfs_dirent sysfs_root =
.s_element = NULL,
.s_type = SYSFS_ROOT,
.s_iattr = NULL,
+ .s_ino = 1,
};

static void sysfs_clear_inode(struct inode *inode)
Index: work1/fs/sysfs/sysfs.h
===================================================================
--- work1.orig/fs/sysfs/sysfs.h
+++ work1/fs/sysfs/sysfs.h
@@ -5,6 +5,7 @@ struct sysfs_dirent {
void * s_element;
int s_type;
umode_t s_mode;
+ ino_t s_ino;
struct dentry * s_dentry;
struct iattr * s_iattr;
atomic_t s_event;

2007-06-11 05:04:22

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 3/3] sysfs: fix race condition around sd->s_dentry, take#2

Allowing attribute and symlink dentries to be reclaimed means
sd->s_dentry can change dynamically. However, updates to the field
are unsynchronized leading to race conditions. This patch adds
sysfs_lock and use it to synchronize updates to sd->s_dentry.

Due to the locking around ->d_iput, the check in sysfs_drop_dentry()
is complex. sysfs_lock only protect sd->s_dentry pointer itself. The
validity of the dentry is protected by dcache_lock, so whether dentry
is alive or not can only be tested while holding both locks.

This is minimal backport of sysfs_drop_dentry() rewrite in devel
branch.

Signed-off-by: Tejun Heo <[email protected]>
---
fs/sysfs/dir.c | 22 ++++++++++++++++++++--
fs/sysfs/inode.c | 18 +++++++++++++++++-
fs/sysfs/sysfs.h | 1 +
3 files changed, 38 insertions(+), 3 deletions(-)

Index: work1/fs/sysfs/dir.c
===================================================================
--- work1.orig/fs/sysfs/dir.c
+++ work1/fs/sysfs/dir.c
@@ -13,14 +13,26 @@
#include "sysfs.h"

DECLARE_RWSEM(sysfs_rename_sem);
+spinlock_t sysfs_lock = SPIN_LOCK_UNLOCKED;

static void sysfs_d_iput(struct dentry * dentry, struct inode * inode)
{
struct sysfs_dirent * sd = dentry->d_fsdata;

if (sd) {
- BUG_ON(sd->s_dentry != dentry);
- sd->s_dentry = NULL;
+ /* sd->s_dentry is protected with sysfs_lock. This
+ * allows sysfs_drop_dentry() to dereference it.
+ */
+ spin_lock(&sysfs_lock);
+
+ /* The dentry might have been deleted or another
+ * lookup could have happened updating sd->s_dentry to
+ * point the new dentry. Ignore if it isn't pointing
+ * to this dentry.
+ */
+ if (sd->s_dentry == dentry)
+ sd->s_dentry = NULL;
+ spin_unlock(&sysfs_lock);
sysfs_put(sd);
}
iput(inode);
@@ -247,7 +259,10 @@ static int sysfs_attach_attr(struct sysf
}

dentry->d_fsdata = sysfs_get(sd);
+ /* protect sd->s_dentry against sysfs_d_iput */
+ spin_lock(&sysfs_lock);
sd->s_dentry = dentry;
+ spin_unlock(&sysfs_lock);
error = sysfs_create(dentry, (attr->mode & S_IALLUGO) | S_IFREG, init);
if (error) {
sysfs_put(sd);
@@ -269,7 +284,10 @@ static int sysfs_attach_link(struct sysf
int err = 0;

dentry->d_fsdata = sysfs_get(sd);
+ /* protect sd->s_dentry against sysfs_d_iput */
+ spin_lock(&sysfs_lock);
sd->s_dentry = dentry;
+ spin_unlock(&sysfs_lock);
err = sysfs_create(dentry, S_IFLNK|S_IRWXUGO, init_symlink);
if (!err) {
dentry->d_op = &sysfs_dentry_ops;
Index: work1/fs/sysfs/inode.c
===================================================================
--- work1.orig/fs/sysfs/inode.c
+++ work1/fs/sysfs/inode.c
@@ -246,9 +246,23 @@ static inline void orphan_all_buffers(st
*/
void sysfs_drop_dentry(struct sysfs_dirent * sd, struct dentry * parent)
{
- struct dentry * dentry = sd->s_dentry;
+ struct dentry *dentry = NULL;
struct inode *inode;

+ /* We're not holding a reference to ->s_dentry dentry but the
+ * field will stay valid as long as sysfs_lock is held.
+ */
+ spin_lock(&sysfs_lock);
+ spin_lock(&dcache_lock);
+
+ /* dget dentry if it's still alive */
+ if (sd->s_dentry && sd->s_dentry->d_inode)
+ dentry = dget_locked(sd->s_dentry);
+
+ spin_unlock(&dcache_lock);
+ spin_unlock(&sysfs_lock);
+
+ /* drop dentry */
if (dentry) {
spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
@@ -268,6 +282,8 @@ void sysfs_drop_dentry(struct sysfs_dire
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
}
+
+ dput(dentry);
}
}

Index: work1/fs/sysfs/sysfs.h
===================================================================
--- work1.orig/fs/sysfs/sysfs.h
+++ work1/fs/sysfs/sysfs.h
@@ -33,6 +33,7 @@ extern const unsigned char * sysfs_get_n
extern void sysfs_drop_dentry(struct sysfs_dirent *sd, struct dentry *parent);
extern int sysfs_setattr(struct dentry *dentry, struct iattr *iattr);

+extern spinlock_t sysfs_lock;
extern struct rw_semaphore sysfs_rename_sem;
extern struct super_block * sysfs_sb;
extern const struct file_operations sysfs_dir_operations;

2007-06-11 05:13:46

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/3] sysfs: fix condition check in sysfs_drop_dentry()

The condition check doesn't make much sense as it basically always
succeeds. This causes NULL dereferencing on certain cases. It seems
that parentheses are put in the wrong place. Fix it.

Signed-off-by: Tejun Heo <[email protected]>
---
fs/sysfs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: work1/fs/sysfs/inode.c
===================================================================
--- work1.orig/fs/sysfs/inode.c
+++ work1/fs/sysfs/inode.c
@@ -252,7 +252,7 @@ void sysfs_drop_dentry(struct sysfs_dire
if (dentry) {
spin_lock(&dcache_lock);
spin_lock(&dentry->d_lock);
- if (!(d_unhashed(dentry) && dentry->d_inode)) {
+ if (!d_unhashed(dentry) && dentry->d_inode) {
inode = dentry->d_inode;
spin_lock(&inode->i_lock);
__iget(inode);

2007-06-11 06:13:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHSET 2.6.22-rc4] sysfs: fix race conditions

On Mon, 11 Jun 2007 14:01:07 +0900 Tejun Heo <[email protected]> wrote:

> Currently, there are several race conditions around dentry/inode
> reclamation.
>
> a. sysfs_dirent->s_dentry dereferencing in sysfs_readdir()
>
> b. sysfs_dirent->s_dentry dereferencing in sysfs_drop_dentry()
>
> c. sysfs_dirent->s_dentry clearing in sysfs_d_iput()
>
> All aboves are done without synchronization and can cause oops if the
> timing is right (or wrong).
>
> These race conditions are difficult to trigger but with the attached
> patch (sysfs-races.patch) and the following commands running
> parallelly, all three are reliably reproducible (you may have to
> change timings or disable others to trigger specific one).
>
> 1. while true; do insmod drivers/ata/libata.ko; insmod drivers/ata/ata_piix.ko; sleep 1; rmmod ata_piix; rmmod libata; sleep 1; echo -n . ; done
> 2. while true; do find /sys -type f | xargs cat > /dev/null; echo -n .; sleep 1; done
> 3. while true; do find /sys/class/scsi_disk -type f | sort | xargs cat > /dev/null; echo -n .; sleep 1; done
> 4. while true; do umount /sys; sleep 1; mount /sys; sleep 1; echo -n .; done
>
> #1 assumes there are several devices attached to ata_piix controller.
> Change #1 to any module or command which creates and removes sysfs
> nodes repeatedly and adjust #3 to cat those sysfs nodes.
>
> All known race conditions are fixed in the current -mm. #a is
> replaced by adding sd->s_ino and allocating unique ino with ida. #b
> and #c are fixed during sysfs_drop_dentry() rewrite. However, those
> changes were too big to apply to 2.6.22-rcX or any stable branches.
>
> This patchset contains three minimal backports of fixes in -mm. With
> all patches in the patchset and sysfs-races.patch applied, kernel
> survived ~20 hours of stress test without any problem.

So these are being proposed for 2.6.22?

I do wonder about Rafael's bug which he bisected down to
gregkh-driver-sysfs-use-singly-linked-list-for-sysfs_dirent-tree.patch.

If that won't be a problem in this patchset then I spose it's probably best
to go ahead with a 2.6.22 merge, but it's more a Greg thing than a me
thing.

I don't have a tree to merge these patches into, unless I drop all the
patches which are in Greg's tree.

Greg, can I leave it up to you to decide how we are to proceed here?

2007-06-11 06:16:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET 2.6.22-rc4] sysfs: fix race conditions

Andrew Morton wrote:
>> This patchset contains three minimal backports of fixes in -mm. With
>> all patches in the patchset and sysfs-races.patch applied, kernel
>> survived ~20 hours of stress test without any problem.
>
> So these are being proposed for 2.6.22?

Yeap.

> I do wonder about Rafael's bug which he bisected down to
> gregkh-driver-sysfs-use-singly-linked-list-for-sysfs_dirent-tree.patch.
>
> If that won't be a problem in this patchset then I spose it's probably best
> to go ahead with a 2.6.22 merge, but it's more a Greg thing than a me
> thing.

I'm currently debugging that and it's irrelevant to these fixes. The
bug is introduced far after the fixes.

> I don't have a tree to merge these patches into, unless I drop all the
> patches which are in Greg's tree.
>
> Greg, can I leave it up to you to decide how we are to proceed here?

I can rebase all sysfs patches in -mm on top of linus#master + these
fixes if necessary.

Thanks.

--
tejun

2007-06-11 09:57:58

by Eduard-Gabriel Munteanu

[permalink] [raw]
Subject: Re: [PATCHSET 2.6.22-rc4] sysfs: fix race conditions

*This message was transferred with a trial version of CommuniGate(r) Pro*
Tejun Heo wrote:

> This patchset contains three minimal backports of fixes in -mm. With
> all patches in the patchset and sysfs-races.patch applied, kernel
> survived ~20 hours of stress test without any problem.
>

Seriously, do you think adding a few sleeps fixes anything? It just
makes it harder to happen, and even harder to debug, so this is no good.
Race condition are fixed by proper locking, ordering and so on.

(No offence, this has to be the patch of the week. :) )

2007-06-11 10:03:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET 2.6.22-rc4] sysfs: fix race conditions

Eduard-Gabriel Munteanu wrote:
> *This message was transferred with a trial version of CommuniGate(r) Pro*
> Tejun Heo wrote:
>
>> This patchset contains three minimal backports of fixes in -mm. With
>> all patches in the patchset and sysfs-races.patch applied, kernel
>> survived ~20 hours of stress test without any problem.
>>
>
> Seriously, do you think adding a few sleeps fixes anything? It just
> makes it harder to happen, and even harder to debug, so this is no good.
> Race condition are fixed by proper locking, ordering and so on.
>
> (No offence, this has to be the patch of the week. :) )

Dude, what are you smoking and can I get some? The attached patch is to
trigger the race conditions more easily for verification. Actual fixes
are in the three patches posted as reply to the head message.

--
tejun

2007-06-11 10:14:01

by Eduard-Gabriel Munteanu

[permalink] [raw]
Subject: Re: [PATCHSET 2.6.22-rc4] sysfs: fix race conditions

*This message was transferred with a trial version of CommuniGate(r) Pro*
Tejun Heo wrote:
> Dude, what are you smoking and can I get some? The attached patch is to
> trigger the race conditions more easily for verification. Actual fixes
> are in the three patches posted as reply to the head message.

Sorry, I thought that was meant to be a fix.

2007-06-12 00:54:28

by Greg KH

[permalink] [raw]
Subject: Re: [PATCHSET 2.6.22-rc4] sysfs: fix race conditions

On Mon, Jun 11, 2007 at 03:15:39PM +0900, Tejun Heo wrote:
> Andrew Morton wrote:
> >> This patchset contains three minimal backports of fixes in -mm. With
> >> all patches in the patchset and sysfs-races.patch applied, kernel
> >> survived ~20 hours of stress test without any problem.
> >
> > So these are being proposed for 2.6.22?
>
> Yeap.
>
> > I do wonder about Rafael's bug which he bisected down to
> > gregkh-driver-sysfs-use-singly-linked-list-for-sysfs_dirent-tree.patch.
> >
> > If that won't be a problem in this patchset then I spose it's probably best
> > to go ahead with a 2.6.22 merge, but it's more a Greg thing than a me
> > thing.
>
> I'm currently debugging that and it's irrelevant to these fixes. The
> bug is introduced far after the fixes.
>
> > I don't have a tree to merge these patches into, unless I drop all the
> > patches which are in Greg's tree.
> >
> > Greg, can I leave it up to you to decide how we are to proceed here?

Ok, I'll test them out, and if look sane pass them to Linus.

> I can rebase all sysfs patches in -mm on top of linus#master + these
> fixes if necessary.

Yeah, I'll need that if these look good enough, otherwise my tree will
stop applying :)

Give me some time tonight to do this...

thanks,

greg k-h

2007-06-12 23:47:58

by Greg KH

[permalink] [raw]
Subject: Re: [PATCHSET 2.6.22-rc4] sysfs: fix race conditions

On Mon, Jun 11, 2007 at 03:15:39PM +0900, Tejun Heo wrote:
> Andrew Morton wrote:
> >> This patchset contains three minimal backports of fixes in -mm. With
> >> all patches in the patchset and sysfs-races.patch applied, kernel
> >> survived ~20 hours of stress test without any problem.
> >
> > So these are being proposed for 2.6.22?
>
> Yeap.
>
> > I do wonder about Rafael's bug which he bisected down to
> > gregkh-driver-sysfs-use-singly-linked-list-for-sysfs_dirent-tree.patch.
> >
> > If that won't be a problem in this patchset then I spose it's probably best
> > to go ahead with a 2.6.22 merge, but it's more a Greg thing than a me
> > thing.
>
> I'm currently debugging that and it's irrelevant to these fixes. The
> bug is introduced far after the fixes.
>
> > I don't have a tree to merge these patches into, unless I drop all the
> > patches which are in Greg's tree.
> >
> > Greg, can I leave it up to you to decide how we are to proceed here?
>
> I can rebase all sysfs patches in -mm on top of linus#master + these
> fixes if necessary.

Ok, I've sent these to Linus, so if he takes them, can you rebase your
patches on top of this and resend the whole tree to me (or just the ones
that needed to be modified, if that's easier.)

thanks again for this fix, I really appreciate it.

greg k-h

2007-06-13 04:51:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET 2.6.22-rc4] sysfs: fix race conditions

Greg KH wrote:
> On Mon, Jun 11, 2007 at 03:15:39PM +0900, Tejun Heo wrote:
>> Andrew Morton wrote:
>>>> This patchset contains three minimal backports of fixes in -mm. With
>>>> all patches in the patchset and sysfs-races.patch applied, kernel
>>>> survived ~20 hours of stress test without any problem.
>>> So these are being proposed for 2.6.22?
>> Yeap.
>>
>>> I do wonder about Rafael's bug which he bisected down to
>>> gregkh-driver-sysfs-use-singly-linked-list-for-sysfs_dirent-tree.patch.
>>>
>>> If that won't be a problem in this patchset then I spose it's probably best
>>> to go ahead with a 2.6.22 merge, but it's more a Greg thing than a me
>>> thing.
>> I'm currently debugging that and it's irrelevant to these fixes. The
>> bug is introduced far after the fixes.
>>
>>> I don't have a tree to merge these patches into, unless I drop all the
>>> patches which are in Greg's tree.
>>>
>>> Greg, can I leave it up to you to decide how we are to proceed here?
>> I can rebase all sysfs patches in -mm on top of linus#master + these
>> fixes if necessary.
>
> Ok, I've sent these to Linus, so if he takes them, can you rebase your
> patches on top of this and resend the whole tree to me (or just the ones
> that needed to be modified, if that's easier.)

Sure thing.

> thanks again for this fix, I really appreciate it.

:-)

--
tejun