2014-01-09 21:27:40

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()


While running stress tests on adding and deleting ftrace instances I
hit this bug:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
IP: [<ffffffff812d8bc5>] selinux_inode_permission+0x85/0x160
PGD 63681067 PUD 7ddbe067 PMD 0
Oops: 0000 [#1] PREEMPT
CPU: 0 PID: 5634 Comm: ftrace-test-mki Not tainted 3.13.0-rc4-test-00033-gd2a6dde-dirty #20
Hardware name: /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
task: ffff880078375800 ti: ffff88007ddb0000 task.ti: ffff88007ddb0000
RIP: 0010:[<ffffffff812d8bc5>] [<ffffffff812d8bc5>] selinux_inode_permission+0x85/0x160
RSP: 0018:ffff88007ddb1c48 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000800000 RCX: ffff88006dd43840
RDX: 0000000000000001 RSI: 0000000000000081 RDI: ffff88006ee46000
RBP: ffff88007ddb1c88 R08: 0000000000000000 R09: ffff88007ddb1c54
R10: 6e6576652f6f6f66 R11: 0000000000000003 R12: 0000000000000000
R13: 0000000000000081 R14: ffff88006ee46000 R15: 0000000000000000
FS: 00007f217b5b6700(0000) GS:ffffffff81e21000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
CR2: 0000000000000020 CR3: 000000006a0fe000 CR4: 00000000000007f0
Stack:
0000000000000081 ffff88006ee46000 0000000000000081 ffffffff812d8b45
ffff88006ee46000 0000000000000081 ffff880078375800 ffff880078375800
ffff88007ddb1c98 ffffffff812d358c ffff88007ddb1cb8 ffffffff811364f1
Call Trace:^M
[<ffffffff812d8b45>] ? selinux_inode_permission+0x5/0x160
[<ffffffff812d358c>] security_inode_permission+0x1c/0x30
[<ffffffff811364f1>] __inode_permission+0x41/0xa0
[<ffffffff81136568>] inode_permission+0x18/0x50
[<ffffffff811378b6>] link_path_walk+0x66/0x920
[<ffffffff810875a5>] ? __rcu_read_lock+0x5/0x20
[<ffffffff8113a9e6>] path_openat+0xa6/0x6c0
[<ffffffff8113b000>] ? path_openat+0x6c0/0x6c0
[<ffffffff810c4e69>] ? __trace_graph_entry+0x49/0xc0
[<ffffffff8112b196>] ? do_sys_open+0x146/0x240
[<ffffffff8113b043>] do_filp_open+0x43/0xa0
[<ffffffff8113b005>] ? do_filp_open+0x5/0xa0
[<ffffffff8112b196>] do_sys_open+0x146/0x240
[<ffffffff8112b2ae>] SyS_open+0x1e/0x20
[<ffffffff81948cd0>] system_call_fastpath+0x16/0x1b
Code: 84 a1 00 00 00 81 e3 00 20 00 00 89 d8 83 c8 02 40 f6 c6 04 0f 45 d8 40 f6 c6 08 74 71 80 cf 02 49 8b 46 38 4c 8d 4d cc 45 31 c0 <0f> b7 50 20 8b 70 1c 48 8b 41 70 89 d9 8b 78 04 e8 36 cf ff ff
RIP [<ffffffff812d8bc5>] selinux_inode_permission+0x85/0x160
RSP <ffff88007ddb1c48>
CR2: 0000000000000020
---[ end trace 9d800e5ac5059462 ]---


Investigating, I found that the inode->i_security was NULL, and the
dereference of it caused the oops.

in selinux_inode_permission():
----
isec = inode->i_security;

rc = avc_has_perm_noaudit(sid, isec->sid, isec->sclass, perms, 0, &avd);
----

Note, the crash came from stressing the deletion and reading of debugfs
files. I was not able to recreate this via normal files. But I'm not
sure they are safe. It may just be that the race window is much harder
to hit.

What seems to have happened (and what I have traced), is the file is
being opened at the same time the file or directory is being deleted.
As the dentry and inode locks are not held during the path walk, nor is
the inodes ref counts being incremented, there is nothing saving these
structures from being discarded except for an rcu_read_lock().

The rcu_read_lock() protects against freeing of the inode, but it does
not protect freeing of the inode_security_struct. Talking with Eric
Paris about this, it seems to be a generic issue with the
destroy_inode() calling security_inode_free() when the inode may still
be in use (protected by rcu). It seems that the true destruction of the
inode (done by __destroy_inode()) should also be protect by rcu.

Cc: [email protected]
Cc: Eric Paris <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>
---

This is based off of this thread:

https://lkml.org/lkml/2014/1/9/349

And perhaps is the true fix for:

http://oss.sgi.com/archives/xfs/2013-11/msg00709.html

diff --git a/fs/inode.c b/fs/inode.c
index 4bcdad3..a8f3b88 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -252,16 +252,17 @@ EXPORT_SYMBOL(__destroy_inode);
static void i_callback(struct rcu_head *head)
{
struct inode *inode = container_of(head, struct inode, i_rcu);
+ __destroy_inode(inode);
kmem_cache_free(inode_cachep, inode);
}

static void destroy_inode(struct inode *inode)
{
BUG_ON(!list_empty(&inode->i_lru));
- __destroy_inode(inode);
- if (inode->i_sb->s_op->destroy_inode)
+ if (inode->i_sb->s_op->destroy_inode) {
+ __destroy_inode(inode);
inode->i_sb->s_op->destroy_inode(inode);
- else
+ } else
call_rcu(&inode->i_rcu, i_callback);
}


2014-01-09 21:42:54

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Thu, Jan 09, 2014 at 04:27:31PM -0500, Steven Rostedt wrote:
> Note, the crash came from stressing the deletion and reading of debugfs
> files. I was not able to recreate this via normal files. But I'm not
> sure they are safe. It may just be that the race window is much harder
> to hit.

But "normal" files have a 'destroy_inode' method. So you've basically
only fixed it for debugfs (and maybe a few other unusual filesystems).
Why doesn't the code look like this:

static void i_callback(struct rcu_head *head)
{
struct inode *inode = container_of(head, struct inode, i_rcu);
__destroy_inode(inode);
if (inode->i_sb->s_op->destroy_inode)
inode->i_sb->s_op->destroy_inode(inode);
else
kmem_cache_free(inode_cachep, inode);
}

static void destroy_inode(struct inode *inode)
{
BUG_ON(!list_empty(&inode->i_lru));
call_rcu(&inode->i_rcu, i_callback);
}

We'd then have to get rid of all the call_rcu() invocations in individual
filesystems' destroy_inode methods, but that doesn't sound like a bad
thing to me.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2014-01-09 21:50:17

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Thu, 9 Jan 2014 14:42:39 -0700
Matthew Wilcox <[email protected]> wrote:

> On Thu, Jan 09, 2014 at 04:27:31PM -0500, Steven Rostedt wrote:
> > Note, the crash came from stressing the deletion and reading of debugfs
> > files. I was not able to recreate this via normal files. But I'm not
> > sure they are safe. It may just be that the race window is much harder
> > to hit.
>
> But "normal" files have a 'destroy_inode' method. So you've basically
> only fixed it for debugfs (and maybe a few other unusual filesystems).
> Why doesn't the code look like this:

Because I thought of that after I sent the email ;-)

Well, that's not really true. I don't know the semantics of the
destroy_inode() call. But I should have asked that in my change log.

>
> static void i_callback(struct rcu_head *head)
> {
> struct inode *inode = container_of(head, struct inode, i_rcu);
> __destroy_inode(inode);
> if (inode->i_sb->s_op->destroy_inode)
> inode->i_sb->s_op->destroy_inode(inode);
> else
> kmem_cache_free(inode_cachep, inode);
> }
>
> static void destroy_inode(struct inode *inode)
> {
> BUG_ON(!list_empty(&inode->i_lru));
> call_rcu(&inode->i_rcu, i_callback);
> }
>
> We'd then have to get rid of all the call_rcu() invocations in individual
> filesystems' destroy_inode methods, but that doesn't sound like a bad
> thing to me.
>

Which is another reason that I didn't do it, as I didn't know all the
happenings inside the ->destroy_inode() calls. But yeah, I agree with
this.

Also, can iput() sleep? If not then we are OK. Otherwise, we need to be
careful about any mutex being grabbed in those call backs, as the
rcu_callback can't sleep either.

-- Steve

2014-01-09 22:31:40

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Thu, Jan 09, 2014 at 04:50:12PM -0500, Steven Rostedt wrote:

> > We'd then have to get rid of all the call_rcu() invocations in individual
> > filesystems' destroy_inode methods, but that doesn't sound like a bad
> > thing to me.

Check what e.g. XFS is doing...

> Which is another reason that I didn't do it, as I didn't know all the
> happenings inside the ->destroy_inode() calls. But yeah, I agree with
> this.
>
> Also, can iput() sleep? If not then we are OK. Otherwise, we need to be
> careful about any mutex being grabbed in those call backs, as the
> rcu_callback can't sleep either.

iput() definitely can sleep (that's when actual truncation and inode
freeing is done for opened-and-unlinked files - on the final iput() after
close()), but that' irrelevant here - fsnotify_delete_inode() grabs
a bunch of mutexes, which makes calling it from rcu callback no-go.

2014-01-09 23:11:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Fri, Jan 10, 2014 at 06:41:03AM +0800, Linus Torvalds wrote:
> I think the sane short term fix is to make the kfree() of the i_security
> member be a rcu free, and not clear the member.

Interesting use case. ;-)

Thanx, Paul

> Not pretty, but should did this case..
>
> Linus
>
> On Jan 10, 2014 6:31 AM, "Al Viro" <[email protected]> wrote:
> >
> > iput() definitely can sleep (that's when actual truncation and inode
> > freeing is done for opened-and-unlinked files - on the final iput() after
> > close()), but that' irrelevant here - fsnotify_delete_inode() grabs
> > a bunch of mutexes, which makes calling it from rcu callback no-go.

2014-01-09 23:25:31

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Fri, 10 Jan 2014 06:41:03 +0800
Linus Torvalds <[email protected]> wrote:

> I think the sane short term fix is to make the kfree() of the i_security
> member be a rcu free, and not clear the member.

You mean my first patch?

https://lkml.org/lkml/2014/1/9/349

-- Steve

>
> Not pretty, but should did this case..
>

2014-01-09 23:28:02

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Thu, 9 Jan 2014 18:25:23 -0500
Steven Rostedt <[email protected]> wrote:

> On Fri, 10 Jan 2014 06:41:03 +0800
> Linus Torvalds <[email protected]> wrote:
>
> > I think the sane short term fix is to make the kfree() of the i_security
> > member be a rcu free, and not clear the member.
>
> You mean my first patch?
>
> https://lkml.org/lkml/2014/1/9/349
>

Oh wait, you said not to clear the member. Thus, the patch would look
like this:

Signed-off-by: Steven Rostedt <[email protected]>

Index: linux-trace.git/security/selinux/hooks.c
===================================================================
--- linux-trace.git.orig/security/selinux/hooks.c
+++ linux-trace.git/security/selinux/hooks.c
@@ -234,6 +234,14 @@ static int inode_alloc_security(struct i
return 0;
}

+static void inode_free_rcu(struct rcu_head *head)
+{
+ struct inode_security_struct *isec;
+
+ isec = container_of(head, struct inode_security_struct, rcu);
+ kmem_cache_free(sel_inode_cache, isec);
+}
+
static void inode_free_security(struct inode *inode)
{
struct inode_security_struct *isec = inode->i_security;
@@ -244,8 +252,7 @@ static void inode_free_security(struct i
list_del_init(&isec->list);
spin_unlock(&sbsec->isec_lock);

- inode->i_security = NULL;
- kmem_cache_free(sel_inode_cache, isec);
+ call_rcu(&isec->rcu, inode_free_rcu);
}

static int file_alloc_security(struct file *file)
Index: linux-trace.git/security/selinux/include/objsec.h
===================================================================
--- linux-trace.git.orig/security/selinux/include/objsec.h
+++ linux-trace.git/security/selinux/include/objsec.h
@@ -38,7 +38,10 @@ struct task_security_struct {

struct inode_security_struct {
struct inode *inode; /* back pointer to inode object */
- struct list_head list; /* list of inode_security_struct */
+ union {
+ struct list_head list; /* list of inode_security_struct */
+ struct rcu_head rcu; /* for freeing the inode_security_struct */
+ };
u32 task_sid; /* SID of creating task */
u32 sid; /* SID of this object */
u16 sclass; /* security class of this object */

2014-01-09 23:46:03

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Thu, 09 Jan 2014 18:37:06 -0500
Eric Paris <[email protected]> wrote:


> > Oh wait, you said not to clear the member. Thus, the patch would look
> > like this:
> >
> > Signed-off-by: Steven Rostedt <[email protected]>
>
> SMACK also needs this change somehow in smack_inode_free_security()
>
> but at least from an SELinux PoV, I think it's quick and easy, but wrong
> for maintainability...

I agree, but for this late in the -rc window, this may be the best
thing for now. We can think of a better solution for the future.

-- Steve

2014-01-09 23:37:46

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Thu, 2014-01-09 at 18:27 -0500, Steven Rostedt wrote:
> On Thu, 9 Jan 2014 18:25:23 -0500
> Steven Rostedt <[email protected]> wrote:
>
> > On Fri, 10 Jan 2014 06:41:03 +0800
> > Linus Torvalds <[email protected]> wrote:
> >
> > > I think the sane short term fix is to make the kfree() of the i_security
> > > member be a rcu free, and not clear the member.
> >
> > You mean my first patch?
> >
> > https://lkml.org/lkml/2014/1/9/349
> >
>
> Oh wait, you said not to clear the member. Thus, the patch would look
> like this:
>
> Signed-off-by: Steven Rostedt <[email protected]>

SMACK also needs this change somehow in smack_inode_free_security()

but at least from an SELinux PoV, I think it's quick and easy, but wrong
for maintainability...


> Index: linux-trace.git/security/selinux/hooks.c
> ===================================================================
> --- linux-trace.git.orig/security/selinux/hooks.c
> +++ linux-trace.git/security/selinux/hooks.c
> @@ -234,6 +234,14 @@ static int inode_alloc_security(struct i
> return 0;
> }
>
> +static void inode_free_rcu(struct rcu_head *head)
> +{
> + struct inode_security_struct *isec;
> +
> + isec = container_of(head, struct inode_security_struct, rcu);
> + kmem_cache_free(sel_inode_cache, isec);
> +}
> +
> static void inode_free_security(struct inode *inode)
> {
> struct inode_security_struct *isec = inode->i_security;
> @@ -244,8 +252,7 @@ static void inode_free_security(struct i
> list_del_init(&isec->list);
> spin_unlock(&sbsec->isec_lock);
>
> - inode->i_security = NULL;
> - kmem_cache_free(sel_inode_cache, isec);
> + call_rcu(&isec->rcu, inode_free_rcu);
> }
>
> static int file_alloc_security(struct file *file)
> Index: linux-trace.git/security/selinux/include/objsec.h
> ===================================================================
> --- linux-trace.git.orig/security/selinux/include/objsec.h
> +++ linux-trace.git/security/selinux/include/objsec.h
> @@ -38,7 +38,10 @@ struct task_security_struct {
>
> struct inode_security_struct {
> struct inode *inode; /* back pointer to inode object */
> - struct list_head list; /* list of inode_security_struct */
> + union {
> + struct list_head list; /* list of inode_security_struct */
> + struct rcu_head rcu; /* for freeing the inode_security_struct */
> + };
> u32 task_sid; /* SID of creating task */
> u32 sid; /* SID of this object */
> u16 sclass; /* security class of this object */

2014-01-09 23:45:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Thu, Jan 09, 2014 at 06:27:56PM -0500, Steven Rostedt wrote:
> On Thu, 9 Jan 2014 18:25:23 -0500
> Steven Rostedt <[email protected]> wrote:
>
> > On Fri, 10 Jan 2014 06:41:03 +0800
> > Linus Torvalds <[email protected]> wrote:
> >
> > > I think the sane short term fix is to make the kfree() of the i_security
> > > member be a rcu free, and not clear the member.
> >
> > You mean my first patch?
> >
> > https://lkml.org/lkml/2014/1/9/349
> >
>
> Oh wait, you said not to clear the member. Thus, the patch would look
> like this:
>
> Signed-off-by: Steven Rostedt <[email protected]>
>
> Index: linux-trace.git/security/selinux/hooks.c
> ===================================================================
> --- linux-trace.git.orig/security/selinux/hooks.c
> +++ linux-trace.git/security/selinux/hooks.c
> @@ -234,6 +234,14 @@ static int inode_alloc_security(struct i
> return 0;
> }
>
> +static void inode_free_rcu(struct rcu_head *head)
> +{
> + struct inode_security_struct *isec;
> +
> + isec = container_of(head, struct inode_security_struct, rcu);
> + kmem_cache_free(sel_inode_cache, isec);
> +}
> +
> static void inode_free_security(struct inode *inode)
> {
> struct inode_security_struct *isec = inode->i_security;
> @@ -244,8 +252,7 @@ static void inode_free_security(struct i
> list_del_init(&isec->list);
> spin_unlock(&sbsec->isec_lock);
>
> - inode->i_security = NULL;
> - kmem_cache_free(sel_inode_cache, isec);
> + call_rcu(&isec->rcu, inode_free_rcu);

Does not clearing ->i_security mean that RCU readers can traverse
this pointer after the invocation of call_rcu()? If so, this is
problematic. (If something else already prevents readers from getting
here, no problem.)

Thanx, Paul

> }
>
> static int file_alloc_security(struct file *file)
> Index: linux-trace.git/security/selinux/include/objsec.h
> ===================================================================
> --- linux-trace.git.orig/security/selinux/include/objsec.h
> +++ linux-trace.git/security/selinux/include/objsec.h
> @@ -38,7 +38,10 @@ struct task_security_struct {
>
> struct inode_security_struct {
> struct inode *inode; /* back pointer to inode object */
> - struct list_head list; /* list of inode_security_struct */
> + union {
> + struct list_head list; /* list of inode_security_struct */
> + struct rcu_head rcu; /* for freeing the inode_security_struct */
> + };
> u32 task_sid; /* SID of creating task */
> u32 sid; /* SID of this object */
> u16 sclass; /* security class of this object */
>

2014-01-09 23:53:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Fri, Jan 10, 2014 at 7:37 AM, Eric Paris <[email protected]> wrote:
>
> but at least from an SELinux PoV, I think it's quick and easy, but wrong
> for maintainability...

Yeah, it's a hack, and it's wrong, and we should figure out how to do
it right. Likely we should just tie the lifetime of the i_security
member directly to the lifetime of the inode itself, and just make the
rule be that security_inode_free() gets called from whatever frees the
inode itself, and *not* have an extra rcu callback etc. But that
sounds like a bigger change than I'm comfy with right now, so the
hacky one might be the band-aid to do for stable..

The problem, of course, is that all the different filesystems have
their own inode allocations/freeing. Of course, they all tend to share
the same pattern ("call_rcu xyz_i_callback"), so maybe we could try to
make that a more generic thing? Like have a "free_inode" vfs callback,
and do the call_rcu delaying at the VFS level..

And maybe, just maybe, we could just say that that is what
"destroy_inode()" is, and that we will just call it from rcu context.
All the IO has hopefully been done earlier Yes/no?

Linus

2014-01-09 23:59:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Thu, 9 Jan 2014 15:45:37 -0800
"Paul E. McKenney" <[email protected]> wrote:

> > static void inode_free_security(struct inode *inode)
> > {
> > struct inode_security_struct *isec = inode->i_security;
> > @@ -244,8 +252,7 @@ static void inode_free_security(struct i
> > list_del_init(&isec->list);
> > spin_unlock(&sbsec->isec_lock);
> >
> > - inode->i_security = NULL;
> > - kmem_cache_free(sel_inode_cache, isec);
> > + call_rcu(&isec->rcu, inode_free_rcu);
>
> Does not clearing ->i_security mean that RCU readers can traverse
> this pointer after the invocation of call_rcu()? If so, this is
> problematic. (If something else already prevents readers from getting
> here, no problem.)

This is called when we are about to free the inode. Look at
destroy_inode(). Basically, this is the same as doing:

call_rcu(&isec->rcu, inode_free_rcu);
call_rcu(&inode->i_rcu, i_callback);

Where i_callback() does the free of the inode.

If you can access inode->i_security, after a call_rcu, then you can
also access the inode itself that has just been freed.

Yes, technically, having two separate call_rcu(), the first grace
period can end before the second, but everything to remove the inode
from sight has already been set up before that first call_rcu() is
made. That means when the first call_rcu() is executed, the inode
should already be invisible to the readers.


- Steve

>
> Thanx, Paul
>
> > }
> >

2014-01-10 00:06:56

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Fri, Jan 10, 2014 at 07:53:41AM +0800, Linus Torvalds wrote:
> On Fri, Jan 10, 2014 at 7:37 AM, Eric Paris <[email protected]> wrote:
> >
> > but at least from an SELinux PoV, I think it's quick and easy, but wrong
> > for maintainability...
>
> Yeah, it's a hack, and it's wrong, and we should figure out how to do
> it right. Likely we should just tie the lifetime of the i_security
> member directly to the lifetime of the inode itself, and just make the
> rule be that security_inode_free() gets called from whatever frees the
> inode itself, and *not* have an extra rcu callback etc. But that
> sounds like a bigger change than I'm comfy with right now, so the
> hacky one might be the band-aid to do for stable..
>
> The problem, of course, is that all the different filesystems have
> their own inode allocations/freeing. Of course, they all tend to share
> the same pattern ("call_rcu xyz_i_callback"), so maybe we could try to
> make that a more generic thing? Like have a "free_inode" vfs callback,
> and do the call_rcu delaying at the VFS level..
>
> And maybe, just maybe, we could just say that that is what
> "destroy_inode()" is, and that we will just call it from rcu context.
> All the IO has hopefully been done earlier Yes/no?

Check what XFS is doing ;-/ That's where those call_rcu() have come from.
Sure, we can separate the simple "just do call_rcu(...->free_inode)" case
and hit it whenever full ->free_inode is there and ->destroy_inode isn't.
Not too pretty, but removal of tons of boilerplate might be worth doing
that anyway. But ->destroy_inode() is still needed for cases where fs
has its own idea of inode lifetime rules. Again, check what XFS is doing
in that area...

There's an extra source of headache, BTW - what about the "LSM stacking"
crowd and their plans?

2014-01-10 00:09:27

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Fri, Jan 10, 2014 at 12:06:42AM +0000, Al Viro wrote:

> and hit it whenever full ->free_inode is there and ->destroy_inode isn't.

Grr... "whenever ->free_inode is there and full ->destroy_inode isn't".
Apologies for sloppy editing...

2014-01-10 00:18:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Fri, Jan 10, 2014 at 8:06 AM, Al Viro <[email protected]> wrote:
>
> Sure, we can separate the simple "just do call_rcu(...->free_inode)" case
> and hit it whenever full ->free_inode is there and ->destroy_inode isn't.
> Not too pretty, but removal of tons of boilerplate might be worth doing
> that anyway.

Yeah.

> But ->destroy_inode() is still needed for cases where fs
> has its own idea of inode lifetime rules. Again, check what XFS is doing
> in that area...

Ok, so we can't change destroy_inode, and we'd need to add a new op
for just freeing it.

Painful mainly because there are so many filesystems, but it shouldn't
be *complicated*.

> There's an extra source of headache, BTW - what about the "LSM stacking"
> crowd and their plans?

LSM stacking is a pipedream right now anyway, isn't it? It's been
talked about for years and years, I've never seen a patch-set that is
even remotely something we'd seriously consider.

Linus

2014-01-10 00:44:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Thu, Jan 09, 2014 at 06:59:07PM -0500, Steven Rostedt wrote:
> On Thu, 9 Jan 2014 15:45:37 -0800
> "Paul E. McKenney" <[email protected]> wrote:
>
> > > static void inode_free_security(struct inode *inode)
> > > {
> > > struct inode_security_struct *isec = inode->i_security;
> > > @@ -244,8 +252,7 @@ static void inode_free_security(struct i
> > > list_del_init(&isec->list);
> > > spin_unlock(&sbsec->isec_lock);
> > >
> > > - inode->i_security = NULL;
> > > - kmem_cache_free(sel_inode_cache, isec);
> > > + call_rcu(&isec->rcu, inode_free_rcu);
> >
> > Does not clearing ->i_security mean that RCU readers can traverse
> > this pointer after the invocation of call_rcu()? If so, this is
> > problematic. (If something else already prevents readers from getting
> > here, no problem.)
>
> This is called when we are about to free the inode. Look at
> destroy_inode(). Basically, this is the same as doing:
>
> call_rcu(&isec->rcu, inode_free_rcu);
> call_rcu(&inode->i_rcu, i_callback);
>
> Where i_callback() does the free of the inode.
>
> If you can access inode->i_security, after a call_rcu, then you can
> also access the inode itself that has just been freed.
>
> Yes, technically, having two separate call_rcu(), the first grace
> period can end before the second, but everything to remove the inode
> from sight has already been set up before that first call_rcu() is
> made. That means when the first call_rcu() is executed, the inode
> should already be invisible to the readers.

Got it, should be fine then, sorry for the noise.

Thanx, Paul

2014-01-10 02:36:59

by James Morris

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Fri, 10 Jan 2014, Linus Torvalds wrote:

> > There's an extra source of headache, BTW - what about the "LSM stacking"
> > crowd and their plans?
>
> LSM stacking is a pipedream right now anyway, isn't it? It's been
> talked about for years and years, I've never seen a patch-set that is
> even remotely something we'd seriously consider.
>

And there's no good justification for it.


--
James Morris
<[email protected]>

2014-01-10 09:32:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Fri, Jan 10, 2014 at 12:06:42AM +0000, Al Viro wrote:
> Check what XFS is doing ;-/ That's where those call_rcu() have come from.
> Sure, we can separate the simple "just do call_rcu(...->free_inode)" case
> and hit it whenever full ->free_inode is there and ->destroy_inode isn't.
> Not too pretty, but removal of tons of boilerplate might be worth doing
> that anyway. But ->destroy_inode() is still needed for cases where fs
> has its own idea of inode lifetime rules. Again, check what XFS is doing
> in that area...

Btw, I'd really love to get rid of the XFS ->destroy_inode abuse, it's
been a long time thorn in the flesh.

What's really needed there to make XFS behave more similar to everyone
else is a way for the filesystem to say: "I can't actually free this
inode right now, but I'll come back to you later". That's what we
actually do right now, except we pretend that the VFS inode gets freed,
while its memory lives on (punt intended).

2014-01-10 18:14:47

by Ben Myers

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

Christoph,

On Fri, Jan 10, 2014 at 01:31:48AM -0800, Christoph Hellwig wrote:
> On Fri, Jan 10, 2014 at 12:06:42AM +0000, Al Viro wrote:
> > Check what XFS is doing ;-/ That's where those call_rcu() have come from.
> > Sure, we can separate the simple "just do call_rcu(...->free_inode)" case
> > and hit it whenever full ->free_inode is there and ->destroy_inode isn't.
> > Not too pretty, but removal of tons of boilerplate might be worth doing
> > that anyway. But ->destroy_inode() is still needed for cases where fs
> > has its own idea of inode lifetime rules. Again, check what XFS is doing
> > in that area...
>
> Btw, I'd really love to get rid of the XFS ->destroy_inode abuse, it's
> been a long time thorn in the flesh.

I believe this behavior is related to freeing of an inode cluster.

> What's really needed there to make XFS behave more similar to everyone
> else is a way for the filesystem to say: "I can't actually free this
> inode right now, but I'll come back to you later".

This test might read something like: "If my link count has gone to zero, and I
am the last inode in my cluster to be freed, and there are other inodes from my
cluster incore, I cannot be freed."

Should be doable. Maybe there are other reasons.

-Ben

2014-01-11 10:32:46

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] vfs: Fix possible NULL pointer dereference in inode_permission()

On Fri, Jan 10, 2014 at 12:14:34PM -0600, Ben Myers wrote:
> > What's really needed there to make XFS behave more similar to everyone
> > else is a way for the filesystem to say: "I can't actually free this
> > inode right now, but I'll come back to you later".
>
> This test might read something like: "If my link count has gone to zero, and I
> am the last inode in my cluster to be freed, and there are other inodes from my
> cluster incore, I cannot be freed."

It's more complicated than that. In theory we would free the inode
easily as soon as the VFS wants it, but performance would be horrible
as we would have to synchronously write back the inode. Note that it
really matters for the interface, that just needs to be an: I won't
free this right now, but I'll call you back later when I can.