2017-03-23 14:54:50

by Johannes Berg

[permalink] [raw]
Subject: deadlock in synchronize_srcu() in debugfs?

Hi,

Before I go hunting - has anyone seen a deadlock in synchronize_srcu()
in debugfs_remove() before? We're observing that with our (backported,
but very recent) driver against 4.9 (and 4.10, I think), but there are
no backports of any debugfs things so the backport itself doesn't seem
like a likely problem.

sysrq-w shows a lot of tasks blocked on various locks (e.g. RTNL), but
the ultimate problem is the wireless stack getting blocked on
debugfs_remove_recursive(), in __synchronize_srcu(), in
wait_for_completion() (while holding lots of locks, hence the other
tasks getting stuck).

Thanks,
johannes


2017-03-23 15:30:12

by Johannes Berg

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

On Thu, 2017-03-23 at 15:54 +0100, Johannes Berg wrote:

> Before I go hunting - has anyone seen a deadlock in
> synchronize_srcu() in debugfs_remove() before?

Isn't it possible for the following to happen?

CPU1 CPU2

mutex_lock(&M);
full_proxy_xyz();
srcu_read_lock(&debugfs_srcu);
real_fops->xyz();
mutex_lock(&M);
debugfs_remove(F);
synchronize_srcu(&debugfs_srcu);

-> deadlock?

I'm not convinced that this is the scenario I'm looking at, since then
it seems I should see the mutex_lock(&M) on CPU 2 with a backtrace
pointing to a full_proxy and the debugfs operation I recognize, but
lots of debugfs files acquire locks and it seems likely that it's not
always removed without holding those locks?

Am I missing something? I'll see if I can add lockdep annotations.

johannes

2017-03-23 15:36:24

by Nicolai Stange

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

Hi Johannes,

On Thu, Mar 23 2017, Johannes Berg wrote:

> Before I go hunting - has anyone seen a deadlock in synchronize_srcu()
> in debugfs_remove() before?

Not yet. How reproducible is this?


> We're observing that with our (backported, but very recent) driver
> against 4.9 (and 4.10, I think),

Do I understand it correctly that this driver has been backported from
4.11-rcX to 4.9/10 and that there isn't any issue with 4.11-rcX?


> but there are no backports of any debugfs things so the backport
> itself doesn't seem like a likely problem.

Right, there haven't been any SRCU related changes to debugfs after
4.8.


> sysrq-w shows a lot of tasks blocked on various locks (e.g. RTNL), but
> the ultimate problem is the wireless stack getting blocked on
> debugfs_remove_recursive(), in __synchronize_srcu(), in
> wait_for_completion() (while holding lots of locks, hence the other
> tasks getting stuck).

Could you share a complete backtrace? For example, is the
debugfs_remove_recursive() called from any debugfs file's fops and thus,
possibly from within a SRCU read side critical section?


Thanks,

Nicolai

2017-03-23 15:38:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

On Thu, Mar 23, 2017 at 03:54:46PM +0100, Johannes Berg wrote:
> Hi,
>
> Before I go hunting - has anyone seen a deadlock in synchronize_srcu()
> in debugfs_remove() before? We're observing that with our (backported,
> but very recent) driver against 4.9 (and 4.10, I think), but there are
> no backports of any debugfs things so the backport itself doesn't seem
> like a likely problem.
>
> sysrq-w shows a lot of tasks blocked on various locks (e.g. RTNL), but
> the ultimate problem is the wireless stack getting blocked on
> debugfs_remove_recursive(), in __synchronize_srcu(), in
> wait_for_completion() (while holding lots of locks, hence the other
> tasks getting stuck).

I have not seen this, but my usual question for __synchronize_srcu()
is if some other task is blocked holding srcu_read_lock() for that
same srcu_struct.

Thanx, Paul

2017-03-23 15:46:40

by Johannes Berg

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

On Thu, 2017-03-23 at 08:37 -0700, Paul E. McKenney wrote:

> I have not seen this, but my usual question for __synchronize_srcu()
> is if some other task is blocked holding srcu_read_lock() for that
> same srcu_struct.
>

Not as far as I can see - but that was the scenario I was outlining in
my second email, I guess.

I'll need to reproduce it and a get a fuller view of the system, I only
have the "echo w > sysrq-trigger" output right now.

Thanks,
johannes

2017-03-23 15:47:08

by Johannes Berg

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

Hi,

> Not yet. How reproducible is this?

Apparently quite. I haven't tried myself - it happens during some
automated test that I need to analyse further.

> > We're observing that with our (backported, but very recent) driver
> > against 4.9 (and 4.10, I think),
>
> Do I understand it correctly that this driver has been backported
> from 4.11-rcX to 4.9/10

Yes.

> and that there isn't any issue with 4.11-rcX?

No, I can't say this, we haven't run that test.

> > but there are no backports of any debugfs things so the backport
> > itself doesn't seem like a likely problem.
>
> Right, there haven't been any SRCU related changes to debugfs after
> 4.8.

Right.

> > sysrq-w shows a lot of tasks blocked on various locks (e.g. RTNL),
> > but
> > the ultimate problem is the wireless stack getting blocked on
> > debugfs_remove_recursive(), in __synchronize_srcu(), in
> > wait_for_completion() (while holding lots of locks, hence the other
> > tasks getting stuck).
>
> Could you share a complete backtrace? For example, is the
> debugfs_remove_recursive() called from any debugfs file's fops and
> thus, possibly from within a SRCU read side critical section?

No, it's called from netlink:

[  884.634857] wpa_supplicant  D    0  1769   1005 0x00000000
[  884.634874]  0000000000000000 ffff8ca50633d140 ffff8ca507b219c0 ffff8ca5455d4cc0
[  884.634898]  ffff8ca54f599d98 ffff97df431c36a0 ffffffff878dadf3 ffff8ca500000001
[  884.634927]  81ed67337c8469e4 ffff8ca54f599d98 0000932a07b219c0 ffff8ca507b219c0
[  884.634952] Call Trace:
[  884.634969]  [<ffffffff878dadf3>] ? __schedule+0x303/0xb00
[  884.634985]  [<ffffffff878db62d>] schedule+0x3d/0x90
[  884.635002]  [<ffffffff878e022c>] schedule_timeout+0x2fc/0x600
[  884.635021]  [<ffffffff870e8b06>] ? mark_held_locks+0x66/0x90
[  884.635041]  [<ffffffff878e16bc>] ? _raw_spin_unlock_irq+0x2c/0x40
[  884.635059]  [<ffffffff878dc8cc>] wait_for_completion+0xdc/0x110
[  884.635073]  [<ffffffff870bff90>] ? wake_up_q+0x80/0x80
[  884.635091]  [<ffffffff8710a46e>] __synchronize_srcu+0x11e/0x1c0
[  884.635109]  [<ffffffff87109510>] ? trace_raw_output_rcu_utilization+0x60/0x60
[  884.635131]  [<ffffffff8710a542>] synchronize_srcu+0x32/0x40
[  884.635145]  [<ffffffff873899ed>] debugfs_remove_recursive+0x17d/0x190
[  884.635239]  [<ffffffffc087b3be>] ieee80211_debugfs_key_remove+0x1e/0x30 [mac80211]
[  884.635333]  [<ffffffffc0840773>] __ieee80211_key_destroy+0x1b3/0x480 [mac80211]
[  884.635440]  [<ffffffffc0841807>] ieee80211_free_sta_keys+0x117/0x170 [mac80211]
[  884.635524]  [<ffffffffc0807b0c>] __sta_info_destroy_part2+0x4c/0x200 [mac80211]
[  884.635597]  [<ffffffffc0807fbd>] __sta_info_flush+0x10d/0x1a0 [mac80211]
[  884.635706]  [<ffffffffc086634b>] ieee80211_set_disassoc+0xcb/0x530 [mac80211]
[  884.635802]  [<ffffffffc086e3b6>] ieee80211_mgd_deauth+0x2e6/0x7b0 [mac80211]
[  884.635901]  [<ffffffffc08237c8>] ieee80211_deauth+0x18/0x20 [mac80211]
[  884.636024]  [<ffffffffc0673e8f>] cfg80211_mlme_deauth+0x14f/0x3b0 [cfg80211]
[  884.636110]  [<ffffffffc0649265>] nl80211_deauthenticate+0xe5/0x130 [cfg80211]
[  884.636133]  [<ffffffff877dc52c>] genl_family_rcv_msg+0x1bc/0x370
[  884.636151]  [<ffffffff877dc6e0>] ? genl_family_rcv_msg+0x370/0x370
[  884.636262]  [<ffffffff877dc760>] genl_rcv_msg+0x80/0xc0
[  884.636275]  [<ffffffff877dba87>] netlink_rcv_skb+0xa7/0xc0
[  884.636289]  [<ffffffff877dc148>] genl_rcv+0x28/0x40
[  884.636303]  [<ffffffff877db45b>] netlink_unicast+0x15b/0x210
[  884.636318]  [<ffffffff877db82a>] netlink_sendmsg+0x31a/0x3a0
[  884.636335]  [<ffffffff8777bb48>] sock_sendmsg+0x38/0x50
[  884.636354]  [<ffffffff8777c41c>] ___sys_sendmsg+0x26c/0x280
[  884.636378]  [<ffffffff8717b042>] ? ring_buffer_unlock_commit+0x32/0x290
[  884.636393]  [<ffffffff8718122e>] ? __buffer_unlock_commit+0x1e/0x40
[  884.636407]  [<ffffffff87181d12>] ? tracing_mark_write+0x162/0x2b0
[  884.636423]  [<ffffffff870e7419>] ? __lock_is_held+0x49/0x70
[  884.636440]  [<ffffffff8777d0a5>] __sys_sendmsg+0x45/0x80
[  884.636459]  [<ffffffff8777d0f2>] SyS_sendmsg+0x12/0x20
[  884.636477]  [<ffffffff878e1e45>] entry_SYSCALL_64_fastpath+0x23/0xc6


johannes

2017-03-24 08:56:56

by Johannes Berg

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

On Thu, 2017-03-23 at 16:29 +0100, Johannes Berg wrote:
> Isn't it possible for the following to happen?
>
> CPU1 CPU2
>
> mutex_lock(&M);
> full_proxy_xyz();
> srcu_read_lock(&debugfs_srcu);
> real_fops->xyz();
> mutex_lock(&M);
> debugfs_remove(F);
> synchronize_srcu(&debugfs_srcu);


So I'm pretty sure that this can happen. I'm not convinced that it's
happening here, but still.

I tried to make lockdep flag it, but the only way I could get it to
flag it was to do this:

--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -235,7 +235,7 @@ static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp)
preempt_disable();
retval = __srcu_read_lock(sp);
preempt_enable();
- rcu_lock_acquire(&(sp)->dep_map);
+ lock_map_acquire(&(sp)->dep_map);
return retval;
}

@@ -249,7 +249,7 @@ static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp)
static inline void srcu_read_unlock(struct srcu_struct *sp, int idx)
__releases(sp)
{
- rcu_lock_release(&(sp)->dep_map);
+ lock_map_release(&(sp)->dep_map);
__srcu_read_unlock(sp, idx);
}

diff --git a/kernel/rcu/srcu.c b/kernel/rcu/srcu.c
index ef3bcfb15b39..0f9e542ca3f2 100644
--- a/kernel/rcu/srcu.c
+++ b/kernel/rcu/srcu.c
@@ -395,6 +395,9 @@ static void __synchronize_srcu(struct srcu_struct *sp, int trycount)
lock_is_held(&rcu_sched_lock_map),
"Illegal synchronize_srcu() in same-type SRCU (or in RCU) read-side critical section");

+ lock_map_acquire(&sp->dep_map);
+ lock_map_release(&sp->dep_map);
+
might_sleep();
init_completion(&rcu.completion);


The lock_map_acquire() in srcu_read_lock() is really not desired
though, since it will make recursion get flagged as bad. If I change
that to lock_map_acquire_read() though, the problem doesn't get flagged
for some reason. I thought it should.


Regardless though, I don't see a way to solve this problem for debugfs.
We have a ton of debugfs files in net/mac80211/debugfs.c that need to
acquire e.g. the RTNL (or other locks), and I'm not sure we can easily
avoid removing the debugfs files under the RTNL, since we get all our
configuration callbacks with the RTNL already held...

Need to think about that, but perhaps there's some other solution?

johannes

2017-03-24 09:26:48

by Johannes Berg

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

Hi,

On Fri, 2017-03-24 at 09:56 +0100, Johannes Berg wrote:
> On Thu, 2017-03-23 at 16:29 +0100, Johannes Berg wrote:
> > Isn't it possible for the following to happen?
> >
> > CPU1 CPU2
> >
> > mutex_lock(&M); // acquires mutex
> > full_proxy_xyz();
> > srcu_read_lock(&debugfs_srcu);
> > real_fops->xyz();
> > mutex_lock(&M); // waiting for mutex
> > debugfs_remove(F);
> > synchronize_srcu(&debugfs_srcu);

> So I'm pretty sure that this can happen. I'm not convinced that it's
> happening here, but still.

I'm a bit confused, in that SRCU, of course, doesn't wait until all the
readers are done - that'd be a regular reader/writer lock or something.

However, it does (have to) wait until all the currently active read-
side sections have terminated, which still leads to a deadlock in the
example above, I think?

In his 2006 LWN article Paul wrote:

The designer of a given subsystem is responsible for: (1) ensuring
that SRCU read-side sleeping is bounded and (2) limiting the amount
of memory waiting for synchronize_srcu(). [1]

In the case of debugfs files acquiring locks, (1) can't really be
guaranteed, especially if those locks can be held while doing
synchronize_srcu() [via debugfs_remove], so I still think the lockdep
annotation needs to be changed to at least have some annotation at
synchronize_srcu() time so we can detect this.

Now, I still suspect there's some other bug here in the case that I'm
seeing, because I don't actually see the "mutex_lock(&M); // waiting"
piece in the traces. I'll need to run this with some tracing on Monday
when the test guys are back from the weekend.

I'm also not sure how I can possibly fix this in debugfs in mac80211
and friends, but that's perhaps a different story. Clearly, this
debugfs patch is a good thing - the code will likely have had use-
after-free problems in this situation without it. But flagging the
potential deadlocks would make it a lot easier to find them.

johannes

2017-03-24 17:46:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

On Fri, Mar 24, 2017 at 10:24:46AM +0100, Johannes Berg wrote:
> Hi,
>
> On Fri, 2017-03-24 at 09:56 +0100, Johannes Berg wrote:
> > On Thu, 2017-03-23 at 16:29 +0100, Johannes Berg wrote:
> > > Isn't it possible for the following to happen?
> > >
> > > CPU1 CPU2
> > >
> > > mutex_lock(&M); // acquires mutex
> > > full_proxy_xyz();
> > > srcu_read_lock(&debugfs_srcu);
> > > real_fops->xyz();
> > > mutex_lock(&M); // waiting for mutex
> > > debugfs_remove(F);
> > > synchronize_srcu(&debugfs_srcu);
>
> > So I'm pretty sure that this can happen. I'm not convinced that it's
> > happening here, but still.
>
> I'm a bit confused, in that SRCU, of course, doesn't wait until all the
> readers are done - that'd be a regular reader/writer lock or something.

Agreed, synchronize_srcu() does not have to wait for new readers
(as a reader/writer lock would), but it -does- have have to wait for
pre-existing readers, like the one shown in your example above.

> However, it does (have to) wait until all the currently active read-
> side sections have terminated, which still leads to a deadlock in the
> example above, I think?

Yes. CPU2 has a pre-existing reader that CPU1's synchronize_srcu()
must wait for. But CPU2's reader cannot end until CPU1 releases
its lock, which it cannot do until after CPU2's reader ends. Thus,
as you say, deadlock.

The rule is that if you are within any kind of RCU read-side critical
section, you cannot directly or indirectly wait for a grace period from
that same RCU flavor.

> In his 2006 LWN article Paul wrote:
>
> The designer of a given subsystem is responsible for: (1) ensuring
> that SRCU read-side sleeping is bounded and (2) limiting the amount
> of memory waiting for synchronize_srcu(). [1]
>
> In the case of debugfs files acquiring locks, (1) can't really be
> guaranteed, especially if those locks can be held while doing
> synchronize_srcu() [via debugfs_remove], so I still think the lockdep
> annotation needs to be changed to at least have some annotation at
> synchronize_srcu() time so we can detect this.

That would be very nice!

There are some challenges, though. This is OK:

CPU1 CPU2
i = srcu_read_lock(&mysrcu); mutex_lock(&my_lock);
mutex_lock(&my_lock); i = srcu_read_lock(&mysrcu);
srcu_read_unlock(&mysrcu, i); mutex_unlock(&my_lock);
mutex_unlock(&my_lock); srcu_read_unlock(&mysrcu, i);

CPU3
synchronize_srcu(&mylock);

This could be a deadlock for reader-writer locking, but not for SRCU.

This is also OK:

CPU1 CPU2
i = srcu_read_lock(&mysrcu); mutex_lock(&my_lock);
mutex_lock(&my_lock); synchronize_srcu(&yoursrcu);
srcu_read_unlock(&mysrcu, i); mutex_unlock(&my_lock);
mutex_unlock(&my_lock);

Here CPU1's read-side critical sections are for mysrcu, which is
independent of CPU2's grace period for yoursrcu.

So you could flag any lockdep cycle that contained a reader and a
synchronous grace period for the same flavor of RCU, where for SRCU the
identity of the srcu_struct structure is part of the flavor.

> Now, I still suspect there's some other bug here in the case that I'm
> seeing, because?I don't actually see the "mutex_lock(&M); // waiting"
> piece in the traces. I'll need to run this with some tracing on Monday
> when the test guys are back from the weekend.
>
> I'm also not sure how I can possibly fix this in debugfs in mac80211
> and friends,?but that's perhaps a different story. Clearly, this
> debugfs patch is a good thing - the code will likely have had use-
> after-free problems in this situation without it. But flagging the
> potential deadlocks would make it a lot easier to find them.

No argument here!

Thanx, Paul

2017-03-24 18:52:09

by Johannes Berg

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?


> Yes.  CPU2 has a pre-existing reader that CPU1's synchronize_srcu()
> must wait for.  But CPU2's reader cannot end until CPU1 releases
> its lock, which it cannot do until after CPU2's reader ends.  Thus,
> as you say, deadlock.
>
> The rule is that if you are within any kind of RCU read-side critical
> section, you cannot directly or indirectly wait for a grace period
> from that same RCU flavor.

Right. This is indirect then, in a way.

> There are some challenges, though.  This is OK:
>
> CPU1 CPU2
> i = srcu_read_lock(&mysrcu); mutex_lock(&my_lock);
> mutex_lock(&my_lock); i = srcu_read_lock(&mysrcu);
> srcu_read_unlock(&mysrcu, i); mutex_unlock(&my_lock);
> mutex_unlock(&my_lock); srcu_read_unlock(&mysrcu, i);
>
> CPU3
> synchronize_srcu(&mylock);
>
> This could be a deadlock for reader-writer locking, but not for SRCU.

Hmm, yes, that's a good point. If srcu_read_lock() was read_lock, and
synchronize_srcu() was write_lock(), then the write_lock() could stop
CPU2's read_lock() from acquiring the lock, and thus cause a deadlock.

However, I'm not convinced that lockdep handles reader/writer locks
correctly to start with, right now, since it *didn't* actually trigger
any warnings when I annotated SRCU as a reader/writer lock.

> This is also OK:
> CPU1 CPU2
> i = srcu_read_lock(&mysrcu); mutex_lock(&my_lock);
> mutex_lock(&my_lock); synchronize_srcu(&yoursrc
u);
> srcu_read_unlock(&mysrcu, i); mutex_unlock(&my_lock);
> mutex_unlock(&my_lock);
>
> Here CPU1's read-side critical sections are for mysrcu, which is
> independent of CPU2's grace period for yoursrcu.

Right, but that's already covered by having separate a lockdep_map for
each SRCU subsystem (mysrcu, yoursrcu).

> So you could flag any lockdep cycle that contained a reader and a
> synchronous grace period for the same flavor of RCU, where for SRCU
> the identity of the srcu_struct structure is part of the flavor.

Right. Basically, I think SRCU should be like a reader/writer lock
(perhaps fixed to work right). The only difference seems to be the
scenario you outlined above (first of the two)?

Actually, given the scenario above, for lockdep purposes the
reader/writer lock is actually the same as a recursive lock, I guess?

You outlined a scenario in which the reader gets blocked due to a
writer (CPU3 doing a write_lock()) so the reader can still participate
in a deadlock cycle since it can - without any other locks being held
by CPU3 that participate - cause a deadlock between CPU1 and CPU2 here.
For lockdep then, even seeing the CPU1 and CPU2 scenarios should be
sufficient to flag a deadlock (*).

This part then isn't true for SRCU, because there forward progress will
still be made. So for SRCU, the "reader" side really needs to be
connected with a "writer" side to form a deadlock cycle, unlike for a
reader/writer lock.

johannes

(*) technically only after checking that write_lock() is ever used, but
... seems reasonable enough to assume that it will be used, since why
would anyone ever use a reader/writer lock if there are only readers?
That's a no-op.

2017-03-24 19:34:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

On Fri, Mar 24, 2017 at 07:51:47PM +0100, Johannes Berg wrote:
>
> > Yes.??CPU2 has a pre-existing reader that CPU1's synchronize_srcu()
> > must wait for.??But CPU2's reader cannot end until CPU1 releases
> > its lock, which it cannot do until after CPU2's reader ends.??Thus,
> > as you say, deadlock.
> >
> > The rule is that if you are within any kind of RCU read-side critical
> > section, you cannot directly or indirectly wait for a grace period
> > from that same RCU flavor.
>
> Right. This is indirect then, in a way.

Agreed, in a way. ;-)

> > There are some challenges, though.??This is OK:
> >
> > CPU1 CPU2
> > i = srcu_read_lock(&mysrcu); mutex_lock(&my_lock);
> > mutex_lock(&my_lock); i = srcu_read_lock(&mysrcu);
> > srcu_read_unlock(&mysrcu, i); mutex_unlock(&my_lock);
> > mutex_unlock(&my_lock); srcu_read_unlock(&mysrcu, i);
> >
> > CPU3
> > synchronize_srcu(&mylock);
> >
> > This could be a deadlock for reader-writer locking, but not for SRCU.
>
> Hmm, yes, that's a good point. If srcu_read_lock() was read_lock, and
> synchronize_srcu() was write_lock(), then the write_lock() could stop
> CPU2's read_lock() from acquiring the lock, and thus cause a deadlock.

Yes.

> However, I'm not convinced that lockdep handles reader/writer locks
> correctly to start with, right now, since it *didn't* actually trigger
> any warnings when I annotated SRCU as a reader/writer lock.

I haven't looked into lockdep enough to know either way.

> > This is also OK:
> > CPU1 CPU2
> > i = srcu_read_lock(&mysrcu); mutex_lock(&my_lock);
> > mutex_lock(&my_lock); synchronize_srcu(&yoursrc
> u);
> > srcu_read_unlock(&mysrcu, i); mutex_unlock(&my_lock);
> > mutex_unlock(&my_lock);
> >
> > Here CPU1's read-side critical sections are for mysrcu, which is
> > independent of CPU2's grace period for yoursrcu.
>
> Right, but that's already covered by having separate a lockdep_map for
> each SRCU subsystem (mysrcu, yoursrcu).

I hope so, but haven't proved that this would work in all possible cases.

> > So you could flag any lockdep cycle that contained a reader and a
> > synchronous grace period for the same flavor of RCU, where for SRCU
> > the identity of the srcu_struct structure is part of the flavor.
>
> Right. Basically, I think SRCU should be like a reader/writer lock
> (perhaps fixed to work right). The only difference seems to be the
> scenario you outlined above (first of the two)?
>
> Actually, given the scenario above, for lockdep purposes the
> reader/writer lock is actually the same as a recursive lock, I guess?

Except that a recursive reader/writer lock can still have deadlocks
involving the outermost reader that would not be deadlocks for the
equivalent SRCU scenarios.

> You outlined a scenario in which the reader gets blocked due to a
> writer (CPU3 doing a write_lock()) so the reader can still participate
> in a deadlock cycle since it can - without any other locks being held
> by CPU3 that participate - cause a deadlock between CPU1 and CPU2 here.
> For lockdep then, even seeing the CPU1 and CPU2 scenarios should be
> sufficient to flag a deadlock (*).

Might this be one of the reasons why lockdep has problems with
reader-writer locks?

> This part then isn't true for SRCU, because there forward progress will
> still be made. So for SRCU, the "reader" side really needs to be
> connected with a "writer" side to form a deadlock cycle, unlike for a
> reader/writer lock.

Yes, for SRCU, srcu_read_lock() itself never blocks, so it never
participates directly in a deadlock cycle. It has to be the case
that something within the SRCU read-side critical section blocks
and takes its place in the deadlock cycle.

Then again, if you didn't have something blocking within your SRCU
read-side critical section, why would you be using SRCU instead of
just plain RCU? ;-)

> johannes
>
> (*) technically only after checking that write_lock() is ever used, but
> ... seems reasonable enough to assume that it will be used, since why
> would anyone ever use a reader/writer lock if there are only readers?
> That's a no-op.

Makes sense to me! The only reasons I can come up with are things like
shutting lockdep up when it wants a given lock read-held or some such.

Thanx, Paul

2017-03-24 20:20:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

On Fri, Mar 24, 2017 at 12:33:22PM -0700, Paul E. McKenney wrote:
> On Fri, Mar 24, 2017 at 07:51:47PM +0100, Johannes Berg wrote:
> >
> > > Yes.??CPU2 has a pre-existing reader that CPU1's synchronize_srcu()
> > > must wait for.??But CPU2's reader cannot end until CPU1 releases
> > > its lock, which it cannot do until after CPU2's reader ends.??Thus,
> > > as you say, deadlock.
> > >
> > > The rule is that if you are within any kind of RCU read-side critical
> > > section, you cannot directly or indirectly wait for a grace period
> > > from that same RCU flavor.
> >
> > Right. This is indirect then, in a way.
>
> Agreed, in a way. ;-)
>
> > > There are some challenges, though.??This is OK:
> > >
> > > CPU1 CPU2
> > > i = srcu_read_lock(&mysrcu); mutex_lock(&my_lock);
> > > mutex_lock(&my_lock); i = srcu_read_lock(&mysrcu);
> > > srcu_read_unlock(&mysrcu, i); mutex_unlock(&my_lock);
> > > mutex_unlock(&my_lock); srcu_read_unlock(&mysrcu, i);
> > >
> > > CPU3
> > > synchronize_srcu(&mylock);
> > >
> > > This could be a deadlock for reader-writer locking, but not for SRCU.
> >
> > Hmm, yes, that's a good point. If srcu_read_lock() was read_lock, and
> > synchronize_srcu() was write_lock(), then the write_lock() could stop
> > CPU2's read_lock() from acquiring the lock, and thus cause a deadlock.
>
> Yes.
>
> > However, I'm not convinced that lockdep handles reader/writer locks
> > correctly to start with, right now, since it *didn't* actually trigger
> > any warnings when I annotated SRCU as a reader/writer lock.
>
> I haven't looked into lockdep enough to know either way.
>
> > > This is also OK:
> > > CPU1 CPU2
> > > i = srcu_read_lock(&mysrcu); mutex_lock(&my_lock);
> > > mutex_lock(&my_lock); synchronize_srcu(&yoursrc
> > u);
> > > srcu_read_unlock(&mysrcu, i); mutex_unlock(&my_lock);
> > > mutex_unlock(&my_lock);
> > >
> > > Here CPU1's read-side critical sections are for mysrcu, which is
> > > independent of CPU2's grace period for yoursrcu.
> >
> > Right, but that's already covered by having separate a lockdep_map for
> > each SRCU subsystem (mysrcu, yoursrcu).
>
> I hope so, but haven't proved that this would work in all possible cases.
>
> > > So you could flag any lockdep cycle that contained a reader and a
> > > synchronous grace period for the same flavor of RCU, where for SRCU
> > > the identity of the srcu_struct structure is part of the flavor.
> >
> > Right. Basically, I think SRCU should be like a reader/writer lock
> > (perhaps fixed to work right). The only difference seems to be the
> > scenario you outlined above (first of the two)?
> >
> > Actually, given the scenario above, for lockdep purposes the
> > reader/writer lock is actually the same as a recursive lock, I guess?
>
> Except that a recursive reader/writer lock can still have deadlocks
> involving the outermost reader that would not be deadlocks for the
> equivalent SRCU scenarios.
>
> > You outlined a scenario in which the reader gets blocked due to a
> > writer (CPU3 doing a write_lock()) so the reader can still participate
> > in a deadlock cycle since it can - without any other locks being held
> > by CPU3 that participate - cause a deadlock between CPU1 and CPU2 here.
> > For lockdep then, even seeing the CPU1 and CPU2 scenarios should be
> > sufficient to flag a deadlock (*).
>
> Might this be one of the reasons why lockdep has problems with
> reader-writer locks?
>
> > This part then isn't true for SRCU, because there forward progress will
> > still be made. So for SRCU, the "reader" side really needs to be
> > connected with a "writer" side to form a deadlock cycle, unlike for a
> > reader/writer lock.
>
> Yes, for SRCU, srcu_read_lock() itself never blocks, so it never
> participates directly in a deadlock cycle. It has to be the case
> that something within the SRCU read-side critical section blocks
> and takes its place in the deadlock cycle.
>
> Then again, if you didn't have something blocking within your SRCU
> read-side critical section, why would you be using SRCU instead of
> just plain RCU? ;-)
>
> > johannes
> >
> > (*) technically only after checking that write_lock() is ever used, but
> > ... seems reasonable enough to assume that it will be used, since why
> > would anyone ever use a reader/writer lock if there are only readers?
> > That's a no-op.
>
> Makes sense to me! The only reasons I can come up with are things like
> shutting lockdep up when it wants a given lock read-held or some such.

And I cannot resist adding this one:

CPU 1 CPU 2
i = srcu_read_lock(&s1); mutex_lock(&l1);
mutex_lock(&l1); synchronize_srcu(&s2);
mutex_unlock(&l1); mutex_unlock(&l1);
srcu_read_unlock(&s1, i);

CPU 3 CPU 4
i = srcu_read_lock(&s2); mutex_lock(&l2);
mutex_lock(&l2); synchronize_srcu(&s1);
mutex_unlock(&l2); mutex_unlock(&l2);
srcu_read_unlock(&s2, i);

Removing the SRCU statements from any of these CPU would break the
deadlock. This can be easily extended to a deadlock cycle involving
any number of srcu_struct structures.

But this would still be a cycle involving an srcu_read_lock() and a
synchronize_srcu() on the same srcu_struct, which is reassuring.

Thanx, Paul

2017-03-27 11:18:44

by Johannes Berg

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

On Fri, 2017-03-24 at 13:20 -0700, Paul E. McKenney wrote:
>
> And I cannot resist adding this one:
>
> CPU 1 CPU 2
> i = srcu_read_lock(&s1); mutex_lock(&l1);
> mutex_lock(&l1); synchronize_srcu(&s2);
> mutex_unlock(&l1); mutex_unlock(&l1);
> srcu_read_unlock(&s1, i);
>
> CPU 3 CPU 4
> i = srcu_read_lock(&s2); mutex_lock(&l2);
> mutex_lock(&l2); synchronize_srcu(&s1);
> mutex_unlock(&l2); mutex_unlock(&l2);
> srcu_read_unlock(&s2, i);
>
> Removing the SRCU statements from any of these CPU would break the
> deadlock.  This can be easily extended to a deadlock cycle involving
> any number of srcu_struct structures.
>
> But this would still be a cycle involving an srcu_read_lock() and a
> synchronize_srcu() on the same srcu_struct, which is reassuring.

Right, you can cycle this indefinitely. lockdep has some kind of
maximum chain length I think. :)

johannes

2017-03-27 11:57:21

by Johannes Berg

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

Hi,

> > Before I go hunting - has anyone seen a deadlock in
> > synchronize_srcu() in debugfs_remove() before?
>
> Not yet. How reproducible is this?

So ... this turned out to be a livelock of sorts.

We have a debugfs file (not upstream (yet?), it seems) that basically
blocks reading data.

At the point of system hanging, there was a process reading from that
file, with no data being generated.

A second process was trying to remove a completely unrelated debugfs
file (*), with the RTNL held.

A third and many other processes were waiting to acquire the RTNL.


Obviously, in light of things like nfp_net_debugfs_tx_q_read(),
wil_write_file_reset(), lowpan_short_addr_get() and quite a few more,
nobody in the whole system can now remove debugfs files while holding
the RTNL. Not sure how many people that affects, but it's IMHO a pretty
major new restriction, and one that isn't even flagged at all.


Similarly, nobody should be blocking in debugfs files, like we did in
ours, but also smsdvb_stats_read(), crtc_crc_open() look like they
could block for quite a while. Again, there's no warning here that
blocking in debugfs files can now indefinitely defer completely
unrelated debugfs_remove() calls in the entire system.

Overall, while I can solve this problem for our driver, possibly by
making the debugfs file return some dummy data periodically if no real
data exists, which may not easily be possible for all such files, I'm
not convinced that all of this really is the right thing to actually
impose. Perhaps if it was per directory, or per some kind of subsystem?

johannes

(*) before removing first first we'd obviously wake up and thereby more
or less terminate the readers first

2017-03-30 07:32:26

by Nicolai Stange

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

Hi Johannes,

On Mon, Mar 27 2017, Johannes Berg wrote:

>> > Before I go hunting - has anyone seen a deadlock in
>> > synchronize_srcu() in debugfs_remove() before?
>>
>> Not yet. How reproducible is this?
>
> So ... this turned out to be a livelock of sorts.
>
> We have a debugfs file (not upstream (yet?), it seems) that basically
> blocks reading data.
>
> At the point of system hanging, there was a process reading from that
> file, with no data being generated.

>
> A second process was trying to remove a completely unrelated debugfs
> file (*), with the RTNL held.

I wonder if holding the RTNL during the debugfs file removal is really
needed. I'll try to have a look in the next couple of days.

>
> A third and many other processes were waiting to acquire the RTNL.
>
>
> Obviously, in light of things like nfp_net_debugfs_tx_q_read(),
> wil_write_file_reset(), lowpan_short_addr_get() and quite a few more,
> nobody in the whole system can now remove debugfs files while holding
> the RTNL. Not sure how many people that affects, but it's IMHO a pretty
> major new restriction, and one that isn't even flagged at all.

To be honest, I didn't have this scenario, i.e. removing a debugfs file
under a lock, in mind when writing this removal protection series.

Thank you very much for your debugging work and for pointing me to this
sort of problem!

Summarizing, the problem is the call to the indefinitely blocking
srcu_synchronize() while having a lock held? I'll see whether I can ask
lockdep if any lock is held and spit out a warning then.


> Similarly, nobody should be blocking in debugfs files, like we did in
> ours, but also smsdvb_stats_read(), crtc_crc_open() look like they
> could block for quite a while.

Blocking in the debugfs files' fops shall be fine by itself, that's why
SRCU is used for the removal stuff.


> Again, there's no warning here that blocking in debugfs files can now
> indefinitely defer completely unrelated debugfs_remove() calls in the
> entire system.

Yes, there's only one global srcu_struct for debugfs. So far this hasn't
been a problem and if I understand things correctly, it's also not the
problem at hand? If it really becomes an issue, we can very well
introduce per directory srcu_structs as you suggested.


> Overall, while I can solve this problem for our driver, possibly by
> making the debugfs file return some dummy data periodically if no real
> data exists, which may not easily be possible for all such files, I'm
> not convinced that all of this really is the right thing to actually
> impose.

No, I agree: imposing dummy data reads certainly isn't.

Let me have a look
- whether holding the RTNL lock while removing the debugfs files is
actually needed and
- whether there is an easy way to spot similar scenarios and emit
a warning for them.

If this doesn't solve the problem, I'll have to think of a different way
to fix this...

> (*) before removing first first we'd obviously wake up and thereby more
> or less terminate the readers first

With the current implementation, I can't see an easy way to identify the
tasks blocking on a particular debugfs file. But maybe this is
resolvable and the way to go here...


Thanks,

Nicolai

2017-03-30 07:55:53

by Johannes Berg

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

On Thu, 2017-03-30 at 09:32 +0200, Nicolai Stange wrote:
>
> I wonder if holding the RTNL during the debugfs file removal is
> really needed. I'll try to have a look in the next couple of days.

Yes, I'm pretty much convinced that it is. I considered doing a
deferred debugfs_remove() by holding the object around, but then I
can't be sure that I can later re-add a new object with the same
directory name, so I have much more complexity - I'm not even sure that
can be solved at all, *perhaps* by renaming in debugfs first, but
that's major new complexity.

Enough complexity that I'm considering just removing debugfs usage
entirely and invent new mechanisms, or use sysfs, or something else
instead.

> Summarizing, the problem is the call to the indefinitely blocking
> srcu_synchronize() while having a lock held? I'll see whether I can
> ask lockdep if any lock is held and spit out a warning then.

Half the thread here was about that - it's not easily doable because
you'd have to teach lockdep about the special SRCU semantics first.
Since it doesn't even seem to do read/write locks properly that's
probably a massive undertaking.

I also doubt that it's useful, because even if we did flag this sort of
situation, it can occur across very different drivers - for example the
netronome driver using rtnl_lock() inside its debugfs files, and
mac80211 removing a completely unrelated debugfs file within
rtnl_lock().

>
> > Similarly, nobody should be blocking in debugfs files, like we did
> > in ours, but also smsdvb_stats_read(), crtc_crc_open() look like
> > they could block for quite a while.
>
> Blocking in the debugfs files' fops shall be fine by itself, that's
> why SRCU is used for the removal stuff.

No, it isn't fine at all now! If I have a debugfs file - like the one I
had - that could block for external events (firmware, in my case), then
*any* other debugfs_remove() in the whole system would also block
indefinitely. That's a major problem! The two other files listed above
can also block waiting for external events, afaict. I'm told that there
are more files in other wireless drivers too.

Basically, even with SRCU being used, you cannot have blocking files.
You have to treat everything as O_NONBLOCK, because if you don't a
completely unrelated debugfs_remove() will block until the file
produces data. IMHO that's completely unacceptable.

> Yes, there's only one global srcu_struct for debugfs. So far this
> hasn't been a problem and if I understand things correctly, it's also
> not the problem at hand? If it really becomes an issue, we can very
> well introduce per directory srcu_structs as you suggested.

No, this is exactly the problem. If I have one blocking file, and
remove any completely unrelated file elsewhere in the system, I need to
wait for the blocking file to have produced data. That just doesn't
scale.

Using a separate SRCU subsystem per directory will go some way, but it
would still be useful to have lockdep annotations there.

Ultimately, I'm not sure I see why one couldn't just have a
reader/writer lock per *file*, which would be the ultimate granularity
to solve this. Obviously, a blocking file has to be aborted before
being removed itself, but there's nothing that says that you can't
remove any other file - even from the same directory - while this one
is in a blocking read.

> Let me have a look
> - whether holding the RTNL lock while removing the debugfs files is
>   actually needed and
> - whether there is an easy way to spot similar scenarios and emit
>   a warning for them.
>
> If this doesn't solve the problem, I'll have to think of a different
> way to fix this...

It solves - imho with unnecessary hardship on the caller of
debugfs_remove() - only half of the problem, namely the real deadlock.
It does nothing for the "blocking debugfs file" live-lock where forward
progress can be made as soon as the file has some data available -
which in practice in my scenario never happened as the data producer
was completely idle.

> > (*) before removing first first we'd obviously wake up and thereby
> > more or less terminate the readers first
>
> With the current implementation, I can't see an easy way to identify
> the tasks blocking on a particular debugfs file. But maybe this is
> resolvable and the way to go here...

No, this is unrelated. If I write a blocking debugfs file, then *of
course* I need to take care that before remove it I wake up all the
readers. But that's easy because I control how that file produces data,
so I can wake up the waitq and tell my own code inside the debugfs read
to return 0 (EOF).

This would be entirely inappropriate as a general solution because
you'd have to kill the userspace process or something...

johannes

2017-03-30 10:27:19

by Nicolai Stange

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

So, please correct me if I'm wrong, there are two problems with
indefinitely blocking debugfs files' fops:

1. The one which actually hung your system:
An indefinitely blocking debugfs_remove() while holding a lock.
Other tasks attempting to grab that same lock get stuck as well.

2. The other one you've found, namely that the locking granularity is
too coarse: a debugfs_remove() would get blocked by unrelated files'
pending fops.

AFAICS, the first one can't get resolved by simply refining the blocking
granularity: a debugfs_remove() on the indefinitely blocking file would
still block as well.

But:

On Thu, Mar 30 2017, Johannes Berg wrote:

> On Thu, 2017-03-30 at 09:32 +0200, Nicolai Stange wrote:
>>
>> I wonder if holding the RTNL during the debugfs file removal is
>> really needed. I'll try to have a look in the next couple of days.
>
> Yes, I'm pretty much convinced that it is. I considered doing a
> deferred debugfs_remove() by holding the object around, but then I
> can't be sure that I can later re-add a new object with the same
> directory name,

Ah, I see. What about making debugfs provide separate

- debugfs_remove_start(): unlink file (c.f. __debugfs_remove()), can be
called under a lock
and
- debugfs_remove_wait(): the synchronize_srcu(), must not get called
under any lock

?

This would solve 1.). It would still be nice to detect those situations,
i.e. calls to debugfs_remove() with some lock being held, though.

In short, I'd make calling debugfs_remove() with any lock being
held illegal.

What do you think?


> Half the thread here was about that - it's not easily doable because
> you'd have to teach lockdep about the special SRCU semantics first.
> Since it doesn't even seem to do read/write locks properly that's
> probably a massive undertaking.

I haven't looked into lockdep yet. So there is no way to ask lockdep
"is there any lock held?" from debugfs_remove() before doing the
synchonize_srcu()?


> I also doubt that it's useful, because even if we did flag this sort of
> situation, it can occur across very different drivers - for example the
> netronome driver using rtnl_lock() inside its debugfs files, and
> mac80211 removing a completely unrelated debugfs file within
> rtnl_lock().

I'm proposing to convert the latter to a
debugfs_remove_start()/debugfs_remove_wait() pair.


>> > Similarly, nobody should be blocking in debugfs files, like we did
>> > in ours, but also smsdvb_stats_read(), crtc_crc_open() look like
>> > they could block for quite a while.
>>
>> Blocking in the debugfs files' fops shall be fine by itself, that's
>> why SRCU is used for the removal stuff.
>
> No, it isn't fine at all now!

"Shall" in the sense of "it's a requirement" and if it isn't fulfilled,
it must be fixed. So I do agree with you here.


> If I have a debugfs file - like the one I
> had - that could block for external events (firmware, in my case), then
> *any* other debugfs_remove() in the whole system would also block
> indefinitely. That's a major problem!

Indeed.


> Ultimately, I'm not sure I see why one couldn't just have a
> reader/writer lock per *file*, which would be the ultimate granularity
> to solve this. Obviously, a blocking file has to be aborted before
> being removed itself, but there's nothing that says that you can't
> remove any other file - even from the same directory - while this one
> is in a blocking read.

When I did this, per-file reader/writer locks actuallt came to my mind
first. The problem here is that debugfs_use_file_start() must grab the
lock first and check whether the file has been deleted in the meanwhile.
But as it stands, there's nothing that would guarantee the existence of
the lock at the time it's to be taken.

Anyways, I'll have to think a little about possible solutions to mitigate
problem 2.)

Thanks,

Nicolai

2017-03-30 11:11:28

by Johannes Berg

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

On Thu, 2017-03-30 at 12:27 +0200, Nicolai Stange wrote:
> So, please correct me if I'm wrong, there are two problems with
> indefinitely blocking debugfs files' fops:
>
> 1. The one which actually hung your system:
>    An indefinitely blocking debugfs_remove() while holding a lock.
>    Other tasks attempting to grab that same lock get stuck as well.
>
> 2. The other one you've found, namely that the locking granularity is
>    too coarse: a debugfs_remove() would get blocked by unrelated
> files'
>    pending fops.

No, this isn't really an accurate description of the two problems.

> AFAICS, the first one can't get resolved by simply refining the
> blocking granularity: a debugfs_remove() on the indefinitely blocking
> file would still block as well.

Correct.

The first problem - the one I ran into - is the following:

1)
A given debugfs file's .read() was waiting for some event to happen
(being a blocking file), and I was trying to debugfs_remove() some
completely unrelated file, this got stuck.
Due to me holding a lock while doing this debugfs_remove(), other tasks
*also* got stuck, but that's just a sub-problem - having the
debugfs_remove() of an unrelated file get stuck would already have been
a problem - the fact that other tasks also got stuck was just an
additional wrinkle.

Mind - this is a livelock of sorts - if the debugfs file will ever make
progress, the system can recover.

2)
There's a complete deadlock situation if this happens:

CPU1 CPU2

debugfs_file_read(file="foo") mutex_lock(&M);
srcu_read_lock(&debugfs_srcu); debugfs_remove(file="bar")
mutex_lock(&M); synchronize_srcu(&debugfs_srcu)

This is intrinsically unrecoverable.

> > Yes, I'm pretty much convinced that it is. I considered doing a
> > deferred debugfs_remove() by holding the object around, but then I
> > can't be sure that I can later re-add a new object with the same
> > directory name,
>
> Ah, I see. What about making debugfs provide separate
>
> - debugfs_remove_start(): unlink file (c.f. __debugfs_remove()), can
> be
>   called under a lock
> and
> - debugfs_remove_wait(): the synchronize_srcu(), must not get called
>   under any lock
>
> ?

I don't think it would really help much - the lock acquisition in my
case is in a completely different layer (cfg80211) than the code doing
debugfs_remove(), so delaying the debugfs_remove_wait() would mean
moving it somewhere else completely. Also, afaict you still have to
keep the object around until debugfs_remove_wait() has finished, so you
still have the name reuse problem.

> In short, I'd make calling debugfs_remove() with any lock being
> held illegal.
>
> What do you think?

I think I'll stop using debugfs if that happens - too much hassle.

> > Half the thread here was about that - it's not easily doable
> > because
> > you'd have to teach lockdep about the special SRCU semantics first.
> > Since it doesn't even seem to do read/write locks properly that's
> > probably a massive undertaking.
>
> I haven't looked into lockdep yet. So there is no way to ask lockdep
> "is there any lock held?" from debugfs_remove() before doing the
> synchonize_srcu()? 

That's probably possible, yes.

> > > > Similarly, nobody should be blocking in debugfs files, like we
> > > > did
> > > > in ours, but also smsdvb_stats_read(), crtc_crc_open() look
> > > > like
> > > > they could block for quite a while.
> > >
> > > Blocking in the debugfs files' fops shall be fine by itself,
> > > that's why SRCU is used for the removal stuff.
> >
> > No, it isn't fine at all now!
>
> "Shall" in the sense of "it's a requirement" and if it isn't
> fulfilled, it must be fixed. So I do agree with you here.

Ok. So let's assume that we allow blocking (indefinitely, at least
until you remove the file) for a debugfs file - then immediately due to
the way SRCU is used you've now made some debugfs_remove() calls block
indefinitely. Not just on the same file - that'd be fine and a bug,
because before you remove a file you should wake it up - but any other
file in the system.

Even splitting it into debugfs_remove_start() and debugfs_remove_wait()
will not do anything to fix this problem - debugfs_remove_wait() would
then block forever and the task that called it will not be able to make
any forward progress, until the completely unrelated other files
created some data or got closed.

This is the core of the problem really - that you're tying completely
unrelated processes together.

Therefore, to continue using SRCU in this way means that you have to
disallow blocking debugfs files. There may not be many of those, but
any single one of them would be a problem.

If we stop using SRCU this way we can discuss how we can fix it - but
anything more coarse grained than per-file (which really makes SRCU
unsuitable) would still have the same problem one way or another. And
we haven't even addressed the deadlock situation (2 above) either.

> When I did this, per-file reader/writer locks actuallt came to my
> mind first. The problem here is that debugfs_use_file_start() must
> grab the lock first and check whether the file has been deleted in
> the meanwhile. But as it stands, there's nothing that would guarantee
> the existence of the lock at the time it's to be taken.

That seems like a strange argument to me - something has to exist for a
process to be able to look up the file, and currently the proxy also
has to exist?

So when a file is created you can allocate the proxy for it, and if you
can look up the proxy object - perhaps even using plain RCU - then you
also have the lock? IOW, instead of storing just the real_fops in
d_fsdata, you can store a small object that holds a lock and the
real_fops. You can always access that object, and lock it, but the
real_fops inside it might eventually end up NULL which you handle
through proxying. No?

johannes

2017-03-31 09:03:37

by Nicolai Stange

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

On Thu, Mar 30 2017, Johannes Berg wrote:

> On Thu, 2017-03-30 at 12:27 +0200, Nicolai Stange wrote:
>> So, please correct me if I'm wrong, there are two problems with
>> indefinitely blocking debugfs files' fops:
>>
>> 1. The one which actually hung your system:
>>    An indefinitely blocking debugfs_remove() while holding a lock.
>>    Other tasks attempting to grab that same lock get stuck as well.
>>
>> 2. The other one you've found, namely that the locking granularity is
>>    too coarse: a debugfs_remove() would get blocked by unrelated
>> files'
>>    pending fops.
>
> No, this isn't really an accurate description of the two problems.
>
>> AFAICS, the first one can't get resolved by simply refining the
>> blocking granularity: a debugfs_remove() on the indefinitely blocking
>> file would still block as well.
>
> Correct.
>
> The first problem - the one I ran into - is the following:
>
> 1)
> A given debugfs file's .read() was waiting for some event to happen
> (being a blocking file), and I was trying to debugfs_remove() some
> completely unrelated file, this got stuck.

I got it now. I was missing the "completely unrelated file" part.
(Admittedly, a related file would have made no sense at all -- the
remover would have been responsible to cancel any indefinite blocking in
there, as you said).

> Due to me holding a lock while doing this debugfs_remove(), other tasks
> *also* got stuck, but that's just a sub-problem - having the
> debugfs_remove() of an unrelated file get stuck would already have been
> a problem - the fact that other tasks also got stuck was just an
> additional wrinkle.
>
> Mind - this is a livelock of sorts - if the debugfs file will ever make
> progress, the system can recover.
>


> 2)
> There's a complete deadlock situation if this happens:
>
> CPU1 CPU2
>
> debugfs_file_read(file="foo") mutex_lock(&M);
> srcu_read_lock(&debugfs_srcu); debugfs_remove(file="bar")
> mutex_lock(&M); synchronize_srcu(&debugfs_srcu)
>
> This is intrinsically unrecoverable.

Let's address this in a second step.


> This is the core of the problem really - that you're tying completely
> unrelated processes together.
>
> Therefore, to continue using SRCU in this way means that you have to
> disallow blocking debugfs files. There may not be many of those, but
> any single one of them would be a problem.
>
> If we stop using SRCU this way we can discuss how we can fix it - but
> anything more coarse grained than per-file (which really makes SRCU
> unsuitable) would still have the same problem one way or another. And
> we haven't even addressed the deadlock situation (2 above) either.
>
>> When I did this, per-file reader/writer locks actuallt came to my
>> mind first. The problem here is that debugfs_use_file_start() must
>> grab the lock first and check whether the file has been deleted in
>> the meanwhile. But as it stands, there's nothing that would guarantee
>> the existence of the lock at the time it's to be taken.
>
> That seems like a strange argument to me - something has to exist for a
> process to be able to look up the file, and currently the proxy also
> has to exist?

No, the proxies are created at file _open_ time and installed at the
struct file.

Rationale: there are potentially many debugfs files with only few of them
opened at a time and a proxy, i.e. a struct file_operations, is quite
large.


> So when a file is created you can allocate the proxy for it, and if you
> can look up the proxy object - perhaps even using plain RCU - then you
> also have the lock? IOW, instead of storing just the real_fops in
> d_fsdata, you can store a small object that holds a lock and the
> real_fops. You can always access that object, and lock it, but the
> real_fops inside it might eventually end up NULL which you handle
> through proxying. No?

As said, there isn't always a proxy object around.

Of course, attaching some sort of lock on a per-file basis should be
doable. I just refrained from doing it so far (and resorted to SRCU
instead) because I wasn't aware of those indefinite blockers and wanted
to avoid the additional complexity (namely avoiding use-after-frees on
that lock).

I'll work out a solution this weekend and send some RFC patches then.

Thanks for your clarifications!

Nicolai

2017-03-31 09:45:01

by Johannes Berg

[permalink] [raw]
Subject: Re: deadlock in synchronize_srcu() in debugfs?

On Fri, 2017-03-31 at 11:03 +0200, Nicolai Stange wrote:

> > 2)
> > There's a complete deadlock situation if this happens:
> >
> > CPU1 CPU2
> >
> > debugfs_file_read(file="foo") mutex_lock(&M);
> > srcu_read_lock(&debugfs_srcu); debugfs_remove(file="
> > bar")
> > mutex_lock(&M); synchronize_srcu(&de
> > bugfs_srcu)
> >
> > This is intrinsically unrecoverable.
>
> Let's address this in a second step.

I suspect that it's actually better to address both in the same step,
but whatever :)

> > That seems like a strange argument to me - something has to exist
> > for a process to be able to look up the file, and currently the
> > proxy also has to exist?
>
> No, the proxies are created at file _open_ time and installed at the
> struct file.
>
> Rationale: there are potentially many debugfs files with only few of
> them opened at a time and a proxy, i.e. a struct file_operations, is
> quite large.

Ok, that makes sense. But that's not really a show-stopper, is it?

You can either have a proxy or not have it at remove time, and if you
don't have one then you can remove safely, right? And if you do have a
proxy, then you have to write_lock() it.

Lookup of the proxy itself can still be protected by (S)RCU, but you
can't go into the debugfs file callbacks while you hold (S)RCU, so that
you can safely determine whether or not a proxy exists.

I'm handwaving though - there are problems here with freeing the proxy
again when you close a file. Perhaps something like
* first, remove the pointer and wait for a grace period
* write_lock() it to make sure nobody is still inside it
* delete it now

works.

> I'll work out a solution this weekend and send some RFC patches then.
>
Thanks!

johannes

2017-04-16 09:51:50

by Nicolai Stange

[permalink] [raw]
Subject: [RFC PATCH 0/9] debugfs: per-file removal protection

Hello Greg,

this series implements the debugfs removal protection at file granularity,
meant to solve the livelock issue reported by Johannes [1]:

Task 1 Task 2
mutex_lock(&m);
debugfs_use_file_start(&d2, ...);
debugfs_remove(d1); mutex_lock(&m);

with d1 != d2.

In order to be able to store the additionally required per-dentry state,
a small container struct, debugfs_fsdata, will be allocated and installed
at ->d_fsdata.

The remaining question is if and how these debugfs_fsdata instances should
be freed. For one possible solution, please see
[9/9] ("debugfs: free debugfs_fsdata instances"). As stated in that patch's
description, I'm not convinced that it's a particularly good one -- it's
included only to show how a lock-free/RCU based scheme would look like.
In order to avoid frequent allocations and deallocations, I'd personally
not free the debugfs_fsdata instances at all: there would be exactly one
per debugfs file ever opened, which probably isn't too much. If you don't
agree, I can try and see how a simpler solution based on a global spinlock
would look like... Any advice on how to proceed with this welcome!


Thanks,

Nicolai

[1] http://lkml.kernel.org/r/[email protected]


Nicolai Stange (9):
debugfs: add support for more elaborate ->d_fsdata
debugfs: implement per-file removal protection
debugfs: debugfs_real_fops(): drop __must_hold sparse annotation
debugfs: convert to debugfs_file_get() and -put()
IB/hfi1: convert to debugfs_file_get() and -put()
debugfs: purge obsolete SRCU based removal protection
debugfs: call debugfs_real_fops() only after debugfs_file_get()
debugfs: defer debugfs_fsdata allocation to first usage
debugfs: free debugfs_fsdata instances

drivers/infiniband/hw/hfi1/debugfs.c | 20 +--
fs/debugfs/file.c | 272 +++++++++++++++++++++++------------
fs/debugfs/inode.c | 55 +++++--
fs/debugfs/internal.h | 15 ++
include/linux/debugfs.h | 33 +----
lib/Kconfig.debug | 1 -
6 files changed, 256 insertions(+), 140 deletions(-)

--
2.12.2

2017-04-16 09:52:01

by Nicolai Stange

[permalink] [raw]
Subject: [RFC PATCH 3/9] debugfs: debugfs_real_fops(): drop __must_hold sparse annotation

Currently, debugfs_real_fops() is annotated with a
__must_hold(&debugfs_srcu) sparse annotation.

With the conversion of the SRCU based protection of users against
concurrent file removals to a per-file refcount based scheme, this becomes
wrong.

Drop this annotation.

Signed-off-by: Nicolai Stange <[email protected]>
---
fs/debugfs/file.c | 6 +-----
include/linux/debugfs.h | 3 +--
2 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 081d74d390a6..3bc3e2e69f80 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -98,13 +98,9 @@ EXPORT_SYMBOL_GPL(debugfs_use_file_finish);
#define F_DENTRY(filp) ((filp)->f_path.dentry)

const struct file_operations *debugfs_real_fops(const struct file *filp)
- __must_hold(&debugfs_srcu)
{
struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata;
- /*
- * Neither the pointer to the struct file_operations, nor its
- * contents ever change -- srcu_dereference() is not needed here.
- */
+
return fsd->real_fops;
}
EXPORT_SYMBOL_GPL(debugfs_real_fops);
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index d1f1104c41ee..c65ff61b498c 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -95,8 +95,7 @@ int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx)

void debugfs_use_file_finish(int srcu_idx) __releases(&debugfs_srcu);

-const struct file_operations *debugfs_real_fops(const struct file *filp)
- __must_hold(&debugfs_srcu);
+const struct file_operations *debugfs_real_fops(const struct file *filp);

int debugfs_file_get(struct dentry *dentry);
void debugfs_file_put(struct dentry *dentry);
--
2.12.2

2017-04-16 09:51:58

by Nicolai Stange

[permalink] [raw]
Subject: [RFC PATCH 2/9] debugfs: implement per-file removal protection

Since commit 49d200deaa68 ("debugfs: prevent access to removed files'
private data"), accesses to a file's private data are protected from
concurrent removal by covering all file_operations with a SRCU read section
and sychronizing with those before returning from debugfs_remove() by means
of synchronize_srcu().

As pointed out by Johannes Berg, there are debugfs files with forever
blocking file_operations. Their corresponding SRCU read side sections would
block any debugfs_remove() forever as well, even unrelated ones. This
results in a livelock. Because a remover can't cancel any indefinite
blocking within foreign files, this is a problem.

Resolve this by introducing support for more granular protection on a
per-file basis.

This is implemented by introducing an 'active_users' refcount_t to the
per-file struct debugfs_fsdata state. At file creation time, it is set to
one and a debugfs_remove() will drop that initial reference. The new
debugfs_file_get() and debugfs_file_put(), intended to be used in place of
former debugfs_use_file_start() and debugfs_use_file_finish(), increment
and decrement it respectively. Once the count drops to zero,
debugfs_file_put() will signal a completion which is possibly being waited
for from debugfs_remove().
Thus, as long as there is a debugfs_file_get() not yet matched by a
corresponding debugfs_file_put() around, debugfs_remove() will block.

Actual users of debugfs_use_file_start() and -finish() will get converted
to the new debugfs_file_get() and debugfs_file_put() by followup patches.

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private
data")
Reported-by: Johannes Berg <[email protected]>
Signed-off-by: Nicolai Stange <[email protected]>
---
fs/debugfs/file.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/debugfs/inode.c | 24 +++++++++++++++++++-----
fs/debugfs/internal.h | 2 ++
include/linux/debugfs.h | 11 +++++++++++
4 files changed, 80 insertions(+), 5 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 15655a1a0704..081d74d390a6 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -109,6 +109,54 @@ const struct file_operations *debugfs_real_fops(const struct file *filp)
}
EXPORT_SYMBOL_GPL(debugfs_real_fops);

+/**
+ * debugfs_file_get - mark the beginning of file data access
+ * @dentry: the dentry object whose data is being accessed.
+ *
+ * Up to a matching call to debugfs_file_put(), any successive call
+ * into the file removing functions debugfs_remove() and
+ * debugfs_remove_recursive() will block. Since associated private
+ * file data may only get freed after a successful return of any of
+ * the removal functions, you may safely access it after a successful
+ * call to debugfs_file_get() without worrying about lifetime issues.
+ *
+ * If -%EIO is returned, the file has already been removed and thus,
+ * it is not safe to access any of its data. If, on the other hand,
+ * it is allowed to access the file data, zero is returned.
+ */
+int debugfs_file_get(struct dentry *dentry)
+{
+ struct debugfs_fsdata *fsd = dentry->d_fsdata;
+
+ /* Avoid starvation of removers. */
+ if (d_unlinked(dentry))
+ return -EIO;
+
+ if (!refcount_inc_not_zero(&fsd->active_users))
+ return -EIO;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(debugfs_file_get);
+
+/**
+ * debugfs_file_put - mark the end of file data access
+ * @dentry: the dentry object formerly passed to
+ * debugfs_file_get().
+ *
+ * Allow any ongoing concurrent call into debugfs_remove() or
+ * debugfs_remove_recursive() blocked by a former call to
+ * debugfs_file_get() to proceed and return to its caller.
+ */
+void debugfs_file_put(struct dentry *dentry)
+{
+ struct debugfs_fsdata *fsd = dentry->d_fsdata;
+
+ if (refcount_dec_and_test(&fsd->active_users))
+ complete(&fsd->active_users_drained);
+}
+EXPORT_SYMBOL_GPL(debugfs_file_put);
+
static int open_proxy_open(struct inode *inode, struct file *filp)
{
const struct dentry *dentry = F_DENTRY(filp);
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 1ae133064ca1..88d51e666c38 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -376,6 +376,7 @@ static struct dentry *__debugfs_create_file(const char *name, umode_t mode,

inode->i_fop = proxy_fops;
fsd->real_fops = real_fops;
+ refcount_set(&fsd->active_users, 1);
dentry->d_fsdata = fsd;

d_instantiate(dentry, inode);
@@ -633,18 +634,31 @@ struct dentry *debugfs_create_symlink(const char *name, struct dentry *parent,
}
EXPORT_SYMBOL_GPL(debugfs_create_symlink);

+static void __debugfs_remove_file(struct dentry *dentry, struct dentry *parent)
+{
+ struct debugfs_fsdata *fsd;
+
+ simple_unlink(d_inode(parent), dentry);
+ d_delete(dentry);
+ fsd = dentry->d_fsdata;
+ init_completion(&fsd->active_users_drained);
+ if (!refcount_dec_and_test(&fsd->active_users))
+ wait_for_completion(&fsd->active_users_drained);
+}
+
static int __debugfs_remove(struct dentry *dentry, struct dentry *parent)
{
int ret = 0;

if (simple_positive(dentry)) {
dget(dentry);
- if (d_is_dir(dentry))
+ if (d_is_dir(dentry)) {
ret = simple_rmdir(d_inode(parent), dentry);
- else
- simple_unlink(d_inode(parent), dentry);
- if (!ret)
- d_delete(dentry);
+ if (!ret)
+ d_delete(dentry);
+ } else {
+ __debugfs_remove_file(dentry, parent);
+ }
dput(dentry);
}
return ret;
diff --git a/fs/debugfs/internal.h b/fs/debugfs/internal.h
index 512601eed3ce..0eea99432840 100644
--- a/fs/debugfs/internal.h
+++ b/fs/debugfs/internal.h
@@ -21,6 +21,8 @@ extern const struct file_operations debugfs_full_proxy_file_operations;

struct debugfs_fsdata {
const struct file_operations *real_fops;
+ refcount_t active_users;
+ struct completion active_users_drained;
};

#endif /* _DEBUGFS_INTERNAL_H_ */
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index d614be21412a..d1f1104c41ee 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -98,6 +98,9 @@ void debugfs_use_file_finish(int srcu_idx) __releases(&debugfs_srcu);
const struct file_operations *debugfs_real_fops(const struct file *filp)
__must_hold(&debugfs_srcu);

+int debugfs_file_get(struct dentry *dentry);
+void debugfs_file_put(struct dentry *dentry);
+
ssize_t debugfs_attr_read(struct file *file, char __user *buf,
size_t len, loff_t *ppos);
ssize_t debugfs_attr_write(struct file *file, const char __user *buf,
@@ -228,6 +231,14 @@ static inline void debugfs_use_file_finish(int srcu_idx)
__releases(&debugfs_srcu)
{ }

+static inline int debugfs_file_get(struct dentry *dentry)
+{
+ return 0;
+}
+
+static inline void debugfs_file_put(struct dentry *dentry)
+{ }
+
static inline ssize_t debugfs_attr_read(struct file *file, char __user *buf,
size_t len, loff_t *ppos)
{
--
2.12.2

2017-04-16 09:52:02

by Nicolai Stange

[permalink] [raw]
Subject: [RFC PATCH 4/9] debugfs: convert to debugfs_file_get() and -put()

Convert all calls to the now obsolete debugfs_use_file_start() and
debugfs_use_file_finish() from the debugfs core itself to the new
debugfs_file_get() and debugfs_file_put() API.

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private
data")
Signed-off-by: Nicolai Stange <[email protected]>
---
fs/debugfs/file.c | 106 ++++++++++++++++++++++++++----------------------------
1 file changed, 50 insertions(+), 56 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 3bc3e2e69f80..ebf5b2faf91c 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -155,15 +155,12 @@ EXPORT_SYMBOL_GPL(debugfs_file_put);

static int open_proxy_open(struct inode *inode, struct file *filp)
{
- const struct dentry *dentry = F_DENTRY(filp);
+ struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = NULL;
- int srcu_idx, r;
+ int r = 0;

- r = debugfs_use_file_start(dentry, &srcu_idx);
- if (r) {
- r = -ENOENT;
- goto out;
- }
+ if (debugfs_file_get(dentry))
+ return -ENOENT;

real_fops = debugfs_real_fops(filp);
real_fops = fops_get(real_fops);
@@ -180,7 +177,7 @@ static int open_proxy_open(struct inode *inode, struct file *filp)
r = real_fops->open(inode, filp);

out:
- debugfs_use_file_finish(srcu_idx);
+ debugfs_file_put(dentry);
return r;
}

@@ -194,16 +191,16 @@ const struct file_operations debugfs_open_proxy_file_operations = {
#define FULL_PROXY_FUNC(name, ret_type, filp, proto, args) \
static ret_type full_proxy_ ## name(proto) \
{ \
- const struct dentry *dentry = F_DENTRY(filp); \
+ struct dentry *dentry = F_DENTRY(filp); \
const struct file_operations *real_fops = \
debugfs_real_fops(filp); \
- int srcu_idx; \
ret_type r; \
\
- r = debugfs_use_file_start(dentry, &srcu_idx); \
- if (likely(!r)) \
- r = real_fops->name(args); \
- debugfs_use_file_finish(srcu_idx); \
+ r = debugfs_file_get(dentry); \
+ if (unlikely(r)) \
+ return r; \
+ r = real_fops->name(args); \
+ debugfs_file_put(dentry); \
return r; \
}

@@ -228,18 +225,15 @@ FULL_PROXY_FUNC(unlocked_ioctl, long, filp,
static unsigned int full_proxy_poll(struct file *filp,
struct poll_table_struct *wait)
{
- const struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = debugfs_real_fops(filp);
- int srcu_idx;
+ struct dentry *dentry = F_DENTRY(filp);
unsigned int r = 0;

- if (debugfs_use_file_start(dentry, &srcu_idx)) {
- debugfs_use_file_finish(srcu_idx);
+ if (debugfs_file_get(dentry))
return POLLHUP;
- }

r = real_fops->poll(filp, wait);
- debugfs_use_file_finish(srcu_idx);
+ debugfs_file_put(dentry);
return r;
}

@@ -283,16 +277,13 @@ static void __full_proxy_fops_init(struct file_operations *proxy_fops,

static int full_proxy_open(struct inode *inode, struct file *filp)
{
- const struct dentry *dentry = F_DENTRY(filp);
+ struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = NULL;
struct file_operations *proxy_fops = NULL;
- int srcu_idx, r;
+ int r = 0;

- r = debugfs_use_file_start(dentry, &srcu_idx);
- if (r) {
- r = -ENOENT;
- goto out;
- }
+ if (debugfs_file_get(dentry))
+ return -ENOENT;

real_fops = debugfs_real_fops(filp);
real_fops = fops_get(real_fops);
@@ -330,7 +321,7 @@ static int full_proxy_open(struct inode *inode, struct file *filp)
kfree(proxy_fops);
fops_put(real_fops);
out:
- debugfs_use_file_finish(srcu_idx);
+ debugfs_file_put(dentry);
return r;
}

@@ -341,13 +332,14 @@ const struct file_operations debugfs_full_proxy_file_operations = {
ssize_t debugfs_attr_read(struct file *file, char __user *buf,
size_t len, loff_t *ppos)
{
+ struct dentry *dentry = F_DENTRY(file);
ssize_t ret;
- int srcu_idx;

- ret = debugfs_use_file_start(F_DENTRY(file), &srcu_idx);
- if (likely(!ret))
- ret = simple_attr_read(file, buf, len, ppos);
- debugfs_use_file_finish(srcu_idx);
+ ret = debugfs_file_get(dentry);
+ if (unlikely(ret))
+ return ret;
+ ret = simple_attr_read(file, buf, len, ppos);
+ debugfs_file_put(dentry);
return ret;
}
EXPORT_SYMBOL_GPL(debugfs_attr_read);
@@ -355,13 +347,14 @@ EXPORT_SYMBOL_GPL(debugfs_attr_read);
ssize_t debugfs_attr_write(struct file *file, const char __user *buf,
size_t len, loff_t *ppos)
{
+ struct dentry *dentry = F_DENTRY(file);
ssize_t ret;
- int srcu_idx;

- ret = debugfs_use_file_start(F_DENTRY(file), &srcu_idx);
- if (likely(!ret))
- ret = simple_attr_write(file, buf, len, ppos);
- debugfs_use_file_finish(srcu_idx);
+ ret = debugfs_file_get(dentry);
+ if (unlikely(ret))
+ return ret;
+ ret = simple_attr_write(file, buf, len, ppos);
+ debugfs_file_put(dentry);
return ret;
}
EXPORT_SYMBOL_GPL(debugfs_attr_write);
@@ -795,14 +788,14 @@ ssize_t debugfs_read_file_bool(struct file *file, char __user *user_buf,
{
char buf[3];
bool val;
- int r, srcu_idx;
+ int r;
+ struct dentry *dentry = F_DENTRY(file);

- r = debugfs_use_file_start(F_DENTRY(file), &srcu_idx);
- if (likely(!r))
- val = *(bool *)file->private_data;
- debugfs_use_file_finish(srcu_idx);
- if (r)
+ r = debugfs_file_get(dentry);
+ if (unlikely(r))
return r;
+ val = *(bool *)file->private_data;
+ debugfs_file_put(dentry);

if (val)
buf[0] = 'Y';
@@ -820,8 +813,9 @@ ssize_t debugfs_write_file_bool(struct file *file, const char __user *user_buf,
char buf[32];
size_t buf_size;
bool bv;
- int r, srcu_idx;
+ int r;
bool *val = file->private_data;
+ struct dentry *dentry = F_DENTRY(file);

buf_size = min(count, (sizeof(buf)-1));
if (copy_from_user(buf, user_buf, buf_size))
@@ -829,12 +823,11 @@ ssize_t debugfs_write_file_bool(struct file *file, const char __user *user_buf,

buf[buf_size] = '\0';
if (strtobool(buf, &bv) == 0) {
- r = debugfs_use_file_start(F_DENTRY(file), &srcu_idx);
- if (likely(!r))
- *val = bv;
- debugfs_use_file_finish(srcu_idx);
- if (r)
+ r = debugfs_file_get(dentry);
+ if (unlikely(r))
return r;
+ *val = bv;
+ debugfs_file_put(dentry);
}

return count;
@@ -896,14 +889,15 @@ static ssize_t read_file_blob(struct file *file, char __user *user_buf,
size_t count, loff_t *ppos)
{
struct debugfs_blob_wrapper *blob = file->private_data;
+ struct dentry *dentry = F_DENTRY(file);
ssize_t r;
- int srcu_idx;

- r = debugfs_use_file_start(F_DENTRY(file), &srcu_idx);
- if (likely(!r))
- r = simple_read_from_buffer(user_buf, count, ppos, blob->data,
- blob->size);
- debugfs_use_file_finish(srcu_idx);
+ r = debugfs_file_get(dentry);
+ if (unlikely(r))
+ return r;
+ r = simple_read_from_buffer(user_buf, count, ppos, blob->data,
+ blob->size);
+ debugfs_file_put(dentry);
return r;
}

--
2.12.2

2017-04-16 09:52:07

by Nicolai Stange

[permalink] [raw]
Subject: [RFC PATCH 6/9] debugfs: purge obsolete SRCU based removal protection

Purge the SRCU based file removal race protection in favour of the new,
refcount based debugfs_file_get()/debugfs_file_put() API.

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private
data")
Signed-off-by: Nicolai Stange <[email protected]>
---
fs/debugfs/file.c | 48 ------------------------------------------------
fs/debugfs/inode.c | 7 -------
include/linux/debugfs.h | 19 -------------------
lib/Kconfig.debug | 1 -
4 files changed, 75 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index ebf5b2faf91c..7de733ccdf6c 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -22,7 +22,6 @@
#include <linux/slab.h>
#include <linux/atomic.h>
#include <linux/device.h>
-#include <linux/srcu.h>
#include <asm/poll.h>

#include "internal.h"
@@ -48,53 +47,6 @@ const struct file_operations debugfs_noop_file_operations = {
.llseek = noop_llseek,
};

-/**
- * debugfs_use_file_start - mark the beginning of file data access
- * @dentry: the dentry object whose data is being accessed.
- * @srcu_idx: a pointer to some memory to store a SRCU index in.
- *
- * Up to a matching call to debugfs_use_file_finish(), any
- * successive call into the file removing functions debugfs_remove()
- * and debugfs_remove_recursive() will block. Since associated private
- * file data may only get freed after a successful return of any of
- * the removal functions, you may safely access it after a successful
- * call to debugfs_use_file_start() without worrying about
- * lifetime issues.
- *
- * If -%EIO is returned, the file has already been removed and thus,
- * it is not safe to access any of its data. If, on the other hand,
- * it is allowed to access the file data, zero is returned.
- *
- * Regardless of the return code, any call to
- * debugfs_use_file_start() must be followed by a matching call
- * to debugfs_use_file_finish().
- */
-int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx)
- __acquires(&debugfs_srcu)
-{
- *srcu_idx = srcu_read_lock(&debugfs_srcu);
- barrier();
- if (d_unlinked(dentry))
- return -EIO;
- return 0;
-}
-EXPORT_SYMBOL_GPL(debugfs_use_file_start);
-
-/**
- * debugfs_use_file_finish - mark the end of file data access
- * @srcu_idx: the SRCU index "created" by a former call to
- * debugfs_use_file_start().
- *
- * Allow any ongoing concurrent call into debugfs_remove() or
- * debugfs_remove_recursive() blocked by a former call to
- * debugfs_use_file_start() to proceed and return to its caller.
- */
-void debugfs_use_file_finish(int srcu_idx) __releases(&debugfs_srcu)
-{
- srcu_read_unlock(&debugfs_srcu, srcu_idx);
-}
-EXPORT_SYMBOL_GPL(debugfs_use_file_finish);
-
#define F_DENTRY(filp) ((filp)->f_path.dentry)

const struct file_operations *debugfs_real_fops(const struct file *filp)
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 88d51e666c38..5550f11d60bd 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -27,14 +27,11 @@
#include <linux/parser.h>
#include <linux/magic.h>
#include <linux/slab.h>
-#include <linux/srcu.h>

#include "internal.h"

#define DEBUGFS_DEFAULT_MODE 0700

-DEFINE_SRCU(debugfs_srcu);
-
static struct vfsmount *debugfs_mount;
static int debugfs_mount_count;
static bool debugfs_registered;
@@ -692,8 +689,6 @@ void debugfs_remove(struct dentry *dentry)
inode_unlock(d_inode(parent));
if (!ret)
simple_release_fs(&debugfs_mount, &debugfs_mount_count);
-
- synchronize_srcu(&debugfs_srcu);
}
EXPORT_SYMBOL_GPL(debugfs_remove);

@@ -767,8 +762,6 @@ void debugfs_remove_recursive(struct dentry *dentry)
if (!__debugfs_remove(child, parent))
simple_release_fs(&debugfs_mount, &debugfs_mount_count);
inode_unlock(d_inode(parent));
-
- synchronize_srcu(&debugfs_srcu);
}
EXPORT_SYMBOL_GPL(debugfs_remove_recursive);

diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index c65ff61b498c..df4d6eff2aac 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -23,7 +23,6 @@

struct device;
struct file_operations;
-struct srcu_struct;

struct debugfs_blob_wrapper {
void *data;
@@ -43,8 +42,6 @@ struct debugfs_regset32 {

extern struct dentry *arch_debugfs_dir;

-extern struct srcu_struct debugfs_srcu;
-
#define DEFINE_DEBUGFS_ATTRIBUTE(__fops, __get, __set, __fmt) \
static int __fops ## _open(struct inode *inode, struct file *file) \
{ \
@@ -90,11 +87,6 @@ struct dentry *debugfs_create_automount(const char *name,
void debugfs_remove(struct dentry *dentry);
void debugfs_remove_recursive(struct dentry *dentry);

-int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx)
- __acquires(&debugfs_srcu);
-
-void debugfs_use_file_finish(int srcu_idx) __releases(&debugfs_srcu);
-
const struct file_operations *debugfs_real_fops(const struct file *filp);

int debugfs_file_get(struct dentry *dentry);
@@ -219,17 +211,6 @@ static inline void debugfs_remove(struct dentry *dentry)
static inline void debugfs_remove_recursive(struct dentry *dentry)
{ }

-static inline int debugfs_use_file_start(const struct dentry *dentry,
- int *srcu_idx)
- __acquires(&debugfs_srcu)
-{
- return 0;
-}
-
-static inline void debugfs_use_file_finish(int srcu_idx)
- __releases(&debugfs_srcu)
-{ }
-
static inline int debugfs_file_get(struct dentry *dentry)
{
return 0;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 77fadface4f9..92759731854c 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -279,7 +279,6 @@ config PAGE_OWNER

config DEBUG_FS
bool "Debug Filesystem"
- select SRCU
help
debugfs is a virtual file system that kernel developers use to put
debugging files into. Enable this option to be able to read and
--
2.12.2

2017-04-16 09:52:20

by Nicolai Stange

[permalink] [raw]
Subject: [RFC PATCH 8/9] debugfs: defer debugfs_fsdata allocation to first usage

Currently, __debugfs_create_file allocates one struct debugfs_fsdata
instance for every file created. However, there are potentially many
debugfs file around, most of which are never touched by userspace.

Thus, defer the allocations to the first usage, i.e. to the first
debugfs_file_get().

A dentry's ->d_fsdata starts out to point to the "real", user provided
fops. After a debugfs_fsdata instance has been allocated (and the real
fops pointer has been moved over into its ->real_fops member),
->d_fsdata is changed to point to it from then on. The two cases are
distinguished by setting BIT(0) for the real fops case.

struct debugfs_fsdata's foremost purpose is to track active users and to
make debugfs_remove() block until they are done. Since no debugfs_fsdata
instance means no active users, make debugfs_remove() return immediately
in this case.

Take care of possible races between debugfs_file_get() and
debugfs_remove(): either debugfs_remove() must see a debugfs_fsdata
instance and thus wait for possible active users or debugfs_file_get() must
see a dead dentry and return immediately.

Make a dentry's ->d_release(), i.e. debugfs_release_dentry(), check whether
->d_fsdata is actually a debugfs_fsdata instance before kfree()ing it.

Finally, the set of possible error codes returned from debugfs_file_get()
has grown from -EIO to -EIO and -ENOMEM. Make open_proxy_open() and
full_proxy_open() pass the -ENOMEM onwards to their callers.

Signed-off-by: Nicolai Stange <[email protected]>
---
fs/debugfs/file.c | 47 ++++++++++++++++++++++++++++++++++++++---------
fs/debugfs/inode.c | 36 +++++++++++++++++++-----------------
fs/debugfs/internal.h | 8 ++++++++
3 files changed, 65 insertions(+), 26 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index d92038c5f131..f4dfd7d0d625 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -53,6 +53,7 @@ const struct file_operations *debugfs_real_fops(const struct file *filp)
{
struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata;

+ WARN_ON((unsigned long)fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
return fsd->real_fops;
}
EXPORT_SYMBOL_GPL(debugfs_real_fops);
@@ -74,9 +75,35 @@ EXPORT_SYMBOL_GPL(debugfs_real_fops);
*/
int debugfs_file_get(struct dentry *dentry)
{
- struct debugfs_fsdata *fsd = dentry->d_fsdata;
+ struct debugfs_fsdata *fsd;
+ void *d_fsd;
+
+ d_fsd = READ_ONCE(dentry->d_fsdata);
+ if (!((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)) {
+ fsd = d_fsd;
+ } else {
+ fsd = kmalloc(sizeof(*fsd), GFP_KERNEL);
+ if (!fsd)
+ return -ENOMEM;
+
+ fsd->real_fops = (void *)((unsigned long)d_fsd &
+ ~DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
+ refcount_set(&fsd->active_users, 1);
+ init_completion(&fsd->active_users_drained);
+ if (cmpxchg(&dentry->d_fsdata, d_fsd, fsd) != d_fsd) {
+ kfree(fsd);
+ fsd = READ_ONCE(dentry->d_fsdata);
+ }
+ }

- /* Avoid starvation of removers. */
+ /*
+ * In case of a successful cmpxchg() above, this check is
+ * strictly necessary and must follow it, see the comment in
+ * __debugfs_remove_file().
+ * OTOH, if the cmpxchg() hasn't been executed or wasn't
+ * successful, this serves the purpose of not starving
+ * removers.
+ */
if (d_unlinked(dentry))
return -EIO;

@@ -98,7 +125,7 @@ EXPORT_SYMBOL_GPL(debugfs_file_get);
*/
void debugfs_file_put(struct dentry *dentry)
{
- struct debugfs_fsdata *fsd = dentry->d_fsdata;
+ struct debugfs_fsdata *fsd = READ_ONCE(dentry->d_fsdata);

if (refcount_dec_and_test(&fsd->active_users))
complete(&fsd->active_users_drained);
@@ -109,10 +136,11 @@ static int open_proxy_open(struct inode *inode, struct file *filp)
{
struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = NULL;
- int r = 0;
+ int r;

- if (debugfs_file_get(dentry))
- return -ENOENT;
+ r = debugfs_file_get(dentry);
+ if (r)
+ return r == -EIO ? -ENOENT : r;

real_fops = debugfs_real_fops(filp);
real_fops = fops_get(real_fops);
@@ -233,10 +261,11 @@ static int full_proxy_open(struct inode *inode, struct file *filp)
struct dentry *dentry = F_DENTRY(filp);
const struct file_operations *real_fops = NULL;
struct file_operations *proxy_fops = NULL;
- int r = 0;
+ int r;

- if (debugfs_file_get(dentry))
- return -ENOENT;
+ r = debugfs_file_get(dentry);
+ if (r)
+ return r == -EIO ? -ENOENT : r;

real_fops = debugfs_real_fops(filp);
real_fops = fops_get(real_fops);
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 5550f11d60bd..2360c17ec00a 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -184,7 +184,10 @@ static const struct super_operations debugfs_super_operations = {

static void debugfs_release_dentry(struct dentry *dentry)
{
- kfree(dentry->d_fsdata);
+ void *fsd = dentry->d_fsdata;
+
+ if (!((unsigned long)fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT))
+ kfree(dentry->d_fsdata);
}

static struct vfsmount *debugfs_automount(struct path *path)
@@ -346,35 +349,25 @@ static struct dentry *__debugfs_create_file(const char *name, umode_t mode,
{
struct dentry *dentry;
struct inode *inode;
- struct debugfs_fsdata *fsd;
-
- fsd = kmalloc(sizeof(*fsd), GFP_KERNEL);
- if (!fsd)
- return NULL;

if (!(mode & S_IFMT))
mode |= S_IFREG;
BUG_ON(!S_ISREG(mode));
dentry = start_creating(name, parent);

- if (IS_ERR(dentry)) {
- kfree(fsd);
+ if (IS_ERR(dentry))
return NULL;
- }

inode = debugfs_get_inode(dentry->d_sb);
- if (unlikely(!inode)) {
- kfree(fsd);
+ if (unlikely(!inode))
return failed_creating(dentry);
- }

inode->i_mode = mode;
inode->i_private = data;

inode->i_fop = proxy_fops;
- fsd->real_fops = real_fops;
- refcount_set(&fsd->active_users, 1);
- dentry->d_fsdata = fsd;
+ dentry->d_fsdata = (void *)((unsigned long)real_fops |
+ DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);

d_instantiate(dentry, inode);
fsnotify_create(d_inode(dentry->d_parent), dentry);
@@ -637,8 +630,17 @@ static void __debugfs_remove_file(struct dentry *dentry, struct dentry *parent)

simple_unlink(d_inode(parent), dentry);
d_delete(dentry);
- fsd = dentry->d_fsdata;
- init_completion(&fsd->active_users_drained);
+
+ /*
+ * Paired with the closing smp_mb() implied by a successful
+ * cmpxchg() in debugfs_file_get(): either
+ * debugfs_file_get() must see a dead dentry or we must see a
+ * debugfs_fsdata instance at ->d_fsdata here (or both).
+ */
+ smp_mb();
+ fsd = READ_ONCE(dentry->d_fsdata);
+ if ((unsigned long)fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)
+ return;
if (!refcount_dec_and_test(&fsd->active_users))
wait_for_completion(&fsd->active_users_drained);
}
diff --git a/fs/debugfs/internal.h b/fs/debugfs/internal.h
index 0eea99432840..cb1e8139c398 100644
--- a/fs/debugfs/internal.h
+++ b/fs/debugfs/internal.h
@@ -25,4 +25,12 @@ struct debugfs_fsdata {
struct completion active_users_drained;
};

+/*
+ * A dentry's ->d_fsdata either points to the real fops or to a
+ * dynamically allocated debugfs_fsdata instance.
+ * In order to distinguish between these two cases, a real fops
+ * pointer gets its lowest bit set.
+ */
+#define DEBUGFS_FSDATA_IS_REAL_FOPS_BIT BIT(0)
+
#endif /* _DEBUGFS_INTERNAL_H_ */
--
2.12.2

2017-04-16 09:52:37

by Nicolai Stange

[permalink] [raw]
Subject: [RFC PATCH 9/9] debugfs: free debugfs_fsdata instances

Currently, a dentry's debugfs_fsdata instance is allocated from
debugfs_file_get() at first usage, i.e. at first file opening.

It won't ever get freed though.

Ideally, these instances would get freed after the last open file handle
gets closed again, that is from the fops' ->release().

Unfortunately, we can't hook easily into those ->release()ers of files
created through debugfs_create_file_unsafe(), i.e. of those proxied through
debugfs_open_proxy_file_operations rather than through the "full"
debugfs_full_proxy_file_operations proxy.

Hence, free unreferenced debugfs_fsdata instances from debugfs_file_put(),
with the drawback of a potential allocation + deallocation per
debugfs_file_get() + debugfs_file_put() pair, that is per fops invocation.

In addition to its former role of tracking pending fops, use the
->active_users for reference counting on the debugfs_fsdata instance
itself. In particular, don't keep a dummy reference to be dropped from
__debugfs_remove_file(): a d_delete()ed dentry and thus, request for
completion notification, is now signaled by the d_unlinked() dentry itself.

Once ->active_users drops to zero (and the dentry is still intact), free
the debugfs_fsdata instance from debugfs_file_put(). RCU protects any
concurrent debugfs_file_get() attempts to get a hold of the instance here.
Likewise for full_proxy_release() which lacks a call to debugfs_file_get().

Note that due to non-atomic updates to the d_unlinked() + ->d_fsdata pair,
care must be taken in order to avoid races between debugfs_file_put() and
debugfs_file_get() as well as __debugfs_remove_file(). Rather than
introducing a global lock, exploit the fact that there will ever be only a
single !d_unlinked() -> d_unlinked() transition and add memory barriers
where needed. Given the lack of proper benchmarking, that debugfs fops
aren't performance critical and that we've already got a potential
allocation/deallocation pair anyway, the added code complexity might be
highly questionable though.

Signed-off-by: Nicolai Stange <[email protected]>
---
fs/debugfs/file.c | 102 ++++++++++++++++++++++++++++++++++++++++----------
fs/debugfs/inode.c | 8 +++-
fs/debugfs/internal.h | 1 +
3 files changed, 90 insertions(+), 21 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index f4dfd7d0d625..b2cc25d44a39 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -22,6 +22,7 @@
#include <linux/slab.h>
#include <linux/atomic.h>
#include <linux/device.h>
+#include <linux/rcupdate.h>
#include <asm/poll.h>

#include "internal.h"
@@ -78,10 +79,39 @@ int debugfs_file_get(struct dentry *dentry)
struct debugfs_fsdata *fsd;
void *d_fsd;

- d_fsd = READ_ONCE(dentry->d_fsdata);
+ rcu_read_lock();
+retry:
+ d_fsd = rcu_dereference(dentry->d_fsdata);
if (!((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)) {
+ /*
+ * Paired with the control dependency in
+ * debugfs_file_put(): if we saw the debugfs_fsdata
+ * instance "restored" there but not the dead dentry,
+ * we'd erroneously instantiate a fresh debugfs_fsdata
+ * instance below.
+ */
+ smp_rmb();
+ if (d_unlinked(dentry)) {
+ rcu_read_unlock();
+ return -EIO;
+ }
+
fsd = d_fsd;
+ if (!refcount_inc_not_zero(&fsd->active_users)) {
+ /*
+ * A concurrent debugfs_file_put() dropped the
+ * count to zero and is about to free the
+ * debugfs_fsdata. Help out resetting the
+ * ->d_fsdata and retry.
+ */
+ d_fsd = (void *)((unsigned long)fsd->real_fops |
+ DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
+ RCU_INIT_POINTER(dentry->d_fsdata, d_fsd);
+ goto retry;
+ }
+ rcu_read_unlock();
} else {
+ rcu_read_unlock();
fsd = kmalloc(sizeof(*fsd), GFP_KERNEL);
if (!fsd)
return -ENOMEM;
@@ -91,25 +121,28 @@ int debugfs_file_get(struct dentry *dentry)
refcount_set(&fsd->active_users, 1);
init_completion(&fsd->active_users_drained);
if (cmpxchg(&dentry->d_fsdata, d_fsd, fsd) != d_fsd) {
+ /*
+ * Another debugfs_file_get() has installed a
+ * debugfs_fsdata instance concurrently.
+ * Free ours and retry to grab a reference on
+ * the installed one.
+ */
kfree(fsd);
- fsd = READ_ONCE(dentry->d_fsdata);
+ rcu_read_lock();
+ goto retry;
+ }
+ /*
+ * In case of a successful cmpxchg() above, this check is
+ * strictly necessary and must follow it, see the comment in
+ * __debugfs_remove_file().
+ */
+ if (d_unlinked(dentry)) {
+ if (refcount_dec_and_test(&fsd->active_users))
+ complete(&fsd->active_users_drained);
+ return -EIO;
}
}

- /*
- * In case of a successful cmpxchg() above, this check is
- * strictly necessary and must follow it, see the comment in
- * __debugfs_remove_file().
- * OTOH, if the cmpxchg() hasn't been executed or wasn't
- * successful, this serves the purpose of not starving
- * removers.
- */
- if (d_unlinked(dentry))
- return -EIO;
-
- if (!refcount_inc_not_zero(&fsd->active_users))
- return -EIO;
-
return 0;
}
EXPORT_SYMBOL_GPL(debugfs_file_get);
@@ -126,9 +159,29 @@ EXPORT_SYMBOL_GPL(debugfs_file_get);
void debugfs_file_put(struct dentry *dentry)
{
struct debugfs_fsdata *fsd = READ_ONCE(dentry->d_fsdata);
+ void *d_fsd;

- if (refcount_dec_and_test(&fsd->active_users))
- complete(&fsd->active_users_drained);
+ if (refcount_dec_and_test(&fsd->active_users)) {
+ d_fsd = (void *)((unsigned long)fsd->real_fops |
+ DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
+ RCU_INIT_POINTER(dentry->d_fsdata, d_fsd);
+ /* Paired with smp_mb() in __debugfs_remove_file(). */
+ smp_mb();
+ if (d_unlinked(dentry)) {
+ /*
+ * We have a control dependency paired with the
+ * smp_rmb() in debugfs_file_get() here.
+ *
+ * Restore the debugfs_fsdata instance into
+ * ->d_fsdata s.t. ->d_release() can free
+ * it.
+ */
+ WRITE_ONCE(dentry->d_fsdata, fsd);
+ complete(&fsd->active_users_drained);
+ } else {
+ kfree_rcu(fsd, rcu_head);
+ }
+ }
}
EXPORT_SYMBOL_GPL(debugfs_file_put);

@@ -221,9 +274,20 @@ static unsigned int full_proxy_poll(struct file *filp,
static int full_proxy_release(struct inode *inode, struct file *filp)
{
const struct dentry *dentry = F_DENTRY(filp);
- const struct file_operations *real_fops = debugfs_real_fops(filp);
const struct file_operations *proxy_fops = filp->f_op;
int r = 0;
+ void *d_fsd;
+ const struct file_operations *real_fops;
+
+ rcu_read_lock();
+ d_fsd = rcu_dereference(F_DENTRY(filp)->d_fsdata);
+ if ((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT) {
+ real_fops = (void *)((unsigned long)d_fsd &
+ ~DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
+ } else {
+ real_fops = ((struct debugfs_fsdata *)d_fsd)->real_fops;
+ }
+ rcu_read_unlock();

/*
* We must not protect this against removal races here: the
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 2360c17ec00a..bacb4d6bf178 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -27,6 +27,7 @@
#include <linux/parser.h>
#include <linux/magic.h>
#include <linux/slab.h>
+#include <linux/rcupdate.h>

#include "internal.h"

@@ -636,13 +637,16 @@ static void __debugfs_remove_file(struct dentry *dentry, struct dentry *parent)
* cmpxchg() in debugfs_file_get(): either
* debugfs_file_get() must see a dead dentry or we must see a
* debugfs_fsdata instance at ->d_fsdata here (or both).
+ *
+ * Also paired with the smp_mb() in debugfs_file_put(): if we
+ * see a debugfs_fsdata instance here, then debugfs_file_put()
+ * must see a dead dentry.
*/
smp_mb();
fsd = READ_ONCE(dentry->d_fsdata);
if ((unsigned long)fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)
return;
- if (!refcount_dec_and_test(&fsd->active_users))
- wait_for_completion(&fsd->active_users_drained);
+ wait_for_completion(&fsd->active_users_drained);
}

static int __debugfs_remove(struct dentry *dentry, struct dentry *parent)
diff --git a/fs/debugfs/internal.h b/fs/debugfs/internal.h
index cb1e8139c398..0445bd7d11f2 100644
--- a/fs/debugfs/internal.h
+++ b/fs/debugfs/internal.h
@@ -23,6 +23,7 @@ struct debugfs_fsdata {
const struct file_operations *real_fops;
refcount_t active_users;
struct completion active_users_drained;
+ struct rcu_head rcu_head;
};

/*
--
2.12.2

2017-04-16 09:53:08

by Nicolai Stange

[permalink] [raw]
Subject: [RFC PATCH 7/9] debugfs: call debugfs_real_fops() only after debugfs_file_get()

The current implementation of debugfs_real_fops() relies on a
debugfs_fsdata instance to be installed at ->d_fsdata.

With future patches introducing lazy allocation of these, this requirement
will be guaranteed to be fullfilled only inbetween a
debugfs_file_get()/debugfs_file_put() pair.

The full proxies' fops implemented by debugfs happen to be the only
offenders. Fix them up by moving their debugfs_real_fops() calls past those
to debugfs_file_get().

full_proxy_release() is special as it doesn't invoke debugfs_file_get() at
all. Leave it alone for now.

Signed-off-by: Nicolai Stange <[email protected]>
---
fs/debugfs/file.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 7de733ccdf6c..d92038c5f131 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -144,13 +144,13 @@ const struct file_operations debugfs_open_proxy_file_operations = {
static ret_type full_proxy_ ## name(proto) \
{ \
struct dentry *dentry = F_DENTRY(filp); \
- const struct file_operations *real_fops = \
- debugfs_real_fops(filp); \
+ const struct file_operations *real_fops; \
ret_type r; \
\
r = debugfs_file_get(dentry); \
if (unlikely(r)) \
return r; \
+ real_fops = debugfs_real_fops(filp); \
r = real_fops->name(args); \
debugfs_file_put(dentry); \
return r; \
@@ -177,13 +177,14 @@ FULL_PROXY_FUNC(unlocked_ioctl, long, filp,
static unsigned int full_proxy_poll(struct file *filp,
struct poll_table_struct *wait)
{
- const struct file_operations *real_fops = debugfs_real_fops(filp);
struct dentry *dentry = F_DENTRY(filp);
unsigned int r = 0;
+ const struct file_operations *real_fops;

if (debugfs_file_get(dentry))
return POLLHUP;

+ real_fops = debugfs_real_fops(filp);
r = real_fops->poll(filp, wait);
debugfs_file_put(dentry);
return r;
--
2.12.2

2017-04-16 09:53:28

by Nicolai Stange

[permalink] [raw]
Subject: [RFC PATCH 5/9] IB/hfi1: convert to debugfs_file_get() and -put()

Convert all calls to the now obsolete debugfs_use_file_start() and
debugfs_use_file_finish() to the new debugfs_file_get() and
debugfs_file_put() API.

Fixes: 49d200deaa68 ("debugfs: prevent access to removed files' private
data")
Signed-off-by: Nicolai Stange <[email protected]>
---
drivers/infiniband/hw/hfi1/debugfs.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/debugfs.c b/drivers/infiniband/hw/hfi1/debugfs.c
index dc2c1c993f04..ec78fe678179 100644
--- a/drivers/infiniband/hw/hfi1/debugfs.c
+++ b/drivers/infiniband/hw/hfi1/debugfs.c
@@ -72,13 +72,13 @@ static ssize_t hfi1_seq_read(
loff_t *ppos)
{
struct dentry *d = file->f_path.dentry;
- int srcu_idx;
ssize_t r;

- r = debugfs_use_file_start(d, &srcu_idx);
- if (likely(!r))
- r = seq_read(file, buf, size, ppos);
- debugfs_use_file_finish(srcu_idx);
+ r = debugfs_file_get(d);
+ if (unlikely(r))
+ return r;
+ r = seq_read(file, buf, size, ppos);
+ debugfs_file_put(d);
return r;
}

@@ -88,13 +88,13 @@ static loff_t hfi1_seq_lseek(
int whence)
{
struct dentry *d = file->f_path.dentry;
- int srcu_idx;
loff_t r;

- r = debugfs_use_file_start(d, &srcu_idx);
- if (likely(!r))
- r = seq_lseek(file, offset, whence);
- debugfs_use_file_finish(srcu_idx);
+ r = debugfs_file_get(d);
+ if (unlikely(r))
+ return r;
+ r = seq_lseek(file, offset, whence);
+ debugfs_file_put(d);
return r;
}

--
2.12.2

2017-04-16 09:53:45

by Nicolai Stange

[permalink] [raw]
Subject: [RFC PATCH 1/9] debugfs: add support for more elaborate ->d_fsdata

Currently, the user provided fops, "real_fops", are stored directly into
->d_fsdata.

In order to be able to store more per-file state and thus prepare for more
granular file removal protection, wrap the real_fops into a dynamically
allocated container struct, debugfs_fsdata.

A struct debugfs_fsdata gets allocated at file creation and freed from the
newly intoduced ->d_release().

Finally, move the implementation of debugfs_real_fops() out of the public
debugfs header such that struct debugfs_fsdata's declaration can be kept
private.

Signed-off-by: Nicolai Stange <[email protected]>
---
fs/debugfs/file.c | 12 ++++++++++++
fs/debugfs/inode.c | 22 +++++++++++++++++++---
fs/debugfs/internal.h | 4 ++++
include/linux/debugfs.h | 20 +++-----------------
4 files changed, 38 insertions(+), 20 deletions(-)

diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
index 354e2ab62031..15655a1a0704 100644
--- a/fs/debugfs/file.c
+++ b/fs/debugfs/file.c
@@ -97,6 +97,18 @@ EXPORT_SYMBOL_GPL(debugfs_use_file_finish);

#define F_DENTRY(filp) ((filp)->f_path.dentry)

+const struct file_operations *debugfs_real_fops(const struct file *filp)
+ __must_hold(&debugfs_srcu)
+{
+ struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata;
+ /*
+ * Neither the pointer to the struct file_operations, nor its
+ * contents ever change -- srcu_dereference() is not needed here.
+ */
+ return fsd->real_fops;
+}
+EXPORT_SYMBOL_GPL(debugfs_real_fops);
+
static int open_proxy_open(struct inode *inode, struct file *filp)
{
const struct dentry *dentry = F_DENTRY(filp);
diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 7fd4ec4bb214..1ae133064ca1 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -185,6 +185,11 @@ static const struct super_operations debugfs_super_operations = {
.evict_inode = debugfs_evict_inode,
};

+static void debugfs_release_dentry(struct dentry *dentry)
+{
+ kfree(dentry->d_fsdata);
+}
+
static struct vfsmount *debugfs_automount(struct path *path)
{
debugfs_automount_t f;
@@ -194,6 +199,7 @@ static struct vfsmount *debugfs_automount(struct path *path)

static const struct dentry_operations debugfs_dops = {
.d_delete = always_delete_dentry,
+ .d_release = debugfs_release_dentry,
.d_automount = debugfs_automount,
};

@@ -343,24 +349,34 @@ static struct dentry *__debugfs_create_file(const char *name, umode_t mode,
{
struct dentry *dentry;
struct inode *inode;
+ struct debugfs_fsdata *fsd;
+
+ fsd = kmalloc(sizeof(*fsd), GFP_KERNEL);
+ if (!fsd)
+ return NULL;

if (!(mode & S_IFMT))
mode |= S_IFREG;
BUG_ON(!S_ISREG(mode));
dentry = start_creating(name, parent);

- if (IS_ERR(dentry))
+ if (IS_ERR(dentry)) {
+ kfree(fsd);
return NULL;
+ }

inode = debugfs_get_inode(dentry->d_sb);
- if (unlikely(!inode))
+ if (unlikely(!inode)) {
+ kfree(fsd);
return failed_creating(dentry);
+ }

inode->i_mode = mode;
inode->i_private = data;

inode->i_fop = proxy_fops;
- dentry->d_fsdata = (void *)real_fops;
+ fsd->real_fops = real_fops;
+ dentry->d_fsdata = fsd;

d_instantiate(dentry, inode);
fsnotify_create(d_inode(dentry->d_parent), dentry);
diff --git a/fs/debugfs/internal.h b/fs/debugfs/internal.h
index b3e8443a1f47..512601eed3ce 100644
--- a/fs/debugfs/internal.h
+++ b/fs/debugfs/internal.h
@@ -19,4 +19,8 @@ extern const struct file_operations debugfs_noop_file_operations;
extern const struct file_operations debugfs_open_proxy_file_operations;
extern const struct file_operations debugfs_full_proxy_file_operations;

+struct debugfs_fsdata {
+ const struct file_operations *real_fops;
+};
+
#endif /* _DEBUGFS_INTERNAL_H_ */
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index 9174b0d28582..d614be21412a 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -45,23 +45,6 @@ extern struct dentry *arch_debugfs_dir;

extern struct srcu_struct debugfs_srcu;

-/**
- * debugfs_real_fops - getter for the real file operation
- * @filp: a pointer to a struct file
- *
- * Must only be called under the protection established by
- * debugfs_use_file_start().
- */
-static inline const struct file_operations *debugfs_real_fops(const struct file *filp)
- __must_hold(&debugfs_srcu)
-{
- /*
- * Neither the pointer to the struct file_operations, nor its
- * contents ever change -- srcu_dereference() is not needed here.
- */
- return filp->f_path.dentry->d_fsdata;
-}
-
#define DEFINE_DEBUGFS_ATTRIBUTE(__fops, __get, __set, __fmt) \
static int __fops ## _open(struct inode *inode, struct file *file) \
{ \
@@ -112,6 +95,9 @@ int debugfs_use_file_start(const struct dentry *dentry, int *srcu_idx)

void debugfs_use_file_finish(int srcu_idx) __releases(&debugfs_srcu);

+const struct file_operations *debugfs_real_fops(const struct file *filp)
+ __must_hold(&debugfs_srcu);
+
ssize_t debugfs_attr_read(struct file *file, char __user *buf,
size_t len, loff_t *ppos);
ssize_t debugfs_attr_write(struct file *file, const char __user *buf,
--
2.12.2

2017-04-17 16:01:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 9/9] debugfs: free debugfs_fsdata instances

On Sun, Apr 16, 2017 at 11:51:37AM +0200, Nicolai Stange wrote:
> Currently, a dentry's debugfs_fsdata instance is allocated from
> debugfs_file_get() at first usage, i.e. at first file opening.
>
> It won't ever get freed though.
>
> Ideally, these instances would get freed after the last open file handle
> gets closed again, that is from the fops' ->release().
>
> Unfortunately, we can't hook easily into those ->release()ers of files
> created through debugfs_create_file_unsafe(), i.e. of those proxied through
> debugfs_open_proxy_file_operations rather than through the "full"
> debugfs_full_proxy_file_operations proxy.
>
> Hence, free unreferenced debugfs_fsdata instances from debugfs_file_put(),
> with the drawback of a potential allocation + deallocation per
> debugfs_file_get() + debugfs_file_put() pair, that is per fops invocation.
>
> In addition to its former role of tracking pending fops, use the
> ->active_users for reference counting on the debugfs_fsdata instance
> itself. In particular, don't keep a dummy reference to be dropped from
> __debugfs_remove_file(): a d_delete()ed dentry and thus, request for
> completion notification, is now signaled by the d_unlinked() dentry itself.
>
> Once ->active_users drops to zero (and the dentry is still intact), free
> the debugfs_fsdata instance from debugfs_file_put(). RCU protects any
> concurrent debugfs_file_get() attempts to get a hold of the instance here.
> Likewise for full_proxy_release() which lacks a call to debugfs_file_get().
>
> Note that due to non-atomic updates to the d_unlinked() + ->d_fsdata pair,
> care must be taken in order to avoid races between debugfs_file_put() and
> debugfs_file_get() as well as __debugfs_remove_file(). Rather than
> introducing a global lock, exploit the fact that there will ever be only a
> single !d_unlinked() -> d_unlinked() transition and add memory barriers
> where needed. Given the lack of proper benchmarking, that debugfs fops
> aren't performance critical and that we've already got a potential
> allocation/deallocation pair anyway, the added code complexity might be
> highly questionable though.
>
> Signed-off-by: Nicolai Stange <[email protected]>

If you have not already done so, please run this with debug enabled,
especially CONFIG_PROVE_LOCKING=y (which implies CONFIG_PROVE_RCU=y).
This is important because there are configurations for which the deadlocks
you saw with SRCU turn into silent failure, including memory corruption.
CONFIG_PROVE_RCU=y will catch many of those situations.

(And yes, kfree_rcu() doesn't have that problem, but...)

Another issue called out inline.

Thanx, Paul

> ---
> fs/debugfs/file.c | 102 ++++++++++++++++++++++++++++++++++++++++----------
> fs/debugfs/inode.c | 8 +++-
> fs/debugfs/internal.h | 1 +
> 3 files changed, 90 insertions(+), 21 deletions(-)
>
> diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
> index f4dfd7d0d625..b2cc25d44a39 100644
> --- a/fs/debugfs/file.c
> +++ b/fs/debugfs/file.c
> @@ -22,6 +22,7 @@
> #include <linux/slab.h>
> #include <linux/atomic.h>
> #include <linux/device.h>
> +#include <linux/rcupdate.h>
> #include <asm/poll.h>
>
> #include "internal.h"
> @@ -78,10 +79,39 @@ int debugfs_file_get(struct dentry *dentry)
> struct debugfs_fsdata *fsd;
> void *d_fsd;
>
> - d_fsd = READ_ONCE(dentry->d_fsdata);
> + rcu_read_lock();
> +retry:
> + d_fsd = rcu_dereference(dentry->d_fsdata);
> if (!((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)) {
> + /*
> + * Paired with the control dependency in
> + * debugfs_file_put(): if we saw the debugfs_fsdata
> + * instance "restored" there but not the dead dentry,
> + * we'd erroneously instantiate a fresh debugfs_fsdata
> + * instance below.
> + */
> + smp_rmb();
> + if (d_unlinked(dentry)) {
> + rcu_read_unlock();
> + return -EIO;
> + }
> +
> fsd = d_fsd;
> + if (!refcount_inc_not_zero(&fsd->active_users)) {
> + /*
> + * A concurrent debugfs_file_put() dropped the
> + * count to zero and is about to free the
> + * debugfs_fsdata. Help out resetting the
> + * ->d_fsdata and retry.
> + */
> + d_fsd = (void *)((unsigned long)fsd->real_fops |
> + DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
> + RCU_INIT_POINTER(dentry->d_fsdata, d_fsd);

This is an infrequent race, I hope? If on the other hand there is
a possibility of this branch being taken a huge number of times in
one call, it would be good to exit the RCU read-side critical section
before retrying.

> + goto retry;
> + }
> + rcu_read_unlock();
> } else {
> + rcu_read_unlock();
> fsd = kmalloc(sizeof(*fsd), GFP_KERNEL);
> if (!fsd)
> return -ENOMEM;
> @@ -91,25 +121,28 @@ int debugfs_file_get(struct dentry *dentry)
> refcount_set(&fsd->active_users, 1);
> init_completion(&fsd->active_users_drained);
> if (cmpxchg(&dentry->d_fsdata, d_fsd, fsd) != d_fsd) {
> + /*
> + * Another debugfs_file_get() has installed a
> + * debugfs_fsdata instance concurrently.
> + * Free ours and retry to grab a reference on
> + * the installed one.
> + */
> kfree(fsd);
> - fsd = READ_ONCE(dentry->d_fsdata);
> + rcu_read_lock();
> + goto retry;

And given this code path, why not put the retry: label before the
initial rcu_read_lock()? Same number of lines of code, rcu_read_lock()
and rcu_read_unlock() are very lightweight, the extra executions should
be rare, and you might be avoiding a grace-period-starvation problem.

> + }
> + /*
> + * In case of a successful cmpxchg() above, this check is
> + * strictly necessary and must follow it, see the comment in
> + * __debugfs_remove_file().
> + */
> + if (d_unlinked(dentry)) {
> + if (refcount_dec_and_test(&fsd->active_users))
> + complete(&fsd->active_users_drained);
> + return -EIO;
> }
> }
>
> - /*
> - * In case of a successful cmpxchg() above, this check is
> - * strictly necessary and must follow it, see the comment in
> - * __debugfs_remove_file().
> - * OTOH, if the cmpxchg() hasn't been executed or wasn't
> - * successful, this serves the purpose of not starving
> - * removers.
> - */
> - if (d_unlinked(dentry))
> - return -EIO;
> -
> - if (!refcount_inc_not_zero(&fsd->active_users))
> - return -EIO;
> -
> return 0;
> }
> EXPORT_SYMBOL_GPL(debugfs_file_get);
> @@ -126,9 +159,29 @@ EXPORT_SYMBOL_GPL(debugfs_file_get);
> void debugfs_file_put(struct dentry *dentry)
> {
> struct debugfs_fsdata *fsd = READ_ONCE(dentry->d_fsdata);
> + void *d_fsd;
>
> - if (refcount_dec_and_test(&fsd->active_users))
> - complete(&fsd->active_users_drained);
> + if (refcount_dec_and_test(&fsd->active_users)) {
> + d_fsd = (void *)((unsigned long)fsd->real_fops |
> + DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
> + RCU_INIT_POINTER(dentry->d_fsdata, d_fsd);
> + /* Paired with smp_mb() in __debugfs_remove_file(). */
> + smp_mb();
> + if (d_unlinked(dentry)) {
> + /*
> + * We have a control dependency paired with the
> + * smp_rmb() in debugfs_file_get() here.
> + *
> + * Restore the debugfs_fsdata instance into
> + * ->d_fsdata s.t. ->d_release() can free
> + * it.
> + */
> + WRITE_ONCE(dentry->d_fsdata, fsd);
> + complete(&fsd->active_users_drained);
> + } else {
> + kfree_rcu(fsd, rcu_head);
> + }
> + }
> }
> EXPORT_SYMBOL_GPL(debugfs_file_put);
>
> @@ -221,9 +274,20 @@ static unsigned int full_proxy_poll(struct file *filp,
> static int full_proxy_release(struct inode *inode, struct file *filp)
> {
> const struct dentry *dentry = F_DENTRY(filp);
> - const struct file_operations *real_fops = debugfs_real_fops(filp);
> const struct file_operations *proxy_fops = filp->f_op;
> int r = 0;
> + void *d_fsd;
> + const struct file_operations *real_fops;
> +
> + rcu_read_lock();
> + d_fsd = rcu_dereference(F_DENTRY(filp)->d_fsdata);
> + if ((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT) {
> + real_fops = (void *)((unsigned long)d_fsd &
> + ~DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
> + } else {
> + real_fops = ((struct debugfs_fsdata *)d_fsd)->real_fops;
> + }
> + rcu_read_unlock();
>
> /*
> * We must not protect this against removal races here: the
> diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
> index 2360c17ec00a..bacb4d6bf178 100644
> --- a/fs/debugfs/inode.c
> +++ b/fs/debugfs/inode.c
> @@ -27,6 +27,7 @@
> #include <linux/parser.h>
> #include <linux/magic.h>
> #include <linux/slab.h>
> +#include <linux/rcupdate.h>
>
> #include "internal.h"
>
> @@ -636,13 +637,16 @@ static void __debugfs_remove_file(struct dentry *dentry, struct dentry *parent)
> * cmpxchg() in debugfs_file_get(): either
> * debugfs_file_get() must see a dead dentry or we must see a
> * debugfs_fsdata instance at ->d_fsdata here (or both).
> + *
> + * Also paired with the smp_mb() in debugfs_file_put(): if we
> + * see a debugfs_fsdata instance here, then debugfs_file_put()
> + * must see a dead dentry.
> */
> smp_mb();
> fsd = READ_ONCE(dentry->d_fsdata);
> if ((unsigned long)fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)
> return;
> - if (!refcount_dec_and_test(&fsd->active_users))
> - wait_for_completion(&fsd->active_users_drained);
> + wait_for_completion(&fsd->active_users_drained);
> }
>
> static int __debugfs_remove(struct dentry *dentry, struct dentry *parent)
> diff --git a/fs/debugfs/internal.h b/fs/debugfs/internal.h
> index cb1e8139c398..0445bd7d11f2 100644
> --- a/fs/debugfs/internal.h
> +++ b/fs/debugfs/internal.h
> @@ -23,6 +23,7 @@ struct debugfs_fsdata {
> const struct file_operations *real_fops;
> refcount_t active_users;
> struct completion active_users_drained;
> + struct rcu_head rcu_head;
> };
>
> /*
> --
> 2.12.2
>

2017-04-18 02:25:55

by kernel test robot

[permalink] [raw]
Subject: [lkp-robot] [debugfs] f3e7155d08: BUG:unable_to_handle_kernel


FYI, we noticed the following commit:

commit: f3e7155d085591ab58f0993ce633fea58c082b35 ("debugfs: implement per-file removal protection")
url: https://github.com/0day-ci/linux/commits/Nicolai-Stange/debugfs-per-file-removal-protection/20170416-175841


in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu kvm64,+ssse3 -smp 2 -m 8G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+------------------------------------------+------------+------------+
| | a2a6a4384e | f3e7155d08 |
+------------------------------------------+------------+------------+
| boot_successes | 4 | 2 |
| boot_failures | 2 | 4 |
| BUG:kernel_hang_in_test_stage | 2 | |
| BUG:unable_to_handle_kernel | 0 | 4 |
| Oops:#[##] | 0 | 4 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 4 |
+------------------------------------------+------------+------------+



[ 45.772683] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 45.772697] IP: __debugfs_remove+0x5c/0xc0
[ 45.772698] PGD 1fabb8067
[ 45.772699] PUD 1fa889067
[ 45.772700] PMD 0
[ 45.772700]
[ 45.772702] Oops: 0002 [#1] SMP
[ 45.772704] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver sr_mod cdrom sg ata_generic pata_acpi ppdev snd_pcm snd_timer snd soundcore pcspkr ata_piix parport_pc serio_raw i2c_piix4 libata floppy parport acpi_cpufreq ip_tables
[ 45.772721] CPU: 1 PID: 9314 Comm: mount.nfs Not tainted 4.11.0-rc6-00202-gf3e7155 #1
[ 45.772722] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[ 45.772724] task: ffff8801f9170000 task.stack: ffffc90006570000
[ 45.772726] RIP: 0010:__debugfs_remove+0x5c/0xc0
[ 45.772727] RSP: 0018:ffffc90006573798 EFLAGS: 00010246
[ 45.772729] RAX: 0000000000000000 RBX: ffff8802133b20c0 RCX: ffff88023fd1c270
[ 45.772730] RDX: ffffffff823bd720 RSI: ffffffff81c99fe0 RDI: 0000000000000018
[ 45.772731] RBP: ffffc900065737b0 R08: 0000000000000000 R09: 0000000000000246
[ 45.772732] R10: ffffc900065736b8 R11: ffff8802133b2900 R12: 0000000000000000
[ 45.772733] R13: ffff8802133b23c0 R14: ffffffff823bd724 R15: ffff8802133b2160
[ 45.772735] FS: 00007f4b07362880(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000
[ 45.772736] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 45.772737] CR2: 0000000000000010 CR3: 000000020acda000 CR4: 00000000000006e0
[ 45.772743] Call Trace:
[ 45.772750] debugfs_remove_recursive+0xd4/0x1e0
[ 45.772758] rpc_clnt_debugfs_unregister+0x19/0x30
[ 45.772762] rpc_client_register+0x18a/0x1c0
[ 45.772765] rpc_new_client+0x1de/0x2e0
[ 45.772768] rpc_create_xprt+0x58/0x170
[ 45.772769] rpc_create+0xea/0x1c0
[ 45.772776] nfs_create_rpc_client+0xe8/0x130
[ 45.772814] nfs4_init_client+0x7e/0x290 [nfsv4]
[ 45.772820] ? __radix_tree_replace+0x8a/0x140
[ 45.772823] ? radix_tree_iter_tag_clear+0x1c/0x20
[ 45.772827] ? __rpc_init_priority_wait_queue+0x81/0xb0
[ 45.772830] ? rpc_init_wait_queue+0x13/0x20
[ 45.772847] ? nfs4_alloc_client+0x1d2/0x1e0 [nfsv4]
[ 45.772849] nfs_get_client+0x32a/0x400
[ 45.772868] nfs4_set_client+0x86/0x100 [nfsv4]
[ 45.772884] nfs4_create_server+0x114/0x330 [nfsv4]
[ 45.772899] nfs4_remote_mount+0x2e/0x60 [nfsv4]
[ 45.772905] mount_fs+0x38/0x150
[ 45.772908] ? alloc_vfsmnt+0x1b2/0x230
[ 45.772910] vfs_kern_mount+0x67/0x140
[ 45.772925] nfs_do_root_mount+0x84/0xc0 [nfsv4]
[ 45.772940] nfs4_try_mount+0x44/0xd0 [nfsv4]
[ 45.772942] ? get_nfs_version+0x27/0x90
[ 45.772945] nfs_fs_mount+0x64a/0xd10
[ 45.772951] ? pcpu_alloc+0x2e4/0x640
[ 45.772953] ? nfs_clone_super+0x130/0x130
[ 45.772954] ? param_set_portnr+0x50/0x50
[ 45.772957] mount_fs+0x38/0x150
[ 45.772958] ? mount_fs+0x38/0x150
[ 45.772960] ? alloc_vfsmnt+0x1b2/0x230
[ 45.772962] vfs_kern_mount+0x67/0x140
[ 45.772964] do_mount+0x1bf/0xc70
[ 45.772968] ? kmem_cache_alloc_trace+0x131/0x1b0
[ 45.772970] SyS_mount+0x83/0xd0
[ 45.772975] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 45.772977] RIP: 0033:0x7f4b06a2098a
[ 45.772978] RSP: 002b:00007ffd3656cfa8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
[ 45.772980] RAX: ffffffffffffffda RBX: 00000000025d6670 RCX: 00007f4b06a2098a
[ 45.772981] RDX: 00000000025d5f50 RSI: 00000000025d5f30 RDI: 00000000025d5f10
[ 45.772982] RBP: 00007ffd3656d000 R08: 00000000025d68b0 R09: 0000000000000041
[ 45.772983] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
[ 45.772983] R13: 00007ffd3656cf60 R14: 00007ffd3656cff4 R15: 0000000000000003
[ 45.772985] Code: 8b 7c 24 30 48 89 de e8 f3 28 e6 ff 48 89 df e8 3b 22 e5 ff 4c 8b 63 78 48 c7 c2 20 d7 3b 82 48 c7 c6 e0 9f c9 81 49 8d 7c 24 18 <41> c7 44 24 10 00 00 00 00 4d 8d 6c 24 10 e8 a1 b6 cc ff 49 8d
[ 45.773008] RIP: __debugfs_remove+0x5c/0xc0 RSP: ffffc90006573798
[ 45.773009] CR2: 0000000000000010
[ 45.773109] ---[ end trace 01c0751f3e244643 ]---


To reproduce:

git clone https://github.com/01org/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
Xiaolong


Attachments:
(No filename) (5.32 kB)
config-4.11.0-rc6-00202-gf3e7155 (154.37 kB)
job-script (3.99 kB)
dmesg.xz (34.80 kB)
Download all attachments

2017-04-18 09:36:48

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC PATCH 8/9] debugfs: defer debugfs_fsdata allocation to first usage

On Sun, 2017-04-16 at 11:51 +0200, Nicolai Stange wrote:
>
> +++ b/fs/debugfs/file.c
> @@ -53,6 +53,7 @@ const struct file_operations
> *debugfs_real_fops(const struct file *filp)
>  {
>   struct debugfs_fsdata *fsd = F_DENTRY(filp)->d_fsdata;
>  
> + WARN_ON((unsigned long)fsd &
> DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
>   return fsd->real_fops;

I'm not a fan of BUG_ON(), but in this case, if you have a completely
bogus pointer here, and then you return fsd->real_fops which will be
even more bogus, and *then* you call a function from within it... that
seems like a recipe for disaster.

So either you could return some valid ops (perhaps
debugfs_noop_file_operations although those don't have .name or .poll,
so it doesn't cover everything), or you can just BUG_ON() here
directly, saving the incomprehensible crash later.

johannes

>  EXPORT_SYMBOL_GPL(debugfs_real_fops);
> @@ -74,9 +75,35 @@ EXPORT_SYMBOL_GPL(debugfs_real_fops);
>   */
>  int debugfs_file_get(struct dentry *dentry)
>  {
> - struct debugfs_fsdata *fsd = dentry->d_fsdata;
> + struct debugfs_fsdata *fsd;
> + void *d_fsd;
> +
> + d_fsd = READ_ONCE(dentry->d_fsdata);
> + if (!((unsigned long)d_fsd &
> DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)) {
> + fsd = d_fsd;
> + } else {
> + fsd = kmalloc(sizeof(*fsd), GFP_KERNEL);
> + if (!fsd)
> + return -ENOMEM;
> +
> + fsd->real_fops = (void *)((unsigned long)d_fsd &
> + ~DEBUGFS_FSDATA_IS_REAL_FOPS
> _BIT);
> + refcount_set(&fsd->active_users, 1);
> + init_completion(&fsd->active_users_drained);
> + if (cmpxchg(&dentry->d_fsdata, d_fsd, fsd) != d_fsd)
> {
> + kfree(fsd);
> + fsd = READ_ONCE(dentry->d_fsdata);
> + }
> + }
>  
> - /* Avoid starvation of removers. */
> + /*
> +  * In case of a successful cmpxchg() above, this check is
> +  * strictly necessary and must follow it, see the comment in
> +  * __debugfs_remove_file().
> +  * OTOH, if the cmpxchg() hasn't been executed or wasn't
> +  * successful, this serves the purpose of not starving
> +  * removers.
> +  */
>   if (d_unlinked(dentry))
>   return -EIO;
>  
> @@ -98,7 +125,7 @@ EXPORT_SYMBOL_GPL(debugfs_file_get);
>   */
>  void debugfs_file_put(struct dentry *dentry)
>  {
> - struct debugfs_fsdata *fsd = dentry->d_fsdata;
> + struct debugfs_fsdata *fsd = READ_ONCE(dentry->d_fsdata);
>  
>   if (refcount_dec_and_test(&fsd->active_users))
>   complete(&fsd->active_users_drained);
> @@ -109,10 +136,11 @@ static int open_proxy_open(struct inode *inode,
> struct file *filp)
>  {
>   struct dentry *dentry = F_DENTRY(filp);
>   const struct file_operations *real_fops = NULL;
> - int r = 0;
> + int r;
>  
> - if (debugfs_file_get(dentry))
> - return -ENOENT;
> + r = debugfs_file_get(dentry);
> + if (r)
> + return r == -EIO ? -ENOENT : r;
>  
>   real_fops = debugfs_real_fops(filp);
>   real_fops = fops_get(real_fops);
> @@ -233,10 +261,11 @@ static int full_proxy_open(struct inode *inode,
> struct file *filp)
>   struct dentry *dentry = F_DENTRY(filp);
>   const struct file_operations *real_fops = NULL;
>   struct file_operations *proxy_fops = NULL;
> - int r = 0;
> + int r;
>  
> - if (debugfs_file_get(dentry))
> - return -ENOENT;
> + r = debugfs_file_get(dentry);
> + if (r)
> + return r == -EIO ? -ENOENT : r;
>  
>   real_fops = debugfs_real_fops(filp);
>   real_fops = fops_get(real_fops);
> diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
> index 5550f11d60bd..2360c17ec00a 100644
> --- a/fs/debugfs/inode.c
> +++ b/fs/debugfs/inode.c
> @@ -184,7 +184,10 @@ static const struct super_operations
> debugfs_super_operations = {
>  
>  static void debugfs_release_dentry(struct dentry *dentry)
>  {
> - kfree(dentry->d_fsdata);
> + void *fsd = dentry->d_fsdata;
> +
> + if (!((unsigned long)fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT))
> + kfree(dentry->d_fsdata);
>  }
>  
>  static struct vfsmount *debugfs_automount(struct path *path)
> @@ -346,35 +349,25 @@ static struct dentry
> *__debugfs_create_file(const char *name, umode_t mode,
>  {
>   struct dentry *dentry;
>   struct inode *inode;
> - struct debugfs_fsdata *fsd;
> -
> - fsd = kmalloc(sizeof(*fsd), GFP_KERNEL);
> - if (!fsd)
> - return NULL;
>  
>   if (!(mode & S_IFMT))
>   mode |= S_IFREG;
>   BUG_ON(!S_ISREG(mode));
>   dentry = start_creating(name, parent);
>  
> - if (IS_ERR(dentry)) {
> - kfree(fsd);
> + if (IS_ERR(dentry))
>   return NULL;
> - }
>  
>   inode = debugfs_get_inode(dentry->d_sb);
> - if (unlikely(!inode)) {
> - kfree(fsd);
> + if (unlikely(!inode))
>   return failed_creating(dentry);
> - }
>  
>   inode->i_mode = mode;
>   inode->i_private = data;
>  
>   inode->i_fop = proxy_fops;
> - fsd->real_fops = real_fops;
> - refcount_set(&fsd->active_users, 1);
> - dentry->d_fsdata = fsd;
> + dentry->d_fsdata = (void *)((unsigned long)real_fops |
> + DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
>  
>   d_instantiate(dentry, inode);
>   fsnotify_create(d_inode(dentry->d_parent), dentry);
> @@ -637,8 +630,17 @@ static void __debugfs_remove_file(struct dentry
> *dentry, struct dentry *parent)
>  
>   simple_unlink(d_inode(parent), dentry);
>   d_delete(dentry);
> - fsd = dentry->d_fsdata;
> - init_completion(&fsd->active_users_drained);
> +
> + /*
> +  * Paired with the closing smp_mb() implied by a successful
> +  * cmpxchg() in debugfs_file_get(): either
> +  * debugfs_file_get() must see a dead dentry or we must see
> a
> +  * debugfs_fsdata instance at ->d_fsdata here (or both).
> +  */
> + smp_mb();
> + fsd = READ_ONCE(dentry->d_fsdata);
> + if ((unsigned long)fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)
> + return;
>   if (!refcount_dec_and_test(&fsd->active_users))
>   wait_for_completion(&fsd->active_users_drained);
>  }
> diff --git a/fs/debugfs/internal.h b/fs/debugfs/internal.h
> index 0eea99432840..cb1e8139c398 100644
> --- a/fs/debugfs/internal.h
> +++ b/fs/debugfs/internal.h
> @@ -25,4 +25,12 @@ struct debugfs_fsdata {
>   struct completion active_users_drained;
>  };
>  
> +/*
> + * A dentry's ->d_fsdata either points to the real fops or to a
> + * dynamically allocated debugfs_fsdata instance.
> + * In order to distinguish between these two cases, a real fops
> + * pointer gets its lowest bit set.
> + */
> +#define DEBUGFS_FSDATA_IS_REAL_FOPS_BIT BIT(0)
> +
>  #endif /* _DEBUGFS_INTERNAL_H_ */

2017-04-18 09:39:39

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC PATCH 9/9] debugfs: free debugfs_fsdata instances

On Mon, 2017-04-17 at 09:01 -0700, Paul E. McKenney wrote:

> If you have not already done so, please run this with debug enabled,
> especially CONFIG_PROVE_LOCKING=y (which implies CONFIG_PROVE_RCU=y).
> This is important because there are configurations for which the
> deadlocks you saw with SRCU turn into silent failure, including
> memory corruption.
> CONFIG_PROVE_RCU=y will catch many of those situations.

Can you elaborate on that? I think we may have had CONFIG_PROVE_RCU
enabled in the builds where we saw the problem, but I'm not sure.

Can you say which configurations you're thinking of? And perhaps what
kind of corruption you're thinking of also? I'm having a hard time
imagining any corruption that should happen?

Nicolai probably never even ran into this problem, though it should be
easy to reproduce.

johannes

2017-04-18 13:31:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 9/9] debugfs: free debugfs_fsdata instances

On Tue, Apr 18, 2017 at 11:39:27AM +0200, Johannes Berg wrote:
> On Mon, 2017-04-17 at 09:01 -0700, Paul E. McKenney wrote:
>
> > If you have not already done so, please run this with debug enabled,
> > especially CONFIG_PROVE_LOCKING=y (which implies CONFIG_PROVE_RCU=y).
> > This is important because there are configurations for which the
> > deadlocks you saw with SRCU turn into silent failure, including
> > memory corruption.
> > CONFIG_PROVE_RCU=y will catch many of those situations.
>
> Can you elaborate on that? I think we may have had CONFIG_PROVE_RCU
> enabled in the builds where we saw the problem, but I'm not sure.

CONFIG_PROVE_RCU=y will reliably catch things like this:

1. rcu_read_lock();
synchronize_rcu();
rcu_read_unlock();

With CONFIG_PROVE_RCU=n and CONFIG_PREEMPT=n, this will result in
too-short grace periods, which can free things out from under the
read-side critical section, which in turn can result in arbitrary
memory corruption. You might not even get a "scheduling while
atomic", though CONFIG_PREEMPT_COUNT=y will produce this message.

With CONFIG_PREEMPT=y, on the other hand, this should
deadlock in a manner similar to the earlier SRCU deadlocks
seen in debugfs.

2. rcu_read_lock();
schedule_timeout_interruptible(HZ);
rcu_read_unlock();

With CONFIG_PROVE_RCU=y and CONFIG_PREEMPT=y, this will just
work, more or less. Until someone runs with CONFIG_PREEMPT=n,
which will produce "scheduling while atomic". (I have a
fix for this queued for 4.13, FWIW, so that in the future
CONFIG_PROVE_RCU=y and CONFIG_PREEMPT=y will complain about
this. But for now, silent bug.)

There are more, but this should get you the flavor of the types
of bugs CONFIG_PROVE_RCU=y can locate for you.

> Can you say which configurations you're thinking of? And perhaps what
> kind of corruption you're thinking of also? I'm having a hard time
> imagining any corruption that should happen?

#1 is the silent corruption case given CONFIG_PROVE_RCU=n,
CONFIG_PREEMPT=n, and CONFIG_PREEMPT_COUNT=n.

> Nicolai probably never even ran into this problem, though it should be
> easy to reproduce.

I am just worried that the situation resulting in the earlier SRCU
deadlocks might be hiding behind CONFIG_PROVE_RCU=n, CONFIG_PREEMPT=n,
and CONFIG_PREEMPT_COUNT=n. Or some other bug hiding behind some
other set of Kconfig options.

Thanx, Paul

2017-04-18 13:40:38

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC PATCH 9/9] debugfs: free debugfs_fsdata instances

On Tue, 2017-04-18 at 06:31 -0700, Paul E. McKenney wrote:
> On Tue, Apr 18, 2017 at 11:39:27AM +0200, Johannes Berg wrote:
> > On Mon, 2017-04-17 at 09:01 -0700, Paul E. McKenney wrote:
> >
> > > If you have not already done so, please run this with debug
> > > enabled,
> > > especially CONFIG_PROVE_LOCKING=y (which implies
> > > CONFIG_PROVE_RCU=y).
> > > This is important because there are configurations for which the
> > > deadlocks you saw with SRCU turn into silent failure, including
> > > memory corruption.
> > > CONFIG_PROVE_RCU=y will catch many of those situations.
> >
> > Can you elaborate on that? I think we may have had CONFIG_PROVE_RCU
> > enabled in the builds where we saw the problem, but I'm not sure.
>
> CONFIG_PROVE_RCU=y will reliably catch things like this:
>
> 1. rcu_read_lock();
> synchronize_rcu();
> rcu_read_unlock();

Ok, that's not something that happens here either.

> 2. rcu_read_lock();
> schedule_timeout_interruptible(HZ);
> rcu_read_unlock();

Neither is this happening.

> There are more, but this should get you the flavor of the types
> of bugs CONFIG_PROVE_RCU=y can locate for you.

Makes sense. However, the issue at hand is what we (you and I)
discussed earlier wrt. lockdep -- from SRCU's point of view everything
is actually OK, except that the one thread is waiting for something and
we can never finish the grace period, and thus synchronize_srcu() will
never return. But there's no real SRCU bug here.

> > Nicolai probably never even ran into this problem, though it should
> > be easy to reproduce.
>
> I am just worried that the situation resulting in the earlier SRCU
> deadlocks might be hiding behind CONFIG_PROVE_RCU=n,
> CONFIG_PREEMPT=n, and CONFIG_PREEMPT_COUNT=n.  Or some other bug
> hiding behind some other set of Kconfig options.

There's no SRCU deadlock though. I know exactly why it happens, in my
case, which is the following:

Thread 1
userspace: read(debugfs_file_1)
srcu_read_lock(&debugfs_srcu); // in debugfs bowels
wait_event_interruptible(...); // in my driver's debugfs read method

Thread 2:
debugfs_remove(debugfs_file_2);
srcu_synchronize(&debugfs_srcu); // in debugfs bowels


This is the live-lock. The deadlock is something I posited but never
ran into:

CPU 1 CPU 2
srcu_read_lock(&debugfs_srcu);
rtnl_lock();
rtnl_lock();
srcu_synchronize(&debugfs_srcu);

Again, no (S)RCU abuse here, just an ABBA deadlock.

johannes

2017-04-18 15:17:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 9/9] debugfs: free debugfs_fsdata instances

On Tue, Apr 18, 2017 at 03:40:32PM +0200, Johannes Berg wrote:
> On Tue, 2017-04-18 at 06:31 -0700, Paul E. McKenney wrote:
> > On Tue, Apr 18, 2017 at 11:39:27AM +0200, Johannes Berg wrote:
> > > On Mon, 2017-04-17 at 09:01 -0700, Paul E. McKenney wrote:
> > >
> > > > If you have not already done so, please run this with debug
> > > > enabled,
> > > > especially CONFIG_PROVE_LOCKING=y (which implies
> > > > CONFIG_PROVE_RCU=y).
> > > > This is important because there are configurations for which the
> > > > deadlocks you saw with SRCU turn into silent failure, including
> > > > memory corruption.
> > > > CONFIG_PROVE_RCU=y will catch many of those situations.
> > >
> > > Can you elaborate on that? I think we may have had CONFIG_PROVE_RCU
> > > enabled in the builds where we saw the problem, but I'm not sure.
> >
> > CONFIG_PROVE_RCU=y will reliably catch things like this:
> >
> > 1. rcu_read_lock();
> > synchronize_rcu();
> > rcu_read_unlock();
>
> Ok, that's not something that happens here either.
>
> > 2. rcu_read_lock();
> > schedule_timeout_interruptible(HZ);
> > rcu_read_unlock();
>
> Neither is this happening.
>
> > There are more, but this should get you the flavor of the types
> > of bugs CONFIG_PROVE_RCU=y can locate for you.
>
> Makes sense. However, the issue at hand is what we (you and I)
> discussed earlier wrt. lockdep -- from SRCU's point of view everything
> is actually OK, except that the one thread is waiting for something and
> we can never finish the grace period, and thus synchronize_srcu() will
> never return. But there's no real SRCU bug here.
>
> > > Nicolai probably never even ran into this problem, though it should
> > > be easy to reproduce.
> >
> > I am just worried that the situation resulting in the earlier SRCU
> > deadlocks might be hiding behind CONFIG_PROVE_RCU=n,
> > CONFIG_PREEMPT=n, and CONFIG_PREEMPT_COUNT=n.??Or some other bug
> > hiding behind some other set of Kconfig options.
>
> There's no SRCU deadlock though. I know exactly why it happens, in my
> case, which is the following:
>
> Thread 1
> userspace: read(debugfs_file_1)
> srcu_read_lock(&debugfs_srcu); // in debugfs bowels
> wait_event_interruptible(...); // in my driver's debugfs read method
>
> Thread 2:
> debugfs_remove(debugfs_file_2);
> srcu_synchronize(&debugfs_srcu); // in debugfs bowels
>
>
> This is the live-lock. The deadlock is something I posited but never
> ran into:
>
> CPU 1 CPU 2
> srcu_read_lock(&debugfs_srcu);
> rtnl_lock();
> rtnl_lock();
> srcu_synchronize(&debugfs_srcu);
>
> Again, no (S)RCU abuse here, just an ABBA deadlock.

OK, please accept my apologies for failing to follow the thread.

I nevertheless reiterate my advice to run at least some tests with
CONFIG_PROVE_RCU=y. And yes, it would be good to upgrade lockdep
to find the above theoretical deadlock.

Thanx, Paul

2017-04-18 15:20:44

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC PATCH 9/9] debugfs: free debugfs_fsdata instances

On Tue, 2017-04-18 at 08:17 -0700, Paul E. McKenney wrote:

> > Again, no (S)RCU abuse here, just an ABBA deadlock.
>
> OK, please accept my apologies for failing to follow the thread.

No worries - just wanted to clarify this in case I was missing
something.

> I nevertheless reiterate my advice to run at least some tests with
> CONFIG_PROVE_RCU=y.  And yes, it would be good to upgrade lockdep
> to find the above theoretical deadlock.

Right. It won't matter for debugfs after this patchset, but it would
indeed be nice - I failed in my attempt though :)

johannes

2017-04-18 17:19:43

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 9/9] debugfs: free debugfs_fsdata instances

On Tue, Apr 18, 2017 at 05:20:38PM +0200, Johannes Berg wrote:
> On Tue, 2017-04-18 at 08:17 -0700, Paul E. McKenney wrote:
>
> > > Again, no (S)RCU abuse here, just an ABBA deadlock.
> >
> > OK, please accept my apologies for failing to follow the thread.
>
> No worries - just wanted to clarify this in case I was missing
> something.
>
> > I nevertheless reiterate my advice to run at least some tests with
> > CONFIG_PROVE_RCU=y.??And yes, it would be good to upgrade lockdep
> > to find the above theoretical deadlock.
>
> Right. It won't matter for debugfs after this patchset, but it would
> indeed be nice - I failed in my attempt though :)

Hey, I was hoping! ;-)

Thanx, Paul

2017-04-23 18:37:28

by Nicolai Stange

[permalink] [raw]
Subject: Re: [lkp-robot] [debugfs] f3e7155d08: BUG:unable_to_handle_kernel

Hi Xiaolong,

I'm encountering some difficulties running the reproducer, see below.
Any help is very welcome!


On Tue, Apr 18 2017, kernel test robot wrote:

> [ 45.772683] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
<snip>
> [ 45.772697] IP: __debugfs_remove+0x5c/0xc0
> [ 45.772743] Call Trace:
> [ 45.772750] debugfs_remove_recursive+0xd4/0x1e0
> [ 45.772758] rpc_clnt_debugfs_unregister+0x19/0x30
> [ 45.772762] rpc_client_register+0x18a/0x1c0
> [ 45.772765] rpc_new_client+0x1de/0x2e0
> [ 45.772768] rpc_create_xprt+0x58/0x170
> [ 45.772769] rpc_create+0xea/0x1c0
> [ 45.772776] nfs_create_rpc_client+0xe8/0x130
> [ 45.772814] nfs4_init_client+0x7e/0x290 [nfsv4]
> [ 45.772820] ? __radix_tree_replace+0x8a/0x140
> [ 45.772823] ? radix_tree_iter_tag_clear+0x1c/0x20
> [ 45.772827] ? __rpc_init_priority_wait_queue+0x81/0xb0
> [ 45.772830] ? rpc_init_wait_queue+0x13/0x20
> [ 45.772847] ? nfs4_alloc_client+0x1d2/0x1e0 [nfsv4]
<snip>
> [ 45.772985] Code: 8b 7c 24 30 48 89 de e8 f3 28 e6 ff 48 89 df e8 3b
> 22 e5 ff 4c 8b 63 78 48 c7 c2 20 d7 3b 82 48 c7 c6 e0 9f c9 81 49 8d
> 7c 24 18 <41> c7 44 24 10 00 00 00 00 4d 8d 6c 24 10 e8 a1 b6 cc ff 49
> 8d

Ok, that's

41 c7 44 24 10 00 00 movl $0x0,0x10(%r12)

which is probably the init_completion() in __debugfs_remove_file():

fsd = dentry->d_fsdata;
init_completion(&fsd->active_users_drained);

This would mean that fsd == NULL and this can happen only if the dentry
in question isn't a regular file but a symlink or whatever. So, an
additional d_is_reg() is needed here. I'll fix this in the next
iteration once I got the reproducer working.


> To reproduce:
>
> git clone https://github.com/01org/lkp-tests.git
> cd lkp-tests
> bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
>

This gives

make: Entering directory '/home/nic/lkp-tests/bin/event'
gcc -c -o wakeup.o wakeup.c
gcc -o wakeup wakeup.o
rm -f wakeup.o
strip wakeup
make: Leaving directory '/home/nic/lkp-tests/bin/event'
cpio: root:lkp: invalid group
cpio: root:lkp: invalid group
cpio: root:lkp: invalid group
gzip: /home/nic/.lkp/cache/lkp-x86_64.cpio: No such file or directory
mv: cannot stat ‘/home/nic/.lkp/cache/lkp-x86_64.cpio.gz’: No such file or directory
mv: cannot stat ‘/home/nic/.lkp/cache/lkp-x86_64.cgz’: No such file or directory
result_root: /home/nic/.lkp//result/boot/1/vm-lkp-nex04-8G/debian-x86_64-2016-08-31.cgz/x86_64-rhel-7.2/gcc-6/f3e7155d085591ab58f0993ce633fea58c082b35/3
downloading initrds ...
/usr/bin/wget -q --local-encoding=UTF-8 --retry-connrefused --waitretry 1000 --tries 1000 https://github.com/0day-ci/lkp-qemu/raw/master/osimage/debian/debian-x86_64-2016-08-31.cgz -N -P /home/nic/.lkp/cache/osimage/debian
/usr/bin/wget -q --local-encoding=UTF-8 --retry-connrefused --waitretry 1000 --tries 1000 https://github.com/0day-ci/lkp-qemu/raw/master/osimage/deps/debian-x86_64-2016-08-31.cgz/lkp_2017-04-01.cgz -N -P /home/nic/.lkp/cache/osimage/deps/debian-x86_64-2016-08-31.cgz
Failed to download osimage/deps/debian-x86_64-2016-08-31.cgz/lkp_2017-04-01.cgz

Manual download of that very last lkp_2017-04-01.cgz file results in a 404
error. Please let me know if you need more details.


Thank you!

Nicolai

2017-04-24 06:38:12

by kernel test robot

[permalink] [raw]
Subject: Re: [lkp-robot] [debugfs] f3e7155d08: BUG:unable_to_handle_kernel

On 04/23, Nicolai Stange wrote:
>Hi Xiaolong,
>
>I'm encountering some difficulties running the reproducer, see below.
>Any help is very welcome!
>

Thanks for watching the report and trying the reproducer.

>
>On Tue, Apr 18 2017, kernel test robot wrote:
>
>> [ 45.772683] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
><snip>
>> [ 45.772697] IP: __debugfs_remove+0x5c/0xc0
>> [ 45.772743] Call Trace:
>> [ 45.772750] debugfs_remove_recursive+0xd4/0x1e0
>> [ 45.772758] rpc_clnt_debugfs_unregister+0x19/0x30
>> [ 45.772762] rpc_client_register+0x18a/0x1c0
>> [ 45.772765] rpc_new_client+0x1de/0x2e0
>> [ 45.772768] rpc_create_xprt+0x58/0x170
>> [ 45.772769] rpc_create+0xea/0x1c0
>> [ 45.772776] nfs_create_rpc_client+0xe8/0x130
>> [ 45.772814] nfs4_init_client+0x7e/0x290 [nfsv4]
>> [ 45.772820] ? __radix_tree_replace+0x8a/0x140
>> [ 45.772823] ? radix_tree_iter_tag_clear+0x1c/0x20
>> [ 45.772827] ? __rpc_init_priority_wait_queue+0x81/0xb0
>> [ 45.772830] ? rpc_init_wait_queue+0x13/0x20
>> [ 45.772847] ? nfs4_alloc_client+0x1d2/0x1e0 [nfsv4]
><snip>
>> [ 45.772985] Code: 8b 7c 24 30 48 89 de e8 f3 28 e6 ff 48 89 df e8 3b
>> 22 e5 ff 4c 8b 63 78 48 c7 c2 20 d7 3b 82 48 c7 c6 e0 9f c9 81 49 8d
>> 7c 24 18 <41> c7 44 24 10 00 00 00 00 4d 8d 6c 24 10 e8 a1 b6 cc ff 49
>> 8d
>
>Ok, that's
>
> 41 c7 44 24 10 00 00 movl $0x0,0x10(%r12)
>
>which is probably the init_completion() in __debugfs_remove_file():
>
> fsd = dentry->d_fsdata;
> init_completion(&fsd->active_users_drained);
>
>This would mean that fsd == NULL and this can happen only if the dentry
>in question isn't a regular file but a symlink or whatever. So, an
>additional d_is_reg() is needed here. I'll fix this in the next
>iteration once I got the reproducer working.
>
>
>> To reproduce:
>>
>> git clone https://github.com/01org/lkp-tests.git
>> cd lkp-tests
>> bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
>>
>
>This gives
>
> make: Entering directory '/home/nic/lkp-tests/bin/event'
> gcc -c -o wakeup.o wakeup.c
> gcc -o wakeup wakeup.o
> rm -f wakeup.o
> strip wakeup
> make: Leaving directory '/home/nic/lkp-tests/bin/event'
> cpio: root:lkp: invalid group
> cpio: root:lkp: invalid group
> cpio: root:lkp: invalid group
> gzip: /home/nic/.lkp/cache/lkp-x86_64.cpio: No such file or directory
> mv: cannot stat ‘/home/nic/.lkp/cache/lkp-x86_64.cpio.gz’: No such file or directory
> mv: cannot stat ‘/home/nic/.lkp/cache/lkp-x86_64.cgz’: No such file or directory
> result_root: /home/nic/.lkp//result/boot/1/vm-lkp-nex04-8G/debian-x86_64-2016-08-31.cgz/x86_64-rhel-7.2/gcc-6/f3e7155d085591ab58f0993ce633fea58c082b35/3
> downloading initrds ...
> /usr/bin/wget -q --local-encoding=UTF-8 --retry-connrefused --waitretry 1000 --tries 1000 https://github.com/0day-ci/lkp-qemu/raw/master/osimage/debian/debian-x86_64-2016-08-31.cgz -N -P /home/nic/.lkp/cache/osimage/debian
> /usr/bin/wget -q --local-encoding=UTF-8 --retry-connrefused --waitretry 1000 --tries 1000 https://github.com/0day-ci/lkp-qemu/raw/master/osimage/deps/debian-x86_64-2016-08-31.cgz/lkp_2017-04-01.cgz -N -P /home/nic/.lkp/cache/osimage/deps/debian-x86_64-2016-08-31.cgz
> Failed to download osimage/deps/debian-x86_64-2016-08-31.cgz/lkp_2017-04-01.cgz
>
>Manual download of that very last lkp_2017-04-01.cgz file results in a 404
>error. Please let me know if you need more details.

It's most likely we haven't upload the lkp_2017-04-01.cgz file to github.
I'll check it and get back to you later.

Thanks,
Xiaolong
>
>
>Thank you!
>
>Nicolai