by Paul E. McKenney

[permalink] [raw]

Subject: Re: [RFC PATCH 9/9] debugfs: free debugfs_fsdata instances

On Sun, Apr 16, 2017 at 11:51:37AM +0200, Nicolai Stange wrote:
> Currently, a dentry's debugfs_fsdata instance is allocated from
> debugfs_file_get() at first usage, i.e. at first file opening.
>
> It won't ever get freed though.
>
> Ideally, these instances would get freed after the last open file handle
> gets closed again, that is from the fops' ->release().
>
> Unfortunately, we can't hook easily into those ->release()ers of files
> created through debugfs_create_file_unsafe(), i.e. of those proxied through
> debugfs_open_proxy_file_operations rather than through the "full"
> debugfs_full_proxy_file_operations proxy.
>
> Hence, free unreferenced debugfs_fsdata instances from debugfs_file_put(),
> with the drawback of a potential allocation + deallocation per
> debugfs_file_get() + debugfs_file_put() pair, that is per fops invocation.
>
> In addition to its former role of tracking pending fops, use the
> ->active_users for reference counting on the debugfs_fsdata instance
> itself. In particular, don't keep a dummy reference to be dropped from
> __debugfs_remove_file(): a d_delete()ed dentry and thus, request for
> completion notification, is now signaled by the d_unlinked() dentry itself.
>
> Once ->active_users drops to zero (and the dentry is still intact), free
> the debugfs_fsdata instance from debugfs_file_put(). RCU protects any
> concurrent debugfs_file_get() attempts to get a hold of the instance here.
> Likewise for full_proxy_release() which lacks a call to debugfs_file_get().
>
> Note that due to non-atomic updates to the d_unlinked() + ->d_fsdata pair,
> care must be taken in order to avoid races between debugfs_file_put() and
> debugfs_file_get() as well as __debugfs_remove_file(). Rather than
> introducing a global lock, exploit the fact that there will ever be only a
> single !d_unlinked() -> d_unlinked() transition and add memory barriers
> where needed. Given the lack of proper benchmarking, that debugfs fops
> aren't performance critical and that we've already got a potential
> allocation/deallocation pair anyway, the added code complexity might be
> highly questionable though.
>
> Signed-off-by: Nicolai Stange <[email protected]>

If you have not already done so, please run this with debug enabled,
especially CONFIG_PROVE_LOCKING=y (which implies CONFIG_PROVE_RCU=y).
This is important because there are configurations for which the deadlocks
you saw with SRCU turn into silent failure, including memory corruption.
CONFIG_PROVE_RCU=y will catch many of those situations.

(And yes, kfree_rcu() doesn't have that problem, but...)

Another issue called out inline.

Thanx, Paul

> ---
> fs/debugfs/file.c | 102 ++++++++++++++++++++++++++++++++++++++++----------
> fs/debugfs/inode.c | 8 +++-
> fs/debugfs/internal.h | 1 +
> 3 files changed, 90 insertions(+), 21 deletions(-)
>
> diff --git a/fs/debugfs/file.c b/fs/debugfs/file.c
> index f4dfd7d0d625..b2cc25d44a39 100644
> --- a/fs/debugfs/file.c
> +++ b/fs/debugfs/file.c
> @@ -22,6 +22,7 @@
> #include <linux/slab.h>
> #include <linux/atomic.h>
> #include <linux/device.h>
> +#include <linux/rcupdate.h>
> #include <asm/poll.h>
>
> #include "internal.h"
> @@ -78,10 +79,39 @@ int debugfs_file_get(struct dentry *dentry)
> struct debugfs_fsdata *fsd;
> void *d_fsd;
>
> - d_fsd = READ_ONCE(dentry->d_fsdata);
> + rcu_read_lock();
> +retry:
> + d_fsd = rcu_dereference(dentry->d_fsdata);
> if (!((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)) {
> + /*
> + * Paired with the control dependency in
> + * debugfs_file_put(): if we saw the debugfs_fsdata
> + * instance "restored" there but not the dead dentry,
> + * we'd erroneously instantiate a fresh debugfs_fsdata
> + * instance below.
> + */
> + smp_rmb();
> + if (d_unlinked(dentry)) {
> + rcu_read_unlock();
> + return -EIO;
> + }
> +
> fsd = d_fsd;
> + if (!refcount_inc_not_zero(&fsd->active_users)) {
> + /*
> + * A concurrent debugfs_file_put() dropped the
> + * count to zero and is about to free the
> + * debugfs_fsdata. Help out resetting the
> + * ->d_fsdata and retry.
> + */
> + d_fsd = (void *)((unsigned long)fsd->real_fops |
> + DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
> + RCU_INIT_POINTER(dentry->d_fsdata, d_fsd);

This is an infrequent race, I hope? If on the other hand there is
a possibility of this branch being taken a huge number of times in
one call, it would be good to exit the RCU read-side critical section
before retrying.

> + goto retry;
> + }
> + rcu_read_unlock();
> } else {
> + rcu_read_unlock();
> fsd = kmalloc(sizeof(*fsd), GFP_KERNEL);
> if (!fsd)
> return -ENOMEM;
> @@ -91,25 +121,28 @@ int debugfs_file_get(struct dentry *dentry)
> refcount_set(&fsd->active_users, 1);
> init_completion(&fsd->active_users_drained);
> if (cmpxchg(&dentry->d_fsdata, d_fsd, fsd) != d_fsd) {
> + /*
> + * Another debugfs_file_get() has installed a
> + * debugfs_fsdata instance concurrently.
> + * Free ours and retry to grab a reference on
> + * the installed one.
> + */
> kfree(fsd);
> - fsd = READ_ONCE(dentry->d_fsdata);
> + rcu_read_lock();
> + goto retry;

And given this code path, why not put the retry: label before the
initial rcu_read_lock()? Same number of lines of code, rcu_read_lock()
and rcu_read_unlock() are very lightweight, the extra executions should
be rare, and you might be avoiding a grace-period-starvation problem.

> + }
> + /*
> + * In case of a successful cmpxchg() above, this check is
> + * strictly necessary and must follow it, see the comment in
> + * __debugfs_remove_file().
> + */
> + if (d_unlinked(dentry)) {
> + if (refcount_dec_and_test(&fsd->active_users))
> + complete(&fsd->active_users_drained);
> + return -EIO;
> }
> }
>
> - /*
> - * In case of a successful cmpxchg() above, this check is
> - * strictly necessary and must follow it, see the comment in
> - * __debugfs_remove_file().
> - * OTOH, if the cmpxchg() hasn't been executed or wasn't
> - * successful, this serves the purpose of not starving
> - * removers.
> - */
> - if (d_unlinked(dentry))
> - return -EIO;
> -
> - if (!refcount_inc_not_zero(&fsd->active_users))
> - return -EIO;
> -
> return 0;
> }
> EXPORT_SYMBOL_GPL(debugfs_file_get);
> @@ -126,9 +159,29 @@ EXPORT_SYMBOL_GPL(debugfs_file_get);
> void debugfs_file_put(struct dentry *dentry)
> {
> struct debugfs_fsdata *fsd = READ_ONCE(dentry->d_fsdata);
> + void *d_fsd;
>
> - if (refcount_dec_and_test(&fsd->active_users))
> - complete(&fsd->active_users_drained);
> + if (refcount_dec_and_test(&fsd->active_users)) {
> + d_fsd = (void *)((unsigned long)fsd->real_fops |
> + DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
> + RCU_INIT_POINTER(dentry->d_fsdata, d_fsd);
> + /* Paired with smp_mb() in __debugfs_remove_file(). */
> + smp_mb();
> + if (d_unlinked(dentry)) {
> + /*
> + * We have a control dependency paired with the
> + * smp_rmb() in debugfs_file_get() here.
> + *
> + * Restore the debugfs_fsdata instance into
> + * ->d_fsdata s.t. ->d_release() can free
> + * it.
> + */
> + WRITE_ONCE(dentry->d_fsdata, fsd);
> + complete(&fsd->active_users_drained);
> + } else {
> + kfree_rcu(fsd, rcu_head);
> + }
> + }
> }
> EXPORT_SYMBOL_GPL(debugfs_file_put);
>
> @@ -221,9 +274,20 @@ static unsigned int full_proxy_poll(struct file *filp,
> static int full_proxy_release(struct inode *inode, struct file *filp)
> {
> const struct dentry *dentry = F_DENTRY(filp);
> - const struct file_operations *real_fops = debugfs_real_fops(filp);
> const struct file_operations *proxy_fops = filp->f_op;
> int r = 0;
> + void *d_fsd;
> + const struct file_operations *real_fops;
> +
> + rcu_read_lock();
> + d_fsd = rcu_dereference(F_DENTRY(filp)->d_fsdata);
> + if ((unsigned long)d_fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT) {
> + real_fops = (void *)((unsigned long)d_fsd &
> + ~DEBUGFS_FSDATA_IS_REAL_FOPS_BIT);
> + } else {
> + real_fops = ((struct debugfs_fsdata *)d_fsd)->real_fops;
> + }
> + rcu_read_unlock();
>
> /*
> * We must not protect this against removal races here: the
> diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
> index 2360c17ec00a..bacb4d6bf178 100644
> --- a/fs/debugfs/inode.c
> +++ b/fs/debugfs/inode.c
> @@ -27,6 +27,7 @@
> #include <linux/parser.h>
> #include <linux/magic.h>
> #include <linux/slab.h>
> +#include <linux/rcupdate.h>
>
> #include "internal.h"
>
> @@ -636,13 +637,16 @@ static void __debugfs_remove_file(struct dentry *dentry, struct dentry *parent)
> * cmpxchg() in debugfs_file_get(): either
> * debugfs_file_get() must see a dead dentry or we must see a
> * debugfs_fsdata instance at ->d_fsdata here (or both).
> + *
> + * Also paired with the smp_mb() in debugfs_file_put(): if we
> + * see a debugfs_fsdata instance here, then debugfs_file_put()
> + * must see a dead dentry.
> */
> smp_mb();
> fsd = READ_ONCE(dentry->d_fsdata);
> if ((unsigned long)fsd & DEBUGFS_FSDATA_IS_REAL_FOPS_BIT)
> return;
> - if (!refcount_dec_and_test(&fsd->active_users))
> - wait_for_completion(&fsd->active_users_drained);
> + wait_for_completion(&fsd->active_users_drained);
> }
>
> static int __debugfs_remove(struct dentry *dentry, struct dentry *parent)
> diff --git a/fs/debugfs/internal.h b/fs/debugfs/internal.h
> index cb1e8139c398..0445bd7d11f2 100644
> --- a/fs/debugfs/internal.h
> +++ b/fs/debugfs/internal.h
> @@ -23,6 +23,7 @@ struct debugfs_fsdata {
> const struct file_operations *real_fops;
> refcount_t active_users;
> struct completion active_users_drained;
> + struct rcu_head rcu_head;
> };
>
> /*
> --
> 2.12.2
>

2017-04-18 02:25:55

On 04/23, Nicolai Stange wrote:
>Hi Xiaolong,
>
>I'm encountering some difficulties running the reproducer, see below.
>Any help is very welcome!
>

Thanks for watching the report and trying the reproducer.

>
>On Tue, Apr 18 2017, kernel test robot wrote:
>
>> [ 45.772683] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
><snip>
>> [ 45.772697] IP: __debugfs_remove+0x5c/0xc0
>> [ 45.772743] Call Trace:
>> [ 45.772750] debugfs_remove_recursive+0xd4/0x1e0
>> [ 45.772758] rpc_clnt_debugfs_unregister+0x19/0x30
>> [ 45.772762] rpc_client_register+0x18a/0x1c0
>> [ 45.772765] rpc_new_client+0x1de/0x2e0
>> [ 45.772768] rpc_create_xprt+0x58/0x170
>> [ 45.772769] rpc_create+0xea/0x1c0
>> [ 45.772776] nfs_create_rpc_client+0xe8/0x130
>> [ 45.772814] nfs4_init_client+0x7e/0x290 [nfsv4]
>> [ 45.772820] ? __radix_tree_replace+0x8a/0x140
>> [ 45.772823] ? radix_tree_iter_tag_clear+0x1c/0x20
>> [ 45.772827] ? __rpc_init_priority_wait_queue+0x81/0xb0
>> [ 45.772830] ? rpc_init_wait_queue+0x13/0x20
>> [ 45.772847] ? nfs4_alloc_client+0x1d2/0x1e0 [nfsv4]
><snip>
>> [ 45.772985] Code: 8b 7c 24 30 48 89 de e8 f3 28 e6 ff 48 89 df e8 3b
>> 22 e5 ff 4c 8b 63 78 48 c7 c2 20 d7 3b 82 48 c7 c6 e0 9f c9 81 49 8d
>> 7c 24 18 <41> c7 44 24 10 00 00 00 00 4d 8d 6c 24 10 e8 a1 b6 cc ff 49
>> 8d
>
>Ok, that's
>
> 41 c7 44 24 10 00 00 movl $0x0,0x10(%r12)
>
>which is probably the init_completion() in __debugfs_remove_file():
>
> fsd = dentry->d_fsdata;
> init_completion(&fsd->active_users_drained);
>
>This would mean that fsd == NULL and this can happen only if the dentry
>in question isn't a regular file but a symlink or whatever. So, an
>additional d_is_reg() is needed here. I'll fix this in the next
>iteration once I got the reproducer working.
>
>
>> To reproduce:
>>
>> git clone https://github.com/01org/lkp-tests.git
>> cd lkp-tests
>> bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
>>
>
>This gives
>
> make: Entering directory '/home/nic/lkp-tests/bin/event'
> gcc -c -o wakeup.o wakeup.c
> gcc -o wakeup wakeup.o
> rm -f wakeup.o
> strip wakeup
> make: Leaving directory '/home/nic/lkp-tests/bin/event'
> cpio: root:lkp: invalid group
> cpio: root:lkp: invalid group
> cpio: root:lkp: invalid group
> gzip: /home/nic/.lkp/cache/lkp-x86_64.cpio: No such file or directory
> mv: cannot stat ‘/home/nic/.lkp/cache/lkp-x86_64.cpio.gz’: No such file or directory
> mv: cannot stat ‘/home/nic/.lkp/cache/lkp-x86_64.cgz’: No such file or directory
> result_root: /home/nic/.lkp//result/boot/1/vm-lkp-nex04-8G/debian-x86_64-2016-08-31.cgz/x86_64-rhel-7.2/gcc-6/f3e7155d085591ab58f0993ce633fea58c082b35/3
> downloading initrds ...
> /usr/bin/wget -q --local-encoding=UTF-8 --retry-connrefused --waitretry 1000 --tries 1000 https://github.com/0day-ci/lkp-qemu/raw/master/osimage/debian/debian-x86_64-2016-08-31.cgz -N -P /home/nic/.lkp/cache/osimage/debian
> /usr/bin/wget -q --local-encoding=UTF-8 --retry-connrefused --waitretry 1000 --tries 1000 https://github.com/0day-ci/lkp-qemu/raw/master/osimage/deps/debian-x86_64-2016-08-31.cgz/lkp_2017-04-01.cgz -N -P /home/nic/.lkp/cache/osimage/deps/debian-x86_64-2016-08-31.cgz
> Failed to download osimage/deps/debian-x86_64-2016-08-31.cgz/lkp_2017-04-01.cgz
>
>Manual download of that very last lkp_2017-04-01.cgz file results in a 404
>error. Please let me know if you need more details.

It's most likely we haven't upload the lkp_2017-04-01.cgz file to github.
I'll check it and get back to you later.

Thanks,
Xiaolong
>
>
>Thank you!
>
>Nicolai