2017-04-07 08:52:48

by Zefan Li

[permalink] [raw]
Subject: [PATCH] cgroup: avoid attaching a cgroup root to two different superblocks

Run this:

touch file0
for ((; ;))
{
mount -t cpuset xxx file0
}

And this concurrently:

touch file1
for ((; ;))
{
mount -t cpuset xxx file1
}

We'll trigger a warning like this:

------------[ cut here ]------------
WARNING: CPU: 1 PID: 4675 at lib/percpu-refcount.c:317 percpu_ref_kill_and_confirm+0x92/0xb0
percpu_ref_kill_and_confirm called more than once on css_release!
CPU: 1 PID: 4675 Comm: mount Not tainted 4.11.0-rc5+ #5
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
Call Trace:
dump_stack+0x63/0x84
__warn+0xd1/0xf0
warn_slowpath_fmt+0x5f/0x80
percpu_ref_kill_and_confirm+0x92/0xb0
cgroup_kill_sb+0x95/0xb0
deactivate_locked_super+0x43/0x70
deactivate_super+0x46/0x60
...
---[ end trace a79f61c2a2633700 ]---

Here's a race:

Thread A Thread B

cgroup1_mount()
# alloc a new cgroup root
cgroup_setup_root()
cgroup1_mount()
# no sb yet, returns NULL
kernfs_pin_sb()

# but succeeds in getting the refcnt,
# so re-use cgroup root
percpu_ref_tryget_live()
# alloc sb with cgroup root
cgroup_do_mount()

cgroup_kill_sb()
# alloc another sb with same root
cgroup_do_mount()

cgroup_kill_sb()

We end up using the same cgroup root for two different superblocks,
so percpu_ref_kill() will be called twice on the same root when the
two superblocks are destroyed.

We should fix to make sure the superblock pinning is really successful.

Cc: [email protected] # 3.16+
Reported-by: Dmitry Vyukov <[email protected]>
Signed-off-by: Zefan Li <[email protected]>
---
kernel/cgroup/cgroup-v1.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 1dc22f6..12e19f0 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1146,7 +1146,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* path is super cold. Let's just sleep a bit and retry.
*/
pinned_sb = kernfs_pin_sb(root->kf_root, NULL);
- if (IS_ERR(pinned_sb) ||
+ if (IS_ERR_OR_NULL(pinned_sb) ||
!percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
mutex_unlock(&cgroup_mutex);
if (!IS_ERR_OR_NULL(pinned_sb))
--
1.8.3.1


2017-04-11 00:01:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] cgroup: avoid attaching a cgroup root to two different superblocks

On Fri, Apr 07, 2017 at 04:51:55PM +0800, Zefan Li wrote:
> We end up using the same cgroup root for two different superblocks,
> so percpu_ref_kill() will be called twice on the same root when the
> two superblocks are destroyed.
>
> We should fix to make sure the superblock pinning is really successful.
>
> Cc: [email protected] # 3.16+
> Reported-by: Dmitry Vyukov <[email protected]>
> Signed-off-by: Zefan Li <[email protected]>

Applied to cgroup/for-4.11-fixes.

Thanks.

--
tejun

2017-04-14 23:33:06

by Andrei Vagin

[permalink] [raw]
Subject: Re: cgroup: avoid attaching a cgroup root to two different superblocks

On Fri, Apr 14, 2017 at 04:27:37PM -0700, Andrei Vagin wrote:
> Hello,
>
> One of our CRIU tests hangs with this patch.
>
> Steps to reproduce:
> curl -o cgroupns.c https://gist.githubusercontent.com/avagin/f87c8a8bd2a0de9afcc74976327786bc/raw/5843701ef3679f50dd2427cf57a80871082eb28c/gistfile1.txt
> gcc cgroupns.c -o cgroupns
> ./cgroupns
> ./cgroupns

I've found a trivial reproducer:
mkdir /tmp/xxx
mount -t cgroup -o none,name=zdtmtst xxx /tmp/xxx
mkdir /tmp/xxx/xxx
umount /tmp/xxx
mount -t cgroup -o none,name=zdtmtst xxx /tmp/xxx

>
> [root@fc24 ~]# strace -s 256 -fe clone,unshare,setns,mount ./cgroupns
> mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = 0
> unshare(CLONE_NEWCGROUP) = 0
> clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe5da0b89d0) = 529
> strace: Process 529 attached
> [pid 529] setns(3, CLONE_NEWCGROUP) = 0
> [pid 529] +++ exited with 0 +++
> --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=529, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
> +++ exited with 0 +++
> [root@fc24 ~]# strace -s 256 -fe clone,unshare,setns,mount ./cgroupns
> mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = ? ERESTARTNOINTR (To be restarted)
> mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = ? ERESTARTNOINTR (To be restarted)
> mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = ? ERESTARTNOINTR (To be restarted)
> mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = ? ERESTARTNOINTR (To be restarted)
> mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = ? ERESTARTNOINTR (To be restarted)
> mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = ? ERESTARTNOINTR (To be restarted)
> ....
>
> Thanks,
> Andrei
>
> On Fri, Apr 07, 2017 at 04:51:55PM +0800, Li Zefan wrote:
> > Run this:
> >
> > touch file0
> > for ((; ;))
> > {
> > mount -t cpuset xxx file0
> > }
> >
> > And this concurrently:
> >
> > touch file1
> > for ((; ;))
> > {
> > mount -t cpuset xxx file1
> > }
> >
> > We'll trigger a warning like this:
> >
> > ------------[ cut here ]------------
> > WARNING: CPU: 1 PID: 4675 at lib/percpu-refcount.c:317 percpu_ref_kill_and_confirm+0x92/0xb0
> > percpu_ref_kill_and_confirm called more than once on css_release!
> > CPU: 1 PID: 4675 Comm: mount Not tainted 4.11.0-rc5+ #5
> > Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> > Call Trace:
> > dump_stack+0x63/0x84
> > __warn+0xd1/0xf0
> > warn_slowpath_fmt+0x5f/0x80
> > percpu_ref_kill_and_confirm+0x92/0xb0
> > cgroup_kill_sb+0x95/0xb0
> > deactivate_locked_super+0x43/0x70
> > deactivate_super+0x46/0x60
> > ...
> > ---[ end trace a79f61c2a2633700 ]---
> >
> > Here's a race:
> >
> > Thread A Thread B
> >
> > cgroup1_mount()
> > # alloc a new cgroup root
> > cgroup_setup_root()
> > cgroup1_mount()
> > # no sb yet, returns NULL
> > kernfs_pin_sb()
> >
> > # but succeeds in getting the refcnt,
> > # so re-use cgroup root
> > percpu_ref_tryget_live()
> > # alloc sb with cgroup root
> > cgroup_do_mount()
> >
> > cgroup_kill_sb()
> > # alloc another sb with same root
> > cgroup_do_mount()
> >
> > cgroup_kill_sb()
> >
> > We end up using the same cgroup root for two different superblocks,
> > so percpu_ref_kill() will be called twice on the same root when the
> > two superblocks are destroyed.
> >
> > We should fix to make sure the superblock pinning is really successful.
> >
> > Cc: [email protected] # 3.16+
> > Reported-by: Dmitry Vyukov <[email protected]>
> > Signed-off-by: Zefan Li <[email protected]>
> > ---
> > kernel/cgroup/cgroup-v1.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> > index 1dc22f6..12e19f0 100644
> > --- a/kernel/cgroup/cgroup-v1.c
> > +++ b/kernel/cgroup/cgroup-v1.c
> > @@ -1146,7 +1146,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> > * path is super cold. Let's just sleep a bit and retry.
> > */
> > pinned_sb = kernfs_pin_sb(root->kf_root, NULL);
> > - if (IS_ERR(pinned_sb) ||
> > + if (IS_ERR_OR_NULL(pinned_sb) ||
> > !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
> > mutex_unlock(&cgroup_mutex);
> > if (!IS_ERR_OR_NULL(pinned_sb))

2017-04-14 23:27:56

by Andrei Vagin

[permalink] [raw]
Subject: Re: cgroup: avoid attaching a cgroup root to two different superblocks

Hello,

One of our CRIU tests hangs with this patch.

Steps to reproduce:
curl -o cgroupns.c https://gist.githubusercontent.com/avagin/f87c8a8bd2a0de9afcc74976327786bc/raw/5843701ef3679f50dd2427cf57a80871082eb28c/gistfile1.txt
gcc cgroupns.c -o cgroupns
./cgroupns
./cgroupns

[root@fc24 ~]# strace -s 256 -fe clone,unshare,setns,mount ./cgroupns
mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = 0
unshare(CLONE_NEWCGROUP) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe5da0b89d0) = 529
strace: Process 529 attached
[pid 529] setns(3, CLONE_NEWCGROUP) = 0
[pid 529] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=529, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
+++ exited with 0 +++
[root@fc24 ~]# strace -s 256 -fe clone,unshare,setns,mount ./cgroupns
mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = ? ERESTARTNOINTR (To be restarted)
mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = ? ERESTARTNOINTR (To be restarted)
mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = ? ERESTARTNOINTR (To be restarted)
mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = ? ERESTARTNOINTR (To be restarted)
mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = ? ERESTARTNOINTR (To be restarted)
mount("none", "/tmp/cgroupns.test/zdtmtst", "cgroup", 0, "none,name=zdtmtst") = ? ERESTARTNOINTR (To be restarted)
....

Thanks,
Andrei

On Fri, Apr 07, 2017 at 04:51:55PM +0800, Li Zefan wrote:
> Run this:
>
> touch file0
> for ((; ;))
> {
> mount -t cpuset xxx file0
> }
>
> And this concurrently:
>
> touch file1
> for ((; ;))
> {
> mount -t cpuset xxx file1
> }
>
> We'll trigger a warning like this:
>
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 4675 at lib/percpu-refcount.c:317 percpu_ref_kill_and_confirm+0x92/0xb0
> percpu_ref_kill_and_confirm called more than once on css_release!
> CPU: 1 PID: 4675 Comm: mount Not tainted 4.11.0-rc5+ #5
> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
> Call Trace:
> dump_stack+0x63/0x84
> __warn+0xd1/0xf0
> warn_slowpath_fmt+0x5f/0x80
> percpu_ref_kill_and_confirm+0x92/0xb0
> cgroup_kill_sb+0x95/0xb0
> deactivate_locked_super+0x43/0x70
> deactivate_super+0x46/0x60
> ...
> ---[ end trace a79f61c2a2633700 ]---
>
> Here's a race:
>
> Thread A Thread B
>
> cgroup1_mount()
> # alloc a new cgroup root
> cgroup_setup_root()
> cgroup1_mount()
> # no sb yet, returns NULL
> kernfs_pin_sb()
>
> # but succeeds in getting the refcnt,
> # so re-use cgroup root
> percpu_ref_tryget_live()
> # alloc sb with cgroup root
> cgroup_do_mount()
>
> cgroup_kill_sb()
> # alloc another sb with same root
> cgroup_do_mount()
>
> cgroup_kill_sb()
>
> We end up using the same cgroup root for two different superblocks,
> so percpu_ref_kill() will be called twice on the same root when the
> two superblocks are destroyed.
>
> We should fix to make sure the superblock pinning is really successful.
>
> Cc: [email protected] # 3.16+
> Reported-by: Dmitry Vyukov <[email protected]>
> Signed-off-by: Zefan Li <[email protected]>
> ---
> kernel/cgroup/cgroup-v1.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index 1dc22f6..12e19f0 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -1146,7 +1146,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> * path is super cold. Let's just sleep a bit and retry.
> */
> pinned_sb = kernfs_pin_sb(root->kf_root, NULL);
> - if (IS_ERR(pinned_sb) ||
> + if (IS_ERR_OR_NULL(pinned_sb) ||
> !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
> mutex_unlock(&cgroup_mutex);
> if (!IS_ERR_OR_NULL(pinned_sb))

2017-04-16 15:24:50

by Tejun Heo

[permalink] [raw]
Subject: Re: cgroup: avoid attaching a cgroup root to two different superblocks

On Fri, Apr 14, 2017 at 04:27:38PM -0700, Andrei Vagin wrote:
> Hello,
>
> One of our CRIU tests hangs with this patch.
>
> Steps to reproduce:
> curl -o cgroupns.c https://gist.githubusercontent.com/avagin/f87c8a8bd2a0de9afcc74976327786bc/raw/5843701ef3679f50dd2427cf57a80871082eb28c/gistfile1.txt
> gcc cgroupns.c -o cgroupns
> ./cgroupns
> ./cgroupns

Reverted the patch for now. Let's try again later.

Thanks.

--
tejun

2017-04-17 10:42:27

by Zefan Li

[permalink] [raw]
Subject: Re: cgroup: avoid attaching a cgroup root to two different superblocks

On 2017/4/15 7:32, Andrei Vagin wrote:
> On Fri, Apr 14, 2017 at 04:27:37PM -0700, Andrei Vagin wrote:
>> Hello,
>>
>> One of our CRIU tests hangs with this patch.
>>
>> Steps to reproduce:
>> curl -o cgroupns.c https://gist.githubusercontent.com/avagin/f87c8a8bd2a0de9afcc74976327786bc/raw/5843701ef3679f50dd2427cf57a80871082eb28c/gistfile1.txt
>> gcc cgroupns.c -o cgroupns
>> ./cgroupns
>> ./cgroupns
>
> I've found a trivial reproducer:
> mkdir /tmp/xxx
> mount -t cgroup -o none,name=zdtmtst xxx /tmp/xxx
> mkdir /tmp/xxx/xxx
> umount /tmp/xxx
> mount -t cgroup -o none,name=zdtmtst xxx /tmp/xxx
>

Now I remember why it didn't check NULL pointer... Could you try the following fix?
It also reverts my previous patch. I would appreciate if you run the full test suit,
to make sure it won't break anything.

PS: Tejun, I found recently I can no longer receive your emails. Don't know why...

=======

[PATCH] cgruop: avoid attaching a cgroup root to two different superblocks, take 2

Commit bfb0b80db5f9 is broken. Now we try to fix the race by delaying
the initialization of cgroup root refcnt until a superblock has been
allocated.

Cc: [email protected] # 3.16+
Reported-by: Dmitry Vyukov <[email protected]>
Reported-by: Andrei Vagin <[email protected]>
Signed-off-by: Zefan Li <[email protected]>
---
kernel/cgroup/cgroup-internal.h | 2 +-
kernel/cgroup/cgroup-v1.c | 18 ++++++++++++++++--
kernel/cgroup/cgroup.c | 8 ++++----
3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 9203bfb..e470268 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -163,7 +163,7 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,

void cgroup_free_root(struct cgroup_root *root);
void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts);
-int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask);
+int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags);
int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
struct cgroup_root *root, unsigned long magic,
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 12e19f0..6ca9b12 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1072,6 +1072,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
struct cgroup_subsys *ss;
struct dentry *dentry;
int i, ret;
+ bool new_root = false;

cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);

@@ -1146,7 +1147,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
* path is super cold. Let's just sleep a bit and retry.
*/
pinned_sb = kernfs_pin_sb(root->kf_root, NULL);
- if (IS_ERR_OR_NULL(pinned_sb) ||
+ if (IS_ERR(pinned_sb) ||
!percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
mutex_unlock(&cgroup_mutex);
if (!IS_ERR_OR_NULL(pinned_sb))
@@ -1181,10 +1182,11 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
ret = -ENOMEM;
goto out_unlock;
}
+ new_root = true;

init_cgroup_root(root, &opts);

- ret = cgroup_setup_root(root, opts.subsys_mask);
+ ret = cgroup_setup_root(root, opts.subsys_mask, PERCPU_REF_INIT_DEAD);
if (ret)
cgroup_free_root(root);

@@ -1201,6 +1203,18 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
CGROUP_SUPER_MAGIC, ns);

/*
+ * There's a race window after we release cgroup_mutex and before
+ * allocating a superblock. Make sure a concurrent process won't
+ * be able to re-use the root during this window by delaying the
+ * initialization of root refcnt.
+ */
+ if (new_root) {
+ mutex_lock(&cgroup_mutex);
+ percpu_ref_reinit(&root->cgrp.self.refcnt);
+ mutex_unlock(&cgroup_mutex);
+ }
+
+ /*
* If @pinned_sb, we're reusing an existing root and holding an
* extra ref on its sb. Mount is complete. Put the extra ref.
*/
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 4885132..0f98010 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1640,7 +1640,7 @@ void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
}

-int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
+int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
{
LIST_HEAD(tmp_links);
struct cgroup *root_cgrp = &root->cgrp;
@@ -1656,8 +1656,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
root_cgrp->id = ret;
root_cgrp->ancestor_ids[0] = ret;

- ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release, 0,
- GFP_KERNEL);
+ ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release,
+ ref_flags, GFP_KERNEL);
if (ret)
goto out;

@@ -4512,7 +4512,7 @@ int __init cgroup_init(void)
hash_add(css_set_table, &init_css_set.hlist,
css_set_hash(init_css_set.subsys));

- BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0));
+ BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0, 0));

mutex_unlock(&cgroup_mutex);

--
1.8.3.1

2017-04-18 04:09:34

by Andrei Vagin

[permalink] [raw]
Subject: Re: cgroup: avoid attaching a cgroup root to two different superblocks

On Mon, Apr 17, 2017 at 06:41:38PM +0800, Zefan Li wrote:
> On 2017/4/15 7:32, Andrei Vagin wrote:
> > On Fri, Apr 14, 2017 at 04:27:37PM -0700, Andrei Vagin wrote:
> >> Hello,
> >>
> >> One of our CRIU tests hangs with this patch.
> >>
> >> Steps to reproduce:
> >> curl -o cgroupns.c https://gist.githubusercontent.com/avagin/f87c8a8bd2a0de9afcc74976327786bc/raw/5843701ef3679f50dd2427cf57a80871082eb28c/gistfile1.txt
> >> gcc cgroupns.c -o cgroupns
> >> ./cgroupns
> >> ./cgroupns
> >
> > I've found a trivial reproducer:
> > mkdir /tmp/xxx
> > mount -t cgroup -o none,name=zdtmtst xxx /tmp/xxx
> > mkdir /tmp/xxx/xxx
> > umount /tmp/xxx
> > mount -t cgroup -o none,name=zdtmtst xxx /tmp/xxx
> >
>
> Now I remember why it didn't check NULL pointer... Could you try the following fix?
> It also reverts my previous patch. I would appreciate if you run the full test suit,
> to make sure it won't break anything.

It works for me. Thanks.

Tested-by: Andrei Vagin <[email protected]>

>
> PS: Tejun, I found recently I can no longer receive your emails. Don't know why...
>
> =======
>
> [PATCH] cgruop: avoid attaching a cgroup root to two different superblocks, take 2
>
> Commit bfb0b80db5f9 is broken. Now we try to fix the race by delaying
> the initialization of cgroup root refcnt until a superblock has been
> allocated.
>
> Cc: [email protected] # 3.16+
> Reported-by: Dmitry Vyukov <[email protected]>
> Reported-by: Andrei Vagin <[email protected]>
> Signed-off-by: Zefan Li <[email protected]>
> ---
> kernel/cgroup/cgroup-internal.h | 2 +-
> kernel/cgroup/cgroup-v1.c | 18 ++++++++++++++++--
> kernel/cgroup/cgroup.c | 8 ++++----
> 3 files changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index 9203bfb..e470268 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -163,7 +163,7 @@ int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
>
> void cgroup_free_root(struct cgroup_root *root);
> void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts);
> -int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask);
> +int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags);
> int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
> struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
> struct cgroup_root *root, unsigned long magic,
> diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
> index 12e19f0..6ca9b12 100644
> --- a/kernel/cgroup/cgroup-v1.c
> +++ b/kernel/cgroup/cgroup-v1.c
> @@ -1072,6 +1072,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> struct cgroup_subsys *ss;
> struct dentry *dentry;
> int i, ret;
> + bool new_root = false;
>
> cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
>
> @@ -1146,7 +1147,7 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> * path is super cold. Let's just sleep a bit and retry.
> */
> pinned_sb = kernfs_pin_sb(root->kf_root, NULL);
> - if (IS_ERR_OR_NULL(pinned_sb) ||
> + if (IS_ERR(pinned_sb) ||
> !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
> mutex_unlock(&cgroup_mutex);
> if (!IS_ERR_OR_NULL(pinned_sb))
> @@ -1181,10 +1182,11 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> ret = -ENOMEM;
> goto out_unlock;
> }
> + new_root = true;
>
> init_cgroup_root(root, &opts);
>
> - ret = cgroup_setup_root(root, opts.subsys_mask);
> + ret = cgroup_setup_root(root, opts.subsys_mask, PERCPU_REF_INIT_DEAD);
> if (ret)
> cgroup_free_root(root);
>
> @@ -1201,6 +1203,18 @@ struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
> CGROUP_SUPER_MAGIC, ns);
>
> /*
> + * There's a race window after we release cgroup_mutex and before
> + * allocating a superblock. Make sure a concurrent process won't
> + * be able to re-use the root during this window by delaying the
> + * initialization of root refcnt.
> + */
> + if (new_root) {
> + mutex_lock(&cgroup_mutex);
> + percpu_ref_reinit(&root->cgrp.self.refcnt);
> + mutex_unlock(&cgroup_mutex);
> + }
> +
> + /*
> * If @pinned_sb, we're reusing an existing root and holding an
> * extra ref on its sb. Mount is complete. Put the extra ref.
> */
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 4885132..0f98010 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -1640,7 +1640,7 @@ void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts)
> set_bit(CGRP_CPUSET_CLONE_CHILDREN, &root->cgrp.flags);
> }
>
> -int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
> +int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask, int ref_flags)
> {
> LIST_HEAD(tmp_links);
> struct cgroup *root_cgrp = &root->cgrp;
> @@ -1656,8 +1656,8 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
> root_cgrp->id = ret;
> root_cgrp->ancestor_ids[0] = ret;
>
> - ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release, 0,
> - GFP_KERNEL);
> + ret = percpu_ref_init(&root_cgrp->self.refcnt, css_release,
> + ref_flags, GFP_KERNEL);
> if (ret)
> goto out;
>
> @@ -4512,7 +4512,7 @@ int __init cgroup_init(void)
> hash_add(css_set_table, &init_css_set.hlist,
> css_set_hash(init_css_set.subsys));
>
> - BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0));
> + BUG_ON(cgroup_setup_root(&cgrp_dfl_root, 0, 0));
>
> mutex_unlock(&cgroup_mutex);
>
> --
> 1.8.3.1
>

2017-04-18 06:39:34

by Tejun Heo

[permalink] [raw]
Subject: Re: cgroup: avoid attaching a cgroup root to two different superblocks

Hello, Li.

On Mon, Apr 17, 2017 at 06:41:38PM +0800, Zefan Li wrote:
> Now I remember why it didn't check NULL pointer... Could you try the following fix?
> It also reverts my previous patch. I would appreciate if you run the full test suit,
> to make sure it won't break anything.

Can you please refresh the patch on top of cgroup/for-next which
already has the original patch reverted? Let's apply this one for
for-4.12 w/o -stable. The bug isn't too likely to trigger in the wild
and the fix is kinda subtle. If necessary, we can last ask Greg to
pick it up once we know that the patch is safe.

> PS: Tejun, I found recently I can no longer receive your emails. Don't know why...

Hah, no idea. I didn't change anything on my side - all emails are
sent through gmail.com.

Thanks.

--
tejun