# cat test.sh
#! /bin/bash
mount -t cgroup -o cpu xxx /cgroup
umount /cgroup
mount -t cgroup -o cpu,cpuacct xxx /cgroup
umount /cgroup
# ./test.sh
mount: xxx already mounted or /mnt busy
mount: according to mtab, xxx is already mounted on /mnt
It's because the cgroupfs_root of the first mount was under destruction
asynchronously.
Fix this by delaying and then retrying mount for this case.
v2:
- use percpu_ref_tryget_live() rather that introducing
percpu_ref_alive(). (Tejun)
- adjust comment.
Signed-off-by: Li Zefan <[email protected]>
---
kernel/cgroup.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1c65f24..ae2b382 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1648,10 +1648,12 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
void *data)
{
+ struct cgroup_subsys *ss;
struct cgroup_root *root;
struct cgroup_sb_opts opts;
struct dentry *dentry;
int ret;
+ int i, j = -1;
bool new_sb;
/*
@@ -1677,6 +1679,23 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
goto out_unlock;
}
+ /*
+ * Destruction of cgroup root is asynchronous, so we may fail to
+ * mount a cgroupfs if it immediately follows a umount. Let's wait
+ * a little bit and retry.
+ */
+ for_each_subsys(ss, i) {
+ if (!(opts.subsys_mask & (1 << i)))
+ continue;
+ if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
+ mutex_unlock(&cgroup_mutex);
+ msleep(10);
+ ret = restart_syscall
kernfs_pin_sb() tries to get a refcnt of the superblock.
This will be used by cgroupfs.
v2:
- make kernfs_pin_sb() return pointer to the superblock.
- drop kernfs_drop_sb().
Signed-off-by: Li Zefan <[email protected]>
---
fs/kernfs/mount.c | 27 +++++++++++++++++++++++++++
include/linux/kernfs.h | 1 +
2 files changed, 28 insertions(+)
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f25a7c0..616c5c4 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -210,6 +210,33 @@ void kernfs_kill_sb(struct super_block *sb)
kernfs_put(root_kn);
}
+/**
+ * kernfs_pin_sb: try to pin the superblock associated with a kernfs_root
+ * @kernfs_root: the kernfs_root in question
+ * @ns: the namespace tag
+ *
+ * Pin the superblock so the superblock won't be destroyed in subsequent
+ * operations. Return NULL if there's no superblock associated to this
+ * kernfs_root, or -EINVAL if the superblock is being freed.
+ */
+struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns)
+{
+ struct kernfs_super_info *info;
+ struct super_block *sb = NULL;
+
+ mutex_lock(&kernfs_mutex);
+ list_for_each_entry(info, &root->supers, node) {
+ if (info->ns == ns) {
+ sb = info->sb;
+ if (!atomic_inc_not_zero(&info->sb->s_active))
+ sb = ERR_PTR(-EINVAL);
+ break;
+ }
+ }
+ mutex_unlock(&kernfs_mutex);
+ return sb;
+}
+
void __init kernfs_init(void)
{
kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 589318b..9096296 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -287,6 +287,7 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
struct kernfs_root *root, bool *new_sb_created,
const void *ns);
void kernfs_kill_sb(struct super_block *sb);
+struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
void kernfs_init(void);
--
1.8.0.2
We've converted cgroup to kernfs so cgroup won't be intertwined with
vfs objects and locking, but there are dark areas.
Run two instances of this script concurrently:
for ((; ;))
{
mount -t cgroup -o cpuacct xxx /cgroup
umount /cgroup
}
After a while, I saw two mount processes were stuck at retrying, because
they were waiting for a subsystem to become free, but the root associated
with this subsystem never got freed.
This can happen, if thread A is in the process of killing superblock but
hasn't called percpu_ref_kill(), and at this time thread B is mounting
the same cgroup root and finds the root in the root list and performs
percpu_ref_try_get().
To fix this, we try to increase both the refcnt of the superblock and the
percpu refcnt of cgroup root.
v2:
- we should try to get both the superblock refcnt and cgroup_root refcnt,
because cgroup_root may have no superblock assosiated with it.
- adjust/add comments.
Signed-off-by: Li Zefan <[email protected]>
---
kernel/cgroup.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index ae2b382..111b7c3 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1655,6 +1655,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int ret;
int i, j = -1;
bool new_sb;
+ struct super_block *sb = NULL;
/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -1737,14 +1738,18 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
/*
* A root's lifetime is governed by its root cgroup.
- * tryget_live failure indicate that the root is being
- * destroyed. Wait for destruction to complete so that the
- * subsystems are free. We can use wait_queue for the wait
- * but this path is super cold. Let's just sleep for a bit
- * and retry.
+ * pin_sb and tryget_live failure indicate that the root is
+ * being destroyed. Wait for destruction to complete so that
+ * the subsystems are free. We can use wait_queue for the
+ * wait but this path is super cold. Let's just sleep for
+ * a bit and retry.
*/
- if (!percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
+ sb = kernfs_pin_sb(root->kf_root, NULL);
+ if (IS_ERR(sb) ||
+ !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
mutex_unlock(&cgroup_mutex);
+ if (!IS_ERR_OR_NULL(sb))
+ deactivate_super(sb);
msleep(10);
ret = restart_syscall();
goto out_free;
@@ -1796,6 +1801,17 @@ out_free:
dentry = kernfs_mount(fs_type, flags, root->kf_root, &new_sb);
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);
+
+ if (sb) {
+ /*
+ * On success kernfs_mount() returns with sb->s_umount held,
+ * but kernfs_mount() also increases the superblock's refcnt,
+ * so calling deactivate_super() to drop the refcnt we got when
+ * looking up cgroup root won't acquire sb->s_umount again.
+ */
+ WARN_ON(new_sb);
+ deactivate_super(sb);
+ }
return dentry;
}
--
1.8.0.2
Oh sorry the cut&paste was incomplete. Here's the complete one:
================
From: Li Zefan <[email protected]>
Date: Thu, 12 Jun 2014 09:11:00 +0800
Subject: [PATCH v2 1/3] cgroup: fix mount failure in a corner case
# cat test.sh
#! /bin/bash
mount -t cgroup -o cpu xxx /cgroup
umount /cgroup
mount -t cgroup -o cpu,cpuacct xxx /cgroup
umount /cgroup
# ./test.sh
mount: xxx already mounted or /cgroup busy
mount: according to mtab, xxx is already mounted on /cgroup
It's because the cgroupfs_root of the first mount was under destruction
asynchronously.
Fix this by delaying and then retrying mount for this case.
v2:
- use percpu_ref_tryget_live() rather that introducing
percpu_ref_alive(). (Tejun)
- adjust comment.
Signed-off-by: Li Zefan <[email protected]>
---
kernel/cgroup.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1c65f24..6826227 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1648,10 +1648,12 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
void *data)
{
+ struct cgroup_subsys *ss;
struct cgroup_root *root;
struct cgroup_sb_opts opts;
struct dentry *dentry;
int ret;
+ int i, j = -1;
bool new_sb;
/*
@@ -1677,6 +1679,25 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
goto out_unlock;
}
+ /*
+ * Destruction of cgroup root is asynchronous, so we may fail to
+ * mount a cgroupfs if it immediately follows a umount. Let's wait
+ * a little bit and retry.
+ */
+ for_each_subsys(ss, i) {
+ if (!(opts.subsys_mask & (1 << i)) ||
+ ss->root === &cgrp_dfl_root)
+ continue;
+
+ if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
+ mutex_unlock(&cgroup_mutex);
+ msleep(10);
+ ret = restart_syscall();
+ goto out_free;
+ }
+ j = i;
+ }
+
for_each_root(root) {
bool name_match = false;
@@ -1763,6 +1784,14 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);
+ for_each_subsys(ss, i) {
+ if (i > j)
+ break;
+ if (!(opts.subsys_mask & (1 << i)))
+ continue;
+ cgroup_put(&ss->root->cgrp);
+ }
+
if (ret)
return ERR_PTR(ret);
--
1.8.0.2
Made a mistake again.. :(
==============
From: Li Zefan <[email protected]>
Subject: [PATCH 1/3] cgroup: fix mount failure in a corner case
# cat test.sh
#! /bin/bash
mount -t cgroup -o cpu xxx /cgroup
umount /cgroup
mount -t cgroup -o cpu,cpuacct xxx /cgroup
umount /cgroup
# ./test.sh
mount: xxx already mounted or /cgroup busy
mount: according to mtab, xxx is already mounted on /cgroup
It's because the cgroupfs_root of the first mount was under destruction
asynchronously.
Fix this by delaying and then retrying mount for this case.
v2:
- use percpu_ref_tryget_live() rather that introducing
percpu_ref_alive(). (Tejun)
- adjust comment.
Signed-off-by: Li Zefan <[email protected]>
---
kernel/cgroup.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1c65f24..b94449f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1648,10 +1648,12 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
void *data)
{
+ struct cgroup_subsys *ss;
struct cgroup_root *root;
struct cgroup_sb_opts opts;
struct dentry *dentry;
int ret;
+ int i, j = -1;
bool new_sb;
/*
@@ -1677,6 +1679,25 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
goto out_unlock;
}
+ /*
+ * Destruction of cgroup root is asynchronous, so we may fail to
+ * mount a cgroupfs if it immediately follows a umount. Let's wait
+ * a little bit and retry.
+ */
+ for_each_subsys(ss, i) {
+ if (!(opts.subsys_mask & (1 << i)) ||
+ ss->root == &cgrp_dfl_root)
+ continue;
+
+ if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
+ mutex_unlock(&cgroup_mutex);
+ msleep(10);
+ ret = restart_syscall();
+ goto out_free;
+ }
+ j = i;
+ }
+
for_each_root(root) {
bool name_match = false;
@@ -1763,6 +1784,14 @@ out_free:
kfree(opts.release_agent);
kfree(opts.name);
+ for_each_subsys(ss, i) {
+ if (i > j)
+ break;
+ if (!(opts.subsys_mask & (1 << i)))
+ continue;
+ cgroup_put(&ss->root->cgrp);
+ }
+
if (ret)
return ERR_PTR(ret);
--
1.8.0.2
On Fri, Jun 27, 2014 at 03:10:48PM +0800, Li Zefan wrote:
> kernfs_pin_sb() tries to get a refcnt of the superblock.
>
> This will be used by cgroupfs.
Greg, this is pretty much cgroup specific due to the way cgroup
dynamically manages multiple hierarchies. Can I route this through
cgroup/for-3.16-fixes w/ stable cc'd?
Thanks.
--
tejun
On Fri, Jun 27, 2014 at 11:01:41AM -0400, Tejun Heo wrote:
> On Fri, Jun 27, 2014 at 03:10:48PM +0800, Li Zefan wrote:
> > kernfs_pin_sb() tries to get a refcnt of the superblock.
> >
> > This will be used by cgroupfs.
>
> Greg, this is pretty much cgroup specific due to the way cgroup
> dynamically manages multiple hierarchies. Can I route this through
> cgroup/for-3.16-fixes w/ stable cc'd?
Please do:
Acked-by: Greg Kroah-Hartman <[email protected]>
Hello, Li.
On Fri, Jun 27, 2014 at 05:13:12PM +0800, Li Zefan wrote:
> + for_each_subsys(ss, i) {
> + if (!(opts.subsys_mask & (1 << i)) ||
> + ss->root == &cgrp_dfl_root)
> + continue;
> +
> + if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
> + mutex_unlock(&cgroup_mutex);
> + msleep(10);
> + ret = restart_syscall();
> + goto out_free;
> + }
Why not just put it immediately? We know that it's not gonna be
destroyed while holding cgroup_mutex. It may look a bit weird but
this is a pretty special case anyway and deferring put doesn't buy
anything.
Thanks.
--
tejun
On 2014/6/28 19:58, Tejun Heo wrote:
> Hello, Li.
>
> On Fri, Jun 27, 2014 at 05:13:12PM +0800, Li Zefan wrote:
>> + for_each_subsys(ss, i) {
>> + if (!(opts.subsys_mask & (1 << i)) ||
>> + ss->root == &cgrp_dfl_root)
>> + continue;
>> +
>> + if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
>> + mutex_unlock(&cgroup_mutex);
>> + msleep(10);
>> + ret = restart_syscall();
>> + goto out_free;
>> + }
>
> Why not just put it immediately? We know that it's not gonna be
> destroyed while holding cgroup_mutex. It may look a bit weird but
> this is a pretty special case anyway and deferring put doesn't buy
> anything.
>
Yeah, this is better. :)