# cat test.sh
#! /bin/bash
mount -t cgroup -o cpu xxx /cgroup
umount /cgroup
mount -t cgroup -o cpu,cpuacct xxx /cgroup
umount /cgroup
# ./test.sh
mount: xxx already mounted or /cgroup busy
mount: according to mtab, xxx is already mounted on /cgroup
It's because the cgroupfs_root of the first mount was under destruction
asynchronously.
Fix this by delaying and then retrying mount for this case.
v3:
- put the refcnt immediately after getting it. (Tejun)
v2:
- use percpu_ref_tryget_live() rather that introducing
percpu_ref_alive(). (Tejun)
- adjust comment.
Cc: <[email protected]> # 3.15
Signed-off-by: Li Zefan <[email protected]>
---
kernel/cgroup.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1c65f24..d3662ac 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1648,10 +1648,12 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
void *data)
{
+ struct cgroup_subsys *ss;
struct cgroup_root *root;
struct cgroup_sb_opts opts;
struct dentry *dentry;
int ret;
+ int i;
bool new_sb;
/*
@@ -1677,6 +1679,25 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
goto out_unlock;
}
+ /*
+ * Destruction of cgroup root is asynchronous, so we may fail to
+ * mount a cgroupfs if it immediately follows a umount. Let's wait
+ * a little bit and retry.
+ */
+ for_each_subsys(ss, i) {
+ if (!(opts.subsys_mask & (1 << i)) ||
+ ss->root == &cgrp_dfl_root)
+ continue;
+
+ if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
+ mutex_unlock(&cgroup_mutex);
+ msleep(10);
+ ret = restart_syscall();
+ goto out_free;
+ }
+ cgroup_put(&ss->root->cgrp);
+ }
+
for_each_root(root) {
bool name_match = false;
--
1.8.0.2
kernfs_pin_sb() tries to get a refcnt of the superblock.
This will be used by cgroupfs.
v2:
- make kernfs_pin_sb() return the superblock.
- drop kernfs_drop_sb().
[ This is a prerequisite for a bugfix. ]
Cc: <[email protected]> # 3.15
Acked-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Li Zefan <[email protected]>
---
fs/kernfs/mount.c | 27 +++++++++++++++++++++++++++
include/linux/kernfs.h | 1 +
2 files changed, 28 insertions(+)
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f25a7c0..616c5c4 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -210,6 +210,33 @@ void kernfs_kill_sb(struct super_block *sb)
kernfs_put(root_kn);
}
+/**
+ * kernfs_pin_sb: try to pin the superblock associated with a kernfs_root
+ * @kernfs_root: the kernfs_root in question
+ * @ns: the namespace tag
+ *
+ * Pin the superblock so the superblock won't be destroyed in subsequent
+ * operations. Return NULL if there's no superblock associated to this
+ * kernfs_root, or -EINVAL if the superblock is being freed.
+ */
+struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns)
+{
+ struct kernfs_super_info *info;
+ struct super_block *sb = NULL;
+
+ mutex_lock(&kernfs_mutex);
+ list_for_each_entry(info, &root->supers, node) {
+ if (info->ns == ns) {
+ sb = info->sb;
+ if (!atomic_inc_not_zero(&info->sb->s_active))
+ sb = ERR_PTR(-EINVAL);
+ break;
+ }
+ }
+ mutex_unlock(&kernfs_mutex);
+ return sb;
+}
+
void __init kernfs_init(void)
{
kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 589318b..9096296 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -287,6 +287,7 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
struct kernfs_root *root, bool *new_sb_created,
const void *ns);
void kernfs_kill_sb(struct super_block *sb);
+struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
void kernfs_init(void);
--
1.8.0.2
We've converted cgroup to kernfs so cgroup won't be intertwined with
vfs objects and locking, but there are dark areas.
Run two instances of this script concurrently:
for ((; ;))
{
mount -t cgroup -o cpuacct xxx /cgroup
umount /cgroup
}
After a while, I saw two mount processes were stuck at retrying, because
they were waiting for a subsystem to become free, but the root associated
with this subsystem never got freed.
This can happen, if thread A is in the process of killing superblock but
hasn't called percpu_ref_kill(), and at this time thread B is mounting
the same cgroup root and finds the root in the root list and performs
percpu_ref_try_get().
To fix this, we try to increase both the refcnt of the superblock and the
percpu refcnt of cgroup root.
v2:
- we should try to get both the superblock refcnt and cgroup_root refcnt,
because cgroup_root may have no superblock assosiated with it.
- adjust/add comments.
Cc: <[email protected]> # 3.15
Signed-off-by: Li Zefan <[email protected]>
---
kernel/cgroup.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d3662ac..11e40cf 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1655,6 +1655,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int ret;
int i;
bool new_sb;
+ struct super_block *sb = NULL;
/*
* The first time anyone tries to mount a cgroup, enable the list
@@ -1739,14 +1740,18 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
/*
* A root's lifetime is governed by its root cgroup.
- * tryget_live failure indicate that the root is being
- * destroyed. Wait for destruction to complete so that the
- * subsystems are free. We can use wait_queue for the wait
- * but this path is super cold. Let's just sleep for a bit
- * and retry.
+ * pin_sb and tryget_live failure indicate that the root is
+ * being destroyed. Wait for destruction to complete so that
+ * the subsystems are free. We can use wait_queue for the
+ * wait but this path is super cold. Let's just sleep for
+ * a bit and retry.
*/
- if (!percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
+ sb = kernfs_pin_sb(root->kf_root, NULL);
+ if (IS_ERR(sb) ||
+ !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
mutex_unlock(&cgroup_mutex);
+ if (!IS_ERR_OR_NULL(sb))
+ deactivate_super(sb);
msleep(10);
ret = restart_syscall();
goto out_free;
@@ -1790,6 +1795,17 @@ out_free:
dentry = kernfs_mount(fs_type, flags, root->kf_root, &new_sb);
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);
+
+ if (sb) {
+ /*
+ * On success kernfs_mount() returns with sb->s_umount held,
+ * but kernfs_mount() also increases the superblock's refcnt,
+ * so calling deactivate_super() to drop the refcnt we got when
+ * looking up cgroup root won't acquire sb->s_umount again.
+ */
+ WARN_ON(new_sb);
+ deactivate_super(sb);
+ }
return dentry;
}
--
1.8.0.2
Applied with minor updates.
Thanks.
------- 8< -------
>From 970317aa48c6ef66cd023c039c2650c897bad927 Mon Sep 17 00:00:00 2001
From: Li Zefan <[email protected]>
Date: Mon, 30 Jun 2014 11:49:58 +0800
Subject: [PATCH 1/3] cgroup: fix mount failure in a corner case
# cat test.sh
#! /bin/bash
mount -t cgroup -o cpu xxx /cgroup
umount /cgroup
mount -t cgroup -o cpu,cpuacct xxx /cgroup
umount /cgroup
# ./test.sh
mount: xxx already mounted or /cgroup busy
mount: according to mtab, xxx is already mounted on /cgroup
It's because the cgroupfs_root of the first mount was under destruction
asynchronously.
Fix this by delaying and then retrying mount for this case.
v3:
- put the refcnt immediately after getting it. (Tejun)
v2:
- use percpu_ref_tryget_live() rather that introducing
percpu_ref_alive(). (Tejun)
- adjust comment.
tj: Updated the comment a bit.
Cc: <[email protected]> # 3.15
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index d9a8be9..6406866 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1648,10 +1648,12 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
void *data)
{
+ struct cgroup_subsys *ss;
struct cgroup_root *root;
struct cgroup_sb_opts opts;
struct dentry *dentry;
int ret;
+ int i;
bool new_sb;
/*
@@ -1677,6 +1679,27 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
goto out_unlock;
}
+ /*
+ * Destruction of cgroup root is asynchronous, so subsystems may
+ * still be dying after the previous unmount. Let's drain the
+ * dying subsystems. We just need to ensure that the ones
+ * unmounted previously finish dying and don't care about new ones
+ * starting. Testing ref liveliness is good enough.
+ */
+ for_each_subsys(ss, i) {
+ if (!(opts.subsys_mask & (1 << i)) ||
+ ss->root == &cgrp_dfl_root)
+ continue;
+
+ if (!percpu_ref_tryget_live(&ss->root->cgrp.self.refcnt)) {
+ mutex_unlock(&cgroup_mutex);
+ msleep(10);
+ ret = restart_syscall();
+ goto out_free;
+ }
+ cgroup_put(&ss->root->cgrp);
+ }
+
for_each_root(root) {
bool name_match = false;
--
1.9.3
Applied to cgroup/for-3.16-fixes with minor updates. Thanks.
------ 8< ------
>From 4e26445faad366d67d7723622bf6a60a6f0f5993 Mon Sep 17 00:00:00 2001
From: Li Zefan <[email protected]>
Date: Mon, 30 Jun 2014 11:50:28 +0800
Subject: [PATCH 2/3] kernfs: introduce kernfs_pin_sb()
kernfs_pin_sb() tries to get a refcnt of the superblock.
This will be used by cgroupfs.
v2:
- make kernfs_pin_sb() return the superblock.
- drop kernfs_drop_sb().
tj: Updated the comment a bit.
[ This is a prerequisite for a bugfix. ]
Cc: <[email protected]> # 3.15
Acked-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
fs/kernfs/mount.c | 30 ++++++++++++++++++++++++++++++
include/linux/kernfs.h | 1 +
2 files changed, 31 insertions(+)
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d171b98..f973ae9 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -211,6 +211,36 @@ void kernfs_kill_sb(struct super_block *sb)
kernfs_put(root_kn);
}
+/**
+ * kernfs_pin_sb: try to pin the superblock associated with a kernfs_root
+ * @kernfs_root: the kernfs_root in question
+ * @ns: the namespace tag
+ *
+ * Pin the superblock so the superblock won't be destroyed in subsequent
+ * operations. This can be used to block ->kill_sb() which may be useful
+ * for kernfs users which dynamically manage superblocks.
+ *
+ * Returns NULL if there's no superblock associated to this kernfs_root, or
+ * -EINVAL if the superblock is being freed.
+ */
+struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns)
+{
+ struct kernfs_super_info *info;
+ struct super_block *sb = NULL;
+
+ mutex_lock(&kernfs_mutex);
+ list_for_each_entry(info, &root->supers, node) {
+ if (info->ns == ns) {
+ sb = info->sb;
+ if (!atomic_inc_not_zero(&info->sb->s_active))
+ sb = ERR_PTR(-EINVAL);
+ break;
+ }
+ }
+ mutex_unlock(&kernfs_mutex);
+ return sb;
+}
+
void __init kernfs_init(void)
{
kernfs_node_cache = kmem_cache_create("kernfs_node_cache",
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 17aa1cc..20f4935 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -304,6 +304,7 @@ struct dentry *kernfs_mount_ns(struct file_system_type *fs_type, int flags,
struct kernfs_root *root, unsigned long magic,
bool *new_sb_created, const void *ns);
void kernfs_kill_sb(struct super_block *sb);
+struct super_block *kernfs_pin_sb(struct kernfs_root *root, const void *ns);
void kernfs_init(void);
--
1.9.3
Applied to cgroup/for-3.16-fixes with minor updates. Thanks.
------ 8< ------
>From 3a32bd72d77058d768dbb38183ad517f720dd1bc Mon Sep 17 00:00:00 2001
From: Li Zefan <[email protected]>
Date: Mon, 30 Jun 2014 11:50:59 +0800
Subject: [PATCH 3/3] cgroup: fix a race between cgroup_mount() and
cgroup_kill_sb()
We've converted cgroup to kernfs so cgroup won't be intertwined with
vfs objects and locking, but there are dark areas.
Run two instances of this script concurrently:
for ((; ;))
{
mount -t cgroup -o cpuacct xxx /cgroup
umount /cgroup
}
After a while, I saw two mount processes were stuck at retrying, because
they were waiting for a subsystem to become free, but the root associated
with this subsystem never got freed.
This can happen, if thread A is in the process of killing superblock but
hasn't called percpu_ref_kill(), and at this time thread B is mounting
the same cgroup root and finds the root in the root list and performs
percpu_ref_try_get().
To fix this, we try to increase both the refcnt of the superblock and the
percpu refcnt of cgroup root.
v2:
- we should try to get both the superblock refcnt and cgroup_root refcnt,
because cgroup_root may have no superblock assosiated with it.
- adjust/add comments.
tj: Updated comments. Renamed @sb to @pinned_sb.
Cc: <[email protected]> # 3.15
Signed-off-by: Li Zefan <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup.c | 33 ++++++++++++++++++++++++++-------
1 file changed, 26 insertions(+), 7 deletions(-)
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 6406866..70776ae 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1648,6 +1648,7 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
int flags, const char *unused_dev_name,
void *data)
{
+ struct super_block *pinned_sb = NULL;
struct cgroup_subsys *ss;
struct cgroup_root *root;
struct cgroup_sb_opts opts;
@@ -1740,15 +1741,23 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
}
/*
- * A root's lifetime is governed by its root cgroup.
- * tryget_live failure indicate that the root is being
- * destroyed. Wait for destruction to complete so that the
- * subsystems are free. We can use wait_queue for the wait
- * but this path is super cold. Let's just sleep for a bit
- * and retry.
+ * We want to reuse @root whose lifetime is governed by its
+ * ->cgrp. Let's check whether @root is alive and keep it
+ * that way. As cgroup_kill_sb() can happen anytime, we
+ * want to block it by pinning the sb so that @root doesn't
+ * get killed before mount is complete.
+ *
+ * With the sb pinned, tryget_live can reliably indicate
+ * whether @root can be reused. If it's being killed,
+ * drain it. We can use wait_queue for the wait but this
+ * path is super cold. Let's just sleep a bit and retry.
*/
- if (!percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
+ pinned_sb = kernfs_pin_sb(root->kf_root, NULL);
+ if (IS_ERR(pinned_sb) ||
+ !percpu_ref_tryget_live(&root->cgrp.self.refcnt)) {
mutex_unlock(&cgroup_mutex);
+ if (!IS_ERR_OR_NULL(pinned_sb))
+ deactivate_super(pinned_sb);
msleep(10);
ret = restart_syscall();
goto out_free;
@@ -1793,6 +1802,16 @@ out_free:
CGROUP_SUPER_MAGIC, &new_sb);
if (IS_ERR(dentry) || !new_sb)
cgroup_put(&root->cgrp);
+
+ /*
+ * If @pinned_sb, we're reusing an existing root and holding an
+ * extra ref on its sb. Mount is complete. Put the extra ref.
+ */
+ if (pinned_sb) {
+ WARN_ON(new_sb);
+ deactivate_super(pinned_sb);
+ }
+
return dentry;
}
--
1.9.3
Hello, Li.
I applied the three patches with some updates (mostly comments).
Dynamic root management is turning out to be pretty ugly but I think
we can clean it up a bit.
On Mon, Jun 30, 2014 at 11:50:59AM +0800, Li Zefan wrote:
> @@ -1790,6 +1795,17 @@ out_free:
> dentry = kernfs_mount(fs_type, flags, root->kf_root, &new_sb);
> if (IS_ERR(dentry) || !new_sb)
> cgroup_put(&root->cgrp);
> +
> + if (sb) {
> + /*
> + * On success kernfs_mount() returns with sb->s_umount held,
> + * but kernfs_mount() also increases the superblock's refcnt,
> + * so calling deactivate_super() to drop the refcnt we got when
> + * looking up cgroup root won't acquire sb->s_umount again.
> + */
> + WARN_ON(new_sb);
> + deactivate_super(sb);
> + }
So, @new_sb and @sb indicate the same conditions now. When we reuse
an existing root, we always reuse its sb too which means that we at
least no longer need the @new_sb ugliness. I actually kinda prefer it
as this means that the ugliness is confined to one ugly function -
kernfs_pin_sb() - instead of the generic kernfs_mount(). Can you
please prepare patches to clean these up? I'll pull in for-3.16-fixes
into for-3.17 and apply the cleanup on top.
Thanks!
--
tejun