2021-09-08 12:16:41

by Yi Tao

[permalink] [raw]
Subject: [RFC PATCH 0/2] support cgroup pool in v1

In a scenario where containers are started with high concurrency, in
order to control the use of system resources by the container, it is
necessary to create a corresponding cgroup for each container and
attach the process. The kernel uses the cgroup_mutex global lock to
protect the consistency of the data, which results in a higher
long-tail delay for cgroup-related operations during concurrent startup.
For example, long-tail delay of creating cgroup under each subsystems
is 900ms when starting 400 containers, which becomes bottleneck of
performance. The delay is mainly composed of two parts, namely the
time of the critical section protected by cgroup_mutex and the
scheduling time of sleep. The scheduling time will increase with
the increase of the cpu overhead.

In order to solve this long-tail delay problem, we designed a cgroup
pool. The cgroup pool will create a certain number of cgroups in advance.
When a user creates a cgroup through the mkdir system call, a clean cgroup
can be quickly obtained from the pool. Cgroup pool draws on the idea of
cgroup rename. By creating pool and rename in advance, it reduces the
critical area of cgroup creation, and uses a spinlock different from
cgroup_mutex, which reduces scheduling overhead on the one hand, and eases
competition with attaching processes on the other hand.

The core idea of implementing a cgroup pool is to create a hidden kernfs
tree. Cgroup is implemented based on the kernfs file system. The user
manipulates the cgroup through the kernfs file. Therefore, we can create
a cgroup in advance and place it in a hidden kernfs tree, so that the user
can not operate the cgroup. When the user needs to create one, move the
cgroup to its original location. Because this only needs to remove a node
from one kernfs tree and move it to another tree, it does not affect other
data of the cgroup and related subsystems, so this operation is very
efficient and fast, and there is no need to hold cgroup_mutex. In this
way, we get rid of the limitation of cgroup_mutex and reduce the time
consumption of the critical section, but the kernfs_rwsem is still
protecting the kernfs-related data structure, and the scheduling time
of sleep still exists.

In order to avoid the use of kernfs_rwsem, we introduced a pinned state for
the kernfs node. When the pinned state of this node is true, the lock that
protects the data of this node is changed from kernfs_rwsem to a lock that
can be set. In the scenario of a cgroup pool, the parent cgroup will have a
corresponding spinlock. When the pool is enabled, the kernfs nodes of all
cgroups under the parent cgroup are set to the pinned state. Create,
delete, and move these kernfs nodes are protected by the spinlock of the
parent cgroup, so data consistency will not be a problem.

After opening the pool, the user creates a cgroup will take the fast path
and obtain it from the cgroup pool. Deleting cgroups still take the slow
path. When resources in the pool are insufficient, a delayed task will be
triggered, and the pool will be replenished after a period of time. This
is done to avoid competition with the current creation of cgroups and thus
affect performance. When the resources in the pool are exhausted and not
replenished in time, the creation of a cgroup will take a slow path,
so users need to set an appropriate pool size and supplementary delay time.

What we did in the patches are:
1.add pinned flags for kernfs nodes, so that they can get rid of
kernfs_rwsem and choose to be protected by other locks.
2.add pool_size interface which used to open cgroup pool and
close cgroup pool.
3.add extra kernfs tree which used to hide cgroup in pool.
4.add spinlock to protect kernfs nodes of cgroup in pool


Yi Tao (2):
add pinned flags for kernfs node
support cgroup pool in v1

fs/kernfs/dir.c | 74 ++++++++++++++++-------
include/linux/cgroup-defs.h | 16 +++++
include/linux/cgroup.h | 2 +
include/linux/kernfs.h | 14 +++++
kernel/cgroup/cgroup-v1.c | 139 ++++++++++++++++++++++++++++++++++++++++++++
kernel/cgroup/cgroup.c | 113 ++++++++++++++++++++++++++++++++++-
kernel/sysctl.c | 8 +++
7 files changed, 345 insertions(+), 21 deletions(-)

--
1.8.3.1


2021-09-08 12:55:43

by Yi Tao

[permalink] [raw]
Subject: [RFC PATCH 1/2] add pinned flags for kernfs node

This patch is preparing for the implementation of cgroup pool. If a
kernfs node is set to pinned. the data of this node will no longer be
protected by kernfs internally. When it performs the following actions,
the area protected by kernfs_rwsem will be protected by the specific
spinlock:
1.rename this node
2.remove this node
3.create child node

Suggested-by: Shanpei Chen <[email protected]>
Signed-off-by: Yi Tao <[email protected]>
---
fs/kernfs/dir.c | 74 ++++++++++++++++++++++++++++++++++++--------------
include/linux/kernfs.h | 14 ++++++++++
2 files changed, 68 insertions(+), 20 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index ba581429bf7b..68b05b5bc1a2 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -26,7 +26,6 @@

static bool kernfs_active(struct kernfs_node *kn)
{
- lockdep_assert_held(&kernfs_rwsem);
return atomic_read(&kn->active) >= 0;
}

@@ -461,10 +460,9 @@ static void kernfs_drain(struct kernfs_node *kn)
{
struct kernfs_root *root = kernfs_root(kn);

- lockdep_assert_held_write(&kernfs_rwsem);
WARN_ON_ONCE(kernfs_active(kn));

- up_write(&kernfs_rwsem);
+ kernfs_unlock(kn);

if (kernfs_lockdep(kn)) {
rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -483,7 +481,7 @@ static void kernfs_drain(struct kernfs_node *kn)

kernfs_drain_open_files(kn);

- down_write(&kernfs_rwsem);
+ kernfs_lock(kn);
}

/**
@@ -722,7 +720,7 @@ int kernfs_add_one(struct kernfs_node *kn)
bool has_ns;
int ret;

- down_write(&kernfs_rwsem);
+ kernfs_lock(parent);

ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -753,7 +751,7 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}

- up_write(&kernfs_rwsem);
+ kernfs_unlock(parent);

/*
* Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -767,7 +765,7 @@ int kernfs_add_one(struct kernfs_node *kn)
return 0;

out_unlock:
- up_write(&kernfs_rwsem);
+ kernfs_unlock(parent);
return ret;
}

@@ -788,8 +786,6 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
bool has_ns = kernfs_ns_enabled(parent);
unsigned int hash;

- lockdep_assert_held(&kernfs_rwsem);
-
if (has_ns != (bool)ns) {
WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
has_ns ? "required" : "invalid", parent->name, name);
@@ -1242,8 +1238,6 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
{
struct rb_node *rbn;

- lockdep_assert_held_write(&kernfs_rwsem);
-
/* if first iteration, visit leftmost descendant which may be root */
if (!pos)
return kernfs_leftmost_descendant(root);
@@ -1299,8 +1293,6 @@ static void __kernfs_remove(struct kernfs_node *kn)
{
struct kernfs_node *pos;

- lockdep_assert_held_write(&kernfs_rwsem);
-
/*
* Short-circuit if non-root @kn has already finished removal.
* This is for kernfs_remove_self() which plays with active ref
@@ -1369,9 +1361,9 @@ static void __kernfs_remove(struct kernfs_node *kn)
*/
void kernfs_remove(struct kernfs_node *kn)
{
- down_write(&kernfs_rwsem);
+ kernfs_lock(kn);
__kernfs_remove(kn);
- up_write(&kernfs_rwsem);
+ kernfs_unlock(kn);
}

/**
@@ -1525,13 +1517,13 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
return -ENOENT;
}

- down_write(&kernfs_rwsem);
+ kernfs_lock(parent);

kn = kernfs_find_ns(parent, name, ns);
if (kn)
__kernfs_remove(kn);

- up_write(&kernfs_rwsem);
+ kernfs_unlock(parent);

if (kn)
return 0;
@@ -1557,7 +1549,9 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
if (!kn->parent)
return -EINVAL;

- down_write(&kernfs_rwsem);
+ /* if parent is pinned, parent->lock protects rename */
+ if (!kn->parent->pinned)
+ down_write(&kernfs_rwsem);

error = -ENOENT;
if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1576,7 +1570,8 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
/* rename kernfs_node */
if (strcmp(kn->name, new_name) != 0) {
error = -ENOMEM;
- new_name = kstrdup_const(new_name, GFP_KERNEL);
+ /* use GFP_ATOMIC to avoid sleep */
+ new_name = kstrdup_const(new_name, GFP_ATOMIC);
if (!new_name)
goto out;
} else {
@@ -1611,10 +1606,49 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,

error = 0;
out:
- up_write(&kernfs_rwsem);
+ if (!kn->parent->pinned)
+ up_write(&kernfs_rwsem);
return error;
}

+/* Traverse all descendants and set pinned */
+void kernfs_set_pinned(struct kernfs_node *kn, spinlock_t *lock)
+{
+ struct kernfs_node *pos = NULL;
+
+ while ((pos = kernfs_next_descendant_post(pos, kn))) {
+ pos->pinned = true;
+ pos->lock = lock;
+ }
+}
+
+/* Traverse all descendants and clear pinned */
+void kernfs_clear_pinned(struct kernfs_node *kn)
+{
+ struct kernfs_node *pos = NULL;
+
+ while ((pos = kernfs_next_descendant_post(pos, kn))) {
+ pos->pinned = false;
+ pos->lock = NULL;
+ }
+}
+
+void kernfs_lock(struct kernfs_node *kn)
+{
+ if (!kn->pinned)
+ down_write(&kernfs_rwsem);
+ else
+ spin_lock(kn->lock);
+}
+
+void kernfs_unlock(struct kernfs_node *kn)
+{
+ if (!kn->pinned)
+ up_write(&kernfs_rwsem);
+ else
+ spin_unlock(kn->lock);
+}
+
/* Relationship between mode and the DT_xxx types */
static inline unsigned char dt_type(struct kernfs_node *kn)
{
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 1093abf7c28c..a70d96308c51 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -161,6 +161,13 @@ struct kernfs_node {
unsigned short flags;
umode_t mode;
struct kernfs_iattrs *iattr;
+
+ /*
+ * If pinned is true, use lock to protect remove, rename this kernfs
+ * node or create child kernfs node.
+ */
+ bool pinned;
+ spinlock_t *lock;
};

/*
@@ -415,6 +422,11 @@ int kernfs_xattr_set(struct kernfs_node *kn, const char *name,

struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
u64 id);
+
+void kernfs_set_pinned(struct kernfs_node *kn, spinlock_t *lock);
+void kernfs_clear_pinned(struct kernfs_node *kn);
+void kernfs_lock(struct kernfs_node *kn);
+void kernfs_unlock(struct kernfs_node *kn);
#else /* CONFIG_KERNFS */

static inline enum kernfs_node_type kernfs_type(struct kernfs_node *kn)
@@ -528,6 +540,8 @@ static inline void kernfs_kill_sb(struct super_block *sb) { }

static inline void kernfs_init(void) { }

+inline void kernfs_set_pinned(struct kernfs_node *kn, spinlock_t *lock) {}
+inline void kernfs_clear_pinned(struct kernfs_node *kn) {}
#endif /* CONFIG_KERNFS */

/**
--
1.8.3.1

2021-09-08 13:05:09

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] add pinned flags for kernfs node

On Wed, Sep 08, 2021 at 08:15:12PM +0800, Yi Tao wrote:
> This patch is preparing for the implementation of cgroup pool. If a
> kernfs node is set to pinned. the data of this node will no longer be
> protected by kernfs internally. When it performs the following actions,
> the area protected by kernfs_rwsem will be protected by the specific
> spinlock:
> 1.rename this node
> 2.remove this node
> 3.create child node
>
> Suggested-by: Shanpei Chen <[email protected]>
> Signed-off-by: Yi Tao <[email protected]>
> ---
> fs/kernfs/dir.c | 74 ++++++++++++++++++++++++++++++++++++--------------
> include/linux/kernfs.h | 14 ++++++++++
> 2 files changed, 68 insertions(+), 20 deletions(-)

Ugh, this is going to get messy fast.

Why are kernfs changes needed for this? kernfs creation is not
necessarily supposed to be "fast", what benchmark needs this type of
change to require the addition of this complexity?

And how are we going to audit things to ensure the "pinning" is working
properly?

thanks,

greg k-h

2021-09-08 13:05:23

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] support cgroup pool in v1

On Wed, Sep 08, 2021 at 08:15:11PM +0800, Yi Tao wrote:
> In a scenario where containers are started with high concurrency, in
> order to control the use of system resources by the container, it is
> necessary to create a corresponding cgroup for each container and
> attach the process. The kernel uses the cgroup_mutex global lock to
> protect the consistency of the data, which results in a higher
> long-tail delay for cgroup-related operations during concurrent startup.
> For example, long-tail delay of creating cgroup under each subsystems
> is 900ms when starting 400 containers, which becomes bottleneck of
> performance. The delay is mainly composed of two parts, namely the
> time of the critical section protected by cgroup_mutex and the
> scheduling time of sleep. The scheduling time will increase with
> the increase of the cpu overhead.

Perhaps you shouldn't be creating that many containers all at once?
What normal workload requires this?

thanks,

greg k-h

2021-09-08 16:38:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] support cgroup pool in v1

Hello,

On Wed, Sep 08, 2021 at 08:15:11PM +0800, Yi Tao wrote:
> In order to solve this long-tail delay problem, we designed a cgroup
> pool. The cgroup pool will create a certain number of cgroups in advance.
> When a user creates a cgroup through the mkdir system call, a clean cgroup
> can be quickly obtained from the pool. Cgroup pool draws on the idea of
> cgroup rename. By creating pool and rename in advance, it reduces the
> critical area of cgroup creation, and uses a spinlock different from
> cgroup_mutex, which reduces scheduling overhead on the one hand, and eases
> competition with attaching processes on the other hand.

I'm not sure this is the right way to go about it. There are more
conventional ways to improve scalability - making locking more granular and
hunting down specific operations which take long time. I don't think cgroup
management operations need the level of scalability which requires front
caching.

Thanks.

--
tejun

2021-09-10 02:14:32

by Yi Tao

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] support cgroup pool in v1

On 2021/9/8 下午8:37, Greg KH wrote:

> Perhaps you shouldn't be creating that many containers all at once?
> What normal workload requires this?

Thank you for your reply.


The scenario is the function computing of the public

cloud. Each instance of function computing will be

allocated about 0.1 core cpu and 100M memory. On

a high-end server, for example, 104 cores and 384G,

it is normal to create hundreds of containers at the

same time if burst of requests comes.

thanks,

Yi Tao

2021-09-10 02:15:23

by Yi Tao

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] support cgroup pool in v1

I am glad to receive your reply.

cgroup pool is a relatively simple solution that I think can

solve the problem.

I have tried making locking more granular, but in the end found

it too diffcult. cgroup_mutex protects almost all operation related

to cgroup. If not use cgroup_mutex, I have no idea how to design

lock mechanism to take both concurrent performance and

existing interfaces into account. Do you have any good advice?


thanks,


Yi Tao


On 2021/9/9 上午12:35, Tejun Heo wrote:
> Hello,
>
> On Wed, Sep 08, 2021 at 08:15:11PM +0800, Yi Tao wrote:
>> In order to solve this long-tail delay problem, we designed a cgroup
>> pool. The cgroup pool will create a certain number of cgroups in advance.
>> When a user creates a cgroup through the mkdir system call, a clean cgroup
>> can be quickly obtained from the pool. Cgroup pool draws on the idea of
>> cgroup rename. By creating pool and rename in advance, it reduces the
>> critical area of cgroup creation, and uses a spinlock different from
>> cgroup_mutex, which reduces scheduling overhead on the one hand, and eases
>> competition with attaching processes on the other hand.
> I'm not sure this is the right way to go about it. There are more
> conventional ways to improve scalability - making locking more granular and
> hunting down specific operations which take long time. I don't think cgroup
> management operations need the level of scalability which requires front
> caching.
>
> Thanks.
>

2021-09-10 02:17:35

by Yi Tao

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] add pinned flags for kernfs node


On 2021/9/8 下午8:35, Greg KH wrote:
> Why are kernfs changes needed for this? kernfs creation is not
> necessarily supposed to be "fast", what benchmark needs this type of
> change to require the addition of this complexity?

The implementation of the cgroup pool should have nothing

to do with kernfs, but during the development process,

I found that when there is a background cpu load, it takes

a very significant time for a process to get the mutex from

being awakened to starting execution.

To create 400 cgroups concurrently, if there is no background

cpu load, it takes about 80ms, but if the cpu usage rate is

40%, it takes about 700ms. If you reduce

sched_wakeup_granularity_ns, the time consumption will also

be reduced. If you change mutex to spinlock, the situation

will be very much improved.

So to solve this problem, mutex should not be used. The

cgroup pool relies on kernfs_rename which uses

kernfs_mutex, so I need to bypass kernfs_mutex and

add a pinned flag for this.

Because the lock mechanism of kernfs_rename has been

changed, in order to maintain data consistency, the creation

and deletion of kernfs have also been changed accordingly

I admit that this is really not a very elegant design, but I don’t

know how to make it better, so I throw out the problem and

try to seek help from the community.


thanks,


Yi Tao

2021-09-10 06:03:50

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH 1/2] add pinned flags for kernfs node

On Fri, Sep 10, 2021 at 10:14:28AM +0800, taoyi.ty wrote:
>
> On 2021/9/8 下午8:35, Greg KH wrote:
> > Why are kernfs changes needed for this? kernfs creation is not
> > necessarily supposed to be "fast", what benchmark needs this type of
> > change to require the addition of this complexity?
>
> The implementation of the cgroup pool should have nothing
>
> to do with kernfs, but during the development process,
>
> I found that when there is a background cpu load, it takes
>
> a very significant time for a process to get the mutex from
>
> being awakened to starting execution.
>
> To create 400 cgroups concurrently, if there is no background
>
> cpu load, it takes about 80ms, but if the cpu usage rate is
>
> 40%, it takes about 700ms. If you reduce
>
> sched_wakeup_granularity_ns, the time consumption will also
>
> be reduced. If you change mutex to spinlock, the situation
>
> will be very much improved.
>
> So to solve this problem, mutex should not be used. The
>
> cgroup pool relies on kernfs_rename which uses
>
> kernfs_mutex, so I need to bypass kernfs_mutex and
>
> add a pinned flag for this.
>
> Because the lock mechanism of kernfs_rename has been
>
> changed, in order to maintain data consistency, the creation
>
> and deletion of kernfs have also been changed accordingly
>
> I admit that this is really not a very elegant design, but I don’t
>
> know how to make it better, so I throw out the problem and
>
> try to seek help from the community.

Look at the changes to kernfs for 5.15-rc1 where a lot of the lock
contention was removed based on benchmarks where kernfs (through sysfs)
was accessed by lots of processes all at once.

That should help a bit in your case, but remember, the creation of
kernfs files is not the "normal" case, so it is not optimized at all.
We have optimized the access case, which is by far the most common.

good luck!

greg k-h

2021-09-10 06:05:22

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] support cgroup pool in v1

On Fri, Sep 10, 2021 at 10:11:53AM +0800, taoyi.ty wrote:
> On 2021/9/8 下午8:37, Greg KH wrote:
>
> > Perhaps you shouldn't be creating that many containers all at once?
> > What normal workload requires this?
>
> Thank you for your reply.
>
>
> The scenario is the function computing of the public
>
> cloud. Each instance of function computing will be
>
> allocated about 0.1 core cpu and 100M memory. On
>
> a high-end server, for example, 104 cores and 384G,
>
> it is normal to create hundreds of containers at the
>
> same time if burst of requests comes.

So it is a resource management issue on your side, right? Perhaps
stagger the creation of new containers to allow the overall creation
time to be less?

thanks,

greg k-h

2021-09-10 16:50:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] support cgroup pool in v1

Hello,

On Fri, Sep 10, 2021 at 10:11:53AM +0800, taoyi.ty wrote:
> The scenario is the function computing of the public
> cloud. Each instance of function computing will be
> allocated about 0.1 core cpu and 100M memory. On
> a high-end server, for example, 104 cores and 384G,
> it is normal to create hundreds of containers at the
> same time if burst of requests comes.

This type of use case isn't something cgroup is good at, at least not
currently. The problem is that trying to scale management operations like
creating and destroying cgroups has implications on how each controller is
implemented - we want the hot paths which get used while cgroups are running
actively to be as efficient and scalable as possible even if that requires a
lot of extra preparation and lazy cleanup operations. We don't really want
to push for cgroup creation / destruction efficiency at the cost of hot path
overhead.

This has implications for use cases like you describe. Even if the kernel
pre-prepare cgroups to low latency for cgroup creation, it means that the
system would be doing a *lot* of managerial extra work creating and
destroying cgroups constantly for not much actual work.

Usually, the right solution for this sort of situations is pooling cgroups
from the userspace which usually has a lot better insight into which cgroups
can be recycled and can also adjust the cgroup hierarchy to better fit the
use case (e.g. some rapid-cycling cgroups can benefit from higher-level
resource configurations).

So, it'd be great to make the managerial operations more efficient from
cgroup core side but there are inherent architectural reasons why
rapid-cycling use cases aren't and won't be prioritized.

Thanks.

--
tejun

2021-09-13 14:57:16

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] support cgroup pool in v1

On Fri, Sep 10, 2021 at 06:49:27AM -1000, Tejun Heo wrote:
> Hello,
>
> On Fri, Sep 10, 2021 at 10:11:53AM +0800, taoyi.ty wrote:
> > The scenario is the function computing of the public
> > cloud. Each instance of function computing will be
> > allocated about 0.1 core cpu and 100M memory. On
> > a high-end server, for example, 104 cores and 384G,
> > it is normal to create hundreds of containers at the
> > same time if burst of requests comes.
>
> This type of use case isn't something cgroup is good at, at least not
> currently. The problem is that trying to scale management operations like
> creating and destroying cgroups has implications on how each controller is
> implemented - we want the hot paths which get used while cgroups are running
> actively to be as efficient and scalable as possible even if that requires a
> lot of extra preparation and lazy cleanup operations. We don't really want
> to push for cgroup creation / destruction efficiency at the cost of hot path
> overhead.
>
> This has implications for use cases like you describe. Even if the kernel
> pre-prepare cgroups to low latency for cgroup creation, it means that the
> system would be doing a *lot* of managerial extra work creating and
> destroying cgroups constantly for not much actual work.
>
> Usually, the right solution for this sort of situations is pooling cgroups
> from the userspace which usually has a lot better insight into which cgroups
> can be recycled and can also adjust the cgroup hierarchy to better fit the
> use case (e.g. some rapid-cycling cgroups can benefit from higher-level
> resource configurations).

I had the same reaction and I wanted to do something like this before,
i.e. maintain a pool of pre-allocated cgroups in userspace. But there
were some problems.

Afaict, there is currently now way to prevent the deletion of empty
cgroups, especially newly created ones. So for example, if I have a
cgroup manager that prunes the cgroup tree whenever they detect empty
cgroups they can delete cgroups that were pre-allocated. This is
something we have run into before.

A related problem is a crashed or killed container manager
(segfault, sigkill, etc.). It might not have had the chance to cleanup
cgroups it allocated for the container. If the container manager is
restarted it can't reuse the existing cgroup it found because it has no
way of guaranteeing whether in between the time it crashed and got
restarted another program has just created a cgroup with the same name.
We usually solve this by just creating another cgroup with an index
appended until we we find an unallocated one setting an arbitrary cut
off point until we require manual intervention by the user (e.g. 1000).

Right now iirc, one can rmdir() an empty cgroup while someone still
holds a file descriptor open for it. This can lead to situation where a
cgroup got created but before moving into the cgroup (via clone3() or
write()) someone else has deleted it. What would already be helpful is
if one had a way to prevent the deletion of cgroups when someone still
has an open reference to it. This would allow a pool of cgroups to be
created that can't simply be deleted.

Christian

2021-09-13 16:28:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 0/2] support cgroup pool in v1

Hello,

On Mon, Sep 13, 2021 at 04:20:59PM +0200, Christian Brauner wrote:
> Afaict, there is currently now way to prevent the deletion of empty
> cgroups, especially newly created ones. So for example, if I have a
> cgroup manager that prunes the cgroup tree whenever they detect empty
> cgroups they can delete cgroups that were pre-allocated. This is
> something we have run into before.

systemd doesn't mess with cgroups behind a delegation point.

> A related problem is a crashed or killed container manager
> (segfault, sigkill, etc.). It might not have had the chance to cleanup
> cgroups it allocated for the container. If the container manager is
> restarted it can't reuse the existing cgroup it found because it has no
> way of guaranteeing whether in between the time it crashed and got
> restarted another program has just created a cgroup with the same name.
> We usually solve this by just creating another cgroup with an index
> appended until we we find an unallocated one setting an arbitrary cut
> off point until we require manual intervention by the user (e.g. 1000).
>
> Right now iirc, one can rmdir() an empty cgroup while someone still
> holds a file descriptor open for it. This can lead to situation where a
> cgroup got created but before moving into the cgroup (via clone3() or
> write()) someone else has deleted it. What would already be helpful is
> if one had a way to prevent the deletion of cgroups when someone still
> has an open reference to it. This would allow a pool of cgroups to be
> created that can't simply be deleted.

The above are problems common for any entity managing cgroup hierarchy.
Beyond the permission and delegation based access control, cgroup doesn't
have a mechanism to grant exclusive managerial operations to a specific
application. It's the userspace's responsibility to coordinate these
operations like in most other kernel interfaces.

Thanks.

--
tejun