when remove a cgroup dir, make sure all the csses associated which
the cgroup are all offlined,so that we will be sure that the resources
allocated by the csses are all freed when rmdir exit successfully.
Signed-off-by: Hongchen Zhang <[email protected]>
---
kernel/cgroup/cgroup.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e..12d3a14 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3020,6 +3020,31 @@ void cgroup_lock_and_drain_offline(struct cgroup *cgrp)
}
}
+/* wait all cgrp's csses become offlined */
+void cgroup_wait_css_offline(struct cgroup *cgrp)
+{
+ struct cgroup_subsys *ss;
+ int ssid;
+
+ lockdep_assert_held(&cgroup_mutex);
+ for_each_subsys(ss, ssid) {
+ struct cgroup_subsys_state *css = cgroup_css(cgrp, ss);
+ DEFINE_WAIT(wait);
+
+ if (!css || !percpu_ref_is_dying(&css->refcnt))
+ continue;
+
+ prepare_to_wait(&cgrp->offline_waitq, &wait,
+ TASK_UNINTERRUPTIBLE);
+
+ mutex_unlock(&cgroup_mutex);
+ schedule();
+ finish_wait(&cgrp->offline_waitq, &wait);
+
+ mutex_lock(&cgroup_mutex);
+ }
+}
+
/**
* cgroup_save_control - save control masks and dom_cgrp of a subtree
* @cgrp: root of the target subtree
@@ -5724,6 +5749,7 @@ int cgroup_rmdir(struct kernfs_node *kn)
if (!ret)
TRACE_CGROUP_PATH(rmdir, cgrp);
+ cgroup_wait_css_offline(cgrp);
cgroup_kn_unlock(kn);
return ret;
}
--
1.8.3.1
Hi Hongchen,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on tj-cgroup/for-next]
[also build test WARNING on v5.18 next-20220527]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/intel-lab-lkp/linux/commits/Hongchen-Zhang/cgroup-wait-for-css-offline-when-rmdir/20220527-104105
base: https://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-next
config: um-i386_defconfig (https://download.01.org/0day-ci/archive/20220527/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-1) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/9cd51af8f62de826ed76ffb6a63dce3d14926a03
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Hongchen-Zhang/cgroup-wait-for-css-offline-when-rmdir/20220527-104105
git checkout 9cd51af8f62de826ed76ffb6a63dce3d14926a03
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=um SUBARCH=i386 SHELL=/bin/bash kernel/cgroup/
If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
>> kernel/cgroup/cgroup.c:3024:6: warning: no previous prototype for 'cgroup_wait_css_offline' [-Wmissing-prototypes]
3024 | void cgroup_wait_css_offline(struct cgroup *cgrp)
| ^~~~~~~~~~~~~~~~~~~~~~~
vim +/cgroup_wait_css_offline +3024 kernel/cgroup/cgroup.c
3022
3023 /* wait all cgrp's csses become offlined */
> 3024 void cgroup_wait_css_offline(struct cgroup *cgrp)
3025 {
3026 struct cgroup_subsys *ss;
3027 int ssid;
3028
3029 lockdep_assert_held(&cgroup_mutex);
3030 for_each_subsys(ss, ssid) {
3031 struct cgroup_subsys_state *css = cgroup_css(cgrp, ss);
3032 DEFINE_WAIT(wait);
3033
3034 if (!css || !percpu_ref_is_dying(&css->refcnt))
3035 continue;
3036
3037 prepare_to_wait(&cgrp->offline_waitq, &wait,
3038 TASK_UNINTERRUPTIBLE);
3039
3040 mutex_unlock(&cgroup_mutex);
3041 schedule();
3042 finish_wait(&cgrp->offline_waitq, &wait);
3043
3044 mutex_lock(&cgroup_mutex);
3045 }
3046 }
3047
--
0-DAY CI Kernel Test Service
https://01.org/lkp
On Fri, May 27, 2022 at 10:39:18AM +0800, Hongchen Zhang wrote:
> when remove a cgroup dir, make sure all the csses associated which
> the cgroup are all offlined,so that we will be sure that the resources
> allocated by the csses are all freed when rmdir exit successfully.
Offlining doesn't guarantee that resources are freed and there's no definite
time limit on how long it'd take to free all resources. e.g. for memcg, if
there isn't sufficient memory pressure, its page cache can remain
indefinitely. Is there something practical you're trying to achieve?
Thanks.
--
tejun
On 2022/5/27 下午4:48, Tejun Heo wrote:
> On Fri, May 27, 2022 at 10:39:18AM +0800, Hongchen Zhang wrote:
>> when remove a cgroup dir, make sure all the csses associated which
>> the cgroup are all offlined,so that we will be sure that the resources
>> allocated by the csses are all freed when rmdir exit successfully.
>
> Offlining doesn't guarantee that resources are freed and there's no definite
> time limit on how long it'd take to free all resources. e.g. for memcg, if
> there isn't sufficient memory pressure, its page cache can remain
> indefinitely. Is there something practical you're trying to achieve?
>
> Thanks.
>
Hi Tejun,
When I test the LTP's memcg_test_3 testcase at 8 Node server,I get
the -ENOMEM error,which caused by no avaliable idr found in mem_cgroup_idr.
the reason is the use of idr in mem_cgroup_idr is too fast than the
free.In the specific case,the idr is used and freed cyclically,so when
we rmdir one cgroup dir, we can synchronize the idr free through wating
for the memcg css offlined,and then we can use it the next cycle.
Thanks.
Hello,
On Mon, May 30, 2022 at 09:53:51AM +0800, Hongchen Zhang wrote:
> When I test the LTP's memcg_test_3 testcase at 8 Node server,I get the
> -ENOMEM error,which caused by no avaliable idr found in mem_cgroup_idr.
> the reason is the use of idr in mem_cgroup_idr is too fast than the free.In
> the specific case,the idr is used and freed cyclically,so when we rmdir one
> cgroup dir, we can synchronize the idr free through wating for the memcg css
> offlined,and then we can use it the next cycle.
This is a micro benchmark specific problem and it doesn't make sense to
change the overall behavior for this as the suggested change is neither
desirable or logical. Maybe you can just incur the delay only after idr
allocation fails and then retry?
Thanks.
--
tejun
On Tue, May 31, 2022 at 07:19:31AM -1000, Tejun Heo wrote:
> Maybe just keep the sequence numbers for started and completed offline
> operations and wait for completed# to reach the started# on memcg alloc
> failure and retry? Note that we can get live locked, so have to remember the
> sequence number to wait for at the beginning. Or, even simpler, maybe it'd
> be enough to just do synchronize_rcu() and then wait for the offline wait
> once and retry.
Also, given that it's an memcg ID issue. It probably makes sense to
implement it on the memcg side.
Thanks.
--
tejun
On 2022/5/31 上午9:01, Tejun Heo wrote:
> Hello,
>
> On Mon, May 30, 2022 at 09:53:51AM +0800, Hongchen Zhang wrote:
>> When I test the LTP's memcg_test_3 testcase at 8 Node server,I get the
>> -ENOMEM error,which caused by no avaliable idr found in mem_cgroup_idr.
>> the reason is the use of idr in mem_cgroup_idr is too fast than the free.In
>> the specific case,the idr is used and freed cyclically,so when we rmdir one
>> cgroup dir, we can synchronize the idr free through wating for the memcg css
>> offlined,and then we can use it the next cycle.
>
> This is a micro benchmark specific problem and it doesn't make sense to
> change the overall behavior for this as the suggested change is neither
> desirable or logical. Maybe you can just incur the delay only after idr
> allocation fails and then retry?
>
> Thanks.
>
Hi Tejun,
Yes, the problem would disappear when add some reasonable delay. But I
think if we can increase the MEM_CGROUP_ID_MAX to INT_MAX.Thus the
-ENOMEM error would be never occured,even if the system is out of memory.
Thanks.
Hello,
On Tue, May 31, 2022 at 11:49:53AM +0800, Hongchen Zhang wrote:
> Yes, the problem would disappear when add some reasonable delay. But I think
It'd be better to wait for some group of operations to complete than
inserting explicit delays.
> if we can increase the MEM_CGROUP_ID_MAX to INT_MAX.Thus the -ENOMEM error
> would be never occured,even if the system is out of memory.
Oh, you're hitting the memcg ID limit, not the css one. Memcg id is limited
so that it doesn't consume as many bits in, I guess, struct page. I don't
think it'd make sense to increase overall overhead to solve this rather
artificial problem tho.
Maybe just keep the sequence numbers for started and completed offline
operations and wait for completed# to reach the started# on memcg alloc
failure and retry? Note that we can get live locked, so have to remember the
sequence number to wait for at the beginning. Or, even simpler, maybe it'd
be enough to just do synchronize_rcu() and then wait for the offline wait
once and retry.
Thanks.
--
tejun