Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp7357065ybl; Wed, 15 Jan 2020 21:18:55 -0800 (PST) X-Google-Smtp-Source: APXvYqx/86zWcTZ+CQAgO5h8Rwau9KgIuKjHbP/Ygpu7msRooT3fo4Kv9glRLt6nyvRa48eEHpao X-Received: by 2002:a9d:65cf:: with SMTP id z15mr680498oth.238.1579151935776; Wed, 15 Jan 2020 21:18:55 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1579151935; cv=none; d=google.com; s=arc-20160816; b=vh6oxj1su0K5WqVbNdF5I/NnavEKh6UXc6pe3Xtml/KtJ/X9fdEdoRx8df5GPjIq4l CJMco115vfm0jfCqq7wi+dUwWDuVy7DHwGhmZohNc4wgb9BnAz53Vn6rSZnigWz0p+hx 1gupObunBDui1Yga4BNXvEWzPHLA63s2ovD9FPi88SyBY5frm22tgEpcoDaB6pjfpKAq 8xX4M2qFvclfYmIs4iFvp/dVM1EfrxppgAKLCUzUBJhKzJHiD+PJw53bVN0XNe9h077c 71cJRWLRaLB2E+SDZwr3u/J8fmdyCi/mWxh0gaf0j3hDfX4au5xt+7y8bcvKC+KWgHkD WjOQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:from:subject:mime-version :message-id:date:dkim-signature; bh=iNg4g6cNRZYUNuBsTO74IwcH6tuAcaZ/nzmuUCgI7YQ=; b=vCcGJR/cjQZMNYzcssMcu68JnSytm+41DJvGWCIAmgtGbRxDDPvq+FQjz6ixlEhswq E6SHkNI7VZtFAu871ikxMO184sM15NZ5GxMX8L4n3jA9R98zK16dy9m0eLgOz1g2ezEy drRVXEOQn4iaMPAk+MlpORMGcxa61bIY+tjNMcMZKQg2x8z0cBzAZSydARTCGRnGOXpy RGeqhQgpTrrk6wP0OJ8ZcIPoE4TvFwnM/oYOjUp7ldjxfbEocOTPi+J0zn8FESpWV4gI CO9GX9YJ0cxwwGpmokiQbt3KBZxw3p90Y9JGnbyQxDp8rRy7/qpeYCV/1P6v+4MlUc7C 6G/w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=fN56Jr4n; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v14si12013664oto.127.2020.01.15.21.18.43; Wed, 15 Jan 2020 21:18:55 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=fN56Jr4n; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730958AbgAPEgR (ORCPT + 99 others); Wed, 15 Jan 2020 23:36:17 -0500 Received: from mail-pg1-f201.google.com ([209.85.215.201]:46380 "EHLO mail-pg1-f201.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730835AbgAPEgR (ORCPT ); Wed, 15 Jan 2020 23:36:17 -0500 Received: by mail-pg1-f201.google.com with SMTP id t12so11588967pgs.13 for ; Wed, 15 Jan 2020 20:36:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:message-id:mime-version:subject:from:to:cc; bh=iNg4g6cNRZYUNuBsTO74IwcH6tuAcaZ/nzmuUCgI7YQ=; b=fN56Jr4nvW06HTQ13RJneqj6X8lflgmq6jlCoKsIPmknMl26RPKWCz8YEw1tHa1jYz 2FHoKNebeuIlL6IVX9VRKW/oUm9eaZqm+8btqLQ+Nlcd1VZiHbmg/BoKSpmF3FoJIoqo 6288KMRzjyYttljMBr1z60uAlr8s/UoWzpL6nPgZWUat7AqhnKexYi5cfsk+npyXYqtZ 3A+h/dAOOUB0LJrFQh2qEQP/OiXhnOm9Q9/zvYNeu8l66371mJxlH2QimcYGjNM3gCyf QWaNg/ehOECHaLWi6uhhJ/vhW8Z0wBC8Td+0KrpsYSt+KrXhE1CCQbHvIvlrocKve+JL vPIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc; bh=iNg4g6cNRZYUNuBsTO74IwcH6tuAcaZ/nzmuUCgI7YQ=; b=fR6UQXhcUdYwYLhdUwJV5WBaQSVY10W2FzVVabifw+uZCf82EBK0/qU+MtAJc2AuAV 705Ysisarzj2chUyMnXSXOEYwNgpEUGhHf+4dFVONKGVJQRJA3zsWvw/WBuL6+/6EPxC yb8xUZ8TPEJSemfJl4oRrFDDKlzRilf/ZoDZJ3uFSNB+sisGaBr5sGRR6CdWKYwqPJ4L JVddImYaVmSpvzqwD/BI0zCrn1mg/SxZtvaszkfNMQAFoQkqLeynxM7OPk7BLMW7z2MI gXmOQn7dx+bxtyKIUI9mTc8xhMGylEr5g+BtRR3iM8/50MfzruGQRhIKSpES2QqCDtlx k15w== X-Gm-Message-State: APjAAAV8ffZhuPmLuCEXs2+vbcbje42szpf5+2MGnDdQWTJEdNGtds3l M5xV4bgyxBdD4qmjNHaBzArpHtesxjE= X-Received: by 2002:a63:f202:: with SMTP id v2mr36549471pgh.420.1579149376256; Wed, 15 Jan 2020 20:36:16 -0800 (PST) Date: Wed, 15 Jan 2020 20:36:11 -0800 Message-Id: <20200116043612.52782-1-surenb@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.25.0.rc1.283.g88dfdc4193-goog Subject: [PATCH 1/2] cgroup: allow deletion of cgroups containing only dying processes From: Suren Baghdasaryan To: surenb@google.com Cc: tj@kernel.org, lizefan@huawei.com, hannes@cmpxchg.org, matthias.bgg@gmail.com, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mediatek@lists.infradead.org, shuah@kernel.org, guro@fb.com, alex.shi@linux.alibaba.com, mkoutny@suse.com, linux-kselftest@vger.kernel.org, linger.lee@mediatek.com, tomcherry@google.com, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org A cgroup containing only dying tasks will be seen as empty when a userspace process reads its cgroup.procs or cgroup.tasks files. It should be safe to delete such a cgroup as it is considered empty. However if one of the dying tasks did not reach cgroup_exit then an attempt to delete the cgroup will fail with EBUSY because cgroup_is_populated() will not consider it empty until all tasks reach cgroup_exit. Such a condition can be triggered when a task consumes large amounts of memory and spends enough time in exit_mm to create delay between the moment it is flagged as PF_EXITING and the moment it reaches cgroup_exit. Fix this by detecting cgroups containing only dying tasks during cgroup destruction and proceeding with it while postponing the final step of releasing the last reference until the last task reaches cgroup_exit. Signed-off-by: Suren Baghdasaryan Reported-by: JeiFeng Lee Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations") --- include/linux/cgroup-defs.h | 3 ++ kernel/cgroup/cgroup.c | 65 +++++++++++++++++++++++++++++++++---- 2 files changed, 61 insertions(+), 7 deletions(-) diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 63097cb243cb..f9bcccbac8dd 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -71,6 +71,9 @@ enum { /* Cgroup is frozen. */ CGRP_FROZEN, + + /* Cgroup is dead. */ + CGRP_DEAD, }; /* cgroup_root->flags */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 735af8f15f95..a99ebddd37d9 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -795,10 +795,11 @@ static bool css_set_populated(struct css_set *cset) * that the content of the interface file has changed. This can be used to * detect when @cgrp and its descendants become populated or empty. */ -static void cgroup_update_populated(struct cgroup *cgrp, bool populated) +static bool cgroup_update_populated(struct cgroup *cgrp, bool populated) { struct cgroup *child = NULL; int adj = populated ? 1 : -1; + bool state_change = false; lockdep_assert_held(&css_set_lock); @@ -817,6 +818,7 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated) if (was_populated == cgroup_is_populated(cgrp)) break; + state_change = true; cgroup1_check_for_release(cgrp); TRACE_CGROUP_PATH(notify_populated, cgrp, cgroup_is_populated(cgrp)); @@ -825,6 +827,21 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated) child = cgrp; cgrp = cgroup_parent(cgrp); } while (cgrp); + + return state_change; +} + +static void cgroup_prune_dead(struct cgroup *cgrp) +{ + lockdep_assert_held(&css_set_lock); + + do { + /* put the base reference if cgroup was already destroyed */ + if (!cgroup_is_populated(cgrp) && + test_bit(CGRP_DEAD, &cgrp->flags)) + percpu_ref_kill(&cgrp->self.refcnt); + cgrp = cgroup_parent(cgrp); + } while (cgrp); } /** @@ -838,11 +855,15 @@ static void cgroup_update_populated(struct cgroup *cgrp, bool populated) static void css_set_update_populated(struct css_set *cset, bool populated) { struct cgrp_cset_link *link; + bool state_change; lockdep_assert_held(&css_set_lock); - list_for_each_entry(link, &cset->cgrp_links, cgrp_link) - cgroup_update_populated(link->cgrp, populated); + list_for_each_entry(link, &cset->cgrp_links, cgrp_link) { + state_change = cgroup_update_populated(link->cgrp, populated); + if (state_change && !populated) + cgroup_prune_dead(link->cgrp); + } } /* @@ -5458,8 +5479,26 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) * Only migration can raise populated from zero and we're already * holding cgroup_mutex. */ - if (cgroup_is_populated(cgrp)) - return -EBUSY; + if (cgroup_is_populated(cgrp)) { + struct css_task_iter it; + struct task_struct *task; + + /* + * cgroup_is_populated does not account for exiting tasks + * that did not reach cgroup_exit yet. Check if all the tasks + * in this cgroup are exiting. + */ + css_task_iter_start(&cgrp->self, 0, &it); + do { + task = css_task_iter_next(&it); + } while (task && (task->flags & PF_EXITING)); + css_task_iter_end(&it); + + if (task) { + /* cgroup is indeed populated */ + return -EBUSY; + } + } /* * Make sure there's no live children. We can't test emptiness of @@ -5510,8 +5549,20 @@ static int cgroup_destroy_locked(struct cgroup *cgrp) cgroup_bpf_offline(cgrp); - /* put the base reference */ - percpu_ref_kill(&cgrp->self.refcnt); + /* + * Take css_set_lock because of the possible race with + * cgroup_update_populated. + */ + spin_lock_irq(&css_set_lock); + /* The last task might have died since we last checked */ + if (cgroup_is_populated(cgrp)) { + /* mark cgroup for future destruction */ + set_bit(CGRP_DEAD, &cgrp->flags); + } else { + /* put the base reference */ + percpu_ref_kill(&cgrp->self.refcnt); + } + spin_unlock_irq(&css_set_lock); return 0; }; -- 2.25.0.rc1.283.g88dfdc4193-goog