2012-10-26 11:38:01

by Michal Hocko

[permalink] [raw]
Subject: memcg/cgroup: do not fail fail on pre_destroy callbacks

Hi,
memcg is the only controller which might fail in its pre_destroy
callback which makes the cgroup core more complicated for no good
reason. This is an attempt to change this unfortunate state.

I have previously posted this as an RFC https://lkml.org/lkml/2012/10/17/246
and the feedback was mostly positive. Nobody seem to see any issues with
the approach so let's move on from the RFC. The patchset still needs
good portion of testing and I am working on it. I would also like to see some
Acks ;)
The patchset is posted as v3 because some of the patches went trough 2
revisions during RFC.

The first two patches are just clean ups. They could be merged even
without the rest.

The real change, although the code is not changed that much, is the 3rd
patch. It changes the way how we handle mem_cgroup_move_parent failures.
We have to realize that all those failures are *temporal*. Because we
are either racing with the page removal or the page is temporarily off
the LRU because of migration resp. global reclaim. As a result we do
not fail mem_cgroup_force_empty_list if the page cannot be moved to the
parent and rather retry until the LRU is empty.

The 4th patch is for cgroup core. I have moved cgroup_call_pre_destroy
after css are frozen and the group is marked as removed which means
that all css_tryget will fail as well as no new task can attach the group
resp. no new child group can be added.

Tejun is planning to build on top of that and make some more cleanups
in the cgroup core (namely get rid of of the whole retry code in
cgroup_rmdir).
This makes unfortunate inter-tree dependency between Andrew's and
Tejun's tree therefore I have based all the work on 3.6 kernel so that
it can be merged into Tejun's cgroup tree as well into -mm git tree
(Andrew will see all the changes from linux-next). I do not like to
push memcg changes through other than Andrew's tree but this seems to
be easier as other cgroup changes will probably depend on the Tejun's
cleanups. Is everybody OK with this?

The last two patches are trivial follow ups for the cgroups core change
because now we know that nobody will interfere with us so we can drop
those empty && no child condition.

See the specific patches for the changelogs.

Michal Hocko (6):
memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts
memcg: root_cgroup cannot reach mem_cgroup_move_parent
memcg: Simplify mem_cgroup_force_empty_list error handling
cgroups: forbid pre_destroy callback to fail
memcg: make mem_cgroup_reparent_charges non failing
hugetlb: do not fail in hugetlb_cgroup_pre_destroy

Cumulative diffstat:
kernel/cgroup.c | 30 ++++-------
mm/hugetlb_cgroup.c | 11 ++--
mm/memcontrol.c | 148 ++++++++++++++++++++++++++++++---------------------
3 files changed, 99 insertions(+), 90 deletions(-)


2012-10-26 11:38:09

by Michal Hocko

[permalink] [raw]
Subject: [PATCH v3 2/6] memcg: root_cgroup cannot reach mem_cgroup_move_parent

The root cgroup cannot be destroyed so we never hit it down the
mem_cgroup_pre_destroy path and mem_cgroup_force_empty_write shouldn't
even try to do anything if called for the root.

This means that mem_cgroup_move_parent doesn't have to bother with the
root cgroup and it can assume it can always move charges upwards.

Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
---
mm/memcontrol.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 07d92b8..916132a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2715,9 +2715,7 @@ static int mem_cgroup_move_parent(struct page *page,
unsigned long uninitialized_var(flags);
int ret;

- /* Is ROOT ? */
- if (mem_cgroup_is_root(child))
- return -EINVAL;
+ VM_BUG_ON(mem_cgroup_is_root(child));

ret = -EBUSY;
if (!get_page_unless_zero(page))
@@ -3823,6 +3821,8 @@ static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
int ret;

+ if (mem_cgroup_is_root(memcg))
+ return -EINVAL;
css_get(&memcg->css);
ret = mem_cgroup_force_empty(memcg);
css_put(&memcg->css);
--
1.7.10.4

2012-10-26 11:38:07

by Michal Hocko

[permalink] [raw]
Subject: [PATCH v3 1/6] memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts

mem_cgroup_force_empty did two separate things depending on free_all
parameter from the very beginning. It either reclaimed as many pages as
possible and moved the rest to the parent or just moved charges to the
parent. The first variant is used as memory.force_empty callback while
the later is used from the mem_cgroup_pre_destroy.

The whole games around gotos are far from being nice and there is no
reason to keep those two functions inside one. Let's split them and
also move the responsibility for css reference counting to their callers
to make to code easier.

This patch doesn't have any functional changes.

Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
---
mm/memcontrol.c | 72 ++++++++++++++++++++++++++++++++-----------------------
1 file changed, 42 insertions(+), 30 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 795e525..07d92b8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3739,27 +3739,21 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
}

/*
- * make mem_cgroup's charge to be 0 if there is no task.
+ * make mem_cgroup's charge to be 0 if there is no task by moving
+ * all the charges and pages to the parent.
* This enables deleting this mem_cgroup.
+ *
+ * Caller is responsible for holding css reference on the memcg.
*/
-static int mem_cgroup_force_empty(struct mem_cgroup *memcg, bool free_all)
+static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
{
- int ret;
- int node, zid, shrink;
- int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
struct cgroup *cgrp = memcg->css.cgroup;
+ int node, zid;
+ int ret;

- css_get(&memcg->css);
-
- shrink = 0;
- /* should free all ? */
- if (free_all)
- goto try_to_free;
-move_account:
do {
- ret = -EBUSY;
if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
- goto out;
+ return -EBUSY;
/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
drain_all_stock_sync(memcg);
@@ -3783,27 +3777,34 @@ move_account:
cond_resched();
/* "ret" should also be checked to ensure all lists are empty. */
} while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);
-out:
- css_put(&memcg->css);
+
return ret;
+}
+
+/*
+ * Reclaims as many pages from the given memcg as possible and moves
+ * the rest to the parent.
+ *
+ * Caller is responsible for holding css reference for memcg.
+ */
+static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
+{
+ int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+ struct cgroup *cgrp = memcg->css.cgroup;

-try_to_free:
/* returns EBUSY if there is a task or if we come here twice. */
- if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children) || shrink) {
- ret = -EBUSY;
- goto out;
- }
+ if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
+ return -EBUSY;
+
/* we call try-to-free pages for make this cgroup empty */
lru_add_drain_all();
/* try to free all pages in this cgroup */
- shrink = 1;
while (nr_retries && res_counter_read_u64(&memcg->res, RES_USAGE) > 0) {
int progress;

- if (signal_pending(current)) {
- ret = -EINTR;
- goto out;
- }
+ if (signal_pending(current))
+ return -EINTR;
+
progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL,
false);
if (!progress) {
@@ -3814,13 +3815,19 @@ try_to_free:

}
lru_add_drain();
- /* try move_account...there may be some *locked* pages. */
- goto move_account;
+ return mem_cgroup_reparent_charges(memcg);
}

static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
{
- return mem_cgroup_force_empty(mem_cgroup_from_cont(cont), true);
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ int ret;
+
+ css_get(&memcg->css);
+ ret = mem_cgroup_force_empty(memcg);
+ css_put(&memcg->css);
+
+ return ret;
}


@@ -5003,8 +5010,13 @@ free_out:
static int mem_cgroup_pre_destroy(struct cgroup *cont)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
+ int ret;

- return mem_cgroup_force_empty(memcg, false);
+ css_get(&memcg->css);
+ ret = mem_cgroup_reparent_charges(memcg);
+ css_put(&memcg->css);
+
+ return ret;
}

static void mem_cgroup_destroy(struct cgroup *cont)
--
1.7.10.4

2012-10-26 11:38:30

by Michal Hocko

[permalink] [raw]
Subject: [PATCH v3 6/6] hugetlb: do not fail in hugetlb_cgroup_pre_destroy

Now that pre_destroy callbacks are called from the context where neither
any task can attach the group nor any children group can be added there
is no other way to fail from hugetlb_pre_destroy.

Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
---
mm/hugetlb_cgroup.c | 11 +++--------
1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index a3f358f..dc595c6 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -159,14 +159,9 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
{
struct hstate *h;
struct page *page;
- int ret = 0, idx = 0;
+ int idx = 0;

do {
- if (cgroup_task_count(cgroup) ||
- !list_empty(&cgroup->children)) {
- ret = -EBUSY;
- goto out;
- }
for_each_hstate(h) {
spin_lock(&hugetlb_lock);
list_for_each_entry(page, &h->hugepage_activelist, lru)
@@ -177,8 +172,8 @@ static int hugetlb_cgroup_pre_destroy(struct cgroup *cgroup)
}
cond_resched();
} while (hugetlb_cgroup_have_usage(cgroup));
-out:
- return ret;
+
+ return 0;
}

int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
--
1.7.10.4

2012-10-26 11:38:50

by Michal Hocko

[permalink] [raw]
Subject: [PATCH v3 5/6] memcg: make mem_cgroup_reparent_charges non failing

Now that pre_destroy callbacks are called from the context where neither
any task can attach the group nor any children group can be added there
is no other way to fail from mem_cgroup_pre_destroy.
mem_cgroup_pre_destroy doesn't have to take a reference to memcg's css
because all css' are marked dead already.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5a1d584..34284b8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3763,14 +3763,12 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
*
* Caller is responsible for holding css reference on the memcg.
*/
-static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
+static void mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
{
struct cgroup *cgrp = memcg->css.cgroup;
int node, zid;

do {
- if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
- return -EBUSY;
/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
drain_all_stock_sync(memcg);
@@ -3796,8 +3794,6 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
* charge before adding to the LRU.
*/
} while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0);
-
- return 0;
}

/*
@@ -3834,7 +3830,9 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)

}
lru_add_drain();
- return mem_cgroup_reparent_charges(memcg);
+ mem_cgroup_reparent_charges(memcg);
+
+ return 0;
}

static int mem_cgroup_force_empty_write(struct cgroup *cont, unsigned int event)
@@ -5031,13 +5029,9 @@ free_out:
static int mem_cgroup_pre_destroy(struct cgroup *cont)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cont);
- int ret;

- css_get(&memcg->css);
- ret = mem_cgroup_reparent_charges(memcg);
- css_put(&memcg->css);
-
- return ret;
+ mem_cgroup_reparent_charges(memcg);
+ return 0;
}

static void mem_cgroup_destroy(struct cgroup *cont)
--
1.7.10.4

2012-10-26 11:39:29

by Michal Hocko

[permalink] [raw]
Subject: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

mem_cgroup_force_empty_list currently tries to remove all pages from
the given LRU. To prevent from temoporary failures (EBUSY returned by
mem_cgroup_move_parent) it uses a margin to the current LRU pages and
returns the true if there are still some pages left on the list.

If we consider that mem_cgroup_move_parent fails only when it is racing
with somebody else removing (uncharging) the page or when the page is
migrated then it is obvious that all those failures are only temporal
and so we can safely retry later.
Let's get rid of the safety margin and make the loop really wait for
the empty LRU. The caller should still make sure that all charges have
been removed from the res_counter because mem_cgroup_replace_page_cache
might add a page to the LRU after the list_empty check (it doesn't touch
res_counter though).
This catches most of the cases except for shmem which might call
mem_cgroup_replace_page_cache with a page which is not charged and on
the LRU yet but this was the case also without this patch. In order to
fix this we need a guarantee that try_get_mem_cgroup_from_page falls
back to the current mm's cgroup so it needs css_tryget to fail. This
will be fixed up in a later patch because it needs a help from cgroup
core (pre_destroy has to be called after css is cleared).

Although mem_cgroup_pre_destroy can still fail (if a new task or a new
sub-group appears) there is no reason to retry pre_destroy callback from
the cgroup core. This means that __DEPRECATED_clear_css_refs has lost
its meaning and it can be removed.

Changes since v2
- remove __DEPRECATED_clear_css_refs

Changes since v1
- use kerndoc
- be more specific about mem_cgroup_move_parent possible failures

Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
---
mm/memcontrol.c | 76 +++++++++++++++++++++++++++++++++++--------------------
1 file changed, 48 insertions(+), 28 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 916132a..5a1d584 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2702,10 +2702,27 @@ out:
return ret;
}

-/*
- * move charges to its parent.
+/**
+ * mem_cgroup_move_parent - moves page to the parent group
+ * @page: the page to move
+ * @pc: page_cgroup of the page
+ * @child: page's cgroup
+ *
+ * move charges to its parent or the root cgroup if the group has no
+ * parent (aka use_hierarchy==0).
+ * Although this might fail (get_page_unless_zero, isolate_lru_page or
+ * mem_cgroup_move_account fails) the failure is always temporary and
+ * it signals a race with a page removal/uncharge or migration. In the
+ * first case the page is on the way out and it will vanish from the LRU
+ * on the next attempt and the call should be retried later.
+ * Isolation from the LRU fails only if page has been isolated from
+ * the LRU since we looked at it and that usually means either global
+ * reclaim or migration going on. The page will either get back to the
+ * LRU or vanish.
+ * Finaly mem_cgroup_move_account fails only if the page got uncharged
+ * (!PageCgroupUsed) or moved to a different group. The page will
+ * disappear in the next attempt.
*/
-
static int mem_cgroup_move_parent(struct page *page,
struct page_cgroup *pc,
struct mem_cgroup *child)
@@ -2732,8 +2749,10 @@ static int mem_cgroup_move_parent(struct page *page,
if (!parent)
parent = root_mem_cgroup;

- if (nr_pages > 1)
+ if (nr_pages > 1) {
+ VM_BUG_ON(!PageTransHuge(page));
flags = compound_lock_irqsave(page);
+ }

ret = mem_cgroup_move_account(page, nr_pages,
pc, child, parent);
@@ -3683,17 +3702,22 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
return nr_reclaimed;
}

-/*
+/**
+ * mem_cgroup_force_empty_list - clears LRU of a group
+ * @memcg: group to clear
+ * @node: NUMA node
+ * @zid: zone id
+ * @lru: lru to to clear
+ *
* Traverse a specified page_cgroup list and try to drop them all. This doesn't
- * reclaim the pages page themselves - it just removes the page_cgroups.
- * Returns true if some page_cgroups were not freed, indicating that the caller
- * must retry this operation.
+ * reclaim the pages page themselves - pages are moved to the parent (or root)
+ * group.
*/
-static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
+static void mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
int node, int zid, enum lru_list lru)
{
struct mem_cgroup_per_zone *mz;
- unsigned long flags, loop;
+ unsigned long flags;
struct list_head *list;
struct page *busy;
struct zone *zone;
@@ -3702,11 +3726,8 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
mz = mem_cgroup_zoneinfo(memcg, node, zid);
list = &mz->lruvec.lists[lru];

- loop = mz->lru_size[lru];
- /* give some margin against EBUSY etc...*/
- loop += 256;
busy = NULL;
- while (loop--) {
+ do {
struct page_cgroup *pc;
struct page *page;

@@ -3732,8 +3753,7 @@ static bool mem_cgroup_force_empty_list(struct mem_cgroup *memcg,
cond_resched();
} else
busy = NULL;
- }
- return !list_empty(list);
+ } while (!list_empty(list));
}

/*
@@ -3747,7 +3767,6 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
{
struct cgroup *cgrp = memcg->css.cgroup;
int node, zid;
- int ret;

do {
if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
@@ -3755,28 +3774,30 @@ static int mem_cgroup_reparent_charges(struct mem_cgroup *memcg)
/* This is for making all *used* pages to be on LRU. */
lru_add_drain_all();
drain_all_stock_sync(memcg);
- ret = 0;
mem_cgroup_start_move(memcg);
for_each_node_state(node, N_HIGH_MEMORY) {
- for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
enum lru_list lru;
for_each_lru(lru) {
- ret = mem_cgroup_force_empty_list(memcg,
+ mem_cgroup_force_empty_list(memcg,
node, zid, lru);
- if (ret)
- break;
}
}
- if (ret)
- break;
}
mem_cgroup_end_move(memcg);
memcg_oom_recover(memcg);
cond_resched();
- /* "ret" should also be checked to ensure all lists are empty. */
- } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0 || ret);

- return ret;
+ /*
+ * This is a safety check because mem_cgroup_force_empty_list
+ * could have raced with mem_cgroup_replace_page_cache callers
+ * so the lru seemed empty but the page could have been added
+ * right after the check. RES_USAGE should be safe as we always
+ * charge before adding to the LRU.
+ */
+ } while (res_counter_read_u64(&memcg->res, RES_USAGE) > 0);
+
+ return 0;
}

/*
@@ -5618,7 +5639,6 @@ struct cgroup_subsys mem_cgroup_subsys = {
.base_cftypes = mem_cgroup_files,
.early_init = 0,
.use_id = 1,
- .__DEPRECATED_clear_css_refs = true,
};

#ifdef CONFIG_MEMCG_SWAP
--
1.7.10.4

2012-10-26 11:39:28

by Michal Hocko

[permalink] [raw]
Subject: [PATCH v3 4/6] cgroups: forbid pre_destroy callback to fail

Now that mem_cgroup_pre_destroy callback doesn't fail (other than a race
with a task attach resp. child group appears) finally we can safely move
on and forbit all the callbacks to fail.
The last missing piece is moving cgroup_call_pre_destroy after
cgroup_clear_css_refs so that css_tryget fails so no new charges for the
memcg can happen.
We cannot, however, move cgroup_call_pre_destroy right after because we
cannot call mem_cgroup_pre_destroy with the cgroup_lock held (see
3fa59dfb cgroup: fix potential deadlock in pre_destroy) so we have to
move it after the lock is released.

Changes since v1
- Li Zefan pointed out that mem_cgroup_pre_destroy cannot be called with
cgroup_lock held

Signed-off-by: Michal Hocko <[email protected]>
---
kernel/cgroup.c | 30 +++++++++---------------------
1 file changed, 9 insertions(+), 21 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 7981850..e9e2287b 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -855,7 +855,7 @@ static struct inode *cgroup_new_inode(umode_t mode, struct super_block *sb)
* Call subsys's pre_destroy handler.
* This is called before css refcnt check.
*/
-static int cgroup_call_pre_destroy(struct cgroup *cgrp)
+static void cgroup_call_pre_destroy(struct cgroup *cgrp)
{
struct cgroup_subsys *ss;
int ret = 0;
@@ -864,15 +864,8 @@ static int cgroup_call_pre_destroy(struct cgroup *cgrp)
if (!ss->pre_destroy)
continue;

- ret = ss->pre_destroy(cgrp);
- if (ret) {
- /* ->pre_destroy() failure is being deprecated */
- WARN_ON_ONCE(!ss->__DEPRECATED_clear_css_refs);
- break;
- }
+ BUG_ON(ss->pre_destroy(cgrp));
}
-
- return ret;
}

static void cgroup_diput(struct dentry *dentry, struct inode *inode)
@@ -4151,7 +4144,6 @@ again:
mutex_unlock(&cgroup_mutex);
return -EBUSY;
}
- mutex_unlock(&cgroup_mutex);

/*
* In general, subsystem has no css->refcnt after pre_destroy(). But
@@ -4164,17 +4156,6 @@ again:
*/
set_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);

- /*
- * Call pre_destroy handlers of subsys. Notify subsystems
- * that rmdir() request comes.
- */
- ret = cgroup_call_pre_destroy(cgrp);
- if (ret) {
- clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
- return ret;
- }
-
- mutex_lock(&cgroup_mutex);
parent = cgrp->parent;
if (atomic_read(&cgrp->count) || !list_empty(&cgrp->children)) {
clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
@@ -4196,6 +4177,7 @@ again:
return -EINTR;
goto again;
}
+
/* NO css_tryget() can success after here. */
finish_wait(&cgroup_rmdir_waitq, &wait);
clear_bit(CGRP_WAIT_ON_RMDIR, &cgrp->flags);
@@ -4234,6 +4216,12 @@ again:
spin_unlock(&cgrp->event_list_lock);

mutex_unlock(&cgroup_mutex);
+
+ /*
+ * Call pre_destroy handlers of subsys. Notify subsystems
+ * that rmdir() request comes.
+ */
+ cgroup_call_pre_destroy(cgrp);
return 0;
}

--
1.7.10.4

2012-10-29 13:45:37

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts

On 10/26/2012 03:37 PM, Michal Hocko wrote:
> mem_cgroup_force_empty did two separate things depending on free_all
> parameter from the very beginning. It either reclaimed as many pages as
> possible and moved the rest to the parent or just moved charges to the
> parent. The first variant is used as memory.force_empty callback while
> the later is used from the mem_cgroup_pre_destroy.
>
> The whole games around gotos are far from being nice and there is no
> reason to keep those two functions inside one. Let's split them and
> also move the responsibility for css reference counting to their callers
> to make to code easier.
>
> This patch doesn't have any functional changes.
>
> Signed-off-by: Michal Hocko <[email protected]>
> Reviewed-by: Tejun Heo <[email protected]>

Reviewed-by: Glauber Costa <[email protected]>

2012-10-29 13:48:15

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] memcg: root_cgroup cannot reach mem_cgroup_move_parent

On 10/26/2012 03:37 PM, Michal Hocko wrote:
> The root cgroup cannot be destroyed so we never hit it down the
> mem_cgroup_pre_destroy path and mem_cgroup_force_empty_write shouldn't
> even try to do anything if called for the root.
>
> This means that mem_cgroup_move_parent doesn't have to bother with the
> root cgroup and it can assume it can always move charges upwards.
>
> Signed-off-by: Michal Hocko <[email protected]>
> Reviewed-by: Tejun Heo <[email protected]>

I think it would be safer to have this folded in the last patch, to
avoid a weird intermediate state (specially for force_empty). Being a
single statement, it doesn't confuse review so much.

However, this is also pretty much just a nitpick, do as you prefer.

Reviewed-by: Glauber Costa <[email protected]>

2012-10-29 13:52:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] memcg: root_cgroup cannot reach mem_cgroup_move_parent

On Mon 29-10-12 17:48:00, Glauber Costa wrote:
> On 10/26/2012 03:37 PM, Michal Hocko wrote:
> > The root cgroup cannot be destroyed so we never hit it down the
> > mem_cgroup_pre_destroy path and mem_cgroup_force_empty_write shouldn't
> > even try to do anything if called for the root.
> >
> > This means that mem_cgroup_move_parent doesn't have to bother with the
> > root cgroup and it can assume it can always move charges upwards.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > Reviewed-by: Tejun Heo <[email protected]>
>
> I think it would be safer to have this folded in the last patch, to
> avoid a weird intermediate state (specially for force_empty).

force_empty excludes root cgroup explicitly so there is no way to fail
here. I have kept VM_BUG_ON for future reference but it also can go away
completely.

> Being a single statement, it doesn't confuse review so much.
>
> However, this is also pretty much just a nitpick, do as you prefer.
>
> Reviewed-by: Glauber Costa <[email protected]>
>

--
Michal Hocko
SUSE Labs

2012-10-29 13:59:00

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling


>
> Changes since v1
> - use kerndoc
> - be more specific about mem_cgroup_move_parent possible failures
>
> Signed-off-by: Michal Hocko <[email protected]>
> Reviewed-by: Tejun Heo <[email protected]>
Reviewed-by: Glauber Costa <[email protected]>

> + * move charges to its parent or the root cgroup if the group has no
> + * parent (aka use_hierarchy==0).
> + * Although this might fail (get_page_unless_zero, isolate_lru_page or
> + * mem_cgroup_move_account fails) the failure is always temporary and
> + * it signals a race with a page removal/uncharge or migration. In the
> + * first case the page is on the way out and it will vanish from the LRU
> + * on the next attempt and the call should be retried later.
> + * Isolation from the LRU fails only if page has been isolated from
> + * the LRU since we looked at it and that usually means either global
> + * reclaim or migration going on. The page will either get back to the
> + * LRU or vanish.

I just wonder for how long can it go in the worst case?

2012-10-29 14:04:28

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] cgroups: forbid pre_destroy callback to fail

On 10/26/2012 03:37 PM, Michal Hocko wrote:
> Now that mem_cgroup_pre_destroy callback doesn't fail (other than a race
> with a task attach resp. child group appears) finally we can safely move
> on and forbit all the callbacks to fail.
> The last missing piece is moving cgroup_call_pre_destroy after
> cgroup_clear_css_refs so that css_tryget fails so no new charges for the
> memcg can happen.
> We cannot, however, move cgroup_call_pre_destroy right after because we
> cannot call mem_cgroup_pre_destroy with the cgroup_lock held (see
> 3fa59dfb cgroup: fix potential deadlock in pre_destroy) so we have to
> move it after the lock is released.
>

If we don't have the cgroup lock held, how safe is the following
statement in mem_cgroup_reparent_charges():

if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
return -EBUSY;

?

IIUC, although this is not generally safe, but it would be safe here
because at this point we are expected to had already set the removed bit
in the css. If this is the case, however, this condition is impossible
and becomes useless - in which case you may want to remove it from Patch1.

2012-10-29 14:06:48

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] cgroups: forbid pre_destroy callback to fail

On 10/29/2012 06:04 PM, Glauber Costa wrote:
> On 10/26/2012 03:37 PM, Michal Hocko wrote:
>> Now that mem_cgroup_pre_destroy callback doesn't fail (other than a race
>> with a task attach resp. child group appears) finally we can safely move
>> on and forbit all the callbacks to fail.
>> The last missing piece is moving cgroup_call_pre_destroy after
>> cgroup_clear_css_refs so that css_tryget fails so no new charges for the
>> memcg can happen.
>> We cannot, however, move cgroup_call_pre_destroy right after because we
>> cannot call mem_cgroup_pre_destroy with the cgroup_lock held (see
>> 3fa59dfb cgroup: fix potential deadlock in pre_destroy) so we have to
>> move it after the lock is released.
>>
>
> If we don't have the cgroup lock held, how safe is the following
> statement in mem_cgroup_reparent_charges():
>
> if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
> return -EBUSY;
>
> ?
>
> IIUC, although this is not generally safe, but it would be safe here
> because at this point we are expected to had already set the removed bit
> in the css. If this is the case, however, this condition is impossible
> and becomes useless - in which case you may want to remove it from Patch1.
>
Which I just saw you doing in patch5... =)


2012-10-29 14:07:50

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] memcg: make mem_cgroup_reparent_charges non failing

On 10/26/2012 03:37 PM, Michal Hocko wrote:
> Now that pre_destroy callbacks are called from the context where neither
> any task can attach the group nor any children group can be added there
> is no other way to fail from mem_cgroup_pre_destroy.
> mem_cgroup_pre_destroy doesn't have to take a reference to memcg's css
> because all css' are marked dead already.
>
> Signed-off-by: Michal Hocko <[email protected]>

Good change.

Reviewed-by: Glauber Costa <[email protected]>

2012-10-29 14:08:58

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] hugetlb: do not fail in hugetlb_cgroup_pre_destroy

On 10/26/2012 03:37 PM, Michal Hocko wrote:
> Now that pre_destroy callbacks are called from the context where neither
> any task can attach the group nor any children group can be added there
> is no other way to fail from hugetlb_pre_destroy.
>
> Signed-off-by: Michal Hocko <[email protected]>
> Reviewed-by: Tejun Heo <[email protected]>

Same as Patch5:
Reviewed-by: Glauber Costa <[email protected]>

2012-10-29 14:15:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

On Mon 29-10-12 17:58:45, Glauber Costa wrote:
>
> >
> > Changes since v1
> > - use kerndoc
> > - be more specific about mem_cgroup_move_parent possible failures
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > Reviewed-by: Tejun Heo <[email protected]>
> Reviewed-by: Glauber Costa <[email protected]>

Thanks!

> > + * move charges to its parent or the root cgroup if the group has no
> > + * parent (aka use_hierarchy==0).
> > + * Although this might fail (get_page_unless_zero, isolate_lru_page or
> > + * mem_cgroup_move_account fails) the failure is always temporary and
> > + * it signals a race with a page removal/uncharge or migration. In the
> > + * first case the page is on the way out and it will vanish from the LRU
> > + * on the next attempt and the call should be retried later.
> > + * Isolation from the LRU fails only if page has been isolated from
> > + * the LRU since we looked at it and that usually means either global
> > + * reclaim or migration going on. The page will either get back to the
> > + * LRU or vanish.
>
> I just wonder for how long can it go in the worst case?

That's a good question and to be honest I have no idea. The point is
that it will terminate eventually and that the group is on the way out
so the time to complete the removal is not a big deal IMHO. We had
basically similar situation previously when we would need to repeat
rmdir loop on EBUSY. The only change is that we do not have to retry
anymore.

So the key point is to check whether my assumption about temporarily is
correct and that we cannot block the rest of the kernel/userspace to
proceed even though we are waiting for finalization. I believe this is
true but... (last famous words?)

--
Michal Hocko
SUSE Labs

2012-10-29 14:17:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] cgroups: forbid pre_destroy callback to fail

On Mon 29-10-12 18:06:34, Glauber Costa wrote:
> On 10/29/2012 06:04 PM, Glauber Costa wrote:
> > On 10/26/2012 03:37 PM, Michal Hocko wrote:
> >> Now that mem_cgroup_pre_destroy callback doesn't fail (other than a race
> >> with a task attach resp. child group appears) finally we can safely move
> >> on and forbit all the callbacks to fail.
> >> The last missing piece is moving cgroup_call_pre_destroy after
> >> cgroup_clear_css_refs so that css_tryget fails so no new charges for the
> >> memcg can happen.
> >> We cannot, however, move cgroup_call_pre_destroy right after because we
> >> cannot call mem_cgroup_pre_destroy with the cgroup_lock held (see
> >> 3fa59dfb cgroup: fix potential deadlock in pre_destroy) so we have to
> >> move it after the lock is released.
> >>
> >
> > If we don't have the cgroup lock held, how safe is the following
> > statement in mem_cgroup_reparent_charges():
> >
> > if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
> > return -EBUSY;
> >
> > ?
> >
> > IIUC, although this is not generally safe, but it would be safe here
> > because at this point we are expected to had already set the removed bit
> > in the css. If this is the case, however, this condition is impossible
> > and becomes useless - in which case you may want to remove it from Patch1.
> >
> Which I just saw you doing in patch5... =)

Yes, I just wanted to keep this one cgroup core only to enable further
cgroup clean ups easier. Dropping the earlier in the series could
introduce regressions which I tried to avoid as much as possible.

Thanks

--
Michal Hocko
SUSE Labs

2012-10-29 15:09:48

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

>>> + * move charges to its parent or the root cgroup if the group has no
>>> + * parent (aka use_hierarchy==0).
>>> + * Although this might fail (get_page_unless_zero, isolate_lru_page or
>>> + * mem_cgroup_move_account fails) the failure is always temporary and
>>> + * it signals a race with a page removal/uncharge or migration. In the
>>> + * first case the page is on the way out and it will vanish from the LRU
>>> + * on the next attempt and the call should be retried later.
>>> + * Isolation from the LRU fails only if page has been isolated from
>>> + * the LRU since we looked at it and that usually means either global
>>> + * reclaim or migration going on. The page will either get back to the
>>> + * LRU or vanish.
>>
>> I just wonder for how long can it go in the worst case?
>
> That's a good question and to be honest I have no idea. The point is
> that it will terminate eventually and that the group is on the way out
> so the time to complete the removal is not a big deal IMHO. We had
> basically similar situation previously when we would need to repeat
> rmdir loop on EBUSY. The only change is that we do not have to retry
> anymore.
>
> So the key point is to check whether my assumption about temporarily is
> correct and that we cannot block the rest of the kernel/userspace to
> proceed even though we are waiting for finalization. I believe this is
> true but... (last famous words?)
>
At least for me, it seems that this will hold.

2012-10-29 22:00:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

On Mon, 29 Oct 2012 17:58:45 +0400
Glauber Costa <[email protected]> wrote:

> > + * move charges to its parent or the root cgroup if the group has no
> > + * parent (aka use_hierarchy==0).
> > + * Although this might fail (get_page_unless_zero, isolate_lru_page or
> > + * mem_cgroup_move_account fails) the failure is always temporary and
> > + * it signals a race with a page removal/uncharge or migration. In the
> > + * first case the page is on the way out and it will vanish from the LRU
> > + * on the next attempt and the call should be retried later.
> > + * Isolation from the LRU fails only if page has been isolated from
> > + * the LRU since we looked at it and that usually means either global
> > + * reclaim or migration going on. The page will either get back to the
> > + * LRU or vanish.
>
> I just wonder for how long can it go in the worst case?

If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!

2012-10-29 23:26:17

by Tejun Heo

[permalink] [raw]
Subject: Re: memcg/cgroup: do not fail fail on pre_destroy callbacks

Hello, Michal.

> Tejun is planning to build on top of that and make some more cleanups
> in the cgroup core (namely get rid of of the whole retry code in
> cgroup_rmdir).

I applied 1-3 to the following branch which is based on top of v3.6.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git cgroup-destroy-updates

I'll follow up with updates to the destroy path which will replace #4.
#5 and #6 should be stackable on top. So, Andrew, there's likely be a
conflict in the near future. Just dropping #4-#6 till Michal and I
sort it out should be enough.

Thanks.

--
tejun

2012-10-30 10:36:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

On Mon 29-10-12 15:00:22, Andrew Morton wrote:
> On Mon, 29 Oct 2012 17:58:45 +0400
> Glauber Costa <[email protected]> wrote:
>
> > > + * move charges to its parent or the root cgroup if the group has no
> > > + * parent (aka use_hierarchy==0).
> > > + * Although this might fail (get_page_unless_zero, isolate_lru_page or
> > > + * mem_cgroup_move_account fails) the failure is always temporary and
> > > + * it signals a race with a page removal/uncharge or migration. In the
> > > + * first case the page is on the way out and it will vanish from the LRU
> > > + * on the next attempt and the call should be retried later.
> > > + * Isolation from the LRU fails only if page has been isolated from
> > > + * the LRU since we looked at it and that usually means either global
> > > + * reclaim or migration going on. The page will either get back to the
> > > + * LRU or vanish.
> >
> > I just wonder for how long can it go in the worst case?
>
> If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!

You are right, if the rmdir (resp. echo > force_empty) at SCHED_FIFO
races with put_page (on a shared page) which gets preempted after
put_page_testzero and before __page_cache_release then we are screwed:

put_page(page)
put_page_testzero
<preempted and page still on LRU>
mem_cgroup_force_empty_list
page = list_entry(list->prev, struct page, lru);
mem_cgroup_move_parent(page)
get_page_unless_zero <fails>
cond_resched() <scheduled again>

The race window is really small but it is definitely possible. I am not
happy about this state and it should be probably mentioned in the
patch description but I do not see any way around (except for hacks like
sched_setscheduler for the current which is, ehm...) and still keep
do_not_fail contract here.

Can we consider this as a corner case (it is much easier to kill a
machine with SCHED_FIFO than this anyway) or the concern is really
strong and we should come with a solution before this can get merged?

--
Michal Hocko
SUSE Labs

2012-10-30 23:37:44

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg/cgroup: do not fail fail on pre_destroy callbacks

On Mon 29-10-12 16:26:02, Tejun Heo wrote:
> Hello, Michal.
>
> > Tejun is planning to build on top of that and make some more cleanups
> > in the cgroup core (namely get rid of of the whole retry code in
> > cgroup_rmdir).
>
> I applied 1-3 to the following branch which is based on top of v3.6.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git cgroup-destroy-updates

Ok, Andrew droped all the patches from his tree and I set up this
branch for automerging to -mm git tree.

> I'll follow up with updates to the destroy path which will replace #4.
> #5 and #6 should be stackable on top.

Could you take care of them and apply those two on top of the first one
which guarantees that css_tryget fails and no new task can appear in the
group (aka #4 without follow up cleanups)? So that Andrew doesn't have
to care about them later.

Thanks!
--
Michal Hocko
SUSE Labs

2012-10-31 16:29:49

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] memcg: split mem_cgroup_force_empty into reclaiming and reparenting parts

On Fri, Oct 26, 2012 at 01:37:28PM +0200, Michal Hocko wrote:
> mem_cgroup_force_empty did two separate things depending on free_all
> parameter from the very beginning. It either reclaimed as many pages as
> possible and moved the rest to the parent or just moved charges to the
> parent. The first variant is used as memory.force_empty callback while
> the later is used from the mem_cgroup_pre_destroy.
>
> The whole games around gotos are far from being nice and there is no
> reason to keep those two functions inside one. Let's split them and
> also move the responsibility for css reference counting to their callers
> to make to code easier.
>
> This patch doesn't have any functional changes.
>
> Signed-off-by: Michal Hocko <[email protected]>
> Reviewed-by: Tejun Heo <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2012-10-31 16:31:49

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] memcg: root_cgroup cannot reach mem_cgroup_move_parent

On Fri, Oct 26, 2012 at 01:37:29PM +0200, Michal Hocko wrote:
> The root cgroup cannot be destroyed so we never hit it down the
> mem_cgroup_pre_destroy path and mem_cgroup_force_empty_write shouldn't
> even try to do anything if called for the root.
>
> This means that mem_cgroup_move_parent doesn't have to bother with the
> root cgroup and it can assume it can always move charges upwards.
>
> Signed-off-by: Michal Hocko <[email protected]>
> Reviewed-by: Tejun Heo <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2012-10-31 21:30:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

On Tue, 30 Oct 2012 11:35:59 +0100
Michal Hocko <[email protected]> wrote:

> > If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!
>
> ...
>
> Can we consider this as a corner case (it is much easier to kill a
> machine with SCHED_FIFO than this anyway) or the concern is really
> strong and we should come with a solution before this can get merged?

Sure. SCHED_FIFO can be used to permanently block all kernel threads
which pretty quickly results in a very sick kernel. My observation was
just a general moan about the SCHED_FIFO wontfix problem :)

2012-11-13 21:11:06

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

On Tue, Oct 30, 2012 at 11:35:59AM +0100, Michal Hocko wrote:
> On Mon 29-10-12 15:00:22, Andrew Morton wrote:
> > On Mon, 29 Oct 2012 17:58:45 +0400
> > Glauber Costa <[email protected]> wrote:
> >
> > > > + * move charges to its parent or the root cgroup if the group has no
> > > > + * parent (aka use_hierarchy==0).
> > > > + * Although this might fail (get_page_unless_zero, isolate_lru_page or
> > > > + * mem_cgroup_move_account fails) the failure is always temporary and
> > > > + * it signals a race with a page removal/uncharge or migration. In the
> > > > + * first case the page is on the way out and it will vanish from the LRU
> > > > + * on the next attempt and the call should be retried later.
> > > > + * Isolation from the LRU fails only if page has been isolated from
> > > > + * the LRU since we looked at it and that usually means either global
> > > > + * reclaim or migration going on. The page will either get back to the
> > > > + * LRU or vanish.
> > >
> > > I just wonder for how long can it go in the worst case?
> >
> > If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!
>
> You are right, if the rmdir (resp. echo > force_empty) at SCHED_FIFO
> races with put_page (on a shared page) which gets preempted after
> put_page_testzero and before __page_cache_release then we are screwed:
>
> put_page(page)
> put_page_testzero
> <preempted and page still on LRU>
> mem_cgroup_force_empty_list
> page = list_entry(list->prev, struct page, lru);
> mem_cgroup_move_parent(page)
> get_page_unless_zero <fails>
> cond_resched() <scheduled again>
>
> The race window is really small but it is definitely possible. I am not
> happy about this state and it should be probably mentioned in the
> patch description but I do not see any way around (except for hacks like
> sched_setscheduler for the current which is, ehm...) and still keep
> do_not_fail contract here.
>
> Can we consider this as a corner case (it is much easier to kill a
> machine with SCHED_FIFO than this anyway) or the concern is really
> strong and we should come with a solution before this can get merged?

Wouldn't the much bigger race window be reclaim having the page
isolated and SCHED_FIFO preventing it from putback?

I also don't think this is a new class of problem, though.

Would it make sense to stick a wait_on_page_locked() in there just so
that we don't busy spin on a page under migration/reclaim?

2012-11-13 21:13:31

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] cgroups: forbid pre_destroy callback to fail

On Fri, Oct 26, 2012 at 01:37:31PM +0200, Michal Hocko wrote:
> Now that mem_cgroup_pre_destroy callback doesn't fail (other than a race
> with a task attach resp. child group appears) finally we can safely move
> on and forbit all the callbacks to fail.
> The last missing piece is moving cgroup_call_pre_destroy after
> cgroup_clear_css_refs so that css_tryget fails so no new charges for the
> memcg can happen.
> We cannot, however, move cgroup_call_pre_destroy right after because we
> cannot call mem_cgroup_pre_destroy with the cgroup_lock held (see
> 3fa59dfb cgroup: fix potential deadlock in pre_destroy) so we have to
> move it after the lock is released.
>
> Changes since v1
> - Li Zefan pointed out that mem_cgroup_pre_destroy cannot be called with
> cgroup_lock held
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2012-11-14 13:59:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

On Tue 13-11-12 16:10:41, Johannes Weiner wrote:
> On Tue, Oct 30, 2012 at 11:35:59AM +0100, Michal Hocko wrote:
> > On Mon 29-10-12 15:00:22, Andrew Morton wrote:
> > > On Mon, 29 Oct 2012 17:58:45 +0400
> > > Glauber Costa <[email protected]> wrote:
> > >
> > > > > + * move charges to its parent or the root cgroup if the group has no
> > > > > + * parent (aka use_hierarchy==0).
> > > > > + * Although this might fail (get_page_unless_zero, isolate_lru_page or
> > > > > + * mem_cgroup_move_account fails) the failure is always temporary and
> > > > > + * it signals a race with a page removal/uncharge or migration. In the
> > > > > + * first case the page is on the way out and it will vanish from the LRU
> > > > > + * on the next attempt and the call should be retried later.
> > > > > + * Isolation from the LRU fails only if page has been isolated from
> > > > > + * the LRU since we looked at it and that usually means either global
> > > > > + * reclaim or migration going on. The page will either get back to the
> > > > > + * LRU or vanish.
> > > >
> > > > I just wonder for how long can it go in the worst case?
> > >
> > > If the kernel is uniprocessor and the caller is SCHED_FIFO: ad infinitum!
> >
> > You are right, if the rmdir (resp. echo > force_empty) at SCHED_FIFO
> > races with put_page (on a shared page) which gets preempted after
> > put_page_testzero and before __page_cache_release then we are screwed:
> >
> > put_page(page)
> > put_page_testzero
> > <preempted and page still on LRU>
> > mem_cgroup_force_empty_list
> > page = list_entry(list->prev, struct page, lru);
> > mem_cgroup_move_parent(page)
> > get_page_unless_zero <fails>
> > cond_resched() <scheduled again>
> >
> > The race window is really small but it is definitely possible. I am not
> > happy about this state and it should be probably mentioned in the
> > patch description but I do not see any way around (except for hacks like
> > sched_setscheduler for the current which is, ehm...) and still keep
> > do_not_fail contract here.
> >
> > Can we consider this as a corner case (it is much easier to kill a
> > machine with SCHED_FIFO than this anyway) or the concern is really
> > strong and we should come with a solution before this can get merged?
>
> Wouldn't the much bigger race window be reclaim having the page
> isolated and SCHED_FIFO preventing it from putback?

We wouldn't see the page on the LRU then, right?

> I also don't think this is a new class of problem, though.
>
> Would it make sense to stick a wait_on_page_locked() in there just so
> that we don't busy spin on a page under migration/reclaim?

Hmm, this would also mean that get_page_unless_zero would fail as well
and so we would schedule in mem_cgroup_force_empty_list. It is true that
there might be no other runnable task so we can busy loop so yes this
would help. Care to cook the patch?

Thanks
--
Michal Hocko
SUSE Labs

2012-11-14 18:33:53

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 3/6] memcg: Simplify mem_cgroup_force_empty_list error handling

On Wed, Nov 14, 2012 at 02:59:30PM +0100, Michal Hocko wrote:
> On Tue 13-11-12 16:10:41, Johannes Weiner wrote:
> > Would it make sense to stick a wait_on_page_locked() in there just so
> > that we don't busy spin on a page under migration/reclaim?
>
> Hmm, this would also mean that get_page_unless_zero would fail as well
> and so we would schedule in mem_cgroup_force_empty_list. It is true that
> there might be no other runnable task so we can busy loop so yes this
> would help. Care to cook the patch?

Eventually get_page_unless_zero() would fail but we could still spin
on a page while it's off the LRU and migration performs writeback on
it e.g. cond_resched() does not necessarily schedule just because
there is another runnable task, I think, it's voluntary preemption
when the task needs rescheduling anyway, not yield.

Maybe not worth bothering...