2019-01-02 22:07:29

by Yang Shi

[permalink] [raw]
Subject: [RFC PATCH 0/3] mm: memcontrol: delayed force empty


Currently, force empty reclaims memory synchronously when writing to
memory.force_empty. It may take some time to return and the afterwards
operations are blocked by it. Although it can be interrupted by signal,
it still seems suboptimal.

Now css offline is handled by worker, and the typical usecase of force
empty is before memcg offline. So, handling force empty in css offline
sounds reasonable.

The user may write into any value to memory.force_empty, but I'm
supposed the most used value should be 0 and 1. To not break existing
applications, writing 0 or 1 still do force empty synchronously, any
other value will tell kernel to do force empty in css offline worker.

Patch #1: Fix some obsolete information about force_empty in the document
Patch #2: A minor improvement to skip swap for force_empty
Patch #3: Implement delayed force_empty

Yang Shi (3):
doc: memcontrol: fix the obsolete content about force empty
mm: memcontrol: do not try to do swap when force empty
mm: memcontrol: delay force empty to css offline

Documentation/cgroup-v1/memory.txt | 15 ++++++++++-----
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 20 +++++++++++++++++++-
3 files changed, 31 insertions(+), 6 deletions(-)


2019-01-02 22:16:46

by Yang Shi

[permalink] [raw]
Subject: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty

The typical usecase of force empty is to try to reclaim as much as
possible memory before offlining a memcg. Since there should be no
attached tasks to offlining memcg, the tasks anonymous pages would have
already been freed or uncharged. Even though anonymous pages get
swapped out, but they still get charged to swap space. So, it sounds
pointless to do swap for force empty.

I tried to dig into the history of this, it was introduced by
commit 8c7c6e34a125 ("memcg: mem+swap controller core"), but there is
not any clue about why it was done so at the first place.

The below simple test script shows slight file cache reclaim improvement
when swap is on.

echo 3 > /proc/sys/vm/drop_caches
mkdir /sys/fs/cgroup/memory/test
echo 30 > /sys/fs/cgroup/memory/test/memory.swappiness
echo $$ >/sys/fs/cgroup/memory/test/cgroup.procs
cat /proc/meminfo | grep ^Cached|awk -F" " '{print $2}'
dd if=/dev/zero of=/mnt/test bs=1M count=1024
ping localhost > /dev/null &
echo 1 > /sys/fs/cgroup/memory/test/memory.force_empty
killall ping
echo $$ >/sys/fs/cgroup/memory/cgroup.procs
cat /proc/meminfo | grep ^Cached|awk -F" " '{print $2}'
rmdir /sys/fs/cgroup/memory/test
cat /proc/meminfo | grep ^Cached|awk -F" " '{print $2}'

The number of page cache is:
w/o w/
before force empty 1088792 1088784
after force empty 41492 39428
reclaimed 1047300 1049356

Without doing swap, force empty can reclaim 2MB more memory in 1GB page
cache.

Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e1469b..bbf39b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2872,7 +2872,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
return -EINTR;

progress = try_to_free_mem_cgroup_pages(memcg, 1,
- GFP_KERNEL, true);
+ GFP_KERNEL, false);
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
--
1.8.3.1


2019-01-02 22:19:24

by Yang Shi

[permalink] [raw]
Subject: [PATCH 1/3] doc: memcontrol: fix the obsolete content about force empty

We don't do page cache reparent anymore when offlining memcg, so update
force empty related content accordingly.

Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
Documentation/cgroup-v1/memory.txt | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
index 3682e99..8e2cb1d 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -70,7 +70,7 @@ Brief summary of control files.
memory.soft_limit_in_bytes # set/show soft limit of memory usage
memory.stat # show various statistics
memory.use_hierarchy # set/show hierarchical account enabled
- memory.force_empty # trigger forced move charge to parent
+ memory.force_empty # trigger forced page reclaim
memory.pressure_level # set memory pressure notifications
memory.swappiness # set/show swappiness parameter of vmscan
(See sysctl's vm.swappiness)
@@ -459,8 +459,9 @@ About use_hierarchy, see Section 6.
the cgroup will be reclaimed and as many pages reclaimed as possible.

The typical use case for this interface is before calling rmdir().
- Because rmdir() moves all pages to parent, some out-of-use page caches can be
- moved to the parent. If you want to avoid that, force_empty will be useful.
+ Though rmdir() offlines memcg, but the memcg may still stay there due to
+ charged file caches. Some out-of-use page caches may keep charged until
+ memory pressure happens. If you want to avoid that, force_empty will be useful.

Also, note that when memory.kmem.limit_in_bytes is set the charges due to
kernel pages will still be seen. This is not considered a failure and the
--
1.8.3.1


2019-01-02 22:25:45

by Yang Shi

[permalink] [raw]
Subject: [PATCH 3/3] mm: memcontrol: delay force empty to css offline

Currently, force empty reclaims memory synchronously when writing to
memory.force_empty. It may take some time to return and the afterwards
operations are blocked by it. Although it can be interrupted by signal,
it still seems suboptimal.

Now css offline is handled by worker, and the typical usecase of force
empty is before memcg offline. So, handling force empty in css offline
sounds reasonable.

The user may write into any value to memory.force_empty, but I'm
supposed the most used value should be 0 and 1. To not break existing
applications, writing 0 or 1 still do force empty synchronously, any
other value will tell kernel to do force empty in css offline worker.

Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Signed-off-by: Yang Shi <[email protected]>
---
Documentation/cgroup-v1/memory.txt | 8 ++++++--
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 18 ++++++++++++++++++
3 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
index 8e2cb1d..313d45f 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -452,11 +452,15 @@ About use_hierarchy, see Section 6.

5.1 force_empty
memory.force_empty interface is provided to make cgroup's memory usage empty.
- When writing anything to this
+ When writing 0 or 1 to this

# echo 0 > memory.force_empty

- the cgroup will be reclaimed and as many pages reclaimed as possible.
+ the cgroup will be reclaimed and as many pages reclaimed as possible
+ synchronously.
+
+ Writing any other value to this, the cgroup will delay the memory reclaim
+ to css offline.

The typical use case for this interface is before calling rmdir().
Though rmdir() offlines memcg, but the memcg may still stay there due to
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7ab2120..48a5cf2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -311,6 +311,8 @@ struct mem_cgroup {
struct list_head event_list;
spinlock_t event_list_lock;

+ bool delayed_force_empty;
+
struct mem_cgroup_per_node *nodeinfo[0];
/* WARNING: nodeinfo must be the last member here */
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bbf39b5..620b6c5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2888,10 +2888,25 @@ static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
+ unsigned long val;
+ ssize_t ret;
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));

if (mem_cgroup_is_root(memcg))
return -EINVAL;
+
+ buf = strstrip(buf);
+
+ ret = kstrtoul(buf, 10, &val);
+ if (ret < 0)
+ return ret;
+
+ if (val != 0 && val != 1) {
+ memcg->delayed_force_empty = true;
+ return nbytes;
+ }
+
+ memcg->delayed_force_empty = false;
return mem_cgroup_force_empty(memcg) ?: nbytes;
}

@@ -4531,6 +4546,9 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
struct mem_cgroup_event *event, *tmp;

+ if (memcg->delayed_force_empty)
+ mem_cgroup_force_empty(memcg);
+
/*
* Unregister events and notify userspace.
* Notify userspace about cgroup removing only after rmdir of cgroup
--
1.8.3.1


2019-01-03 00:47:15

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 1/3] doc: memcontrol: fix the obsolete content about force empty

On Wed, Jan 2, 2019 at 12:07 PM Yang Shi <[email protected]> wrote:
>
> We don't do page cache reparent anymore when offlining memcg, so update
> force empty related content accordingly.
>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

> ---
> Documentation/cgroup-v1/memory.txt | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
> index 3682e99..8e2cb1d 100644
> --- a/Documentation/cgroup-v1/memory.txt
> +++ b/Documentation/cgroup-v1/memory.txt
> @@ -70,7 +70,7 @@ Brief summary of control files.
> memory.soft_limit_in_bytes # set/show soft limit of memory usage
> memory.stat # show various statistics
> memory.use_hierarchy # set/show hierarchical account enabled
> - memory.force_empty # trigger forced move charge to parent
> + memory.force_empty # trigger forced page reclaim
> memory.pressure_level # set memory pressure notifications
> memory.swappiness # set/show swappiness parameter of vmscan
> (See sysctl's vm.swappiness)
> @@ -459,8 +459,9 @@ About use_hierarchy, see Section 6.
> the cgroup will be reclaimed and as many pages reclaimed as possible.
>
> The typical use case for this interface is before calling rmdir().
> - Because rmdir() moves all pages to parent, some out-of-use page caches can be
> - moved to the parent. If you want to avoid that, force_empty will be useful.
> + Though rmdir() offlines memcg, but the memcg may still stay there due to
> + charged file caches. Some out-of-use page caches may keep charged until
> + memory pressure happens. If you want to avoid that, force_empty will be useful.
>
> Also, note that when memory.kmem.limit_in_bytes is set the charges due to
> kernel pages will still be seen. This is not considered a failure and the
> --
> 1.8.3.1
>

2019-01-03 02:18:12

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty

On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <[email protected]> wrote:
>
> The typical usecase of force empty is to try to reclaim as much as
> possible memory before offlining a memcg. Since there should be no
> attached tasks to offlining memcg, the tasks anonymous pages would have
> already been freed or uncharged.

Anon pages can come from tmpfs files as well.

> Even though anonymous pages get
> swapped out, but they still get charged to swap space. So, it sounds
> pointless to do swap for force empty.
>

I understand that force_empty is typically used before rmdir'ing a
memcg but it might be used differently by some users. We use this
interface to test memory reclaim behavior (anon and file).

Anyways, I am not against changing the behavior, we can adapt
internally but there might be other users using this interface
differently.

thanks,
Shakeel

2019-01-03 14:03:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

On Thu 03-01-19 04:05:30, Yang Shi wrote:
>
> Currently, force empty reclaims memory synchronously when writing to
> memory.force_empty. It may take some time to return and the afterwards
> operations are blocked by it. Although it can be interrupted by signal,
> it still seems suboptimal.

Why it is suboptimal? We are doing that operation on behalf of the
process requesting it. What should anybody else pay for it? In other
words why should we hide the overhead?

> Now css offline is handled by worker, and the typical usecase of force
> empty is before memcg offline. So, handling force empty in css offline
> sounds reasonable.

Hmm, so I guess you are talking about
echo 1 > $MEMCG/force_empty
rmdir $MEMCG

and you are complaining that the operation takes too long. Right? Why do
you care actually?
--
Michal Hocko
SUSE Labs

2019-01-03 14:17:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/3] doc: memcontrol: fix the obsolete content about force empty

On Thu 03-01-19 04:05:31, Yang Shi wrote:
> We don't do page cache reparent anymore when offlining memcg, so update
> force empty related content accordingly.
>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Signed-off-by: Yang Shi <[email protected]>

Thanks for the clean up.

Acked-by: Michal Hocko <[email protected]>

> ---
> Documentation/cgroup-v1/memory.txt | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
> index 3682e99..8e2cb1d 100644
> --- a/Documentation/cgroup-v1/memory.txt
> +++ b/Documentation/cgroup-v1/memory.txt
> @@ -70,7 +70,7 @@ Brief summary of control files.
> memory.soft_limit_in_bytes # set/show soft limit of memory usage
> memory.stat # show various statistics
> memory.use_hierarchy # set/show hierarchical account enabled
> - memory.force_empty # trigger forced move charge to parent
> + memory.force_empty # trigger forced page reclaim
> memory.pressure_level # set memory pressure notifications
> memory.swappiness # set/show swappiness parameter of vmscan
> (See sysctl's vm.swappiness)
> @@ -459,8 +459,9 @@ About use_hierarchy, see Section 6.
> the cgroup will be reclaimed and as many pages reclaimed as possible.
>
> The typical use case for this interface is before calling rmdir().
> - Because rmdir() moves all pages to parent, some out-of-use page caches can be
> - moved to the parent. If you want to avoid that, force_empty will be useful.
> + Though rmdir() offlines memcg, but the memcg may still stay there due to
> + charged file caches. Some out-of-use page caches may keep charged until
> + memory pressure happens. If you want to avoid that, force_empty will be useful.
>
> Also, note that when memory.kmem.limit_in_bytes is set the charges due to
> kernel pages will still be seen. This is not considered a failure and the
> --
> 1.8.3.1
>

--
Michal Hocko
SUSE Labs

2019-01-03 23:30:42

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty



On 1/2/19 1:45 PM, Shakeel Butt wrote:
> On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <[email protected]> wrote:
>> The typical usecase of force empty is to try to reclaim as much as
>> possible memory before offlining a memcg. Since there should be no
>> attached tasks to offlining memcg, the tasks anonymous pages would have
>> already been freed or uncharged.
> Anon pages can come from tmpfs files as well.

Yes, but they are charged to swap space as regular anon pages.

>
>> Even though anonymous pages get
>> swapped out, but they still get charged to swap space. So, it sounds
>> pointless to do swap for force empty.
>>
> I understand that force_empty is typically used before rmdir'ing a
> memcg but it might be used differently by some users. We use this
> interface to test memory reclaim behavior (anon and file).

Thanks for sharing your usecase. So, you uses this for test only?

>
> Anyways, I am not against changing the behavior, we can adapt
> internally but there might be other users using this interface
> differently.

Thanks.

Yang

>
> thanks,
> Shakeel


2019-01-03 23:34:51

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty

On Thu, Jan 3, 2019 at 8:57 AM Yang Shi <[email protected]> wrote:
>
>
>
> On 1/2/19 1:45 PM, Shakeel Butt wrote:
> > On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <[email protected]> wrote:
> >> The typical usecase of force empty is to try to reclaim as much as
> >> possible memory before offlining a memcg. Since there should be no
> >> attached tasks to offlining memcg, the tasks anonymous pages would have
> >> already been freed or uncharged.
> > Anon pages can come from tmpfs files as well.
>
> Yes, but they are charged to swap space as regular anon pages.
>

The point was the lifetime of tmpfs anon pages are not tied to any
task. Even though there aren't any task attached to a memcg, the tmpfs
anon pages will remain charged. Other than that, the old anon pages of
a task which have migrated away might still be charged to the old
memcg (if move_charge_at_immigrate is not set).

> >
> >> Even though anonymous pages get
> >> swapped out, but they still get charged to swap space. So, it sounds
> >> pointless to do swap for force empty.
> >>
> > I understand that force_empty is typically used before rmdir'ing a
> > memcg but it might be used differently by some users. We use this
> > interface to test memory reclaim behavior (anon and file).
>
> Thanks for sharing your usecase. So, you uses this for test only?
>

Yes.

Shakeel

2019-01-03 23:35:20

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty



On 1/3/19 2:12 AM, Michal Hocko wrote:
> On Thu 03-01-19 04:05:30, Yang Shi wrote:
>> Currently, force empty reclaims memory synchronously when writing to
>> memory.force_empty. It may take some time to return and the afterwards
>> operations are blocked by it. Although it can be interrupted by signal,
>> it still seems suboptimal.
> Why it is suboptimal? We are doing that operation on behalf of the
> process requesting it. What should anybody else pay for it? In other
> words why should we hide the overhead?

Please see the below explanation.

>
>> Now css offline is handled by worker, and the typical usecase of force
>> empty is before memcg offline. So, handling force empty in css offline
>> sounds reasonable.
> Hmm, so I guess you are talking about
> echo 1 > $MEMCG/force_empty
> rmdir $MEMCG
>
> and you are complaining that the operation takes too long. Right? Why do
> you care actually?

We have some usecases which create and remove memcgs very frequently,
and the tasks in the memcg may just access the files which are unlikely
accessed by anyone else. So, we prefer force_empty the memcg before
rmdir'ing it to reclaim the page cache so that they don't get
accumulated to incur unnecessary memory pressure. Since the memory
pressure may incur direct reclaim to harm some latency sensitive
applications.

And, the create/remove might be run in a script sequentially (there
might be a lot scripts or applications are run in parallel to do this), i.e.
mkdir cg1
do something
echo 0 > cg1/memory.force_empty
rmdir cg1

mkdir cg2
...

The creation of the afterwards memcg might be blocked by the force_empty
for long time if there are a lot page caches, so the overall throughput
of the system may get hurt.
And, it is not that urgent to reclaim the page cache right away and it
is not that important who pays the cost, we just need a mechanism to
reclaim the pages soon in a short while. The overhead could be smoothed
by background workqueue.

And, the patch still keeps the old behavior, just in case someone else
still depends on it.

Thanks,
Yang



2019-01-03 23:53:43

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty



On 1/3/19 10:13 AM, Michal Hocko wrote:
> On Thu 03-01-19 09:33:14, Yang Shi wrote:
>>
>> On 1/3/19 2:12 AM, Michal Hocko wrote:
>>> On Thu 03-01-19 04:05:30, Yang Shi wrote:
>>>> Currently, force empty reclaims memory synchronously when writing to
>>>> memory.force_empty. It may take some time to return and the afterwards
>>>> operations are blocked by it. Although it can be interrupted by signal,
>>>> it still seems suboptimal.
>>> Why it is suboptimal? We are doing that operation on behalf of the
>>> process requesting it. What should anybody else pay for it? In other
>>> words why should we hide the overhead?
>> Please see the below explanation.
>>
>>>> Now css offline is handled by worker, and the typical usecase of force
>>>> empty is before memcg offline. So, handling force empty in css offline
>>>> sounds reasonable.
>>> Hmm, so I guess you are talking about
>>> echo 1 > $MEMCG/force_empty
>>> rmdir $MEMCG
>>>
>>> and you are complaining that the operation takes too long. Right? Why do
>>> you care actually?
>> We have some usecases which create and remove memcgs very frequently, and
>> the tasks in the memcg may just access the files which are unlikely accessed
>> by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to
>> reclaim the page cache so that they don't get accumulated to incur
>> unnecessary memory pressure. Since the memory pressure may incur direct
>> reclaim to harm some latency sensitive applications.
> Yes, this makes sense to me.
>
>> And, the create/remove might be run in a script sequentially (there might be
>> a lot scripts or applications are run in parallel to do this), i.e.
>> mkdir cg1
>> do something
>> echo 0 > cg1/memory.force_empty
>> rmdir cg1
>>
>> mkdir cg2
>> ...
>>
>> The creation of the afterwards memcg might be blocked by the force_empty for
>> long time if there are a lot page caches, so the overall throughput of the
>> system may get hurt.
> Is there any reason for your scripts to be strictly sequential here? In
> other words why cannot you offload those expensive operations to a
> detached context in _userspace_?

I would say it has not to be strictly sequential. The above script is
just an example to illustrate the pattern. But, sometimes it may hit
such pattern due to the complicated cluster scheduling and container
scheduling in the production environment, for example the creation
process might be scheduled to the same CPU which is doing force_empty. I
have to say I don't know too much about the internals of the container
scheduling.

Thanks,
Yang



2019-01-03 23:53:43

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: memcontrol: do not try to do swap when force empty



On 1/3/19 9:03 AM, Shakeel Butt wrote:
> On Thu, Jan 3, 2019 at 8:57 AM Yang Shi <[email protected]> wrote:
>>
>>
>> On 1/2/19 1:45 PM, Shakeel Butt wrote:
>>> On Wed, Jan 2, 2019 at 12:06 PM Yang Shi <[email protected]> wrote:
>>>> The typical usecase of force empty is to try to reclaim as much as
>>>> possible memory before offlining a memcg. Since there should be no
>>>> attached tasks to offlining memcg, the tasks anonymous pages would have
>>>> already been freed or uncharged.
>>> Anon pages can come from tmpfs files as well.
>> Yes, but they are charged to swap space as regular anon pages.
>>
> The point was the lifetime of tmpfs anon pages are not tied to any
> task. Even though there aren't any task attached to a memcg, the tmpfs
> anon pages will remain charged. Other than that, the old anon pages of
> a task which have migrated away might still be charged to the old
> memcg (if move_charge_at_immigrate is not set).

Yes, my understanding is even though they are swapped out but they are
still charged to memsw for cgroupv1. force_empty is supposed to reclaim
as much as possible memory, here I'm supposed "reclaim" also means
"uncharge".

Even though the anon pages are still charged to the old memcg, it will
be moved the new memcg when the old one is deleted, or the pages will be
just released if the task is killed.

So, IMHO, I don't see the point why swapping anon pages when doing
force_empty.

Thanks,
Yang

>>>> Even though anonymous pages get
>>>> swapped out, but they still get charged to swap space. So, it sounds
>>>> pointless to do swap for force empty.
>>>>
>>> I understand that force_empty is typically used before rmdir'ing a
>>> memcg but it might be used differently by some users. We use this
>>> interface to test memory reclaim behavior (anon and file).
>> Thanks for sharing your usecase. So, you uses this for test only?
>>
> Yes.
>
> Shakeel


2019-01-03 23:56:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

On Thu 03-01-19 09:33:14, Yang Shi wrote:
>
>
> On 1/3/19 2:12 AM, Michal Hocko wrote:
> > On Thu 03-01-19 04:05:30, Yang Shi wrote:
> > > Currently, force empty reclaims memory synchronously when writing to
> > > memory.force_empty. It may take some time to return and the afterwards
> > > operations are blocked by it. Although it can be interrupted by signal,
> > > it still seems suboptimal.
> > Why it is suboptimal? We are doing that operation on behalf of the
> > process requesting it. What should anybody else pay for it? In other
> > words why should we hide the overhead?
>
> Please see the below explanation.
>
> >
> > > Now css offline is handled by worker, and the typical usecase of force
> > > empty is before memcg offline. So, handling force empty in css offline
> > > sounds reasonable.
> > Hmm, so I guess you are talking about
> > echo 1 > $MEMCG/force_empty
> > rmdir $MEMCG
> >
> > and you are complaining that the operation takes too long. Right? Why do
> > you care actually?
>
> We have some usecases which create and remove memcgs very frequently, and
> the tasks in the memcg may just access the files which are unlikely accessed
> by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to
> reclaim the page cache so that they don't get accumulated to incur
> unnecessary memory pressure. Since the memory pressure may incur direct
> reclaim to harm some latency sensitive applications.

Yes, this makes sense to me.

> And, the create/remove might be run in a script sequentially (there might be
> a lot scripts or applications are run in parallel to do this), i.e.
> mkdir cg1
> do something
> echo 0 > cg1/memory.force_empty
> rmdir cg1
>
> mkdir cg2
> ...
>
> The creation of the afterwards memcg might be blocked by the force_empty for
> long time if there are a lot page caches, so the overall throughput of the
> system may get hurt.

Is there any reason for your scripts to be strictly sequential here? In
other words why cannot you offload those expensive operations to a
detached context in _userspace_?
--
Michal Hocko
SUSE Labs

2019-01-03 23:57:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

On Thu 03-01-19 11:10:00, Yang Shi wrote:
>
>
> On 1/3/19 10:53 AM, Michal Hocko wrote:
> > On Thu 03-01-19 10:40:54, Yang Shi wrote:
> > >
> > > On 1/3/19 10:13 AM, Michal Hocko wrote:
[...]
> > > > Is there any reason for your scripts to be strictly sequential here? In
> > > > other words why cannot you offload those expensive operations to a
> > > > detached context in _userspace_?
> > > I would say it has not to be strictly sequential. The above script is just
> > > an example to illustrate the pattern. But, sometimes it may hit such pattern
> > > due to the complicated cluster scheduling and container scheduling in the
> > > production environment, for example the creation process might be scheduled
> > > to the same CPU which is doing force_empty. I have to say I don't know too
> > > much about the internals of the container scheduling.
> > In that case I do not see a strong reason to implement the offloding
> > into the kernel. It is an additional code and semantic to maintain.
>
> Yes, it does introduce some additional code and semantic, but IMHO, it is
> quite simple and very straight forward, isn't it? Just utilize the existing
> css offline worker. And, that a couple of lines of code do improve some
> throughput issues for some real usecases.

I do not really care it is few LOC. It is more important that it is
conflating force_empty into offlining logic. There was a good reason to
remove reparenting/emptying the memcg during the offline. Considering
that you can offload force_empty from userspace trivially then I do not
see any reason to implement it in the kernel.

> > I think it is more important to discuss whether we want to introduce
> > force_empty in cgroup v2.
>
> We would prefer have it in v2 as well.

Then bring this up in a separate email thread please.
--
Michal Hocko
SUSE Labs

2019-01-03 23:58:51

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty



On 1/3/19 10:53 AM, Michal Hocko wrote:
> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>
>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>>> On Thu 03-01-19 09:33:14, Yang Shi wrote:
>>>> On 1/3/19 2:12 AM, Michal Hocko wrote:
>>>>> On Thu 03-01-19 04:05:30, Yang Shi wrote:
>>>>>> Currently, force empty reclaims memory synchronously when writing to
>>>>>> memory.force_empty. It may take some time to return and the afterwards
>>>>>> operations are blocked by it. Although it can be interrupted by signal,
>>>>>> it still seems suboptimal.
>>>>> Why it is suboptimal? We are doing that operation on behalf of the
>>>>> process requesting it. What should anybody else pay for it? In other
>>>>> words why should we hide the overhead?
>>>> Please see the below explanation.
>>>>
>>>>>> Now css offline is handled by worker, and the typical usecase of force
>>>>>> empty is before memcg offline. So, handling force empty in css offline
>>>>>> sounds reasonable.
>>>>> Hmm, so I guess you are talking about
>>>>> echo 1 > $MEMCG/force_empty
>>>>> rmdir $MEMCG
>>>>>
>>>>> and you are complaining that the operation takes too long. Right? Why do
>>>>> you care actually?
>>>> We have some usecases which create and remove memcgs very frequently, and
>>>> the tasks in the memcg may just access the files which are unlikely accessed
>>>> by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to
>>>> reclaim the page cache so that they don't get accumulated to incur
>>>> unnecessary memory pressure. Since the memory pressure may incur direct
>>>> reclaim to harm some latency sensitive applications.
>>> Yes, this makes sense to me.
>>>
>>>> And, the create/remove might be run in a script sequentially (there might be
>>>> a lot scripts or applications are run in parallel to do this), i.e.
>>>> mkdir cg1
>>>> do something
>>>> echo 0 > cg1/memory.force_empty
>>>> rmdir cg1
>>>>
>>>> mkdir cg2
>>>> ...
>>>>
>>>> The creation of the afterwards memcg might be blocked by the force_empty for
>>>> long time if there are a lot page caches, so the overall throughput of the
>>>> system may get hurt.
>>> Is there any reason for your scripts to be strictly sequential here? In
>>> other words why cannot you offload those expensive operations to a
>>> detached context in _userspace_?
>> I would say it has not to be strictly sequential. The above script is just
>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>> due to the complicated cluster scheduling and container scheduling in the
>> production environment, for example the creation process might be scheduled
>> to the same CPU which is doing force_empty. I have to say I don't know too
>> much about the internals of the container scheduling.
> In that case I do not see a strong reason to implement the offloding
> into the kernel. It is an additional code and semantic to maintain.

Yes, it does introduce some additional code and semantic, but IMHO, it
is quite simple and very straight forward, isn't it? Just utilize the
existing css offline worker. And, that a couple of lines of code do
improve some throughput issues for some real usecases.

>
> I think it is more important to discuss whether we want to introduce
> force_empty in cgroup v2.

We would prefer have it in v2 as well.

Thanks,
Yang



2019-01-03 23:59:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

On Thu 03-01-19 10:40:54, Yang Shi wrote:
>
>
> On 1/3/19 10:13 AM, Michal Hocko wrote:
> > On Thu 03-01-19 09:33:14, Yang Shi wrote:
> > >
> > > On 1/3/19 2:12 AM, Michal Hocko wrote:
> > > > On Thu 03-01-19 04:05:30, Yang Shi wrote:
> > > > > Currently, force empty reclaims memory synchronously when writing to
> > > > > memory.force_empty. It may take some time to return and the afterwards
> > > > > operations are blocked by it. Although it can be interrupted by signal,
> > > > > it still seems suboptimal.
> > > > Why it is suboptimal? We are doing that operation on behalf of the
> > > > process requesting it. What should anybody else pay for it? In other
> > > > words why should we hide the overhead?
> > > Please see the below explanation.
> > >
> > > > > Now css offline is handled by worker, and the typical usecase of force
> > > > > empty is before memcg offline. So, handling force empty in css offline
> > > > > sounds reasonable.
> > > > Hmm, so I guess you are talking about
> > > > echo 1 > $MEMCG/force_empty
> > > > rmdir $MEMCG
> > > >
> > > > and you are complaining that the operation takes too long. Right? Why do
> > > > you care actually?
> > > We have some usecases which create and remove memcgs very frequently, and
> > > the tasks in the memcg may just access the files which are unlikely accessed
> > > by anyone else. So, we prefer force_empty the memcg before rmdir'ing it to
> > > reclaim the page cache so that they don't get accumulated to incur
> > > unnecessary memory pressure. Since the memory pressure may incur direct
> > > reclaim to harm some latency sensitive applications.
> > Yes, this makes sense to me.
> >
> > > And, the create/remove might be run in a script sequentially (there might be
> > > a lot scripts or applications are run in parallel to do this), i.e.
> > > mkdir cg1
> > > do something
> > > echo 0 > cg1/memory.force_empty
> > > rmdir cg1
> > >
> > > mkdir cg2
> > > ...
> > >
> > > The creation of the afterwards memcg might be blocked by the force_empty for
> > > long time if there are a lot page caches, so the overall throughput of the
> > > system may get hurt.
> > Is there any reason for your scripts to be strictly sequential here? In
> > other words why cannot you offload those expensive operations to a
> > detached context in _userspace_?
>
> I would say it has not to be strictly sequential. The above script is just
> an example to illustrate the pattern. But, sometimes it may hit such pattern
> due to the complicated cluster scheduling and container scheduling in the
> production environment, for example the creation process might be scheduled
> to the same CPU which is doing force_empty. I have to say I don't know too
> much about the internals of the container scheduling.

In that case I do not see a strong reason to implement the offloding
into the kernel. It is an additional code and semantic to maintain.

I think it is more important to discuss whether we want to introduce
force_empty in cgroup v2.
--
Michal Hocko
SUSE Labs

2019-01-04 00:24:38

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

On Thu 03-01-19 11:49:32, Yang Shi wrote:
>
>
> On 1/3/19 11:23 AM, Michal Hocko wrote:
> > On Thu 03-01-19 11:10:00, Yang Shi wrote:
> > >
> > > On 1/3/19 10:53 AM, Michal Hocko wrote:
> > > > On Thu 03-01-19 10:40:54, Yang Shi wrote:
> > > > > On 1/3/19 10:13 AM, Michal Hocko wrote:
> > [...]
> > > > > > Is there any reason for your scripts to be strictly sequential here? In
> > > > > > other words why cannot you offload those expensive operations to a
> > > > > > detached context in _userspace_?
> > > > > I would say it has not to be strictly sequential. The above script is just
> > > > > an example to illustrate the pattern. But, sometimes it may hit such pattern
> > > > > due to the complicated cluster scheduling and container scheduling in the
> > > > > production environment, for example the creation process might be scheduled
> > > > > to the same CPU which is doing force_empty. I have to say I don't know too
> > > > > much about the internals of the container scheduling.
> > > > In that case I do not see a strong reason to implement the offloding
> > > > into the kernel. It is an additional code and semantic to maintain.
> > > Yes, it does introduce some additional code and semantic, but IMHO, it is
> > > quite simple and very straight forward, isn't it? Just utilize the existing
> > > css offline worker. And, that a couple of lines of code do improve some
> > > throughput issues for some real usecases.
> > I do not really care it is few LOC. It is more important that it is
> > conflating force_empty into offlining logic. There was a good reason to
> > remove reparenting/emptying the memcg during the offline. Considering
> > that you can offload force_empty from userspace trivially then I do not
> > see any reason to implement it in the kernel.
>
> Er, I may not articulate in the earlier email, force_empty can not be
> offloaded from userspace *trivially*. IOWs the container scheduler may
> unexpectedly overcommit something due to the stall of synchronous force
> empty, which can't be figured out by userspace before it actually happens.
> The scheduler doesn't know how long force_empty would take. If the
> force_empty could be offloaded by kernel, it would make scheduler's life
> much easier. This is not something userspace could do.

What exactly prevents
(
echo 1 > $memecg/force_empty
rmdir $memcg
) &

so that this sequence doesn't really block anything?
--
Michal Hocko
SUSE Labs

2019-01-04 00:24:38

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty



On 1/3/19 11:23 AM, Michal Hocko wrote:
> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>
>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
> [...]
>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>> other words why cannot you offload those expensive operations to a
>>>>> detached context in _userspace_?
>>>> I would say it has not to be strictly sequential. The above script is just
>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>> due to the complicated cluster scheduling and container scheduling in the
>>>> production environment, for example the creation process might be scheduled
>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>> much about the internals of the container scheduling.
>>> In that case I do not see a strong reason to implement the offloding
>>> into the kernel. It is an additional code and semantic to maintain.
>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>> quite simple and very straight forward, isn't it? Just utilize the existing
>> css offline worker. And, that a couple of lines of code do improve some
>> throughput issues for some real usecases.
> I do not really care it is few LOC. It is more important that it is
> conflating force_empty into offlining logic. There was a good reason to
> remove reparenting/emptying the memcg during the offline. Considering
> that you can offload force_empty from userspace trivially then I do not
> see any reason to implement it in the kernel.

Er, I may not articulate in the earlier email, force_empty can not be
offloaded from userspace *trivially*. IOWs the container scheduler may
unexpectedly overcommit something due to the stall of synchronous force
empty, which can't be figured out by userspace before it actually
happens. The scheduler doesn't know how long force_empty would take. If
the force_empty could be offloaded by kernel, it would make scheduler's
life much easier. This is not something userspace could do.

>
>>> I think it is more important to discuss whether we want to introduce
>>> force_empty in cgroup v2.
>> We would prefer have it in v2 as well.
> Then bring this up in a separate email thread please.

Sure. Will prepare the patches later.

Thanks,
Yang



2019-01-04 06:27:36

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty



On 1/3/19 12:01 PM, Michal Hocko wrote:
> On Thu 03-01-19 11:49:32, Yang Shi wrote:
>>
>> On 1/3/19 11:23 AM, Michal Hocko wrote:
>>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>>> [...]
>>>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>>>> other words why cannot you offload those expensive operations to a
>>>>>>> detached context in _userspace_?
>>>>>> I would say it has not to be strictly sequential. The above script is just
>>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>>>> due to the complicated cluster scheduling and container scheduling in the
>>>>>> production environment, for example the creation process might be scheduled
>>>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>>>> much about the internals of the container scheduling.
>>>>> In that case I do not see a strong reason to implement the offloding
>>>>> into the kernel. It is an additional code and semantic to maintain.
>>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>>> css offline worker. And, that a couple of lines of code do improve some
>>>> throughput issues for some real usecases.
>>> I do not really care it is few LOC. It is more important that it is
>>> conflating force_empty into offlining logic. There was a good reason to
>>> remove reparenting/emptying the memcg during the offline. Considering
>>> that you can offload force_empty from userspace trivially then I do not
>>> see any reason to implement it in the kernel.
>> Er, I may not articulate in the earlier email, force_empty can not be
>> offloaded from userspace *trivially*. IOWs the container scheduler may
>> unexpectedly overcommit something due to the stall of synchronous force
>> empty, which can't be figured out by userspace before it actually happens.
>> The scheduler doesn't know how long force_empty would take. If the
>> force_empty could be offloaded by kernel, it would make scheduler's life
>> much easier. This is not something userspace could do.
> What exactly prevents
> (
> echo 1 > $memecg/force_empty
> rmdir $memcg
> ) &
>
> so that this sequence doesn't really block anything?

We have "restarting the same name job" logic in our usecase (I'm not
quite sure why they do so). Basically, it means to create memcg with the
exact same name right after the old one is deleted, but may have
different limit or other settings. The creation has to wait for rmdir is
done. Even though rmdir is done in background like the above, the stall
still exists since rmdir simply is waiting for force_empty.

Thanks,
Yang



2019-01-04 10:21:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

On Thu 03-01-19 20:15:30, Yang Shi wrote:
>
>
> On 1/3/19 12:01 PM, Michal Hocko wrote:
> > On Thu 03-01-19 11:49:32, Yang Shi wrote:
> > >
> > > On 1/3/19 11:23 AM, Michal Hocko wrote:
> > > > On Thu 03-01-19 11:10:00, Yang Shi wrote:
> > > > > On 1/3/19 10:53 AM, Michal Hocko wrote:
> > > > > > On Thu 03-01-19 10:40:54, Yang Shi wrote:
> > > > > > > On 1/3/19 10:13 AM, Michal Hocko wrote:
> > > > [...]
> > > > > > > > Is there any reason for your scripts to be strictly sequential here? In
> > > > > > > > other words why cannot you offload those expensive operations to a
> > > > > > > > detached context in _userspace_?
> > > > > > > I would say it has not to be strictly sequential. The above script is just
> > > > > > > an example to illustrate the pattern. But, sometimes it may hit such pattern
> > > > > > > due to the complicated cluster scheduling and container scheduling in the
> > > > > > > production environment, for example the creation process might be scheduled
> > > > > > > to the same CPU which is doing force_empty. I have to say I don't know too
> > > > > > > much about the internals of the container scheduling.
> > > > > > In that case I do not see a strong reason to implement the offloding
> > > > > > into the kernel. It is an additional code and semantic to maintain.
> > > > > Yes, it does introduce some additional code and semantic, but IMHO, it is
> > > > > quite simple and very straight forward, isn't it? Just utilize the existing
> > > > > css offline worker. And, that a couple of lines of code do improve some
> > > > > throughput issues for some real usecases.
> > > > I do not really care it is few LOC. It is more important that it is
> > > > conflating force_empty into offlining logic. There was a good reason to
> > > > remove reparenting/emptying the memcg during the offline. Considering
> > > > that you can offload force_empty from userspace trivially then I do not
> > > > see any reason to implement it in the kernel.
> > > Er, I may not articulate in the earlier email, force_empty can not be
> > > offloaded from userspace *trivially*. IOWs the container scheduler may
> > > unexpectedly overcommit something due to the stall of synchronous force
> > > empty, which can't be figured out by userspace before it actually happens.
> > > The scheduler doesn't know how long force_empty would take. If the
> > > force_empty could be offloaded by kernel, it would make scheduler's life
> > > much easier. This is not something userspace could do.
> > What exactly prevents
> > (
> > echo 1 > $memecg/force_empty
> > rmdir $memcg
> > ) &
> >
> > so that this sequence doesn't really block anything?
>
> We have "restarting the same name job" logic in our usecase (I'm not quite
> sure why they do so). Basically, it means to create memcg with the exact
> same name right after the old one is deleted, but may have different limit
> or other settings. The creation has to wait for rmdir is done. Even though
> rmdir is done in background like the above, the stall still exists since
> rmdir simply is waiting for force_empty.

OK, I see. This is an important detail you didn't mention previously (or
at least I didn't understand it). One thing is still not clear to me.
"Restarting the same job" sounds as if the memcg itself could be
recycled as well. You are saying that the setting might change but if
that is about limits then we should handle that just fine. Or what other
kind of setting changes that wouldn't work properly?

If the recycling is not possible then I would suggest to not reuse
force_empty interface but add wipe_on_destruction or similar new knob
which would enforce reclaim on offlining. It seems we have several
people asking for something like that already.
--
Michal Hocko
SUSE Labs

2019-01-04 18:12:41

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty



On 1/4/19 12:55 AM, Michal Hocko wrote:
> On Thu 03-01-19 20:15:30, Yang Shi wrote:
>>
>> On 1/3/19 12:01 PM, Michal Hocko wrote:
>>> On Thu 03-01-19 11:49:32, Yang Shi wrote:
>>>> On 1/3/19 11:23 AM, Michal Hocko wrote:
>>>>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>>>>> [...]
>>>>>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>>>>>> other words why cannot you offload those expensive operations to a
>>>>>>>>> detached context in _userspace_?
>>>>>>>> I would say it has not to be strictly sequential. The above script is just
>>>>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>>>>>> due to the complicated cluster scheduling and container scheduling in the
>>>>>>>> production environment, for example the creation process might be scheduled
>>>>>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>>>>>> much about the internals of the container scheduling.
>>>>>>> In that case I do not see a strong reason to implement the offloding
>>>>>>> into the kernel. It is an additional code and semantic to maintain.
>>>>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>>>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>>>>> css offline worker. And, that a couple of lines of code do improve some
>>>>>> throughput issues for some real usecases.
>>>>> I do not really care it is few LOC. It is more important that it is
>>>>> conflating force_empty into offlining logic. There was a good reason to
>>>>> remove reparenting/emptying the memcg during the offline. Considering
>>>>> that you can offload force_empty from userspace trivially then I do not
>>>>> see any reason to implement it in the kernel.
>>>> Er, I may not articulate in the earlier email, force_empty can not be
>>>> offloaded from userspace *trivially*. IOWs the container scheduler may
>>>> unexpectedly overcommit something due to the stall of synchronous force
>>>> empty, which can't be figured out by userspace before it actually happens.
>>>> The scheduler doesn't know how long force_empty would take. If the
>>>> force_empty could be offloaded by kernel, it would make scheduler's life
>>>> much easier. This is not something userspace could do.
>>> What exactly prevents
>>> (
>>> echo 1 > $memecg/force_empty
>>> rmdir $memcg
>>> ) &
>>>
>>> so that this sequence doesn't really block anything?
>> We have "restarting the same name job" logic in our usecase (I'm not quite
>> sure why they do so). Basically, it means to create memcg with the exact
>> same name right after the old one is deleted, but may have different limit
>> or other settings. The creation has to wait for rmdir is done. Even though
>> rmdir is done in background like the above, the stall still exists since
>> rmdir simply is waiting for force_empty.
> OK, I see. This is an important detail you didn't mention previously (or
> at least I didn't understand it). One thing is still not clear to me.

Sorry, I should articulated at the first place.

> "Restarting the same job" sounds as if the memcg itself could be
> recycled as well. You are saying that the setting might change but if
> that is about limits then we should handle that just fine. Or what other
> kind of setting changes that wouldn't work properly?

We did try resize limit, but it may also incur costly direct reclaim to
block something. Other than this we also want to reset all the
counters/stats to get clearer and cleaner resource isolation since the
container may run different jobs although they use the same name.

>
> If the recycling is not possible then I would suggest to not reuse
> force_empty interface but add wipe_on_destruction or similar new knob
> which would enforce reclaim on offlining. It seems we have several
> people asking for something like that already.

We did have a new knob in our in-house implementation, it just did
force_empty on offlining.

So, you mean to have a new knob to just do force empty offlining, and
keep force_empty's behavior, right?

Thanks,
Yang



2019-01-04 20:05:21

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty

Yang Shi <[email protected]> wrote:

> On 1/3/19 11:23 AM, Michal Hocko wrote:
>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>
>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>> [...]
>>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>>> other words why cannot you offload those expensive operations to a
>>>>>> detached context in _userspace_?
>>>>> I would say it has not to be strictly sequential. The above script is just
>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>>> due to the complicated cluster scheduling and container scheduling in the
>>>>> production environment, for example the creation process might be scheduled
>>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>>> much about the internals of the container scheduling.
>>>> In that case I do not see a strong reason to implement the offloding
>>>> into the kernel. It is an additional code and semantic to maintain.
>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>> css offline worker. And, that a couple of lines of code do improve some
>>> throughput issues for some real usecases.
>> I do not really care it is few LOC. It is more important that it is
>> conflating force_empty into offlining logic. There was a good reason to
>> remove reparenting/emptying the memcg during the offline. Considering
>> that you can offload force_empty from userspace trivially then I do not
>> see any reason to implement it in the kernel.
>
> Er, I may not articulate in the earlier email, force_empty can not be
> offloaded from userspace *trivially*. IOWs the container scheduler may
> unexpectedly overcommit something due to the stall of synchronous force
> empty, which can't be figured out by userspace before it actually
> happens. The scheduler doesn't know how long force_empty would take. If
> the force_empty could be offloaded by kernel, it would make scheduler's
> life much easier. This is not something userspace could do.

If kernel workqueues are doing more work (i.e. force_empty processing),
then it seem like the time to offline could grow. I'm not sure if
that's important.

I assume that if we make force_empty an async side effect of rmdir then
user space scheduler would not be unable to immediately assume the
rmdir'd container memory is available without subjecting a new container
to direct reclaim. So it seems like user space would use a mechanism to
wait for reclaim: either the existing sync force_empty or polling
meminfo/etc waiting for free memory to appear.

>>>> I think it is more important to discuss whether we want to introduce
>>>> force_empty in cgroup v2.
>>> We would prefer have it in v2 as well.
>> Then bring this up in a separate email thread please.
>
> Sure. Will prepare the patches later.
>
> Thanks,
> Yang

2019-01-04 21:46:24

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty



On 1/4/19 12:03 PM, Greg Thelen wrote:
> Yang Shi <[email protected]> wrote:
>
>> On 1/3/19 11:23 AM, Michal Hocko wrote:
>>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>>> [...]
>>>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>>>> other words why cannot you offload those expensive operations to a
>>>>>>> detached context in _userspace_?
>>>>>> I would say it has not to be strictly sequential. The above script is just
>>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>>>> due to the complicated cluster scheduling and container scheduling in the
>>>>>> production environment, for example the creation process might be scheduled
>>>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>>>> much about the internals of the container scheduling.
>>>>> In that case I do not see a strong reason to implement the offloding
>>>>> into the kernel. It is an additional code and semantic to maintain.
>>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>>> css offline worker. And, that a couple of lines of code do improve some
>>>> throughput issues for some real usecases.
>>> I do not really care it is few LOC. It is more important that it is
>>> conflating force_empty into offlining logic. There was a good reason to
>>> remove reparenting/emptying the memcg during the offline. Considering
>>> that you can offload force_empty from userspace trivially then I do not
>>> see any reason to implement it in the kernel.
>> Er, I may not articulate in the earlier email, force_empty can not be
>> offloaded from userspace *trivially*. IOWs the container scheduler may
>> unexpectedly overcommit something due to the stall of synchronous force
>> empty, which can't be figured out by userspace before it actually
>> happens. The scheduler doesn't know how long force_empty would take. If
>> the force_empty could be offloaded by kernel, it would make scheduler's
>> life much easier. This is not something userspace could do.
> If kernel workqueues are doing more work (i.e. force_empty processing),
> then it seem like the time to offline could grow. I'm not sure if
> that's important.

Yes, it would grow. I'm not sure, but it seems fine with our workloads.

The reclaim can be placed at the last step of offline, and it can be
interrupted by some signals, i.e. fatal signal in current code.

>
> I assume that if we make force_empty an async side effect of rmdir then
> user space scheduler would not be unable to immediately assume the
> rmdir'd container memory is available without subjecting a new container
> to direct reclaim. So it seems like user space would use a mechanism to
> wait for reclaim: either the existing sync force_empty or polling
> meminfo/etc waiting for free memory to appear.

Yes, it is expected side effect, the memory reclaim would happen in a
short while. In this series I keep sync reclaim behavior of force_empty
by checking the written value. Michal suggested a new knob do the
offline reclaim, and keep force_empty intact.

I think using which knob is user's discretion.

Thanks,
Yang

>
>>>>> I think it is more important to discuss whether we want to introduce
>>>>> force_empty in cgroup v2.
>>>> We would prefer have it in v2 as well.
>>> Then bring this up in a separate email thread please.
>> Sure. Will prepare the patches later.
>>
>> Thanks,
>> Yang


2019-01-04 23:03:01

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty



On 1/4/19 12:03 PM, Greg Thelen wrote:
> Yang Shi <[email protected]> wrote:
>
>> On 1/3/19 11:23 AM, Michal Hocko wrote:
>>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>>> [...]
>>>>>>> Is there any reason for your scripts to be strictly sequential here? In
>>>>>>> other words why cannot you offload those expensive operations to a
>>>>>>> detached context in _userspace_?
>>>>>> I would say it has not to be strictly sequential. The above script is just
>>>>>> an example to illustrate the pattern. But, sometimes it may hit such pattern
>>>>>> due to the complicated cluster scheduling and container scheduling in the
>>>>>> production environment, for example the creation process might be scheduled
>>>>>> to the same CPU which is doing force_empty. I have to say I don't know too
>>>>>> much about the internals of the container scheduling.
>>>>> In that case I do not see a strong reason to implement the offloding
>>>>> into the kernel. It is an additional code and semantic to maintain.
>>>> Yes, it does introduce some additional code and semantic, but IMHO, it is
>>>> quite simple and very straight forward, isn't it? Just utilize the existing
>>>> css offline worker. And, that a couple of lines of code do improve some
>>>> throughput issues for some real usecases.
>>> I do not really care it is few LOC. It is more important that it is
>>> conflating force_empty into offlining logic. There was a good reason to
>>> remove reparenting/emptying the memcg during the offline. Considering
>>> that you can offload force_empty from userspace trivially then I do not
>>> see any reason to implement it in the kernel.
>> Er, I may not articulate in the earlier email, force_empty can not be
>> offloaded from userspace *trivially*. IOWs the container scheduler may
>> unexpectedly overcommit something due to the stall of synchronous force
>> empty, which can't be figured out by userspace before it actually
>> happens. The scheduler doesn't know how long force_empty would take. If
>> the force_empty could be offloaded by kernel, it would make scheduler's
>> life much easier. This is not something userspace could do.
> If kernel workqueues are doing more work (i.e. force_empty processing),
> then it seem like the time to offline could grow. I'm not sure if
> that's important.

One thing I can think of is this may slow down the recycling of memcg
id. This may cause memcg id exhausted for some extreme workload. But, I
don't see this as a problem in our workload.

Thanks,
Yang

>
> I assume that if we make force_empty an async side effect of rmdir then
> user space scheduler would not be unable to immediately assume the
> rmdir'd container memory is available without subjecting a new container
> to direct reclaim. So it seems like user space would use a mechanism to
> wait for reclaim: either the existing sync force_empty or polling
> meminfo/etc waiting for free memory to appear.
>
>>>>> I think it is more important to discuss whether we want to introduce
>>>>> force_empty in cgroup v2.
>>>> We would prefer have it in v2 as well.
>>> Then bring this up in a separate email thread please.
>> Sure. Will prepare the patches later.
>>
>> Thanks,
>> Yang


2019-01-04 23:08:16

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC PATCH 0/3] mm: memcontrol: delayed force empty



On 1/4/19 2:57 PM, Yang Shi wrote:
>
>
> On 1/4/19 12:03 PM, Greg Thelen wrote:
>> Yang Shi <[email protected]> wrote:
>>
>>> On 1/3/19 11:23 AM, Michal Hocko wrote:
>>>> On Thu 03-01-19 11:10:00, Yang Shi wrote:
>>>>> On 1/3/19 10:53 AM, Michal Hocko wrote:
>>>>>> On Thu 03-01-19 10:40:54, Yang Shi wrote:
>>>>>>> On 1/3/19 10:13 AM, Michal Hocko wrote:
>>>> [...]
>>>>>>>> Is there any reason for your scripts to be strictly sequential
>>>>>>>> here? In
>>>>>>>> other words why cannot you offload those expensive operations to a
>>>>>>>> detached context in _userspace_?
>>>>>>> I would say it has not to be strictly sequential. The above
>>>>>>> script is just
>>>>>>> an example to illustrate the pattern. But, sometimes it may hit
>>>>>>> such pattern
>>>>>>> due to the complicated cluster scheduling and container
>>>>>>> scheduling in the
>>>>>>> production environment, for example the creation process might
>>>>>>> be scheduled
>>>>>>> to the same CPU which is doing force_empty. I have to say I
>>>>>>> don't know too
>>>>>>> much about the internals of the container scheduling.
>>>>>> In that case I do not see a strong reason to implement the offloding
>>>>>> into the kernel. It is an additional code and semantic to maintain.
>>>>> Yes, it does introduce some additional code and semantic, but
>>>>> IMHO, it is
>>>>> quite simple and very straight forward, isn't it? Just utilize the
>>>>> existing
>>>>> css offline worker. And, that a couple of lines of code do improve
>>>>> some
>>>>> throughput issues for some real usecases.
>>>> I do not really care it is few LOC. It is more important that it is
>>>> conflating force_empty into offlining logic. There was a good
>>>> reason to
>>>> remove reparenting/emptying the memcg during the offline. Considering
>>>> that you can offload force_empty from userspace trivially then I do
>>>> not
>>>> see any reason to implement it in the kernel.
>>> Er, I may not articulate in the earlier email, force_empty can not be
>>> offloaded from userspace *trivially*. IOWs the container scheduler may
>>> unexpectedly overcommit something due to the stall of synchronous force
>>> empty, which can't be figured out by userspace before it actually
>>> happens. The scheduler doesn't know how long force_empty would take. If
>>> the force_empty could be offloaded by kernel, it would make scheduler's
>>> life much easier. This is not something userspace could do.
>> If kernel workqueues are doing more work (i.e. force_empty processing),
>> then it seem like the time to offline could grow.  I'm not sure if
>> that's important.
>
> One thing I can think of is this may slow down the recycling of memcg
> id. This may cause memcg id exhausted for some extreme workload. But,
> I don't see this as a problem in our workload.

Actually, sync force_empty should have the same side effect.

Yang

>
> Thanks,
> Yang
>
>>
>> I assume that if we make force_empty an async side effect of rmdir then
>> user space scheduler would not be unable to immediately assume the
>> rmdir'd container memory is available without subjecting a new container
>> to direct reclaim.  So it seems like user space would use a mechanism to
>> wait for reclaim: either the existing sync force_empty or polling
>> meminfo/etc waiting for free memory to appear.
>>
>>>>>> I think it is more important to discuss whether we want to introduce
>>>>>> force_empty in cgroup v2.
>>>>> We would prefer have it in v2 as well.
>>>> Then bring this up in a separate email thread please.
>>> Sure. Will prepare the patches later.
>>>
>>> Thanks,
>>> Yang
>