LinuxLists.cc - [PATCH v3] memcg: schedule high reclaim for remote memcgs on high

2019-01-10 17:46:53

Subject: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work

If a memcg is over high limit, memory reclaim is scheduled to run on
return-to-userland. However it is assumed that the memcg is the current
process's memcg. With remote memcg charging for kmem or swapping in a
page charged to remote memcg, current process can trigger reclaim on
remote memcg. So, schduling reclaim on return-to-userland for remote
memcgs will ignore the high reclaim altogether. So, record the memcg
needing high reclaim and trigger high reclaim for that memcg on
return-to-userland. However if the memcg is already recorded for high
reclaim and the recorded memcg is not the descendant of the the memcg
needing high reclaim, punt the high reclaim to the work queue.

Signed-off-by: Shakeel Butt <[email protected]>
---
Changelog since v2:
- TIF_NOTIFY_RESUME can be set from places other than try_charge() in
which case current->memcg_high_reclaim will be null. Correctly handle
such scenarios.

Changelog since v1:
- Punt high reclaim of a memcg to work queue only if the recorded memcg
is not its descendant.

include/linux/sched.h | 3 +++
kernel/fork.c | 1 +
mm/memcontrol.c | 22 ++++++++++++++++------
3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7d08562eeec7..5e6690042497 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1172,6 +1172,9 @@ struct task_struct {

/* Used by memcontrol for targeted memcg charge: */
struct mem_cgroup *active_memcg;
+
+ /* Used by memcontrol for high relcaim: */
+ struct mem_cgroup *memcg_high_reclaim;
#endif

#ifdef CONFIG_BLK_CGROUP
diff --git a/kernel/fork.c b/kernel/fork.c
index 1b0fde63d831..85da44137847 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -918,6 +918,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)

#ifdef CONFIG_MEMCG
tsk->active_memcg = NULL;
+ tsk->memcg_high_reclaim = NULL;
#endif
return tsk;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 953d4ba8a595..18f4aefbe0bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2168,14 +2168,17 @@ static void high_work_func(struct work_struct *work)
void mem_cgroup_handle_over_high(void)
{
unsigned int nr_pages = current->memcg_nr_pages_over_high;
- struct mem_cgroup *memcg;
+ struct mem_cgroup *memcg = current->memcg_high_reclaim;

if (likely(!nr_pages))
return;

- memcg = get_mem_cgroup_from_mm(current->mm);
+ if (!memcg)
+ memcg = get_mem_cgroup_from_mm(current->mm);
+
reclaim_high(memcg, nr_pages, GFP_KERNEL);
css_put(&memcg->css);
+ current->memcg_high_reclaim = NULL;
current->memcg_nr_pages_over_high = 0;
}

@@ -2329,10 +2332,10 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
* If the hierarchy is above the normal consumption range, schedule
* reclaim on returning to userland. We can perform reclaim here
* if __GFP_RECLAIM but let's always punt for simplicity and so that
- * GFP_KERNEL can consistently be used during reclaim. @memcg is
- * not recorded as it most likely matches current's and won't
- * change in the meantime. As high limit is checked again before
- * reclaim, the cost of mismatch is negligible.
+ * GFP_KERNEL can consistently be used during reclaim. Record the memcg
+ * for the return-to-userland high reclaim. If the memcg is already
+ * recorded and the recorded memcg is not the descendant of the memcg
+ * needing high reclaim, punt the high reclaim to the work queue.
*/
do {
if (page_counter_read(&memcg->memory) > memcg->high) {
@@ -2340,6 +2343,13 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (in_interrupt()) {
schedule_work(&memcg->high_work);
break;
+ } else if (!current->memcg_high_reclaim) {
+ css_get(&memcg->css);
+ current->memcg_high_reclaim = memcg;
+ } else if (!mem_cgroup_is_descendant(
+ current->memcg_high_reclaim, memcg)) {
+ schedule_work(&memcg->high_work);
+ break;
}
current->memcg_nr_pages_over_high += batch;
set_notify_resume(current);
--
2.20.1.97.g81188d93c3-goog

2019-01-11 21:21:49

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work

Hi Shakeel,

On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> If a memcg is over high limit, memory reclaim is scheduled to run on
> return-to-userland. However it is assumed that the memcg is the current
> process's memcg. With remote memcg charging for kmem or swapping in a
> page charged to remote memcg, current process can trigger reclaim on
> remote memcg. So, schduling reclaim on return-to-userland for remote
> memcgs will ignore the high reclaim altogether. So, record the memcg
> needing high reclaim and trigger high reclaim for that memcg on
> return-to-userland. However if the memcg is already recorded for high
> reclaim and the recorded memcg is not the descendant of the the memcg
> needing high reclaim, punt the high reclaim to the work queue.

The idea behind remote charging is that the thread allocating the
memory is not responsible for that memory, but a different cgroup
is. Why would the same thread then have to work off any high excess
this could produce in that unrelated group?

Say you have a inotify/dnotify listener that is restricted in its
memory use - now everybody sending notification events from outside
that listener's group would get throttled on a cgroup over which it
has no control. That sounds like a recipe for priority inversions.

It seems to me we should only do reclaim-on-return when current is in
the ill-behaved cgroup, and punt everything else - interrupts and
remote charges - to the workqueue.

2019-01-11 22:56:04

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work

Hi Johannes,

On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <[email protected]> wrote:
>
> Hi Shakeel,
>
> On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > If a memcg is over high limit, memory reclaim is scheduled to run on
> > return-to-userland. However it is assumed that the memcg is the current
> > process's memcg. With remote memcg charging for kmem or swapping in a
> > page charged to remote memcg, current process can trigger reclaim on
> > remote memcg. So, schduling reclaim on return-to-userland for remote
> > memcgs will ignore the high reclaim altogether. So, record the memcg
> > needing high reclaim and trigger high reclaim for that memcg on
> > return-to-userland. However if the memcg is already recorded for high
> > reclaim and the recorded memcg is not the descendant of the the memcg
> > needing high reclaim, punt the high reclaim to the work queue.
>
> The idea behind remote charging is that the thread allocating the
> memory is not responsible for that memory, but a different cgroup
> is. Why would the same thread then have to work off any high excess
> this could produce in that unrelated group?
>
> Say you have a inotify/dnotify listener that is restricted in its
> memory use - now everybody sending notification events from outside
> that listener's group would get throttled on a cgroup over which it
> has no control. That sounds like a recipe for priority inversions.
>
> It seems to me we should only do reclaim-on-return when current is in
> the ill-behaved cgroup, and punt everything else - interrupts and
> remote charges - to the workqueue.

This is what v1 of this patch was doing but Michal suggested to do
what this version is doing. Michal's argument was that the current is
already charging and maybe reclaiming a remote memcg then why not do
the high excess reclaim as well.

Personally I don't have any strong opinion either way. What I actually
wanted was to punt this high reclaim to some process in that remote
memcg. However I didn't explore much on that direction thinking if
that complexity is worth it. Maybe I should at least explore it, so,
we can compare the solutions. What do you think?

Shakeel

2019-01-13 19:55:31

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work

On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> Hi Johannes,
>
> On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <[email protected]> wrote:
> >
> > Hi Shakeel,
> >
> > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > return-to-userland. However it is assumed that the memcg is the current
> > > process's memcg. With remote memcg charging for kmem or swapping in a
> > > page charged to remote memcg, current process can trigger reclaim on
> > > remote memcg. So, schduling reclaim on return-to-userland for remote
> > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > needing high reclaim and trigger high reclaim for that memcg on
> > > return-to-userland. However if the memcg is already recorded for high
> > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > needing high reclaim, punt the high reclaim to the work queue.
> >
> > The idea behind remote charging is that the thread allocating the
> > memory is not responsible for that memory, but a different cgroup
> > is. Why would the same thread then have to work off any high excess
> > this could produce in that unrelated group?
> >
> > Say you have a inotify/dnotify listener that is restricted in its
> > memory use - now everybody sending notification events from outside
> > that listener's group would get throttled on a cgroup over which it
> > has no control. That sounds like a recipe for priority inversions.
> >
> > It seems to me we should only do reclaim-on-return when current is in
> > the ill-behaved cgroup, and punt everything else - interrupts and
> > remote charges - to the workqueue.
>
> This is what v1 of this patch was doing but Michal suggested to do
> what this version is doing. Michal's argument was that the current is
> already charging and maybe reclaiming a remote memcg then why not do
> the high excess reclaim as well.

Johannes has a good point about the priority inversion problems which I
haven't thought about.

> Personally I don't have any strong opinion either way. What I actually
> wanted was to punt this high reclaim to some process in that remote
> memcg. However I didn't explore much on that direction thinking if
> that complexity is worth it. Maybe I should at least explore it, so,
> we can compare the solutions. What do you think?

My question would be whether we really care all that much. Do we know of
workloads which would generate a large high limit excess?
--
Michal Hocko
SUSE Labs

2019-01-14 20:20:14

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work

On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <[email protected]> wrote:
>
> On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> > Hi Johannes,
> >
> > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <[email protected]> wrote:
> > >
> > > Hi Shakeel,
> > >
> > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > > return-to-userland. However it is assumed that the memcg is the current
> > > > process's memcg. With remote memcg charging for kmem or swapping in a
> > > > page charged to remote memcg, current process can trigger reclaim on
> > > > remote memcg. So, schduling reclaim on return-to-userland for remote
> > > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > > needing high reclaim and trigger high reclaim for that memcg on
> > > > return-to-userland. However if the memcg is already recorded for high
> > > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > > needing high reclaim, punt the high reclaim to the work queue.
> > >
> > > The idea behind remote charging is that the thread allocating the
> > > memory is not responsible for that memory, but a different cgroup
> > > is. Why would the same thread then have to work off any high excess
> > > this could produce in that unrelated group?
> > >
> > > Say you have a inotify/dnotify listener that is restricted in its
> > > memory use - now everybody sending notification events from outside
> > > that listener's group would get throttled on a cgroup over which it
> > > has no control. That sounds like a recipe for priority inversions.
> > >
> > > It seems to me we should only do reclaim-on-return when current is in
> > > the ill-behaved cgroup, and punt everything else - interrupts and
> > > remote charges - to the workqueue.
> >
> > This is what v1 of this patch was doing but Michal suggested to do
> > what this version is doing. Michal's argument was that the current is
> > already charging and maybe reclaiming a remote memcg then why not do
> > the high excess reclaim as well.
>
> Johannes has a good point about the priority inversion problems which I
> haven't thought about.
>
> > Personally I don't have any strong opinion either way. What I actually
> > wanted was to punt this high reclaim to some process in that remote
> > memcg. However I didn't explore much on that direction thinking if
> > that complexity is worth it. Maybe I should at least explore it, so,
> > we can compare the solutions. What do you think?
>
> My question would be whether we really care all that much. Do we know of
> workloads which would generate a large high limit excess?
>

The current semantics of memory.high is that it can be breached under
extreme conditions. However any workload where memory.high is used and
a lot of remote memcg charging happens (inotify/dnotify example given
by Johannes or swapping in tmpfs file or shared memory region) the
memory.high breach will become common.

Shakeel

2019-01-15 08:55:35

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work

On Mon 14-01-19 12:18:07, Shakeel Butt wrote:
> On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <[email protected]> wrote:
> >
> > On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> > > Hi Johannes,
> > >
> > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <[email protected]> wrote:
> > > >
> > > > Hi Shakeel,
> > > >
> > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > > > return-to-userland. However it is assumed that the memcg is the current
> > > > > process's memcg. With remote memcg charging for kmem or swapping in a
> > > > > page charged to remote memcg, current process can trigger reclaim on
> > > > > remote memcg. So, schduling reclaim on return-to-userland for remote
> > > > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > > > needing high reclaim and trigger high reclaim for that memcg on
> > > > > return-to-userland. However if the memcg is already recorded for high
> > > > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > > > needing high reclaim, punt the high reclaim to the work queue.
> > > >
> > > > The idea behind remote charging is that the thread allocating the
> > > > memory is not responsible for that memory, but a different cgroup
> > > > is. Why would the same thread then have to work off any high excess
> > > > this could produce in that unrelated group?
> > > >
> > > > Say you have a inotify/dnotify listener that is restricted in its
> > > > memory use - now everybody sending notification events from outside
> > > > that listener's group would get throttled on a cgroup over which it
> > > > has no control. That sounds like a recipe for priority inversions.
> > > >
> > > > It seems to me we should only do reclaim-on-return when current is in
> > > > the ill-behaved cgroup, and punt everything else - interrupts and
> > > > remote charges - to the workqueue.
> > >
> > > This is what v1 of this patch was doing but Michal suggested to do
> > > what this version is doing. Michal's argument was that the current is
> > > already charging and maybe reclaiming a remote memcg then why not do
> > > the high excess reclaim as well.
> >
> > Johannes has a good point about the priority inversion problems which I
> > haven't thought about.
> >
> > > Personally I don't have any strong opinion either way. What I actually
> > > wanted was to punt this high reclaim to some process in that remote
> > > memcg. However I didn't explore much on that direction thinking if
> > > that complexity is worth it. Maybe I should at least explore it, so,
> > > we can compare the solutions. What do you think?
> >
> > My question would be whether we really care all that much. Do we know of
> > workloads which would generate a large high limit excess?
> >
>
> The current semantics of memory.high is that it can be breached under
> extreme conditions. However any workload where memory.high is used and
> a lot of remote memcg charging happens (inotify/dnotify example given
> by Johannes or swapping in tmpfs file or shared memory region) the
> memory.high breach will become common.

This is exactly what I am asking about. Is this something that can
happen easily? Remote charges on themselves should be rare, no?
--
Michal Hocko
SUSE Labs

2019-01-16 10:01:21

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work

On Mon, Jan 14, 2019 at 11:25 PM Michal Hocko <[email protected]> wrote:
>
> On Mon 14-01-19 12:18:07, Shakeel Butt wrote:
> > On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <[email protected]> wrote:
> > >
> > > On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> > > > Hi Johannes,
> > > >
> > > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <[email protected]> wrote:
> > > > >
> > > > > Hi Shakeel,
> > > > >
> > > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > > > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > > > > return-to-userland. However it is assumed that the memcg is the current
> > > > > > process's memcg. With remote memcg charging for kmem or swapping in a
> > > > > > page charged to remote memcg, current process can trigger reclaim on
> > > > > > remote memcg. So, schduling reclaim on return-to-userland for remote
> > > > > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > > > > needing high reclaim and trigger high reclaim for that memcg on
> > > > > > return-to-userland. However if the memcg is already recorded for high
> > > > > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > > > > needing high reclaim, punt the high reclaim to the work queue.
> > > > >
> > > > > The idea behind remote charging is that the thread allocating the
> > > > > memory is not responsible for that memory, but a different cgroup
> > > > > is. Why would the same thread then have to work off any high excess
> > > > > this could produce in that unrelated group?
> > > > >
> > > > > Say you have a inotify/dnotify listener that is restricted in its
> > > > > memory use - now everybody sending notification events from outside
> > > > > that listener's group would get throttled on a cgroup over which it
> > > > > has no control. That sounds like a recipe for priority inversions.
> > > > >
> > > > > It seems to me we should only do reclaim-on-return when current is in
> > > > > the ill-behaved cgroup, and punt everything else - interrupts and
> > > > > remote charges - to the workqueue.
> > > >
> > > > This is what v1 of this patch was doing but Michal suggested to do
> > > > what this version is doing. Michal's argument was that the current is
> > > > already charging and maybe reclaiming a remote memcg then why not do
> > > > the high excess reclaim as well.
> > >
> > > Johannes has a good point about the priority inversion problems which I
> > > haven't thought about.
> > >
> > > > Personally I don't have any strong opinion either way. What I actually
> > > > wanted was to punt this high reclaim to some process in that remote
> > > > memcg. However I didn't explore much on that direction thinking if
> > > > that complexity is worth it. Maybe I should at least explore it, so,
> > > > we can compare the solutions. What do you think?
> > >
> > > My question would be whether we really care all that much. Do we know of
> > > workloads which would generate a large high limit excess?
> > >
> >
> > The current semantics of memory.high is that it can be breached under
> > extreme conditions. However any workload where memory.high is used and
> > a lot of remote memcg charging happens (inotify/dnotify example given
> > by Johannes or swapping in tmpfs file or shared memory region) the
> > memory.high breach will become common.
>
> This is exactly what I am asking about. Is this something that can
> happen easily? Remote charges on themselves should be rare, no?
>

At the moment, for kmem we can do remote charging for fanotify,
inotify and buffer_head and for anon pages we can do remote charging
on swap in. Now based on the workload's cgroup setup the remote
charging can be very frequent or rare.

At Google, remote charging is very frequent but since we are still on
cgroup-v1 and do not use memory.high, the issue this patch is fixing
is not observed. However for the adoption of cgroup-v2, this fix is
needed.

Shakeel

2019-01-16 16:49:51

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH v3] memcg: schedule high reclaim for remote memcgs on high_work

On Tue 15-01-19 11:38:23, Shakeel Butt wrote:
> On Mon, Jan 14, 2019 at 11:25 PM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 14-01-19 12:18:07, Shakeel Butt wrote:
> > > On Sun, Jan 13, 2019 at 10:34 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > > On Fri 11-01-19 14:54:32, Shakeel Butt wrote:
> > > > > Hi Johannes,
> > > > >
> > > > > On Fri, Jan 11, 2019 at 12:59 PM Johannes Weiner <[email protected]> wrote:
> > > > > >
> > > > > > Hi Shakeel,
> > > > > >
> > > > > > On Thu, Jan 10, 2019 at 09:44:32AM -0800, Shakeel Butt wrote:
> > > > > > > If a memcg is over high limit, memory reclaim is scheduled to run on
> > > > > > > return-to-userland. However it is assumed that the memcg is the current
> > > > > > > process's memcg. With remote memcg charging for kmem or swapping in a
> > > > > > > page charged to remote memcg, current process can trigger reclaim on
> > > > > > > remote memcg. So, schduling reclaim on return-to-userland for remote
> > > > > > > memcgs will ignore the high reclaim altogether. So, record the memcg
> > > > > > > needing high reclaim and trigger high reclaim for that memcg on
> > > > > > > return-to-userland. However if the memcg is already recorded for high
> > > > > > > reclaim and the recorded memcg is not the descendant of the the memcg
> > > > > > > needing high reclaim, punt the high reclaim to the work queue.
> > > > > >
> > > > > > The idea behind remote charging is that the thread allocating the
> > > > > > memory is not responsible for that memory, but a different cgroup
> > > > > > is. Why would the same thread then have to work off any high excess
> > > > > > this could produce in that unrelated group?
> > > > > >
> > > > > > Say you have a inotify/dnotify listener that is restricted in its
> > > > > > memory use - now everybody sending notification events from outside
> > > > > > that listener's group would get throttled on a cgroup over which it
> > > > > > has no control. That sounds like a recipe for priority inversions.
> > > > > >
> > > > > > It seems to me we should only do reclaim-on-return when current is in
> > > > > > the ill-behaved cgroup, and punt everything else - interrupts and
> > > > > > remote charges - to the workqueue.
> > > > >
> > > > > This is what v1 of this patch was doing but Michal suggested to do
> > > > > what this version is doing. Michal's argument was that the current is
> > > > > already charging and maybe reclaiming a remote memcg then why not do
> > > > > the high excess reclaim as well.
> > > >
> > > > Johannes has a good point about the priority inversion problems which I
> > > > haven't thought about.
> > > >
> > > > > Personally I don't have any strong opinion either way. What I actually
> > > > > wanted was to punt this high reclaim to some process in that remote
> > > > > memcg. However I didn't explore much on that direction thinking if
> > > > > that complexity is worth it. Maybe I should at least explore it, so,
> > > > > we can compare the solutions. What do you think?
> > > >
> > > > My question would be whether we really care all that much. Do we know of
> > > > workloads which would generate a large high limit excess?
> > > >
> > >
> > > The current semantics of memory.high is that it can be breached under
> > > extreme conditions. However any workload where memory.high is used and
> > > a lot of remote memcg charging happens (inotify/dnotify example given
> > > by Johannes or swapping in tmpfs file or shared memory region) the
> > > memory.high breach will become common.
> >
> > This is exactly what I am asking about. Is this something that can
> > happen easily? Remote charges on themselves should be rare, no?
> >
>
> At the moment, for kmem we can do remote charging for fanotify,
> inotify and buffer_head and for anon pages we can do remote charging
> on swap in. Now based on the workload's cgroup setup the remote
> charging can be very frequent or rare.
>
> At Google, remote charging is very frequent but since we are still on
> cgroup-v1 and do not use memory.high, the issue this patch is fixing
> is not observed. However for the adoption of cgroup-v2, this fix is
> needed.

Adding some numbers into the changelog would be really valuable to judge
the urgency and the scale of the problem. If we are going via kworker
then it is also important to evaluate what kind of effect on the system
this has. How big of the excess can we get? Why don't those memcgs
resolve the excess by themselves on the first direct charge? Is it
possible that kworkers simply swamp the system with many parallel memcgs
with remote charges?

In other words we need deeper analysis of the problem and the solution.
--
Michal Hocko
SUSE Labs