We have received regression reports from users whose workloads moved
into containers and subsequently encountered new latencies. For some
users these were a nuisance, but for some it meant missing their SLA
response times. We tracked those delays down to cgroup limits, which
inject direct reclaim stalls into the workload where previously all
reclaim was handled my kswapd.
This patch adds asynchronous reclaim to the memory.high cgroup limit
while keeping direct reclaim as a fallback. In our testing, this
eliminated all direct reclaim from the affected workload.
memory.high has a grace buffer of about 4% between when it becomes
exceeded and when allocating threads get throttled. We can use the
same buffer for the async reclaimer to operate in. If the worker
cannot keep up and the grace buffer is exceeded, allocating threads
will fall back to direct reclaim before getting throttled.
For irq-context, there's already async memory.high enforcement. Re-use
that work item for all allocating contexts, but switch it to the
unbound workqueue so reclaim work doesn't compete with the workload.
The work item is per cgroup, which means the workqueue infrastructure
will create at maximum one worker thread per reclaiming cgroup.
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
mm/vmscan.c | 10 +++++++--
2 files changed, 54 insertions(+), 16 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cf02e3ef3ed9..bad838d9c2bb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1446,6 +1446,10 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
seq_buf_printf(&s, "pgsteal %lu\n",
memcg_events(memcg, PGSTEAL_KSWAPD) +
memcg_events(memcg, PGSTEAL_DIRECT));
+ seq_buf_printf(&s, "pgscan_direct %lu\n",
+ memcg_events(memcg, PGSCAN_DIRECT));
+ seq_buf_printf(&s, "pgsteal_direct %lu\n",
+ memcg_events(memcg, PGSTEAL_DIRECT));
seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGACTIVATE),
memcg_events(memcg, PGACTIVATE));
seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGDEACTIVATE),
@@ -2235,10 +2239,19 @@ static void reclaim_high(struct mem_cgroup *memcg,
static void high_work_func(struct work_struct *work)
{
+ unsigned long high, usage;
struct mem_cgroup *memcg;
memcg = container_of(work, struct mem_cgroup, high_work);
- reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
+
+ high = READ_ONCE(memcg->high);
+ usage = page_counter_read(&memcg->memory);
+
+ if (usage <= high)
+ return;
+
+ set_worker_desc("cswapd/%llx", cgroup_id(memcg->css.cgroup));
+ reclaim_high(memcg, usage - high, GFP_KERNEL);
}
/*
@@ -2304,15 +2317,22 @@ void mem_cgroup_handle_over_high(void)
unsigned long pflags;
unsigned long penalty_jiffies, overage;
unsigned int nr_pages = current->memcg_nr_pages_over_high;
+ bool tried_direct_reclaim = false;
struct mem_cgroup *memcg;
if (likely(!nr_pages))
return;
- memcg = get_mem_cgroup_from_mm(current->mm);
- reclaim_high(memcg, nr_pages, GFP_KERNEL);
current->memcg_nr_pages_over_high = 0;
+ memcg = get_mem_cgroup_from_mm(current->mm);
+ high = READ_ONCE(memcg->high);
+recheck:
+ usage = page_counter_read(&memcg->memory);
+
+ if (usage <= high)
+ goto out;
+
/*
* memory.high is breached and reclaim is unable to keep up. Throttle
* allocators proactively to slow down excessive growth.
@@ -2325,12 +2345,6 @@ void mem_cgroup_handle_over_high(void)
* overage amount.
*/
- usage = page_counter_read(&memcg->memory);
- high = READ_ONCE(memcg->high);
-
- if (usage <= high)
- goto out;
-
/*
* Prevent division by 0 in overage calculation by acting as if it was a
* threshold of 1 page
@@ -2369,6 +2383,16 @@ void mem_cgroup_handle_over_high(void)
if (penalty_jiffies <= HZ / 100)
goto out;
+ /*
+ * It's possible async reclaim just isn't able to keep
+ * up. Before we go to sleep, try direct reclaim.
+ */
+ if (!tried_direct_reclaim) {
+ reclaim_high(memcg, nr_pages, GFP_KERNEL);
+ tried_direct_reclaim = true;
+ goto recheck;
+ }
+
/*
* If we exit early, we're guaranteed to die (since
* schedule_timeout_killable sets TASK_KILLABLE). This means we don't
@@ -2544,13 +2568,21 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
do {
if (page_counter_read(&memcg->memory) > memcg->high) {
+ /*
+ * Kick off the async reclaimer, which should
+ * be doing most of the work to avoid latency
+ * in the workload. But also check in on its
+ * progress before resuming to userspace, in
+ * case we need to do direct reclaim, or even
+ * throttle the allocating thread if reclaim
+ * cannot keep up with allocation demand.
+ */
+ queue_work(system_unbound_wq, &memcg->high_work);
/* Don't bother a random interrupted task */
- if (in_interrupt()) {
- schedule_work(&memcg->high_work);
- break;
+ if (!in_interrupt()) {
+ current->memcg_nr_pages_over_high += batch;
+ set_notify_resume(current);
}
- current->memcg_nr_pages_over_high += batch;
- set_notify_resume(current);
break;
}
} while ((memcg = parent_mem_cgroup(memcg)));
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74e8edce83ca..d6085115c7f2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1947,7 +1947,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
reclaim_stat->recent_scanned[file] += nr_taken;
- item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
+ if (current_is_kswapd() || (cgroup_reclaim(sc) && current_work()))
+ item = PGSCAN_KSWAPD;
+ else
+ item = PGSCAN_DIRECT;
if (!cgroup_reclaim(sc))
__count_vm_events(item, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
@@ -1961,7 +1964,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
spin_lock_irq(&pgdat->lru_lock);
- item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
+ if (current_is_kswapd() || (cgroup_reclaim(sc) && current_work()))
+ item = PGSTEAL_KSWAPD;
+ else
+ item = PGSTEAL_DIRECT;
if (!cgroup_reclaim(sc))
__count_vm_events(item, nr_reclaimed);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
--
2.24.1
On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> We have received regression reports from users whose workloads moved
> into containers and subsequently encountered new latencies. For some
> users these were a nuisance, but for some it meant missing their SLA
> response times. We tracked those delays down to cgroup limits, which
> inject direct reclaim stalls into the workload where previously all
> reclaim was handled my kswapd.
I am curious why is this unexpected when the high limit is explicitly
documented as a throttling mechanism.
> This patch adds asynchronous reclaim to the memory.high cgroup limit
> while keeping direct reclaim as a fallback. In our testing, this
> eliminated all direct reclaim from the affected workload.
Who is accounted for all the work? Unless I am missing something this
just gets hidden in the system activity and that might hurt the
isolation. I do see how moving the work to a different context is
desirable but this work has to be accounted properly when it is going to
become a normal mode of operation (rather than a rare exception like the
existing irq context handling).
> memory.high has a grace buffer of about 4% between when it becomes
> exceeded and when allocating threads get throttled. We can use the
> same buffer for the async reclaimer to operate in. If the worker
> cannot keep up and the grace buffer is exceeded, allocating threads
> will fall back to direct reclaim before getting throttled.
>
> For irq-context, there's already async memory.high enforcement. Re-use
> that work item for all allocating contexts, but switch it to the
> unbound workqueue so reclaim work doesn't compete with the workload.
> The work item is per cgroup, which means the workqueue infrastructure
> will create at maximum one worker thread per reclaiming cgroup.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
> mm/vmscan.c | 10 +++++++--
> 2 files changed, 54 insertions(+), 16 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cf02e3ef3ed9..bad838d9c2bb 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1446,6 +1446,10 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
> seq_buf_printf(&s, "pgsteal %lu\n",
> memcg_events(memcg, PGSTEAL_KSWAPD) +
> memcg_events(memcg, PGSTEAL_DIRECT));
> + seq_buf_printf(&s, "pgscan_direct %lu\n",
> + memcg_events(memcg, PGSCAN_DIRECT));
> + seq_buf_printf(&s, "pgsteal_direct %lu\n",
> + memcg_events(memcg, PGSTEAL_DIRECT));
> seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGACTIVATE),
> memcg_events(memcg, PGACTIVATE));
> seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGDEACTIVATE),
> @@ -2235,10 +2239,19 @@ static void reclaim_high(struct mem_cgroup *memcg,
>
> static void high_work_func(struct work_struct *work)
> {
> + unsigned long high, usage;
> struct mem_cgroup *memcg;
>
> memcg = container_of(work, struct mem_cgroup, high_work);
> - reclaim_high(memcg, MEMCG_CHARGE_BATCH, GFP_KERNEL);
> +
> + high = READ_ONCE(memcg->high);
> + usage = page_counter_read(&memcg->memory);
> +
> + if (usage <= high)
> + return;
> +
> + set_worker_desc("cswapd/%llx", cgroup_id(memcg->css.cgroup));
> + reclaim_high(memcg, usage - high, GFP_KERNEL);
> }
>
> /*
> @@ -2304,15 +2317,22 @@ void mem_cgroup_handle_over_high(void)
> unsigned long pflags;
> unsigned long penalty_jiffies, overage;
> unsigned int nr_pages = current->memcg_nr_pages_over_high;
> + bool tried_direct_reclaim = false;
> struct mem_cgroup *memcg;
>
> if (likely(!nr_pages))
> return;
>
> - memcg = get_mem_cgroup_from_mm(current->mm);
> - reclaim_high(memcg, nr_pages, GFP_KERNEL);
> current->memcg_nr_pages_over_high = 0;
>
> + memcg = get_mem_cgroup_from_mm(current->mm);
> + high = READ_ONCE(memcg->high);
> +recheck:
> + usage = page_counter_read(&memcg->memory);
> +
> + if (usage <= high)
> + goto out;
> +
> /*
> * memory.high is breached and reclaim is unable to keep up. Throttle
> * allocators proactively to slow down excessive growth.
> @@ -2325,12 +2345,6 @@ void mem_cgroup_handle_over_high(void)
> * overage amount.
> */
>
> - usage = page_counter_read(&memcg->memory);
> - high = READ_ONCE(memcg->high);
> -
> - if (usage <= high)
> - goto out;
> -
> /*
> * Prevent division by 0 in overage calculation by acting as if it was a
> * threshold of 1 page
> @@ -2369,6 +2383,16 @@ void mem_cgroup_handle_over_high(void)
> if (penalty_jiffies <= HZ / 100)
> goto out;
>
> + /*
> + * It's possible async reclaim just isn't able to keep
> + * up. Before we go to sleep, try direct reclaim.
> + */
> + if (!tried_direct_reclaim) {
> + reclaim_high(memcg, nr_pages, GFP_KERNEL);
> + tried_direct_reclaim = true;
> + goto recheck;
> + }
> +
> /*
> * If we exit early, we're guaranteed to die (since
> * schedule_timeout_killable sets TASK_KILLABLE). This means we don't
> @@ -2544,13 +2568,21 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> */
> do {
> if (page_counter_read(&memcg->memory) > memcg->high) {
> + /*
> + * Kick off the async reclaimer, which should
> + * be doing most of the work to avoid latency
> + * in the workload. But also check in on its
> + * progress before resuming to userspace, in
> + * case we need to do direct reclaim, or even
> + * throttle the allocating thread if reclaim
> + * cannot keep up with allocation demand.
> + */
> + queue_work(system_unbound_wq, &memcg->high_work);
> /* Don't bother a random interrupted task */
> - if (in_interrupt()) {
> - schedule_work(&memcg->high_work);
> - break;
> + if (!in_interrupt()) {
> + current->memcg_nr_pages_over_high += batch;
> + set_notify_resume(current);
> }
> - current->memcg_nr_pages_over_high += batch;
> - set_notify_resume(current);
> break;
> }
> } while ((memcg = parent_mem_cgroup(memcg)));
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 74e8edce83ca..d6085115c7f2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1947,7 +1947,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
> reclaim_stat->recent_scanned[file] += nr_taken;
>
> - item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
> + if (current_is_kswapd() || (cgroup_reclaim(sc) && current_work()))
> + item = PGSCAN_KSWAPD;
> + else
> + item = PGSCAN_DIRECT;
> if (!cgroup_reclaim(sc))
> __count_vm_events(item, nr_scanned);
> __count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
> @@ -1961,7 +1964,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>
> spin_lock_irq(&pgdat->lru_lock);
>
> - item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
> + if (current_is_kswapd() || (cgroup_reclaim(sc) && current_work()))
> + item = PGSTEAL_KSWAPD;
> + else
> + item = PGSTEAL_DIRECT;
> if (!cgroup_reclaim(sc))
> __count_vm_events(item, nr_reclaimed);
> __count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
> --
> 2.24.1
>
--
Michal Hocko
SUSE Labs
On Wed, Feb 19, 2020 at 07:37:31PM +0100, Michal Hocko wrote:
> On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> > We have received regression reports from users whose workloads moved
> > into containers and subsequently encountered new latencies. For some
> > users these were a nuisance, but for some it meant missing their SLA
> > response times. We tracked those delays down to cgroup limits, which
> > inject direct reclaim stalls into the workload where previously all
> > reclaim was handled my kswapd.
>
> I am curious why is this unexpected when the high limit is explicitly
> documented as a throttling mechanism.
Memory.high is supposed to curb aggressive growth using throttling
instead of OOM killing. However, if the workload has plenty of easily
reclaimable memory and just needs to recycle a couple of cache pages
to permit an allocation, there is no need to throttle the workload -
just as there wouldn't be any need to trigger the OOM killer.
So it's not unexpected, but it's unnecessarily heavy-handed: since
memory.high allows some flexibility around the target size, we can
move the routine reclaim activity (cache recycling) out of the main
execution stream of the workload, just like we do with kswapd. If that
cannot keep up, we can throttle and do direct reclaim.
It doesn't change the memory.high semantics, but it allows exploiting
the fact that we have SMP systems and can parallize the book keeping.
> > This patch adds asynchronous reclaim to the memory.high cgroup limit
> > while keeping direct reclaim as a fallback. In our testing, this
> > eliminated all direct reclaim from the affected workload.
>
> Who is accounted for all the work? Unless I am missing something this
> just gets hidden in the system activity and that might hurt the
> isolation. I do see how moving the work to a different context is
> desirable but this work has to be accounted properly when it is going to
> become a normal mode of operation (rather than a rare exception like the
> existing irq context handling).
Yes, the plan is to account it to the cgroup on whose behalf we're
doing the work.
The problem is that we have a general lack of usable CPU control right
now - see Rik's work on this: https://lkml.org/lkml/2019/8/21/1208.
For workloads that are contended on CPU, we cannot enable the CPU
controller because the scheduling latencies are too high. And for
workloads that aren't CPU contended, well, it doesn't really matter
where the reclaim cycles are accounted to.
Once we have the CPU controller up to speed, we can add annotations
like these to account stretches of execution to specific
cgroups. There just isn't much point to do it before we can actually
enable CPU control on the real workloads where it would matter.
[ This is generally work in process: for example, if you isolate
workloads with memory.low, kswapd cpu time isn't accounted to the
cgroup that causes it. Swap IO issued by kswapd isn't accounted to
the group that is getting swapped. The memory consumed by struct
cgroup itself, the percpu allocations for the vmstat arrays etc.,
which is sizable, are not accounted to the cgroup that creates
subgroups, and so forth. ]
Michal Hocko writes:
>On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
>> We have received regression reports from users whose workloads moved
>> into containers and subsequently encountered new latencies. For some
>> users these were a nuisance, but for some it meant missing their SLA
>> response times. We tracked those delays down to cgroup limits, which
>> inject direct reclaim stalls into the workload where previously all
>> reclaim was handled my kswapd.
>
>I am curious why is this unexpected when the high limit is explicitly
>documented as a throttling mechanism.
Throttling is what one expects if one's workload does not respect the high
threshold (ie. if it's not possible to reclaim the requisite number of pages),
but you don't expect throttling just because you're brushing up against the
threshold, because we only reclaim at certain points (eg. return to usermode).
That is, the workload may be well-behaved and it's just we didn't get around to
completing reclaim yet. In that case, throttling the workload when there's no
evidence it's misbehaving seems unduly harsh, hence the ~4% grace, with
exponential penalties as there's more evidence the workload is pathological.
So sure, memory.high is a throttling mechanism if you *exceed* the stated
bounds of your allocation, but this is about even those applications which are
well-behaved, and just brushing against the bounds of it, as is expected on a
system where the bottleneck is at the cgroup rather than being global.
On Wed, 19 Feb 2020 19:37:31 +0100 Michal Hocko <[email protected]> wrote:
> On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> > We have received regression reports from users whose workloads moved
> > into containers and subsequently encountered new latencies. For some
> > users these were a nuisance, but for some it meant missing their SLA
> > response times. We tracked those delays down to cgroup limits, which
> > inject direct reclaim stalls into the workload where previously all
> > reclaim was handled my kswapd.
>
> I am curious why is this unexpected when the high limit is explicitly
> documented as a throttling mechanism.
Yes, this sounds like a feature-not-a-bug.
But what was the nature of these stalls? If they were "stuck in D
state waiting for something" then that's throttling. If they were
"unexpected bursts of in-kernel CPU activity" then I see a better case.
On Wed 19-02-20 14:16:18, Johannes Weiner wrote:
> On Wed, Feb 19, 2020 at 07:37:31PM +0100, Michal Hocko wrote:
> > On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> > > We have received regression reports from users whose workloads moved
> > > into containers and subsequently encountered new latencies. For some
> > > users these were a nuisance, but for some it meant missing their SLA
> > > response times. We tracked those delays down to cgroup limits, which
> > > inject direct reclaim stalls into the workload where previously all
> > > reclaim was handled my kswapd.
> >
> > I am curious why is this unexpected when the high limit is explicitly
> > documented as a throttling mechanism.
>
> Memory.high is supposed to curb aggressive growth using throttling
> instead of OOM killing. However, if the workload has plenty of easily
> reclaimable memory and just needs to recycle a couple of cache pages
> to permit an allocation, there is no need to throttle the workload -
> just as there wouldn't be any need to trigger the OOM killer.
>
> So it's not unexpected, but it's unnecessarily heavy-handed: since
> memory.high allows some flexibility around the target size, we can
> move the routine reclaim activity (cache recycling) out of the main
> execution stream of the workload, just like we do with kswapd. If that
> cannot keep up, we can throttle and do direct reclaim.
>
> It doesn't change the memory.high semantics, but it allows exploiting
> the fact that we have SMP systems and can parallize the book keeping.
Thanks, this describes the problem much better and I believe this all
belongs to the changelog.
> > > This patch adds asynchronous reclaim to the memory.high cgroup limit
> > > while keeping direct reclaim as a fallback. In our testing, this
> > > eliminated all direct reclaim from the affected workload.
> >
> > Who is accounted for all the work? Unless I am missing something this
> > just gets hidden in the system activity and that might hurt the
> > isolation. I do see how moving the work to a different context is
> > desirable but this work has to be accounted properly when it is going to
> > become a normal mode of operation (rather than a rare exception like the
> > existing irq context handling).
>
> Yes, the plan is to account it to the cgroup on whose behalf we're
> doing the work.
OK, great, because honestly I am not really sure we can merge this work
without that being handled, I am afraid. We've had similar attempts
- mostly to parallelize work on behalf of the process (e.g. address space
tear down) - and the proper accounting was always the main obstacle so we
really need to handle this problem for other reasons. This doesn't sound
very different. And your example of a workload not meeting SLAs just
shows that the amount of the work required for the high limit reclaim
can be non-trivial. Somebody has to do that work and we cannot simply
allow everybody else to pay for that.
> The problem is that we have a general lack of usable CPU control right
> now - see Rik's work on this: https://lkml.org/lkml/2019/8/21/1208.
> For workloads that are contended on CPU, we cannot enable the CPU
> controller because the scheduling latencies are too high. And for
> workloads that aren't CPU contended, well, it doesn't really matter
> where the reclaim cycles are accounted to.
>
> Once we have the CPU controller up to speed, we can add annotations
> like these to account stretches of execution to specific
> cgroups. There just isn't much point to do it before we can actually
> enable CPU control on the real workloads where it would matter.
>
> [ This is generally work in process: for example, if you isolate
> workloads with memory.low, kswapd cpu time isn't accounted to the
> cgroup that causes it. Swap IO issued by kswapd isn't accounted to
> the group that is getting swapped.
Well, kswapd is a system activity and as such it is acceptable that it
is accounted to the system. But in this case we are talking about a
memcg configuration which influences all other workloads by stealing CPU
cycles from them without much throttling on the consumer side -
especially when the memory is reclaimable without a lot of sleeping or
contention on locks etc.
I am absolutely aware that we will never achieve a perfect isolation due
to all sorts of shared data structures, lock contention and what not but
this patch alone just allows spill over to unaccounted work way too
easily IMHO.
> The memory consumed by struct
> cgroup itself, the percpu allocations for the vmstat arrays etc.,
> which is sizable, are not accounted to the cgroup that creates
> subgroups, and so forth. ]
--
Michal Hocko
SUSE Labs
On Wed, Feb 19, 2020 at 08:53:32PM +0100, Michal Hocko wrote:
> On Wed 19-02-20 14:16:18, Johannes Weiner wrote:
> > On Wed, Feb 19, 2020 at 07:37:31PM +0100, Michal Hocko wrote:
> > > On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> > > > We have received regression reports from users whose workloads moved
> > > > into containers and subsequently encountered new latencies. For some
> > > > users these were a nuisance, but for some it meant missing their SLA
> > > > response times. We tracked those delays down to cgroup limits, which
> > > > inject direct reclaim stalls into the workload where previously all
> > > > reclaim was handled my kswapd.
> > >
> > > I am curious why is this unexpected when the high limit is explicitly
> > > documented as a throttling mechanism.
> >
> > Memory.high is supposed to curb aggressive growth using throttling
> > instead of OOM killing. However, if the workload has plenty of easily
> > reclaimable memory and just needs to recycle a couple of cache pages
> > to permit an allocation, there is no need to throttle the workload -
> > just as there wouldn't be any need to trigger the OOM killer.
> >
> > So it's not unexpected, but it's unnecessarily heavy-handed: since
> > memory.high allows some flexibility around the target size, we can
> > move the routine reclaim activity (cache recycling) out of the main
> > execution stream of the workload, just like we do with kswapd. If that
> > cannot keep up, we can throttle and do direct reclaim.
> >
> > It doesn't change the memory.high semantics, but it allows exploiting
> > the fact that we have SMP systems and can parallize the book keeping.
>
> Thanks, this describes the problem much better and I believe this all
> belongs to the changelog.
Ok.
> > > > This patch adds asynchronous reclaim to the memory.high cgroup limit
> > > > while keeping direct reclaim as a fallback. In our testing, this
> > > > eliminated all direct reclaim from the affected workload.
> > >
> > > Who is accounted for all the work? Unless I am missing something this
> > > just gets hidden in the system activity and that might hurt the
> > > isolation. I do see how moving the work to a different context is
> > > desirable but this work has to be accounted properly when it is going to
> > > become a normal mode of operation (rather than a rare exception like the
> > > existing irq context handling).
> >
> > Yes, the plan is to account it to the cgroup on whose behalf we're
> > doing the work.
>
> OK, great, because honestly I am not really sure we can merge this work
> without that being handled, I am afraid. We've had similar attempts
> - mostly to parallelize work on behalf of the process (e.g. address space
> tear down) - and the proper accounting was always the main obstacle so we
> really need to handle this problem for other reasons. This doesn't sound
> very different. And your example of a workload not meeting SLAs just
> shows that the amount of the work required for the high limit reclaim
> can be non-trivial. Somebody has to do that work and we cannot simply
> allow everybody else to pay for that.
>
> > The problem is that we have a general lack of usable CPU control right
> > now - see Rik's work on this: https://lkml.org/lkml/2019/8/21/1208.
> > For workloads that are contended on CPU, we cannot enable the CPU
> > controller because the scheduling latencies are too high. And for
> > workloads that aren't CPU contended, well, it doesn't really matter
> > where the reclaim cycles are accounted to.
> >
> > Once we have the CPU controller up to speed, we can add annotations
> > like these to account stretches of execution to specific
> > cgroups. There just isn't much point to do it before we can actually
> > enable CPU control on the real workloads where it would matter.
> >
> > [ This is generally work in process: for example, if you isolate
> > workloads with memory.low, kswapd cpu time isn't accounted to the
> > cgroup that causes it. Swap IO issued by kswapd isn't accounted to
> > the group that is getting swapped.
>
> Well, kswapd is a system activity and as such it is acceptable that it
> is accounted to the system. But in this case we are talking about a
> memcg configuration which influences all other workloads by stealing CPU
> cycles from them
From a user perspective this isn't a meaningful distinction.
If I partition my memory among containers and one cgroup is acting
out, I would want the culprit to be charged for the cpu cycles the
reclaim is causing. Whether I divide my machine up using memory.low or
using memory.max doesn't really matter: I'm choosing between the two
based on a *memory policy* I want to implement - work-conserving vs
non-conserving. I shouldn't have to worry about the kernel tracking
CPU cycles properly in the respective implementations of these knobs.
So kswapd is very much a cgroup-attributable activity, *especially* if
I'm using memory.low to delineate different memory domains.
> without much throttling on the consumer side - especially when the
> memory is reclaimable without a lot of sleeping or contention on
> locks etc.
The limiting factor on the consumer side is IO. Reading a page is way
more costly than reclaiming it, which is why we built our isolation
stack starting with memory and IO control and are only now getting to
working on proper CPU isolation.
> I am absolutely aware that we will never achieve a perfect isolation due
> to all sorts of shared data structures, lock contention and what not but
> this patch alone just allows spill over to unaccounted work way too
> easily IMHO.
I understand your concern about CPU cycles escaping, and I share
it. My point is that this patch isn't adding a problem that isn't
already there, nor is it that much of a practical concern at the time
of this writing given the state of CPU isolation in general.
On Wed, Feb 19, 2020 at 11:31:39AM -0800, Andrew Morton wrote:
> But what was the nature of these stalls? If they were "stuck in D
> state waiting for something" then that's throttling. If they were
> "unexpected bursts of in-kernel CPU activity" then I see a better case.
It was both.
However, the workload was able to perform with no direct reclaim
activity and no stalls, while memory.high semantics were never
violated. This means that allocation rate was not outstripping reclaim
rate, just that both of these things happened in bursts and sputters.
If the workload isn't exceeding the memory.high grace buffer, it seems
unnecessary to make it wait behind a reclaim invocation that gets
delayed on IO but will succeed before the workload comes back for the
next allocation. We can just let it get on with its thing, actually
work with the memory it just obtained, while background reclaim will
free pages as they become reclaimable. This is a win-win situation.
On Wed, Feb 19, 2020 at 08:53:32PM +0100, Michal Hocko wrote:
> On Wed 19-02-20 14:16:18, Johannes Weiner wrote:
> > On Wed, Feb 19, 2020 at 07:37:31PM +0100, Michal Hocko wrote:
> > > On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> > > > This patch adds asynchronous reclaim to the memory.high cgroup limit
> > > > while keeping direct reclaim as a fallback. In our testing, this
> > > > eliminated all direct reclaim from the affected workload.
> > >
> > > Who is accounted for all the work? Unless I am missing something this
> > > just gets hidden in the system activity and that might hurt the
> > > isolation. I do see how moving the work to a different context is
> > > desirable but this work has to be accounted properly when it is going to
> > > become a normal mode of operation (rather than a rare exception like the
> > > existing irq context handling).
> >
> > Yes, the plan is to account it to the cgroup on whose behalf we're
> > doing the work.
How are you planning to do that?
I've been thinking about how to account a kernel thread's CPU usage to a cgroup
on and off while working on the parallelizing Michal mentions below. A few
approaches are described here:
https://lore.kernel.org/linux-mm/[email protected]/
> shows that the amount of the work required for the high limit reclaim
> can be non-trivial. Somebody has to do that work and we cannot simply
> allow everybody else to pay for that.
>
> > The problem is that we have a general lack of usable CPU control right
> > now - see Rik's work on this: https://lkml.org/lkml/2019/8/21/1208.
> > For workloads that are contended on CPU, we cannot enable the CPU
> > controller because the scheduling latencies are too high. And for
> > workloads that aren't CPU contended, well, it doesn't really matter
> > where the reclaim cycles are accounted to.
> >
> > Once we have the CPU controller up to speed, we can add annotations
> > like these to account stretches of execution to specific
> > cgroups. There just isn't much point to do it before we can actually
> > enable CPU control on the real workloads where it would matter.
Which annotations do you mean? I didn't see them when skimming through Rik's
work or in this patch.
On Wed, Feb 19, 2020 at 04:41:12PM -0500, Daniel Jordan wrote:
> On Wed, Feb 19, 2020 at 08:53:32PM +0100, Michal Hocko wrote:
> > On Wed 19-02-20 14:16:18, Johannes Weiner wrote:
> > > On Wed, Feb 19, 2020 at 07:37:31PM +0100, Michal Hocko wrote:
> > > > On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> > > > > This patch adds asynchronous reclaim to the memory.high cgroup limit
> > > > > while keeping direct reclaim as a fallback. In our testing, this
> > > > > eliminated all direct reclaim from the affected workload.
> > > >
> > > > Who is accounted for all the work? Unless I am missing something this
> > > > just gets hidden in the system activity and that might hurt the
> > > > isolation. I do see how moving the work to a different context is
> > > > desirable but this work has to be accounted properly when it is going to
> > > > become a normal mode of operation (rather than a rare exception like the
> > > > existing irq context handling).
> > >
> > > Yes, the plan is to account it to the cgroup on whose behalf we're
> > > doing the work.
>
> How are you planning to do that?
>
> I've been thinking about how to account a kernel thread's CPU usage to a cgroup
> on and off while working on the parallelizing Michal mentions below. A few
> approaches are described here:
>
> https://lore.kernel.org/linux-mm/[email protected]/
What we do for the IO controller is execute the work unthrottled but
charge the cgroup on whose behalf we are executing with whatever cost
or time or bandwith that was incurred. The cgroup will pay off this
debt when it requests more of that resource.
This is from blk-iocost.c:
/*
* We're over budget. If @bio has to be issued regardless,
* remember the abs_cost instead of advancing vtime.
* iocg_kick_waitq() will pay off the debt before waking more IOs.
* This way, the debt is continuously paid off each period with the
* actual budget available to the cgroup. If we just wound vtime,
* we would incorrectly use the current hw_inuse for the entire
* amount which, for example, can lead to the cgroup staying
* blocked for a long time even with substantially raised hw_inuse.
*/
if (bio_issue_as_root_blkg(bio) || fatal_signal_pending(current)) {
atomic64_add(abs_cost, &iocg->abs_vdebt);
iocg_kick_delay(iocg, &now, cost);
return;
}
blk-iolatency.c has similar provisions. bio_issue_as_root_blkg() says this:
/**
* bio_issue_as_root_blkg - see if this bio needs to be issued as root blkg
* @return: true if this bio needs to be submitted with the root blkg context.
*
* In order to avoid priority inversions we sometimes need to issue a bio as if
* it were attached to the root blkg, and then backcharge to the actual owning
* blkg. The idea is we do bio_blkcg() to look up the actual context for the
* bio and attach the appropriate blkg to the bio. Then we call this helper and
* if it is true run with the root blkg for that queue and then do any
* backcharging to the originating cgroup once the io is complete.
*/
static inline bool bio_issue_as_root_blkg(struct bio *bio)
{
return (bio->bi_opf & (REQ_META | REQ_SWAP)) != 0;
}
The plan for the CPU controller is similar. When a remote execution
begins, flush the current runtime accumulated (update_curr) and
associate the current thread with another cgroup (similar to
current->active_memcg); when remote execution is done, flush the
runtime delta to that cgroup and unset the remote context.
For async reclaim, whether that's kswapd or the work item that I'm
adding here, we'd want the cycles to go to the cgroup whose memory is
being reclaimed.
> > > The problem is that we have a general lack of usable CPU control right
> > > now - see Rik's work on this: https://lkml.org/lkml/2019/8/21/1208.
> > > For workloads that are contended on CPU, we cannot enable the CPU
> > > controller because the scheduling latencies are too high. And for
> > > workloads that aren't CPU contended, well, it doesn't really matter
> > > where the reclaim cycles are accounted to.
> > >
> > > Once we have the CPU controller up to speed, we can add annotations
> > > like these to account stretches of execution to specific
> > > cgroups. There just isn't much point to do it before we can actually
> > > enable CPU control on the real workloads where it would matter.
>
> Which annotations do you mean? I didn't see them when skimming through Rik's
> work or in this patch.
Sorry, they're not in Rik's patch. My point was that we haven't gotten
to making such fine-grained annotations because the CPU isolation as a
whole isn't something we have working in practice right now. It's not
relevant who is spending the cycles if we cannot enable CPU control.
On Wed 19-02-20 16:17:35, Johannes Weiner wrote:
> On Wed, Feb 19, 2020 at 08:53:32PM +0100, Michal Hocko wrote:
> > On Wed 19-02-20 14:16:18, Johannes Weiner wrote:
[...]
> > > [ This is generally work in process: for example, if you isolate
> > > workloads with memory.low, kswapd cpu time isn't accounted to the
> > > cgroup that causes it. Swap IO issued by kswapd isn't accounted to
> > > the group that is getting swapped.
> >
> > Well, kswapd is a system activity and as such it is acceptable that it
> > is accounted to the system. But in this case we are talking about a
> > memcg configuration which influences all other workloads by stealing CPU
> > cycles from them
>
> From a user perspective this isn't a meaningful distinction.
>
> If I partition my memory among containers and one cgroup is acting
> out, I would want the culprit to be charged for the cpu cycles the
> reclaim is causing. Whether I divide my machine up using memory.low or
> using memory.max doesn't really matter: I'm choosing between the two
> based on a *memory policy* I want to implement - work-conserving vs
> non-conserving. I shouldn't have to worry about the kernel tracking
> CPU cycles properly in the respective implementations of these knobs.
>
> So kswapd is very much a cgroup-attributable activity, *especially* if
> I'm using memory.low to delineate different memory domains.
While I understand what you are saying I do not think this is easily
achievable with the current implementation. The biggest problem I can
see is that you do not have a clear information who to charge for
the memory shortage on a particular NUMA node with a pure low limit
based balancing because the limit is not NUMA aware. Besides that the
origin of the memory pressure might be outside of any memcg. You can
punish/account all memcgs in excess in some manner, e.g. proportionally
to their size/excess but I am not really sure how fair that will
be. Sounds like an interesting project but also sounds like tangent to
this patch.
High/Max limits are quite different because they are dealing with
the internal memory pressure and you can attribute it to the
cgroup/hierarchy which is in excess. There is a clear domain to reclaim
from. This is an easier model to reason about IMHO.
> > without much throttling on the consumer side - especially when the
> > memory is reclaimable without a lot of sleeping or contention on
> > locks etc.
>
> The limiting factor on the consumer side is IO. Reading a page is way
> more costly than reclaiming it, which is why we built our isolation
> stack starting with memory and IO control and are only now getting to
> working on proper CPU isolation.
>
> > I am absolutely aware that we will never achieve a perfect isolation due
> > to all sorts of shared data structures, lock contention and what not but
> > this patch alone just allows spill over to unaccounted work way too
> > easily IMHO.
>
> I understand your concern about CPU cycles escaping, and I share
> it. My point is that this patch isn't adding a problem that isn't
> already there, nor is it that much of a practical concern at the time
> of this writing given the state of CPU isolation in general.
I beg to differ here. Ppu controller should be able to isolate user
contexts performing high limit reclaim now. Your patch is changing that
functionality to become unaccounted for a large part and that might be
seen as a regression for those workloads which partition the system by
using high limit and also rely on cpu controller because workloads are
CPU sensitive.
Without the CPU controller support this patch is not complete and I do
not see an absolute must to marge it ASAP because it is not a regression
fix or something we cannot live without.
--
Michal Hocko
SUSE Labs
On Thu, Feb 20, 2020 at 10:46:39AM +0100, Michal Hocko wrote:
> On Wed 19-02-20 16:17:35, Johannes Weiner wrote:
> > On Wed, Feb 19, 2020 at 08:53:32PM +0100, Michal Hocko wrote:
> > > On Wed 19-02-20 14:16:18, Johannes Weiner wrote:
> [...]
> > > > [ This is generally work in process: for example, if you isolate
> > > > workloads with memory.low, kswapd cpu time isn't accounted to the
> > > > cgroup that causes it. Swap IO issued by kswapd isn't accounted to
> > > > the group that is getting swapped.
> > >
> > > Well, kswapd is a system activity and as such it is acceptable that it
> > > is accounted to the system. But in this case we are talking about a
> > > memcg configuration which influences all other workloads by stealing CPU
> > > cycles from them
> >
> > From a user perspective this isn't a meaningful distinction.
> >
> > If I partition my memory among containers and one cgroup is acting
> > out, I would want the culprit to be charged for the cpu cycles the
> > reclaim is causing. Whether I divide my machine up using memory.low or
> > using memory.max doesn't really matter: I'm choosing between the two
> > based on a *memory policy* I want to implement - work-conserving vs
> > non-conserving. I shouldn't have to worry about the kernel tracking
> > CPU cycles properly in the respective implementations of these knobs.
> >
> > So kswapd is very much a cgroup-attributable activity, *especially* if
> > I'm using memory.low to delineate different memory domains.
>
> While I understand what you are saying I do not think this is easily
> achievable with the current implementation. The biggest problem I can
> see is that you do not have a clear information who to charge for
> the memory shortage on a particular NUMA node with a pure low limit
> based balancing because the limit is not NUMA aware. Besides that the
> origin of the memory pressure might be outside of any memcg. You can
> punish/account all memcgs in excess in some manner, e.g. proportionally
> to their size/excess but I am not really sure how fair that will
> be. Sounds like an interesting project but also sounds like tangent to
> this patch.
>
> High/Max limits are quite different because they are dealing with
> the internal memory pressure and you can attribute it to the
> cgroup/hierarchy which is in excess. There is a clear domain to reclaim
> from. This is an easier model to reason about IMHO.
They're not different. memory.low is just a usage limit that happens
to be enforcecd lazily rather than immediately.
If I'm setting memory.high or memory.max and I allocate beyond it, my
memory will be reclaimed with the limit as the target.
If I'm setting memory.low and I allocate beyond it, my memory will
eventually be reclaimed with the limit as the target.
In either case, the cgroup who allocated the memory that is being
reclaimed is the one obviously responsible for the reclaim work. Why
would the time of limit enforcement change that?
If on the other hand an allocation reclaims you below your limit, such
as can happen with a NUMA-bound allocation, whether it's high, max, or
low, then that's their cost to pay. But it's not really that important
what we do in that case - the memcg settings aren't NUMA aware so that
whole scenario is out of the purview of the controller anyway.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d6085115c7f2..24fe6e9e64b1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2651,6 +2651,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
memcg = mem_cgroup_iter(target_memcg, NULL, NULL);
do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ bool account_cpu = current_is_kswapd() || current_work();
unsigned long reclaimed;
unsigned long scanned;
@@ -2673,6 +2674,7 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
continue;
}
memcg_memory_event(memcg, MEMCG_LOW);
+ account_cpu = false;
break;
case MEMCG_PROT_NONE:
/*
@@ -2688,11 +2690,17 @@ static void shrink_node_memcgs(pg_data_t *pgdat, struct scan_control *sc)
reclaimed = sc->nr_reclaimed;
scanned = sc->nr_scanned;
+ if (account_cpu)
+ use_cpu_of_cgroup(memcg->css.cgroup);
+
shrink_lruvec(lruvec, sc);
shrink_slab(sc->gfp_mask, pgdat->node_id, memcg,
sc->priority);
+ if (account_cpu)
+ unuse_cpu_of_cgroup();
+
/* Record the group's reclaim efficiency */
vmpressure(sc->gfp_mask, memcg, false,
sc->nr_scanned - scanned,
> > > without much throttling on the consumer side - especially when the
> > > memory is reclaimable without a lot of sleeping or contention on
> > > locks etc.
> >
> > The limiting factor on the consumer side is IO. Reading a page is way
> > more costly than reclaiming it, which is why we built our isolation
> > stack starting with memory and IO control and are only now getting to
> > working on proper CPU isolation.
> >
> > > I am absolutely aware that we will never achieve a perfect isolation due
> > > to all sorts of shared data structures, lock contention and what not but
> > > this patch alone just allows spill over to unaccounted work way too
> > > easily IMHO.
> >
> > I understand your concern about CPU cycles escaping, and I share
> > it. My point is that this patch isn't adding a problem that isn't
> > already there, nor is it that much of a practical concern at the time
> > of this writing given the state of CPU isolation in general.
>
> I beg to differ here. Ppu controller should be able to isolate user
> contexts performing high limit reclaim now. Your patch is changing that
> functionality to become unaccounted for a large part and that might be
> seen as a regression for those workloads which partition the system by
> using high limit and also rely on cpu controller because workloads are
> CPU sensitive.
>
> Without the CPU controller support this patch is not complete and I do
> not see an absolute must to marge it ASAP because it is not a regression
> fix or something we cannot live without.
I think you're still thinking in a cgroup1 reality, where you would
set a memory limit in isolation and then eat a ton of CPU pushing up
against it.
In comprehensive isolation setups implemented in cgroup2, "heavily"
reclaimed containers are primarily IO bound on page faults, refaults,
writeback. The reclaim cost is a small part of it, and as I said, in a
magnitude range for which the CPU controller is currently too heavy.
We can carry this patch out of tree until the CPU controller is fixed,
but I think the reasoning to keep it out is not actually based on the
practical reality of a cgroup2 world.
+Peter
On Wed, Feb 19, 2020 at 05:08:59PM -0500, Johannes Weiner wrote:
> On Wed, Feb 19, 2020 at 04:41:12PM -0500, Daniel Jordan wrote:
> > On Wed, Feb 19, 2020 at 08:53:32PM +0100, Michal Hocko wrote:
> > > On Wed 19-02-20 14:16:18, Johannes Weiner wrote:
> > > > On Wed, Feb 19, 2020 at 07:37:31PM +0100, Michal Hocko wrote:
> > > > > On Wed 19-02-20 13:12:19, Johannes Weiner wrote:
> > > > > > This patch adds asynchronous reclaim to the memory.high cgroup limit
> > > > > > while keeping direct reclaim as a fallback. In our testing, this
> > > > > > eliminated all direct reclaim from the affected workload.
> > > > >
> > > > > Who is accounted for all the work? Unless I am missing something this
> > > > > just gets hidden in the system activity and that might hurt the
> > > > > isolation. I do see how moving the work to a different context is
> > > > > desirable but this work has to be accounted properly when it is going to
> > > > > become a normal mode of operation (rather than a rare exception like the
> > > > > existing irq context handling).
> > > >
> > > > Yes, the plan is to account it to the cgroup on whose behalf we're
> > > > doing the work.
> >
> > How are you planning to do that?
> >
> > I've been thinking about how to account a kernel thread's CPU usage to a cgroup
> > on and off while working on the parallelizing Michal mentions below. A few
> > approaches are described here:
> >
> > https://lore.kernel.org/linux-mm/[email protected]/
>
> What we do for the IO controller is execute the work unthrottled but
> charge the cgroup on whose behalf we are executing with whatever cost
> or time or bandwith that was incurred. The cgroup will pay off this
> debt when it requests more of that resource.
>
[snip code pointers]
Thanks! Figuring out how the io controllers dealt with remote charging was on
my list, this makes it easier.
> The plan for the CPU controller is similar. When a remote execution
> begins, flush the current runtime accumulated (update_curr) and
> associate the current thread with another cgroup (similar to
> current->active_memcg); when remote execution is done, flush the
> runtime delta to that cgroup and unset the remote context.
Ok, consistency with io and memory is one advantage to doing it that way.
Creating kthreads in cgroups also seems viable so far, and it's unclear whether
either approach is significantly simpler or more maintainable than the other,
at least to me.
Is someone on your side working on remote charging right now? I was planning
to post an RFD comparing these soon and it would make sense to include them.
On Thu, Feb 20, 2020 at 10:45:24AM -0500, Daniel Jordan wrote:
> Ok, consistency with io and memory is one advantage to doing it that way.
> Creating kthreads in cgroups also seems viable so far, and it's unclear whether
> either approach is significantly simpler or more maintainable than the other,
> at least to me.
The problem with separate kthread approach is that many of these work
units are tiny, and cgroup membership might not be known or doesn't
agree with the processing context from the beginning
For example, the ownership of network packets can't be determined till
processing has progressed quite a bit in shared contexts and each item
too small to bounce around. The only viable way I can think of
splitting aggregate overhead according to the number of packets (or
some other trivially measureable quntity) processed.
Anything sitting in reclaim layer is the same. Reclaim should be
charged to the cgroup whose memory is reclaimed *but* shouldn't block
other cgroups which are waiting for that memory. It has to happen in
the context of the highest priority entity waiting for memory but the
costs incurred must be charged to the memory owners.
So, one way or the other, I think we'll need back charging and once
back charging is needed for big ticket items like network and reclaim,
it's kinda silly to use separate mechanisms for other stuff.
> Is someone on your side working on remote charging right now? I was planning
> to post an RFD comparing these soon and it would make sense to include them.
It's been on the to do list but nobody is working on it yet.
Thanks.
--
tejun
On Thu, Feb 20, 2020 at 10:56:51AM -0500, Tejun Heo wrote:
> On Thu, Feb 20, 2020 at 10:45:24AM -0500, Daniel Jordan wrote:
> > Ok, consistency with io and memory is one advantage to doing it that way.
> > Creating kthreads in cgroups also seems viable so far, and it's unclear whether
> > either approach is significantly simpler or more maintainable than the other,
> > at least to me.
>
> The problem with separate kthread approach is that many of these work
> units are tiny, and cgroup membership might not be known or doesn't
> agree with the processing context from the beginning
The amount of work wouldn't seem to matter as long as the kernel thread stays
in the cgroup and lives long enough. There's only the one-time cost of
attaching it when it's forked. That seems doable for unbound workqueues (the
async reclaim), but may not be for the network packets.
The membership and context issues are pretty compelling though. Good to know,
I'll keep it in mind as I think this through.
> For example, the ownership of network packets can't be determined till
> processing has progressed quite a bit in shared contexts and each item
> too small to bounce around. The only viable way I can think of
> splitting aggregate overhead according to the number of packets (or
> some other trivially measureable quntity) processed.
>
> Anything sitting in reclaim layer is the same. Reclaim should be
> charged to the cgroup whose memory is reclaimed *but* shouldn't block
> other cgroups which are waiting for that memory. It has to happen in
> the context of the highest priority entity waiting for memory but the
> costs incurred must be charged to the memory owners.
>
> So, one way or the other, I think we'll need back charging and once
> back charging is needed for big ticket items like network and reclaim,
> it's kinda silly to use separate mechanisms for other stuff.
Yes, having both would appear to be redundant.
> > Is someone on your side working on remote charging right now? I was planning
> > to post an RFD comparing these soon and it would make sense to include them.
>
> It's been on the to do list but nobody is working on it yet.
Ok, thanks.
Hello,
On Thu, Feb 20, 2020 at 01:23:26PM -0500, Daniel Jordan wrote:
> The amount of work wouldn't seem to matter as long as the kernel thread stays
> in the cgroup and lives long enough. There's only the one-time cost of
> attaching it when it's forked. That seems doable for unbound workqueues (the
> async reclaim), but may not be for the network packets.
The setup cost can be lazy optimized but it'd still have to bounce the
tiny pieces of work to different threads instead of processing them in
one fell swoop from the same context, which most likely is gonna be
untenably expensive.
Thanks.
--
tejun
On Thu, Feb 20, 2020 at 01:45:45PM -0500, Tejun Heo wrote:
> The setup cost can be lazy optimized but it'd still have to bounce the
> tiny pieces of work to different threads instead of processing them in
> one fell swoop from the same context, which most likely is gonna be
> untenably expensive.
I see, your last mail is clearer now. If it's easy to do, a pointer to where
this happens would help so we're on the same page.
thanks,
Daniel
Hello, Daniel.
On Thu, Feb 20, 2020 at 02:55:35PM -0500, Daniel Jordan wrote:
> On Thu, Feb 20, 2020 at 01:45:45PM -0500, Tejun Heo wrote:
> > The setup cost can be lazy optimized but it'd still have to bounce the
> > tiny pieces of work to different threads instead of processing them in
> > one fell swoop from the same context, which most likely is gonna be
> > untenably expensive.
>
> I see, your last mail is clearer now. If it's easy to do, a pointer to where
> this happens would help so we're on the same page.
Network packet rx is the clearest example I think, but you already
mentioned it. Reclaim is less so but when kswapd reclaims, it walks
everybody, and there can be a lot of small cgroups.
Thanks.
--
tejun
On Wed, Feb 19, 2020 at 10:12 AM Johannes Weiner <[email protected]> wrote:
>
> We have received regression reports from users whose workloads moved
> into containers and subsequently encountered new latencies. For some
> users these were a nuisance, but for some it meant missing their SLA
> response times. We tracked those delays down to cgroup limits, which
> inject direct reclaim stalls into the workload where previously all
> reclaim was handled my kswapd.
>
> This patch adds asynchronous reclaim to the memory.high cgroup limit
> while keeping direct reclaim as a fallback. In our testing, this
> eliminated all direct reclaim from the affected workload.
>
> memory.high has a grace buffer of about 4% between when it becomes
> exceeded and when allocating threads get throttled. We can use the
> same buffer for the async reclaimer to operate in. If the worker
> cannot keep up and the grace buffer is exceeded, allocating threads
> will fall back to direct reclaim before getting throttled.
>
> For irq-context, there's already async memory.high enforcement. Re-use
> that work item for all allocating contexts, but switch it to the
> unbound workqueue so reclaim work doesn't compete with the workload.
> The work item is per cgroup, which means the workqueue infrastructure
> will create at maximum one worker thread per reclaiming cgroup.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
> mm/vmscan.c | 10 +++++++--
This reminds me of the per-memcg kswapd proposal from LSFMM 2018
(https://lwn.net/Articles/753162/).
If I understand this correctly, the use-case is that the job instead
of direct reclaiming (potentially in latency sensitive tasks), prefers
a background non-latency sensitive task to do the reclaim. I am
wondering if we can use the memory.high notification along with a new
memcg interface (like memory.try_to_free_pages) to implement a user
space background reclaimer. That would resolve the cpu accounting
concerns as the user space background reclaimer can share the cpu cost
with the task.
One concern with this approach will be that the memory.high
notification is too late and the latency sensitive task has faced the
stall. We can either introduce a threshold notification or another
notification only limit like memory.near_high which can be set based
on the job's rate of allocations and when the usage hits this limit
just notify the user space.
Shakeel
On Wed, Feb 26, 2020 at 12:25:33PM -0800, Shakeel Butt wrote:
> On Wed, Feb 19, 2020 at 10:12 AM Johannes Weiner <[email protected]> wrote:
> >
> > We have received regression reports from users whose workloads moved
> > into containers and subsequently encountered new latencies. For some
> > users these were a nuisance, but for some it meant missing their SLA
> > response times. We tracked those delays down to cgroup limits, which
> > inject direct reclaim stalls into the workload where previously all
> > reclaim was handled my kswapd.
> >
> > This patch adds asynchronous reclaim to the memory.high cgroup limit
> > while keeping direct reclaim as a fallback. In our testing, this
> > eliminated all direct reclaim from the affected workload.
> >
> > memory.high has a grace buffer of about 4% between when it becomes
> > exceeded and when allocating threads get throttled. We can use the
> > same buffer for the async reclaimer to operate in. If the worker
> > cannot keep up and the grace buffer is exceeded, allocating threads
> > will fall back to direct reclaim before getting throttled.
> >
> > For irq-context, there's already async memory.high enforcement. Re-use
> > that work item for all allocating contexts, but switch it to the
> > unbound workqueue so reclaim work doesn't compete with the workload.
> > The work item is per cgroup, which means the workqueue infrastructure
> > will create at maximum one worker thread per reclaiming cgroup.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
> > mm/vmscan.c | 10 +++++++--
>
> This reminds me of the per-memcg kswapd proposal from LSFMM 2018
> (https://lwn.net/Articles/753162/).
Ah yes, I remember those discussions. :)
One thing that has changed since we tried to implement this last was
the workqueue concurrency code. We don't have to worry about a single
thread or fixed threads per cgroup, because the workqueue code has
improved significantly to handle concurrency demands, and having one
work item per cgroup makes sure we have anywhere between 0 threads and
one thread per cgroup doing this reclaim work, completely on-demand.
Also, with cgroup2, memory and cpu always have overlapping control
domains, so the question who to account the work to becomes a much
easier one to answer.
> If I understand this correctly, the use-case is that the job instead
> of direct reclaiming (potentially in latency sensitive tasks), prefers
> a background non-latency sensitive task to do the reclaim. I am
> wondering if we can use the memory.high notification along with a new
> memcg interface (like memory.try_to_free_pages) to implement a user
> space background reclaimer. That would resolve the cpu accounting
> concerns as the user space background reclaimer can share the cpu cost
> with the task.
The idea is not necessarily that the background reclaimer is lower
priority work, but that it can execute in parallel on a separate CPU
instead of being forced into the execution stream of the main work.
So we should be able to fully resolve this problem inside the kernel,
without going through userspace, by accounting CPU cycles used by the
background reclaim worker to the cgroup that is being reclaimed.
> One concern with this approach will be that the memory.high
> notification is too late and the latency sensitive task has faced the
> stall. We can either introduce a threshold notification or another
> notification only limit like memory.near_high which can be set based
> on the job's rate of allocations and when the usage hits this limit
> just notify the user space.
Yeah, I think it would be a pretty drastic expansion of the memory
controller's interface.
On Wed, Feb 26, 2020 at 2:26 PM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Feb 26, 2020 at 12:25:33PM -0800, Shakeel Butt wrote:
> > On Wed, Feb 19, 2020 at 10:12 AM Johannes Weiner <[email protected]> wrote:
> > >
> > > We have received regression reports from users whose workloads moved
> > > into containers and subsequently encountered new latencies. For some
> > > users these were a nuisance, but for some it meant missing their SLA
> > > response times. We tracked those delays down to cgroup limits, which
> > > inject direct reclaim stalls into the workload where previously all
> > > reclaim was handled my kswapd.
> > >
> > > This patch adds asynchronous reclaim to the memory.high cgroup limit
> > > while keeping direct reclaim as a fallback. In our testing, this
> > > eliminated all direct reclaim from the affected workload.
> > >
> > > memory.high has a grace buffer of about 4% between when it becomes
> > > exceeded and when allocating threads get throttled. We can use the
> > > same buffer for the async reclaimer to operate in. If the worker
> > > cannot keep up and the grace buffer is exceeded, allocating threads
> > > will fall back to direct reclaim before getting throttled.
> > >
> > > For irq-context, there's already async memory.high enforcement. Re-use
> > > that work item for all allocating contexts, but switch it to the
> > > unbound workqueue so reclaim work doesn't compete with the workload.
> > > The work item is per cgroup, which means the workqueue infrastructure
> > > will create at maximum one worker thread per reclaiming cgroup.
> > >
> > > Signed-off-by: Johannes Weiner <[email protected]>
> > > ---
> > > mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
> > > mm/vmscan.c | 10 +++++++--
> >
> > This reminds me of the per-memcg kswapd proposal from LSFMM 2018
> > (https://lwn.net/Articles/753162/).
>
> Ah yes, I remember those discussions. :)
>
> One thing that has changed since we tried to implement this last was
> the workqueue concurrency code. We don't have to worry about a single
> thread or fixed threads per cgroup, because the workqueue code has
> improved significantly to handle concurrency demands, and having one
> work item per cgroup makes sure we have anywhere between 0 threads and
> one thread per cgroup doing this reclaim work, completely on-demand.
>
> Also, with cgroup2, memory and cpu always have overlapping control
> domains, so the question who to account the work to becomes a much
> easier one to answer.
>
> > If I understand this correctly, the use-case is that the job instead
> > of direct reclaiming (potentially in latency sensitive tasks), prefers
> > a background non-latency sensitive task to do the reclaim. I am
> > wondering if we can use the memory.high notification along with a new
> > memcg interface (like memory.try_to_free_pages) to implement a user
> > space background reclaimer. That would resolve the cpu accounting
> > concerns as the user space background reclaimer can share the cpu cost
> > with the task.
>
> The idea is not necessarily that the background reclaimer is lower
> priority work, but that it can execute in parallel on a separate CPU
> instead of being forced into the execution stream of the main work.
>
> So we should be able to fully resolve this problem inside the kernel,
> without going through userspace, by accounting CPU cycles used by the
> background reclaim worker to the cgroup that is being reclaimed.
>
> > One concern with this approach will be that the memory.high
> > notification is too late and the latency sensitive task has faced the
> > stall. We can either introduce a threshold notification or another
> > notification only limit like memory.near_high which can be set based
> > on the job's rate of allocations and when the usage hits this limit
> > just notify the user space.
>
> Yeah, I think it would be a pretty drastic expansion of the memory
> controller's interface.
I understand the concern of expanding the interface and resolving the
problem within kernel but there are genuine use-cases which can be
fulfilled by these interfaces. We have a distributed caching service
which manages the caches in anon pages and their hotness. It is
preferable to drop a cold cache known to the application in the user
space on near stall/oom/memory_pressure then let the kernel swap it
out and face a stall on fault as the caches are replicated and other
nodes can serve it. For such workloads kernel reclaim does not help.
What would be your recommendation for such a workload. I can envision
memory.high + PSI notification but note that these are based on stalls
which the application wants to avoid.
Shakeel
On Wed, Feb 26, 2020 at 03:36:50PM -0800, Shakeel Butt wrote:
> On Wed, Feb 26, 2020 at 2:26 PM Johannes Weiner <[email protected]> wrote:
> >
> > On Wed, Feb 26, 2020 at 12:25:33PM -0800, Shakeel Butt wrote:
> > > On Wed, Feb 19, 2020 at 10:12 AM Johannes Weiner <[email protected]> wrote:
> > > >
> > > > We have received regression reports from users whose workloads moved
> > > > into containers and subsequently encountered new latencies. For some
> > > > users these were a nuisance, but for some it meant missing their SLA
> > > > response times. We tracked those delays down to cgroup limits, which
> > > > inject direct reclaim stalls into the workload where previously all
> > > > reclaim was handled my kswapd.
> > > >
> > > > This patch adds asynchronous reclaim to the memory.high cgroup limit
> > > > while keeping direct reclaim as a fallback. In our testing, this
> > > > eliminated all direct reclaim from the affected workload.
> > > >
> > > > memory.high has a grace buffer of about 4% between when it becomes
> > > > exceeded and when allocating threads get throttled. We can use the
> > > > same buffer for the async reclaimer to operate in. If the worker
> > > > cannot keep up and the grace buffer is exceeded, allocating threads
> > > > will fall back to direct reclaim before getting throttled.
> > > >
> > > > For irq-context, there's already async memory.high enforcement. Re-use
> > > > that work item for all allocating contexts, but switch it to the
> > > > unbound workqueue so reclaim work doesn't compete with the workload.
> > > > The work item is per cgroup, which means the workqueue infrastructure
> > > > will create at maximum one worker thread per reclaiming cgroup.
> > > >
> > > > Signed-off-by: Johannes Weiner <[email protected]>
> > > > ---
> > > > mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
> > > > mm/vmscan.c | 10 +++++++--
> > >
> > > This reminds me of the per-memcg kswapd proposal from LSFMM 2018
> > > (https://lwn.net/Articles/753162/).
> >
> > Ah yes, I remember those discussions. :)
> >
> > One thing that has changed since we tried to implement this last was
> > the workqueue concurrency code. We don't have to worry about a single
> > thread or fixed threads per cgroup, because the workqueue code has
> > improved significantly to handle concurrency demands, and having one
> > work item per cgroup makes sure we have anywhere between 0 threads and
> > one thread per cgroup doing this reclaim work, completely on-demand.
> >
> > Also, with cgroup2, memory and cpu always have overlapping control
> > domains, so the question who to account the work to becomes a much
> > easier one to answer.
> >
> > > If I understand this correctly, the use-case is that the job instead
> > > of direct reclaiming (potentially in latency sensitive tasks), prefers
> > > a background non-latency sensitive task to do the reclaim. I am
> > > wondering if we can use the memory.high notification along with a new
> > > memcg interface (like memory.try_to_free_pages) to implement a user
> > > space background reclaimer. That would resolve the cpu accounting
> > > concerns as the user space background reclaimer can share the cpu cost
> > > with the task.
> >
> > The idea is not necessarily that the background reclaimer is lower
> > priority work, but that it can execute in parallel on a separate CPU
> > instead of being forced into the execution stream of the main work.
> >
> > So we should be able to fully resolve this problem inside the kernel,
> > without going through userspace, by accounting CPU cycles used by the
> > background reclaim worker to the cgroup that is being reclaimed.
> >
> > > One concern with this approach will be that the memory.high
> > > notification is too late and the latency sensitive task has faced the
> > > stall. We can either introduce a threshold notification or another
> > > notification only limit like memory.near_high which can be set based
> > > on the job's rate of allocations and when the usage hits this limit
> > > just notify the user space.
> >
> > Yeah, I think it would be a pretty drastic expansion of the memory
> > controller's interface.
>
> I understand the concern of expanding the interface and resolving the
> problem within kernel but there are genuine use-cases which can be
> fulfilled by these interfaces. We have a distributed caching service
> which manages the caches in anon pages and their hotness. It is
> preferable to drop a cold cache known to the application in the user
> space on near stall/oom/memory_pressure then let the kernel swap it
> out and face a stall on fault as the caches are replicated and other
> nodes can serve it. For such workloads kernel reclaim does not help.
> What would be your recommendation for such a workload. I can envision
> memory.high + PSI notification but note that these are based on stalls
> which the application wants to avoid.
Oh sure, for such a usecase it would make sense to explore some sort
of usage notification system.
My observation was in the context of getting kernel background reclaim
cycles accounted properly, not for userspace reclaim algorithms :)
On 2/26/20 12:25 PM, Shakeel Butt wrote:
> On Wed, Feb 19, 2020 at 10:12 AM Johannes Weiner <[email protected]> wrote:
>> We have received regression reports from users whose workloads moved
>> into containers and subsequently encountered new latencies. For some
>> users these were a nuisance, but for some it meant missing their SLA
>> response times. We tracked those delays down to cgroup limits, which
>> inject direct reclaim stalls into the workload where previously all
>> reclaim was handled my kswapd.
>>
>> This patch adds asynchronous reclaim to the memory.high cgroup limit
>> while keeping direct reclaim as a fallback. In our testing, this
>> eliminated all direct reclaim from the affected workload.
>>
>> memory.high has a grace buffer of about 4% between when it becomes
>> exceeded and when allocating threads get throttled. We can use the
>> same buffer for the async reclaimer to operate in. If the worker
>> cannot keep up and the grace buffer is exceeded, allocating threads
>> will fall back to direct reclaim before getting throttled.
>>
>> For irq-context, there's already async memory.high enforcement. Re-use
>> that work item for all allocating contexts, but switch it to the
>> unbound workqueue so reclaim work doesn't compete with the workload.
>> The work item is per cgroup, which means the workqueue infrastructure
>> will create at maximum one worker thread per reclaiming cgroup.
>>
>> Signed-off-by: Johannes Weiner <[email protected]>
>> ---
>> mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
>> mm/vmscan.c | 10 +++++++--
> This reminds me of the per-memcg kswapd proposal from LSFMM 2018
> (https://lwn.net/Articles/753162/).
Thanks for bringing this up.
>
> If I understand this correctly, the use-case is that the job instead
> of direct reclaiming (potentially in latency sensitive tasks), prefers
> a background non-latency sensitive task to do the reclaim. I am
> wondering if we can use the memory.high notification along with a new
> memcg interface (like memory.try_to_free_pages) to implement a user
> space background reclaimer. That would resolve the cpu accounting
> concerns as the user space background reclaimer can share the cpu cost
> with the task.
Actually I'm interested how you implement userspace reclaimer. Via a new
syscall or a variant of existing syscall?
>
> One concern with this approach will be that the memory.high
> notification is too late and the latency sensitive task has faced the
> stall. We can either introduce a threshold notification or another
> notification only limit like memory.near_high which can be set based
> on the job's rate of allocations and when the usage hits this limit
> just notify the user space.
Yes, the solo purpose of background reclaimer is to avoid direct reclaim
for latency sensitive workloads. Our in-house implementation has high
watermark and low watermark, both of which is lower than limit or high.
The background reclaimer would be triggered once available memory is
reached low watermark, then keep reclaimed until available memory is
reached high watermark. It is pretty same with how global water mark works.
>
> Shakeel
On 2/26/20 2:26 PM, Johannes Weiner wrote:
> On Wed, Feb 26, 2020 at 12:25:33PM -0800, Shakeel Butt wrote:
>> On Wed, Feb 19, 2020 at 10:12 AM Johannes Weiner <[email protected]> wrote:
>>> We have received regression reports from users whose workloads moved
>>> into containers and subsequently encountered new latencies. For some
>>> users these were a nuisance, but for some it meant missing their SLA
>>> response times. We tracked those delays down to cgroup limits, which
>>> inject direct reclaim stalls into the workload where previously all
>>> reclaim was handled my kswapd.
>>>
>>> This patch adds asynchronous reclaim to the memory.high cgroup limit
>>> while keeping direct reclaim as a fallback. In our testing, this
>>> eliminated all direct reclaim from the affected workload.
>>>
>>> memory.high has a grace buffer of about 4% between when it becomes
>>> exceeded and when allocating threads get throttled. We can use the
>>> same buffer for the async reclaimer to operate in. If the worker
>>> cannot keep up and the grace buffer is exceeded, allocating threads
>>> will fall back to direct reclaim before getting throttled.
>>>
>>> For irq-context, there's already async memory.high enforcement. Re-use
>>> that work item for all allocating contexts, but switch it to the
>>> unbound workqueue so reclaim work doesn't compete with the workload.
>>> The work item is per cgroup, which means the workqueue infrastructure
>>> will create at maximum one worker thread per reclaiming cgroup.
>>>
>>> Signed-off-by: Johannes Weiner <[email protected]>
>>> ---
>>> mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
>>> mm/vmscan.c | 10 +++++++--
>> This reminds me of the per-memcg kswapd proposal from LSFMM 2018
>> (https://lwn.net/Articles/753162/).
> Ah yes, I remember those discussions. :)
>
> One thing that has changed since we tried to implement this last was
> the workqueue concurrency code. We don't have to worry about a single
> thread or fixed threads per cgroup, because the workqueue code has
> improved significantly to handle concurrency demands, and having one
> work item per cgroup makes sure we have anywhere between 0 threads and
> one thread per cgroup doing this reclaim work, completely on-demand.
Yes, exactly. Our in-house implementation was just converted to use
workqueue instead of dedicated kernel thread for each cgroup.
>
> Also, with cgroup2, memory and cpu always have overlapping control
> domains, so the question who to account the work to becomes a much
> easier one to answer.
>
>> If I understand this correctly, the use-case is that the job instead
>> of direct reclaiming (potentially in latency sensitive tasks), prefers
>> a background non-latency sensitive task to do the reclaim. I am
>> wondering if we can use the memory.high notification along with a new
>> memcg interface (like memory.try_to_free_pages) to implement a user
>> space background reclaimer. That would resolve the cpu accounting
>> concerns as the user space background reclaimer can share the cpu cost
>> with the task.
> The idea is not necessarily that the background reclaimer is lower
> priority work, but that it can execute in parallel on a separate CPU
> instead of being forced into the execution stream of the main work.
>
> So we should be able to fully resolve this problem inside the kernel,
> without going through userspace, by accounting CPU cycles used by the
> background reclaim worker to the cgroup that is being reclaimed.
Actually I'm wondering if we really need account CPU cycles used by
background reclaimer or not. For our usecase (this may be not general),
the purpose of background reclaimer is to avoid latency sensitive
workloads get into direct relcaim (avoid the stall from direct relcaim).
In fact it just "steal" CPU cycles from lower priority or best-effort
workloads to guarantee latency sensitive workloads behave well. If the
"stolen" CPU cycles are accounted, it means the latency sensitive
workloads would get throttled from somewhere else later, i.e. by CPU share.
We definitely don't want to the background reclaimer eat all CPU cycles.
So, the whole background reclaimer is opt in stuff. The higher level
cluster management and administration components make sure the cgroups
are setup correctly, i.e. enable for specific cgroups, setup watermark
properly, etc.
Of course, this may be not universal and may be just fine for some
specific configurations or usecases.
>
>> One concern with this approach will be that the memory.high
>> notification is too late and the latency sensitive task has faced the
>> stall. We can either introduce a threshold notification or another
>> notification only limit like memory.near_high which can be set based
>> on the job's rate of allocations and when the usage hits this limit
>> just notify the user space.
> Yeah, I think it would be a pretty drastic expansion of the memory
> controller's interface.
On Wed, Feb 26, 2020 at 3:59 PM Yang Shi <[email protected]> wrote:
>
>
>
> On 2/26/20 12:25 PM, Shakeel Butt wrote:
> > On Wed, Feb 19, 2020 at 10:12 AM Johannes Weiner <[email protected]> wrote:
> >> We have received regression reports from users whose workloads moved
> >> into containers and subsequently encountered new latencies. For some
> >> users these were a nuisance, but for some it meant missing their SLA
> >> response times. We tracked those delays down to cgroup limits, which
> >> inject direct reclaim stalls into the workload where previously all
> >> reclaim was handled my kswapd.
> >>
> >> This patch adds asynchronous reclaim to the memory.high cgroup limit
> >> while keeping direct reclaim as a fallback. In our testing, this
> >> eliminated all direct reclaim from the affected workload.
> >>
> >> memory.high has a grace buffer of about 4% between when it becomes
> >> exceeded and when allocating threads get throttled. We can use the
> >> same buffer for the async reclaimer to operate in. If the worker
> >> cannot keep up and the grace buffer is exceeded, allocating threads
> >> will fall back to direct reclaim before getting throttled.
> >>
> >> For irq-context, there's already async memory.high enforcement. Re-use
> >> that work item for all allocating contexts, but switch it to the
> >> unbound workqueue so reclaim work doesn't compete with the workload.
> >> The work item is per cgroup, which means the workqueue infrastructure
> >> will create at maximum one worker thread per reclaiming cgroup.
> >>
> >> Signed-off-by: Johannes Weiner <[email protected]>
> >> ---
> >> mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
> >> mm/vmscan.c | 10 +++++++--
> > This reminds me of the per-memcg kswapd proposal from LSFMM 2018
> > (https://lwn.net/Articles/753162/).
>
> Thanks for bringing this up.
>
> >
> > If I understand this correctly, the use-case is that the job instead
> > of direct reclaiming (potentially in latency sensitive tasks), prefers
> > a background non-latency sensitive task to do the reclaim. I am
> > wondering if we can use the memory.high notification along with a new
> > memcg interface (like memory.try_to_free_pages) to implement a user
> > space background reclaimer. That would resolve the cpu accounting
> > concerns as the user space background reclaimer can share the cpu cost
> > with the task.
>
> Actually I'm interested how you implement userspace reclaimer. Via a new
> syscall or a variant of existing syscall?
>
We have a per-memcg interface memory.try_to_free_pages on which user
space can echo two numbers i.e. number of bytes to reclaim and a byte
representing flags (I/O allowed or just reclaim zombies e.t.c).
However nowadays we are just using it for zombie cleanup.
Shakeel
On Wed, Feb 26, 2020 at 4:12 PM Yang Shi <[email protected]> wrote:
>
>
>
> On 2/26/20 2:26 PM, Johannes Weiner wrote:
> > On Wed, Feb 26, 2020 at 12:25:33PM -0800, Shakeel Butt wrote:
> >> On Wed, Feb 19, 2020 at 10:12 AM Johannes Weiner <[email protected]> wrote:
> >>> We have received regression reports from users whose workloads moved
> >>> into containers and subsequently encountered new latencies. For some
> >>> users these were a nuisance, but for some it meant missing their SLA
> >>> response times. We tracked those delays down to cgroup limits, which
> >>> inject direct reclaim stalls into the workload where previously all
> >>> reclaim was handled my kswapd.
> >>>
> >>> This patch adds asynchronous reclaim to the memory.high cgroup limit
> >>> while keeping direct reclaim as a fallback. In our testing, this
> >>> eliminated all direct reclaim from the affected workload.
> >>>
> >>> memory.high has a grace buffer of about 4% between when it becomes
> >>> exceeded and when allocating threads get throttled. We can use the
> >>> same buffer for the async reclaimer to operate in. If the worker
> >>> cannot keep up and the grace buffer is exceeded, allocating threads
> >>> will fall back to direct reclaim before getting throttled.
> >>>
> >>> For irq-context, there's already async memory.high enforcement. Re-use
> >>> that work item for all allocating contexts, but switch it to the
> >>> unbound workqueue so reclaim work doesn't compete with the workload.
> >>> The work item is per cgroup, which means the workqueue infrastructure
> >>> will create at maximum one worker thread per reclaiming cgroup.
> >>>
> >>> Signed-off-by: Johannes Weiner <[email protected]>
> >>> ---
> >>> mm/memcontrol.c | 60 +++++++++++++++++++++++++++++++++++++------------
> >>> mm/vmscan.c | 10 +++++++--
> >> This reminds me of the per-memcg kswapd proposal from LSFMM 2018
> >> (https://lwn.net/Articles/753162/).
> > Ah yes, I remember those discussions. :)
> >
> > One thing that has changed since we tried to implement this last was
> > the workqueue concurrency code. We don't have to worry about a single
> > thread or fixed threads per cgroup, because the workqueue code has
> > improved significantly to handle concurrency demands, and having one
> > work item per cgroup makes sure we have anywhere between 0 threads and
> > one thread per cgroup doing this reclaim work, completely on-demand.
>
> Yes, exactly. Our in-house implementation was just converted to use
> workqueue instead of dedicated kernel thread for each cgroup.
>
> >
> > Also, with cgroup2, memory and cpu always have overlapping control
> > domains, so the question who to account the work to becomes a much
> > easier one to answer.
> >
> >> If I understand this correctly, the use-case is that the job instead
> >> of direct reclaiming (potentially in latency sensitive tasks), prefers
> >> a background non-latency sensitive task to do the reclaim. I am
> >> wondering if we can use the memory.high notification along with a new
> >> memcg interface (like memory.try_to_free_pages) to implement a user
> >> space background reclaimer. That would resolve the cpu accounting
> >> concerns as the user space background reclaimer can share the cpu cost
> >> with the task.
> > The idea is not necessarily that the background reclaimer is lower
> > priority work, but that it can execute in parallel on a separate CPU
> > instead of being forced into the execution stream of the main work.
> >
> > So we should be able to fully resolve this problem inside the kernel,
> > without going through userspace, by accounting CPU cycles used by the
> > background reclaim worker to the cgroup that is being reclaimed.
>
> Actually I'm wondering if we really need account CPU cycles used by
> background reclaimer or not. For our usecase (this may be not general),
> the purpose of background reclaimer is to avoid latency sensitive
> workloads get into direct relcaim (avoid the stall from direct relcaim).
> In fact it just "steal" CPU cycles from lower priority or best-effort
> workloads to guarantee latency sensitive workloads behave well. If the
> "stolen" CPU cycles are accounted, it means the latency sensitive
> workloads would get throttled from somewhere else later, i.e. by CPU share.
>
> We definitely don't want to the background reclaimer eat all CPU cycles.
> So, the whole background reclaimer is opt in stuff. The higher level
> cluster management and administration components make sure the cgroups
> are setup correctly, i.e. enable for specific cgroups, setup watermark
> properly, etc.
>
> Of course, this may be not universal and may be just fine for some
> specific configurations or usecases.
>
IMHO this makes a very good case for user space background reclaimer.
We have much more flexibility to run that reclaimer in the same cgroup
whose memory its reclaiming (i.e. sharing cpu quota) or maybe in a
separate cgroup with stolen CPU from best effort or low priority jobs.
Shakeel
On Wed 26-02-20 16:12:23, Yang Shi wrote:
[...]
> Actually I'm wondering if we really need account CPU cycles used by
> background reclaimer or not. For our usecase (this may be not general), the
> purpose of background reclaimer is to avoid latency sensitive workloads get
> into direct relcaim (avoid the stall from direct relcaim). In fact it just
> "steal" CPU cycles from lower priority or best-effort workloads to guarantee
> latency sensitive workloads behave well. If the "stolen" CPU cycles are
> accounted, it means the latency sensitive workloads would get throttled from
> somewhere else later, i.e. by CPU share.
I believe we need to because that work is not for free and so you are
essentially stealing those CPUs cycles from everybody else outside of
your throttled cgroup.
--
Michal Hocko
SUSE Labs
On Wed, Feb 26, 2020 at 04:12:23PM -0800, Yang Shi wrote:
> On 2/26/20 2:26 PM, Johannes Weiner wrote:
> > So we should be able to fully resolve this problem inside the kernel,
> > without going through userspace, by accounting CPU cycles used by the
> > background reclaim worker to the cgroup that is being reclaimed.
>
> Actually I'm wondering if we really need account CPU cycles used by
> background reclaimer or not. For our usecase (this may be not general), the
> purpose of background reclaimer is to avoid latency sensitive workloads get
> into direct relcaim (avoid the stall from direct relcaim). In fact it just
> "steal" CPU cycles from lower priority or best-effort workloads to guarantee
> latency sensitive workloads behave well. If the "stolen" CPU cycles are
> accounted, it means the latency sensitive workloads would get throttled from
> somewhere else later, i.e. by CPU share.
That doesn't sound right.
"Not accounting" isn't an option. If we don't annotate the reclaim
work, the cycles will go to the root cgroup. That means that the
latency-sensitive workload can steal cycles from the low-pri job, yes,
but also that the low-pri job can steal from the high-pri one.
Say your two workloads on the system are a web server and a compile
job and the CPU shares are allocated 80:20. The compile job will cause
most of the reclaim. If the reclaim cycles can escape to the root
cgroup, the compile job will effectively consume more than 20 shares
and the low-pri job will get less than 80.
But let's say we executed all background reclaim in the low-pri group,
to allow the high-pri group to steal cycles from the low-pri group,
but not the other way round. Again an 80:20 CPU distribution. Now the
reclaim work competes with the compile job over a very small share of
CPU. The reclaim work that the high priority job is relying on is
running at low priority. That means that the compile job can cause the
web server to go into direct reclaim. That's a priority inversion.
> We definitely don't want to the background reclaimer eat all CPU cycles. So,
> the whole background reclaimer is opt in stuff. The higher level cluster
> management and administration components make sure the cgroups are setup
> correctly, i.e. enable for specific cgroups, setup watermark properly, etc.
>
> Of course, this may be not universal and may be just fine for some specific
> configurations or usecases.
Yes, I suspect it works for you because you set up watermarks on the
high-pri job but not on the background jobs, thus making sure only
high-pri jobs can steal cycles from the rest of the system.
However, we do want low-pri jobs to have background reclaim as well. A
compile job may not be latency-sensitive, but it still benefits from a
throughput POV when the reclaim work runs concurrently. And if there
are idle CPU cycles available that the high-pri work isn't using right
now, it would be wasteful not to make use of them.
So yes, I can see how such an accounting loophole can be handy. By
letting reclaim CPU cycles sneak out of containment, you can kind of
use it for high-pri jobs. Or rather *one* high-pri job, because more
than one becomes unsafe again, where one can steal a large number of
cycles from others at the same priority.
But it's more universally useful to properly account CPU cycles that
are actually consumed by a cgroup, to that cgroup, and then reflect
the additional CPU explicitly in the CPU weight configuration. That
way you can safely have background reclaim on jobs of all priorities.