Leonardo has reported [1] that pcp memcg charge draining can interfere
with cpu isolated workloads. The said draining is done from a WQ context
with a pcp worker scheduled on each CPU which holds any cached charges
for a specific memcg hierarchy. Operation is not really a common
operation [2]. It can be triggered from the userspace though so some
care is definitely due.
Leonardo has tried to address the issue by allowing remote charge
draining [3]. This approach requires an additional locking to
synchronize pcp caches sync from a remote cpu from local pcp consumers.
Even though the proposed lock was per-cpu there is still potential for
contention and less predictable behavior.
This patchset addresses the issue from a different angle. Rather than
dealing with a potential synchronization, cpus which are isolated are
simply never scheduled to be drained. This means that a small amount of
charges could be laying around and waiting for a later use or they are
flushed when a different memcg is charged from the same cpu. More
details are in patch 2. The first patch from Frederic is implementing an
abstraction to tell whether a specific cpu has been isolated and
therefore require a special treatment.
The patchset is on top of Andrew's mm-unstable tree. I am not sure which
tree is the best to route both of them but unless there are any special
requirements for the cpu isolation parts then pushing this via Andrew
seems like the easiest choice.
Frederic Weisbecker (1):
sched/isolation: Add cpu_is_isolated() API
Michal Hocko (1):
memcg: do not drain charge pcp caches on remote isolated cpus
include/linux/sched/isolation.h | 12 ++++++++++++
mm/memcontrol.c | 2 +-
2 files changed, 13 insertions(+), 1 deletion(-)
[1] https://lore.kernel.org/all/[email protected]/T/#u
[2] https://lore.kernel.org/all/[email protected]/T/#u
[3] https://lore.kernel.org/all/[email protected]/T/#u
From: Michal Hocko <[email protected]>
Leonardo Bras has noticed that pcp charge cache draining might be
disruptive on workloads relying on 'isolated cpus', a feature commonly
used on workloads that are sensitive to interruption and context
switching such as vRAN and Industrial Control Systems.
There are essentially two ways how to approach the issue. We can either
allow the pcp cache to be drained on a different rather than a local cpu
or avoid remote flushing on isolated cpus.
The current pcp charge cache is really optimized for high performance
and it always relies to stick with its cpu. That means it only requires
local_lock (preempt_disable on !RT) and draining is handed over to pcp
WQ to drain locally again.
The former solution (remote draining) would require to add an additional
locking to prevent local charges from racing with the draining. This
adds an atomic operation to otherwise simple arithmetic fast path in the
try_charge path. Another concern is that the remote draining can cause a
lock contention for the isolated workloads and therefore interfere with
it indirectly via user space interfaces.
Another option is to avoid draining scheduling on isolated cpus
altogether. That means that those remote cpus would keep their charges
even after drain_all_stock returns. This is certainly not optimal either
but it shouldn't really cause any major problems. In the worst case
(many isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH
i.e 64 page) the memory consumption of a memcg would be artificially
higher than can be immediately used from other cpus.
Theoretically a memcg OOM killer could be triggered pre-maturely.
Currently it is not really clear whether this is a practical problem
though. Tight memcg limit would be really counter productive to cpu
isolated workloads pretty much by definition because any memory
reclaimed induced by memcg limit could break user space timing
expectations as those usually expect execution in the userspace most of
the time.
Also charges could be left behind on memcg removal. Any future charge on
those isolated cpus will drain that pcp cache so this won't be a
permanent leak.
Considering cons and pros of both approaches this patch is implementing
the second option and simply do not schedule remote draining if the
target cpu is isolated. This solution is much more simpler. It doesn't
add any new locking and it is more more predictable from the user space
POV. Should the pre-mature memcg OOM become a real life problem, we can
revisit this decision.
Cc: Leonardo Brás <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Reported-by: Leonardo Bras <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
Suggested-by: Roman Gushchin <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0524add35cae..12559c08d976 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2366,7 +2366,7 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
!test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
if (cpu == curcpu)
drain_local_stock(&stock->work);
- else
+ else if (!cpu_is_isolated(cpu))
schedule_work_on(cpu, &stock->work);
}
}
--
2.30.2
From: Frederic Weisbecker <[email protected]>
Provide this new API to check if a CPU has been isolated either through
isolcpus= or nohz_full= kernel parameter.
It aims at avoiding kernel load deemed to be safely spared on CPUs
running sensitive workload that can't bear any disturbance, such as
pcp cache draining.
Suggested-by: Michal Hocko <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/sched/isolation.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index 8c15abd67aed..fe1a46f30d24 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -46,6 +46,12 @@ static inline bool housekeeping_enabled(enum hk_type type)
static inline void housekeeping_affine(struct task_struct *t,
enum hk_type type) { }
+
+static inline bool housekeeping_test_cpu(int cpu, enum hk_type type)
+{
+ return true;
+}
+
static inline void housekeeping_init(void) { }
#endif /* CONFIG_CPU_ISOLATION */
@@ -58,4 +64,10 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
return true;
}
+static inline bool cpu_is_isolated(int cpu)
+{
+ return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) ||
+ !housekeeping_test_cpu(cpu, HK_TYPE_TICK);
+}
+
#endif /* _LINUX_SCHED_ISOLATION_H */
--
2.30.2
On Fri, Mar 17, 2023 at 03:33:13PM -0300, Marcelo Tosatti wrote:
> On Fri, Mar 17, 2023 at 02:44:47PM +0100, Michal Hocko wrote:
> > From: Frederic Weisbecker <[email protected]>
> >
> > Provide this new API to check if a CPU has been isolated either through
> > isolcpus= or nohz_full= kernel parameter.
> >
> > It aims at avoiding kernel load deemed to be safely spared on CPUs
> > running sensitive workload that can't bear any disturbance, such as
> > pcp cache draining.
>
> Hi Michal,
>
> This makes no sense to me.
>
> HK_TYPE_DOMAIN is set when isolcpus=domain is configured.
> HK_TYPE_TICK is set when nohz_full= is configured.
>
> The use-cases i am aware of use either:
>
> isolcpus=managed_irq,... nohz_full=
> OR
> isolcpus=domain,managed_irq,... nohz_full=
>
> So what is the point of this function again?
>
> Perhaps it made sense along with, but now does not make sense
> anymore:
>
> Subject: [PATCH 1/2] sched/isolation: Merge individual nohz_full features into a common housekeeping flag
>
> The individual isolation features turned on by nohz_full were initially
> split in order for each of them to be tunable through cpusets. However
> plans have changed in favour of an interface (be it cpusets or sysctl)
> grouping all these features to be turned on/off altogether. Then should
> the need ever arise, the interface can still be expanded to handle the
> individual isolation features.
>
> But Michal can just use housekeeping_test_cpu(cpu, HK_TYPE_TICK) and
> the convertion of nohz_full features into a common housekeeping flag
> can convert that to something else later?
Actually introducing cpu_is_isolated() seems fine, but it can call
housekeeping_test_cpu(cpu, HK_TYPE_TICK) AFAICS.
On Fri, Mar 17, 2023 at 02:44:47PM +0100, Michal Hocko wrote:
> From: Frederic Weisbecker <[email protected]>
>
> Provide this new API to check if a CPU has been isolated either through
> isolcpus= or nohz_full= kernel parameter.
>
> It aims at avoiding kernel load deemed to be safely spared on CPUs
> running sensitive workload that can't bear any disturbance, such as
> pcp cache draining.
Hi Michal,
This makes no sense to me.
HK_TYPE_DOMAIN is set when isolcpus=domain is configured.
HK_TYPE_TICK is set when nohz_full= is configured.
The use-cases i am aware of use either:
isolcpus=managed_irq,... nohz_full=
OR
isolcpus=domain,managed_irq,... nohz_full=
So what is the point of this function again?
Perhaps it made sense along with, but now does not make sense
anymore:
Subject: [PATCH 1/2] sched/isolation: Merge individual nohz_full features into a common housekeeping flag
The individual isolation features turned on by nohz_full were initially
split in order for each of them to be tunable through cpusets. However
plans have changed in favour of an interface (be it cpusets or sysctl)
grouping all these features to be turned on/off altogether. Then should
the need ever arise, the interface can still be expanded to handle the
individual isolation features.
But Michal can just use housekeeping_test_cpu(cpu, HK_TYPE_TICK) and
the convertion of nohz_full features into a common housekeeping flag
can convert that to something else later?
>
> Suggested-by: Michal Hocko <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/linux/sched/isolation.h | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
> index 8c15abd67aed..fe1a46f30d24 100644
> --- a/include/linux/sched/isolation.h
> +++ b/include/linux/sched/isolation.h
> @@ -46,6 +46,12 @@ static inline bool housekeeping_enabled(enum hk_type type)
>
> static inline void housekeeping_affine(struct task_struct *t,
> enum hk_type type) { }
> +
> +static inline bool housekeeping_test_cpu(int cpu, enum hk_type type)
> +{
> + return true;
> +}
> +
> static inline void housekeeping_init(void) { }
> #endif /* CONFIG_CPU_ISOLATION */
>
> @@ -58,4 +64,10 @@ static inline bool housekeeping_cpu(int cpu, enum hk_type type)
> return true;
> }
>
> +static inline bool cpu_is_isolated(int cpu)
> +{
> + return !housekeeping_test_cpu(cpu, HK_TYPE_DOMAIN) ||
> + !housekeeping_test_cpu(cpu, HK_TYPE_TICK);
> +}
> +
> #endif /* _LINUX_SCHED_ISOLATION_H */
> --
> 2.30.2
>
>
On Fri, Mar 17, 2023 at 6:44 AM Michal Hocko <[email protected]> wrote:
>
> From: Michal Hocko <[email protected]>
>
> Leonardo Bras has noticed that pcp charge cache draining might be
> disruptive on workloads relying on 'isolated cpus', a feature commonly
> used on workloads that are sensitive to interruption and context
> switching such as vRAN and Industrial Control Systems.
>
> There are essentially two ways how to approach the issue. We can either
> allow the pcp cache to be drained on a different rather than a local cpu
> or avoid remote flushing on isolated cpus.
>
> The current pcp charge cache is really optimized for high performance
> and it always relies to stick with its cpu. That means it only requires
> local_lock (preempt_disable on !RT) and draining is handed over to pcp
> WQ to drain locally again.
>
> The former solution (remote draining) would require to add an additional
> locking to prevent local charges from racing with the draining. This
> adds an atomic operation to otherwise simple arithmetic fast path in the
> try_charge path. Another concern is that the remote draining can cause a
> lock contention for the isolated workloads and therefore interfere with
> it indirectly via user space interfaces.
>
> Another option is to avoid draining scheduling on isolated cpus
> altogether. That means that those remote cpus would keep their charges
> even after drain_all_stock returns. This is certainly not optimal either
> but it shouldn't really cause any major problems. In the worst case
> (many isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH
> i.e 64 page) the memory consumption of a memcg would be artificially
> higher than can be immediately used from other cpus.
>
> Theoretically a memcg OOM killer could be triggered pre-maturely.
> Currently it is not really clear whether this is a practical problem
> though. Tight memcg limit would be really counter productive to cpu
> isolated workloads pretty much by definition because any memory
> reclaimed induced by memcg limit could break user space timing
> expectations as those usually expect execution in the userspace most of
> the time.
>
> Also charges could be left behind on memcg removal. Any future charge on
> those isolated cpus will drain that pcp cache so this won't be a
> permanent leak.
>
> Considering cons and pros of both approaches this patch is implementing
> the second option and simply do not schedule remote draining if the
> target cpu is isolated. This solution is much more simpler. It doesn't
> add any new locking and it is more more predictable from the user space
> POV. Should the pre-mature memcg OOM become a real life problem, we can
> revisit this decision.
>
> Cc: Leonardo Brás <[email protected]>
> Cc: Marcelo Tosatti <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Cc: Muchun Song <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Reported-by: Leonardo Bras <[email protected]>
> Acked-by: Roman Gushchin <[email protected]>
> Suggested-by: Roman Gushchin <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
Hi Michal,
I love your patch! Yet something to improve:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.3-rc2 next-20230317]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Michal-Hocko/sched-isolation-Add-cpu_is_isolated-API/20230317-214621
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20230317134448.11082-3-mhocko%40kernel.org
patch subject: [PATCH 2/2] memcg: do not drain charge pcp caches on remote isolated cpus
config: x86_64-randconfig-a014 (https://download.01.org/0day-ci/archive/20230318/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/b376cfcf40a43276e1280950bb926fdf3521940a
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Michal-Hocko/sched-isolation-Add-cpu_is_isolated-API/20230317-214621
git checkout b376cfcf40a43276e1280950bb926fdf3521940a
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
>> mm/memcontrol.c:2369:14: error: implicit declaration of function 'cpu_is_isolated' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
else if (!cpu_is_isolated(cpu))
^
1 error generated.
vim +/cpu_is_isolated +2369 mm/memcontrol.c
2331
2332 /*
2333 * Drains all per-CPU charge caches for given root_memcg resp. subtree
2334 * of the hierarchy under it.
2335 */
2336 static void drain_all_stock(struct mem_cgroup *root_memcg)
2337 {
2338 int cpu, curcpu;
2339
2340 /* If someone's already draining, avoid adding running more workers. */
2341 if (!mutex_trylock(&percpu_charge_mutex))
2342 return;
2343 /*
2344 * Notify other cpus that system-wide "drain" is running
2345 * We do not care about races with the cpu hotplug because cpu down
2346 * as well as workers from this path always operate on the local
2347 * per-cpu data. CPU up doesn't touch memcg_stock at all.
2348 */
2349 migrate_disable();
2350 curcpu = smp_processor_id();
2351 for_each_online_cpu(cpu) {
2352 struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
2353 struct mem_cgroup *memcg;
2354 bool flush = false;
2355
2356 rcu_read_lock();
2357 memcg = stock->cached;
2358 if (memcg && stock->nr_pages &&
2359 mem_cgroup_is_descendant(memcg, root_memcg))
2360 flush = true;
2361 else if (obj_stock_flush_required(stock, root_memcg))
2362 flush = true;
2363 rcu_read_unlock();
2364
2365 if (flush &&
2366 !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
2367 if (cpu == curcpu)
2368 drain_local_stock(&stock->work);
> 2369 else if (!cpu_is_isolated(cpu))
2370 schedule_work_on(cpu, &stock->work);
2371 }
2372 }
2373 migrate_enable();
2374 mutex_unlock(&percpu_charge_mutex);
2375 }
2376
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
Hi Michal,
I love your patch! Yet something to improve:
[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on linus/master v6.3-rc2 next-20230317]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Michal-Hocko/sched-isolation-Add-cpu_is_isolated-API/20230317-214621
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20230317134448.11082-3-mhocko%40kernel.org
patch subject: [PATCH 2/2] memcg: do not drain charge pcp caches on remote isolated cpus
config: x86_64-randconfig-a013 (https://download.01.org/0day-ci/archive/20230318/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/b376cfcf40a43276e1280950bb926fdf3521940a
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Michal-Hocko/sched-isolation-Add-cpu_is_isolated-API/20230317-214621
git checkout b376cfcf40a43276e1280950bb926fdf3521940a
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 olddefconfig
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
mm/memcontrol.c: In function 'drain_all_stock':
>> mm/memcontrol.c:2369:35: error: implicit declaration of function 'cpu_is_isolated' [-Werror=implicit-function-declaration]
2369 | else if (!cpu_is_isolated(cpu))
| ^~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/cpu_is_isolated +2369 mm/memcontrol.c
2331
2332 /*
2333 * Drains all per-CPU charge caches for given root_memcg resp. subtree
2334 * of the hierarchy under it.
2335 */
2336 static void drain_all_stock(struct mem_cgroup *root_memcg)
2337 {
2338 int cpu, curcpu;
2339
2340 /* If someone's already draining, avoid adding running more workers. */
2341 if (!mutex_trylock(&percpu_charge_mutex))
2342 return;
2343 /*
2344 * Notify other cpus that system-wide "drain" is running
2345 * We do not care about races with the cpu hotplug because cpu down
2346 * as well as workers from this path always operate on the local
2347 * per-cpu data. CPU up doesn't touch memcg_stock at all.
2348 */
2349 migrate_disable();
2350 curcpu = smp_processor_id();
2351 for_each_online_cpu(cpu) {
2352 struct memcg_stock_pcp *stock = &per_cpu(memcg_stock, cpu);
2353 struct mem_cgroup *memcg;
2354 bool flush = false;
2355
2356 rcu_read_lock();
2357 memcg = stock->cached;
2358 if (memcg && stock->nr_pages &&
2359 mem_cgroup_is_descendant(memcg, root_memcg))
2360 flush = true;
2361 else if (obj_stock_flush_required(stock, root_memcg))
2362 flush = true;
2363 rcu_read_unlock();
2364
2365 if (flush &&
2366 !test_and_set_bit(FLUSHING_CACHED_CHARGE, &stock->flags)) {
2367 if (cpu == curcpu)
2368 drain_local_stock(&stock->work);
> 2369 else if (!cpu_is_isolated(cpu))
2370 schedule_work_on(cpu, &stock->work);
2371 }
2372 }
2373 migrate_enable();
2374 mutex_unlock(&percpu_charge_mutex);
2375 }
2376
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
On Sat, 18 Mar 2023 06:22:40 +0800 kernel test robot <[email protected]> wrote:
> mm/memcontrol.c: In function 'drain_all_stock':
> >> mm/memcontrol.c:2369:35: error: implicit declaration of function 'cpu_is_isolated' [-Werror=implicit-function-declaration]
> 2369 | else if (!cpu_is_isolated(cpu))
> | ^~~~~~~~~~~~~~~
Thanks.
--- a/mm/memcontrol.c~memcg-do-not-drain-charge-pcp-caches-on-remote-isolated-cpus-fix
+++ a/mm/memcontrol.c
@@ -63,6 +63,7 @@
#include <linux/resume_user_mode.h>
#include <linux/psi.h>
#include <linux/seq_buf.h>
+#include <linux/sched/isolation.h>
#include "internal.h"
#include <net/sock.h>
#include <net/ip.h>
_
On Fri 17-03-23 16:32:29, Andrew Morton wrote:
> On Sat, 18 Mar 2023 06:22:40 +0800 kernel test robot <[email protected]> wrote:
>
> > mm/memcontrol.c: In function 'drain_all_stock':
> > >> mm/memcontrol.c:2369:35: error: implicit declaration of function 'cpu_is_isolated' [-Werror=implicit-function-declaration]
> > 2369 | else if (!cpu_is_isolated(cpu))
> > | ^~~~~~~~~~~~~~~
>
> Thanks.
>
> --- a/mm/memcontrol.c~memcg-do-not-drain-charge-pcp-caches-on-remote-isolated-cpus-fix
> +++ a/mm/memcontrol.c
> @@ -63,6 +63,7 @@
> #include <linux/resume_user_mode.h>
> #include <linux/psi.h>
> #include <linux/seq_buf.h>
> +#include <linux/sched/isolation.h>
> #include "internal.h"
> #include <net/sock.h>
> #include <net/ip.h>
Thanks a lot Andrew!
> _
--
Michal Hocko
SUSE Labs
On Fri 17-03-23 15:35:05, Marcelo Tosatti wrote:
> On Fri, Mar 17, 2023 at 03:33:13PM -0300, Marcelo Tosatti wrote:
> > On Fri, Mar 17, 2023 at 02:44:47PM +0100, Michal Hocko wrote:
> > > From: Frederic Weisbecker <[email protected]>
> > >
> > > Provide this new API to check if a CPU has been isolated either through
> > > isolcpus= or nohz_full= kernel parameter.
> > >
> > > It aims at avoiding kernel load deemed to be safely spared on CPUs
> > > running sensitive workload that can't bear any disturbance, such as
> > > pcp cache draining.
> >
> > Hi Michal,
> >
> > This makes no sense to me.
> >
> > HK_TYPE_DOMAIN is set when isolcpus=domain is configured.
> > HK_TYPE_TICK is set when nohz_full= is configured.
> >
> > The use-cases i am aware of use either:
> >
> > isolcpus=managed_irq,... nohz_full=
> > OR
> > isolcpus=domain,managed_irq,... nohz_full=
> >
> > So what is the point of this function again?
> >
> > Perhaps it made sense along with, but now does not make sense
> > anymore:
> >
> > Subject: [PATCH 1/2] sched/isolation: Merge individual nohz_full features into a common housekeeping flag
> >
> > The individual isolation features turned on by nohz_full were initially
> > split in order for each of them to be tunable through cpusets. However
> > plans have changed in favour of an interface (be it cpusets or sysctl)
> > grouping all these features to be turned on/off altogether. Then should
> > the need ever arise, the interface can still be expanded to handle the
> > individual isolation features.
> >
> > But Michal can just use housekeeping_test_cpu(cpu, HK_TYPE_TICK) and
> > the convertion of nohz_full features into a common housekeeping flag
> > can convert that to something else later?
>
> Actually introducing cpu_is_isolated() seems fine, but it can call
> housekeeping_test_cpu(cpu, HK_TYPE_TICK) AFAICS.
This is not really my area. Frederic, could you have a look please?
--
Michal Hocko
SUSE Labs
On Sat 18-03-23 11:23:50, Hillf Danton wrote:
> On 17 Mar 2023 14:44:48 +0100 Michal Hocko <[email protected]>
> > Leonardo Bras has noticed that pcp charge cache draining might be
> > disruptive on workloads relying on 'isolated cpus', a feature commonly
> > used on workloads that are sensitive to interruption and context
> > switching such as vRAN and Industrial Control Systems.
> >
> > There are essentially two ways how to approach the issue. We can either
> > allow the pcp cache to be drained on a different rather than a local cpu
> > or avoid remote flushing on isolated cpus.
> >
> > The current pcp charge cache is really optimized for high performance
> > and it always relies to stick with its cpu. That means it only requires
> > local_lock (preempt_disable on !RT) and draining is handed over to pcp
> > WQ to drain locally again.
> >
> > The former solution (remote draining) would require to add an additional
> > locking to prevent local charges from racing with the draining. This
> > adds an atomic operation to otherwise simple arithmetic fast path in the
> > try_charge path. Another concern is that the remote draining can cause a
> > lock contention for the isolated workloads and therefore interfere with
> > it indirectly via user space interfaces.
> >
> > Another option is to avoid draining scheduling on isolated cpus
> > altogether. That means that those remote cpus would keep their charges
> > even after drain_all_stock returns. This is certainly not optimal either
> > but it shouldn't really cause any major problems. In the worst case
> > (many isolated cpus with charges - each of them with MEMCG_CHARGE_BATCH
> > i.e 64 page) the memory consumption of a memcg would be artificially
> > higher than can be immediately used from other cpus.
> >
> > Theoretically a memcg OOM killer could be triggered pre-maturely.
> > Currently it is not really clear whether this is a practical problem
> > though. Tight memcg limit would be really counter productive to cpu
> > isolated workloads pretty much by definition because any memory
> > reclaimed induced by memcg limit could break user space timing
> > expectations as those usually expect execution in the userspace most of
> > the time.
> >
> > Also charges could be left behind on memcg removal. Any future charge on
> > those isolated cpus will drain that pcp cache so this won't be a
> > permanent leak.
> >
> > Considering cons and pros of both approaches this patch is implementing
> > the second option and simply do not schedule remote draining if the
> > target cpu is isolated. This solution is much more simpler. It doesn't
> > add any new locking and it is more more predictable from the user space
> > POV. Should the pre-mature memcg OOM become a real life problem, we can
> > revisit this decision.
>
> JFYI feel free to take a look at the non-housekeeping CPUs [1].
>
> [1] https://lore.kernel.org/lkml/[email protected]/
Such an approach would require remote draining and I hope I have
explained why that is not a preferred way in this case. Other than that
I do agree with Christoph that a generic approach would be really nice.
--
Michal Hocko
SUSE Labs
Le Sat, Mar 18, 2023 at 09:04:38AM +0100, Michal Hocko a ?crit :
> On Fri 17-03-23 15:35:05, Marcelo Tosatti wrote:
> > On Fri, Mar 17, 2023 at 03:33:13PM -0300, Marcelo Tosatti wrote:
> > > On Fri, Mar 17, 2023 at 02:44:47PM +0100, Michal Hocko wrote:
> > > > From: Frederic Weisbecker <[email protected]>
> > > >
> > > > Provide this new API to check if a CPU has been isolated either through
> > > > isolcpus= or nohz_full= kernel parameter.
> > > >
> > > > It aims at avoiding kernel load deemed to be safely spared on CPUs
> > > > running sensitive workload that can't bear any disturbance, such as
> > > > pcp cache draining.
> > >
> > > Hi Michal,
> > >
> > > This makes no sense to me.
> > >
> > > HK_TYPE_DOMAIN is set when isolcpus=domain is configured.
> > > HK_TYPE_TICK is set when nohz_full= is configured.
> > >
> > > The use-cases i am aware of use either:
> > >
> > > isolcpus=managed_irq,... nohz_full=
> > > OR
> > > isolcpus=domain,managed_irq,... nohz_full=
> > >
> > > So what is the point of this function again?
> > >
> > > Perhaps it made sense along with, but now does not make sense
> > > anymore:
> > >
> > > Subject: [PATCH 1/2] sched/isolation: Merge individual nohz_full features into a common housekeeping flag
> > >
> > > The individual isolation features turned on by nohz_full were initially
> > > split in order for each of them to be tunable through cpusets. However
> > > plans have changed in favour of an interface (be it cpusets or sysctl)
> > > grouping all these features to be turned on/off altogether. Then should
> > > the need ever arise, the interface can still be expanded to handle the
> > > individual isolation features.
> > >
> > > But Michal can just use housekeeping_test_cpu(cpu, HK_TYPE_TICK) and
> > > the convertion of nohz_full features into a common housekeeping flag
> > > can convert that to something else later?
> >
> > Actually introducing cpu_is_isolated() seems fine, but it can call
> > housekeeping_test_cpu(cpu, HK_TYPE_TICK) AFAICS.
>
> This is not really my area. Frederic, could you have a look please?
The point is to have a function that tells if either nohz_full= or
isolcpus=[domain] has been passed for the given CPU.
Because I assumed that both would be interested in avoiding that flush
noise, wouldn't it be the case?
On Fri, Mar 24, 2023 at 11:35:35PM +0100, Frederic Weisbecker wrote:
> Le Sat, Mar 18, 2023 at 09:04:38AM +0100, Michal Hocko a ?crit :
> > On Fri 17-03-23 15:35:05, Marcelo Tosatti wrote:
> > > On Fri, Mar 17, 2023 at 03:33:13PM -0300, Marcelo Tosatti wrote:
> > > > On Fri, Mar 17, 2023 at 02:44:47PM +0100, Michal Hocko wrote:
> > > > > From: Frederic Weisbecker <[email protected]>
> > > > >
> > > > > Provide this new API to check if a CPU has been isolated either through
> > > > > isolcpus= or nohz_full= kernel parameter.
> > > > >
> > > > > It aims at avoiding kernel load deemed to be safely spared on CPUs
> > > > > running sensitive workload that can't bear any disturbance, such as
> > > > > pcp cache draining.
> > > >
> > > > Hi Michal,
> > > >
> > > > This makes no sense to me.
> > > >
> > > > HK_TYPE_DOMAIN is set when isolcpus=domain is configured.
> > > > HK_TYPE_TICK is set when nohz_full= is configured.
> > > >
> > > > The use-cases i am aware of use either:
> > > >
> > > > isolcpus=managed_irq,... nohz_full=
> > > > OR
> > > > isolcpus=domain,managed_irq,... nohz_full=
> > > >
> > > > So what is the point of this function again?
> > > >
> > > > Perhaps it made sense along with, but now does not make sense
> > > > anymore:
> > > >
> > > > Subject: [PATCH 1/2] sched/isolation: Merge individual nohz_full features into a common housekeeping flag
> > > >
> > > > The individual isolation features turned on by nohz_full were initially
> > > > split in order for each of them to be tunable through cpusets. However
> > > > plans have changed in favour of an interface (be it cpusets or sysctl)
> > > > grouping all these features to be turned on/off altogether. Then should
> > > > the need ever arise, the interface can still be expanded to handle the
> > > > individual isolation features.
> > > >
> > > > But Michal can just use housekeeping_test_cpu(cpu, HK_TYPE_TICK) and
> > > > the convertion of nohz_full features into a common housekeeping flag
> > > > can convert that to something else later?
> > >
> > > Actually introducing cpu_is_isolated() seems fine, but it can call
> > > housekeeping_test_cpu(cpu, HK_TYPE_TICK) AFAICS.
> >
> > This is not really my area. Frederic, could you have a look please?
>
> The point is to have a function that tells if either nohz_full= or
> isolcpus=[domain] has been passed for the given CPU.
>
> Because I assumed that both would be interested in avoiding that flush
> noise, wouldn't it be the case?
Yes, that is the case. But as a note: for the two main types of
configuration performed (one uses isolcpus=[domain] and the other
cgroups, for isolating processes) nohz_full= is always set.
So just testing for nohz_full= would be sufficient (which perhaps would
make the code simpler).
On Mon, Mar 27, 2023 at 07:24:54AM -0300, Marcelo Tosatti wrote:
> On Fri, Mar 24, 2023 at 11:35:35PM +0100, Frederic Weisbecker wrote:
> > Le Sat, Mar 18, 2023 at 09:04:38AM +0100, Michal Hocko a ?crit :
> > > On Fri 17-03-23 15:35:05, Marcelo Tosatti wrote:
> > > > On Fri, Mar 17, 2023 at 03:33:13PM -0300, Marcelo Tosatti wrote:
> > > > > On Fri, Mar 17, 2023 at 02:44:47PM +0100, Michal Hocko wrote:
> > > > > > From: Frederic Weisbecker <[email protected]>
> > > > > >
> > > > > > Provide this new API to check if a CPU has been isolated either through
> > > > > > isolcpus= or nohz_full= kernel parameter.
> > > > > >
> > > > > > It aims at avoiding kernel load deemed to be safely spared on CPUs
> > > > > > running sensitive workload that can't bear any disturbance, such as
> > > > > > pcp cache draining.
> > > > >
> > > > > Hi Michal,
> > > > >
> > > > > This makes no sense to me.
> > > > >
> > > > > HK_TYPE_DOMAIN is set when isolcpus=domain is configured.
> > > > > HK_TYPE_TICK is set when nohz_full= is configured.
> > > > >
> > > > > The use-cases i am aware of use either:
> > > > >
> > > > > isolcpus=managed_irq,... nohz_full=
> > > > > OR
> > > > > isolcpus=domain,managed_irq,... nohz_full=
> > > > >
> > > > > So what is the point of this function again?
> > > > >
> > > > > Perhaps it made sense along with, but now does not make sense
> > > > > anymore:
> > > > >
> > > > > Subject: [PATCH 1/2] sched/isolation: Merge individual nohz_full features into a common housekeeping flag
> > > > >
> > > > > The individual isolation features turned on by nohz_full were initially
> > > > > split in order for each of them to be tunable through cpusets. However
> > > > > plans have changed in favour of an interface (be it cpusets or sysctl)
> > > > > grouping all these features to be turned on/off altogether. Then should
> > > > > the need ever arise, the interface can still be expanded to handle the
> > > > > individual isolation features.
> > > > >
> > > > > But Michal can just use housekeeping_test_cpu(cpu, HK_TYPE_TICK) and
> > > > > the convertion of nohz_full features into a common housekeeping flag
> > > > > can convert that to something else later?
> > > >
> > > > Actually introducing cpu_is_isolated() seems fine, but it can call
> > > > housekeeping_test_cpu(cpu, HK_TYPE_TICK) AFAICS.
> > >
> > > This is not really my area. Frederic, could you have a look please?
> >
> > The point is to have a function that tells if either nohz_full= or
> > isolcpus=[domain] has been passed for the given CPU.
> >
> > Because I assumed that both would be interested in avoiding that flush
> > noise, wouldn't it be the case?
>
> Yes, that is the case. But as a note: for the two main types of
> configuration performed (one uses isolcpus=[domain] and the other
> cgroups, for isolating processes) nohz_full= is always set.
>
> So just testing for nohz_full= would be sufficient (which perhaps would
> make the code simpler).
Ok then all is needed is to test tick_nohz_full_cpu(target), right?
On Mon 27-03-23 07:24:54, Marcelo Tosatti wrote:
> On Fri, Mar 24, 2023 at 11:35:35PM +0100, Frederic Weisbecker wrote:
> > Le Sat, Mar 18, 2023 at 09:04:38AM +0100, Michal Hocko a ?crit :
> > > On Fri 17-03-23 15:35:05, Marcelo Tosatti wrote:
[...]
> > > > Actually introducing cpu_is_isolated() seems fine, but it can call
> > > > housekeeping_test_cpu(cpu, HK_TYPE_TICK) AFAICS.
> > >
> > > This is not really my area. Frederic, could you have a look please?
> >
> > The point is to have a function that tells if either nohz_full= or
> > isolcpus=[domain] has been passed for the given CPU.
> >
> > Because I assumed that both would be interested in avoiding that flush
> > noise, wouldn't it be the case?
>
> Yes, that is the case. But as a note: for the two main types of
> configuration performed (one uses isolcpus=[domain] and the other
> cgroups, for isolating processes) nohz_full= is always set.
>
> So just testing for nohz_full= would be sufficient (which perhaps would
> make the code simpler).
I do not see any mention about that assumption under Documentation/. Is
this a best practice documented anywhere or it just happens to be the
case with workloads you deal with?
--
Michal Hocko
SUSE Labs
On Tue, Mar 28, 2023 at 01:48:02PM +0200, Michal Hocko wrote:
> On Mon 27-03-23 07:24:54, Marcelo Tosatti wrote:
> > On Fri, Mar 24, 2023 at 11:35:35PM +0100, Frederic Weisbecker wrote:
> > > Le Sat, Mar 18, 2023 at 09:04:38AM +0100, Michal Hocko a ?crit :
> > > > On Fri 17-03-23 15:35:05, Marcelo Tosatti wrote:
> [...]
> > > > > Actually introducing cpu_is_isolated() seems fine, but it can call
> > > > > housekeeping_test_cpu(cpu, HK_TYPE_TICK) AFAICS.
> > > >
> > > > This is not really my area. Frederic, could you have a look please?
> > >
> > > The point is to have a function that tells if either nohz_full= or
> > > isolcpus=[domain] has been passed for the given CPU.
> > >
> > > Because I assumed that both would be interested in avoiding that flush
> > > noise, wouldn't it be the case?
> >
> > Yes, that is the case. But as a note: for the two main types of
> > configuration performed (one uses isolcpus=[domain] and the other
> > cgroups, for isolating processes) nohz_full= is always set.
> >
> > So just testing for nohz_full= would be sufficient (which perhaps would
> > make the code simpler).
>
> I do not see any mention about that assumption under Documentation/.
Documentation/admin-guide/kernel-per-CPU-kthreads.rst
SCHED_SOFTIRQ
-------------
Do all of the following:
1. Avoid sending scheduler IPIs to the CPU to be de-jittered,
for example, ensure that at most one runnable kthread is present
on that CPU. If a thread that expects to run on the de-jittered
CPU awakens, the scheduler will send an IPI that can result in
a subsequent SCHED_SOFTIRQ.
2. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be de-jittered
is marked as an adaptive-ticks CPU using the "nohz_full="
boot parameter. This reduces the number of scheduler-clock
interrupts that the de-jittered CPU receives, minimizing its
chances of being selected to do the load balancing work that
runs in SCHED_SOFTIRQ context.
> Is this a best practice documented anywhere or it just happens to be
> the case with workloads you deal with?
Option 2. However Frederic seems interested in matching the exported
toggles with the known use-cases classes.
For example, for this guide:
http://www.comfilewiki.co.kr/en/doku.php?id=comfilepi:improving_real-time_performance:index
Using nohz_full= would be a benefit (and its not being currently set,
perhaps due to not knowing all the options?).
http://www.comfilewiki.co.kr/en/doku.php?id=comfilepi:improving_real-time_performance:index
AFAIU the workloads for which disabling nohz_full= is a benefit are those
where the switching between nohz full mode and sched tick enabled mode
and vice-versa (which involve programming the local timer) happens
often and is therefore avoidable? For example switching between 1
runnable task and more than 1 runnable task (and vice versa).
On Wed 29-03-23 11:20:21, Marcelo Tosatti wrote:
> On Tue, Mar 28, 2023 at 01:48:02PM +0200, Michal Hocko wrote:
> > On Mon 27-03-23 07:24:54, Marcelo Tosatti wrote:
> > > On Fri, Mar 24, 2023 at 11:35:35PM +0100, Frederic Weisbecker wrote:
> > > > Le Sat, Mar 18, 2023 at 09:04:38AM +0100, Michal Hocko a ?crit :
> > > > > On Fri 17-03-23 15:35:05, Marcelo Tosatti wrote:
> > [...]
> > > > > > Actually introducing cpu_is_isolated() seems fine, but it can call
> > > > > > housekeeping_test_cpu(cpu, HK_TYPE_TICK) AFAICS.
> > > > >
> > > > > This is not really my area. Frederic, could you have a look please?
> > > >
> > > > The point is to have a function that tells if either nohz_full= or
> > > > isolcpus=[domain] has been passed for the given CPU.
> > > >
> > > > Because I assumed that both would be interested in avoiding that flush
> > > > noise, wouldn't it be the case?
> > >
> > > Yes, that is the case. But as a note: for the two main types of
> > > configuration performed (one uses isolcpus=[domain] and the other
> > > cgroups, for isolating processes) nohz_full= is always set.
> > >
> > > So just testing for nohz_full= would be sufficient (which perhaps would
> > > make the code simpler).
> >
> > I do not see any mention about that assumption under Documentation/.
>
> Documentation/admin-guide/kernel-per-CPU-kthreads.rst
>
> SCHED_SOFTIRQ
> -------------
>
> Do all of the following:
>
> 1. Avoid sending scheduler IPIs to the CPU to be de-jittered,
> for example, ensure that at most one runnable kthread is present
> on that CPU. If a thread that expects to run on the de-jittered
> CPU awakens, the scheduler will send an IPI that can result in
> a subsequent SCHED_SOFTIRQ.
> 2. CONFIG_NO_HZ_FULL=y and ensure that the CPU to be de-jittered
> is marked as an adaptive-ticks CPU using the "nohz_full="
> boot parameter. This reduces the number of scheduler-clock
> interrupts that the de-jittered CPU receives, minimizing its
> chances of being selected to do the load balancing work that
> runs in SCHED_SOFTIRQ context.
Quite hidden and easy to miss if you are only aware of isolcpus.
> > Is this a best practice documented anywhere or it just happens to be
> > the case with workloads you deal with?
>
> Option 2. However Frederic seems interested in matching the exported
> toggles with the known use-cases classes.
>
> For example, for this guide:
> http://www.comfilewiki.co.kr/en/doku.php?id=comfilepi:improving_real-time_performance:index
>
> Using nohz_full= would be a benefit (and its not being currently set,
> perhaps due to not knowing all the options?).
>
> http://www.comfilewiki.co.kr/en/doku.php?id=comfilepi:improving_real-time_performance:index
>
>
> AFAIU the workloads for which disabling nohz_full= is a benefit are those
> where the switching between nohz full mode and sched tick enabled mode
> and vice-versa (which involve programming the local timer) happens
> often and is therefore avoidable? For example switching between 1
> runnable task and more than 1 runnable task (and vice versa).
The patch from Frederic is testing for both. You seem to be arguing to
reduce the test and I still do not understand why. Sure some workloads
(following the above) will likely use nohz_full= as well but does it
make sense to build that expectation into the higher level logic? What
is an actual benefit?
--
Michal Hocko
SUSE Labs
On Thu, Mar 30, 2023 at 03:28:52PM +0200, Michal Hocko wrote:
> > > Is this a best practice documented anywhere or it just happens to be
> > > the case with workloads you deal with?
> >
> > Option 2. However Frederic seems interested in matching the exported
> > toggles with the known use-cases classes.
> >
> > For example, for this guide:
> > http://www.comfilewiki.co.kr/en/doku.php?id=comfilepi:improving_real-time_performance:index
> >
> > Using nohz_full= would be a benefit (and its not being currently set,
> > perhaps due to not knowing all the options?).
> >
> > http://www.comfilewiki.co.kr/en/doku.php?id=comfilepi:improving_real-time_performance:index
> >
> >
> > AFAIU the workloads for which disabling nohz_full= is a benefit are those
> > where the switching between nohz full mode and sched tick enabled mode
> > and vice-versa (which involve programming the local timer) happens
> > often and is therefore avoidable? For example switching between 1
> > runnable task and more than 1 runnable task (and vice versa).
>
> The patch from Frederic is testing for both. You seem to be arguing to
> reduce the test and I still do not understand why. Sure some workloads
> (following the above) will likely use nohz_full= as well but does it
> make sense to build that expectation into the higher level logic? What
> is an actual benefit?
Just thinking of simpler code. Feel free to maintain the patch as-is if
you see fit.