LinuxLists.cc - [PATCH] nohz: don't kick non-idle CPUs in tick_nohz_full_kick

2018-07-12 18:20:48

Subject: [PATCH] nohz: don't kick non-idle CPUs in tick_nohz_full_kick_cpu()

IIUC, tick_nohz_full_kick_cpu() is intended to wakeup idle CPUs
that will not be poked by scheduler because they are actually
nohz_full.

But in fact this function kicks all CPUs listed in tick_nohz_full_mask,
namely:
- idle CPUs;
- CPUs runnung normal tasks;
- CPUs running isolated tasks [1];

For normal tasks it introduces unneeded latency, and for isolated tasks
it's fatal because isolation gets broken and task receives SIGKILL.

The patch below makes tick_nohz_full_kick_cpu() kicking only idle CPUs.
Non-idle nohz_full CPUs will observe changed system settings just like
non-idle normal (i.e. not nohz_full) CPUs, at next reschedule.

[1] https://lkml.org/lkml/2017/11/3/589

Signed-off-by: Yury Norov <[email protected]>
---
kernel/time/tick-sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c026145eba2f..1c24c700e75a 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -247,7 +247,7 @@ static void tick_nohz_full_kick(void)
*/
void tick_nohz_full_kick_cpu(int cpu)
{
- if (!tick_nohz_full_cpu(cpu))
+ if (!(tick_nohz_full_cpu(cpu) && idle_cpu(cpu)))
return;

irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
--
2.17.1

2018-07-16 15:34:09

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH] nohz: don't kick non-idle CPUs in tick_nohz_full_kick_cpu()

On Thu, Jul 12, 2018 at 09:19:22PM +0300, Yury Norov wrote:
> IIUC, tick_nohz_full_kick_cpu() is intended to wakeup idle CPUs
> that will not be poked by scheduler because they are actually
> nohz_full.

Not exactly. It is intended to trigger an interrupt on a nohz_full
CPU that may be running in userspace without any tick. The irq_exit()
code let us reprogramm the tick with the latest dependency updates.

>
> But in fact this function kicks all CPUs listed in tick_nohz_full_mask,
> namely:
> - idle CPUs;
> - CPUs runnung normal tasks;
> - CPUs running isolated tasks [1];
>
> For normal tasks it introduces unneeded latency, and for isolated tasks
> it's fatal because isolation gets broken and task receives SIGKILL.

So this patch applies on Chris series right? For now there is no such
distinction between normal and isolated tasks. Any task running in a
nohz_full CPU is considered to be isolated.

> The patch below makes tick_nohz_full_kick_cpu() kicking only idle CPUs.
> Non-idle nohz_full CPUs will observe changed system settings just like
> non-idle normal (i.e. not nohz_full) CPUs, at next reschedule.

That's not exactly what we want. In fact when a task runs in a nohz_full CPU,
it may not meet any reschedule interrupt for a long while. This is why we have
tick_nohz_full_kick_cpu() in order to force a nohz_full CPU to see the latest
changes.

Thanks.

2018-07-19 20:24:14

by Yury Norov

[permalink] [raw]

Subject: Re: [PATCH] nohz: don't kick non-idle CPUs in tick_nohz_full_kick_cpu()

On Mon, Jul 16, 2018 at 05:31:10PM +0200, Frederic Weisbecker wrote:
> External Email
>
> On Thu, Jul 12, 2018 at 09:19:22PM +0300, Yury Norov wrote:
> > IIUC, tick_nohz_full_kick_cpu() is intended to wakeup idle CPUs
> > that will not be poked by scheduler because they are actually
> > nohz_full.
>
> Not exactly. It is intended to trigger an interrupt on a nohz_full
> CPU that may be running in userspace without any tick. The irq_exit()
> code let us reprogramm the tick with the latest dependency updates.
>
> >
> > But in fact this function kicks all CPUs listed in tick_nohz_full_mask,
> > namely:
> > - idle CPUs;
> > - CPUs runnung normal tasks;
> > - CPUs running isolated tasks [1];
> >
> > For normal tasks it introduces unneeded latency, and for isolated tasks
> > it's fatal because isolation gets broken and task receives SIGKILL.
>
> So this patch applies on Chris series right?

This patch may be applied on master. That's why I sent it to you.

> For now there is no such
> distinction between normal and isolated tasks. Any task running in a
> nohz_full CPU is considered to be isolated.
>
> > The patch below makes tick_nohz_full_kick_cpu() kicking only idle CPUs.
> > Non-idle nohz_full CPUs will observe changed system settings just like
> > non-idle normal (i.e. not nohz_full) CPUs, at next reschedule.
>
> That's not exactly what we want. In fact when a task runs in a nohz_full CPU,
> it may not meet any reschedule interrupt for a long while. This is why we have
> tick_nohz_full_kick_cpu() in order to force a nohz_full CPU to see the latest
> changes.

OK, got it.

So if my understanding correct, there is 'soft isolation' which is
nohz_full, and 'hard isolation' which is Chris' task_isonation feature. For
soft isolation, the desirable behavior is to receive interrupts generated
by tick_nohz_full_kick_cpu(), and for hard isolation it's obviously not
desirable because it kills application.

The patch below adds check against task isolation in tick_nohz_full_kick_cpu().
It is on top of Chris' series. Is it OK from nohz point of view?

---

While here. I just wonder, on my system IRQs are sent to nohz_full CPUs
at every incoming ssh connection. The trace is like this:
[ 206.835533] Call trace:
[ 206.848411] [<ffff00000889f984>] dump_stack+0x84/0xa8
[ 206.853455] [<ffff0000081ea308>] _task_isolation_remote+0x130/0x140
[ 206.859714] [<ffff0000081bf5ec>] irq_work_queue_on+0xcc/0xfc
[ 206.865365] [<ffff0000081478ac>] tick_nohz_full_kick_cpu+0x88/0x94
[ 206.871536] [<ffff000008147930>] tick_nohz_dep_set_all+0x78/0xa8
[ 206.877533] [<ffff000008147b58>] tick_nohz_dep_set_signal+0x28/0x34
[ 206.883792] [<ffff0000081421fc>] set_process_cpu_timer+0xd0/0x128
[ 206.889876] [<ffff0000081422ac>] update_rlimit_cpu+0x58/0x7c
[ 206.895528] [<ffff0000083aa3d0>] selinux_bprm_committing_creds+0x180/0x1fc
[ 206.902394] [<ffff00000839e394>] security_bprm_committing_creds+0x40/0x5c
[ 206.909173] [<ffff00000828c4a0>] install_exec_creds+0x20/0x6c
[ 206.914911] [<ffff0000082e15b0>] load_elf_binary+0x368/0xbb8
[ 206.920561] [<ffff00000828d09c>] search_binary_handler+0xb8/0x224
[ 206.926645] [<ffff00000828d99c>] do_execveat_common+0x44c/0x5f0
[ 206.932555] [<ffff00000828db78>] do_execve+0x38/0x44
[ 206.937510] [<ffff00000828dd74>] SyS_execve+0x34/0x44

I suspect that scp, ssh tunneling and similar network activities will source
ticks on nohz_full CPUs as well. On high-loaded server it may generate
significant interrupt traffic on nohz_full CPUs. Is it desirable behavior?

---
Yury

From 9be3c9996c06319a8070ac182291d08acfdc588d Mon Sep 17 00:00:00 2001
From: Yury Norov <[email protected]>
Date: Tue, 17 Jul 2018 12:40:49 +0300
Subject: [PATCH] task_isolation: don't kick isolated CPUs with
tick_nohz_full_kick_cpu()
To: Chris Metcalf <[email protected]>,
Frederic Weisbecker <[email protected]>
Cc: Ingo Molnar <[email protected]>,
Thomas Gleixner <[email protected]>,
"Goutham, Sunil" <[email protected]>,
[email protected]

On top of Chris Metcalf series:
https://lkml.org/lkml/2017/11/3/589

tick_nohz_full_kick_cpu() currently interrupts CPUs that may run isolated
task. It's not desirable because that kick will kill isolated application.

The patch below adds check against task isolation in
tick_nohz_full_kick_cpu() to prevent breaking the isolation.

Signed-off-by: Yury Norov <[email protected]>
---
include/linux/isolation.h | 7 +++++++
kernel/isolation.c | 6 ------
kernel/time/tick-sched.c | 5 +++--
3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/isolation.h b/include/linux/isolation.h
index b7f0a9085b13..fad606cdcd5e 100644
--- a/include/linux/isolation.h
+++ b/include/linux/isolation.h
@@ -158,6 +158,12 @@ static inline void task_isolation_user_exit(void)
#endif
}

+static inline bool is_isolation_cpu(int cpu)
+{
+ return task_isolation_map != NULL &&
+ cpumask_test_cpu(cpu, task_isolation_map);
+}
+
#else /* !CONFIG_TASK_ISOLATION */
static inline int task_isolation_request(unsigned int flags) { return -EINVAL; }
static inline void task_isolation_start(void) { }
@@ -172,6 +178,7 @@ static inline void task_isolation_remote_cpumask_interrupt(
const struct cpumask *mask, const char *fmt, ...) { }
static inline void task_isolation_signal(struct task_struct *task) { }
static inline void task_isolation_user_exit(void) { }
+static inline bool is_isolation_cpu(int cpu) { return 0; }
#endif

#endif /* _LINUX_ISOLATION_H */
diff --git a/kernel/isolation.c b/kernel/isolation.c
index 1e39a1493e76..05db247924ef 100644
--- a/kernel/isolation.c
+++ b/kernel/isolation.c
@@ -41,12 +41,6 @@ static int __init task_isolation_init(void)
}
core_initcall(task_isolation_init)

-static inline bool is_isolation_cpu(int cpu)
-{
- return task_isolation_map != NULL &&
- cpumask_test_cpu(cpu, task_isolation_map);
-}
-
/* Enable stack backtraces of any interrupts of task_isolation cores. */
static bool task_isolation_debug;
static int __init task_isolation_debug_func(char *str)
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c026145eba2f..91928a6afd81 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -23,6 +23,7 @@
#include <linux/sched/clock.h>
#include <linux/sched/stat.h>
#include <linux/sched/nohz.h>
+#include <linux/isolation.h>
#include <linux/module.h>
#include <linux/irq_work.h>
#include <linux/posix-timers.h>
@@ -242,12 +243,12 @@ static void tick_nohz_full_kick(void)
}

/*
- * Kick the CPU if it's full dynticks in order to force it to
+ * Kick the CPU if it's full dynticks and not isolated in order to force it to
* re-evaluate its dependency on the tick and restart it if necessary.
*/
void tick_nohz_full_kick_cpu(int cpu)
{
- if (!tick_nohz_full_cpu(cpu))
+ if (!tick_nohz_full_cpu(cpu) || is_isolation_cpu(cpu))
return;

irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
--
2.17.1

2018-07-20 17:25:02

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH] nohz: don't kick non-idle CPUs in tick_nohz_full_kick_cpu()

On Thu, 19 Jul 2018, Yury Norov wrote:
> While here. I just wonder, on my system IRQs are sent to nohz_full CPUs
> at every incoming ssh connection. The trace is like this:
> [ 206.835533] Call trace:
> [ 206.848411] [<ffff00000889f984>] dump_stack+0x84/0xa8
> [ 206.853455] [<ffff0000081ea308>] _task_isolation_remote+0x130/0x140
> [ 206.859714] [<ffff0000081bf5ec>] irq_work_queue_on+0xcc/0xfc
> [ 206.865365] [<ffff0000081478ac>] tick_nohz_full_kick_cpu+0x88/0x94
> [ 206.871536] [<ffff000008147930>] tick_nohz_dep_set_all+0x78/0xa8
> [ 206.877533] [<ffff000008147b58>] tick_nohz_dep_set_signal+0x28/0x34
> [ 206.883792] [<ffff0000081421fc>] set_process_cpu_timer+0xd0/0x128
> [ 206.889876] [<ffff0000081422ac>] update_rlimit_cpu+0x58/0x7c
> [ 206.895528] [<ffff0000083aa3d0>] selinux_bprm_committing_creds+0x180/0x1fc
> [ 206.902394] [<ffff00000839e394>] security_bprm_committing_creds+0x40/0x5c
> [ 206.909173] [<ffff00000828c4a0>] install_exec_creds+0x20/0x6c
> [ 206.914911] [<ffff0000082e15b0>] load_elf_binary+0x368/0xbb8
> [ 206.920561] [<ffff00000828d09c>] search_binary_handler+0xb8/0x224
> [ 206.926645] [<ffff00000828d99c>] do_execveat_common+0x44c/0x5f0
> [ 206.932555] [<ffff00000828db78>] do_execve+0x38/0x44
> [ 206.937510] [<ffff00000828dd74>] SyS_execve+0x34/0x44
>
> I suspect that scp, ssh tunneling and similar network activities will source
> ticks on nohz_full CPUs as well. On high-loaded server it may generate
> significant interrupt traffic on nohz_full CPUs. Is it desirable behavior?

Supsicions and desirable are not really technical interesting aspects.

Just from looking at the stack trace it's obvious that there is a CPU TIME
rlimit on that newly spawned sshd. That's nothing what the kernel
imposes. That's what user space sets.

Now the actual mechanism which does that, i.e. set_process_cpu_timer() ends
up IPI'ing _ALL_ nohz full CPUs for no real good reason. In the exec path
this is really pointless because the new process is not running yet and it
is single threaded. So forcing a IPI to all cpus is pretty pointless.

In fact the state of the task/process for which update_rlimit_cpu(() is
called is known, so the IPI can really be either avoided completely or
restricted to the CPUs on which this process can run or actually runs.

Fredric?

Thanks,

tglx

2018-07-23 13:17:44

by Frederic Weisbecker

[permalink] [raw]

Subject: Re: [PATCH] nohz: don't kick non-idle CPUs in tick_nohz_full_kick_cpu()

On Fri, Jul 20, 2018 at 07:24:00PM +0200, Thomas Gleixner wrote:
> On Thu, 19 Jul 2018, Yury Norov wrote:
> > While here. I just wonder, on my system IRQs are sent to nohz_full CPUs
> > at every incoming ssh connection. The trace is like this:
> > [ 206.835533] Call trace:
> > [ 206.848411] [<ffff00000889f984>] dump_stack+0x84/0xa8
> > [ 206.853455] [<ffff0000081ea308>] _task_isolation_remote+0x130/0x140
> > [ 206.859714] [<ffff0000081bf5ec>] irq_work_queue_on+0xcc/0xfc
> > [ 206.865365] [<ffff0000081478ac>] tick_nohz_full_kick_cpu+0x88/0x94
> > [ 206.871536] [<ffff000008147930>] tick_nohz_dep_set_all+0x78/0xa8
> > [ 206.877533] [<ffff000008147b58>] tick_nohz_dep_set_signal+0x28/0x34
> > [ 206.883792] [<ffff0000081421fc>] set_process_cpu_timer+0xd0/0x128
> > [ 206.889876] [<ffff0000081422ac>] update_rlimit_cpu+0x58/0x7c
> > [ 206.895528] [<ffff0000083aa3d0>] selinux_bprm_committing_creds+0x180/0x1fc
> > [ 206.902394] [<ffff00000839e394>] security_bprm_committing_creds+0x40/0x5c
> > [ 206.909173] [<ffff00000828c4a0>] install_exec_creds+0x20/0x6c
> > [ 206.914911] [<ffff0000082e15b0>] load_elf_binary+0x368/0xbb8
> > [ 206.920561] [<ffff00000828d09c>] search_binary_handler+0xb8/0x224
> > [ 206.926645] [<ffff00000828d99c>] do_execveat_common+0x44c/0x5f0
> > [ 206.932555] [<ffff00000828db78>] do_execve+0x38/0x44
> > [ 206.937510] [<ffff00000828dd74>] SyS_execve+0x34/0x44
> >
> > I suspect that scp, ssh tunneling and similar network activities will source
> > ticks on nohz_full CPUs as well. On high-loaded server it may generate
> > significant interrupt traffic on nohz_full CPUs. Is it desirable behavior?
>
> Supsicions and desirable are not really technical interesting aspects.
>
> Just from looking at the stack trace it's obvious that there is a CPU TIME
> rlimit on that newly spawned sshd. That's nothing what the kernel
> imposes. That's what user space sets.
>
> Now the actual mechanism which does that, i.e. set_process_cpu_timer() ends
> up IPI'ing _ALL_ nohz full CPUs for no real good reason. In the exec path
> this is really pointless because the new process is not running yet and it
> is single threaded. So forcing a IPI to all cpus is pretty pointless.
>
> In fact the state of the task/process for which update_rlimit_cpu(() is
> called is known, so the IPI can really be either avoided completely or
> restricted to the CPUs on which this process can run or actually runs.
>
> Fredric?

Indeed, so far the tick dependency code is lazy and IPIs everywhere when we
add either a thread or a process timer.

We want to make sure that any thread target, running somewhere without a tick,
sees the new tick dependency.

So in the case a single thread, I can easily fix that and IPI the CPU it's running
in if any. In the case of a thread group, I'm concerned about the performance
penalty to walk through each of them and IPI only those running. But probably we'll
have to come into that in the end.

Thanks.