I am not sure how to call this kernel option but we need something like
that. I see drivers and the kernel spawning processes on the nohz cores.
The name kthread is not really catching the purpose.
os_cpus=? highlatency_cpus=?
Subject: Restrict kernel spawning of threads to a specified set of cpus.
Currently the kernel by default allows kernel threads to be spawned on
any cpu. This is a problem for low latency applications that want to
avoid Os actions on specific processors.
Add a kernel option that restrict kthread and usermode spawning
to a specific set of processors. Also sets the affinities of
init by default to the restricted set since we certainly do not
want userspace daemons etc to be started there either.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux/include/linux/cpumask.h
===================================================================
--- linux.orig/include/linux/cpumask.h 2013-09-05 14:55:32.033229179 -0500
+++ linux/include/linux/cpumask.h 2013-09-05 14:55:32.021229296 -0500
@@ -44,6 +44,7 @@ extern int nr_cpu_ids;
* cpu_present_mask - has bit 'cpu' set iff cpu is populated
* cpu_online_mask - has bit 'cpu' set iff cpu available to scheduler
* cpu_active_mask - has bit 'cpu' set iff cpu available to migration
+ * cpu_kthread_mask - has bit 'cpu' set iff general kernel threads allowed
*
* If !CONFIG_HOTPLUG_CPU, present == possible, and active == online.
*
@@ -80,6 +81,7 @@ extern const struct cpumask *const cpu_p
extern const struct cpumask *const cpu_online_mask;
extern const struct cpumask *const cpu_present_mask;
extern const struct cpumask *const cpu_active_mask;
+extern const struct cpumask *const cpu_kthread_mask;
#if NR_CPUS > 1
#define num_online_cpus() cpumask_weight(cpu_online_mask)
Index: linux/init/main.c
===================================================================
--- linux.orig/init/main.c 2013-09-05 14:55:32.033229179 -0500
+++ linux/init/main.c 2013-09-05 14:55:32.025229258 -0500
@@ -882,6 +882,7 @@ static noinline void __init kernel_init_
do_basic_setup();
+ set_cpus_allowed_ptr(current, cpu_kthread_mask);
/* Open the /dev/console on the rootfs, this should never fail */
if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
pr_err("Warning: unable to open an initial console.\n");
Index: linux/kernel/cpu.c
===================================================================
--- linux.orig/kernel/cpu.c 2013-09-05 14:55:32.033229179 -0500
+++ linux/kernel/cpu.c 2013-09-05 14:55:32.025229258 -0500
@@ -677,6 +677,19 @@ static DECLARE_BITMAP(cpu_active_bits, C
const struct cpumask *const cpu_active_mask = to_cpumask(cpu_active_bits);
EXPORT_SYMBOL(cpu_active_mask);
+static DECLARE_BITMAP(cpu_kthread_bits, CONFIG_NR_CPUS) __read_mostly
+ = CPU_BITS_ALL;
+const struct cpumask *const cpu_kthread_mask = to_cpumask(cpu_kthread_bits);
+EXPORT_SYMBOL(cpu_kthread_mask);
+
+static int __init kthread_setup(char *str)
+{
+ cpulist_parse(str, (struct cpumask *)&cpu_kthread_bits);
+ return 1;
+}
+__setup("kthread=", kthread_setup);
+
+
void set_cpu_possible(unsigned int cpu, bool possible)
{
if (possible)
Index: linux/kernel/kthread.c
===================================================================
--- linux.orig/kernel/kthread.c 2013-09-05 14:55:32.033229179 -0500
+++ linux/kernel/kthread.c 2013-09-05 14:55:32.025229258 -0500
@@ -282,7 +282,7 @@ struct task_struct *kthread_create_on_no
* The kernel thread should not inherit these properties.
*/
sched_setscheduler_nocheck(create.result, SCHED_NORMAL, ¶m);
- set_cpus_allowed_ptr(create.result, cpu_all_mask);
+ set_cpus_allowed_ptr(create.result, cpu_kthread_mask);
}
return create.result;
}
@@ -450,7 +450,7 @@ int kthreadd(void *unused)
/* Setup a clean context for our children to inherit. */
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
- set_cpus_allowed_ptr(tsk, cpu_all_mask);
+ set_cpus_allowed_ptr(tsk, cpu_kthread_mask);
set_mems_allowed(node_states[N_MEMORY]);
current->flags |= PF_NOFREEZE;
Index: linux/Documentation/kernel-parameters.txt
===================================================================
--- linux.orig/Documentation/kernel-parameters.txt 2013-09-05 14:55:32.033229179 -0500
+++ linux/Documentation/kernel-parameters.txt 2013-09-05 14:58:38.839366991 -0500
@@ -1400,6 +1400,16 @@ bytes respectively. Such letter suffixes
kstack=N [X86] Print N words from the kernel stack
in oops dumps.
+ kthread= [KNL, SMP] Only run kernel threads on the specified
+ list of processors. The kernel will start threads
+ on the indicated processors only (unless there
+ are specific reasons to run a thread with
+ different affinities). This can be used to make
+ init start on certain processors and also to
+ control where kmod and other user space threads
+ are being spawned. Allows to keep kernel threads
+ away from certain cores unless absoluteluy necessary.
+
kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs.
Default is 0 (don't ignore, but inject #GP)
Index: linux/kernel/kmod.c
===================================================================
--- linux.orig/kernel/kmod.c 2013-09-05 14:55:24.000000000 -0500
+++ linux/kernel/kmod.c 2013-09-05 14:56:29.412657249 -0500
@@ -209,8 +209,8 @@ static int ____call_usermodehelper(void
flush_signal_handlers(current, 1);
spin_unlock_irq(¤t->sighand->siglock);
- /* We can run anywhere, unlike our parent keventd(). */
- set_cpus_allowed_ptr(current, cpu_all_mask);
+ /* We can run only where init is allowed to run. */
+ set_cpus_allowed_ptr(current, cpu_kthread_mask);
/*
* Our parent is keventd, which runs with elevated scheduling priority.
Hi,
On Thu, Sep 5, 2013 at 11:07 PM, Christoph Lameter <[email protected]> wrote:
> I am not sure how to call this kernel option but we need something like
> that. I see drivers and the kernel spawning processes on the nohz cores.
> The name kthread is not really catching the purpose.
>
> os_cpus=? highlatency_cpus=?
>
First off, thank you for doing this. It is very useful :-)
Currently if one wishes to run a single task on an isolated CPU with
as little interference as possible, one needs to pass
rcu_nocbs, isolcpus, nohz_full parameters and now kthread parameter,
all pretty much with the same values
I know some people won't like this, but can we perhaps fold all these
into a single parameter, perhaps even the existing isolcpus?
Thanks,
Gilad
--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com
"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
-- Jean-Baptiste Queru
On Tue, 2013-09-10 at 09:05 +0300, Gilad Ben-Yossef wrote:
> Hi,
>
> On Thu, Sep 5, 2013 at 11:07 PM, Christoph Lameter <[email protected]> wrote:
> > I am not sure how to call this kernel option but we need something like
> > that. I see drivers and the kernel spawning processes on the nohz cores.
> > The name kthread is not really catching the purpose.
> >
> > os_cpus=? highlatency_cpus=?
> >
>
> First off, thank you for doing this. It is very useful :-)
>
> Currently if one wishes to run a single task on an isolated CPU with
> as little interference as possible, one needs to pass
> rcu_nocbs, isolcpus, nohz_full parameters and now kthread parameter,
> all pretty much with the same values
>
> I know some people won't like this, but can we perhaps fold all these
> into a single parameter, perhaps even the existing isolcpus?
isolcpus is supposed to go away, as cpusets can isolate CPUs, and can
turn off load balancing.
-Mike
Hi,
On Tue, Sep 10, 2013 at 9:47 AM, Mike Galbraith <[email protected]> wrote:
>
> On Tue, 2013-09-10 at 09:05 +0300, Gilad Ben-Yossef wrote:
> > Hi,
> >
> > On Thu, Sep 5, 2013 at 11:07 PM, Christoph Lameter <[email protected]> wrote:
> > > I am not sure how to call this kernel option but we need something like
> > > that. I see drivers and the kernel spawning processes on the nohz cores.
> > > The name kthread is not really catching the purpose.
> > >
> > > os_cpus=? highlatency_cpus=?
> > >
> >
> > First off, thank you for doing this. It is very useful :-)
> >
> > Currently if one wishes to run a single task on an isolated CPU with
> > as little interference as possible, one needs to pass
> > rcu_nocbs, isolcpus, nohz_full parameters and now kthread parameter,
> > all pretty much with the same values
> >
> > I know some people won't like this, but can we perhaps fold all these
> > into a single parameter, perhaps even the existing isolcpus?
>
> isolcpus is supposed to go away, as cpusets can isolate CPUs, and can
> turn off load balancing.
>
And I'm all for that. I think cpusets is a much more elegant solution.
But... AFAIK currently cpusets cannot migrate timers that were registered on
a cpu prior to it being isolated via cpuset, designate RCU off loaded CPUs or
sets cpus as full nohz capable, or - it seems from this patch, keep off certain
kernel thread off a cpu.
This is no fault of cpusets, but it still means there are work loads
that it can't
support at this time.
So long as we must have a kernel boot option, I prefer to have one
and not four of
them. Think of it this way - when we put all these capabilities into
cpusets, we'll have
just one kernel option to kill and not four.
Does that makes sense?
Gilad
>
> -Mike
>
--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com
"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
-- Jean-Baptiste Queru
On Tue, 2013-09-10 at 09:59 +0300, Gilad Ben-Yossef wrote:
> Hi,
>
>
> On Tue, Sep 10, 2013 at 9:47 AM, Mike Galbraith <[email protected]> wrote:
> >
> > On Tue, 2013-09-10 at 09:05 +0300, Gilad Ben-Yossef wrote:
> > > Hi,
> > >
> > > On Thu, Sep 5, 2013 at 11:07 PM, Christoph Lameter <[email protected]> wrote:
> > > > I am not sure how to call this kernel option but we need something like
> > > > that. I see drivers and the kernel spawning processes on the nohz cores.
> > > > The name kthread is not really catching the purpose.
> > > >
> > > > os_cpus=? highlatency_cpus=?
> > > >
> > >
> > > First off, thank you for doing this. It is very useful :-)
> > >
> > > Currently if one wishes to run a single task on an isolated CPU with
> > > as little interference as possible, one needs to pass
> > > rcu_nocbs, isolcpus, nohz_full parameters and now kthread parameter,
> > > all pretty much with the same values
> > >
> > > I know some people won't like this, but can we perhaps fold all these
> > > into a single parameter, perhaps even the existing isolcpus?
> >
> > isolcpus is supposed to go away, as cpusets can isolate CPUs, and can
> > turn off load balancing.
> >
>
> And I'm all for that. I think cpusets is a much more elegant solution.
>
> But... AFAIK currently cpusets cannot migrate timers that were registered on
> a cpu prior to it being isolated via cpuset, designate RCU off loaded CPUs or
> sets cpus as full nohz capable, or - it seems from this patch, keep off certain
> kernel thread off a cpu.
>
> This is no fault of cpusets, but it still means there are work loads
> that it can't
> support at this time.
>
> So long as we must have a kernel boot option, I prefer to have one
> and not four of
> them. Think of it this way - when we put all these capabilities into
> cpusets, we'll have
> just one kernel option to kill and not four.
>
> Does that makes sense?
Hammering on the wrong spot makes removing isolcpus take longer, and
adds up to more hammering in the long run, no? Hearing you mention
isolcpus, I just thought I should mention that it wants to go away, so
might not be the optimal spot for isolation related tinkering.
-Mike
On Tue, Sep 10, 2013 at 10:26 AM, Mike Galbraith <[email protected]> wrote:
>
> Hammering on the wrong spot makes removing isolcpus take longer, and
> adds up to more hammering in the long run, no? Hearing you mention
> isolcpus, I just thought I should mention that it wants to go away, so
> might not be the optimal spot for isolation related tinkering.
OK, so I'll bite - isolcpu currently has special magic to do its thing but AFAIK
part of the reason isolcpu works "better" (for some definition of
better, for some
work loads) is simply because it blocks migration earlier than you get with
cpusets.
What if we re-did the implementation of isolcpu as creating an
cpuset with migration off as early as possible in the boot process, prior to
spawning init?
So basically, isolcpus becomes just a way to configure a cpuset early?
Gilad
--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com
"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
-- Jean-Baptiste Queru
On Tue, 2013-09-10 at 10:56 +0300, Gilad Ben-Yossef wrote:
> What if we re-did the implementation of isolcpu as creating an
> cpuset with migration off as early as possible in the boot process, prior to
> spawning init?
>
> So basically, isolcpus becomes just a way to configure a cpuset early?
Makes perfect sense to me.
-Mike
On 09/05/2013 03:07:37 PM, Christoph Lameter wrote:
> I am not sure how to call this kernel option but we need something
> like
> that. I see drivers and the kernel spawning processes on the nohz
> cores.
> The name kthread is not really catching the purpose.
Can't you just use the CPU affinity of PID 1 for this? Since it's a
process that's always there and already has a mask and all. No need for
a new interface...
Rob
On Tue, 10 Sep 2013, Gilad Ben-Yossef wrote:
> as little interference as possible, one needs to pass
> rcu_nocbs, isolcpus, nohz_full parameters and now kthread parameter,
> all pretty much with the same values
>
> I know some people won't like this, but can we perhaps fold all these
> into a single parameter, perhaps even the existing isolcpus?
I have made similar suggestions before. Maybe even autoconfigure the whole
thing? Dedicate the first processor on each numa node to high latency OS
tasksk and keep the rest as noisefree as possible?
On Tue, 10 Sep 2013, Gilad Ben-Yossef wrote:
> On Tue, Sep 10, 2013 at 10:26 AM, Mike Galbraith <[email protected]> wrote:
>
> >
> > Hammering on the wrong spot makes removing isolcpus take longer, and
> > adds up to more hammering in the long run, no? Hearing you mention
> > isolcpus, I just thought I should mention that it wants to go away, so
> > might not be the optimal spot for isolation related tinkering.
>
>
> OK, so I'll bite - isolcpu currently has special magic to do its thing but AFAIK
> part of the reason isolcpu works "better" (for some definition of
> better, for some
> work loads) is simply because it blocks migration earlier than you get with
> cpusets.
>
> What if we re-did the implementation of isolcpu as creating an
> cpuset with migration off as early as possible in the boot process, prior to
> spawning init?
>
> So basically, isolcpus becomes just a way to configure a cpuset early?
I surely wish we had the ability to use tickless without the need for
things like cpusets etc.
isolcpus is broken as far as I can tell. Lets lay it to rest and come up
with a sane way to configure these things. Autoconfig if possible.
On Tue, 10 Sep 2013, Rob Landley wrote:
> On 09/05/2013 03:07:37 PM, Christoph Lameter wrote:
> > I am not sure how to call this kernel option but we need something like
> > that. I see drivers and the kernel spawning processes on the nohz cores.
> > The name kthread is not really catching the purpose.
>
> Can't you just use the CPU affinity of PID 1 for this? Since it's a process
> that's always there and already has a mask and all. No need for a new
> interface...
How would you set the affinity of pid 1 before init starts spawning
threads?
On Tue, 2013-09-10 at 21:10 +0000, Christoph Lameter wrote:
> On Tue, 10 Sep 2013, Gilad Ben-Yossef wrote:
>
> > On Tue, Sep 10, 2013 at 10:26 AM, Mike Galbraith <[email protected]> wrote:
> >
> > >
> > > Hammering on the wrong spot makes removing isolcpus take longer, and
> > > adds up to more hammering in the long run, no? Hearing you mention
> > > isolcpus, I just thought I should mention that it wants to go away, so
> > > might not be the optimal spot for isolation related tinkering.
> >
> >
> > OK, so I'll bite - isolcpu currently has special magic to do its thing but AFAIK
> > part of the reason isolcpu works "better" (for some definition of
> > better, for some
> > work loads) is simply because it blocks migration earlier than you get with
> > cpusets.
> >
> > What if we re-did the implementation of isolcpu as creating an
> > cpuset with migration off as early as possible in the boot process, prior to
> > spawning init?
> >
> > So basically, isolcpus becomes just a way to configure a cpuset early?
>
> I surely wish we had the ability to use tickless without the need for
> things like cpusets etc.
Mind saying why? To me, creating properties of exclusive sets of CPUs
that the interface which manages sets and their properties is not fully
aware of is a dainbramaged thing to do.
-Mike
On Wed, 11 Sep 2013, Mike Galbraith wrote:
> Mind saying why? To me, creating properties of exclusive sets of CPUs
> that the interface which manages sets and their properties is not fully
> aware of is a dainbramaged thing to do.
cpusets is being replaced by cgropus. And the mechanism adds some
significant latencies to core memory management processing path.
Also many folks in finance like to deal directly with the hardware
(processor numbers, affinity masks etc). There are already numerous ways
to specify these masks. Pretty well established. Digging down a cpuset
hierachy is a bit tedious. Then these cpusets can also overlap which
makes the whole setup difficult.
If cpusets can be used on top then ok but I would like it not to be
required to have that compiled in.
On Wed, 2013-09-11 at 14:21 +0000, Christoph Lameter wrote:
> On Wed, 11 Sep 2013, Mike Galbraith wrote:
>
> > Mind saying why? To me, creating properties of exclusive sets of CPUs
> > that the interface which manages sets and their properties is not fully
> > aware of is a dainbramaged thing to do.
>
> cpusets is being replaced by cgropus. And the mechanism adds some
> significant latencies to core memory management processing path.
You don't have to use or even configure in all controllers.
> Also many folks in finance like to deal directly with the hardware
> (processor numbers, affinity masks etc). There are already numerous ways
> to specify these masks. Pretty well established. Digging down a cpuset
> hierachy is a bit tedious. Then these cpusets can also overlap which
> makes the whole setup difficult.
These kind of things have to be exclusive set attributes 'course,
overlapping nohz_tick/full/off just ain't gonna work very well.
I hacked it up for my rt kernel to turn the tick on/off, and disable rt
load balancing (cpupri adds jitter) on a per exclusive set basis. The
cpuset bit is easy. Connecting buttons to scheduler and whatnot can
make cute little "You'd better not EVER submit this" warts though :)
> If cpusets can be used on top then ok but I would like it not to be
> required to have that compiled in.
IMHO, it makes much more sense to unify set attributes in cpusets,
fixing up or griping about whatever annoys HPC boxen/folks.
But whatever, I only piped in to mention that isolcpus wants to die, and
I've done that, so I can pipe-down now.
-Mike
Here is a draft of a patch to do autoconfig if CONFIG_NO_HZ_FULL_ALL is
set.
Subject: Simple autoconfig for tickless system
This is on top of the prior patch that restricts the cpus that
kthread can spawn processes on.
It ensures that one processor per node is kept in regular
HZ mode and also adds that cpu to the kthread_mask so that
OS services (like kswapd etc) can run.
On a two node system two processors will be available for kthread and OS services.
The rest will be tickless and kept as free from OS services as possible.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux/kernel/time/tick-sched.c
===================================================================
--- linux.orig/kernel/time/tick-sched.c 2013-09-05 09:10:59.000000000 -0500
+++ linux/kernel/time/tick-sched.c 2013-09-11 11:46:59.387888072 -0500
@@ -330,7 +330,30 @@ static int tick_nohz_init_all(void)
}
err = 0;
cpumask_setall(tick_nohz_full_mask);
+
+ /* Exempt boot processor and use it for OS services */
cpumask_clear_cpu(smp_processor_id(), tick_nohz_full_mask);
+ cpumask_set(smp_processor_id(), cpumask_kthread_mask);
+
+ /* And one processor for each NUMA node */
+ for_each_node(node) {
+ struct cpumask *m = cpumask_of_node(node);
+
+ /* Boot node ? */
+ if (node == numa_node_id())
+ continue;
+
+ /*
+ * Exempt the first processor on each node that has
+ * processors available.
+ */
+ if (cpumask_weight(m)) {
+ int cpu = cpumask_first(m);
+
+ cpumask_clear_cpu(cpu, tick_nohz_full_mask);
+ cpumask_set(cpu, cpu_kthread_mask);
+ }
+ }
tick_nohz_full_running = true;
#endif
return err;
Index: linux/kernel/cpu.c
===================================================================
--- linux.orig/kernel/cpu.c 2013-09-11 10:45:47.686052132 -0500
+++ linux/kernel/cpu.c 2013-09-11 11:49:34.122210075 -0500
@@ -682,12 +682,14 @@ static DECLARE_BITMAP(cpu_kthread_bits,
const struct cpumask *const cpu_kthread_mask = to_cpumask(cpu_kthread_bits);
EXPORT_SYMBOL(cpu_kthread_mask);
+#ifndef CONFIG_NO_HZ_FULL_ALL
static int __init kthread_setup(char *str)
{
cpulist_parse(str, (struct cpumask *)&cpu_kthread_bits);
return 1;
}
__setup("kthread=", kthread_setup);
+#endif
void set_cpu_possible(unsigned int cpu, bool possible)
On Wed, Sep 11, 2013 at 02:21:06PM +0000, Christoph Lameter wrote:
> On Wed, 11 Sep 2013, Mike Galbraith wrote:
>
> > Mind saying why? To me, creating properties of exclusive sets of CPUs
> > that the interface which manages sets and their properties is not fully
> > aware of is a dainbramaged thing to do.
>
> cpusets is being replaced by cgropus.
You are confusing me. Cpusets is a cgroups subsystem, how can it be replaced
by it?
On Thu, Sep 05, 2013 at 08:07:37PM +0000, Christoph Lameter wrote:
> I am not sure how to call this kernel option but we need something like
> that. I see drivers and the kernel spawning processes on the nohz cores.
> The name kthread is not really catching the purpose.
>
> os_cpus=? highlatency_cpus=?
>
>
> Subject: Restrict kernel spawning of threads to a specified set of cpus.
>
> Currently the kernel by default allows kernel threads to be spawned on
> any cpu. This is a problem for low latency applications that want to
> avoid Os actions on specific processors.
>
> Add a kernel option that restrict kthread and usermode spawning
> to a specific set of processors. Also sets the affinities of
> init by default to the restricted set since we certainly do not
> want userspace daemons etc to be started there either.
>
> Signed-off-by: Christoph Lameter <[email protected]>
Why not do this from userspace instead?
Thanks.
On Wed, 2013-09-11 at 23:36 +0200, Frederic Weisbecker wrote:
> On Wed, Sep 11, 2013 at 02:21:06PM +0000, Christoph Lameter wrote:
> > On Wed, 11 Sep 2013, Mike Galbraith wrote:
> >
> > > Mind saying why? To me, creating properties of exclusive sets of CPUs
> > > that the interface which manages sets and their properties is not fully
> > > aware of is a dainbramaged thing to do.
> >
> > cpusets is being replaced by cgropus.
>
> You are confusing me. Cpusets is a cgroups subsystem, how can it be replaced
> by it?
Yeah, the only irritant I know of is the cpuset API variability. It has
a backward compatibility mount option, so anything other than the user
mounting makes the API selection decision for him/her. systemd mounts
cpuset, i.e. OS component pokes OS API backward compatibility button,
breaking OS API backward compatibility for the user, who then has to
squabble with OS component over button possession if he wants his old
cpuset API using toys to continue to work.
-Mike
On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
> Why not do this from userspace instead?
Because the cpumasks are hardcoded in the kernel code.
On Thu, Sep 12, 2013 at 02:10:56PM +0000, Christoph Lameter wrote:
> On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
>
> > Why not do this from userspace instead?
>
> Because the cpumasks are hardcoded in the kernel code.
>
Ok but you can change the affinity of a kthread from userspace, as
long as you define a cpu set that is among that kthread's cpus allowed.
On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
> On Thu, Sep 12, 2013 at 02:10:56PM +0000, Christoph Lameter wrote:
> > On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
> >
> > > Why not do this from userspace instead?
> >
> > Because the cpumasks are hardcoded in the kernel code.
> >
>
> Ok but you can change the affinity of a kthread from userspace, as
> long as you define a cpu set that is among that kthread's cpus allowed.
Ok but at that point kthread has already spawned a lot of kernel threads.
The same is true for init and kmod.
On Thu, Sep 12, 2013 at 02:22:43PM +0000, Christoph Lameter wrote:
> On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
>
> > On Thu, Sep 12, 2013 at 02:10:56PM +0000, Christoph Lameter wrote:
> > > On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
> > >
> > > > Why not do this from userspace instead?
> > >
> > > Because the cpumasks are hardcoded in the kernel code.
> > >
> >
> > Ok but you can change the affinity of a kthread from userspace, as
> > long as you define a cpu set that is among that kthread's cpus allowed.
>
> Ok but at that point kthread has already spawned a lot of kernel threads.
>
> The same is true for init and kmod.
>
Ok but then we just need to set the affinity of all these kthreads.
A simple lookup on /proc/[0-9]+/ should do the trick.
On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
> > > Ok but you can change the affinity of a kthread from userspace, as
> > > long as you define a cpu set that is among that kthread's cpus allowed.
> >
> > Ok but at that point kthread has already spawned a lot of kernel threads.
> >
> > The same is true for init and kmod.
> >
>
> Ok but then we just need to set the affinity of all these kthreads.
> A simple lookup on /proc/[0-9]+/ should do the trick.
Yea but the kernel option makes it easy. No extras needed. Kernel brings
it up user space cleanly configured and ready to go.
This also allows us to cleanup kernel uses of cpumasks in such a way that
proper thread placement for various other uses (reclaim f.e. kswpad) is
possible.
On Thu, Sep 12, 2013 at 02:52:56PM +0000, Christoph Lameter wrote:
> On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
>
> > > > Ok but you can change the affinity of a kthread from userspace, as
> > > > long as you define a cpu set that is among that kthread's cpus allowed.
> > >
> > > Ok but at that point kthread has already spawned a lot of kernel threads.
> > >
> > > The same is true for init and kmod.
> > >
> >
> > Ok but then we just need to set the affinity of all these kthreads.
> > A simple lookup on /proc/[0-9]+/ should do the trick.
>
> Yea but the kernel option makes it easy. No extras needed. Kernel brings
> it up user space cleanly configured and ready to go.
Ok but really that's just two lines of bash. I really wish we don't complicate
core kernel code for that.
I think we all agree that the big issue here is that CPU isolation requires to set up
a fragmented set of features and it's not at all obvious to do it correctly: full dynticks,
rcu nocbs, kthreads affinity, timer_list, hrtimers, workqueues, IPIs, etc...
So IMHO what is missing is a reliable userspace tool that can handle all that: do
the checks on pre-requirements, handle the kthreads and even user task affinity, tweak
some sysctl stuffs to turn off features that generate noise, etc...
> This also allows us to cleanup kernel uses of cpumasks in such a way that
> proper thread placement for various other uses (reclaim f.e. kswpad) is
> possible.
Same here, a central tool should be able to solve that.
On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
> > Yea but the kernel option makes it easy. No extras needed. Kernel brings
> > it up user space cleanly configured and ready to go.
>
> Ok but really that's just two lines of bash. I really wish we don't complicate
> core kernel code for that.
Thread placement is an issue in general for the future. The more hardware
threads we get the more aware of thread placement we need to become
because caches become more important for performance. Disturbing the cache
of another is significant. So it moving a thread away from its default
thread because memory accesses will have to be done again.
> > This also allows us to cleanup kernel uses of cpumasks in such a way that
> > proper thread placement for various other uses (reclaim f.e. kswpad) is
> > possible.
>
> Same here, a central tool should be able to solve that.
I think this is something that belongs in the kernel under consideration
of the developers. The user space scripts that I have seen are not
that clean and they are strongly kernel version dependant.
On Thu, Sep 12, 2013 at 05:11:04PM +0200, Frederic Weisbecker wrote:
> On Thu, Sep 12, 2013 at 02:52:56PM +0000, Christoph Lameter wrote:
> > On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
> >
> > > > > Ok but you can change the affinity of a kthread from userspace, as
> > > > > long as you define a cpu set that is among that kthread's cpus allowed.
> > > >
> > > > Ok but at that point kthread has already spawned a lot of kernel threads.
> > > >
> > > > The same is true for init and kmod.
> > > >
> > >
> > > Ok but then we just need to set the affinity of all these kthreads.
> > > A simple lookup on /proc/[0-9]+/ should do the trick.
> >
> > Yea but the kernel option makes it easy. No extras needed. Kernel brings
> > it up user space cleanly configured and ready to go.
>
> Ok but really that's just two lines of bash. I really wish we don't complicate
> core kernel code for that.
OK, I will bite... How do you handle the case where you have collected
all the kthreads, one of the kthreads spawns another kthread, then you
set affinity on the collected kthreads, which does not include the newly
spawned one?
Thanx, Paul
> I think we all agree that the big issue here is that CPU isolation requires to set up
> a fragmented set of features and it's not at all obvious to do it correctly: full dynticks,
> rcu nocbs, kthreads affinity, timer_list, hrtimers, workqueues, IPIs, etc...
>
> So IMHO what is missing is a reliable userspace tool that can handle all that: do
> the checks on pre-requirements, handle the kthreads and even user task affinity, tweak
> some sysctl stuffs to turn off features that generate noise, etc...
>
> > This also allows us to cleanup kernel uses of cpumasks in such a way that
> > proper thread placement for various other uses (reclaim f.e. kswpad) is
> > possible.
>
> Same here, a central tool should be able to solve that.
>
Let me just say that the user space approach does not work because the
kernel sets the cpumask to all and then spawns a thread f.e. for
usermodehelper.
This mean we would have to run a daemon that keeps scanning for errand
threads and then move them. But at that point the damage would already
have been done. Short term threads would never be caught.
So I think the kernel based approach is unavoidable.
Look at this in kernel/kmod.c:
static int ____call_usermodehelper(void *data)
{
struct subprocess_info *sub_info = data;
struct cred *new;
int retval;
spin_lock_irq(¤t->sighand->siglock);
flush_signal_handlers(current, 1);
spin_unlock_irq(¤t->sighand->siglock);
/* We can run anywhere, unlike our parent keventd(). */
set_cpus_allowed_ptr(current, cpu_all_mask);
!!!!! No chance to catch this from user space.
....
retval = do_execve(sub_info->path,
(const char __user *const __user *)sub_info->argv,
(const char __user *const __user *)sub_info->envp);
if (!retval)
....
On Thu, Sep 12, 2013 at 03:42:21PM +0000, Christoph Lameter wrote:
> Let me just say that the user space approach does not work because the
> kernel sets the cpumask to all and then spawns a thread f.e. for
> usermodehelper.
>
> This mean we would have to run a daemon that keeps scanning for errand
> threads and then move them. But at that point the damage would already
> have been done. Short term threads would never be caught.
>
> So I think the kernel based approach is unavoidable.
>
> Look at this in kernel/kmod.c:
>
> static int ____call_usermodehelper(void *data)
> {
> struct subprocess_info *sub_info = data;
> struct cred *new;
> int retval;
>
> spin_lock_irq(¤t->sighand->siglock);
> flush_signal_handlers(current, 1);
> spin_unlock_irq(¤t->sighand->siglock);
>
> /* We can run anywhere, unlike our parent keventd(). */
> set_cpus_allowed_ptr(current, cpu_all_mask);
>
>
> !!!!! No chance to catch this from user space.
>
>
>
> ....
>
> retval = do_execve(sub_info->path,
> (const char __user *const __user *)sub_info->argv,
> (const char __user *const __user *)sub_info->envp);
> if (!retval)
>
>
> ....
>
Yeah, setting the threads affinity is racy from userspace in any case. By the time
one scan /proc for tasks, some others can be forked concurrently.
So yeah it's a problem in theory. Now in practice, I have yet to be convinced because
this should be solved after a few iterations in /proc in most cases.
Now the issue doesn't only concern kthreads but all tasks in the system.
If we really want to solve that race, then may be we can think of a kernel_parameter
that sets the initial affinity of init and then lets get it naturally inherited
through the whole tree.
On Thu, Sep 12, 2013 at 08:39:22AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 12, 2013 at 05:11:04PM +0200, Frederic Weisbecker wrote:
> > On Thu, Sep 12, 2013 at 02:52:56PM +0000, Christoph Lameter wrote:
> > > On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
> > >
> > > > > > Ok but you can change the affinity of a kthread from userspace, as
> > > > > > long as you define a cpu set that is among that kthread's cpus allowed.
> > > > >
> > > > > Ok but at that point kthread has already spawned a lot of kernel threads.
> > > > >
> > > > > The same is true for init and kmod.
> > > > >
> > > >
> > > > Ok but then we just need to set the affinity of all these kthreads.
> > > > A simple lookup on /proc/[0-9]+/ should do the trick.
> > >
> > > Yea but the kernel option makes it easy. No extras needed. Kernel brings
> > > it up user space cleanly configured and ready to go.
> >
> > Ok but really that's just two lines of bash. I really wish we don't complicate
> > core kernel code for that.
>
> OK, I will bite... How do you handle the case where you have collected
> all the kthreads, one of the kthreads spawns another kthread, then you
> set affinity on the collected kthreads, which does not include the newly
> spawned one?
Just offline the CPUs you want to isolate, affine your kthreads and re-online
the CPUs.
If you're lucky enough to have 1024 CPUs, a winter night should be enough ;-)
On Thu, Sep 12, 2013 at 03:32:20PM +0000, Christoph Lameter wrote:
> On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
>
> > > Yea but the kernel option makes it easy. No extras needed. Kernel brings
> > > it up user space cleanly configured and ready to go.
> >
> > Ok but really that's just two lines of bash. I really wish we don't complicate
> > core kernel code for that.
>
> Thread placement is an issue in general for the future. The more hardware
> threads we get the more aware of thread placement we need to become
> because caches become more important for performance. Disturbing the cache
> of another is significant. So it moving a thread away from its default
> thread because memory accesses will have to be done again.
Sure I expect the CPU load balancer will do crazy stuff in the future with
the spread of NUMA, involving a lot the kernel in such decision making.
But although I'm no scheduler expert, I suspect this will entangle finer grained
datas than a big fat kthread mask :)
>
> > > This also allows us to cleanup kernel uses of cpumasks in such a way that
> > > proper thread placement for various other uses (reclaim f.e. kswpad) is
> > > possible.
> >
> > Same here, a central tool should be able to solve that.
>
> I think this is something that belongs in the kernel under consideration
> of the developers. The user space scripts that I have seen are not
> that clean and they are strongly kernel version dependant.
The fact that no nice stuff has been done in userspace for this yet doesn't
mean it has to be done in the kernel.
On Thu, Sep 12, 2013 at 08:35:05PM +0200, Frederic Weisbecker wrote:
> On Thu, Sep 12, 2013 at 08:39:22AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 12, 2013 at 05:11:04PM +0200, Frederic Weisbecker wrote:
> > > On Thu, Sep 12, 2013 at 02:52:56PM +0000, Christoph Lameter wrote:
> > > > On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
> > > >
> > > > > > > Ok but you can change the affinity of a kthread from userspace, as
> > > > > > > long as you define a cpu set that is among that kthread's cpus allowed.
> > > > > >
> > > > > > Ok but at that point kthread has already spawned a lot of kernel threads.
> > > > > >
> > > > > > The same is true for init and kmod.
> > > > > >
> > > > >
> > > > > Ok but then we just need to set the affinity of all these kthreads.
> > > > > A simple lookup on /proc/[0-9]+/ should do the trick.
> > > >
> > > > Yea but the kernel option makes it easy. No extras needed. Kernel brings
> > > > it up user space cleanly configured and ready to go.
> > >
> > > Ok but really that's just two lines of bash. I really wish we don't complicate
> > > core kernel code for that.
> >
> > OK, I will bite... How do you handle the case where you have collected
> > all the kthreads, one of the kthreads spawns another kthread, then you
> > set affinity on the collected kthreads, which does not include the newly
> > spawned one?
>
> Just offline the CPUs you want to isolate, affine your kthreads and re-online
> the CPUs.
>
> If you're lucky enough to have 1024 CPUs, a winter night should be enough ;-)
Running at RT prio 99 to reduce the probability of respawns? ;-)
Thanx, Paul
On Thu, Sep 12, 2013 at 08:30:25PM +0200, Frederic Weisbecker wrote:
> Now the issue doesn't only concern kthreads but all tasks in the system.
No, only kernel threads, all other tasks have a parent they inherit
(namespace, cgroup, affinity etc..) context from.
> If we really want to solve that race, then may be we can think of a kernel_parameter
No bloody kernel params. I'd much rather create a pointless kthread to
act as usermodehelper parent that people can set context on (move it
into cgroups, set affinity, whatever) so it automagically propagates to
all userspace helper thingies.
Is there anything other than usermodehelper we need to be concerned
with? One that comes to mind would be unbound workqueue threads. Do we
want to share the parent with usermodehelpers or have these two classes
have different parents?
On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
> So yeah it's a problem in theory. Now in practice, I have yet to be convinced because
> this should be solved after a few iterations in /proc in most cases.
I have seen some drivers regularly spawning threads all over the machnine.
This is a practical issue that I am addresing.
>
> Now the issue doesn't only concern kthreads but all tasks in the system.
> If we really want to solve that race, then may be we can think of a kernel_parameter
> that sets the initial affinity of init and then lets get it naturally inherited
> through the whole tree.
This patch that we are discussing does exactly that.
On Fri, Sep 13, 2013 at 01:45:55PM +0000, Christoph Lameter wrote:
> On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
>
> > So yeah it's a problem in theory. Now in practice, I have yet to be convinced because
> > this should be solved after a few iterations in /proc in most cases.
>
> I have seen some drivers regularly spawning threads all over the machnine.
> This is a practical issue that I am addresing.
> >
> > Now the issue doesn't only concern kthreads but all tasks in the system.
> > If we really want to solve that race, then may be we can think of a kernel_parameter
> > that sets the initial affinity of init and then lets get it naturally inherited
> > through the whole tree.
>
> This patch that we are discussing does exactly that.
>
Indeed, I just looked that again and your cpu_kthread_mask actually also applies to init.
cpu_init_mask would be a better name I think.
On Fri, 13 Sep 2013, Frederic Weisbecker wrote:
> Indeed, I just looked that again and your cpu_kthread_mask actually also applies to init.
> cpu_init_mask would be a better name I think.
Yea the naming is iffy. I want to get a general direction on how to are
going to address these issues before putting more work into it. Any ideas
on how to do this in a nice way that makes it easy for everyone involved
would be appreciated.
There is a second stage to this which comes with NUMA systems. In that
case we need to have at least one processor reserved for the OS to do
reclaim and stuff like that. That is why I also posted the following patch
that amends some things. Not tested just an idea how to address these
issues. And it also does not do the placement of kswapd and other MM
specific threads yet.
Subject: Simple autoconfig for tickless system
This is on top of the prior patch that restricts the cpus that
kthread can spawn processes on.
It ensures that one processor per node is kept in regular
HZ mode and also adds that cpu to the kthread_mask so that
OS services (like kswapd etc) can run.
On a two node system two processors will be available for kthread and OS services.
The rest will be tickless and kept as free from OS services as possible.
Signed-off-by: Christoph Lameter <[email protected]>
Index: linux/kernel/time/tick-sched.c
===================================================================
--- linux.orig/kernel/time/tick-sched.c 2013-09-05 09:10:59.000000000 -0500
+++ linux/kernel/time/tick-sched.c 2013-09-11 11:46:59.387888072 -0500
@@ -330,7 +330,30 @@ static int tick_nohz_init_all(void)
}
err = 0;
cpumask_setall(tick_nohz_full_mask);
+
+ /* Exempt boot processor and use it for OS services */
cpumask_clear_cpu(smp_processor_id(), tick_nohz_full_mask);
+ cpumask_set(smp_processor_id(), cpumask_kthread_mask);
+
+ /* And one processor for each NUMA node */
+ for_each_node(node) {
+ struct cpumask *m = cpumask_of_node(node);
+
+ /* Boot node ? */
+ if (node == numa_node_id())
+ continue;
+
+ /*
+ * Exempt the first processor on each node that has
+ * processors available.
+ */
+ if (cpumask_weight(m)) {
+ int cpu = cpumask_first(m);
+
+ cpumask_clear_cpu(cpu, tick_nohz_full_mask);
+ cpumask_set(cpu, cpu_kthread_mask);
+ }
+ }
tick_nohz_full_running = true;
#endif
return err;
Index: linux/kernel/cpu.c
===================================================================
--- linux.orig/kernel/cpu.c 2013-09-11 10:45:47.686052132 -0500
+++ linux/kernel/cpu.c 2013-09-11 11:49:34.122210075 -0500
@@ -682,12 +682,14 @@ static DECLARE_BITMAP(cpu_kthread_bits,
const struct cpumask *const cpu_kthread_mask = to_cpumask(cpu_kthread_bits);
EXPORT_SYMBOL(cpu_kthread_mask);
+#ifndef CONFIG_NO_HZ_FULL_ALL
static int __init kthread_setup(char *str)
{
cpulist_parse(str, (struct cpumask *)&cpu_kthread_bits);
return 1;
}
__setup("kthread=", kthread_setup);
+#endif
void set_cpu_possible(unsigned int cpu, bool possible)
On Fri, Sep 13, 2013 at 01:54:53PM +0000, Christoph Lameter wrote:
>
> > > If we really want to solve that race, then may be we can think of a kernel_parameter
> >
> > No bloody kernel params. I'd much rather create a pointless kthread to
> > act as usermodehelper parent that people can set context on (move it
> > into cgroups, set affinity, whatever) so it automagically propagates to
> > all userspace helper thingies.
> >
> > Is there anything other than usermodehelper we need to be concerned
> > with? One that comes to mind would be unbound workqueue threads. Do we
> > want to share the parent with usermodehelpers or have these two classes
> > have different parents?
>
> So you want to keep those silly racy move-all-threads-to-some-cpus scripts
> around?
No, creating a parent for them closes the race. It should also makes it
lots easier to find the kids by using ppid.
> A kernel parameter would allow a clean bootup with threads
> starting out on the specific processors we want them to.
Blergh, no. A kernel should boot, a kernel should allow you to configure
things, a kernel should not be limited to boot time settings.
> Also there is even more work ahead to deal with things like kswapd,
> writeback threads, compaction and various other scanners that should also
> be restricted. Mostly one thread per node is sufficient. This is not
> simple to do from user space.
IIRC we have one kswapd per node, not sure about the others. And why is
this not simple from userspace? All these are long-running threads and
from a quick look they can have their affinity changed.
> > If we really want to solve that race, then may be we can think of a kernel_parameter
>
> No bloody kernel params. I'd much rather create a pointless kthread to
> act as usermodehelper parent that people can set context on (move it
> into cgroups, set affinity, whatever) so it automagically propagates to
> all userspace helper thingies.
>
> Is there anything other than usermodehelper we need to be concerned
> with? One that comes to mind would be unbound workqueue threads. Do we
> want to share the parent with usermodehelpers or have these two classes
> have different parents?
So you want to keep those silly racy move-all-threads-to-some-cpus scripts
around? A kernel parameter would allow a clean bootup with threads
starting out on the specific processors we want them to.
Also there is even more work ahead to deal with things like kswapd,
writeback threads, compaction and various other scanners that should also
be restricted. Mostly one thread per node is sufficient. This is not
simple to do from user space.
Hmmm... usermodehelper is based on workqueues. I guess this will
ultimately come down to modify the workqueue behavior for
WORK_CPU_UNBOUND?
If WORK_CPU_UNBOUND could mean to limit process execution to the affinity
of kthreadd then we are fine.
That would also benefit many other workqueue events that may otherwise
disturb tickless cpus.
On Fri, 13 Sep 2013, Peter Zijlstra wrote:
> No, creating a parent for them closes the race. It should also makes it
> lots easier to find the kids by using ppid.
Ok if all spawning is done from kthreadd then that works.
> > A kernel parameter would allow a clean bootup with threads
> > starting out on the specific processors we want them to.
>
> Blergh, no. A kernel should boot, a kernel should allow you to configure
> things, a kernel should not be limited to boot time settings.
The kernel is not limited but can decide where to place threads. The
threads spawned for general user space services are limited to a set of
cpus unless there is an explict override. The intend is to keep as much
processing as possible away from the notick processors.
On Thu, Sep 12, 2013 at 5:16 PM, Frederic Weisbecker <[email protected]> wrote:
> On Thu, Sep 12, 2013 at 02:10:56PM +0000, Christoph Lameter wrote:
>> On Thu, 12 Sep 2013, Frederic Weisbecker wrote:
>>
>> > Why not do this from userspace instead?
>>
>> Because the cpumasks are hardcoded in the kernel code.
>>
>
> Ok but you can change the affinity of a kthread from userspace, as
> long as you define a cpu set that is among that kthread's cpus allowed.
There is also the problem of kernel threads registering timers. We
don't have a good way to migrate those yet, I believe.
Gilad
--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com
"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
-- Jean-Baptiste Queru
On Thu, Sep 12, 2013 at 9:35 PM, Frederic Weisbecker <[email protected]> wrote:
>
> Just offline the CPUs you want to isolate, affine your kthreads and re-online
> the CPUs.
>
> If you're lucky enough to have 1024 CPUs, a winter night should be enough ;-)
Great, I have 4,096 CPUs. I guess I have to wait for the winter solstice :-)
Gilad
--
Gilad Ben-Yossef
Chief Coffee Drinker
[email protected]
Israel Cell: +972-52-8260388
US Cell: +1-973-8260388
http://benyossef.com
"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
-- Jean-Baptiste Queru
On Fri, Sep 13, 2013 at 03:40:40PM +0000, Christoph Lameter wrote:
> Hmmm... usermodehelper is based on workqueues. I guess this will
> ultimately come down to modify the workqueue behavior for
> WORK_CPU_UNBOUND?
You don't need to keep it like that -- in fact I would suggest removing
that dependency and creating an extra (explicit) unbound thread spawner
that both usermodehelper and kworker can use for unbound threads.
> If WORK_CPU_UNBOUND could mean to limit process execution to the affinity
> of kthreadd then we are fine.
No, kthreadd must stay clean, if must not have any affinity nor be part
of any cgroup like thing. We must have means of spawning clean kthreads,
therefore we need to create a new parent for these special cases.