2007-12-17 22:43:51

by Michal Schmidt

[permalink] [raw]
Subject: [PATCH] kthread: run kthreadd with max priority SCHED_FIFO

kthreadd, the creator of other kernel threads, runs as a normal
priority task. This is a potential for priority inversion when a task
wants to spawn a high-priority kernel thread. A middle priority
SCHED_FIFO task can block kthreadd's execution indefinitely and thus
prevent the timely creation of the high-priority kernel thread.

This causes a practical problem. When a runaway real-time task is
eating 100% CPU and we attempt to put the CPU offline, sometimes we
block while waiting for the creation of the highest-priority
"kstopmachine" thread.

The fix is to run kthreadd with the highest possible SCHED_FIFO
priority. Its children must still run as slightly negatively reniced
SCHED_NORMAL tasks.

Signed-off-by: Michal Schmidt <[email protected]>

diff --git a/kernel/kthread.c b/kernel/kthread.c
index dcfe724..a7ce932 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -94,10 +94,17 @@ static void create_kthread(struct kthread_create_info *create)
if (pid < 0) {
create->result = ERR_PTR(pid);
} else {
+ struct sched_param param = { .sched_priority = 0 };
wait_for_completion(&create->started);
read_lock(&tasklist_lock);
create->result = find_task_by_pid(pid);
read_unlock(&tasklist_lock);
+ /*
+ * We (kthreadd) run with SCHED_FIFO, but we don't want
+ * the kthreads we create to have it too by default.
+ */
+ sched_setscheduler(create->result, SCHED_NORMAL, &param);
+ set_user_nice(create->result, -5);
}
complete(&create->done);
}
@@ -217,11 +224,12 @@ EXPORT_SYMBOL(kthread_stop);
int kthreadd(void *unused)
{
struct task_struct *tsk = current;
+ struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };

/* Setup a clean context for our children to inherit. */
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
- set_user_nice(tsk, -5);
+ sched_setscheduler(tsk, SCHED_FIFO, &param);
set_cpus_allowed(tsk, CPU_MASK_ALL);

current->flags |= PF_NOFREEZE;


2007-12-17 23:00:48

by Jon Masters

[permalink] [raw]
Subject: Re: [PATCH] kthread: run kthreadd with max priority SCHED_FIFO

On Mon, 2007-12-17 at 23:43 +0100, Michal Schmidt wrote:

> kthreadd, the creator of other kernel threads, runs as a normal
> priority task. This is a potential for priority inversion when a task
> wants to spawn a high-priority kernel thread. A middle priority
> SCHED_FIFO task can block kthreadd's execution indefinitely and thus
> prevent the timely creation of the high-priority kernel thread.
>
> This causes a practical problem. When a runaway real-time task is
> eating 100% CPU and we attempt to put the CPU offline, sometimes we
> block while waiting for the creation of the highest-priority
> "kstopmachine" thread.
>
> The fix is to run kthreadd with the highest possible SCHED_FIFO
> priority. Its children must still run as slightly negatively reniced
> SCHED_NORMAL tasks.
>
> Signed-off-by: Michal Schmidt <[email protected]>
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index dcfe724..a7ce932 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -94,10 +94,17 @@ static void create_kthread(struct kthread_create_info *create)
> if (pid < 0) {
> create->result = ERR_PTR(pid);
> } else {
> + struct sched_param param = { .sched_priority = 0 };
> wait_for_completion(&create->started);
> read_lock(&tasklist_lock);
> create->result = find_task_by_pid(pid);
> read_unlock(&tasklist_lock);
> + /*
> + * We (kthreadd) run with SCHED_FIFO, but we don't want
> + * the kthreads we create to have it too by default.
> + */
> + sched_setscheduler(create->result, SCHED_NORMAL, &param);
> + set_user_nice(create->result, -5);
> }
> complete(&create->done);
> }
> @@ -217,11 +224,12 @@ EXPORT_SYMBOL(kthread_stop);
> int kthreadd(void *unused)
> {
> struct task_struct *tsk = current;
> + struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
>
> /* Setup a clean context for our children to inherit. */
> set_task_comm(tsk, "kthreadd");
> ignore_signals(tsk);
> - set_user_nice(tsk, -5);
> + sched_setscheduler(tsk, SCHED_FIFO, &param);
> set_cpus_allowed(tsk, CPU_MASK_ALL);
>
> current->flags |= PF_NOFREEZE;

I looked at this internally over the weekend.

Acked-by: Jon Masters <[email protected]>

2007-12-22 09:31:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] kthread: run kthreadd with max priority SCHED_FIFO

On Mon, 17 Dec 2007 23:43:14 +0100 Michal Schmidt <[email protected]> wrote:

> kthreadd, the creator of other kernel threads, runs as a normal
> priority task. This is a potential for priority inversion when a task
> wants to spawn a high-priority kernel thread. A middle priority
> SCHED_FIFO task can block kthreadd's execution indefinitely and thus
> prevent the timely creation of the high-priority kernel thread.
>
> This causes a practical problem. When a runaway real-time task is
> eating 100% CPU and we attempt to put the CPU offline, sometimes we
> block while waiting for the creation of the highest-priority
> "kstopmachine" thread.
>
> The fix is to run kthreadd with the highest possible SCHED_FIFO
> priority. Its children must still run as slightly negatively reniced
> SCHED_NORMAL tasks.

Did you hit this problem with the stock kernel, or have you been working on
other stuff?

A locked-up SCHED_FIFO process will cause kernel threads all sorts of
problems. You've hit one instance, but there will be others. (pdflush
stops working, for one).

The general approach we've taken to this is "don't do that". Yes, we could
boost lots of kernel threads in the way which this patch does but this
actually takes control *away* from userspace. Userspace no longer has the
ability to guarantee itself minimum possible latency without getting
preempted by kernel threads.

And yes, giving userspace this minimum-latency capability does imply that
userspace has a responsibility to not 100% starve kernel threads. It's a
reasonable compromise, I think?

2007-12-22 09:53:18

by Jon Masters

[permalink] [raw]
Subject: Re: [PATCH] kthread: run kthreadd with max priority SCHED_FIFO


On Sat, 2007-12-22 at 01:30 -0800, Andrew Morton wrote:
> On Mon, 17 Dec 2007 23:43:14 +0100 Michal Schmidt <[email protected]> wrote:
>
> > kthreadd, the creator of other kernel threads, runs as a normal
> > priority task. This is a potential for priority inversion when a task
> > wants to spawn a high-priority kernel thread. A middle priority
> > SCHED_FIFO task can block kthreadd's execution indefinitely and thus
> > prevent the timely creation of the high-priority kernel thread.
> >
> > This causes a practical problem. When a runaway real-time task is
> > eating 100% CPU and we attempt to put the CPU offline, sometimes we
> > block while waiting for the creation of the highest-priority
> > "kstopmachine" thread.
> >
> > The fix is to run kthreadd with the highest possible SCHED_FIFO
> > priority. Its children must still run as slightly negatively reniced
> > SCHED_NORMAL tasks.
>
> Did you hit this problem with the stock kernel, or have you been working on
> other stuff?

This kind of problem is *far* more likely to happen on the -RT kernel
(more example cases), but it's also a general problem too.

> A locked-up SCHED_FIFO process will cause kernel threads all sorts of
> problems. You've hit one instance, but there will be others. (pdflush
> stops working, for one).

Right. Agreed that this is just one "fix" out of many possibly needed,
if upstream wants to address this kind of problem.

> The general approach we've taken to this is "don't do that". Yes, we could
> boost lots of kernel threads in the way which this patch does but this
> actually takes control *away* from userspace. Userspace no longer has the
> ability to guarantee itself minimum possible latency without getting
> preempted by kernel threads.
>
> And yes, giving userspace this minimum-latency capability does imply that
> userspace has a responsibility to not 100% starve kernel threads. It's a
> reasonable compromise, I think?

So, user tasks running with SCHED_FIFO should be able to lock a system?
I guess I see both sides of this argument - yes, it's userspace at
fault, but in other cases when userspace is at fault, we take action
(OOM, segfault, others). Isn't this situation just another case where
the kernel needs to avoid the evils of userland going awry?

Jon.

2007-12-22 10:12:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] kthread: run kthreadd with max priority SCHED_FIFO

On Sat, 22 Dec 2007 04:52:50 -0500 Jon Masters <[email protected]> wrote:

> > The general approach we've taken to this is "don't do that". Yes, we could
> > boost lots of kernel threads in the way which this patch does but this
> > actually takes control *away* from userspace. Userspace no longer has the
> > ability to guarantee itself minimum possible latency without getting
> > preempted by kernel threads.
> >
> > And yes, giving userspace this minimum-latency capability does imply that
> > userspace has a responsibility to not 100% starve kernel threads. It's a
> > reasonable compromise, I think?
>
> So, user tasks running with SCHED_FIFO should be able to lock a system?

yup. root can damage the system in all sorts of ways.

> I guess I see both sides of this argument - yes, it's userspace at
> fault, but in other cases when userspace is at fault, we take action
> (OOM, segfault, others). Isn't this situation just another case where
> the kernel needs to avoid the evils of userland going awry?

Well... the problem is that if we add a safety net to catch run-away
SCHED_FIFO processes, we've permanently degraded the service which we
provide to well-behaved programs.

Should there be a watchdog which checks for a process which has run
realtime for a certain period and which then takes some action? Such as
descheduling it for a while, generating warnings, demoting its policy,
killing it etc?

2007-12-22 10:18:37

by Jon Masters

[permalink] [raw]
Subject: Re: [PATCH] kthread: run kthreadd with max priority SCHED_FIFO

On Sat, 2007-12-22 at 02:11 -0800, Andrew Morton wrote:

> Should there be a watchdog which checks for a process which has run
> realtime for a certain period and which then takes some action? Such as
> descheduling it for a while, generating warnings, demoting its policy,
> killing it etc?

Using the analogy of the OOM killer being called, this wouldn't be a bad
idea IMO - especially if it were configurable, and off by default. This
couldn't be done in a kernel thread (unless some new priority level were
created for it), but I think the details can be worked out later.

Michal - what do you think is the best upstream solution?

Jon.

2007-12-22 10:39:36

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] kthread: run kthreadd with max priority SCHED_FIFO


On Sat, 2007-12-22 at 04:52 -0500, Jon Masters wrote:

> So, user tasks running with SCHED_FIFO should be able to lock a system?
> I guess I see both sides of this argument - yes, it's userspace at
> fault, but in other cases when userspace is at fault, we take action
> (OOM, segfault, others). Isn't this situation just another case where
> the kernel needs to avoid the evils of userland going awry?

FYI, Ingo queued the below.

http://lkml.org/lkml/2007/10/31/344

-Mike



2007-12-22 10:53:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] kthread: run kthreadd with max priority SCHED_FIFO

On Sat, 22 Dec 2007 11:39:30 +0100 Mike Galbraith <[email protected]> wrote:

>
> On Sat, 2007-12-22 at 04:52 -0500, Jon Masters wrote:
>
> > So, user tasks running with SCHED_FIFO should be able to lock a system?
> > I guess I see both sides of this argument - yes, it's userspace at
> > fault, but in other cases when userspace is at fault, we take action
> > (OOM, segfault, others). Isn't this situation just another case where
> > the kernel needs to avoid the evils of userland going awry?
>
> FYI, Ingo queued the below.
>
> http://lkml.org/lkml/2007/10/31/344
>

That's pretty different of course, but rlimit might be a suitable interface
for implementing RLIMIT_MAX_CONTINUOUS_RT_MILLISECONDS.

2007-12-22 11:21:43

by Jon Masters

[permalink] [raw]
Subject: Re: [PATCH] kthread: run kthreadd with max priority SCHED_FIFO


On Sat, 2007-12-22 at 02:52 -0800, Andrew Morton wrote:
> On Sat, 22 Dec 2007 11:39:30 +0100 Mike Galbraith <[email protected]> wrote:
>
> >
> > On Sat, 2007-12-22 at 04:52 -0500, Jon Masters wrote:
> >
> > > So, user tasks running with SCHED_FIFO should be able to lock a system?
> > > I guess I see both sides of this argument - yes, it's userspace at
> > > fault, but in other cases when userspace is at fault, we take action
> > > (OOM, segfault, others). Isn't this situation just another case where
> > > the kernel needs to avoid the evils of userland going awry?
> >
> > FYI, Ingo queued the below.
> >
> > http://lkml.org/lkml/2007/10/31/344
> >
>
> That's pretty different of course, but rlimit might be a suitable interface
> for implementing RLIMIT_MAX_CONTINUOUS_RT_MILLISECONDS.

Right. Just when I thought there was no point in continuing (Ingo having
invented everything already, as usual), this specific problem remains.

I'll try bouncing some ideas off Ingo later.

Jon.

2007-12-23 08:50:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] kthread: run kthreadd with max priority SCHED_FIFO


On Sat, 2007-12-22 at 02:52 -0800, Andrew Morton wrote:
> On Sat, 22 Dec 2007 11:39:30 +0100 Mike Galbraith <[email protected]> wrote:
>
> >
> > On Sat, 2007-12-22 at 04:52 -0500, Jon Masters wrote:
> >
> > > So, user tasks running with SCHED_FIFO should be able to lock a system?
> > > I guess I see both sides of this argument - yes, it's userspace at
> > > fault, but in other cases when userspace is at fault, we take action
> > > (OOM, segfault, others). Isn't this situation just another case where
> > > the kernel needs to avoid the evils of userland going awry?
> >
> > FYI, Ingo queued the below.
> >
> > http://lkml.org/lkml/2007/10/31/344
> >
>
> That's pretty different of course, but rlimit might be a suitable interface
> for implementing RLIMIT_MAX_CONTINUOUS_RT_MILLISECONDS.

I'd extend Peter's rt safety net instead: mark for forced requeue when
the soft limit is hit, or add that as an intermediate stage. Possibly
add a demotion stage as well. I wouldn't try to select lower priority
tasks, which RLIMIT_MAX_CONTINUOUS_RT_MILLISECONDS implies to me.

-Mike

2008-01-07 10:06:23

by Michal Schmidt

[permalink] [raw]
Subject: [PATCH] kthread: always create the kernel threads with normal priority

On Sat, 22 Dec 2007 01:30:21 -0800
Andrew Morton <[email protected]> wrote:

> On Mon, 17 Dec 2007 23:43:14 +0100 Michal Schmidt
> <[email protected]> wrote:
>
> > kthreadd, the creator of other kernel threads, runs as a normal
> > priority task. This is a potential for priority inversion when a
> > task wants to spawn a high-priority kernel thread. A middle priority
> > SCHED_FIFO task can block kthreadd's execution indefinitely and thus
> > prevent the timely creation of the high-priority kernel thread.
> >
> > This causes a practical problem. When a runaway real-time task is
> > eating 100% CPU and we attempt to put the CPU offline, sometimes we
> > block while waiting for the creation of the highest-priority
> > "kstopmachine" thread.
> >
> > The fix is to run kthreadd with the highest possible SCHED_FIFO
> > priority. Its children must still run as slightly negatively reniced
> > SCHED_NORMAL tasks.
>
> Did you hit this problem with the stock kernel, or have you been
> working on other stuff?

This was with RHEL5 and with current Fedora kernels.

> A locked-up SCHED_FIFO process will cause kernel threads all sorts of
> problems. You've hit one instance, but there will be others.
> (pdflush stops working, for one).
>
> The general approach we've taken to this is "don't do that". Yes, we
> could boost lots of kernel threads in the way which this patch does
> but this actually takes control *away* from userspace. Userspace no
> longer has the ability to guarantee itself minimum possible latency
> without getting preempted by kernel threads.
>
> And yes, giving userspace this minimum-latency capability does imply
> that userspace has a responsibility to not 100% starve kernel
> threads. It's a reasonable compromise, I think?

You're right. We should not run kthreadd with SCHED_FIFO by default.
But the user should be able to change it using chrt if he wants to
avoid this particular problem. So how about this instead?:



kthreadd, the creator of other kernel threads, runs as a normal priority task.
This is a potential for priority inversion when a task wants to spawn a
high-priority kernel thread. A middle priority SCHED_FIFO task can block
kthreadd's execution indefinitely and thus prevent the timely creation of the
high-priority kernel thread.

This causes a practical problem. When a runaway real-time task is eating 100%
CPU and we attempt to put the CPU offline, sometimes we block while waiting for
the creation of the highest-priority "kstopmachine" thread.

This could be solved by always running kthreadd with the highest possible
SCHED_FIFO priority, but that would be undesirable policy decision in the
kernel. kthreadd would cause unwanted latencies even for the realtime users who
know what they're doing.

Let's not make the decision for the user. Just allow the administrator to
change kthreadd's priority safely if he chooses to do it. Ensure that the
kernel threads are created with the usual nice level even if kthreadd's
priority is changed from the default.

Signed-off-by: Michal Schmidt <[email protected]>
---
kernel/kthread.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index dcfe724..e832a85 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -94,10 +94,21 @@ static void create_kthread(struct kthread_create_info *create)
if (pid < 0) {
create->result = ERR_PTR(pid);
} else {
+ struct sched_param param = { .sched_priority = 0 };
wait_for_completion(&create->started);
read_lock(&tasklist_lock);
create->result = find_task_by_pid(pid);
read_unlock(&tasklist_lock);
+ /*
+ * root may want to change our (kthreadd's) priority to
+ * realtime to solve a corner case priority inversion problem
+ * (a realtime task consuming 100% CPU blocking the creation of
+ * kernel threads). The kernel thread should not inherit the
+ * higher priority. Let's always create it with the usual nice
+ * level.
+ */
+ sched_setscheduler(create->result, SCHED_NORMAL, &param);
+ set_user_nice(create->result, -5);
}
complete(&create->done);
}
--
1.5.3.3

2008-01-07 10:26:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] kthread: always create the kernel threads with normal priority

On Mon, 7 Jan 2008 11:06:03 +0100 Michal Schmidt <[email protected]> wrote:

> On Sat, 22 Dec 2007 01:30:21 -0800
> Andrew Morton <[email protected]> wrote:
>
> > On Mon, 17 Dec 2007 23:43:14 +0100 Michal Schmidt
> > <[email protected]> wrote:
> >
> > > kthreadd, the creator of other kernel threads, runs as a normal
> > > priority task. This is a potential for priority inversion when a
> > > task wants to spawn a high-priority kernel thread. A middle priority
> > > SCHED_FIFO task can block kthreadd's execution indefinitely and thus
> > > prevent the timely creation of the high-priority kernel thread.
> > >
> > > This causes a practical problem. When a runaway real-time task is
> > > eating 100% CPU and we attempt to put the CPU offline, sometimes we
> > > block while waiting for the creation of the highest-priority
> > > "kstopmachine" thread.
> > >
> > > The fix is to run kthreadd with the highest possible SCHED_FIFO
> > > priority. Its children must still run as slightly negatively reniced
> > > SCHED_NORMAL tasks.
> >
> > Did you hit this problem with the stock kernel, or have you been
> > working on other stuff?
>
> This was with RHEL5 and with current Fedora kernels.
>
> > A locked-up SCHED_FIFO process will cause kernel threads all sorts of
> > problems. You've hit one instance, but there will be others.
> > (pdflush stops working, for one).
> >
> > The general approach we've taken to this is "don't do that". Yes, we
> > could boost lots of kernel threads in the way which this patch does
> > but this actually takes control *away* from userspace. Userspace no
> > longer has the ability to guarantee itself minimum possible latency
> > without getting preempted by kernel threads.
> >
> > And yes, giving userspace this minimum-latency capability does imply
> > that userspace has a responsibility to not 100% starve kernel
> > threads. It's a reasonable compromise, I think?
>
> You're right. We should not run kthreadd with SCHED_FIFO by default.
> But the user should be able to change it using chrt if he wants to
> avoid this particular problem. So how about this instead?:
>
>
>
> kthreadd, the creator of other kernel threads, runs as a normal priority task.
> This is a potential for priority inversion when a task wants to spawn a
> high-priority kernel thread. A middle priority SCHED_FIFO task can block
> kthreadd's execution indefinitely and thus prevent the timely creation of the
> high-priority kernel thread.
>
> This causes a practical problem. When a runaway real-time task is eating 100%
> CPU and we attempt to put the CPU offline, sometimes we block while waiting for
> the creation of the highest-priority "kstopmachine" thread.
>
> This could be solved by always running kthreadd with the highest possible
> SCHED_FIFO priority, but that would be undesirable policy decision in the
> kernel. kthreadd would cause unwanted latencies even for the realtime users who
> know what they're doing.
>
> Let's not make the decision for the user. Just allow the administrator to
> change kthreadd's priority safely if he chooses to do it. Ensure that the
> kernel threads are created with the usual nice level even if kthreadd's
> priority is changed from the default.
>
> Signed-off-by: Michal Schmidt <[email protected]>
> ---
> kernel/kthread.c | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index dcfe724..e832a85 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -94,10 +94,21 @@ static void create_kthread(struct kthread_create_info *create)
> if (pid < 0) {
> create->result = ERR_PTR(pid);
> } else {
> + struct sched_param param = { .sched_priority = 0 };
> wait_for_completion(&create->started);
> read_lock(&tasklist_lock);
> create->result = find_task_by_pid(pid);
> read_unlock(&tasklist_lock);
> + /*
> + * root may want to change our (kthreadd's) priority to
> + * realtime to solve a corner case priority inversion problem
> + * (a realtime task consuming 100% CPU blocking the creation of
> + * kernel threads). The kernel thread should not inherit the
> + * higher priority. Let's always create it with the usual nice
> + * level.
> + */
> + sched_setscheduler(create->result, SCHED_NORMAL, &param);
> + set_user_nice(create->result, -5);
> }
> complete(&create->done);
> }

Seems reasonable.

As a followup thing, we now have two hard-coded magical -5's in kthread.c.
It'd be nice to add a #define for this.

It'd be nicer to work out where on earth that -5 came from too ;)

Readers might wonder why kthreadd children disinherit kthreadd's policy and
priority, but retain its cpus_allowed (and whatever other stuff root could have
altered?)

2008-01-07 11:09:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] kthread: always create the kernel threads with normal priority


> > This causes a practical problem. When a runaway real-time task is
> > eating 100% CPU and we attempt to put the CPU offline, sometimes we
> > block while waiting for the creation of the highest-priority
> > "kstopmachine" thread.

sched-devel.git has new mechanisms against runaway RT tasks. There's a
new RLIMIT_RTTIME rlimit - if an RT task exceeds that rlimit then it is
sent SIGXCPU.

there's also a new group scheduling extension that is driven via a
sysctl:

/proc/sys/kernel/sched_rt_ratio

this way if a user has a runaway RT task, other users (and root) will
still have some CPU time left. (in Peter's latest patchset that is
replaced via rt_runtime_ns - but this is a detail)

so instead of the never-ending arms race of kernel thread priorities
against RT task priorities, we are going towards making RT tasks safer
on a policy level.

Ingo

2008-01-07 11:23:05

by Remy Bohmer

[permalink] [raw]
Subject: Re: [PATCH] kthread: always create the kernel threads with normal priority

Hello Michal and Andrew,

> Let's not make the decision for the user. Just allow the administrator to
> change kthreadd's priority safely if he chooses to do it. Ensure that the
> kernel threads are created with the usual nice level even if kthreadd's
> priority is changed from the default.

Last year, I posted a patchset (that was meant for Preempt-RT at that
time) to be able to prioritise the interrupt-handler-threads (which
are kthreads) and softirq-threads from the kernel commandline. See
http://lkml.org/lkml/2007/12/19/208

Maybe we can find a way to use a similar mechanism as I used in my
patchset for the priorities of the remaining kthreads.
I do not like the way of forcing userland to change the priorities,
because that would require a userland with the chrt tool installed,
and that is not that practical for embedded systems (in which there
could be cases that there is no userland at all, or the init-process
is the whole embedded application). In that case an option to do it on
the kernel commandline is more practical.

I propose this kernel cmd-line option:
kthread_pmap=somethread:50,otherthread:12,34

Then threads can be started as SCHED_NORMAL, and when overruled inside
the kthread-sources itself, or by the kernel commandline, the user can
set them to something else.

What do you think of this?

(notice that I am reworking the review comments I received on this
patch-series right now, and that I can take such change into account
immediately)

Kind Regards,

Remy

2008/1/7, Michal Schmidt <[email protected]>:
> On Sat, 22 Dec 2007 01:30:21 -0800
> Andrew Morton <[email protected]> wrote:
>
> > On Mon, 17 Dec 2007 23:43:14 +0100 Michal Schmidt
> > <[email protected]> wrote:
> >
> > > kthreadd, the creator of other kernel threads, runs as a normal
> > > priority task. This is a potential for priority inversion when a
> > > task wants to spawn a high-priority kernel thread. A middle priority
> > > SCHED_FIFO task can block kthreadd's execution indefinitely and thus
> > > prevent the timely creation of the high-priority kernel thread.
> > >
> > > This causes a practical problem. When a runaway real-time task is
> > > eating 100% CPU and we attempt to put the CPU offline, sometimes we
> > > block while waiting for the creation of the highest-priority
> > > "kstopmachine" thread.
> > >
> > > The fix is to run kthreadd with the highest possible SCHED_FIFO
> > > priority. Its children must still run as slightly negatively reniced
> > > SCHED_NORMAL tasks.
> >
> > Did you hit this problem with the stock kernel, or have you been
> > working on other stuff?
>
> This was with RHEL5 and with current Fedora kernels.
>
> > A locked-up SCHED_FIFO process will cause kernel threads all sorts of
> > problems. You've hit one instance, but there will be others.
> > (pdflush stops working, for one).
> >
> > The general approach we've taken to this is "don't do that". Yes, we
> > could boost lots of kernel threads in the way which this patch does
> > but this actually takes control *away* from userspace. Userspace no
> > longer has the ability to guarantee itself minimum possible latency
> > without getting preempted by kernel threads.
> >
> > And yes, giving userspace this minimum-latency capability does imply
> > that userspace has a responsibility to not 100% starve kernel
> > threads. It's a reasonable compromise, I think?
>
> You're right. We should not run kthreadd with SCHED_FIFO by default.
> But the user should be able to change it using chrt if he wants to
> avoid this particular problem. So how about this instead?:
>
>
>
> kthreadd, the creator of other kernel threads, runs as a normal priority task.
> This is a potential for priority inversion when a task wants to spawn a
> high-priority kernel thread. A middle priority SCHED_FIFO task can block
> kthreadd's execution indefinitely and thus prevent the timely creation of the
> high-priority kernel thread.
>
> This causes a practical problem. When a runaway real-time task is eating 100%
> CPU and we attempt to put the CPU offline, sometimes we block while waiting for
> the creation of the highest-priority "kstopmachine" thread.
>
> This could be solved by always running kthreadd with the highest possible
> SCHED_FIFO priority, but that would be undesirable policy decision in the
> kernel. kthreadd would cause unwanted latencies even for the realtime users who
> know what they're doing.
>
> Let's not make the decision for the user. Just allow the administrator to
> change kthreadd's priority safely if he chooses to do it. Ensure that the
> kernel threads are created with the usual nice level even if kthreadd's
> priority is changed from the default.
>
> Signed-off-by: Michal Schmidt <[email protected]>
> ---
> kernel/kthread.c | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index dcfe724..e832a85 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -94,10 +94,21 @@ static void create_kthread(struct kthread_create_info *create)
> if (pid < 0) {
> create->result = ERR_PTR(pid);
> } else {
> + struct sched_param param = { .sched_priority = 0 };
> wait_for_completion(&create->started);
> read_lock(&tasklist_lock);
> create->result = find_task_by_pid(pid);
> read_unlock(&tasklist_lock);
> + /*
> + * root may want to change our (kthreadd's) priority to
> + * realtime to solve a corner case priority inversion problem
> + * (a realtime task consuming 100% CPU blocking the creation of
> + * kernel threads). The kernel thread should not inherit the
> + * higher priority. Let's always create it with the usual nice
> + * level.
> + */
> + sched_setscheduler(create->result, SCHED_NORMAL, &param);
> + set_user_nice(create->result, -5);
> }
> complete(&create->done);
> }
> --
> 1.5.3.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-01-07 13:12:27

by Michal Schmidt

[permalink] [raw]
Subject: Re: [PATCH] kthread: always create the kernel threads with normal priority

On Mon, 7 Jan 2008 12:22:51 +0100
"Remy Bohmer" <[email protected]> wrote:

> Hello Michal and Andrew,
>
> > Let's not make the decision for the user. Just allow the
> > administrator to change kthreadd's priority safely if he chooses to
> > do it. Ensure that the kernel threads are created with the usual
> > nice level even if kthreadd's priority is changed from the default.
>
> Last year, I posted a patchset (that was meant for Preempt-RT at that
> time) to be able to prioritise the interrupt-handler-threads (which
> are kthreads) and softirq-threads from the kernel commandline. See
> http://lkml.org/lkml/2007/12/19/208
>
> Maybe we can find a way to use a similar mechanism as I used in my
> patchset for the priorities of the remaining kthreads.
> I do not like the way of forcing userland to change the priorities,
> because that would require a userland with the chrt tool installed,
> and that is not that practical for embedded systems (in which there
> could be cases that there is no userland at all, or the init-process
> is the whole embedded application). In that case an option to do it on
> the kernel commandline is more practical.
>
> I propose this kernel cmd-line option:
> kthread_pmap=somethread:50,otherthread:12,34

I see. kthreadd would look up the priority for itself and
kthread_create would consult the map for all other kernel threads.
That should work.
Your sirq_pmap would not be needed anymore, as kthread_pmap could be
used for softirq threads too, right?

Michal

2008-01-07 13:19:20

by Michal Schmidt

[permalink] [raw]
Subject: Re: [PATCH] kthread: always create the kernel threads with normal priority

On Mon, 7 Jan 2008 02:25:13 -0800
Andrew Morton <[email protected]> wrote:

> On Mon, 7 Jan 2008 11:06:03 +0100 Michal Schmidt
> <[email protected]> wrote:
>
> > On Sat, 22 Dec 2007 01:30:21 -0800
> > Andrew Morton <[email protected]> wrote:
> >
> > > On Mon, 17 Dec 2007 23:43:14 +0100 Michal Schmidt
> > > <[email protected]> wrote:
> > >
> > > > kthreadd, the creator of other kernel threads, runs as a normal
> > > > priority task. This is a potential for priority inversion when a
> > > > task wants to spawn a high-priority kernel thread. A middle
> > > > priority SCHED_FIFO task can block kthreadd's execution
> > > > indefinitely and thus prevent the timely creation of the
> > > > high-priority kernel thread.
> > > > This causes a practical problem. When a runaway real-time task
> > > > is eating 100% CPU and we attempt to put the CPU offline,
> > > > sometimes we block while waiting for the creation of the
> > > > highest-priority "kstopmachine" thread.
> > > >
> > > > The fix is to run kthreadd with the highest possible SCHED_FIFO
> > > > priority. Its children must still run as slightly negatively
> > > > reniced SCHED_NORMAL tasks.
> > >
> > > Did you hit this problem with the stock kernel, or have you been
> > > working on other stuff?
> >
> > This was with RHEL5 and with current Fedora kernels.
> >
> > > A locked-up SCHED_FIFO process will cause kernel threads all
> > > sorts of problems. You've hit one instance, but there will be
> > > others. (pdflush stops working, for one).
> > >
> > > The general approach we've taken to this is "don't do that".
> > > Yes, we could boost lots of kernel threads in the way which this
> > > patch does but this actually takes control *away* from
> > > userspace. Userspace no longer has the ability to guarantee
> > > itself minimum possible latency without getting preempted by
> > > kernel threads.
> > >
> > > And yes, giving userspace this minimum-latency capability does
> > > imply that userspace has a responsibility to not 100% starve
> > > kernel threads. It's a reasonable compromise, I think?
> >
> > You're right. We should not run kthreadd with SCHED_FIFO by default.
> > But the user should be able to change it using chrt if he wants to
> > avoid this particular problem. So how about this instead?:
> >
> >
> >
> > kthreadd, the creator of other kernel threads, runs as a normal
> > priority task. This is a potential for priority inversion when a
> > task wants to spawn a high-priority kernel thread. A middle
> > priority SCHED_FIFO task can block kthreadd's execution
> > indefinitely and thus prevent the timely creation of the
> > high-priority kernel thread.
> >
> > This causes a practical problem. When a runaway real-time task is
> > eating 100% CPU and we attempt to put the CPU offline, sometimes we
> > block while waiting for the creation of the highest-priority
> > "kstopmachine" thread.
> >
> > This could be solved by always running kthreadd with the highest
> > possible SCHED_FIFO priority, but that would be undesirable policy
> > decision in the kernel. kthreadd would cause unwanted latencies
> > even for the realtime users who know what they're doing.
> >
> > Let's not make the decision for the user. Just allow the
> > administrator to change kthreadd's priority safely if he chooses to
> > do it. Ensure that the kernel threads are created with the usual
> > nice level even if kthreadd's priority is changed from the default.
> >
> > Signed-off-by: Michal Schmidt <[email protected]>
> > ---
> > kernel/kthread.c | 11 +++++++++++
> > 1 files changed, 11 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/kthread.c b/kernel/kthread.c
> > index dcfe724..e832a85 100644
> > --- a/kernel/kthread.c
> > +++ b/kernel/kthread.c
> > @@ -94,10 +94,21 @@ static void create_kthread(struct
> > kthread_create_info *create) if (pid < 0) {
> > create->result = ERR_PTR(pid);
> > } else {
> > + struct sched_param param = { .sched_priority = 0 };
> > wait_for_completion(&create->started);
> > read_lock(&tasklist_lock);
> > create->result = find_task_by_pid(pid);
> > read_unlock(&tasklist_lock);
> > + /*
> > + * root may want to change our (kthreadd's)
> > priority to
> > + * realtime to solve a corner case priority
> > inversion problem
> > + * (a realtime task consuming 100% CPU blocking
> > the creation of
> > + * kernel threads). The kernel thread should not
> > inherit the
> > + * higher priority. Let's always create it with
> > the usual nice
> > + * level.
> > + */
> > + sched_setscheduler(create->result, SCHED_NORMAL,
> > &param);
> > + set_user_nice(create->result, -5);
> > }
> > complete(&create->done);
> > }
>
> Seems reasonable.
>
> As a followup thing, we now have two hard-coded magical -5's in
> kthread.c. It'd be nice to add a #define for this.

Done.

> It'd be nicer to work out where on earth that -5 came from too ;)
>
> Readers might wonder why kthreadd children disinherit kthreadd's
> policy and priority, but retain its cpus_allowed (and whatever other
> stuff root could have altered?)

It's a good idea to reset the CPU mask too.


Allow the administrator to change kthreadd's priority and affinity.
Ensure that the kernel threads are created with the usual nice level
and affinity even if kthreadd's properties were changed from the
default.

Signed-off-by: Michal Schmidt <[email protected]>

diff --git a/kernel/kthread.c b/kernel/kthread.c
index dcfe724..0ac8878 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -15,6 +15,8 @@
#include <linux/mutex.h>
#include <asm/semaphore.h>

+#define KTHREAD_NICE_LEVEL (-5)
+
static DEFINE_SPINLOCK(kthread_create_lock);
static LIST_HEAD(kthread_create_list);
struct task_struct *kthreadd_task;
@@ -94,10 +96,18 @@ static void create_kthread(struct kthread_create_info *create)
if (pid < 0) {
create->result = ERR_PTR(pid);
} else {
+ struct sched_param param = { .sched_priority = 0 };
wait_for_completion(&create->started);
read_lock(&tasklist_lock);
create->result = find_task_by_pid(pid);
read_unlock(&tasklist_lock);
+ /*
+ * root may have changed our (kthreadd's) priority or CPU mask.
+ * The kernel thread should not inherit these properties.
+ */
+ sched_setscheduler(create->result, SCHED_NORMAL, &param);
+ set_user_nice(create->result, KTHREAD_NICE_LEVEL);
+ set_cpus_allowed(create->result, CPU_MASK_ALL);
}
complete(&create->done);
}
@@ -221,7 +231,7 @@ int kthreadd(void *unused)
/* Setup a clean context for our children to inherit. */
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
- set_user_nice(tsk, -5);
+ set_user_nice(tsk, KTHREAD_NICE_LEVEL);
set_cpus_allowed(tsk, CPU_MASK_ALL);

current->flags |= PF_NOFREEZE;

2008-01-07 15:53:28

by Remy Bohmer

[permalink] [raw]
Subject: Re: [PATCH] kthread: always create the kernel threads with normal priority

Hello Michal,

> > Maybe we can find a way to use a similar mechanism as I used in my
> > patchset for the priorities of the remaining kthreads.
> > I do not like the way of forcing userland to change the priorities,
> > because that would require a userland with the chrt tool installed,
> > and that is not that practical for embedded systems (in which there
> > could be cases that there is no userland at all, or the init-process
> > is the whole embedded application). In that case an option to do it on
> > the kernel commandline is more practical.
> >
> > I propose this kernel cmd-line option:
> > kthread_pmap=somethread:50,otherthread:12,34
>
> I see. kthreadd would look up the priority for itself and
> kthread_create would consult the map for all other kernel threads.
> That should work.
> Your sirq_pmap would not be needed anymore, as kthread_pmap could be
> used for softirq threads too, right?

That is correct. The soft-irqs are just ordinary kernel-threads, but
irq_pmap is still needed, to set the priority of a certain interrupt
handler.
In this case it also possible to set the prio of the IRQ-kthreads as
well as the prio of a certain interrupt handler. This might give some
conflicts, and I have to check how to resolve these.

Kind Regards,

Remy


2008/1/7, Michal Schmidt <[email protected]>:
> On Mon, 7 Jan 2008 12:22:51 +0100
> "Remy Bohmer" <[email protected]> wrote:
>
> > Hello Michal and Andrew,
> >
> > > Let's not make the decision for the user. Just allow the
> > > administrator to change kthreadd's priority safely if he chooses to
> > > do it. Ensure that the kernel threads are created with the usual
> > > nice level even if kthreadd's priority is changed from the default.
> >
> > Last year, I posted a patchset (that was meant for Preempt-RT at that
> > time) to be able to prioritise the interrupt-handler-threads (which
> > are kthreads) and softirq-threads from the kernel commandline. See
> > http://lkml.org/lkml/2007/12/19/208
> >
> > Maybe we can find a way to use a similar mechanism as I used in my
> > patchset for the priorities of the remaining kthreads.
> > I do not like the way of forcing userland to change the priorities,
> > because that would require a userland with the chrt tool installed,
> > and that is not that practical for embedded systems (in which there
> > could be cases that there is no userland at all, or the init-process
> > is the whole embedded application). In that case an option to do it on
> > the kernel commandline is more practical.
> >
> > I propose this kernel cmd-line option:
> > kthread_pmap=somethread:50,otherthread:12,34
>
> I see. kthreadd would look up the priority for itself and
> kthread_create would consult the map for all other kernel threads.
> That should work.
> Your sirq_pmap would not be needed anymore, as kthread_pmap could be
> used for softirq threads too, right?
>
> Michal
>

2008-01-07 17:30:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] kthread: always create the kernel threads with normal priority

On Mon, 7 Jan 2008 12:09:04 +0100 Ingo Molnar <[email protected]> wrote:

>
> > > This causes a practical problem. When a runaway real-time task is
> > > eating 100% CPU and we attempt to put the CPU offline, sometimes we
> > > block while waiting for the creation of the highest-priority
> > > "kstopmachine" thread.
>
> sched-devel.git has new mechanisms against runaway RT tasks. There's a
> new RLIMIT_RTTIME rlimit - if an RT task exceeds that rlimit then it is
> sent SIGXCPU.

Is that "total RT CPU time" or "elapsed time since last schedule()"?

If the former, it is not useful for this problem.

> there's also a new group scheduling extension that is driven via a
> sysctl:
>
> /proc/sys/kernel/sched_rt_ratio
>
> this way if a user has a runaway RT task, other users (and root) will
> still have some CPU time left. (in Peter's latest patchset that is
> replaced via rt_runtime_ns - but this is a detail)

Doesn't this make the RT task non-RT? Would need to understand more
details to tell.

> so instead of the never-ending arms race of kernel thread priorities
> against RT task priorities, we are going towards making RT tasks safer
> on a policy level.
>
> Ingo

2008-01-07 17:48:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] kthread: always create the kernel threads with normal priority


On Mon, 2008-01-07 at 09:29 -0800, Andrew Morton wrote:
> On Mon, 7 Jan 2008 12:09:04 +0100 Ingo Molnar <[email protected]> wrote:
>
> >
> > > > This causes a practical problem. When a runaway real-time task is
> > > > eating 100% CPU and we attempt to put the CPU offline, sometimes we
> > > > block while waiting for the creation of the highest-priority
> > > > "kstopmachine" thread.
> >
> > sched-devel.git has new mechanisms against runaway RT tasks. There's a
> > new RLIMIT_RTTIME rlimit - if an RT task exceeds that rlimit then it is
> > sent SIGXCPU.
>
> Is that "total RT CPU time" or "elapsed time since last schedule()"?
>
> If the former, it is not useful for this problem.
>
> > there's also a new group scheduling extension that is driven via a
> > sysctl:
> >
> > /proc/sys/kernel/sched_rt_ratio
> >
> > this way if a user has a runaway RT task, other users (and root) will
> > still have some CPU time left. (in Peter's latest patchset that is
> > replaced via rt_runtime_ns - but this is a detail)
>
> Doesn't this make the RT task non-RT? Would need to understand more
> details to tell.

Its an artifact of rt group scheduling. Each group will have to specify
a period and runtime limit therein (and the normalized sum thereof must
not exceed the total time available - otherwise the set is not
schedulable).

So say we have two groups A and B. A has a period of 2 seconds and a
runtime limit of 1, that gives him an avg of 50% cpu time. If B then has
a period of 1 second with a runtime limit of .25s (avg 25%) the total
time required to schedule the realtime groups would be 75% on average.

Without group scheduling everything is considered one group but we still
have the period and runtime limits.

So as long as the realtime cpu usage fits within the given limits it
acts as before. Once it exceeds its limit it will be capped hard - which
is ok, since it exceeded its hard limit, and realtime applications are
supposed to be deterministic and thus be able to tell how much time
they'd require. [ If only this model were true, but its a model
frequently used and quite accepted ]



Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2008-01-08 09:55:31

by Michal Schmidt

[permalink] [raw]
Subject: Re: [PATCH] kthread: always create the kernel threads with normal priority

On Mon, 7 Jan 2008 09:29:56 -0800
Andrew Morton <[email protected]> wrote:

> On Mon, 7 Jan 2008 12:09:04 +0100 Ingo Molnar <[email protected]> wrote:
>
> >
> > > > This causes a practical problem. When a runaway real-time task
> > > > is eating 100% CPU and we attempt to put the CPU offline,
> > > > sometimes we block while waiting for the creation of the
> > > > highest-priority "kstopmachine" thread.
> >
> > sched-devel.git has new mechanisms against runaway RT tasks.
> > There's a new RLIMIT_RTTIME rlimit - if an RT task exceeds that
> > rlimit then it is sent SIGXCPU.
>
> Is that "total RT CPU time" or "elapsed time since last schedule()"?
>
> If the former, it is not useful for this problem.

It's "runtime since last sleep" so it is useful.

I still think the kthread patch is good to have anyway. The user can
have other reasons to change kthreadd's priority/cpumask.

Michal

2008-01-08 16:23:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] kthread: always create the kernel threads with normal priority


* Michal Schmidt <[email protected]> wrote:

> Allow the administrator to change kthreadd's priority and affinity.
> Ensure that the kernel threads are created with the usual nice level
> and affinity even if kthreadd's properties were changed from the
> default.
>
> Signed-off-by: Michal Schmidt <[email protected]>

thanks Michal, applied your patch to sched-devel.git.

Ingo