2010-06-30 09:26:05

by KOSAKI Motohiro

[permalink] [raw]
Subject: [mmotm 0611][PATCH 00/11] various OOM bugfixes v3

Hi

Here is updated series for various OOM fixes.
Almost fixes are trivial.

One big improvement is Luis's dying task priority boost patch.
This is necessary for RT folks.


oom: don't try to kill oom_unkillable child
oom: oom_kill_process() doesn't select kthread child
oom: make oom_unkillable_task() helper function
oom: oom_kill_process() need to check p is unkillable
oom: /proc/<pid>/oom_score treat kernel thread honestly
oom: kill duplicate OOM_DISABLE check
oom: move OOM_DISABLE check from oom_kill_task to out_of_memory()
oom: cleanup has_intersects_mems_allowed()
oom: remove child->mm check from oom_kill_process()
oom: give the dying task a higher priority
oom: multi threaded process coredump don't make deadlock

fs/proc/base.c | 5 ++-
mm/oom_kill.c | 100 +++++++++++++++++++++++++++++++++++++++-----------------
2 files changed, 73 insertions(+), 32 deletions(-)


2010-06-30 09:27:22

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 01/11] oom: don't try to kill oom_unkillable child

Now, badness() doesn't care neigher CPUSET nor mempolicy. Then
if the victim child process have disjoint nodemask, OOM Killer might
kill innocent process.

This patch fixes it.

Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 26ae697..0aeacb2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -429,7 +429,7 @@ static int oom_kill_task(struct task_struct *p)

static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
unsigned long points, struct mem_cgroup *mem,
- const char *message)
+ nodemask_t *nodemask, const char *message)
{
struct task_struct *victim = p;
struct task_struct *child;
@@ -469,6 +469,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
continue;
if (mem && !task_in_mem_cgroup(child, mem))
continue;
+ if (!has_intersects_mems_allowed(child, nodemask))
+ continue;

/* badness() returns 0 if the thread is unkillable */
child_points = badness(child, uptime.tv_sec);
@@ -519,7 +521,7 @@ retry:
if (!p || PTR_ERR(p) == -1UL)
goto out;

- if (oom_kill_process(p, gfp_mask, 0, points, mem,
+ if (oom_kill_process(p, gfp_mask, 0, points, mem, NULL,
"Memory cgroup out of memory"))
goto retry;
out:
@@ -678,7 +680,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
* non-zero, current could not be killed so we must fallback to
* the tasklist scan.
*/
- if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
+ if (!oom_kill_process(current, gfp_mask, order, 0, NULL, nodemask,
"Out of memory (oom_kill_allocating_task)"))
return;
}
@@ -697,7 +699,7 @@ retry:
panic("Out of memory and no killable processes...\n");
}

- if (oom_kill_process(p, gfp_mask, order, points, NULL,
+ if (oom_kill_process(p, gfp_mask, order, points, NULL, nodemask,
"Out of memory"))
goto retry;
read_unlock(&tasklist_lock);
--
1.6.5.2


2010-06-30 09:27:57

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 02/11] oom: oom_kill_process() doesn't select kthread child

Now, select_bad_process() have PF_KTHREAD check, but oom_kill_process
doesn't. It mean oom_kill_process() may choose wrong task, especially,
when the child are using use_mm().

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0aeacb2..dc8589e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -467,6 +467,8 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

if (child->mm == p->mm)
continue;
+ if (child->flags & PF_KTHREAD)
+ continue;
if (mem && !task_in_mem_cgroup(child, mem))
continue;
if (!has_intersects_mems_allowed(child, nodemask))
--
1.6.5.2


2010-06-30 09:28:42

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 03/11] oom: make oom_unkillable_task() helper function

Now, we have the same task check in two places. Unify it.

Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 33 ++++++++++++++++++++++-----------
1 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dc8589e..a4a5439 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -101,6 +101,26 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
return NULL;
}

+/* return true if the task is not adequate as candidate victim task. */
+static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
+ const nodemask_t *nodemask)
+{
+ if (is_global_init(p))
+ return true;
+ if (p->flags & PF_KTHREAD)
+ return true;
+
+ /* When mem_cgroup_out_of_memory() and p is not member of the group */
+ if (mem && !task_in_mem_cgroup(p, mem))
+ return true;
+
+ /* p may not have freeable memory in nodemask */
+ if (!has_intersects_mems_allowed(p, nodemask))
+ return true;
+
+ return false;
+}
+
/**
* badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
@@ -295,12 +315,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
for_each_process(p) {
unsigned long points;

- /* skip the init task and kthreads */
- if (is_global_init(p) || (p->flags & PF_KTHREAD))
- continue;
- if (mem && !task_in_mem_cgroup(p, mem))
- continue;
- if (!has_intersects_mems_allowed(p, nodemask))
+ if (oom_unkillable_task(p, mem, nodemask))
continue;

/*
@@ -467,11 +482,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

if (child->mm == p->mm)
continue;
- if (child->flags & PF_KTHREAD)
- continue;
- if (mem && !task_in_mem_cgroup(child, mem))
- continue;
- if (!has_intersects_mems_allowed(child, nodemask))
+ if (oom_unkillable_task(p, mem, nodemask))
continue;

/* badness() returns 0 if the thread is unkillable */
--
1.6.5.2


2010-06-30 09:29:26

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 04/11] oom: oom_kill_process() need to check p is unkillable

When oom_kill_allocating_task is enabled, an argument task of
oom_kill_process is not selected by select_bad_process(), It's
just out_of_memory() caller task. It mean the task can be
unkillable. check it first.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index a4a5439..ee00817 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -687,7 +687,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
check_panic_on_oom(constraint, gfp_mask, order);

read_lock(&tasklist_lock);
- if (sysctl_oom_kill_allocating_task) {
+ if (sysctl_oom_kill_allocating_task &&
+ !oom_unkillable_task(current, NULL, nodemask)) {
/*
* oom_kill_process() needs tasklist_lock held. If it returns
* non-zero, current could not be killed so we must fallback to
--
1.6.5.2


2010-06-30 09:30:24

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 05/11] oom: /proc/<pid>/oom_score treat kernel thread honestly

If kernel thread are using use_mm(), badness() return positive value.
This is not big issue because caller care it correctly. but there is
one exception, /proc/<pid>/oom_score call badness() directly and
don't care the task is regular process.

another example, /proc/1/oom_score return !0 value. but it's unkillable.
This incorrectness makes confusing to admin a bit.

This patch fixes it.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
fs/proc/base.c | 5 +++--
mm/oom_kill.c | 13 +++++++------
2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 28099a1..56b8d3e 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -428,7 +428,8 @@ static const struct file_operations proc_lstats_operations = {
#endif

/* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
+unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
+ nodemask_t *nodemask, unsigned long uptime);
static int proc_oom_score(struct task_struct *task, char *buffer)
{
unsigned long points = 0;
@@ -437,7 +438,7 @@ static int proc_oom_score(struct task_struct *task, char *buffer)
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
if (pid_alive(task))
- points = badness(task, uptime.tv_sec);
+ points = badness(task, NULL, NULL, uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ee00817..fcbd21b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -139,8 +139,8 @@ static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
* algorithm has been meticulously tuned to meet the principle
* of least surprise ... (be careful when you change it)
*/
-
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned long badness(struct task_struct *p, struct mem_cgroup *mem,
+ const nodemask_t *nodemask, unsigned long uptime)
{
unsigned long points, cpu_time, run_time;
struct task_struct *child;
@@ -150,6 +150,8 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
unsigned long utime;
unsigned long stime;

+ if (oom_unkillable_task(p, mem, nodemask))
+ return 0;
if (oom_adj == OOM_DISABLE)
return 0;

@@ -351,7 +353,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
if (p->signal->oom_adj == OOM_DISABLE)
continue;

- points = badness(p, uptime.tv_sec);
+ points = badness(p, mem, nodemask, uptime.tv_sec);
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
@@ -482,11 +484,10 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

if (child->mm == p->mm)
continue;
- if (oom_unkillable_task(p, mem, nodemask))
- continue;

/* badness() returns 0 if the thread is unkillable */
- child_points = badness(child, uptime.tv_sec);
+ child_points = badness(child, mem, nodemask,
+ uptime.tv_sec);
if (child_points > victim_points) {
victim = child;
victim_points = child_points;
--
1.6.5.2


2010-06-30 09:31:07

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 06/11] oom: kill duplicate OOM_DISABLE check


select_bad_process() and badness() have the same OOM_DISABLE check.
This patch kill one.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index fcbd21b..f5d903f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -350,9 +350,6 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
*ppoints = ULONG_MAX;
}

- if (p->signal->oom_adj == OOM_DISABLE)
- continue;
-
points = badness(p, mem, nodemask, uptime.tv_sec);
if (points > *ppoints || !chosen) {
chosen = p;
--
1.6.5.2


2010-06-30 09:31:40

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 07/11] oom: move OOM_DISABLE check from oom_kill_task to out_of_memory()

Now, if oom_kill_allocating_task is enabled and current have
OOM_DISABLED, following printk in oom_kill_process is called twice.

pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
message, task_pid_nr(p), p->comm, points);

So, OOM_DISABLE check should be more early.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f5d903f..75f6dbc 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -424,7 +424,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
static int oom_kill_task(struct task_struct *p)
{
p = find_lock_task_mm(p);
- if (!p || p->signal->oom_adj == OOM_DISABLE) {
+ if (!p) {
task_unlock(p);
return 1;
}
@@ -686,7 +686,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,

read_lock(&tasklist_lock);
if (sysctl_oom_kill_allocating_task &&
- !oom_unkillable_task(current, NULL, nodemask)) {
+ !oom_unkillable_task(current, NULL, nodemask) &&
+ (current->signal->oom_adj != OOM_DISABLE)) {
/*
* oom_kill_process() needs tasklist_lock held. If it returns
* non-zero, current could not be killed so we must fallback to
--
1.6.5.2


2010-06-30 09:32:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 08/11] oom: cleanup has_intersects_mems_allowed()

Now has_intersects_mems_allowed() has own thread iterate logic, but
it should use while_each_thread().

It slightly improve the code readability.

Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 75f6dbc..136708b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -69,8 +69,8 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
if (cpuset_mems_allowed_intersects(current, tsk))
return true;
}
- tsk = next_thread(tsk);
- } while (tsk != start);
+ } while_each_thread(start, tsk);
+
return false;
}
#else
--
1.6.5.2


2010-06-30 09:32:51

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 09/11] oom: remove child->mm check from oom_kill_process()


Current "child->mm == p->mm" mean prevent to select vfork() task.
But we don't have any reason to don't consider vfork().

Remvoed.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 3 ---
1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 136708b..b5678bf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,9 +479,6 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
list_for_each_entry(child, &t->children, sibling) {
unsigned long child_points;

- if (child->mm == p->mm)
- continue;
-
/* badness() returns 0 if the thread is unkillable */
child_points = badness(child, mem, nodemask,
uptime.tv_sec);
--
1.6.5.2


2010-06-30 09:33:28

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 10/11] oom: give the dying task a higher priority

From: Luis Claudio R. Goncalves <[email protected]>

In a system under heavy load it was observed that even after the
oom-killer selects a task to die, the task may take a long time to die.

Right after sending a SIGKILL to the task selected by the oom-killer
this task has it's priority increased so that it can exit() exit soon,
freeing memory. That is accomplished by:

/*
* We give our sacrificial lamb high priority and access to
* all the memory it needs. That way it should be able to
* exit() and clear out its resources quickly...
*/
p->rt.time_slice = HZ;
set_tsk_thread_flag(p, TIF_MEMDIE);

It sounds plausible giving the dying task an even higher priority to be
sure it will be scheduled sooner and free the desired memory. It was
suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that
this task won't interfere with any running RT task.

If the dying task is already an RT task, leave it untouched.
Another good suggestion, implemented here, was to avoid boosting the
dying task priority in case of mem_cgroup OOM.

Signed-off-by: Luis Claudio R. Goncalves <[email protected]>
Cc: Minchan Kim <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 34 +++++++++++++++++++++++++++++++---
1 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b5678bf..0858b18 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -82,6 +82,24 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
#endif /* CONFIG_NUMA */

/*
+ * If this is a system OOM (not a memcg OOM) and the task selected to be
+ * killed is not already running at high (RT) priorities, speed up the
+ * recovery by boosting the dying task to the lowest FIFO priority.
+ * That helps with the recovery and avoids interfering with RT tasks.
+ */
+static void boost_dying_task_prio(struct task_struct *p,
+ struct mem_cgroup *mem)
+{
+ struct sched_param param = { .sched_priority = 1 };
+
+ if (mem)
+ return;
+
+ if (!rt_task(p))
+ sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
+}
+
+/*
* The process p may have detached its own ->mm while exiting or through
* use_mm(), but one or more of its subthreads may still have a valid
* pointer. Return p, or any of its subthreads with a valid ->mm, with
@@ -421,7 +439,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
}

#define K(x) ((x) << (PAGE_SHIFT-10))
-static int oom_kill_task(struct task_struct *p)
+static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
{
p = find_lock_task_mm(p);
if (!p) {
@@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
K(get_mm_counter(p->mm, MM_FILEPAGES)));
task_unlock(p);

- p->rt.time_slice = HZ;
+
set_tsk_thread_flag(p, TIF_MEMDIE);
force_sig(SIGKILL, p);
+
+ /*
+ * We give our sacrificial lamb high priority and access to
+ * all the memory it needs. That way it should be able to
+ * exit() and clear out its resources quickly...
+ */
+ boost_dying_task_prio(p, mem);
+
return 0;
}
#undef K
@@ -460,6 +486,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
*/
if (p->flags & PF_EXITING) {
set_tsk_thread_flag(p, TIF_MEMDIE);
+ boost_dying_task_prio(p, mem);
return 0;
}

@@ -489,7 +516,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
}
} while_each_thread(p, t);

- return oom_kill_task(victim);
+ return oom_kill_task(victim, mem);
}

/*
@@ -670,6 +697,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
*/
if (fatal_signal_pending(current)) {
set_thread_flag(TIF_MEMDIE);
+ boost_dying_task_prio(current, NULL);
return;
}

--
1.6.5.2


2010-06-30 09:34:12

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 11/11] oom: multi threaded process coredump don't make deadlock

Oleg pointed out current PF_EXITING check is wrong. Because PF_EXITING
is per-thread flag, not per-process flag. He said,

Two threads, group-leader L and its sub-thread T. T dumps the code.
In this case both threads have ->mm != NULL, L has PF_EXITING.

The first problem is, select_bad_process() always return -1 in this
case (even if the caller is T, this doesn't matter).

The second problem is that we should add TIF_MEMDIE to T, not L.

I think we can remove this dubious PF_EXITING check. but as first step,
This patch add the protection of multi threaded issue.

Cc: Oleg Nesterov <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/oom_kill.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0858b18..b04e557 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -360,7 +360,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
* the process of exiting and releasing its resources.
* Otherwise we could get an easy OOM deadlock.
*/
- if ((p->flags & PF_EXITING) && p->mm) {
+ if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
if (p != current)
return ERR_PTR(-1UL);

--
1.6.5.2


2010-06-30 09:35:16

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 10/11] oom: give the dying task a higher priority


Sorry, I forgot to cc Luis. resend.


(intentional full quote)

> From: Luis Claudio R. Goncalves <[email protected]>
>
> In a system under heavy load it was observed that even after the
> oom-killer selects a task to die, the task may take a long time to die.
>
> Right after sending a SIGKILL to the task selected by the oom-killer
> this task has it's priority increased so that it can exit() exit soon,
> freeing memory. That is accomplished by:
>
> /*
> * We give our sacrificial lamb high priority and access to
> * all the memory it needs. That way it should be able to
> * exit() and clear out its resources quickly...
> */
> p->rt.time_slice = HZ;
> set_tsk_thread_flag(p, TIF_MEMDIE);
>
> It sounds plausible giving the dying task an even higher priority to be
> sure it will be scheduled sooner and free the desired memory. It was
> suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that
> this task won't interfere with any running RT task.
>
> If the dying task is already an RT task, leave it untouched.
> Another good suggestion, implemented here, was to avoid boosting the
> dying task priority in case of mem_cgroup OOM.
>
> Signed-off-by: Luis Claudio R. Goncalves <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/oom_kill.c | 34 +++++++++++++++++++++++++++++++---
> 1 files changed, 31 insertions(+), 3 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index b5678bf..0858b18 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -82,6 +82,24 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
> #endif /* CONFIG_NUMA */
>
> /*
> + * If this is a system OOM (not a memcg OOM) and the task selected to be
> + * killed is not already running at high (RT) priorities, speed up the
> + * recovery by boosting the dying task to the lowest FIFO priority.
> + * That helps with the recovery and avoids interfering with RT tasks.
> + */
> +static void boost_dying_task_prio(struct task_struct *p,
> + struct mem_cgroup *mem)
> +{
> + struct sched_param param = { .sched_priority = 1 };
> +
> + if (mem)
> + return;
> +
> + if (!rt_task(p))
> + sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> +}
> +
> +/*
> * The process p may have detached its own ->mm while exiting or through
> * use_mm(), but one or more of its subthreads may still have a valid
> * pointer. Return p, or any of its subthreads with a valid ->mm, with
> @@ -421,7 +439,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> }
>
> #define K(x) ((x) << (PAGE_SHIFT-10))
> -static int oom_kill_task(struct task_struct *p)
> +static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
> {
> p = find_lock_task_mm(p);
> if (!p) {
> @@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
> K(get_mm_counter(p->mm, MM_FILEPAGES)));
> task_unlock(p);
>
> - p->rt.time_slice = HZ;
> +
> set_tsk_thread_flag(p, TIF_MEMDIE);
> force_sig(SIGKILL, p);
> +
> + /*
> + * We give our sacrificial lamb high priority and access to
> + * all the memory it needs. That way it should be able to
> + * exit() and clear out its resources quickly...
> + */
> + boost_dying_task_prio(p, mem);
> +
> return 0;
> }
> #undef K
> @@ -460,6 +486,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> */
> if (p->flags & PF_EXITING) {
> set_tsk_thread_flag(p, TIF_MEMDIE);
> + boost_dying_task_prio(p, mem);
> return 0;
> }
>
> @@ -489,7 +516,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> }
> } while_each_thread(p, t);
>
> - return oom_kill_task(victim);
> + return oom_kill_task(victim, mem);
> }
>
> /*
> @@ -670,6 +697,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> */
> if (fatal_signal_pending(current)) {
> set_thread_flag(TIF_MEMDIE);
> + boost_dying_task_prio(current, NULL);
> return;
> }
>
> --
> 1.6.5.2
>
>
>


2010-06-30 14:03:09

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 04/11] oom: oom_kill_process() need to check p is unkillable

On Wed, Jun 30, 2010 at 06:29:22PM +0900, KOSAKI Motohiro wrote:
> When oom_kill_allocating_task is enabled, an argument task of
> oom_kill_process is not selected by select_bad_process(), It's
> just out_of_memory() caller task. It mean the task can be
> unkillable. check it first.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
--
Kind regards,
Minchan Kim

2010-06-30 14:04:09

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 05/11] oom: /proc/<pid>/oom_score treat kernel thread honestly

On Wed, Jun 30, 2010 at 06:30:19PM +0900, KOSAKI Motohiro wrote:
> If kernel thread are using use_mm(), badness() return positive value.
> This is not big issue because caller care it correctly. but there is
> one exception, /proc/<pid>/oom_score call badness() directly and
> don't care the task is regular process.
>
> another example, /proc/1/oom_score return !0 value. but it's unkillable.
> This incorrectness makes confusing to admin a bit.

Hmm. If it is a really problem, Could we solve it in proc_oom_score itself?

--
Kind regards,
Minchan Kim

2010-06-30 14:10:14

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 06/11] oom: kill duplicate OOM_DISABLE check

On Wed, Jun 30, 2010 at 06:31:00PM +0900, KOSAKI Motohiro wrote:
>
> select_bad_process() and badness() have the same OOM_DISABLE check.
> This patch kill one.
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2010-06-30 14:19:54

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 03/11] oom: make oom_unkillable_task() helper function

On Wed, Jun 30, 2010 at 06:28:37PM +0900, KOSAKI Motohiro wrote:
> Now, we have the same task check in two places. Unify it.
>
> Reviewed-by: Minchan Kim <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/oom_kill.c | 33 ++++++++++++++++++++++-----------
> 1 files changed, 22 insertions(+), 11 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dc8589e..a4a5439 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -101,6 +101,26 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
> return NULL;
> }
>
> +/* return true if the task is not adequate as candidate victim task. */
> +static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
> + const nodemask_t *nodemask)
> +{
> + if (is_global_init(p))
> + return true;
> + if (p->flags & PF_KTHREAD)
> + return true;
> +
> + /* When mem_cgroup_out_of_memory() and p is not member of the group */
> + if (mem && !task_in_mem_cgroup(p, mem))
> + return true;
> +
> + /* p may not have freeable memory in nodemask */
> + if (!has_intersects_mems_allowed(p, nodemask))
> + return true;
> +
> + return false;
> +}
> +

I returend this patch as review 7/11.
Why didn't you check p->signal->oom_adj == OOM_DISABLE in here?
I don't figure out code after your patches are applied totally.
But I think it would be check it in this function as function's name says.


--
Kind regards,
Minchan Kim

2010-06-30 14:20:47

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 07/11] oom: move OOM_DISABLE check from oom_kill_task to out_of_memory()

On Wed, Jun 30, 2010 at 06:31:36PM +0900, KOSAKI Motohiro wrote:
> Now, if oom_kill_allocating_task is enabled and current have
> OOM_DISABLED, following printk in oom_kill_process is called twice.
>
> pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
> message, task_pid_nr(p), p->comm, points);
>
> So, OOM_DISABLE check should be more early.

If we check it in oom_unkillable_task, we don't need this patch.

--
Kind regards,
Minchan Kim

2010-06-30 14:24:39

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 02/11] oom: oom_kill_process() doesn't select kthread child

On Wed, Jun 30, 2010 at 06:27:52PM +0900, KOSAKI Motohiro wrote:
> Now, select_bad_process() have PF_KTHREAD check, but oom_kill_process
> doesn't. It mean oom_kill_process() may choose wrong task, especially,
> when the child are using use_mm().

Is it possible child is kthread even though parent isn't kthread?

--
Kind regards,
Minchan Kim

2010-06-30 14:30:26

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 09/11] oom: remove child->mm check from oom_kill_process()

On Wed, Jun 30, 2010 at 06:32:44PM +0900, KOSAKI Motohiro wrote:
>
> Current "child->mm == p->mm" mean prevent to select vfork() task.
> But we don't have any reason to don't consider vfork().

I guess "child->mm == p->mm" is for losing the minimal amount of work done
as comment say. But frankly speaking, I don't understand it, either.
Maybe "One shot, two kill" problem?

Andrea. Could you explain it?

--
Kind regards,
Minchan Kim

2010-06-30 14:40:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 10/11] oom: give the dying task a higher priority

On Wed, Jun 30, 2010 at 06:35:08PM +0900, KOSAKI Motohiro wrote:
>
> Sorry, I forgot to cc Luis. resend.
>
>
> (intentional full quote)
>
> > From: Luis Claudio R. Goncalves <[email protected]>
> >
> > In a system under heavy load it was observed that even after the
> > oom-killer selects a task to die, the task may take a long time to die.
> >
> > Right after sending a SIGKILL to the task selected by the oom-killer
> > this task has it's priority increased so that it can exit() exit soon,
> > freeing memory. That is accomplished by:
> >
> > /*
> > * We give our sacrificial lamb high priority and access to
> > * all the memory it needs. That way it should be able to
> > * exit() and clear out its resources quickly...
> > */
> > p->rt.time_slice = HZ;
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> >
> > It sounds plausible giving the dying task an even higher priority to be
> > sure it will be scheduled sooner and free the desired memory. It was
> > suggested on LKML using SCHED_FIFO:1, the lowest RT priority so that
> > this task won't interfere with any running RT task.
> >
> > If the dying task is already an RT task, leave it untouched.
> > Another good suggestion, implemented here, was to avoid boosting the
> > dying task priority in case of mem_cgroup OOM.
> >
> > Signed-off-by: Luis Claudio R. Goncalves <[email protected]>
> > Cc: Minchan Kim <[email protected]>
> > Signed-off-by: KOSAKI Motohiro <[email protected]>

Reviewed-by: Minchan Kim <[email protected]>

It seems code itself doesn't have a problem.
So I give reviewed-by.
But this patch might break fairness of normal process at corner case.
If system working is more important than fairness of processes,
It does make sense. But scheduler guys might have a different opinion.

So at least, we need ACKs of scheduler guys.
Cced Ingo, Peter, Thomas.

> > ---
> > mm/oom_kill.c | 34 +++++++++++++++++++++++++++++++---
> > 1 files changed, 31 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index b5678bf..0858b18 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -82,6 +82,24 @@ static bool has_intersects_mems_allowed(struct task_struct *tsk,
> > #endif /* CONFIG_NUMA */
> >
> > /*
> > + * If this is a system OOM (not a memcg OOM) and the task selected to be
> > + * killed is not already running at high (RT) priorities, speed up the
> > + * recovery by boosting the dying task to the lowest FIFO priority.
> > + * That helps with the recovery and avoids interfering with RT tasks.
> > + */
> > +static void boost_dying_task_prio(struct task_struct *p,
> > + struct mem_cgroup *mem)
> > +{
> > + struct sched_param param = { .sched_priority = 1 };
> > +
> > + if (mem)
> > + return;
> > +
> > + if (!rt_task(p))
> > + sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> > +}
> > +
> > +/*
> > * The process p may have detached its own ->mm while exiting or through
> > * use_mm(), but one or more of its subthreads may still have a valid
> > * pointer. Return p, or any of its subthreads with a valid ->mm, with
> > @@ -421,7 +439,7 @@ static void dump_header(struct task_struct *p, gfp_t gfp_mask, int order,
> > }
> >
> > #define K(x) ((x) << (PAGE_SHIFT-10))
> > -static int oom_kill_task(struct task_struct *p)
> > +static int oom_kill_task(struct task_struct *p, struct mem_cgroup *mem)
> > {
> > p = find_lock_task_mm(p);
> > if (!p) {
> > @@ -434,9 +452,17 @@ static int oom_kill_task(struct task_struct *p)
> > K(get_mm_counter(p->mm, MM_FILEPAGES)));
> > task_unlock(p);
> >
> > - p->rt.time_slice = HZ;
> > +
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > force_sig(SIGKILL, p);
> > +
> > + /*
> > + * We give our sacrificial lamb high priority and access to
> > + * all the memory it needs. That way it should be able to
> > + * exit() and clear out its resources quickly...
> > + */
> > + boost_dying_task_prio(p, mem);
> > +
> > return 0;
> > }
> > #undef K
> > @@ -460,6 +486,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > */
> > if (p->flags & PF_EXITING) {
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > + boost_dying_task_prio(p, mem);
> > return 0;
> > }
> >
> > @@ -489,7 +516,7 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > }
> > } while_each_thread(p, t);
> >
> > - return oom_kill_task(victim);
> > + return oom_kill_task(victim, mem);
> > }
> >
> > /*
> > @@ -670,6 +697,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > */
> > if (fatal_signal_pending(current)) {
> > set_thread_flag(TIF_MEMDIE);
> > + boost_dying_task_prio(current, NULL);
> > return;
> > }
> >
> > --
> > 1.6.5.2
> >
> >
> >
>
>
>

--
Kind regards,
Minchan Kim

2010-07-01 00:07:15

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 03/11] oom: make oom_unkillable_task() helper function

> On Wed, Jun 30, 2010 at 06:28:37PM +0900, KOSAKI Motohiro wrote:
> > Now, we have the same task check in two places. Unify it.
> >
> > Reviewed-by: Minchan Kim <[email protected]>
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > ---
> > mm/oom_kill.c | 33 ++++++++++++++++++++++-----------
> > 1 files changed, 22 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index dc8589e..a4a5439 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -101,6 +101,26 @@ static struct task_struct *find_lock_task_mm(struct task_struct *p)
> > return NULL;
> > }
> >
> > +/* return true if the task is not adequate as candidate victim task. */
> > +static bool oom_unkillable_task(struct task_struct *p, struct mem_cgroup *mem,
> > + const nodemask_t *nodemask)
> > +{
> > + if (is_global_init(p))
> > + return true;
> > + if (p->flags & PF_KTHREAD)
> > + return true;
> > +
> > + /* When mem_cgroup_out_of_memory() and p is not member of the group */
> > + if (mem && !task_in_mem_cgroup(p, mem))
> > + return true;
> > +
> > + /* p may not have freeable memory in nodemask */
> > + if (!has_intersects_mems_allowed(p, nodemask))
> > + return true;
> > +
> > + return false;
> > +}
> > +
>
> I returend this patch as review 7/11.
> Why didn't you check p->signal->oom_adj == OOM_DISABLE in here?
> I don't figure out code after your patches are applied totally.
> But I think it would be check it in this function as function's name says.

For preserve select_bad_process() semantics. It have

for_each_process(p) {
if (oom_unkillable_task(p, mem, nodemask))
continue;

if (thread_group_empty(p) && (p->flags & PF_EXITING) && p->mm) {
if (p != current)
return ERR_PTR(-1UL);

chosen = p;
*ppoints = ULONG_MAX;
}

if (oom_adj == OOM_DISABLE)
continue;

That said, Current OOM-Killer intend to kill PF_EXITING process even if
it have OOM_DISABLE. (practically, it's not kill. it only affect to give
allocation bonus to PF_EXITING process)

My trivial fixes series don't intend to make large semantics change.


2010-07-01 00:07:13

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 02/11] oom: oom_kill_process() doesn't select kthread child

> On Wed, Jun 30, 2010 at 06:27:52PM +0900, KOSAKI Motohiro wrote:
> > Now, select_bad_process() have PF_KTHREAD check, but oom_kill_process
> > doesn't. It mean oom_kill_process() may choose wrong task, especially,
> > when the child are using use_mm().
>
> Is it possible child is kthread even though parent isn't kthread?

Usually unhappen. but crappy driver can do any strange thing freely.
As I said, oom code should have conservative assumption as far as possible.


2010-07-01 00:07:09

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 07/11] oom: move OOM_DISABLE check from oom_kill_task to out_of_memory()

> On Wed, Jun 30, 2010 at 06:31:36PM +0900, KOSAKI Motohiro wrote:
> > Now, if oom_kill_allocating_task is enabled and current have
> > OOM_DISABLED, following printk in oom_kill_process is called twice.
> >
> > pr_err("%s: Kill process %d (%s) score %lu or sacrifice child\n",
> > message, task_pid_nr(p), p->comm, points);
> >
> > So, OOM_DISABLE check should be more early.
>
> If we check it in oom_unkillable_task, we don't need this patch.

Yup. but please read the commnet of [3/11].


2010-07-01 00:07:11

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 05/11] oom: /proc/<pid>/oom_score treat kernel thread honestly

> On Wed, Jun 30, 2010 at 06:30:19PM +0900, KOSAKI Motohiro wrote:
> > If kernel thread are using use_mm(), badness() return positive value.
> > This is not big issue because caller care it correctly. but there is
> > one exception, /proc/<pid>/oom_score call badness() directly and
> > don't care the task is regular process.
> >
> > another example, /proc/1/oom_score return !0 value. but it's unkillable.
> > This incorrectness makes confusing to admin a bit.
>
> Hmm. If it is a really problem, Could we solve it in proc_oom_score itself?

probably, no good idea. For maintainance view, all oom related code should
be gathered in oom_kill.c.
If you dislike to add messy into badness(), I hope to make badness_for_oom_score()
or something like instead.


2010-07-01 13:38:56

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 02/11] oom: oom_kill_process() doesn't select kthread child

On Thu, Jul 01, 2010 at 09:07:02AM +0900, KOSAKI Motohiro wrote:
> > On Wed, Jun 30, 2010 at 06:27:52PM +0900, KOSAKI Motohiro wrote:
> > > Now, select_bad_process() have PF_KTHREAD check, but oom_kill_process
> > > doesn't. It mean oom_kill_process() may choose wrong task, especially,
> > > when the child are using use_mm().
> >
> > Is it possible child is kthread even though parent isn't kthread?
>
> Usually unhappen. but crappy driver can do any strange thing freely.
> As I said, oom code should have conservative assumption as far as possible.

Okay. You change the check with oom_unkillable_task at last.
The oom_unkillable_task is generic function so that the kthread check in
oom_kill_process is tivial, I think.

Reviewed-by: Minchan Kim <[email protected]>


>
>
>

--
Kind regards,
Minchan Kim

2010-07-01 14:36:17

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 05/11] oom: /proc/<pid>/oom_score treat kernel thread honestly

On Thu, Jul 01, 2010 at 09:07:02AM +0900, KOSAKI Motohiro wrote:
> > On Wed, Jun 30, 2010 at 06:30:19PM +0900, KOSAKI Motohiro wrote:
> > > If kernel thread are using use_mm(), badness() return positive value.
> > > This is not big issue because caller care it correctly. but there is
> > > one exception, /proc/<pid>/oom_score call badness() directly and
> > > don't care the task is regular process.
> > >
> > > another example, /proc/1/oom_score return !0 value. but it's unkillable.
> > > This incorrectness makes confusing to admin a bit.
> >
> > Hmm. If it is a really problem, Could we solve it in proc_oom_score itself?
>
> probably, no good idea. For maintainance view, all oom related code should
> be gathered in oom_kill.c.
> If you dislike to add messy into badness(), I hope to make badness_for_oom_score()

I am looking forward to seeing your next series.
Thanks, Kosaki.

P.S)
I think if the number of patch series is the bigger than #10,
It would be better to include or point url of all-at-once patch
in patch series.

In case of your patch, post patches changes pre patches
It could make hard review unless the reviewer merge patches into tree to
see the final figure.

--
Kind regards,
Minchan Kim

2010-07-02 21:50:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 10/11] oom: give the dying task a higher priority

On Wed, 30 Jun 2010 18:33:23 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> +static void boost_dying_task_prio(struct task_struct *p,
> + struct mem_cgroup *mem)
> +{
> + struct sched_param param = { .sched_priority = 1 };
> +
> + if (mem)
> + return;
> +
> + if (!rt_task(p))
> + sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> +}

We can actually make `param' static here. That saves a teeny bit of
code and a little bit of stack. The oom-killer can be called when
we're using a lot of stack.

But if we make that change we really should make the param arg to
sched_setscheduler_nocheck() be const. I did that (and was able to
convert lots of callers to use a static `param') but to complete the
job we'd need to chase through all the security goop, fixing up
security_task_setscheduler() and callees, and I got bored.


include/linux/sched.h | 2 +-
kernel/kthread.c | 2 +-
kernel/sched.c | 4 ++--
kernel/softirq.c | 4 +++-
kernel/stop_machine.c | 2 +-
kernel/workqueue.c | 2 +-
6 files changed, 9 insertions(+), 7 deletions(-)

diff -puN kernel/kthread.c~a kernel/kthread.c
--- a/kernel/kthread.c~a
+++ a/kernel/kthread.c
@@ -131,7 +131,7 @@ struct task_struct *kthread_create(int (
wait_for_completion(&create.done);

if (!IS_ERR(create.result)) {
- struct sched_param param = { .sched_priority = 0 };
+ static struct sched_param param = { .sched_priority = 0 };
va_list args;

va_start(args, namefmt);
diff -puN kernel/workqueue.c~a kernel/workqueue.c
--- a/kernel/workqueue.c~a
+++ a/kernel/workqueue.c
@@ -962,7 +962,7 @@ init_cpu_workqueue(struct workqueue_stru

static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
{
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+ static struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
struct workqueue_struct *wq = cwq->wq;
const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
struct task_struct *p;
diff -puN kernel/stop_machine.c~a kernel/stop_machine.c
--- a/kernel/stop_machine.c~a
+++ a/kernel/stop_machine.c
@@ -291,7 +291,7 @@ repeat:
static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
- struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
+ static struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
unsigned int cpu = (unsigned long)hcpu;
struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
struct task_struct *p;
diff -puN kernel/sched.c~a kernel/sched.c
--- a/kernel/sched.c~a
+++ a/kernel/sched.c
@@ -4570,7 +4570,7 @@ static bool check_same_owner(struct task
}

static int __sched_setscheduler(struct task_struct *p, int policy,
- struct sched_param *param, bool user)
+ const struct sched_param *param, bool user)
{
int retval, oldprio, oldpolicy = -1, on_rq, running;
unsigned long flags;
@@ -4734,7 +4734,7 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
* but our caller might not have that capability.
*/
int sched_setscheduler_nocheck(struct task_struct *p, int policy,
- struct sched_param *param)
+ const struct sched_param *param)
{
return __sched_setscheduler(p, policy, param, false);
}
diff -puN kernel/softirq.c~a kernel/softirq.c
--- a/kernel/softirq.c~a
+++ a/kernel/softirq.c
@@ -827,7 +827,9 @@ static int __cpuinit cpu_callback(struct
cpumask_any(cpu_online_mask));
case CPU_DEAD:
case CPU_DEAD_FROZEN: {
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+ static struct sched_param param = {
+ .sched_priority = MAX_RT_PRIO-1,
+ };

p = per_cpu(ksoftirqd, hotcpu);
per_cpu(ksoftirqd, hotcpu) = NULL;
diff -puN include/linux/sched.h~a include/linux/sched.h
--- a/include/linux/sched.h~a
+++ a/include/linux/sched.h
@@ -1924,7 +1924,7 @@ extern int task_curr(const struct task_s
extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
extern int sched_setscheduler_nocheck(struct task_struct *, int,
- struct sched_param *);
+ const struct sched_param *);
extern struct task_struct *idle_task(int cpu);
extern struct task_struct *curr_task(int cpu);
extern void set_curr_task(int cpu, struct task_struct *p);
_

2010-07-06 00:49:11

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 10/11] oom: give the dying task a higher priority

(cc to James)

> On Wed, 30 Jun 2010 18:33:23 +0900 (JST)
> KOSAKI Motohiro <[email protected]> wrote:
>
> > +static void boost_dying_task_prio(struct task_struct *p,
> > + struct mem_cgroup *mem)
> > +{
> > + struct sched_param param = { .sched_priority = 1 };
> > +
> > + if (mem)
> > + return;
> > +
> > + if (!rt_task(p))
> > + sched_setscheduler_nocheck(p, SCHED_FIFO, &param);
> > +}
>
> We can actually make `param' static here. That saves a teeny bit of
> code and a little bit of stack. The oom-killer can be called when
> we're using a lot of stack.
>
> But if we make that change we really should make the param arg to
> sched_setscheduler_nocheck() be const. I did that (and was able to
> convert lots of callers to use a static `param') but to complete the
> job we'd need to chase through all the security goop, fixing up
> security_task_setscheduler() and callees, and I got bored.

ok, I've finished this works. I made two patches, for-security-tree and
for-core. diffstat is below.

I'll post the patches as reply of this mail.


KOSAKI Motohiro (2):
security: add const to security_task_setscheduler()
sched: make sched_param arugment static variables in some
sched_setscheduler() caller

include/linux/sched.h | 5 +++--
include/linux/security.h | 9 +++++----
kernel/irq/manage.c | 4 +++-
kernel/kthread.c | 2 +-
kernel/sched.c | 6 +++---
kernel/softirq.c | 4 +++-
kernel/stop_machine.c | 2 +-
kernel/trace/trace_selftest.c | 2 +-
kernel/watchdog.c | 2 +-
kernel/workqueue.c | 2 +-
security/commoncap.c | 2 +-
security/security.c | 4 ++--
security/selinux/hooks.c | 3 ++-
security/smack/smack_lsm.c | 2 +-
14 files changed, 28 insertions(+), 21 deletions(-)

- kosaki


>
>
> include/linux/sched.h | 2 +-
> kernel/kthread.c | 2 +-
> kernel/sched.c | 4 ++--
> kernel/softirq.c | 4 +++-
> kernel/stop_machine.c | 2 +-
> kernel/workqueue.c | 2 +-
> 6 files changed, 9 insertions(+), 7 deletions(-)
>
> diff -puN kernel/kthread.c~a kernel/kthread.c
> --- a/kernel/kthread.c~a
> +++ a/kernel/kthread.c
> @@ -131,7 +131,7 @@ struct task_struct *kthread_create(int (
> wait_for_completion(&create.done);
>
> if (!IS_ERR(create.result)) {
> - struct sched_param param = { .sched_priority = 0 };
> + static struct sched_param param = { .sched_priority = 0 };
> va_list args;
>
> va_start(args, namefmt);
> diff -puN kernel/workqueue.c~a kernel/workqueue.c
> --- a/kernel/workqueue.c~a
> +++ a/kernel/workqueue.c
> @@ -962,7 +962,7 @@ init_cpu_workqueue(struct workqueue_stru
>
> static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
> {
> - struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> + static struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> struct workqueue_struct *wq = cwq->wq;
> const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
> struct task_struct *p;
> diff -puN kernel/stop_machine.c~a kernel/stop_machine.c
> --- a/kernel/stop_machine.c~a
> +++ a/kernel/stop_machine.c
> @@ -291,7 +291,7 @@ repeat:
> static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
> unsigned long action, void *hcpu)
> {
> - struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
> + static struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
> unsigned int cpu = (unsigned long)hcpu;
> struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
> struct task_struct *p;
> diff -puN kernel/sched.c~a kernel/sched.c
> --- a/kernel/sched.c~a
> +++ a/kernel/sched.c
> @@ -4570,7 +4570,7 @@ static bool check_same_owner(struct task
> }
>
> static int __sched_setscheduler(struct task_struct *p, int policy,
> - struct sched_param *param, bool user)
> + const struct sched_param *param, bool user)
> {
> int retval, oldprio, oldpolicy = -1, on_rq, running;
> unsigned long flags;
> @@ -4734,7 +4734,7 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
> * but our caller might not have that capability.
> */
> int sched_setscheduler_nocheck(struct task_struct *p, int policy,
> - struct sched_param *param)
> + const struct sched_param *param)
> {
> return __sched_setscheduler(p, policy, param, false);
> }
> diff -puN kernel/softirq.c~a kernel/softirq.c
> --- a/kernel/softirq.c~a
> +++ a/kernel/softirq.c
> @@ -827,7 +827,9 @@ static int __cpuinit cpu_callback(struct
> cpumask_any(cpu_online_mask));
> case CPU_DEAD:
> case CPU_DEAD_FROZEN: {
> - struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
> + static struct sched_param param = {
> + .sched_priority = MAX_RT_PRIO-1,
> + };
>
> p = per_cpu(ksoftirqd, hotcpu);
> per_cpu(ksoftirqd, hotcpu) = NULL;
> diff -puN include/linux/sched.h~a include/linux/sched.h
> --- a/include/linux/sched.h~a
> +++ a/include/linux/sched.h
> @@ -1924,7 +1924,7 @@ extern int task_curr(const struct task_s
> extern int idle_cpu(int cpu);
> extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
> extern int sched_setscheduler_nocheck(struct task_struct *, int,
> - struct sched_param *);
> + const struct sched_param *);
> extern struct task_struct *idle_task(int cpu);
> extern struct task_struct *curr_task(int cpu);
> extern void set_curr_task(int cpu, struct task_struct *p);
> _
>


2010-07-06 00:50:19

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 1/2] security: add const to security_task_setscheduler()

All security modules shouldn't change sched_param parameter of
security_task_setscheduler(). This is not only meaningless, but
also make harmful result if caller pass static variable.

This patch add const to it.

Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/security.h | 9 +++++----
security/commoncap.c | 2 +-
security/security.c | 4 ++--
security/selinux/hooks.c | 3 ++-
security/smack/smack_lsm.c | 2 +-
5 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/include/linux/security.h b/include/linux/security.h
index 5bcb395..07e94e5 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -74,7 +74,8 @@ extern int cap_file_mmap(struct file *file, unsigned long reqprot,
extern int cap_task_fix_setuid(struct cred *new, const struct cred *old, int flags);
extern int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3,
unsigned long arg4, unsigned long arg5);
-extern int cap_task_setscheduler(struct task_struct *p, int policy, struct sched_param *lp);
+extern int cap_task_setscheduler(struct task_struct *p, int policy,
+ const struct sched_param *lp);
extern int cap_task_setioprio(struct task_struct *p, int ioprio);
extern int cap_task_setnice(struct task_struct *p, int nice);
extern int cap_syslog(int type, bool from_file);
@@ -1501,7 +1502,7 @@ struct security_operations {
int (*task_getioprio) (struct task_struct *p);
int (*task_setrlimit) (unsigned int resource, struct rlimit *new_rlim);
int (*task_setscheduler) (struct task_struct *p, int policy,
- struct sched_param *lp);
+ const struct sched_param *lp);
int (*task_getscheduler) (struct task_struct *p);
int (*task_movememory) (struct task_struct *p);
int (*task_kill) (struct task_struct *p,
@@ -1750,8 +1751,8 @@ int security_task_setnice(struct task_struct *p, int nice);
int security_task_setioprio(struct task_struct *p, int ioprio);
int security_task_getioprio(struct task_struct *p);
int security_task_setrlimit(unsigned int resource, struct rlimit *new_rlim);
-int security_task_setscheduler(struct task_struct *p,
- int policy, struct sched_param *lp);
+int security_task_setscheduler(struct task_struct *p, int policy,
+ const struct sched_param *lp);
int security_task_getscheduler(struct task_struct *p);
int security_task_movememory(struct task_struct *p);
int security_task_kill(struct task_struct *p, struct siginfo *info,
diff --git a/security/commoncap.c b/security/commoncap.c
index 4e01599..b74d460 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -726,7 +726,7 @@ static int cap_safe_nice(struct task_struct *p)
* specified task, returning 0 if permission is granted, -ve if denied.
*/
int cap_task_setscheduler(struct task_struct *p, int policy,
- struct sched_param *lp)
+ const struct sched_param *lp)
{
return cap_safe_nice(p);
}
diff --git a/security/security.c b/security/security.c
index 7461b1b..6151322 100644
--- a/security/security.c
+++ b/security/security.c
@@ -785,8 +785,8 @@ int security_task_setrlimit(unsigned int resource, struct rlimit *new_rlim)
return security_ops->task_setrlimit(resource, new_rlim);
}

-int security_task_setscheduler(struct task_struct *p,
- int policy, struct sched_param *lp)
+int security_task_setscheduler(struct task_struct *p, int policy,
+ const struct sched_param *lp)
{
return security_ops->task_setscheduler(p, policy, lp);
}
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 5c9f25b..dd136bd 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -3385,7 +3385,8 @@ static int selinux_task_setrlimit(unsigned int resource, struct rlimit *new_rlim
return 0;
}

-static int selinux_task_setscheduler(struct task_struct *p, int policy, struct sched_param *lp)
+static int selinux_task_setscheduler(struct task_struct *p, int policy,
+ const struct sched_param *lp)
{
int rc;

diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
index 07abc9c..c3336f1 100644
--- a/security/smack/smack_lsm.c
+++ b/security/smack/smack_lsm.c
@@ -1280,7 +1280,7 @@ static int smack_task_getioprio(struct task_struct *p)
* Return 0 if read access is permitted
*/
static int smack_task_setscheduler(struct task_struct *p, int policy,
- struct sched_param *lp)
+ const struct sched_param *lp)
{
int rc;

--
1.6.5.2


2010-07-06 00:51:39

by KOSAKI Motohiro

[permalink] [raw]
Subject: [PATCH 2/2] sched: make sched_param arugment static variables in some sched_setscheduler() caller

Andrew Morton pointed out almost sched_setscheduler() caller are
using fixed parameter and it can be converted static. it reduce
runtume memory waste a bit.

Reported-by: Andrew Morton <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
include/linux/sched.h | 5 +++--
kernel/irq/manage.c | 4 +++-
kernel/kthread.c | 2 +-
kernel/sched.c | 6 +++---
kernel/softirq.c | 4 +++-
kernel/stop_machine.c | 2 +-
kernel/trace/trace_selftest.c | 2 +-
kernel/watchdog.c | 2 +-
kernel/workqueue.c | 2 +-
9 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cfcd13e..97b7fe3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1917,9 +1917,10 @@ extern int task_nice(const struct task_struct *p);
extern int can_nice(const struct task_struct *p, const int nice);
extern int task_curr(const struct task_struct *p);
extern int idle_cpu(int cpu);
-extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
+extern int sched_setscheduler(struct task_struct *, int,
+ const struct sched_param *);
extern int sched_setscheduler_nocheck(struct task_struct *, int,
- struct sched_param *);
+ const struct sched_param *);
extern struct task_struct *idle_task(int cpu);
extern struct task_struct *curr_task(int cpu);
extern void set_curr_task(int cpu, struct task_struct *p);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 3164ba7..ce1dbb8 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -569,7 +569,9 @@ irq_thread_check_affinity(struct irq_desc *desc, struct irqaction *action) { }
*/
static int irq_thread(void *data)
{
- struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2, };
+ static struct sched_param param = {
+ .sched_priority = MAX_USER_RT_PRIO/2,
+ };
struct irqaction *action = data;
struct irq_desc *desc = irq_to_desc(action->irq);
int wake, oneshot = desc->status & IRQ_ONESHOT;
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 83911c7..3b55a26 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -131,7 +131,7 @@ struct task_struct *kthread_create(int (*threadfn)(void *data),
wait_for_completion(&create.done);

if (!IS_ERR(create.result)) {
- struct sched_param param = { .sched_priority = 0 };
+ static struct sched_param param = { .sched_priority = 0 };
va_list args;

va_start(args, namefmt);
diff --git a/kernel/sched.c b/kernel/sched.c
index 18faf4d..8f2f4b4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4396,7 +4396,7 @@ static bool check_same_owner(struct task_struct *p)
}

static int __sched_setscheduler(struct task_struct *p, int policy,
- struct sched_param *param, bool user)
+ const struct sched_param *param, bool user)
{
int retval, oldprio, oldpolicy = -1, on_rq, running;
unsigned long flags;
@@ -4540,7 +4540,7 @@ recheck:
* NOTE that the task may be already dead.
*/
int sched_setscheduler(struct task_struct *p, int policy,
- struct sched_param *param)
+ const struct sched_param *param)
{
return __sched_setscheduler(p, policy, param, true);
}
@@ -4558,7 +4558,7 @@ EXPORT_SYMBOL_GPL(sched_setscheduler);
* but our caller might not have that capability.
*/
int sched_setscheduler_nocheck(struct task_struct *p, int policy,
- struct sched_param *param)
+ const struct sched_param *param)
{
return __sched_setscheduler(p, policy, param, false);
}
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 07b4f1b..be3ec11 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -827,7 +827,9 @@ static int __cpuinit cpu_callback(struct notifier_block *nfb,
cpumask_any(cpu_online_mask));
case CPU_DEAD:
case CPU_DEAD_FROZEN: {
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+ static struct sched_param param = {
+ .sched_priority = MAX_RT_PRIO-1
+ };

p = per_cpu(ksoftirqd, hotcpu);
per_cpu(ksoftirqd, hotcpu) = NULL;
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 70f8d90..da9bb42 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -291,7 +291,7 @@ repeat:
static int __cpuinit cpu_stop_cpu_callback(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
- struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
+ static struct sched_param param = { .sched_priority = MAX_RT_PRIO - 1 };
unsigned int cpu = (unsigned long)hcpu;
struct cpu_stopper *stopper = &per_cpu(cpu_stopper, cpu);
struct task_struct *p;
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 250e7f9..d2ebe6e 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -560,7 +560,7 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
static int trace_wakeup_test_thread(void *data)
{
/* Make this a RT thread, doesn't need to be too high */
- struct sched_param param = { .sched_priority = 5 };
+ static struct sched_param param = { .sched_priority = 5 };
struct completion *x = data;

sched_setscheduler(current, SCHED_FIFO, &param);
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 91b0b26..95d8463 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -310,7 +310,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
*/
static int watchdog(void *unused)
{
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+ static struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);

sched_setscheduler(current, SCHED_FIFO, &param);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 327d2de..036939a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -947,7 +947,7 @@ init_cpu_workqueue(struct workqueue_struct *wq, int cpu)

static int create_workqueue_thread(struct cpu_workqueue_struct *cwq, int cpu)
{
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+ static struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
struct workqueue_struct *wq = cwq->wq;
const char *fmt = is_wq_single_threaded(wq) ? "%s" : "%s/%d";
struct task_struct *p;
--
1.6.5.2


2010-07-06 22:14:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: make sched_param arugment static variables in some sched_setscheduler() caller

On Tue, 2010-07-06 at 09:51 +0900, KOSAKI Motohiro wrote:
> Andrew Morton pointed out almost sched_setscheduler() caller are
> using fixed parameter and it can be converted static. it reduce
> runtume memory waste a bit.

We are replacing runtime waste with permanent waste?

>
> Reported-by: Andrew Morton <[email protected]>
> Signed-off-by: KOSAKI Motohiro <[email protected]>



> --- a/kernel/trace/trace_selftest.c
> +++ b/kernel/trace/trace_selftest.c
> @@ -560,7 +560,7 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
> static int trace_wakeup_test_thread(void *data)
> {
> /* Make this a RT thread, doesn't need to be too high */
> - struct sched_param param = { .sched_priority = 5 };
> + static struct sched_param param = { .sched_priority = 5 };
> struct completion *x = data;
>

This is a thread that runs on boot up to test the sched_wakeup tracer.
Then it is deleted and all memory is reclaimed.

Thus, this patch just took memory that was usable at run time and
removed it permanently.

Please Cc me on all tracing changes.

-- Steve

2010-07-06 23:13:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: make sched_param arugment static variables in some sched_setscheduler() caller

On Tue, 06 Jul 2010 18:13:58 -0400 Steven Rostedt <[email protected]> wrote:

> On Tue, 2010-07-06 at 09:51 +0900, KOSAKI Motohiro wrote:
> > Andrew Morton pointed out almost sched_setscheduler() caller are
> > using fixed parameter and it can be converted static. it reduce
> > runtume memory waste a bit.
>
> We are replacing runtime waste with permanent waste?

Confused. kernel/trace/ appears to waste resources by design, so what's
the issue?

I don't think this change will cause more waste. It'll consume 4 bytes
of .data and will save a little more .text.

> >
> > Reported-by: Andrew Morton <[email protected]>
> > Signed-off-by: KOSAKI Motohiro <[email protected]>
>
>
>
> > --- a/kernel/trace/trace_selftest.c
> > +++ b/kernel/trace/trace_selftest.c
> > @@ -560,7 +560,7 @@ trace_selftest_startup_nop(struct tracer *trace, struct trace_array *tr)
> > static int trace_wakeup_test_thread(void *data)
> > {
> > /* Make this a RT thread, doesn't need to be too high */
> > - struct sched_param param = { .sched_priority = 5 };
> > + static struct sched_param param = { .sched_priority = 5 };
> > struct completion *x = data;
> >
>
> This is a thread that runs on boot up to test the sched_wakeup tracer.
> Then it is deleted and all memory is reclaimed.
>
> Thus, this patch just took memory that was usable at run time and
> removed it permanently.
>
> Please Cc me on all tracing changes.

Well if we're so worried about resource wastage then how about making
all boot-time-only text and data reside in __init and __initdata
sections rather than hanging around uselessly in memory for ever?

Only that's going to be hard because we went and added pointers into
.init.text from .data due to `struct tracer.selftest', which will cause
a storm of section mismatch warnings. Doh, should have invoked the
selftests from initcalls. That might open the opportunity of running
the selftests by modprobing the selftest module, too.

And I _do_ wish the selftest module was modprobeable, rather than this
monstrosity:

#ifdef CONFIG_FTRACE_SELFTEST
/* Let selftest have access to static functions in this file */
#include "trace_selftest.c"
#endif

Really? Who had a tastebudectomy over there? At least call it
trace_selftest.inc or something, so poor schmucks don't go scrabbling
around wondering "how the hell does this thing get built oh no they
didn't really go and #include it did they?"

2010-07-06 23:49:52

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: make sched_param arugment static variables in some sched_setscheduler() caller

On Tue, 2010-07-06 at 16:12 -0700, Andrew Morton wrote:

> Well if we're so worried about resource wastage then how about making
> all boot-time-only text and data reside in __init and __initdata
> sections rather than hanging around uselessly in memory for ever?

That would be a patch I would like :-)

I could probably do that when I get some time.

>
> Only that's going to be hard because we went and added pointers into
> .init.text from .data due to `struct tracer.selftest', which will cause
> a storm of section mismatch warnings. Doh, should have invoked the
> selftests from initcalls. That might open the opportunity of running
> the selftests by modprobing the selftest module, too.

They are called by initcalls. The initcalls register the tracers and
that is the time we call the selftest. No other time.

Is there a way that we set up a function pointer to let the section
checks know that it is only called at bootup?

>
> And I _do_ wish the selftest module was modprobeable, rather than this
> monstrosity:

The selftests are done by individual tracers at boot up. It would be
hard to modprobe them at that time.


> #ifdef CONFIG_FTRACE_SELFTEST
> /* Let selftest have access to static functions in this file */
> #include "trace_selftest.c"
> #endif
>
> Really? Who had a tastebudectomy over there? At least call it
> trace_selftest.inc or something, so poor schmucks don't go scrabbling
> around wondering "how the hell does this thing get built oh no they
> didn't really go and #include it did they?"


Well this is also the way sched.c adds all its extra code. Making it
trace_selftest.inc would make it hard to know what the hell it was. And
also hard for editors to know what type of file it is, or things can be
missed with a 'find . -name "*.[ch]" | xargs grep blahblah'

Yes, the self tests are ugly and can probably go with an overhaul. Since
we are trying to get away from the tracer plugins anyway, they will
start disappearing when the plugins do.

We should have some main selftests anyway. Those are for the TRACE_EVENT
tests (which are not even in the trace_selftest.c file, and the function
testing which currently are, as well as the latency testers.

The trace_selftest.c should eventually be replaced with more compact
tests for the specific types of tracing.

-- Steve

2010-07-07 00:03:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: make sched_param arugment static variables in some sched_setscheduler() caller

On Tue, 06 Jul 2010 19:49:47 -0400 Steven Rostedt <[email protected]> wrote:

> On Tue, 2010-07-06 at 16:12 -0700, Andrew Morton wrote:
>
> > Well if we're so worried about resource wastage then how about making
> > all boot-time-only text and data reside in __init and __initdata
> > sections rather than hanging around uselessly in memory for ever?
>
> That would be a patch I would like :-)
>
> I could probably do that when I get some time.
>
> >
> > Only that's going to be hard because we went and added pointers into
> > .init.text from .data due to `struct tracer.selftest', which will cause
> > a storm of section mismatch warnings. Doh, should have invoked the
> > selftests from initcalls. That might open the opportunity of running
> > the selftests by modprobing the selftest module, too.
>
> They are called by initcalls. The initcalls register the tracers and
> that is the time we call the selftest. No other time.

It should all be __init!

> Is there a way that we set up a function pointer to let the section
> checks know that it is only called at bootup?

<sticks his nose in modpost.c for the first time>

There are various whitelisting hacks in there based on the name of the
offending symbol. Search term: DEFAULT_SYMBOL_WHITE_LIST.

It'd be cleaner to just zap the tracer.selftest field altogether and
run the tests from initcalls if possible?

> >
> > And I _do_ wish the selftest module was modprobeable, rather than this
> > monstrosity:
>
> The selftests are done by individual tracers at boot up. It would be
> hard to modprobe them at that time.

No, if tracer_selftest.o was linked into vmlinux then the tests get run
within do_initcalls(). If tracer_selftest.o is a module, then the tests
get run at modprobe-time. The latter option may not be terribly useful
but it comes basically for free as a reward for doing stuff correctly.

>
> > #ifdef CONFIG_FTRACE_SELFTEST
> > /* Let selftest have access to static functions in this file */
> > #include "trace_selftest.c"
> > #endif
> >
> > Really? Who had a tastebudectomy over there? At least call it
> > trace_selftest.inc or something, so poor schmucks don't go scrabbling
> > around wondering "how the hell does this thing get built oh no they
> > didn't really go and #include it did they?"
>
>
> Well this is also the way sched.c adds all its extra code.

The sched.c hack sucks too.

> Making it
> trace_selftest.inc would make it hard to know what the hell it was.

trace_selftest.i_really_suck?

> And
> also hard for editors to know what type of file it is, or things can be
> missed with a 'find . -name "*.[ch]" | xargs grep blahblah'

Well bad luck. Of _course_ it makes a mess. It's already a mess.

How's about just removing the `static' from whichever symbols
tracer_selftest needs? That'd surely be better than #including a .c
file.

> Yes, the self tests are ugly and can probably go with an overhaul. Since
> we are trying to get away from the tracer plugins anyway, they will
> start disappearing when the plugins do.
>
> We should have some main selftests anyway. Those are for the TRACE_EVENT
> tests (which are not even in the trace_selftest.c file, and the function
> testing which currently are, as well as the latency testers.
>
> The trace_selftest.c should eventually be replaced with more compact
> tests for the specific types of tracing.

2010-07-07 19:44:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: make sched_param arugment static variables in some sched_setscheduler() caller

On Tue, 2010-07-06 at 17:02 -0700, Andrew Morton wrote:
> > Well this is also the way sched.c adds all its extra code.
>
> The sched.c hack sucks too.

Agreed, moving things to kernel/sched/ and adding some internal.h thing
could cure that, but I simply haven't gotten around to cleaning that
up..