2013-03-30 13:15:39

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 0/3] posix_timers: A few expiry caching fixes

Hi,

This series is on top of latest -mmotm.

[1/3] fixes some omitting sample capture after a build warning suddenly became visible
[2/3] fixes a wrong timer_gettime() result on a timer whose target has been reaped.
[3/3] needs some more thinking probably

Thanks.

Frederic Weisbecker (3):
posix-timers: Correctly get dying task time sample in
posix_cpu_timer_schedule()
posix_timers: Fix racy timer delta caching on task exit
posix_timers: Remove dead task timer expiry caching

kernel/posix-cpu-timers.c | 68 ++-------------------------------------------
1 files changed, 3 insertions(+), 65 deletions(-)

--
1.7.5.4


2013-03-30 13:15:51

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 2/3] posix_timers: Fix racy timer delta caching on task exit

When a task exits, we perform a caching of the remaining
cputime delta before expiring of its timers.

This is done from the following places:

* When the task is reaped. We iterate through its list of
posix cpu timers and store the remaining timer delta to
the timer struct instead of the absolute value.
(See posix_cpu_timers_exit() / posix_cpu_timers_exit_group() )

* When we call posix_cpu_timer_get() or posix_cpu_timer_schedule().
If the timer's task is considered dying when watched from these
places, the same conversion from absolute to relative expiry time
is performed. Then the given task's reference is released.
(See clear_dead_task() ).

The relevance of this caching is questionable but this is another
and deeper debate.

The big issue here is that these two sources of caching don't mix
up very well together.

More specifically, the caching can easily be done twice, resulting
in a wrong delta as it gets spuriously substracted a second time by
the elapsed clock. This can happen in the following scenario:

The task exits and gets reaped: we call posix_cpu_timers_exit()
and the absolute timer expiry values are converted to a relative
delta.

timer_gettime() -> posix_cpu_timer_get() is called and relies on
clear_dead_task() because tsk->exit_state == EXIT_DEAD.
The delta gets substracted again by the elapsed clock and we return
a wrong result.

To fix this, just remove the caching done on task reaping time.
It doesn't bring much value on its own. The caching done from
posix_cpu_timer_get/schedule is enough.

And it would also be hard to get it really right: we could make it
put and clear the target task in the timer struct so that readers
know if they are dealing with a relative cached of absolute value.
But it would be racy. The only safe way to do it would be to lock
the itimer->it_lock so that we know nobody reads the cputime expiry
value while we modify it and its target task reference. Doing so
would involve some funny workarounds to avoid circular lock against
the sighand lock. There is just no reason to maintain this.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
kernel/posix-cpu-timers.c | 22 +++++++++++-----------
1 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index afd79a9..877439b 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -387,14 +387,8 @@ static void cleanup_timers_list(struct list_head *head,
{
struct cpu_timer_list *timer, *next;

- list_for_each_entry_safe(timer, next, head, entry) {
+ list_for_each_entry_safe(timer, next, head, entry)
list_del_init(&timer->entry);
- if (timer->expires < curr) {
- timer->expires = 0;
- } else {
- timer->expires -= curr;
- }
- }
}

/*
@@ -442,15 +436,21 @@ void posix_cpu_timers_exit_group(struct task_struct *tsk)
tsk->se.sum_exec_runtime + sig->sum_sched_runtime);
}

-static void clear_dead_task(struct k_itimer *timer, unsigned long long now)
+static void clear_dead_task(struct k_itimer *itimer, unsigned long long now)
{
+ struct cpu_timer_list *timer = &itimer->it.cpu;
+
/*
* That's all for this thread or process.
* We leave our residual in expires to be reported.
*/
- put_task_struct(timer->it.cpu.task);
- timer->it.cpu.task = NULL;
- timer->it.cpu.expires -= now;
+ put_task_struct(timer->task);
+ timer->task = NULL;
+ if (timer->expires < now) {
+ timer->expires = 0;
+ } else {
+ timer->expires -= now;
+ }
}

static inline int expires_gt(cputime_t expires, cputime_t new_exp)
--
1.7.5.4

2013-03-30 13:16:07

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 3/3] posix_timers: Remove dead task timer expiry caching

When reading a timer sample, posix_cpu_timer_get() and
posix_cpu_timer_schedule() both perform a caching of the timer
expiry time by converting its value from absolute to relative
if the task has exited.

The reason for this caching is not clear though, it could be:

1) For performance reasons: no need to calculate the delta after
the task has died, its cputime won't change anymore. We can
thus avoid some locking (sighand, tasklist_lock, rq->lock for
task_delta_exec(), ...), and various operations to calculate
the sample...

2) To keep the remaining delta for the timer available after the
task has died. When it gets reaped, its sighand disappears, so
accessing the process wide cputime through tsk->signal is probably
not safe.

Now, is the first reason really worth it? I have no idea if it is
a case we really want to optimize.

Considering the second reason, we return a disarmed zero'ed timer
when tsk->sighand == NULL. So if this is an assumed reason, it's
broken. And this case only concern process wide timers that
have their group leader reaped. The posix cpu timer shouldn't even
be available anymore at that time. Unless the group leader changed
since we called posix_cpu_timer_create() after an exec?

Anyway for now I'm sending this as an RFC because there may well
be subtle things I left behind.

Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Oleg Nesterov <[email protected]>
---
kernel/posix-cpu-timers.c | 67 +-------------------------------------------
1 files changed, 2 insertions(+), 65 deletions(-)

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index 877439b..2388062 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -436,23 +436,6 @@ void posix_cpu_timers_exit_group(struct task_struct *tsk)
tsk->se.sum_exec_runtime + sig->sum_sched_runtime);
}

-static void clear_dead_task(struct k_itimer *itimer, unsigned long long now)
-{
- struct cpu_timer_list *timer = &itimer->it.cpu;
-
- /*
- * That's all for this thread or process.
- * We leave our residual in expires to be reported.
- */
- put_task_struct(timer->task);
- timer->task = NULL;
- if (timer->expires < now) {
- timer->expires = 0;
- } else {
- timer->expires -= now;
- }
-}
-
static inline int expires_gt(cputime_t expires, cputime_t new_exp)
{
return expires == 0 || expires > new_exp;
@@ -737,7 +720,6 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec *itp)
{
unsigned long long now;
struct task_struct *p = timer->it.cpu.task;
- int clear_dead;

/*
* Easy part: convert the reload time.
@@ -750,23 +732,11 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec *itp)
return;
}

- if (unlikely(p == NULL)) {
- /*
- * This task already died and the timer will never fire.
- * In this case, expires is actually the dead value.
- */
- dead:
- sample_to_timespec(timer->it_clock, timer->it.cpu.expires,
- &itp->it_value);
- return;
- }
-
/*
* Sample the clock to take the difference with the expiry time.
*/
if (CPUCLOCK_PERTHREAD(timer->it_clock)) {
cpu_clock_sample(timer->it_clock, p, &now);
- clear_dead = p->exit_state;
} else {
read_lock(&tasklist_lock);
if (unlikely(p->sighand == NULL)) {
@@ -775,29 +745,16 @@ static void posix_cpu_timer_get(struct k_itimer *timer, struct itimerspec *itp)
* We can't even collect a sample any more.
* Call the timer disarmed, nothing else to do.
*/
- put_task_struct(p);
- timer->it.cpu.task = NULL;
timer->it.cpu.expires = 0;
+ itp->it_value.tv_sec = itp->it_value.tv_nsec = 0;
read_unlock(&tasklist_lock);
- goto dead;
+ return;
} else {
cpu_timer_sample_group(timer->it_clock, p, &now);
- clear_dead = (unlikely(p->exit_state) &&
- thread_group_empty(p));
}
read_unlock(&tasklist_lock);
}

- if (unlikely(clear_dead)) {
- /*
- * We've noticed that the thread is dead, but
- * not yet reaped. Take this opportunity to
- * drop our task ref.
- */
- clear_dead_task(timer, now);
- goto dead;
- }
-
if (now < timer->it.cpu.expires) {
sample_to_timespec(timer->it_clock,
timer->it.cpu.expires - now,
@@ -1027,22 +984,12 @@ void posix_cpu_timer_schedule(struct k_itimer *timer)
struct task_struct *p = timer->it.cpu.task;
unsigned long long now;

- if (unlikely(p == NULL))
- /*
- * The task was cleaned up already, no future firings.
- */
- goto out;
-
/*
* Fetch the current sample and update the timer's expiry time.
*/
if (CPUCLOCK_PERTHREAD(timer->it_clock)) {
cpu_clock_sample(timer->it_clock, p, &now);
bump_cpu_timer(timer, now);
- if (unlikely(p->exit_state)) {
- clear_dead_task(timer, now);
- goto out;
- }
read_lock(&tasklist_lock); /* arm_timer needs it. */
spin_lock(&p->sighand->siglock);
} else {
@@ -1056,15 +1003,6 @@ void posix_cpu_timer_schedule(struct k_itimer *timer)
timer->it.cpu.task = p = NULL;
timer->it.cpu.expires = 0;
goto out_unlock;
- } else if (unlikely(p->exit_state) && thread_group_empty(p)) {
- /*
- * We've noticed that the thread is dead, but
- * not yet reaped. Take this opportunity to
- * drop our task ref.
- */
- cpu_timer_sample_group(timer->it_clock, p, &now);
- clear_dead_task(timer, now);
- goto out_unlock;
}
spin_lock(&p->sighand->siglock);
cpu_timer_sample_group(timer->it_clock, p, &now);
@@ -1082,7 +1020,6 @@ void posix_cpu_timer_schedule(struct k_itimer *timer)
out_unlock:
read_unlock(&tasklist_lock);

-out:
timer->it_overrun_last = timer->it_overrun;
timer->it_overrun = -1;
++timer->it_requeue_pending;
--
1.7.5.4

2013-03-30 13:16:05

by Frederic Weisbecker

[permalink] [raw]
Subject: [PATCH 1/3] posix-timers: Correctly get dying task time sample in posix_cpu_timer_schedule()

In order to arm the next timer to schedule, we take a sample of the
current process or thread cputime.

If the task is dying though, we don't arm anything but we
cache the remaining timer expiration delta for further reads.

Something similar is performed in posix_cpu_timer_get() but
here we forget to take the process wide cputime sample
before caching it.

As a result we are storing random stack content, leading
every further reads of that timer to return junk values.

Fix this by taking the appropriate sample in the case of
process wide timers.

Reported-by: Andrew Morton <[email protected]>
Reported-by: Chen Gang <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: Stanislaw Gruszka <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Chen Gang <[email protected]>
---
kernel/posix-cpu-timers.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index edf94b6..afd79a9 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -1062,6 +1062,7 @@ void posix_cpu_timer_schedule(struct k_itimer *timer)
* not yet reaped. Take this opportunity to
* drop our task ref.
*/
+ cpu_timer_sample_group(timer->it_clock, p, &now);
clear_dead_task(timer, now);
goto out_unlock;
}
--
1.7.5.4

2013-04-01 00:08:31

by Chen Gang

[permalink] [raw]
Subject: Re: [PATCH 1/3] posix-timers: Correctly get dying task time sample in posix_cpu_timer_schedule()

On 2013??03??30?? 21:15, Frederic Weisbecker wrote:
> In order to arm the next timer to schedule, we take a sample of the
> current process or thread cputime.
>
> If the task is dying though, we don't arm anything but we
> cache the remaining timer expiration delta for further reads.
>
> Something similar is performed in posix_cpu_timer_get() but
> here we forget to take the process wide cputime sample
> before caching it.
>
> As a result we are storing random stack content, leading
> every further reads of that timer to return junk values.
>
> Fix this by taking the appropriate sample in the case of
> process wide timers.
>
> Reported-by: Andrew Morton <[email protected]>
> Reported-by: Chen Gang <[email protected]>
> Signed-off-by: Frederic Weisbecker <[email protected]>
> Cc: Stanislaw Gruszka <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Chen Gang <[email protected]>
> ---

thank you for mark me as reported by, although I reported too late
(Andrew Morton is the first reporter).

next, I should continue to try to find another issues about kernel.

:-)

--
Chen Gang

Asianux Corporation

2013-04-08 23:56:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/3] posix_timers: A few expiry caching fixes

On Sat, 30 Mar 2013 14:15:28 +0100 Frederic Weisbecker <[email protected]> wrote:

> This series is on top of latest -mmotm.
>
> [1/3] fixes some omitting sample capture after a build warning suddenly became visible
> [2/3] fixes a wrong timer_gettime() result on a timer whose target has been reaped.
> [3/3] needs some more thinking probably

Nobody seems to have applied these to anything yet.

The changelogs don't have an explicit description of the user-visible
impact of the bugs which were fixed. Providing such a description
makes it more likely that others will make the correct patch-scheduling
decisions ;)

2013-04-11 15:19:21

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 0/3] posix_timers: A few expiry caching fixes

On Mon, Apr 08, 2013 at 04:55:59PM -0700, Andrew Morton wrote:
> On Sat, 30 Mar 2013 14:15:28 +0100 Frederic Weisbecker <[email protected]> wrote:
>
> > This series is on top of latest -mmotm.
> >
> > [1/3] fixes some omitting sample capture after a build warning suddenly became visible
> > [2/3] fixes a wrong timer_gettime() result on a timer whose target has been reaped.
> > [3/3] needs some more thinking probably
>
> Nobody seems to have applied these to anything yet.
>
> The changelogs don't have an explicit description of the user-visible
> impact of the bugs which were fixed. Providing such a description
> makes it more likely that others will make the correct patch-scheduling
> decisions ;)

Ok, I'll iterate through these and try to get a better description of that side.

Thanks.

2013-04-11 15:27:06

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [PATCH 1/3] posix-timers: Correctly get dying task time sample in posix_cpu_timer_schedule()

On Mon, Apr 01, 2013 at 08:07:59AM +0800, Chen Gang wrote:
> On 2013年03月30日 21:15, Frederic Weisbecker wrote:
> > In order to arm the next timer to schedule, we take a sample of the
> > current process or thread cputime.
> >
> > If the task is dying though, we don't arm anything but we
> > cache the remaining timer expiration delta for further reads.
> >
> > Something similar is performed in posix_cpu_timer_get() but
> > here we forget to take the process wide cputime sample
> > before caching it.
> >
> > As a result we are storing random stack content, leading
> > every further reads of that timer to return junk values.
> >
> > Fix this by taking the appropriate sample in the case of
> > process wide timers.
> >
> > Reported-by: Andrew Morton <[email protected]>
> > Reported-by: Chen Gang <[email protected]>
> > Signed-off-by: Frederic Weisbecker <[email protected]>
> > Cc: Stanislaw Gruszka <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Oleg Nesterov <[email protected]>
> > Cc: Chen Gang <[email protected]>
> > ---
>
> thank you for mark me as reported by, although I reported too late
> (Andrew Morton is the first reporter).

Bug reporters simply deserve to be credited, no sorting is required ;)

>
> next, I should continue to try to find another issues about kernel.

I encourage you to do so :)

Thanks.

2013-04-11 15:34:03

by Chen Gang

[permalink] [raw]
Subject: Re: [PATCH 1/3] posix-timers: Correctly get dying task time sample in posix_cpu_timer_schedule()

On 2013年04月11日 23:26, Frederic Weisbecker wrote:
> Bug reporters simply deserve to be credited, no sorting is required ;)
>

thank God.


>> >
>> > next, I should continue to try to find another issues about kernel.
> I encourage you to do so :)

I will continue.

:-)

--
Chen Gang

Asianux Corporation