2007-10-26 19:37:49

by Mathieu Desnoyers

[permalink] [raw]
Subject: Adding TIF_TRACE_KERNEL to x86_64

Hi Andi,

I am trying to add a TIF_TRACE_KERNEL to each architectures to have a
system-wide activation of syscall_trace. However, I get the following
issue on x86_64 : a few processes segfault and others get a GPF when I
enable the flag on all processes. I am starting to think that it might
be caused by an incorrect top of stack when we return from a
syscall/interrupt in these processes. It would happen if we get into the
following race:

1 - process A enters in a syscall, TIF_KERNEL_TRACE is cleared
2 - we activate TIF_KERNEL_TRACE
3 - process A returns from syscall (with wrong top of stack ?) -> segfault.

Am I on the right track ?

Can this be a concern with TIF_SYSCALL_TRACE also ? (potential race in
ptrace ?)

Thanks for you input,

Mathieu

My x86_64 flags patch for 2.6.23.1 looks like this:

---
include/asm-x86_64/thread_info.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/include/asm-x86_64/thread_info.h
===================================================================
--- linux-2.6-lttng.orig/include/asm-x86_64/thread_info.h 2007-07-30 18:46:16.000000000 -0400
+++ linux-2.6-lttng/include/asm-x86_64/thread_info.h 2007-07-30 19:13:03.000000000 -0400
@@ -107,6 +107,7 @@ static inline struct thread_info *stack_
* Warning: layout of LSW is hardcoded in entry.S
*/
#define TIF_SYSCALL_TRACE 0 /* syscall trace active */
+#define TIF_KERNEL_TRACE 1 /* kernel trace active */
#define TIF_SIGPENDING 2 /* signal pending */
#define TIF_NEED_RESCHED 3 /* rescheduling necessary */
#define TIF_SINGLESTEP 4 /* reenable singlestep on user return*/
@@ -125,6 +126,7 @@ static inline struct thread_info *stack_
#define TIF_FREEZE 23 /* is freezing for suspend */

#define _TIF_SYSCALL_TRACE (1<<TIF_SYSCALL_TRACE)
+#define _TIF_KERNEL_TRACE (1<<TIF_KERNEL_TRACE)
#define _TIF_SIGPENDING (1<<TIF_SIGPENDING)
#define _TIF_SINGLESTEP (1<<TIF_SINGLESTEP)
#define _TIF_NEED_RESCHED (1<<TIF_NEED_RESCHED)
@@ -142,7 +144,7 @@ static inline struct thread_info *stack_

/* work to do on interrupt/exception return */
#define _TIF_WORK_MASK \
- (0x0000FFFF & ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SINGLESTEP|_TIF_SECCOMP))
+ (0x0000FFFF & ~(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_KERNEL_TRACE|_TIF_SINGLESTEP|_TIF_SECCOMP))
/* work to do on any return to user space */
#define _TIF_ALLWORK_MASK (0x0000FFFF & ~_TIF_SECCOMP)

And the code to activate/deactivate the flags:

---
include/linux/sched.h | 3 +++
kernel/fork.c | 9 +++++++++
kernel/sched.c | 42 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 54 insertions(+)

Index: linux-2.6-lttng/include/linux/sched.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/sched.h 2007-10-12 12:02:09.000000000 -0400
+++ linux-2.6-lttng/include/linux/sched.h 2007-10-12 12:11:23.000000000 -0400
@@ -1953,6 +1953,9 @@ static inline void inc_syscw(struct task
}
#endif

+extern void clear_kernel_trace_flag_all_tasks(void);
+extern void set_kernel_trace_flag_all_tasks(void);
+
#endif /* __KERNEL__ */

#endif
Index: linux-2.6-lttng/kernel/fork.c
===================================================================
--- linux-2.6-lttng.orig/kernel/fork.c 2007-10-12 12:02:18.000000000 -0400
+++ linux-2.6-lttng/kernel/fork.c 2007-10-12 12:11:23.000000000 -0400
@@ -1241,6 +1241,15 @@ static struct task_struct *copy_process(
!cpu_online(task_cpu(p))))
set_task_cpu(p, smp_processor_id());

+ /*
+ * The state of the parent's TIF_KTRACE flag may have changed
+ * since it was copied in dup_task_struct() so we re-copy it here.
+ */
+ if (test_thread_flag(TIF_KERNEL_TRACE))
+ set_tsk_thread_flag(p, TIF_KERNEL_TRACE);
+ else
+ clear_tsk_thread_flag(p, TIF_KERNEL_TRACE);
+
/* CLONE_PARENT re-uses the old parent */
if (clone_flags & (CLONE_PARENT|CLONE_THREAD))
p->real_parent = current->real_parent;
Index: linux-2.6-lttng/kernel/sched.c
===================================================================
--- linux-2.6-lttng.orig/kernel/sched.c 2007-10-12 12:02:17.000000000 -0400
+++ linux-2.6-lttng/kernel/sched.c 2007-10-12 12:11:55.000000000 -0400
@@ -7032,3 +7032,45 @@ struct cgroup_subsys cpu_cgroup_subsys =
};

#endif /* CONFIG_FAIR_CGROUP_SCHED */
+
+/**
+ * clear_kernel_trace_flag_all_tasks - clears all TIF_KERNEL_TRACE thread flags.
+ *
+ * This function iterates on all threads in the system to clear their
+ * TIF_KERNEL_TRACE flag. Setting the TIF_KERNEL_TRACE flag with the
+ * tasklist_lock held in copy_process() makes sure that once we finish clearing
+ * the thread flags, all threads have their flags cleared.
+ */
+void clear_kernel_trace_flag_all_tasks(void)
+{
+ struct task_struct *p;
+ struct task_struct *t;
+
+ read_lock(&tasklist_lock);
+ do_each_thread(p, t) {
+ clear_tsk_thread_flag(t, TIF_KERNEL_TRACE);
+ } while_each_thread(p, t);
+ read_unlock(&tasklist_lock);
+}
+EXPORT_SYMBOL_GPL(clear_kernel_trace_flag_all_tasks);
+
+/**
+ * set_kernel_trace_flag_all_tasks - sets all TIF_KERNEL_TRACE thread flags.
+ *
+ * This function iterates on all threads in the system to set their
+ * TIF_KERNEL_TRACE flag. Setting the TIF_KERNEL_TRACE flag with the
+ * tasklist_lock held in copy_process() makes sure that once we finish setting
+ * the thread flags, all threads have their flags set.
+ */
+void set_kernel_trace_flag_all_tasks(void)
+{
+ struct task_struct *p;
+ struct task_struct *t;
+
+ read_lock(&tasklist_lock);
+ do_each_thread(p, t) {
+ set_tsk_thread_flag(t, TIF_KERNEL_TRACE);
+ } while_each_thread(p, t);
+ read_unlock(&tasklist_lock);
+}
+EXPORT_SYMBOL_GPL(set_kernel_trace_flag_all_tasks);


--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68


2007-10-27 18:08:47

by Mathieu Desnoyers

[permalink] [raw]
Subject: [PATCH] Fix x86_64 TIF_SYSCALL_TRACE race in entry.S

Fix x86_64 TIF_SYSCALL_TRACE race in entry.S

When the flag is inactive upon syscall entry and concurrently activated before
exit, we seem to reach a state where the top of stack is incorrect upon return
to user space.

Fix this by fixing the top of stack and jumping to int_ret_from_sys_call if we
detect that thread flags has been modified.

We make sure that the thread flag read is coherent between our new test and the ALLWORK_MASK test by first saving it in a register used for both comparisons.

It applies on top of 2.6.23-mm1 or 2.6.23.1. If you think the
implementation is correct, I'll port it to 2.6.24-rc1.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Andi Kleen <[email protected]>
---
arch/x86_64/kernel/entry.S | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

Index: linux-2.6-lttng/arch/x86_64/kernel/entry.S
===================================================================
--- linux-2.6-lttng.orig/arch/x86_64/kernel/entry.S 2007-10-27 14:01:12.000000000 -0400
+++ linux-2.6-lttng/arch/x86_64/kernel/entry.S 2007-10-27 14:01:28.000000000 -0400
@@ -245,9 +245,11 @@ ret_from_sys_call:
/* edi: flagmask */
sysret_check:
GET_THREAD_INFO(%rcx)
+ movl threadinfo_flags(%rcx),%edx
+ testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),%edx
+ jnz ret_from_sys_call_trace
cli
TRACE_IRQS_OFF
- movl threadinfo_flags(%rcx),%edx
andl %edi,%edx
jnz sysret_careful
CFI_REMEMBER_STATE
@@ -278,6 +280,14 @@ sysret_careful:
CFI_ADJUST_CFA_OFFSET -8
jmp sysret_check

+ret_from_sys_call_trace:
+ SAVE_REST
+ FIXUP_TOP_OF_STACK %rdi
+ movq %rsp,%rdi
+ LOAD_ARGS ARGOFFSET /* reload args from stack in case ptrace changed it */
+ RESTORE_REST
+ jmp int_ret_from_sys_call
+
/* Handle a signal */
sysret_signal:
TRACE_IRQS_ON
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-10-27 19:00:44

by Andi Kleen

[permalink] [raw]
Subject: Re: Adding TIF_TRACE_KERNEL to x86_64

On Fri, Oct 26, 2007 at 03:37:38PM -0400, Mathieu Desnoyers wrote:
>
> 1 - process A enters in a syscall, TIF_KERNEL_TRACE is cleared
> 2 - we activate TIF_KERNEL_TRACE
> 3 - process A returns from syscall (with wrong top of stack ?) -> segfault.
>
> Am I on the right track ?

Yes. The code was not designed to allow that. The syscall path
has been optimized to go as fast as possible.

>
> Can this be a concern with TIF_SYSCALL_TRACE also ? (potential race in
> ptrace ?)

ptrace only changes state in stopped processes; and those are not hanging
in syscalls. So no there is no race in a standard kernel.

If you wanted to change it the right way would be probably to test
for SYSCALL_TRACE too in the work flags and handle it there

-Andi

2007-10-27 19:05:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Fix x86_64 TIF_SYSCALL_TRACE race in entry.S

Mathieu Desnoyers <[email protected]> writes:

> We make sure that the thread flag read is coherent between our new test and the ALLWORK_MASK test by first saving it in a register used for both comparisons.
>

That doesn't make sense. If someone is setting those asynchronously you
can always race.

You should really just stop the process like ptrace does before changing
such things.

-Andi

2007-10-28 21:15:38

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] Fix x86_64 TIF_SYSCALL_TRACE race in entry.S

* Andi Kleen ([email protected]) wrote:
> Mathieu Desnoyers <[email protected]> writes:
>
> > We make sure that the thread flag read is coherent between our new test and the ALLWORK_MASK test by first saving it in a register used for both comparisons.
> >
>
> That doesn't make sense. If someone is setting those asynchronously you
> can always race.
>

Setting the thread flag being an atomic operation, I would expect
setting/clearing it asynchronously from another thread to be a valid
behavior. The only race that I foresee happens if the code that uses the
thread flag reads it more than once and expects it to stay unchanged
between the reads.

> You should really just stop the process like ptrace does before changing
> such things.
>

Iterating on each thread running on the system and stopping them when
we start kernel tracing seems to have the same impact as throwing a
brick in a quiet lake. :) I would prefer not to do that if we can do
otherwise.

Here is a modified version where I add my test only in the path where we
know that we have work to do, therefore removing the supplementary test
from the performance critical path. Would it be more acceptable ?



Fix x86_64 TIF_SYSCALL_TRACE race in entry.S

When the flag is inactive upon syscall entry and concurrently activated before
exit, we seem to reach a state where the top of stack is incorrect upon return
to user space.

Fix this by fixing the top of stack and jumping to int_ret_from_sys_call if we
detect that thread flags has been modified.

We make sure that the thread flag read is coherent between our new test and the ALLWORK_MASK test by first saving it in a register used for both comparisons.

Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Andi Kleen <[email protected]>
---
arch/x86_64/kernel/entry.S | 12 ++++++++++++
1 file changed, 12 insertions(+)

Index: linux-2.6-lttng/arch/x86_64/kernel/entry.S
===================================================================
--- linux-2.6-lttng.orig/arch/x86_64/kernel/entry.S 2007-10-27 14:01:12.000000000 -0400
+++ linux-2.6-lttng/arch/x86_64/kernel/entry.S 2007-10-28 16:33:56.000000000 -0400
@@ -267,6 +267,8 @@ sysret_check:
/* Handle reschedules */
/* edx: work, edi: workmask */
sysret_careful:
+ testl $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT|_TIF_SECCOMP),%edx
+ jnz ret_from_sys_call_trace
bt $TIF_NEED_RESCHED,%edx
jnc sysret_signal
TRACE_IRQS_ON
@@ -278,6 +280,16 @@ sysret_careful:
CFI_ADJUST_CFA_OFFSET -8
jmp sysret_check

+ret_from_sys_call_trace:
+ TRACE_IRQS_ON
+ sti
+ SAVE_REST
+ FIXUP_TOP_OF_STACK %rdi
+ movq %rsp,%rdi
+ LOAD_ARGS ARGOFFSET /* reload args from stack in case ptrace changed it */
+ RESTORE_REST
+ jmp int_ret_from_sys_call
+
/* Handle a signal */
sysret_signal:
TRACE_IRQS_ON


--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2007-10-28 21:22:03

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Fix x86_64 TIF_SYSCALL_TRACE race in entry.S

> Setting the thread flag being an atomic operation, I would expect
> setting/clearing it asynchronously from another thread to be a valid

It could be a very short stop. Also do you start kernel tracing that often?

> Here is a modified version where I add my test only in the path where we
> know that we have work to do, therefore removing the supplementary test
> from the performance critical path. Would it be more acceptable ?

It's better, but stopping would be even better. I wouldn't
be surprised if there are other problems with async thread flags changing.

Also I object to you calling this a bug. It's a new feature.

-Andi

2007-10-28 22:31:50

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH] Fix x86_64 TIF_SYSCALL_TRACE race in entry.S

* Andi Kleen ([email protected]) wrote:
> > Setting the thread flag being an atomic operation, I would expect
> > setting/clearing it asynchronously from another thread to be a valid
>
> It could be a very short stop. Also do you start kernel tracing that often?
>

It's not a matter of how often I start tracing, but more about what
impact I want this operation to have on a running production system. If
I start tracing on a server to try detecting particularly nasty race
conditions, I prefer not to interfere with the normal execution too
much. The same applies when we try to figure out the source of some
unexpected latencies experienced in user-space : stopping the processes
could be considered as having too much impact on the system studied.

I was already reluctant about iterating on every thread to set a flag
(this was proposed by Martin and Rebecca, in their Google ktrace
implementation), but I accepted to go forward this solution because of
the performance benefits. However, I would prefer not to go as far as
stopping each process on the system upon trace start/stop to perform
this unless it's the only solution left.

> > Here is a modified version where I add my test only in the path where we
> > know that we have work to do, therefore removing the supplementary test
> > from the performance critical path. Would it be more acceptable ?
>
> It's better, but stopping would be even better. I wouldn't
> be surprised if there are other problems with async thread flags changing.
>

Do you mean architectures other than x86_64 could also assume that the
thread flags will stay unchanged between two consecutive reads ? If
those thread flags were meant not to be asynchronously updated, why
would they require an atomic update at all ?


> Also I object to you calling this a bug. It's a new feature.
>

Agreed. ptrace seems to be correct as is. It would only be needed if we
plan to use the flags as I described TIF_KERNEL_TRACE.

Mathieu


--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68