2022-01-18 02:41:50

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not

Hi folks,

Problem
=======

Abhijeet pointed out that the following sequence of trace events can be
observed:

cat-1676 [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004
grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120
<idle>-0 [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004

IOW, not-yet-woken task gets presented as runnable and switched out in
favor of the idle task... Dietmar and I had a look, turns out this sequence
can happen:

p->__state = TASK_INTERRUPTIBLE;
__schedule()
deactivate_task(p);
ttwu()
READ !p->on_rq
p->__state=TASK_WAKING
trace_sched_switch()
__trace_sched_switch_state()
task_state_index()
return 0;

TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0.
This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu()
spinning on p->on_cpu") which punted the TASK_WAKING write to the other
side of

smp_cond_load_acquire(&p->on_cpu, !VAL);

in ttwu().

Uwe reported on #linux-rt what I think is a similar issue with
TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so
you get prev_state=0 despite the task blocking on a lock.

Both of those are very confusing for tooling or even human observers.

Patches
=======

For the TASK_WAKING case, pushing the prev_state read in __schedule() down
to __trace_sched_switch_state() seems sufficient. That's patch 1.

For TASK_RTLOCK_WAIT, it's a bit trickier. One could use ->saved_state as
prev_state, but
a) that is also susceptible to racing (see ttwu() / ttwu_state_match())
b) some lock substitutions (e.g. rt_spin_lock()) leave ->saved_state as
TASK_RUNNING.

Patch 2 adds TASK_RTLOCK_WAIT to TASK_REPORT instead.

Testing
=======

Briefly tested on an Arm Juno via ftrace+hackbench against
o tip/sched/core@82762d2af31a
o v5.16-rt15-rebase w/ CONFIG_PREEMPT_RT


I also had a look at the __schedule() disassembly as suggested by Peter; on x86
prev_state gets pushed to the stack and popped prior to the trace event static
call, the rest seems mostly unchanged (GCC 9.3).

Revisions
=========

RFC v1 -> v2
++++++++++++

o Collected tags (Steven, Sebastian)
o Patched missing trace_switch event clients (Steven)

Cheers,
Valentin

Valentin Schneider (2):
sched/tracing: Don't re-read p->state when emitting sched_switch event
sched/tracing: Add TASK_RTLOCK_WAIT to TASK_REPORT

fs/proc/array.c | 3 ++-
include/linux/sched.h | 28 +++++++++++++++++-----------
include/trace/events/sched.h | 12 ++++++++----
kernel/sched/core.c | 4 ++--
kernel/trace/fgraph.c | 4 +++-
kernel/trace/ftrace.c | 4 +++-
kernel/trace/trace_events.c | 8 ++++++--
kernel/trace/trace_osnoise.c | 4 +++-
kernel/trace/trace_sched_switch.c | 1 +
kernel/trace/trace_sched_wakeup.c | 1 +
10 files changed, 46 insertions(+), 23 deletions(-)

--
2.25.1


2022-01-18 02:41:50

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 1/2] sched/tracing: Don't re-read p->state when emitting sched_switch event

As of commit

c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")

the following sequence becomes possible:

p->__state = TASK_INTERRUPTIBLE;
__schedule()
deactivate_task(p);
ttwu()
READ !p->on_rq
p->__state=TASK_WAKING
trace_sched_switch()
__trace_sched_switch_state()
task_state_index()
return 0;

TASK_WAKING isn't in TASK_REPORT, so the task appears as TASK_RUNNING in
the trace event.

Prevent this by pushing the value read from __schedule() down the trace
event.

Reported-by: Abhijeet Dharmapurikar <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
---
include/linux/sched.h | 11 ++++++++---
include/trace/events/sched.h | 11 +++++++----
kernel/sched/core.c | 4 ++--
kernel/trace/fgraph.c | 4 +++-
kernel/trace/ftrace.c | 4 +++-
kernel/trace/trace_events.c | 8 ++++++--
kernel/trace/trace_osnoise.c | 4 +++-
kernel/trace/trace_sched_switch.c | 1 +
kernel/trace/trace_sched_wakeup.c | 1 +
9 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2e261adb8ea..d00837d12b9d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1616,10 +1616,10 @@ static inline pid_t task_pgrp_nr(struct task_struct *tsk)
#define TASK_REPORT_IDLE (TASK_REPORT + 1)
#define TASK_REPORT_MAX (TASK_REPORT_IDLE << 1)

-static inline unsigned int task_state_index(struct task_struct *tsk)
+static inline unsigned int __task_state_index(unsigned int tsk_state,
+ unsigned int tsk_exit_state)
{
- unsigned int tsk_state = READ_ONCE(tsk->__state);
- unsigned int state = (tsk_state | tsk->exit_state) & TASK_REPORT;
+ unsigned int state = (tsk_state | tsk_exit_state) & TASK_REPORT;

BUILD_BUG_ON_NOT_POWER_OF_2(TASK_REPORT_MAX);

@@ -1629,6 +1629,11 @@ static inline unsigned int task_state_index(struct task_struct *tsk)
return fls(state);
}

+static inline unsigned int task_state_index(struct task_struct *tsk)
+{
+ return __task_state_index(READ_ONCE(tsk->__state), tsk->exit_state);
+}
+
static inline char task_index_to_char(unsigned int state)
{
static const char state_char[] = "RSDTtXZPI";
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 94640482cfe7..65e786756321 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -187,7 +187,9 @@ DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
TP_ARGS(p));

#ifdef CREATE_TRACE_POINTS
-static inline long __trace_sched_switch_state(bool preempt, struct task_struct *p)
+static inline long __trace_sched_switch_state(bool preempt,
+ unsigned int prev_state,
+ struct task_struct *p)
{
unsigned int state;

@@ -208,7 +210,7 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
* it for left shift operation to get the correct task->state
* mapping.
*/
- state = task_state_index(p);
+ state = __task_state_index(prev_state, p->exit_state);

return state ? (1 << (state - 1)) : state;
}
@@ -220,10 +222,11 @@ static inline long __trace_sched_switch_state(bool preempt, struct task_struct *
TRACE_EVENT(sched_switch,

TP_PROTO(bool preempt,
+ unsigned int prev_state,
struct task_struct *prev,
struct task_struct *next),

- TP_ARGS(preempt, prev, next),
+ TP_ARGS(preempt, prev_state, prev, next),

TP_STRUCT__entry(
__array( char, prev_comm, TASK_COMM_LEN )
@@ -239,7 +242,7 @@ TRACE_EVENT(sched_switch,
memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
__entry->prev_pid = prev->pid;
__entry->prev_prio = prev->prio;
- __entry->prev_state = __trace_sched_switch_state(preempt, prev);
+ __entry->prev_state = __trace_sched_switch_state(preempt, prev_state, prev);
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid = next->pid;
__entry->next_prio = next->prio;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe53e510e711..a8799a2d8546 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4822,7 +4822,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
{
struct rq *rq = this_rq();
struct mm_struct *mm = rq->prev_mm;
- long prev_state;
+ unsigned int prev_state;

/*
* The previous task will have left us with a preempt_count of 2
@@ -6287,7 +6287,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
migrate_disable_switch(rq, prev);
psi_sched_switch(prev, next, !task_on_rq_queued(prev));

- trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next);
+ trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev_state, prev, next);

/* Also unlocks the rq: */
rq = context_switch(rq, prev, next, &rf);
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 22061d38fc00..19028e072cdb 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -415,7 +415,9 @@ static int alloc_retstack_tasklist(struct ftrace_ret_stack **ret_stack_list)

static void
ftrace_graph_probe_sched_switch(void *ignore, bool preempt,
- struct task_struct *prev, struct task_struct *next)
+ unsigned int prev_state,
+ struct task_struct *prev,
+ struct task_struct *next)
{
unsigned long long timestamp;
int index;
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 30bc880c3849..e296ddeec99f 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -7311,7 +7311,9 @@ ftrace_func_t ftrace_ops_get_func(struct ftrace_ops *ops)

static void
ftrace_filter_pid_sched_switch_probe(void *data, bool preempt,
- struct task_struct *prev, struct task_struct *next)
+ unsigned int prev_state,
+ struct task_struct *prev,
+ struct task_struct *next)
{
struct trace_array *tr = data;
struct trace_pid_list *pid_list;
diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 4021b9a79f93..6ddc6cc0d5d5 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -759,7 +759,9 @@ void trace_event_follow_fork(struct trace_array *tr, bool enable)

static void
event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,
- struct task_struct *prev, struct task_struct *next)
+ unsigned int prev_state,
+ struct task_struct *prev,
+ struct task_struct *next)
{
struct trace_array *tr = data;
struct trace_pid_list *no_pid_list;
@@ -783,7 +785,9 @@ event_filter_pid_sched_switch_probe_pre(void *data, bool preempt,

static void
event_filter_pid_sched_switch_probe_post(void *data, bool preempt,
- struct task_struct *prev, struct task_struct *next)
+ unsigned int prev_state,
+ struct task_struct *prev,
+ struct task_struct *next)
{
struct trace_array *tr = data;
struct trace_pid_list *no_pid_list;
diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 7520d43aed55..a8a2d17f858c 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c
@@ -1168,7 +1168,9 @@ thread_exit(struct osnoise_variables *osn_var, struct task_struct *t)
* used to record the beginning and to report the end of a thread noise window.
*/
static void
-trace_sched_switch_callback(void *data, bool preempt, struct task_struct *p,
+trace_sched_switch_callback(void *data, bool preempt,
+ unsigned int prev_state,
+ struct task_struct *p,
struct task_struct *n)
{
struct osnoise_variables *osn_var = this_cpu_osn_var();
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index e304196d7c28..993b0ed10d8c 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -22,6 +22,7 @@ static DEFINE_MUTEX(sched_register_mutex);

static void
probe_sched_switch(void *ignore, bool preempt,
+ unsigned int prev_state,
struct task_struct *prev, struct task_struct *next)
{
int flags;
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 2402de520eca..46429f9a96fa 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -426,6 +426,7 @@ tracing_sched_wakeup_trace(struct trace_array *tr,

static void notrace
probe_wakeup_sched_switch(void *ignore, bool preempt,
+ unsigned int prev_state,
struct task_struct *prev, struct task_struct *next)
{
struct trace_array_cpu *data;
--
2.25.1

2022-01-18 02:41:51

by Valentin Schneider

[permalink] [raw]
Subject: [PATCH v2 2/2] sched/tracing: Add TASK_RTLOCK_WAIT to TASK_REPORT

TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.

The actual state is saved in p->saved_state, but reading it after reading
p->__state has a few issues:
o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
o ttwu_state_match() might have changed that to TASK_RUNNING

Add TASK_RTLOCK_WAIT to TASK_REPORT.

Reported-by: Uwe Kleine-König <[email protected]>
Signed-off-by: Valentin Schneider <[email protected]>
Reviewed-by: Steven Rostedt <[email protected]>
Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
---
fs/proc/array.c | 3 ++-
include/linux/sched.h | 17 +++++++++--------
include/trace/events/sched.h | 1 +
3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index ff869a66b34e..f4cae65529a6 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -128,9 +128,10 @@ static const char * const task_state_array[] = {
"X (dead)", /* 0x10 */
"Z (zombie)", /* 0x20 */
"P (parked)", /* 0x40 */
+ "L (rt-locked)", /* 0x80 */

/* states beyond TASK_REPORT: */
- "I (idle)", /* 0x80 */
+ "I (idle)", /* 0x100 */
};

static inline const char *get_task_state(struct task_struct *tsk)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d00837d12b9d..18fd77578dae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -91,13 +91,14 @@ struct task_group;
#define EXIT_TRACE (EXIT_ZOMBIE | EXIT_DEAD)
/* Used in tsk->state again: */
#define TASK_PARKED 0x0040
-#define TASK_DEAD 0x0080
-#define TASK_WAKEKILL 0x0100
-#define TASK_WAKING 0x0200
-#define TASK_NOLOAD 0x0400
-#define TASK_NEW 0x0800
/* RT specific auxilliary flag to mark RT lock waiters */
-#define TASK_RTLOCK_WAIT 0x1000
+#define TASK_RTLOCK_WAIT 0x0080
+
+#define TASK_DEAD 0x0100
+#define TASK_WAKEKILL 0x0200
+#define TASK_WAKING 0x0400
+#define TASK_NOLOAD 0x0800
+#define TASK_NEW 0x1000
#define TASK_STATE_MAX 0x2000

/* Convenience macros for the sake of set_current_state: */
@@ -114,7 +115,7 @@ struct task_group;
#define TASK_REPORT (TASK_RUNNING | TASK_INTERRUPTIBLE | \
TASK_UNINTERRUPTIBLE | __TASK_STOPPED | \
__TASK_TRACED | EXIT_DEAD | EXIT_ZOMBIE | \
- TASK_PARKED)
+ TASK_PARKED | TASK_RTLOCK_WAIT)

#define task_is_running(task) (READ_ONCE((task)->__state) == TASK_RUNNING)

@@ -1636,7 +1637,7 @@ static inline unsigned int task_state_index(struct task_struct *tsk)

static inline char task_index_to_char(unsigned int state)
{
- static const char state_char[] = "RSDTtXZPI";
+ static const char state_char[] = "RSDTtXZPLI";

BUILD_BUG_ON(1 + ilog2(TASK_REPORT_MAX) != sizeof(state_char) - 1);

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 65e786756321..f86ec9af19ff 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -261,6 +261,7 @@ TRACE_EVENT(sched_switch,
{ EXIT_DEAD, "X" },
{ EXIT_ZOMBIE, "Z" },
{ TASK_PARKED, "P" },
+ { TASK_RTLOCK_WAIT, "L" },
{ TASK_DEAD, "I" }) :
"R",

--
2.25.1

2022-01-18 02:59:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/tracing: Add TASK_RTLOCK_WAIT to TASK_REPORT

Valentin Schneider <[email protected]> writes:

> TASK_RTLOCK_WAIT currently isn't part of TASK_REPORT, thus a task blocking
> on an rtlock will appear as having a task state == 0, IOW TASK_RUNNING.
>
> The actual state is saved in p->saved_state, but reading it after reading
> p->__state has a few issues:
> o that could still be TASK_RUNNING in the case of e.g. rt_spin_lock
> o ttwu_state_match() might have changed that to TASK_RUNNING
>
> Add TASK_RTLOCK_WAIT to TASK_REPORT.
>
> Reported-by: Uwe Kleine-König <[email protected]>
> Signed-off-by: Valentin Schneider <[email protected]>
> Reviewed-by: Steven Rostedt <[email protected]>
> Reviewed-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> fs/proc/array.c | 3 ++-
> include/linux/sched.h | 17 +++++++++--------
> include/trace/events/sched.h | 1 +
> 3 files changed, 12 insertions(+), 9 deletions(-)
>
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index ff869a66b34e..f4cae65529a6 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -128,9 +128,10 @@ static const char * const task_state_array[] = {
> "X (dead)", /* 0x10 */
> "Z (zombie)", /* 0x20 */
> "P (parked)", /* 0x40 */
> + "L (rt-locked)", /* 0x80 */
>
> /* states beyond TASK_REPORT: */
> - "I (idle)", /* 0x80 */
> + "I (idle)", /* 0x100 */
> };

I think this is at least possibly an ABI break. I have a vague memory
that userspace is not ready being reported new task states. Which is
why we encode some of our states the way we do.

Maybe it was just someone being very conservative.

Still if you are going to add new states to userspace and risk breaking
them can you do some basic analysis and report what ps and similar
programs do.

Simply changing userspace without even mentioning that you are changing
the userspace output of proc looks dangerous indeed.

Looking in the history commit 74e37200de8e ("proc: cleanup/simplify
get_task_state/task_state_array") seems to best document the concern
that userspace does not know how to handle new states.

The fact we have had a parked state for quite a few years despite that
concern seems to argue it is possible to extend the states. Or perhaps
it just argues that parked states are rare enough it does not matter.

It is definitely the case that the ps manpage documents the possible
states and as such they could be a part of anyone's shell scripts.

From the ps man page:
> Here are the different values that the s, stat and state output
> specifiers (header "STAT" or "S") will display to describe the
> state of a process:
>
> D uninterruptible sleep (usually IO)
> I Idle kernel thread
> R running or runnable (on run queue)
> S interruptible sleep (waiting for an event to complete)
> T stopped by job control signal
> t stopped by debugger during the tracing
> W paging (not valid since the 2.6.xx kernel)
> X dead (should never be seen)
> Z defunct ("zombie") process, terminated but not reaped by its parent
>

So it looks like a change that adds to the number of states in the
kernel should update the ps man page as well.

Eric

2022-01-20 19:19:51

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/tracing: Add TASK_RTLOCK_WAIT to TASK_REPORT

On 17/01/22 13:12, Eric W. Biederman wrote:
> Valentin Schneider <[email protected]> writes:
>> --- a/fs/proc/array.c
>> +++ b/fs/proc/array.c
>> @@ -128,9 +128,10 @@ static const char * const task_state_array[] = {
>> "X (dead)", /* 0x10 */
>> "Z (zombie)", /* 0x20 */
>> "P (parked)", /* 0x40 */
>> + "L (rt-locked)", /* 0x80 */
>>
>> /* states beyond TASK_REPORT: */
>> - "I (idle)", /* 0x80 */
>> + "I (idle)", /* 0x100 */
>> };
>
> I think this is at least possibly an ABI break. I have a vague memory
> that userspace is not ready being reported new task states. Which is
> why we encode some of our states the way we do.
>
> Maybe it was just someone being very conservative.
>
> Still if you are going to add new states to userspace and risk breaking
> them can you do some basic analysis and report what ps and similar
> programs do.
>
> Simply changing userspace without even mentioning that you are changing
> the userspace output of proc looks dangerous indeed.
>

Yeah, you're right.

> Looking in the history commit 74e37200de8e ("proc: cleanup/simplify
> get_task_state/task_state_array") seems to best document the concern
> that userspace does not know how to handle new states.
>

Thanks for the sha1 and for digging around. Now, I read
74e37200de8e ("proc: cleanup/simplify get_task_state/task_state_array")
as "get_task_state() isn't clear vs what value is actually exposed to
userspace" rather than "get_task_state() could expose things userspace
doesn't know what to do with".

> The fact we have had a parked state for quite a few years despite that
> concern seems to argue it is possible to extend the states. Or perhaps
> it just argues that parked states are rare enough it does not matter.
>
> It is definitely the case that the ps manpage documents the possible
> states and as such they could be a part of anyone's shell scripts.
>

06eb61844d84 ("sched/debug: Add explicit TASK_IDLE printing") for instance
seems to suggest extending the states OK, but you're right that this then
requires updating ps' manpage.

Alternatively, TASK_RTLOCK_WAIT could be masqueraded as
TASK_(UN)INTERRUPTIBLE when reported to userspace - it is actually somewhat
similar, unlike TASK_IDLE vs TASK_UNINTERRUPTIBLE for instance. The
handling in get_task_state() will be fugly, but it might be preferable over
exposing a detail userspace might not need to be made aware of?

> From the ps man page:
>> Here are the different values that the s, stat and state output
>> specifiers (header "STAT" or "S") will display to describe the
>> state of a process:
>>
>> D uninterruptible sleep (usually IO)
>> I Idle kernel thread
>> R running or runnable (on run queue)
>> S interruptible sleep (waiting for an event to complete)
>> T stopped by job control signal
>> t stopped by debugger during the tracing
>> W paging (not valid since the 2.6.xx kernel)
>> X dead (should never be seen)
>> Z defunct ("zombie") process, terminated but not reaped by its parent
>>
>
> So it looks like a change that adds to the number of states in the
> kernel should update the ps man page as well.
>
> Eric

2022-01-20 21:27:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/tracing: Add TASK_RTLOCK_WAIT to TASK_REPORT

Valentin Schneider <[email protected]> writes:

> On 17/01/22 13:12, Eric W. Biederman wrote:
>> Valentin Schneider <[email protected]> writes:
>>> --- a/fs/proc/array.c
>>> +++ b/fs/proc/array.c
>>> @@ -128,9 +128,10 @@ static const char * const task_state_array[] = {
>>> "X (dead)", /* 0x10 */
>>> "Z (zombie)", /* 0x20 */
>>> "P (parked)", /* 0x40 */
>>> + "L (rt-locked)", /* 0x80 */
>>>
>>> /* states beyond TASK_REPORT: */
>>> - "I (idle)", /* 0x80 */
>>> + "I (idle)", /* 0x100 */
>>> };
>>
>> I think this is at least possibly an ABI break. I have a vague memory
>> that userspace is not ready being reported new task states. Which is
>> why we encode some of our states the way we do.
>>
>> Maybe it was just someone being very conservative.
>>
>> Still if you are going to add new states to userspace and risk breaking
>> them can you do some basic analysis and report what ps and similar
>> programs do.
>>
>> Simply changing userspace without even mentioning that you are changing
>> the userspace output of proc looks dangerous indeed.
>>
>
> Yeah, you're right.
>
>> Looking in the history commit 74e37200de8e ("proc: cleanup/simplify
>> get_task_state/task_state_array") seems to best document the concern
>> that userspace does not know how to handle new states.
>>
>
> Thanks for the sha1 and for digging around. Now, I read
> 74e37200de8e ("proc: cleanup/simplify get_task_state/task_state_array")
> as "get_task_state() isn't clear vs what value is actually exposed to
> userspace" rather than "get_task_state() could expose things userspace
> doesn't know what to do with".

There is also commit abd50b39e783 ("wait: introduce EXIT_TRACE to avoid
the racy EXIT_DEAD->EXIT_ZOMBIE transition").

Which I think is more of what I was remembering.

>> The fact we have had a parked state for quite a few years despite that
>> concern seems to argue it is possible to extend the states. Or perhaps
>> it just argues that parked states are rare enough it does not matter.
>>
>> It is definitely the case that the ps manpage documents the possible
>> states and as such they could be a part of anyone's shell scripts.
>>
>
> 06eb61844d84 ("sched/debug: Add explicit TASK_IDLE printing") for instance
> seems to suggest extending the states OK, but you're right that this then
> requires updating ps' manpage.
>
> Alternatively, TASK_RTLOCK_WAIT could be masqueraded as
> TASK_(UN)INTERRUPTIBLE when reported to userspace - it is actually somewhat
> similar, unlike TASK_IDLE vs TASK_UNINTERRUPTIBLE for instance. The
> handling in get_task_state() will be fugly, but it might be preferable over
> exposing a detail userspace might not need to be made aware of?

Right.

Frequently I have seen people do a cost/benefit analysis.

If the benefit is enough, and tracking down the userspace programs that
need to be verified to work with the change is inexpensive enough the
change is made. Always keeping in mind that if something was missed and
the change causes a regression the change will need to be reverted.

If there is little benefit or the cost to track down userspace is great
enough the work is put in to hide the change from userspace. Just
because it is too much trouble to expose it to userspace.

I honestly don't have any kind of sense about how hard it is to verify
that a userspace regression won't result from a change like this. I
just know that the question needs to be asked.

Eric

2022-01-21 19:57:52

by Valentin Schneider

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/tracing: Add TASK_RTLOCK_WAIT to TASK_REPORT

On 18/01/22 12:10, Eric W. Biederman wrote:
> Valentin Schneider <[email protected]> writes:
>>
>> Alternatively, TASK_RTLOCK_WAIT could be masqueraded as
>> TASK_(UN)INTERRUPTIBLE when reported to userspace - it is actually somewhat
>> similar, unlike TASK_IDLE vs TASK_UNINTERRUPTIBLE for instance. The
>> handling in get_task_state() will be fugly, but it might be preferable over
>> exposing a detail userspace might not need to be made aware of?
>
> Right.
>
> Frequently I have seen people do a cost/benefit analysis.
>
> If the benefit is enough, and tracking down the userspace programs that
> need to be verified to work with the change is inexpensive enough the
> change is made. Always keeping in mind that if something was missed and
> the change causes a regression the change will need to be reverted.
>
> If there is little benefit or the cost to track down userspace is great
> enough the work is put in to hide the change from userspace. Just
> because it is too much trouble to expose it to userspace.
>
> I honestly don't have any kind of sense about how hard it is to verify
> that a userspace regression won't result from a change like this. I
> just know that the question needs to be asked.
>

I see it as: does it actually make sense to expose a new state? All the
information this is conveying is: "this task took a lock that is
substituted by a sleepable lock under PREEMPT_RT". Now that you brought
this up, I don't really see much value in this vs just conveying that the
task is sleeping on a lock, i.e. just report the same as if it had gone
through rt_mutex_lock(), aka:

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d00837d12b9d..ac7b3eef4a61 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1626,6 +1626,14 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
if (tsk_state == TASK_IDLE)
state = TASK_REPORT_IDLE;

+ /*
+ * We're lying here, but rather than expose a completely new task state
+ * to userspace, we can make this appear as if the task had gone through
+ * a regular rt_mutex_lock() call.
+ */
+ if (tsk_state == TASK_RTLOCK_WAIT)
+ state = TASK_UNINTERRUPTIBLE;
+
return fls(state);
}


2022-01-21 19:58:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/tracing: Add TASK_RTLOCK_WAIT to TASK_REPORT

Valentin Schneider <[email protected]> writes:

> On 18/01/22 12:10, Eric W. Biederman wrote:
>> Valentin Schneider <[email protected]> writes:
>>>
>>> Alternatively, TASK_RTLOCK_WAIT could be masqueraded as
>>> TASK_(UN)INTERRUPTIBLE when reported to userspace - it is actually somewhat
>>> similar, unlike TASK_IDLE vs TASK_UNINTERRUPTIBLE for instance. The
>>> handling in get_task_state() will be fugly, but it might be preferable over
>>> exposing a detail userspace might not need to be made aware of?
>>
>> Right.
>>
>> Frequently I have seen people do a cost/benefit analysis.
>>
>> If the benefit is enough, and tracking down the userspace programs that
>> need to be verified to work with the change is inexpensive enough the
>> change is made. Always keeping in mind that if something was missed and
>> the change causes a regression the change will need to be reverted.
>>
>> If there is little benefit or the cost to track down userspace is great
>> enough the work is put in to hide the change from userspace. Just
>> because it is too much trouble to expose it to userspace.
>>
>> I honestly don't have any kind of sense about how hard it is to verify
>> that a userspace regression won't result from a change like this. I
>> just know that the question needs to be asked.
>>
>
> I see it as: does it actually make sense to expose a new state? All the
> information this is conveying is: "this task took a lock that is
> substituted by a sleepable lock under PREEMPT_RT". Now that you brought
> this up, I don't really see much value in this vs just conveying that the
> task is sleeping on a lock, i.e. just report the same as if it had gone
> through rt_mutex_lock(), aka:

That seems reasonable to me.

Eric

>
> ---
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d00837d12b9d..ac7b3eef4a61 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1626,6 +1626,14 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
> if (tsk_state == TASK_IDLE)
> state = TASK_REPORT_IDLE;
>
> + /*
> + * We're lying here, but rather than expose a completely new task state
> + * to userspace, we can make this appear as if the task had gone through
> + * a regular rt_mutex_lock() call.
> + */
> + if (tsk_state == TASK_RTLOCK_WAIT)
> + state = TASK_UNINTERRUPTIBLE;
> +
> return fls(state);
> }
>