2008-07-27 12:12:17

by Oleg Nesterov

[permalink] [raw]
Subject: Q: wait_task_inactive() and !CONFIG_SMP && CONFIG_PREEMPT

Without CONFIG_SMP wait_task_inactive() is noop, this doesn't look right.
Shouldn't we also take CONFIG_PREEMPT into account?

Not that it really matters, just curious. kthread_bind() itself could be
noop without CONFIG_SMP. ptrace_check_attach() shouldn't have real problems,
but still.


Also, the !SMP version of wait_task_inactive() always returns 1, this
doesn't conform to the comment near kernel/sched.c:wait_task_inactive().

Oleg.


2008-07-27 16:55:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: Q: wait_task_inactive() and !CONFIG_SMP && CONFIG_PREEMPT



On Sun, 27 Jul 2008, Oleg Nesterov wrote:
>
> Without CONFIG_SMP wait_task_inactive() is noop, this doesn't look right.
> Shouldn't we also take CONFIG_PREEMPT into account?

The exit path had _better_ be non-preemptable until it hits the final
schedule that disables it (and that wait_task_inactive() was waiting for).
At least that used to be the rule.

Of course, historically the only user for this was just the "wait for
exit to complete" case. And that isn't really true any more, I guess. So I
dunno.

Linus

2008-07-27 20:06:13

by Roland McGrath

[permalink] [raw]
Subject: Re: Q: wait_task_inactive() and !CONFIG_SMP && CONFIG_PREEMPT

> Without CONFIG_SMP wait_task_inactive() is noop, this doesn't look right.
> Shouldn't we also take CONFIG_PREEMPT into account?

wait_task_inactive is only called when task->state is nonzero (i.e. not
TASK_RUNNING). Preemption leaves a task in TASK_RUNNING, so a preempted
task shouldn't ever be passed to wait_task_inactive. I dont see the problem.

> Also, the !SMP version of wait_task_inactive() always returns 1, this
> doesn't conform to the comment near kernel/sched.c:wait_task_inactive().

You mean the "(its total switch count)" part of the comment?
The normative part was only meant to be "a positive number".


Thanks,
Roland

2008-07-28 12:53:39

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: wait_task_inactive() and !CONFIG_SMP && CONFIG_PREEMPT

On 07/27, Roland McGrath wrote:
>
> > Without CONFIG_SMP wait_task_inactive() is noop, this doesn't look right.
> > Shouldn't we also take CONFIG_PREEMPT into account?
>
> wait_task_inactive is only called when task->state is nonzero (i.e. not
> TASK_RUNNING). Preemption leaves a task in TASK_RUNNING, so a preempted
> task shouldn't ever be passed to wait_task_inactive.

No, schedule() doesn't change prev->state when the task with ->state !=
TASK_RUNNING gets a preemption. Note this check

if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

in schedule().

Let's suppose the child does ptrace_stop(). It sets state = TASK_TRACED
and unlocks ->siglock.

If it is preempted by the parent which does ptrace_check_attach(),
wait_task_inactive() must wait until the child leaves the runqueue,
but the dummy version just returns success.

sys_ptrace() continues assuming that the child sleeps in TASK_TRACED,
while it fact it is running, despite its ->state == TASK_TRACED.


As I said, nothing realy bad can happen, the child can't return to the
user-space or something, but this just means that ptrace_check_attach()
afaics doesn't have the strong reasons for wait_task_inactive().

> > Also, the !SMP version of wait_task_inactive() always returns 1, this
> > doesn't conform to the comment near kernel/sched.c:wait_task_inactive().
>
> You mean the "(its total switch count)" part of the comment?
> The normative part was only meant to be "a positive number".

I refer to this patch of the comment:

If a second call a short while later returns the same number, the
caller can be sure that @p has remained unscheduled the whole time.

The dummy version always returns the same number == 1.


So. I think that wait_task_inactive() needs "defined(SMP) || defined(PREEMPT)"
and the dummy version should return ->nvcsw too.

Oleg.

2008-07-28 23:40:00

by Roland McGrath

[permalink] [raw]
Subject: Re: Q: wait_task_inactive() and !CONFIG_SMP && CONFIG_PREEMPT

> If it is preempted by the parent which does ptrace_check_attach(),
> wait_task_inactive() must wait until the child leaves the runqueue,
> but the dummy version just returns success.

I see your point.

> sys_ptrace() continues assuming that the child sleeps in TASK_TRACED,
> while it fact it is running, despite its ->state == TASK_TRACED.

For ptrace, the only real expectation has ever been that it's no longer on
the physical CPU, i.e. we are not racing with context switch itself. On a
uniprocessor, this can of course never happen.

The historical picture is that the preemption issue wasn't thought about
much. ptrace has always used lock_kernel(), and mostly this implied
disabling preemption anyway (there was CONFIG_PREEMPT_BKL for a while).
So it's moot there.

Even if preemption were an issue for ptrace, it's not a problem. All that
matters is that the tracee is not going to run any code that changes the
thread machine state ptrace accesses (pt_regs, thread.foo, etc). If ptrace
gets preempted, the tracee gets switched in, gets switched back out, and
the ptrace-calling thread switched back in, there is no problem. All the
flutter on the kernel memory ptrace might touch took place during the
context switches themselves, and every byte was back in the same place
between when ptrace got preempted and when it resumed its next instruction.

I can't speak to the kthread case. I suspect that set_task_cpu() is always
safe on !SMP PREEMPT, and that's why it's fine. But I'm not really sure.

> I refer to this patch of the comment:
>
> If a second call a short while later returns the same number, the
> caller can be sure that @p has remained unscheduled the whole time.
>
> The dummy version always returns the same number == 1.

Right. For the general case where this is the contract wait_task_inactive
is expected to meet, it does matter. I think task_current_syscall() does
want this checked for the preempted uniprocessor case, for example.

> So. I think that wait_task_inactive() needs "defined(SMP) || defined(PREEMPT)"
> and the dummy version should return ->nvcsw too.

Is this what we want?

#ifdef CONFIG_SMP
extern unsigned long wait_task_inactive(struct task_struct *, long);
#else
static inline unsigned long wait_task_inactive(struct task_struct *p,
long match_state)
{
unsigned long ret = 0;
if (match_state) {
preempt_disable();
if (p->state == match_state)
ret = (p->nvcsw << 1) | 1;
preempt_enable();
}
return ret;
}
#endif


Thanks,
Roland

2008-07-29 12:18:14

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: wait_task_inactive() and !CONFIG_SMP && CONFIG_PREEMPT

On 07/28, Roland McGrath wrote:
>
> I can't speak to the kthread case. I suspect that set_task_cpu() is always
> safe on !SMP PREEMPT, and that's why it's fine.

Yes, kthread_bind() is fine, it changes nothing in *k if !SMP.

> > I refer to this patch of the comment:
> >
> > If a second call a short while later returns the same number, the
> > caller can be sure that @p has remained unscheduled the whole time.
> >
> > The dummy version always returns the same number == 1.
>
> Right. For the general case where this is the contract wait_task_inactive
> is expected to meet, it does matter. I think task_current_syscall() does
> want this checked for the preempted uniprocessor case, for example.
>
> > So. I think that wait_task_inactive() needs "defined(SMP) || defined(PREEMPT)"
> > and the dummy version should return ->nvcsw too.
>
> Is this what we want?
>
> #ifdef CONFIG_SMP
> extern unsigned long wait_task_inactive(struct task_struct *, long);
> #else
> static inline unsigned long wait_task_inactive(struct task_struct *p,
> long match_state)
> {
> unsigned long ret = 0;
> if (match_state) {
> preempt_disable();
> if (p->state == match_state)
> ret = (p->nvcsw << 1) | 1;
> preempt_enable();
> }
> return ret;
> }
> #endif

I dont think this is right.

Firstly, the above always fails if match_state == 0, this is not right.

But more importantly, we can't just check ->state == match_state. And
preempt_disable() buys nothing.

Let's look at task_current_syscall(). The "target" can set, say,
TASK_UNINTERRUPTIBLE many times, do a lot of syscalls, and not once
call schedule().

And the task remains fully preemptible even if it runs in
TASK_UNINTERRUPTIBLE state.


Let's suppose we implement kthread_set_nice() in the same manner
as kthread_bind(),

kthread_set_nice(struct task_struct *k, long nice)
{
wait_task_inactive(k, TASK_UNINTERRUPTIBLE);

... just change ->prio/static_prio ...
}

the above is ugly of course, but should be correct correct even
with !SMP && PREEMPT.


I think we need

#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
extern unsigned long wait_task_inactive(struct task_struct *, long);
#else
static inline unsigned long wait_task_inactive(struct task_struct *p,
long match_state)
{
if (match_state && p->state != match_state)
return 0;
return p->nvcsw | (LONG_MAX + 1); // the same in sched.c
}

Oleg.

2008-08-01 01:28:34

by Roland McGrath

[permalink] [raw]
Subject: Re: Q: wait_task_inactive() and !CONFIG_SMP && CONFIG_PREEMPT

> I dont think this is right.
>
> Firstly, the above always fails if match_state == 0, this is not right.

A call with 0 is the "legacy case", where the return value is 0 and nothing
but the traditional wait_task_inactive behavior is expected. On UP, this
was a nop before and still is.

Anyway, this is moot since we are soon to have no callers that pass 0.

> But more importantly, we can't just check ->state == match_state. And
> preempt_disable() buys nothing.

It ensures that the samples of ->state and ->nvcsw both came while the
target could never have run in between. Without it, a preemption after the
->state check could mean the ->nvcsw value we use is from a later block in
a different state than the one intended.

> Let's look at task_current_syscall(). The "target" can set, say,
> TASK_UNINTERRUPTIBLE many times, do a lot of syscalls, and not once
> call schedule().
>
> And the task remains fully preemptible even if it runs in
> TASK_UNINTERRUPTIBLE state.

One of us is missing something basic. We are on the only CPU. If target
does *anything*, it means we got preempted, then target switched in, did
things, and then called schedule (including via preemption)--so that we
could possibly be running again now afterwards. That schedule call bumped
the counter after we sampled it. The second call done for "is it still
blocked afterwards?" will see a different count and abort. Am I confused?

Ah, I think it was me who was missing something when I let you talk me into
checking only ->nvcsw. It really should be ->nivcsw + ->nvcsw as I had it
originally (| LONG_MIN as you've done, a good trick). That makes what I
just said true in the preemption case. This bit:

if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

will not hit, so switch_count = &prev->nivcsw; remains from before.
This is why it was nivcsw + nvcsw to begin with.

What am I missing here?


Thanks,
Roland

2008-08-01 11:46:35

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Q: wait_task_inactive() and !CONFIG_SMP && CONFIG_PREEMPT

On 07/31, Roland McGrath wrote:
>
> > I dont think this is right.
> >
> > Firstly, the above always fails if match_state == 0, this is not right.
>
> A call with 0 is the "legacy case", where the return value is 0 and nothing
> but the traditional wait_task_inactive behavior is expected. On UP, this
> was a nop before and still is.
>
> Anyway, this is moot since we are soon to have no callers that pass 0.

This means we can't use wait_task_inactive(p) unless we know p->state
is already != TASK_RUNNING. OK, the current callers assume exactly this.

But perhaps we should change the state checks from "==" to "&", note
that pfm_task_incompatible() checks task_is_stopped_or_traced().

> > But more importantly, we can't just check ->state == match_state. And
> > preempt_disable() buys nothing.
>
> It ensures that the samples of ->state and ->nvcsw both came while the
> target could never have run in between. Without it, a preemption after the
> ->state check could mean the ->nvcsw value we use is from a later block in
> a different state than the one intended.

I meant, it buys nothing because the task can set its state == TASK_RUNNING
right after preempt_enable(), because the task can be runnable despite
the fact its state is (say) TASK_UNINTERRUPTIBLE.

Please see below.

> > Let's look at task_current_syscall(). The "target" can set, say,
> > TASK_UNINTERRUPTIBLE many times, do a lot of syscalls, and not once
> > call schedule().
> >
> > And the task remains fully preemptible even if it runs in
> > TASK_UNINTERRUPTIBLE state.
>
> One of us is missing something basic. We are on the only CPU. If target
> does *anything*, it means we got preempted, then target switched in, did
> things, and then called schedule (including via preemption)--so that we
> could possibly be running again now afterwards. That schedule call bumped
> the counter after we sampled it. The second call done for "is it still
> blocked afterwards?" will see a different count and abort. Am I confused?

I think we misunderstood each other. I never said the _current_ callers have
problems (except the dummy version can't just return 1 of course).

I proposed to synchronize 2 implementations to avoid the possible problems,
and personally I still dislike the fact that !SMP version has the very
different behaviour.

But yes, if we only want to "fix" the current callers, we can make the !SMP
version

static inline unsigned long wait_task_inactive(struct task_struct *p,
long match_state)
{
int ret = 0;

preempt_disable();
if (!match_state || p->state == match_state)
ret = (p->nvcsw + p->nivcsw) | LONG_MIN;
preempt_enable();

return ret;
}

> Ah, I think it was me who was missing something when I let you talk me into
> checking only ->nvcsw. It really should be ->nivcsw + ->nvcsw as I had it
> originally (| LONG_MIN as you've done, a good trick). That makes what I
> just said true in the preemption case. This bit:
>
> if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
>
> will not hit, so switch_count = &prev->nivcsw; remains from before.
> This is why it was nivcsw + nvcsw to begin with.

This is fine on SMP, afaics.

More precisely, this is fine if wait_task_inactive(p) returns success only
when p->se.on_rq == 0, we shouldn't worry about preemption (nivcsw) at all.

Let's forget about the overflow, and suppose that 2 subsequent calls return
the same ->nvcsw. Every time the task leaves the runqueue (so that !.on_rq
becomes true), shedule() bumps its ->nvcsw. If the task was scheduled in
between we must notice this, because the second success means the task
was deactivate_task()'ed again.

We shouldn't look at nivcsw at all. While the task is deactivated, its
nivcsw/nvcsw can't be changed. If it is scheduled again, it must increment
->nvcsw before wait_task_inactive() can return success.

(Dmitry cc'ed to check my understanding).

Oleg.

2008-08-01 21:27:49

by Roland McGrath

[permalink] [raw]
Subject: Re: Q: wait_task_inactive() and !CONFIG_SMP && CONFIG_PREEMPT

> This means we can't use wait_task_inactive(p) unless we know p->state
> is already != TASK_RUNNING. OK, the current callers assume exactly this.

Correct.

> But perhaps we should change the state checks from "==" to "&", note
> that pfm_task_incompatible() checks task_is_stopped_or_traced().

No, the intent is "not changed at all", not "in some acceptable state". So
the caller should sample ->state, decide that value is ok, and pass that
same value in. Then wait_task_inactive succeeds only if exactly that value
stayed in place and the target got off the CPU. (If it came out and went
back into the exact same state before we got into wait_task_inactive, we
don't really care. The callers have outside means to know that either is
good enough for their purposes, or is impossible.)

> I meant, it buys nothing because the task can set its state == TASK_RUNNING
> right after preempt_enable(), because the task can be runnable despite
> the fact its state is (say) TASK_UNINTERRUPTIBLE.

Right, but we don't care about that. The predicate really is "says
expected ->state, and is off the CPU", not "... and is not runnable".

> I think we misunderstood each other. I never said the _current_ callers have
> problems (except the dummy version can't just return 1 of course).

Callers like the current ones are the only reasons we have this function.
Its only purpose is to meet those needs. If those requirements are not as
strong as a "generic" specification of what the function does, then we can
just change the specification.

> I proposed to synchronize 2 implementations to avoid the possible problems,
> and personally I still dislike the fact that !SMP version has the very
> different behaviour.

It's only "very different" if what you call its "behavior" is an abstract
specification based on what the SMP version happens to do, rather than
saying "its behavior" is just to meet the guarantees I set out earlier.
Meeting those guarantees is what the function is for. So it should do the
cheapest thing to satisfies them in each configuration.

> But yes, if we only want to "fix" the current callers, we can make the !SMP
> version
[...]
> int ret = 0;

unsigned long

>
> preempt_disable();
> if (!match_state || p->state == match_state)
> ret = (p->nvcsw + p->nivcsw) | LONG_MIN;
> preempt_enable();

In the !match_state case (nearly moot now anyway), there's no need to set
the return value. Callers passing 0 never expected a return value.

> This is fine on SMP, afaics.
>
> More precisely, this is fine if wait_task_inactive(p) returns success only
> when p->se.on_rq == 0, we shouldn't worry about preemption (nivcsw) at all.

Ok. I was looking at the !SMP PREEMPT context when I said that. I don't
dispute your rationale about the SMP case; it relies on scheduler innards
that I don't know at all well.


Thanks,
Roland