2009-07-07 23:58:24

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH/RFC] sched: Remove SYSTEM_RUNNING checks from cond_resched*()

Using early netconsole and gianfar driver this error pops up:

netconsole: timeout waiting for carrier

It appears that net/core/netpoll.c:netpoll_setup() is using
cond_resched() in a loop waiting for a carrier.

The thing is that cond_resched() is a no-op when system_state !=
SYSTEM_RUNNING, and so drivers/net/phy/phy.c's state_queue is never
scheduled, therefore link detection doesn't work.

This patch fixes the issue by removing SYSTEM_RUNNING checks, which
might be dangerous to do, but looking at 'git blame' I can see that
the first SYSTEM_RUNNING check was introduced in this commit:

commit 8ba7b0a14b2ec19583bedbcdbea7f1c5008fc922
Author: Linus Torvalds <[email protected]>
Date: Mon Mar 6 17:38:49 2006 -0800

Add early-boot-safety check to cond_resched()

Just to be safe, we should not trigger a conditional reschedule during
the early boot sequence. We've historically done some questionable
early on, and the safety warnings in __might_sleep() are generally
turned off during that period, so there might be problems lurking.

This affects CONFIG_PREEMPT_VOLUNTARY, which takes over might_sleep() to
cause a voluntary conditional reschedule.

The commit doesn't mention that it fixes some particular problem,
though it surely breaks netconsole for drivers that are using phylib.

I tested the patch with CONFIG_PREEMPT_VOLUNTARY (on PowerPC), and it
didn't cause any issues.

Surely, the other way to fix the netconsole is to call schedule() or
msleep() directly, but I'm curious if removing the checks makes sense
nowadays.

Signed-off-by: Anton Vorontsov <[email protected]>
---
kernel/sched.c | 7 +++----
1 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7c9098d..3899baa 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6560,8 +6560,7 @@ static void __cond_resched(void)

int __sched _cond_resched(void)
{
- if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE) &&
- system_state == SYSTEM_RUNNING) {
+ if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE)) {
__cond_resched();
return 1;
}
@@ -6579,7 +6578,7 @@ EXPORT_SYMBOL(_cond_resched);
*/
int cond_resched_lock(spinlock_t *lock)
{
- int resched = need_resched() && system_state == SYSTEM_RUNNING;
+ int resched = need_resched();
int ret = 0;

if (spin_needbreak(lock) || resched) {
@@ -6599,7 +6598,7 @@ int __sched cond_resched_softirq(void)
{
BUG_ON(!in_softirq());

- if (need_resched() && system_state == SYSTEM_RUNNING) {
+ if (need_resched()) {
local_bh_enable();
__cond_resched();
local_bh_disable();
--
1.6.3.3


2009-07-08 00:53:38

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH/RFC] sched: Remove SYSTEM_RUNNING checks from cond_resched*()

On 07/08, Anton Vorontsov wrote:
>
> but I'm curious if removing the checks makes sense
> nowadays.
...
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -6560,8 +6560,7 @@ static void __cond_resched(void)
>
> int __sched _cond_resched(void)
> {
> - if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE) &&
> - system_state == SYSTEM_RUNNING) {
> + if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE)) {
> __cond_resched();
> return 1;
> }

and, with CONFIG_PREEMPT preempt_schedule() does not check system_state,
so it looks really strange cond_resched() does check SYSTEM_RUNNING.



debug_smp_processor_id() looks strange too:

/*
* It is valid to assume CPU-locality during early bootup:
*/
if (system_state != SYSTEM_RUNNING)
goto out;

this doesn't look right, smp_init() is called before we set
SYSTEM_RUNNING.

Hmm, and

/*
* Kernel threads bound to a single CPU can safely use
* smp_processor_id():
*/
if (cpumask_equal(&current->cpus_allowed, cpumask_of(this_cpu)))
goto out;

perhaps this should use PF_THREAD_BOUND ?

Oleg.

2009-07-08 06:24:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH/RFC] sched: Remove SYSTEM_RUNNING checks from cond_resched*()

On Wed, 2009-07-08 at 02:50 +0200, Oleg Nesterov wrote:

> /*
> * It is valid to assume CPU-locality during early bootup:
> */
> if (system_state != SYSTEM_RUNNING)
> goto out;
>
> this doesn't look right, smp_init() is called before we set
> SYSTEM_RUNNING.

The thing is, there's also ton's of code that might end up calling
cond_resched() and co before the scheduler is fully initialized. Doing
so would indeed mess things up.

Also, by definition we'd have to call smp_init() before SYSTEM_RUNNING,
because you simply cannot declare a system up and running when your core
functionality isn't initialized.

So I'd really rather preserve these checks -- I can even remember
running into some of these things a while back, but memory isn't
providing specific cases.

> Hmm, and
>
> /*
> * Kernel threads bound to a single CPU can safely use
> * smp_processor_id():
> */
> if (cpumask_equal(&current->cpus_allowed, cpumask_of(this_cpu)))
> goto out;
>
> perhaps this should use PF_THREAD_BOUND ?

That might predate PF_THREAD_BOUND, also I think this is more generic,
and I think we used it for that set_affinity dance we did Rusty 'fixed'
a while back.

2009-07-08 12:03:18

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH/RFC] sched: Remove SYSTEM_RUNNING checks from cond_resched*()

On Wed, Jul 08, 2009 at 08:24:23AM +0200, Peter Zijlstra wrote:
> On Wed, 2009-07-08 at 02:50 +0200, Oleg Nesterov wrote:
>
> > /*
> > * It is valid to assume CPU-locality during early bootup:
> > */
> > if (system_state != SYSTEM_RUNNING)
> > goto out;
> >
> > this doesn't look right, smp_init() is called before we set
> > SYSTEM_RUNNING.
>
> The thing is, there's also ton's of code that might end up calling
> cond_resched() and co before the scheduler is fully initialized.

Hm. Speaking of cond_resched*() only, then it should be pretty
safe to convert the SYSTEM_RUNNING checks to scheduler_running,
no? scheduler_running is set after sched_init().

Something like this

diff --git a/kernel/sched.c b/kernel/sched.c
index 7c9098d..555360b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6561,7 +6561,7 @@ static void __cond_resched(void)
int __sched _cond_resched(void)
{
if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE) &&
- system_state == SYSTEM_RUNNING) {
+ scheduler_running) {
__cond_resched();
return 1;
}
@@ -6579,7 +6579,7 @@ EXPORT_SYMBOL(_cond_resched);
*/
int cond_resched_lock(spinlock_t *lock)
{
- int resched = need_resched() && system_state == SYSTEM_RUNNING;
+ int resched = need_resched() && scheduler_running;
int ret = 0;

if (spin_needbreak(lock) || resched) {
@@ -6599,7 +6599,7 @@ int __sched cond_resched_softirq(void)
{
BUG_ON(!in_softirq());

- if (need_resched() && system_state == SYSTEM_RUNNING) {
+ if (need_resched() && scheduler_running) {
local_bh_enable();
__cond_resched();
local_bh_disable();

2009-07-08 12:12:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH/RFC] sched: Remove SYSTEM_RUNNING checks from cond_resched*()

On Wed, 2009-07-08 at 16:03 +0400, Anton Vorontsov wrote:
> On Wed, Jul 08, 2009 at 08:24:23AM +0200, Peter Zijlstra wrote:
> > On Wed, 2009-07-08 at 02:50 +0200, Oleg Nesterov wrote:
> >
> > > /*
> > > * It is valid to assume CPU-locality during early bootup:
> > > */
> > > if (system_state != SYSTEM_RUNNING)
> > > goto out;
> > >
> > > this doesn't look right, smp_init() is called before we set
> > > SYSTEM_RUNNING.
> >
> > The thing is, there's also ton's of code that might end up calling
> > cond_resched() and co before the scheduler is fully initialized.
>
> Hm. Speaking of cond_resched*() only, then it should be pretty
> safe to convert the SYSTEM_RUNNING checks to scheduler_running,
> no? scheduler_running is set after sched_init().

Hmm, that might work, I'd have to audit sched_init_smp() as it seems to
do way too much...

2009-07-08 12:55:37

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH/RFC] sched: Remove SYSTEM_RUNNING checks from cond_resched*()

On Wed, Jul 08, 2009 at 02:12:25PM +0200, Peter Zijlstra wrote:
> On Wed, 2009-07-08 at 16:03 +0400, Anton Vorontsov wrote:
> > On Wed, Jul 08, 2009 at 08:24:23AM +0200, Peter Zijlstra wrote:
> > > On Wed, 2009-07-08 at 02:50 +0200, Oleg Nesterov wrote:
> > >
> > > > /*
> > > > * It is valid to assume CPU-locality during early bootup:
> > > > */
> > > > if (system_state != SYSTEM_RUNNING)
> > > > goto out;
> > > >
> > > > this doesn't look right, smp_init() is called before we set
> > > > SYSTEM_RUNNING.
> > >
> > > The thing is, there's also ton's of code that might end up calling
> > > cond_resched() and co before the scheduler is fully initialized.
> >
> > Hm. Speaking of cond_resched*() only, then it should be pretty
> > safe to convert the SYSTEM_RUNNING checks to scheduler_running,
> > no? scheduler_running is set after sched_init().
>
> Hmm, that might work, I'd have to audit sched_init_smp() as it seems to
> do way too much...

sched_init_smp() is called from the kernel_thread(), so
if the scheduler is not functional prior to kernel_thread(),
you're in trouble anyway, no? The point is that a lot of
code is calling schedule() prior to sched_init_smp()
(e.g. msleep(), mutexes), and there are no issues. So
should be no issues with cond_resched()?

--
Anton Vorontsov
email: [email protected]
irc://irc.freenode.net/bd2

2009-07-08 12:59:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH/RFC] sched: Remove SYSTEM_RUNNING checks from cond_resched*()

On Wed, 2009-07-08 at 16:55 +0400, Anton Vorontsov wrote:
> On Wed, Jul 08, 2009 at 02:12:25PM +0200, Peter Zijlstra wrote:
> > On Wed, 2009-07-08 at 16:03 +0400, Anton Vorontsov wrote:
> > > On Wed, Jul 08, 2009 at 08:24:23AM +0200, Peter Zijlstra wrote:
> > > > On Wed, 2009-07-08 at 02:50 +0200, Oleg Nesterov wrote:
> > > >
> > > > > /*
> > > > > * It is valid to assume CPU-locality during early bootup:
> > > > > */
> > > > > if (system_state != SYSTEM_RUNNING)
> > > > > goto out;
> > > > >
> > > > > this doesn't look right, smp_init() is called before we set
> > > > > SYSTEM_RUNNING.
> > > >
> > > > The thing is, there's also ton's of code that might end up calling
> > > > cond_resched() and co before the scheduler is fully initialized.
> > >
> > > Hm. Speaking of cond_resched*() only, then it should be pretty
> > > safe to convert the SYSTEM_RUNNING checks to scheduler_running,
> > > no? scheduler_running is set after sched_init().
> >
> > Hmm, that might work, I'd have to audit sched_init_smp() as it seems to
> > do way too much...
>
> sched_init_smp() is called from the kernel_thread(), so
> if the scheduler is not functional prior to kernel_thread(),
> you're in trouble anyway, no? The point is that a lot of
> code is calling schedule() prior to sched_init_smp()
> (e.g. msleep(), mutexes), and there are no issues. So
> should be no issues with cond_resched()?

Yeah, it should be good, I just got paranoid looking at
sched_init_smp().

2009-07-08 16:15:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH/RFC] sched: Remove SYSTEM_RUNNING checks from cond_resched*()



On Wed, 8 Jul 2009, Peter Zijlstra wrote:

> On Wed, 2009-07-08 at 02:50 +0200, Oleg Nesterov wrote:
>
> > /*
> > * It is valid to assume CPU-locality during early bootup:
> > */
> > if (system_state != SYSTEM_RUNNING)
> > goto out;
> >
> > this doesn't look right, smp_init() is called before we set
> > SYSTEM_RUNNING.
>
> The thing is, there's also ton's of code that might end up calling
> cond_resched() and co before the scheduler is fully initialized. Doing
> so would indeed mess things up.

I forget what triggered me to add some of them, but we definitely had bugs
with the scheduler being called early, and having some really nasty oopses
happen early in the boot (and that early, they don't get saved, so the
reporting percentage goes down to something very low).

So I think that the patch Anton posted is probably the RightThing(tm) to
do, and in that sense I like it - but they make me very nervous. There's a
lot of "cond_resched()"s sprinkled about in helper routines, and if they
are ever called early, you're basically screwed.

> Also, by definition we'd have to call smp_init() before SYSTEM_RUNNING,
> because you simply cannot declare a system up and running when your core
> functionality isn't initialized.

Now, that part I'm not sure about.

We can consider a UP system to be running, and brining up the other CPU's
to be a matter of CPU hotplug.

It's not what we do _now_, of course. We don't force hotplug support on
people just because they want SMP, and we don't want to go through the
whole "rewrite locks" things twice etc.

> So I'd really rather preserve these checks -- I can even remember
> running into some of these things a while back, but memory isn't
> providing specific cases.

Yes. I get nervous too about the patch.

That said, I do agree that maybe SYSTEM_RUNNING isn't the right check.
Testing that the scheduler is initialized may be the more correct one. I
think the SYSTEM_RUNNING one just comes from that being used for other
debug issues.

Linus

2009-07-08 20:45:56

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH] sched: Make cond_resched*() available earlier

Using early netconsole and gianfar driver this error pops up:

netconsole: timeout waiting for carrier

It appears that net/core/netpoll.c:netpoll_setup() is using
cond_resched() in a loop waiting for a carrier.

The thing is that cond_resched() is a no-op when system_state !=
SYSTEM_RUNNING, and so drivers/net/phy/phy.c's state_queue is never
scheduled, therefore link detection doesn't work.

The system_state check exists to avoid very early calls to the
scheduler. Though, using cond_resched() should be safe after scheduler
is initialized, so instead of checking the system_state, we can test
that the scheduler is running, therefore making cond_resched() and
friends available much earlier (but not too much).

Signed-off-by: Anton Vorontsov <[email protected]>
---

On Wed, Jul 08, 2009 at 02:58:46PM +0200, Peter Zijlstra wrote:
[...]
> > > > Hm. Speaking of cond_resched*() only, then it should be pretty
> > > > safe to convert the SYSTEM_RUNNING checks to scheduler_running,
> > > > no? scheduler_running is set after sched_init().
> > >
> > > Hmm, that might work, I'd have to audit sched_init_smp() as it seems to
> > > do way too much...
> >
> > sched_init_smp() is called from the kernel_thread(), so
> > if the scheduler is not functional prior to kernel_thread(),
> > you're in trouble anyway, no? The point is that a lot of
> > code is calling schedule() prior to sched_init_smp()
> > (e.g. msleep(), mutexes), and there are no issues. So
> > should be no issues with cond_resched()?
>
> Yeah, it should be good, I just got paranoid looking at
> sched_init_smp().

OK, thanks everyone for the reviews.

Here is an updated patch. As usual, tested on UP PowerPC, with and
without SMP, PREEMPT and PREEMPT_VOLUNTARY.

kernel/sched.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7c9098d..555360b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6561,7 +6561,7 @@ static void __cond_resched(void)
int __sched _cond_resched(void)
{
if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE) &&
- system_state == SYSTEM_RUNNING) {
+ scheduler_running) {
__cond_resched();
return 1;
}
@@ -6579,7 +6579,7 @@ EXPORT_SYMBOL(_cond_resched);
*/
int cond_resched_lock(spinlock_t *lock)
{
- int resched = need_resched() && system_state == SYSTEM_RUNNING;
+ int resched = need_resched() && scheduler_running;
int ret = 0;

if (spin_needbreak(lock) || resched) {
@@ -6599,7 +6599,7 @@ int __sched cond_resched_softirq(void)
{
BUG_ON(!in_softirq());

- if (need_resched() && system_state == SYSTEM_RUNNING) {
+ if (need_resched() && scheduler_running) {
local_bh_enable();
__cond_resched();
local_bh_disable();
--
1.6.3.3

2009-07-08 21:11:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH/RFC] sched: Remove SYSTEM_RUNNING checks from cond_resched*()

> On Wed, 8 Jul 2009 09:12:30 -0700 (PDT) Linus Torvalds <[email protected]> wrote:
> That said, I do agree that maybe SYSTEM_RUNNING isn't the right check.
> Testing that the scheduler is initialized may be the more correct one. I
> think the SYSTEM_RUNNING one just comes from that being used for other
> debug issues.

Agreed. system_state is too general.

If we specifically want to know whether it is safe to call schedule() then
let's create a global boolean it_is_safe_to_call_schedule and test that,
rather than testing something which indirectly and unreliably implies "it
is safe to call schedule". If that boolean already exists then no-brainer.

All that being said, I wonder if the netconsole code should be using
msleep(1) instead. Spinning on cond_resched() is a bit rude. But one
would have to verify that it is safe to call schedule() at this time, and
for the netconsole caller, this is dubious.

2009-07-08 21:33:39

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH/RFC] sched: Remove SYSTEM_RUNNING checks from cond_resched*()

On Wed, Jul 08, 2009 at 02:10:24PM -0700, Andrew Morton wrote:
> > On Wed, 8 Jul 2009 09:12:30 -0700 (PDT) Linus Torvalds <[email protected]> wrote:
> > That said, I do agree that maybe SYSTEM_RUNNING isn't the right check.
> > Testing that the scheduler is initialized may be the more correct one. I
> > think the SYSTEM_RUNNING one just comes from that being used for other
> > debug issues.
>
> Agreed. system_state is too general.
>
> If we specifically want to know whether it is safe to call schedule() then
> let's create a global boolean it_is_safe_to_call_schedule and test that,
> rather than testing something which indirectly and unreliably implies "it
> is safe to call schedule". If that boolean already exists then no-brainer.
>
> All that being said, I wonder if the netconsole code should be using
> msleep(1) instead. Spinning on cond_resched() is a bit rude. But one
> would have to verify that it is safe to call schedule() at this time, and
> for the netconsole caller, this is dubious.

What do you mean by "verify that it is safe"? If it works,
can I assume that it's safe? ;-) It works, fwiw.

Thanks,

--
Anton Vorontsov
email: [email protected]
irc://irc.freenode.net/bd2

2009-07-08 21:49:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH/RFC] sched: Remove SYSTEM_RUNNING checks from cond_resched*()

(belatedly cc'ing netdev)

Original diagnosis:

: Using early netconsole and gianfar driver this error pops up:
:
: netconsole: timeout waiting for carrier
:
: It appears that net/core/netpoll.c:netpoll_setup() is using
: cond_resched() in a loop waiting for a carrier.
:
: The thing is that cond_resched() is a no-op when system_state !=
: SYSTEM_RUNNING, and so drivers/net/phy/phy.c's state_queue is never
: scheduled, therefore link detection doesn't work

> On Thu, 9 Jul 2009 01:33:31 +0400 Anton Vorontsov <[email protected]> wrote:
> On Wed, Jul 08, 2009 at 02:10:24PM -0700, Andrew Morton wrote:
> > > On Wed, 8 Jul 2009 09:12:30 -0700 (PDT) Linus Torvalds <[email protected]> wrote:
> > > That said, I do agree that maybe SYSTEM_RUNNING isn't the right check.
> > > Testing that the scheduler is initialized may be the more correct one. I
> > > think the SYSTEM_RUNNING one just comes from that being used for other
> > > debug issues.
> >
> > Agreed. system_state is too general.
> >
> > If we specifically want to know whether it is safe to call schedule() then
> > let's create a global boolean it_is_safe_to_call_schedule and test that,
> > rather than testing something which indirectly and unreliably implies "it
> > is safe to call schedule". If that boolean already exists then no-brainer.
> >
> > All that being said, I wonder if the netconsole code should be using
> > msleep(1) instead. Spinning on cond_resched() is a bit rude. But one
> > would have to verify that it is safe to call schedule() at this time, and
> > for the netconsole caller, this is dubious.
>
> What do you mean by "verify that it is safe"? If it works,
> can I assume that it's safe? ;-) It works, fwiw.
>

netconsole is supposed to be available as early as possible in boot for
obvious reasons. I'd say there's a decent risk now and in the future that
netconsole will be initialised prior to the scheduler being available.

In fact, if "netconsole: timeout waiting for carrier" newly added to
netpoll_setup() a depedency on the scheduler being available then perhaps
that was an incorrect change.

2009-07-08 22:20:21

by Anton Vorontsov

[permalink] [raw]
Subject: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib

Using early netconsole and gianfar driver this error pops up:

netconsole: timeout waiting for carrier

It appears that net/core/netpoll.c:netpoll_setup() is using
cond_resched() in a loop waiting for a carrier.

The thing is that cond_resched() is a no-op when system_state !=
SYSTEM_RUNNING, and so drivers/net/phy/phy.c's state_queue is never
scheduled, therefore link detection doesn't work.

I belive that the main problem is in cond_resched()[1], but despite
how the cond_resched() story ends, it might be a good idea to call
msleep(1) instead of cond_resched(), as suggested by Andrew Morton.

[1] http://lkml.org/lkml/2009/7/7/463

Signed-off-by: Anton Vorontsov <[email protected]>
---

On Wed, Jul 08, 2009 at 02:47:44PM -0700, Andrew Morton wrote:
> (belatedly cc'ing netdev)
>
> Original diagnosis:
>
> : Using early netconsole and gianfar driver this error pops up:
> :
> : netconsole: timeout waiting for carrier
> :
> : It appears that net/core/netpoll.c:netpoll_setup() is using
> : cond_resched() in a loop waiting for a carrier.
> :
> : The thing is that cond_resched() is a no-op when system_state !=
> : SYSTEM_RUNNING, and so drivers/net/phy/phy.c's state_queue is never
> : scheduled, therefore link detection doesn't work
>
> > On Thu, 9 Jul 2009 01:33:31 +0400 Anton Vorontsov <[email protected]> wrote:
> > On Wed, Jul 08, 2009 at 02:10:24PM -0700, Andrew Morton wrote:
> > > > On Wed, 8 Jul 2009 09:12:30 -0700 (PDT) Linus Torvalds <[email protected]> wrote:
> > > > That said, I do agree that maybe SYSTEM_RUNNING isn't the right check.
> > > > Testing that the scheduler is initialized may be the more correct one. I
> > > > think the SYSTEM_RUNNING one just comes from that being used for other
> > > > debug issues.
> > >
> > > Agreed. system_state is too general.
> > >
> > > If we specifically want to know whether it is safe to call schedule() then
> > > let's create a global boolean it_is_safe_to_call_schedule and test that,
> > > rather than testing something which indirectly and unreliably implies "it
> > > is safe to call schedule". If that boolean already exists then no-brainer.
> > >
> > > All that being said, I wonder if the netconsole code should be using
> > > msleep(1) instead. Spinning on cond_resched() is a bit rude. But one
> > > would have to verify that it is safe to call schedule() at this time, and
> > > for the netconsole caller, this is dubious.
> >
> > What do you mean by "verify that it is safe"? If it works,
> > can I assume that it's safe? ;-) It works, fwiw.
> >
>
> netconsole is supposed to be available as early as possible in boot for
> obvious reasons. I'd say there's a decent risk now and in the future that
> netconsole will be initialised prior to the scheduler being available.
>
> In fact, if "netconsole: timeout waiting for carrier" newly added to
> netpoll_setup() a depedency on the scheduler being available then perhaps
> that was an incorrect change.

'git blame' says that carrier detection code didn't change since 2.6.12
(where git history starts), PHYLIB is using workqueue since its
submission (2.6.13). And SYSTEM_RUNNING check was added in 2.6.16.
So it's not a new dependency.

The netpoll code is using msleep() just a few lines below cond_resched(),
so we won't make things worse. ;-)

Thanks!

net/core/netpoll.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 9675f31..df30feb 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -740,7 +740,7 @@ int netpoll_setup(struct netpoll *np)
np->name);
break;
}
- cond_resched();
+ msleep(1);
}

/* If carrier appears to come up instantly, we don't
--
1.6.3.3

2009-07-09 00:03:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib



On Thu, 9 Jul 2009, Anton Vorontsov wrote:
>
> The netpoll code is using msleep() just a few lines below cond_resched(),
> so we won't make things worse. ;-)

Yeah. That function is definitely sleeping. It does things like
kmalloc(GFP_KERNEL), rtnl_lock() and synchronize_rcu() etc too, so an
added msleep() is the least of our problems.

Afaik, it's called from a bog-standard "module_init()", which happens late
enough that everything works.

In fact, I wonder if we should set SYSTEM_RUNNING much earlier - _before_
doing the whole "do_initcalls()". By then we've set up all the core stuff
and enabled interrupts, so we really _are_ running. We just don't
necessarily have drivers, filesystems etc loaded yet. But anything that
happens late enough to be an initcall should be largely considered to
be during "normal code".

(The "early_initcall" cases are special - those really do happen pretty
early).

So ACK on Anton's patch, but I wonder if we _also_ should do the
following?

Looking at the people looking at SYSTEM_RUNNING, I do note some odd cases.
Why the heck does kernel/perf_counter.c do it, for example?

Linus

---
init/main.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/init/main.c b/init/main.c
index 2c5ade7..f10d9cd 100644
--- a/init/main.c
+++ b/init/main.c
@@ -788,6 +788,7 @@ static void __init do_initcalls(void)
{
initcall_t *call;

+ system_state = SYSTEM_RUNNING;
for (call = __early_initcall_end; call < __initcall_end; call++)
do_one_initcall(*call);

@@ -839,7 +840,6 @@ static noinline int init_post(void)
free_initmem();
unlock_kernel();
mark_rodata_ro();
- system_state = SYSTEM_RUNNING;
numa_default_policy();

if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)

2009-07-09 03:08:59

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib

From: Linus Torvalds <[email protected]>
Date: Wed, 8 Jul 2009 17:01:14 -0700 (PDT)

> So ACK on Anton's patch,

I'll integrate it into my tree, thanks.

> but I wonder if we _also_ should do the following?

I think we should set SYSTEM_RUNNING earlier, makes sense
to me.

2009-07-09 07:57:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib

On Wed, 2009-07-08 at 17:01 -0700, Linus Torvalds wrote:
> Looking at the people looking at SYSTEM_RUNNING, I do note some odd cases.
> Why the heck does kernel/perf_counter.c do it, for example?

Ah, those are the swcounter and other probe entry points. I've had
several cases where we called into the perf counter code from those
points before it was initialized, getting in kernel segfaults due to
dereferencing uninitialized data etc..

I could keep a variable that tracked the perf_counter_init() state, and
use that instead if you prefer?

2009-07-09 12:54:34

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib

On Thu, 2009-07-09 at 02:20 +0400, Anton Vorontsov wrote:
> Using early netconsole and gianfar driver this error pops up:
>
> netconsole: timeout waiting for carrier
>
> It appears that net/core/netpoll.c:netpoll_setup() is using
> cond_resched() in a loop waiting for a carrier.
>
> The thing is that cond_resched() is a no-op when system_state !=
> SYSTEM_RUNNING, and so drivers/net/phy/phy.c's state_queue is never
> scheduled, therefore link detection doesn't work.
>
> I belive that the main problem is in cond_resched()[1], but despite
> how the cond_resched() story ends, it might be a good idea to call
> msleep(1) instead of cond_resched(), as suggested by Andrew Morton.
>
> [1] http://lkml.org/lkml/2009/7/7/463
>
> Signed-off-by: Anton Vorontsov <[email protected]>
> ---
>
> On Wed, Jul 08, 2009 at 02:47:44PM -0700, Andrew Morton wrote:
> > (belatedly cc'ing netdev)
> >
> > Original diagnosis:
> >
> > : Using early netconsole and gianfar driver this error pops up:
> > :
> > : netconsole: timeout waiting for carrier
> > :
> > : It appears that net/core/netpoll.c:netpoll_setup() is using
> > : cond_resched() in a loop waiting for a carrier.
> > :
> > : The thing is that cond_resched() is a no-op when system_state !=
> > : SYSTEM_RUNNING, and so drivers/net/phy/phy.c's state_queue is never
> > : scheduled, therefore link detection doesn't work
> >
> > > On Thu, 9 Jul 2009 01:33:31 +0400 Anton Vorontsov <[email protected]> wrote:
> > > On Wed, Jul 08, 2009 at 02:10:24PM -0700, Andrew Morton wrote:
> > > > > On Wed, 8 Jul 2009 09:12:30 -0700 (PDT) Linus Torvalds <[email protected]> wrote:
> > > > > That said, I do agree that maybe SYSTEM_RUNNING isn't the right check.
> > > > > Testing that the scheduler is initialized may be the more correct one. I
> > > > > think the SYSTEM_RUNNING one just comes from that being used for other
> > > > > debug issues.
> > > >
> > > > Agreed. system_state is too general.
> > > >
> > > > If we specifically want to know whether it is safe to call schedule() then
> > > > let's create a global boolean it_is_safe_to_call_schedule and test that,
> > > > rather than testing something which indirectly and unreliably implies "it
> > > > is safe to call schedule". If that boolean already exists then no-brainer.
> > > >
> > > > All that being said, I wonder if the netconsole code should be using
> > > > msleep(1) instead. Spinning on cond_resched() is a bit rude. But one
> > > > would have to verify that it is safe to call schedule() at this time, and
> > > > for the netconsole caller, this is dubious.
> > >
> > > What do you mean by "verify that it is safe"? If it works,
> > > can I assume that it's safe? ;-) It works, fwiw.
> > >
> >
> > netconsole is supposed to be available as early as possible in boot for
> > obvious reasons. I'd say there's a decent risk now and in the future that
> > netconsole will be initialised prior to the scheduler being available.
> >
> > In fact, if "netconsole: timeout waiting for carrier" newly added to
> > netpoll_setup() a depedency on the scheduler being available then perhaps
> > that was an incorrect change.
>
> 'git blame' says that carrier detection code didn't change since 2.6.12
> (where git history starts), PHYLIB is using workqueue since its
> submission (2.6.13). And SYSTEM_RUNNING check was added in 2.6.16.
> So it's not a new dependency.
>
> The netpoll code is using msleep() just a few lines below cond_resched(),
> so we won't make things worse. ;-)

I think that's an improvement with or without the SYSTEM_RUNNING fix.

Signed-off-by: Matt Mackall <[email protected]>

> Thanks!
>
> net/core/netpoll.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 9675f31..df30feb 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -740,7 +740,7 @@ int netpoll_setup(struct netpoll *np)
> np->name);
> break;
> }
> - cond_resched();
> + msleep(1);
> }
>
> /* If carrier appears to come up instantly, we don't

--
http://selenic.com : development and support for Mercurial and Linux

2009-07-09 12:58:20

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib

On Thu, 2009-07-09 at 09:56 +0200, Peter Zijlstra wrote:
> On Wed, 2009-07-08 at 17:01 -0700, Linus Torvalds wrote:
> > Looking at the people looking at SYSTEM_RUNNING, I do note some odd cases.
> > Why the heck does kernel/perf_counter.c do it, for example?
>
> Ah, those are the swcounter and other probe entry points. I've had
> several cases where we called into the perf counter code from those
> points before it was initialized, getting in kernel segfaults due to
> dereferencing uninitialized data etc..
>
> I could keep a variable that tracked the perf_counter_init() state, and
> use that instead if you prefer?

Looks like that'd be more accurate. Linus's proposed patch might break
your current assumptions.

--
http://selenic.com : development and support for Mercurial and Linux

2009-07-09 13:29:47

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib

On Wed, 2009-07-08 at 17:01 -0700, Linus Torvalds wrote:
>
> On Thu, 9 Jul 2009, Anton Vorontsov wrote:
> >
> > The netpoll code is using msleep() just a few lines below cond_resched(),
> > so we won't make things worse. ;-)
>
> Yeah. That function is definitely sleeping. It does things like
> kmalloc(GFP_KERNEL), rtnl_lock() and synchronize_rcu() etc too, so an
> added msleep() is the least of our problems.
>
> Afaik, it's called from a bog-standard "module_init()", which happens late
> enough that everything works.
>
> In fact, I wonder if we should set SYSTEM_RUNNING much earlier - _before_
> doing the whole "do_initcalls()".

Well there are two ways of consistently defining SYSTEM_RUNNING:

a) define it with reference to the well-understood notion of booting vs
running and don't switch it until handing off to init

b) define it with reference to its usage by an arbitrary user like
cond_resched()

In the latter case, we obviously need to move it to the earliest point
that scheduling is possible. But there are a number of things like

http://lxr.linux.no/linux+v2.6.30/kernel/printk.c#L228

that assume the definition is actually (a). We're currently within a
couple lines of a strict definition of (a) already, so I actually think
cond_resched() is just wrong (and we already know it broke a
previously-working user). It should perhaps be using another private
flag that gets set as soon as scheduling is up and running.

But I'd actually go further and say that it's unfortunate to be checking
extra flags in such an important inline, especially since the check is
false for all but the first couple seconds of run time. Seems like we
could avoid adding an extra check by artificially elevating the preempt
count in early boot (or at compile time) then dropping it when
scheduling becomes available.

--
http://selenic.com : development and support for Mercurial and Linux

2009-07-09 13:47:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib

On Thu, 2009-07-09 at 08:26 -0500, Matt Mackall wrote:
> On Wed, 2009-07-08 at 17:01 -0700, Linus Torvalds wrote:
> >
> > On Thu, 9 Jul 2009, Anton Vorontsov wrote:
> > >
> > > The netpoll code is using msleep() just a few lines below cond_resched(),
> > > so we won't make things worse. ;-)
> >
> > Yeah. That function is definitely sleeping. It does things like
> > kmalloc(GFP_KERNEL), rtnl_lock() and synchronize_rcu() etc too, so an
> > added msleep() is the least of our problems.
> >
> > Afaik, it's called from a bog-standard "module_init()", which happens late
> > enough that everything works.
> >
> > In fact, I wonder if we should set SYSTEM_RUNNING much earlier - _before_
> > doing the whole "do_initcalls()".
>
> Well there are two ways of consistently defining SYSTEM_RUNNING:
>
> a) define it with reference to the well-understood notion of booting vs
> running and don't switch it until handing off to init

This makes the most sense IMHO.

> b) define it with reference to its usage by an arbitrary user like
> cond_resched()
>
> In the latter case, we obviously need to move it to the earliest point
> that scheduling is possible. But there are a number of things like
>
> http://lxr.linux.no/linux+v2.6.30/kernel/printk.c#L228
>
> that assume the definition is actually (a). We're currently within a
> couple lines of a strict definition of (a) already, so I actually think
> cond_resched() is just wrong (and we already know it broke a
> previously-working user). It should perhaps be using another private
> flag that gets set as soon as scheduling is up and running.

Right as mentioned before in this thread, we grew scheduler_running a
while back which could be used for this.

> But I'd actually go further and say that it's unfortunate to be checking
> extra flags in such an important inline, especially since the check is
> false for all but the first couple seconds of run time. Seems like we
> could avoid adding an extra check by artificially elevating the preempt
> count in early boot (or at compile time) then dropping it when
> scheduling becomes available.

Calling cond_resched() and co when !preemptable is an error so this
wouldn't actually work.


2009-07-09 14:22:00

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib

On Thu, 2009-07-09 at 15:46 +0200, Peter Zijlstra wrote:
> On Thu, 2009-07-09 at 08:26 -0500, Matt Mackall wrote:
> > On Wed, 2009-07-08 at 17:01 -0700, Linus Torvalds wrote:
> > >
> > > On Thu, 9 Jul 2009, Anton Vorontsov wrote:
> > > >
> > > > The netpoll code is using msleep() just a few lines below cond_resched(),
> > > > so we won't make things worse. ;-)
> > >
> > > Yeah. That function is definitely sleeping. It does things like
> > > kmalloc(GFP_KERNEL), rtnl_lock() and synchronize_rcu() etc too, so an
> > > added msleep() is the least of our problems.
> > >
> > > Afaik, it's called from a bog-standard "module_init()", which happens late
> > > enough that everything works.
> > >
> > > In fact, I wonder if we should set SYSTEM_RUNNING much earlier - _before_
> > > doing the whole "do_initcalls()".
> >
> > Well there are two ways of consistently defining SYSTEM_RUNNING:
> >
> > a) define it with reference to the well-understood notion of booting vs
> > running and don't switch it until handing off to init
>
> This makes the most sense IMHO.
>
> > b) define it with reference to its usage by an arbitrary user like
> > cond_resched()
> >
> > In the latter case, we obviously need to move it to the earliest point
> > that scheduling is possible. But there are a number of things like
> >
> > http://lxr.linux.no/linux+v2.6.30/kernel/printk.c#L228
> >
> > that assume the definition is actually (a). We're currently within a
> > couple lines of a strict definition of (a) already, so I actually think
> > cond_resched() is just wrong (and we already know it broke a
> > previously-working user). It should perhaps be using another private
> > flag that gets set as soon as scheduling is up and running.
>
> Right as mentioned before in this thread, we grew scheduler_running a
> while back which could be used for this.
>
> > But I'd actually go further and say that it's unfortunate to be checking
> > extra flags in such an important inline, especially since the check is
> > false for all but the first couple seconds of run time. Seems like we
> > could avoid adding an extra check by artificially elevating the preempt
> > count in early boot (or at compile time) then dropping it when
> > scheduling becomes available.
>
> Calling cond_resched() and co when !preemptable is an error so this
> wouldn't actually work.

Sorry if I was unclear. I'm suggesting setting the count so the existing
PREEMPT_ACTIVE test here fires:

int __sched _cond_resched(void)
{
if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE) &&
system_state == SYSTEM_RUNNING) {
__cond_resched();
return 1;
}
return 0;
}

That should be kosher.

--
http://selenic.com : development and support for Mercurial and Linux

2009-07-09 14:31:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib

On Thu, 2009-07-09 at 09:18 -0500, Matt Mackall wrote:
>
> Sorry if I was unclear. I'm suggesting setting the count so the existing
> PREEMPT_ACTIVE test here fires:
>
> int __sched _cond_resched(void)
> {
> if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE) &&
> system_state == SYSTEM_RUNNING) {
> __cond_resched();
> return 1;
> }
> return 0;
> }

Right, /me read preempt and thought a simple preempt inc but didn't read
the code. Shame on me.

So something like (utterly untested and such)

---

arch/x86/include/asm/thread_info.h | 2 +-
kernel/sched.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index b078352..7b5dbce 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -49,7 +49,7 @@ struct thread_info {
.exec_domain = &default_exec_domain, \
.flags = 0, \
.cpu = 0, \
- .preempt_count = 1, \
+ .preempt_count = 1 + PREEMPT_ACTIVE, \
.addr_limit = KERNEL_DS, \
.restart_block = { \
.fn = do_no_restart_syscall, \
diff --git a/kernel/sched.c b/kernel/sched.c
index fd3ac58..8e81162 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -6599,8 +6599,7 @@ static void __cond_resched(void)

int __sched _cond_resched(void)
{
- if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE) &&
- system_state == SYSTEM_RUNNING) {
+ if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE)) {
__cond_resched();
return 1;
}
@@ -9422,6 +9421,7 @@ void __init sched_init(void)
perf_counter_init();

scheduler_running = 1;
+ sub_preempt_count(PREEMPT_ACTIVE);
}

#ifdef CONFIG_DEBUG_SPINLOCK_SLEEP

2009-07-09 14:46:18

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib

On Thu, 2009-07-09 at 16:31 +0200, Peter Zijlstra wrote:
> On Thu, 2009-07-09 at 09:18 -0500, Matt Mackall wrote:
> >
> > Sorry if I was unclear. I'm suggesting setting the count so the existing
> > PREEMPT_ACTIVE test here fires:
> >
> > int __sched _cond_resched(void)
> > {
> > if (need_resched() && !(preempt_count() & PREEMPT_ACTIVE) &&
> > system_state == SYSTEM_RUNNING) {
> > __cond_resched();
> > return 1;
> > }
> > return 0;
> > }
>
> Right, /me read preempt and thought a simple preempt inc but didn't read
> the code. Shame on me.
>
> So something like (utterly untested and such)

Yeah, that's what I had in mind. Probably throw in a define:

/* for disabling scheduling in early boot */
#define PREEMPT_EARLY (1 + PREEMPT_ACTIVE)

and slap a comment on the sub_preempt_count().

Does anything actually use scheduler_running yet? Perhaps my tree is
old.

Also, might_sleep's use of system_state probably bears revisiting.

--
http://selenic.com : development and support for Mercurial and Linux

2009-07-09 14:51:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib

On Thu, 2009-07-09 at 09:43 -0500, Matt Mackall wrote:

> Yeah, that's what I had in mind. Probably throw in a define:
>
> /* for disabling scheduling in early boot */
> #define PREEMPT_EARLY (1 + PREEMPT_ACTIVE)
>
> and slap a comment on the sub_preempt_count().

Right, and visit all the other arch init code ;-)

I'll wait to see if Ingo has anything to say about it and then complete
this thing.

> Does anything actually use scheduler_running yet? Perhaps my tree is
> old.

# git grep scheduler_running
kernel/sched.c:static __read_mostly int scheduler_running;
kernel/sched.c: scheduler_running = 1;
kernel/sched_rt.c: if (unlikely(!scheduler_running))
kernel/sched_rt.c: if (unlikely(!scheduler_running))

If memory serves there used to be more, but I think that migrated into
kernel/sched_clock.c, which has sched_clock_running.

> Also, might_sleep's use of system_state probably bears revisiting.

Yeah, all that code is from long before we had scheduler_running (which
was introduced around CFS/.23).

2009-07-09 15:09:26

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib

On Thu, 2009-07-09 at 16:51 +0200, Peter Zijlstra wrote:
> > Does anything actually use scheduler_running yet? Perhaps my tree is
> > old.
>
> # git grep scheduler_running
> kernel/sched.c:static __read_mostly int scheduler_running;
> kernel/sched.c: scheduler_running = 1;
> kernel/sched_rt.c: if (unlikely(!scheduler_running))
> kernel/sched_rt.c: if (unlikely(!scheduler_running))
>
> If memory serves there used to be more, but I think that migrated into
> kernel/sched_clock.c, which has sched_clock_running.

The static in the first line makes that a bit of a head-scratcher? Oh, I
see: it's living in sin. Going to avert my eyes for now.

--
http://selenic.com : development and support for Mercurial and Linux

2009-07-09 17:31:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] netpoll: Fix carrier detection for drivers that are using phylib



On Thu, 9 Jul 2009, Peter Zijlstra wrote:
>
> So something like (utterly untested and such)

This looks like a good patch. Please make it so - who knows what other
uses of cond_resched() we have in module init routines that might have
deadlocks without it. The netpoll case got fixed, but please just do this.

I'd like to do my system_state movement too (the thing is, when you load
drivers as modules you _will_ have "system_state == SYSTEM_RUNNING", so
any initcall that depends on it being "early boot" is already broken), but
there's no way that patch is appropriate for post-rc2. This one, however,
looks appropriate (modulo getting some testing, of course)

Linus

2009-07-10 12:06:21

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH/RFC] sched: Remove SYSTEM_RUNNING checks from cond_resched*()

Hi!

> > That said, I do agree that maybe SYSTEM_RUNNING isn't the right check.
> > Testing that the scheduler is initialized may be the more correct one. I
> > think the SYSTEM_RUNNING one just comes from that being used for other
> > debug issues.
>
> Agreed. system_state is too general.
>
> If we specifically want to know whether it is safe to call schedule() then
> let's create a global boolean it_is_safe_to_call_schedule and test that,
> rather than testing something which indirectly and unreliably implies "it
> is safe to call schedule". If that boolean already exists then no-brainer.

or maybe we could embed that check into schedule(), just returning
when scheduler is not ready?

And I always wondered... system_state is not protected by any kind of
lock and is not atomic_t... Does it all work by mistake?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html