2015-02-05 21:14:50

by Bruno Prémont

[permalink] [raw]
Subject: Re: Linux 3.19-rc5

On Thu, 29 January 2015 Linus Torvalds wrote:
> On Thu, Jan 29, 2015 at 5:12 PM, Linus Torvalds wrote:
> >
> > The WARN() was already changed to a WARN_ONCE().
>
> Oh, but I notice that the "__set_current_state(TASK_RUNNING) ends up
> always happening.
>
> So I think the right fix is to:
>
> - warn once like we do
>
> - but *not* do that __set_current_state() which was always total crap anyway
>
> Why do I say "total crap"? Because of two independent issues:
>
> (a) it actually changes behavior for a debug vs non-debug kernel,
> which is a really bad idea to begin with
>
> (b) it's really wrong. The whole "nested sleep" case was never a
> major bug to begin with, just a possible inefficiency where constant
> nested sleeps would possibly make the outer sleep not sleep. But that
> "could possibly make" case was the unlikely case, and the debug patch
> made it happen *all* the time by explicitly setting things running.
>
> So I think the proper patch is the attached.
>
> The comment is also crap. The comment says
>
> "Blocking primitives will set (and therefore destroy) current->state [...]"
>
> but the reality is that they *may* set it, and only in the unlikely
> slow-path where they actually block.
>
> So doing this in "__may_sleep()" is just bogus and horrible horrible
> crap. It turns the "harmless ugliness" into a real *harmful* bug. The
> key word of "__may_sleep()" is that "MAY" part. It's a debug thing to
> make relatively rare cases show up.
>
> PeterZ, please don't make "debugging" patches like this. Ever again.
> Because this was just stupid, and it took me too long to realize that
> despite the warning being shut up, the debug patch was still actively
> doing bad bad things.
>
> Ingo, maybe you'd want to apply this through the scheduler tree, the
> way you already did the WARN_ONCE() thing.
>
> Bruno, does this finally actually fix your pccard thing?

Tested the variant that was applied by running rc7 and it fixes the
endless loop.
The loop is now replaced by a single WARN() trace - I guess expected:

[ 3.083647] ------------[ cut here ]------------
[ 3.087477] WARNING: CPU: 0 PID: 67 at /usr/src/linux-git/kernel/sched/core.c:7300 __might_sleep+0x79/0x90()
[ 3.091357] do not call blocking ops when !TASK_RUNNING; state=1 set at [<c1442040>] pccardd+0xa0/0x3e0
[ 3.095232] Modules linked in:
[ 3.099020] CPU: 0 PID: 67 Comm: pccardd Not tainted 3.19.0-rc7-00003-g67288c4 #17
[ 3.102760] Hardware name: Acer TravelMate 660/TravelMate 660, BIOS 3A19 01/14/2004
[ 3.106504] c212def4 c212def4 c212deb4 c16caf23 c212dee4 c10416fd c1907334 c212df10
[ 3.110315] 00000043 c1907380 00001c84 c105a099 c105a099 c1442040 00000001 f5f54bc0
[ 3.114143] c212defc c104176e 00000009 c212def4 c1907334 c212df10 c212df30 c105a099
[ 3.117960] Call Trace:
[ 3.121703] [<c16caf23>] dump_stack+0x16/0x18
[ 3.125447] [<c10416fd>] warn_slowpath_common+0x7d/0xc0
[ 3.129172] [<c105a099>] ? __might_sleep+0x79/0x90
[ 3.132868] [<c105a099>] ? __might_sleep+0x79/0x90
[ 3.136500] [<c1442040>] ? pccardd+0xa0/0x3e0
[ 3.140092] [<c104176e>] warn_slowpath_fmt+0x2e/0x30
[ 3.143657] [<c105a099>] __might_sleep+0x79/0x90
[ 3.147209] [<c1442040>] ? pccardd+0xa0/0x3e0
[ 3.150747] [<c1442040>] ? pccardd+0xa0/0x3e0
[ 3.154256] [<c16cf447>] mutex_lock+0x17/0x2a
[ 3.157734] [<c1442089>] pccardd+0xe9/0x3e0
[ 3.161207] [<c1441fa0>] ? pcmcia_socket_uevent+0x30/0x30
[ 3.164660] [<c1056990>] kthread+0xa0/0xc0
[ 3.168059] [<c16d1040>] ret_from_kernel_thread+0x20/0x30
[ 3.171436] [<c10568f0>] ? kthread_worker_fn+0x140/0x140
[ 3.174796] ---[ end trace c3f708b642e3d8f0 ]---

>From my reading of the thread fixing pccardd/sched TASK_RUNNING usage/check
is another issue left for the future.

Thanks,
Bruno

> Linus


2015-02-06 11:51:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Linux 3.19-rc5

On Thu, Feb 05, 2015 at 10:14:36PM +0100, Bruno Pr?mont wrote:
> The loop is now replaced by a single WARN() trace - I guess expected:

> From my reading of the thread fixing pccardd/sched TASK_RUNNING usage/check
> is another issue left for the future.

Yeah, something like the below will make it go away -- under the
assumption that that comment is actually correct, I don't know, the
pcmcia people should probably write a better comment :/

Also, set_current_state(TASK_RUNNING) is almost always pointless, nobody
cares about that barrier, so make it go away.

---
drivers/pcmcia/cs.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/pcmcia/cs.c b/drivers/pcmcia/cs.c
index 5292db69c426..5678e161a17d 100644
--- a/drivers/pcmcia/cs.c
+++ b/drivers/pcmcia/cs.c
@@ -635,6 +635,12 @@ static int pccardd(void *__skt)
skt->sysfs_events = 0;
spin_unlock_irqrestore(&skt->thread_lock, flags);

+ /*
+ * Supposedly this is a rarely contended mutex and
+ * sleeping is therefore unlikely, the occasional
+ * extra loop iteration is harmless.
+ */
+ sched_annotate_sleep();
mutex_lock(&skt->skt_mutex);
if (events & SS_DETECT)
socket_detect_change(skt);
@@ -679,7 +685,7 @@ static int pccardd(void *__skt)
try_to_freeze();
}
/* make sure we are running before we exit */
- set_current_state(TASK_RUNNING);
+ __set_current_state(TASK_RUNNING);

/* shut down socket, if a device is still present */
if (skt->state & SOCKET_PRESENT) {

2015-02-06 16:02:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 3.19-rc5

On Fri, Feb 6, 2015 at 3:50 AM, Peter Zijlstra <[email protected]> wrote:
>
> Also, set_current_state(TASK_RUNNING) is almost always pointless, nobody
> cares about that barrier, so make it go away.

I'd rather not mix this with the patch, and wonder if we should just
do that globally with some preprocessor magic. We do have a fair
number of "set_current_state(TASK_RUNNING)" and at least for the
*documented* reason for the memory barrier, all of them could/should
be barrier-less.

So something like

if (__is_constant_p(state) && state == TASK_RUNNING)
tsk->state = state;
else
set_mb(tsk->state, state);

might be more general solution than randomly doing one at a time when
changing code around it..

Linus

2015-02-06 16:40:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Linux 3.19-rc5

On Fri, Feb 06, 2015 at 08:02:41AM -0800, Linus Torvalds wrote:
> On Fri, Feb 6, 2015 at 3:50 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > Also, set_current_state(TASK_RUNNING) is almost always pointless, nobody
> > cares about that barrier, so make it go away.
>
> I'd rather not mix this with the patch, and wonder if we should just
> do that globally with some preprocessor magic. We do have a fair
> number of "set_current_state(TASK_RUNNING)"

138

> and at least for the
> *documented* reason for the memory barrier, all of them could/should
> be barrier-less.
>
> So something like
>
> if (__is_constant_p(state) && state == TASK_RUNNING)
> tsk->state = state;
> else
> set_mb(tsk->state, state);
>
> might be more general solution than randomly doing one at a time when
> changing code around it..

Yeah, or we could do the coccinelle thing and do a mass conversion.

I like the macro one though; I worry a wee bit about non-documented
cases through. If someone is doing something way subtle we'll break it
:/


---
Subject: sched: Avoid the full memory-barrier for set_current_state(TASK_RUNNING)

One should never need the full memory barrier implied by
set_current_state() to set TASK_RUNNING for the documented reason of
avoiding races against wakeup.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef98d2f..aea44c4eeed8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -243,6 +243,27 @@ extern char ___assert_task_state[1 - 2*!!(
((task->state & TASK_UNINTERRUPTIBLE) != 0 && \
(task->flags & PF_FROZEN) == 0)

+/*
+ * set_current_state() includes a barrier so that the write of current->state
+ * is correctly serialised wrt the caller's subsequent test of whether to
+ * actually sleep:
+ *
+ * set_current_state(TASK_UNINTERRUPTIBLE);
+ * if (do_i_need_to_sleep())
+ * schedule();
+ *
+ * If the caller does not need such serialisation then use
+ * __set_current_state(). This is always true for TASK_RUNNING since
+ * there is no race against wakeup -- both write the same value.
+ */
+#define ___set_current_state(state) \
+do { \
+ if (__is_constant_p(state) && (state) == TASK_RUNNING) \
+ current->state = (state); \
+ else \
+ set_mb(current->state, (state)); \
+} while (0)
+
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP

#define __set_task_state(tsk, state_value) \
@@ -256,17 +277,6 @@ extern char ___assert_task_state[1 - 2*!!(
set_mb((tsk)->state, (state_value)); \
} while (0)

-/*
- * set_current_state() includes a barrier so that the write of current->state
- * is correctly serialised wrt the caller's subsequent test of whether to
- * actually sleep:
- *
- * set_current_state(TASK_UNINTERRUPTIBLE);
- * if (do_i_need_to_sleep())
- * schedule();
- *
- * If the caller does not need such serialisation then use __set_current_state()
- */
#define __set_current_state(state_value) \
do { \
current->task_state_change = _THIS_IP_; \
@@ -275,7 +285,7 @@ extern char ___assert_task_state[1 - 2*!!(
#define set_current_state(state_value) \
do { \
current->task_state_change = _THIS_IP_; \
- set_mb(current->state, (state_value)); \
+ ___set_current_state(state_value); \
} while (0)

#else
@@ -285,21 +295,10 @@ extern char ___assert_task_state[1 - 2*!!(
#define set_task_state(tsk, state_value) \
set_mb((tsk)->state, (state_value))

-/*
- * set_current_state() includes a barrier so that the write of current->state
- * is correctly serialised wrt the caller's subsequent test of whether to
- * actually sleep:
- *
- * set_current_state(TASK_UNINTERRUPTIBLE);
- * if (do_i_need_to_sleep())
- * schedule();
- *
- * If the caller does not need such serialisation then use __set_current_state()
- */
#define __set_current_state(state_value) \
do { current->state = (state_value); } while (0)
#define set_current_state(state_value) \
- set_mb(current->state, (state_value))
+ ___set_current_state(state_value);

#endif


2015-02-06 16:46:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: Linux 3.19-rc5

On Fri, Feb 6, 2015 at 8:39 AM, Peter Zijlstra <[email protected]> wrote:
>
> I like the macro one though; I worry a wee bit about non-documented
> cases through. If someone is doing something way subtle we'll break it
> :/

Yeah, I agree. And I wonder how much we should even care. People who
do this thing by hand - rather than using the wait-event model - are
*likely* legacy stuff or just random drivers, or simply stuff that
isn't all that performance-sensitive.

So I dunno how worthwhile this all is.

Linus