2012-05-03 17:03:10

by Sune Mølgaard

[permalink] [raw]
Subject: Re: Boot failure since 3.3-rc?

Incidentally, I had to swap a wifi card, and bisecting now leads to a
different bad commit(?)

This is what it says is the culprit now (I wonder if I should bisect
again, and attempt booting maybe 3 or 4 times each time):

f94edacf998516ac9d849f7bc6949a703977a7f3 is the first bad commit
commit f94edacf998516ac9d849f7bc6949a703977a7f3
Author: Linus Torvalds <[email protected]>
Date: Fri Feb 17 21:48:54 2012 -0800

i387: move TS_USEDFPU flag from thread_info to task_struct

This moves the bit that indicates whether a thread has ownership of the
FPU from the TS_USEDFPU bit in thread_info->status to a word of its own
(called 'has_fpu') in task_struct->thread.has_fpu.

This fixes two independent bugs at the same time:

- changing 'thread_info->status' from the scheduler causes nasty
problems for the other users of that variable, since it is
defined to
be thread-synchronous (that's what the "TS_" part of the naming was
supposed to indicate).

So perfectly valid code could (and did) do

ti->status |= TS_RESTORE_SIGMASK;

and the compiler was free to do that as separate load, or and store
instructions. Which can cause problems with preemption, since a
task
switch could happen in between, and change the TS_USEDFPU bit. The
change to TS_USEDFPU would be overwritten by the final store.

In practice, this seldom happened, though, because the 'status'
field
was seldom used more than once, so gcc would generally tend to
generate code that used a read-modify-write instruction and thus
happened to avoid this problem - RMW instructions are naturally low
fat and preemption-safe.

- On x86-32, the current_thread_info() pointer would, during
interrupts
and softirqs, point to a *copy* of the real thread_info, because
x86-32 uses %esp to calculate the thread_info address, and thus the
separate irq (and softirq) stacks would cause these kinds of odd
thread_info copy aliases.

This is normally not a problem, since interrupts aren't supposed to
look at thread information anyway (what thread is running at
interrupt time really isn't very well-defined), but it confused the
heck out of irq_fpu_usable() and the code that tried to squirrel
away the FPU state.

(It also caused untold confusion for us poor kernel developers).

It also turns out that using 'task_struct' is actually much more
natural
for most of the call sites that care about the FPU state, since they
tend to work with the task struct for other reasons anyway (ie
scheduling). And the FPU data that we are going to save/restore is
found there too.

Thanks to Arjan Van De Ven <[email protected]> for pointing us to
the %esp issue.

Cc: Arjan van de Ven <[email protected]>
Reported-and-tested-by: Raphael Prevost <[email protected]>
Acked-and-tested-by: Suresh Siddha <[email protected]>
Tested-by: Peter Anvin <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

:040000 040000 19548f49884c9745ecb3970321ff41b244d79b97
ec8b1a02dd7ef354f1be4c68767e4353819dd5fa M arch

For obvious reasons, this commit cannot be easily reverted, but help is
much appreciated!

/sune

--
Unix is not an 'a-ha' experience, it is more of a 'holy-shit' experience.
- Colin McFadyen


2012-05-11 15:46:16

by Sune Mølgaard

[permalink] [raw]
Subject: Re: Boot failure since 3.3-rc?

Sune M?lgaard wrote:
> Incidentally, I had to swap a wifi card, and bisecting now leads to a
> different bad commit(?)
>
> This is what it says is the culprit now (I wonder if I should bisect
> again, and attempt booting maybe 3 or 4 times each time):
>
> f94edacf998516ac9d849f7bc6949a703977a7f3 is the first bad commit
> commit f94edacf998516ac9d849f7bc6949a703977a7f3

Would anyone happen to know if this has been backported to the 3.0-series?

Just tried booting the latest ubuntu 11.10 kernel (based on 3.0) which
also failed. That should, naturally, be logged with the Ubuntu guys (and
it will be), but until then, if someone can positively say that the
above patch was backported, it might lend credence to the assumption
that it is indeed the culprit.

I have, btw., ordered a small display to hook up to the machine in order
to see where it fails.

Will report back...

Best regards,

Sune M?lgaard

--
First things first, but not necessarily in that order.
- Doctor Who