2006-11-06 14:17:54

by Arjan van de Ven

[permalink] [raw]
Subject: [patch] Regression in 2.6.19-rc microcode driver

Hi,

if the microcode driver is built in (rather than module) there are some,
ehm, interesting effects happening due to the new "call out to
userspace" behavior that is introduced.. and which runs too early. The
result is a boot hang; which is really nasty.

The patch below is a minimally safe patch to fix this regression for
2.6.19 by just not requesting actual microcode updates during early
boot. (That is a good idea in general anyway)

The "real" fix is a lot more complex given the entire cpu hotplug
scenario (during cpu hotplug you normally need to load the microcode as
well); but the interactions for that are just really messy at this
point; this fix at least makes it work and avoids a full detangle of
hotplug.

Signed-off-by: Arjan van de Ven <[email protected]>

--- linux-2.6.18/arch/i386/kernel/microcode.c.org 2006-11-06 14:50:37.000000000 +0100
+++ linux-2.6.18/arch/i386/kernel/microcode.c 2006-11-06 14:52:30.000000000 +0100
@@ -577,7 +577,7 @@ static void microcode_init_cpu(int cpu)
set_cpus_allowed(current, cpumask_of_cpu(cpu));
mutex_lock(&microcode_mutex);
collect_cpu_info(cpu);
- if (uci->valid)
+ if (uci->valid && system_state==SYSTEM_RUNNING)
cpu_request_microcode(cpu);
mutex_unlock(&microcode_mutex);
set_cpus_allowed(current, old);


2006-11-06 19:05:22

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] Regression in 2.6.19-rc microcode driver

On Mon, 06 Nov 2006 15:15:38 +0100
Arjan van de Ven <[email protected]> wrote:

> Hi,
>
> if the microcode driver is built in (rather than module) there are some,
> ehm, interesting effects happening due to the new "call out to
> userspace" behavior that is introduced.. and which runs too early. The
> result is a boot hang; which is really nasty.
>
> The patch below is a minimally safe patch to fix this regression for
> 2.6.19 by just not requesting actual microcode updates during early
> boot. (That is a good idea in general anyway)
>
> The "real" fix is a lot more complex given the entire cpu hotplug
> scenario (during cpu hotplug you normally need to load the microcode as
> well); but the interactions for that are just really messy at this
> point; this fix at least makes it work and avoids a full detangle of
> hotplug.
>
> Signed-off-by: Arjan van de Ven <[email protected]>
>
> --- linux-2.6.18/arch/i386/kernel/microcode.c.org 2006-11-06 14:50:37.000000000 +0100
> +++ linux-2.6.18/arch/i386/kernel/microcode.c 2006-11-06 14:52:30.000000000 +0100
> @@ -577,7 +577,7 @@ static void microcode_init_cpu(int cpu)
> set_cpus_allowed(current, cpumask_of_cpu(cpu));
> mutex_lock(&microcode_mutex);
> collect_cpu_info(cpu);
> - if (uci->valid)
> + if (uci->valid && system_state==SYSTEM_RUNNING)
> cpu_request_microcode(cpu);
> mutex_unlock(&microcode_mutex);
> set_cpus_allowed(current, old);

Can we fix this by switching to late_initcall() or something like that?

2006-11-06 19:03:09

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] Regression in 2.6.19-rc microcode driver

Andrew Morton wrote:
>> --- linux-2.6.18/arch/i386/kernel/microcode.c.org 2006-11-06 14:50:37.000000000 +0100
>> +++ linux-2.6.18/arch/i386/kernel/microcode.c 2006-11-06 14:52:30.000000000 +0100
>> @@ -577,7 +577,7 @@ static void microcode_init_cpu(int cpu)
>> set_cpus_allowed(current, cpumask_of_cpu(cpu));
>> mutex_lock(&microcode_mutex);
>> collect_cpu_info(cpu);
>> - if (uci->valid)
>> + if (uci->valid && system_state==SYSTEM_RUNNING)
>> cpu_request_microcode(cpu);
>> mutex_unlock(&microcode_mutex);
>> set_cpus_allowed(current, old);
>
> Can we fix this by switching to late_initcall() or something like that?

I will try but it then still runs before userspace (esp "init") is
alive so I'm not convinced it'll do the right thing

2006-11-06 19:19:10

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] Regression in 2.6.19-rc microcode driver

Andrew Morton wrote:
> Can we fix this by switching to late_initcall() or something like that?

after testing this: the answer is "no." ;(

at least not without significant redesign on how this all interacts
(which includes cpuhotplug meeting sysfs which isn't all that pretty
already) which is imo not a 2.6.19 thing.

2006-11-07 01:20:43

by Shaohua Li

[permalink] [raw]
Subject: Re: [patch] Regression in 2.6.19-rc microcode driver

On Mon, 2006-11-06 at 15:15 +0100, Arjan van de Ven wrote:
> Hi,
>
> if the microcode driver is built in (rather than module) there are some,
> ehm, interesting effects happening due to the new "call out to
> userspace" behavior that is introduced.. and which runs too early. The
> result is a boot hang; which is really nasty.
>
> The patch below is a minimally safe patch to fix this regression for
> 2.6.19 by just not requesting actual microcode updates during early
> boot. (That is a good idea in general anyway)
>
> The "real" fix is a lot more complex given the entire cpu hotplug
> scenario (during cpu hotplug you normally need to load the microcode as
> well); but the interactions for that are just really messy at this
> point; this fix at least makes it work and avoids a full detangle of
> hotplug.
Yes, this is an issue which I documented in my patch. It's not a hang,
but a long delay if you have many cpus. Other drivers with firmware
request have the same issue if they are built-in. Maybe we should fix
the firmware request mechanism itself. I hope no distribution has
microcode driver built-in.

Thanks,
Shaohua

2006-11-07 02:02:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] Regression in 2.6.19-rc microcode driver

On Tue, 07 Nov 2006 09:20:27 +0800
Shaohua Li <[email protected]> wrote:

> On Mon, 2006-11-06 at 15:15 +0100, Arjan van de Ven wrote:
> > Hi,
> >
> > if the microcode driver is built in (rather than module) there are some,
> > ehm, interesting effects happening due to the new "call out to
> > userspace" behavior that is introduced.. and which runs too early. The
> > result is a boot hang; which is really nasty.
> >
> > The patch below is a minimally safe patch to fix this regression for
> > 2.6.19 by just not requesting actual microcode updates during early
> > boot. (That is a good idea in general anyway)
> >
> > The "real" fix is a lot more complex given the entire cpu hotplug
> > scenario (during cpu hotplug you normally need to load the microcode as
> > well); but the interactions for that are just really messy at this
> > point; this fix at least makes it work and avoids a full detangle of
> > hotplug.
> Yes, this is an issue which I documented in my patch. It's not a hang,
> but a long delay if you have many cpus.

Due to the timeout? So it should come back after 10*num_online_cpus seconds?

Does Arjan have a lot of CPUs?

> Other drivers with firmware
> request have the same issue if they are built-in. Maybe we should fix
> the firmware request mechanism itself. I hope no distribution has
> microcode driver built-in.

But what would a fix look like? I think things would work OK if all the
appropriate stuff is present in initramfs, yes? We wouldn't want to break
that.

hm. kobject_uevent() stupidly returns void. If we were to fix that, is
there any reason why _request_firmware() should still wait for ten seconds
if kobject_uevent() returned a synchronous error? (ie:
__call_usermodehelper failed?)

Answer: yes. That won't work because request_firmware() uses
call_usermodehelper(wait=0) (iirc this bad thing was done because of
deadlock problems which were hard to fix properly).

But all it not lost - because call_usermodehelper() will use CLONE_VFORK I
_think_ we can still work out whether the child thread successfully exec'ed
a new program. It'd take a bit of hacking on the fork() code to make that
work though.

2006-11-07 09:13:25

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] Regression in 2.6.19-rc microcode driver

>
> Due to the timeout? So it should come back after 10*num_online_cpus seconds?
>
> Does Arjan have a lot of CPUs?

eh yes, my test machine has quite a large number of those.

2006-11-07 09:24:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] Regression in 2.6.19-rc microcode driver

On Tue, 07 Nov 2006 10:13:17 +0100
Arjan van de Ven <[email protected]> wrote:

> >
> > Due to the timeout? So it should come back after 10*num_online_cpus seconds?
> >
> > Does Arjan have a lot of CPUs?
>
> eh yes, my test machine has quite a large number of those.

So did it really hang?

2006-11-07 09:35:39

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] Regression in 2.6.19-rc microcode driver

Andrew Morton wrote:
> On Tue, 07 Nov 2006 10:13:17 +0100
> Arjan van de Ven <[email protected]> wrote:
>
>>> Due to the timeout? So it should come back after 10*num_online_cpus seconds?
>>>
>>> Does Arjan have a lot of CPUs?
>> eh yes, my test machine has quite a large number of those.
>
> So did it really hang?

I'll retry and wait a few minutes ;(

2006-11-07 09:43:31

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] Regression in 2.6.19-rc microcode driver

Andrew Morton wrote:
> On Tue, 07 Nov 2006 10:13:17 +0100
> Arjan van de Ven <[email protected]> wrote:
>
>>> Due to the timeout? So it should come back after 10*num_online_cpus seconds?
>>>
>>> Does Arjan have a lot of CPUs?
>> eh yes, my test machine has quite a large number of those.
>
> So did it really hang?

ok so it eventually continues.. just way past my "it hang" patience
window of about a minute