LinuxLists.cc - lockup when C1E and high-resolution timers enabled

2015-06-13 19:54:09

Subject: lockup when C1E and high-resolution timers enabled

Hi,

on following computer configuration, I do get hard lockup under heavy
IO-Load (using rsync):

- CONFIG_HIGH_RES_TIMERS=y
- CPU: AMD FX(tm)-8350 Eight-Core Processor (family 0x15 model 0x2)
- Motherboard: 'GA-970A-UD3P (rev. 1.0)' AMD 970/SB950
- BIOS: C1E enabled (on 'GA-970A-UD3P' there is no disable option)
- Kernels: 4.1.0-rc6, 4.0.x, 3.16.x

Tests:
- add kernel parameter "idle=halt" -> system runs fine
- disable CONFIG_HIGH_RES_TIMERS -> system runs fine
- change motherboard and disable C1E -> system runs fine
- change CPU to AMD Phenom II X6 Processor -> system runs fine

$ cat /sys/devices/system/cpu/modalias
cpu:type:x86,ven0002fam0015mod0002:
feature:,0000,0001,0002,0003,0004,0005,0006,0007,0008,0009,000B,000C,
000D,000E,000F,0010,0011,0013,0017,0018,0019,001A,001C,0020,0021,0022,
0023,0024,0025,0026,0027,0028,0029,002B,002C,002D,002E,002F,0030,0031,
0034,0036,0037,0038,0039,003A,003B,003D,0064,0068,006E,0070,0071,0074,
0075,0078,007A,007C,0080,0081,0083,0089,008C,008D,0093,0094,0097,0099,
009A,009B,009C,009D,00C0,00C1,00C2,00C3,00C4,00C5,00C6,00C7,00C8,00C9,
00CA,00CB,00CC,00CD,00CF,00D0,00D1,00D3,00D5,00D6,00D7,00D8,00E1,00E2,
00E8,0105,0106,0107,0108,0109,010A,010B,010C,010D,010E,010F,0123

$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 21
model : 2
model name : AMD FX(tm)-8350 Eight-Core Processor
stepping : 0
microcode : 0x6000832
cpu MHz : 1400.000
cache size : 2048 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 16
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc
extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1
sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic
cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt
lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb arat cpb
hw_pstate npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid
decodeassists pausefilter pfthreshold vmmcall bmi1
bugs : fxsave_leak sysret_ss_attrs
bogomips : 8036.70
TLB size : 1536 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro
<snip>

Any ideas?

Thanks
-- Christoph

2015-06-13 20:19:46

by Heinz Diehl

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

On 13.06.2015, Christoph Fritz wrote:

> - add kernel parameter "idle=halt" -> system runs fine
> - disable CONFIG_HIGH_RES_TIMERS -> system runs fine
> - change motherboard and disable C1E -> system runs fine
> - change CPU to AMD Phenom II X6 Processor -> system runs fine

I encountered quite some C1E related problems with different Gigabyte
mainboards in the last few years. Try booting with
acpi_skip_timer_override, it fixed most of those problems for me.

2015-06-13 20:59:00

by Christoph Fritz

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

On Sat, 2015-06-13 at 22:19 +0200, Heinz Diehl wrote:
> On 13.06.2015, Christoph Fritz wrote:
>
> > - add kernel parameter "idle=halt" -> system runs fine
> > - disable CONFIG_HIGH_RES_TIMERS -> system runs fine
> > - change motherboard and disable C1E -> system runs fine
> > - change CPU to AMD Phenom II X6 Processor -> system runs fine
>
> I encountered quite some C1E related problems with different Gigabyte
> mainboards in the last few years. Try booting with
> acpi_skip_timer_override, it fixed most of those problems for me.
>

Thanks for your hint, I already tried that: With kernel parameter
'acpi_skip_timer_override' kernel doesn't boot at all.

2015-06-14 03:13:11

by Daniel J Blueman

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

On Sunday, June 14, 2015 at 4:00:06 AM UTC+8, Christoph Fritz wrote:
> Hi,
>
> on following computer configuration, I do get hard lockup under heavy
> IO-Load (using rsync):
>
> - CONFIG_HIGH_RES_TIMERS=y
> - CPU: AMD FX(tm)-8350 Eight-Core Processor (family 0x15 model 0x2)
> - Motherboard: 'GA-970A-UD3P (rev. 1.0)' AMD 970/SB950
> - BIOS: C1E enabled (on 'GA-970A-UD3P' there is no disable option)
> - Kernels: 4.1.0-rc6, 4.0.x, 3.16.x
>
> Tests:
> - add kernel parameter "idle=halt" -> system runs fine
> - disable CONFIG_HIGH_RES_TIMERS -> system runs fine
> - change motherboard and disable C1E -> system runs fine
> - change CPU to AMD Phenom II X6 Processor -> system runs fine
[..]

C1E disconnects HyperTransport links when all cores enter C1 (halt)
for a period of time; this is all at the platform level, so isn't due
to the kernel. The AMD AGESA code which controls the setup of this
mechanism is updated in the F2g BIOS:
http://www.gigabyte.com/products/product-page.aspx?pid=4717#bios

Did you try both BIOS releases with defaults?

If still issues, also try with the current family 10h microcode from
http://www.amd64.org/microcode/amd-ucode-latest.tar.bz2

Thanks,
Daniel
--
Daniel J Blueman

2015-06-14 04:40:07

by Christoph Fritz

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

On Sun, 2015-06-14 at 11:13 +0800, Daniel J Blueman wrote:
> On Sunday, June 14, 2015 at 4:00:06 AM UTC+8, Christoph Fritz wrote:
> > Hi,
> >
> > on following computer configuration, I do get hard lockup under heavy
> > IO-Load (using rsync):
> >
> > - CONFIG_HIGH_RES_TIMERS=y
> > - CPU: AMD FX(tm)-8350 Eight-Core Processor (family 0x15 model 0x2)
> > - Motherboard: 'GA-970A-UD3P (rev. 1.0)' AMD 970/SB950
> > - BIOS: C1E enabled (on 'GA-970A-UD3P' there is no disable option)
> > - Kernels: 4.1.0-rc6, 4.0.x, 3.16.x
> >
> > Tests:
> > - add kernel parameter "idle=halt" -> system runs fine
> > - disable CONFIG_HIGH_RES_TIMERS -> system runs fine
> > - change motherboard and disable C1E -> system runs fine
> > - change CPU to AMD Phenom II X6 Processor -> system runs fine
> [..]
>
> C1E disconnects HyperTransport links when all cores enter C1 (halt)
> for a period of time; this is all at the platform level, so isn't due
> to the kernel. The AMD AGESA code which controls the setup of this
> mechanism is updated in the F2g BIOS:
> http://www.gigabyte.com/products/product-page.aspx?pid=4717#bios
>
> Did you try both BIOS releases with defaults?

Yes, rechecked both versions: Same bad behaviour.

> If still issues, also try with the current family 10h microcode from
> http://www.amd64.org/microcode/amd-ucode-latest.tar.bz2

Don't you mean family 15h for 'AMD FX(tm)-8350' ?

already using latest microcode:

[ 0.514490] microcode: CPU0: patch_level=0x06000822
[ 0.514497] microcode: CPU1: patch_level=0x06000822
[ 0.514508] microcode: CPU2: patch_level=0x06000822
[ 0.514519] microcode: CPU3: patch_level=0x06000822
[ 0.514529] microcode: CPU4: patch_level=0x06000822
[ 0.514540] microcode: CPU5: patch_level=0x06000822
[ 0.514550] microcode: CPU6: patch_level=0x06000822
[ 0.514561] microcode: CPU7: patch_level=0x06000822

2015-06-14 07:54:16

by Daniel J Blueman

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

On 14 June 2015 at 12:39, Christoph Fritz <[email protected]> wrote:
> On Sun, 2015-06-14 at 11:13 +0800, Daniel J Blueman wrote:
>> On Sunday, June 14, 2015 at 4:00:06 AM UTC+8, Christoph Fritz wrote:
>> > Hi,
>> >
>> > on following computer configuration, I do get hard lockup under heavy
>> > IO-Load (using rsync):
>> >
>> > - CONFIG_HIGH_RES_TIMERS=y
>> > - CPU: AMD FX(tm)-8350 Eight-Core Processor (family 0x15 model 0x2)
>> > - Motherboard: 'GA-970A-UD3P (rev. 1.0)' AMD 970/SB950
>> > - BIOS: C1E enabled (on 'GA-970A-UD3P' there is no disable option)
>> > - Kernels: 4.1.0-rc6, 4.0.x, 3.16.x
>> >
>> > Tests:
>> > - add kernel parameter "idle=halt" -> system runs fine
>> > - disable CONFIG_HIGH_RES_TIMERS -> system runs fine
>> > - change motherboard and disable C1E -> system runs fine
>> > - change CPU to AMD Phenom II X6 Processor -> system runs fine
>> [..]
>>
>> C1E disconnects HyperTransport links when all cores enter C1 (halt)
>> for a period of time; this is all at the platform level, so isn't due
>> to the kernel. The AMD AGESA code which controls the setup of this
>> mechanism is updated in the F2g BIOS:
>> http://www.gigabyte.com/products/product-page.aspx?pid=4717#bios
>>
>> Did you try both BIOS releases with defaults?
>
> Yes, rechecked both versions: Same bad behaviour.
>
>> If still issues, also try with the current family 10h microcode from
>> http://www.amd64.org/microcode/amd-ucode-latest.tar.bz2
>
> Don't you mean family 15h for 'AMD FX(tm)-8350' ?
>
> already using latest microcode:

As a workaround, you can probably just disable message triggered C1E
(see the BKDG p399 [1]):

val=0x$(setpci -s 00:18.4 0xd4.l) # read D18F3xD4
val=$((val &~(1 << 13))) # clear bit13 (MTC1eEn)
setpci -d 1022:1604 0xd4.l=$(printf %x $val) # write back

The chipset setup and behaviour is quite complex, so it's likely
Gigabyte haven't done their homework. The alternative is coreboot of
course.

Thanks,
Daniel

[1] http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf
--
Daniel J Blueman

2015-06-14 14:49:58

by Christoph Fritz

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

On Sun, 2015-06-14 at 15:54 +0800, Daniel J Blueman wrote:
> As a workaround, you can probably just disable message triggered C1E
> (see the BKDG p399 [1]):
>
> val=0x$(setpci -s 00:18.4 0xd4.l) # read D18F3xD4

mhm... $(setpci -s 00:18.4 0xd4.l) returns zero, this can't be right.

2015-06-14 15:24:26

by Daniel J Blueman

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

On 14 June 2015 at 22:49, Christoph Fritz <[email protected]> wrote:
> On Sun, 2015-06-14 at 15:54 +0800, Daniel J Blueman wrote:
>> As a workaround, you can probably just disable message triggered C1E
>> (see the BKDG p399 [1]):
>>
>> val=0x$(setpci -s 00:18.4 0xd4.l) # read D18F3xD4
>
> mhm... $(setpci -s 00:18.4 0xd4.l) returns zero, this can't be right.

Ahh, try:

val=0x$(setpci -s 00:18.3 0xd4.l) # read D18F3xD4
val=$((val &~(1 << 13))) # clear bit13 (MTC1eEn)
setpci -d 1022:1603 0xd4.l=$(printf %x $val) # write back
--
Daniel J Blueman

2015-06-14 16:47:58

by Borislav Petkov

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

On Sun, Jun 14, 2015 at 06:39:56AM +0200, Christoph Fritz wrote:
> Don't you mean family 15h for 'AMD FX(tm)-8350' ?
>
> already using latest microcode:
>
> [ 0.514490] microcode: CPU0: patch_level=0x06000822
> [ 0.514497] microcode: CPU1: patch_level=0x06000822
> [ 0.514508] microcode: CPU2: patch_level=0x06000822
> [ 0.514519] microcode: CPU3: patch_level=0x06000822
> [ 0.514529] microcode: CPU4: patch_level=0x06000822
> [ 0.514540] microcode: CPU5: patch_level=0x06000822
> [ 0.514550] microcode: CPU6: patch_level=0x06000822
> [ 0.514561] microcode: CPU7: patch_level=0x06000822

This is not the latest microcode.

This is the latest:

processor : 0
vendor_id : AuthenticAMD
cpu family : 21
model : 2
model name : AMD FX(tm)-8350 Eight-Core Processor
stepping : 0
microcode : 0x6000832
^^^^^^^^^

in your first email, you did have the latest:

$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 21
model : 2
model name : AMD FX(tm)-8350 Eight-Core Processor
stepping : 0
microcode : 0x6000832

so what changed?

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-06-14 23:05:19

by Christoph Fritz

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

> > already using latest microcode:
> >
> > [ 0.514490] microcode: CPU0: patch_level=0x06000822
> > [ 0.514497] microcode: CPU1: patch_level=0x06000822
> > [ 0.514508] microcode: CPU2: patch_level=0x06000822
> > [ 0.514519] microcode: CPU3: patch_level=0x06000822
> > [ 0.514529] microcode: CPU4: patch_level=0x06000822
> > [ 0.514540] microcode: CPU5: patch_level=0x06000822
> > [ 0.514550] microcode: CPU6: patch_level=0x06000822
> > [ 0.514561] microcode: CPU7: patch_level=0x06000822
>
> This is not the latest microcode.

> so what changed?

nice catch, my bad -- forgot to post all microcode messages.

$ dmesg | grep microcode:
[ 0.514422] microcode: CPU0: patch_level=0x06000822
[ 0.514429] microcode: CPU1: patch_level=0x06000822
[ 0.514440] microcode: CPU2: patch_level=0x06000822
[ 0.514450] microcode: CPU3: patch_level=0x06000822
[ 0.514460] microcode: CPU4: patch_level=0x06000822
[ 0.514493] microcode: CPU5: patch_level=0x06000822
[ 0.514502] microcode: CPU6: patch_level=0x06000822
[ 0.514513] microcode: CPU7: patch_level=0x06000822
[ 0.514557] microcode: Microcode Update Driver: v2.00 <[email protected]>, Peter Oruba
[ 3.909642] microcode: CPU0: new patch_level=0x06000832
[ 3.940694] microcode: CPU2: new patch_level=0x06000832
[ 3.955187] microcode: CPU4: new patch_level=0x06000832
[ 3.963403] microcode: CPU6: new patch_level=0x06000832

2015-06-14 23:38:45

by Christoph Fritz

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

On Sun, 2015-06-14 at 23:24 +0800, Daniel J Blueman wrote:
> val=0x$(setpci -s 00:18.3 0xd4.l) # read D18F3xD4
> val=$((val &~(1 << 13))) # clear bit13 (MTC1eEn)
> setpci -d 1022:1603 0xd4.l=$(printf %x $val) # write back

This slows down the whole system dramatically:

- before: MTC1eEn set: Booting takes 11 secs
- after : MTC1eEn cleared: Booting takes 53 secs

2015-06-15 07:54:51

by Borislav Petkov

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

On Mon, Jun 15, 2015 at 01:05:07AM +0200, Christoph Fritz wrote:
> > > already using latest microcode:
> > >
> > > [ 0.514490] microcode: CPU0: patch_level=0x06000822
> > > [ 0.514497] microcode: CPU1: patch_level=0x06000822
> > > [ 0.514508] microcode: CPU2: patch_level=0x06000822
> > > [ 0.514519] microcode: CPU3: patch_level=0x06000822
> > > [ 0.514529] microcode: CPU4: patch_level=0x06000822
> > > [ 0.514540] microcode: CPU5: patch_level=0x06000822
> > > [ 0.514550] microcode: CPU6: patch_level=0x06000822
> > > [ 0.514561] microcode: CPU7: patch_level=0x06000822
> >
> > This is not the latest microcode.
>
> > so what changed?
>
> nice catch, my bad -- forgot to post all microcode messages.
>
> $ dmesg | grep microcode:
> [ 0.514422] microcode: CPU0: patch_level=0x06000822
> [ 0.514429] microcode: CPU1: patch_level=0x06000822
> [ 0.514440] microcode: CPU2: patch_level=0x06000822
> [ 0.514450] microcode: CPU3: patch_level=0x06000822
> [ 0.514460] microcode: CPU4: patch_level=0x06000822
> [ 0.514493] microcode: CPU5: patch_level=0x06000822
> [ 0.514502] microcode: CPU6: patch_level=0x06000822
> [ 0.514513] microcode: CPU7: patch_level=0x06000822
> [ 0.514557] microcode: Microcode Update Driver: v2.00 <[email protected]>, Peter Oruba
> [ 3.909642] microcode: CPU0: new patch_level=0x06000832
> [ 3.940694] microcode: CPU2: new patch_level=0x06000832
> [ 3.955187] microcode: CPU4: new patch_level=0x06000832
> [ 3.963403] microcode: CPU6: new patch_level=0x06000832

Just to rule out the aspect that your issue might be fixed by microcode
but that microcode needs to be loaded early, can you enable the early
microcode loader, put the microcode in initrd as described here:

Documentation/x86/early-microcode.txt

and retry?

I'm working on having it built-in too, in the case where people don't
use initrd, but that's 64-bit only for now.

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-06-15 08:01:50

by Christoph Fritz

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

On Mon, 2015-06-15 at 09:54 +0200, Borislav Petkov wrote:
> Just to rule out the aspect that your issue might be fixed by microcode
> but that microcode needs to be loaded early, can you enable the early
> microcode loader, put the microcode in initrd as described here:
>
> Documentation/x86/early-microcode.txt
>
> and retry?

I should have mentioned that, I already tested that, it doesn't fix
the described lockup :-(

2015-06-15 08:10:14

by Borislav Petkov

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

On Mon, Jun 15, 2015 at 10:01:41AM +0200, Christoph Fritz wrote:
> I should have mentioned that, I already tested that, it doesn't fix
> the described lockup :-(

Hmm, can you boot with

"log_buf_len=16M ignore_loglevel debug initcall_debug apic=debug"

and send me full dmesg from that box?

Privately's fine too.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

2015-07-13 07:32:44

by Christoph Fritz

[permalink] [raw]

Subject: Re: lockup when C1E and high-resolution timers enabled

Hi,

the whole error (freeze under heavy IO when C1E enabled) here is
referable to motherboard 'GA-970A-UD3P (rev. 1.0)', changing it to a
'Asus M5A97 PLUS' fixes my problems here.

I'm not sure if this GA-970A board is just broken or has some general
problems.

Thanks
-- chf