2016-03-21 23:10:24

by Matthias Schiffer

[permalink] [raw]
Subject: Nonterministic hang during bootconsole/console handover on ath79

Hi,
we're experiencing weird nondeterministic hangs during bootconsole/console
handover on some ath79 systems on OpenWrt. I've seen this issue myself on
kernel 3.18.23~3.18.27 on a AR7241-based system, but according to other
reports ([1], [2]) kernel 4.1.x is affected as well, and other SoCs like
QCA953x likewise.

See the log below for the exact place it hangs; the log was taken in during
a good boot; a bad boot will just hang forever at the marked location. The
issue is extremely hard to debug, as changing the timing in any way (like
adding additional printk) will usually make it work without problems. (Even
recompiling the kernel with the same config, but different uname timestamp
will make the occurence more or less likely)

My theory is the following:

As soon as ttyS0 is detected and installed as the console, there are two
console drivers active on the serial port at the same time: early0 and
ttyS0. I suspect that the hang occurs when the primitive early0
implementation prom_putchar_ar71xx waits indefinitely on THRE, but the real
driver has just reset the serial controller in a way that makes THRE never
come. When the boot is successful, I also sometimes see just garbage
instead of the message "serial8250.0: ttyS0 at MMIO 0x18020000...", which
supports my idea that the kernel is trying to use the serial console while
it is not correctly setup.

Is is possible that the first "console [ttyS0] disabled" message caused by
the call chain

serial8250_register_8250_port
uart_remove_one_port
unregister_console

should rather unregister early0 (as ttyS0 is not even registered at this
point), so having both drivers active at the same time is avoided?

Does this make any sense? Or do you have any other idea what might cause this?

Thanks in advance,
Matthias


[1] https://dev.openwrt.org/ticket/21773
[2] https://dev.openwrt.org/ticket/21857


> [ 0.000000] Linux version 3.18.27 (neoraider@avalon) (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r47335) ) #6 Sun Mar 20 22:40:15 CET 2016
> [ 0.000000] bootconsole [early0] enabled
> [ 0.000000] CPU0 revision is: 00019374 (MIPS 24Kc)
> [ 0.000000] SoC: Atheros AR7241 rev 1
> [ 0.000000] Determined physical RAM map:
> [ 0.000000] memory: 02000000 @ 00000000 (usable)
> [ 0.000000] Initrd not found or empty - disabling initrd
> [ 0.000000] Zone ranges:
> [ 0.000000] Normal [mem 0x00000000-0x01ffffff]
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x00000000-0x01ffffff]
> [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x01ffffff]
> [ 0.000000] Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
> [ 0.000000] Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes
> [ 0.000000] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 8128
> [ 0.000000] Kernel command line: board=TL-WR841N-v7 console=ttyS0,115200 rootfstype=squashfs,jffs2 noinitrd
> [ 0.000000] PID hash table entries: 128 (order: -3, 512 bytes)
> [ 0.000000] Dentry cache hash table entries: 4096 (order: 2, 16384 bytes)
> [ 0.000000] Inode-cache hash table entries: 2048 (order: 1, 8192 bytes)
> [ 0.000000] Writing ErrCtl register=00000000
> [ 0.000000] Readback ErrCtl register=00000000
> [ 0.000000] Memory: 28324K/32768K available (2573K kernel code, 133K rwdata, 540K rodata, 264K init, 193K bss, 4444K reserved)
> [ 0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
> [ 0.000000] NR_IRQS:51
> [ 0.000000] Clocks: CPU:400.000MHz, DDR:400.000MHz, AHB:200.000MHz, Ref:5.000MHz
> [ 0.000000] Calibrating delay loop... 265.42 BogoMIPS (lpj=1327104)
> [ 0.080000] pid_max: default: 32768 minimum: 301
> [ 0.080000] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
> [ 0.090000] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
> [ 0.100000] NET: Registered protocol family 16
> [ 0.100000] MIPS: machine is TP-LINK TL-WR841N/ND v7
> [ 0.550000] registering PCI controller with io_map_base unset
> [ 0.560000] PCI host bridge to bus 0000:00
> [ 0.570000] pci_bus 0000:00: root bus resource [mem 0x10000000-0x13ffffff]
> [ 0.570000] pci_bus 0000:00: root bus resource [io 0x0000]
> [ 0.580000] pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
> [ 0.580000] pci 0000:00:00.0: fixup device configuration
> [ 0.590000] pci 0000:00:00.0: BAR 0: assigned [mem 0x10000000-0x1000ffff 64bit]
> [ 0.590000] pci 0000:00:00.0: using irq 40 for pin 1
> [ 0.600000] Switched to clocksource MIPS
> [ 0.600000] NET: Registered protocol family 2
> [ 0.610000] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
> [ 0.610000] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
> [ 0.620000] TCP: Hash tables configured (established 1024 bind 1024)
> [ 0.630000] TCP: reno registered
> [ 0.630000] UDP hash table entries: 256 (order: 0, 4096 bytes)
> [ 0.640000] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
> [ 0.640000] NET: Registered protocol family 1
> [ 0.650000] futex hash table entries: 256 (order: -1, 3072 bytes)
> [ 0.660000] squashfs: version 4.0 (2009/01/31) Phillip Lougher
> [ 0.670000] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
> [ 0.680000] msgmni has been set to 55
> [ 0.680000] io scheduler noop registered
> [ 0.690000] io scheduler deadline registered (default)
> [ 0.690000] Serial: 8250/16550 driver, 16 ports, IRQ sharing enabled
> [ 0.700000] console [ttyS0] disabled
>>> Boot will hang here <<<
> [ 0.730000] serial8250.0: ttyS0 at MMIO 0x18020000 (irq = 11, base_baud = 12500000) is a 16550A
> [ 0.740000] console [ttyS0] enabled
> [ 0.740000] console [ttyS0] enabled
> [ 0.740000] bootconsole [early0] disabled
> [ 0.740000] bootconsole [early0] disabled


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2016-03-21 23:08:25

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

On Tue, Mar 22, 2016 at 12:02:57AM +0100, Matthias Schiffer wrote:
> Hi,
> we're experiencing weird nondeterministic hangs during bootconsole/console
> handover on some ath79 systems on OpenWrt. I've seen this issue myself on
> kernel 3.18.23~3.18.27 on a AR7241-based system, but according to other
> reports ([1], [2]) kernel 4.1.x is affected as well, and other SoCs like
> QCA953x likewise.

Can you try 4.4 or ideally, 4.5? There's been a lot of console/tty
fixes/changes since the obsolete 3.18 kernel you are using...

thanks,

greg k-h

2016-03-22 00:52:48

by Matthias Schiffer

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

On 03/22/2016 12:08 AM, Greg KH wrote:
> On Tue, Mar 22, 2016 at 12:02:57AM +0100, Matthias Schiffer wrote:
>> Hi,
>> we're experiencing weird nondeterministic hangs during bootconsole/console
>> handover on some ath79 systems on OpenWrt. I've seen this issue myself on
>> kernel 3.18.23~3.18.27 on a AR7241-based system, but according to other
>> reports ([1], [2]) kernel 4.1.x is affected as well, and other SoCs like
>> QCA953x likewise.
>
> Can you try 4.4 or ideally, 4.5? There's been a lot of console/tty
> fixes/changes since the obsolete 3.18 kernel you are using...
>
> thanks,
>
> greg k-h
>

With 4.4, I was not able to reproduce this hang, but I have no idea if this
is caused by an actual bugfix, or just random timing changes hiding the
bug. I suspect the latter might be the case (as I wrote in my first mail,
even minor differences in kernel images of the same version and the same
config make the hang more or less probable.) I was not yet able to test
4.5, as OpenWrt is a hell of kernel patches...

On 3.18, I also tried other things like disabling the early console
altogether, which also made the hang go away, but as even much smaller
changes hid the bug, this doesn't really say much.

The basic code path during the console handover seems to be the same in
3.18 and 4.4, even though a few functions have been moved; the relevant
part of the log looks the same:

> [ 0.756298] Serial: 8250/16550 driver, 16 ports, IRQ sharing enabled
> [ 0.766754] console [ttyS0] disabled
> [ 0.790293] serial8250.0: ttyS0 at MMIO 0x18020000 (irq = 11, base_baud = 12500000) is a 16550A
> [ 0.798909] console [ttyS0] enabled
> [ 0.798909] console [ttyS0] enabled
> [ 0.805854] bootconsole [early0] disabled
> [ 0.805854] bootconsole [early0] disabled

So, in propect of an actual bugfix or backport, this boils down to two
questions, which I hope the serial or MIPS maintainers can answer me:

* Is it sane to have two console drivers using the same serial port? In
particular, is it sane for the early console to use the serial port after
serial8250_config_port has reset/configured it, but before the rest of the
setup of uart_configure_port has run? (this would be the case for the
message "serial8250.0: ttyS0 at MMIO...")
* Is it possible to get the serial controller into a state in which
early_printk might wait for THRE forever?

Thanks,
Matthias


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2016-03-22 02:44:31

by Peter Hurley

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

Hi Matthias,

On 03/21/2016 04:02 PM, Matthias Schiffer wrote:
> Hi,
> we're experiencing weird nondeterministic hangs during bootconsole/console
> handover on some ath79 systems on OpenWrt. I've seen this issue myself on
> kernel 3.18.23~3.18.27 on a AR7241-based system, but according to other
> reports ([1], [2]) kernel 4.1.x is affected as well, and other SoCs like
> QCA953x likewise.
>
> See the log below for the exact place it hangs; the log was taken in during
> a good boot; a bad boot will just hang forever at the marked location. The
> issue is extremely hard to debug, as changing the timing in any way (like
> adding additional printk) will usually make it work without problems. (Even
> recompiling the kernel with the same config, but different uname timestamp
> will make the occurence more or less likely)
>
> My theory is the following:
>
> As soon as ttyS0 is detected and installed as the console, there are two
> console drivers active on the serial port at the same time: early0 and
> ttyS0. I suspect that the hang occurs when the primitive early0
> implementation prom_putchar_ar71xx waits indefinitely on THRE, but the real
> driver has just reset the serial controller in a way that makes THRE never
> come.

Doubtful.

console writes are performed with ints disabled, as is the 8250 driver's
autoconfig probing. Since this is a UP platform, as long as you're not
using the DEBUG_AUTOCONF switch in the 8250 driver, I don't think there's
a way for the boot console to be outputting while the 8250 driver is
configuring.

> When the boot is successful, I also sometimes see just garbage
> instead of the message "serial8250.0: ttyS0 at MMIO 0x18020000...", which
> supports my idea that the kernel is trying to use the serial console while
> it is not correctly setup.
>
> Is is possible that the first "console [ttyS0] disabled" message caused by
> the call chain
>
> serial8250_register_8250_port
> uart_remove_one_port
> unregister_console
>
> should rather unregister early0 (as ttyS0 is not even registered at this
> point), so having both drivers active at the same time is avoided?

No.

Having both consoles active briefly at the handover is not really a problem;
all consoles are serialized with each other. Only one console can be outputting
at any one time.

The unregister_console() producing the bogus "ttyS0 disabled" message
is wrong but harmless.

That part happens because the 8250 driver creates phantom ports which are
unregistered as new ports are probed, but the ttyS0 console hadn't actually
been registered yet.


> Does this make any sense? Or do you have any other idea what might cause this?

I wonder if autoconfig probing (that's what discovers the uart port type)
is broken.

You could test this hypothesis by setting the port type directly and
set UPF_FIXED_TYPE; ie., in arch/mips/ath79/dev-common.c

diff --git a/arch/mips/ath79/dev-common.c b/arch/mips/ath79/dev-common.c
index 516225d..3814a42 100644
--- a/arch/mips/ath79/dev-common.c
+++ b/arch/mips/ath79/dev-common.c
@@ -36,7 +36,8 @@ static struct plat_serial8250_port ath79_uart_data[] = {
{
.mapbase = AR71XX_UART_BASE,
.irq = ATH79_MISC_IRQ(3),
- .flags = AR71XX_UART_FLAGS,
+ .flags = AR71XX_UART_FLAGS | UPF_FIXED_TYPE,
+ .type = PORT_16550A,
.iotype = UPIO_MEM32,
.regshift = 2,
}, {


Regards,
Peter Hurley

> Thanks in advance,
> Matthias
>
>
> [1] https://dev.openwrt.org/ticket/21773
> [2] https://dev.openwrt.org/ticket/21857
>
>
>> [ 0.000000] Linux version 3.18.27 (neoraider@avalon) (gcc version 4.8.3 (OpenWrt/Linaro GCC 4.8-2014.04 r47335) ) #6 Sun Mar 20 22:40:15 CET 2016
>> [ 0.000000] bootconsole [early0] enabled
>> [ 0.000000] CPU0 revision is: 00019374 (MIPS 24Kc)
>> [ 0.000000] SoC: Atheros AR7241 rev 1
>> [ 0.000000] Determined physical RAM map:
>> [ 0.000000] memory: 02000000 @ 00000000 (usable)
>> [ 0.000000] Initrd not found or empty - disabling initrd
>> [ 0.000000] Zone ranges:
>> [ 0.000000] Normal [mem 0x00000000-0x01ffffff]
>> [ 0.000000] Movable zone start for each node
>> [ 0.000000] Early memory node ranges
>> [ 0.000000] node 0: [mem 0x00000000-0x01ffffff]
>> [ 0.000000] Initmem setup node 0 [mem 0x00000000-0x01ffffff]
>> [ 0.000000] Primary instruction cache 64kB, VIPT, 4-way, linesize 32 bytes.
>> [ 0.000000] Primary data cache 32kB, 4-way, VIPT, cache aliases, linesize 32 bytes
>> [ 0.000000] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 8128
>> [ 0.000000] Kernel command line: board=TL-WR841N-v7 console=ttyS0,115200 rootfstype=squashfs,jffs2 noinitrd
>> [ 0.000000] PID hash table entries: 128 (order: -3, 512 bytes)
>> [ 0.000000] Dentry cache hash table entries: 4096 (order: 2, 16384 bytes)
>> [ 0.000000] Inode-cache hash table entries: 2048 (order: 1, 8192 bytes)
>> [ 0.000000] Writing ErrCtl register=00000000
>> [ 0.000000] Readback ErrCtl register=00000000
>> [ 0.000000] Memory: 28324K/32768K available (2573K kernel code, 133K rwdata, 540K rodata, 264K init, 193K bss, 4444K reserved)
>> [ 0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
>> [ 0.000000] NR_IRQS:51
>> [ 0.000000] Clocks: CPU:400.000MHz, DDR:400.000MHz, AHB:200.000MHz, Ref:5.000MHz
>> [ 0.000000] Calibrating delay loop... 265.42 BogoMIPS (lpj=1327104)
>> [ 0.080000] pid_max: default: 32768 minimum: 301
>> [ 0.080000] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
>> [ 0.090000] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
>> [ 0.100000] NET: Registered protocol family 16
>> [ 0.100000] MIPS: machine is TP-LINK TL-WR841N/ND v7
>> [ 0.550000] registering PCI controller with io_map_base unset
>> [ 0.560000] PCI host bridge to bus 0000:00
>> [ 0.570000] pci_bus 0000:00: root bus resource [mem 0x10000000-0x13ffffff]
>> [ 0.570000] pci_bus 0000:00: root bus resource [io 0x0000]
>> [ 0.580000] pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
>> [ 0.580000] pci 0000:00:00.0: fixup device configuration
>> [ 0.590000] pci 0000:00:00.0: BAR 0: assigned [mem 0x10000000-0x1000ffff 64bit]
>> [ 0.590000] pci 0000:00:00.0: using irq 40 for pin 1
>> [ 0.600000] Switched to clocksource MIPS
>> [ 0.600000] NET: Registered protocol family 2
>> [ 0.610000] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
>> [ 0.610000] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
>> [ 0.620000] TCP: Hash tables configured (established 1024 bind 1024)
>> [ 0.630000] TCP: reno registered
>> [ 0.630000] UDP hash table entries: 256 (order: 0, 4096 bytes)
>> [ 0.640000] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
>> [ 0.640000] NET: Registered protocol family 1
>> [ 0.650000] futex hash table entries: 256 (order: -1, 3072 bytes)
>> [ 0.660000] squashfs: version 4.0 (2009/01/31) Phillip Lougher
>> [ 0.670000] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
>> [ 0.680000] msgmni has been set to 55
>> [ 0.680000] io scheduler noop registered
>> [ 0.690000] io scheduler deadline registered (default)
>> [ 0.690000] Serial: 8250/16550 driver, 16 ports, IRQ sharing enabled
>> [ 0.700000] console [ttyS0] disabled
>>>> Boot will hang here <<<
>> [ 0.730000] serial8250.0: ttyS0 at MMIO 0x18020000 (irq = 11, base_baud = 12500000) is a 16550A
>> [ 0.740000] console [ttyS0] enabled
>> [ 0.740000] console [ttyS0] enabled
>> [ 0.740000] bootconsole [early0] disabled
>> [ 0.740000] bootconsole [early0] disabled
>

2016-03-22 02:51:28

by Peter Hurley

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

On 03/21/2016 05:52 PM, Matthias Schiffer wrote:
> On 03/22/2016 12:08 AM, Greg KH wrote:
>> On Tue, Mar 22, 2016 at 12:02:57AM +0100, Matthias Schiffer wrote:
>>> Hi,
>>> we're experiencing weird nondeterministic hangs during bootconsole/console
>>> handover on some ath79 systems on OpenWrt. I've seen this issue myself on
>>> kernel 3.18.23~3.18.27 on a AR7241-based system, but according to other
>>> reports ([1], [2]) kernel 4.1.x is affected as well, and other SoCs like
>>> QCA953x likewise.
>>
>> Can you try 4.4 or ideally, 4.5? There's been a lot of console/tty
>> fixes/changes since the obsolete 3.18 kernel you are using...
>>
>> thanks,
>>
>> greg k-h
>>
>
> With 4.4, I was not able to reproduce this hang, but I have no idea if this
> is caused by an actual bugfix, or just random timing changes hiding the
> bug.

Can you continue testing with 4.4.x and see if it eventually reproduces?


> I suspect the latter might be the case (as I wrote in my first mail,
> even minor differences in kernel images of the same version and the same
> config make the hang more or less probable.) I was not yet able to test
> 4.5, as OpenWrt is a hell of kernel patches...
>
> On 3.18, I also tried other things like disabling the early console
> altogether, which also made the hang go away, but as even much smaller
> changes hid the bug, this doesn't really say much.

FWIW, printk() is not a small change; takes ~500us @ 115200


>
> The basic code path during the console handover seems to be the same in
> 3.18 and 4.4, even though a few functions have been moved; the relevant
> part of the log looks the same:
>
>> [ 0.756298] Serial: 8250/16550 driver, 16 ports, IRQ sharing enabled
>> [ 0.766754] console [ttyS0] disabled
>> [ 0.790293] serial8250.0: ttyS0 at MMIO 0x18020000 (irq = 11, base_baud = 12500000) is a 16550A
>> [ 0.798909] console [ttyS0] enabled
>> [ 0.798909] console [ttyS0] enabled
>> [ 0.805854] bootconsole [early0] disabled
>> [ 0.805854] bootconsole [early0] disabled
>
> So, in propect of an actual bugfix or backport, this boils down to two
> questions, which I hope the serial or MIPS maintainers can answer me:
>
> * Is it sane to have two console drivers using the same serial port? In
> particular, is it sane for the early console to use the serial port after
> serial8250_config_port has reset/configured it, but before the rest of the
> setup of uart_configure_port has run? (this would be the case for the
> message "serial8250.0: ttyS0 at MMIO...")
> * Is it possible to get the serial controller into a state in which
> early_printk might wait for THRE forever?

I think I addressed these questions in my other reply; let me know if not.

Regards,
Peter Hurley

2016-03-22 05:41:18

by Antony Pavlov

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

On Tue, 22 Mar 2016 00:02:57 +0100
Matthias Schiffer <[email protected]> wrote:

> Hi,
> we're experiencing weird nondeterministic hangs during bootconsole/console
> handover on some ath79 systems on OpenWrt. I've seen this issue myself on
> kernel 3.18.23~3.18.27 on a AR7241-based system, but according to other
> reports ([1], [2]) kernel 4.1.x is affected as well, and other SoCs like
> QCA953x likewise.
>
> See the log below for the exact place it hangs; the log was taken in during
> a good boot; a bad boot will just hang forever at the marked location. The
> issue is extremely hard to debug, as changing the timing in any way (like
> adding additional printk) will usually make it work without problems. (Even
> recompiling the kernel with the same config, but different uname timestamp
> will make the occurence more or less likely)
>
> My theory is the following:
>
> As soon as ttyS0 is detected and installed as the console, there are two
> console drivers active on the serial port at the same time: early0 and
> ttyS0. I suspect that the hang occurs when the primitive early0
> implementation prom_putchar_ar71xx waits indefinitely on THRE,

Can you use EJTAG to prove your theory?

--?
Best regards,
? Antony Pavlov

2016-03-22 09:50:45

by Matthias Schiffer

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

On 03/22/2016 06:40 AM, Antony Pavlov wrote:
> On Tue, 22 Mar 2016 00:02:57 +0100
> Matthias Schiffer <[email protected]> wrote:
>
>> Hi,
>> we're experiencing weird nondeterministic hangs during bootconsole/console
>> handover on some ath79 systems on OpenWrt. I've seen this issue myself on
>> kernel 3.18.23~3.18.27 on a AR7241-based system, but according to other
>> reports ([1], [2]) kernel 4.1.x is affected as well, and other SoCs like
>> QCA953x likewise.
>>
>> See the log below for the exact place it hangs; the log was taken in during
>> a good boot; a bad boot will just hang forever at the marked location. The
>> issue is extremely hard to debug, as changing the timing in any way (like
>> adding additional printk) will usually make it work without problems. (Even
>> recompiling the kernel with the same config, but different uname timestamp
>> will make the occurence more or less likely)
>>
>> My theory is the following:
>>
>> As soon as ttyS0 is detected and installed as the console, there are two
>> console drivers active on the serial port at the same time: early0 and
>> ttyS0. I suspect that the hang occurs when the primitive early0
>> implementation prom_putchar_ar71xx waits indefinitely on THRE,
>
> Can you use EJTAG to prove your theory?
>
> --
> Best regards,
> Antony Pavlov
>

Unfortunately, the board I can reproduce this at the moment does not have
an accessible JTAG port. I hope I can reproduce it on other hardware which
is more developer-friendly, but no luck so far.

Matthias


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2016-03-22 13:08:07

by Matthias Schiffer

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

>> My theory is the following:
>>
>> As soon as ttyS0 is detected and installed as the console, there are two
>> console drivers active on the serial port at the same time: early0 and
>> ttyS0. I suspect that the hang occurs when the primitive early0
>> implementation prom_putchar_ar71xx waits indefinitely on THRE, but the real
>> driver has just reset the serial controller in a way that makes THRE never
>> come.
>
> Doubtful.
>
> console writes are performed with ints disabled, as is the 8250 driver's
> autoconfig probing. Since this is a UP platform, as long as you're not
> using the DEBUG_AUTOCONF switch in the 8250 driver, I don't think there's
> a way for the boot console to be outputting while the 8250 driver is
> configuring.

I see.

>
>> When the boot is successful, I also sometimes see just garbage
>> instead of the message "serial8250.0: ttyS0 at MMIO 0x18020000...", which
>> supports my idea that the kernel is trying to use the serial console while
>> it is not correctly setup.
>>

>
> I wonder if autoconfig probing (that's what discovers the uart port type)
> is broken.
>
> You could test this hypothesis by setting the port type directly and
> set UPF_FIXED_TYPE; ie., in arch/mips/ath79/dev-common.c
>
> diff --git a/arch/mips/ath79/dev-common.c b/arch/mips/ath79/dev-common.c
> index 516225d..3814a42 100644
> --- a/arch/mips/ath79/dev-common.c
> +++ b/arch/mips/ath79/dev-common.c
> @@ -36,7 +36,8 @@ static struct plat_serial8250_port ath79_uart_data[] = {
> {
> .mapbase = AR71XX_UART_BASE,
> .irq = ATH79_MISC_IRQ(3),
> - .flags = AR71XX_UART_FLAGS,
> + .flags = AR71XX_UART_FLAGS | UPF_FIXED_TYPE,
> + .type = PORT_16550A,
> .iotype = UPIO_MEM32,
> .regshift = 2,
> }, {
>
>
> Regards,
> Peter Hurley

I've tried your patch and I can't reproduce the issue anymore with it; I
have no idea if this actually has to do something with the issue, or the
change of the code path just hid the bug again.

Regarding your other mail: with "small change", I was not talking about
adding an additional printk; as mentioned, even changing the numbers in
UTS_VERSION can hide the issue. I diffed a working and a broken kernel
image, and the UTS_VERSION is really the only difference. I have no idea
how to explain this. (OpenWrt uses an LZMA-compressed kernel image, so
after compression, the differences are much greater; but how these
differences would affect the kernel after decompression eludes me)

I'll continue searching for a board with accessible JTAG which exhibits
this issue. Given the heisenbuggy nature of the issue, getting to the root
cause is probably impossible without JTAG unless someone has an obvious
explanation...

Thanks,
Matthias


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2016-03-22 15:39:01

by Peter Hurley

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

On 03/22/2016 06:07 AM, Matthias Schiffer wrote:
> I've tried your patch and I can't reproduce the issue anymore with it; I
> have no idea if this actually has to do something with the issue, or the
> change of the code path just hid the bug again.
>
> Regarding your other mail: with "small change", I was not talking about
> adding an additional printk; as mentioned, even changing the numbers in
> UTS_VERSION can hide the issue. I diffed a working and a broken kernel
> image, and the UTS_VERSION is really the only difference. I have no idea
> how to explain this.

If _any_ change may hide the problem, that will make it impossible
to determine if any attempted fix actually works, regardless of what
debugging method you use.

FWIW, you could still use the boot console to debug the problem by
disabling the regular command-line console.

Regards,
Peter Hurley


> (OpenWrt uses an LZMA-compressed kernel image, so
> after compression, the differences are much greater; but how these
> differences would affect the kernel after decompression eludes me)
>
> I'll continue searching for a board with accessible JTAG which exhibits
> this issue. Given the heisenbuggy nature of the issue, getting to the root
> cause is probably impossible without JTAG unless someone has an obvious
> explanation...
>
> Thanks,
> Matthias
>

2016-03-23 17:40:58

by Matthias Schiffer

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

On 03/22/2016 04:38 PM, Peter Hurley wrote:
> On 03/22/2016 06:07 AM, Matthias Schiffer wrote:
>> I've tried your patch and I can't reproduce the issue anymore with it; I
>> have no idea if this actually has to do something with the issue, or the
>> change of the code path just hid the bug again.
>>
>> Regarding your other mail: with "small change", I was not talking about
>> adding an additional printk; as mentioned, even changing the numbers in
>> UTS_VERSION can hide the issue. I diffed a working and a broken kernel
>> image, and the UTS_VERSION is really the only difference. I have no idea
>> how to explain this.
>
> If _any_ change may hide the problem, that will make it impossible
> to determine if any attempted fix actually works, regardless of what
> debugging method you use.
>
> FWIW, you could still use the boot console to debug the problem by
> disabling the regular command-line console.
>
> Regards,
> Peter Hurley

Hi,
it seems Peter was on the right track. With some help from Ralf, I was able
to narrow down the issue a bit, and I'm fairly sure the hang happens
somewhere in autoconfig().

autoconfig_16550a() is doing all kinds of weird checks to detect different
hardware by writing a lot of register values which are documented as
reserved in the AR7242 datasheet (there's a leaked version going around
that can be easily googled...), no idea if any of those are problematic.
Just setting UPF_FIXED_TYPE as suggested by Peter would avoid that code
altogether.

That being said, I found another minimal change that seems to fix the
issue: prom_putchar_ar71xx() in arch/mips/ath79/early_printk.c only waits
for UART_LSR_THRE, while serial_putc() in
drivers/tty/serial/8250/8250_early.c waits for (UART_LSR_TEMT |
UART_LSR_THRE). Adjusting arch/mips/ath79/early_printk.c in the same way
makes the hangs go away. Maybe the AR7242 doesn't like its serial config
registers being poked while there's still something in the FIFO? Waiting
for UART_LSR_TEMT seems like a good idea anyways to ensure that all
characters have been printed before autoconfig() starts taking things
apart. (Why do these two versions of essentially the same code exist anyways?)

Regards,
Matthias


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2016-03-24 00:44:30

by Peter Hurley

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

Hi Matthias,

On 03/23/2016 10:40 AM, Matthias Schiffer wrote:
> On 03/22/2016 04:38 PM, Peter Hurley wrote:
>> On 03/22/2016 06:07 AM, Matthias Schiffer wrote:
>>> I've tried your patch and I can't reproduce the issue anymore with it; I
>>> have no idea if this actually has to do something with the issue, or the
>>> change of the code path just hid the bug again.
>>>
>>> Regarding your other mail: with "small change", I was not talking about
>>> adding an additional printk; as mentioned, even changing the numbers in
>>> UTS_VERSION can hide the issue. I diffed a working and a broken kernel
>>> image, and the UTS_VERSION is really the only difference. I have no idea
>>> how to explain this.
>>
>> If _any_ change may hide the problem, that will make it impossible
>> to determine if any attempted fix actually works, regardless of what
>> debugging method you use.
>>
>> FWIW, you could still use the boot console to debug the problem by
>> disabling the regular command-line console.
>>
>> Regards,
>> Peter Hurley
>
> Hi,
> it seems Peter was on the right track. With some help from Ralf, I was able
> to narrow down the issue a bit, and I'm fairly sure the hang happens
> somewhere in autoconfig().
>
> autoconfig_16550a() is doing all kinds of weird checks to detect different
> hardware by writing a lot of register values which are documented as
> reserved in the AR7242 datasheet (there's a leaked version going around
> that can be easily googled...), no idea if any of those are problematic.
> Just setting UPF_FIXED_TYPE as suggested by Peter would avoid that code
> altogether.

That's just a debugging patch and not appropriate for permanent use,
the reason being that this uart is _not_ 16550 compatible (or even 16450
compatible).

The three options for 8250 driver support for this part are:
1. Similar to the debugging patch, set UPF_FIXED_TYPE but set port type
to PORT_8250 instead. This will lose FIFO support so 115K won't be
possible and likely neither will 38400.

2. Set UPF_FIXED_TYPE but define a new PORT_* value and add support for
this PORT_* value to uart_config array, uapi headers, and anywhere
the scratch register is used.

3. As with 2. above but don't set UPF_FIXED_TYPE and add a probe function
that detects ports of this type to autoconfig(). I don't recommend this
method.

This requirement is independent of fixing prom_putchar_ar71xx().


> That being said, I found another minimal change that seems to fix the
> issue: prom_putchar_ar71xx() in arch/mips/ath79/early_printk.c only waits
> for UART_LSR_THRE, while serial_putc() in
> drivers/tty/serial/8250/8250_early.c waits for (UART_LSR_TEMT |
> UART_LSR_THRE). Adjusting arch/mips/ath79/early_printk.c in the same way
> makes the hangs go away.

Go ahead and do this as well.


> Maybe the AR7242 doesn't like its serial config
> registers being poked while there's still something in the FIFO? Waiting
> for UART_LSR_TEMT seems like a good idea anyways to ensure that all
> characters have been printed before autoconfig() starts taking things
> apart.

I agree.


> (Why do these two versions of essentially the same code exist anyways?)

earlyprintk command-line parameter is arch-specific and predates
the arch-independent earlycon support.

earlycon requires the arch to either supply a fixmap for the uart i/o address
or ioremap() must work by parse_early_param(); note how the mmio in
arch/mips/ath79/early_printk.c accesses the mmio space @ KSEG1ADDR(uart base addr)?

Can't do that in arch-independent earlycon code.

Regards,
Peter Hurley

2016-03-24 02:09:45

by Matthias Schiffer

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

>> autoconfig_16550a() is doing all kinds of weird checks to detect different
>> hardware by writing a lot of register values which are documented as
>> reserved in the AR7242 datasheet (there's a leaked version going around
>> that can be easily googled...), no idea if any of those are problematic.
>> Just setting UPF_FIXED_TYPE as suggested by Peter would avoid that code
>> altogether.
>
> That's just a debugging patch and not appropriate for permanent use,
> the reason being that this uart is _not_ 16550 compatible (or even 16450
> compatible).
>
> The three options for 8250 driver support for this part are:
> 1. Similar to the debugging patch, set UPF_FIXED_TYPE but set port type
> to PORT_8250 instead. This will lose FIFO support so 115K won't be
> possible and likely neither will 38400.
>
> 2. Set UPF_FIXED_TYPE but define a new PORT_* value and add support for
> this PORT_* value to uart_config array, uapi headers, and anywhere
> the scratch register is used.
>
> 3. As with 2. above but don't set UPF_FIXED_TYPE and add a probe function
> that detects ports of this type to autoconfig(). I don't recommend this
> method.
>
> This requirement is independent of fixing prom_putchar_ar71xx().
>

I can send patches for all of this, and I think that 2. would be the nicest
solution. I've noticed though that include/uapi/linux/serial_core.h is
experiencing a little "overflow": PORT_MAX_8250 has grown just below the
first non-8250 entry. Should I just add the new entry at the bottom (and
thus grow the uart_config array by ~85 unused entries)? What about
PORT_MAX_8250 (used at least in drivers/tty/serial/8250/8250_of.c)?

Regards,
Matthias


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2016-03-24 03:17:12

by Peter Hurley

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

On 03/23/2016 07:09 PM, Matthias Schiffer wrote:
>>> autoconfig_16550a() is doing all kinds of weird checks to detect different
>>> hardware by writing a lot of register values which are documented as
>>> reserved in the AR7242 datasheet (there's a leaked version going around
>>> that can be easily googled...), no idea if any of those are problematic.
>>> Just setting UPF_FIXED_TYPE as suggested by Peter would avoid that code
>>> altogether.
>>
>> That's just a debugging patch and not appropriate for permanent use,
>> the reason being that this uart is _not_ 16550 compatible (or even 16450
>> compatible).
>>
>> The three options for 8250 driver support for this part are:
>> 1. Similar to the debugging patch, set UPF_FIXED_TYPE but set port type
>> to PORT_8250 instead. This will lose FIFO support so 115K won't be
>> possible and likely neither will 38400.
>>
>> 2. Set UPF_FIXED_TYPE but define a new PORT_* value and add support for
>> this PORT_* value to uart_config array, uapi headers, and anywhere
>> the scratch register is used.
>>
>> 3. As with 2. above but don't set UPF_FIXED_TYPE and add a probe function
>> that detects ports of this type to autoconfig(). I don't recommend this
>> method.
>>
>> This requirement is independent of fixing prom_putchar_ar71xx().
>>
>
> I can send patches for all of this, and I think that 2. would be the nicest
> solution. I've noticed though that include/uapi/linux/serial_core.h is
> experiencing a little "overflow": PORT_MAX_8250 has grown just below the
> first non-8250 entry.

Ugh. Thanks for noting this.

> Should I just add the new entry at the bottom (and
> thus grow the uart_config array by ~85 unused entries)? What about
> PORT_MAX_8250 (used at least in drivers/tty/serial/8250/8250_of.c)?

None of the above, unfortunately. Ok, plan B.

I need to clean off a dusty series that adds probe steering and
bugs pass-thru for 8250 sub-drivers and platform data. Then add a
UART_BUG_NOSCR to indicate a uart does not have a scratch register
(like this one). Then for this uart, set UPF_FIXED_TYPE and type to
PORT_16550A, with UART_BUG_NOSCR flag.

Couple of days because I'm in the middle of something else, ok?

Regards,
Peter Hurley

2016-03-25 13:00:00

by Gabor Juhos

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

2016.03.24. 4:17 keltezéssel, Peter Hurley írta:
> On 03/23/2016 07:09 PM, Matthias Schiffer wrote:
>>>> autoconfig_16550a() is doing all kinds of weird checks to detect different
>>>> hardware by writing a lot of register values which are documented as
>>>> reserved in the AR7242 datasheet (there's a leaked version going around
>>>> that can be easily googled...), no idea if any of those are problematic.
>>>> Just setting UPF_FIXED_TYPE as suggested by Peter would avoid that code
>>>> altogether.
>>>
>>> That's just a debugging patch and not appropriate for permanent use,
>>> the reason being that this uart is _not_ 16550 compatible (or even 16450
>>> compatible).
>>>
>>> The three options for 8250 driver support for this part are:
>>> 1. Similar to the debugging patch, set UPF_FIXED_TYPE but set port type
>>> to PORT_8250 instead. This will lose FIFO support so 115K won't be
>>> possible and likely neither will 38400.
>>>
>>> 2. Set UPF_FIXED_TYPE but define a new PORT_* value and add support for
>>> this PORT_* value to uart_config array, uapi headers, and anywhere
>>> the scratch register is used.
>>>
>>> 3. As with 2. above but don't set UPF_FIXED_TYPE and add a probe function
>>> that detects ports of this type to autoconfig(). I don't recommend this
>>> method.
>>>
>>> This requirement is independent of fixing prom_putchar_ar71xx().
>>>
>>
>> I can send patches for all of this, and I think that 2. would be the nicest
>> solution. I've noticed though that include/uapi/linux/serial_core.h is
>> experiencing a little "overflow": PORT_MAX_8250 has grown just below the
>> first non-8250 entry.
>
> Ugh. Thanks for noting this.
>
>> Should I just add the new entry at the bottom (and
>> thus grow the uart_config array by ~85 unused entries)? What about
>> PORT_MAX_8250 (used at least in drivers/tty/serial/8250/8250_of.c)?
>
> None of the above, unfortunately. Ok, plan B.
>
> I need to clean off a dusty series that adds probe steering and
> bugs pass-thru for 8250 sub-drivers and platform data. Then add a
> UART_BUG_NOSCR to indicate a uart does not have a scratch register
> (like this one). Then for this uart, set UPF_FIXED_TYPE and type to
> PORT_16550A, with UART_BUG_NOSCR flag.

Introducing the UART_BUG_NOSCR flag for this UART is pointless in my opinion,
because it does have a scratch register in fact. Even if it is not listed in the
datasheet of the SoCs, it exists.

I have tested that from the bootloader on the Netgear WNDR3700 board which uses
the AR7161 SoC:

ar7100> md.l 0xb802001c 1
b802001c: 00000000 ....
ar7100> mw.l 0xb802001c a5
ar7100> md.l 0xb802001c 1
b802001c: 000000a5 ....
ar7100> mw.l 0xb802001c 5a
ar7100> md.l 0xb802001c 1
b802001c: 0000005a ...Z
ar7100>

The same test is on the TL-WR842ND v1 board (AR7241 SoC):

ar7240> md.l b802001c 1
b802001c: 00000000 ....
ar7240> mw.l b802001c a5
ar7240> md.l b802001c 1
b802001c: 000000a5 ....
ar7240> mw.l b802001c 5a
ar7240> md.l b802001c 1
b802001c: 0000005a ...Z
ar7240>

And on the TL-WR841N v8 board (AR9341 Soc):

wasp> md.l b802001c 1
b802001c: 00000000 ....
wasp> mw.l b802001c a5
wasp> md.l b802001c 1
b802001c: 000000a5 ....
wasp> mw.l b802001c 5a
wasp> md.l b802001c 1
b802001c: 0000005a ...Z
wasp>

Although i did not test it on other SoCs, but i assume that the behaviour is the
same on those.

-Gabor

2016-03-25 15:24:47

by Peter Hurley

[permalink] [raw]
Subject: Re: Nonterministic hang during bootconsole/console handover on ath79

Hi Gabor,

On 03/25/2016 05:59 AM, Gabor Juhos wrote:
> 2016.03.24. 4:17 keltezéssel, Peter Hurley írta:
>> On 03/23/2016 07:09 PM, Matthias Schiffer wrote:
>>>>> autoconfig_16550a() is doing all kinds of weird checks to detect different
>>>>> hardware by writing a lot of register values which are documented as
>>>>> reserved in the AR7242 datasheet (there's a leaked version going around
>>>>> that can be easily googled...), no idea if any of those are problematic.
>>>>> Just setting UPF_FIXED_TYPE as suggested by Peter would avoid that code
>>>>> altogether.
>>>>
>>>> That's just a debugging patch and not appropriate for permanent use,
>>>> the reason being that this uart is _not_ 16550 compatible (or even 16450
>>>> compatible).
>>>>
>>>> The three options for 8250 driver support for this part are:
>>>> 1. Similar to the debugging patch, set UPF_FIXED_TYPE but set port type
>>>> to PORT_8250 instead. This will lose FIFO support so 115K won't be
>>>> possible and likely neither will 38400.
>>>>
>>>> 2. Set UPF_FIXED_TYPE but define a new PORT_* value and add support for
>>>> this PORT_* value to uart_config array, uapi headers, and anywhere
>>>> the scratch register is used.
>>>>
>>>> 3. As with 2. above but don't set UPF_FIXED_TYPE and add a probe function
>>>> that detects ports of this type to autoconfig(). I don't recommend this
>>>> method.
>>>>
>>>> This requirement is independent of fixing prom_putchar_ar71xx().
>>>>
>>>
>>> I can send patches for all of this, and I think that 2. would be the nicest
>>> solution. I've noticed though that include/uapi/linux/serial_core.h is
>>> experiencing a little "overflow": PORT_MAX_8250 has grown just below the
>>> first non-8250 entry.
>>
>> Ugh. Thanks for noting this.
>>
>>> Should I just add the new entry at the bottom (and
>>> thus grow the uart_config array by ~85 unused entries)? What about
>>> PORT_MAX_8250 (used at least in drivers/tty/serial/8250/8250_of.c)?
>>
>> None of the above, unfortunately. Ok, plan B.
>>
>> I need to clean off a dusty series that adds probe steering and
>> bugs pass-thru for 8250 sub-drivers and platform data. Then add a
>> UART_BUG_NOSCR to indicate a uart does not have a scratch register
>> (like this one). Then for this uart, set UPF_FIXED_TYPE and type to
>> PORT_16550A, with UART_BUG_NOSCR flag.
>
> Introducing the UART_BUG_NOSCR flag for this UART is pointless in my opinion,
> because it does have a scratch register in fact. Even if it is not listed in the
> datasheet of the SoCs, it exists.

Ok, then the part(s) should be compatible enough with the 8250 driver as it is.
If boot hang problem persists on these parts, then the autoconfig() probes may
still be a problem, and my debugging patch from earlier can be used to skirt
autoconfig().


> I have tested that from the bootloader on the Netgear WNDR3700 board which uses
> the AR7161 SoC:

Thanks for testing.

Regards,
Peter Hurley