2015-07-24 22:43:08

by Bjorn Helgaas

[permalink] [raw]
Subject: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

I regularly see faults like this on an APM X-Gene:

U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
32 KB ICACHE, 32 KB DCACHE
SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
...
Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
Internal error: : 96000010 [#1] SMP
Modules linked in:
CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
Hardware name: APM X-Gene Mustang board (DT)
task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
PC is at pci_generic_config_read32+0x4c/0xb8
LR is at pci_generic_config_read32+0x40/0xb8
pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
...
Call trace:
[<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
[<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
[<ffffffc0003496a8>] pci_read_config+0x15c/0x238
[<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
[<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
[<ffffffc0001c361c>] __vfs_read+0x44/0x128
[<ffffffc0001c3e28>] vfs_read+0x84/0x144
[<ffffffc0001c4764>] SyS_read+0x50/0xb0

# lspci
00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04)
01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family

I first saw this on an ancient kernel and thought it was likely specific to
my environment, but I'm now using an almost unmodified v4.1 kernel and
still seeing it. Does anybody else see this? The box does have a PCI card
installed, but I haven't yet worked out what device's config space we're
trying to read.

Is there anything I can do to debug this? I'm not an arm64 guy, but my
impression is that this is a page fault, and the address seems to be in the
"cfg" area ioremapped by xgene_pcie_map_reg(), so I'm not sure this is
really a PCI issue -- maybe that page mapping got trashed by somebody else?

Bjorn


2015-07-25 00:05:51

by Duc Dang

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

Hi Bjorn,

On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <[email protected]> wrote:
>
> I regularly see faults like this on an APM X-Gene:
>
> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> 32 KB ICACHE, 32 KB DCACHE
> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> ...
> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
> Internal error: : 96000010 [#1] SMP
> Modules linked in:
> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> Hardware name: APM X-Gene Mustang board (DT)
> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> PC is at pci_generic_config_read32+0x4c/0xb8
> LR is at pci_generic_config_read32+0x40/0xb8
> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> ...
> Call trace:
> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> [<ffffffc0001c4764>] SyS_read+0x50/0xb0

The log shows kernel gets an exception when trying to access Mellanox
card configuration space. This is usually due to suboptimal PCIe
SerDes parameters are using in your board, which will cause bad link
quality.
The PCIe SerDes programming is done in U-Boot, so I suggest you do a
U-Boot upgrade to our latest X-Gene U-Boot release.

In order to access latest X-Gene U-Boot release, please use APM
official support channel:
https://myapm.apm.com

Please register an account at myapm.apm.com if you don't have one
using following link:
https://myapm.apm.com/user/register

>
> # lspci
> 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04)
> 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
>
> I first saw this on an ancient kernel and thought it was likely specific to
> my environment, but I'm now using an almost unmodified v4.1 kernel and
> still seeing it. Does anybody else see this? The box does have a PCI card
> installed, but I haven't yet worked out what device's config space we're
> trying to read.
>
> Is there anything I can do to debug this? I'm not an arm64 guy, but my
> impression is that this is a page fault, and the address seems to be in the
> "cfg" area ioremapped by xgene_pcie_map_reg(), so I'm not sure this is
> really a PCI issue -- maybe that page mapping got trashed by somebody else?
>
> Bjorn


--
Regards,
Duc Dang.

2015-07-27 11:36:28

by Catalin Marinas

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Fri, Jul 24, 2015 at 05:05:19PM -0700, Duc Dang wrote:
> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <[email protected]> wrote:
> > I regularly see faults like this on an APM X-Gene:
> >
> > U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> > 32 KB ICACHE, 32 KB DCACHE
> > SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> > ...
> > Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034

That's generated by an external device (PCIe root complex, card etc.)
and some mis-configured CPU setting.

> > Internal error: : 96000010 [#1] SMP
> > Modules linked in:
> > CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> > Hardware name: APM X-Gene Mustang board (DT)
> > task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> > PC is at pci_generic_config_read32+0x4c/0xb8
> > LR is at pci_generic_config_read32+0x40/0xb8
> > pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> > ...
> > Call trace:
> > [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> > [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> > [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> > [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> > [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> > [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> > [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> > [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>
> The log shows kernel gets an exception when trying to access Mellanox
> card configuration space. This is usually due to suboptimal PCIe
> SerDes parameters are using in your board, which will cause bad link
> quality.

I would have hoped that "suboptimal" means that it still works, albeit
not fully optimal ;).

> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> U-Boot upgrade to our latest X-Gene U-Boot release.
>
> In order to access latest X-Gene U-Boot release, please use APM
> official support channel:
> https://myapm.apm.com
>
> Please register an account at myapm.apm.com if you don't have one
> using following link:
> https://myapm.apm.com/user/register

Isn't the latest U-Boot source for X-Gene publicly available anywhere?
It's GPL code anyway, so it shouldn't have proprietary code to require
registration, click-through agreements.

--
Catalin

2015-07-28 14:38:33

by Dall, Elizabeth J

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On 07/24/2015 04:43 PM, Bjorn Helgaas wrote:
> I regularly see faults like this on an APM X-Gene:
>
> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> 32 KB ICACHE, 32 KB DCACHE
> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> ...
> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034

The 0x96000010 is the value of the ESR register and decodes to "Stack
Pointer Alignment exception". The ISS field for this exception code is
reserved, so no additional info.

-Betty Dall

2015-07-28 16:43:36

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <[email protected]> wrote:
> Hi Bjorn,
>
> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <[email protected]> wrote:
>>
>> I regularly see faults like this on an APM X-Gene:
>>
>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>> 32 KB ICACHE, 32 KB DCACHE
>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>> ...
>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>> Internal error: : 96000010 [#1] SMP
>> Modules linked in:
>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>> Hardware name: APM X-Gene Mustang board (DT)
>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>> PC is at pci_generic_config_read32+0x4c/0xb8
>> LR is at pci_generic_config_read32+0x40/0xb8
>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>> ...
>> Call trace:
>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>
> The log shows kernel gets an exception when trying to access Mellanox
> card configuration space. This is usually due to suboptimal PCIe
> SerDes parameters are using in your board, which will cause bad link
> quality.
> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> U-Boot upgrade to our latest X-Gene U-Boot release.

I installed U-Boot 1.15.12, which I thought was the latest. I'm still
seeing this issue regularly, approx once/hour.

2015-07-28 17:40:01

by Duc Dang

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Mon, Jul 27, 2015 at 4:36 AM, Catalin Marinas
<[email protected]> wrote:
> On Fri, Jul 24, 2015 at 05:05:19PM -0700, Duc Dang wrote:
>> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <[email protected]> wrote:
>> > I regularly see faults like this on an APM X-Gene:
>> >
>> > U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>> > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>> > 32 KB ICACHE, 32 KB DCACHE
>> > SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>> > ...
>> > Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>
> That's generated by an external device (PCIe root complex, card etc.)
> and some mis-configured CPU setting.
>
>> > Internal error: : 96000010 [#1] SMP
>> > Modules linked in:
>> > CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>> > Hardware name: APM X-Gene Mustang board (DT)
>> > task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>> > PC is at pci_generic_config_read32+0x4c/0xb8
>> > LR is at pci_generic_config_read32+0x40/0xb8
>> > pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>> > ...
>> > Call trace:
>> > [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>> > [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>> > [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>> > [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>> > [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>> > [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>> > [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>> > [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>>
>> The log shows kernel gets an exception when trying to access Mellanox
>> card configuration space. This is usually due to suboptimal PCIe
>> SerDes parameters are using in your board, which will cause bad link
>> quality.
>
> I would have hoped that "suboptimal" means that it still works, albeit
> not fully optimal ;).

Yes, it should still work, but you may see crashes occasionally due to
link quality.

>
>> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> U-Boot upgrade to our latest X-Gene U-Boot release.
>>
>> In order to access latest X-Gene U-Boot release, please use APM
>> official support channel:
>> https://myapm.apm.com
>>
>> Please register an account at myapm.apm.com if you don't have one
>> using following link:
>> https://myapm.apm.com/user/register
>
> Isn't the latest U-Boot source for X-Gene publicly available anywhere?
> It's GPL code anyway, so it shouldn't have proprietary code to require
> registration, click-through agreements.

APM X-Gene U-Boot isn't available publicly yet. Though, if this is
required, we can make a public GIT which will be hosted with APM
server.

As of now, customer who has a board from APM will have to use MyAPM to
get U-Boot source and binary.
>
> --
> Catalin



--
Regards,
Duc Dang.

2015-07-28 17:46:00

by Duc Dang

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <[email protected]> wrote:
> On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <[email protected]> wrote:
>> Hi Bjorn,
>>
>> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <[email protected]> wrote:
>>>
>>> I regularly see faults like this on an APM X-Gene:
>>>
>>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>>> 32 KB ICACHE, 32 KB DCACHE
>>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>>> ...
>>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>>> Internal error: : 96000010 [#1] SMP
>>> Modules linked in:
>>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>>> Hardware name: APM X-Gene Mustang board (DT)
>>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>>> PC is at pci_generic_config_read32+0x4c/0xb8
>>> LR is at pci_generic_config_read32+0x40/0xb8
>>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>>> ...
>>> Call trace:
>>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>>
>> The log shows kernel gets an exception when trying to access Mellanox
>> card configuration space. This is usually due to suboptimal PCIe
>> SerDes parameters are using in your board, which will cause bad link
>> quality.
>> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> U-Boot upgrade to our latest X-Gene U-Boot release.
>
> I installed U-Boot 1.15.12, which I thought was the latest. I'm still
> seeing this issue regularly, approx once/hour.

Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
version to use. Are you running any PCIe traffic test when the error
happens? I will try to reproduce the issue with my Mustang board as
well.

And it will be useful if you can share your "lspci -vvv" output when
the board is running, we can check to see if there is any error status
reported.

--
Regards,
Duc Dang.

2015-07-28 18:36:25

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Tue, Jul 28, 2015 at 12:39 PM, Duc Dang <[email protected]> wrote:
> On Mon, Jul 27, 2015 at 4:36 AM, Catalin Marinas
> <[email protected]> wrote:
>> On Fri, Jul 24, 2015 at 05:05:19PM -0700, Duc Dang wrote:
>>> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <[email protected]> wrote:
>>> > I regularly see faults like this on an APM X-Gene:
>>> >
>>> > U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>>> > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>>> > 32 KB ICACHE, 32 KB DCACHE
>>> > SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>>> > ...
>>> > Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>>
>> That's generated by an external device (PCIe root complex, card etc.)
>> and some mis-configured CPU setting.
>>
>>> > Internal error: : 96000010 [#1] SMP
>>> > Modules linked in:
>>> > CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>>> > Hardware name: APM X-Gene Mustang board (DT)
>>> > task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>>> > PC is at pci_generic_config_read32+0x4c/0xb8
>>> > LR is at pci_generic_config_read32+0x40/0xb8
>>> > pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>>> > ...
>>> > Call trace:
>>> > [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>>> > [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>>> > [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>>> > [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>>> > [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>>> > [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>>> > [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>>> > [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>>>
>>> The log shows kernel gets an exception when trying to access Mellanox
>>> card configuration space. This is usually due to suboptimal PCIe
>>> SerDes parameters are using in your board, which will cause bad link
>>> quality.
>>
>> I would have hoped that "suboptimal" means that it still works, albeit
>> not fully optimal ;).
>
> Yes, it should still work, but you may see crashes occasionally due to
> link quality.

A crash seems like a too-severe response to a link quality issue.
Isn't there some way to retry the access or return an error, so we
don't have to crash the whole system?

2015-07-28 21:29:51

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <[email protected]> wrote:
> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <[email protected]> wrote:
> >> Hi Bjorn,
> >>
> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <[email protected]> wrote:
> >>>
> >>> I regularly see faults like this on an APM X-Gene:
> >>>
> >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >>> 32 KB ICACHE, 32 KB DCACHE
> >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> >>> ...
> >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
> >>> Internal error: : 96000010 [#1] SMP
> >>> Modules linked in:
> >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> >>> Hardware name: APM X-Gene Mustang board (DT)
> >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> >>> PC is at pci_generic_config_read32+0x4c/0xb8
> >>> LR is at pci_generic_config_read32+0x40/0xb8
> >>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> >>> ...
> >>> Call trace:
> >>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> >>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> >>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> >>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> >>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> >>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> >>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> >>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0
> >>
> >> The log shows kernel gets an exception when trying to access Mellanox
> >> card configuration space. This is usually due to suboptimal PCIe
> >> SerDes parameters are using in your board, which will cause bad link
> >> quality.
> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> >> U-Boot upgrade to our latest X-Gene U-Boot release.
> >
> > I installed U-Boot 1.15.12, which I thought was the latest. I'm still
> > seeing this issue regularly, approx once/hour.
>
> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
> version to use. Are you running any PCIe traffic test when the error
> happens?

Nope, the machine was either idle or running a reboot test; no PCIe stress
test or anything.

> And it will be useful if you can share your "lspci -vvv" output when
> the board is running, we can check to see if there is any error status
> reported.

Here's some lspci output and info about the firmware I'm running.
Obviously this lspci output was collected before a crash. I have also
seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port.

U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33)

CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
32 KB ICACHE, 32 KB DCACHE
SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
Boot from SPI-NOR
Slimpro FW:
Ver: 2.4 (build 01.15.12.00 2015/05/20)
PMD: 970 mV
SOC: 950 mV
Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board
I2C: ready
DRAM: ECC 32 GiB @ 1600MHz
SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB
MMC: X-Gene SD/SDIO/eMMC: 0
PCIE0: (RC) X8 GEN-3 link up
00:00.0 - 10e8:e004 - Bridge device
01:00.0 - 15b3:1007 - Network controller

# lspci -vvv
00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode])
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
I/O behind bridge: 0000f000-00000fff
Memory behind bridge: 80000000-82ffffff
Prefetchable memory behind bridge: 0000000083000000-00000000830fffff
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
ExtTag- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend+
LnkCap: Port #0, Speed unknown, Width x8, ASPM L0s L1, Latency L0 unlimited, L1 unlimited
ClockPM- Surprise+ LLActRep+ BwNot+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
Slot #1, PowerLimit 10.000W; Interlock- NoCompl-
SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
Control: AttnInd Off, PwrInd Off, Power- Interlock-
SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
Changed: MRL- PresDet- LinkState+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
RootCap: CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+ ARIFwd-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB
Capabilities: [80] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [180 v1] #19
Capabilities: [150 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Kernel driver in use: pcieport

01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 226
Region 0: [virtual] Memory at e182000000 (32-bit, non-prefetchable) [size=1M]
Region 2: [virtual] Memory at e180000000 (32-bit, non-prefetchable) [size=32M]
[virtual] Expansion ROM at e183000000 [disabled] [size=1M]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [9c] MSI-X: Enable- Count=64 Masked-
Vector table: BAR=0 offset=0007c000
PBA: BAR=0 offset=0007d000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #8, Speed unknown, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB
Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
ARICap: MFVC- ACS-, Next Function: 0
ARICtl: MFVC- ACS-, Function Group: 0
Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx
Capabilities: [154 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [18c v1] #19
Kernel modules: mlx4_core

2015-07-28 21:51:11

by Duc Dang

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Tue, Jul 28, 2015 at 2:29 PM, Bjorn Helgaas <[email protected]> wrote:
> On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
>> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <[email protected]> wrote:
>> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <[email protected]> wrote:
>> >> Hi Bjorn,
>> >>
>> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <[email protected]> wrote:
>> >>>
>> >>> I regularly see faults like this on an APM X-Gene:
>> >>>
>> >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>> >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>> >>> 32 KB ICACHE, 32 KB DCACHE
>> >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>> >>> ...
>> >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>> >>> Internal error: : 96000010 [#1] SMP
>> >>> Modules linked in:
>> >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>> >>> Hardware name: APM X-Gene Mustang board (DT)
>> >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>> >>> PC is at pci_generic_config_read32+0x4c/0xb8
>> >>> LR is at pci_generic_config_read32+0x40/0xb8
>> >>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>> >>> ...
>> >>> Call trace:
>> >>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>> >>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>> >>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>> >>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>> >>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>> >>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>> >>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>> >>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>> >>
>> >> The log shows kernel gets an exception when trying to access Mellanox
>> >> card configuration space. This is usually due to suboptimal PCIe
>> >> SerDes parameters are using in your board, which will cause bad link
>> >> quality.
>> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> >> U-Boot upgrade to our latest X-Gene U-Boot release.
>> >
>> > I installed U-Boot 1.15.12, which I thought was the latest. I'm still
>> > seeing this issue regularly, approx once/hour.
>>
>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>> version to use. Are you running any PCIe traffic test when the error
>> happens?
>
> Nope, the machine was either idle or running a reboot test; no PCIe stress
> test or anything.
>
>> And it will be useful if you can share your "lspci -vvv" output when
>> the board is running, we can check to see if there is any error status
>> reported.
>
> Here's some lspci output and info about the firmware I'm running.
> Obviously this lspci output was collected before a crash. I have also
> seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port.
>
> U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33)
>
> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> 32 KB ICACHE, 32 KB DCACHE
> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> Boot from SPI-NOR
> Slimpro FW:
> Ver: 2.4 (build 01.15.12.00 2015/05/20)
> PMD: 970 mV
> SOC: 950 mV
> Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board
> I2C: ready
> DRAM: ECC 32 GiB @ 1600MHz
> SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB
> MMC: X-Gene SD/SDIO/eMMC: 0
> PCIE0: (RC) X8 GEN-3 link up
> 00:00.0 - 10e8:e004 - Bridge device
> 01:00.0 - 15b3:1007 - Network controller
>
> # lspci -vvv
> 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode])
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Latency: 0
> Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
> I/O behind bridge: 0000f000-00000fff
> Memory behind bridge: 80000000-82ffffff
> Prefetchable memory behind bridge: 0000000083000000-00000000830fffff
> Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
> BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
> PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
> Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
> ExtTag- RBE+ FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
> RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
> MaxPayload 256 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend+
> LnkCap: Port #0, Speed unknown, Width x8, ASPM L0s L1, Latency L0 unlimited, L1 unlimited
> ClockPM- Surprise+ LLActRep+ BwNot+
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
> SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
> Slot #1, PowerLimit 10.000W; Interlock- NoCompl-
> SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
> Control: AttnInd Off, PwrInd Off, Power- Interlock-
> SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
> Changed: MRL- PresDet- LinkState+
> RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
> RootCap: CRSVisible-
> RootSta: PME ReqID 0000, PMEStatus- PMEPending-
> DevCap2: Completion Timeout: Not Supported, TimeoutDis+ ARIFwd-
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
> LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB

Target Link Speed unknown is really strange. I also saw the same "Link
speed unknown" for Mellanox card below.

> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB
> Capabilities: [80] Power Management version 3
> Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [100 v1] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> CEMsk: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> Capabilities: [180 v1] #19
> Capabilities: [150 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
> Kernel driver in use: pcieport
>
> 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-

Mem and BusMaster are disabled. So this card is not functional?

> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
> Interrupt: pin A routed to IRQ 226
> Region 0: [virtual] Memory at e182000000 (32-bit, non-prefetchable) [size=1M]
> Region 2: [virtual] Memory at e180000000 (32-bit, non-prefetchable) [size=32M]
> [virtual] Expansion ROM at e183000000 [disabled] [size=1M]
> Capabilities: [40] Power Management version 3
> Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> Capabilities: [9c] MSI-X: Enable- Count=64 Masked-

This may be unrelated, but MSI allocation fails for this card somehow.

> Vector table: BAR=0 offset=0007c000
> PBA: BAR=0 offset=0007d000
> Capabilities: [60] Express (v2) Endpoint, MSI 00
> DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
> ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
> RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> MaxPayload 128 bytes, MaxReadReq 512 bytes
> DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
> LnkCap: Port #8, Speed unknown, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
> ClockPM- Surprise- LLActRep- BwNot-
> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
> LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
> Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
> Compliance De-emphasis: -6dB
> LnkSta2: Current De-emphasis Level: -6dB
> Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
> ARICap: MFVC- ACS-, Next Function: 0
> ARICtl: MFVC- ACS-, Function Group: 0
> Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx

The serial number here seems invalid. I have a Mellanox card but
different model (ConnectX-3 15b3:1003) that shows meaningful serial
number:
Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-0b-c2-30.

Do you have another PCIe card to try on the same reboot test on this board?

> Capabilities: [154 v2] Advanced Error Reporting
> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
> AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
> Capabilities: [18c v1] #19
> Kernel modules: mlx4_core

--
Regards,
Duc Dang.

2015-07-29 01:23:04

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
> On Tue, Jul 28, 2015 at 2:29 PM, Bjorn Helgaas <[email protected]> wrote:
> > On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
> >> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <[email protected]> wrote:
> >> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <[email protected]> wrote:
> >> >> Hi Bjorn,
> >> >>
> >> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <[email protected]> wrote:
> >> >>>
> >> >>> I regularly see faults like this on an APM X-Gene:
> >> >>>
> >> >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> >> >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >> >>> 32 KB ICACHE, 32 KB DCACHE
> >> >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> >> >>> ...
> >> >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
> >> >>> Internal error: : 96000010 [#1] SMP
> >> >>> Modules linked in:
> >> >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> >> >>> Hardware name: APM X-Gene Mustang board (DT)
> >> >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> >> >>> PC is at pci_generic_config_read32+0x4c/0xb8
> >> >>> LR is at pci_generic_config_read32+0x40/0xb8
> >> >>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> >> >>> ...
> >> >>> Call trace:
> >> >>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> >> >>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> >> >>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> >> >>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> >> >>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> >> >>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> >> >>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> >> >>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0
> >> >>
> >> >> The log shows kernel gets an exception when trying to access Mellanox
> >> >> card configuration space. This is usually due to suboptimal PCIe
> >> >> SerDes parameters are using in your board, which will cause bad link
> >> >> quality.
> >> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> >> >> U-Boot upgrade to our latest X-Gene U-Boot release.
> >> >
> >> > I installed U-Boot 1.15.12, which I thought was the latest. I'm still
> >> > seeing this issue regularly, approx once/hour.
> >>
> >> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
> >> version to use. Are you running any PCIe traffic test when the error
> >> happens?
> >
> > Nope, the machine was either idle or running a reboot test; no PCIe stress
> > test or anything.
> >
> >> And it will be useful if you can share your "lspci -vvv" output when
> >> the board is running, we can check to see if there is any error status
> >> reported.
> >
> > Here's some lspci output and info about the firmware I'm running.
> > Obviously this lspci output was collected before a crash. I have also
> > seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port.
> >
> > U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33)
> >
> > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> > 32 KB ICACHE, 32 KB DCACHE
> > SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> > Boot from SPI-NOR
> > Slimpro FW:
> > Ver: 2.4 (build 01.15.12.00 2015/05/20)
> > PMD: 970 mV
> > SOC: 950 mV
> > Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board
> > I2C: ready
> > DRAM: ECC 32 GiB @ 1600MHz
> > SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB
> > MMC: X-Gene SD/SDIO/eMMC: 0
> > PCIE0: (RC) X8 GEN-3 link up
> > 00:00.0 - 10e8:e004 - Bridge device
> > 01:00.0 - 15b3:1007 - Network controller
> >
> > # lspci -vvv
> > 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode])

> > LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB
>
> Target Link Speed unknown is really strange. I also saw the same "Link
> speed unknown" for Mellanox card below.

I think this is because I have a really old lspci. Here's the -xxx output:

00: e8 10 04 e0 07 00 10 00 04 00 04 06 00 00 01 00
10: 00 00 00 00 00 00 00 00 00 01 01 00 f1 01 00 00
20: 00 80 f0 82 01 83 01 83 00 00 00 00 00 00 00 00
30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 00 00
40: 10 80 42 01 02 8f 00 00 36 28 21 00 83 fc 7b 00
50: 40 00 83 70 00 05 08 00 c0 03 00 01 00 00 01 00
60: 00 00 00 00 10 00 00 00 00 00 00 00 0e 01 00 00
70: 43 00 1e 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 01 00 03 06 08 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

LnkCtl2 is at offset 0x30 in the PCIe capability, which starts at 0x40,
so LnkCtl2 = 0x0043. I think that means Target Link Speed is 0x3, or
"Supported Link Speeds Vector field bit 2". The Supported Link Speeds
Vector in LnkCap2 (which isn't decoded even by current upstream lspci)
is 0x7, so 2.5GT/s, 5.0GT/s, and 8.0GT/s are all supported, with bit 2
being 8.0GT/s. So I think a modern lspci would show "8.0GT/s".

> > 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
> > Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>
> Mem and BusMaster are disabled. So this card is not functional?

I don't know whether it's functional; I haven't tried to use it yet.

I typically don't even load the mlx4 driver, so most of the failures I'm
seeing are when the driver isn't loaded. User-space code is doing config
reads via /sys.

> > Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx
>
> The serial number here seems invalid. I have a Mellanox card but
> different model (ConnectX-3 15b3:1003) that shows meaningful serial
> number:
> Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-0b-c2-30.

My fault, lspci actually showed a meaningful serial number; I removed
it in a misguided attempt to avoid exposing anything proprietary.

> Do you have another PCIe card to try on the same reboot test on this board?

I've seen this on at least two Mellanox cards. I'm running similar tests
on a different type of card now.

Bjorn

2015-07-29 15:55:18

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:

> > Do you have another PCIe card to try on the same reboot test on this board?
>
> I've seen this on at least two Mellanox cards. I'm running similar tests
> on a different type of card now.

FWIW, reboot tests on two machines with Mellanox cards failed, while the
same test on a machine with a different proprietary card succeeded.

2015-07-31 17:00:38

by Duc Dang

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <[email protected]> wrote:
> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>
>> > Do you have another PCIe card to try on the same reboot test on this board?
>>
>> I've seen this on at least two Mellanox cards. I'm running similar tests
>> on a different type of card now.
>
> FWIW, reboot tests on two machines with Mellanox cards failed, while the
> same test on a machine with a different proprietary card succeeded.

Thanks, Bjorn.

I don't have the same Mellanox card as yours, but I will also run
similar reboot test to see if I hit the same issue with my card.

--
Regards,
Duc Dang.

2015-08-10 16:18:50

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <[email protected]> wrote:
> On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <[email protected]> wrote:
>> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>
>>> > Do you have another PCIe card to try on the same reboot test on this board?
>>>
>>> I've seen this on at least two Mellanox cards. I'm running similar tests
>>> on a different type of card now.
>>
>> FWIW, reboot tests on two machines with Mellanox cards failed, while the
>> same test on a machine with a different proprietary card succeeded.
>
> Thanks, Bjorn.
>
> I don't have the same Mellanox card as yours, but I will also run
> similar reboot test to see if I hit the same issue with my card.

Any more hints on this? Nothing has changed on my end, so of course
I'm still seeing this, always on machines with Mellanox, and never on
other machines. Could this be a hardware issue like a signal
integrity or margin issue? I don't know where to go from here because
I'm not a hardware person, and I don't know anything to do in
software.

Bjorn

2015-08-10 17:38:14

by Catalin Marinas

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Mon, Aug 10, 2015 at 11:18:23AM -0500, Bjorn Helgaas wrote:
> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <[email protected]> wrote:
> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <[email protected]> wrote:
> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
> >>
> >>> > Do you have another PCIe card to try on the same reboot test on this board?
> >>>
> >>> I've seen this on at least two Mellanox cards. I'm running similar tests
> >>> on a different type of card now.
> >>
> >> FWIW, reboot tests on two machines with Mellanox cards failed, while the
> >> same test on a machine with a different proprietary card succeeded.
> >
> > Thanks, Bjorn.
> >
> > I don't have the same Mellanox card as yours, but I will also run
> > similar reboot test to see if I hit the same issue with my card.
>
> Any more hints on this? Nothing has changed on my end, so of course
> I'm still seeing this, always on machines with Mellanox, and never on
> other machines. Could this be a hardware issue like a signal
> integrity or margin issue? I don't know where to go from here because
> I'm not a hardware person, and I don't know anything to do in
> software.

Silly hack below, not actually a solution (and it may not even work):

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 94d98cd1aad8..e895e96b3d13 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -369,6 +369,14 @@ static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs)
return 1;
}

+/*
+ * Retry the faulty access.
+ */
+static int do_good(unsigned long addr, unsigned int esr, struct pt_regs *regs)
+{
+ return 0;
+}
+
static struct fault_info {
int (*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs);
int sig;
@@ -391,7 +399,7 @@ static struct fault_info {
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 permission fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 permission fault" },
{ do_page_fault, SIGSEGV, SEGV_ACCERR, "level 3 permission fault" },
- { do_bad, SIGBUS, 0, "synchronous external abort" },
+ { do_good, SIGBUS, 0, "synchronous external abort" },
{ do_bad, SIGBUS, 0, "asynchronous external abort" },
{ do_bad, SIGBUS, 0, "unknown 18" },
{ do_bad, SIGBUS, 0, "unknown 19" },

--
Catalin

2015-08-10 17:42:44

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <[email protected]> wrote:
> On Monday, August 10, 2015, Bjorn Helgaas <[email protected]> wrote:
>>
>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <[email protected]> wrote:
>> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <[email protected]>
>> > wrote:
>> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>> >>
>> >>> > Do you have another PCIe card to try on the same reboot test on this
>> >>> > board?
>> >>>
>> >>> I've seen this on at least two Mellanox cards. I'm running similar
>> >>> tests
>> >>> on a different type of card now.
>> >>
>> >> FWIW, reboot tests on two machines with Mellanox cards failed, while
>> >> the
>> >> same test on a machine with a different proprietary card succeeded.
>> >
>> > Thanks, Bjorn.
>> >
>> > I don't have the same Mellanox card as yours, but I will also run
>> > similar reboot test to see if I hit the same issue with my card.
>>
>> Any more hints on this? Nothing has changed on my end, so of course
>> I'm still seeing this, always on machines with Mellanox, and never on
>> other machines. Could this be a hardware issue like a signal
>> integrity or margin issue? I don't know where to go from here because
>> I'm not a hardware person, and I don't know anything to do in
>> software.
>
>
> Hi Bjorn,
>
> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
> family, one card has 2 10G interfaces, the other one has 1 port that
> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
> the crash that you encounterred.
>
> Did you check if your Mellanox cards have latest firmware? I did see some
> link issues on my Mellanox cards with its old firmware before.

Good idea; I'll check that, too. Also, I just learned that these
cards on installed with an extender card because of some space issues,
so we're going to test again without the extender.

2015-08-10 19:07:45

by Duc Dang

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <[email protected]> wrote:
> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <[email protected]> wrote:
>> On Monday, August 10, 2015, Bjorn Helgaas <[email protected]> wrote:
>>>
>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <[email protected]> wrote:
>>> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <[email protected]>
>>> > wrote:
>>> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>> >>
>>> >>> > Do you have another PCIe card to try on the same reboot test on this
>>> >>> > board?
>>> >>>
>>> >>> I've seen this on at least two Mellanox cards. I'm running similar
>>> >>> tests
>>> >>> on a different type of card now.
>>> >>
>>> >> FWIW, reboot tests on two machines with Mellanox cards failed, while
>>> >> the
>>> >> same test on a machine with a different proprietary card succeeded.
>>> >
>>> > Thanks, Bjorn.
>>> >
>>> > I don't have the same Mellanox card as yours, but I will also run
>>> > similar reboot test to see if I hit the same issue with my card.
>>>
>>> Any more hints on this? Nothing has changed on my end, so of course
>>> I'm still seeing this, always on machines with Mellanox, and never on
>>> other machines. Could this be a hardware issue like a signal
>>> integrity or margin issue? I don't know where to go from here because
>>> I'm not a hardware person, and I don't know anything to do in
>>> software.
>>
>>
>> Hi Bjorn,
>>
>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
>> family, one card has 2 10G interfaces, the other one has 1 port that
>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
>> the crash that you encounterred.
>>
>> Did you check if your Mellanox cards have latest firmware? I did see some
>> link issues on my Mellanox cards with its old firmware before.
>
> Good idea; I'll check that, too. Also, I just learned that these
> cards on installed with an extender card because of some space issues,
> so we're going to test again without the extender.

Hi Bjorn,

Are other cards that passed your test installed directly to the
on-board PCIe slot?
If yes, then this is a good data point and it will be useful to test
the case where
your Mellanox cards are directly installed into the on-board PCIe slot.

--
Regards,
Duc Dang.

2015-08-11 19:29:03

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Mon, Aug 10, 2015 at 2:07 PM, Duc Dang <[email protected]> wrote:
> On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <[email protected]> wrote:
>> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <[email protected]> wrote:
>>> On Monday, August 10, 2015, Bjorn Helgaas <[email protected]> wrote:
>>>>
>>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <[email protected]> wrote:
>>>> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <[email protected]>
>>>> > wrote:
>>>> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>>> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>>> >>
>>>> >>> > Do you have another PCIe card to try on the same reboot test on this
>>>> >>> > board?
>>>> >>>
>>>> >>> I've seen this on at least two Mellanox cards. I'm running similar
>>>> >>> tests
>>>> >>> on a different type of card now.
>>>> >>
>>>> >> FWIW, reboot tests on two machines with Mellanox cards failed, while
>>>> >> the
>>>> >> same test on a machine with a different proprietary card succeeded.
>>>> >
>>>> > Thanks, Bjorn.
>>>> >
>>>> > I don't have the same Mellanox card as yours, but I will also run
>>>> > similar reboot test to see if I hit the same issue with my card.
>>>>
>>>> Any more hints on this? Nothing has changed on my end, so of course
>>>> I'm still seeing this, always on machines with Mellanox, and never on
>>>> other machines. Could this be a hardware issue like a signal
>>>> integrity or margin issue? I don't know where to go from here because
>>>> I'm not a hardware person, and I don't know anything to do in
>>>> software.
>>>
>>>
>>> Hi Bjorn,
>>>
>>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
>>> family, one card has 2 10G interfaces, the other one has 1 port that
>>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
>>> the crash that you encounterred.
>>>
>>> Did you check if your Mellanox cards have latest firmware? I did see some
>>> link issues on my Mellanox cards with its old firmware before.
>>
>> Good idea; I'll check that, too. Also, I just learned that these
>> cards on installed with an extender card because of some space issues,
>> so we're going to test again without the extender.
>
> Hi Bjorn,
>
> Are other cards that passed your test installed directly to the
> on-board PCIe slot?
> If yes, then this is a good data point and it will be useful to test
> the case where
> your Mellanox cards are directly installed into the on-board PCIe slot.

The cards that passed the test were installed directly, with no
extender. We removed the extender from one of the machines with the
Mellanox card and have not seen this issue since then. I think it's
very likely that the problem is related to using the extender.

Bjorn

2016-04-13 09:58:22

by Sudeep Holla

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

Hi,

(sorry for replying on the old thread, but I found it could be related
to the issue
I have now)

On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <[email protected]> wrote:
> On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
>> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <[email protected]> wrote:
>> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <[email protected]> wrote:
>> >> Hi Bjorn,
>> >>
>> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <[email protected]> wrote:
>> >>>
>> >>> I regularly see faults like this on an APM X-Gene:
>> >>>
>> >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>> >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>> >>> 32 KB ICACHE, 32 KB DCACHE
>> >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>> >>> ...
>> >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>> >>> Internal error: : 96000010 [#1] SMP
>> >>> Modules linked in:
>> >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>> >>> Hardware name: APM X-Gene Mustang board (DT)
>> >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>> >>> PC is at pci_generic_config_read32+0x4c/0xb8
>> >>> LR is at pci_generic_config_read32+0x40/0xb8
>> >>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>> >>> ...
>> >>> Call trace:
>> >>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>> >>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>> >>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>> >>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>> >>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>> >>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>> >>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>> >>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>> >>
>> >> The log shows kernel gets an exception when trying to access Mellanox
>> >> card configuration space. This is usually due to suboptimal PCIe
>> >> SerDes parameters are using in your board, which will cause bad link
>> >> quality.
>> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> >> U-Boot upgrade to our latest X-Gene U-Boot release.
>> >
>> > I installed U-Boot 1.15.12, which I thought was the latest. I'm still
>> > seeing this issue regularly, approx once/hour.
>>
>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>> version to use. Are you running any PCIe traffic test when the error
>> happens?
>
> Nope, the machine was either idle or running a reboot test; no PCIe stress
> test or anything.
>

Was there any conclusion on this ?
I am having similar issue[1] on my Juno with sky2 PCIe driver during reboot.

Regards,
Sudeep

[1] http://marc.info/?l=linux-netdev&m=146046999701956&w=2

2016-04-13 13:21:07

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On Wed, Apr 13, 2016 at 10:58:18AM +0100, Sudeep Holla wrote:
> Hi,
>
> (sorry for replying on the old thread, but I found it could be related
> to the issue
> I have now)
>
> On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <[email protected]> wrote:
> > On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
> >> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <[email protected]> wrote:
> >> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <[email protected]> wrote:
> >> >> Hi Bjorn,
> >> >>
> >> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <[email protected]> wrote:
> >> >>>
> >> >>> I regularly see faults like this on an APM X-Gene:
> >> >>>
> >> >>> U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> >> >>> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >> >>> 32 KB ICACHE, 32 KB DCACHE
> >> >>> SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> >> >>> ...
> >> >>> Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
> >> >>> Internal error: : 96000010 [#1] SMP
> >> >>> Modules linked in:
> >> >>> CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> >> >>> Hardware name: APM X-Gene Mustang board (DT)
> >> >>> task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> >> >>> PC is at pci_generic_config_read32+0x4c/0xb8
> >> >>> LR is at pci_generic_config_read32+0x40/0xb8
> >> >>> pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> >> >>> ...
> >> >>> Call trace:
> >> >>> [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> >> >>> [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> >> >>> [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> >> >>> [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> >> >>> [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> >> >>> [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> >> >>> [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> >> >>> [<ffffffc0001c4764>] SyS_read+0x50/0xb0
> >> >>
> >> >> The log shows kernel gets an exception when trying to access Mellanox
> >> >> card configuration space. This is usually due to suboptimal PCIe
> >> >> SerDes parameters are using in your board, which will cause bad link
> >> >> quality.
> >> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> >> >> U-Boot upgrade to our latest X-Gene U-Boot release.
> >> >
> >> > I installed U-Boot 1.15.12, which I thought was the latest. I'm still
> >> > seeing this issue regularly, approx once/hour.
> >>
> >> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
> >> version to use. Are you running any PCIe traffic test when the error
> >> happens?
> >
> > Nope, the machine was either idle or running a reboot test; no PCIe stress
> > test or anything.
> >
>
> Was there any conclusion on this ?
> I am having similar issue[1] on my Juno with sky2 PCIe driver during reboot.

We found that the unhandled faults occurred when using an extender
card. After removing the extender card, we didn't see the faults any
more.

> [1] http://marc.info/?l=linux-netdev&m=146046999701956&w=2

2016-04-13 13:29:21

by Sudeep Holla

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32



On 13/04/16 14:21, Bjorn Helgaas wrote:
> On Wed, Apr 13, 2016 at 10:58:18AM +0100, Sudeep Holla wrote:
>> Hi,
>>
>> (sorry for replying on the old thread, but I found it could be related
>> to the issue
>> I have now)
>>
>> On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <[email protected]> wrote:

[...]

>>>>
>>>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>>>> version to use. Are you running any PCIe traffic test when the error
>>>> happens?
>>>
>>> Nope, the machine was either idle or running a reboot test; no PCIe stress
>>> test or anything.
>>>
>>
>> Was there any conclusion on this ?
>> I am having similar issue[1] on my Juno with sky2 PCIe driver during reboot.
>
> We found that the unhandled faults occurred when using an extender
> card. After removing the extender card, we didn't see the faults any
> more.
>

Thanks for the response. It's not related then, I saw report referencing
reboot tests and hence linked them together. Sorry for the noise.

--
Regards,
Sudeep

>> [1] http://marc.info/?l=linux-netdev&m=146046999701956&w=2

2016-04-13 22:17:56

by Jon Masters

[permalink] [raw]
Subject: Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32

On 04/13/2016 09:29 AM, Sudeep Holla wrote:
>
>
> On 13/04/16 14:21, Bjorn Helgaas wrote:
>> On Wed, Apr 13, 2016 at 10:58:18AM +0100, Sudeep Holla wrote:
>>> Hi,
>>>
>>> (sorry for replying on the old thread, but I found it could be related
>>> to the issue
>>> I have now)
>>>
>>> On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <[email protected]>
>>> wrote:
>
> [...]
>
>>>>>
>>>>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>>>>> version to use. Are you running any PCIe traffic test when the error
>>>>> happens?
>>>>
>>>> Nope, the machine was either idle or running a reboot test; no PCIe
>>>> stress
>>>> test or anything.
>>>>
>>>
>>> Was there any conclusion on this ?
>>> I am having similar issue[1] on my Juno with sky2 PCIe driver during
>>> reboot.
>>
>> We found that the unhandled faults occurred when using an extender
>> card. After removing the extender card, we didn't see the faults any
>> more.
>>
>
> Thanks for the response. It's not related then, I saw report referencing
> reboot tests and hence linked them together. Sorry for the noise.

For the record, I've had success with this cable on X-Gene:

http://www.amazon.com/PCI-E-Riser-Flexible-Ribbon-Extension/dp/B00H8VVD00?ie=UTF8&psc=1&redirect=true&ref_=oh_aui_search_detailpage

But it's hit or miss. The only public platform where I've been reliably
able to use an extender cable so far is AMD Seattle. On that platform,
the PCIe IP is so rock solid that I can talk to very funky PCIe IP I've
implemented myself in a FPGA (and I can see link quality is fine too).

There's one other non-public platform so far where PCIe extenders work
without a single hitch as well, and a number where more work is needed.

Jon.

--
Computer Architect