2014-04-09 02:40:55

by Baoquan He

[permalink] [raw]
Subject: hpsa driver bug crack kernel down!

Hi,

The kernel is 3.14.0+ which is pulled just now.


[ 18.402695] systemd[1]: Set hostname to
<hp-sl4545g7-01.rhts.eng.bos.redhat.com>.
[ 18.408456] random: systemd urandom read with 70 bits of entropy
available
[ 18md[1]: Expecting device
dev-mapper-rhel_hp\x2d\x2dsl4545g7\x2d\x2d01\x2droot.device...
Expecting device
dev-mapper-rhel_hp\x2d\x2dsl4545g7\...droot.device...
[ 18.860704] systemd[1]: Starting -.slice.
[ OK ] Created slice -.slice.
[ 18.866030] systemd[1]: Created slice -.slice.
[ 18.869466] systemd[1]: Starting System Slice.
[ OK ] Created slice System Sl 18.939116] systemd[1]: Created
slice System Slice.
[ 18.976213] systemd[1]: Starting Slices.
[ OK ] Reached target Slices.
[ 18.981154] systemd[1]: Reached target Slices.
[ 18.984183] systemd[1]: Starting Timers.
[ OK ] Reached target Timers.
[ 18.989161] systemd[1]: Reached target Timers.
[ 18.992004] systemd[1]: Starting Journal Socket.
[ OK ] Listening on Journal Socket.
[ 18.997174] systemd[1]: Listening on Journal Socket.
[ 19.000702] systemd[1]: Starting dracut cmdline hook...
Starting dracut cmdline hook...
[ 19.006697] systemd[1]: Started Load KernModules.
[ 19.110408] systemd[1]: Starting Setup Virtual Console...
Starting Setup Virtual Console...
[ 19.116652] systemd[1]: Starting Journal Service...
Starting Journal Service...
[ OK ] Started Journal Service.
[ 19.127172] systemd[1]: Started Journal Service.
[ OK ] Listening on udev Kernel Socket.
[ 19.141504] systemd-journald[281]: Vac[ OK ] Listening on udev
Control Socket.
[ OK ] Reached target Sockets.
Starting Create list of required static device nodes...rrent
kernel...
Starting Apply Kernel Variables...
[ OK ] Reached target Swap.
[ OK ] Reached target Local File Systems.
[ OK ] Started dracut cmdline hook.
[ OK ] Started Setup Virtual Console.
[ OK ] Started Apply Kernel Variables.
[ OK ] Started Create list of required static device nodes ...current
kernel.
Starting Create static device nodes in /dev...
Starting dracut pre-udev hook...
[ OK ] Started Create static device nodes in /dev.
[ 20.247819] device-mapper: uevent: version 1.0.3
[ 20.251101] device-mapper: ioctl: 4.27.0-ioctl (2013-10-30)
initialised: [email protected]
[ OK ] Started dracut pre-udev hook.
Starting udev Kernel Device Manager...
[ 20.322923] systemd-udevd[335]: starting version 208
[ OK ] Started udev Kernel Device Manager.
Starting udev Coldplug all Devices...
Mounting Configuration File System...
[ OK ] Mounted Configuration File System.
[ OK ] Started udev Coldplug all Devices.
Starting dracut initqueue hook...
[ OK ][1] HP HPSA Driver (v 3.4.4-1)
[ 20.832850] hpsa 0000:05:00.0: can't disable ASPM; OS doesn't have
ASPM control
Reached target System Initialization.
[ 20.875178] ACPI: PCI Interrupt Link [I0C0] enabled at IRQ 36
[ 20.909000] hpsa 0000:05:00.0: MSIX
[ 20.911586] hpsa 0000:05:00.0: Logical aborts not supported
[ 20.916004] [drm] Initialized drm 1.1.0 20060810
[ 20.936139] hpsa 0000:05:00.0: hpsa0: <0x323b> at IRQ 73 using DAC
[ 20.956967] BUG: unable to handle kernel NULL pointer dereference at
(null)
[ 20.956997] IP: [<ffffffffa004b97f>]
hpsa_enter_performant_mode+0x4ff/0x580 [hpsa]
[ 20.957003] PGD 0
[ 20.957012] Oops: 0002 [#1] SMP
[ 20.957035] Modules linked in: drm(+) libata hpsa(+) i2c_core
dm_mirror dm_region_hash dm_log dm_mod
[ 20.957046] CPU: 10 PID: 341 Comm: systemd-udevd Not tainted 3.14.0+
#28
[ 20.957049] Hardware name: HP ProLiant SL4545 G7/, BIOS A31
12/08/2012
[ 20.957055] task: ffff880824191b40 ti: ffff88082309c000 task.ti:
ffff88082309c000
[ 20.957078] RIP: 0010:[<ffffffffa004b97f>] [<ffffffffa004b97f>]
hpsa_enter_performant_mode+0x4ff/0x580 [hpsa]
[ 20.957083] RSP: 0018:ffff88082309da18 EFLAGS: 00010297
[ 20.957088] RAX: 0000000000000000 RBX: 000000007c000167 RCX:
0000000000000004
[ 20.957091] RDX: 000000000000


2014-04-09 22:50:04

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> Hi,
>
> The kernel is 3.14.0+ which is pulled just now.

Cc'ing more people.

While the hpsa driver appears to be involved in some way, I'm sure if
this is a related issue, but as of today's pull I'm getting another
problem that causes my DL980 not to come up.

*Massive* amounts of:

DMAR:[fault reason 02] Present bit in context entry is clear
dmar: DRHD: handling fault status reg 602
dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000

Then:

hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
...
Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
...

Screenshot of the actual LOCKUP:
http://stgolabs.net/hpsa-hard-lockup-3.14+.png

While I haven't bisected, things worked fine until at least until commit
39de65aa2c3e (April 2nd).

Any ideas?

Thanks,
Davidlohr

2014-04-09 23:08:29

by James Bottomley

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

[+linux-scsi]
On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> > Hi,
> >
> > The kernel is 3.14.0+ which is pulled just now.
>
> Cc'ing more people.
>
> While the hpsa driver appears to be involved in some way, I'm sure if
> this is a related issue, but as of today's pull I'm getting another
> problem that causes my DL980 not to come up.
>
> *Massive* amounts of:
>
> DMAR:[fault reason 02] Present bit in context entry is clear
> dmar: DRHD: handling fault status reg 602
> dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
>
> Then:
>
> hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
> ...
> Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> ...
>
> Screenshot of the actual LOCKUP:
> http://stgolabs.net/hpsa-hard-lockup-3.14+.png
>
> While I haven't bisected, things worked fine until at least until commit
> 39de65aa2c3e (April 2nd).
>
> Any ideas?

Well, it's either a DMA remapping issue or a hpsa one. Your assertion
that everything worked fine until 39de65aa2c3e would tend to vindicate
hpsa, because all the hpsa changes went in before that under

Merge: 3e75c6d b2bff6c
Author: Linus Torvalds <[email protected]>
Date: Tue Apr 1 18:49:04 2014 -0700

Merge tag 'scsi-misc' of
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

can you revalidate that this commit works OK just to make sure?

James

2014-04-09 23:10:47

by James Bottomley

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> [+linux-scsi]
> On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> > > Hi,
> > >
> > > The kernel is 3.14.0+ which is pulled just now.
> >
> > Cc'ing more people.
> >
> > While the hpsa driver appears to be involved in some way, I'm sure if
> > this is a related issue, but as of today's pull I'm getting another
> > problem that causes my DL980 not to come up.
> >
> > *Massive* amounts of:
> >
> > DMAR:[fault reason 02] Present bit in context entry is clear
> > dmar: DRHD: handling fault status reg 602
> > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
> >
> > Then:
> >
> > hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
> > ...
> > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> > ...
> >
> > Screenshot of the actual LOCKUP:
> > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> >
> > While I haven't bisected, things worked fine until at least until commit
> > 39de65aa2c3e (April 2nd).
> >
> > Any ideas?
>
> Well, it's either a DMA remapping issue or a hpsa one. Your assertion
> that everything worked fine until 39de65aa2c3e would tend to vindicate
> hpsa, because all the hpsa changes went in before that under

Missing crucial info:

commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1

> Merge: 3e75c6d b2bff6c
> Author: Linus Torvalds <[email protected]>
> Date: Tue Apr 1 18:49:04 2014 -0700
>
> Merge tag 'scsi-misc' of
> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
>
> can you revalidate that this commit works OK just to make sure?
>
> James
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2014-04-09 23:40:22

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> > [+linux-scsi]
> > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> > > > Hi,
> > > >
> > > > The kernel is 3.14.0+ which is pulled just now.
> > >
> > > Cc'ing more people.
> > >
> > > While the hpsa driver appears to be involved in some way, I'm sure if
> > > this is a related issue, but as of today's pull I'm getting another
> > > problem that causes my DL980 not to come up.
> > >
> > > *Massive* amounts of:
> > >
> > > DMAR:[fault reason 02] Present bit in context entry is clear
> > > dmar: DRHD: handling fault status reg 602
> > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
> > >
> > > Then:
> > >
> > > hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
> > > ...
> > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> > > ...
> > >
> > > Screenshot of the actual LOCKUP:
> > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> > >
> > > While I haven't bisected, things worked fine until at least until commit
> > > 39de65aa2c3e (April 2nd).
> > >
> > > Any ideas?
> >
> > Well, it's either a DMA remapping issue or a hpsa one. Your assertion
> > that everything worked fine until 39de65aa2c3e would tend to vindicate
> > hpsa,

Hmm here you mean DMA, right?

> because all the hpsa changes went in before that under
> Missing crucial info:
>
> commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
>
> > Merge: 3e75c6d b2bff6c
> > Author: Linus Torvalds <[email protected]>
> > Date: Tue Apr 1 18:49:04 2014 -0700
> >
> > Merge tag 'scsi-misc' of
> > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> >
> > can you revalidate that this commit works OK just to make sure?

Ok so I don't see those DMA messages and system starts just fine. I'm
thinking perhaps something broke after the IO mmu stuff in commit
3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
causing the CPU stalls and just blame hpsa in the path as a side effect?

/me goes out to try the commit.

2014-04-09 23:50:31

by James Bottomley

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
> On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> > > [+linux-scsi]
> > > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> > > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> > > > > Hi,
> > > > >
> > > > > The kernel is 3.14.0+ which is pulled just now.
> > > >
> > > > Cc'ing more people.
> > > >
> > > > While the hpsa driver appears to be involved in some way, I'm sure if
> > > > this is a related issue, but as of today's pull I'm getting another
> > > > problem that causes my DL980 not to come up.
> > > >
> > > > *Massive* amounts of:
> > > >
> > > > DMAR:[fault reason 02] Present bit in context entry is clear
> > > > dmar: DRHD: handling fault status reg 602
> > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
> > > >
> > > > Then:
> > > >
> > > > hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
> > > > ...
> > > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> > > > ...
> > > >
> > > > Screenshot of the actual LOCKUP:
> > > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> > > >
> > > > While I haven't bisected, things worked fine until at least until commit
> > > > 39de65aa2c3e (April 2nd).
> > > >
> > > > Any ideas?
> > >
> > > Well, it's either a DMA remapping issue or a hpsa one. Your assertion
> > > that everything worked fine until 39de65aa2c3e would tend to vindicate
> > > hpsa,
>
> Hmm here you mean DMA, right?

No, it vindicates the hpsa changes ... they don't seem to be causing
problems until something goes wrong with dma remapping.

> > because all the hpsa changes went in before that under
> > Missing crucial info:
> >
> > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> >
> > > Merge: 3e75c6d b2bff6c
> > > Author: Linus Torvalds <[email protected]>
> > > Date: Tue Apr 1 18:49:04 2014 -0700
> > >
> > > Merge tag 'scsi-misc' of
> > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> > >
> > > can you revalidate that this commit works OK just to make sure?
>
> Ok so I don't see those DMA messages and system starts just fine. I'm
> thinking perhaps something broke after the IO mmu stuff in commit
> 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
> causing the CPU stalls and just blame hpsa in the path as a side effect?
>
> /me goes out to try the commit.

That's my guess. The DMAR messages are DMA remapping issues caused in
the IOMMU. If I had to guess, I'd say the DMAR fault message is
indicating the IOMMU is calling for a mapping address before it can
satisfy the driver read request, which is causing the hang apparently in
the hpsa driver.

I've added linux-pci to the cc; I think they deal with iommu issues on
x86.

James

2014-04-10 00:19:44

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
> > On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> > > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> > > > [+linux-scsi]
> > > > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> > > > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> > > > > > Hi,
> > > > > >
> > > > > > The kernel is 3.14.0+ which is pulled just now.
> > > > >
> > > > > Cc'ing more people.
> > > > >
> > > > > While the hpsa driver appears to be involved in some way, I'm sure if
> > > > > this is a related issue, but as of today's pull I'm getting another
> > > > > problem that causes my DL980 not to come up.
> > > > >
> > > > > *Massive* amounts of:
> > > > >
> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
> > > > > dmar: DRHD: handling fault status reg 602
> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
> > > > >
> > > > > Then:
> > > > >
> > > > > hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
> > > > > ...
> > > > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> > > > > ...
> > > > >
> > > > > Screenshot of the actual LOCKUP:
> > > > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> > > > >
> > > > > While I haven't bisected, things worked fine until at least until commit
> > > > > 39de65aa2c3e (April 2nd).
> > > > >
> > > > > Any ideas?
> > > >
> > > > Well, it's either a DMA remapping issue or a hpsa one. Your assertion
> > > > that everything worked fine until 39de65aa2c3e would tend to vindicate
> > > > hpsa,
> >
> > Hmm here you mean DMA, right?
>
> No, it vindicates the hpsa changes ... they don't seem to be causing
> problems until something goes wrong with dma remapping.
>
> > > because all the hpsa changes went in before that under
> > > Missing crucial info:
> > >
> > > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> > >
> > > > Merge: 3e75c6d b2bff6c
> > > > Author: Linus Torvalds <[email protected]>
> > > > Date: Tue Apr 1 18:49:04 2014 -0700
> > > >
> > > > Merge tag 'scsi-misc' of
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> > > >
> > > > can you revalidate that this commit works OK just to make sure?
> >
> > Ok so I don't see those DMA messages and system starts just fine. I'm
> > thinking perhaps something broke after the IO mmu stuff in commit
> > 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
> > causing the CPU stalls and just blame hpsa in the path as a side effect?
> >
> > /me goes out to try the commit.
>
> That's my guess. The DMAR messages are DMA remapping issues caused in
> the IOMMU. If I had to guess, I'd say the DMAR fault message is
> indicating the IOMMU is calling for a mapping address before it can
> satisfy the driver read request, which is causing the hang apparently in
> the hpsa driver.
>
> I've added linux-pci to the cc; I think they deal with iommu issues on
> x86.

So that merge commit appears to be the culprit, I see both the DMA
messages and the lockup blaming hpsa...

2014-04-10 04:04:03

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

[+cc Joerg, iommu list]

On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso <[email protected]> wrote:
> On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
>> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
>> > On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
>> > > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
>> > > > [+linux-scsi]
>> > > > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
>> > > > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
>> > > > > > Hi,
>> > > > > >
>> > > > > > The kernel is 3.14.0+ which is pulled just now.
>> > > > >
>> > > > > Cc'ing more people.
>> > > > >
>> > > > > While the hpsa driver appears to be involved in some way, I'm sure if
>> > > > > this is a related issue, but as of today's pull I'm getting another
>> > > > > problem that causes my DL980 not to come up.
>> > > > >
>> > > > > *Massive* amounts of:
>> > > > >
>> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
>> > > > > dmar: DRHD: handling fault status reg 602
>> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
>> > > > >
>> > > > > Then:
>> > > > >
>> > > > > hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
>> > > > > ...
>> > > > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
>> > > > > ...
>> > > > >
>> > > > > Screenshot of the actual LOCKUP:
>> > > > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
>> > > > >
>> > > > > While I haven't bisected, things worked fine until at least until commit
>> > > > > 39de65aa2c3e (April 2nd).
>> > > > >
>> > > > > Any ideas?
>> > > >
>> > > > Well, it's either a DMA remapping issue or a hpsa one. Your assertion
>> > > > that everything worked fine until 39de65aa2c3e would tend to vindicate
>> > > > hpsa,
>> >
>> > Hmm here you mean DMA, right?
>>
>> No, it vindicates the hpsa changes ... they don't seem to be causing
>> problems until something goes wrong with dma remapping.
>>
>> > > because all the hpsa changes went in before that under
>> > > Missing crucial info:
>> > >
>> > > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
>> > >
>> > > > Merge: 3e75c6d b2bff6c
>> > > > Author: Linus Torvalds <[email protected]>
>> > > > Date: Tue Apr 1 18:49:04 2014 -0700
>> > > >
>> > > > Merge tag 'scsi-misc' of
>> > > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
>> > > >
>> > > > can you revalidate that this commit works OK just to make sure?
>> >
>> > Ok so I don't see those DMA messages and system starts just fine. I'm
>> > thinking perhaps something broke after the IO mmu stuff in commit
>> > 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
>> > causing the CPU stalls and just blame hpsa in the path as a side effect?
>> >
>> > /me goes out to try the commit.
>>
>> That's my guess. The DMAR messages are DMA remapping issues caused in
>> the IOMMU. If I had to guess, I'd say the DMAR fault message is
>> indicating the IOMMU is calling for a mapping address before it can
>> satisfy the driver read request, which is causing the hang apparently in
>> the hpsa driver.
>>
>> I've added linux-pci to the cc; I think they deal with iommu issues on
>> x86.
>
> So that merge commit appears to be the culprit, I see both the DMA
> messages and the lockup blaming hpsa...

My understanding so far (please correct me if I'm wrong):

39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")

2014-04-10 06:32:41

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote:
> [+cc Joerg, iommu list]
>
> On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso <[email protected]> wrote:
> > On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> >> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
> >> > On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> >> > > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> >> > > > [+linux-scsi]
> >> > > > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> >> > > > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> >> > > > > > Hi,
> >> > > > > >
> >> > > > > > The kernel is 3.14.0+ which is pulled just now.
> >> > > > >
> >> > > > > Cc'ing more people.
> >> > > > >
> >> > > > > While the hpsa driver appears to be involved in some way, I'm sure if
> >> > > > > this is a related issue, but as of today's pull I'm getting another
> >> > > > > problem that causes my DL980 not to come up.
> >> > > > >
> >> > > > > *Massive* amounts of:
> >> > > > >
> >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
> >> > > > > dmar: DRHD: handling fault status reg 602
> >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
> >> > > > >
> >> > > > > Then:
> >> > > > >
> >> > > > > hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
> >> > > > > ...
> >> > > > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> >> > > > > ...
> >> > > > >
> >> > > > > Screenshot of the actual LOCKUP:
> >> > > > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> >> > > > >
> >> > > > > While I haven't bisected, things worked fine until at least until commit
> >> > > > > 39de65aa2c3e (April 2nd).
> >> > > > >
> >> > > > > Any ideas?
> >> > > >
> >> > > > Well, it's either a DMA remapping issue or a hpsa one. Your assertion
> >> > > > that everything worked fine until 39de65aa2c3e would tend to vindicate
> >> > > > hpsa,
> >> >
> >> > Hmm here you mean DMA, right?
> >>
> >> No, it vindicates the hpsa changes ... they don't seem to be causing
> >> problems until something goes wrong with dma remapping.
> >>
> >> > > because all the hpsa changes went in before that under
> >> > > Missing crucial info:
> >> > >
> >> > > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> >> > >
> >> > > > Merge: 3e75c6d b2bff6c
> >> > > > Author: Linus Torvalds <[email protected]>
> >> > > > Date: Tue Apr 1 18:49:04 2014 -0700
> >> > > >
> >> > > > Merge tag 'scsi-misc' of
> >> > > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> >> > > >
> >> > > > can you revalidate that this commit works OK just to make sure?
> >> >
> >> > Ok so I don't see those DMA messages and system starts just fine. I'm
> >> > thinking perhaps something broke after the IO mmu stuff in commit
> >> > 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
> >> > causing the CPU stalls and just blame hpsa in the path as a side effect?
> >> >
> >> > /me goes out to try the commit.
> >>
> >> That's my guess. The DMAR messages are DMA remapping issues caused in
> >> the IOMMU. If I had to guess, I'd say the DMAR fault message is
> >> indicating the IOMMU is calling for a mapping address before it can
> >> satisfy the driver read request, which is causing the hang apparently in
> >> the hpsa driver.
> >>
> >> I've added linux-pci to the cc; I think they deal with iommu issues on
> >> x86.
> >
> > So that merge commit appears to be the culprit, I see both the DMA
> > messages and the lockup blaming hpsa...
>
> My understanding so far (please correct me if I'm wrong):
>
> 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
> 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")

Yes, specifically (finally done bisecting):

commit 2e45528930388658603ea24d49cf52867b928d3e
Author: Jiang Liu <[email protected]>
Date: Wed Feb 19 14:07:36 2014 +0800

iommu/vt-d: Unify the way to process DMAR device scope array

Now we have a PCI bus notification based mechanism to update DMAR
device scope array, we could extend the mechanism to support boot
time initialization too, which will help to unify and simplify
the implementation.

Signed-off-by: Jiang Liu <[email protected]>
Signed-off-by: Joerg Roedel <[email protected]>

2014-04-10 07:15:44

by Joerg Roedel

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

[+ David, VT-d maintainer ]

Jiang, David, can you please have a look into this issue?

Thanks,

Joerg

On Wed, Apr 09, 2014 at 11:32:37PM -0700, Davidlohr Bueso wrote:
> On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote:
> > [+cc Joerg, iommu list]
> >
> > On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso <[email protected]> wrote:
> > > On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> > >> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
> > >> > On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> > >> > > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> > >> > > > [+linux-scsi]
> > >> > > > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> > >> > > > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> > >> > > > > > Hi,
> > >> > > > > >
> > >> > > > > > The kernel is 3.14.0+ which is pulled just now.
> > >> > > > >
> > >> > > > > Cc'ing more people.
> > >> > > > >
> > >> > > > > While the hpsa driver appears to be involved in some way, I'm sure if
> > >> > > > > this is a related issue, but as of today's pull I'm getting another
> > >> > > > > problem that causes my DL980 not to come up.
> > >> > > > >
> > >> > > > > *Massive* amounts of:
> > >> > > > >
> > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
> > >> > > > > dmar: DRHD: handling fault status reg 602
> > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
> > >> > > > >
> > >> > > > > Then:
> > >> > > > >
> > >> > > > > hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
> > >> > > > > ...
> > >> > > > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> > >> > > > > ...
> > >> > > > >
> > >> > > > > Screenshot of the actual LOCKUP:
> > >> > > > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> > >> > > > >
> > >> > > > > While I haven't bisected, things worked fine until at least until commit
> > >> > > > > 39de65aa2c3e (April 2nd).
> > >> > > > >
> > >> > > > > Any ideas?
> > >> > > >
> > >> > > > Well, it's either a DMA remapping issue or a hpsa one. Your assertion
> > >> > > > that everything worked fine until 39de65aa2c3e would tend to vindicate
> > >> > > > hpsa,
> > >> >
> > >> > Hmm here you mean DMA, right?
> > >>
> > >> No, it vindicates the hpsa changes ... they don't seem to be causing
> > >> problems until something goes wrong with dma remapping.
> > >>
> > >> > > because all the hpsa changes went in before that under
> > >> > > Missing crucial info:
> > >> > >
> > >> > > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> > >> > >
> > >> > > > Merge: 3e75c6d b2bff6c
> > >> > > > Author: Linus Torvalds <[email protected]>
> > >> > > > Date: Tue Apr 1 18:49:04 2014 -0700
> > >> > > >
> > >> > > > Merge tag 'scsi-misc' of
> > >> > > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> > >> > > >
> > >> > > > can you revalidate that this commit works OK just to make sure?
> > >> >
> > >> > Ok so I don't see those DMA messages and system starts just fine. I'm
> > >> > thinking perhaps something broke after the IO mmu stuff in commit
> > >> > 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
> > >> > causing the CPU stalls and just blame hpsa in the path as a side effect?
> > >> >
> > >> > /me goes out to try the commit.
> > >>
> > >> That's my guess. The DMAR messages are DMA remapping issues caused in
> > >> the IOMMU. If I had to guess, I'd say the DMAR fault message is
> > >> indicating the IOMMU is calling for a mapping address before it can
> > >> satisfy the driver read request, which is causing the hang apparently in
> > >> the hpsa driver.
> > >>
> > >> I've added linux-pci to the cc; I think they deal with iommu issues on
> > >> x86.
> > >
> > > So that merge commit appears to be the culprit, I see both the DMA
> > > messages and the lockup blaming hpsa...
> >
> > My understanding so far (please correct me if I'm wrong):
> >
> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
> > 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")
>
> Yes, specifically (finally done bisecting):
>
> commit 2e45528930388658603ea24d49cf52867b928d3e
> Author: Jiang Liu <[email protected]>
> Date: Wed Feb 19 14:07:36 2014 +0800
>
> iommu/vt-d: Unify the way to process DMAR device scope array
>
> Now we have a PCI bus notification based mechanism to update DMAR
> device scope array, we could extend the mechanism to support boot
> time initialization too, which will help to unify and simplify
> the implementation.
>
> Signed-off-by: Jiang Liu <[email protected]>
> Signed-off-by: Joerg Roedel <[email protected]>
>

2014-04-10 08:34:19

by Jiang Liu

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

Hi Baoquan,
Could you please help to give output of "lspci -vvvv"?
Is device "hpsa 0000:03:00.0" a legacy PCI device(non-PCIe)?
It may have relationship with IOMMU driver.
Thanks!
Gerry

On 2014/4/10 12:03, Bjorn Helgaas wrote:
> [+cc Joerg, iommu list]
>
> On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso <[email protected]> wrote:
>> On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
>>> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
>>>> On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
>>>>> On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
>>>>>> [+linux-scsi]
>>>>>> On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
>>>>>>> On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> The kernel is 3.14.0+ which is pulled just now.
>>>>>>>
>>>>>>> Cc'ing more people.
>>>>>>>
>>>>>>> While the hpsa driver appears to be involved in some way, I'm sure if
>>>>>>> this is a related issue, but as of today's pull I'm getting another
>>>>>>> problem that causes my DL980 not to come up.
>>>>>>>
>>>>>>> *Massive* amounts of:
>>>>>>>
>>>>>>> DMAR:[fault reason 02] Present bit in context entry is clear
>>>>>>> dmar: DRHD: handling fault status reg 602
>>>>>>> dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
>>>>>>>
>>>>>>> Then:
>>>>>>>
>>>>>>> hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
>>>>>>> ...
>>>>>>> Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
>>>>>>> ...
>>>>>>>
>>>>>>> Screenshot of the actual LOCKUP:
>>>>>>> http://stgolabs.net/hpsa-hard-lockup-3.14+.png
>>>>>>>
>>>>>>> While I haven't bisected, things worked fine until at least until commit
>>>>>>> 39de65aa2c3e (April 2nd).
>>>>>>>
>>>>>>> Any ideas?
>>>>>>
>>>>>> Well, it's either a DMA remapping issue or a hpsa one. Your assertion
>>>>>> that everything worked fine until 39de65aa2c3e would tend to vindicate
>>>>>> hpsa,
>>>>
>>>> Hmm here you mean DMA, right?
>>>
>>> No, it vindicates the hpsa changes ... they don't seem to be causing
>>> problems until something goes wrong with dma remapping.
>>>
>>>>> because all the hpsa changes went in before that under
>>>>> Missing crucial info:
>>>>>
>>>>> commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
>>>>>
>>>>>> Merge: 3e75c6d b2bff6c
>>>>>> Author: Linus Torvalds <[email protected]>
>>>>>> Date: Tue Apr 1 18:49:04 2014 -0700
>>>>>>
>>>>>> Merge tag 'scsi-misc' of
>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
>>>>>>
>>>>>> can you revalidate that this commit works OK just to make sure?
>>>>
>>>> Ok so I don't see those DMA messages and system starts just fine. I'm
>>>> thinking perhaps something broke after the IO mmu stuff in commit
>>>> 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
>>>> causing the CPU stalls and just blame hpsa in the path as a side effect?
>>>>
>>>> /me goes out to try the commit.
>>>
>>> That's my guess. The DMAR messages are DMA remapping issues caused in
>>> the IOMMU. If I had to guess, I'd say the DMAR fault message is
>>> indicating the IOMMU is calling for a mapping address before it can
>>> satisfy the driver read request, which is causing the hang apparently in
>>> the hpsa driver.
>>>
>>> I've added linux-pci to the cc; I think they deal with iommu issues on
>>> x86.
>>
>> So that merge commit appears to be the culprit, I see both the DMA
>> messages and the lockup blaming hpsa...
>
> My understanding so far (please correct me if I'm wrong):
>
> 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
> 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2014-04-10 08:46:57

by Woodhouse, David

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Thu, 2014-04-10 at 09:15 +0200, Joerg Roedel wrote:
> [+ David, VT-d maintainer ]
>
> Jiang, David, can you please have a look into this issue?
>

> > > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
> > > >> > > > > dmar: DRHD: handling fault status reg 602
> > > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000

That "Present bit in context entry is clear" fault means that we have
not set up *any* mappings for this PCI device… on this IOMMU.

> > Yes, specifically (finally done bisecting):
> >
> > commit 2e45528930388658603ea24d49cf52867b928d3e
> > Author: Jiang Liu <[email protected]>
> > Date: Wed Feb 19 14:07:36 2014 +0800
> >
> > iommu/vt-d: Unify the way to process DMAR device scope array

This commit is about how we decide which IOMMU a given PCI device is
attached to.

Thus, my first guess would be that we are quite happily setting up the
requested DMA maps on the *wrong* IOMMU, and then taking faults when the
device actually tries to do DMA.

However, I'm not 100% convinced of that. The fault address looks
suspiciously like a true physical address, not a virtual bus address of
the type that we'd normally allocate for a dma_map_* operation. Those
would start at 0xfffff000 and work downwards, typically.

Do you have 'iommu=pt' on the kernel command line? Can I see the full
dmesg as this system boots, and also a copy of the DMAR table?


We should also rate-limit DMA faults, which would avoid the lockup
failure mode. Bjorn, what should an IOMMU driver *do* when it detects
that a device is creating an endless stream of DMA faults and isn't
aborting the transaction?

I can set it to silent so that it just stops *reporting* the DMA faults
for that device... and I suppose I can re-enable them when I next see a
DMA mapping for it (although actually it'd be better to have a hook to
do that on FLR or something like that). But there must be a better
answer than that, surely? And I don't want to hack it up locally in
*one* specific IOMMU driver, any more than I have to.

On a POWER system with EEH, the kernel would end up isolating the
offending device completely, and subsequently resetting it...

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation


Attachments:
smime.p7s (3.36 kB)

2014-04-10 15:15:24

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Thu, Apr 10, 2014 at 2:46 AM, Woodhouse, David
<[email protected]> wrote:

>> > > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
>> > > >> > > > > dmar: DRHD: handling fault status reg 602
>> > > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
>
> That "Present bit in context entry is clear" fault means that we have
> not set up *any* mappings for this PCI device… on this IOMMU.
>
>> > Yes, specifically (finally done bisecting):
>> >
>> > commit 2e45528930388658603ea24d49cf52867b928d3e
>> > Author: Jiang Liu <[email protected]>
>> > Date: Wed Feb 19 14:07:36 2014 +0800
>> >
>> > iommu/vt-d: Unify the way to process DMAR device scope array
>
> This commit is about how we decide which IOMMU a given PCI device is
> attached to.
>
> Thus, my first guess would be that we are quite happily setting up the
> requested DMA maps on the *wrong* IOMMU, and then taking faults when the
> device actually tries to do DMA.
>
> However, I'm not 100% convinced of that. The fault address looks
> suspiciously like a true physical address, not a virtual bus address of
> the type that we'd normally allocate for a dma_map_* operation. Those
> would start at 0xfffff000 and work downwards, typically.

I like the "wrong IOMMU (or no IOMMU at all)" theory. If we didn't
connect the device with an IOMMU at all, that would explain the device
DMAing directly to a physical address, wouldn't it?

> Do you have 'iommu=pt' on the kernel command line? Can I see the full
> dmesg as this system boots, and also a copy of the DMAR table?
>
> We should also rate-limit DMA faults, which would avoid the lockup
> failure mode. Bjorn, what should an IOMMU driver *do* when it detects
> that a device is creating an endless stream of DMA faults and isn't
> aborting the transaction?

You mentioned that POWER with EEH does something intelligent in this
case, but I'm not familiar with that code. We have AER support, which
can result in resetting a device, but I think DMA faults are reported
differently, and I don't think there's any nice existing way for PCI
to deal with them. Maybe there should be, though.

Bjorn

2014-04-10 15:36:56

by Linda Knippers

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On 4/10/2014 11:14 AM, Bjorn Helgaas wrote:
> On Thu, Apr 10, 2014 at 2:46 AM, Woodhouse, David
> <[email protected]> wrote:
>
>>>>>>>>>>> DMAR:[fault reason 02] Present bit in context entry is clear
>>>>>>>>>>> dmar: DRHD: handling fault status reg 602
>>>>>>>>>>> dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
>>
>> That "Present bit in context entry is clear" fault means that we have
>> not set up *any* mappings for this PCI device… on this IOMMU.
>>
>>>> Yes, specifically (finally done bisecting):
>>>>
>>>> commit 2e45528930388658603ea24d49cf52867b928d3e
>>>> Author: Jiang Liu <[email protected]>
>>>> Date: Wed Feb 19 14:07:36 2014 +0800
>>>>
>>>> iommu/vt-d: Unify the way to process DMAR device scope array
>>
>> This commit is about how we decide which IOMMU a given PCI device is
>> attached to.
>>
>> Thus, my first guess would be that we are quite happily setting up the
>> requested DMA maps on the *wrong* IOMMU, and then taking faults when the
>> device actually tries to do DMA.
>>
>> However, I'm not 100% convinced of that. The fault address looks
>> suspiciously like a true physical address, not a virtual bus address of
>> the type that we'd normally allocate for a dma_map_* operation. Those
>> would start at 0xfffff000 and work downwards, typically.
>
> I like the "wrong IOMMU (or no IOMMU at all)" theory. If we didn't
> connect the device with an IOMMU at all, that would explain the device
> DMAing directly to a physical address, wouldn't it?
>
>> Do you have 'iommu=pt' on the kernel command line? Can I see the full
>> dmesg as this system boots, and also a copy of the DMAR table?

This will be really helpful information. This box has devices with
RMRR records and if they're not set up correctly, DMAR faults can occur.

>>
>> We should also rate-limit DMA faults, which would avoid the lockup
>> failure mode. Bjorn, what should an IOMMU driver *do* when it detects
>> that a device is creating an endless stream of DMA faults and isn't
>> aborting the transaction?
>
> You mentioned that POWER with EEH does something intelligent in this
> case, but I'm not familiar with that code. We have AER support, which
> can result in resetting a device, but I think DMA faults are reported
> differently, and I don't think there's any nice existing way for PCI
> to deal with them. Maybe there should be, though.
>
> Bjorn
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>

2014-04-10 15:37:55

by Woodhouse, David

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Thu, 2014-04-10 at 09:14 -0600, Bjorn Helgaas wrote:
> > Thus, my first guess would be that we are quite happily setting up the
> > requested DMA maps on the *wrong* IOMMU, and then taking faults when the
> > device actually tries to do DMA.
> >
> I like the "wrong IOMMU (or no IOMMU at all)" theory. If we didn't
> connect the device with an IOMMU at all, that would explain the device
> DMAing directly to a physical address, wouldn't it?

An unlikely failure mode. We're much more likely to see *wrong* IOMMU
than no IOMMU. And thus we'd still see the distinctive virtual addresses
just below 4GiB.

However, Rob's answer may solve that puzzle. If this is one of those
abominations where the device continues to do DMA to system memory even
after the OS is up and running and *thinks* it has control of the
hardware, then the offending address will be listed in an RMRR entry
(which tells the OS to set up a 1:1 mapping for access to certain memory
ranges for a given device). And will be inside an E820 reserved region.

A little odd that such an error would trigger only when we're actually
trying to initialise the device from the Linux driver, not as soon as we
enable the IOMMU. But all things are possible.

But the DMAR table and dmesg that I asked for would give us a bit more
information and hopefully let us stop speculating...

> > We should also rate-limit DMA faults, which would avoid the lockup
> > failure mode. Bjorn, what should an IOMMU driver *do* when it detects
> > that a device is creating an endless stream of DMA faults and isn't
> > aborting the transaction?
>
> You mentioned that POWER with EEH does something intelligent in this
> case, but I'm not familiar with that code. We have AER support, which
> can result in resetting a device, but I think DMA faults are reported
> differently, and I don't think there's any nice existing way for PCI
> to deal with them. Maybe there should be, though.

Quite frankly, I don't care how *you* deal with them, or even if you
can. All I want to know is how I tell you about the problem, because *I*
sure as hell don't want to be trying to deal with it in the IOMMU code.
That's a generic PCI layer thing. :)

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation


Attachments:
smime.p7s (3.36 kB)

2014-04-10 15:43:31

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Tue, Apr 8, 2014 at 8:39 PM, Baoquan He <[email protected]> wrote:
> Hi,
>
> The kernel is 3.14.0+ which is pulled just now.
>
>
> [ 18.402695] systemd[1]: Set hostname to
> <hp-sl4545g7-01.rhts.eng.bos.redhat.com>.
> [ 18.408456] random: systemd urandom read with 70 bits of entropy
> available
> [ 18md[1]: Expecting device
> dev-mapper-rhel_hp\x2d\x2dsl4545g7\x2d\x2d01\x2droot.device...
> Expecting device
> dev-mapper-rhel_hp\x2d\x2dsl4545g7\...droot.device...
> [ 18.860704] systemd[1]: Starting -.slice.
> [ OK ] Created slice -.slice.
> [ 18.866030] systemd[1]: Created slice -.slice.
> [ 18.869466] systemd[1]: Starting System Slice.
> [ OK ] Created slice System Sl 18.939116] systemd[1]: Created
> slice System Slice.
> [ 18.976213] systemd[1]: Starting Slices.
> [ OK ] Reached target Slices.
> [ 18.981154] systemd[1]: Reached target Slices.
> [ 18.984183] systemd[1]: Starting Timers.
> [ OK ] Reached target Timers.
> [ 18.989161] systemd[1]: Reached target Timers.
> [ 18.992004] systemd[1]: Starting Journal Socket.
> [ OK ] Listening on Journal Socket.
> [ 18.997174] systemd[1]: Listening on Journal Socket.
> [ 19.000702] systemd[1]: Starting dracut cmdline hook...
> Starting dracut cmdline hook...
> [ 19.006697] systemd[1]: Started Load KernModules.
> [ 19.110408] systemd[1]: Starting Setup Virtual Console...
> Starting Setup Virtual Console...
> [ 19.116652] systemd[1]: Starting Journal Service...
> Starting Journal Service...
> [ OK ] Started Journal Service.
> [ 19.127172] systemd[1]: Started Journal Service.
> [ OK ] Listening on udev Kernel Socket.
> [ 19.141504] systemd-journald[281]: Vac[ OK ] Listening on udev
> Control Socket.
> [ OK ] Reached target Sockets.
> Starting Create list of required static device nodes...rrent
> kernel...
> Starting Apply Kernel Variables...
> [ OK ] Reached target Swap.
> [ OK ] Reached target Local File Systems.
> [ OK ] Started dracut cmdline hook.
> [ OK ] Started Setup Virtual Console.
> [ OK ] Started Apply Kernel Variables.
> [ OK ] Started Create list of required static device nodes ...current
> kernel.
> Starting Create static device nodes in /dev...
> Starting dracut pre-udev hook...
> [ OK ] Started Create static device nodes in /dev.
> [ 20.247819] device-mapper: uevent: version 1.0.3
> [ 20.251101] device-mapper: ioctl: 4.27.0-ioctl (2013-10-30)
> initialised: [email protected]
> [ OK ] Started dracut pre-udev hook.
> Starting udev Kernel Device Manager...
> [ 20.322923] systemd-udevd[335]: starting version 208
> [ OK ] Started udev Kernel Device Manager.
> Starting udev Coldplug all Devices...
> Mounting Configuration File System...
> [ OK ] Mounted Configuration File System.
> [ OK ] Started udev Coldplug all Devices.
> Starting dracut initqueue hook...
> [ OK ][1] HP HPSA Driver (v 3.4.4-1)
> [ 20.832850] hpsa 0000:05:00.0: can't disable ASPM; OS doesn't have
> ASPM control
> Reached target System Initialization.
> [ 20.875178] ACPI: PCI Interrupt Link [I0C0] enabled at IRQ 36
> [ 20.909000] hpsa 0000:05:00.0: MSIX
> [ 20.911586] hpsa 0000:05:00.0: Logical aborts not supported
> [ 20.916004] [drm] Initialized drm 1.1.0 20060810
> [ 20.936139] hpsa 0000:05:00.0: hpsa0: <0x323b> at IRQ 73 using DAC
> [ 20.956967] BUG: unable to handle kernel NULL pointer dereference at
> (null)
> [ 20.956997] IP: [<ffffffffa004b97f>]
> hpsa_enter_performant_mode+0x4ff/0x580 [hpsa]
> [ 20.957003] PGD 0
> [ 20.957012] Oops: 0002 [#1] SMP
> [ 20.957035] Modules linked in: drm(+) libata hpsa(+) i2c_core
> dm_mirror dm_region_hash dm_log dm_mod
> [ 20.957046] CPU: 10 PID: 341 Comm: systemd-udevd Not tainted 3.14.0+
> #28
> [ 20.957049] Hardware name: HP ProLiant SL4545 G7/, BIOS A31
> 12/08/2012
> [ 20.957055] task: ffff880824191b40 ti: ffff88082309c000 task.ti:
> ffff88082309c000
> [ 20.957078] RIP: 0010:[<ffffffffa004b97f>] [<ffffffffa004b97f>]
> hpsa_enter_performant_mode+0x4ff/0x580 [hpsa]
> [ 20.957083] RSP: 0018:ffff88082309da18 EFLAGS: 00010297
> [ 20.957088] RAX: 0000000000000000 RBX: 000000007c000167 RCX:
> 0000000000000004
> [ 20.957091] RDX: 000000000000

What happened with this original report? This looks like a different
problem than the DMA fault reported by Davidlohr. I'd start by
disassembling the hpsa module and matching the IP to a line.
Documentation/oops-tracing.txt might have useful tips on how to do
that.

2014-04-10 15:54:25

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Thu, 2014-04-10 at 16:34 +0800, Jiang Liu wrote:
> Hi Baoquan,
> Could you please help to give output of "lspci -vvvv"?

Attached.

> Is device "hpsa 0000:03:00.0" a legacy PCI device(non-PCIe)?
> It may have relationship with IOMMU driver.

I honestly don't know. PCI is way out of my area of knowledge.


Attachments:
lspci.txt (41.95 kB)

2014-04-10 16:02:40

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

[+cc Steve and iss_storagedev, remove "storagedev" which bounced
(apparent typo)]

On Thu, Apr 10, 2014 at 9:43 AM, Bjorn Helgaas <[email protected]> wrote:
> On Tue, Apr 8, 2014 at 8:39 PM, Baoquan He <[email protected]> wrote:
>> Hi,
>>
>> The kernel is 3.14.0+ which is pulled just now.
>>
>>
>> [ 18.402695] systemd[1]: Set hostname to
>> <hp-sl4545g7-01.rhts.eng.bos.redhat.com>.
>> [ 18.408456] random: systemd urandom read with 70 bits of entropy
>> available
>> [ 18md[1]: Expecting device
>> dev-mapper-rhel_hp\x2d\x2dsl4545g7\x2d\x2d01\x2droot.device...
>> Expecting device
>> dev-mapper-rhel_hp\x2d\x2dsl4545g7\...droot.device...
>> [ 18.860704] systemd[1]: Starting -.slice.
>> [ OK ] Created slice -.slice.
>> [ 18.866030] systemd[1]: Created slice -.slice.
>> [ 18.869466] systemd[1]: Starting System Slice.
>> [ OK ] Created slice System Sl 18.939116] systemd[1]: Created
>> slice System Slice.
>> [ 18.976213] systemd[1]: Starting Slices.
>> [ OK ] Reached target Slices.
>> [ 18.981154] systemd[1]: Reached target Slices.
>> [ 18.984183] systemd[1]: Starting Timers.
>> [ OK ] Reached target Timers.
>> [ 18.989161] systemd[1]: Reached target Timers.
>> [ 18.992004] systemd[1]: Starting Journal Socket.
>> [ OK ] Listening on Journal Socket.
>> [ 18.997174] systemd[1]: Listening on Journal Socket.
>> [ 19.000702] systemd[1]: Starting dracut cmdline hook...
>> Starting dracut cmdline hook...
>> [ 19.006697] systemd[1]: Started Load KernModules.
>> [ 19.110408] systemd[1]: Starting Setup Virtual Console...
>> Starting Setup Virtual Console...
>> [ 19.116652] systemd[1]: Starting Journal Service...
>> Starting Journal Service...
>> [ OK ] Started Journal Service.
>> [ 19.127172] systemd[1]: Started Journal Service.
>> [ OK ] Listening on udev Kernel Socket.
>> [ 19.141504] systemd-journald[281]: Vac[ OK ] Listening on udev
>> Control Socket.
>> [ OK ] Reached target Sockets.
>> Starting Create list of required static device nodes...rrent
>> kernel...
>> Starting Apply Kernel Variables...
>> [ OK ] Reached target Swap.
>> [ OK ] Reached target Local File Systems.
>> [ OK ] Started dracut cmdline hook.
>> [ OK ] Started Setup Virtual Console.
>> [ OK ] Started Apply Kernel Variables.
>> [ OK ] Started Create list of required static device nodes ...current
>> kernel.
>> Starting Create static device nodes in /dev...
>> Starting dracut pre-udev hook...
>> [ OK ] Started Create static device nodes in /dev.
>> [ 20.247819] device-mapper: uevent: version 1.0.3
>> [ 20.251101] device-mapper: ioctl: 4.27.0-ioctl (2013-10-30)
>> initialised: [email protected]
>> [ OK ] Started dracut pre-udev hook.
>> Starting udev Kernel Device Manager...
>> [ 20.322923] systemd-udevd[335]: starting version 208
>> [ OK ] Started udev Kernel Device Manager.
>> Starting udev Coldplug all Devices...
>> Mounting Configuration File System...
>> [ OK ] Mounted Configuration File System.
>> [ OK ] Started udev Coldplug all Devices.
>> Starting dracut initqueue hook...
>> [ OK ][1] HP HPSA Driver (v 3.4.4-1)
>> [ 20.832850] hpsa 0000:05:00.0: can't disable ASPM; OS doesn't have
>> ASPM control
>> Reached target System Initialization.
>> [ 20.875178] ACPI: PCI Interrupt Link [I0C0] enabled at IRQ 36
>> [ 20.909000] hpsa 0000:05:00.0: MSIX
>> [ 20.911586] hpsa 0000:05:00.0: Logical aborts not supported
>> [ 20.916004] [drm] Initialized drm 1.1.0 20060810
>> [ 20.936139] hpsa 0000:05:00.0: hpsa0: <0x323b> at IRQ 73 using DAC
>> [ 20.956967] BUG: unable to handle kernel NULL pointer dereference at
>> (null)
>> [ 20.956997] IP: [<ffffffffa004b97f>]
>> hpsa_enter_performant_mode+0x4ff/0x580 [hpsa]
>> [ 20.957003] PGD 0
>> [ 20.957012] Oops: 0002 [#1] SMP
>> [ 20.957035] Modules linked in: drm(+) libata hpsa(+) i2c_core
>> dm_mirror dm_region_hash dm_log dm_mod
>> [ 20.957046] CPU: 10 PID: 341 Comm: systemd-udevd Not tainted 3.14.0+
>> #28
>> [ 20.957049] Hardware name: HP ProLiant SL4545 G7/, BIOS A31
>> 12/08/2012
>> [ 20.957055] task: ffff880824191b40 ti: ffff88082309c000 task.ti:
>> ffff88082309c000
>> [ 20.957078] RIP: 0010:[<ffffffffa004b97f>] [<ffffffffa004b97f>]
>> hpsa_enter_performant_mode+0x4ff/0x580 [hpsa]
>> [ 20.957083] RSP: 0018:ffff88082309da18 EFLAGS: 00010297
>> [ 20.957088] RAX: 0000000000000000 RBX: 000000007c000167 RCX:
>> 0000000000000004
>> [ 20.957091] RDX: 000000000000
>
> What happened with this original report? This looks like a different
> problem than the DMA fault reported by Davidlohr. I'd start by
> disassembling the hpsa module and matching the IP to a line.
> Documentation/oops-tracing.txt might have useful tips on how to do
> that.

2014-04-10 16:05:25

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Thu, 2014-04-10 at 16:34 +0800, Jiang Liu wrote:
> Hi Baoquan,
> Could you please help to give output of "lspci -vvvv"?

Reran as root, attached again.


Attachments:
lspci2.txt (144.16 kB)

2014-04-10 16:19:56

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Thu, 2014-04-10 at 08:46 +0000, Woodhouse, David wrote:
> On Thu, 2014-04-10 at 09:15 +0200, Joerg Roedel wrote:
> > [+ David, VT-d maintainer ]
> >
> > Jiang, David, can you please have a look into this issue?
> >
>
> > > > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
> > > > >> > > > > dmar: DRHD: handling fault status reg 602
> > > > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
>
> That "Present bit in context entry is clear" fault means that we have
> not set up *any* mappings for this PCI device… on this IOMMU.
>
> > > Yes, specifically (finally done bisecting):
> > >
> > > commit 2e45528930388658603ea24d49cf52867b928d3e
> > > Author: Jiang Liu <[email protected]>
> > > Date: Wed Feb 19 14:07:36 2014 +0800
> > >
> > > iommu/vt-d: Unify the way to process DMAR device scope array
>
> This commit is about how we decide which IOMMU a given PCI device is
> attached to.
>
> Thus, my first guess would be that we are quite happily setting up the
> requested DMA maps on the *wrong* IOMMU, and then taking faults when the
> device actually tries to do DMA.
>
> However, I'm not 100% convinced of that. The fault address looks
> suspiciously like a true physical address, not a virtual bus address of
> the type that we'd normally allocate for a dma_map_* operation. Those
> would start at 0xfffff000 and work downwards, typically.
>
> Do you have 'iommu=pt' on the kernel command line?

No.

> Can I see the full
> dmesg as this system boots, and also a copy of the DMAR table?

Attaching a dmesg from one of the kernels that boots. It doesn't appear
to have much of the related information... is there any debug config
option I can enable that might give you more data?


Attachments:
dmesg.out (100.37 kB)

2014-04-10 16:32:30

by Woodhouse, David

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote:
> > > > > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
> >
> > That "Present bit in context entry is clear" fault means that we have
> > not set up *any* mappings for this PCI device… on this IOMMU.
> >
> > > > Yes, specifically (finally done bisecting):
> > > >
> > > > commit 2e45528930388658603ea24d49cf52867b928d3e
> > > > Author: Jiang Liu <[email protected]>
> > > > Date: Wed Feb 19 14:07:36 2014 +0800
> > > >
> > > > iommu/vt-d: Unify the way to process DMAR device scope array
> >
> > This commit is about how we decide which IOMMU a given PCI device is
> > attached to.
> >
> > Thus, my first guess would be that we are quite happily setting up the
> > requested DMA maps on the *wrong* IOMMU, and then taking faults when the
> > device actually tries to do DMA.
> >
> > However, I'm not 100% convinced of that. The fault address looks
> > suspiciously like a true physical address, not a virtual bus address of
> > the type that we'd normally allocate for a dma_map_* operation. Those
> > would start at 0xfffff000 and work downwards, typically.
> >
> > Do you have 'iommu=pt' on the kernel command line?
>
> No.
>
> > Can I see the full
> > dmesg as this system boots, and also a copy of the DMAR table?
>
> Attaching a dmesg from one of the kernels that boots. It doesn't appear
> to have much of the related information...

It shows us that the address 0x7f61e000 is in an E820-reserved region,
and that there's and RMRR covering that region for an unspecified PCI
device, but that's going to be the hpsa.

So if isn't just a simple case of us assigning this device to the wrong
IOMMU, *perhaps* it's that we lose the RMRR when the driver takes
control of the device. RMRRs are generally expected to be a boot-time
thing, for things like legacy keyboard/mouse emulation via USB. Using
them while the system is *active* is... horrid. We've often not quite
handled that right.

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation


Attachments:
smime.p7s (3.36 kB)

2014-04-10 20:46:38

by Stephen M. Cameron

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Wed, Apr 09, 2014 at 11:32:37PM -0700, Davidlohr Bueso wrote:
> On Wed, 2014-04-09 at 22:03 -0600, Bjorn Helgaas wrote:
> > [+cc Joerg, iommu list]
> >
> > On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso <[email protected]> wrote:
> > > On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> > >> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
> > >> > On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> > >> > > On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> > >> > > > [+linux-scsi]
> > >> > > > On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> > >> > > > > On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> > >> > > > > > Hi,
> > >> > > > > >
> > >> > > > > > The kernel is 3.14.0+ which is pulled just now.
> > >> > > > >
> > >> > > > > Cc'ing more people.
> > >> > > > >
> > >> > > > > While the hpsa driver appears to be involved in some way, I'm sure if
> > >> > > > > this is a related issue, but as of today's pull I'm getting another
> > >> > > > > problem that causes my DL980 not to come up.
> > >> > > > >
> > >> > > > > *Massive* amounts of:
> > >> > > > >
> > >> > > > > DMAR:[fault reason 02] Present bit in context entry is clear
> > >> > > > > dmar: DRHD: handling fault status reg 602
> > >> > > > > dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
> > >> > > > >
> > >> > > > > Then:
> > >> > > > >
> > >> > > > > hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
> > >> > > > > ...
> > >> > > > > Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> > >> > > > > ...
> > >> > > > >
> > >> > > > > Screenshot of the actual LOCKUP:
> > >> > > > > http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> > >> > > > >
> > >> > > > > While I haven't bisected, things worked fine until at least until commit
> > >> > > > > 39de65aa2c3e (April 2nd).
> > >> > > > >
> > >> > > > > Any ideas?
> > >> > > >
> > >> > > > Well, it's either a DMA remapping issue or a hpsa one. Your assertion
> > >> > > > that everything worked fine until 39de65aa2c3e would tend to vindicate
> > >> > > > hpsa,
> > >> >
> > >> > Hmm here you mean DMA, right?
> > >>
> > >> No, it vindicates the hpsa changes ... they don't seem to be causing
> > >> problems until something goes wrong with dma remapping.
> > >>
> > >> > > because all the hpsa changes went in before that under
> > >> > > Missing crucial info:
> > >> > >
> > >> > > commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> > >> > >
> > >> > > > Merge: 3e75c6d b2bff6c
> > >> > > > Author: Linus Torvalds <[email protected]>
> > >> > > > Date: Tue Apr 1 18:49:04 2014 -0700
> > >> > > >
> > >> > > > Merge tag 'scsi-misc' of
> > >> > > > git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> > >> > > >
> > >> > > > can you revalidate that this commit works OK just to make sure?
> > >> >
> > >> > Ok so I don't see those DMA messages and system starts just fine. I'm
> > >> > thinking perhaps something broke after the IO mmu stuff in commit
> > >> > 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
> > >> > causing the CPU stalls and just blame hpsa in the path as a side effect?
> > >> >
> > >> > /me goes out to try the commit.
> > >>
> > >> That's my guess. The DMAR messages are DMA remapping issues caused in
> > >> the IOMMU. If I had to guess, I'd say the DMAR fault message is
> > >> indicating the IOMMU is calling for a mapping address before it can
> > >> satisfy the driver read request, which is causing the hang apparently in
> > >> the hpsa driver.
> > >>
> > >> I've added linux-pci to the cc; I think they deal with iommu issues on
> > >> x86.
> > >
> > > So that merge commit appears to be the culprit, I see both the DMA
> > > messages and the lockup blaming hpsa...
> >
> > My understanding so far (please correct me if I'm wrong):
> >
> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")

^^^ this one, 1a0b6abaea78, did not work for me, crashing in
hpsa_enter_performant mode() which was surprsing to me as I am
pretty sure I tried on this very same machine I'm using now
(DL360p with P420, P430 and P420i) with 3.14-rc-something plus
all the hpsa patches that I thought were merged in.

But now I am seeing:

[<ffffffffa0002bd0>] hpsa_enter_performant_mode+0x4c0/0x540 [hpsa]
RSP: 0018:ffff88042c515a78 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88042c650000 RCX: 0000000000000004
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffff88042c515b48 R08: 0000000000000000 R09: 000000008af03cc0
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88042c515a98
R13: 0000000060000104 R14: ffff88042c515ad8 R15: ffffffffa0001630
FS: 00007f86f7a38700(0000) GS:ffff88043f560000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
usb 1-1.6: new low-speed USB device number 3 using ehci-pci
CR2: 0000000000000000 CR3: 000000042c4c3000 CR4: 00000000000407e0
Stack:
0000000000008024 ffffffffa00000c0 ffffffffa0000be0 0000000000000000
0000000600000005 0000000800000007 0000000a00000009 0000000c0000000b
0000000e0000000d 000000100000000f 0000001200000011 0000000400000013
Call Trace:
[<ffffffffa00000c0>] ? SA5_fifo_full+0x20/0x20 [hpsa]
[<ffffffffa0000be0>] ? SA5_ioaccel_mode1_completed+0xd0/0xd0 [hpsa]
[<ffffffffa000aab6>] hpsa_put_ctlr_into_performant_mode+0x186/0x320 [hpsa]
[<ffffffffa0005132>] ? hpsa_allocate_sg_chain_blocks+0xa2/0xd0 [hpsa]
[<ffffffffa000b08b>] hpsa_init_one+0x43b/0x7d0 [hpsa]
[<ffffffff812bc34c>] local_pci_probe+0x4c/0xb0
[<ffffffff812bc439>] pci_call_probe+0x89/0xb0
[<ffffffff812bb074>] ? pci_match_device+0xc4/0xd0
[<ffffffff812bc719>] pci_device_probe+0x79/0xa0
[<ffffffff8138edd2>] ? driver_sysfs_add+0x82/0xb0
[<ffffffff8138f03c>] really_probe+0x6c/0x320
usb 1-1.6: New USB device found, idVendor=0624, idProduct=0341
usb 1-1.6: New USB device strings: Mfr=1, Product=2, SerialNumber=0
usb 1-1.6: Product: HP 336047-B21
usb 1-1.6: Manufacturer: Avocent
input: Avocent HP 336047-B21 as
/devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.64
hid-generic 0003:0624:0341.0001: input,hidraw0: USB HID v1.10 Keyboard
[Avocent0
input: Avocent HP 336047-B21 as
/devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.65
hid-generic 0003:0624:0341.0002: input,hidraw1: USB HID v1.10 Mouse [Avocent
HP1
[<ffffffff8138f337>] driver_probe_device+0x47/0xa0
[<ffffffff8138f43b>] __driver_attach+0xab/0xb0
[<ffffffff8138f390>] ? driver_probe_device+0xa0/0xa0
[<ffffffff8138f390>] ? driver_probe_device+0xa0/0xa0
[<ffffffff8138d1d4>] bus_for_each_dev+0x94/0xb0
[<ffffffff8138ecfe>] driver_attach+0x1e/0x20
[<ffffffff8138e6e0>] bus_add_driver+0x1b0/0x250
usb 2-1.3: new high-speed USB device number 3 using ehci-pci
[<ffffffffa0016000>] ? 0xffffffffa0015fff
[<ffffffff8138f9d4>] driver_register+0x64/0xf0
[<ffffffffa0016000>] ? 0xffffffffa0015fff
[<ffffffff812bc80c>] __pci_register_driver+0x4c/0x50
[<ffffffffa001601e>] hpsa_init+0x1e/0x20 [hpsa]
[<ffffffff810002a2>] do_one_initcall+0xd2/0x180
[<ffffffff810771a5>] ? __blocking_notifier_call_chain+0x65/0x80
[<ffffffff810c8154>] do_init_module+0x44/0x1b0
[<ffffffff810ca7c8>] load_module+0x5a8/0x6f0
[<ffffffff810c7a30>] ? __unlink_module+0x30/0x30
[<ffffffff81164c35>] ? __vmalloc_node+0x35/0x40
[<ffffffff810c7120>] ? module_sect_show+0x30/0x30
[<ffffffff810caa96>] SyS_init_module+0x96/0xc0
[<ffffffff81590d52>] system_call_fastpath+0x16/0x1b
Code: 89 45 8c 78 2c 31 f6 8d 4e 04 4c 89 e2 31 c0 0f 1f 40 00 39 0a 7d 0c usb
0
usb 2-1.3: New USB device strings: Mfr=0, Product=0, SerialNumber=0
hub 2-1.3:1.0: USB hub found
hub 2-1.3:1.0: 2 ports detected

83 c0 01 48 83 c2 04 83 f8 10 75 f0 48 63 d6 83 c6 01 39 f7 <41> 89 04 90 7d
d6
RIP [<ffffffffa0002bd0>] hpsa_enter_performant_mode+0x4c0/0x540 [hpsa]
RSP <ffff88042c515a78>
CR2: 0000000000000000
---[ end trace ab56f106199a4971 ]---


> > 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")
>
> Yes, specifically (finally done bisecting):
>
> commit 2e45528930388658603ea24d49cf52867b928d3e
> Author: Jiang Liu <[email protected]>
> Date: Wed Feb 19 14:07:36 2014 +0800
>
> iommu/vt-d: Unify the way to process DMAR device scope array
>
> Now we have a PCI bus notification based mechanism to update DMAR
> device scope array, we could extend the mechanism to support boot
> time initialization too, which will help to unify and simplify
> the implementation.
>
> Signed-off-by: Jiang Liu <[email protected]>
> Signed-off-by: Joerg Roedel <[email protected]>

My git bisect appears to be converging on something else, something
within the hpsa patches that I sent up recently, unfortunately for
me. Will let you all know when it converges.

-- steve

2014-04-10 23:17:47

by Shuah Khan

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Thu, Apr 10, 2014 at 2:45 PM, <[email protected]> wrote:
>> > 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")
>>
>> Yes, specifically (finally done bisecting):
>>
>> commit 2e45528930388658603ea24d49cf52867b928d3e
>> Author: Jiang Liu <[email protected]>
>> Date: Wed Feb 19 14:07:36 2014 +0800
>>
>> iommu/vt-d: Unify the way to process DMAR device scope array
>>
>> Now we have a PCI bus notification based mechanism to update DMAR
>> device scope array, we could extend the mechanism to support boot
>> time initialization too, which will help to unify and simplify
>> the implementation.
>>
>> Signed-off-by: Jiang Liu <[email protected]>
>> Signed-off-by: Joerg Roedel <[email protected]>
>
> My git bisect appears to be converging on something else, something
> within the hpsa patches that I sent up recently, unfortunately for
> me. Will let you all know when it converges.
>

This smells very much like the problem that was solved couple of years
ago for SI domain. It is likely that path is broken with the DMAR
device scope array change. Please take a look to see if the following
no longer occurs. Looks like BIOS could be expecting this RMRR to be
still mapped.

/*
* We want to prevent any device associated with an RMRR from
* getting placed into the SI Domain. This is done because
* problems exist when devices are moved in and out of domains
* and their respective RMRR info is lost. We exempt USB devices
* from this process due to their usage of RMRRs that are known
* to not be needed after BIOS hand-off to OS.
*/
if (device_has_rmrr(dev) &&
(pdev->class >> 8) != PCI_CLASS_SERIAL_USB)
return 0;

-- Shuah

2014-04-11 01:37:02

by Baoquan He

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On 04/10/14 at 04:34pm, Jiang Liu wrote:
> Hi Baoquan,
> Could you please help to give output of "lspci -vvvv"?
> Is device "hpsa 0000:03:00.0" a legacy PCI device(non-PCIe)?
> It may have relationship with IOMMU driver.
> Thanks!
> Gerry

Hi,

I just saw your mail now. Do you still need the output of "lspci -vvvv"
on my test machine?

In fact, I didn't see the DMAR error related to intel vt-d issues.

If the output is helpful, I can make a latest build to do this.

Thanks
Baoquan

>
> On 2014/4/10 12:03, Bjorn Helgaas wrote:
> > [+cc Joerg, iommu list]
> >
> > On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso <[email protected]> wrote:
> >> On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> >>> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
> >>>> On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> >>>>> On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> >>>>>> [+linux-scsi]
> >>>>>> On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> >>>>>>> On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> The kernel is 3.14.0+ which is pulled just now.
> >>>>>>>
> >>>>>>> Cc'ing more people.
> >>>>>>>
> >>>>>>> While the hpsa driver appears to be involved in some way, I'm sure if
> >>>>>>> this is a related issue, but as of today's pull I'm getting another
> >>>>>>> problem that causes my DL980 not to come up.
> >>>>>>>
> >>>>>>> *Massive* amounts of:
> >>>>>>>
> >>>>>>> DMAR:[fault reason 02] Present bit in context entry is clear
> >>>>>>> dmar: DRHD: handling fault status reg 602
> >>>>>>> dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
> >>>>>>>
> >>>>>>> Then:
> >>>>>>>
> >>>>>>> hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
> >>>>>>> ...
> >>>>>>> Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> >>>>>>> ...
> >>>>>>>
> >>>>>>> Screenshot of the actual LOCKUP:
> >>>>>>> http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> >>>>>>>
> >>>>>>> While I haven't bisected, things worked fine until at least until commit
> >>>>>>> 39de65aa2c3e (April 2nd).
> >>>>>>>
> >>>>>>> Any ideas?
> >>>>>>
> >>>>>> Well, it's either a DMA remapping issue or a hpsa one. Your assertion
> >>>>>> that everything worked fine until 39de65aa2c3e would tend to vindicate
> >>>>>> hpsa,
> >>>>
> >>>> Hmm here you mean DMA, right?
> >>>
> >>> No, it vindicates the hpsa changes ... they don't seem to be causing
> >>> problems until something goes wrong with dma remapping.
> >>>
> >>>>> because all the hpsa changes went in before that under
> >>>>> Missing crucial info:
> >>>>>
> >>>>> commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> >>>>>
> >>>>>> Merge: 3e75c6d b2bff6c
> >>>>>> Author: Linus Torvalds <[email protected]>
> >>>>>> Date: Tue Apr 1 18:49:04 2014 -0700
> >>>>>>
> >>>>>> Merge tag 'scsi-misc' of
> >>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> >>>>>>
> >>>>>> can you revalidate that this commit works OK just to make sure?
> >>>>
> >>>> Ok so I don't see those DMA messages and system starts just fine. I'm
> >>>> thinking perhaps something broke after the IO mmu stuff in commit
> >>>> 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
> >>>> causing the CPU stalls and just blame hpsa in the path as a side effect?
> >>>>
> >>>> /me goes out to try the commit.
> >>>
> >>> That's my guess. The DMAR messages are DMA remapping issues caused in
> >>> the IOMMU. If I had to guess, I'd say the DMAR fault message is
> >>> indicating the IOMMU is calling for a mapping address before it can
> >>> satisfy the driver read request, which is causing the hang apparently in
> >>> the hpsa driver.
> >>>
> >>> I've added linux-pci to the cc; I think they deal with iommu issues on
> >>> x86.
> >>
> >> So that merge commit appears to be the culprit, I see both the DMA
> >> messages and the lockup blaming hpsa...
> >
> > My understanding so far (please correct me if I'm wrong):
> >
> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
> > 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >

2014-04-11 03:17:16

by Baoquan He

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On 04/10/14 at 04:34pm, Jiang Liu wrote:
> Hi Baoquan,
> Could you please help to give output of "lspci -vvvv"?
> Is device "hpsa 0000:03:00.0" a legacy PCI device(non-PCIe)?
> It may have relationship with IOMMU driver.
> Thanks!
> Gerry

Well, the machine bug was reported on is a AMD machine, and it doesn't
have the IOMMU problem. David saw there are some DMAR errors, it should
be a intel machine which use the VT-d.

>
> On 2014/4/10 12:03, Bjorn Helgaas wrote:
> > [+cc Joerg, iommu list]
> >
> > On Wed, Apr 9, 2014 at 6:19 PM, Davidlohr Bueso <[email protected]> wrote:
> >> On Wed, 2014-04-09 at 16:50 -0700, James Bottomley wrote:
> >>> On Wed, 2014-04-09 at 16:40 -0700, Davidlohr Bueso wrote:
> >>>> On Wed, 2014-04-09 at 16:10 -0700, James Bottomley wrote:
> >>>>> On Wed, 2014-04-09 at 16:08 -0700, James Bottomley wrote:
> >>>>>> [+linux-scsi]
> >>>>>> On Wed, 2014-04-09 at 15:49 -0700, Davidlohr Bueso wrote:
> >>>>>>> On Wed, 2014-04-09 at 10:39 +0800, Baoquan He wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> The kernel is 3.14.0+ which is pulled just now.
> >>>>>>>
> >>>>>>> Cc'ing more people.
> >>>>>>>
> >>>>>>> While the hpsa driver appears to be involved in some way, I'm sure if
> >>>>>>> this is a related issue, but as of today's pull I'm getting another
> >>>>>>> problem that causes my DL980 not to come up.
> >>>>>>>
> >>>>>>> *Massive* amounts of:
> >>>>>>>
> >>>>>>> DMAR:[fault reason 02] Present bit in context entry is clear
> >>>>>>> dmar: DRHD: handling fault status reg 602
> >>>>>>> dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
> >>>>>>>
> >>>>>>> Then:
> >>>>>>>
> >>>>>>> hpsa 0000:03:00.0: Controller lockup detected: 0xffff0000
> >>>>>>> ...
> >>>>>>> Workqueue: events hpsa_monitor_ctlr_worker [hpsa]
> >>>>>>> ...
> >>>>>>>
> >>>>>>> Screenshot of the actual LOCKUP:
> >>>>>>> http://stgolabs.net/hpsa-hard-lockup-3.14+.png
> >>>>>>>
> >>>>>>> While I haven't bisected, things worked fine until at least until commit
> >>>>>>> 39de65aa2c3e (April 2nd).
> >>>>>>>
> >>>>>>> Any ideas?
> >>>>>>
> >>>>>> Well, it's either a DMA remapping issue or a hpsa one. Your assertion
> >>>>>> that everything worked fine until 39de65aa2c3e would tend to vindicate
> >>>>>> hpsa,
> >>>>
> >>>> Hmm here you mean DMA, right?
> >>>
> >>> No, it vindicates the hpsa changes ... they don't seem to be causing
> >>> problems until something goes wrong with dma remapping.
> >>>
> >>>>> because all the hpsa changes went in before that under
> >>>>> Missing crucial info:
> >>>>>
> >>>>> commit 1a0b6abaea78f73d9bc0a2f6df2d9e4c917cade1
> >>>>>
> >>>>>> Merge: 3e75c6d b2bff6c
> >>>>>> Author: Linus Torvalds <[email protected]>
> >>>>>> Date: Tue Apr 1 18:49:04 2014 -0700
> >>>>>>
> >>>>>> Merge tag 'scsi-misc' of
> >>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
> >>>>>>
> >>>>>> can you revalidate that this commit works OK just to make sure?
> >>>>
> >>>> Ok so I don't see those DMA messages and system starts just fine. I'm
> >>>> thinking perhaps something broke after the IO mmu stuff in commit
> >>>> 3f583bc21977a608908b83d03ee2250426a5695c... could this be indirectly
> >>>> causing the CPU stalls and just blame hpsa in the path as a side effect?
> >>>>
> >>>> /me goes out to try the commit.
> >>>
> >>> That's my guess. The DMAR messages are DMA remapping issues caused in
> >>> the IOMMU. If I had to guess, I'd say the DMAR fault message is
> >>> indicating the IOMMU is calling for a mapping address before it can
> >>> satisfy the driver read request, which is causing the hang apparently in
> >>> the hpsa driver.
> >>>
> >>> I've added linux-pci to the cc; I think they deal with iommu issues on
> >>> x86.
> >>
> >> So that merge commit appears to be the culprit, I see both the DMA
> >> messages and the lockup blaming hpsa...
> >
> > My understanding so far (please correct me if I'm wrong):
> >
> > 39de65aa2c3e OK ("Merge branch 'i2c/for-next'")
> > 1a0b6abaea78 OK ("Merge tag 'scsi-misc'")
> > 3f583bc21977 BAD ("Merge tag 'iommu-updates-v3.15'")
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >

2014-04-11 08:57:19

by David Woodhouse

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Thu, 2014-04-10 at 17:17 -0600, Shuah Khan wrote:
> This smells very much like the problem that was solved couple of years
> ago for SI domain. It is likely that path is broken with the DMAR
> device scope array change. Please take a look to see if the following
> no longer occurs. Looks like BIOS could be expecting this RMRR to be
> still mapped.
>
> /*
> * We want to prevent any device associated with an RMRR from
> * getting placed into the SI Domain. This is done because
> * problems exist when devices are moved in and out of domains
> * and their respective RMRR info is lost. We exempt USB devices
> * from this process due to their usage of RMRRs that are known
> * to not be needed after BIOS hand-off to OS.
> */
> if (device_has_rmrr(dev) &&
> (pdev->class >> 8) != PCI_CLASS_SERIAL_USB)
> return 0;

Yeah, I'd be inclined to agree.... although I've tested with graphics
*since* these patches. That's another case where we need to preserve the
RMRR mapping after the driver takes over — and it *was* working.

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation


Attachments:
smime.p7s (5.61 kB)

2014-04-11 09:21:30

by Woodhouse, David

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote:
> Attaching a dmesg from one of the kernels that boots. It doesn't appear
> to have much of the related information... is there any debug config
> option I can enable that might give you more data?

I'd like the contents of /sys/firmware/acpi/tables/DMAR please. And
please could you also apply this patch to both the last-working and
first-failing kernels and show me the output in both cases?

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index dd576c0..d52ac03 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -683,6 +683,12 @@ static struct intel_iommu *device_to_iommu(int segment, u8 bus, u8 devfn)
out:
rcu_read_unlock();

+ if (iommu)
+ printk("Device %x:%02x:%02x.%d on IOMMU at %llx\n", segment, bus,
+ PCI_SLOT(devfn), PCI_FUNC(devfn), drhd->reg_base_addr);
+ else
+ printk("Device %x:%02x:%02x.%d on no IOMMU\n", segment, bus,
+ PCI_SLOT(devfn), PCI_FUNC(devfn));
return iommu;
}



--
Sent with Evolution's ActiveSync support.

David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation





Attachments:
smime.p7s (3.36 kB)

2014-04-14 07:01:30

by Jiang Liu

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

Hi Davidlohr,
Thanks for the information!
According to lspci output, device 0000:02:00.2 is HP ILO
controller, device 0000:03:00.0 is RAID controller. Both ILO and
RAID controllers need to access reserved memory range
[0x7f61e000 - 0x7f61ffff] in physical mode.

According to dmesg output, BIOS has reserved memory and
IOMMU has setup 1:1 mapping for ILO and RAID controller to access
this range. Related log messages as below:
BIOS-e820: [mem 0x000000007f61d000-0x000000008fffffff] reserved
IOMMU: Setting RMRR:
IOMMU: Setting identity map for device 0000:03:00.0 [0x7f61e000 -
0x7f61ffff]
IOMMU: Setting identity map for device 0000:02:00.0 [0x7f61e000 -
0x7f61ffff]
IOMMU: Setting identity map for device 0000:02:00.2 [0x7f61e000 -
0x7f61ffff]

From the screenshot, device 0000:02:00.2 fails to access
memory address 0x7f61e000. That indicates IOMMU driver fails to
setup 1:1 mapping for Reserved Memory Range for ILO controller.
So could you please help to check whether you could observe boot
messages like "IOMMU: Setting identity map for device 0000:02:00.2
[0x7f61e000 - 0x7f61ffff]" with the failure kernel image?

It would be great if boot messages could be saved when
failing to boot, so we could get more information from log.

BTW, I have double checked related code, and still can't
find a reliable explanation for the regression:(

Thanks!
Gerry

On 2014/4/11 0:19, Davidlohr Bueso wrote:
> On Thu, 2014-04-10 at 08:46 +0000, Woodhouse, David wrote:
>> On Thu, 2014-04-10 at 09:15 +0200, Joerg Roedel wrote:
>>> [+ David, VT-d maintainer ]
>>>
>>> Jiang, David, can you please have a look into this issue?
>>>
>>
>>>>>>>>>>> DMAR:[fault reason 02] Present bit in context entry is clear
>>>>>>>>>>> dmar: DRHD: handling fault status reg 602
>>>>>>>>>>> dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
>>
>> That "Present bit in context entry is clear" fault means that we have
>> not set up *any* mappings for this PCI device… on this IOMMU.
>>
>>>> Yes, specifically (finally done bisecting):
>>>>
>>>> commit 2e45528930388658603ea24d49cf52867b928d3e
>>>> Author: Jiang Liu <[email protected]>
>>>> Date: Wed Feb 19 14:07:36 2014 +0800
>>>>
>>>> iommu/vt-d: Unify the way to process DMAR device scope array
>>
>> This commit is about how we decide which IOMMU a given PCI device is
>> attached to.
>>
>> Thus, my first guess would be that we are quite happily setting up the
>> requested DMA maps on the *wrong* IOMMU, and then taking faults when the
>> device actually tries to do DMA.
>>
>> However, I'm not 100% convinced of that. The fault address looks
>> suspiciously like a true physical address, not a virtual bus address of
>> the type that we'd normally allocate for a dma_map_* operation. Those
>> would start at 0xfffff000 and work downwards, typically.
>>
>> Do you have 'iommu=pt' on the kernel command line?
>
> No.
>
>> Can I see the full
>> dmesg as this system boots, and also a copy of the DMAR table?
>
> Attaching a dmesg from one of the kernels that boots. It doesn't appear
> to have much of the related information... is there any debug config
> option I can enable that might give you more data?
>

2014-04-14 08:58:00

by Jiang Liu

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

Hi all,
I guess I found the root cause. It's a bug in matching
device scope, variable 'level' should be decreased when walking up PCI
topology.
Could you please help to test following patch?
Thanks!
Gerry

diff --git a/drivers/iommu/dmar.c b/drivers/iommu/dmar.c
index f445c10..1f8308c 100644
--- a/drivers/iommu/dmar.c
+++ b/drivers/iommu/dmar.c
@@ -152,7 +152,7 @@ dmar_alloc_pci_notify_info(struct pci_dev *dev,
unsigned long event)
info->seg = pci_domain_nr(dev->bus);
info->level = level;
if (event == BUS_NOTIFY_ADD_DEVICE) {
- for (tmp = dev, level--; tmp; tmp = tmp->bus->self) {
+ for (tmp = dev, level--; tmp; level--, tmp =
tmp->bus->self) {
info->path[level].device = PCI_SLOT(tmp->devfn);
info->path[level].function = PCI_FUNC(tmp->devfn);
if (pci_is_root_bus(tmp->bus))


On 2014/4/11 0:19, Davidlohr Bueso wrote:
> On Thu, 2014-04-10 at 08:46 +0000, Woodhouse, David wrote:
>> On Thu, 2014-04-10 at 09:15 +0200, Joerg Roedel wrote:
>>> [+ David, VT-d maintainer ]
>>>
>>> Jiang, David, can you please have a look into this issue?
>>>
>>
>>>>>>>>>>> DMAR:[fault reason 02] Present bit in context entry is clear
>>>>>>>>>>> dmar: DRHD: handling fault status reg 602
>>>>>>>>>>> dmar: DMAR:[DMA Read] Request device [02:00.0] fault addr 7f61e000
>>
>> That "Present bit in context entry is clear" fault means that we have
>> not set up *any* mappings for this PCI device… on this IOMMU.
>>
>>>> Yes, specifically (finally done bisecting):
>>>>
>>>> commit 2e45528930388658603ea24d49cf52867b928d3e
>>>> Author: Jiang Liu <[email protected]>
>>>> Date: Wed Feb 19 14:07:36 2014 +0800
>>>>
>>>> iommu/vt-d: Unify the way to process DMAR device scope array
>>
>> This commit is about how we decide which IOMMU a given PCI device is
>> attached to.
>>
>> Thus, my first guess would be that we are quite happily setting up the
>> requested DMA maps on the *wrong* IOMMU, and then taking faults when the
>> device actually tries to do DMA.
>>
>> However, I'm not 100% convinced of that. The fault address looks
>> suspiciously like a true physical address, not a virtual bus address of
>> the type that we'd normally allocate for a dma_map_* operation. Those
>> would start at 0xfffff000 and work downwards, typically.
>>
>> Do you have 'iommu=pt' on the kernel command line?
>
> No.
>
>> Can I see the full
>> dmesg as this system boots, and also a copy of the DMAR table?
>
> Attaching a dmesg from one of the kernels that boots. It doesn't appear
> to have much of the related information... is there any debug config
> option I can enable that might give you more data?
>

2014-04-14 15:46:09

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

Sorry for the delay, I've been having to take turns for this box.

On Fri, 2014-04-11 at 09:18 +0000, Woodhouse, David wrote:
> On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote:
> > Attaching a dmesg from one of the kernels that boots. It doesn't appear
> > to have much of the related information... is there any debug config
> > option I can enable that might give you more data?
>
> I'd like the contents of /sys/firmware/acpi/tables/DMAR please.

Attached is the disassembly of the raw output.

> And
> please could you also apply this patch to both the last-working and
> first-failing kernels and show me the output in both cases?

So I still cannot get around getting the info for the first failing
kernel, but below is for the last working. Thanks.

Device 0:03:00.0 on IOMMU at a8000000
Device 0:03:00.0 on IOMMU at a8000000
IOMMU: Setting identity map for device 0000:02:00.0 [0x7f61e000 - 0x7f61ffff]
Device 0:02:00.0 on IOMMU at a8000000
Device 0:02:00.0 on IOMMU at a8000000
IOMMU: Setting identity map for device 0000:02:00.2 [0x7f61e000 - 0x7f61ffff]
Device 0:02:00.2 on IOMMU at a8000000
Device 0:02:00.2 on IOMMU at a8000000
IOMMU: Setting identity map for device 0000:00:1d.0 [0x7f7e7000 - 0x7f7ecfff]
Device 0:00:1d.0 on IOMMU at a8000000
Device 0:00:1d.0 on IOMMU at a8000000
IOMMU: Setting identity map for device 0000:00:1d.1 [0x7f7e7000 - 0x7f7ecfff]
Device 0:00:1d.1 on IOMMU at a8000000
Device 0:00:1d.1 on IOMMU at a8000000
IOMMU: Setting identity map for device 0000:00:1d.2 [0x7f7e7000 - 0x7f7ecfff]
Device 0:00:1d.2 on IOMMU at a8000000
Device 0:00:1d.2 on IOMMU at a8000000
IOMMU: Setting identity map for device 0000:00:1d.3 [0x7f7e7000 - 0x7f7ecfff]
Device 0:00:1d.3 on IOMMU at a8000000
Device 0:00:1d.3 on IOMMU at a8000000
IOMMU: Setting identity map for device 0000:02:00.0 [0x7f7e7000 - 0x7f7ecfff]
Device 0:02:00.0 on IOMMU at a8000000
IOMMU: Setting identity map for device 0000:02:00.2 [0x7f7e7000 - 0x7f7ecfff]
Device 0:02:00.2 on IOMMU at a8000000
IOMMU: Setting identity map for device 0000:02:00.4 [0x7f7e7000 - 0x7f7ecfff]
Device 0:02:00.4 on IOMMU at a8000000
Device 0:02:00.4 on IOMMU at a8000000
IOMMU: Setting identity map for device 0000:00:1d.7 [0x7f7ee000 - 0x7f7effff]
Device 0:00:1d.7 on IOMMU at a8000000
Device 0:00:1d.7 on IOMMU at a8000000
IOMMU: Prepare 0-16MiB unity mapping for LPC
IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
Device 0:00:1f.0 on IOMMU at a8000000
Device 0:00:1f.0 on IOMMU at a8000000
PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
Device 0:00:00.0 on IOMMU at a8000000
Device 0:00:01.0 on IOMMU at a8000000
Device 0:00:02.0 on IOMMU at a8000000
Device 0:00:03.0 on IOMMU at a8000000
Device 0:00:04.0 on IOMMU at a8000000
Device 0:00:05.0 on IOMMU at a8000000
Device 0:00:06.0 on IOMMU at a8000000
Device 0:00:07.0 on IOMMU at a8000000
Device 0:00:08.0 on IOMMU at a8000000
Device 0:00:09.0 on IOMMU at a8000000
Device 0:00:0a.0 on IOMMU at a8000000
Device 0:00:14.0 on IOMMU at a8000000
Device 0:00:1c.0 on IOMMU at a8000000
Device 0:00:1c.4 on IOMMU at a8000000
Device 0:00:1d.0 on IOMMU at a8000000
Device 0:00:1d.1 on IOMMU at a8000000
Device 0:00:1d.2 on IOMMU at a8000000
Device 0:00:1d.3 on IOMMU at a8000000
Device 0:00:1d.7 on IOMMU at a8000000
Device 0:00:1e.0 on IOMMU at a8000000
Device 0:00:1f.0 on IOMMU at a8000000
Device 0:04:00.0 on IOMMU at a8000000
Device 0:04:00.1 on IOMMU at a8000000
Device 0:04:00.2 on IOMMU at a8000000
Device 0:04:00.3 on IOMMU at a8000000
Device 0:03:00.0 on IOMMU at a8000000
Device 0:02:00.0 on IOMMU at a8000000
Device 0:02:00.2 on IOMMU at a8000000
Device 0:02:00.4 on IOMMU at a8000000
Device 0:01:03.0 on IOMMU at a8000000
Device 0:50:00.0 on IOMMU at ac000000
Device 0:50:01.0 on IOMMU at ac000000
Device 0:50:02.0 on IOMMU at ac000000
Device 0:50:03.0 on IOMMU at ac000000
Device 0:50:04.0 on IOMMU at ac000000
Device 0:50:05.0 on IOMMU at ac000000
Device 0:50:06.0 on IOMMU at ac000000
Device 0:50:07.0 on IOMMU at ac000000
Device 0:50:08.0 on IOMMU at ac000000
Device 0:50:09.0 on IOMMU at ac000000
Device 0:50:0a.0 on IOMMU at ac000000
Device 0:50:14.0 on IOMMU at a8000000
Device 0:a0:00.0 on IOMMU at b0000000
Device 0:a0:01.0 on IOMMU at b0000000
Device 0:a0:02.0 on IOMMU at b0000000
Device 0:a0:03.0 on IOMMU at b0000000
Device 0:a0:04.0 on IOMMU at b0000000
Device 0:a0:05.0 on IOMMU at b0000000
Device 0:a0:06.0 on IOMMU at b0000000
Device 0:a0:07.0 on IOMMU at b0000000
Device 0:a0:08.0 on IOMMU at b0000000
Device 0:a0:09.0 on IOMMU at b0000000
Device 0:a0:0a.0 on IOMMU at b0000000
Device 0:a0:14.0 on IOMMU at a8000000
Device 0:7c:00.0 on IOMMU at a8000000
Device 0:7c:08.0 on IOMMU at a8000000
Device 0:82:00.0 on IOMMU at a8000000
Device 0:82:08.0 on IOMMU at a8000000


Attachments:
DMAR.dsl (28.46 kB)

2014-04-14 16:20:04

by Jiang Liu

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

Hi Davidlohr,
Thanks for providing the DMAR table. According to the DMAR
table, one bug in the iommu driver fails to handle this entry:
[1D2h 0466 1] Device Scope Entry Type : 01
[1D3h 0467 1] Entry Length : 0A
[1D4h 0468 2] Reserved : 0000
[1D6h 0470 1] Enumeration ID : 00
[1D7h 0471 1] PCI Bus Number : 00
[1D8h 0472 2] PCI Path : 1C,04
[1DAh 0474 2] PCI Path : 00,02

And the patch sent out by me should fix this bug. Could you please help
to have a try?
Thanks!
Gerry

On 2014/4/14 23:45, Davidlohr Bueso wrote:
> Sorry for the delay, I've been having to take turns for this box.
>
> On Fri, 2014-04-11 at 09:18 +0000, Woodhouse, David wrote:
>> On Thu, 2014-04-10 at 09:19 -0700, Davidlohr Bueso wrote:
>>> Attaching a dmesg from one of the kernels that boots. It doesn't appear
>>> to have much of the related information... is there any debug config
>>> option I can enable that might give you more data?
>>
>> I'd like the contents of /sys/firmware/acpi/tables/DMAR please.
>
> Attached is the disassembly of the raw output.
>
>> And
>> please could you also apply this patch to both the last-working and
>> first-failing kernels and show me the output in both cases?
>
> So I still cannot get around getting the info for the first failing
> kernel, but below is for the last working. Thanks.
>
> Device 0:03:00.0 on IOMMU at a8000000
> Device 0:03:00.0 on IOMMU at a8000000
> IOMMU: Setting identity map for device 0000:02:00.0 [0x7f61e000 - 0x7f61ffff]
> Device 0:02:00.0 on IOMMU at a8000000
> Device 0:02:00.0 on IOMMU at a8000000
> IOMMU: Setting identity map for device 0000:02:00.2 [0x7f61e000 - 0x7f61ffff]
> Device 0:02:00.2 on IOMMU at a8000000
> Device 0:02:00.2 on IOMMU at a8000000
> IOMMU: Setting identity map for device 0000:00:1d.0 [0x7f7e7000 - 0x7f7ecfff]
> Device 0:00:1d.0 on IOMMU at a8000000
> Device 0:00:1d.0 on IOMMU at a8000000
> IOMMU: Setting identity map for device 0000:00:1d.1 [0x7f7e7000 - 0x7f7ecfff]
> Device 0:00:1d.1 on IOMMU at a8000000
> Device 0:00:1d.1 on IOMMU at a8000000
> IOMMU: Setting identity map for device 0000:00:1d.2 [0x7f7e7000 - 0x7f7ecfff]
> Device 0:00:1d.2 on IOMMU at a8000000
> Device 0:00:1d.2 on IOMMU at a8000000
> IOMMU: Setting identity map for device 0000:00:1d.3 [0x7f7e7000 - 0x7f7ecfff]
> Device 0:00:1d.3 on IOMMU at a8000000
> Device 0:00:1d.3 on IOMMU at a8000000
> IOMMU: Setting identity map for device 0000:02:00.0 [0x7f7e7000 - 0x7f7ecfff]
> Device 0:02:00.0 on IOMMU at a8000000
> IOMMU: Setting identity map for device 0000:02:00.2 [0x7f7e7000 - 0x7f7ecfff]
> Device 0:02:00.2 on IOMMU at a8000000
> IOMMU: Setting identity map for device 0000:02:00.4 [0x7f7e7000 - 0x7f7ecfff]
> Device 0:02:00.4 on IOMMU at a8000000
> Device 0:02:00.4 on IOMMU at a8000000
> IOMMU: Setting identity map for device 0000:00:1d.7 [0x7f7ee000 - 0x7f7effff]
> Device 0:00:1d.7 on IOMMU at a8000000
> Device 0:00:1d.7 on IOMMU at a8000000
> IOMMU: Prepare 0-16MiB unity mapping for LPC
> IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]
> Device 0:00:1f.0 on IOMMU at a8000000
> Device 0:00:1f.0 on IOMMU at a8000000
> PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
> Device 0:00:00.0 on IOMMU at a8000000
> Device 0:00:01.0 on IOMMU at a8000000
> Device 0:00:02.0 on IOMMU at a8000000
> Device 0:00:03.0 on IOMMU at a8000000
> Device 0:00:04.0 on IOMMU at a8000000
> Device 0:00:05.0 on IOMMU at a8000000
> Device 0:00:06.0 on IOMMU at a8000000
> Device 0:00:07.0 on IOMMU at a8000000
> Device 0:00:08.0 on IOMMU at a8000000
> Device 0:00:09.0 on IOMMU at a8000000
> Device 0:00:0a.0 on IOMMU at a8000000
> Device 0:00:14.0 on IOMMU at a8000000
> Device 0:00:1c.0 on IOMMU at a8000000
> Device 0:00:1c.4 on IOMMU at a8000000
> Device 0:00:1d.0 on IOMMU at a8000000
> Device 0:00:1d.1 on IOMMU at a8000000
> Device 0:00:1d.2 on IOMMU at a8000000
> Device 0:00:1d.3 on IOMMU at a8000000
> Device 0:00:1d.7 on IOMMU at a8000000
> Device 0:00:1e.0 on IOMMU at a8000000
> Device 0:00:1f.0 on IOMMU at a8000000
> Device 0:04:00.0 on IOMMU at a8000000
> Device 0:04:00.1 on IOMMU at a8000000
> Device 0:04:00.2 on IOMMU at a8000000
> Device 0:04:00.3 on IOMMU at a8000000
> Device 0:03:00.0 on IOMMU at a8000000
> Device 0:02:00.0 on IOMMU at a8000000
> Device 0:02:00.2 on IOMMU at a8000000
> Device 0:02:00.4 on IOMMU at a8000000
> Device 0:01:03.0 on IOMMU at a8000000
> Device 0:50:00.0 on IOMMU at ac000000
> Device 0:50:01.0 on IOMMU at ac000000
> Device 0:50:02.0 on IOMMU at ac000000
> Device 0:50:03.0 on IOMMU at ac000000
> Device 0:50:04.0 on IOMMU at ac000000
> Device 0:50:05.0 on IOMMU at ac000000
> Device 0:50:06.0 on IOMMU at ac000000
> Device 0:50:07.0 on IOMMU at ac000000
> Device 0:50:08.0 on IOMMU at ac000000
> Device 0:50:09.0 on IOMMU at ac000000
> Device 0:50:0a.0 on IOMMU at ac000000
> Device 0:50:14.0 on IOMMU at a8000000
> Device 0:a0:00.0 on IOMMU at b0000000
> Device 0:a0:01.0 on IOMMU at b0000000
> Device 0:a0:02.0 on IOMMU at b0000000
> Device 0:a0:03.0 on IOMMU at b0000000
> Device 0:a0:04.0 on IOMMU at b0000000
> Device 0:a0:05.0 on IOMMU at b0000000
> Device 0:a0:06.0 on IOMMU at b0000000
> Device 0:a0:07.0 on IOMMU at b0000000
> Device 0:a0:08.0 on IOMMU at b0000000
> Device 0:a0:09.0 on IOMMU at b0000000
> Device 0:a0:0a.0 on IOMMU at b0000000
> Device 0:a0:14.0 on IOMMU at a8000000
> Device 0:7c:00.0 on IOMMU at a8000000
> Device 0:7c:08.0 on IOMMU at a8000000
> Device 0:82:00.0 on IOMMU at a8000000
> Device 0:82:08.0 on IOMMU at a8000000
>

2014-04-14 16:44:26

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Tue, 2014-04-15 at 00:19 +0800, Jiang Liu wrote:
> Hi Davidlohr,
> Thanks for providing the DMAR table. According to the DMAR
> table, one bug in the iommu driver fails to handle this entry:
> [1D2h 0466 1] Device Scope Entry Type : 01
> [1D3h 0467 1] Entry Length : 0A
> [1D4h 0468 2] Reserved : 0000
> [1D6h 0470 1] Enumeration ID : 00
> [1D7h 0471 1] PCI Bus Number : 00
> [1D8h 0472 2] PCI Path : 1C,04
> [1DAh 0474 2] PCI Path : 00,02
>
> And the patch sent out by me should fix this bug. Could you please help
> to have a try?

Sorry, I am unable to find any patches from you regarding this issue...
I must be missing something. Could you please point me to the lkml link?

Thanks.

2014-04-14 16:47:56

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Mon, 2014-04-14 at 09:44 -0700, Davidlohr Bueso wrote:
> On Tue, 2014-04-15 at 00:19 +0800, Jiang Liu wrote:
> > Hi Davidlohr,
> > Thanks for providing the DMAR table. According to the DMAR
> > table, one bug in the iommu driver fails to handle this entry:
> > [1D2h 0466 1] Device Scope Entry Type : 01
> > [1D3h 0467 1] Entry Length : 0A
> > [1D4h 0468 2] Reserved : 0000
> > [1D6h 0470 1] Enumeration ID : 00
> > [1D7h 0471 1] PCI Bus Number : 00
> > [1D8h 0472 2] PCI Path : 1C,04
> > [1DAh 0474 2] PCI Path : 00,02
> >
> > And the patch sent out by me should fix this bug. Could you please help
> > to have a try?
>
> Sorry, I am unable to find any patches from you regarding this issue...
> I must be missing something. Could you please point me to the lkml link?

Never mind, I got it internally. I'll let you know as soon as I can
test it later today.

2014-04-14 17:05:14

by Woodhouse, David

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Mon, 2014-04-14 at 09:47 -0700, Davidlohr Bueso wrote:
> On Mon, 2014-04-14 at 09:44 -0700, Davidlohr Bueso wrote:
> > On Tue, 2014-04-15 at 00:19 +0800, Jiang Liu wrote:
> > > Hi Davidlohr,
> > > Thanks for providing the DMAR table. According to the DMAR
> > > table, one bug in the iommu driver fails to handle this entry:
> > > [1D2h 0466 1] Device Scope Entry Type : 01
> > > [1D3h 0467 1] Entry Length : 0A
> > > [1D4h 0468 2] Reserved : 0000
> > > [1D6h 0470 1] Enumeration ID : 00
> > > [1D7h 0471 1] PCI Bus Number : 00
> > > [1D8h 0472 2] PCI Path : 1C,04
> > > [1DAh 0474 2] PCI Path : 00,02
> > >
> > > And the patch sent out by me should fix this bug. Could you please help
> > > to have a try?
> >
> > Sorry, I am unable to find any patches from you regarding this issue...
> > I must be missing something. Could you please point me to the lkml link?
>
> Never mind, I got it internally. I'll let you know as soon as I can
> test it later today.

Thanks.

Jiang, if you can then let me have a copy with a signed-off-by I'll
shepherd it upstream along with your other patch which is already in my
iommu-2.6.git tree.

--
Sent with Evolution's ActiveSync support.

David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation





Attachments:
smime.p7s (3.36 kB)

2014-04-14 18:08:28

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Mon, 2014-04-14 at 16:57 +0800, Jiang Liu wrote:
> Hi all,
> I guess I found the root cause. It's a bug in matching
> device scope, variable 'level' should be decreased when walking up PCI
> topology.
> Could you please help to test following patch?
> Thanks!
> Gerry

Worked like a charm -- I no longer see all those DMAR messages and the
hpsa hard lockup is gone, thanks. Feel free to add my:

Reported-and-tested-by: Davidlohr Bueso <[email protected]>

2014-04-16 13:37:46

by Joerg Roedel

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

Hey David,

On Mon, Apr 14, 2014 at 05:03:51PM +0000, Woodhouse, David wrote:
> Jiang, if you can then let me have a copy with a signed-off-by I'll
> shepherd it upstream along with your other patch which is already in my
> iommu-2.6.git tree.

What is the state of these fixes? I plan to send out a pull-request
before easter and hoped to include these fixes as well.

Thanks,

Joerg

2014-04-16 13:58:52

by Woodhouse, David

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Wed, 2014-04-16 at 15:37 +0200, [email protected] wrote:
> Hey David,
>
> On Mon, Apr 14, 2014 at 05:03:51PM +0000, Woodhouse, David wrote:
> > Jiang, if you can then let me have a copy with a signed-off-by I'll
> > shepherd it upstream along with your other patch which is already in my
> > iommu-2.6.git tree.
>
> What is the state of these fixes? I plan to send out a pull-request
> before easter and hoped to include these fixes as well.

I'm travelling and was going to do some final testing and send out a
pull request after I got home tomorrow. But since you ask...

Please pull from
git://git.infradead.org/iommu-2.6.git

David Woodhouse (1):
iommu/vt-d: Fix get_domain_for_dev() handling of upstream PCIe bridges

Jiang Liu (2):
iommu/vt-d: fix memory leakage caused by commit ea8ea46
iommu/vt-d: fix bug in matching PCI devices with DRHD/RMRR descriptors

drivers/iommu/dmar.c | 3 ++-
drivers/iommu/intel-iommu.c | 10 +++++++---
2 files changed, 9 insertions(+), 4 deletions(-)



--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation


Attachments:
smime.p7s (3.36 kB)

2014-04-16 14:13:28

by Joerg Roedel

[permalink] [raw]
Subject: Re: hpsa driver bug crack kernel down!

On Wed, Apr 16, 2014 at 01:58:44PM +0000, Woodhouse, David wrote:
> On Wed, 2014-04-16 at 15:37 +0200, [email protected] wrote:
> > What is the state of these fixes? I plan to send out a pull-request
> > before easter and hoped to include these fixes as well.
>
> I'm travelling and was going to do some final testing and send out a
> pull request after I got home tomorrow. But since you ask...
>
> Please pull from
> git://git.infradead.org/iommu-2.6.git
>
> David Woodhouse (1):
> iommu/vt-d: Fix get_domain_for_dev() handling of upstream PCIe bridges
>
> Jiang Liu (2):
> iommu/vt-d: fix memory leakage caused by commit ea8ea46
> iommu/vt-d: fix bug in matching PCI devices with DRHD/RMRR descriptors
>
> drivers/iommu/dmar.c | 3 ++-
> drivers/iommu/intel-iommu.c | 10 +++++++---
> 2 files changed, 9 insertions(+), 4 deletions(-)

Pulled, thanks David. I will also do some additional testing before
sending it upstream.


Joerg