2010-08-10 17:37:38

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

Hello Joerg,

The requested info is attached.
So that would mean a bios problem ? (those are not on my wishlist :-p)

--
Sander


Tuesday, August 10, 2010, 6:26:06 PM, you wrote:

> On Tue, Aug 10, 2010 at 04:48:50PM +0200, Sander Eikelenboom wrote:
>> Hi Joerg,
>>
>> Trying to boot 2.6.35 with amd iommu enabled on a MSI 890FXA-GD70
>> motherboard with AMD 890FX chipset results in the oops below, complete
>> serial log attached.

> Ok, I have a theory whats going on. It looks like one of your devices
> aliases to an non-existent pci-bdf. But please send me the data I
> requested so I can verify this.

> Joerg




--
Best regards,
Sander mailto:[email protected]


Attachments:
.config (97.18 kB)
dmesg-2.6.35-iommu.txt (30.19 kB)
lspci-2.6.35-iommu.txt (55.07 kB)
Download all attachments

2010-08-10 18:01:28

by Joerg Roedel

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

On Tue, Aug 10, 2010 at 06:57:45PM +0200, Sander Eikelenboom wrote:
> The requested info is attached.
> So that would mean a bios problem ? (those are not on my wishlist :-p)

Yeah, looks like a BIOS problem. But the driver should handle that
without crashing the system, so there is a bug in the driver too.

Problem is:

AMD-Vi: DEV_ALIAS_RANGE devid: 0a:01.0 flags: 00 devid_to: 0a:00.0
AMD-Vi: DEV_RANGE_END devid: 0a:1f.7

This means that PCI devices from 0a:01.0 to 0a:1f.7 may use their own
device-id or 0a:00.0. But a device which id 0a:00.0 is not present in
the system. From the lspci output this looks like your USB3 controler
should alias to 09:00.0. I prepare a patch for you to fix the crash but
I can't guarantee that your USB3 controler will work afterwards. If you
see IO-Page-Faults please report them to me.

Joerg

2010-08-10 18:05:31

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

Hello Joerg,

Could you also provide a perhaps more specific message what is wrong with the bios, that i could forward to MSI, in the hope it will reach the bios engineers someday ? :-)

--
Sander

Tuesday, August 10, 2010, 8:01:22 PM, you wrote:

> On Tue, Aug 10, 2010 at 06:57:45PM +0200, Sander Eikelenboom wrote:
>> The requested info is attached.
>> So that would mean a bios problem ? (those are not on my wishlist :-p)

> Yeah, looks like a BIOS problem. But the driver should handle that
> without crashing the system, so there is a bug in the driver too.

> Problem is:

> AMD-Vi: DEV_ALIAS_RANGE devid: 0a:01.0 flags: 00 devid_to: 0a:00.0
> AMD-Vi: DEV_RANGE_END devid: 0a:1f.7

> This means that PCI devices from 0a:01.0 to 0a:1f.7 may use their own
> device-id or 0a:00.0. But a device which id 0a:00.0 is not present in
> the system. From the lspci output this looks like your USB3 controler
> should alias to 09:00.0. I prepare a patch for you to fix the crash but
> I can't guarantee that your USB3 controler will work afterwards. If you
> see IO-Page-Faults please report them to me.

> Joerg




--
Best regards,
Sander mailto:[email protected]

2010-08-10 20:28:42

by Joerg Roedel

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

On Tue, Aug 10, 2010 at 08:05:14PM +0200, Sander Eikelenboom wrote:
> Could you also provide a perhaps more specific message what is wrong
> with the bios, that i could forward to MSI, in the hope it will reach
> the bios engineers someday ? :-)

Lets first prove that my theory is right before contacting MSI directly.
Can you try the attached patch? it should fix the boot-crash. When the
system booted successfully please try some USB device (make sure it uses
the seperate usb-controler, I guess the seperate device is responsible
for USB 3, so try to plug a device into one of your USB 3 ports).
If you finished that please send me whether it worked or not and the
full dmesg output of the system.

Joerg


Attachments:
(No filename) (712.00 B)
iommu-crash-fix.diff (415.00 B)
Download all attachments

2010-08-10 20:36:40

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

Hello Joerg,

Errr which seperate usb controller ? .. it has actually:
- 1 pci-e usb 2.0 controller
- 2 pci-e usb 3.0 controller (one of which includes a sata controller as well)

(apart from the onboard stuff)

--
Sander


Tuesday, August 10, 2010, 10:28:39 PM, you wrote:

> On Tue, Aug 10, 2010 at 08:05:14PM +0200, Sander Eikelenboom wrote:
>> Could you also provide a perhaps more specific message what is wrong
>> with the bios, that i could forward to MSI, in the hope it will reach
>> the bios engineers someday ? :-)

> Lets first prove that my theory is right before contacting MSI directly.
> Can you try the attached patch? it should fix the boot-crash. When the
> system booted successfully please try some USB device (make sure it uses
> the seperate usb-controler, I guess the seperate device is responsible
> for USB 3, so try to plug a device into one of your USB 3 ports).
> If you finished that please send me whether it worked or not and the
> full dmesg output of the system.

> Joerg




--
Best regards,
Sander mailto:[email protected]

2010-08-10 20:47:25

by Joerg Roedel

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

Hi Sander,

On Tue, Aug 10, 2010 at 10:36:35PM +0200, Sander Eikelenboom wrote:
> Errr which seperate usb controller ? .. it has actually:
> - 1 pci-e usb 2.0 controller
> - 2 pci-e usb 3.0 controller (one of which includes a sata controller as well)

The devices should be attached to this controler:

0a:01.0 USB Controller [0c03]: NEC Corporation USB [1033:0035] (rev 43) (prog-if 10 [OHCI])
0a:01.1 USB Controller [0c03]: NEC Corporation USB [1033:0035] (rev 43) (prog-if 10 [OHCI])
0a:01.2 USB Controller [0c03]: NEC Corporation USB 2.0 [1033:00e0] (rev 04) (prog-if 20 [EHCI])

The PCI devices associated with that controler alias to 0a:00.0 which
does not exist in your system (hence the crash). And the fact that these
devices have an alias makes me believe that the BIOS detects them as
legacy PCI devices. PCI-e does typically not has aliases. Can you send
lcpi -t output to see to which upstream bridge these devices are
connected to?

Joerg

2010-08-10 20:57:30

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

Hmmm the fun part seems to be .. that the usb devices on that usb2 controller seemed to work fine on Xen.
And i have some problems about xen not willing to passthrough things with the usb3 controllers (supposedly due to the (extra) bridges),
that are the controllers on 04:00.0 and 08:00.0

-[0000:00]-+-00.0
+-00.2
+-02.0-[0000:0d]--+-00.0
| \-00.1
+-05.0-[0000:0c]----00.0
+-06.0-[0000:0b]----00.0
+-0a.0-[0000:09-0a]----00.0-[0000:0a]--+-01.0
| +-01.1
| \-01.2
+-0b.0-[0000:05-08]----00.0-[0000:06-08]--+-01.0-[0000:08]----00.0
| \-02.0-[0000:07]----00.0
+-0d.0-[0000:04]----00.0
+-11.0
+-12.0
+-12.2
+-13.0
+-13.2
+-14.0
+-14.3
+-14.4-[0000:03]----06.0
+-14.5
+-15.0-[0000:02]--
+-16.0
+-16.2
+-18.0
+-18.1
+-18.2
+-18.3
\-18.4

I had hoped things would become easier/better with my new mobo including iommu :-)
Doesn't seem that way yet. Previously i had 2 usb2.0 controllers(1x pci 1x pci-e) and 1 usb3.0(pci-e) passed through (with xen-swiotlb and no hardware iommu).. and that worked fine grabbing video 24/7 for several weeks.


But lets hope for the best :-)

--
Sander




Tuesday, August 10, 2010, 10:47:21 PM, you wrote:

> Hi Sander,

> On Tue, Aug 10, 2010 at 10:36:35PM +0200, Sander Eikelenboom wrote:
>> Errr which seperate usb controller ? .. it has actually:
>> - 1 pci-e usb 2.0 controller
>> - 2 pci-e usb 3.0 controller (one of which includes a sata controller as well)

> The devices should be attached to this controler:

> 0a:01.0 USB Controller [0c03]: NEC Corporation USB [1033:0035] (rev 43) (prog-if 10 [OHCI])
> 0a:01.1 USB Controller [0c03]: NEC Corporation USB [1033:0035] (rev 43) (prog-if 10 [OHCI])
> 0a:01.2 USB Controller [0c03]: NEC Corporation USB 2.0 [1033:00e0] (rev 04) (prog-if 20 [EHCI])

> The PCI devices associated with that controler alias to 0a:00.0 which
> does not exist in your system (hence the crash). And the fact that these
> devices have an alias makes me believe that the BIOS detects them as
> legacy PCI devices. PCI-e does typically not has aliases. Can you send
> lcpi -t output to see to which upstream bridge these devices are
> connected to?

> Joerg




--
Best regards,
Sander mailto:[email protected]

2010-08-10 21:25:55

by Joerg Roedel

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

On Tue, Aug 10, 2010 at 10:57:26PM +0200, Sander Eikelenboom wrote:
> Hmmm the fun part seems to be .. that the usb devices on that usb2
> controller seemed to work fine on Xen.

Hmm, thats weird. In this case these devices probably do not alias at
all. But lets wait for the results when you test my patch.


> +-0a.0-[0000:09-0a]----00.0-[0000:0a]--+-01.0
> | +-01.1
> | \-01.2

Yeah, device 09:00.0 is a PCIe-to-PCI bridge and the addtional USB
controlers are behind that bridge as legacy PCI devices. Thats why the
BIOS sets up the alias-entry. It should set up 09:00.0 instead of
0a:00.0 to make things work correctly.

Joerg

2010-08-10 21:37:06

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

It boots now, dmesg attached.


Tuesday, August 10, 2010, 10:47:21 PM, you wrote:

> Hi Sander,

> On Tue, Aug 10, 2010 at 10:36:35PM +0200, Sander Eikelenboom wrote:
>> Errr which seperate usb controller ? .. it has actually:
>> - 1 pci-e usb 2.0 controller
>> - 2 pci-e usb 3.0 controller (one of which includes a sata controller as well)

> The devices should be attached to this controler:

> 0a:01.0 USB Controller [0c03]: NEC Corporation USB [1033:0035] (rev 43) (prog-if 10 [OHCI])
> 0a:01.1 USB Controller [0c03]: NEC Corporation USB [1033:0035] (rev 43) (prog-if 10 [OHCI])
> 0a:01.2 USB Controller [0c03]: NEC Corporation USB 2.0 [1033:00e0] (rev 04) (prog-if 20 [EHCI])

> The PCI devices associated with that controler alias to 0a:00.0 which
> does not exist in your system (hence the crash). And the fact that these
> devices have an alias makes me believe that the BIOS detects them as
> legacy PCI devices. PCI-e does typically not has aliases. Can you send
> lcpi -t output to see to which upstream bridge these devices are
> connected to?

> Joerg




--
Best regards,
Sander mailto:[email protected]


Attachments:
dmesg-amd-iommu-patched (94.06 kB)

2010-08-10 21:49:19

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

Hi Joerg,

Ok it boots ok now, but plugging in a USB device in the 2.0 controller (0a.01.*) results in a flood of error messages about the usb controller not functioning.
When running same kernel with amd_iommu=off results in ...the device at least registering properly as usb device (altough trying to use it now resulted in an entirely new oops probably in the driver of the videograbber.)

--
Sander

Tuesday, August 10, 2010, 11:25:51 PM, you wrote:

> On Tue, Aug 10, 2010 at 10:57:26PM +0200, Sander Eikelenboom wrote:
>> Hmmm the fun part seems to be .. that the usb devices on that usb2
>> controller seemed to work fine on Xen.

> Hmm, thats weird. In this case these devices probably do not alias at
> all. But lets wait for the results when you test my patch.


>> +-0a.0-[0000:09-0a]----00.0-[0000:0a]--+-01.0
>> | +-01.1
>> | \-01.2

> Yeah, device 09:00.0 is a PCIe-to-PCI bridge and the addtional USB
> controlers are behind that bridge as legacy PCI devices. Thats why the
> BIOS sets up the alias-entry. It should set up 09:00.0 instead of
> 0a:00.0 to make things work correctly.

> Joerg




--
Best regards,
Sander mailto:[email protected]

2010-08-10 22:02:56

by Joerg Roedel

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

Ok,

On Tue, Aug 10, 2010 at 11:36:59PM +0200, Sander Eikelenboom wrote:
> It boots now, dmesg attached.

AMD-Vi: Event logged [IO_PAGE_FAULT device=0a:00.0 domain=0x0000 address=0x0000000000001080 flags=0x0070]

So it indeed uses 0a:00.0 as the device id. Thats weird but states that
the BIOS is actually ok. I need to fix that in the driver.

Thanks,

Joerg

2010-08-10 22:24:23

by Joerg Roedel

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

On Tue, Aug 10, 2010 at 11:36:59PM +0200, Sander Eikelenboom wrote:
> It boots now, dmesg attached.

Ok, here is a quick and dirty patch wich should make your system boot
again. It introduces other issues which will show up when you try to
assign the devices to a virtual machine. But at least the devices should
work again on bare-metal.

Joerg


Attachments:
(No filename) (348.00 B)
iommu-alias-fix.diff (1.89 kB)
Download all attachments

2010-08-11 17:54:47

by Sander Eikelenboom

[permalink] [raw]
Subject: Re: [2.6.35] AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40 BUG: unable to handle kernel NULL pointer dereference at 0000000000000198

Hello Joerg,

Had to apply the patch by hand, and found 2 typo's:

arch/x86/kernel/amd_iommu.c: In function ?do_attach?:
arch/x86/kernel/amd_iommu.c:1456: error: implicit declaration of function ?set_dte_enry?
arch/x86/kernel/amd_iommu.c: In function ?do_detach?:
arch/x86/kernel/amd_iommu.c:1486: error: implicit declaration of function ?clear_dte_enry?
make[2]: *** [arch/x86/kernel/amd_iommu.o] Error 1



Should be "entry" of course.

--

Sander
Wednesday, August 11, 2010, 12:24:19 AM, you wrote:

> On Tue, Aug 10, 2010 at 11:36:59PM +0200, Sander Eikelenboom wrote:
>> It boots now, dmesg attached.

> Ok, here is a quick and dirty patch wich should make your system boot
> again. It introduces other issues which will show up when you try to
> assign the devices to a virtual machine. But at least the devices should
> work again on bare-metal.

> Joerg




--
Best regards,
Sander mailto:[email protected]