2007-06-29 00:58:56

by Robert Hancock

[permalink] [raw]
Subject: Re: 2.6.22-rc6-mm1 Intel DMAR crash on AMD x86_64

Zan Lynx wrote:
> On Thu, 2007-06-28 at 03:43 -0700, Andrew Morton wrote:
>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/
>
>> +intel-iommu-dmar-detection-and-parsing-logic.patch
>> +intel-iommu-pci-generic-helper-function.patch
>> +intel-iommu-pci-generic-helper-function-fix.patch
>> +intel-iommu-clflush_cache_range-now-takes-size-param.patch
>> +intel-iommu-iova-allocation-and-management-routines.patch
>> +intel-iommu-iova-allocation-and-management-routines-fix.patch
>> +intel-iommu-iova-allocation-and-management-routines-fix-2.patch
>> +intel-iommu-intel-iommu-driver.patch
>> +intel-iommu-intel-iommu-driver-fix.patch
>> +intel-iommu-intel-iommu-driver-fix-2.patch
>> +intel-iommu-avoid-memory-allocation-failures-in-dma-map-api-calls.patch
>> +intel-iommu-intel-iommu-cmdline-option-forcedac.patch
>> +intel-iommu-dmar-fault-handling-support.patch
>> +intel-iommu-iommu-gfx-workaround.patch
>> +intel-iommu-iommu-floppy-workaround.patch
>> +intel-iommu-iommu-floppy-workaround-fix.patch
>> +intel-iommu-iommu-floppy-workaround-fix-fix.patch
>>
>> Intel IOMMU support
>
> I believe the above patch set is causing the problem. On my first try
> with rc6-mm1 I said Yes to the CONFIG_DMAR options. (I'm nearly as good
> as random option selection :-)
>
> The system panicked during boot, I believe it was trying to detect an
> Intel IOMMU. Later when I have a camera, I will try to post a
> screenshot of the backtrace. (I can't seem to get netconsole to work on
> boot, only in a module).
>
> When I recompiled without DMAR set, things seem to be working great. I
> seem to be getting better disk read throughput than rc3-mm1, by the way.
>
> This laptop is an AMD Athlon64 on a NForce3 running a 64-bit Gentoo
> build.
>
> I'll provide more details on request, and when I get the chance. This
> is a heads-up on the BUG in case someone has an "ah ha!" moment.

I took a picture of it, looks like the backtrace is:

NULL pointer dereference at 024
EIP:dmar_table_init+0x11
intel_iommu_init+0x30
pci_iommu_init+0xe
kernel_init+0x16e

Presumably something is NULL in dmar_table_init that wasn't expected to
be.. I would guess it likely crashes on any system without an Intel
IOMMU in it.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from [email protected]
Home Page: http://www.roberthancock.com/


2007-06-29 01:14:48

by Shaohua Li

[permalink] [raw]
Subject: RE: 2.6.22-rc6-mm1 Intel DMAR crash on AMD x86_64



>-----Original Message-----
>From: Robert Hancock [mailto:[email protected]]
>Sent: Friday, June 29, 2007 8:59 AM
>To: Zan Lynx
>Cc: Andrew Morton; [email protected]; Raj, Ashok; Li,
Shaohua;
>Keshavamurthy, Anil S
>Subject: Re: 2.6.22-rc6-mm1 Intel DMAR crash on AMD x86_64
>
>Zan Lynx wrote:
>> On Thu, 2007-06-28 at 03:43 -0700, Andrew Morton wrote:
>>>
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-
>rc6/2.6.22-rc6-mm1/
>>
>>> +intel-iommu-dmar-detection-and-parsing-logic.patch
>>> +intel-iommu-pci-generic-helper-function.patch
>>> +intel-iommu-pci-generic-helper-function-fix.patch
>>> +intel-iommu-clflush_cache_range-now-takes-size-param.patch
>>> +intel-iommu-iova-allocation-and-management-routines.patch
>>> +intel-iommu-iova-allocation-and-management-routines-fix.patch
>>> +intel-iommu-iova-allocation-and-management-routines-fix-2.patch
>>> +intel-iommu-intel-iommu-driver.patch
>>> +intel-iommu-intel-iommu-driver-fix.patch
>>> +intel-iommu-intel-iommu-driver-fix-2.patch
>>>
+intel-iommu-avoid-memory-allocation-failures-in-dma-map-api-calls.patch
>>> +intel-iommu-intel-iommu-cmdline-option-forcedac.patch
>>> +intel-iommu-dmar-fault-handling-support.patch
>>> +intel-iommu-iommu-gfx-workaround.patch
>>> +intel-iommu-iommu-floppy-workaround.patch
>>> +intel-iommu-iommu-floppy-workaround-fix.patch
>>> +intel-iommu-iommu-floppy-workaround-fix-fix.patch
>>>
>>> Intel IOMMU support
>>
>> I believe the above patch set is causing the problem. On my first
try
>> with rc6-mm1 I said Yes to the CONFIG_DMAR options. (I'm nearly as
good
>> as random option selection :-)
>>
>> The system panicked during boot, I believe it was trying to detect an
>> Intel IOMMU. Later when I have a camera, I will try to post a
>> screenshot of the backtrace. (I can't seem to get netconsole to work
on
>> boot, only in a module).
>>
>> When I recompiled without DMAR set, things seem to be working great.
I
>> seem to be getting better disk read throughput than rc3-mm1, by the
way.
>>
>> This laptop is an AMD Athlon64 on a NForce3 running a 64-bit Gentoo
>> build.
>>
>> I'll provide more details on request, and when I get the chance.
This
>> is a heads-up on the BUG in case someone has an "ah ha!" moment.
>
>I took a picture of it, looks like the backtrace is:
>
>NULL pointer dereference at 024
>EIP:dmar_table_init+0x11
>intel_iommu_init+0x30
>pci_iommu_init+0xe
>kernel_init+0x16e
>
>Presumably something is NULL in dmar_table_init that wasn't expected to
>be.. I would guess it likely crashes on any system without an Intel
>IOMMU in it.
How about something like below?


int __init dmar_table_init(void)
{
+ if (!dmar_tbl)
+ return -ENODEV;
parse_dmar_table();
if (list_empty(&dmar_drhd_units)) {
printk(KERN_ERR PREFIX "No DMAR devices found\n");
return -ENODEV;
}
return 0;
}

2007-06-29 15:33:46

by Keshavamurthy, Anil S

[permalink] [raw]
Subject: Re: 2.6.22-rc6-mm1 Intel DMAR crash on AMD x86_64

On Thu, Jun 28, 2007 at 06:14:27PM -0700, Li, Shaohua wrote:
>
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-
> >rc6/2.6.22-rc6-mm1/
> >>
> >>> +intel-iommu-dmar-detection-and-parsing-logic.patch
[..]
> >
> >I took a picture of it, looks like the backtrace is:
> >
> >NULL pointer dereference at 024
> >EIP:dmar_table_init+0x11
> >intel_iommu_init+0x30
> >pci_iommu_init+0xe
> >kernel_init+0x16e
> >
> >Presumably something is NULL in dmar_table_init that wasn't expected to
> >be.. I would guess it likely crashes on any system without an Intel
> >IOMMU in it.
Yup, that is correct.

> How about something like below?
>
>
> int __init dmar_table_init(void)
> {
> + if (!dmar_tbl)
> + return -ENODEV;
> parse_dmar_table();
why not check for NULL in the function where it touched?
Also when there are no DMAR devices we need the below
printk on the console.

> if (list_empty(&dmar_drhd_units)) {
> printk(KERN_ERR PREFIX "No DMAR devices found\n");
> return -ENODEV;
> }
> return 0;
> }

Here is the revised patch of the above.
Andrew, please add this fix to
+intel-iommu-dmar-detection-and-parsing-logic.patch
------------------------------------------------

Check for dmar_tbl pointer as this can be NULL on
systems with no Intel VT-d support.

Signed-off-by: Anil S Keshavamurthy <[email protected]>

---
drivers/pci/dmar.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc4-mm2/drivers/pci/dmar.c
===================================================================
--- linux-2.6.22-rc4-mm2.orig/drivers/pci/dmar.c 2007-06-29 07:43:43.000000000 -0700
+++ linux-2.6.22-rc4-mm2/drivers/pci/dmar.c 2007-06-29 07:46:25.000000000 -0700
@@ -260,6 +260,8 @@
int ret = 0;

dmar = (struct acpi_table_dmar *)dmar_tbl;
+ if (!dmar)
+ return -ENODEV;

if (!dmar->width) {
printk (KERN_WARNING PREFIX "Zero: Invalid DMAR haw\n");
@@ -301,7 +303,7 @@

parse_dmar_table();
if (list_empty(&dmar_drhd_units)) {
- printk(KERN_ERR PREFIX "No DMAR devices found\n");
+ printk(KERN_INFO PREFIX "No DMAR devices found\n");
return -ENODEV;
}
return 0;

2007-06-29 16:24:18

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: 2.6.22-rc6-mm1 Intel DMAR crash on AMD x86_64

On Fri, Jun 29, 2007 at 08:28:58AM -0700, Keshavamurthy, Anil S wrote:

> Here is the revised patch of the above.
> Andrew, please add this fix to
> +intel-iommu-dmar-detection-and-parsing-logic.patch
> ------------------------------------------------
>
> Check for dmar_tbl pointer as this can be NULL on
> systems with no Intel VT-d support.
>
> Signed-off-by: Anil S Keshavamurthy <[email protected]>
>
> ---
> drivers/pci/dmar.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> Index: linux-2.6.22-rc4-mm2/drivers/pci/dmar.c
> ===================================================================
> --- linux-2.6.22-rc4-mm2.orig/drivers/pci/dmar.c 2007-06-29 07:43:43.000000000 -0700
> +++ linux-2.6.22-rc4-mm2/drivers/pci/dmar.c 2007-06-29 07:46:25.000000000 -0700
> @@ -260,6 +260,8 @@
> int ret = 0;
>
> dmar = (struct acpi_table_dmar *)dmar_tbl;
> + if (!dmar)
> + return -ENODEV;
>
> if (!dmar->width) {
> printk (KERN_WARNING PREFIX "Zero: Invalid DMAR haw\n");
> @@ -301,7 +303,7 @@
>
> parse_dmar_table();
> if (list_empty(&dmar_drhd_units)) {
> - printk(KERN_ERR PREFIX "No DMAR devices found\n");
> + printk(KERN_INFO PREFIX "No DMAR devices found\n");
> return -ENODEV;
> }
> return 0;

The convention is to print a KERN_DEBUG message if hardware is not
found when probing it, otherwise the boot messages become cluttered
with lots of "$FOO not found".

Cheers,
Muli

2007-06-29 19:28:41

by Keshavamurthy, Anil S

[permalink] [raw]
Subject: Re: 2.6.22-rc6-mm1 Intel DMAR crash on AMD x86_64

On Fri, Jun 29, 2007 at 12:23:43PM -0400, Muli Ben-Yehuda wrote:
> On Fri, Jun 29, 2007 at 08:28:58AM -0700, Keshavamurthy, Anil S wrote:
>
> > +++ linux-2.6.22-rc4-mm2/drivers/pci/dmar.c 2007-06-29 07:46:25.000000000 -0700
> > @@ -260,6 +260,8 @@
> > int ret = 0;
> >
> > dmar = (struct acpi_table_dmar *)dmar_tbl;
> > + if (!dmar)
> > + return -ENODEV;
> >
> > if (!dmar->width) {
> > printk (KERN_WARNING PREFIX "Zero: Invalid DMAR haw\n");
> > @@ -301,7 +303,7 @@
> >
> > parse_dmar_table();
> > if (list_empty(&dmar_drhd_units)) {
> > - printk(KERN_ERR PREFIX "No DMAR devices found\n");
> > + printk(KERN_INFO PREFIX "No DMAR devices found\n");
> > return -ENODEV;
> > }
> > return 0;
>
> The convention is to print a KERN_DEBUG message if hardware is not
> found when probing it, otherwise the boot messages become cluttered
> with lots of "$FOO not found".

Since this is IOMMU is built into the kernel and it is
good idea to report that the device is not present. The
above is printed only once and is consistent with other
IOMMU implementation. Atleast it is useful when people
report bugs we can makeout whether IOMMU is being detected
or not.

Here is what I see on my box.
[..]
"PCI-GART: No AMD northbridge found."
[..]
Calgary: detecting Calgary via BIOS EBDA area
Calgary: Unable to locate Rio Grande table in EBDA - bailing!
[..]

As you can see I don;t have either GART or Calgary on my box.

-Thanks,
Anil

2007-06-29 21:18:31

by Muli Ben-Yehuda

[permalink] [raw]
Subject: Re: 2.6.22-rc6-mm1 Intel DMAR crash on AMD x86_64

On Fri, Jun 29, 2007 at 12:23:53PM -0700, Keshavamurthy, Anil S wrote:

> Since this is IOMMU is built into the kernel and it is good idea to
> report that the device is not present.

Yes - as a debug message.

> The above is printed only once and is consistent with other IOMMU
> implementation. Atleast it is useful when people report bugs we can
> makeout whether IOMMU is being detected or not.

If it was printed that it was detected it was - otherwise, it wasn't.

> Here is what I see on my box.
> [..]
> "PCI-GART: No AMD northbridge found."

You're right, that should be a debug message as well.

> [..]
> Calgary: detecting Calgary via BIOS EBDA area
> Calgary: Unable to locate Rio Grande table in EBDA - bailing!

These are KERN_DEBUG messages.

Cheers,
Muli

2007-06-29 21:51:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.22-rc6-mm1 Intel DMAR crash on AMD x86_64

On Friday, 29 June 2007 17:28, Keshavamurthy, Anil S wrote:
> On Thu, Jun 28, 2007 at 06:14:27PM -0700, Li, Shaohua wrote:
> >
> >
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-
> > >rc6/2.6.22-rc6-mm1/
> > >>
> > >>> +intel-iommu-dmar-detection-and-parsing-logic.patch
> [..]
> > >
> > >I took a picture of it, looks like the backtrace is:
> > >
> > >NULL pointer dereference at 024
> > >EIP:dmar_table_init+0x11
> > >intel_iommu_init+0x30
> > >pci_iommu_init+0xe
> > >kernel_init+0x16e
> > >
> > >Presumably something is NULL in dmar_table_init that wasn't expected to
> > >be.. I would guess it likely crashes on any system without an Intel
> > >IOMMU in it.
> Yup, that is correct.
>
> > How about something like below?
> >
> >
> > int __init dmar_table_init(void)
> > {
> > + if (!dmar_tbl)
> > + return -ENODEV;
> > parse_dmar_table();
> why not check for NULL in the function where it touched?
> Also when there are no DMAR devices we need the below
> printk on the console.
>
> > if (list_empty(&dmar_drhd_units)) {
> > printk(KERN_ERR PREFIX "No DMAR devices found\n");
> > return -ENODEV;
> > }
> > return 0;
> > }
>
> Here is the revised patch of the above.
> Andrew, please add this fix to
> +intel-iommu-dmar-detection-and-parsing-logic.patch
> ------------------------------------------------
>
> Check for dmar_tbl pointer as this can be NULL on
> systems with no Intel VT-d support.
>
> Signed-off-by: Anil S Keshavamurthy <[email protected]>

For the record, this patch fixes the boot crash on my AMD64-based test box.

Thanks,
Rafael


--
"Premature optimization is the root of all evil." - Donald Knuth

2007-06-30 18:55:33

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.22-rc6-mm1 Intel DMAR crash on AMD x86_64

Muli Ben-Yehuda <[email protected]> writes:
>
> The convention is to print a KERN_DEBUG message if hardware is not
> found when probing it, otherwise the boot messages become cluttered
> with lots of "$FOO not found".

No the convention is to print no message at all when nothing is found
Some drivers fail this, but they're bad examples.

-Andi