2009-01-08 20:06:21

by Dirk Hohndel

[permalink] [raw]
Subject: git-latest: kernel oops in IOMMU setup


latest git from Linus. On a Thinkpad x200s with VT-d enabled (if I
disable VT-d, this of course goes away).

The oops happens very early during boot in device_to_iommu (called
from domain_context_mapping_one).

Looking at the code dump and the disassembled function here's where
the error happens:

static struct intel_iommu *device_to_iommu(u8 bus, u8 devfn)
{
struct dmar_drhd_unit *drhd = NULL;
int i;

for_each_drhd_unit(drhd) {
if (drhd->ignored)
continue;

for (i = 0; i < drhd->devices_cnt; i++)
if (drhd->devices[i]->bus->number == bus &&
--> drhd->devices[0] is NULL
drhd->devices[i]->devfn == devfn)
return drhd->iommu;


Given how early this happens it's a little hard to provide logs, etc. I
literally used delay_boot=100 and wrote things down by hand (forgot my
digital camera) and then added printk's to verify).

please let me know what other data I should collect.

The system ran fine with the 2.6.28 release kernel.

/D

--
Dirk Hohndel
Intel Open Source Technology Center


2009-01-08 21:41:38

by Grant Grundler

[permalink] [raw]
Subject: Re: git-latest: kernel oops in IOMMU setup

On Thu, Jan 08, 2009 at 12:05:38PM -0800, Dirk Hohndel wrote:
>
> latest git from Linus. On a Thinkpad x200s with VT-d enabled (if I
> disable VT-d, this of course goes away).
>
> The oops happens very early during boot in device_to_iommu (called
> from domain_context_mapping_one).
>
> Looking at the code dump and the disassembled function here's where
> the error happens:
>
> static struct intel_iommu *device_to_iommu(u8 bus, u8 devfn)
> {
> struct dmar_drhd_unit *drhd = NULL;
> int i;
>
> for_each_drhd_unit(drhd) {
> if (drhd->ignored)
> continue;
>
> for (i = 0; i < drhd->devices_cnt; i++)
> if (drhd->devices[i]->bus->number == bus &&
> --> drhd->devices[0] is NULL
> drhd->devices[i]->devfn == devfn)
> return drhd->iommu;
>
>
> Given how early this happens it's a little hard to provide logs, etc. I
> literally used delay_boot=100 and wrote things down by hand (forgot my
> digital camera) and then added printk's to verify).
>
> please let me know what other data I should collect.

If you can, a back trace. Basically just need to know which caller
is tripping over this. But there can't be that many callers and they
are all in this file:
0 intel-iommu.c device_to_iommu 431 static struct intel_iommu *device_to_iommu(u8 bus, u8 devfn)
1 intel-iommu.c domain_context_mapping_on 1471 iommu = device_to_iommu(bus, devfn);
2 intel-iommu.c domain_context_mapped 1593 iommu = device_to_iommu(pdev->bus->number, pdev->devfn);
3 intel-iommu.c domain_remove_dev_info 1684 iommu = device_to_iommu(info->bus, info->devfn);
4 intel-iommu.c vm_domain_remove_one_dev_ 2773 iommu = device_to_iommu(pdev->bus->number, pdev->devfn);
5 intel-iommu.c vm_domain_remove_one_dev_ 2803 if (device_to_iommu(info->bus, info->devfn) == iommu)
6 intel-iommu.c vm_domain_remove_all_dev_ 2836 iommu = device_to_iommu(info->bus, info->devfn);
7 intel-iommu.c intel_iommu_attach_device 3023 iommu = device_to_iommu(pdev->bus->number, pdev->devfn);

so it should be possible to figure out which one is called
before the dev is setup. It's unlikely to be anything with
"remove" in the name. :)

My guess is it's intel_iommu_attach_device being called "too early".

hth,
grant


hth,
grant

>
> The system ran fine with the 2.6.28 release kernel.
>
> /D
>
> --
> Dirk Hohndel
> Intel Open Source Technology Center
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2009-01-08 21:57:29

by Dirk Hohndel

[permalink] [raw]
Subject: Re: git-latest: kernel oops in IOMMU setup

On Thu, 8 Jan 2009 14:41:16 -0700
Grant Grundler <[email protected]> wrote:

> On Thu, Jan 08, 2009 at 12:05:38PM -0800, Dirk Hohndel wrote:
> >
> > latest git from Linus. On a Thinkpad x200s with VT-d enabled (if I
> > disable VT-d, this of course goes away).
> >
> > The oops happens very early during boot in device_to_iommu (called
> > from domain_context_mapping_one).

Look here, that's where it's called from.
Do you want me to note down the complete backtrace?


> If you can, a back trace. Basically just need to know which caller
> is tripping over this. But there can't be that many callers and they
> are all in this file:
> ...
> so it should be possible to figure out which one is called
> before the dev is setup. It's unlikely to be anything with
> "remove" in the name. :)

correct - it's context_mapping_one

--
Dirk Hohndel
Intel Open Source Technology Center

2009-01-09 00:59:57

by Weidong Han

[permalink] [raw]
Subject: RE: git-latest: kernel oops in IOMMU setup

Grant Grundler wrote:
> On Thu, Jan 08, 2009 at 12:05:38PM -0800, Dirk Hohndel wrote:
>>
>> latest git from Linus. On a Thinkpad x200s with VT-d enabled (if I
>> disable VT-d, this of course goes away).
>>
>> The oops happens very early during boot in device_to_iommu (called
>> from domain_context_mapping_one).
>>
>> Looking at the code dump and the disassembled function here's where
>> the error happens:
>>
>> static struct intel_iommu *device_to_iommu(u8 bus, u8 devfn) {
>> struct dmar_drhd_unit *drhd = NULL;
>> int i;
>>
>> for_each_drhd_unit(drhd) {
>> if (drhd->ignored)
>> continue;
>>
>> for (i = 0; i < drhd->devices_cnt; i++)
>> if (drhd->devices[i]->bus->number == bus &&
>> --> drhd->devices[0] is NULL
>> drhd->devices[i]->devfn == devfn)
>> return drhd->iommu;
>>
>>
>> Given how early this happens it's a little hard to provide logs,
>> etc. I literally used delay_boot=100 and wrote things down by hand
>> (forgot my digital camera) and then added printk's to verify).
>>
>> please let me know what other data I should collect.
>
> If you can, a back trace. Basically just need to know which caller
> is tripping over this. But there can't be that many callers and they
> are all in this file:
> 0 intel-iommu.c device_to_iommu 431 static struct
> intel_iommu *device_to_iommu(u8 bus, u8 devfn) 1 intel-iommu.c
> domain_context_mapping_on 1471 iommu = device_to_iommu(bus, devfn); 2
> intel-iommu.c domain_context_mapped 1593 iommu =
> device_to_iommu(pdev->bus->number, pdev->devfn); 3 intel-iommu.c
> domain_remove_dev_info 1684 iommu = device_to_iommu(info->bus,
> info->devfn); 4 intel-iommu.c vm_domain_remove_one_dev_ 2773 iommu =
> device_to_iommu(pdev->bus->number, pdev->devfn); 5 intel-iommu.c
> vm_domain_remove_one_dev_ 2803 if (device_to_iommu(info->bus,
> info->devfn) == iommu) 6 intel-iommu.c vm_domain_remove_all_dev_ 2836
> iommu = device_to_iommu(info->bus, info->devfn); 7 intel-iommu.c
> intel_iommu_attach_device 3023 iommu =
> device_to_iommu(pdev->bus->number, pdev->devfn);
>
> so it should be possible to figure out which one is called
> before the dev is setup. It's unlikely to be anything with
> "remove" in the name. :)
>
> My guess is it's intel_iommu_attach_device being called "too early".

yes, pls get the call trace. When device_to_iommu() is called, DMAR should be already parsed from acpi table and registered, so device_to_iommu() should not fail unless it's called earlier than DMAR is parsed and registered.

Regards,
Weidong

>
> hth,
> grant
>
>
> hth,
> grant
>
>>
>> The system ran fine with the 2.6.28 release kernel.
>>
>> /D
>>
>> --
>> Dirk Hohndel
>> Intel Open Source Technology Center
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci"
>> in the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/iommu

2009-01-09 02:05:42

by Dirk Hohndel

[permalink] [raw]
Subject: Re: git-latest: kernel oops in IOMMU setup

On Fri, 9 Jan 2009 08:58:46 +0800 "Han, Weidong"
<[email protected]> wrote:
> >>
> >> The oops happens very early during boot in device_to_iommu (called
> >> from domain_context_mapping_one).
> >>
> >> Looking at the code dump and the disassembled function here's where
> >> the error happens:
> >>
> >> static struct intel_iommu *device_to_iommu(u8 bus, u8 devfn) {
> >> struct dmar_drhd_unit *drhd = NULL;
> >> int i;
> >>
> >> for_each_drhd_unit(drhd) {
> >> if (drhd->ignored)
> >> continue;
> >>
> >> for (i = 0; i < drhd->devices_cnt; i++)
> >> if (drhd->devices[i]->bus->number == bus &&
> >> --> drhd->devices[0] is NULL
> >> drhd->devices[i]->devfn == devfn)
> >> return drhd->iommu;
> >>
> >>
> >> Given how early this happens it's a little hard to provide logs,
> >> etc. I literally used delay_boot=100 and wrote things down by hand
> >> (forgot my digital camera) and then added printk's to verify).
> >>
> >> please let me know what other data I should collect.
> >
> yes, pls get the call trace. When device_to_iommu() is called, DMAR
> should be already parsed from acpi table and registered, so
> device_to_iommu() should not fail unless it's called earlier than
> DMAR is parsed and registered.

I updated to Linus' latest git (as your description made me wonder if
the async stuff might play a role here). I still get an oops - but at
a different spot and the system no longer hangs - it partly recovers
(but things aren't too well - for example my USB keyboard / mouse don't
work anymore).

Here's the oops:

Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 12.359578] ------------[ cut here ]------------
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 12.410579] WARNING: at arch/x86/mm/ioremap.c:240 __ioremap_caller+0x150/0x2bd()
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 12.461578] Hardware name: 7465CTO
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 12.512578] Modules linked in:
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 12.614579] Pid: 1, comm: swapper Not tainted 2.6.28 #12
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 12.665578] Call Trace:
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 12.767581] [<ffffffff81038b49>] warn_slowpath+0xb1/0xed
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 12.869580] [<ffffffff81028319>] ? change_page_attr_set_clr+0x13e/0x2e6
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 12.971580] [<ffffffff810275b2>] __ioremap_caller+0x150/0x2bd
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 13.073581] [<ffffffff81158363>] ? alloc_iommu+0x140/0x181
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 13.175580] [<ffffffff810277f2>] ioremap_nocache+0x12/0x14
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 13.277580] [<ffffffff81158363>] alloc_iommu+0x140/0x181
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 13.379581] [<ffffffff8166a5d6>] dmar_table_init+0x115/0x265
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 13.481580] [<ffffffff8165687b>] ? pci_iommu_init+0x0/0x17
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 13.583580] [<ffffffff8166abb1>] intel_iommu_init+0x16/0x8f3
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 13.685581] [<ffffffff813ce372>] ? mutex_lock+0x11/0x23
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 13.787581] [<ffffffff813bb9d1>] ? sysctl_net_init+0x1b/0x1f
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 13.889580] [<ffffffff8165687b>] ? pci_iommu_init+0x0/0x17
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 13.991580] [<ffffffff81656884>] pci_iommu_init+0x9/0x17
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 14.093581] [<ffffffff81009056>] _stext+0x56/0x12b
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 14.195581] [<ffffffff81071220>] ? register_irq_proc+0xa3/0xbf
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 14.297582] [<ffffffff810e0000>] ? proc_coredump_filter_write+0xe0/0xfe
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 14.399581] [<ffffffff8164e673>] kernel_init+0x139/0x191
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 14.501581] [<ffffffff8100d27a>] child_rip+0xa/0x20
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 14.603581] [<ffffffff8164e53a>] ? kernel_init+0x0/0x191
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 14.705581] [<ffffffff8100d270>] ? child_rip+0x0/0x20
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 14.756580] ---[ end trace 4eaa2a86a8e2da22 ]---
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 14.807580] IOMMU: can't map the region
Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 14.858580] DMAR:parse DMAR table failure.

later in the log file I find lots of these:

Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 40.403251] nommu_map_single: overflow 13a08b248+8 of device mask ffffffff

and finally

Jan 8 17:51:00 dhohndel-mobl4 kernel: [ 66.777166] hub 4-0:1.0: unable to enumerate USB device on port 2

/D

--
Dirk Hohndel
Intel Open Source Technology Center

2009-01-09 04:52:40

by Dirk Hohndel

[permalink] [raw]
Subject: Re: git-latest: kernel oops in IOMMU setup

On Thu, 8 Jan 2009 18:05:15 -0800
Dirk Hohndel <[email protected]> wrote:

> On Fri, 9 Jan 2009 08:58:46 +0800 "Han, Weidong"
>
> I updated to Linus' latest git (as your description made me wonder if
> the async stuff might play a role here). I still get an oops - but at
> a different spot and the system no longer hangs - it partly recovers
> (but things aren't too well - for example my USB keyboard / mouse
> don't work anymore).

Spoke too soon. Rebooted and had the same hard lockup again. This time
I had my camera within reach, so here's the trace:

device_to_iommu+0x33/0x73
domain_context_mapping_one+0x37/0x335
domain_context_mapping+0x25/0xa7
iommu_prepare_identity+0xd7/0xf3
intel_iommu_init+0x4e4/0x8f3
? mutex_lock
? sysctl_net_init
? pci_iommu_init
pci_iommu_init

I also have stack, code and register values. Let me know if you need
them. Or I can just post the picture :-)

Again, very latest git tree, VT-d enabled.

/D

--
Dirk Hohndel
Intel Open Source Technology Center

2009-01-09 06:55:00

by Weidong Han

[permalink] [raw]
Subject: RE: git-latest: kernel oops in IOMMU setup

Dirk Hohndel wrote:
> On Thu, 8 Jan 2009 18:05:15 -0800
> Dirk Hohndel <[email protected]> wrote:
>
>> On Fri, 9 Jan 2009 08:58:46 +0800 "Han, Weidong"
>>
>> I updated to Linus' latest git (as your description made me wonder if
>> the async stuff might play a role here). I still get an oops - but at
>> a different spot and the system no longer hangs - it partly recovers
>> (but things aren't too well - for example my USB keyboard / mouse
>> don't work anymore).
>
> Spoke too soon. Rebooted and had the same hard lockup again. This time
> I had my camera within reach, so here's the trace:
>
> device_to_iommu+0x33/0x73
> domain_context_mapping_one+0x37/0x335
> domain_context_mapping+0x25/0xa7
> iommu_prepare_identity+0xd7/0xf3
> intel_iommu_init+0x4e4/0x8f3
> ? mutex_lock
> ? sysctl_net_init
> ? pci_iommu_init
> pci_iommu_init
>
> I also have stack, code and register values. Let me know if you need
> them. Or I can just post the picture :-)
>
> Again, very latest git tree, VT-d enabled.
>
> /D

I tried latest git tree, it works for me. Above call trace looks right.

Regards,
Weidong-

2009-01-09 15:09:00

by Dirk Hohndel

[permalink] [raw]
Subject: Re: git-latest: kernel oops in IOMMU setup

On Fri, 9 Jan 2009 14:53:14 +0800
"Han, Weidong" <[email protected]> wrote:

> Dirk Hohndel wrote:
> > On Thu, 8 Jan 2009 18:05:15 -0800
> > Dirk Hohndel <[email protected]> wrote:
> >
> >> On Fri, 9 Jan 2009 08:58:46 +0800 "Han, Weidong"
> >>
> >> I updated to Linus' latest git (as your description made me wonder
> >> if the async stuff might play a role here). I still get an oops -
> >> but at a different spot and the system no longer hangs - it partly
> >> recovers (but things aren't too well - for example my USB
> >> keyboard / mouse don't work anymore).
> >
> > Spoke too soon. Rebooted and had the same hard lockup again. This
> > time I had my camera within reach, so here's the trace:
> >
> > device_to_iommu+0x33/0x73
> > domain_context_mapping_one+0x37/0x335
> > domain_context_mapping+0x25/0xa7
> > iommu_prepare_identity+0xd7/0xf3
> > intel_iommu_init+0x4e4/0x8f3
> > ? mutex_lock
> > ? sysctl_net_init
> > ? pci_iommu_init
> > pci_iommu_init
> >
> > I also have stack, code and register values. Let me know if you need
> > them. Or I can just post the picture :-)
> >
> > Again, very latest git tree, VT-d enabled.
> >
> > /D
>
> I tried latest git tree, it works for me. Above call trace looks
> right.

Spent some more time reading the code. Can't quite claim to understand
all of it, yet, but I notice that most everywhere else drhd->devices[i]
is checked to be != NULL before it is accessed. Why is it safe not to
do that in device_to_iommu()?

Would the patch below be a valid fix? It stops my system from hanging at
boot. But I wonder if there is an assertion that if drhd->ignored is 0
then drhd->devices[0..drhd->device_cnt] is known to be != NULL and
therefore this test is just hiding a bug somewhere else...

/D

Signed-off-by: Dirk Hohndel <[email protected]>
---
drivers/pci/intel-iommu.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 235fb7a..3dfecb2 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -438,7 +438,8 @@ static struct intel_iommu *device_to_iommu(u8 bus,
u8 devfn) continue;

for (i = 0; i < drhd->devices_cnt; i++)
- if (drhd->devices[i]->bus->number == bus &&
+ if (drhd->devices[i] &&
+ drhd->devices[i]->bus->number == bus &&
drhd->devices[i]->devfn == devfn)
return drhd->iommu;

--
1.6.0.6


--
Dirk Hohndel
Intel Open Source Technology Center

2009-01-09 16:16:34

by Zhao, Yu

[permalink] [raw]
Subject: Re: git-latest: kernel oops in IOMMU setup

Dirk Hohndel wrote:
> On Fri, 9 Jan 2009 14:53:14 +0800
> "Han, Weidong" <[email protected]> wrote:
>
>> Dirk Hohndel wrote:
>>> On Thu, 8 Jan 2009 18:05:15 -0800
>>> Dirk Hohndel <[email protected]> wrote:
>>>
>>>> On Fri, 9 Jan 2009 08:58:46 +0800 "Han, Weidong"
>>>>
>>>> I updated to Linus' latest git (as your description made me wonder
>>>> if the async stuff might play a role here). I still get an oops -
>>>> but at a different spot and the system no longer hangs - it partly
>>>> recovers (but things aren't too well - for example my USB
>>>> keyboard / mouse don't work anymore).
>>> Spoke too soon. Rebooted and had the same hard lockup again. This
>>> time I had my camera within reach, so here's the trace:
>>>
>>> device_to_iommu+0x33/0x73
>>> domain_context_mapping_one+0x37/0x335
>>> domain_context_mapping+0x25/0xa7
>>> iommu_prepare_identity+0xd7/0xf3
>>> intel_iommu_init+0x4e4/0x8f3
>>> ? mutex_lock
>>> ? sysctl_net_init
>>> ? pci_iommu_init
>>> pci_iommu_init
>>>
>>> I also have stack, code and register values. Let me know if you need
>>> them. Or I can just post the picture :-)
>>>
>>> Again, very latest git tree, VT-d enabled.
>>>
>>> /D
>> I tried latest git tree, it works for me. Above call trace looks
>> right.
>
> Spent some more time reading the code. Can't quite claim to understand
> all of it, yet, but I notice that most everywhere else drhd->devices[i]
> is checked to be != NULL before it is accessed. Why is it safe not to
> do that in device_to_iommu()?
>
> Would the patch below be a valid fix? It stops my system from hanging at
> boot. But I wonder if there is an assertion that if drhd->ignored is 0
> then drhd->devices[0..drhd->device_cnt] is known to be != NULL and
> therefore this test is just hiding a bug somewhere else...
>
> /D
>
> Signed-off-by: Dirk Hohndel <[email protected]>
> ---
> drivers/pci/intel-iommu.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> index 235fb7a..3dfecb2 100644
> --- a/drivers/pci/intel-iommu.c
> +++ b/drivers/pci/intel-iommu.c
> @@ -438,7 +438,8 @@ static struct intel_iommu *device_to_iommu(u8 bus,
> u8 devfn) continue;
>
> for (i = 0; i < drhd->devices_cnt; i++)
> - if (drhd->devices[i]->bus->number == bus &&
> + if (drhd->devices[i] &&
> + drhd->devices[i]->bus->number == bus &&
> drhd->devices[i]->devfn == devfn)
> return drhd->iommu;
>

Did you see following in the kernel message?
printk(KERN_WARNING PREFIX
"Device scope device [%04x:%02x:%02x.%02x] not found\n",
segment, scope->bus, path->dev, path->fn);

If yes, then
Acked-by: Yu Zhao <[email protected]>

2009-01-09 16:35:00

by Dirk Hohndel

[permalink] [raw]
Subject: Re: git-latest: kernel oops in IOMMU setup

On Sat, 10 Jan 2009 00:16:22 +0800
"Zhao, Yu" <[email protected]> wrote:

> Dirk Hohndel wrote:
> > On Fri, 9 Jan 2009 14:53:14 +0800
> > "Han, Weidong" <[email protected]> wrote:
> >
> >> Dirk Hohndel wrote:
> >>> On Thu, 8 Jan 2009 18:05:15 -0800
> >>> Dirk Hohndel <[email protected]> wrote:
> >>>
> >>>> On Fri, 9 Jan 2009 08:58:46 +0800 "Han, Weidong"
> >>>>
> >>>> I updated to Linus' latest git (as your description made me
> >>>> wonder if the async stuff might play a role here). I still get
> >>>> an oops - but at a different spot and the system no longer hangs
> >>>> - it partly recovers (but things aren't too well - for example
> >>>> my USB keyboard / mouse don't work anymore).
> >>> Spoke too soon. Rebooted and had the same hard lockup again. This
> >>> time I had my camera within reach, so here's the trace:
> >>>
> >>> device_to_iommu+0x33/0x73
> >>> domain_context_mapping_one+0x37/0x335
> >>> domain_context_mapping+0x25/0xa7
> >>> iommu_prepare_identity+0xd7/0xf3
> >>> intel_iommu_init+0x4e4/0x8f3
> >>> ? mutex_lock
> >>> ? sysctl_net_init
> >>> ? pci_iommu_init
> >>> pci_iommu_init
> >>>
> >>> I also have stack, code and register values. Let me know if you
> >>> need them. Or I can just post the picture :-)
> >>>
> >>> Again, very latest git tree, VT-d enabled.
> >>>
> >>> /D
> >> I tried latest git tree, it works for me. Above call trace looks
> >> right.
> >
> > Spent some more time reading the code. Can't quite claim to
> > understand all of it, yet, but I notice that most everywhere else
> > drhd->devices[i] is checked to be != NULL before it is accessed.
> > Why is it safe not to do that in device_to_iommu()?
> >
> > Would the patch below be a valid fix? It stops my system from
> > hanging at boot. But I wonder if there is an assertion that if
> > drhd->ignored is 0 then drhd->devices[0..drhd->device_cnt] is known
> > to be != NULL and therefore this test is just hiding a bug
> > somewhere else...
> >
> > /D
> >
> > Signed-off-by: Dirk Hohndel <[email protected]>
> > ---
> > drivers/pci/intel-iommu.c | 3 ++-
> > 1 files changed, 2 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
> > index 235fb7a..3dfecb2 100644
> > --- a/drivers/pci/intel-iommu.c
> > +++ b/drivers/pci/intel-iommu.c
> > @@ -438,7 +438,8 @@ static struct intel_iommu *device_to_iommu(u8
> > bus, u8 devfn) continue;
> >
> > for (i = 0; i < drhd->devices_cnt; i++)
> > - if (drhd->devices[i]->bus->number == bus &&
> > + if (drhd->devices[i] &&
> > + drhd->devices[i]->bus->number == bus &&
> > drhd->devices[i]->devfn == devfn)
> > return drhd->iommu;
> >
>
> Did you see following in the kernel message?
> printk(KERN_WARNING PREFIX
> "Device scope device [%04x:%02x:%02x.%02x] not
> found\n", segment, scope->bus, path->dev, path->fn);
>
> If yes, then
> Acked-by: Yu Zhao <[email protected]>

Yes,

DMAR: Device scope device [0000:00:03:02] not found
DMAR: Device scope device [0000:00:03:02] not found
DMAR: Device scope device [0000:00:03:03] not found
DMAR: Device scope device [0000:00:03:03] not found

/D

--
Dirk Hohndel
Intel Open Source Technology Center

2009-01-09 16:45:52

by Zhao, Yu

[permalink] [raw]
Subject: Re: git-latest: kernel oops in IOMMU setup

Dirk Hohndel wrote:
> On Sat, 10 Jan 2009 00:16:22 +0800
> "Zhao, Yu" <[email protected]> wrote:
>
>> Dirk Hohndel wrote:
>>> On Fri, 9 Jan 2009 14:53:14 +0800
>>> "Han, Weidong" <[email protected]> wrote:
>>>
>>>> Dirk Hohndel wrote:
>>>>> On Thu, 8 Jan 2009 18:05:15 -0800
>>>>> Dirk Hohndel <[email protected]> wrote:
>>>>>
>>>>>> On Fri, 9 Jan 2009 08:58:46 +0800 "Han, Weidong"
>>>>>>
>>>>>> I updated to Linus' latest git (as your description made me
>>>>>> wonder if the async stuff might play a role here). I still get
>>>>>> an oops - but at a different spot and the system no longer hangs
>>>>>> - it partly recovers (but things aren't too well - for example
>>>>>> my USB keyboard / mouse don't work anymore).
>>>>> Spoke too soon. Rebooted and had the same hard lockup again. This
>>>>> time I had my camera within reach, so here's the trace:
>>>>>
>>>>> device_to_iommu+0x33/0x73
>>>>> domain_context_mapping_one+0x37/0x335
>>>>> domain_context_mapping+0x25/0xa7
>>>>> iommu_prepare_identity+0xd7/0xf3
>>>>> intel_iommu_init+0x4e4/0x8f3
>>>>> ? mutex_lock
>>>>> ? sysctl_net_init
>>>>> ? pci_iommu_init
>>>>> pci_iommu_init
>>>>>
>>>>> I also have stack, code and register values. Let me know if you
>>>>> need them. Or I can just post the picture :-)
>>>>>
>>>>> Again, very latest git tree, VT-d enabled.
>>>>>
>>>>> /D
>>>> I tried latest git tree, it works for me. Above call trace looks
>>>> right.
>>> Spent some more time reading the code. Can't quite claim to
>>> understand all of it, yet, but I notice that most everywhere else
>>> drhd->devices[i] is checked to be != NULL before it is accessed.
>>> Why is it safe not to do that in device_to_iommu()?
>>>
>>> Would the patch below be a valid fix? It stops my system from
>>> hanging at boot. But I wonder if there is an assertion that if
>>> drhd->ignored is 0 then drhd->devices[0..drhd->device_cnt] is known
>>> to be != NULL and therefore this test is just hiding a bug
>>> somewhere else...
>>>
>>> /D
>>>
>>> Signed-off-by: Dirk Hohndel <[email protected]>
>>> ---
>>> drivers/pci/intel-iommu.c | 3 ++-
>>> 1 files changed, 2 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
>>> index 235fb7a..3dfecb2 100644
>>> --- a/drivers/pci/intel-iommu.c
>>> +++ b/drivers/pci/intel-iommu.c
>>> @@ -438,7 +438,8 @@ static struct intel_iommu *device_to_iommu(u8
>>> bus, u8 devfn) continue;
>>>
>>> for (i = 0; i < drhd->devices_cnt; i++)
>>> - if (drhd->devices[i]->bus->number == bus &&
>>> + if (drhd->devices[i] &&
>>> + drhd->devices[i]->bus->number == bus &&
>>> drhd->devices[i]->devfn == devfn)
>>> return drhd->iommu;
>>>
>> Did you see following in the kernel message?
>> printk(KERN_WARNING PREFIX
>> "Device scope device [%04x:%02x:%02x.%02x] not
>> found\n", segment, scope->bus, path->dev, path->fn);
>>
>> If yes, then
>> Acked-by: Yu Zhao <[email protected]>
>
> Yes,
>
> DMAR: Device scope device [0000:00:03:02] not found
> DMAR: Device scope device [0000:00:03:02] not found
> DMAR: Device scope device [0000:00:03:03] not found
> DMAR: Device scope device [0000:00:03:03] not found

The laptop has a nasty bios, try to update it if you want to get rid of
these noises... assuming you are luck enough :-)

2009-01-09 16:55:32

by Dirk Hohndel

[permalink] [raw]
Subject: Re: git-latest: kernel oops in IOMMU setup

On Sat, 10 Jan 2009 00:45:31 +0800
"Zhao, Yu" <[email protected]> wrote:
> > Yes,
> >
> > DMAR: Device scope device [0000:00:03:02] not found
> > DMAR: Device scope device [0000:00:03:02] not found
> > DMAR: Device scope device [0000:00:03:03] not found
> > DMAR: Device scope device [0000:00:03:03] not found
>
> The laptop has a nasty bios, try to update it if you want to get rid
> of these noises... assuming you are luck enough :-)

It's a Lenovo Thinkpad X200s - and I am running the latest BIOS (at
least according to their support website) :-(

kernel: thinkpad_acpi: ThinkPad ACPI Extras v0.21
kernel: thinkpad_acpi: http://ibm-acpi.sf.net/
kernel: thinkpad_acpi: ThinkPad BIOS 6DET33WW (1.10 ), EC 7XHT21WW-1.03
kernel: thinkpad_acpi: Lenovo ThinkPad X200s, model 7465CTO

/D

--
Dirk Hohndel
Intel Open Source Technology Center

2009-01-09 16:58:23

by Dirk Hohndel

[permalink] [raw]
Subject: [PATCH] Prevent oops at boot with VT-d


Resending with appropriate Subject

Signed-off-by: Dirk Hohndel <[email protected]>
Acked-by: Yu Zhao <[email protected]>
---
drivers/pci/intel-iommu.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 235fb7a..3dfecb2 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -438,7 +438,8 @@ static struct intel_iommu *device_to_iommu(u8 bus,
u8 devfn) continue;

for (i = 0; i < drhd->devices_cnt; i++)
- if (drhd->devices[i]->bus->number == bus &&
+ if (drhd->devices[i] &&
+ drhd->devices[i]->bus->number == bus &&
drhd->devices[i]->devfn == devfn)
return drhd->iommu;

--
1.6.0.6

2009-01-11 15:25:58

by Dirk Hohndel

[permalink] [raw]
Subject: [Resend][PATCH] Prevent oops at boot with VT-d

This wasn't included in 2.6.29-rc1

With some broken BIOSs when VT-d is enabled, the data structures are
filled incorrectly. This can cause a NULL pointer dereference in very
early boot.


Signed-off-by: Dirk Hohndel <[email protected]>
Acked-by: Yu Zhao <[email protected]>
---
drivers/pci/intel-iommu.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 235fb7a..3dfecb2 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -438,7 +438,8 @@ static struct intel_iommu *device_to_iommu(u8 bus, u8 devfn)
continue;

for (i = 0; i < drhd->devices_cnt; i++)
- if (drhd->devices[i]->bus->number == bus &&
+ if (drhd->devices[i] &&
+ drhd->devices[i]->bus->number == bus &&
drhd->devices[i]->devfn == devfn)
return drhd->iommu;

--
1.6.0.6
--
Dirk Hohndel
Intel Open Source Technology Center