LinuxLists.cc - [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping

2011-03-29 23:36:07

Subject: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping

When the IOMMU is being used, each request for a DMA mapping requires
the intel_iommu code to look for some space in the DMA mapping table.
For most drivers this occurs for each transfer.

When there are many outstanding DMA mappings [as seems to be the case
with the 10GigE driver], the table grows large and the search for
space becomes increasingly time consuming. Performance for the
10GigE driver drops to about 10% of it's capacity on a UV system
when the CPU count is large.

The workaround is to specify the iommu=pt option which sets up a 1:1
identity map for those devices that support enough DMA address bits to
cover the physical system memory. This is the "pass through" option.

But this can only be accomplished by those devices that pass their
DMA data through the IOMMU (VTd). But Host Bridge Devices connected
to System Sockets do not pass their data through the VTd, thus the
following error occurs:

IOMMU: hardware identity mapping for device 1000:3e:00.0
Failed to setup IOMMU pass-through
BUG: unable to handle kernel NULL pointer dereference at 000000000000001c

This patch fixes that problem but removing Host Bridge devices from
being identity mapped, given that they do not generate DMA ops anyways.

Signed-off-by: Mike Travis <[email protected]>
Reviewed-by: Mike Habeck <[email protected]>
---
drivers/pci/intel-iommu.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

--- linux.orig/drivers/pci/intel-iommu.c
+++ linux/drivers/pci/intel-iommu.c
@@ -46,6 +46,7 @@
#define ROOT_SIZE VTD_PAGE_SIZE
#define CONTEXT_SIZE VTD_PAGE_SIZE

+#define IS_HOSTBRIDGE_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_HOST)
#define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
#define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
#define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
@@ -2183,7 +2184,7 @@ static int iommu_should_identity_map(str
* take them out of the 1:1 domain later.
*/
if (!startup)
- return pdev->dma_mask > DMA_BIT_MASK(32);
+ return pdev->dma_mask == DMA_BIT_MASK(64);

return 1;
}
@@ -2198,6 +2199,9 @@ static int __init iommu_prepare_static_i
return -EFAULT;

for_each_pci_dev(pdev) {
+ /* Skip PCI Host Bridge devices */
+ if (IS_HOSTBRIDGE_DEVICE(pdev))
+ continue;
if (iommu_should_identity_map(pdev, 1)) {
printk(KERN_INFO "IOMMU: %s identity mapping for device %s\n",
hw ? "hardware" : "software", pci_name(pdev));

--

2011-03-30 17:52:58

by Chris Wright

[permalink] [raw]

Subject: Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping

* Mike Travis ([email protected]) wrote:
> When the IOMMU is being used, each request for a DMA mapping requires
> the intel_iommu code to look for some space in the DMA mapping table.
> For most drivers this occurs for each transfer.
>
> When there are many outstanding DMA mappings [as seems to be the case
> with the 10GigE driver], the table grows large and the search for
> space becomes increasingly time consuming. Performance for the
> 10GigE driver drops to about 10% of it's capacity on a UV system
> when the CPU count is large.

That's pretty poor. I've seen large overheads, but when that big it was
also related to issues in the 10G driver. Do you have profile data
showing this as the hotspot?

> The workaround is to specify the iommu=pt option which sets up a 1:1
> identity map for those devices that support enough DMA address bits to
> cover the physical system memory. This is the "pass through" option.
>
> But this can only be accomplished by those devices that pass their
> DMA data through the IOMMU (VTd). But Host Bridge Devices connected
> to System Sockets do not pass their data through the VTd, thus the
> following error occurs:
>
> IOMMU: hardware identity mapping for device 1000:3e:00.0
> Failed to setup IOMMU pass-through
> BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
>
> This patch fixes that problem but removing Host Bridge devices from
> being identity mapped, given that they do not generate DMA ops anyways.
>
> Signed-off-by: Mike Travis <[email protected]>
> Reviewed-by: Mike Habeck <[email protected]>
> ---
> drivers/pci/intel-iommu.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> --- linux.orig/drivers/pci/intel-iommu.c
> +++ linux/drivers/pci/intel-iommu.c
> @@ -46,6 +46,7 @@
> #define ROOT_SIZE VTD_PAGE_SIZE
> #define CONTEXT_SIZE VTD_PAGE_SIZE
>
> +#define IS_HOSTBRIDGE_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_HOST)
> #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
> #define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
> #define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
> @@ -2183,7 +2184,7 @@ static int iommu_should_identity_map(str
> * take them out of the 1:1 domain later.
> */
> if (!startup)
> - return pdev->dma_mask > DMA_BIT_MASK(32);
> + return pdev->dma_mask == DMA_BIT_MASK(64);

This looks unrelated, why the change?

> return 1;
> }
> @@ -2198,6 +2199,9 @@ static int __init iommu_prepare_static_i
> return -EFAULT;
>
> for_each_pci_dev(pdev) {
> + /* Skip PCI Host Bridge devices */
> + if (IS_HOSTBRIDGE_DEVICE(pdev))
> + continue;
> if (iommu_should_identity_map(pdev, 1)) {

Should this host bridge check go into iommu_should_identity_map?

I understand skipping the extra host bridges, but what is the NULL ptr deref
coming from? Just to be sure this isn't a bandaid.

thanks,
-chris

2011-03-30 18:30:59

by Mike Travis

[permalink] [raw]

Subject: Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping

Chris Wright wrote:
> * Mike Travis ([email protected]) wrote:
>> When the IOMMU is being used, each request for a DMA mapping requires
>> the intel_iommu code to look for some space in the DMA mapping table.
>> For most drivers this occurs for each transfer.
>>
>> When there are many outstanding DMA mappings [as seems to be the case
>> with the 10GigE driver], the table grows large and the search for
>> space becomes increasingly time consuming. Performance for the
>> 10GigE driver drops to about 10% of it's capacity on a UV system
>> when the CPU count is large.
>
> That's pretty poor. I've seen large overheads, but when that big it was
> also related to issues in the 10G driver. Do you have profile data
> showing this as the hotspot?

Here's one from our internal bug report:

Here is a profile from a run with iommu=on iommu=pt (no forcedac)

uv48-sys was receiving and uv-debug sending.
ksoftirqd/640 was running at approx. 100% cpu utilization.
I had pinned the nttcp process on uv48-sys to cpu 64.

# Samples: 1255641
#
# Overhead Command Shared Object Symbol
# ........ ............. ............. ......
#
50.27%ESC[m ksoftirqd/640 [kernel] [k] _spin_lock
27.43%ESC[m ksoftirqd/640 [kernel] [k] iommu_no_mapping
...
0.48% ksoftirqd/640 [kernel] [k] iommu_should_identity_map
0.45% ksoftirqd/640 [kernel] [k] ixgbe_alloc_rx_buffers [
ixgbe]
0.42% ksoftirqd/640 [kernel] [k] ioat2_tx_submit_unlock [
ioatdma]
0.29% ksoftirqd/640 [kernel] [k] uv_read_rtc
0.25% ksoftirqd/640 [kernel] [k] __alloc_skb
0.20% ksoftirqd/640 [kernel] [k] try_to_wake_up
0.19% ksoftirqd/640 [kernel] [k] ____cache_alloc_node
0.19% ksoftirqd/640 [kernel] [k] kmem_cache_free
0.19% ksoftirqd/640 [kernel] [k] __netdev_alloc_skb
0.18% ksoftirqd/640 [kernel] [k] tcp_v4_rcv
0.15% ksoftirqd/640 [kernel] [k] resched_task
0.15% ksoftirqd/640 [kernel] [k] tcp_data_queue
0.13% ksoftirqd/640 [kernel] [k] xfrm4_policy_check
0.11% ksoftirqd/640 [kernel] [k] get_page_from_freelist
0.10% ksoftirqd/640 [kernel] [k] sched_clock_cpu
0.10% ksoftirqd/640 [kernel] [k] sock_def_readable
...

I tracked this time down to identity_mapping() in this loop:

list_for_each_entry(info, &si_domain->devices, link)
if (info->dev == pdev)
return 1;

I didn't get the exact count, but there was approx 11,000 PCI devices
on this system. And this function was called for every page request
in each DMA request.

Here's an excerpt from our internal bug report:

I also looked at the cpu utilization uv. Its at 22% for the nttcp process
and ksoftirqd is not at the top so I think this means the fix is working.

Another run
uv-debug:~/eddiem/nttcp-1.52 # ./nttcp -T -l 1048576 -P 60 192.168.1.2
Running for 60 seconds...
Bytes Real s CPU s Real-MBit/s CPU-MBit/s Calls Real-C/s CPU-C/s
l51671728128 60.00 13.52 6889.4548 30582.1259 49278 821.29 3645.7
151671728128 60.00 12.53 6889.4660 32983.4024 123666 2061.07 9867.4

Trying it from the other side shows nttcp on uv at 44% cpu.

uv41-sys:~/eddiem/nttcp-1.52 # ./nttcp -T -l 1048576 -P 60 192.168.1.1
Running for 60 seconds...
Bytes Real s CPU s Real-MBit/s CPU-MBit/s Calls Real-C/s CPU-C/s
l51292143616 60.00 26.40 6838.9326 15544.4581 48917 815.28 1853.1
151292456796 60.00 7.35 6839.0407 55809.8528 93530 1558.84 12720.9

Note that our networking experts also tuned the 10GigE parameters which
helped bring the speed back up to almost line speed. (The 10GigE was
by far the most affected driver, but even the 1GigE driver lost performance.)

There was also changes for the irq_rebalancer and disabling sched domains
2 and 3 (which was being hit by idle_rebalancer). I remember sched domain 3
had all 4096 cpus but I forgot what sd 2 had.)

Also, running the network test on the same node as where the cards were
helped as well.

If you really need them, I can sign up for some system time and get better
before/after profile data specifically for these IOMMU changes?

Thanks,
Mike

>
>> The workaround is to specify the iommu=pt option which sets up a 1:1
>> identity map for those devices that support enough DMA address bits to
>> cover the physical system memory. This is the "pass through" option.
>>
>> But this can only be accomplished by those devices that pass their
>> DMA data through the IOMMU (VTd). But Host Bridge Devices connected
>> to System Sockets do not pass their data through the VTd, thus the
>> following error occurs:
>>
>> IOMMU: hardware identity mapping for device 1000:3e:00.0
>> Failed to setup IOMMU pass-through
>> BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
>>
>> This patch fixes that problem but removing Host Bridge devices from
>> being identity mapped, given that they do not generate DMA ops anyways.
>>
>> Signed-off-by: Mike Travis <[email protected]>
>> Reviewed-by: Mike Habeck <[email protected]>
>> ---
>> drivers/pci/intel-iommu.c | 6 +++++-
>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> --- linux.orig/drivers/pci/intel-iommu.c
>> +++ linux/drivers/pci/intel-iommu.c
>> @@ -46,6 +46,7 @@
>> #define ROOT_SIZE VTD_PAGE_SIZE
>> #define CONTEXT_SIZE VTD_PAGE_SIZE
>>
>> +#define IS_HOSTBRIDGE_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_HOST)
>> #define IS_GFX_DEVICE(pdev) ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY)
>> #define IS_ISA_DEVICE(pdev) ((pdev->class >> 8) == PCI_CLASS_BRIDGE_ISA)
>> #define IS_AZALIA(pdev) ((pdev)->vendor == 0x8086 && (pdev)->device == 0x3a3e)
>> @@ -2183,7 +2184,7 @@ static int iommu_should_identity_map(str
>> * take them out of the 1:1 domain later.
>> */
>> if (!startup)
>> - return pdev->dma_mask > DMA_BIT_MASK(32);
>> + return pdev->dma_mask == DMA_BIT_MASK(64);
>
> This looks unrelated, why the change?
>
>> return 1;
>> }
>> @@ -2198,6 +2199,9 @@ static int __init iommu_prepare_static_i
>> return -EFAULT;
>>
>> for_each_pci_dev(pdev) {
>> + /* Skip PCI Host Bridge devices */
>> + if (IS_HOSTBRIDGE_DEVICE(pdev))
>> + continue;
>> if (iommu_should_identity_map(pdev, 1)) {
>
> Should this host bridge check go into iommu_should_identity_map?
>
> I understand skipping the extra host bridges, but what is the NULL ptr deref
> coming from? Just to be sure this isn't a bandaid.
>
> thanks,
> -chris
>

2011-03-30 19:15:52

by Chris Wright

[permalink] [raw]

Subject: Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping

* Mike Travis ([email protected]) wrote:
> Chris Wright wrote:
> >* Mike Travis ([email protected]) wrote:
> >> When the IOMMU is being used, each request for a DMA mapping requires
> >> the intel_iommu code to look for some space in the DMA mapping table.
> >> For most drivers this occurs for each transfer.
> >>
> >> When there are many outstanding DMA mappings [as seems to be the case
> >> with the 10GigE driver], the table grows large and the search for
> >> space becomes increasingly time consuming. Performance for the
> >> 10GigE driver drops to about 10% of it's capacity on a UV system
> >> when the CPU count is large.
> >
> >That's pretty poor. I've seen large overheads, but when that big it was
> >also related to issues in the 10G driver. Do you have profile data
> >showing this as the hotspot?
>
> Here's one from our internal bug report:
>
> Here is a profile from a run with iommu=on iommu=pt (no forcedac)

OK, I was actually interested in the !pt case. But this is useful
still. The iova lookup being distinct from the identity_mapping() case.

> uv48-sys was receiving and uv-debug sending.
> ksoftirqd/640 was running at approx. 100% cpu utilization.
> I had pinned the nttcp process on uv48-sys to cpu 64.
>
> # Samples: 1255641
> #
> # Overhead Command Shared Object Symbol
> # ........ ............. ............. ......
> #
> 50.27%ESC[m ksoftirqd/640 [kernel] [k] _spin_lock
> 27.43%ESC[m ksoftirqd/640 [kernel] [k] iommu_no_mapping

> ...
> 0.48% ksoftirqd/640 [kernel] [k] iommu_should_identity_map
> 0.45% ksoftirqd/640 [kernel] [k] ixgbe_alloc_rx_buffers [
> ixgbe]

Note, ixgbe has had rx dma mapping issues (that's why I wondered what
was causing the massive slowdown under !pt mode).

<snip>
> I tracked this time down to identity_mapping() in this loop:
>
> list_for_each_entry(info, &si_domain->devices, link)
> if (info->dev == pdev)
> return 1;
>
> I didn't get the exact count, but there was approx 11,000 PCI devices
> on this system. And this function was called for every page request
> in each DMA request.

Right, so this is the list traversal (and wow, a lot of PCI devices).
Did you try a smarter data structure? (While there's room for another
bit in pci_dev, the bit is more about iommu implementation details than
anything at the pci level).

Or the domain_dev_info is cached in the archdata of device struct.
You should be able to just reference that directly.

Didn't think it through completely, but perhaps something as simple as:

return pdev->dev.archdata.iommu == si_domain;

thanks,
-chris

2011-03-30 19:25:50

by Mike Travis

[permalink] [raw]

Subject: Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping

Chris Wright wrote:
> * Mike Travis ([email protected]) wrote:
>> Chris Wright wrote:
>>> * Mike Travis ([email protected]) wrote:
>>>> When the IOMMU is being used, each request for a DMA mapping requires
>>>> the intel_iommu code to look for some space in the DMA mapping table.
>>>> For most drivers this occurs for each transfer.
>>>>
>>>> When there are many outstanding DMA mappings [as seems to be the case
>>>> with the 10GigE driver], the table grows large and the search for
>>>> space becomes increasingly time consuming. Performance for the
>>>> 10GigE driver drops to about 10% of it's capacity on a UV system
>>>> when the CPU count is large.
>>> That's pretty poor. I've seen large overheads, but when that big it was
>>> also related to issues in the 10G driver. Do you have profile data
>>> showing this as the hotspot?
>> Here's one from our internal bug report:
>>
>> Here is a profile from a run with iommu=on iommu=pt (no forcedac)
>
> OK, I was actually interested in the !pt case. But this is useful
> still. The iova lookup being distinct from the identity_mapping() case.

I can get that as well, but having every device using maps caused it's
own set of problems (hundreds of dma maps). Here's a list of devices
on the system under test. You can see that even 'minor' glitches can
get magnified when there are so many...

Blade Location NASID PCI Address X Display Device
----------------------------------------------------------------------
0 r001i01b00 0 0000:01:00.0 - Intel 82576 Gigabit Network Connection
. . . 0000:01:00.1 - Intel 82576 Gigabit Network Connection
. . . 0000:04:00.0 - LSI SAS1064ET Fusion-MPT SAS
. . . 0000:05:00.0 - Matrox MGA G200e
2 r001i01b02 4 0001:02:00.0 - Mellanox MT26428 InfiniBand
3 r001i01b03 6 0002:02:00.0 - Mellanox MT26428 InfiniBand
4 r001i01b04 8 0003:02:00.0 - Mellanox MT26428 InfiniBand
11 r001i01b11 22 0007:02:00.0 - Mellanox MT26428 InfiniBand
13 r001i01b13 26 0008:02:00.0 - Mellanox MT26428 InfiniBand
15 r001i01b15 30 0009:07:00.0 :0.0 nVidia GF100 [Tesla S2050]
. . . 0009:08:00.0 :1.1 nVidia GF100 [Tesla S2050]
18 r001i23b02 36 000b:02:00.0 - Mellanox MT26428 InfiniBand
20 r001i23b04 40 000c:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
. . . 000c:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
. . . 000c:04:00.0 - Mellanox MT26428 InfiniBand
23 r001i23b07 46 000d:07:00.0 - nVidia GF100 [Tesla S2050]
. . . 000d:08:00.0 - nVidia GF100 [Tesla S2050]
25 r001i23b09 50 000e:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
. . . 000e:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
. . . 000e:04:00.0 - Mellanox MT26428 InfiniBand
26 r001i23b10 52 000f:02:00.0 - Mellanox MT26428 InfiniBand
27 r001i23b11 54 0010:02:00.0 - Mellanox MT26428 InfiniBand
29 r001i23b13 58 0011:02:00.0 - Mellanox MT26428 InfiniBand
31 r001i23b15 62 0012:02:00.0 - Mellanox MT26428 InfiniBand
34 r002i01b02 68 0013:01:00.0 - Mellanox MT26428 InfiniBand
35 r002i01b03 70 0014:02:00.0 - Mellanox MT26428 InfiniBand
36 r002i01b04 72 0015:01:00.0 - Mellanox MT26428 InfiniBand
41 r002i01b09 82 0018:07:00.0 - nVidia GF100 [Tesla S2050]
. . . 0018:08:00.0 - nVidia GF100 [Tesla S2050]
43 r002i01b11 86 0019:01:00.0 - Mellanox MT26428 InfiniBand
45 r002i01b13 90 001a:01:00.0 - Mellanox MT26428 InfiniBand
48 r002i23b00 96 001c:07:00.0 - nVidia GF100 [Tesla S2050]
. . . 001c:08:00.0 - nVidia GF100 [Tesla S2050]
50 r002i23b02 100 001d:02:00.0 - Mellanox MT26428 InfiniBand
52 r002i23b04 104 001e:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
. . . 001e:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
. . . 001e:04:00.0 - Mellanox MT26428 InfiniBand
57 r002i23b09 114 0020:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
. . . 0020:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
. . . 0020:04:00.0 - Mellanox MT26428 InfiniBand
58 r002i23b10 116 0021:02:00.0 - Mellanox MT26428 InfiniBand
59 r002i23b11 118 0022:02:00.0 - Mellanox MT26428 InfiniBand
61 r002i23b13 122 0023:02:00.0 - Mellanox MT26428 InfiniBand
63 r002i23b15 126 0024:02:00.0 - Mellanox MT26428 InfiniBand

>
>> uv48-sys was receiving and uv-debug sending.
>> ksoftirqd/640 was running at approx. 100% cpu utilization.
>> I had pinned the nttcp process on uv48-sys to cpu 64.
>>
>> # Samples: 1255641
>> #
>> # Overhead Command Shared Object Symbol
>> # ........ ............. ............. ......
>> #
>> 50.27%ESC[m ksoftirqd/640 [kernel] [k] _spin_lock
>> 27.43%ESC[m ksoftirqd/640 [kernel] [k] iommu_no_mapping
>
>> ...
>> 0.48% ksoftirqd/640 [kernel] [k] iommu_should_identity_map
>> 0.45% ksoftirqd/640 [kernel] [k] ixgbe_alloc_rx_buffers [
>> ixgbe]
>
> Note, ixgbe has had rx dma mapping issues (that's why I wondered what
> was causing the massive slowdown under !pt mode).

I think since this profile run, the network guys updated the ixgbe
driver with a later version. (I don't know the outcome of that test.)

>
> <snip>
>> I tracked this time down to identity_mapping() in this loop:
>>
>> list_for_each_entry(info, &si_domain->devices, link)
>> if (info->dev == pdev)
>> return 1;
>>
>> I didn't get the exact count, but there was approx 11,000 PCI devices
>> on this system. And this function was called for every page request
>> in each DMA request.
>
> Right, so this is the list traversal (and wow, a lot of PCI devices).

Most of the PCI devices were the 45 on each of 256 Nahalem sockets.
Also, there's a ton of bridges as well.

> Did you try a smarter data structure? (While there's room for another
> bit in pci_dev, the bit is more about iommu implementation details than
> anything at the pci level).
>
> Or the domain_dev_info is cached in the archdata of device struct.
> You should be able to just reference that directly.
>
> Didn't think it through completely, but perhaps something as simple as:
>
> return pdev->dev.archdata.iommu == si_domain;

I can try this, thanks!

>
> thanks,
> -chris

2011-03-30 19:57:58

by Chris Wright

[permalink] [raw]

Subject: Re: [PATCH 1/4] Intel pci: Remove Host Bridge devices from identity mapping

* Mike Travis ([email protected]) wrote:
> Chris Wright wrote:
> >OK, I was actually interested in the !pt case. But this is useful
> >still. The iova lookup being distinct from the identity_mapping() case.
>
> I can get that as well, but having every device using maps caused it's
> own set of problems (hundreds of dma maps). Here's a list of devices
> on the system under test. You can see that even 'minor' glitches can
> get magnified when there are so many...

Yeah, I was focused on the overhead of actually mapping/unmapping an
address in the non-pt case.

> Blade Location NASID PCI Address X Display Device
> ----------------------------------------------------------------------
> 0 r001i01b00 0 0000:01:00.0 - Intel 82576 Gigabit Network Connection
> . . . 0000:01:00.1 - Intel 82576 Gigabit Network Connection
> . . . 0000:04:00.0 - LSI SAS1064ET Fusion-MPT SAS
> . . . 0000:05:00.0 - Matrox MGA G200e
> 2 r001i01b02 4 0001:02:00.0 - Mellanox MT26428 InfiniBand
> 3 r001i01b03 6 0002:02:00.0 - Mellanox MT26428 InfiniBand
> 4 r001i01b04 8 0003:02:00.0 - Mellanox MT26428 InfiniBand
> 11 r001i01b11 22 0007:02:00.0 - Mellanox MT26428 InfiniBand
> 13 r001i01b13 26 0008:02:00.0 - Mellanox MT26428 InfiniBand
> 15 r001i01b15 30 0009:07:00.0 :0.0 nVidia GF100 [Tesla S2050]
> . . . 0009:08:00.0 :1.1 nVidia GF100 [Tesla S2050]
> 18 r001i23b02 36 000b:02:00.0 - Mellanox MT26428 InfiniBand
> 20 r001i23b04 40 000c:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
> . . . 000c:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
> . . . 000c:04:00.0 - Mellanox MT26428 InfiniBand
> 23 r001i23b07 46 000d:07:00.0 - nVidia GF100 [Tesla S2050]
> . . . 000d:08:00.0 - nVidia GF100 [Tesla S2050]
> 25 r001i23b09 50 000e:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
> . . . 000e:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
> . . . 000e:04:00.0 - Mellanox MT26428 InfiniBand
> 26 r001i23b10 52 000f:02:00.0 - Mellanox MT26428 InfiniBand
> 27 r001i23b11 54 0010:02:00.0 - Mellanox MT26428 InfiniBand
> 29 r001i23b13 58 0011:02:00.0 - Mellanox MT26428 InfiniBand
> 31 r001i23b15 62 0012:02:00.0 - Mellanox MT26428 InfiniBand
> 34 r002i01b02 68 0013:01:00.0 - Mellanox MT26428 InfiniBand
> 35 r002i01b03 70 0014:02:00.0 - Mellanox MT26428 InfiniBand
> 36 r002i01b04 72 0015:01:00.0 - Mellanox MT26428 InfiniBand
> 41 r002i01b09 82 0018:07:00.0 - nVidia GF100 [Tesla S2050]
> . . . 0018:08:00.0 - nVidia GF100 [Tesla S2050]
> 43 r002i01b11 86 0019:01:00.0 - Mellanox MT26428 InfiniBand
> 45 r002i01b13 90 001a:01:00.0 - Mellanox MT26428 InfiniBand
> 48 r002i23b00 96 001c:07:00.0 - nVidia GF100 [Tesla S2050]
> . . . 001c:08:00.0 - nVidia GF100 [Tesla S2050]
> 50 r002i23b02 100 001d:02:00.0 - Mellanox MT26428 InfiniBand
> 52 r002i23b04 104 001e:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
> . . . 001e:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
> . . . 001e:04:00.0 - Mellanox MT26428 InfiniBand
> 57 r002i23b09 114 0020:01:00.0 - Intel 82599EB 10-Gigabit Network Connection
> . . . 0020:01:00.1 - Intel 82599EB 10-Gigabit Network Connection
> . . . 0020:04:00.0 - Mellanox MT26428 InfiniBand
> 58 r002i23b10 116 0021:02:00.0 - Mellanox MT26428 InfiniBand
> 59 r002i23b11 118 0022:02:00.0 - Mellanox MT26428 InfiniBand
> 61 r002i23b13 122 0023:02:00.0 - Mellanox MT26428 InfiniBand
> 63 r002i23b15 126 0024:02:00.0 - Mellanox MT26428 InfiniBand
>
> >
> >>uv48-sys was receiving and uv-debug sending.
> >>ksoftirqd/640 was running at approx. 100% cpu utilization.
> >>I had pinned the nttcp process on uv48-sys to cpu 64.
> >>
> >># Samples: 1255641
> >>#
> >># Overhead Command Shared Object Symbol
> >># ........ ............. ............. ......
> >>#
> >> 50.27%ESC[m ksoftirqd/640 [kernel] [k] _spin_lock
> >> 27.43%ESC[m ksoftirqd/640 [kernel] [k] iommu_no_mapping
> >
> >>...
> >> 0.48% ksoftirqd/640 [kernel] [k] iommu_should_identity_map
> >> 0.45% ksoftirqd/640 [kernel] [k] ixgbe_alloc_rx_buffers [
> >>ixgbe]
> >
> >Note, ixgbe has had rx dma mapping issues (that's why I wondered what
> >was causing the massive slowdown under !pt mode).
>
> I think since this profile run, the network guys updated the ixgbe
> driver with a later version. (I don't know the outcome of that test.)

OK. The ixgbe fix I was thinking of is in since 2.6.34: 43634e82 (ixgbe:
Fix DMA mapping/unmapping issues when HWRSC is enabled on IOMMU enabled
on IOMMU enabled kernels).

> ><snip>
> >>I tracked this time down to identity_mapping() in this loop:
> >>
> >> list_for_each_entry(info, &si_domain->devices, link)
> >> if (info->dev == pdev)
> >> return 1;
> >>
> >>I didn't get the exact count, but there was approx 11,000 PCI devices
> >>on this system. And this function was called for every page request
> >>in each DMA request.
> >
> >Right, so this is the list traversal (and wow, a lot of PCI devices).
>
> Most of the PCI devices were the 45 on each of 256 Nahalem sockets.
> Also, there's a ton of bridges as well.
>
> >Did you try a smarter data structure? (While there's room for another
> >bit in pci_dev, the bit is more about iommu implementation details than
> >anything at the pci level).
> >
> >Or the domain_dev_info is cached in the archdata of device struct.
> >You should be able to just reference that directly.
> >
> >Didn't think it through completely, but perhaps something as simple as:
> >
> > return pdev->dev.archdata.iommu == si_domain;
>
> I can try this, thanks!

Err, I guess that'd be info = archdata.iommu; info->domain == si_domain
(and probably need some sanity checking against things like
DUMMY_DEVICE_DOMAIN_INFO). But you get the idea.

thanks,
-chris