2012-06-16 01:57:50

by Ulrich Drepper

[permalink] [raw]
Subject: SNB PCI root information

The PCI roots in multi-socket SNB are part of specific sockets. This
means optimization will need to know which socket the root is part of
and therefore which cores have direct access as opposed to over one or
more QPI links.

I tried to find this information in the /sys filesystem in kernels up
to the current upstream kernel. It seems there is actually nothing
like this.

There are the files /sys/devices/pci*/*/local_cpus which should
contain this information. For each device we would be able to get the
information about the local CPUs.

The SPARC OF handling seems to set the field, some Intel drivers seem
to try to do it in a different way.

The problem I have seen (at least on a Dell R620) is that the
dev_to_code() function returns -1 which indicates that no node
information is stored.

If I understand the code correctly, the numa_node field can be set
explicitly but is mostly inherited from the underlying device (bus
etc). Does this mean that the locality information should come from
the same place where the PCI root data structure is initialized?

This happens, if I'm not mistaken, in the ACPI table parsing. I've
disassembled the DSDT table and didn't find anything like this type of
information. At least I didn't see it. I also couldn't find anything
in the ACPI 5.0 spec.


The questions are:
a) am I missing something?
b) do BIOSes (perhaps from other manufacturers) provide the information?
c) can we get this fixed?
d) can we interpolate the information for platforms where the BIOSes
don't have the information?


2012-06-16 03:03:40

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Fri, Jun 15, 2012 at 6:57 PM, Ulrich Drepper <[email protected]> wrote:
> The PCI roots in multi-socket SNB are part of specific sockets. ?This
> means optimization will need to know which socket the root is part of
> and therefore which cores have direct access as opposed to over one or
> more QPI links.
>
> I tried to find this information in the /sys filesystem in kernels up
> to the current upstream kernel. ?It seems there is actually nothing
> like this.
>
> There are the files /sys/devices/pci*/*/local_cpus which should
> contain this information. ?For each device we would be able to get the
> information about the local CPUs.
>
> The SPARC OF handling seems to set the field, some Intel drivers seem
> to try to do it in a different way.
>
> The problem I have seen (at least on a Dell R620) is that the
> dev_to_code() function returns -1 which indicates that no node
> information is stored.
>
> If I understand the code correctly, the numa_node field can be set
> explicitly but is mostly inherited from the underlying device (bus
> etc). ?Does this mean that the locality information should come from
> the same place where the PCI root data structure is initialized?
>
> This happens, if I'm not mistaken, in the ACPI table parsing. ?I've
> disassembled the DSDT table and didn't find anything like this type of
> information. ?At least I didn't see it. ?I also couldn't find anything
> in the ACPI 5.0 spec.

yes, you should have _PXM for root bus in DSDT.

>
>
> The questions are:
> a) am I missing something?
> b) do BIOSes (perhaps from other manufacturers) provide the information?
> c) can we get this fixed?

get updated BIOS.

> d) can we interpolate the information for platforms where the BIOSes
> don't have the information?

in arch/x86/pci/acpi.c::pci_acpi_scan_root(), we have

node = -1;
#ifdef CONFIG_ACPI_NUMA
pxm = acpi_get_pxm(device->handle);
if (pxm >= 0)
node = pxm_to_node(pxm);
if (node != -1)
set_mp_bus_to_node(busnum, node);
else
#endif
node = get_mp_bus_to_node(busnum);

if (node != -1 && !node_online(node))
node = -1;

info = kzalloc(sizeof(*info), GFP_KERNEL);
if (!info) {
printk(KERN_WARNING "pci_bus %04x:%02x: "
"ignored (out of memory)\n", domain, busnum);
return NULL;
}

sd = &info->sd;
sd->domain = domain;
sd->node = node;

So kernel will check _PXM at first, or will use pre-probe host bridge info.
Now we only have that for amd k8 cpu.

We used to have same for intel IOH nehalem, and get bless from intel.
but that get removed at some point.
I have one local internal similar patch for SNB iio for crossing check
if BIOS set correctly.
but I don't think i will try to get blessing from intel to publish it.

So please get one updated bios from your vendor.

or we could add command line to pass those info, just like fake numa interface.

Thanks

Yinghai

2012-06-16 08:52:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: SNB PCI root information

On Fri, 15 Jun 2012, Yinghai Lu wrote:
> We used to have same for intel IOH nehalem, and get bless from intel.
> but that get removed at some point.

Why so?

> I have one local internal similar patch for SNB iio for crossing check
> if BIOS set correctly.
> but I don't think i will try to get blessing from intel to publish it.

Why do you need Intels blessing to post a useful patch?

Is that patch based on public available documentation and/or your own
findings?

If yes, then hell we don't care whether Intel likes it or not.

If not, then the question arises what Intel has to hide in that area.

Thanks,

tglx

2012-06-16 19:36:09

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Sat, Jun 16, 2012 at 1:52 AM, Thomas Gleixner <[email protected]> wrote:
>> I have one local internal similar patch for SNB iio for crossing check
>> if BIOS set correctly.
>> but I don't think i will try to get blessing from intel to publish it.
>
> Why do you need Intels blessing to post a useful patch?
>
> Is that patch based on public available documentation and/or your own
> findings?
>
> If yes, then hell we don't care whether Intel likes it or not.
>
> If not, then the question arises what Intel has to hide in that area.

it is with Intel NDA EDS or BIOS writing guide.

Thanks

Yinghai

2012-06-16 21:56:53

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: SNB PCI root information

On Fri, Jun 15, 2012 at 9:03 PM, Yinghai Lu <[email protected]> wrote:
> On Fri, Jun 15, 2012 at 6:57 PM, Ulrich Drepper <[email protected]> wrote:
>> The PCI roots in multi-socket SNB are part of specific sockets. ?This
>> means optimization will need to know which socket the root is part of
>> and therefore which cores have direct access as opposed to over one or
>> more QPI links.
>>
>> I tried to find this information in the /sys filesystem in kernels up
>> to the current upstream kernel. ?It seems there is actually nothing
>> like this.
>>
>> There are the files /sys/devices/pci*/*/local_cpus which should
>> contain this information. ?For each device we would be able to get the
>> information about the local CPUs.
>>
>> The SPARC OF handling seems to set the field, some Intel drivers seem
>> to try to do it in a different way.
>>
>> The problem I have seen (at least on a Dell R620) is that the
>> dev_to_code() function returns -1 which indicates that no node
>> information is stored.
>>
>> If I understand the code correctly, the numa_node field can be set
>> explicitly but is mostly inherited from the underlying device (bus
>> etc). ?Does this mean that the locality information should come from
>> the same place where the PCI root data structure is initialized?
>>
>> This happens, if I'm not mistaken, in the ACPI table parsing. ?I've
>> disassembled the DSDT table and didn't find anything like this type of
>> information. ?At least I didn't see it. ?I also couldn't find anything
>> in the ACPI 5.0 spec.
>
> yes, you should have _PXM for root bus in DSDT.
>
>>
>>
>> The questions are:
>> a) am I missing something?
>> b) do BIOSes (perhaps from other manufacturers) provide the information?
>> c) can we get this fixed?
>
> get updated BIOS.
>
>> d) can we interpolate the information for platforms where the BIOSes
>> don't have the information?
>
> in arch/x86/pci/acpi.c::pci_acpi_scan_root(), we have
>
> ? ? ? ?node = -1;
> #ifdef CONFIG_ACPI_NUMA
> ? ? ? ?pxm = acpi_get_pxm(device->handle);
> ? ? ? ?if (pxm >= 0)
> ? ? ? ? ? ? ? ?node = pxm_to_node(pxm);
> ? ? ? ?if (node != -1)
> ? ? ? ? ? ? ? ?set_mp_bus_to_node(busnum, node);
> ? ? ? ?else
> #endif
> ? ? ? ? ? ? ? ?node = get_mp_bus_to_node(busnum);
>
> ? ? ? ?if (node != -1 && !node_online(node))
> ? ? ? ? ? ? ? ?node = -1;
>
> ? ? ? ?info = kzalloc(sizeof(*info), GFP_KERNEL);
> ? ? ? ?if (!info) {
> ? ? ? ? ? ? ? ?printk(KERN_WARNING "pci_bus %04x:%02x: "
> ? ? ? ? ? ? ? ? ? ? ? "ignored (out of memory)\n", domain, busnum);
> ? ? ? ? ? ? ? ?return NULL;
> ? ? ? ?}
>
> ? ? ? ?sd = &info->sd;
> ? ? ? ?sd->domain = domain;
> ? ? ? ?sd->node = node;
>
> So kernel will check _PXM at first, or will use pre-probe host bridge info.
> Now we only have that for amd k8 cpu.
>
> We used to have same for intel IOH nehalem, ?and get bless from intel.
> but that get removed at some point.
> I have one local internal similar patch for SNB iio for crossing check
> if BIOS set correctly.
> but I don't think i will try to get blessing from intel to publish it.
>
> So please get one updated bios from your vendor.

If ACPI provides a perfectly usable generic way to describe this
topology and the vendor BIOS doesn't bother to use it, I'm not very
interested in trying to compensate for that BIOS deficiency by adding
a bunch of non-portable CPU-specific gunk to Linux.

2012-06-18 22:30:44

by Ulrich Drepper

[permalink] [raw]
Subject: Re: SNB PCI root information

On Sat, Jun 16, 2012 at 5:56 PM, Bjorn Helgaas <[email protected]> wrote:
> If ACPI provides a perfectly usable generic way to describe this
> topology and the vendor BIOS doesn't bother to use it, I'm not very
> interested in trying to compensate for that BIOS deficiency by adding
> a bunch of non-portable CPU-specific gunk to Linux.

The problem is that all machines get this wrong. I've tested varies
models from Dell and HP and none of them have the _PXM entry and the
local_cpus fields are wrong. If there is a reasonably sane way to
compensate for broken BIOSes it should be considered. We all know how
good BIOS authors are...

2012-06-18 23:40:45

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Mon, Jun 18, 2012 at 3:30 PM, Ulrich Drepper <[email protected]> wrote:
> On Sat, Jun 16, 2012 at 5:56 PM, Bjorn Helgaas <[email protected]> wrote:
>> If ACPI provides a perfectly usable generic way to describe this
>> topology and the vendor BIOS doesn't bother to use it, I'm not very
>> interested in trying to compensate for that BIOS deficiency by adding
>> a bunch of non-portable CPU-specific gunk to Linux.
>
> The problem is that all machines get this wrong. ?I've tested varies
> models from Dell and HP and none of them have the _PXM entry and the
> local_cpus fields are wrong. ?If there is a reasonably sane way to
> compensate for broken BIOSes it should be considered. ?We all know how
> good BIOS authors are...

please check attached one. and you may append "pci=busnum_node=00:00,80:01"
to change node for root bus.
assume you have two root bus: 00, 80

Thanks

Yinghai


Attachments:
busnum_node.patch (1.57 kB)

2012-06-19 12:36:44

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: SNB PCI root information

On Mon, Jun 18, 2012 at 5:40 PM, Yinghai Lu <[email protected]> wrote:
> On Mon, Jun 18, 2012 at 3:30 PM, Ulrich Drepper <[email protected]> wrote:
>> On Sat, Jun 16, 2012 at 5:56 PM, Bjorn Helgaas <[email protected]> wrote:
>>> If ACPI provides a perfectly usable generic way to describe this
>>> topology and the vendor BIOS doesn't bother to use it, I'm not very
>>> interested in trying to compensate for that BIOS deficiency by adding
>>> a bunch of non-portable CPU-specific gunk to Linux.
>>
>> The problem is that all machines get this wrong. ?I've tested varies
>> models from Dell and HP and none of them have the _PXM entry and the
>> local_cpus fields are wrong. ?If there is a reasonably sane way to
>> compensate for broken BIOSes it should be considered. ?We all know how
>> good BIOS authors are...
>
> please check attached one. and ?you may append "pci=busnum_node=00:00,80:01"
> to change node for root bus.
> assume you have two root bus: 00, 80

I'm not opposed to something like this, if people think it's useful.

This patch sets the node quite early, before we even look at the _PXM
information in pci_acpi_scan_root(). That means if the BIOS does
supply a _PXM method and the user gives this argument, the
user-supplied info is silently overwritten. To me it would make more
sense to handle an option like this *after* we look for _PXM info.
That way it could be used to compensate for both missing and incorrect
_PXM info.

2012-06-19 18:20:13

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Tue, Jun 19, 2012 at 5:36 AM, Bjorn Helgaas <[email protected]> wrote:
>
> I'm not opposed to something like this, if people think it's useful.
>
> This patch sets the node quite early, before we even look at the _PXM
> information in pci_acpi_scan_root(). ?That means if the BIOS does
> supply a _PXM method and the user gives this argument, the
> user-supplied info is silently overwritten. ?To me it would make more
> sense to handle an option like this *after* we look for _PXM info.
> That way it could be used to compensate for both missing and incorrect
> _PXM info.

yes, we can only let user input and hostbridge touch that array.

but i'd like to only handle missing _PXM case.

If the BIOS provide wrong _PXM, that BIOS really should be fixed at first.

Yinghai

2012-06-20 17:11:48

by Ulrich Drepper

[permalink] [raw]
Subject: Re: SNB PCI root information

On Tue, Jun 19, 2012 at 2:20 PM, Yinghai Lu <[email protected]> wrote:
> If the BIOS provide wrong _PXM, that BIOS really should be fixed at first.

Let's go there when we see BIOSes with the information...

In the meantime I tested the patch. It needs the small modification
below (iinterdiff-generated). The problem is the pci= parameter
handling uses ',' to separate parameters and therefore the second PCI
root information, separated by a comma, is interpreted as a new pci=
parameter. The patch below changes the separator to ';'. With that I
get the correct information after boot up.


Signed-off-by: Ulrich Drepper <[email protected]>

diff -u b/Documentation/kernel-parameters.txt
b/Documentation/kernel-parameters.txt
--- b/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2192,7 +2192,7 @@
realloc same as realloc=on
busnum_node=
Format:
- <bus>:<node>[, ...]
+ <bus>:<node>[;...]
Specifies node for bus
noari do not use PCIe ARI.
pcie_scan_all Scan all possible PCIe devices. Otherwise we
diff -u b/arch/x86/pci/common.c b/arch/x86/pci/common.c
--- b/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -509,7 +509,7 @@
}
set_mp_bus_to_node(bus, node);
p += count;
- if (*p != ',')
+ if (*p != ';')
break;
p++;
}

2012-06-20 17:18:15

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: SNB PCI root information

On Tue, Jun 19, 2012 at 12:20 PM, Yinghai Lu <[email protected]> wrote:
> On Tue, Jun 19, 2012 at 5:36 AM, Bjorn Helgaas <[email protected]> wrote:
>>
>> I'm not opposed to something like this, if people think it's useful.
>>
>> This patch sets the node quite early, before we even look at the _PXM
>> information in pci_acpi_scan_root(). ?That means if the BIOS does
>> supply a _PXM method and the user gives this argument, the
>> user-supplied info is silently overwritten. ?To me it would make more
>> sense to handle an option like this *after* we look for _PXM info.
>> That way it could be used to compensate for both missing and incorrect
>> _PXM info.
>
> yes, we can only let user input and hostbridge touch that array.
>
> but i'd like to only handle missing _PXM case.
>
> If the BIOS provide wrong _PXM, that BIOS really should be fixed at first.

I don't understand this. Is there an *advantage* to silently throwing
away the information the user specified on the command line? If the
user goes to the trouble of discovering and using a command line
argument, I think that user-supplied information should override
anything the kernel can figure out on its own. Ulrich, do you have an
opinion either way?

2012-06-20 17:59:46

by Ulrich Drepper

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 1:17 PM, Bjorn Helgaas <[email protected]> wrote:
> I don't understand this.  Is there an *advantage* to silently throwing
> away the information the user specified on the command line?  If the
> user goes to the trouble of discovering and using a command line
> argument, I think that user-supplied information should override
> anything the kernel can figure out on its own.  Ulrich, do you have an
> opinion either way?

If the BIOS information we look for would be something generic then we
might want to have something to overwrite. But in this case it's a
very specific piece of information which only has one correct value
and I'd hope the BIOS writers get it right.

Assuming this there *might* be value in having it the way the patch
does now. If the BIOS changes it could in theory also renumber the
devices etc. In this case the kernel command line overwrite values
might become wrong while a newly-added _PXM entry might be right.
We've seen all kind of things happening on BIOS updates...

On the other hand it's easy enough to then remove the kernel command
line parameter. It's also probably more in line with other parameters
which overwrite information the kernel determines otherwise
automatically.

I'd be willing to go with Yinghai's recommendation and give the BIOS
writers the benefit of a doubt that they get things right. If they
prove to be incapable again we can still change the option handling to
overwrite the kernel setting regardless.

2012-06-20 18:37:41

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 10:59 AM, Ulrich Drepper <[email protected]> wrote:
> On Wed, Jun 20, 2012 at 1:17 PM, Bjorn Helgaas <[email protected]> wrote:
>
> I'd be willing to go with Yinghai's recommendation and give the BIOS
> writers the benefit of a doubt that they get things right. ?If they
> prove to be incapable again we can still change the option handling to
> overwrite the kernel setting regardless.

good, please check updated patch with fix from you.

Thanks

Yinghai


Attachments:
busnum_node_v2.patch (2.30 kB)

2012-06-20 18:47:42

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 11:59 AM, Ulrich Drepper <[email protected]> wrote:
> On Wed, Jun 20, 2012 at 1:17 PM, Bjorn Helgaas <[email protected]> wrote:
>> I don't understand this. ?Is there an *advantage* to silently throwing
>> away the information the user specified on the command line? ?If the
>> user goes to the trouble of discovering and using a command line
>> argument, I think that user-supplied information should override
>> anything the kernel can figure out on its own. ?Ulrich, do you have an
>> opinion either way?
>
> If the BIOS information we look for would be something generic then we
> might want to have something to overwrite. ?But in this case it's a
> very specific piece of information which only has one correct value
> and I'd hope the BIOS writers get it right.
>
> Assuming this there *might* be value in having it the way the patch
> does now. ?If the BIOS changes it could in theory also renumber the
> devices etc. ?In this case the kernel command line overwrite values
> might become wrong while a newly-added _PXM entry might be right.
> We've seen all kind of things happening on BIOS updates...
>
> On the other hand it's easy enough to then remove the kernel command
> line parameter. ?It's also probably more in line with other parameters
> which overwrite information the kernel determines otherwise
> automatically.
>
> I'd be willing to go with Yinghai's recommendation and give the BIOS
> writers the benefit of a doubt that they get things right. ?If they
> prove to be incapable again we can still change the option handling to
> overwrite the kernel setting regardless.

As far as I can tell, here's Yinghai's recommendation: the user
argument should not override BIOS _PXM because if the BIOS gets the
_PXM wrong, the user won't be able to work around it with the
argument, which will force the vendor to fix the BIOS.

I'm not buying it. The convention that user-supplied arguments always
take precedence is useful, easy to document, and matches user
expectations. It allows the user to work around both missing _PXM and
incorrect _PXM.

Bjorn

2012-06-20 19:28:14

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 11:46 AM, Bjorn Helgaas <[email protected]> wrote:

> As far as I can tell, here's Yinghai's recommendation: ?the user
> argument should not override BIOS _PXM because if the BIOS gets the
> _PXM wrong, the user won't be able to work around it with the
> argument, which will force the vendor to fix the BIOS.
>
> I'm not buying it. ?The convention that user-supplied arguments always
> take precedence is useful, easy to document, and matches user
> expectations. ?It allows the user to work around both missing _PXM and
> incorrect _PXM.

if the vendor provide _PXM, that _PXM should be right and be trusted.

if the vendor does not provide _PXM, we can have command line to input
it before user can get one updated BIOS from vendor.

Yinghai

2012-06-20 19:34:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: SNB PCI root information


* Yinghai Lu <[email protected]> wrote:

> On Wed, Jun 20, 2012 at 11:46 AM, Bjorn Helgaas <[email protected]> wrote:
>
> > As far as I can tell, here's Yinghai's recommendation: ?the
> > user argument should not override BIOS _PXM because if the
> > BIOS gets the _PXM wrong, the user won't be able to work
> > around it with the argument, which will force the vendor to
> > fix the BIOS.
> >
> > I'm not buying it. ?The convention that user-supplied
> > arguments always take precedence is useful, easy to
> > document, and matches user expectations. ?It allows the user
> > to work around both missing _PXM and incorrect _PXM.
>
> if the vendor provide _PXM, that _PXM should be right and be
> trusted.
>
> if the vendor does not provide _PXM, we can have command line
> to input it before user can get one updated BIOS from vendor.

So how about an incorrect _PXM, or a slightly inefficient one?
Why shouldn't it be possible for the user to override it?

I mean, if we create a parameter space that tweaks data then why
not make it complete and allow *all* firmware data to be
(optionally) modified, from the kernel boot line?

Thanks,

Ingo

2012-06-20 19:57:33

by Brice Goglin

[permalink] [raw]
Subject: Re: SNB PCI root information

Le 20/06/2012 21:28, Yinghai Lu a ?crit :
> On Wed, Jun 20, 2012 at 11:46 AM, Bjorn Helgaas <[email protected]> wrote:
>
>> As far as I can tell, here's Yinghai's recommendation: the user
>> argument should not override BIOS _PXM because if the BIOS gets the
>> _PXM wrong, the user won't be able to work around it with the
>> argument, which will force the vendor to fix the BIOS.
>>
>> I'm not buying it. The convention that user-supplied arguments always
>> take precedence is useful, easy to document, and matches user
>> expectations. It allows the user to work around both missing _PXM and
>> incorrect _PXM.
> if the vendor provide _PXM, that _PXM should be right and be trusted.
>

I agree that the most common problem is that _PXM is missing. But I've
seen at least some HP Westmere-EP platforms where _PXM exists but it is
wrong. Last time we checked, there was no BIOS update, even if we
reported the bug a while ago.

Brice

2012-06-20 20:04:49

by Ulrich Drepper

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 3:34 PM, Ingo Molnar <[email protected]> wrote:
> I mean, if we create a parameter space that tweaks data then why
> not make it complete and allow *all* firmware data to be
> (optionally) modified, from the kernel boot line?

If there is proof that BIOSes get it wrong then just use this small
additional patch:


Signed-off-by: Ulrich Drepper <[email protected]>

diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
index fc09c27..7aceb84 100644
--- a/arch/x86/pci/acpi.c
+++ b/arch/x86/pci/acpi.c
@@ -387,16 +387,16 @@ struct pci_bus * __devinit
pci_acpi_scan_root(struct acpi_pci_root *root)
return NULL;
}

- node = -1;
+ node = get_mp_bus_to_node(busnum);
#ifdef CONFIG_ACPI_NUMA
- pxm = acpi_get_pxm(device->handle);
- if (pxm >= 0)
- node = pxm_to_node(pxm);
- if (node != -1)
- set_mp_bus_to_node(busnum, node);
- else
+ if (node == -1) {
+ pxm = acpi_get_pxm(device->handle);
+ if (pxm >= 0)
+ node = pxm_to_node(pxm);
+ if (node != -1)
+ set_mp_bus_to_node(busnum, node);
+ }
#endif
- node = get_mp_bus_to_node(busnum);

if (node != -1 && !node_online(node))
node = -1;

2012-06-20 20:17:04

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 2:04 PM, Ulrich Drepper <[email protected]> wrote:
> On Wed, Jun 20, 2012 at 3:34 PM, Ingo Molnar <[email protected]> wrote:
>> I mean, if we create a parameter space that tweaks data then why
>> not make it complete and allow *all* firmware data to be
>> (optionally) modified, from the kernel boot line?
>
> If there is proof that BIOSes get it wrong then just use this small
> additional patch:

It's not a question of "do BIOSes get this wrong?" The question is
"what does a user expect to happen when she supplies
'pci=busnum_node=00:00,80:01'?" I contend that the user expects us to
use that info whether the BIOS gave us correct info, wrong info, or
nothing at all.

Normally I'd be glad to munge this all together myself, but I'm
feeling a bit swamped, so if anybody wants this, it would help me out
to get a single complete patch.

> Signed-off-by: Ulrich Drepper <[email protected]>
>
> diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> index fc09c27..7aceb84 100644
> --- a/arch/x86/pci/acpi.c
> +++ b/arch/x86/pci/acpi.c
> @@ -387,16 +387,16 @@ struct pci_bus * __devinit
> pci_acpi_scan_root(struct acpi_pci_root *root)
> ? ? ? ? ? ? ? ?return NULL;
> ? ? ? ?}
>
> - ? ? ? node = -1;
> + ? ? ? node = get_mp_bus_to_node(busnum);
> ?#ifdef CONFIG_ACPI_NUMA
> - ? ? ? pxm = acpi_get_pxm(device->handle);
> - ? ? ? if (pxm >= 0)
> - ? ? ? ? ? ? ? node = pxm_to_node(pxm);
> - ? ? ? if (node != -1)
> - ? ? ? ? ? ? ? set_mp_bus_to_node(busnum, node);
> - ? ? ? else
> + ? ? ? if (node == -1) {
> + ? ? ? ? ? ? ? pxm = acpi_get_pxm(device->handle);
> + ? ? ? ? ? ? ? if (pxm >= 0)
> + ? ? ? ? ? ? ? ? ? ? ? node = pxm_to_node(pxm);
> + ? ? ? ? ? ? ? if (node != -1)
> + ? ? ? ? ? ? ? ? ? ? ? set_mp_bus_to_node(busnum, node);
> + ? ? ? }
> ?#endif
> - ? ? ? ? ? ? ? node = get_mp_bus_to_node(busnum);
>
> ? ? ? ?if (node != -1 && !node_online(node))
> ? ? ? ? ? ? ? ?node = -1;

2012-06-20 21:22:15

by Ulrich Drepper

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 4:16 PM, Bjorn Helgaas <[email protected]> wrote:
> It's not a question of "do BIOSes get this wrong?"  The question is
> "what does a user expect to happen when she supplies
> 'pci=busnum_node=00:00,80:01'?"  I contend that the user expects us to
> use that info whether the BIOS gave us correct info, wrong info, or
> nothing at all.

Think about busnum_node as busnum_default_node. Then the behavior
makes perfect sense.

I contend that the really the only reason for overwriting is if there
are wrong BIOS.

2012-06-20 23:58:33

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 1:04 PM, Ulrich Drepper <[email protected]> wrote:
> On Wed, Jun 20, 2012 at 3:34 PM, Ingo Molnar <[email protected]> wrote:
>> I mean, if we create a parameter space that tweaks data then why
>> not make it complete and allow *all* firmware data to be
>> (optionally) modified, from the kernel boot line?
>
> If there is proof that BIOSes get it wrong then just use this small
> additional patch:
>
>
> Signed-off-by: Ulrich Drepper <[email protected]>
>
> diff --git a/arch/x86/pci/acpi.c b/arch/x86/pci/acpi.c
> index fc09c27..7aceb84 100644
> --- a/arch/x86/pci/acpi.c
> +++ b/arch/x86/pci/acpi.c
> @@ -387,16 +387,16 @@ struct pci_bus * __devinit
> pci_acpi_scan_root(struct acpi_pci_root *root)
> ? ? ? ? ? ? ? ?return NULL;
> ? ? ? ?}
>
> - ? ? ? node = -1;
> + ? ? ? node = get_mp_bus_to_node(busnum);
> ?#ifdef CONFIG_ACPI_NUMA
> - ? ? ? pxm = acpi_get_pxm(device->handle);
> - ? ? ? if (pxm >= 0)
> - ? ? ? ? ? ? ? node = pxm_to_node(pxm);
> - ? ? ? if (node != -1)
> - ? ? ? ? ? ? ? set_mp_bus_to_node(busnum, node);
> - ? ? ? else
> + ? ? ? if (node == -1) {
> + ? ? ? ? ? ? ? pxm = acpi_get_pxm(device->handle);
> + ? ? ? ? ? ? ? if (pxm >= 0)
> + ? ? ? ? ? ? ? ? ? ? ? node = pxm_to_node(pxm);
> + ? ? ? ? ? ? ? if (node != -1)
> + ? ? ? ? ? ? ? ? ? ? ? set_mp_bus_to_node(busnum, node);
> + ? ? ? }
> ?#endif
> - ? ? ? ? ? ? ? node = get_mp_bus_to_node(busnum);
>
> ? ? ? ?if (node != -1 && !node_online(node))
> ? ? ? ? ? ? ? ?node = -1;

it will have problem for amd platform that have _PXM and also have
hostbridge detection.

before the patch, _PXM will always be used at first.

after this patch, the one from host bridge will always get used at first.

need to check AMD cpu with two silicon chip in one package.

Thanks

Yinghai

2012-06-21 02:38:00

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 12:34 PM, Ingo Molnar <[email protected]> wrote:
>> if the vendor provide _PXM, that _PXM should be right and be
>> trusted.
>>
>> if the vendor does not provide _PXM, we can have command line
>> to input it before user can get one updated BIOS from vendor.
>
> So how about an incorrect _PXM, or a slightly inefficient one?
> Why shouldn't it be possible for the user to override it?

Try to keep the code simple.

>
> I mean, if we create a parameter space that tweaks data then why
> not make it complete and allow *all* firmware data to be
> (optionally) modified, from the kernel boot line?

that pxm/node for pci device should be consistent with srat table etc,
so better solution is that BIOS keep them consistent.

If BIOS provide _PXM for pci device, the _PXM should have more chance
to be right.

Anyway if you insist that it should cover that wrong case, let me
check if it could be done simply.

Thanks

Yinghai

2012-06-21 02:43:24

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 12:57 PM, Brice Goglin <[email protected]> wrote:
> Le 20/06/2012 21:28, Yinghai Lu a ?crit :
>> On Wed, Jun 20, 2012 at 11:46 AM, Bjorn Helgaas <[email protected]> wrote:
>
> I agree that the most common problem is that _PXM is missing. But I've
> seen at least some HP Westmere-EP platforms where _PXM exists but it is
> wrong. Last time we checked, there was no BIOS update, even if we
> reported the bug a while ago.
>
Do you have boot log or acpi dump?

I suspected that could be other problem. intel system before
sandbridge does not have IIO.
they will have ioh instead, and one ioh would connect to two cpu sockets.
but _PXM for root bus in dsdt only can return one value. Aka it is
acpi spec limitation.

for example, Sun x4800 8 sockets server would have bus 00, 40, 80, c0.
but _PXM will only return 0, 2, 3, 6.

Thanks

Yinghai

2012-06-21 03:50:07

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 7:37 PM, Yinghai Lu <[email protected]> wrote:
> On Wed, Jun 20, 2012 at 12:34 PM, Ingo Molnar <[email protected]> wrote:
>>> if the vendor provide _PXM, that _PXM should be right and be
>>> trusted.
>>>
>>> if the vendor does not provide _PXM, we can have command line
>>> to input it before user can get one updated BIOS from vendor.
>>
>> So how about an incorrect _PXM, or a slightly inefficient one?
>> Why shouldn't it be possible for the user to override it?
>
> Try to keep the code simple.
>
>>
>> I mean, if we create a parameter space that tweaks data then why
>> not make it complete and allow *all* firmware data to be
>> (optionally) modified, from the kernel boot line?
>
> that pxm/node for pci device should be consistent with srat table etc,
> so better solution is that BIOS keep them consistent.
>
> If BIOS provide _PXM for pci device, the _PXM should have more chance
> to be right.
>
> Anyway if you insist that it should cover that wrong case, let me
> check if it could be done simply.

please check -v3, and it will add 40 lines.

and -v2 is about 25 lines.

Thanks

Yinghai


Attachments:
busnum_node_v3.patch (3.86 kB)

2012-06-21 05:56:29

by Brice Goglin

[permalink] [raw]
Subject: Re: SNB PCI root information

Le 21/06/2012 04:43, Yinghai Lu a ?crit :
> On Wed, Jun 20, 2012 at 12:57 PM, Brice Goglin <[email protected]> wrote:
>> Le 20/06/2012 21:28, Yinghai Lu a ?crit :
>>> On Wed, Jun 20, 2012 at 11:46 AM, Bjorn Helgaas <[email protected]> wrote:
>> I agree that the most common problem is that _PXM is missing. But I've
>> seen at least some HP Westmere-EP platforms where _PXM exists but it is
>> wrong. Last time we checked, there was no BIOS update, even if we
>> reported the bug a while ago.
>>
> Do you have boot log or acpi dump?
>
> I suspected that could be other problem. intel system before
> sandbridge does not have IIO.
> they will have ioh instead, and one ioh would connect to two cpu sockets.
> but _PXM for root bus in dsdt only can return one value. Aka it is
> acpi spec limitation.

The machines I am talking about have two IOHs (one connected to each
socket). I only have an old 2.6.27 boot log at hand, I'll ask my admins
for more. Do you want the 2.6.27 dmesg anyway?

Brice

2012-06-21 12:18:01

by Ulrich Drepper

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 11:50 PM, Yinghai Lu <[email protected]> wrote:
> please check -v3, and it will add 40 lines.

Won't this version print out something like

PCI: unknown option `busnum_node=xx:yy'


I can verify but that's what I saw when the use of ',' as the
separator caused mis-parsing.

I think you have to provide an extension to pcibios_setup in any case.

2012-06-21 16:22:40

by Ulrich Drepper

[permalink] [raw]
Subject: Re: SNB PCI root information

On Thu, Jun 21, 2012 at 8:17 AM, Ulrich Drepper <[email protected]> wrote:
> On Wed, Jun 20, 2012 at 11:50 PM, Yinghai Lu <[email protected]> wrote:
>> please check -v3, and it will add 40 lines.
>
> Won't this version print out something like
>
>  PCI: unknown option `busnum_node=xx:yy'

It does.

If you apply the following patch on top of your version 3 patch it
seems to work and is slightly more efficient.


Signed-off-by: Ulrich Drepper <[email protected]>

diff -u b/arch/x86/pci/common.c b/arch/x86/pci/common.c
--- b/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -494,16 +494,22 @@
return 0;
}

+static const char *busnum_node_param;
+
+static void remember_busnum_node(const char *str)
+{
+ busnum_node_param = str;
+}
+
int get_user_busnum_node(int busnum)
{
int bus, node, count;
- char *p;
+ const char *p;

- p = strstr(boot_command_line, "busnum_node=");
+ p = busnum_node_param;
if (!p)
return -1;

- p += 12; /* strlen("busnum_node=") */
while (*p) {
count = 0;
if (sscanf(p, "%x:%x%n", &bus, &node, &count) != 2) {
@@ -607,6 +613,9 @@
} else if (!strcmp(str, "nocrs")) {
pci_probe |= PCI_ROOT_NO_CRS;
return NULL;
+ } else if (!strncmp(str, "busnum_node=", 12)) {
+ remember_busnum_node(str + 12);
+ return NULL;
} else if (!strcmp(str, "earlydump")) {
pci_early_dump_regs = 1;
return NULL;

2012-06-21 18:11:23

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Thu, Jun 21, 2012 at 9:22 AM, Ulrich Drepper <[email protected]> wrote:
> On Thu, Jun 21, 2012 at 8:17 AM, Ulrich Drepper <[email protected]> wrote:
>> On Wed, Jun 20, 2012 at 11:50 PM, Yinghai Lu <[email protected]> wrote:
>>> please check -v3, and it will add 40 lines.
>>
>> Won't this version print out something like
>>
>> ?PCI: unknown option `busnum_node=xx:yy'
>
> It does.

yes, but only when you have pci=busnum_node...

i changed grammar to only "busnum_node=..."

>
> If you apply the following patch on top of your version 3 patch it
> seems to work and is slightly more efficient.
>
>
> Signed-off-by: Ulrich Drepper <[email protected]>
>
> diff -u b/arch/x86/pci/common.c b/arch/x86/pci/common.c
> --- b/arch/x86/pci/common.c
> +++ b/arch/x86/pci/common.c
> @@ -494,16 +494,22 @@
> ? ? ? ?return 0;
> ?}
>
> +static const char *busnum_node_param;
> +
> +static void remember_busnum_node(const char *str)
> +{
> + ? ? ? busnum_node_param = str;
> +}
> +
> ?int get_user_busnum_node(int busnum)
> ?{
> ? ? ? ?int bus, node, count;
> - ? ? ? char *p;
> + ? ? ? const char *p;
>
> - ? ? ? p = strstr(boot_command_line, "busnum_node=");
> + ? ? ? p = busnum_node_param;
> ? ? ? ?if (!p)
> ? ? ? ? ? ? ? ?return -1;
>
> - ? ? ? p += 12; /* strlen("busnum_node=") */
> ? ? ? ?while (*p) {
> ? ? ? ? ? ? ? ?count = 0;
> ? ? ? ? ? ? ? ?if (sscanf(p, "%x:%x%n", &bus, &node, &count) != 2) {
> @@ -607,6 +613,9 @@
> ? ? ? ?} else if (!strcmp(str, "nocrs")) {
> ? ? ? ? ? ? ? ?pci_probe |= PCI_ROOT_NO_CRS;
> ? ? ? ? ? ? ? ?return NULL;
> + ? ? ? } else if (!strncmp(str, "busnum_node=", 12)) {
> + ? ? ? ? ? ? ? remember_busnum_node(str + 12);
> + ? ? ? ? ? ? ? return NULL;
> ? ? ? ?} else if (!strcmp(str, "earlydump")) {
> ? ? ? ? ? ? ? ?pci_early_dump_regs = 1;
> ? ? ? ? ? ? ? ?return NULL;

Yes, that would be better.

Thanks

Yinghai

2012-06-21 19:24:10

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Wed, Jun 20, 2012 at 10:56 PM, Brice Goglin <[email protected]> wrote:
>> I suspected that could be other problem. intel system before
>> sandbridge does not have IIO.
>> they will have ioh instead, and one ioh would connect to two cpu sockets.
>> but _PXM for root bus in dsdt only can return one value. Aka it is
>> acpi spec limitation.
>
> The machines I am talking about have two IOHs (one connected to each
> socket). I only have an old 2.6.27 boot log at hand, I'll ask my admins
> for more. Do you want the 2.6.27 dmesg anyway?
>

log from 2.6.27 should be good enough for this.

BIOS _PXM return 0 for both pci root buses?

those systems with two IOHs really deserve good BIOS.

Hope we can have more updated machines with open sourced firmware.

Thanks

Yinghai

2012-06-22 07:14:17

by Brice Goglin

[permalink] [raw]
Subject: Re: SNB PCI root information

Le 21/06/2012 21:24, Yinghai Lu a ?crit :
> On Wed, Jun 20, 2012 at 10:56 PM, Brice Goglin <[email protected]> wrote:
>>> I suspected that could be other problem. intel system before
>>> sandbridge does not have IIO.
>>> they will have ioh instead, and one ioh would connect to two cpu sockets.
>>> but _PXM for root bus in dsdt only can return one value. Aka it is
>>> acpi spec limitation.
>> The machines I am talking about have two IOHs (one connected to each
>> socket). I only have an old 2.6.27 boot log at hand, I'll ask my admins
>> for more. Do you want the 2.6.27 dmesg anyway?
>>
> log from 2.6.27 should be good enough for this.
>
> BIOS _PXM return 0 for both pci root buses?
>

Here's dmesg.
I can't say for sure whether _PXM returns 0 since I don't know how to
read all this :) But Linux puts the first socket cpumap in
/sys/bus/pci/devices/*/local_cpus (and that's wrong according to the
motherboard manual and according to the performance we see).

Brice


Attachments:
dmesg.txt (190.92 kB)

2012-06-22 17:28:59

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Fri, Jun 22, 2012 at 12:14 AM, Brice Goglin <[email protected]> wrote:

>> BIOS _PXM return 0 for both pci root buses?
>>
>
> Here's dmesg.
> I can't say for sure whether _PXM returns 0 since I don't know how to
> read all this :) But Linux puts the first socket cpumap in
> /sys/bus/pci/devices/*/local_cpus (and that's wrong according to the
> motherboard manual and according to the performance we see).
>

hi, looks like you system bios does not provide _PXM for the root bus.

Thanks

Yinghai

2012-06-22 20:38:20

by Brice Goglin

[permalink] [raw]
Subject: Re: SNB PCI root information

Le 22/06/2012 19:28, Yinghai Lu a ?crit :
> On Fri, Jun 22, 2012 at 12:14 AM, Brice Goglin <[email protected]> wrote:
>
>>> BIOS _PXM return 0 for both pci root buses?
>>>
>> Here's dmesg.
>> I can't say for sure whether _PXM returns 0 since I don't know how to
>> read all this :) But Linux puts the first socket cpumap in
>> /sys/bus/pci/devices/*/local_cpus (and that's wrong according to the
>> motherboard manual and according to the performance we see).
>>
> hi, looks like you system bios does not provide _PXM for the root bus.
>

So why does Linux say that all buses are close to socket 0 instead of
close to everything as usual?

Brice

2012-06-22 20:41:37

by Yinghai Lu

[permalink] [raw]
Subject: Re: SNB PCI root information

On Fri, Jun 22, 2012 at 1:38 PM, Brice Goglin <[email protected]> wrote:
> Le 22/06/2012 19:28, Yinghai Lu a ?crit :
>> On Fri, Jun 22, 2012 at 12:14 AM, Brice Goglin <[email protected]> wrote:
>>
>>>> BIOS _PXM return 0 for both pci root buses?
>>>>
>>> Here's dmesg.
>>> I can't say for sure whether _PXM returns 0 since I don't know how to
>>> read all this :) But Linux puts the first socket cpumap in
>>> /sys/bus/pci/devices/*/local_cpus (and that's wrong according to the
>>> motherboard manual and according to the performance we see).
>>>
>> hi, looks like you system bios does not provide _PXM for the root bus.
>>
>
> So why does Linux say that all buses are close to socket 0 instead of
> close to everything as usual?

if (bus && node != -1) {
#ifdef CONFIG_ACPI_NUMA
if (pxm >= 0)
dev_printk(KERN_DEBUG, &bus->dev,
"on NUMA node %d (pxm %d)\n", node, pxm);
#else
dev_printk(KERN_DEBUG, &bus->dev, "on NUMA node %d\n", node);
#endif
}

so can you boot with "debug ignore_loglevel" ?

Thanks

Yinghai

2012-06-25 09:07:36

by Brice Goglin

[permalink] [raw]
Subject: Re: SNB PCI root information

Le 22/06/2012 22:41, Yinghai Lu a ?crit :
> On Fri, Jun 22, 2012 at 1:38 PM, Brice Goglin <[email protected]> wrote:
>> Le 22/06/2012 19:28, Yinghai Lu a ?crit :
>>> On Fri, Jun 22, 2012 at 12:14 AM, Brice Goglin <[email protected]> wrote:
>>>
>>>>> BIOS _PXM return 0 for both pci root buses?
>>>>>
>>>> Here's dmesg.
>>>> I can't say for sure whether _PXM returns 0 since I don't know how to
>>>> read all this :) But Linux puts the first socket cpumap in
>>>> /sys/bus/pci/devices/*/local_cpus (and that's wrong according to the
>>>> motherboard manual and according to the performance we see).
>>>>
>>> hi, looks like you system bios does not provide _PXM for the root bus.
>>>
>> So why does Linux say that all buses are close to socket 0 instead of
>> close to everything as usual?
> if (bus && node != -1) {
> #ifdef CONFIG_ACPI_NUMA
> if (pxm >= 0)
> dev_printk(KERN_DEBUG, &bus->dev,
> "on NUMA node %d (pxm %d)\n", node, pxm);
> #else
> dev_printk(KERN_DEBUG, &bus->dev, "on NUMA node %d\n", node);
> #endif
> }
>
> so can you boot with "debug ignore_loglevel" ?
>

I can't find anything relevant in the debug log. The code has changed a
lot since my old SLES11 but I don't see any old message either. I guess
"bus && node != -1 && pxm>=0" evaluates to false.

Brice

2012-06-25 17:55:18

by Ulrich Drepper

[permalink] [raw]
Subject: Re: SNB PCI root information

On Thu, Jun 21, 2012 at 2:11 PM, Yinghai Lu <[email protected]> wrote:
> Yes, that would be better.
> [...]

Can you post a complete patch so that it can be queued up for inclusion?