2006-09-08 03:14:24

by Matt Domsch

[permalink] [raw]
Subject: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

Problem:
New Dell PowerEdge servers have 2 embedded ethernet ports, which are
labeled NIC1 and NIC2 on the chassis, in the BIOS setup screens, and
in the printed documentation. Assuming no other add-in ethernet ports
in the system, Linux 2.4 kernels name these eth0 and eth1
respectively. Many people have come to expect this naming. Linux 2.6
kernels name these eth1 and eth0 respectively (backwards from
expectations). I also have reports that various Sun and HP servers
have similar behavior.


Root cause:
Linux 2.4 kernels walk the pci_devices list, which happens to be
sorted in breadth-first order (or pcbios_find_device order on i386,
which most often is breadth-first also). 2.6 kernels have both the pci_devices
list and the pci_bus_type.klist_devices list, the latter is what is
walked at driver load time to match the pci_id tables; this klist
happens to be in depth-first order.

On systems where, for physical routing reasons, NIC1 appears on a
lower bus number than NIC2, but NIC2's bridge is discovered first in
the depth-first ordering, NIC2 will be discovered before NIC1. If the
list were sorted breadth-first, NIC1 woudl be discovered before NIC2.

A PowerEdge 1955 system has the following topology which easily
exhibits the difference between depth-first and breadth-first device
lists.

-[0000:00]-+-00.0 Intel Corporation 5000P Chipset Memory Controller Hub
+-02.0-[0000:03-08]--+-00.0-[0000:04-07]--+-00.0-[0000:05-06]----00.0-[0000:06]----00.0 Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (labeled NIC2, 2.4 kernel name eth1, 2.6 kernel name eth0)
+-1c.0-[0000:01-02]----00.0-[0000:02]----00.0 Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (labeled NIC1, 2.4 kernel name eth0, 2.6 kernel name eth1)


Other factors, such as device driver load order and the presence of
PCI slots at various points in the bus hierarchy further complicate
this problem; I'm not trying to solve those here, just restore the
device order, and thus basic behavior, that 2.4 kernels had.


Solution:

The solution can come in multiple steps.

Suggested fix #1: kernel
Patch below sorts the two device lists into breadth-first ordering to
maintain compatibility with 2.4 kernels. It also overloads the
'pci=nosort' option to disable the breadth-first sort (and on i386 it
continues to disable the pcibios_find_device sort as well).

Suggested fix #2: udev rules from userland
Many people also have the expectation that embedded NICs are always
discovered before add-in NICs (which this patch does not try to do).
Using the PCI IRQ Routing Table provided by system BIOS, it's easy to
determine which PCI devices are embedded, or if add-in, which PCI slot
they're in. I'm working on a tool that would allow udev to name
ethernet devices in ascending embedded, slot 1 .. slot N order,
subsort by PCI bus/dev/fn breadth-first. It'll be possible to use it
independent of udev as well for those distributions that don't use
udev in their installers.

Suggested fix #3: system board routing rules
One can constrain the system board layout to put NIC1 ahead of NIC2
regardless of breadth-first or depth-first discovery order. This adds
a significant level of complexity to board routing, and may not be
possible in all instances (witness the above systems from several
major manufacturers). I don't want to encourage this particular train
of thought too far, at the expense of not doing #1 or #2 above.


Feedback appreciated. Patch tested on a Dell PowerEdge 1955 blade
with 2.6.18-rc5.

You'll also note I took some liberty and temporarily break the klist
abstraction to simplify and speed up the sort algorithm. I think
that's both safe and appropriate in this instance.

Thanks,
Matt

--
Matt Domsch
Software Architect
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com


diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index b50595a..192435a 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1186,7 +1186,11 @@ running once the system is up.
nomsi [MSI] If the PCI_MSI kernel config parameter is
enabled, this kernel boot option can be used to
disable the use of MSI interrupts system-wide.
- nosort [IA-32] Don't sort PCI devices according to
+ nosort Don't sort PCI devices into breadth-first order.
+ This sorting is done to get a device
+ order compatible with older (<= 2.4) kernels.
+ and
+ [IA-32] Don't sort PCI devices according to
order given by the PCI BIOS. This sorting is
done to get a device order compatible with
older kernels.
diff --git a/arch/i386/pci/common.c b/arch/i386/pci/common.c
index 0a362e3..86657cc 100644
--- a/arch/i386/pci/common.c
+++ b/arch/i386/pci/common.c
@@ -189,6 +189,8 @@ static int __init pcibios_init(void)

pcibios_resource_survey();

+ if (!(pci_probe & PCI_NO_SORT))
+ pci_sort_breadthfirst();
#ifdef CONFIG_PCI_BIOS
if ((pci_probe & PCI_BIOS_SORT) && !(pci_probe & PCI_NO_SORT))
pcibios_sort();
@@ -203,6 +205,9 @@ char * __devinit pcibios_setup(char *st
if (!strcmp(str, "off")) {
pci_probe = 0;
return NULL;
+ } else if (!strcmp(str, "nosort")) {
+ pci_probe |= PCI_NO_SORT;
+ return NULL;
}
#ifdef CONFIG_PCI_BIOS
else if (!strcmp(str, "bios")) {
@@ -210,9 +215,6 @@ char * __devinit pcibios_setup(char *st
return NULL;
} else if (!strcmp(str, "nobios")) {
pci_probe &= ~PCI_PROBE_BIOS;
- return NULL;
- } else if (!strcmp(str, "nosort")) {
- pci_probe |= PCI_NO_SORT;
return NULL;
} else if (!strcmp(str, "biosirq")) {
pci_probe |= PCI_BIOS_IRQ_SCAN;
diff --git a/arch/i386/pci/pci.h b/arch/i386/pci/pci.h
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index c5a58d1..3c21c25 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1055,3 +1055,95 @@ EXPORT_SYMBOL(pci_scan_bridge);
EXPORT_SYMBOL(pci_scan_single_device);
EXPORT_SYMBOL_GPL(pci_scan_child_bus);
#endif
+
+static int pci_sort_bf_cmp(const struct pci_dev *a, const struct pci_dev *b)
+{
+ if (pci_domain_nr(a->bus) < pci_domain_nr(b->bus)) return -1;
+ else if (pci_domain_nr(a->bus) > pci_domain_nr(b->bus)) return 1;
+
+ if (a->bus->number < b->bus->number) return -1;
+ else if (a->bus->number > b->bus->number) return 1;
+
+ if (a->devfn < b->devfn) return -1;
+ else if (a->devfn > b->devfn) return 1;
+
+ return 0;
+}
+
+/*
+ * Yes, this forcably breaks the klist abstraction temporarily. It
+ * just wants to sort the klist, not change reference counts and
+ * take/drop locks rapidly in the process. It does all this while
+ * holding the lock for the list, so objects can't otherwise be
+ * added/removed while we're swizzling.
+ */
+
+static void pci_insertion_sort_klist(struct pci_dev *a, struct list_head *list,
+ int (*cmp)(const struct pci_dev *, const struct pci_dev *))
+{
+ struct list_head *pos;
+ struct klist_node *n;
+ struct device *dev;
+ struct pci_dev *b;
+ list_for_each(pos, list) {
+ n = container_of(pos, struct klist_node, n_node);
+ dev = container_of(n, struct device, knode_bus);
+ b = to_pci_dev(dev);
+ if (cmp(a, b) <= 0) {
+ list_move_tail(&a->dev.knode_bus.n_node, &b->dev.knode_bus.n_node);
+ return;
+ }
+ }
+ list_move_tail(&a->dev.knode_bus.n_node, list);
+}
+
+static void pci_sort_breadthfirst_klist(void)
+{
+ LIST_HEAD(sorted_devices);
+ struct list_head *pos, *tmp;
+ struct klist_node *n;
+ struct device *dev;
+ struct pci_dev *pdev;
+ spin_lock(&pci_bus_type.klist_devices.k_lock);
+ list_for_each_safe(pos, tmp, &pci_bus_type.klist_devices.k_list) {
+ n = container_of(pos, struct klist_node, n_node);
+ dev = container_of(n, struct device, knode_bus);
+ pdev = to_pci_dev(dev);
+ pci_insertion_sort_klist(pdev, &sorted_devices, pci_sort_bf_cmp);
+ }
+ list_splice(&sorted_devices, &pci_bus_type.klist_devices.k_list);
+ spin_unlock(&pci_bus_type.klist_devices.k_lock);
+}
+
+static void pci_insertion_sort_devices(struct pci_dev *a, struct list_head *list,
+ int (*cmp)(const struct pci_dev *, const struct pci_dev *))
+{
+ struct pci_dev *b;
+ list_for_each_entry(b, list, global_list) {
+ if (cmp(a, b) <= 0) {
+ list_move_tail(&a->global_list, &b->global_list);
+ return;
+ }
+ }
+ list_move_tail(&a->global_list, list);
+}
+
+static void pci_sort_breadthfirst_devices(void)
+{
+ LIST_HEAD(sorted_devices);
+ struct pci_dev *dev, *tmp;
+
+ down_write(&pci_bus_sem);
+ list_for_each_entry_safe(dev, tmp, &pci_devices, global_list) {
+ pci_insertion_sort_devices(dev, &sorted_devices, pci_sort_bf_cmp);
+ }
+ list_splice(&sorted_devices, &pci_devices);
+ up_write(&pci_bus_sem);
+}
+
+void pci_sort_breadthfirst(void)
+{
+ pci_sort_breadthfirst_devices();
+ pci_sort_breadthfirst_klist();
+}
+
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 8565b81..3011715 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -437,6 +437,7 @@ extern void pci_dev_put(struct pci_dev *
extern void pci_remove_bus(struct pci_bus *b);
extern void pci_remove_bus_device(struct pci_dev *dev);
void pci_setup_cardbus(struct pci_bus *bus);
+extern void pci_sort_breadthfirst(void);

/* Generic PCI functions exported to card drivers */


2006-09-08 15:49:36

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

On Thu, Sep 07, 2006 at 10:14:22PM -0500, Matt Domsch wrote:
> Problem:
> New Dell PowerEdge servers have 2 embedded ethernet ports, which are
> labeled NIC1 and NIC2 on the chassis, in the BIOS setup screens, and
> in the printed documentation. Assuming no other add-in ethernet ports
> in the system, Linux 2.4 kernels name these eth0 and eth1
> respectively. Many people have come to expect this naming. Linux 2.6
> kernels name these eth1 and eth0 respectively (backwards from
> expectations). I also have reports that various Sun and HP servers
> have similar behavior.

This came up years back when 2.6 was something new, and the answer
then was 'bind the interface to the MAC address'.

Whilst your patch will fix the case that's currently broken (2.4->2.6),
doesn't it offer equal possibility to break existing setups when people move
from <=2.6.18 -> 2.6.19 ?

Dave

2006-09-08 16:18:17

by Matt Domsch

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

On Fri, Sep 08, 2006 at 11:56:39AM -0400, Dave Jones wrote:
> On Thu, Sep 07, 2006 at 10:14:22PM -0500, Matt Domsch wrote:
> > Problem:
> > New Dell PowerEdge servers have 2 embedded ethernet ports, which are
> > labeled NIC1 and NIC2 on the chassis, in the BIOS setup screens, and
> > in the printed documentation. Assuming no other add-in ethernet ports
> > in the system, Linux 2.4 kernels name these eth0 and eth1
> > respectively. Many people have come to expect this naming. Linux 2.6
> > kernels name these eth1 and eth0 respectively (backwards from
> > expectations). I also have reports that various Sun and HP servers
> > have similar behavior.
>
> This came up years back when 2.6 was something new, and the answer
> then was 'bind the interface to the MAC address'.

Both Red Hat-based distros and openSuSE-based distros do something
like this with configuration files automatically. Red Hat's
anaconda/kudzu puts the HWADDR lines in the generated
/etc/sysconfig/network-scripts/ifcfg-* files. openSuSE's udev rules
puts lines in /etc/udev/rules.d/30-net_persistent_names.rules the
first time it discovers a new interface. Both methods are intended to
maintain a persistent name for each NIC, after being set up the first
time. Neither deals well with replacing one NIC with another - you
must edit the config files.

This works pretty well post-install. It doesn't work well at install
time, all the installers use the kernel's original names, and then
those names become the persistent names in the config files.


> Whilst your patch will fix the case that's currently broken (2.4->2.6),
> doesn't it offer equal possibility to break existing setups when people move
> from <=2.6.18 -> 2.6.19 ?

If they're using config files / udev rules as suggested, it shouldn't
break them.

If they're not, then yes, this could. Debian's
/etc/network/interfaces file allows use of hwaddr fields, though by
default it doesn't appear anything sets it up.

I'm open to suggestions on how *not* to break setups that don't use
the MAC addresses.

Thanks,
Matt

--
Matt Domsch
Software Architect
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2006-09-08 16:46:05

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

On Fri, Sep 08, 2006 at 11:56:39AM -0400, Dave Jones wrote:
> On Thu, Sep 07, 2006 at 10:14:22PM -0500, Matt Domsch wrote:
> > Problem:
> > New Dell PowerEdge servers have 2 embedded ethernet ports, which are
> > labeled NIC1 and NIC2 on the chassis, in the BIOS setup screens, and
> > in the printed documentation. Assuming no other add-in ethernet ports
> > in the system, Linux 2.4 kernels name these eth0 and eth1
> > respectively. Many people have come to expect this naming. Linux 2.6
> > kernels name these eth1 and eth0 respectively (backwards from
> > expectations). I also have reports that various Sun and HP servers
> > have similar behavior.
>
> This came up years back when 2.6 was something new, and the answer
> then was 'bind the interface to the MAC address'.

It came up then? I don't recall it, if so, I hope I didn't say that...

We should fix this, it's obviously a regression. Thanks a lot Matt for
tracking this down and fixing it.

> Whilst your patch will fix the case that's currently broken (2.4->2.6),
> doesn't it offer equal possibility to break existing setups when people move
> from <=2.6.18 -> 2.6.19 ?

It might, but I'll take that heat, we do have a command line option that
returns the functionality to the "broken" way.

Let's see what happens in the next -mm.

thanks,

greg k-h

2006-09-08 16:55:46

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

On Fri, 2006-09-08 at 11:18 -0500, Matt Domsch wrote:
> On Fri, Sep 08, 2006 at 11:56:39AM -0400, Dave Jones wrote:
> > On Thu, Sep 07, 2006 at 10:14:22PM -0500, Matt Domsch wrote:
> > > Problem:
> > > New Dell PowerEdge servers have 2 embedded ethernet ports, which are
> > > labeled NIC1 and NIC2 on the chassis, in the BIOS setup screens, and
> > > in the printed documentation. Assuming no other add-in ethernet ports
> > > in the system, Linux 2.4 kernels name these eth0 and eth1
> > > respectively. Many people have come to expect this naming. Linux 2.6
> > > kernels name these eth1 and eth0 respectively (backwards from
> > > expectations). I also have reports that various Sun and HP servers
> > > have similar behavior.
> >
> > This came up years back when 2.6 was something new, and the answer
> > then was 'bind the interface to the MAC address'.
>
> Both Red Hat-based distros and openSuSE-based distros do something
> like this with configuration files automatically. Red Hat's
> anaconda/kudzu puts the HWADDR lines in the generated
> /etc/sysconfig/network-scripts/ifcfg-* files. openSuSE's udev rules
> puts lines in /etc/udev/rules.d/30-net_persistent_names.rules the
> first time it discovers a new interface. Both methods are intended to
> maintain a persistent name for each NIC, after being set up the first
> time. Neither deals well with replacing one NIC with another - you
> must edit the config files.
>
> This works pretty well post-install. It doesn't work well at install
> time, all the installers use the kernel's original names, and then
> those names become the persistent names in the config files.
>
>
> > Whilst your patch will fix the case that's currently broken (2.4->2.6),
> > doesn't it offer equal possibility to break existing setups when people move
> > from <=2.6.18 -> 2.6.19 ?
>
> If they're using config files / udev rules as suggested, it shouldn't
> break them.
>
> If they're not, then yes, this could. Debian's
> /etc/network/interfaces file allows use of hwaddr fields, though by
> default it doesn't appear anything sets it up.
>
> I'm open to suggestions on how *not* to break setups that don't use
> the MAC addresses.

to be honest I really don't like the PCI ordering change thing for this.
It's just too fragile altogether to cause a fixed ordering as you want.

Maybe the kernel's initial ordering should do a numeric sort by mac
address or something.. (or userspace should)


2006-09-08 18:22:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

On Thu, 7 Sep 2006 22:14:22 -0500
Matt Domsch <[email protected]> wrote:

> @@ -189,6 +189,8 @@ static int __init pcibios_init(void)
>
> pcibios_resource_survey();
>
> + if (!(pci_probe & PCI_NO_SORT))
> + pci_sort_breadthfirst();
>
> ...
>
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1055,3 +1055,95 @@ EXPORT_SYMBOL(pci_scan_bridge);
> EXPORT_SYMBOL(pci_scan_single_device);
> EXPORT_SYMBOL_GPL(pci_scan_child_bus);
> #endif
> +
> +static int pci_sort_bf_cmp(const struct pci_dev *a, const struct pci_dev *b)
> +static void pci_insertion_sort_klist(struct pci_dev *a, struct list_head *list,
> +static void pci_sort_breadthfirst_klist(void)
> +static void pci_insertion_sort_devices(struct pci_dev *a, struct list_head *list,
> +static void pci_sort_breadthfirst_devices(void)
> +void pci_sort_breadthfirst(void)

I think all these functions can+should be __init?

> +extern void pci_sort_breadthfirst(void);

In which case this needs the __init tag too (new rule, due to frv (at least)).

2006-09-08 18:52:34

by Matt Domsch

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

On Fri, Sep 08, 2006 at 11:20:35AM -0700, Andrew Morton wrote:
> On Thu, 7 Sep 2006 22:14:22 -0500
> Matt Domsch <[email protected]> wrote:
>
> > @@ -189,6 +189,8 @@ static int __init pcibios_init(void)
> >
> > pcibios_resource_survey();
> >
> > + if (!(pci_probe & PCI_NO_SORT))
> > + pci_sort_breadthfirst();
> >
> > ...
> >
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> > @@ -1055,3 +1055,95 @@ EXPORT_SYMBOL(pci_scan_bridge);
> > EXPORT_SYMBOL(pci_scan_single_device);
> > EXPORT_SYMBOL_GPL(pci_scan_child_bus);
> > #endif
> > +
> > +static int pci_sort_bf_cmp(const struct pci_dev *a, const struct pci_dev *b)
> > +static void pci_insertion_sort_klist(struct pci_dev *a, struct list_head *list,
> > +static void pci_sort_breadthfirst_klist(void)
> > +static void pci_insertion_sort_devices(struct pci_dev *a, struct list_head *list,
> > +static void pci_sort_breadthfirst_devices(void)
> > +void pci_sort_breadthfirst(void)
>
> I think all these functions can+should be __init?
>
> > +extern void pci_sort_breadthfirst(void);
>
> In which case this needs the __init tag too (new rule, due to frv (at least)).

Will fix up immediately. I also see that it's making indirect calls
through the (*cmp)() pointer, which is interesting only of there's
other code that's going to be doing insertion sorting. For now, a
direct call to the comparison function would be a little cleaner.

Thanks for the review!
Matt

--
Matt Domsch
Software Architect
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2006-09-08 18:59:17

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

On Fri, Sep 08, 2006 at 11:20:35AM -0700, Andrew Morton wrote:
> > +extern void pci_sort_breadthfirst(void);
>
> In which case this needs the __init tag too (new rule, due to frv (at least)).

Some time ago I've asked David Howells about it:

> Does this apply to function prototypes?

No, because functions can't be crammed into such a small space.

2006-09-09 09:05:35

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

I wanted to note what Martin Mares just raised in the thread "State of
the Linux PCI Subsystem for 2.6.18-rc6":

> Changing the device order in the middle of the 2.6 cycle doesn't sound
> like a sane idea to me. Many people have changed their systems' configuration
> to adapt to the 2.6 ordering and this patch would break their setups.
> I have seen many such examples in my vicinity.
>
> I believe that not breaking existing 2.6 setups is much more important
> than keeping compatibility with 2.4 kernels, especially when the problem
> is discovered after more than 2 years after release of the first 2.6.


I'm not siding with either Martin or Matt explicitly, just noting that
there are good arguments for both sides.

Jeff



2006-09-10 09:34:24

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

Hi!

> I wanted to note what Martin Mares just raised in the
> thread "State of the Linux PCI Subsystem for 2.6.18-rc6":
>
> >Changing the device order in the middle of the 2.6
> >cycle doesn't sound
> >like a sane idea to me. Many people have changed their
> >systems' configuration
> >to adapt to the 2.6 ordering and this patch would break
> >their setups.
> >I have seen many such examples in my vicinity.
> >
> >I believe that not breaking existing 2.6 setups is much
> >more important
> >than keeping compatibility with 2.4 kernels, especially
> >when the problem
> >is discovered after more than 2 years after release of
> >the first 2.6.
>
>
> I'm not siding with either Martin or Matt explicitly,
> just noting that there are good arguments for both sides.

I agree with martin here. 'Lets break all the machines where people
are currently using 2.6.x, for benefit of people currently running
2.4.x' is *very* bad idea.
Pavel
--
Thanks for all the (sleeping) penguins.

2006-09-10 11:28:59

by Stefan Richter

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

Pavel Machek wrote:
> I agree with martin here. 'Lets break all the machines where people
> are currently using 2.6.x, for benefit of people currently running
> 2.4.x' is *very* bad idea.

Small correction: Not all but "only" setups which rely on device
ordering instead of persistent unique device properties are affected.
--
Stefan Richter
-=====-=-==- =--= -=-=-
http://arcgraph.de/sr/

2006-09-10 13:08:28

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

On Sat, Sep 09, 2006 at 05:05:25AM -0400, Jeff Garzik wrote:
> I wanted to note what Martin Mares just raised in the thread "State of
> the Linux PCI Subsystem for 2.6.18-rc6":
>
> >Changing the device order in the middle of the 2.6 cycle doesn't sound
> >like a sane idea to me. Many people have changed their systems'
> >configuration
> >to adapt to the 2.6 ordering and this patch would break their setups.
> >I have seen many such examples in my vicinity.
> >
> >I believe that not breaking existing 2.6 setups is much more important
> >than keeping compatibility with 2.4 kernels, especially when the problem
> >is discovered after more than 2 years after release of the first 2.6.
>
>
> I'm not siding with either Martin or Matt explicitly, just noting that
> there are good arguments for both sides.

I agree, there are.

But I know we explicitly tried to keep 2.4 compability in 2.6 for this
very thing. And the fact that almost no one has reported having
problems with it, leads me to believe that the sorting order isn't as
broken as you might think. Some machines, like the ones that Matt wrote
this patch for, need this change, but for all others, it's not really
needed.

Let's let this sit in -mm for a bit to see if anyone notices any
problems.

thanks,

greg k-h

2006-09-11 19:32:10

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

Arjan van de Ven wrote:
> On Fri, 2006-09-08 at 11:18 -0500, Matt Domsch wrote:
>> On Fri, Sep 08, 2006 at 11:56:39AM -0400, Dave Jones wrote:
>>> On Thu, Sep 07, 2006 at 10:14:22PM -0500, Matt Domsch wrote:
>>> > Problem:
>>> > New Dell PowerEdge servers have 2 embedded ethernet ports, which are
>>> > labeled NIC1 and NIC2 on the chassis, in the BIOS setup screens, and
>>> > in the printed documentation. Assuming no other add-in ethernet ports
>>> > in the system, Linux 2.4 kernels name these eth0 and eth1
>>> > respectively. Many people have come to expect this naming. Linux 2.6
>>> > kernels name these eth1 and eth0 respectively (backwards from
>>> > expectations). I also have reports that various Sun and HP servers
>>> > have similar behavior.
>>>
>>> This came up years back when 2.6 was something new, and the answer
>>> then was 'bind the interface to the MAC address'.
>> Both Red Hat-based distros and openSuSE-based distros do something
>> like this with configuration files automatically. Red Hat's
>> anaconda/kudzu puts the HWADDR lines in the generated
>> /etc/sysconfig/network-scripts/ifcfg-* files. openSuSE's udev rules
>> puts lines in /etc/udev/rules.d/30-net_persistent_names.rules the
>> first time it discovers a new interface. Both methods are intended to
>> maintain a persistent name for each NIC, after being set up the first
>> time. Neither deals well with replacing one NIC with another - you
>> must edit the config files.
>>
>> This works pretty well post-install. It doesn't work well at install
>> time, all the installers use the kernel's original names, and then
>> those names become the persistent names in the config files.
>>
>>
>>> Whilst your patch will fix the case that's currently broken (2.4->2.6),
>>> doesn't it offer equal possibility to break existing setups when people move
>>> from <=2.6.18 -> 2.6.19 ?
>> If they're using config files / udev rules as suggested, it shouldn't
>> break them.
>>
>> If they're not, then yes, this could. Debian's
>> /etc/network/interfaces file allows use of hwaddr fields, though by
>> default it doesn't appear anything sets it up.
>>
>> I'm open to suggestions on how *not* to break setups that don't use
>> the MAC addresses.
>
> to be honest I really don't like the PCI ordering change thing for this.
> It's just too fragile altogether to cause a fixed ordering as you want.
>
> Maybe the kernel's initial ordering should do a numeric sort by mac
> address or something.. (or userspace should)
>
>
That wouldn't match any existing setup, and would be subject to mid-list
insertions if a NIC were added/replaced. And that is fragile.

I was looking for an easy way to do PCI slot to MAC, and from there MAC
to IP, so any NIC plugged into a given slot could be called eth0 (for
instance) and given the "right" IP address, but that's not easy. Can be
done with some searching in /sys, but it's non-trivial.

--
Bill Davidsen <[email protected]>
Obscure bug of 2004: BASH BUFFER OVERFLOW - if bash is being run by a
normal user and is setuid root, with the "vi" line edit mode selected,
and the character set is "big5," an off-by-one errors occurs during
wildcard (glob) expansion.

2006-09-12 03:08:46

by Matt Domsch

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

On Mon, Sep 11, 2006 at 03:36:06PM -0400, Bill Davidsen wrote:
> Arjan van de Ven wrote:
> >Maybe the kernel's initial ordering should do a numeric sort by mac
> >address or something.. (or userspace should)
> >
> That wouldn't match any existing setup, and would be subject to mid-list
> insertions if a NIC were added/replaced. And that is fragile.
>
> I was looking for an easy way to do PCI slot to MAC, and from there MAC
> to IP, so any NIC plugged into a given slot could be called eth0 (for
> instance) and given the "right" IP address, but that's not easy. Can be
> done with some searching in /sys, but it's non-trivial.

So, I did almost this in
userspace. http://linux.dell.com/files/name_eths/. It uses the PCI
IRQ Routing table to determine if a PCI device is embedded on the
motherboard, or is in an add-in slot, and if so, which slot. It
orders the list of thereby possible PCI NICs with all the embeddeds
first, in ascending PCI breadth-first order, then orders all the
add-in NICs in PCI slot number ascending order, subsort PCI
breadth-first for those nifty multiport cards. It rewrites
modprobe.conf to load network drivers 'proper' order and outputs an
/etc/mactab file that can be used by the second half of the script to
write HWADDR lines into Red Hat-style ifcfg-eth* files, and into
openSuSE udev ethernet rules file.

This works great, until you add another NIC into an add-in slot
somewhere in the middle (e.g. you have 2 embedded NICs eth0 and eth1,
and a NIC card in PCI slot 4 eth2, then at some later point you add a
NIC card into PCI slot 2). You either have to manually configure a
name for the new card, or run name_eths again and expect the NIC in
PCI slot 2 to become eth2, and the one in slot4 to become eth3.

The pure C udev helper I'm working on behaves similarly, though would
negate the need to edit config files, as it assigns a name "on the
fly" at device discovery time.

For the relatively rare cases of adding a NIC, I'm OK with this. I
don't have a better way to handle it, but am open to ideas.

We could assign names like eth-embedded-1, eth-embedded-2,
eth-slot2-1, eth-slot4-1 if we wanted to change how people think of
ethernet names (and this would be similar to how large network
switches work: blade N, port M. We've got 15 usable chars in the name
after all...

Thanks,
Matt

--
Matt Domsch
Software Architect
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2006-09-12 20:59:00

by Bill Davidsen

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

Matt Domsch wrote:

>On Mon, Sep 11, 2006 at 03:36:06PM -0400, Bill Davidsen wrote:
>
>
>>Arjan van de Ven wrote:
>>
>>
>>>Maybe the kernel's initial ordering should do a numeric sort by mac
>>>address or something.. (or userspace should)
>>>
>>>
>>>
>>That wouldn't match any existing setup, and would be subject to mid-list
>>insertions if a NIC were added/replaced. And that is fragile.
>>
>>I was looking for an easy way to do PCI slot to MAC, and from there MAC
>>to IP, so any NIC plugged into a given slot could be called eth0 (for
>>instance) and given the "right" IP address, but that's not easy. Can be
>>done with some searching in /sys, but it's non-trivial.
>>
>>
>
>So, I did almost this in
>userspace. http://linux.dell.com/files/name_eths/. It uses the PCI
>IRQ Routing table to determine if a PCI device is embedded on the
>motherboard, or is in an add-in slot, and if so, which slot. It
>orders the list of thereby possible PCI NICs with all the embeddeds
>first, in ascending PCI breadth-first order, then orders all the
>add-in NICs in PCI slot number ascending order, subsort PCI
>breadth-first for those nifty multiport cards. It rewrites
>modprobe.conf to load network drivers 'proper' order and outputs an
>/etc/mactab file that can be used by the second half of the script to
>write HWADDR lines into Red Hat-style ifcfg-eth* files, and into
>openSuSE udev ethernet rules file.
>
>
I am not (generally) concerned with the embedded NICs, it's the slots I
have had bite me. I do see what you're doing, although I didn't do it
quite the same way.

>This works great, until you add another NIC into an add-in slot
>somewhere in the middle (e.g. you have 2 embedded NICs eth0 and eth1,
>and a NIC card in PCI slot 4 eth2, then at some later point you add a
>NIC card into PCI slot 2). You either have to manually configure a
>name for the new card, or run name_eths again and expect the NIC in
>PCI slot 2 to become eth2, and the one in slot4 to become eth3.
>
>
And this is why I want to do a slot to name translation, which may
result in gaps in the names. That doesn't seem to matter. Then I can put
a label over the slot, like "public" or "private" and know that adding
another NIC, or replacing a failed NIC will leave my labels intact.

>The pure C udev helper I'm working on behaves similarly, though would
>negate the need to edit config files, as it assigns a name "on the
>fly" at device discovery time.
>
>For the relatively rare cases of adding a NIC, I'm OK with this. I
>don't have a better way to handle it, but am open to ideas.
>
>
You have mine. I tried putting labels on the NIC, but the machine go to
live timezones away and repairs are done by whoever has the service
contract.

>We could assign names like eth-embedded-1, eth-embedded-2,
>eth-slot2-1, eth-slot4-1 if we wanted to change how people think of
>ethernet names (and this would be similar to how large network
>switches work: blade N, port M. We've got 15 usable chars in the name
>after all...
>
Opens the door for change, for sure. Since I have had to use multi-drop
NICs and nameset like e_slot1_jack2 is an interesting concept. Or
perhaps just append the jack index no matter how many connections it
has. So eth2-0 would be a NIC and socket. Then when I run out of slots
and have to go to a two or four post card, the primary name wouldn't change.

Your idea has just enough usefulness to be dangerous ;-) Thanks.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979

2006-09-15 03:53:17

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

On 9/7/06, Matt Domsch <[email protected]> wrote:
> Problem:
> Many people have come to expect this naming. Linux 2.6
> kernels name these eth1 and eth0 respectively (backwards from
> expectations). I also have reports that various Sun and HP servers
> have similar behavior.
>

On RHEL3 the 32bit and 64bit versions order the NICs differently.
64bit RHEL3 orders it the same as 2.6.

I've got a lot of systems where the NIC LEDs are labelled. The labels
are correct for 2.6 but not for 2.4 32 bit. I'm suspect the labels
were designed for Windows originally.

My boss pointed out that this patch. It was supposed to make PCI bus
ordering match 2.4.
http://kernel.org/git/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=ffdd6e8f870ca6dd0d9b9169b8c2e0fdbae99549
It's still there, why did it stop working?

Couldn't we just use the labelling from the DMI data to order the NICs?

regards,
dan carpenter

2006-09-15 13:02:33

by Matt Domsch

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

On Thu, Sep 14, 2006 at 08:53:16PM -0700, Dan Carpenter wrote:
> On 9/7/06, Matt Domsch <[email protected]> wrote:
> >Problem:
> >Many people have come to expect this naming. Linux 2.6
> >kernels name these eth1 and eth0 respectively (backwards from
> >expectations). I also have reports that various Sun and HP servers
> >have similar behavior.
> >
>
> On RHEL3 the 32bit and 64bit versions order the NICs differently.
> 64bit RHEL3 orders it the same as 2.6.
>
> I've got a lot of systems where the NIC LEDs are labelled. The labels
> are correct for 2.6 but not for 2.4 32 bit. I'm suspect the labels
> were designed for Windows originally.

2.4 i386 by default resorts the list by what BIOS reports the order
should be. No other arches do this. So I'd expect it to be opposite
of what you say.

Care to send the output of 'lspci -tv' and the first 80 or so lines of
dmidecode? This is curious.


> My boss pointed out that this patch. It was supposed to make PCI bus
> ordering match 2.4.
> http://kernel.org/git/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=ffdd6e8f870ca6dd0d9b9169b8c2e0fdbae99549
> It's still there, why did it stop working?
>
> Couldn't we just use the labelling from the DMI data to order the NICs?

Unfortunately, DMI data doesn't include enough information. It says
"there's a port called NIC1", but doesn't say where to find it in PCI
space. :-( I'm looking at ACPI 3.0 extensions, but am not finding
what I'm needing here either yet.

--
Matt Domsch
Software Architect
Dell Linux Solutions linux.dell.com & http://www.dell.com/linux
Linux on Dell mailing lists @ http://lists.us.dell.com

2006-09-16 02:39:10

by Dan Carpenter

[permalink] [raw]
Subject: Re: [PATCH 2.6.18-rc5] PCI: sort device lists breadth-first

On 9/15/06, Matt Domsch <[email protected]> wrote:
> On Thu, Sep 14, 2006 at 08:53:16PM -0700, Dan Carpenter wrote:
> > On 9/7/06, Matt Domsch <[email protected]> wrote:
> > >Problem:
> > >Many people have come to expect this naming. Linux 2.6
> > >kernels name these eth1 and eth0 respectively (backwards from
> > >expectations). I also have reports that various Sun and HP servers
> > >have similar behavior.
> > >
> >
> > On RHEL3 the 32bit and 64bit versions order the NICs differently.
> > 64bit RHEL3 orders it the same as 2.6.
> >
> > I've got a lot of systems where the NIC LEDs are labelled. The labels
> > are correct for 2.6 but not for 2.4 32 bit. I'm suspect the labels
> > were designed for Windows originally.
>
> 2.4 i386 by default resorts the list by what BIOS reports the order
> should be. No other arches do this. So I'd expect it to be opposite
> of what you say.
>
> Care to send the output of 'lspci -tv' and the first 80 or so lines of
> dmidecode? This is curious.
>

On this system the LEDs match the BIOS and the PXE boot order and the
2.6 ordering.

-+-[06]-+-01.0-[07]--
| +-01.1 Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
| +-02.0-[08]--
| \-02.1 Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC
\-[00]-+-00.0 nVidia Corporation CK804 Memory Controller
+-01.0 nVidia Corporation CK804 ISA Bridge
+-01.1 nVidia Corporation CK804 SMBus
+-02.0 nVidia Corporation CK804 USB Controller
+-02.1 nVidia Corporation CK804 USB Controller
+-06.0 nVidia Corporation CK804 IDE
+-07.0 nVidia Corporation CK804 Serial ATA Controller
+-08.0 nVidia Corporation CK804 Serial ATA Controller
+-09.0-[01]----07.0 ATI Technologies Inc Rage XL
+-0b.0-[02]----00.0 Broadcom Corporation NetXtreme BCM5721
Gigabit Ethernet PCI Express
+-0c.0-[03]--
+-0d.0-[04]----00.0 Broadcom Corporation NetXtreme BCM5721
Gigabit Ethernet PCI Express
+-0e.0-[05]--
+-18.0 Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
+-18.1 Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
+-18.2 Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
DRAM Controller
+-18.3 Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
+-19.0 Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
+-19.1 Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
+-19.2 Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
DRAM Controller
\-19.3 Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control

# dmidecode 2.2
SMBIOS 2.3 present.
54 structures occupying 2065 bytes.
Table at 0x000FA560.
Handle 0x0000
DMI type 0, 20 bytes.
BIOS Information
Vendor: IR2300
Version: BIOS Version 1.50 Date: 03/17/06 13:23:14
Release Date: <BAD INDEX>
Address: 0xF0000
Runtime Size: 64 kB
ROM Size: 1024 kB
Characteristics:
ISA is supported
PCI is supported
PNP is supported
BIOS is upgradeable
BIOS shadowing is allowed
ESCD support is available
Boot from CD is supported
Selectable boot is supported
BIOS ROM is socketed
EDD is supported
3.5"/2.88 MB floppy services are supported (int 13h)
Print screen service is supported (int 5h)
8042 keyboard services are supported (int 9h)
Serial services are supported (int 14h)
CGA/mono video services are supported (int 10h)
ACPI is supported
USB legacy is supported
LS-120 boot is supported
ATAPI Zip drive boot is supported
BIOS boot specification is supported
Function key-initiated network boot is supported
Handle 0x0001
DMI type 1, 25 bytes.
System Information
Manufacturer: To Be Filled By O.E.M.
Product Name: IR2300
Version: To Be Filled By O.E.M.
Serial Number: To Be Filled By O.E.M.
UUID: 00020003-0004-0005-0006-000700080009
Wake-up Type: Power Switch
Handle 0x0002
DMI type 2, 8 bytes.
Base Board Information
Manufacturer: InventecESC
Product Name: Tuna
Version: To be filled by O.E.M.
Serial Number: To be filled by O.E.M.
Handle 0x0003
DMI type 3, 17 bytes.
Chassis Information
Manufacturer: InventecESC
Type: Hand Held
Lock: Not Present
Version: To Be Filled By O.E.M.
Serial Number: To Be Filled By O.E.M.
Asset Tag: To Be Filled By O.E.M.
Boot-up State: Safe
Power Supply State: Safe
Thermal State: Safe
Security Status: None
OEM Information: 0x00000000

regards,
dan carpenter