2023-11-30 16:57:37

by Herve Codina

[permalink] [raw]
Subject: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

Hi,

The commit 407d1a51921e ("PCI: Create device tree node for bridge")
creates of_node for PCI devices.
During the insertion handling of these new DT nodes done by of_platform,
new devices (struct device) are created.
For each PCI devices a struct device is already present (created and
handled by the PCI core).
Creating a new device from a DT node leads to some kind of wrong struct
device duplication to represent the exact same PCI device.

This patch series first introduces device_{add,remove}_of_node() in
order to add or remove a newly created of_node to an already existing
device.
Then it fixes the DT node creation for PCI devices to add or remove the
created node to the existing PCI device without any new device creation.

Best regards,
Hervé

Changes v1 -> v2
- Patch 1
Add 'Cc: [email protected]'

Herve Codina (2):
driver core: Introduce device_{add,remove}_of_node()
PCI: of: Attach created of_node to existing device

drivers/base/core.c | 74 ++++++++++++++++++++++++++++++++++++++++++
drivers/pci/of.c | 15 +++++++--
include/linux/device.h | 2 ++
3 files changed, 89 insertions(+), 2 deletions(-)

--
2.42.0


2023-11-30 16:57:46

by Herve Codina

[permalink] [raw]
Subject: [PATCH v2 2/2] PCI: of: Attach created of_node to existing device

The commit 407d1a51921e ("PCI: Create device tree node for bridge")
creates of_node for PCI devices.
During the insertion handling of these new DT nodes done by of_platform,
new devices (struct device) are created.
For each PCI devices a struct device is already present (created and
handled by the PCI core).
Having a second struct device to represent the exact same PCI device is
not correct.

On the of_node creation, tell the of_platform that there is no need to
create a device for this node (OF_POPULATED flag), link this newly
created of_node to the already present device and tell fwnode that the
device attached to this of_node is ready (fwnode_dev_initialized()).

With this fix, the of_node are available in the sysfs device tree:
/sys/devices/platform/soc/d0070000.pcie/
+ of_node -> .../devicetree/base/soc/pcie@d0070000
+ pci0000:00
+ 0000:00:00.0
+ of_node -> .../devicetree/base/soc/pcie@d0070000/pci@0,0
+ 0000:01:00.0
+ of_node -> .../devicetree/base/soc/pcie@d0070000/pci@0,0/dev@0,0

On the of_node removal, revert the operations.

Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
Cc: [email protected]
Signed-off-by: Herve Codina <[email protected]>
---
drivers/pci/of.c | 15 +++++++++++++--
1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index 51e3dd0ea5ab..5afd2731e876 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -615,7 +615,8 @@ void of_pci_remove_node(struct pci_dev *pdev)
np = pci_device_to_OF_node(pdev);
if (!np || !of_node_check_flag(np, OF_DYNAMIC))
return;
- pdev->dev.of_node = NULL;
+
+ device_remove_of_node(&pdev->dev);

of_changeset_revert(np->data);
of_changeset_destroy(np->data);
@@ -668,12 +669,22 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
if (ret)
goto out_free_node;

+ /*
+ * This of_node will be added to an existing device.
+ * Avoid any device creation and use the existing device
+ */
+ of_node_set_flag(np, OF_POPULATED);
+ np->fwnode.dev = &pdev->dev;
+ fwnode_dev_initialized(&np->fwnode, true);
+
ret = of_changeset_apply(cset);
if (ret)
goto out_free_node;

np->data = cset;
- pdev->dev.of_node = np;
+
+ /* Add the of_node to the existing device */
+ device_add_of_node(&pdev->dev, np);
kfree(name);

return;
--
2.42.0

2023-11-30 16:57:46

by Herve Codina

[permalink] [raw]
Subject: [PATCH v2 1/2] driver core: Introduce device_{add,remove}_of_node()

An of_node can be duplicated from an existing device using
device_set_of_node_from_dev() or initialized using device_set_node()
In both cases, these functions have to be called before the device_add()
call in order to have the of_node link created in the device sysfs
directory. Further more, these function cannot prevent any of_node
and/or fwnode overwrites.

When adding an of_node on an already present device, the following
operations need to be done:
- Attach the of_node if no of_node were already attached
- Attach the of_node as a fwnode if no fwnode were already attached
- Create the of_node sysfs link if needed

This is the purpose of device_add_of_node().
device_remove_of_node() reverts the operations done by
device_add_of_node().

Cc: [email protected]
Signed-off-by: Herve Codina <[email protected]>
---
drivers/base/core.c | 74 ++++++++++++++++++++++++++++++++++++++++++
include/linux/device.h | 2 ++
2 files changed, 76 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 2926f3b1f868..ac026187ac6a 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -5046,6 +5046,80 @@ void set_secondary_fwnode(struct device *dev, struct fwnode_handle *fwnode)
}
EXPORT_SYMBOL_GPL(set_secondary_fwnode);

+/**
+ * device_remove_of_node - Remove an of_node from a device
+ * @dev: device whose device-tree node is being removed
+ */
+void device_remove_of_node(struct device *dev)
+{
+ dev = get_device(dev);
+ if (!dev)
+ return;
+
+ if (!dev->of_node)
+ goto end;
+
+ sysfs_remove_link(&dev->kobj, "of_node");
+
+ if (dev->fwnode == of_fwnode_handle(dev->of_node))
+ dev->fwnode = NULL;
+
+ of_node_put(dev->of_node);
+ dev->of_node = NULL;
+
+end:
+ put_device(dev);
+}
+EXPORT_SYMBOL_GPL(device_remove_of_node);
+
+/**
+ * device_add_of_node - Add an of_node to an existing device
+ * @dev: device whose device-tree node is being added
+ * @of_node: of_node to add
+ */
+void device_add_of_node(struct device *dev, struct device_node *of_node)
+{
+ int ret;
+
+ if (!of_node)
+ return;
+
+ dev = get_device(dev);
+ if (!dev)
+ return;
+
+ if (dev->of_node) {
+ dev_warn(dev, "Replace node %pOF with %pOF\n", dev->of_node, of_node);
+ device_remove_of_node(dev);
+ }
+
+ dev->of_node = of_node_get(of_node);
+
+ if (!dev->fwnode)
+ dev->fwnode = of_fwnode_handle(of_node);
+
+ if (!dev->p) {
+ /*
+ * device_add() was not previously called.
+ * The of_node link will be created when device_add() is called.
+ */
+ goto end;
+ }
+
+ /*
+ * device_add() was previously called and so the of_node link was not
+ * created by device_add_class_symlinks().
+ * Create this link now.
+ */
+ ret = sysfs_create_link(&dev->kobj, of_node_kobj(of_node), "of_node");
+ if (ret)
+ dev_warn(dev, "Error %d creating of_node link\n", ret);
+
+end:
+ put_device(dev);
+}
+EXPORT_SYMBOL_GPL(device_add_of_node);
+
/**
* device_set_of_node_from_dev - reuse device-tree node of another device
* @dev: device whose device-tree node is being set
diff --git a/include/linux/device.h b/include/linux/device.h
index d7a72a8749ea..2b093e62907a 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -1128,6 +1128,8 @@ int device_offline(struct device *dev);
int device_online(struct device *dev);
void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode);
void set_secondary_fwnode(struct device *dev, struct fwnode_handle *fwnode);
+void device_add_of_node(struct device *dev, struct device_node *of_node);
+void device_remove_of_node(struct device *dev);
void device_set_of_node_from_dev(struct device *dev, const struct device *dev2);
void device_set_node(struct device *dev, struct fwnode_handle *fwnode);

--
2.42.0

2023-12-01 22:27:38

by Rob Herring

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <[email protected]> wrote:
>
> Hi,
>
> The commit 407d1a51921e ("PCI: Create device tree node for bridge")
> creates of_node for PCI devices.
> During the insertion handling of these new DT nodes done by of_platform,
> new devices (struct device) are created.
> For each PCI devices a struct device is already present (created and
> handled by the PCI core).
> Creating a new device from a DT node leads to some kind of wrong struct
> device duplication to represent the exact same PCI device.
>
> This patch series first introduces device_{add,remove}_of_node() in
> order to add or remove a newly created of_node to an already existing
> device.
> Then it fixes the DT node creation for PCI devices to add or remove the
> created node to the existing PCI device without any new device creation.

I think the simpler solution is to get the DT node created earlier. We
are just asking for pain if the DT node is set for the device at
different times compared to static DT nodes.

The following fixes the lack of of_node link. The DT unittest fails
with the change though and I don't see why.

Also, no idea if the bridge part works because my qemu setup doesn't
create bridges (anyone got a magic cmdline to create them?).

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 9c2137dae429..46b252bbe500 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
*/
pcibios_bus_add_device(dev);
pci_fixup_device(pci_fixup_final, dev);
- if (pci_is_bridge(dev))
- of_pci_make_dev_node(dev);
pci_create_sysfs_dev_files(dev);
pci_proc_attach_device(dev);
pci_bridge_d3_update(dev);
diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index 51e3dd0ea5ab..e15eaf0127fc 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
return 0;

node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
+ if (!node && pci_is_bridge(dev))
+ of_pci_make_dev_node(dev);
if (!node)
return 0;

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index ea476252280a..e50b07fe5a63 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -6208,9 +6208,9 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL,
0x9a31, dpc_log_size);
* before driver probing, it might need to add a device tree node as the final
* fixup.
*/
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_XILINX, 0x5020, of_pci_make_dev_node);
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_XILINX, 0x5021, of_pci_make_dev_node);
-DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REDHAT, 0x0005, of_pci_make_dev_node);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_XILINX, 0x5020, of_pci_make_dev_node);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_XILINX, 0x5021, of_pci_make_dev_node);
+DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_REDHAT, 0x0005, of_pci_make_dev_node);

/*
* Devices known to require a longer delay before first config space access

2023-12-01 22:46:10

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

On Fri, Dec 01, 2023 at 04:26:45PM -0600, Rob Herring wrote:
> On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <[email protected]> wrote:
> ...

> Also, no idea if the bridge part works because my qemu setup doesn't
> create bridges (anyone got a magic cmdline to create them?).

I probably copied this from somewhere and certainly couldn't construct
it from scratch, but it did create a hierarchy like this:

00:04.0 bridge to [bus 01-04] (Root Port)
01:00.0 bridge to [bus 02-04] (Switch Upstream Port)
02:00.0 bridge to [bus 03] (Switch Downstream Port)
02:01.0 bridge to [bus 04] (Switch Downstream Port)
03:00.0 endpoint
04:00.0 endpoint

IMAGE=ubuntu.img
KERNEL=~/linux/arch/x86/boot/bzImage
IMGDIR=~/virt/img/

qemu-system-x86_64 -enable-kvm -s -m 2048 $IMAGE \
-device pcie-root-port,id=root_port1,chassis=1,slot=1 \
-device x3130-upstream,id=upstream_port1,bus=root_port1 \
-device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=2,slot=1 \
-device xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=2,slot=2 \
-drive file=${IMGDIR}/nvme.qcow2,if=none,id=nvme1,snapshot=on \
-device nvme,drive=nvme1,serial=nvme1,cmb_size_mb=2048,bus=downstream_port1 \
-drive file=${IMGDIR}/nvme2.qcow2,if=none,id=nvme2,snapshot=on \
-device nvme,drive=nvme2,serial=nvme1,bus=downstream_port2 \
-virtfs local,id=home,path=/home/,security_model=mapped,mount_tag=home \
-nographic \
-kernel $KERNEL \
-append "root=/dev/sda2 rootfstype=ext4 console=ttyS0,38400n8"

2023-12-01 22:49:47

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] PCI: of: Attach created of_node to existing device

On Thu, Nov 30, 2023 at 05:56:59PM +0100, Herve Codina wrote:
> The commit 407d1a51921e ("PCI: Create device tree node for bridge")
> creates of_node for PCI devices.
> During the insertion handling of these new DT nodes done by of_platform,
> new devices (struct device) are created.
> For each PCI devices a struct device is already present (created and
> handled by the PCI core).
> Having a second struct device to represent the exact same PCI device is
> not correct.

Can you rewrap this or, if you intend multiple paragraphs, add blank
lines between them?

> On the of_node creation, tell the of_platform that there is no need to
> create a device for this node (OF_POPULATED flag), link this newly
> created of_node to the already present device and tell fwnode that the
> device attached to this of_node is ready (fwnode_dev_initialized()).
>
> With this fix, the of_node are available in the sysfs device tree:
> /sys/devices/platform/soc/d0070000.pcie/
> + of_node -> .../devicetree/base/soc/pcie@d0070000
> + pci0000:00
> + 0000:00:00.0
> + of_node -> .../devicetree/base/soc/pcie@d0070000/pci@0,0
> + 0000:01:00.0
> + of_node -> .../devicetree/base/soc/pcie@d0070000/pci@0,0/dev@0,0
>
> On the of_node removal, revert the operations.
>
> Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
> Cc: [email protected]
> Signed-off-by: Herve Codina <[email protected]>
> ---
> drivers/pci/of.c | 15 +++++++++++++--
> 1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> index 51e3dd0ea5ab..5afd2731e876 100644
> --- a/drivers/pci/of.c
> +++ b/drivers/pci/of.c
> @@ -615,7 +615,8 @@ void of_pci_remove_node(struct pci_dev *pdev)
> np = pci_device_to_OF_node(pdev);
> if (!np || !of_node_check_flag(np, OF_DYNAMIC))
> return;
> - pdev->dev.of_node = NULL;
> +
> + device_remove_of_node(&pdev->dev);
>
> of_changeset_revert(np->data);
> of_changeset_destroy(np->data);
> @@ -668,12 +669,22 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
> if (ret)
> goto out_free_node;
>
> + /*
> + * This of_node will be added to an existing device.
> + * Avoid any device creation and use the existing device
> + */
> + of_node_set_flag(np, OF_POPULATED);
> + np->fwnode.dev = &pdev->dev;
> + fwnode_dev_initialized(&np->fwnode, true);
> +
> ret = of_changeset_apply(cset);
> if (ret)
> goto out_free_node;
>
> np->data = cset;
> - pdev->dev.of_node = np;
> +
> + /* Add the of_node to the existing device */
> + device_add_of_node(&pdev->dev, np);
> kfree(name);
>
> return;
> --
> 2.42.0
>

2023-12-04 12:43:54

by Herve Codina

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

Hi Rob,

On Fri, 1 Dec 2023 16:26:45 -0600
Rob Herring <[email protected]> wrote:

> On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <[email protected]> wrote:
> >
> > Hi,
> >
> > The commit 407d1a51921e ("PCI: Create device tree node for bridge")
> > creates of_node for PCI devices.
> > During the insertion handling of these new DT nodes done by of_platform,
> > new devices (struct device) are created.
> > For each PCI devices a struct device is already present (created and
> > handled by the PCI core).
> > Creating a new device from a DT node leads to some kind of wrong struct
> > device duplication to represent the exact same PCI device.
> >
> > This patch series first introduces device_{add,remove}_of_node() in
> > order to add or remove a newly created of_node to an already existing
> > device.
> > Then it fixes the DT node creation for PCI devices to add or remove the
> > created node to the existing PCI device without any new device creation.
>
> I think the simpler solution is to get the DT node created earlier. We
> are just asking for pain if the DT node is set for the device at
> different times compared to static DT nodes.
>
> The following fixes the lack of of_node link. The DT unittest fails
> with the change though and I don't see why.
>
> Also, no idea if the bridge part works because my qemu setup doesn't
> create bridges (anyone got a magic cmdline to create them?).
>
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index 9c2137dae429..46b252bbe500 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> */
> pcibios_bus_add_device(dev);
> pci_fixup_device(pci_fixup_final, dev);
> - if (pci_is_bridge(dev))
> - of_pci_make_dev_node(dev);
> pci_create_sysfs_dev_files(dev);
> pci_proc_attach_device(dev);
> pci_bridge_d3_update(dev);
> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> index 51e3dd0ea5ab..e15eaf0127fc 100644
> --- a/drivers/pci/of.c
> +++ b/drivers/pci/of.c
> @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> return 0;
>
> node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> + if (!node && pci_is_bridge(dev))
> + of_pci_make_dev_node(dev);
> if (!node)
> return 0;

Maybe it is too early.
of_pci_make_dev_node() creates a node and fills some properties based on
some already set values available in the PCI device such as its struct resource
values.
We need to have some values set by the PCI infra in order to create our DT node
with correct values.

Best regards,
Hervé

>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index ea476252280a..e50b07fe5a63 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -6208,9 +6208,9 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL,
> 0x9a31, dpc_log_size);
> * before driver probing, it might need to add a device tree node as the final
> * fixup.
> */
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_XILINX, 0x5020, of_pci_make_dev_node);
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_XILINX, 0x5021, of_pci_make_dev_node);
> -DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REDHAT, 0x0005, of_pci_make_dev_node);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_XILINX, 0x5020, of_pci_make_dev_node);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_XILINX, 0x5021, of_pci_make_dev_node);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_REDHAT, 0x0005, of_pci_make_dev_node);
>
> /*
> * Devices known to require a longer delay before first config space access

2023-12-04 13:59:37

by Rob Herring

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

On Mon, Dec 4, 2023 at 6:43 AM Herve Codina <[email protected]> wrote:
>
> Hi Rob,
>
> On Fri, 1 Dec 2023 16:26:45 -0600
> Rob Herring <[email protected]> wrote:
>
> > On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > The commit 407d1a51921e ("PCI: Create device tree node for bridge")
> > > creates of_node for PCI devices.
> > > During the insertion handling of these new DT nodes done by of_platform,
> > > new devices (struct device) are created.
> > > For each PCI devices a struct device is already present (created and
> > > handled by the PCI core).
> > > Creating a new device from a DT node leads to some kind of wrong struct
> > > device duplication to represent the exact same PCI device.
> > >
> > > This patch series first introduces device_{add,remove}_of_node() in
> > > order to add or remove a newly created of_node to an already existing
> > > device.
> > > Then it fixes the DT node creation for PCI devices to add or remove the
> > > created node to the existing PCI device without any new device creation.
> >
> > I think the simpler solution is to get the DT node created earlier. We
> > are just asking for pain if the DT node is set for the device at
> > different times compared to static DT nodes.
> >
> > The following fixes the lack of of_node link. The DT unittest fails
> > with the change though and I don't see why.
> >
> > Also, no idea if the bridge part works because my qemu setup doesn't
> > create bridges (anyone got a magic cmdline to create them?).
> >
> > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > index 9c2137dae429..46b252bbe500 100644
> > --- a/drivers/pci/bus.c
> > +++ b/drivers/pci/bus.c
> > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > */
> > pcibios_bus_add_device(dev);
> > pci_fixup_device(pci_fixup_final, dev);
> > - if (pci_is_bridge(dev))
> > - of_pci_make_dev_node(dev);
> > pci_create_sysfs_dev_files(dev);
> > pci_proc_attach_device(dev);
> > pci_bridge_d3_update(dev);
> > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > --- a/drivers/pci/of.c
> > +++ b/drivers/pci/of.c
> > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > return 0;
> >
> > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > + if (!node && pci_is_bridge(dev))
> > + of_pci_make_dev_node(dev);
> > if (!node)
> > return 0;
>
> Maybe it is too early.
> of_pci_make_dev_node() creates a node and fills some properties based on
> some already set values available in the PCI device such as its struct resource
> values.
> We need to have some values set by the PCI infra in order to create our DT node
> with correct values.

Indeed, that's probably the issue I'm having. In that case,
DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
device_add().

I think modifying sysfs after device_add() is going to race with
userspace. Userspace is notified of a new device, and then the of_node
link may or may not be there when it reads sysfs. Also, not sure if
we'll need DT modaliases with PCI devices, but they won't work if the
DT node is not set before device_add().

Rob

2023-12-04 15:30:48

by Herve Codina

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

Hi Rob,

On Mon, 4 Dec 2023 07:59:09 -0600
Rob Herring <[email protected]> wrote:

[...]

> > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > > index 9c2137dae429..46b252bbe500 100644
> > > --- a/drivers/pci/bus.c
> > > +++ b/drivers/pci/bus.c
> > > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > > */
> > > pcibios_bus_add_device(dev);
> > > pci_fixup_device(pci_fixup_final, dev);
> > > - if (pci_is_bridge(dev))
> > > - of_pci_make_dev_node(dev);
> > > pci_create_sysfs_dev_files(dev);
> > > pci_proc_attach_device(dev);
> > > pci_bridge_d3_update(dev);
> > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > > --- a/drivers/pci/of.c
> > > +++ b/drivers/pci/of.c
> > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > return 0;
> > >
> > > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > + if (!node && pci_is_bridge(dev))
> > > + of_pci_make_dev_node(dev);
> > > if (!node)
> > > return 0;
> >
> > Maybe it is too early.
> > of_pci_make_dev_node() creates a node and fills some properties based on
> > some already set values available in the PCI device such as its struct resource
> > values.
> > We need to have some values set by the PCI infra in order to create our DT node
> > with correct values.
>
> Indeed, that's probably the issue I'm having. In that case,
> DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> device_add().
>
> I think modifying sysfs after device_add() is going to race with
> userspace. Userspace is notified of a new device, and then the of_node
> link may or may not be there when it reads sysfs. Also, not sure if
> we'll need DT modaliases with PCI devices, but they won't work if the
> DT node is not set before device_add().

Ok, we can try using DECLARE_PCI_FIXUP_HEADER.
On your side, is moving from DECLARE_PCI_FIXUP_EARLY to DECLARE_PCI_FIXUP_HEADER
fix your QEMU unittest ?

We have to note that between the pci_fixup_device(pci_fixup_header, dev) call
and the device_add() call, the call to pci_set_msi_domain() is present.
MSIs are not supported currently but in the future ...

Related to DT modaliases, I don't think they are needed.
All drivers related to PCI device should be declared as pci_driver.
Correct me if I am wrong but I think that the core PCI will load the correct
module without any DT modalias.

Best regards,
Hervé

--
Hervé Codina, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

2023-12-04 16:48:25

by Lizhi Hou

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

Here is the link I put in patch set cover letter

https://github.com/houlz0507/xoclv2/blob/pci-dt-0329/pci-dt-patch-0329/README

It has the steps I used to create and start qemu with pci bridges.
Hopefully this helps.


Lizhi

On 12/1/23 14:45, Bjorn Helgaas wrote:
> On Fri, Dec 01, 2023 at 04:26:45PM -0600, Rob Herring wrote:
>> On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <[email protected]> wrote:
>> ...
>> Also, no idea if the bridge part works because my qemu setup doesn't
>> create bridges (anyone got a magic cmdline to create them?).
> I probably copied this from somewhere and certainly couldn't construct
> it from scratch, but it did create a hierarchy like this:
>
> 00:04.0 bridge to [bus 01-04] (Root Port)
> 01:00.0 bridge to [bus 02-04] (Switch Upstream Port)
> 02:00.0 bridge to [bus 03] (Switch Downstream Port)
> 02:01.0 bridge to [bus 04] (Switch Downstream Port)
> 03:00.0 endpoint
> 04:00.0 endpoint
>
> IMAGE=ubuntu.img
> KERNEL=~/linux/arch/x86/boot/bzImage
> IMGDIR=~/virt/img/
>
> qemu-system-x86_64 -enable-kvm -s -m 2048 $IMAGE \
> -device pcie-root-port,id=root_port1,chassis=1,slot=1 \
> -device x3130-upstream,id=upstream_port1,bus=root_port1 \
> -device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=2,slot=1 \
> -device xio3130-downstream,id=downstream_port2,bus=upstream_port1,chassis=2,slot=2 \
> -drive file=${IMGDIR}/nvme.qcow2,if=none,id=nvme1,snapshot=on \
> -device nvme,drive=nvme1,serial=nvme1,cmb_size_mb=2048,bus=downstream_port1 \
> -drive file=${IMGDIR}/nvme2.qcow2,if=none,id=nvme2,snapshot=on \
> -device nvme,drive=nvme2,serial=nvme1,bus=downstream_port2 \
> -virtfs local,id=home,path=/home/,security_model=mapped,mount_tag=home \
> -nographic \
> -kernel $KERNEL \
> -append "root=/dev/sda2 rootfstype=ext4 console=ttyS0,38400n8"

2023-12-04 23:04:04

by Rob Herring

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

On Mon, Dec 4, 2023 at 9:30 AM Herve Codina <[email protected]> wrote:
>
> Hi Rob,
>
> On Mon, 4 Dec 2023 07:59:09 -0600
> Rob Herring <[email protected]> wrote:
>
> [...]
>
> > > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > > > index 9c2137dae429..46b252bbe500 100644
> > > > --- a/drivers/pci/bus.c
> > > > +++ b/drivers/pci/bus.c
> > > > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > > > */
> > > > pcibios_bus_add_device(dev);
> > > > pci_fixup_device(pci_fixup_final, dev);
> > > > - if (pci_is_bridge(dev))
> > > > - of_pci_make_dev_node(dev);
> > > > pci_create_sysfs_dev_files(dev);
> > > > pci_proc_attach_device(dev);
> > > > pci_bridge_d3_update(dev);
> > > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > > > --- a/drivers/pci/of.c
> > > > +++ b/drivers/pci/of.c
> > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > return 0;
> > > >
> > > > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > + if (!node && pci_is_bridge(dev))
> > > > + of_pci_make_dev_node(dev);
> > > > if (!node)
> > > > return 0;
> > >
> > > Maybe it is too early.
> > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > some already set values available in the PCI device such as its struct resource
> > > values.
> > > We need to have some values set by the PCI infra in order to create our DT node
> > > with correct values.
> >
> > Indeed, that's probably the issue I'm having. In that case,
> > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > device_add().
> >
> > I think modifying sysfs after device_add() is going to race with
> > userspace. Userspace is notified of a new device, and then the of_node
> > link may or may not be there when it reads sysfs. Also, not sure if
> > we'll need DT modaliases with PCI devices, but they won't work if the
> > DT node is not set before device_add().
>
> Ok, we can try using DECLARE_PCI_FIXUP_HEADER.
> On your side, is moving from DECLARE_PCI_FIXUP_EARLY to DECLARE_PCI_FIXUP_HEADER
> fix your QEMU unittest ?

No...

And testing the bridge part crashes. That's because there's a
dependency on the bridge->subordinate to write out bus-range and
interrupt-map. I think the fix there is we should just not write those
properties. The bus range isn't needed because the kernel does its own
assignments. For interrupt-map, it is only needed if "interrupts" is
present in the child devices. If not present, then the standard PCI
swizzling is used. Alternatively, I think the interrupt mapping could
be simplified to just implement the standard swizzling at each level
which isn't dependent on any of the devices on the bus. I gave that a
go where each interrupt-map just points to the parent bridge, but ran
into an issue that the bridge nodes don't have a phandle. That should
be fixable, but I'd rather go with the first option. I suppose that
depends on how the interrupts downstream of the PCI device need to get
resolved. It could be that the PCI device serves as the interrupt
controller and can resolve the parent interrupt on its own (which may
be needed for ACPI host anyways).

> We have to note that between the pci_fixup_device(pci_fixup_header, dev) call
> and the device_add() call, the call to pci_set_msi_domain() is present.
> MSIs are not supported currently but in the future ...

MSI's aren't ever described in PCI nodes. Only the host bridge. So I
don't think we should have problems there.

> Related to DT modaliases, I don't think they are needed.
> All drivers related to PCI device should be declared as pci_driver.
> Correct me if I am wrong but I think that the core PCI will load the correct
> module without any DT modalias.

Yes, you are probably right.

Rob

2023-12-05 08:05:26

by Herve Codina

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

On Mon, 4 Dec 2023 17:03:21 -0600
Rob Herring <[email protected]> wrote:

> On Mon, Dec 4, 2023 at 9:30 AM Herve Codina <[email protected]> wrote:
> >
> > Hi Rob,
> >
> > On Mon, 4 Dec 2023 07:59:09 -0600
> > Rob Herring <[email protected]> wrote:
> >
> > [...]
> >
> > > > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > > > > index 9c2137dae429..46b252bbe500 100644
> > > > > --- a/drivers/pci/bus.c
> > > > > +++ b/drivers/pci/bus.c
> > > > > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > > > > */
> > > > > pcibios_bus_add_device(dev);
> > > > > pci_fixup_device(pci_fixup_final, dev);
> > > > > - if (pci_is_bridge(dev))
> > > > > - of_pci_make_dev_node(dev);
> > > > > pci_create_sysfs_dev_files(dev);
> > > > > pci_proc_attach_device(dev);
> > > > > pci_bridge_d3_update(dev);
> > > > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > > > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > > > > --- a/drivers/pci/of.c
> > > > > +++ b/drivers/pci/of.c
> > > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > > return 0;
> > > > >
> > > > > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > > + if (!node && pci_is_bridge(dev))
> > > > > + of_pci_make_dev_node(dev);
> > > > > if (!node)
> > > > > return 0;
> > > >
> > > > Maybe it is too early.
> > > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > > some already set values available in the PCI device such as its struct resource
> > > > values.
> > > > We need to have some values set by the PCI infra in order to create our DT node
> > > > with correct values.
> > >
> > > Indeed, that's probably the issue I'm having. In that case,
> > > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > > device_add().
> > >
> > > I think modifying sysfs after device_add() is going to race with
> > > userspace. Userspace is notified of a new device, and then the of_node
> > > link may or may not be there when it reads sysfs. Also, not sure if
> > > we'll need DT modaliases with PCI devices, but they won't work if the
> > > DT node is not set before device_add().
> >
> > Ok, we can try using DECLARE_PCI_FIXUP_HEADER.
> > On your side, is moving from DECLARE_PCI_FIXUP_EARLY to DECLARE_PCI_FIXUP_HEADER
> > fix your QEMU unittest ?
>
> No...
>
> And testing the bridge part crashes. That's because there's a
> dependency on the bridge->subordinate to write out bus-range and
> interrupt-map. I think the fix there is we should just not write those
> properties. The bus range isn't needed because the kernel does its own
> assignments. For interrupt-map, it is only needed if "interrupts" is
> present in the child devices. If not present, then the standard PCI
> swizzling is used. Alternatively, I think the interrupt mapping could
> be simplified to just implement the standard swizzling at each level
> which isn't dependent on any of the devices on the bus. I gave that a
> go where each interrupt-map just points to the parent bridge, but ran
> into an issue that the bridge nodes don't have a phandle. That should
> be fixable, but I'd rather go with the first option. I suppose that
> depends on how the interrupts downstream of the PCI device need to get
> resolved. It could be that the PCI device serves as the interrupt
> controller and can resolve the parent interrupt on its own (which may
> be needed for ACPI host anyways).

About interrupt, I am a bit stuck on my side.
My dtso (applied at PCI device level) contains the following:
fragment@0 {
target-path="";
__overlay__ {
pci-ep-bus@0 {
compatible = "simple-bus";
#address-cells = <1>;
#size-cells = <1>;

/*
* map @0xe2000000 (32MB) to BAR0 (CPU)
* map @0xe0000000 (16MB) to BAR1 (AMBA)
*/
ranges = <0xe2000000 0x00 0x00 0x00 0x2000000
0xe0000000 0x01 0x00 0x00 0x1000000>;

itc: itc {
compatible = "microchip,lan966x-itc";
#interrupt-cells = <1>;
interrupt-controller;
reg = <0xe00c0120 0x190>;
};

...
};
};
};

I have a 'simple-bus' with a 'ranges' property to translate the BAR addresses
then several devices. Among them a interrupt controller (itc). Its parent
interrupt is the one used by the PCI device (INTA).
I cannot describe this parent interrupt in the dtso because to that I need the
parent interrupt phandle which will be know only at runtime.
Of course, I can modified the overlay applied to tweak the 'interrupt' and
'interrupt-parent' in the itc node from my PCI device driver at runtime but I
would like to avoid this kind of tweak in the PCI device driver.
This kind of tweak is overlay dependent and needs to be done by each PCI
device driver that need to work with interrupts.

For BAR addresses translation, we use the 'ranges' property at the PCI device
node to translate 0 0 0 to BAR0, 1 0 0 to BAR1, ...
What do you think about a new 'irq-ranges' property to translate the irq number
and get the irq parent controller base.

irq-ranges = <child_irq_spec parent_irq_spec length>;

The idea would be to translate the child irq specifier (irq number +
possible flags) parent DT node.
if 'irq-ranges' present in parent translate irq spec then:
1) 'interrupt' present in parent.
In that case, if the translation match, we use the translated irq spec
and resolve the parent interrupt controller using existing ways from this
node. If it does not match: error.
2) 'interrupt-map' present in parent.
We use existing ways from this node with the translated irq spec to get the
parent interrupt controller.
3) 'interrupt-controller' present in parent.
We use this node as parent interrupt controller and the translated irq spec
4) no specific property present, update parent taking the parent's parent and
go again.

With this the overlay becomes:
pci-ep-bus@0 {
compatible = "simple-bus";
#address-cells = <1>;
#size-cells = <1>;
#interrupt-cells = <1>;

/*
* map @0xe2000000 (32MB) to BAR0 (CPU)
* map @0xe0000000 (16MB) to BAR1 (AMBA)
*/
ranges = <0xe2000000 0x00 0x00 0x00 0x2000000
0xe0000000 0x01 0x00 0x00 0x1000000>;

/* Translate PCI IRQ*/
irq-ranges = <1 1 1>;

itc: itc {
compatible = "microchip,lan966x-itc";
#interrupt-cells = <1>;
interrupts = <1>;
interrupt-controller;
reg = <0xe00c0120 0x190>;
};

...
};

And the PCI device node built at runtime:
dev@0,0 {
#address-cells = <0x03>;
interrupts = <0x01>; <-- use parent interrupt-map
#size-cells = <0x02>;
compatible = "pci1055,9660\0pciclass,020000\0pciclass,0200";
ranges = ...;
irq-ranges = <1 0x01 1>
reg = <0x10000 0x00 0x00 0x00 0x00>;
};
or
dev@0,0 {
#address-cells = <0x03>;
interrupts = <NN>;
interrupts-parent = <phandle'
#size-cells = <0x02>;
compatible = "pci1055,9660\0pciclass,020000\0pciclass,0200";
ranges = ...;
irq-ranges = <1 NN 1>
reg = <0x10000 0x00 0x00 0x00 0x00>;
};

In the second case, interrup-map in bridges nodes is still needed to build the
'interrupts' / 'interrupts-parent' in devices nodes.
and irq-ranges (if it makes sense on your side) allow to get and interrupt from
the overlay without the need to know the irq parent phandle.

Regards,
Hervé

>
> > We have to note that between the pci_fixup_device(pci_fixup_header, dev) call
> > and the device_add() call, the call to pci_set_msi_domain() is present.
> > MSIs are not supported currently but in the future ...
>
> MSI's aren't ever described in PCI nodes. Only the host bridge. So I
> don't think we should have problems there.
>
> > Related to DT modaliases, I don't think they are needed.
> > All drivers related to PCI device should be declared as pci_driver.
> > Correct me if I am wrong but I think that the core PCI will load the correct
> > module without any DT modalias.
>
> Yes, you are probably right.
>
> Rob



--
Hervé Codina, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

2023-12-05 18:54:10

by Lizhi Hou

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices


On 12/4/23 15:03, Rob Herring wrote:
> On Mon, Dec 4, 2023 at 9:30 AM Herve Codina <[email protected]> wrote:
>> Hi Rob,
>>
>> On Mon, 4 Dec 2023 07:59:09 -0600
>> Rob Herring <[email protected]> wrote:
>>
>> [...]
>>
>>>>> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
>>>>> index 9c2137dae429..46b252bbe500 100644
>>>>> --- a/drivers/pci/bus.c
>>>>> +++ b/drivers/pci/bus.c
>>>>> @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
>>>>> */
>>>>> pcibios_bus_add_device(dev);
>>>>> pci_fixup_device(pci_fixup_final, dev);
>>>>> - if (pci_is_bridge(dev))
>>>>> - of_pci_make_dev_node(dev);
>>>>> pci_create_sysfs_dev_files(dev);
>>>>> pci_proc_attach_device(dev);
>>>>> pci_bridge_d3_update(dev);
>>>>> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
>>>>> index 51e3dd0ea5ab..e15eaf0127fc 100644
>>>>> --- a/drivers/pci/of.c
>>>>> +++ b/drivers/pci/of.c
>>>>> @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
>>>>> return 0;
>>>>>
>>>>> node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
>>>>> + if (!node && pci_is_bridge(dev))
>>>>> + of_pci_make_dev_node(dev);
>>>>> if (!node)
>>>>> return 0;
>>>> Maybe it is too early.
>>>> of_pci_make_dev_node() creates a node and fills some properties based on
>>>> some already set values available in the PCI device such as its struct resource
>>>> values.
>>>> We need to have some values set by the PCI infra in order to create our DT node
>>>> with correct values.
>>> Indeed, that's probably the issue I'm having. In that case,
>>> DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
>>> device_add().
>>>
>>> I think modifying sysfs after device_add() is going to race with
>>> userspace. Userspace is notified of a new device, and then the of_node
>>> link may or may not be there when it reads sysfs. Also, not sure if
>>> we'll need DT modaliases with PCI devices, but they won't work if the
>>> DT node is not set before device_add().
>> Ok, we can try using DECLARE_PCI_FIXUP_HEADER.
>> On your side, is moving from DECLARE_PCI_FIXUP_EARLY to DECLARE_PCI_FIXUP_HEADER
>> fix your QEMU unittest ?
> No...
>
> And testing the bridge part crashes. That's because there's a
> dependency on the bridge->subordinate to write out bus-range and
> interrupt-map. I think the fix there is we should just not write those
> properties. The bus range isn't needed because the kernel does its own
> assignments. For interrupt-map, it is only needed if "interrupts" is
Without 'bus-range', dtc will output a warning while compiling the
generated node.
> present in the child devices. If not present, then the standard PCI
> swizzling is used. Alternatively, I think the interrupt mapping could
> be simplified to just implement the standard swizzling at each level
> which isn't dependent on any of the devices on the bus. I gave that a
> go where each interrupt-map just points to the parent bridge, but ran
> into an issue that the bridge nodes don't have a phandle. That should
> be fixable, but I'd rather go with the first option. I suppose that

Do you mean it might be something similar with I posted in V12?

https://lore.kernel.org/lkml/[email protected]/


Thanks,

Lizhi

> depends on how the interrupts downstream of the PCI device need to get
> resolved. It could be that the PCI device serves as the interrupt
> controller and can resolve the parent interrupt on its own (which may
> be needed for ACPI host anyways).
>
>> We have to note that between the pci_fixup_device(pci_fixup_header, dev) call
>> and the device_add() call, the call to pci_set_msi_domain() is present.
>> MSIs are not supported currently but in the future ...
> MSI's aren't ever described in PCI nodes. Only the host bridge. So I
> don't think we should have problems there.
>
>> Related to DT modaliases, I don't think they are needed.
>> All drivers related to PCI device should be declared as pci_driver.
>> Correct me if I am wrong but I think that the core PCI will load the correct
>> module without any DT modalias.
> Yes, you are probably right.
>
> Rob

2023-12-07 22:52:34

by Rob Herring

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

On Tue, Dec 5, 2023 at 2:05 AM Herve Codina <[email protected]> wrote:
>
> On Mon, 4 Dec 2023 17:03:21 -0600
> Rob Herring <[email protected]> wrote:
>
> > On Mon, Dec 4, 2023 at 9:30 AM Herve Codina <[email protected]> wrote:
> > >
> > > Hi Rob,
> > >
> > > On Mon, 4 Dec 2023 07:59:09 -0600
> > > Rob Herring <[email protected]> wrote:
> > >
> > > [...]
> > >
> > > > > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > > > > > index 9c2137dae429..46b252bbe500 100644
> > > > > > --- a/drivers/pci/bus.c
> > > > > > +++ b/drivers/pci/bus.c
> > > > > > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > > > > > */
> > > > > > pcibios_bus_add_device(dev);
> > > > > > pci_fixup_device(pci_fixup_final, dev);
> > > > > > - if (pci_is_bridge(dev))
> > > > > > - of_pci_make_dev_node(dev);
> > > > > > pci_create_sysfs_dev_files(dev);
> > > > > > pci_proc_attach_device(dev);
> > > > > > pci_bridge_d3_update(dev);
> > > > > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > > > > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > > > > > --- a/drivers/pci/of.c
> > > > > > +++ b/drivers/pci/of.c
> > > > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > > > return 0;
> > > > > >
> > > > > > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > > > + if (!node && pci_is_bridge(dev))
> > > > > > + of_pci_make_dev_node(dev);
> > > > > > if (!node)
> > > > > > return 0;
> > > > >
> > > > > Maybe it is too early.
> > > > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > > > some already set values available in the PCI device such as its struct resource
> > > > > values.
> > > > > We need to have some values set by the PCI infra in order to create our DT node
> > > > > with correct values.
> > > >
> > > > Indeed, that's probably the issue I'm having. In that case,
> > > > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > > > device_add().
> > > >
> > > > I think modifying sysfs after device_add() is going to race with
> > > > userspace. Userspace is notified of a new device, and then the of_node
> > > > link may or may not be there when it reads sysfs. Also, not sure if
> > > > we'll need DT modaliases with PCI devices, but they won't work if the
> > > > DT node is not set before device_add().
> > >
> > > Ok, we can try using DECLARE_PCI_FIXUP_HEADER.
> > > On your side, is moving from DECLARE_PCI_FIXUP_EARLY to DECLARE_PCI_FIXUP_HEADER
> > > fix your QEMU unittest ?
> >
> > No...

I think the problem is we aren't setting the fwnode, just the of_node
ptr, but I haven't had a chance to verify that.

> > And testing the bridge part crashes. That's because there's a
> > dependency on the bridge->subordinate to write out bus-range and
> > interrupt-map. I think the fix there is we should just not write those
> > properties. The bus range isn't needed because the kernel does its own
> > assignments. For interrupt-map, it is only needed if "interrupts" is
> > present in the child devices. If not present, then the standard PCI
> > swizzling is used. Alternatively, I think the interrupt mapping could
> > be simplified to just implement the standard swizzling at each level
> > which isn't dependent on any of the devices on the bus. I gave that a
> > go where each interrupt-map just points to the parent bridge, but ran
> > into an issue that the bridge nodes don't have a phandle. That should
> > be fixable, but I'd rather go with the first option. I suppose that
> > depends on how the interrupts downstream of the PCI device need to get
> > resolved. It could be that the PCI device serves as the interrupt
> > controller and can resolve the parent interrupt on its own (which may
> > be needed for ACPI host anyways).
>
> About interrupt, I am a bit stuck on my side.
> My dtso (applied at PCI device level) contains the following:
> fragment@0 {
> target-path="";
> __overlay__ {
> pci-ep-bus@0 {
> compatible = "simple-bus";
> #address-cells = <1>;
> #size-cells = <1>;
>
> /*
> * map @0xe2000000 (32MB) to BAR0 (CPU)
> * map @0xe0000000 (16MB) to BAR1 (AMBA)
> */
> ranges = <0xe2000000 0x00 0x00 0x00 0x2000000
> 0xe0000000 0x01 0x00 0x00 0x1000000>;
>
> itc: itc {
> compatible = "microchip,lan966x-itc";
> #interrupt-cells = <1>;
> interrupt-controller;
> reg = <0xe00c0120 0x190>;
> };
>
> ...
> };
> };
> };
>
> I have a 'simple-bus' with a 'ranges' property to translate the BAR addresses
> then several devices. Among them a interrupt controller (itc). Its parent
> interrupt is the one used by the PCI device (INTA).
> I cannot describe this parent interrupt in the dtso because to that I need the
> parent interrupt phandle which will be know only at runtime.

But you don't. The logic to find the interrupt parent is walk up the
parent nodes until you find 'interrupt-parent' or
'#interrupt-controller' (and interrupt-map always has
#interrupt-controller). So your overlay just needs 'interrupts = <1>'
for INTA and it should all just work.

That of course implies that we need interrupt properties in all the
bridges which I was hoping to avoid. In the ACPI case, for DT
interrupt parsing to work, we're going to need to end up in an
'interrupt-controller' node somewhere. I think the options are either
we walk interrupt-map properties up to the host bridge which then
points to something or the PCI device is the interrupt controller. I
think the PCI device is the right place. How the downstream interrupts
are routed to PCI interrupts are defined by the device. That would
work the same way for both DT and ACPI. If you are concerned about
implementing in each driver needing this, some library functions can
mitigate that.

I'm trying to play around with the IRQ domains and get this to work,
but not having any luck yet.

> Of course, I can modified the overlay applied to tweak the 'interrupt' and
> 'interrupt-parent' in the itc node from my PCI device driver at runtime but I
> would like to avoid this kind of tweak in the PCI device driver.
> This kind of tweak is overlay dependent and needs to be done by each PCI
> device driver that need to work with interrupts.
>
> For BAR addresses translation, we use the 'ranges' property at the PCI device
> node to translate 0 0 0 to BAR0, 1 0 0 to BAR1, ...
> What do you think about a new 'irq-ranges' property to translate the irq number
> and get the irq parent controller base.
>
> irq-ranges = <child_irq_spec parent_irq_spec length>;

Seems fragile as you have to know something about the parent (the # of
cells), but you don't have the phandle. If you needed multiple
entries, you couldn't parse this.

Rob

2023-12-08 08:49:22

by Herve Codina

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

Hi Rob,

On Thu, 7 Dec 2023 16:51:56 -0600
Rob Herring <[email protected]> wrote:

> On Tue, Dec 5, 2023 at 2:05 AM Herve Codina <[email protected]> wrote:
> >
> > On Mon, 4 Dec 2023 17:03:21 -0600
> > Rob Herring <[email protected]> wrote:
> >
> > > On Mon, Dec 4, 2023 at 9:30 AM Herve Codina <[email protected]> wrote:
> > > >
> > > > Hi Rob,
> > > >
> > > > On Mon, 4 Dec 2023 07:59:09 -0600
> > > > Rob Herring <[email protected]> wrote:
> > > >
> > > > [...]
> > > >
> > > > > > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > > > > > > index 9c2137dae429..46b252bbe500 100644
> > > > > > > --- a/drivers/pci/bus.c
> > > > > > > +++ b/drivers/pci/bus.c
> > > > > > > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > > > > > > */
> > > > > > > pcibios_bus_add_device(dev);
> > > > > > > pci_fixup_device(pci_fixup_final, dev);
> > > > > > > - if (pci_is_bridge(dev))
> > > > > > > - of_pci_make_dev_node(dev);
> > > > > > > pci_create_sysfs_dev_files(dev);
> > > > > > > pci_proc_attach_device(dev);
> > > > > > > pci_bridge_d3_update(dev);
> > > > > > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > > > > > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > > > > > > --- a/drivers/pci/of.c
> > > > > > > +++ b/drivers/pci/of.c
> > > > > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > > > > return 0;
> > > > > > >
> > > > > > > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > > > > + if (!node && pci_is_bridge(dev))
> > > > > > > + of_pci_make_dev_node(dev);
> > > > > > > if (!node)
> > > > > > > return 0;
> > > > > >
> > > > > > Maybe it is too early.
> > > > > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > > > > some already set values available in the PCI device such as its struct resource
> > > > > > values.
> > > > > > We need to have some values set by the PCI infra in order to create our DT node
> > > > > > with correct values.
> > > > >
> > > > > Indeed, that's probably the issue I'm having. In that case,
> > > > > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > > > > device_add().
> > > > >
> > > > > I think modifying sysfs after device_add() is going to race with
> > > > > userspace. Userspace is notified of a new device, and then the of_node
> > > > > link may or may not be there when it reads sysfs. Also, not sure if
> > > > > we'll need DT modaliases with PCI devices, but they won't work if the
> > > > > DT node is not set before device_add().
> > > >
> > > > Ok, we can try using DECLARE_PCI_FIXUP_HEADER.
> > > > On your side, is moving from DECLARE_PCI_FIXUP_EARLY to DECLARE_PCI_FIXUP_HEADER
> > > > fix your QEMU unittest ?
> > >
> > > No...
>
> I think the problem is we aren't setting the fwnode, just the of_node
> ptr, but I haven't had a chance to verify that.
>
> > > And testing the bridge part crashes. That's because there's a
> > > dependency on the bridge->subordinate to write out bus-range and
> > > interrupt-map. I think the fix there is we should just not write those
> > > properties. The bus range isn't needed because the kernel does its own
> > > assignments. For interrupt-map, it is only needed if "interrupts" is
> > > present in the child devices. If not present, then the standard PCI
> > > swizzling is used. Alternatively, I think the interrupt mapping could
> > > be simplified to just implement the standard swizzling at each level
> > > which isn't dependent on any of the devices on the bus. I gave that a
> > > go where each interrupt-map just points to the parent bridge, but ran
> > > into an issue that the bridge nodes don't have a phandle. That should
> > > be fixable, but I'd rather go with the first option. I suppose that
> > > depends on how the interrupts downstream of the PCI device need to get
> > > resolved. It could be that the PCI device serves as the interrupt
> > > controller and can resolve the parent interrupt on its own (which may
> > > be needed for ACPI host anyways).
> >
> > About interrupt, I am a bit stuck on my side.
> > My dtso (applied at PCI device level) contains the following:
> > fragment@0 {
> > target-path="";
> > __overlay__ {
> > pci-ep-bus@0 {
> > compatible = "simple-bus";
> > #address-cells = <1>;
> > #size-cells = <1>;
> >
> > /*
> > * map @0xe2000000 (32MB) to BAR0 (CPU)
> > * map @0xe0000000 (16MB) to BAR1 (AMBA)
> > */
> > ranges = <0xe2000000 0x00 0x00 0x00 0x2000000
> > 0xe0000000 0x01 0x00 0x00 0x1000000>;
> >
> > itc: itc {
> > compatible = "microchip,lan966x-itc";
> > #interrupt-cells = <1>;
> > interrupt-controller;
> > reg = <0xe00c0120 0x190>;
> > };
> >
> > ...
> > };
> > };
> > };
> >
> > I have a 'simple-bus' with a 'ranges' property to translate the BAR addresses
> > then several devices. Among them a interrupt controller (itc). Its parent
> > interrupt is the one used by the PCI device (INTA).
> > I cannot describe this parent interrupt in the dtso because to that I need the
> > parent interrupt phandle which will be know only at runtime.
>
> But you don't. The logic to find the interrupt parent is walk up the
> parent nodes until you find 'interrupt-parent' or
> '#interrupt-controller' (and interrupt-map always has
> #interrupt-controller). So your overlay just needs 'interrupts = <1>'
> for INTA and it should all just work.

Yes, I tried some stuffs in that way...
>
> That of course implies that we need interrupt properties in all the
> bridges which I was hoping to avoid. In the ACPI case, for DT
> interrupt parsing to work, we're going to need to end up in an
> 'interrupt-controller' node somewhere. I think the options are either

... and I went up to that point.
Further more with that way, we need to update the addr value retrieved
from the device requesting the interrupt.
https://elixir.bootlin.com/linux/latest/source/drivers/of/irq.c#L343
Indeed, with the current 'interrupt-map' at bridges level, a addr value
update is needed at the PCI device level if the interrupt is requested
from some PCI device children.
This is were my (not so good) interrupt-ranges property could play a
role.

> we walk interrupt-map properties up to the host bridge which then
> points to something or the PCI device is the interrupt controller. I
> think the PCI device is the right place. How the downstream interrupts

Agree, the PCI device seems to be the best candidate.

> are routed to PCI interrupts are defined by the device. That would
> work the same way for both DT and ACPI. If you are concerned about
> implementing in each driver needing this, some library functions can
> mitigate that.
>
> I'm trying to play around with the IRQ domains and get this to work,
> but not having any luck yet.

May I help you on some points?

Got a system with a real hardware (Lan966x) and a setup using an ARM platform (full
DT) and an other one using an x86 platform (ACPI).
Also, I have some piece of code to create the PCI host bridge node.
Of course, this need to be first working on a full DT system but I can do some test
to be sure that can be still ok on a ACPI system.

>
> > Of course, I can modified the overlay applied to tweak the 'interrupt' and
> > 'interrupt-parent' in the itc node from my PCI device driver at runtime but I
> > would like to avoid this kind of tweak in the PCI device driver.
> > This kind of tweak is overlay dependent and needs to be done by each PCI
> > device driver that need to work with interrupts.
> >
> > For BAR addresses translation, we use the 'ranges' property at the PCI device
> > node to translate 0 0 0 to BAR0, 1 0 0 to BAR1, ...
> > What do you think about a new 'irq-ranges' property to translate the irq number
> > and get the irq parent controller base.
> >
> > irq-ranges = <child_irq_spec parent_irq_spec length>;
>
> Seems fragile as you have to know something about the parent (the # of
> cells), but you don't have the phandle. If you needed multiple
> entries, you couldn't parse this.
>

Right.
With the PCI device seen as an interrupt controller, there is no more need of
this interrup-ranges property.

Best regards,
Hervé

2023-12-14 14:31:40

by Herve Codina

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

Hi Rob,

On Fri, 8 Dec 2023 09:48:40 +0100
Herve Codina <[email protected]> wrote:

> >
> > But you don't. The logic to find the interrupt parent is walk up the
> > parent nodes until you find 'interrupt-parent' or
> > '#interrupt-controller' (and interrupt-map always has
> > #interrupt-controller). So your overlay just needs 'interrupts = <1>'
> > for INTA and it should all just work.
>
> Yes, I tried some stuffs in that way...
> >
> > That of course implies that we need interrupt properties in all the
> > bridges which I was hoping to avoid. In the ACPI case, for DT
> > interrupt parsing to work, we're going to need to end up in an
> > 'interrupt-controller' node somewhere. I think the options are either
>
> ... and I went up to that point.
> Further more with that way, we need to update the addr value retrieved
> from the device requesting the interrupt.
> https://elixir.bootlin.com/linux/latest/source/drivers/of/irq.c#L343
> Indeed, with the current 'interrupt-map' at bridges level, a addr value
> update is needed at the PCI device level if the interrupt is requested
> from some PCI device children.
> This is were my (not so good) interrupt-ranges property could play a
> role.
>
> > we walk interrupt-map properties up to the host bridge which then
> > points to something or the PCI device is the interrupt controller. I
> > think the PCI device is the right place. How the downstream interrupts
>
> Agree, the PCI device seems to be the best candidate.
>
> > are routed to PCI interrupts are defined by the device. That would
> > work the same way for both DT and ACPI. If you are concerned about
> > implementing in each driver needing this, some library functions can
> > mitigate that.
> >
> > I'm trying to play around with the IRQ domains and get this to work,
> > but not having any luck yet.
>

Got some piece of code creating an interrupt controller at PCI device level.
To have it working, '#interrupt-cell = <1>' and 'interrupt-controller'
properties need to be set in the PCI device DT node.

I can set them when the PCI device DT node is created (add them in
of_pci_add_properties()) but as this interrupt controller is specific to the
PCI device driver (it needs to be created after the pci_enable_device() call
and will probably be device specific with MSI), it would make sense to have
these properties set by the PCI device driver itself or in the overlay it
applies.

But these properties creation done by the device driver itself (or the
overlay) lead to memory leaks.
Indeed, we cannot add properties to an existing node without memory
leaks. When a property is removed, the property is not freed but moved
to the node->deadprops list (of_remove_property()).
The properties present in the list will be free once the node itself is
removed.
In our use-case, the node (PCI device node) is not removed when the driver
is removed and probe again (rmmod/insmod).

Do you have any opinion about the '#interrupt-cell' and
'interrupt-controller' properties creation:

- Created by of_pci_add_properties():
No mem leak but done outside the specific device driver itself and done for
all PCI devices.
Maybe the driver will not create the interrupt controller, maybe it will
prefer an other value for '#interrupt-cell', maybe it will handle MSI and
will need to set other properties, ... All of these are device specific.

- Created by the device driver itself (or the overlay it applies):
The mem leak is present. Any idea about a way to fix that or at least having
a fix for some properties ?
I was thinking about avoiding to export properties (of_find_property()) and
so avoid references properties or introducing refcount on properties but
these modifications touch a lot of drivers and subsystems and are subject
to errors.
That's said, checking a property presence based on its name can be done without
exporting the property, as well as getting a single value. Iterating over array
values need some more complicated code.


Best regards,
Hervé

2023-12-15 13:52:29

by Herve Codina

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

Hi Rob,

On Mon, 4 Dec 2023 07:59:09 -0600
Rob Herring <[email protected]> wrote:

> On Mon, Dec 4, 2023 at 6:43 AM Herve Codina <[email protected]> wrote:
> >
> > Hi Rob,
> >
> > On Fri, 1 Dec 2023 16:26:45 -0600
> > Rob Herring <[email protected]> wrote:
> >
> > > On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <[email protected]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > The commit 407d1a51921e ("PCI: Create device tree node for bridge")
> > > > creates of_node for PCI devices.
> > > > During the insertion handling of these new DT nodes done by of_platform,
> > > > new devices (struct device) are created.
> > > > For each PCI devices a struct device is already present (created and
> > > > handled by the PCI core).
> > > > Creating a new device from a DT node leads to some kind of wrong struct
> > > > device duplication to represent the exact same PCI device.
> > > >
> > > > This patch series first introduces device_{add,remove}_of_node() in
> > > > order to add or remove a newly created of_node to an already existing
> > > > device.
> > > > Then it fixes the DT node creation for PCI devices to add or remove the
> > > > created node to the existing PCI device without any new device creation.
> > >
> > > I think the simpler solution is to get the DT node created earlier. We
> > > are just asking for pain if the DT node is set for the device at
> > > different times compared to static DT nodes.
> > >
> > > The following fixes the lack of of_node link. The DT unittest fails
> > > with the change though and I don't see why.
> > >
> > > Also, no idea if the bridge part works because my qemu setup doesn't
> > > create bridges (anyone got a magic cmdline to create them?).
> > >
> > > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > > index 9c2137dae429..46b252bbe500 100644
> > > --- a/drivers/pci/bus.c
> > > +++ b/drivers/pci/bus.c
> > > @@ -342,8 +342,6 @@ void pci_bus_add_device(struct pci_dev *dev)
> > > */
> > > pcibios_bus_add_device(dev);
> > > pci_fixup_device(pci_fixup_final, dev);
> > > - if (pci_is_bridge(dev))
> > > - of_pci_make_dev_node(dev);
> > > pci_create_sysfs_dev_files(dev);
> > > pci_proc_attach_device(dev);
> > > pci_bridge_d3_update(dev);
> > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > index 51e3dd0ea5ab..e15eaf0127fc 100644
> > > --- a/drivers/pci/of.c
> > > +++ b/drivers/pci/of.c
> > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > return 0;
> > >
> > > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > + if (!node && pci_is_bridge(dev))
> > > + of_pci_make_dev_node(dev);
> > > if (!node)
> > > return 0;
> >
> > Maybe it is too early.
> > of_pci_make_dev_node() creates a node and fills some properties based on
> > some already set values available in the PCI device such as its struct resource
> > values.
> > We need to have some values set by the PCI infra in order to create our DT node
> > with correct values.
>
> Indeed, that's probably the issue I'm having. In that case,
> DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> device_add().
>
> I think modifying sysfs after device_add() is going to race with
> userspace. Userspace is notified of a new device, and then the of_node
> link may or may not be there when it reads sysfs. Also, not sure if
> we'll need DT modaliases with PCI devices, but they won't work if the
> DT node is not set before device_add().
>
> Rob

DECLARE_PCI_FIXUP_HEADER is too early as well as doing the DT node creation
just before the device_add() call.
Indeed, in order to fill the DT properties, resources need to be assigned
(needed for the 'ranges' property used for addresses translation).
The resources assignment is done after the call to device_add().

Some PCI sysfs files are already created after adding the device by the
pci_create_sysfs_dev_files() call:
https://elixir.bootlin.com/linux/v6.6/source/drivers/pci/bus.c#L347

Is it really an issue to add the of_node link to sysfs on an already present
device ?

Maybe we can add the of_node link before the device_add() call and add the
'ranges' property in the DT node later, once resources are assigned.
In that case, the race condition is not fixed but moved from the PCI device
to the DT node the device is pointing to.
With DT overlays and of_changeset_*(), modifying nodes is a normal behavior.
Is that acceptable for the 'ranges' property in this use-case ?

Best regards,
Hervé

2024-03-19 14:42:38

by Herve Codina

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

Hi Rob,

On Thu, 14 Dec 2023 15:31:22 +0100
Herve Codina <[email protected]> wrote:

> > >
> > > But you don't. The logic to find the interrupt parent is walk up the
> > > parent nodes until you find 'interrupt-parent' or
> > > '#interrupt-controller' (and interrupt-map always has
> > > #interrupt-controller). So your overlay just needs 'interrupts = <1>'
> > > for INTA and it should all just work.
> >
> > Yes, I tried some stuffs in that way...
> > >
> > > That of course implies that we need interrupt properties in all the
> > > bridges which I was hoping to avoid. In the ACPI case, for DT
> > > interrupt parsing to work, we're going to need to end up in an
> > > 'interrupt-controller' node somewhere. I think the options are either
> >
> > ... and I went up to that point.
> > Further more with that way, we need to update the addr value retrieved
> > from the device requesting the interrupt.
> > https://elixir.bootlin.com/linux/latest/source/drivers/of/irq.c#L343
> > Indeed, with the current 'interrupt-map' at bridges level, a addr value
> > update is needed at the PCI device level if the interrupt is requested
> > from some PCI device children.
> > This is were my (not so good) interrupt-ranges property could play a
> > role.
> >
> > > we walk interrupt-map properties up to the host bridge which then
> > > points to something or the PCI device is the interrupt controller. I
> > > think the PCI device is the right place. How the downstream interrupts
> >
> > Agree, the PCI device seems to be the best candidate.
> >
> > > are routed to PCI interrupts are defined by the device. That would
> > > work the same way for both DT and ACPI. If you are concerned about
> > > implementing in each driver needing this, some library functions can
> > > mitigate that.
> > >
> > > I'm trying to play around with the IRQ domains and get this to work,
> > > but not having any luck yet.
> >
>
> Got some piece of code creating an interrupt controller at PCI device level.
> To have it working, '#interrupt-cell = <1>' and 'interrupt-controller'
> properties need to be set in the PCI device DT node.
>
> I can set them when the PCI device DT node is created (add them in
> of_pci_add_properties()) but as this interrupt controller is specific to the
> PCI device driver (it needs to be created after the pci_enable_device() call
> and will probably be device specific with MSI), it would make sense to have
> these properties set by the PCI device driver itself or in the overlay it
> applies.
>
> But these properties creation done by the device driver itself (or the
> overlay) lead to memory leaks.
> Indeed, we cannot add properties to an existing node without memory
> leaks. When a property is removed, the property is not freed but moved
> to the node->deadprops list (of_remove_property()).
> The properties present in the list will be free once the node itself is
> removed.
> In our use-case, the node (PCI device node) is not removed when the driver
> is removed and probe again (rmmod/insmod).
>
> Do you have any opinion about the '#interrupt-cell' and
> 'interrupt-controller' properties creation:
>
> - Created by of_pci_add_properties():
> No mem leak but done outside the specific device driver itself and done for
> all PCI devices.
> Maybe the driver will not create the interrupt controller, maybe it will
> prefer an other value for '#interrupt-cell', maybe it will handle MSI and
> will need to set other properties, ... All of these are device specific.
>
> - Created by the device driver itself (or the overlay it applies):
> The mem leak is present. Any idea about a way to fix that or at least having
> a fix for some properties ?
> I was thinking about avoiding to export properties (of_find_property()) and
> so avoid references properties or introducing refcount on properties but
> these modifications touch a lot of drivers and subsystems and are subject
> to errors.
> That's said, checking a property presence based on its name can be done without
> exporting the property, as well as getting a single value. Iterating over array
> values need some more complicated code.
>

I revive this quite old topic.

The primary goal of this series was to avoid a struct device duplication due
to the insertion of DT nodes related to PCI devices.
The series added the sysfs of_node symlink once the device is visible from
sysfs.
You proposed to create the DT node earlier, DECLARE_PCI_FIXUP_EARLY() instead
of DECLARE_PCI_FIXUP_FINAL() in order to have it set before the device_add()
call.
This raised some new issues because the DT node creation needs some information
set by the PCI core. DECLARE_PCI_FIXUP_HEADER() was the new candidate.
Issues were still present.
The 'ranges' property is needed and information needed for its computation
are set by the PCI core after the device_add() call.

At this point the discussion continued also on interrupts with the idea to
add the 'interrupt-controller' property in the PCI device node in order to
bypass all the interrupt mapping done in DT nodes.
Indeed, in order to have a working DT mapping, an 'interrupt-parent' phandle
is needed at some points and will be problematic with ACPI.

On my side I've got a piece of code to consider the PCI device as an interrup
controller. This work with the 'interrupt-controller' property.

The question raised:
Which part of code set the interrupt-controller property ?
- DT node creation:
Common to all PCI devices even if the interrupt are not handled by the PCI
device driver. Also '#interrupt-cells' is really device specific.
- Added by the PCI device driver itself
Seems the good place but we enter in overlay memleak issues

What is your opinion related to the best candidate for the
'interrupt-controller' and '#interrupt-cells' property creation ?

Back to the of_node link addition to sysfs.
Is it really an issue to add it on an already present device ?
If so, is it acceptable (not tested on my side) to create the of_node link
to an empty node before the device_add() call and then fill this node later
when needed information are set by the PCI core ?

With your answers, I hope to move forward on these topics.
Best regards,
Hervé

--
Hervé Codina, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

2024-03-19 15:25:23

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

[+cc Krzysztof]

On Fri, Dec 15, 2023 at 02:52:07PM +0100, Herve Codina wrote:
> On Mon, 4 Dec 2023 07:59:09 -0600
> Rob Herring <[email protected]> wrote:
> > On Mon, Dec 4, 2023 at 6:43 AM Herve Codina <[email protected]> wrote:
> > > On Fri, 1 Dec 2023 16:26:45 -0600
> > > Rob Herring <[email protected]> wrote:
> > > > On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <[email protected]> wrote:
> > > > ...

> > > > --- a/drivers/pci/of.c
> > > > +++ b/drivers/pci/of.c
> > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > return 0;
> > > >
> > > > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > + if (!node && pci_is_bridge(dev))
> > > > + of_pci_make_dev_node(dev);
> > > > if (!node)
> > > > return 0;
> > >
> > > Maybe it is too early.
> > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > some already set values available in the PCI device such as its struct resource
> > > values.
> > > We need to have some values set by the PCI infra in order to create our DT node
> > > with correct values.
> >
> > Indeed, that's probably the issue I'm having. In that case,
> > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > device_add().
> >
> > I think modifying sysfs after device_add() is going to race with
> > userspace. Userspace is notified of a new device, and then the of_node
> > link may or may not be there when it reads sysfs. Also, not sure if
> > we'll need DT modaliases with PCI devices, but they won't work if the
> > DT node is not set before device_add().
>
> DECLARE_PCI_FIXUP_HEADER is too early as well as doing the DT node creation
> just before the device_add() call.
> Indeed, in order to fill the DT properties, resources need to be assigned
> (needed for the 'ranges' property used for addresses translation).
> The resources assignment is done after the call to device_add().

Do we need to know the actual address *value* before creating the
sysfs file, or is it enough to know that the file should *exist*, even
if the value may be changed later?

> Some PCI sysfs files are already created after adding the device by the
> pci_create_sysfs_dev_files() call:
> https://elixir.bootlin.com/linux/v6.6/source/drivers/pci/bus.c#L347
>
> Is it really an issue to add the of_node link to sysfs on an already
> present device ?

Yes, I think this would be an issue. We've been trying to get rid of
pci_create_sysfs_dev_files() altogether because there's a long history
of race issues related to it:

https://lore.kernel.org/r/1271099285.9831.13.camel@localhost/ WARNING: at fs/sysfs/dir.c:451 sysfs_add_one: sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.0/slot'
https://lore.kernel.org/r/[email protected]/ [2.6.35-rc1 regression] sysfs: cannot create duplicate filename ... XVR-600 related?
https://lore.kernel.org/r/20200716110423.xtfyb3n6tn5ixedh@pali/ PCI: Race condition in pci_create_sysfs_dev_files
https://lore.kernel.org/r/[email protected]/ PCI: Race condition in pci_create_sysfs_dev_files (can't boot)
https://bugzilla.kernel.org/show_bug.cgi?id=215515 sysfs: cannot create duplicate filename '.../0000:e0'

And several previous attempts to fix them:

https://lore.kernel.org/r/[email protected]/ Guard pci_create_sysfs_dev_files with atomic value
https://lore.kernel.org/r/[email protected] PCI/sysfs: get rid of pci_sysfs_init late_initcall
https://lore.kernel.org/r/[email protected]/ PCI/sysfs: Fix race in pci sysfs creation

Bjorn

2024-03-19 16:43:59

by Herve Codina

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

Hi Bjorn,

On Tue, 19 Mar 2024 10:25:13 -0500
Bjorn Helgaas <[email protected]> wrote:

> [+cc Krzysztof]
>
> On Fri, Dec 15, 2023 at 02:52:07PM +0100, Herve Codina wrote:
> > On Mon, 4 Dec 2023 07:59:09 -0600
> > Rob Herring <[email protected]> wrote:
> > > On Mon, Dec 4, 2023 at 6:43 AM Herve Codina <[email protected]> wrote:
> > > > On Fri, 1 Dec 2023 16:26:45 -0600
> > > > Rob Herring <[email protected]> wrote:
> > > > > On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <[email protected]> wrote:
> > > > > ...
>
> > > > > --- a/drivers/pci/of.c
> > > > > +++ b/drivers/pci/of.c
> > > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > > return 0;
> > > > >
> > > > > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > > + if (!node && pci_is_bridge(dev))
> > > > > + of_pci_make_dev_node(dev);
> > > > > if (!node)
> > > > > return 0;
> > > >
> > > > Maybe it is too early.
> > > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > > some already set values available in the PCI device such as its struct resource
> > > > values.
> > > > We need to have some values set by the PCI infra in order to create our DT node
> > > > with correct values.
> > >
> > > Indeed, that's probably the issue I'm having. In that case,
> > > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > > device_add().
> > >
> > > I think modifying sysfs after device_add() is going to race with
> > > userspace. Userspace is notified of a new device, and then the of_node
> > > link may or may not be there when it reads sysfs. Also, not sure if
> > > we'll need DT modaliases with PCI devices, but they won't work if the
> > > DT node is not set before device_add().
> >
> > DECLARE_PCI_FIXUP_HEADER is too early as well as doing the DT node creation
> > just before the device_add() call.
> > Indeed, in order to fill the DT properties, resources need to be assigned
> > (needed for the 'ranges' property used for addresses translation).
> > The resources assignment is done after the call to device_add().
>
> Do we need to know the actual address *value* before creating the
> sysfs file, or is it enough to know that the file should *exist*, even
> if the value may be changed later?

I think, the problematic file is 'of_node'.
This file is a symlink present in the device directory pointing to the
node in a device-tree subdir.

How can we create this of_node symlink without having the device-tree
subdir available ?

>
> > Some PCI sysfs files are already created after adding the device by the
> > pci_create_sysfs_dev_files() call:
> > https://elixir.bootlin.com/linux/v6.6/source/drivers/pci/bus.c#L347
> >
> > Is it really an issue to add the of_node link to sysfs on an already
> > present device ?
>
> Yes, I think this would be an issue. We've been trying to get rid of
> pci_create_sysfs_dev_files() altogether because there's a long history
> of race issues related to it:
>
> https://lore.kernel.org/r/1271099285.9831.13.camel@localhost/ WARNING: at fs/sysfs/dir.c:451 sysfs_add_one: sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.0/slot'
> https://lore.kernel.org/r/[email protected]/ [2.6.35-rc1 regression] sysfs: cannot create duplicate filename ... XVR-600 related?
> https://lore.kernel.org/r/20200716110423.xtfyb3n6tn5ixedh@pali/ PCI: Race condition in pci_create_sysfs_dev_files
> https://lore.kernel.org/r/[email protected]/ PCI: Race condition in pci_create_sysfs_dev_files (can't boot)
> https://bugzilla.kernel.org/show_bug.cgi?id=215515 sysfs: cannot create duplicate filename '.../0000:e0'
>
> And several previous attempts to fix them:
>
> https://lore.kernel.org/r/[email protected]/ Guard pci_create_sysfs_dev_files with atomic value
> https://lore.kernel.org/r/[email protected] PCI/sysfs: get rid of pci_sysfs_init late_initcall
> https://lore.kernel.org/r/[email protected]/ PCI/sysfs: Fix race in pci sysfs creation
>

I am not sure we are facing in the same kind of issues.
The ones you mentioned are related to some sysfs duplication.
In the of_node case, the issue (if any) is that the symlink will be created
after the other device's file. Not sure that it can lead to some file
duplication.

Best regards,
Hervé
--
Hervé Codina, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

2024-03-19 16:54:47

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

On Tue, Mar 19, 2024 at 05:34:04PM +0100, Herve Codina wrote:
> On Tue, 19 Mar 2024 10:25:13 -0500
> Bjorn Helgaas <[email protected]> wrote:
> > On Fri, Dec 15, 2023 at 02:52:07PM +0100, Herve Codina wrote:
> > > On Mon, 4 Dec 2023 07:59:09 -0600
> > > Rob Herring <[email protected]> wrote:
> > > > On Mon, Dec 4, 2023 at 6:43 AM Herve Codina <[email protected]> wrote:
> > > > > On Fri, 1 Dec 2023 16:26:45 -0600
> > > > > Rob Herring <[email protected]> wrote:
> > > > > > On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <[email protected]> wrote:
> > > > > > ...
> >
> > > > > > --- a/drivers/pci/of.c
> > > > > > +++ b/drivers/pci/of.c
> > > > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > > > return 0;
> > > > > >
> > > > > > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > > > + if (!node && pci_is_bridge(dev))
> > > > > > + of_pci_make_dev_node(dev);
> > > > > > if (!node)
> > > > > > return 0;
> > > > >
> > > > > Maybe it is too early.
> > > > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > > > some already set values available in the PCI device such as its struct resource
> > > > > values.
> > > > > We need to have some values set by the PCI infra in order to create our DT node
> > > > > with correct values.
> > > >
> > > > Indeed, that's probably the issue I'm having. In that case,
> > > > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > > > device_add().
> > > >
> > > > I think modifying sysfs after device_add() is going to race with
> > > > userspace. Userspace is notified of a new device, and then the of_node
> > > > link may or may not be there when it reads sysfs. Also, not sure if
> > > > we'll need DT modaliases with PCI devices, but they won't work if the
> > > > DT node is not set before device_add().
> > >
> > > DECLARE_PCI_FIXUP_HEADER is too early as well as doing the DT node creation
> > > just before the device_add() call.
> > > Indeed, in order to fill the DT properties, resources need to be assigned
> > > (needed for the 'ranges' property used for addresses translation).
> > > The resources assignment is done after the call to device_add().
> >
> > Do we need to know the actual address *value* before creating the
> > sysfs file, or is it enough to know that the file should *exist*, even
> > if the value may be changed later?
>
> I think, the problematic file is 'of_node'.
> This file is a symlink present in the device directory pointing to the
> node in a device-tree subdir.
>
> How can we create this of_node symlink without having the device-tree
> subdir available ?
>
> > > Some PCI sysfs files are already created after adding the device by the
> > > pci_create_sysfs_dev_files() call:
> > > https://elixir.bootlin.com/linux/v6.6/source/drivers/pci/bus.c#L347
> > >
> > > Is it really an issue to add the of_node link to sysfs on an already
> > > present device ?
> >
> > Yes, I think this would be an issue. We've been trying to get rid of
> > pci_create_sysfs_dev_files() altogether because there's a long history
> > of race issues related to it:
> >
> > https://lore.kernel.org/r/1271099285.9831.13.camel@localhost/ WARNING: at fs/sysfs/dir.c:451 sysfs_add_one: sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.0/slot'
> > https://lore.kernel.org/r/[email protected]/ [2.6.35-rc1 regression] sysfs: cannot create duplicate filename ... XVR-600 related?
> > https://lore.kernel.org/r/20200716110423.xtfyb3n6tn5ixedh@pali/ PCI: Race condition in pci_create_sysfs_dev_files
> > https://lore.kernel.org/r/[email protected]/ PCI: Race condition in pci_create_sysfs_dev_files (can't boot)
> > https://bugzilla.kernel.org/show_bug.cgi?id=215515 sysfs: cannot create duplicate filename '.../0000:e0'
> >
> > And several previous attempts to fix them:
> >
> > https://lore.kernel.org/r/[email protected]/ Guard pci_create_sysfs_dev_files with atomic value
> > https://lore.kernel.org/r/[email protected] PCI/sysfs: get rid of pci_sysfs_init late_initcall
> > https://lore.kernel.org/r/[email protected]/ PCI/sysfs: Fix race in pci sysfs creation
>
> I am not sure we are facing in the same kind of issues.
> The ones you mentioned are related to some sysfs duplication.
> In the of_node case, the issue (if any) is that the symlink will be created
> after the other device's file. Not sure that it can lead to some file
> duplication.

It sounds different and maybe not a problem. I saw
pci_create_sysfs_dev_files() and was afraid you planned to add
something there when we're trying to remove it.

Bjorn

2024-04-10 21:42:07

by Rob Herring

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

On Tue, Mar 19, 2024 at 11:34 AM Herve Codina <herve.codina@bootlincom> wrote:
>
> Hi Bjorn,
>
> On Tue, 19 Mar 2024 10:25:13 -0500
> Bjorn Helgaas <[email protected]> wrote:
>
> > [+cc Krzysztof]
> >
> > On Fri, Dec 15, 2023 at 02:52:07PM +0100, Herve Codina wrote:
> > > On Mon, 4 Dec 2023 07:59:09 -0600
> > > Rob Herring <[email protected]> wrote:
> > > > On Mon, Dec 4, 2023 at 6:43 AM Herve Codina <[email protected]> wrote:
> > > > > On Fri, 1 Dec 2023 16:26:45 -0600
> > > > > Rob Herring <[email protected]> wrote:
> > > > > > On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <[email protected]> wrote:
> > > > > > ...
> >
> > > > > > --- a/drivers/pci/of.c
> > > > > > +++ b/drivers/pci/of.c
> > > > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > > > return 0;
> > > > > >
> > > > > > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > > > + if (!node && pci_is_bridge(dev))
> > > > > > + of_pci_make_dev_node(dev);
> > > > > > if (!node)
> > > > > > return 0;
> > > > >
> > > > > Maybe it is too early.
> > > > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > > > some already set values available in the PCI device such as its struct resource
> > > > > values.
> > > > > We need to have some values set by the PCI infra in order to create our DT node
> > > > > with correct values.
> > > >
> > > > Indeed, that's probably the issue I'm having. In that case,
> > > > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > > > device_add().
> > > >
> > > > I think modifying sysfs after device_add() is going to race with
> > > > userspace. Userspace is notified of a new device, and then the of_node
> > > > link may or may not be there when it reads sysfs. Also, not sure if
> > > > we'll need DT modaliases with PCI devices, but they won't work if the
> > > > DT node is not set before device_add().
> > >
> > > DECLARE_PCI_FIXUP_HEADER is too early as well as doing the DT node creation
> > > just before the device_add() call.
> > > Indeed, in order to fill the DT properties, resources need to be assigned
> > > (needed for the 'ranges' property used for addresses translation).
> > > The resources assignment is done after the call to device_add().
> >
> > Do we need to know the actual address *value* before creating the
> > sysfs file, or is it enough to know that the file should *exist*, even
> > if the value may be changed later?
>
> I think, the problematic file is 'of_node'.
> This file is a symlink present in the device directory pointing to the
> node in a device-tree subdir.
>
> How can we create this of_node symlink without having the device-tree
> subdir available ?
>
> >
> > > Some PCI sysfs files are already created after adding the device by the
> > > pci_create_sysfs_dev_files() call:
> > > https://elixir.bootlin.com/linux/v6.6/source/drivers/pci/bus.c#L347
> > >
> > > Is it really an issue to add the of_node link to sysfs on an already
> > > present device ?
> >
> > Yes, I think this would be an issue. We've been trying to get rid of
> > pci_create_sysfs_dev_files() altogether because there's a long history
> > of race issues related to it:
> >
> > https://lore.kernel.org/r/1271099285.9831.13.camel@localhost/ WARNING: at fs/sysfs/dir.c:451 sysfs_add_one: sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.0/slot'
> > https://lore.kernel.org/r/[email protected]/ [2.6.35-rc1 regression] sysfs: cannot create duplicate filename ... XVR-600 related?
> > https://lore.kernel.org/r/20200716110423.xtfyb3n6tn5ixedh@pali/ PCI: Race condition in pci_create_sysfs_dev_files
> > https://lore.kernel.org/r/[email protected]/ PCI: Race condition in pci_create_sysfs_dev_files (can't boot)
> > https://bugzilla.kernel.org/show_bug.cgi?id=215515 sysfs: cannot create duplicate filename '.../0000:e0'
> >
> > And several previous attempts to fix them:
> >
> > https://lore.kernel.org/r/[email protected]/ Guard pci_create_sysfs_dev_files with atomic value
> > https://lore.kernel.org/r/[email protected] PCI/sysfs: get rid of pci_sysfs_init late_initcall
> > https://lore.kernel.org/r/[email protected]/ PCI/sysfs: Fix race in pci sysfs creation
> >
>
> I am not sure we are facing in the same kind of issues.
> The ones you mentioned are related to some sysfs duplication.
> In the of_node case, the issue (if any) is that the symlink will be created
> after the other device's file. Not sure that it can lead to some file
> duplication.

Again, if you notify userspace and it wants to make some decisions
based on of_node, then it has to be there when the notification
happens. As Greg says frequently, you've raced with userspace and
lost.

I imagine Bjorn won't like it, but may we need another fixup point?

Rob

2024-04-11 14:07:28

by Herve Codina

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

On Wed, 10 Apr 2024 16:41:43 -0500
Rob Herring <[email protected]> wrote:

> On Tue, Mar 19, 2024 at 11:34 AM Herve Codina <[email protected]> wrote:
> >
> > Hi Bjorn,
> >
> > On Tue, 19 Mar 2024 10:25:13 -0500
> > Bjorn Helgaas <[email protected]> wrote:
> >
> > > [+cc Krzysztof]
> > >
> > > On Fri, Dec 15, 2023 at 02:52:07PM +0100, Herve Codina wrote:
> > > > On Mon, 4 Dec 2023 07:59:09 -0600
> > > > Rob Herring <[email protected]> wrote:
> > > > > On Mon, Dec 4, 2023 at 6:43 AM Herve Codina <[email protected]> wrote:
> > > > > > On Fri, 1 Dec 2023 16:26:45 -0600
> > > > > > Rob Herring <[email protected]> wrote:
> > > > > > > On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <[email protected]> wrote:
> > > > > > > ...
> > >
> > > > > > > --- a/drivers/pci/of.c
> > > > > > > +++ b/drivers/pci/of.c
> > > > > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > > > > return 0;
> > > > > > >
> > > > > > > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > > > > + if (!node && pci_is_bridge(dev))
> > > > > > > + of_pci_make_dev_node(dev);
> > > > > > > if (!node)
> > > > > > > return 0;
> > > > > >
> > > > > > Maybe it is too early.
> > > > > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > > > > some already set values available in the PCI device such as its struct resource
> > > > > > values.
> > > > > > We need to have some values set by the PCI infra in order to create our DT node
> > > > > > with correct values.
> > > > >
> > > > > Indeed, that's probably the issue I'm having. In that case,
> > > > > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > > > > device_add().
> > > > >
> > > > > I think modifying sysfs after device_add() is going to race with
> > > > > userspace. Userspace is notified of a new device, and then the of_node
> > > > > link may or may not be there when it reads sysfs. Also, not sure if
> > > > > we'll need DT modaliases with PCI devices, but they won't work if the
> > > > > DT node is not set before device_add().
> > > >
> > > > DECLARE_PCI_FIXUP_HEADER is too early as well as doing the DT node creation
> > > > just before the device_add() call.
> > > > Indeed, in order to fill the DT properties, resources need to be assigned
> > > > (needed for the 'ranges' property used for addresses translation).
> > > > The resources assignment is done after the call to device_add().
> > >
> > > Do we need to know the actual address *value* before creating the
> > > sysfs file, or is it enough to know that the file should *exist*, even
> > > if the value may be changed later?
> >
> > I think, the problematic file is 'of_node'.
> > This file is a symlink present in the device directory pointing to the
> > node in a device-tree subdir.
> >
> > How can we create this of_node symlink without having the device-tree
> > subdir available ?
> >
> > >
> > > > Some PCI sysfs files are already created after adding the device by the
> > > > pci_create_sysfs_dev_files() call:
> > > > https://elixir.bootlin.com/linux/v6.6/source/drivers/pci/bus.c#L347
> > > >
> > > > Is it really an issue to add the of_node link to sysfs on an already
> > > > present device ?
> > >
> > > Yes, I think this would be an issue. We've been trying to get rid of
> > > pci_create_sysfs_dev_files() altogether because there's a long history
> > > of race issues related to it:
> > >
> > > https://lore.kernel.org/r/1271099285.9831.13.camel@localhost/ WARNING: at fs/sysfs/dir.c:451 sysfs_add_one: sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.0/slot'
> > > https://lore.kernel.org/r/[email protected]/ [2.6.35-rc1 regression] sysfs: cannot create duplicate filename ... XVR-600 related?
> > > https://lore.kernel.org/r/20200716110423.xtfyb3n6tn5ixedh@pali/ PCI: Race condition in pci_create_sysfs_dev_files
> > > https://lore.kernel.org/r/[email protected]/ PCI: Race condition in pci_create_sysfs_dev_files (can't boot)
> > > https://bugzilla.kernel.org/show_bug.cgi?id=215515 sysfs: cannot create duplicate filename '.../0000:e0'
> > >
> > > And several previous attempts to fix them:
> > >
> > > https://lore.kernel.org/r/[email protected]/ Guard pci_create_sysfs_dev_files with atomic value
> > > https://lore.kernel.org/r/[email protected] PCI/sysfs: get rid of pci_sysfs_init late_initcall
> > > https://lore.kernel.org/r/[email protected]/ PCI/sysfs: Fix race in pci sysfs creation
> > >
> >
> > I am not sure we are facing in the same kind of issues.
> > The ones you mentioned are related to some sysfs duplication.
> > In the of_node case, the issue (if any) is that the symlink will be created
> > after the other device's file. Not sure that it can lead to some file
> > duplication.
>
> Again, if you notify userspace and it wants to make some decisions
> based on of_node, then it has to be there when the notification
> happens. As Greg says frequently, you've raced with userspace and
> lost.
>
> I imagine Bjorn won't like it, but may we need another fixup point?
>
> Rob

I'am not sure that a fixup point can fix the issue.

In order to have the of_node available before the notification, we need
to a have the of_node set and filled before the device_add() call.

In order to create the 'ranges' property in the DT node, we need PCI
resources computed. These resources are computed after the device_add()
call.

Hervé

2024-04-11 20:58:05

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Attach DT nodes to existing PCI devices

On Thu, Apr 11, 2024 at 04:05:48PM +0200, Herve Codina wrote:
> On Wed, 10 Apr 2024 16:41:43 -0500
> Rob Herring <[email protected]> wrote:
> > On Tue, Mar 19, 2024 at 11:34 AM Herve Codina <[email protected]> wrote:
> > > On Tue, 19 Mar 2024 10:25:13 -0500
> > > Bjorn Helgaas <[email protected]> wrote:
> > > > On Fri, Dec 15, 2023 at 02:52:07PM +0100, Herve Codina wrote:
> > > > > On Mon, 4 Dec 2023 07:59:09 -0600
> > > > > Rob Herring <[email protected]> wrote:
> > > > > > On Mon, Dec 4, 2023 at 6:43 AM Herve Codina <[email protected]> wrote:
> > > > > > > On Fri, 1 Dec 2023 16:26:45 -0600
> > > > > > > Rob Herring <[email protected]> wrote:
> > > > > > > > On Thu, Nov 30, 2023 at 10:57 AM Herve Codina <[email protected]> wrote:
> > > > > > > > ...
> > > >
> > > > > > > > --- a/drivers/pci/of.c
> > > > > > > > +++ b/drivers/pci/of.c
> > > > > > > > @@ -31,6 +31,8 @@ int pci_set_of_node(struct pci_dev *dev)
> > > > > > > > return 0;
> > > > > > > >
> > > > > > > > node = of_pci_find_child_device(dev->bus->dev.of_node, dev->devfn);
> > > > > > > > + if (!node && pci_is_bridge(dev))
> > > > > > > > + of_pci_make_dev_node(dev);
> > > > > > > > if (!node)
> > > > > > > > return 0;
> > > > > > >
> > > > > > > Maybe it is too early.
> > > > > > > of_pci_make_dev_node() creates a node and fills some properties based on
> > > > > > > some already set values available in the PCI device such as its struct resource
> > > > > > > values.
> > > > > > > We need to have some values set by the PCI infra in order to create our DT node
> > > > > > > with correct values.
> > > > > >
> > > > > > Indeed, that's probably the issue I'm having. In that case,
> > > > > > DECLARE_PCI_FIXUP_HEADER should work. That's later, but still before
> > > > > > device_add().
> > > > > >
> > > > > > I think modifying sysfs after device_add() is going to race with
> > > > > > userspace. Userspace is notified of a new device, and then the of_node
> > > > > > link may or may not be there when it reads sysfs. Also, not sure if
> > > > > > we'll need DT modaliases with PCI devices, but they won't work if the
> > > > > > DT node is not set before device_add().
> > > > >
> > > > > DECLARE_PCI_FIXUP_HEADER is too early as well as doing the DT node creation
> > > > > just before the device_add() call.
> > > > > Indeed, in order to fill the DT properties, resources need to be assigned
> > > > > (needed for the 'ranges' property used for addresses translation).
> > > > > The resources assignment is done after the call to device_add().
> > > >
> > > > Do we need to know the actual address *value* before creating the
> > > > sysfs file, or is it enough to know that the file should *exist*, even
> > > > if the value may be changed later?
> > >
> > > I think, the problematic file is 'of_node'.
> > > This file is a symlink present in the device directory pointing to the
> > > node in a device-tree subdir.
> > >
> > > How can we create this of_node symlink without having the device-tree
> > > subdir available ?
> > >
> > > > > Some PCI sysfs files are already created after adding the device by the
> > > > > pci_create_sysfs_dev_files() call:
> > > > > https://elixir.bootlin.com/linux/v6.6/source/drivers/pci/bus.c#L347
> > > > >
> > > > > Is it really an issue to add the of_node link to sysfs on an already
> > > > > present device ?
> > > >
> > > > Yes, I think this would be an issue. We've been trying to get rid of
> > > > pci_create_sysfs_dev_files() altogether because there's a long history
> > > > of race issues related to it:
> > > >
> > > > https://lore.kernel.org/r/1271099285.9831.13.camel@localhost/ WARNING: at fs/sysfs/dir.c:451 sysfs_add_one: sysfs: cannot create duplicate filename '/devices/pci0000:00/0000:00:01.0/slot'
> > > > https://lore.kernel.org/r/[email protected]/ [2.6.35-rc1 regression] sysfs: cannot create duplicate filename ... XVR-600 related?
> > > > https://lore.kernel.org/r/20200716110423.xtfyb3n6tn5ixedh@pali/ PCI: Race condition in pci_create_sysfs_dev_files
> > > > https://lore.kernel.org/r/[email protected]/ PCI: Race condition in pci_create_sysfs_dev_files (can't boot)
> > > > https://bugzilla.kernel.org/show_bug.cgi?id=215515 sysfs: cannot create duplicate filename '.../0000:e0'
> > > >
> > > > And several previous attempts to fix them:
> > > >
> > > > https://lore.kernel.org/r/[email protected]/ Guard pci_create_sysfs_dev_files with atomic value
> > > > https://lore.kernel.org/r/[email protected] PCI/sysfs: get rid of pci_sysfs_init late_initcall
> > > > https://lore.kernel.org/r/[email protected]/ PCI/sysfs: Fix race in pci sysfs creation
> > > >
> > >
> > > I am not sure we are facing in the same kind of issues.
> > > The ones you mentioned are related to some sysfs duplication.
> > > In the of_node case, the issue (if any) is that the symlink will be created
> > > after the other device's file. Not sure that it can lead to some file
> > > duplication.
> >
> > Again, if you notify userspace and it wants to make some decisions
> > based on of_node, then it has to be there when the notification
> > happens. As Greg says frequently, you've raced with userspace and
> > lost.
> >
> > I imagine Bjorn won't like it, but may we need another fixup point?
>
> I'am not sure that a fixup point can fix the issue.
>
> In order to have the of_node available before the notification, we need
> to a have the of_node set and filled before the device_add() call.
>
> In order to create the 'ranges' property in the DT node, we need PCI
> resources computed. These resources are computed after the device_add()
> call.

I guess this is the problem that pci_assign_unassigned_resources() and
similar are called after device_add(), right? That seems kind of
problematic in general since device_add() exposes /sys/.../resource
before they may be valid. But I don't know how to fix that.

Bjorn