Hi,
The patch "of: address: Work around missing device_type property in
pcie nodes" by Marc Zyngier, d1ac0002dd297069bb8448c2764c9c31c4668441,
causes the "DUMO" variant of the gru-scarlet-inx, at the very least,
to not boot. A gru-kevin reportedly had no issues booting (further),
though it most likely had a different kernel configuration.
Using a SuzyQ cable, there is absolutely no serial output at boot,
while reverting the commit (and this commit alone) on top of v5.9-rc2
works just as it did with v5.9-rc1.
From this point on, I don't know what's the usual process, so bear with
me if I forgot to provide relevant information, or made a faux-pas by
CC-ing too many people or not enough.
Thanks.
Hi Samuel,
On 2020-08-29 21:54, Samuel Dionne-Riel wrote:
> Hi,
>
> The patch "of: address: Work around missing device_type property in
> pcie nodes" by Marc Zyngier, d1ac0002dd297069bb8448c2764c9c31c4668441,
> causes the "DUMO" variant of the gru-scarlet-inx, at the very least,
> to not boot. A gru-kevin reportedly had no issues booting (further),
> though it most likely had a different kernel configuration.
Do you have a pointer to the device-tree for this system? I couldn't
spot anything amiss in the scarlet-inx DT, but I'm not sure the
system you have is that exact one. Even a DTB would help.
The fact that Kevin still boots is a good indication that the issue
could be with with the board-specific changes layered on top of the
GRU base. My own rk3399 systems are running with this patch.
> Using a SuzyQ cable, there is absolutely no serial output at boot,
> while reverting the commit (and this commit alone) on top of v5.9-rc2
> works just as it did with v5.9-rc1.
Do you have "earlycon" on the kernel command-line?
> From this point on, I don't know what's the usual process, so bear with
> me if I forgot to provide relevant information, or made a faux-pas by
> CC-ing too many people or not enough.
No need to worry, and thank you for reporting the issue.
Could you try replacing the problematic patch with [1], and let me
know whether this changes anything on your end? This patch probably
isn't the right approach, but it would certainly help pointing me
in the right direction.
Thanks,
M.
[1] https://lore.kernel.org/lkml/[email protected]/
--
Jazz is not dead. It just smells funny...
On Sun, 30 Aug 2020 10:41:42 +0100
Marc Zyngier <[email protected]> wrote:
Hi,
Thanks for the reply.
> Hi Samuel,
>
> On 2020-08-29 21:54, Samuel Dionne-Riel wrote:
> > Hi,
> >
> > The patch "of: address: Work around missing device_type property in
> > pcie nodes" by Marc Zyngier,
> > d1ac0002dd297069bb8448c2764c9c31c4668441, causes the "DUMO" variant
> > of the gru-scarlet-inx, at the very least, to not boot. A gru-kevin
> > reportedly had no issues booting (further), though it most likely
> > had a different kernel configuration.
>
> Do you have a pointer to the device-tree for this system? I couldn't
> spot anything amiss in the scarlet-inx DT, but I'm not sure the
> system you have is that exact one. Even a DTB would help.
Is "arch/arm64/boot/dts/rockchip/rk3399-gru-scarlet-inx.dts" what you
wanted? The FDT in use is the one that's present in the kernel tree for
the same revision. The one with the `compatible` property with a bunch
of `google,scarlet-rev*-sku6`. The build process for the kernel
partition (booting using depthcharge) ensures they are in sync always.
In any cases, from previous discussions with people involved with the
scarlet development, the only difference between all scarlets are the
display on some (innolux vs. kingdisplay). I would expect (and hope)
the issue would be the same on all.
> The fact that Kevin still boots is a good indication that the issue
> could be with with the board-specific changes layered on top of the
> GRU base. My own rk3399 systems are running with this patch.
>
> > Using a SuzyQ cable, there is absolutely no serial output at boot,
> > while reverting the commit (and this commit alone) on top of
> > v5.9-rc2 works just as it did with v5.9-rc1.
>
> Do you have "earlycon" on the kernel command-line?
I did not, I thought earlyprintk was earlycon... I had:
console=ttyS2,115200n8 earlyprintk=ttyS2,115200n8
Now with wither "earlycon" or "earlycon=uart8250,mmio32,0xff1a0000" I
somehow can't get any output, and it's not booting. That is with and
without the problematic patch, and also verified on v5.8. Odd.
So I would say that I don't have earlycon, and maybe can't? I'm open to
suggestions.
> > From this point on, I don't know what's the usual process, so bear
> > with me if I forgot to provide relevant information, or made a
> > faux-pas by CC-ing too many people or not enough.
>
> No need to worry, and thank you for reporting the issue.
>
> Could you try replacing the problematic patch with [1], and let me
> know whether this changes anything on your end? This patch probably
> isn't the right approach, but it would certainly help pointing me
> in the right direction.
>
> [1]
> https://lore.kernel.org/lkml/[email protected]/
On top of v5.9-rc2 + revert d1ac0002dd297069bb8448c2764c9c31c4668441
$ curl https://lore.kernel.org/lkml/[...]/raw | git am
With the patch, with and without (the probably bad) earlycon, I get
the same result, hang at boot, no serial output.
Again, knowing that the hardware is not necessarily in the hands of
everyone, I'll be glad to try patches and configurations proposed to
further the understanding of the issues.
On Sun, 30 Aug 2020 10:41:42 +0100
Marc Zyngier <[email protected]> wrote:
Hi,
>
> Could you try replacing the problematic patch with [1], and let me
> know whether this changes anything on your end? This patch probably
> isn't the right approach, but it would certainly help pointing me
> in the right direction.
>
> [1]
> https://lore.kernel.org/lkml/[email protected]/
Following through a bisect session to figure out why the Wi-Fi broke
between 5.8 and 5.9-rc1, I figured out something that you might have in
mind already.
It seems that anything that makes of_bus_pci_match return true will
cause this to happen. This is why your initial fix also fails.
I believe my understanding is right since applying the following on top
of 5.9-rc1 also produces the same result.
--- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
@@ -227,6 +227,7 @@ dmac_peri: dma-controller@ff6e0000 {
};
pcie0: pcie@f8000000 {
+ device_type = "pci";
compatible = "rockchip,rk3399-pcie";
reg = <0x0 0xf8000000 0x0 0x2000000>,
<0x0 0xfd000000 0x0 0x1000000>;
This was found out since the Wi-Fi pci-based ath10k Wi-Fi broke, with
2f96593ecc37e98bf99525f0629128080533867f, which changes stuff around
pci bus... things...
Am I understanding right that your fix(es) were related to the change
set where the commit is found?
My intuition is that the commit causing the boot issue could be related
to changes with PCI or PCIe subsystems, and that your fix for
of_bus_pci_match is a red herring, that only surfaced the existing
issue.
This is backed by applying the previous dts patch on top of 2f96593e,
and having Wi-Fi work. I would assume that between that commit and
5.9-rc1 there is a commit that causes the complete failure to boot,
which is unrelated to the first identified commit on 5.9-rc2.
And backed by a further bisection with this that points to
d84c572de1a360501d2e439ac632126f5facf59d being the actual change that
causes the tablet to fail to boot, as long as the pcie0 node is
identified as pci properly.
I am unsure if I should add as a Cc everyone involved in that change
set, though the author (coincidentally) is already in the original list
of recipients.
Any additional thoughts from this additional information?
--
Samuel Dionne-Riel
On 2020-08-31 08:18, Samuel Dionne-Riel wrote:
> On Sun, 30 Aug 2020 10:41:42 +0100
> Marc Zyngier <[email protected]> wrote:
>
> Hi,
>
>>
>> Could you try replacing the problematic patch with [1], and let me
>> know whether this changes anything on your end? This patch probably
>> isn't the right approach, but it would certainly help pointing me
>> in the right direction.
>>
>> [1]
>> https://lore.kernel.org/lkml/[email protected]/
>
> Following through a bisect session to figure out why the Wi-Fi broke
> between 5.8 and 5.9-rc1, I figured out something that you might have in
> mind already.
>
> It seems that anything that makes of_bus_pci_match return true will
> cause this to happen. This is why your initial fix also fails.
>
> I believe my understanding is right since applying the following on top
> of 5.9-rc1 also produces the same result.
>
> --- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> +++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
> @@ -227,6 +227,7 @@ dmac_peri: dma-controller@ff6e0000 {
> };
>
> pcie0: pcie@f8000000 {
> + device_type = "pci";
> compatible = "rockchip,rk3399-pcie";
> reg = <0x0 0xf8000000 0x0 0x2000000>,
> <0x0 0xfd000000 0x0 0x1000000>;
>
>
> This was found out since the Wi-Fi pci-based ath10k Wi-Fi broke, with
> 2f96593ecc37e98bf99525f0629128080533867f, which changes stuff around
> pci bus... things...
>
> Am I understanding right that your fix(es) were related to the change
> set where the commit is found?
>
> My intuition is that the commit causing the boot issue could be related
> to changes with PCI or PCIe subsystems, and that your fix for
> of_bus_pci_match is a red herring, that only surfaced the existing
> issue.
>
> This is backed by applying the previous dts patch on top of 2f96593e,
> and having Wi-Fi work. I would assume that between that commit and
> 5.9-rc1 there is a commit that causes the complete failure to boot,
> which is unrelated to the first identified commit on 5.9-rc2.
Ah, so actually anything that *enables pcie* kills your system.
Great investigative work!
>
> And backed by a further bisection with this that points to
> d84c572de1a360501d2e439ac632126f5facf59d being the actual change that
> causes the tablet to fail to boot, as long as the pcie0 node is
> identified as pci properly.
>
> I am unsure if I should add as a Cc everyone involved in that change
> set, though the author (coincidentally) is already in the original list
> of recipients.
I've deliberately moved Rob from Cc to To... ;-)
> Any additional thoughts from this additional information?
What you could do is to start looking at which of the pci_is_root_bus()
changes breaks PCIe on this system. The fact that it breaks on your
system and not on mine is a bit puzzling.
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On Mon, 31 Aug 2020 10:27:37 +0100
Marc Zyngier <[email protected]> wrote:
>
> Ah, so actually anything that *enables pcie* kills your system.
> Great investigative work!
>
> >
> > And backed by a further bisection with this that points to
> > d84c572de1a360501d2e439ac632126f5facf59d being the actual change
> > that causes the tablet to fail to boot, as long as the pcie0 node is
> > identified as pci properly.
> >
> > I am unsure if I should add as a Cc everyone involved in that change
> > set, though the author (coincidentally) is already in the original
> > list of recipients.
>
> I've deliberately moved Rob from Cc to To... ;-)
Thanks, I don't actually know who to write to exactly.
> > Any additional thoughts from this additional information?
>
> What you could do is to start looking at which of the
> pci_is_root_bus() changes breaks PCIe on this system. The fact that
> it breaks on your system and not on mine is a bit puzzling.
Let me show you, on top of v5.9-rc3 I can successfully boot using this
partial revert / adaptation of d84c572d. In addition, it also allows
the Wi-Fi to work again, compared to how it didn't in 5.9-rc1 or
5.9-rc[23] with the dumb revert of your fix.
So, if we number each pci_is_root_bus by order appearance, it is only
the second use, in rockchip_pcie_valid_device, which seem to cause
scarlet not to boot.
The patch (not actually a patch submission) reverts only that instance
of pci_is_root_bus, while also doing some leg work to put back some
functionally equivalent code that was refactored away since.
If there's anything else you want me to try, don't hesitate.
---
drivers/pci/controller/pcie-rockchip-host.c | 8 +++++++-
drivers/pci/controller/pcie-rockchip.h | 1 +
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/controller/pcie-rockchip-host.c b/drivers/pci/controller/pcie-rockchip-host.c
index 0bb2fb3e8a0b..5a27fa833fbd 100644
--- a/drivers/pci/controller/pcie-rockchip-host.c
+++ b/drivers/pci/controller/pcie-rockchip-host.c
@@ -79,7 +79,7 @@ static int rockchip_pcie_valid_device(struct rockchip_pcie *rockchip,
* do not read more than one device on the bus directly attached
* to RC's downstream side.
*/
- if (pci_is_root_bus(bus->parent) && dev > 0)
+ if (bus->primary == rockchip->root_bus_nr && dev > 0)
return 0;
return 1;
@@ -944,6 +944,7 @@ static int rockchip_pcie_probe(struct platform_device *pdev)
struct rockchip_pcie *rockchip;
struct device *dev = &pdev->dev;
struct pci_host_bridge *bridge;
+ struct resource *bus_res;
int err;
if (!dev->of_node)
@@ -983,6 +984,11 @@ static int rockchip_pcie_probe(struct platform_device *pdev)
if (err < 0)
goto err_deinit_port;
+ /* HACK; ~equiv to last param of pci_parse_request_of_pci_ranges */
+ bus_res = (resource_list_first_type(&bridge->windows, IORESOURCE_MEM))->res;
+
+ rockchip->root_bus_nr = bus_res->start;
+
err = rockchip_pcie_cfg_atu(rockchip);
if (err)
goto err_remove_irq_domain;
diff --git a/drivers/pci/controller/pcie-rockchip.h b/drivers/pci/controller/pcie-rockchip.h
index c7d0178fc8c2..0952fec7e34d 100644
--- a/drivers/pci/controller/pcie-rockchip.h
+++ b/drivers/pci/controller/pcie-rockchip.h
@@ -298,6 +298,7 @@ struct rockchip_pcie {
struct gpio_desc *ep_gpio;
u32 lanes;
u8 lanes_map;
+ u8 root_bus_nr;
int link_gen;
struct device *dev;
struct irq_domain *irq_domain;
--
2.25.4
Thanks again!
--
Samuel Dionne-Riel
On 2020-09-01 04:45, Samuel Dionne-Riel wrote:
> On Mon, 31 Aug 2020 10:27:37 +0100
> Marc Zyngier <[email protected]> wrote:
>>
>> Ah, so actually anything that *enables pcie* kills your system.
>> Great investigative work!
>>
>> >
>> > And backed by a further bisection with this that points to
>> > d84c572de1a360501d2e439ac632126f5facf59d being the actual change
>> > that causes the tablet to fail to boot, as long as the pcie0 node is
>> > identified as pci properly.
>> >
>> > I am unsure if I should add as a Cc everyone involved in that change
>> > set, though the author (coincidentally) is already in the original
>> > list of recipients.
>>
>> I've deliberately moved Rob from Cc to To... ;-)
>
> Thanks, I don't actually know who to write to exactly.
Given that this is a PCI regression, I guess the PCI maintainers
are the likely victims. Adding Bjorn and Lorenzo to the list in
addition to Rob.
You can find the relevant people by looking at the MAINTAINERS
file.
>> > Any additional thoughts from this additional information?
>>
>> What you could do is to start looking at which of the
>> pci_is_root_bus() changes breaks PCIe on this system. The fact that
>> it breaks on your system and not on mine is a bit puzzling.
>
> Let me show you, on top of v5.9-rc3 I can successfully boot using this
> partial revert / adaptation of d84c572d. In addition, it also allows
> the Wi-Fi to work again, compared to how it didn't in 5.9-rc1 or
> 5.9-rc[23] with the dumb revert of your fix.
>
> So, if we number each pci_is_root_bus by order appearance, it is only
> the second use, in rockchip_pcie_valid_device, which seem to cause
> scarlet not to boot.
>
> The patch (not actually a patch submission) reverts only that instance
> of pci_is_root_bus, while also doing some leg work to put back some
> functionally equivalent code that was refactored away since.
>
> If there's anything else you want me to try, don't hesitate.
>
> ---
> drivers/pci/controller/pcie-rockchip-host.c | 8 +++++++-
> drivers/pci/controller/pcie-rockchip.h | 1 +
> 2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/controller/pcie-rockchip-host.c
> b/drivers/pci/controller/pcie-rockchip-host.c
> index 0bb2fb3e8a0b..5a27fa833fbd 100644
> --- a/drivers/pci/controller/pcie-rockchip-host.c
> +++ b/drivers/pci/controller/pcie-rockchip-host.c
> @@ -79,7 +79,7 @@ static int rockchip_pcie_valid_device(struct
> rockchip_pcie *rockchip,
> * do not read more than one device on the bus directly attached
> * to RC's downstream side.
> */
> - if (pci_is_root_bus(bus->parent) && dev > 0)
> + if (bus->primary == rockchip->root_bus_nr && dev > 0)
> return 0;
>
> return 1;
> @@ -944,6 +944,7 @@ static int rockchip_pcie_probe(struct
> platform_device *pdev)
> struct rockchip_pcie *rockchip;
> struct device *dev = &pdev->dev;
> struct pci_host_bridge *bridge;
> + struct resource *bus_res;
> int err;
>
> if (!dev->of_node)
> @@ -983,6 +984,11 @@ static int rockchip_pcie_probe(struct
> platform_device *pdev)
> if (err < 0)
> goto err_deinit_port;
>
> + /* HACK; ~equiv to last param of pci_parse_request_of_pci_ranges */
> + bus_res = (resource_list_first_type(&bridge->windows,
> IORESOURCE_MEM))->res;
> +
> + rockchip->root_bus_nr = bus_res->start;
> +
> err = rockchip_pcie_cfg_atu(rockchip);
> if (err)
> goto err_remove_irq_domain;
> diff --git a/drivers/pci/controller/pcie-rockchip.h
> b/drivers/pci/controller/pcie-rockchip.h
> index c7d0178fc8c2..0952fec7e34d 100644
> --- a/drivers/pci/controller/pcie-rockchip.h
> +++ b/drivers/pci/controller/pcie-rockchip.h
> @@ -298,6 +298,7 @@ struct rockchip_pcie {
> struct gpio_desc *ep_gpio;
> u32 lanes;
> u8 lanes_map;
> + u8 root_bus_nr;
> int link_gen;
> struct device *dev;
> struct irq_domain *irq_domain;
> --
> 2.25.4
>
>
> Thanks again!
Hmmm. It seems that the original commit (d84c572d) considered that
root_bus_nr was always zero, while it may not have been.
Rob, Lorenzo: do you guys have any idea what is going on here?
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On Tue, Sep 01, 2020 at 04:37:42PM +0100, Marc Zyngier wrote:
> On 2020-09-01 04:45, Samuel Dionne-Riel wrote:
> > On Mon, 31 Aug 2020 10:27:37 +0100
> > Marc Zyngier <[email protected]> wrote:
> > >
> > > Ah, so actually anything that *enables pcie* kills your system.
> > > Great investigative work!
> > >
> > > >
> > > > And backed by a further bisection with this that points to
> > > > d84c572de1a360501d2e439ac632126f5facf59d being the actual change
> > > > that causes the tablet to fail to boot, as long as the pcie0 node is
> > > > identified as pci properly.
> > > >
> > > > I am unsure if I should add as a Cc everyone involved in that change
> > > > set, though the author (coincidentally) is already in the original
> > > > list of recipients.
> > >
> > > I've deliberately moved Rob from Cc to To... ;-)
> >
> > Thanks, I don't actually know who to write to exactly.
>
> Given that this is a PCI regression, I guess the PCI maintainers
> are the likely victims. Adding Bjorn and Lorenzo to the list in
> addition to Rob.
>
> You can find the relevant people by looking at the MAINTAINERS
> file.
>
> > > > Any additional thoughts from this additional information?
> > >
> > > What you could do is to start looking at which of the
> > > pci_is_root_bus() changes breaks PCIe on this system. The fact that
> > > it breaks on your system and not on mine is a bit puzzling.
> >
> > Let me show you, on top of v5.9-rc3 I can successfully boot using this
> > partial revert / adaptation of d84c572d. In addition, it also allows
> > the Wi-Fi to work again, compared to how it didn't in 5.9-rc1 or
> > 5.9-rc[23] with the dumb revert of your fix.
> >
> > So, if we number each pci_is_root_bus by order appearance, it is only
> > the second use, in rockchip_pcie_valid_device, which seem to cause
> > scarlet not to boot.
> >
> > The patch (not actually a patch submission) reverts only that instance
> > of pci_is_root_bus, while also doing some leg work to put back some
> > functionally equivalent code that was refactored away since.
> >
> > If there's anything else you want me to try, don't hesitate.
> >
> > ---
> > drivers/pci/controller/pcie-rockchip-host.c | 8 +++++++-
> > drivers/pci/controller/pcie-rockchip.h | 1 +
> > 2 files changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/pci/controller/pcie-rockchip-host.c
> > b/drivers/pci/controller/pcie-rockchip-host.c
> > index 0bb2fb3e8a0b..5a27fa833fbd 100644
> > --- a/drivers/pci/controller/pcie-rockchip-host.c
> > +++ b/drivers/pci/controller/pcie-rockchip-host.c
> > @@ -79,7 +79,7 @@ static int rockchip_pcie_valid_device(struct
> > rockchip_pcie *rockchip,
> > * do not read more than one device on the bus directly attached
> > * to RC's downstream side.
> > */
> > - if (pci_is_root_bus(bus->parent) && dev > 0)
> > + if (bus->primary == rockchip->root_bus_nr && dev > 0)
Can you dump bus->primary when this condition is hit please ?
Also on a working system (ie prior to regression) please drop the output
of:
lspci -t
here.
> > return 0;
> >
> > return 1;
> > @@ -944,6 +944,7 @@ static int rockchip_pcie_probe(struct
> > platform_device *pdev)
> > struct rockchip_pcie *rockchip;
> > struct device *dev = &pdev->dev;
> > struct pci_host_bridge *bridge;
> > + struct resource *bus_res;
> > int err;
> >
> > if (!dev->of_node)
> > @@ -983,6 +984,11 @@ static int rockchip_pcie_probe(struct
> > platform_device *pdev)
> > if (err < 0)
> > goto err_deinit_port;
> >
> > + /* HACK; ~equiv to last param of pci_parse_request_of_pci_ranges */
> > + bus_res = (resource_list_first_type(&bridge->windows,
> > IORESOURCE_MEM))->res;
IORESOURCE_MEM ? I am a bit puzzled by this hack, what is it supposed
to do ?
> > + rockchip->root_bus_nr = bus_res->start;
> > +
> > err = rockchip_pcie_cfg_atu(rockchip);
> > if (err)
> > goto err_remove_irq_domain;
> > diff --git a/drivers/pci/controller/pcie-rockchip.h
> > b/drivers/pci/controller/pcie-rockchip.h
> > index c7d0178fc8c2..0952fec7e34d 100644
> > --- a/drivers/pci/controller/pcie-rockchip.h
> > +++ b/drivers/pci/controller/pcie-rockchip.h
> > @@ -298,6 +298,7 @@ struct rockchip_pcie {
> > struct gpio_desc *ep_gpio;
> > u32 lanes;
> > u8 lanes_map;
> > + u8 root_bus_nr;
> > int link_gen;
> > struct device *dev;
> > struct irq_domain *irq_domain;
> > --
> > 2.25.4
> >
> >
> > Thanks again!
>
> Hmmm. It seems that the original commit (d84c572d) considered that
> root_bus_nr was always zero, while it may not have been.
>
> Rob, Lorenzo: do you guys have any idea what is going on here?
That's a possibility - it would also be useful to have a look at
the DTS to check the bus-range property.
Thanks for reporting it,
Lorenzo
On Tue, 1 Sep 2020 17:42:49 +0100
Lorenzo Pieralisi <[email protected]> wrote:
> On Tue, Sep 01, 2020 at 04:37:42PM +0100, Marc Zyngier wrote:
> > On 2020-09-01 04:45, Samuel Dionne-Riel wrote:
> > > - if (pci_is_root_bus(bus->parent) && dev > 0)
> > > + if (bus->primary == rockchip->root_bus_nr && dev > 0)
>
> Can you dump bus->primary when this condition is hit please ?
With the following diff
---
@@ -79,6 +79,8 @@ static int rockchip_pcie_valid_device(struct rockchip_pcie *rockchip,
* do not read more than one device on the bus directly attached
* to RC's downstream side.
*/
+ printk("[!!] // bus->parent (%d)\n", bus->parent);
+ printk("[!!] bus->primary (%d) == rockchip->root_bus_nr (%d) && dev (%d) > 0\n", bus->primary, rockchip->root_bus_nr, dev);
if (bus->primary == rockchip->root_bus_nr && dev > 0)
return 0;
--
I get two kind of results
[ 1.692913] [!!] // bus->parent (0)
[ 1.692917] [!!] bus->primary (0) == rockchip->root_bus_nr (0) && dev (0) > 0
and
[ 1.693055] [!!] // bus->parent (-256794624)
[ 1.693058] [!!] bus->primary (0) == rockchip->root_bus_nr (0) && dev (0) > 0
What I understand from my logging here is that in some situations the
new check from d84c572d checks against two different values, while
before d84c572d all checks at this location were equivalent.
Note that I have no idea what bus->parent's semantics are compared to
the other condition. I only thought about logging all the information
relevant to this condition.
> Also on a working system (ie prior to regression) please drop the
> output of:
>
> lspci -t
>
$ lspci -t
-[0000:00]---00.0-[01]----00.0
> > >
> > > + /* HACK; ~equiv to last param of
> > > pci_parse_request_of_pci_ranges */
> > > + bus_res = (resource_list_first_type(&bridge->windows,
> > > IORESOURCE_MEM))->res;
>
> IORESOURCE_MEM ? I am a bit puzzled by this hack, what is it supposed
> to do ?
It's not really supposed to do anything. I only needed access to
bus_res for bus_res->start to keep as root_bus_nr. My complete lack of
familiarity with all of this meant that I simply borrowed something
that was in use in another function to give me the bus_res.
Note that, while this hack is ugly, this was at first tested directly
against d84c572d, no hack needed. Against d84c572d, reverting the
change for the second call to pci_is_root_bus only made it work fine.
This is the equivalent (possibly bad) change.
> > Hmmm. It seems that the original commit (d84c572d) considered that
> > root_bus_nr was always zero, while it may not have been.
> >
> > Rob, Lorenzo: do you guys have any idea what is going on here?
>
> That's a possibility - it would also be useful to have a look at
> the DTS to check the bus-range property.
The DTS is arch/arm64/boot/dts/rockchip/rk3399-gru-scarlet-inx.dts,
systematically synced-up with the kernel source used to build the
running kernel.
Which, AFAICT, only has a bus-range set in
arch/arm64/boot/dts/rockchip/rk3399.dtsi.
pcie0: pcie@f8000000 {
[...]
bus-range = <0x0 0x1f>;
[...]
}
Thanks again!
--
Samuel Dionne-Riel
On Tue, Sep 01, 2020 at 02:33:56PM -0400, Samuel Dionne-Riel wrote:
> On Tue, 1 Sep 2020 17:42:49 +0100
> Lorenzo Pieralisi <[email protected]> wrote:
>
> > On Tue, Sep 01, 2020 at 04:37:42PM +0100, Marc Zyngier wrote:
> > > On 2020-09-01 04:45, Samuel Dionne-Riel wrote:
> > > > - if (pci_is_root_bus(bus->parent) && dev > 0)
> > > > + if (bus->primary == rockchip->root_bus_nr && dev > 0)
> >
> > Can you dump bus->primary when this condition is hit please ?
>
> With the following diff
>
> ---
> @@ -79,6 +79,8 @@ static int rockchip_pcie_valid_device(struct rockchip_pcie *rockchip,
> * do not read more than one device on the bus directly attached
> * to RC's downstream side.
> */
> + printk("[!!] // bus->parent (%d)\n", bus->parent);
Please print a pointer as a pointer and print both bus and bus->parent.
> + printk("[!!] bus->primary (%d) == rockchip->root_bus_nr (%d) && dev (%d) > 0\n", bus->primary, rockchip->root_bus_nr, dev);
> if (bus->primary == rockchip->root_bus_nr && dev > 0)
> return 0;
>
> --
>
> I get two kind of results
>
> [ 1.692913] [!!] // bus->parent (0)
> [ 1.692917] [!!] bus->primary (0) == rockchip->root_bus_nr (0) && dev (0) > 0
>
> and
>
> [ 1.693055] [!!] // bus->parent (-256794624)
> [ 1.693058] [!!] bus->primary (0) == rockchip->root_bus_nr (0) && dev (0) > 0
>
Looks like this is the condition that pci_is_root_bus(bus->parent) is
not hitting.
[...]
> > > > + /* HACK; ~equiv to last param of
> > > > pci_parse_request_of_pci_ranges */
> > > > + bus_res = (resource_list_first_type(&bridge->windows,
> > > > IORESOURCE_MEM))->res;
> >
> > IORESOURCE_MEM ? I am a bit puzzled by this hack, what is it supposed
> > to do ?
>
> It's not really supposed to do anything. I only needed access to
> bus_res for bus_res->start to keep as root_bus_nr. My complete lack of
> familiarity with all of this meant that I simply borrowed something
> that was in use in another function to give me the bus_res.
You are accessing a resource IORESOURCE_MEM that has nothing to do
with bus numbers.
s/IORESOURCE_MEM/IORESOURCE_BUS
should be better ;-)
Lorenzo
On Wed, 2 Sep 2020 17:01:19 +0100
Lorenzo Pieralisi <[email protected]> wrote:
> On Tue, Sep 01, 2020 at 02:33:56PM -0400, Samuel Dionne-Riel wrote:
>
> Please print a pointer as a pointer and print both bus and
> bus->parent.
Hopefully pointer as a pointer is %px. Not sure what else, if that's
wrong please tell.
---
@@ -79,6 +79,8 @@ static int rockchip_pcie_valid_device(struct rockchip_pcie *rockchip,
* do not read more than one device on the bus directly attached
* to RC's downstream side.
*/
+ printk("[!!] // bus (%px) bus->parent (%px)\n", bus, bus->parent);
+ printk("[!!] bus->primary (%d) == rockchip->root_bus_nr (%d) && dev (%d) > 0\n", bus->primary, rockchip->root_bus_nr, dev);
if (bus->primary == rockchip->root_bus_nr && dev > 0)
return 0;
--
Again, two values, verified with a bit of set and `sort -u`.
[ 1.691266] [!!] // bus (ffff0000ef9ab800) bus->parent (0000000000000000)
[ 1.691271] [!!] bus->primary (0) == rockchip->root_bus_nr (0) && dev (0) > 0
and
[ 1.697156] [!!] // bus (ffff0000ef9ac000) bus->parent (ffff0000ef9ab800)
[ 1.697160] [!!] bus->primary (0) == rockchip->root_bus_nr (0) && dev (0) > 0
First instance of each shown here. Last time I don't think it was.
> > + printk("[!!] bus->primary (%d) == rockchip->root_bus_nr
> > (%d) && dev (%d) > 0\n", bus->primary, rockchip->root_bus_nr, dev);
> > if (bus->primary == rockchip->root_bus_nr && dev > 0) return 0;
> >
> > --
> >
> > I get two kind of results
> >
> > [ 1.692913] [!!] // bus->parent (0)
> > [ 1.692917] [!!] bus->primary (0) == rockchip->root_bus_nr (0)
> > && dev (0) > 0
> >
> > and
> >
> > [ 1.693055] [!!] // bus->parent (-256794624)
> > [ 1.693058] [!!] bus->primary (0) == rockchip->root_bus_nr (0)
> > && dev (0) > 0
>
> Looks like this is the condition that pci_is_root_bus(bus->parent) is
> not hitting.
>
> You are accessing a resource IORESOURCE_MEM that has nothing to do
> with bus numbers.
>
> s/IORESOURCE_MEM/IORESOURCE_BUS
>
> should be better ;-)
At least correct, rather than luckily working.
Thanks, as always, anything I missed, or need more precisions, do ask.
--
Samuel Dionne-Riel
On Wed, Sep 02, 2020 at 11:47:56PM -0400, Samuel Dionne-Riel wrote:
> On Wed, 2 Sep 2020 17:01:19 +0100
> Lorenzo Pieralisi <[email protected]> wrote:
>
> > On Tue, Sep 01, 2020 at 02:33:56PM -0400, Samuel Dionne-Riel wrote:
> >
> > Please print a pointer as a pointer and print both bus and
> > bus->parent.
>
> Hopefully pointer as a pointer is %px. Not sure what else, if that's
> wrong please tell.
>
> ---
> @@ -79,6 +79,8 @@ static int rockchip_pcie_valid_device(struct rockchip_pcie *rockchip,
> * do not read more than one device on the bus directly attached
> * to RC's downstream side.
> */
> + printk("[!!] // bus (%px) bus->parent (%px)\n", bus, bus->parent);
> + printk("[!!] bus->primary (%d) == rockchip->root_bus_nr (%d) && dev (%d) > 0\n", bus->primary, rockchip->root_bus_nr, dev);
> if (bus->primary == rockchip->root_bus_nr && dev > 0)
> return 0;
>
> --
>
> Again, two values, verified with a bit of set and `sort -u`.
>
> [ 1.691266] [!!] // bus (ffff0000ef9ab800) bus->parent (0000000000000000)
> [ 1.691271] [!!] bus->primary (0) == rockchip->root_bus_nr (0) && dev (0) > 0
>
> and
>
> [ 1.697156] [!!] // bus (ffff0000ef9ac000) bus->parent (ffff0000ef9ab800)
> [ 1.697160] [!!] bus->primary (0) == rockchip->root_bus_nr (0) && dev (0) > 0
>
> First instance of each shown here. Last time I don't think it was.
Ok I think I understand what the problem is.
Can you give this patch a shot please ? I think we are dereferencing
a NULL pointer if bus is the root bus and dev == 0, we can rewrite
the check if this patch fixes the issue.
-- >8 --
diff --git a/drivers/pci/controller/pcie-rockchip-host.c b/drivers/pci/controller/pcie-rockchip-host.c
index 0bb2fb3e8a0b..72beda87b47f 100644
--- a/drivers/pci/controller/pcie-rockchip-host.c
+++ b/drivers/pci/controller/pcie-rockchip-host.c
@@ -79,7 +79,7 @@ static int rockchip_pcie_valid_device(struct rockchip_pcie *rockchip,
* do not read more than one device on the bus directly attached
* to RC's downstream side.
*/
- if (pci_is_root_bus(bus->parent) && dev > 0)
+ if (bus->parent && pci_is_root_bus(bus->parent) && dev > 0)
return 0;
return 1;
On Thu, Sep 3, 2020 at 3:19 AM Lorenzo Pieralisi
<[email protected]> wrote:
>
> On Wed, Sep 02, 2020 at 11:47:56PM -0400, Samuel Dionne-Riel wrote:
> > On Wed, 2 Sep 2020 17:01:19 +0100
> > Lorenzo Pieralisi <[email protected]> wrote:
> >
> > > On Tue, Sep 01, 2020 at 02:33:56PM -0400, Samuel Dionne-Riel wrote:
> > >
> > > Please print a pointer as a pointer and print both bus and
> > > bus->parent.
> >
> > Hopefully pointer as a pointer is %px. Not sure what else, if that's
> > wrong please tell.
> >
> > ---
> > @@ -79,6 +79,8 @@ static int rockchip_pcie_valid_device(struct rockchip_pcie *rockchip,
> > * do not read more than one device on the bus directly attached
> > * to RC's downstream side.
> > */
> > + printk("[!!] // bus (%px) bus->parent (%px)\n", bus, bus->parent);
> > + printk("[!!] bus->primary (%d) == rockchip->root_bus_nr (%d) && dev (%d) > 0\n", bus->primary, rockchip->root_bus_nr, dev);
> > if (bus->primary == rockchip->root_bus_nr && dev > 0)
> > return 0;
> >
> > --
> >
> > Again, two values, verified with a bit of set and `sort -u`.
> >
> > [ 1.691266] [!!] // bus (ffff0000ef9ab800) bus->parent (0000000000000000)
> > [ 1.691271] [!!] bus->primary (0) == rockchip->root_bus_nr (0) && dev (0) > 0
> >
> > and
> >
> > [ 1.697156] [!!] // bus (ffff0000ef9ac000) bus->parent (ffff0000ef9ab800)
> > [ 1.697160] [!!] bus->primary (0) == rockchip->root_bus_nr (0) && dev (0) > 0
> >
> > First instance of each shown here. Last time I don't think it was.
>
> Ok I think I understand what the problem is.
>
> Can you give this patch a shot please ? I think we are dereferencing
> a NULL pointer if bus is the root bus and dev == 0, we can rewrite
> the check if this patch fixes the issue.
Indeed. I checked all the other cases of pci_is_root_bus(bus->parent)
and they should be fine because they are only reached if !root_bus.
I would restructure the check like this instead:
diff --git a/drivers/pci/controller/pcie-rockchip-host.c
b/drivers/pci/controller/pcie-rockchip-host.c
index 0bb2fb3e8a0b..9b485bea8b92 100644
--- a/drivers/pci/controller/pcie-rockchip-host.c
+++ b/drivers/pci/controller/pcie-rockchip-host.c
@@ -72,14 +72,14 @@ static int rockchip_pcie_valid_device(struct
rockchip_pcie *rockchip,
struct pci_bus *bus, int dev)
{
/* access only one slot on each root port */
- if (pci_is_root_bus(bus) && dev > 0)
- return 0;
-
- /*
- * do not read more than one device on the bus directly attached
- * to RC's downstream side.
- */
- if (pci_is_root_bus(bus->parent) && dev > 0)
+ if (pci_is_root_bus(bus))
+ if (dev > 0)
+ return 0;
+ else if (pci_is_root_bus(bus->parent) && dev > 0)
+ /*
+ * do not read more than one device on the bus directly attached
+ * to RC's downstream side.
+ */
return 0;
return 1;
>
> -- >8 --
> diff --git a/drivers/pci/controller/pcie-rockchip-host.c b/drivers/pci/controller/pcie-rockchip-host.c
> index 0bb2fb3e8a0b..72beda87b47f 100644
> --- a/drivers/pci/controller/pcie-rockchip-host.c
> +++ b/drivers/pci/controller/pcie-rockchip-host.c
> @@ -79,7 +79,7 @@ static int rockchip_pcie_valid_device(struct rockchip_pcie *rockchip,
> * do not read more than one device on the bus directly attached
> * to RC's downstream side.
> */
> - if (pci_is_root_bus(bus->parent) && dev > 0)
> + if (bus->parent && pci_is_root_bus(bus->parent) && dev > 0)
> return 0;
>
> return 1;
On 2020-09-03 15:35, Rob Herring wrote:
> On Thu, Sep 3, 2020 at 3:19 AM Lorenzo Pieralisi
> <[email protected]> wrote:
>>
>> On Wed, Sep 02, 2020 at 11:47:56PM -0400, Samuel Dionne-Riel wrote:
>> > On Wed, 2 Sep 2020 17:01:19 +0100
>> > Lorenzo Pieralisi <[email protected]> wrote:
>> >
>> > > On Tue, Sep 01, 2020 at 02:33:56PM -0400, Samuel Dionne-Riel wrote:
>> > >
>> > > Please print a pointer as a pointer and print both bus and
>> > > bus->parent.
>> >
>> > Hopefully pointer as a pointer is %px. Not sure what else, if that's
>> > wrong please tell.
>> >
>> > ---
>> > @@ -79,6 +79,8 @@ static int rockchip_pcie_valid_device(struct rockchip_pcie *rockchip,
>> > * do not read more than one device on the bus directly attached
>> > * to RC's downstream side.
>> > */
>> > + printk("[!!] // bus (%px) bus->parent (%px)\n", bus, bus->parent);
>> > + printk("[!!] bus->primary (%d) == rockchip->root_bus_nr (%d) && dev (%d) > 0\n", bus->primary, rockchip->root_bus_nr, dev);
>> > if (bus->primary == rockchip->root_bus_nr && dev > 0)
>> > return 0;
>> >
>> > --
>> >
>> > Again, two values, verified with a bit of set and `sort -u`.
>> >
>> > [ 1.691266] [!!] // bus (ffff0000ef9ab800) bus->parent (0000000000000000)
>> > [ 1.691271] [!!] bus->primary (0) == rockchip->root_bus_nr (0) && dev (0) > 0
>> >
>> > and
>> >
>> > [ 1.697156] [!!] // bus (ffff0000ef9ac000) bus->parent (ffff0000ef9ab800)
>> > [ 1.697160] [!!] bus->primary (0) == rockchip->root_bus_nr (0) && dev (0) > 0
>> >
>> > First instance of each shown here. Last time I don't think it was.
>>
>> Ok I think I understand what the problem is.
>>
>> Can you give this patch a shot please ? I think we are dereferencing
>> a NULL pointer if bus is the root bus and dev == 0, we can rewrite
>> the check if this patch fixes the issue.
>
> Indeed. I checked all the other cases of pci_is_root_bus(bus->parent)
> and they should be fine because they are only reached if !root_bus.
>
> I would restructure the check like this instead:
>
> diff --git a/drivers/pci/controller/pcie-rockchip-host.c
> b/drivers/pci/controller/pcie-rockchip-host.c
> index 0bb2fb3e8a0b..9b485bea8b92 100644
> --- a/drivers/pci/controller/pcie-rockchip-host.c
> +++ b/drivers/pci/controller/pcie-rockchip-host.c
> @@ -72,14 +72,14 @@ static int rockchip_pcie_valid_device(struct
> rockchip_pcie *rockchip,
> struct pci_bus *bus, int dev)
> {
> /* access only one slot on each root port */
> - if (pci_is_root_bus(bus) && dev > 0)
> - return 0;
> -
> - /*
> - * do not read more than one device on the bus directly
> attached
> - * to RC's downstream side.
> - */
> - if (pci_is_root_bus(bus->parent) && dev > 0)
> + if (pci_is_root_bus(bus))
> + if (dev > 0)
> + return 0;
> + else if (pci_is_root_bus(bus->parent) && dev > 0)
Careful here, this else is relative to the *closest* if,
and not what the indentation suggests...
> + /*
> + * do not read more than one device on the bus directly
> attached
> + * to RC's downstream side.
> + */
> return 0;
>
> return 1;
M.
--
Jazz is not dead. It just smells funny...
On Thu, 03 Sep 2020 16:59:30 +0100
Marc Zyngier <[email protected]> wrote:
> On 2020-09-03 15:35, Rob Herring wrote:
> > On Thu, Sep 3, 2020 at 3:19 AM Lorenzo Pieralisi
> > <[email protected]> wrote:
> >>
> >> Ok I think I understand what the problem is.
> >>
> >> Can you give this patch a shot please ? I think we are
> >> dereferencing a NULL pointer if bus is the root bus and dev == 0,
> >> we can rewrite the check if this patch fixes the issue.
> >
> > Indeed. I checked all the other cases of
> > pci_is_root_bus(bus->parent) and they should be fine because they
> > are only reached if !root_bus.
> >
> > I would restructure the check like this instead:
> >
> > diff --git a/drivers/pci/controller/pcie-rockchip-host.c
> > b/drivers/pci/controller/pcie-rockchip-host.c
> > index 0bb2fb3e8a0b..9b485bea8b92 100644
> > --- a/drivers/pci/controller/pcie-rockchip-host.c
> > +++ b/drivers/pci/controller/pcie-rockchip-host.c
> > @@ -72,14 +72,14 @@ static int rockchip_pcie_valid_device(struct
> > rockchip_pcie *rockchip,
> > struct pci_bus *bus, int dev)
> > {
> > /* access only one slot on each root port */
> > - if (pci_is_root_bus(bus) && dev > 0)
> > - return 0;
> > -
> > - /*
> > - * do not read more than one device on the bus directly
> > attached
> > - * to RC's downstream side.
> > - */
> > - if (pci_is_root_bus(bus->parent) && dev > 0)
> > + if (pci_is_root_bus(bus))
> > + if (dev > 0)
> > + return 0;
> > + else if (pci_is_root_bus(bus->parent) && dev > 0)
>
> Careful here, this else is relative to the *closest* if,
> and not what the indentation suggests...
>
> > + /*
> > + * do not read more than one device on the bus
> > directly attached
> > + * to RC's downstream side.
> > + */
> > return 0;
> >
> > return 1;
>
>
> M.
Both patches work fine. They both allow the device to boot and Wi-Fi to
work as expected.
I will echo Marc's concerns over the indentation. I am only casually
acquainted with C and this indentation would bring to wrong conclusions
if it hadn't been pointed out to me.
Cc me on an eventual patch, FWIW I can give it a final test just in
case.
Thanks again,
--
Samuel Dionne-Riel