by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 00/11] of: dma-ranges fixes and improvements

On Sun, Sep 29, 2019 at 01:16:20PM +0200, Arnd Bergmann wrote:
> On a semi-related note, Thierry asked about one aspect of the dma-ranges
> property recently, which is the behavior of dma_set_mask() and related
> functions when a driver sets a mask that is larger than the memory
> area in the bus-ranges but smaller than the available physical RAM.
> As I understood Thierry's problem and the current code, the generic
> dma_set_mask() will either reject the new mask entirely or override
> the mask set by of_dma_configure, but it fails to set a correct mask
> within the limitations of the parent bus in this case.

There days dma_set_mask will only reject a mask if it is too small
to be supported by the hardware. Larger than required masks are now
always accepted.

2019-09-30 08:59:07

by Thierry Reding

[permalink] [raw]

Subject: Re: [PATCH 00/11] of: dma-ranges fixes and improvements

On Mon, Sep 30, 2019 at 01:20:55AM -0700, Christoph Hellwig wrote:
> On Sun, Sep 29, 2019 at 01:16:20PM +0200, Arnd Bergmann wrote:
> > On a semi-related note, Thierry asked about one aspect of the dma-ranges
> > property recently, which is the behavior of dma_set_mask() and related
> > functions when a driver sets a mask that is larger than the memory
> > area in the bus-ranges but smaller than the available physical RAM.
> > As I understood Thierry's problem and the current code, the generic
> > dma_set_mask() will either reject the new mask entirely or override
> > the mask set by of_dma_configure, but it fails to set a correct mask
> > within the limitations of the parent bus in this case.
>
> There days dma_set_mask will only reject a mask if it is too small
> to be supported by the hardware. Larger than required masks are now
> always accepted.

Summarizing why this came up: the memory subsystem on Tegra194 has a
mechanism controlled by bit 39 of physical addresses. This is used to
support two variants of sector ordering for block linear formats. The
GPU uses a slightly different ordering than other MSS clients, so the
drivers have to set this bit depending on who they interoperate with.

I was running into this as I was adding support for IOMMU support for
the Ethernet controller on Tegra194. The controller has a HW feature
register that contains how many address bits it supports. This is 40
for Tegra194, corresponding to the number of address bits to the MSS.
Without IOMMU support that's not a problem because there are no systems
with 40 bits of system memory. However, if we enable IOMMU support, the
DMA/IOMMU code will allocate from the top of a 48-bit (constrained to
40 bits via the DMA mask) input address space. This causes bit 39 to be
set, which in turn will make the MSS reorder sectors and break network
communications.

Since this reordering takes place at the MSS level, this applies to all
MSS clients. Most of these clients always want bit 39 to be 0, whereas
the clients that can and want to make use of the reordering always want
bit 39 to be under their control, so they can control in a fine-grained
way when to set it.

This means that effectively all MSS clients can only address 39 bits, so
instead of hard-coding that for each driver I thought it'd make sense to
have a central place to configure this, so that all devices by default
are restricted to 39-bit addressing. However, with the current DMA API
implementation this causes a problem because the default 39-bit DMA mask
would get overwritten by the driver (as in the example of the Ethernet
controller setting a 40-bit DMA mask because that's what the hardware
supports).

I realize that this is somewhat exotic. On one hand it is correct for a
driver to say that the hardware supports 40-bit addressing (i.e. the
Ethernet controller can address bit 39), but from a system integration
point of view, using bit 39 is wrong, except in a very restricted set of
cases.

If I understand correctly, describing this with a dma-ranges property is
the right thing to do, but it wouldn't work with the current
implementation because drivers can still override a lower DMA mask with
a higher one.

Thierry

Attachments:

(No filename) (3.19 kB)
signature.asc (849.00 B)
Download all attachments

2019-09-30 09:22:43

On Mon, Sep 30, 2019 at 10:55:15AM +0100, Robin Murphy wrote:
> On 2019-09-30 9:56 am, Thierry Reding wrote:
> > On Mon, Sep 30, 2019 at 01:20:55AM -0700, Christoph Hellwig wrote:
> > > On Sun, Sep 29, 2019 at 01:16:20PM +0200, Arnd Bergmann wrote:
> > > > On a semi-related note, Thierry asked about one aspect of the dma-ranges
> > > > property recently, which is the behavior of dma_set_mask() and related
> > > > functions when a driver sets a mask that is larger than the memory
> > > > area in the bus-ranges but smaller than the available physical RAM.
> > > > As I understood Thierry's problem and the current code, the generic
> > > > dma_set_mask() will either reject the new mask entirely or override
> > > > the mask set by of_dma_configure, but it fails to set a correct mask
> > > > within the limitations of the parent bus in this case.
> > >
> > > There days dma_set_mask will only reject a mask if it is too small
> > > to be supported by the hardware. Larger than required masks are now
> > > always accepted.
> >
> > Summarizing why this came up: the memory subsystem on Tegra194 has a
> > mechanism controlled by bit 39 of physical addresses. This is used to
> > support two variants of sector ordering for block linear formats. The
> > GPU uses a slightly different ordering than other MSS clients, so the
> > drivers have to set this bit depending on who they interoperate with.
> >
> > I was running into this as I was adding support for IOMMU support for
> > the Ethernet controller on Tegra194. The controller has a HW feature
> > register that contains how many address bits it supports. This is 40
> > for Tegra194, corresponding to the number of address bits to the MSS.
> > Without IOMMU support that's not a problem because there are no systems
> > with 40 bits of system memory. However, if we enable IOMMU support, the
> > DMA/IOMMU code will allocate from the top of a 48-bit (constrained to
> > 40 bits via the DMA mask) input address space. This causes bit 39 to be
> > set, which in turn will make the MSS reorder sectors and break network
> > communications.
> >
> > Since this reordering takes place at the MSS level, this applies to all
> > MSS clients. Most of these clients always want bit 39 to be 0, whereas
> > the clients that can and want to make use of the reordering always want
> > bit 39 to be under their control, so they can control in a fine-grained
> > way when to set it.
> >
> > This means that effectively all MSS clients can only address 39 bits, so
> > instead of hard-coding that for each driver I thought it'd make sense to
> > have a central place to configure this, so that all devices by default
> > are restricted to 39-bit addressing. However, with the current DMA API
> > implementation this causes a problem because the default 39-bit DMA mask
> > would get overwritten by the driver (as in the example of the Ethernet
> > controller setting a 40-bit DMA mask because that's what the hardware
> > supports).
> >
> > I realize that this is somewhat exotic. On one hand it is correct for a
> > driver to say that the hardware supports 40-bit addressing (i.e. the
> > Ethernet controller can address bit 39), but from a system integration
> > point of view, using bit 39 is wrong, except in a very restricted set of
> > cases.
> >
> > If I understand correctly, describing this with a dma-ranges property is
> > the right thing to do, but it wouldn't work with the current
> > implementation because drivers can still override a lower DMA mask with
> > a higher one.
>
> But that sounds like exactly the situation for which we introduced
> bus_dma_mask. If "dma-ranges" is found, then we should initialise that to
> reflect the limitation. Drivers may subsequently set a larger mask based on
> what the device is natively capable of, but the DMA API internals should
> quietly clamp that down to the bus mask wherever it matters.
>
> Since that change, the initial value of dma_mask and coherent_dma_mask
> doesn't really matter much, as we expect drivers to reset them anyway (and
> in general they have to be able to enlarge them from a 32-bit default
> value).
>
> As far as I'm aware this has been working fine (albeit in equivalent ACPI
> form) for at least one SoC with 64-bit device masks, a 48-bit IOMMU, and a
> 44-bit interconnect in between - indeed if I avoid distraction long enough
> to set up the big new box under my desk, the sending of future emails will
> depend on it ;)

After applying this series it does indeed seem to be working. The only
thing I had to do was add a dma-ranges property to the DMA parent. I
ended up doing that via an interconnects property because the default
DMA parent on Tegra194 is /cbb which restricts #address-cells = <1> and
#size-cells = <1>, so it can't actually translate anything beyond 32
bits of system memory.

So I basically ended up making the memory controller an interconnect
provider, increasing #address-cells = <2> and #size-cells = <2> again
and then using a dma-ranges property like this:

dma-ranges = <0x0 0x0 0x0 0x80 0x0>;

to specify that only 39 bits should be used for addressing, leaving the
special bit 39 up to the driver to set as required.

Coincidentally making the memory controller an interconnect provider is
something that I was planning to do anyway in order to support memory
frequency scaling, so this all actually fits together pretty elegantly.

Thierry

Attachments:

(No filename) (5.39 kB)
signature.asc (849.00 B)
Download all attachments

2019-09-30 13:36:25

On Mon, 2019-09-30 at 16:24 -0500, Rob Herring wrote:
> On Mon, Sep 30, 2019 at 8:32 AM Nicolas Saenz Julienne
> <[email protected]> wrote:
> > On Mon, 2019-09-30 at 05:57 -0700, Christoph Hellwig wrote:
> > > On Thu, Sep 26, 2019 at 07:24:49PM -0500, Rob Herring wrote:
> > > > -int of_dma_configure(struct device *dev, struct device_node *np, bool
> > > > force_dma)
> > > > +int of_dma_configure(struct device *dev, struct device_node *parent,
> > > > bool
> > > > force_dma)
> > >
> > > This creates a > 80 char line.
> > >
> > > > {
> > > > u64 dma_addr, paddr, size = 0;
> > > > int ret;
> > > > bool coherent;
> > > > unsigned long offset;
> > > > const struct iommu_ops *iommu;
> > > > + struct device_node *np;
> > > > u64 mask;
> > > >
> > > > + np = dev->of_node;
> > > > + if (!np)
> > > > + np = parent;
> > > > + if (!np)
> > > > + return -ENODEV;
> > >
> > > I have to say I find the older calling convention simpler to understand.
> > > If we want to enforce the invariant I'd rather do that explicitly:
> > >
> > > if (dev->of_node && np != dev->of_node)
> > > return -EINVAL;
> >
> > As is, this would break Freescale Layerscape fsl-mc bus' dma_configure():
>
> This may break PCI too for devices that have a DT node.
>
> > static int fsl_mc_dma_configure(struct device *dev)
> > {
> > struct device *dma_dev = dev;
> >
> > while (dev_is_fsl_mc(dma_dev))
> > dma_dev = dma_dev->parent;
> >
> > return of_dma_configure(dev, dma_dev->of_node, 0);
> > }
> >
> > But I think that with this series, given the fact that we now treat the lack
> > of
> > dma-ranges as a 1:1 mapping instead of an error, we could rewrite the
> > function
> > like this:
>
> Now, I'm reconsidering allowing this abuse... It's better if the code
> which understands the bus structure in DT for a specific bus passes in
> the right thing. Maybe I should go back to Robin's version (below).
> OTOH, the existing assumption that 'dma-ranges' was in the immediate
> parent was an assumption on the bus structure which maybe doesn't
> always apply.
>
> diff --git a/drivers/of/device.c b/drivers/of/device.c
> index a45261e21144..6951450bb8f3 100644
> --- a/drivers/of/device.c
> +++ b/drivers/of/device.c
> @@ -98,12 +98,15 @@ int of_dma_configure(struct device *dev, struct
> device_node *parent, bool force_
> u64 mask;
>
> np = dev->of_node;
> - if (!np)
> - np = parent;
> + if (np)
> + parent = of_get_dma_parent(np);
> + else
> + np = of_node_get(parent);
> if (!np)
> return -ENODEV;
>
> - ret = of_dma_get_range(np, &dma_addr, &paddr, &size);
> + ret = of_dma_get_range(parent, &dma_addr, &paddr, &size);
> + of_node_put(parent);
> if (ret < 0) {
> /*
> * For legacy reasons, we have to assume some devices need

I spent some time thinking about your comments and researching. I came to the
realization that both these solutions break the usage in
drivers/gpu/drm/sun4i/sun4i_backend.c:805. In that specific case both
'dev->of_node' and 'parent' exist yet the device receiving the configuration
and 'parent' aren't related in any way.

IOW we can't just use 'dev->of_node' as a starting point to walk upwards the
tree. We always have to respect whatever DT node the bus provided, and start
there. This clashes with the current solutions, as they are based on the fact
that we can use dev->of_node when present.

My guess at this point, if we're forced to honor that behaviour, is that we
have to create a new API for the PCI use case. Something the likes of
of_dma_configure_parent().

Regards,
Nicolas

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part

2019-10-04 07:56:14

by Rob Herring (Arm)

[permalink] [raw]

Subject: Re: [PATCH 05/11] of: Ratify of_dma_configure() interface

On Tue, Oct 1, 2019 at 10:43 AM Nicolas Saenz Julienne
<[email protected]> wrote:
>
> On Mon, 2019-09-30 at 16:24 -0500, Rob Herring wrote:
> > On Mon, Sep 30, 2019 at 8:32 AM Nicolas Saenz Julienne
> > <[email protected]> wrote:
> > > On Mon, 2019-09-30 at 05:57 -0700, Christoph Hellwig wrote:
> > > > On Thu, Sep 26, 2019 at 07:24:49PM -0500, Rob Herring wrote:
> > > > > -int of_dma_configure(struct device *dev, struct device_node *np, bool
> > > > > force_dma)
> > > > > +int of_dma_configure(struct device *dev, struct device_node *parent,
> > > > > bool
> > > > > force_dma)
> > > >
> > > > This creates a > 80 char line.
> > > >
> > > > > {
> > > > > u64 dma_addr, paddr, size = 0;
> > > > > int ret;
> > > > > bool coherent;
> > > > > unsigned long offset;
> > > > > const struct iommu_ops *iommu;
> > > > > + struct device_node *np;
> > > > > u64 mask;
> > > > >
> > > > > + np = dev->of_node;
> > > > > + if (!np)
> > > > > + np = parent;
> > > > > + if (!np)
> > > > > + return -ENODEV;
> > > >
> > > > I have to say I find the older calling convention simpler to understand.
> > > > If we want to enforce the invariant I'd rather do that explicitly:
> > > >
> > > > if (dev->of_node && np != dev->of_node)
> > > > return -EINVAL;
> > >
> > > As is, this would break Freescale Layerscape fsl-mc bus' dma_configure():
> >
> > This may break PCI too for devices that have a DT node.
> >
> > > static int fsl_mc_dma_configure(struct device *dev)
> > > {
> > > struct device *dma_dev = dev;
> > >
> > > while (dev_is_fsl_mc(dma_dev))
> > > dma_dev = dma_dev->parent;
> > >
> > > return of_dma_configure(dev, dma_dev->of_node, 0);
> > > }
> > >
> > > But I think that with this series, given the fact that we now treat the lack
> > > of
> > > dma-ranges as a 1:1 mapping instead of an error, we could rewrite the
> > > function
> > > like this:
> >
> > Now, I'm reconsidering allowing this abuse... It's better if the code
> > which understands the bus structure in DT for a specific bus passes in
> > the right thing. Maybe I should go back to Robin's version (below).
> > OTOH, the existing assumption that 'dma-ranges' was in the immediate
> > parent was an assumption on the bus structure which maybe doesn't
> > always apply.
> >
> > diff --git a/drivers/of/device.c b/drivers/of/device.c
> > index a45261e21144..6951450bb8f3 100644
> > --- a/drivers/of/device.c
> > +++ b/drivers/of/device.c
> > @@ -98,12 +98,15 @@ int of_dma_configure(struct device *dev, struct
> > device_node *parent, bool force_
> > u64 mask;
> >
> > np = dev->of_node;
> > - if (!np)
> > - np = parent;
> > + if (np)
> > + parent = of_get_dma_parent(np);
> > + else
> > + np = of_node_get(parent);
> > if (!np)
> > return -ENODEV;
> >
> > - ret = of_dma_get_range(np, &dma_addr, &paddr, &size);
> > + ret = of_dma_get_range(parent, &dma_addr, &paddr, &size);
> > + of_node_put(parent);
> > if (ret < 0) {
> > /*
> > * For legacy reasons, we have to assume some devices need
>
> I spent some time thinking about your comments and researching. I came to the
> realization that both these solutions break the usage in
> drivers/gpu/drm/sun4i/sun4i_backend.c:805. In that specific case both
> 'dev->of_node' and 'parent' exist yet the device receiving the configuration
> and 'parent' aren't related in any way.

I knew there was some reason I didn't like those virtual DT nodes...

That does seem to be the oddest case. Several of the others are just
non-DT child platform devices. Perhaps we need a "copy the DMA config
from another struct device (or parent struct device)" function to
avoid using a DT function on a non-DT device.

> IOW we can't just use 'dev->of_node' as a starting point to walk upwards the
> tree. We always have to respect whatever DT node the bus provided, and start
> there. This clashes with the current solutions, as they are based on the fact
> that we can use dev->of_node when present.

Yes, you are right.

> My guess at this point, if we're forced to honor that behaviour, is that we
> have to create a new API for the PCI use case. Something the likes of
> of_dma_configure_parent().

I think of_dma_configure just has to work with the device_node of
either the device or the device parent and dev->of_node is never used
unless the caller sets it.

Rob

2019-10-07 17:54:43

by Nicolas Saenz Julienne

[permalink] [raw]

Subject: Re: [PATCH 05/11] of: Ratify of_dma_configure() interface

On Thu, 2019-10-03 at 20:53 -0500, Rob Herring wrote:
> > > > But I think that with this series, given the fact that we now treat the
> > > > lack
> > > > of
> > > > dma-ranges as a 1:1 mapping instead of an error, we could rewrite the
> > > > function
> > > > like this:
> > >
> > > Now, I'm reconsidering allowing this abuse... It's better if the code
> > > which understands the bus structure in DT for a specific bus passes in
> > > the right thing. Maybe I should go back to Robin's version (below).
> > > OTOH, the existing assumption that 'dma-ranges' was in the immediate
> > > parent was an assumption on the bus structure which maybe doesn't
> > > always apply.
> > >
> > > diff --git a/drivers/of/device.c b/drivers/of/device.c
> > > index a45261e21144..6951450bb8f3 100644
> > > --- a/drivers/of/device.c
> > > +++ b/drivers/of/device.c
> > > @@ -98,12 +98,15 @@ int of_dma_configure(struct device *dev, struct
> > > device_node *parent, bool force_
> > > u64 mask;
> > >
> > > np = dev->of_node;
> > > - if (!np)
> > > - np = parent;
> > > + if (np)
> > > + parent = of_get_dma_parent(np);
> > > + else
> > > + np = of_node_get(parent);
> > > if (!np)
> > > return -ENODEV;
> > >
> > > - ret = of_dma_get_range(np, &dma_addr, &paddr, &size);
> > > + ret = of_dma_get_range(parent, &dma_addr, &paddr, &size);
> > > + of_node_put(parent);
> > > if (ret < 0) {
> > > /*
> > > * For legacy reasons, we have to assume some devices need
> >
> > I spent some time thinking about your comments and researching. I came to
> > the
> > realization that both these solutions break the usage in
> > drivers/gpu/drm/sun4i/sun4i_backend.c:805. In that specific case both
> > 'dev->of_node' and 'parent' exist yet the device receiving the configuration
> > and 'parent' aren't related in any way.
>
> I knew there was some reason I didn't like those virtual DT nodes...
>
> That does seem to be the oddest case. Several of the others are just
> non-DT child platform devices. Perhaps we need a "copy the DMA config
> from another struct device (or parent struct device)" function to
> avoid using a DT function on a non-DT device.
>
> > IOW we can't just use 'dev->of_node' as a starting point to walk upwards the
> > tree. We always have to respect whatever DT node the bus provided, and start
> > there. This clashes with the current solutions, as they are based on the
> > fact
> > that we can use dev->of_node when present.
>
> Yes, you are right.
>
> > My guess at this point, if we're forced to honor that behaviour, is that we
> > have to create a new API for the PCI use case. Something the likes of
> > of_dma_configure_parent().
>
> I think of_dma_configure just has to work with the device_node of
> either the device or the device parent and dev->of_node is never used
> unless the caller sets it.

Fine, so given the following two distinct uses of
of_dma_configure(struct device *dev, struct device_node *np, bool ...):

- dev->of_node == np: Platform bus' typical use, we imperatively have to start
parsing dma-ranges from np's DMA parent, as the device we're configuring
might be a bus containing dma-ranges himself. For example a platform PCIe bus.

- dev->of_node != np: Here the bus is pulling some trick. The device might or
might not be represented in DT and np might be a bus or a device. But one
thing I realised is that device being configured never represents a memory
mapped bus. Assuming this assumption is acceptable, we can traverse the DT
tree starting from np and get a correct configuration as long as dma-ranges
not being present is interpreted as a 1:1 mapping.

The resulting code, which I tested on an RPi4, Freescale Layerscape and passes
OF's unit tests, looks like this:

int of_dma_configure(struct device *dev, struct device_node *np, bool force_dma)
{
u64 dma_addr, paddr, size = 0;
struct device_node *parent;
u64 mask;
int ret;

if (!np)
return -ENODEV;

parent = of_node_get(np);
if (dev->of_node == parent)
parent = of_get_next_dma_parent(np);

ret = of_dma_get_range(parent, &dma_addr, &paddr, &size);
of_node_put(parent);

[...]
}

Would that be acceptable?

Regards,
Nicolas

Attachments:

signature.asc (499.00 B)
This is a digitally signed message part