LinuxLists.cc - [PATCH RFC 0/4] arm64: support DMA zone starting above 4GB

2023-12-27 15:12:41

Subject: [PATCH RFC 0/4] arm64: support DMA zone starting above 4GB

DMA zones code assumes that DMA lower limit is zero. When there is no RAM
below 4GB, arm64 platform code sets DMA/DMA32 zone limits to cover the entire
RAM[0].

The platform I have has RAM starting at 32GB. Devices with 30-bit DMA mask are
mapped to 1GB at the bottom of RAM, between 32GB - 33GB. A DMA zone over the
entire RAM breaks DMA allocation for these devices.

In response to a previous RFC hack[1] Catalin Marinas suggested to add a
separate offset value as base address for the DMA zone. This RFC series
attempts to implement that suggestion.

With this series applied, the DMA zone covers the right RAM range for my
platform.

[0] See commit 791ab8b2e3db ("arm64: Ignore any DMA offsets in the
max_zone_phys() calculation")

[1] https://lore.kernel.org/all/9af8a19c3398e7dc09cfc1fbafed98d795d9f83e.1699464622.git.baruch@tkos.co.il/

Baruch Siach (4):
of: get dma area lower limit
of: unittest: add test for of_dma_get_cpu_limits() 'min' param
dma-direct: add offset to zone_dma_bits
arm64: mm: take DMA zone offset into account

arch/arm64/mm/init.c | 18 +++++++++++++-----
drivers/of/address.c | 38 +++++++++++++++++++++++++++-----------
drivers/of/unittest.c | 17 ++++++++++-------
include/linux/dma-direct.h | 1 +
include/linux/of.h | 11 ++++++++---
kernel/dma/direct.c | 10 ++++++----
kernel/dma/pool.c | 2 +-
kernel/dma/swiotlb.c | 5 +++--
8 files changed, 69 insertions(+), 33 deletions(-)

--
2.43.0

2023-12-27 15:12:45

by Baruch Siach

[permalink] [raw]

Subject: [PATCH RFC 4/4] arm64: mm: take DMA zone offset into account

Commit 791ab8b2e3db ("arm64: Ignore any DMA offsets in the
max_zone_phys() calculation") made DMA/DMA32 zones span the entire RAM
when RAM starts above 32-bits. This breaks hardware with DMA area that
start above 32-bits. But the commit log says that "we haven't noticed
any such hardware". It turns out that such hardware does exist.

One such platform has RAM starting at 32GB with an internal bus that has
the following DMA limits:

#address-cells = <2>;
#size-cells = <2>;
dma-ranges = <0x00 0xc0000000 0x08 0x00000000 0x00 0x40000000>;

Devices under this bus can see 1GB of DMA range between 3GB-4GB in each
device address space. This range is mapped to CPU memory at 32GB-33GB.
With current code DMA allocations for devices under this bus are not
limited to DMA area, leading to run-time allocation failure.

Modify 'zone_dma_bits' calculation (via dt_zone_dma_bits) to only cover
the actual DMA area starting at 'zone_dma_off'. Use the newly introduced
'min' parameter of of_dma_get_cpu_limits() to set 'zone_dma_off'.

DMA32 zone is useless in this configuration, so make its limit the same
as the DMA zone when the lower DMA limit is higher than 32-bits.

The result is DMA zone that properly reflects the hardware constraints
as follows:

[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000800000000-0x000000083fffffff]
[ 0.000000] DMA32 empty
[ 0.000000] Normal [mem 0x0000000840000000-0x0000000bffffffff]

Suggested-by: Catalin Marinas <[email protected]>
Signed-off-by: Baruch Siach <[email protected]>
---
arch/arm64/mm/init.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index d6c723ae6fb0..4a8fd8394ce6 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -118,10 +118,11 @@ static void __init arch_reserve_crashkernel(void)
* limit. If DRAM starts above 32-bit, expand the zone to the maximum
* available memory, otherwise cap it at 32-bit.
*/
-static phys_addr_t __init max_zone_phys(unsigned int zone_bits)
+static phys_addr_t __init max_zone_phys(unsigned int zone_bits,
+ phys_addr_t zone_off)
{
phys_addr_t zone_mask = DMA_BIT_MASK(zone_bits);
- phys_addr_t phys_start = memblock_start_of_DRAM();
+ phys_addr_t phys_start = memblock_start_of_DRAM() - zone_off;

if (phys_start > U32_MAX)
zone_mask = PHYS_ADDR_MAX;
@@ -137,14 +138,19 @@ static void __init zone_sizes_init(void)
unsigned int __maybe_unused acpi_zone_dma_bits;
unsigned int __maybe_unused dt_zone_dma_bits;
phys_addr_t __maybe_unused max_cpu_address;
- phys_addr_t __maybe_unused dma32_phys_limit = max_zone_phys(32);
+ phys_addr_t __maybe_unused min_cpu_address;
+ phys_addr_t __maybe_unused dma32_phys_limit = max_zone_phys(32, 0);

#ifdef CONFIG_ZONE_DMA
acpi_zone_dma_bits = fls64(acpi_iort_dma_get_max_cpu_address());
- of_dma_get_cpu_limits(NULL, &max_cpu_address, NULL);
- dt_zone_dma_bits = fls64(max_cpu_address);
+ of_dma_get_cpu_limits(NULL, &max_cpu_address, &min_cpu_address);
+ dt_zone_dma_bits = fls64(max_cpu_address - min_cpu_address);
zone_dma_bits = min3(32U, dt_zone_dma_bits, acpi_zone_dma_bits);
- arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
+ zone_dma_off = min_cpu_address;
+ arm64_dma_phys_limit = max_zone_phys(zone_dma_bits, zone_dma_off)
+ + zone_dma_off;
+ if (zone_dma_off > U32_MAX)
+ dma32_phys_limit = arm64_dma_phys_limit;
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
#endif
#ifdef CONFIG_ZONE_DMA32
--
2.43.0

2023-12-27 15:12:48

by Baruch Siach

[permalink] [raw]

Subject: [PATCH RFC 3/4] dma-direct: add offset to zone_dma_bits

Current code using zone_dma_bits assume that all addresses range in the
bits mask are suitable for DMA. For some existing platforms this
assumption is not correct. DMA range might have non zero lower limit.

Add 'zone_dma_off' for platform code to set base address for DMA zone.

Rename the dma_direct_supported() local 'min_mask' variable to better
describe its use as limit.

Suggested-by: Catalin Marinas <[email protected]>
Signed-off-by: Baruch Siach <[email protected]>
---
include/linux/dma-direct.h | 1 +
kernel/dma/direct.c | 10 ++++++----
kernel/dma/pool.c | 2 +-
kernel/dma/swiotlb.c | 5 +++--
4 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 18aade195884..77c52019c1ff 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -13,6 +13,7 @@
#include <linux/swiotlb.h>

extern unsigned int zone_dma_bits;
+extern phys_addr_t zone_dma_off;

/*
* Record the mapping of CPU physical to DMA addresses for a given region.
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 73c95815789a..7a54fa71a174 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -21,6 +21,7 @@
* override the variable below for dma-direct to work properly.
*/
unsigned int zone_dma_bits __ro_after_init = 24;
+phys_addr_t zone_dma_off __ro_after_init;

static inline dma_addr_t phys_to_dma_direct(struct device *dev,
phys_addr_t phys)
@@ -59,7 +60,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit)
* zones.
*/
*phys_limit = dma_to_phys(dev, dma_limit);
- if (*phys_limit <= DMA_BIT_MASK(zone_dma_bits))
+ if (*phys_limit <= zone_dma_off + DMA_BIT_MASK(zone_dma_bits))
return GFP_DMA;
if (*phys_limit <= DMA_BIT_MASK(32))
return GFP_DMA32;
@@ -566,7 +567,7 @@ int dma_direct_mmap(struct device *dev, struct vm_area_struct *vma,

int dma_direct_supported(struct device *dev, u64 mask)
{
- u64 min_mask = (max_pfn - 1) << PAGE_SHIFT;
+ u64 min_limit = (max_pfn - 1) << PAGE_SHIFT;

/*
* Because 32-bit DMA masks are so common we expect every architecture
@@ -583,8 +584,9 @@ int dma_direct_supported(struct device *dev, u64 mask)
* part of the check.
*/
if (IS_ENABLED(CONFIG_ZONE_DMA))
- min_mask = min_t(u64, min_mask, DMA_BIT_MASK(zone_dma_bits));
- return mask >= phys_to_dma_unencrypted(dev, min_mask);
+ min_limit = min_t(u64, min_limit,
+ zone_dma_off + DMA_BIT_MASK(zone_dma_bits));
+ return mask >= phys_to_dma_unencrypted(dev, min_limit);
}

/*
diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index b481c48a31a6..0103e932444d 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -70,7 +70,7 @@ static bool cma_in_zone(gfp_t gfp)
/* CMA can't cross zone boundaries, see cma_activate_area() */
end = cma_get_base(cma) + size - 1;
if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp & GFP_DMA))
- return end <= DMA_BIT_MASK(zone_dma_bits);
+ return end <= zone_dma_off + DMA_BIT_MASK(zone_dma_bits);
if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp & GFP_DMA32))
return end <= DMA_BIT_MASK(32);
return true;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 33d942615be5..bd3071894d2b 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -446,7 +446,8 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
if (!remap)
io_tlb_default_mem.can_grow = true;
if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp_mask & __GFP_DMA))
- io_tlb_default_mem.phys_limit = DMA_BIT_MASK(zone_dma_bits);
+ io_tlb_default_mem.phys_limit =
+ zone_dma_off + DMA_BIT_MASK(zone_dma_bits);
else if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp_mask & __GFP_DMA32))
io_tlb_default_mem.phys_limit = DMA_BIT_MASK(32);
else
@@ -625,7 +626,7 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
}

gfp &= ~GFP_ZONEMASK;
- if (phys_limit <= DMA_BIT_MASK(zone_dma_bits))
+ if (phys_limit <= zone_dma_off + DMA_BIT_MASK(zone_dma_bits))
gfp |= __GFP_DMA;
else if (phys_limit <= DMA_BIT_MASK(32))
gfp |= __GFP_DMA32;
--
2.43.0

2023-12-27 15:12:56

by Baruch Siach

[permalink] [raw]

Subject: [PATCH RFC 1/4] of: get dma area lower limit

of_dma_get_max_cpu_address() returns the highest CPU address that
devices can use for DMA. The implicit assumption is that all CPU
addresses below that limit are suitable for DMA. However the
'dma-ranges' property this code uses also encodes a lower limit for DMA
that is potentially non zero.

Rename to of_dma_get_cpu_limits(), and extend to retrieve also the lower
limit for the same 'dma-ranges' property describing the high limit.

Update callers of of_dma_get_max_cpu_address(). No functional change
intended.

Signed-off-by: Baruch Siach <[email protected]>
---
arch/arm64/mm/init.c | 4 +++-
drivers/of/address.c | 38 +++++++++++++++++++++++++++-----------
drivers/of/unittest.c | 8 ++++----
include/linux/of.h | 11 ++++++++---
4 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 74c1db8ce271..d6c723ae6fb0 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -136,11 +136,13 @@ static void __init zone_sizes_init(void)
unsigned long max_zone_pfns[MAX_NR_ZONES] = {0};
unsigned int __maybe_unused acpi_zone_dma_bits;
unsigned int __maybe_unused dt_zone_dma_bits;
+ phys_addr_t __maybe_unused max_cpu_address;
phys_addr_t __maybe_unused dma32_phys_limit = max_zone_phys(32);

#ifdef CONFIG_ZONE_DMA
acpi_zone_dma_bits = fls64(acpi_iort_dma_get_max_cpu_address());
- dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL));
+ of_dma_get_cpu_limits(NULL, &max_cpu_address, NULL);
+ dt_zone_dma_bits = fls64(max_cpu_address);
zone_dma_bits = min3(32U, dt_zone_dma_bits, acpi_zone_dma_bits);
arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
diff --git a/drivers/of/address.c b/drivers/of/address.c
index b59956310f66..51fa52bbe911 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -965,21 +965,25 @@ int of_dma_get_range(struct device_node *np, const struct bus_dma_region **map)
#endif /* CONFIG_HAS_DMA */

/**
- * of_dma_get_max_cpu_address - Gets highest CPU address suitable for DMA
+ * of_dma_get_cpu_limits - Gets highest CPU address suitable for DMA
* @np: The node to start searching from or NULL to start from the root
+ * @max: Pointer to high address limit or NULL if not needed
+ * @min: Pointer to low address limit or NULL if not needed
*
* Gets the highest CPU physical address that is addressable by all DMA masters
- * in the sub-tree pointed by np, or the whole tree if NULL is passed. If no
- * DMA constrained device is found, it returns PHYS_ADDR_MAX.
+ * in the sub-tree pointed by np, or the whole tree if @np in NULL. If no
+ * DMA constrained device is found, @*max is PHYS_ADDR_MAX, and @*low is 0.
*/
-phys_addr_t __init of_dma_get_max_cpu_address(struct device_node *np)
+void __init of_dma_get_cpu_limits(struct device_node *np,
+ phys_addr_t *max, phys_addr_t *min)
{
phys_addr_t max_cpu_addr = PHYS_ADDR_MAX;
struct of_range_parser parser;
- phys_addr_t subtree_max_addr;
+ phys_addr_t min_cpu_addr = 0;
struct device_node *child;
struct of_range range;
const __be32 *ranges;
+ u64 cpu_start = 0;
u64 cpu_end = 0;
int len;

@@ -989,21 +993,33 @@ phys_addr_t __init of_dma_get_max_cpu_address(struct device_node *np)
ranges = of_get_property(np, "dma-ranges", &len);
if (ranges && len) {
of_dma_range_parser_init(&parser, np);
- for_each_of_range(&parser, &range)
- if (range.cpu_addr + range.size > cpu_end)
+ for_each_of_range(&parser, &range) {
+ if (range.cpu_addr + range.size > cpu_end) {
cpu_end = range.cpu_addr + range.size - 1;
+ cpu_start = range.cpu_addr;
+ }
+ }

- if (max_cpu_addr > cpu_end)
+ if (max_cpu_addr > cpu_end) {
max_cpu_addr = cpu_end;
+ min_cpu_addr = cpu_start;
+ }
}

for_each_available_child_of_node(np, child) {
- subtree_max_addr = of_dma_get_max_cpu_address(child);
- if (max_cpu_addr > subtree_max_addr)
+ phys_addr_t subtree_max_addr, subtree_min_addr;
+
+ of_dma_get_cpu_limits(child, &subtree_max_addr, &subtree_min_addr);
+ if (max_cpu_addr > subtree_max_addr) {
max_cpu_addr = subtree_max_addr;
+ min_cpu_addr = subtree_min_addr;
+ }
}

- return max_cpu_addr;
+ if (max)
+ *max = max_cpu_addr;
+ if (min)
+ *min = min_cpu_addr;
}

/**
diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index e9e90e96600e..21d273a05ba6 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -908,7 +908,7 @@ static void __init of_unittest_changeset(void)
#endif
}

-static void __init of_unittest_dma_get_max_cpu_address(void)
+static void __init of_unittest_dma_get_cpu_limits(void)
{
struct device_node *np;
phys_addr_t cpu_addr;
@@ -922,9 +922,9 @@ static void __init of_unittest_dma_get_max_cpu_address(void)
return;
}

- cpu_addr = of_dma_get_max_cpu_address(np);
+ of_dma_get_cpu_limits(np, &cpu_addr, NULL);
unittest(cpu_addr == 0x4fffffff,
- "of_dma_get_max_cpu_address: wrong CPU addr %pad (expecting %x)\n",
+ "of_dma_get_cpu_limits: wrong CPU addr %pad (expecting %x)\n",
&cpu_addr, 0x4fffffff);
}

@@ -4104,7 +4104,7 @@ static int __init of_unittest(void)
of_unittest_changeset();
of_unittest_parse_interrupts();
of_unittest_parse_interrupts_extended();
- of_unittest_dma_get_max_cpu_address();
+ of_unittest_dma_get_cpu_limits();
of_unittest_parse_dma_ranges();
of_unittest_pci_dma_ranges();
of_unittest_bus_ranges();
diff --git a/include/linux/of.h b/include/linux/of.h
index 6a9ddf20e79a..629b402d81bf 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -445,7 +445,8 @@ int of_map_id(struct device_node *np, u32 id,
const char *map_name, const char *map_mask_name,
struct device_node **target, u32 *id_out);

-phys_addr_t of_dma_get_max_cpu_address(struct device_node *np);
+void of_dma_get_cpu_limits(struct device_node *np, phys_addr_t *max,
+ phys_addr_t *min);

struct kimage;
void *of_kexec_alloc_and_setup_fdt(const struct kimage *image,
@@ -865,9 +866,13 @@ static inline int of_map_id(struct device_node *np, u32 id,
return -EINVAL;
}

-static inline phys_addr_t of_dma_get_max_cpu_address(struct device_node *np)
+static inline void of_dma_get_cpu_limits(struct device_node *np,
+ phys_addr_t *max, phys_addr_t *min)
{
- return PHYS_ADDR_MAX;
+ if (max)
+ *max = PHYS_ADDR_MAX;
+ if (min)
+ *min = 0;
}

static inline const void *of_device_get_match_data(const struct device *dev)
--
2.43.0

2024-01-08 18:01:12

by Catalin Marinas

[permalink] [raw]

Subject: Re: [PATCH RFC 3/4] dma-direct: add offset to zone_dma_bits

On Wed, Dec 27, 2023 at 05:04:27PM +0200, Baruch Siach wrote:
> Current code using zone_dma_bits assume that all addresses range in the
> bits mask are suitable for DMA. For some existing platforms this
> assumption is not correct. DMA range might have non zero lower limit.
>
> Add 'zone_dma_off' for platform code to set base address for DMA zone.
>
> Rename the dma_direct_supported() local 'min_mask' variable to better
> describe its use as limit.
>
> Suggested-by: Catalin Marinas <[email protected]>

When I suggested taking the DMA offsets into account, that's not exactly
what I meant. Based on patch 4, it looks like zone_dma_off is equivalent
to the lower CPU address. Let's say a system has DRAM starting at 2GB
and all 32-bit DMA-capable devices has a DMA offset of 0. We want
ZONE_DMA32 to end at 4GB rather than 6GB.

> @@ -59,7 +60,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit)
> * zones.
> */
> *phys_limit = dma_to_phys(dev, dma_limit);
> - if (*phys_limit <= DMA_BIT_MASK(zone_dma_bits))
> + if (*phys_limit <= zone_dma_off + DMA_BIT_MASK(zone_dma_bits))
> return GFP_DMA;
> if (*phys_limit <= DMA_BIT_MASK(32))
> return GFP_DMA32;

Ah, you ignore the zone_dma_off for 32-bit calculations. But the
argument still stands, the start of DRAM does not necessarily mean that
all non-64-bit devices have such DMA offset.

The current dma_direct_optimal_gfp_mask() confuses me a bit, I think it
gives the wrong flag if we have a zone_dma_bits of 30 and a device with
a coherent_dma_mask of 31, it incorrectly ends up with GFP_DMA32 (I'm
ignoring dma offsets in this example). Luckily I don't think we have any
set up where this would fail. Basically if *phys_limit is strictly
smaller than DMA_BIT_MASK(32), we want GFP_DMA rather than GFP_DMA32
even if it is larger than DMA_BIT_MASK(zone_dma_bits).

Anyway, current mainline assumes that DMA_BIT_MASK(zone_dma_bits) and
DMA_BIT_MASK(32) are CPU addresses. The problem is that we may have the
start of RAM well above 4GB and neither ZONE_DMA nor ZONE_DMA32 upper
limits would be a power-of-two. We could change the DMA_BIT_MASK(...) to
be DMA address limits and we end up with something like:

static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit)
{
u64 dma_limit = min_not_zero(
dev->coherent_dma_mask,
dev->bus_dma_limit);
u64 dma32_limit = dma_to_phys(dev, DMA_BIT_MASK(32));

*phys_limit = dma_to_phys(dev, dma_limit);
if (*phys_limit > dma_limit)
return 0;
if (*phys_limit = dma32_limit)
return GFP_DMA32;
return GFP_DMA;
}

The alternative is to get rid of the *_bits variants and go for
zone_dma_limit and zone_dma32_limit in the generic code. For most
architectures they would match the current DMA_BIT_MASK(32) etc. but
arm64 would be able to set some higher values.

My preference would be to go for zone_dma{,32}_limit, it's easier to
change all the places where DMA_BIT_MASK({zone_dma_bits,32}) is used.

--
Catalin

2024-01-09 10:11:27

by Baruch Siach

[permalink] [raw]

Subject: Re: [PATCH RFC 3/4] dma-direct: add offset to zone_dma_bits

Hi Catalin,

On Mon, Jan 08 2024, Catalin Marinas wrote:
> On Wed, Dec 27, 2023 at 05:04:27PM +0200, Baruch Siach wrote:
>> Current code using zone_dma_bits assume that all addresses range in the
>> bits mask are suitable for DMA. For some existing platforms this
>> assumption is not correct. DMA range might have non zero lower limit.
>>
>> Add 'zone_dma_off' for platform code to set base address for DMA zone.
>>
>> Rename the dma_direct_supported() local 'min_mask' variable to better
>> describe its use as limit.
>>
>> Suggested-by: Catalin Marinas <[email protected]>
>
> When I suggested taking the DMA offsets into account, that's not exactly
> what I meant. Based on patch 4, it looks like zone_dma_off is equivalent
> to the lower CPU address. Let's say a system has DRAM starting at 2GB
> and all 32-bit DMA-capable devices has a DMA offset of 0. We want
> ZONE_DMA32 to end at 4GB rather than 6GB.

Patch 4 sets zone_dma_off to the lower limit from 'dma-ranges' property
that determines zone_dma_bits. This is not necessarily equivalent to
start of DRAM, though it happens to be that way on my platform.

>> @@ -59,7 +60,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit)
>> * zones.
>> */
>> *phys_limit = dma_to_phys(dev, dma_limit);
>> - if (*phys_limit <= DMA_BIT_MASK(zone_dma_bits))
>> + if (*phys_limit <= zone_dma_off + DMA_BIT_MASK(zone_dma_bits))
>> return GFP_DMA;
>> if (*phys_limit <= DMA_BIT_MASK(32))
>> return GFP_DMA32;
>
> Ah, you ignore the zone_dma_off for 32-bit calculations. But the
> argument still stands, the start of DRAM does not necessarily mean that
> all non-64-bit devices have such DMA offset.
>
> The current dma_direct_optimal_gfp_mask() confuses me a bit, I think it
> gives the wrong flag if we have a zone_dma_bits of 30 and a device with
> a coherent_dma_mask of 31, it incorrectly ends up with GFP_DMA32 (I'm
> ignoring dma offsets in this example). Luckily I don't think we have any
> set up where this would fail. Basically if *phys_limit is strictly
> smaller than DMA_BIT_MASK(32), we want GFP_DMA rather than GFP_DMA32
> even if it is larger than DMA_BIT_MASK(zone_dma_bits).
>
> Anyway, current mainline assumes that DMA_BIT_MASK(zone_dma_bits) and
> DMA_BIT_MASK(32) are CPU addresses. The problem is that we may have the
> start of RAM well above 4GB and neither ZONE_DMA nor ZONE_DMA32 upper
> limits would be a power-of-two. We could change the DMA_BIT_MASK(...) to
> be DMA address limits and we end up with something like:
>
> static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit)
> {
> u64 dma_limit = min_not_zero(
> dev->coherent_dma_mask,
> dev->bus_dma_limit);
> u64 dma32_limit = dma_to_phys(dev, DMA_BIT_MASK(32));
>
> *phys_limit = dma_to_phys(dev, dma_limit);
> if (*phys_limit > dma_limit)
> return 0;
> if (*phys_limit = dma32_limit)
> return GFP_DMA32;
> return GFP_DMA;
> }
>
> The alternative is to get rid of the *_bits variants and go for
> zone_dma_limit and zone_dma32_limit in the generic code. For most
> architectures they would match the current DMA_BIT_MASK(32) etc. but
> arm64 would be able to set some higher values.
>
> My preference would be to go for zone_dma{,32}_limit, it's easier to
> change all the places where DMA_BIT_MASK({zone_dma_bits,32}) is used.

Sounds good to me.

Thanks for your review of this confusing piece of code.

baruch

--
~. .~ Tk Open Systems
=}------------------------------------------------ooO--U--Ooo------------{=
- [email protected] - tel: +972.52.368.4656, http://www.tkos.co.il -

2024-01-09 10:55:21

by Catalin Marinas

[permalink] [raw]

Subject: Re: [PATCH RFC 3/4] dma-direct: add offset to zone_dma_bits

On Tue, Jan 09, 2024 at 12:03:43PM +0200, Baruch Siach wrote:
> On Mon, Jan 08 2024, Catalin Marinas wrote:
> > On Wed, Dec 27, 2023 at 05:04:27PM +0200, Baruch Siach wrote:
> >> Current code using zone_dma_bits assume that all addresses range in the
> >> bits mask are suitable for DMA. For some existing platforms this
> >> assumption is not correct. DMA range might have non zero lower limit.
> >>
> >> Add 'zone_dma_off' for platform code to set base address for DMA zone.
> >>
> >> Rename the dma_direct_supported() local 'min_mask' variable to better
> >> describe its use as limit.
> >>
> >> Suggested-by: Catalin Marinas <[email protected]>
> >
> > When I suggested taking the DMA offsets into account, that's not exactly
> > what I meant. Based on patch 4, it looks like zone_dma_off is equivalent
> > to the lower CPU address. Let's say a system has DRAM starting at 2GB
> > and all 32-bit DMA-capable devices has a DMA offset of 0. We want
> > ZONE_DMA32 to end at 4GB rather than 6GB.
>
> Patch 4 sets zone_dma_off to the lower limit from 'dma-ranges' property
> that determines zone_dma_bits. This is not necessarily equivalent to
> start of DRAM, though it happens to be that way on my platform.

A bit better but it still assumes that all devices have the same DMA
offset which may not be the case.

--
Catalin

2024-01-09 14:10:43

by Baruch Siach

[permalink] [raw]

Subject: Re: [PATCH RFC 3/4] dma-direct: add offset to zone_dma_bits

Hi Catalin,

On Tue, Jan 09 2024, Catalin Marinas wrote:
> On Tue, Jan 09, 2024 at 12:03:43PM +0200, Baruch Siach wrote:
>> On Mon, Jan 08 2024, Catalin Marinas wrote:
>> > On Wed, Dec 27, 2023 at 05:04:27PM +0200, Baruch Siach wrote:
>> >> Current code using zone_dma_bits assume that all addresses range in the
>> >> bits mask are suitable for DMA. For some existing platforms this
>> >> assumption is not correct. DMA range might have non zero lower limit.
>> >>
>> >> Add 'zone_dma_off' for platform code to set base address for DMA zone.
>> >>
>> >> Rename the dma_direct_supported() local 'min_mask' variable to better
>> >> describe its use as limit.
>> >>
>> >> Suggested-by: Catalin Marinas <[email protected]>
>> >
>> > When I suggested taking the DMA offsets into account, that's not exactly
>> > what I meant. Based on patch 4, it looks like zone_dma_off is equivalent
>> > to the lower CPU address. Let's say a system has DRAM starting at 2GB
>> > and all 32-bit DMA-capable devices has a DMA offset of 0. We want
>> > ZONE_DMA32 to end at 4GB rather than 6GB.
>>
>> Patch 4 sets zone_dma_off to the lower limit from 'dma-ranges' property
>> that determines zone_dma_bits. This is not necessarily equivalent to
>> start of DRAM, though it happens to be that way on my platform.
>
> A bit better but it still assumes that all devices have the same DMA
> offset which may not be the case.

Current code calculates zone_dma_bits based on the lowest high limit of
all 'dma-ranges' properties. The assumption appears to be that this
limit fits all devices. This series does not change this assumption. It
only extends the logic to the lower limit of the "winning" 'dma-ranges'
to set the base address for DMA zone.

Moving to dma_zone_limit would not change that logic. Unless I'm missing
something.

Breaking the "one DMA zone fits all devices" assumption as Petr
suggested is a much larger change.

baruch

--
~. .~ Tk Open Systems
=}------------------------------------------------ooO--U--Ooo------------{=
- [email protected] - tel: +972.52.368.4656, http://www.tkos.co.il -

2024-01-09 17:52:28

by Catalin Marinas

[permalink] [raw]

Subject: Re: [PATCH RFC 3/4] dma-direct: add offset to zone_dma_bits

On Tue, Jan 09, 2024 at 03:54:13PM +0200, Baruch Siach wrote:
> On Tue, Jan 09 2024, Catalin Marinas wrote:
> > On Tue, Jan 09, 2024 at 12:03:43PM +0200, Baruch Siach wrote:
> >> On Mon, Jan 08 2024, Catalin Marinas wrote:
> >> > On Wed, Dec 27, 2023 at 05:04:27PM +0200, Baruch Siach wrote:
> >> >> Current code using zone_dma_bits assume that all addresses range in the
> >> >> bits mask are suitable for DMA. For some existing platforms this
> >> >> assumption is not correct. DMA range might have non zero lower limit.
> >> >>
> >> >> Add 'zone_dma_off' for platform code to set base address for DMA zone.
> >> >>
> >> >> Rename the dma_direct_supported() local 'min_mask' variable to better
> >> >> describe its use as limit.
> >> >>
> >> >> Suggested-by: Catalin Marinas <[email protected]>
> >> >
> >> > When I suggested taking the DMA offsets into account, that's not exactly
> >> > what I meant. Based on patch 4, it looks like zone_dma_off is equivalent
> >> > to the lower CPU address. Let's say a system has DRAM starting at 2GB
> >> > and all 32-bit DMA-capable devices has a DMA offset of 0. We want
> >> > ZONE_DMA32 to end at 4GB rather than 6GB.
> >>
> >> Patch 4 sets zone_dma_off to the lower limit from 'dma-ranges' property
> >> that determines zone_dma_bits. This is not necessarily equivalent to
> >> start of DRAM, though it happens to be that way on my platform.
> >
> > A bit better but it still assumes that all devices have the same DMA
> > offset which may not be the case.
>
> Current code calculates zone_dma_bits based on the lowest high limit of
> all 'dma-ranges' properties. The assumption appears to be that this
> limit fits all devices. This series does not change this assumption. It
> only extends the logic to the lower limit of the "winning" 'dma-ranges'
> to set the base address for DMA zone.
>
> Moving to dma_zone_limit would not change that logic. Unless I'm missing
> something.

Indeed, the logic here stays the same. What doesn't work currently is
that we use fls64() of this address and we also cap it to 32 in the
arm64 zone_sizes_init(). On a system where RAM starts at 4GB and we have
a (for example) 30-bit device, zone_dma_bits ends up as 32 (fls64(5G)).
With your patch, IIUC, zone_dma_off would be 4GB in such scenario and
adding DMA_BIT_MASK(zone_dma_bits) to it results in 8GB, not the 5GB
limit that we actually need (the first GB of the RAM).

> Breaking the "one DMA zone fits all devices" assumption as Petr
> suggested is a much larger change.

I don't think we should go this way, we just need to make sure the DMA
zone does not extend above the lowest upper limit of the cpu addresses
in the 'dma-ranges' property. Basically what
of_dma_get_max_cpu_address() gives us but without any capping or
conversion into a power of two. This should cover those sub-32-bit
devices.

See the partial patch below, not really tested (and it breaks powerpc,
s390) but it's easier to discuss on code. In addition, we need to figure
out what to do with ZONE_DMA32 in such case. Do we consider the first
4GB of the RAM or we just don't bother with setting up this zone if the
RAM starts above 4GB, only rely on ZONE_DMA? I'd go with the latter but
needs some thinking.

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 74c1db8ce271..0a15628ece7e 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -113,36 +113,24 @@ static void __init arch_reserve_crashkernel(void)
low_size, high);
}

-/*
- * Return the maximum physical address for a zone accessible by the given bits
- * limit. If DRAM starts above 32-bit, expand the zone to the maximum
- * available memory, otherwise cap it at 32-bit.
- */
-static phys_addr_t __init max_zone_phys(unsigned int zone_bits)
+static phys_addr_t __init max_zone_phys(phys_addr_t zone_mask)
{
- phys_addr_t zone_mask = DMA_BIT_MASK(zone_bits);
- phys_addr_t phys_start = memblock_start_of_DRAM();
-
- if (phys_start > U32_MAX)
- zone_mask = PHYS_ADDR_MAX;
- else if (phys_start > zone_mask)
- zone_mask = U32_MAX;
-
return min(zone_mask, memblock_end_of_DRAM() - 1) + 1;
}

static void __init zone_sizes_init(void)
{
unsigned long max_zone_pfns[MAX_NR_ZONES] = {0};
- unsigned int __maybe_unused acpi_zone_dma_bits;
- unsigned int __maybe_unused dt_zone_dma_bits;
- phys_addr_t __maybe_unused dma32_phys_limit = max_zone_phys(32);
+ phys_addr_t __maybe_unused acpi_zone_dma_limit;
+ phys_addr_t __maybe_unused dt_zone_dma_limit;
+ phys_addr_t __maybe_unused dma32_phys_limit =
+ max_zone_phys(DMA_BIT_MASK(32));

#ifdef CONFIG_ZONE_DMA
- acpi_zone_dma_bits = fls64(acpi_iort_dma_get_max_cpu_address());
- dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL));
- zone_dma_bits = min3(32U, dt_zone_dma_bits, acpi_zone_dma_bits);
- arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
+ acpi_zone_dma_limit = acpi_iort_dma_get_max_cpu_address();
+ dt_zone_dma_limit = of_dma_get_max_cpu_address(NULL);
+ zone_dma_limit = min(dt_zone_dma_limit, acpi_zone_dma_limit);
+ arm64_dma_phys_limit = max_zone_phys(zone_dma_limit);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
#endif
#ifdef CONFIG_ZONE_DMA32
diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 18aade195884..f55de778612b 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -12,7 +12,7 @@
#include <linux/mem_encrypt.h>
#include <linux/swiotlb.h>

-extern unsigned int zone_dma_bits;
+extern phys_addr_t zone_dma_limit;

/*
* Record the mapping of CPU physical to DMA addresses for a given region.
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 73c95815789a..1e12c593b6f3 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -20,7 +20,7 @@
* it for entirely different regions. In that case the arch code needs to
* override the variable below for dma-direct to work properly.
*/
-unsigned int zone_dma_bits __ro_after_init = 24;
+phys_addr_t zone_dma_limit __ro_after_init = DMA_BIT_MASK(24);

static inline dma_addr_t phys_to_dma_direct(struct device *dev,
phys_addr_t phys)
@@ -59,7 +59,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 *phys_limit)
* zones.
*/
*phys_limit = dma_to_phys(dev, dma_limit);
- if (*phys_limit <= DMA_BIT_MASK(zone_dma_bits))
+ if (*phys_limit <= zone_dma_limit)
return GFP_DMA;
if (*phys_limit <= DMA_BIT_MASK(32))
return GFP_DMA32;
@@ -583,7 +583,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
* part of the check.
*/
if (IS_ENABLED(CONFIG_ZONE_DMA))
- min_mask = min_t(u64, min_mask, DMA_BIT_MASK(zone_dma_bits));
+ min_mask = min_t(u64, min_mask, zone_dma_limit);
return mask >= phys_to_dma_unencrypted(dev, min_mask);
}

diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index b481c48a31a6..af02948adfff 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -70,7 +70,7 @@ static bool cma_in_zone(gfp_t gfp)
/* CMA can't cross zone boundaries, see cma_activate_area() */
end = cma_get_base(cma) + size - 1;
if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp & GFP_DMA))
- return end <= DMA_BIT_MASK(zone_dma_bits);
+ return end <= zone_dma_limit;
if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp & GFP_DMA32))
return end <= DMA_BIT_MASK(32);
return true;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 33d942615be5..be76816b3ff9 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -446,7 +446,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
if (!remap)
io_tlb_default_mem.can_grow = true;
if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp_mask & __GFP_DMA))
- io_tlb_default_mem.phys_limit = DMA_BIT_MASK(zone_dma_bits);
+ io_tlb_default_mem.phys_limit = zone_dma_limit;
else if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp_mask & __GFP_DMA32))
io_tlb_default_mem.phys_limit = DMA_BIT_MASK(32);
else
@@ -625,7 +625,7 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, size_t bytes,
}

gfp &= ~GFP_ZONEMASK;
- if (phys_limit <= DMA_BIT_MASK(zone_dma_bits))
+ if (phys_limit <= zone_dma_limit)
gfp |= __GFP_DMA;
else if (phys_limit <= DMA_BIT_MASK(32))
gfp |= __GFP_DMA32;

--
Catalin

2024-01-18 11:13:28

by Baruch Siach

[permalink] [raw]

Subject: Re: [PATCH RFC 1/4] of: get dma area lower limit

Hi Christoph

On Wed, Jan 17 2024, Christoph Lameter (Ampere) wrote:
> On Wed, 27 Dec 2023, Baruch Siach wrote:
>> of_dma_get_max_cpu_address() returns the highest CPU address that
>> devices can use for DMA. The implicit assumption is that all CPU
>> addresses below that limit are suitable for DMA. However the
>> 'dma-ranges' property this code uses also encodes a lower limit for DMA
>> that is potentially non zero.
>
> All of memory can be used for DMA by default (==ZONE_NORMAL). ZONE_DMA defines
> a special range for devices that are unable to perform DMA to all of
> memory. Usually due to the lack of address bit support.
>
> So I guess that the platform in question here has as a general limit as to
> what address spaces I/O devices can do DMA to?

DMA to/from devices in bus with 'dma-ranges' property is limited to
address space described in 'dma-ranges'. The arm64 platform currently
uses 'dma-ranges' as a hint to set ZONE_DMA limits globally. This series
is meant to make ZONE_DMA limits adjustment code work better for
platforms where the lower DMA limit is above 4GB. This commit adds the
ability to extract the lower limit from 'dma-ranges'.

baruch

--
~. .~ Tk Open Systems
=}------------------------------------------------ooO--U--Ooo------------{=
- [email protected] - tel: +972.52.368.4656, http://www.tkos.co.il -