LinuxLists.cc - [PATCH v3 0/2] mm: fix cma allocation fail sometimes

2022-03-16 09:38:47

Subject: [PATCH v3 0/2] mm: fix cma allocation fail sometimes

We observed an issue with NXP 5.15 LTS kernel that dma_alloc_coherent()
may fail sometimes when there're multiple processes trying to allocate
CMA memory.

This issue can be very easily reproduced on MX6Q SDB board with latest
linux-next kernel by writing a test module creating 16 or 32 threads
allocating random size of CMA memory in parallel at the background.
Or simply enabling CONFIG_CMA_DEBUG, you can see endless of CMA alloc
retries during booting:
[ 1.452124] cma: cma_alloc(): memory range at (ptrval) is busy,retrying
....
(thousands of reties)

The root cause of this issue is that since commit a4efc174b382
("mm/cma.c: remove redundant cma_mutex lock"), CMA supports concurrent
memory allocation.
It's possible that the memory range process A try to alloc has already
been isolated by the allocation of process B during memory migration.

The problem here is that the memory range isolated during one allocation
by start_isolate_page_range() could be much bigger than the real size we
want to alloc due to the range is aligned to MAX_ORDER_NR_PAGES.

Taking an ARMv7 platform with 1G memory as an example, when MAX_ORDER_NR_PAGES
is big (e.g. 32M with max_order 14) and CMA memory is relatively small
(e.g. 128M), there're only 4 MAX_ORDER slot, then it's very easy that
all CMA memory may have already been isolated by other processes when
one trying to allocate memory using dma_alloc_coherent().
Since current CMA code will only scan one time of whole available CMA
memory, then dma_alloc_coherent() may easy fail due to contention with
other processes.

This patchset introduces a retry mechanism to rescan CMA bitmap for -EBUSY
error in case the target pageblock may has been temporarily isolated
by others and released later.

It also improves the CMA allocation performance by trying the next
MAX_ORDER_NR_PAGES range during reties rather than looping within the
same isolated range in small steps which wasting CPU mips.

The following test is based on linux-next: next-20211213.

Without the fix, it's easily fail.
# insmod cma_alloc.ko pnum=16
[ 274.322369] CMA alloc test enter: thread number: 16
[ 274.329948] cpu: 0, pid: 692, index 4 pages 144
[ 274.330143] cpu: 1, pid: 694, index 2 pages 44
[ 274.330359] cpu: 2, pid: 695, index 7 pages 757
[ 274.330760] cpu: 2, pid: 696, index 4 pages 144
[ 274.330974] cpu: 2, pid: 697, index 6 pages 512
[ 274.331223] cpu: 2, pid: 698, index 6 pages 512
[ 274.331499] cpu: 2, pid: 699, index 2 pages 44
[ 274.332228] cpu: 2, pid: 700, index 0 pages 7
[ 274.337421] cpu: 0, pid: 701, index 1 pages 38
[ 274.337618] cpu: 2, pid: 702, index 0 pages 7
[ 274.344669] cpu: 1, pid: 703, index 0 pages 7
[ 274.344807] cpu: 3, pid: 704, index 6 pages 512
[ 274.348269] cpu: 2, pid: 705, index 5 pages 148
[ 274.349490] cma: cma_alloc: reserved: alloc failed, req-size: 38 pages, ret: -16
[ 274.366292] cpu: 1, pid: 706, index 4 pages 144
[ 274.366562] cpu: 0, pid: 707, index 3 pages 128
[ 274.367356] cma: cma_alloc: reserved: alloc failed, req-size: 128 pages, ret: -16
[ 274.367370] cpu: 0, pid: 707, index 3 pages 128 failed
[ 274.371148] cma: cma_alloc: reserved: alloc failed, req-size: 148 pages, ret: -16
[ 274.375348] cma: cma_alloc: reserved: alloc failed, req-size: 144 pages, ret: -16
[ 274.384256] cpu: 2, pid: 708, index 0 pages 7
....

With the fix, 32 threads allocating in parallel can pass overnight
stress test.

root@imx6qpdlsolox:~# insmod cma_alloc.ko pnum=32
[ 112.976809] cma_alloc: loading out-of-tree module taints kernel.
[ 112.984128] CMA alloc test enter: thread number: 32
[ 112.989748] cpu: 2, pid: 707, index 6 pages 512
[ 112.994342] cpu: 1, pid: 708, index 6 pages 512
[ 112.995162] cpu: 0, pid: 709, index 3 pages 128
[ 112.995867] cpu: 2, pid: 710, index 0 pages 7
[ 112.995910] cpu: 3, pid: 711, index 2 pages 44
[ 112.996005] cpu: 3, pid: 712, index 7 pages 757
[ 112.996098] cpu: 3, pid: 713, index 7 pages 757
...
[41877.368163] cpu: 1, pid: 737, index 2 pages 44
[41877.369388] cpu: 1, pid: 736, index 3 pages 128
[41878.486516] cpu: 0, pid: 737, index 2 pages 44
[41878.486515] cpu: 2, pid: 739, index 4 pages 144
[41878.486622] cpu: 1, pid: 736, index 3 pages 128
[41878.486948] cpu: 2, pid: 735, index 7 pages 757
[41878.487279] cpu: 2, pid: 738, index 4 pages 144
[41879.526603] cpu: 1, pid: 739, index 3 pages 128
[41879.606491] cpu: 2, pid: 737, index 3 pages 128
[41879.606550] cpu: 0, pid: 736, index 0 pages 7
[41879.612271] cpu: 2, pid: 738, index 4 pages 144
...
v1:
https://patchwork.kernel.org/project/linux-mm/cover/[email protected]/

v2:
https://patchwork.kernel.org/project/linux-mm/cover/[email protected]/

Dong Aisheng (2):
mm: cma: fix allocation may fail sometimes
mm: cma: try next MAX_ORDER_NR_PAGES during retry

mm/cma.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)

--
2.25.1

2022-03-17 05:28:16

by Aisheng Dong

[permalink] [raw]

Subject: [PATCH v3 2/2] mm: cma: try next MAX_ORDER_NR_PAGES during retry

On an ARMv7 platform with 1G memory, when MAX_ORDER_NR_PAGES
is big (e.g. 32M with max_order 14) and CMA memory is relatively small
(e.g. 128M), we observed a huge number of repeat retries of CMA
allocation (1k+) during booting when allocating one page for each
of 3 mmc instance probe.

This is caused by CMA now supports concurrent allocation since commit
a4efc174b382 ("mm/cma.c: remove redundant cma_mutex lock").
The memory range, MAX_ORDER_NR_PAGES aligned, from which we are trying
to allocate memory may have already been acquired and isolated by others
(see: alloc_contig_range()).

Current cma_alloc() will retry the next area by a small step of
bitmap_no + mask + 1 which are very likely within the same isolated range
and fail again. So when the MAX_ORDER_NR_PAGES is big (e.g. 32M),
keep retrying in a small step become meaningless because it will be known
to fail at a huge number of times due to that memory range has been isolated
by others, especially when allocating only one or two pages.

Instead of looping in the same isolated range and wasting CPU mips a lot,
especially for big MAX_ORDER systems (e.g. 16M or 32M),
we try the next MAX_ORDER_NR_PAGES directly.

Doing this way can greatly mitigate the situtation.

Below is the original error log during booting:
[ 2.004804] cma: cma_alloc(cma (ptrval), count 1, align 0)
[ 2.010318] cma: cma_alloc(cma (ptrval), count 1, align 0)
[ 2.010776] cma: cma_alloc(): memory range at (ptrval) is busy, retrying
[ 2.010785] cma: cma_alloc(): memory range at (ptrval) is busy, retrying
[ 2.010793] cma: cma_alloc(): memory range at (ptrval) is busy, retrying
[ 2.010800] cma: cma_alloc(): memory range at (ptrval) is busy, retrying
[ 2.010807] cma: cma_alloc(): memory range at (ptrval) is busy, retrying
[ 2.010814] cma: cma_alloc(): memory range at (ptrval) is busy, retrying
.... (+1K retries)

After fix, the 1200+ reties can be reduced to 0.
Another test running 8 threads running dma_alloc_coherent() in parallel
shows that 1500+ retries dropped to ~145.

IOW this patch can improve the CMA allocation speed a lot when there're
enough CMA memory by reducing retries significantly.

Cc: Andrew Morton <[email protected]>
Cc: Marek Szyprowski <[email protected]>
Cc: Lecopzer Chen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Vlastimil Babka <[email protected]>
CC: [email protected] # 5.11+
Fixes: a4efc174b382 ("mm/cma.c: remove redundant cma_mutex lock")
Signed-off-by: Dong Aisheng <[email protected]>
---
v2->v3:
* Improve commit messeages
v1->v2:
* change to align with MAX_ORDER_NR_PAGES instead of pageblock_nr_pages
---
mm/cma.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 46a9fd9f92c4..46bc12fe28b3 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -496,8 +496,16 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,

trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn),
count, align);
- /* try again with a bit different memory target */
- start = bitmap_no + mask + 1;
+ /*
+ * Try again with a bit different memory target.
+ * Since memory isolated in alloc_contig_range() is aligned
+ * with MAX_ORDER_NR_PAGES, instead of retrying in a small
+ * step within the same isolated range, we try the next
+ * available memory range directly.
+ */
+ start = ALIGN(bitmap_no + mask + 1,
+ MAX_ORDER_NR_PAGES >> cma->order_per_bit);
+
}

trace_cma_alloc_finish(cma->name, pfn, page, count, align);
--
2.25.1