On a memcache setup with heavy anon usage and no swap, we routinely
see premature OOM kills with multiple gigabytes of free space left:
Node 0 Normal free:4978632kB [...] free_cma:4893276kB
This free space turns out to be CMA. We set CMA regions aside for
potential hugetlb users on all of our machines, figuring that even if
there aren't any, the memory is available to userspace allocations.
When the OOMs trigger, it's from unmovable and reclaimable allocations
that aren't allowed to dip into CMA. The non-CMA regions meanwhile are
dominated by the anon pages.
Movable pages can be migrated out of CMA when necessary, but we don't
have a mechanism to migrate them *into* CMA to make room for unmovable
allocations. The only recourse we have for these pages is reclaim,
which due to a lack of swap is unavailable in our case.
Because we have more options for CMA pages, change the policy to
always fill up CMA first. This reduces the risk of premature OOMs.
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 44 ++++++++++++++++++++------------------------
1 file changed, 20 insertions(+), 24 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7d3460c7a480..24b9102cd4f6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1635,13 +1635,13 @@ static int fallbacks[MIGRATE_TYPES][MIGRATE_PCPTYPES - 1] = {
};
#ifdef CONFIG_CMA
-static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone,
+static __always_inline struct page *__rmqueue_cma(struct zone *zone,
unsigned int order)
{
return __rmqueue_smallest(zone, order, MIGRATE_CMA);
}
#else
-static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
+static inline struct page *__rmqueue_cma(struct zone *zone,
unsigned int order) { return NULL; }
#endif
@@ -2124,29 +2124,25 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
{
struct page *page;
- if (IS_ENABLED(CONFIG_CMA)) {
- /*
- * Balance movable allocations between regular and CMA areas by
- * allocating from CMA when over half of the zone's free memory
- * is in the CMA area.
- */
- if (alloc_flags & ALLOC_CMA &&
- zone_page_state(zone, NR_FREE_CMA_PAGES) >
- zone_page_state(zone, NR_FREE_PAGES) / 2) {
- page = __rmqueue_cma_fallback(zone, order);
- if (page)
- return page;
- }
+ /*
+ * Use up CMA first. Movable pages can be migrated out of CMA
+ * if necessary, but they cannot migrate into it to make room
+ * for unmovables elsewhere. The only recourse for them is
+ * then reclaim, which might be unavailable without swap. We
+ * want to reduce the risk of OOM with free CMA space left.
+ */
+ if (IS_ENABLED(CONFIG_CMA) && (alloc_flags & ALLOC_CMA)) {
+ page = __rmqueue_cma(zone, order);
+ if (page)
+ return page;
}
-retry:
- page = __rmqueue_smallest(zone, order, migratetype);
- if (unlikely(!page)) {
- if (alloc_flags & ALLOC_CMA)
- page = __rmqueue_cma_fallback(zone, order);
-
- if (!page && __rmqueue_fallback(zone, order, migratetype,
- alloc_flags))
- goto retry;
+
+ for (;;) {
+ page = __rmqueue_smallest(zone, order, migratetype);
+ if (page)
+ break;
+ if (!__rmqueue_fallback(zone, order, migratetype, alloc_flags))
+ break;
}
return page;
}
--
2.41.0
On a memcache setup with heavy anon usage and no swap, we routinely
see premature OOM kills with multiple gigabytes of free space left:
Node 0 Normal free:4978632kB [...] free_cma:4893276kB
This free space turns out to be CMA. We set CMA regions aside for
potential hugetlb users on all of our machines, figuring that even if
there aren't any, the memory is available to userspace allocations.
When the OOMs trigger, it's from unmovable and reclaimable allocations
that aren't allowed to dip into CMA. The non-CMA regions meanwhile are
dominated by the anon pages.
Movable pages can be migrated out of CMA when necessary, but we don't
have a mechanism to migrate them *into* CMA to make room for unmovable
allocations. The only recourse we have for these pages is reclaim,
which due to a lack of swap is unavailable in our case.
Because we have more options for CMA pages, change the policy to
always fill up CMA first. This reduces the risk of premature OOMs.
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 53 +++++++++++++++++++------------------------------
1 file changed, 20 insertions(+), 33 deletions(-)
I realized shortly after sending the first version that the code can
be further simplified by removing __rmqueue_cma_fallback() altogether.
Build, boot and runtime tested that CMA is indeed used up first.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7d3460c7a480..b257f9651ce9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1634,17 +1634,6 @@ static int fallbacks[MIGRATE_TYPES][MIGRATE_PCPTYPES - 1] = {
[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE, MIGRATE_MOVABLE },
};
-#ifdef CONFIG_CMA
-static __always_inline struct page *__rmqueue_cma_fallback(struct zone *zone,
- unsigned int order)
-{
- return __rmqueue_smallest(zone, order, MIGRATE_CMA);
-}
-#else
-static inline struct page *__rmqueue_cma_fallback(struct zone *zone,
- unsigned int order) { return NULL; }
-#endif
-
/*
* Move the free pages in a range to the freelist tail of the requested type.
* Note that start_page and end_pages are not aligned on a pageblock
@@ -2124,29 +2113,27 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype,
{
struct page *page;
- if (IS_ENABLED(CONFIG_CMA)) {
- /*
- * Balance movable allocations between regular and CMA areas by
- * allocating from CMA when over half of the zone's free memory
- * is in the CMA area.
- */
- if (alloc_flags & ALLOC_CMA &&
- zone_page_state(zone, NR_FREE_CMA_PAGES) >
- zone_page_state(zone, NR_FREE_PAGES) / 2) {
- page = __rmqueue_cma_fallback(zone, order);
- if (page)
- return page;
- }
+#ifdef CONFIG_CMA
+ /*
+ * Use up CMA first. Movable pages can be migrated out of CMA
+ * if necessary, but they cannot migrate into it to make room
+ * for unmovables elsewhere. The only recourse for them is
+ * then reclaim, which might be unavailable without swap. We
+ * want to reduce the risk of OOM with free CMA space left.
+ */
+ if (alloc_flags & ALLOC_CMA) {
+ page = __rmqueue_smallest(zone, order, MIGRATE_CMA);
+ if (page)
+ return page;
}
-retry:
- page = __rmqueue_smallest(zone, order, migratetype);
- if (unlikely(!page)) {
- if (alloc_flags & ALLOC_CMA)
- page = __rmqueue_cma_fallback(zone, order);
-
- if (!page && __rmqueue_fallback(zone, order, migratetype,
- alloc_flags))
- goto retry;
+#endif
+
+ for (;;) {
+ page = __rmqueue_smallest(zone, order, migratetype);
+ if (page)
+ break;
+ if (!__rmqueue_fallback(zone, order, migratetype, alloc_flags))
+ break;
}
return page;
}
--
2.41.0
On Wed, 26 Jul 2023 11:07:05 -0400 Johannes Weiner <[email protected]> wrote:
> On a memcache setup with heavy anon usage and no swap, we routinely
> see premature OOM kills with multiple gigabytes of free space left:
>
> Node 0 Normal free:4978632kB [...] free_cma:4893276kB
>
> This free space turns out to be CMA. We set CMA regions aside for
> potential hugetlb users on all of our machines, figuring that even if
> there aren't any, the memory is available to userspace allocations.
>
> When the OOMs trigger, it's from unmovable and reclaimable allocations
> that aren't allowed to dip into CMA. The non-CMA regions meanwhile are
> dominated by the anon pages.
>
> Movable pages can be migrated out of CMA when necessary, but we don't
> have a mechanism to migrate them *into* CMA to make room for unmovable
> allocations. The only recourse we have for these pages is reclaim,
> which due to a lack of swap is unavailable in our case.
>
> Because we have more options for CMA pages, change the policy to
> always fill up CMA first. This reduces the risk of premature OOMs.
This conflicts significantly (and more than textually) with "mm:
optimization on page allocation when CMA enabled", which has been
languishing in mm-unstable in an inadequately reviewed state since May
11. Please suggest a way forward?
On Wed, Jul 26, 2023 at 10:53:04AM -0400, Johannes Weiner wrote:
> On a memcache setup with heavy anon usage and no swap, we routinely
> see premature OOM kills with multiple gigabytes of free space left:
>
> Node 0 Normal free:4978632kB [...] free_cma:4893276kB
>
> This free space turns out to be CMA. We set CMA regions aside for
> potential hugetlb users on all of our machines, figuring that even if
> there aren't any, the memory is available to userspace allocations.
>
> When the OOMs trigger, it's from unmovable and reclaimable allocations
> that aren't allowed to dip into CMA. The non-CMA regions meanwhile are
> dominated by the anon pages.
>
>
> Because we have more options for CMA pages, change the policy to
> always fill up CMA first. This reduces the risk of premature OOMs.
I suspect it might cause regressions on small(er) devices where
a relatively small cma area (Mb's) is often reserved for a use by various
device drivers, which can't handle allocation failures well (even interim
allocation failures). A startup time can regress too: migrating pages out of
cma will take time.
And given the velocity of kernel upgrades on such devices, we won't learn about
it for next couple of years.
> Movable pages can be migrated out of CMA when necessary, but we don't
> have a mechanism to migrate them *into* CMA to make room for unmovable
> allocations. The only recourse we have for these pages is reclaim,
> which due to a lack of swap is unavailable in our case.
Idk, should we introduce such a mechanism? Or use some alternative heuristics,
which will be a better compromise between those who need cma allocations always
pass and those who use large cma areas for opportunistic huge page allocations.
Of course, we can add a boot flag/sysctl/per-cma-area flag, but I doubt we want
really this.
Thanks!
On 7/27/23 01:38, Roman Gushchin wrote:
> On Wed, Jul 26, 2023 at 10:53:04AM -0400, Johannes Weiner wrote:
>> On a memcache setup with heavy anon usage and no swap, we routinely
>> see premature OOM kills with multiple gigabytes of free space left:
>>
>> Node 0 Normal free:4978632kB [...] free_cma:4893276kB
>>
>> This free space turns out to be CMA. We set CMA regions aside for
>> potential hugetlb users on all of our machines, figuring that even if
>> there aren't any, the memory is available to userspace allocations.
>>
>> When the OOMs trigger, it's from unmovable and reclaimable allocations
>> that aren't allowed to dip into CMA. The non-CMA regions meanwhile are
>> dominated by the anon pages.
>>
>>
>> Because we have more options for CMA pages, change the policy to
>> always fill up CMA first. This reduces the risk of premature OOMs.
>
> I suspect it might cause regressions on small(er) devices where
> a relatively small cma area (Mb's) is often reserved for a use by various
> device drivers, which can't handle allocation failures well (even interim
> allocation failures). A startup time can regress too: migrating pages out of
> cma will take time.
Agreed, we should be more careful here.
> And given the velocity of kernel upgrades on such devices, we won't learn about
> it for next couple of years.
>
>> Movable pages can be migrated out of CMA when necessary, but we don't
>> have a mechanism to migrate them *into* CMA to make room for unmovable
>> allocations. The only recourse we have for these pages is reclaim,
>> which due to a lack of swap is unavailable in our case.
>
> Idk, should we introduce such a mechanism? Or use some alternative heuristics,
> which will be a better compromise between those who need cma allocations always
> pass and those who use large cma areas for opportunistic huge page allocations.
> Of course, we can add a boot flag/sysctl/per-cma-area flag, but I doubt we want
> really this.
At some point the solution was supposed to be ZONE_MOVABLE:
https://lore.kernel.org/linux-mm/[email protected]/
But it was reverted due to IIRC some bugs, and Joonsoo going MIA.
> Thanks!
On Wed, Jul 26, 2023 at 04:38:11PM -0700, Roman Gushchin wrote:
> On Wed, Jul 26, 2023 at 10:53:04AM -0400, Johannes Weiner wrote:
> > On a memcache setup with heavy anon usage and no swap, we routinely
> > see premature OOM kills with multiple gigabytes of free space left:
> >
> > Node 0 Normal free:4978632kB [...] free_cma:4893276kB
> >
> > This free space turns out to be CMA. We set CMA regions aside for
> > potential hugetlb users on all of our machines, figuring that even if
> > there aren't any, the memory is available to userspace allocations.
> >
> > When the OOMs trigger, it's from unmovable and reclaimable allocations
> > that aren't allowed to dip into CMA. The non-CMA regions meanwhile are
> > dominated by the anon pages.
> >
> >
> > Because we have more options for CMA pages, change the policy to
> > always fill up CMA first. This reduces the risk of premature OOMs.
>
> I suspect it might cause regressions on small(er) devices where
> a relatively small cma area (Mb's) is often reserved for a use by various
> device drivers, which can't handle allocation failures well (even interim
> allocation failures). A startup time can regress too: migrating pages out of
> cma will take time.
The page allocator is currently happy to give away all CMA memory to
movables before entering reclaim. It will use CMA even before falling
back to a different migratetype.
Do these small setups take special precautions to never fill memory?
Proactively trim file cache? Never swap? Because AFAICS, unless they
do so, this would only change the timing of when CMA fills up, not if.
> And given the velocity of kernel upgrades on such devices, we won't learn about
> it for next couple of years.
That's true. However, a potential regression with this would show up
fairly early in kernel validation since CMA would fill up in a more
predictable timeline. And the change is easy to revert, too.
Given that we have a concrete problem with the current behavior, I
think it's fair to require a higher bar for proof that this will
indeed cause a regression elsewhere before raising the bar on the fix.
> > Movable pages can be migrated out of CMA when necessary, but we don't
> > have a mechanism to migrate them *into* CMA to make room for unmovable
> > allocations. The only recourse we have for these pages is reclaim,
> > which due to a lack of swap is unavailable in our case.
>
> Idk, should we introduce such a mechanism? Or use some alternative heuristics,
> which will be a better compromise between those who need cma allocations always
> pass and those who use large cma areas for opportunistic huge page allocations.
> Of course, we can add a boot flag/sysctl/per-cma-area flag, but I doubt we want
> really this.
Right, having migration into CMA could be a viable option as well.
But I would like to learn more from CMA users and their expectations,
since there isn't currently a guarantee that CMA stays empty.
This patch would definitely be the simpler solution. It would also
shave some branches and cycles off the buddy hotpath for many users
that don't actively use CMA but have CONFIG_CMA=y (I checked archlinux
and Fedora, not sure about Suse).
On Thu, Jul 27, 2023 at 11:34:13AM -0400, Johannes Weiner wrote:
> On Wed, Jul 26, 2023 at 04:38:11PM -0700, Roman Gushchin wrote:
> > On Wed, Jul 26, 2023 at 10:53:04AM -0400, Johannes Weiner wrote:
> > > On a memcache setup with heavy anon usage and no swap, we routinely
> > > see premature OOM kills with multiple gigabytes of free space left:
> > >
> > > Node 0 Normal free:4978632kB [...] free_cma:4893276kB
> > >
> > > This free space turns out to be CMA. We set CMA regions aside for
> > > potential hugetlb users on all of our machines, figuring that even if
> > > there aren't any, the memory is available to userspace allocations.
> > >
> > > When the OOMs trigger, it's from unmovable and reclaimable allocations
> > > that aren't allowed to dip into CMA. The non-CMA regions meanwhile are
> > > dominated by the anon pages.
> > >
> > >
> > > Because we have more options for CMA pages, change the policy to
> > > always fill up CMA first. This reduces the risk of premature OOMs.
> >
> > I suspect it might cause regressions on small(er) devices where
> > a relatively small cma area (Mb's) is often reserved for a use by various
> > device drivers, which can't handle allocation failures well (even interim
> > allocation failures). A startup time can regress too: migrating pages out of
> > cma will take time.
>
> The page allocator is currently happy to give away all CMA memory to
> movables before entering reclaim. It will use CMA even before falling
> back to a different migratetype.
>
> Do these small setups take special precautions to never fill memory?
> Proactively trim file cache? Never swap? Because AFAICS, unless they
> do so, this would only change the timing of when CMA fills up, not if.
Imagine something like a web-camera or a router. It boots up, brings up some
custom drivers/hardware, starts some daemons and runs forever. It might never
reach the memory capacity or it might take hours or days. The point it that
during the initialization cma is fully available.
>
> > And given the velocity of kernel upgrades on such devices, we won't learn about
> > it for next couple of years.
>
> That's true. However, a potential regression with this would show up
> fairly early in kernel validation since CMA would fill up in a more
> predictable timeline. And the change is easy to revert, too.
>
> Given that we have a concrete problem with the current behavior, I
> think it's fair to require a higher bar for proof that this will
> indeed cause a regression elsewhere before raising the bar on the fix.
I'm not opposing the change, just raising up a concern. I expect that
we'll need a more complicated solution at some point anyway.
>
> > > Movable pages can be migrated out of CMA when necessary, but we don't
> > > have a mechanism to migrate them *into* CMA to make room for unmovable
> > > allocations. The only recourse we have for these pages is reclaim,
> > > which due to a lack of swap is unavailable in our case.
> >
> > Idk, should we introduce such a mechanism? Or use some alternative heuristics,
> > which will be a better compromise between those who need cma allocations always
> > pass and those who use large cma areas for opportunistic huge page allocations.
> > Of course, we can add a boot flag/sysctl/per-cma-area flag, but I doubt we want
> > really this.
>
> Right, having migration into CMA could be a viable option as well.
>
> But I would like to learn more from CMA users and their expectations,
> since there isn't currently a guarantee that CMA stays empty.
This change makes cma allocations less deterministic. If previously a cma allocation
was almost always succeeding, with this change we'll see more interim failures.
(it's all about some time after a boot when the majority of memory is still empty).
>
> This patch would definitely be the simpler solution. It would also
> shave some branches and cycles off the buddy hotpath for many users
> that don't actively use CMA but have CONFIG_CMA=y (I checked archlinux
> and Fedora, not sure about Suse).
Yes, this is good.