Before 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
the alsa non-contiguous allocator always called the alsa fallback
allocator in the non-iommu case. This allocated non-contig memory
consisting of progressively smaller contiguous chunks. Allocation was
fast due to the OR-ing in of __GFP_NORETRY.
After 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
the code tries the dma non-contig allocator first, then falls back to
the alsa fallback allocator. In the non-iommu case, the former supports
only a single contiguous chunk.
We have observed experimentally that under heavy memory fragmentation,
allocating a large-ish contiguous chunk with __GFP_RETRY_MAYFAIL
triggers an indefinite hang in the dma non-contig allocator. This has
high-impact, as an occurrence will trigger a device reboot, resulting in
loss of user state.
Fix the non-iommu path by letting dma_alloc_noncontiguous() fail quickly
so it does not get stuck looking for that elusive large contiguous chunk,
in which case we will fall back to the alsa fallback allocator.
Note that the iommu dma non-contiguous allocator is not affected. While
assembling an array of pages, it tries consecutively smaller contiguous
allocations, and lets higher-order chunk allocations fail quickly.
Suggested-by: Sven van Ashbrook <[email protected]>
Suggested-by: Brian Geffon <[email protected]>
Fixes: 9d8e536d36e7 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
Cc: [email protected]
Cc: Sven van Ashbrook <[email protected]>
Cc: Brian Geffon <[email protected]>
Cc: Curtis Malainey <[email protected]>
Signed-off-by: Karthikeyan Ramasubramanian <[email protected]>
---
sound/core/memalloc.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/sound/core/memalloc.c b/sound/core/memalloc.c
index f901504b5afc1..5f6526a0d731c 100644
--- a/sound/core/memalloc.c
+++ b/sound/core/memalloc.c
@@ -540,13 +540,18 @@ static void *snd_dma_noncontig_alloc(struct snd_dma_buffer *dmab, size_t size)
{
struct sg_table *sgt;
void *p;
+ gfp_t gfp_flags = DEFAULT_GFP;
#ifdef CONFIG_SND_DMA_SGBUF
if (cpu_feature_enabled(X86_FEATURE_XENPV))
return snd_dma_sg_fallback_alloc(dmab, size);
+
+ /* Non-IOMMU case: prevent allocator from searching forever */
+ if (!get_dma_ops(dmab->dev.dev))
+ gfp_flags |= __GFP_NORETRY;
#endif
sgt = dma_alloc_noncontiguous(dmab->dev.dev, size, dmab->dev.dir,
- DEFAULT_GFP, 0);
+ gfp_flags, 0);
#ifdef CONFIG_SND_DMA_SGBUF
if (!sgt && !get_dma_ops(dmab->dev.dev))
return snd_dma_sg_fallback_alloc(dmab, size);
--
2.43.0.687.g38aa6559b0-goog
On Wed, 14 Feb 2024 17:07:25 -0700 Karthikeyan Ramasubramanian <[email protected]>
> Before 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
> the alsa non-contiguous allocator always called the alsa fallback
> allocator in the non-iommu case. This allocated non-contig memory
> consisting of progressively smaller contiguous chunks. Allocation was
> fast due to the OR-ing in of __GFP_NORETRY.
>
> After 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
> the code tries the dma non-contig allocator first, then falls back to
> the alsa fallback allocator. In the non-iommu case, the former supports
> only a single contiguous chunk.
>
> We have observed experimentally that under heavy memory fragmentation,
> allocating a large-ish contiguous chunk with __GFP_RETRY_MAYFAIL
> triggers an indefinite hang in the dma non-contig allocator. This has
> high-impact, as an occurrence will trigger a device reboot, resulting in
> loss of user state.
>
> Fix the non-iommu path by letting dma_alloc_noncontiguous() fail quickly
> so it does not get stuck looking for that elusive large contiguous chunk,
> in which case we will fall back to the alsa fallback allocator.
The faster dma_alloc_noncontiguous() fails the more likely the paperover
in 9d8e536d36e7 fails to work, so this is another case of bandaid instead
of mitigating heavy fragmentation at the first place.
On Thu, 15 Feb 2024 04:45:27 +0100,
Hillf Danton wrote:
>
> On Wed, 14 Feb 2024 17:07:25 -0700 Karthikeyan Ramasubramanian <[email protected]>
> > Before 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
> > the alsa non-contiguous allocator always called the alsa fallback
> > allocator in the non-iommu case. This allocated non-contig memory
> > consisting of progressively smaller contiguous chunks. Allocation was
> > fast due to the OR-ing in of __GFP_NORETRY.
> >
> > After 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
> > the code tries the dma non-contig allocator first, then falls back to
> > the alsa fallback allocator. In the non-iommu case, the former supports
> > only a single contiguous chunk.
> >
> > We have observed experimentally that under heavy memory fragmentation,
> > allocating a large-ish contiguous chunk with __GFP_RETRY_MAYFAIL
> > triggers an indefinite hang in the dma non-contig allocator. This has
> > high-impact, as an occurrence will trigger a device reboot, resulting in
> > loss of user state.
> >
> > Fix the non-iommu path by letting dma_alloc_noncontiguous() fail quickly
> > so it does not get stuck looking for that elusive large contiguous chunk,
> > in which case we will fall back to the alsa fallback allocator.
>
> The faster dma_alloc_noncontiguous() fails the more likely the paperover
> in 9d8e536d36e7 fails to work, so this is another case of bandaid instead
> of mitigating heavy fragmentation at the first place.
Yes, the main problem is the indefinite hang from
dma_alloc_noncontiguous().
So, is the behavior more or less same even if you pass
__GFP_RETRY_MAYFAIL to dma_alloc_noncontiguous()? Or is this flag
already implicitly set somewhere in the middle? It shouldn't hang
indefinitely, but the other impact to the system like OOM-killer
kickoff may be seen.
As of now, I'm inclined to take the suggested workaround. It'll work
in most cases. The original issue worked around by the commit
9d8e536d36e7 still remains, and we need to address differently.
thanks,
Takashi
Hi Takashi,
On Thu, Feb 15, 2024 at 3:40 AM Takashi Iwai <[email protected]> wrote:
>
> Yes, the main problem is the indefinite hang from
> dma_alloc_noncontiguous().
We have a publicly-visible test [1] which readily triggers the
indefinite hang on non-iommu Intel SoCs such as JasperLake.
As noted in the commit message, iommu SoCs are not currently
affected.
> So, is the behavior more or less same even if you pass
> __GFP_RETRY_MAYFAIL to dma_alloc_noncontiguous()? Or is this flag
> already implicitly set somewhere in the middle? It shouldn't hang
> indefinitely, but the other impact to the system like OOM-killer
> kickoff may be seen.
My incomplete understanding:
Alsa specifies __GFP_RETRY_MAYFAIL because it wants to prevent triggering
the OOM killer. This was __GFP_NORETRY in the not-too-distant past [2],
but that failed too quickly, which resulted in permanent loss of audio due
to failed firmware dma sg allocations.
In the iommu case, dma_alloc_noncontiguous() implements a backoff [3] loop
which ORs in __GFP_NORETRY except for minimum order allocations. We observe
experimentally that __GFP_RETRY_MAYFAIL does not get "stuck" on minimum order
allocations. So the iommu case is not affected.
In the non-iommu case however, dma_alloc_noncontiguous() actually becomes a
contiguous allocator, with no backoff loop. The commit introducing it [4]
states "This API is only properly implemented for dma-iommu and will simply
return a contigious chunk as a fallback." In this case we observe the indefinite
hang.
The alsa fallback allocator is also not affected by the problem, as it does
not specify __GFP_RETRY_MAYFAIL. Except in the XENPV case.
Sven
[1] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/autotest/files/client/site_tests/power_LowMemorySuspend/control
[2] a61c7d8 ("ALSA: memalloc: use __GFP_RETRY_MAYFAIL for DMA mem allocs")
[3] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/dma-iommu.c?h=v6.8-rc4#n903
[4] 7d5b573 ("dma-mapping: add a dma_alloc_noncontiguous API")
On Thu, 15 Feb 2024 17:07:38 +0100,
Sven van Ashbrook wrote:
>
> Hi Takashi,
>
> On Thu, Feb 15, 2024 at 3:40 AM Takashi Iwai <[email protected]> wrote:
> >
> > Yes, the main problem is the indefinite hang from
> > dma_alloc_noncontiguous().
>
> We have a publicly-visible test [1] which readily triggers the
> indefinite hang on non-iommu Intel SoCs such as JasperLake.
> As noted in the commit message, iommu SoCs are not currently
> affected.
>
> > So, is the behavior more or less same even if you pass
> > __GFP_RETRY_MAYFAIL to dma_alloc_noncontiguous()? Or is this flag
> > already implicitly set somewhere in the middle? It shouldn't hang
> > indefinitely, but the other impact to the system like OOM-killer
> > kickoff may be seen.
>
> My incomplete understanding:
>
> Alsa specifies __GFP_RETRY_MAYFAIL because it wants to prevent triggering
> the OOM killer.
Ah I forgot that we set that bit commonly in the flag.
> This was __GFP_NORETRY in the not-too-distant past [2],
> but that failed too quickly, which resulted in permanent loss of audio due
> to failed firmware dma sg allocations.
Hm, the change in commit a61c7d88d38c assumed that __GFP_RETRY_MAYFAIL
shouldn't have that big impact. If the hang really happens with a
high order allocation, it's dangerous to use it in other code
allocations than noncontiguous case (i.e. SNDRV_DMA_TYPE_DEV and co).
In the tight memory situation, a similar problem can be triggered
quite easily, then.
> In the iommu case, dma_alloc_noncontiguous() implements a backoff [3] loop
> which ORs in __GFP_NORETRY except for minimum order allocations. We observe
> experimentally that __GFP_RETRY_MAYFAIL does not get "stuck" on minimum order
> allocations. So the iommu case is not affected.
>
> In the non-iommu case however, dma_alloc_noncontiguous() actually becomes a
> contiguous allocator, with no backoff loop. The commit introducing it [4]
> states "This API is only properly implemented for dma-iommu and will simply
> return a contigious chunk as a fallback." In this case we observe the indefinite
> hang.
>
> The alsa fallback allocator is also not affected by the problem, as it does
> not specify __GFP_RETRY_MAYFAIL. Except in the XENPV case.
So it sounds like that we should go back for __GFP_NORETRY in general
for non-zero order allocations, not only the call you changed, as
__GFP_RETRY_MAYFAIL doesn't guarantee the stuck.
How about the changes like below?
And, Kai, could you verify whether this change with the recent kernel
code won't give significant regressions for the issues you tried to
address with commit a61c7d88d38c?
thanks,
Takashi
--- a/sound/core/memalloc.c
+++ b/sound/core/memalloc.c
@@ -19,17 +19,22 @@
#include <sound/memalloc.h>
#include "memalloc_local.h"
-#define DEFAULT_GFP \
- (GFP_KERNEL | \
- __GFP_RETRY_MAYFAIL | /* don't trigger OOM-killer */ \
- __GFP_NOWARN) /* no stack trace print - this call is non-critical */
-
static const struct snd_malloc_ops *snd_dma_get_ops(struct snd_dma_buffer *dmab);
#ifdef CONFIG_SND_DMA_SGBUF
static void *snd_dma_sg_fallback_alloc(struct snd_dma_buffer *dmab, size_t size);
#endif
+/* default GFP bits for our allocations */
+static gfp_t default_gfp(size_t size)
+{
+ /* don't allocate intensively for high-order pages */
+ if (size > PAGE_SIZE)
+ return GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY;
+ else
+ return GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL;
+}
+
static void *__snd_dma_alloc_pages(struct snd_dma_buffer *dmab, size_t size)
{
const struct snd_malloc_ops *ops = snd_dma_get_ops(dmab);
@@ -281,7 +286,7 @@ static void *do_alloc_pages(struct device *dev, size_t size, dma_addr_t *addr,
bool wc)
{
void *p;
- gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
+ gfp_t gfp = default_gfp(size);
again:
p = alloc_pages_exact(size, gfp);
@@ -466,7 +471,7 @@ static const struct snd_malloc_ops snd_dma_iram_ops = {
*/
static void *snd_dma_dev_alloc(struct snd_dma_buffer *dmab, size_t size)
{
- return dma_alloc_coherent(dmab->dev.dev, size, &dmab->addr, DEFAULT_GFP);
+ return dma_alloc_coherent(dmab->dev.dev, size, &dmab->addr, default_gfp(size));
}
static void snd_dma_dev_free(struct snd_dma_buffer *dmab)
@@ -511,7 +516,7 @@ static int snd_dma_wc_mmap(struct snd_dma_buffer *dmab,
#else
static void *snd_dma_wc_alloc(struct snd_dma_buffer *dmab, size_t size)
{
- return dma_alloc_wc(dmab->dev.dev, size, &dmab->addr, DEFAULT_GFP);
+ return dma_alloc_wc(dmab->dev.dev, size, &dmab->addr, default_gfp(size));
}
static void snd_dma_wc_free(struct snd_dma_buffer *dmab)
@@ -546,7 +551,7 @@ static void *snd_dma_noncontig_alloc(struct snd_dma_buffer *dmab, size_t size)
return snd_dma_sg_fallback_alloc(dmab, size);
#endif
sgt = dma_alloc_noncontiguous(dmab->dev.dev, size, dmab->dev.dir,
- DEFAULT_GFP, 0);
+ default_gfp(size), 0);
#ifdef CONFIG_SND_DMA_SGBUF
if (!sgt && !get_dma_ops(dmab->dev.dev))
return snd_dma_sg_fallback_alloc(dmab, size);
@@ -783,7 +788,7 @@ static void *snd_dma_sg_fallback_alloc(struct snd_dma_buffer *dmab, size_t size)
while (size > 0) {
chunk = min(size, chunk);
if (sgbuf->use_dma_alloc_coherent)
- p = dma_alloc_coherent(dmab->dev.dev, chunk, &addr, DEFAULT_GFP);
+ p = dma_alloc_coherent(dmab->dev.dev, chunk, &addr, default_gfp(size));
else
p = do_alloc_pages(dmab->dev.dev, chunk, &addr, false);
if (!p) {
@@ -871,7 +876,7 @@ static void *snd_dma_noncoherent_alloc(struct snd_dma_buffer *dmab, size_t size)
void *p;
p = dma_alloc_noncoherent(dmab->dev.dev, size, &dmab->addr,
- dmab->dev.dir, DEFAULT_GFP);
+ dmab->dev.dir, default_gfp(size));
if (p)
dmab->dev.need_sync = dma_need_sync(dmab->dev.dev, dmab->addr);
return p;
On Thu, Feb 15, 2024 at 10:03 AM Takashi Iwai <[email protected]> wrote:
>
> On Thu, 15 Feb 2024 17:07:38 +0100,
> Sven van Ashbrook wrote:
> >
> > Hi Takashi,
> >
> > On Thu, Feb 15, 2024 at 3:40 AM Takashi Iwai <[email protected]> wrote:
> > >
> > > Yes, the main problem is the indefinite hang from
> > > dma_alloc_noncontiguous().
> >
> > We have a publicly-visible test [1] which readily triggers the
> > indefinite hang on non-iommu Intel SoCs such as JasperLake.
> > As noted in the commit message, iommu SoCs are not currently
> > affected.
> >
> > > So, is the behavior more or less same even if you pass
> > > __GFP_RETRY_MAYFAIL to dma_alloc_noncontiguous()? Or is this flag
> > > already implicitly set somewhere in the middle? It shouldn't hang
> > > indefinitely, but the other impact to the system like OOM-killer
> > > kickoff may be seen.
> >
> > My incomplete understanding:
> >
> > Alsa specifies __GFP_RETRY_MAYFAIL because it wants to prevent triggering
> > the OOM killer.
>
> Ah I forgot that we set that bit commonly in the flag.
>
> > This was __GFP_NORETRY in the not-too-distant past [2],
> > but that failed too quickly, which resulted in permanent loss of audio due
> > to failed firmware dma sg allocations.
>
> Hm, the change in commit a61c7d88d38c assumed that __GFP_RETRY_MAYFAIL
> shouldn't have that big impact. If the hang really happens with a
> high order allocation, it's dangerous to use it in other code
> allocations than noncontiguous case (i.e. SNDRV_DMA_TYPE_DEV and co).
> In the tight memory situation, a similar problem can be triggered
> quite easily, then.
>
> > In the iommu case, dma_alloc_noncontiguous() implements a backoff [3] loop
> > which ORs in __GFP_NORETRY except for minimum order allocations. We observe
> > experimentally that __GFP_RETRY_MAYFAIL does not get "stuck" on minimum order
> > allocations. So the iommu case is not affected.
> >
> > In the non-iommu case however, dma_alloc_noncontiguous() actually becomes a
> > contiguous allocator, with no backoff loop. The commit introducing it [4]
> > states "This API is only properly implemented for dma-iommu and will simply
> > return a contigious chunk as a fallback." In this case we observe the indefinite
> > hang.
> >
> > The alsa fallback allocator is also not affected by the problem, as it does
> > not specify __GFP_RETRY_MAYFAIL. Except in the XENPV case.
>
> So it sounds like that we should go back for __GFP_NORETRY in general
> for non-zero order allocations, not only the call you changed, as
> __GFP_RETRY_MAYFAIL doesn't guarantee the stuck.
>
> How about the changes like below?
We are concerned that the below proposed change might break the fix
introduced by commit a61c7d88d38c ("FROMGIT: ALSA: memalloc: use
__GFP_RETRY_MAYFAIL for DMA mem allocs"). More specifically in the
IOMMU case with a large allocation, the fallback algorithm in the DMA
IOMMU allocator[1] will not try really hard and hence may fail
prematurely when it gets to minimum order allocations. This will
result in no audio when there is significant load on physical memory.
Thanks and Regards,
Karthikeyan.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/iommu/dma-iommu.c?h=v6.8-rc4#n903
>
> And, Kai, could you verify whether this change with the recent kernel
> code won't give significant regressions for the issues you tried to
> address with commit a61c7d88d38c?
>
>
> thanks,
>
> Takashi
>
> --- a/sound/core/memalloc.c
> +++ b/sound/core/memalloc.c
> @@ -19,17 +19,22 @@
> #include <sound/memalloc.h>
> #include "memalloc_local.h"
>
> -#define DEFAULT_GFP \
> - (GFP_KERNEL | \
> - __GFP_RETRY_MAYFAIL | /* don't trigger OOM-killer */ \
> - __GFP_NOWARN) /* no stack trace print - this call is non-critical */
> -
> static const struct snd_malloc_ops *snd_dma_get_ops(struct snd_dma_buffer *dmab);
>
> #ifdef CONFIG_SND_DMA_SGBUF
> static void *snd_dma_sg_fallback_alloc(struct snd_dma_buffer *dmab, size_t size);
> #endif
>
> +/* default GFP bits for our allocations */
> +static gfp_t default_gfp(size_t size)
> +{
> + /* don't allocate intensively for high-order pages */
> + if (size > PAGE_SIZE)
> + return GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY;
> + else
> + return GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL;
> +}
> +
> static void *__snd_dma_alloc_pages(struct snd_dma_buffer *dmab, size_t size)
> {
> const struct snd_malloc_ops *ops = snd_dma_get_ops(dmab);
> @@ -281,7 +286,7 @@ static void *do_alloc_pages(struct device *dev, size_t size, dma_addr_t *addr,
> bool wc)
> {
> void *p;
> - gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
> + gfp_t gfp = default_gfp(size);
>
> again:
> p = alloc_pages_exact(size, gfp);
> @@ -466,7 +471,7 @@ static const struct snd_malloc_ops snd_dma_iram_ops = {
> */
> static void *snd_dma_dev_alloc(struct snd_dma_buffer *dmab, size_t size)
> {
> - return dma_alloc_coherent(dmab->dev.dev, size, &dmab->addr, DEFAULT_GFP);
> + return dma_alloc_coherent(dmab->dev.dev, size, &dmab->addr, default_gfp(size));
> }
>
> static void snd_dma_dev_free(struct snd_dma_buffer *dmab)
> @@ -511,7 +516,7 @@ static int snd_dma_wc_mmap(struct snd_dma_buffer *dmab,
> #else
> static void *snd_dma_wc_alloc(struct snd_dma_buffer *dmab, size_t size)
> {
> - return dma_alloc_wc(dmab->dev.dev, size, &dmab->addr, DEFAULT_GFP);
> + return dma_alloc_wc(dmab->dev.dev, size, &dmab->addr, default_gfp(size));
> }
>
> static void snd_dma_wc_free(struct snd_dma_buffer *dmab)
> @@ -546,7 +551,7 @@ static void *snd_dma_noncontig_alloc(struct snd_dma_buffer *dmab, size_t size)
> return snd_dma_sg_fallback_alloc(dmab, size);
> #endif
> sgt = dma_alloc_noncontiguous(dmab->dev.dev, size, dmab->dev.dir,
> - DEFAULT_GFP, 0);
> + default_gfp(size), 0);
> #ifdef CONFIG_SND_DMA_SGBUF
> if (!sgt && !get_dma_ops(dmab->dev.dev))
> return snd_dma_sg_fallback_alloc(dmab, size);
> @@ -783,7 +788,7 @@ static void *snd_dma_sg_fallback_alloc(struct snd_dma_buffer *dmab, size_t size)
> while (size > 0) {
> chunk = min(size, chunk);
> if (sgbuf->use_dma_alloc_coherent)
> - p = dma_alloc_coherent(dmab->dev.dev, chunk, &addr, DEFAULT_GFP);
> + p = dma_alloc_coherent(dmab->dev.dev, chunk, &addr, default_gfp(size));
> else
> p = do_alloc_pages(dmab->dev.dev, chunk, &addr, false);
> if (!p) {
> @@ -871,7 +876,7 @@ static void *snd_dma_noncoherent_alloc(struct snd_dma_buffer *dmab, size_t size)
> void *p;
>
> p = dma_alloc_noncoherent(dmab->dev.dev, size, &dmab->addr,
> - dmab->dev.dir, DEFAULT_GFP);
> + dmab->dev.dir, default_gfp(size));
> if (p)
> dmab->dev.need_sync = dma_need_sync(dmab->dev.dev, dmab->addr);
> return p;
On Thu, 15 Feb 2024 18:03:01 +0100 Takashi Iwai <[email protected]> wrote:
>
> So it sounds like that we should go back for __GFP_NORETRY in general
> for non-zero order allocations, not only the call you changed, as
> __GFP_RETRY_MAYFAIL doesn't guarantee the stuck.
>
> How about the changes like below?
>
> +/* default GFP bits for our allocations */
> +static gfp_t default_gfp(size_t size)
> +{
> + /* don't allocate intensively for high-order pages */
> + if (size > PAGE_SIZE)
> + return GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY;
> + else
> + return GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL;
> +}
Looks like an overdose because both __GFP_NORETRY and __GFP_RETRY_MAYFAIL
are checked in __alloc_pages_slowpath().
--- x/sound/core/memalloc.c
+++ y/sound/core/memalloc.c
@@ -540,13 +540,20 @@ static void *snd_dma_noncontig_alloc(str
{
struct sg_table *sgt;
void *p;
+ gfp_t gfp = DEFAULT_GFP;
#ifdef CONFIG_SND_DMA_SGBUF
if (cpu_feature_enabled(X86_FEATURE_XENPV))
return snd_dma_sg_fallback_alloc(dmab, size);
+ /*
+ * Given fallback, quit allocation in case of PAGE_ALLOC_COSTLY_ORDER with
+ * lower orders handled by page allocator
+ */
+ if (!get_dma_ops(dmab->dev.dev))
+ gfp &= ~__GFP_RETRY_MAYFAIL;
#endif
- sgt = dma_alloc_noncontiguous(dmab->dev.dev, size, dmab->dev.dir,
- DEFAULT_GFP, 0);
+ sgt = dma_alloc_noncontiguous(dmab->dev.dev, size, dmab->dev.dir, gfp, 0);
+
#ifdef CONFIG_SND_DMA_SGBUF
if (!sgt && !get_dma_ops(dmab->dev.dev))
return snd_dma_sg_fallback_alloc(dmab, size);
On Thu, 15 Feb 2024 20:32:22 +0100,
Karthikeyan Ramasubramanian wrote:
>
> On Thu, Feb 15, 2024 at 10:03 AM Takashi Iwai <[email protected]> wrote:
> >
> > On Thu, 15 Feb 2024 17:07:38 +0100,
> > Sven van Ashbrook wrote:
> > >
> > > Hi Takashi,
> > >
> > > On Thu, Feb 15, 2024 at 3:40 AM Takashi Iwai <[email protected]> wrote:
> > > >
> > > > Yes, the main problem is the indefinite hang from
> > > > dma_alloc_noncontiguous().
> > >
> > > We have a publicly-visible test [1] which readily triggers the
> > > indefinite hang on non-iommu Intel SoCs such as JasperLake.
> > > As noted in the commit message, iommu SoCs are not currently
> > > affected.
> > >
> > > > So, is the behavior more or less same even if you pass
> > > > __GFP_RETRY_MAYFAIL to dma_alloc_noncontiguous()? Or is this flag
> > > > already implicitly set somewhere in the middle? It shouldn't hang
> > > > indefinitely, but the other impact to the system like OOM-killer
> > > > kickoff may be seen.
> > >
> > > My incomplete understanding:
> > >
> > > Alsa specifies __GFP_RETRY_MAYFAIL because it wants to prevent triggering
> > > the OOM killer.
> >
> > Ah I forgot that we set that bit commonly in the flag.
> >
> > > This was __GFP_NORETRY in the not-too-distant past [2],
> > > but that failed too quickly, which resulted in permanent loss of audio due
> > > to failed firmware dma sg allocations.
> >
> > Hm, the change in commit a61c7d88d38c assumed that __GFP_RETRY_MAYFAIL
> > shouldn't have that big impact. If the hang really happens with a
> > high order allocation, it's dangerous to use it in other code
> > allocations than noncontiguous case (i.e. SNDRV_DMA_TYPE_DEV and co).
> > In the tight memory situation, a similar problem can be triggered
> > quite easily, then.
> >
> > > In the iommu case, dma_alloc_noncontiguous() implements a backoff [3] loop
> > > which ORs in __GFP_NORETRY except for minimum order allocations. We observe
> > > experimentally that __GFP_RETRY_MAYFAIL does not get "stuck" on minimum order
> > > allocations. So the iommu case is not affected.
> > >
> > > In the non-iommu case however, dma_alloc_noncontiguous() actually becomes a
> > > contiguous allocator, with no backoff loop. The commit introducing it [4]
> > > states "This API is only properly implemented for dma-iommu and will simply
> > > return a contigious chunk as a fallback." In this case we observe the indefinite
> > > hang.
> > >
> > > The alsa fallback allocator is also not affected by the problem, as it does
> > > not specify __GFP_RETRY_MAYFAIL. Except in the XENPV case.
> >
> > So it sounds like that we should go back for __GFP_NORETRY in general
> > for non-zero order allocations, not only the call you changed, as
> > __GFP_RETRY_MAYFAIL doesn't guarantee the stuck.
> >
> > How about the changes like below?
>
> We are concerned that the below proposed change might break the fix
> introduced by commit a61c7d88d38c ("FROMGIT: ALSA: memalloc: use
> __GFP_RETRY_MAYFAIL for DMA mem allocs"). More specifically in the
> IOMMU case with a large allocation, the fallback algorithm in the DMA
> IOMMU allocator[1] will not try really hard and hence may fail
> prematurely when it gets to minimum order allocations. This will
> result in no audio when there is significant load on physical memory.
Hm, then we may give __GFP_RETRY_MAYFAIL specially for iommu case,
too? Something like v2 patch below.
thanks,
Takashi
--- a/sound/core/memalloc.c
+++ b/sound/core/memalloc.c
@@ -21,7 +21,11 @@
#define DEFAULT_GFP \
(GFP_KERNEL | \
- __GFP_RETRY_MAYFAIL | /* don't trigger OOM-killer */ \
+ __GFP_NORETRY | /* don't trigger OOM-killer */ \
+ __GFP_NOWARN) /* no stack trace print - this call is non-critical */
+#define DEFAULT_GFP_RETRY \
+ (GFP_KERNEL | \
+ __GFP_RETRY_MAYFAIL | /* try a bit harder */ \
__GFP_NOWARN) /* no stack trace print - this call is non-critical */
static const struct snd_malloc_ops *snd_dma_get_ops(struct snd_dma_buffer *dmab);
@@ -30,6 +34,13 @@ static const struct snd_malloc_ops *snd_dma_get_ops(struct snd_dma_buffer *dmab)
static void *snd_dma_sg_fallback_alloc(struct snd_dma_buffer *dmab, size_t size);
#endif
+/* default GFP bits for our allocations */
+static gfp_t default_gfp(size_t size)
+{
+ /* don't allocate intensively for high-order pages */
+ return (size > PAGE_SIZE) ? DEFAULT_GFP : DEFAULT_GFP_RETRY;
+}
+
static void *__snd_dma_alloc_pages(struct snd_dma_buffer *dmab, size_t size)
{
const struct snd_malloc_ops *ops = snd_dma_get_ops(dmab);
@@ -281,7 +292,7 @@ static void *do_alloc_pages(struct device *dev, size_t size, dma_addr_t *addr,
bool wc)
{
void *p;
- gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
+ gfp_t gfp = default_gfp(size);
again:
p = alloc_pages_exact(size, gfp);
@@ -466,7 +477,7 @@ static const struct snd_malloc_ops snd_dma_iram_ops = {
*/
static void *snd_dma_dev_alloc(struct snd_dma_buffer *dmab, size_t size)
{
- return dma_alloc_coherent(dmab->dev.dev, size, &dmab->addr, DEFAULT_GFP);
+ return dma_alloc_coherent(dmab->dev.dev, size, &dmab->addr, default_gfp(size));
}
static void snd_dma_dev_free(struct snd_dma_buffer *dmab)
@@ -511,7 +522,7 @@ static int snd_dma_wc_mmap(struct snd_dma_buffer *dmab,
#else
static void *snd_dma_wc_alloc(struct snd_dma_buffer *dmab, size_t size)
{
- return dma_alloc_wc(dmab->dev.dev, size, &dmab->addr, DEFAULT_GFP);
+ return dma_alloc_wc(dmab->dev.dev, size, &dmab->addr, default_gfp(size));
}
static void snd_dma_wc_free(struct snd_dma_buffer *dmab)
@@ -539,14 +550,20 @@ static const struct snd_malloc_ops snd_dma_wc_ops = {
static void *snd_dma_noncontig_alloc(struct snd_dma_buffer *dmab, size_t size)
{
struct sg_table *sgt;
+ gfp_t gfp = default_gfp(size);
void *p;
#ifdef CONFIG_SND_DMA_SGBUF
if (cpu_feature_enabled(X86_FEATURE_XENPV))
return snd_dma_sg_fallback_alloc(dmab, size);
+ /* dma_alloc_noncontiguous() with IOMMU is safe to pass
+ * __GFP_RETRY_MAYFAIL option for more intensive allocations
+ */
+ if (get_dma_ops(dmab->dev.dev))
+ gfp = DEFAULT_GFP_RETRY;
#endif
sgt = dma_alloc_noncontiguous(dmab->dev.dev, size, dmab->dev.dir,
- DEFAULT_GFP, 0);
+ gfp, 0);
#ifdef CONFIG_SND_DMA_SGBUF
if (!sgt && !get_dma_ops(dmab->dev.dev))
return snd_dma_sg_fallback_alloc(dmab, size);
@@ -783,7 +800,7 @@ static void *snd_dma_sg_fallback_alloc(struct snd_dma_buffer *dmab, size_t size)
while (size > 0) {
chunk = min(size, chunk);
if (sgbuf->use_dma_alloc_coherent)
- p = dma_alloc_coherent(dmab->dev.dev, chunk, &addr, DEFAULT_GFP);
+ p = dma_alloc_coherent(dmab->dev.dev, chunk, &addr, default_gfp(size));
else
p = do_alloc_pages(dmab->dev.dev, chunk, &addr, false);
if (!p) {
@@ -871,7 +888,7 @@ static void *snd_dma_noncoherent_alloc(struct snd_dma_buffer *dmab, size_t size)
void *p;
p = dma_alloc_noncoherent(dmab->dev.dev, size, &dmab->addr,
- dmab->dev.dir, DEFAULT_GFP);
+ dmab->dev.dir, default_gfp(size));
if (p)
dmab->dev.need_sync = dma_need_sync(dmab->dev.dev, dmab->addr);
return p;
On Fri, 16 Feb 2024 05:34:24 +0100,
Hillf Danton wrote:
>
> On Thu, 15 Feb 2024 18:03:01 +0100 Takashi Iwai <[email protected]> wrote:
> >
> > So it sounds like that we should go back for __GFP_NORETRY in general
> > for non-zero order allocations, not only the call you changed, as
> > __GFP_RETRY_MAYFAIL doesn't guarantee the stuck.
> >
> > How about the changes like below?
> >
> > +/* default GFP bits for our allocations */
> > +static gfp_t default_gfp(size_t size)
> > +{
> > + /* don't allocate intensively for high-order pages */
> > + if (size > PAGE_SIZE)
> > + return GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY;
> > + else
> > + return GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL;
> > +}
>
> Looks like an overdose because both __GFP_NORETRY and __GFP_RETRY_MAYFAIL
> are checked in __alloc_pages_slowpath().
If the check there worked as expected, this shouldn't have been a
problem, no?
The fact that we have to drop __GFP_RETRY_MAYFAIL indicates that the
handling there doesn't suffice -- at least for the audio operation.
thanks,
Takashi
On Fri, 16 Feb 2024 09:35:32 +0100,
Takashi Iwai wrote:
>
> On Fri, 16 Feb 2024 05:34:24 +0100,
> Hillf Danton wrote:
> >
> > On Thu, 15 Feb 2024 18:03:01 +0100 Takashi Iwai <[email protected]> wrote:
> > >
> > > So it sounds like that we should go back for __GFP_NORETRY in general
> > > for non-zero order allocations, not only the call you changed, as
> > > __GFP_RETRY_MAYFAIL doesn't guarantee the stuck.
> > >
> > > How about the changes like below?
> > >
> > > +/* default GFP bits for our allocations */
> > > +static gfp_t default_gfp(size_t size)
> > > +{
> > > + /* don't allocate intensively for high-order pages */
> > > + if (size > PAGE_SIZE)
> > > + return GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY;
> > > + else
> > > + return GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL;
> > > +}
> >
> > Looks like an overdose because both __GFP_NORETRY and __GFP_RETRY_MAYFAIL
> > are checked in __alloc_pages_slowpath().
>
> If the check there worked as expected, this shouldn't have been a
> problem, no?
>
> The fact that we have to drop __GFP_RETRY_MAYFAIL indicates that the
> handling there doesn't suffice -- at least for the audio operation.
Reconsidering on this again, I wonder keeping __GFP_RETRY_MAYFAIL
makes sense. We did have __GFP_NORETRY for avoiding OOM-killer.
But it's been over ages, and the memory allocation core became smart
enough.
The side-effect of __GFP_RETRY_MAYFAIL is that the page reclaim and
compaction happens even for high-order allocations, and that must be
the issue we see now. For dma_alloc_contiguous() with IOMMU, this
wasn't visible because the loop there sets __GFP_NORETRY explicitly
unless the minimal order.
So, basically we could have achieved the more or less same effect just
by dropping __GFP_NORETRY from DEFAULT_GFP definition.
(Now it's a drop of __GFP_RETRY_MAYFAIL)
OTOH, a slight concern with the drop of __GFP_RETRY_MAYFAIL is whether
allowing OOM-killer for low order allocations is acceptable or not.
There are two patterns of calling allocators:
1. SNDRV_DMA_TYPE_DEV for large pages:
this is usually only once at driver probe, and the pages are
preserved via PCM buffer preallocation mechanism
2. SNDRV_DMA_TYPE_DEV for lower orders:
those are usually at probes for some communication buffers, and in
most cases they are kept by drivers, too
3. SNDRV_DMA_TYPE_NONCONTIG for large size:
this is called often, once per stream open, since the driver
doesn't keep the buffer.
4. SNDRV_DMA_TYPE_NONCONTIG for lower orders:
basically same as case 2.
That is, triggering OOM-killer would be OK for 2 and 4, but we have to
avoid for 3. So, __GFP_RETRY_MAYFAIL would be still useful for there.
And for 3, there are two paths:
- with IOMMU => we may pass __GFP_RETRY_MAYFAIL unconditionally to
dma_alloc_noncontiguous()
- without IOMMU => dma_alloc_noncontiguous() without MAYFAIL,
but fallback allocation should become conditional:
- higher order: default (or explicitly with NORETRY)
- lower order: MAYFAIL
OTOH, the avoidance of OOM-killer wouldn't be bad even for 2 and 4 (as its
usefulness is dubious). Then the conditionally setting MAYFAIL
wouldn't be bad for the calls of other dma_alloc_coherent() & co,
too.
Takashi
On Fri, 16 Feb 2024 09:35:32 +0100 Takashi Iwai <[email protected]> wrote:
> On Fri, 16 Feb 2024 05:34:24 +0100, Hillf Danton wrote:
> > On Thu, 15 Feb 2024 18:03:01 +0100 Takashi Iwai <[email protected]> wrote:
> > > So it sounds like that we should go back for __GFP_NORETRY in general
> > > for non-zero order allocations, not only the call you changed, as
> > > __GFP_RETRY_MAYFAIL doesn't guarantee the stuck.
> > >
> > > How about the changes like below?
> > >
> > > +/* default GFP bits for our allocations */
> > > +static gfp_t default_gfp(size_t size)
> > > +{
> > > + /* don't allocate intensively for high-order pages */
> > > + if (size > PAGE_SIZE)
> > > + return GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY;
> > > + else
> > > + return GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL;
> > > +}
> >
> > Looks like an overdose because both __GFP_NORETRY and __GFP_RETRY_MAYFAIL
> > are checked in __alloc_pages_slowpath().
>
> If the check there worked as expected, this shouldn't have been a
> problem, no?
>
> The fact that we have to drop __GFP_RETRY_MAYFAIL indicates that the
> handling there doesn't suffice -- at least for the audio operation.
Dropping the retry gfp flag makes no sense without checking fallback in
sound allocation context in this thread, nor checking it without checking
the costly order in the page allocator.
OTOH page allocator will never work in every corner case particularly
where perfermance/response is expected given insanely heavy fragmentation
background.
Fragmentation [1] is not anything new, nor is it a cure to fiddle with a
couple flags.
[1] https://lore.kernel.org/lkml/[email protected]/
Hi,
On Fri, 16 Feb 2024, Takashi Iwai wrote:
> On Fri, 16 Feb 2024 09:35:32 +0100, Takashi Iwai wrote:
> > The fact that we have to drop __GFP_RETRY_MAYFAIL indicates that the
> > handling there doesn't suffice -- at least for the audio operation.
>
> Reconsidering on this again, I wonder keeping __GFP_RETRY_MAYFAIL
> makes sense. We did have __GFP_NORETRY for avoiding OOM-killer.
> But it's been over ages, and the memory allocation core became smart
> enough.
>
> The side-effect of __GFP_RETRY_MAYFAIL is that the page reclaim and
> compaction happens even for high-order allocations, and that must be
for the original problem that led to "ALSA: memalloc: use
__GFP_RETRY_MAYFAIL for DMA mem allocs", reclaim for low-order case
would be enough. I.e. the case was:
> OTOH, a slight concern with the drop of __GFP_RETRY_MAYFAIL is whether
> allowing OOM-killer for low order allocations is acceptable or not.
>
> There are two patterns of calling allocators:
[..]
> 3. SNDRV_DMA_TYPE_NONCONTIG for large size:
> this is called often, once per stream open, since the driver
> doesn't keep the buffer.
So with SOF we have additional case where we do an allocation for the DSP
firmware (snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV_SG, ...)) and this is
called at system resume. With s/__GPF_RETRY_MAYFAIL/__GFP_NORETRY/, these
allocations failed (on a iommu enabled Chromebook) at system resume in a
case where system was not really running out of memory (reclaim was
possible). A failed allocation means there's no audio in the system after
resume, so we want to try harder.
But yeah, I think the proposed handling for (3) category would work. If
needed, we can further specialize the DSP firmware case with some hint
to snd_dma_alloc_pages().
Br, Kai
On Fri, 16 Feb 2024 13:19:54 +0100,
Kai Vehmanen wrote:
>
> Hi,
>
> On Fri, 16 Feb 2024, Takashi Iwai wrote:
>
> > On Fri, 16 Feb 2024 09:35:32 +0100, Takashi Iwai wrote:
> > > The fact that we have to drop __GFP_RETRY_MAYFAIL indicates that the
> > > handling there doesn't suffice -- at least for the audio operation.
> >
> > Reconsidering on this again, I wonder keeping __GFP_RETRY_MAYFAIL
> > makes sense. We did have __GFP_NORETRY for avoiding OOM-killer.
> > But it's been over ages, and the memory allocation core became smart
> > enough.
> >
> > The side-effect of __GFP_RETRY_MAYFAIL is that the page reclaim and
> > compaction happens even for high-order allocations, and that must be
>
> for the original problem that led to "ALSA: memalloc: use
> __GFP_RETRY_MAYFAIL for DMA mem allocs", reclaim for low-order case
> would be enough. I.e. the case was:
>
> > OTOH, a slight concern with the drop of __GFP_RETRY_MAYFAIL is whether
> > allowing OOM-killer for low order allocations is acceptable or not.
> >
> > There are two patterns of calling allocators:
> [..]
> > 3. SNDRV_DMA_TYPE_NONCONTIG for large size:
> > this is called often, once per stream open, since the driver
> > doesn't keep the buffer.
>
> So with SOF we have additional case where we do an allocation for the DSP
> firmware (snd_dma_alloc_pages(SNDRV_DMA_TYPE_DEV_SG, ...)) and this is
> called at system resume. With s/__GPF_RETRY_MAYFAIL/__GFP_NORETRY/, these
> allocations failed (on a iommu enabled Chromebook) at system resume in a
> case where system was not really running out of memory (reclaim was
> possible). A failed allocation means there's no audio in the system after
> resume, so we want to try harder.
>
> But yeah, I think the proposed handling for (3) category would work. If
> needed, we can further specialize the DSP firmware case with some hint
> to snd_dma_alloc_pages().
OK, then how about the one like below?
This changes:
- Back to __GFP_NORETRY as default
- Use __GFP_RETRY_MAYFAIL for SNDRV_DMA_TYPE_NONCONTIG with IOMMU;
this should cover the commit a61c7d88d38c
- Also use __GFP_RETRY_MAYFAIL for the SG-fallback allocations of the
minimal order, just like IOMMU allocator does.
This should be less destructive, while still allowing more aggressive
allocations for SG buffers.
thanks,
Takashi
--- a/sound/core/memalloc.c
+++ b/sound/core/memalloc.c
@@ -21,9 +21,13 @@
#define DEFAULT_GFP \
(GFP_KERNEL | \
- __GFP_RETRY_MAYFAIL | /* don't trigger OOM-killer */ \
+ __GFP_NORETRY | /* don't trigger OOM-killer */ \
__GFP_NOWARN) /* no stack trace print - this call is non-critical */
+/* GFP flags to be used for low order pages, allowing reclaim and compaction */
+#define DEFAULT_GFP_RETRY \
+ (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN)
+
static const struct snd_malloc_ops *snd_dma_get_ops(struct snd_dma_buffer *dmab);
#ifdef CONFIG_SND_DMA_SGBUF
@@ -281,7 +285,11 @@ static void *do_alloc_pages(struct device *dev, size_t size, dma_addr_t *addr,
bool wc)
{
void *p;
- gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN;
+ gfp_t gfp = DEFAULT_GFP;
+
+ /* allow reclaim and compaction for low order pages */
+ if (size <= PAGE_SIZE)
+ gfp = DEFAULT_GFP_RETRY;
again:
p = alloc_pages_exact(size, gfp);
@@ -539,14 +547,18 @@ static const struct snd_malloc_ops snd_dma_wc_ops = {
static void *snd_dma_noncontig_alloc(struct snd_dma_buffer *dmab, size_t size)
{
struct sg_table *sgt;
+ gfp_t gfp = DEFAULT_GFP;
void *p;
#ifdef CONFIG_SND_DMA_SGBUF
if (cpu_feature_enabled(X86_FEATURE_XENPV))
return snd_dma_sg_fallback_alloc(dmab, size);
+ /* with IOMMU, it's safe to pass __GFP_RETRY_MAYFAIL with high order */
+ if (get_dma_ops(dmab->dev.dev))
+ gfp = DEFAULT_GFP_RETRY;
#endif
sgt = dma_alloc_noncontiguous(dmab->dev.dev, size, dmab->dev.dir,
- DEFAULT_GFP, 0);
+ gfp, 0);
#ifdef CONFIG_SND_DMA_SGBUF
if (!sgt && !get_dma_ops(dmab->dev.dev))
return snd_dma_sg_fallback_alloc(dmab, size);
On Fri, Feb 16, 2024 at 9:43 AM Takashi Iwai <[email protected]> wrote:
>
> OK, then how about the one like below?
>
> This changes:
> - Back to __GFP_NORETRY as default
> - Use __GFP_RETRY_MAYFAIL for SNDRV_DMA_TYPE_NONCONTIG with IOMMU;
> this should cover the commit a61c7d88d38c
> - Also use __GFP_RETRY_MAYFAIL for the SG-fallback allocations of the
> minimal order, just like IOMMU allocator does.
>
> This should be less destructive, while still allowing more aggressive
> allocations for SG buffers.
This one looks like it would keep the SOF firmware allocation issue at bay,
in both iommu and non-iommu cases.
If there is no further discussion in this thread, we'll stress test this
on iommu and non-iommu Chromebooks.
On Fri, 16 Feb 2024 17:22:45 +0100,
Sven van Ashbrook wrote:
>
> On Fri, Feb 16, 2024 at 9:43 AM Takashi Iwai <[email protected]> wrote:
> >
> > OK, then how about the one like below?
> >
> > This changes:
> > - Back to __GFP_NORETRY as default
> > - Use __GFP_RETRY_MAYFAIL for SNDRV_DMA_TYPE_NONCONTIG with IOMMU;
> > this should cover the commit a61c7d88d38c
> > - Also use __GFP_RETRY_MAYFAIL for the SG-fallback allocations of the
> > minimal order, just like IOMMU allocator does.
> >
> > This should be less destructive, while still allowing more aggressive
> > allocations for SG buffers.
>
> This one looks like it would keep the SOF firmware allocation issue at bay,
> in both iommu and non-iommu cases.
>
> If there is no further discussion in this thread, we'll stress test this
> on iommu and non-iommu Chromebooks.
The test with my latest patch would be appreciated in anyway.
thanks,
Takashi
On Thu, 15 Feb 2024 01:07:25 +0100,
Karthikeyan Ramasubramanian wrote:
>
> Before 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
> the alsa non-contiguous allocator always called the alsa fallback
> allocator in the non-iommu case. This allocated non-contig memory
> consisting of progressively smaller contiguous chunks. Allocation was
> fast due to the OR-ing in of __GFP_NORETRY.
>
> After 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
> the code tries the dma non-contig allocator first, then falls back to
> the alsa fallback allocator. In the non-iommu case, the former supports
> only a single contiguous chunk.
>
> We have observed experimentally that under heavy memory fragmentation,
> allocating a large-ish contiguous chunk with __GFP_RETRY_MAYFAIL
> triggers an indefinite hang in the dma non-contig allocator. This has
> high-impact, as an occurrence will trigger a device reboot, resulting in
> loss of user state.
>
> Fix the non-iommu path by letting dma_alloc_noncontiguous() fail quickly
> so it does not get stuck looking for that elusive large contiguous chunk,
> in which case we will fall back to the alsa fallback allocator.
>
> Note that the iommu dma non-contiguous allocator is not affected. While
> assembling an array of pages, it tries consecutively smaller contiguous
> allocations, and lets higher-order chunk allocations fail quickly.
>
> Suggested-by: Sven van Ashbrook <[email protected]>
> Suggested-by: Brian Geffon <[email protected]>
> Fixes: 9d8e536d36e7 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
> Cc: [email protected]
> Cc: Sven van Ashbrook <[email protected]>
> Cc: Brian Geffon <[email protected]>
> Cc: Curtis Malainey <[email protected]>
> Signed-off-by: Karthikeyan Ramasubramanian <[email protected]>
After chatting with Vlastimil, he recommended to get linux-mm people
involved, as basically such a problem shouldn't happen in the page
allocator side. So let's widen the audience.
To recap the thread: the problem is that dma_alloc_contiguous() call
with high order pages and __GFP_FAIL_MAYRETRY leads to indefinite
stall. (It was __GFP_NORETRY beforehand.) This looks like the code
path with the direct page allocation where no IOMMU is involved.
Karthikeyan, Sven, and co: could you guys show the stack trace at the
stall? This may give us more clear light.
Also, Vlastimil suggested that tracepoints would be helpful if that's
really in the page allocator, too.
Thanks!
Takashi
> ---
>
> sound/core/memalloc.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/sound/core/memalloc.c b/sound/core/memalloc.c
> index f901504b5afc1..5f6526a0d731c 100644
> --- a/sound/core/memalloc.c
> +++ b/sound/core/memalloc.c
> @@ -540,13 +540,18 @@ static void *snd_dma_noncontig_alloc(struct snd_dma_buffer *dmab, size_t size)
> {
> struct sg_table *sgt;
> void *p;
> + gfp_t gfp_flags = DEFAULT_GFP;
>
> #ifdef CONFIG_SND_DMA_SGBUF
> if (cpu_feature_enabled(X86_FEATURE_XENPV))
> return snd_dma_sg_fallback_alloc(dmab, size);
> +
> + /* Non-IOMMU case: prevent allocator from searching forever */
> + if (!get_dma_ops(dmab->dev.dev))
> + gfp_flags |= __GFP_NORETRY;
> #endif
> sgt = dma_alloc_noncontiguous(dmab->dev.dev, size, dmab->dev.dir,
> - DEFAULT_GFP, 0);
> + gfp_flags, 0);
> #ifdef CONFIG_SND_DMA_SGBUF
> if (!sgt && !get_dma_ops(dmab->dev.dev))
> return snd_dma_sg_fallback_alloc(dmab, size);
> --
> 2.43.0.687.g38aa6559b0-goog
>
On 2/19/24 12:36, Takashi Iwai wrote:
> On Thu, 15 Feb 2024 01:07:25 +0100,
> Karthikeyan Ramasubramanian wrote:
>>
>> Before 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
>> the alsa non-contiguous allocator always called the alsa fallback
>> allocator in the non-iommu case. This allocated non-contig memory
>> consisting of progressively smaller contiguous chunks. Allocation was
>> fast due to the OR-ing in of __GFP_NORETRY.
>>
>> After 9d8e536 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
>> the code tries the dma non-contig allocator first, then falls back to
>> the alsa fallback allocator. In the non-iommu case, the former supports
>> only a single contiguous chunk.
>>
>> We have observed experimentally that under heavy memory fragmentation,
>> allocating a large-ish contiguous chunk with __GFP_RETRY_MAYFAIL
>> triggers an indefinite hang in the dma non-contig allocator. This has
>> high-impact, as an occurrence will trigger a device reboot, resulting in
>> loss of user state.
>>
>> Fix the non-iommu path by letting dma_alloc_noncontiguous() fail quickly
>> so it does not get stuck looking for that elusive large contiguous chunk,
>> in which case we will fall back to the alsa fallback allocator.
>>
>> Note that the iommu dma non-contiguous allocator is not affected. While
>> assembling an array of pages, it tries consecutively smaller contiguous
>> allocations, and lets higher-order chunk allocations fail quickly.
>>
>> Suggested-by: Sven van Ashbrook <[email protected]>
>> Suggested-by: Brian Geffon <[email protected]>
>> Fixes: 9d8e536d36e7 ("ALSA: memalloc: Try dma_alloc_noncontiguous() at first")
>> Cc: [email protected]
>> Cc: Sven van Ashbrook <[email protected]>
>> Cc: Brian Geffon <[email protected]>
>> Cc: Curtis Malainey <[email protected]>
>> Signed-off-by: Karthikeyan Ramasubramanian <[email protected]>
>
> After chatting with Vlastimil, he recommended to get linux-mm people
> involved, as basically such a problem shouldn't happen in the page
> allocator side. So let's widen the audience.
>
> To recap the thread: the problem is that dma_alloc_contiguous() call
> with high order pages and __GFP_FAIL_MAYRETRY leads to indefinite
> stall. (It was __GFP_NORETRY beforehand.) This looks like the code
> path with the direct page allocation where no IOMMU is involved.
>
> Karthikeyan, Sven, and co: could you guys show the stack trace at the
> stall? This may give us more clear light.
Yeah, if the inifinite loop with __GFP_RETRY_MAYFAIL happens in a call to
__alloc_pages and not in some retry loop around it in an upper layer (I
tried to check the dma functions but got lost quickly so the exact call
stack would be useful), we definitely want to know the details. It shouldn't
happen for costly orders (>3) because the retries are hard limited for those
despite apparent progress or reclaim or compaction.
> Also, Vlastimil suggested that tracepoints would be helpful if that's
> really in the page allocator, too.
>
>
> Thanks!
>
> Takashi
>
>> ---
>>
>> sound/core/memalloc.c | 7 ++++++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/sound/core/memalloc.c b/sound/core/memalloc.c
>> index f901504b5afc1..5f6526a0d731c 100644
>> --- a/sound/core/memalloc.c
>> +++ b/sound/core/memalloc.c
>> @@ -540,13 +540,18 @@ static void *snd_dma_noncontig_alloc(struct snd_dma_buffer *dmab, size_t size)
>> {
>> struct sg_table *sgt;
>> void *p;
>> + gfp_t gfp_flags = DEFAULT_GFP;
>>
>> #ifdef CONFIG_SND_DMA_SGBUF
>> if (cpu_feature_enabled(X86_FEATURE_XENPV))
>> return snd_dma_sg_fallback_alloc(dmab, size);
>> +
>> + /* Non-IOMMU case: prevent allocator from searching forever */
>> + if (!get_dma_ops(dmab->dev.dev))
>> + gfp_flags |= __GFP_NORETRY;
>> #endif
>> sgt = dma_alloc_noncontiguous(dmab->dev.dev, size, dmab->dev.dir,
>> - DEFAULT_GFP, 0);
>> + gfp_flags, 0);
>> #ifdef CONFIG_SND_DMA_SGBUF
>> if (!sgt && !get_dma_ops(dmab->dev.dev))
>> return snd_dma_sg_fallback_alloc(dmab, size);
>> --
>> 2.43.0.687.g38aa6559b0-goog
>>
Takaski, Vlastimil: thanks so much for the engagement! See below.
> On 2/19/24 12:36, Takashi Iwai wrote:
> >
> > Karthikeyan, Sven, and co: could you guys show the stack trace at the
> > stall? This may give us more clear light.
Here are two typical stack traces at the stall. Note that the timer interrupt
is just a software watchdog that fires to generate the stack trace.
This is running the v6.1 kernel.
We should be able to reproduce this on v6.6 as well if need be.
<4>[310289.546429] <TASK>
<4>[310289.546431] asm_sysvec_apic_timer_interrupt+0x16/0x20
<4>[310289.546434] RIP: 0010:super_cache_count+0xc/0xea
<4>[310289.546438] Code: ff ff e8 48 ac e3 ff 4c 89 e0 48 83 c4 20 5b
41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 0f 1f 44 00 00 f6 87 23
fc ff ff 20 <75> 08 31 c0 c3 cc cc cc cc cc 55 48 89 e5 41 57 41 56 41
54 53 49
<4>[310289.546440] RSP: 0018:ffffa64e8aed35c0 EFLAGS: 00000202
<4>[310289.546443] RAX: 0000000000000080 RBX: 0000000000000400 RCX:
0000000000000000
<4>[310289.546445] RDX: ffffffffa6d66bc8 RSI: ffffa64e8aed3610 RDI:
ffff9fd2873dbc30
<4>[310289.546447] RBP: ffffa64e8aed3660 R08: 0000000000000064 R09:
0000000000000000
<4>[310289.546449] R10: ffffffffa6e3b260 R11: ffffffffa5163a52 R12:
ffff9fd2873dbc50
<4>[310289.546451] R13: 0000000000046c00 R14: 0000000000000000 R15:
0000000000000000
<4>[310289.546453] ? super_cache_scan+0x199/0x199
<4>[310289.546457] shrink_slab+0xb3/0x37e
<4>[310289.546460] shrink_node+0x377/0x110e
<4>[310289.546464] ? sysvec_apic_timer_interrupt+0x17/0x80
<4>[310289.546467] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
<4>[310289.546471] try_to_free_pages+0x46e/0x857
<4>[310289.546475] ? psi_task_change+0x7f/0x9c
<4>[310289.546478] __alloc_pages_slowpath+0x4e2/0xe5c
<4>[310289.546482] __alloc_pages+0x225/0x2a2
<4>[310289.546486] __dma_direct_alloc_pages+0xed/0x1cb
<4>[310289.546489] dma_direct_alloc_pages+0x21/0xa3
<4>[310289.546493] dma_alloc_noncontiguous+0xd1/0x144
<4>[310289.546496] snd_dma_noncontig_alloc+0x45/0xe3
<4>[310289.546499] snd_dma_alloc_dir_pages+0x4f/0x81
<4>[310289.546502] hda_cl_stream_prepare+0x66/0x15e
[snd_sof_intel_hda_common (HASH:1255 1)]
<4>[310289.546510] hda_dsp_cl_boot_firmware+0xc4/0x2ca
[snd_sof_intel_hda_common (HASH:1255 1)]
<4>[310289.546518] snd_sof_run_firmware+0xca/0x2d7 [snd_sof (HASH:ecd9 2)]
<4>[310289.546526] ? hda_dsp_resume+0x97/0x1a7
[snd_sof_intel_hda_common (HASH:1255 1)]
<4>[310289.546534] sof_resume+0x155/0x251 [snd_sof (HASH:ecd9 2)]
<4>[310289.546542] ? pci_pm_suspend+0x1e7/0x1e7
<4>[310289.546546] dpm_run_callback+0x3c/0x132
<4>[310289.546549] device_resume+0x1f7/0x282
<4>[310289.546552] ? dpm_watchdog_set+0x54/0x54
<4>[310289.546555] async_resume+0x1f/0x5b
<4>[310289.546558] async_run_entry_fn+0x2b/0xc5
<4>[310289.546561] process_one_work+0x1be/0x381
<4>[310289.546564] worker_thread+0x20b/0x35b
<4>[310289.546568] kthread+0xde/0xf7
<4>[310289.546571] ? pr_cont_work+0x54/0x54
<4>[310289.546574] ? kthread_blkcg+0x32/0x32
<4>[310289.546577] ret_from_fork+0x1f/0x30
<4>[310289.546580] </TASK>
<4>[171032.151834] <TASK>
<4>[171032.151835] asm_sysvec_apic_timer_interrupt+0x16/0x20
<4>[171032.151839] RIP: 0010:_raw_spin_unlock_irq+0x10/0x28
<4>[171032.151842] Code: 2c 70 74 06 c3 cc cc cc cc cc 55 48 89 e5 e8
7e 30 2b ff 5d c3 cc cc cc cc cc 0f 1f 44 00 00 c6 07 00 fb 65 ff 0d
af b1 2c 70 <74> 06 c3 cc cc cc cc cc 55 48 89 e5 e8 56 30 2b ff 5d c3
cc cc cc
<4>[171032.151844] RSP: 0018:ffff942447b334d8 EFLAGS: 00000286
<4>[171032.151847] RAX: 0000000000000031 RBX: 0000000000000001 RCX:
0000000000000034
<4>[171032.151849] RDX: 0000000000000031 RSI: 0000000000000002 RDI:
ffffffff9103b1b0
<4>[171032.151851] RBP: ffff942447b33660 R08: 0000000000000032 R09:
0000000000000010
<4>[171032.151853] R10: ffffffff9103b370 R11: 00000000ffffffff R12:
ffffffff9103b160
<4>[171032.151855] R13: ffffd055000111c8 R14: 0000000000000000 R15:
0000000000000031
<4>[171032.151858] evict_folios+0xf9e/0x1307
<4>[171032.151861] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
<4>[171032.151866] shrink_node+0x2e8/0x110e
<4>[171032.151870] ? common_interrupt+0x1c/0x95
<4>[171032.151872] ? common_interrupt+0x1c/0x95
<4>[171032.151875] ? asm_common_interrupt+0x22/0x40
<4>[171032.151878] ? __compaction_suitable+0x7c/0x9d
<4>[171032.151882] try_to_free_pages+0x46e/0x857
<4>[171032.151885] ? psi_task_change+0x7f/0x9c
<4>[171032.151889] __alloc_pages_slowpath+0x4e2/0xe5c
<4>[171032.151893] __alloc_pages+0x225/0x2a2
<4>[171032.151896] __dma_direct_alloc_pages+0xed/0x1cb
<4>[171032.151900] dma_direct_alloc_pages+0x21/0xa3
<4>[171032.151903] dma_alloc_noncontiguous+0xd1/0x144
<4>[171032.151907] snd_dma_noncontig_alloc+0x45/0xe3
<4>[171032.151910] snd_dma_alloc_dir_pages+0x4f/0x81
<4>[171032.151913] hda_cl_stream_prepare+0x66/0x15e
[snd_sof_intel_hda_common (HASH:7df0 1)]
<4>[171032.151921] hda_dsp_cl_boot_firmware+0xc4/0x2ca
[snd_sof_intel_hda_common (HASH:7df0 1)]
<4>[171032.151929] snd_sof_run_firmware+0xca/0x2d7 [snd_sof (HASH:9f20 2)]
<4>[171032.151937] ? hda_dsp_resume+0x97/0x1a7
[snd_sof_intel_hda_common (HASH:7df0 1)]
<4>[171032.151945] sof_resume+0x155/0x251 [snd_sof (HASH:9f20 2)]
<4>[171032.151953] ? pci_pm_suspend+0x1e7/0x1e7
<4>[171032.151957] dpm_run_callback+0x3c/0x132
<4>[171032.151960] device_resume+0x1f7/0x282
<4>[171032.151962] ? dpm_watchdog_set+0x54/0x54
<4>[171032.151965] async_resume+0x1f/0x5b
<4>[171032.151968] async_run_entry_fn+0x2b/0xc5
<4>[171032.151971] process_one_work+0x1be/0x381
<4>[171032.151975] worker_thread+0x20b/0x35b
<4>[171032.151978] kthread+0xde/0xf7
<4>[171032.151981] ? pr_cont_work+0x54/0x54
<4>[171032.151984] ? kthread_blkcg+0x32/0x32
<4>[171032.151987] ret_from_fork+0x1f/0x30
<4>[171032.151991] </TASK>
On Mon, Feb 19, 2024 at 6:40 AM Vlastimil Babka <[email protected]> wrote:
>
> Yeah, if the inifinite loop with __GFP_RETRY_MAYFAIL happens in a call to
> __alloc_pages and not in some retry loop around it in an upper layer (I
> tried to check the dma functions but got lost quickly so the exact call
> stack would be useful), we definitely want to know the details. It shouldn't
> happen for costly orders (>3) because the retries are hard limited for those
> despite apparent progress or reclaim or compaction.
Here are our notes of the indefinite stall we saw on v5.10 with iommu SoCs.
We did not pursue debugging the stall at the time, in favour of a work-around
with the gfp flags. Therefore we only have partial confidence in the notes
below. Take them with a block of salt, but they may point in a useful direction.
1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with
__GFP_RETRY_MAYFAIL set.
2. page alloc's __alloc_pages_slowpath [1] tries to get a page from
the freelist.
This fails because there is nothing free of that costly order.
3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which
bails out [2] because a zone is ready to be compacted; it pretends
to have made
a single page of progress.
4. page alloc tries to compact, but this always bails out early [3]
because __GFP_IO is not set
(it's not passed by the snd allocator, and even if it were, we are
suspending so the
__GFP_IO flag would be cleared anyway).
5. page alloc believes reclaim progress was made (because of the
pretense in item 3) and
so it checks whether it should retry compaction. The compaction
retry logic [4] thinks
it should try again, because:
a) reclaim is needed because of the early bail-out in item 4
b) a zonelist is suitable for compaction
6. goto 2. indefinite stall.
>
> > Also, Vlastimil suggested that tracepoints would be helpful if that's
> > really in the page allocator, too.
> >
We might be able to generate traces by bailing out of the indefinite
stall using a timer,
which should hopefully give us a device that's "alive enough" to read
the traces.
Can you advise which tracepoints you'd like to see? Is trace-cmd [5]
suitable to capture
this?
[1] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/kernel/v5.10/mm/page_alloc.c;l=4654;drc=a16293af64a1f558dab9a5dd7fb05fdbc2b7c5c0
[2] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/kernel/v5.10/mm/vmscan.c;drc=44452e4236561f6e36ec587805a52b683e2804c9;l=6177
[3] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/kernel/v5.10/mm/compaction.c;l=2479;drc=d7b105aa1559e6c287f3f372044c21c7400b7784
[4] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/kernel/v5.10/mm/page_alloc.c;l=4171;drc=a16293af64a1f558dab9a5dd7fb05fdbc2b7c5c0
[5] https://chromium.googlesource.com/chromiumos/docs/+/HEAD/kernel_development.md#ftrace-debugging
On Tue, 20 Feb 2024 16:52:14 +0100,
Sven van Ashbrook wrote:
>
> Takaski, Vlastimil: thanks so much for the engagement! See below.
>
> > On 2/19/24 12:36, Takashi Iwai wrote:
> > >
> > > Karthikeyan, Sven, and co: could you guys show the stack trace at the
> > > stall? This may give us more clear light.
>
> Here are two typical stack traces at the stall. Note that the timer interrupt
> is just a software watchdog that fires to generate the stack trace.
> This is running the v6.1 kernel.
> We should be able to reproduce this on v6.6 as well if need be.
>
> <4>[310289.546429] <TASK>
> <4>[310289.546431] asm_sysvec_apic_timer_interrupt+0x16/0x20
> <4>[310289.546434] RIP: 0010:super_cache_count+0xc/0xea
> <4>[310289.546438] Code: ff ff e8 48 ac e3 ff 4c 89 e0 48 83 c4 20 5b
> 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc cc 0f 1f 44 00 00 f6 87 23
> fc ff ff 20 <75> 08 31 c0 c3 cc cc cc cc cc 55 48 89 e5 41 57 41 56 41
> 54 53 49
> <4>[310289.546440] RSP: 0018:ffffa64e8aed35c0 EFLAGS: 00000202
> <4>[310289.546443] RAX: 0000000000000080 RBX: 0000000000000400 RCX:
> 0000000000000000
> <4>[310289.546445] RDX: ffffffffa6d66bc8 RSI: ffffa64e8aed3610 RDI:
> ffff9fd2873dbc30
> <4>[310289.546447] RBP: ffffa64e8aed3660 R08: 0000000000000064 R09:
> 0000000000000000
> <4>[310289.546449] R10: ffffffffa6e3b260 R11: ffffffffa5163a52 R12:
> ffff9fd2873dbc50
> <4>[310289.546451] R13: 0000000000046c00 R14: 0000000000000000 R15:
> 0000000000000000
> <4>[310289.546453] ? super_cache_scan+0x199/0x199
> <4>[310289.546457] shrink_slab+0xb3/0x37e
> <4>[310289.546460] shrink_node+0x377/0x110e
> <4>[310289.546464] ? sysvec_apic_timer_interrupt+0x17/0x80
> <4>[310289.546467] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> <4>[310289.546471] try_to_free_pages+0x46e/0x857
> <4>[310289.546475] ? psi_task_change+0x7f/0x9c
> <4>[310289.546478] __alloc_pages_slowpath+0x4e2/0xe5c
> <4>[310289.546482] __alloc_pages+0x225/0x2a2
> <4>[310289.546486] __dma_direct_alloc_pages+0xed/0x1cb
> <4>[310289.546489] dma_direct_alloc_pages+0x21/0xa3
> <4>[310289.546493] dma_alloc_noncontiguous+0xd1/0x144
> <4>[310289.546496] snd_dma_noncontig_alloc+0x45/0xe3
> <4>[310289.546499] snd_dma_alloc_dir_pages+0x4f/0x81
> <4>[310289.546502] hda_cl_stream_prepare+0x66/0x15e
> [snd_sof_intel_hda_common (HASH:1255 1)]
> <4>[310289.546510] hda_dsp_cl_boot_firmware+0xc4/0x2ca
> [snd_sof_intel_hda_common (HASH:1255 1)]
> <4>[310289.546518] snd_sof_run_firmware+0xca/0x2d7 [snd_sof (HASH:ecd9 2)]
> <4>[310289.546526] ? hda_dsp_resume+0x97/0x1a7
> [snd_sof_intel_hda_common (HASH:1255 1)]
> <4>[310289.546534] sof_resume+0x155/0x251 [snd_sof (HASH:ecd9 2)]
> <4>[310289.546542] ? pci_pm_suspend+0x1e7/0x1e7
> <4>[310289.546546] dpm_run_callback+0x3c/0x132
> <4>[310289.546549] device_resume+0x1f7/0x282
> <4>[310289.546552] ? dpm_watchdog_set+0x54/0x54
> <4>[310289.546555] async_resume+0x1f/0x5b
> <4>[310289.546558] async_run_entry_fn+0x2b/0xc5
> <4>[310289.546561] process_one_work+0x1be/0x381
> <4>[310289.546564] worker_thread+0x20b/0x35b
> <4>[310289.546568] kthread+0xde/0xf7
> <4>[310289.546571] ? pr_cont_work+0x54/0x54
> <4>[310289.546574] ? kthread_blkcg+0x32/0x32
> <4>[310289.546577] ret_from_fork+0x1f/0x30
> <4>[310289.546580] </TASK>
>
> <4>[171032.151834] <TASK>
> <4>[171032.151835] asm_sysvec_apic_timer_interrupt+0x16/0x20
> <4>[171032.151839] RIP: 0010:_raw_spin_unlock_irq+0x10/0x28
> <4>[171032.151842] Code: 2c 70 74 06 c3 cc cc cc cc cc 55 48 89 e5 e8
> 7e 30 2b ff 5d c3 cc cc cc cc cc 0f 1f 44 00 00 c6 07 00 fb 65 ff 0d
> af b1 2c 70 <74> 06 c3 cc cc cc cc cc 55 48 89 e5 e8 56 30 2b ff 5d c3
> cc cc cc
> <4>[171032.151844] RSP: 0018:ffff942447b334d8 EFLAGS: 00000286
> <4>[171032.151847] RAX: 0000000000000031 RBX: 0000000000000001 RCX:
> 0000000000000034
> <4>[171032.151849] RDX: 0000000000000031 RSI: 0000000000000002 RDI:
> ffffffff9103b1b0
> <4>[171032.151851] RBP: ffff942447b33660 R08: 0000000000000032 R09:
> 0000000000000010
> <4>[171032.151853] R10: ffffffff9103b370 R11: 00000000ffffffff R12:
> ffffffff9103b160
> <4>[171032.151855] R13: ffffd055000111c8 R14: 0000000000000000 R15:
> 0000000000000031
> <4>[171032.151858] evict_folios+0xf9e/0x1307
> <4>[171032.151861] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> <4>[171032.151866] shrink_node+0x2e8/0x110e
> <4>[171032.151870] ? common_interrupt+0x1c/0x95
> <4>[171032.151872] ? common_interrupt+0x1c/0x95
> <4>[171032.151875] ? asm_common_interrupt+0x22/0x40
> <4>[171032.151878] ? __compaction_suitable+0x7c/0x9d
> <4>[171032.151882] try_to_free_pages+0x46e/0x857
> <4>[171032.151885] ? psi_task_change+0x7f/0x9c
> <4>[171032.151889] __alloc_pages_slowpath+0x4e2/0xe5c
> <4>[171032.151893] __alloc_pages+0x225/0x2a2
> <4>[171032.151896] __dma_direct_alloc_pages+0xed/0x1cb
> <4>[171032.151900] dma_direct_alloc_pages+0x21/0xa3
> <4>[171032.151903] dma_alloc_noncontiguous+0xd1/0x144
> <4>[171032.151907] snd_dma_noncontig_alloc+0x45/0xe3
> <4>[171032.151910] snd_dma_alloc_dir_pages+0x4f/0x81
> <4>[171032.151913] hda_cl_stream_prepare+0x66/0x15e
> [snd_sof_intel_hda_common (HASH:7df0 1)]
> <4>[171032.151921] hda_dsp_cl_boot_firmware+0xc4/0x2ca
> [snd_sof_intel_hda_common (HASH:7df0 1)]
> <4>[171032.151929] snd_sof_run_firmware+0xca/0x2d7 [snd_sof (HASH:9f20 2)]
> <4>[171032.151937] ? hda_dsp_resume+0x97/0x1a7
> [snd_sof_intel_hda_common (HASH:7df0 1)]
> <4>[171032.151945] sof_resume+0x155/0x251 [snd_sof (HASH:9f20 2)]
> <4>[171032.151953] ? pci_pm_suspend+0x1e7/0x1e7
> <4>[171032.151957] dpm_run_callback+0x3c/0x132
> <4>[171032.151960] device_resume+0x1f7/0x282
> <4>[171032.151962] ? dpm_watchdog_set+0x54/0x54
> <4>[171032.151965] async_resume+0x1f/0x5b
> <4>[171032.151968] async_run_entry_fn+0x2b/0xc5
> <4>[171032.151971] process_one_work+0x1be/0x381
> <4>[171032.151975] worker_thread+0x20b/0x35b
> <4>[171032.151978] kthread+0xde/0xf7
> <4>[171032.151981] ? pr_cont_work+0x54/0x54
> <4>[171032.151984] ? kthread_blkcg+0x32/0x32
> <4>[171032.151987] ret_from_fork+0x1f/0x30
> <4>[171032.151991] </TASK>
Thanks!
Both look like the code path via async PM resume.
Were both from the runtime PM resume? Or the system resume?
Takashi
On 2/20/24 16:52, Sven van Ashbrook wrote:
> Takaski, Vlastimil: thanks so much for the engagement! See below.
>
>> On 2/19/24 12:36, Takashi Iwai wrote:
>> >
>> > Karthikeyan, Sven, and co: could you guys show the stack trace at the
>> > stall? This may give us more clear light.
> Here are our notes of the indefinite stall we saw on v5.10 with iommu SoCs.
> We did not pursue debugging the stall at the time, in favour of a work-around
> with the gfp flags. Therefore we only have partial confidence in the notes
> below. Take them with a block of salt, but they may point in a useful direction.
>
> 1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER) with
> __GFP_RETRY_MAYFAIL set.
>
> 2. page alloc's __alloc_pages_slowpath [1] tries to get a page from
> the freelist.
> This fails because there is nothing free of that costly order.
>
> 3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim, which
> bails out [2] because a zone is ready to be compacted; it pretends
> to have made
> a single page of progress.
>
> 4. page alloc tries to compact, but this always bails out early [3]
> because __GFP_IO is not set
> (it's not passed by the snd allocator, and even if it were, we are
> suspending so the
> __GFP_IO flag would be cleared anyway).
>
> 5. page alloc believes reclaim progress was made (because of the
> pretense in item 3) and
> so it checks whether it should retry compaction. The compaction
> retry logic [4] thinks
> it should try again, because:
> a) reclaim is needed because of the early bail-out in item 4
> b) a zonelist is suitable for compaction
>
> 6. goto 2. indefinite stall.
Thanks a lot, seems this can indeed happen even in 6.8-rc5. We're
mishandling the case where compaction is skipped due to lack of __GFP_IO,
which is indeed cleared in suspend/resume. I'll create a fix. Please don't
hesitate to report such issues the next time, even if not fully debugged :)
>>
>> > Also, Vlastimil suggested that tracepoints would be helpful if that's
>> > really in the page allocator, too.
>> >
>
> We might be able to generate traces by bailing out of the indefinite
> stall using a timer,
> which should hopefully give us a device that's "alive enough" to read
> the traces.
>
> Can you advise which tracepoints you'd like to see? Is trace-cmd [5]
> suitable to capture
> this?
>
> [1] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/kernel/v5.10/mm/page_alloc.c;l=4654;drc=a16293af64a1f558dab9a5dd7fb05fdbc2b7c5c0
> [2] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/kernel/v5.10/mm/vmscan.c;drc=44452e4236561f6e36ec587805a52b683e2804c9;l=6177
> [3] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/kernel/v5.10/mm/compaction.c;l=2479;drc=d7b105aa1559e6c287f3f372044c21c7400b7784
> [4] https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/third_party/kernel/v5.10/mm/page_alloc.c;l=4171;drc=a16293af64a1f558dab9a5dd7fb05fdbc2b7c5c0
> [5] https://chromium.googlesource.com/chromiumos/docs/+/HEAD/kernel_development.md#ftrace-debugging
Sven reports an infinite loop in __alloc_pages_slowpath() for costly
order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such
combination can happen in a suspend/resume context where a GFP_KERNEL
allocation can have __GFP_IO masked out via gfp_allowed_mask.
Quoting Sven:
1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER)
with __GFP_RETRY_MAYFAIL set.
2. page alloc's __alloc_pages_slowpath tries to get a page from the
freelist. This fails because there is nothing free of that costly
order.
3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim,
which bails out because a zone is ready to be compacted; it pretends
to have made a single page of progress.
4. page alloc tries to compact, but this always bails out early because
__GFP_IO is not set (it's not passed by the snd allocator, and even
if it were, we are suspending so the __GFP_IO flag would be cleared
anyway).
5. page alloc believes reclaim progress was made (because of the
pretense in item 3) and so it checks whether it should retry
compaction. The compaction retry logic thinks it should try again,
because:
a) reclaim is needed because of the early bail-out in item 4
b) a zonelist is suitable for compaction
6. goto 2. indefinite stall.
(end quote)
The immediate root cause is confusing the COMPACT_SKIPPED returned from
__alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be
indicating a lack of order-0 pages, and in step 5 evaluating that in
should_compact_retry() as a reason to retry, before incrementing and
limiting the number of retries. There are however other places that
wrongly assume that compaction can happen while we lack __GFP_IO.
To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO
evaluation and switch the open-coded test in try_to_compact_pages() to
use it.
Also use the new helper in:
- compaction_ready(), which will make reclaim not bail out in step 3, so
there's at least one attempt to actually reclaim, even if chances are
small for a costly order
- in_reclaim_compaction() which will make should_continue_reclaim()
return false and we don't over-reclaim unnecessarily
- in __alloc_pages_slowpath() to set a local variable can_compact,
which is then used to avoid retrying reclaim/compaction for costly
allocations (step 5) if we can't compact and also to skip the early
compaction attempt that we do in some cases
Reported-by: Sven van Ashbrook <[email protected]>
Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%[email protected]/
Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"")
Cc: <[email protected]>
Signed-off-by: Vlastimil Babka <[email protected]>
---
include/linux/gfp.h | 9 +++++++++
mm/compaction.c | 7 +------
mm/page_alloc.c | 10 ++++++----
mm/vmscan.c | 5 ++++-
4 files changed, 20 insertions(+), 11 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index de292a007138..e2a916cf29c4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -353,6 +353,15 @@ static inline bool gfp_has_io_fs(gfp_t gfp)
return (gfp & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS);
}
+/*
+ * Check if the gfp flags allow compaction - GFP_NOIO is a really
+ * tricky context because the migration might require IO.
+ */
+static inline bool gfp_compaction_allowed(gfp_t gfp_mask)
+{
+ return IS_ENABLED(CONFIG_COMPACTION) && (gfp_mask & __GFP_IO);
+}
+
extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);
#ifdef CONFIG_CONTIG_ALLOC
diff --git a/mm/compaction.c b/mm/compaction.c
index 4add68d40e8d..b961db601df4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2723,16 +2723,11 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags, const struct alloc_context *ac,
enum compact_priority prio, struct page **capture)
{
- int may_perform_io = (__force int)(gfp_mask & __GFP_IO);
struct zoneref *z;
struct zone *zone;
enum compact_result rc = COMPACT_SKIPPED;
- /*
- * Check if the GFP flags allow compaction - GFP_NOIO is really
- * tricky context because the migration might require IO
- */
- if (!may_perform_io)
+ if (!gfp_compaction_allowed(gfp_mask))
return COMPACT_SKIPPED;
trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 150d4f23b010..a663202045dc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4041,6 +4041,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
{
bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
+ bool can_compact = gfp_compaction_allowed(gfp_mask);
const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
struct page *page = NULL;
unsigned int alloc_flags;
@@ -4111,7 +4112,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* Don't try this for allocations that are allowed to ignore
* watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
*/
- if (can_direct_reclaim &&
+ if (can_direct_reclaim && can_compact &&
(costly_order ||
(order > 0 && ac->migratetype != MIGRATE_MOVABLE))
&& !gfp_pfmemalloc_allowed(gfp_mask)) {
@@ -4209,9 +4210,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
/*
* Do not retry costly high order allocations unless they are
- * __GFP_RETRY_MAYFAIL
+ * __GFP_RETRY_MAYFAIL and we can compact
*/
- if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
+ if (costly_order && (!can_compact ||
+ !(gfp_mask & __GFP_RETRY_MAYFAIL)))
goto nopage;
if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
@@ -4224,7 +4226,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
* implementation of the compaction depends on the sufficient amount
* of free memory (see __compaction_suitable)
*/
- if (did_some_progress > 0 &&
+ if (did_some_progress > 0 && can_compact &&
should_compact_retry(ac, order, alloc_flags,
compact_result, &compact_priority,
&compaction_retries))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f9c854ce6cc..4255619a1a31 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5753,7 +5753,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
/* Use reclaim/compaction for costly allocs or under memory pressure */
static bool in_reclaim_compaction(struct scan_control *sc)
{
- if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
+ if (gfp_compaction_allowed(sc->gfp_mask) && sc->order &&
(sc->order > PAGE_ALLOC_COSTLY_ORDER ||
sc->priority < DEF_PRIORITY - 2))
return true;
@@ -5998,6 +5998,9 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
{
unsigned long watermark;
+ if (!gfp_compaction_allowed(sc->gfp_mask))
+ return false;
+
/* Allocation can already succeed, nothing to do */
if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone),
sc->reclaim_idx, 0))
--
2.43.1
Hi Takashi,
On Wed, Feb 21, 2024 at 4:21 AM Takashi Iwai <[email protected]> wrote:
>
> Both look like the code path via async PM resume.
> Were both from the runtime PM resume? Or the system resume?
The large firmware allocation that triggers the stall happens in
runtime resume.
This means runtime resume and system resume are affected.
Due to the way runtime PM works, the system suspend path is also affected.
Because system suspend first wakes up any runtime suspended devices,
before executing their system suspend callback.
So we get:
system active, SOF runtime suspended (audio not active)
-> system suspend request
-> calls SOF runtime resume callback (stall happens here)
-> calls SOF system suspend callback
On Wed, Feb 21, 2024 at 4:58 AM Vlastimil Babka <[email protected]> wrote:
>
> Thanks a lot, seems this can indeed happen even in 6.8-rc5. We're
> mishandling the case where compaction is skipped due to lack of __GFP_IO,
> which is indeed cleared in suspend/resume. I'll create a fix.
I'm really grateful for the engagement here !!
> Please don't
> hesitate to report such issues the next time, even if not fully debugged :)
>
Will do :)
Thanks so much ! We will stress test this on our side.
We do this by exhausting memory and triggering many suspend/resume
cycles. This reliably reproduces the problem (before this patch).
Of course, as we all know, absence of evidence (no more stalls in stress tests)
does not equal evidence of absence (stalls are gone in all code paths).
On Mon, Feb 19, 2024 at 4:19 AM Takashi Iwai <[email protected]> wrote:
>
> On Fri, 16 Feb 2024 17:22:45 +0100,
> Sven van Ashbrook wrote:
> >
> > On Fri, Feb 16, 2024 at 9:43 AM Takashi Iwai <[email protected]> wrote:
> > >
> > > OK, then how about the one like below?
> > >
> > > This changes:
> > > - Back to __GFP_NORETRY as default
> > > - Use __GFP_RETRY_MAYFAIL for SNDRV_DMA_TYPE_NONCONTIG with IOMMU;
> > > this should cover the commit a61c7d88d38c
> > > - Also use __GFP_RETRY_MAYFAIL for the SG-fallback allocations of the
> > > minimal order, just like IOMMU allocator does.
> > >
> > > This should be less destructive, while still allowing more aggressive
> > > allocations for SG buffers.
> >
> > This one looks like it would keep the SOF firmware allocation issue at bay,
> > in both iommu and non-iommu cases.
> >
> > If there is no further discussion in this thread, we'll stress test this
> > on iommu and non-iommu Chromebooks.
>
> The test with my latest patch would be appreciated in anyway.
I am not sure if you are still interested in knowing the test results.
The original indefinite hang issue is not observed with your latest
patchset. This observation applies to both IOMMU and non-IOMMU cases.
Thanks and Regards,
Karthikeyan.
>
>
> thanks,
>
> Takashi
>
With this patch, the test results have been promising so far. After 10
hours of stress testing, we have not hit the reported problem yet. We
will keep testing and report here if we hit the problem again. Thanks
for engaging with us.
On Wed, Feb 21, 2024 at 8:53 AM Sven van Ashbrook <[email protected]> wrote:
>
> Thanks so much ! We will stress test this on our side.
>
> We do this by exhausting memory and triggering many suspend/resume
> cycles. This reliably reproduces the problem (before this patch).
>
> Of course, as we all know, absence of evidence (no more stalls in stress tests)
> does not equal evidence of absence (stalls are gone in all code paths).
Vlastimil,
We noticed that this patch is now added to Andrew Morton's
mm-hotfixes-unstable branch.
How can we help to get this into mm-hotfixes-stable
and from there, into mainline ?
We are still stress-testing using low memory suspend. Anything else
that is required, and we can help with?
Sven
On Wed, Feb 21, 2024 at 6:44 AM Vlastimil Babka <[email protected]> wrote:
>
> Sven reports an infinite loop in __alloc_pages_slowpath() for costly
> order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such
> combination can happen in a suspend/resume context where a GFP_KERNEL
> allocation can have __GFP_IO masked out via gfp_allowed_mask.
>
> Quoting Sven:
>
> 1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER)
> with __GFP_RETRY_MAYFAIL set.
>
> 2. page alloc's __alloc_pages_slowpath tries to get a page from the
> freelist. This fails because there is nothing free of that costly
> order.
>
> 3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim,
> which bails out because a zone is ready to be compacted; it pretends
> to have made a single page of progress.
>
> 4. page alloc tries to compact, but this always bails out early because
> __GFP_IO is not set (it's not passed by the snd allocator, and even
> if it were, we are suspending so the __GFP_IO flag would be cleared
> anyway).
>
> 5. page alloc believes reclaim progress was made (because of the
> pretense in item 3) and so it checks whether it should retry
> compaction. The compaction retry logic thinks it should try again,
> because:
> a) reclaim is needed because of the early bail-out in item 4
> b) a zonelist is suitable for compaction
>
> 6. goto 2. indefinite stall.
>
> (end quote)
>
> The immediate root cause is confusing the COMPACT_SKIPPED returned from
> __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be
> indicating a lack of order-0 pages, and in step 5 evaluating that in
> should_compact_retry() as a reason to retry, before incrementing and
> limiting the number of retries. There are however other places that
> wrongly assume that compaction can happen while we lack __GFP_IO.
>
> To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO
> evaluation and switch the open-coded test in try_to_compact_pages() to
> use it.
>
> Also use the new helper in:
> - compaction_ready(), which will make reclaim not bail out in step 3, so
> there's at least one attempt to actually reclaim, even if chances are
> small for a costly order
> - in_reclaim_compaction() which will make should_continue_reclaim()
> return false and we don't over-reclaim unnecessarily
> - in __alloc_pages_slowpath() to set a local variable can_compact,
> which is then used to avoid retrying reclaim/compaction for costly
> allocations (step 5) if we can't compact and also to skip the early
> compaction attempt that we do in some cases
>
> Reported-by: Sven van Ashbrook <[email protected]>
> Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%[email protected]/
> Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"")
> Cc: <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
> ---
> include/linux/gfp.h | 9 +++++++++
> mm/compaction.c | 7 +------
> mm/page_alloc.c | 10 ++++++----
> mm/vmscan.c | 5 ++++-
> 4 files changed, 20 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index de292a007138..e2a916cf29c4 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -353,6 +353,15 @@ static inline bool gfp_has_io_fs(gfp_t gfp)
> return (gfp & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS);
> }
>
> +/*
> + * Check if the gfp flags allow compaction - GFP_NOIO is a really
> + * tricky context because the migration might require IO.
> + */
> +static inline bool gfp_compaction_allowed(gfp_t gfp_mask)
> +{
> + return IS_ENABLED(CONFIG_COMPACTION) && (gfp_mask & __GFP_IO);
> +}
> +
> extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);
>
> #ifdef CONFIG_CONTIG_ALLOC
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4add68d40e8d..b961db601df4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2723,16 +2723,11 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> unsigned int alloc_flags, const struct alloc_context *ac,
> enum compact_priority prio, struct page **capture)
> {
> - int may_perform_io = (__force int)(gfp_mask & __GFP_IO);
> struct zoneref *z;
> struct zone *zone;
> enum compact_result rc = COMPACT_SKIPPED;
>
> - /*
> - * Check if the GFP flags allow compaction - GFP_NOIO is really
> - * tricky context because the migration might require IO
> - */
> - if (!may_perform_io)
> + if (!gfp_compaction_allowed(gfp_mask))
> return COMPACT_SKIPPED;
>
> trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 150d4f23b010..a663202045dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4041,6 +4041,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> struct alloc_context *ac)
> {
> bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
> + bool can_compact = gfp_compaction_allowed(gfp_mask);
> const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
> struct page *page = NULL;
> unsigned int alloc_flags;
> @@ -4111,7 +4112,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * Don't try this for allocations that are allowed to ignore
> * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
> */
> - if (can_direct_reclaim &&
> + if (can_direct_reclaim && can_compact &&
> (costly_order ||
> (order > 0 && ac->migratetype != MIGRATE_MOVABLE))
> && !gfp_pfmemalloc_allowed(gfp_mask)) {
> @@ -4209,9 +4210,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>
> /*
> * Do not retry costly high order allocations unless they are
> - * __GFP_RETRY_MAYFAIL
> + * __GFP_RETRY_MAYFAIL and we can compact
> */
> - if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
> + if (costly_order && (!can_compact ||
> + !(gfp_mask & __GFP_RETRY_MAYFAIL)))
> goto nopage;
>
> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> @@ -4224,7 +4226,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * implementation of the compaction depends on the sufficient amount
> * of free memory (see __compaction_suitable)
> */
> - if (did_some_progress > 0 &&
> + if (did_some_progress > 0 && can_compact &&
> should_compact_retry(ac, order, alloc_flags,
> compact_result, &compact_priority,
> &compaction_retries))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f9c854ce6cc..4255619a1a31 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5753,7 +5753,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> /* Use reclaim/compaction for costly allocs or under memory pressure */
> static bool in_reclaim_compaction(struct scan_control *sc)
> {
> - if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
> + if (gfp_compaction_allowed(sc->gfp_mask) && sc->order &&
> (sc->order > PAGE_ALLOC_COSTLY_ORDER ||
> sc->priority < DEF_PRIORITY - 2))
> return true;
> @@ -5998,6 +5998,9 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
> {
> unsigned long watermark;
>
> + if (!gfp_compaction_allowed(sc->gfp_mask))
> + return false;
> +
> /* Allocation can already succeed, nothing to do */
> if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone),
> sc->reclaim_idx, 0))
> --
> 2.43.1
>
On 2/26/24 17:09, Sven van Ashbrook wrote:
> Vlastimil,
>
> We noticed that this patch is now added to Andrew Morton's
> mm-hotfixes-unstable branch.
>
> How can we help to get this into mm-hotfixes-stable
> and from there, into mainline ?
A Tested-by: can't hurt, but of course you need to finish the testing first.
It's encouraging you didn't hit the bug yet anymore, thought!
> We are still stress-testing using low memory suspend. Anything else
> that is required, and we can help with?
I think that's already great enough, thanks!
> Sven
On Wed, Feb 21, 2024 at 4:44 AM Vlastimil Babka <[email protected]> wrote:
>
> Sven reports an infinite loop in __alloc_pages_slowpath() for costly
> order __GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such
> combination can happen in a suspend/resume context where a GFP_KERNEL
> allocation can have __GFP_IO masked out via gfp_allowed_mask.
>
> Quoting Sven:
>
> 1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER)
> with __GFP_RETRY_MAYFAIL set.
>
> 2. page alloc's __alloc_pages_slowpath tries to get a page from the
> freelist. This fails because there is nothing free of that costly
> order.
>
> 3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim,
> which bails out because a zone is ready to be compacted; it pretends
> to have made a single page of progress.
>
> 4. page alloc tries to compact, but this always bails out early because
> __GFP_IO is not set (it's not passed by the snd allocator, and even
> if it were, we are suspending so the __GFP_IO flag would be cleared
> anyway).
>
> 5. page alloc believes reclaim progress was made (because of the
> pretense in item 3) and so it checks whether it should retry
> compaction. The compaction retry logic thinks it should try again,
> because:
> a) reclaim is needed because of the early bail-out in item 4
> b) a zonelist is suitable for compaction
>
> 6. goto 2. indefinite stall.
>
> (end quote)
>
> The immediate root cause is confusing the COMPACT_SKIPPED returned from
> __alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be
> indicating a lack of order-0 pages, and in step 5 evaluating that in
> should_compact_retry() as a reason to retry, before incrementing and
> limiting the number of retries. There are however other places that
> wrongly assume that compaction can happen while we lack __GFP_IO.
>
> To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO
> evaluation and switch the open-coded test in try_to_compact_pages() to
> use it.
>
> Also use the new helper in:
> - compaction_ready(), which will make reclaim not bail out in step 3, so
> there's at least one attempt to actually reclaim, even if chances are
> small for a costly order
> - in_reclaim_compaction() which will make should_continue_reclaim()
> return false and we don't over-reclaim unnecessarily
> - in __alloc_pages_slowpath() to set a local variable can_compact,
> which is then used to avoid retrying reclaim/compaction for costly
> allocations (step 5) if we can't compact and also to skip the early
> compaction attempt that we do in some cases
>
> Reported-by: Sven van Ashbrook <[email protected]>
> Closes: https://lore.kernel.org/all/CAG-rBihs_xMKb3wrMO1%2B-%2Bp4fowP9oy1pa_OTkfxBzPUVOZF%[email protected]/
> Fixes: 3250845d0526 ("Revert "mm, oom: prevent premature OOM killer invocation for high order request"")
> Cc: <[email protected]>
> Signed-off-by: Vlastimil Babka <[email protected]>
Tested-by: Karthikeyan Ramasubramanian <[email protected]>
> ---
> include/linux/gfp.h | 9 +++++++++
> mm/compaction.c | 7 +------
> mm/page_alloc.c | 10 ++++++----
> mm/vmscan.c | 5 ++++-
> 4 files changed, 20 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index de292a007138..e2a916cf29c4 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -353,6 +353,15 @@ static inline bool gfp_has_io_fs(gfp_t gfp)
> return (gfp & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS);
> }
>
> +/*
> + * Check if the gfp flags allow compaction - GFP_NOIO is a really
> + * tricky context because the migration might require IO.
> + */
> +static inline bool gfp_compaction_allowed(gfp_t gfp_mask)
> +{
> + return IS_ENABLED(CONFIG_COMPACTION) && (gfp_mask & __GFP_IO);
> +}
> +
> extern gfp_t vma_thp_gfp_mask(struct vm_area_struct *vma);
>
> #ifdef CONFIG_CONTIG_ALLOC
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 4add68d40e8d..b961db601df4 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2723,16 +2723,11 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> unsigned int alloc_flags, const struct alloc_context *ac,
> enum compact_priority prio, struct page **capture)
> {
> - int may_perform_io = (__force int)(gfp_mask & __GFP_IO);
> struct zoneref *z;
> struct zone *zone;
> enum compact_result rc = COMPACT_SKIPPED;
>
> - /*
> - * Check if the GFP flags allow compaction - GFP_NOIO is really
> - * tricky context because the migration might require IO
> - */
> - if (!may_perform_io)
> + if (!gfp_compaction_allowed(gfp_mask))
> return COMPACT_SKIPPED;
>
> trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 150d4f23b010..a663202045dc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4041,6 +4041,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> struct alloc_context *ac)
> {
> bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
> + bool can_compact = gfp_compaction_allowed(gfp_mask);
> const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;
> struct page *page = NULL;
> unsigned int alloc_flags;
> @@ -4111,7 +4112,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * Don't try this for allocations that are allowed to ignore
> * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
> */
> - if (can_direct_reclaim &&
> + if (can_direct_reclaim && can_compact &&
> (costly_order ||
> (order > 0 && ac->migratetype != MIGRATE_MOVABLE))
> && !gfp_pfmemalloc_allowed(gfp_mask)) {
> @@ -4209,9 +4210,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>
> /*
> * Do not retry costly high order allocations unless they are
> - * __GFP_RETRY_MAYFAIL
> + * __GFP_RETRY_MAYFAIL and we can compact
> */
> - if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
> + if (costly_order && (!can_compact ||
> + !(gfp_mask & __GFP_RETRY_MAYFAIL)))
> goto nopage;
>
> if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> @@ -4224,7 +4226,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> * implementation of the compaction depends on the sufficient amount
> * of free memory (see __compaction_suitable)
> */
> - if (did_some_progress > 0 &&
> + if (did_some_progress > 0 && can_compact &&
> should_compact_retry(ac, order, alloc_flags,
> compact_result, &compact_priority,
> &compaction_retries))
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f9c854ce6cc..4255619a1a31 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5753,7 +5753,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> /* Use reclaim/compaction for costly allocs or under memory pressure */
> static bool in_reclaim_compaction(struct scan_control *sc)
> {
> - if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
> + if (gfp_compaction_allowed(sc->gfp_mask) && sc->order &&
> (sc->order > PAGE_ALLOC_COSTLY_ORDER ||
> sc->priority < DEF_PRIORITY - 2))
> return true;
> @@ -5998,6 +5998,9 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
> {
> unsigned long watermark;
>
> + if (!gfp_compaction_allowed(sc->gfp_mask))
> + return false;
> +
> /* Allocation can already succeed, nothing to do */
> if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone),
> sc->reclaim_idx, 0))
> --
> 2.43.1
>