2018-02-08 06:30:59

by Kai-Heng Feng

[permalink] [raw]
Subject: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

A user with i386 instead of AMD64 machine reports [1] that commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly”) causes a regression.
BUG_ON(PageHighMem(pg)) in drivers/media/common/saa7146/saa7146_core.c always gets triggered after that commit.

Commit 704b862f9efd ("mm/vmalloc.c: don't unconditonally use __GFP_HIGHMEM”) adjusts the mask logic, now the __GFP_HIGHMEM only gets applied when there is no GFP_DMA or GFP_DMA32.

So I tried to adjust its malloc to "__vmalloc(nr_pages * sizeof(struct scatterlist), GFP_KERNEL | GFP_DMA | __GFP_ZERO, PAGE_KERNEL)”, but both GFP_DMA or GFP_DMA32 still trigger the BUG_ON(PageHighMem()) macro.

Also there are other BUG_ON(PageHighMem()) in drivers/media, I think they will get hit by same regression in 32bit machine too.

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1742316

Kai-Heng



2018-02-08 13:07:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

On Thu, Feb 08, 2018 at 02:29:57PM +0800, Kai Heng Feng wrote:
> A user with i386 instead of AMD64 machine reports [1] that commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly”) causes a regression.
> BUG_ON(PageHighMem(pg)) in drivers/media/common/saa7146/saa7146_core.c always gets triggered after that commit.

Well, the BUG_ON is wrong. You can absolutely have pages which are both
HighMem and under the 4GB boundary. Only the first 896MB (iirc) are LowMem,
and the next 3GB of pages are available to vmalloc_32().

> Also there are other BUG_ON(PageHighMem()) in drivers/media, I think they will get hit by same regression in 32bit machine too.

I fixed one of them. I think the other three are also bogus, but it's
hard to say; the comments say "DMA to HighMem might not work", and they
probably mean "above the 4GB boundary", but I really don't know.

(since two drivers now have this code, maybe it should be part of the core
MM API? Or maybe there's already something better they should be using?)

diff --git a/drivers/media/common/saa7146/saa7146_core.c b/drivers/media/common/saa7146/saa7146_core.c
index 9f7c5b0a6b45..329fd43228ff 100644
--- a/drivers/media/common/saa7146/saa7146_core.c
+++ b/drivers/media/common/saa7146/saa7146_core.c
@@ -160,7 +160,7 @@ static struct scatterlist* vmalloc_to_sg(unsigned char *virt, int nr_pages)
pg = vmalloc_to_page(virt);
if (NULL == pg)
goto err;
- BUG_ON(PageHighMem(pg));
+ BUG_ON(page_to_pfn(pg) >= (1 << (32 - PAGE_SHIFT)));
sg_set_page(&sglist[i], pg, PAGE_SIZE, 0);
}
return sglist;
diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index f412429cf5ba..b5ec74b9c867 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -77,7 +77,7 @@ static struct scatterlist *videobuf_vmalloc_to_sg(unsigned char *virt,
pg = vmalloc_to_page(virt);
if (NULL == pg)
goto err;
- BUG_ON(PageHighMem(pg));
+ BUG_ON(page_to_pfn(pg) >= (1 << (32 - PAGE_SHIFT)));
sg_set_page(&sglist[i], pg, PAGE_SIZE, 0);
}
return sglist;

2018-02-08 17:58:17

by Laura Abbott

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

On 02/08/2018 05:06 AM, Matthew Wilcox wrote:
> On Thu, Feb 08, 2018 at 02:29:57PM +0800, Kai Heng Feng wrote:
>> A user with i386 instead of AMD64 machine reports [1] that commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly”) causes a regression.
>> BUG_ON(PageHighMem(pg)) in drivers/media/common/saa7146/saa7146_core.c always gets triggered after that commit.
>
> Well, the BUG_ON is wrong. You can absolutely have pages which are both
> HighMem and under the 4GB boundary. Only the first 896MB (iirc) are LowMem,
> and the next 3GB of pages are available to vmalloc_32().
>
>> Also there are other BUG_ON(PageHighMem()) in drivers/media, I think they will get hit by same regression in 32bit machine too.
>
> I fixed one of them. I think the other three are also bogus, but it's
> hard to say; the comments say "DMA to HighMem might not work", and they
> probably mean "above the 4GB boundary", but I really don't know.
>
> (since two drivers now have this code, maybe it should be part of the core
> MM API? Or maybe there's already something better they should be using?)
>

The comment at the top said the function was copied from videobuf-dma-sg.c
so might be worth making it common.

> diff --git a/drivers/media/common/saa7146/saa7146_core.c b/drivers/media/common/saa7146/saa7146_core.c
> index 9f7c5b0a6b45..329fd43228ff 100644
> --- a/drivers/media/common/saa7146/saa7146_core.c
> +++ b/drivers/media/common/saa7146/saa7146_core.c
> @@ -160,7 +160,7 @@ static struct scatterlist* vmalloc_to_sg(unsigned char *virt, int nr_pages)
> pg = vmalloc_to_page(virt);
> if (NULL == pg)
> goto err;
> - BUG_ON(PageHighMem(pg));
> + BUG_ON(page_to_pfn(pg) >= (1 << (32 - PAGE_SHIFT)));
> sg_set_page(&sglist[i], pg, PAGE_SIZE, 0);
> }
> return sglist;
> diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
> index f412429cf5ba..b5ec74b9c867 100644
> --- a/drivers/media/v4l2-core/videobuf-dma-sg.c
> +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
> @@ -77,7 +77,7 @@ static struct scatterlist *videobuf_vmalloc_to_sg(unsigned char *virt,
> pg = vmalloc_to_page(virt);
> if (NULL == pg)
> goto err;
> - BUG_ON(PageHighMem(pg));
> + BUG_ON(page_to_pfn(pg) >= (1 << (32 - PAGE_SHIFT)));
> sg_set_page(&sglist[i], pg, PAGE_SIZE, 0);
> }
> return sglist;
>

the vzalloc in this function needs to be switched to vmalloc32 if it
actually wants to guarantee 32-bit memory.

Thanks,
Laura

2018-02-08 18:19:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

On Thu, Feb 08, 2018 at 09:56:42AM -0800, Laura Abbott wrote:
> > +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
> > @@ -77,7 +77,7 @@ static struct scatterlist *videobuf_vmalloc_to_sg(unsigned char *virt,
> > pg = vmalloc_to_page(virt);
> > if (NULL == pg)
> > goto err;
> > - BUG_ON(PageHighMem(pg));
> > + BUG_ON(page_to_pfn(pg) >= (1 << (32 - PAGE_SHIFT)));
> > sg_set_page(&sglist[i], pg, PAGE_SIZE, 0);
> > }
> > return sglist;
> >
>
> the vzalloc in this function needs to be switched to vmalloc32 if it
> actually wants to guarantee 32-bit memory.

Whoops, you got confused between the sglist allocation and the allocation
of the pages which will be mapped ...

2018-02-08 18:37:07

by Laura Abbott

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

On 02/08/2018 10:18 AM, Matthew Wilcox wrote:
> On Thu, Feb 08, 2018 at 09:56:42AM -0800, Laura Abbott wrote:
>>> +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
>>> @@ -77,7 +77,7 @@ static struct scatterlist *videobuf_vmalloc_to_sg(unsigned char *virt,
>>> pg = vmalloc_to_page(virt);
>>> if (NULL == pg)
>>> goto err;
>>> - BUG_ON(PageHighMem(pg));
>>> + BUG_ON(page_to_pfn(pg) >= (1 << (32 - PAGE_SHIFT)));
>>> sg_set_page(&sglist[i], pg, PAGE_SIZE, 0);
>>> }
>>> return sglist;
>>>
>>
>> the vzalloc in this function needs to be switched to vmalloc32 if it
>> actually wants to guarantee 32-bit memory.
>
> Whoops, you got confused between the sglist allocation and the allocation
> of the pages which will be mapped ...
>

Ah yeah, clearly need more coffee this morning.

2018-02-08 23:21:07

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

On Thu, Feb 08, 2018 at 05:06:49AM -0800, Matthew Wilcox wrote:
> On Thu, Feb 08, 2018 at 02:29:57PM +0800, Kai Heng Feng wrote:
> > A user with i386 instead of AMD64 machine reports [1] that commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly”) causes a regression.
> > BUG_ON(PageHighMem(pg)) in drivers/media/common/saa7146/saa7146_core.c always gets triggered after that commit.
>
> Well, the BUG_ON is wrong. You can absolutely have pages which are both
> HighMem and under the 4GB boundary. Only the first 896MB (iirc) are LowMem,
> and the next 3GB of pages are available to vmalloc_32().

... nevertheless, 19809c2da28a does in fact break vmalloc_32 on 32-bit. Look:

#if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
#define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
#elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
#define GFP_VMALLOC32 GFP_DMA | GFP_KERNEL
#else
#define GFP_VMALLOC32 GFP_KERNEL
#endif

So we pass in GFP_KERNEL to __vmalloc_node, which calls __vmalloc_node_range
which calls __vmalloc_area_node, which ORs in __GFP_HIGHMEM.

So ... we could enable ZONE_DMA32 on 32-bit architectures. I don't know
what side-effects that might have; it's clearly only been tested on 64-bit
architectures so far.

It might be best to just revert 19809c2da28a and the follow-on 704b862f9efd.

2018-02-09 04:09:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

On Thu, Feb 08, 2018 at 03:20:04PM -0800, Matthew Wilcox wrote:
> So ... we could enable ZONE_DMA32 on 32-bit architectures. I don't know
> what side-effects that might have; it's clearly only been tested on 64-bit
> architectures so far.
>
> It might be best to just revert 19809c2da28a and the follow-on 704b862f9efd.

Alternatively, try this. It passes in GFP_DMA32 from vmalloc_32,
regardless of whether ZONE_DMA32 exists or not. If ZONE_DMA32 doesn't
exist, then we clear it in __vmalloc_area_node(), after using it to
determine that we shouldn't set __GFP_HIGHMEM.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 673942094328..91e8a95123c4 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1669,10 +1669,11 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
struct page **pages;
unsigned int nr_pages, array_size, i;
const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
- const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
- const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
- 0 :
- __GFP_HIGHMEM;
+ gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
+ if (!(alloc_mask & GFP_ZONEMASK))
+ alloc_mask |= __GFP_HIGHMEM;
+ if (!IS_ENABLED(CONFIG_ZONE_DMA32) && (alloc_mask & __GFP_DMA32))
+ alloc_mask &= ~__GFP_DMA32;

nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
array_size = (nr_pages * sizeof(struct page *));
@@ -1680,7 +1681,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
area->nr_pages = nr_pages;
/* Please note that the recursion is strictly bounded. */
if (array_size > PAGE_SIZE) {
- pages = __vmalloc_node(array_size, 1, nested_gfp|highmem_mask,
+ pages = __vmalloc_node(array_size, 1, nested_gfp|__GFP_HIGHMEM,
PAGE_KERNEL, node, area->caller);
} else {
pages = kmalloc_node(array_size, nested_gfp, node);
@@ -1696,9 +1697,9 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
struct page *page;

if (node == NUMA_NO_NODE)
- page = alloc_page(alloc_mask|highmem_mask);
+ page = alloc_page(alloc_mask);
else
- page = alloc_pages_node(node, alloc_mask|highmem_mask, 0);
+ page = alloc_pages_node(node, alloc_mask, 0);

if (unlikely(!page)) {
/* Successfully allocated i pages, free them in __vunmap() */
@@ -1706,7 +1707,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
goto fail;
}
area->pages[i] = page;
- if (gfpflags_allow_blocking(gfp_mask|highmem_mask))
+ if (gfpflags_allow_blocking(gfp_mask))
cond_resched();
}

@@ -1942,12 +1943,10 @@ void *vmalloc_exec(unsigned long size)
NUMA_NO_NODE, __builtin_return_address(0));
}

-#if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
-#define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
-#elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
+#if defined(CONFIG_64BIT) && !defined(CONFIG_ZONE_DMA32)
#define GFP_VMALLOC32 GFP_DMA | GFP_KERNEL
#else
-#define GFP_VMALLOC32 GFP_KERNEL
+#define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
#endif

/**

2018-02-09 09:14:06

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

Hi Matthew,

> On Feb 9, 2018, at 12:08 PM, Matthew Wilcox <[email protected]> wrote:
> Alternatively, try this. It passes in GFP_DMA32 from vmalloc_32,
> regardless of whether ZONE_DMA32 exists or not. If ZONE_DMA32 doesn't
> exist, then we clear it in __vmalloc_area_node(), after using it to
> determine that we shouldn't set __GFP_HIGHMEM.

IIUC, I need to let drivers/media drivers start using vmalloc_32() with
your patch, right?

Kai-Heng

>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 673942094328..91e8a95123c4 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1669,10 +1669,11 @@ static void *__vmalloc_area_node(struct vm_struct
> *area, gfp_t gfp_mask,
> struct page **pages;
> unsigned int nr_pages, array_size, i;
> const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
> - const gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
> - const gfp_t highmem_mask = (gfp_mask & (GFP_DMA | GFP_DMA32)) ?
> - 0 :
> - __GFP_HIGHMEM;
> + gfp_t alloc_mask = gfp_mask | __GFP_NOWARN;
> + if (!(alloc_mask & GFP_ZONEMASK))
> + alloc_mask |= __GFP_HIGHMEM;
> + if (!IS_ENABLED(CONFIG_ZONE_DMA32) && (alloc_mask & __GFP_DMA32))
> + alloc_mask &= ~__GFP_DMA32;
>
> nr_pages = get_vm_area_size(area) >> PAGE_SHIFT;
> array_size = (nr_pages * sizeof(struct page *));
> @@ -1680,7 +1681,7 @@ static void *__vmalloc_area_node(struct vm_struct
> *area, gfp_t gfp_mask,
> area->nr_pages = nr_pages;
> /* Please note that the recursion is strictly bounded. */
> if (array_size > PAGE_SIZE) {
> - pages = __vmalloc_node(array_size, 1, nested_gfp|highmem_mask,
> + pages = __vmalloc_node(array_size, 1, nested_gfp|__GFP_HIGHMEM,
> PAGE_KERNEL, node, area->caller);
> } else {
> pages = kmalloc_node(array_size, nested_gfp, node);
> @@ -1696,9 +1697,9 @@ static void *__vmalloc_area_node(struct vm_struct
> *area, gfp_t gfp_mask,
> struct page *page;
>
> if (node == NUMA_NO_NODE)
> - page = alloc_page(alloc_mask|highmem_mask);
> + page = alloc_page(alloc_mask);
> else
> - page = alloc_pages_node(node, alloc_mask|highmem_mask, 0);
> + page = alloc_pages_node(node, alloc_mask, 0);
>
> if (unlikely(!page)) {
> /* Successfully allocated i pages, free them in __vunmap() */
> @@ -1706,7 +1707,7 @@ static void *__vmalloc_area_node(struct vm_struct
> *area, gfp_t gfp_mask,
> goto fail;
> }
> area->pages[i] = page;
> - if (gfpflags_allow_blocking(gfp_mask|highmem_mask))
> + if (gfpflags_allow_blocking(gfp_mask))
> cond_resched();
> }
>
> @@ -1942,12 +1943,10 @@ void *vmalloc_exec(unsigned long size)
> NUMA_NO_NODE, __builtin_return_address(0));
> }
>
> -#if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
> -#define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
> -#elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
> +#if defined(CONFIG_64BIT) && !defined(CONFIG_ZONE_DMA32)
> #define GFP_VMALLOC32 GFP_DMA | GFP_KERNEL
> #else
> -#define GFP_VMALLOC32 GFP_KERNEL
> +#define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
> #endif
>
> /**

2018-02-09 14:09:12

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

On Fri, Feb 09, 2018 at 05:12:56PM +0800, Kai Heng Feng wrote:
> Hi Matthew,
>
> > On Feb 9, 2018, at 12:08 PM, Matthew Wilcox <[email protected]> wrote:
> > Alternatively, try this. It passes in GFP_DMA32 from vmalloc_32,
> > regardless of whether ZONE_DMA32 exists or not. If ZONE_DMA32 doesn't
> > exist, then we clear it in __vmalloc_area_node(), after using it to
> > determine that we shouldn't set __GFP_HIGHMEM.
>
> IIUC, I need to let drivers/media drivers start using vmalloc_32() with your
> patch, right?

Hopefully those that need to already are. Otherwise they're broken
on 64-bit. I do see several places already using vmalloc_32().


2018-02-11 09:27:46

by Michal Hocko

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

On Thu 08-02-18 15:20:04, Matthew Wilcox wrote:
> On Thu, Feb 08, 2018 at 05:06:49AM -0800, Matthew Wilcox wrote:
> > On Thu, Feb 08, 2018 at 02:29:57PM +0800, Kai Heng Feng wrote:
> > > A user with i386 instead of AMD64 machine reports [1] that commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly”) causes a regression.
> > > BUG_ON(PageHighMem(pg)) in drivers/media/common/saa7146/saa7146_core.c always gets triggered after that commit.
> >
> > Well, the BUG_ON is wrong. You can absolutely have pages which are both
> > HighMem and under the 4GB boundary. Only the first 896MB (iirc) are LowMem,
> > and the next 3GB of pages are available to vmalloc_32().
>
> ... nevertheless, 19809c2da28a does in fact break vmalloc_32 on 32-bit. Look:
>
> #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
> #define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
> #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
> #define GFP_VMALLOC32 GFP_DMA | GFP_KERNEL
> #else
> #define GFP_VMALLOC32 GFP_KERNEL
> #endif
>
> So we pass in GFP_KERNEL to __vmalloc_node, which calls __vmalloc_node_range
> which calls __vmalloc_area_node, which ORs in __GFP_HIGHMEM.

Dohh. I have missed this. I was convinced that we always add GFP_DMA32
when doing vmalloc_32. Sorry about that. The above definition looks
quite weird to be honest. First of all do we have any 64b system without
both DMA and DMA32 zones? If yes, what is the actual semantic of
vmalloc_32? Or is there any magic forcing GFP_KERNEL into low 32b?

Also I would expect that __GFP_DMA32 should do the right thing on 32b
systems. So something like the below should do the trick
---
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 673942094328..2eab5d1ef548 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1947,7 +1947,8 @@ void *vmalloc_exec(unsigned long size)
#elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
#define GFP_VMALLOC32 GFP_DMA | GFP_KERNEL
#else
-#define GFP_VMALLOC32 GFP_KERNEL
+/* This should be only 32b systems */
+#define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
#endif

/**

--
Michal Hocko
SUSE Labs

2018-02-11 11:31:22

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

On Sun, Feb 11, 2018 at 10:26:52AM +0100, Michal Hocko wrote:
> On Thu 08-02-18 15:20:04, Matthew Wilcox wrote:
> > ... nevertheless, 19809c2da28a does in fact break vmalloc_32 on 32-bit. Look:
> >
> > #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
> > #define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
> > #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
> > #define GFP_VMALLOC32 GFP_DMA | GFP_KERNEL
> > #else
> > #define GFP_VMALLOC32 GFP_KERNEL
> > #endif
> >
> > So we pass in GFP_KERNEL to __vmalloc_node, which calls __vmalloc_node_range
> > which calls __vmalloc_area_node, which ORs in __GFP_HIGHMEM.
>
> Dohh. I have missed this. I was convinced that we always add GFP_DMA32
> when doing vmalloc_32. Sorry about that. The above definition looks
> quite weird to be honest. First of all do we have any 64b system without
> both DMA and DMA32 zones? If yes, what is the actual semantic of
> vmalloc_32? Or is there any magic forcing GFP_KERNEL into low 32b?

mmzone.h has the following, which may be inaccurate / out of date:

* parisc, ia64, sparc <4G
* s390 <2G
* arm Various
* alpha Unlimited or 0-16MB.
*
* i386, x86_64 and multiple other arches
* <16M.

It claims ZONE_DMA32 is x86-64 only, which is incorrect; it's now used
by arm64, ia64, mips, powerpc, tile.

> Also I would expect that __GFP_DMA32 should do the right thing on 32b
> systems. So something like the below should do the trick

Oh, I see. Because we have:

#ifdef CONFIG_ZONE_DMA32
#define OPT_ZONE_DMA32 ZONE_DMA32
#else
#define OPT_ZONE_DMA32 ZONE_NORMAL
#endif

we'll end up allocating from ZONE_NORMAL if a non-DMA32 architecture asks
for GFP_DMA32 memory. Thanks; I missed that.

I'd recommend this instead then:

#if defined(CONFIG_64BIT) && !defined(CONFIG_ZONE_DMA32)
#define GFP_VMALLOC32 GFP_DMA | GFP_KERNEL
#else
#define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
#endif

I think it's clearer than the three-way #if.

Now, longer-term, perhaps we should do the following:

#ifdef CONFIG_ZONE_DMA32
#define OPT_ZONE_DMA32 ZONE_DMA32
#elif defined(CONFIG_64BIT)
#define OPT_ZONE_DMA OPT_ZONE_DMA
#else
#define OPT_ZONE_DMA32 ZONE_NORMAL
#endif

Then we wouldn't need the ifdef here and could always use GFP_DMA32
| GFP_KERNEL. Would need to audit current users and make sure they
wouldn't be broken by such a change.

I noticed a mistake in 704b862f9efd;

- pages = __vmalloc_node(array_size, 1, nested_gfp|__GFP_HIGHMEM,
+ pages = __vmalloc_node(array_size, 1, nested_gfp|highmem_mask,

We should unconditionally use __GFP_HIGHMEM here instead of highmem_mask
because this is where we allocate the array to hold the struct page
pointers. This can be allocated from highmem, and does not need to be
allocated from ZONE_NORMAL.

Similarly,

- if (gfpflags_allow_blocking(gfp_mask))
+ if (gfpflags_allow_blocking(gfp_mask|highmem_mask))

is not needed (it's not *wrong*, it was just an unnecessary change).

> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 673942094328..2eab5d1ef548 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1947,7 +1947,8 @@ void *vmalloc_exec(unsigned long size)
> #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
> #define GFP_VMALLOC32 GFP_DMA | GFP_KERNEL
> #else
> -#define GFP_VMALLOC32 GFP_KERNEL
> +/* This should be only 32b systems */
> +#define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
> #endif
>
> /**
>
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2018-02-11 12:06:16

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

On Sun, Feb 11, 2018 at 03:28:08AM -0800, Matthew Wilcox wrote:
> Now, longer-term, perhaps we should do the following:
>
> #ifdef CONFIG_ZONE_DMA32
> #define OPT_ZONE_DMA32 ZONE_DMA32
> #elif defined(CONFIG_64BIT)
> #define OPT_ZONE_DMA OPT_ZONE_DMA
> #else
> #define OPT_ZONE_DMA32 ZONE_NORMAL
> #endif
>
> Then we wouldn't need the ifdef here and could always use GFP_DMA32
> | GFP_KERNEL. Would need to audit current users and make sure they
> wouldn't be broken by such a change.

Argh, I forgot to say the most important thing. (For those newly invited
to the party, we're talking about drivers/media, in particular
drivers/media/common/saa7146/saa7146_core.c, functions
saa7146_vmalloc_build_pgtable and vmalloc_to_sg)

I think we're missing a function in our DMA API. These drivers don't
actually need physical memory below the 4GB mark. They need DMA addresses
which are below the 4GB mark. For machines with IOMMUs, this can mean
no restrictions on physical memory. If we don't have an IOMMU, then a
bounce buffer could be used (but would be slow) -- like the swiotlb.
So we should endeavour to allocate memory below the 4GB boundary on
systems with no IOMMU, but can allocate memory anywhere on systems with
an IOMMU.

For consistent / coherent memory, we have an allocation function.
But we don't have an allocation function for streaming memory, which is
what these drivers want. They also flush the DMA memory and then access
the memory through a different virtual mapping, which I'm not sure is
going to work well on virtually-indexed caches like SPARC and PA-RISC
(maybe not MIPS either?)

I think we want something like

struct scatterlist *dma_alloc_sg(struct device *dev, int *nents);
void dma_free_sg(struct device *dev, struct scatterlist *sg, int nents);

That lets individual architectures decide where to allocate, and handle
the tradeoff between allocating below 4GB and using bounce buffers.

I don't have a good answer to synchronising between device-view of
memory and CPU-view-through-vmalloc though. They're already calling
dma_sync_*_for_cpu(); do they need to also call a new vflush(void *p,
unsigned long len) function which can be a no-op on x86 and flushes the
range on SPARC/PA-RISC/... ?

2018-02-12 00:11:55

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

On Sun, Feb 11, 2018 at 04:05:15AM -0800, Matthew Wilcox wrote:
> On Sun, Feb 11, 2018 at 03:28:08AM -0800, Matthew Wilcox wrote:
> > Now, longer-term, perhaps we should do the following:
> >
> > #ifdef CONFIG_ZONE_DMA32
> > #define OPT_ZONE_DMA32 ZONE_DMA32
> > #elif defined(CONFIG_64BIT)
> > #define OPT_ZONE_DMA OPT_ZONE_DMA
> > #else
> > #define OPT_ZONE_DMA32 ZONE_NORMAL
> > #endif
>
> For consistent / coherent memory, we have an allocation function.
> But we don't have an allocation function for streaming memory, which is
> what these drivers want. They also flush the DMA memory and then access
> the memory through a different virtual mapping, which I'm not sure is
> going to work well on virtually-indexed caches like SPARC and PA-RISC
> (maybe not MIPS either?)

Perhaps I (and a number of other people ...) have misunderstood the
semantics of GFP_DMA32. Perhaps GFP_DMA32 is not "allocate memory below
4GB", perhaps it's "allocate memory which can be mapped below 4GB".
Machines with an IOMMU can use ZONE_NORMAL. Machines with no IOMMU can
choose to allocate memory with a physical address below 4GB.

After all, it has 'DMA' right there in the name. If someone's relying
on it to allocate physical memory below 4GB, they're arguably misusing it.

2018-02-12 12:03:30

by Michal Hocko

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

[I am crawling over a large backlog after vacation so I will get to
other emails in this thread later. Let's just fix the regression
first. The patch with the full changelog is at the end of this email.
CC Andrew - the original report is http://lkml.kernel.org/r/[email protected]]

On Sun 11-02-18 03:28:08, Matthew Wilcox wrote:
> On Sun, Feb 11, 2018 at 10:26:52AM +0100, Michal Hocko wrote:
> > On Thu 08-02-18 15:20:04, Matthew Wilcox wrote:
> > > ... nevertheless, 19809c2da28a does in fact break vmalloc_32 on 32-bit. Look:
> > >
> > > #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
> > > #define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
> > > #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
> > > #define GFP_VMALLOC32 GFP_DMA | GFP_KERNEL
> > > #else
> > > #define GFP_VMALLOC32 GFP_KERNEL
> > > #endif
> > >
> > > So we pass in GFP_KERNEL to __vmalloc_node, which calls __vmalloc_node_range
> > > which calls __vmalloc_area_node, which ORs in __GFP_HIGHMEM.
> >
> > Dohh. I have missed this. I was convinced that we always add GFP_DMA32
> > when doing vmalloc_32. Sorry about that. The above definition looks
> > quite weird to be honest. First of all do we have any 64b system without
> > both DMA and DMA32 zones? If yes, what is the actual semantic of
> > vmalloc_32? Or is there any magic forcing GFP_KERNEL into low 32b?
>
> mmzone.h has the following, which may be inaccurate / out of date:
>
> * parisc, ia64, sparc <4G
> * s390 <2G
> * arm Various
> * alpha Unlimited or 0-16MB.
> *
> * i386, x86_64 and multiple other arches
> * <16M.
>
> It claims ZONE_DMA32 is x86-64 only, which is incorrect; it's now used
> by arm64, ia64, mips, powerpc, tile.

yes, nobody seem to keep this one in sync.

> > Also I would expect that __GFP_DMA32 should do the right thing on 32b
> > systems. So something like the below should do the trick
>
> Oh, I see. Because we have:
>
> #ifdef CONFIG_ZONE_DMA32
> #define OPT_ZONE_DMA32 ZONE_DMA32
> #else
> #define OPT_ZONE_DMA32 ZONE_NORMAL
> #endif
>
> we'll end up allocating from ZONE_NORMAL if a non-DMA32 architecture asks
> for GFP_DMA32 memory. Thanks; I missed that.

yep

> I'd recommend this instead then:
>
> #if defined(CONFIG_64BIT) && !defined(CONFIG_ZONE_DMA32)
> #define GFP_VMALLOC32 GFP_DMA | GFP_KERNEL
> #else
> #define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
> #endif
>
> I think it's clearer than the three-way #if.

I do not have a strong opinion here. I just wanted the change to be
obvious without meddling with the 64b ifdefs much. Follow up cleanups
are certainly possible.

> Now, longer-term, perhaps we should do the following:
>
> #ifdef CONFIG_ZONE_DMA32
> #define OPT_ZONE_DMA32 ZONE_DMA32
> #elif defined(CONFIG_64BIT)
> #define OPT_ZONE_DMA OPT_ZONE_DMA
> #else
> #define OPT_ZONE_DMA32 ZONE_NORMAL
> #endif
>
> Then we wouldn't need the ifdef here and could always use GFP_DMA32
> | GFP_KERNEL. Would need to audit current users and make sure they
> wouldn't be broken by such a change.

I am pretty sure improvements are possible.

> I noticed a mistake in 704b862f9efd;
>
> - pages = __vmalloc_node(array_size, 1, nested_gfp|__GFP_HIGHMEM,
> + pages = __vmalloc_node(array_size, 1, nested_gfp|highmem_mask,
>
> We should unconditionally use __GFP_HIGHMEM here instead of highmem_mask
> because this is where we allocate the array to hold the struct page
> pointers. This can be allocated from highmem, and does not need to be
> allocated from ZONE_NORMAL.

You seem to be right. nested_gfp doesn't include zone modifiers. Care to
send a patch?

> Similarly,
>
> - if (gfpflags_allow_blocking(gfp_mask))
> + if (gfpflags_allow_blocking(gfp_mask|highmem_mask))
>
> is not needed (it's not *wrong*, it was just an unnecessary change).

yes. highmem_mask has no influence on the blocking behavior.

The fix for the regressions should be

From 301c0acbce9dd80a854bba49c9db40991df0f9e4 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 12 Feb 2018 10:37:19 +0100
Subject: [PATCH] vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b
systems
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Kai Heng Feng has noticed that BUG_ON(PageHighMem(pg)) triggers in
drivers/media/common/saa7146/saa7146_core.c since 19809c2da28a ("mm,
vmalloc: use __GFP_HIGHMEM implicitly”). saa7146_vmalloc_build_pgtable
uses vmalloc_32 and it is reasonable to expect that the resulting page
is not in highmem. The above commit aimed to add __GFP_HIGHMEM only
for those requests which do not specify any zone modifier gfp flag.
vmalloc_32 relies on GFP_VMALLOC32 which should do the right thing.
Except it has been missed that GFP_VMALLOC32 is an alias for GFP_KERNEL
on 32b architectures. Thanks to Matthew to notice this.

Fix the problem by unconditionally setting GFP_DMA32 in GFP_VMALLOC32
for !64b arches (as a bailout). This should do the right thing and use
ZONE_NORMAL which should be always below 4G on 32b systems.

Debugged-by: Matthew Wilcox <[email protected]>
Reported-by: Kai Heng Feng <[email protected]>
Fixes: 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly”)
Cc: stable
Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmalloc.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 673942094328..1d147078c469 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1947,7 +1947,11 @@ void *vmalloc_exec(unsigned long size)
#elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
#define GFP_VMALLOC32 GFP_DMA | GFP_KERNEL
#else
-#define GFP_VMALLOC32 GFP_KERNEL
+/*
+ * 64b systems should always have either DMA or DMA32 zones. For others
+ * GFP_DMA32 should do the right thing and use the normal zone.
+ */
+#define GFP_VMALLOC32 GFP_DMA32 | GFP_KERNEL
#endif

/**
--
2.15.1

--
Michal Hocko
SUSE Labs

2018-02-14 14:06:10

by Michal Hocko

[permalink] [raw]
Subject: Re: Regression after commit 19809c2da28a ("mm, vmalloc: use __GFP_HIGHMEM implicitly")

On Sun 11-02-18 15:51:07, Matthew Wilcox wrote:
> On Sun, Feb 11, 2018 at 04:05:15AM -0800, Matthew Wilcox wrote:
> > On Sun, Feb 11, 2018 at 03:28:08AM -0800, Matthew Wilcox wrote:
> > > Now, longer-term, perhaps we should do the following:
> > >
> > > #ifdef CONFIG_ZONE_DMA32
> > > #define OPT_ZONE_DMA32 ZONE_DMA32
> > > #elif defined(CONFIG_64BIT)
> > > #define OPT_ZONE_DMA OPT_ZONE_DMA
> > > #else
> > > #define OPT_ZONE_DMA32 ZONE_NORMAL
> > > #endif
> >
> > For consistent / coherent memory, we have an allocation function.
> > But we don't have an allocation function for streaming memory, which is
> > what these drivers want. They also flush the DMA memory and then access
> > the memory through a different virtual mapping, which I'm not sure is
> > going to work well on virtually-indexed caches like SPARC and PA-RISC
> > (maybe not MIPS either?)
>
> Perhaps I (and a number of other people ...) have misunderstood the
> semantics of GFP_DMA32. Perhaps GFP_DMA32 is not "allocate memory below
> 4GB", perhaps it's "allocate memory which can be mapped below 4GB".

Well, GFP_DMA32 is clearly under-documented. But I _believe_ the
intention was to really return a physical memory within 32b address
range.

> Machines with an IOMMU can use ZONE_NORMAL. Machines with no IOMMU can
> choose to allocate memory with a physical address below 4GB.

This would be something for the higher level allocator I think. The page
allocator is largely unaware of IOMMU or any remapping and that is good
IMHO.

> After all, it has 'DMA' right there in the name.

The name is misnomer following GFP_DMA which is arguably a better fit.
GFP_MEM32 would be a better name.

Btw. I believe the GFP_VMALLOC32 shows that our GFP_DM32 needs some
love. The user shouldn't really care about lowmem zones layout.
GFP_DMA32 should simply use the appropriate zone regardless the arch
specific details.

--
Michal Hocko
SUSE Labs