LinuxLists.cc - Bug: Discontigmem virt_to

2002-04-26 18:27:19

Subject: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]

Hi,

I've been looking at some of the ARM discontigmem implementations, and
have come across a nasty bug. To illustrate this, I'm going to take
part of the generic kernel, and use the Alpha implementation to
illustrate the problem we're facing on ARM.

I'm going to argue here that virt_to_page() can, in the discontigmem
case, produce rather a nasty bug when used with non-direct mapped
kernel memory arguments.

In mm/memory.c:remap_pte_range() we have the following code:

page = virt_to_page(__va(phys_addr));
if ((!VALID_PAGE(page)) || PageReserved(page))
set_pte(pte, mk_pte_phys(phys_addr, prot));

Let's look closely at the first line:

page = virt_to_page(__va(phys_addr));

Essentially, what we're trying to do here is convert a physical address
to a struct page pointer.

__va() is defined, on Alpha, to be:

#define __va(x) ((void *)((unsigned long) (x) + PAGE_OFFSET))

so we produce a unique "va" for any physical address that is passed. No
problem so far. Now, lets look at virt_to_page() for the Alpha:

#define virt_to_page(kaddr) \
(ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr))

Looks inoccuous enough. However, look closer at ADDR_TO_MAPBASE:

#define ADDR_TO_MAPBASE(kaddr) \
NODE_MEM_MAP(KVADDR_TO_NID((unsigned long)(kaddr)))
#define NODE_MEM_MAP(nid) (NODE_DATA(nid)->node_mem_map)
#define NODE_DATA(n) (&((PLAT_NODE_DATA(n))->gendata))
#define PLAT_NODE_DATA(n) (plat_node_data[(n)])

Ok, so here we get the map base via:

plat_node_data[KVADDR_TO_NID((unsigned long)(kaddr))]->
gendata.node_mem_map

plat_node_data is declared as:

plat_pg_data_t *plat_node_data[MAX_NUMNODES];

Lets look closer at KVADDR_TO_NID() and MAX_NUMNODES:

#define KVADDR_TO_NID(kaddr) PHYSADDR_TO_NID(__pa(kaddr))
#define __pa(x) ((unsigned long) (x) - PAGE_OFFSET)
#define PHYSADDR_TO_NID(pa) ALPHA_PA_TO_NID(pa)
#define ALPHA_PA_TO_NID(pa) ((pa) >> 36) /* 16 nodes max due 43bit kseg */

#define MAX_NUMNODES WILDFIRE_MAX_QBB
#define WILDFIRE_MAX_QBB 8 /* more than 8 requires other mods */

So, we have a maximum of 8 nodes total, and therefore the plat_node_data
array is 8 entries large.

Now, what happens if 'kaddr' is below PAGE_OFFSET (because the user has
opened /dev/mem and mapped some random bit of physical memory space)?

__pa returns a large positive number. We shift this large positive
number left by 36 bits, leaving 28 bits of large positive number, which
is larger than our total 8 nodes.

We use this 28-bit number to index plat_node_data. Whoops.

And now, for the icing on the cake, take a look at Alpha's pte_page()
implementation:

unsigned long kvirt; \
struct page * __xx; \
\
kvirt = (unsigned long)__va(pte_val(x) >> (32-PAGE_SHIFT)); \
__xx = virt_to_page(kvirt); \
\
__xx; \

Someone *please* tell me where I'm wrong. I really want to be wrong,
because I can see the same thing happening (in theory, and one report
in practice from a developer) on a certain ARM platform.

On ARM, however, we have cherry to add here. __va() may alias certain
physical memory addresses to the same virtual memory address, which
makes:

VALID_PAGE(virt_to_page(__va(phys)))

completely nonsensical.

I'll try kicking myself 3 times to see if I wake up from this horrible
dream now. 8)

--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html

2002-04-26 22:46:27

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]

On Fri, Apr 26, 2002 at 07:27:11PM +0100, Russell King wrote:
> Hi,
>
> I've been looking at some of the ARM discontigmem implementations, and
> have come across a nasty bug. To illustrate this, I'm going to take
> part of the generic kernel, and use the Alpha implementation to
> illustrate the problem we're facing on ARM.
>
> I'm going to argue here that virt_to_page() can, in the discontigmem
> case, produce rather a nasty bug when used with non-direct mapped
> kernel memory arguments.
>
> In mm/memory.c:remap_pte_range() we have the following code:
>
> page = virt_to_page(__va(phys_addr));
> if ((!VALID_PAGE(page)) || PageReserved(page))
> set_pte(pte, mk_pte_phys(phys_addr, prot));
>
> Let's look closely at the first line:
>
> page = virt_to_page(__va(phys_addr));
>
> Essentially, what we're trying to do here is convert a physical address
> to a struct page pointer.
>
> __va() is defined, on Alpha, to be:
>
> #define __va(x) ((void *)((unsigned long) (x) + PAGE_OFFSET))
>
> so we produce a unique "va" for any physical address that is passed. No
> problem so far. Now, lets look at virt_to_page() for the Alpha:
>
> #define virt_to_page(kaddr) \
> (ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr))
>
> Looks inoccuous enough. However, look closer at ADDR_TO_MAPBASE:
>
> #define ADDR_TO_MAPBASE(kaddr) \
> NODE_MEM_MAP(KVADDR_TO_NID((unsigned long)(kaddr)))
> #define NODE_MEM_MAP(nid) (NODE_DATA(nid)->node_mem_map)
> #define NODE_DATA(n) (&((PLAT_NODE_DATA(n))->gendata))
> #define PLAT_NODE_DATA(n) (plat_node_data[(n)])
>
> Ok, so here we get the map base via:
>
> plat_node_data[KVADDR_TO_NID((unsigned long)(kaddr))]->
> gendata.node_mem_map
>
> plat_node_data is declared as:
>
> plat_pg_data_t *plat_node_data[MAX_NUMNODES];
>
> Lets look closer at KVADDR_TO_NID() and MAX_NUMNODES:
>
> #define KVADDR_TO_NID(kaddr) PHYSADDR_TO_NID(__pa(kaddr))
> #define __pa(x) ((unsigned long) (x) - PAGE_OFFSET)
> #define PHYSADDR_TO_NID(pa) ALPHA_PA_TO_NID(pa)
> #define ALPHA_PA_TO_NID(pa) ((pa) >> 36) /* 16 nodes max due 43bit kseg */
>
> #define MAX_NUMNODES WILDFIRE_MAX_QBB
> #define WILDFIRE_MAX_QBB 8 /* more than 8 requires other mods */
>
> So, we have a maximum of 8 nodes total, and therefore the plat_node_data
> array is 8 entries large.
>
> Now, what happens if 'kaddr' is below PAGE_OFFSET (because the user has
> opened /dev/mem and mapped some random bit of physical memory space)?
>
> __pa returns a large positive number. We shift this large positive
> number left by 36 bits, leaving 28 bits of large positive number, which
> is larger than our total 8 nodes.
>
> We use this 28-bit number to index plat_node_data. Whoops.
>
> And now, for the icing on the cake, take a look at Alpha's pte_page()
> implementation:
>
> unsigned long kvirt; \
> struct page * __xx; \
> \
> kvirt = (unsigned long)__va(pte_val(x) >> (32-PAGE_SHIFT)); \
> __xx = virt_to_page(kvirt); \
> \
> __xx; \
>
>
> Someone *please* tell me where I'm wrong. I really want to be wrong,
> because I can see the same thing happening (in theory, and one report
> in practice from a developer) on a certain ARM platform.
>
> On ARM, however, we have cherry to add here. __va() may alias certain
> physical memory addresses to the same virtual memory address, which
> makes:
>
> VALID_PAGE(virt_to_page(__va(phys)))
>
> completely nonsensical.

correct. This should fix it:

--- 2.4.19pre7aa2/include/asm-alpha/mmzone.h.~1~ Fri Apr 26 10:28:28 2002
+++ 2.4.19pre7aa2/include/asm-alpha/mmzone.h Sat Apr 27 00:30:02 2002
@@ -106,8 +106,8 @@
#define kern_addr_valid(kaddr) test_bit(LOCAL_MAP_NR(kaddr), \
NODE_DATA(KVADDR_TO_NID(kaddr))->valid_addr_bitmap)

-#define virt_to_page(kaddr) (ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr))
-#define VALID_PAGE(page) (((page) - mem_map) < max_mapnr)
+#define virt_to_page(kaddr) (KVADDR_TO_NID((unsigned long) kaddr) < MAX_NUMNODES ? ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr) : 0)
+#define VALID_PAGE(page) ((page) != NULL)

#ifdef CONFIG_NUMA
#ifdef CONFIG_NUMA_SCHED

It still doesn't cover the ram between the end of a node and the start
of the next node, but at least on alpha-wildfire there can be nothing
mapped there (it's reserved for "more dimm ram" slots) and it would be
even more costly to check if the address is in those intra-node holes.

The invalid pages now will start at phys addr 64G*8 that is the maximum
ram that linux can handle the the wildfire. if you mmap the intra-node
ram via /dev/mem you risk for troubles anyways because there's no dimm
there and probably the effect is undefined or unpredictable, it's just a
"mustn't do that", /dev/mem is a "root" thing so the above approch looks
fine to me.

Andrea

2002-04-28 22:10:09

by Daniel Phillips

[permalink] [raw]

Subject: Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]

On Friday 26 April 2002 20:27, Russell King wrote:
> Hi,
>
> I've been looking at some of the ARM discontigmem implementations, and
> have come across a nasty bug. To illustrate this, I'm going to take
> part of the generic kernel, and use the Alpha implementation to
> illustrate the problem we're facing on ARM.
>
> I'm going to argue here that virt_to_page() can, in the discontigmem
> case, produce rather a nasty bug when used with non-direct mapped
> kernel memory arguments.

It's tough to follow, even when you know the code. While cooking my
config_nonlinear patch I noticed the line you're concerned about and
regarded it with deep suspicion. My patch does this:

- page = virt_to_page(__va(phys_addr));
+ page = phys_to_page(phys_addr);

And of course took care that phys_to_page does the right thing in all
cases.

<plug>
The new config_nonlinear was designed as a cleaner, more powerful
replacement for all non-numa uses of config_discontigmem.
</plug>

--
Daniel

2002-04-29 13:35:52

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]

On Sun, Apr 28, 2002 at 12:10:20AM +0200, Daniel Phillips wrote:
> On Friday 26 April 2002 20:27, Russell King wrote:
> > Hi,
> >
> > I've been looking at some of the ARM discontigmem implementations, and
> > have come across a nasty bug. To illustrate this, I'm going to take
> > part of the generic kernel, and use the Alpha implementation to
> > illustrate the problem we're facing on ARM.
> >
> > I'm going to argue here that virt_to_page() can, in the discontigmem
> > case, produce rather a nasty bug when used with non-direct mapped
> > kernel memory arguments.
>
> It's tough to follow, even when you know the code. While cooking my
> config_nonlinear patch I noticed the line you're concerned about and
> regarded it with deep suspicion. My patch does this:
>
> - page = virt_to_page(__va(phys_addr));
> + page = phys_to_page(phys_addr);
>
> And of course took care that phys_to_page does the right thing in all
> cases.

The problem remains the same also going from phys to page, the problem
is that it's not a contigous mem_map and it choked when the phys addr
was above the max ram physaddr. The patch I posted a few days ago will
fix it (modulo for ununused ram space, but attempting to map into the
address space unused ram space is a bug in the first place).

>
> <plug>
> The new config_nonlinear was designed as a cleaner, more powerful
> replacement for all non-numa uses of config_discontigmem.
> </plug>

I maybe wrong because I only had a short look at it so far, but the
"non-numa" is what I noticed too and that's what renders it not a very
interesting option IMHO. Most discontigmem needs numa too. If it cannot
handle numa it doesn't worth to add the complexity there, with numa we
must view those chunks differently, not linearly. Also there's nothing
magic that says mem_map must have a magical meaning, doesn't worth to
preserve the mem_map thing, virt_to_page is a much cleaner abstraction
than doing mem_map + pfn by hand.

Andrea

2002-04-29 16:53:11

by Martin J. Bligh

[permalink] [raw]

Subject: Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]

>> page = virt_to_page(__va(phys_addr));
>>
>> ...
>>
>> __va() is defined, on Alpha, to be:
>>
>> # define __va(x) ((void *)((unsigned long) (x) + PAGE_OFFSET))
>>
>> ...
>>
>> Now, what happens if 'kaddr' is below PAGE_OFFSET (because the user has
>> opened /dev/mem and mapped some random bit of physical memory space)?

But we generated kaddr by using __va, as above? If the user mapped /dev/mem
and created a second possible answer for a P->V mapping, that seems
irrelevant, as long as __va always returns the "primary" mapping into kernel
virtual address space.

I'd agree we're lacking some error checking here (maybe virt_to_page should
be an inline that checks that kaddr really is a kernel virtual address), but I
can't see a real practical problem in the scenario you describe. As other
people seem to be able to, maybe I'm missing something ;-)

I'm not sure if your arch is a 32-bit or 64-bit arch, but I see more of a problem
in this code if we do "page = virt_to_page(__va(phys_addr));" on a physaddr
that's in HIGHMEM on a 32 bit arch, in which we get garbage from the wrapping,
and Daniel's "page = phys_to_page(phys_addr);" makes infintely more sense.

Martin.

2002-04-29 22:01:10

by Roman Zippel

[permalink] [raw]

Subject: Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]

Hi,

On Sat, 27 Apr 2002, Andrea Arcangeli wrote:

> correct. This should fix it:
>
> --- 2.4.19pre7aa2/include/asm-alpha/mmzone.h.~1~ Fri Apr 26 10:28:28 2002
> +++ 2.4.19pre7aa2/include/asm-alpha/mmzone.h Sat Apr 27 00:30:02 2002
> @@ -106,8 +106,8 @@
> #define kern_addr_valid(kaddr) test_bit(LOCAL_MAP_NR(kaddr), \
> NODE_DATA(KVADDR_TO_NID(kaddr))->valid_addr_bitmap)
>
> -#define virt_to_page(kaddr) (ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr))
> -#define VALID_PAGE(page) (((page) - mem_map) < max_mapnr)
> +#define virt_to_page(kaddr) (KVADDR_TO_NID((unsigned long) kaddr) < MAX_NUMNODES ? ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr) : 0)
> +#define VALID_PAGE(page) ((page) != NULL)
>
> #ifdef CONFIG_NUMA
> #ifdef CONFIG_NUMA_SCHED

I'd prefer if VALID_PAGE would go away completely, that test was almost
always to late. What about the patch below, it even reduces the code size
by 1072 bytes (but it's otherwise untested).
It introduces virt_to_valid_page and pte_valid_page, which include a
check, whether the input is valid.

bye, Roman

Index: arch/arm/mach-arc/small_page.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/arm/mach-arc/small_page.c,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 small_page.c
--- arch/arm/mach-arc/small_page.c 15 Jan 2002 18:12:17 -0000 1.1.1.1
+++ arch/arm/mach-arc/small_page.c 29 Apr 2002 20:38:49 -0000
@@ -150,8 +150,8 @@ static void __free_small_page(unsigned l
unsigned long flags;
struct page *page;

- page = virt_to_page(spage);
- if (VALID_PAGE(page)) {
+ page = virt_to_valid_page(spage);
+ if (page) {

/*
* The container-page must be marked Reserved
Index: arch/arm/mm/fault-armv.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/arm/mm/fault-armv.c,v
retrieving revision 1.1.1.5
diff -u -p -r1.1.1.5 fault-armv.c
--- arch/arm/mm/fault-armv.c 14 Apr 2002 20:06:12 -0000 1.1.1.5
+++ arch/arm/mm/fault-armv.c 29 Apr 2002 19:18:37 -0000
@@ -240,9 +240,9 @@ make_coherent(struct vm_area_struct *vma
*/
void update_mmu_cache(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
{
- struct page *page = pte_page(pte);
+ struct page *page = pte_valid_page(pte);

- if (VALID_PAGE(page) && page->mapping) {
+ if (page && page->mapping) {
if (test_and_clear_bit(PG_dcache_dirty, &page->flags))
__flush_dcache_page(page);

Index: arch/ia64/mm/init.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/ia64/mm/init.c,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 init.c
--- arch/ia64/mm/init.c 24 Apr 2002 19:35:43 -0000 1.1.1.3
+++ arch/ia64/mm/init.c 29 Apr 2002 20:39:05 -0000
@@ -147,7 +147,7 @@ free_initrd_mem (unsigned long start, un
printk(KERN_INFO "Freeing initrd memory: %ldkB freed\n", (end - start) >> 10);

for (; start < end; start += PAGE_SIZE) {
- if (!VALID_PAGE(virt_to_page(start)))
+ if (!virt_to_valid_page(start))
continue;
clear_bit(PG_reserved, &virt_to_page(start)->flags);
set_page_count(virt_to_page(start), 1);
Index: arch/mips/mm/umap.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/mips/mm/umap.c,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 umap.c
--- arch/mips/mm/umap.c 31 Jan 2002 22:19:02 -0000 1.1.1.2
+++ arch/mips/mm/umap.c 29 Apr 2002 19:17:45 -0000
@@ -116,8 +116,8 @@ void *vmalloc_uncached (unsigned long si
static inline void free_pte(pte_t page)
{
if (pte_present(page)) {
- struct page *ptpage = pte_page(page);
- if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
+ struct page *ptpage = pte_valid_page(page);
+ if (!ptpage || PageReserved(ptpage))
return;
__free_page(ptpage);
if (current->mm->rss <= 0)
Index: arch/mips64/mm/umap.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/mips64/mm/umap.c,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 umap.c
--- arch/mips64/mm/umap.c 31 Jan 2002 22:19:51 -0000 1.1.1.2
+++ arch/mips64/mm/umap.c 29 Apr 2002 19:17:29 -0000
@@ -115,8 +115,8 @@ void *vmalloc_uncached (unsigned long si
static inline void free_pte(pte_t page)
{
if (pte_present(page)) {
- struct page *ptpage = pte_page(page);
- if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
+ struct page *ptpage = pte_valid_page(page);
+ if (!ptpage || PageReserved(ptpage))
return;
__free_page(ptpage);
if (current->mm->rss <= 0)
Index: arch/sh/mm/fault.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/sh/mm/fault.c,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 fault.c
--- arch/sh/mm/fault.c 31 Jan 2002 22:19:42 -0000 1.1.1.2
+++ arch/sh/mm/fault.c 29 Apr 2002 19:17:11 -0000
@@ -298,8 +298,8 @@ void update_mmu_cache(struct vm_area_str
return;

#if defined(__SH4__)
- page = pte_page(pte);
- if (VALID_PAGE(page) && !test_bit(PG_mapped, &page->flags)) {
+ page = pte_valid_page(pte);
+ if (page && !test_bit(PG_mapped, &page->flags)) {
unsigned long phys = pte_val(pte) & PTE_PHYS_MASK;
__flush_wback_region((void *)P1SEGADDR(phys), PAGE_SIZE);
__set_bit(PG_mapped, &page->flags);
Index: arch/sparc/mm/generic.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc/mm/generic.c,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 generic.c
--- arch/sparc/mm/generic.c 31 Jan 2002 22:19:00 -0000 1.1.1.2
+++ arch/sparc/mm/generic.c 29 Apr 2002 19:16:58 -0000
@@ -19,8 +19,8 @@ static inline void forget_pte(pte_t page
if (pte_none(page))
return;
if (pte_present(page)) {
- struct page *ptpage = pte_page(page);
- if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
+ struct page *ptpage = pte_valid_page(page);
+ if (!ptpage || PageReserved(ptpage))
return;
page_cache_release(ptpage);
return;
Index: arch/sparc/mm/sun4c.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc/mm/sun4c.c,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 sun4c.c
--- arch/sparc/mm/sun4c.c 14 Apr 2002 20:05:32 -0000 1.1.1.3
+++ arch/sparc/mm/sun4c.c 29 Apr 2002 20:39:35 -0000
@@ -1327,7 +1327,7 @@ static __u32 sun4c_get_scsi_one(char *bu
unsigned long page;

page = ((unsigned long)bufptr) & PAGE_MASK;
- if (!VALID_PAGE(virt_to_page(page))) {
+ if (!virt_to_valid_page(page)) {
sun4c_flush_page(page);
return (__u32)bufptr; /* already locked */
}
@@ -2106,7 +2106,7 @@ static void sun4c_pte_clear(pte_t *ptep)
static int sun4c_pmd_bad(pmd_t pmd)
{
return (((pmd_val(pmd) & ~PAGE_MASK) != PGD_TABLE) ||
- (!VALID_PAGE(virt_to_page(pmd_val(pmd)))));
+ (!virt_to_valid_page(pmd_val(pmd))));
}

static int sun4c_pmd_present(pmd_t pmd)
Index: arch/sparc64/kernel/traps.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc64/kernel/traps.c,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 traps.c
--- arch/sparc64/kernel/traps.c 11 Feb 2002 18:49:01 -0000 1.1.1.3
+++ arch/sparc64/kernel/traps.c 29 Apr 2002 20:39:53 -0000
@@ -1284,9 +1284,9 @@ void cheetah_deferred_handler(struct pt_
}

if (recoverable) {
- struct page *page = virt_to_page(__va(afar));
+ struct page *page = virt_to_valid_page(__va(afar));

- if (VALID_PAGE(page))
+ if (page)
get_page(page);
else
recoverable = 0;
Index: arch/sparc64/mm/generic.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc64/mm/generic.c,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 generic.c
--- arch/sparc64/mm/generic.c 19 Mar 2002 01:27:51 -0000 1.1.1.3
+++ arch/sparc64/mm/generic.c 29 Apr 2002 19:14:35 -0000
@@ -19,8 +19,8 @@ static inline void forget_pte(pte_t page
if (pte_none(page))
return;
if (pte_present(page)) {
- struct page *ptpage = pte_page(page);
- if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
+ struct page *ptpage = pte_valid_page(page);
+ if (!ptpage || PageReserved(ptpage))
return;
page_cache_release(ptpage);
return;
Index: arch/sparc64/mm/init.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/arch/sparc64/mm/init.c,v
retrieving revision 1.1.1.6
diff -u -p -r1.1.1.6 init.c
--- arch/sparc64/mm/init.c 14 Apr 2002 20:06:08 -0000 1.1.1.6
+++ arch/sparc64/mm/init.c 29 Apr 2002 19:14:15 -0000
@@ -187,11 +187,10 @@ extern void __update_mmu_cache(unsigned

void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t pte)
{
- struct page *page = pte_page(pte);
+ struct page *page = pte_valid_page(pte);
unsigned long pg_flags;

- if (VALID_PAGE(page) &&
- page->mapping &&
+ if (page && page->mapping &&
((pg_flags = page->flags) & (1UL << PG_dcache_dirty))) {
int cpu = ((pg_flags >> 24) & (NR_CPUS - 1UL));

@@ -260,10 +259,10 @@ static inline void flush_cache_pte_range
continue;

if (pte_present(pte) && pte_dirty(pte)) {
- struct page *page = pte_page(pte);
+ struct page *page = pte_valid_page(pte);
unsigned long pgaddr, uaddr;

- if (!VALID_PAGE(page) || PageReserved(page) || !page->mapping)
+ if (!page || PageReserved(page) || !page->mapping)
continue;
pgaddr = (unsigned long) page_address(page);
uaddr = address + offset;
Index: fs/proc/array.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/fs/proc/array.c,v
retrieving revision 1.1.1.7
diff -u -p -r1.1.1.7 array.c
--- fs/proc/array.c 14 Apr 2002 20:01:10 -0000 1.1.1.7
+++ fs/proc/array.c 29 Apr 2002 19:12:38 -0000
@@ -424,8 +424,8 @@ static inline void statm_pte_range(pmd_t
++*total;
if (!pte_present(page))
continue;
- ptpage = pte_page(page);
- if ((!VALID_PAGE(ptpage)) || PageReserved(ptpage))
+ ptpage = pte_valid_page(page);
+ if (!ptpage || PageReserved(ptpage))
continue;
++*pages;
if (pte_dirty(page))
Index: include/asm-cris/processor.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-cris/processor.h,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 processor.h
--- include/asm-cris/processor.h 31 Jan 2002 22:16:02 -0000 1.1.1.2
+++ include/asm-cris/processor.h 29 Apr 2002 20:40:17 -0000
@@ -101,8 +101,7 @@ unsigned long get_wchan(struct task_stru
({ \
unsigned long eip = 0; \
unsigned long regs = (unsigned long)user_regs(tsk); \
- if (regs > PAGE_SIZE && \
- VALID_PAGE(virt_to_page(regs))) \
+ if (regs > PAGE_SIZE && virt_to_valid_page(regs)) \
eip = ((struct pt_regs *)regs)->irp; \
eip; })

Index: include/asm-i386/page.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/page.h,v
retrieving revision 1.1.1.3
diff -u -p -r1.1.1.3 page.h
--- include/asm-i386/page.h 24 Feb 2002 23:11:41 -0000 1.1.1.3
+++ include/asm-i386/page.h 29 Apr 2002 21:09:09 -0000
@@ -132,7 +132,10 @@ static __inline__ int get_order(unsigned
#define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
#define virt_to_page(kaddr) (mem_map + (__pa(kaddr) >> PAGE_SHIFT))
-#define VALID_PAGE(page) ((page - mem_map) < max_mapnr)
+#define virt_to_valid_page(kaddr) ({ \
+ unsigned long __paddr = __pa(kaddr); \
+ __paddr < max_mapnr ? mem_map + (__paddr >> PAGE_SHIFT) : NULL; \
+})

#define VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \
VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
Index: include/asm-i386/pgtable-2level.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgtable-2level.h,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 pgtable-2level.h
--- include/asm-i386/pgtable-2level.h 26 Nov 2001 19:29:55 -0000 1.1.1.1
+++ include/asm-i386/pgtable-2level.h 29 Apr 2002 21:13:29 -0000
@@ -57,6 +57,7 @@ static inline pmd_t * pmd_offset(pgd_t *
#define ptep_get_and_clear(xp) __pte(xchg(&(xp)->pte_low, 0))
#define pte_same(a, b) ((a).pte_low == (b).pte_low)
#define pte_page(x) (mem_map+((unsigned long)(((x).pte_low >> PAGE_SHIFT))))
+#define pte_valid_page(x) (pte_val(x) < max_mapnr ? pte_page(x) : NULL)
#define pte_none(x) (!(x).pte_low)
#define __mk_pte(page_nr,pgprot) __pte(((page_nr) << PAGE_SHIFT) | pgprot_val(pgprot))

Index: include/asm-i386/pgtable-3level.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgtable-3level.h,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 pgtable-3level.h
--- include/asm-i386/pgtable-3level.h 26 Nov 2001 19:29:55 -0000 1.1.1.1
+++ include/asm-i386/pgtable-3level.h 29 Apr 2002 21:13:08 -0000
@@ -87,6 +87,7 @@ static inline int pte_same(pte_t a, pte_
}

#define pte_page(x) (mem_map+(((x).pte_low >> PAGE_SHIFT) | ((x).pte_high << (32 - PAGE_SHIFT))))
+#define pte_valid_page(x) (pte_val(x) < max_mapnr ? pte_page(x) : NULL)
#define pte_none(x) (!(x).pte_low && !(x).pte_high)

static inline pte_t __mk_pte(unsigned long page_nr, pgprot_t pgprot)
Index: include/asm-m68k/processor.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-m68k/processor.h,v
retrieving revision 1.1.1.1
diff -u -p -r1.1.1.1 processor.h
--- include/asm-m68k/processor.h 26 Nov 2001 19:29:57 -0000 1.1.1.1
+++ include/asm-m68k/processor.h 29 Apr 2002 20:40:37 -0000
@@ -139,7 +139,7 @@ unsigned long get_wchan(struct task_stru
({ \
unsigned long eip = 0; \
if ((tsk)->thread.esp0 > PAGE_SIZE && \
- (VALID_PAGE(virt_to_page((tsk)->thread.esp0)))) \
+ (virt_to_valid_page((tsk)->thread.esp0))) \
eip = ((struct pt_regs *) (tsk)->thread.esp0)->pc; \
eip; })
#define KSTK_ESP(tsk) ((tsk) == current ? rdusp() : (tsk)->thread.usp)
Index: include/asm-sh/pgalloc.h
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/include/asm-sh/pgalloc.h,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 pgalloc.h
--- include/asm-sh/pgalloc.h 31 Jan 2002 22:15:51 -0000 1.1.1.2
+++ include/asm-sh/pgalloc.h 29 Apr 2002 19:11:43 -0000
@@ -105,9 +105,8 @@ static inline pte_t ptep_get_and_clear(p

pte_clear(ptep);
if (!pte_not_present(pte)) {
- struct page *page = pte_page(pte);
- if (VALID_PAGE(page)&&
- (!page->mapping || !(page->mapping->i_mmap_shared)))
+ struct page *page = pte_valid_page(pte);
+ if (page && (!page->mapping || !(page->mapping->i_mmap_shared)))
__clear_bit(PG_mapped, &page->flags);
}
return pte;
Index: mm/memory.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/memory.c,v
retrieving revision 1.1.1.9
diff -u -p -r1.1.1.9 memory.c
--- mm/memory.c 29 Apr 2002 17:30:38 -0000 1.1.1.9
+++ mm/memory.c 29 Apr 2002 20:38:17 -0000
@@ -76,8 +76,8 @@ mem_map_t * mem_map;
*/
void __free_pte(pte_t pte)
{
- struct page *page = pte_page(pte);
- if ((!VALID_PAGE(page)) || PageReserved(page))
+ struct page *page = pte_valid_page(pte);
+ if (!page || PageReserved(page))
return;
if (pte_dirty(pte))
set_page_dirty(page);
@@ -278,9 +278,8 @@ skip_copy_pte_range: address = (address
swap_duplicate(pte_to_swp_entry(pte));
goto cont_copy_pte_range;
}
- ptepage = pte_page(pte);
- if ((!VALID_PAGE(ptepage)) ||
- PageReserved(ptepage))
+ ptepage = pte_valid_page(pte);
+ if (!ptepage || PageReserved(ptepage))
goto cont_copy_pte_range;

/* If it's a COW mapping, write protect it both in the parent and the child */
@@ -356,8 +355,8 @@ static inline int zap_pte_range(mmu_gath
if (pte_none(pte))
continue;
if (pte_present(pte)) {
- struct page *page = pte_page(pte);
- if (VALID_PAGE(page) && !PageReserved(page))
+ struct page *page = pte_valid_page(pte);
+ if (page && !PageReserved(page))
freed ++;
/* This will eventually call __free_pte on the pte. */
tlb_remove_page(tlb, ptep, address + offset);
@@ -473,7 +472,7 @@ static struct page * follow_page(struct
if (pte_present(pte)) {
if (!write ||
(pte_write(pte) && pte_dirty(pte)))
- return pte_page(pte);
+ return pte_valid_page(pte);
}

out:
@@ -488,8 +487,6 @@ out:

static inline struct page * get_page_map(struct page *page)
{
- if (!VALID_PAGE(page))
- return 0;
return page;
}

@@ -860,12 +857,12 @@ static inline void remap_pte_range(pte_t
end = PMD_SIZE;
do {
struct page *page;
- pte_t oldpage;
+ pte_t oldpage, newpage;
oldpage = ptep_get_and_clear(pte);
-
- page = virt_to_page(__va(phys_addr));
- if ((!VALID_PAGE(page)) || PageReserved(page))
- set_pte(pte, mk_pte_phys(phys_addr, prot));
+ newpage = mk_pte_phys(phys_addr, prot);
+ page = pte_valid_page(newpage);
+ if (!page || PageReserved(page))
+ set_pte(pte, newpage);
forget_pte(oldpage);
address += PAGE_SIZE;
phys_addr += PAGE_SIZE;
@@ -978,8 +975,8 @@ static int do_wp_page(struct mm_struct *
{
struct page *old_page, *new_page;

- old_page = pte_page(pte);
- if (!VALID_PAGE(old_page))
+ old_page = pte_valid_page(pte);
+ if (!old_page)
goto bad_wp_page;

if (!TryLockPage(old_page)) {
Index: mm/msync.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/msync.c,v
retrieving revision 1.1.1.2
diff -u -p -r1.1.1.2 msync.c
--- mm/msync.c 14 Apr 2002 20:01:38 -0000 1.1.1.2
+++ mm/msync.c 29 Apr 2002 19:04:34 -0000
@@ -26,8 +26,8 @@ static int filemap_sync_pte(pte_t *ptep,
pte_t pte = *ptep;

if (pte_present(pte) && pte_dirty(pte)) {
- struct page *page = pte_page(pte);
- if (VALID_PAGE(page) && !PageReserved(page) && ptep_test_and_clear_dirty(ptep)) {
+ struct page *page = pte_valid_page(pte);
+ if (page && !PageReserved(page) && ptep_test_and_clear_dirty(ptep)) {
flush_tlb_page(vma, address);
set_page_dirty(page);
}
Index: mm/page_alloc.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/page_alloc.c,v
retrieving revision 1.1.1.8
diff -u -p -r1.1.1.8 page_alloc.c
--- mm/page_alloc.c 24 Apr 2002 19:31:04 -0000 1.1.1.8
+++ mm/page_alloc.c 29 Apr 2002 20:42:30 -0000
@@ -101,8 +101,6 @@ static void __free_pages_ok (struct page
BUG();
if (page->mapping)
BUG();
- if (!VALID_PAGE(page))
- BUG();
if (PageSwapCache(page))
BUG();
if (PageLocked(page))
@@ -294,8 +292,6 @@ static struct page * balance_classzone(z
BUG();
if (page->mapping)
BUG();
- if (!VALID_PAGE(page))
- BUG();
if (PageSwapCache(page))
BUG();
if (PageLocked(page))
@@ -474,7 +470,7 @@ void __free_pages(struct page *page, uns
void free_pages(unsigned long addr, unsigned int order)
{
if (addr != 0)
- __free_pages(virt_to_page(addr), order);
+ __free_pages(virt_to_valid_page(addr), order);
}

/*
Index: mm/slab.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/slab.c,v
retrieving revision 1.1.1.5
diff -u -p -r1.1.1.5 slab.c
--- mm/slab.c 13 Mar 2002 21:16:16 -0000 1.1.1.5
+++ mm/slab.c 29 Apr 2002 20:44:21 -0000
@@ -1415,7 +1415,7 @@ alloc_new_slab_nolock:
#if DEBUG
# define CHECK_NR(pg) \
do { \
- if (!VALID_PAGE(pg)) { \
+ if (!pg) { \
printk(KERN_ERR "kfree: out of range ptr %lxh.\n", \
(unsigned long)objp); \
BUG(); \
@@ -1439,7 +1439,7 @@ static inline void kmem_cache_free_one(k
{
slab_t* slabp;

- CHECK_PAGE(virt_to_page(objp));
+ CHECK_PAGE(virt_to_valid_page(objp));
/* reduces memory footprint
*
if (OPTIMIZE(cachep))
@@ -1519,7 +1519,7 @@ static inline void __kmem_cache_free (km
#ifdef CONFIG_SMP
cpucache_t *cc = cc_data(cachep);

- CHECK_PAGE(virt_to_page(objp));
+ CHECK_PAGE(virt_to_valid_page(objp));
if (cc) {
int batchcount;
if (cc->avail < cc->limit) {
@@ -1601,7 +1601,7 @@ void kmem_cache_free (kmem_cache_t *cach
{
unsigned long flags;
#if DEBUG
- CHECK_PAGE(virt_to_page(objp));
+ CHECK_PAGE(virt_to_valid_page(objp));
if (cachep != GET_PAGE_CACHE(virt_to_page(objp)))
BUG();
#endif
@@ -1626,7 +1626,7 @@ void kfree (const void *objp)
if (!objp)
return;
local_irq_save(flags);
- CHECK_PAGE(virt_to_page(objp));
+ CHECK_PAGE(virt_to_valid_page(objp));
c = GET_PAGE_CACHE(virt_to_page(objp));
__kmem_cache_free(c, (void*)objp);
local_irq_restore(flags);
Index: mm/vmalloc.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/vmalloc.c,v
retrieving revision 1.1.1.5
diff -u -p -r1.1.1.5 vmalloc.c
--- mm/vmalloc.c 24 Apr 2002 19:31:04 -0000 1.1.1.5
+++ mm/vmalloc.c 29 Apr 2002 18:59:39 -0000
@@ -45,8 +45,8 @@ static inline void free_area_pte(pmd_t *
if (pte_none(page))
continue;
if (pte_present(page)) {
- struct page *ptpage = pte_page(page);
- if (VALID_PAGE(ptpage) && (!PageReserved(ptpage)))
+ struct page *ptpage = pte_valid_page(page);
+ if (ptpage && (!PageReserved(ptpage)))
__free_page(ptpage);
continue;
}
Index: mm/vmscan.c
===================================================================
RCS file: /usr/src/cvsroot/linux-2.5/mm/vmscan.c,v
retrieving revision 1.1.1.7
diff -u -p -r1.1.1.7 vmscan.c
--- mm/vmscan.c 24 Apr 2002 19:31:04 -0000 1.1.1.7
+++ mm/vmscan.c 29 Apr 2002 18:57:37 -0000
@@ -206,9 +206,9 @@ static inline int swap_out_pmd(struct mm

do {
if (pte_present(*pte)) {
- struct page *page = pte_page(*pte);
+ struct page *page = pte_valid_page(*pte);

- if (VALID_PAGE(page) && !PageReserved(page)) {
+ if (page && !PageReserved(page)) {
count -= try_to_swap_out(mm, vma, address, pte, page, classzone);
if (!count) {
address += PAGE_SIZE;

2002-04-30 00:44:19

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]

On Tue, Apr 30, 2002 at 12:00:50AM +0200, Roman Zippel wrote:
> Hi,
>
> On Sat, 27 Apr 2002, Andrea Arcangeli wrote:
>
> > correct. This should fix it:
> >
> > --- 2.4.19pre7aa2/include/asm-alpha/mmzone.h.~1~ Fri Apr 26 10:28:28 2002
> > +++ 2.4.19pre7aa2/include/asm-alpha/mmzone.h Sat Apr 27 00:30:02 2002
> > @@ -106,8 +106,8 @@
> > #define kern_addr_valid(kaddr) test_bit(LOCAL_MAP_NR(kaddr), \
> > NODE_DATA(KVADDR_TO_NID(kaddr))->valid_addr_bitmap)
> >
> > -#define virt_to_page(kaddr) (ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr))
> > -#define VALID_PAGE(page) (((page) - mem_map) < max_mapnr)
> > +#define virt_to_page(kaddr) (KVADDR_TO_NID((unsigned long) kaddr) < MAX_NUMNODES ? ADDR_TO_MAPBASE(kaddr) + LOCAL_MAP_NR(kaddr) : 0)
> > +#define VALID_PAGE(page) ((page) != NULL)
> >
> > #ifdef CONFIG_NUMA
> > #ifdef CONFIG_NUMA_SCHED
>
> I'd prefer if VALID_PAGE would go away completely, that test was almost
> always to late. What about the patch below, it even reduces the code size

it is _always_ too late indeed, I definitely agree with your proposal to
change the common code API, yours is a much saner API. But that's a
common code change call, my object was to fix the arch part without
changing the common code, and after all my patch will work exactly the
same as yours, it's just that you put the page != NULL check explicit
and I still use VALID_PAGE instead. You can skip the overflow-check when
we know the vaddr or the pte to match with a valid ram page, so it's a
bit faster than my fix with discontigmem enabled. I'm not sure if for
2.4 it worth to change that given that my two liner arch-contained patch
will also work flawlessy. I've just quite a lots of stuff pending in 2.4
that makes some huge difference to users, so I tend to prefer to left
the stuff that doesn't make difference to users for 2.5 only (it's a
cleanup plus a minor discontigmem optimization after all). So I
recommend you to push it to Linus after fixing the below bugs.

> --- include/asm-i386/page.h 24 Feb 2002 23:11:41 -0000 1.1.1.3
> +++ include/asm-i386/page.h 29 Apr 2002 21:09:09 -0000
> @@ -132,7 +132,10 @@ static __inline__ int get_order(unsigned
> #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
> #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
> #define virt_to_page(kaddr) (mem_map + (__pa(kaddr) >> PAGE_SHIFT))
> -#define VALID_PAGE(page) ((page - mem_map) < max_mapnr)
> +#define virt_to_valid_page(kaddr) ({ \
> + unsigned long __paddr = __pa(kaddr); \
> + __paddr < max_mapnr ? mem_map + (__paddr >> PAGE_SHIFT) : NULL; \
> +})
>
> #define VM_DATA_DEFAULT_FLAGS (VM_READ | VM_WRITE | VM_EXEC | \
> VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC)
> Index: include/asm-i386/pgtable-2level.h
> ===================================================================
> RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgtable-2level.h,v
> retrieving revision 1.1.1.1
> diff -u -p -r1.1.1.1 pgtable-2level.h
> --- include/asm-i386/pgtable-2level.h 26 Nov 2001 19:29:55 -0000 1.1.1.1
> +++ include/asm-i386/pgtable-2level.h 29 Apr 2002 21:13:29 -0000
> @@ -57,6 +57,7 @@ static inline pmd_t * pmd_offset(pgd_t *
> #define ptep_get_and_clear(xp) __pte(xchg(&(xp)->pte_low, 0))
> #define pte_same(a, b) ((a).pte_low == (b).pte_low)
> #define pte_page(x) (mem_map+((unsigned long)(((x).pte_low >> PAGE_SHIFT))))
> +#define pte_valid_page(x) (pte_val(x) < max_mapnr ? pte_page(x) : NULL)
> #define pte_none(x) (!(x).pte_low)
> #define __mk_pte(page_nr,pgprot) __pte(((page_nr) << PAGE_SHIFT) | pgprot_val(pgprot))
>
> Index: include/asm-i386/pgtable-3level.h
> ===================================================================
> RCS file: /usr/src/cvsroot/linux-2.5/include/asm-i386/pgtable-3level.h,v
> retrieving revision 1.1.1.1
> diff -u -p -r1.1.1.1 pgtable-3level.h
> --- include/asm-i386/pgtable-3level.h 26 Nov 2001 19:29:55 -0000 1.1.1.1
> +++ include/asm-i386/pgtable-3level.h 29 Apr 2002 21:13:08 -0000
> @@ -87,6 +87,7 @@ static inline int pte_same(pte_t a, pte_
> }
>
> #define pte_page(x) (mem_map+(((x).pte_low >> PAGE_SHIFT) | ((x).pte_high << (32 - PAGE_SHIFT))))
> +#define pte_valid_page(x) (pte_val(x) < max_mapnr ? pte_page(x) : NULL)
> #define pte_none(x) (!(x).pte_low && !(x).pte_high)
>

map_mapnr is a pfn, not a physaddr, you're off of 2^PAGE_SHIFT, fix is
trivial of course.

Andrea

2002-04-30 23:02:15

by Daniel Phillips

[permalink] [raw]

Subject: Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]

On Monday 29 April 2002 15:35, Andrea Arcangeli wrote:
> On Sun, Apr 28, 2002 at 12:10:20AM +0200, Daniel Phillips wrote:
> > On Friday 26 April 2002 20:27, Russell King wrote:
> > > Hi,
> > >
> > > I've been looking at some of the ARM discontigmem implementations, and
> > > have come across a nasty bug. To illustrate this, I'm going to take
> > > part of the generic kernel, and use the Alpha implementation to
> > > illustrate the problem we're facing on ARM.
> > >
> > > I'm going to argue here that virt_to_page() can, in the discontigmem
> > > case, produce rather a nasty bug when used with non-direct mapped
> > > kernel memory arguments.
> >
> > It's tough to follow, even when you know the code. While cooking my
> > config_nonlinear patch I noticed the line you're concerned about and
> > regarded it with deep suspicion. My patch does this:
> >
> > - page = virt_to_page(__va(phys_addr));
> > + page = phys_to_page(phys_addr);
> >
> > And of course took care that phys_to_page does the right thing in all
> > cases.
>
> The problem remains the same also going from phys to page, the problem
> is that it's not a contigous mem_map and it choked when the phys addr
> was above the max ram physaddr. The patch I posted a few days ago will
> fix it (modulo for ununused ram space, but attempting to map into the
> address space unused ram space is a bug in the first place).

My config_nonlinear patch does not suffer from the above problem. Here's the
code:

unsigned long vsection[MAX_SECTIONS];

static inline unsigned long phys_to_ordinal(phys_t p)
{
return vsection[p >> SECTION_SHIFT] + ((p & SECTION_MASK) >> PAGE_SHIFT);
}

static inline struct page *phys_to_page(unsigned long p)
{
return mem_map + phys_to_ordinal(p);
}

Nothing can go out of range. Sensible, no?

> > <plug>
> > The new config_nonlinear was designed as a cleaner, more powerful
> > replacement for all non-numa uses of config_discontigmem.
> > </plug>
>
> I maybe wrong because I only had a short look at it so far, but the
> "non-numa" is what I noticed too and that's what renders it not a very
> interesting option IMHO. Most discontigmem needs numa too.

I am, first and foremost, presenting config_nonlinear as a replacement for
config_discontig for *non-numa* uses of config_discontig. (Sorry if I'm
repeating myself here.)

There are also applications in numa. Please see the lse-tech archives for
details. I expect that, by taking a fresh look at numa code in the light
of new work, that the numa code can be cleaned up and simplififed
considerably. But that's "further work". Config_nonlinear stands on its
own quite nicely.

> If it cannot
> handle numa it doesn't worth to add the complexity there,

It does not add complexity, it removes complexity. Please read the patch
more closely. It's very simple. It's also more powerful than
config_discontig.

> with numa we must view those chunks differently, not linearly.

Correct. Now, if you want to extend my patch to handle multiple mem_map
vectors, you do it by defining an ordinal_to_page and page_to_ordinal pair
of mappings.[1] Don't you think this is a nicer way to organize things?

> Also there's nothing
> magic that says mem_map must have a magical meaning, doesn't worth to
> preserve the mem_map thing, virt_to_page is a much cleaner abstraction
> than doing mem_map + pfn by hand.

True. The upcoming iteration of config_nonlinear moves all uses of
mem_map inside the per-arch page.h headers, so that mem_map need not
exist at all in configurations where there is no single mem_map.

[1] Since allocation needs to be aware of the separate zones,
_alloc_pages stays much as it is, but if we change all non-numa users
of config_discontig over to config_nonlinear then we can get rid of the
#ifdef CONFIG_NUMA's, by way of cleanup.

--
Daniel

2002-05-01 02:23:05

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: Bug: Discontigmem virt_to_page() [Alpha,ARM,Mips64?]

On Tue, Apr 30, 2002 at 01:02:05AM +0200, Daniel Phillips wrote:
> My config_nonlinear patch does not suffer from the above problem. Here's the
> code:
>
> unsigned long vsection[MAX_SECTIONS];
>
> static inline unsigned long phys_to_ordinal(phys_t p)
> {
> return vsection[p >> SECTION_SHIFT] + ((p & SECTION_MASK) >> PAGE_SHIFT);
> }
>
> static inline struct page *phys_to_page(unsigned long p)
> {
> return mem_map + phys_to_ordinal(p);
> }
>
> Nothing can go out of range. Sensible, no?

Really the above vsection[p >> SECTION_SHIFT] will overflow in the very
same case I fixed a few days ago for numa-alpha. The whole point is that
p isn't a ram page and you assumed that (the alpha code was also assuming
that and that's why it overflowed the same way as yours). Either that
or you're wasting some huge tons of ram with vsection on a 64bit arch.

After the above out of range bug is fixed in practice it is the same as
the current discontigmem, except that with the current way you can take
the page structures in the right node with numa. And again I cannot see
any advantage in having a contigous mem_map even for archs with only
discontigmem and non-numa (I think only ARM falls in such category, btw).

> > > <plug>
> > > The new config_nonlinear was designed as a cleaner, more powerful
> > > replacement for all non-numa uses of config_discontigmem.
> > > </plug>
> >
> > I maybe wrong because I only had a short look at it so far, but the
> > "non-numa" is what I noticed too and that's what renders it not a very
> > interesting option IMHO. Most discontigmem needs numa too.
>
> I am, first and foremost, presenting config_nonlinear as a replacement for
> config_discontig for *non-numa* uses of config_discontig. (Sorry if I'm
> repeating myself here.)
>
> There are also applications in numa. Please see the lse-tech archives for
> details. I expect that, by taking a fresh look at numa code in the light
> of new work, that the numa code can be cleaned up and simplififed
> considerably. But that's "further work". Config_nonlinear stands on its
> own quite nicely.

Tell me how an ARM machine will run faster with nonlinear, it is doing
nearly the same thing except it's a lesser abstraction that forces a
contiguous mem_map. Current code is much more powerful and it carries
more information (the pgdat describes the whole memory topology to the
common code), and it's not going to be slower, so I don't see why should
we complicate the code with nonlinear. Personally I hate more than one
way of doing the same thing if there's no need of it, the less ways the
less you have to keep in mind, the simpler to understand, the better
(partly offtopic but for the very same reason when I work in userspace I
much prefer coding in python than in perl).

> > If it cannot
> > handle numa it doesn't worth to add the complexity there,
>
> It does not add complexity, it removes complexity. Please read the patch
> more closely. It's very simple. It's also more powerful than
> config_discontig.

How? I may be overlooking something but I would say it's all but more
powerful. I don't see any "power" point in trying to keep the mem_map
contigous. please don't tell me it's more powerful, just tell me why.

> > with numa we must view those chunks differently, not linearly.
>
> Correct. Now, if you want to extend my patch to handle multiple mem_map
> vectors, you do it by defining an ordinal_to_page and page_to_ordinal pair
> of mappings.[1] Don't you think this is a nicer way to organize things?

What's the advantage? And after you can have more than one mem_map,
after you added this "vector", then each mem_map will match a
discontigmem pgdat. Tell me a numa machine where there's an hole in the
middle of a node. The holes are always intra-node, never within the
nodes themself. So the nonlinear-numa should fallback to the stright
mem_map array pointed by the pgdat all the time like it is just right now.

The only advantage of nonlinear I can see could be a machine with an
huge hole in a node, then with nonlinear you could avoid wasting mem_map
for this hole but without having to add another pgdat that would
otherwise break numa assumptions on the pgdat, but I'm not aware of any
machine with huge holes of the order of the gigabytes in the middle of a
node, at the very least if that happens it means the hardware of the
machine is misconfigured.

The very same problem would happen right now in x86 if there would be an
huge hole in the physical ram, so you have 128M of ram and then an hole
of 63G and then the other phusical 900M at offset 63G+128M, it will
never happen, that's broken hardware if you see anything like that.

at the very least I would wait somebody to ask with a so weird hardware
that intentionally does like the above instead of overdesigning common
code abstractions, and there would be also other ways to deal with such
situation without requiring a contigous mem_map.

> > Also there's nothing
> > magic that says mem_map must have a magical meaning, doesn't worth to
> > preserve the mem_map thing, virt_to_page is a much cleaner abstraction
> > than doing mem_map + pfn by hand.
>
> True. The upcoming iteration of config_nonlinear moves all uses of
> mem_map inside the per-arch page.h headers, so that mem_map need not
> exist at all in configurations where there is no single mem_map.

That's fine, and all correct kernel code just does that correctly,
nonbody is allowed to use mem_map in any common code anywhere (besides
the mm proper internals when discontigmem is disabled).

Andrea