2010-07-12 15:53:57

by Minchan Kim

[permalink] [raw]
Subject: [RFC] Tight check of pfn_valid on sparsemem

Kukjin, Could you test below patch?
I don't have any sparsemem system. Sorry.

-- CUT DOWN HERE --

Kukjin reported oops happen while he change min_free_kbytes
http://www.spinics.net/lists/arm-kernel/msg92894.html
It happen by memory map on sparsemem.

The system has a memory map following as.
section 0 section 1 section 2
0x20000000-0x25000000, 0x40000000-0x50000000, 0x50000000-0x58000000
SECTION_SIZE_BITS 28(256M)

It means section 0 is an incompletely filled section.
Nontheless, current pfn_valid of sparsemem checks pfn loosely.

It checks only mem_section's validation.
So in above case, pfn on 0x25000000 can pass pfn_valid's validation check.
It's not what we want.

The Following patch adds check valid pfn range check on pfn_valid of sparsemem.

Signed-off-by: Minchan Kim <[email protected]>
Reported-by: Kukjin Kim <[email protected]>

P.S)
It is just RFC. If we agree with this, I will make the patch on mmotm.

--

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..6c2147a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -979,6 +979,8 @@ struct mem_section {
struct page_cgroup *page_cgroup;
unsigned long pad;
#endif
+ unsigned long start_pfn;
+ unsigned long end_pfn;
};

#ifdef CONFIG_SPARSEMEM_EXTREME
@@ -1039,6 +1041,12 @@ static inline int valid_section(struct mem_section *section)
return (section && (section->section_mem_map & SECTION_HAS_MEM_MAP));
}

+static inline int valid_section_pfn(struct mem_section *section, unsigned long pfn)
+{
+ return ((section && (section->section_mem_map & SECTION_HAS_MEM_MAP)) &&
+ (section->start_pfn <= pfn && pfn < section->end_pfn));
+}
+
static inline int valid_section_nr(unsigned long nr)
{
return valid_section(__nr_to_section(nr));
@@ -1053,7 +1061,7 @@ static inline int pfn_valid(unsigned long pfn)
{
if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
return 0;
- return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
+ return valid_section_pfn(__nr_to_section(pfn_to_section_nr(pfn)), pfn);
}

diff --git a/mm/sparse.c b/mm/sparse.c
index 95ac219..bde9090 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -195,6 +195,8 @@ void __init memory_present(int nid, unsigned long start, unsigned long end)
if (!ms->section_mem_map)
ms->section_mem_map = sparse_encode_early_nid(nid) |
SECTION_MARKED_PRESENT;
+ ms->start_pfn = start;
+ ms->end_pfn = end;
}
}



--
Kind regards,
Minchan Kim


2010-07-13 00:00:06

by Kukjin Kim

[permalink] [raw]
Subject: RE: [RFC] Tight check of pfn_valid on sparsemem

Minchan Kim wrote:
>
Hi :-)

> Kukjin, Could you test below patch?

Sure.

> I don't have any sparsemem system. Sorry.

No problem...
And in the same test, there was no problem ;-)

It means has no kernel panic with your this patch.

If you need other test on sparsemem system, please let me know.

Thanks.

Best regards,
Kgene.
--
Kukjin Kim <[email protected]>, Senior Engineer,
SW Solution Development Team, Samsung Electronics Co., Ltd.
>
> -- CUT DOWN HERE --
>
> Kukjin reported oops happen while he change min_free_kbytes
> http://www.spinics.net/lists/arm-kernel/msg92894.html
> It happen by memory map on sparsemem.
>
> The system has a memory map following as.
> section 0 section 1 section 2
> 0x20000000-0x25000000, 0x40000000-0x50000000, 0x50000000-0x58000000
> SECTION_SIZE_BITS 28(256M)
>
> It means section 0 is an incompletely filled section.
> Nontheless, current pfn_valid of sparsemem checks pfn loosely.
>
> It checks only mem_section's validation.
> So in above case, pfn on 0x25000000 can pass pfn_valid's validation check.
> It's not what we want.
>
> The Following patch adds check valid pfn range check on pfn_valid of
sparsemem.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Reported-by: Kukjin Kim <[email protected]>
>
> P.S)
> It is just RFC. If we agree with this, I will make the patch on mmotm.
>
> --
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b4d109e..6c2147a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -979,6 +979,8 @@ struct mem_section {
> struct page_cgroup *page_cgroup;
> unsigned long pad;
> #endif
> + unsigned long start_pfn;
> + unsigned long end_pfn;
> };
>
> #ifdef CONFIG_SPARSEMEM_EXTREME
> @@ -1039,6 +1041,12 @@ static inline int valid_section(struct mem_section
> *section)
> return (section && (section->section_mem_map &
> SECTION_HAS_MEM_MAP));
> }
>
> +static inline int valid_section_pfn(struct mem_section *section, unsigned
long pfn)
> +{
> + return ((section && (section->section_mem_map &
> SECTION_HAS_MEM_MAP)) &&
> + (section->start_pfn <= pfn && pfn < section->end_pfn));
> +}
> +
> static inline int valid_section_nr(unsigned long nr)
> {
> return valid_section(__nr_to_section(nr));
> @@ -1053,7 +1061,7 @@ static inline int pfn_valid(unsigned long pfn)
> {
> if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
> return 0;
> - return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
> + return valid_section_pfn(__nr_to_section(pfn_to_section_nr(pfn)),
pfn);
> }
>
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 95ac219..bde9090 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -195,6 +195,8 @@ void __init memory_present(int nid, unsigned long
start,
> unsigned long end)
> if (!ms->section_mem_map)
> ms->section_mem_map =
> sparse_encode_early_nid(nid) |
>
> SECTION_MARKED_PRESENT;
> + ms->start_pfn = start;
> + ms->end_pfn = end;
> }
> }
>
>
>
> --
> Kind regards,
> Minchan Kim

2010-07-13 03:24:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, 13 Jul 2010 00:53:48 +0900
Minchan Kim <[email protected]> wrote:

> Kukjin, Could you test below patch?
> I don't have any sparsemem system. Sorry.
>
> -- CUT DOWN HERE --
>
> Kukjin reported oops happen while he change min_free_kbytes
> http://www.spinics.net/lists/arm-kernel/msg92894.html
> It happen by memory map on sparsemem.
>
> The system has a memory map following as.
> section 0 section 1 section 2
> 0x20000000-0x25000000, 0x40000000-0x50000000, 0x50000000-0x58000000
> SECTION_SIZE_BITS 28(256M)
>
> It means section 0 is an incompletely filled section.
> Nontheless, current pfn_valid of sparsemem checks pfn loosely.
>
> It checks only mem_section's validation.
> So in above case, pfn on 0x25000000 can pass pfn_valid's validation check.
> It's not what we want.
>
> The Following patch adds check valid pfn range check on pfn_valid of sparsemem.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Reported-by: Kukjin Kim <[email protected]>
>
> P.S)
> It is just RFC. If we agree with this, I will make the patch on mmotm.
>
> --
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b4d109e..6c2147a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -979,6 +979,8 @@ struct mem_section {
> struct page_cgroup *page_cgroup;
> unsigned long pad;
> #endif
> + unsigned long start_pfn;
> + unsigned long end_pfn;
> };
>

I have 2 concerns.
1. This makes mem_section twice. Wasting too much memory and not good for cache.
But yes, you can put this under some CONFIG which has small number of mem_section[].

2. This can't be help for a case where a section has multiple small holes.


Then, my proposal for HOLES_IN_MEMMAP sparsemem is below.
==
Some architectures unmap memmap[] for memory holes even with SPARSEMEM.
To handle that, pfn_valid() should check there are really memmap or not.
For that purpose, __get_user() can be used.
This idea is from ia64_pfn_valid().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/mmzone.h | 12 ++++++++++++
mm/sparse.c | 17 +++++++++++++++++
2 files changed, 29 insertions(+)

Index: mmotm-2.6.35-0701/include/linux/mmzone.h
===================================================================
--- mmotm-2.6.35-0701.orig/include/linux/mmzone.h
+++ mmotm-2.6.35-0701/include/linux/mmzone.h
@@ -1047,12 +1047,24 @@ static inline struct mem_section *__pfn_
return __nr_to_section(pfn_to_section_nr(pfn));
}

+#ifndef CONFIG_ARCH_HAS_HOLES_IN_MEMMAP
static inline int pfn_valid(unsigned long pfn)
{
if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
return 0;
return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
}
+#else
+extern int pfn_valid_mapped(unsigned long pfn);
+static inline int pfn_valid(unsigned long pfn)
+{
+ if (pfn_to_seciton_nr(pfn) >= NR_MEM_SECTIONS)
+ return 0;
+ if (!valid_section(__nr_to_section(pfn_to_section_nr(pfn))))
+ return 0;
+ return pfn_valid_mapped(pfn);
+}
+#endif

static inline int pfn_present(unsigned long pfn)
{
Index: mmotm-2.6.35-0701/mm/sparse.c
===================================================================
--- mmotm-2.6.35-0701.orig/mm/sparse.c
+++ mmotm-2.6.35-0701/mm/sparse.c
@@ -799,3 +799,20 @@ void sparse_remove_one_section(struct zo
free_section_usemap(memmap, usemap);
}
#endif
+
+#ifdef CONFIG_ARCH_HAS_HOLES_IN_MEMMAP
+int pfn_valid_mapped(unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ char *lastbyte = (char *)(page+1)-1;
+ char byte;
+
+ if(__get_user(byte, page) != 0)
+ return 0;
+
+ if ((((unsigned long)page) & PAGE_MASK) ==
+ (((unsigned long)lastbyte) & PAGE_MASK))
+ return 1;
+ return (__get_user(byte,lastbyte) == 0);
+}
+#endif




2010-07-13 04:11:21

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, Jul 13, 2010 at 12:19 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 13 Jul 2010 00:53:48 +0900
> Minchan Kim <[email protected]> wrote:
>
>> Kukjin, Could you test below patch?
>> I don't have any sparsemem system. Sorry.
>>
>> -- CUT DOWN HERE --
>>
>> Kukjin reported oops happen while he change min_free_kbytes
>> http://www.spinics.net/lists/arm-kernel/msg92894.html
>> It happen by memory map on sparsemem.
>>
>> The system has a memory map following as.
>> ? ? ?section 0 ? ? ? ? ? ? section 1 ? ? ? ? ? ? ?section 2
>> 0x20000000-0x25000000, 0x40000000-0x50000000, 0x50000000-0x58000000
>> SECTION_SIZE_BITS 28(256M)
>>
>> It means section 0 is an incompletely filled section.
>> Nontheless, current pfn_valid of sparsemem checks pfn loosely.
>>
>> It checks only mem_section's validation.
>> So in above case, pfn on 0x25000000 can pass pfn_valid's validation check.
>> It's not what we want.
>>
>> The Following patch adds check valid pfn range check on pfn_valid of sparsemem.
>>
>> Signed-off-by: Minchan Kim <[email protected]>
>> Reported-by: Kukjin Kim <[email protected]>
>>
>> P.S)
>> It is just RFC. If we agree with this, I will make the patch on mmotm.
>>
>> --
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index b4d109e..6c2147a 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -979,6 +979,8 @@ struct mem_section {
>> ? ? ? ? struct page_cgroup *page_cgroup;
>> ? ? ? ? unsigned long pad;
>> ?#endif
>> + ? ? ? unsigned long start_pfn;
>> + ? ? ? unsigned long end_pfn;
>> ?};
>>
>
> I have 2 concerns.
> ?1. This makes mem_section twice. Wasting too much memory and not good for cache.
> ? ?But yes, you can put this under some CONFIG which has small number of mem_section[].
>

I think memory usage isn't a big deal. but for cache, we can move
fields into just after section_mem_map.

> ?2. This can't be help for a case where a section has multiple small holes.

I agree. But this(not punched hole but not filled section problem)
isn't such case. But it would be better to handle it altogether. :)

>
> Then, my proposal for HOLES_IN_MEMMAP sparsemem is below.
> ==
> Some architectures unmap memmap[] for memory holes even with SPARSEMEM.
> To handle that, pfn_valid() should check there are really memmap or not.
> For that purpose, __get_user() can be used.

Look at free_unused_memmap. We don't unmap pte of hole memmap.
Is __get_use effective, still?




--
Kind regards,
Minchan Kim

2010-07-13 04:28:04

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, 13 Jul 2010 13:11:14 +0900
Minchan Kim <[email protected]> wrote:

> On Tue, Jul 13, 2010 at 12:19 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Tue, 13 Jul 2010 00:53:48 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> >> Kukjin, Could you test below patch?
> >> I don't have any sparsemem system. Sorry.
> >>
> >> -- CUT DOWN HERE --
> >>
> >> Kukjin reported oops happen while he change min_free_kbytes
> >> http://www.spinics.net/lists/arm-kernel/msg92894.html
> >> It happen by memory map on sparsemem.
> >>
> >> The system has a memory map following as.
> >>      section 0             section 1              section 2
> >> 0x20000000-0x25000000, 0x40000000-0x50000000, 0x50000000-0x58000000
> >> SECTION_SIZE_BITS 28(256M)
> >>
> >> It means section 0 is an incompletely filled section.
> >> Nontheless, current pfn_valid of sparsemem checks pfn loosely.
> >>
> >> It checks only mem_section's validation.
> >> So in above case, pfn on 0x25000000 can pass pfn_valid's validation check.
> >> It's not what we want.
> >>
> >> The Following patch adds check valid pfn range check on pfn_valid of sparsemem.
> >>
> >> Signed-off-by: Minchan Kim <[email protected]>
> >> Reported-by: Kukjin Kim <[email protected]>
> >>
> >> P.S)
> >> It is just RFC. If we agree with this, I will make the patch on mmotm.
> >>
> >> --
> >>
> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >> index b4d109e..6c2147a 100644
> >> --- a/include/linux/mmzone.h
> >> +++ b/include/linux/mmzone.h
> >> @@ -979,6 +979,8 @@ struct mem_section {
> >>         struct page_cgroup *page_cgroup;
> >>         unsigned long pad;
> >>  #endif
> >> +       unsigned long start_pfn;
> >> +       unsigned long end_pfn;
> >>  };
> >>
> >
> > I have 2 concerns.
> >  1. This makes mem_section twice. Wasting too much memory and not good for cache.
> >    But yes, you can put this under some CONFIG which has small number of mem_section[].
> >
>
> I think memory usage isn't a big deal. but for cache, we can move
> fields into just after section_mem_map.
>
I don't think so. This addtional field can eat up the amount of memory you saved
by unmap.

> >  2. This can't be help for a case where a section has multiple small holes.
>
> I agree. But this(not punched hole but not filled section problem)
> isn't such case. But it would be better to handle it altogether. :)
>
> >
> > Then, my proposal for HOLES_IN_MEMMAP sparsemem is below.
> > ==
> > Some architectures unmap memmap[] for memory holes even with SPARSEMEM.
> > To handle that, pfn_valid() should check there are really memmap or not.
> > For that purpose, __get_user() can be used.
>
> Look at free_unused_memmap. We don't unmap pte of hole memmap.
> Is __get_use effective, still?
>
__get_user() works with TLB and page table, the vaddr is really mapped or not.
If you got SEGV, __get_user() returns -EFAULT. It works per page granule.

Thanks,
-Kame

2010-07-13 06:04:05

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, Jul 13, 2010 at 1:23 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 13 Jul 2010 13:11:14 +0900
> Minchan Kim <[email protected]> wrote:
>
>> On Tue, Jul 13, 2010 at 12:19 PM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>> > On Tue, 13 Jul 2010 00:53:48 +0900
>> > Minchan Kim <[email protected]> wrote:
>> >
>> >> Kukjin, Could you test below patch?
>> >> I don't have any sparsemem system. Sorry.
>> >>
>> >> -- CUT DOWN HERE --
>> >>
>> >> Kukjin reported oops happen while he change min_free_kbytes
>> >> http://www.spinics.net/lists/arm-kernel/msg92894.html
>> >> It happen by memory map on sparsemem.
>> >>
>> >> The system has a memory map following as.
>> >> ? ? ?section 0 ? ? ? ? ? ? section 1 ? ? ? ? ? ? ?section 2
>> >> 0x20000000-0x25000000, 0x40000000-0x50000000, 0x50000000-0x58000000
>> >> SECTION_SIZE_BITS 28(256M)
>> >>
>> >> It means section 0 is an incompletely filled section.
>> >> Nontheless, current pfn_valid of sparsemem checks pfn loosely.
>> >>
>> >> It checks only mem_section's validation.
>> >> So in above case, pfn on 0x25000000 can pass pfn_valid's validation check.
>> >> It's not what we want.
>> >>
>> >> The Following patch adds check valid pfn range check on pfn_valid of sparsemem.
>> >>
>> >> Signed-off-by: Minchan Kim <[email protected]>
>> >> Reported-by: Kukjin Kim <[email protected]>
>> >>
>> >> P.S)
>> >> It is just RFC. If we agree with this, I will make the patch on mmotm.
>> >>
>> >> --
>> >>
>> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> >> index b4d109e..6c2147a 100644
>> >> --- a/include/linux/mmzone.h
>> >> +++ b/include/linux/mmzone.h
>> >> @@ -979,6 +979,8 @@ struct mem_section {
>> >> ? ? ? ? struct page_cgroup *page_cgroup;
>> >> ? ? ? ? unsigned long pad;
>> >> ?#endif
>> >> + ? ? ? unsigned long start_pfn;
>> >> + ? ? ? unsigned long end_pfn;
>> >> ?};
>> >>
>> >
>> > I have 2 concerns.
>> > ?1. This makes mem_section twice. Wasting too much memory and not good for cache.
>> > ? ?But yes, you can put this under some CONFIG which has small number of mem_section[].
>> >
>>
>> I think memory usage isn't a big deal. but for cache, we can move
>> fields into just after section_mem_map.
>>
> I don't think so. This addtional field can eat up the amount of memory you saved
> by unmap.

Agree.

>
>> > ?2. This can't be help for a case where a section has multiple small holes.
>>
>> I agree. But this(not punched hole but not filled section problem)
>> isn't such case. But it would be better to handle it altogether. :)
>>
>> >
>> > Then, my proposal for HOLES_IN_MEMMAP sparsemem is below.
>> > ==
>> > Some architectures unmap memmap[] for memory holes even with SPARSEMEM.
>> > To handle that, pfn_valid() should check there are really memmap or not.
>> > For that purpose, __get_user() can be used.
>>
>> Look at free_unused_memmap. We don't unmap pte of hole memmap.
>> Is __get_use effective, still?
>>
> __get_user() works with TLB and page table, the vaddr is really mapped or not.
> If you got SEGV, __get_user() returns -EFAULT. It works per page granule.

I mean following as.
For example, there is a struct page in on 0x20000000.

int pfn_valid_mapped(unsigned long pfn)
{
struct page *page = pfn_to_page(pfn); /* hole page is 0x2000000 */
char *lastbyte = (char *)(page+1)-1; /* lastbyte is 0x2000001f */
char byte;

/* We pass this test since free_unused_memmap doesn't unmap pte */
if(__get_user(byte, page) != 0)
return 0;
/*
* (0x20000000 & PAGE_MASK) == (0x2000001f & PAGE_MASK)
* So, return 1, it is wrong result.
*/
if ((((unsigned long)page) & PAGE_MASK) ==
(((unsigned long)lastbyte) & PAGE_MASK))
return 1;
return (__get_user(byte,lastbyte) == 0);
}

Am I missing something?


--
Kind regards,
Minchan Kim

2010-07-13 06:45:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, 13 Jul 2010 15:04:00 +0900
Minchan Kim <[email protected]> wrote:

> >> >  2. This can't be help for a case where a section has multiple small holes.
> >>
> >> I agree. But this(not punched hole but not filled section problem)
> >> isn't such case. But it would be better to handle it altogether. :)
> >>
> >> >
> >> > Then, my proposal for HOLES_IN_MEMMAP sparsemem is below.
> >> > ==
> >> > Some architectures unmap memmap[] for memory holes even with SPARSEMEM.
> >> > To handle that, pfn_valid() should check there are really memmap or not.
> >> > For that purpose, __get_user() can be used.
> >>
> >> Look at free_unused_memmap. We don't unmap pte of hole memmap.
> >> Is __get_use effective, still?
> >>
> > __get_user() works with TLB and page table, the vaddr is really mapped or not.
> > If you got SEGV, __get_user() returns -EFAULT. It works per page granule.
>
> I mean following as.
> For example, there is a struct page in on 0x20000000.
>
> int pfn_valid_mapped(unsigned long pfn)
> {
> struct page *page = pfn_to_page(pfn); /* hole page is 0x2000000 */
> char *lastbyte = (char *)(page+1)-1; /* lastbyte is 0x2000001f */
> char byte;
>
> /* We pass this test since free_unused_memmap doesn't unmap pte */
> if(__get_user(byte, page) != 0)
> return 0;

why ? When the page size is 4096 byte.

0x1ffff000 - 0x1ffffffff
0x20000000 - 0x200000fff are on the same page. And memory is mapped per page.

What we access by above __get_user() is a byte at [0x20000000, 0x20000001)
and it's unmapped if 0x20000000 is unmapped.

Thanks,
-Kame

2010-07-13 07:21:57

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, Jul 13, 2010 at 03:04:00PM +0900, Minchan Kim wrote:
> > __get_user() works with TLB and page table, the vaddr is really mapped or not.
> > If you got SEGV, __get_user() returns -EFAULT. It works per page granule.

Not in kernel space. It works on 1MB sections there.

Testing whether a page is mapped by __get_user is a hugely expensive
way to test whether a PFN is valid. It'd be cheaper to use our
flatmem implementation of pfn_valid() instead.

2010-07-13 07:39:08

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, 13 Jul 2010 08:20:09 +0100
Russell King - ARM Linux <[email protected]> wrote:

> On Tue, Jul 13, 2010 at 03:04:00PM +0900, Minchan Kim wrote:
> > > __get_user() works with TLB and page table, the vaddr is really mapped or not.
> > > If you got SEGV, __get_user() returns -EFAULT. It works per page granule.
>
> Not in kernel space. It works on 1MB sections there.
>
> Testing whether a page is mapped by __get_user is a hugely expensive
> way to test whether a PFN is valid.

Note: pfn_valid() is for checking "there is memmap".

> It'd be cheaper to use our flatmem implementation of pfn_valid() instead.
>
Hmm. IIUC, pfn_valid() succeeds in almost all case if there is a section.
But yes, I'm not familar with ARM.

I love another idea as I've already shown as preparing _a_ page filled with
0x00004000 and map it into the all holes. PG_reserved will help almost all case
even if it's ugly.

Anyway, sparsemem is designed to be aligned to SECTION_SIZE of memmap.
Please avoid adding new Spaghetti code without proper configs.
Thanks,
-Kame

2010-07-13 08:03:04

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, 13 Jul 2010 16:34:17 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> Anyway, sparsemem is designed to be aligned to SECTION_SIZE of memmap.
> Please avoid adding new Spaghetti code without proper configs.
> Thanks,

Ok, I realized I misunderstand all. Arm doesn't unmap memmap but reuse the page
for memmap without modifing ptes. My routine only works when ARM uses sparsemem_vmemmap.
But yes, it isn't.

Hmm...How about using pfn_valid() for FLATMEM or avoid using SPARSEMEM ?
If you want conrols lower than SPARSEMEM, FLATMEM works better because ARM unmaps memmap.
What is the reason for SPARSEMEM ?

Thanks,
-Kame

2010-07-13 08:07:00

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, Jul 13, 2010 at 3:40 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 13 Jul 2010 15:04:00 +0900
> Minchan Kim <[email protected]> wrote:
>
>> >> > ?2. This can't be help for a case where a section has multiple small holes.
>> >>
>> >> I agree. But this(not punched hole but not filled section problem)
>> >> isn't such case. But it would be better to handle it altogether. :)
>> >>
>> >> >
>> >> > Then, my proposal for HOLES_IN_MEMMAP sparsemem is below.
>> >> > ==
>> >> > Some architectures unmap memmap[] for memory holes even with SPARSEMEM.
>> >> > To handle that, pfn_valid() should check there are really memmap or not.
>> >> > For that purpose, __get_user() can be used.
>> >>
>> >> Look at free_unused_memmap. We don't unmap pte of hole memmap.
>> >> Is __get_use effective, still?
>> >>
>> > __get_user() works with TLB and page table, the vaddr is really mapped or not.
>> > If you got SEGV, __get_user() returns -EFAULT. It works per page granule.
>>
>> I mean following as.
>> For example, there is a struct page in on 0x20000000.
>>
>> int pfn_valid_mapped(unsigned long pfn)
>> {
>> ? ? ? ?struct page *page = pfn_to_page(pfn); /* hole page is 0x2000000 */
>> ? ? ? ?char *lastbyte = (char *)(page+1)-1; ?/* lastbyte is 0x2000001f */
>> ? ? ? ?char byte;
>>
>> ? ? ? ?/* We pass this test since free_unused_memmap doesn't unmap pte */
>> ? ? ? ?if(__get_user(byte, page) != 0)
>> ? ? ? ? ? ? ? ?return 0;
>
> why ? When the page size is 4096 byte.
>
> ? ? ?0x1ffff000 - 0x1ffffffff
> ? ? ?0x20000000 - 0x200000fff are on the same page. And memory is mapped per page.

sizeof(struct page) is 32 byte.
So lastbyte is address of struct page + 32 byte - 1.

> What we access by above __get_user() is a byte at [0x20000000, 0x20000001)

Right.

> and it's unmapped if 0x20000000 is unmapped.

free_unused_memmap doesn't unmap pte although it returns the page to
free list of buddy.

>
> Thanks,
> -Kame
>
>



--
Kind regards,
Minchan Kim

2010-07-13 08:07:18

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, 13 Jul 2010 16:58:08 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 13 Jul 2010 16:34:17 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > Anyway, sparsemem is designed to be aligned to SECTION_SIZE of memmap.
> > Please avoid adding new Spaghetti code without proper configs.
> > Thanks,
>
> Ok, I realized I misunderstand all. Arm doesn't unmap memmap but reuse the page
> for memmap without modifing ptes. My routine only works when ARM uses sparsemem_vmemmap.
> But yes, it isn't.
>
> Hmm...How about using pfn_valid() for FLATMEM or avoid using SPARSEMEM ?
> If you want conrols lower than SPARSEMEM, FLATMEM works better because ARM unmaps memmap.
allocation of memmap in lower granule than SPARSEMEM.


How about stop using SPARSEMEM ? What's the benefit ? It just eats up memory for
mem_section[].

Sorry,
-Kame

2010-07-13 08:08:34

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, 13 Jul 2010 17:06:56 +0900
Minchan Kim <[email protected]> wrote:

> On Tue, Jul 13, 2010 at 3:40 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Tue, 13 Jul 2010 15:04:00 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> >> >> >  2. This can't be help for a case where a section has multiple small holes.
> >> >>
> >> >> I agree. But this(not punched hole but not filled section problem)
> >> >> isn't such case. But it would be better to handle it altogether. :)
> >> >>
> >> >> >
> >> >> > Then, my proposal for HOLES_IN_MEMMAP sparsemem is below.
> >> >> > ==
> >> >> > Some architectures unmap memmap[] for memory holes even with SPARSEMEM.
> >> >> > To handle that, pfn_valid() should check there are really memmap or not.
> >> >> > For that purpose, __get_user() can be used.
> >> >>
> >> >> Look at free_unused_memmap. We don't unmap pte of hole memmap.
> >> >> Is __get_use effective, still?
> >> >>
> >> > __get_user() works with TLB and page table, the vaddr is really mapped or not.
> >> > If you got SEGV, __get_user() returns -EFAULT. It works per page granule.
> >>
> >> I mean following as.
> >> For example, there is a struct page in on 0x20000000.
> >>
> >> int pfn_valid_mapped(unsigned long pfn)
> >> {
> >>        struct page *page = pfn_to_page(pfn); /* hole page is 0x2000000 */
> >>        char *lastbyte = (char *)(page+1)-1;  /* lastbyte is 0x2000001f */
> >>        char byte;
> >>
> >>        /* We pass this test since free_unused_memmap doesn't unmap pte */
> >>        if(__get_user(byte, page) != 0)
> >>                return 0;
> >
> > why ? When the page size is 4096 byte.
> >
> >      0x1ffff000 - 0x1ffffffff
> >      0x20000000 - 0x200000fff are on the same page. And memory is mapped per page.
>
> sizeof(struct page) is 32 byte.
> So lastbyte is address of struct page + 32 byte - 1.
>
> > What we access by above __get_user() is a byte at [0x20000000, 0x20000001)
>
> Right.
>
> > and it's unmapped if 0x20000000 is unmapped.
>
> free_unused_memmap doesn't unmap pte although it returns the page to
> free list of buddy.
>
ok, I understood. please see my latest mail and ignore all others.

-kame

2010-07-13 09:30:43

by Johannes Weiner

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, Jul 13, 2010 at 12:53:48AM +0900, Minchan Kim wrote:
> Kukjin, Could you test below patch?
> I don't have any sparsemem system. Sorry.
>
> -- CUT DOWN HERE --
>
> Kukjin reported oops happen while he change min_free_kbytes
> http://www.spinics.net/lists/arm-kernel/msg92894.html
> It happen by memory map on sparsemem.
>
> The system has a memory map following as.
> section 0 section 1 section 2
> 0x20000000-0x25000000, 0x40000000-0x50000000, 0x50000000-0x58000000
> SECTION_SIZE_BITS 28(256M)
>
> It means section 0 is an incompletely filled section.
> Nontheless, current pfn_valid of sparsemem checks pfn loosely.
>
> It checks only mem_section's validation.
> So in above case, pfn on 0x25000000 can pass pfn_valid's validation check.
> It's not what we want.
>
> The Following patch adds check valid pfn range check on pfn_valid of sparsemem.

Look at the declaration of struct mem_section for a second. It is
meant to partition address space uniformly into backed and unbacked
areas.

It comes with implicit size and offset information by means of
SECTION_SIZE_BITS and the section's index in the section array.

Now you are not okay with the _granularity_ but propose to change _the
model_ by introducing a subsection within each section and at the same
time make the concept of a section completely meaningless: its size
becomes arbitrary and its associated mem_map and flags will apply to
the subsection only.

My question is: if the sections are not fine-grained enough, why not
just make them?

The biggest possible section size to describe the memory population on
this machine accurately is 16M. Why not set SECTION_SIZE_BITS to 24?

2010-07-13 09:37:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, Jul 13, 2010 at 12:53:48AM +0900, Minchan Kim wrote:
> Kukjin, Could you test below patch?
> I don't have any sparsemem system. Sorry.
>
> -- CUT DOWN HERE --
>
> Kukjin reported oops happen while he change min_free_kbytes
> http://www.spinics.net/lists/arm-kernel/msg92894.html
> It happen by memory map on sparsemem.
>
> The system has a memory map following as.
> section 0 section 1 section 2
> 0x20000000-0x25000000, 0x40000000-0x50000000, 0x50000000-0x58000000
> SECTION_SIZE_BITS 28(256M)
>
> It means section 0 is an incompletely filled section.
> Nontheless, current pfn_valid of sparsemem checks pfn loosely.
>
> It checks only mem_section's validation.

Note that this is meant to be fine as far as sparsemem is concerned. This is
how it functions by design - if a section is valid, all pages within that
section are valid. Hence, I consider the comment "sparsemem checks pfn loosely"
unfair as it's as strong as required until someone breaks the model.

It's ARM that is punching holes here and as the hole is at the beginning
of a section, the zone bounaries could have been adjusted after the holes
was punched.

Anyway...

> So in above case, pfn on 0x25000000 can pass pfn_valid's validation check.
> It's not what we want.
>
> The Following patch adds check valid pfn range check on pfn_valid of sparsemem.
>
> Signed-off-by: Minchan Kim <[email protected]>
> Reported-by: Kukjin Kim <[email protected]>
>
> P.S)
> It is just RFC. If we agree with this, I will make the patch on mmotm.
>
> --
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index b4d109e..6c2147a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -979,6 +979,8 @@ struct mem_section {
> struct page_cgroup *page_cgroup;
> unsigned long pad;
> #endif
> + unsigned long start_pfn;
> + unsigned long end_pfn;
> };

ouch, 8 to 16 bytes increase on a commonly-used structure. Minimally, I
would only expect this to be defined for
CONFIG_ARCH_HAS_HOLES_MEMORYMODEL. Similarly for the rest of the patch,
I'd expect the "strong" pfn_valid checks to only exist on ARM because
nothing else needs it.

>
> #ifdef CONFIG_SPARSEMEM_EXTREME
> @@ -1039,6 +1041,12 @@ static inline int valid_section(struct mem_section *section)
> return (section && (section->section_mem_map & SECTION_HAS_MEM_MAP));
> }
>
> +static inline int valid_section_pfn(struct mem_section *section, unsigned long pfn)
> +{
> + return ((section && (section->section_mem_map & SECTION_HAS_MEM_MAP)) &&
> + (section->start_pfn <= pfn && pfn < section->end_pfn));
> +}
> +
> static inline int valid_section_nr(unsigned long nr)
> {
> return valid_section(__nr_to_section(nr));
> @@ -1053,7 +1061,7 @@ static inline int pfn_valid(unsigned long pfn)
> {
> if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
> return 0;
> - return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
> + return valid_section_pfn(__nr_to_section(pfn_to_section_nr(pfn)), pfn);
> }
>
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 95ac219..bde9090 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -195,6 +195,8 @@ void __init memory_present(int nid, unsigned long start, unsigned long end)
> if (!ms->section_mem_map)
> ms->section_mem_map = sparse_encode_early_nid(nid) |
> SECTION_MARKED_PRESENT;
> + ms->start_pfn = start;
> + ms->end_pfn = end;
> }
> }
>

I'd also expect the patch to then delete memmap_valid_within and replace
it with a pfn_valid_within() check (which in turn should call the strong
version of pfn_valid().

I prefer Kamezawa's suggestion of mapping on a ZERO_PAGE-like page full
of PageReserved struct pages because it would have better performance
and be more in line with maintaining the assumptions of the memory
model. If we go in this direction, I would strongly prefer it was an
ARM-only thing.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-07-13 09:48:07

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, Jul 13, 2010 at 10:37:00AM +0100, Mel Gorman wrote:
> I prefer Kamezawa's suggestion of mapping on a ZERO_PAGE-like page full
> of PageReserved struct pages because it would have better performance
> and be more in line with maintaining the assumptions of the memory
> model. If we go in this direction, I would strongly prefer it was an
> ARM-only thing.

As I've said, this is not possible without doing some serious page
manipulation.

Plus the pages that where there become unusable as they don't correspond
with a PFN or obey phys_to_virt(). So there's absolutely no point to
this.

Now, why do we free the holes in the mem_map - because these holes can
be extremely large. Every 512K of hole equates to one page of mem_map
array. Balance that against memory placed at 0xc0000000 physical on
some platforms, and with PHYSMEM_BITS at 32 and SECTION_SIZE_BITS at
19 - well, you do the maths. The result is certainly not pretty.

2010-07-13 10:00:52

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, Jul 13, 2010 at 10:46:12AM +0100, Russell King - ARM Linux wrote:
> On Tue, Jul 13, 2010 at 10:37:00AM +0100, Mel Gorman wrote:
> > I prefer Kamezawa's suggestion of mapping on a ZERO_PAGE-like page full
> > of PageReserved struct pages because it would have better performance
> > and be more in line with maintaining the assumptions of the memory
> > model. If we go in this direction, I would strongly prefer it was an
> > ARM-only thing.
>
> As I've said, this is not possible without doing some serious page
> manipulation.
>

Yep, x86 used to do something like this for discontig. It wasn't pretty.

> Plus the pages that where there become unusable as they don't correspond
> with a PFN or obey phys_to_virt(). So there's absolutely no point to
> this.
>
> Now, why do we free the holes in the mem_map - because these holes can
> be extremely large. Every 512K of hole equates to one page of mem_map
> array.

Sure, the holes might be large but at least they are contiguous. Is
there ever a case where you have

512K_Valid 512K_Hole 512K_Valid 512K_Hole

or is it typically

512K_hole 512K_hole ...... 512K_Valid 512K_Valid etc

If holes are typically contiguos, memmap is not allocated in the first place
and the savings from punching holes in memmap in the latter configuration
are minimal.

I recognise if you have a 2M section with a hole in it, you are
potentially wasting 3 pages on unused memmap. If this is really a problem,
Minchan's modification to sparsemem to increase the size of mem_section on
CONFIG_ARCH_HAS_HOLES_MEMORYMODEL is a messy option. I say messy because
it only works if the hole is on either end of the section and it's adding
quirks to the memory model.

> Balance that against memory placed at 0xc0000000 physical on
> some platforms, and with PHYSMEM_BITS at 32 and SECTION_SIZE_BITS at
> 19 - well, you do the maths. The result is certainly not pretty.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-07-13 15:43:49

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, Jul 13, 2010 at 11:30:06AM +0200, Johannes Weiner wrote:
> On Tue, Jul 13, 2010 at 12:53:48AM +0900, Minchan Kim wrote:
> > Kukjin, Could you test below patch?
> > I don't have any sparsemem system. Sorry.
> >
> > -- CUT DOWN HERE --
> >
> > Kukjin reported oops happen while he change min_free_kbytes
> > http://www.spinics.net/lists/arm-kernel/msg92894.html
> > It happen by memory map on sparsemem.
> >
> > The system has a memory map following as.
> > section 0 section 1 section 2
> > 0x20000000-0x25000000, 0x40000000-0x50000000, 0x50000000-0x58000000
> > SECTION_SIZE_BITS 28(256M)
> >
> > It means section 0 is an incompletely filled section.
> > Nontheless, current pfn_valid of sparsemem checks pfn loosely.
> >
> > It checks only mem_section's validation.
> > So in above case, pfn on 0x25000000 can pass pfn_valid's validation check.
> > It's not what we want.
> >
> > The Following patch adds check valid pfn range check on pfn_valid of sparsemem.
>
> Look at the declaration of struct mem_section for a second. It is
> meant to partition address space uniformly into backed and unbacked
> areas.
>
> It comes with implicit size and offset information by means of
> SECTION_SIZE_BITS and the section's index in the section array.
>
> Now you are not okay with the _granularity_ but propose to change _the
> model_ by introducing a subsection within each section and at the same
> time make the concept of a section completely meaningless: its size
> becomes arbitrary and its associated mem_map and flags will apply to
> the subsection only.
>
> My question is: if the sections are not fine-grained enough, why not
> just make them?
>
> The biggest possible section size to describe the memory population on
> this machine accurately is 16M. Why not set SECTION_SIZE_BITS to 24?

You're right. AFAIK, Kukjin tried it but Russell and others rejected it.
Let's wrap it up.

First of all, Thanks for joining good discussion, Kame, Hannes, Mel and
Russell.

The system has following memory map.
0x20000000-0x25000000, 0x40000000-0x50000000, 0x50000000-0x58000000
80M hole : 432M 256M 128M

1) FLATMEM
If it uses FLATMEM, it wastes 864(432M/512K) pages due to memmap on hole.
That's horrible.

2) SPARSEMEM(16M)
It makes 56 mem_sections. It costs 448(56 * 8)byte.
It doesn't make unused memmap. So good.

3) SPARSEMEM(256M)
It makes 3 mem_sections. It costs 24(3 * 8) byte.
And if we free unused memmap on 176M(256M - 80M), we can save 352 pages.

3 is best about memory usage. but for 3, we should check pfn_valid more tightly.
It can be checked by my patch. but mm guys didn't like it since it makes memory
model messy due to some funny architecture.(ie, sparsemem designed to not include
hole.) and it still has a problem if there is a hole in the middle of section.

3 is not a big deal than 2 about memory usage.
If the system use memory space fully(MAX_PHYSMEM_BITS 31), it just consumes
1024(128 * 8) byte. So now I think best solution is 2.

Russell. What do you think about it?

--
Kind regards,
Minchan Kim

2010-07-13 16:35:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Wed, 2010-07-14 at 00:43 +0900, Minchan Kim wrote:
> 3 is not a big deal than 2 about memory usage.
> If the system use memory space fully(MAX_PHYSMEM_BITS 31), it just consumes
> 1024(128 * 8) byte. So now I think best solution is 2.
>
> Russell. What do you think about it?

I'm not Russell, but I'll tell you what I think. :)

Make the sections 16MB. You suggestion to add the start/end pfns
_doubles_ the size of the structure, and its size overhead. We have
systems with a pretty tremendous amount of memory with 16MB sections.

If you _really_ can't make the section size smaller, and the vast
majority of the sections are fully populated, you could hack something
in. We could, for instance, have a global list that's mostly readonly
which tells you which sections need to be have their sizes closely
inspected. That would work OK if, for instance, you only needed to
check a couple of memory sections in the system. It'll start to suck if
you made the lists very long.

-- Dave

2010-07-13 16:44:36

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

Hi, Dave.

On Tue, Jul 13, 2010 at 09:35:33AM -0700, Dave Hansen wrote:
> On Wed, 2010-07-14 at 00:43 +0900, Minchan Kim wrote:
> > 3 is not a big deal than 2 about memory usage.
> > If the system use memory space fully(MAX_PHYSMEM_BITS 31), it just consumes
> > 1024(128 * 8) byte. So now I think best solution is 2.
> >
> > Russell. What do you think about it?
>
> I'm not Russell, but I'll tell you what I think. :)
>

No problem. :)

> Make the sections 16MB. You suggestion to add the start/end pfns

I hope so.

> _doubles_ the size of the structure, and its size overhead. We have
> systems with a pretty tremendous amount of memory with 16MB sections.

Yes. it does in several GB server system.

>
> If you _really_ can't make the section size smaller, and the vast
> majority of the sections are fully populated, you could hack something
> in. We could, for instance, have a global list that's mostly readonly
> which tells you which sections need to be have their sizes closely
> inspected. That would work OK if, for instance, you only needed to
> check a couple of memory sections in the system. It'll start to suck if
> you made the lists very long.

Thanks for advise. As I say, I hope Russell accept 16M section.

>
> -- Dave
>

--
Kind regards,
Minchan Kim

2010-07-13 18:40:22

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, Jul 13, 2010 at 05:02:22PM +0900, KAMEZAWA Hiroyuki wrote:
> How about stop using SPARSEMEM ? What's the benefit ? It just eats up
> memory for mem_section[].

The problem with that approach is that sometimes the mem_map array
doesn't fit into any memory banks.

We've gone around the loop of using flatmem with holes punched in it,
to using discontigmem, and now to using sparsemem. It seems none of
these solutions does what we need for ARM. I guess that's the price
we pay for not having memory architected to be at any particular place
in the physical memory map.

We're even seeing lately setups now where system memory is split into
two areas, where the second (higher physical address) is populated
first before the lower bank... These kinds of games are getting rather
stupid and idiotic, but we're not the hardware designers and so we have
to live with it - or just tell the folk who are porting the kernel to
these platforms that we'll never take their patches.

2010-07-13 20:47:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Tue, 2010-07-13 at 19:39 +0100, Russell King - ARM Linux wrote:
> On Tue, Jul 13, 2010 at 05:02:22PM +0900, KAMEZAWA Hiroyuki wrote:
> > How about stop using SPARSEMEM ? What's the benefit ? It just eats up
> > memory for mem_section[].
>
> The problem with that approach is that sometimes the mem_map array
> doesn't fit into any memory banks.
>
> We've gone around the loop of using flatmem with holes punched in it,
> to using discontigmem, and now to using sparsemem. It seems none of
> these solutions does what we need for ARM. I guess that's the price
> we pay for not having memory architected to be at any particular place
> in the physical memory map.

What's the ARM hardware's maximum addressable memory these days? 4GB?

A 4GB system would have 256 sections, which means 256*2*sizeof(unsigned
long) for the mem_section[]. That's a pretty small amount of RAM.

What sizes are the holes that are being punched these days? Smaller
than 16MB?

-- Dave

2010-07-14 00:27:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Wed, 14 Jul 2010 01:44:23 +0900
Minchan Kim <[email protected]> wrote:

> > If you _really_ can't make the section size smaller, and the vast
> > majority of the sections are fully populated, you could hack something
> > in. We could, for instance, have a global list that's mostly readonly
> > which tells you which sections need to be have their sizes closely
> > inspected. That would work OK if, for instance, you only needed to
> > check a couple of memory sections in the system. It'll start to suck if
> > you made the lists very long.
>
> Thanks for advise. As I say, I hope Russell accept 16M section.
>

It seems what I needed was good sleep....
How about this if 16M section is not acceptable ?

== NOT TESTED AT ALL, EVEN NOT COMPILED ==

register address of mem_section to memmap itself's page struct's pg->private field.
This means the page is used for memmap of the section.
Otherwise, the page is used for other purpose and memmap has a hole.

---
arch/arm/mm/init.c | 11 ++++++++++-
include/linux/mmzone.h | 19 ++++++++++++++++++-
mm/sparse.c | 37 +++++++++++++++++++++++++++++++++++++
3 files changed, 65 insertions(+), 2 deletions(-)

Index: mmotm-2.6.35-0701/include/linux/mmzone.h
===================================================================
--- mmotm-2.6.35-0701.orig/include/linux/mmzone.h
+++ mmotm-2.6.35-0701/include/linux/mmzone.h
@@ -1047,11 +1047,28 @@ static inline struct mem_section *__pfn_
return __nr_to_section(pfn_to_section_nr(pfn));
}

+#ifdef CONFIG_SPARSEMEM_HAS_PIT
+void mark_memmap_pit(unsigned long start, unsigned long end, bool valid);
+static inline int page_valid(struct mem_section *ms, unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ struct page *__pg = virt_to_page(page);
+ return __pg->private == ms;
+}
+#else
+static inline int page_valid(struct mem_section *ms, unsigned long pfn)
+{
+ return 1;
+}
+#endif
+
static inline int pfn_valid(unsigned long pfn)
{
+ struct mem_section *ms;
if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
return 0;
- return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
+ ms = __nr_to_section(pfn_to_section_nr(pfn));
+ return valid_section(ms) && page_valid(ms, pfn);
}

static inline int pfn_present(unsigned long pfn)
Index: mmotm-2.6.35-0701/mm/sparse.c
===================================================================
--- mmotm-2.6.35-0701.orig/mm/sparse.c
+++ mmotm-2.6.35-0701/mm/sparse.c
@@ -615,6 +615,43 @@ void __init sparse_init(void)
free_bootmem(__pa(usemap_map), size);
}

+#ifdef CONFIT_SPARSEMEM_HAS_PIT
+/*
+ * Fill memmap's pg->private with a pointer to mem_section.
+ * pfn_valid() will check this later. (see include/linux/mmzone.h)
+ * The caller should call
+ * mark_memmap_pit(start, end, true) # for all allocated mem_map
+ * and, after that,
+ * mark_memmap_pit(start, end, false) # for all pits in mem_map.
+ * please see usage in ARM.
+ */
+void mark_memmap_pit(unsigned long start, unsigned long end, bool valid)
+{
+ struct mem_section *ms;
+ unsigned long pos, next;
+ struct page *pg;
+ void *memmap, *end;
+ unsigned long mapsize = sizeof(struct page) * PAGES_PER_SECTION;
+
+ for (pos = start;
+ pos < end; pos = next) {
+ next = (pos + PAGES_PER_SECTION) & PAGE_SECTION_MASK;
+ ms = __pfn_to_section(pos);
+ if (!valid_section(ms))
+ continue;
+ for (memmap = pfn_to_page(pfn), end = pfn_to_page(next-1);
+ memmap != end + 1;
+ memmap += PAGE_SIZE) {
+ pg = virt_to_page(memmap);
+ if (valid)
+ pg->private = ms;
+ else
+ pg->private = NULL;
+ }
+ }
+}
+#endif
+
#ifdef CONFIG_MEMORY_HOTPLUG
#ifdef CONFIG_SPARSEMEM_VMEMMAP
static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
Index: mmotm-2.6.35-0701/arch/arm/mm/init.c
===================================================================
--- mmotm-2.6.35-0701.orig/arch/arm/mm/init.c
+++ mmotm-2.6.35-0701/arch/arm/mm/init.c
@@ -234,6 +234,13 @@ static void __init arm_bootmem_free(stru
arch_adjust_zones(zone_size, zhole_size);

free_area_init_node(0, zone_size, min, zhole_size);
+
+#ifdef CONFIG_SPARSEMEM
+ for_each_bank(i, mi) {
+ mark_memmap_pit(bank_start_pfn(mi->bank[i]),
+ bank_end_pfn(mi->bank[i]), true);
+ }
+#endif
}

#ifndef CONFIG_SPARSEMEM
@@ -386,8 +393,10 @@ free_memmap(unsigned long start_pfn, uns
* If there are free pages between these,
* free the section of the memmap array.
*/
- if (pg < pgend)
+ if (pg < pgend) {
+ mark_memap_pit(pg >> PAGE_SHIFT, pgend >> PAGE_SHIFT, false);
free_bootmem(pg, pgend - pg);
+ }
}

/*

2010-07-14 06:44:44

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

Hi, Kame.

On Wed, Jul 14, 2010 at 9:23 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Wed, 14 Jul 2010 01:44:23 +0900
> Minchan Kim <[email protected]> wrote:
>
>> > If you _really_ can't make the section size smaller, and the vast
>> > majority of the sections are fully populated, you could hack something
>> > in. ?We could, for instance, have a global list that's mostly readonly
>> > which tells you which sections need to be have their sizes closely
>> > inspected. ?That would work OK if, for instance, you only needed to
>> > check a couple of memory sections in the system. ?It'll start to suck if
>> > you made the lists very long.
>>
>> Thanks for advise. As I say, I hope Russell accept 16M section.
>>
>
> It seems what I needed was good sleep....
> How about this if 16M section is not acceptable ?
>
> == NOT TESTED AT ALL, EVEN NOT COMPILED ==
>
> register address of mem_section to memmap itself's page struct's pg->private field.
> This means the page is used for memmap of the section.
> Otherwise, the page is used for other purpose and memmap has a hole.

It's a very good idea. :)
But can this handle case that a page on memmap pages have struct page
descriptor of hole?
I mean one page can include 128 page descriptor(4096 / 32).
In there, 64 page descriptor is valid but remain 64 page descriptor is on hole.
In this case, free_memmap doesn't free the page.

I think most of system will have aligned memory of 512K(4K * 128).
But I am not sure.
--
Kind regards,
Minchan Kim

2010-07-14 07:15:48

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Wed, 14 Jul 2010 15:44:41 +0900
Minchan Kim <[email protected]> wrote:

> Hi, Kame.
>
> On Wed, Jul 14, 2010 at 9:23 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Wed, 14 Jul 2010 01:44:23 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> >> > If you _really_ can't make the section size smaller, and the vast
> >> > majority of the sections are fully populated, you could hack something
> >> > in.  We could, for instance, have a global list that's mostly readonly
> >> > which tells you which sections need to be have their sizes closely
> >> > inspected.  That would work OK if, for instance, you only needed to
> >> > check a couple of memory sections in the system.  It'll start to suck if
> >> > you made the lists very long.
> >>
> >> Thanks for advise. As I say, I hope Russell accept 16M section.
> >>
> >
> > It seems what I needed was good sleep....
> > How about this if 16M section is not acceptable ?
> >
> > == NOT TESTED AT ALL, EVEN NOT COMPILED ==
> >
> > register address of mem_section to memmap itself's page struct's pg->private field.
> > This means the page is used for memmap of the section.
> > Otherwise, the page is used for other purpose and memmap has a hole.
>
> It's a very good idea. :)
> But can this handle case that a page on memmap pages have struct page
> descriptor of hole?
> I mean one page can include 128 page descriptor(4096 / 32).
yes.

> In there, 64 page descriptor is valid but remain 64 page descriptor is on hole.
> In this case, free_memmap doesn't free the page.

yes. but in that case, there are valid page decriptor for 64pages of holes.
pfn_valid() should return true but PG_reserved is set.
(This is usual behavior.)

My intention is that

- When all 128 page descriptors are unused, free_memmap() will free it.
In that case, clear page->private of a page for freed page descriptors.

- When some of page descriptors are used, free_memmap() can't free it
and page->private points to &mem_section. We may have memmap for memory
hole but pfn_valid() is a function to check there is memmap or not.
The bahavior of pfn_valid() is valid.
Anyway, you can't free only half of page.

If my code doesn't seem to work as above, it's bug.


Thanks,
-Kame

2010-07-14 07:35:25

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Wed, Jul 14, 2010 at 4:10 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Wed, 14 Jul 2010 15:44:41 +0900
> Minchan Kim <[email protected]> wrote:
>
>> Hi, Kame.
>>
>> On Wed, Jul 14, 2010 at 9:23 AM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>> > On Wed, 14 Jul 2010 01:44:23 +0900
>> > Minchan Kim <[email protected]> wrote:
>> >
>> >> > If you _really_ can't make the section size smaller, and the vast
>> >> > majority of the sections are fully populated, you could hack something
>> >> > in. ?We could, for instance, have a global list that's mostly readonly
>> >> > which tells you which sections need to be have their sizes closely
>> >> > inspected. ?That would work OK if, for instance, you only needed to
>> >> > check a couple of memory sections in the system. ?It'll start to suck if
>> >> > you made the lists very long.
>> >>
>> >> Thanks for advise. As I say, I hope Russell accept 16M section.
>> >>
>> >
>> > It seems what I needed was good sleep....
>> > How about this if 16M section is not acceptable ?
>> >
>> > == NOT TESTED AT ALL, EVEN NOT COMPILED ==
>> >
>> > register address of mem_section to memmap itself's page struct's pg->private field.
>> > This means the page is used for memmap of the section.
>> > Otherwise, the page is used for other purpose and memmap has a hole.
>>
>> It's a very good idea. :)
>> But can this handle case that a page on memmap pages have struct page
>> descriptor of hole?
>> I mean one page can include 128 page descriptor(4096 / 32).
> yes.
>
>> In there, 64 page descriptor is valid but remain 64 page descriptor is on hole.
>> In this case, free_memmap doesn't free the page.
>
> yes. but in that case, there are valid page decriptor for 64pages of holes.
> pfn_valid() should return true but PG_reserved is set.
> (This is usual behavior.)
>
> My intention is that
>
> ?- When all 128 page descriptors are unused, free_memmap() will free it.
> ? In that case, clear page->private of a page for freed page descriptors.
>
> ?- When some of page descriptors are used, free_memmap() can't free it
> ? and page->private points to &mem_section. We may have memmap for memory
> ? hole but pfn_valid() is a function to check there is memmap or not.
> ? The bahavior of pfn_valid() is valid.
> ? Anyway, you can't free only half of page.

Okay. I missed PageReserved.
Your idea seems to be good. :)

I looked at pagetypeinfo_showblockcount_print.
It doesn't check PageReserved. Instead of it, it does ugly memmap_valid_within.
Can't we remove it and change it with PageReserved?


--
Kind regards,
Minchan Kim

2010-07-14 07:44:10

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Wed, 14 Jul 2010 16:35:22 +0900
Minchan Kim <[email protected]> wrote:

> On Wed, Jul 14, 2010 at 4:10 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Wed, 14 Jul 2010 15:44:41 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> >> Hi, Kame.
> >>
> >> On Wed, Jul 14, 2010 at 9:23 AM, KAMEZAWA Hiroyuki
> >> <[email protected]> wrote:
> >> > On Wed, 14 Jul 2010 01:44:23 +0900
> >> > Minchan Kim <[email protected]> wrote:
> >> >
> >> >> > If you _really_ can't make the section size smaller, and the vast
> >> >> > majority of the sections are fully populated, you could hack something
> >> >> > in.  We could, for instance, have a global list that's mostly readonly
> >> >> > which tells you which sections need to be have their sizes closely
> >> >> > inspected.  That would work OK if, for instance, you only needed to
> >> >> > check a couple of memory sections in the system.  It'll start to suck if
> >> >> > you made the lists very long.
> >> >>
> >> >> Thanks for advise. As I say, I hope Russell accept 16M section.
> >> >>
> >> >
> >> > It seems what I needed was good sleep....
> >> > How about this if 16M section is not acceptable ?
> >> >
> >> > == NOT TESTED AT ALL, EVEN NOT COMPILED ==
> >> >
> >> > register address of mem_section to memmap itself's page struct's pg->private field.
> >> > This means the page is used for memmap of the section.
> >> > Otherwise, the page is used for other purpose and memmap has a hole.
> >>
> >> It's a very good idea. :)
> >> But can this handle case that a page on memmap pages have struct page
> >> descriptor of hole?
> >> I mean one page can include 128 page descriptor(4096 / 32).
> > yes.
> >
> >> In there, 64 page descriptor is valid but remain 64 page descriptor is on hole.
> >> In this case, free_memmap doesn't free the page.
> >
> > yes. but in that case, there are valid page decriptor for 64pages of holes.
> > pfn_valid() should return true but PG_reserved is set.
> > (This is usual behavior.)
> >
> > My intention is that
> >
> >  - When all 128 page descriptors are unused, free_memmap() will free it.
> >   In that case, clear page->private of a page for freed page descriptors.
> >
> >  - When some of page descriptors are used, free_memmap() can't free it
> >   and page->private points to &mem_section. We may have memmap for memory
> >   hole but pfn_valid() is a function to check there is memmap or not.
> >   The bahavior of pfn_valid() is valid.
> >   Anyway, you can't free only half of page.
>
> Okay. I missed PageReserved.
> Your idea seems to be good. :)
>
> I looked at pagetypeinfo_showblockcount_print.
> It doesn't check PageReserved. Instead of it, it does ugly memmap_valid_within.
> Can't we remove it and change it with PageReserved?
>
maybe. but I'm not sure how many archs uses CONFIG_ARCH_HAS_HOLES_MEMORYMODEL.
Because my idea requires to add arch-dependent hook, enhancement of pfn_valid()
happens only when an arch supports it. So, you may need a conservative path.

Anyway, I can't test the patch by myself. So, I pass ball to ARM guys.
Feel free to reuse my idea if you like.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Thanks,
-Kame



2010-07-14 07:50:51

by Kukjin Kim

[permalink] [raw]
Subject: RE: [RFC] Tight check of pfn_valid on sparsemem

KAMEZAWA Hiroyuki wrote:
>
> On Wed, 14 Jul 2010 01:44:23 +0900
> Minchan Kim <[email protected]> wrote:
>
> > > If you _really_ can't make the section size smaller, and the vast
> > > majority of the sections are fully populated, you could hack something
> > > in. We could, for instance, have a global list that's mostly readonly
> > > which tells you which sections need to be have their sizes closely
> > > inspected. That would work OK if, for instance, you only needed to
> > > check a couple of memory sections in the system. It'll start to suck
if
> > > you made the lists very long.
> >
> > Thanks for advise. As I say, I hope Russell accept 16M section.
> >
Hi,

Thanks for your inputs.
>
> It seems what I needed was good sleep....
> How about this if 16M section is not acceptable ?
>
> == NOT TESTED AT ALL, EVEN NOT COMPILED ==

Yeah...

Couldn't build with s5pv210_defconfig when used mmotm tree,
And couldn't apply your patch against latest mainline 35-rc5.

Could you please remake your patch against mainline 35-rc5?
Or...please let me know how I can test on my board(smdkv210).

>
> register address of mem_section to memmap itself's page struct's
pg->private
> field.
> This means the page is used for memmap of the section.
> Otherwise, the page is used for other purpose and memmap has a hole.
>
> ---
> arch/arm/mm/init.c | 11 ++++++++++-
> include/linux/mmzone.h | 19 ++++++++++++++++++-
> mm/sparse.c | 37
> +++++++++++++++++++++++++++++++++++++
> 3 files changed, 65 insertions(+), 2 deletions(-)
>
> Index: mmotm-2.6.35-0701/include/linux/mmzone.h
> =============================================================
> ======
> --- mmotm-2.6.35-0701.orig/include/linux/mmzone.h
> +++ mmotm-2.6.35-0701/include/linux/mmzone.h
> @@ -1047,11 +1047,28 @@ static inline struct mem_section *__pfn_
> return __nr_to_section(pfn_to_section_nr(pfn));
> }
>
> +#ifdef CONFIG_SPARSEMEM_HAS_PIT
> +void mark_memmap_pit(unsigned long start, unsigned long end, bool valid);
> +static inline int page_valid(struct mem_section *ms, unsigned long pfn)
> +{
> + struct page *page = pfn_to_page(pfn);
> + struct page *__pg = virt_to_page(page);
> + return __pg->private == ms;
> +}
> +#else
> +static inline int page_valid(struct mem_section *ms, unsigned long pfn)
> +{
> + return 1;
> +}
> +#endif
> +
> static inline int pfn_valid(unsigned long pfn)
> {
> + struct mem_section *ms;
> if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
> return 0;
> - return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
> + ms = __nr_to_section(pfn_to_section_nr(pfn));
> + return valid_section(ms) && page_valid(ms, pfn);
> }
>
> static inline int pfn_present(unsigned long pfn)
> Index: mmotm-2.6.35-0701/mm/sparse.c
> =============================================================
> ======
> --- mmotm-2.6.35-0701.orig/mm/sparse.c
> +++ mmotm-2.6.35-0701/mm/sparse.c
> @@ -615,6 +615,43 @@ void __init sparse_init(void)
> free_bootmem(__pa(usemap_map), size);
> }
>
> +#ifdef CONFIT_SPARSEMEM_HAS_PIT
> +/*
> + * Fill memmap's pg->private with a pointer to mem_section.
> + * pfn_valid() will check this later. (see include/linux/mmzone.h)
> + * The caller should call
> + * mark_memmap_pit(start, end, true) # for all allocated mem_map
> + * and, after that,
> + * mark_memmap_pit(start, end, false) # for all pits in mem_map.
> + * please see usage in ARM.
> + */
> +void mark_memmap_pit(unsigned long start, unsigned long end, bool valid)
> +{
> + struct mem_section *ms;
> + unsigned long pos, next;
> + struct page *pg;
> + void *memmap, *end;
> + unsigned long mapsize = sizeof(struct page) * PAGES_PER_SECTION;
> +
> + for (pos = start;
> + pos < end; pos = next) {
> + next = (pos + PAGES_PER_SECTION) &
> PAGE_SECTION_MASK;
> + ms = __pfn_to_section(pos);
> + if (!valid_section(ms))
> + continue;
> + for (memmap = pfn_to_page(pfn), end = pfn_to_page(next-1);
> + memmap != end + 1;
> + memmap += PAGE_SIZE) {
> + pg = virt_to_page(memmap);
> + if (valid)
> + pg->private = ms;
> + else
> + pg->private = NULL;
> + }
> + }
> +}
> +#endif
> +
> #ifdef CONFIG_MEMORY_HOTPLUG
> #ifdef CONFIG_SPARSEMEM_VMEMMAP
> static inline struct page *kmalloc_section_memmap(unsigned long pnum, int
nid,
> Index: mmotm-2.6.35-0701/arch/arm/mm/init.c
> =============================================================
> ======
> --- mmotm-2.6.35-0701.orig/arch/arm/mm/init.c
> +++ mmotm-2.6.35-0701/arch/arm/mm/init.c
> @@ -234,6 +234,13 @@ static void __init arm_bootmem_free(stru
> arch_adjust_zones(zone_size, zhole_size);
>
> free_area_init_node(0, zone_size, min, zhole_size);
> +
> +#ifdef CONFIG_SPARSEMEM
> + for_each_bank(i, mi) {
> + mark_memmap_pit(bank_start_pfn(mi->bank[i]),
> + bank_end_pfn(mi->bank[i]), true);
> + }
> +#endif
> }
>
> #ifndef CONFIG_SPARSEMEM
> @@ -386,8 +393,10 @@ free_memmap(unsigned long start_pfn, uns
> * If there are free pages between these,
> * free the section of the memmap array.
> */
> - if (pg < pgend)
> + if (pg < pgend) {
> + mark_memap_pit(pg >> PAGE_SHIFT, pgend >> PAGE_SHIFT,
> false);
> free_bootmem(pg, pgend - pg);
> + }
> }
>
> /*



Thanks.

Best regards,
Kgene.
--
Kukjin Kim <[email protected]>, Senior Engineer,
SW Solution Development Team, Samsung Electronics Co., Ltd.

2010-07-14 08:14:38

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC] Tight check of pfn_valid on sparsemem

On Wed, 14 Jul 2010 16:50:25 +0900
Kukjin Kim <[email protected]> wrote:

> KAMEZAWA Hiroyuki wrote:
> >
> > On Wed, 14 Jul 2010 01:44:23 +0900
> > Minchan Kim <[email protected]> wrote:
> >
> > > > If you _really_ can't make the section size smaller, and the vast
> > > > majority of the sections are fully populated, you could hack something
> > > > in. We could, for instance, have a global list that's mostly readonly
> > > > which tells you which sections need to be have their sizes closely
> > > > inspected. That would work OK if, for instance, you only needed to
> > > > check a couple of memory sections in the system. It'll start to suck
> if
> > > > you made the lists very long.
> > >
> > > Thanks for advise. As I say, I hope Russell accept 16M section.
> > >
> Hi,
>
> Thanks for your inputs.
> >
> > It seems what I needed was good sleep....
> > How about this if 16M section is not acceptable ?
> >
> > == NOT TESTED AT ALL, EVEN NOT COMPILED ==
>
> Yeah...
>
> Couldn't build with s5pv210_defconfig when used mmotm tree,
> And couldn't apply your patch against latest mainline 35-rc5.
>
> Could you please remake your patch against mainline 35-rc5?
> Or...please let me know how I can test on my board(smdkv210).
>
Ahh..how brave you are.

my patch was against mmotm-07-01.
select SPARSEMEM_HAS_PIT in config.
(in menuconfig it will appear under Processor type and features.)

This is a fixed one, maybe mm/sparce.o can be compiled.
At least, arch-generic part is compiled.

(This config should be selected automatically via arm's config.
this patch is just for test.)

Signed-off-by: KAMEZAWA hiroyuki <[email protected]>
---
arch/arm/mm/init.c | 11 ++++++++++-
include/linux/mmzone.h | 19 ++++++++++++++++++-
mm/Kconfig | 5 +++++
mm/sparse.c | 38 +++++++++++++++++++++++++++++++++++++-
4 files changed, 70 insertions(+), 3 deletions(-)

Index: mmotm-2.6.35-0701/include/linux/mmzone.h
===================================================================
--- mmotm-2.6.35-0701.orig/include/linux/mmzone.h
+++ mmotm-2.6.35-0701/include/linux/mmzone.h
@@ -1047,11 +1047,28 @@ static inline struct mem_section *__pfn_
return __nr_to_section(pfn_to_section_nr(pfn));
}

+#ifdef CONFIG_SPARSEMEM_HAS_PIT
+void mark_memmap_pit(unsigned long start, unsigned long end, bool valid);
+static inline int page_valid(struct mem_section *ms, unsigned long pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+ struct page *__pg = virt_to_page(page);
+ return __pg->private == ms;
+}
+#else
+static inline int page_valid(struct mem_section *ms, unsigned long pfn)
+{
+ return 1;
+}
+#endif
+
static inline int pfn_valid(unsigned long pfn)
{
+ struct mem_section *ms;
if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
return 0;
- return valid_section(__nr_to_section(pfn_to_section_nr(pfn)));
+ ms = __nr_to_section(pfn_to_section_nr(pfn));
+ return valid_section(ms) && page_valid(ms, pfn);
}

static inline int pfn_present(unsigned long pfn)
Index: mmotm-2.6.35-0701/mm/sparse.c
===================================================================
--- mmotm-2.6.35-0701.orig/mm/sparse.c
+++ mmotm-2.6.35-0701/mm/sparse.c
@@ -13,7 +13,6 @@
#include <asm/dma.h>
#include <asm/pgalloc.h>
#include <asm/pgtable.h>
-
/*
* Permanent SPARSEMEM data:
*
@@ -615,6 +614,43 @@ void __init sparse_init(void)
free_bootmem(__pa(usemap_map), size);
}

+#ifdef CONFIG_SPARSEMEM_HAS_PIT
+/*
+ * Fill memmap's pg->private with a pointer to mem_section.
+ * pfn_valid() will check this later. (see include/linux/mmzone.h)
+ * The caller should call
+ * mark_memmap_pit(start, end, true) # for all allocated mem_map
+ * and, after that,
+ * mark_memmap_pit(start, end, false) # for all pits in mem_map.
+ * please see usage in ARM.
+ */
+void mark_memmap_pit(unsigned long start, unsigned long end, bool valid)
+{
+ struct mem_section *ms;
+ unsigned long pos, next;
+ struct page *pg;
+ void *memmap, *mapend;
+
+ for (pos = start;
+ pos < end; pos = next) {
+ next = (pos + PAGES_PER_SECTION) & PAGE_SECTION_MASK;
+ ms = __pfn_to_section(pos);
+ if (!valid_section(ms))
+ continue;
+ for (memmap = (void*)pfn_to_page(pos),
+ mapend = pfn_to_page(next-1); /* the last page in section*/
+ memmap < mapend;
+ memmap += PAGE_SIZE) {
+ pg = virt_to_page(memmap);
+ if (valid)
+ pg->private = (unsigned long)ms;
+ else
+ pg->private = 0;
+ }
+ }
+}
+#endif
+
#ifdef CONFIG_MEMORY_HOTPLUG
#ifdef CONFIG_SPARSEMEM_VMEMMAP
static inline struct page *kmalloc_section_memmap(unsigned long pnum, int nid,
Index: mmotm-2.6.35-0701/arch/arm/mm/init.c
===================================================================
--- mmotm-2.6.35-0701.orig/arch/arm/mm/init.c
+++ mmotm-2.6.35-0701/arch/arm/mm/init.c
@@ -234,6 +234,13 @@ static void __init arm_bootmem_free(stru
arch_adjust_zones(zone_size, zhole_size);

free_area_init_node(0, zone_size, min, zhole_size);
+
+#ifdef CONFIG_SPARSEMEM
+ for_each_bank(i, mi) {
+ mark_memmap_pit(bank_start_pfn(mi->bank[i]),
+ bank_end_pfn(mi->bank[i]), true);
+ }
+#endif
}

#ifndef CONFIG_SPARSEMEM
@@ -386,8 +393,10 @@ free_memmap(unsigned long start_pfn, uns
* If there are free pages between these,
* free the section of the memmap array.
*/
- if (pg < pgend)
+ if (pg < pgend) {
+ mark_memap_pit(pg >> PAGE_SHIFT, pgend >> PAGE_SHIFT, false);
free_bootmem(pg, pgend - pg);
+ }
}

/*
Index: mmotm-2.6.35-0701/mm/Kconfig
===================================================================
--- mmotm-2.6.35-0701.orig/mm/Kconfig
+++ mmotm-2.6.35-0701/mm/Kconfig
@@ -128,6 +128,11 @@ config SPARSEMEM_VMEMMAP
pfn_to_page and page_to_pfn operations. This is the most
efficient option when sufficient kernel resources are available.

+config SPAESEMEM_HAS_PIT
+ bool "allow holes in sparsemem's memmap"
+ depends on SPARSEMEM && !SPARSEMEM_VMEMMAP
+ default n
+
# eventually, we can have this option just 'select SPARSEMEM'
config MEMORY_HOTPLUG
bool "Allow for memory hot-add"