2012-11-06 01:32:27

by Jiang Liu (Gerry)

[permalink] [raw]
Subject: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d

Changeset 7f1290f2f2 tries to fix a issue when calculating
zone->present_pages, but it causes a regression to 32bit systems with
HIGHMEM. With that changeset, function reset_zone_present_pages()
resets all zone->present_pages to zero, and fixup_zone_present_pages()
is called to recalculate zone->present_pages when boot allocator frees
core memory pages into buddy allocator. Because highmem pages are not
freed by bootmem allocator, all highmem zones' present_pages becomes
zero.

Actually there's no need to recalculate present_pages for highmem zone
because bootmem allocator never allocates pages from them. So fix the
regression by skipping highmem in function reset_zone_present_pages()
and fixup_zone_present_pages().

Signed-off-by: Jiang Liu <[email protected]>
Signed-off-by: Jianguo Wu <[email protected]>
Reported-by: Maciej Rutecki <[email protected]>
Tested-by: Maciej Rutecki <[email protected]>
Cc: Chris Clayton <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: [email protected]
Cc: [email protected]

---

Hi Maciej,
Thanks for reporting and bisecting. We have analyzed the regression
and worked out a patch for it. Could you please help to verify whether it
fix the regression?
Thanks!
Gerry

---
mm/page_alloc.c | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b74de6..2311f15 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
for_each_node_state(nid, N_HIGH_MEMORY) {
for (i = 0; i < MAX_NR_ZONES; i++) {
z = NODE_DATA(nid)->node_zones + i;
- z->present_pages = 0;
+ if (!is_highmem(z))
+ z->present_pages = 0;
}
}
}
@@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,

for (i = 0; i < MAX_NR_ZONES; i++) {
z = NODE_DATA(nid)->node_zones + i;
+ if (is_highmem(z))
+ continue;
+
zone_start_pfn = z->zone_start_pfn;
zone_end_pfn = zone_start_pfn + z->spanned_pages;
-
- /* if the two regions intersect */
if (!(zone_start_pfn >= end_pfn || zone_end_pfn <= start_pfn))
z->present_pages += min(end_pfn, zone_end_pfn) -
max(start_pfn, zone_start_pfn);
--
1.7.1


2012-11-06 10:23:47

by Chris Clayton

[permalink] [raw]
Subject: Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d



On 11/06/12 01:31, Jiang Liu wrote:
> Changeset 7f1290f2f2 tries to fix a issue when calculating
> zone->present_pages, but it causes a regression to 32bit systems with
> HIGHMEM. With that changeset, function reset_zone_present_pages()
> resets all zone->present_pages to zero, and fixup_zone_present_pages()
> is called to recalculate zone->present_pages when boot allocator frees
> core memory pages into buddy allocator. Because highmem pages are not
> freed by bootmem allocator, all highmem zones' present_pages becomes
> zero.
>
> Actually there's no need to recalculate present_pages for highmem zone
> because bootmem allocator never allocates pages from them. So fix the
> regression by skipping highmem in function reset_zone_present_pages()
> and fixup_zone_present_pages().
>
> Signed-off-by: Jiang Liu <[email protected]>
> Signed-off-by: Jianguo Wu <[email protected]>
> Reported-by: Maciej Rutecki <[email protected]>
> Tested-by: Maciej Rutecki <[email protected]>
> Cc: Chris Clayton <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
>
> Hi Maciej,
> Thanks for reporting and bisecting. We have analyzed the regression
> and worked out a patch for it. Could you please help to verify whether it
> fix the regression?
> Thanks!
> Gerry
>

Thanks Gerry.

I've applied this patch to 3.7.0-rc4 and can confirm that it fixes the
problem I had with my laptop failing to resume after a suspend to disk.

Tested-by: Chris Clayton <[email protected]>

> ---
> mm/page_alloc.c | 8 +++++---
> 1 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b74de6..2311f15 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
> for_each_node_state(nid, N_HIGH_MEMORY) {
> for (i = 0; i < MAX_NR_ZONES; i++) {
> z = NODE_DATA(nid)->node_zones + i;
> - z->present_pages = 0;
> + if (!is_highmem(z))
> + z->present_pages = 0;
> }
> }
> }
> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>
> for (i = 0; i < MAX_NR_ZONES; i++) {
> z = NODE_DATA(nid)->node_zones + i;
> + if (is_highmem(z))
> + continue;
> +
> zone_start_pfn = z->zone_start_pfn;
> zone_end_pfn = zone_start_pfn + z->spanned_pages;
> -
> - /* if the two regions intersect */
> if (!(zone_start_pfn >= end_pfn || zone_end_pfn <= start_pfn))
> z->present_pages += min(end_pfn, zone_end_pfn) -
> max(start_pfn, zone_start_pfn);
>

2012-11-06 20:43:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d

On Tue, 6 Nov 2012 09:31:57 +0800
Jiang Liu <[email protected]> wrote:

> Changeset 7f1290f2f2 tries to fix a issue when calculating
> zone->present_pages, but it causes a regression to 32bit systems with
> HIGHMEM. With that changeset, function reset_zone_present_pages()
> resets all zone->present_pages to zero, and fixup_zone_present_pages()
> is called to recalculate zone->present_pages when boot allocator frees
> core memory pages into buddy allocator. Because highmem pages are not
> freed by bootmem allocator, all highmem zones' present_pages becomes
> zero.
>
> Actually there's no need to recalculate present_pages for highmem zone
> because bootmem allocator never allocates pages from them. So fix the
> regression by skipping highmem in function reset_zone_present_pages()
> and fixup_zone_present_pages().
>
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
> for_each_node_state(nid, N_HIGH_MEMORY) {
> for (i = 0; i < MAX_NR_ZONES; i++) {
> z = NODE_DATA(nid)->node_zones + i;
> - z->present_pages = 0;
> + if (!is_highmem(z))
> + z->present_pages = 0;
> }
> }
> }
> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>
> for (i = 0; i < MAX_NR_ZONES; i++) {
> z = NODE_DATA(nid)->node_zones + i;
> + if (is_highmem(z))
> + continue;
> +
> zone_start_pfn = z->zone_start_pfn;
> zone_end_pfn = zone_start_pfn + z->spanned_pages;
> -
> - /* if the two regions intersect */
> if (!(zone_start_pfn >= end_pfn || zone_end_pfn <= start_pfn))
> z->present_pages += min(end_pfn, zone_end_pfn) -
> max(start_pfn, zone_start_pfn);

This ... isn't very nice. It is embeds within
reset_zone_present_pages() and fixup_zone_present_pages() knowledge
about their caller's state. Or, more specifically, it is emebedding
knowledge about the overall state of the system when these functions
are called.

I mean, a function called "reset_zone_present_pages" should reset
->present_pages!

The fact that fixup_zone_present_page() has multiple call sites makes
this all even more risky. And what are the interactions between this
and memory hotplug?

Can we find a cleaner fix?

Please tell us more about what's happening here. Is it the case that
reset_zone_present_pages() is being called *after* highmem has been
populated? If so, then fixup_zone_present_pages() should work
correctly for highmem? Or is it the case that highmem hasn't yet been
setup? IOW, what is the sequence of operations here?

Is the problem that we're *missing* a call to
fixup_zone_present_pages(), perhaps? If we call
fixup_zone_present_pages() after highmem has been populated,
fixup_zone_present_pages() should correctly fill in the highmem zone's
->present_pages?

2012-11-14 14:52:12

by Jiang Liu

[permalink] [raw]
Subject: Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d

On 11/07/2012 04:43 AM, Andrew Morton wrote:
> On Tue, 6 Nov 2012 09:31:57 +0800
> Jiang Liu <[email protected]> wrote:
>
>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>> zone->present_pages, but it causes a regression to 32bit systems with
>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>> is called to recalculate zone->present_pages when boot allocator frees
>> core memory pages into buddy allocator. Because highmem pages are not
>> freed by bootmem allocator, all highmem zones' present_pages becomes
>> zero.
>>
>> Actually there's no need to recalculate present_pages for highmem zone
>> because bootmem allocator never allocates pages from them. So fix the
>> regression by skipping highmem in function reset_zone_present_pages()
>> and fixup_zone_present_pages().
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>> for_each_node_state(nid, N_HIGH_MEMORY) {
>> for (i = 0; i < MAX_NR_ZONES; i++) {
>> z = NODE_DATA(nid)->node_zones + i;
>> - z->present_pages = 0;
>> + if (!is_highmem(z))
>> + z->present_pages = 0;
>> }
>> }
>> }
>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>
>> for (i = 0; i < MAX_NR_ZONES; i++) {
>> z = NODE_DATA(nid)->node_zones + i;
>> + if (is_highmem(z))
>> + continue;
>> +
>> zone_start_pfn = z->zone_start_pfn;
>> zone_end_pfn = zone_start_pfn + z->spanned_pages;
>> -
>> - /* if the two regions intersect */
>> if (!(zone_start_pfn >= end_pfn || zone_end_pfn <= start_pfn))
>> z->present_pages += min(end_pfn, zone_end_pfn) -
>> max(start_pfn, zone_start_pfn);
>
> This ... isn't very nice. It is embeds within
> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
> about their caller's state. Or, more specifically, it is emebedding
> knowledge about the overall state of the system when these functions
> are called.
>
> I mean, a function called "reset_zone_present_pages" should reset
> ->present_pages!
>
> The fact that fixup_zone_present_page() has multiple call sites makes
> this all even more risky. And what are the interactions between this
> and memory hotplug?
>
> Can we find a cleaner fix?
>
> Please tell us more about what's happening here. Is it the case that
> reset_zone_present_pages() is being called *after* highmem has been
> populated? If so, then fixup_zone_present_pages() should work
> correctly for highmem? Or is it the case that highmem hasn't yet been
> setup? IOW, what is the sequence of operations here?
>
> Is the problem that we're *missing* a call to
> fixup_zone_present_pages(), perhaps? If we call
> fixup_zone_present_pages() after highmem has been populated,
> fixup_zone_present_pages() should correctly fill in the highmem zone's
> ->present_pages?
Hi Andrew,
Sorry for the late response:(
I have done more investigations according to your suggestions. Currently
we have only called fixup_zone_present_pages() for memory freed by bootmem
allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
And sadly enough, I found the quick fix is still incomplete. The original
patch still have another issue that, reset_zone_present_pages() is only called
for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
Then I feel a little guilty and tried to find a cleaner solution without
touching arch specific code. But things are more complex than my expectation and
I'm still working on that.
So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
and I will post another version once I found a cleaner way?
Thanks!
Gerry

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-11-15 09:16:30

by Wen Congyang

[permalink] [raw]
Subject: Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d

Hi, Liu Jiang

At 11/14/2012 10:52 PM, Jiang Liu Wrote:
> On 11/07/2012 04:43 AM, Andrew Morton wrote:
>> On Tue, 6 Nov 2012 09:31:57 +0800
>> Jiang Liu <[email protected]> wrote:
>>
>>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>>> zone->present_pages, but it causes a regression to 32bit systems with
>>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>>> is called to recalculate zone->present_pages when boot allocator frees
>>> core memory pages into buddy allocator. Because highmem pages are not
>>> freed by bootmem allocator, all highmem zones' present_pages becomes
>>> zero.
>>>
>>> Actually there's no need to recalculate present_pages for highmem zone
>>> because bootmem allocator never allocates pages from them. So fix the
>>> regression by skipping highmem in function reset_zone_present_pages()
>>> and fixup_zone_present_pages().
>>>
>>> ...
>>>
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>> for_each_node_state(nid, N_HIGH_MEMORY) {
>>> for (i = 0; i < MAX_NR_ZONES; i++) {
>>> z = NODE_DATA(nid)->node_zones + i;
>>> - z->present_pages = 0;
>>> + if (!is_highmem(z))
>>> + z->present_pages = 0;
>>> }
>>> }
>>> }
>>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>>
>>> for (i = 0; i < MAX_NR_ZONES; i++) {
>>> z = NODE_DATA(nid)->node_zones + i;
>>> + if (is_highmem(z))
>>> + continue;
>>> +
>>> zone_start_pfn = z->zone_start_pfn;
>>> zone_end_pfn = zone_start_pfn + z->spanned_pages;
>>> -
>>> - /* if the two regions intersect */
>>> if (!(zone_start_pfn >= end_pfn || zone_end_pfn <= start_pfn))
>>> z->present_pages += min(end_pfn, zone_end_pfn) -
>>> max(start_pfn, zone_start_pfn);
>>
>> This ... isn't very nice. It is embeds within
>> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
>> about their caller's state. Or, more specifically, it is emebedding
>> knowledge about the overall state of the system when these functions
>> are called.
>>
>> I mean, a function called "reset_zone_present_pages" should reset
>> ->present_pages!
>>
>> The fact that fixup_zone_present_page() has multiple call sites makes
>> this all even more risky. And what are the interactions between this
>> and memory hotplug?
>>
>> Can we find a cleaner fix?
>>
>> Please tell us more about what's happening here. Is it the case that
>> reset_zone_present_pages() is being called *after* highmem has been
>> populated? If so, then fixup_zone_present_pages() should work
>> correctly for highmem? Or is it the case that highmem hasn't yet been
>> setup? IOW, what is the sequence of operations here?
>>
>> Is the problem that we're *missing* a call to
>> fixup_zone_present_pages(), perhaps? If we call
>> fixup_zone_present_pages() after highmem has been populated,
>> fixup_zone_present_pages() should correctly fill in the highmem zone's
>> ->present_pages?
> Hi Andrew,
> Sorry for the late response:(
> I have done more investigations according to your suggestions. Currently
> we have only called fixup_zone_present_pages() for memory freed by bootmem
> allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
> for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
> sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
> And sadly enough, I found the quick fix is still incomplete. The original
> patch still have another issue that, reset_zone_present_pages() is only called
> for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
> Then I feel a little guilty and tried to find a cleaner solution without
> touching arch specific code. But things are more complex than my expectation and
> I'm still working on that.
> So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> and I will post another version once I found a cleaner way?

I think fixup_zone_present_pages() are very useful for memory hotplug.

We calculate zone->present_pages in free_area_init_core(), but its value is wrong.
So it is why we fix it in fixup_zone_present_pages().

What about this:
1. init zone->present_pages to the present pages in this zone(include bootmem)
2. don't reset zone->present_pages for HIGHMEM pages

We don't allocate bootmem from HIGHMEM. So its present pages is inited in step1
and there is no need to fix it in step2.

Is it OK?

If it is OK, I will resend the patch for step1(the patch is from laijs).

Thanks
Wen Congyang

> Thanks!
> Gerry
>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-11-15 11:28:24

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d

On Thu, Nov 15, 2012 at 5:22 PM, Wen Congyang <[email protected]> wrote:
> Hi, Liu Jiang
>
> At 11/14/2012 10:52 PM, Jiang Liu Wrote:
>> On 11/07/2012 04:43 AM, Andrew Morton wrote:
>>> On Tue, 6 Nov 2012 09:31:57 +0800
>>> Jiang Liu <[email protected]> wrote:
>>>
>>>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>>>> zone->present_pages, but it causes a regression to 32bit systems with
>>>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>>>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>>>> is called to recalculate zone->present_pages when boot allocator frees
>>>> core memory pages into buddy allocator. Because highmem pages are not
>>>> freed by bootmem allocator, all highmem zones' present_pages becomes
>>>> zero.
>>>>
>>>> Actually there's no need to recalculate present_pages for highmem zone
>>>> because bootmem allocator never allocates pages from them. So fix the
>>>> regression by skipping highmem in function reset_zone_present_pages()
>>>> and fixup_zone_present_pages().
>>>>
>>>> ...
>>>>
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>>> for_each_node_state(nid, N_HIGH_MEMORY) {
>>>> for (i = 0; i < MAX_NR_ZONES; i++) {
>>>> z = NODE_DATA(nid)->node_zones + i;
>>>> - z->present_pages = 0;
>>>> + if (!is_highmem(z))
>>>> + z->present_pages = 0;
>>>> }
>>>> }
>>>> }
>>>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>>>
>>>> for (i = 0; i < MAX_NR_ZONES; i++) {
>>>> z = NODE_DATA(nid)->node_zones + i;
>>>> + if (is_highmem(z))
>>>> + continue;
>>>> +
>>>> zone_start_pfn = z->zone_start_pfn;
>>>> zone_end_pfn = zone_start_pfn + z->spanned_pages;
>>>> -
>>>> - /* if the two regions intersect */
>>>> if (!(zone_start_pfn >= end_pfn || zone_end_pfn <= start_pfn))
>>>> z->present_pages += min(end_pfn, zone_end_pfn) -
>>>> max(start_pfn, zone_start_pfn);
>>>
>>> This ... isn't very nice. It is embeds within
>>> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
>>> about their caller's state. Or, more specifically, it is emebedding
>>> knowledge about the overall state of the system when these functions
>>> are called.
>>>
>>> I mean, a function called "reset_zone_present_pages" should reset
>>> ->present_pages!
>>>
>>> The fact that fixup_zone_present_page() has multiple call sites makes
>>> this all even more risky. And what are the interactions between this
>>> and memory hotplug?
>>>
>>> Can we find a cleaner fix?
>>>
>>> Please tell us more about what's happening here. Is it the case that
>>> reset_zone_present_pages() is being called *after* highmem has been
>>> populated? If so, then fixup_zone_present_pages() should work
>>> correctly for highmem? Or is it the case that highmem hasn't yet been
>>> setup? IOW, what is the sequence of operations here?
>>>
>>> Is the problem that we're *missing* a call to
>>> fixup_zone_present_pages(), perhaps? If we call
>>> fixup_zone_present_pages() after highmem has been populated,
>>> fixup_zone_present_pages() should correctly fill in the highmem zone's
>>> ->present_pages?
>> Hi Andrew,
>> Sorry for the late response:(
>> I have done more investigations according to your suggestions. Currently
>> we have only called fixup_zone_present_pages() for memory freed by bootmem
>> allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
>> for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
>> sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
>> And sadly enough, I found the quick fix is still incomplete. The original
>> patch still have another issue that, reset_zone_present_pages() is only called
>> for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
>> Then I feel a little guilty and tried to find a cleaner solution without
>> touching arch specific code. But things are more complex than my expectation and
>> I'm still working on that.
>> So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
>> and I will post another version once I found a cleaner way?
>
> I think fixup_zone_present_pages() are very useful for memory hotplug.
>

I might miss something, but if memory hotplug is the only user depends on
fixup_zone_present_pages().
Why not reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
And add checking to offline_pages() like:
if (zone->present_pages >= offlined_page)
zone->present_pages -= offlined_pages;
else
zone->present_pages = 0;

It's more simple and can minimize the effect to other parts of kernel.

> We calculate zone->present_pages in free_area_init_core(), but its value is wrong.
> So it is why we fix it in fixup_zone_present_pages().
>
> What about this:
> 1. init zone->present_pages to the present pages in this zone(include bootmem)
> 2. don't reset zone->present_pages for HIGHMEM pages
>
> We don't allocate bootmem from HIGHMEM. So its present pages is inited in step1
> and there is no need to fix it in step2.
>
> Is it OK?
>
> If it is OK, I will resend the patch for step1(the patch is from laijs).
>

--
Regards,
--Bob

2012-11-15 14:23:20

by Wen Congyang

[permalink] [raw]
Subject: Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d

At 2012/11/15 19:28, Bob Liu Wrote:
> On Thu, Nov 15, 2012 at 5:22 PM, Wen Congyang<[email protected]> wrote:
>> Hi, Liu Jiang
>>
>> At 11/14/2012 10:52 PM, Jiang Liu Wrote:
>>> On 11/07/2012 04:43 AM, Andrew Morton wrote:
>>>> On Tue, 6 Nov 2012 09:31:57 +0800
>>>> Jiang Liu<[email protected]> wrote:
>>>>
>>>>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>>>>> zone->present_pages, but it causes a regression to 32bit systems with
>>>>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>>>>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>>>>> is called to recalculate zone->present_pages when boot allocator frees
>>>>> core memory pages into buddy allocator. Because highmem pages are not
>>>>> freed by bootmem allocator, all highmem zones' present_pages becomes
>>>>> zero.
>>>>>
>>>>> Actually there's no need to recalculate present_pages for highmem zone
>>>>> because bootmem allocator never allocates pages from them. So fix the
>>>>> regression by skipping highmem in function reset_zone_present_pages()
>>>>> and fixup_zone_present_pages().
>>>>>
>>>>> ...
>>>>>
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>>>> for_each_node_state(nid, N_HIGH_MEMORY) {
>>>>> for (i = 0; i< MAX_NR_ZONES; i++) {
>>>>> z = NODE_DATA(nid)->node_zones + i;
>>>>> - z->present_pages = 0;
>>>>> + if (!is_highmem(z))
>>>>> + z->present_pages = 0;
>>>>> }
>>>>> }
>>>>> }
>>>>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>>>>
>>>>> for (i = 0; i< MAX_NR_ZONES; i++) {
>>>>> z = NODE_DATA(nid)->node_zones + i;
>>>>> + if (is_highmem(z))
>>>>> + continue;
>>>>> +
>>>>> zone_start_pfn = z->zone_start_pfn;
>>>>> zone_end_pfn = zone_start_pfn + z->spanned_pages;
>>>>> -
>>>>> - /* if the two regions intersect */
>>>>> if (!(zone_start_pfn>= end_pfn || zone_end_pfn<= start_pfn))
>>>>> z->present_pages += min(end_pfn, zone_end_pfn) -
>>>>> max(start_pfn, zone_start_pfn);
>>>>
>>>> This ... isn't very nice. It is embeds within
>>>> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
>>>> about their caller's state. Or, more specifically, it is emebedding
>>>> knowledge about the overall state of the system when these functions
>>>> are called.
>>>>
>>>> I mean, a function called "reset_zone_present_pages" should reset
>>>> ->present_pages!
>>>>
>>>> The fact that fixup_zone_present_page() has multiple call sites makes
>>>> this all even more risky. And what are the interactions between this
>>>> and memory hotplug?
>>>>
>>>> Can we find a cleaner fix?
>>>>
>>>> Please tell us more about what's happening here. Is it the case that
>>>> reset_zone_present_pages() is being called *after* highmem has been
>>>> populated? If so, then fixup_zone_present_pages() should work
>>>> correctly for highmem? Or is it the case that highmem hasn't yet been
>>>> setup? IOW, what is the sequence of operations here?
>>>>
>>>> Is the problem that we're *missing* a call to
>>>> fixup_zone_present_pages(), perhaps? If we call
>>>> fixup_zone_present_pages() after highmem has been populated,
>>>> fixup_zone_present_pages() should correctly fill in the highmem zone's
>>>> ->present_pages?
>>> Hi Andrew,
>>> Sorry for the late response:(
>>> I have done more investigations according to your suggestions. Currently
>>> we have only called fixup_zone_present_pages() for memory freed by bootmem
>>> allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
>>> for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
>>> sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
>>> And sadly enough, I found the quick fix is still incomplete. The original
>>> patch still have another issue that, reset_zone_present_pages() is only called
>>> for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
>>> Then I feel a little guilty and tried to find a cleaner solution without
>>> touching arch specific code. But things are more complex than my expectation and
>>> I'm still working on that.
>>> So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
>>> and I will post another version once I found a cleaner way?
>>
>> I think fixup_zone_present_pages() are very useful for memory hotplug.
>>
>
> I might miss something, but if memory hotplug is the only user depends on
> fixup_zone_present_pages().

IIRC, water_mask depends on zone->present_pages. But I don't meet any
problem
even if zone->present_pages is wrong.

> Why not reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> And add checking to offline_pages() like:
> if (zone->present_pages>= offlined_page)
> zone->present_pages -= offlined_pages;
> else
> zone->present_pages = 0;
>
> It's more simple and can minimize the effect to other parts of kernel.

Hmm, zone->present_pages may be 0 when there is memory in this zone which is
onlined and in use. If zone->present_pages becomes to 0, we will free pcp
list for this zone. It will cause some unexpected error.

>
>> We calculate zone->present_pages in free_area_init_core(), but its value is wrong.
>> So it is why we fix it in fixup_zone_present_pages().
>>
>> What about this:
>> 1. init zone->present_pages to the present pages in this zone(include bootmem)
>> 2. don't reset zone->present_pages for HIGHMEM pages
>>
>> We don't allocate bootmem from HIGHMEM. So its present pages is inited in step1
>> and there is no need to fix it in step2.
>>
>> Is it OK?
>>
>> If it is OK, I will resend the patch for step1(the patch is from laijs).
>>
>

2012-11-15 15:40:49

by Jiang Liu

[permalink] [raw]
Subject: Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d

On 11/15/2012 05:22 PM, Wen Congyang wrote:
> Hi, Liu Jiang
>
> At 11/14/2012 10:52 PM, Jiang Liu Wrote:
>> On 11/07/2012 04:43 AM, Andrew Morton wrote:
>>> On Tue, 6 Nov 2012 09:31:57 +0800
>>> Jiang Liu <[email protected]> wrote:
>>>
>>>> Changeset 7f1290f2f2 tries to fix a issue when calculating
>>>> zone->present_pages, but it causes a regression to 32bit systems with
>>>> HIGHMEM. With that changeset, function reset_zone_present_pages()
>>>> resets all zone->present_pages to zero, and fixup_zone_present_pages()
>>>> is called to recalculate zone->present_pages when boot allocator frees
>>>> core memory pages into buddy allocator. Because highmem pages are not
>>>> freed by bootmem allocator, all highmem zones' present_pages becomes
>>>> zero.
>>>>
>>>> Actually there's no need to recalculate present_pages for highmem zone
>>>> because bootmem allocator never allocates pages from them. So fix the
>>>> regression by skipping highmem in function reset_zone_present_pages()
>>>> and fixup_zone_present_pages().
>>>>
>>>> ...
>>>>
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -6108,7 +6108,8 @@ void reset_zone_present_pages(void)
>>>> for_each_node_state(nid, N_HIGH_MEMORY) {
>>>> for (i = 0; i < MAX_NR_ZONES; i++) {
>>>> z = NODE_DATA(nid)->node_zones + i;
>>>> - z->present_pages = 0;
>>>> + if (!is_highmem(z))
>>>> + z->present_pages = 0;
>>>> }
>>>> }
>>>> }
>>>> @@ -6123,10 +6124,11 @@ void fixup_zone_present_pages(int nid, unsigned long start_pfn,
>>>>
>>>> for (i = 0; i < MAX_NR_ZONES; i++) {
>>>> z = NODE_DATA(nid)->node_zones + i;
>>>> + if (is_highmem(z))
>>>> + continue;
>>>> +
>>>> zone_start_pfn = z->zone_start_pfn;
>>>> zone_end_pfn = zone_start_pfn + z->spanned_pages;
>>>> -
>>>> - /* if the two regions intersect */
>>>> if (!(zone_start_pfn >= end_pfn || zone_end_pfn <= start_pfn))
>>>> z->present_pages += min(end_pfn, zone_end_pfn) -
>>>> max(start_pfn, zone_start_pfn);
>>>
>>> This ... isn't very nice. It is embeds within
>>> reset_zone_present_pages() and fixup_zone_present_pages() knowledge
>>> about their caller's state. Or, more specifically, it is emebedding
>>> knowledge about the overall state of the system when these functions
>>> are called.
>>>
>>> I mean, a function called "reset_zone_present_pages" should reset
>>> ->present_pages!
>>>
>>> The fact that fixup_zone_present_page() has multiple call sites makes
>>> this all even more risky. And what are the interactions between this
>>> and memory hotplug?
>>>
>>> Can we find a cleaner fix?
>>>
>>> Please tell us more about what's happening here. Is it the case that
>>> reset_zone_present_pages() is being called *after* highmem has been
>>> populated? If so, then fixup_zone_present_pages() should work
>>> correctly for highmem? Or is it the case that highmem hasn't yet been
>>> setup? IOW, what is the sequence of operations here?
>>>
>>> Is the problem that we're *missing* a call to
>>> fixup_zone_present_pages(), perhaps? If we call
>>> fixup_zone_present_pages() after highmem has been populated,
>>> fixup_zone_present_pages() should correctly fill in the highmem zone's
>>> ->present_pages?
>> Hi Andrew,
>> Sorry for the late response:(
>> I have done more investigations according to your suggestions. Currently
>> we have only called fixup_zone_present_pages() for memory freed by bootmem
>> allocator and missed HIGHMEM pages. We could also call fixup_zone_present_pages()
>> for HIGHMEM pages, but that will need to change arch specific code for x86, powerpc,
>> sparc, microblaze, arm, mips, um and tile etc. Seems a little overhead.
>> And sadly enough, I found the quick fix is still incomplete. The original
>> patch still have another issue that, reset_zone_present_pages() is only called
>> for IA64, so it will cause trouble for other arches which make use of "bootmem.c".
>> Then I feel a little guilty and tried to find a cleaner solution without
>> touching arch specific code. But things are more complex than my expectation and
>> I'm still working on that.
>> So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
>> and I will post another version once I found a cleaner way?
>
> I think fixup_zone_present_pages() are very useful for memory hotplug.
>
> We calculate zone->present_pages in free_area_init_core(), but its value is wrong.
> So it is why we fix it in fixup_zone_present_pages().
>
> What about this:
> 1. init zone->present_pages to the present pages in this zone(include bootmem)
> 2. don't reset zone->present_pages for HIGHMEM pages
>
> We don't allocate bootmem from HIGHMEM. So its present pages is inited in step1
> and there is no need to fix it in step2.
Hi Congyang,

I feel that zone->present_pages has been abused. I guess it means "physical pages
present in this zone" originally, but now sometimes zone->present_pages is used as
"pages in this zone managed by the buddy system". So I'm trying to add a new
field "managed_pages" into zone, which accounts for pages managed by buddy system.
That's why I thought the clean solution is a little complex:(

Why do we need "managed_pages"? With HIGHMEM enabled, there may be bigger difference
between "present_pages" and "managed_pages" on ZONE_NORMAL because it also hosts
page array for ZONE_HIGHMEM. That may cause disturbance to page allocator or scanner.

What's your thoughts?

Thanks
Gerry

2012-11-15 19:24:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d

On Wed, 14 Nov 2012 22:52:03 +0800
Jiang Liu <[email protected]> wrote:

> So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> and I will post another version once I found a cleaner way?

We do need to get this regression fixed and I guess that a
straightforward revert is an acceptable way of doing that, for now.


I queued the below, with a plan to send it to Linus next week.


From: Andrew Morton <[email protected]>
Subject: revert "mm: fix-up zone present pages"

Revert

commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
Author: Jianguo Wu <[email protected]>
AuthorDate: Mon Oct 8 16:33:06 2012 -0700
Commit: Linus Torvalds <[email protected]>
CommitDate: Tue Oct 9 16:22:54 2012 +0900

mm: fix-up zone present pages


That patch tried to fix a issue when calculating zone->present_pages, but
it caused a regression on 32bit systems with HIGHMEM. With that
changeset, reset_zone_present_pages() resets all zone->present_pages to
zero, and fixup_zone_present_pages() is called to recalculate
zone->present_pages when the boot allocator frees core memory pages into
buddy allocator. Because highmem pages are not freed by bootmem
allocator, all highmem zones' present_pages becomes zero.

Various options for improving the situation are being discussed but for
now, let's return to the 3.6 code.

Cc: Jianguo Wu <[email protected]>
Cc: Jiang Liu <[email protected]>
Cc: Petr Tesarik <[email protected]>
Cc: "Luck, Tony" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Yinghai Lu <[email protected]>
Cc: Minchan Kim <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: David Rientjes <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/ia64/mm/init.c | 1 -
include/linux/mm.h | 4 ----
mm/bootmem.c | 10 +---------
mm/memory_hotplug.c | 7 -------
mm/nobootmem.c | 3 ---
mm/page_alloc.c | 34 ----------------------------------
6 files changed, 1 insertion(+), 58 deletions(-)

diff -puN arch/ia64/mm/init.c~revert-1 arch/ia64/mm/init.c
--- a/arch/ia64/mm/init.c~revert-1
+++ a/arch/ia64/mm/init.c
@@ -637,7 +637,6 @@ mem_init (void)

high_memory = __va(max_low_pfn * PAGE_SIZE);

- reset_zone_present_pages();
for_each_online_pgdat(pgdat)
if (pgdat->bdata->node_bootmem_map)
totalram_pages += free_all_bootmem_node(pgdat);
diff -puN include/linux/mm.h~revert-1 include/linux/mm.h
--- a/include/linux/mm.h~revert-1
+++ a/include/linux/mm.h
@@ -1684,9 +1684,5 @@ static inline unsigned int debug_guardpa
static inline bool page_is_guard(struct page *page) { return false; }
#endif /* CONFIG_DEBUG_PAGEALLOC */

-extern void reset_zone_present_pages(void);
-extern void fixup_zone_present_pages(int nid, unsigned long start_pfn,
- unsigned long end_pfn);
-
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff -puN mm/bootmem.c~revert-1 mm/bootmem.c
--- a/mm/bootmem.c~revert-1
+++ a/mm/bootmem.c
@@ -198,8 +198,6 @@ static unsigned long __init free_all_boo
int order = ilog2(BITS_PER_LONG);

__free_pages_bootmem(pfn_to_page(start), order);
- fixup_zone_present_pages(page_to_nid(pfn_to_page(start)),
- start, start + BITS_PER_LONG);
count += BITS_PER_LONG;
start += BITS_PER_LONG;
} else {
@@ -210,9 +208,6 @@ static unsigned long __init free_all_boo
if (vec & 1) {
page = pfn_to_page(start + off);
__free_pages_bootmem(page, 0);
- fixup_zone_present_pages(
- page_to_nid(page),
- start + off, start + off + 1);
count++;
}
vec >>= 1;
@@ -226,11 +221,8 @@ static unsigned long __init free_all_boo
pages = bdata->node_low_pfn - bdata->node_min_pfn;
pages = bootmem_bootmap_pages(pages);
count += pages;
- while (pages--) {
- fixup_zone_present_pages(page_to_nid(page),
- page_to_pfn(page), page_to_pfn(page) + 1);
+ while (pages--)
__free_pages_bootmem(page++, 0);
- }

bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);

diff -puN mm/memory_hotplug.c~revert-1 mm/memory_hotplug.c
--- a/mm/memory_hotplug.c~revert-1
+++ a/mm/memory_hotplug.c
@@ -106,7 +106,6 @@ static void get_page_bootmem(unsigned lo
void __ref put_page_bootmem(struct page *page)
{
unsigned long type;
- struct zone *zone;

type = (unsigned long) page->lru.next;
BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
@@ -117,12 +116,6 @@ void __ref put_page_bootmem(struct page
set_page_private(page, 0);
INIT_LIST_HEAD(&page->lru);
__free_pages_bootmem(page, 0);
-
- zone = page_zone(page);
- zone_span_writelock(zone);
- zone->present_pages++;
- zone_span_writeunlock(zone);
- totalram_pages++;
}

}
diff -puN mm/nobootmem.c~revert-1 mm/nobootmem.c
--- a/mm/nobootmem.c~revert-1
+++ a/mm/nobootmem.c
@@ -116,8 +116,6 @@ static unsigned long __init __free_memor
return 0;

__free_pages_memory(start_pfn, end_pfn);
- fixup_zone_present_pages(pfn_to_nid(start >> PAGE_SHIFT),
- start_pfn, end_pfn);

return end_pfn - start_pfn;
}
@@ -128,7 +126,6 @@ unsigned long __init free_low_memory_cor
phys_addr_t start, end, size;
u64 i;

- reset_zone_present_pages();
for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
count += __free_memory_core(start, end);

diff -puN mm/page_alloc.c~revert-1 mm/page_alloc.c
--- a/mm/page_alloc.c~revert-1
+++ a/mm/page_alloc.c
@@ -6098,37 +6098,3 @@ void dump_page(struct page *page)
dump_page_flags(page->flags);
mem_cgroup_print_bad_page(page);
}
-
-/* reset zone->present_pages */
-void reset_zone_present_pages(void)
-{
- struct zone *z;
- int i, nid;
-
- for_each_node_state(nid, N_HIGH_MEMORY) {
- for (i = 0; i < MAX_NR_ZONES; i++) {
- z = NODE_DATA(nid)->node_zones + i;
- z->present_pages = 0;
- }
- }
-}
-
-/* calculate zone's present pages in buddy system */
-void fixup_zone_present_pages(int nid, unsigned long start_pfn,
- unsigned long end_pfn)
-{
- struct zone *z;
- unsigned long zone_start_pfn, zone_end_pfn;
- int i;
-
- for (i = 0; i < MAX_NR_ZONES; i++) {
- z = NODE_DATA(nid)->node_zones + i;
- zone_start_pfn = z->zone_start_pfn;
- zone_end_pfn = zone_start_pfn + z->spanned_pages;
-
- /* if the two regions intersect */
- if (!(zone_start_pfn >= end_pfn || zone_end_pfn <= start_pfn))
- z->present_pages += min(end_pfn, zone_end_pfn) -
- max(start_pfn, zone_start_pfn);
- }
-}
_

2012-11-15 21:25:04

by Chris Clayton

[permalink] [raw]
Subject: Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d



On 11/15/12 19:24, Andrew Morton wrote:
> On Wed, 14 Nov 2012 22:52:03 +0800
> Jiang Liu <[email protected]> wrote:
>
>> So how about totally reverting the changeset 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
>> and I will post another version once I found a cleaner way?
>
> We do need to get this regression fixed and I guess that a
> straightforward revert is an acceptable way of doing that, for now.
>
>
> I queued the below, with a plan to send it to Linus next week.
>

I've applied this patch to v3.7-rc5-28-g79e979e and can confirm that it
fixes the problem I had with my laptop failing to resume (by either
freezing or rebooting) after a suspend to disk.

Tested-by: Chris Clayton <[email protected]>

>
> From: Andrew Morton <[email protected]>
> Subject: revert "mm: fix-up zone present pages"
>
> Revert
>
> commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> Author: Jianguo Wu <[email protected]>
> AuthorDate: Mon Oct 8 16:33:06 2012 -0700
> Commit: Linus Torvalds <[email protected]>
> CommitDate: Tue Oct 9 16:22:54 2012 +0900
>
> mm: fix-up zone present pages
>
>
> That patch tried to fix a issue when calculating zone->present_pages, but
> it caused a regression on 32bit systems with HIGHMEM. With that
> changeset, reset_zone_present_pages() resets all zone->present_pages to
> zero, and fixup_zone_present_pages() is called to recalculate
> zone->present_pages when the boot allocator frees core memory pages into
> buddy allocator. Because highmem pages are not freed by bootmem
> allocator, all highmem zones' present_pages becomes zero.
>
> Various options for improving the situation are being discussed but for
> now, let's return to the 3.6 code.
>
> Cc: Jianguo Wu <[email protected]>
> Cc: Jiang Liu <[email protected]>
> Cc: Petr Tesarik <[email protected]>
> Cc: "Luck, Tony" <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Yinghai Lu <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: David Rientjes <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> arch/ia64/mm/init.c | 1 -
> include/linux/mm.h | 4 ----
> mm/bootmem.c | 10 +---------
> mm/memory_hotplug.c | 7 -------
> mm/nobootmem.c | 3 ---
> mm/page_alloc.c | 34 ----------------------------------
> 6 files changed, 1 insertion(+), 58 deletions(-)
>
> diff -puN arch/ia64/mm/init.c~revert-1 arch/ia64/mm/init.c
> --- a/arch/ia64/mm/init.c~revert-1
> +++ a/arch/ia64/mm/init.c
> @@ -637,7 +637,6 @@ mem_init (void)
>
> high_memory = __va(max_low_pfn * PAGE_SIZE);
>
> - reset_zone_present_pages();
> for_each_online_pgdat(pgdat)
> if (pgdat->bdata->node_bootmem_map)
> totalram_pages += free_all_bootmem_node(pgdat);
> diff -puN include/linux/mm.h~revert-1 include/linux/mm.h
> --- a/include/linux/mm.h~revert-1
> +++ a/include/linux/mm.h
> @@ -1684,9 +1684,5 @@ static inline unsigned int debug_guardpa
> static inline bool page_is_guard(struct page *page) { return false; }
> #endif /* CONFIG_DEBUG_PAGEALLOC */
>
> -extern void reset_zone_present_pages(void);
> -extern void fixup_zone_present_pages(int nid, unsigned long start_pfn,
> - unsigned long end_pfn);
> -
> #endif /* __KERNEL__ */
> #endif /* _LINUX_MM_H */
> diff -puN mm/bootmem.c~revert-1 mm/bootmem.c
> --- a/mm/bootmem.c~revert-1
> +++ a/mm/bootmem.c
> @@ -198,8 +198,6 @@ static unsigned long __init free_all_boo
> int order = ilog2(BITS_PER_LONG);
>
> __free_pages_bootmem(pfn_to_page(start), order);
> - fixup_zone_present_pages(page_to_nid(pfn_to_page(start)),
> - start, start + BITS_PER_LONG);
> count += BITS_PER_LONG;
> start += BITS_PER_LONG;
> } else {
> @@ -210,9 +208,6 @@ static unsigned long __init free_all_boo
> if (vec & 1) {
> page = pfn_to_page(start + off);
> __free_pages_bootmem(page, 0);
> - fixup_zone_present_pages(
> - page_to_nid(page),
> - start + off, start + off + 1);
> count++;
> }
> vec >>= 1;
> @@ -226,11 +221,8 @@ static unsigned long __init free_all_boo
> pages = bdata->node_low_pfn - bdata->node_min_pfn;
> pages = bootmem_bootmap_pages(pages);
> count += pages;
> - while (pages--) {
> - fixup_zone_present_pages(page_to_nid(page),
> - page_to_pfn(page), page_to_pfn(page) + 1);
> + while (pages--)
> __free_pages_bootmem(page++, 0);
> - }
>
> bdebug("nid=%td released=%lx\n", bdata - bootmem_node_data, count);
>
> diff -puN mm/memory_hotplug.c~revert-1 mm/memory_hotplug.c
> --- a/mm/memory_hotplug.c~revert-1
> +++ a/mm/memory_hotplug.c
> @@ -106,7 +106,6 @@ static void get_page_bootmem(unsigned lo
> void __ref put_page_bootmem(struct page *page)
> {
> unsigned long type;
> - struct zone *zone;
>
> type = (unsigned long) page->lru.next;
> BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
> @@ -117,12 +116,6 @@ void __ref put_page_bootmem(struct page
> set_page_private(page, 0);
> INIT_LIST_HEAD(&page->lru);
> __free_pages_bootmem(page, 0);
> -
> - zone = page_zone(page);
> - zone_span_writelock(zone);
> - zone->present_pages++;
> - zone_span_writeunlock(zone);
> - totalram_pages++;
> }
>
> }
> diff -puN mm/nobootmem.c~revert-1 mm/nobootmem.c
> --- a/mm/nobootmem.c~revert-1
> +++ a/mm/nobootmem.c
> @@ -116,8 +116,6 @@ static unsigned long __init __free_memor
> return 0;
>
> __free_pages_memory(start_pfn, end_pfn);
> - fixup_zone_present_pages(pfn_to_nid(start >> PAGE_SHIFT),
> - start_pfn, end_pfn);
>
> return end_pfn - start_pfn;
> }
> @@ -128,7 +126,6 @@ unsigned long __init free_low_memory_cor
> phys_addr_t start, end, size;
> u64 i;
>
> - reset_zone_present_pages();
> for_each_free_mem_range(i, MAX_NUMNODES, &start, &end, NULL)
> count += __free_memory_core(start, end);
>
> diff -puN mm/page_alloc.c~revert-1 mm/page_alloc.c
> --- a/mm/page_alloc.c~revert-1
> +++ a/mm/page_alloc.c
> @@ -6098,37 +6098,3 @@ void dump_page(struct page *page)
> dump_page_flags(page->flags);
> mem_cgroup_print_bad_page(page);
> }
> -
> -/* reset zone->present_pages */
> -void reset_zone_present_pages(void)
> -{
> - struct zone *z;
> - int i, nid;
> -
> - for_each_node_state(nid, N_HIGH_MEMORY) {
> - for (i = 0; i < MAX_NR_ZONES; i++) {
> - z = NODE_DATA(nid)->node_zones + i;
> - z->present_pages = 0;
> - }
> - }
> -}
> -
> -/* calculate zone's present pages in buddy system */
> -void fixup_zone_present_pages(int nid, unsigned long start_pfn,
> - unsigned long end_pfn)
> -{
> - struct zone *z;
> - unsigned long zone_start_pfn, zone_end_pfn;
> - int i;
> -
> - for (i = 0; i < MAX_NR_ZONES; i++) {
> - z = NODE_DATA(nid)->node_zones + i;
> - zone_start_pfn = z->zone_start_pfn;
> - zone_end_pfn = zone_start_pfn + z->spanned_pages;
> -
> - /* if the two regions intersect */
> - if (!(zone_start_pfn >= end_pfn || zone_end_pfn <= start_pfn))
> - z->present_pages += min(end_pfn, zone_end_pfn) -
> - max(start_pfn, zone_start_pfn);
> - }
> -}
> _
>

2012-11-15 21:27:09

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d

On Thu, 15 Nov 2012, Andrew Morton wrote:

> From: Andrew Morton <[email protected]>
> Subject: revert "mm: fix-up zone present pages"
>
> Revert
>
> commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
> Author: Jianguo Wu <[email protected]>
> AuthorDate: Mon Oct 8 16:33:06 2012 -0700
> Commit: Linus Torvalds <[email protected]>
> CommitDate: Tue Oct 9 16:22:54 2012 +0900
>
> mm: fix-up zone present pages
>
>
> That patch tried to fix a issue when calculating zone->present_pages, but
> it caused a regression on 32bit systems with HIGHMEM. With that
> changeset, reset_zone_present_pages() resets all zone->present_pages to
> zero, and fixup_zone_present_pages() is called to recalculate
> zone->present_pages when the boot allocator frees core memory pages into
> buddy allocator. Because highmem pages are not freed by bootmem
> allocator, all highmem zones' present_pages becomes zero.
>
> Various options for improving the situation are being discussed but for
> now, let's return to the 3.6 code.
>
> Cc: Jianguo Wu <[email protected]>
> Cc: Jiang Liu <[email protected]>
> Cc: Petr Tesarik <[email protected]>
> Cc: "Luck, Tony" <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Yinghai Lu <[email protected]>
> Cc: Minchan Kim <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: David Rientjes <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>

Acked-by: David Rientjes <[email protected]>

2012-11-15 21:41:30

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] mm: fix a regression with HIGHMEM introduced by changeset 7f1290f2f2a4d

On Thu, 15 Nov 2012, Jiang Liu wrote:

> I feel that zone->present_pages has been abused. I guess it means "physical pages
> present in this zone" originally, but now sometimes zone->present_pages is used as
> "pages in this zone managed by the buddy system".

It's definition is all pages spanned by the zone that are not reserved and
unavailable to the kernel to allocate from, and the implementation of
bootmem requires that its memory be considered as "reserved" until freed.
It's used throughout the kernel to determine the amount of memory that is
allocatable in that zone from the page allocator since its reclaim
heuristics and watermarks depend on this memory being allocatable.

> So I'm trying to add a new
> field "managed_pages" into zone, which accounts for pages managed by buddy system.
> That's why I thought the clean solution is a little complex:(
>

You need to update the pgdat's node_present_pages to be consistent with
all of its zones' present_pages.

2012-11-18 16:09:16

by Jiang Liu

[permalink] [raw]
Subject: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages

The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
resolve an issue caused by inaccurate zone->present_pages, but that
fix is incomplete and causes regresions with HIGHMEM. And it has been
reverted by commit
5576646 revert "mm: fix-up zone present pages"

This is a following-up patchset for the issue above. It introduces a
new field named "managed_pages" to struct zone, which counts pages
managed by the buddy system from the zone. And zone->present_pages
is used to count pages existing in the zone, which is
spanned_pages - absent_pages.

But that way, zone->present_pages will be kept in consistence with
pgdat->node_present_pages, which is sum of zone->present_pages.

This patchset has only been tested on x86_64 with nobootmem.c. So need
help to test this patchset on machines:
1) use bootmem.c
2) have highmem

This patchset applies to "f4a75d2e Linux 3.7-rc6" from
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

Any comments and helps are welcomed!

Jiang Liu (5):
mm: introduce new field "managed_pages" to struct zone
mm: replace zone->present_pages with zone->managed_pages if
appreciated
mm: set zone->present_pages to number of existing pages in the zone
mm: provide more accurate estimation of pages occupied by memmap
mm: increase totalram_pages when free pages allocated by bootmem
allocator

include/linux/mmzone.h | 1 +
mm/bootmem.c | 14 ++++++++
mm/memory_hotplug.c | 6 ++++
mm/mempolicy.c | 2 +-
mm/nobootmem.c | 15 ++++++++
mm/page_alloc.c | 89 +++++++++++++++++++++++++++++++-----------------
mm/vmscan.c | 16 ++++-----
mm/vmstat.c | 8 +++--
8 files changed, 108 insertions(+), 43 deletions(-)

--
1.7.9.5

2012-11-18 16:09:27

by Jiang Liu

[permalink] [raw]
Subject: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone

Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.
spanned_pages - absent_pages - memmap_pages - dma_reserve.

During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused. The field zone->present_pages
may have different meanings in different contexts:
1) pages existing in a zone.
2) pages managed by the buddy system.

For more discussions about the issue, please refer to:
http://lkml.org/lkml/2012/11/5/866
https://patchwork.kernel.org/patch/1346751/

This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system". And
revert zone->present_pages to count "physical pages existing in a zone",
which also keep in consistence with pgdat->node_present_pages.

We will set an initial value for zone->managed_pages in function
free_area_init_core() and will be adjusted later if the initial value is
inaccurate.

For DMA/normal zones, the initial value is set to:
(spanned_pages - absent_pages - memmap_pages - dma_reserve)
Later zone->managed_pages will be adjusted to the accurate value when
the bootmem allocator frees all free pages to the buddy system in
function free_all_bootmem_node() and free_all_bootmem().

The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.

This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.

Signed-off-by: Jiang Liu <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/bootmem.c | 14 ++++++++++++++
mm/memory_hotplug.c | 5 +++++
mm/nobootmem.c | 15 +++++++++++++++
mm/page_alloc.c | 37 +++++++++++++++++++++++--------------
mm/vmstat.c | 6 ++++--
6 files changed, 62 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a23923b..c61fb6d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -480,6 +480,7 @@ struct zone {
*/
unsigned long spanned_pages; /* total size, including holes */
unsigned long present_pages; /* amount of memory (excluding holes) */
+ unsigned long managed_pages; /* pages managed by the Buddy */

/*
* rarely used fields:
diff --git a/mm/bootmem.c b/mm/bootmem.c
index f468185..a813e5b 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -229,6 +229,15 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
return count;
}

+static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
+{
+ struct zone *z;
+
+ for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+ if (!is_highmem(z))
+ z->managed_pages = 0;
+}
+
/**
* free_all_bootmem_node - release a node's free pages to the buddy allocator
* @pgdat: node to be released
@@ -238,6 +247,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
{
register_page_bootmem_info_node(pgdat);
+ reset_node_lowmem_managed_pages(pgdat);
return free_all_bootmem_core(pgdat->bdata);
}

@@ -250,6 +260,10 @@ unsigned long __init free_all_bootmem(void)
{
unsigned long total_pages = 0;
bootmem_data_t *bdata;
+ struct pglist_data *pgdat;
+
+ for_each_online_pgdat(pgdat)
+ reset_node_lowmem_managed_pages(pgdat);

list_for_each_entry(bdata, &bdata_list, list)
total_pages += free_all_bootmem_core(bdata);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4eeaca..acd3993 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info, struct page *page,
void __ref put_page_bootmem(struct page *page)
{
unsigned long type;
+ static DEFINE_MUTEX(ppb_lock);

type = (unsigned long) page->lru.next;
BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
@@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
ClearPagePrivate(page);
set_page_private(page, 0);
INIT_LIST_HEAD(&page->lru);
+ mutex_lock(&ppb_lock);
__free_pages_bootmem(page, 0);
+ mutex_unlock(&ppb_lock);
}

}
@@ -514,6 +517,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
return ret;
}

+ zone->managed_pages += onlined_pages;
zone->present_pages += onlined_pages;
zone->zone_pgdat->node_present_pages += onlined_pages;
if (onlined_pages) {
@@ -961,6 +965,7 @@ repeat:
/* reset pagetype flags and makes migrate type to be MOVABLE */
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
/* removal success */
+ zone->managed_pages -= offlined_pages;
zone->present_pages -= offlined_pages;
zone->zone_pgdat->node_present_pages -= offlined_pages;
totalram_pages -= offlined_pages;
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index bd82f6b..9c55c18 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -137,6 +137,15 @@ unsigned long __init free_low_memory_core_early(int nodeid)
return count;
}

+static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
+{
+ struct zone *z;
+
+ for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+ if (!is_highmem(z))
+ z->managed_pages = 0;
+}
+
/**
* free_all_bootmem_node - release a node's free pages to the buddy allocator
* @pgdat: node to be released
@@ -146,6 +155,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
{
register_page_bootmem_info_node(pgdat);
+ reset_node_lowmem_managed_pages(pgdat);

/* free_low_memory_core_early(MAX_NUMNODES) will be called later */
return 0;
@@ -158,6 +168,11 @@ unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
*/
unsigned long __init free_all_bootmem(void)
{
+ struct pglist_data *pgdat;
+
+ for_each_online_pgdat(pgdat)
+ reset_node_lowmem_managed_pages(pgdat);
+
/*
* We need to use MAX_NUMNODES instead of NODE_DATA(0)->node_id
* because in some case like Node0 doesn't have RAM installed
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7bb35ac..a41ee64 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -745,6 +745,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
set_page_count(p, 0);
}

+ page_zone(page)->managed_pages += 1 << order;
set_page_refcounted(page);
__free_pages(page, order);
}
@@ -2950,6 +2951,7 @@ void show_free_areas(unsigned int filter)
" isolated(anon):%lukB"
" isolated(file):%lukB"
" present:%lukB"
+ " managed:%lukB"
" mlocked:%lukB"
" dirty:%lukB"
" writeback:%lukB"
@@ -2979,6 +2981,7 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_ISOLATED_ANON)),
K(zone_page_state(zone, NR_ISOLATED_FILE)),
K(zone->present_pages),
+ K(zone->managed_pages),
K(zone_page_state(zone, NR_MLOCK)),
K(zone_page_state(zone, NR_FILE_DIRTY)),
K(zone_page_state(zone, NR_WRITEBACK)),
@@ -4455,48 +4458,54 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
- unsigned long size, realsize, memmap_pages;
+ unsigned long size, realsize, freesize, memmap_pages;

size = zone_spanned_pages_in_node(nid, j, zones_size);
- realsize = size - zone_absent_pages_in_node(nid, j,
+ realsize = freesize = size - zone_absent_pages_in_node(nid, j,
zholes_size);

/*
- * Adjust realsize so that it accounts for how much memory
+ * Adjust freesize so that it accounts for how much memory
* is used by this zone for memmap. This affects the watermark
* and per-cpu initialisations
*/
memmap_pages =
PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
- if (realsize >= memmap_pages) {
- realsize -= memmap_pages;
+ if (freesize >= memmap_pages) {
+ freesize -= memmap_pages;
if (memmap_pages)
printk(KERN_DEBUG
" %s zone: %lu pages used for memmap\n",
zone_names[j], memmap_pages);
} else
printk(KERN_WARNING
- " %s zone: %lu pages exceeds realsize %lu\n",
- zone_names[j], memmap_pages, realsize);
+ " %s zone: %lu pages exceeds freesize %lu\n",
+ zone_names[j], memmap_pages, freesize);

/* Account for reserved pages */
- if (j == 0 && realsize > dma_reserve) {
- realsize -= dma_reserve;
+ if (j == 0 && freesize > dma_reserve) {
+ freesize -= dma_reserve;
printk(KERN_DEBUG " %s zone: %lu pages reserved\n",
zone_names[0], dma_reserve);
}

if (!is_highmem_idx(j))
- nr_kernel_pages += realsize;
- nr_all_pages += realsize;
+ nr_kernel_pages += freesize;
+ nr_all_pages += freesize;

zone->spanned_pages = size;
- zone->present_pages = realsize;
+ zone->present_pages = freesize;
+ /*
+ * Set an approximate value for lowmem here, it will be adjusted
+ * when the bootmem allocator frees pages into the buddy system.
+ * And all highmem pages will be managed by the buddy system.
+ */
+ zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
#ifdef CONFIG_NUMA
zone->node = nid;
- zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
+ zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
/ 100;
- zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
+ zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
#endif
zone->name = zone_names[j];
spin_lock_init(&zone->lock);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..e47d31c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -992,14 +992,16 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n high %lu"
"\n scanned %lu"
"\n spanned %lu"
- "\n present %lu",
+ "\n present %lu"
+ "\n managed %lu",
zone_page_state(zone, NR_FREE_PAGES),
min_wmark_pages(zone),
low_wmark_pages(zone),
high_wmark_pages(zone),
zone->pages_scanned,
zone->spanned_pages,
- zone->present_pages);
+ zone->present_pages,
+ zone->managed_pages);

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
seq_printf(m, "\n %-12s %lu", vmstat_text[i],
--
1.7.9.5

2012-11-18 16:09:39

by Jiang Liu

[permalink] [raw]
Subject: [RFT PATCH v1 2/5] mm: replace zone->present_pages with zone->managed_pages if appreciated

Now we have zone->managed_pages for "pages managed by the buddy system
in the zone", so replace zone->present_pages with zone->managed_pages
if what the user really wants is number of pages managed by the buddy
system.

Signed-off-by: Jiang Liu <[email protected]>
---
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 32 ++++++++++++++++----------------
mm/vmscan.c | 16 ++++++++--------
mm/vmstat.c | 2 +-
4 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..8367070 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -147,7 +147,7 @@ static int is_valid_nodemask(const nodemask_t *nodemask)

for (k = 0; k <= policy_zone; k++) {
z = &NODE_DATA(nd)->node_zones[k];
- if (z->present_pages > 0)
+ if (z->managed_pages > 0)
return 1;
}
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a41ee64..fe1cf48 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2791,7 +2791,7 @@ static unsigned int nr_free_zone_pages(int offset)
struct zonelist *zonelist = node_zonelist(numa_node_id(), GFP_KERNEL);

for_each_zone_zonelist(zone, z, zonelist, offset) {
- unsigned long size = zone->present_pages;
+ unsigned long size = zone->managed_pages;
unsigned long high = high_wmark_pages(zone);
if (size > high)
sum += size - high;
@@ -2844,7 +2844,7 @@ void si_meminfo_node(struct sysinfo *val, int nid)
val->totalram = pgdat->node_present_pages;
val->freeram = node_page_state(nid, NR_FREE_PAGES);
#ifdef CONFIG_HIGHMEM
- val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].present_pages;
+ val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].managed_pages;
val->freehigh = zone_page_state(&pgdat->node_zones[ZONE_HIGHMEM],
NR_FREE_PAGES);
#else
@@ -3883,7 +3883,7 @@ static int __meminit zone_batchsize(struct zone *zone)
*
* OK, so we don't know how big the cache is. So guess.
*/
- batch = zone->present_pages / 1024;
+ batch = zone->managed_pages / 1024;
if (batch * PAGE_SIZE > 512 * 1024)
batch = (512 * 1024) / PAGE_SIZE;
batch /= 4; /* We effectively *= 4 below */
@@ -3967,7 +3967,7 @@ static void __meminit setup_zone_pageset(struct zone *zone)

if (percpu_pagelist_fraction)
setup_pagelist_highmark(pcp,
- (zone->present_pages /
+ (zone->managed_pages /
percpu_pagelist_fraction));
}
}
@@ -5077,8 +5077,8 @@ static void calculate_totalreserve_pages(void)
/* we treat the high watermark as reserved pages. */
max += high_wmark_pages(zone);

- if (max > zone->present_pages)
- max = zone->present_pages;
+ if (max > zone->managed_pages)
+ max = zone->managed_pages;
reserve_pages += max;
/*
* Lowmem reserves are not available to
@@ -5110,7 +5110,7 @@ static void setup_per_zone_lowmem_reserve(void)
for_each_online_pgdat(pgdat) {
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
- unsigned long present_pages = zone->present_pages;
+ unsigned long managed_pages = zone->managed_pages;

zone->lowmem_reserve[j] = 0;

@@ -5124,9 +5124,9 @@ static void setup_per_zone_lowmem_reserve(void)
sysctl_lowmem_reserve_ratio[idx] = 1;

lower_zone = pgdat->node_zones + idx;
- lower_zone->lowmem_reserve[j] = present_pages /
+ lower_zone->lowmem_reserve[j] = managed_pages /
sysctl_lowmem_reserve_ratio[idx];
- present_pages += lower_zone->present_pages;
+ managed_pages += lower_zone->managed_pages;
}
}
}
@@ -5145,14 +5145,14 @@ static void __setup_per_zone_wmarks(void)
/* Calculate total number of !ZONE_HIGHMEM pages */
for_each_zone(zone) {
if (!is_highmem(zone))
- lowmem_pages += zone->present_pages;
+ lowmem_pages += zone->managed_pages;
}

for_each_zone(zone) {
u64 tmp;

spin_lock_irqsave(&zone->lock, flags);
- tmp = (u64)pages_min * zone->present_pages;
+ tmp = (u64)pages_min * zone->managed_pages;
do_div(tmp, lowmem_pages);
if (is_highmem(zone)) {
/*
@@ -5166,7 +5166,7 @@ static void __setup_per_zone_wmarks(void)
*/
int min_pages;

- min_pages = zone->present_pages / 1024;
+ min_pages = zone->managed_pages / 1024;
if (min_pages < SWAP_CLUSTER_MAX)
min_pages = SWAP_CLUSTER_MAX;
if (min_pages > 128)
@@ -5235,7 +5235,7 @@ static void __meminit calculate_zone_inactive_ratio(struct zone *zone)
unsigned int gb, ratio;

/* Zone size in gigabytes */
- gb = zone->present_pages >> (30 - PAGE_SHIFT);
+ gb = zone->managed_pages >> (30 - PAGE_SHIFT);
if (gb)
ratio = int_sqrt(10 * gb);
else
@@ -5321,7 +5321,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
return rc;

for_each_zone(zone)
- zone->min_unmapped_pages = (zone->present_pages *
+ zone->min_unmapped_pages = (zone->managed_pages *
sysctl_min_unmapped_ratio) / 100;
return 0;
}
@@ -5337,7 +5337,7 @@ int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
return rc;

for_each_zone(zone)
- zone->min_slab_pages = (zone->present_pages *
+ zone->min_slab_pages = (zone->managed_pages *
sysctl_min_slab_ratio) / 100;
return 0;
}
@@ -5379,7 +5379,7 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
for_each_populated_zone(zone) {
for_each_possible_cpu(cpu) {
unsigned long high;
- high = zone->present_pages / percpu_pagelist_fraction;
+ high = zone->managed_pages / percpu_pagelist_fraction;
setup_pagelist_highmark(
per_cpu_ptr(zone->pageset, cpu), high);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48550c6..7240e89 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1935,7 +1935,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
* a reasonable chance of completing and allocating the page
*/
balance_gap = min(low_wmark_pages(zone),
- (zone->present_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+ (zone->managed_pages + KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
KSWAPD_ZONE_BALANCE_GAP_RATIO);
watermark = high_wmark_pages(zone) + balance_gap + (2UL << sc->order);
watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
@@ -2416,14 +2416,14 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
int classzone_idx)
{
- unsigned long present_pages = 0;
+ unsigned long managed_pages = 0;
int i;

for (i = 0; i <= classzone_idx; i++)
- present_pages += pgdat->node_zones[i].present_pages;
+ managed_pages += pgdat->node_zones[i].managed_pages;

/* A special case here: if zone has no page, we think it's balanced */
- return balanced_pages >= (present_pages >> 2);
+ return balanced_pages >= (managed_pages >> 2);
}

/*
@@ -2471,7 +2471,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
* is to sleep
*/
if (zone->all_unreclaimable) {
- balanced += zone->present_pages;
+ balanced += zone->managed_pages;
continue;
}

@@ -2479,7 +2479,7 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
i, 0))
all_zones_ok = false;
else
- balanced += zone->present_pages;
+ balanced += zone->managed_pages;
}

/*
@@ -2645,7 +2645,7 @@ loop_again:
* of the zone, whichever is smaller.
*/
balance_gap = min(low_wmark_pages(zone),
- (zone->present_pages +
+ (zone->managed_pages +
KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
KSWAPD_ZONE_BALANCE_GAP_RATIO);
/*
@@ -2712,7 +2712,7 @@ loop_again:
*/
zone_clear_flag(zone, ZONE_CONGESTED);
if (i <= *classzone_idx)
- balanced += zone->present_pages;
+ balanced += zone->managed_pages;
}

}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e47d31c..b2925d1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -142,7 +142,7 @@ int calculate_normal_threshold(struct zone *zone)
* 125 1024 10 16-32 GB 9
*/

- mem = zone->present_pages >> (27 - PAGE_SHIFT);
+ mem = zone->managed_pages >> (27 - PAGE_SHIFT);

threshold = 2 * fls(num_online_cpus()) * (1 + fls(mem));

--
1.7.9.5

2012-11-18 16:09:54

by Jiang Liu

[permalink] [raw]
Subject: [RFT PATCH v1 3/5] mm: set zone->present_pages to number of existing pages in the zone

Now all users using "number of pages managed by the buddy system" have
been converted to use zone->managed_pages, so set zone->present_pages
to what it really should be:
present_pages = spanned_pages - absent_pages;

Signed-off-by: Jiang Liu <[email protected]>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fe1cf48..5b327d7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4494,7 +4494,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
nr_all_pages += freesize;

zone->spanned_pages = size;
- zone->present_pages = freesize;
+ zone->present_pages = realsize;
/*
* Set an approximate value for lowmem here, it will be adjusted
* when the bootmem allocator frees pages into the buddy system.
--
1.7.9.5

2012-11-18 16:10:07

by Jiang Liu

[permalink] [raw]
Subject: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap

If SPARSEMEM is enabled, it won't build page structures for
non-existing pages (holes) within a zone, so provide a more accurate
estimation of pages occupied by memmap if there are big holes within
the zone.

And pages for highmem zones' memmap will be allocated from lowmem,
so charge nr_kernel_pages for that.

Signed-off-by: Jiang Liu <[email protected]>
---
mm/page_alloc.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5b327d7..eb25679 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)

#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */

+static unsigned long calc_memmap_size(unsigned long spanned_pages,
+ unsigned long present_pages)
+{
+ unsigned long pages = spanned_pages;
+
+ /*
+ * Provide a more accurate estimation if there are big holes within
+ * the zone and SPARSEMEM is in use.
+ */
+ if (spanned_pages > present_pages + (present_pages >> 4) &&
+ IS_ENABLED(CONFIG_SPARSEMEM))
+ pages = present_pages;
+
+ return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
+}
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -4469,8 +4485,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
* is used by this zone for memmap. This affects the watermark
* and per-cpu initialisations
*/
- memmap_pages =
- PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
+ memmap_pages = calc_memmap_size(size, realsize);
if (freesize >= memmap_pages) {
freesize -= memmap_pages;
if (memmap_pages)
@@ -4491,6 +4506,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

if (!is_highmem_idx(j))
nr_kernel_pages += freesize;
+ /* Charge for highmem memmap if there are enough kernel pages */
+ else if (nr_kernel_pages > memmap_pages * 2)
+ nr_kernel_pages -= memmap_pages;
nr_all_pages += freesize;

zone->spanned_pages = size;
--
1.7.9.5

2012-11-18 16:10:16

by Jiang Liu

[permalink] [raw]
Subject: [RFT PATCH v1 5/5] mm: increase totalram_pages when free pages allocated by bootmem allocator

Function put_page_bootmem() is used to free pages allocated by bootmem
allocator to the buddy system, so it should increase totalram_pages
when freeing pages.

Signed-off-by: Jiang Liu <[email protected]>
---
mm/memory_hotplug.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index acd3993..6b07e0c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -119,6 +119,7 @@ void __ref put_page_bootmem(struct page *page)
mutex_lock(&ppb_lock);
__free_pages_bootmem(page, 0);
mutex_unlock(&ppb_lock);
+ totalram_pages++;
}

}
--
1.7.9.5

2012-11-18 20:36:13

by Chris Clayton

[permalink] [raw]
Subject: Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages

Hi Gerry.

On 11/18/12 16:07, Jiang Liu wrote:
> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
> resolve an issue caused by inaccurate zone->present_pages, but that
> fix is incomplete and causes regresions with HIGHMEM. And it has been
> reverted by commit
> 5576646 revert "mm: fix-up zone present pages"
>
> This is a following-up patchset for the issue above. It introduces a
> new field named "managed_pages" to struct zone, which counts pages
> managed by the buddy system from the zone. And zone->present_pages
> is used to count pages existing in the zone, which is
> spanned_pages - absent_pages.
>
> But that way, zone->present_pages will be kept in consistence with
> pgdat->node_present_pages, which is sum of zone->present_pages.
>
> This patchset has only been tested on x86_64 with nobootmem.c. So need
> help to test this patchset on machines:
> 1) use bootmem.c
> 2) have highmem
>
> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>

I've applied the five patches to Linus' 3.7.0-rc6 and can confirm that
the kernel allows my system to resume from a suspend to disc. Although
my laptop is 64 bit, I run a 32 bit kernel with HIGHMEM (I have 8GB RAM):

[chris:~/kernel/tmp/linux-3.7-rc6-resume]$ grep -E HIGHMEM\|X86_32 .config
CONFIG_X86_32=y
CONFIG_X86_32_SMP=y
CONFIG_X86_32_LAZY_GS=y
# CONFIG_X86_32_IRIS is not set
# CONFIG_NOHIGHMEM is not set
# CONFIG_HIGHMEM4G is not set
CONFIG_HIGHMEM64G=y
CONFIG_HIGHMEM=y

I can also say that a quick browse of the output of dmesg, shows nothing
out of the ordinary. I have insufficient knowledge to comment on the
patches, but I will run the kernel over the next few days and report
back later in the week.

Chris

> Any comments and helps are welcomed!
>
> Jiang Liu (5):
> mm: introduce new field "managed_pages" to struct zone
> mm: replace zone->present_pages with zone->managed_pages if
> appreciated
> mm: set zone->present_pages to number of existing pages in the zone
> mm: provide more accurate estimation of pages occupied by memmap
> mm: increase totalram_pages when free pages allocated by bootmem
> allocator
>
> include/linux/mmzone.h | 1 +
> mm/bootmem.c | 14 ++++++++
> mm/memory_hotplug.c | 6 ++++
> mm/mempolicy.c | 2 +-
> mm/nobootmem.c | 15 ++++++++
> mm/page_alloc.c | 89 +++++++++++++++++++++++++++++++-----------------
> mm/vmscan.c | 16 ++++-----
> mm/vmstat.c | 8 +++--
> 8 files changed, 108 insertions(+), 43 deletions(-)
>

2012-11-19 21:36:38

by Maciej Rutecki

[permalink] [raw]
Subject: Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages

On niedziela, 18 listopada 2012 o 17:07:25 Jiang Liu wrote:
> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
> resolve an issue caused by inaccurate zone->present_pages, but that
> fix is incomplete and causes regresions with HIGHMEM. And it has been
> reverted by commit
> 5576646 revert "mm: fix-up zone present pages"
>
> This is a following-up patchset for the issue above. It introduces a
> new field named "managed_pages" to struct zone, which counts pages
> managed by the buddy system from the zone. And zone->present_pages
> is used to count pages existing in the zone, which is
> spanned_pages - absent_pages.
>
> But that way, zone->present_pages will be kept in consistence with
> pgdat->node_present_pages, which is sum of zone->present_pages.
>
> This patchset has only been tested on x86_64 with nobootmem.c. So need
> help to test this patchset on machines:
> 1) use bootmem.c
> 2) have highmem
>
> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>
> Any comments and helps are welcomed!
>
> Jiang Liu (5):
> mm: introduce new field "managed_pages" to struct zone
> mm: replace zone->present_pages with zone->managed_pages if
> appreciated
> mm: set zone->present_pages to number of existing pages in the zone
> mm: provide more accurate estimation of pages occupied by memmap
> mm: increase totalram_pages when free pages allocated by bootmem
> allocator
>
> include/linux/mmzone.h | 1 +
> mm/bootmem.c | 14 ++++++++
> mm/memory_hotplug.c | 6 ++++
> mm/mempolicy.c | 2 +-
> mm/nobootmem.c | 15 ++++++++
> mm/page_alloc.c | 89
> +++++++++++++++++++++++++++++++----------------- mm/vmscan.c |
> 16 ++++-----
> mm/vmstat.c | 8 +++--
> 8 files changed, 108 insertions(+), 43 deletions(-)
Tested in 32 bit linux with HIGHMEM. Seems be OK.

Regards
--
Maciej Rutecki
http://www.mrutecki.pl

2012-11-19 23:38:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone

On Mon, 19 Nov 2012 00:07:26 +0800
Jiang Liu <[email protected]> wrote:

> Currently a zone's present_pages is calcuated as below, which is
> inaccurate and may cause trouble to memory hotplug.
> spanned_pages - absent_pages - memmap_pages - dma_reserve.
>
> During fixing bugs caused by inaccurate zone->present_pages, we found
> zone->present_pages has been abused. The field zone->present_pages
> may have different meanings in different contexts:
> 1) pages existing in a zone.
> 2) pages managed by the buddy system.
>
> For more discussions about the issue, please refer to:
> http://lkml.org/lkml/2012/11/5/866
> https://patchwork.kernel.org/patch/1346751/
>
> This patchset tries to introduce a new field named "managed_pages" to
> struct zone, which counts "pages managed by the buddy system". And
> revert zone->present_pages to count "physical pages existing in a zone",
> which also keep in consistence with pgdat->node_present_pages.
>
> We will set an initial value for zone->managed_pages in function
> free_area_init_core() and will be adjusted later if the initial value is
> inaccurate.
>
> For DMA/normal zones, the initial value is set to:
> (spanned_pages - absent_pages - memmap_pages - dma_reserve)
> Later zone->managed_pages will be adjusted to the accurate value when
> the bootmem allocator frees all free pages to the buddy system in
> function free_all_bootmem_node() and free_all_bootmem().
>
> The bootmem allocator doesn't touch highmem pages, so highmem zones'
> managed_pages is set to the accurate value "spanned_pages - absent_pages"
> in function free_area_init_core() and won't be updated anymore.
>
> This patch also adds a new field "managed_pages" to /proc/zoneinfo
> and sysrq showmem.

hoo boy, what a mess we made. I'd like to merge these patches and get
them into -next for some testing, but -next has stopped for a couple of
weeks. Oh well, let's see what can be done.

> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -480,6 +480,7 @@ struct zone {
> */
> unsigned long spanned_pages; /* total size, including holes */
> unsigned long present_pages; /* amount of memory (excluding holes) */
> + unsigned long managed_pages; /* pages managed by the Buddy */

Can you please add a nice big comment over these three fields which
fully describes what they do and the relationship between them?
Basically that stuff that's in the changelog.

Also, the existing comment tells us that spanned_pages and
present_pages are protected by span_seqlock but has not been updated to
describe the locking (if any) for managed_pages.

> /*
> * rarely used fields:
> diff --git a/mm/bootmem.c b/mm/bootmem.c
> index f468185..a813e5b 100644
> --- a/mm/bootmem.c
> +++ b/mm/bootmem.c
> @@ -229,6 +229,15 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
> return count;
> }
>
> +static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
> +{
> + struct zone *z;
> +
> + for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> + if (!is_highmem(z))

Needs a comment explaining why we skip the highmem zone, please.

> + z->managed_pages = 0;
> +}
> +
>
> ...
>
> @@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info, struct page *page,
> void __ref put_page_bootmem(struct page *page)
> {
> unsigned long type;
> + static DEFINE_MUTEX(ppb_lock);
>
> type = (unsigned long) page->lru.next;
> BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
> @@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
> ClearPagePrivate(page);
> set_page_private(page, 0);
> INIT_LIST_HEAD(&page->lru);
> + mutex_lock(&ppb_lock);
> __free_pages_bootmem(page, 0);
> + mutex_unlock(&ppb_lock);

The mutex is odd. Nothing in the changelog, no code comment.
__free_pages_bootmem() is called from a lot of places but only this one
has locking. I'm madly guessing that the lock is here to handle two or
more concurrent memory hotpluggings, but I shouldn't need to guess!!

> }
>
> }
>
> ...
>

2012-11-19 23:42:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap

On Mon, 19 Nov 2012 00:07:29 +0800
Jiang Liu <[email protected]> wrote:

> If SPARSEMEM is enabled, it won't build page structures for
> non-existing pages (holes) within a zone, so provide a more accurate
> estimation of pages occupied by memmap if there are big holes within
> the zone.
>
> And pages for highmem zones' memmap will be allocated from lowmem,
> so charge nr_kernel_pages for that.
>
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)
>
> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>
> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> + unsigned long present_pages)
> +{
> + unsigned long pages = spanned_pages;
> +
> + /*
> + * Provide a more accurate estimation if there are big holes within
> + * the zone and SPARSEMEM is in use.
> + */
> + if (spanned_pages > present_pages + (present_pages >> 4) &&
> + IS_ENABLED(CONFIG_SPARSEMEM))
> + pages = present_pages;
> +
> + return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> +}

Please explain the ">> 4" heuristc more completely - preferably in both
the changelog and code comments. Why can't we calculate this
requirement exactly? That might require a second pass, but that's OK for
code like this?

2012-11-20 02:13:12

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages

On 11/19/2012 12:07 AM, Jiang Liu wrote:
> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
> resolve an issue caused by inaccurate zone->present_pages, but that
> fix is incomplete and causes regresions with HIGHMEM. And it has been
> reverted by commit
> 5576646 revert "mm: fix-up zone present pages"
>
> This is a following-up patchset for the issue above. It introduces a
> new field named "managed_pages" to struct zone, which counts pages
> managed by the buddy system from the zone. And zone->present_pages
> is used to count pages existing in the zone, which is
> spanned_pages - absent_pages.
>
> But that way, zone->present_pages will be kept in consistence with
> pgdat->node_present_pages, which is sum of zone->present_pages.
>
> This patchset has only been tested on x86_64 with nobootmem.c. So need
> help to test this patchset on machines:
> 1) use bootmem.c

If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it?
> 2) have highmem
>
> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>
> Any comments and helps are welcomed!
>
> Jiang Liu (5):
> mm: introduce new field "managed_pages" to struct zone
> mm: replace zone->present_pages with zone->managed_pages if
> appreciated
> mm: set zone->present_pages to number of existing pages in the zone
> mm: provide more accurate estimation of pages occupied by memmap
> mm: increase totalram_pages when free pages allocated by bootmem
> allocator
>
> include/linux/mmzone.h | 1 +
> mm/bootmem.c | 14 ++++++++
> mm/memory_hotplug.c | 6 ++++
> mm/mempolicy.c | 2 +-
> mm/nobootmem.c | 15 ++++++++
> mm/page_alloc.c | 89 +++++++++++++++++++++++++++++++-----------------
> mm/vmscan.c | 16 ++++-----
> mm/vmstat.c | 8 +++--
> 8 files changed, 108 insertions(+), 43 deletions(-)
>

2012-11-20 02:15:21

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap

On 11/19/2012 12:07 AM, Jiang Liu wrote:
> If SPARSEMEM is enabled, it won't build page structures for
> non-existing pages (holes) within a zone, so provide a more accurate
> estimation of pages occupied by memmap if there are big holes within
> the zone.
>
> And pages for highmem zones' memmap will be allocated from lowmem,
> so charge nr_kernel_pages for that.
>
> Signed-off-by: Jiang Liu <[email protected]>
> ---
> mm/page_alloc.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5b327d7..eb25679 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)
>
> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>
> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> + unsigned long present_pages)
> +{
> + unsigned long pages = spanned_pages;
> +
> + /*
> + * Provide a more accurate estimation if there are big holes within
> + * the zone and SPARSEMEM is in use.
> + */
> + if (spanned_pages > present_pages + (present_pages >> 4) &&
> + IS_ENABLED(CONFIG_SPARSEMEM))
> + pages = present_pages;
> +
> + return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> +}
> +
> /*
> * Set up the zone data structures:
> * - mark all pages reserved
> @@ -4469,8 +4485,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
> * is used by this zone for memmap. This affects the watermark
> * and per-cpu initialisations
> */
> - memmap_pages =
> - PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
> + memmap_pages = calc_memmap_size(size, realsize);
> if (freesize >= memmap_pages) {
> freesize -= memmap_pages;
> if (memmap_pages)
> @@ -4491,6 +4506,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
>
> if (!is_highmem_idx(j))
> nr_kernel_pages += freesize;
> + /* Charge for highmem memmap if there are enough kernel pages */
> + else if (nr_kernel_pages > memmap_pages * 2)
> + nr_kernel_pages -= memmap_pages;

Since this is in else branch, if nr_kernel_pages is equal to 0 at
initially time?

> nr_all_pages += freesize;
>
> zone->spanned_pages = size;

2012-11-20 02:44:19

by Jiang Liu (Gerry)

[permalink] [raw]
Subject: Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages

On 2012-11-20 10:13, Jaegeuk Hanse wrote:
> On 11/19/2012 12:07 AM, Jiang Liu wrote:
>> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
>> resolve an issue caused by inaccurate zone->present_pages, but that
>> fix is incomplete and causes regresions with HIGHMEM. And it has been
>> reverted by commit
>> 5576646 revert "mm: fix-up zone present pages"
>>
>> This is a following-up patchset for the issue above. It introduces a
>> new field named "managed_pages" to struct zone, which counts pages
>> managed by the buddy system from the zone. And zone->present_pages
>> is used to count pages existing in the zone, which is
>> spanned_pages - absent_pages.
>>
>> But that way, zone->present_pages will be kept in consistence with
>> pgdat->node_present_pages, which is sum of zone->present_pages.
>>
>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>> help to test this patchset on machines:
>> 1) use bootmem.c
>
> If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it?
Hi Jaegeuk,
Thanks for review this patch set.
Currently x86/x86_64/Sparc have been converted to use nobootmem.c,
and other Arches still use bootmem.c. So need to test it on other Arches,
such as ARM etc. Yesterday we have tested it patchset on an Itanium platform,
so bootmem.c should work as expected too.
Thanks!
Gerry

2012-11-20 03:20:33

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages

On 11/20/2012 10:43 AM, Jiang Liu wrote:
> On 2012-11-20 10:13, Jaegeuk Hanse wrote:
>> On 11/19/2012 12:07 AM, Jiang Liu wrote:
>>> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
>>> resolve an issue caused by inaccurate zone->present_pages, but that
>>> fix is incomplete and causes regresions with HIGHMEM. And it has been
>>> reverted by commit
>>> 5576646 revert "mm: fix-up zone present pages"
>>>
>>> This is a following-up patchset for the issue above. It introduces a
>>> new field named "managed_pages" to struct zone, which counts pages
>>> managed by the buddy system from the zone. And zone->present_pages
>>> is used to count pages existing in the zone, which is
>>> spanned_pages - absent_pages.
>>>
>>> But that way, zone->present_pages will be kept in consistence with
>>> pgdat->node_present_pages, which is sum of zone->present_pages.
>>>
>>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>>> help to test this patchset on machines:
>>> 1) use bootmem.c
>> If only x86_32 use bootmem.c instead of nobootmem.c? How could I confirm it?
> Hi Jaegeuk,
> Thanks for review this patch set.
> Currently x86/x86_64/Sparc have been converted to use nobootmem.c,
> and other Arches still use bootmem.c. So need to test it on other Arches,
> such as ARM etc. Yesterday we have tested it patchset on an Itanium platform,
> so bootmem.c should work as expected too.

Hi Jiang,

If there are any codes changed in x86/x86_64 to meet nobootmem.c logic?
I mean if remove
config NO_BOOTMEM
def_bool y
in arch/x86/Kconfig, whether x86/x86_64 can take advantage of bootmem.c
or not.

Regards,
Jaegeuk

> Thanks!
> Gerry
>
>

2012-11-20 03:49:11

by Jiang Liu (Gerry)

[permalink] [raw]
Subject: Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages

On 2012-11-20 11:20, Jaegeuk Hanse wrote:
> On 11/20/2012 10:43 AM, Jiang Liu wrote:
>> On 2012-11-20 10:13, Jaegeuk Hanse wrote:
>>> On 11/19/2012 12:07 AM, Jiang Liu wrote:
>> Hi Jaegeuk,
>> Thanks for review this patch set.
>> Currently x86/x86_64/Sparc have been converted to use nobootmem.c,
>> and other Arches still use bootmem.c. So need to test it on other Arches,
>> such as ARM etc. Yesterday we have tested it patchset on an Itanium platform,
>> so bootmem.c should work as expected too.
>
> Hi Jiang,
>
> If there are any codes changed in x86/x86_64 to meet nobootmem.c logic? I mean if remove
> config NO_BOOTMEM
> def_bool y
> in arch/x86/Kconfig, whether x86/x86_64 can take advantage of bootmem.c or not.
There are code change in x86/x86_64 arch directory to convert from bootmem.c
to nobootmem.c, so you can't simply comment out NO_BOOTMEM Kconfig item.
There are differences in APIs for bootmem.c and nobootmem.c.
For example, free_low_memory_core_early() is only provided by nobootmem.c
and init_bootmem_node() is only provided by bootmem.c.
Thanks!

2012-11-20 14:56:27

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone

On 11/20/2012 07:38 AM, Andrew Morton wrote:
> On Mon, 19 Nov 2012 00:07:26 +0800
> Jiang Liu <[email protected]> wrote:
>
>> Currently a zone's present_pages is calcuated as below, which is
>> inaccurate and may cause trouble to memory hotplug.
>> spanned_pages - absent_pages - memmap_pages - dma_reserve.
>>
>> During fixing bugs caused by inaccurate zone->present_pages, we found
>> zone->present_pages has been abused. The field zone->present_pages
>> may have different meanings in different contexts:
>> 1) pages existing in a zone.
>> 2) pages managed by the buddy system.
>>
>> For more discussions about the issue, please refer to:
>> http://lkml.org/lkml/2012/11/5/866
>> https://patchwork.kernel.org/patch/1346751/
>>
>> This patchset tries to introduce a new field named "managed_pages" to
>> struct zone, which counts "pages managed by the buddy system". And
>> revert zone->present_pages to count "physical pages existing in a zone",
>> which also keep in consistence with pgdat->node_present_pages.
>>
>> We will set an initial value for zone->managed_pages in function
>> free_area_init_core() and will be adjusted later if the initial value is
>> inaccurate.
>>
>> For DMA/normal zones, the initial value is set to:
>> (spanned_pages - absent_pages - memmap_pages - dma_reserve)
>> Later zone->managed_pages will be adjusted to the accurate value when
>> the bootmem allocator frees all free pages to the buddy system in
>> function free_all_bootmem_node() and free_all_bootmem().
>>
>> The bootmem allocator doesn't touch highmem pages, so highmem zones'
>> managed_pages is set to the accurate value "spanned_pages - absent_pages"
>> in function free_area_init_core() and won't be updated anymore.
>>
>> This patch also adds a new field "managed_pages" to /proc/zoneinfo
>> and sysrq showmem.
>
> hoo boy, what a mess we made. I'd like to merge these patches and get
> them into -next for some testing, but -next has stopped for a couple of
> weeks. Oh well, let's see what can be done.
Hi Andrew,
Really sorry for the delay. Within last a few weeks, I could only
find after work hours or weekends for programming:(

>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -480,6 +480,7 @@ struct zone {
>> */
>> unsigned long spanned_pages; /* total size, including holes */
>> unsigned long present_pages; /* amount of memory (excluding holes) */
>> + unsigned long managed_pages; /* pages managed by the Buddy */
>
> Can you please add a nice big comment over these three fields which
> fully describes what they do and the relationship between them?
> Basically that stuff that's in the changelog.
>
> Also, the existing comment tells us that spanned_pages and
> present_pages are protected by span_seqlock but has not been updated to
> describe the locking (if any) for managed_pages.
How about this?

/*
* spanned_pages is the total pages spanned by the zone, including
* holes, which is calcualted as:
* spanned_pages = zone_end_pfn - zone_start_pfn;
*
* present_pages is physical pages existing within the zone, which
* is calculated as:
* present_pages = spanned_pages - absent_pages(pags in holes);
*
* managed_pages is present pages managed by the buddy system, which
* is calculated as (reserved_pages includes pages allocated by the
* bootmem allocator):
* managed_pages = present_pages - reserved_pages;
*
* So present_pages may be used by memory hotplug or memory power
* management logic to figure out unmanaged pages by checking
* (present_pages - managed_pages). And managed_pages should be used
* by page allocator and vm scanner to calculate all kinds of watermarks
* and thresholds.
*
* Lock Rules:
*
* zone_start_pfn, spanned_pages are protected by span_seqlock.
* It is a seqlock because it has to be read outside of zone->lock,
* and it is done in the main allocator path. But, it is written
* quite infrequently.
*
* The span_seq lock is declared along with zone->lock because it is
* frequently read in proximity to zone->lock. It's good to
* give them a chance of being in the same cacheline.
*
* Writing access to present_pages and managed_pages at runtime should
* be protected by lock_memory_hotplug()/unlock_memory_hotplug().
* Any reader who can't tolerant drift of present_pages and
* managed_pages should hold memory hotplug lock to get a stable value.
*/
unsigned long spanned_pages;
unsigned long present_pages;
unsigned long managed_pages;


>
>> /*
>> * rarely used fields:
>> diff --git a/mm/bootmem.c b/mm/bootmem.c
>> index f468185..a813e5b 100644
>> --- a/mm/bootmem.c
>> +++ b/mm/bootmem.c
>> @@ -229,6 +229,15 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
>> return count;
>> }
>>
>> +static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
>> +{
>> + struct zone *z;
>> +
>> + for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
>> + if (!is_highmem(z))
>
> Needs a comment explaining why we skip the highmem zone, please.
How about this?
/*
* In free_area_init_core(), highmem zone's managed_pages is set to
* present_pages, and bootmem allocator doesn't allocate from highmem
* zones. So there's no need to recalculate managed_pages because all
* highmem pages will be managed by the buddy system. Here highmem
* zone also includes highmem movable zone.
*/


>> + z->managed_pages = 0;
>> +}
>> +
>>
>> ...
>>
>> @@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info, struct page *page,
>> void __ref put_page_bootmem(struct page *page)
>> {
>> unsigned long type;
>> + static DEFINE_MUTEX(ppb_lock);
>>
>> type = (unsigned long) page->lru.next;
>> BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
>> @@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
>> ClearPagePrivate(page);
>> set_page_private(page, 0);
>> INIT_LIST_HEAD(&page->lru);
>> + mutex_lock(&ppb_lock);
>> __free_pages_bootmem(page, 0);
>> + mutex_unlock(&ppb_lock);
>
> The mutex is odd. Nothing in the changelog, no code comment.
> __free_pages_bootmem() is called from a lot of places but only this one
> has locking. I'm madly guessing that the lock is here to handle two or
> more concurrent memory hotpluggings, but I shouldn't need to guess!!
Actually I'm a little hesitate whether we should add a lock here.

All callers of __free_pages_bootmem() other than put_page_bootmem() should
only be used at startup time. And currently the only caller of put_page_bootmem()
has already been protected by pgdat_resize_lock(pgdat, &flags). So there's
no real need for lock, just defensive.

I'm not sure which is the best solution here.
1) add a comments into __free_pages_bootmem() to state that the caller should
serialize themselves.
2) Use a dedicated lock to serialize updates to zone->managed_pages, this need
modifications to page_alloc.c and memory_hotplug.c.
3) The above solution to serialize in put_page_bootmem().
What's your suggestions here?

Thanks
Gerry

2012-11-20 15:18:51

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap

On 11/20/2012 07:42 AM, Andrew Morton wrote:
> On Mon, 19 Nov 2012 00:07:29 +0800
> Jiang Liu <[email protected]> wrote:
>
>> If SPARSEMEM is enabled, it won't build page structures for
>> non-existing pages (holes) within a zone, so provide a more accurate
>> estimation of pages occupied by memmap if there are big holes within
>> the zone.
>>
>> And pages for highmem zones' memmap will be allocated from lowmem,
>> so charge nr_kernel_pages for that.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4435,6 +4435,22 @@ void __init set_pageblock_order(void)
>>
>> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>
>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
>> + unsigned long present_pages)
>> +{
>> + unsigned long pages = spanned_pages;
>> +
>> + /*
>> + * Provide a more accurate estimation if there are big holes within
>> + * the zone and SPARSEMEM is in use.
>> + */
>> + if (spanned_pages > present_pages + (present_pages >> 4) &&
>> + IS_ENABLED(CONFIG_SPARSEMEM))
>> + pages = present_pages;
>> +
>> + return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
>> +}
>
> Please explain the ">> 4" heuristc more completely - preferably in both
> the changelog and code comments. Why can't we calculate this
> requirement exactly? That might require a second pass, but that's OK for
> code like this?
Hi Andrew,
A normal x86 platform always have some holes within the DMA ZONE,
so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
ZONE on x86 platforms.
Because the memmap_size is just an estimation, I feel it's OK to
remove the ">> 4" heuristic, that shouldn't affect much.

Thanks
Gerry

2012-11-20 16:03:26

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages

Thanks, Maciej!

On 11/20/2012 05:36 AM, Maciej Rutecki wrote:
> On niedziela, 18 listopada 2012 o 17:07:25 Jiang Liu wrote:
>> The commit 7f1290f2f2a4 ("mm: fix-up zone present pages") tries to
>> resolve an issue caused by inaccurate zone->present_pages, but that
>> fix is incomplete and causes regresions with HIGHMEM. And it has been
>> reverted by commit
>> 5576646 revert "mm: fix-up zone present pages"
>>
>> This is a following-up patchset for the issue above. It introduces a
>> new field named "managed_pages" to struct zone, which counts pages
>> managed by the buddy system from the zone. And zone->present_pages
>> is used to count pages existing in the zone, which is
>> spanned_pages - absent_pages.
>>
>> But that way, zone->present_pages will be kept in consistence with
>> pgdat->node_present_pages, which is sum of zone->present_pages.
>>
>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>> help to test this patchset on machines:
>> 1) use bootmem.c
>> 2) have highmem
>>
>> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>
>> Any comments and helps are welcomed!
>>
>> Jiang Liu (5):
>> mm: introduce new field "managed_pages" to struct zone
>> mm: replace zone->present_pages with zone->managed_pages if
>> appreciated
>> mm: set zone->present_pages to number of existing pages in the zone
>> mm: provide more accurate estimation of pages occupied by memmap
>> mm: increase totalram_pages when free pages allocated by bootmem
>> allocator
>>
>> include/linux/mmzone.h | 1 +
>> mm/bootmem.c | 14 ++++++++
>> mm/memory_hotplug.c | 6 ++++
>> mm/mempolicy.c | 2 +-
>> mm/nobootmem.c | 15 ++++++++
>> mm/page_alloc.c | 89
>> +++++++++++++++++++++++++++++++----------------- mm/vmscan.c |
>> 16 ++++-----
>> mm/vmstat.c | 8 +++--
>> 8 files changed, 108 insertions(+), 43 deletions(-)
> Tested in 32 bit linux with HIGHMEM. Seems be OK.
>
> Regards
>

2012-11-20 19:19:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap

On Tue, 20 Nov 2012 23:18:34 +0800
Jiang Liu <[email protected]> wrote:

> >> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> >> + unsigned long present_pages)
> >> +{
> >> + unsigned long pages = spanned_pages;
> >> +
> >> + /*
> >> + * Provide a more accurate estimation if there are big holes within
> >> + * the zone and SPARSEMEM is in use.
> >> + */
> >> + if (spanned_pages > present_pages + (present_pages >> 4) &&
> >> + IS_ENABLED(CONFIG_SPARSEMEM))
> >> + pages = present_pages;
> >> +
> >> + return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> >> +}
> >
> > Please explain the ">> 4" heuristc more completely - preferably in both
> > the changelog and code comments. Why can't we calculate this
> > requirement exactly? That might require a second pass, but that's OK for
> > code like this?
> Hi Andrew,
> A normal x86 platform always have some holes within the DMA ZONE,
> so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
> ZONE on x86 platforms.
> Because the memmap_size is just an estimation, I feel it's OK to
> remove the ">> 4" heuristic, that shouldn't affect much.

Again: why can't we calculate this requirement exactly? That might
require a second pass, but that's OK for code like this?

2012-11-20 19:31:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone

On Tue, 20 Nov 2012 22:56:11 +0800
Jiang Liu <[email protected]> wrote:

> On 11/20/2012 07:38 AM, Andrew Morton wrote:
> > On Mon, 19 Nov 2012 00:07:26 +0800
> > Jiang Liu <[email protected]> wrote:
>
> ...
>
> > Also, the existing comment tells us that spanned_pages and
> > present_pages are protected by span_seqlock but has not been updated to
> > describe the locking (if any) for managed_pages.
> How about this?
>
> ...
>

Looks nice.

> >> + for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
> >> + if (!is_highmem(z))
> >
> > Needs a comment explaining why we skip the highmem zone, please.
> How about this?
>
> ...
>

Ditto.

> >> @@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info, struct page *page,
> >> void __ref put_page_bootmem(struct page *page)
> >> {
> >> unsigned long type;
> >> + static DEFINE_MUTEX(ppb_lock);
> >>
> >> type = (unsigned long) page->lru.next;
> >> BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
> >> @@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
> >> ClearPagePrivate(page);
> >> set_page_private(page, 0);
> >> INIT_LIST_HEAD(&page->lru);
> >> + mutex_lock(&ppb_lock);
> >> __free_pages_bootmem(page, 0);
> >> + mutex_unlock(&ppb_lock);
> >
> > The mutex is odd. Nothing in the changelog, no code comment.
> > __free_pages_bootmem() is called from a lot of places but only this one
> > has locking. I'm madly guessing that the lock is here to handle two or
> > more concurrent memory hotpluggings, but I shouldn't need to guess!!
> Actually I'm a little hesitate whether we should add a lock here.
>
> All callers of __free_pages_bootmem() other than put_page_bootmem() should
> only be used at startup time. And currently the only caller of put_page_bootmem()
> has already been protected by pgdat_resize_lock(pgdat, &flags). So there's
> no real need for lock, just defensive.
>
> I'm not sure which is the best solution here.
> 1) add a comments into __free_pages_bootmem() to state that the caller should
> serialize themselves.
> 2) Use a dedicated lock to serialize updates to zone->managed_pages, this need
> modifications to page_alloc.c and memory_hotplug.c.
> 3) The above solution to serialize in put_page_bootmem().
> What's your suggestions here?

Firstly, let's be clear about what *data* we're protecting here. I
think it's only ->managed_pages?

I agree that no locking is needed during the init-time code.

So afaict we only need be concerned about concurrent updates to
->managed_pages via memory hotplug, and lock_memory_hotplug() is
sufficient there. We don't need to be concerned about readers of
managed_pages because it is an unsigned long (a u64 on 32-bit machines
would be a problem).

All correct? If so, the code is OK as-is and this can all be
described/formalised in code comments. If one wants to be really
confident, we could do something along the lines of

void mod_zone_managed_pages(struct zone *zone, signed long delta)
{
WARN_ON(system_state != SYSTEM_BOOTING &&
!is_locked_memory_hotplug());
zone->managed_pages += delta;
}

And yes, is_locked_memory_hotplug() is a dopey name.
[un]lock_memory_hotplug() should have been called
memory_hotplug_[un]lock()!

2012-11-21 14:37:52

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone

On 11/21/2012 03:31 AM, Andrew Morton wrote:

>>>> @@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info, struct page *page,
>>>> void __ref put_page_bootmem(struct page *page)
>>>> {
>>>> unsigned long type;
>>>> + static DEFINE_MUTEX(ppb_lock);
>>>>
>>>> type = (unsigned long) page->lru.next;
>>>> BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
>>>> @@ -115,7 +116,9 @@ void __ref put_page_bootmem(struct page *page)
>>>> ClearPagePrivate(page);
>>>> set_page_private(page, 0);
>>>> INIT_LIST_HEAD(&page->lru);
>>>> + mutex_lock(&ppb_lock);
>>>> __free_pages_bootmem(page, 0);
>>>> + mutex_unlock(&ppb_lock);
>>>
>>> The mutex is odd. Nothing in the changelog, no code comment.
>>> __free_pages_bootmem() is called from a lot of places but only this one
>>> has locking. I'm madly guessing that the lock is here to handle two or
>>> more concurrent memory hotpluggings, but I shouldn't need to guess!!
>> Actually I'm a little hesitate whether we should add a lock here.
>>
>> All callers of __free_pages_bootmem() other than put_page_bootmem() should
>> only be used at startup time. And currently the only caller of put_page_bootmem()
>> has already been protected by pgdat_resize_lock(pgdat, &flags). So there's
>> no real need for lock, just defensive.
>>
>> I'm not sure which is the best solution here.
>> 1) add a comments into __free_pages_bootmem() to state that the caller should
>> serialize themselves.
>> 2) Use a dedicated lock to serialize updates to zone->managed_pages, this need
>> modifications to page_alloc.c and memory_hotplug.c.
>> 3) The above solution to serialize in put_page_bootmem().
>> What's your suggestions here?
>
> Firstly, let's be clear about what *data* we're protecting here. I
> think it's only ->managed_pages?
Yes, we are just trying to protect ->managed_pages.

> I agree that no locking is needed during the init-time code.
>
> So afaict we only need be concerned about concurrent updates to
> ->managed_pages via memory hotplug, and lock_memory_hotplug() is
> sufficient there. We don't need to be concerned about readers of
> managed_pages because it is an unsigned long (a u64 on 32-bit machines
> would be a problem).
>
> All correct? If so, the code is OK as-is and this can all be
> described/formalised in code comments. If one wants to be really
> confident, we could do something along the lines of
OK, will add some comments to describe.

> void mod_zone_managed_pages(struct zone *zone, signed long delta)
> {
> WARN_ON(system_state != SYSTEM_BOOTING &&
> !is_locked_memory_hotplug());
> zone->managed_pages += delta;
> }
This seems a little overhead because __free_pages_bootmem() is on the hot path
and will be called many times at boot time.

Regards!
Gerry

2012-11-21 14:53:00

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap

On 11/21/2012 03:19 AM, Andrew Morton wrote:
> On Tue, 20 Nov 2012 23:18:34 +0800
> Jiang Liu <[email protected]> wrote:
>
>>>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
>>>> + unsigned long present_pages)
>>>> +{
>>>> + unsigned long pages = spanned_pages;
>>>> +
>>>> + /*
>>>> + * Provide a more accurate estimation if there are big holes within
>>>> + * the zone and SPARSEMEM is in use.
>>>> + */
>>>> + if (spanned_pages > present_pages + (present_pages >> 4) &&
>>>> + IS_ENABLED(CONFIG_SPARSEMEM))
>>>> + pages = present_pages;
>>>> +
>>>> + return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
>>>> +}
>>>
>>> Please explain the ">> 4" heuristc more completely - preferably in both
>>> the changelog and code comments. Why can't we calculate this
>>> requirement exactly? That might require a second pass, but that's OK for
>>> code like this?
>> Hi Andrew,
>> A normal x86 platform always have some holes within the DMA ZONE,
>> so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
>> ZONE on x86 platforms.
>> Because the memmap_size is just an estimation, I feel it's OK to
>> remove the ">> 4" heuristic, that shouldn't affect much.
>
> Again: why can't we calculate this requirement exactly? That might
> require a second pass, but that's OK for code like this?

Hi Andrew,
If there are holes within a zone, it may cost us one or two extra pages
for each populated region within the zone due to alignment because memmap for
each populated regions may not naturally aligned on page boundary.
Originally the ">> 4" heuristic is to trade off these extra memmap pages,
especially for small zones linke DMA zone.
Thanks!

2012-11-21 15:08:11

by Jiang Liu

[permalink] [raw]
Subject: [RFT PATCH v2 1/5] mm: introduce new field "managed_pages" to struct zone

Currently a zone's present_pages is calcuated as below, which is
inaccurate and may cause trouble to memory hotplug.
spanned_pages - absent_pages - memmap_pages - dma_reserve.

During fixing bugs caused by inaccurate zone->present_pages, we found
zone->present_pages has been abused. The field zone->present_pages
may have different meanings in different contexts:
1) pages existing in a zone.
2) pages managed by the buddy system.

For more discussions about the issue, please refer to:
http://lkml.org/lkml/2012/11/5/866
https://patchwork.kernel.org/patch/1346751/

This patchset tries to introduce a new field named "managed_pages" to
struct zone, which counts "pages managed by the buddy system". And
revert zone->present_pages to count "physical pages existing in a zone",
which also keep in consistence with pgdat->node_present_pages.

We will set an initial value for zone->managed_pages in function
free_area_init_core() and will adjust it later if the initial value is
inaccurate.

For DMA/normal zones, the initial value is set to:
(spanned_pages - absent_pages - memmap_pages - dma_reserve)
Later zone->managed_pages will be adjusted to the accurate value when
the bootmem allocator frees all free pages to the buddy system in
function free_all_bootmem_node() and free_all_bootmem().

The bootmem allocator doesn't touch highmem pages, so highmem zones'
managed_pages is set to the accurate value "spanned_pages - absent_pages"
in function free_area_init_core() and won't be updated anymore.

This patch also adds a new field "managed_pages" to /proc/zoneinfo
and sysrq showmem.

Signed-off-by: Jiang Liu <[email protected]>
---
include/linux/mmzone.h | 41 ++++++++++++++++++++++++++++++++++-------
mm/bootmem.c | 21 +++++++++++++++++++++
mm/memory_hotplug.c | 10 ++++++++++
mm/nobootmem.c | 22 ++++++++++++++++++++++
mm/page_alloc.c | 44 ++++++++++++++++++++++++++++++--------------
mm/vmstat.c | 6 ++++--
6 files changed, 121 insertions(+), 23 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a23923b..47f5672 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -469,17 +469,44 @@ struct zone {
unsigned long zone_start_pfn;

/*
- * zone_start_pfn, spanned_pages and present_pages are all
- * protected by span_seqlock. It is a seqlock because it has
- * to be read outside of zone->lock, and it is done in the main
- * allocator path. But, it is written quite infrequently.
+ * spanned_pages is the total pages spanned by the zone, including
+ * holes, which is calcualted as:
+ * spanned_pages = zone_end_pfn - zone_start_pfn;
*
- * The lock is declared along with zone->lock because it is
+ * present_pages is physical pages existing within the zone, which
+ * is calculated as:
+ * present_pages = spanned_pages - absent_pages(pags in holes);
+ *
+ * managed_pages is present pages managed by the buddy system, which
+ * is calculated as (reserved_pages includes pages allocated by the
+ * bootmem allocator):
+ * managed_pages = present_pages - reserved_pages;
+ *
+ * So present_pages may be used by memory hotplug or memory power
+ * management logic to figure out unmanaged pages by checking
+ * (present_pages - managed_pages). And managed_pages should be used
+ * by page allocator and vm scanner to calculate all kinds of watermarks
+ * and thresholds.
+ *
+ * Lock Rules:
+ *
+ * zone_start_pfn, spanned_pages are protected by span_seqlock.
+ * It is a seqlock because it has to be read outside of zone->lock,
+ * and it is done in the main allocator path. But, it is written
+ * quite infrequently.
+ *
+ * The span_seq lock is declared along with zone->lock because it is
* frequently read in proximity to zone->lock. It's good to
* give them a chance of being in the same cacheline.
+ *
+ * Writing access to present_pages and managed_pages at runtime should
+ * be protected by lock_memory_hotplug()/unlock_memory_hotplug().
+ * Any reader who can't tolerant drift of present_pages and
+ * managed_pages should hold memory hotplug lock to get a stable value.
*/
- unsigned long spanned_pages; /* total size, including holes */
- unsigned long present_pages; /* amount of memory (excluding holes) */
+ unsigned long spanned_pages;
+ unsigned long present_pages;
+ unsigned long managed_pages;

/*
* rarely used fields:
diff --git a/mm/bootmem.c b/mm/bootmem.c
index f468185..4117294 100644
--- a/mm/bootmem.c
+++ b/mm/bootmem.c
@@ -229,6 +229,22 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
return count;
}

+static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
+{
+ struct zone *z;
+
+ /*
+ * In free_area_init_core(), highmem zone's managed_pages is set to
+ * present_pages, and bootmem allocator doesn't allocate from highmem
+ * zones. So there's no need to recalculate managed_pages because all
+ * highmem pages will be managed by the buddy system. Here highmem
+ * zone also includes highmem movable zone.
+ */
+ for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+ if (!is_highmem(z))
+ z->managed_pages = 0;
+}
+
/**
* free_all_bootmem_node - release a node's free pages to the buddy allocator
* @pgdat: node to be released
@@ -238,6 +254,7 @@ static unsigned long __init free_all_bootmem_core(bootmem_data_t *bdata)
unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
{
register_page_bootmem_info_node(pgdat);
+ reset_node_lowmem_managed_pages(pgdat);
return free_all_bootmem_core(pgdat->bdata);
}

@@ -250,6 +267,10 @@ unsigned long __init free_all_bootmem(void)
{
unsigned long total_pages = 0;
bootmem_data_t *bdata;
+ struct pglist_data *pgdat;
+
+ for_each_online_pgdat(pgdat)
+ reset_node_lowmem_managed_pages(pgdat);

list_for_each_entry(bdata, &bdata_list, list)
total_pages += free_all_bootmem_core(bdata);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index e4eeaca..1a9e675 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -106,6 +106,7 @@ static void get_page_bootmem(unsigned long info, struct page *page,
void __ref put_page_bootmem(struct page *page)
{
unsigned long type;
+ static DEFINE_MUTEX(ppb_lock);

type = (unsigned long) page->lru.next;
BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
@@ -115,7 +116,14 @@ void __ref put_page_bootmem(struct page *page)
ClearPagePrivate(page);
set_page_private(page, 0);
INIT_LIST_HEAD(&page->lru);
+
+ /*
+ * Please refer to comment for __free_pages_bootmem()
+ * for why we serialize here.
+ */
+ mutex_lock(&ppb_lock);
__free_pages_bootmem(page, 0);
+ mutex_unlock(&ppb_lock);
}

}
@@ -514,6 +522,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
return ret;
}

+ zone->managed_pages += onlined_pages;
zone->present_pages += onlined_pages;
zone->zone_pgdat->node_present_pages += onlined_pages;
if (onlined_pages) {
@@ -961,6 +970,7 @@ repeat:
/* reset pagetype flags and makes migrate type to be MOVABLE */
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
/* removal success */
+ zone->managed_pages -= offlined_pages;
zone->present_pages -= offlined_pages;
zone->zone_pgdat->node_present_pages -= offlined_pages;
totalram_pages -= offlined_pages;
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index bd82f6b..b8294fc 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -137,6 +137,22 @@ unsigned long __init free_low_memory_core_early(int nodeid)
return count;
}

+static void reset_node_lowmem_managed_pages(pg_data_t *pgdat)
+{
+ struct zone *z;
+
+ /*
+ * In free_area_init_core(), highmem zone's managed_pages is set to
+ * present_pages, and bootmem allocator doesn't allocate from highmem
+ * zones. So there's no need to recalculate managed_pages because all
+ * highmem pages will be managed by the buddy system. Here highmem
+ * zone also includes highmem movable zone.
+ */
+ for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
+ if (!is_highmem(z))
+ z->managed_pages = 0;
+}
+
/**
* free_all_bootmem_node - release a node's free pages to the buddy allocator
* @pgdat: node to be released
@@ -146,6 +162,7 @@ unsigned long __init free_low_memory_core_early(int nodeid)
unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
{
register_page_bootmem_info_node(pgdat);
+ reset_node_lowmem_managed_pages(pgdat);

/* free_low_memory_core_early(MAX_NUMNODES) will be called later */
return 0;
@@ -158,6 +175,11 @@ unsigned long __init free_all_bootmem_node(pg_data_t *pgdat)
*/
unsigned long __init free_all_bootmem(void)
{
+ struct pglist_data *pgdat;
+
+ for_each_online_pgdat(pgdat)
+ reset_node_lowmem_managed_pages(pgdat);
+
/*
* We need to use MAX_NUMNODES instead of NODE_DATA(0)->node_id
* because in some case like Node0 doesn't have RAM installed
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7bb35ac..4418a04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -730,6 +730,13 @@ static void __free_pages_ok(struct page *page, unsigned int order)
local_irq_restore(flags);
}

+/*
+ * Read access to zone->managed_pages is safe because it's unsigned long,
+ * but we still need to serialize writers. Currently all callers of
+ * __free_pages_bootmem() except put_page_bootmem() should only be used
+ * at boot time. So for shorter boot time, we have shift the burden to
+ * put_page_bootmem() to serialize writers.
+ */
void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
{
unsigned int nr_pages = 1 << order;
@@ -745,6 +752,7 @@ void __meminit __free_pages_bootmem(struct page *page, unsigned int order)
set_page_count(p, 0);
}

+ page_zone(page)->managed_pages += 1 << order;
set_page_refcounted(page);
__free_pages(page, order);
}
@@ -2950,6 +2958,7 @@ void show_free_areas(unsigned int filter)
" isolated(anon):%lukB"
" isolated(file):%lukB"
" present:%lukB"
+ " managed:%lukB"
" mlocked:%lukB"
" dirty:%lukB"
" writeback:%lukB"
@@ -2979,6 +2988,7 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_ISOLATED_ANON)),
K(zone_page_state(zone, NR_ISOLATED_FILE)),
K(zone->present_pages),
+ K(zone->managed_pages),
K(zone_page_state(zone, NR_MLOCK)),
K(zone_page_state(zone, NR_FILE_DIRTY)),
K(zone_page_state(zone, NR_WRITEBACK)),
@@ -4455,48 +4465,54 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
- unsigned long size, realsize, memmap_pages;
+ unsigned long size, realsize, freesize, memmap_pages;

size = zone_spanned_pages_in_node(nid, j, zones_size);
- realsize = size - zone_absent_pages_in_node(nid, j,
+ realsize = freesize = size - zone_absent_pages_in_node(nid, j,
zholes_size);

/*
- * Adjust realsize so that it accounts for how much memory
+ * Adjust freesize so that it accounts for how much memory
* is used by this zone for memmap. This affects the watermark
* and per-cpu initialisations
*/
memmap_pages =
PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
- if (realsize >= memmap_pages) {
- realsize -= memmap_pages;
+ if (freesize >= memmap_pages) {
+ freesize -= memmap_pages;
if (memmap_pages)
printk(KERN_DEBUG
" %s zone: %lu pages used for memmap\n",
zone_names[j], memmap_pages);
} else
printk(KERN_WARNING
- " %s zone: %lu pages exceeds realsize %lu\n",
- zone_names[j], memmap_pages, realsize);
+ " %s zone: %lu pages exceeds freesize %lu\n",
+ zone_names[j], memmap_pages, freesize);

/* Account for reserved pages */
- if (j == 0 && realsize > dma_reserve) {
- realsize -= dma_reserve;
+ if (j == 0 && freesize > dma_reserve) {
+ freesize -= dma_reserve;
printk(KERN_DEBUG " %s zone: %lu pages reserved\n",
zone_names[0], dma_reserve);
}

if (!is_highmem_idx(j))
- nr_kernel_pages += realsize;
- nr_all_pages += realsize;
+ nr_kernel_pages += freesize;
+ nr_all_pages += freesize;

zone->spanned_pages = size;
- zone->present_pages = realsize;
+ zone->present_pages = freesize;
+ /*
+ * Set an approximate value for lowmem here, it will be adjusted
+ * when the bootmem allocator frees pages into the buddy system.
+ * And all highmem pages will be managed by the buddy system.
+ */
+ zone->managed_pages = is_highmem_idx(j) ? realsize : freesize;
#ifdef CONFIG_NUMA
zone->node = nid;
- zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
+ zone->min_unmapped_pages = (freesize*sysctl_min_unmapped_ratio)
/ 100;
- zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
+ zone->min_slab_pages = (freesize * sysctl_min_slab_ratio) / 100;
#endif
zone->name = zone_names[j];
spin_lock_init(&zone->lock);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..e47d31c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -992,14 +992,16 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n high %lu"
"\n scanned %lu"
"\n spanned %lu"
- "\n present %lu",
+ "\n present %lu"
+ "\n managed %lu",
zone_page_state(zone, NR_FREE_PAGES),
min_wmark_pages(zone),
low_wmark_pages(zone),
high_wmark_pages(zone),
zone->pages_scanned,
zone->spanned_pages,
- zone->present_pages);
+ zone->present_pages,
+ zone->managed_pages);

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
seq_printf(m, "\n %-12s %lu", vmstat_text[i],
--
1.7.9.5

2012-11-21 15:10:11

by Jiang Liu

[permalink] [raw]
Subject: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap

If SPARSEMEM is enabled, it won't build page structures for
non-existing pages (holes) within a zone, so provide a more accurate
estimation of pages occupied by memmap if there are bigger holes within
the zone.

And pages for highmem zones' memmap will be allocated from lowmem, so
charge nr_kernel_pages for that.

Signed-off-by: Jiang Liu <[email protected]>
---
mm/page_alloc.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fc10071..9bbac97 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4442,6 +4442,26 @@ void __init set_pageblock_order(void)

#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */

+static unsigned long calc_memmap_size(unsigned long spanned_pages,
+ unsigned long present_pages)
+{
+ unsigned long pages = spanned_pages;
+
+ /*
+ * Provide a more accurate estimation if there are holes within
+ * the zone and SPARSEMEM is in use. If there are holes within the
+ * zone, each populated memory region may cost us one or two extra
+ * memmap pages due to alignment because memmap pages for each
+ * populated regions may not naturally algined on page boundary.
+ * So the (present_pages >> 4) heuristic is a tradeoff for that.
+ */
+ if (spanned_pages > present_pages + (present_pages >> 4) &&
+ IS_ENABLED(CONFIG_SPARSEMEM))
+ pages = present_pages;
+
+ return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
+}
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -4476,8 +4496,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
* is used by this zone for memmap. This affects the watermark
* and per-cpu initialisations
*/
- memmap_pages =
- PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
+ memmap_pages = calc_memmap_size(size, realsize);
if (freesize >= memmap_pages) {
freesize -= memmap_pages;
if (memmap_pages)
@@ -4498,6 +4517,9 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

if (!is_highmem_idx(j))
nr_kernel_pages += freesize;
+ /* Charge for highmem memmap if there are enough kernel pages */
+ else if (nr_kernel_pages > memmap_pages * 2)
+ nr_kernel_pages -= memmap_pages;
nr_all_pages += freesize;

zone->spanned_pages = size;
--
1.7.9.5

2012-11-21 19:31:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFT PATCH v1 1/5] mm: introduce new field "managed_pages" to struct zone

On Wed, 21 Nov 2012 22:36:56 +0800
Jiang Liu <[email protected]> wrote:

> > void mod_zone_managed_pages(struct zone *zone, signed long delta)
> > {
> > WARN_ON(system_state != SYSTEM_BOOTING &&
> > !is_locked_memory_hotplug());
> > zone->managed_pages += delta;
> > }
> This seems a little overhead because __free_pages_bootmem() is on the hot path
> and will be called many times at boot time.

Maybe, maybe not. These things are measurable so let's not just guess.

But I'm not really recommending that we do this - there are all sorts
of things we *could* check and warn about, but we don't. Potential
errors in this area don't seem terribly important.

2012-11-21 19:35:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap

On Wed, 21 Nov 2012 22:52:29 +0800
Jiang Liu <[email protected]> wrote:

> On 11/21/2012 03:19 AM, Andrew Morton wrote:
> > On Tue, 20 Nov 2012 23:18:34 +0800
> > Jiang Liu <[email protected]> wrote:
> >
> >>>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> >>>> + unsigned long present_pages)
> >>>> +{
> >>>> + unsigned long pages = spanned_pages;
> >>>> +
> >>>> + /*
> >>>> + * Provide a more accurate estimation if there are big holes within
> >>>> + * the zone and SPARSEMEM is in use.
> >>>> + */
> >>>> + if (spanned_pages > present_pages + (present_pages >> 4) &&
> >>>> + IS_ENABLED(CONFIG_SPARSEMEM))
> >>>> + pages = present_pages;
> >>>> +
> >>>> + return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> >>>> +}
> >>>
> >>> Please explain the ">> 4" heuristc more completely - preferably in both
> >>> the changelog and code comments. Why can't we calculate this
> >>> requirement exactly? That might require a second pass, but that's OK for
> >>> code like this?
> >> Hi Andrew,
> >> A normal x86 platform always have some holes within the DMA ZONE,
> >> so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
> >> ZONE on x86 platforms.
> >> Because the memmap_size is just an estimation, I feel it's OK to
> >> remove the ">> 4" heuristic, that shouldn't affect much.
> >
> > Again: why can't we calculate this requirement exactly? That might
> > require a second pass, but that's OK for code like this?
>
> Hi Andrew,
> If there are holes within a zone, it may cost us one or two extra pages
> for each populated region within the zone due to alignment because memmap for
> each populated regions may not naturally aligned on page boundary.

Right. So with an additional pass across the zone and a bit of
arithmetic, we can calculate the exact space requirement for memmap?
No need for kludgy heuristics?

> Originally the ">> 4" heuristic is to trade off these extra memmap pages,
> especially for small zones linke DMA zone.

2012-11-22 19:08:42

by Jiang Liu

[permalink] [raw]
Subject: Re: [RFT PATCH v1 4/5] mm: provide more accurate estimation of pages occupied by memmap

On 11/22/2012 03:35 AM, Andrew Morton wrote:
> On Wed, 21 Nov 2012 22:52:29 +0800
> Jiang Liu <[email protected]> wrote:
>
>> On 11/21/2012 03:19 AM, Andrew Morton wrote:
>>> On Tue, 20 Nov 2012 23:18:34 +0800
>>> Jiang Liu <[email protected]> wrote:
>>>
>>>>>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
>>>>>> + unsigned long present_pages)
>>>>>> +{
>>>>>> + unsigned long pages = spanned_pages;
>>>>>> +
>>>>>> + /*
>>>>>> + * Provide a more accurate estimation if there are big holes within
>>>>>> + * the zone and SPARSEMEM is in use.
>>>>>> + */
>>>>>> + if (spanned_pages > present_pages + (present_pages >> 4) &&
>>>>>> + IS_ENABLED(CONFIG_SPARSEMEM))
>>>>>> + pages = present_pages;
>>>>>> +
>>>>>> + return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
>>>>>> +}
>>>>>
>>>>> Please explain the ">> 4" heuristc more completely - preferably in both
>>>>> the changelog and code comments. Why can't we calculate this
>>>>> requirement exactly? That might require a second pass, but that's OK for
>>>>> code like this?
>>>> Hi Andrew,
>>>> A normal x86 platform always have some holes within the DMA ZONE,
>>>> so the ">> 4" heuristic is to avoid applying this adjustment to the DMA
>>>> ZONE on x86 platforms.
>>>> Because the memmap_size is just an estimation, I feel it's OK to
>>>> remove the ">> 4" heuristic, that shouldn't affect much.
>>>
>>> Again: why can't we calculate this requirement exactly? That might
>>> require a second pass, but that's OK for code like this?
>>
>> Hi Andrew,
>> If there are holes within a zone, it may cost us one or two extra pages
>> for each populated region within the zone due to alignment because memmap for
>> each populated regions may not naturally aligned on page boundary.
>
> Right. So with an additional pass across the zone and a bit of
> arithmetic, we can calculate the exact space requirement for memmap?
> No need for kludgy heuristics?
Hi Andrew:
Happy Thanksgiving!

The way to calculate the exact space requirement for memmap seems a little
complex, it depends on:
CONFIG_SPARSEMEM_VMEMMAP
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
arch implemenation of alloc_remap()

Actually the original motivation is to reduce the deviation on a platform
such as:
node 0: 0-2G,4G-255G (a 2G hole between 2-4G)
node 1: 256G - 511G
node 2: 512G - 767G
node 3: 768G - 1023G
node 0: 1024G - 1026G (memory recovered from the hole)

So I just tried to reduce the deviation instead of accurate calculation of memmap.

Regards!
Gerry

2012-11-22 20:41:25

by Chris Clayton

[permalink] [raw]
Subject: Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages

>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>> help to test this patchset on machines:
>> 1) use bootmem.c
>> 2) have highmem
>>
>> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>
>
> I've applied the five patches to Linus' 3.7.0-rc6 and can confirm that
> the kernel allows my system to resume from a suspend to disc. Although
> my laptop is 64 bit, I run a 32 bit kernel with HIGHMEM (I have 8GB RAM):
>
> [chris:~/kernel/tmp/linux-3.7-rc6-resume]$ grep -E HIGHMEM\|X86_32 .config
> CONFIG_X86_32=y
> CONFIG_X86_32_SMP=y
> CONFIG_X86_32_LAZY_GS=y
> # CONFIG_X86_32_IRIS is not set
> # CONFIG_NOHIGHMEM is not set
> # CONFIG_HIGHMEM4G is not set
> CONFIG_HIGHMEM64G=y
> CONFIG_HIGHMEM=y
>
> I can also say that a quick browse of the output of dmesg, shows nothing
> out of the ordinary. I have insufficient knowledge to comment on the
> patches, but I will run the kernel over the next few days and report
> back later in the week.
>

Well, I've been running the kernel since Sunday and have had no problems
with my normal work mix of browsing, browsing the internet, video
editing, listening to music and building software. I'm now running a
kernel that build with the new patches 1 and 4 from yesterday (plus the
original 1, 2 and 5). All seems OK so far, including a couple of resumes
from suspend to disk.

2012-11-26 09:46:14

by Chris Clayton

[permalink] [raw]
Subject: Re: [RFT PATCH v1 0/5] fix up inaccurate zone->present_pages



On 11/22/12 09:23, Chris Clayton wrote:
>>> This patchset has only been tested on x86_64 with nobootmem.c. So need
>>> help to test this patchset on machines:
>>> 1) use bootmem.c
>>> 2) have highmem
>>>
>>> This patchset applies to "f4a75d2e Linux 3.7-rc6" from
>>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
>>>
>>
>> I've applied the five patches to Linus' 3.7.0-rc6 and can confirm that
>> the kernel allows my system to resume from a suspend to disc. Although
>> my laptop is 64 bit, I run a 32 bit kernel with HIGHMEM (I have 8GB RAM):
>>
>> [chris:~/kernel/tmp/linux-3.7-rc6-resume]$ grep -E HIGHMEM\|X86_32
>> .config
>> CONFIG_X86_32=y
>> CONFIG_X86_32_SMP=y
>> CONFIG_X86_32_LAZY_GS=y
>> # CONFIG_X86_32_IRIS is not set
>> # CONFIG_NOHIGHMEM is not set
>> # CONFIG_HIGHMEM4G is not set
>> CONFIG_HIGHMEM64G=y
>> CONFIG_HIGHMEM=y
>>
>> I can also say that a quick browse of the output of dmesg, shows nothing
>> out of the ordinary. I have insufficient knowledge to comment on the
>> patches, but I will run the kernel over the next few days and report
>> back later in the week.
>>
>
> Well, I've been running the kernel since Sunday and have had no problems
> with my normal work mix of browsing, browsing the internet, video
> editing, listening to music and building software. I'm now running a
> kernel that build with the new patches 1 and 4 from yesterday (plus the
> original 1, 2 and 5). All seems OK so far, including a couple of resumes
> from suspend to disk.
>
>

-rc6 with Gerry's patches has run fine, including numerous resumes from
suspend to disk, which fails (freezing or rebooting) without the
patches. I've now applied the patches to -rc7 (they apply with a few
offsets, but look OK) and will run that for a day or two.

2012-11-28 23:52:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap

On Wed, 21 Nov 2012 23:09:46 +0800
Jiang Liu <[email protected]> wrote:

> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap

How are people to test this? "does it boot"?

> If SPARSEMEM is enabled, it won't build page structures for
> non-existing pages (holes) within a zone, so provide a more accurate
> estimation of pages occupied by memmap if there are bigger holes within
> the zone.
>
> And pages for highmem zones' memmap will be allocated from lowmem, so
> charge nr_kernel_pages for that.
>
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4442,6 +4442,26 @@ void __init set_pageblock_order(void)
>
> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>
> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
> + unsigned long present_pages)
> +{
> + unsigned long pages = spanned_pages;
> +
> + /*
> + * Provide a more accurate estimation if there are holes within
> + * the zone and SPARSEMEM is in use. If there are holes within the
> + * zone, each populated memory region may cost us one or two extra
> + * memmap pages due to alignment because memmap pages for each
> + * populated regions may not naturally algined on page boundary.
> + * So the (present_pages >> 4) heuristic is a tradeoff for that.
> + */
> + if (spanned_pages > present_pages + (present_pages >> 4) &&
> + IS_ENABLED(CONFIG_SPARSEMEM))
> + pages = present_pages;
> +
> + return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
> +}
> +

I spose we should do this, although it makes no difference as the
compiler will inline calc_memmap_size() into its caller:

--- a/mm/page_alloc.c~mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap-fix
+++ a/mm/page_alloc.c
@@ -4526,8 +4526,8 @@ void __init set_pageblock_order(void)

#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */

-static unsigned long calc_memmap_size(unsigned long spanned_pages,
- unsigned long present_pages)
+static unsigned long __paginginit calc_memmap_size(unsigned long spanned_pages,
+ unsigned long present_pages)
{
unsigned long pages = spanned_pages;

2012-11-29 02:25:42

by Jianguo Wu

[permalink] [raw]
Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap

On 2012/11/29 7:52, Andrew Morton wrote:

> On Wed, 21 Nov 2012 23:09:46 +0800
> Jiang Liu <[email protected]> wrote:
>
>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
>
> How are people to test this? "does it boot"?
>

I have tested this in x86_64, it does boot.

Node 0, zone DMA
pages free 3972
min 1
low 1
high 1
scanned 0
spanned 4080
present 3979
managed 3972

Node 0, zone DMA32
pages free 448783
min 172
low 215
high 258
scanned 0
spanned 1044480
present 500799
managed 444545

Node 0, zone Normal
pages free 2375547
min 1394
low 1742
high 2091
scanned 0
spanned 3670016
present 3670016
managed 3585105

Thanks,
Jianguo Wu

>> If SPARSEMEM is enabled, it won't build page structures for
>> non-existing pages (holes) within a zone, so provide a more accurate
>> estimation of pages occupied by memmap if there are bigger holes within
>> the zone.
>>
>> And pages for highmem zones' memmap will be allocated from lowmem, so
>> charge nr_kernel_pages for that.
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -4442,6 +4442,26 @@ void __init set_pageblock_order(void)
>>
>> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>>
>> +static unsigned long calc_memmap_size(unsigned long spanned_pages,
>> + unsigned long present_pages)
>> +{
>> + unsigned long pages = spanned_pages;
>> +
>> + /*
>> + * Provide a more accurate estimation if there are holes within
>> + * the zone and SPARSEMEM is in use. If there are holes within the
>> + * zone, each populated memory region may cost us one or two extra
>> + * memmap pages due to alignment because memmap pages for each
>> + * populated regions may not naturally algined on page boundary.
>> + * So the (present_pages >> 4) heuristic is a tradeoff for that.
>> + */
>> + if (spanned_pages > present_pages + (present_pages >> 4) &&
>> + IS_ENABLED(CONFIG_SPARSEMEM))
>> + pages = present_pages;
>> +
>> + return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT;
>> +}
>> +
>
> I spose we should do this, although it makes no difference as the
> compiler will inline calc_memmap_size() into its caller:
>
> --- a/mm/page_alloc.c~mm-provide-more-accurate-estimation-of-pages-occupied-by-memmap-fix
> +++ a/mm/page_alloc.c
> @@ -4526,8 +4526,8 @@ void __init set_pageblock_order(void)
>
> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>
> -static unsigned long calc_memmap_size(unsigned long spanned_pages,
> - unsigned long present_pages)
> +static unsigned long __paginginit calc_memmap_size(unsigned long spanned_pages,
> + unsigned long present_pages)
> {
> unsigned long pages = spanned_pages;
>
>
>
> .
>


2012-11-29 10:52:16

by Chris Clayton

[permalink] [raw]
Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap

On 11/28/12 23:52, Andrew Morton wrote:
> On Wed, 21 Nov 2012 23:09:46 +0800
> Jiang Liu <[email protected]> wrote:
>
>> Subject: Re: [RFT PATCH v2 4/5] mm: provide more accurate estimation of pages occupied by memmap
>
> How are people to test this? "does it boot"?
>

I've been running kernels with Gerry's 5 patches applied for 11 days
now. This is on a 64bit laptop but with a 32bit kernel + HIGHMEM. I
joined the conversation because my laptop would not resume from suspend
to disk - it either froze or rebooted. With the patches applied the
laptop does successfully resume and has been stable.

Since Monday, I have have been running a kernel with the patches (plus,
from today, the patch you mailed yesterday) applied to 3.7rc7, without
problems.

Thanks,
Chris