Two cleanups for "[PATCH] mm/page_alloc: Skip non present sections on zone
initialization" [1], whereby one cleanup seems to also be a fix for a
(theoretial?) kernelcore=mirror case - unless I am messing something up :)
[1] https://lkml.kernel.org/r/[email protected]
David Hildenbrand (2):
mm/page_alloc: fix and rework pfn handling in memmap_init_zone()
mm: factor out next_present_section_nr()
include/linux/mmzone.h | 10 ++++++++++
mm/page_alloc.c | 20 ++++++++------------
mm/sparse.c | 10 ----------
3 files changed, 18 insertions(+), 22 deletions(-)
--
2.24.1
Let's move it to the header and use the shorter variant from
mm/page_alloc.c (the original one will also check
"__highest_present_section_nr + 1", which is not necessary). While at it,
make the section_nr in next_pfn() const.
In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once
we exceed __highest_present_section_nr, which doesn't make a difference in
the caller as it is big enough (>= all sane end_pfn).
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
include/linux/mmzone.h | 10 ++++++++++
mm/page_alloc.c | 11 ++---------
mm/sparse.c | 10 ----------
3 files changed, 12 insertions(+), 19 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c2bc309d1634..462f6873905a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1379,6 +1379,16 @@ static inline int pfn_present(unsigned long pfn)
return present_section(__nr_to_section(pfn_to_section_nr(pfn)));
}
+static inline unsigned long next_present_section_nr(unsigned long section_nr)
+{
+ while (++section_nr <= __highest_present_section_nr) {
+ if (present_section_nr(section_nr))
+ return section_nr;
+ }
+
+ return -1;
+}
+
/*
* These are _only_ used during initialisation, therefore they
* can use __initdata ... They could have names to indicate
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a92791512077..26e8044e9848 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5852,18 +5852,11 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
/* Skip PFNs that belong to non-present sections */
static inline __meminit unsigned long next_pfn(unsigned long pfn)
{
- unsigned long section_nr;
+ const unsigned long section_nr = pfn_to_section_nr(++pfn);
- section_nr = pfn_to_section_nr(++pfn);
if (present_section_nr(section_nr))
return pfn;
-
- while (++section_nr <= __highest_present_section_nr) {
- if (present_section_nr(section_nr))
- return section_nr_to_pfn(section_nr);
- }
-
- return -1;
+ return section_nr_to_pfn(next_present_section_nr(section_nr));
}
#else
static inline __meminit unsigned long next_pfn(unsigned long pfn)
diff --git a/mm/sparse.c b/mm/sparse.c
index 3822ecbd8a1f..ac4a2bfae514 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -198,16 +198,6 @@ static void section_mark_present(struct mem_section *ms)
ms->section_mem_map |= SECTION_MARKED_PRESENT;
}
-static inline unsigned long next_present_section_nr(unsigned long section_nr)
-{
- do {
- section_nr++;
- if (present_section_nr(section_nr))
- return section_nr;
- } while ((section_nr <= __highest_present_section_nr));
-
- return -1;
-}
#define for_each_present_section_nr(start, section_nr) \
for (section_nr = next_present_section_nr(start-1); \
((section_nr != -1) && \
--
2.24.1
Let's update the pfn manually whenever we continue the loop. This makes
the code easier to read but also less error prone (and we can directly
fix one issue).
When overlap_memmap_init() returns true, pfn is updated to
"memblock_region_memory_end_pfn(r)". So it already points at the *next*
pfn to process. Incrementing the pfn another time is wrong, we might
leave one uninitialized. I spotted this by inspecting the code, so I have
no idea if this is relevant in practise (with kernelcore=mirror).
Fixes: a9a9e77fbf27 ("mm: move mirrored memory specific code outside of memmap_init_zone")
Cc: Pavel Tatashin <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Oscar Salvador <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/page_alloc.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a41bd7341de1..a92791512077 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5905,18 +5905,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
}
#endif
- for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+ for (pfn = start_pfn; pfn < end_pfn; ) {
/*
* There can be holes in boot-time mem_map[]s handed to this
* function. They do not exist on hotplugged memory.
*/
if (context == MEMMAP_EARLY) {
if (!early_pfn_valid(pfn)) {
- pfn = next_pfn(pfn) - 1;
+ pfn = next_pfn(pfn);
continue;
}
- if (!early_pfn_in_nid(pfn, nid))
+ if (!early_pfn_in_nid(pfn, nid)) {
+ pfn++;
continue;
+ }
if (overlap_memmap_init(zone, &pfn))
continue;
if (defer_init(nid, pfn, end_pfn))
@@ -5944,6 +5946,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
cond_resched();
}
+ pfn++;
}
}
--
2.24.1
On Mon, Jan 13, 2020 at 03:40:35PM +0100, David Hildenbrand wrote:
> Let's move it to the header and use the shorter variant from
> mm/page_alloc.c (the original one will also check
> "__highest_present_section_nr + 1", which is not necessary). While at it,
> make the section_nr in next_pfn() const.
>
> In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once
> we exceed __highest_present_section_nr, which doesn't make a difference in
> the caller as it is big enough (>= all sane end_pfn).
>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> include/linux/mmzone.h | 10 ++++++++++
> mm/page_alloc.c | 11 ++---------
> mm/sparse.c | 10 ----------
> 3 files changed, 12 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index c2bc309d1634..462f6873905a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1379,6 +1379,16 @@ static inline int pfn_present(unsigned long pfn)
> return present_section(__nr_to_section(pfn_to_section_nr(pfn)));
> }
>
> +static inline unsigned long next_present_section_nr(unsigned long section_nr)
> +{
> + while (++section_nr <= __highest_present_section_nr) {
> + if (present_section_nr(section_nr))
> + return section_nr;
> + }
> +
> + return -1;
> +}
> +
> /*
> * These are _only_ used during initialisation, therefore they
> * can use __initdata ... They could have names to indicate
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a92791512077..26e8044e9848 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5852,18 +5852,11 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
> /* Skip PFNs that belong to non-present sections */
> static inline __meminit unsigned long next_pfn(unsigned long pfn)
> {
> - unsigned long section_nr;
> + const unsigned long section_nr = pfn_to_section_nr(++pfn);
>
> - section_nr = pfn_to_section_nr(++pfn);
> if (present_section_nr(section_nr))
> return pfn;
> -
> - while (++section_nr <= __highest_present_section_nr) {
> - if (present_section_nr(section_nr))
> - return section_nr_to_pfn(section_nr);
> - }
> -
> - return -1;
> + return section_nr_to_pfn(next_present_section_nr(section_nr));
This changes behaviour in the corner case: if next_present_section_nr()
returns -1, we call section_nr_to_pfn() for it. It's unlikely would give
any valid pfn, but I can't say for sure for all archs. I guess the worst
case scenrio would be endless loop over the same secitons/pfns.
Have you considered the case?
--
Kirill A. Shutemov
> Am 13.01.2020 um 23:41 schrieb Kirill A. Shutemov <[email protected]>:
>
> On Mon, Jan 13, 2020 at 03:40:35PM +0100, David Hildenbrand wrote:
>> Let's move it to the header and use the shorter variant from
>> mm/page_alloc.c (the original one will also check
>> "__highest_present_section_nr + 1", which is not necessary). While at it,
>> make the section_nr in next_pfn() const.
>>
>> In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once
>> we exceed __highest_present_section_nr, which doesn't make a difference in
>> the caller as it is big enough (>= all sane end_pfn).
>>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Oscar Salvador <[email protected]>
>> Cc: Kirill A. Shutemov <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> include/linux/mmzone.h | 10 ++++++++++
>> mm/page_alloc.c | 11 ++---------
>> mm/sparse.c | 10 ----------
>> 3 files changed, 12 insertions(+), 19 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index c2bc309d1634..462f6873905a 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -1379,6 +1379,16 @@ static inline int pfn_present(unsigned long pfn)
>> return present_section(__nr_to_section(pfn_to_section_nr(pfn)));
>> }
>>
>> +static inline unsigned long next_present_section_nr(unsigned long section_nr)
>> +{
>> + while (++section_nr <= __highest_present_section_nr) {
>> + if (present_section_nr(section_nr))
>> + return section_nr;
>> + }
>> +
>> + return -1;
>> +}
>> +
>> /*
>> * These are _only_ used during initialisation, therefore they
>> * can use __initdata ... They could have names to indicate
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index a92791512077..26e8044e9848 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5852,18 +5852,11 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
>> /* Skip PFNs that belong to non-present sections */
>> static inline __meminit unsigned long next_pfn(unsigned long pfn)
>> {
>> - unsigned long section_nr;
>> + const unsigned long section_nr = pfn_to_section_nr(++pfn);
>>
>> - section_nr = pfn_to_section_nr(++pfn);
>> if (present_section_nr(section_nr))
>> return pfn;
>> -
>> - while (++section_nr <= __highest_present_section_nr) {
>> - if (present_section_nr(section_nr))
>> - return section_nr_to_pfn(section_nr);
>> - }
>> -
>> - return -1;
>> + return section_nr_to_pfn(next_present_section_nr(section_nr));
>
> This changes behaviour in the corner case: if next_present_section_nr()
> returns -1, we call section_nr_to_pfn() for it. It's unlikely would give
> any valid pfn, but I can't say for sure for all archs. I guess the worst
> case scenrio would be endless loop over the same secitons/pfns.
>
> Have you considered the case?
Yes, see the patch description. We return -1 << PFN_SECTION_SHIFT, so a number close to the end of the address space (0xfff...000). (Will double check tomorrow if any 32bit arch could be problematic here)
Thanks!
>
> --
> Kirill A. Shutemov
>
> Am 13.01.2020 um 23:57 schrieb David Hildenbrand <[email protected]>:
>
>
>
>>> Am 13.01.2020 um 23:41 schrieb Kirill A. Shutemov <[email protected]>:
>>>
>>> On Mon, Jan 13, 2020 at 03:40:35PM +0100, David Hildenbrand wrote:
>>> Let's move it to the header and use the shorter variant from
>>> mm/page_alloc.c (the original one will also check
>>> "__highest_present_section_nr + 1", which is not necessary). While at it,
>>> make the section_nr in next_pfn() const.
>>>
>>> In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once
>>> we exceed __highest_present_section_nr, which doesn't make a difference in
>>> the caller as it is big enough (>= all sane end_pfn).
>>>
>>> Cc: Andrew Morton <[email protected]>
>>> Cc: Michal Hocko <[email protected]>
>>> Cc: Oscar Salvador <[email protected]>
>>> Cc: Kirill A. Shutemov <[email protected]>
>>> Signed-off-by: David Hildenbrand <[email protected]>
>>> ---
>>> include/linux/mmzone.h | 10 ++++++++++
>>> mm/page_alloc.c | 11 ++---------
>>> mm/sparse.c | 10 ----------
>>> 3 files changed, 12 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index c2bc309d1634..462f6873905a 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -1379,6 +1379,16 @@ static inline int pfn_present(unsigned long pfn)
>>> return present_section(__nr_to_section(pfn_to_section_nr(pfn)));
>>> }
>>>
>>> +static inline unsigned long next_present_section_nr(unsigned long section_nr)
>>> +{
>>> + while (++section_nr <= __highest_present_section_nr) {
>>> + if (present_section_nr(section_nr))
>>> + return section_nr;
>>> + }
>>> +
>>> + return -1;
>>> +}
>>> +
>>> /*
>>> * These are _only_ used during initialisation, therefore they
>>> * can use __initdata ... They could have names to indicate
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index a92791512077..26e8044e9848 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -5852,18 +5852,11 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
>>> /* Skip PFNs that belong to non-present sections */
>>> static inline __meminit unsigned long next_pfn(unsigned long pfn)
>>> {
>>> - unsigned long section_nr;
>>> + const unsigned long section_nr = pfn_to_section_nr(++pfn);
>>>
>>> - section_nr = pfn_to_section_nr(++pfn);
>>> if (present_section_nr(section_nr))
>>> return pfn;
>>> -
>>> - while (++section_nr <= __highest_present_section_nr) {
>>> - if (present_section_nr(section_nr))
>>> - return section_nr_to_pfn(section_nr);
>>> - }
>>> -
>>> - return -1;
>>> + return section_nr_to_pfn(next_present_section_nr(section_nr));
>>
>> This changes behaviour in the corner case: if next_present_section_nr()
>> returns -1, we call section_nr_to_pfn() for it. It's unlikely would give
>> any valid pfn, but I can't say for sure for all archs. I guess the worst
>> case scenrio would be endless loop over the same secitons/pfns.
>>
>> Have you considered the case?
>
> Yes, see the patch description. We return -1 << PFN_SECTION_SHIFT, so a number close to the end of the address space (0xfff...000). (Will double check tomorrow if any 32bit arch could be problematic here)
... but thinking again, 0xfff... is certainly an invalid PFN, so this should work just fine.
(biggest possible pfn is -1 >> PFN_SHIFT)
But it‘s late in Germany, will double check tomorrow :)
On Tue, Jan 14, 2020 at 12:02:00AM +0100, David Hildenbrand wrote:
>
>
> > Am 13.01.2020 um 23:57 schrieb David Hildenbrand <[email protected]>:
> >
> >
> >
> >>> Am 13.01.2020 um 23:41 schrieb Kirill A. Shutemov <[email protected]>:
> >>>
> >>> On Mon, Jan 13, 2020 at 03:40:35PM +0100, David Hildenbrand wrote:
> >>> Let's move it to the header and use the shorter variant from
> >>> mm/page_alloc.c (the original one will also check
> >>> "__highest_present_section_nr + 1", which is not necessary). While at it,
> >>> make the section_nr in next_pfn() const.
> >>>
> >>> In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once
> >>> we exceed __highest_present_section_nr, which doesn't make a difference in
> >>> the caller as it is big enough (>= all sane end_pfn).
> >>>
> >>> Cc: Andrew Morton <[email protected]>
> >>> Cc: Michal Hocko <[email protected]>
> >>> Cc: Oscar Salvador <[email protected]>
> >>> Cc: Kirill A. Shutemov <[email protected]>
> >>> Signed-off-by: David Hildenbrand <[email protected]>
> >>> ---
> >>> include/linux/mmzone.h | 10 ++++++++++
> >>> mm/page_alloc.c | 11 ++---------
> >>> mm/sparse.c | 10 ----------
> >>> 3 files changed, 12 insertions(+), 19 deletions(-)
> >>>
> >>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> >>> index c2bc309d1634..462f6873905a 100644
> >>> --- a/include/linux/mmzone.h
> >>> +++ b/include/linux/mmzone.h
> >>> @@ -1379,6 +1379,16 @@ static inline int pfn_present(unsigned long pfn)
> >>> return present_section(__nr_to_section(pfn_to_section_nr(pfn)));
> >>> }
> >>>
> >>> +static inline unsigned long next_present_section_nr(unsigned long section_nr)
> >>> +{
> >>> + while (++section_nr <= __highest_present_section_nr) {
> >>> + if (present_section_nr(section_nr))
> >>> + return section_nr;
> >>> + }
> >>> +
> >>> + return -1;
> >>> +}
> >>> +
> >>> /*
> >>> * These are _only_ used during initialisation, therefore they
> >>> * can use __initdata ... They could have names to indicate
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index a92791512077..26e8044e9848 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -5852,18 +5852,11 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
> >>> /* Skip PFNs that belong to non-present sections */
> >>> static inline __meminit unsigned long next_pfn(unsigned long pfn)
> >>> {
> >>> - unsigned long section_nr;
> >>> + const unsigned long section_nr = pfn_to_section_nr(++pfn);
> >>>
> >>> - section_nr = pfn_to_section_nr(++pfn);
> >>> if (present_section_nr(section_nr))
> >>> return pfn;
> >>> -
> >>> - while (++section_nr <= __highest_present_section_nr) {
> >>> - if (present_section_nr(section_nr))
> >>> - return section_nr_to_pfn(section_nr);
> >>> - }
> >>> -
> >>> - return -1;
> >>> + return section_nr_to_pfn(next_present_section_nr(section_nr));
> >>
> >> This changes behaviour in the corner case: if next_present_section_nr()
> >> returns -1, we call section_nr_to_pfn() for it. It's unlikely would give
> >> any valid pfn, but I can't say for sure for all archs. I guess the worst
> >> case scenrio would be endless loop over the same secitons/pfns.
> >>
> >> Have you considered the case?
> >
> > Yes, see the patch description. We return -1 << PFN_SECTION_SHIFT, so a number close to the end of the address space (0xfff...000). (Will double check tomorrow if any 32bit arch could be problematic here)
>
> ... but thinking again, 0xfff... is certainly an invalid PFN, so this should work just fine.
>
> (biggest possible pfn is -1 >> PFN_SHIFT)
>
> But it‘s late in Germany, will double check tomorrow :)
If the end_pfn happens the be more than -1UL << PFN_SECTION_SHIFT we are
screwed: the pfn is invalid, next_present_section_nr() returns -1, the
next iterartion is on the same pfn and we have endless loop.
The question is whether we can prove end_pfn is always less than
-1UL << PFN_SECTION_SHIFT in any configuration of any arch.
It is not obvious for me.
--
Kirill A. Shutemov
On 14.01.20 11:41, Kirill A. Shutemov wrote:
> On Tue, Jan 14, 2020 at 12:02:00AM +0100, David Hildenbrand wrote:
>>
>>
>>> Am 13.01.2020 um 23:57 schrieb David Hildenbrand <[email protected]>:
>>>
>>>
>>>
>>>>> Am 13.01.2020 um 23:41 schrieb Kirill A. Shutemov <[email protected]>:
>>>>>
>>>>> On Mon, Jan 13, 2020 at 03:40:35PM +0100, David Hildenbrand wrote:
>>>>> Let's move it to the header and use the shorter variant from
>>>>> mm/page_alloc.c (the original one will also check
>>>>> "__highest_present_section_nr + 1", which is not necessary). While at it,
>>>>> make the section_nr in next_pfn() const.
>>>>>
>>>>> In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once
>>>>> we exceed __highest_present_section_nr, which doesn't make a difference in
>>>>> the caller as it is big enough (>= all sane end_pfn).
>>>>>
>>>>> Cc: Andrew Morton <[email protected]>
>>>>> Cc: Michal Hocko <[email protected]>
>>>>> Cc: Oscar Salvador <[email protected]>
>>>>> Cc: Kirill A. Shutemov <[email protected]>
>>>>> Signed-off-by: David Hildenbrand <[email protected]>
>>>>> ---
>>>>> include/linux/mmzone.h | 10 ++++++++++
>>>>> mm/page_alloc.c | 11 ++---------
>>>>> mm/sparse.c | 10 ----------
>>>>> 3 files changed, 12 insertions(+), 19 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>>>> index c2bc309d1634..462f6873905a 100644
>>>>> --- a/include/linux/mmzone.h
>>>>> +++ b/include/linux/mmzone.h
>>>>> @@ -1379,6 +1379,16 @@ static inline int pfn_present(unsigned long pfn)
>>>>> return present_section(__nr_to_section(pfn_to_section_nr(pfn)));
>>>>> }
>>>>>
>>>>> +static inline unsigned long next_present_section_nr(unsigned long section_nr)
>>>>> +{
>>>>> + while (++section_nr <= __highest_present_section_nr) {
>>>>> + if (present_section_nr(section_nr))
>>>>> + return section_nr;
>>>>> + }
>>>>> +
>>>>> + return -1;
>>>>> +}
>>>>> +
>>>>> /*
>>>>> * These are _only_ used during initialisation, therefore they
>>>>> * can use __initdata ... They could have names to indicate
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index a92791512077..26e8044e9848 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -5852,18 +5852,11 @@ overlap_memmap_init(unsigned long zone, unsigned long *pfn)
>>>>> /* Skip PFNs that belong to non-present sections */
>>>>> static inline __meminit unsigned long next_pfn(unsigned long pfn)
>>>>> {
>>>>> - unsigned long section_nr;
>>>>> + const unsigned long section_nr = pfn_to_section_nr(++pfn);
>>>>>
>>>>> - section_nr = pfn_to_section_nr(++pfn);
>>>>> if (present_section_nr(section_nr))
>>>>> return pfn;
>>>>> -
>>>>> - while (++section_nr <= __highest_present_section_nr) {
>>>>> - if (present_section_nr(section_nr))
>>>>> - return section_nr_to_pfn(section_nr);
>>>>> - }
>>>>> -
>>>>> - return -1;
>>>>> + return section_nr_to_pfn(next_present_section_nr(section_nr));
>>>>
>>>> This changes behaviour in the corner case: if next_present_section_nr()
>>>> returns -1, we call section_nr_to_pfn() for it. It's unlikely would give
>>>> any valid pfn, but I can't say for sure for all archs. I guess the worst
>>>> case scenrio would be endless loop over the same secitons/pfns.
>>>>
>>>> Have you considered the case?
>>>
>>> Yes, see the patch description. We return -1 << PFN_SECTION_SHIFT, so a number close to the end of the address space (0xfff...000). (Will double check tomorrow if any 32bit arch could be problematic here)
>>
>> ... but thinking again, 0xfff... is certainly an invalid PFN, so this should work just fine.
>>
>> (biggest possible pfn is -1 >> PFN_SHIFT)
>>
>> But it‘s late in Germany, will double check tomorrow :)
>
> If the end_pfn happens the be more than -1UL << PFN_SECTION_SHIFT we are
> screwed: the pfn is invalid, next_present_section_nr() returns -1, the
> next iterartion is on the same pfn and we have endless loop.
>
> The question is whether we can prove end_pfn is always less than
> -1UL << PFN_SECTION_SHIFT in any configuration of any arch.
>
> It is not obvious for me.
memmap_init_zone() is called for a physical memory region: pfn + size
(nr_pages)
The highest possible PFN you can have is "-1(unsigned long) >>
PFN_SHIFT". So even if you would want to add the very last section, the
PFN would still be smaller than -1UL << PFN_SECTION_SHIFT.
--
Thanks,
David / dhildenb
On Tue, Jan 14, 2020 at 11:49:19AM +0100, David Hildenbrand wrote:
> memmap_init_zone() is called for a physical memory region: pfn + size
> (nr_pages)
>
> The highest possible PFN you can have is "-1(unsigned long) >>
> PFN_SHIFT". So even if you would want to add the very last section, the
> PFN would still be smaller than -1UL << PFN_SECTION_SHIFT.
PFN_SHIFT? I guess you mean PAGE_SHIFT.
Of course PFN can be more than -1UL >> PAGE_SHIFT. Like on 32-bit x86 with
PAE it is ((1ULL << 36) - 1) >> PAGE_SHIFT. That's the whole reason for
PAE.
The highest possible PFN must fit into phys_addr_t when shifted left by
PAGE_SHIFT and must fit into unsigned long. It's can be -1UL if
phys_addr_t is 64-bit.
Any other limitation I miss?
--
Kirill A. Shutemov
On 14.01.20 16:52, Kirill A. Shutemov wrote:
> On Tue, Jan 14, 2020 at 11:49:19AM +0100, David Hildenbrand wrote:
>> memmap_init_zone() is called for a physical memory region: pfn + size
>> (nr_pages)
>>
>> The highest possible PFN you can have is "-1(unsigned long) >>
>> PFN_SHIFT". So even if you would want to add the very last section, the
>> PFN would still be smaller than -1UL << PFN_SECTION_SHIFT.
>
> PFN_SHIFT? I guess you mean PAGE_SHIFT.
Yes :)
>
> Of course PFN can be more than -1UL >> PAGE_SHIFT. Like on 32-bit x86 with
> PAE it is ((1ULL << 36) - 1) >> PAGE_SHIFT. That's the whole reason for
> PAE.
You are right about PAE, but I think you agree that is is a special case.
>
> The highest possible PFN must fit into phys_addr_t when shifted left by
> PAGE_SHIFT and must fit into unsigned long. It's can be -1UL if
> phys_addr_t is 64-bit.
>
Right, and for 32bit, that would mean (assuming something like 12bit
PAGE_SHIFT) if you have -1 (0xffffffff) that the biggest possible
address is 0xfffffffffff (44bit). In that case, the existing code would
already break because "end_pfn" (is actually +1, pointing after the one
to initialize), would overflow to 0 and you would have an endless loop
in memmap_init_zone().
Now, after thischange you not only get an endless loop when trying to
init the very last PFN, but when trying to init a PFN in the very last
section (section_nr= -1 - e.g., the last 128MB).
I don't think there is any sane use case where you initialize something
partially in the last section that is possible with any hardware address
extension mechanism.
--
Thanks,
David / dhildenb
On 14.01.20 17:50, David Hildenbrand wrote:
> On 14.01.20 16:52, Kirill A. Shutemov wrote:
>> On Tue, Jan 14, 2020 at 11:49:19AM +0100, David Hildenbrand wrote:
>>> memmap_init_zone() is called for a physical memory region: pfn + size
>>> (nr_pages)
>>>
>>> The highest possible PFN you can have is "-1(unsigned long) >>
>>> PFN_SHIFT". So even if you would want to add the very last section, the
>>> PFN would still be smaller than -1UL << PFN_SECTION_SHIFT.
>>
>> PFN_SHIFT? I guess you mean PAGE_SHIFT.
>
> Yes :)
>
>>
>> Of course PFN can be more than -1UL >> PAGE_SHIFT. Like on 32-bit x86 with
>> PAE it is ((1ULL << 36) - 1) >> PAGE_SHIFT. That's the whole reason for
>> PAE.
>
> You are right about PAE, but I think you agree that is is a special case.
>
>>
>> The highest possible PFN must fit into phys_addr_t when shifted left by
>> PAGE_SHIFT and must fit into unsigned long. It's can be -1UL if
>> phys_addr_t is 64-bit.
>>
>
> Right, and for 32bit, that would mean (assuming something like 12bit
> PAGE_SHIFT) if you have -1 (0xffffffff) that the biggest possible
> address is 0xfffffffffff (44bit). In that case, the existing code would
> already break because "end_pfn" (is actually +1, pointing after the one
> to initialize), would overflow to 0 and you would have an endless loop
> in memmap_init_zone().
Correction: If end_pfn overflows to 0, you would get no loop iteration
at all.
--
Thanks,
David / dhildenb
On Mon, 13 Jan 2020 15:40:33 +0100 David Hildenbrand <[email protected]> wrote:
> Two cleanups for "[PATCH] mm/page_alloc: Skip non present sections on zone
> initialization" [1], whereby one cleanup seems to also be a fix for a
> (theoretial?) kernelcore=mirror case - unless I am messing something up :)
>
I'm not seeing any acks or reviewed-by's on these two?
On Thu, Jan 30, 2020 at 08:30:59PM -0800, Andrew Morton wrote:
> On Mon, 13 Jan 2020 15:40:33 +0100 David Hildenbrand <[email protected]> wrote:
>
> > Two cleanups for "[PATCH] mm/page_alloc: Skip non present sections on zone
> > initialization" [1], whereby one cleanup seems to also be a fix for a
> > (theoretial?) kernelcore=mirror case - unless I am messing something up :)
> >
>
> I'm not seeing any acks or reviewed-by's on these two?
You can use mine:
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Mon, Jan 13, 2020 at 6:40 AM David Hildenbrand <[email protected]> wrote:
>
> Let's update the pfn manually whenever we continue the loop. This makes
> the code easier to read but also less error prone (and we can directly
> fix one issue).
>
> When overlap_memmap_init() returns true, pfn is updated to
> "memblock_region_memory_end_pfn(r)". So it already points at the *next*
> pfn to process. Incrementing the pfn another time is wrong, we might
> leave one uninitialized. I spotted this by inspecting the code, so I have
> no idea if this is relevant in practise (with kernelcore=mirror).
>
> Fixes: a9a9e77fbf27 ("mm: move mirrored memory specific code outside of memmap_init_zone")
> Cc: Pavel Tatashin <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Oscar Salvador <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/page_alloc.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a41bd7341de1..a92791512077 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5905,18 +5905,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> }
> #endif
>
> - for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> + for (pfn = start_pfn; pfn < end_pfn; ) {
> /*
> * There can be holes in boot-time mem_map[]s handed to this
> * function. They do not exist on hotplugged memory.
> */
> if (context == MEMMAP_EARLY) {
> if (!early_pfn_valid(pfn)) {
> - pfn = next_pfn(pfn) - 1;
> + pfn = next_pfn(pfn);
> continue;
> }
> - if (!early_pfn_in_nid(pfn, nid))
> + if (!early_pfn_in_nid(pfn, nid)) {
> + pfn++;
> continue;
> + }
> if (overlap_memmap_init(zone, &pfn))
> continue;
> if (defer_init(nid, pfn, end_pfn))
I'm pretty sure this is a bit broken. The overlap_memmap_init is going
to return memblock_region_memory_end_pfn instead of the start of the
next region. I think that is going to stick you in a mirrored region
without advancing in that case. You would also need to have that case
do a pfn++ before the continue;
> @@ -5944,6 +5946,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> set_pageblock_migratetype(page, MIGRATE_MOVABLE);
> cond_resched();
> }
> + pfn++;
> }
> }
>
> --
> 2.24.1
>
>
> Am 03.02.2020 um 22:35 schrieb Alexander Duyck <[email protected]>:
>
> On Mon, Jan 13, 2020 at 6:40 AM David Hildenbrand <[email protected]> wrote:
>>
>> Let's update the pfn manually whenever we continue the loop. This makes
>> the code easier to read but also less error prone (and we can directly
>> fix one issue).
>>
>> When overlap_memmap_init() returns true, pfn is updated to
>> "memblock_region_memory_end_pfn(r)". So it already points at the *next*
>> pfn to process. Incrementing the pfn another time is wrong, we might
>> leave one uninitialized. I spotted this by inspecting the code, so I have
>> no idea if this is relevant in practise (with kernelcore=mirror).
>>
>> Fixes: a9a9e77fbf27 ("mm: move mirrored memory specific code outside of memmap_init_zone")
>> Cc: Pavel Tatashin <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Oscar Salvador <[email protected]>
>> Cc: Kirill A. Shutemov <[email protected]>
>> Signed-off-by: David Hildenbrand <[email protected]>
>> ---
>> mm/page_alloc.c | 9 ++++++---
>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index a41bd7341de1..a92791512077 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5905,18 +5905,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>> }
>> #endif
>>
>> - for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>> + for (pfn = start_pfn; pfn < end_pfn; ) {
>> /*
>> * There can be holes in boot-time mem_map[]s handed to this
>> * function. They do not exist on hotplugged memory.
>> */
>> if (context == MEMMAP_EARLY) {
>> if (!early_pfn_valid(pfn)) {
>> - pfn = next_pfn(pfn) - 1;
>> + pfn = next_pfn(pfn);
>> continue;
>> }
>> - if (!early_pfn_in_nid(pfn, nid))
>> + if (!early_pfn_in_nid(pfn, nid)) {
>> + pfn++;
>> continue;
>> + }
>> if (overlap_memmap_init(zone, &pfn))
>> continue;
>> if (defer_init(nid, pfn, end_pfn))
>
> I'm pretty sure this is a bit broken. The overlap_memmap_init is going
> to return memblock_region_memory_end_pfn instead of the start of the
> next region. I think that is going to stick you in a mirrored region
> without advancing in that case. You would also need to have that case
> do a pfn++ before the continue;
Thanks for having a look.
Did you read the description regarding this change?
On Mon, Feb 3, 2020 at 1:44 PM David Hildenbrand <[email protected]> wrote:
>
>
>
> > Am 03.02.2020 um 22:35 schrieb Alexander Duyck <[email protected]>:
> >
> > On Mon, Jan 13, 2020 at 6:40 AM David Hildenbrand <[email protected]> wrote:
> >>
> >> Let's update the pfn manually whenever we continue the loop. This makes
> >> the code easier to read but also less error prone (and we can directly
> >> fix one issue).
> >>
> >> When overlap_memmap_init() returns true, pfn is updated to
> >> "memblock_region_memory_end_pfn(r)". So it already points at the *next*
> >> pfn to process. Incrementing the pfn another time is wrong, we might
> >> leave one uninitialized. I spotted this by inspecting the code, so I have
> >> no idea if this is relevant in practise (with kernelcore=mirror).
> >>
> >> Fixes: a9a9e77fbf27 ("mm: move mirrored memory specific code outside of memmap_init_zone")
> >> Cc: Pavel Tatashin <[email protected]>
> >> Cc: Andrew Morton <[email protected]>
> >> Cc: Michal Hocko <[email protected]>
> >> Cc: Oscar Salvador <[email protected]>
> >> Cc: Kirill A. Shutemov <[email protected]>
> >> Signed-off-by: David Hildenbrand <[email protected]>
> >> ---
> >> mm/page_alloc.c | 9 ++++++---
> >> 1 file changed, 6 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index a41bd7341de1..a92791512077 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -5905,18 +5905,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
> >> }
> >> #endif
> >>
> >> - for (pfn = start_pfn; pfn < end_pfn; pfn++) {
> >> + for (pfn = start_pfn; pfn < end_pfn; ) {
> >> /*
> >> * There can be holes in boot-time mem_map[]s handed to this
> >> * function. They do not exist on hotplugged memory.
> >> */
> >> if (context == MEMMAP_EARLY) {
> >> if (!early_pfn_valid(pfn)) {
> >> - pfn = next_pfn(pfn) - 1;
> >> + pfn = next_pfn(pfn);
> >> continue;
> >> }
> >> - if (!early_pfn_in_nid(pfn, nid))
> >> + if (!early_pfn_in_nid(pfn, nid)) {
> >> + pfn++;
> >> continue;
> >> + }
> >> if (overlap_memmap_init(zone, &pfn))
> >> continue;
> >> if (defer_init(nid, pfn, end_pfn))
> >
> > I'm pretty sure this is a bit broken. The overlap_memmap_init is going
> > to return memblock_region_memory_end_pfn instead of the start of the
> > next region. I think that is going to stick you in a mirrored region
> > without advancing in that case. You would also need to have that case
> > do a pfn++ before the continue;
>
> Thanks for having a look.
>
> Did you read the description regarding this change?
Actually I hadn't read it all that closely, so my bad on that. The
part that had caught my attention though was that
memblock_region_memory_end is using PFN_DOWN to identify the end of
the memory region, Given that we probably shouldn't be messing with
the PFNs that may contain any of this memory it might make more sense
to use memblock_region_reserved_end_pfn which uses PFN_UP so that we
exclude all memory that is in the mirrored region just in case
something doesn't end on a PFN aligned boundary.
If we know that the mirrored region is going to always be page size
aligned then I guess you are good to go. That was the only thing I
wasn't sure about.
Reviewed-by: Alexander Duyck <[email protected]>
On 04.02.20 00:17, Alexander Duyck wrote:
> On Mon, Feb 3, 2020 at 1:44 PM David Hildenbrand <[email protected]> wrote:
>>
>>
>>
>>> Am 03.02.2020 um 22:35 schrieb Alexander Duyck <[email protected]>:
>>>
>>> On Mon, Jan 13, 2020 at 6:40 AM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> Let's update the pfn manually whenever we continue the loop. This makes
>>>> the code easier to read but also less error prone (and we can directly
>>>> fix one issue).
>>>>
>>>> When overlap_memmap_init() returns true, pfn is updated to
>>>> "memblock_region_memory_end_pfn(r)". So it already points at the *next*
>>>> pfn to process. Incrementing the pfn another time is wrong, we might
>>>> leave one uninitialized. I spotted this by inspecting the code, so I have
>>>> no idea if this is relevant in practise (with kernelcore=mirror).
>>>>
>>>> Fixes: a9a9e77fbf27 ("mm: move mirrored memory specific code outside of memmap_init_zone")
>>>> Cc: Pavel Tatashin <[email protected]>
>>>> Cc: Andrew Morton <[email protected]>
>>>> Cc: Michal Hocko <[email protected]>
>>>> Cc: Oscar Salvador <[email protected]>
>>>> Cc: Kirill A. Shutemov <[email protected]>
>>>> Signed-off-by: David Hildenbrand <[email protected]>
>>>> ---
>>>> mm/page_alloc.c | 9 ++++++---
>>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>> index a41bd7341de1..a92791512077 100644
>>>> --- a/mm/page_alloc.c
>>>> +++ b/mm/page_alloc.c
>>>> @@ -5905,18 +5905,20 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
>>>> }
>>>> #endif
>>>>
>>>> - for (pfn = start_pfn; pfn < end_pfn; pfn++) {
>>>> + for (pfn = start_pfn; pfn < end_pfn; ) {
>>>> /*
>>>> * There can be holes in boot-time mem_map[]s handed to this
>>>> * function. They do not exist on hotplugged memory.
>>>> */
>>>> if (context == MEMMAP_EARLY) {
>>>> if (!early_pfn_valid(pfn)) {
>>>> - pfn = next_pfn(pfn) - 1;
>>>> + pfn = next_pfn(pfn);
>>>> continue;
>>>> }
>>>> - if (!early_pfn_in_nid(pfn, nid))
>>>> + if (!early_pfn_in_nid(pfn, nid)) {
>>>> + pfn++;
>>>> continue;
>>>> + }
>>>> if (overlap_memmap_init(zone, &pfn))
>>>> continue;
>>>> if (defer_init(nid, pfn, end_pfn))
>>>
>>> I'm pretty sure this is a bit broken. The overlap_memmap_init is going
>>> to return memblock_region_memory_end_pfn instead of the start of the
>>> next region. I think that is going to stick you in a mirrored region
>>> without advancing in that case. You would also need to have that case
>>> do a pfn++ before the continue;
>>
>> Thanks for having a look.
>>
>> Did you read the description regarding this change?
>
> Actually I hadn't read it all that closely, so my bad on that. The
> part that had caught my attention though was that
> memblock_region_memory_end is using PFN_DOWN to identify the end of
> the memory region, Given that we probably shouldn't be messing with
> the PFNs that may contain any of this memory it might make more sense
> to use memblock_region_reserved_end_pfn which uses PFN_UP so that we
> exclude all memory that is in the mirrored region just in case
> something doesn't end on a PFN aligned boundary.
>
> If we know that the mirrored region is going to always be page size
> aligned then I guess you are good to go. That was the only thing I
> wasn't sure about.
I think we can safely assume this for now. But I *think* we are fine
either way:
We are using memblock_region_memory_end() in all cases I spotted
(especially consistently in overlap_memmap_init()) - so there is never a
mis-match that could result in an endless loop.
Anyhow, having mirrored sub-page regions would be weird either way :)
(just like any zone that would end on sub-pages)
>
> Reviewed-by: Alexander Duyck <[email protected]>
>
Thanks!
--
Thanks,
David / dhildenb