The comment is confusing. On the one hand, it refers to 32-bit
alignment (struct page alignment on 32-bit platforms), but this
would only guarantee that the 2 lowest bits must be zero. On the
other hand, it claims that at least 3 bits are available, and 3 bits
are actually used.
This is not broken, because there is a stronger alignment guarantee,
just less obvious. Let's fix the comment to make it clear how many
bits are available and why.
Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/mmzone.h | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c38939..7522a6987595 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
/*
* We use the lower bits of the mem_map pointer to store
- * a little bit of information. There should be at least
- * 3 bits here due to 32-bit alignment.
+ * a little bit of information. The pointer is calculated
+ * as mem_map - section_nr_to_pfn(pnum). The result is
+ * aligned to the minimum alignment of the two values:
+ * 1. All mem_map arrays are page-aligned.
+ * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
+ * lowest bits. PFN_SECTION_SHIFT is arch-specific
+ * (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
+ * worst combination is powerpc with 256k pages,
+ * which results in PFN_SECTION_SHIFT equal 6.
+ * To sum it up, at least 6 bits are available.
*/
#define SECTION_MARKED_PRESENT (1UL<<0)
#define SECTION_HAS_MEM_MAP (1UL<<1)
--
2.13.6
On Fri 19-01-18 08:09:08, Petr Tesarik wrote:
[...]
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 67f2e3c38939..7522a6987595 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
>
> /*
> * We use the lower bits of the mem_map pointer to store
> - * a little bit of information. There should be at least
> - * 3 bits here due to 32-bit alignment.
> + * a little bit of information. The pointer is calculated
> + * as mem_map - section_nr_to_pfn(pnum). The result is
> + * aligned to the minimum alignment of the two values:
> + * 1. All mem_map arrays are page-aligned.
> + * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
> + * lowest bits. PFN_SECTION_SHIFT is arch-specific
> + * (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
> + * worst combination is powerpc with 256k pages,
> + * which results in PFN_SECTION_SHIFT equal 6.
> + * To sum it up, at least 6 bits are available.
> */
This is _much_ better indeed. Do you think we can go one step further
and add BUG_ON into the sparse code to guarantee that every mmemap
is indeed aligned properly so that SECTION_MAP_LAST_BIT-1 bits are never
used?
Thanks!
> #define SECTION_MARKED_PRESENT (1UL<<0)
> #define SECTION_HAS_MEM_MAP (1UL<<1)
> --
> 2.13.6
--
Michal Hocko
SUSE Labs
On Fri, 19 Jan 2018 13:39:56 +0100
Michal Hocko <[email protected]> wrote:
> On Fri 19-01-18 08:09:08, Petr Tesarik wrote:
> [...]
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 67f2e3c38939..7522a6987595 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
> >
> > /*
> > * We use the lower bits of the mem_map pointer to store
> > - * a little bit of information. There should be at least
> > - * 3 bits here due to 32-bit alignment.
> > + * a little bit of information. The pointer is calculated
> > + * as mem_map - section_nr_to_pfn(pnum). The result is
> > + * aligned to the minimum alignment of the two values:
> > + * 1. All mem_map arrays are page-aligned.
> > + * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
> > + * lowest bits. PFN_SECTION_SHIFT is arch-specific
> > + * (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
> > + * worst combination is powerpc with 256k pages,
> > + * which results in PFN_SECTION_SHIFT equal 6.
> > + * To sum it up, at least 6 bits are available.
> > */
>
> This is _much_ better indeed. Do you think we can go one step further
> and add BUG_ON into the sparse code to guarantee that every mmemap
> is indeed aligned properly so that SECTION_MAP_LAST_BIT-1 bits are never
> used?
This is easy for the section_nr_to_pfn() part. I'd just add:
BUILD_BUG_ON(PFN_SECTION_SHIFT < SECTION_MAP_LAST_BIT);
But for the mem_map arrays... Do you mean adding a run-time BUG_ON into
all allocation paths?
Note that mem_map arrays can be allocated by:
a) __earlyonly_bootmem_alloc
b) memblock_virt_alloc_try_nid
c) memblock_virt_alloc_try_nid_raw
d) alloc_remap (only arch/tile still has it)
Some allocation paths are in mm/sparse.c, others are
mm/sparse-vmemmap.c, so it becomes a bit messy, but since it's
a single line in each, it may work.
Petr T
On Fri 19-01-18 14:21:33, Petr Tesarik wrote:
> On Fri, 19 Jan 2018 13:39:56 +0100
> Michal Hocko <[email protected]> wrote:
>
> > On Fri 19-01-18 08:09:08, Petr Tesarik wrote:
> > [...]
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 67f2e3c38939..7522a6987595 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
> > >
> > > /*
> > > * We use the lower bits of the mem_map pointer to store
> > > - * a little bit of information. There should be at least
> > > - * 3 bits here due to 32-bit alignment.
> > > + * a little bit of information. The pointer is calculated
> > > + * as mem_map - section_nr_to_pfn(pnum). The result is
> > > + * aligned to the minimum alignment of the two values:
> > > + * 1. All mem_map arrays are page-aligned.
> > > + * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
> > > + * lowest bits. PFN_SECTION_SHIFT is arch-specific
> > > + * (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
> > > + * worst combination is powerpc with 256k pages,
> > > + * which results in PFN_SECTION_SHIFT equal 6.
> > > + * To sum it up, at least 6 bits are available.
> > > */
> >
> > This is _much_ better indeed. Do you think we can go one step further
> > and add BUG_ON into the sparse code to guarantee that every mmemap
> > is indeed aligned properly so that SECTION_MAP_LAST_BIT-1 bits are never
> > used?
>
> This is easy for the section_nr_to_pfn() part. I'd just add:
>
> BUILD_BUG_ON(PFN_SECTION_SHIFT < SECTION_MAP_LAST_BIT);
>
> But for the mem_map arrays... Do you mean adding a run-time BUG_ON into
> all allocation paths?
>
> Note that mem_map arrays can be allocated by:
>
> a) __earlyonly_bootmem_alloc
> b) memblock_virt_alloc_try_nid
> c) memblock_virt_alloc_try_nid_raw
> d) alloc_remap (only arch/tile still has it)
>
> Some allocation paths are in mm/sparse.c, others are
> mm/sparse-vmemmap.c, so it becomes a bit messy, but since it's
> a single line in each, it may work.
Yeah, it is a mess. So I will leave it up to you. I do not want to block
your comment update which is a nice improvement. So with or without the
runtime check feel free to add
Acked-by: Michal Hocko <[email protected]>
--
Michal Hocko
SUSE Labs
On Wed, 24 Jan 2018 13:43:53 +0100
Michal Hocko <[email protected]> wrote:
> On Fri 19-01-18 14:21:33, Petr Tesarik wrote:
> > On Fri, 19 Jan 2018 13:39:56 +0100
> > Michal Hocko <[email protected]> wrote:
> >
> > > On Fri 19-01-18 08:09:08, Petr Tesarik wrote:
> > > [...]
> > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > > index 67f2e3c38939..7522a6987595 100644
> > > > --- a/include/linux/mmzone.h
> > > > +++ b/include/linux/mmzone.h
> > > > @@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
> > > >
> > > > /*
> > > > * We use the lower bits of the mem_map pointer to store
> > > > - * a little bit of information. There should be at least
> > > > - * 3 bits here due to 32-bit alignment.
> > > > + * a little bit of information. The pointer is calculated
> > > > + * as mem_map - section_nr_to_pfn(pnum). The result is
> > > > + * aligned to the minimum alignment of the two values:
> > > > + * 1. All mem_map arrays are page-aligned.
> > > > + * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
> > > > + * lowest bits. PFN_SECTION_SHIFT is arch-specific
> > > > + * (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
> > > > + * worst combination is powerpc with 256k pages,
> > > > + * which results in PFN_SECTION_SHIFT equal 6.
> > > > + * To sum it up, at least 6 bits are available.
> > > > */
> > >
> > > This is _much_ better indeed. Do you think we can go one step further
> > > and add BUG_ON into the sparse code to guarantee that every mmemap
> > > is indeed aligned properly so that SECTION_MAP_LAST_BIT-1 bits are never
> > > used?
> >
> > This is easy for the section_nr_to_pfn() part. I'd just add:
> >
> > BUILD_BUG_ON(PFN_SECTION_SHIFT < SECTION_MAP_LAST_BIT);
> >
> > But for the mem_map arrays... Do you mean adding a run-time BUG_ON into
> > all allocation paths?
> >
> > Note that mem_map arrays can be allocated by:
> >
> > a) __earlyonly_bootmem_alloc
> > b) memblock_virt_alloc_try_nid
> > c) memblock_virt_alloc_try_nid_raw
> > d) alloc_remap (only arch/tile still has it)
> >
> > Some allocation paths are in mm/sparse.c, others are
> > mm/sparse-vmemmap.c, so it becomes a bit messy, but since it's
> > a single line in each, it may work.
>
> Yeah, it is a mess. So I will leave it up to you. I do not want to block
> your comment update which is a nice improvement. So with or without the
> runtime check feel free to add
Hell, since I have already taken the time to review all the allocation
paths, I can just also add those BUG_ONs. I was just curious if you had
a better idea than spraying them all around the place, but it seems you
don't. ;-)
In short, stay tuned for v2, which is now WIP.
Petr T
The comment is confusing. On the one hand, it refers to 32-bit
alignment (struct page alignment on 32-bit platforms), but this
would only guarantee that the 2 lowest bits must be zero. On the
other hand, it claims that at least 3 bits are available, and 3 bits
are actually used.
This is not broken, because there is a stronger alignment guarantee,
just less obvious. Let's fix the comment to make it clear how many
bits are available and why.
Although memmap arrays are allocated in various places, the
resulting pointer is encoded eventually, so I am adding a BUG_ON()
here to enforce at runtime that all expected bits are indeed
available.
I have also added a BUILD_BUG_ON to check that PFN_SECTION_SHIFT is
sufficient, because this part of the calculation can be easily
checked at build time.
Signed-off-by: Petr Tesarik <[email protected]>
---
include/linux/mmzone.h | 12 ++++++++++--
mm/sparse.c | 6 +++++-
2 files changed, 15 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 67f2e3c38939..7522a6987595 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
/*
* We use the lower bits of the mem_map pointer to store
- * a little bit of information. There should be at least
- * 3 bits here due to 32-bit alignment.
+ * a little bit of information. The pointer is calculated
+ * as mem_map - section_nr_to_pfn(pnum). The result is
+ * aligned to the minimum alignment of the two values:
+ * 1. All mem_map arrays are page-aligned.
+ * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
+ * lowest bits. PFN_SECTION_SHIFT is arch-specific
+ * (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
+ * worst combination is powerpc with 256k pages,
+ * which results in PFN_SECTION_SHIFT equal 6.
+ * To sum it up, at least 6 bits are available.
*/
#define SECTION_MARKED_PRESENT (1UL<<0)
#define SECTION_HAS_MEM_MAP (1UL<<1)
diff --git a/mm/sparse.c b/mm/sparse.c
index 2609aba121e8..6b8b5e91ceef 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -264,7 +264,11 @@ unsigned long __init node_memmap_size_bytes(int nid, unsigned long start_pfn,
*/
static unsigned long sparse_encode_mem_map(struct page *mem_map, unsigned long pnum)
{
- return (unsigned long)(mem_map - (section_nr_to_pfn(pnum)));
+ unsigned long coded_mem_map =
+ (unsigned long)(mem_map - (section_nr_to_pfn(pnum)));
+ BUILD_BUG_ON(SECTION_MAP_LAST_BIT > (1UL<<PFN_SECTION_SHIFT));
+ BUG_ON(coded_mem_map & ~SECTION_MAP_MASK);
+ return coded_mem_map;
}
/*
--
2.13.6
On Thu 25-01-18 10:05:16, Petr Tesarik wrote:
> The comment is confusing. On the one hand, it refers to 32-bit
> alignment (struct page alignment on 32-bit platforms), but this
> would only guarantee that the 2 lowest bits must be zero. On the
> other hand, it claims that at least 3 bits are available, and 3 bits
> are actually used.
>
> This is not broken, because there is a stronger alignment guarantee,
> just less obvious. Let's fix the comment to make it clear how many
> bits are available and why.
>
> Although memmap arrays are allocated in various places, the
> resulting pointer is encoded eventually, so I am adding a BUG_ON()
> here to enforce at runtime that all expected bits are indeed
> available.
>
> I have also added a BUILD_BUG_ON to check that PFN_SECTION_SHIFT is
> sufficient, because this part of the calculation can be easily
> checked at build time.
>
> Signed-off-by: Petr Tesarik <[email protected]>
Thank you. The check is much simpler than I originally thought.
Acked-by: Michal Hocko <[email protected]>
> ---
> include/linux/mmzone.h | 12 ++++++++++--
> mm/sparse.c | 6 +++++-
> 2 files changed, 15 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 67f2e3c38939..7522a6987595 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1166,8 +1166,16 @@ extern unsigned long usemap_size(void);
>
> /*
> * We use the lower bits of the mem_map pointer to store
> - * a little bit of information. There should be at least
> - * 3 bits here due to 32-bit alignment.
> + * a little bit of information. The pointer is calculated
> + * as mem_map - section_nr_to_pfn(pnum). The result is
> + * aligned to the minimum alignment of the two values:
> + * 1. All mem_map arrays are page-aligned.
> + * 2. section_nr_to_pfn() always clears PFN_SECTION_SHIFT
> + * lowest bits. PFN_SECTION_SHIFT is arch-specific
> + * (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the
> + * worst combination is powerpc with 256k pages,
> + * which results in PFN_SECTION_SHIFT equal 6.
> + * To sum it up, at least 6 bits are available.
> */
> #define SECTION_MARKED_PRESENT (1UL<<0)
> #define SECTION_HAS_MEM_MAP (1UL<<1)
> diff --git a/mm/sparse.c b/mm/sparse.c
> index 2609aba121e8..6b8b5e91ceef 100644
> --- a/mm/sparse.c
> +++ b/mm/sparse.c
> @@ -264,7 +264,11 @@ unsigned long __init node_memmap_size_bytes(int nid, unsigned long start_pfn,
> */
> static unsigned long sparse_encode_mem_map(struct page *mem_map, unsigned long pnum)
> {
> - return (unsigned long)(mem_map - (section_nr_to_pfn(pnum)));
> + unsigned long coded_mem_map =
> + (unsigned long)(mem_map - (section_nr_to_pfn(pnum)));
> + BUILD_BUG_ON(SECTION_MAP_LAST_BIT > (1UL<<PFN_SECTION_SHIFT));
> + BUG_ON(coded_mem_map & ~SECTION_MAP_MASK);
> + return coded_mem_map;
> }
>
> /*
> --
> 2.13.6
--
Michal Hocko
SUSE Labs