2018-03-14 18:14:13

by Kani, Toshimitsu

[permalink] [raw]
Subject: [PATCH v2 0/2] fix memory leak / panic in ioremap huge pages

On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap()
may create pud/pmd mappings. Kernel panic was observed on arm64
systems with Cortex-A75 in the following steps as described by
Hanjun Guo. [1]

1. ioremap a 4K size, valid page table will build,
2. iounmap it, pte0 will set to 0;
3. ioremap the same address with 2M size, pgd/pmd is unchanged,
then set the a new value for pmd;
4. pte0 is leaked;
5. CPU may meet exception because the old pmd is still in TLB,
which will lead to kernel panic.

This panic is not reproducible on x86. INVLPG, called from iounmap,
purges all levels of entries associated with purged address on x86.
x86 still has memory leak.

The patch changes the ioremap path to free unmapped page table(s) since
doing so in the unmap path has the following issues:

- The iounmap() path is shared with vunmap(). Since vmap() only
supports pte mappings, making vunmap() to free a pte page is an
overhead for regular vmap users as they do not need a pte page
freed up.
- Checking if all entries in a pte page are cleared in the unmap path
is racy, and serializing this check is expensive.
- The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
purge.

Patch 01 adds new interfaces as stubs, which work as workaround of
this issue. This patch 01 was leveraged from Hanjun's patch. [1]

Patch 02 fixes the issue on x86 by implementing the interfaces.
A separate patch (not included in this series) is necessary for arm64.

[1] https://patchwork.kernel.org/patch/10134581/

---
v2
- Added cc to stable (Andrew Morton)
- Added proper function headers (Matthew Wilcox)
- Added descriptions why fixing in the ioremap path. (Will Deacon)

---
Toshi Kani (2):
1/2 mm/vmalloc: Add interfaces to free unmapped page table
2/2 x86/mm: implement free pmd/pte page interfaces

---
arch/arm64/mm/mmu.c | 10 ++++++++++
arch/x86/mm/pgtable.c | 44 +++++++++++++++++++++++++++++++++++++++++++
include/asm-generic/pgtable.h | 10 ++++++++++
lib/ioremap.c | 6 ++++--
4 files changed, 68 insertions(+), 2 deletions(-)


2018-03-14 18:15:01

by Kani, Toshimitsu

[permalink] [raw]
Subject: [PATCH v2 1/2] mm/vmalloc: Add interfaces to free unmapped page table

On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap()
may create pud/pmd mappings. Kernel panic was observed on arm64
systems with Cortex-A75 in the following steps as described by
Hanjun Guo.

1. ioremap a 4K size, valid page table will build,
2. iounmap it, pte0 will set to 0;
3. ioremap the same address with 2M size, pgd/pmd is unchanged,
then set the a new value for pmd;
4. pte0 is leaked;
5. CPU may meet exception because the old pmd is still in TLB,
which will lead to kernel panic.

This panic is not reproducible on x86. INVLPG, called from iounmap,
purges all levels of entries associated with purged address on x86.
x86 still has memory leak.

The patch changes the ioremap path to free unmapped page table(s) since
doing so in the unmap path has the following issues:

- The iounmap() path is shared with vunmap(). Since vmap() only
supports pte mappings, making vunmap() to free a pte page is an
overhead for regular vmap users as they do not need a pte page
freed up.
- Checking if all entries in a pte page are cleared in the unmap path
is racy, and serializing this check is expensive.
- The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
purge.

Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(),
which clear a given pud/pmd entry and free up a page for the lower
level entries.

This patch implements their stub functions on x86 and arm64, which
work as workaround.

Reported-by: Lei Li <[email protected]>
Signed-off-by: Toshi Kani <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Wang Xuefeng <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Hanjun Guo <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Chintan Pandya <[email protected]>
Cc: <[email protected]>
---
arch/arm64/mm/mmu.c | 10 ++++++++++
arch/x86/mm/pgtable.c | 24 ++++++++++++++++++++++++
include/asm-generic/pgtable.h | 10 ++++++++++
lib/ioremap.c | 6 ++++--
4 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 8c704f1e53c2..2dbb2c9f1ec1 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -972,3 +972,13 @@ int pmd_clear_huge(pmd_t *pmdp)
pmd_clear(pmdp);
return 1;
}
+
+int pud_free_pmd_page(pud_t *pud)
+{
+ return pud_none(*pud);
+}
+
+int pmd_free_pte_page(pmd_t *pmd)
+{
+ return pmd_none(*pmd);
+}
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 004abf9ebf12..1eed7ed518e6 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -702,4 +702,28 @@ int pmd_clear_huge(pmd_t *pmd)

return 0;
}
+
+/**
+ * pud_free_pmd_page - Clear pud entry and free pmd page.
+ * @pud: Pointer to a PUD.
+ *
+ * Context: The pud range has been unmaped and TLB purged.
+ * Return: 1 if clearing the entry succeeded. 0 otherwise.
+ */
+int pud_free_pmd_page(pud_t *pud)
+{
+ return pud_none(*pud);
+}
+
+/**
+ * pmd_free_pte_page - Clear pmd entry and free pte page.
+ * @pmd: Pointer to a PMD.
+ *
+ * Context: The pmd range has been unmaped and TLB purged.
+ * Return: 1 if clearing the entry succeeded. 0 otherwise.
+ */
+int pmd_free_pte_page(pmd_t *pmd)
+{
+ return pmd_none(*pmd);
+}
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 2cfa3075d148..2490800f7c5a 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -983,6 +983,8 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot);
int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot);
int pud_clear_huge(pud_t *pud);
int pmd_clear_huge(pmd_t *pmd);
+int pud_free_pmd_page(pud_t *pud);
+int pmd_free_pte_page(pmd_t *pmd);
#else /* !CONFIG_HAVE_ARCH_HUGE_VMAP */
static inline int p4d_set_huge(p4d_t *p4d, phys_addr_t addr, pgprot_t prot)
{
@@ -1008,6 +1010,14 @@ static inline int pmd_clear_huge(pmd_t *pmd)
{
return 0;
}
+static inline int pud_free_pmd_page(pud_t *pud)
+{
+ return 0;
+}
+static inline int pmd_free_pte_page(pud_t *pmd)
+{
+ return 0;
+}
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */

#ifndef __HAVE_ARCH_FLUSH_PMD_TLB_RANGE
diff --git a/lib/ioremap.c b/lib/ioremap.c
index b808a390e4c3..54e5bbaa3200 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -91,7 +91,8 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,

if (ioremap_pmd_enabled() &&
((next - addr) == PMD_SIZE) &&
- IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
+ IS_ALIGNED(phys_addr + addr, PMD_SIZE) &&
+ pmd_free_pte_page(pmd)) {
if (pmd_set_huge(pmd, phys_addr + addr, prot))
continue;
}
@@ -117,7 +118,8 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,

if (ioremap_pud_enabled() &&
((next - addr) == PUD_SIZE) &&
- IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
+ IS_ALIGNED(phys_addr + addr, PUD_SIZE) &&
+ pud_free_pmd_page(pud)) {
if (pud_set_huge(pud, phys_addr + addr, prot))
continue;
}

2018-03-14 18:17:54

by Kani, Toshimitsu

[permalink] [raw]
Subject: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
clear a given pud/pmd entry and free up lower level page table(s).
Address range associated with the pud/pmd entry must have been purged
by INVLPG.

fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
Signed-off-by: Toshi Kani <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: <[email protected]>
---
arch/x86/mm/pgtable.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 1eed7ed518e6..34cda7e0551b 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -712,7 +712,22 @@ int pmd_clear_huge(pmd_t *pmd)
*/
int pud_free_pmd_page(pud_t *pud)
{
- return pud_none(*pud);
+ pmd_t *pmd;
+ int i;
+
+ if (pud_none(*pud))
+ return 1;
+
+ pmd = (pmd_t *)pud_page_vaddr(*pud);
+
+ for (i = 0; i < PTRS_PER_PMD; i++)
+ if (!pmd_free_pte_page(&pmd[i]))
+ return 0;
+
+ pud_clear(pud);
+ free_page((unsigned long)pmd);
+
+ return 1;
}

/**
@@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud)
*/
int pmd_free_pte_page(pmd_t *pmd)
{
- return pmd_none(*pmd);
+ pte_t *pte;
+
+ if (pmd_none(*pmd))
+ return 1;
+
+ pte = (pte_t *)pmd_page_vaddr(*pmd);
+ pmd_clear(pmd);
+ free_page((unsigned long)pte);
+
+ return 1;
}
#endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */

2018-03-14 22:39:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm/vmalloc: Add interfaces to free unmapped page table

On Wed, 14 Mar 2018 12:01:54 -0600 Toshi Kani <[email protected]> wrote:

> On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap()
> may create pud/pmd mappings. Kernel panic was observed on arm64
> systems with Cortex-A75 in the following steps as described by
> Hanjun Guo.
>
> 1. ioremap a 4K size, valid page table will build,
> 2. iounmap it, pte0 will set to 0;
> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
> then set the a new value for pmd;
> 4. pte0 is leaked;
> 5. CPU may meet exception because the old pmd is still in TLB,
> which will lead to kernel panic.
>
> This panic is not reproducible on x86. INVLPG, called from iounmap,
> purges all levels of entries associated with purged address on x86.
> x86 still has memory leak.
>
> The patch changes the ioremap path to free unmapped page table(s) since
> doing so in the unmap path has the following issues:
>
> - The iounmap() path is shared with vunmap(). Since vmap() only
> supports pte mappings, making vunmap() to free a pte page is an
> overhead for regular vmap users as they do not need a pte page
> freed up.
> - Checking if all entries in a pte page are cleared in the unmap path
> is racy, and serializing this check is expensive.
> - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
> Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
> purge.
>
> Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(),
> which clear a given pud/pmd entry and free up a page for the lower
> level entries.
>
> This patch implements their stub functions on x86 and arm64, which
> work as workaround.
>

whoops.

--- a/include/asm-generic/pgtable.h~mm-vmalloc-add-interfaces-to-free-unmapped-page-table-fix
+++ a/include/asm-generic/pgtable.h
@@ -1014,7 +1014,7 @@ static inline int pud_free_pmd_page(pud_
{
return 0;
}
-static inline int pmd_free_pte_page(pud_t *pmd)
+static inline int pmd_free_pte_page(pmd_t *pmd)
{
return 0;
}
_


2018-03-15 07:40:47

by Chintan Pandya

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces



On 3/14/2018 11:31 PM, Toshi Kani wrote:
> Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
> clear a given pud/pmd entry and free up lower level page table(s).
> Address range associated with the pud/pmd entry must have been purged
> by INVLPG.
>
> fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
> Signed-off-by: Toshi Kani <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: <[email protected]>
> ---
> arch/x86/mm/pgtable.c | 28 ++++++++++++++++++++++++++--
> 1 file changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 1eed7ed518e6..34cda7e0551b 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -712,7 +712,22 @@ int pmd_clear_huge(pmd_t *pmd)
> */
> int pud_free_pmd_page(pud_t *pud)
> {
> - return pud_none(*pud);
> + pmd_t *pmd;
> + int i;
> +
> + if (pud_none(*pud))
> + return 1;
> +
> + pmd = (pmd_t *)pud_page_vaddr(*pud);
> +
> + for (i = 0; i < PTRS_PER_PMD; i++)
> + if (!pmd_free_pte_page(&pmd[i]))

This is forced action and no optional. Also, pmd_free_pte_page()
doesn't return 0 in any case. So, you may remove _if_ ?

> + return 0;
> +
> + pud_clear(pud);
> + free_page((unsigned long)pmd);
> +
> + return 1;
> }
>
> /**
> @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud)
> */
> int pmd_free_pte_page(pmd_t *pmd)
> {
> - return pmd_none(*pmd);
> + pte_t *pte;
> +
> + if (pmd_none(*pmd))

This should also check if pmd is already huge. Same for pud ?

> + return 1;
> +
> + pte = (pte_t *)pmd_page_vaddr(*pmd);
> + pmd_clear(pmd);
> + free_page((unsigned long)pte);
> +
> + return 1;
> }
> #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

Chintan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum, a Linux Foundation
Collaborative Project

2018-03-15 14:29:22

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm/vmalloc: Add interfaces to free unmapped page table

On Wed, 2018-03-14 at 15:38 -0700, Andrew Morton wrote:
> On Wed, 14 Mar 2018 12:01:54 -0600 Toshi Kani <[email protected]> wrote:
:
>
> whoops.
>
> --- a/include/asm-generic/pgtable.h~mm-vmalloc-add-interfaces-to-free-unmapped-page-table-fix
> +++ a/include/asm-generic/pgtable.h
> @@ -1014,7 +1014,7 @@ static inline int pud_free_pmd_page(pud_
> {
> return 0;
> }
> -static inline int pmd_free_pte_page(pud_t *pmd)
> +static inline int pmd_free_pte_page(pmd_t *pmd)
> {
> return 0;
> }

Thanks Andrew for catching this!!
-Toshi

2018-03-15 14:52:56

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Thu, 2018-03-15 at 13:09 +0530, Chintan Pandya wrote:
>
> On 3/14/2018 11:31 PM, Toshi Kani wrote:
> > Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
> > clear a given pud/pmd entry and free up lower level page table(s).
> > Address range associated with the pud/pmd entry must have been purged
> > by INVLPG.
> >
> > fixes: e61ce6ade404e ("mm: change ioremap to set up huge I/O mappings")
> > Signed-off-by: Toshi Kani <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: "H. Peter Anvin" <[email protected]>
> > Cc: Borislav Petkov <[email protected]>
> > Cc: Matthew Wilcox <[email protected]>
> > Cc: <[email protected]>
> > ---
> > arch/x86/mm/pgtable.c | 28 ++++++++++++++++++++++++++--
> > 1 file changed, 26 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> > index 1eed7ed518e6..34cda7e0551b 100644
> > --- a/arch/x86/mm/pgtable.c
> > +++ b/arch/x86/mm/pgtable.c
> > @@ -712,7 +712,22 @@ int pmd_clear_huge(pmd_t *pmd)
> > */
> > int pud_free_pmd_page(pud_t *pud)
> > {
> > - return pud_none(*pud);
> > + pmd_t *pmd;
> > + int i;
> > +
> > + if (pud_none(*pud))
> > + return 1;
> > +
> > + pmd = (pmd_t *)pud_page_vaddr(*pud);
> > +
> > + for (i = 0; i < PTRS_PER_PMD; i++)
> > + if (!pmd_free_pte_page(&pmd[i]))
>
> This is forced action and no optional. Also, pmd_free_pte_page()
> doesn't return 0 in any case. So, you may remove _if_ ?

The code needs to be written per the interface definition, not per the
current implementation.

> > + return 0;
> > +
> > + pud_clear(pud);
> > + free_page((unsigned long)pmd);
> > +
> > + return 1;
> > }
> >
> > /**
> > @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud)
> > */
> > int pmd_free_pte_page(pmd_t *pmd)
> > {
> > - return pmd_none(*pmd);
> > + pte_t *pte;
> > +
> > + if (pmd_none(*pmd))
>
> This should also check if pmd is already huge. Same for pud ?

Not necessary. As described in the function header, one of the entry
conditions is that a given pmd range is unmapped. See
vunmap_pmd_range().

Thanks,
-Toshi

2018-04-26 14:21:06

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

Hi Toshi, Andrew,

this patch(-set) is broken in several ways, please see below.

On Wed, Mar 14, 2018 at 12:01:55PM -0600, Toshi Kani wrote:
> Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
> clear a given pud/pmd entry and free up lower level page table(s).
> Address range associated with the pud/pmd entry must have been purged
> by INVLPG.

An INVLPG before actually unmapping the page is useless, as other cores
or even speculative instruction execution can bring the TLB entry back
before the code actually unmaps the page.

> int pud_free_pmd_page(pud_t *pud)
> {
> - return pud_none(*pud);
> + pmd_t *pmd;
> + int i;
> +
> + if (pud_none(*pud))
> + return 1;
> +
> + pmd = (pmd_t *)pud_page_vaddr(*pud);
> +
> + for (i = 0; i < PTRS_PER_PMD; i++)
> + if (!pmd_free_pte_page(&pmd[i]))
> + return 0;
> +
> + pud_clear(pud);

TLB flush needed here, before the page is freed.

> + free_page((unsigned long)pmd);
> +
> + return 1;
> }
>
> /**
> @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud)
> */
> int pmd_free_pte_page(pmd_t *pmd)
> {
> - return pmd_none(*pmd);
> + pte_t *pte;
> +
> + if (pmd_none(*pmd))
> + return 1;
> +
> + pte = (pte_t *)pmd_page_vaddr(*pmd);
> + pmd_clear(pmd);

Same here, TLB flush needed.

Further this needs synchronization with other page-tables in the system
when the kernel PMDs are not shared between processes. In x86-32 with
PAE this causes a BUG_ON() being triggered at arch/x86/mm/fault.c:268
because the page-tables are not correctly synchronized.

> + free_page((unsigned long)pte);
> +
> + return 1;
> }
> #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */

2018-04-26 16:23:22

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Thu, 2018-04-26 at 16:19 +0200, Joerg Roedel wrote:
> Hi Toshi, Andrew,
>
> this patch(-set) is broken in several ways, please see below.
>
> On Wed, Mar 14, 2018 at 12:01:55PM -0600, Toshi Kani wrote:
> > Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
> > clear a given pud/pmd entry and free up lower level page table(s).
> > Address range associated with the pud/pmd entry must have been purged
> > by INVLPG.
>
> An INVLPG before actually unmapping the page is useless, as other cores
> or even speculative instruction execution can bring the TLB entry back
> before the code actually unmaps the page.

Hi Joerg,

All pages under the pmd had been unmapped and then lazy TLB purged with
INVLPG before coming to this code path. Speculation is not allowed to
pages without mapping.

> > int pud_free_pmd_page(pud_t *pud)
> > {
> > - return pud_none(*pud);
> > + pmd_t *pmd;
> > + int i;
> > +
> > + if (pud_none(*pud))
> > + return 1;
> > +
> > + pmd = (pmd_t *)pud_page_vaddr(*pud);
> > +
> > + for (i = 0; i < PTRS_PER_PMD; i++)
> > + if (!pmd_free_pte_page(&pmd[i]))
> > + return 0;
> > +
> > + pud_clear(pud);
>
> TLB flush needed here, before the page is freed.
>
> > + free_page((unsigned long)pmd);
> > +
> > + return 1;
> > }
> >
> > /**
> > @@ -724,6 +739,15 @@ int pud_free_pmd_page(pud_t *pud)
> > */
> > int pmd_free_pte_page(pmd_t *pmd)
> > {
> > - return pmd_none(*pmd);
> > + pte_t *pte;
> > +
> > + if (pmd_none(*pmd))
> > + return 1;
> > +
> > + pte = (pte_t *)pmd_page_vaddr(*pmd);
> > + pmd_clear(pmd);
>
> Same here, TLB flush needed.
>
> Further this needs synchronization with other page-tables in the system
> when the kernel PMDs are not shared between processes. In x86-32 with
> PAE this causes a BUG_ON() being triggered at arch/x86/mm/fault.c:268
> because the page-tables are not correctly synchronized.

I think this is an issue with pmd mapping support on x86-32-PAE, not
with this patch. I think the code needed to be updated to sync at the
pud level.

Thanks,
-Toshi

2018-04-26 17:25:24

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Thu, Apr 26, 2018 at 04:21:19PM +0000, Kani, Toshi wrote:
> All pages under the pmd had been unmapped and then lazy TLB purged with
> INVLPG before coming to this code path. Speculation is not allowed to
> pages without mapping.

CPUs have not only TLBs, but also page-walk caches which cache
intermediary results of page-table walks and which is flushed together
with the TLB.

So the PMD entry you clear can still be in a page-walk cache and this
needs to be flushed too before you can free the PTE page. Otherwise
page-walks might still go to the page you just freed. That is especially
bad when the page is already reallocated and filled with other data.

> > Further this needs synchronization with other page-tables in the system
> > when the kernel PMDs are not shared between processes. In x86-32 with
> > PAE this causes a BUG_ON() being triggered at arch/x86/mm/fault.c:268
> > because the page-tables are not correctly synchronized.
>
> I think this is an issue with pmd mapping support on x86-32-PAE, not
> with this patch. I think the code needed to be updated to sync at the
> pud level.

It is an issue with this patch, because this patch is for x86 and on x86
every change to the kernel page-tables potentially needs to by
synchronized to the other page-tables. And this patch doesn't implement
it, which triggers a BUG_ON() under certain conditions.


Regards,

Joerg


2018-04-26 17:52:22

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Thu, 2018-04-26 at 19:23 +0200, [email protected] wrote:
> On Thu, Apr 26, 2018 at 04:21:19PM +0000, Kani, Toshi wrote:
> > All pages under the pmd had been unmapped and then lazy TLB purged with
> > INVLPG before coming to this code path. Speculation is not allowed to
> > pages without mapping.
>
> CPUs have not only TLBs, but also page-walk caches which cache
> intermediary results of page-table walks and which is flushed together
> with the TLB.
>
> So the PMD entry you clear can still be in a page-walk cache and this
> needs to be flushed too before you can free the PTE page. Otherwise
> page-walks might still go to the page you just freed. That is especially
> bad when the page is already reallocated and filled with other data.

I do not understand why we need to flush processor caches here. x86
processor caches are coherent with MESI. So, clearing an PMD entry
modifies a cache entry on the processor associated with the address,
which in turn invalidates all stale cache entries on other processors.

> > > Further this needs synchronization with other page-tables in the system
> > > when the kernel PMDs are not shared between processes. In x86-32 with
> > > PAE this causes a BUG_ON() being triggered at arch/x86/mm/fault.c:268
> > > because the page-tables are not correctly synchronized.
> >
> > I think this is an issue with pmd mapping support on x86-32-PAE, not
> > with this patch. I think the code needed to be updated to sync at the
> > pud level.
>
> It is an issue with this patch, because this patch is for x86 and on x86
> every change to the kernel page-tables potentially needs to by
> synchronized to the other page-tables. And this patch doesn't implement
> it, which triggers a BUG_ON() under certain conditions.

The issue was introduced when pmd mapping support was added on x86/32,
which was made prior to this patch.

Thanks,
-Toshi

2018-04-26 20:09:24

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Thu, Apr 26, 2018 at 05:49:58PM +0000, Kani, Toshi wrote:
> On Thu, 2018-04-26 at 19:23 +0200, [email protected] wrote:
> > So the PMD entry you clear can still be in a page-walk cache and this
> > needs to be flushed too before you can free the PTE page. Otherwise
> > page-walks might still go to the page you just freed. That is especially
> > bad when the page is already reallocated and filled with other data.
>
> I do not understand why we need to flush processor caches here. x86
> processor caches are coherent with MESI. So, clearing an PMD entry
> modifies a cache entry on the processor associated with the address,
> which in turn invalidates all stale cache entries on other processors.

A page walk cache is not about the processors data cache, its a cache
similar to the TLB to speed up page-walks by caching intermediate
results of previous page walks.


Thanks,

Joerg

2018-04-26 22:32:21

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Thu, 2018-04-26 at 22:07 +0200, [email protected] wrote:
> On Thu, Apr 26, 2018 at 05:49:58PM +0000, Kani, Toshi wrote:
> > On Thu, 2018-04-26 at 19:23 +0200, [email protected] wrote:
> > > So the PMD entry you clear can still be in a page-walk cache and this
> > > needs to be flushed too before you can free the PTE page. Otherwise
> > > page-walks might still go to the page you just freed. That is especially
> > > bad when the page is already reallocated and filled with other data.
> >
> > I do not understand why we need to flush processor caches here. x86
> > processor caches are coherent with MESI. So, clearing an PMD entry
> > modifies a cache entry on the processor associated with the address,
> > which in turn invalidates all stale cache entries on other processors.
>
> A page walk cache is not about the processors data cache, its a cache
> similar to the TLB to speed up page-walks by caching intermediate
> results of previous page walks.

Thanks for the clarification. After reading through SDM one more time, I
agree that we need a TLB purge here. Here is my current understanding.

- INVLPG purges both TLB and paging-structure caches. So, PMD cache was
purged once.
- However, processor may cache this PMD entry later in speculation
since it has p-bit set. (This is where my misunderstanding was.
Speculation is not allowed to access a target address, but it may still
cache this PMD entry.)
- A single INVLPG on each processor purges this PMD cache. It does not
need a range purge (which was already done).

Does it sound right to you?

As for the BUG_ON issue, are you able to reproduce this issue? If so,
would you be able to test the fix?

Regards,
-Toshi

2018-04-27 07:38:55

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Thu, Apr 26, 2018 at 10:30:14PM +0000, Kani, Toshi wrote:
> Thanks for the clarification. After reading through SDM one more time, I
> agree that we need a TLB purge here. Here is my current understanding.
>
> - INVLPG purges both TLB and paging-structure caches. So, PMD cache was
> purged once.
> - However, processor may cache this PMD entry later in speculation
> since it has p-bit set. (This is where my misunderstanding was.
> Speculation is not allowed to access a target address, but it may still
> cache this PMD entry.)
> - A single INVLPG on each processor purges this PMD cache. It does not
> need a range purge (which was already done).
>
> Does it sound right to you?

The right fix is to first synchronize the changes when the PMD/PUD is
cleared and then flush the TLB system-wide. After that is done you can
free the page.

But doing all that in the pud/pmd_free_pmd/pte_page() functions is too
expensive, as the TLB flush requires to send IPIs to all cores in the
system, and that every time the function is called.

So what needs to be done is to fix this from high-level ioremap code to
first unmap all required PTE/PMD pages and collect them in a list. When
that is done you can synchronize the changes with the other page-tables
in the system and do one system-wide TLB flush. When that is complete
you can free the pages on the list that were collected while unmapping.

Then the new mappings can be established and again synchronized with the
other page-tables in the system.

> As for the BUG_ON issue, are you able to reproduce this issue? If so,
> would you be able to test the fix?

Yes, I can reproduce the BUG_ON with my PTI patches and a fedora-i386
VM.

I already ran into the issue before your patches were merged upstream,
but my "fix" is different because it just prevents huge-mappings when
there were smaller mappings before. See

e3e288121408 x86/pgtable: Don't set huge PUD/PMD on non-leaf entries

for details. This patch does not fix the base-problem, but hides it
again, as the real fix needs some more work across architectures.

Your patch actually makes the problem worse, without it the PTE/PMD pages
were just leaked, so that they could not be reused. But with your patch
the pages can be used again and the page-walker might establish TLB
entries based on random content the new owner writes to it. This can
lead to all kinds of random and very hard to debug data corruption
issues.

So until we make the generic ioremap code in lib/ioremap.c smarter about
unmapping/remapping ranges the best solution is making my fix work again
by reverting your patch.


Thanks,

Joerg


2018-04-27 11:40:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Fri 27-04-18 09:37:20, [email protected] wrote:
[...]
> So until we make the generic ioremap code in lib/ioremap.c smarter about
> unmapping/remapping ranges the best solution is making my fix work again
> by reverting your patch.

Could you reuse the same mmu_gather mechanism as we use in the
zap_*_range?
--
Michal Hocko
SUSE Labs

2018-04-27 11:48:47

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Fri, Apr 27, 2018 at 01:39:23PM +0200, Michal Hocko wrote:
> On Fri 27-04-18 09:37:20, [email protected] wrote:
> [...]
> > So until we make the generic ioremap code in lib/ioremap.c smarter about
> > unmapping/remapping ranges the best solution is making my fix work again
> > by reverting your patch.
>
> Could you reuse the same mmu_gather mechanism as we use in the
> zap_*_range?

Yeah, maybe, I havn't looked into the details yet. At least something
similar is needed.


Joerg

2018-04-27 11:54:04

by Chintan Pandya

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces



On 4/27/2018 1:07 PM, [email protected] wrote:
> On Thu, Apr 26, 2018 at 10:30:14PM +0000, Kani, Toshi wrote:
>> Thanks for the clarification. After reading through SDM one more time, I
>> agree that we need a TLB purge here. Here is my current understanding.
>>
>> - INVLPG purges both TLB and paging-structure caches. So, PMD cache was
>> purged once.
>> - However, processor may cache this PMD entry later in speculation
>> since it has p-bit set. (This is where my misunderstanding was.
>> Speculation is not allowed to access a target address, but it may still
>> cache this PMD entry.)
>> - A single INVLPG on each processor purges this PMD cache. It does not
>> need a range purge (which was already done).
>>
>> Does it sound right to you?
>
> The right fix is to first synchronize the changes when the PMD/PUD is
> cleared and then flush the TLB system-wide. After that is done you can
> free the page.
>

I'm bit confused here. Are you pointing to race within ioremap/vmalloc
framework while updating the page table or race during tlb ops. Since
later is arch dependent, I would not comment. But if the race being
discussed here while altering page tables, I'm not on the same page.

Current ioremap/vmalloc framework works with reserved virtual area for
its own purpose. Within this virtual area, we maintain mutual
exclusiveness by maintaining separate rbtree which is of course
synchronized. In the __vunmap leg, we perform page table ops first and
then release the virtual area for someone else to re-use. This way,
without taking any additional locks for page table modifications, we are
good.

If that's not the case and I'm missing something here.

Also, I'm curious to know what race you are observing at your end.


Chintan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum, a Linux Foundation
Collaborative Project

2018-04-27 12:49:53

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Fri, Apr 27, 2018 at 05:22:28PM +0530, Chintan Pandya wrote:
> I'm bit confused here. Are you pointing to race within ioremap/vmalloc
> framework while updating the page table or race during tlb ops. Since
> later is arch dependent, I would not comment. But if the race being
> discussed here while altering page tables, I'm not on the same page.

The race condition is between hardware and software. It is not
sufficient to just remove the software references to the page that is
about to be freed (by clearing the PMD/PUD), also the hardware
references in the page-walk cache need to be removed with a TLB flush.
Otherwise the hardware can use the freed (and possibly reused) page to
establish new TLB entries.



Joerg


2018-04-27 13:44:31

by Chintan Pandya

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces



On 4/27/2018 6:18 PM, [email protected] wrote:
> On Fri, Apr 27, 2018 at 05:22:28PM +0530, Chintan Pandya wrote:
>> I'm bit confused here. Are you pointing to race within ioremap/vmalloc
>> framework while updating the page table or race during tlb ops. Since
>> later is arch dependent, I would not comment. But if the race being
>> discussed here while altering page tables, I'm not on the same page.
>
> The race condition is between hardware and software. It is not
> sufficient to just remove the software references to the page that is
> about to be freed (by clearing the PMD/PUD), also the hardware
> references in the page-walk cache need to be removed with a TLB flush.
> Otherwise the hardware can use the freed (and possibly reused) page to
> establish new TLB entries.

Sure ! This is my understanding too (from arm64 context).

>
>
>
> Joerg
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

Chintan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum, a Linux Foundation
Collaborative Project

2018-04-27 14:33:39

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Fri, 2018-04-27 at 09:37 +0200, [email protected] wrote:
> On Thu, Apr 26, 2018 at 10:30:14PM +0000, Kani, Toshi wrote:
> > Thanks for the clarification. After reading through SDM one more time, I
> > agree that we need a TLB purge here. Here is my current understanding.
> >
> > - INVLPG purges both TLB and paging-structure caches. So, PMD cache was
> > purged once.
> > - However, processor may cache this PMD entry later in speculation
> > since it has p-bit set. (This is where my misunderstanding was.
> > Speculation is not allowed to access a target address, but it may still
> > cache this PMD entry.)
> > - A single INVLPG on each processor purges this PMD cache. It does not
> > need a range purge (which was already done).
> >
> > Does it sound right to you?
>
> The right fix is to first synchronize the changes when the PMD/PUD is
> cleared and then flush the TLB system-wide. After that is done you can
> free the page.

Agreed. This can be done on top of this patch.

> But doing all that in the pud/pmd_free_pmd/pte_page() functions is too
> expensive, as the TLB flush requires to send IPIs to all cores in the
> system, and that every time the function is called.
>
> So what needs to be done is to fix this from high-level ioremap code to
> first unmap all required PTE/PMD pages and collect them in a list. When
> that is done you can synchronize the changes with the other page-tables
> in the system and do one system-wide TLB flush. When that is complete
> you can free the pages on the list that were collected while unmapping.
>
> Then the new mappings can be established and again synchronized with the
> other page-tables in the system.

Yes, and this patch was designed to work in such way. Please note that
this patch added pud_free_pmd_page() and pmd_free_pte_page() to the
ioremap() path when and only when it creates a pud or pmd map. This
assures the following preconditions are met without overhead.
- All pte entries for a target pud/pmd address range have been cleared.
- System-wide TLB purges have been done for a target pud/pmd address
range.

So, we can add the step 2 on top of this patch.
1. Clear pud/pmd entry.
2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH
3. Free its underlining pmd/pte page.

> > As for the BUG_ON issue, are you able to reproduce this issue? If so,
> > would you be able to test the fix?
>
> Yes, I can reproduce the BUG_ON with my PTI patches and a fedora-i386
> VM.

Great!

> I already ran into the issue before your patches were merged upstream,
> but my "fix" is different because it just prevents huge-mappings when
> there were smaller mappings before. See
>
> e3e288121408 x86/pgtable: Don't set huge PUD/PMD on non-leaf entries
>
> for details. This patch does not fix the base-problem, but hides it
> again, as the real fix needs some more work across architectures.

Right. Patch 1/2 of this series made the same fix as well. See:

b6bdb7517c3d mm/vmalloc: add interfaces to free unmapped page table

> Your patch actually makes the problem worse, without it the PTE/PMD pages
> were just leaked, so that they could not be reused. But with your patch
> the pages can be used again and the page-walker might establish TLB
> entries based on random content the new owner writes to it. This can
> lead to all kinds of random and very hard to debug data corruption
> issues.
>
> So until we make the generic ioremap code in lib/ioremap.c smarter about
> unmapping/remapping ranges the best solution is making my fix work again
> by reverting your patch.

We do not need to revert this patch. We can make the above change I
mentioned.

Thanks,
-Toshi

2018-04-28 09:03:44

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Fri, Apr 27, 2018 at 02:31:51PM +0000, Kani, Toshi wrote:
> So, we can add the step 2 on top of this patch.
> 1. Clear pud/pmd entry.
> 2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH
> 3. Free its underlining pmd/pte page.

This still lacks the page-table synchronization and will thus not fix
the BUG_ON being triggered.

> We do not need to revert this patch. We can make the above change I
> mentioned.

Please note that we are not in the merge window anymore and that any fix
needs to be simple and obviously correct.


Thanks,

Joerg

2018-04-28 20:57:12

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Sat, 2018-04-28 at 11:02 +0200, [email protected] wrote:
> On Fri, Apr 27, 2018 at 02:31:51PM +0000, Kani, Toshi wrote:
> > So, we can add the step 2 on top of this patch.
> > 1. Clear pud/pmd entry.
> > 2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH
> > 3. Free its underlining pmd/pte page.
>
> This still lacks the page-table synchronization and will thus not fix
> the BUG_ON being triggered.

The BUG_ON issue is specific to PAE that it syncs at the pmd level.
x86/64 does not have this issue since it syncs at the pgd or p4d level.

> > We do not need to revert this patch. We can make the above change I
> > mentioned.
>
> Please note that we are not in the merge window anymore and that any fix
> needs to be simple and obviously correct.

Understood. Changing the x86/32 sync point is risky. So, I am going to
revert the free page handling for PAE.

Thanks,
-Toshi

2018-04-30 07:31:43

by Chintan Pandya

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces



On 4/29/2018 2:24 AM, Kani, Toshi wrote:
> On Sat, 2018-04-28 at 11:02 +0200, [email protected] wrote:
>> On Fri, Apr 27, 2018 at 02:31:51PM +0000, Kani, Toshi wrote:
>>> So, we can add the step 2 on top of this patch.
>>> 1. Clear pud/pmd entry.
>>> 2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH
>>> 3. Free its underlining pmd/pte page.
>>
>> This still lacks the page-table synchronization and will thus not fix
>> the BUG_ON being triggered.
>
> The BUG_ON issue is specific to PAE that it syncs at the pmd level.
> x86/64 does not have this issue since it syncs at the pgd or p4d level.
>
>>> We do not need to revert this patch. We can make the above change I
>>> mentioned.
>>
>> Please note that we are not in the merge window anymore and that any fix
>> needs to be simple and obviously correct.
>
> Understood. Changing the x86/32 sync point is risky. So, I am going to
> revert the free page handling for PAE.

Will this affect pmd_free_pte_page() & pud_free_pmd_page() 's existence
or its parameters ? I'm asking because, I've similar change for arm64
and ready to send v9 patches.

I'm thinking to share my v9 patches in any case. If you are going to do
TLB invalidation within these APIs, my first patch will help.

>
> Thanks,
> -Toshi
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

Chintan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum, a Linux Foundation
Collaborative Project

2018-04-30 13:44:31

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] x86/mm: implement free pmd/pte page interfaces

On Mon, 2018-04-30 at 13:00 +0530, Chintan Pandya wrote:
>
> On 4/29/2018 2:24 AM, Kani, Toshi wrote:
> > On Sat, 2018-04-28 at 11:02 +0200, [email protected] wrote:
> > > On Fri, Apr 27, 2018 at 02:31:51PM +0000, Kani, Toshi wrote:
> > > > So, we can add the step 2 on top of this patch.
> > > > 1. Clear pud/pmd entry.
> > > > 2. System wide TLB flush <-- TO BE ADDED BY NEW PATCH
> > > > 3. Free its underlining pmd/pte page.
> > >
> > > This still lacks the page-table synchronization and will thus not fix
> > > the BUG_ON being triggered.
> >
> > The BUG_ON issue is specific to PAE that it syncs at the pmd level.
> > x86/64 does not have this issue since it syncs at the pgd or p4d level.
> >
> > > > We do not need to revert this patch. We can make the above change I
> > > > mentioned.
> > >
> > > Please note that we are not in the merge window anymore and that any fix
> > > needs to be simple and obviously correct.
> >
> > Understood. Changing the x86/32 sync point is risky. So, I am going to
> > revert the free page handling for PAE.
>
> Will this affect pmd_free_pte_page() & pud_free_pmd_page() 's existence
> or its parameters ? I'm asking because, I've similar change for arm64
> and ready to send v9 patches.

No, it won't. The change is only to the x86 side.

> I'm thinking to share my v9 patches in any case. If you are going to do
> TLB invalidation within these APIs, my first patch will help.

I will make my change on top of your v9 1/4 patch so that we can avoid
merge conflict.

Thanks,
-Toshi