2019-08-28 14:32:51

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()

From: Song Liu <[email protected]>

pti_clone_pmds() assumes that the supplied address is either:

- properly PUD/PMD aligned
or
- the address is actually mapped which means that independent
of the mapping level (PUD/PMD/PTE) the next higher mapping
exist.

If that's not the case the unaligned address can be incremented by PUD or
PMD size wrongly. All callers supply mapped and/or aligned addresses, but
for robustness sake, it's better to handle that case proper and to emit a
warning.

Signed-off-by: Song Liu <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/mm/pti.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,15 @@ pti_clone_pgtable(unsigned long start, u

pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
- addr += PUD_SIZE;
+ WARN_ON_ONCE(addr & PUD_MASK);
+ addr = round_up(addr + 1, PUD_SIZE);
continue;
}

pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
- addr += PMD_SIZE;
+ WARN_ON_ONCE(addr & PMD_MASK);
+ addr = round_up(addr + 1, PMD_SIZE);
continue;
}




2019-08-28 15:48:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()

On 8/28/19 7:24 AM, Thomas Gleixner wrote:
> From: Song Liu <[email protected]>
>
> pti_clone_pmds() assumes that the supplied address is either:
>
> - properly PUD/PMD aligned
> or
> - the address is actually mapped which means that independent
> of the mapping level (PUD/PMD/PTE) the next higher mapping
> exist.
>
> If that's not the case the unaligned address can be incremented by PUD or
> PMD size wrongly. All callers supply mapped and/or aligned addresses, but
> for robustness sake, it's better to handle that case proper and to emit a
> warning.

Reviewed-by: Dave Hansen <[email protected]>

Song, did you ever root-cause the performance regression? I thought
there were still some mysteries there.

2019-08-28 15:53:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()

On Wed, 28 Aug 2019, Dave Hansen wrote:
> On 8/28/19 7:24 AM, Thomas Gleixner wrote:
> > From: Song Liu <[email protected]>
> >
> > pti_clone_pmds() assumes that the supplied address is either:
> >
> > - properly PUD/PMD aligned
> > or
> > - the address is actually mapped which means that independent
> > of the mapping level (PUD/PMD/PTE) the next higher mapping
> > exist.
> >
> > If that's not the case the unaligned address can be incremented by PUD or
> > PMD size wrongly. All callers supply mapped and/or aligned addresses, but
> > for robustness sake, it's better to handle that case proper and to emit a
> > warning.
>
> Reviewed-by: Dave Hansen <[email protected]>
>
> Song, did you ever root-cause the performance regression? I thought
> there were still some mysteries there.

See Peter's series to rework the ftrace code patching ...

2019-08-28 18:02:18

by Song Liu

[permalink] [raw]
Subject: Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()



> On Aug 28, 2019, at 8:51 AM, Thomas Gleixner <[email protected]> wrote:
>
> On Wed, 28 Aug 2019, Dave Hansen wrote:
>> On 8/28/19 7:24 AM, Thomas Gleixner wrote:
>>> From: Song Liu <[email protected]>
>>>
>>> pti_clone_pmds() assumes that the supplied address is either:
>>>
>>> - properly PUD/PMD aligned
>>> or
>>> - the address is actually mapped which means that independent
>>> of the mapping level (PUD/PMD/PTE) the next higher mapping
>>> exist.
>>>
>>> If that's not the case the unaligned address can be incremented by PUD or
>>> PMD size wrongly. All callers supply mapped and/or aligned addresses, but
>>> for robustness sake, it's better to handle that case proper and to emit a
>>> warning.
>>
>> Reviewed-by: Dave Hansen <[email protected]>
>>
>> Song, did you ever root-cause the performance regression? I thought
>> there were still some mysteries there.
>
> See Peter's series to rework the ftrace code patching ...

Thanks Thomas.

Yes, in summary, enabling ftrace or kprobe-on-ftrace causes the kernel
to split PMDs in kernel text mapping.

Related question: while Peter's patches fix it for 5.3 kernel, they don't
apply cleanly over 5.2 kernel (which we are using). So I wonder what is
the best solution for 5.2 kernel. May patch also fixes the issue:

https://lore.kernel.org/lkml/[email protected]/

How about we apply this patch to upstream 5.2 kernel?

Thanks,
Song

2019-08-28 19:01:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()


* Thomas Gleixner <[email protected]> wrote:

> From: Song Liu <[email protected]>
>
> pti_clone_pmds() assumes that the supplied address is either:
>
> - properly PUD/PMD aligned
> or
> - the address is actually mapped which means that independent
> of the mapping level (PUD/PMD/PTE) the next higher mapping
> exist.

s/independent
/independently

s/exist
/exists

> If that's not the case the unaligned address can be incremented by PUD or
> PMD size wrongly. All callers supply mapped and/or aligned addresses, but
> for robustness sake, it's better to handle that case proper and to emit a
> warning.

s/wrongly
/incorrectly

s/robustness sake
/robustness's sake

s/proper
/properly

With that:

> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + WARN_ON_ONCE(addr & PUD_MASK);
> + addr = round_up(addr + 1, PUD_SIZE);
> continue;
> }
>
> pmd = pmd_offset(pud, addr);
> if (pmd_none(*pmd)) {
> - addr += PMD_SIZE;
> + WARN_ON_ONCE(addr & PMD_MASK);
> + addr = round_up(addr + 1, PMD_SIZE);

So given that PUD_MASK and PMD_MASK are masking out the *offset*:

arch/x86/include/asm/pgtable_64_types.h:#define PMD_MASK (~(PMD_SIZE - 1))

Didn't we want something like:

WARN_ON_ONCE(addr & ~PUD_MASK);

WARN_ON_ONCE(addr & ~PMD_MASK);

to warn about an unaligned 'addr', or am I misreading the intent here?

Thanks,

Ingo

2019-08-28 19:46:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()

On Wed, 28 Aug 2019, Ingo Molnar wrote:
> > pmd = pmd_offset(pud, addr);
> > if (pmd_none(*pmd)) {
> > - addr += PMD_SIZE;
> > + WARN_ON_ONCE(addr & PMD_MASK);
> > + addr = round_up(addr + 1, PMD_SIZE);
>
> So given that PUD_MASK and PMD_MASK are masking out the *offset*:
>
> arch/x86/include/asm/pgtable_64_types.h:#define PMD_MASK (~(PMD_SIZE - 1))
>
> Didn't we want something like:
>
> WARN_ON_ONCE(addr & ~PUD_MASK);
>
> WARN_ON_ONCE(addr & ~PMD_MASK);
>
> to warn about an unaligned 'addr', or am I misreading the intent here?

Bah, right you are...

2019-08-28 20:07:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()

On Wed, 28 Aug 2019, Song Liu wrote:
> > On Aug 28, 2019, at 8:51 AM, Thomas Gleixner <[email protected]> wrote:
> >
> > On Wed, 28 Aug 2019, Dave Hansen wrote:
> >> On 8/28/19 7:24 AM, Thomas Gleixner wrote:
> >>> From: Song Liu <[email protected]>
> >>>
> >>> pti_clone_pmds() assumes that the supplied address is either:
> >>>
> >>> - properly PUD/PMD aligned
> >>> or
> >>> - the address is actually mapped which means that independent
> >>> of the mapping level (PUD/PMD/PTE) the next higher mapping
> >>> exist.
> >>>
> >>> If that's not the case the unaligned address can be incremented by PUD or
> >>> PMD size wrongly. All callers supply mapped and/or aligned addresses, but
> >>> for robustness sake, it's better to handle that case proper and to emit a
> >>> warning.
> >>
> >> Reviewed-by: Dave Hansen <[email protected]>
> >>
> >> Song, did you ever root-cause the performance regression? I thought
> >> there were still some mysteries there.
> >
> > See Peter's series to rework the ftrace code patching ...
>
> Thanks Thomas.
>
> Yes, in summary, enabling ftrace or kprobe-on-ftrace causes the kernel
> to split PMDs in kernel text mapping.
>
> Related question: while Peter's patches fix it for 5.3 kernel, they don't
> apply cleanly over 5.2 kernel (which we are using). So I wonder what is
> the best solution for 5.2 kernel. May patch also fixes the issue:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> How about we apply this patch to upstream 5.2 kernel?

That's not how it works. We fix stuff upstream and it gets backported to
all affected kernels not just to the one you care about.

Aside of that I really disagree with that hack. You completely fail to
explain why that commit in question broke it and instead of fixing the
underlying issue you create a horrible workaround.

It took me ~10 minutes to analyze the root cause and I'm just booting the
test box with a proper fix which can be actually tagged for stable and can
be removed from upstream again once ftrace got moved over to text poke.

I'll post it once it's confirmed to work and I wrote a comprehensible
changelog.

Thanks,

tglx




2019-08-28 20:36:11

by Song Liu

[permalink] [raw]
Subject: Re: [patch 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()



> On Aug 28, 2019, at 1:05 PM, Thomas Gleixner <[email protected]> wrote:
>
> On Wed, 28 Aug 2019, Song Liu wrote:
>>> On Aug 28, 2019, at 8:51 AM, Thomas Gleixner <[email protected]> wrote:
>>>
>>> On Wed, 28 Aug 2019, Dave Hansen wrote:
>>>> On 8/28/19 7:24 AM, Thomas Gleixner wrote:
>>>>> From: Song Liu <[email protected]>
>>>>>
>>>>> pti_clone_pmds() assumes that the supplied address is either:
>>>>>
>>>>> - properly PUD/PMD aligned
>>>>> or
>>>>> - the address is actually mapped which means that independent
>>>>> of the mapping level (PUD/PMD/PTE) the next higher mapping
>>>>> exist.
>>>>>
>>>>> If that's not the case the unaligned address can be incremented by PUD or
>>>>> PMD size wrongly. All callers supply mapped and/or aligned addresses, but
>>>>> for robustness sake, it's better to handle that case proper and to emit a
>>>>> warning.
>>>>
>>>> Reviewed-by: Dave Hansen <[email protected]>
>>>>
>>>> Song, did you ever root-cause the performance regression? I thought
>>>> there were still some mysteries there.
>>>
>>> See Peter's series to rework the ftrace code patching ...
>>
>> Thanks Thomas.
>>
>> Yes, in summary, enabling ftrace or kprobe-on-ftrace causes the kernel
>> to split PMDs in kernel text mapping.
>>
>> Related question: while Peter's patches fix it for 5.3 kernel, they don't
>> apply cleanly over 5.2 kernel (which we are using). So I wonder what is
>> the best solution for 5.2 kernel. May patch also fixes the issue:
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> How about we apply this patch to upstream 5.2 kernel?
>
> That's not how it works. We fix stuff upstream and it gets backported to
> all affected kernels not just to the one you care about.

Agreed. I am trying to back port Peter's patch set to 5.2 kernel. There
are 9 dependencies and some manual changes.

>
> Aside of that I really disagree with that hack. You completely fail to
> explain why that commit in question broke it and instead of fixing the
> underlying issue you create a horrible workaround.
>
> It took me ~10 minutes to analyze the root cause and I'm just booting the
> test box with a proper fix which can be actually tagged for stable and can
> be removed from upstream again once ftrace got moved over to text poke.
>
> I'll post it once it's confirmed to work and I wrote a comprehensible
> changelog.

This sounds great. Thanks!

Song

2019-08-28 20:57:31

by Thomas Gleixner

[permalink] [raw]
Subject: [patch V2 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()

From: Song Liu <[email protected]>

pti_clone_pmds() assumes that the supplied address is either:

- properly PUD/PMD aligned
or
- the address is actually mapped which means that independent
of the mapping level (PUD/PMD/PTE) the next higher mapping
exist.

If that's not the case the unaligned address can be incremented by PUD or
PMD size wrongly. All callers supply mapped and/or aligned addresses, but
for robustness sake, it's better to handle that case proper and to emit a
warning.

[ tglx: Rewrote changelog and added WARN_ON_ONCE() ]

Signed-off-by: Song Liu <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
V2: Negate P[UM]D_MASK for checking whether the offset part is 0
---
arch/x86/mm/pti.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,15 @@ pti_clone_pgtable(unsigned long start, u

pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
- addr += PUD_SIZE;
+ WARN_ON_ONCE(addr & ~PUD_MASK);
+ addr = round_up(addr + 1, PUD_SIZE);
continue;
}

pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
- addr += PMD_SIZE;
+ WARN_ON_ONCE(addr & ~PMD_MASK);
+ addr = round_up(addr + 1, PMD_SIZE);
continue;
}

2019-08-28 21:53:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch V2 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()

On Wed, 28 Aug 2019, Thomas Gleixner wrote:

> From: Song Liu <[email protected]>
>
> pti_clone_pmds() assumes that the supplied address is either:
>
> - properly PUD/PMD aligned
> or
> - the address is actually mapped which means that independent

Bah. I'm a moron. Forgot to fix the spell checker issues.

2019-08-28 21:56:20

by Thomas Gleixner

[permalink] [raw]
Subject: [patch V3 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()

From: Song Liu <[email protected]>

pti_clone_pmds() assumes that the supplied address is either:

- properly PUD/PMD aligned
or
- the address is actually mapped which means that independently
of the mapping level (PUD/PMD/PTE) the next higher mapping
exists.

If that's not the case the unaligned address can be incremented by PUD or
PMD size incorrectly. All callers supply mapped and/or aligned addresses,
but for the sake of robustness it's better to handle that case properly and
to emit a warning.

[ tglx: Rewrote changelog and added WARN_ON_ONCE() ]

Signed-off-by: Song Liu <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
V2: Negate P[UM]D_MASK for checking whether the offset part is 0
V3: Fix changelog
---
arch/x86/mm/pti.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,15 @@ pti_clone_pgtable(unsigned long start, u

pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
- addr += PUD_SIZE;
+ WARN_ON_ONCE(addr & ~PUD_MASK);
+ addr = round_up(addr + 1, PUD_SIZE);
continue;
}

pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
- addr += PMD_SIZE;
+ WARN_ON_ONCE(addr & ~PMD_MASK);
+ addr = round_up(addr + 1, PMD_SIZE);
continue;
}

2019-08-28 22:33:08

by Thomas Gleixner

[permalink] [raw]
Subject: [PATCH] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text

ftrace does not use text_poke() for enabling trace functionality. It uses
its own mechanism and flips the whole kernel text to RW and back to RO.

The CPA rework removed a loop based check of 4k pages which tried to
preserve a large page by checking each 4k page whether the change would
actually cover all pages in the large page.

This resulted in endless loops for nothing as in testing it turned out that
it actually never preserved anything. Of course testing missed to include
ftrace, which is the one and only case which benefitted from the 4k loop.

As a consequence enabling function tracing or ftrace based kprobes results
in a full 4k split of the kernel text, which affects iTLB performance.

The kernel RO protection is the only valid case where this can actually
preserve large pages.

All other static protections (RO data, data NX, PCI, BIOS) are truly
static. So a conflict with those protections which results in a split
should only ever happen when a change of memory next to a protected region
is attempted. But these conflicts are rightfully splitting the large page
to preserve the protected regions. In fact a change to the protected
regions itself is a bug and is warned about.

Add an exception for the static protection check for kernel text RO when
the to be changed region spawns a full large page which allows to preserve
the large mappings. This also prevents the syslog to be spammed about CPA
violations when ftrace is used.

The exception needs to be removed once ftrace switched over to text_poke()
which avoids the whole issue.

Fixes: 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely")
Reported-by: Song Liu <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: [email protected]
---
arch/x86/mm/pageattr.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)

--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -516,7 +516,7 @@ static inline void check_conflict(int wa
*/
static inline pgprot_t static_protections(pgprot_t prot, unsigned long start,
unsigned long pfn, unsigned long npg,
- int warnlvl)
+ unsigned long lpsize, int warnlvl)
{
pgprotval_t forbidden, res;
unsigned long end;
@@ -535,9 +535,17 @@ static inline pgprot_t static_protection
check_conflict(warnlvl, prot, res, start, end, pfn, "Text NX");
forbidden = res;

- res = protect_kernel_text_ro(start, end);
- check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
- forbidden |= res;
+ /*
+ * Special case to preserve a large page. If the change spawns the
+ * full large page mapping then there is no point to split it
+ * up. Happens with ftrace and is going to be removed once ftrace
+ * switched to text_poke().
+ */
+ if (lpsize != (npg * PAGE_SIZE) || (start & (lpsize - 1))) {
+ res = protect_kernel_text_ro(start, end);
+ check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
+ forbidden |= res;
+ }

/* Check the PFN directly */
res = protect_pci_bios(pfn, pfn + npg - 1);
@@ -819,7 +827,7 @@ static int __should_split_large_page(pte
* extra conditional required here.
*/
chk_prot = static_protections(old_prot, lpaddr, old_pfn, numpages,
- CPA_CONFLICT);
+ psize, CPA_CONFLICT);

if (WARN_ON_ONCE(pgprot_val(chk_prot) != pgprot_val(old_prot))) {
/*
@@ -855,7 +863,7 @@ static int __should_split_large_page(pte
* protection requirement in the large page.
*/
new_prot = static_protections(req_prot, lpaddr, old_pfn, numpages,
- CPA_DETECT);
+ psize, CPA_DETECT);

/*
* If there is a conflict, split the large page.
@@ -906,7 +914,8 @@ static void split_set_pte(struct cpa_dat
if (!cpa->force_static_prot)
goto set;

- prot = static_protections(ref_prot, address, pfn, npg, CPA_PROTECT);
+ /* Hand in lpsize = 0 to enforce the protection mechanism */
+ prot = static_protections(ref_prot, address, pfn, npg, 0, CPA_PROTECT);

if (pgprot_val(prot) == pgprot_val(ref_prot))
goto set;
@@ -1503,7 +1512,8 @@ static int __change_page_attr(struct cpa
pgprot_val(new_prot) |= pgprot_val(cpa->mask_set);

cpa_inc_4k_install();
- new_prot = static_protections(new_prot, address, pfn, 1,
+ /* Hand in lpsize = 0 to enforce the protection mechanism */
+ new_prot = static_protections(new_prot, address, pfn, 1, 0,
CPA_PROTECT);

new_prot = pgprot_clear_protnone_bits(new_prot);

2019-08-28 23:05:24

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text


> On Aug 28, 2019, at 3:31 PM, Thomas Gleixner <[email protected]> wrote:
>
> ftrace does not use text_poke() for enabling trace functionality. It uses
> its own mechanism and flips the whole kernel text to RW and back to RO.
>
> The CPA rework removed a loop based check of 4k pages which tried to
> preserve a large page by checking each 4k page whether the change would
> actually cover all pages in the large page.
>
> This resulted in endless loops for nothing as in testing it turned out that
> it actually never preserved anything. Of course testing missed to include
> ftrace, which is the one and only case which benefitted from the 4k loop.
>
> As a consequence enabling function tracing or ftrace based kprobes results
> in a full 4k split of the kernel text, which affects iTLB performance.
>
> The kernel RO protection is the only valid case where this can actually
> preserve large pages.
>
> All other static protections (RO data, data NX, PCI, BIOS) are truly
> static. So a conflict with those protections which results in a split
> should only ever happen when a change of memory next to a protected region
> is attempted. But these conflicts are rightfully splitting the large page
> to preserve the protected regions. In fact a change to the protected
> regions itself is a bug and is warned about.
>
> Add an exception for the static protection check for kernel text RO when
> the to be changed region spawns a full large page which allows to preserve
> the large mappings. This also prevents the syslog to be spammed about CPA
> violations when ftrace is used.
>
> The exception needs to be removed once ftrace switched over to text_poke()
> which avoids the whole issue.
>
> Fixes: 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely")
> Reported-by: Song Liu <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> Cc: [email protected]

This looks great. Much cleaner than my workaround.

Thanks!

Reviewed-and-tested-by: Song Liu <[email protected]>

We need this for v4.20 to v5.3 (assuming Peter's patches will land in 5.4).


> ---
> arch/x86/mm/pageattr.c | 26 ++++++++++++++++++--------
> 1 file changed, 18 insertions(+), 8 deletions(-)
>
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -516,7 +516,7 @@ static inline void check_conflict(int wa
> */
> static inline pgprot_t static_protections(pgprot_t prot, unsigned long start,
> unsigned long pfn, unsigned long npg,
> - int warnlvl)
> + unsigned long lpsize, int warnlvl)
> {
> pgprotval_t forbidden, res;
> unsigned long end;
> @@ -535,9 +535,17 @@ static inline pgprot_t static_protection
> check_conflict(warnlvl, prot, res, start, end, pfn, "Text NX");
> forbidden = res;
>
> - res = protect_kernel_text_ro(start, end);
> - check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
> - forbidden |= res;
> + /*
> + * Special case to preserve a large page. If the change spawns the
> + * full large page mapping then there is no point to split it
> + * up. Happens with ftrace and is going to be removed once ftrace
> + * switched to text_poke().
> + */
> + if (lpsize != (npg * PAGE_SIZE) || (start & (lpsize - 1))) {
> + res = protect_kernel_text_ro(start, end);
> + check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
> + forbidden |= res;
> + }
>
> /* Check the PFN directly */
> res = protect_pci_bios(pfn, pfn + npg - 1);
> @@ -819,7 +827,7 @@ static int __should_split_large_page(pte
> * extra conditional required here.
> */
> chk_prot = static_protections(old_prot, lpaddr, old_pfn, numpages,
> - CPA_CONFLICT);
> + psize, CPA_CONFLICT);
>
> if (WARN_ON_ONCE(pgprot_val(chk_prot) != pgprot_val(old_prot))) {
> /*
> @@ -855,7 +863,7 @@ static int __should_split_large_page(pte
> * protection requirement in the large page.
> */
> new_prot = static_protections(req_prot, lpaddr, old_pfn, numpages,
> - CPA_DETECT);
> + psize, CPA_DETECT);
>
> /*
> * If there is a conflict, split the large page.
> @@ -906,7 +914,8 @@ static void split_set_pte(struct cpa_dat
> if (!cpa->force_static_prot)
> goto set;
>
> - prot = static_protections(ref_prot, address, pfn, npg, CPA_PROTECT);
> + /* Hand in lpsize = 0 to enforce the protection mechanism */
> + prot = static_protections(ref_prot, address, pfn, npg, 0, CPA_PROTECT);
>
> if (pgprot_val(prot) == pgprot_val(ref_prot))
> goto set;
> @@ -1503,7 +1512,8 @@ static int __change_page_attr(struct cpa
> pgprot_val(new_prot) |= pgprot_val(cpa->mask_set);
>
> cpa_inc_4k_install();
> - new_prot = static_protections(new_prot, address, pfn, 1,
> + /* Hand in lpsize = 0 to enforce the protection mechanism */
> + new_prot = static_protections(new_prot, address, pfn, 1, 0,
> CPA_PROTECT);
>
> new_prot = pgprot_clear_protnone_bits(new_prot);

2019-08-28 23:25:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch V3 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()


* Thomas Gleixner <[email protected]> wrote:

> From: Song Liu <[email protected]>
>
> pti_clone_pmds() assumes that the supplied address is either:
>
> - properly PUD/PMD aligned
> or
> - the address is actually mapped which means that independently
> of the mapping level (PUD/PMD/PTE) the next higher mapping
> exists.
>
> If that's not the case the unaligned address can be incremented by PUD or
> PMD size incorrectly. All callers supply mapped and/or aligned addresses,
> but for the sake of robustness it's better to handle that case properly and
> to emit a warning.
>
> [ tglx: Rewrote changelog and added WARN_ON_ONCE() ]
>
> Signed-off-by: Song Liu <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
> ---
> V2: Negate P[UM]D_MASK for checking whether the offset part is 0
> V3: Fix changelog
> ---
> arch/x86/mm/pti.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,13 +330,15 @@ pti_clone_pgtable(unsigned long start, u
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + WARN_ON_ONCE(addr & ~PUD_MASK);
> + addr = round_up(addr + 1, PUD_SIZE);
> continue;
> }
>
> pmd = pmd_offset(pud, addr);
> if (pmd_none(*pmd)) {
> - addr += PMD_SIZE;
> + WARN_ON_ONCE(addr & ~PMD_MASK);
> + addr = round_up(addr + 1, PMD_SIZE);
> continue;
> }

Reviewed-by: Ingo Molnar <[email protected]>

Thanks,

Ingo

2019-08-29 13:03:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text

On Thu, Aug 29, 2019 at 12:31:34AM +0200, Thomas Gleixner wrote:
> arch/x86/mm/pageattr.c | 26 ++++++++++++++++++--------
> 1 file changed, 18 insertions(+), 8 deletions(-)
>
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -516,7 +516,7 @@ static inline void check_conflict(int wa
> */
> static inline pgprot_t static_protections(pgprot_t prot, unsigned long start,
> unsigned long pfn, unsigned long npg,
> - int warnlvl)
> + unsigned long lpsize, int warnlvl)
> {
> pgprotval_t forbidden, res;
> unsigned long end;
> @@ -535,9 +535,17 @@ static inline pgprot_t static_protection
> check_conflict(warnlvl, prot, res, start, end, pfn, "Text NX");
> forbidden = res;
>
> - res = protect_kernel_text_ro(start, end);
> - check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
> - forbidden |= res;
> + /*
> + * Special case to preserve a large page. If the change spawns the
> + * full large page mapping then there is no point to split it
> + * up. Happens with ftrace and is going to be removed once ftrace
> + * switched to text_poke().
> + */
> + if (lpsize != (npg * PAGE_SIZE) || (start & (lpsize - 1))) {
> + res = protect_kernel_text_ro(start, end);
> + check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
> + forbidden |= res;
> + }

Right, so this allows the RW (doesn't enforce RO) and thereby doesn't
force split, when it is a whole large page.

>
> /* Check the PFN directly */
> res = protect_pci_bios(pfn, pfn + npg - 1);
> @@ -819,7 +827,7 @@ static int __should_split_large_page(pte
> * extra conditional required here.
> */
> chk_prot = static_protections(old_prot, lpaddr, old_pfn, numpages,
> - CPA_CONFLICT);
> + psize, CPA_CONFLICT);
>
> if (WARN_ON_ONCE(pgprot_val(chk_prot) != pgprot_val(old_prot))) {
> /*
> @@ -855,7 +863,7 @@ static int __should_split_large_page(pte
> * protection requirement in the large page.
> */
> new_prot = static_protections(req_prot, lpaddr, old_pfn, numpages,
> - CPA_DETECT);
> + psize, CPA_DETECT);
>
> /*
> * If there is a conflict, split the large page.

And these are the callsites in __should_split_large_page(), and you
provide psize, and therefore we allow RW to preserve the large pages on
the kernel text.

> @@ -906,7 +914,8 @@ static void split_set_pte(struct cpa_dat
> if (!cpa->force_static_prot)
> goto set;
>
> - prot = static_protections(ref_prot, address, pfn, npg, CPA_PROTECT);
> + /* Hand in lpsize = 0 to enforce the protection mechanism */
> + prot = static_protections(ref_prot, address, pfn, npg, 0, CPA_PROTECT);

This is when we've already decided to split, in which case we might as
well enforce the normal rules, and .lpsize=0 does just that.

>
> if (pgprot_val(prot) == pgprot_val(ref_prot))
> goto set;
> @@ -1503,7 +1512,8 @@ static int __change_page_attr(struct cpa
> pgprot_val(new_prot) |= pgprot_val(cpa->mask_set);
>
> cpa_inc_4k_install();
> - new_prot = static_protections(new_prot, address, pfn, 1,
> + /* Hand in lpsize = 0 to enforce the protection mechanism */
> + new_prot = static_protections(new_prot, address, pfn, 1, 0,
> CPA_PROTECT);

And here we check the protections of a single 4k page, in which case
large pages are irrelevant and again .lpsize=0 disables the new code.

>
> new_prot = pgprot_clear_protnone_bits(new_prot);


That all seems OK I suppose.

Acked-by: Peter Zijlstra (Intel) <[email protected]>

Subject: [tip: x86/urgent] x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 7af0145067bc429a09ac4047b167c0971c9f0dc7
Gitweb: https://git.kernel.org/tip/7af0145067bc429a09ac4047b167c0971c9f0dc7
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 29 Aug 2019 00:31:34 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Thu, 29 Aug 2019 20:48:44 +02:00

x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text

ftrace does not use text_poke() for enabling trace functionality. It uses
its own mechanism and flips the whole kernel text to RW and back to RO.

The CPA rework removed a loop based check of 4k pages which tried to
preserve a large page by checking each 4k page whether the change would
actually cover all pages in the large page.

This resulted in endless loops for nothing as in testing it turned out that
it actually never preserved anything. Of course testing missed to include
ftrace, which is the one and only case which benefitted from the 4k loop.

As a consequence enabling function tracing or ftrace based kprobes results
in a full 4k split of the kernel text, which affects iTLB performance.

The kernel RO protection is the only valid case where this can actually
preserve large pages.

All other static protections (RO data, data NX, PCI, BIOS) are truly
static. So a conflict with those protections which results in a split
should only ever happen when a change of memory next to a protected region
is attempted. But these conflicts are rightfully splitting the large page
to preserve the protected regions. In fact a change to the protected
regions itself is a bug and is warned about.

Add an exception for the static protection check for kernel text RO when
the to be changed region spawns a full large page which allows to preserve
the large mappings. This also prevents the syslog to be spammed about CPA
violations when ftrace is used.

The exception needs to be removed once ftrace switched over to text_poke()
which avoids the whole issue.

Fixes: 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely")
Reported-by: Song Liu <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Tested-by: Song Liu <[email protected]>
Reviewed-by: Song Liu <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/mm/pageattr.c | 26 ++++++++++++++++++--------
1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 6a9a77a..e14e95e 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -516,7 +516,7 @@ static inline void check_conflict(int warnlvl, pgprot_t prot, pgprotval_t val,
*/
static inline pgprot_t static_protections(pgprot_t prot, unsigned long start,
unsigned long pfn, unsigned long npg,
- int warnlvl)
+ unsigned long lpsize, int warnlvl)
{
pgprotval_t forbidden, res;
unsigned long end;
@@ -535,9 +535,17 @@ static inline pgprot_t static_protections(pgprot_t prot, unsigned long start,
check_conflict(warnlvl, prot, res, start, end, pfn, "Text NX");
forbidden = res;

- res = protect_kernel_text_ro(start, end);
- check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
- forbidden |= res;
+ /*
+ * Special case to preserve a large page. If the change spawns the
+ * full large page mapping then there is no point to split it
+ * up. Happens with ftrace and is going to be removed once ftrace
+ * switched to text_poke().
+ */
+ if (lpsize != (npg * PAGE_SIZE) || (start & (lpsize - 1))) {
+ res = protect_kernel_text_ro(start, end);
+ check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO");
+ forbidden |= res;
+ }

/* Check the PFN directly */
res = protect_pci_bios(pfn, pfn + npg - 1);
@@ -819,7 +827,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address,
* extra conditional required here.
*/
chk_prot = static_protections(old_prot, lpaddr, old_pfn, numpages,
- CPA_CONFLICT);
+ psize, CPA_CONFLICT);

if (WARN_ON_ONCE(pgprot_val(chk_prot) != pgprot_val(old_prot))) {
/*
@@ -855,7 +863,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address,
* protection requirement in the large page.
*/
new_prot = static_protections(req_prot, lpaddr, old_pfn, numpages,
- CPA_DETECT);
+ psize, CPA_DETECT);

/*
* If there is a conflict, split the large page.
@@ -906,7 +914,8 @@ static void split_set_pte(struct cpa_data *cpa, pte_t *pte, unsigned long pfn,
if (!cpa->force_static_prot)
goto set;

- prot = static_protections(ref_prot, address, pfn, npg, CPA_PROTECT);
+ /* Hand in lpsize = 0 to enforce the protection mechanism */
+ prot = static_protections(ref_prot, address, pfn, npg, 0, CPA_PROTECT);

if (pgprot_val(prot) == pgprot_val(ref_prot))
goto set;
@@ -1503,7 +1512,8 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
pgprot_val(new_prot) |= pgprot_val(cpa->mask_set);

cpa_inc_4k_install();
- new_prot = static_protections(new_prot, address, pfn, 1,
+ /* Hand in lpsize = 0 to enforce the protection mechanism */
+ new_prot = static_protections(new_prot, address, pfn, 1, 0,
CPA_PROTECT);

new_prot = pgprot_clear_protnone_bits(new_prot);

Subject: [tip: x86/pti] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()

The following commit has been merged into the x86/pti branch of tip:

Commit-ID: 825d0b73cd7526b0bb186798583fae810091cbac
Gitweb: https://git.kernel.org/tip/825d0b73cd7526b0bb186798583fae810091cbac
Author: Song Liu <[email protected]>
AuthorDate: Wed, 28 Aug 2019 23:54:55 +02:00
Committer: Thomas Gleixner <[email protected]>
CommitterDate: Thu, 29 Aug 2019 20:52:52 +02:00

x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()

pti_clone_pmds() assumes that the supplied address is either:

- properly PUD/PMD aligned
or
- the address is actually mapped which means that independently
of the mapping level (PUD/PMD/PTE) the next higher mapping
exists.

If that's not the case the unaligned address can be incremented by PUD or
PMD size incorrectly. All callers supply mapped and/or aligned addresses,
but for the sake of robustness it's better to handle that case properly and
to emit a warning.

[ tglx: Rewrote changelog and added WARN_ON_ONCE() ]

Signed-off-by: Song Liu <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/mm/pti.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524..a24487b 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,15 @@ pti_clone_pgtable(unsigned long start, unsigned long end,

pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
- addr += PUD_SIZE;
+ WARN_ON_ONCE(addr & ~PUD_MASK);
+ addr = round_up(addr + 1, PUD_SIZE);
continue;
}

pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
- addr += PMD_SIZE;
+ WARN_ON_ONCE(addr & ~PMD_MASK);
+ addr = round_up(addr + 1, PMD_SIZE);
continue;
}

2019-08-30 10:25:54

by Jörg Rödel

[permalink] [raw]
Subject: Re: [patch V3 1/2] x86/mm/pti: Handle unaligned address gracefully in pti_clone_pagetable()

On Wed, Aug 28, 2019 at 11:54:55PM +0200, Thomas Gleixner wrote:
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,13 +330,15 @@ pti_clone_pgtable(unsigned long start, u
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + WARN_ON_ONCE(addr & ~PUD_MASK);
> + addr = round_up(addr + 1, PUD_SIZE);
> continue;
> }
>
> pmd = pmd_offset(pud, addr);
> if (pmd_none(*pmd)) {
> - addr += PMD_SIZE;
> + WARN_ON_ONCE(addr & ~PMD_MASK);
> + addr = round_up(addr + 1, PMD_SIZE);
> continue;
> }
>

Reviewed-by: Joerg Roedel <[email protected]>