2019-08-20 07:53:03

by Song Liu

[permalink] [raw]
Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE

pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
This is not accurate because addr may not be PUD_SIZE aligned.

In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
of this issuse, including PMD for the irq entry table. For a memcache
like workload, this introduces about 4.5x more iTLB-load and about 2.5x
more iTLB-load-misses on a Skylake CPU.

This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
case.

Cc: [email protected] # v4.19+
Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
Signed-off-by: Song Liu <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
arch/x86/mm/pti.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..5a67c3015f59 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,

pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
- addr += PUD_SIZE;
+ addr += PMD_SIZE;
continue;
}

--
2.17.1


2019-08-20 09:14:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE

On Tue, 20 Aug 2019, Song Liu wrote:

> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
> This is not accurate because addr may not be PUD_SIZE aligned.

You fail to explain how this happened. The code before the 32bit support
did always increase by PMD_SIZE. The 32bit support broke that.

> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
> of this issuse, including PMD for the irq entry table. For a memcache
> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
> more iTLB-load-misses on a Skylake CPU.

This information is largely irrelevant. What matters is the fact that this
got broken and incorrectly forwards the address by PUD_SIZE which is wrong
if address is not PUD_SIZE aligned.

> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
> case.

git grep 'This patch' Documentation/process/submitting-patches.rst

> Cc: [email protected] # v4.19+
> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
> Signed-off-by: Song Liu <[email protected]>
> Cc: Joerg Roedel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
> arch/x86/mm/pti.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..5a67c3015f59 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + addr += PMD_SIZE;

The right fix is to skip forward to the next PUD boundary instead of doing
this in a loop with PMD_SIZE increments.

Thanks,

tglx

2019-08-20 10:03:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE

On Tue, Aug 20, 2019 at 12:51:28AM -0700, Song Liu wrote:
> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
> This is not accurate because addr may not be PUD_SIZE aligned.
>
> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
> of this issuse, including PMD for the irq entry table. For a memcache
> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
> more iTLB-load-misses on a Skylake CPU.
>
> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
> case.

> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..5a67c3015f59 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + addr += PMD_SIZE;
> continue;
> }

I'm thinking you're right in that there's a bug here, but I'm also
thinking your patch is both incomplete and broken.

What that code wants to do is skip to the end of the pud, a pmd_size
increase will not do that. And right below this, there's a second
instance of this exact pattern.

Did I get the below right?

---
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..32b20b3cb227 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,

pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
+ addr &= PUD_MASK;
addr += PUD_SIZE;
continue;
}

pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
+ addr &= PMD_MASK;
addr += PMD_SIZE;
continue;
}

2019-08-20 11:18:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE

On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> What that code wants to do is skip to the end of the pud, a pmd_size
> increase will not do that. And right below this, there's a second
> instance of this exact pattern.
>
> Did I get the below right?
>
> ---
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..32b20b3cb227 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> + addr &= PUD_MASK;
> addr += PUD_SIZE;

round_up(addr, PUD_SIZE);

perhaps?

> continue;
> }
>
> pmd = pmd_offset(pud, addr);
> if (pmd_none(*pmd)) {
> + addr &= PMD_MASK;
> addr += PMD_SIZE;
> continue;
> }
>

2019-08-20 13:18:58

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE



> On Aug 20, 2019, at 2:12 AM, Thomas Gleixner <[email protected]> wrote:
>
> On Tue, 20 Aug 2019, Song Liu wrote:
>
>> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
>> This is not accurate because addr may not be PUD_SIZE aligned.
>
> You fail to explain how this happened. The code before the 32bit support
> did always increase by PMD_SIZE. The 32bit support broke that.

Will fix.

>
>> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
>> of this issuse, including PMD for the irq entry table. For a memcache
>> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
>> more iTLB-load-misses on a Skylake CPU.
>
> This information is largely irrelevant. What matters is the fact that this
> got broken and incorrectly forwards the address by PUD_SIZE which is wrong
> if address is not PUD_SIZE aligned.

We started looking into this because we cannot explain the regression in
iTLB miss rate. I guess the patch itself explains it pretty well, so the
original issue doesn't matter that much?

I will remove this part.

>
>> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
>> case.
>
> git grep 'This patch' Documentation/process/submitting-patches.rst

Will fix.

>> Cc: [email protected] # v4.19+
>> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
>> Signed-off-by: Song Liu <[email protected]>
>> Cc: Joerg Roedel <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> ---
>> arch/x86/mm/pti.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>> index b196524759ec..5a67c3015f59 100644
>> --- a/arch/x86/mm/pti.c
>> +++ b/arch/x86/mm/pti.c
>> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>>
>> pud = pud_offset(p4d, addr);
>> if (pud_none(*pud)) {
>> - addr += PUD_SIZE;
>> + addr += PMD_SIZE;
>
> The right fix is to skip forward to the next PUD boundary instead of doing
> this in a loop with PMD_SIZE increments.

Agreed.

Thanks,
Song

2019-08-20 13:21:00

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE



> On Aug 20, 2019, at 3:00 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Aug 20, 2019 at 12:51:28AM -0700, Song Liu wrote:
>> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
>> This is not accurate because addr may not be PUD_SIZE aligned.
>>
>> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
>> of this issuse, including PMD for the irq entry table. For a memcache
>> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
>> more iTLB-load-misses on a Skylake CPU.
>>
>> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
>> case.
>
>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>> index b196524759ec..5a67c3015f59 100644
>> --- a/arch/x86/mm/pti.c
>> +++ b/arch/x86/mm/pti.c
>> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>>
>> pud = pud_offset(p4d, addr);
>> if (pud_none(*pud)) {
>> - addr += PUD_SIZE;
>> + addr += PMD_SIZE;
>> continue;
>> }
>
> I'm thinking you're right in that there's a bug here, but I'm also
> thinking your patch is both incomplete and broken.
>
> What that code wants to do is skip to the end of the pud, a pmd_size
> increase will not do that. And right below this, there's a second
> instance of this exact pattern.
>
> Did I get the below right?

Yes, your are right.

Thanks,
Song

>
> ---
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..32b20b3cb227 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> + addr &= PUD_MASK;
> addr += PUD_SIZE;
> continue;
> }
>
> pmd = pmd_offset(pud, addr);
> if (pmd_none(*pmd)) {
> + addr &= PMD_MASK;
> addr += PMD_SIZE;
> continue;
> }

2019-08-20 13:23:03

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE



> On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <[email protected]> wrote:
>
> On Tue, 20 Aug 2019, Peter Zijlstra wrote:
>> What that code wants to do is skip to the end of the pud, a pmd_size
>> increase will not do that. And right below this, there's a second
>> instance of this exact pattern.
>>
>> Did I get the below right?
>>
>> ---
>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>> index b196524759ec..32b20b3cb227 100644
>> --- a/arch/x86/mm/pti.c
>> +++ b/arch/x86/mm/pti.c
>> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>>
>> pud = pud_offset(p4d, addr);
>> if (pud_none(*pud)) {
>> + addr &= PUD_MASK;
>> addr += PUD_SIZE;
>
> round_up(addr, PUD_SIZE);

I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".

Thanks,
Song

>
> perhaps?
>
>> continue;
>> }
>>
>> pmd = pmd_offset(pud, addr);
>> if (pmd_none(*pmd)) {
>> + addr &= PMD_MASK;
>> addr += PMD_SIZE;
>> continue;
>> }
>>

2019-08-20 13:24:27

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE

On Tue, 2019-08-20 at 13:16 +0200, Thomas Gleixner wrote:
> On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> > What that code wants to do is skip to the end of the pud, a
> > pmd_size
> > increase will not do that. And right below this, there's a second
> > instance of this exact pattern.
> >
> > Did I get the below right?
> >
> > ---
> > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> > index b196524759ec..32b20b3cb227 100644
> > --- a/arch/x86/mm/pti.c
> > +++ b/arch/x86/mm/pti.c
> > @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start,
> > unsigned long end,
> >
> > pud = pud_offset(p4d, addr);
> > if (pud_none(*pud)) {
> > + addr &= PUD_MASK;
> > addr += PUD_SIZE;
>
> round_up(addr, PUD_SIZE);
>
> perhaps?

Won't that keep returning the same address forever
if addr & PUD_MASK == 0?

2019-08-20 13:41:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE

On Tue, 20 Aug 2019, Song Liu wrote:
> > On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <[email protected]> wrote:
> >
> > On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> >> What that code wants to do is skip to the end of the pud, a pmd_size
> >> increase will not do that. And right below this, there's a second
> >> instance of this exact pattern.
> >>
> >> Did I get the below right?
> >>
> >> ---
> >> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> >> index b196524759ec..32b20b3cb227 100644
> >> --- a/arch/x86/mm/pti.c
> >> +++ b/arch/x86/mm/pti.c
> >> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
> >>
> >> pud = pud_offset(p4d, addr);
> >> if (pud_none(*pud)) {
> >> + addr &= PUD_MASK;
> >> addr += PUD_SIZE;
> >
> > round_up(addr, PUD_SIZE);
>
> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".

Right you are.

2019-08-20 13:58:38

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE

On Tue, 2019-08-20 at 09:21 -0400, Song Liu wrote:
> > On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <[email protected]>
> > wrote:
> >
> > On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> > > What that code wants to do is skip to the end of the pud, a
> > > pmd_size
> > > increase will not do that. And right below this, there's a second
> > > instance of this exact pattern.
> > >
> > > Did I get the below right?
> > >
> > > ---
> > > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> > > index b196524759ec..32b20b3cb227 100644
> > > --- a/arch/x86/mm/pti.c
> > > +++ b/arch/x86/mm/pti.c
> > > @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start,
> > > unsigned long end,
> > >
> > > pud = pud_offset(p4d, addr);
> > > if (pud_none(*pud)) {
> > > + addr &= PUD_MASK;
> > > addr += PUD_SIZE;
> >
> > round_up(addr, PUD_SIZE);
>
> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".

What does that do if start is less than PMD_SIZE
away from the next PUD_SIZE boundary?

How about: round_up(addr + 1, PUD_SIZE) ?

2019-08-20 13:59:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE

On 8/20/19 12:51 AM, Song Liu wrote:
> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
> of this issuse, including PMD for the irq entry table. For a memcache
> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
> more iTLB-load-misses on a Skylake CPU.

I was surprised that this manifests as a performance issue. Usually
messing up PTI page table manipulation means you get to experience the
jobs of debugging triple faults. But, it makes sense if its this line:

/*
* Note that this will undo _some_ of the work that
* pti_set_kernel_image_nonglobal() did to clear the
* global bit.
*/
pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);

which is restoring the Global bit.

*But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
and shouldn't have a global kernel image. Could you confirm whether
PCIDs are supported on this CPU?

> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + addr += PMD_SIZE;
> continue;
> }

Did we also bugger up this code:

pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
addr += PMD_SIZE;
continue;
}

if we're on 32-bit and this:

#define PTI_LEVEL_KERNEL_IMAGE PTI_CLONE_PTE

and we get a hole walking to a non-PMD-aligned address?

2019-08-20 14:03:11

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE



> On Aug 20, 2019, at 6:55 AM, Rik van Riel <[email protected]> wrote:
>
> On Tue, 2019-08-20 at 09:21 -0400, Song Liu wrote:
>>> On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <[email protected]>
>>> wrote:
>>>
>>> On Tue, 20 Aug 2019, Peter Zijlstra wrote:
>>>> What that code wants to do is skip to the end of the pud, a
>>>> pmd_size
>>>> increase will not do that. And right below this, there's a second
>>>> instance of this exact pattern.
>>>>
>>>> Did I get the below right?
>>>>
>>>> ---
>>>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>>>> index b196524759ec..32b20b3cb227 100644
>>>> --- a/arch/x86/mm/pti.c
>>>> +++ b/arch/x86/mm/pti.c
>>>> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start,
>>>> unsigned long end,
>>>>
>>>> pud = pud_offset(p4d, addr);
>>>> if (pud_none(*pud)) {
>>>> + addr &= PUD_MASK;
>>>> addr += PUD_SIZE;
>>>
>>> round_up(addr, PUD_SIZE);
>>
>> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
>
> What does that do if start is less than PMD_SIZE
> away from the next PUD_SIZE boundary?

Great point!

>
> How about: round_up(addr + 1, PUD_SIZE) ?

Yes. How about this?

=========================== 8< ============================

From 9ae74cff4faf4710a11cb8da4c4a3f3404bd9fdd Mon Sep 17 00:00:00 2001
From: Song Liu <[email protected]>
Date: Mon, 19 Aug 2019 23:59:47 -0700
Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable(), increase addr properly

Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
This behavior changes after the 32-bit support: pti_clone_pgtable()
increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by
PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because
addr may not be PUD_SIZE/PMD_SIZE aligned.

Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
in these two cases.

Cc: [email protected] # v4.19+
Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
Signed-off-by: Song Liu <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
arch/x86/mm/pti.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..1337494e22ef 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,13 @@ pti_clone_pgtable(unsigned long start, unsigned long end,

pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
- addr += PUD_SIZE;
+ addr = round_up(addr + 1, PUD_SIZE);
continue;
}

pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
- addr += PMD_SIZE;
+ addr = round_up(addr + 1, PMD_SIZE);
continue;
}

--
2.17.1


2019-08-20 14:17:25

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE



> On Aug 20, 2019, at 6:57 AM, Dave Hansen <[email protected]> wrote:
>
> On 8/20/19 12:51 AM, Song Liu wrote:
>> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
>> of this issuse, including PMD for the irq entry table. For a memcache
>> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
>> more iTLB-load-misses on a Skylake CPU.
>
> I was surprised that this manifests as a performance issue. Usually
> messing up PTI page table manipulation means you get to experience the
> jobs of debugging triple faults. But, it makes sense if its this line:
>
> /*
> * Note that this will undo _some_ of the work that
> * pti_set_kernel_image_nonglobal() did to clear the
> * global bit.
> */
> pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
>
> which is restoring the Global bit.
>
> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
> and shouldn't have a global kernel image. Could you confirm whether
> PCIDs are supported on this CPU?

Yes, pcid is listed in /proc/cpuinfo.

Thanks,
Song

2019-08-20 14:19:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE

On 8/20/19 7:14 AM, Song Liu wrote:
>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
>> and shouldn't have a global kernel image. Could you confirm whether
>> PCIDs are supported on this CPU?
> Yes, pcid is listed in /proc/cpuinfo.

So what's going on? Could you confirm exactly which pti_clone_pgtable()
is causing you problems? Do you have a theory as to why this manifests
as a performance problem rather than a functional one?

A diff of these:

/sys/kernel/debug/page_tables/current_user
/sys/kernel/debug/page_tables/current_kernel

before and after your patch might be helpful.

2019-08-20 16:07:24

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE



> On Aug 20, 2019, at 7:18 AM, Dave Hansen <[email protected]> wrote:
>
> On 8/20/19 7:14 AM, Song Liu wrote:
>>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
>>> and shouldn't have a global kernel image. Could you confirm whether
>>> PCIDs are supported on this CPU?
>> Yes, pcid is listed in /proc/cpuinfo.
>
> So what's going on? Could you confirm exactly which pti_clone_pgtable()
> is causing you problems? Do you have a theory as to why this manifests
> as a performance problem rather than a functional one?
>
> A diff of these:
>
> /sys/kernel/debug/page_tables/current_user
> /sys/kernel/debug/page_tables/current_kernel
>
> before and after your patch might be helpful.

I believe the difference is from the following entries (7 PMDs)

Before the patch:

current_kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
efi: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte


After the patch:

current_kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
efi: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd

current_kernel and kernel show same data though.

Thanks,
Song

2019-08-20 16:39:55

by Song Liu

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE



> On Aug 20, 2019, at 9:05 AM, Song Liu <[email protected]> wrote:
>
>
>
>> On Aug 20, 2019, at 7:18 AM, Dave Hansen <[email protected]> wrote:
>>
>> On 8/20/19 7:14 AM, Song Liu wrote:
>>>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
>>>> and shouldn't have a global kernel image. Could you confirm whether
>>>> PCIDs are supported on this CPU?
>>> Yes, pcid is listed in /proc/cpuinfo.
>>
>> So what's going on? Could you confirm exactly which pti_clone_pgtable()
>> is causing you problems? Do you have a theory as to why this manifests
>> as a performance problem rather than a functional one?
>>
>> A diff of these:
>>
>> /sys/kernel/debug/page_tables/current_user
>> /sys/kernel/debug/page_tables/current_kernel
>>
>> before and after your patch might be helpful.
>
> I believe the difference is from the following entries (7 PMDs)
>
> Before the patch:
>
> current_kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
> efi: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
> kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
>
>
> After the patch:
>
> current_kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> efi: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
>
> current_kernel and kernel show same data though.

A little more details on how I got here.

We use huge page for hot text and thus reduces iTLB misses. As we
benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more
iTLB misses.

To figure out the issue, I use a debug patch that dumps page table for
a pid. The following are information from the workload pid.


For the 4.16 based kernel:

host-4.16 # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
0xffffffff81a00000-0xffffffff81c00000 2M ro PSE x pmd


For the 5.2 based kernel before this patch:

host-5.2-before # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd


The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the
irq entry table; while 4.16 kernel doesn't have it.


For the 5.2 based kernel after this patch:

host-5.2-after # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd


So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD
in 4.16 kernel. This further reduces iTLB miss rate

Thanks,
Song

2019-08-20 16:57:53

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH v2] x86/mm/pti: in pti_clone_pgtable() don't increase addr by PUD_SIZE

On Tue, 2019-08-20 at 10:00 -0400, Song Liu wrote:
>
> From 9ae74cff4faf4710a11cb8da4c4a3f3404bd9fdd Mon Sep 17 00:00:00
> 2001
> From: Song Liu <[email protected]>
> Date: Mon, 19 Aug 2019 23:59:47 -0700
> Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable(), increase addr
> properly
>
> Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
> This behavior changes after the 32-bit support: pti_clone_pgtable()
> increases addr by PUD_SIZE for pud_none(*pud) case, and increases
> addr by
> PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate
> because
> addr may not be PUD_SIZE/PMD_SIZE aligned.
>
> Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
> in these two cases.
>
> Cc: [email protected] # v4.19+
> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for
> 32 bit")
> Signed-off-by: Song Liu <[email protected]>
> Cc: Joerg Roedel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Peter Zijlstra <[email protected]>

This looks like it should do the trick!

Reviewed-by: Rik van Riel <[email protected]>

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part