pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
This is not accurate because addr may not be PUD_SIZE aligned.
In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
of this issuse, including PMD for the irq entry table. For a memcache
like workload, this introduces about 4.5x more iTLB-load and about 2.5x
more iTLB-load-misses on a Skylake CPU.
This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
case.
Cc: [email protected] # v4.19+
Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
Signed-off-by: Song Liu <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
arch/x86/mm/pti.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..5a67c3015f59 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
- addr += PUD_SIZE;
+ addr += PMD_SIZE;
continue;
}
--
2.17.1
On Tue, 20 Aug 2019, Song Liu wrote:
> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
> This is not accurate because addr may not be PUD_SIZE aligned.
You fail to explain how this happened. The code before the 32bit support
did always increase by PMD_SIZE. The 32bit support broke that.
> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
> of this issuse, including PMD for the irq entry table. For a memcache
> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
> more iTLB-load-misses on a Skylake CPU.
This information is largely irrelevant. What matters is the fact that this
got broken and incorrectly forwards the address by PUD_SIZE which is wrong
if address is not PUD_SIZE aligned.
> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
> case.
git grep 'This patch' Documentation/process/submitting-patches.rst
> Cc: [email protected] # v4.19+
> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
> Signed-off-by: Song Liu <[email protected]>
> Cc: Joerg Roedel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
> arch/x86/mm/pti.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..5a67c3015f59 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + addr += PMD_SIZE;
The right fix is to skip forward to the next PUD boundary instead of doing
this in a loop with PMD_SIZE increments.
Thanks,
tglx
On Tue, Aug 20, 2019 at 12:51:28AM -0700, Song Liu wrote:
> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
> This is not accurate because addr may not be PUD_SIZE aligned.
>
> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
> of this issuse, including PMD for the irq entry table. For a memcache
> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
> more iTLB-load-misses on a Skylake CPU.
>
> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
> case.
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..5a67c3015f59 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + addr += PMD_SIZE;
> continue;
> }
I'm thinking you're right in that there's a bug here, but I'm also
thinking your patch is both incomplete and broken.
What that code wants to do is skip to the end of the pud, a pmd_size
increase will not do that. And right below this, there's a second
instance of this exact pattern.
Did I get the below right?
---
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..32b20b3cb227 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
+ addr &= PUD_MASK;
addr += PUD_SIZE;
continue;
}
pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
+ addr &= PMD_MASK;
addr += PMD_SIZE;
continue;
}
On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> What that code wants to do is skip to the end of the pud, a pmd_size
> increase will not do that. And right below this, there's a second
> instance of this exact pattern.
>
> Did I get the below right?
>
> ---
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..32b20b3cb227 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> + addr &= PUD_MASK;
> addr += PUD_SIZE;
round_up(addr, PUD_SIZE);
perhaps?
> continue;
> }
>
> pmd = pmd_offset(pud, addr);
> if (pmd_none(*pmd)) {
> + addr &= PMD_MASK;
> addr += PMD_SIZE;
> continue;
> }
>
> On Aug 20, 2019, at 2:12 AM, Thomas Gleixner <[email protected]> wrote:
>
> On Tue, 20 Aug 2019, Song Liu wrote:
>
>> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
>> This is not accurate because addr may not be PUD_SIZE aligned.
>
> You fail to explain how this happened. The code before the 32bit support
> did always increase by PMD_SIZE. The 32bit support broke that.
Will fix.
>
>> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
>> of this issuse, including PMD for the irq entry table. For a memcache
>> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
>> more iTLB-load-misses on a Skylake CPU.
>
> This information is largely irrelevant. What matters is the fact that this
> got broken and incorrectly forwards the address by PUD_SIZE which is wrong
> if address is not PUD_SIZE aligned.
We started looking into this because we cannot explain the regression in
iTLB miss rate. I guess the patch itself explains it pretty well, so the
original issue doesn't matter that much?
I will remove this part.
>
>> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
>> case.
>
> git grep 'This patch' Documentation/process/submitting-patches.rst
Will fix.
>> Cc: [email protected] # v4.19+
>> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
>> Signed-off-by: Song Liu <[email protected]>
>> Cc: Joerg Roedel <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> ---
>> arch/x86/mm/pti.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>> index b196524759ec..5a67c3015f59 100644
>> --- a/arch/x86/mm/pti.c
>> +++ b/arch/x86/mm/pti.c
>> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>>
>> pud = pud_offset(p4d, addr);
>> if (pud_none(*pud)) {
>> - addr += PUD_SIZE;
>> + addr += PMD_SIZE;
>
> The right fix is to skip forward to the next PUD boundary instead of doing
> this in a loop with PMD_SIZE increments.
Agreed.
Thanks,
Song
> On Aug 20, 2019, at 3:00 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Tue, Aug 20, 2019 at 12:51:28AM -0700, Song Liu wrote:
>> pti_clone_pgtable() increases addr by PUD_SIZE for pud_none(*pud) case.
>> This is not accurate because addr may not be PUD_SIZE aligned.
>>
>> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
>> of this issuse, including PMD for the irq entry table. For a memcache
>> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
>> more iTLB-load-misses on a Skylake CPU.
>>
>> This patch fixes this issue by adding PMD_SIZE to addr for pud_none()
>> case.
>
>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>> index b196524759ec..5a67c3015f59 100644
>> --- a/arch/x86/mm/pti.c
>> +++ b/arch/x86/mm/pti.c
>> @@ -330,7 +330,7 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>>
>> pud = pud_offset(p4d, addr);
>> if (pud_none(*pud)) {
>> - addr += PUD_SIZE;
>> + addr += PMD_SIZE;
>> continue;
>> }
>
> I'm thinking you're right in that there's a bug here, but I'm also
> thinking your patch is both incomplete and broken.
>
> What that code wants to do is skip to the end of the pud, a pmd_size
> increase will not do that. And right below this, there's a second
> instance of this exact pattern.
>
> Did I get the below right?
Yes, your are right.
Thanks,
Song
>
> ---
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..32b20b3cb227 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> + addr &= PUD_MASK;
> addr += PUD_SIZE;
> continue;
> }
>
> pmd = pmd_offset(pud, addr);
> if (pmd_none(*pmd)) {
> + addr &= PMD_MASK;
> addr += PMD_SIZE;
> continue;
> }
> On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <[email protected]> wrote:
>
> On Tue, 20 Aug 2019, Peter Zijlstra wrote:
>> What that code wants to do is skip to the end of the pud, a pmd_size
>> increase will not do that. And right below this, there's a second
>> instance of this exact pattern.
>>
>> Did I get the below right?
>>
>> ---
>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>> index b196524759ec..32b20b3cb227 100644
>> --- a/arch/x86/mm/pti.c
>> +++ b/arch/x86/mm/pti.c
>> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>>
>> pud = pud_offset(p4d, addr);
>> if (pud_none(*pud)) {
>> + addr &= PUD_MASK;
>> addr += PUD_SIZE;
>
> round_up(addr, PUD_SIZE);
I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
Thanks,
Song
>
> perhaps?
>
>> continue;
>> }
>>
>> pmd = pmd_offset(pud, addr);
>> if (pmd_none(*pmd)) {
>> + addr &= PMD_MASK;
>> addr += PMD_SIZE;
>> continue;
>> }
>>
On Tue, 2019-08-20 at 13:16 +0200, Thomas Gleixner wrote:
> On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> > What that code wants to do is skip to the end of the pud, a
> > pmd_size
> > increase will not do that. And right below this, there's a second
> > instance of this exact pattern.
> >
> > Did I get the below right?
> >
> > ---
> > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> > index b196524759ec..32b20b3cb227 100644
> > --- a/arch/x86/mm/pti.c
> > +++ b/arch/x86/mm/pti.c
> > @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start,
> > unsigned long end,
> >
> > pud = pud_offset(p4d, addr);
> > if (pud_none(*pud)) {
> > + addr &= PUD_MASK;
> > addr += PUD_SIZE;
>
> round_up(addr, PUD_SIZE);
>
> perhaps?
Won't that keep returning the same address forever
if addr & PUD_MASK == 0?
On Tue, 20 Aug 2019, Song Liu wrote:
> > On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <[email protected]> wrote:
> >
> > On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> >> What that code wants to do is skip to the end of the pud, a pmd_size
> >> increase will not do that. And right below this, there's a second
> >> instance of this exact pattern.
> >>
> >> Did I get the below right?
> >>
> >> ---
> >> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> >> index b196524759ec..32b20b3cb227 100644
> >> --- a/arch/x86/mm/pti.c
> >> +++ b/arch/x86/mm/pti.c
> >> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
> >>
> >> pud = pud_offset(p4d, addr);
> >> if (pud_none(*pud)) {
> >> + addr &= PUD_MASK;
> >> addr += PUD_SIZE;
> >
> > round_up(addr, PUD_SIZE);
>
> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
Right you are.
On Tue, 2019-08-20 at 09:21 -0400, Song Liu wrote:
> > On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <[email protected]>
> > wrote:
> >
> > On Tue, 20 Aug 2019, Peter Zijlstra wrote:
> > > What that code wants to do is skip to the end of the pud, a
> > > pmd_size
> > > increase will not do that. And right below this, there's a second
> > > instance of this exact pattern.
> > >
> > > Did I get the below right?
> > >
> > > ---
> > > diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> > > index b196524759ec..32b20b3cb227 100644
> > > --- a/arch/x86/mm/pti.c
> > > +++ b/arch/x86/mm/pti.c
> > > @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start,
> > > unsigned long end,
> > >
> > > pud = pud_offset(p4d, addr);
> > > if (pud_none(*pud)) {
> > > + addr &= PUD_MASK;
> > > addr += PUD_SIZE;
> >
> > round_up(addr, PUD_SIZE);
>
> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
What does that do if start is less than PMD_SIZE
away from the next PUD_SIZE boundary?
How about: round_up(addr + 1, PUD_SIZE) ?
On 8/20/19 12:51 AM, Song Liu wrote:
> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
> of this issuse, including PMD for the irq entry table. For a memcache
> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
> more iTLB-load-misses on a Skylake CPU.
I was surprised that this manifests as a performance issue. Usually
messing up PTI page table manipulation means you get to experience the
jobs of debugging triple faults. But, it makes sense if its this line:
/*
* Note that this will undo _some_ of the work that
* pti_set_kernel_image_nonglobal() did to clear the
* global bit.
*/
pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
which is restoring the Global bit.
*But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
and shouldn't have a global kernel image. Could you confirm whether
PCIDs are supported on this CPU?
> pud = pud_offset(p4d, addr);
> if (pud_none(*pud)) {
> - addr += PUD_SIZE;
> + addr += PMD_SIZE;
> continue;
> }
Did we also bugger up this code:
pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
addr += PMD_SIZE;
continue;
}
if we're on 32-bit and this:
#define PTI_LEVEL_KERNEL_IMAGE PTI_CLONE_PTE
and we get a hole walking to a non-PMD-aligned address?
> On Aug 20, 2019, at 6:55 AM, Rik van Riel <[email protected]> wrote:
>
> On Tue, 2019-08-20 at 09:21 -0400, Song Liu wrote:
>>> On Aug 20, 2019, at 4:16 AM, Thomas Gleixner <[email protected]>
>>> wrote:
>>>
>>> On Tue, 20 Aug 2019, Peter Zijlstra wrote:
>>>> What that code wants to do is skip to the end of the pud, a
>>>> pmd_size
>>>> increase will not do that. And right below this, there's a second
>>>> instance of this exact pattern.
>>>>
>>>> Did I get the below right?
>>>>
>>>> ---
>>>> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
>>>> index b196524759ec..32b20b3cb227 100644
>>>> --- a/arch/x86/mm/pti.c
>>>> +++ b/arch/x86/mm/pti.c
>>>> @@ -330,12 +330,14 @@ pti_clone_pgtable(unsigned long start,
>>>> unsigned long end,
>>>>
>>>> pud = pud_offset(p4d, addr);
>>>> if (pud_none(*pud)) {
>>>> + addr &= PUD_MASK;
>>>> addr += PUD_SIZE;
>>>
>>> round_up(addr, PUD_SIZE);
>>
>> I guess we need "round_up(addr + PMD_SIZE, PUD_SIZE)".
>
> What does that do if start is less than PMD_SIZE
> away from the next PUD_SIZE boundary?
Great point!
>
> How about: round_up(addr + 1, PUD_SIZE) ?
Yes. How about this?
=========================== 8< ============================
From 9ae74cff4faf4710a11cb8da4c4a3f3404bd9fdd Mon Sep 17 00:00:00 2001
From: Song Liu <[email protected]>
Date: Mon, 19 Aug 2019 23:59:47 -0700
Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable(), increase addr properly
Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
This behavior changes after the 32-bit support: pti_clone_pgtable()
increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by
PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because
addr may not be PUD_SIZE/PMD_SIZE aligned.
Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
in these two cases.
Cc: [email protected] # v4.19+
Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
Signed-off-by: Song Liu <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
arch/x86/mm/pti.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..1337494e22ef 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,13 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
- addr += PUD_SIZE;
+ addr = round_up(addr + 1, PUD_SIZE);
continue;
}
pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
- addr += PMD_SIZE;
+ addr = round_up(addr + 1, PMD_SIZE);
continue;
}
--
2.17.1
> On Aug 20, 2019, at 6:57 AM, Dave Hansen <[email protected]> wrote:
>
> On 8/20/19 12:51 AM, Song Liu wrote:
>> In our x86_64 kernel, pti_clone_pgtable() fails to clone 7 PMDs because
>> of this issuse, including PMD for the irq entry table. For a memcache
>> like workload, this introduces about 4.5x more iTLB-load and about 2.5x
>> more iTLB-load-misses on a Skylake CPU.
>
> I was surprised that this manifests as a performance issue. Usually
> messing up PTI page table manipulation means you get to experience the
> jobs of debugging triple faults. But, it makes sense if its this line:
>
> /*
> * Note that this will undo _some_ of the work that
> * pti_set_kernel_image_nonglobal() did to clear the
> * global bit.
> */
> pti_clone_pgtable(start, end_clone, PTI_LEVEL_KERNEL_IMAGE);
>
> which is restoring the Global bit.
>
> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
> and shouldn't have a global kernel image. Could you confirm whether
> PCIDs are supported on this CPU?
Yes, pcid is listed in /proc/cpuinfo.
Thanks,
Song
On 8/20/19 7:14 AM, Song Liu wrote:
>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
>> and shouldn't have a global kernel image. Could you confirm whether
>> PCIDs are supported on this CPU?
> Yes, pcid is listed in /proc/cpuinfo.
So what's going on? Could you confirm exactly which pti_clone_pgtable()
is causing you problems? Do you have a theory as to why this manifests
as a performance problem rather than a functional one?
A diff of these:
/sys/kernel/debug/page_tables/current_user
/sys/kernel/debug/page_tables/current_kernel
before and after your patch might be helpful.
> On Aug 20, 2019, at 7:18 AM, Dave Hansen <[email protected]> wrote:
>
> On 8/20/19 7:14 AM, Song Liu wrote:
>>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
>>> and shouldn't have a global kernel image. Could you confirm whether
>>> PCIDs are supported on this CPU?
>> Yes, pcid is listed in /proc/cpuinfo.
>
> So what's going on? Could you confirm exactly which pti_clone_pgtable()
> is causing you problems? Do you have a theory as to why this manifests
> as a performance problem rather than a functional one?
>
> A diff of these:
>
> /sys/kernel/debug/page_tables/current_user
> /sys/kernel/debug/page_tables/current_kernel
>
> before and after your patch might be helpful.
I believe the difference is from the following entries (7 PMDs)
Before the patch:
current_kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
efi: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
After the patch:
current_kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
efi: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_kernel and kernel show same data though.
Thanks,
Song
> On Aug 20, 2019, at 9:05 AM, Song Liu <[email protected]> wrote:
>
>
>
>> On Aug 20, 2019, at 7:18 AM, Dave Hansen <[email protected]> wrote:
>>
>> On 8/20/19 7:14 AM, Song Liu wrote:
>>>> *But*, that shouldn't get hit on a Skylake CPU since those have PCIDs
>>>> and shouldn't have a global kernel image. Could you confirm whether
>>>> PCIDs are supported on this CPU?
>>> Yes, pcid is listed in /proc/cpuinfo.
>>
>> So what's going on? Could you confirm exactly which pti_clone_pgtable()
>> is causing you problems? Do you have a theory as to why this manifests
>> as a performance problem rather than a functional one?
>>
>> A diff of these:
>>
>> /sys/kernel/debug/page_tables/current_user
>> /sys/kernel/debug/page_tables/current_kernel
>>
>> before and after your patch might be helpful.
>
> I believe the difference is from the following entries (7 PMDs)
>
> Before the patch:
>
> current_kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
> efi: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
> kernel: 0xffffffff81000000-0xffffffff81e04000 14352K ro GLB x pte
>
>
> After the patch:
>
> current_kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> efi: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> kernel: 0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
>
> current_kernel and kernel show same data though.
A little more details on how I got here.
We use huge page for hot text and thus reduces iTLB misses. As we
benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more
iTLB misses.
To figure out the issue, I use a debug patch that dumps page table for
a pid. The following are information from the workload pid.
For the 4.16 based kernel:
host-4.16 # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
0xffffffff81a00000-0xffffffff81c00000 2M ro PSE x pmd
For the 5.2 based kernel before this patch:
host-5.2-before # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the
irq entry table; while 4.16 kernel doesn't have it.
For the 5.2 based kernel after this patch:
host-5.2-after # grep "x pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000 8M USR ro PSE x pmd
0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD
in 4.16 kernel. This further reduces iTLB miss rate
Thanks,
Song
On Tue, 2019-08-20 at 10:00 -0400, Song Liu wrote:
>
> From 9ae74cff4faf4710a11cb8da4c4a3f3404bd9fdd Mon Sep 17 00:00:00
> 2001
> From: Song Liu <[email protected]>
> Date: Mon, 19 Aug 2019 23:59:47 -0700
> Subject: [PATCH] x86/mm/pti: in pti_clone_pgtable(), increase addr
> properly
>
> Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
> This behavior changes after the 32-bit support: pti_clone_pgtable()
> increases addr by PUD_SIZE for pud_none(*pud) case, and increases
> addr by
> PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate
> because
> addr may not be PUD_SIZE/PMD_SIZE aligned.
>
> Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
> in these two cases.
>
> Cc: [email protected] # v4.19+
> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for
> 32 bit")
> Signed-off-by: Song Liu <[email protected]>
> Cc: Joerg Roedel <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
This looks like it should do the trick!
Reviewed-by: Rik van Riel <[email protected]>
--
All Rights Reversed.