2015-07-22 17:13:43

by Catalin Marinas

[permalink] [raw]
Subject: [PATCH] mm: Flush the TLB for a single address in a huge page

When the page table entry is a huge page (and not a table), there is no
need to flush the TLB by range. This patch changes flush_tlb_range() to
flush_tlb_page() in functions where we know the pmd entry is a huge
page.

Signed-off-by: Catalin Marinas <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
---

Hi,

That's just a minor improvement but it saves iterating over each small
page in a huge page when a single TLB entry is used (we already have a
similar assumption in __tlb_adjust_range).

Thanks.

mm/pgtable-generic.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 6b674e00153c..ff17eca26211 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -67,7 +67,7 @@ int pmdp_set_access_flags(struct vm_area_struct *vma,
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
if (changed) {
set_pmd_at(vma->vm_mm, address, pmdp, entry);
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ flush_tlb_page(vma, address);
}
return changed;
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -101,7 +101,7 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
young = pmdp_test_and_clear_young(vma, address, pmdp);
if (young)
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ flush_tlb_page(vma, address);
return young;
}
#endif
@@ -128,7 +128,7 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
VM_BUG_ON(!pmd_trans_huge(*pmdp));
pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ flush_tlb_page(vma, address);
return pmd;
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -143,7 +143,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
set_pmd_at(vma->vm_mm, address, pmdp, pmd);
/* tlb flush only to serialize against gup-fast */
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ flush_tlb_page(vma, address);
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif
@@ -195,7 +195,7 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
{
pmd_t entry = *pmdp;
set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ flush_tlb_page(vma, address);
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif


2015-07-22 21:39:39

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On Wed, 22 Jul 2015, Catalin Marinas wrote:

> When the page table entry is a huge page (and not a table), there is no
> need to flush the TLB by range. This patch changes flush_tlb_range() to
> flush_tlb_page() in functions where we know the pmd entry is a huge
> page.
>
> Signed-off-by: Catalin Marinas <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> ---
>
> Hi,
>
> That's just a minor improvement but it saves iterating over each small
> page in a huge page when a single TLB entry is used (we already have a
> similar assumption in __tlb_adjust_range).
>

For x86 smp, this seems to mean the difference between unconditional
flush_tlb_page() and local_flush_tlb() due to
tlb_single_page_flush_ceiling, so I don't think this just removes the
iteration.

> Thanks.
>
> mm/pgtable-generic.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 6b674e00153c..ff17eca26211 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -67,7 +67,7 @@ int pmdp_set_access_flags(struct vm_area_struct *vma,
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> if (changed) {
> set_pmd_at(vma->vm_mm, address, pmdp, entry);
> - flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> + flush_tlb_page(vma, address);
> }
> return changed;
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> @@ -101,7 +101,7 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> young = pmdp_test_and_clear_young(vma, address, pmdp);
> if (young)
> - flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> + flush_tlb_page(vma, address);
> return young;
> }
> #endif
> @@ -128,7 +128,7 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> VM_BUG_ON(!pmd_trans_huge(*pmdp));
> pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
> - flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> + flush_tlb_page(vma, address);
> return pmd;
> }
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> @@ -143,7 +143,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
> VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> set_pmd_at(vma->vm_mm, address, pmdp, pmd);
> /* tlb flush only to serialize against gup-fast */
> - flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> + flush_tlb_page(vma, address);
> }
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> #endif
> @@ -195,7 +195,7 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> {
> pmd_t entry = *pmdp;
> set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
> - flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> + flush_tlb_page(vma, address);
> }
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> #endif

2015-07-22 22:49:04

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On 22 July 2015 at 22:39, David Rientjes <[email protected]> wrote:
> On Wed, 22 Jul 2015, Catalin Marinas wrote:
>
>> When the page table entry is a huge page (and not a table), there is no
>> need to flush the TLB by range. This patch changes flush_tlb_range() to
>> flush_tlb_page() in functions where we know the pmd entry is a huge
>> page.
>>
>> Signed-off-by: Catalin Marinas <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Andrea Arcangeli <[email protected]>
>> ---
>>
>> Hi,
>>
>> That's just a minor improvement but it saves iterating over each small
>> page in a huge page when a single TLB entry is used (we already have a
>> similar assumption in __tlb_adjust_range).
>
> For x86 smp, this seems to mean the difference between unconditional
> flush_tlb_page() and local_flush_tlb() due to
> tlb_single_page_flush_ceiling, so I don't think this just removes the
> iteration.

You are right, on x86 the tlb_single_page_flush_ceiling seems to be
33, so for an HPAGE_SIZE range the code does a local_flush_tlb()
always. I would say a single page TLB flush is more efficient than a
whole TLB flush but I'm not familiar enough with x86.

Alternatively, I could introduce a flush_tlb_pmd_huge_page (suggested
by Andrea separately) and let the architectures deal with this as they
see fit. The default definition would do a flush_tlb_range(vma,
address, address + HPAGE_SIZE). For arm64, I'll define it as
flush_tlb_page(vma, address).

--
Catalin

2015-07-22 23:05:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On 07/22/2015 03:48 PM, Catalin Marinas wrote:
> You are right, on x86 the tlb_single_page_flush_ceiling seems to be
> 33, so for an HPAGE_SIZE range the code does a local_flush_tlb()
> always. I would say a single page TLB flush is more efficient than a
> whole TLB flush but I'm not familiar enough with x86.

The last time I looked, the instruction to invalidate a single page is
more expensive than the instruction to flush the entire TLB. We also
don't bother doing ranged flushes _ever_ for hugetlbfs TLB
invalidations, but that was just because the work done around commit
e7b52ffd4 didn't see any benefit.

That said, I can't imagine this will hurt anything. We also have TLBs
that can mix 2M and 4k pages and I don't think we did back when we put
that code in originally.

2015-07-23 10:49:48

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On Thu, Jul 23, 2015 at 12:05:21AM +0100, Dave Hansen wrote:
> On 07/22/2015 03:48 PM, Catalin Marinas wrote:
> > You are right, on x86 the tlb_single_page_flush_ceiling seems to be
> > 33, so for an HPAGE_SIZE range the code does a local_flush_tlb()
> > always. I would say a single page TLB flush is more efficient than a
> > whole TLB flush but I'm not familiar enough with x86.
>
> The last time I looked, the instruction to invalidate a single page is
> more expensive than the instruction to flush the entire TLB.

I was thinking of the overall cost of re-populating the TLB after being
nuked rather than the instruction itself.

> We also don't bother doing ranged flushes _ever_ for hugetlbfs TLB
> invalidations, but that was just because the work done around commit
> e7b52ffd4 didn't see any benefit.

For huge pages, there are indeed fewer page table levels to fetch, so I
guess the impact is not significant. With virtualisation/nested pages,
at least on ARM, refilling the TLB for guest would take longer (though
it's highly dependent on the microarchitecture implementation, whether
it caches the guest PA to host PA separately).

> That said, I can't imagine this will hurt anything. We also have TLBs
> that can mix 2M and 4k pages and I don't think we did back when we put
> that code in originally.

Another question is whether flushing a single address is enough for a
huge page. I assumed it is since tlb_remove_pmd_tlb_entry() only adjusts
the mmu_gather range by PAGE_SIZE (rather than HPAGE_SIZE) and no-one
complained so far. AFAICT, there are only 3 architectures that don't use
asm-generic/tlb.h but they all seem to handle this case:

arch/arm: it implements tlb_remove_pmd_tlb_entry() in a similar way to
the generic one

arch/s390: tlb_remove_pmd_tlb_entry() is a no-op

arch/ia64: does not support THP

--
Catalin

2015-07-23 14:13:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On Thu, Jul 23, 2015 at 11:49:38AM +0100, Catalin Marinas wrote:
> On Thu, Jul 23, 2015 at 12:05:21AM +0100, Dave Hansen wrote:
> > On 07/22/2015 03:48 PM, Catalin Marinas wrote:
> > > You are right, on x86 the tlb_single_page_flush_ceiling seems to be
> > > 33, so for an HPAGE_SIZE range the code does a local_flush_tlb()
> > > always. I would say a single page TLB flush is more efficient than a
> > > whole TLB flush but I'm not familiar enough with x86.
> >
> > The last time I looked, the instruction to invalidate a single page is
> > more expensive than the instruction to flush the entire TLB.
>
> I was thinking of the overall cost of re-populating the TLB after being
> nuked rather than the instruction itself.

Unless I'm not aware about timing differences in flushing 2MB TLB
entries vs flushing 4kb TLB entries with invlpg, the benchmarks that
have been run to tune the optimal tlb_single_page_flush_ceiling value,
should already guarantee us that this is a valid optimization (as we
just got one entry, we're not even close to the 33 ceiling that makes
it more a grey area).

> > That said, I can't imagine this will hurt anything. We also have TLBs
> > that can mix 2M and 4k pages and I don't think we did back when we put
> > that code in originally.

Dave, I'm confused about this. We should still stick to an invariant
that we can't ever mix 2M and 4k TLB entries if their mappings end up
overlapping on the same physical memory (if this isn't enforced in
common code, some x86 implementation errata triggers, and it really
oopses with machine checks so it's not just theoretical). Perhaps I
misunderstood what you meant with mix 2M and 4k pages though.

> Another question is whether flushing a single address is enough for a
> huge page. I assumed it is since tlb_remove_pmd_tlb_entry() only adjusts

That's the primary reason why the range flush was used currently (and
it must be still used in pmdp_collapse_flush as that deals with 4k TLB
entries, but your patch correctly isn't touching that one).

I recall having used flush_tlb_page initially for the 2MB invalidates,
but then I switched to the range version purely to be safer. If we can
optimize this now I'd certainly be happy about that. Back then there
was not yet tlb_remove_pmd_tlb_entry which already started to optimize
things for this.

> the mmu_gather range by PAGE_SIZE (rather than HPAGE_SIZE) and
> no-one complained so far. AFAICT, there are only 3 architectures
> that don't use asm-generic/tlb.h but they all seem to handle this
> case:

Agreed that archs using the generic tlb.h that sets the tlb->end to
address+PAGE_SIZE should be fine with the flush_tlb_page.

> arch/arm: it implements tlb_remove_pmd_tlb_entry() in a similar way to
> the generic one
>
> arch/s390: tlb_remove_pmd_tlb_entry() is a no-op

I guess s390 is fine too but I'm not convinced that the fact it won't
adjust the tlb->start/end is a guarantees that flush_tlb_page is
enough when a single 2MB TLB has to be invalidated (not during range
zapping).

For the range zapping, could the arch decide to unconditionally flush
the whole TLB without doing the tlb->start/end tracking by overriding
tlb_gather_mmu in a way that won't call __tlb_reset_range? There seems
to be quite some flexibility in the per-arch tlb_gather_mmu setup in
order to unconditionally set tlb->start/end to the total range zapped,
without actually narrowing it down during the pagetable walk.

This is why I was thinking a flush_tlb_pmd_huge_page might have been
safer. However if hugetlbfs is basically assuming flush_tlb_page works
like Dave said, and if s390 is fine as well, I think we can just apply
this patch which follows the generic tlb_remove_pmd_tlb_entry
optimization.

Thanks,
Andrea

2015-07-23 14:41:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On 07/23/2015 07:13 AM, Andrea Arcangeli wrote:
> On Thu, Jul 23, 2015 at 11:49:38AM +0100, Catalin Marinas wrote:
>> On Thu, Jul 23, 2015 at 12:05:21AM +0100, Dave Hansen wrote:
>>> On 07/22/2015 03:48 PM, Catalin Marinas wrote:
>>>> You are right, on x86 the tlb_single_page_flush_ceiling seems to be
>>>> 33, so for an HPAGE_SIZE range the code does a local_flush_tlb()
>>>> always. I would say a single page TLB flush is more efficient than a
>>>> whole TLB flush but I'm not familiar enough with x86.
>>>
>>> The last time I looked, the instruction to invalidate a single page is
>>> more expensive than the instruction to flush the entire TLB.
>>
>> I was thinking of the overall cost of re-populating the TLB after being
>> nuked rather than the instruction itself.
>
> Unless I'm not aware about timing differences in flushing 2MB TLB
> entries vs flushing 4kb TLB entries with invlpg, the benchmarks that
> have been run to tune the optimal tlb_single_page_flush_ceiling value,
> should already guarantee us that this is a valid optimization (as we
> just got one entry, we're not even close to the 33 ceiling that makes
> it more a grey area).

We had a discussion about this a few weeks ago:

https://lkml.org/lkml/2015/6/25/666

The argument is that the CPU is so good at refilling the TLB that it
rarely waits on it, so the "cost" can be very very low.

>>> That said, I can't imagine this will hurt anything. We also have TLBs
>>> that can mix 2M and 4k pages and I don't think we did back when we put
>>> that code in originally.
>
> Dave, I'm confused about this. We should still stick to an invariant
> that we can't ever mix 2M and 4k TLB entries if their mappings end up
> overlapping on the same physical memory (if this isn't enforced in
> common code, some x86 implementation errata triggers, and it really
> oopses with machine checks so it's not just theoretical). Perhaps I
> misunderstood what you meant with mix 2M and 4k pages though.

On older CPUs we had dedicated 2M TLB slots. Now, we have an STLB that
can hold 2M and 4k entries at the same time. That will surely change
the performance profile enough that whatever testing we did in the past
is fairly stale now.

I didn't mean mixing 4k and 2M mappings for the same virtual address.

2015-07-23 15:58:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On Thu, Jul 23, 2015 at 07:41:24AM -0700, Dave Hansen wrote:
> We had a discussion about this a few weeks ago:
>
> https://lkml.org/lkml/2015/6/25/666
>
> The argument is that the CPU is so good at refilling the TLB that it
> rarely waits on it, so the "cost" can be very very low.

That was about a new optimization adding more infrastructure to extend
these kind of optimizations, but the infrastructure has a cost so I'm
not sure this is relevant for this discussion, as there is not
infrastructure or overhead here.

If we were to do close to 33 invlpg it would be a gray area, but it's
just one.

I also refer to code that is already in the kernel:

if (base_pages_to_flush > tlb_single_page_flush_ceiling) {

You wrote the patch that uses the tlb_single_page_flush_ceiling, so if
the above discussion would be relevant with regard to flush_tlb_page,
are you implying that the above optimization in the kernel, should
also be removed?

When these flush_tlb_range optimizations were introduced, it was
measured with benchmark that they helped IIRC. If it's not true
anymore with latest CPU I don't know but there should be at least a
subset of those CPUs where this helps. So I doubt it should be removed
for all CPUs out there.

Also if flush_tlb_page is slower on x86 (I doubt), then x86 should be
implement it without invlpg but for the common code it would still
make sense to use flush_tlb_page.

> On older CPUs we had dedicated 2M TLB slots. Now, we have an STLB that
> can hold 2M and 4k entries at the same time. That will surely change
> the performance profile enough that whatever testing we did in the past
> is fairly stale now.
>
> I didn't mean mixing 4k and 2M mappings for the same virtual address.

Thanks for the clarification, got what you meant now.

I still can't see why flush_tlb_page isn't an obviously valid
optimization though.

The tlb_single_page_flush_ceiling optimization has nothing to do with
2MB pages. But if that is still valid (or if it has ever been valid
for older CPUs), why is flush_tlb_page not a valid optimization at
least for those older CPUS? Why is it worth doing single invalidates
on 4k pages and not on 2MB pages?

It surely was helpful to do invlpg invalidated on 4k pages, up to 33
in a row, with x86 CPUs as you wrote the code quoted above to do
that, and it is still in the current kernel. So why are 2MB pages
different?

I don't see a relation with this and the fact 2MB and 4KB TLB entries
aren't separated anymore. If something the fact the TLB can use the
same TLB entry for 2MB and 4KB pages, means doing invlpg on 2MB is
even more helpful with newer CPUs as there can be more 2MB TLB entries
to preserve than before. So I'm slightly confused now.

Thanks,
Andrea

2015-07-23 16:16:54

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On Thu, Jul 23, 2015 at 03:41:24PM +0100, Dave Hansen wrote:
> On 07/23/2015 07:13 AM, Andrea Arcangeli wrote:
> > On Thu, Jul 23, 2015 at 11:49:38AM +0100, Catalin Marinas wrote:
> >> On Thu, Jul 23, 2015 at 12:05:21AM +0100, Dave Hansen wrote:
> >>> On 07/22/2015 03:48 PM, Catalin Marinas wrote:
> >>>> You are right, on x86 the tlb_single_page_flush_ceiling seems to be
> >>>> 33, so for an HPAGE_SIZE range the code does a local_flush_tlb()
> >>>> always. I would say a single page TLB flush is more efficient than a
> >>>> whole TLB flush but I'm not familiar enough with x86.
> >>>
> >>> The last time I looked, the instruction to invalidate a single page is
> >>> more expensive than the instruction to flush the entire TLB.
> >>
> >> I was thinking of the overall cost of re-populating the TLB after being
> >> nuked rather than the instruction itself.
> >
> > Unless I'm not aware about timing differences in flushing 2MB TLB
> > entries vs flushing 4kb TLB entries with invlpg, the benchmarks that
> > have been run to tune the optimal tlb_single_page_flush_ceiling value,
> > should already guarantee us that this is a valid optimization (as we
> > just got one entry, we're not even close to the 33 ceiling that makes
> > it more a grey area).
>
> We had a discussion about this a few weeks ago:
>
> https://lkml.org/lkml/2015/6/25/666
>
> The argument is that the CPU is so good at refilling the TLB that it
> rarely waits on it, so the "cost" can be very very low.

Interesting thread. I can see from Ingo's benchmarks that invlpg is much
more expensive than the cr3 write but I can't really comment on the
refill cost (it may be small with page table caching in L1/L2). The
problem with small/targeted benchmarks is that you don't see the overall
impact.

On ARM, most recent CPUs can cache intermediate page table levels in the
TLB (usually as VA->pte translation). ARM64 introduces a new TLB
flushing instruction that only touches the last level (pte, huge pmd).
In theory this should be cheaper overall since the CPU doesn't need to
refill intermediate levels. In practice, it's probably lost in the
noise.

Anyway, if you want to keep the option of a full TLB flush for x86 on
huge pages, I'm happy to repost a v2 with a separate
flush_tlb_pmd_huge_page that arch code can define as it sees fit.

--
Catalin

2015-07-23 16:49:33

by Catalin Marinas

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On Thu, Jul 23, 2015 at 03:13:03PM +0100, Andrea Arcangeli wrote:
> On Thu, Jul 23, 2015 at 11:49:38AM +0100, Catalin Marinas wrote:
> > On Thu, Jul 23, 2015 at 12:05:21AM +0100, Dave Hansen wrote:
> > > On 07/22/2015 03:48 PM, Catalin Marinas wrote:
> > > > You are right, on x86 the tlb_single_page_flush_ceiling seems to be
> > > > 33, so for an HPAGE_SIZE range the code does a local_flush_tlb()
> > > > always. I would say a single page TLB flush is more efficient than a
> > > > whole TLB flush but I'm not familiar enough with x86.
> > >
> > > The last time I looked, the instruction to invalidate a single page is
> > > more expensive than the instruction to flush the entire TLB.
[...]
> > Another question is whether flushing a single address is enough for a
> > huge page. I assumed it is since tlb_remove_pmd_tlb_entry() only adjusts
[...]
> > the mmu_gather range by PAGE_SIZE (rather than HPAGE_SIZE) and
> > no-one complained so far. AFAICT, there are only 3 architectures
> > that don't use asm-generic/tlb.h but they all seem to handle this
> > case:
>
> Agreed that archs using the generic tlb.h that sets the tlb->end to
> address+PAGE_SIZE should be fine with the flush_tlb_page.
>
> > arch/arm: it implements tlb_remove_pmd_tlb_entry() in a similar way to
> > the generic one
> >
> > arch/s390: tlb_remove_pmd_tlb_entry() is a no-op
>
> I guess s390 is fine too but I'm not convinced that the fact it won't
> adjust the tlb->start/end is a guarantees that flush_tlb_page is
> enough when a single 2MB TLB has to be invalidated (not during range
> zapping).
>
> For the range zapping, could the arch decide to unconditionally flush
> the whole TLB without doing the tlb->start/end tracking by overriding
> tlb_gather_mmu in a way that won't call __tlb_reset_range? There seems
> to be quite some flexibility in the per-arch tlb_gather_mmu setup in
> order to unconditionally set tlb->start/end to the total range zapped,
> without actually narrowing it down during the pagetable walk.

You are right, looking at the s390 code, tlb_finish_mmu() flushes the
whole TLB, so the ranges don't seem to matter. I'm cc'ing the s390
maintainers to confirm whether this patch affects them in any way:

https://lkml.org/lkml/2015/7/22/521

IIUC, all the functions touched by this patch are implemented by s390 in
its specific way, so I don't think it makes any difference:

pmdp_set_access_flags
pmdp_clear_flush_young
pmdp_huge_clear_flush
pmdp_splitting_flush
pmdp_invalidate

--
Catalin

2015-07-23 16:52:52

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On 07/23/2015 08:58 AM, Andrea Arcangeli wrote:
> You wrote the patch that uses the tlb_single_page_flush_ceiling, so if
> the above discussion would be relevant with regard to flush_tlb_page,
> are you implying that the above optimization in the kernel, should
> also be removed?

When I put that in, my goal was to bring consistency to how we handled
things without regressing anything. I was never able to measure any
nice macro-level benefits to a particular flush behavior.

We can also now just easily disable the ranged flushes if we want to, or
leave them in place for small flushes only.

> When these flush_tlb_range optimizations were introduced, it was
> measured with benchmark that they helped IIRC. If it's not true
> anymore with latest CPU I don't know but there should be at least a
> subset of those CPUs where this helps. So I doubt it should be removed
> for all CPUs out there.

I tried to reproduce the results and had a difficult time doing so.

> The tlb_single_page_flush_ceiling optimization has nothing to do with
> 2MB pages. But if that is still valid (or if it has ever been valid
> for older CPUs), why is flush_tlb_page not a valid optimization at
> least for those older CPUS? Why is it worth doing single invalidates
> on 4k pages and not on 2MB pages?

I haven't seen any solid evidence that we should do it for one and not
the other.

> It surely was helpful to do invlpg invalidated on 4k pages, up to 33
> in a row, with x86 CPUs as you wrote the code quoted above to do
> that, and it is still in the current kernel. So why are 2MB pages
> different?

They were originally different because the work that introduced 'invlpg'
didn't see a benefit from using 'invlpg' on 2M pages. I didn't
reevaluate it when I hacked on the code and just left it as it was.

It would be great if someone would go and collect some recent data on
using 'invlpg' on 2M pages!

2015-07-23 16:55:36

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On 07/23/2015 09:16 AM, Catalin Marinas wrote:
> Anyway, if you want to keep the option of a full TLB flush for x86 on
> huge pages, I'm happy to repost a v2 with a separate
> flush_tlb_pmd_huge_page that arch code can define as it sees fit.

I think your patch is fine on x86. We need to keep an eye out for any
regressions, but I think it's OK.

2015-07-23 17:13:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On Thu, Jul 23, 2015 at 09:55:33AM -0700, Dave Hansen wrote:
> On 07/23/2015 09:16 AM, Catalin Marinas wrote:
> > Anyway, if you want to keep the option of a full TLB flush for x86 on
> > huge pages, I'm happy to repost a v2 with a separate
> > flush_tlb_pmd_huge_page that arch code can define as it sees fit.
>
> I think your patch is fine on x86. We need to keep an eye out for any
> regressions, but I think it's OK.

That's my view as well.

I've read more of the other thread and I quote Ingo:

" It barely makes sense for a 2 pages and gets exponentially
worse. It's probably done in microcode and its performance is
horrible. "

So in our case it's just 1 page (not 2, not 33), and considering it
prevents to invalidate all other TLB entries, it's most certainly a
win: it requires zero additional infrastructure and best of all it can
also avoid to flush the entire TLB for remote CPUs too again without
infrastructure or pfn arrays or multiple invlpg.

As further confirmation that for 1 entry invlpg is worth it, even
flush_tlb_page->flush_tlb_func invokes __flush_tlb_single in the IPI
handler instead of local_flush_tlb().

So the discussion there was about the additional infrastructure and a
flood of invlpg, perhaps more than 33, I agree a local_flush_tlb()
sounds better for that.

The question left for x86 is if invlpg is even slower for 2MB pages
than it is for 4k pages, but I'd be surprised if it is, especially on
newer CPUs where the TLB can use different page size for each TLB
entry. Why we didn't do flush_tlb_page before wasn't related to such a
concern at least.

2015-07-24 07:17:58

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] mm: Flush the TLB for a single address in a huge page

On Thu, 23 Jul 2015 17:49:21 +0100
Catalin Marinas <[email protected]> wrote:

> On Thu, Jul 23, 2015 at 03:13:03PM +0100, Andrea Arcangeli wrote:
> > On Thu, Jul 23, 2015 at 11:49:38AM +0100, Catalin Marinas wrote:
> > > On Thu, Jul 23, 2015 at 12:05:21AM +0100, Dave Hansen wrote:
> > > > On 07/22/2015 03:48 PM, Catalin Marinas wrote:
> > > > > You are right, on x86 the tlb_single_page_flush_ceiling seems to be
> > > > > 33, so for an HPAGE_SIZE range the code does a local_flush_tlb()
> > > > > always. I would say a single page TLB flush is more efficient than a
> > > > > whole TLB flush but I'm not familiar enough with x86.
> > > >
> > > > The last time I looked, the instruction to invalidate a single page is
> > > > more expensive than the instruction to flush the entire TLB.
> [...]
> > > Another question is whether flushing a single address is enough for a
> > > huge page. I assumed it is since tlb_remove_pmd_tlb_entry() only adjusts
> [...]
> > > the mmu_gather range by PAGE_SIZE (rather than HPAGE_SIZE) and
> > > no-one complained so far. AFAICT, there are only 3 architectures
> > > that don't use asm-generic/tlb.h but they all seem to handle this
> > > case:
> >
> > Agreed that archs using the generic tlb.h that sets the tlb->end to
> > address+PAGE_SIZE should be fine with the flush_tlb_page.
> >
> > > arch/arm: it implements tlb_remove_pmd_tlb_entry() in a similar way to
> > > the generic one
> > >
> > > arch/s390: tlb_remove_pmd_tlb_entry() is a no-op
> >
> > I guess s390 is fine too but I'm not convinced that the fact it won't
> > adjust the tlb->start/end is a guarantees that flush_tlb_page is
> > enough when a single 2MB TLB has to be invalidated (not during range
> > zapping).

tlb_remove_pmd_tlb_entry() is a no-op because pmdp_get_and_clear_full()
already did the job. s390 is special in regard to TLB flushing, the
machines have the requirement that a pte/pmd needs to be invalidated
with specific instruction if there is a process that might use the
translation path. In this case the IDTE instruction needs to be used
which sets the invalid bit in the pmd *and* flushes the TLB at the
same time. The code still tries to be lazy and do batched flushes to
improve performance. All in all quite complicated..

> > For the range zapping, could the arch decide to unconditionally flush
> > the whole TLB without doing the tlb->start/end tracking by overriding
> > tlb_gather_mmu in a way that won't call __tlb_reset_range? There seems
> > to be quite some flexibility in the per-arch tlb_gather_mmu setup in
> > order to unconditionally set tlb->start/end to the total range zapped,
> > without actually narrowing it down during the pagetable walk.
>
> You are right, looking at the s390 code, tlb_finish_mmu() flushes the
> whole TLB, so the ranges don't seem to matter. I'm cc'ing the s390
> maintainers to confirm whether this patch affects them in any way:
>
> https://lkml.org/lkml/2015/7/22/521
>
> IIUC, all the functions touched by this patch are implemented by s390 in
> its specific way, so I don't think it makes any difference:
>
> pmdp_set_access_flags
> pmdp_clear_flush_young
> pmdp_huge_clear_flush
> pmdp_splitting_flush
> pmdp_invalidate

tlb_finish_mmu may flush all entries for a specific address space, not
the whole TLB. And it does so only for batched operations. If all changes
to the page tables have been done with IPTE/IDTE then flush_mm will not
be set and no full address space flush is done.

But to answer the question: s390 is fine with the change outlined in
https://lkml.org/lkml/2015/7/22/521

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.