LinuxLists.cc - [PATCH 06/17] arm: mmu

2011-02-17 17:15:11

Subject: [PATCH 06/17] arm: mmu_gather rework

Fix up the arm mmu_gather code to conform to the new API.

Cc: Russell King <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
arch/arm/include/asm/tlb.h | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)

Index: linux-2.6/arch/arm/include/asm/tlb.h
===================================================================
--- linux-2.6.orig/arch/arm/include/asm/tlb.h
+++ linux-2.6/arch/arm/include/asm/tlb.h
@@ -40,17 +40,11 @@ struct mmu_gather {
unsigned long range_end;
};

-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *
-tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)
+static inline void
+tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned int full_mm_flush)
{
- struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
-
tlb->mm = mm;
tlb->fullmm = full_mm_flush;
-
- return tlb;
}

static inline void
@@ -61,8 +55,6 @@ tlb_finish_mmu(struct mmu_gather *tlb, u

/* keep the page table cache within bounds */
check_pgt_cache();
-
- put_cpu_var(mmu_gathers);
}

/*
@@ -101,7 +93,22 @@ tlb_end_vma(struct mmu_gather *tlb, stru
flush_tlb_range(vma, tlb->range_start, tlb->range_end);
}

-#define tlb_remove_page(tlb,page) free_page_and_swap_cache(page)
+static inline int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+ free_page_and_swap_cache(page);
+ return 0;
+}
+
+static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
+{
+ might_sleep();
+ __tlb_remove_page(tlb, page);
+}
+
+static inline void tlb_flush_mmu(struct mmu_gather *tlb)
+{
+}
+
#define pte_free_tlb(tlb, ptep, addr) pte_free((tlb)->mm, ptep)
#define pmd_free_tlb(tlb, pmdp, addr) pmd_free((tlb)->mm, pmdp)

2011-02-24 16:35:03

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Thu, 2011-02-17 at 17:23 +0100, Peter Zijlstra wrote:
> plain text document attachment
> (peter_zijlstra-arm-preemptible_mmu_gather.patch)
> Fix up the arm mmu_gather code to conform to the new API.

So akpm noted that this one doesn't apply anymore because of:

commit 06824ba824b3e9f2fedb38bee79af0643198ed7f
Author: Russell King <[email protected]>
Date: Sun Feb 20 12:16:45 2011 +0000

ARM: tlb: delay page freeing for SMP and ARMv7 CPUs

We need to delay freeing any mapped page on SMP and ARMv7 systems to
ensure that the data is not accessed by other CPUs, or is used for
speculative prefetch with ARMv7. This includes not only mapped pages
but also pages used for the page tables themselves.

This avoids races with the MMU/other CPUs accessing pages after they've
been freed but before we've invalidated the TLB.

Signed-off-by: Russell King <[email protected]>

Which raises a nice point about shift_arg_pages() which calls
free_pgd_range(), the other architectures that look similar to arm in
this respect are ia64 and sh, do they suffer the same problem?

It doesn't look hard to fold the requirements for this into the generic
tlb range support (patch 14 in this series).

2011-02-25 18:05:52

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Thu, 2011-02-24 at 17:34 +0100, Peter Zijlstra wrote:
> On Thu, 2011-02-17 at 17:23 +0100, Peter Zijlstra wrote:
> > plain text document attachment
> > (peter_zijlstra-arm-preemptible_mmu_gather.patch)
> > Fix up the arm mmu_gather code to conform to the new API.
>
> So akpm noted that this one doesn't apply anymore because of:
>
> commit 06824ba824b3e9f2fedb38bee79af0643198ed7f
> Author: Russell King <[email protected]>
> Date: Sun Feb 20 12:16:45 2011 +0000
>
> ARM: tlb: delay page freeing for SMP and ARMv7 CPUs
>
> We need to delay freeing any mapped page on SMP and ARMv7 systems to
> ensure that the data is not accessed by other CPUs, or is used for
> speculative prefetch with ARMv7. This includes not only mapped pages
> but also pages used for the page tables themselves.
>
> This avoids races with the MMU/other CPUs accessing pages after they've
> been freed but before we've invalidated the TLB.
>
> Signed-off-by: Russell King <[email protected]>
>
>
> Which raises a nice point about shift_arg_pages() which calls
> free_pgd_range(), the other architectures that look similar to arm in
> this respect are ia64 and sh, do they suffer the same problem?
>
> It doesn't look hard to fold the requirements for this into the generic
> tlb range support (patch 14 in this series).

It looks like both ia64 and sh do indeed suffer there.

I've pulled my generic range tracking to the head of the series so that
I can convert ARM, IA64 and SH to generic tlb solving it for those.

Russell, generic tlb doesn't look to need the extra logic you added for
the fs/exec.c case, but please double check the patches when I post
them.

In short, tlb_end_vma() will call flush_tlb_range() on the tracked range
and clear ->need_flush, so things like zap_page_range() will not then
also call tlb_flush().

In case of shift_arg_pages() and unmap_region() however we first call
free_pgtables() which might end up calling p??_free_tlb() which will
then set ->need_flush, and tlb_finish_mmu() will then end up calling
tlb_flush().

I'm not quite sure why you chose to add range tracking on
pte_free_tlb(), the only affected code path seems to be unmap_region()
where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
range is much larger than 1 page, and if you do it there you also need
it for all the other p??_free_tlb() functions.

The tlb flush after freeing page-tables is needed for things like
gup_fast() which needs to sync against them being freed.

So the stuff I have now will try its best to track ranges on zap_* while
clearing the page mapping, will use flush_cache_range() and
flush_tlb_range(). But when it comes to tearing down the page-tables
themselves we'll punt and use a full mm flush, which seems a waste of
all that careful range tracking by zap_*.

One possibility would be to add tlb_start/end_vma() in
unmap_page_range(), except we don't need to flush the cache again, also,
it would be nice to not have to flush on tlb_end_vma() but delay it all
to tlb_finish_mmu() where possible.

OK, let me try and hack up proper range tracking for free_*, that way I
can move the flush_tlb_range() from tlb_end_vma() and into
tlb_flush_mmu().

2011-02-25 19:45:46

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Fri, 2011-02-25 at 19:04 +0100, Peter Zijlstra wrote:
> On Thu, 2011-02-24 at 17:34 +0100, Peter Zijlstra wrote:
> > On Thu, 2011-02-17 at 17:23 +0100, Peter Zijlstra wrote:
> > > plain text document attachment
> > > (peter_zijlstra-arm-preemptible_mmu_gather.patch)
> > > Fix up the arm mmu_gather code to conform to the new API.
> >
> > So akpm noted that this one doesn't apply anymore because of:
> >
> > commit 06824ba824b3e9f2fedb38bee79af0643198ed7f
> > Author: Russell King <[email protected]>
> > Date: Sun Feb 20 12:16:45 2011 +0000
> >
> > ARM: tlb: delay page freeing for SMP and ARMv7 CPUs
> >
> > We need to delay freeing any mapped page on SMP and ARMv7 systems to
> > ensure that the data is not accessed by other CPUs, or is used for
> > speculative prefetch with ARMv7. This includes not only mapped pages
> > but also pages used for the page tables themselves.
> >
> > This avoids races with the MMU/other CPUs accessing pages after they've
> > been freed but before we've invalidated the TLB.
> >
> > Signed-off-by: Russell King <[email protected]>
> >
> >
> > Which raises a nice point about shift_arg_pages() which calls
> > free_pgd_range(), the other architectures that look similar to arm in
> > this respect are ia64 and sh, do they suffer the same problem?
> >
> > It doesn't look hard to fold the requirements for this into the generic
> > tlb range support (patch 14 in this series).
>
> It looks like both ia64 and sh do indeed suffer there.
>
> I've pulled my generic range tracking to the head of the series so that
> I can convert ARM, IA64 and SH to generic tlb solving it for those.
>
> Russell, generic tlb doesn't look to need the extra logic you added for
> the fs/exec.c case, but please double check the patches when I post
> them.
>
> In short, tlb_end_vma() will call flush_tlb_range() on the tracked range
> and clear ->need_flush, so things like zap_page_range() will not then
> also call tlb_flush().
>
> In case of shift_arg_pages() and unmap_region() however we first call
> free_pgtables() which might end up calling p??_free_tlb() which will
> then set ->need_flush, and tlb_finish_mmu() will then end up calling
> tlb_flush().
>
> I'm not quite sure why you chose to add range tracking on
> pte_free_tlb(), the only affected code path seems to be unmap_region()
> where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
> range is much larger than 1 page, and if you do it there you also need
> it for all the other p??_free_tlb() functions.
>
> The tlb flush after freeing page-tables is needed for things like
> gup_fast() which needs to sync against them being freed.
>
> So the stuff I have now will try its best to track ranges on zap_* while
> clearing the page mapping, will use flush_cache_range() and
> flush_tlb_range(). But when it comes to tearing down the page-tables
> themselves we'll punt and use a full mm flush, which seems a waste of
> all that careful range tracking by zap_*.
>
> One possibility would be to add tlb_start/end_vma() in
> unmap_page_range(), except we don't need to flush the cache again, also,
> it would be nice to not have to flush on tlb_end_vma() but delay it all
> to tlb_finish_mmu() where possible.
>
> OK, let me try and hack up proper range tracking for free_*, that way I
> can move the flush_tlb_range() from tlb_end_vma() and into
> tlb_flush_mmu().

Grmbl.. so doing that would require flush_tlb_range() to take an mm, not
a vma, but tile and arm both use the vma->flags & VM_EXEC test to avoid
flushing their i-tlbs.

I'm tempted to make them flush i-tlbs unconditionally as its still
better than hitting an mm wide tlb flush due to the page table free.

Ideas?

2011-02-25 19:59:33

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Fri, Feb 25, 2011 at 11:45 AM, Peter Zijlstra <[email protected]> wrote:

> Grmbl.. so doing that would require flush_tlb_range() to take an mm, not
> a vma, but tile and arm both use the vma->flags & VM_EXEC test to avoid
> flushing their i-tlbs.
>
> I'm tempted to make them flush i-tlbs unconditionally as its still
> better than hitting an mm wide tlb flush due to the page table free.
>
> Ideas?

What's wrong with using vma->vm_mm?

Hugh

2011-02-25 21:53:37

by Russell King

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Fri, Feb 25, 2011 at 07:04:43PM +0100, Peter Zijlstra wrote:
> I'm not quite sure why you chose to add range tracking on
> pte_free_tlb(), the only affected code path seems to be unmap_region()
> where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
> range is much larger than 1 page, and if you do it there you also need
> it for all the other p??_free_tlb() functions.

My reasoning is to do with the way the LPAE stuff works. For the
explaination below, I'm going to assume a 2 level page table system
for simplicity.

The first thing to realise is that if we have L2 entries, then we'll
have unmapped them first using the usual tlb shootdown interfaces.

However, when we're freeing the page tables themselves, we should
already have removed the L2 entries, so all we have are the L1 entries.
In most 'normal' processors, these aren't cached in any way.

Howver, with LPAE, these are cached. I'm told that any TLB flush for an
address which is covered by the L1 entry will cause that cached entry to
be invalidated.

So really this is about getting rid of cached L1 entries, and not the
usual TLB lookaside entries that you'd come to expect.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2011-02-28 11:45:44

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Fri, 2011-02-25 at 21:51 +0000, Russell King wrote:
> On Fri, Feb 25, 2011 at 07:04:43PM +0100, Peter Zijlstra wrote:
> > I'm not quite sure why you chose to add range tracking on
> > pte_free_tlb(), the only affected code path seems to be unmap_region()
> > where you'll use a flush_tlb_range(), but its buggy, the pte_free_tlb()
> > range is much larger than 1 page, and if you do it there you also need
> > it for all the other p??_free_tlb() functions.
>
> My reasoning is to do with the way the LPAE stuff works. For the
> explaination below, I'm going to assume a 2 level page table system
> for simplicity.
>
> The first thing to realise is that if we have L2 entries, then we'll
> have unmapped them first using the usual tlb shootdown interfaces.
>
> However, when we're freeing the page tables themselves, we should
> already have removed the L2 entries, so all we have are the L1 entries.
> In most 'normal' processors, these aren't cached in any way.
>
> Howver, with LPAE, these are cached. I'm told that any TLB flush for an
> address which is covered by the L1 entry will cause that cached entry to
> be invalidated.
>
> So really this is about getting rid of cached L1 entries, and not the
> usual TLB lookaside entries that you'd come to expect.

Right, so the normal case is:

unmap_region()
tlb_gather_mmu()
unmap_vmas()
for (; vma; vma = vma->vm_next)
unmao_page_range()
tlb_start_vma() -> flush cache range
zap_*_range()
ptep_get_and_clear_full() -> batch/track external tlbs
tlb_remove_tlb_entry() -> batch/track external tlbs
tlb_remove_page() -> track range/batch page
tlb_end_vma() -> flush tlb range

[ for architectures that have hardware page table walkers
concurrent faults can still load the page tables ]

free_pgtables()
while (vma)
unlink_*_vma()
free_*_range()
*_free_tlb()
tlb_finish_mmu()

free vmas

Now, if we want to track ranges _and_ have hardware page table walkers
(ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
because the hardware walker could have re-populated the cache after
clearing the PTEs but before freeing the page tables.

What ARM does is it retains the last vma pointer and tracks
pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
hacky.

Mostly because of shift_arg_pages(), where we have:

shift_arg_pages()
tlb_gather_mmu()
free_*_range()
tlb_finish_mmu()

For which ARM now punts and does a full tlb invalidate (no vma pointer).
But also because it drags along that vma pointer, which might not at all
match the range its actually going to invalidate (and hence its vm_flags
might not accurately reflect things -- at worst more expensive than
needed).

The reason I wanted flush_tlb_range() to take an mm_struct and not the
current vm_area_struct is because we can avoid doing the
flush_tlb_range() from tlb_end_vma() and delay the thing until
tlb_finish_mmu() without having to resort to such games as above. We
could simply track the full range over all VMAs and free'd page-tables
and do one range invalidate.

ARM uses vm_flags & VM_EXEC to see if it also needs to invalidate
I-TLBs, and TILE uses VM_EXEC and VM_HUGETLB.

For the I-TLBs we could easily use
ptep_get_and_clear_full()/tlb_remove_tlb_entry() and see if any of the
cleared pte's had its executable bit set (both ARM and TILE seem to have
such a PTE bit).

I'm not sure what we can do about TILE's VM_HUGETLB usage though, if it
needs explicit flushes for huge ptes it might just have to issue
multiple tlb invalidates and do them from tlb_start_vma()/tlb_end_vma().

So my current proposal for generic mmu_gather (not fully coded yet) is
to provide a number of CONFIG_goo switches:

CONFIG_HAVE_RCU_TABLE_FREE - for architectures that do not walk the
linux page tables in hardware (Sparc64, PowerPC, etc), and others where
TLB flushing isn't delayed by disabling IRQs (Xen, s390).

CONFIG_HAVE_MMU_GATHER_RANGE - will track start,end ranges from
tlb_remove_tlb_entry() and p*_free_tlb() and issue
flush_tlb_range(mm,start,end) instead of mm-wide invalidates.

CONFIG_HAVE_MMU_GATHER_ITLB - will use
ptep_get_and_clear_full()/tlb_remove_tlb_entry() to test pte_exec() and
issue flush_itlb_range(mm,start,end).

Then there is the optimization s390 wants, which is to do a full mm tlb
flush for fullmm (exit_mmap()) at tlb_gather_mmu() and never again after
that, since there is guaranteed no concurrency to poke at anything.
AFAICT that should work on all architectures so we can do that
unconditionally.

So the biggest problem with implementing the above is TILE, where we
need to figure out wth to do with its hugetlb stuff.

The second biggest problem is with ARM and TILE, where we'd need to
implement flush_itlb_range(). I've already got a patch for all other
architectures to convert flush_tlb_range() to mm_struct.

2011-02-28 12:01:13

by Russell King

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Mon, Feb 28, 2011 at 12:44:47PM +0100, Peter Zijlstra wrote:
> Right, so the normal case is:
>
> unmap_region()
> tlb_gather_mmu()

The fullmm argument is important here as it specifies the mode.
tlb_gather_mmu(, 0)

> unmap_vmas()
> for (; vma; vma = vma->vm_next)
> unmao_page_range()
> tlb_start_vma() -> flush cache range
> zap_*_range()
> ptep_get_and_clear_full() -> batch/track external tlbs
> tlb_remove_tlb_entry() -> batch/track external tlbs
> tlb_remove_page() -> track range/batch page
> tlb_end_vma() -> flush tlb range

tlb_finish_mmu() -> nothing

>
> [ for architectures that have hardware page table walkers
> concurrent faults can still load the page tables ]
>
> free_pgtables()

tlb_gather_mmu(, 1)

> while (vma)
> unlink_*_vma()
> free_*_range()
> *_free_tlb()
> tlb_finish_mmu()

tlb_finish_mmu() -> flush tlb mm

>
> free vmas

So this is all fine. Note that we *don't* use the range stuff here.

> Now, if we want to track ranges _and_ have hardware page table walkers
> (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> because the hardware walker could have re-populated the cache after
> clearing the PTEs but before freeing the page tables.

No. The hardware walker won't re-populate the TLB after the page table
entries have been cleared - where would it get this information from if
not from the page tables?

> What ARM does is it retains the last vma pointer and tracks
> pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> hacky.

It may be hacky but then the TLB shootdown interface is hacky too. We
don't keep the vma around to re-use after tlb_end_vma() - if you think
that then you misunderstand what's going on. The vma pointer is kept
around as a cheap way of allowing tlb_finish_mmu() to distinguish
between the unmap_region() mode and the shift_arg_pages() mode.

> Mostly because of shift_arg_pages(), where we have:
>
> shift_arg_pages()
> tlb_gather_mmu()

tlb_gather_mmu(, 0)

> free_*_range()
> tlb_finish_mmu()

tlb_finish_mmu() does nothing without the ARM change.
tlb_finish_mmu() -> flush_tlb_mm() with the ARM change.

And this is where the bug was - these page table entries could find
their way into the TLB and persist after they've been torn down.

> For which ARM now punts and does a full tlb invalidate (no vma pointer).
> But also because it drags along that vma pointer, which might not at all
> match the range its actually going to invalidate (and hence its vm_flags
> might not accurately reflect things -- at worst more expensive than
> needed).

Where do you get that from? Where exactly in the above code would the
VMA pointer get set? In this case, it will be NULL, so we do a
flush_tlb_mm() for this case. We have to - we don't have any VMA to
deal with at this point.

> The reason I wanted flush_tlb_range() to take an mm_struct and not the
> current vm_area_struct is because we can avoid doing the
> flush_tlb_range() from tlb_end_vma() and delay the thing until
> tlb_finish_mmu() without having to resort to such games as above. We
> could simply track the full range over all VMAs and free'd page-tables
> and do one range invalidate.

No. That's stupid. Consider the case where you have to loop one page
at a time over the range (we do on ARM.) If we ended up with your
suggestion above, that means we could potentially have to loop 4K at a
time over 3GB of address space. That's idiotic when we have an
instruction which can flush the entire TLB for a particular thread.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2011-02-28 12:09:26

by Russell King

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Mon, Feb 28, 2011 at 11:59:07AM +0000, Russell King wrote:
> It may be hacky but then the TLB shootdown interface is hacky too. We
> don't keep the vma around to re-use after tlb_end_vma() - if you think
> that then you misunderstand what's going on. The vma pointer is kept
> around as a cheap way of allowing tlb_finish_mmu() to distinguish
> between the unmap_region() mode and the shift_arg_pages() mode.

As I think I mentioned, the TLB shootdown interface either needs rewriting
from scratch as its currently a broken design, or it needs tlb_gather_mmu()
to take a proper mode argument, rather than this useless 'fullmm' argument
which only gives half the story.

The fact is that the interface has three modes, and distinguishing between
them requires a certain amount of black magic. Explicitly, the !fullmm
case has two modes, and it requires implementations to remember whether
tlb_start_vma() has been called before tlb_finish_mm() or not.

Maybe this will help you understand the ARM implementation - this doesn't
change the functionality, but may make things clearer.

diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index 82dfe5d..73fb813 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -54,7 +54,7 @@
struct mmu_gather {
struct mm_struct *mm;
unsigned int fullmm;
- struct vm_area_struct *vma;
+ unsigned int byvma;
unsigned long range_start;
unsigned long range_end;
unsigned int nr;
@@ -68,23 +68,18 @@ DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
* code is used:
* 1. Unmapping a range of vmas. See zap_page_range(), unmap_region().
* tlb->fullmm = 0, and tlb_start_vma/tlb_end_vma will be called.
- * tlb->vma will be non-NULL.
+ * tlb->byvma will be true.
* 2. Unmapping all vmas. See exit_mmap().
* tlb->fullmm = 1, and tlb_start_vma/tlb_end_vma will be called.
- * tlb->vma will be non-NULL. Additionally, page tables will be freed.
+ * tlb->byvma will be true. Additionally, page tables will be freed.
* 3. Unmapping argument pages. See shift_arg_pages().
* tlb->fullmm = 0, but tlb_start_vma/tlb_end_vma will not be called.
- * tlb->vma will be NULL.
+ * tlb->byvma will be false.
*/
static inline void tlb_flush(struct mmu_gather *tlb)
{
- if (tlb->fullmm || !tlb->vma)
+ if (tlb->fullmm || !tlb->byvma)
flush_tlb_mm(tlb->mm);
- else if (tlb->range_end > 0) {
- flush_tlb_range(tlb->vma, tlb->range_start, tlb->range_end);
- tlb->range_start = TASK_SIZE;
- tlb->range_end = 0;
- }
}

static inline void tlb_add_flush(struct mmu_gather *tlb, unsigned long addr)
@@ -113,7 +108,7 @@ tlb_gather_mmu(struct mm_struct *mm, unsigned int full_mm_flush)

tlb->mm = mm;
tlb->fullmm = full_mm_flush;
- tlb->vma = NULL;
+ tlb->byvma = 0;
tlb->nr = 0;

return tlb;
@@ -149,7 +144,7 @@ tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
{
if (!tlb->fullmm) {
flush_cache_range(vma, vma->vm_start, vma->vm_end);
- tlb->vma = vma;
+ tlb->byvma = 1;
tlb->range_start = TASK_SIZE;
tlb->range_end = 0;
}
@@ -158,8 +153,11 @@ tlb_start_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
static inline void
tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vma)
{
- if (!tlb->fullmm)
- tlb_flush(tlb);
+ if (!tlb->fullmm && tlb->range_end > 0) {
+ flush_tlb_range(vma, tlb->range_start, tlb->range_end);
+ tlb->range_start = TASK_SIZE;
+ tlb->range_end = 0;
+ }
}

static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2011-02-28 12:21:07

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Mon, 2011-02-28 at 11:59 +0000, Russell King wrote:
> On Mon, Feb 28, 2011 at 12:44:47PM +0100, Peter Zijlstra wrote:
> > Right, so the normal case is:
> >
> > unmap_region()
> > tlb_gather_mmu()
>
> The fullmm argument is important here as it specifies the mode.

well, unmap_region always has that 0, I've mentioned the fullmm mode
separately below, its in many way the easiest case to deal with.

> tlb_gather_mmu(, 0)
>
> > unmap_vmas()
> > for (; vma; vma = vma->vm_next)
> > unmao_page_range()
> > tlb_start_vma() -> flush cache range
> > zap_*_range()
> > ptep_get_and_clear_full() -> batch/track external tlbs
> > tlb_remove_tlb_entry() -> batch/track external tlbs
> > tlb_remove_page() -> track range/batch page
> > tlb_end_vma() -> flush tlb range
>
> tlb_finish_mmu() -> nothing
>
> >
> > [ for architectures that have hardware page table walkers
> > concurrent faults can still load the page tables ]
> >
> > free_pgtables()
>
> tlb_gather_mmu(, 1)
>
> > while (vma)
> > unlink_*_vma()
> > free_*_range()
> > *_free_tlb()
> > tlb_finish_mmu()
>
> tlb_finish_mmu() -> flush tlb mm
>
> >
> > free vmas
>
> So this is all fine. Note that we *don't* use the range stuff here.
>
> > Now, if we want to track ranges _and_ have hardware page table walkers
> > (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> > flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> > than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> > because the hardware walker could have re-populated the cache after
> > clearing the PTEs but before freeing the page tables.
>
> No. The hardware walker won't re-populate the TLB after the page table

Never said it would repopulate the TLB, just said it could repopulate
your cache thing and that it might still walk the page tables.

> entries have been cleared - where would it get this information from if
> not from the page tables?
>
> > What ARM does is it retains the last vma pointer and tracks
> > pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> > hacky.
>
> It may be hacky but then the TLB shootdown interface is hacky too. We
> don't keep the vma around to re-use after tlb_end_vma() - if you think
> that then you misunderstand what's going on. The vma pointer is kept
> around as a cheap way of allowing tlb_finish_mmu() to distinguish
> between the unmap_region() mode and the shift_arg_pages() mode.

Well, you most certainly use it in the unmap_region() case above.
tlb_end_vma() will do a flush_tlb_range(), but then your
__pte_free_tlb() will also track range and the tlb_finish_mmu() will
then again issue a flush_tlb_range() using the last vma pointer.

You argued you need that second tlb flush fo flush your cached level1
entries for your LPAE mode (btw arm sucks for having all those docs
private).

> > Mostly because of shift_arg_pages(), where we have:
> >
> > shift_arg_pages()
> > tlb_gather_mmu()
>
> tlb_gather_mmu(, 0)
>
> > free_*_range()
> > tlb_finish_mmu()
>
> tlb_finish_mmu() does nothing without the ARM change.
> tlb_finish_mmu() -> flush_tlb_mm() with the ARM change.
>
> And this is where the bug was - these page table entries could find
> their way into the TLB and persist after they've been torn down.

Sure, I got that, you punt and do a full mm tlb invalidate (IA64 and SH
seem similarly challenged).

> > For which ARM now punts and does a full tlb invalidate (no vma pointer).
> > But also because it drags along that vma pointer, which might not at all
> > match the range its actually going to invalidate (and hence its vm_flags
> > might not accurately reflect things -- at worst more expensive than
> > needed).
>
> Where do you get that from? Where exactly in the above code would the
> VMA pointer get set? In this case, it will be NULL, so we do a
> flush_tlb_mm() for this case. We have to - we don't have any VMA to
> deal with at this point.

unmap_region()'s last tlb_start_vma(), with __pte_free_tlb() tracking
range will then get tlb_finish_mmu() to issue a second
flush_tlb_range().

> > The reason I wanted flush_tlb_range() to take an mm_struct and not the
> > current vm_area_struct is because we can avoid doing the
> > flush_tlb_range() from tlb_end_vma() and delay the thing until
> > tlb_finish_mmu() without having to resort to such games as above. We
> > could simply track the full range over all VMAs and free'd page-tables
> > and do one range invalidate.
>
> No. That's stupid. Consider the case where you have to loop one page
> at a time over the range (we do on ARM.) If we ended up with your
> suggestion above, that means we could potentially have to loop 4K at a
> time over 3GB of address space. That's idiotic when we have an
> instruction which can flush the entire TLB for a particular thread.

*blink* so you've implemented flush_tlb_range() as an iteration of
single page invalidates?

x86 could have done the same I think, instead we chose to implement it
as a full mm invalidate simply because that's way cheaper in general.

You could also put a threshold in, if (end-start) >> PAGE_SHIFT > n,
flush everything if you want.

Anyway, I don't see how that's related to the I-TLB thing?

2011-02-28 12:26:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Mon, 2011-02-28 at 12:06 +0000, Russell King wrote:
>
> As I think I mentioned, the TLB shootdown interface either needs rewriting
> from scratch as its currently a broken design, or it needs tlb_gather_mmu()
> to take a proper mode argument, rather than this useless 'fullmm' argument
> which only gives half the story.
>
> The fact is that the interface has three modes, and distinguishing between
> them requires a certain amount of black magic. Explicitly, the !fullmm
> case has two modes, and it requires implementations to remember whether
> tlb_start_vma() has been called before tlb_finish_mm() or not.
>
> Maybe this will help you understand the ARM implementation - this doesn't
> change the functionality, but may make things clearer.

I've actually implemented that, but it didn't really help much.

Mostly because you want your TLB flush to be after freeing the
page-tables, not before it.

So I want to avoid having to flush at tlb_end_vma() _and_ at
tlb_finish_mmu(), and doing that needs a flush_tlb_range() that doesn't
need a vma.

ARM also does the whole IPI thing on TLB flush, so a gup_fast()
implementation for arm would also need that TLB flush after page-table
tear-down, not on tlb_end_vma().

And once you want a single TLB invalidate, it doesn't matter if you want
to track ranges for p*_free_tlb() too.

2011-02-28 12:29:47

by Russell King

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Mon, Feb 28, 2011 at 01:20:12PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 11:59 +0000, Russell King wrote:
> > On Mon, Feb 28, 2011 at 12:44:47PM +0100, Peter Zijlstra wrote:
> > > Right, so the normal case is:
> > >
> > > unmap_region()
> > > tlb_gather_mmu()
> >
> > The fullmm argument is important here as it specifies the mode.
>
> well, unmap_region always has that 0, I've mentioned the fullmm mode
> separately below, its in many way the easiest case to deal with.
>
> > tlb_gather_mmu(, 0)
> >
> > > unmap_vmas()
> > > for (; vma; vma = vma->vm_next)
> > > unmao_page_range()
> > > tlb_start_vma() -> flush cache range
> > > zap_*_range()
> > > ptep_get_and_clear_full() -> batch/track external tlbs
> > > tlb_remove_tlb_entry() -> batch/track external tlbs
> > > tlb_remove_page() -> track range/batch page
> > > tlb_end_vma() -> flush tlb range
> >
> > tlb_finish_mmu() -> nothing
> >
> > >
> > > [ for architectures that have hardware page table walkers
> > > concurrent faults can still load the page tables ]
> > >
> > > free_pgtables()
> >
> > tlb_gather_mmu(, 1)
> >
> > > while (vma)
> > > unlink_*_vma()
> > > free_*_range()
> > > *_free_tlb()
> > > tlb_finish_mmu()
> >
> > tlb_finish_mmu() -> flush tlb mm
> >
> > >
> > > free vmas
> >
> > So this is all fine. Note that we *don't* use the range stuff here.
> >
> > > Now, if we want to track ranges _and_ have hardware page table walkers
> > > (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> > > flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> > > than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> > > because the hardware walker could have re-populated the cache after
> > > clearing the PTEs but before freeing the page tables.
> >
> > No. The hardware walker won't re-populate the TLB after the page table
>
> Never said it would repopulate the TLB, just said it could repopulate
> your cache thing and that it might still walk the page tables.
>
> > entries have been cleared - where would it get this information from if
> > not from the page tables?
> >
> > > What ARM does is it retains the last vma pointer and tracks
> > > pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> > > hacky.
> >
> > It may be hacky but then the TLB shootdown interface is hacky too. We
> > don't keep the vma around to re-use after tlb_end_vma() - if you think
> > that then you misunderstand what's going on. The vma pointer is kept
> > around as a cheap way of allowing tlb_finish_mmu() to distinguish
> > between the unmap_region() mode and the shift_arg_pages() mode.
>
> Well, you most certainly use it in the unmap_region() case above.
> tlb_end_vma() will do a flush_tlb_range(), but then your
> __pte_free_tlb() will also track range and the tlb_finish_mmu() will
> then again issue a flush_tlb_range() using the last vma pointer.

Can you point out where pte_free_tlb() is used with unmap_region()?

> unmap_region()'s last tlb_start_vma(), with __pte_free_tlb() tracking
> range will then get tlb_finish_mmu() to issue a second
> flush_tlb_range().

I don't think it will because afaics pte_free_tlb() is never called in
the unmap_region() case.

> > No. That's stupid. Consider the case where you have to loop one page
> > at a time over the range (we do on ARM.) If we ended up with your
> > suggestion above, that means we could potentially have to loop 4K at a
> > time over 3GB of address space. That's idiotic when we have an
> > instruction which can flush the entire TLB for a particular thread.
>
> *blink* so you've implemented flush_tlb_range() as an iteration of
> single page invalidates?

Yes, because flush_tlb_range() is used at most over one VMA, which
typically will not be in the GB range, but a few MB at most.

> Anyway, I don't see how that's related to the I-TLB thing?

It's all related because I don't think you understand what's going on
here properly yet, and as such are getting rather mixed up and confused
about when flush_tlb_range() is called. As such, the whole
does-it-take-vma-or-mm argument is irrelevant, and therefore so is
the I-TLB stuff.

I put to you that pte_free_tlb() is not called in unmap_vmas(), and
as such the double-tlb-invalidate you talk about can't happen.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2011-02-28 12:50:02

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Mon, 2011-02-28 at 12:28 +0000, Russell King wrote:
> Can you point out where pte_free_tlb() is used with unmap_region()?

unmap_region()
free_pgtables()
free_pgd_range()
free_pud_range()
free_pmd_range()
free_pte_range()
pte_free_tlb()

2011-02-28 12:52:44

by Russell King

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Mon, Feb 28, 2011 at 01:49:02PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 12:28 +0000, Russell King wrote:
> > Can you point out where pte_free_tlb() is used with unmap_region()?
>
> unmap_region()
> free_pgtables()
> free_pgd_range()
> free_pud_range()
> free_pmd_range()
> free_pte_range()
> pte_free_tlb()

Damn it. Okay, I give up with this. The TLB shootdown interface is
total crap.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2011-02-28 13:04:16

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Mon, 2011-02-28 at 12:50 +0000, Russell King wrote:
> On Mon, Feb 28, 2011 at 01:49:02PM +0100, Peter Zijlstra wrote:
> > On Mon, 2011-02-28 at 12:28 +0000, Russell King wrote:
> > > Can you point out where pte_free_tlb() is used with unmap_region()?
> >
> > unmap_region()
> > free_pgtables()
> > free_pgd_range()
> > free_pud_range()
> > free_pmd_range()
> > free_pte_range()
> > pte_free_tlb()
>
> Damn it. Okay, I give up with this. The TLB shootdown interface is
> total crap.

:-)

There's a reason I'd like to make everybody use asm-generic/tlb.h and
unify all the crazy bits. Once there's common code everybody is forced
to think about this stuff instead of endlessly hack their own
architecture to make it work without consideration for the rest of us.

Furthermore, I don't think its actually too hard to do.. (famous last
words).

2011-02-28 14:19:55

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Mon, 2011-02-28 at 12:44 +0100, Peter Zijlstra wrote:
> unmap_region()
> tlb_gather_mmu()
> unmap_vmas()
> for (; vma; vma = vma->vm_next)
> unmao_page_range()
> tlb_start_vma() -> flush cache range

So why is this correct? Can't we race with a concurrent access to the
memory region (munmap() vs other thread access race)? While
unmap_region() callers will have removed the vma from the tree so faults
will not be satisfied, TLBs might still be present and allow us to
access the memory and thereby reloading it in the cache.

> zap_*_range()
> ptep_get_and_clear_full() -> batch/track external tlbs
> tlb_remove_tlb_entry() -> batch/track external tlbs
> tlb_remove_page() -> track range/batch page
> tlb_end_vma() -> flush tlb range
>
> [ for architectures that have hardware page table walkers
> concurrent faults can still load the page tables ]
>
> free_pgtables()
> while (vma)
> unlink_*_vma()
> free_*_range()
> *_free_tlb()
> tlb_finish_mmu()
>
> free vmas

2011-02-28 14:59:43

by Russell King

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Mon, Feb 28, 2011 at 03:18:47PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 12:44 +0100, Peter Zijlstra wrote:
> > unmap_region()
> > tlb_gather_mmu()
> > unmap_vmas()
> > for (; vma; vma = vma->vm_next)
> > unmao_page_range()
> > tlb_start_vma() -> flush cache range
>
> So why is this correct? Can't we race with a concurrent access to the
> memory region (munmap() vs other thread access race)? While
> unmap_region() callers will have removed the vma from the tree so faults
> will not be satisfied, TLBs might still be present and allow us to
> access the memory and thereby reloading it in the cache.

It is my understanding that code sections between tlb_gather_mmu() and
tlb_finish_mmu() are non-preemptible - that was the case once upon a
time when this stuff first appeared. If that's changed then that
change has introduced an unnoticed bug.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2011-02-28 15:04:30

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Mon, 2011-02-28 at 14:57 +0000, Russell King wrote:
> On Mon, Feb 28, 2011 at 03:18:47PM +0100, Peter Zijlstra wrote:
> > On Mon, 2011-02-28 at 12:44 +0100, Peter Zijlstra wrote:
> > > unmap_region()
> > > tlb_gather_mmu()
> > > unmap_vmas()
> > > for (; vma; vma = vma->vm_next)
> > > unmao_page_range()
> > > tlb_start_vma() -> flush cache range
> >
> > So why is this correct? Can't we race with a concurrent access to the
> > memory region (munmap() vs other thread access race)? While
> > unmap_region() callers will have removed the vma from the tree so faults
> > will not be satisfied, TLBs might still be present and allow us to
> > access the memory and thereby reloading it in the cache.
>
> It is my understanding that code sections between tlb_gather_mmu() and
> tlb_finish_mmu() are non-preemptible - that was the case once upon a
> time when this stuff first appeared.

It is still so, but that doesn't help with SMP. The case mentioned above
has two threads running, one doing munmap() and the other is poking at
the memory being unmapped.

Afaict, even when its all non-preemptible, the remote cpu can
re-populate the cache you just flushed through existing TLB entries.

> If that's changed then that change has introduced an unnoticed bug.

I've got such a patch-set pending, but I cannot see how that would
change the semantics other than that the above race becomes possible on
a single CPU.

2011-02-28 15:17:31

by Russell King

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Mon, Feb 28, 2011 at 04:05:48PM +0100, Peter Zijlstra wrote:
> On Mon, 2011-02-28 at 14:57 +0000, Russell King wrote:
> > On Mon, Feb 28, 2011 at 03:18:47PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2011-02-28 at 12:44 +0100, Peter Zijlstra wrote:
> > > > unmap_region()
> > > > tlb_gather_mmu()
> > > > unmap_vmas()
> > > > for (; vma; vma = vma->vm_next)
> > > > unmao_page_range()
> > > > tlb_start_vma() -> flush cache range
> > >
> > > So why is this correct? Can't we race with a concurrent access to the
> > > memory region (munmap() vs other thread access race)? While
> > > unmap_region() callers will have removed the vma from the tree so faults
> > > will not be satisfied, TLBs might still be present and allow us to
> > > access the memory and thereby reloading it in the cache.
> >
> > It is my understanding that code sections between tlb_gather_mmu() and
> > tlb_finish_mmu() are non-preemptible - that was the case once upon a
> > time when this stuff first appeared.
>
> It is still so, but that doesn't help with SMP. The case mentioned above
> has two threads running, one doing munmap() and the other is poking at
> the memory being unmapped.

Luckily its a no-op on SMP capable CPUs (and actually is also a no-op
on any PIPT or VIPT ARM CPU.)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:

2011-03-01 22:05:43

by Chris Metcalf

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On 2/28/2011 6:44 AM, Peter Zijlstra wrote:
> [...]
> Now, if we want to track ranges _and_ have hardware page table walkers
> (ARM seems to be one such), we must flush TLBs at tlb_end_vma() because
> flush_tlb_range() requires a vma pointer (ARM and TILE actually use more
> than ->vm_mm), and on tlb_finish_mmu() issue a full mm wide invalidate
> because the hardware walker could have re-populated the cache after
> clearing the PTEs but before freeing the page tables.
>
> What ARM does is it retains the last vma pointer and tracks
> pte_free_tlb() range and uses that in tlb_finish_mmu(), which is a tad
> hacky.
>
> Mostly because of shift_arg_pages(), where we have:
>
> shift_arg_pages()
> tlb_gather_mmu()
> free_*_range()
> tlb_finish_mmu()
>
> For which ARM now punts and does a full tlb invalidate (no vma pointer).
> But also because it drags along that vma pointer, which might not at all
> match the range its actually going to invalidate (and hence its vm_flags
> might not accurately reflect things -- at worst more expensive than
> needed).
>
> The reason I wanted flush_tlb_range() to take an mm_struct and not the
> current vm_area_struct is because we can avoid doing the
> flush_tlb_range() from tlb_end_vma() and delay the thing until
> tlb_finish_mmu() without having to resort to such games as above. We
> could simply track the full range over all VMAs and free'd page-tables
> and do one range invalidate.
>
> ARM uses vm_flags & VM_EXEC to see if it also needs to invalidate
> I-TLBs, and TILE uses VM_EXEC and VM_HUGETLB.
>
> For the I-TLBs we could easily use
> ptep_get_and_clear_full()/tlb_remove_tlb_entry() and see if any of the
> cleared pte's had its executable bit set (both ARM and TILE seem to have
> such a PTE bit).

For Tile, the concern is that we want to make sure to invalidate the
i-cache. The I-TLB is handled by the regular TLB flush just fine, like the
other architectures. So our concern is that once we have cleared the page
table entries and invalidated the TLBs, we still have to deal with i-cache
lines in any core that may have run code from that page. The risk is that
the kernel might free, reallocate, and then run code from one of those
pages, all before the stale i-cache lines happened to be evicted.

The current Tile code flushes the icache explicitly at two different times:

1. Whenever we flush the TLB, since this is one time when we know who might
currently be using the page (via cpu_vm_mask) and we can flush all of them
easily, piggybacking on the infrastructure we use to flush remote TLBs.

2. Whenever we context switch, to handle the case where cpu 1 is running
process A, then switches to B, but another cpu still running process A
unmaps an executable page that was in cpu 1's icache. This way when cpu 1
switches back to A, it doesn't have to worry about any unmaps that occurred
while it was switched out.

> I'm not sure what we can do about TILE's VM_HUGETLB usage though, if it
> needs explicit flushes for huge ptes it might just have to issue
> multiple tlb invalidates and do them from tlb_start_vma()/tlb_end_vma().

I'm not too concerned about this. We can make the flush code check both
page sizes at a small cost in efficiency, relative to the overall cost of
global TLB invalidation.

> CONFIG_HAVE_MMU_GATHER_ITLB - will use
> ptep_get_and_clear_full()/tlb_remove_tlb_entry() to test pte_exec() and
> issue flush_itlb_range(mm,start,end).

So it sounds like the proposal for tile would be to piggy-back on
flush_itlb_range() and use it to flush the i-cache? It does seem like
there must be other Linux architectures with incoherent icache out there,
and some existing solution we could just repurpose.

--
Chris Metcalf, Tilera Corp.
http://www.tilera.com

2011-03-02 10:52:42

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 06/17] arm: mmu_gather rework

On Tue, 2011-03-01 at 17:05 -0500, Chris Metcalf wrote:

> For Tile, the concern is that we want to make sure to invalidate the
> i-cache. The I-TLB is handled by the regular TLB flush just fine, like the
> other architectures. So our concern is that once we have cleared the page
> table entries and invalidated the TLBs, we still have to deal with i-cache
> lines in any core that may have run code from that page. The risk is that
> the kernel might free, reallocate, and then run code from one of those
> pages, all before the stale i-cache lines happened to be evicted.

>From reading Documentation/cachetlb.txt, update_mmu_cache() can be used
to flush i-cache whenever you install a pte with executable permissions,
and covers the particular case you mention above.

DaveM any comment? You seem to be the one who wrote that document :-)

> The current Tile code flushes the icache explicitly at two different times:
>
> 1. Whenever we flush the TLB, since this is one time when we know who might
> currently be using the page (via cpu_vm_mask) and we can flush all of them
> easily, piggybacking on the infrastructure we use to flush remote TLBs.
>
> 2. Whenever we context switch, to handle the case where cpu 1 is running
> process A, then switches to B, but another cpu still running process A
> unmaps an executable page that was in cpu 1's icache. This way when cpu 1
> switches back to A, it doesn't have to worry about any unmaps that occurred
> while it was switched out.
>
>
> > I'm not sure what we can do about TILE's VM_HUGETLB usage though, if it
> > needs explicit flushes for huge ptes it might just have to issue
> > multiple tlb invalidates and do them from tlb_start_vma()/tlb_end_vma().
>
> I'm not too concerned about this. We can make the flush code check both
> page sizes at a small cost in efficiency, relative to the overall cost of
> global TLB invalidation.

OK, that's basically what I made it do now:

Index: linux-2.6/arch/tile/kernel/tlb.c
===================================================================
--- linux-2.6.orig/arch/tile/kernel/tlb.c
+++ linux-2.6/arch/tile/kernel/tlb.c
@@ -64,14 +64,13 @@ void flush_tlb_page(const struct vm_area
}
EXPORT_SYMBOL(flush_tlb_page);

-void flush_tlb_range(const struct vm_area_struct *vma,
+void flush_tlb_range(const struct mm_struct *mm,
unsigned long start, unsigned long end)
{
- unsigned long size = hv_page_size(vma);
- struct mm_struct *mm = vma->vm_mm;
- int cache = (vma->vm_flags & VM_EXEC) ? HV_FLUSH_EVICT_L1I : 0;
- flush_remote(0, cache, &mm->cpu_vm_mask, start, end - start, size,
- &mm->cpu_vm_mask, NULL, 0);
+ flush_remote(0, HV_FLUSH_EVICT_L1I, &mm->cpu_vm_mask,
+ start, end - start, PAGE_SIZE, &mm->cpu_vm_mask, NULL, 0);
+ flush_remote(0, 0, &mm->cpu_vm_mask,
+ start, end - start, HPAGE_SIZE, &mm->cpu_vm_mask, NULL, 0);
}

And I guess that if the update_mmu_cache() thing works out we can remove
the HV_FLUSH_EVICT_L1I thing.