2014-12-03 03:25:49

by Leonid Yegoshin

[permalink] [raw]
Subject: [PATCH] Revert "MIPS: Remove race window in page fault handling"

This reverts commit 64f23ab30b1fec86b91a48bf1f088840e5299971
(commit 2a4a8b1e5d9d343e13ff22e19af7b353f7b52d6f in Linux 'master')

Unfortunately the original commit effectively kills Imagination MIPS CPUs
performance because it enforces page cache flush each time then PTE is changed
for our CPUs. Basically - each CoW, each process start, each library attachment
or detachment. And it happens even in "fully-coherent" systems (in MIPS sense -
we have non coherent I-cache). Last tendency for increasing page size to 16K-64K
even exaggerate the performance loss.

Original commit discussion says:

Kernel call stacks showed one thread handling an illegal instruction
exception while another thread was somewhere around the
set_pte_at/update_mmu_cache calls for the same page.

If this is taken correct then it points to a major problem unrelated to MIPS -
if by chance a first thread hits the page just before set_pte_at then it should
get a non-present PTE: set_pte_at with _PAGE_PRESENT/_PAGE_VALID flags can be
used only on "disactivated" PTE, after pte_clear* functions, effectively -
non-present, non-valid. And application can get SIGSEGV. It is a major race
condition and kernel should not offer it to applications. And in my
understanding set_pte_at is used in cases then page is available after kernel
finishes page handling, at least in theory.

Test note: I ran MIPS 1004K with 8 VPEs on 4 cores around 1.5 month on SOAK test
and never seen that problem on 3.10 kernel.

Signed-off-by: Leonid Yegoshin <[email protected]>
---
arch/mips/include/asm/pgtable.h | 8 +++++---
arch/mips/mm/cache.c | 27 ++++++++-------------------
2 files changed, 13 insertions(+), 22 deletions(-)

diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index d6d1928539b1..cd5fbf42b864 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -122,9 +122,6 @@ do { \
} \
} while(0)

-extern void set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
- pte_t pteval);
-
#if defined(CONFIG_64BIT_PHYS_ADDR) && defined(CONFIG_CPU_MIPS32)

#define pte_none(pte) (!(((pte).pte_low | (pte).pte_high) & ~_PAGE_GLOBAL))
@@ -148,6 +145,7 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
}
}
}
+#define set_pte_at(mm, addr, ptep, pteval) set_pte(ptep, pteval)

static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
@@ -185,6 +183,7 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
}
#endif
}
+#define set_pte_at(mm, addr, ptep, pteval) set_pte(ptep, pteval)

static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
@@ -401,12 +400,15 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)

extern void __update_tlb(struct vm_area_struct *vma, unsigned long address,
pte_t pte);
+extern void __update_cache(struct vm_area_struct *vma, unsigned long address,
+ pte_t pte);

static inline void update_mmu_cache(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
pte_t pte = *ptep;
__update_tlb(vma, address, pte);
+ __update_cache(vma, address, pte);
}

static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c
index 7e3ea7766822..f7b91d3a371d 100644
--- a/arch/mips/mm/cache.c
+++ b/arch/mips/mm/cache.c
@@ -119,36 +119,25 @@ void __flush_anon_page(struct page *page, unsigned long vmaddr)

EXPORT_SYMBOL(__flush_anon_page);

-static void mips_flush_dcache_from_pte(pte_t pteval, unsigned long address)
+void __update_cache(struct vm_area_struct *vma, unsigned long address,
+ pte_t pte)
{
struct page *page;
- unsigned long pfn = pte_pfn(pteval);
+ unsigned long pfn, addr;
+ int exec = (vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc;

+ pfn = pte_pfn(pte);
if (unlikely(!pfn_valid(pfn)))
return;
-
page = pfn_to_page(pfn);
if (page_mapping(page) && Page_dcache_dirty(page)) {
- unsigned long page_addr = (unsigned long) page_address(page);
-
- if (!cpu_has_ic_fills_f_dc ||
- pages_do_alias(page_addr, address & PAGE_MASK))
- flush_data_cache_page(page_addr);
+ addr = (unsigned long) page_address(page);
+ if (exec || pages_do_alias(addr, address & PAGE_MASK))
+ flush_data_cache_page(addr);
ClearPageDcacheDirty(page);
}
}

-void set_pte_at(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pteval)
-{
- if (cpu_has_dc_aliases || !cpu_has_ic_fills_f_dc) {
- if (pte_present(pteval))
- mips_flush_dcache_from_pte(pteval, addr);
- }
-
- set_pte(ptep, pteval);
-}
-
unsigned long _page_cachable_default;
EXPORT_SYMBOL(_page_cachable_default);


2014-12-03 09:40:10

by Lars Persson

[permalink] [raw]
Subject: Re: [PATCH] Revert "MIPS: Remove race window in page fault handling"

Hi Leonid

First let me describe the mechanism of this race condition, which was a
fault in the kernel's MIPS architecture code. Specifically in its
implementation of lazy dcache flushing. AFAIK, it would only hit on
systems where the pagein code path writes to the page from the CPU.

The order of calls is:
flush_dcache_page() (from the FS's readpage)
set_pte_at()
update_mmu_cache()

The thread number one has executed the set_pte_at() when thread number
two hits the same page. It finds a valid PTE and proceeds to execute
code from a page that is not yet flushed to the point of I/D coherency.
That flush would happen in update_mmu_cache().

My patch does increase number of cache flushes for CoW yes and there
could be an optimization opportunity by playing tricks with the pte_t to
include information about executability of the mapping.

Reverting the patch is a big no-no, then we go back to a state of
undefined CPU behavior.

- Lars


On Wed, 2014-12-03 at 04:25 +0100, Leonid Yegoshin wrote:
> This reverts commit 64f23ab30b1fec86b91a48bf1f088840e5299971
> (commit 2a4a8b1e5d9d343e13ff22e19af7b353f7b52d6f in Linux 'master')
>
> Unfortunately the original commit effectively kills Imagination MIPS CPUs
> performance because it enforces page cache flush each time then PTE is changed
> for our CPUs. Basically - each CoW, each process start, each library attachment
> or detachment. And it happens even in "fully-coherent" systems (in MIPS sense -
> we have non coherent I-cache). Last tendency for increasing page size to 16K-64K
> even exaggerate the performance loss.
>
> Original commit discussion says:
>
> Kernel call stacks showed one thread handling an illegal instruction
> exception while another thread was somewhere around the
> set_pte_at/update_mmu_cache calls for the same page.
>
> If this is taken correct then it points to a major problem unrelated to MIPS -
> if by chance a first thread hits the page just before set_pte_at then it should
> get a non-present PTE: set_pte_at with _PAGE_PRESENT/_PAGE_VALID flags can be
> used only on "disactivated" PTE, after pte_clear* functions, effectively -
> non-present, non-valid. And application can get SIGSEGV. It is a major race
> condition and kernel should not offer it to applications. And in my
> understanding set_pte_at is used in cases then page is available after kernel
> finishes page handling, at least in theory.
>
> Test note: I ran MIPS 1004K with 8 VPEs on 4 cores around 1.5 month on SOAK test
> and never seen that problem on 3.10 kernel.
>
> Signed-off-by: Leonid Yegoshin <[email protected]>
>
>


2014-12-03 13:24:37

by Ralf Baechle

[permalink] [raw]
Subject: Re: [PATCH] Revert "MIPS: Remove race window in page fault handling"

On Wed, Dec 03, 2014 at 10:31:44AM +0100, Lars Persson wrote:

> Hi Leonid
>
> First let me describe the mechanism of this race condition, which was a
> fault in the kernel's MIPS architecture code. Specifically in its
> implementation of lazy dcache flushing. AFAIK, it would only hit on
> systems where the pagein code path writes to the page from the CPU.
>
> The order of calls is:
> flush_dcache_page() (from the FS's readpage)
> set_pte_at()
> update_mmu_cache()
>
> The thread number one has executed the set_pte_at() when thread number
> two hits the same page. It finds a valid PTE and proceeds to execute
> code from a page that is not yet flushed to the point of I/D coherency.
> That flush would happen in update_mmu_cache().
>
> My patch does increase number of cache flushes for CoW yes and there
> could be an optimization opportunity by playing tricks with the pte_t to
> include information about executability of the mapping.
>
> Reverting the patch is a big no-no, then we go back to a state of
> undefined CPU behavior.

The performance issues of this patch were fairly obvious when I applied
the patch. At that time I choose correctness over performance. But it
needs proper sorting. Too massive performance impact also is a bug and
Leonid's sledgehammer approach to revoke the patch outright without
anything better to replace it is not the right way either!

Ralf

2014-12-03 14:03:27

by Lars Persson

[permalink] [raw]
Subject: Re: [PATCH] Revert "MIPS: Remove race window in page fault handling"

It is the flush_dcache_page() that was called from the file-system
reading the page contents into memory.

- Lars

On Wed, 2014-12-03 at 14:42 +0100, Ralf Baechle wrote:
> Lars,
>
> normally set_pte_at() is invoked in a
>
> cache_flush_*()
> set_pte_at()
> tlb_flush_*()
>
> sequence. So I'm wondering if you're trying to fix something in set_pte_at
> that actually ought to be fixed in the cache_flush_*() function.
>
> I'm wondering, have you identified which cache flush function in particular
> was used in the sequence in your particular bug's case?
>
> Ralf


2014-12-03 14:09:17

by Ralf Baechle

[permalink] [raw]
Subject: Re: [PATCH] Revert "MIPS: Remove race window in page fault handling"

Lars,

normally set_pte_at() is invoked in a

cache_flush_*()
set_pte_at()
tlb_flush_*()

sequence. So I'm wondering if you're trying to fix something in set_pte_at
that actually ought to be fixed in the cache_flush_*() function.

I'm wondering, have you identified which cache flush function in particular
was used in the sequence in your particular bug's case?

Ralf

2014-12-03 14:20:31

by Lars Persson

[permalink] [raw]
Subject: Re: [PATCH] Revert "MIPS: Remove race window in page fault handling"

Ralf

I remember now that we have applied to our tree the proposed patch
titled "MIPS HIGHMEM fixes for cache aliasing and non-DMA I/O".

This patch changes the semantics of flush_dcache_page() by using
page_mapped() instead of mapping_mapped() to decide if the flush should
be lazy. Is it this change that makes us get lazy flushes for code
mappings and therefore exposing the problem ? The ARM port which has
made a similar change to set_pte_at() also uses page_mapped() to decide
if lazy flushing is possible.

If this is true, then upstream might not need my patch.

- Lars



On ons, 2014-12-03 at 14:42 +0100, Ralf Baechle wrote:
> Lars,
>
> normally set_pte_at() is invoked in a
>
> cache_flush_*()
> set_pte_at()
> tlb_flush_*()
>
> sequence. So I'm wondering if you're trying to fix something in set_pte_at
> that actually ought to be fixed in the cache_flush_*() function.
>
> I'm wondering, have you identified which cache flush function in particular
> was used in the sequence in your particular bug's case?
>
> Ralf


2014-12-03 19:28:53

by Leonid Yegoshin

[permalink] [raw]
Subject: Re: [PATCH] Revert "MIPS: Remove race window in page fault handling"

Lars,

Do you have a stack trace or so then you found the second VPE between
set_pte_at and update_mmu_cache?
It would be interesting how it happens - generally, to get a consistent
SIGILL in applications due to misbehaviour of memory subsystem, the bug
in FS is not enough.

Hold on - do you use non-DMA file system?
If so, I advice you to try this simple patch:

Author: Leonid Yegoshin <[email protected]>
Date: Tue Apr 2 14:20:37 2013 -0700

MIPS: (opt) Fix of reading I-pages from non-DMA FS devices for ID
cache separation

This optional fix provides a D-cache flush for instruction code
pages on
page faults. In case of non-DMA block device a driver doesn't know
that it
reads I-page and doesn't flush D-cache generally on systems without
cache aliasing. And that takes toll during page fault of
instruction pages.

It is not a perfect fix, it should be considered as a temporary fix.
The permanent fix would track page origin in page cache and flushes
D-cache
during reception of page from driver only but not at each page fault.
It is not done yet.

Change-Id: I43f5943d6ce0509729179615f6b81e77803a34ac
Author: Leonid Yegoshin <[email protected]>
Signed-off-by: Leonid Yegoshin <[email protected]>(imported from
commit 6ebd22eb7a3d9873582ebe990a77094f971652ee)(imported from commit
0caf3b4a1eebb64572e81e4df6fdb3abf12c70

diff --git a/arch/mips/include/asm/cacheflush.h
b/arch/mips/include/asm/cacheflush.h
index 42e5fc682590..27b17b16a96d 100644
--- a/arch/mips/include/asm/cacheflush.h
+++ b/arch/mips/include/asm/cacheflush.h
@@ -61,6 +61,9 @@ static inline void flush_anon_page(struct
vm_area_struct *vma,
static inline void flush_icache_page(struct vm_area_struct *vma,
struct page *page)
{
+ if (cpu_has_dc_aliases ||
+ ((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc))
+ __flush_dcache_page(page);
}

extern void (*flush_icache_range)(unsigned long start, unsigned long end);


It fixed crash problems with non-DMA FS in a couple of our customers.
Without it the non-DMA root FS crashes are catastrophic in aliasing
systems but it is still a problem for I-cache too but much rare.

Unfortunately, it is also a performance hit, however is less than run a
page cache flush at each PTE setup.

- Leonid.

On 12/03/2014 06:03 AM, Lars Persson wrote:
> It is the flush_dcache_page() that was called from the file-system
> reading the page contents into memory. - Lars On Wed, 2014-12-03 at
> 14:42 +0100, Ralf Baechle wrote:
>> Lars, normally set_pte_at() is invoked in a cache_flush_*()
>> set_pte_at() tlb_flush_*() sequence. So I'm wondering if you're
>> trying to fix something in set_pte_at that actually ought to be fixed
>> in the cache_flush_*() function. I'm wondering, have you identified
>> which cache flush function in particular was used in the sequence in
>> your particular bug's case? Ralf
>

2014-12-05 02:16:52

by Leonid Yegoshin

[permalink] [raw]
Subject: Re: [PATCH] Revert "MIPS: Remove race window in page fault handling"

(repeat mesg, first one went to wrong place)

Lars,

Do you have a stack trace or so then you found the second VPE between
set_pte_at and update_mmu_cache?
It would be interesting how it happens - generally, to get a consistent
SIGILL in applications due to misbehaviour of memory subsystem, the bug
in FS is not enough.

Hold on - do you use non-DMA file system?
If so, I advice you to try this simple patch:

Author: Leonid Yegoshin <[email protected]>
Date: Tue Apr 2 14:20:37 2013 -0700

MIPS: (opt) Fix of reading I-pages from non-DMA FS devices for ID
cache separation

This optional fix provides a D-cache flush for instruction code
pages on
page faults. In case of non-DMA block device a driver doesn't know
that it
reads I-page and doesn't flush D-cache generally on systems without
cache aliasing. And that takes toll during page fault of
instruction pages.

It is not a perfect fix, it should be considered as a temporary fix.
The permanent fix would track page origin in page cache and flushes
D-cache
during reception of page from driver only but not at each page fault.
It is not done yet.

Change-Id: I43f5943d6ce0509729179615f6b81e77803a34ac
Author: Leonid Yegoshin <[email protected]>
Signed-off-by: Leonid Yegoshin <[email protected]>(imported from
commit 6ebd22eb7a3d9873582ebe990a77094f971652ee)(imported from commit
0caf3b4a1eebb64572e81e4df6fdb3abf12c70

arch/mips/include/asm/cacheflush.h:

@@ -61,6 +61,9 @@ static inline void flush_anon_page(struct
vm_area_struct *vma,
static inline void flush_icache_page(struct vm_area_struct *vma,
struct page *page)
{
+ if (cpu_has_dc_aliases ||
+ ((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc))
+ __flush_dcache_page(page);
}

extern void (*flush_icache_range)(unsigned long start, unsigned
long end);


It fixed crash problems with non-DMA FS in a couple of our customers.
Without it the non-DMA root FS crashes are catastrophic in aliasing
systems but it is still a problem for I-cache too but much rare.

Unfortunately, it is also a performance hit, however is less than run a
page cache flush at each PTE setup. On 12/03/2014 06:03 AM, Lars Persson
wrote:
> It is the flush_dcache_page() that was called from the file-system
> reading the page contents into memory.
>
> - Lars
>
>

2014-12-05 09:33:12

by Lars Persson

[permalink] [raw]
Subject: Re: [PATCH] Revert "MIPS: Remove race window in page fault handling"

Hi

Our setup includes both a non-DMA block device and a compressing
file-system (UBIFS). A flush_dcache_page() is issued by UBIFS so your
patch fixes another problem that we do not hit.

The stack trace is not available now. Do we need it for any further
analysis ? I think the mechanism of the race window is understood and it
depends on the __flush_dcache_page() deciding that the flush should be
postponed.

- Lars

On Fri, 2014-12-05 at 03:16 +0100, Leonid Yegoshin wrote:
> (repeat mesg, first one went to wrong place)
>
> Lars,
>
> Do you have a stack trace or so then you found the second VPE between
> set_pte_at and update_mmu_cache?
> It would be interesting how it happens - generally, to get a consistent
> SIGILL in applications due to misbehaviour of memory subsystem, the bug
> in FS is not enough.
>
> Hold on - do you use non-DMA file system?
> If so, I advice you to try this simple patch:
>
> Author: Leonid Yegoshin <[email protected]>
> Date: Tue Apr 2 14:20:37 2013 -0700
>
> MIPS: (opt) Fix of reading I-pages from non-DMA FS devices for ID
> cache separation
>
> This optional fix provides a D-cache flush for instruction code
> pages on
> page faults. In case of non-DMA block device a driver doesn't know
> that it
> reads I-page and doesn't flush D-cache generally on systems without
> cache aliasing. And that takes toll during page fault of
> instruction pages.
>
> It is not a perfect fix, it should be considered as a temporary fix.
> The permanent fix would track page origin in page cache and flushes
> D-cache
> during reception of page from driver only but not at each page fault.
> It is not done yet.
>
> Change-Id: I43f5943d6ce0509729179615f6b81e77803a34ac
> Author: Leonid Yegoshin <[email protected]>
> Signed-off-by: Leonid Yegoshin <[email protected]>(imported from
> commit 6ebd22eb7a3d9873582ebe990a77094f971652ee)(imported from commit
> 0caf3b4a1eebb64572e81e4df6fdb3abf12c70
>
> arch/mips/include/asm/cacheflush.h:
>
> @@ -61,6 +61,9 @@ static inline void flush_anon_page(struct
> vm_area_struct *vma,
> static inline void flush_icache_page(struct vm_area_struct *vma,
> struct page *page)
> {
> + if (cpu_has_dc_aliases ||
> + ((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc))
> + __flush_dcache_page(page);
> }
>
> extern void (*flush_icache_range)(unsigned long start, unsigned
> long end);
>
>
> It fixed crash problems with non-DMA FS in a couple of our customers.
> Without it the non-DMA root FS crashes are catastrophic in aliasing
> systems but it is still a problem for I-cache too but much rare.
>
> Unfortunately, it is also a performance hit, however is less than run a
> page cache flush at each PTE setup. On 12/03/2014 06:03 AM, Lars Persson
> wrote:
> > It is the flush_dcache_page() that was called from the file-system
> > reading the page contents into memory.
> >
> > - Lars
> >
> >
>


2014-12-05 21:41:33

by Leonid Yegoshin

[permalink] [raw]
Subject: Re: [PATCH] Revert "MIPS: Remove race window in page fault handling"

Lars,

On 12/05/2014 01:32 AM, Lars Persson wrote:
> Hi
>
> Our setup includes both a non-DMA block device and a compressing
> file-system (UBIFS). A flush_dcache_page() is issued by UBIFS so your
> patch fixes another problem that we do not hit.
>
> The stack trace is not available now. Do we need it for any further
> analysis ? I think the mechanism of the race window is understood and it
> depends on the __flush_dcache_page() deciding that the flush should be
> postponed.

Unfortunately, the research of original case is still needed.
I looked into all cases of update_mmu_cache() besides HUGE page support
and NUMA, and I see:

1. insert_pfn()
It is used to put a special page (read - VDSO) into memory map. No
cache flush is needed here because page is done and flushed during
system boot.

2. do_wp_page(), first occurrence
It has flush_cache_page() before it sets PTE in
ptep_set_access_flags(). This flush is unconditional and affects all caches.

3. do_wp_page(), second case
It is done after preparing a clear new page or after COW. COW has
an appropriate cache flush of destination in copy_user_highpage(). The
immediate use of cleared new page as instruction (you had SIGILL,
right?)... hm-m, something wrong in application in this case.

4. do_swap_page()
Well, it may be a case of flush_icache_page() is not used (see
below) and page is taken from non-DMA swap. But I also recommend to look
into

http://patchwork.linux-mips.org/patch/7615/

there is a bug in swap entry number presentation and it may affect your
system.

5. do_anonymous_page()
The similar to case (3) - cleared new page, using of it as
instruction page may point to some app problem.

6. do_set_pte()
It also has flush_icache_page() which may have impact if not
implemented, see below.

7. handle_pte_fault()
Page is not touched and cache flush is NO-OP.

8. remove_migration_pte()
Well, it is a place for suspicion. But it should not run in
parallel with any running thread - dirtying page while other thread is
running is a way to disaster.

So, you see - if I understand it correctly, there is no place for
failure... besides application misbehaviour or potential kernel bug in
migration. Of course, I may miss something and that is a reason why
stack trace is desirable.


> I think the mechanism of the race window is understood and it
> depends on the __flush_dcache_page() deciding that the flush should be
> postponed.

As I remember, you said you use HIGHMEM patch, right? It changes a
little __flush_dcache_page() and flush of any mapped page is not
postponed anymore. So, it has an immediate effect for application pages.

- Leonid.

>
>
> - Lars
>
> On Fri, 2014-12-05 at 03:16 +0100, Leonid Yegoshin wrote:
>> (repeat mesg, first one went to wrong place)
>>
>> Lars,
>>
>> Do you have a stack trace or so then you found the second VPE between
>> set_pte_at and update_mmu_cache?
>> It would be interesting how it happens - generally, to get a consistent
>> SIGILL in applications due to misbehaviour of memory subsystem, the bug
>> in FS is not enough.
>>
>> Hold on - do you use non-DMA file system?
>> If so, I advice you to try this simple patch:
>>
>> Author: Leonid Yegoshin <[email protected]>
>> Date: Tue Apr 2 14:20:37 2013 -0700
>>
>> MIPS: (opt) Fix of reading I-pages from non-DMA FS devices for ID
>> cache separation
>>
>> This optional fix provides a D-cache flush for instruction code
>> pages on
>> page faults. In case of non-DMA block device a driver doesn't know
>> that it
>> reads I-page and doesn't flush D-cache generally on systems without
>> cache aliasing. And that takes toll during page fault of
>> instruction pages.
>>
>> It is not a perfect fix, it should be considered as a temporary fix.
>> The permanent fix would track page origin in page cache and flushes
>> D-cache
>> during reception of page from driver only but not at each page fault.
>> It is not done yet.
>>
>> Change-Id: I43f5943d6ce0509729179615f6b81e77803a34ac
>> Author: Leonid Yegoshin <[email protected]>
>> Signed-off-by: Leonid Yegoshin <[email protected]>(imported from
>> commit 6ebd22eb7a3d9873582ebe990a77094f971652ee)(imported from commit
>> 0caf3b4a1eebb64572e81e4df6fdb3abf12c70
>>
>> arch/mips/include/asm/cacheflush.h:
>>
>> @@ -61,6 +61,9 @@ static inline void flush_anon_page(struct
>> vm_area_struct *vma,
>> static inline void flush_icache_page(struct vm_area_struct *vma,
>> struct page *page)
>> {
>> + if (cpu_has_dc_aliases ||
>> + ((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc))
>> + __flush_dcache_page(page);
>> }
>>
>> extern void (*flush_icache_range)(unsigned long start, unsigned
>> long end);
>>
>>
>> It fixed crash problems with non-DMA FS in a couple of our customers.
>> Without it the non-DMA root FS crashes are catastrophic in aliasing
>> systems but it is still a problem for I-cache too but much rare.
>>
>> Unfortunately, it is also a performance hit, however is less than run a
>> page cache flush at each PTE setup. On 12/03/2014 06:03 AM, Lars Persson
>> wrote:
>>> It is the flush_dcache_page() that was called from the file-system
>>> reading the page contents into memory.
>>>
>>> - Lars
>>>
>>>
>
>

2014-12-08 09:18:59

by Lars Persson

[permalink] [raw]
Subject: Re: [PATCH] Revert "MIPS: Remove race window in page fault handling"

Hi

I have reconstructed the stack trace on the 3.14 kernel where we caught
the problem.

The call path was __do_fault() -> set_pte_at(). In later kernels it
corresponds to the do_set_pte().

You are right that there are calls to flush_icache_page(). I did not dig
into the code enough to check if page_mapped() will return true at that
point. If it does, then your proposed patch for flush_icache_page is a
better fix.

- Lars

On fre, 2014-12-05 at 22:41 +0100, Leonid Yegoshin wrote:
> Lars,
>
> On 12/05/2014 01:32 AM, Lars Persson wrote:
> > Hi
> >
> > Our setup includes both a non-DMA block device and a compressing
> > file-system (UBIFS). A flush_dcache_page() is issued by UBIFS so your
> > patch fixes another problem that we do not hit.
> >
> > The stack trace is not available now. Do we need it for any further
> > analysis ? I think the mechanism of the race window is understood and it
> > depends on the __flush_dcache_page() deciding that the flush should be
> > postponed.
>
> Unfortunately, the research of original case is still needed.
> I looked into all cases of update_mmu_cache() besides HUGE page support
> and NUMA, and I see:
>
> 1. insert_pfn()
> It is used to put a special page (read - VDSO) into memory map. No
> cache flush is needed here because page is done and flushed during
> system boot.
>
> 2. do_wp_page(), first occurrence
> It has flush_cache_page() before it sets PTE in
> ptep_set_access_flags(). This flush is unconditional and affects all caches.
>
> 3. do_wp_page(), second case
> It is done after preparing a clear new page or after COW. COW has
> an appropriate cache flush of destination in copy_user_highpage(). The
> immediate use of cleared new page as instruction (you had SIGILL,
> right?)... hm-m, something wrong in application in this case.
>
> 4. do_swap_page()
> Well, it may be a case of flush_icache_page() is not used (see
> below) and page is taken from non-DMA swap. But I also recommend to look
> into
>
> http://patchwork.linux-mips.org/patch/7615/
>
> there is a bug in swap entry number presentation and it may affect your
> system.
>
> 5. do_anonymous_page()
> The similar to case (3) - cleared new page, using of it as
> instruction page may point to some app problem.
>
> 6. do_set_pte()
> It also has flush_icache_page() which may have impact if not
> implemented, see below.
>
> 7. handle_pte_fault()
> Page is not touched and cache flush is NO-OP.
>
> 8. remove_migration_pte()
> Well, it is a place for suspicion. But it should not run in
> parallel with any running thread - dirtying page while other thread is
> running is a way to disaster.
>
> So, you see - if I understand it correctly, there is no place for
> failure... besides application misbehaviour or potential kernel bug in
> migration. Of course, I may miss something and that is a reason why
> stack trace is desirable.
>
>
> > I think the mechanism of the race window is understood and it
> > depends on the __flush_dcache_page() deciding that the flush should be
> > postponed.
>
> As I remember, you said you use HIGHMEM patch, right? It changes a
> little __flush_dcache_page() and flush of any mapped page is not
> postponed anymore. So, it has an immediate effect for application pages.
>
> - Leonid.
>
> >
> >
> > - Lars
> >
> > On Fri, 2014-12-05 at 03:16 +0100, Leonid Yegoshin wrote:
> >> (repeat mesg, first one went to wrong place)
> >>
> >> Lars,
> >>
> >> Do you have a stack trace or so then you found the second VPE between
> >> set_pte_at and update_mmu_cache?
> >> It would be interesting how it happens - generally, to get a consistent
> >> SIGILL in applications due to misbehaviour of memory subsystem, the bug
> >> in FS is not enough.
> >>
> >> Hold on - do you use non-DMA file system?
> >> If so, I advice you to try this simple patch:
> >>
> >> Author: Leonid Yegoshin <[email protected]>
> >> Date: Tue Apr 2 14:20:37 2013 -0700
> >>
> >> MIPS: (opt) Fix of reading I-pages from non-DMA FS devices for ID
> >> cache separation
> >>
> >> This optional fix provides a D-cache flush for instruction code
> >> pages on
> >> page faults. In case of non-DMA block device a driver doesn't know
> >> that it
> >> reads I-page and doesn't flush D-cache generally on systems without
> >> cache aliasing. And that takes toll during page fault of
> >> instruction pages.
> >>
> >> It is not a perfect fix, it should be considered as a temporary fix.
> >> The permanent fix would track page origin in page cache and flushes
> >> D-cache
> >> during reception of page from driver only but not at each page fault.
> >> It is not done yet.
> >>
> >> Change-Id: I43f5943d6ce0509729179615f6b81e77803a34ac
> >> Author: Leonid Yegoshin <[email protected]>
> >> Signed-off-by: Leonid Yegoshin <[email protected]>(imported from
> >> commit 6ebd22eb7a3d9873582ebe990a77094f971652ee)(imported from commit
> >> 0caf3b4a1eebb64572e81e4df6fdb3abf12c70
> >>
> >> arch/mips/include/asm/cacheflush.h:
> >>
> >> @@ -61,6 +61,9 @@ static inline void flush_anon_page(struct
> >> vm_area_struct *vma,
> >> static inline void flush_icache_page(struct vm_area_struct *vma,
> >> struct page *page)
> >> {
> >> + if (cpu_has_dc_aliases ||
> >> + ((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc))
> >> + __flush_dcache_page(page);
> >> }
> >>
> >> extern void (*flush_icache_range)(unsigned long start, unsigned
> >> long end);
> >>
> >>
> >> It fixed crash problems with non-DMA FS in a couple of our customers.
> >> Without it the non-DMA root FS crashes are catastrophic in aliasing
> >> systems but it is still a problem for I-cache too but much rare.
> >>
> >> Unfortunately, it is also a performance hit, however is less than run a
> >> page cache flush at each PTE setup. On 12/03/2014 06:03 AM, Lars Persson
> >> wrote:
> >>> It is the flush_dcache_page() that was called from the file-system
> >>> reading the page contents into memory.
> >>>
> >>> - Lars
> >>>
> >>>
> >
> >
>