From: Atsushi Nemoto <[email protected]>
Problem:
1. There is a process containing two thread (T1 and T2). The
thread T1 calls fork(). Then dup_mmap() function called on T1 context.
static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
...
flush_cache_mm(current->mm);
... /* A */
(write-protect all Copy-On-Write pages)
... /* B */
flush_tlb_mm(current->mm);
...
2. When preemption happens between A and B (or on SMP kernel), the
thread T2 can run and modify data on COW pages without page fault
(modified data will stay in cache).
3. Some time after fork() completed, the thread T2 may cause a page
fault by write-protect on a COW page.
4. Then data of the COW page will be copied to newly allocated
physical page (copy_cow_page()). It reads data via kernel mapping.
The kernel mapping can have different 'color' with user space
mapping of the thread T2 (dcache aliasing). Therefore
copy_cow_page() will copy stale data. Then the modified data in
cache will be lost.
In order to allow architecture code to deal with this problem allow
architecture code to override copy_user_highpage() by defining
__HAVE_ARCH_COPY_USER_HIGHPAGE in <asm/page.h>.
The main part of this patch was originally written by Ralf Baechle;
Atushi Nemoto did the the debugging.
Signed-off-by: Atsushi Nemoto <[email protected]>
Signed-off-by: Ralf Baechle <[email protected]>
include/linux/highmem.h | 4 ++++
1 file changed, 4 insertions(+)
Index: upstream-linus/include/linux/highmem.h
===================================================================
--- upstream-linus.orig/include/linux/highmem.h 2006-10-17 00:14:43.000000000 +0100
+++ upstream-linus/include/linux/highmem.h 2006-10-17 00:15:21.000000000 +0100
@@ -94,6 +94,8 @@ static inline void memclear_highpage_flu
kunmap_atomic(kaddr, KM_USER0);
}
+#ifndef __HAVE_ARCH_COPY_USER_HIGHPAGE
+
static inline void copy_user_highpage(struct page *to, struct page *from, unsigned long vaddr)
{
char *vfrom, *vto;
@@ -107,6 +109,8 @@ static inline void copy_user_highpage(st
smp_wmb();
}
+#endif
+
static inline void copy_highpage(struct page *to, struct page *from)
{
char *vfrom, *vto;
From: Atsushi Nemoto <[email protected]>
To allow a more effective copy_user_highpage() on certain architectures,
a vma argument is added to the function and cow_user_page() allowing
the implementation of these functions to check for the VM_EXEC bit.
The main part of this patch was originally written by Ralf Baechle;
Atushi Nemoto did the the debugging.
Signed-off-by: Atsushi Nemoto <[email protected]>
Signed-off-by: Ralf Baechle <[email protected]>
include/linux/highmem.h | 3 ++-
mm/memory.c | 10 +++++-----
2 files changed, 7 insertions(+), 6 deletions(-)
Index: upstream-linus/mm/memory.c
===================================================================
--- upstream-linus.orig/mm/memory.c 2006-10-17 00:14:41.000000000 +0100
+++ upstream-linus/mm/memory.c 2006-10-17 00:15:40.000000000 +0100
@@ -1431,7 +1431,7 @@ static inline pte_t maybe_mkwrite(pte_t
return pte;
}
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va)
+static inline void cow_user_page(struct page *dst, struct page *src, unsigned long va, struct vm_area_struct *vma)
{
/*
* If the source page was a PFN mapping, we don't have
@@ -1453,9 +1453,9 @@ static inline void cow_user_page(struct
memset(kaddr, 0, PAGE_SIZE);
kunmap_atomic(kaddr, KM_USER0);
return;
-
+
}
- copy_user_highpage(dst, src, va);
+ copy_user_highpage(dst, src, va, vma);
}
/*
@@ -1566,7 +1566,7 @@ gotten:
new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
if (!new_page)
goto oom;
- cow_user_page(new_page, old_page, address);
+ cow_user_page(new_page, old_page, address, vma);
}
/*
@@ -2190,7 +2190,7 @@ retry:
page = alloc_page_vma(GFP_HIGHUSER, vma, address);
if (!page)
goto oom;
- copy_user_highpage(page, new_page, address);
+ copy_user_highpage(page, new_page, address, vma);
page_cache_release(new_page);
new_page = page;
anon = 1;
Index: upstream-linus/include/linux/highmem.h
===================================================================
--- upstream-linus.orig/include/linux/highmem.h 2006-10-17 00:15:21.000000000 +0100
+++ upstream-linus/include/linux/highmem.h 2006-10-17 00:17:23.000000000 +0100
@@ -96,7 +96,8 @@ static inline void memclear_highpage_flu
#ifndef __HAVE_ARCH_COPY_USER_HIGHPAGE
-static inline void copy_user_highpage(struct page *to, struct page *from, unsigned long vaddr)
+static inline void copy_user_highpage(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma)
{
char *vfrom, *vto;
From: Atsushi Nemoto <[email protected]>
Provide a custom copy_user_highpage() to deal with aliasing issues on
MIPS. It uses kmap_coherent() to map an user page for kernel with same
color. Rewrite copy_to_user_page() and copy_from_user_page() with the
new interfaces to avoid extra cache flushing.
The main part of this patch was originally written by Ralf Baechle;
Atushi Nemoto did the the debugging.
Signed-off-by: Atsushi Nemoto <[email protected]>
Signed-off-by: Ralf Baechle <[email protected]>
arch/mips/mm/init.c | 213 ++++++++++++++++++++++++++++++++++++++++--
arch/mips/mm/pgtable-32.c | 7 -
arch/mips/mm/pgtable-64.c | 11 ++
include/asm-mips/cacheflush.h | 19 ---
include/asm-mips/fixmap.h | 14 ++
include/asm-mips/page.h | 17 +--
6 files changed, 242 insertions(+), 39 deletions(-)
Index: upstream-linus/arch/mips/mm/init.c
===================================================================
--- upstream-linus.orig/arch/mips/mm/init.c 2006-10-12 18:51:03.000000000 +0100
+++ upstream-linus/arch/mips/mm/init.c 2006-10-12 18:51:18.000000000 +0100
@@ -30,11 +30,39 @@
#include <asm/cachectl.h>
#include <asm/cpu.h>
#include <asm/dma.h>
+#include <asm/kmap_types.h>
#include <asm/mmu_context.h>
#include <asm/sections.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
#include <asm/tlb.h>
+#include <asm/fixmap.h>
+
+/* CP0 hazard avoidance. */
+#define BARRIER __asm__ __volatile__(".set noreorder\n\t" \
+ "nop; nop; nop; nop; nop; nop;\n\t" \
+ ".set reorder\n\t")
+
+/* Atomicity and interruptability */
+#ifdef CONFIG_MIPS_MT_SMTC
+
+#include <asm/mipsmtregs.h>
+
+#define ENTER_CRITICAL(flags) \
+ { \
+ unsigned int mvpflags; \
+ local_irq_save(flags);\
+ mvpflags = dvpe()
+#define EXIT_CRITICAL(flags) \
+ evpe(mvpflags); \
+ local_irq_restore(flags); \
+ }
+#else
+
+#define ENTER_CRITICAL(flags) local_irq_save(flags)
+#define EXIT_CRITICAL(flags) local_irq_restore(flags)
+
+#endif /* CONFIG_MIPS_MT_SMTC */
DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
@@ -80,13 +108,183 @@ unsigned long setup_zero_pages(void)
return 1UL << order;
}
-#ifdef CONFIG_HIGHMEM
-pte_t *kmap_pte;
-pgprot_t kmap_prot;
+/*
+ * These are almost like kmap_atomic / kunmap_atmic except they take an
+ * additional address argument as the hint.
+ */
#define kmap_get_fixmap_pte(vaddr) \
pte_offset_kernel(pmd_offset(pud_offset(pgd_offset_k(vaddr), (vaddr)), (vaddr)), (vaddr))
+#ifdef CONFIG_MIPS_MT_SMTC
+static pte_t *kmap_coherent_pte;
+static void __init kmap_coherent_init(void)
+{
+ unsigned long vaddr;
+
+ /* cache the first coherent kmap pte */
+ vaddr = __fix_to_virt(FIX_CMAP_BEGIN);
+ kmap_coherent_pte = kmap_get_fixmap_pte(vaddr);
+}
+#else
+static inline void kmap_coherent_init(void) {}
+#endif
+
+static inline void *kmap_coherent(struct page *page, unsigned long addr)
+{
+ enum fixed_addresses idx;
+ unsigned long vaddr, flags, entrylo;
+ unsigned long old_ctx;
+ pte_t pte;
+ unsigned int tlbidx;
+
+ inc_preempt_count();
+ idx = (addr >> PAGE_SHIFT) & (FIX_N_COLOURS - 1);
+#ifdef CONFIG_MIPS_MT_SMTC
+ idx += FIX_N_COLOURS * smp_processor_id();
+#endif
+ vaddr = __fix_to_virt(FIX_CMAP_END - idx);
+ pte = mk_pte(page, PAGE_KERNEL);
+#if defined(CONFIG_64BIT_PHYS_ADDR) && defined(CONFIG_CPU_MIPS32_R1)
+ entrylo = pte.pte_high;
+#else
+ entrylo = pte_val(pte) >> 6;
+#endif
+
+ ENTER_CRITICAL(flags);
+ old_ctx = read_c0_entryhi();
+ write_c0_entryhi(vaddr & (PAGE_MASK << 1));
+ write_c0_entrylo0(entrylo);
+ write_c0_entrylo1(entrylo);
+#ifdef CONFIG_MIPS_MT_SMTC
+ set_pte(kmap_coherent_pte - (FIX_CMAP_END - idx), pte);
+ /* preload TLB instead of local_flush_tlb_one() */
+ mtc0_tlbw_hazard();
+ tlb_probe();
+ BARRIER;
+ tlbidx = read_c0_index();
+ mtc0_tlbw_hazard();
+ if (tlbidx < 0)
+ tlb_write_random();
+ else
+ tlb_write_indexed();
+#else
+ tlbidx = read_c0_wired();
+ write_c0_wired(tlbidx + 1);
+ write_c0_index(tlbidx);
+ mtc0_tlbw_hazard();
+ tlb_write_indexed();
+#endif
+ tlbw_use_hazard();
+ write_c0_entryhi(old_ctx);
+ EXIT_CRITICAL(flags);
+
+ return (void*) vaddr;
+}
+
+#define UNIQUE_ENTRYHI(idx) (CKSEG0 + ((idx) << (PAGE_SHIFT + 1)))
+
+static inline void kunmap_coherent(struct page *page)
+{
+#ifndef CONFIG_MIPS_MT_SMTC
+ unsigned int wired;
+ unsigned long flags, old_ctx;
+
+ ENTER_CRITICAL(flags);
+ old_ctx = read_c0_entryhi();
+ wired = read_c0_wired() - 1;
+ write_c0_wired(wired);
+ write_c0_index(wired);
+ write_c0_entryhi(UNIQUE_ENTRYHI(wired));
+ write_c0_entrylo0(0);
+ write_c0_entrylo1(0);
+ mtc0_tlbw_hazard();
+ tlb_write_indexed();
+ write_c0_entryhi(old_ctx);
+ EXIT_CRITICAL(flags);
+#endif
+ dec_preempt_count();
+ preempt_check_resched();
+}
+
+void copy_user_highpage(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma)
+{
+ void *vfrom, *vto;
+
+ vto = kmap_atomic(to, KM_USER1);
+ if (cpu_has_dc_aliases) {
+ vfrom = kmap_coherent(from, vaddr);
+ copy_page(vto, vfrom);
+ kunmap_coherent(from);
+ } else {
+ vfrom = kmap_atomic(from, KM_USER0);
+ copy_page(vto, vfrom);
+ kunmap_atomic(vfrom, KM_USER0);
+ }
+ if (((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc) ||
+ pages_do_alias((unsigned long)vto, vaddr & PAGE_MASK))
+ flush_data_cache_page((unsigned long)vto);
+ kunmap_atomic(vto, KM_USER1);
+ /* Make sure this page is cleared on other CPU's too before using it */
+ smp_wmb();
+}
+
+EXPORT_SYMBOL(copy_user_highpage);
+
+void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
+ struct page *to)
+{
+ if (cpu_has_dc_aliases) {
+ struct page *from = virt_to_page(vfrom);
+ vfrom = kmap_coherent(from, vaddr);
+ copy_page(vto, vfrom);
+ kunmap_coherent(from);
+ } else
+ copy_page(vto, vfrom);
+ if (!cpu_has_ic_fills_f_dc ||
+ pages_do_alias((unsigned long)vto, vaddr & PAGE_MASK))
+ flush_data_cache_page((unsigned long)vto);
+}
+
+EXPORT_SYMBOL(copy_user_page);
+
+void copy_to_user_page(struct vm_area_struct *vma,
+ struct page *page, unsigned long vaddr, void *dst, const void *src,
+ unsigned long len)
+{
+ if (cpu_has_dc_aliases) {
+ void *vto = kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
+ memcpy(vto, src, len);
+ kunmap_coherent(page);
+ } else
+ memcpy(dst, src, len);
+ if ((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc)
+ flush_cache_page(vma, vaddr, page_to_pfn(page));
+}
+
+EXPORT_SYMBOL(copy_to_user_page);
+
+void copy_from_user_page(struct vm_area_struct *vma,
+ struct page *page, unsigned long vaddr, void *dst, const void *src,
+ unsigned long len)
+{
+ if (cpu_has_dc_aliases) {
+ void *vfrom =
+ kmap_coherent(page, vaddr) + (vaddr & ~PAGE_MASK);
+ memcpy(dst, vfrom, len);
+ kunmap_coherent(page);
+ } else
+ memcpy(dst, src, len);
+}
+
+EXPORT_SYMBOL(copy_from_user_page);
+
+
+#ifdef CONFIG_HIGHMEM
+pte_t *kmap_pte;
+pgprot_t kmap_prot;
+
static void __init kmap_init(void)
{
unsigned long kmap_vstart;
@@ -97,11 +295,12 @@ static void __init kmap_init(void)
kmap_prot = PAGE_KERNEL;
}
+#endif /* CONFIG_HIGHMEM */
-#ifdef CONFIG_32BIT
void __init fixrange_init(unsigned long start, unsigned long end,
pgd_t *pgd_base)
{
+#if defined(CONFIG_HIGHMEM) || defined(CONFIG_MIPS_MT_SMTC)
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -122,7 +321,7 @@ void __init fixrange_init(unsigned long
for (; (k < PTRS_PER_PMD) && (vaddr != end); pmd++, k++) {
if (pmd_none(*pmd)) {
pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
- set_pmd(pmd, __pmd(pte));
+ set_pmd(pmd, __pmd((unsigned long)pte));
if (pte != pte_offset_kernel(pmd, 0))
BUG();
}
@@ -132,9 +331,8 @@ void __init fixrange_init(unsigned long
}
j = 0;
}
+#endif
}
-#endif /* CONFIG_32BIT */
-#endif /* CONFIG_HIGHMEM */
#ifndef CONFIG_NEED_MULTIPLE_NODES
extern void pagetable_init(void);
@@ -175,6 +373,7 @@ void __init paging_init(void)
#ifdef CONFIG_HIGHMEM
kmap_init();
#endif
+ kmap_coherent_init();
max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
low = max_low_pfn;
Index: upstream-linus/arch/mips/mm/pgtable-32.c
===================================================================
--- upstream-linus.orig/arch/mips/mm/pgtable-32.c 2006-10-12 18:51:03.000000000 +0100
+++ upstream-linus/arch/mips/mm/pgtable-32.c 2006-10-12 18:51:18.000000000 +0100
@@ -31,9 +31,10 @@ void pgd_init(unsigned long page)
void __init pagetable_init(void)
{
-#ifdef CONFIG_HIGHMEM
unsigned long vaddr;
- pgd_t *pgd, *pgd_base;
+ pgd_t *pgd_base;
+#ifdef CONFIG_HIGHMEM
+ pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
@@ -44,7 +45,6 @@ void __init pagetable_init(void)
pgd_init((unsigned long)swapper_pg_dir
+ sizeof(pgd_t) * USER_PTRS_PER_PGD);
-#ifdef CONFIG_HIGHMEM
pgd_base = swapper_pg_dir;
/*
@@ -53,6 +53,7 @@ void __init pagetable_init(void)
vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
fixrange_init(vaddr, 0, pgd_base);
+#ifdef CONFIG_HIGHMEM
/*
* Permanent kmaps:
*/
Index: upstream-linus/arch/mips/mm/pgtable-64.c
===================================================================
--- upstream-linus.orig/arch/mips/mm/pgtable-64.c 2006-10-12 18:51:03.000000000 +0100
+++ upstream-linus/arch/mips/mm/pgtable-64.c 2006-10-12 18:51:18.000000000 +0100
@@ -8,6 +8,7 @@
*/
#include <linux/init.h>
#include <linux/mm.h>
+#include <asm/fixmap.h>
#include <asm/pgtable.h>
void pgd_init(unsigned long page)
@@ -52,7 +53,17 @@ void pmd_init(unsigned long addr, unsign
void __init pagetable_init(void)
{
+ unsigned long vaddr;
+ pgd_t *pgd_base;
+
/* Initialize the entire pgd. */
pgd_init((unsigned long)swapper_pg_dir);
pmd_init((unsigned long)invalid_pmd_table, (unsigned long)invalid_pte_table);
+
+ pgd_base = swapper_pg_dir;
+ /*
+ * Fixed mappings:
+ */
+ vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
+ fixrange_init(vaddr, 0, pgd_base);
}
Index: upstream-linus/include/asm-mips/cacheflush.h
===================================================================
--- upstream-linus.orig/include/asm-mips/cacheflush.h 2006-10-12 18:51:03.000000000 +0100
+++ upstream-linus/include/asm-mips/cacheflush.h 2006-10-12 18:51:18.000000000 +0100
@@ -55,24 +55,13 @@ extern void (*flush_icache_range)(unsign
#define flush_cache_vmap(start, end) flush_cache_all()
#define flush_cache_vunmap(start, end) flush_cache_all()
-static inline void copy_to_user_page(struct vm_area_struct *vma,
+extern void copy_to_user_page(struct vm_area_struct *vma,
struct page *page, unsigned long vaddr, void *dst, const void *src,
- unsigned long len)
-{
- if (cpu_has_dc_aliases)
- flush_cache_page(vma, vaddr, page_to_pfn(page));
- memcpy(dst, src, len);
- __flush_icache_page(vma, page);
-}
+ unsigned long len);
-static inline void copy_from_user_page(struct vm_area_struct *vma,
+extern void copy_from_user_page(struct vm_area_struct *vma,
struct page *page, unsigned long vaddr, void *dst, const void *src,
- unsigned long len)
-{
- if (cpu_has_dc_aliases)
- flush_cache_page(vma, vaddr, page_to_pfn(page));
- memcpy(dst, src, len);
-}
+ unsigned long len);
extern void (*flush_cache_sigtramp)(unsigned long addr);
extern void (*flush_icache_all)(void);
Index: upstream-linus/include/asm-mips/fixmap.h
===================================================================
--- upstream-linus.orig/include/asm-mips/fixmap.h 2006-10-12 18:51:03.000000000 +0100
+++ upstream-linus/include/asm-mips/fixmap.h 2006-10-12 18:51:18.000000000 +0100
@@ -45,8 +45,16 @@
* fix-mapped?
*/
enum fixed_addresses {
+#define FIX_N_COLOURS 8
+ FIX_CMAP_BEGIN,
+#ifdef CONFIG_MIPS_MT_SMTC
+ FIX_CMAP_END = FIX_CMAP_BEGIN + (FIX_N_COLOURS * NR_CPUS),
+#else
+ FIX_CMAP_END = FIX_CMAP_BEGIN + FIX_N_COLOURS,
+#endif
#ifdef CONFIG_HIGHMEM
- FIX_KMAP_BEGIN, /* reserved pte's for temporary kernel mappings */
+ /* reserved pte's for temporary kernel mappings */
+ FIX_KMAP_BEGIN = FIX_CMAP_END + 1,
FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1,
#endif
__end_of_fixed_addresses
@@ -70,9 +78,9 @@ extern void __set_fixmap (enum fixed_add
* at the top of mem..
*/
#if defined(CONFIG_CPU_TX39XX) || defined(CONFIG_CPU_TX49XX)
-#define FIXADDR_TOP (0xff000000UL - 0x2000)
+#define FIXADDR_TOP ((unsigned long)(long)(int)(0xff000000 - 0x20000))
#else
-#define FIXADDR_TOP (0xffffe000UL)
+#define FIXADDR_TOP ((unsigned long)(long)(int)0xfffe0000)
#endif
#define FIXADDR_SIZE (__end_of_fixed_addresses << PAGE_SHIFT)
#define FIXADDR_START (FIXADDR_TOP - FIXADDR_SIZE)
Index: upstream-linus/include/asm-mips/page.h
===================================================================
--- upstream-linus.orig/include/asm-mips/page.h 2006-10-12 18:51:03.000000000 +0100
+++ upstream-linus/include/asm-mips/page.h 2006-10-12 18:51:18.000000000 +0100
@@ -34,8 +34,6 @@
#ifndef __ASSEMBLY__
-#include <asm/cpu-features.h>
-
extern void clear_page(void * page);
extern void copy_page(void * to, void * from);
@@ -59,16 +57,13 @@ static inline void clear_user_page(void
flush_data_cache_page((unsigned long)addr);
}
-static inline void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
- struct page *to)
-{
- extern void (*flush_data_cache_page)(unsigned long addr);
+extern void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
+ struct page *to);
+struct vm_area_struct;
+extern void copy_user_highpage(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma);
- copy_page(vto, vfrom);
- if (!cpu_has_ic_fills_f_dc ||
- pages_do_alias((unsigned long)vto, vaddr & PAGE_MASK))
- flush_data_cache_page((unsigned long)vto);
-}
+#define __HAVE_ARCH_COPY_USER_HIGHPAGE
/*
* These are used to make use of C type-checking..
Ralf Baechle wrote:
> From: Atsushi Nemoto <[email protected]>
>
> Problem:
>
> 1. There is a process containing two thread (T1 and T2). The
> thread T1 calls fork(). Then dup_mmap() function called on T1 context.
>
> static inline int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
> ...
> flush_cache_mm(current->mm);
> ... /* A */
> (write-protect all Copy-On-Write pages)
> ... /* B */
> flush_tlb_mm(current->mm);
> ...
>
> 2. When preemption happens between A and B (or on SMP kernel), the
> thread T2 can run and modify data on COW pages without page fault
> (modified data will stay in cache).
>
> 3. Some time after fork() completed, the thread T2 may cause a page
> fault by write-protect on a COW page.
>
> 4. Then data of the COW page will be copied to newly allocated
> physical page (copy_cow_page()). It reads data via kernel mapping.
> The kernel mapping can have different 'color' with user space
> mapping of the thread T2 (dcache aliasing). Therefore
> copy_cow_page() will copy stale data. Then the modified data in
> cache will be lost.
What about if you just flush the caches after write protecting all
COW pages? Would that work? Simpler? Better performance? (I don't know)
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Fri, Oct 20, 2006 at 03:46:35AM +1000, Nick Piggin wrote:
> What about if you just flush the caches after write protecting all
> COW pages? Would that work? Simpler? Better performance? (I don't know)
That would require changing the order of cache flush and tlb flush. To
keep certain architectures that require a valid translation in the TLB
the cacheflush has to be done first. Not sure if those architectures need
a writeable mapping for dirty cachelines - I think hypersparc was one
of them.
Ralf
Ralf Baechle wrote:
> On Fri, Oct 20, 2006 at 03:46:35AM +1000, Nick Piggin wrote:
>
>
>>What about if you just flush the caches after write protecting all
>>COW pages? Would that work? Simpler? Better performance? (I don't know)
>
>
> That would require changing the order of cache flush and tlb flush. To
Can't we just move flush_cache_mm to below the copy_page_range, before
the flush_tlb_mm?
> keep certain architectures that require a valid translation in the TLB
> the cacheflush has to be done first. Not sure if those architectures need
> a writeable mapping for dirty cachelines - I think hypersparc was one
> of them.
If the cache is dirty, then the TLB must have a writeable mapping in it,
mustn't it? If there is an architecture where this isn't the case, then
the current code is broken anyway, because in your example the T2 thread
is dirtying data right before the mapping gets write protected anyway.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
From: Ralf Baechle <[email protected]>
Date: Thu, 19 Oct 2006 19:13:46 +0100
> That would require changing the order of cache flush and tlb flush.
> To keep certain architectures that require a valid translation in
> the TLB the cacheflush has to be done first. Not sure if those
> architectures need a writeable mapping for dirty cachelines - I
> think hypersparc was one of them.
There just has to be "a mapping" in the TLB so that the L2 cache can
translate the virtual address to a physical one for the writeback to
main memory.
David Miller wrote:
> From: Ralf Baechle <[email protected]>
> Date: Thu, 19 Oct 2006 19:13:46 +0100
>
>
>>That would require changing the order of cache flush and tlb flush.
>>To keep certain architectures that require a valid translation in
>>the TLB the cacheflush has to be done first. Not sure if those
>>architectures need a writeable mapping for dirty cachelines - I
>>think hypersparc was one of them.
>
>
> There just has to be "a mapping" in the TLB so that the L2 cache can
> translate the virtual address to a physical one for the writeback to
> main memory.
So moving the flush_cache_mm below the copy_page_range, to just
before the flush_tlb_mm, would work then? This would make the
race much smaller than with this patchset.
But doesn't that still leave a race?
What if another thread writes to cache after we have flushed it
but before flushing the TLBs? Although we've marked the the ptes
readonly, the CPU won't trap if the TLB is valid? There must be
some special way for the arch to handle this, but I can't see it.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Sat, 21 Oct 2006, Nick Piggin wrote:
>
> So moving the flush_cache_mm below the copy_page_range, to just
> before the flush_tlb_mm, would work then? This would make the
> race much smaller than with this patchset.
>
> But doesn't that still leave a race?
>
> What if another thread writes to cache after we have flushed it
> but before flushing the TLBs? Although we've marked the the ptes
> readonly, the CPU won't trap if the TLB is valid? There must be
> some special way for the arch to handle this, but I can't see it.
Why not do the cache flush _after_ the TLB flush? There's still a mapping,
and never mind that it's read-only: the _mapping_ still exists, and I
doubt any CPU will not do the writeback (the readonly bit had better
affect the _frontend_ of the memory pipeline, but affectign the back end
would be insane and very hard, since you can't raise a fault any more).
Hmm?
Linus
Linus Torvalds wrote:
>
> On Sat, 21 Oct 2006, Nick Piggin wrote:
>
>>So moving the flush_cache_mm below the copy_page_range, to just
>>before the flush_tlb_mm, would work then? This would make the
>>race much smaller than with this patchset.
>>
>>But doesn't that still leave a race?
>>
>>What if another thread writes to cache after we have flushed it
>>but before flushing the TLBs? Although we've marked the the ptes
>>readonly, the CPU won't trap if the TLB is valid? There must be
>>some special way for the arch to handle this, but I can't see it.
>
>
> Why not do the cache flush _after_ the TLB flush? There's still a mapping,
> and never mind that it's read-only: the _mapping_ still exists, and I
> doubt any CPU will not do the writeback (the readonly bit had better
> affect the _frontend_ of the memory pipeline, but affectign the back end
> would be insane and very hard, since you can't raise a fault any more).
I didn't think that would work if there is no TLB. But if the writeback
can cause a TLB reload, and then bypass the readonly protection, then
yes would close all races.
Of course, you may also want to do the racy cache flush before the
TLB flush as well, so you don't immediately take a load of TLB misses
to write it out.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Sat, Oct 21, 2006 at 12:39:40AM +1000, Nick Piggin wrote:
> >>That would require changing the order of cache flush and tlb flush.
> >>To keep certain architectures that require a valid translation in
> >>the TLB the cacheflush has to be done first. Not sure if those
> >>architectures need a writeable mapping for dirty cachelines - I
> >>think hypersparc was one of them.
> >
> >
> >There just has to be "a mapping" in the TLB so that the L2 cache can
> >translate the virtual address to a physical one for the writeback to
> >main memory.
>
> So moving the flush_cache_mm below the copy_page_range, to just
> before the flush_tlb_mm, would work then? This would make the
> race much smaller than with this patchset.
90% of this changeset are MIPS-specific code. Of that in turn much is
just infrastructure which is already being used anyway.
> But doesn't that still leave a race?
Both calls would have to be done under the mmap_sem to close any races.
> What if another thread writes to cache after we have flushed it
> but before flushing the TLBs? Although we've marked the the ptes
> readonly, the CPU won't trap if the TLB is valid? There must be
> some special way for the arch to handle this, but I can't see it.
There isn't really. Reordering with a patch like:
Signed-off-by: Ralf Baechle <[email protected]>
diff --git a/kernel/fork.c b/kernel/fork.c
index 29ebb30..28e51e0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -202,7 +202,6 @@ static inline int dup_mmap(struct mm_str
struct mempolicy *pol;
down_write(&oldmm->mmap_sem);
- flush_cache_mm(oldmm);
/*
* Not linked in yet - no deadlock potential:
*/
@@ -287,8 +286,9 @@ static inline int dup_mmap(struct mm_str
}
retval = 0;
out:
- up_write(&mm->mmap_sem);
flush_tlb_mm(oldmm);
+ flush_cache_mm(oldmm);
+ up_write(&mm->mmap_sem);
up_write(&oldmm->mmap_sem);
return retval;
fail_nomem_policy:
should close the hole for all effected architectures. I say should
because this patch would need another round of linux-arch reviewing and I
haven't tested it this patch yet myself.
But even so that doesn't change that I would really like to make
copy_user_highpage() an arch interface replacing copy_to_user_page.
The current way of doing things enforces a cacheflush on MIPS which itself
is pricy - 1,000 cycles when it's cheap but could be several times as
expensive. And as a side effect of the cacheflush the process breaking
a COW page will start with a cold page.
Or if an architecture wants to be clever about aliasing and uses the
vto argument of copy_user_page to create a non-conflicting mapping it
means the mapping setup by copy_user_highpage will be unused ...
Ralf
Ralf Baechle wrote:
> On Sat, Oct 21, 2006 at 12:39:40AM +1000, Nick Piggin wrote:
>>So moving the flush_cache_mm below the copy_page_range, to just
>>before the flush_tlb_mm, would work then? This would make the
>>race much smaller than with this patchset.
>
>
> 90% of this changeset are MIPS-specific code. Of that in turn much is
> just infrastructure which is already being used anyway.
>
>
>>But doesn't that still leave a race?
>
>
> Both calls would have to be done under the mmap_sem to close any races.
Of course.
>>What if another thread writes to cache after we have flushed it
>>but before flushing the TLBs? Although we've marked the the ptes
>>readonly, the CPU won't trap if the TLB is valid? There must be
>>some special way for the arch to handle this, but I can't see it.
>
>
> There isn't really. Reordering with a patch like:
>
> Signed-off-by: Ralf Baechle <[email protected]>
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 29ebb30..28e51e0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -202,7 +202,6 @@ static inline int dup_mmap(struct mm_str
> struct mempolicy *pol;
>
> down_write(&oldmm->mmap_sem);
> - flush_cache_mm(oldmm);
> /*
> * Not linked in yet - no deadlock potential:
> */
> @@ -287,8 +286,9 @@ static inline int dup_mmap(struct mm_str
> }
> retval = 0;
> out:
> - up_write(&mm->mmap_sem);
You don't need to do that. Nothing will use the new mm.
You might also want a flush before the flush_tlb_mm.
Actually, we should turn this into a single call, so architectures
can optimise it however they like.
> flush_tlb_mm(oldmm);
> + flush_cache_mm(oldmm);
That does close the race. We just need to ensure that all architectures
can handle this case.
And we need to figure out whether any of the other code that follows the
same flush_cache_* .. flush_tlb_* is buggy in the presence of other
threads writing into the cache in between.
I suspect they may well be, and you probably only noticed the issue in
fork, because the window is so large.
Places where we want to remove the mapping completely are going to be
more tricky to fix. Either we need to transition to readonly, then flush,
then transition to invalid, or arch code needs to be reworked
(the single operation caches and tlb flush will do the trick, but that
might to do an IPI, which is bad).
> + up_write(&mm->mmap_sem);
> up_write(&oldmm->mmap_sem);
> return retval;
> fail_nomem_policy:
>
> should close the hole for all effected architectures. I say should
> because this patch would need another round of linux-arch reviewing and I
> haven't tested it this patch yet myself.
>
> But even so that doesn't change that I would really like to make
> copy_user_highpage() an arch interface replacing copy_to_user_page.
>
> The current way of doing things enforces a cacheflush on MIPS which itself
> is pricy - 1,000 cycles when it's cheap but could be several times as
> expensive. And as a side effect of the cacheflush the process breaking
> a COW page will start with a cold page.
>
> Or if an architecture wants to be clever about aliasing and uses the
> vto argument of copy_user_page to create a non-conflicting mapping it
> means the mapping setup by copy_user_highpage will be unused ...
OK, I'm not arguing against any other changes. Hmm... I don't see how you
can kmap_coherent the "from" page when it can be mapped into more than one
virtual address? Does the MIPS port restrict remapping to the same colour?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Sat, 21 Oct 2006, Nick Piggin wrote:
>
> I didn't think that would work if there is no TLB. But if the writeback
> can cause a TLB reload, and then bypass the readonly protection, then
> yes would close all races.
On the other hand, doing the cache flush at COW time is "kind of
equivalent" to just doing it after the TLB flush. It's now just _much_
after the flush ;)
So maybe the COW D$ aliasing patch-series is just the right thing to do.
Not worry about D$ at _all_ when doing the actual fork, and only worry
about it on an actual COW event. Hmm?
Linus
Linus Torvalds wrote:
>
> On Sat, 21 Oct 2006, Nick Piggin wrote:
>
>>I didn't think that would work if there is no TLB. But if the writeback
>>can cause a TLB reload, and then bypass the readonly protection, then
>>yes would close all races.
>
>
> On the other hand, doing the cache flush at COW time is "kind of
> equivalent" to just doing it after the TLB flush. It's now just _much_
> after the flush ;)
>
> So maybe the COW D$ aliasing patch-series is just the right thing to do.
> Not worry about D$ at _all_ when doing the actual fork, and only worry
> about it on an actual COW event. Hmm?
Well if we have the calls in there, we should at least make them work
right for the architectures there now. At the moment the flush_cache_mm
before the copy_page_range wouldn't seem to do anything if you can still
have threads dirty the cache again through existing TLB entries.
I don't think that flushing on COW is exactly right though, because dirty
data can remain invisible if you're only doing reads (no write, no flush).
And if that cache gets written back at some point, you're going to see
supposedly RO data change underneath you. I think?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Sat, 21 Oct 2006, Nick Piggin wrote:
> > So maybe the COW D$ aliasing patch-series is just the right thing to do. Not
> > worry about D$ at _all_ when doing the actual fork, and only worry about it
> > on an actual COW event. Hmm?
>
> Well if we have the calls in there, we should at least make them work
> right for the architectures there now. At the moment the flush_cache_mm
> before the copy_page_range wouldn't seem to do anything if you can still
> have threads dirty the cache again through existing TLB entries.
>
> I don't think that flushing on COW is exactly right though, because dirty
> data can remain invisible if you're only doing reads (no write, no flush).
You're right. A virtually indexed cache needs the flush _before_ we return
from the fork into a new process (since otherwise the dirty data won't be
visible in the new virtual address space).
So you've convinced me. Flushing at COW time _cannot_ be right, because it
by definition means that there has been a time when the new process didn't
see the dirty data in the case of a virtual index. And in the case of a
physical index it cannot matter.
So I think the right thing to do is to forget about the COW D$ series
(which probably _hides_ most of the problems in practice, so it "works"
that way) and instead go with Ralf's last patch that just moves the
flush_cache_mm() to after the TLB flush.
We do need to have all the architecture people (especially S390, which has
been very strange in this regard in the past) check that it's ok. The
_mappings_ are still valid, so S390 should be able to do the write-back,
but there may be architectures that would want to do the flush _both_
before and after (for performance reasons - if writing out dirty data
requires a TLB lookup, doing most fo the writeback before is probably a
better thing, and then we can do a _second_ writeback after the flush to
close the race with some other thread dirtying the pages before the TLB
was marked read-only).
I added linux-arch and Martin Schwidefsky (s390) to the Cc:.
Guys, in case you missed the earlier discussion: there's a suggested patch
by Ralf Baechle on linux-kernel (but it does just the "flush after"
version, not the "perhaps we need it both before and after" thing I
theorise about above). Message-ID: [email protected].
Linus
Linus Torvalds wrote:
>
> On Sat, 21 Oct 2006, Nick Piggin wrote:
>
>>>So maybe the COW D$ aliasing patch-series is just the right thing to do. Not
>>>worry about D$ at _all_ when doing the actual fork, and only worry about it
>>>on an actual COW event. Hmm?
>>
>>Well if we have the calls in there, we should at least make them work
>>right for the architectures there now. At the moment the flush_cache_mm
>>before the copy_page_range wouldn't seem to do anything if you can still
>>have threads dirty the cache again through existing TLB entries.
>>
>>I don't think that flushing on COW is exactly right though, because dirty
>>data can remain invisible if you're only doing reads (no write, no flush).
>
>
> You're right. A virtually indexed cache needs the flush _before_ we return
> from the fork into a new process (since otherwise the dirty data won't be
> visible in the new virtual address space).
>
> So you've convinced me. Flushing at COW time _cannot_ be right, because it
> by definition means that there has been a time when the new process didn't
> see the dirty data in the case of a virtual index. And in the case of a
> physical index it cannot matter.
>
> So I think the right thing to do is to forget about the COW D$ series
> (which probably _hides_ most of the problems in practice, so it "works"
> that way) and instead go with Ralf's last patch that just moves the
> flush_cache_mm() to after the TLB flush.
So long as we don't move around the mmap semaphores, I'm OK with that
patch...
> We do need to have all the architecture people (especially S390, which has
> been very strange in this regard in the past) check that it's ok. The
> _mappings_ are still valid, so S390 should be able to do the write-back,
> but there may be architectures that would want to do the flush _both_
> before and after (for performance reasons - if writing out dirty data
> requires a TLB lookup, doing most fo the writeback before is probably a
> better thing, and then we can do a _second_ writeback after the flush to
> close the race with some other thread dirtying the pages before the TLB
> was marked read-only).
Yes, that's my theory too. Probably the thing to aim for is replacing
that API with a new single call to flush caches and tlbs, and the
arch can do what best suits.
But for now, to get it actually *working*, moving the flush_cache_mm
seems like the first step.
> I added linux-arch and Martin Schwidefsky (s390) to the Cc:.
>
> Guys, in case you missed the earlier discussion: there's a suggested patch
> by Ralf Baechle on linux-kernel (but it does just the "flush after"
> version, not the "perhaps we need it both before and after" thing I
> theorise about above). Message-ID: [email protected].
As I mentioned there, we probably want to to check that other places
which flush caches before invalidating TLBs (eg. most of the kernel) is
OK in the presence of concurrent writes to valid TLBs from other threads.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
From: Nick Piggin <[email protected]>
Date: Sat, 21 Oct 2006 00:39:40 +1000
> So moving the flush_cache_mm below the copy_page_range, to just
> before the flush_tlb_mm, would work then? This would make the
> race much smaller than with this patchset.
>
> But doesn't that still leave a race?
>
> What if another thread writes to cache after we have flushed it
> but before flushing the TLBs? Although we've marked the the ptes
> readonly, the CPU won't trap if the TLB is valid? There must be
> some special way for the arch to handle this, but I can't see it.
Also, it is actually the case that doing page-by-page cache flushes
can be cheaper than flush_mm_cache() on certain cpus. Very few cpus
that need this cache flushing provide a "context" based cache flush.
On cpus like the mentioned hypersparc, there is no way to do a
"context" flush of the cache, so we flush the entire multi-megabyte L2
cache. Actually, it allows to flush only "user" cache lines which
keeps the kernel cache lines in there, but still it's very expensive.
From: Linus Torvalds <[email protected]>
Date: Fri, 20 Oct 2006 08:49:35 -0700 (PDT)
> Why not do the cache flush _after_ the TLB flush? There's still a mapping,
> and never mind that it's read-only: the _mapping_ still exists, and I
> doubt any CPU will not do the writeback (the readonly bit had better
> affect the _frontend_ of the memory pipeline, but affectign the back end
> would be insane and very hard, since you can't raise a fault any more).
>
> Hmm?
You get an asynchronous fault from the L2 cache, and that's also what
happens when the TLB entry is missing during L2 writeback too. You
get a level 15 non-maskable IRQ when these asynchronous errors happen.
On Fri, 20 Oct 2006, David Miller wrote:
>
> You get an asynchronous fault from the L2 cache, and that's also what
> happens when the TLB entry is missing during L2 writeback too. You
> get a level 15 non-maskable IRQ when these asynchronous errors happen.
Well, sparc always was crud. I can see the missing tlb entry, but if it's
been turned read-only, the write-back should still work (it clearly _was_
writable when the write that dirtied the cacheline happened).
Anyway, if you cannot flush a read-only mapping, then the "flush at COW
fault time" won't work _either_, since the original mapping is still
read-only.
So regardless, the COW-time flush cannot work. But the "flush before the
TLB flush, and then flush after in case we had a race" approach should
work as well as it can in practice, no?
Linus
From: Linus Torvalds <[email protected]>
Date: Fri, 20 Oct 2006 12:54:17 -0700 (PDT)
> Well, sparc always was crud. I can see the missing tlb entry, but if it's
> been turned read-only, the write-back should still work (it clearly _was_
> writable when the write that dirtied the cacheline happened).
I did some more digging, here's what I think the hardware actually
does:
1) On L2 cacheline load, the "user" and "writable" protection
bits are propagated from the TLB entry into the L2 cache
line. Access checks are done on L2 cache hit using this
cached copy of the two protection bits.
2) On L2 dirty cacheline writeback, the physical address is
obtained from the TLB
So what you guys are suggesting should probably work fine.
On Fri, 20 Oct 2006, David Miller wrote:
>
> I did some more digging, here's what I think the hardware actually
> does:
Ok, this sounds sane.
What should we do about this? How does this patch look to people?
(Totally untested, and I'm not sure we should even do that whole
"oldmm->mm_users" test, but I'm throwing it out here for discussion, in
case it matters for performance. The second D$ flush should obviously be
unnecessary for the common unthreaded case, which is why none of this has
mattered historically, I think).
Comments? We need ARM, MIPS, sparc and S390 at the very least to sign off
on this, and somebody to write a nice explanation for the changelog (and
preferably do this through -mm too).
Linus
---
diff --git a/kernel/fork.c b/kernel/fork.c
index 29ebb30..14c6a1d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -287,8 +287,18 @@ static inline int dup_mmap(struct mm_str
}
retval = 0;
out:
- up_write(&mm->mmap_sem);
flush_tlb_mm(oldmm);
+ /*
+ * If we have other threads using the old mm, we need to
+ * flush the D$ again - the other threads might have dirtied
+ * it more before the TLB got flushed.
+ *
+ * After the flush, they can no longer dirty more pages,
+ * since they are now marked read-only, of course.
+ */
+ if (atomic_read(&oldmm->mm_users) != 1)
+ flush_cache_mm(oldmm);
+ up_write(&mm->mmap_sem);
up_write(&oldmm->mmap_sem);
return retval;
fail_nomem_policy:
On Fri, Oct 20, 2006 at 01:10:59PM -0700, Linus Torvalds wrote:
> On Fri, 20 Oct 2006, David Miller wrote:
> > I did some more digging, here's what I think the hardware actually
> > does:
>
> Ok, this sounds sane.
>
> What should we do about this? How does this patch look to people?
>
> (Totally untested, and I'm not sure we should even do that whole
> "oldmm->mm_users" test, but I'm throwing it out here for discussion, in
> case it matters for performance. The second D$ flush should obviously be
> unnecessary for the common unthreaded case, which is why none of this has
> mattered historically, I think).
>
> Comments? We need ARM, MIPS, sparc and S390 at the very least to sign off
> on this, and somebody to write a nice explanation for the changelog (and
> preferably do this through -mm too).
Well, looking at do_wp_page() I'm now quite concerned about ARM and COW.
I can't see how this code could _possibly_ work with a virtually indexed
cache as it stands. Yet, the kernel does appear to work.
I'm afraid I'm utterly confused with the Linux MM in this day and age, so
I don't think I can even consider commenting on this change.
The majority of ARM caches are VIVT, so data read via the kernel mappings
definitely does not hit the same cache lines as data accessed via the user
mappings.
Our copy_user_page() function merely copies between the two kernel mappings
of the pages so with VIVT caches the kernel mappings - as it always has done
since it's original invention.
However, when I look at this code now, I see _no where_ where we synchronise
the cache between the userspace mapping and the kernel space mapping before
copying a COW page.
So I'm afraid I'm going to have to hold up my hand and say "I don't
understand the Linux MM anymore".
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
From: Russell King <[email protected]>
Date: Fri, 20 Oct 2006 21:59:29 +0100
> However, when I look at this code now, I see _no where_ where we synchronise
> the cache between the userspace mapping and the kernel space mapping before
> copying a COW page.
When the user obtains write access to the page, we'll flush.
Since there are many locations at which write access can be
obtained, there are many locations where the synchronization
is obtained.
One popular way to obtain the synchronization is to implement
flush_dcache_page() to flush, and implement clear_page() and
copy_user_page() to clear and copy pages in kernel space at
special temporrary mappings whose virtual address will alias
up properly with userspace's mapping. That's why we pass a
virtual address to these two arch functions.
On Fri, 20 Oct 2006, Russell King wrote:
>
> Well, looking at do_wp_page() I'm now quite concerned about ARM and COW.
> I can't see how this code could _possibly_ work with a virtually indexed
> cache as it stands. Yet, the kernel does appear to work.
It really shouldn't need any extra code, exactly because by the time it
hits any page-fault, the caches had better be in sync with the physical
page contents _anyway_ (yes, being virtual, the caches will _duplicate_
the contents, but since the pages are read-only, that aliasing should be
perfectly fine).
> I'm afraid I'm utterly confused with the Linux MM in this day and age, so
> I don't think I can even consider commenting on this change.
Well, we'd need somebody to verify that it still works, but quite frankly,
the likelihood of it breaking anything seems basically nil.
> However, when I look at this code now, I see _no where_ where we synchronise
> the cache between the userspace mapping and the kernel space mapping before
> copying a COW page.
At the COW, it should be synchronized already, exactly because we did the
cache_flush_mm() when we _created_ the COW mapping in the first place.
It's just that we weren't quite careful enough at that time (and even
then, that would only matter for some really really unlikely and strange
situations that only happen when you fork() from a _threaded_ environment,
so it shouldn't be anything you'd notice under normal load).
I think.
> So I'm afraid I'm going to have to hold up my hand and say "I don't
> understand the Linux MM anymore".
There are few enough people who understand it even though they're supposed
to. I certainly have to always go back and look and read the code when
there is anything subtle going on, and even then I want to be backed up by
one of the _competent_ people ;)
Linus
On Fri, Oct 20, 2006 at 02:06:19PM -0700, David Miller wrote:
> From: Russell King <[email protected]>
> Date: Fri, 20 Oct 2006 21:59:29 +0100
>
> > However, when I look at this code now, I see _no where_ where we synchronise
> > the cache between the userspace mapping and the kernel space mapping before
> > copying a COW page.
>
> When the user obtains write access to the page, we'll flush.
>
> Since there are many locations at which write access can be
> obtained, there are many locations where the synchronization
> is obtained.
>
> One popular way to obtain the synchronization is to implement
> flush_dcache_page() to flush, and implement clear_page() and
> copy_user_page() to clear and copy pages in kernel space at
> special temporrary mappings whose virtual address will alias
> up properly with userspace's mapping. That's why we pass a
> virtual address to these two arch functions.
I did say I had a VIVT cache. With such a cache, the *only* place where
you can read data written via one mapping is via that very same mapping.
There is no other virtual address which will give you coherent access to
the data in another mapping.
The majority of ARMs to date have been VIVT, and the majority of Linux
kernels have worked fine (albiet the "recent" breakage of PIO block IO.)
I'm now in the situation where I come back to look at the MM code and,
to put it quite frankly, I can't see any possible way for ARM to work
with this code. In practice, however, it does appear to work. I just
can't see _why_ it's working.
Hence why I'm declaring the "I don't understand" flag and refraining to
endorse the patch - I _can't_ endorse what I don't understand.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
On Fri, Oct 20, 2006 at 02:12:11PM -0700, Linus Torvalds wrote:
> On Fri, 20 Oct 2006, Russell King wrote:
> > Well, looking at do_wp_page() I'm now quite concerned about ARM and COW.
> > I can't see how this code could _possibly_ work with a virtually indexed
> > cache as it stands. Yet, the kernel does appear to work.
>
> It really shouldn't need any extra code, exactly because by the time it
> hits any page-fault, the caches had better be in sync with the physical
> page contents _anyway_ (yes, being virtual, the caches will _duplicate_
> the contents, but since the pages are read-only, that aliasing should be
> perfectly fine).
Oh, of course! That explains why it actually works as expected! Thanks
for filling back in that bit of swapped-out-years-ago-and-lost information.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
From: Russell King <[email protected]>
Date: Fri, 20 Oct 2006 22:17:23 +0100
> I did say I had a VIVT cache.
And everything I said was premised with this understanding :-)
On Fri, Oct 20, 2006 at 02:12:11PM -0700, Linus Torvalds wrote:
> > Well, looking at do_wp_page() I'm now quite concerned about ARM and COW.
> > I can't see how this code could _possibly_ work with a virtually indexed
> > cache as it stands. Yet, the kernel does appear to work.
>
> It really shouldn't need any extra code, exactly because by the time it
> hits any page-fault, the caches had better be in sync with the physical
> page contents _anyway_ (yes, being virtual, the caches will _duplicate_
> the contents, but since the pages are read-only, that aliasing should be
> perfectly fine).
Until yesterday I also thought multiple read-only copies wouldn't do any
harm. Well, until I learned about the wonderful behaviour of the PA8800
caches. PA8800 has VIPT primary caches, PIPT secondary caches. And the
sinister part - caches are exclusive, that is a cacheline is either in
L1 or L2 but never in both and can migrate between L1 and L2. Now
onsider the following scenario:
o physical address P is mapped to two aliasing addresses V1 and V2
o a load from V1 results in a clean line in L1 caching P at index V1.
o a store to V2 results in a clean line in L1 caching P at index V2.
o the line at V2 is getting written back to memory.
o a victim replacement of the line at V1 results in the _clean_ line
migrating back from L1 to L2.
-> another read from V2 will return stale data.
As consequence flush_cache_mm() on PA (or at least PA8800) currently blows
away the entire cache, as Kyle McMartin just told me. The whole 1.5MB L1
and 32MB of L2 making fork an ultraheavy operation.
> It's just that we weren't quite careful enough at that time (and even
> then, that would only matter for some really really unlikely and strange
> situations that only happen when you fork() from a _threaded_ environment,
> so it shouldn't be anything you'd notice under normal load).
>
> I think.
The flush is there since a very long time. I have it in my tree since
~ 2.1.36 and I get the feeling anybody every has been seriously revisited
the issue since.
Ralf
On Fri, Oct 20, 2006 at 01:10:59PM -0700, Linus Torvalds wrote:
> Ok, this sounds sane.
>
> What should we do about this? How does this patch look to people?
>
> (Totally untested, and I'm not sure we should even do that whole
> "oldmm->mm_users" test, but I'm throwing it out here for discussion, in
> case it matters for performance. The second D$ flush should obviously be
> unnecessary for the common unthreaded case, which is why none of this has
> mattered historically, I think).
>
> Comments? We need ARM, MIPS, sparc and S390 at the very least to sign off
> on this, and somebody to write a nice explanation for the changelog (and
> preferably do this through -mm too).
As a minimal solution your patch would work for MIPS but performance would be
suboptimal.
With my D-cache alias series applied the flush_cache_mm() in dup_mmap()
becomes entirely redundant. When I delete the call (not part of my patchset)
it means 12% faster fork. But I'm not proposing this for 2.6.19.
Note this does not make the flush_cache_mm() on process termination
redundant ...
Ralf
On Fri, 20 Oct 2006, Ralf Baechle wrote:
>
> As a minimal solution your patch would work for MIPS but performance would be
> suboptimal.
Not so.
> With my D-cache alias series applied the flush_cache_mm() in dup_mmap()
> becomes entirely redundant.
No it does not, as pointed out by Nick.
If there are dirty lines in the virtual cache, they _must_ be flushd long
before the COW happens. Because if they are not, they don't show up in the
child of the fork (which only sees it's _own_ virtual cache). See?
> When I delete the call (not part of my patchset) it means 12% faster
> fork. But I'm not proposing this for 2.6.19.
I just suspect it means a _buggy_ fork.
It so happens (I think), that fork is big enough that it probably flushes
the L1 cache _anyway_.
Does MIPS have some kind of "flush_cache_mm()" that could only flush
user-level caches? Maybe the overhead is from flushing all dirty
cachelines, regardless of whether they are kernel or not (and dirty kernel
cachelines are going to be the most common by far in that path).
Linus
From: Linus Torvalds <[email protected]>
Date: Fri, 20 Oct 2006 15:02:39 -0700 (PDT)
> On Fri, 20 Oct 2006, Ralf Baechle wrote:
> > When I delete the call (not part of my patchset) it means 12% faster
> > fork. But I'm not proposing this for 2.6.19.
>
> I just suspect it means a _buggy_ fork.
>
> It so happens (I think), that fork is big enough that it probably flushes
> the L1 cache _anyway_.
My understanding is that this works because in Ralf's original patch
(which is the context in which he is removing the flush_cache_mm()
call), he uses kmap()/kunmap() to map the page(s) being accessed at a
kernel virtual address which will fall into the same cache color as
the user virtual address --> no alias problems.
Since he does this for every page touched on the kernel side during
dup_mmap(), the existing flush_cache_mm() call in dup_mmap() does in
fact become redundant.
On Fri, Oct 20, 2006 at 03:22:47PM -0700, David Miller wrote:
> > On Fri, 20 Oct 2006, Ralf Baechle wrote:
> > > When I delete the call (not part of my patchset) it means 12% faster
> > > fork. But I'm not proposing this for 2.6.19.
> >
> > I just suspect it means a _buggy_ fork.
> >
> > It so happens (I think), that fork is big enough that it probably flushes
> > the L1 cache _anyway_.
I doubt it; I've tested this on 64K I-cache VIPT, 64K D-cache VIPT.
> My understanding is that this works because in Ralf's original patch
> (which is the context in which he is removing the flush_cache_mm()
> call), he uses kmap()/kunmap() to map the page(s) being accessed at a
> kernel virtual address which will fall into the same cache color as
> the user virtual address --> no alias problems.
>
> Since he does this for every page touched on the kernel side during
> dup_mmap(), the existing flush_cache_mm() call in dup_mmap() does in
> fact become redundant.
Correct.
It means no cache flush operation to deal with aliases at all left in
fork and COW code.
Another advantage of this strategy is that we will never have to handle
less virtual coherency exceptions. A virtual coherency exception is raised
on some MIPS processors when they detect the creation of a cache alias.
This allows the software to cleanup caches. Neat as an alarm system for
alias debugging but rather expensive to service if large numbers are
raised, not available on all processors and also detects the creation of
harmless aliases of clean lines, thus a slight annoyance.
Ralf
On Fri, 20 Oct 2006, Ralf Baechle wrote:
>
> > My understanding is that this works because in Ralf's original patch
> > (which is the context in which he is removing the flush_cache_mm()
> > call), he uses kmap()/kunmap() to map the page(s) being accessed at a
> > kernel virtual address which will fall into the same cache color as
> > the user virtual address --> no alias problems.
> >
> > Since he does this for every page touched on the kernel side during
> > dup_mmap(), the existing flush_cache_mm() call in dup_mmap() does in
> > fact become redundant.
>
> Correct.
>
> It means no cache flush operation to deal with aliases at all left in
> fork and COW code.
Umm. That would seem to only happen to work for a direct-mapped virtually
indexed cache where the index is taken purely from the virtual address,
and there are no "process context" bits in the virtually indexed D$.
The moment there are process context bits involved, afaik you absolutely
_need_ to flush, because otherwise the other process will never pick up
the dirty state (which it would need to reload from memory).
That said, maybe nobody does that. Virtual caches are a total braindamage
in the first place, so hopefully they have limited use.
Linus
On Fri, Oct 20, 2006 at 04:28:37PM -0700, Linus Torvalds wrote:
> > > My understanding is that this works because in Ralf's original patch
> > > (which is the context in which he is removing the flush_cache_mm()
> > > call), he uses kmap()/kunmap() to map the page(s) being accessed at a
> > > kernel virtual address which will fall into the same cache color as
> > > the user virtual address --> no alias problems.
> > >
> > > Since he does this for every page touched on the kernel side during
> > > dup_mmap(), the existing flush_cache_mm() call in dup_mmap() does in
> > > fact become redundant.
> >
> > Correct.
> >
> > It means no cache flush operation to deal with aliases at all left in
> > fork and COW code.
>
> Umm. That would seem to only happen to work for a direct-mapped virtually
> indexed cache where the index is taken purely from the virtual address,
> and there are no "process context" bits in the virtually indexed D$.
No MIPS processor has something like that. See below.
> The moment there are process context bits involved, afaik you absolutely
> _need_ to flush, because otherwise the other process will never pick up
> the dirty state (which it would need to reload from memory).
Correct.
> That said, maybe nobody does that. Virtual caches are a total braindamage
> in the first place, so hopefully they have limited use.
On MIPS we never had pure virtual caches. The major variants in existence
are:
o D-cache PIPT, I-cache PIPT
o PIVT (no typo!)
Only the R6000 has this and it's not supported by Linux.
o D-cache VIPT, I-cache VIPT
This is by far the most common on any MIPS designed since '91.
A variant of these caches has hardware logic to detect cache aliases and
fix them automatically and therefore is equivalent to PIPT even though
they are not implemented as PIPT. And obviously the alias replay of the
pipe will cost a few cycles. The R10000 family of SGI belongs into this
class and the 24K/34K family of synthesizable cores by MIPS Technologies
have this as a synthesis option.
Another variant throws virtual coherency exceptions as I've explained in
another thread.
o D-cache PIPT, I-cache VIVT with additional address space tags.
o Cacheless. Not usually running Linux but heck, it's working anyway.
Be sure I'm sending a CPU designers a strong message about aliases. And I
think they're slowly getting the message that kernel hackers like to poke
needles into voodoo dolls for aliases ;-)
Ralf
On Sat, 21 Oct 2006, Ralf Baechle wrote:
>
> > That said, maybe nobody does that. Virtual caches are a total braindamage
> > in the first place, so hopefully they have limited use.
>
> On MIPS we never had pure virtual caches.
Ok, so on MIPS my schenario doesn't matter.
I think (but may be mistaken) that ARM _does_ have pure virtual caches
with a process ID, but people have always ended up flushing them at
context switch simply because it just causes too much trouble.
Sparc? VIPT too? Davem?
I have absolutely zero clue about s390.
Anyway, it sounds to me like this is too big to decide for 2.6.19 anyway,
and as far as I can tell this i snot a regression, right? Ie we've always
had the aliasing issue. Ralf?
But it would be good to have something for the early -rc1 sequence for
2.6.20, and maybe the MIPS COW D$ patches are it, if it has performance
advantages on MIPS that can also be translated to other virtual cache
users..
> Be sure I'm sending a CPU designers a strong message about aliases.
Castration. That's the best solution. We don't want those people
procreating.
Linus
On Sat, Oct 21, 2006 at 02:47:56AM +1000, Nick Piggin wrote:
> >So maybe the COW D$ aliasing patch-series is just the right thing to do.
> >Not worry about D$ at _all_ when doing the actual fork, and only worry
> >about it on an actual COW event. Hmm?
>
> Well if we have the calls in there, we should at least make them work
> right for the architectures there now. At the moment the flush_cache_mm
> before the copy_page_range wouldn't seem to do anything if you can still
> have threads dirty the cache again through existing TLB entries.
If you're talking about getting 2.6.19 out of the door without risking
too much wreckage, I'm all for it.
> I don't think that flushing on COW is exactly right though, because dirty
> data can remain invisible if you're only doing reads (no write, no flush).
> And if that cache gets written back at some point, you're going to see
> supposedly RO data change underneath you. I think?
You will not be able to avoid aliases at COW breaking time by any kind of
cache flush. In a VIPT cache the userspace page is living at it's
userspace address but the current implemention of copy_user_page will
try to copy it at it's kernel space address and those two may not live
at the same cache index.
So, when breaking a COW mapping you have two options:
1) copying involves touch a userspace mapped page at it's kernel address.
This may result in an alias so apropriate cache flushing will be
needed. On a MIPS system this flush itself is like 1,000 cycles but
could easily several times as much. Add the cost of bringing all the
data that was blown away from the cache back later on.
2) try to be smart and avoid the creation of aliases by creating proper
temporary aliases. Creating and tearing down a TLB mapping is dirt
cheap.
The current strategy is #1 which happens to work for MIPS because there
is at most an alias between clean lines which is harmless on MIPS but will
blow up on PA8800 - without overkill flushing.
It is possible to implement the common sequence of fork + exec such that
the child never ever breaks a COW page, not even a stack page. So why
should a PIPT or VIPT cache be flushed in such a case? We can do better
than that.
Ralf
Linus Torvalds writes:
> I think (but may be mistaken) that ARM _does_ have pure virtual caches
> with a process ID, but people have always ended up flushing them at
> context switch simply because it just causes too much trouble.
>
> Sparc? VIPT too? Davem?
There is one PowerPC embedded chip family, the PPC440, which has a
virtual i-cache with a process ID tag. The d-cache is sane though.
Of course, the i-cache being readonly means we avoid the nastier
issues.
Paul.
From: Linus Torvalds <[email protected]>
Date: Fri, 20 Oct 2006 17:38:32 -0700 (PDT)
> I think (but may be mistaken) that ARM _does_ have pure virtual caches
> with a process ID, but people have always ended up flushing them at
> context switch simply because it just causes too much trouble.
>
> Sparc? VIPT too? Davem?
sun4c is VIVT, but has no SMP variants.
sun4m has both VIPT and PIPT.
> But it would be good to have something for the early -rc1 sequence for
> 2.6.20, and maybe the MIPS COW D$ patches are it, if it has performance
> advantages on MIPS that can also be translated to other virtual cache
> users..
I think it could help for sun4m highmem configs.
On Fri, 20 Oct 2006, David Miller wrote:
>
> From: Linus Torvalds <[email protected]>
> Date: Fri, 20 Oct 2006 17:38:32 -0700 (PDT)
>
> > I think (but may be mistaken) that ARM _does_ have pure virtual caches
> > with a process ID, but people have always ended up flushing them at
> > context switch simply because it just causes too much trouble.
> >
> > Sparc? VIPT too? Davem?
>
> sun4c is VIVT, but has no SMP variants.
You don't need SMP - we have sleeping sections here, so even threads on UP
can trigger it.
Now, to trigger it you need to have
- virtual indexing not just by address, but by some "address space
identifier" thing too
- (in practice) a big enough cache that switching tasks wouldn't flush it
anyway.
> sun4m has both VIPT and PIPT.
>
> > But it would be good to have something for the early -rc1 sequence for
> > 2.6.20, and maybe the MIPS COW D$ patches are it, if it has performance
> > advantages on MIPS that can also be translated to other virtual cache
> > users..
>
> I think it could help for sun4m highmem configs.
Well, if you can re-create the performance numbers (Ralf - can you send
the full series with the final "remove the now unnecessary flush" to
Davem?), that will make deciding things easier, I think.
I suspect sparc, mips and arm are the main architectures where virtually
indexed caching really matters enough for this to be an issue at all.
Linus
From: Linus Torvalds <[email protected]>
Date: Fri, 20 Oct 2006 19:37:24 -0700 (PDT)
> On Fri, 20 Oct 2006, David Miller wrote:
> > I think it could help for sun4m highmem configs.
>
> Well, if you can re-create the performance numbers (Ralf - can you send
> the full series with the final "remove the now unnecessary flush" to
> Davem?), that will make deciding things easier, I think.
>
> I suspect sparc, mips and arm are the main architectures where virtually
> indexed caching really matters enough for this to be an issue at all.
Unfortunately, I don't have any sparc 32-bit systems any more,
so I can't really help out here. I just make sure the build
keeps working :-)
On Fri, 20 Oct 2006 22:41:22 +0100, Ralf Baechle <[email protected]> wrote:
> > It's just that we weren't quite careful enough at that time (and even
> > then, that would only matter for some really really unlikely and strange
> > situations that only happen when you fork() from a _threaded_ environment,
> > so it shouldn't be anything you'd notice under normal load).
> >
> > I think.
>
> The flush is there since a very long time. I have it in my tree since
> ~ 2.1.36 and I get the feeling anybody every has been seriously revisited
> the issue since.
I think calling fork() (or system() or popen() or so) in threaded
program is neither very unlikely or strange. But this breakage happens
very rarely indeed, especially non-preemptive kernel.
During debugging this issue, I had used this test program and slightly
modified kernel --- inserting yield() at middle of dup_mmap().
With the modified kernel on 32KB VIPT D$, running this test program
some times could reproduce the breakage ("BAD!" messages). I heard
PARISC people had successed to reproduce it too.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
static void *thread_func(void *arg)
{
unsigned char buf[2048], j;
int i;
for (j = 0; ; j++) {
/* fill buf[] with j */
memset(buf, j, sizeof(buf)/2);
sched_yield();
memset(buf + sizeof(buf)/2, j, sizeof(buf)/2);
sched_yield();
/* check buf[] contents */
for (i = 0; i < sizeof(buf); i++) {
if (buf[i] != j) {
printf("BAD! %p (%d != %d)\n",
buf + i, buf[i], j);
exit(1);
}
}
}
}
int main(int argc, char *argv[])
{
int i;
pid_t pid;
pthread_t tid;
for (i = 0; i < 4; i++)
pthread_create(&tid, NULL, thread_func, NULL);
for (i = 0; i < 100; i++) {
pid = fork();
if (pid == -1) {
perror("fork");
exit(1);
}
if (pid)
waitpid(pid, NULL, 0);
else
exit(0);
}
return 0;
}
---
Atsushi Nemoto
On Fri, Oct 20, 2006 at 07:37:24PM -0700, Linus Torvalds wrote:
> Well, if you can re-create the performance numbers (Ralf - can you send
> the full series with the final "remove the now unnecessary flush" to
> Davem?), that will make deciding things easier, I think.
>
> I suspect sparc, mips and arm are the main architectures where virtually
> indexed caching really matters enough for this to be an issue at all.
What I was using for my fork benchmark was basically the series as posted
in this thread + the quick hack patch below.
I'll dig up some numbers for the posted patchset and will send them later.
Ralf
diff --git a/kernel/fork.c b/kernel/fork.c
index 29ebb30..c83d226 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -202,7 +202,6 @@ static inline int dup_mmap(struct mm_str
struct mempolicy *pol;
down_write(&oldmm->mmap_sem);
- flush_cache_mm(oldmm);
/*
* Not linked in yet - no deadlock potential:
*/
On Fri, Oct 20, 2006 at 07:37:24PM -0700, Linus Torvalds wrote:
> Well, if you can re-create the performance numbers (Ralf - can you send
> the full series with the final "remove the now unnecessary flush" to
> Davem?), that will make deciding things easier, I think.
Blwo are numbers and comments from Atsushi Nemoto on two Toshiba TY49
cores with 16K rsp. 32K per primary cache. Each lmbench run was repeated
twice. The numbers taken without the flush_cache_mm hack to dup_mmap(),
so there are those 12% on fork which can easily be obtained in addition
on PIVT caches such as the TX49.
Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
TX49-16K without-patch 197 0.72 2.40 17.7 34.3 82.9 2.84 26.2 2500 9364 39.K
TX49-16K without-patch 197 0.73 2.40 17.7 34.4 82.9 2.85 26.1 2495 9337 39.K
TX49-16K without-patch 197 0.72 2.40 17.8 34.3 82.9 2.85 26.1 2501 9341 39.K
TX49-16K with-patch 197 0.72 2.39 20.1 31.9 82.9 2.85 20.2 2491 9101 38.K
TX49-16K with-patch 197 0.72 2.39 20.1 32.8 82.9 2.86 20.2 2496 9058 38.K
TX49-16K with-patch 197 0.72 2.39 20.1 32.8 82.9 2.85 20.3 2501 9074 38.K
TX49-32K without-patch 396 0.36 1.19 6.78 11.3 41.0 1.41 8.15 1246 4674 19.K
TX49-32K without-patch 396 0.36 1.19 6.78 11.3 41.0 1.41 8.17 1251 4680 19.K
TX49-32K without-patch 396 0.36 1.19 6.79 11.3 41.0 1.41 8.15 1250 4682 19.K
TX49-32K with-patch 396 0.36 1.19 6.79 10.2 41.0 1.41 8.14 1230 4638 19.K
TX49-32K with-patch 396 0.36 1.19 6.78 10.2 40.9 1.41 8.14 1241 4628 19.K
TX49-32K with-patch 396 0.36 1.19 6.79 10.2 40.9 1.41 8.14 1238 4627 19.K
A little bit faster on exec/proc and open/close (why?). Strange
results on sig/hndl and stat on TX49-16K again.
Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
TX49-16K without-patch 4.7800 87.1 36.5 97.2 47.8 101.1 40.8
TX49-16K without-patch 5.4000 88.4 28.9 96.2 39.6 101.2 40.8
TX49-16K without-patch 4.6800 84.5 32.7 96.8 46.5 100.2 42.9
TX49-16K with-patch 2.4600 82.7 34.0 93.5 50.5 97.2 43.3
TX49-16K with-patch 1.5200 87.7 33.6 95.1 42.4 99.7 43.6
TX49-16K with-patch 1.7700 86.2 34.1 95.8 49.0 99.2 41.7
TX49-32K without-patch 31.4 11.3 72.1 15.2 72.3 16.9
TX49-32K without-patch 30.4 11.6 72.2 16.1 73.2 15.1
TX49-32K without-patch 33.5 12.1 71.2 15.5 73.0 17.0
TX49-32K with-patch 30.9 11.5 72.3 17.4 73.1 17.5
TX49-32K with-patch 31.5 11.9 71.8 15.8 73.0 16.7
TX49-32K with-patch 32.5 10.4 71.7 16.0 72.5 16.6
No noticeable differences.
File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page 100fd
Create Delete Create Delete Latency Fault Fault selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
TX49-16K without-patch 251.8 192.5 1212.1 397.6 500.0 4.388 7.32710 50.5
TX49-16K without-patch 254.7 193.9 1197.6 394.9 505.0 4.412 7.34230 50.5
TX49-16K without-patch 252.6 193.6 1212.1 399.4 499.0 4.758 7.33790 50.5
TX49-16K with-patch 251.8 192.2 1207.7 391.7 502.0 0.143 7.32320 50.5
TX49-16K with-patch 252.7 194.0 1200.5 393.7 505.0 0.108 7.32030 50.5
TX49-16K with-patch 252.0 191.8 1199.0 392.3 508.0 0.011 7.33150 50.5
TX49-32K without-patch 86.0 54.8 461.3 146.3 378.0 1.818 5.45460 25.0
TX49-32K without-patch 86.5 54.1 454.3 148.1 378.0 1.816 5.47120 25.0
TX49-32K without-patch 86.7 53.8 458.1 148.0 378.0 2.130 5.48540 25.0
TX49-32K with-patch 90.4 52.5 460.8 148.7 377.0 0.471 5.46340 25.0
TX49-32K with-patch 88.8 52.6 462.5 148.6 380.0 0.476 5.44630 25.0
TX49-32K with-patch 88.7 52.9 466.4 147.8 378.0 0.477 5.49560 25.0
Major improvements on Prot/Fault.
Ralf
With parts of the previous patch 3/3 just having been applied this new
patch replaces it.
< ---- snip ---- >
From: Atsushi Nemoto <[email protected]>
MIPS: Fix COW D-cache aliasing on fork
Provide a custom copy_user_highpage() to deal with aliasing issues on
MIPS. It uses kmap_coherent() to map an user page for kernel with same
color.
The main part of this patch was originally written by Ralf Baechle;
Atushi Nemoto did the the debugging.
Signed-off-by: Atsushi Nemoto <[email protected]>
Signed-off-by: Ralf Baechle <[email protected]>
arch/mips/mm/init.c | 25 +++++++++++++++++++++++++
include/asm-mips/page.h | 17 ++++++-----------
2 files changed, 31 insertions(+), 11 deletions(-)
Index: upstream-alias/arch/mips/mm/init.c
===================================================================
--- upstream-alias.orig/arch/mips/mm/init.c 2006-10-22 03:08:41.000000000 +0100
+++ upstream-alias/arch/mips/mm/init.c 2006-10-22 03:08:45.000000000 +0100
@@ -203,6 +203,31 @@ static inline void kunmap_coherent(struc
preempt_check_resched();
}
+void copy_user_highpage(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma)
+{
+ void *vfrom, *vto;
+
+ vto = kmap_atomic(to, KM_USER1);
+ if (cpu_has_dc_aliases) {
+ vfrom = kmap_coherent(from, vaddr);
+ copy_page(vto, vfrom);
+ kunmap_coherent(from);
+ } else {
+ vfrom = kmap_atomic(from, KM_USER0);
+ copy_page(vto, vfrom);
+ kunmap_atomic(vfrom, KM_USER0);
+ }
+ if (((vma->vm_flags & VM_EXEC) && !cpu_has_ic_fills_f_dc) ||
+ pages_do_alias((unsigned long)vto, vaddr & PAGE_MASK))
+ flush_data_cache_page((unsigned long)vto);
+ kunmap_atomic(vto, KM_USER1);
+ /* Make sure this page is cleared on other CPU's too before using it */
+ smp_wmb();
+}
+
+EXPORT_SYMBOL(copy_user_highpage);
+
void copy_to_user_page(struct vm_area_struct *vma,
struct page *page, unsigned long vaddr, void *dst, const void *src,
unsigned long len)
Index: upstream-alias/include/asm-mips/page.h
===================================================================
--- upstream-alias.orig/include/asm-mips/page.h 2006-10-22 03:08:41.000000000 +0100
+++ upstream-alias/include/asm-mips/page.h 2006-10-22 03:08:45.000000000 +0100
@@ -34,8 +34,6 @@
#ifndef __ASSEMBLY__
-#include <asm/cpu-features.h>
-
extern void clear_page(void * page);
extern void copy_page(void * to, void * from);
@@ -59,16 +57,13 @@ static inline void clear_user_page(void
flush_data_cache_page((unsigned long)addr);
}
-static inline void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
- struct page *to)
-{
- extern void (*flush_data_cache_page)(unsigned long addr);
+extern void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
+ struct page *to);
+struct vm_area_struct;
+extern void copy_user_highpage(struct page *to, struct page *from,
+ unsigned long vaddr, struct vm_area_struct *vma);
- copy_page(vto, vfrom);
- if (!cpu_has_ic_fills_f_dc ||
- pages_do_alias((unsigned long)vto, vaddr & PAGE_MASK))
- flush_data_cache_page((unsigned long)vto);
-}
+#define __HAVE_ARCH_COPY_USER_HIGHPAGE
/*
* These are used to make use of C type-checking..
On Fri, 2006-10-20 at 13:10 -0700, Linus Torvalds wrote:
>
> On Fri, 20 Oct 2006, David Miller wrote:
> >
> > I did some more digging, here's what I think the hardware actually
> > does:
>
> Ok, this sounds sane.
>
> What should we do about this? How does this patch look to people?
>
> (Totally untested, and I'm not sure we should even do that whole
> "oldmm->mm_users" test, but I'm throwing it out here for discussion, in
> case it matters for performance. The second D$ flush should obviously be
> unnecessary for the common unthreaded case, which is why none of this has
> mattered historically, I think).
>
> Comments? We need ARM, MIPS, sparc and S390 at the very least to sign off
> on this, and somebody to write a nice explanation for the changelog (and
> preferably do this through -mm too).
On s390 you never have to worry about cache flushing. It is not stated
anywhere in the principles of operation but the architecture has to be
PIPT, otherwise it couldn't possibly work. The best indication for it is
that there is no cache flush instruction. The view on data in memory is
always consistent.
--
blue skies,
Martin.
Martin Schwidefsky
Linux for zSeries Development & Services
IBM Deutschland Entwicklung GmbH
"Reality continues to ruin my life." - Calvin.
On Fri, Oct 20, 2006 at 05:38:32PM -0700, Linus Torvalds wrote:
> On Sat, 21 Oct 2006, Ralf Baechle wrote:
> > > That said, maybe nobody does that. Virtual caches are a total braindamage
> > > in the first place, so hopefully they have limited use.
> >
> > On MIPS we never had pure virtual caches.
>
> Ok, so on MIPS my schenario doesn't matter.
>
> I think (but may be mistaken) that ARM _does_ have pure virtual caches
> with a process ID, but people have always ended up flushing them at
> context switch simply because it just causes too much trouble.
Just read this, sorry.
The majority of ARM CPU implementations have pure virtual caches
_without_ process IDs. (Some have a nasty hack which involves
remapping the lower 32MB of virtual memory space to other areas
of the cache's virtual space, but obviously that limits you to
32MB of VM.)
Thankfully, with ARM version 6, they had an inkling of clue, and
decided to move to VIPT caches but with _optional_ aliasing, and
if the CPU design was Harvard there's a possibility for D/I cache
aliasing.
> > Be sure I'm sending a CPU designers a strong message about aliases.
>
> Castration. That's the best solution. We don't want those people
> procreating.
Absolutely. Can we start such a program in Cambridge, England ASAP
please?
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: