2002-06-13 20:15:41

by Andi Kleen

[permalink] [raw]
Subject: New version of pageattr caching conflict fix for 2.4


Here is the latest version of pageattr for i386. The patch changes
the caching attributes in the kernel mapping dynamically to avoid memory
aliases with the AGP GART with different caching policies (which can lead
to data corruption with some CPUs). It also fixes some other kernel
potential mapping caching conflicts.

The patch does not require turning off of larges pages for the kernel
globally; instead it changes the kernel large page mapping to a 4K
mappin and back as needed.

It adds a new function

int change_page_attr(struct page *page, pgprot_t mapping)

Everytime someone maps a page with write back or write combining
or another non standard caching mode into any page table the corresponding
kernel page must be remapped with this version. After that the virtual
kernel address of the page (page_address(page)) has the caching policy
specified in mapping. In theory it could be used to write protect kernel
pages too, but that's probably dangerous and should be avoided.
(best is to use PAGE_KERNEL, PAGE_KERNEL_NOCACHE and perhaps
MAKE_GLOBAL(_PAGE_KERNEL | _PAGE_PWT) for write combining if you really need
it). The function is currently i386 specific, but at least x86-64 needs
it too.

Note that some drivers did this previously with ioremap_nocache. This
is incorrect and should be replaced with change_page_attr().

Also available at:
ftp://ftp.suse.com/pub/people/ak/v2.4/pageattr-2.4.19pre9-3

AMD's version to fix the same problem:
ftp://ftp.suse.com/pub/people/ak/amd-adv-spec-caching-disable-2.4.19pre9-1

Thanks to Ben LaHaise who found some problems in the original version.

I will probably submit this version for 2.4 unless someone finds a grave
problem in this version.

-Andi


diff -burpN -X ../KDIFX -x *.[ch]-* linux-2.4.19pre9/arch/i386/mm/Makefile linux-2.4.19pre9-work/arch/i386/mm/Makefile
--- linux-2.4.19pre9/arch/i386/mm/Makefile Fri Dec 29 23:07:20 2000
+++ linux-2.4.19pre9-work/arch/i386/mm/Makefile Wed Jun 5 02:35:11 2002
@@ -9,6 +9,7 @@

O_TARGET := mm.o

-obj-y := init.o fault.o ioremap.o extable.o
+obj-y := init.o fault.o ioremap.o extable.o pageattr.o
+export-objs := pageattr.o

include $(TOPDIR)/Rules.make
diff -burpN -X ../KDIFX -x *.[ch]-* linux-2.4.19pre9/arch/i386/mm/pageattr.c linux-2.4.19pre9-work/arch/i386/mm/pageattr.c
--- linux-2.4.19pre9/arch/i386/mm/pageattr.c Thu Jan 1 01:00:00 1970
+++ linux-2.4.19pre9-work/arch/i386/mm/pageattr.c Thu Jun 13 19:51:13 2002
@@ -0,0 +1,154 @@
+/*
+ * Copyright 2002 Andi Kleen, SuSE Labs.
+ * Thanks to Ben La Haise for precious feedback.
+ */
+
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/module.h>
+#include <asm/uaccess.h>
+#include <asm/processor.h>
+
+/* Should move most of this stuff into the appropiate includes */
+#define PAGE_LARGE (_PAGE_PSE|_PAGE_PRESENT)
+#define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1))
+#define LARGE_PAGE_SIZE (1UL << PMD_SHIFT)
+
+static inline pte_t *lookup_address(unsigned long address)
+{
+ pmd_t *pmd;
+ pgd_t *pgd = pgd_offset(&init_mm, address);
+
+ if ((pgd_val(*pgd) & PAGE_LARGE) == PAGE_LARGE)
+ return (pte_t *)pgd;
+ pmd = pmd_offset(pgd, address);
+ if ((pmd_val(*pmd) & PAGE_LARGE) == PAGE_LARGE)
+ return (pte_t *)pmd;
+
+ return pte_offset(pmd, address);
+}
+
+static pte_t *split_large_page(unsigned long address, pgprot_t prot)
+{
+ unsigned long addr;
+ int i;
+ pte_t *base = (pte_t *) __get_free_page(GFP_KERNEL);
+ if (!base)
+ return NULL;
+ BUG_ON(atomic_read(&virt_to_page(base)->count) != 1);
+ address = __pa(address);
+ addr = address & LARGE_PAGE_MASK;
+ for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
+ base[i] = mk_pte_phys(addr, addr == address ? prot : PAGE_KERNEL);
+ }
+ return base;
+}
+
+static void flush_kernel_map(void * address)
+{
+ if (boot_cpu_data.x86_model >= 4)
+ asm volatile("wbinvd":::"memory");
+ __flush_tlb_one(address);
+}
+
+static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
+{
+ set_pte_atomic(kpte, pte); /* change init_mm */
+#ifndef CONFIG_X86_PAE
+ {
+ struct list_head *l;
+ spin_lock(&mmlist_lock);
+ list_for_each(l, &init_mm.mmlist) {
+ struct mm_struct *mm = list_entry(l, struct mm_struct, mmlist);
+ pmd_t *pmd = pmd_offset(pgd_offset(mm, address), address);
+ set_pte_atomic((pte_t *)pmd, pte);
+ }
+ spin_unlock(&mmlist_lock);
+ }
+#endif
+}
+
+/* no more special protections in this 2/4MB area - revert to a
+ large page again. */
+void revert_page(struct page *kpte_page, unsigned long address)
+{
+ pte_t *linear = (pte_t *)
+ pmd_offset(pgd_offset(&init_mm, address), address);
+ set_pmd_pte(linear, address,
+ mk_pte_phys(__pa(address & LARGE_PAGE_MASK),
+ __pgprot(_KERNPG_TABLE|_PAGE_PSE)));
+ __free_page(kpte_page);
+}
+
+/*
+ * Change the page attributes of an page in the linear mapping.
+ *
+ * This should be used when a page is mapped with a different caching policy
+ * than write-back somewhere - some CPUs do not like it when mappings with
+ * different caching policies exist. This changes the page attributes of the
+ * in kernel linear mapping too.
+ *
+ * The caller needs to ensure that there are no conflicting mappings elsewhere.
+ * This function only deals with the kernel linear map.
+ * When page is in highmem it must never be kmap'ed.
+ */
+int change_page_attr(struct page *page, pgprot_t prot)
+{
+ int err;
+ pte_t *kpte, *split;
+ unsigned long address;
+ struct page *kpte_page;
+
+#ifdef CONFIG_HIGHMEM
+ /* Hopefully not be mapped anywhere else. */
+ if (page >= highmem_start_page)
+ return 0;
+#endif
+
+ address = (unsigned long)page_address(page);
+
+ down_write(&init_mm.mmap_sem);
+ kpte = lookup_address(address);
+ kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
+ if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
+ if ((pte_val(*kpte) & _PAGE_PSE) == 0) {
+ pte_t old = *kpte;
+ pte_t standard = mk_pte(page, PAGE_KERNEL);
+
+ /* Already a 4KB page. Just update it to the
+ * new protection. Doesn't need to be
+ * atomic. It doesn't matter if dirty bits get
+ * lost. */
+ set_pte_atomic(kpte, mk_pte(page, prot));
+ if (!memcmp(&old,&standard,sizeof(pte_t)))
+ atomic_inc(&kpte_page->count);
+ } else {
+ err = -ENOMEM;
+ split = split_large_page(address, prot);
+ if (!split)
+ goto out;
+ set_pmd_pte(kpte,address,
+ mk_pte(virt_to_page(split), PAGE_KERNEL));
+ }
+ } else if ((pte_val(*kpte) & _PAGE_PSE) == 0) {
+ set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
+ atomic_dec(&kpte_page->count);
+ }
+
+#ifdef CONFIG_SMP
+ smp_call_function(flush_kernel_map, (void *)address, 1, 1);
+#endif
+ flush_kernel_map((void *)address);
+
+ if (test_bit(X86_FEATURE_PSE, &boot_cpu_data.x86_capability) &&
+ (atomic_read(&kpte_page->count) == 1))
+ revert_page(kpte_page, address);
+ err = 0;
+ out:
+ up_write(&init_mm.mmap_sem);
+ return err;
+}
+
+EXPORT_SYMBOL(change_page_attr);
diff -burpN -X ../KDIFX -x *.[ch]-* linux-2.4.19pre9/drivers/char/agp/agpgart_be.c linux-2.4.19pre9-work/drivers/char/agp/agpgart_be.c
--- linux-2.4.19pre9/drivers/char/agp/agpgart_be.c Sun Jun 2 20:24:00 2002
+++ linux-2.4.19pre9-work/drivers/char/agp/agpgart_be.c Thu Jun 13 17:05:49 2002
@@ -397,7 +397,7 @@ int agp_unbind_memory(agp_memory * curr)
static void agp_generic_agp_enable(u32 mode)
{
struct pci_dev *device = NULL;
- u32 command, scratch, cap_id;
+ u32 command, scratch;
u8 cap_ptr;

pci_read_config_dword(agp_bridge.dev,
@@ -561,6 +561,7 @@ static int agp_generic_create_gatt_table
agp_bridge.current_size;
break;
}
+ temp = agp_bridge.current_size;
} else {
agp_bridge.aperture_size_idx = i;
}
@@ -578,23 +579,16 @@ static int agp_generic_create_gatt_table
}
table_end = table + ((PAGE_SIZE * (1 << page_order)) - 1);

- for (page = virt_to_page(table); page <= virt_to_page(table_end); page++)
+ for (page = virt_to_page(table); page <= virt_to_page(table_end); page++) {
SetPageReserved(page);
+ change_page_attr(page, PAGE_KERNEL_NOCACHE);
+ }

agp_bridge.gatt_table_real = (unsigned long *) table;
CACHE_FLUSH();
- agp_bridge.gatt_table = ioremap_nocache(virt_to_phys(table),
- (PAGE_SIZE * (1 << page_order)));
+ agp_bridge.gatt_table = agp_bridge.gatt_table_real;
CACHE_FLUSH();

- if (agp_bridge.gatt_table == NULL) {
- for (page = virt_to_page(table); page <= virt_to_page(table_end); page++)
- ClearPageReserved(page);
-
- free_pages((unsigned long) table, page_order);
-
- return -ENOMEM;
- }
agp_bridge.gatt_bus_addr = virt_to_phys(agp_bridge.gatt_table_real);

for (i = 0; i < num_entries; i++) {
@@ -651,7 +645,6 @@ static int agp_generic_free_gatt_table(v
* from the table.
*/

- iounmap(agp_bridge.gatt_table);
table = (char *) agp_bridge.gatt_table_real;
table_end = table + ((PAGE_SIZE * (1 << page_order)) - 1);

@@ -769,6 +762,10 @@ static unsigned long agp_generic_alloc_p
if (page == NULL) {
return 0;
}
+#ifdef __i386__
+ change_page_attr(page, __pgprot(__PAGE_KERNEL | _PAGE_PCD));
+#endif
+
get_page(page);
LockPage(page);
atomic_inc(&agp_bridge.current_memory_agp);
@@ -785,6 +782,11 @@ static void agp_generic_destroy_page(uns
}

page = virt_to_page(pt);
+#ifdef __i386__
+ change_page_attr(page, PAGE_KERNEL);
+#endif
+
+
put_page(page);
UnlockPage(page);
free_page((unsigned long) pt);
@@ -2242,16 +2244,8 @@ static int amd_create_page_map(amd_page_
return -ENOMEM;
}
SetPageReserved(virt_to_page(page_map->real));
- CACHE_FLUSH();
- page_map->remapped = ioremap_nocache(virt_to_phys(page_map->real),
- PAGE_SIZE);
- if (page_map->remapped == NULL) {
- ClearPageReserved(virt_to_page(page_map->real));
- free_page((unsigned long) page_map->real);
- page_map->real = NULL;
- return -ENOMEM;
- }
- CACHE_FLUSH();
+ change_page_attr(virt_to_page(page_map->real), PAGE_KERNEL_NOCACHE);
+ page_map->remapped = page_map->real;

for(i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
page_map->remapped[i] = agp_bridge.scratch_page;
@@ -2262,7 +2256,6 @@ static int amd_create_page_map(amd_page_

static void amd_free_page_map(amd_page_map *page_map)
{
- iounmap(page_map->remapped);
ClearPageReserved(virt_to_page(page_map->real));
free_page((unsigned long) page_map->real);
}
@@ -2744,27 +2737,22 @@ static void ali_cache_flush(void)

static unsigned long ali_alloc_page(void)
{
- struct page *page;
- u32 temp;
-
- page = alloc_page(GFP_KERNEL);
- if (page == NULL)
+ unsigned long p = agp_generic_alloc_page();
+ if (!p)
return 0;

- get_page(page);
- LockPage(page);
- atomic_inc(&agp_bridge.current_memory_agp);
-
+ /* probably not needed anymore */
global_cache_flush();

if (agp_bridge.type == ALI_M1541) {
+ u32 temp;
pci_read_config_dword(agp_bridge.dev, ALI_CACHE_FLUSH_CTRL, &temp);
pci_write_config_dword(agp_bridge.dev, ALI_CACHE_FLUSH_CTRL,
(((temp & ALI_CACHE_FLUSH_ADDR_MASK) |
- virt_to_phys(page_address(page))) |
+ virt_to_phys(p)) |
ALI_CACHE_FLUSH_EN ));
}
- return (unsigned long)page_address(page);
+ return p;
}

static void ali_destroy_page(unsigned long addr)
@@ -2786,11 +2774,7 @@ static void ali_destroy_page(unsigned lo
ALI_CACHE_FLUSH_EN));
}

- page = virt_to_page(pt);
- put_page(page);
- UnlockPage(page);
- free_page((unsigned long) pt);
- atomic_dec(&agp_bridge.current_memory_agp);
+ agp_generic_destroy_page(pt);
}

/* Setup function */
@@ -2871,16 +2855,8 @@ static int serverworks_create_page_map(s
return -ENOMEM;
}
SetPageReserved(virt_to_page(page_map->real));
- CACHE_FLUSH();
- page_map->remapped = ioremap_nocache(virt_to_phys(page_map->real),
- PAGE_SIZE);
- if (page_map->remapped == NULL) {
- ClearPageReserved(virt_to_page(page_map->real));
- free_page((unsigned long) page_map->real);
- page_map->real = NULL;
- return -ENOMEM;
- }
- CACHE_FLUSH();
+ change_page_attr(virt_to_page(page_map->real), PAGE_KERNEL_NOCACHE)
+ page_map->remapped = page_map->real;

for(i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
page_map->remapped[i] = agp_bridge.scratch_page;
@@ -2891,7 +2867,6 @@ static int serverworks_create_page_map(s

static void serverworks_free_page_map(serverworks_page_map *page_map)
{
- iounmap(page_map->remapped);
ClearPageReserved(virt_to_page(page_map->real));
free_page((unsigned long) page_map->real);
}
diff -burpN -X ../KDIFX -x *.[ch]-* linux-2.4.19pre9/include/asm-i386/pgtable-2level.h linux-2.4.19pre9-work/include/asm-i386/pgtable-2level.h
--- linux-2.4.19pre9/include/asm-i386/pgtable-2level.h Thu Jul 26 22:40:32 2001
+++ linux-2.4.19pre9-work/include/asm-i386/pgtable-2level.h Thu Jun 13 16:52:53 2002
@@ -40,6 +40,8 @@ static inline int pgd_present(pgd_t pgd)
* hook is made available.
*/
#define set_pte(pteptr, pteval) (*(pteptr) = pteval)
+#define set_pte_atomic(pteptr, pteval) (*(pteptr) = pteval)
+
/*
* (pmds are folded into pgds so this doesnt get actually called,
* but the define is needed for a generic inline function.)
diff -burpN -X ../KDIFX -x *.[ch]-* linux-2.4.19pre9/include/asm-i386/pgtable-3level.h linux-2.4.19pre9-work/include/asm-i386/pgtable-3level.h
--- linux-2.4.19pre9/include/asm-i386/pgtable-3level.h Thu Jul 26 22:40:32 2001
+++ linux-2.4.19pre9-work/include/asm-i386/pgtable-3level.h Thu Jun 13 17:15:14 2002
@@ -53,6 +53,9 @@ static inline void set_pte(pte_t *ptep,
set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
#define set_pgd(pgdptr,pgdval) \
set_64bit((unsigned long long *)(pgdptr),pgd_val(pgdval))
+#define set_pte_atomic(pteptr,pteval) \
+ set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
+

/*
* Pentium-II erratum A13: in PAE mode we explicitly have to flush
diff -burpN -X ../KDIFX -x *.[ch]-* linux-2.4.19pre9/include/asm-i386/pgtable.h linux-2.4.19pre9-work/include/asm-i386/pgtable.h
--- linux-2.4.19pre9/include/asm-i386/pgtable.h Mon Jun 3 21:15:18 2002
+++ linux-2.4.19pre9-work/include/asm-i386/pgtable.h Thu Jun 13 20:27:09 2002
@@ -347,6 +347,9 @@ static inline pte_t pte_modify(pte_t pte
#define pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low })
#define swp_entry_to_pte(x) ((pte_t) { (x).val })

+struct page;
+int change_page_attr(struct page *, pgprot_t prot);
+
#endif /* !__ASSEMBLY__ */

/* Needs to be defined here and not in linux/mm.h, as it is arch dependent */


2002-06-16 16:48:06

by Andi Kleen

[permalink] [raw]
Subject: Re: another new version of pageattr caching conflict fix for 2.4

On Sun, Jun 16, 2002 at 04:08:49AM -0600, Eric W. Biederman wrote:
> Andi Kleen <[email protected]> writes:
>
> > On Fri, Jun 14, 2002 at 12:31:33PM -0500, Andrea Arcangeli wrote:
> > > On Fri, Jun 14, 2002 at 06:13:28PM +0200, Andi Kleen wrote:
> > > > +#ifdef CONFIG_HIGHMEM
> > > > + /* Hopefully not be mapped anywhere else. */
> > > > + if (page >= highmem_start_page)
> > > > + return 0;
> > > > +#endif
> > >
> > > there's no hope here. If you don't want to code it right because nobody
> > > is exercising such path and flush both any per-cpu kmap-atomic slot and
> > > the page->virtual, please place a BUG() there or any more graceful
> > > failure notification.
> >
> > Ok done.
> >
> > I also removed the DRM change because it was buggy. I'm not 100% yet
> > it is even a problem. Needs more investigation.
> >
> > BTW there is another corner case that was overlooked:
> > mmap of /dev/kmem, /proc/kcore, /dev/mem. To handle this correctly
> > the mmap functions would need to always walk the kernel page table
> > and force the correct caching attribute in the user mapping.
> > But what to do when there is already an existing mapping to user space?
>
> Don't allow the change_page_attr if page->count > 1 is an easy solution,
> and it probably suffices. Beyond that anything mmaped can be found

Erm, what should it do then? Fail the AGP map ?

> by walking into the backing address space, and then through the
> vm_area_structs found with i_mmap && i_mmap_shared. Of course the
> vm_area_structs may possibly need to break because of the multiple
> page protections.

I know that, but it seems rather racy expensive and complicated.

>
> > > > +int change_page_attr(struct page *page, int numpages, pgprot_t prot)
> > > > +{
> > >
> > > this API not the best, again I would recommend something on these lines:
> >
> > Hmm: i would really prefer to do the allocation in the caller.
> > If you want your API you could do just do it as a simple wrapper to
> > the existing function (with the new numpages it would be even
> > efficient)
>
> Using pgprot_t to convey the cacheablity of a page appears to be an
> abuse. At the very least we need a PAGE_CACHE_MASK, to find just
> the cacheability attributes.

change_page_attr is more than just cachability. You can use it e.g.
to write protect kernel pages not (if you wanted to do that for some
reason)

The current one is fine with PAGE_KERNEL_NOCACHE, you could eventually
define PAGE_KERNEL_WRITECOMBINING too if you wanted.


> And we should really consider using the other cacheability page
> attributes on x86, not just cache enable/disable. Using just mtrr's
> is limiting in that you don't always have enough of them, and that
> sometimes valid ram is mapped uncacheable, because of the mtrr
> alignment restrictions.

Exposing it to user space in a more general way is tricky because
you need to avoid aliases with different caching attributes and also
change the kernel map as needed (would be surely an interesting
project, but definitely requires more work)

You can already use WT. DRM does that already in fact.
Newer Intel/AMD CPUs allow to set up a more general PAT table with some
more modis, but to be honest I don't see the point in most of them,
except perhaps WC. Unfortunately there is no 'weak ordering like alpha/sparc64'
mode that could be used in the kernel :-)

-Andi

2002-06-16 18:00:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: another new version of pageattr caching conflict fix for 2.4

Andi Kleen <[email protected]> writes:

> On Sun, Jun 16, 2002 at 04:08:49AM -0600, Eric W. Biederman wrote:
> >
> > Don't allow the change_page_attr if page->count > 1 is an easy solution,
> > and it probably suffices. Beyond that anything mmaped can be found
>
> Erm, what should it do then? Fail the AGP map ?

Why not. If user-space has already mapped the memory one way, turning
around and using it another way is a problem. If the memory is
dynamically allocated for AGP I don't even see how this case
could occur.

But the mmap case still has to apply the general cache ability of the
page from the kernel page tables (or where ever it is cached), if you
do the operation in the opposite order.

> > by walking into the backing address space, and then through the
> > vm_area_structs found with i_mmap && i_mmap_shared. Of course the
> > vm_area_structs may possibly need to break because of the multiple
> > page protections.
>
> I know that, but it seems rather racy expensive and complicated.
>
> >
> > > > > +int change_page_attr(struct page *page, int numpages, pgprot_t prot)
> > > > > +{
> > > >
> > > > this API not the best, again I would recommend something on these lines:
> > >
> > > Hmm: i would really prefer to do the allocation in the caller.
> > > If you want your API you could do just do it as a simple wrapper to
> > > the existing function (with the new numpages it would be even
> > > efficient)
> >
> > Using pgprot_t to convey the cacheablity of a page appears to be an
> > abuse. At the very least we need a PAGE_CACHE_MASK, to find just
> > the cacheability attributes.
>
> change_page_attr is more than just cachability. You can use it e.g.
> to write protect kernel pages not (if you wanted to do that for some
> reason)
> The current one is fine with PAGE_KERNEL_NOCACHE, you could eventually
> define PAGE_KERNEL_WRITECOMBINING too if you wanted.
>

Combining different operations like modifying cacheability and the
page protections worries me. Unless we name the operations something
like change_kernel_page_attr() in which case we are only worrying
about a subset of the problem. Which it appears that we are.

> > And we should really consider using the other cacheability page
> > attributes on x86, not just cache enable/disable. Using just mtrr's
> > is limiting in that you don't always have enough of them, and that
> > sometimes valid ram is mapped uncacheable, because of the mtrr
> > alignment restrictions.
>
> Exposing it to user space in a more general way is tricky because
> you need to avoid aliases with different caching attributes and also
> change the kernel map as needed (would be surely an interesting
> project, but definitely requires more work)

The kernel already exposes through /proc/mtrr the ability to
arbitrarily change the caching of pages. And since this is a case
of physical aliasing we need to make certain the mtrr's don't
conflict. So much of this is already exposed to user space already.

> You can already use WT. DRM does that already in fact.
> Newer Intel/AMD CPUs allow to set up a more general PAT table with some
> more modis, but to be honest I don't see the point in most of them,
> except perhaps WC. Unfortunately there is no 'weak ordering like alpha/sparc64'
> mode that could be used in the kernel :-)

With the PAT table only write-back, write-combining, uncached interest
me. Given the number of BIOS's that don't set all of RAM to
write-back and the major performance penalty of running on uncached
RAM having the kernel fix it, would reduce a lot of headaches long
term.

Primarily I'm brining up the connections because it may be worth
considering them. For 2.4.19 unless the implementation is trivial it
probably isn't wise to try for more that what is needed. For later
kernels 2.4.x, 2.4.20+ it is almost certainly worth doing something.

Eric

2002-06-16 18:43:16

by Andi Kleen

[permalink] [raw]
Subject: Re: another new version of pageattr caching conflict fix for 2.4

On Sun, Jun 16, 2002 at 11:50:51AM -0600, Eric W. Biederman wrote:
> Andi Kleen <[email protected]> writes:
>
> > On Sun, Jun 16, 2002 at 04:08:49AM -0600, Eric W. Biederman wrote:
> > >
> > > Don't allow the change_page_attr if page->count > 1 is an easy solution,
> > > and it probably suffices. Beyond that anything mmaped can be found
> >
> > Erm, what should it do then? Fail the AGP map ?
>
> Why not. If user-space has already mapped the memory one way, turning
> around and using it another way is a problem. If the memory is
> dynamically allocated for AGP I don't even see how this case
> could occur.

It's a random error. AGP randomly gets some page that is already mapped
by /dev/mem somewhere. There is nothing that it can do to avoid this.
As /dev/mem is root only and root can already cause much damage it is
probably not worth avoiding the particular breakage.

Now if someone used change_page_attr as a more general call and used
it on random memory that could be already mapped to user space in a more
legitimate way then he or she has to do much more work anyways (avoiding
alias etc.). Part of that more work would be to check the page count.
I don't think it should be done by the low level routine.


>
> The kernel already exposes through /proc/mtrr the ability to
> arbitrarily change the caching of pages. And since this is a case
> of physical aliasing we need to make certain the mtrr's don't
> conflict. So much of this is already exposed to user space already.

MTRRs work on physical, not virtual memory, so they have no aliasing
issues.

> > You can already use WT. DRM does that already in fact.
> > Newer Intel/AMD CPUs allow to set up a more general PAT table with some
> > more modis, but to be honest I don't see the point in most of them,
> > except perhaps WC. Unfortunately there is no 'weak ordering like alpha/sparc64'
> > mode that could be used in the kernel :-)
>
> With the PAT table only write-back, write-combining, uncached interest
> me. Given the number of BIOS's that don't set all of RAM to
> write-back and the major performance penalty of running on uncached
> RAM having the kernel fix it, would reduce a lot of headaches long
> term.

Fixing the MTRRs is fine, but it is really outside the scope of my patch.
Just changing the kernel map wouldn't be enough to fix wrong MTRRs,
because it wouldn't cover highmem.

-Andi

2002-06-16 20:06:52

by Eric W. Biederman

[permalink] [raw]
Subject: Re: another new version of pageattr caching conflict fix for 2.4

Andi Kleen <[email protected]> writes:

> On Sun, Jun 16, 2002 at 11:50:51AM -0600, Eric W. Biederman wrote:
> > Andi Kleen <[email protected]> writes:
> >
> > > On Sun, Jun 16, 2002 at 04:08:49AM -0600, Eric W. Biederman wrote:
> > > >
> > > > Don't allow the change_page_attr if page->count > 1 is an easy solution,
> > > > and it probably suffices. Beyond that anything mmaped can be found
> > >
> > > Erm, what should it do then? Fail the AGP map ?
> >
> > Why not. If user-space has already mapped the memory one way, turning
> > around and using it another way is a problem. If the memory is
> > dynamically allocated for AGP I don't even see how this case
> > could occur.
>
> It's a random error. AGP randomly gets some page that is already mapped
> by /dev/mem somewhere. There is nothing that it can do to avoid this.
> As /dev/mem is root only and root can already cause much damage it is
> probably not worth avoiding the particular breakage.

Sorry my confusion I thought mmapping /dev/mem increased the page count,
in which case it would be easy to detect, and avoid. I just looked
and it doesn't so there appears no trivial way to detect or handle
that case.

> Now if someone used change_page_attr as a more general call and used
> it on random memory that could be already mapped to user space in a more
> legitimate way then he or she has to do much more work anyways (avoiding
> alias etc.). Part of that more work would be to check the page count.
> I don't think it should be done by the low level routine.
>
>
> >
> > The kernel already exposes through /proc/mtrr the ability to
> > arbitrarily change the caching of pages. And since this is a case
> > of physical aliasing we need to make certain the mtrr's don't
> > conflict. So much of this is already exposed to user space already.
>
> MTRRs work on physical, not virtual memory, so they have no aliasing
> issues.

Doesn't the AGP aperture cause a physical alias? Leading to strange
the same problems if the agp aperture was marked write-back, and the
memory was marked uncacheable. My gut impression is to just make the
agp aperture write-back cacheable, and then we don't have to change
the kernel page table at all. Unfortunately I don't expect the host
bridge with the memory and agp controllers to like that mode,
especially as there are physical aliasing issues.

> > > You can already use WT. DRM does that already in fact.
> > > Newer Intel/AMD CPUs allow to set up a more general PAT table with some
> > > more modis, but to be honest I don't see the point in most of them,
> > > except perhaps WC. Unfortunately there is no 'weak ordering like
> alpha/sparc64'
>
> > > mode that could be used in the kernel :-)
> >
> > With the PAT table only write-back, write-combining, uncached interest
> > me. Given the number of BIOS's that don't set all of RAM to
> > write-back and the major performance penalty of running on uncached
> > RAM having the kernel fix it, would reduce a lot of headaches long
> > term.
>
> Fixing the MTRRs is fine, but it is really outside the scope of my patch.
> Just changing the kernel map wouldn't be enough to fix wrong MTRRs,
> because it wouldn't cover highmem.

My preferred fix is to use PAT, to override the buggy mtrrs. Which
brings up the same aliasing issues. Which makes it related but
outside the scope of the problem.

Eric

2002-06-16 23:37:33

by Andi Kleen

[permalink] [raw]
Subject: Re: another new version of pageattr caching conflict fix for 2.4

> > MTRRs work on physical, not virtual memory, so they have no aliasing
> > issues.
>
> Doesn't the AGP aperture cause a physical alias? Leading to strange

Yes. That's what this patch is all about.

> the same problems if the agp aperture was marked write-back, and the

AGP aperture is uncacheable, not write-back.

> memory was marked uncacheable. My gut impression is to just make the
> agp aperture write-back cacheable, and then we don't have to change
> the kernel page table at all. Unfortunately I don't expect the host

That would violate the AGP specification.

> bridge with the memory and agp controllers to like that mode,
> especially as there are physical aliasing issues.

exactly.

>
> > Fixing the MTRRs is fine, but it is really outside the scope of my patch.
> > Just changing the kernel map wouldn't be enough to fix wrong MTRRs,
> > because it wouldn't cover highmem.
>
> My preferred fix is to use PAT, to override the buggy mtrrs. Which
> brings up the same aliasing issues. Which makes it related but
> outside the scope of the problem.

I don't follow you here. IMHO it is much easier to fix the MTRRs in the
MTRR driver for those rare buggy BIOS (if they exist - I've never seen one)
than to hack up all of memory management just to get the right bits set.
I see no disadvantage of using the MTRRs and it is lot simpler than
PAT and pte bits.


-Andi

2002-06-17 00:08:40

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: another new version of pageattr caching conflict fix for 2.4

Andi Kleen writes:

>> the same problems if the agp aperture was marked write-back, and the
>
> AGP aperture is uncacheable, not write-back.
>
>> memory was marked uncacheable. My gut impression is to just make the
>> agp aperture write-back cacheable, and then we don't have to change
>> the kernel page table at all. Unfortunately I don't expect the host
>
> That would violate the AGP specification.
>
>> bridge with the memory and agp controllers to like that mode,
>> especially as there are physical aliasing issues.
>
> exactly.

You can do whatever you want, as long as...

1. you have cache control instructions and use them
2. the bridge ignores the coherency protocol (no machine check)

Most likely you should make the AGP memory write-back
cacheable. This requires some care regarding cache lines,
but ought to be faster.

>>> Fixing the MTRRs is fine, but it is really outside the scope of my patch.
>>> Just changing the kernel map wouldn't be enough to fix wrong MTRRs,
>>> because it wouldn't cover highmem.
>>
>> My preferred fix is to use PAT, to override the buggy mtrrs. Which
>> brings up the same aliasing issues. Which makes it related but
>> outside the scope of the problem.
>
> I don't follow you here. IMHO it is much easier to fix the MTRRs in the
> MTRR driver for those rare buggy BIOS (if they exist - I've never seen one)
> than to hack up all of memory management just to get the right bits set.
> I see no disadvantage of using the MTRRs and it is lot simpler than
> PAT and pte bits.

For non-x86 one must "hack up all of memory management" anyway.

Example: There aren't any MTRRs on the PowerPC, but every page
has 4 memory type bits. It's not OK to map something more than
one way at the same time. Large "pages" (256 MB each) are used
to cover all of non-highmem physical memory.




2002-06-17 04:16:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: another new version of pageattr caching conflict fix for 2.4

Andi Kleen <[email protected]> writes:

> > > MTRRs work on physical, not virtual memory, so they have no aliasing
> > > issues.
> >
> > Doesn't the AGP aperture cause a physical alias? Leading to strange
>
> Yes. That's what this patch is all about.
>
> > the same problems if the agp aperture was marked write-back, and the
>
> AGP aperture is uncacheable, not write-back.
>
> > memory was marked uncacheable. My gut impression is to just make the
> > agp aperture write-back cacheable, and then we don't have to change
> > the kernel page table at all. Unfortunately I don't expect the host
>
> That would violate the AGP specification.
>
> > bridge with the memory and agp controllers to like that mode,
> > especially as there are physical aliasing issues.
>
> exactly.

All of which is an AGP design bug, if you want performance you don't
cripple your caches, but we have to live with it so no use tilting at
windmills.

> > > Fixing the MTRRs is fine, but it is really outside the scope of my patch.
> > > Just changing the kernel map wouldn't be enough to fix wrong MTRRs,
> > > because it wouldn't cover highmem.
> >
> > My preferred fix is to use PAT, to override the buggy mtrrs. Which
> > brings up the same aliasing issues. Which makes it related but
> > outside the scope of the problem.
>
> I don't follow you here. IMHO it is much easier to fix the MTRRs in the
> MTRR driver for those rare buggy BIOS (if they exist - I've never seen one)

I've heard of several and dealt with one. The problem was essentially they
ran out of mtrrs, the edges of free memory were to rough.

> than to hack up all of memory management just to get the right bits set.
> I see no disadvantage of using the MTRRs and it is lot simpler than
> PAT and pte bits.

There are not enough MTRRs. And using the PAT bits to say memory is
write-back can be a constant. It just takes a little work to get in
place. Plus the weird assortment of consistency issues.

For most purposes PAT makes memory easier to deal with because you
can be as fine grained as you like. The difficulty is that you must
have consistent attributes across all of your virtual mappings.

The other case PAT should help is when a machine has multiple cards
that can benefit from write-combining. Currently running out of mtrrs
is a problem.

Eric

2002-06-17 06:54:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: another new version of pageattr caching conflict fix for 2.4

On Sun, Jun 16, 2002 at 01:56:51PM -0600, Eric W. Biederman wrote:
> Andi Kleen <[email protected]> writes:
>
> > On Sun, Jun 16, 2002 at 11:50:51AM -0600, Eric W. Biederman wrote:
> > > Andi Kleen <[email protected]> writes:
> > >
> > > > On Sun, Jun 16, 2002 at 04:08:49AM -0600, Eric W. Biederman wrote:
> > > > >
> > > > > Don't allow the change_page_attr if page->count > 1 is an easy solution,
> > > > > and it probably suffices. Beyond that anything mmaped can be found
> > > >
> > > > Erm, what should it do then? Fail the AGP map ?
> > >
> > > Why not. If user-space has already mapped the memory one way, turning
> > > around and using it another way is a problem. If the memory is
> > > dynamically allocated for AGP I don't even see how this case
> > > could occur.
> >
> > It's a random error. AGP randomly gets some page that is already mapped
> > by /dev/mem somewhere. There is nothing that it can do to avoid this.
> > As /dev/mem is root only and root can already cause much damage it is
> > probably not worth avoiding the particular breakage.
>
> Sorry my confusion I thought mmapping /dev/mem increased the page count,
> in which case it would be easy to detect, and avoid. I just looked
> and it doesn't so there appears no trivial way to detect or handle
> that case.

yep, because remap_page_range is used to work on things that doesn't
have a page structure in the first place too. We cannot trap /dev/mem,
unless additional infrastructure is developed for it. I would ignore it,
/dev/mem should be necessary only to non-ram pages, and so it should
never be a problem. If it's a problem it falls into Al's category of the
"doctor it hurts".

> My preferred fix is to use PAT, to override the buggy mtrrs. Which
> brings up the same aliasing issues. Which makes it related but
> outside the scope of the problem.

I don't see why you keep wondering about cacheability attributes. We
simply need to avoid the aliasing, then you don't care about the cache
attributes. It is a matter of the agp code if it wants write-back
write-combining or uncached then. the corruption happens because of the
physical aliases, if you remove the physical aliases in the cache then
you don't need to care about the cache-type.

So my suggestion, besides changing the API to the device drivers so that
the arch returns an array of pages ready to handle physical aliasing (so
moving the allocation/freeing in the arch/ section), is to mark the pte
of the aliased 4k pages as invalid. No need to care about cacheability
then. Just mark the pte invalid, flush the tlb, and flush the caches.
When you drop pages from the aperture, drop the agp mapping, flush the
tlb for the agp mapping, flush caches and reeanble the original
pagetable possibly making it again a largepage. No need to care about
mtrr or pte cacheability attributes this way, and even better this way
we'll oops any bogus user that is trying to access the page via the
kernel direct mapping while there's a physical alias, such guy should
use the agp virtual mapping instead, not the kernel direct mapping.
Marking the pte invalid is's even simpler than marking it uncacheable,
so it's very simple to change Andi's patch that way.

Andrea

2002-06-17 15:56:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: another new version of pageattr caching conflict fix for 2.4

Andrea Arcangeli <[email protected]> writes:

> On Sun, Jun 16, 2002 at 01:56:51PM -0600, Eric W. Biederman wrote:
> > Andi Kleen <[email protected]> writes:
> >
> > > On Sun, Jun 16, 2002 at 11:50:51AM -0600, Eric W. Biederman wrote:
> > > > Andi Kleen <[email protected]> writes:
> > > >
> > > > > On Sun, Jun 16, 2002 at 04:08:49AM -0600, Eric W. Biederman wrote:
> > > > > >
> > > > > > Don't allow the change_page_attr if page->count > 1 is an easy
> solution,
>
> > > > > > and it probably suffices. Beyond that anything mmaped can be found
> > > > >
> > > > > Erm, what should it do then? Fail the AGP map ?
> > > >
> > > > Why not. If user-space has already mapped the memory one way, turning
> > > > around and using it another way is a problem. If the memory is
> > > > dynamically allocated for AGP I don't even see how this case
> > > > could occur.
> > >
> > > It's a random error. AGP randomly gets some page that is already mapped
> > > by /dev/mem somewhere. There is nothing that it can do to avoid this.
> > > As /dev/mem is root only and root can already cause much damage it is
> > > probably not worth avoiding the particular breakage.
> >
> > Sorry my confusion I thought mmapping /dev/mem increased the page count,
> > in which case it would be easy to detect, and avoid. I just looked
> > and it doesn't so there appears no trivial way to detect or handle
> > that case.
>
> yep, because remap_page_range is used to work on things that doesn't
> have a page structure in the first place too. We cannot trap /dev/mem,
> unless additional infrastructure is developed for it. I would ignore it,
> /dev/mem should be necessary only to non-ram pages, and so it should
> never be a problem. If it's a problem it falls into Al's category of the
> "doctor it hurts".

Which is fine as long as we recognize it will hurt if we use it.

> > My preferred fix is to use PAT, to override the buggy mtrrs. Which
> > brings up the same aliasing issues. Which makes it related but
> > outside the scope of the problem.
>
> I don't see why you keep wondering about cacheability attributes.

Because this patch is nipping on the edge of a more general problem,
and the practical purpose I can serve is to stir up ideas.

If the mtrrs we a good flexible general purpose solution we could just
mark all of the affected memory as uncacheable, in the mtrrs, and not
even modify the page tables. As it is this case is one of several
proofs that the mtrrs are significantly flexibility impaired.

Beyond that having cacheability attributes in the page table is not
limited to x86, and we are coming close to building the infrastructure
needed to handle this in general. Not that I advocate we do anything
except consider the case at this juncture, but having had the
conversation now it lays the foundation for developing code that
handles the cache of virtual aliases having the same cacheability attributes.

> We
> simply need to avoid the aliasing, then you don't care about the cache
> attributes.

That we should avoid physical aliasing I agree. That is the real
problem even though the summary provided by AMD a little while ago
got this wrong.

> It is a matter of the agp code if it wants write-back
> write-combining or uncached then. the corruption happens because of the
> physical aliases, if you remove the physical aliases in the cache then
> you don't need to care about the cache-type.

Potentially. If you start caching the data it will only work if the
agp controller participates in the cache-consistency protocol.

> So my suggestion, besides changing the API to the device drivers so that
> the arch returns an array of pages ready to handle physical aliasing (so
> moving the allocation/freeing in the arch/ section), is to mark the pte
> of the aliased 4k pages as invalid. No need to care about cacheability
> then. Just mark the pte invalid, flush the tlb, and flush the caches.
> When you drop pages from the aperture, drop the agp mapping, flush the
> tlb for the agp mapping, flush caches and reeanble the original
> pagetable possibly making it again a largepage. No need to care about
> mtrr or pte cacheability attributes this way, and even better this way
> we'll oops any bogus user that is trying to access the page via the
> kernel direct mapping while there's a physical alias, such guy should
> use the agp virtual mapping instead, not the kernel direct mapping.
> Marking the pte invalid is's even simpler than marking it uncacheable,
> so it's very simple to change Andi's patch that way.

A relatively simple hook into the /dev/mem infrastructure would
probably be worth it as well. If you are going to oops the kernel,
giving user space an error is polite.

Eric



2002-06-14 01:03:40

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: New version of pageattr caching conflict fix for 2.4

On Thu, Jun 13, 2002 at 10:15:33PM +0200, Andi Kleen wrote:
> Thanks to Ben LaHaise who found some problems in the original version.
>
> I will probably submit this version for 2.4 unless someone finds a grave
> problem in this version.

This version is missing a few of the fixes included in my version:
it doesn't properly flush global tlb entries, or update the page
tables of processes which have copied 4MB page table entries into
their page tables. Also, the revert_page function must be called
before the tlb flush and free the page after the tlb flush, or
else tlb prefetching on the P4 can cache stale pmd pointers. I'd
suggest using the -C0 version which already has those fixes.

-ben

2002-06-14 01:24:34

by Andi Kleen

[permalink] [raw]
Subject: Re: New version of pageattr caching conflict fix for 2.4

On Thu, Jun 13, 2002 at 09:03:39PM -0400, Benjamin LaHaise wrote:
> On Thu, Jun 13, 2002 at 10:15:33PM +0200, Andi Kleen wrote:
> > Thanks to Ben LaHaise who found some problems in the original version.
> >
> > I will probably submit this version for 2.4 unless someone finds a grave
> > problem in this version.
>
> This version is missing a few of the fixes included in my version:
> it doesn't properly flush global tlb entries, or update the page

Sure it does. INVLPG (__flush_tlb_one) flushes global entries.

> tables of processes which have copied 4MB page table entries into
> their page tables. Also, the revert_page function must be called

That's done in set_pmd_page.

> before the tlb flush and free the page after the tlb flush, or
> else tlb prefetching on the P4 can cache stale pmd pointers. I'd

Fair point. Hmm, I had that correct, but somehow it got messed up again.

Another thing that probably needs to be added is that DRM needs
some change_page_attr() calls too, because it does private AGP mappings.

-Andi

2002-06-14 01:37:24

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: New version of pageattr caching conflict fix for 2.4

On Fri, Jun 14, 2002 at 03:24:29AM +0200, Andi Kleen wrote:
> > This version is missing a few of the fixes included in my version:
> > it doesn't properly flush global tlb entries, or update the page
>
> Sure it does. INVLPG (__flush_tlb_one) flushes global entries.

It failed to do so in my testing. The only safe way of flushing
global entries is to disable them in cr4 before attempting the
tlb flush.

> > tables of processes which have copied 4MB page table entries into
> > their page tables. Also, the revert_page function must be called
>
> That's done in set_pmd_page.

Doh. I should consume coffee after waking up but before posting...

> > before the tlb flush and free the page after the tlb flush, or
> > else tlb prefetching on the P4 can cache stale pmd pointers. I'd
>
> Fair point. Hmm, I had that correct, but somehow it got messed up again.
>
> Another thing that probably needs to be added is that DRM needs
> some change_page_attr() calls too, because it does private AGP mappings.

I'd still prefer to get the typing of the page table manipulations
correct. Also, the code can prematurely revert to 4MB mappings if
the caller does anything awry, like changing the attributes back to
standard attributes on the same page twice. The usage of #ifdef __i386__
is inconsistent, too.

-ben
--
"You will be reincarnated as a toad; and you will be much happier."

2002-06-14 04:00:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: New version of pageattr caching conflict fix for 2.4

On Thu, Jun 13, 2002 at 09:37:24PM -0400, Benjamin LaHaise wrote:
> On Fri, Jun 14, 2002 at 03:24:29AM +0200, Andi Kleen wrote:
> > > This version is missing a few of the fixes included in my version:
> > > it doesn't properly flush global tlb entries, or update the page
> >
> > Sure it does. INVLPG (__flush_tlb_one) flushes global entries.
>
> It failed to do so in my testing. The only safe way of flushing

just a fast comment on this bit: x86 specs state invlpg must flush
global entries from the tlb too, see also the kmap_prot as pratical
reference.

Andrea

2002-06-14 04:17:26

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: New version of pageattr caching conflict fix for 2.4

On Fri, Jun 14, 2002 at 06:00:25AM +0200, Andrea Arcangeli wrote:
> just a fast comment on this bit: x86 specs state invlpg must flush
> global entries from the tlb too, see also the kmap_prot as pratical
> reference.

It's not the 4KB pages that I'm worried about so much as the 4MB pages.

-ben
--
"You will be reincarnated as a toad; and you will be much happier."

2002-06-14 04:28:07

by Andi Kleen

[permalink] [raw]
Subject: Re: New version of pageattr caching conflict fix for 2.4

On Fri, Jun 14, 2002 at 12:17:26AM -0400, Benjamin LaHaise wrote:
> On Fri, Jun 14, 2002 at 06:00:25AM +0200, Andrea Arcangeli wrote:
> > just a fast comment on this bit: x86 specs state invlpg must flush
> > global entries from the tlb too, see also the kmap_prot as pratical
> > reference.
>
> It's not the 4KB pages that I'm worried about so much as the 4MB pages.

Both AMD x86-64 and Intel IA32 documentation states that INVLPG flushes global
TLBs. The first version of change_page_attr did in fact use __flush_tlb_all,
but I changed it after checking the docs.

-Andi

2002-06-14 04:28:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: New version of pageattr caching conflict fix for 2.4

On Fri, Jun 14, 2002 at 12:17:26AM -0400, Benjamin LaHaise wrote:
> On Fri, Jun 14, 2002 at 06:00:25AM +0200, Andrea Arcangeli wrote:
> > just a fast comment on this bit: x86 specs state invlpg must flush
> > global entries from the tlb too, see also the kmap_prot as pratical
> > reference.
>
> It's not the 4KB pages that I'm worried about so much as the 4MB pages.

I don't recall any spec mentioning that it doesn't work on large pages
so I thought it was safe to assume it had to work there too, I don't see
why it shouldn't, but it is certainly safe to go safe and double check.

IIRC the only bit to care about with the large pages is to flush with
the virtual address in the first half of the page to avoid falling into
an errata.

Andrea

2002-06-14 15:28:50

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: New version of pageattr caching conflict fix for 2.4

On Fri, Jun 14, 2002 at 06:27:54AM +0200, Andi Kleen wrote:
> Both AMD x86-64 and Intel IA32 documentation states that INVLPG flushes global
> TLBs. The first version of change_page_attr did in fact use __flush_tlb_all,
> but I changed it after checking the docs.

As Andrea pointed out, there is an errata whereby 4MB pages aren't flushed
on the Athlon. If you mask off the low bits of the address for flushing,
that should fix the problem, and sounds like a plausible explanation for
the failure I saw.

-ben

ps. s/La Haise/LaHaise would be nice, too.
--
"You will be reincarnated as a toad; and you will be much happier."

2002-06-14 16:13:42

by Andi Kleen

[permalink] [raw]
Subject: Re: New version of pageattr caching conflict fix for 2.4

On Fri, Jun 14, 2002 at 11:28:49AM -0400, Benjamin LaHaise wrote:
> On Fri, Jun 14, 2002 at 06:27:54AM +0200, Andi Kleen wrote:
> > Both AMD x86-64 and Intel IA32 documentation states that INVLPG flushes global
> > TLBs. The first version of change_page_attr did in fact use __flush_tlb_all,
> > but I changed it after checking the docs.
>
> As Andrea pointed out, there is an errata whereby 4MB pages aren't flushed
> on the Athlon. If you mask off the low bits of the address for flushing,
> that should fix the problem, and sounds like a plausible explanation for
> the failure I saw.

Ok. I'm using __flush_tlb_all for now.

Here is the latest version; also fixing DRM and some more cleanups.

Review please.

-Andi


diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/arch/i386/mm/Makefile linux-2.4.19pre9-work/arch/i386/mm/Makefile
--- linux-2.4.19pre9/arch/i386/mm/Makefile Fri Dec 29 23:07:20 2000
+++ linux-2.4.19pre9-work/arch/i386/mm/Makefile Wed Jun 5 02:35:11 2002
@@ -9,6 +9,7 @@

O_TARGET := mm.o

-obj-y := init.o fault.o ioremap.o extable.o
+obj-y := init.o fault.o ioremap.o extable.o pageattr.o
+export-objs := pageattr.o

include $(TOPDIR)/Rules.make
diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/arch/i386/mm/pageattr.c linux-2.4.19pre9-work/arch/i386/mm/pageattr.c
--- linux-2.4.19pre9/arch/i386/mm/pageattr.c Thu Jan 1 01:00:00 1970
+++ linux-2.4.19pre9-work/arch/i386/mm/pageattr.c Fri Jun 14 18:07:58 2002
@@ -0,0 +1,168 @@
+/*
+ * Copyright 2002 Andi Kleen, SuSE Labs.
+ * Thanks to Ben LaHaise for precious feedback.
+ */
+
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/module.h>
+#include <asm/uaccess.h>
+#include <asm/processor.h>
+
+/* Should move most of this stuff into the appropiate includes */
+#define PAGE_LARGE (_PAGE_PSE|_PAGE_PRESENT)
+#define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1))
+#define LARGE_PAGE_SIZE (1UL << PMD_SHIFT)
+
+static inline pte_t *lookup_address(unsigned long address)
+{
+ pmd_t *pmd;
+ pgd_t *pgd = pgd_offset(&init_mm, address);
+
+ if ((pgd_val(*pgd) & PAGE_LARGE) == PAGE_LARGE)
+ return (pte_t *)pgd;
+ pmd = pmd_offset(pgd, address);
+ if ((pmd_val(*pmd) & PAGE_LARGE) == PAGE_LARGE)
+ return (pte_t *)pmd;
+
+ return pte_offset(pmd, address);
+}
+
+static struct page *split_large_page(unsigned long address, pgprot_t prot)
+{
+ int i;
+ unsigned long addr;
+ struct page *base = alloc_pages(GFP_KERNEL, 0);
+ pte_t *pbase;
+ if (!base)
+ return NULL;
+ address = __pa(address);
+ addr = address & LARGE_PAGE_MASK;
+ pbase = (pte_t *)page_address(base);
+ for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
+ pbase[i] = mk_pte_phys(addr,
+ addr == address ? prot : PAGE_KERNEL);
+ }
+ return base;
+}
+
+static void flush_kernel_map(void * address)
+{
+ /* Could use CLFLUSH here if the CPU supports it (Hammer,P4) */
+ if (boot_cpu_data.x86_model >= 4)
+ asm volatile("wbinvd":::"memory");
+ __flush_tlb_all();
+}
+
+static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
+{
+ set_pte_atomic(kpte, pte); /* change init_mm */
+#ifndef CONFIG_X86_PAE
+ {
+ struct list_head *l;
+ spin_lock(&mmlist_lock);
+ list_for_each(l, &init_mm.mmlist) {
+ struct mm_struct *mm = list_entry(l, struct mm_struct, mmlist);
+ pmd_t *pmd = pmd_offset(pgd_offset(mm, address), address);
+ set_pte_atomic((pte_t *)pmd, pte);
+ }
+ spin_unlock(&mmlist_lock);
+ }
+#endif
+}
+
+/* no more special protections in this 2/4MB area - revert to a
+ large page again. */
+static inline void revert_page(struct page *kpte_page, unsigned long address)
+{
+ pte_t *linear = (pte_t *)
+ pmd_offset(pgd_offset(&init_mm, address), address);
+ set_pmd_pte(linear, address,
+ mk_pte_phys(__pa(address & LARGE_PAGE_MASK),
+ __pgprot(_KERNPG_TABLE|_PAGE_PSE)));
+}
+
+/*
+ * Change the page attributes of an page in the linear mapping.
+ *
+ * This should be used when a page is mapped with a different caching policy
+ * than write-back somewhere - some CPUs do not like it when mappings with
+ * different caching policies exist. This changes the page attributes of the
+ * in kernel linear mapping too.
+ *
+ * The caller needs to ensure that there are no conflicting mappings elsewhere.
+ * This function only deals with the kernel linear map.
+ * When page is in highmem it must never be kmap'ed.
+ */
+static int
+__change_page_attr(struct page *page, pgprot_t prot, struct page **oldpage)
+{
+ pte_t *kpte;
+ unsigned long address;
+ struct page *kpte_page;
+
+#ifdef CONFIG_HIGHMEM
+ /* Hopefully not be mapped anywhere else. */
+ if (page >= highmem_start_page)
+ return 0;
+#endif
+ address = (unsigned long)page_address(page);
+
+ kpte = lookup_address(address);
+ kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
+ if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
+ if ((pte_val(*kpte) & _PAGE_PSE) == 0) {
+ pte_t old = *kpte;
+ pte_t standard = mk_pte(page, PAGE_KERNEL);
+
+ set_pte_atomic(kpte, mk_pte(page, prot));
+ if (pte_same(old,standard))
+ atomic_inc(&kpte_page->count);
+ } else {
+ struct page *split = split_large_page(address, prot);
+ if (!split)
+ return -ENOMEM;
+ set_pmd_pte(kpte,address,mk_pte(split, PAGE_KERNEL));
+ }
+ } else if ((pte_val(*kpte) & _PAGE_PSE) == 0) {
+ set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
+ atomic_dec(&kpte_page->count);
+ }
+
+ if (test_bit(X86_FEATURE_PSE, &boot_cpu_data.x86_capability) &&
+ (atomic_read(&kpte_page->count) == 1)) {
+ *oldpage = kpte_page;
+ revert_page(kpte_page, address);
+ }
+ return 0;
+}
+
+int change_page_attr(struct page *page, int numpages, pgprot_t prot)
+{
+ int err = 0;
+ struct page *fpage;
+ int i;
+
+ down_write(&init_mm.mmap_sem);
+ for (i = 0; i < numpages; i++, page++) {
+ fpage = NULL;
+ err = __change_page_attr(page, prot, &fpage);
+ if (err)
+ break;
+ if (fpage || i == numpages-1) {
+ void *address = page_address(page);
+#ifdef CONFIG_SMP
+ smp_call_function(flush_kernel_map, address, 1, 1);
+#endif
+ flush_kernel_map(address);
+ if (fpage)
+ __free_page(fpage);
+ }
+ }
+ up_write(&init_mm.mmap_sem);
+ return err;
+}
+
+EXPORT_SYMBOL(change_page_attr);
diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/drivers/char/agp/agpgart_be.c linux-2.4.19pre9-work/drivers/char/agp/agpgart_be.c
--- linux-2.4.19pre9/drivers/char/agp/agpgart_be.c Sun Jun 2 20:24:00 2002
+++ linux-2.4.19pre9-work/drivers/char/agp/agpgart_be.c Fri Jun 14 17:37:46 2002
@@ -397,7 +397,7 @@ int agp_unbind_memory(agp_memory * curr)
static void agp_generic_agp_enable(u32 mode)
{
struct pci_dev *device = NULL;
- u32 command, scratch, cap_id;
+ u32 command, scratch;
u8 cap_ptr;

pci_read_config_dword(agp_bridge.dev,
@@ -561,6 +561,7 @@ static int agp_generic_create_gatt_table
agp_bridge.current_size;
break;
}
+ temp = agp_bridge.current_size;
} else {
agp_bridge.aperture_size_idx = i;
}
@@ -578,23 +579,21 @@ static int agp_generic_create_gatt_table
}
table_end = table + ((PAGE_SIZE * (1 << page_order)) - 1);

- for (page = virt_to_page(table); page <= virt_to_page(table_end); page++)
+ page = virt_to_page(table);
+
+#ifdef __i386__
+ if (change_page_attr(page, 1<<page_order, PAGE_KERNEL_NOCACHE) < 0)
+ return -ENOMEM;
+#endif
+ for (; page <= virt_to_page(table_end); page++) {
SetPageReserved(page);
+ }

agp_bridge.gatt_table_real = (unsigned long *) table;
CACHE_FLUSH();
- agp_bridge.gatt_table = ioremap_nocache(virt_to_phys(table),
- (PAGE_SIZE * (1 << page_order)));
+ agp_bridge.gatt_table = agp_bridge.gatt_table_real;
CACHE_FLUSH();

- if (agp_bridge.gatt_table == NULL) {
- for (page = virt_to_page(table); page <= virt_to_page(table_end); page++)
- ClearPageReserved(page);
-
- free_pages((unsigned long) table, page_order);
-
- return -ENOMEM;
- }
agp_bridge.gatt_bus_addr = virt_to_phys(agp_bridge.gatt_table_real);

for (i = 0; i < num_entries; i++) {
@@ -651,7 +650,6 @@ static int agp_generic_free_gatt_table(v
* from the table.
*/

- iounmap(agp_bridge.gatt_table);
table = (char *) agp_bridge.gatt_table_real;
table_end = table + ((PAGE_SIZE * (1 << page_order)) - 1);

@@ -769,6 +767,13 @@ static unsigned long agp_generic_alloc_p
if (page == NULL) {
return 0;
}
+#ifdef __i386__
+ if (change_page_attr(page, 1, PAGE_KERNEL_NOCACHE) < 0) {
+ __free_page(page);
+ return 0;
+ }
+#endif
+
get_page(page);
LockPage(page);
atomic_inc(&agp_bridge.current_memory_agp);
@@ -785,6 +790,11 @@ static void agp_generic_destroy_page(uns
}

page = virt_to_page(pt);
+#ifdef __i386__
+ change_page_attr(page, 1, PAGE_KERNEL);
+#endif
+
+
put_page(page);
UnlockPage(page);
free_page((unsigned long) pt);
@@ -2241,17 +2251,15 @@ static int amd_create_page_map(amd_page_
if (page_map->real == NULL) {
return -ENOMEM;
}
- SetPageReserved(virt_to_page(page_map->real));
- CACHE_FLUSH();
- page_map->remapped = ioremap_nocache(virt_to_phys(page_map->real),
- PAGE_SIZE);
- if (page_map->remapped == NULL) {
- ClearPageReserved(virt_to_page(page_map->real));
- free_page((unsigned long) page_map->real);
- page_map->real = NULL;
+#ifdef __i386__
+ if (change_page_attr(virt_to_page(page_map->real),
+ 1, PAGE_KERNEL_NOCACHE)) {
+ free_page(page_map->real);
return -ENOMEM;
}
- CACHE_FLUSH();
+#endif
+ SetPageReserved(virt_to_page(page_map->real));
+ page_map->remapped = page_map->real;

for(i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
page_map->remapped[i] = agp_bridge.scratch_page;
@@ -2262,7 +2270,6 @@ static int amd_create_page_map(amd_page_

static void amd_free_page_map(amd_page_map *page_map)
{
- iounmap(page_map->remapped);
ClearPageReserved(virt_to_page(page_map->real));
free_page((unsigned long) page_map->real);
}
@@ -2744,27 +2751,22 @@ static void ali_cache_flush(void)

static unsigned long ali_alloc_page(void)
{
- struct page *page;
- u32 temp;
-
- page = alloc_page(GFP_KERNEL);
- if (page == NULL)
+ unsigned long p = agp_generic_alloc_page();
+ if (!p)
return 0;

- get_page(page);
- LockPage(page);
- atomic_inc(&agp_bridge.current_memory_agp);
-
+ /* probably not needed anymore */
global_cache_flush();

if (agp_bridge.type == ALI_M1541) {
+ u32 temp;
pci_read_config_dword(agp_bridge.dev, ALI_CACHE_FLUSH_CTRL, &temp);
pci_write_config_dword(agp_bridge.dev, ALI_CACHE_FLUSH_CTRL,
(((temp & ALI_CACHE_FLUSH_ADDR_MASK) |
- virt_to_phys(page_address(page))) |
+ virt_to_phys(p)) |
ALI_CACHE_FLUSH_EN ));
}
- return (unsigned long)page_address(page);
+ return p;
}

static void ali_destroy_page(unsigned long addr)
@@ -2786,11 +2788,7 @@ static void ali_destroy_page(unsigned lo
ALI_CACHE_FLUSH_EN));
}

- page = virt_to_page(pt);
- put_page(page);
- UnlockPage(page);
- free_page((unsigned long) pt);
- atomic_dec(&agp_bridge.current_memory_agp);
+ agp_generic_destroy_page(pt);
}

/* Setup function */
@@ -2870,17 +2868,15 @@ static int serverworks_create_page_map(s
if (page_map->real == NULL) {
return -ENOMEM;
}
- SetPageReserved(virt_to_page(page_map->real));
- CACHE_FLUSH();
- page_map->remapped = ioremap_nocache(virt_to_phys(page_map->real),
- PAGE_SIZE);
- if (page_map->remapped == NULL) {
- ClearPageReserved(virt_to_page(page_map->real));
- free_page((unsigned long) page_map->real);
- page_map->real = NULL;
+#ifdef __i386__
+ if (change_page_attr(virt_to_page(page_map->real), 1,
+ PAGE_KERNEL_NOCACHE) < 0) {
+ free_page(page_map->real);
return -ENOMEM;
}
- CACHE_FLUSH();
+#endif
+ SetPageReserved(virt_to_page(page_map->real));
+ page_map->remapped = page_map->real;

for(i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
page_map->remapped[i] = agp_bridge.scratch_page;
@@ -2891,7 +2887,6 @@ static int serverworks_create_page_map(s

static void serverworks_free_page_map(serverworks_page_map *page_map)
{
- iounmap(page_map->remapped);
ClearPageReserved(virt_to_page(page_map->real));
free_page((unsigned long) page_map->real);
}
diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/drivers/char/drm/drm_vm.h linux-2.4.19pre9-work/drivers/char/drm/drm_vm.h
--- linux-2.4.19pre9/drivers/char/drm/drm_vm.h Wed Jun 5 18:08:30 2002
+++ linux-2.4.19pre9-work/drivers/char/drm/drm_vm.h Fri Jun 14 06:05:08 2002
@@ -531,6 +531,11 @@ int DRM(mmap)(struct file *filp, struct
vma->vm_flags |= VM_IO; /* not in core dump */
}
offset = DRIVER_GET_REG_OFS();
+#ifdef __i386__
+ change_page_attr(vma->vm_pgoff,
+ (vma->vm_end - vma->vm_start)>>PAGE_SHIFT,
+ vma->vm_page_prot);
+#endif
#ifdef __sparc__
if (io_remap_page_range(vma->vm_start,
VM_OFFSET(vma) + offset,
diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/include/asm-i386/pgtable-2level.h linux-2.4.19pre9-work/include/asm-i386/pgtable-2level.h
--- linux-2.4.19pre9/include/asm-i386/pgtable-2level.h Thu Jul 26 22:40:32 2001
+++ linux-2.4.19pre9-work/include/asm-i386/pgtable-2level.h Thu Jun 13 16:52:53 2002
@@ -40,6 +40,8 @@ static inline int pgd_present(pgd_t pgd)
* hook is made available.
*/
#define set_pte(pteptr, pteval) (*(pteptr) = pteval)
+#define set_pte_atomic(pteptr, pteval) (*(pteptr) = pteval)
+
/*
* (pmds are folded into pgds so this doesnt get actually called,
* but the define is needed for a generic inline function.)
diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/include/asm-i386/pgtable-3level.h linux-2.4.19pre9-work/include/asm-i386/pgtable-3level.h
--- linux-2.4.19pre9/include/asm-i386/pgtable-3level.h Thu Jul 26 22:40:32 2001
+++ linux-2.4.19pre9-work/include/asm-i386/pgtable-3level.h Thu Jun 13 17:15:14 2002
@@ -53,6 +53,9 @@ static inline void set_pte(pte_t *ptep,
set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
#define set_pgd(pgdptr,pgdval) \
set_64bit((unsigned long long *)(pgdptr),pgd_val(pgdval))
+#define set_pte_atomic(pteptr,pteval) \
+ set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
+

/*
* Pentium-II erratum A13: in PAE mode we explicitly have to flush
diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/include/asm-i386/pgtable.h linux-2.4.19pre9-work/include/asm-i386/pgtable.h
--- linux-2.4.19pre9/include/asm-i386/pgtable.h Mon Jun 3 21:15:18 2002
+++ linux-2.4.19pre9-work/include/asm-i386/pgtable.h Fri Jun 14 17:45:29 2002
@@ -347,6 +347,9 @@ static inline pte_t pte_modify(pte_t pte
#define pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low })
#define swp_entry_to_pte(x) ((pte_t) { (x).val })

+struct page;
+int change_page_attr(struct page *, int, pgprot_t prot);
+
#endif /* !__ASSEMBLY__ */

/* Needs to be defined here and not in linux/mm.h, as it is arch dependent */


2002-06-14 17:31:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: New version of pageattr caching conflict fix for 2.4

On Fri, Jun 14, 2002 at 06:13:28PM +0200, Andi Kleen wrote:
> +#ifdef CONFIG_HIGHMEM
> + /* Hopefully not be mapped anywhere else. */
> + if (page >= highmem_start_page)
> + return 0;
> +#endif

there's no hope here. If you don't want to code it right because nobody
is exercising such path and flush both any per-cpu kmap-atomic slot and
the page->virtual, please place a BUG() there or any more graceful
failure notification.

> +int change_page_attr(struct page *page, int numpages, pgprot_t prot)
> +{

this API not the best, again I would recommend something on these lines:

struct page ** physical_alias_alloc_pages(int numpages, unsigned int gfp_mask);
void physical_alias_free_pages(struct page **);

the semantics are trivial, agp_gart_alloc_pages() will return an array
of numpages entries (allocated with kmalloc), that points to all the
pages that are been prepared from the architectural call (of course
those two func are in arch/) for the generation of a physical alias.

This allows the arch to allocate the pages with order >0, this way the
number of fragmented 4/2m pages will be reduced.

The only requirement is that you know the number of pages that you're
going to allocate, if some doesn't we can add also this additional api:

struct page * physical_alias_alloc_page(unsigned int gfp_mask);
void physical_alias_free_pages(struct page *);

Andrea

2002-06-14 18:05:47

by Andi Kleen

[permalink] [raw]
Subject: Re: another new version of pageattr caching conflict fix for 2.4

On Fri, Jun 14, 2002 at 12:31:33PM -0500, Andrea Arcangeli wrote:
> On Fri, Jun 14, 2002 at 06:13:28PM +0200, Andi Kleen wrote:
> > +#ifdef CONFIG_HIGHMEM
> > + /* Hopefully not be mapped anywhere else. */
> > + if (page >= highmem_start_page)
> > + return 0;
> > +#endif
>
> there's no hope here. If you don't want to code it right because nobody
> is exercising such path and flush both any per-cpu kmap-atomic slot and
> the page->virtual, please place a BUG() there or any more graceful
> failure notification.

Ok done.

I also removed the DRM change because it was buggy. I'm not 100% yet
it is even a problem. Needs more investigation.

BTW there is another corner case that was overlooked:
mmap of /dev/kmem, /proc/kcore, /dev/mem. To handle this correctly
the mmap functions would need to always walk the kernel page table
and force the correct caching attribute in the user mapping.
But what to do when there is already an existing mapping to user space?


> > +int change_page_attr(struct page *page, int numpages, pgprot_t prot)
> > +{
>
> this API not the best, again I would recommend something on these lines:

Hmm: i would really prefer to do the allocation in the caller.
If you want your API you could do just do it as a simple wrapper to
the existing function (with the new numpages it would be even
efficient)

New version appended.
Also available at ftp://ftp.suse.com/pub/people/ak/v2.4/pageattr-2.4.19pre9-6

-Andi


diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/arch/i386/mm/Makefile linux-2.4.19pre9-work/arch/i386/mm/Makefile
--- linux-2.4.19pre9/arch/i386/mm/Makefile Fri Dec 29 23:07:20 2000
+++ linux-2.4.19pre9-work/arch/i386/mm/Makefile Wed Jun 5 02:35:11 2002
@@ -9,6 +9,7 @@

O_TARGET := mm.o

-obj-y := init.o fault.o ioremap.o extable.o
+obj-y := init.o fault.o ioremap.o extable.o pageattr.o
+export-objs := pageattr.o

include $(TOPDIR)/Rules.make
diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/arch/i386/mm/pageattr.c linux-2.4.19pre9-work/arch/i386/mm/pageattr.c
--- linux-2.4.19pre9/arch/i386/mm/pageattr.c Thu Jan 1 01:00:00 1970
+++ linux-2.4.19pre9-work/arch/i386/mm/pageattr.c Fri Jun 14 20:00:04 2002
@@ -0,0 +1,167 @@
+/*
+ * Copyright 2002 Andi Kleen, SuSE Labs.
+ * Thanks to Ben LaHaise for precious feedback.
+ */
+
+#include <linux/config.h>
+#include <linux/mm.h>
+#include <linux/sched.h>
+#include <linux/highmem.h>
+#include <linux/module.h>
+#include <asm/uaccess.h>
+#include <asm/processor.h>
+
+/* Should move most of this stuff into the appropiate includes */
+#define PAGE_LARGE (_PAGE_PSE|_PAGE_PRESENT)
+#define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1))
+#define LARGE_PAGE_SIZE (1UL << PMD_SHIFT)
+
+static inline pte_t *lookup_address(unsigned long address)
+{
+ pmd_t *pmd;
+ pgd_t *pgd = pgd_offset(&init_mm, address);
+
+ if ((pgd_val(*pgd) & PAGE_LARGE) == PAGE_LARGE)
+ return (pte_t *)pgd;
+ pmd = pmd_offset(pgd, address);
+ if ((pmd_val(*pmd) & PAGE_LARGE) == PAGE_LARGE)
+ return (pte_t *)pmd;
+
+ return pte_offset(pmd, address);
+}
+
+static struct page *split_large_page(unsigned long address, pgprot_t prot)
+{
+ int i;
+ unsigned long addr;
+ struct page *base = alloc_pages(GFP_KERNEL, 0);
+ pte_t *pbase;
+ if (!base)
+ return NULL;
+ address = __pa(address);
+ addr = address & LARGE_PAGE_MASK;
+ pbase = (pte_t *)page_address(base);
+ for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
+ pbase[i] = mk_pte_phys(addr,
+ addr == address ? prot : PAGE_KERNEL);
+ }
+ return base;
+}
+
+static void flush_kernel_map(void * address)
+{
+ /* Could use CLFLUSH here if the CPU supports it (Hammer,P4) */
+ if (boot_cpu_data.x86_model >= 4)
+ asm volatile("wbinvd":::"memory");
+ __flush_tlb_all();
+}
+
+static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
+{
+ set_pte_atomic(kpte, pte); /* change init_mm */
+#ifndef CONFIG_X86_PAE
+ {
+ struct list_head *l;
+ spin_lock(&mmlist_lock);
+ list_for_each(l, &init_mm.mmlist) {
+ struct mm_struct *mm = list_entry(l, struct mm_struct, mmlist);
+ pmd_t *pmd = pmd_offset(pgd_offset(mm, address), address);
+ set_pte_atomic((pte_t *)pmd, pte);
+ }
+ spin_unlock(&mmlist_lock);
+ }
+#endif
+}
+
+/* no more special protections in this 2/4MB area - revert to a
+ large page again. */
+static inline void revert_page(struct page *kpte_page, unsigned long address)
+{
+ pte_t *linear = (pte_t *)
+ pmd_offset(pgd_offset(&init_mm, address), address);
+ set_pmd_pte(linear, address,
+ mk_pte_phys(__pa(address & LARGE_PAGE_MASK),
+ __pgprot(_KERNPG_TABLE|_PAGE_PSE)));
+}
+
+/*
+ * Change the page attributes of an page in the linear mapping.
+ *
+ * This should be used when a page is mapped with a different caching policy
+ * than write-back somewhere - some CPUs do not like it when mappings with
+ * different caching policies exist. This changes the page attributes of the
+ * in kernel linear mapping too.
+ *
+ * The caller needs to ensure that there are no conflicting mappings elsewhere.
+ * This function only deals with the kernel linear map.
+ * When page is in highmem it must never be kmap'ed.
+ */
+static int
+__change_page_attr(struct page *page, pgprot_t prot, struct page **oldpage)
+{
+ pte_t *kpte;
+ unsigned long address;
+ struct page *kpte_page;
+
+#ifdef CONFIG_HIGHMEM
+ if (page >= highmem_start_page)
+ BUG();
+#endif
+ address = (unsigned long)page_address(page);
+
+ kpte = lookup_address(address);
+ kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
+ if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
+ if ((pte_val(*kpte) & _PAGE_PSE) == 0) {
+ pte_t old = *kpte;
+ pte_t standard = mk_pte(page, PAGE_KERNEL);
+
+ set_pte_atomic(kpte, mk_pte(page, prot));
+ if (pte_same(old,standard))
+ atomic_inc(&kpte_page->count);
+ } else {
+ struct page *split = split_large_page(address, prot);
+ if (!split)
+ return -ENOMEM;
+ set_pmd_pte(kpte,address,mk_pte(split, PAGE_KERNEL));
+ }
+ } else if ((pte_val(*kpte) & _PAGE_PSE) == 0) {
+ set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
+ atomic_dec(&kpte_page->count);
+ }
+
+ if (test_bit(X86_FEATURE_PSE, &boot_cpu_data.x86_capability) &&
+ (atomic_read(&kpte_page->count) == 1)) {
+ *oldpage = kpte_page;
+ revert_page(kpte_page, address);
+ }
+ return 0;
+}
+
+int change_page_attr(struct page *page, int numpages, pgprot_t prot)
+{
+ int err = 0;
+ struct page *fpage;
+ int i;
+
+ down_write(&init_mm.mmap_sem);
+ for (i = 0; i < numpages; i++, page++) {
+ fpage = NULL;
+ err = __change_page_attr(page, prot, &fpage);
+ if (err)
+ break;
+ if (fpage || i == numpages-1) {
+ void *address = page_address(page);
+#ifdef CONFIG_SMP
+ smp_call_function(flush_kernel_map, address, 1, 1);
+#endif
+ flush_kernel_map(address);
+ if (fpage)
+ __free_page(fpage);
+ }
+ }
+ up_write(&init_mm.mmap_sem);
+ return err;
+}
+
+EXPORT_SYMBOL(change_page_attr);
diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/drivers/char/agp/agpgart_be.c linux-2.4.19pre9-work/drivers/char/agp/agpgart_be.c
--- linux-2.4.19pre9/drivers/char/agp/agpgart_be.c Sun Jun 2 20:24:00 2002
+++ linux-2.4.19pre9-work/drivers/char/agp/agpgart_be.c Fri Jun 14 17:37:46 2002
@@ -397,7 +397,7 @@ int agp_unbind_memory(agp_memory * curr)
static void agp_generic_agp_enable(u32 mode)
{
struct pci_dev *device = NULL;
- u32 command, scratch, cap_id;
+ u32 command, scratch;
u8 cap_ptr;

pci_read_config_dword(agp_bridge.dev,
@@ -561,6 +561,7 @@ static int agp_generic_create_gatt_table
agp_bridge.current_size;
break;
}
+ temp = agp_bridge.current_size;
} else {
agp_bridge.aperture_size_idx = i;
}
@@ -578,23 +579,21 @@ static int agp_generic_create_gatt_table
}
table_end = table + ((PAGE_SIZE * (1 << page_order)) - 1);

- for (page = virt_to_page(table); page <= virt_to_page(table_end); page++)
+ page = virt_to_page(table);
+
+#ifdef __i386__
+ if (change_page_attr(page, 1<<page_order, PAGE_KERNEL_NOCACHE) < 0)
+ return -ENOMEM;
+#endif
+ for (; page <= virt_to_page(table_end); page++) {
SetPageReserved(page);
+ }

agp_bridge.gatt_table_real = (unsigned long *) table;
CACHE_FLUSH();
- agp_bridge.gatt_table = ioremap_nocache(virt_to_phys(table),
- (PAGE_SIZE * (1 << page_order)));
+ agp_bridge.gatt_table = agp_bridge.gatt_table_real;
CACHE_FLUSH();

- if (agp_bridge.gatt_table == NULL) {
- for (page = virt_to_page(table); page <= virt_to_page(table_end); page++)
- ClearPageReserved(page);
-
- free_pages((unsigned long) table, page_order);
-
- return -ENOMEM;
- }
agp_bridge.gatt_bus_addr = virt_to_phys(agp_bridge.gatt_table_real);

for (i = 0; i < num_entries; i++) {
@@ -651,7 +650,6 @@ static int agp_generic_free_gatt_table(v
* from the table.
*/

- iounmap(agp_bridge.gatt_table);
table = (char *) agp_bridge.gatt_table_real;
table_end = table + ((PAGE_SIZE * (1 << page_order)) - 1);

@@ -769,6 +767,13 @@ static unsigned long agp_generic_alloc_p
if (page == NULL) {
return 0;
}
+#ifdef __i386__
+ if (change_page_attr(page, 1, PAGE_KERNEL_NOCACHE) < 0) {
+ __free_page(page);
+ return 0;
+ }
+#endif
+
get_page(page);
LockPage(page);
atomic_inc(&agp_bridge.current_memory_agp);
@@ -785,6 +790,11 @@ static void agp_generic_destroy_page(uns
}

page = virt_to_page(pt);
+#ifdef __i386__
+ change_page_attr(page, 1, PAGE_KERNEL);
+#endif
+
+
put_page(page);
UnlockPage(page);
free_page((unsigned long) pt);
@@ -2241,17 +2251,15 @@ static int amd_create_page_map(amd_page_
if (page_map->real == NULL) {
return -ENOMEM;
}
- SetPageReserved(virt_to_page(page_map->real));
- CACHE_FLUSH();
- page_map->remapped = ioremap_nocache(virt_to_phys(page_map->real),
- PAGE_SIZE);
- if (page_map->remapped == NULL) {
- ClearPageReserved(virt_to_page(page_map->real));
- free_page((unsigned long) page_map->real);
- page_map->real = NULL;
+#ifdef __i386__
+ if (change_page_attr(virt_to_page(page_map->real),
+ 1, PAGE_KERNEL_NOCACHE)) {
+ free_page(page_map->real);
return -ENOMEM;
}
- CACHE_FLUSH();
+#endif
+ SetPageReserved(virt_to_page(page_map->real));
+ page_map->remapped = page_map->real;

for(i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
page_map->remapped[i] = agp_bridge.scratch_page;
@@ -2262,7 +2270,6 @@ static int amd_create_page_map(amd_page_

static void amd_free_page_map(amd_page_map *page_map)
{
- iounmap(page_map->remapped);
ClearPageReserved(virt_to_page(page_map->real));
free_page((unsigned long) page_map->real);
}
@@ -2744,27 +2751,22 @@ static void ali_cache_flush(void)

static unsigned long ali_alloc_page(void)
{
- struct page *page;
- u32 temp;
-
- page = alloc_page(GFP_KERNEL);
- if (page == NULL)
+ unsigned long p = agp_generic_alloc_page();
+ if (!p)
return 0;

- get_page(page);
- LockPage(page);
- atomic_inc(&agp_bridge.current_memory_agp);
-
+ /* probably not needed anymore */
global_cache_flush();

if (agp_bridge.type == ALI_M1541) {
+ u32 temp;
pci_read_config_dword(agp_bridge.dev, ALI_CACHE_FLUSH_CTRL, &temp);
pci_write_config_dword(agp_bridge.dev, ALI_CACHE_FLUSH_CTRL,
(((temp & ALI_CACHE_FLUSH_ADDR_MASK) |
- virt_to_phys(page_address(page))) |
+ virt_to_phys(p)) |
ALI_CACHE_FLUSH_EN ));
}
- return (unsigned long)page_address(page);
+ return p;
}

static void ali_destroy_page(unsigned long addr)
@@ -2786,11 +2788,7 @@ static void ali_destroy_page(unsigned lo
ALI_CACHE_FLUSH_EN));
}

- page = virt_to_page(pt);
- put_page(page);
- UnlockPage(page);
- free_page((unsigned long) pt);
- atomic_dec(&agp_bridge.current_memory_agp);
+ agp_generic_destroy_page(pt);
}

/* Setup function */
@@ -2870,17 +2868,15 @@ static int serverworks_create_page_map(s
if (page_map->real == NULL) {
return -ENOMEM;
}
- SetPageReserved(virt_to_page(page_map->real));
- CACHE_FLUSH();
- page_map->remapped = ioremap_nocache(virt_to_phys(page_map->real),
- PAGE_SIZE);
- if (page_map->remapped == NULL) {
- ClearPageReserved(virt_to_page(page_map->real));
- free_page((unsigned long) page_map->real);
- page_map->real = NULL;
+#ifdef __i386__
+ if (change_page_attr(virt_to_page(page_map->real), 1,
+ PAGE_KERNEL_NOCACHE) < 0) {
+ free_page(page_map->real);
return -ENOMEM;
}
- CACHE_FLUSH();
+#endif
+ SetPageReserved(virt_to_page(page_map->real));
+ page_map->remapped = page_map->real;

for(i = 0; i < PAGE_SIZE / sizeof(unsigned long); i++) {
page_map->remapped[i] = agp_bridge.scratch_page;
@@ -2891,7 +2887,6 @@ static int serverworks_create_page_map(s

static void serverworks_free_page_map(serverworks_page_map *page_map)
{
- iounmap(page_map->remapped);
ClearPageReserved(virt_to_page(page_map->real));
free_page((unsigned long) page_map->real);
}
diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/include/asm-i386/pgtable-2level.h linux-2.4.19pre9-work/include/asm-i386/pgtable-2level.h
--- linux-2.4.19pre9/include/asm-i386/pgtable-2level.h Thu Jul 26 22:40:32 2001
+++ linux-2.4.19pre9-work/include/asm-i386/pgtable-2level.h Thu Jun 13 16:52:53 2002
@@ -40,6 +40,8 @@ static inline int pgd_present(pgd_t pgd)
* hook is made available.
*/
#define set_pte(pteptr, pteval) (*(pteptr) = pteval)
+#define set_pte_atomic(pteptr, pteval) (*(pteptr) = pteval)
+
/*
* (pmds are folded into pgds so this doesnt get actually called,
* but the define is needed for a generic inline function.)
diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/include/asm-i386/pgtable-3level.h linux-2.4.19pre9-work/include/asm-i386/pgtable-3level.h
--- linux-2.4.19pre9/include/asm-i386/pgtable-3level.h Thu Jul 26 22:40:32 2001
+++ linux-2.4.19pre9-work/include/asm-i386/pgtable-3level.h Thu Jun 13 17:15:14 2002
@@ -53,6 +53,9 @@ static inline void set_pte(pte_t *ptep,
set_64bit((unsigned long long *)(pmdptr),pmd_val(pmdval))
#define set_pgd(pgdptr,pgdval) \
set_64bit((unsigned long long *)(pgdptr),pgd_val(pgdval))
+#define set_pte_atomic(pteptr,pteval) \
+ set_64bit((unsigned long long *)(pteptr),pte_val(pteval))
+

/*
* Pentium-II erratum A13: in PAE mode we explicitly have to flush
diff -burpN -X ../KDIFX -x *-o -x *.[ch]-* linux-2.4.19pre9/include/asm-i386/pgtable.h linux-2.4.19pre9-work/include/asm-i386/pgtable.h
--- linux-2.4.19pre9/include/asm-i386/pgtable.h Mon Jun 3 21:15:18 2002
+++ linux-2.4.19pre9-work/include/asm-i386/pgtable.h Fri Jun 14 17:45:29 2002
@@ -347,6 +347,9 @@ static inline pte_t pte_modify(pte_t pte
#define pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low })
#define swp_entry_to_pte(x) ((pte_t) { (x).val })

+struct page;
+int change_page_attr(struct page *, int, pgprot_t prot);
+
#endif /* !__ASSEMBLY__ */

/* Needs to be defined here and not in linux/mm.h, as it is arch dependent */

2002-06-16 10:18:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: another new version of pageattr caching conflict fix for 2.4

Andi Kleen <[email protected]> writes:

> On Fri, Jun 14, 2002 at 12:31:33PM -0500, Andrea Arcangeli wrote:
> > On Fri, Jun 14, 2002 at 06:13:28PM +0200, Andi Kleen wrote:
> > > +#ifdef CONFIG_HIGHMEM
> > > + /* Hopefully not be mapped anywhere else. */
> > > + if (page >= highmem_start_page)
> > > + return 0;
> > > +#endif
> >
> > there's no hope here. If you don't want to code it right because nobody
> > is exercising such path and flush both any per-cpu kmap-atomic slot and
> > the page->virtual, please place a BUG() there or any more graceful
> > failure notification.
>
> Ok done.
>
> I also removed the DRM change because it was buggy. I'm not 100% yet
> it is even a problem. Needs more investigation.
>
> BTW there is another corner case that was overlooked:
> mmap of /dev/kmem, /proc/kcore, /dev/mem. To handle this correctly
> the mmap functions would need to always walk the kernel page table
> and force the correct caching attribute in the user mapping.
> But what to do when there is already an existing mapping to user space?

Don't allow the change_page_attr if page->count > 1 is an easy solution,
and it probably suffices. Beyond that anything mmaped can be found
by walking into the backing address space, and then through the
vm_area_structs found with i_mmap && i_mmap_shared. Of course the
vm_area_structs may possibly need to break because of the multiple
page protections.

> > > +int change_page_attr(struct page *page, int numpages, pgprot_t prot)
> > > +{
> >
> > this API not the best, again I would recommend something on these lines:
>
> Hmm: i would really prefer to do the allocation in the caller.
> If you want your API you could do just do it as a simple wrapper to
> the existing function (with the new numpages it would be even
> efficient)

Using pgprot_t to convey the cacheablity of a page appears to be an
abuse. At the very least we need a PAGE_CACHE_MASK, to find just
the cacheability attributes.

And we should really consider using the other cacheability page
attributes on x86, not just cache enable/disable. Using just mtrr's
is limiting in that you don't always have enough of them, and that
sometimes valid ram is mapped uncacheable, because of the mtrr
alignment restrictions.

Eric