2009-07-10 16:59:51

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: [PATCH 00/10] x86, PAT: PAT patchset

Another PAT patchset. Apart from minor cleanups, there are 3
relatively big changes here.
* Adding memtype tracking to io_mapping_* APIs. Currently these APIs hardcoded
the type to WC.
* An rbtree added to keep track of PAT entries. This will avoid worst case
latency with PAT entry lookup.
* Tracking of RAM pages using 2 bits in page struct. This will give us the
ability to lookup the type of any physical address.

This patchset covers almost all the items in our PAT todo list, except for:
* Make PAT APIs to take try MTRR setup internally when pat is disabled.
* Need a set_pages_array_wc interface.

These patches are targeted for .32 and hoping to get some extended test
until the merge upstream. I have tested this with some of my test programs
and also with regressions tests in Eric's intel-gpu-tools here
http://cgit.freedesktop.org/xorg/app/intel-gpu-tools/

Venkatesh Pallipadi (10):
x86, PAT: keep identity maps consistent with mmaps even when
pat_disabled
x86, PAT: ioremap to follow same PAT restrictions as other PAT users
x86, PAT: New i/f for driver to request memtype for IO regions
x86, PAT: Add PAT reserve free to io_mapping* APIs
x86, PAT: Add rbtree to do quick lookup in memtype tracking
x86, PAT: Generalize the use of page flag PG_uncached
x86, PAT: Use page flags to track memtypes of RAM pages
x86, PAT: Add lookup_memtype to get the current memtype of a paddr
x86, PAT: lookup the protection from memtype list on vm_insert_pfn()
x86, PAT: Sanity check remap_pfn_range for RAM region

arch/ia64/Kconfig | 4 +
arch/x86/Kconfig | 4 +
arch/x86/include/asm/cacheflush.h | 54 ++++++-
arch/x86/include/asm/iomap.h | 9 +-
arch/x86/include/asm/pat.h | 5 +
arch/x86/mm/iomap_32.c | 27 +++-
arch/x86/mm/ioremap.c | 17 +--
arch/x86/mm/pat.c | 352 +++++++++++++++++++++++++++----------
include/linux/io-mapping.h | 17 ++-
include/linux/page-flags.h | 4 +-
10 files changed, 376 insertions(+), 117 deletions(-)


2009-07-10 17:00:06

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: [PATCH 01/10] x86, PAT: keep identity maps consistent with mmaps even when pat_disabled

Make reserve_memtype internally take care of pat disabled case and fallback
to default return values.

Remove the specific pat_disabled checks in track_* routines.

Change kernel_map_sync_memtype to sync identity map even when
pat_disabled.

This change ensures that, even for pat_disabled case, we take care of
keeping identity map in sync. Before this patch, in pat disabled case,
ioremap() keeps the identity maps in sync and other APIs like pci and
/dev/mem mmap don't, which is not a very consistent behavior.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/mm/pat.c | 13 +++----------
1 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index e6718bb..d5af279 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -339,6 +339,8 @@ int reserve_memtype(u64 start, u64 end, unsigned long req_type,
if (new_type) {
if (req_type == -1)
*new_type = _PAGE_CACHE_WB;
+ else if (req_type == _PAGE_CACHE_WC)
+ *new_type = _PAGE_CACHE_UC_MINUS;
else
*new_type = req_type & _PAGE_CACHE_MASK;
}
@@ -577,7 +579,7 @@ int kernel_map_sync_memtype(u64 base, unsigned long size, unsigned long flags)
{
unsigned long id_sz;

- if (!pat_enabled || base >= __pa(high_memory))
+ if (base >= __pa(high_memory))
return 0;

id_sz = (__pa(high_memory) < base + size) ?
@@ -677,9 +679,6 @@ int track_pfn_vma_copy(struct vm_area_struct *vma)
unsigned long vma_size = vma->vm_end - vma->vm_start;
pgprot_t pgprot;

- if (!pat_enabled)
- return 0;
-
/*
* For now, only handle remap_pfn_range() vmas where
* is_linear_pfn_mapping() == TRUE. Handling of
@@ -715,9 +714,6 @@ int track_pfn_vma_new(struct vm_area_struct *vma, pgprot_t *prot,
resource_size_t paddr;
unsigned long vma_size = vma->vm_end - vma->vm_start;

- if (!pat_enabled)
- return 0;
-
/*
* For now, only handle remap_pfn_range() vmas where
* is_linear_pfn_mapping() == TRUE. Handling of
@@ -743,9 +739,6 @@ void untrack_pfn_vma(struct vm_area_struct *vma, unsigned long pfn,
resource_size_t paddr;
unsigned long vma_size = vma->vm_end - vma->vm_start;

- if (!pat_enabled)
- return;
-
/*
* For now, only handle remap_pfn_range() vmas where
* is_linear_pfn_mapping() == TRUE. Handling of
--
1.6.0.6

2009-07-10 17:00:23

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: [PATCH 03/10] x86, PAT: New i/f for driver to request memtype for IO regions

Add new routines to request memtype for IO regions. This will currently
be a backend for io_mapping_* routines. But, it can also be made available
to drivers directly in future, in case it is needed.

reserve interface reserves the memory, makes sure we have a compatible
memory type available and keeps the identity map in sync when needed.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/include/asm/pat.h | 5 ++++
arch/x86/mm/pat.c | 49 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/pat.h b/arch/x86/include/asm/pat.h
index 7af14e5..e2c1668 100644
--- a/arch/x86/include/asm/pat.h
+++ b/arch/x86/include/asm/pat.h
@@ -19,4 +19,9 @@ extern int free_memtype(u64 start, u64 end);
extern int kernel_map_sync_memtype(u64 base, unsigned long size,
unsigned long flag);

+int io_reserve_memtype(resource_size_t start, resource_size_t end,
+ unsigned long *type);
+
+void io_free_memtype(resource_size_t start, resource_size_t end);
+
#endif /* _ASM_X86_PAT_H */
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index d5af279..82d097c 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -498,6 +498,55 @@ int free_memtype(u64 start, u64 end)
}


+/**
+ * io_reserve_memtype - Request a memory type mapping for a region of memory
+ * @start: start (physical address) of the region
+ * @end: end (physical address) of the region
+ * @type: A pointer to memtype, with requested type. On success, requested
+ * or any other compatible type that was available for the region is returned
+ *
+ * On success, returns 0
+ * On failure, returns non-zero
+ */
+int io_reserve_memtype(resource_size_t start, resource_size_t end,
+ unsigned long *type)
+{
+ unsigned long req_type = *type;
+ unsigned long new_type;
+ int ret;
+
+ WARN_ON_ONCE(iomem_map_sanity_check(start, end - start));
+
+ ret = reserve_memtype(start, end, req_type, &new_type);
+ if (ret)
+ goto out_err;
+
+ if (!is_new_memtype_allowed(req_type, new_type))
+ goto out_free;
+
+ if (kernel_map_sync_memtype(start, end - start, new_type) < 0)
+ goto out_free;
+
+ *type = new_type;
+ return 0;
+
+out_free:
+ free_memtype(start, end);
+ ret = -EBUSY;
+out_err:
+ return ret;
+}
+
+/**
+ * io_free_memtype - Release a memory type mapping for a region of memory
+ * @start: start (physical address) of the region
+ * @end: end (physical address) of the region
+ */
+void io_free_memtype(resource_size_t start, resource_size_t end)
+{
+ free_memtype(start, end);
+}
+
pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
unsigned long size, pgprot_t vma_prot)
{
--
1.6.0.6

2009-07-10 17:00:42

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: [PATCH 02/10] x86, PAT: ioremap to follow same PAT restrictions as other PAT users

ioremap has this hard-coded check for new type and requested type. That
check differs from other PAT users like /dev/mem mmap, remap_pfn_range
in only one condition where requested type is UC_MINUS and new type
is WC. Under that condition, ioremap fails. But other PAT interfaces succeed
with a WC mapping.

Change to make ioremap be in sync with other PAT APIs and use the same
macro as others. Also changes the error print to KERN_ERR instead of
pr_debug.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/mm/ioremap.c | 17 +++--------------
1 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 8a45093..aeaea8c 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -228,24 +228,13 @@ static void __iomem *__ioremap_caller(resource_size_t phys_addr,
retval = reserve_memtype(phys_addr, (u64)phys_addr + size,
prot_val, &new_prot_val);
if (retval) {
- pr_debug("Warning: reserve_memtype returned %d\n", retval);
+ printk(KERN_ERR "ioremap reserve_memtype failed %d\n", retval);
return NULL;
}

if (prot_val != new_prot_val) {
- /*
- * Do not fallback to certain memory types with certain
- * requested type:
- * - request is uc-, return cannot be write-back
- * - request is uc-, return cannot be write-combine
- * - request is write-combine, return cannot be write-back
- */
- if ((prot_val == _PAGE_CACHE_UC_MINUS &&
- (new_prot_val == _PAGE_CACHE_WB ||
- new_prot_val == _PAGE_CACHE_WC)) ||
- (prot_val == _PAGE_CACHE_WC &&
- new_prot_val == _PAGE_CACHE_WB)) {
- pr_debug(
+ if (!is_new_memtype_allowed(prot_val, new_prot_val)) {
+ printk(KERN_ERR
"ioremap error for 0x%llx-0x%llx, requested 0x%lx, got 0x%lx\n",
(unsigned long long)phys_addr,
(unsigned long long)(phys_addr + size),
--
1.6.0.6

2009-07-10 17:00:52

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: [PATCH 04/10] x86, PAT: Add PAT reserve free to io_mapping* APIs

io_mapping_* interfaces were added, mainly for graphics drivers.
Make this interface go through the PAT reserve/free, instead of
hardcoding WC mapping. This makes sure that there are no
aliases due to unconditional WC setting.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/include/asm/iomap.h | 9 ++++++---
arch/x86/mm/iomap_32.c | 27 +++++++++++++++++++++++++--
include/linux/io-mapping.h | 17 ++++++++++++-----
3 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/iomap.h b/arch/x86/include/asm/iomap.h
index 0e9fe1d..f35eb45 100644
--- a/arch/x86/include/asm/iomap.h
+++ b/arch/x86/include/asm/iomap.h
@@ -26,13 +26,16 @@
#include <asm/pgtable.h>
#include <asm/tlbflush.h>

-int
-is_io_mapping_possible(resource_size_t base, unsigned long size);
-
void *
iomap_atomic_prot_pfn(unsigned long pfn, enum km_type type, pgprot_t prot);

void
iounmap_atomic(void *kvaddr, enum km_type type);

+int
+iomap_create_wc(resource_size_t base, unsigned long size, pgprot_t *prot);
+
+void
+iomap_free(resource_size_t base, unsigned long size);
+
#endif /* _ASM_X86_IOMAP_H */
diff --git a/arch/x86/mm/iomap_32.c b/arch/x86/mm/iomap_32.c
index fe6f84c..84e236c 100644
--- a/arch/x86/mm/iomap_32.c
+++ b/arch/x86/mm/iomap_32.c
@@ -21,7 +21,7 @@
#include <linux/module.h>
#include <linux/highmem.h>

-int is_io_mapping_possible(resource_size_t base, unsigned long size)
+static int is_io_mapping_possible(resource_size_t base, unsigned long size)
{
#if !defined(CONFIG_X86_PAE) && defined(CONFIG_PHYS_ADDR_T_64BIT)
/* There is no way to map greater than 1 << 32 address without PAE */
@@ -30,7 +30,30 @@ int is_io_mapping_possible(resource_size_t base, unsigned long size)
#endif
return 1;
}
-EXPORT_SYMBOL_GPL(is_io_mapping_possible);
+
+int iomap_create_wc(resource_size_t base, unsigned long size, pgprot_t *prot)
+{
+ unsigned long flag = _PAGE_CACHE_WC;
+ int ret;
+
+ if (!is_io_mapping_possible(base, size))
+ return -EINVAL;
+
+ ret = io_reserve_memtype(base, base + size, &flag);
+ if (ret)
+ return ret;
+
+ *prot = __pgprot(__PAGE_KERNEL | flag);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(iomap_create_wc);
+
+void
+iomap_free(resource_size_t base, unsigned long size)
+{
+ io_free_memtype(base, base + size);
+}
+EXPORT_SYMBOL_GPL(iomap_free);

void *kmap_atomic_prot_pfn(unsigned long pfn, enum km_type type, pgprot_t prot)
{
diff --git a/include/linux/io-mapping.h b/include/linux/io-mapping.h
index 0adb0f9..97eb928 100644
--- a/include/linux/io-mapping.h
+++ b/include/linux/io-mapping.h
@@ -49,23 +49,30 @@ static inline struct io_mapping *
io_mapping_create_wc(resource_size_t base, unsigned long size)
{
struct io_mapping *iomap;
-
- if (!is_io_mapping_possible(base, size))
- return NULL;
+ pgprot_t prot;

iomap = kmalloc(sizeof(*iomap), GFP_KERNEL);
if (!iomap)
- return NULL;
+ goto out_err;
+
+ if (iomap_create_wc(base, size, &prot))
+ goto out_free;

iomap->base = base;
iomap->size = size;
- iomap->prot = pgprot_writecombine(__pgprot(__PAGE_KERNEL));
+ iomap->prot = prot;
return iomap;
+
+out_free:
+ kfree(iomap);
+out_err:
+ return NULL;
}

static inline void
io_mapping_free(struct io_mapping *mapping)
{
+ iomap_free(mapping->base, mapping->size);
kfree(mapping);
}

--
1.6.0.6

2009-07-10 17:02:02

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: [PATCH 09/10] x86, PAT: lookup the protection from memtype list on vm_insert_pfn()

Lookup the reserved memtype during vm_insert_pfn and use that memtype
for the new mapping. This takes care or handling of vm_insert_pfn()
interface in track_pfn_vma*/untrack_pfn_vma.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/mm/pat.c | 24 +++++++++---------------
1 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 71aa6f7..b629f75 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -848,11 +848,6 @@ int track_pfn_vma_copy(struct vm_area_struct *vma)
unsigned long vma_size = vma->vm_end - vma->vm_start;
pgprot_t pgprot;

- /*
- * For now, only handle remap_pfn_range() vmas where
- * is_linear_pfn_mapping() == TRUE. Handling of
- * vm_insert_pfn() is TBD.
- */
if (is_linear_pfn_mapping(vma)) {
/*
* reserve the whole chunk covered by vma. We need the
@@ -880,20 +875,24 @@ int track_pfn_vma_copy(struct vm_area_struct *vma)
int track_pfn_vma_new(struct vm_area_struct *vma, pgprot_t *prot,
unsigned long pfn, unsigned long size)
{
+ unsigned long flags;
resource_size_t paddr;
unsigned long vma_size = vma->vm_end - vma->vm_start;

- /*
- * For now, only handle remap_pfn_range() vmas where
- * is_linear_pfn_mapping() == TRUE. Handling of
- * vm_insert_pfn() is TBD.
- */
if (is_linear_pfn_mapping(vma)) {
/* reserve the whole chunk starting from vm_pgoff */
paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT;
return reserve_pfn_range(paddr, vma_size, prot, 0);
}

+ if (!pat_enabled)
+ return 0;
+
+ /* for vm_insert_pfn and friends, we set prot based on lookup */
+ flags = lookup_memtype(pfn << PAGE_SHIFT);
+ *prot = __pgprot((pgprot_val(vma->vm_page_prot) & (~_PAGE_CACHE_MASK)) |
+ flags);
+
return 0;
}

@@ -908,11 +907,6 @@ void untrack_pfn_vma(struct vm_area_struct *vma, unsigned long pfn,
resource_size_t paddr;
unsigned long vma_size = vma->vm_end - vma->vm_start;

- /*
- * For now, only handle remap_pfn_range() vmas where
- * is_linear_pfn_mapping() == TRUE. Handling of
- * vm_insert_pfn() is TBD.
- */
if (is_linear_pfn_mapping(vma)) {
/* free the whole chunk starting from vm_pgoff */
paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT;
--
1.6.0.6

2009-07-10 17:01:36

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: [PATCH 08/10] x86, PAT: Add lookup_memtype to get the current memtype of a paddr

Add a new routine lookup_memtype() to get the current memtype based on
the PAT reserves and frees.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/mm/pat.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 45 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 1a9d0f0..71aa6f7 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -574,6 +574,51 @@ unlock_ret:


/**
+ * lookup_memtype - Looksup the memory type for a physical address
+ * @paddr: physical address of which memory type needs to be looked up
+ *
+ * Only to be called when PAT is enabled
+ *
+ * Returns _PAGE_CACHE_WB, _PAGE_CACHE_WC, _PAGE_CACHE_UC_MINUS or
+ * _PAGE_CACHE_UC
+ */
+static unsigned long lookup_memtype(u64 paddr)
+{
+ int rettype = _PAGE_CACHE_WB;
+ struct memtype *entry;
+
+ if (is_ISA_range(paddr, paddr + PAGE_SIZE - 1))
+ return rettype;
+
+ if (pat_pagerange_is_ram(paddr, paddr + PAGE_SIZE)) {
+ struct page *page;
+ spin_lock(&memtype_lock);
+ page = pfn_to_page(paddr >> PAGE_SHIFT);
+ rettype = get_page_memtype(page);
+ spin_unlock(&memtype_lock);
+ /*
+ * -1 from get_page_memtype() implies RAM page is in its
+ * default state and not reserved, and hence of type WB
+ */
+ if (rettype == -1)
+ rettype = _PAGE_CACHE_WB;
+
+ return rettype;
+ }
+
+ spin_lock(&memtype_lock);
+
+ entry = memtype_rb_search(&memtype_rbroot, paddr);
+ if (entry != NULL)
+ rettype = entry->type;
+ else
+ rettype = _PAGE_CACHE_UC_MINUS;
+
+ spin_unlock(&memtype_lock);
+ return rettype;
+}
+
+/**
* io_reserve_memtype - Request a memory type mapping for a region of memory
* @start: start (physical address) of the region
* @end: end (physical address) of the region
--
1.6.0.6

2009-07-10 17:01:08

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: [PATCH 06/10] x86, PAT: Generalize the use of page flag PG_uncached

Only IA64 was using PG_uncached as of now. We now intend to use this bit
in x86 as well, to keep track of memory type of those addresses that
have page struct for them. So, generalize the use of that bit across
ia64 and x86.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
---
arch/ia64/Kconfig | 4 ++++
arch/x86/Kconfig | 4 ++++
include/linux/page-flags.h | 4 ++--
3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 170042b..e624611 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -112,6 +112,10 @@ config IA64_UNCACHED_ALLOCATOR
bool
select GENERIC_ALLOCATOR

+config ARCH_USES_PG_UNCACHED
+ def_bool y
+ depends on IA64_UNCACHED_ALLOCATOR
+
config AUDIT_ARCH
bool
default y
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9e90fc5..031db02 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1420,6 +1420,10 @@ config X86_PAT

If unsure, say Y.

+config ARCH_USES_PG_UNCACHED
+ def_bool y
+ depends on X86_PAT
+
config EFI
bool "EFI runtime service support"
depends on ACPI
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e2e5ce5..2b87acf 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -99,7 +99,7 @@ enum pageflags {
#ifdef CONFIG_HAVE_MLOCKED_PAGE_BIT
PG_mlocked, /* Page is vma mlocked */
#endif
-#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
+#ifdef CONFIG_ARCH_USES_PG_UNCACHED
PG_uncached, /* Page has been mapped as uncached */
#endif
__NR_PAGEFLAGS,
@@ -257,7 +257,7 @@ PAGEFLAG_FALSE(Mlocked)
SETPAGEFLAG_NOOP(Mlocked) TESTCLEARFLAG_FALSE(Mlocked)
#endif

-#ifdef CONFIG_IA64_UNCACHED_ALLOCATOR
+#ifdef CONFIG_ARCH_USES_PG_UNCACHED
PAGEFLAG(Uncached, uncached)
#else
PAGEFLAG_FALSE(Uncached)
--
1.6.0.6

2009-07-10 17:01:26

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: [PATCH 05/10] x86, PAT: Add rbtree to do quick lookup in memtype tracking

PAT memtype tracking uses a linear link list to keep track of IO
(non-RAM) regions and their memtypes. The code used a last_accessed
pointer as a cache to speedup the lookup. As per discussions with
H. Peter Anvin a while back, having a rbtree here will avoid bad
performances in pathological cases where we may end up with huge
linked list. This may not add any noticable performance speedup
in normal case as the number of entires in PAT memtype list tend
to be ~20-30 range. The patch removes the "cached_entry" logic
as with rbtree we have more generic way of speeding up the lookup.

With this patch, we use rbtree to do the quick lookup. We still use
linked list as the memtype range tracked can be of different sizes
and can overlap in different ways. We also keep track of usage counts
with linked list.

Example:
Multiple ioremaps with different sizes
uncached-minus @ 0xfffff00000-0xfffff04000
uncached-minus @ 0xfffff02000-0xfffff03000

And one userlevel mmap and the thread forks a new process
uncached-minus @ 0xbf453000-0xbf454000
uncached-minus @ 0xbf453000-0xbf454000

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/mm/pat.c | 106 +++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 86 insertions(+), 20 deletions(-)

diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index 82d097c..c90f242 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -15,6 +15,7 @@
#include <linux/gfp.h>
#include <linux/mm.h>
#include <linux/fs.h>
+#include <linux/rbtree.h>

#include <asm/cacheflush.h>
#include <asm/processor.h>
@@ -148,11 +149,10 @@ static char *cattr_name(unsigned long flags)
* areas). All the aliases have the same cache attributes of course.
* Zero attributes are represented as holes.
*
- * Currently the data structure is a list because the number of mappings
- * are expected to be relatively small. If this should be a problem
- * it could be changed to a rbtree or similar.
+ * The data structure is a list that is also organized as an rbtree
+ * sorted on the start address of memtype range.
*
- * memtype_lock protects the whole list.
+ * memtype_lock protects both the linear list and rbtree.
*/

struct memtype {
@@ -160,11 +160,53 @@ struct memtype {
u64 end;
unsigned long type;
struct list_head nd;
+ struct rb_node rb;
};

+static struct rb_root memtype_rbroot = RB_ROOT;
static LIST_HEAD(memtype_list);
static DEFINE_SPINLOCK(memtype_lock); /* protects memtype list */

+static struct memtype *memtype_rb_search(struct rb_root *root, u64 start)
+{
+ struct rb_node *node = root->rb_node;
+ struct memtype *last_lower = NULL;
+
+ while (node) {
+ struct memtype *data = container_of(node, struct memtype, rb);
+
+ if (data->start < start) {
+ last_lower = data;
+ node = node->rb_right;
+ } else if (data->start > start) {
+ node = node->rb_left;
+ } else
+ return data;
+ }
+
+ /* Will return NULL if there is no entry with its start <= start */
+ return last_lower;
+}
+
+static void memtype_rb_insert(struct rb_root *root, struct memtype *data)
+{
+ struct rb_node **new = &(root->rb_node);
+ struct rb_node *parent = NULL;
+
+ while (*new) {
+ struct memtype *this = container_of(*new, struct memtype, rb);
+
+ parent = *new;
+ if (data->start <= this->start)
+ new = &((*new)->rb_left);
+ else if (data->start > this->start)
+ new = &((*new)->rb_right);
+ }
+
+ rb_link_node(&data->rb, parent, new);
+ rb_insert_color(&data->rb, root);
+}
+
/*
* Does intersection of PAT memory type and MTRR memory type and returns
* the resulting memory type as PAT understands it.
@@ -218,9 +260,6 @@ chk_conflict(struct memtype *new, struct memtype *entry, unsigned long *type)
return -EBUSY;
}

-static struct memtype *cached_entry;
-static u64 cached_start;
-
static int pat_pagerange_is_ram(unsigned long start, unsigned long end)
{
int ram_page = 0, not_rampage = 0;
@@ -382,17 +421,19 @@ int reserve_memtype(u64 start, u64 end, unsigned long req_type,

spin_lock(&memtype_lock);

- if (cached_entry && start >= cached_start)
- entry = cached_entry;
- else
+ entry = memtype_rb_search(&memtype_rbroot, new->start);
+ if (likely(entry != NULL)) {
+ /* To work correctly with list_for_each_entry_continue */
+ entry = list_entry(entry->nd.prev, struct memtype, nd);
+ } else {
entry = list_entry(&memtype_list, struct memtype, nd);
+ }

/* Search for existing mapping that overlaps the current range */
where = NULL;
list_for_each_entry_continue(entry, &memtype_list, nd) {
if (end <= entry->start) {
where = entry->nd.prev;
- cached_entry = list_entry(where, struct memtype, nd);
break;
} else if (start <= entry->start) { /* end > entry->start */
err = chk_conflict(new, entry, new_type);
@@ -400,8 +441,6 @@ int reserve_memtype(u64 start, u64 end, unsigned long req_type,
dprintk("Overlap at 0x%Lx-0x%Lx\n",
entry->start, entry->end);
where = entry->nd.prev;
- cached_entry = list_entry(where,
- struct memtype, nd);
}
break;
} else if (start < entry->end) { /* start > entry->start */
@@ -409,8 +448,6 @@ int reserve_memtype(u64 start, u64 end, unsigned long req_type,
if (!err) {
dprintk("Overlap at 0x%Lx-0x%Lx\n",
entry->start, entry->end);
- cached_entry = list_entry(entry->nd.prev,
- struct memtype, nd);

/*
* Move to right position in the linked
@@ -438,13 +475,13 @@ int reserve_memtype(u64 start, u64 end, unsigned long req_type,
return err;
}

- cached_start = start;
-
if (where)
list_add(&new->nd, where);
else
list_add_tail(&new->nd, &memtype_list);

+ memtype_rb_insert(&memtype_rbroot, new);
+
spin_unlock(&memtype_lock);

dprintk("reserve_memtype added 0x%Lx-0x%Lx, track %s, req %s, ret %s\n",
@@ -456,7 +493,7 @@ int reserve_memtype(u64 start, u64 end, unsigned long req_type,

int free_memtype(u64 start, u64 end)
{
- struct memtype *entry;
+ struct memtype *entry, *saved_entry;
int err = -EINVAL;
int is_range_ram;

@@ -474,17 +511,46 @@ int free_memtype(u64 start, u64 end)
return -EINVAL;

spin_lock(&memtype_lock);
+
+ entry = memtype_rb_search(&memtype_rbroot, start);
+ if (unlikely(entry == NULL))
+ goto unlock_ret;
+
+ /*
+ * Saved entry points to an entry with start same or less than what
+ * we searched for. Now go through the list in both directions to look
+ * for the entry that matches with both start and end, with list stored
+ * in sorted start address
+ */
+ saved_entry = entry;
list_for_each_entry(entry, &memtype_list, nd) {
if (entry->start == start && entry->end == end) {
- if (cached_entry == entry || cached_start == start)
- cached_entry = NULL;
+ rb_erase(&entry->rb, &memtype_rbroot);
+ list_del(&entry->nd);
+ kfree(entry);
+ err = 0;
+ break;
+ } else if (entry->start > start) {
+ break;
+ }
+ }
+
+ if (!err)
+ goto unlock_ret;

+ entry = saved_entry;
+ list_for_each_entry_reverse(entry, &memtype_list, nd) {
+ if (entry->start == start && entry->end == end) {
+ rb_erase(&entry->rb, &memtype_rbroot);
list_del(&entry->nd);
kfree(entry);
err = 0;
break;
+ } else if (entry->start < start) {
+ break;
}
}
+unlock_ret:
spin_unlock(&memtype_lock);

if (err) {
--
1.6.0.6

2009-07-10 17:01:48

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: [PATCH 07/10] x86, PAT: Use page flags to track memtypes of RAM pages

Change reserve_ram_pages_type and free_ram_pages_type to use 2 page
flags to track UC_MINUS, WC, WB and default types. Previous RAM tracking
just tracked WB or NonWB, which was not complete and did not allow
tracking of RAM fully and there was no way to get the actual type
reserved by looking at the page flags.

We use the memtype_lock spinlock for atomicity in dealing with
memtype tracking in struct page.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/include/asm/cacheflush.h | 54 +++++++++++++++++++++-
arch/x86/mm/pat.c | 91 ++++++++++++++++++++-----------------
2 files changed, 102 insertions(+), 43 deletions(-)

diff --git a/arch/x86/include/asm/cacheflush.h b/arch/x86/include/asm/cacheflush.h
index e55dfc1..b54f6af 100644
--- a/arch/x86/include/asm/cacheflush.h
+++ b/arch/x86/include/asm/cacheflush.h
@@ -43,8 +43,58 @@ static inline void copy_from_user_page(struct vm_area_struct *vma,
memcpy(dst, src, len);
}

-#define PG_non_WB PG_arch_1
-PAGEFLAG(NonWB, non_WB)
+#define PG_WC PG_arch_1
+PAGEFLAG(WC, WC)
+
+#ifdef CONFIG_X86_PAT
+/*
+ * X86 PAT uses page flags WC and Uncached together to keep track of
+ * memory type of pages that have backing page struct. X86 PAT supports 3
+ * different memory types, _PAGE_CACHE_WB, _PAGE_CACHE_WC and
+ * _PAGE_CACHE_UC_MINUS and fourth state where page's memory type has not
+ * been changed from its default (value of -1 used to denote this).
+ * Note we do not support _PAGE_CACHE_UC here.
+ *
+ * Caller must hold memtype_lock for atomicity.
+ */
+static inline unsigned long get_page_memtype(struct page *pg)
+{
+ if (!PageUncached(pg) && !PageWC(pg))
+ return -1;
+ else if (!PageUncached(pg) && PageWC(pg))
+ return _PAGE_CACHE_WC;
+ else if (PageUncached(pg) && !PageWC(pg))
+ return _PAGE_CACHE_UC_MINUS;
+ else
+ return _PAGE_CACHE_WB;
+}
+
+static inline void set_page_memtype(struct page *pg, unsigned long memtype)
+{
+ switch (memtype) {
+ case _PAGE_CACHE_WC:
+ ClearPageUncached(pg);
+ SetPageWC(pg);
+ break;
+ case _PAGE_CACHE_UC_MINUS:
+ SetPageUncached(pg);
+ ClearPageWC(pg);
+ break;
+ case _PAGE_CACHE_WB:
+ SetPageUncached(pg);
+ SetPageWC(pg);
+ break;
+ default:
+ case -1:
+ ClearPageUncached(pg);
+ ClearPageWC(pg);
+ break;
+ }
+}
+#else
+static inline unsigned long get_page_memtype(struct page *pg) { return -1; }
+static inline void set_page_memtype(struct page *pg, unsigned long memtype) { }
+#endif

/*
* The set_memory_* API can be used to change various attributes of a virtual
diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index c90f242..1a9d0f0 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -288,63 +288,61 @@ static int pat_pagerange_is_ram(unsigned long start, unsigned long end)
}

/*
- * For RAM pages, mark the pages as non WB memory type using
- * PageNonWB (PG_arch_1). We allow only one set_memory_uc() or
- * set_memory_wc() on a RAM page at a time before marking it as WB again.
- * This is ok, because only one driver will be owning the page and
- * doing set_memory_*() calls.
+ * For RAM pages, we use page flags to mark the pages with appropriate type.
+ * Here we do two pass:
+ * - Find the memtype of all the pages in the range, look for any conflicts
+ * - In case of no conflicts, set the new memtype for pages in the range
*
- * For now, we use PageNonWB to track that the RAM page is being mapped
- * as non WB. In future, we will have to use one more flag
- * (or some other mechanism in page_struct) to distinguish between
- * UC and WC mapping.
+ * Caller must hold memtype_lock for atomicity.
*/
static int reserve_ram_pages_type(u64 start, u64 end, unsigned long req_type,
unsigned long *new_type)
{
struct page *page;
- u64 pfn, end_pfn;
+ u64 pfn;
+
+ if (req_type == _PAGE_CACHE_UC) {
+ /* We do not support strong UC */
+ WARN_ON_ONCE(1);
+ req_type = _PAGE_CACHE_UC_MINUS;
+ }

for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) {
- page = pfn_to_page(pfn);
- if (page_mapped(page) || PageNonWB(page))
- goto out;
+ unsigned long type;

- SetPageNonWB(page);
+ page = pfn_to_page(pfn);
+ type = get_page_memtype(page);
+ if (type != -1) {
+ printk(KERN_INFO "reserve_ram_pages_type failed "
+ "0x%Lx-0x%Lx, track 0x%lx, req 0x%lx\n",
+ start, end, type, req_type);
+ if (new_type)
+ *new_type = type;
+
+ return -EBUSY;
+ }
}
- return 0;

-out:
- end_pfn = pfn;
- for (pfn = (start >> PAGE_SHIFT); pfn < end_pfn; ++pfn) {
+ if (new_type)
+ *new_type = req_type;
+
+ for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) {
page = pfn_to_page(pfn);
- ClearPageNonWB(page);
+ set_page_memtype(page, req_type);
}
-
- return -EINVAL;
+ return 0;
}

static int free_ram_pages_type(u64 start, u64 end)
{
struct page *page;
- u64 pfn, end_pfn;
+ u64 pfn;

for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) {
page = pfn_to_page(pfn);
- if (page_mapped(page) || !PageNonWB(page))
- goto out;
-
- ClearPageNonWB(page);
+ set_page_memtype(page, -1);
}
return 0;
-
-out:
- end_pfn = pfn;
- for (pfn = (start >> PAGE_SHIFT); pfn < end_pfn; ++pfn) {
- page = pfn_to_page(pfn);
- SetPageNonWB(page);
- }
- return -EINVAL;
}

/*
@@ -405,11 +403,16 @@ int reserve_memtype(u64 start, u64 end, unsigned long req_type,
*new_type = actual_type;

is_range_ram = pat_pagerange_is_ram(start, end);
- if (is_range_ram == 1)
- return reserve_ram_pages_type(start, end, req_type,
- new_type);
- else if (is_range_ram < 0)
+ if (is_range_ram == 1) {
+
+ spin_lock(&memtype_lock);
+ err = reserve_ram_pages_type(start, end, req_type, new_type);
+ spin_unlock(&memtype_lock);
+
+ return err;
+ } else if (is_range_ram < 0) {
return -EINVAL;
+ }

new = kmalloc(sizeof(struct memtype), GFP_KERNEL);
if (!new)
@@ -505,10 +508,16 @@ int free_memtype(u64 start, u64 end)
return 0;

is_range_ram = pat_pagerange_is_ram(start, end);
- if (is_range_ram == 1)
- return free_ram_pages_type(start, end);
- else if (is_range_ram < 0)
+ if (is_range_ram == 1) {
+
+ spin_lock(&memtype_lock);
+ err = free_ram_pages_type(start, end);
+ spin_unlock(&memtype_lock);
+
+ return err;
+ } else if (is_range_ram < 0) {
return -EINVAL;
+ }

spin_lock(&memtype_lock);

--
1.6.0.6

2009-07-10 17:02:21

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: [PATCH 10/10] x86, PAT: Sanity check remap_pfn_range for RAM region

Add sanity check for remap_pfn_range of RAM regions using
lookup_memtype(). Previously, we did not have anyway to get the type of
RAM memory regions as they were tracked using a single bit in
page_struct (WB, nonWB). Now we can get the actual type from page struct
(WB, WC, UC_MINUS) and make sure the requester gets that type.

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
---
arch/x86/mm/pat.c | 24 +++++++++++++++++++++---
1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index b629f75..a6cace0 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -783,11 +783,29 @@ static int reserve_pfn_range(u64 paddr, unsigned long size, pgprot_t *vma_prot,
is_ram = pat_pagerange_is_ram(paddr, paddr + size);

/*
- * reserve_pfn_range() doesn't support RAM pages. Maintain the current
- * behavior with RAM pages by returning success.
+ * reserve_pfn_range() for RAM pages. We do not refcount to keep
+ * track of number of mappings of RAM pages. We can assert that
+ * the type requested matches the type of first page in the range.
*/
- if (is_ram != 0)
+ if (is_ram) {
+ if (!pat_enabled)
+ return 0;
+
+ flags = lookup_memtype(paddr);
+ if (want_flags != flags) {
+ printk(KERN_WARNING
+ "%s:%d map pfn RAM range req %s for %Lx-%Lx, got %s\n",
+ current->comm, current->pid,
+ cattr_name(want_flags),
+ (unsigned long long)paddr,
+ (unsigned long long)(paddr + size),
+ cattr_name(flags));
+ *vma_prot = __pgprot((pgprot_val(*vma_prot) &
+ (~_PAGE_CACHE_MASK)) |
+ flags);
+ }
return 0;
+ }

ret = reserve_memtype(paddr, paddr + size, want_flags, &flags);
if (ret)
--
1.6.0.6