2007-12-14 00:04:36

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

Originally based on a patch from Eric Biederman, but heavily changed.

Forward port of pat-base.patch to x86 tree, with a bug fix.
Code was using 'PCD|PWT' i.e., PAT3 for WC mapping. So set the WC mapping at
correct PAT fields PA3/PA7.

TBD: KEXEC and other CPU offline paths may need pat_shutdown()?

Signed-off-by: Venkatesh Pallipadi <[email protected]>
Signed-off-by: Suresh Siddha <[email protected]>
---
Index: linux-2.6/arch/x86/kernel/setup64.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/setup64.c 2007-12-11 03:30:46.000000000 -0800
+++ linux-2.6/arch/x86/kernel/setup64.c 2007-12-11 03:42:08.000000000 -0800
@@ -291,9 +291,11 @@

fpu_init();

+ pat_init();
raw_local_save_flags(kernel_eflags);
}

void cpu_shutdown(void)
{
+ pat_shutdown();
}
Index: linux-2.6/arch/x86/mm/Makefile_64
===================================================================
--- linux-2.6.orig/arch/x86/mm/Makefile_64 2007-12-11 03:30:34.000000000 -0800
+++ linux-2.6/arch/x86/mm/Makefile_64 2007-12-11 03:42:08.000000000 -0800
@@ -2,7 +2,7 @@
# Makefile for the linux x86_64-specific parts of the memory manager.
#

-obj-y := init_64.o fault_64.o ioremap_64.o extable_64.o pageattr_64.o mmap_64.o
+obj-y := init_64.o fault_64.o ioremap_64.o extable_64.o pageattr_64.o mmap_64.o pat.o
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
obj-$(CONFIG_NUMA) += numa_64.o
obj-$(CONFIG_K8_NUMA) += k8topology_64.o
Index: linux-2.6/arch/x86/mm/pat.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/arch/x86/mm/pat.c 2007-12-11 04:12:47.000000000 -0800
@@ -0,0 +1,57 @@
+/* Handle caching attributes in page tables (PAT) */
+#include <linux/mm.h>
+#include <linux/kernel.h>
+#include <linux/rbtree.h>
+#include <linux/gfp.h>
+#include <asm/msr.h>
+#include <asm/tlbflush.h>
+#include <asm/processor.h>
+
+static u64 boot_pat_state;
+
+enum {
+ PAT_UC = 0, /* uncached */
+ PAT_WC = 1, /* Write combining */
+ PAT_WT = 4, /* Write Through */
+ PAT_WP = 5, /* Write Protected */
+ PAT_WB = 6, /* Write Back (default) */
+ PAT_UC_MINUS = 7, /* UC, but can be overriden by MTRR */
+};
+
+#define PAT(x,y) ((u64)PAT_ ## y << ((x)*8))
+
+void __cpuinit pat_init(void)
+{
+ /* Set PWT+PCD to Write-Combining. All other bits stay the same */
+ if (cpu_has_pat) {
+ u64 pat;
+ /* PTE encoding used in Linux:
+ PAT
+ |PCD
+ ||PWT
+ |||
+ 000 WB default
+ 010 UC_MINUS _PAGE_PCD
+ 011 WC _PAGE_WC
+ PAT bit unused */
+ pat = PAT(0,WB) | PAT(1,WT) | PAT(2,UC_MINUS) | PAT(3,WC) |
+ PAT(4,WB) | PAT(5,WT) | PAT(6,UC_MINUS) | PAT(7,WC);
+ rdmsrl(MSR_IA32_CR_PAT, boot_pat_state);
+ wrmsrl(MSR_IA32_CR_PAT, pat);
+ __flush_tlb_all();
+ asm volatile("wbinvd");
+ }
+}
+
+#undef PAT
+
+void pat_shutdown(void)
+{
+ /* Restore CPU default pat state */
+ if (cpu_has_pat) {
+ wrmsrl(MSR_IA32_CR_PAT, boot_pat_state);
+ __flush_tlb_all();
+ asm volatile("wbinvd");
+ }
+}
+
Index: linux-2.6/arch/x86/pci/i386.c
===================================================================
--- linux-2.6.orig/arch/x86/pci/i386.c 2007-12-11 03:30:34.000000000 -0800
+++ linux-2.6/arch/x86/pci/i386.c 2007-12-11 03:42:08.000000000 -0800
@@ -300,8 +300,6 @@
int pci_mmap_page_range(struct pci_dev *dev, struct vm_area_struct *vma,
enum pci_mmap_state mmap_state, int write_combine)
{
- unsigned long prot;
-
/* I/O space cannot be accessed via normal processor loads and
* stores on this platform.
*/
@@ -311,14 +309,11 @@
/* Leave vm_pgoff as-is, the PCI space address is the physical
* address on this platform.
*/
- prot = pgprot_val(vma->vm_page_prot);
- if (boot_cpu_data.x86 > 3)
- prot |= _PAGE_PCD | _PAGE_PWT;
- vma->vm_page_prot = __pgprot(prot);
+ if (write_combine)
+ vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
+ else
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);

- /* Write-combine setting is ignored, it is changed via the mtrr
- * interfaces on this platform.
- */
if (io_remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot))
Index: linux-2.6/include/asm-x86/cpufeature_32.h
===================================================================
--- linux-2.6.orig/include/asm-x86/cpufeature_32.h 2007-12-11 03:30:34.000000000 -0800
+++ linux-2.6/include/asm-x86/cpufeature_32.h 2007-12-11 03:42:08.000000000 -0800
@@ -166,6 +166,8 @@
#define cpu_has_clflush boot_cpu_has(X86_FEATURE_CLFLSH)
#define cpu_has_bts boot_cpu_has(X86_FEATURE_BTS)

+#define cpu_has_pat boot_cpu_has(X86_FEATURE_PAT)
+
#endif /* __ASM_I386_CPUFEATURE_H */

/*
Index: linux-2.6/include/asm-x86/msr-index.h
===================================================================
--- linux-2.6.orig/include/asm-x86/msr-index.h 2007-12-11 03:30:34.000000000 -0800
+++ linux-2.6/include/asm-x86/msr-index.h 2007-12-11 03:42:08.000000000 -0800
@@ -63,6 +63,7 @@
#define MSR_IA32_LASTINTFROMIP 0x000001dd
#define MSR_IA32_LASTINTTOIP 0x000001de

+#define MSR_IA32_CR_PAT 0x00000277
#define MSR_IA32_MC0_CTL 0x00000400
#define MSR_IA32_MC0_STATUS 0x00000401
#define MSR_IA32_MC0_ADDR 0x00000402
Index: linux-2.6/include/asm-x86/pgtable_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/pgtable_64.h 2007-12-11 03:30:34.000000000 -0800
+++ linux-2.6/include/asm-x86/pgtable_64.h 2007-12-11 03:42:08.000000000 -0800
@@ -164,6 +164,12 @@
#define _PAGE_FILE 0x040 /* nonlinear file mapping, saved PTE; unset:swap */
#define _PAGE_GLOBAL 0x100 /* Global TLB entry */

+/* We redefine PWT|PCD to be write combining. PAT bit is not used */
+
+#define _PAGE_WC (_PAGE_PWT|_PAGE_PCD)
+
+#define _PAGE_CACHE_MASK (_PAGE_PWT|_PAGE_PCD)
+
#define _PAGE_PROTNONE 0x080 /* If not present */
#define _PAGE_NX (_AC(1,UL)<<_PAGE_BIT_NX)

@@ -203,6 +209,7 @@
#define PAGE_KERNEL_EXEC MAKE_GLOBAL(__PAGE_KERNEL_EXEC)
#define PAGE_KERNEL_RO MAKE_GLOBAL(__PAGE_KERNEL_RO)
#define PAGE_KERNEL_NOCACHE MAKE_GLOBAL(__PAGE_KERNEL_NOCACHE)
+#define PAGE_KERNEL_WC MAKE_GLOBAL(__PAGE_KERNEL_WC)
#define PAGE_KERNEL_VSYSCALL32 __pgprot(__PAGE_KERNEL_VSYSCALL)
#define PAGE_KERNEL_VSYSCALL MAKE_GLOBAL(__PAGE_KERNEL_VSYSCALL)
#define PAGE_KERNEL_LARGE MAKE_GLOBAL(__PAGE_KERNEL_LARGE)
@@ -299,8 +306,24 @@

/*
* Macro to mark a page protection value as "uncacheable".
+ * Accesses through a uncached translation bypasses the cache
+ * and do not allow for consecutive writes to be combined.
*/
-#define pgprot_noncached(prot) (__pgprot(pgprot_val(prot) | _PAGE_PCD | _PAGE_PWT))
+#define pgprot_noncached(prot) \
+ __pgprot((pgprot_val(prot) & ~_PAGE_CACHE_MASK) | _PAGE_PCD)
+
+/*
+ * Macro to make mark a page protection value as "write-combining".
+ * Accesses through a write-combining translation works bypasses the
+ * caches, but does allow for consecutive writes to be combined into
+ * single (but larger) write transactions.
+ * This is mostly useful for IO accesses, for memory it is often slower.
+ * It also implies uncached.
+ */
+#define pgprot_writecombine(prot) \
+ __pgprot((pgprot_val(prot) & ~_PAGE_CACHE_MASK) | _PAGE_WC)
+
+#define pgprot_nonstd(prot) (pgprot_val(prot) & _PAGE_CACHE_MASK)

static inline int pmd_large(pmd_t pte) {
return (pmd_val(pte) & __LARGE_PTE) == __LARGE_PTE;
@@ -414,6 +437,7 @@
#define pgtable_cache_init() do { } while (0)
#define check_pgt_cache() do { } while (0)

+/* AGP users use MTRRs for now. Need to add an ioctl to agpgart for WC */
#define PAGE_AGP PAGE_KERNEL_NOCACHE
#define HAVE_PAGE_AGP 1

Index: linux-2.6/include/asm-x86/processor_64.h
===================================================================
--- linux-2.6.orig/include/asm-x86/processor_64.h 2007-12-11 03:30:46.000000000 -0800
+++ linux-2.6/include/asm-x86/processor_64.h 2007-12-11 03:42:08.000000000 -0800
@@ -105,6 +105,8 @@
extern void init_scattered_cpuid_features(struct cpuinfo_x86 *c);
extern unsigned int init_intel_cacheinfo(struct cpuinfo_x86 *c);
extern unsigned short num_cache_leaves;
+extern void pat_init(void);
+extern void pat_shutdown(void);

/*
* Save the cr4 feature set we're using (ie

--


2007-12-14 00:49:10

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

> +void __cpuinit pat_init(void)
> +{
> + /* Set PWT+PCD to Write-Combining. All other bits stay the same */
> + if (cpu_has_pat) {

All the old CPUs (PPro etc.) with known PAT bugs need to clear this flag
now in their CPU init functions. It is fine to be aggressive there
because these old systems have lived so long without PAT they can do
so forever. So perhaps it's best to just white list it only for newer
CPUs on the Intel side at least.

Another problem is that there are some popular modules (ATI, Nvidia for once)
who reprogram the PAT registers on their own, likely different. Need some way to detect
that case I guess, otherwise lots of users will see strange malfunctions.
Maybe recheck after module load?

> + |||
> + 000 WB default
> + 010 UC_MINUS _PAGE_PCD
> + 011 WC _PAGE_WC
> + PAT bit unused */
> + pat = PAT(0,WB) | PAT(1,WT) | PAT(2,UC_MINUS) | PAT(3,WC) |
> + PAT(4,WB) | PAT(5,WT) | PAT(6,UC_MINUS) | PAT(7,WC);
> + rdmsrl(MSR_IA32_CR_PAT, boot_pat_state);
> + wrmsrl(MSR_IA32_CR_PAT, pat);
> + __flush_tlb_all();
> + asm volatile("wbinvd");

Have you double checked this is the full procedure from the manual? iirc there
were some steps missing.

-Andi

2007-12-14 03:52:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

[email protected] writes:

> Originally based on a patch from Eric Biederman, but heavily changed.
>
> Forward port of pat-base.patch to x86 tree, with a bug fix.
> Code was using 'PCD|PWT' i.e., PAT3 for WC mapping. So set the WC mapping at
> correct PAT fields PA3/PA7.

Well that wasn't from my original tested patch. Grr.

> TBD: KEXEC and other CPU offline paths may need pat_shutdown()?

> Index: linux-2.6/arch/x86/mm/Makefile_64
> ===================================================================
> --- linux-2.6.orig/arch/x86/mm/Makefile_64 2007-12-11 03:30:34.000000000 -0800
> +++ linux-2.6/arch/x86/mm/Makefile_64 2007-12-11 03:42:08.000000000 -0800
> @@ -2,7 +2,7 @@
> # Makefile for the linux x86_64-specific parts of the memory manager.
> #
>
> -obj-y := init_64.o fault_64.o ioremap_64.o extable_64.o pageattr_64.o mmap_64.o
> +obj-y := init_64.o fault_64.o ioremap_64.o extable_64.o pageattr_64.o mmap_64.o
> pat.o
> obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
> obj-$(CONFIG_NUMA) += numa_64.o
> obj-$(CONFIG_K8_NUMA) += k8topology_64.o
> Index: linux-2.6/arch/x86/mm/pat.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6/arch/x86/mm/pat.c 2007-12-11 04:12:47.000000000 -0800
> @@ -0,0 +1,57 @@
> +/* Handle caching attributes in page tables (PAT) */
> +#include <linux/mm.h>
> +#include <linux/kernel.h>
> +#include <linux/rbtree.h>
> +#include <linux/gfp.h>
> +#include <asm/msr.h>
> +#include <asm/tlbflush.h>
> +#include <asm/processor.h>
> +
> +static u64 boot_pat_state;
> +
> +enum {
> + PAT_UC = 0, /* uncached */
> + PAT_WC = 1, /* Write combining */
> + PAT_WT = 4, /* Write Through */
> + PAT_WP = 5, /* Write Protected */
> + PAT_WB = 6, /* Write Back (default) */
> + PAT_UC_MINUS = 7, /* UC, but can be overriden by MTRR */
> +};
> +
> +#define PAT(x,y) ((u64)PAT_ ## y << ((x)*8))
> +
> +void __cpuinit pat_init(void)
> +{
> + /* Set PWT+PCD to Write-Combining. All other bits stay the same */
> + if (cpu_has_pat) {
> + u64 pat;
> + /* PTE encoding used in Linux:
> + PAT
> + |PCD
> + ||PWT
> + |||
> + 000 WB default
> + 010 UC_MINUS _PAGE_PCD
> + 011 WC _PAGE_WC
> + PAT bit unused */
> + pat = PAT(0,WB) | PAT(1,WT) | PAT(2,UC_MINUS) | PAT(3,WC) |
> + PAT(4,WB) | PAT(5,WT) | PAT(6,UC_MINUS) | PAT(7,WC);

I strongly object to this configuration.

The caching modes of interest are:
PAT_WB write-back or a close as the MTRRs will allow
used for WC today.
PAT_UC completely uncachable not overridable by MTRRs
and what we use today for pgprot_noncached
PAT_WC what isn't available for current use.

We should use:
> + pat = PAT(0,WB) | PAT(1,WT) | PAT(2,WC) | PAT(3,UC) |
> + PAT(4,WB) | PAT(5,WT) | PAT(6,WC) | PAT(7,UC);

Changing the UC- which currently allows write-combining if the MTRRs specify it,
to WC. This grandfathers in all of our current usage and changes the one
PAT type that could today and in legacy mode specify WC to really specify WC.

I don't know if we need to set the high half or not, that would depend
on the state of the PAT errata.

I do know we need to use the low 4 pat mappings to avoid most of the PAT
errata issues.

As for Andi's concern about modules playing games with the PAT mappings
if we don't redefine how we use the page table entries our exposure to
badly behaved modules more limited.

> + rdmsrl(MSR_IA32_CR_PAT, boot_pat_state);
> + wrmsrl(MSR_IA32_CR_PAT, pat);
> + __flush_tlb_all();
> + asm volatile("wbinvd");
> + }
> +}
> +
> +#undef PAT
> +
> +void pat_shutdown(void)
> +{
> + /* Restore CPU default pat state */
> + if (cpu_has_pat) {
> + wrmsrl(MSR_IA32_CR_PAT, boot_pat_state);
> + __flush_tlb_all();
> + asm volatile("wbinvd");
> + }
> +}
> +


Eric

2007-12-14 04:26:42

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

[email protected] (Eric W. Biederman) writes:


> We should use:
>> + pat = PAT(0,WB) | PAT(1,WT) | PAT(2,WC) | PAT(3,UC) |
>> + PAT(4,WB) | PAT(5,WT) | PAT(6,WC) | PAT(7,UC);
>
> Changing the UC- which currently allows write-combining if the MTRRs specify it,
> to WC. This grandfathers in all of our current usage and changes the one
> PAT type that could today and in legacy mode specify WC to really specify WC.
>
> I don't know if we need to set the high half or not, that would depend
> on the state of the PAT errata.
>
> I do know we need to use the low 4 pat mappings to avoid most of the PAT
> errata issues.
>
> As for Andi's concern about modules playing games with the PAT mappings
> if we don't redefine how we use the page table entries our exposure to
> badly behaved modules more limited.

Ok. My analysis here was wrong. Currently pgprot_noncached and
ioremap_nocache are out of sync. With ioremap_nocache only specifying
_PAGE_PCD and pgprot_noncached specifying _PAGE_PCD | _PAGE_PWT.

So I don't have a clue how someone could reprogram the mtrrs currently
and expect things to work.

...

If we bother to ask ioremap for memory that is not cached, the last
thing in the world we want is the MTRRs upgrading that to write combining.
So ioremap_nocache has been slightly buggy for ages. ioremap_nocache
and PAGE_KERNEL_NOCACHE should get _PAGE_PWT added to their
definitions.

Could we please get a cleanup patch at the beginning of this patchset
or that comes before it that fixes ioremap_nocache on x86?

That will make us a lot more git-bisect safe.


Eric








2007-12-14 10:25:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

> I do know we need to use the low 4 pat mappings to avoid most of the PAT
> errata issues.

They don't really matter. These are all very old systems who have run
fine for many years without PAT. It is no problem to let them
continue to do so and just disable PAT for them. So just clear pat bit in
CPU initialization for any CPUs with non trivial erratas in this
area.

PAT is only really needed on modern boxes.

Just someone needs to go through the old errata sheets and find
out on which CPUs it is needed to clear the bit.

> As for Andi's concern about modules playing games with the PAT mappings
> if we don't redefine how we use the page table entries our exposure to
> badly behaved modules more limited.

I would just recheck them after module load and if it happens
print a nasty message and program them back. e.g. kernel debuggers
need an after module notifier anyways, so it would be fine
to just add one and hook into that.

-Andi

2007-12-14 18:31:32

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

On Fri, Dec 14, 2007 at 01:42:12AM +0100, Andi Kleen wrote:
> > +void __cpuinit pat_init(void)
> > +{
> > + /* Set PWT+PCD to Write-Combining. All other bits stay the same */
> > + if (cpu_has_pat) {
>
> All the old CPUs (PPro etc.) with known PAT bugs need to clear this flag
> now in their CPU init functions. It is fine to be aggressive there
> because these old systems have lived so long without PAT they can do
> so forever. So perhaps it's best to just white list it only for newer
> CPUs on the Intel side at least.

Yes. Enabling this only on relatively newer CPUs is safer. Will do that in next iteration of the patches.

> Another problem is that there are some popular modules (ATI, Nvidia for once)
> who reprogram the PAT registers on their own, likely different. Need some way to detect
> that case I guess, otherwise lots of users will see strange malfunctions.
> Maybe recheck after module load?

Yes. We can check that at load time. But they can still do bad things at runt ime, like say when 3D gets enabled etc??


> > + |||
> > + 000 WB default
> > + 010 UC_MINUS _PAGE_PCD
> > + 011 WC _PAGE_WC
> > + PAT bit unused */
> > + pat = PAT(0,WB) | PAT(1,WT) | PAT(2,UC_MINUS) | PAT(3,WC) |
> > + PAT(4,WB) | PAT(5,WT) | PAT(6,UC_MINUS) | PAT(7,WC);
> > + rdmsrl(MSR_IA32_CR_PAT, boot_pat_state);
> > + wrmsrl(MSR_IA32_CR_PAT, pat);
> > + __flush_tlb_all();
> > + asm volatile("wbinvd");
>
> Have you double checked this is the full procedure from the manual? iirc there
> were some steps missing.


Checking the manual for this. You are right, we had missed some steps here.
Actually, manual says on MP, PAT MSR on all CPUs must be consistent (even when they are not really using it in their page tables.
So, this will change the init and shutdown parts significantly and there may be some challenges with CPU offline and KEXEC. We will redo this part in next iteration.

Thanks,
Venki

2007-12-14 19:58:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

Andi Kleen wrote:
>> I do know we need to use the low 4 pat mappings to avoid most of the PAT
>> errata issues.
>
> They don't really matter. These are all very old systems who have run
> fine for many years without PAT. It is no problem to let them
> continue to do so and just disable PAT for them. So just clear pat bit in
> CPU initialization for any CPUs with non trivial erratas in this
> area.
>
> PAT is only really needed on modern boxes.

How many mapping types do we actually need? The only ones which are
likely to be used in practice are WB, UC, WC, which still leaves a
spare. (Any intended users of WP or WT?)

-hpa

2007-12-14 21:07:56

by Suresh Siddha

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

On Thu, Dec 13, 2007 at 08:48:45PM -0700, Eric W. Biederman wrote:
> > + pat = PAT(0,WB) | PAT(1,WT) | PAT(2,UC_MINUS) | PAT(3,WC) |
> > + PAT(4,WB) | PAT(5,WT) | PAT(6,UC_MINUS) | PAT(7,WC);
>
> I strongly object to this configuration.
>
> The caching modes of interest are:
> PAT_WB write-back or a close as the MTRRs will allow
> used for WC today.
> PAT_UC completely uncachable not overridable by MTRRs
> and what we use today for pgprot_noncached
> PAT_WC what isn't available for current use.
>
> We should use:
> > + pat = PAT(0,WB) | PAT(1,WT) | PAT(2,WC) | PAT(3,UC) |
> > + PAT(4,WB) | PAT(5,WT) | PAT(6,WC) | PAT(7,UC);
>
> Changing the UC- which currently allows write-combining if the MTRRs specify it,
> to WC. This grandfathers in all of our current usage and changes the one
> PAT type that could today and in legacy mode specify WC to really specify WC.

That seems reasonable. But looking at mainline kernel, ioremap_nocache()
actually uses UC_MINUS. Wonder why it is not using UC (like
pgprot_noncached). I think it is ok to change ioremap_nocache() to use UC.

thanks,
suresh

2007-12-14 21:11:27

by Suresh Siddha

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

On Thu, Dec 13, 2007 at 09:23:26PM -0700, Eric W. Biederman wrote:
> [email protected] (Eric W. Biederman) writes:
> Ok. My analysis here was wrong. Currently pgprot_noncached and
> ioremap_nocache are out of sync. With ioremap_nocache only specifying
> _PAGE_PCD and pgprot_noncached specifying _PAGE_PCD | _PAGE_PWT.
>
> So I don't have a clue how someone could reprogram the mtrrs currently
> and expect things to work.
>
> ...
>
> If we bother to ask ioremap for memory that is not cached, the last
> thing in the world we want is the MTRRs upgrading that to write combining.
> So ioremap_nocache has been slightly buggy for ages. ioremap_nocache
> and PAGE_KERNEL_NOCACHE should get _PAGE_PWT added to their
> definitions.
>
> Could we please get a cleanup patch at the beginning of this patchset
> or that comes before it that fixes ioremap_nocache on x86?
>
> That will make us a lot more git-bisect safe.

Ok. I will send a separate patch fixing ioremap_nocache on x86.

thanks,
suresh

2007-12-14 23:34:19

by Suresh Siddha

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

On Fri, Dec 14, 2007 at 01:10:39PM -0800, Siddha, Suresh B wrote:
> On Thu, Dec 13, 2007 at 09:23:26PM -0700, Eric W. Biederman wrote:
> > [email protected] (Eric W. Biederman) writes:
> > Ok. My analysis here was wrong. Currently pgprot_noncached and
> > ioremap_nocache are out of sync. With ioremap_nocache only specifying
> > _PAGE_PCD and pgprot_noncached specifying _PAGE_PCD | _PAGE_PWT.
> >
> > So I don't have a clue how someone could reprogram the mtrrs currently
> > and expect things to work.
> >
> > ...
> >
> > If we bother to ask ioremap for memory that is not cached, the last
> > thing in the world we want is the MTRRs upgrading that to write combining.
> > So ioremap_nocache has been slightly buggy for ages. ioremap_nocache
> > and PAGE_KERNEL_NOCACHE should get _PAGE_PWT added to their
> > definitions.
> >
> > Could we please get a cleanup patch at the beginning of this patchset
> > or that comes before it that fixes ioremap_nocache on x86?
> >
> > That will make us a lot more git-bisect safe.
>
> Ok. I will send a separate patch fixing ioremap_nocache on x86.

Appended the patch. x86 folks, please consider for x86 mm git tree. Thanks.

---
[patch] x86: Set strong uncacheable where UC is really desired

Also use _PAGE_PWT for all the mappings which need uncache mapping. Instead of
existing PAT2 which is UC- (and can be overwritten by MTRRs), we now use PAT3
which is strong uncacheable.

This makes it consistent with pgprot_noncached()

Signed-off-by: Suresh Siddha <[email protected]>
---

diff --git a/arch/x86/mm/ioremap_32.c b/arch/x86/mm/ioremap_32.c
index 0b27831..ef0f6a4 100644
--- a/arch/x86/mm/ioremap_32.c
+++ b/arch/x86/mm/ioremap_32.c
@@ -119,7 +119,7 @@ EXPORT_SYMBOL(__ioremap);
void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
{
unsigned long last_addr;
- void __iomem *p = __ioremap(phys_addr, size, _PAGE_PCD);
+ void __iomem *p = __ioremap(phys_addr, size, _PAGE_PCD | _PAGE_PWT);
if (!p)
return p;

diff --git a/arch/x86/mm/ioremap_64.c b/arch/x86/mm/ioremap_64.c
index 6cac90a..8be3062 100644
--- a/arch/x86/mm/ioremap_64.c
+++ b/arch/x86/mm/ioremap_64.c
@@ -158,7 +158,7 @@ EXPORT_SYMBOL(__ioremap);

void __iomem *ioremap_nocache (unsigned long phys_addr, unsigned long size)
{
- return __ioremap(phys_addr, size, _PAGE_PCD);
+ return __ioremap(phys_addr, size, _PAGE_PCD | _PAGE_PWT);
}
EXPORT_SYMBOL(ioremap_nocache);

diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
index ed3e70d..b1215e1 100644
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -156,7 +156,7 @@ void paging_init(void);
extern unsigned long long __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
#define __PAGE_KERNEL_RO (__PAGE_KERNEL & ~_PAGE_RW)
#define __PAGE_KERNEL_RX (__PAGE_KERNEL_EXEC & ~_PAGE_RW)
-#define __PAGE_KERNEL_NOCACHE (__PAGE_KERNEL | _PAGE_PCD)
+#define __PAGE_KERNEL_NOCACHE (__PAGE_KERNEL | _PAGE_PCD | _PAGE_PWT)
#define __PAGE_KERNEL_LARGE (__PAGE_KERNEL | _PAGE_PSE)
#define __PAGE_KERNEL_LARGE_EXEC (__PAGE_KERNEL_EXEC | _PAGE_PSE)

diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
index 9b0ff47..4e4dcc4 100644
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -185,13 +185,13 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm, unsigned long
#define __PAGE_KERNEL_EXEC \
(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED)
#define __PAGE_KERNEL_NOCACHE \
- (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_PCD | _PAGE_ACCESSED | _PAGE_NX)
+ (_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_PCD | _PAGE_PWT | _PAGE_ACCESSED | _PAGE_NX)
#define __PAGE_KERNEL_RO \
(_PAGE_PRESENT | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_NX)
#define __PAGE_KERNEL_VSYSCALL \
(_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED)
#define __PAGE_KERNEL_VSYSCALL_NOCACHE \
- (_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED | _PAGE_PCD)
+ (_PAGE_PRESENT | _PAGE_USER | _PAGE_ACCESSED | _PAGE_PCD | _PAGE_PWT)
#define __PAGE_KERNEL_LARGE \
(__PAGE_KERNEL | _PAGE_PSE)
#define __PAGE_KERNEL_LARGE_EXEC \

2007-12-15 07:57:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation


* Siddha, Suresh B <[email protected]> wrote:

> > Ok. I will send a separate patch fixing ioremap_nocache on x86.
>
> Appended the patch. x86 folks, please consider for x86 mm git tree.
> Thanks.

thanks, applied.

Ingo

2007-12-18 04:45:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

Andi Kleen <[email protected]> writes:

>> I do know we need to use the low 4 pat mappings to avoid most of the PAT
>> errata issues.
>
> They don't really matter. These are all very old systems who have run
> fine for many years without PAT. It is no problem to let them
> continue to do so and just disable PAT for them. So just clear pat bit in
> CPU initialization for any CPUs with non trivial erratas in this
> area.
>
> PAT is only really needed on modern boxes.
>
> Just someone needs to go through the old errata sheets and find
> out on which CPUs it is needed to clear the bit.

It has been ages now, but my impression when I wrote the patch that
current cores still had a few outstanding errata with using the
extended pat bits.

Further it was my impression was that if we just changed UC- to WC
we work on essentially everything, because PAT is always enabled
on the cores that support it.

Therefore since we only have 3 interesting caching modes.
WB, WC, UC. We should be very careful about reprogramming it
and we can ignore the errors.

As for the pat class errata about inconsistent mappings those are
reoccurring issues, that happen across all cpu types (x86/ppc/fred),
and every major core overhaul is likely to have them again.

Eric

2007-12-18 04:52:16

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCH 02/12] PAT 64b: Basic PAT implementation

Venki Pallipadi <[email protected]> writes:

> Checking the manual for this. You are right, we had missed some steps here.
> Actually, manual says on MP, PAT MSR on all CPUs must be consistent (even when
> they are not really using it in their page tables.
> So, this will change the init and shutdown parts significantly and there may be
> some challenges with CPU offline and KEXEC. We will redo this part in next
> iteration.

Well the normal kexec path is no worse then reboot. The kdump path is a
mess but only a minor one, and with us only changing the UC- case we
can probably just ignore it and leave the system started with that
pat register set to WC :)

What we are doing really should be no worse the MTRR setup except
that disabling it at reboot is polite.

CPU online and offline that is weird, but so far it is always weird,
and I don't think ever quite correct.

Eric