2010-06-17 01:29:26

by Kenji Kaneshige

[permalink] [raw]
Subject: [BUG][PATCH 0/2 (v.2)] x86: ioremap() problem in X86_32 PAE

Hi,

Here is a updated version (v.2) of patchset to fix ioremap() problem
found in x86_32 PAE mode. The problem is that ioremap() maps wrong
address for the devices to which phisical addresses higer than 32-bit
are assigned (ioat device in my case).

Changelog
----
v.1 => v.2:
Thanks to the feedbacks, I found that some of my v.1 patches are totally
wrong because they were based on my mis-understanding about architectural
limit and linux memory management limit of physical address. In addition,
it turned out that there are bugs that handles physical address improperly
(higher 32-bits are cleared unexpectedly) also in the other places in the
ioremap() code path. Major changes are
- Removed wrong changes against phys_addr_valid() ([PATCH 2/4] in v.1).
- Removed wrong changes against wraning message in ioremap() ([PATCH 3/4]
in v.1)
- Changed not to use PHYSICAL_PAGE_MASK because the PHYSICAL_PAGE_MASK
would not work for physical address higher than 44-bit.
- Added fixes for remaining bugs of physical address handling in ioremap()
code path according to the feedbacks.
- Added a fix for s_show() in vmalloc.c to show high physical address properly.
----

I also found the bug in PCI MSI-X code that handles physical address
improperly, which causes the problem MSI-X doesn't work on my devices. I'll
send a patch for this to PCI mail list separately.

The v.2 patches are:

- [PATCH 1/2] x86: ioremap: fix wrong physical address handling
- [PATCH 2/2] x86: ioremap: fix normal ram range check

Thanks,
Kenji Kaneshige


2010-06-17 01:30:48

by Kenji Kaneshige

[permalink] [raw]
Subject: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

Current x86 ioremap() doesn't handle physical address higher than
32-bit properly in X86_32 PAE mode. When physical address higher than
32-bit is passed to ioremap(), higher 32-bits in physical address is
cleared wrongly. Due to this bug, ioremap() can map wrong address to
linear address space.

In my case, 64-bit MMIO region was assigned to a PCI device (ioat
device) on my system. Because of the ioremap()'s bug, wrong physical
address (instead of MMIO region) was mapped to linear address space.
Because of this, loading ioatdma driver caused unexpected behavior
(kernel panic, kernel hangup, ...).

Signed-off-by: Kenji Kaneshige <[email protected]>

---
arch/x86/mm/ioremap.c | 12 +++++-------
include/linux/io.h | 4 ++--
include/linux/vmalloc.h | 2 +-
lib/ioremap.c | 10 +++++-----
mm/vmalloc.c | 2 +-
5 files changed, 14 insertions(+), 16 deletions(-)

Index: linux-2.6.34/arch/x86/mm/ioremap.c
===================================================================
--- linux-2.6.34.orig/arch/x86/mm/ioremap.c 2010-06-15 04:43:00.978332015 +0900
+++ linux-2.6.34/arch/x86/mm/ioremap.c 2010-06-15 05:32:59.291693007 +0900
@@ -62,8 +62,8 @@
static void __iomem *__ioremap_caller(resource_size_t phys_addr,
unsigned long size, unsigned long prot_val, void *caller)
{
- unsigned long pfn, offset, vaddr;
- resource_size_t last_addr;
+ unsigned long offset, vaddr;
+ resource_size_t pfn, last_pfn, last_addr;
const resource_size_t unaligned_phys_addr = phys_addr;
const unsigned long unaligned_size = size;
struct vm_struct *area;
@@ -100,10 +100,8 @@
/*
* Don't allow anybody to remap normal RAM that we're using..
*/
- for (pfn = phys_addr >> PAGE_SHIFT;
- (pfn << PAGE_SHIFT) < (last_addr & PAGE_MASK);
- pfn++) {
-
+ last_pfn = last_addr >> PAGE_SHIFT;
+ for (pfn = phys_addr >> PAGE_SHIFT; pfn < last_pfn; pfn++) {
int is_ram = page_is_ram(pfn);

if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)))
@@ -115,7 +113,7 @@
* Mappings have to be page-aligned
*/
offset = phys_addr & ~PAGE_MASK;
- phys_addr &= PAGE_MASK;
+ phys_addr = (phys_addr >> PAGE_SHIFT) << PAGE_SHIFT;
size = PAGE_ALIGN(last_addr+1) - phys_addr;

retval = reserve_memtype(phys_addr, (u64)phys_addr + size,
Index: linux-2.6.34/include/linux/vmalloc.h
===================================================================
--- linux-2.6.34.orig/include/linux/vmalloc.h 2010-06-15 04:43:00.970258681 +0900
+++ linux-2.6.34/include/linux/vmalloc.h 2010-06-15 05:32:59.323364960 +0900
@@ -30,7 +30,7 @@
unsigned long flags;
struct page **pages;
unsigned int nr_pages;
- unsigned long phys_addr;
+ phys_addr_t phys_addr;
void *caller;
};

Index: linux-2.6.34/lib/ioremap.c
===================================================================
--- linux-2.6.34.orig/lib/ioremap.c 2010-06-15 04:43:00.970258681 +0900
+++ linux-2.6.34/lib/ioremap.c 2010-06-15 05:32:59.352457435 +0900
@@ -13,10 +13,10 @@
#include <asm/pgtable.h>

static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, unsigned long phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
pte_t *pte;
- unsigned long pfn;
+ u64 pfn;

pfn = phys_addr >> PAGE_SHIFT;
pte = pte_alloc_kernel(pmd, addr);
@@ -31,7 +31,7 @@
}

static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
- unsigned long end, unsigned long phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
pmd_t *pmd;
unsigned long next;
@@ -49,7 +49,7 @@
}

static inline int ioremap_pud_range(pgd_t *pgd, unsigned long addr,
- unsigned long end, unsigned long phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
pud_t *pud;
unsigned long next;
@@ -67,7 +67,7 @@
}

int ioremap_page_range(unsigned long addr,
- unsigned long end, unsigned long phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
pgd_t *pgd;
unsigned long start;
Index: linux-2.6.34/include/linux/io.h
===================================================================
--- linux-2.6.34.orig/include/linux/io.h 2010-06-15 04:43:00.971256515 +0900
+++ linux-2.6.34/include/linux/io.h 2010-06-15 05:32:59.377701457 +0900
@@ -29,10 +29,10 @@

#ifdef CONFIG_MMU
int ioremap_page_range(unsigned long addr, unsigned long end,
- unsigned long phys_addr, pgprot_t prot);
+ phys_addr_t phys_addr, pgprot_t prot);
#else
static inline int ioremap_page_range(unsigned long addr, unsigned long end,
- unsigned long phys_addr, pgprot_t prot)
+ phys_addr_t phys_addr, pgprot_t prot)
{
return 0;
}
Index: linux-2.6.34/mm/vmalloc.c
===================================================================
--- linux-2.6.34.orig/mm/vmalloc.c 2010-06-15 04:43:00.963255188 +0900
+++ linux-2.6.34/mm/vmalloc.c 2010-06-15 05:32:59.404457295 +0900
@@ -2403,7 +2403,7 @@
seq_printf(m, " pages=%d", v->nr_pages);

if (v->phys_addr)
- seq_printf(m, " phys=%lx", v->phys_addr);
+ seq_printf(m, " phys=%llx", (unsigned long long)v->phys_addr);

if (v->flags & VM_IOREMAP)
seq_printf(m, " ioremap");

2010-06-17 01:31:56

by Kenji Kaneshige

[permalink] [raw]
Subject: [PATCH 2/2] x86: ioremap: fix normal ram range check

Check for norma RAM in x86 ioremap() code seems to not work for the
last page frame in the specified physical address range.

Signed-off-by: Kenji Kaneshige <[email protected]>

---
arch/x86/mm/ioremap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.34/arch/x86/mm/ioremap.c
===================================================================
--- linux-2.6.34.orig/arch/x86/mm/ioremap.c 2010-06-15 05:33:06.272381451 +0900
+++ linux-2.6.34/arch/x86/mm/ioremap.c 2010-06-15 05:33:21.086585690 +0900
@@ -101,7 +101,7 @@
* Don't allow anybody to remap normal RAM that we're using..
*/
last_pfn = last_addr >> PAGE_SHIFT;
- for (pfn = phys_addr >> PAGE_SHIFT; pfn < last_pfn; pfn++) {
+ for (pfn = phys_addr >> PAGE_SHIFT; pfn <= last_pfn; pfn++) {
int is_ram = page_is_ram(pfn);

if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)))

2010-06-17 02:50:55

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

On Thu, Jun 17, 2010 at 10:30:06AM +0900, Kenji Kaneshige wrote:
> Index: linux-2.6.34/arch/x86/mm/ioremap.c
> ===================================================================
> --- linux-2.6.34.orig/arch/x86/mm/ioremap.c 2010-06-15 04:43:00.978332015 +0900
> +++ linux-2.6.34/arch/x86/mm/ioremap.c 2010-06-15 05:32:59.291693007 +0900
> @@ -62,8 +62,8 @@
> static void __iomem *__ioremap_caller(resource_size_t phys_addr,
> unsigned long size, unsigned long prot_val, void *caller)
> {
> - unsigned long pfn, offset, vaddr;
> - resource_size_t last_addr;
> + unsigned long offset, vaddr;
> + resource_size_t pfn, last_pfn, last_addr;

I have a hard time understanding this change. pfn is always a physical
address shifted by PAGE_SHIFT. So a 32-bit pfn supports up to 44-bit
physical addresses. Are your addresses above 44-bits?

> @@ -115,7 +113,7 @@
> * Mappings have to be page-aligned
> */
> offset = phys_addr & ~PAGE_MASK;
> - phys_addr &= PAGE_MASK;
> + phys_addr = (phys_addr >> PAGE_SHIFT) << PAGE_SHIFT;

I'd rather see PAGE_MASK fixed. Would this work?

#define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT)
-#define PAGE_MASK (~(PAGE_SIZE-1))
+#define PAGE_MASK (~(PAGE_SIZE-1ULL))

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2010-06-17 04:22:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

On 06/16/2010 07:50 PM, Matthew Wilcox wrote:
> On Thu, Jun 17, 2010 at 10:30:06AM +0900, Kenji Kaneshige wrote:
>> Index: linux-2.6.34/arch/x86/mm/ioremap.c
>> ===================================================================
>> --- linux-2.6.34.orig/arch/x86/mm/ioremap.c 2010-06-15 04:43:00.978332015 +0900
>> +++ linux-2.6.34/arch/x86/mm/ioremap.c 2010-06-15 05:32:59.291693007 +0900
>> @@ -62,8 +62,8 @@
>> static void __iomem *__ioremap_caller(resource_size_t phys_addr,
>> unsigned long size, unsigned long prot_val, void *caller)
>> {
>> - unsigned long pfn, offset, vaddr;
>> - resource_size_t last_addr;
>> + unsigned long offset, vaddr;
>> + resource_size_t pfn, last_pfn, last_addr;
>
> I have a hard time understanding this change. pfn is always a physical
> address shifted by PAGE_SHIFT. So a 32-bit pfn supports up to 44-bit
> physical addresses. Are your addresses above 44-bits?
>

I think they might be. Kenji?

-hpa

2010-06-17 04:56:10

by Kenji Kaneshige

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

(2010/06/17 13:22), H. Peter Anvin wrote:
> On 06/16/2010 07:50 PM, Matthew Wilcox wrote:
>> On Thu, Jun 17, 2010 at 10:30:06AM +0900, Kenji Kaneshige wrote:
>>> Index: linux-2.6.34/arch/x86/mm/ioremap.c
>>> ===================================================================
>>> --- linux-2.6.34.orig/arch/x86/mm/ioremap.c 2010-06-15
>>> 04:43:00.978332015 +0900
>>> +++ linux-2.6.34/arch/x86/mm/ioremap.c 2010-06-15 05:32:59.291693007
>>> +0900
>>> @@ -62,8 +62,8 @@
>>> static void __iomem *__ioremap_caller(resource_size_t phys_addr,
>>> unsigned long size, unsigned long prot_val, void *caller)
>>> {
>>> - unsigned long pfn, offset, vaddr;
>>> - resource_size_t last_addr;
>>> + unsigned long offset, vaddr;
>>> + resource_size_t pfn, last_pfn, last_addr;
>>
>> I have a hard time understanding this change. pfn is always a physical
>> address shifted by PAGE_SHIFT. So a 32-bit pfn supports up to 44-bit
>> physical addresses. Are your addresses above 44-bits?
>>
>
> I think they might be. Kenji?

No. My addresses are in the 44-bits range (around fc000000000). So it is
not required for my problem. This change assumes that phys_addr can be
above 44-bits (up to 52-bits (and higher in the future?)).

By the way, is there linux kernel limit regarding above 44-bits physical
address in x86_32 PAE? For example, pfn above 32-bits is not supported?

#ifdef CONFIG_X86_PAE
/* 44=32+12, the limit we can fit into an unsigned long pfn */
#define __PHYSICAL_MASK_SHIFT 44
#define __VIRTUAL_MASK_SHIFT 32

If there is 44-bits physical address limit, I think it's better to use
PHYSICAL_PAGE_MASK for masking physical address, instead of "(phys_addr
>> PAGE_SHIFT) << PAGE_SHIFT)". The PHYSICAL_PAGE_MASK would become
greater value when 44-bits physical address limit is eliminated. And
maybe we need to change phys_addr_valid() returns error if physical
address is above (1 << __PHYSICAL_MASK_SHIFT)?

Thanks,
Kenji Kaneshige

2010-06-17 06:03:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

On 06/16/2010 09:55 PM, Kenji Kaneshige wrote:
>>
>> I think they might be. Kenji?
>
> No. My addresses are in the 44-bits range (around fc000000000). So it is
> not required for my problem. This change assumes that phys_addr can be
> above 44-bits (up to 52-bits (and higher in the future?)).
>
> By the way, is there linux kernel limit regarding above 44-bits physical
> address in x86_32 PAE? For example, pfn above 32-bits is not supported?
>

There are probably places at which PFNs are held in 32-bit numbers,
although it would be good to track them down if it isn't too expensive
to fix them (i.e. doesn't affect generic code.)

This also affects paravirt systems, i.e. right now Xen has to locate all
32-bit guests below 64 GB, which limits its usefulness.

> #ifdef CONFIG_X86_PAE
> /* 44=32+12, the limit we can fit into an unsigned long pfn */
> #define __PHYSICAL_MASK_SHIFT 44
> #define __VIRTUAL_MASK_SHIFT 32
>
> If there is 44-bits physical address limit, I think it's better to use
> PHYSICAL_PAGE_MASK for masking physical address, instead of "(phys_addr
>>> PAGE_SHIFT) << PAGE_SHIFT)". The PHYSICAL_PAGE_MASK would become
> greater value when 44-bits physical address limit is eliminated. And
> maybe we need to change phys_addr_valid() returns error if physical
> address is above (1 << __PHYSICAL_MASK_SHIFT)?

The real question is how much we can fix without an unreasonable cost.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-06-17 06:21:59

by Kenji Kaneshige

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

(2010/06/17 15:03), H. Peter Anvin wrote:
> On 06/16/2010 09:55 PM, Kenji Kaneshige wrote:
>>>
>>> I think they might be. Kenji?
>>
>> No. My addresses are in the 44-bits range (around fc000000000). So it is
>> not required for my problem. This change assumes that phys_addr can be
>> above 44-bits (up to 52-bits (and higher in the future?)).
>>
>> By the way, is there linux kernel limit regarding above 44-bits physical
>> address in x86_32 PAE? For example, pfn above 32-bits is not supported?
>>
>
> There are probably places at which PFNs are held in 32-bit numbers,
> although it would be good to track them down if it isn't too expensive
> to fix them (i.e. doesn't affect generic code.)
>
> This also affects paravirt systems, i.e. right now Xen has to locate all
> 32-bit guests below 64 GB, which limits its usefulness.
>
>> #ifdef CONFIG_X86_PAE
>> /* 44=32+12, the limit we can fit into an unsigned long pfn */
>> #define __PHYSICAL_MASK_SHIFT 44
>> #define __VIRTUAL_MASK_SHIFT 32
>>
>> If there is 44-bits physical address limit, I think it's better to use
>> PHYSICAL_PAGE_MASK for masking physical address, instead of "(phys_addr
>>>> PAGE_SHIFT)<< PAGE_SHIFT)". The PHYSICAL_PAGE_MASK would become
>> greater value when 44-bits physical address limit is eliminated. And
>> maybe we need to change phys_addr_valid() returns error if physical
>> address is above (1<< __PHYSICAL_MASK_SHIFT)?
>
> The real question is how much we can fix without an unreasonable cost.
>

Thank you very much. I understand the situation.

Thanks,
Kenji Kaneshige

2010-06-17 06:29:18

by Kenji Kaneshige

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

(2010/06/17 11:50), Matthew Wilcox wrote:
> On Thu, Jun 17, 2010 at 10:30:06AM +0900, Kenji Kaneshige wrote:
>> Index: linux-2.6.34/arch/x86/mm/ioremap.c
>> ===================================================================
>> --- linux-2.6.34.orig/arch/x86/mm/ioremap.c 2010-06-15 04:43:00.978332015 +0900
>> +++ linux-2.6.34/arch/x86/mm/ioremap.c 2010-06-15 05:32:59.291693007 +0900
>> @@ -62,8 +62,8 @@
>> static void __iomem *__ioremap_caller(resource_size_t phys_addr,
>> unsigned long size, unsigned long prot_val, void *caller)
>> {
>> - unsigned long pfn, offset, vaddr;
>> - resource_size_t last_addr;
>> + unsigned long offset, vaddr;
>> + resource_size_t pfn, last_pfn, last_addr;
>
> I have a hard time understanding this change. pfn is always a physical
> address shifted by PAGE_SHIFT. So a 32-bit pfn supports up to 44-bit
> physical addresses. Are your addresses above 44-bits?
>
>> @@ -115,7 +113,7 @@
>> * Mappings have to be page-aligned
>> */
>> offset = phys_addr& ~PAGE_MASK;
>> - phys_addr&= PAGE_MASK;
>> + phys_addr = (phys_addr>> PAGE_SHIFT)<< PAGE_SHIFT;
>
> I'd rather see PAGE_MASK fixed. Would this work?
>
> #define PAGE_SIZE (_AC(1,UL)<< PAGE_SHIFT)
> -#define PAGE_MASK (~(PAGE_SIZE-1))
> +#define PAGE_MASK (~(PAGE_SIZE-1ULL))
>

I think it should work. But I'm worrying about regressions.
Now I think using PHYSICAL_PAGE_MASK (as my v.1 patch did) is good idea
again. What do you think about this?

Thanks,
Kenji Kaneshige

2010-06-17 09:35:26

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

On 06/17/2010 07:03 AM, H. Peter Anvin wrote:
> On 06/16/2010 09:55 PM, Kenji Kaneshige wrote:
>
>>> I think they might be. Kenji?
>>>
>> No. My addresses are in the 44-bits range (around fc000000000). So it is
>> not required for my problem. This change assumes that phys_addr can be
>> above 44-bits (up to 52-bits (and higher in the future?)).
>>
>> By the way, is there linux kernel limit regarding above 44-bits physical
>> address in x86_32 PAE? For example, pfn above 32-bits is not supported?
>>
>>

That's an awkward situation. I would tend to suggest that you not
support this type of machine with a 32-bit kernel. Is it a sparse
memory system, or is there a device mapped in that range?

I guess it would be possible to special-case ioremap to allow the
creation of such mappings, but I don't know what kind of system-wide
fallout would happen as a result. The consequences of something trying
to extract a pfn from one of those ptes would be

> There are probably places at which PFNs are held in 32-bit numbers,
> although it would be good to track them down if it isn't too expensive
> to fix them (i.e. doesn't affect generic code.)
>

There are many places which hold pfns in 32 bit variables on 32 bit
systems; the standard type for pfns is "unsigned long", pretty much
everywhere in the kernel. It might be worth defining a pfn_t and
converting usage over to that, but it would be a pervasive change.

> This also affects paravirt systems, i.e. right now Xen has to locate all
> 32-bit guests below 64 GB, which limits its usefulness.
>

I don't think the limit is 64GB. A 32-bit PFN limits us to 2^44, which
is 16TB. (32-bit PV Xen guests have another unrelated limit of around
160GB physical memory because that as much m2p table will fit into the
Xen hole in the kernel mapping.)

>> #ifdef CONFIG_X86_PAE
>> /* 44=32+12, the limit we can fit into an unsigned long pfn */
>> #define __PHYSICAL_MASK_SHIFT 44
>> #define __VIRTUAL_MASK_SHIFT 32
>>
>> If there is 44-bits physical address limit, I think it's better to use
>> PHYSICAL_PAGE_MASK for masking physical address, instead of "(phys_addr
>>
>>>> PAGE_SHIFT) << PAGE_SHIFT)". The PHYSICAL_PAGE_MASK would become
>>>>
>> greater value when 44-bits physical address limit is eliminated. And
>> maybe we need to change phys_addr_valid() returns error if physical
>> address is above (1 << __PHYSICAL_MASK_SHIFT)?
>>
> The real question is how much we can fix without an unreasonable cost.
>

I think it would be a pretty large change. From the Xen's perspective,
any machine even approximately approaching the 2^44 limit will be
capable of running Xen guests in hvm mode, so PV isn't really a concern.

J

2010-06-17 09:38:06

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

On 06/17/2010 10:35 AM, Jeremy Fitzhardinge wrote:
> I guess it would be possible to special-case ioremap to allow the
> creation of such mappings, but I don't know what kind of system-wide
> fallout would happen as a result. The consequences of something trying
> to extract a pfn from one of those ptes would be
>

...very bad, as it would result in truncated pfns and likely cause some
kind of corruption.

(oops, sent too early)

J

2010-06-17 13:47:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

On 06/17/2010 02:35 AM, Jeremy Fitzhardinge wrote:
>>>
>>> By the way, is there linux kernel limit regarding above 44-bits physical
>>> address in x86_32 PAE? For example, pfn above 32-bits is not supported?
>
> That's an awkward situation. I would tend to suggest that you not
> support this type of machine with a 32-bit kernel. Is it a sparse
> memory system, or is there a device mapped in that range?
>
> I guess it would be possible to special-case ioremap to allow the
> creation of such mappings, but I don't know what kind of system-wide
> fallout would happen as a result. The consequences of something trying
> to extract a pfn from one of those ptes would be
>
>> There are probably places at which PFNs are held in 32-bit numbers,
>> although it would be good to track them down if it isn't too expensive
>> to fix them (i.e. doesn't affect generic code.)
>>
>
> There are many places which hold pfns in 32 bit variables on 32 bit
> systems; the standard type for pfns is "unsigned long", pretty much
> everywhere in the kernel. It might be worth defining a pfn_t and
> converting usage over to that, but it would be a pervasive change.
>

I think you're right, and just making 2^44 work correctly would be good
enough. Doing special forwarding of all 52 bits of the real physical
address in the paravirt case (where it is self-contained and doesn't
spill into the rest of the kernel) would probably be a good thing, though.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-06-18 00:23:22

by Kenji Kaneshige

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

(2010/06/17 18:35), Jeremy Fitzhardinge wrote:
> On 06/17/2010 07:03 AM, H. Peter Anvin wrote:
>> On 06/16/2010 09:55 PM, Kenji Kaneshige wrote:
>>
>>>> I think they might be. Kenji?
>>>>
>>> No. My addresses are in the 44-bits range (around fc000000000). So it is
>>> not required for my problem. This change assumes that phys_addr can be
>>> above 44-bits (up to 52-bits (and higher in the future?)).
>>>
>>> By the way, is there linux kernel limit regarding above 44-bits physical
>>> address in x86_32 PAE? For example, pfn above 32-bits is not supported?
>>>
>>>
>
> That's an awkward situation. I would tend to suggest that you not
> support this type of machine with a 32-bit kernel. Is it a sparse
> memory system, or is there a device mapped in that range?
>

Device mapped range in my case.
Fortunately, the address is in 44-bits range. I'd like to focus on
making 2^44 work correctly this time.

Thanks,
Kenji Kaneshige




> I guess it would be possible to special-case ioremap to allow the
> creation of such mappings, but I don't know what kind of system-wide
> fallout would happen as a result. The consequences of something trying
> to extract a pfn from one of those ptes would be
>
>> There are probably places at which PFNs are held in 32-bit numbers,
>> although it would be good to track them down if it isn't too expensive
>> to fix them (i.e. doesn't affect generic code.)
>>
>
> There are many places which hold pfns in 32 bit variables on 32 bit
> systems; the standard type for pfns is "unsigned long", pretty much
> everywhere in the kernel. It might be worth defining a pfn_t and
> converting usage over to that, but it would be a pervasive change.
>
>> This also affects paravirt systems, i.e. right now Xen has to locate all
>> 32-bit guests below 64 GB, which limits its usefulness.
>>
>
> I don't think the limit is 64GB. A 32-bit PFN limits us to 2^44, which
> is 16TB. (32-bit PV Xen guests have another unrelated limit of around
> 160GB physical memory because that as much m2p table will fit into the
> Xen hole in the kernel mapping.)
>
>>> #ifdef CONFIG_X86_PAE
>>> /* 44=32+12, the limit we can fit into an unsigned long pfn */
>>> #define __PHYSICAL_MASK_SHIFT 44
>>> #define __VIRTUAL_MASK_SHIFT 32
>>>
>>> If there is 44-bits physical address limit, I think it's better to use
>>> PHYSICAL_PAGE_MASK for masking physical address, instead of "(phys_addr
>>>
>>>>> PAGE_SHIFT)<< PAGE_SHIFT)". The PHYSICAL_PAGE_MASK would become
>>>>>
>>> greater value when 44-bits physical address limit is eliminated. And
>>> maybe we need to change phys_addr_valid() returns error if physical
>>> address is above (1<< __PHYSICAL_MASK_SHIFT)?
>>>
>> The real question is how much we can fix without an unreasonable cost.
>>
>
> I think it would be a pretty large change. From the Xen's perspective,
> any machine even approximately approaching the 2^44 limit will be
> capable of running Xen guests in hvm mode, so PV isn't really a concern.
>
> J
>
>

2010-06-18 00:33:25

by Kenji Kaneshige

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

(2010/06/17 22:46), H. Peter Anvin wrote:
> On 06/17/2010 02:35 AM, Jeremy Fitzhardinge wrote:
>>>>
>>>> By the way, is there linux kernel limit regarding above 44-bits physical
>>>> address in x86_32 PAE? For example, pfn above 32-bits is not supported?
>>
>> That's an awkward situation. I would tend to suggest that you not
>> support this type of machine with a 32-bit kernel. Is it a sparse
>> memory system, or is there a device mapped in that range?
>>
>> I guess it would be possible to special-case ioremap to allow the
>> creation of such mappings, but I don't know what kind of system-wide
>> fallout would happen as a result. The consequences of something trying
>> to extract a pfn from one of those ptes would be
>>
>>> There are probably places at which PFNs are held in 32-bit numbers,
>>> although it would be good to track them down if it isn't too expensive
>>> to fix them (i.e. doesn't affect generic code.)
>>>
>>
>> There are many places which hold pfns in 32 bit variables on 32 bit
>> systems; the standard type for pfns is "unsigned long", pretty much
>> everywhere in the kernel. It might be worth defining a pfn_t and
>> converting usage over to that, but it would be a pervasive change.
>>
>
> I think you're right, and just making 2^44 work correctly would be good
> enough. Doing special forwarding of all 52 bits of the real physical
> address in the paravirt case (where it is self-contained and doesn't
> spill into the rest of the kernel) would probably be a good thing, though.
>
> -hpa
>

I'll focus on making 2^44 work correctly. Then, I'll do the following
change in the next version of my patch.

- The v.2 patch uses resource_size_t for pfn. I'll keep using
resource_size_t for pfn also in v.3, because there is no reason to
leave it being "unsigned long".

- Use PHYSICAL_PAGE_MASK for masking physical address as v.1 patch
did. I think changing the definition of PAGE_MASK is a little risky.

Thanks,
Kenji Kaneshige

2010-07-09 04:24:29

by Simon Horman

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

On Thu, Jun 17, 2010 at 10:35:19AM +0100, Jeremy Fitzhardinge wrote:
> On 06/17/2010 07:03 AM, H. Peter Anvin wrote:
> > On 06/16/2010 09:55 PM, Kenji Kaneshige wrote:

[snip]

> >> greater value when 44-bits physical address limit is eliminated. And
> >> maybe we need to change phys_addr_valid() returns error if physical
> >> address is above (1 << __PHYSICAL_MASK_SHIFT)?
> >>
> > The real question is how much we can fix without an unreasonable cost.
> >
>
> I think it would be a pretty large change. From the Xen's perspective,
> any machine even approximately approaching the 2^44 limit will be
> capable of running Xen guests in hvm mode, so PV isn't really a concern.

Hi Jeremy,

Is the implication of that statement that HVM is preferred where
supported by HW?

2010-07-09 05:33:11

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

On 07/08/2010 09:24 PM, Simon Horman wrote:
>> I think it would be a pretty large change. From the Xen's perspective,
>> any machine even approximately approaching the 2^44 limit will be
>> capable of running Xen guests in hvm mode, so PV isn't really a concern.
>>
> Hi Jeremy,
>
> Is the implication of that statement that HVM is preferred where
> supported by HW?
>

I wouldn't go that far; the PV vs HVM choice is pretty complex, and
depends on what your workload is and what hardware you have available.
All I meant was what I said: that if you're running on a machine with a
large amount of memory, then you should run your 32-bit domains as HVM
rather than PV. Though Xen could easily keep domains limited to memory
that they can actually use (it already does this, in fact).

J

2010-07-09 06:10:13

by Simon Horman

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86: ioremap: fix wrong physical address handling

On Thu, Jul 08, 2010 at 10:33:08PM -0700, Jeremy Fitzhardinge wrote:
> On 07/08/2010 09:24 PM, Simon Horman wrote:
> >> I think it would be a pretty large change. From the Xen's perspective,
> >> any machine even approximately approaching the 2^44 limit will be
> >> capable of running Xen guests in hvm mode, so PV isn't really a concern.
> >>
> > Hi Jeremy,
> >
> > Is the implication of that statement that HVM is preferred where
> > supported by HW?
> >
>
> I wouldn't go that far; the PV vs HVM choice is pretty complex, and
> depends on what your workload is and what hardware you have available.
> All I meant was what I said: that if you're running on a machine with a
> large amount of memory, then you should run your 32-bit domains as HVM
> rather than PV. Though Xen could easily keep domains limited to memory
> that they can actually use (it already does this, in fact).

Hi Jeremy,

thanks for the clarification.

2010-07-09 18:31:54

by Kenji Kaneshige

[permalink] [raw]
Subject: [tip:x86/mm] x86, ioremap: Fix normal ram range check

Commit-ID: 2233576bf7b5d246593c3e06cab74d879b32b949
Gitweb: http://git.kernel.org/tip/2233576bf7b5d246593c3e06cab74d879b32b949
Author: Kenji Kaneshige <[email protected]>
AuthorDate: Thu, 17 Jun 2010 10:31:11 +0900
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 9 Jul 2010 10:51:56 -0700

x86, ioremap: Fix normal ram range check

Check for normal RAM in x86 ioremap() code seems to not work for the
last page frame in the specified physical address range.

Signed-off-by: Kenji Kaneshige <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/mm/ioremap.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 9c8e3a7..299e4eb 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -101,7 +101,7 @@ static void __iomem *__ioremap_caller(resource_size_t phys_addr,
* Don't allow anybody to remap normal RAM that we're using..
*/
last_pfn = last_addr >> PAGE_SHIFT;
- for (pfn = phys_addr >> PAGE_SHIFT; pfn < last_pfn; pfn++) {
+ for (pfn = phys_addr >> PAGE_SHIFT; pfn <= last_pfn; pfn++) {
int is_ram = page_is_ram(pfn);

if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)))

2010-07-09 18:35:08

by Kenji Kaneshige

[permalink] [raw]
Subject: [tip:x86/mm] x86, pae: Fix handling of large physical addresses in ioremap

Commit-ID: 39d8c3ff39443825b6a21b28249fc4904809203f
Gitweb: http://git.kernel.org/tip/39d8c3ff39443825b6a21b28249fc4904809203f
Author: Kenji Kaneshige <[email protected]>
AuthorDate: Thu, 17 Jun 2010 10:30:06 +0900
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 9 Jul 2010 10:51:39 -0700

x86, pae: Fix handling of large physical addresses in ioremap

Current x86 ioremap() doesn't handle physical address higher than
32-bit properly in X86_32 PAE mode. When physical address higher than
32-bit is passed to ioremap(), higher 32-bits in physical address is
cleared wrongly. Due to this bug, ioremap() can map wrong address to
linear address space.

In my case, 64-bit MMIO region was assigned to a PCI device (ioat
device) on my system. Because of the ioremap()'s bug, wrong physical
address (instead of MMIO region) was mapped to linear address space.
Because of this, loading ioatdma driver caused unexpected behavior
(kernel panic, kernel hangup, ...).

Signed-off-by: Kenji Kaneshige <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/mm/ioremap.c | 12 +++++-------
include/linux/io.h | 4 ++--
include/linux/vmalloc.h | 2 +-
lib/ioremap.c | 10 +++++-----
mm/vmalloc.c | 2 +-
5 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 12e4d2d..9c8e3a7 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -62,8 +62,8 @@ int ioremap_change_attr(unsigned long vaddr, unsigned long size,
static void __iomem *__ioremap_caller(resource_size_t phys_addr,
unsigned long size, unsigned long prot_val, void *caller)
{
- unsigned long pfn, offset, vaddr;
- resource_size_t last_addr;
+ unsigned long offset, vaddr;
+ resource_size_t pfn, last_pfn, last_addr;
const resource_size_t unaligned_phys_addr = phys_addr;
const unsigned long unaligned_size = size;
struct vm_struct *area;
@@ -100,10 +100,8 @@ static void __iomem *__ioremap_caller(resource_size_t phys_addr,
/*
* Don't allow anybody to remap normal RAM that we're using..
*/
- for (pfn = phys_addr >> PAGE_SHIFT;
- (pfn << PAGE_SHIFT) < (last_addr & PAGE_MASK);
- pfn++) {
-
+ last_pfn = last_addr >> PAGE_SHIFT;
+ for (pfn = phys_addr >> PAGE_SHIFT; pfn < last_pfn; pfn++) {
int is_ram = page_is_ram(pfn);

if (is_ram && pfn_valid(pfn) && !PageReserved(pfn_to_page(pfn)))
@@ -115,7 +113,7 @@ static void __iomem *__ioremap_caller(resource_size_t phys_addr,
* Mappings have to be page-aligned
*/
offset = phys_addr & ~PAGE_MASK;
- phys_addr &= PAGE_MASK;
+ phys_addr = (phys_addr >> PAGE_SHIFT) << PAGE_SHIFT;
size = PAGE_ALIGN(last_addr+1) - phys_addr;

retval = reserve_memtype(phys_addr, (u64)phys_addr + size,
diff --git a/include/linux/io.h b/include/linux/io.h
index 6c7f0ba..7fd2d21 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -29,10 +29,10 @@ void __iowrite64_copy(void __iomem *to, const void *from, size_t count);

#ifdef CONFIG_MMU
int ioremap_page_range(unsigned long addr, unsigned long end,
- unsigned long phys_addr, pgprot_t prot);
+ phys_addr_t phys_addr, pgprot_t prot);
#else
static inline int ioremap_page_range(unsigned long addr, unsigned long end,
- unsigned long phys_addr, pgprot_t prot)
+ phys_addr_t phys_addr, pgprot_t prot)
{
return 0;
}
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 227c2a5..de05e96 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -30,7 +30,7 @@ struct vm_struct {
unsigned long flags;
struct page **pages;
unsigned int nr_pages;
- unsigned long phys_addr;
+ phys_addr_t phys_addr;
void *caller;
};

diff --git a/lib/ioremap.c b/lib/ioremap.c
index 14c6078..5730ecd 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -13,10 +13,10 @@
#include <asm/pgtable.h>

static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, unsigned long phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
pte_t *pte;
- unsigned long pfn;
+ u64 pfn;

pfn = phys_addr >> PAGE_SHIFT;
pte = pte_alloc_kernel(pmd, addr);
@@ -31,7 +31,7 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
}

static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
- unsigned long end, unsigned long phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
pmd_t *pmd;
unsigned long next;
@@ -49,7 +49,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
}

static inline int ioremap_pud_range(pgd_t *pgd, unsigned long addr,
- unsigned long end, unsigned long phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
pud_t *pud;
unsigned long next;
@@ -67,7 +67,7 @@ static inline int ioremap_pud_range(pgd_t *pgd, unsigned long addr,
}

int ioremap_page_range(unsigned long addr,
- unsigned long end, unsigned long phys_addr, pgprot_t prot)
+ unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
{
pgd_t *pgd;
unsigned long start;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ae00746..b7e314b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2403,7 +2403,7 @@ static int s_show(struct seq_file *m, void *p)
seq_printf(m, " pages=%d", v->nr_pages);

if (v->phys_addr)
- seq_printf(m, " phys=%lx", v->phys_addr);
+ seq_printf(m, " phys=%llx", (unsigned long long)v->phys_addr);

if (v->flags & VM_IOREMAP)
seq_printf(m, " ioremap");

2010-07-09 18:44:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [tip:x86/mm] x86, pae: Fix handling of large physical addresses in ioremap

On 07/09/2010 11:31 AM, tip-bot for Kenji Kaneshige wrote:
> Commit-ID: 39d8c3ff39443825b6a21b28249fc4904809203f
> Gitweb: http://git.kernel.org/tip/39d8c3ff39443825b6a21b28249fc4904809203f
> Author: Kenji Kaneshige <[email protected]>
> AuthorDate: Thu, 17 Jun 2010 10:30:06 +0900
> Committer: H. Peter Anvin <[email protected]>
> CommitDate: Fri, 9 Jul 2010 10:51:39 -0700
>
> x86, pae: Fix handling of large physical addresses in ioremap
>
> Current x86 ioremap() doesn't handle physical address higher than
> 32-bit properly in X86_32 PAE mode. When physical address higher than
> 32-bit is passed to ioremap(), higher 32-bits in physical address is
> cleared wrongly. Due to this bug, ioremap() can map wrong address to
> linear address space.
>
> In my case, 64-bit MMIO region was assigned to a PCI device (ioat
> device) on my system. Because of the ioremap()'s bug, wrong physical
> address (instead of MMIO region) was mapped to linear address space.
> Because of this, loading ioatdma driver caused unexpected behavior
> (kernel panic, kernel hangup, ...).
>

Sorry, pushed the wrong version of this patch. I will push the correct
one shortly.

-hpa