LinuxLists.cc - [PATCH] x86: add phys addr validity check for /dev/mem mmap

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On 04/02/2013 05:28 AM, Frantisek Hrbata wrote:
>
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index d8e8eef..39607c6 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
> #endif
> }
>
> +#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
> +extern int valid_phys_addr_range(phys_addr_t addr, size_t count);
> +extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count);
> +
> #endif /* __KERNEL__ */
>
> extern void native_io_delay(void);
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 845df68..92ec31c 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -31,6 +31,8 @@
> #include <linux/sched.h>
> #include <asm/elf.h>
>
> +#include "physaddr.h"
> +
> struct __read_mostly va_alignment va_align = {
> .flags = -1,
> };
> @@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
> mm->unmap_area = arch_unmap_area_topdown;
> }
> }
> +
> +int valid_phys_addr_range(phys_addr_t addr, size_t count)
> +{
> + return addr + count <= __pa(high_memory);
> +}
> +
> +int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
> +{
> + resource_size_t addr = (pfn << PAGE_SHIFT) + count;
> + return phys_addr_valid(addr);
> +}
>

Good initiative, but I think the implementation is worong. I suspect we
should use the number of physical address bits supported rather than
high_memory, since it is common and legal to use /dev/mem to access I/O
resources that are beyond the last byte of RAM.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2013-04-02 19:10:27

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On Tue, Apr 02, 2013 at 11:48:34AM -0700, H. Peter Anvin wrote:
> On 04/02/2013 05:28 AM, Frantisek Hrbata wrote:
> >
> > diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> > index d8e8eef..39607c6 100644
> > --- a/arch/x86/include/asm/io.h
> > +++ b/arch/x86/include/asm/io.h
> > @@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
> > #endif
> > }
> >
> > +#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
> > +extern int valid_phys_addr_range(phys_addr_t addr, size_t count);
> > +extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count);
> > +
> > #endif /* __KERNEL__ */
> >
> > extern void native_io_delay(void);
> > diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> > index 845df68..92ec31c 100644
> > --- a/arch/x86/mm/mmap.c
> > +++ b/arch/x86/mm/mmap.c
> > @@ -31,6 +31,8 @@
> > #include <linux/sched.h>
> > #include <asm/elf.h>
> >
> > +#include "physaddr.h"
> > +
> > struct __read_mostly va_alignment va_align = {
> > .flags = -1,
> > };
> > @@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
> > mm->unmap_area = arch_unmap_area_topdown;
> > }
> > }
> > +
> > +int valid_phys_addr_range(phys_addr_t addr, size_t count)
> > +{
> > + return addr + count <= __pa(high_memory);
> > +}
> > +
> > +int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
> > +{
> > + resource_size_t addr = (pfn << PAGE_SHIFT) + count;
> > + return phys_addr_valid(addr);
> > +}
> >
>
> Good initiative, but I think the implementation is worong. I suspect we
> should use the number of physical address bits supported rather than
> high_memory, since it is common and legal to use /dev/mem to access I/O
> resources that are beyond the last byte of RAM.
>
> -hpa

Hi, this is exactly what the patch is doing imho. Note that the
valid_phys_addr_range(), which is using the high_memory, is the same as the
default one in drivers/char/mem.c(#ifndef ARCH_HAS_VALID_PHYS_ADDR_RANGE). I
just added x86 specific check for valid_mmap_phys_addr_range and moved both
functions to arch/x86/mm/mmap.c, rather then modifying the default generic ones.
This is how other archs(arm) are doing it.

Also valid_phys_addr_range is used just in read|write_mem and
valid_mmap_phys_addr_range is checked in mmap_mem and it calls phys_addr_valid

static inline int phys_addr_valid(resource_size_t addr)
{
#ifdef CONFIG_PHYS_ADDR_T_64BIT
return !(addr >> boot_cpu_data.x86_phys_bits);
#else
return 1;
#endif
}

I for sure could overlooked something, but this seems right to me.

Thank you!

>
> --
> H. Peter Anvin, Intel Open Source Technology Center
> I work for Intel. I don't speak on their behalf.
>

--
Frantisek Hrbata

2013-04-02 20:29:37

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On 04/02/2013 12:10 PM, Frantisek Hrbata wrote:
>
> Hi, this is exactly what the patch is doing imho. Note that the
> valid_phys_addr_range(), which is using the high_memory, is the same as the
> default one in drivers/char/mem.c(#ifndef ARCH_HAS_VALID_PHYS_ADDR_RANGE). I
> just added x86 specific check for valid_mmap_phys_addr_range and moved both
> functions to arch/x86/mm/mmap.c, rather then modifying the default generic ones.
> This is how other archs(arm) are doing it.
>
> Also valid_phys_addr_range is used just in read|write_mem and
> valid_mmap_phys_addr_range is checked in mmap_mem and it calls phys_addr_valid
>
> static inline int phys_addr_valid(resource_size_t addr)
> {
> #ifdef CONFIG_PHYS_ADDR_T_64BIT
> return !(addr >> boot_cpu_data.x86_phys_bits);
> #else
> return 1;
> #endif
> }
>
> I for sure could overlooked something, but this seems right to me.
>

OK, this is really confusing ... which isn't a *huge* surprise (the
entire /dev/mem code has some gigantic bugs in it.)

I think I need to do more of an in-depth review. The other question is
why we don't call phys_addr_valid() everywhere.

-hpa

2013-04-02 20:52:28

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On Tue, Apr 02, 2013 at 01:29:12PM -0700, H. Peter Anvin wrote:
> On 04/02/2013 12:10 PM, Frantisek Hrbata wrote:
> >
> > Hi, this is exactly what the patch is doing imho. Note that the
> > valid_phys_addr_range(), which is using the high_memory, is the same as the
> > default one in drivers/char/mem.c(#ifndef ARCH_HAS_VALID_PHYS_ADDR_RANGE). I
> > just added x86 specific check for valid_mmap_phys_addr_range and moved both
> > functions to arch/x86/mm/mmap.c, rather then modifying the default generic ones.
> > This is how other archs(arm) are doing it.
> >
> > Also valid_phys_addr_range is used just in read|write_mem and
> > valid_mmap_phys_addr_range is checked in mmap_mem and it calls phys_addr_valid
> >
> > static inline int phys_addr_valid(resource_size_t addr)
> > {
> > #ifdef CONFIG_PHYS_ADDR_T_64BIT
> > return !(addr >> boot_cpu_data.x86_phys_bits);
> > #else
> > return 1;
> > #endif
> > }
> >
> > I for sure could overlooked something, but this seems right to me.
> >
>
> OK, this is really confusing ... which isn't a *huge* surprise (the
> entire /dev/mem code has some gigantic bugs in it.)
>
> I think I need to do more of an in-depth review. The other question is
> why we don't call phys_addr_valid() everywhere.

I'm not going to pretend I understand the code, but IMHO the
valid_phys_addr_range and valid_mmap_phys_addr_range in drivers/char/mem.c are
generic for all archs. If some arch wants specific version of those functions it
defines them in the arch specific code and define ARCH_HAS_VALID_PHYS_ADDR_RANGE.
The phys_addr_valid is x86 specific defined in arch/x86/mm/physaddr.h, so IMHO
it cannot be used in the generic checks. For example ARM has it's specific
checks in arch/arm/mm/mmap.c.

I reused phys_addr_valid because it is already used in ioremap(__ioremap_caller)
for the same purpose imho.

Thank you for looking into this.

>
> -hpa
>
>

--
Frantisek Hrbata

2013-04-04 01:12:01

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

Hi H.Peter,
On 04/03/2013 02:48 AM, H. Peter Anvin wrote:
> On 04/02/2013 05:28 AM, Frantisek Hrbata wrote:
>> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>> index d8e8eef..39607c6 100644
>> --- a/arch/x86/include/asm/io.h
>> +++ b/arch/x86/include/asm/io.h
>> @@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
>> #endif
>> }
>>
>> +#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
>> +extern int valid_phys_addr_range(phys_addr_t addr, size_t count);
>> +extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count);
>> +
>> #endif /* __KERNEL__ */
>>
>> extern void native_io_delay(void);
>> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
>> index 845df68..92ec31c 100644
>> --- a/arch/x86/mm/mmap.c
>> +++ b/arch/x86/mm/mmap.c
>> @@ -31,6 +31,8 @@
>> #include <linux/sched.h>
>> #include <asm/elf.h>
>>
>> +#include "physaddr.h"
>> +
>> struct __read_mostly va_alignment va_align = {
>> .flags = -1,
>> };
>> @@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
>> mm->unmap_area = arch_unmap_area_topdown;
>> }
>> }
>> +
>> +int valid_phys_addr_range(phys_addr_t addr, size_t count)
>> +{
>> + return addr + count <= __pa(high_memory);
>> +}
>> +
>> +int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
>> +{
>> + resource_size_t addr = (pfn << PAGE_SHIFT) + count;
>> + return phys_addr_valid(addr);
>> +}
>>

Why we consider boot_cpu_data.x86_phys_bits instead of e820 map here?

>
> -hpa
>

2013-04-04 01:14:07

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On 04/03/2013 06:11 PM, Simon Jeons wrote:
>
> Why we consider boot_cpu_data.x86_phys_bits instead of e820 map here?
>

Because x86_phys_bits is what controls how much address space the
processor has. e820 tells us how much *RAM* the machine has, or
specifically, how much RAM the machine had on boot.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2013-04-04 01:17:36

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

Hi H.Peter,
On 04/04/2013 09:13 AM, H. Peter Anvin wrote:
> On 04/03/2013 06:11 PM, Simon Jeons wrote:
>> Why we consider boot_cpu_data.x86_phys_bits instead of e820 map here?
>>
> Because x86_phys_bits is what controls how much address space the
> processor has. e820 tells us how much *RAM* the machine has, or
> specifically, how much RAM the machine had on boot.

e820 also contain mmio, correct? So cpu should not access address beyond
e820 map(RAM+MMIO).

>
> -hpa
>

2013-04-04 01:32:36

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On 04/03/2013 06:17 PM, Simon Jeons wrote:
>
> e820 also contain mmio, correct?

No.

> So cpu should not access address beyond
> e820 map(RAM+MMIO).

No.

-hpa

2013-04-04 01:53:22

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On 04/04/2013 09:32 AM, H. Peter Anvin wrote:
> On 04/03/2013 06:17 PM, Simon Jeons wrote:
>> e820 also contain mmio, correct?
> No.
>
>> So cpu should not access address beyond
>> e820 map(RAM+MMIO).
> No.
>
> -hpa
>
>

One offline question, why can't check git log before 2005?

2013-04-04 02:15:22

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

Because git didn't exist before then?

Simon Jeons <[email protected]> wrote:

>On 04/04/2013 09:32 AM, H. Peter Anvin wrote:
>> On 04/03/2013 06:17 PM, Simon Jeons wrote:
>>> e820 also contain mmio, correct?
>> No.
>>
>>> So cpu should not access address beyond
>>> e820 map(RAM+MMIO).
>> No.
>>
>> -hpa
>>
>>
>
>One offline question, why can't check git log before 2005?

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-04-04 02:17:37

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On 04/04/2013 10:14 AM, H. Peter Anvin wrote:
> Because git didn't exist before then?

Oh, I see, thanks! :-)

>
> Simon Jeons <[email protected]> wrote:
>
>> On 04/04/2013 09:32 AM, H. Peter Anvin wrote:
>>> On 04/03/2013 06:17 PM, Simon Jeons wrote:
>>>> e820 also contain mmio, correct?
>>> No.
>>>
>>>> So cpu should not access address beyond
>>>> e820 map(RAM+MMIO).
>>> No.
>>>
>>> -hpa
>>>
>>>
>> One offline question, why can't check git log before 2005?

2013-04-04 05:21:04

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

Hi H.Peter,
On 04/04/2013 09:32 AM, H. Peter Anvin wrote:
> On 04/03/2013 06:17 PM, Simon Jeons wrote:
>> e820 also contain mmio, correct?
> No.

How to check which address is used by mmio? /proc/iomem, correct?

>
>> So cpu should not access address beyond
>> e820 map(RAM+MMIO).
> No.
>
> -hpa
>
>

2013-04-11 02:40:37

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

Hi H.Peter,
On 04/04/2013 09:13 AM, H. Peter Anvin wrote:
> On 04/03/2013 06:11 PM, Simon Jeons wrote:
>> Why we consider boot_cpu_data.x86_phys_bits instead of e820 map here?
>>
> Because x86_phys_bits is what controls how much address space the
> processor has. e820 tells us how much *RAM* the machine has, or
> specifically, how much RAM the machine had on boot.

I have 8GB memory in my machine, but when I accumulated every e820
ranges which dump in dmesg, there are 25MB memory less then 8GB(1024*8)
memory, why 25MB miss?

>
> -hpa
>

2013-04-11 02:49:14

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On 04/10/2013 07:40 PM, Simon Jeons wrote:
> Hi H.Peter,
> On 04/04/2013 09:13 AM, H. Peter Anvin wrote:
>> On 04/03/2013 06:11 PM, Simon Jeons wrote:
>>> Why we consider boot_cpu_data.x86_phys_bits instead of e820 map here?
>>>
>> Because x86_phys_bits is what controls how much address space the
>> processor has. e820 tells us how much *RAM* the machine has, or
>> specifically, how much RAM the machine had on boot.
>
> I have 8GB memory in my machine, but when I accumulated every e820
> ranges which dump in dmesg, there are 25MB memory less then 8GB(1024*8)
> memory, why 25MB miss?
>

For whatever reason your BIOS is stealing some memory, possibly for video.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2013-04-11 02:58:54

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

Hi H.Peter,
On 04/11/2013 10:48 AM, H. Peter Anvin wrote:
> On 04/10/2013 07:40 PM, Simon Jeons wrote:
>> Hi H.Peter,
>> On 04/04/2013 09:13 AM, H. Peter Anvin wrote:
>>> On 04/03/2013 06:11 PM, Simon Jeons wrote:
>>>> Why we consider boot_cpu_data.x86_phys_bits instead of e820 map here?
>>>>
>>> Because x86_phys_bits is what controls how much address space the
>>> processor has. e820 tells us how much *RAM* the machine has, or
>>> specifically, how much RAM the machine had on boot.
>> I have 8GB memory in my machine, but when I accumulated every e820
>> ranges which dump in dmesg, there are 25MB memory less then 8GB(1024*8)
>> memory, why 25MB miss?
>>
> For whatever reason your BIOS is stealing some memory, possibly for video.

Thanks for your quick response. ;-)
My machine is new which have i7 cpu. How much memory video need? 8MB?
Why I miss 25MB?

>
> -hpa
>
>

2013-04-24 11:37:43

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On Tue, Apr 02, 2013 at 01:29:12PM -0700, H. Peter Anvin wrote:
> On 04/02/2013 12:10 PM, Frantisek Hrbata wrote:
> >
> > Hi, this is exactly what the patch is doing imho. Note that the
> > valid_phys_addr_range(), which is using the high_memory, is the same as the
> > default one in drivers/char/mem.c(#ifndef ARCH_HAS_VALID_PHYS_ADDR_RANGE). I
> > just added x86 specific check for valid_mmap_phys_addr_range and moved both
> > functions to arch/x86/mm/mmap.c, rather then modifying the default generic ones.
> > This is how other archs(arm) are doing it.
> >
> > Also valid_phys_addr_range is used just in read|write_mem and
> > valid_mmap_phys_addr_range is checked in mmap_mem and it calls phys_addr_valid
> >
> > static inline int phys_addr_valid(resource_size_t addr)
> > {
> > #ifdef CONFIG_PHYS_ADDR_T_64BIT
> > return !(addr >> boot_cpu_data.x86_phys_bits);
> > #else
> > return 1;
> > #endif
> > }
> >
> > I for sure could overlooked something, but this seems right to me.
> >
>
> OK, this is really confusing ... which isn't a *huge* surprise (the
> entire /dev/mem code has some gigantic bugs in it.)
>
> I think I need to do more of an in-depth review. The other question is
> why we don't call phys_addr_valid() everywhere.
>
> -hpa

Hi Peter,

please, have you had a chance to look at this?

Many thanks

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Frantisek Hrbata

2013-04-26 05:21:43

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

Hi Peter,
On 04/02/2013 08:28 PM, Frantisek Hrbata wrote:
> When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set for
> X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 64b
> PTE are reserved and have to be set to zero. For example, for IA-32e and 4KB
> page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are reserved. So
> for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If one of
> the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is generated
> with RSVD error code.
>
> <quote>
> RSVD flag (bit 3).
> This flag is 1 if there is no valid translation for the linear address because a
> reserved bit was set in one of the paging-structure entries used to translate
> that address. (Because reserved bits are not checked in a paging-structure entry
> whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also
> set.)
> </quote>
>
> In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it always
> returns 1 on x86. So it's possible to use any pgoff we want and to set the PTE's
> reserved bits in remap_pfn_range(). Meaning there is a possibility to use mmap

In this case, remap_pfn_range() setup the map and reserved bits for mmio
memory, so the mmio memory is already populated, why trigger #PF?

> on /dev/mem and cause system panic. It's probably not that serious, because
> access to /dev/mem is limited and the system has to have panic_on_oops set, but
> still I think we should check this and return error.
>
> This patch adds check for x86 when ARCH_PHYS_ADDR_T_64BIT is set, the same way
> as it is already done in e.g. ioremap. With this fix mmap returns -EINVAL if the
> requested phys addr is bigger then the supported phys addr width.
>
> [1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A
>
> Signed-off-by: Frantisek Hrbata <[email protected]>
> ---
> arch/x86/include/asm/io.h | 4 ++++
> arch/x86/mm/mmap.c | 13 +++++++++++++
> 2 files changed, 17 insertions(+)
>
> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> index d8e8eef..39607c6 100644
> --- a/arch/x86/include/asm/io.h
> +++ b/arch/x86/include/asm/io.h
> @@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
> #endif
> }
>
> +#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
> +extern int valid_phys_addr_range(phys_addr_t addr, size_t count);
> +extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count);
> +
> #endif /* __KERNEL__ */
>
> extern void native_io_delay(void);
> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> index 845df68..92ec31c 100644
> --- a/arch/x86/mm/mmap.c
> +++ b/arch/x86/mm/mmap.c
> @@ -31,6 +31,8 @@
> #include <linux/sched.h>
> #include <asm/elf.h>
>
> +#include "physaddr.h"
> +
> struct __read_mostly va_alignment va_align = {
> .flags = -1,
> };
> @@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
> mm->unmap_area = arch_unmap_area_topdown;
> }
> }
> +
> +int valid_phys_addr_range(phys_addr_t addr, size_t count)
> +{
> + return addr + count <= __pa(high_memory);
> +}
> +
> +int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
> +{
> + resource_size_t addr = (pfn << PAGE_SHIFT) + count;
> + return phys_addr_valid(addr);
> +}

2013-04-26 15:35:29

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote:
> Hi Peter,
> On 04/02/2013 08:28 PM, Frantisek Hrbata wrote:
> >When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set for
> >X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 64b
> >PTE are reserved and have to be set to zero. For example, for IA-32e and 4KB
> >page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are reserved. So
> >for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If one of
> >the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is generated
> >with RSVD error code.
> >
> ><quote>
> >RSVD flag (bit 3).
> >This flag is 1 if there is no valid translation for the linear address because a
> >reserved bit was set in one of the paging-structure entries used to translate
> >that address. (Because reserved bits are not checked in a paging-structure entry
> >whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also
> >set.)
> ></quote>
> >
> >In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it always
> >returns 1 on x86. So it's possible to use any pgoff we want and to set the PTE's
> >reserved bits in remap_pfn_range(). Meaning there is a possibility to use mmap
>
> In this case, remap_pfn_range() setup the map and reserved bits for
> mmio memory, so the mmio memory is already populated, why trigger
> #PF?

Hi,

I think this is described in the quote above for the RSVD flag.

remap_pfn_range() => page present => touch page => tlb miss =>
walk through paging structures => reserved bit set => #pf with rsvd flag

I hope I didn't misunderstand your question.

Thanks

>
> >on /dev/mem and cause system panic. It's probably not that serious, because
> >access to /dev/mem is limited and the system has to have panic_on_oops set, but
> >still I think we should check this and return error.
> >
> >This patch adds check for x86 when ARCH_PHYS_ADDR_T_64BIT is set, the same way
> >as it is already done in e.g. ioremap. With this fix mmap returns -EINVAL if the
> >requested phys addr is bigger then the supported phys addr width.
> >
> >[1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A
> >
> >Signed-off-by: Frantisek Hrbata <[email protected]>
> >---
> > arch/x86/include/asm/io.h | 4 ++++
> > arch/x86/mm/mmap.c | 13 +++++++++++++
> > 2 files changed, 17 insertions(+)
> >
> >diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> >index d8e8eef..39607c6 100644
> >--- a/arch/x86/include/asm/io.h
> >+++ b/arch/x86/include/asm/io.h
> >@@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
> > #endif
> > }
> >+#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
> >+extern int valid_phys_addr_range(phys_addr_t addr, size_t count);
> >+extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count);
> >+
> > #endif /* __KERNEL__ */
> > extern void native_io_delay(void);
> >diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> >index 845df68..92ec31c 100644
> >--- a/arch/x86/mm/mmap.c
> >+++ b/arch/x86/mm/mmap.c
> >@@ -31,6 +31,8 @@
> > #include <linux/sched.h>
> > #include <asm/elf.h>
> >+#include "physaddr.h"
> >+
> > struct __read_mostly va_alignment va_align = {
> > .flags = -1,
> > };
> >@@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
> > mm->unmap_area = arch_unmap_area_topdown;
> > }
> > }
> >+
> >+int valid_phys_addr_range(phys_addr_t addr, size_t count)
> >+{
> >+ return addr + count <= __pa(high_memory);
> >+}
> >+
> >+int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
> >+{
> >+ resource_size_t addr = (pfn << PAGE_SHIFT) + count;
> >+ return phys_addr_valid(addr);
> >+}
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Frantisek Hrbata

2013-04-27 07:00:21

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On 04/26/2013 11:35 PM, Frantisek Hrbata wrote:
> On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote:
>> Hi Peter,
>> On 04/02/2013 08:28 PM, Frantisek Hrbata wrote:
>>> When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set for
>>> X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 64b
>>> PTE are reserved and have to be set to zero. For example, for IA-32e and 4KB
>>> page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are reserved. So
>>> for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If one of
>>> the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is generated
>>> with RSVD error code.
>>>
>>> <quote>
>>> RSVD flag (bit 3).
>>> This flag is 1 if there is no valid translation for the linear address because a
>>> reserved bit was set in one of the paging-structure entries used to translate
>>> that address. (Because reserved bits are not checked in a paging-structure entry
>>> whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also
>>> set.)
>>> </quote>
>>>
>>> In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it always
>>> returns 1 on x86. So it's possible to use any pgoff we want and to set the PTE's
>>> reserved bits in remap_pfn_range(). Meaning there is a possibility to use mmap
>> In this case, remap_pfn_range() setup the map and reserved bits for
>> mmio memory, so the mmio memory is already populated, why trigger
>> #PF?
> Hi,
>
> I think this is described in the quote above for the RSVD flag.
>
> remap_pfn_range() => page present => touch page => tlb miss =>
> walk through paging structures => reserved bit set => #pf with rsvd flag

Page present can also trigger #PF? why?

>
> I hope I didn't misunderstand your question.
>
> Thanks
>
>>> on /dev/mem and cause system panic. It's probably not that serious, because
>>> access to /dev/mem is limited and the system has to have panic_on_oops set, but
>>> still I think we should check this and return error.
>>>
>>> This patch adds check for x86 when ARCH_PHYS_ADDR_T_64BIT is set, the same way
>>> as it is already done in e.g. ioremap. With this fix mmap returns -EINVAL if the
>>> requested phys addr is bigger then the supported phys addr width.
>>>
>>> [1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A
>>>
>>> Signed-off-by: Frantisek Hrbata <[email protected]>
>>> ---
>>> arch/x86/include/asm/io.h | 4 ++++
>>> arch/x86/mm/mmap.c | 13 +++++++++++++
>>> 2 files changed, 17 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>>> index d8e8eef..39607c6 100644
>>> --- a/arch/x86/include/asm/io.h
>>> +++ b/arch/x86/include/asm/io.h
>>> @@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
>>> #endif
>>> }
>>> +#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
>>> +extern int valid_phys_addr_range(phys_addr_t addr, size_t count);
>>> +extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count);
>>> +
>>> #endif /* __KERNEL__ */
>>> extern void native_io_delay(void);
>>> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
>>> index 845df68..92ec31c 100644
>>> --- a/arch/x86/mm/mmap.c
>>> +++ b/arch/x86/mm/mmap.c
>>> @@ -31,6 +31,8 @@
>>> #include <linux/sched.h>
>>> #include <asm/elf.h>
>>> +#include "physaddr.h"
>>> +
>>> struct __read_mostly va_alignment va_align = {
>>> .flags = -1,
>>> };
>>> @@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
>>> mm->unmap_area = arch_unmap_area_topdown;
>>> }
>>> }
>>> +
>>> +int valid_phys_addr_range(phys_addr_t addr, size_t count)
>>> +{
>>> + return addr + count <= __pa(high_memory);
>>> +}
>>> +
>>> +int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
>>> +{
>>> + resource_size_t addr = (pfn << PAGE_SHIFT) + count;
>>> + return phys_addr_valid(addr);
>>> +}
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/

2013-04-27 19:14:11

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On Sat, Apr 27, 2013 at 03:00:11PM +0800, Will Huck wrote:
> On 04/26/2013 11:35 PM, Frantisek Hrbata wrote:
> >On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote:
> >>Hi Peter,
> >>On 04/02/2013 08:28 PM, Frantisek Hrbata wrote:
> >>>When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set for
> >>>X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 64b
> >>>PTE are reserved and have to be set to zero. For example, for IA-32e and 4KB
> >>>page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are reserved. So
> >>>for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If one of
> >>>the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is generated
> >>>with RSVD error code.
> >>>
> >>><quote>
> >>>RSVD flag (bit 3).
> >>>This flag is 1 if there is no valid translation for the linear address because a
> >>>reserved bit was set in one of the paging-structure entries used to translate
> >>>that address. (Because reserved bits are not checked in a paging-structure entry
> >>>whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also
> >>>set.)
> >>></quote>
> >>>
> >>>In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it always
> >>>returns 1 on x86. So it's possible to use any pgoff we want and to set the PTE's
> >>>reserved bits in remap_pfn_range(). Meaning there is a possibility to use mmap
> >>In this case, remap_pfn_range() setup the map and reserved bits for
> >>mmio memory, so the mmio memory is already populated, why trigger
> >>#PF?
> >Hi,
> >
> >I think this is described in the quote above for the RSVD flag.
> >
> >remap_pfn_range() => page present => touch page => tlb miss =>
> >walk through paging structures => reserved bit set => #pf with rsvd flag
>
> Page present can also trigger #PF? why?

Yes, please see
Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A

4.7 PAGE-FAULT EXCEPTIONS
<quote>
? RSVD flag (bit 3).
This flag is 1 if there is no valid translation for the linear address because
a reserved bit was set in one of the paging-structure entries used to
translate that address. (Because reserved bits are not checked in a
paging-structure entry whose P flag is 0, bit 3 of the error code can be set
only if bit 0 is also set.) Bits reserved in the paging-structure entries are
reserved for future functionality. Software developers should be aware that
such bits may be used in the future and that a paging-structure entry that
causes a page-fault exception on one processor might not do so in the future.
</quote>

I cannot tell you why. I guess this is more a question for some Intel guys.

Anyway this patch is trying to fix the following problem and
the "Bad pagetable" oops.

---------------------------------8<--------------------------------------
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <err.h>
#include <stdlib.h>
#include <sys/mman.h>

#define die(fmt, ...) err(1, fmt, ##__VA_ARGS__)

/*
1) Find some non system ram in case the CONFIG_STRICT_DEVMEM is defined
$ cat /proc/iomem | grep -v "$System RAM\|reserved$"

2) Find physical address width
$ cat /proc/cpuinfo | grep "address sizes"

PTE bits 51 - M are reserved, where M is physical address width found 2)
Note: step 2) is actually not needed, we can always set just the 51th bit
(0x8000000000000)

Set OFFSET macro to

(start of iomem range found in 1)) | (1 << 51)

for example
0x000a0000 | 0x8000000000000 = 0x80000000a0000

where 0x000a0000 is start of PCI BUS on my laptop

*/

#define OFFSET 0x80000000a0000LL

int main(int argc, char *argv[])
{
int fd;
long ps;
long pgoff;
char *map;
char c;

ps = sysconf(_SC_PAGE_SIZE);
if (ps == -1)
die("cannot get page size");

fd = open("/dev/mem", O_RDONLY);
if (fd == -1)
die("cannot open /dev/mem");

printf("%Lx\n", pgoff);
pgoff = (OFFSET + (ps - 1)) & ~(ps - 1);
printf("%Lx\n", pgoff);

map = mmap(NULL, ps, PROT_READ, MAP_SHARED, fd, pgoff);
if (map == MAP_FAILED)
die("cannot mmap");

c = map[0];

if (munmap(map, ps) == -1)
die("cannot munmap");

if (close(fd) == -1)
die("cannot close");

return 0;
}
---------------------------------8<--------------------------------------

Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.814860] pfrsvd: Corrupted page table at address 7f34087c8000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.817356] PGD 12d0b3067 PUD 12d544067 PMD 12e29d067 PTE 80080000000a0225
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.820216] Bad pagetable: 000d [#1] SMP
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.822821] Modules linked in: fuse ebtable_nat xt_CHECKSUM bridge stp llc ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rfcomm bnep arc4 iwldvm mac80211 snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel snd_hda_codec uvcvideo snd_hwdep snd_seq snd_seq_device snd_pcm iTCO_wdt videobuf2_vmalloc videobuf2_memops videobuf2_core videodev btusb snd_page_alloc bluetooth snd_timer thinkpad_acpi iwlwifi media snd i2c_i801 cfg80211 iTCO_vendor_support intel_ips e1000e coretemp lpc_ich mfd_core soundcore rfkill mei microcode nfsd auth_rpcgss nfs_acl lockd sunrpc vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc uinput dm_crypt crc32c_intel i915 ghash_clmulni_intel firewire_ohci i2c_algo_bit drm_kms_helper firewire_core sdhci_pci crc_itu_t drm sdhci mmc_core i2c_core mxm_wmi video wmi
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.845686] CPU 3
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.845709] Pid: 8751, comm: pfrsvd Not tainted 3.8.1-201.fc18.x86_64 #1 LENOVO 4384AV1/4384AV1
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.852876] RIP: 0033:[<00000000004007db>] [<00000000004007db>] 0x4007da
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.856587] RSP: 002b:00007ffff5c12620 EFLAGS: 00010213
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.860296] RAX: 00007f34087c8000 RBX: 0000000000000000 RCX: 00000030fd4eed6a
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.864061] RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.867878] RBP: 00007ffff5c12660 R08: 0000000000000003 R09: 00080000000a0000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.871706] R10: 0000000000000001 R11: 0000000000000206 R12: 00000000004005f0
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.875566] R13: 00007ffff5c12740 R14: 0000000000000000 R15: 0000000000000000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.879490] FS: 00007f34087a0740(0000) GS:ffff880137d80000(0000) knlGS:0000000000000000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.883447] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.887436] CR2: 00007f34087c8000 CR3: 0000000107509000 CR4: 00000000000007e0
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.891495] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.895603] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.899739] Process pfrsvd (pid: 8751, threadinfo ffff880104ea8000, task ffff88012d9e1760)
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.903944]
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.908169] RIP [<00000000004007db>] 0x4007da
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.912447] RSP <00007ffff5c12620>
Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.943802] ---[ end trace 1113d12a53145197 ]---

Please note the PTE value 80080000000a0225

HTH

Thank you
>
> >
> >I hope I didn't misunderstand your question.
> >
> >Thanks
> >
> >>>on /dev/mem and cause system panic. It's probably not that serious, because
> >>>access to /dev/mem is limited and the system has to have panic_on_oops set, but
> >>>still I think we should check this and return error.
> >>>
> >>>This patch adds check for x86 when ARCH_PHYS_ADDR_T_64BIT is set, the same way
> >>>as it is already done in e.g. ioremap. With this fix mmap returns -EINVAL if the
> >>>requested phys addr is bigger then the supported phys addr width.
> >>>
> >>>[1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A
> >>>
> >>>Signed-off-by: Frantisek Hrbata <[email protected]>
> >>>---
> >>> arch/x86/include/asm/io.h | 4 ++++
> >>> arch/x86/mm/mmap.c | 13 +++++++++++++
> >>> 2 files changed, 17 insertions(+)
> >>>
> >>>diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
> >>>index d8e8eef..39607c6 100644
> >>>--- a/arch/x86/include/asm/io.h
> >>>+++ b/arch/x86/include/asm/io.h
> >>>@@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
> >>> #endif
> >>> }
> >>>+#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
> >>>+extern int valid_phys_addr_range(phys_addr_t addr, size_t count);
> >>>+extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count);
> >>>+
> >>> #endif /* __KERNEL__ */
> >>> extern void native_io_delay(void);
> >>>diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
> >>>index 845df68..92ec31c 100644
> >>>--- a/arch/x86/mm/mmap.c
> >>>+++ b/arch/x86/mm/mmap.c
> >>>@@ -31,6 +31,8 @@
> >>> #include <linux/sched.h>
> >>> #include <asm/elf.h>
> >>>+#include "physaddr.h"
> >>>+
> >>> struct __read_mostly va_alignment va_align = {
> >>> .flags = -1,
> >>> };
> >>>@@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
> >>> mm->unmap_area = arch_unmap_area_topdown;
> >>> }
> >>> }
> >>>+
> >>>+int valid_phys_addr_range(phys_addr_t addr, size_t count)
> >>>+{
> >>>+ return addr + count <= __pa(high_memory);
> >>>+}
> >>>+
> >>>+int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
> >>>+{
> >>>+ resource_size_t addr = (pfn << PAGE_SHIFT) + count;
> >>>+ return phys_addr_valid(addr);
> >>>+}
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>the body of a message to [email protected]
> >>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>Please read the FAQ at http://www.tux.org/lkml/
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Frantisek Hrbata

2013-04-28 03:17:58

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On 04/28/2013 03:13 AM, Frantisek Hrbata wrote:
> On Sat, Apr 27, 2013 at 03:00:11PM +0800, Will Huck wrote:
>> On 04/26/2013 11:35 PM, Frantisek Hrbata wrote:
>>> On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote:
>>>> Hi Peter,
>>>> On 04/02/2013 08:28 PM, Frantisek Hrbata wrote:
>>>>> When CR4.PAE is set, the 64b PTE's are used(ARCH_PHYS_ADDR_T_64BIT is set for
>>>>> X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some higher bits in 64b
>>>>> PTE are reserved and have to be set to zero. For example, for IA-32e and 4KB
>>>>> page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are reserved. So
>>>>> for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be zero. If one of
>>>>> the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF is generated
>>>>> with RSVD error code.
>>>>>
>>>>> <quote>
>>>>> RSVD flag (bit 3).
>>>>> This flag is 1 if there is no valid translation for the linear address because a
>>>>> reserved bit was set in one of the paging-structure entries used to translate
>>>>> that address. (Because reserved bits are not checked in a paging-structure entry
>>>>> whose P flag is 0, bit 3 of the error code can be set only if bit 0 is also
>>>>> set.)
>>>>> </quote>
>>>>>
>>>>> In mmap_mem() the first check is valid_mmap_phys_addr_range(), but it always
>>>>> returns 1 on x86. So it's possible to use any pgoff we want and to set the PTE's
>>>>> reserved bits in remap_pfn_range(). Meaning there is a possibility to use mmap
>>>> In this case, remap_pfn_range() setup the map and reserved bits for
>>>> mmio memory, so the mmio memory is already populated, why trigger
>>>> #PF?
>>> Hi,
>>>
>>> I think this is described in the quote above for the RSVD flag.
>>>
>>> remap_pfn_range() => page present => touch page => tlb miss =>
>>> walk through paging structures => reserved bit set => #pf with rsvd flag
>> Page present can also trigger #PF? why?
> Yes, please see
> Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A
>
> 4.7 PAGE-FAULT EXCEPTIONS
> <quote>
> ? RSVD flag (bit 3).
> This flag is 1 if there is no valid translation for the linear address because
> a reserved bit was set in one of the paging-structure entries used to
> translate that address. (Because reserved bits are not checked in a
> paging-structure entry whose P flag is 0, bit 3 of the error code can be set
> only if bit 0 is also set.) Bits reserved in the paging-structure entries are
> reserved for future functionality. Software developers should be aware that
> such bits may be used in the future and that a paging-structure entry that
> causes a page-fault exception on one processor might not do so in the future.
> </quote>
>
> I cannot tell you why. I guess this is more a question for some Intel guys.
>
> Anyway this patch is trying to fix the following problem and
> the "Bad pagetable" oops.
>
> ---------------------------------8<--------------------------------------
> #include <stdio.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <err.h>
> #include <stdlib.h>
> #include <sys/mman.h>
>
> #define die(fmt, ...) err(1, fmt, ##__VA_ARGS__)
>
> /*
> 1) Find some non system ram in case the CONFIG_STRICT_DEVMEM is defined
> $ cat /proc/iomem | grep -v "$System RAM\|reserved$"
>
> 2) Find physical address width
> $ cat /proc/cpuinfo | grep "address sizes"
>
> PTE bits 51 - M are reserved, where M is physical address width found 2)
> Note: step 2) is actually not needed, we can always set just the 51th bit
> (0x8000000000000)

What's the meaning here? You trigger oops since the address is beyond
max address cpu supported or access to a reserved page? If the answer is
the latter, I'm think it's not right. For example, the kernel code/data
section is reserved in memory, kernel access it will trigger oops? I
don't think so.

>
> Set OFFSET macro to
>
> (start of iomem range found in 1)) | (1 << 51)
>
> for example
> 0x000a0000 | 0x8000000000000 = 0x80000000a0000
>
> where 0x000a0000 is start of PCI BUS on my laptop
>
> */
>
> #define OFFSET 0x80000000a0000LL
>
> int main(int argc, char *argv[])
> {
> int fd;
> long ps;
> long pgoff;
> char *map;
> char c;
>
> ps = sysconf(_SC_PAGE_SIZE);
> if (ps == -1)
> die("cannot get page size");
>
> fd = open("/dev/mem", O_RDONLY);
> if (fd == -1)
> die("cannot open /dev/mem");
>
> printf("%Lx\n", pgoff);
> pgoff = (OFFSET + (ps - 1)) & ~(ps - 1);
> printf("%Lx\n", pgoff);
>
> map = mmap(NULL, ps, PROT_READ, MAP_SHARED, fd, pgoff);
> if (map == MAP_FAILED)
> die("cannot mmap");
>
> c = map[0];
>
> if (munmap(map, ps) == -1)
> die("cannot munmap");
>
> if (close(fd) == -1)
> die("cannot close");
>
> return 0;
> }
> ---------------------------------8<--------------------------------------
>
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.814860] pfrsvd: Corrupted page table at address 7f34087c8000
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.817356] PGD 12d0b3067 PUD 12d544067 PMD 12e29d067 PTE 80080000000a0225
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.820216] Bad pagetable: 000d [#1] SMP
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.822821] Modules linked in: fuse ebtable_nat xt_CHECKSUM bridge stp llc ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rfcomm bnep arc4 iwldvm mac80211 snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel snd_hda_codec uvcvideo snd_hwdep snd_seq snd_seq_device snd_pcm iTCO_wdt videobuf2_vmalloc videobuf2_memops videobuf2_core videodev btusb snd_page_alloc bluetooth snd_timer thinkpad_acpi iwlwifi media snd i2c_i801 cfg80211 iTCO_vendor_support intel_ips e1000e coretemp lpc_ich mfd_core soundcore rfkill mei microcode nfsd auth_rpcgss nfs_acl lockd sunrpc vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc uinput dm_crypt crc32c_intel i915 ghash_clmulni_intel firewire_ohci i2c_algo_bit drm_kms_helper firewire_core sdhci_pci crc_itu_t drm sdhci mmc_core i2c_core mxm_wmi video wmi
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.845686] CPU 3
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.845709] Pid: 8751, comm: pfrsvd Not tainted 3.8.1-201.fc18.x86_64 #1 LENOVO 4384AV1/4384AV1
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.852876] RIP: 0033:[<00000000004007db>] [<00000000004007db>] 0x4007da
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.856587] RSP: 002b:00007ffff5c12620 EFLAGS: 00010213
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.860296] RAX: 00007f34087c8000 RBX: 0000000000000000 RCX: 00000030fd4eed6a
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.864061] RDX: 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.867878] RBP: 00007ffff5c12660 R08: 0000000000000003 R09: 00080000000a0000
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.871706] R10: 0000000000000001 R11: 0000000000000206 R12: 00000000004005f0
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.875566] R13: 00007ffff5c12740 R14: 0000000000000000 R15: 0000000000000000
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.879490] FS: 00007f34087a0740(0000) GS:ffff880137d80000(0000) knlGS:0000000000000000
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.883447] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.887436] CR2: 00007f34087c8000 CR3: 0000000107509000 CR4: 00000000000007e0
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.891495] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.895603] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.899739] Process pfrsvd (pid: 8751, threadinfo ffff880104ea8000, task ffff88012d9e1760)
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.903944]
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.908169] RIP [<00000000004007db>] 0x4007da
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.912447] RSP <00007ffff5c12620>
> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.943802] ---[ end trace 1113d12a53145197 ]---
>
> Please note the PTE value 80080000000a0225
>
> HTH
>
> Thank you
>>> I hope I didn't misunderstand your question.
>>>
>>> Thanks
>>>
>>>>> on /dev/mem and cause system panic. It's probably not that serious, because
>>>>> access to /dev/mem is limited and the system has to have panic_on_oops set, but
>>>>> still I think we should check this and return error.
>>>>>
>>>>> This patch adds check for x86 when ARCH_PHYS_ADDR_T_64BIT is set, the same way
>>>>> as it is already done in e.g. ioremap. With this fix mmap returns -EINVAL if the
>>>>> requested phys addr is bigger then the supported phys addr width.
>>>>>
>>>>> [1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A
>>>>>
>>>>> Signed-off-by: Frantisek Hrbata <[email protected]>
>>>>> ---
>>>>> arch/x86/include/asm/io.h | 4 ++++
>>>>> arch/x86/mm/mmap.c | 13 +++++++++++++
>>>>> 2 files changed, 17 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>>>>> index d8e8eef..39607c6 100644
>>>>> --- a/arch/x86/include/asm/io.h
>>>>> +++ b/arch/x86/include/asm/io.h
>>>>> @@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
>>>>> #endif
>>>>> }
>>>>> +#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
>>>>> +extern int valid_phys_addr_range(phys_addr_t addr, size_t count);
>>>>> +extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t count);
>>>>> +
>>>>> #endif /* __KERNEL__ */
>>>>> extern void native_io_delay(void);
>>>>> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
>>>>> index 845df68..92ec31c 100644
>>>>> --- a/arch/x86/mm/mmap.c
>>>>> +++ b/arch/x86/mm/mmap.c
>>>>> @@ -31,6 +31,8 @@
>>>>> #include <linux/sched.h>
>>>>> #include <asm/elf.h>
>>>>> +#include "physaddr.h"
>>>>> +
>>>>> struct __read_mostly va_alignment va_align = {
>>>>> .flags = -1,
>>>>> };
>>>>> @@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
>>>>> mm->unmap_area = arch_unmap_area_topdown;
>>>>> }
>>>>> }
>>>>> +
>>>>> +int valid_phys_addr_range(phys_addr_t addr, size_t count)
>>>>> +{
>>>>> + return addr + count <= __pa(high_memory);
>>>>> +}
>>>>> +
>>>>> +int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
>>>>> +{
>>>>> + resource_size_t addr = (pfn << PAGE_SHIFT) + count;
>>>>> + return phys_addr_valid(addr);
>>>>> +}
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at http://www.tux.org/lkml/
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/

2013-04-28 04:01:27

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

Not reserved page, reserved bits in the page tables (which includes all bits beyond the maximum physical address.)

Will Huck <[email protected]> wrote:

>On 04/28/2013 03:13 AM, Frantisek Hrbata wrote:
>> On Sat, Apr 27, 2013 at 03:00:11PM +0800, Will Huck wrote:
>>> On 04/26/2013 11:35 PM, Frantisek Hrbata wrote:
>>>> On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote:
>>>>> Hi Peter,
>>>>> On 04/02/2013 08:28 PM, Frantisek Hrbata wrote:
>>>>>> When CR4.PAE is set, the 64b PTE's are
>used(ARCH_PHYS_ADDR_T_64BIT is set for
>>>>>> X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some
>higher bits in 64b
>>>>>> PTE are reserved and have to be set to zero. For example, for
>IA-32e and 4KB
>>>>>> page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are
>reserved. So
>>>>>> for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be
>zero. If one of
>>>>>> the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF
>is generated
>>>>>> with RSVD error code.
>>>>>>
>>>>>> <quote>
>>>>>> RSVD flag (bit 3).
>>>>>> This flag is 1 if there is no valid translation for the linear
>address because a
>>>>>> reserved bit was set in one of the paging-structure entries used
>to translate
>>>>>> that address. (Because reserved bits are not checked in a
>paging-structure entry
>>>>>> whose P flag is 0, bit 3 of the error code can be set only if bit
>0 is also
>>>>>> set.)
>>>>>> </quote>
>>>>>>
>>>>>> In mmap_mem() the first check is valid_mmap_phys_addr_range(),
>but it always
>>>>>> returns 1 on x86. So it's possible to use any pgoff we want and
>to set the PTE's
>>>>>> reserved bits in remap_pfn_range(). Meaning there is a
>possibility to use mmap
>>>>> In this case, remap_pfn_range() setup the map and reserved bits
>for
>>>>> mmio memory, so the mmio memory is already populated, why trigger
>>>>> #PF?
>>>> Hi,
>>>>
>>>> I think this is described in the quote above for the RSVD flag.
>>>>
>>>> remap_pfn_range() => page present => touch page => tlb miss =>
>>>> walk through paging structures => reserved bit set => #pf with rsvd
>flag
>>> Page present can also trigger #PF? why?
>> Yes, please see
>> Intel 64 and IA-32 Architectures Software Developer's Manual, Volume
>3A
>>
>> 4.7 PAGE-FAULT EXCEPTIONS
>> <quote>
>> · RSVD flag (bit 3).
>> This flag is 1 if there is no valid translation for the linear
>address because
>> a reserved bit was set in one of the paging-structure entries used to
>> translate that address. (Because reserved bits are not checked in a
>> paging-structure entry whose P flag is 0, bit 3 of the error code can
>be set
>> only if bit 0 is also set.) Bits reserved in the paging-structure
>entries are
>> reserved for future functionality. Software developers should be
>aware that
>> such bits may be used in the future and that a paging-structure entry
>that
>> causes a page-fault exception on one processor might not do so in the
>future.
>> </quote>
>>
>> I cannot tell you why. I guess this is more a question for some Intel
>guys.
>>
>> Anyway this patch is trying to fix the following problem and
>> the "Bad pagetable" oops.
>>
>>
>---------------------------------8<--------------------------------------
>> #include <stdio.h>
>> #include <unistd.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <err.h>
>> #include <stdlib.h>
>> #include <sys/mman.h>
>>
>> #define die(fmt, ...) err(1, fmt, ##__VA_ARGS__)
>>
>> /*
>> 1) Find some non system ram in case the CONFIG_STRICT_DEVMEM is
>defined
>> $ cat /proc/iomem | grep -v "$System RAM\|reserved$"
>>
>> 2) Find physical address width
>> $ cat /proc/cpuinfo | grep "address sizes"
>>
>> PTE bits 51 - M are reserved, where M is physical address width
>found 2)
>> Note: step 2) is actually not needed, we can always set just the
>51th bit
>> (0x8000000000000)
>
>What's the meaning here? You trigger oops since the address is beyond
>max address cpu supported or access to a reserved page? If the answer
>is
>the latter, I'm think it's not right. For example, the kernel code/data
>
>section is reserved in memory, kernel access it will trigger oops? I
>don't think so.
>
>>
>> Set OFFSET macro to
>>
>> (start of iomem range found in 1)) | (1 << 51)
>>
>> for example
>> 0x000a0000 | 0x8000000000000 = 0x80000000a0000
>>
>> where 0x000a0000 is start of PCI BUS on my laptop
>>
>> */
>>
>> #define OFFSET 0x80000000a0000LL
>>
>> int main(int argc, char *argv[])
>> {
>> int fd;
>> long ps;
>> long pgoff;
>> char *map;
>> char c;
>>
>> ps = sysconf(_SC_PAGE_SIZE);
>> if (ps == -1)
>> die("cannot get page size");
>>
>> fd = open("/dev/mem", O_RDONLY);
>> if (fd == -1)
>> die("cannot open /dev/mem");
>>
>> printf("%Lx\n", pgoff);
>> pgoff = (OFFSET + (ps - 1)) & ~(ps - 1);
>> printf("%Lx\n", pgoff);
>>
>> map = mmap(NULL, ps, PROT_READ, MAP_SHARED, fd, pgoff);
>> if (map == MAP_FAILED)
>> die("cannot mmap");
>>
>> c = map[0];
>>
>> if (munmap(map, ps) == -1)
>> die("cannot munmap");
>>
>> if (close(fd) == -1)
>> die("cannot close");
>>
>> return 0;
>> }
>>
>---------------------------------8<--------------------------------------
>>
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.814860] pfrsvd: Corrupted
>page table at address 7f34087c8000
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.817356] PGD 12d0b3067 PUD
>12d544067 PMD 12e29d067 PTE 80080000000a0225
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.820216] Bad pagetable:
>000d [#1] SMP
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.822821] Modules linked in:
>fuse ebtable_nat xt_CHECKSUM bridge stp llc ipt_MASQUERADE
>nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle
>ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4
>nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
>nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables
>be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio
>libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core
>iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rfcomm bnep arc4
>iwldvm mac80211 snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel
>snd_hda_codec uvcvideo snd_hwdep snd_seq snd_seq_device snd_pcm
>iTCO_wdt videobuf2_vmalloc videobuf2_memops videobuf2_core videodev
>btusb snd_page_alloc bluetooth snd_timer thinkpad_acpi iwlwifi media
>snd i2c_i801 cfg80211 iTCO_vendor_support intel_ips e1000e coretemp
>lpc_ich mfd_core soundcore rfkill mei microcode nfsd auth_rpcgss
>nfs_acl lockd sunrpc vhost_net tun macvtap macvlan kvm_intel kvm
>binfmt_misc uinput dm_crypt crc32c_intel i915 ghash_clmulni_intel
>firewire_ohci i2c_algo_bit drm_kms_helper firewire_core sdhci_pci
>crc_itu_t drm sdhci mmc_core i2c_core mxm_wmi video wmi
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.845686] CPU 3
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.845709] Pid: 8751, comm:
>pfrsvd Not tainted 3.8.1-201.fc18.x86_64 #1 LENOVO 4384AV1/4384AV1
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.852876] RIP:
>0033:[<00000000004007db>] [<00000000004007db>] 0x4007da
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.856587] RSP:
>002b:00007ffff5c12620 EFLAGS: 00010213
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.860296] RAX:
>00007f34087c8000 RBX: 0000000000000000 RCX: 00000030fd4eed6a
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.864061] RDX:
>0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.867878] RBP:
>00007ffff5c12660 R08: 0000000000000003 R09: 00080000000a0000
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.871706] R10:
>0000000000000001 R11: 0000000000000206 R12: 00000000004005f0
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.875566] R13:
>00007ffff5c12740 R14: 0000000000000000 R15: 0000000000000000
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.879490] FS:
>00007f34087a0740(0000) GS:ffff880137d80000(0000) knlGS:0000000000000000
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.883447] CS: 0010 DS: 0000
>ES: 0000 CR0: 0000000080050033
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.887436] CR2:
>00007f34087c8000 CR3: 0000000107509000 CR4: 00000000000007e0
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.891495] DR0:
>0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.895603] DR3:
>0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.899739] Process pfrsvd
>(pid: 8751, threadinfo ffff880104ea8000, task ffff88012d9e1760)
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.903944]
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.908169] RIP
>[<00000000004007db>] 0x4007da
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.912447] RSP
><00007ffff5c12620>
>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.943802] ---[ end trace
>1113d12a53145197 ]---
>>
>> Please note the PTE value 80080000000a0225
>>
>> HTH
>>
>> Thank you
>>>> I hope I didn't misunderstand your question.
>>>>
>>>> Thanks
>>>>
>>>>>> on /dev/mem and cause system panic. It's probably not that
>serious, because
>>>>>> access to /dev/mem is limited and the system has to have
>panic_on_oops set, but
>>>>>> still I think we should check this and return error.
>>>>>>
>>>>>> This patch adds check for x86 when ARCH_PHYS_ADDR_T_64BIT is set,
>the same way
>>>>>> as it is already done in e.g. ioremap. With this fix mmap returns
>-EINVAL if the
>>>>>> requested phys addr is bigger then the supported phys addr width.
>>>>>>
>>>>>> [1] Intel 64 and IA-32 Architectures Software Developer's Manual,
>Volume 3A
>>>>>>
>>>>>> Signed-off-by: Frantisek Hrbata <[email protected]>
>>>>>> ---
>>>>>> arch/x86/include/asm/io.h | 4 ++++
>>>>>> arch/x86/mm/mmap.c | 13 +++++++++++++
>>>>>> 2 files changed, 17 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/x86/include/asm/io.h
>b/arch/x86/include/asm/io.h
>>>>>> index d8e8eef..39607c6 100644
>>>>>> --- a/arch/x86/include/asm/io.h
>>>>>> +++ b/arch/x86/include/asm/io.h
>>>>>> @@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
>>>>>> #endif
>>>>>> }
>>>>>> +#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
>>>>>> +extern int valid_phys_addr_range(phys_addr_t addr, size_t
>count);
>>>>>> +extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t
>count);
>>>>>> +
>>>>>> #endif /* __KERNEL__ */
>>>>>> extern void native_io_delay(void);
>>>>>> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
>>>>>> index 845df68..92ec31c 100644
>>>>>> --- a/arch/x86/mm/mmap.c
>>>>>> +++ b/arch/x86/mm/mmap.c
>>>>>> @@ -31,6 +31,8 @@
>>>>>> #include <linux/sched.h>
>>>>>> #include <asm/elf.h>
>>>>>> +#include "physaddr.h"
>>>>>> +
>>>>>> struct __read_mostly va_alignment va_align = {
>>>>>> .flags = -1,
>>>>>> };
>>>>>> @@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct
>*mm)
>>>>>> mm->unmap_area = arch_unmap_area_topdown;
>>>>>> }
>>>>>> }
>>>>>> +
>>>>>> +int valid_phys_addr_range(phys_addr_t addr, size_t count)
>>>>>> +{
>>>>>> + return addr + count <= __pa(high_memory);
>>>>>> +}
>>>>>> +
>>>>>> +int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
>>>>>> +{
>>>>>> + resource_size_t addr = (pfn << PAGE_SHIFT) + count;
>>>>>> + return phys_addr_valid(addr);
>>>>>> +}
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe
>linux-kernel" in
>>>>> the body of a message to [email protected]
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> Please read the FAQ at http://www.tux.org/lkml/
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-04-28 08:04:09

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

Hi Peter,
On 04/28/2013 12:00 PM, H. Peter Anvin wrote:
> Not reserved page, reserved bits in the page tables (which includes all bits beyond the maximum physical address.)

Thanks for your clarify. When these reserved bits are set?

Another question, if configure UMA to fake numa can get benefit?

>
> Will Huck <[email protected]> wrote:
>
>> On 04/28/2013 03:13 AM, Frantisek Hrbata wrote:
>>> On Sat, Apr 27, 2013 at 03:00:11PM +0800, Will Huck wrote:
>>>> On 04/26/2013 11:35 PM, Frantisek Hrbata wrote:
>>>>> On Fri, Apr 26, 2013 at 01:21:28PM +0800, Will Huck wrote:
>>>>>> Hi Peter,
>>>>>> On 04/02/2013 08:28 PM, Frantisek Hrbata wrote:
>>>>>>> When CR4.PAE is set, the 64b PTE's are
>> used(ARCH_PHYS_ADDR_T_64BIT is set for
>>>>>>> X86_64 || X86_PAE). According to [1] Chapter 4 Paging, some
>> higher bits in 64b
>>>>>>> PTE are reserved and have to be set to zero. For example, for
>> IA-32e and 4KB
>>>>>>> page [1] 4.5 IA-32e Paging: Table 4-19, bits 51-M(MAXPHYADDR) are
>> reserved. So
>>>>>>> for a CPU with e.g. 48bit phys addr width, bits 51-48 have to be
>> zero. If one of
>>>>>>> the reserved bits is set, [1] 4.7 Page-Fault Exceptions, the #PF
>> is generated
>>>>>>> with RSVD error code.
>>>>>>>
>>>>>>> <quote>
>>>>>>> RSVD flag (bit 3).
>>>>>>> This flag is 1 if there is no valid translation for the linear
>> address because a
>>>>>>> reserved bit was set in one of the paging-structure entries used
>> to translate
>>>>>>> that address. (Because reserved bits are not checked in a
>> paging-structure entry
>>>>>>> whose P flag is 0, bit 3 of the error code can be set only if bit
>> 0 is also
>>>>>>> set.)
>>>>>>> </quote>
>>>>>>>
>>>>>>> In mmap_mem() the first check is valid_mmap_phys_addr_range(),
>> but it always
>>>>>>> returns 1 on x86. So it's possible to use any pgoff we want and
>> to set the PTE's
>>>>>>> reserved bits in remap_pfn_range(). Meaning there is a
>> possibility to use mmap
>>>>>> In this case, remap_pfn_range() setup the map and reserved bits
>> for
>>>>>> mmio memory, so the mmio memory is already populated, why trigger
>>>>>> #PF?
>>>>> Hi,
>>>>>
>>>>> I think this is described in the quote above for the RSVD flag.
>>>>>
>>>>> remap_pfn_range() => page present => touch page => tlb miss =>
>>>>> walk through paging structures => reserved bit set => #pf with rsvd
>> flag
>>>> Page present can also trigger #PF? why?
>>> Yes, please see
>>> Intel 64 and IA-32 Architectures Software Developer's Manual, Volume
>> 3A
>>> 4.7 PAGE-FAULT EXCEPTIONS
>>> <quote>
>>> · RSVD flag (bit 3).
>>> This flag is 1 if there is no valid translation for the linear
>> address because
>>> a reserved bit was set in one of the paging-structure entries used to
>>> translate that address. (Because reserved bits are not checked in a
>>> paging-structure entry whose P flag is 0, bit 3 of the error code can
>> be set
>>> only if bit 0 is also set.) Bits reserved in the paging-structure
>> entries are
>>> reserved for future functionality. Software developers should be
>> aware that
>>> such bits may be used in the future and that a paging-structure entry
>> that
>>> causes a page-fault exception on one processor might not do so in the
>> future.
>>> </quote>
>>>
>>> I cannot tell you why. I guess this is more a question for some Intel
>> guys.
>>> Anyway this patch is trying to fix the following problem and
>>> the "Bad pagetable" oops.
>>>
>>>
>> ---------------------------------8<--------------------------------------
>>> #include <stdio.h>
>>> #include <unistd.h>
>>> #include <sys/types.h>
>>> #include <sys/stat.h>
>>> #include <fcntl.h>
>>> #include <err.h>
>>> #include <stdlib.h>
>>> #include <sys/mman.h>
>>>
>>> #define die(fmt, ...) err(1, fmt, ##__VA_ARGS__)
>>>
>>> /*
>>> 1) Find some non system ram in case the CONFIG_STRICT_DEVMEM is
>> defined
>>> $ cat /proc/iomem | grep -v "$System RAM\|reserved$"
>>>
>>> 2) Find physical address width
>>> $ cat /proc/cpuinfo | grep "address sizes"
>>>
>>> PTE bits 51 - M are reserved, where M is physical address width
>> found 2)
>>> Note: step 2) is actually not needed, we can always set just the
>> 51th bit
>>> (0x8000000000000)
>> What's the meaning here? You trigger oops since the address is beyond
>> max address cpu supported or access to a reserved page? If the answer
>> is
>> the latter, I'm think it's not right. For example, the kernel code/data
>>
>> section is reserved in memory, kernel access it will trigger oops? I
>> don't think so.
>>
>>> Set OFFSET macro to
>>>
>>> (start of iomem range found in 1)) | (1 << 51)
>>>
>>> for example
>>> 0x000a0000 | 0x8000000000000 = 0x80000000a0000
>>>
>>> where 0x000a0000 is start of PCI BUS on my laptop
>>>
>>> */
>>>
>>> #define OFFSET 0x80000000a0000LL
>>>
>>> int main(int argc, char *argv[])
>>> {
>>> int fd;
>>> long ps;
>>> long pgoff;
>>> char *map;
>>> char c;
>>>
>>> ps = sysconf(_SC_PAGE_SIZE);
>>> if (ps == -1)
>>> die("cannot get page size");
>>>
>>> fd = open("/dev/mem", O_RDONLY);
>>> if (fd == -1)
>>> die("cannot open /dev/mem");
>>>
>>> printf("%Lx\n", pgoff);
>>> pgoff = (OFFSET + (ps - 1)) & ~(ps - 1);
>>> printf("%Lx\n", pgoff);
>>>
>>> map = mmap(NULL, ps, PROT_READ, MAP_SHARED, fd, pgoff);
>>> if (map == MAP_FAILED)
>>> die("cannot mmap");
>>>
>>> c = map[0];
>>>
>>> if (munmap(map, ps) == -1)
>>> die("cannot munmap");
>>>
>>> if (close(fd) == -1)
>>> die("cannot close");
>>>
>>> return 0;
>>> }
>>>
>> ---------------------------------8<--------------------------------------
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.814860] pfrsvd: Corrupted
>> page table at address 7f34087c8000
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.817356] PGD 12d0b3067 PUD
>> 12d544067 PMD 12e29d067 PTE 80080000000a0225
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.820216] Bad pagetable:
>> 000d [#1] SMP
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.822821] Modules linked in:
>> fuse ebtable_nat xt_CHECKSUM bridge stp llc ipt_MASQUERADE
>> nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle
>> ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4
>> nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
>> nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables
>> be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio
>> libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core
>> iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rfcomm bnep arc4
>> iwldvm mac80211 snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel
>> snd_hda_codec uvcvideo snd_hwdep snd_seq snd_seq_device snd_pcm
>> iTCO_wdt videobuf2_vmalloc videobuf2_memops videobuf2_core videodev
>> btusb snd_page_alloc bluetooth snd_timer thinkpad_acpi iwlwifi media
>> snd i2c_i801 cfg80211 iTCO_vendor_support intel_ips e1000e coretemp
>> lpc_ich mfd_core soundcore rfkill mei microcode nfsd auth_rpcgss
>> nfs_acl lockd sunrpc vhost_net tun macvtap macvlan kvm_intel kvm
>> binfmt_misc uinput dm_crypt crc32c_intel i915 ghash_clmulni_intel
>> firewire_ohci i2c_algo_bit drm_kms_helper firewire_core sdhci_pci
>> crc_itu_t drm sdhci mmc_core i2c_core mxm_wmi video wmi
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.845686] CPU 3
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.845709] Pid: 8751, comm:
>> pfrsvd Not tainted 3.8.1-201.fc18.x86_64 #1 LENOVO 4384AV1/4384AV1
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.852876] RIP:
>> 0033:[<00000000004007db>] [<00000000004007db>] 0x4007da
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.856587] RSP:
>> 002b:00007ffff5c12620 EFLAGS: 00010213
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.860296] RAX:
>> 00007f34087c8000 RBX: 0000000000000000 RCX: 00000030fd4eed6a
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.864061] RDX:
>> 0000000000000001 RSI: 0000000000001000 RDI: 0000000000000000
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.867878] RBP:
>> 00007ffff5c12660 R08: 0000000000000003 R09: 00080000000a0000
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.871706] R10:
>> 0000000000000001 R11: 0000000000000206 R12: 00000000004005f0
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.875566] R13:
>> 00007ffff5c12740 R14: 0000000000000000 R15: 0000000000000000
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.879490] FS:
>> 00007f34087a0740(0000) GS:ffff880137d80000(0000) knlGS:0000000000000000
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.883447] CS: 0010 DS: 0000
>> ES: 0000 CR0: 0000000080050033
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.887436] CR2:
>> 00007f34087c8000 CR3: 0000000107509000 CR4: 00000000000007e0
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.891495] DR0:
>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.895603] DR3:
>> 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.899739] Process pfrsvd
>> (pid: 8751, threadinfo ffff880104ea8000, task ffff88012d9e1760)
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.903944]
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.908169] RIP
>> [<00000000004007db>] 0x4007da
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.912447] RSP
>> <00007ffff5c12620>
>>> Apr 27 19:52:29 dhcp-26-164 kernel: [ 6464.943802] ---[ end trace
>> 1113d12a53145197 ]---
>>> Please note the PTE value 80080000000a0225
>>>
>>> HTH
>>>
>>> Thank you
>>>>> I hope I didn't misunderstand your question.
>>>>>
>>>>> Thanks
>>>>>
>>>>>>> on /dev/mem and cause system panic. It's probably not that
>> serious, because
>>>>>>> access to /dev/mem is limited and the system has to have
>> panic_on_oops set, but
>>>>>>> still I think we should check this and return error.
>>>>>>>
>>>>>>> This patch adds check for x86 when ARCH_PHYS_ADDR_T_64BIT is set,
>> the same way
>>>>>>> as it is already done in e.g. ioremap. With this fix mmap returns
>> -EINVAL if the
>>>>>>> requested phys addr is bigger then the supported phys addr width.
>>>>>>>
>>>>>>> [1] Intel 64 and IA-32 Architectures Software Developer's Manual,
>> Volume 3A
>>>>>>> Signed-off-by: Frantisek Hrbata <[email protected]>
>>>>>>> ---
>>>>>>> arch/x86/include/asm/io.h | 4 ++++
>>>>>>> arch/x86/mm/mmap.c | 13 +++++++++++++
>>>>>>> 2 files changed, 17 insertions(+)
>>>>>>>
>>>>>>> diff --git a/arch/x86/include/asm/io.h
>> b/arch/x86/include/asm/io.h
>>>>>>> index d8e8eef..39607c6 100644
>>>>>>> --- a/arch/x86/include/asm/io.h
>>>>>>> +++ b/arch/x86/include/asm/io.h
>>>>>>> @@ -242,6 +242,10 @@ static inline void flush_write_buffers(void)
>>>>>>> #endif
>>>>>>> }
>>>>>>> +#define ARCH_HAS_VALID_PHYS_ADDR_RANGE
>>>>>>> +extern int valid_phys_addr_range(phys_addr_t addr, size_t
>> count);
>>>>>>> +extern int valid_mmap_phys_addr_range(unsigned long pfn, size_t
>> count);
>>>>>>> +
>>>>>>> #endif /* __KERNEL__ */
>>>>>>> extern void native_io_delay(void);
>>>>>>> diff --git a/arch/x86/mm/mmap.c b/arch/x86/mm/mmap.c
>>>>>>> index 845df68..92ec31c 100644
>>>>>>> --- a/arch/x86/mm/mmap.c
>>>>>>> +++ b/arch/x86/mm/mmap.c
>>>>>>> @@ -31,6 +31,8 @@
>>>>>>> #include <linux/sched.h>
>>>>>>> #include <asm/elf.h>
>>>>>>> +#include "physaddr.h"
>>>>>>> +
>>>>>>> struct __read_mostly va_alignment va_align = {
>>>>>>> .flags = -1,
>>>>>>> };
>>>>>>> @@ -122,3 +124,14 @@ void arch_pick_mmap_layout(struct mm_struct
>> *mm)
>>>>>>> mm->unmap_area = arch_unmap_area_topdown;
>>>>>>> }
>>>>>>> }
>>>>>>> +
>>>>>>> +int valid_phys_addr_range(phys_addr_t addr, size_t count)
>>>>>>> +{
>>>>>>> + return addr + count <= __pa(high_memory);
>>>>>>> +}
>>>>>>> +
>>>>>>> +int valid_mmap_phys_addr_range(unsigned long pfn, size_t count)
>>>>>>> +{
>>>>>>> + resource_size_t addr = (pfn << PAGE_SHIFT) + count;
>>>>>>> + return phys_addr_valid(addr);
>>>>>>> +}
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>>>>>> the body of a message to [email protected]
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>> Please read the FAQ at http://www.tux.org/lkml/
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at http://www.tux.org/lkml/

2013-05-01 18:19:26

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH] x86: add phys addr validity check for /dev/mem mmap

On 04/27/2013 08:17 PM, Will Huck wrote:
>> PTE bits 51 - M are reserved, where M is physical address width
>> found 2)
>> Note: step 2) is actually not needed, we can always set just the
>> 51th bit
>> (0x8000000000000)
>
> What's the meaning here? You trigger oops since the address is beyond
> max address cpu supported or access to a reserved page? If the answer is
> the latter, I'm think it's not right. For example, the kernel code/data
> section is reserved in memory, kernel access it will trigger oops? I
> don't think so.

I think you're confusing the original problem here with how we would
implement the solution.

/dev/mem essentially lets you create ptes with as large of a value as
you like. You just mmap() it, and the kernel will build you a pte to
access whatever crazy offset you choose.

The problem is that on _some_ systems, you won't just get a bus error,
the kernel actually sets up some ptes which the hardware objects to (the
reserved bits in the pte), and we'll panic when the hardware sees the
ptes. We're trying to avoid these panics by ensuring that we never
create these nasty ptes.

Those "nasty" PTEs point to memory which can not even possibly be
*addressed* on the CPUs where they upset the hardware. In other words,
if we limit /dev/mem to *possible* memory on the system (which is sane
all by itself) we will also fix this particular problem.

2013-05-01 19:04:31