2017-12-28 11:33:57

by Hanjun Guo

[permalink] [raw]
Subject: [RFC patch] ioremap: don't set up huge I/O mappings when p4d/pud/pmd is zero

From: Hanjun Guo <[email protected]>

When we using iounmap() to free the 4K mapping, it just clear the PTEs
but leave P4D/PUD/PMD unchanged, also will not free the memory of page
tables.

This will cause issues on ARM64 platform (not sure if other archs have
the same issue) for this case:

1. ioremap a 4K size, valid page table will build,
2. iounmap it, pte0 will set to 0;
3. ioremap the same address with 2M size, pgd/pmd is unchanged,
then set the a new value for pmd;
4. pte0 is leaked;
5. CPU may meet exception because the old pmd is still in TLB,
which will lead to kernel panic.

Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
zero.

Reported-by: Lei Li <[email protected]>
Signed-off-by: Hanjun Guo <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Mark Rutland <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Catalin Marinas <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Xuefeng Wang <[email protected]>
---

Not sure if this is the right direction, this patch has a obvious
side effect that a mapped address with 4K will not back to 2M. I may
miss something and just wrong, so this is just a RFC version, comments
are welcomed.

lib/ioremap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/ioremap.c b/lib/ioremap.c
index b808a39..4e6f19a 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -89,7 +89,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
do {
next = pmd_addr_end(addr, end);

- if (ioremap_pmd_enabled() &&
+ if (ioremap_pmd_enabled() && pmd_none(*pmd) &&
((next - addr) == PMD_SIZE) &&
IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
if (pmd_set_huge(pmd, phys_addr + addr, prot))
@@ -115,7 +115,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
do {
next = pud_addr_end(addr, end);

- if (ioremap_pud_enabled() &&
+ if (ioremap_pud_enabled() && pud_none(*pud) &&
((next - addr) == PUD_SIZE) &&
IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
if (pud_set_huge(pud, phys_addr + addr, prot))
@@ -141,7 +141,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
do {
next = p4d_addr_end(addr, end);

- if (ioremap_p4d_enabled() &&
+ if (ioremap_p4d_enabled() && p4d_none(*p4d) &&
((next - addr) == P4D_SIZE) &&
IS_ALIGNED(phys_addr + addr, P4D_SIZE)) {
if (p4d_set_huge(p4d, phys_addr + addr, prot))
--
1.7.12.4


2017-12-29 08:01:59

by Hanjun Guo

[permalink] [raw]
Subject: Re: [RFC patch] ioremap: don't set up huge I/O mappings when p4d/pud/pmd is zero

oops, the title of this patch is wrong, should be:
ioremap: skip setting up huge I/O mappings when p4d/pud/pmd is zero

On 2017/12/28 19:24, Hanjun Guo wrote:
> From: Hanjun Guo <[email protected]>
>
> When we using iounmap() to free the 4K mapping, it just clear the PTEs
> but leave P4D/PUD/PMD unchanged, also will not free the memory of page
> tables.
>
> This will cause issues on ARM64 platform (not sure if other archs have
> the same issue) for this case:
>
> 1. ioremap a 4K size, valid page table will build,
> 2. iounmap it, pte0 will set to 0;
> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
> then set the a new value for pmd;
> 4. pte0 is leaked;
> 5. CPU may meet exception because the old pmd is still in TLB,
> which will lead to kernel panic.
>
> Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
> zero.
>
> Reported-by: Lei Li <[email protected]>
> Signed-off-by: Hanjun Guo <[email protected]>
> Cc: Toshi Kani <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Xuefeng Wang <[email protected]>
> ---
>
> Not sure if this is the right direction, this patch has a obvious
> side effect that a mapped address with 4K will not back to 2M. I may
> miss something and just wrong, so this is just a RFC version, comments
> are welcomed.
>
> lib/ioremap.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/lib/ioremap.c b/lib/ioremap.c
> index b808a39..4e6f19a 100644
> --- a/lib/ioremap.c
> +++ b/lib/ioremap.c
> @@ -89,7 +89,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
> do {
> next = pmd_addr_end(addr, end);
>
> - if (ioremap_pmd_enabled() &&
> + if (ioremap_pmd_enabled() && pmd_none(*pmd) &&
> ((next - addr) == PMD_SIZE) &&
> IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
> if (pmd_set_huge(pmd, phys_addr + addr, prot))
> @@ -115,7 +115,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
> do {
> next = pud_addr_end(addr, end);
>
> - if (ioremap_pud_enabled() &&
> + if (ioremap_pud_enabled() && pud_none(*pud) &&
> ((next - addr) == PUD_SIZE) &&
> IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
> if (pud_set_huge(pud, phys_addr + addr, prot))
> @@ -141,7 +141,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
> do {
> next = p4d_addr_end(addr, end);
>
> - if (ioremap_p4d_enabled() &&
> + if (ioremap_p4d_enabled() && p4d_none(*p4d) &&
> ((next - addr) == P4D_SIZE) &&
> IS_ALIGNED(phys_addr + addr, P4D_SIZE)) {
> if (p4d_set_huge(p4d, phys_addr + addr, prot))
>

2018-01-05 22:16:04

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [RFC patch] ioremap: don't set up huge I/O mappings when p4d/pud/pmd is zero

On Thu, 2017-12-28 at 19:24 +0800, Hanjun Guo wrote:
> From: Hanjun Guo <[email protected]>
>
> When we using iounmap() to free the 4K mapping, it just clear the PTEs
> but leave P4D/PUD/PMD unchanged, also will not free the memory of page
> tables.
>
> This will cause issues on ARM64 platform (not sure if other archs have
> the same issue) for this case:
>
> 1. ioremap a 4K size, valid page table will build,
> 2. iounmap it, pte0 will set to 0;
> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
> then set the a new value for pmd;
> 4. pte0 is leaked;
> 5. CPU may meet exception because the old pmd is still in TLB,
> which will lead to kernel panic.
>
> Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
> zero.

Hi Hanjun,

I tested the above steps on my x86 box, but was not able to reproduce
your kernel panic. On x86, a 4K vaddr gets allocated from a small
fragmented free range, whereas a 2MB vaddr is from a larger free range.
Their addrs have different alignments (4KB & 2MB) as well. So, the
steps did not lead to use a same pmd entry.

However, I agree that zero'd pte entries will be leaked when a pmd map
is set if they are present under the pmd.

I also tested your patch on my x86 box. Unfortunately, it effectively
disabled 2MB mappings. While a 2MB vaddr gets allocated from a larger
free range, it sill comes from a free range covered by zero'd pte
entries. So, it ends up with 4KB mappings with your changes.

I think we need to come up with other approach.
Thanks,
-Toshi

2018-01-06 09:48:43

by Hanjun Guo

[permalink] [raw]
Subject: Re: [RFC patch] ioremap: don't set up huge I/O mappings when p4d/pud/pmd is zero

On 2018/1/6 6:15, Kani, Toshi wrote:
> On Thu, 2017-12-28 at 19:24 +0800, Hanjun Guo wrote:
>> From: Hanjun Guo <[email protected]>
>>
>> When we using iounmap() to free the 4K mapping, it just clear the PTEs
>> but leave P4D/PUD/PMD unchanged, also will not free the memory of page
>> tables.
>>
>> This will cause issues on ARM64 platform (not sure if other archs have
>> the same issue) for this case:
>>
>> 1. ioremap a 4K size, valid page table will build,
>> 2. iounmap it, pte0 will set to 0;
>> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
>> then set the a new value for pmd;
>> 4. pte0 is leaked;
>> 5. CPU may meet exception because the old pmd is still in TLB,
>> which will lead to kernel panic.
>>
>> Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
>> zero.
>
> Hi Hanjun,
>
> I tested the above steps on my x86 box, but was not able to reproduce
> your kernel panic. On x86, a 4K vaddr gets allocated from a small
> fragmented free range, whereas a 2MB vaddr is from a larger free range.
> Their addrs have different alignments (4KB & 2MB) as well. So, the
> steps did not lead to use a same pmd entry.

Thanks for the testing, I can only reproduce this on my ARM64 platform
which the CPU will cache the PMD in TLB, from my knowledge, only Cortex-A75
will do this, so ARM64 platforms which are not A75 based can't be reproduced
either.

Catalin, Will, I can reproduce this issue in about 3 minutes with following
simplified test case [1], and can trigger panic as [2], could you take a look
as well?

>
> However, I agree that zero'd pte entries will be leaked when a pmd map
> is set if they are present under the pmd.

Thanks for the confirm.

>
> I also tested your patch on my x86 box. Unfortunately, it effectively
> disabled 2MB mappings. While a 2MB vaddr gets allocated from a larger
> free range, it sill comes from a free range covered by zero'd pte
> entries. So, it ends up with 4KB mappings with your changes.
>
> I think we need to come up with other approach.

Yes, As I said in my patch, this is just RFC, comments are welcomed :)

[1]:
#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/random.h>
#include <asm/io.h>


void pcie_io_remap_test(u32 times_ms)
{
u64 phy_addr = 0xd0000000;
unsigned long timeout = jiffies + msecs_to_jiffies(times_ms);
int rand_type;
u32 mem_size;
void *vir_addr;

do {
get_random_bytes(&rand_type, sizeof(u32));

rand_type %= 6;
switch (rand_type) {
case 0:
mem_size = 0x1000;
break;
case 1:
mem_size = 0x4000;
break;
case 2:
mem_size = 0x200000;
break;
case 3:
mem_size = 0x300000;
break;
case 4:
mem_size = 0x400;
break;
case 5:
mem_size = 0x400000;
break;
default:
mem_size = 0x400;
break;
}

vir_addr = ioremap(phy_addr, mem_size);
readl(vir_addr);

iounmap(vir_addr);
schedule();

} while (time_before(jiffies, timeout));
}

[2]:

[ 2062.300139] Unable to handle kernel paging request at virtual address ffff000018600000
[ 2062.300139] Unable to handle kernel paging request at virtual address ffff000018600000
[ 2062.327051] Mem abort info:
[ 2062.327051] Mem abort info:
[ 2062.337134] Exception class = DABT (current EL), IL = 32 bits
[ 2062.337134] Exception class = DABT (current EL), IL = 32 bits
[ 2062.354614] SET = 0, FnV = 0
[ 2062.354614] SET = 0, FnV = 0
[ 2062.363818] EA = 0, S1PTW = 0
[ 2062.363818] EA = 0, S1PTW = 0
[ 2062.373131] Data abort info:
[ 2062.373131] Data abort info:
[ 2062.381671] ISV = 0, ISS = 0x00000007
[ 2062.381671] ISV = 0, ISS = 0x00000007
[ 2062.393099] CM = 0, WnR = 0
[ 2062.393099] CM = 0, WnR = 0
[ 2062.402155] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff000009cf6000
[ 2062.402155] swapper pgtable: 4k pages, 48-bit VAs, pgd = ffff000009cf6000
[ 2062.421477] [ffff000018600000] *pgd=00000021ffffe003, *pud=00000021ffffd003, *pmd=00e80000d0000705
[ 2062.421477] [ffff000018600000] *pgd=00000021ffffe003, *pud=00000021ffffd003, *pmd=00e80000d0000705
[ 2062.447913] Internal error: Oops: 96000007 [#1] SMP
[ 2062.447913] Internal error: Oops: 96000007 [#1] SMP
[ 2062.540141] CPU: 1 PID: 2149 Comm: unibsp.out Tainted: P OEL 4.14.0 #1
[ 2062.540141] CPU: 1 PID: 2149 Comm: unibsp.out Tainted: P OEL 4.14.0 #1
[ 2062.560853] task: ffff8021e6f8a100 task.stack: ffff000010aa0000
[ 2062.560853] task: ffff8021e6f8a100 task.stack: ffff000010aa0000
[ 2062.580070] PC is at pcie_io_remap_test+0x9c/0x228
[ 2062.580070] PC is at pcie_io_remap_test+0x9c/0x228
[ 2062.597307] LR is at pcie_io_remap_test+0x9c/0x228
[ 2062.597307] LR is at pcie_io_remap_test+0x9c/0x228
[ 2062.613720] pc : [<ffff0000010a5e74>] lr : [<ffff0000010a5e74>] pstate: 60400149
[ 2062.613720] pc : [<ffff0000010a5e74>] lr : [<ffff0000010a5e74>] pstate: 60400149
[ 2062.634012] sp : ffff000010aa3bb0
[ 2062.634012] sp : ffff000010aa3bb0
[ 2062.643202] x29: ffff000010aa3bb0 x28: ffff8021e6f8a100
[ 2062.643202] x29: ffff000010aa3bb0 x28: ffff8021e6f8a100
[ 2062.658122] x27: ffff000008c91000 x26: 000000000000001d
[ 2062.658122] x27: ffff000008c91000 x26: 000000000000001d
[ 2062.672948] x25: 0000000000000124 x24: 0000000000000003
[ 2062.672948] x25: 0000000000000124 x24: 0000000000000003
[ 2062.687841] x23: ffff0000093a9c88 x22: ffff0000010a88f0
[ 2062.687841] x23: ffff0000093a9c88 x22: ffff0000010a88f0
[ 2062.702658] x21: 000000002aaaaaab x20: ffff0000093a6a80
[ 2062.702658] x21: 000000002aaaaaab x20: ffff0000093a6a80
[ 2062.717535] x19: 0000000100085de7 x18: 0000000000000040
[ 2062.717535] x19: 0000000100085de7 x18: 0000000000000040
[ 2062.732292] x17: 0000000098b401b6 x16: 00000000af415618
[ 2062.732292] x17: 0000000098b401b6 x16: 00000000af415618
[ 2062.747048] x15: 000000000eed33f7 x14: 0140000000000000
[ 2062.747048] x15: 000000000eed33f7 x14: 0140000000000000
[ 2062.761863] x13: ffff0000093bd000 x12: 0000000000000000
[ 2062.761863] x13: ffff0000093bd000 x12: 0000000000000000
[ 2062.776589] x11: 0400000000000001 x10: 0000000000000001
[ 2062.776589] x11: 0400000000000001 x10: 0000000000000001
[ 2062.791482] x9 : 0040000000000001 x8 : ffff008018600000
[ 2062.791482] x9 : 0040000000000001 x8 : ffff008018600000
[ 2062.806166] x7 : ffff8021ffffd620 x6 : ffff000058600000
[ 2062.806166] x7 : ffff8021ffffd620 x6 : ffff000058600000
[ 2062.820842] x5 : ffff000018800000 x4 : 0000000000000000
[ 2062.820842] x5 : ffff000018800000 x4 : 0000000000000000
[ 2062.835538] x3 : 00e8000000000707 x2 : 00e8000000000707
[ 2062.835538] x3 : 00e8000000000707 x2 : 00e8000000000707
[ 2062.850273] x1 : 00000000d0000000 x0 : ffff000018600000
[ 2062.850273] x1 : 00000000d0000000 x0 : ffff000018600000
[ 2062.865258] Process unibsp.out (pid: 2149, stack limit = 0xffff000010aa0000)
[ 2062.865258] Process unibsp.out (pid: 2149, stack limit = 0xffff000010aa0000)
[ 2062.884578] Call trace:
[ 2062.884578] Call trace:
[ 2062.891581] Exception stack(0xffff000010aa3a70 to 0xffff000010aa3bb0)
[ 2062.891581] Exception stack(0xffff000010aa3a70 to 0xffff000010aa3bb0)
[ 2062.909389] 3a60: ffff000018600000 00000000d0000000
[ 2062.909389] 3a60: ffff000018600000 00000000d0000000
[ 2062.930931] 3a80: 00e8000000000707 00e8000000000707 0000000000000000 ffff000018800000
[ 2062.930931] 3a80: 00e8000000000707 00e8000000000707 0000000000000000 ffff000018800000
[ 2062.952554] 3aa0: ffff000058600000 ffff8021ffffd620 ffff008018600000 0040000000000001
[ 2062.952554] 3aa0: ffff000058600000 ffff8021ffffd620 ffff008018600000 0040000000000001
[ 2062.974189] 3ac0: 0000000000000001 0400000000000001 0000000000000000 ffff0000093bd000
[ 2062.974189] 3ac0: 0000000000000001 0400000000000001 0000000000000000 ffff0000093bd000
[ 2062.995645] 3ae0: 0140000000000000 000000000eed33f7 00000000af415618 0000000098b401b6
[ 2062.995645] 3ae0: 0140000000000000 000000000eed33f7 00000000af415618 0000000098b401b6
[ 2063.017204] 3b00: 0000000000000040 0000000100085de7 ffff0000093a6a80 000000002aaaaaab
[ 2063.017204] 3b00: 0000000000000040 0000000100085de7 ffff0000093a6a80 000000002aaaaaab
[ 2063.038889] 3b20: ffff0000010a88f0 ffff0000093a9c88 0000000000000003 0000000000000124
[ 2063.038889] 3b20: ffff0000010a88f0 ffff0000093a9c88 0000000000000003 0000000000000124
[ 2063.060433] 3b40: 000000000000001d ffff000008c91000 ffff8021e6f8a100 ffff000010aa3bb0
[ 2063.060433] 3b40: 000000000000001d ffff000008c91000 ffff8021e6f8a100 ffff000010aa3bb0
[ 2063.081744] 3b60: ffff0000010a5e74 ffff000010aa3bb0 ffff0000010a5e74 0000000060400149
[ 2063.081744] 3b60: ffff0000010a5e74 ffff000010aa3bb0 ffff0000010a5e74 0000000060400149
[ 2063.102974] 3b80: ffff000010aa3bb0 ffff0000010a5e74 0001000000000000 ffff0000093a6a80
[ 2063.102974] 3b80: ffff000010aa3bb0 ffff0000010a5e74 0001000000000000 ffff0000093a6a80
[ 2063.124379] 3ba0: ffff000010aa3bb0 ffff0000010a5e74
[ 2063.124379] 3ba0: ffff000010aa3bb0 ffff0000010a5e74
[ 2063.138959] [<ffff0000010a5e74>] pcie_io_remap_test+0x9c/0x228
[ 2063.138959] [<ffff0000010a5e74>] pcie_io_remap_test+0x9c/0x228
[ 2063.159168] [<ffff000000d38700>] MCSS_ioctl+0x118/0x218 [unibsp]
[ 2063.159168] [<ffff000000d38700>] MCSS_ioctl+0x118/0x218 [unibsp]
[ 2063.176220] [<ffff0000082c16bc>] do_vfs_ioctl+0xc4/0x7a4
[ 2063.176220] [<ffff0000082c16bc>] do_vfs_ioctl+0xc4/0x7a4
[ 2063.190966] [<ffff0000082c1e2c>] SyS_ioctl+0x90/0xa4
[ 2063.190966] [<ffff0000082c1e2c>] SyS_ioctl+0x90/0xa4
[ 2063.204706] Exception stack(0xffff000010aa3ec0 to 0xffff000010aa4000)
[ 2063.204706] Exception stack(0xffff000010aa3ec0 to 0xffff000010aa4000)
[ 2063.222471] 3ec0: 0000000000000003 00000000000bc006 0000ffffd8c3d698 0000000000000010
[ 2063.222471] 3ec0: 0000000000000003 00000000000bc006 0000ffffd8c3d698 0000000000000010
[ 2063.243921] 3ee0: 6572ffffffffffff 0000000000000000 0000ffffd8c3d6aa 65725f6f695f6569
[ 2063.243921] 3ee0: 6572ffffffffffff 0000000000000000 0000ffffd8c3d6aa 65725f6f695f6569
[ 2063.265403] 3f00: 000000000000001d 2f2f2f2f2f34feff 0000000000000000 0000ffff8310dcb0
[ 2063.265403] 3f00: 000000000000001d 2f2f2f2f2f34feff 0000000000000000 0000ffff8310dcb0
[ 2063.286856] 3f20: 0000000000000000 0000000000000509 0000ffff82ea2e64 0000000000000000
[ 2063.286856] 3f20: 0000000000000000 0000000000000509 0000ffff82ea2e64 0000000000000000
[ 2063.308332] 3f40: 0000ffff82f60680 0000000000415540 0000000000000000 0000ffffd8c3d960
[ 2063.308332] 3f40: 0000ffff82f60680 0000000000415540 0000000000000000 0000ffffd8c3d960
[ 2063.329809] 3f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2063.329809] 3f60: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2063.351309] 3f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2063.351309] 3f80: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2063.372763] 3fa0: 0000000000000000 0000ffffd8c3d640 0000000000401b98 0000ffffd8c3d640
[ 2063.372763] 3fa0: 0000000000000000 0000ffffd8c3d640 0000000000401b98 0000ffffd8c3d640
[ 2063.394311] 3fc0: 0000ffff82f6068c 0000000080000000 0000000000000003 000000000000001d
[ 2063.394311] 3fc0: 0000ffff82f6068c 0000000080000000 0000000000000003 000000000000001d
[ 2063.415922] 3fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2063.415922] 3fe0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 2063.437650] [<ffff0000080837b0>] el0_svc_naked+0x24/0x28
[ 2063.437650] [<ffff0000080837b0>] el0_svc_naked+0x24/0x28
[ 2063.452308] Code: d280e0e2 d2ba0000 f2e01d02 95bfdf0d (b9400001)
[ 2063.452308] Code: d280e0e2 d2ba0000 f2e01d02 95bfdf0d (b9400001)
[ 2063.469741] ---[ end trace 36d530c5bf5fea7d ]---
[ 2063.469741] ---[ end trace 36d530c5bf5fea7d ]---
[ 2063.482519] Kernel panic - not syncing: Fatal exception
[ 2063.482519] Kernel panic - not syncing: Fatal exception
[ 2063.497131] SMP: stopping secondary CPUs
[ 2063.497131] SMP: stopping secondary CPUs
[ 2063.508593] Kernel Offset: disabled
[ 2063.508593] Kernel Offset: disabled
[ 2063.518572] CPU features: 0x000a18
[ 2063.518572] CPU features: 0x000a18
[ 2063.528209] Memory Limit: none
[ 2063.528209] Memory Limit: none
[ 2063.537067] ---[ end Kernel panic - not syncing: Fatal exception
[ 2063.537067] ---[ end Kernel panic - not syncing: Fatal exception


Thanks
Hanjun

2018-01-08 23:37:04

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [RFC patch] ioremap: don't set up huge I/O mappings when p4d/pud/pmd is zero

On Sat, 2018-01-06 at 17:46 +0800, Hanjun Guo wrote:
> On 2018/1/6 6:15, Kani, Toshi wrote:
> > On Thu, 2017-12-28 at 19:24 +0800, Hanjun Guo wrote:
> > > From: Hanjun Guo <[email protected]>
> > >
> > > When we using iounmap() to free the 4K mapping, it just clear the PTEs
> > > but leave P4D/PUD/PMD unchanged, also will not free the memory of page
> > > tables.
> > >
> > > This will cause issues on ARM64 platform (not sure if other archs have
> > > the same issue) for this case:
> > >
> > > 1. ioremap a 4K size, valid page table will build,
> > > 2. iounmap it, pte0 will set to 0;
> > > 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
> > > then set the a new value for pmd;
> > > 4. pte0 is leaked;
> > > 5. CPU may meet exception because the old pmd is still in TLB,
> > > which will lead to kernel panic.
> > >
> > > Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
> > > zero.
> >
> > Hi Hanjun,
> >
> > I tested the above steps on my x86 box, but was not able to reproduce
> > your kernel panic. On x86, a 4K vaddr gets allocated from a small
> > fragmented free range, whereas a 2MB vaddr is from a larger free range.
> > Their addrs have different alignments (4KB & 2MB) as well. So, the
> > steps did not lead to use a same pmd entry.
>
> Thanks for the testing, I can only reproduce this on my ARM64 platform
> which the CPU will cache the PMD in TLB, from my knowledge, only Cortex-A75
> will do this, so ARM64 platforms which are not A75 based can't be reproduced
> either.
>
> Catalin, Will, I can reproduce this issue in about 3 minutes with following
> simplified test case [1], and can trigger panic as [2], could you take a look
> as well?

Yes, the test case looks good to me. (nit - it should check if vir_addr
is not NULL.)

> > However, I agree that zero'd pte entries will be leaked when a pmd map
> > is set if they are present under the pmd.
>
> Thanks for the confirm.
>
> >
> > I also tested your patch on my x86 box. Unfortunately, it effectively
> > disabled 2MB mappings. While a 2MB vaddr gets allocated from a larger
> > free range, it sill comes from a free range covered by zero'd pte
> > entries. So, it ends up with 4KB mappings with your changes.
> >
> > I think we need to come up with other approach.
>
> Yes, As I said in my patch, this is just RFC, comments are welcomed :)

I am wondering if we can follow the same approach in
arch/x86/mm/pageattr.c. Like the ioremap case, populate_pmd() does not
check if there is a pte table under the pmd. But its free function,
unmap_pte_range() calls try_to_free_pte_page() so that a pte table is
freed when all pte entries are zero'd. It then calls pmd_clear().
iounmap()'s free function, vunmap_pte_range() does not free up a pte
table even if all pte entries are zero'd.

Thanks,
-Toshi

2018-02-20 09:26:00

by Chintan Pandya

[permalink] [raw]
Subject: Re: [RFC patch] ioremap: don't set up huge I/O mappings when p4d/pud/pmd is zero



On 12/28/2017 4:54 PM, Hanjun Guo wrote:
> From: Hanjun Guo <[email protected]>
>
> When we using iounmap() to free the 4K mapping, it just clear the PTEs
> but leave P4D/PUD/PMD unchanged, also will not free the memory of page
> tables.
>
> This will cause issues on ARM64 platform (not sure if other archs have
> the same issue) for this case:
>
> 1. ioremap a 4K size, valid page table will build,
> 2. iounmap it, pte0 will set to 0;
> 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
> then set the a new value for pmd;
> 4. pte0 is leaked;
> 5. CPU may meet exception because the old pmd is still in TLB,
> which will lead to kernel panic.
>
> Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
> zero.
>
One obvious problem I see here is, once any 2nd level entry has 3rd
level mapping, this entry can't map 2M section ever in future. This way,
we will fragment entire virtual space over time.

The code you are changing is common between 32-bit systems as well (I
think). And running out of section mapping would be a reality in
practical terms.

So, if we can do the following as a fix up, we would be saved.
1) Invalidate 2nd level entry from TLB, and
2) Free the page which holds last level page table

BTW, is there any further discussion going on this topic which I am
missing ?

> Reported-by: Lei Li <[email protected]>
> Signed-off-by: Hanjun Guo <[email protected]>
> Cc: Toshi Kani <[email protected]>
> Cc: Mark Rutland <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Catalin Marinas <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Xuefeng Wang <[email protected]>
> ---
>
> Not sure if this is the right direction, this patch has a obvious
> side effect that a mapped address with 4K will not back to 2M. I may
> miss something and just wrong, so this is just a RFC version, comments
> are welcomed.
>
> lib/ioremap.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/lib/ioremap.c b/lib/ioremap.c
> index b808a39..4e6f19a 100644
> --- a/lib/ioremap.c
> +++ b/lib/ioremap.c
> @@ -89,7 +89,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
> do {
> next = pmd_addr_end(addr, end);
>
> - if (ioremap_pmd_enabled() &&
> + if (ioremap_pmd_enabled() && pmd_none(*pmd) &&
> ((next - addr) == PMD_SIZE) &&
> IS_ALIGNED(phys_addr + addr, PMD_SIZE)) {
> if (pmd_set_huge(pmd, phys_addr + addr, prot))
> @@ -115,7 +115,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
> do {
> next = pud_addr_end(addr, end);
>
> - if (ioremap_pud_enabled() &&
> + if (ioremap_pud_enabled() && pud_none(*pud) &&
> ((next - addr) == PUD_SIZE) &&
> IS_ALIGNED(phys_addr + addr, PUD_SIZE)) {
> if (pud_set_huge(pud, phys_addr + addr, prot))
> @@ -141,7 +141,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
> do {
> next = p4d_addr_end(addr, end);
>
> - if (ioremap_p4d_enabled() &&
> + if (ioremap_p4d_enabled() && p4d_none(*p4d) &&
> ((next - addr) == P4D_SIZE) &&
> IS_ALIGNED(phys_addr + addr, P4D_SIZE)) {
> if (p4d_set_huge(p4d, phys_addr + addr, prot))
>

Chintan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc. is a member of the Code Aurora Forum, a Linux Foundation
Collaborative Project

2018-02-21 00:37:08

by Kani, Toshimitsu

[permalink] [raw]
Subject: Re: [RFC patch] ioremap: don't set up huge I/O mappings when p4d/pud/pmd is zero

On Tue, 2018-02-20 at 14:54 +0530, Chintan Pandya wrote:
>
> On 12/28/2017 4:54 PM, Hanjun Guo wrote:
> > From: Hanjun Guo <[email protected]>
> >
> > When we using iounmap() to free the 4K mapping, it just clear the PTEs
> > but leave P4D/PUD/PMD unchanged, also will not free the memory of page
> > tables.
> >
> > This will cause issues on ARM64 platform (not sure if other archs have
> > the same issue) for this case:
> >
> > 1. ioremap a 4K size, valid page table will build,
> > 2. iounmap it, pte0 will set to 0;
> > 3. ioremap the same address with 2M size, pgd/pmd is unchanged,
> > then set the a new value for pmd;
> > 4. pte0 is leaked;
> > 5. CPU may meet exception because the old pmd is still in TLB,
> > which will lead to kernel panic.
> >
> > Fix it by skip setting up the huge I/O mappings when p4d/pud/pmd is
> > zero.
> >
>
> One obvious problem I see here is, once any 2nd level entry has 3rd
> level mapping, this entry can't map 2M section ever in future. This way,
> we will fragment entire virtual space over time.
>
> The code you are changing is common between 32-bit systems as well (I
> think). And running out of section mapping would be a reality in
> practical terms.
>
> So, if we can do the following as a fix up, we would be saved.
> 1) Invalidate 2nd level entry from TLB, and
> 2) Free the page which holds last level page table
>
> BTW, is there any further discussion going on this topic which I am
> missing ?

Yes, I suggested to free up a pte table in my last reply.
https://patchwork.kernel.org/patch/10134581/

Thanks,
-Toshi