This small patchset intends to use PUD/PGDIR entries for linear mapping
in order to better utilize TLB.
At the moment, only PMD entries can be used since on common platforms
(qemu/unleashed), the kernel is loaded at DRAM + 2MB which dealigns virtual
and physical addresses and then prevents the use of PUD/PGDIR entries.
So the kernel must be able to get those 2MB for PAGE_OFFSET to map the
beginning of the DRAM: this is achieved in patch 1.
But furthermore, at the moment, the firmware (opensbi) explicitly asks the
kernel not to map the region it occupies, which is on those common
platforms at the very beginning of the DRAM and then it also dealigns
virtual and physical addresses. I proposed a patch here:
https://github.com/riscv/opensbi/pull/167
that removes this 'constraint' but *not* all the time as it offers some
kind of protection in case PMP is not available. So sometimes, we may
have a part of the memory below the kernel that is removed creating a
misalignment between virtual and physical addresses. So for performance
reasons, we must at least make sure that PMD entries can be used: that
is guaranteed by patch 1 too.
Finally the second patch simply improves best_map_size so that whenever
possible, PUD/PGDIR entries are used.
Below is the kernel page table without this patch on a 6G platform:
---[ Linear mapping ]---
0xffffc00000000000-0xffffc00176e00000 0x0000000080200000 5998M PMD D A . . . W R V
And with this patchset + opensbi patch:
---[ Linear mapping ]---
0xffffc00000000000-0xffffc00140000000 0x0000000080000000 5G PUD D A . . . W R V
0xffffc00140000000-0xffffc00177000000 0x00000001c0000000 880M PMD D A . . . W R V
Alexandre Ghiti (2):
riscv: Get memory below load_pa while ensuring linear mapping is PMD
aligned
riscv: Use PUD/PGDIR entries for linear mapping when possible
arch/riscv/include/asm/page.h | 8 ++++
arch/riscv/mm/init.c | 69 +++++++++++++++++++++++++++++------
2 files changed, 65 insertions(+), 12 deletions(-)
--
2.20.1
Improve best_map_size so that PUD or PGDIR entries are used for linear
mapping when possible as it allows better TLB utilization.
Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/mm/init.c | 45 +++++++++++++++++++++++++++++++++-----------
1 file changed, 34 insertions(+), 11 deletions(-)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 9a5c97e091c1..d275f9f834cf 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -424,13 +424,29 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
create_pgd_next_mapping(nextp, va, pa, sz, prot);
}
-static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
+static bool is_map_size_ok(uintptr_t map_size, phys_addr_t base,
+ uintptr_t base_virt, phys_addr_t size)
{
- /* Upgrade to PMD_SIZE mappings whenever possible */
- if ((base & (PMD_SIZE - 1)) || (size & (PMD_SIZE - 1)))
- return PAGE_SIZE;
+ return !((base & (map_size - 1)) || (base_virt & (map_size - 1)) ||
+ (size < map_size));
+}
+
+static uintptr_t __init best_map_size(phys_addr_t base, uintptr_t base_virt,
+ phys_addr_t size)
+{
+#ifndef __PAGETABLE_PMD_FOLDED
+ if (is_map_size_ok(PGDIR_SIZE, base, base_virt, size))
+ return PGDIR_SIZE;
+
+ if (pgtable_l4_enabled)
+ if (is_map_size_ok(PUD_SIZE, base, base_virt, size))
+ return PUD_SIZE;
+#endif
+
+ if (is_map_size_ok(PMD_SIZE, base, base_virt, size))
+ return PMD_SIZE;
- return PMD_SIZE;
+ return PAGE_SIZE;
}
/*
@@ -576,7 +592,7 @@ void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
asmlinkage void __init setup_vm(uintptr_t dtb_pa)
{
uintptr_t va, end_va;
- uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
+ uintptr_t map_size;
load_pa = (uintptr_t)(&_start);
load_sz = (uintptr_t)(&_end) - load_pa;
@@ -587,6 +603,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
kernel_virt_addr = KERNEL_VIRT_ADDR;
+ map_size = best_map_size(load_pa, PAGE_OFFSET, MAX_EARLY_MAPPING_SIZE);
va_pa_offset = PAGE_OFFSET - load_pa;
va_kernel_pa_offset = kernel_virt_addr - load_pa;
pfn_base = PFN_DOWN(load_pa);
@@ -700,6 +717,8 @@ static void __init setup_vm_final(void)
/* Map all memory banks */
for_each_memblock(memory, reg) {
+ uintptr_t remaining_size;
+
start = reg->base;
end = start + reg->size;
@@ -707,15 +726,19 @@ static void __init setup_vm_final(void)
break;
if (memblock_is_nomap(reg))
continue;
- if (start <= __pa(PAGE_OFFSET) &&
- __pa(PAGE_OFFSET) < end)
- start = __pa(PAGE_OFFSET);
- map_size = best_map_size(start, end - start);
- for (pa = start; pa < end; pa += map_size) {
+ pa = start;
+ remaining_size = reg->size;
+
+ while (remaining_size) {
va = (uintptr_t)__va(pa);
+ map_size = best_map_size(pa, va, remaining_size);
+
create_pgd_mapping(swapper_pg_dir, va, pa,
map_size, PAGE_KERNEL);
+
+ pa += map_size;
+ remaining_size -= map_size;
}
}
--
2.20.1
Early page table uses the kernel load address as mapping for PAGE_OFFSET:
that makes memblock remove any memory below the kernel which results in
using only PMD entries for the linear mapping.
By setting MIN_MEMBLOCK_ADDR to 0, we allow this memory to be present
when creating the kernel page table: that potentially allows to use
PUD/PGDIR entries for the linear mapping.
But as the firmware might ask the kernel to remove some part of this
memory, we need to ensure that the physical address targeted by
PAGE_OFFSET is at least aligned on PMD size since otherwise the linear
mapping would use only PTE entries.
Signed-off-by: Alexandre Ghiti <[email protected]>
---
arch/riscv/include/asm/page.h | 8 ++++++++
arch/riscv/mm/init.c | 24 +++++++++++++++++++++++-
2 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
index 5e77fe7f0d6d..b416396fc357 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -100,6 +100,14 @@ typedef struct page *pgtable_t;
#define PTE_FMT "%08lx"
#endif
+/*
+ * Early page table maps PAGE_OFFSET to load_pa, which may not be the memory
+ * base address and by default MIN_MEMBLOCK_ADDR is equal to __pa(PAGE_OFFSET)
+ * then memblock ignores memory below load_pa: we want this memory to get mapped
+ * as it may allow to use hugepages for linear mapping.
+ */
+#define MIN_MEMBLOCK_ADDR 0
+
#ifdef CONFIG_MMU
extern unsigned long va_pa_offset;
extern unsigned long va_kernel_pa_offset;
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 4064639b24e4..9a5c97e091c1 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -664,7 +664,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
static void __init setup_vm_final(void)
{
uintptr_t va, map_size;
- phys_addr_t pa, start, end;
+ phys_addr_t pa, start, end, dram_start;
struct memblock_region *reg;
static struct vm_struct vm_kernel = { 0 };
@@ -676,6 +676,28 @@ static void __init setup_vm_final(void)
__pa_symbol(fixmap_pgd_next),
PGDIR_SIZE, PAGE_TABLE);
+ /*
+ * Make sure that virtual and physical addresses are at least aligned
+ * on PMD_SIZE, even if we have to lose some memory (< PMD_SIZE)
+ * otherwise the linear mapping would get mapped using PTE entries.
+ */
+ dram_start = memblock_start_of_DRAM();
+ if (dram_start & (PMD_SIZE - 1)) {
+ uintptr_t next_dram_start;
+
+ next_dram_start = (dram_start + PMD_SIZE - 1) & ~(PMD_SIZE - 1);
+ memblock_remove(dram_start, next_dram_start - dram_start);
+ dram_start = next_dram_start;
+ }
+
+ /*
+ * We started considering PAGE_OFFSET would start at load_pa because
+ * it was the only piece of information we had, but now make PAGE_OFFSET
+ * point to the real beginning of the memory area.
+ */
+ va_pa_offset = PAGE_OFFSET - dram_start;
+ pfn_base = PFN_DOWN(dram_start);
+
/* Map all memory banks */
for_each_memblock(memory, reg) {
start = reg->base;
--
2.20.1
On Wed, Jun 3, 2020 at 8:36 AM Alexandre Ghiti <[email protected]> wrote:
>
> This small patchset intends to use PUD/PGDIR entries for linear mapping
> in order to better utilize TLB.
>
> At the moment, only PMD entries can be used since on common platforms
> (qemu/unleashed), the kernel is loaded at DRAM + 2MB which dealigns virtual
> and physical addresses and then prevents the use of PUD/PGDIR entries.
> So the kernel must be able to get those 2MB for PAGE_OFFSET to map the
> beginning of the DRAM: this is achieved in patch 1.
>
I don't have in depth knowledge of how mm code works so this question
may be a completely
stupid one :). Just for my understanding,
As per my understanding, kernel will map those 2MB of memory but never use it.
How does the kernel ensure that it doesn't allocate any memory from those 2MB
memory if it is not marked as reserved?
> But furthermore, at the moment, the firmware (opensbi) explicitly asks the
> kernel not to map the region it occupies, which is on those common
> platforms at the very beginning of the DRAM and then it also dealigns
> virtual and physical addresses. I proposed a patch here:
>
> https://github.com/riscv/opensbi/pull/167
>
> that removes this 'constraint' but *not* all the time as it offers some
> kind of protection in case PMP is not available. So sometimes, we may
> have a part of the memory below the kernel that is removed creating a
> misalignment between virtual and physical addresses. So for performance
> reasons, we must at least make sure that PMD entries can be used: that
> is guaranteed by patch 1 too.
>
> Finally the second patch simply improves best_map_size so that whenever
> possible, PUD/PGDIR entries are used.
>
> Below is the kernel page table without this patch on a 6G platform:
>
> ---[ Linear mapping ]---
> 0xffffc00000000000-0xffffc00176e00000 0x0000000080200000 5998M PMD D A . . . W R V
>
> And with this patchset + opensbi patch:
>
> ---[ Linear mapping ]---
> 0xffffc00000000000-0xffffc00140000000 0x0000000080000000 5G PUD D A . . . W R V
> 0xffffc00140000000-0xffffc00177000000 0x00000001c0000000 880M PMD D A . . . W R V
>
> Alexandre Ghiti (2):
> riscv: Get memory below load_pa while ensuring linear mapping is PMD
> aligned
> riscv: Use PUD/PGDIR entries for linear mapping when possible
>
> arch/riscv/include/asm/page.h | 8 ++++
> arch/riscv/mm/init.c | 69 +++++++++++++++++++++++++++++------
> 2 files changed, 65 insertions(+), 12 deletions(-)
>
> --
> 2.20.1
>
>
--
Regards,
Atish
Hi Atish,
Le 6/10/20 à 2:32 PM, Atish Patra a écrit :
> On Wed, Jun 3, 2020 at 8:36 AM Alexandre Ghiti <[email protected]> wrote:
>> This small patchset intends to use PUD/PGDIR entries for linear mapping
>> in order to better utilize TLB.
>>
>> At the moment, only PMD entries can be used since on common platforms
>> (qemu/unleashed), the kernel is loaded at DRAM + 2MB which dealigns virtual
>> and physical addresses and then prevents the use of PUD/PGDIR entries.
>> So the kernel must be able to get those 2MB for PAGE_OFFSET to map the
>> beginning of the DRAM: this is achieved in patch 1.
>>
> I don't have in depth knowledge of how mm code works so this question
> may be a completely
> stupid one :). Just for my understanding,
> As per my understanding, kernel will map those 2MB of memory but never use it.
> How does the kernel ensure that it doesn't allocate any memory from those 2MB
> memory if it is not marked as reserved?
Yes, a 1GB hugepage will cover those 2MB: I rely on the previous boot
stage to mark this region
as reserved if there is something there (like opensbi). Otherwise, the
kernel will indeed try to
allocate memory from there :)
Alex
>> But furthermore, at the moment, the firmware (opensbi) explicitly asks the
>> kernel not to map the region it occupies, which is on those common
>> platforms at the very beginning of the DRAM and then it also dealigns
>> virtual and physical addresses. I proposed a patch here:
>>
>> https://github.com/riscv/opensbi/pull/167
>>
>> that removes this 'constraint' but *not* all the time as it offers some
>> kind of protection in case PMP is not available. So sometimes, we may
>> have a part of the memory below the kernel that is removed creating a
>> misalignment between virtual and physical addresses. So for performance
>> reasons, we must at least make sure that PMD entries can be used: that
>> is guaranteed by patch 1 too.
>>
>> Finally the second patch simply improves best_map_size so that whenever
>> possible, PUD/PGDIR entries are used.
>>
>> Below is the kernel page table without this patch on a 6G platform:
>>
>> ---[ Linear mapping ]---
>> 0xffffc00000000000-0xffffc00176e00000 0x0000000080200000 5998M PMD D A . . . W R V
>>
>> And with this patchset + opensbi patch:
>>
>> ---[ Linear mapping ]---
>> 0xffffc00000000000-0xffffc00140000000 0x0000000080000000 5G PUD D A . . . W R V
>> 0xffffc00140000000-0xffffc00177000000 0x00000001c0000000 880M PMD D A . . . W R V
>>
>> Alexandre Ghiti (2):
>> riscv: Get memory below load_pa while ensuring linear mapping is PMD
>> aligned
>> riscv: Use PUD/PGDIR entries for linear mapping when possible
>>
>> arch/riscv/include/asm/page.h | 8 ++++
>> arch/riscv/mm/init.c | 69 +++++++++++++++++++++++++++++------
>> 2 files changed, 65 insertions(+), 12 deletions(-)
>>
>> --
>> 2.20.1
>>
>>
>
On Wed, Jun 10, 2020 at 11:51 PM Alex Ghiti <[email protected]> wrote:
>
> Hi Atish,
>
> Le 6/10/20 à 2:32 PM, Atish Patra a écrit :
> > On Wed, Jun 3, 2020 at 8:36 AM Alexandre Ghiti <[email protected]> wrote:
> >> This small patchset intends to use PUD/PGDIR entries for linear mapping
> >> in order to better utilize TLB.
> >>
> >> At the moment, only PMD entries can be used since on common platforms
> >> (qemu/unleashed), the kernel is loaded at DRAM + 2MB which dealigns virtual
> >> and physical addresses and then prevents the use of PUD/PGDIR entries.
> >> So the kernel must be able to get those 2MB for PAGE_OFFSET to map the
> >> beginning of the DRAM: this is achieved in patch 1.
> >>
> > I don't have in depth knowledge of how mm code works so this question
> > may be a completely
> > stupid one :). Just for my understanding,
> > As per my understanding, kernel will map those 2MB of memory but never use it.
> > How does the kernel ensure that it doesn't allocate any memory from those 2MB
> > memory if it is not marked as reserved?
>
> Yes, a 1GB hugepage will cover those 2MB: I rely on the previous boot
> stage to mark this region
> as reserved if there is something there (like opensbi). Otherwise, the
> kernel will indeed try to
> allocate memory from there :)
>
In that case, this patch mandates that the firmware region has to be
mark "reserved"
the device tree so that the Linux kernel doesn't try to allocate
memory from there.
OpenSBI is already doing it from v0.7. Thus, any user using latest
OpenSBI can leverage
this patch for a better TLB utilization.
However, legacy previous boot stages(BBL) do not reserve this area via
DT which may
result in an unexpected crash. I am not sure how many developers still
use BBL though.
Few general suggestions to tackle this problem:
1. This mandatory requirement should be added to the booting document
so that any other
SBI implementation is also aware of it.
2. You may have to move the patch1 to a separate config so that any
users of legacy boot stages
can disable this feature.
> Alex
>
>
> >> But furthermore, at the moment, the firmware (opensbi) explicitly asks the
> >> kernel not to map the region it occupies, which is on those common
> >> platforms at the very beginning of the DRAM and then it also dealigns
> >> virtual and physical addresses. I proposed a patch here:
> >>
> >> https://github.com/riscv/opensbi/pull/167
> >>
> >> that removes this 'constraint' but *not* all the time as it offers some
> >> kind of protection in case PMP is not available. So sometimes, we may
> >> have a part of the memory below the kernel that is removed creating a
> >> misalignment between virtual and physical addresses. So for performance
> >> reasons, we must at least make sure that PMD entries can be used: that
> >> is guaranteed by patch 1 too.
> >>
> >> Finally the second patch simply improves best_map_size so that whenever
> >> possible, PUD/PGDIR entries are used.
> >>
> >> Below is the kernel page table without this patch on a 6G platform:
> >>
> >> ---[ Linear mapping ]---
> >> 0xffffc00000000000-0xffffc00176e00000 0x0000000080200000 5998M PMD D A . . . W R V
> >>
> >> And with this patchset + opensbi patch:
> >>
> >> ---[ Linear mapping ]---
> >> 0xffffc00000000000-0xffffc00140000000 0x0000000080000000 5G PUD D A . . . W R V
> >> 0xffffc00140000000-0xffffc00177000000 0x00000001c0000000 880M PMD D A . . . W R V
> >>
> >> Alexandre Ghiti (2):
> >> riscv: Get memory below load_pa while ensuring linear mapping is PMD
> >> aligned
> >> riscv: Use PUD/PGDIR entries for linear mapping when possible
> >>
> >> arch/riscv/include/asm/page.h | 8 ++++
> >> arch/riscv/mm/init.c | 69 +++++++++++++++++++++++++++++------
> >> 2 files changed, 65 insertions(+), 12 deletions(-)
> >>
> >> --
> >> 2.20.1
> >>
> >>
> >
--
Regards,
Atish
Hi Atish,
Le 6/11/20 à 1:29 PM, Atish Patra a écrit :
> On Wed, Jun 10, 2020 at 11:51 PM Alex Ghiti<[email protected]> wrote:
>> Hi Atish,
>>
>> Le 6/10/20 à 2:32 PM, Atish Patra a écrit :
>>> On Wed, Jun 3, 2020 at 8:36 AM Alexandre Ghiti<[email protected]> wrote:
>>>> This small patchset intends to use PUD/PGDIR entries for linear mapping
>>>> in order to better utilize TLB.
>>>>
>>>> At the moment, only PMD entries can be used since on common platforms
>>>> (qemu/unleashed), the kernel is loaded at DRAM + 2MB which dealigns virtual
>>>> and physical addresses and then prevents the use of PUD/PGDIR entries.
>>>> So the kernel must be able to get those 2MB for PAGE_OFFSET to map the
>>>> beginning of the DRAM: this is achieved in patch 1.
>>>>
>>> I don't have in depth knowledge of how mm code works so this question
>>> may be a completely
>>> stupid one :). Just for my understanding,
>>> As per my understanding, kernel will map those 2MB of memory but never use it.
>>> How does the kernel ensure that it doesn't allocate any memory from those 2MB
>>> memory if it is not marked as reserved?
>> Yes, a 1GB hugepage will cover those 2MB: I rely on the previous boot
>> stage to mark this region
>> as reserved if there is something there (like opensbi). Otherwise, the
>> kernel will indeed try to
>> allocate memory from there :)
>>
> In that case, this patch mandates that the firmware region has to be
> mark "reserved"
> the device tree so that the Linux kernel doesn't try to allocate
> memory from there.
> OpenSBI is already doing it from v0.7. Thus, any user using latest
> OpenSBI can leverage
> this patch for a better TLB utilization.
Note that *currently* OpenSBI v0.7 still adds the "no-map" property
which prevents such optimization.
> However, legacy previous boot stages(BBL) do not reserve this area via
> DT which may
> result in an unexpected crash. I am not sure how many developers still
> use BBL though.
>
> Few general suggestions to tackle this problem:
> 1. This mandatory requirement should be added to the booting document
> so that any other
> SBI implementation is also aware of it.
> 2. You may have to move the patch1 to a separate config so that any
> users of legacy boot stages
> can disable this feature.
IMHO, the region occupied by runtime services should be marked as
reserved in the device-tree. So it seems redundant to add this as a
requirement, I would rather consider its absence as a bug.
Even if I understand that this might break some system, I don't like the
idea of a new config to support old "buggy" bootloaders: when will we be
able to remove it ? We'll never know when people will stop using those
bootloaders, so it will stay here forever...Where can I find the boot
document you are talking about ? Can we simply state here that this
kernel version will not be compatible with those bootloaders (we'll draw
an exhaustive list here) ?
Alex
>> Alex
>>
>>
>>>> But furthermore, at the moment, the firmware (opensbi) explicitly asks the
>>>> kernel not to map the region it occupies, which is on those common
>>>> platforms at the very beginning of the DRAM and then it also dealigns
>>>> virtual and physical addresses. I proposed a patch here:
>>>>
>>>> https://github.com/riscv/opensbi/pull/167
>>>>
>>>> that removes this 'constraint' but *not* all the time as it offers some
>>>> kind of protection in case PMP is not available. So sometimes, we may
>>>> have a part of the memory below the kernel that is removed creating a
>>>> misalignment between virtual and physical addresses. So for performance
>>>> reasons, we must at least make sure that PMD entries can be used: that
>>>> is guaranteed by patch 1 too.
>>>>
>>>> Finally the second patch simply improves best_map_size so that whenever
>>>> possible, PUD/PGDIR entries are used.
>>>>
>>>> Below is the kernel page table without this patch on a 6G platform:
>>>>
>>>> ---[ Linear mapping ]---
>>>> 0xffffc00000000000-0xffffc00176e00000 0x0000000080200000 5998M PMD D A . . . W R V
>>>>
>>>> And with this patchset + opensbi patch:
>>>>
>>>> ---[ Linear mapping ]---
>>>> 0xffffc00000000000-0xffffc00140000000 0x0000000080000000 5G PUD D A . . . W R V
>>>> 0xffffc00140000000-0xffffc00177000000 0x00000001c0000000 880M PMD D A . . . W R V
>>>>
>>>> Alexandre Ghiti (2):
>>>> riscv: Get memory below load_pa while ensuring linear mapping is PMD
>>>> aligned
>>>> riscv: Use PUD/PGDIR entries for linear mapping when possible
>>>>
>>>> arch/riscv/include/asm/page.h | 8 ++++
>>>> arch/riscv/mm/init.c | 69 +++++++++++++++++++++++++++++------
>>>> 2 files changed, 65 insertions(+), 12 deletions(-)
>>>>
>>>> --
>>>> 2.20.1
>>>>
>>>>
>
Le 6/12/20 à 8:59 AM, Alex Ghiti a écrit :
> Hi Atish,
>
> Le 6/11/20 à 1:29 PM, Atish Patra a écrit :
>> On Wed, Jun 10, 2020 at 11:51 PM Alex Ghiti<[email protected]> wrote:
>>> Hi Atish,
>>>
>>> Le 6/10/20 à 2:32 PM, Atish Patra a écrit :
>>>> On Wed, Jun 3, 2020 at 8:36 AM Alexandre Ghiti<[email protected]> wrote:
>>>>> This small patchset intends to use PUD/PGDIR entries for linear
>>>>> mapping
>>>>> in order to better utilize TLB.
>>>>>
>>>>> At the moment, only PMD entries can be used since on common platforms
>>>>> (qemu/unleashed), the kernel is loaded at DRAM + 2MB which
>>>>> dealigns virtual
>>>>> and physical addresses and then prevents the use of PUD/PGDIR
>>>>> entries.
>>>>> So the kernel must be able to get those 2MB for PAGE_OFFSET to map
>>>>> the
>>>>> beginning of the DRAM: this is achieved in patch 1.
>>>>>
>>>> I don't have in depth knowledge of how mm code works so this question
>>>> may be a completely
>>>> stupid one :). Just for my understanding,
>>>> As per my understanding, kernel will map those 2MB of memory but
>>>> never use it.
>>>> How does the kernel ensure that it doesn't allocate any memory from
>>>> those 2MB
>>>> memory if it is not marked as reserved?
>>> Yes, a 1GB hugepage will cover those 2MB: I rely on the previous boot
>>> stage to mark this region
>>> as reserved if there is something there (like opensbi). Otherwise, the
>>> kernel will indeed try to
>>> allocate memory from there :)
>>>
>> In that case, this patch mandates that the firmware region has to be
>> mark "reserved"
>> the device tree so that the Linux kernel doesn't try to allocate
>> memory from there.
>> OpenSBI is already doing it from v0.7. Thus, any user using latest
>> OpenSBI can leverage
>> this patch for a better TLB utilization.
>
>
> Note that *currently* OpenSBI v0.7 still adds the "no-map" property
> which prevents such optimization.
>
>> However, legacy previous boot stages(BBL) do not reserve this area via
>> DT which may
>> result in an unexpected crash. I am not sure how many developers still
>> use BBL though.
>>
>> Few general suggestions to tackle this problem:
>> 1. This mandatory requirement should be added to the booting document
>> so that any other
>> SBI implementation is also aware of it.
>> 2. You may have to move the patch1 to a separate config so that any
>> users of legacy boot stages
>> can disable this feature.
>
>
> IMHO, the region occupied by runtime services should be marked as
> reserved in the device-tree. So it seems redundant to add this as a
> requirement, I would rather consider its absence as a bug.
>
> Even if I understand that this might break some system, I don't like
> the idea of a new config to support old "buggy" bootloaders: when will
> we be able to remove it ? We'll never know when people will stop using
> those bootloaders, so it will stay here forever...Where can I find the
> boot document you are talking about ? Can we simply state here that
> this kernel version will not be compatible with those bootloaders
> (we'll draw an exhaustive list here) ?
Ok, I have just found Documentation/riscv/boot-image-header.rst: could
we imagine doing something like incrementing the version and use that as
a hint in the kernel not to map the 2MB offset ? That's still legacy,
but at least it does not require to recompile a kernel as the check
would be done at runtime.
>
> Alex
>
>
>>> Alex
>>>
>>>
>>>>> But furthermore, at the moment, the firmware (opensbi) explicitly
>>>>> asks the
>>>>> kernel not to map the region it occupies, which is on those common
>>>>> platforms at the very beginning of the DRAM and then it also dealigns
>>>>> virtual and physical addresses. I proposed a patch here:
>>>>>
>>>>> https://github.com/riscv/opensbi/pull/167
>>>>>
>>>>> that removes this 'constraint' but *not* all the time as it offers
>>>>> some
>>>>> kind of protection in case PMP is not available. So sometimes, we may
>>>>> have a part of the memory below the kernel that is removed creating a
>>>>> misalignment between virtual and physical addresses. So for
>>>>> performance
>>>>> reasons, we must at least make sure that PMD entries can be used:
>>>>> that
>>>>> is guaranteed by patch 1 too.
>>>>>
>>>>> Finally the second patch simply improves best_map_size so that
>>>>> whenever
>>>>> possible, PUD/PGDIR entries are used.
>>>>>
>>>>> Below is the kernel page table without this patch on a 6G platform:
>>>>>
>>>>> ---[ Linear mapping ]---
>>>>> 0xffffc00000000000-0xffffc00176e00000 0x0000000080200000 5998M
>>>>> PMD D A . . . W R V
>>>>>
>>>>> And with this patchset + opensbi patch:
>>>>>
>>>>> ---[ Linear mapping ]---
>>>>> 0xffffc00000000000-0xffffc00140000000 0x0000000080000000
>>>>> 5G PUD D A . . . W R V
>>>>> 0xffffc00140000000-0xffffc00177000000 0x00000001c0000000 880M
>>>>> PMD D A . . . W R V
>>>>>
>>>>> Alexandre Ghiti (2):
>>>>> riscv: Get memory below load_pa while ensuring linear mapping
>>>>> is PMD
>>>>> aligned
>>>>> riscv: Use PUD/PGDIR entries for linear mapping when possible
>>>>>
>>>>> arch/riscv/include/asm/page.h | 8 ++++
>>>>> arch/riscv/mm/init.c | 69
>>>>> +++++++++++++++++++++++++++++------
>>>>> 2 files changed, 65 insertions(+), 12 deletions(-)
>>>>>
>>>>> --
>>>>> 2.20.1
>>>>>
>>>>>
>>
>
On Fri, Jun 12, 2020 at 6:17 AM Alex Ghiti <[email protected]> wrote:
>
> Le 6/12/20 à 8:59 AM, Alex Ghiti a écrit :
> > Hi Atish,
> >
> > Le 6/11/20 à 1:29 PM, Atish Patra a écrit :
> >> On Wed, Jun 10, 2020 at 11:51 PM Alex Ghiti<[email protected]> wrote:
> >>> Hi Atish,
> >>>
> >>> Le 6/10/20 à 2:32 PM, Atish Patra a écrit :
> >>>> On Wed, Jun 3, 2020 at 8:36 AM Alexandre Ghiti<[email protected]> wrote:
> >>>>> This small patchset intends to use PUD/PGDIR entries for linear
> >>>>> mapping
> >>>>> in order to better utilize TLB.
> >>>>>
> >>>>> At the moment, only PMD entries can be used since on common platforms
> >>>>> (qemu/unleashed), the kernel is loaded at DRAM + 2MB which
> >>>>> dealigns virtual
> >>>>> and physical addresses and then prevents the use of PUD/PGDIR
> >>>>> entries.
> >>>>> So the kernel must be able to get those 2MB for PAGE_OFFSET to map
> >>>>> the
> >>>>> beginning of the DRAM: this is achieved in patch 1.
> >>>>>
> >>>> I don't have in depth knowledge of how mm code works so this question
> >>>> may be a completely
> >>>> stupid one :). Just for my understanding,
> >>>> As per my understanding, kernel will map those 2MB of memory but
> >>>> never use it.
> >>>> How does the kernel ensure that it doesn't allocate any memory from
> >>>> those 2MB
> >>>> memory if it is not marked as reserved?
> >>> Yes, a 1GB hugepage will cover those 2MB: I rely on the previous boot
> >>> stage to mark this region
> >>> as reserved if there is something there (like opensbi). Otherwise, the
> >>> kernel will indeed try to
> >>> allocate memory from there :)
> >>>
> >> In that case, this patch mandates that the firmware region has to be
> >> mark "reserved"
> >> the device tree so that the Linux kernel doesn't try to allocate
> >> memory from there.
> >> OpenSBI is already doing it from v0.7. Thus, any user using latest
> >> OpenSBI can leverage
> >> this patch for a better TLB utilization.
> >
> >
> > Note that *currently* OpenSBI v0.7 still adds the "no-map" property
> > which prevents such optimization.
> >
Thanks for the clarification. When I said latest, I meant including
your patch in the mailing list.
> >> However, legacy previous boot stages(BBL) do not reserve this area via
> >> DT which may
> >> result in an unexpected crash. I am not sure how many developers still
> >> use BBL though.
> >>
> >> Few general suggestions to tackle this problem:
> >> 1. This mandatory requirement should be added to the booting document
> >> so that any other
> >> SBI implementation is also aware of it.
> >> 2. You may have to move the patch1 to a separate config so that any
> >> users of legacy boot stages
> >> can disable this feature.
> >
> >
> > IMHO, the region occupied by runtime services should be marked as
> > reserved in the device-tree. So it seems redundant to add this as a
> > requirement, I would rather consider its absence as a bug.
> >
I agree. I was just suggesting to document this bug :).
> > Even if I understand that this might break some system, I don't like
> > the idea of a new config to support old "buggy" bootloaders: when will
> > we be able to remove it ? We'll never know when people will stop using
> > those bootloaders, so it will stay here forever...Where can I find the
Personally, I am fine with that. However, there were few concerns in the past.
I am leaving it to Palmer to decide.
@Palmer Dabbelt : Any thoughts ?
> > boot document you are talking about ? Can we simply state here that
> > this kernel version will not be compatible with those bootloaders
> > (we'll draw an exhaustive list here) ?
>
Yes.
>
> Ok, I have just found Documentation/riscv/boot-image-header.rst: could
> we imagine doing something like incrementing the version and use that as
> a hint in the kernel not to map the 2MB offset ? That's still legacy,
> but at least it does not require to recompile a kernel as the check
> would be done at runtime.
>
I was suggesting to add a risc-v specific booting document and
document this "bug".
Documentation/riscv/boot-image-header.rst can be linked from that document or
the boot hader content can be included in that. No changes in code is necessary.
Eventually, this booting document will also include other additional
booting constraints for RISC-V
such as minimum extension required to boot Linux, csr state upon
entering S-mode, mmu state.
>
> >
> > Alex
> >
> >
> >>> Alex
> >>>
> >>>
> >>>>> But furthermore, at the moment, the firmware (opensbi) explicitly
> >>>>> asks the
> >>>>> kernel not to map the region it occupies, which is on those common
> >>>>> platforms at the very beginning of the DRAM and then it also dealigns
> >>>>> virtual and physical addresses. I proposed a patch here:
> >>>>>
> >>>>> https://github.com/riscv/opensbi/pull/167
> >>>>>
> >>>>> that removes this 'constraint' but *not* all the time as it offers
> >>>>> some
> >>>>> kind of protection in case PMP is not available. So sometimes, we may
> >>>>> have a part of the memory below the kernel that is removed creating a
> >>>>> misalignment between virtual and physical addresses. So for
> >>>>> performance
> >>>>> reasons, we must at least make sure that PMD entries can be used:
> >>>>> that
> >>>>> is guaranteed by patch 1 too.
> >>>>>
> >>>>> Finally the second patch simply improves best_map_size so that
> >>>>> whenever
> >>>>> possible, PUD/PGDIR entries are used.
> >>>>>
> >>>>> Below is the kernel page table without this patch on a 6G platform:
> >>>>>
> >>>>> ---[ Linear mapping ]---
> >>>>> 0xffffc00000000000-0xffffc00176e00000 0x0000000080200000 5998M
> >>>>> PMD D A . . . W R V
> >>>>>
> >>>>> And with this patchset + opensbi patch:
> >>>>>
> >>>>> ---[ Linear mapping ]---
> >>>>> 0xffffc00000000000-0xffffc00140000000 0x0000000080000000
> >>>>> 5G PUD D A . . . W R V
> >>>>> 0xffffc00140000000-0xffffc00177000000 0x00000001c0000000 880M
> >>>>> PMD D A . . . W R V
> >>>>>
> >>>>> Alexandre Ghiti (2):
> >>>>> riscv: Get memory below load_pa while ensuring linear mapping
> >>>>> is PMD
> >>>>> aligned
> >>>>> riscv: Use PUD/PGDIR entries for linear mapping when possible
> >>>>>
> >>>>> arch/riscv/include/asm/page.h | 8 ++++
> >>>>> arch/riscv/mm/init.c | 69
> >>>>> +++++++++++++++++++++++++++++------
> >>>>> 2 files changed, 65 insertions(+), 12 deletions(-)
> >>>>>
> >>>>> --
> >>>>> 2.20.1
> >>>>>
> >>>>>
> >>
> >
--
Regards,
Atish
Hi Atish,
Le 6/12/20 à 1:43 PM, Atish Patra a écrit :
> On Fri, Jun 12, 2020 at 6:17 AM Alex Ghiti <[email protected]> wrote:
>> Le 6/12/20 à 8:59 AM, Alex Ghiti a écrit :
>>> Hi Atish,
>>>
>>> Le 6/11/20 à 1:29 PM, Atish Patra a écrit :
>>>> On Wed, Jun 10, 2020 at 11:51 PM Alex Ghiti<[email protected]> wrote:
>>>>> Hi Atish,
>>>>>
>>>>> Le 6/10/20 à 2:32 PM, Atish Patra a écrit :
>>>>>> On Wed, Jun 3, 2020 at 8:36 AM Alexandre Ghiti<[email protected]> wrote:
>>>>>>> This small patchset intends to use PUD/PGDIR entries for linear
>>>>>>> mapping
>>>>>>> in order to better utilize TLB.
>>>>>>>
>>>>>>> At the moment, only PMD entries can be used since on common platforms
>>>>>>> (qemu/unleashed), the kernel is loaded at DRAM + 2MB which
>>>>>>> dealigns virtual
>>>>>>> and physical addresses and then prevents the use of PUD/PGDIR
>>>>>>> entries.
>>>>>>> So the kernel must be able to get those 2MB for PAGE_OFFSET to map
>>>>>>> the
>>>>>>> beginning of the DRAM: this is achieved in patch 1.
>>>>>>>
>>>>>> I don't have in depth knowledge of how mm code works so this question
>>>>>> may be a completely
>>>>>> stupid one :). Just for my understanding,
>>>>>> As per my understanding, kernel will map those 2MB of memory but
>>>>>> never use it.
>>>>>> How does the kernel ensure that it doesn't allocate any memory from
>>>>>> those 2MB
>>>>>> memory if it is not marked as reserved?
>>>>> Yes, a 1GB hugepage will cover those 2MB: I rely on the previous boot
>>>>> stage to mark this region
>>>>> as reserved if there is something there (like opensbi). Otherwise, the
>>>>> kernel will indeed try to
>>>>> allocate memory from there :)
>>>>>
>>>> In that case, this patch mandates that the firmware region has to be
>>>> mark "reserved"
>>>> the device tree so that the Linux kernel doesn't try to allocate
>>>> memory from there.
>>>> OpenSBI is already doing it from v0.7. Thus, any user using latest
>>>> OpenSBI can leverage
>>>> this patch for a better TLB utilization.
>>>
>>> Note that *currently* OpenSBI v0.7 still adds the "no-map" property
>>> which prevents such optimization.
>>>
> Thanks for the clarification. When I said latest, I meant including
> your patch in the mailing list.
>
>>>> However, legacy previous boot stages(BBL) do not reserve this area via
>>>> DT which may
>>>> result in an unexpected crash. I am not sure how many developers still
>>>> use BBL though.
>>>>
>>>> Few general suggestions to tackle this problem:
>>>> 1. This mandatory requirement should be added to the booting document
>>>> so that any other
>>>> SBI implementation is also aware of it.
>>>> 2. You may have to move the patch1 to a separate config so that any
>>>> users of legacy boot stages
>>>> can disable this feature.
>>>
>>> IMHO, the region occupied by runtime services should be marked as
>>> reserved in the device-tree. So it seems redundant to add this as a
>>> requirement, I would rather consider its absence as a bug.
>>>
> I agree. I was just suggesting to document this bug :).
Oh ok then, we meant the same thing :)
>
>>> Even if I understand that this might break some system, I don't like
>>> the idea of a new config to support old "buggy" bootloaders: when will
>>> we be able to remove it ? We'll never know when people will stop using
>>> those bootloaders, so it will stay here forever...Where can I find the
> Personally, I am fine with that. However, there were few concerns in the past.
> I am leaving it to Palmer to decide.
>
> @Palmer Dabbelt : Any thoughts ?
>
>>> boot document you are talking about ? Can we simply state here that
>>> this kernel version will not be compatible with those bootloaders
>>> (we'll draw an exhaustive list here) ?
> Yes.
>
>> Ok, I have just found Documentation/riscv/boot-image-header.rst: could
>> we imagine doing something like incrementing the version and use that as
>> a hint in the kernel not to map the 2MB offset ? That's still legacy,
>> but at least it does not require to recompile a kernel as the check
>> would be done at runtime.
>>
> I was suggesting to add a risc-v specific booting document and
> document this "bug".
> Documentation/riscv/boot-image-header.rst can be linked from that document or
> the boot hader content can be included in that. No changes in code is necessary.
>
> Eventually, this booting document will also include other additional
> booting constraints for RISC-V
> such as minimum extension required to boot Linux, csr state upon
> entering S-mode, mmu state.
Ok I will prepare a boot document that links to the existing documents and
add all of that, I will need you for the last constraints that I don't
know about.
Thanks Atish,
Alex
>>> Alex
>>>
>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>>> But furthermore, at the moment, the firmware (opensbi) explicitly
>>>>>>> asks the
>>>>>>> kernel not to map the region it occupies, which is on those common
>>>>>>> platforms at the very beginning of the DRAM and then it also dealigns
>>>>>>> virtual and physical addresses. I proposed a patch here:
>>>>>>>
>>>>>>> https://github.com/riscv/opensbi/pull/167
>>>>>>>
>>>>>>> that removes this 'constraint' but *not* all the time as it offers
>>>>>>> some
>>>>>>> kind of protection in case PMP is not available. So sometimes, we may
>>>>>>> have a part of the memory below the kernel that is removed creating a
>>>>>>> misalignment between virtual and physical addresses. So for
>>>>>>> performance
>>>>>>> reasons, we must at least make sure that PMD entries can be used:
>>>>>>> that
>>>>>>> is guaranteed by patch 1 too.
>>>>>>>
>>>>>>> Finally the second patch simply improves best_map_size so that
>>>>>>> whenever
>>>>>>> possible, PUD/PGDIR entries are used.
>>>>>>>
>>>>>>> Below is the kernel page table without this patch on a 6G platform:
>>>>>>>
>>>>>>> ---[ Linear mapping ]---
>>>>>>> 0xffffc00000000000-0xffffc00176e00000 0x0000000080200000 5998M
>>>>>>> PMD D A . . . W R V
>>>>>>>
>>>>>>> And with this patchset + opensbi patch:
>>>>>>>
>>>>>>> ---[ Linear mapping ]---
>>>>>>> 0xffffc00000000000-0xffffc00140000000 0x0000000080000000
>>>>>>> 5G PUD D A . . . W R V
>>>>>>> 0xffffc00140000000-0xffffc00177000000 0x00000001c0000000 880M
>>>>>>> PMD D A . . . W R V
>>>>>>>
>>>>>>> Alexandre Ghiti (2):
>>>>>>> riscv: Get memory below load_pa while ensuring linear mapping
>>>>>>> is PMD
>>>>>>> aligned
>>>>>>> riscv: Use PUD/PGDIR entries for linear mapping when possible
>>>>>>>
>>>>>>> arch/riscv/include/asm/page.h | 8 ++++
>>>>>>> arch/riscv/mm/init.c | 69
>>>>>>> +++++++++++++++++++++++++++++------
>>>>>>> 2 files changed, 65 insertions(+), 12 deletions(-)
>>>>>>>
>>>>>>> --
>>>>>>> 2.20.1
>>>>>>>
>>>>>>>
>
>
On Sun, Jun 14, 2020 at 10:35 PM Alex Ghiti <[email protected]> wrote:
>
> Hi Atish,
>
> Le 6/12/20 à 1:43 PM, Atish Patra a écrit :
> > On Fri, Jun 12, 2020 at 6:17 AM Alex Ghiti <[email protected]> wrote:
> >> Le 6/12/20 à 8:59 AM, Alex Ghiti a écrit :
> >>> Hi Atish,
> >>>
> >>> Le 6/11/20 à 1:29 PM, Atish Patra a écrit :
> >>>> On Wed, Jun 10, 2020 at 11:51 PM Alex Ghiti<[email protected]> wrote:
> >>>>> Hi Atish,
> >>>>>
> >>>>> Le 6/10/20 à 2:32 PM, Atish Patra a écrit :
> >>>>>> On Wed, Jun 3, 2020 at 8:36 AM Alexandre Ghiti<[email protected]> wrote:
> >>>>>>> This small patchset intends to use PUD/PGDIR entries for linear
> >>>>>>> mapping
> >>>>>>> in order to better utilize TLB.
> >>>>>>>
> >>>>>>> At the moment, only PMD entries can be used since on common platforms
> >>>>>>> (qemu/unleashed), the kernel is loaded at DRAM + 2MB which
> >>>>>>> dealigns virtual
> >>>>>>> and physical addresses and then prevents the use of PUD/PGDIR
> >>>>>>> entries.
> >>>>>>> So the kernel must be able to get those 2MB for PAGE_OFFSET to map
> >>>>>>> the
> >>>>>>> beginning of the DRAM: this is achieved in patch 1.
> >>>>>>>
> >>>>>> I don't have in depth knowledge of how mm code works so this question
> >>>>>> may be a completely
> >>>>>> stupid one :). Just for my understanding,
> >>>>>> As per my understanding, kernel will map those 2MB of memory but
> >>>>>> never use it.
> >>>>>> How does the kernel ensure that it doesn't allocate any memory from
> >>>>>> those 2MB
> >>>>>> memory if it is not marked as reserved?
> >>>>> Yes, a 1GB hugepage will cover those 2MB: I rely on the previous boot
> >>>>> stage to mark this region
> >>>>> as reserved if there is something there (like opensbi). Otherwise, the
> >>>>> kernel will indeed try to
> >>>>> allocate memory from there :)
> >>>>>
> >>>> In that case, this patch mandates that the firmware region has to be
> >>>> mark "reserved"
> >>>> the device tree so that the Linux kernel doesn't try to allocate
> >>>> memory from there.
> >>>> OpenSBI is already doing it from v0.7. Thus, any user using latest
> >>>> OpenSBI can leverage
> >>>> this patch for a better TLB utilization.
> >>>
> >>> Note that *currently* OpenSBI v0.7 still adds the "no-map" property
> >>> which prevents such optimization.
> >>>
> > Thanks for the clarification. When I said latest, I meant including
> > your patch in the mailing list.
> >
> >>>> However, legacy previous boot stages(BBL) do not reserve this area via
> >>>> DT which may
> >>>> result in an unexpected crash. I am not sure how many developers still
> >>>> use BBL though.
> >>>>
> >>>> Few general suggestions to tackle this problem:
> >>>> 1. This mandatory requirement should be added to the booting document
> >>>> so that any other
> >>>> SBI implementation is also aware of it.
> >>>> 2. You may have to move the patch1 to a separate config so that any
> >>>> users of legacy boot stages
> >>>> can disable this feature.
> >>>
> >>> IMHO, the region occupied by runtime services should be marked as
> >>> reserved in the device-tree. So it seems redundant to add this as a
> >>> requirement, I would rather consider its absence as a bug.
> >>>
> > I agree. I was just suggesting to document this bug :).
>
> Oh ok then, we meant the same thing :)
> >
> >>> Even if I understand that this might break some system, I don't like
> >>> the idea of a new config to support old "buggy" bootloaders: when will
> >>> we be able to remove it ? We'll never know when people will stop using
> >>> those bootloaders, so it will stay here forever...Where can I find the
> > Personally, I am fine with that. However, there were few concerns in the past.
> > I am leaving it to Palmer to decide.
> >
> > @Palmer Dabbelt : Any thoughts ?
> >
> >>> boot document you are talking about ? Can we simply state here that
> >>> this kernel version will not be compatible with those bootloaders
> >>> (we'll draw an exhaustive list here) ?
> > Yes.
> >
> >> Ok, I have just found Documentation/riscv/boot-image-header.rst: could
> >> we imagine doing something like incrementing the version and use that as
> >> a hint in the kernel not to map the 2MB offset ? That's still legacy,
> >> but at least it does not require to recompile a kernel as the check
> >> would be done at runtime.
> >>
> > I was suggesting to add a risc-v specific booting document and
> > document this "bug".
> > Documentation/riscv/boot-image-header.rst can be linked from that document or
> > the boot hader content can be included in that. No changes in code is necessary.
> >
> > Eventually, this booting document will also include other additional
> > booting constraints for RISC-V
> > such as minimum extension required to boot Linux, csr state upon
> > entering S-mode, mmu state.
>
>
> Ok I will prepare a boot document that links to the existing documents and
> add all of that, I will need you for the last constraints that I don't
> know about.
>
Thanks. I will send patches on top of your boot document patch.
This was long due. Thanks for coming up with the initial version :).
> Thanks Atish,
>
> Alex
>
> >>> Alex
> >>>
> >>>
> >>>>> Alex
> >>>>>
> >>>>>
> >>>>>>> But furthermore, at the moment, the firmware (opensbi) explicitly
> >>>>>>> asks the
> >>>>>>> kernel not to map the region it occupies, which is on those common
> >>>>>>> platforms at the very beginning of the DRAM and then it also dealigns
> >>>>>>> virtual and physical addresses. I proposed a patch here:
> >>>>>>>
> >>>>>>> https://github.com/riscv/opensbi/pull/167
> >>>>>>>
> >>>>>>> that removes this 'constraint' but *not* all the time as it offers
> >>>>>>> some
> >>>>>>> kind of protection in case PMP is not available. So sometimes, we may
> >>>>>>> have a part of the memory below the kernel that is removed creating a
> >>>>>>> misalignment between virtual and physical addresses. So for
> >>>>>>> performance
> >>>>>>> reasons, we must at least make sure that PMD entries can be used:
> >>>>>>> that
> >>>>>>> is guaranteed by patch 1 too.
> >>>>>>>
> >>>>>>> Finally the second patch simply improves best_map_size so that
> >>>>>>> whenever
> >>>>>>> possible, PUD/PGDIR entries are used.
> >>>>>>>
> >>>>>>> Below is the kernel page table without this patch on a 6G platform:
> >>>>>>>
> >>>>>>> ---[ Linear mapping ]---
> >>>>>>> 0xffffc00000000000-0xffffc00176e00000 0x0000000080200000 5998M
> >>>>>>> PMD D A . . . W R V
> >>>>>>>
> >>>>>>> And with this patchset + opensbi patch:
> >>>>>>>
> >>>>>>> ---[ Linear mapping ]---
> >>>>>>> 0xffffc00000000000-0xffffc00140000000 0x0000000080000000
> >>>>>>> 5G PUD D A . . . W R V
> >>>>>>> 0xffffc00140000000-0xffffc00177000000 0x00000001c0000000 880M
> >>>>>>> PMD D A . . . W R V
> >>>>>>>
> >>>>>>> Alexandre Ghiti (2):
> >>>>>>> riscv: Get memory below load_pa while ensuring linear mapping
> >>>>>>> is PMD
> >>>>>>> aligned
> >>>>>>> riscv: Use PUD/PGDIR entries for linear mapping when possible
> >>>>>>>
> >>>>>>> arch/riscv/include/asm/page.h | 8 ++++
> >>>>>>> arch/riscv/mm/init.c | 69
> >>>>>>> +++++++++++++++++++++++++++++------
> >>>>>>> 2 files changed, 65 insertions(+), 12 deletions(-)
> >>>>>>>
> >>>>>>> --
> >>>>>>> 2.20.1
> >>>>>>>
> >>>>>>>
> >
> >
--
Regards,
Atish
On Wed, Jun 3, 2020 at 8:38 AM Alexandre Ghiti <[email protected]> wrote:
>
> Improve best_map_size so that PUD or PGDIR entries are used for linear
> mapping when possible as it allows better TLB utilization.
>
> Signed-off-by: Alexandre Ghiti <[email protected]>
> ---
> arch/riscv/mm/init.c | 45 +++++++++++++++++++++++++++++++++-----------
> 1 file changed, 34 insertions(+), 11 deletions(-)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 9a5c97e091c1..d275f9f834cf 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -424,13 +424,29 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
> create_pgd_next_mapping(nextp, va, pa, sz, prot);
> }
>
> -static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
> +static bool is_map_size_ok(uintptr_t map_size, phys_addr_t base,
> + uintptr_t base_virt, phys_addr_t size)
> {
> - /* Upgrade to PMD_SIZE mappings whenever possible */
> - if ((base & (PMD_SIZE - 1)) || (size & (PMD_SIZE - 1)))
> - return PAGE_SIZE;
> + return !((base & (map_size - 1)) || (base_virt & (map_size - 1)) ||
> + (size < map_size));
> +}
> +
> +static uintptr_t __init best_map_size(phys_addr_t base, uintptr_t base_virt,
> + phys_addr_t size)
> +{
> +#ifndef __PAGETABLE_PMD_FOLDED
> + if (is_map_size_ok(PGDIR_SIZE, base, base_virt, size))
> + return PGDIR_SIZE;
> +
> + if (pgtable_l4_enabled)
> + if (is_map_size_ok(PUD_SIZE, base, base_virt, size))
> + return PUD_SIZE;
> +#endif
> +
> + if (is_map_size_ok(PMD_SIZE, base, base_virt, size))
> + return PMD_SIZE;
>
> - return PMD_SIZE;
> + return PAGE_SIZE;
> }
>
> /*
> @@ -576,7 +592,7 @@ void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> {
> uintptr_t va, end_va;
> - uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
> + uintptr_t map_size;
>
> load_pa = (uintptr_t)(&_start);
> load_sz = (uintptr_t)(&_end) - load_pa;
> @@ -587,6 +603,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>
> kernel_virt_addr = KERNEL_VIRT_ADDR;
>
> + map_size = best_map_size(load_pa, PAGE_OFFSET, MAX_EARLY_MAPPING_SIZE);
> va_pa_offset = PAGE_OFFSET - load_pa;
> va_kernel_pa_offset = kernel_virt_addr - load_pa;
> pfn_base = PFN_DOWN(load_pa);
> @@ -700,6 +717,8 @@ static void __init setup_vm_final(void)
>
> /* Map all memory banks */
> for_each_memblock(memory, reg) {
> + uintptr_t remaining_size;
> +
> start = reg->base;
> end = start + reg->size;
>
> @@ -707,15 +726,19 @@ static void __init setup_vm_final(void)
> break;
> if (memblock_is_nomap(reg))
> continue;
> - if (start <= __pa(PAGE_OFFSET) &&
> - __pa(PAGE_OFFSET) < end)
> - start = __pa(PAGE_OFFSET);
>
> - map_size = best_map_size(start, end - start);
> - for (pa = start; pa < end; pa += map_size) {
> + pa = start;
> + remaining_size = reg->size;
> +
> + while (remaining_size) {
> va = (uintptr_t)__va(pa);
> + map_size = best_map_size(pa, va, remaining_size);
> +
> create_pgd_mapping(swapper_pg_dir, va, pa,
> map_size, PAGE_KERNEL);
> +
> + pa += map_size;
> + remaining_size -= map_size;
> }
> }
>
This may not work in the RV32 with 2G memory and if the map_size is
determined to be a page size
for the last memblock. Both pa & remaining_size will overflow and the
loop will try to map memory from zero again.
> --
> 2.20.1
>
>
--
Regards,
Atish
Hi Atish,
Le 6/18/20 à 8:47 PM, Atish Patra a écrit :
> On Wed, Jun 3, 2020 at 8:38 AM Alexandre Ghiti <[email protected]> wrote:
>> Improve best_map_size so that PUD or PGDIR entries are used for linear
>> mapping when possible as it allows better TLB utilization.
>>
>> Signed-off-by: Alexandre Ghiti <[email protected]>
>> ---
>> arch/riscv/mm/init.c | 45 +++++++++++++++++++++++++++++++++-----------
>> 1 file changed, 34 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 9a5c97e091c1..d275f9f834cf 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -424,13 +424,29 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
>> create_pgd_next_mapping(nextp, va, pa, sz, prot);
>> }
>>
>> -static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>> +static bool is_map_size_ok(uintptr_t map_size, phys_addr_t base,
>> + uintptr_t base_virt, phys_addr_t size)
>> {
>> - /* Upgrade to PMD_SIZE mappings whenever possible */
>> - if ((base & (PMD_SIZE - 1)) || (size & (PMD_SIZE - 1)))
>> - return PAGE_SIZE;
>> + return !((base & (map_size - 1)) || (base_virt & (map_size - 1)) ||
>> + (size < map_size));
>> +}
>> +
>> +static uintptr_t __init best_map_size(phys_addr_t base, uintptr_t base_virt,
>> + phys_addr_t size)
>> +{
>> +#ifndef __PAGETABLE_PMD_FOLDED
>> + if (is_map_size_ok(PGDIR_SIZE, base, base_virt, size))
>> + return PGDIR_SIZE;
>> +
>> + if (pgtable_l4_enabled)
>> + if (is_map_size_ok(PUD_SIZE, base, base_virt, size))
>> + return PUD_SIZE;
>> +#endif
>> +
>> + if (is_map_size_ok(PMD_SIZE, base, base_virt, size))
>> + return PMD_SIZE;
>>
>> - return PMD_SIZE;
>> + return PAGE_SIZE;
>> }
>>
>> /*
>> @@ -576,7 +592,7 @@ void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>> asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>> {
>> uintptr_t va, end_va;
>> - uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>> + uintptr_t map_size;
>>
>> load_pa = (uintptr_t)(&_start);
>> load_sz = (uintptr_t)(&_end) - load_pa;
>> @@ -587,6 +603,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>
>> kernel_virt_addr = KERNEL_VIRT_ADDR;
>>
>> + map_size = best_map_size(load_pa, PAGE_OFFSET, MAX_EARLY_MAPPING_SIZE);
>> va_pa_offset = PAGE_OFFSET - load_pa;
>> va_kernel_pa_offset = kernel_virt_addr - load_pa;
>> pfn_base = PFN_DOWN(load_pa);
>> @@ -700,6 +717,8 @@ static void __init setup_vm_final(void)
>>
>> /* Map all memory banks */
>> for_each_memblock(memory, reg) {
>> + uintptr_t remaining_size;
>> +
>> start = reg->base;
>> end = start + reg->size;
>>
>> @@ -707,15 +726,19 @@ static void __init setup_vm_final(void)
>> break;
>> if (memblock_is_nomap(reg))
>> continue;
>> - if (start <= __pa(PAGE_OFFSET) &&
>> - __pa(PAGE_OFFSET) < end)
>> - start = __pa(PAGE_OFFSET);
>>
>> - map_size = best_map_size(start, end - start);
>> - for (pa = start; pa < end; pa += map_size) {
>> + pa = start;
>> + remaining_size = reg->size;
>> +
>> + while (remaining_size) {
>> va = (uintptr_t)__va(pa);
>> + map_size = best_map_size(pa, va, remaining_size);
>> +
>> create_pgd_mapping(swapper_pg_dir, va, pa,
>> map_size, PAGE_KERNEL);
>> +
>> + pa += map_size;
>> + remaining_size -= map_size;
>> }
>> }
>>
> This may not work in the RV32 with 2G memory and if the map_size is
> determined to be a page size
> for the last memblock. Both pa & remaining_size will overflow and the
> loop will try to map memory from zero again.
I'm not sure I understand: if pa starts at 0x8000_0000 and size is 2G,
then pa will overflow in the last iteration, but remaining_size will
then be equal to 0 right ?
And by the way, I realize that this loop only handles sizes that are
aligned on map_size.
Thanks,
Alex
>
>> --
>> 2.20.1
>>
>>
>
On Thu, Jun 18, 2020 at 9:28 PM Alex Ghiti <[email protected]> wrote:
>
> Hi Atish,
>
> Le 6/18/20 à 8:47 PM, Atish Patra a écrit :
> > On Wed, Jun 3, 2020 at 8:38 AM Alexandre Ghiti <[email protected]> wrote:
> >> Improve best_map_size so that PUD or PGDIR entries are used for linear
> >> mapping when possible as it allows better TLB utilization.
> >>
> >> Signed-off-by: Alexandre Ghiti <[email protected]>
> >> ---
> >> arch/riscv/mm/init.c | 45 +++++++++++++++++++++++++++++++++-----------
> >> 1 file changed, 34 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> >> index 9a5c97e091c1..d275f9f834cf 100644
> >> --- a/arch/riscv/mm/init.c
> >> +++ b/arch/riscv/mm/init.c
> >> @@ -424,13 +424,29 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
> >> create_pgd_next_mapping(nextp, va, pa, sz, prot);
> >> }
> >>
> >> -static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
> >> +static bool is_map_size_ok(uintptr_t map_size, phys_addr_t base,
> >> + uintptr_t base_virt, phys_addr_t size)
> >> {
> >> - /* Upgrade to PMD_SIZE mappings whenever possible */
> >> - if ((base & (PMD_SIZE - 1)) || (size & (PMD_SIZE - 1)))
> >> - return PAGE_SIZE;
> >> + return !((base & (map_size - 1)) || (base_virt & (map_size - 1)) ||
> >> + (size < map_size));
> >> +}
> >> +
> >> +static uintptr_t __init best_map_size(phys_addr_t base, uintptr_t base_virt,
> >> + phys_addr_t size)
> >> +{
> >> +#ifndef __PAGETABLE_PMD_FOLDED
> >> + if (is_map_size_ok(PGDIR_SIZE, base, base_virt, size))
> >> + return PGDIR_SIZE;
> >> +
> >> + if (pgtable_l4_enabled)
> >> + if (is_map_size_ok(PUD_SIZE, base, base_virt, size))
> >> + return PUD_SIZE;
> >> +#endif
> >> +
> >> + if (is_map_size_ok(PMD_SIZE, base, base_virt, size))
> >> + return PMD_SIZE;
> >>
> >> - return PMD_SIZE;
> >> + return PAGE_SIZE;
> >> }
> >>
> >> /*
> >> @@ -576,7 +592,7 @@ void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
> >> asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >> {
> >> uintptr_t va, end_va;
> >> - uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
> >> + uintptr_t map_size;
> >>
> >> load_pa = (uintptr_t)(&_start);
> >> load_sz = (uintptr_t)(&_end) - load_pa;
> >> @@ -587,6 +603,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>
> >> kernel_virt_addr = KERNEL_VIRT_ADDR;
> >>
> >> + map_size = best_map_size(load_pa, PAGE_OFFSET, MAX_EARLY_MAPPING_SIZE);
> >> va_pa_offset = PAGE_OFFSET - load_pa;
> >> va_kernel_pa_offset = kernel_virt_addr - load_pa;
> >> pfn_base = PFN_DOWN(load_pa);
> >> @@ -700,6 +717,8 @@ static void __init setup_vm_final(void)
> >>
> >> /* Map all memory banks */
> >> for_each_memblock(memory, reg) {
> >> + uintptr_t remaining_size;
> >> +
> >> start = reg->base;
> >> end = start + reg->size;
> >>
> >> @@ -707,15 +726,19 @@ static void __init setup_vm_final(void)
> >> break;
> >> if (memblock_is_nomap(reg))
> >> continue;
> >> - if (start <= __pa(PAGE_OFFSET) &&
> >> - __pa(PAGE_OFFSET) < end)
> >> - start = __pa(PAGE_OFFSET);
> >>
> >> - map_size = best_map_size(start, end - start);
> >> - for (pa = start; pa < end; pa += map_size) {
> >> + pa = start;
> >> + remaining_size = reg->size;
> >> +
> >> + while (remaining_size) {
> >> va = (uintptr_t)__va(pa);
> >> + map_size = best_map_size(pa, va, remaining_size);
> >> +
> >> create_pgd_mapping(swapper_pg_dir, va, pa,
> >> map_size, PAGE_KERNEL);
> >> +
> >> + pa += map_size;
> >> + remaining_size -= map_size;
> >> }
> >> }
> >>
> > This may not work in the RV32 with 2G memory and if the map_size is
> > determined to be a page size
> > for the last memblock. Both pa & remaining_size will overflow and the
> > loop will try to map memory from zero again.
>
> I'm not sure I understand: if pa starts at 0x8000_0000 and size is 2G,
> then pa will overflow in the last iteration, but remaining_size will
> then be equal to 0 right ?
>
Not unless the remaining_size is at least page size aligned. The last
remaining size would "fff".
It will overflow as well after subtracting the map_size.
> And by the way, I realize that this loop only handles sizes that are
> aligned on map_size.
>
Yeah.
> Thanks,
>
> Alex
>
>
> >
> >> --
> >> 2.20.1
> >>
> >>
> >
--
Regards,
Atish
Hi Atish,
Le 6/19/20 à 2:16 PM, Atish Patra a écrit :
> On Thu, Jun 18, 2020 at 9:28 PM Alex Ghiti <[email protected]> wrote:
>> Hi Atish,
>>
>> Le 6/18/20 à 8:47 PM, Atish Patra a écrit :
>>> On Wed, Jun 3, 2020 at 8:38 AM Alexandre Ghiti <[email protected]> wrote:
>>>> Improve best_map_size so that PUD or PGDIR entries are used for linear
>>>> mapping when possible as it allows better TLB utilization.
>>>>
>>>> Signed-off-by: Alexandre Ghiti <[email protected]>
>>>> ---
>>>> arch/riscv/mm/init.c | 45 +++++++++++++++++++++++++++++++++-----------
>>>> 1 file changed, 34 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>> index 9a5c97e091c1..d275f9f834cf 100644
>>>> --- a/arch/riscv/mm/init.c
>>>> +++ b/arch/riscv/mm/init.c
>>>> @@ -424,13 +424,29 @@ static void __init create_pgd_mapping(pgd_t *pgdp,
>>>> create_pgd_next_mapping(nextp, va, pa, sz, prot);
>>>> }
>>>>
>>>> -static uintptr_t __init best_map_size(phys_addr_t base, phys_addr_t size)
>>>> +static bool is_map_size_ok(uintptr_t map_size, phys_addr_t base,
>>>> + uintptr_t base_virt, phys_addr_t size)
>>>> {
>>>> - /* Upgrade to PMD_SIZE mappings whenever possible */
>>>> - if ((base & (PMD_SIZE - 1)) || (size & (PMD_SIZE - 1)))
>>>> - return PAGE_SIZE;
>>>> + return !((base & (map_size - 1)) || (base_virt & (map_size - 1)) ||
>>>> + (size < map_size));
>>>> +}
>>>> +
>>>> +static uintptr_t __init best_map_size(phys_addr_t base, uintptr_t base_virt,
>>>> + phys_addr_t size)
>>>> +{
>>>> +#ifndef __PAGETABLE_PMD_FOLDED
>>>> + if (is_map_size_ok(PGDIR_SIZE, base, base_virt, size))
>>>> + return PGDIR_SIZE;
>>>> +
>>>> + if (pgtable_l4_enabled)
>>>> + if (is_map_size_ok(PUD_SIZE, base, base_virt, size))
>>>> + return PUD_SIZE;
>>>> +#endif
>>>> +
>>>> + if (is_map_size_ok(PMD_SIZE, base, base_virt, size))
>>>> + return PMD_SIZE;
>>>>
>>>> - return PMD_SIZE;
>>>> + return PAGE_SIZE;
>>>> }
>>>>
>>>> /*
>>>> @@ -576,7 +592,7 @@ void create_kernel_page_table(pgd_t *pgdir, uintptr_t map_size)
>>>> asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>> {
>>>> uintptr_t va, end_va;
>>>> - uintptr_t map_size = best_map_size(load_pa, MAX_EARLY_MAPPING_SIZE);
>>>> + uintptr_t map_size;
>>>>
>>>> load_pa = (uintptr_t)(&_start);
>>>> load_sz = (uintptr_t)(&_end) - load_pa;
>>>> @@ -587,6 +603,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>
>>>> kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>>
>>>> + map_size = best_map_size(load_pa, PAGE_OFFSET, MAX_EARLY_MAPPING_SIZE);
>>>> va_pa_offset = PAGE_OFFSET - load_pa;
>>>> va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>> pfn_base = PFN_DOWN(load_pa);
>>>> @@ -700,6 +717,8 @@ static void __init setup_vm_final(void)
>>>>
>>>> /* Map all memory banks */
>>>> for_each_memblock(memory, reg) {
>>>> + uintptr_t remaining_size;
>>>> +
>>>> start = reg->base;
>>>> end = start + reg->size;
>>>>
>>>> @@ -707,15 +726,19 @@ static void __init setup_vm_final(void)
>>>> break;
>>>> if (memblock_is_nomap(reg))
>>>> continue;
>>>> - if (start <= __pa(PAGE_OFFSET) &&
>>>> - __pa(PAGE_OFFSET) < end)
>>>> - start = __pa(PAGE_OFFSET);
>>>>
>>>> - map_size = best_map_size(start, end - start);
>>>> - for (pa = start; pa < end; pa += map_size) {
>>>> + pa = start;
>>>> + remaining_size = reg->size;
>>>> +
>>>> + while (remaining_size) {
>>>> va = (uintptr_t)__va(pa);
>>>> + map_size = best_map_size(pa, va, remaining_size);
>>>> +
>>>> create_pgd_mapping(swapper_pg_dir, va, pa,
>>>> map_size, PAGE_KERNEL);
>>>> +
>>>> + pa += map_size;
>>>> + remaining_size -= map_size;
>>>> }
>>>> }
>>>>
>>> This may not work in the RV32 with 2G memory and if the map_size is
>>> determined to be a page size
>>> for the last memblock. Both pa & remaining_size will overflow and the
>>> loop will try to map memory from zero again.
>> I'm not sure I understand: if pa starts at 0x8000_0000 and size is 2G,
>> then pa will overflow in the last iteration, but remaining_size will
>> then be equal to 0 right ?
>>
> Not unless the remaining_size is at least page size aligned. The last
> remaining size would "fff".
> It will overflow as well after subtracting the map_size.
>
>> And by the way, I realize that this loop only handles sizes that are
>> aligned on map_size.
>>
> Yeah.
Thanks for noticing, I send a v2.
Alex
>
>> Thanks,
>>
>> Alex
>>
>>
>>>> --
>>>> 2.20.1
>>>>
>>>>
>
>
Hi Atish,
Le 6/20/20 à 5:04 AM, Alex Ghiti a écrit :
> Hi Atish,
>
> Le 6/19/20 à 2:16 PM, Atish Patra a écrit :
>> On Thu, Jun 18, 2020 at 9:28 PM Alex Ghiti <[email protected]> wrote:
>>> Hi Atish,
>>>
>>> Le 6/18/20 à 8:47 PM, Atish Patra a écrit :
>>>> On Wed, Jun 3, 2020 at 8:38 AM Alexandre Ghiti <[email protected]> wrote:
>>>>> Improve best_map_size so that PUD or PGDIR entries are used for
>>>>> linear
>>>>> mapping when possible as it allows better TLB utilization.
>>>>>
>>>>> Signed-off-by: Alexandre Ghiti <[email protected]>
>>>>> ---
>>>>> arch/riscv/mm/init.c | 45
>>>>> +++++++++++++++++++++++++++++++++-----------
>>>>> 1 file changed, 34 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>>> index 9a5c97e091c1..d275f9f834cf 100644
>>>>> --- a/arch/riscv/mm/init.c
>>>>> +++ b/arch/riscv/mm/init.c
>>>>> @@ -424,13 +424,29 @@ static void __init create_pgd_mapping(pgd_t
>>>>> *pgdp,
>>>>> create_pgd_next_mapping(nextp, va, pa, sz, prot);
>>>>> }
>>>>>
>>>>> -static uintptr_t __init best_map_size(phys_addr_t base,
>>>>> phys_addr_t size)
>>>>> +static bool is_map_size_ok(uintptr_t map_size, phys_addr_t base,
>>>>> + uintptr_t base_virt, phys_addr_t size)
>>>>> {
>>>>> - /* Upgrade to PMD_SIZE mappings whenever possible */
>>>>> - if ((base & (PMD_SIZE - 1)) || (size & (PMD_SIZE - 1)))
>>>>> - return PAGE_SIZE;
>>>>> + return !((base & (map_size - 1)) || (base_virt & (map_size
>>>>> - 1)) ||
>>>>> + (size < map_size));
>>>>> +}
>>>>> +
>>>>> +static uintptr_t __init best_map_size(phys_addr_t base, uintptr_t
>>>>> base_virt,
>>>>> + phys_addr_t size)
>>>>> +{
>>>>> +#ifndef __PAGETABLE_PMD_FOLDED
>>>>> + if (is_map_size_ok(PGDIR_SIZE, base, base_virt, size))
>>>>> + return PGDIR_SIZE;
>>>>> +
>>>>> + if (pgtable_l4_enabled)
>>>>> + if (is_map_size_ok(PUD_SIZE, base, base_virt, size))
>>>>> + return PUD_SIZE;
>>>>> +#endif
>>>>> +
>>>>> + if (is_map_size_ok(PMD_SIZE, base, base_virt, size))
>>>>> + return PMD_SIZE;
>>>>>
>>>>> - return PMD_SIZE;
>>>>> + return PAGE_SIZE;
>>>>> }
>>>>>
>>>>> /*
>>>>> @@ -576,7 +592,7 @@ void create_kernel_page_table(pgd_t *pgdir,
>>>>> uintptr_t map_size)
>>>>> asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>> {
>>>>> uintptr_t va, end_va;
>>>>> - uintptr_t map_size = best_map_size(load_pa,
>>>>> MAX_EARLY_MAPPING_SIZE);
>>>>> + uintptr_t map_size;
>>>>>
>>>>> load_pa = (uintptr_t)(&_start);
>>>>> load_sz = (uintptr_t)(&_end) - load_pa;
>>>>> @@ -587,6 +603,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>>
>>>>> kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>>>
>>>>> + map_size = best_map_size(load_pa, PAGE_OFFSET,
>>>>> MAX_EARLY_MAPPING_SIZE);
>>>>> va_pa_offset = PAGE_OFFSET - load_pa;
>>>>> va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>>> pfn_base = PFN_DOWN(load_pa);
>>>>> @@ -700,6 +717,8 @@ static void __init setup_vm_final(void)
>>>>>
>>>>> /* Map all memory banks */
>>>>> for_each_memblock(memory, reg) {
>>>>> + uintptr_t remaining_size;
>>>>> +
>>>>> start = reg->base;
>>>>> end = start + reg->size;
>>>>>
>>>>> @@ -707,15 +726,19 @@ static void __init setup_vm_final(void)
>>>>> break;
>>>>> if (memblock_is_nomap(reg))
>>>>> continue;
>>>>> - if (start <= __pa(PAGE_OFFSET) &&
>>>>> - __pa(PAGE_OFFSET) < end)
>>>>> - start = __pa(PAGE_OFFSET);
>>>>>
>>>>> - map_size = best_map_size(start, end - start);
>>>>> - for (pa = start; pa < end; pa += map_size) {
>>>>> + pa = start;
>>>>> + remaining_size = reg->size;
>>>>> +
>>>>> + while (remaining_size) {
>>>>> va = (uintptr_t)__va(pa);
>>>>> + map_size = best_map_size(pa, va,
>>>>> remaining_size);
>>>>> +
>>>>> create_pgd_mapping(swapper_pg_dir, va, pa,
>>>>> map_size, PAGE_KERNEL);
>>>>> +
>>>>> + pa += map_size;
>>>>> + remaining_size -= map_size;
>>>>> }
>>>>> }
>>>>>
>>>> This may not work in the RV32 with 2G memory and if the map_size is
>>>> determined to be a page size
>>>> for the last memblock. Both pa & remaining_size will overflow and the
>>>> loop will try to map memory from zero again.
>>> I'm not sure I understand: if pa starts at 0x8000_0000 and size is 2G,
>>> then pa will overflow in the last iteration, but remaining_size will
>>> then be equal to 0 right ?
>>>
>> Not unless the remaining_size is at least page size aligned. The last
>> remaining size would "fff".
>> It will overflow as well after subtracting the map_size.
While fixing this issue, I noticed that if the size in the device tree
is not aligned on PAGE_SIZE, the size is then automatically realigned on
PAGE_SIZE: see early_init_dt_add_memory_arch where size is and-ed with
PAGE_MASK to remove the unaligned part.
So the issue does not need to be fixed :)
Thanks anyway,
Alex
>>
>>> And by the way, I realize that this loop only handles sizes that are
>>> aligned on map_size.
>>>
>> Yeah.
>
>
> Thanks for noticing, I send a v2.
>
> Alex
>
>
>>
>>> Thanks,
>>>
>>> Alex
>>>
>>>
>>>>> --
>>>>> 2.20.1
>>>>>
>>>>>
>>
>>
>
On Sun, Jun 21, 2020 at 2:39 AM Alex Ghiti <[email protected]> wrote:
>
> Hi Atish,
>
> Le 6/20/20 à 5:04 AM, Alex Ghiti a écrit :
> > Hi Atish,
> >
> > Le 6/19/20 à 2:16 PM, Atish Patra a écrit :
> >> On Thu, Jun 18, 2020 at 9:28 PM Alex Ghiti <[email protected]> wrote:
> >>> Hi Atish,
> >>>
> >>> Le 6/18/20 à 8:47 PM, Atish Patra a écrit :
> >>>> On Wed, Jun 3, 2020 at 8:38 AM Alexandre Ghiti <[email protected]> wrote:
> >>>>> Improve best_map_size so that PUD or PGDIR entries are used for
> >>>>> linear
> >>>>> mapping when possible as it allows better TLB utilization.
> >>>>>
> >>>>> Signed-off-by: Alexandre Ghiti <[email protected]>
> >>>>> ---
> >>>>> arch/riscv/mm/init.c | 45
> >>>>> +++++++++++++++++++++++++++++++++-----------
> >>>>> 1 file changed, 34 insertions(+), 11 deletions(-)
> >>>>>
> >>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> >>>>> index 9a5c97e091c1..d275f9f834cf 100644
> >>>>> --- a/arch/riscv/mm/init.c
> >>>>> +++ b/arch/riscv/mm/init.c
> >>>>> @@ -424,13 +424,29 @@ static void __init create_pgd_mapping(pgd_t
> >>>>> *pgdp,
> >>>>> create_pgd_next_mapping(nextp, va, pa, sz, prot);
> >>>>> }
> >>>>>
> >>>>> -static uintptr_t __init best_map_size(phys_addr_t base,
> >>>>> phys_addr_t size)
> >>>>> +static bool is_map_size_ok(uintptr_t map_size, phys_addr_t base,
> >>>>> + uintptr_t base_virt, phys_addr_t size)
> >>>>> {
> >>>>> - /* Upgrade to PMD_SIZE mappings whenever possible */
> >>>>> - if ((base & (PMD_SIZE - 1)) || (size & (PMD_SIZE - 1)))
> >>>>> - return PAGE_SIZE;
> >>>>> + return !((base & (map_size - 1)) || (base_virt & (map_size
> >>>>> - 1)) ||
> >>>>> + (size < map_size));
> >>>>> +}
> >>>>> +
> >>>>> +static uintptr_t __init best_map_size(phys_addr_t base, uintptr_t
> >>>>> base_virt,
> >>>>> + phys_addr_t size)
> >>>>> +{
> >>>>> +#ifndef __PAGETABLE_PMD_FOLDED
> >>>>> + if (is_map_size_ok(PGDIR_SIZE, base, base_virt, size))
> >>>>> + return PGDIR_SIZE;
> >>>>> +
> >>>>> + if (pgtable_l4_enabled)
> >>>>> + if (is_map_size_ok(PUD_SIZE, base, base_virt, size))
> >>>>> + return PUD_SIZE;
> >>>>> +#endif
> >>>>> +
> >>>>> + if (is_map_size_ok(PMD_SIZE, base, base_virt, size))
> >>>>> + return PMD_SIZE;
> >>>>>
> >>>>> - return PMD_SIZE;
> >>>>> + return PAGE_SIZE;
> >>>>> }
> >>>>>
> >>>>> /*
> >>>>> @@ -576,7 +592,7 @@ void create_kernel_page_table(pgd_t *pgdir,
> >>>>> uintptr_t map_size)
> >>>>> asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>>>> {
> >>>>> uintptr_t va, end_va;
> >>>>> - uintptr_t map_size = best_map_size(load_pa,
> >>>>> MAX_EARLY_MAPPING_SIZE);
> >>>>> + uintptr_t map_size;
> >>>>>
> >>>>> load_pa = (uintptr_t)(&_start);
> >>>>> load_sz = (uintptr_t)(&_end) - load_pa;
> >>>>> @@ -587,6 +603,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
> >>>>>
> >>>>> kernel_virt_addr = KERNEL_VIRT_ADDR;
> >>>>>
> >>>>> + map_size = best_map_size(load_pa, PAGE_OFFSET,
> >>>>> MAX_EARLY_MAPPING_SIZE);
> >>>>> va_pa_offset = PAGE_OFFSET - load_pa;
> >>>>> va_kernel_pa_offset = kernel_virt_addr - load_pa;
> >>>>> pfn_base = PFN_DOWN(load_pa);
> >>>>> @@ -700,6 +717,8 @@ static void __init setup_vm_final(void)
> >>>>>
> >>>>> /* Map all memory banks */
> >>>>> for_each_memblock(memory, reg) {
> >>>>> + uintptr_t remaining_size;
> >>>>> +
> >>>>> start = reg->base;
> >>>>> end = start + reg->size;
> >>>>>
> >>>>> @@ -707,15 +726,19 @@ static void __init setup_vm_final(void)
> >>>>> break;
> >>>>> if (memblock_is_nomap(reg))
> >>>>> continue;
> >>>>> - if (start <= __pa(PAGE_OFFSET) &&
> >>>>> - __pa(PAGE_OFFSET) < end)
> >>>>> - start = __pa(PAGE_OFFSET);
> >>>>>
> >>>>> - map_size = best_map_size(start, end - start);
> >>>>> - for (pa = start; pa < end; pa += map_size) {
> >>>>> + pa = start;
> >>>>> + remaining_size = reg->size;
> >>>>> +
> >>>>> + while (remaining_size) {
> >>>>> va = (uintptr_t)__va(pa);
> >>>>> + map_size = best_map_size(pa, va,
> >>>>> remaining_size);
> >>>>> +
> >>>>> create_pgd_mapping(swapper_pg_dir, va, pa,
> >>>>> map_size, PAGE_KERNEL);
> >>>>> +
> >>>>> + pa += map_size;
> >>>>> + remaining_size -= map_size;
> >>>>> }
> >>>>> }
> >>>>>
> >>>> This may not work in the RV32 with 2G memory and if the map_size is
> >>>> determined to be a page size
> >>>> for the last memblock. Both pa & remaining_size will overflow and the
> >>>> loop will try to map memory from zero again.
> >>> I'm not sure I understand: if pa starts at 0x8000_0000 and size is 2G,
> >>> then pa will overflow in the last iteration, but remaining_size will
> >>> then be equal to 0 right ?
> >>>
> >> Not unless the remaining_size is at least page size aligned. The last
> >> remaining size would "fff".
> >> It will overflow as well after subtracting the map_size.
>
>
> While fixing this issue, I noticed that if the size in the device tree
> is not aligned on PAGE_SIZE, the size is then automatically realigned on
> PAGE_SIZE: see early_init_dt_add_memory_arch where size is and-ed with
> PAGE_MASK to remove the unaligned part.
>
Yes. But the memblock size is not guaranteed to be PAGE_SIZE aligned.
The memblock size is updated in memblock_cap_size
/* adjust *@size so that (@base + *@size) doesn't overflow, return
new size */
static inline phys_addr_t memblock_cap_size(phys_addr_t base,
phys_addr_t *size)
{
return *size = min(*size, PHYS_ADDR_MAX - base);
}
You will not see this issue right away even if you allocate 2GB of
memory while running 32 bit linux in qemu
because the kernel removes anything beyond 0xc0400000 for 32 bit in
bootmem setup.
│[ 0.000000][ T0] memblock_remove: [0xc0400000-0xfffffffe]
setup_bootmem+0x90/0x216
This also restricts the kernel to use only 1GB of memory even if
maximum physical memory supported is 2GB.
> So the issue does not need to be fixed :)
>
> Thanks anyway,
>
> Alex
>
>
> >>
> >>> And by the way, I realize that this loop only handles sizes that are
> >>> aligned on map_size.
> >>>
> >> Yeah.
> >
> >
> > Thanks for noticing, I send a v2.
> >
> > Alex
> >
> >
> >>
> >>> Thanks,
> >>>
> >>> Alex
> >>>
> >>>
> >>>>> --
> >>>>> 2.20.1
> >>>>>
> >>>>>
> >>
> >>
> >
--
Regards,
Atish
Le 6/3/20 ? 11:36 AM, Alexandre Ghiti a ?crit?:
> This small patchset intends to use PUD/PGDIR entries for linear mapping
> in order to better utilize TLB.
>
> At the moment, only PMD entries can be used since on common platforms
> (qemu/unleashed), the kernel is loaded at DRAM + 2MB which dealigns virtual
> and physical addresses and then prevents the use of PUD/PGDIR entries.
> So the kernel must be able to get those 2MB for PAGE_OFFSET to map the
> beginning of the DRAM: this is achieved in patch 1.
>
> But furthermore, at the moment, the firmware (opensbi) explicitly asks the
> kernel not to map the region it occupies, which is on those common
> platforms at the very beginning of the DRAM and then it also dealigns
> virtual and physical addresses. I proposed a patch here:
>
> https://github.com/riscv/opensbi/pull/167
>
> that removes this 'constraint' but *not* all the time as it offers some
> kind of protection in case PMP is not available. So sometimes, we may
> have a part of the memory below the kernel that is removed creating a
> misalignment between virtual and physical addresses. So for performance
> reasons, we must at least make sure that PMD entries can be used: that
> is guaranteed by patch 1 too.
>
> Finally the second patch simply improves best_map_size so that whenever
> possible, PUD/PGDIR entries are used.
>
> Below is the kernel page table without this patch on a 6G platform:
>
> ---[ Linear mapping ]---
> 0xffffc00000000000-0xffffc00176e00000 0x0000000080200000 5998M PMD D A . . . W R V
>
> And with this patchset + opensbi patch:
>
> ---[ Linear mapping ]---
> 0xffffc00000000000-0xffffc00140000000 0x0000000080000000 5G PUD D A . . . W R V
> 0xffffc00140000000-0xffffc00177000000 0x00000001c0000000 880M PMD D A . . . W R V
>
> Alexandre Ghiti (2):
> riscv: Get memory below load_pa while ensuring linear mapping is PMD
> aligned
> riscv: Use PUD/PGDIR entries for linear mapping when possible
>
> arch/riscv/include/asm/page.h | 8 ++++
> arch/riscv/mm/init.c | 69 +++++++++++++++++++++++++++++------
> 2 files changed, 65 insertions(+), 12 deletions(-)
>
The way to handle the remapping of the first 2MB is incorrect: Atish has
issues while using an initrd because the initrd_start variable is
defined using __va between setup_vm and setup_vm_final and then its
value is inconsistent after setup_vm_final since virtual addressing was
modified with the remapping of the first 2MB.
I will come with another solution to this problem since the way I handle
it for now is not correct.
Thanks,
Alex
Hi Atish,
Le 6/22/20 à 3:11 PM, Atish Patra a écrit :
> On Sun, Jun 21, 2020 at 2:39 AM Alex Ghiti <[email protected]> wrote:
>>
>> Hi Atish,
>>
>> Le 6/20/20 à 5:04 AM, Alex Ghiti a écrit :
>>> Hi Atish,
>>>
>>> Le 6/19/20 à 2:16 PM, Atish Patra a écrit :
>>>> On Thu, Jun 18, 2020 at 9:28 PM Alex Ghiti <[email protected]> wrote:
>>>>> Hi Atish,
>>>>>
>>>>> Le 6/18/20 à 8:47 PM, Atish Patra a écrit :
>>>>>> On Wed, Jun 3, 2020 at 8:38 AM Alexandre Ghiti <[email protected]> wrote:
>>>>>>> Improve best_map_size so that PUD or PGDIR entries are used for
>>>>>>> linear
>>>>>>> mapping when possible as it allows better TLB utilization.
>>>>>>>
>>>>>>> Signed-off-by: Alexandre Ghiti <[email protected]>
>>>>>>> ---
>>>>>>> arch/riscv/mm/init.c | 45
>>>>>>> +++++++++++++++++++++++++++++++++-----------
>>>>>>> 1 file changed, 34 insertions(+), 11 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>>>>> index 9a5c97e091c1..d275f9f834cf 100644
>>>>>>> --- a/arch/riscv/mm/init.c
>>>>>>> +++ b/arch/riscv/mm/init.c
>>>>>>> @@ -424,13 +424,29 @@ static void __init create_pgd_mapping(pgd_t
>>>>>>> *pgdp,
>>>>>>> create_pgd_next_mapping(nextp, va, pa, sz, prot);
>>>>>>> }
>>>>>>>
>>>>>>> -static uintptr_t __init best_map_size(phys_addr_t base,
>>>>>>> phys_addr_t size)
>>>>>>> +static bool is_map_size_ok(uintptr_t map_size, phys_addr_t base,
>>>>>>> + uintptr_t base_virt, phys_addr_t size)
>>>>>>> {
>>>>>>> - /* Upgrade to PMD_SIZE mappings whenever possible */
>>>>>>> - if ((base & (PMD_SIZE - 1)) || (size & (PMD_SIZE - 1)))
>>>>>>> - return PAGE_SIZE;
>>>>>>> + return !((base & (map_size - 1)) || (base_virt & (map_size
>>>>>>> - 1)) ||
>>>>>>> + (size < map_size));
>>>>>>> +}
>>>>>>> +
>>>>>>> +static uintptr_t __init best_map_size(phys_addr_t base, uintptr_t
>>>>>>> base_virt,
>>>>>>> + phys_addr_t size)
>>>>>>> +{
>>>>>>> +#ifndef __PAGETABLE_PMD_FOLDED
>>>>>>> + if (is_map_size_ok(PGDIR_SIZE, base, base_virt, size))
>>>>>>> + return PGDIR_SIZE;
>>>>>>> +
>>>>>>> + if (pgtable_l4_enabled)
>>>>>>> + if (is_map_size_ok(PUD_SIZE, base, base_virt, size))
>>>>>>> + return PUD_SIZE;
>>>>>>> +#endif
>>>>>>> +
>>>>>>> + if (is_map_size_ok(PMD_SIZE, base, base_virt, size))
>>>>>>> + return PMD_SIZE;
>>>>>>>
>>>>>>> - return PMD_SIZE;
>>>>>>> + return PAGE_SIZE;
>>>>>>> }
>>>>>>>
>>>>>>> /*
>>>>>>> @@ -576,7 +592,7 @@ void create_kernel_page_table(pgd_t *pgdir,
>>>>>>> uintptr_t map_size)
>>>>>>> asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>>>> {
>>>>>>> uintptr_t va, end_va;
>>>>>>> - uintptr_t map_size = best_map_size(load_pa,
>>>>>>> MAX_EARLY_MAPPING_SIZE);
>>>>>>> + uintptr_t map_size;
>>>>>>>
>>>>>>> load_pa = (uintptr_t)(&_start);
>>>>>>> load_sz = (uintptr_t)(&_end) - load_pa;
>>>>>>> @@ -587,6 +603,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>>>>>>>
>>>>>>> kernel_virt_addr = KERNEL_VIRT_ADDR;
>>>>>>>
>>>>>>> + map_size = best_map_size(load_pa, PAGE_OFFSET,
>>>>>>> MAX_EARLY_MAPPING_SIZE);
>>>>>>> va_pa_offset = PAGE_OFFSET - load_pa;
>>>>>>> va_kernel_pa_offset = kernel_virt_addr - load_pa;
>>>>>>> pfn_base = PFN_DOWN(load_pa);
>>>>>>> @@ -700,6 +717,8 @@ static void __init setup_vm_final(void)
>>>>>>>
>>>>>>> /* Map all memory banks */
>>>>>>> for_each_memblock(memory, reg) {
>>>>>>> + uintptr_t remaining_size;
>>>>>>> +
>>>>>>> start = reg->base;
>>>>>>> end = start + reg->size;
>>>>>>>
>>>>>>> @@ -707,15 +726,19 @@ static void __init setup_vm_final(void)
>>>>>>> break;
>>>>>>> if (memblock_is_nomap(reg))
>>>>>>> continue;
>>>>>>> - if (start <= __pa(PAGE_OFFSET) &&
>>>>>>> - __pa(PAGE_OFFSET) < end)
>>>>>>> - start = __pa(PAGE_OFFSET);
>>>>>>>
>>>>>>> - map_size = best_map_size(start, end - start);
>>>>>>> - for (pa = start; pa < end; pa += map_size) {
>>>>>>> + pa = start;
>>>>>>> + remaining_size = reg->size;
>>>>>>> +
>>>>>>> + while (remaining_size) {
>>>>>>> va = (uintptr_t)__va(pa);
>>>>>>> + map_size = best_map_size(pa, va,
>>>>>>> remaining_size);
>>>>>>> +
>>>>>>> create_pgd_mapping(swapper_pg_dir, va, pa,
>>>>>>> map_size, PAGE_KERNEL);
>>>>>>> +
>>>>>>> + pa += map_size;
>>>>>>> + remaining_size -= map_size;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>> This may not work in the RV32 with 2G memory and if the map_size is
>>>>>> determined to be a page size
>>>>>> for the last memblock. Both pa & remaining_size will overflow and the
>>>>>> loop will try to map memory from zero again.
>>>>> I'm not sure I understand: if pa starts at 0x8000_0000 and size is 2G,
>>>>> then pa will overflow in the last iteration, but remaining_size will
>>>>> then be equal to 0 right ?
>>>>>
>>>> Not unless the remaining_size is at least page size aligned. The last
>>>> remaining size would "fff".
>>>> It will overflow as well after subtracting the map_size.
>>
>>
>> While fixing this issue, I noticed that if the size in the device tree
>> is not aligned on PAGE_SIZE, the size is then automatically realigned on
>> PAGE_SIZE: see early_init_dt_add_memory_arch where size is and-ed with
>> PAGE_MASK to remove the unaligned part.
>>
> Yes. But the memblock size is not guaranteed to be PAGE_SIZE aligned.
> The memblock size is updated in memblock_cap_size
>
> /* adjust *@size so that (@base + *@size) doesn't overflow, return
> new size */
> static inline phys_addr_t memblock_cap_size(phys_addr_t base,
> phys_addr_t *size)
> {
> return *size = min(*size, PHYS_ADDR_MAX - base);
> }
>
Yes you're right, I will fix that in a v2.
Thanks,
Alex
> You will not see this issue right away even if you allocate 2GB of
> memory while running 32 bit linux in qemu
> because the kernel removes anything beyond 0xc0400000 for 32 bit in
> bootmem setup.
>
> │[ 0.000000][ T0] memblock_remove: [0xc0400000-0xfffffffe]
> setup_bootmem+0x90/0x216
>
> This also restricts the kernel to use only 1GB of memory even if
> maximum physical memory supported is 2GB.
>
>> So the issue does not need to be fixed :)
>>
>> Thanks anyway,
>>
>> Alex
>>
>>
>>>>
>>>>> And by the way, I realize that this loop only handles sizes that are
>>>>> aligned on map_size.
>>>>>
>>>> Yeah.
>>>
>>>
>>> Thanks for noticing, I send a v2.
>>>
>>> Alex
>>>
>>>
>>>>
>>>>> Thanks,
>>>>>
>>>>> Alex
>>>>>
>>>>>
>>>>>>> --
>>>>>>> 2.20.1
>>>>>>>
>>>>>>>
>>>>
>>>>
>>>
>
>
>
> --
> Regards,
> Atish
>