Hi,
This is v3 of the patchset which cleans up a number of minor issues
from the feedback of v2 and rebases onto v5.6-rc2. Additional feedback
is welcome.
Thanks,
Logan
--
Changes in v3:
* Rebased onto v5.6-rc2
* Rename mhp_modifiers to mhp_params per David with an updated kernel
doc per Dan
* Drop support for s390 per David seeing it does not support
ZONE_DEVICE yet and there was a potential problem with huge pages.
* Added WARN_ON_ONCE in cases where arches recieve non PAGE_KERNEL
parameters
* Collected David and Micheal's Reviewed-By and Acked-by Tags
Changes in v2:
* Rebased onto v5.5-rc5
* Renamed mhp_restrictions to mhp_modifiers and added the pgprot field
to that structure instead of using an argument for
arch_add_memory().
* Add patch to drop the unused flags field in mhp_restrictions
A git branch is available here:
https://github.com/sbates130272/linux-p2pmem remap_pages_cache_v3
--
Currently, the page tables created using memremap_pages() are always
created with the PAGE_KERNEL cacheing mode. However, the P2PDMA code
is creating pages for PCI BAR memory which should never be accessed
through the cache and instead use either WC or UC. This still works in
most cases, on x86, because the MTRR registers typically override the
caching settings in the page tables for all of the IO memory to be
UC-. However, this tends not to work so well on other arches or
some rare x86 machines that have firmware which does not setup the
MTRR registers in this way.
Instead of this, this series proposes a change to arch_add_memory()
to take the pgprot required by the mapping which allows us to
explicitly set pagetable entries for P2PDMA memory to WC.
This changes is pretty routine for most of the arches: x86_64, s390, arm64
and powerpc simply need to thread the pgprot through to where the page
tables are setup. x86_32 unfortunately sets up the page tables at boot so
must use _set_memory_prot() to change their caching mode. ia64 and sh
don't appear to have an easy way to change the page tables so, for now
at least, we just return -EINVAL on such mappings and thus they will
not support P2PDMA memory until the work for this is done.
--
Logan Gunthorpe (7):
mm/memory_hotplug: Drop the flags field from struct mhp_restrictions
mm/memory_hotplug: Rename mhp_restrictions to mhp_params
x86/mm: Thread pgprot_t through init_memory_mapping()
x86/mm: Introduce _set_memory_prot()
powerpc/mm: Thread pgprot_t through create_section_mapping()
mm/memory_hotplug: Add pgprot_t to mhp_params
mm/memremap: Set caching mode for PCI P2PDMA memory to WC
arch/arm64/mm/mmu.c | 7 ++--
arch/ia64/mm/init.c | 7 ++--
arch/powerpc/include/asm/book3s/64/hash.h | 3 +-
arch/powerpc/include/asm/book3s/64/radix.h | 3 +-
arch/powerpc/include/asm/sparsemem.h | 3 +-
arch/powerpc/mm/book3s64/hash_utils.c | 5 +--
arch/powerpc/mm/book3s64/pgtable.c | 7 ++--
arch/powerpc/mm/book3s64/radix_pgtable.c | 18 ++++++----
arch/powerpc/mm/mem.c | 10 +++---
arch/s390/mm/init.c | 9 +++--
arch/sh/mm/init.c | 7 ++--
arch/x86/include/asm/page_types.h | 3 --
arch/x86/include/asm/pgtable.h | 3 ++
arch/x86/include/asm/set_memory.h | 1 +
arch/x86/kernel/amd_gart_64.c | 3 +-
arch/x86/mm/init.c | 9 ++---
arch/x86/mm/init_32.c | 12 +++++--
arch/x86/mm/init_64.c | 40 ++++++++++++----------
arch/x86/mm/mm_internal.h | 3 +-
arch/x86/mm/pat/set_memory.c | 7 ++++
arch/x86/platform/uv/bios_uv.c | 3 +-
include/linux/memory_hotplug.h | 20 +++++------
mm/memory_hotplug.c | 11 +++---
mm/memremap.c | 17 +++++----
24 files changed, 130 insertions(+), 81 deletions(-)
base-commit: 11a48a5a18c63fd7621bb050228cebf13566e4d8
--
2.20.1
PCI BAR IO memory should never be mapped as WB, however prior to this
the PAT bits were set WB and it was typically overridden by MTRR
registers set by the firmware.
Set PCI P2PDMA memory to be WC (writecombining) as the only current
user (the NVMe CMB) was originally mapped WC before the P2PDMA code
replaced the mapping with devm_memremap_pages().
Future use-cases may need to generalize this by adding flags to
select the caching type, as some P2PDMA cases will not want WC.
However, those use-cases are not upstream yet and this can be changed
when they arrive.
Cc: Dan Williams <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
mm/memremap.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/memremap.c b/mm/memremap.c
index 06742372a203..8d141c3e3364 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -190,7 +190,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
break;
case MEMORY_DEVICE_DEVDAX:
+ need_devmap_managed = false;
+ break;
case MEMORY_DEVICE_PCI_P2PDMA:
+ params.pgprot = pgprot_writecombine(params.pgprot);
need_devmap_managed = false;
break;
default:
--
2.20.1
devm_memremap_pages() is currently used by the PCI P2PDMA code to create
struct page mappings for IO memory. At present, these mappings are created
with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on
x86, an mtrr register will typically override this and force the cache
type to be UC-. In the case firmware doesn't set this register it is
effectively WB and will typically result in a machine check exception
when it's accessed.
Other arches are not currently likely to function correctly seeing they
don't have any MTRR registers to fall back on.
To solve this, provide a way to specify the pgprot value explicitly to
arch_add_memory().
Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple
change to pass the pgprot_t down to their respective functions which set
up the page tables. For x86_32, set the page tables explicitly using
_set_memory_prot() (seeing they are already mapped). For ia64, s390 and
sh, reject anything but PAGE_KERNEL settings -- this should be fine,
for now, seeing these architectures don't support ZONE_DEVICE.
A check in __add_pages() is also added to ensure the pgprot parameter was
set for all arches.
Cc: Dan Williams <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
arch/arm64/mm/mmu.c | 3 ++-
arch/ia64/mm/init.c | 3 +++
arch/powerpc/mm/mem.c | 3 ++-
arch/s390/mm/init.c | 3 +++
arch/sh/mm/init.c | 3 +++
arch/x86/mm/init_32.c | 5 +++++
arch/x86/mm/init_64.c | 2 +-
include/linux/memory_hotplug.h | 2 ++
mm/memory_hotplug.c | 5 ++++-
mm/memremap.c | 6 +++---
10 files changed, 28 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index ee37bca8aba8..ea3fa844a8a2 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1058,7 +1058,8 @@ int arch_add_memory(int nid, u64 start, u64 size,
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
- size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
+ size, params->pgprot, __pgd_pgtable_alloc,
+ flags);
memblock_clear_nomap(start, size);
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index 97bbc23ea1e3..d637b4ea3147 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -676,6 +676,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
+ if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
+ return -EINVAL;
+
ret = __add_pages(nid, start_pfn, nr_pages, params);
if (ret)
printk("%s: Problem encountered in __add_pages() as ret=%d\n",
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 19b1da5d7eca..832412bc7fad 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -138,7 +138,8 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
resize_hpt_for_hotplug(memblock_phys_mem_size());
start = (unsigned long)__va(start);
- rc = create_section_mapping(start, start + size, nid, PAGE_KERNEL);
+ rc = create_section_mapping(start, start + size, nid,
+ params->pgprot);
if (rc) {
pr_warn("Unable to create mapping for hot added memory 0x%llx..0x%llx: %d\n",
start, start + size, rc);
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index e9e4a7abd0cc..87b2d024e75a 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -277,6 +277,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
if (WARN_ON_ONCE(params->altmap))
return -EINVAL;
+ if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
+ return -EINVAL;
+
rc = vmem_add_mapping(start, size);
if (rc)
return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index e5114c053364..b9de2d4fa57e 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -412,6 +412,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
+ if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot)
+ return -EINVAL;
+
/* We only have ZONE_NORMAL, so this is easy.. */
ret = __add_pages(nid, start_pfn, nr_pages, params);
if (unlikely(ret))
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index e25a4218e6ff..96d8e4fb1cc8 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -858,6 +858,11 @@ int arch_add_memory(int nid, u64 start, u64 size,
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
+ int ret;
+
+ ret = _set_memory_prot(start, nr_pages, params->pgprot);
+ if (ret)
+ return ret;
return __add_pages(nid, start_pfn, nr_pages, params);
}
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 9e7692080dda..230240af38b4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -868,7 +868,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- init_memory_mapping(start, start + size, PAGE_KERNEL);
+ init_memory_mapping(start, start + size, params->pgprot);
return add_pages(nid, start_pfn, nr_pages, params);
}
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index c5df1b3dada0..30d6c1b8847e 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -56,9 +56,11 @@ enum {
/*
* Extended parameters for memory hotplug:
* altmap: alternative allocator for memmap array (optional)
+ * pgprot: page protection flags to apply to newly added page tables (required)
*/
struct mhp_params {
struct vmem_altmap *altmap;
+ pgprot_t pgprot;
};
/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c69469e1b40e..d7d4806ad81b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -305,6 +305,9 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
unsigned long nr, start_sec, end_sec;
struct vmem_altmap *altmap = params->altmap;
+ if (WARN_ON_ONCE(!params->pgprot.pgprot))
+ return -EINVAL;
+
err = check_hotplug_memory_addressable(pfn, nr_pages);
if (err)
return err;
@@ -993,7 +996,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
*/
int __ref add_memory_resource(int nid, struct resource *res)
{
- struct mhp_params params = {};
+ struct mhp_params params = { .pgprot = PAGE_KERNEL };
u64 start, size;
bool new_node = false;
int ret;
diff --git a/mm/memremap.c b/mm/memremap.c
index 6891a503a078..06742372a203 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -166,8 +166,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
* We do not want any optional features only our own memmap
*/
.altmap = pgmap_altmap(pgmap),
+ .pgprot = PAGE_KERNEL,
};
- pgprot_t pgprot = PAGE_KERNEL;
int error, is_ram;
bool need_devmap_managed = true;
@@ -255,8 +255,8 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
if (nid < 0)
nid = numa_mem_id();
- error = track_pfn_remap(NULL, &pgprot, PHYS_PFN(res->start), 0,
- resource_size(res));
+ error = track_pfn_remap(NULL, ¶ms.pgprot, PHYS_PFN(res->start),
+ 0, resource_size(res));
if (error)
goto err_pfn_remap;
--
2.20.1
In prepartion to support a pgprot_t argument for arch_add_memory().
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: Michael Ellerman <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
arch/powerpc/include/asm/book3s/64/hash.h | 3 ++-
arch/powerpc/include/asm/book3s/64/radix.h | 3 ++-
arch/powerpc/include/asm/sparsemem.h | 3 ++-
arch/powerpc/mm/book3s64/hash_utils.c | 5 +++--
arch/powerpc/mm/book3s64/pgtable.c | 7 ++++---
arch/powerpc/mm/book3s64/radix_pgtable.c | 18 +++++++++++-------
arch/powerpc/mm/mem.c | 5 +++--
7 files changed, 27 insertions(+), 17 deletions(-)
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index 2781ebf6add4..6fc4520092c7 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -251,7 +251,8 @@ extern int __meminit hash__vmemmap_create_mapping(unsigned long start,
extern void hash__vmemmap_remove_mapping(unsigned long start,
unsigned long page_size);
-int hash__create_section_mapping(unsigned long start, unsigned long end, int nid);
+int hash__create_section_mapping(unsigned long start, unsigned long end,
+ int nid, pgprot_t prot);
int hash__remove_section_mapping(unsigned long start, unsigned long end);
#endif /* !__ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index d97db3ad9aae..46799f3c3d1d 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -289,7 +289,8 @@ static inline unsigned long radix__get_tree_size(void)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-int radix__create_section_mapping(unsigned long start, unsigned long end, int nid);
+int radix__create_section_mapping(unsigned long start, unsigned long end,
+ int nid, pgprot_t prot);
int radix__remove_section_mapping(unsigned long start, unsigned long end);
#endif /* CONFIG_MEMORY_HOTPLUG */
#endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/sparsemem.h b/arch/powerpc/include/asm/sparsemem.h
index 3192d454a733..c89b32443cff 100644
--- a/arch/powerpc/include/asm/sparsemem.h
+++ b/arch/powerpc/include/asm/sparsemem.h
@@ -13,7 +13,8 @@
#endif /* CONFIG_SPARSEMEM */
#ifdef CONFIG_MEMORY_HOTPLUG
-extern int create_section_mapping(unsigned long start, unsigned long end, int nid);
+extern int create_section_mapping(unsigned long start, unsigned long end,
+ int nid, pgprot_t prot);
extern int remove_section_mapping(unsigned long start, unsigned long end);
#ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c
index 523d4d39d11e..201738e07a1d 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -809,7 +809,8 @@ int resize_hpt_for_hotplug(unsigned long new_mem_size)
return 0;
}
-int hash__create_section_mapping(unsigned long start, unsigned long end, int nid)
+int hash__create_section_mapping(unsigned long start, unsigned long end,
+ int nid, pgprot_t prot)
{
int rc;
@@ -819,7 +820,7 @@ int hash__create_section_mapping(unsigned long start, unsigned long end, int nid
}
rc = htab_bolt_mapping(start, end, __pa(start),
- pgprot_val(PAGE_KERNEL), mmu_linear_psize,
+ pgprot_val(prot), mmu_linear_psize,
mmu_kernel_ssize);
if (rc < 0) {
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 2bf7e1b4fd82..e0bb69c616e4 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -171,12 +171,13 @@ void mmu_cleanup_all(void)
}
#ifdef CONFIG_MEMORY_HOTPLUG
-int __meminit create_section_mapping(unsigned long start, unsigned long end, int nid)
+int __meminit create_section_mapping(unsigned long start, unsigned long end,
+ int nid, pgprot_t prot)
{
if (radix_enabled())
- return radix__create_section_mapping(start, end, nid);
+ return radix__create_section_mapping(start, end, nid, prot);
- return hash__create_section_mapping(start, end, nid);
+ return hash__create_section_mapping(start, end, nid, prot);
}
int __meminit remove_section_mapping(unsigned long start, unsigned long end)
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index dd1bea45325c..0ef10b5c26ba 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -253,7 +253,7 @@ static unsigned long next_boundary(unsigned long addr, unsigned long end)
static int __meminit create_physical_mapping(unsigned long start,
unsigned long end,
- int nid)
+ int nid, pgprot_t _prot)
{
unsigned long vaddr, addr, mapping_size = 0;
bool prev_exec, exec = false;
@@ -289,7 +289,7 @@ static int __meminit create_physical_mapping(unsigned long start,
prot = PAGE_KERNEL_X;
exec = true;
} else {
- prot = PAGE_KERNEL;
+ prot = _prot;
exec = false;
}
@@ -333,7 +333,7 @@ static void __init radix_init_pgtable(void)
WARN_ON(create_physical_mapping(reg->base,
reg->base + reg->size,
- -1));
+ -1, PAGE_KERNEL));
}
/* Find out how many PID bits are supported */
@@ -712,8 +712,10 @@ static int __meminit stop_machine_change_mapping(void *data)
spin_unlock(&init_mm.page_table_lock);
pte_clear(&init_mm, params->aligned_start, params->pte);
- create_physical_mapping(__pa(params->aligned_start), __pa(params->start), -1);
- create_physical_mapping(__pa(params->end), __pa(params->aligned_end), -1);
+ create_physical_mapping(__pa(params->aligned_start),
+ __pa(params->start), -1, PAGE_KERNEL);
+ create_physical_mapping(__pa(params->end), __pa(params->aligned_end),
+ -1, PAGE_KERNEL);
spin_lock(&init_mm.page_table_lock);
return 0;
}
@@ -870,14 +872,16 @@ static void __meminit remove_pagetable(unsigned long start, unsigned long end)
radix__flush_tlb_kernel_range(start, end);
}
-int __meminit radix__create_section_mapping(unsigned long start, unsigned long end, int nid)
+int __meminit radix__create_section_mapping(unsigned long start,
+ unsigned long end, int nid,
+ pgprot_t prot)
{
if (end >= RADIX_VMALLOC_START) {
pr_warn("Outside the supported range\n");
return -1;
}
- return create_physical_mapping(__pa(start), __pa(end), nid);
+ return create_physical_mapping(__pa(start), __pa(end), nid, prot);
}
int __meminit radix__remove_section_mapping(unsigned long start, unsigned long end)
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index b4bece53bec0..19b1da5d7eca 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -96,7 +96,8 @@ int memory_add_physaddr_to_nid(u64 start)
}
#endif
-int __weak create_section_mapping(unsigned long start, unsigned long end, int nid)
+int __weak create_section_mapping(unsigned long start, unsigned long end,
+ int nid, pgprot_t prot)
{
return -ENODEV;
}
@@ -137,7 +138,7 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
resize_hpt_for_hotplug(memblock_phys_mem_size());
start = (unsigned long)__va(start);
- rc = create_section_mapping(start, start + size, nid);
+ rc = create_section_mapping(start, start + size, nid, PAGE_KERNEL);
if (rc) {
pr_warn("Unable to create mapping for hot added memory 0x%llx..0x%llx: %d\n",
start, start + size, rc);
--
2.20.1
The mhp_restrictions struct really doesn't specify anything resembling
a restriction anymore so rename it to be mhp_params as it is a list
of extended parameters.
Signed-off-by: Logan Gunthorpe <[email protected]>
---
arch/arm64/mm/mmu.c | 4 ++--
arch/ia64/mm/init.c | 4 ++--
arch/powerpc/mm/mem.c | 4 ++--
arch/s390/mm/init.c | 6 +++---
arch/sh/mm/init.c | 4 ++--
arch/x86/mm/init_32.c | 4 ++--
arch/x86/mm/init_64.c | 8 ++++----
include/linux/memory_hotplug.h | 16 ++++++++--------
mm/memory_hotplug.c | 8 ++++----
mm/memremap.c | 8 ++++----
10 files changed, 33 insertions(+), 33 deletions(-)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 128f70852bf3..ee37bca8aba8 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1050,7 +1050,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
#ifdef CONFIG_MEMORY_HOTPLUG
int arch_add_memory(int nid, u64 start, u64 size,
- struct mhp_restrictions *restrictions)
+ struct mhp_params *params)
{
int flags = 0;
@@ -1063,7 +1063,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
memblock_clear_nomap(start, size);
return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
- restrictions);
+ params);
}
void arch_remove_memory(int nid, u64 start, u64 size,
struct vmem_altmap *altmap)
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index b01d68a2d5d9..97bbc23ea1e3 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -670,13 +670,13 @@ mem_init (void)
#ifdef CONFIG_MEMORY_HOTPLUG
int arch_add_memory(int nid, u64 start, u64 size,
- struct mhp_restrictions *restrictions)
+ struct mhp_params *params)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
- ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
+ ret = __add_pages(nid, start_pfn, nr_pages, params);
if (ret)
printk("%s: Problem encountered in __add_pages() as ret=%d\n",
__func__, ret);
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index ef7b1119b2e2..b4bece53bec0 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -128,7 +128,7 @@ static void flush_dcache_range_chunked(unsigned long start, unsigned long stop,
}
int __ref arch_add_memory(int nid, u64 start, u64 size,
- struct mhp_restrictions *restrictions)
+ struct mhp_params *params)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
@@ -144,7 +144,7 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
return -EFAULT;
}
- return __add_pages(nid, start_pfn, nr_pages, restrictions);
+ return __add_pages(nid, start_pfn, nr_pages, params);
}
void __ref arch_remove_memory(int nid, u64 start, u64 size,
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index ac44bd76db4b..e9e4a7abd0cc 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -268,20 +268,20 @@ device_initcall(s390_cma_mem_init);
#endif /* CONFIG_CMA */
int arch_add_memory(int nid, u64 start, u64 size,
- struct mhp_restrictions *restrictions)
+ struct mhp_params *params)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long size_pages = PFN_DOWN(size);
int rc;
- if (WARN_ON_ONCE(restrictions->altmap))
+ if (WARN_ON_ONCE(params->altmap))
return -EINVAL;
rc = vmem_add_mapping(start, size);
if (rc)
return rc;
- rc = __add_pages(nid, start_pfn, size_pages, restrictions);
+ rc = __add_pages(nid, start_pfn, size_pages, params);
if (rc)
vmem_remove_mapping(start, size);
return rc;
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index d1b1ff2be17a..e5114c053364 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -406,14 +406,14 @@ void __init mem_init(void)
#ifdef CONFIG_MEMORY_HOTPLUG
int arch_add_memory(int nid, u64 start, u64 size,
- struct mhp_restrictions *restrictions)
+ struct mhp_params *params)
{
unsigned long start_pfn = PFN_DOWN(start);
unsigned long nr_pages = size >> PAGE_SHIFT;
int ret;
/* We only have ZONE_NORMAL, so this is easy.. */
- ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
+ ret = __add_pages(nid, start_pfn, nr_pages, params);
if (unlikely(ret))
printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 23df4885bbed..3ec3dac7c268 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -853,12 +853,12 @@ void __init mem_init(void)
#ifdef CONFIG_MEMORY_HOTPLUG
int arch_add_memory(int nid, u64 start, u64 size,
- struct mhp_restrictions *restrictions)
+ struct mhp_params *params)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- return __add_pages(nid, start_pfn, nr_pages, restrictions);
+ return __add_pages(nid, start_pfn, nr_pages, params);
}
void arch_remove_memory(int nid, u64 start, u64 size,
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index abbdecb75fad..87977a8bfbbf 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -844,11 +844,11 @@ static void update_end_of_memory_vars(u64 start, u64 size)
}
int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
- struct mhp_restrictions *restrictions)
+ struct mhp_params *params)
{
int ret;
- ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
+ ret = __add_pages(nid, start_pfn, nr_pages, params);
WARN_ON_ONCE(ret);
/* update max_pfn, max_low_pfn and high_memory */
@@ -859,14 +859,14 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
}
int arch_add_memory(int nid, u64 start, u64 size,
- struct mhp_restrictions *restrictions)
+ struct mhp_params *params)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
init_memory_mapping(start, start + size);
- return add_pages(nid, start_pfn, nr_pages, restrictions);
+ return add_pages(nid, start_pfn, nr_pages, params);
}
#define PAGE_INUSE 0xFD
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 69ff3037528d..c5df1b3dada0 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -54,10 +54,10 @@ enum {
};
/*
- * Restrictions for the memory hotplug:
- * altmap: alternative allocator for memmap array
+ * Extended parameters for memory hotplug:
+ * altmap: alternative allocator for memmap array (optional)
*/
-struct mhp_restrictions {
+struct mhp_params {
struct vmem_altmap *altmap;
};
@@ -108,7 +108,7 @@ extern int restore_online_page_callback(online_page_callback_t callback);
extern int try_online_node(int nid);
extern int arch_add_memory(int nid, u64 start, u64 size,
- struct mhp_restrictions *restrictions);
+ struct mhp_params *params);
extern u64 max_mem_size;
extern bool memhp_auto_online;
@@ -126,17 +126,17 @@ extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages,
/* reasonably generic interface to expand the physical pages */
extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
- struct mhp_restrictions *restrictions);
+ struct mhp_params *params);
#ifndef CONFIG_ARCH_HAS_ADD_PAGES
static inline int add_pages(int nid, unsigned long start_pfn,
- unsigned long nr_pages, struct mhp_restrictions *restrictions)
+ unsigned long nr_pages, struct mhp_params *params)
{
- return __add_pages(nid, start_pfn, nr_pages, restrictions);
+ return __add_pages(nid, start_pfn, nr_pages, params);
}
#else /* ARCH_HAS_ADD_PAGES */
int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
- struct mhp_restrictions *restrictions);
+ struct mhp_params *params);
#endif /* ARCH_HAS_ADD_PAGES */
#ifdef CONFIG_NUMA
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0a54ffac8c68..c69469e1b40e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -299,11 +299,11 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
* add the new pages.
*/
int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
- struct mhp_restrictions *restrictions)
+ struct mhp_params *params)
{
int err;
unsigned long nr, start_sec, end_sec;
- struct vmem_altmap *altmap = restrictions->altmap;
+ struct vmem_altmap *altmap = params->altmap;
err = check_hotplug_memory_addressable(pfn, nr_pages);
if (err)
@@ -993,7 +993,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
*/
int __ref add_memory_resource(int nid, struct resource *res)
{
- struct mhp_restrictions restrictions = {};
+ struct mhp_params params = {};
u64 start, size;
bool new_node = false;
int ret;
@@ -1021,7 +1021,7 @@ int __ref add_memory_resource(int nid, struct resource *res)
new_node = ret;
/* call arch's memory hotadd */
- ret = arch_add_memory(nid, start, size, &restrictions);
+ ret = arch_add_memory(nid, start, size, ¶ms);
if (ret < 0)
goto error;
diff --git a/mm/memremap.c b/mm/memremap.c
index 09b5b7adc773..6891a503a078 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -161,7 +161,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
{
struct resource *res = &pgmap->res;
struct dev_pagemap *conflict_pgmap;
- struct mhp_restrictions restrictions = {
+ struct mhp_params params = {
/*
* We do not want any optional features only our own memmap
*/
@@ -275,7 +275,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
*/
if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
error = add_pages(nid, PHYS_PFN(res->start),
- PHYS_PFN(resource_size(res)), &restrictions);
+ PHYS_PFN(resource_size(res)), ¶ms);
} else {
error = kasan_add_zero_shadow(__va(res->start), resource_size(res));
if (error) {
@@ -284,7 +284,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
}
error = arch_add_memory(nid, res->start, resource_size(res),
- &restrictions);
+ ¶ms);
}
if (!error) {
@@ -292,7 +292,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
move_pfn_range_to_zone(zone, PHYS_PFN(res->start),
- PHYS_PFN(resource_size(res)), restrictions.altmap);
+ PHYS_PFN(resource_size(res)), params.altmap);
}
mem_hotplug_done();
--
2.20.1
In prepartion to support a pgprot_t argument for arch_add_memory().
It's required to move the prototype of init_memory_mapping() seeing
the original location came before the definition of pgprot_t.
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
arch/x86/include/asm/page_types.h | 3 ---
arch/x86/include/asm/pgtable.h | 3 +++
arch/x86/kernel/amd_gart_64.c | 3 ++-
arch/x86/mm/init.c | 9 +++++----
arch/x86/mm/init_32.c | 3 ++-
arch/x86/mm/init_64.c | 32 +++++++++++++++++--------------
arch/x86/mm/mm_internal.h | 3 ++-
arch/x86/platform/uv/bios_uv.c | 3 ++-
8 files changed, 34 insertions(+), 25 deletions(-)
diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index c85e15010f48..bf7aa2e290ef 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -73,9 +73,6 @@ static inline phys_addr_t get_max_mapped(void)
bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
-extern unsigned long init_memory_mapping(unsigned long start,
- unsigned long end);
-
extern void initmem_init(void);
#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 7e118660bbd9..48d6a5960f28 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1046,6 +1046,9 @@ static inline void __meminit init_trampoline_default(void)
void __init poking_init(void);
+unsigned long init_memory_mapping(unsigned long start,
+ unsigned long end, pgprot_t prot);
+
# ifdef CONFIG_RANDOMIZE_MEMORY
void __meminit init_trampoline(void);
# else
diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 4e5f50236048..16133819415c 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -744,7 +744,8 @@ int __init gart_iommu_init(void)
start_pfn = PFN_DOWN(aper_base);
if (!pfn_range_is_mapped(start_pfn, end_pfn))
- init_memory_mapping(start_pfn<<PAGE_SHIFT, end_pfn<<PAGE_SHIFT);
+ init_memory_mapping(start_pfn<<PAGE_SHIFT, end_pfn<<PAGE_SHIFT,
+ PAGE_KERNEL);
pr_info("PCI-DMA: using GART IOMMU.\n");
iommu_size = check_iommu_size(info.aper_base, aper_size);
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index e7bb483557c9..1bba16c5742b 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -467,7 +467,7 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
* the physical memory. To access them they are temporarily mapped.
*/
unsigned long __ref init_memory_mapping(unsigned long start,
- unsigned long end)
+ unsigned long end, pgprot_t prot)
{
struct map_range mr[NR_RANGE_MR];
unsigned long ret = 0;
@@ -481,7 +481,8 @@ unsigned long __ref init_memory_mapping(unsigned long start,
for (i = 0; i < nr_range; i++)
ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
- mr[i].page_size_mask);
+ mr[i].page_size_mask,
+ prot);
add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT);
@@ -521,7 +522,7 @@ static unsigned long __init init_range_memory_mapping(
*/
can_use_brk_pgt = max(start, (u64)pgt_buf_end<<PAGE_SHIFT) >=
min(end, (u64)pgt_buf_top<<PAGE_SHIFT);
- init_memory_mapping(start, end);
+ init_memory_mapping(start, end, PAGE_KERNEL);
mapped_ram_size += end - start;
can_use_brk_pgt = true;
}
@@ -661,7 +662,7 @@ void __init init_mem_mapping(void)
#endif
/* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
+ init_memory_mapping(0, ISA_END_ADDRESS, PAGE_KERNEL);
/* Init the trampoline, possibly with KASLR memory offset */
init_trampoline();
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 3ec3dac7c268..e25a4218e6ff 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -253,7 +253,8 @@ static inline int is_kernel_text(unsigned long addr)
unsigned long __init
kernel_physical_mapping_init(unsigned long start,
unsigned long end,
- unsigned long page_size_mask)
+ unsigned long page_size_mask,
+ pgprot_t prot)
{
int use_pse = page_size_mask == (1<<PG_LEVEL_2M);
unsigned long last_map_addr = end;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 87977a8bfbbf..9e7692080dda 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -585,7 +585,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end,
*/
static unsigned long __meminit
phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
- unsigned long page_size_mask, bool init)
+ unsigned long page_size_mask, pgprot_t _prot, bool init)
{
unsigned long pages = 0, paddr_next;
unsigned long paddr_last = paddr_end;
@@ -595,7 +595,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
for (; i < PTRS_PER_PUD; i++, paddr = paddr_next) {
pud_t *pud;
pmd_t *pmd;
- pgprot_t prot = PAGE_KERNEL;
+ pgprot_t prot = _prot;
vaddr = (unsigned long)__va(paddr);
pud = pud_page + pud_index(vaddr);
@@ -644,9 +644,12 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
if (page_size_mask & (1<<PG_LEVEL_1G)) {
pages++;
spin_lock(&init_mm.page_table_lock);
+
+ prot = __pgprot(pgprot_val(prot) | __PAGE_KERNEL_LARGE);
+
set_pte_init((pte_t *)pud,
pfn_pte((paddr & PUD_MASK) >> PAGE_SHIFT,
- PAGE_KERNEL_LARGE),
+ prot),
init);
spin_unlock(&init_mm.page_table_lock);
paddr_last = paddr_next;
@@ -669,7 +672,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
static unsigned long __meminit
phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
- unsigned long page_size_mask, bool init)
+ unsigned long page_size_mask, pgprot_t prot, bool init)
{
unsigned long vaddr, vaddr_end, vaddr_next, paddr_next, paddr_last;
@@ -679,7 +682,7 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
if (!pgtable_l5_enabled())
return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end,
- page_size_mask, init);
+ page_size_mask, prot, init);
for (; vaddr < vaddr_end; vaddr = vaddr_next) {
p4d_t *p4d = p4d_page + p4d_index(vaddr);
@@ -702,13 +705,13 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
if (!p4d_none(*p4d)) {
pud = pud_offset(p4d, 0);
paddr_last = phys_pud_init(pud, paddr, __pa(vaddr_end),
- page_size_mask, init);
+ page_size_mask, prot, init);
continue;
}
pud = alloc_low_page();
paddr_last = phys_pud_init(pud, paddr, __pa(vaddr_end),
- page_size_mask, init);
+ page_size_mask, prot, init);
spin_lock(&init_mm.page_table_lock);
p4d_populate_init(&init_mm, p4d, pud, init);
@@ -722,7 +725,7 @@ static unsigned long __meminit
__kernel_physical_mapping_init(unsigned long paddr_start,
unsigned long paddr_end,
unsigned long page_size_mask,
- bool init)
+ pgprot_t prot, bool init)
{
bool pgd_changed = false;
unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last;
@@ -743,13 +746,13 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
paddr_last = phys_p4d_init(p4d, __pa(vaddr),
__pa(vaddr_end),
page_size_mask,
- init);
+ prot, init);
continue;
}
p4d = alloc_low_page();
paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
- page_size_mask, init);
+ page_size_mask, prot, init);
spin_lock(&init_mm.page_table_lock);
if (pgtable_l5_enabled())
@@ -778,10 +781,10 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
unsigned long __meminit
kernel_physical_mapping_init(unsigned long paddr_start,
unsigned long paddr_end,
- unsigned long page_size_mask)
+ unsigned long page_size_mask, pgprot_t prot)
{
return __kernel_physical_mapping_init(paddr_start, paddr_end,
- page_size_mask, true);
+ page_size_mask, prot, true);
}
/*
@@ -796,7 +799,8 @@ kernel_physical_mapping_change(unsigned long paddr_start,
unsigned long page_size_mask)
{
return __kernel_physical_mapping_init(paddr_start, paddr_end,
- page_size_mask, false);
+ page_size_mask, PAGE_KERNEL,
+ false);
}
#ifndef CONFIG_NUMA
@@ -864,7 +868,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- init_memory_mapping(start, start + size);
+ init_memory_mapping(start, start + size, PAGE_KERNEL);
return add_pages(nid, start_pfn, nr_pages, params);
}
diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
index eeae142062ed..3f37b5c80bb3 100644
--- a/arch/x86/mm/mm_internal.h
+++ b/arch/x86/mm/mm_internal.h
@@ -12,7 +12,8 @@ void early_ioremap_page_table_range_init(void);
unsigned long kernel_physical_mapping_init(unsigned long start,
unsigned long end,
- unsigned long page_size_mask);
+ unsigned long page_size_mask,
+ pgprot_t prot);
unsigned long kernel_physical_mapping_change(unsigned long start,
unsigned long end,
unsigned long page_size_mask);
diff --git a/arch/x86/platform/uv/bios_uv.c b/arch/x86/platform/uv/bios_uv.c
index 607f58147311..c60255da5a6c 100644
--- a/arch/x86/platform/uv/bios_uv.c
+++ b/arch/x86/platform/uv/bios_uv.c
@@ -352,7 +352,8 @@ void __iomem *__init efi_ioremap(unsigned long phys_addr, unsigned long size,
if (type == EFI_MEMORY_MAPPED_IO)
return ioremap(phys_addr, size);
- last_map_pfn = init_memory_mapping(phys_addr, phys_addr + size);
+ last_map_pfn = init_memory_mapping(phys_addr, phys_addr + size,
+ PAGE_KERNEL);
if ((last_map_pfn << PAGE_SHIFT) < phys_addr + size) {
unsigned long top = last_map_pfn << PAGE_SHIFT;
efi_ioremap(top, size - (top - phys_addr), type, attribute);
--
2.20.1
For use in the 32bit arch_add_memory() to set the pgprot type of the
memory to add.
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Logan Gunthorpe <[email protected]>
---
arch/x86/include/asm/set_memory.h | 1 +
arch/x86/mm/pat/set_memory.c | 7 +++++++
2 files changed, 8 insertions(+)
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 64c3dce374e5..0aca959cf9a4 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -34,6 +34,7 @@
* The caller is required to take care of these.
*/
+int _set_memory_prot(unsigned long addr, int numpages, pgprot_t prot);
int _set_memory_uc(unsigned long addr, int numpages);
int _set_memory_wc(unsigned long addr, int numpages);
int _set_memory_wt(unsigned long addr, int numpages);
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index c4aedd00c1ba..2ba83d53d835 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -1792,6 +1792,13 @@ static inline int cpa_clear_pages_array(struct page **pages, int numpages,
CPA_PAGES_ARRAY, pages);
}
+int _set_memory_prot(unsigned long addr, int numpages, pgprot_t prot)
+{
+ return change_page_attr_set_clr(&addr, numpages, prot,
+ __pgprot(~pgprot_val(prot)), 0, 0,
+ NULL);
+}
+
int _set_memory_uc(unsigned long addr, int numpages)
{
/*
--
2.20.1
On 21.02.20 19:24, Logan Gunthorpe wrote:
> The mhp_restrictions struct really doesn't specify anything resembling
> a restriction anymore so rename it to be mhp_params as it is a list
> of extended parameters.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
--
Thanks,
David / dhildenb
On 21.02.20 19:25, Logan Gunthorpe wrote:
> devm_memremap_pages() is currently used by the PCI P2PDMA code to create
> struct page mappings for IO memory. At present, these mappings are created
> with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on
> x86, an mtrr register will typically override this and force the cache
> type to be UC-. In the case firmware doesn't set this register it is
> effectively WB and will typically result in a machine check exception
> when it's accessed.
>
> Other arches are not currently likely to function correctly seeing they
> don't have any MTRR registers to fall back on.
>
> To solve this, provide a way to specify the pgprot value explicitly to
> arch_add_memory().
>
> Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple
> change to pass the pgprot_t down to their respective functions which set
> up the page tables. For x86_32, set the page tables explicitly using
> _set_memory_prot() (seeing they are already mapped). For ia64, s390 and
> sh, reject anything but PAGE_KERNEL settings -- this should be fine,
> for now, seeing these architectures don't support ZONE_DEVICE.
>
> A check in __add_pages() is also added to ensure the pgprot parameter was
> set for all arches.
>
> Cc: Dan Williams <[email protected]>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Acked-by: David Hildenbrand <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
[...]
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index c5df1b3dada0..30d6c1b8847e 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -56,9 +56,11 @@ enum {
> /*
> * Extended parameters for memory hotplug:
> * altmap: alternative allocator for memmap array (optional)
> + * pgprot: page protection flags to apply to newly added page tables (required)
s/added/created/?
--
Thanks,
David / dhildenb
On Fri, Feb 21, 2020 at 11:24:56AM -0700, Logan Gunthorpe wrote:
> Hi,
>
> This is v3 of the patchset which cleans up a number of minor issues
> from the feedback of v2 and rebases onto v5.6-rc2. Additional feedback
> is welcome.
>
> Thanks,
>
> Logan
>
> --
>
> Changes in v3:
> * Rebased onto v5.6-rc2
> * Rename mhp_modifiers to mhp_params per David with an updated kernel
> doc per Dan
> * Drop support for s390 per David seeing it does not support
> ZONE_DEVICE yet and there was a potential problem with huge pages.
> * Added WARN_ON_ONCE in cases where arches recieve non PAGE_KERNEL
> parameters
> * Collected David and Micheal's Reviewed-By and Acked-by Tags
>
> Changes in v2:
> * Rebased onto v5.5-rc5
> * Renamed mhp_restrictions to mhp_modifiers and added the pgprot field
> to that structure instead of using an argument for
> arch_add_memory().
> * Add patch to drop the unused flags field in mhp_restrictions
>
> A git branch is available here:
>
> https://github.com/sbates130272/linux-p2pmem remap_pages_cache_v3
>
> --
>
> Currently, the page tables created using memremap_pages() are always
> created with the PAGE_KERNEL cacheing mode. However, the P2PDMA code
> is creating pages for PCI BAR memory which should never be accessed
> through the cache and instead use either WC or UC. This still works in
> most cases, on x86, because the MTRR registers typically override the
> caching settings in the page tables for all of the IO memory to be
> UC-. However, this tends not to work so well on other arches or
> some rare x86 machines that have firmware which does not setup the
> MTRR registers in this way.
>
> Instead of this, this series proposes a change to arch_add_memory()
> to take the pgprot required by the mapping which allows us to
> explicitly set pagetable entries for P2PDMA memory to WC.
Is there a particular reason why WC was selected here? I thought for
the p2pdma cases there was no kernel user that touched the memory?
I definitely forsee devices where we want UC instead.
Even so, the whole idea looks like the right direction to me.
Jason
On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote:
>> Instead of this, this series proposes a change to arch_add_memory()
>> to take the pgprot required by the mapping which allows us to
>> explicitly set pagetable entries for P2PDMA memory to WC.
>
> Is there a particular reason why WC was selected here? I thought for
> the p2pdma cases there was no kernel user that touched the memory?
Yes, that's correct. I choose WC here because the existing users are
registering memory blocks without side effects which fit the WC
semantics well.
> I definitely forsee devices where we want UC instead.
Yes. My expectation is that once we have a kernel user that needs this,
we'd wire the option through struct dev_pagemap so the caller can choose
the mapping that makes sense.
Logan
On Thu, Feb 27, 2020 at 10:21:50AM -0700, Logan Gunthorpe wrote:
>
>
> On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote:
> >> Instead of this, this series proposes a change to arch_add_memory()
> >> to take the pgprot required by the mapping which allows us to
> >> explicitly set pagetable entries for P2PDMA memory to WC.
> >
> > Is there a particular reason why WC was selected here? I thought for
> > the p2pdma cases there was no kernel user that touched the memory?
>
> Yes, that's correct. I choose WC here because the existing users are
> registering memory blocks without side effects which fit the WC
> semantics well.
Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in
Linux, so while it is true the memory has no side effects, there would
be surprising concurrency risks if anything in the kernel tried to
write to it.
Not compatible means the locks don't contain stores to WC memory the
way you would expect. AFAIK on many CPUs extra barriers are required
to keep WC stores ordered, the same way ARM already has extra barriers
to keep UC stores ordered with locking..
The spinlocks are defined to contain UC stores though.
If there is no actual need today for WC I would suggest using UC as
the default.
Jason
On 2020-02-27 10:43 a.m., Jason Gunthorpe wrote:
> Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in
> Linux, so while it is true the memory has no side effects, there would
> be surprising concurrency risks if anything in the kernel tried to
> write to it.
>
> Not compatible means the locks don't contain stores to WC memory the
> way you would expect. AFAIK on many CPUs extra barriers are required
> to keep WC stores ordered, the same way ARM already has extra barriers
> to keep UC stores ordered with locking..
>
> The spinlocks are defined to contain UC stores though.
>
> If there is no actual need today for WC I would suggest using UC as
> the default.
Ok, that sounds sensible. I'll do that in the next revision.
Thanks,
Logan
On Thu, Feb 27, 2020 at 9:43 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Feb 27, 2020 at 10:21:50AM -0700, Logan Gunthorpe wrote:
> >
> >
> > On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote:
> > >> Instead of this, this series proposes a change to arch_add_memory()
> > >> to take the pgprot required by the mapping which allows us to
> > >> explicitly set pagetable entries for P2PDMA memory to WC.
> > >
> > > Is there a particular reason why WC was selected here? I thought for
> > > the p2pdma cases there was no kernel user that touched the memory?
> >
> > Yes, that's correct. I choose WC here because the existing users are
> > registering memory blocks without side effects which fit the WC
> > semantics well.
>
> Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in
> Linux, so while it is true the memory has no side effects, there would
> be surprising concurrency risks if anything in the kernel tried to
> write to it.
>
> Not compatible means the locks don't contain stores to WC memory the
> way you would expect. AFAIK on many CPUs extra barriers are required
> to keep WC stores ordered, the same way ARM already has extra barriers
> to keep UC stores ordered with locking..
>
> The spinlocks are defined to contain UC stores though.
How are spinlocks and mutexes getting into p2pdma ranges in the first
instance? Even with UC, the system has bigger problems if it's trying
to send bus locks targeting PCI, see the flurry of activity of trying
to trigger faults on split locks [1].
This does raise a question about separating the cacheability of the
'struct page' memmap from the BAR range. You get this for free if the
memmap is dynamically allocated from "System RAM", but perhaps
memremap_pages() should explicitly prevent altmap configurations that
try to place the map in PCI space?
> If there is no actual need today for WC I would suggest using UC as
> the default.
That's reasonable, but it still seems to be making a broken
configuration marginally less broken. I'd be more interested in
safeguards that prevent p2pdma mappings from being used for any cpu
atomic cycles.
[1]: https://lwn.net/Articles/784864/
On Thu, Feb 27, 2020 at 09:55:04AM -0800, Dan Williams wrote:
> On Thu, Feb 27, 2020 at 9:43 AM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Thu, Feb 27, 2020 at 10:21:50AM -0700, Logan Gunthorpe wrote:
> > >
> > >
> > > On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote:
> > > >> Instead of this, this series proposes a change to arch_add_memory()
> > > >> to take the pgprot required by the mapping which allows us to
> > > >> explicitly set pagetable entries for P2PDMA memory to WC.
> > > >
> > > > Is there a particular reason why WC was selected here? I thought for
> > > > the p2pdma cases there was no kernel user that touched the memory?
> > >
> > > Yes, that's correct. I choose WC here because the existing users are
> > > registering memory blocks without side effects which fit the WC
> > > semantics well.
> >
> > Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in
> > Linux, so while it is true the memory has no side effects, there would
> > be surprising concurrency risks if anything in the kernel tried to
> > write to it.
> >
> > Not compatible means the locks don't contain stores to WC memory the
> > way you would expect. AFAIK on many CPUs extra barriers are required
> > to keep WC stores ordered, the same way ARM already has extra barriers
> > to keep UC stores ordered with locking..
> >
> > The spinlocks are defined to contain UC stores though.
>
> How are spinlocks and mutexes getting into p2pdma ranges in the first
> instance? Even with UC, the system has bigger problems if it's trying
> to send bus locks targeting PCI, see the flurry of activity of trying
> to trigger faults on split locks [1].
This is not what I was trying to explain.
Consider
static spinlock lock; // CPU DRAM
static idx = 0;
u64 *wc_memory = [..];
spin_lock(&lock);
wc_memory[0] = idx++;
spin_unlock(&lock);
You'd expect that the PCI device will observe stores where idx is
strictly increasing, but this is not guarenteed. idx may decrease, idx
may skip. It just won't duplicate.
Or perhaps
wc_memory[0] = foo;
writel(doorbell)
foo is not guarenteed observable by the device before doorbell reaches
the device.
All of these are things that do not happen with UC or NC memory, and
are surprising violations of our programming model.
Generic kernel code should never touch WC memory unless the code is
specifically designed to handle it.
Jason
On Thu, Feb 27, 2020 at 10:03 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Thu, Feb 27, 2020 at 09:55:04AM -0800, Dan Williams wrote:
> > On Thu, Feb 27, 2020 at 9:43 AM Jason Gunthorpe <[email protected]> wrote:
> > >
> > > On Thu, Feb 27, 2020 at 10:21:50AM -0700, Logan Gunthorpe wrote:
> > > >
> > > >
> > > > On 2020-02-27 10:17 a.m., Jason Gunthorpe wrote:
> > > > >> Instead of this, this series proposes a change to arch_add_memory()
> > > > >> to take the pgprot required by the mapping which allows us to
> > > > >> explicitly set pagetable entries for P2PDMA memory to WC.
> > > > >
> > > > > Is there a particular reason why WC was selected here? I thought for
> > > > > the p2pdma cases there was no kernel user that touched the memory?
> > > >
> > > > Yes, that's correct. I choose WC here because the existing users are
> > > > registering memory blocks without side effects which fit the WC
> > > > semantics well.
> > >
> > > Hm, AFAIK WC memory is not compatible with the spinlocks/mutexs/etc in
> > > Linux, so while it is true the memory has no side effects, there would
> > > be surprising concurrency risks if anything in the kernel tried to
> > > write to it.
> > >
> > > Not compatible means the locks don't contain stores to WC memory the
> > > way you would expect. AFAIK on many CPUs extra barriers are required
> > > to keep WC stores ordered, the same way ARM already has extra barriers
> > > to keep UC stores ordered with locking..
> > >
> > > The spinlocks are defined to contain UC stores though.
> >
> > How are spinlocks and mutexes getting into p2pdma ranges in the first
> > instance? Even with UC, the system has bigger problems if it's trying
> > to send bus locks targeting PCI, see the flurry of activity of trying
> > to trigger faults on split locks [1].
>
> This is not what I was trying to explain.
>
> Consider
>
> static spinlock lock; // CPU DRAM
> static idx = 0;
> u64 *wc_memory = [..];
>
> spin_lock(&lock);
> wc_memory[0] = idx++;
> spin_unlock(&lock);
>
> You'd expect that the PCI device will observe stores where idx is
> strictly increasing, but this is not guarenteed. idx may decrease, idx
> may skip. It just won't duplicate.
>
> Or perhaps
>
> wc_memory[0] = foo;
> writel(doorbell)
>
> foo is not guarenteed observable by the device before doorbell reaches
> the device.
>
> All of these are things that do not happen with UC or NC memory, and
> are surprising violations of our programming model.
>
> Generic kernel code should never touch WC memory unless the code is
> specifically designed to handle it.
Ah, yes, agree.
On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe <[email protected]> wrote:
>
> The mhp_restrictions struct really doesn't specify anything resembling
> a restriction anymore so rename it to be mhp_params as it is a list
> of extended parameters.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
Tests ok, and looks good to me:
Reviewed-by: Dan Williams <[email protected]>
On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe <[email protected]> wrote:
>
> For use in the 32bit arch_add_memory() to set the pgprot type of the
> memory to add.
>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: [email protected]
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> arch/x86/include/asm/set_memory.h | 1 +
> arch/x86/mm/pat/set_memory.c | 7 +++++++
> 2 files changed, 8 insertions(+)
>
> diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
> index 64c3dce374e5..0aca959cf9a4 100644
> --- a/arch/x86/include/asm/set_memory.h
> +++ b/arch/x86/include/asm/set_memory.h
> @@ -34,6 +34,7 @@
> * The caller is required to take care of these.
> */
>
> +int _set_memory_prot(unsigned long addr, int numpages, pgprot_t prot);
I wonder if this should be separated from the naming convention of the
other routines because this is only an internal helper for code paths
where the prot was established by an upper layer. For example, I
expect that the kernel does not want new usages to make the mistake of
calling:
_set_memory_prot(..., pgprot_writecombine(pgprot))
...instead of
_set_memory_wc()
I'm thinking just a double underscore rename (__set_memory_prot) and a
kerneldoc comment for that pointing people to use the direct
_set_memory_<cachemode> helpers.
With that you can add:
Reviewed-by: Dan Williams <[email protected]>
On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe <[email protected]> wrote:
>
> In prepartion to support a pgprot_t argument for arch_add_memory().
>
> It's required to move the prototype of init_memory_mapping() seeing
> the original location came before the definition of pgprot_t.
>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: [email protected]
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Logan Gunthorpe <[email protected]>
Looks good, checked for argument confusion, passes the nvdimm unit tests.
Reviewed-by: Dan Williams <[email protected]>
On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe <[email protected]> wrote:
>
> devm_memremap_pages() is currently used by the PCI P2PDMA code to create
> struct page mappings for IO memory. At present, these mappings are created
> with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on
> x86, an mtrr register will typically override this and force the cache
> type to be UC-. In the case firmware doesn't set this register it is
> effectively WB and will typically result in a machine check exception
> when it's accessed.
>
> Other arches are not currently likely to function correctly seeing they
> don't have any MTRR registers to fall back on.
>
> To solve this, provide a way to specify the pgprot value explicitly to
> arch_add_memory().
>
> Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple
> change to pass the pgprot_t down to their respective functions which set
> up the page tables. For x86_32, set the page tables explicitly using
> _set_memory_prot() (seeing they are already mapped). For ia64, s390 and
> sh, reject anything but PAGE_KERNEL settings -- this should be fine,
> for now, seeing these architectures don't support ZONE_DEVICE.
>
> A check in __add_pages() is also added to ensure the pgprot parameter was
> set for all arches.
>
> Cc: Dan Williams <[email protected]>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> Acked-by: David Hildenbrand <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> ---
> arch/arm64/mm/mmu.c | 3 ++-
> arch/ia64/mm/init.c | 3 +++
> arch/powerpc/mm/mem.c | 3 ++-
> arch/s390/mm/init.c | 3 +++
> arch/sh/mm/init.c | 3 +++
> arch/x86/mm/init_32.c | 5 +++++
> arch/x86/mm/init_64.c | 2 +-
> include/linux/memory_hotplug.h | 2 ++
> mm/memory_hotplug.c | 5 ++++-
> mm/memremap.c | 6 +++---
> 10 files changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index ee37bca8aba8..ea3fa844a8a2 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1058,7 +1058,8 @@ int arch_add_memory(int nid, u64 start, u64 size,
> flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>
> __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
> - size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
> + size, params->pgprot, __pgd_pgtable_alloc,
> + flags);
>
> memblock_clear_nomap(start, size);
>
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index 97bbc23ea1e3..d637b4ea3147 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -676,6 +676,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long nr_pages = size >> PAGE_SHIFT;
> int ret;
>
> + if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
> + return -EINVAL;
> +
> ret = __add_pages(nid, start_pfn, nr_pages, params);
> if (ret)
> printk("%s: Problem encountered in __add_pages() as ret=%d\n",
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index 19b1da5d7eca..832412bc7fad 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -138,7 +138,8 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
> resize_hpt_for_hotplug(memblock_phys_mem_size());
>
> start = (unsigned long)__va(start);
> - rc = create_section_mapping(start, start + size, nid, PAGE_KERNEL);
> + rc = create_section_mapping(start, start + size, nid,
> + params->pgprot);
> if (rc) {
> pr_warn("Unable to create mapping for hot added memory 0x%llx..0x%llx: %d\n",
> start, start + size, rc);
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index e9e4a7abd0cc..87b2d024e75a 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -277,6 +277,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
> if (WARN_ON_ONCE(params->altmap))
> return -EINVAL;
>
> + if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
> + return -EINVAL;
> +
> rc = vmem_add_mapping(start, size);
> if (rc)
> return rc;
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index e5114c053364..b9de2d4fa57e 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -412,6 +412,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long nr_pages = size >> PAGE_SHIFT;
> int ret;
>
> + if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot)
> + return -EINVAL;
> +
> /* We only have ZONE_NORMAL, so this is easy.. */
> ret = __add_pages(nid, start_pfn, nr_pages, params);
> if (unlikely(ret))
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index e25a4218e6ff..96d8e4fb1cc8 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -858,6 +858,11 @@ int arch_add_memory(int nid, u64 start, u64 size,
> {
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
> + int ret;
> +
> + ret = _set_memory_prot(start, nr_pages, params->pgprot);
Perhaps a comment since it's not immediately obvious where the
PAGE_KERNEL prot was established, and perhaps add a conditional to
skip this call in the param->pgprot == PAGE_KERNEL case?
Other than that looks good to me, but only an ack since I'm only
testing the x86 changes.
Acked-by: Dan Williams <[email protected]>
On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe <[email protected]> wrote:
>
> PCI BAR IO memory should never be mapped as WB, however prior to this
> the PAT bits were set WB and it was typically overridden by MTRR
> registers set by the firmware.
>
> Set PCI P2PDMA memory to be WC (writecombining) as the only current
> user (the NVMe CMB) was originally mapped WC before the P2PDMA code
> replaced the mapping with devm_memremap_pages().
Will the change to UC regress this existing use case?
>
> Future use-cases may need to generalize this by adding flags to
> select the caching type, as some P2PDMA cases will not want WC.
> However, those use-cases are not upstream yet and this can be changed
> when they arrive.
>
> Cc: Dan Williams <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Signed-off-by: Logan Gunthorpe <[email protected]>
> ---
> mm/memremap.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 06742372a203..8d141c3e3364 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -190,7 +190,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
> }
> break;
> case MEMORY_DEVICE_DEVDAX:
> + need_devmap_managed = false;
> + break;
> case MEMORY_DEVICE_PCI_P2PDMA:
> + params.pgprot = pgprot_writecombine(params.pgprot);
Approach looks good to me, modulo Jason's comment that this should be
UC. Upcoming DAX changes will want to pass this via pgmap, but as you
say this can wait for this changes to arrive.
After change to UC:
Reviewed-by: Dan Williams <[email protected]>
On 2020-02-29 3:33 p.m., Dan Williams wrote:
> On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe <[email protected]> wrote:
>>
>> For use in the 32bit arch_add_memory() to set the pgprot type of the
>> memory to add.
>>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: "H. Peter Anvin" <[email protected]>
>> Cc: [email protected]
>> Cc: Dave Hansen <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> ---
>> arch/x86/include/asm/set_memory.h | 1 +
>> arch/x86/mm/pat/set_memory.c | 7 +++++++
>> 2 files changed, 8 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
>> index 64c3dce374e5..0aca959cf9a4 100644
>> --- a/arch/x86/include/asm/set_memory.h
>> +++ b/arch/x86/include/asm/set_memory.h
>> @@ -34,6 +34,7 @@
>> * The caller is required to take care of these.
>> */
>>
>> +int _set_memory_prot(unsigned long addr, int numpages, pgprot_t prot);
>
> I wonder if this should be separated from the naming convention of the
> other routines because this is only an internal helper for code paths
> where the prot was established by an upper layer. For example, I
> expect that the kernel does not want new usages to make the mistake of
> calling:
>
> _set_memory_prot(..., pgprot_writecombine(pgprot))
>
> ...instead of
>
> _set_memory_wc()
>
> I'm thinking just a double underscore rename (__set_memory_prot) and a
> kerneldoc comment for that pointing people to use the direct
> _set_memory_<cachemode> helpers.
Thanks! Will do. Note, though, that even _set_memory_wc() is an internal
x86-specific function. But the extra comment and underscore still make
sense.
> With that you can add:
>
> Reviewed-by: Dan Williams <[email protected]>
>
On 2020-02-29 3:44 p.m., Dan Williams wrote:
> On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe <[email protected]> wrote:
>>
>> devm_memremap_pages() is currently used by the PCI P2PDMA code to create
>> struct page mappings for IO memory. At present, these mappings are created
>> with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on
>> x86, an mtrr register will typically override this and force the cache
>> type to be UC-. In the case firmware doesn't set this register it is
>> effectively WB and will typically result in a machine check exception
>> when it's accessed.
>>
>> Other arches are not currently likely to function correctly seeing they
>> don't have any MTRR registers to fall back on.
>>
>> To solve this, provide a way to specify the pgprot value explicitly to
>> arch_add_memory().
>>
>> Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple
>> change to pass the pgprot_t down to their respective functions which set
>> up the page tables. For x86_32, set the page tables explicitly using
>> _set_memory_prot() (seeing they are already mapped). For ia64, s390 and
>> sh, reject anything but PAGE_KERNEL settings -- this should be fine,
>> for now, seeing these architectures don't support ZONE_DEVICE.
>>
>> A check in __add_pages() is also added to ensure the pgprot parameter was
>> set for all arches.
>>
>> Cc: Dan Williams <[email protected]>
>> Signed-off-by: Logan Gunthorpe <[email protected]>
>> Acked-by: David Hildenbrand <[email protected]>
>> Acked-by: Michal Hocko <[email protected]>
>> ---
>> arch/arm64/mm/mmu.c | 3 ++-
>> arch/ia64/mm/init.c | 3 +++
>> arch/powerpc/mm/mem.c | 3 ++-
>> arch/s390/mm/init.c | 3 +++
>> arch/sh/mm/init.c | 3 +++
>> arch/x86/mm/init_32.c | 5 +++++
>> arch/x86/mm/init_64.c | 2 +-
>> include/linux/memory_hotplug.h | 2 ++
>> mm/memory_hotplug.c | 5 ++++-
>> mm/memremap.c | 6 +++---
>> 10 files changed, 28 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index ee37bca8aba8..ea3fa844a8a2 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -1058,7 +1058,8 @@ int arch_add_memory(int nid, u64 start, u64 size,
>> flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>
>> __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
>> - size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
>> + size, params->pgprot, __pgd_pgtable_alloc,
>> + flags);
>>
>> memblock_clear_nomap(start, size);
>>
>> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
>> index 97bbc23ea1e3..d637b4ea3147 100644
>> --- a/arch/ia64/mm/init.c
>> +++ b/arch/ia64/mm/init.c
>> @@ -676,6 +676,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
>> unsigned long nr_pages = size >> PAGE_SHIFT;
>> int ret;
>>
>> + if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
>> + return -EINVAL;
>> +
>> ret = __add_pages(nid, start_pfn, nr_pages, params);
>> if (ret)
>> printk("%s: Problem encountered in __add_pages() as ret=%d\n",
>> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
>> index 19b1da5d7eca..832412bc7fad 100644
>> --- a/arch/powerpc/mm/mem.c
>> +++ b/arch/powerpc/mm/mem.c
>> @@ -138,7 +138,8 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
>> resize_hpt_for_hotplug(memblock_phys_mem_size());
>>
>> start = (unsigned long)__va(start);
>> - rc = create_section_mapping(start, start + size, nid, PAGE_KERNEL);
>> + rc = create_section_mapping(start, start + size, nid,
>> + params->pgprot);
>> if (rc) {
>> pr_warn("Unable to create mapping for hot added memory 0x%llx..0x%llx: %d\n",
>> start, start + size, rc);
>> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
>> index e9e4a7abd0cc..87b2d024e75a 100644
>> --- a/arch/s390/mm/init.c
>> +++ b/arch/s390/mm/init.c
>> @@ -277,6 +277,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
>> if (WARN_ON_ONCE(params->altmap))
>> return -EINVAL;
>>
>> + if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
>> + return -EINVAL;
>> +
>> rc = vmem_add_mapping(start, size);
>> if (rc)
>> return rc;
>> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
>> index e5114c053364..b9de2d4fa57e 100644
>> --- a/arch/sh/mm/init.c
>> +++ b/arch/sh/mm/init.c
>> @@ -412,6 +412,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
>> unsigned long nr_pages = size >> PAGE_SHIFT;
>> int ret;
>>
>> + if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot)
>> + return -EINVAL;
>> +
>> /* We only have ZONE_NORMAL, so this is easy.. */
>> ret = __add_pages(nid, start_pfn, nr_pages, params);
>> if (unlikely(ret))
>> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
>> index e25a4218e6ff..96d8e4fb1cc8 100644
>> --- a/arch/x86/mm/init_32.c
>> +++ b/arch/x86/mm/init_32.c
>> @@ -858,6 +858,11 @@ int arch_add_memory(int nid, u64 start, u64 size,
>> {
>> unsigned long start_pfn = start >> PAGE_SHIFT;
>> unsigned long nr_pages = size >> PAGE_SHIFT;
>> + int ret;
>> +
>> + ret = _set_memory_prot(start, nr_pages, params->pgprot);
>
> Perhaps a comment since it's not immediately obvious where the
> PAGE_KERNEL prot was established, and perhaps add a conditional to
> skip this call in the param->pgprot == PAGE_KERNEL case?
Yes I can add the skip in the PAGE_KERNEL case. Though I'm not sure what
you are asking for with regards to the comment. Just that pgprot is set
by the caller usually to PAGE_KERNEL?
> Other than that looks good to me, but only an ack since I'm only
> testing the x86 changes.
>
> Acked-by: Dan Williams <[email protected]>
Thanks,
Logan
On Mon, Mar 2, 2020 at 10:55 AM Logan Gunthorpe <[email protected]> wrote:
>
>
>
> On 2020-02-29 3:44 p.m., Dan Williams wrote:
> > On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe <[email protected]> wrote:
> >>
> >> devm_memremap_pages() is currently used by the PCI P2PDMA code to create
> >> struct page mappings for IO memory. At present, these mappings are created
> >> with PAGE_KERNEL which implies setting the PAT bits to be WB. However, on
> >> x86, an mtrr register will typically override this and force the cache
> >> type to be UC-. In the case firmware doesn't set this register it is
> >> effectively WB and will typically result in a machine check exception
> >> when it's accessed.
> >>
> >> Other arches are not currently likely to function correctly seeing they
> >> don't have any MTRR registers to fall back on.
> >>
> >> To solve this, provide a way to specify the pgprot value explicitly to
> >> arch_add_memory().
> >>
> >> Of the arches that support MEMORY_HOTPLUG: x86_64, and arm64 need a simple
> >> change to pass the pgprot_t down to their respective functions which set
> >> up the page tables. For x86_32, set the page tables explicitly using
> >> _set_memory_prot() (seeing they are already mapped). For ia64, s390 and
> >> sh, reject anything but PAGE_KERNEL settings -- this should be fine,
> >> for now, seeing these architectures don't support ZONE_DEVICE.
> >>
> >> A check in __add_pages() is also added to ensure the pgprot parameter was
> >> set for all arches.
> >>
> >> Cc: Dan Williams <[email protected]>
> >> Signed-off-by: Logan Gunthorpe <[email protected]>
> >> Acked-by: David Hildenbrand <[email protected]>
> >> Acked-by: Michal Hocko <[email protected]>
> >> ---
> >> arch/arm64/mm/mmu.c | 3 ++-
> >> arch/ia64/mm/init.c | 3 +++
> >> arch/powerpc/mm/mem.c | 3 ++-
> >> arch/s390/mm/init.c | 3 +++
> >> arch/sh/mm/init.c | 3 +++
> >> arch/x86/mm/init_32.c | 5 +++++
> >> arch/x86/mm/init_64.c | 2 +-
> >> include/linux/memory_hotplug.h | 2 ++
> >> mm/memory_hotplug.c | 5 ++++-
> >> mm/memremap.c | 6 +++---
> >> 10 files changed, 28 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> >> index ee37bca8aba8..ea3fa844a8a2 100644
> >> --- a/arch/arm64/mm/mmu.c
> >> +++ b/arch/arm64/mm/mmu.c
> >> @@ -1058,7 +1058,8 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >> flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
> >>
> >> __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
> >> - size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
> >> + size, params->pgprot, __pgd_pgtable_alloc,
> >> + flags);
> >>
> >> memblock_clear_nomap(start, size);
> >>
> >> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> >> index 97bbc23ea1e3..d637b4ea3147 100644
> >> --- a/arch/ia64/mm/init.c
> >> +++ b/arch/ia64/mm/init.c
> >> @@ -676,6 +676,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >> unsigned long nr_pages = size >> PAGE_SHIFT;
> >> int ret;
> >>
> >> + if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
> >> + return -EINVAL;
> >> +
> >> ret = __add_pages(nid, start_pfn, nr_pages, params);
> >> if (ret)
> >> printk("%s: Problem encountered in __add_pages() as ret=%d\n",
> >> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> >> index 19b1da5d7eca..832412bc7fad 100644
> >> --- a/arch/powerpc/mm/mem.c
> >> +++ b/arch/powerpc/mm/mem.c
> >> @@ -138,7 +138,8 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
> >> resize_hpt_for_hotplug(memblock_phys_mem_size());
> >>
> >> start = (unsigned long)__va(start);
> >> - rc = create_section_mapping(start, start + size, nid, PAGE_KERNEL);
> >> + rc = create_section_mapping(start, start + size, nid,
> >> + params->pgprot);
> >> if (rc) {
> >> pr_warn("Unable to create mapping for hot added memory 0x%llx..0x%llx: %d\n",
> >> start, start + size, rc);
> >> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> >> index e9e4a7abd0cc..87b2d024e75a 100644
> >> --- a/arch/s390/mm/init.c
> >> +++ b/arch/s390/mm/init.c
> >> @@ -277,6 +277,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >> if (WARN_ON_ONCE(params->altmap))
> >> return -EINVAL;
> >>
> >> + if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot))
> >> + return -EINVAL;
> >> +
> >> rc = vmem_add_mapping(start, size);
> >> if (rc)
> >> return rc;
> >> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> >> index e5114c053364..b9de2d4fa57e 100644
> >> --- a/arch/sh/mm/init.c
> >> +++ b/arch/sh/mm/init.c
> >> @@ -412,6 +412,9 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >> unsigned long nr_pages = size >> PAGE_SHIFT;
> >> int ret;
> >>
> >> + if (WARN_ON_ONCE(params->pgprot.pgprot != PAGE_KERNEL.pgprot)
> >> + return -EINVAL;
> >> +
> >> /* We only have ZONE_NORMAL, so this is easy.. */
> >> ret = __add_pages(nid, start_pfn, nr_pages, params);
> >> if (unlikely(ret))
> >> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> >> index e25a4218e6ff..96d8e4fb1cc8 100644
> >> --- a/arch/x86/mm/init_32.c
> >> +++ b/arch/x86/mm/init_32.c
> >> @@ -858,6 +858,11 @@ int arch_add_memory(int nid, u64 start, u64 size,
> >> {
> >> unsigned long start_pfn = start >> PAGE_SHIFT;
> >> unsigned long nr_pages = size >> PAGE_SHIFT;
> >> + int ret;
> >> +
> >> + ret = _set_memory_prot(start, nr_pages, params->pgprot);
> >
> > Perhaps a comment since it's not immediately obvious where the
> > PAGE_KERNEL prot was established, and perhaps add a conditional to
> > skip this call in the param->pgprot == PAGE_KERNEL case?
>
> Yes I can add the skip in the PAGE_KERNEL case. Though I'm not sure what
> you are asking for with regards to the comment. Just that pgprot is set
> by the caller usually to PAGE_KERNEL?
No, I'm reacting to this comment in the changelog "For x86_32, set the
page tables explicitly using _set_memory_prot() (seeing they are
already mapped)". You've done some investigation that
x86_32::arch_add_memory() expects the page tables to be already
established. I think that's worth capturing inline in the code for
other people doing cross-arch arch_add_memory() changes.
On 2020-02-29 3:47 p.m., Dan Williams wrote:
> On Fri, Feb 21, 2020 at 10:25 AM Logan Gunthorpe <[email protected]> wrote:
>>
>> PCI BAR IO memory should never be mapped as WB, however prior to this
>> the PAT bits were set WB and it was typically overridden by MTRR
>> registers set by the firmware.
>>
>> Set PCI P2PDMA memory to be WC (writecombining) as the only current
>> user (the NVMe CMB) was originally mapped WC before the P2PDMA code
>> replaced the mapping with devm_memremap_pages().
>
> Will the change to UC regress this existing use case?
I don't think so. They've been essentially mapped UC for a long time now
(since the P2PDMA patch set was merged) and nobody has complained.
On Fri 21-02-20 11:24:58, Logan Gunthorpe wrote:
> The mhp_restrictions struct really doesn't specify anything resembling
> a restriction anymore so rename it to be mhp_params as it is a list
> of extended parameters.
>
> Signed-off-by: Logan Gunthorpe <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> arch/arm64/mm/mmu.c | 4 ++--
> arch/ia64/mm/init.c | 4 ++--
> arch/powerpc/mm/mem.c | 4 ++--
> arch/s390/mm/init.c | 6 +++---
> arch/sh/mm/init.c | 4 ++--
> arch/x86/mm/init_32.c | 4 ++--
> arch/x86/mm/init_64.c | 8 ++++----
> include/linux/memory_hotplug.h | 16 ++++++++--------
> mm/memory_hotplug.c | 8 ++++----
> mm/memremap.c | 8 ++++----
> 10 files changed, 33 insertions(+), 33 deletions(-)
>
> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
> index 128f70852bf3..ee37bca8aba8 100644
> --- a/arch/arm64/mm/mmu.c
> +++ b/arch/arm64/mm/mmu.c
> @@ -1050,7 +1050,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> int arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
> {
> int flags = 0;
>
> @@ -1063,7 +1063,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
> memblock_clear_nomap(start, size);
>
> return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
> - restrictions);
> + params);
> }
> void arch_remove_memory(int nid, u64 start, u64 size,
> struct vmem_altmap *altmap)
> diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
> index b01d68a2d5d9..97bbc23ea1e3 100644
> --- a/arch/ia64/mm/init.c
> +++ b/arch/ia64/mm/init.c
> @@ -670,13 +670,13 @@ mem_init (void)
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> int arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
> {
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
> int ret;
>
> - ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
> + ret = __add_pages(nid, start_pfn, nr_pages, params);
> if (ret)
> printk("%s: Problem encountered in __add_pages() as ret=%d\n",
> __func__, ret);
> diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> index ef7b1119b2e2..b4bece53bec0 100644
> --- a/arch/powerpc/mm/mem.c
> +++ b/arch/powerpc/mm/mem.c
> @@ -128,7 +128,7 @@ static void flush_dcache_range_chunked(unsigned long start, unsigned long stop,
> }
>
> int __ref arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
> {
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
> @@ -144,7 +144,7 @@ int __ref arch_add_memory(int nid, u64 start, u64 size,
> return -EFAULT;
> }
>
> - return __add_pages(nid, start_pfn, nr_pages, restrictions);
> + return __add_pages(nid, start_pfn, nr_pages, params);
> }
>
> void __ref arch_remove_memory(int nid, u64 start, u64 size,
> diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
> index ac44bd76db4b..e9e4a7abd0cc 100644
> --- a/arch/s390/mm/init.c
> +++ b/arch/s390/mm/init.c
> @@ -268,20 +268,20 @@ device_initcall(s390_cma_mem_init);
> #endif /* CONFIG_CMA */
>
> int arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
> {
> unsigned long start_pfn = PFN_DOWN(start);
> unsigned long size_pages = PFN_DOWN(size);
> int rc;
>
> - if (WARN_ON_ONCE(restrictions->altmap))
> + if (WARN_ON_ONCE(params->altmap))
> return -EINVAL;
>
> rc = vmem_add_mapping(start, size);
> if (rc)
> return rc;
>
> - rc = __add_pages(nid, start_pfn, size_pages, restrictions);
> + rc = __add_pages(nid, start_pfn, size_pages, params);
> if (rc)
> vmem_remove_mapping(start, size);
> return rc;
> diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
> index d1b1ff2be17a..e5114c053364 100644
> --- a/arch/sh/mm/init.c
> +++ b/arch/sh/mm/init.c
> @@ -406,14 +406,14 @@ void __init mem_init(void)
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> int arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
> {
> unsigned long start_pfn = PFN_DOWN(start);
> unsigned long nr_pages = size >> PAGE_SHIFT;
> int ret;
>
> /* We only have ZONE_NORMAL, so this is easy.. */
> - ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
> + ret = __add_pages(nid, start_pfn, nr_pages, params);
> if (unlikely(ret))
> printk("%s: Failed, __add_pages() == %d\n", __func__, ret);
>
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 23df4885bbed..3ec3dac7c268 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -853,12 +853,12 @@ void __init mem_init(void)
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> int arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
> {
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
>
> - return __add_pages(nid, start_pfn, nr_pages, restrictions);
> + return __add_pages(nid, start_pfn, nr_pages, params);
> }
>
> void arch_remove_memory(int nid, u64 start, u64 size,
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index abbdecb75fad..87977a8bfbbf 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -844,11 +844,11 @@ static void update_end_of_memory_vars(u64 start, u64 size)
> }
>
> int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
> {
> int ret;
>
> - ret = __add_pages(nid, start_pfn, nr_pages, restrictions);
> + ret = __add_pages(nid, start_pfn, nr_pages, params);
> WARN_ON_ONCE(ret);
>
> /* update max_pfn, max_low_pfn and high_memory */
> @@ -859,14 +859,14 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> }
>
> int arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
> {
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
>
> init_memory_mapping(start, start + size);
>
> - return add_pages(nid, start_pfn, nr_pages, restrictions);
> + return add_pages(nid, start_pfn, nr_pages, params);
> }
>
> #define PAGE_INUSE 0xFD
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 69ff3037528d..c5df1b3dada0 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -54,10 +54,10 @@ enum {
> };
>
> /*
> - * Restrictions for the memory hotplug:
> - * altmap: alternative allocator for memmap array
> + * Extended parameters for memory hotplug:
> + * altmap: alternative allocator for memmap array (optional)
> */
> -struct mhp_restrictions {
> +struct mhp_params {
> struct vmem_altmap *altmap;
> };
>
> @@ -108,7 +108,7 @@ extern int restore_online_page_callback(online_page_callback_t callback);
> extern int try_online_node(int nid);
>
> extern int arch_add_memory(int nid, u64 start, u64 size,
> - struct mhp_restrictions *restrictions);
> + struct mhp_params *params);
> extern u64 max_mem_size;
>
> extern bool memhp_auto_online;
> @@ -126,17 +126,17 @@ extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages,
>
> /* reasonably generic interface to expand the physical pages */
> extern int __add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> - struct mhp_restrictions *restrictions);
> + struct mhp_params *params);
>
> #ifndef CONFIG_ARCH_HAS_ADD_PAGES
> static inline int add_pages(int nid, unsigned long start_pfn,
> - unsigned long nr_pages, struct mhp_restrictions *restrictions)
> + unsigned long nr_pages, struct mhp_params *params)
> {
> - return __add_pages(nid, start_pfn, nr_pages, restrictions);
> + return __add_pages(nid, start_pfn, nr_pages, params);
> }
> #else /* ARCH_HAS_ADD_PAGES */
> int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> - struct mhp_restrictions *restrictions);
> + struct mhp_params *params);
> #endif /* ARCH_HAS_ADD_PAGES */
>
> #ifdef CONFIG_NUMA
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 0a54ffac8c68..c69469e1b40e 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -299,11 +299,11 @@ static int check_hotplug_memory_addressable(unsigned long pfn,
> * add the new pages.
> */
> int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages,
> - struct mhp_restrictions *restrictions)
> + struct mhp_params *params)
> {
> int err;
> unsigned long nr, start_sec, end_sec;
> - struct vmem_altmap *altmap = restrictions->altmap;
> + struct vmem_altmap *altmap = params->altmap;
>
> err = check_hotplug_memory_addressable(pfn, nr_pages);
> if (err)
> @@ -993,7 +993,7 @@ static int online_memory_block(struct memory_block *mem, void *arg)
> */
> int __ref add_memory_resource(int nid, struct resource *res)
> {
> - struct mhp_restrictions restrictions = {};
> + struct mhp_params params = {};
> u64 start, size;
> bool new_node = false;
> int ret;
> @@ -1021,7 +1021,7 @@ int __ref add_memory_resource(int nid, struct resource *res)
> new_node = ret;
>
> /* call arch's memory hotadd */
> - ret = arch_add_memory(nid, start, size, &restrictions);
> + ret = arch_add_memory(nid, start, size, ¶ms);
> if (ret < 0)
> goto error;
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 09b5b7adc773..6891a503a078 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -161,7 +161,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
> {
> struct resource *res = &pgmap->res;
> struct dev_pagemap *conflict_pgmap;
> - struct mhp_restrictions restrictions = {
> + struct mhp_params params = {
> /*
> * We do not want any optional features only our own memmap
> */
> @@ -275,7 +275,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
> */
> if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
> error = add_pages(nid, PHYS_PFN(res->start),
> - PHYS_PFN(resource_size(res)), &restrictions);
> + PHYS_PFN(resource_size(res)), ¶ms);
> } else {
> error = kasan_add_zero_shadow(__va(res->start), resource_size(res));
> if (error) {
> @@ -284,7 +284,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
> }
>
> error = arch_add_memory(nid, res->start, resource_size(res),
> - &restrictions);
> + ¶ms);
> }
>
> if (!error) {
> @@ -292,7 +292,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>
> zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
> move_pfn_range_to_zone(zone, PHYS_PFN(res->start),
> - PHYS_PFN(resource_size(res)), restrictions.altmap);
> + PHYS_PFN(resource_size(res)), params.altmap);
> }
>
> mem_hotplug_done();
> --
> 2.20.1
--
Michal Hocko
SUSE Labs
On Fri 21-02-20 11:24:59, Logan Gunthorpe wrote:
> In prepartion to support a pgprot_t argument for arch_add_memory().
>
> It's required to move the prototype of init_memory_mapping() seeing
> the original location came before the definition of pgprot_t.
>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: [email protected]
> Cc: Dave Hansen <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Logan Gunthorpe <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> arch/x86/include/asm/page_types.h | 3 ---
> arch/x86/include/asm/pgtable.h | 3 +++
> arch/x86/kernel/amd_gart_64.c | 3 ++-
> arch/x86/mm/init.c | 9 +++++----
> arch/x86/mm/init_32.c | 3 ++-
> arch/x86/mm/init_64.c | 32 +++++++++++++++++--------------
> arch/x86/mm/mm_internal.h | 3 ++-
> arch/x86/platform/uv/bios_uv.c | 3 ++-
> 8 files changed, 34 insertions(+), 25 deletions(-)
>
> diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
> index c85e15010f48..bf7aa2e290ef 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -73,9 +73,6 @@ static inline phys_addr_t get_max_mapped(void)
>
> bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
>
> -extern unsigned long init_memory_mapping(unsigned long start,
> - unsigned long end);
> -
> extern void initmem_init(void);
>
> #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 7e118660bbd9..48d6a5960f28 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1046,6 +1046,9 @@ static inline void __meminit init_trampoline_default(void)
>
> void __init poking_init(void);
>
> +unsigned long init_memory_mapping(unsigned long start,
> + unsigned long end, pgprot_t prot);
> +
> # ifdef CONFIG_RANDOMIZE_MEMORY
> void __meminit init_trampoline(void);
> # else
> diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
> index 4e5f50236048..16133819415c 100644
> --- a/arch/x86/kernel/amd_gart_64.c
> +++ b/arch/x86/kernel/amd_gart_64.c
> @@ -744,7 +744,8 @@ int __init gart_iommu_init(void)
>
> start_pfn = PFN_DOWN(aper_base);
> if (!pfn_range_is_mapped(start_pfn, end_pfn))
> - init_memory_mapping(start_pfn<<PAGE_SHIFT, end_pfn<<PAGE_SHIFT);
> + init_memory_mapping(start_pfn<<PAGE_SHIFT, end_pfn<<PAGE_SHIFT,
> + PAGE_KERNEL);
>
> pr_info("PCI-DMA: using GART IOMMU.\n");
> iommu_size = check_iommu_size(info.aper_base, aper_size);
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index e7bb483557c9..1bba16c5742b 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -467,7 +467,7 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
> * the physical memory. To access them they are temporarily mapped.
> */
> unsigned long __ref init_memory_mapping(unsigned long start,
> - unsigned long end)
> + unsigned long end, pgprot_t prot)
> {
> struct map_range mr[NR_RANGE_MR];
> unsigned long ret = 0;
> @@ -481,7 +481,8 @@ unsigned long __ref init_memory_mapping(unsigned long start,
>
> for (i = 0; i < nr_range; i++)
> ret = kernel_physical_mapping_init(mr[i].start, mr[i].end,
> - mr[i].page_size_mask);
> + mr[i].page_size_mask,
> + prot);
>
> add_pfn_range_mapped(start >> PAGE_SHIFT, ret >> PAGE_SHIFT);
>
> @@ -521,7 +522,7 @@ static unsigned long __init init_range_memory_mapping(
> */
> can_use_brk_pgt = max(start, (u64)pgt_buf_end<<PAGE_SHIFT) >=
> min(end, (u64)pgt_buf_top<<PAGE_SHIFT);
> - init_memory_mapping(start, end);
> + init_memory_mapping(start, end, PAGE_KERNEL);
> mapped_ram_size += end - start;
> can_use_brk_pgt = true;
> }
> @@ -661,7 +662,7 @@ void __init init_mem_mapping(void)
> #endif
>
> /* the ISA range is always mapped regardless of memory holes */
> - init_memory_mapping(0, ISA_END_ADDRESS);
> + init_memory_mapping(0, ISA_END_ADDRESS, PAGE_KERNEL);
>
> /* Init the trampoline, possibly with KASLR memory offset */
> init_trampoline();
> diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
> index 3ec3dac7c268..e25a4218e6ff 100644
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -253,7 +253,8 @@ static inline int is_kernel_text(unsigned long addr)
> unsigned long __init
> kernel_physical_mapping_init(unsigned long start,
> unsigned long end,
> - unsigned long page_size_mask)
> + unsigned long page_size_mask,
> + pgprot_t prot)
> {
> int use_pse = page_size_mask == (1<<PG_LEVEL_2M);
> unsigned long last_map_addr = end;
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 87977a8bfbbf..9e7692080dda 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -585,7 +585,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned long paddr, unsigned long paddr_end,
> */
> static unsigned long __meminit
> phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
> - unsigned long page_size_mask, bool init)
> + unsigned long page_size_mask, pgprot_t _prot, bool init)
> {
> unsigned long pages = 0, paddr_next;
> unsigned long paddr_last = paddr_end;
> @@ -595,7 +595,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
> for (; i < PTRS_PER_PUD; i++, paddr = paddr_next) {
> pud_t *pud;
> pmd_t *pmd;
> - pgprot_t prot = PAGE_KERNEL;
> + pgprot_t prot = _prot;
>
> vaddr = (unsigned long)__va(paddr);
> pud = pud_page + pud_index(vaddr);
> @@ -644,9 +644,12 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
> if (page_size_mask & (1<<PG_LEVEL_1G)) {
> pages++;
> spin_lock(&init_mm.page_table_lock);
> +
> + prot = __pgprot(pgprot_val(prot) | __PAGE_KERNEL_LARGE);
> +
> set_pte_init((pte_t *)pud,
> pfn_pte((paddr & PUD_MASK) >> PAGE_SHIFT,
> - PAGE_KERNEL_LARGE),
> + prot),
> init);
> spin_unlock(&init_mm.page_table_lock);
> paddr_last = paddr_next;
> @@ -669,7 +672,7 @@ phys_pud_init(pud_t *pud_page, unsigned long paddr, unsigned long paddr_end,
>
> static unsigned long __meminit
> phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
> - unsigned long page_size_mask, bool init)
> + unsigned long page_size_mask, pgprot_t prot, bool init)
> {
> unsigned long vaddr, vaddr_end, vaddr_next, paddr_next, paddr_last;
>
> @@ -679,7 +682,7 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
>
> if (!pgtable_l5_enabled())
> return phys_pud_init((pud_t *) p4d_page, paddr, paddr_end,
> - page_size_mask, init);
> + page_size_mask, prot, init);
>
> for (; vaddr < vaddr_end; vaddr = vaddr_next) {
> p4d_t *p4d = p4d_page + p4d_index(vaddr);
> @@ -702,13 +705,13 @@ phys_p4d_init(p4d_t *p4d_page, unsigned long paddr, unsigned long paddr_end,
> if (!p4d_none(*p4d)) {
> pud = pud_offset(p4d, 0);
> paddr_last = phys_pud_init(pud, paddr, __pa(vaddr_end),
> - page_size_mask, init);
> + page_size_mask, prot, init);
> continue;
> }
>
> pud = alloc_low_page();
> paddr_last = phys_pud_init(pud, paddr, __pa(vaddr_end),
> - page_size_mask, init);
> + page_size_mask, prot, init);
>
> spin_lock(&init_mm.page_table_lock);
> p4d_populate_init(&init_mm, p4d, pud, init);
> @@ -722,7 +725,7 @@ static unsigned long __meminit
> __kernel_physical_mapping_init(unsigned long paddr_start,
> unsigned long paddr_end,
> unsigned long page_size_mask,
> - bool init)
> + pgprot_t prot, bool init)
> {
> bool pgd_changed = false;
> unsigned long vaddr, vaddr_start, vaddr_end, vaddr_next, paddr_last;
> @@ -743,13 +746,13 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
> paddr_last = phys_p4d_init(p4d, __pa(vaddr),
> __pa(vaddr_end),
> page_size_mask,
> - init);
> + prot, init);
> continue;
> }
>
> p4d = alloc_low_page();
> paddr_last = phys_p4d_init(p4d, __pa(vaddr), __pa(vaddr_end),
> - page_size_mask, init);
> + page_size_mask, prot, init);
>
> spin_lock(&init_mm.page_table_lock);
> if (pgtable_l5_enabled())
> @@ -778,10 +781,10 @@ __kernel_physical_mapping_init(unsigned long paddr_start,
> unsigned long __meminit
> kernel_physical_mapping_init(unsigned long paddr_start,
> unsigned long paddr_end,
> - unsigned long page_size_mask)
> + unsigned long page_size_mask, pgprot_t prot)
> {
> return __kernel_physical_mapping_init(paddr_start, paddr_end,
> - page_size_mask, true);
> + page_size_mask, prot, true);
> }
>
> /*
> @@ -796,7 +799,8 @@ kernel_physical_mapping_change(unsigned long paddr_start,
> unsigned long page_size_mask)
> {
> return __kernel_physical_mapping_init(paddr_start, paddr_end,
> - page_size_mask, false);
> + page_size_mask, PAGE_KERNEL,
> + false);
> }
>
> #ifndef CONFIG_NUMA
> @@ -864,7 +868,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
>
> - init_memory_mapping(start, start + size);
> + init_memory_mapping(start, start + size, PAGE_KERNEL);
>
> return add_pages(nid, start_pfn, nr_pages, params);
> }
> diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h
> index eeae142062ed..3f37b5c80bb3 100644
> --- a/arch/x86/mm/mm_internal.h
> +++ b/arch/x86/mm/mm_internal.h
> @@ -12,7 +12,8 @@ void early_ioremap_page_table_range_init(void);
>
> unsigned long kernel_physical_mapping_init(unsigned long start,
> unsigned long end,
> - unsigned long page_size_mask);
> + unsigned long page_size_mask,
> + pgprot_t prot);
> unsigned long kernel_physical_mapping_change(unsigned long start,
> unsigned long end,
> unsigned long page_size_mask);
> diff --git a/arch/x86/platform/uv/bios_uv.c b/arch/x86/platform/uv/bios_uv.c
> index 607f58147311..c60255da5a6c 100644
> --- a/arch/x86/platform/uv/bios_uv.c
> +++ b/arch/x86/platform/uv/bios_uv.c
> @@ -352,7 +352,8 @@ void __iomem *__init efi_ioremap(unsigned long phys_addr, unsigned long size,
> if (type == EFI_MEMORY_MAPPED_IO)
> return ioremap(phys_addr, size);
>
> - last_map_pfn = init_memory_mapping(phys_addr, phys_addr + size);
> + last_map_pfn = init_memory_mapping(phys_addr, phys_addr + size,
> + PAGE_KERNEL);
> if ((last_map_pfn << PAGE_SHIFT) < phys_addr + size) {
> unsigned long top = last_map_pfn << PAGE_SHIFT;
> efi_ioremap(top, size - (top - phys_addr), type, attribute);
> --
> 2.20.1
>
--
Michal Hocko
SUSE Labs