LinuxLists.cc - [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible

2022-08-08 15:01:46

Subject: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible

This is an early RFC. While all reviews are welcome, reviewing this code
now will be a waste of time for the x86 subsystem maintainers. I would,
however, appreciate a preliminary review from the folks on the to and cc
list. I'm posting it to the list in case anyone else is interested in
seeing this early version.

Dave Hansen: I need your ack before this goes to the maintainers.

Here it goes:

On x86_64, Linux has direct mapping of almost all physical memory. For
performance reasons, this mapping is usually set as large page like 2M
or 1G per hardware's capability with read, write and non-execute
protection.

There are cases where some pages have to change their protection to RO
and eXecutable, like pages that host module code or bpf prog. When these
pages' protection are changed, the corresponding large mapping that
cover these pages will have to be splitted into 4K first and then
individual 4k page's protection changed accordingly, i.e. unaffected
pages keep their original protection as RW and NX while affected pages'
protection changed to RO and X.

There is a problem due to this split: the large mapping will remain
splitted even after the affected pages' protection are changed back to
RW and NX, like when the module is unloaded or bpf progs are freed.
After system runs a long time, there can be more and more large mapping
being splitted, causing more and more dTLB misses and overall system
performance getting hurt[1].

For this reason, people tried some techniques to reduce the harm of
large mapping beling splitted, like bpf_prog_pack[2] which packs
multiple bpf progs into a single page instead of allocating and changing
one page's protection for each bpf prog. This approach made large
mapping split happen much fewer.

This patchset addresses this problem in another way: it merges
splitted mappings back to a large mapping when protections of all entries
of the splitted small mapping page table become same again, e.g. when the
page whose protection was changed to RO+X now has its protection changed
back to RW+NX due to reasons like module unload, bpf prog free, etc. and
all other entries' protection are also RW+NX.

One final note is, with features like bpf_prog_pack etc., there can be
much fewer large mapping split IIUC; also, this patchset can not help
when the page which has its protection changed keeps in use. So my take
on this large mapping split problem is: to get the most value of keeping
large mapping intact, features like bpf_prog_pack is important. This
patchset can help to further reduce large mapping split when in use page
that has special protection set finally gets released.

[1]: http://lkml.kernel.org/r/CAPhsuW4eAm9QrAxhZMJu-bmvHnjWjuw86gFZzTHRaMEaeFhAxw@mail.gmail.com
[2]: https://lore.kernel.org/lkml/[email protected]/

Aaron Lu (4):
x86/mm/cpa: restore global bit when page is present
x86/mm/cpa: merge splitted direct mapping when possible
x86/mm/cpa: add merge event counter
x86/mm/cpa: add a test interface to split direct map

arch/x86/mm/pat/set_memory.c | 411 +++++++++++++++++++++++++++++++++-
include/linux/mm_types.h | 6 +
include/linux/page-flags.h | 6 +
include/linux/vm_event_item.h | 2 +
mm/vmstat.c | 2 +
5 files changed, 420 insertions(+), 7 deletions(-)

--
2.37.1

2022-08-08 15:02:07

by Aaron Lu

[permalink] [raw]

Subject: [RFC PATCH 3/4] x86/mm/cpa: add merge event counter

Like split event counter, this patch add counter for merge event.

Signed-off-by: Aaron Lu <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 19 +++++++++++++++++++
include/linux/vm_event_item.h | 2 ++
mm/vmstat.c | 2 ++
3 files changed, 23 insertions(+)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index fea2c70ff37f..1be9aab42c79 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -105,6 +105,23 @@ static void split_page_count(int level)
direct_pages_count[level - 1] += PTRS_PER_PTE;
}

+static void merge_page_count(int level)
+{
+ if (direct_pages_count[level] < PTRS_PER_PTE) {
+ WARN_ON_ONCE(1);
+ return;
+ }
+
+ direct_pages_count[level] -= PTRS_PER_PTE;
+ if (system_state == SYSTEM_RUNNING) {
+ if (level == PG_LEVEL_4K)
+ count_vm_event(DIRECT_MAP_LEVEL1_MERGE);
+ else if (level == PG_LEVEL_2M)
+ count_vm_event(DIRECT_MAP_LEVEL2_MERGE);
+ }
+ direct_pages_count[level + 1]++;
+}
+
void arch_report_meminfo(struct seq_file *m)
{
seq_printf(m, "DirectMap4k: %8lu kB\n",
@@ -875,6 +892,8 @@ static void merge_splitted_mapping(struct page *pgt, int level)

__ClearPageSplitpgt(pgt);
__free_page(pgt);
+
+ merge_page_count(level);
}

static pgprot_t pgprot_clear_protnone_bits(pgprot_t prot)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 404024486fa5..00a9a435af49 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -143,6 +143,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_X86
DIRECT_MAP_LEVEL2_SPLIT,
DIRECT_MAP_LEVEL3_SPLIT,
+ DIRECT_MAP_LEVEL1_MERGE,
+ DIRECT_MAP_LEVEL2_MERGE,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 373d2730fcf2..1a4287a4d614 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1403,6 +1403,8 @@ const char * const vmstat_text[] = {
#ifdef CONFIG_X86
"direct_map_level2_splits",
"direct_map_level3_splits",
+ "direct_map_level1_merges",
+ "direct_map_level2_merges",
#endif
#endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
};
--
2.37.1

2022-08-08 15:02:07

by Aaron Lu

[permalink] [raw]

Subject: [RFC PATCH 2/4] x86/mm/cpa: merge splitted direct mapping when possible

On x86_64, Linux has direct mapping of almost all physical memory. For
performance reasons, this mapping is usually set as large page like 2M
or 1G per hardware's capability with read, write and non-execute
protection.

There are cases where some pages have to change their protection to RO
and eXecutable, like pages that host module code or bpf prog. When these
pages' protection are changed, the corresponding large mapping that
cover these pages will have to be splitted into 4K first and then individual
4k page's protection changed accordingly, i.e. unaffected pages keep their
original protection as RW and NX while affected pages' protection changed
to RO and X.

There is a problem due to this split: the large mapping will remain
splitted even after the affected pages' protection are changed back to
RW and NX, like when the module is unloaded or bpf progs are freed.
After system runs a long time, there can be more and more large mapping
being splitted, causing more and more dTLB misses and overall system
performance getting hurt.

This patch tries to restore splitted large mapping by tracking how
many entries of the splitted small mapping page table have the same
protection bits and once that number becomes PTRS_PER_PTE, this small
mapping page table can be released with its upper level page table
entry pointing directly to a large page.

Testing: see patch4 for detailed testing.

Signed-off-by: Aaron Lu <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 184 +++++++++++++++++++++++++++++++++--
include/linux/mm_types.h | 6 ++
include/linux/page-flags.h | 6 ++
3 files changed, 189 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 33657a54670a..fea2c70ff37f 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -718,13 +718,89 @@ phys_addr_t slow_virt_to_phys(void *__virt_addr)
}
EXPORT_SYMBOL_GPL(slow_virt_to_phys);

+static void merge_splitted_mapping(struct page *pgt, int level);
+static void set_pte_adjust_nr_same_prot(pte_t *kpte, int level, pte_t pte)
+{
+ struct page *pgt = virt_to_page(kpte);
+ pgprot_t old_prot, new_prot;
+ int i;
+
+ /* The purpose of tracking entries with same_prot is to hopefully
+ * mege splitted small mappings to large ones. Since only 2M and
+ * 1G mapping are supported, there is no need tracking for page
+ * tables of level > 2M.
+ */
+ if (!PageSplitpgt(pgt) || level > PG_LEVEL_2M) {
+ set_pte(kpte, pte);
+ return;
+ }
+
+ /* get old protection before kpte is updated */
+ if (level == PG_LEVEL_4K) {
+ old_prot = pte_pgprot(*kpte);
+ new_prot = pte_pgprot(pte);
+ } else {
+ old_prot = pmd_pgprot(*(pmd_t *)kpte);
+ new_prot = pmd_pgprot(*(pmd_t *)&pte);
+ }
+
+ set_pte(kpte, pte);
+
+ if (pgprot_val(pgt->same_prot) != pgprot_val(old_prot) &&
+ pgprot_val(pgt->same_prot) == pgprot_val(new_prot))
+ pgt->nr_same_prot++;
+
+ if (pgprot_val(pgt->same_prot) == pgprot_val(old_prot) &&
+ pgprot_val(pgt->same_prot) != pgprot_val(new_prot))
+ pgt->nr_same_prot--;
+
+ if (unlikely(pgt->nr_same_prot == 0)) {
+ pte_t *entry = page_address(pgt);
+
+ /*
+ * Now all entries' prot have changed, check again
+ * to see if all entries have the same new prot.
+ * Use the 1st entry's prot as the new pgt->same_prot.
+ */
+ if (level == PG_LEVEL_4K)
+ pgt->same_prot = pte_pgprot(*entry);
+ else
+ pgt->same_prot = pmd_pgprot(*(pmd_t *)entry);
+
+ for (i = 0; i < PTRS_PER_PTE; i++, entry++) {
+ pgprot_t prot;
+
+ if (level == PG_LEVEL_4K)
+ prot = pte_pgprot(*entry);
+ else
+ prot = pmd_pgprot(*(pmd_t *)entry);
+
+ if (pgprot_val(prot) == pgprot_val(pgt->same_prot))
+ pgt->nr_same_prot++;
+ }
+ }
+
+ /*
+ * If this splitted page table's entries all have the same
+ * protection now, try merge it. Note that for a PMD level
+ * page table, if all entries are pointing to PTE page table,
+ * no merge can be done.
+ */
+ if (unlikely(pgt->nr_same_prot == PTRS_PER_PTE &&
+ (pgprot_val(pgt->same_prot) & _PAGE_PRESENT) &&
+ (level == PG_LEVEL_4K ||
+ pgprot_val(pgt->same_prot) & _PAGE_PSE)))
+ merge_splitted_mapping(pgt, level);
+
+}
+
/*
* Set the new pmd in all the pgds we know about:
*/
-static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
+static void __set_pmd_pte(pte_t *kpte, int level, unsigned long address, pte_t pte)
{
/* change init_mm */
- set_pte_atomic(kpte, pte);
+ set_pte_adjust_nr_same_prot(kpte, level, pte);
#ifdef CONFIG_X86_32
if (!SHARED_KERNEL_PMD) {
struct page *page;
@@ -739,12 +815,68 @@ static void __set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
p4d = p4d_offset(pgd, address);
pud = pud_offset(p4d, address);
pmd = pmd_offset(pud, address);
- set_pte_atomic((pte_t *)pmd, pte);
+ set_pte_adjust_nr_same_prot((pte_t *)pmd, level, pte);
}
}
#endif
}

+static void merge_splitted_mapping(struct page *pgt, int level)
+{
+ pte_t *kpte = page_address(pgt);
+ pgprot_t pte_prot, pmd_prot;
+ unsigned long address;
+ unsigned long pfn;
+ pte_t pte;
+ pud_t pud;
+
+ switch (level) {
+ case PG_LEVEL_4K:
+ pte_prot = pte_pgprot(*kpte);
+ pmd_prot = pgprot_4k_2_large(pte_prot);
+ pgprot_val(pmd_prot) |= _PAGE_PSE;
+ pfn = pte_pfn(*kpte);
+ pte = pfn_pte(pfn, pmd_prot);
+
+ /*
+ * update upper level kpte.
+ * Note that further merge can happen if all PMD table's
+ * entries have the same protection bits after this change.
+ */
+ address = (unsigned long)page_address(pfn_to_page(pfn));
+ __set_pmd_pte(pgt->upper_kpte, level + 1, address, pte);
+ break;
+ case PG_LEVEL_2M:
+ pfn = pmd_pfn(*(pmd_t *)kpte);
+ pmd_prot = pmd_pgprot(*(pmd_t *)kpte);
+ pud = pfn_pud(pfn, pmd_prot);
+ set_pud(pgt->upper_kpte, pud);
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ return;
+ }
+
+ /*
+ * Current kernel did flush_tlb_all() when splitting a large page
+ * inside pgd_lock because:
+ * - an errata of Atom AAH41; as well as
+ * - avoid another cpu simultaneously changing the just splitted
+ * large page's attr.
+ * The first does not require a full tlb flush according to
+ * commit 211b3d03c7400("x86: work around Fedora-11 x86-32 kernel
+ * failures on Intel Atom CPUs") while the 2nd can be already
+ * achieved by cpa_lock. commit c0a759abf5a68("x86/mm/cpa: Move
+ * flush_tlb_all()") simplified the code by doing a full tlb flush
+ * inside pgd_lock. For the same reason, I also did a full tlb
+ * flush inside pgd_lock after doing a merge.
+ */
+ flush_tlb_all();
+
+ __ClearPageSplitpgt(pgt);
+ __free_page(pgt);
+}
+
static pgprot_t pgprot_clear_protnone_bits(pgprot_t prot)
{
/*
@@ -901,9 +1033,10 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address,

/* All checks passed. Update the large page mapping. */
new_pte = pfn_pte(old_pfn, new_prot);
- __set_pmd_pte(kpte, address, new_pte);
+ __set_pmd_pte(kpte, level, address, new_pte);
cpa->flags |= CPA_FLUSHTLB;
cpa_inc_lp_preserved(level);
+
return 0;
}

@@ -1023,6 +1156,11 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
for (i = 0; i < PTRS_PER_PTE; i++, pfn += pfninc, lpaddr += lpinc)
split_set_pte(cpa, pbase + i, pfn, ref_prot, lpaddr, lpinc);

+ __SetPageSplitpgt(base);
+ base->upper_kpte = kpte;
+ base->same_prot = ref_prot;
+ base->nr_same_prot = PTRS_PER_PTE;
+
if (virt_addr_valid(address)) {
unsigned long pfn = PFN_DOWN(__pa(address));

@@ -1037,7 +1175,7 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
* pagetable protections, the actual ptes set above control the
* primary protection behavior:
*/
- __set_pmd_pte(kpte, address, mk_pte(base, __pgprot(_KERNPG_TABLE)));
+ __set_pmd_pte(kpte, level, address, mk_pte(base, __pgprot(_KERNPG_TABLE)));

/*
* Do a global flush tlb after splitting the large page
@@ -1508,6 +1646,23 @@ static int __cpa_process_fault(struct cpa_data *cpa, unsigned long vaddr,
}
}

+/*
+ * When debug_pagealloc_enabled():
+ * - direct map will not use large page mapping;
+ * - kernel highmap can still use large mapping.
+ * When !debug_pagealloc_enabled(): both direct map and kernel highmap
+ * can use large page mapping.
+ *
+ * When large page mapping is used, it can be splitted due to reasons
+ * like protection change and thus, it is also possible a merge can
+ * happen for that splitted small mapping page table page.
+ */
+static bool subject_to_merge(unsigned long addr)
+{
+ return !debug_pagealloc_enabled() ||
+ within(addr, (unsigned long)_text, _brk_end);
+}
+
static int __change_page_attr(struct cpa_data *cpa, int primary)
{
unsigned long address;
@@ -1526,10 +1681,23 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
return __cpa_process_fault(cpa, address, primary);

if (level == PG_LEVEL_4K) {
- pte_t new_pte;
+ pte_t new_pte, *tmp;
pgprot_t new_prot = pte_pgprot(old_pte);
unsigned long pfn = pte_pfn(old_pte);

+ if (subject_to_merge(address)) {
+ spin_lock(&pgd_lock);
+ /*
+ * Check for races, another CPU might have merged
+ * this page up already.
+ */
+ tmp = _lookup_address_cpa(cpa, address, &level);
+ if (tmp != kpte) {
+ spin_unlock(&pgd_lock);
+ goto repeat;
+ }
+ }
+
pgprot_val(new_prot) &= ~pgprot_val(cpa->mask_clr);
pgprot_val(new_prot) |= pgprot_val(cpa->mask_set);

@@ -1551,10 +1719,12 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
* Do we really change anything ?
*/
if (pte_val(old_pte) != pte_val(new_pte)) {
- set_pte_atomic(kpte, new_pte);
+ set_pte_adjust_nr_same_prot(kpte, level, new_pte);
cpa->flags |= CPA_FLUSHTLB;
}
cpa->numpages = 1;
+ if (subject_to_merge(address))
+ spin_unlock(&pgd_lock);
return 0;
}

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index c29ab4c0cd5c..6124c575fdad 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -160,6 +160,12 @@ struct page {
spinlock_t ptl;
#endif
};
+ struct { /* splitted page table pages */
+ void *upper_kpte; /* compound_head */
+ int nr_same_prot;
+ unsigned long _split_pt_pad; /* mapping */
+ pgprot_t same_prot;
+ };
struct { /* ZONE_DEVICE pages */
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e66f7aa3191d..3fe395dd7dfc 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -942,6 +942,7 @@ static inline bool is_page_hwpoison(struct page *page)
#define PG_offline 0x00000100
#define PG_table 0x00000200
#define PG_guard 0x00000400
+#define PG_splitpgt 0x00000800

#define PageType(page, flag) \
((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
@@ -1012,6 +1013,11 @@ PAGE_TYPE_OPS(Table, table)
*/
PAGE_TYPE_OPS(Guard, guard)

+/*
+ * Marks pages in use as splitted page tables
+ */
+PAGE_TYPE_OPS(Splitpgt, splitpgt)
+
extern bool is_free_buddy_page(struct page *page);

PAGEFLAG(Isolated, isolated, PF_ANY);
--
2.37.1

2022-08-08 15:18:07

by Aaron Lu

[permalink] [raw]

Subject: [TEST NOT_FOR_MERGE 4/4] x86/mm/cpa: add a test interface to split direct map

To test this functionality, a debugfs interface is added:
/sys/kernel/debug/x86/split_mapping

There are three test modes.
mode 0: allocate $page_nr pages and set each page's protection first
to RO and X and then back to RW and NX. This is used to test multiple
CPUs dealing with different address ranges.
mode 1: allocate several pages and create $nr_cpu kthreads to
simultaneously change those pages protection with a fixed pattern.
This is used to test multiple CPUs dealing with the same address range.
mode 2: same as mode 0 except using alloc_pages() instead of vmalloc()
because vmalloc space is too small on x86_32/pae.

On a x86_64 VM, I started mode0.sh and mode1.sh at the same time:

mode0.sh:
mode=0
page_nr=200000
nr_cpu=16

function test_one()
{
echo $mode $page_nr > /sys/kernel/debug/x86/split_mapping
}

while true; do
for i in `seq $nr_cpu`; do
test_one &
done
wait
done

mode1.sh:
mode=1
page_nr=1
echo $mode $page_nr > /sys/kernel/debug/x86/split_mapping

After 5 hours, no problem occured with some millions of splits and merges.

For x86_32 and x86_pae, mode2 test is used and also no problem found.

Signed-off-by: Aaron Lu <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 206 +++++++++++++++++++++++++++++++++++
1 file changed, 206 insertions(+)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 1be9aab42c79..4deea4de73e7 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -20,6 +20,9 @@
#include <linux/kernel.h>
#include <linux/cc_platform.h>
#include <linux/set_memory.h>
+#include <linux/kthread.h>
+#include <linux/delay.h>
+#include <linux/random.h>

#include <asm/e820/api.h>
#include <asm/processor.h>
@@ -2556,6 +2559,209 @@ int __init kernel_unmap_pages_in_pgd(pgd_t *pgd, unsigned long address,
return retval;
}

+static int split_mapping_mode0_test(int page_nr)
+{
+ void **addr_buff;
+ void *addr;
+ int i, j;
+
+ addr_buff = kvmalloc(sizeof(void *) * page_nr, GFP_KERNEL);
+ if (!addr_buff) {
+ pr_err("addr_buff: no memory\n");
+ return -ENOMEM;
+ }
+
+ for (i = 0; i < page_nr; i++) {
+ addr = vmalloc(PAGE_SIZE);
+ if (!addr) {
+ pr_err("no memory\n");
+ break;
+ }
+
+ set_memory_ro((unsigned long)addr, 1);
+ set_memory_x((unsigned long)addr, 1);
+
+ addr_buff[i] = addr;
+ }
+
+ for (j = 0; j < i; j++) {
+ set_memory_nx((unsigned long)addr_buff[j], 1);
+ set_memory_rw((unsigned long)addr_buff[j], 1);
+ vfree(addr_buff[j]);
+ }
+
+ kvfree(addr_buff);
+
+ return 0;
+}
+
+struct split_mapping_mode1_data {
+ unsigned long addr;
+ int page_nr;
+};
+
+static int split_mapping_set_prot(void *data)
+{
+ struct split_mapping_mode1_data *d = data;
+ unsigned long addr = d->addr;
+ int page_nr = d->page_nr;
+ int m;
+
+ m = get_random_int() % 100;
+ msleep(m);
+
+ while (!kthread_should_stop()) {
+ set_memory_ro(addr, page_nr);
+ set_memory_x(addr, page_nr);
+ set_memory_rw(addr, page_nr);
+ set_memory_nx(addr, page_nr);
+ cond_resched();
+ }
+
+ return 0;
+}
+
+static int split_mapping_mode1_test(int page_nr)
+{
+ int nr_kthreads = num_online_cpus();
+ struct split_mapping_mode1_data d;
+ struct task_struct **kthreads;
+ int i, j, ret;
+ void *addr;
+
+ addr = vmalloc(PAGE_SIZE * page_nr);
+ if (!addr)
+ return -ENOMEM;
+
+ kthreads = kmalloc(nr_kthreads * sizeof(struct task_struct *), GFP_KERNEL);
+ if (!kthreads) {
+ vfree(addr);
+ return -ENOMEM;
+ }
+
+ d.addr = (unsigned long)addr;
+ d.page_nr = page_nr;
+ for (i = 0; i < nr_kthreads; i++) {
+ kthreads[i] = kthread_run(split_mapping_set_prot, &d, "split_mappingd%d", i);
+ if (IS_ERR(kthreads[i])) {
+ for (j = 0; j < i; j++)
+ kthread_stop(kthreads[j]);
+ ret = PTR_ERR(kthreads[i]);
+ goto out;
+ }
+ }
+
+ while (1) {
+ if (signal_pending(current)) {
+ for (i = 0; i < nr_kthreads; i++)
+ kthread_stop(kthreads[i]);
+ ret = 0;
+ break;
+ }
+ msleep(1000);
+ }
+
+out:
+ kfree(kthreads);
+ vfree(addr);
+ return ret;
+}
+
+static int split_mapping_mode2_test(int page_nr)
+{
+ struct page *p, *t;
+ unsigned long addr;
+ int i;
+
+ LIST_HEAD(head);
+
+ for (i = 0; i < page_nr; i++) {
+ p = alloc_pages(GFP_KERNEL | GFP_DMA32, 0);
+ if (!p) {
+ pr_err("no memory\n");
+ break;
+ }
+
+ addr = (unsigned long)page_address(p);
+ BUG_ON(!addr);
+
+ set_memory_ro(addr, 1);
+ set_memory_x(addr, 1);
+
+ list_add(&p->lru, &head);
+ }
+
+ list_for_each_entry_safe(p, t, &head, lru) {
+ addr = (unsigned long)page_address(p);
+ set_memory_nx(addr, 1);
+ set_memory_rw(addr, 1);
+
+ list_del(&p->lru);
+ __free_page(p);
+ }
+
+ return 0;
+}
+static ssize_t split_mapping_write_file(struct file *file, const char __user *buf,
+ size_t count, loff_t *ppos)
+{
+ unsigned int mode = 0, page_nr = 0;
+ char buffer[64];
+ int ret;
+
+ if (count > 64)
+ return -EINVAL;
+
+ if (copy_from_user(buffer, buf, count))
+ return -EFAULT;
+ sscanf(buffer, "%u %u", &mode, &page_nr);
+
+ /*
+ * There are 3 test modes.
+ * mode 0: each thread allocates $page_nr pages and set each page's
+ * protection first to RO and X and then back to RW and NX.
+ * This is used to test multiple CPUs dealing with different
+ * pages.
+ * mode 1: allocate several pages and create $nr_cpu kthreads to
+ * simultaneously change those pages protection to a fixed
+ * pattern. This is used to test multiple CPUs dealing with
+ * some same page's protection.
+ * mode 2: like mode 0 but directly use alloc_pages() because vmalloc
+ * area on x86_32 is too small, only 128M.
+ */
+ if (mode > 2)
+ return -EINVAL;
+
+ if (page_nr == 0)
+ return -EINVAL;
+
+ if (mode == 0)
+ ret = split_mapping_mode0_test(page_nr);
+ else if (mode == 1)
+ ret = split_mapping_mode1_test(page_nr);
+ else
+ ret = split_mapping_mode2_test(page_nr);
+
+ return ret ? ret : count;
+}
+
+static const struct file_operations split_mapping_fops = {
+ .write = split_mapping_write_file,
+};
+
+static int __init split_mapping_init(void)
+{
+ struct dentry *d = debugfs_create_file("split_mapping", S_IWUSR, arch_debugfs_dir, NULL,
+ &split_mapping_fops);
+ if (IS_ERR(d)) {
+ pr_err("create split_mapping failed: %ld\n", PTR_ERR(d));
+ return PTR_ERR(d);
+ }
+
+ return 0;
+}
+late_initcall(split_mapping_init);
+
/*
* The testcases use internal knowledge of the implementation that shouldn't
* be exposed to the rest of the kernel. Include these directly here.
--
2.37.1

2022-08-09 15:22:33

by Aaron Lu

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible

Hi Kirill,

Thanks a lot for the feedback.

On 8/9/2022 6:04 PM, Kirill A. Shutemov wrote:
> On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
>> This is an early RFC. While all reviews are welcome, reviewing this code
>> now will be a waste of time for the x86 subsystem maintainers. I would,
>> however, appreciate a preliminary review from the folks on the to and cc
>> list. I'm posting it to the list in case anyone else is interested in
>> seeing this early version.
>
> Last time[1] I tried to merge pages back in direct mapping it lead to
> substantial performance regression for some workloads. I cannot find the
> report right now, but I remember it was something graphics related.
>

Do you happen to remember the workload name? I can try running it.

> Have you done any performance evaluation?
>

Not yet, I was mostly concentrating on correctness. In addition to the
graphics workload, do you have anything else in mind that may be
sensitive to such a change?

I think maybe I can run patch4's mode0 test with and without this merge
functionality and see how performance would change since mode0 is
essentially doing various set_memory_X() calls on different cpus
simultaneously which can trigger a lot of splits and merges. Sounds good?

> My take away was that the merge has to be batched. Like log where changes
> to direct mapping happens and come back to then and merge when the number
> of changes cross some limit.
>

Appreciate your suggestion.

> Also I don't see you handling set_memory_4k(). Huh?
>

Ah Right, I missed that. Currently set_memory_4k() is not specially
handled and can be mistakenly merged. Will fix this in later versions.

> [1] https://lore.kernel.org/lkml/[email protected]/
>

Thanks!

2022-08-09 18:50:13

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible

On Tue, Aug 09, 2022 at 10:58:18PM +0800, Aaron Lu wrote:
> Hi Kirill,
>
> Thanks a lot for the feedback.
>
> On 8/9/2022 6:04 PM, Kirill A. Shutemov wrote:
> > On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
> >> This is an early RFC. While all reviews are welcome, reviewing this code
> >> now will be a waste of time for the x86 subsystem maintainers. I would,
> >> however, appreciate a preliminary review from the folks on the to and cc
> >> list. I'm posting it to the list in case anyone else is interested in
> >> seeing this early version.
> >
> > Last time[1] I tried to merge pages back in direct mapping it lead to
> > substantial performance regression for some workloads. I cannot find the
> > report right now, but I remember it was something graphics related.
> >
>
> Do you happen to remember the workload name? I can try running it.

No, sorry. As I said, I tried to find the report, but failed.

--
Kiryl Shutsemau / Kirill A. Shutemov

2022-08-11 05:08:41

by Hyeonggon Yoo

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible

On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
> This is an early RFC. While all reviews are welcome, reviewing this code
> now will be a waste of time for the x86 subsystem maintainers. I would,
> however, appreciate a preliminary review from the folks on the to and cc
> list. I'm posting it to the list in case anyone else is interested in
> seeing this early version.
>

Hello Aaron!

+Cc Mike Rapoport, who has been same problem. [1]

There is also LPC discussion (with different approach on this problem)
[2], [4]

and performance measurement when all pages are 4K/2M. [3]

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://www.youtube.com/watch?v=egC7ZK4pcnQ
[3] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
[4] https://lwn.net/Articles/894557/

> Dave Hansen: I need your ack before this goes to the maintainers.
>
> Here it goes:
>
> On x86_64, Linux has direct mapping of almost all physical memory. For
> performance reasons, this mapping is usually set as large page like 2M
> or 1G per hardware's capability with read, write and non-execute
> protection.
>
> There are cases where some pages have to change their protection to RO
> and eXecutable, like pages that host module code or bpf prog. When these
> pages' protection are changed, the corresponding large mapping that
> cover these pages will have to be splitted into 4K first and then
> individual 4k page's protection changed accordingly, i.e. unaffected
> pages keep their original protection as RW and NX while affected pages'
> protection changed to RO and X.
>
> There is a problem due to this split: the large mapping will remain
> splitted even after the affected pages' protection are changed back to
> RW and NX, like when the module is unloaded or bpf progs are freed.
> After system runs a long time, there can be more and more large mapping
> being splitted, causing more and more dTLB misses and overall system
> performance getting hurt[1].
>
> For this reason, people tried some techniques to reduce the harm of
> large mapping beling splitted, like bpf_prog_pack[2] which packs
> multiple bpf progs into a single page instead of allocating and changing
> one page's protection for each bpf prog. This approach made large
> mapping split happen much fewer.
>
> This patchset addresses this problem in another way: it merges
> splitted mappings back to a large mapping when protections of all entries
> of the splitted small mapping page table become same again, e.g. when the
> page whose protection was changed to RO+X now has its protection changed
> back to RW+NX due to reasons like module unload, bpf prog free, etc. and
> all other entries' protection are also RW+NX.
>

I tried very similar approach few months ago (for toy implementation) [5],
and the biggest obstacle to this approach was: you need to be extremely sure
that the page->nr_same_prot is ALWAYS correct.

For example, in arch/x86/include/asm/kfence.h [6], it clears and set
_PAGE_PRESENT without going through CPA, which can simply break the count.

[5] https://github.com/hygoni/linux/tree/merge-mapping-v1r3
[6] https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/kfence.h#L56

I think we may need to hook set_pte/set_pmd/etc and use proper
synchronization primitives when changing init_mm's page table to go
further on this approach.

> One final note is, with features like bpf_prog_pack etc., there can be
> much fewer large mapping split IIUC; also, this patchset can not help
> when the page which has its protection changed keeps in use. So my take
> on this large mapping split problem is: to get the most value of keeping
> large mapping intact, features like bpf_prog_pack is important. This
> patchset can help to further reduce large mapping split when in use page
> that has special protection set finally gets released.
>
> [1]: http://lkml.kernel.org/r/CAPhsuW4eAm9QrAxhZMJu-bmvHnjWjuw86gFZzTHRaMEaeFhAxw@mail.gmail.com
> [2]: https://lore.kernel.org/lkml/[email protected]/
>
> Aaron Lu (4):
> x86/mm/cpa: restore global bit when page is present
> x86/mm/cpa: merge splitted direct mapping when possible
> x86/mm/cpa: add merge event counter
> x86/mm/cpa: add a test interface to split direct map
>
> arch/x86/mm/pat/set_memory.c | 411 +++++++++++++++++++++++++++++++++-
> include/linux/mm_types.h | 6 +
> include/linux/page-flags.h | 6 +
> include/linux/vm_event_item.h | 2 +
> mm/vmstat.c | 2 +
> 5 files changed, 420 insertions(+), 7 deletions(-)
>
> --
> 2.37.1
>
>

2022-08-11 08:19:43

by Aaron Lu

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible

On Thu, 2022-08-11 at 04:50 +0000, Hyeonggon Yoo wrote:
> On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
> > This is an early RFC. While all reviews are welcome, reviewing this code
> > now will be a waste of time for the x86 subsystem maintainers. I would,
> > however, appreciate a preliminary review from the folks on the to and cc
> > list. I'm posting it to the list in case anyone else is interested in
> > seeing this early version.
> >
>
> Hello Aaron!
>

Hi Hyeonggon,

> +Cc Mike Rapoport, who has been same problem. [1]
>
> There is also LPC discussion (with different approach on this problem)
> [2], [4]
>
> and performance measurement when all pages are 4K/2M. [3]
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://www.youtube.com/watch?v=egC7ZK4pcnQ
> [3] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
> [4] https://lwn.net/Articles/894557/
>

Thanks a lot for these info.

> > Dave Hansen: I need your ack before this goes to the maintainers.
> >
> > Here it goes:
> >
> > On x86_64, Linux has direct mapping of almost all physical memory. For
> > performance reasons, this mapping is usually set as large page like 2M
> > or 1G per hardware's capability with read, write and non-execute
> > protection.
> >
> > There are cases where some pages have to change their protection to RO
> > and eXecutable, like pages that host module code or bpf prog. When these
> > pages' protection are changed, the corresponding large mapping that
> > cover these pages will have to be splitted into 4K first and then
> > individual 4k page's protection changed accordingly, i.e. unaffected
> > pages keep their original protection as RW and NX while affected pages'
> > protection changed to RO and X.
> >
> > There is a problem due to this split: the large mapping will remain
> > splitted even after the affected pages' protection are changed back to
> > RW and NX, like when the module is unloaded or bpf progs are freed.
> > After system runs a long time, there can be more and more large mapping
> > being splitted, causing more and more dTLB misses and overall system
> > performance getting hurt[1].
> >
> > For this reason, people tried some techniques to reduce the harm of
> > large mapping beling splitted, like bpf_prog_pack[2] which packs
> > multiple bpf progs into a single page instead of allocating and changing
> > one page's protection for each bpf prog. This approach made large
> > mapping split happen much fewer.
> >
> > This patchset addresses this problem in another way: it merges
> > splitted mappings back to a large mapping when protections of all entries
> > of the splitted small mapping page table become same again, e.g. when the
> > page whose protection was changed to RO+X now has its protection changed
> > back to RW+NX due to reasons like module unload, bpf prog free, etc. and
> > all other entries' protection are also RW+NX.
> >
>
> I tried very similar approach few months ago (for toy implementation) [5],

Cool, glad we have tried similar approach :-)

> and the biggest obstacle to this approach was: you need to be extremely sure
> that the page->nr_same_prot is ALWAYS correct.
>

Yes indeed.

> For example, in arch/x86/include/asm/kfence.h [6], it clears and set
> _PAGE_PRESENT without going through CPA, which can simply break the count.
>
> [5] https://github.com/hygoni/linux/tree/merge-mapping-v1r3
> [6] https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/kfence.h#L56
>

For this specific case, it probably doesn't matter because kfence
intentionally uses set_memory_4k() for these pages and no merge shall
ever be done for them according to commit 1dc0da6e9ec0("x86, kfence:
enable KFENCE for x86").
(Kirill pointed out my current version has problem dealing with
set_memory_4k() but that is fixable).

> I think we may need to hook set_pte/set_pmd/etc and use proper
> synchronization primitives when changing init_mm's page table to go
> further on this approach.

Thanks for the suggestion. I'll check how many callsites there are that
manipulate init_mm's page table outside of cpa() and then decides if it
is possible to do the hook and sync for set_pte/etc.

>
> > One final note is, with features like bpf_prog_pack etc., there can be
> > much fewer large mapping split IIUC; also, this patchset can not help
> > when the page which has its protection changed keeps in use. So my take
> > on this large mapping split problem is: to get the most value of keeping
> > large mapping intact, features like bpf_prog_pack is important. This
> > patchset can help to further reduce large mapping split when in use page
> > that has special protection set finally gets released.
> >
> > [1]: http://lkml.kernel.org/r/CAPhsuW4eAm9QrAxhZMJu-bmvHnjWjuw86gFZzTHRaMEaeFhAxw@mail.gmail.com
> > [2]: https://lore.kernel.org/lkml/[email protected]/
> >
> > Aaron Lu (4):
> > x86/mm/cpa: restore global bit when page is present
> > x86/mm/cpa: merge splitted direct mapping when possible
> > x86/mm/cpa: add merge event counter
> > x86/mm/cpa: add a test interface to split direct map
> >
> > arch/x86/mm/pat/set_memory.c | 411 +++++++++++++++++++++++++++++++++-
> > include/linux/mm_types.h | 6 +
> > include/linux/page-flags.h | 6 +
> > include/linux/vm_event_item.h | 2 +
> > mm/vmstat.c | 2 +
> > 5 files changed, 420 insertions(+), 7 deletions(-)
> >
> > --
> > 2.37.1
> >
> >

2022-08-13 16:10:00

by Mike Rapoport

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible

Hi Aaron,

On Thu, Aug 11, 2022 at 04:50:44AM +0000, Hyeonggon Yoo wrote:
> On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
> > This is an early RFC. While all reviews are welcome, reviewing this code
> > now will be a waste of time for the x86 subsystem maintainers. I would,
> > however, appreciate a preliminary review from the folks on the to and cc
> > list. I'm posting it to the list in case anyone else is interested in
> > seeing this early version.
> >
>
> Hello Aaron!
>
> +Cc Mike Rapoport, who has been same problem. [1]

Thanks Hyeonggon!

> There is also LPC discussion (with different approach on this problem)
> [2], [4]
>
> and performance measurement when all pages are 4K/2M. [3]
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://www.youtube.com/watch?v=egC7ZK4pcnQ
> [3] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
> [4] https://lwn.net/Articles/894557/
>
> > Dave Hansen: I need your ack before this goes to the maintainers.
> >
> > Here it goes:
> >
> > On x86_64, Linux has direct mapping of almost all physical memory. For
> > performance reasons, this mapping is usually set as large page like 2M
> > or 1G per hardware's capability with read, write and non-execute
> > protection.
> >
> > There are cases where some pages have to change their protection to RO
> > and eXecutable, like pages that host module code or bpf prog. When these
> > pages' protection are changed, the corresponding large mapping that
> > cover these pages will have to be splitted into 4K first and then
> > individual 4k page's protection changed accordingly, i.e. unaffected
> > pages keep their original protection as RW and NX while affected pages'
> > protection changed to RO and X.
> >
> > There is a problem due to this split: the large mapping will remain
> > splitted even after the affected pages' protection are changed back to
> > RW and NX, like when the module is unloaded or bpf progs are freed.
> > After system runs a long time, there can be more and more large mapping
> > being splitted, causing more and more dTLB misses and overall system
> > performance getting hurt[1].
> >
> > For this reason, people tried some techniques to reduce the harm of
> > large mapping beling splitted, like bpf_prog_pack[2] which packs
> > multiple bpf progs into a single page instead of allocating and changing
> > one page's protection for each bpf prog. This approach made large
> > mapping split happen much fewer.
> >
> > This patchset addresses this problem in another way: it merges
> > splitted mappings back to a large mapping when protections of all entries
> > of the splitted small mapping page table become same again, e.g. when the
> > page whose protection was changed to RO+X now has its protection changed
> > back to RW+NX due to reasons like module unload, bpf prog free, etc. and
> > all other entries' protection are also RW+NX.
> >
>
> I tried very similar approach few months ago (for toy implementation) [5],
> and the biggest obstacle to this approach was: you need to be extremely sure
> that the page->nr_same_prot is ALWAYS correct.
>
> For example, in arch/x86/include/asm/kfence.h [6], it clears and set
> _PAGE_PRESENT without going through CPA, which can simply break the count.
>
> [5] https://github.com/hygoni/linux/tree/merge-mapping-v1r3
> [6] https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/kfence.h#L56
>
> I think we may need to hook set_pte/set_pmd/etc and use proper
> synchronization primitives when changing init_mm's page table to go
> further on this approach.
>
> > One final note is, with features like bpf_prog_pack etc., there can be
> > much fewer large mapping split IIUC; also, this patchset can not help
> > when the page which has its protection changed keeps in use. So my take
> > on this large mapping split problem is: to get the most value of keeping
> > large mapping intact, features like bpf_prog_pack is important. This
> > patchset can help to further reduce large mapping split when in use page
> > that has special protection set finally gets released.

I'm not sure automatic collapse of large pages in the direct map will
actually trigger frequently.

Consider for example pages allocated for modules, that have adjusted
protection bits. This pages could be scattered all over and even when they
are freed, chances there will be a contiguous 2M chunk are quite low...

I believe that to reduce the fragmentation of the direct map the 4K pages
with changed protection should be allocated from a cache of large pages, as
I did on older version of secretmem or as Rick implemented in his vmalloc
and PKS series.

Then CPA may provide a method for explicitly collapsing a large page, so
that such cache can call this method when an entire large page becomes
free.

> > [1]: http://lkml.kernel.org/r/CAPhsuW4eAm9QrAxhZMJu-bmvHnjWjuw86gFZzTHRaMEaeFhAxw@mail.gmail.com
> > [2]: https://lore.kernel.org/lkml/[email protected]/
> >
> > Aaron Lu (4):
> > x86/mm/cpa: restore global bit when page is present
> > x86/mm/cpa: merge splitted direct mapping when possible
> > x86/mm/cpa: add merge event counter
> > x86/mm/cpa: add a test interface to split direct map
> >
> > arch/x86/mm/pat/set_memory.c | 411 +++++++++++++++++++++++++++++++++-
> > include/linux/mm_types.h | 6 +
> > include/linux/page-flags.h | 6 +
> > include/linux/vm_event_item.h | 2 +
> > mm/vmstat.c | 2 +
> > 5 files changed, 420 insertions(+), 7 deletions(-)
> >
> > --
> > 2.37.1
> >
> >

--
Sincerely yours,
Mike.

2022-08-16 08:50:06

by Aaron Lu

[permalink] [raw]

Subject: Re: [RFC PATCH 0/4] x86/mm/cpa: merge small mappings whenever possible

Hi Mike,

Thanks for the feedback. See below for my comments.

On 8/14/2022 12:05 AM, Mike Rapoport wrote:
> Hi Aaron,
>
> On Thu, Aug 11, 2022 at 04:50:44AM +0000, Hyeonggon Yoo wrote:
>> On Mon, Aug 08, 2022 at 10:56:45PM +0800, Aaron Lu wrote:
>>> This is an early RFC. While all reviews are welcome, reviewing this code
>>> now will be a waste of time for the x86 subsystem maintainers. I would,
>>> however, appreciate a preliminary review from the folks on the to and cc
>>> list. I'm posting it to the list in case anyone else is interested in
>>> seeing this early version.
>>>
>>
>> Hello Aaron!
>>
>> +Cc Mike Rapoport, who has been same problem. [1]
>
> Thanks Hyeonggon!
>
>> There is also LPC discussion (with different approach on this problem)
>> [2], [4]
>>
>> and performance measurement when all pages are 4K/2M. [3]
>>
>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>> [2] https://www.youtube.com/watch?v=egC7ZK4pcnQ
>> [3] https://lpc.events/event/11/contributions/1127/attachments/922/1792/LPC21%20Direct%20map%20management%20.pdf
>> [4] https://lwn.net/Articles/894557/
>>
>>> Dave Hansen: I need your ack before this goes to the maintainers.
>>>
>>> Here it goes:
>>>
>>> On x86_64, Linux has direct mapping of almost all physical memory. For
>>> performance reasons, this mapping is usually set as large page like 2M
>>> or 1G per hardware's capability with read, write and non-execute
>>> protection.
>>>
>>> There are cases where some pages have to change their protection to RO
>>> and eXecutable, like pages that host module code or bpf prog. When these
>>> pages' protection are changed, the corresponding large mapping that
>>> cover these pages will have to be splitted into 4K first and then
>>> individual 4k page's protection changed accordingly, i.e. unaffected
>>> pages keep their original protection as RW and NX while affected pages'
>>> protection changed to RO and X.
>>>
>>> There is a problem due to this split: the large mapping will remain
>>> splitted even after the affected pages' protection are changed back to
>>> RW and NX, like when the module is unloaded or bpf progs are freed.
>>> After system runs a long time, there can be more and more large mapping
>>> being splitted, causing more and more dTLB misses and overall system
>>> performance getting hurt[1].
>>>
>>> For this reason, people tried some techniques to reduce the harm of
>>> large mapping beling splitted, like bpf_prog_pack[2] which packs
>>> multiple bpf progs into a single page instead of allocating and changing
>>> one page's protection for each bpf prog. This approach made large
>>> mapping split happen much fewer.
>>>
>>> This patchset addresses this problem in another way: it merges
>>> splitted mappings back to a large mapping when protections of all entries
>>> of the splitted small mapping page table become same again, e.g. when the
>>> page whose protection was changed to RO+X now has its protection changed
>>> back to RW+NX due to reasons like module unload, bpf prog free, etc. and
>>> all other entries' protection are also RW+NX.
>>>
>>
>> I tried very similar approach few months ago (for toy implementation) [5],
>> and the biggest obstacle to this approach was: you need to be extremely sure
>> that the page->nr_same_prot is ALWAYS correct.
>>
>> For example, in arch/x86/include/asm/kfence.h [6], it clears and set
>> _PAGE_PRESENT without going through CPA, which can simply break the count.
>>
>> [5] https://github.com/hygoni/linux/tree/merge-mapping-v1r3
>> [6] https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/kfence.h#L56
>>
>> I think we may need to hook set_pte/set_pmd/etc and use proper
>> synchronization primitives when changing init_mm's page table to go
>> further on this approach.
>>
>>> One final note is, with features like bpf_prog_pack etc., there can be
>>> much fewer large mapping split IIUC; also, this patchset can not help
>>> when the page which has its protection changed keeps in use. So my take
>>> on this large mapping split problem is: to get the most value of keeping
>>> large mapping intact, features like bpf_prog_pack is important. This
>>> patchset can help to further reduce large mapping split when in use page
>>> that has special protection set finally gets released.
>
> I'm not sure automatic collapse of large pages in the direct map will
> actually trigger frequently.
>
> Consider for example pages allocated for modules, that have adjusted
> protection bits. This pages could be scattered all over and even when they
> are freed, chances there will be a contiguous 2M chunk are quite low...
>

When these pages that scattered a 2M chunk with special protection bits
set are all freed, then we can do merge for them. I suppose you mean
it's not easy to have all these special pages freed?

> I believe that to reduce the fragmentation of the direct map the 4K pages
> with changed protection should be allocated from a cache of large pages, as
> I did on older version of secretmem or as Rick implemented in his vmalloc
> and PKS series.

I agree that allocation side is important to reduce direct map
fragmentation. This approach here doesn't help when these special pages
are in use while the approaches you mentioned can help that.

>
> Then CPA may provide a method for explicitly collapsing a large page, so
> that such cache can call this method when an entire large page becomes
> free.

I think this is a good idea. With things like your Unmap migratetype
patchset, when a order-9 page is free, it can somehow notify CPA about
this and then arch code like CPA can manipulate direct mapping as it
sees appropriate, like merging lower level page tables to higher level.

This also saves the trouble of tracking pgt->same_prot and nr_same_prot
of the kernel page tables in this patchset. CPA now only need to get
notified and then do a page table scan to make sure such a merge is correct.

I suppose this should work as long as all pages that will have
protection bits changed are allocated from the page allocator(so that
your approach can track such pages).

>
>>> [1]: http://lkml.kernel.org/r/CAPhsuW4eAm9QrAxhZMJu-bmvHnjWjuw86gFZzTHRaMEaeFhAxw@mail.gmail.com
>>> [2]: https://lore.kernel.org/lkml/[email protected]/
>>>
>>> Aaron Lu (4):
>>> x86/mm/cpa: restore global bit when page is present
>>> x86/mm/cpa: merge splitted direct mapping when possible
>>> x86/mm/cpa: add merge event counter
>>> x86/mm/cpa: add a test interface to split direct map
>>>
>>> arch/x86/mm/pat/set_memory.c | 411 +++++++++++++++++++++++++++++++++-
>>> include/linux/mm_types.h | 6 +
>>> include/linux/page-flags.h | 6 +
>>> include/linux/vm_event_item.h | 2 +
>>> mm/vmstat.c | 2 +
>>> 5 files changed, 420 insertions(+), 7 deletions(-)
>>>
>>> --
>>> 2.37.1
>>>
>>>
>