Hi,
This patch series aims to try to free user PTE page table pages when no one is
using it.
The beginning of this story is that some malloc libraries(e.g. jemalloc or
tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
But the page tables do not be freed by madvise(), so it can produce many
page tables when the process touches an enormous virtual address space.
The following figures are a memory usage snapshot of one process which actually
happened on our server:
VIRT: 55t
RES: 590g
VmPTE: 110g
As we can see, the PTE page tables size is 110g, while the RES is 590g. In
theory, the process only need 1.2g PTE page tables to map those physical
memory. The reason why PTE page tables occupy a lot of memory is that
madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
doesn't free the PTE page table pages. So we can free those empty PTE page
tables to save memory. In the above cases, we can save memory about 108g(best
case). And the larger the difference between the size of VIRT and RES, the
more memory we save.
In this patch series, we add a pte_ref field to the struct page of page table
to track how many users of user PTE page table. Similar to the mechanism of page
refcount, the user of PTE page table should hold a refcount to it before
accessing. The user PTE page table page may be freed when the last refcount is
dropped.
Different from the idea of another patchset of mine before[1], the pte_ref
becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
entryies, and then release the user PTE page table page when checking that
pte_ref is 0. The advantage of this is that there is basically no performance
overhead in percpu mode, but it can also free the empty PTEs. In addition, the
code implementation of this patchset is much simpler and more portable than the
another patchset[1].
Testing:
The following code snippet can show the effect of optimization:
mmap 50G
while (1) {
for (; i < 1024 * 25; i++) {
touch 2M memory
madvise MADV_DONTNEED 2M
}
}
As we can see, the memory usage of VmPTE is reduced:
before after
VIRT 50.0 GB 50.0 GB
RES 3.1 MB 3.1 MB
VmPTE 102640 kB 96 kB
I also have tested the stability by LTP[2] for several weeks. I have not seen
any crash so far.
This series is based on v5.18-rc2.
Comments and suggestions are welcome.
Thanks,
Qi.
[1] https://patchwork.kernel.org/project/linux-mm/cover/[email protected]/
[2] https://github.com/linux-test-project/ltp
Qi Zheng (18):
x86/mm/encrypt: add the missing pte_unmap() call
percpu_ref: make ref stable after percpu_ref_switch_to_atomic_sync()
returns
percpu_ref: make percpu_ref_switch_lock per percpu_ref
mm: convert to use ptep_clear() in pte_clear_not_present_full()
mm: split the related definitions of pte_offset_map_lock() into
pgtable.h
mm: introduce CONFIG_FREE_USER_PTE
mm: add pte_to_page() helper
mm: introduce percpu_ref for user PTE page table page
pte_ref: add pte_tryget() and {__,}pte_put() helper
mm: add pte_tryget_map{_lock}() helper
mm: convert to use pte_tryget_map_lock()
mm: convert to use pte_tryget_map()
mm: add try_to_free_user_pte() helper
mm: use try_to_free_user_pte() in MADV_DONTNEED case
mm: use try_to_free_user_pte() in MADV_FREE case
pte_ref: add track_pte_{set, clear}() helper
x86/mm: add x86_64 support for pte_ref
Documentation: add document for pte_ref
Documentation/vm/index.rst | 1 +
Documentation/vm/pte_ref.rst | 210 ++++++++++++++++++++++++++
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 7 +-
arch/x86/mm/mem_encrypt_identity.c | 10 +-
fs/proc/task_mmu.c | 16 +-
fs/userfaultfd.c | 10 +-
include/linux/mm.h | 162 ++------------------
include/linux/mm_types.h | 1 +
include/linux/percpu-refcount.h | 6 +-
include/linux/pgtable.h | 196 +++++++++++++++++++++++-
include/linux/pte_ref.h | 73 +++++++++
include/linux/rmap.h | 2 +
include/linux/swapops.h | 4 +-
kernel/events/core.c | 5 +-
lib/percpu-refcount.c | 86 +++++++----
mm/Kconfig | 10 ++
mm/Makefile | 2 +-
mm/damon/vaddr.c | 30 ++--
mm/debug_vm_pgtable.c | 2 +-
mm/filemap.c | 4 +-
mm/gup.c | 20 ++-
mm/hmm.c | 9 +-
mm/huge_memory.c | 4 +-
mm/internal.h | 3 +-
mm/khugepaged.c | 18 ++-
mm/ksm.c | 4 +-
mm/madvise.c | 35 +++--
mm/memcontrol.c | 8 +-
mm/memory-failure.c | 15 +-
mm/memory.c | 187 +++++++++++++++--------
mm/mempolicy.c | 4 +-
mm/migrate.c | 8 +-
mm/migrate_device.c | 22 ++-
mm/mincore.c | 5 +-
mm/mlock.c | 5 +-
mm/mprotect.c | 4 +-
mm/mremap.c | 10 +-
mm/oom_kill.c | 3 +-
mm/page_table_check.c | 2 +-
mm/page_vma_mapped.c | 59 +++++++-
mm/pagewalk.c | 6 +-
mm/pte_ref.c | 230 +++++++++++++++++++++++++++++
mm/rmap.c | 9 ++
mm/swap_state.c | 4 +-
mm/swapfile.c | 18 ++-
mm/userfaultfd.c | 11 +-
mm/vmalloc.c | 2 +-
48 files changed, 1203 insertions(+), 340 deletions(-)
create mode 100644 Documentation/vm/pte_ref.rst
create mode 100644 include/linux/pte_ref.h
create mode 100644 mm/pte_ref.c
--
2.20.1
Add pte_ref hooks into routines that modify user PTE page tables,
and select ARCH_SUPPORTS_FREE_USER_PTE, so that the pte_ref code
can be compiled and worked on this architecture.
Signed-off-by: Qi Zheng <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b0142e01002e..c1046fc15882 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -34,6 +34,7 @@ config X86_64
select SWIOTLB
select ARCH_HAS_ELFCORE_COMPAT
select ZONE_DMA32
+ select ARCH_SUPPORTS_FREE_USER_PTE
config FORCE_DYNAMIC_FTRACE
def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62ab07e24aef..08d0aa5ce8d4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -23,6 +23,7 @@
#include <asm/coco.h>
#include <asm-generic/pgtable_uffd.h>
#include <linux/page_table_check.h>
+#include <linux/pte_ref.h>
extern pgd_t early_top_pgt[PTRS_PER_PGD];
bool __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
@@ -1010,6 +1011,7 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte)
{
page_table_check_pte_set(mm, addr, ptep, pte);
+ track_pte_set(mm, addr, ptep, pte);
set_pte(ptep, pte);
}
@@ -1055,6 +1057,7 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
{
pte_t pte = native_ptep_get_and_clear(ptep);
page_table_check_pte_clear(mm, addr, pte);
+ track_pte_clear(mm, addr, ptep, pte);
return pte;
}
@@ -1071,6 +1074,7 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
*/
pte = native_local_ptep_get_and_clear(ptep);
page_table_check_pte_clear(mm, addr, pte);
+ track_pte_clear(mm, addr, ptep, pte);
} else {
pte = ptep_get_and_clear(mm, addr, ptep);
}
@@ -1081,7 +1085,8 @@ static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
static inline void ptep_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
{
- if (IS_ENABLED(CONFIG_PAGE_TABLE_CHECK))
+ if (IS_ENABLED(CONFIG_PAGE_TABLE_CHECK)
+ || IS_ENABLED(CONFIG_FREE_USER_PTE))
ptep_get_and_clear(mm, addr, ptep);
else
pte_clear(mm, addr, ptep);
--
2.20.1
This commit adds document for pte_ref under `Documentation/vm/`.
Signed-off-by: Qi Zheng <[email protected]>
---
Documentation/vm/index.rst | 1 +
Documentation/vm/pte_ref.rst | 210 +++++++++++++++++++++++++++++++++++
2 files changed, 211 insertions(+)
create mode 100644 Documentation/vm/pte_ref.rst
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index 44365c4574a3..ee71baccc2e7 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -31,6 +31,7 @@ algorithms. If you are looking for advice on simply allocating memory, see the
page_frags
page_owner
page_table_check
+ pte_ref
remap_file_pages
slub
split_page_table_lock
diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
new file mode 100644
index 000000000000..0ac1e5a408d7
--- /dev/null
+++ b/Documentation/vm/pte_ref.rst
@@ -0,0 +1,210 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================================================
+pte_ref: Tracking about how many references to each user PTE page table page
+============================================================================
+
+Preface
+=======
+
+Now in order to pursue high performance, applications mostly use some
+high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
+These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
+physical memory for the following reasons::
+
+ First of all, we should hold as few write locks of mmap_lock as possible,
+ since the mmap_lock semaphore has long been a contention point in the
+ memory management subsystem. The mmap()/munmap() hold the write lock, and
+ the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
+ madvise() instead of munmap() to released physical memory can reduce the
+ competition of the mmap_lock.
+
+ Secondly, after using madvise() to release physical memory, there is no
+ need to build vma and allocate page tables again when accessing the same
+ virtual address again, which can also save some time.
+
+The following is the largest user PTE page table memory that can be
+allocated by a single user process in a 32-bit and a 64-bit system.
+
++---------------------------+--------+---------+
+| | 32-bit | 64-bit |
++===========================+========+=========+
+| user PTE page table pages | 3 MiB | 512 GiB |
++---------------------------+--------+---------+
+| user PMD page table pages | 3 KiB | 1 GiB |
++---------------------------+--------+---------+
+
+(for 32-bit, take 3G user address space, 4K page size as an example;
+ for 64-bit, take 48-bit address width, 4K page size as an example.)
+
+After using madvise(), everything looks good, but as can be seen from the
+above table, a single process can create a large number of PTE page tables
+on a 64-bit system, since both of the MADV_DONTNEED and MADV_FREE will not
+release page table memory. And before the process exits or calls munmap(),
+the kernel cannot reclaim these pages even if these PTE page tables do not
+map anything.
+
+To fix the situation, we introduces a reference count for each user PTE page
+table page. Then we can track whether users are using the user PTE page table
+page and reclaim the user PTE page table pages that does not map anything at
+the right time.
+
+Introduction
+============
+
+The ``pte_ref``, which is the reference count of user PTE page table page, is
+``percpu_ref`` type. It is used to track the usage of each user PTE page table
+page.
+
+Who will hold the pte_ref?
+--------------------------
+
+The following people will hold a pte_ref::
+
+ The !pte_none() entry, such as regular page table entry that map physical
+ pages, or swap entry, or migrate entry, etc.
+
+ Visitor to the PTE page table entries, such as page table walker.
+
+Any ``!pte_none()`` entry and visitor can be regarded as the user of the PTE
+page table page. When the pte_ref is reduced to 0, it means that no one is
+using the PTE page table page, then this free PTE page table page can be
+reclaimed at this time.
+
+About mode switching
+--------------------
+
+When user PTE page table page is allocated, its ``pte_ref`` will be initialized
+to percpu mode, which basically does not bring performance overhead. When we
+want to reclaim the PTE page, it will be switched to atomic mode. Then we can
+check if the ``pte_ref`` is zero::
+
+ - If it is zero, we can safely reclaim it immediately;
+ - If it is not zero but we expect that the PTE page can be reclaimed
+ automatically when no one is using it, we can keep its ``pte_ref`` in
+ atomic mode (e.g. MADV_FREE case);
+ - If it is not zero, and we will continue to try at the next opportunity,
+ then we can choose to switch back to percpu mode (e.g. MADV_DONTNEED case).
+
+Competitive relationship
+------------------------
+
+Now, the user page table will only be released by calling ``free_pgtables()``
+when the process exits or ``unmap_region()`` is called (e.g. ``munmap()`` path).
+So other threads only need to ensure mutual exclusion with these paths to ensure
+that the page table is not released. For example::
+
+ thread A thread B
+ page table walker munmap
+ ================= ======
+
+ mmap_read_lock()
+ if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+ pte_offset_map_lock()
+ *walk page table*
+ pte_unmap_unlock()
+ }
+ mmap_read_unlock()
+
+ mmap_write_lock_killable()
+ detach_vmas_to_be_unmapped()
+ unmap_region()
+ --> free_pgtables()
+
+But after we introduce the ``pte_ref`` for the user PTE page table page, these
+existing balances will be broken. The page can be released at any time when its
+``pte_ref`` is reduced to 0. Therefore, the following case may happen::
+
+ thread A thread B thread C
+ page table walker madvise(MADV_DONTNEED) page fault
+ ================= ====================== ==========
+
+ mmap_read_lock()
+ if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+
+ mmap_read_lock()
+ unmap_page_range()
+ --> zap_pte_range()
+ /* the pte_ref is reduced to 0 */
+ --> free PTE page table page
+
+ mmap_read_lock()
+ /* may allocate
+ * a new huge
+ * pmd or a new
+ * PTE page
+ */
+
+ /* broken!! */
+ pte_offset_map_lock()
+
+As we can see, all of the thread A, B and C hold the read lock of mmap_lock, so
+they can execute concurrently. When thread B releases the PTE page table page,
+the value in the corresponding pmd entry will become unstable, which may be
+none or huge pmd, or map a new PTE page table page again. This will cause system
+chaos and even panic.
+
+So as described in the section "Who will hold the pte_ref?", the page table
+walker (visitor) also need to try to take a ``pte_ref`` to the user PTE page
+table page before walking page table (the helper ``pte_tryget_map{_lock}()``
+can help us to do this), then the system will become orderly again::
+
+ thread A thread B
+ page table walker madvise(MADV_DONTNEED)
+ ================= ======================
+
+ mmap_read_lock()
+ if (!pte_none() && pte_present() && !pmd_trans_unstable()) {
+ pte_tryget()
+ --> percpu_ref_tryget
+ *if successfully, then:*
+
+ mmap_read_lock()
+ unmap_page_range()
+ --> zap_pte_range()
+ /* the pte_refcount is reduced to 1 */
+
+ pte_offset_map_lock()
+ *walk page table*
+ pte_unmap_unlock()
+
+There is also a lock-less scenario(such as fast GUP). Fortunately, we don't need
+to do any additional operations to ensure that the system is in order. Take fast
+GUP as an example::
+
+ thread A thread B
+ fast GUP madvise(MADV_DONTNEED)
+ ======== ======================
+
+ get_user_pages_fast_only()
+ --> local_irq_save();
+ call_rcu(pte_free_rcu)
+ gup_pgd_range();
+ local_irq_restore();
+ /* do pte_free_rcu() */
+
+Helpers
+=======
+
++----------------------+------------------------------------------------+
+| pte_ref_init | Initialize the pte_ref |
++----------------------+------------------------------------------------+
+| pte_ref_free | Free the pte_ref |
++----------------------+------------------------------------------------+
+| pte_tryget | Try to hold a pte_ref |
++----------------------+------------------------------------------------+
+| pte_put | Decrement a pte_ref |
++----------------------+------------------------------------------------+
+| pte_tryget_map | Do pte_tryget and pte_offset_map |
++----------------------+------------------------------------------------+
+| pte_tryget_map_lock | Do pte_tryget and pte_offset_map_lock |
++----------------------+------------------------------------------------+
+| free_user_pte | Free the user PTE page table page |
++----------------------+------------------------------------------------+
+| try_to_free_user_pte | Try to free the user PTE page table page |
++----------------------+------------------------------------------------+
+| track_pte_set | Track the setting of user PTE page table page |
++----------------------+------------------------------------------------+
+| track_pte_clear | Track the clearing of user PTE page table page |
++----------------------+------------------------------------------------+
+
--
2.20.1
Hi Qi,
On Fri, Apr 29, 2022 at 09:35:52PM +0800, Qi Zheng wrote:
> +Now in order to pursue high performance, applications mostly use some
> +high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
> +These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
> +physical memory for the following reasons::
> +
> + First of all, we should hold as few write locks of mmap_lock as possible,
> + since the mmap_lock semaphore has long been a contention point in the
> + memory management subsystem. The mmap()/munmap() hold the write lock, and
> + the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
> + madvise() instead of munmap() to released physical memory can reduce the
> + competition of the mmap_lock.
> +
> + Secondly, after using madvise() to release physical memory, there is no
> + need to build vma and allocate page tables again when accessing the same
> + virtual address again, which can also save some time.
> +
I think we can use enumerated list, like below:
-- >8 --
diff --git a/Documentation/vm/pte_ref.rst b/Documentation/vm/pte_ref.rst
index 0ac1e5a408d7c6..67b18e74fcb367 100644
--- a/Documentation/vm/pte_ref.rst
+++ b/Documentation/vm/pte_ref.rst
@@ -10,18 +10,18 @@ Preface
Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or tcmalloc.
These memory allocators use madvise(MADV_DONTNEED or MADV_FREE) to release
-physical memory for the following reasons::
-
- First of all, we should hold as few write locks of mmap_lock as possible,
- since the mmap_lock semaphore has long been a contention point in the
- memory management subsystem. The mmap()/munmap() hold the write lock, and
- the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
- madvise() instead of munmap() to released physical memory can reduce the
- competition of the mmap_lock.
-
- Secondly, after using madvise() to release physical memory, there is no
- need to build vma and allocate page tables again when accessing the same
- virtual address again, which can also save some time.
+physical memory for the following reasons:
+
+1. We should hold as few write locks of mmap_lock as possible,
+ since the mmap_lock semaphore has long been a contention point in the
+ memory management subsystem. The mmap()/munmap() hold the write lock, and
+ the madvise(MADV_DONTNEED or MADV_FREE) hold the read lock, so using
+ madvise() instead of munmap() to released physical memory can reduce the
+ competition of the mmap_lock.
+
+2. After using madvise() to release physical memory, there is no
+ need to build vma and allocate page tables again when accessing the same
+ virtual address again, which can also save some time.
The following is the largest user PTE page table memory that can be
allocated by a single user process in a 32-bit and a 64-bit system.
> +The following is the largest user PTE page table memory that can be
> +allocated by a single user process in a 32-bit and a 64-bit system.
> +
We can say "assuming 4K page size" here,
> ++---------------------------+--------+---------+
> +| | 32-bit | 64-bit |
> ++===========================+========+=========+
> +| user PTE page table pages | 3 MiB | 512 GiB |
> ++---------------------------+--------+---------+
> +| user PMD page table pages | 3 KiB | 1 GiB |
> ++---------------------------+--------+---------+
> +
> +(for 32-bit, take 3G user address space, 4K page size as an example;
> + for 64-bit, take 48-bit address width, 4K page size as an example.)
> +
... instead of here.
> +There is also a lock-less scenario(such as fast GUP). Fortunately, we don't need
> +to do any additional operations to ensure that the system is in order. Take fast
> +GUP as an example::
> +
> + thread A thread B
> + fast GUP madvise(MADV_DONTNEED)
> + ======== ======================
> +
> + get_user_pages_fast_only()
> + --> local_irq_save();
> + call_rcu(pte_free_rcu)
> + gup_pgd_range();
> + local_irq_restore();
> + /* do pte_free_rcu() */
> +
I see whitespace warning circa do pte_free_rcu() line above when
applying this series.
Thanks.
--
An old man doll... just what I always wanted! - Clara
On 2022/4/29 9:35 PM, Qi Zheng wrote:
> Hi,
>
> This patch series aims to try to free user PTE page table pages when no one is
> using it.
>
> The beginning of this story is that some malloc libraries(e.g. jemalloc or
> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
> VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
> But the page tables do not be freed by madvise(), so it can produce many
> page tables when the process touches an enormous virtual address space.
>
> The following figures are a memory usage snapshot of one process which actually
> happened on our server:
>
> VIRT: 55t
> RES: 590g
> VmPTE: 110g
>
> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
> theory, the process only need 1.2g PTE page tables to map those physical
> memory. The reason why PTE page tables occupy a lot of memory is that
> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
> doesn't free the PTE page table pages. So we can free those empty PTE page
> tables to save memory. In the above cases, we can save memory about 108g(best
> case). And the larger the difference between the size of VIRT and RES, the
> more memory we save.
>
> In this patch series, we add a pte_ref field to the struct page of page table
> to track how many users of user PTE page table. Similar to the mechanism of page
> refcount, the user of PTE page table should hold a refcount to it before
> accessing. The user PTE page table page may be freed when the last refcount is
> dropped.
>
> Different from the idea of another patchset of mine before[1], the pte_ref
> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
> entryies, and then release the user PTE page table page when checking that
> pte_ref is 0. The advantage of this is that there is basically no performance
> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
> code implementation of this patchset is much simpler and more portable than the
> another patchset[1].
Hi David,
I learned from the LWN article[1] that you led a session at the LSFMM on
the problems posed by the lack of page-table reclaim (And thank you very
much for mentioning some of my work in this direction). So I want to
know, what are the further plans of the community for this problem?
For the way of adding pte_ref to each PTE page table page, I currently
posted two versions: atomic count version[2] and percpu_ref version(This
patchset).
For the atomic count version:
- Advantage: PTE pages can be freed as soon as the reference count drops
to 0.
- Disadvantage: The addition and subtraction of pte_ref are atomic
operations, which have a certain performance overhead,
but should not become a performance bottleneck until the
mmap_lock contention problem is resolved.
For the percpu_ref version:
- Advantage: In the percpu mode, the addition and subtraction of the
pte_ref are all operations on local cpu variables, there
is basically no performance overhead.
Disadvantage: Need to explicitly convert the pte_ref to atomic mode so
that the unused PTE pages can be freed.
There are still many places to optimize the code implementation of these
two versions. But before I do further work, I would like to hear your
and the community's views and suggestions on these two versions.
Thanks,
Qi
[1]: https://lwn.net/Articles/893726 (Ways to reclaim unused page-table
pages)
[2]:
https://lore.kernel.org/lkml/[email protected]/
>
--
Thanks,
Qi
On 17.05.22 10:30, Qi Zheng wrote:
>
>
> On 2022/4/29 9:35 PM, Qi Zheng wrote:
>> Hi,
>>
>> This patch series aims to try to free user PTE page table pages when no one is
>> using it.
>>
>> The beginning of this story is that some malloc libraries(e.g. jemalloc or
>> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those
>> VAs. They will use madvise(MADV_DONTNEED) to free physical memory if they want.
>> But the page tables do not be freed by madvise(), so it can produce many
>> page tables when the process touches an enormous virtual address space.
>>
>> The following figures are a memory usage snapshot of one process which actually
>> happened on our server:
>>
>> VIRT: 55t
>> RES: 590g
>> VmPTE: 110g
>>
>> As we can see, the PTE page tables size is 110g, while the RES is 590g. In
>> theory, the process only need 1.2g PTE page tables to map those physical
>> memory. The reason why PTE page tables occupy a lot of memory is that
>> madvise(MADV_DONTNEED) only empty the PTE and free physical memory but
>> doesn't free the PTE page table pages. So we can free those empty PTE page
>> tables to save memory. In the above cases, we can save memory about 108g(best
>> case). And the larger the difference between the size of VIRT and RES, the
>> more memory we save.
>>
>> In this patch series, we add a pte_ref field to the struct page of page table
>> to track how many users of user PTE page table. Similar to the mechanism of page
>> refcount, the user of PTE page table should hold a refcount to it before
>> accessing. The user PTE page table page may be freed when the last refcount is
>> dropped.
>>
>> Different from the idea of another patchset of mine before[1], the pte_ref
>> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
>> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
>> entryies, and then release the user PTE page table page when checking that
>> pte_ref is 0. The advantage of this is that there is basically no performance
>> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
>> code implementation of this patchset is much simpler and more portable than the
>> another patchset[1].
>
> Hi David,
>
> I learned from the LWN article[1] that you led a session at the LSFMM on
> the problems posed by the lack of page-table reclaim (And thank you very
> much for mentioning some of my work in this direction). So I want to
> know, what are the further plans of the community for this problem?
Hi,
yes, I talked about the involved challenges, especially, how malicious
user space can trigger allocation of almost elusively page tables and
essentially consume a lot of unmovable+unswappable memory and even store
secrets in the page table structure.
Empty PTE tables is one such case we care about, but there is more. Even
with your approach, we can still end up with many page tables that are
allocated on higher levels (e.g., PMD tables) or page tables that are
not empty (especially, filled with the shared zeropage).
Ideally, we'd have some mechanism that can reclaim also other
reclaimable page tables (e.g., filled with shared zeropage). One idea
was to add reclaimable page tables to the LRU list and to then
scan+reclaim them on demand. There are multiple challenges involved,
obviously. One is how to synchronize against concurrent page table
walkers, another one is how to invalidate MMU notifiers from reclaim
context. It would most probably involve storing required information in
the memmap to be able to lock+synchronize.
Having that said, adding infrastructure that might not be easy to extend
to the more general case of reclaiming other reclaimable page tables on
multiple levels (esp PMD tables) might not be what we want. OTOH, it
gets the job done for once case we care about.
It's really hard to tell what to do because reclaiming page tables and
eventually handling malicious user space correctly is far from trivial :)
I'll be on vacation until end of May, I'll come back to this mail once
I'm back.
--
Thanks,
David / dhildenb
On Wed, May 18, 2022 at 04:51:06PM +0200, David Hildenbrand wrote:
> yes, I talked about the involved challenges, especially, how malicious
> user space can trigger allocation of almost elusively page tables and
> essentially consume a lot of unmovable+unswappable memory and even store
> secrets in the page table structure.
There are a lot of ways for userspace to consume a large amount of
kernel memory. For example, one can open a file and set file locks on
alternate bytes. We generally handle this by accounting the memory to
the process and let the OOM killer, rlimits, memcg or other mechanism
take care of it. Just because page tables are (generally) reclaimable
doesn't mean we need to treat them specially.
On 2022/5/18 10:51 PM, David Hildenbrand wrote:
> On 17.05.22 10:30, Qi Zheng wrote:
>>
>>
>> On 2022/4/29 9:35 PM, Qi Zheng wrote:
>>> Hi, >>>
>>> Different from the idea of another patchset of mine before[1], the pte_ref
>>> becomes a struct percpu_ref type, and we switch it to atomic mode only in cases
>>> such as MADV_DONTNEED and MADV_FREE that may clear the user PTE page table
>>> entryies, and then release the user PTE page table page when checking that
>>> pte_ref is 0. The advantage of this is that there is basically no performance
>>> overhead in percpu mode, but it can also free the empty PTEs. In addition, the
>>> code implementation of this patchset is much simpler and more portable than the
>>> another patchset[1].
>>
>> Hi David,
>>
>> I learned from the LWN article[1] that you led a session at the LSFMM on
>> the problems posed by the lack of page-table reclaim (And thank you very
>> much for mentioning some of my work in this direction). So I want to
>> know, what are the further plans of the community for this problem?
>
> Hi,
>
> yes, I talked about the involved challenges, especially, how malicious
> user space can trigger allocation of almost elusively page tables and
> essentially consume a lot of unmovable+unswappable memory and even store
> secrets in the page table structure.
It is indeed difficult to deal with malicious user space programs,
because as long as there is an entry in PTE page table page that
maps the physical page, the entire PTE page cannot be freed.
So maybe we should first solve the problems encountered in engineering
practice. We encountered the problems I mentioned in the cover letter
several times on our server:
VIRT: 55t
RES: 590g
VmPTE: 110g
They are not malicious programs, they just use jemalloc/tcmalloc
normally (currently jemalloc/tcmalloc often uses mmap+madvise instead
of mmap+munmap to improve performance). And we checked and found taht
most of these VmPTEs are empty.
Of course, normal operations may also lead to the consequences of
similar malicious programs, but we have not found such examples
on our servers.
>
> Empty PTE tables is one such case we care about, but there is more. Even
> with your approach, we can still end up with many page tables that are
> allocated on higher levels (e.g., PMD tables) or page tables that are
Yes, currently my patch does not consider PMD tables. The reason is that
its maximum memory consumption is only 1G on 64-bits system, the impact
is smaller that 512G of PTE tables.
> not empty (especially, filled with the shared zeropage).
This case is indeed a problem, and more difficult. :(
>
> Ideally, we'd have some mechanism that can reclaim also other
> reclaimable page tables (e.g., filled with shared zeropage). One idea
> was to add reclaimable page tables to the LRU list and to then
> scan+reclaim them on demand. There are multiple challenges involved,
> obviously. One is how to synchronize against concurrent page table
Agree, the current situation is that holding the read lock of mmap_lock
can ensure that the PTE tables is stable. If the refcount method is not
considered or the logic of the lock that protects the PTE tables is not
changed, then the write lock of mmap_lock should be held to ensure
synchronization (this has a huge impact on performance).
> walkers, another one is how to invalidate MMU notifiers from reclaim
> context. It would most probably involve storing required information in
> the memmap to be able to lock+synchronize.
This may also be a way to explore.
>
> Having that said, adding infrastructure that might not be easy to extend
> to the more general case of reclaiming other reclaimable page tables on
> multiple levels (esp PMD tables) might not be what we want. OTOH, it
> gets the job done for once case we care about.
>
> It's really hard to tell what to do because reclaiming page tables and
> eventually handling malicious user space correctly is far from trivial :)
Yeah, agree :(
>
> I'll be on vacation until end of May, I'll come back to this mail once
> I'm back.
>
OK, thanks, and have a nice holiday.
--
Thanks,
Qi
On 2022/5/18 10:56 PM, Matthew Wilcox wrote:
> On Wed, May 18, 2022 at 04:51:06PM +0200, David Hildenbrand wrote:
>> yes, I talked about the involved challenges, especially, how malicious
>> user space can trigger allocation of almost elusively page tables and
>> essentially consume a lot of unmovable+unswappable memory and even store
>> secrets in the page table structure.
>
> There are a lot of ways for userspace to consume a large amount of
> kernel memory. For example, one can open a file and set file locks on
Yes, malicious programs are really hard to avoid, maybe we should try to
solve some common cases first (such as empty PTE tables).
> alternate bytes. We generally handle this by accounting the memory to
> the process and let the OOM killer, rlimits, memcg or other mechanism
> take care of it. Just because page tables are (generally) reclaimable
> doesn't mean we need to treat them specially.
>
--
Thanks,
Qi