2022-08-24 18:01:20

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 0/8] hugetlb: Use new vma mutex for huge pmd sharing synchronization

hugetlb fault scalability regressions have recently been reported [1].
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7. At that time, a proposal to
address the regression was suggested [3] but went nowhere.

The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"

48 sample Avg
next-20220822 next-20220822 next-20220822
unmodified revert i_mmap_sema locking vma sema locking, this series
-----------------------------------------------------------------------------
494229 KB/s 495375 KB/s 495573 KB/s

The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings. To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory. The shared
mapping size was 250GB. For comparison, a single instance of the program
was run. Then, multiple instances were run in parallel to introduce
lock contention. Changing the locking scheme results in a significant
performance benefit.

test instances unmodified revert vma
--------------------------------------------------------------------------
faults per sec 1 397068 403411 394935
faults per sec 24 68322 83023 82436
forks per sec 1 2717 2862 2816
forks per sec 24 404 465 499
Combined faults 24 1528 69090 59544
Combined forks 24 337 66 140

Combined test is when running both faulting program and forking program
simultaneously.

Patches 1 and 2 of this series revert c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79. Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share. With c0d0381ade79 reverted, this race is exposed:

Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid <------------------------ huge_pmd_unshare()
Could be in a previously unlock_page_table
sharing process or worse i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

Reverting 87bf91d39bb5 exposes races in page fault/file truncation.
Patches 3 and 4 of this series address those races. This requires
using the hugetlb fault mutexes for more coordination between the fault
code and file page removal.

Patches 5 - 7 add infrastructure for a new vma based rw semaphore that
will be used for pmd sharing synchronization. The idea is that this
semaphore will be held in read mode for the duration of fault processing,
and held in write mode for unmap operations which may call huge_pmd_unshare.
Acquiring i_mmap_rwsem is also still required to synchronize huge pmd
sharing. However it is only required in the fault path when setting up
sharing, and will be acquired in huge_pmd_share().

Patch 8 makes use of this new vma lock. Unfortunately, the fault code
and truncate/hole punch code would naturally take locks in the opposite
order which could lead to deadlock. Since the performance of page faults
is more important, the truncation/hole punch code is modified to back
out and take locks in the correct order if necessary.

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/[email protected]/

RFC -> v1
- Addressed many issues pointed out by Miaohe Lin. Thank you! Most
significant was not attempting to backout pages in fault code due to
races with truncation (patch 4).
- Rebased and retested on next-20220822

Mike Kravetz (8):
hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race
hugetlbfs: revert use i_mmap_rwsem for more pmd sharing
synchronization
hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache
hugetlb: handle truncate racing with page faults
hugetlb: rename vma_shareable() and refactor code
hugetlb: add vma based lock for pmd sharing
hugetlb: create hugetlb_unmap_file_folio to unmap single file folio
hugetlb: use new vma_lock for pmd sharing synchronization

fs/hugetlbfs/inode.c | 364 ++++++++++++++++++++++++++++++----------
include/linux/hugetlb.h | 38 ++++-
kernel/fork.c | 6 +-
mm/hugetlb.c | 354 ++++++++++++++++++++++++++++----------
mm/memory.c | 2 +
mm/rmap.c | 114 ++++++++-----
mm/userfaultfd.c | 14 +-
7 files changed, 653 insertions(+), 239 deletions(-)


base-commit: cc2986f4dc67df7e6209e0cd74145fffbd30d693
--
2.37.1


2022-08-24 18:01:45

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 1/8] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing. The use of i_mmap_rwsem to prevent
fault/truncate races depends on this. However, this has been shown to
cause performance/scaling issues. As a result, that code will be
reverted. Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.

In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.

Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 30 +++++++++---------------------
mm/hugetlb.c | 23 ++++++++++++-----------
2 files changed, 21 insertions(+), 32 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index f7a5b5124d8a..a32031e751d1 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -419,9 +419,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
* maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() holds i_mmap_rwsem and prevents
- * page faults in the truncated range by checking i_size. i_size is
- * modified while holding i_mmap_rwsem.
+ * in this routine. hugetlb_no_page() prevents page faults in the
+ * truncated range. It checks i_size before allocation, and again after
+ * with the page table lock for the page held. The same lock must be
+ * acquired to unmap a page.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserve map
@@ -451,16 +452,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
u32 hash = 0;

index = folio->index;
- if (!truncate_op) {
- /*
- * Only need to hold the fault mutex in the
- * hole punch case. This prevents races with
- * page faults. Races are not possible in the
- * case of truncation.
- */
- hash = hugetlb_fault_mutex_hash(mapping, index);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
- }
+ hash = hugetlb_fault_mutex_hash(mapping, index);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);

/*
* If folio is mapped, it was faulted in after being
@@ -504,8 +497,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
}

folio_unlock(folio);
- if (!truncate_op)
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
folio_batch_release(&fbatch);
cond_resched();
@@ -543,8 +535,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
BUG_ON(offset & ~huge_page_mask(h));
pgoff = offset >> PAGE_SHIFT;

- i_mmap_lock_write(mapping);
i_size_write(inode, offset);
+ i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
ZAP_FLAG_DROP_MARKER);
@@ -703,11 +695,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
/* addr is the offset within the file (zero based) */
addr = index * hpage_size;

- /*
- * fault mutex taken here, protects against fault path
- * and hole punch. inode_lock previously taken protects
- * against truncation.
- */
+ /* mutex taken here, fault path and hole punch */
hash = hugetlb_fault_mutex_hash(mapping, index);
mutex_lock(&hugetlb_fault_mutex_table[hash]);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9a72499486c1..70bc7f867bc0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5575,18 +5575,17 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}

/*
- * We can not race with truncation due to holding i_mmap_rwsem.
- * i_size is modified when holding i_mmap_rwsem, so check here
- * once for faults beyond end of file.
+ * Use page lock to guard against racing truncation
+ * before we get page_table_lock.
*/
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto out;
-
retry:
new_page = false;
page = find_lock_page(mapping, idx);
if (!page) {
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto out;
+
/* Check for page in userfault range */
if (userfaultfd_missing(vma)) {
ret = hugetlb_handle_userfault(vma, mapping, idx,
@@ -5677,6 +5676,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}

ptl = huge_pte_lock(h, mm, ptep);
+ size = i_size_read(mapping->host) >> huge_page_shift(h);
+ if (idx >= size)
+ goto backout;
+
ret = 0;
/* If pte changed from under us, retry */
if (!pte_same(huge_ptep_get(ptep), old_pte))
@@ -5785,10 +5788,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,

/*
* Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This serves two purposes:
- * 1) It prevents huge_pmd_unshare from being called elsewhere
- * and making the ptep no longer valid.
- * 2) It synchronizes us with i_size modifications during truncation.
+ * until finished with ptep. This prevents huge_pmd_unshare from
+ * being called elsewhere and making the ptep no longer valid.
*
* ptep could have already be assigned via huge_pte_offset. That
* is OK, as huge_pte_alloc will return the same value unless
--
2.37.1

2022-08-24 18:01:47

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 4/8] hugetlb: handle truncate racing with page faults

When page fault code needs to allocate and instantiate a new hugetlb
page (huegtlb_no_page), it checks early to determine if the fault is
beyond i_size. When discovered early, it is easy to abort the fault and
return an error. However, it becomes much more difficult to handle when
discovered later after allocating the page and consuming reservations
and adding to the page cache. Backing out changes in such instances
becomes difficult and error prone.

Instead of trying to catch and backout all such races, use the hugetlb
fault mutex to handle truncate racing with page faults. The most
significant change is modification of the routine remove_inode_hugepages
such that it will take the fault mutex for EVERY index in the truncated
range (or hole in the case of hole punch). Since remove_inode_hugepages
is called in the truncate path after updating i_size, we can experience
races as follows.
- truncate code updates i_size and takes fault mutex before a racing
fault. After fault code takes mutex, it will notice fault beyond
i_size and abort early.
- fault code obtains mutex, and truncate updates i_size after early
checks in fault code. fault code will add page beyond i_size.
When truncate code takes mutex for page/index, it will remove the
page.
- truncate updates i_size, but fault code obtains mutex first. If
fault code sees updated i_size it will abort early. If fault code
does not see updated i_size, it will add page beyond i_size and
truncate code will remove page when it obtains fault mutex.

Note, for performance reasons remove_inode_hugepages will still use
filemap_get_folios for bulk folio lookups. For indicies not returned in
the bulk lookup, it will need to lookup individual folios to check for
races with page fault.

Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
mm/hugetlb.c | 41 +++++-----
2 files changed, 152 insertions(+), 73 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d98c6edbd1a4..e83fd31671b3 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -411,6 +411,95 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
}
}

+/*
+ * Called with hugetlb fault mutex held.
+ * Returns true if page was actually removed, false otherwise.
+ */
+static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
+ struct address_space *mapping,
+ struct folio *folio, pgoff_t index,
+ bool truncate_op)
+{
+ bool ret = false;
+
+ /*
+ * If folio is mapped, it was faulted in after being
+ * unmapped in caller. Unmap (again) while holding
+ * the fault mutex. The mutex will prevent faults
+ * until we finish removing the folio.
+ */
+ if (unlikely(folio_mapped(folio))) {
+ i_mmap_lock_write(mapping);
+ hugetlb_vmdelete_list(&mapping->i_mmap,
+ index * pages_per_huge_page(h),
+ (index + 1) * pages_per_huge_page(h),
+ ZAP_FLAG_DROP_MARKER);
+ i_mmap_unlock_write(mapping);
+ }
+
+ folio_lock(folio);
+ /*
+ * After locking page, make sure mapping is the same.
+ * We could have raced with page fault populate and
+ * backout code.
+ */
+ if (folio_mapping(folio) == mapping) {
+ /*
+ * We must remove the folio from page cache before removing
+ * the region/ reserve map (hugetlb_unreserve_pages). In
+ * rare out of memory conditions, removal of the region/reserve
+ * map could fail. Correspondingly, the subpool and global
+ * reserve usage count can need to be adjusted.
+ */
+ VM_BUG_ON(HPageRestoreReserve(&folio->page));
+ hugetlb_delete_from_page_cache(&folio->page);
+ ret = true;
+ if (!truncate_op) {
+ if (unlikely(hugetlb_unreserve_pages(inode, index,
+ index + 1, 1)))
+ hugetlb_fix_reserve_counts(inode);
+ }
+ }
+
+ folio_unlock(folio);
+ return ret;
+}
+
+/*
+ * Take hugetlb fault mutex for a set of inode indicies.
+ * Check for and remove any found folios. Return the number of
+ * any removed folios.
+ *
+ */
+static long fault_lock_inode_indicies(struct hstate *h,
+ struct inode *inode,
+ struct address_space *mapping,
+ pgoff_t start, pgoff_t end,
+ bool truncate_op)
+{
+ struct folio *folio;
+ long freed = 0;
+ pgoff_t index;
+ u32 hash;
+
+ for (index = start; index < end; index++) {
+ hash = hugetlb_fault_mutex_hash(mapping, index);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+
+ folio = filemap_get_folio(mapping, index);
+ if (folio) {
+ if (remove_inode_single_folio(h, inode, mapping, folio,
+ index, truncate_op))
+ freed++;
+ folio_put(folio);
+ }
+
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ }
+
+ return freed;
+}
+
/*
* remove_inode_hugepages handles two distinct cases: truncation and hole
* punch. There are subtle differences in operation for each case.
@@ -418,11 +507,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
* truncation is indicated by end of range being LLONG_MAX
* In this case, we first scan the range and release found pages.
* After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
- * maps and global counts. Page faults can not race with truncation
- * in this routine. hugetlb_no_page() prevents page faults in the
- * truncated range. It checks i_size before allocation, and again after
- * with the page table lock for the page held. The same lock must be
- * acquired to unmap a page.
+ * maps and global counts. Page faults can race with truncation.
+ * During faults, hugetlb_no_page() checks i_size before page allocation,
+ * and again after obtaining page table lock. It will 'back out'
+ * allocations in the truncated range.
* hole punch is indicated if end is not LLONG_MAX
* In the hole punch case we scan the range and release found pages.
* Only when releasing a page is the associated region/reserve map
@@ -431,75 +519,69 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
* This is indicated if we find a mapped page.
* Note: If the passed end of range value is beyond the end of file, but
* not LLONG_MAX this routine still performs a hole punch operation.
+ *
+ * Since page faults can race with this routine, care must be taken as both
+ * modify huge page reservation data. To somewhat synchronize these operations
+ * the hugetlb fault mutex is taken for EVERY index in the range to be hole
+ * punched or truncated. In this way, we KNOW either:
+ * - fault code has added a page beyond i_size, and we will remove here
+ * - fault code will see updated i_size and not add a page beyond
+ * The parameter 'lm__end' indicates the offset of the end of hole or file
+ * before truncation. For hole punch lm_end == lend.
*/
static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
- loff_t lend)
+ loff_t lend, loff_t lm_end)
{
struct hstate *h = hstate_inode(inode);
struct address_space *mapping = &inode->i_data;
const pgoff_t start = lstart >> huge_page_shift(h);
const pgoff_t end = lend >> huge_page_shift(h);
+ pgoff_t m_end = lm_end >> huge_page_shift(h);
+ pgoff_t m_start, m_index;
struct folio_batch fbatch;
+ struct folio *folio;
pgoff_t next, index;
- int i, freed = 0;
+ unsigned int i;
+ long freed = 0;
+ u32 hash;
bool truncate_op = (lend == LLONG_MAX);

folio_batch_init(&fbatch);
- next = start;
+ next = m_start = start;
while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
for (i = 0; i < folio_batch_count(&fbatch); ++i) {
- struct folio *folio = fbatch.folios[i];
- u32 hash = 0;
+ folio = fbatch.folios[i];

index = folio->index;
- hash = hugetlb_fault_mutex_hash(mapping, index);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
/*
- * If folio is mapped, it was faulted in after being
- * unmapped in caller. Unmap (again) now after taking
- * the fault mutex. The mutex will prevent faults
- * until we finish removing the folio.
- *
- * This race can only happen in the hole punch case.
- * Getting here in a truncate operation is a bug.
+ * Take fault mutex for missing folios before index,
+ * while checking folios that might have been added
+ * due to a race with fault code.
*/
- if (unlikely(folio_mapped(folio))) {
- BUG_ON(truncate_op);
-
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h),
- ZAP_FLAG_DROP_MARKER);
- i_mmap_unlock_write(mapping);
- }
+ freed += fault_lock_inode_indicies(h, inode, mapping,
+ m_start, m_index, truncate_op);

- folio_lock(folio);
/*
- * We must free the huge page and remove from page
- * cache BEFORE removing the * region/reserve map
- * (hugetlb_unreserve_pages). In rare out of memory
- * conditions, removal of the region/reserve map could
- * fail. Correspondingly, the subpool and global
- * reserve usage count can need to be adjusted.
+ * Remove folio that was part of folio_batch.
*/
- VM_BUG_ON(HPageRestoreReserve(&folio->page));
- hugetlb_delete_from_page_cache(&folio->page);
- freed++;
- if (!truncate_op) {
- if (unlikely(hugetlb_unreserve_pages(inode,
- index, index + 1, 1)))
- hugetlb_fix_reserve_counts(inode);
- }
-
- folio_unlock(folio);
+ hash = hugetlb_fault_mutex_hash(mapping, index);
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ if (remove_inode_single_folio(h, inode, mapping, folio,
+ index, truncate_op))
+ freed++;
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
}
folio_batch_release(&fbatch);
cond_resched();
}

+ /*
+ * Take fault mutex for missing folios at end of range while checking
+ * for folios that might have been added due to a race with fault code.
+ */
+ freed += fault_lock_inode_indicies(h, inode, mapping, m_start, m_end,
+ truncate_op);
+
if (truncate_op)
(void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed);
}
@@ -507,8 +589,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
static void hugetlbfs_evict_inode(struct inode *inode)
{
struct resv_map *resv_map;
+ loff_t prev_size = i_size_read(inode);

- remove_inode_hugepages(inode, 0, LLONG_MAX);
+ remove_inode_hugepages(inode, 0, LLONG_MAX, prev_size);

/*
* Get the resv_map from the address space embedded in the inode.
@@ -528,6 +611,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
pgoff_t pgoff;
struct address_space *mapping = inode->i_mapping;
struct hstate *h = hstate_inode(inode);
+ loff_t prev_size = i_size_read(inode);

BUG_ON(offset & ~huge_page_mask(h));
pgoff = offset >> PAGE_SHIFT;
@@ -538,7 +622,7 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
ZAP_FLAG_DROP_MARKER);
i_mmap_unlock_write(mapping);
- remove_inode_hugepages(inode, offset, LLONG_MAX);
+ remove_inode_hugepages(inode, offset, LLONG_MAX, prev_size);
}

static void hugetlbfs_zero_partial_page(struct hstate *h,
@@ -610,7 +694,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)

/* Remove full pages from the file. */
if (hole_end > hole_start)
- remove_inode_hugepages(inode, hole_start, hole_end);
+ remove_inode_hugepages(inode, hole_start, hole_end, hole_end);

inode_unlock(inode);

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 11c02513588c..a6eb46c64baf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5527,6 +5527,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
spinlock_t *ptl;
unsigned long haddr = address & huge_page_mask(h);
bool new_page, new_pagecache_page = false;
+ bool reserve_alloc = false;

/*
* Currently, we are forced to kill the process in the event the
@@ -5584,9 +5585,13 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
clear_huge_page(page, address, pages_per_huge_page(h));
__SetPageUptodate(page);
new_page = true;
+ if (HPageRestoreReserve(page))
+ reserve_alloc = true;

if (vma->vm_flags & VM_MAYSHARE) {
- int err = hugetlb_add_to_page_cache(page, mapping, idx);
+ int err;
+
+ err = hugetlb_add_to_page_cache(page, mapping, idx);
if (err) {
restore_reserve_on_error(h, vma, haddr, page);
put_page(page);
@@ -5642,10 +5647,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
}

ptl = huge_pte_lock(h, mm, ptep);
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- if (idx >= size)
- goto backout;
-
ret = 0;
/* If pte changed from under us, retry */
if (!pte_same(huge_ptep_get(ptep), old_pte))
@@ -5689,10 +5690,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
backout:
spin_unlock(ptl);
backout_unlocked:
- unlock_page(page);
- /* restore reserve for newly allocated pages not in page cache */
- if (new_page && !new_pagecache_page)
+ if (new_page && !new_pagecache_page) {
+ /*
+ * If reserve was consumed, make sure flag is set so that it
+ * will be restored in free_huge_page().
+ */
+ if (reserve_alloc)
+ SetHPageRestoreReserve(page);
+
restore_reserve_on_error(h, vma, haddr, page);
+ }
+
+ unlock_page(page);
put_page(page);
goto out;
}
@@ -6006,26 +6015,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
spin_lock(ptl);

- /*
- * Recheck the i_size after holding PT lock to make sure not
- * to leave any page mapped (as page_mapped()) beyond the end
- * of the i_size (remove_inode_hugepages() is strict about
- * enforcing that). If we bail out here, we'll also leave a
- * page in the radix tree in the vm_shared case beyond the end
- * of the i_size, but remove_inode_hugepages() will take care
- * of it as soon as we drop the hugetlb_fault_mutex_table.
- */
- size = i_size_read(mapping->host) >> huge_page_shift(h);
- ret = -EFAULT;
- if (idx >= size)
- goto out_release_unlock;
-
- ret = -EEXIST;
/*
* We allow to overwrite a pte marker: consider when both MISSING|WP
* registered, we firstly wr-protect a none pte which has no page cache
* page backing it, then access the page.
*/
+ ret = -EEXIST;
if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
goto out_release_unlock;

--
2.37.1

2022-08-24 18:02:38

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 5/8] hugetlb: rename vma_shareable() and refactor code

Rename the routine vma_shareable to vma_addr_pmd_shareable as it is
checking a specific address within the vma. Refactor code to check if
an aligned range is shareable as this will be needed in a subsequent
patch.

Signed-off-by: Mike Kravetz <[email protected]>
---
mm/hugetlb.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a6eb46c64baf..758b6844d566 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6648,26 +6648,33 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
return saddr;
}

-static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
+static bool __vma_aligned_range_pmd_shareable(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
{
- unsigned long base = addr & PUD_MASK;
- unsigned long end = base + PUD_SIZE;
-
/*
* check on proper vm_flags and page table alignment
*/
- if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end))
+ if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, start, end))
return true;
return false;
}

+static bool vma_addr_pmd_shareable(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ unsigned long start = addr & PUD_MASK;
+ unsigned long end = start + PUD_SIZE;
+
+ return __vma_aligned_range_pmd_shareable(vma, start, end);
+}
+
bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
{
#ifdef CONFIG_USERFAULTFD
if (uffd_disable_huge_pmd_share(vma))
return false;
#endif
- return vma_shareable(vma, addr);
+ return vma_addr_pmd_shareable(vma, addr);
}

/*
--
2.37.1

2022-08-24 18:22:26

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization

The new hugetlb vma lock (rw semaphore) is used to address this race:

Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid <------------------------ huge_pmd_unshare()
Could be in a previously unlock_page_table
sharing process or worse i_mmap_unlock_write
...

The vma_lock is used as follows:
- During fault processing. the lock is acquired in read mode before
doing a page table lock and allocation (huge_pte_alloc). The lock is
held until code is finished with the page table entry (ptep).
- The lock must be held in write mode whenever huge_pmd_unshare is
called.

Lock ordering issues come into play when unmapping a page from all
vmas mapping the page. The i_mmap_rwsem must be held to search for the
vmas, and the vma lock must be held before calling unmap which will
call huge_pmd_unshare. This is done today in:
- try_to_migrate_one and try_to_unmap_ for page migration and memory
error handling. In these routines we 'try' to obtain the vma lock and
fail to unmap if unsuccessful. Calling routines already deal with the
failure of unmapping.
- hugetlb_vmdelete_list for truncation and hole punch. This routine
also tries to acquire the vma lock. If it fails, it skips the
unmapping. However, we can not have file truncation or hole punch
fail because of contention. After hugetlb_vmdelete_list, truncation
and hole punch call remove_inode_hugepages. remove_inode_hugepages
check for mapped pages and call hugetlb_unmap_file_page to unmap them.
hugetlb_unmap_file_page is designed to drop locks and reacquire in the
correct order to guarantee unmap success.

Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 46 +++++++++++++++++++
mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++----
mm/memory.c | 2 +
mm/rmap.c | 100 +++++++++++++++++++++++++++---------------
mm/userfaultfd.c | 9 +++-
5 files changed, 214 insertions(+), 45 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index b93d131b0cb5..52d9b390389b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
struct folio *folio, pgoff_t index)
{
struct rb_root_cached *root = &mapping->i_mmap;
+ unsigned long skipped_vm_start;
+ struct mm_struct *skipped_mm;
struct page *page = &folio->page;
struct vm_area_struct *vma;
unsigned long v_start;
@@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
end = ((index + 1) * pages_per_huge_page(h));

i_mmap_lock_write(mapping);
+retry:
+ skipped_mm = NULL;

vma_interval_tree_foreach(vma, root, start, end - 1) {
v_start = vma_offset_start(vma, start);
@@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
continue;

+ if (!hugetlb_vma_trylock_write(vma)) {
+ /*
+ * If we can not get vma lock, we need to drop
+ * immap_sema and take locks in order.
+ */
+ skipped_vm_start = vma->vm_start;
+ skipped_mm = vma->vm_mm;
+ /* grab mm-struct as we will be dropping i_mmap_sema */
+ mmgrab(skipped_mm);
+ break;
+ }
+
unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
NULL, ZAP_FLAG_DROP_MARKER);
+ hugetlb_vma_unlock_write(vma);
}

i_mmap_unlock_write(mapping);
+
+ if (skipped_mm) {
+ mmap_read_lock(skipped_mm);
+ vma = find_vma(skipped_mm, skipped_vm_start);
+ if (!vma || !is_vm_hugetlb_page(vma) ||
+ vma->vm_file->f_mapping != mapping ||
+ vma->vm_start != skipped_vm_start) {
+ mmap_read_unlock(skipped_mm);
+ mmdrop(skipped_mm);
+ goto retry;
+ }
+
+ hugetlb_vma_lock_write(vma);
+ i_mmap_lock_write(mapping);
+ mmap_read_unlock(skipped_mm);
+ mmdrop(skipped_mm);
+
+ v_start = vma_offset_start(vma, start);
+ v_end = vma_offset_end(vma, end);
+ unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
+ NULL, ZAP_FLAG_DROP_MARKER);
+ hugetlb_vma_unlock_write(vma);
+
+ goto retry;
+ }
}

static void
@@ -474,11 +516,15 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
unsigned long v_start;
unsigned long v_end;

+ if (!hugetlb_vma_trylock_write(vma))
+ continue;
+
v_start = vma_offset_start(vma, start);
v_end = vma_offset_end(vma, end);

unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
NULL, zap_flags);
+ hugetlb_vma_unlock_write(vma);
}
}

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6fb0bff2c7ee..5912c2b97ddf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4801,6 +4801,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
mmu_notifier_invalidate_range_start(&range);
mmap_assert_write_locked(src);
raw_write_seqcount_begin(&src->write_protect_seq);
+ } else {
+ /*
+ * For shared mappings the vma lock must be held before
+ * calling huge_pte_offset in the src vma. Otherwise, the
+ * returned ptep could go away if part of a shared pmd and
+ * another thread calls huge_pmd_unshare.
+ */
+ hugetlb_vma_lock_read(src_vma);
}

last_addr_mask = hugetlb_mask_last_page(h);
@@ -4948,6 +4956,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (cow) {
raw_write_seqcount_end(&src->write_protect_seq);
mmu_notifier_invalidate_range_end(&range);
+ } else {
+ hugetlb_vma_unlock_read(src_vma);
}

return ret;
@@ -5006,6 +5016,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(&range);
last_addr_mask = hugetlb_mask_last_page(h);
/* Prevent race with file truncation */
+ hugetlb_vma_lock_write(vma);
i_mmap_lock_write(mapping);
for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
src_pte = huge_pte_offset(mm, old_addr, sz);
@@ -5037,6 +5048,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
flush_tlb_range(vma, old_end - len, old_end);
mmu_notifier_invalidate_range_end(&range);
i_mmap_unlock_write(mapping);
+ hugetlb_vma_unlock_write(vma);

return len + old_addr - old_end;
}
@@ -5356,9 +5368,30 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
* may get SIGKILLed if it later faults.
*/
if (outside_reserve) {
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ pgoff_t idx;
+ u32 hash;
+
put_page(old_page);
BUG_ON(huge_pte_none(pte));
+ /*
+ * Drop hugetlb_fault_mutex and vma_lock before
+ * unmapping. unmapping needs to hold vma_lock
+ * in write mode. Dropping vma_lock in read mode
+ * here is OK as COW mappings do not interact with
+ * PMD sharing.
+ *
+ * Reacquire both after unmap operation.
+ */
+ idx = vma_hugecache_offset(h, vma, haddr);
+ hash = hugetlb_fault_mutex_hash(mapping, idx);
+ hugetlb_vma_unlock_read(vma);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+
unmap_ref_private(mm, vma, old_page, haddr);
+
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ hugetlb_vma_lock_read(vma);
spin_lock(ptl);
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (likely(ptep &&
@@ -5520,14 +5553,16 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
};

/*
- * hugetlb_fault_mutex and i_mmap_rwsem must be
+ * vma_lock and hugetlb_fault_mutex must be
* dropped before handling userfault. Reacquire
* after handling fault to make calling code simpler.
*/
+ hugetlb_vma_unlock_read(vma);
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
ret = handle_userfault(&vmf, reason);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ hugetlb_vma_lock_read(vma);

return ret;
}
@@ -5767,6 +5802,11 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,

ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (ptep) {
+ /*
+ * Since we hold no locks, ptep could be stale. That is
+ * OK as we are only making decisions based on content and
+ * not actually modifying content here.
+ */
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
migration_entry_wait_huge(vma, ptep);
@@ -5774,23 +5814,35 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
- } else {
- ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
- if (!ptep)
- return VM_FAULT_OOM;
}

- mapping = vma->vm_file->f_mapping;
- idx = vma_hugecache_offset(h, vma, haddr);
-
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
+ mapping = vma->vm_file->f_mapping;
+ idx = vma_hugecache_offset(h, vma, haddr);
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_lock(&hugetlb_fault_mutex_table[hash]);

+ /*
+ * Acquire vma lock before calling huge_pte_alloc and hold
+ * until finished with ptep. This prevents huge_pmd_unshare from
+ * being called elsewhere and making the ptep no longer valid.
+ *
+ * ptep could have already be assigned via huge_pte_offset. That
+ * is OK, as huge_pte_alloc will return the same value unless
+ * something has changed.
+ */
+ hugetlb_vma_lock_read(vma);
+ ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
+ if (!ptep) {
+ hugetlb_vma_unlock_read(vma);
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ return VM_FAULT_OOM;
+ }
+
entry = huge_ptep_get(ptep);
/* PTE markers should be handled the same way as none pte */
if (huge_pte_none_mostly(entry)) {
@@ -5851,6 +5903,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unlock_page(pagecache_page);
put_page(pagecache_page);
}
+ hugetlb_vma_unlock_read(vma);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
return handle_userfault(&vmf, VM_UFFD_WP);
}
@@ -5894,6 +5947,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
put_page(pagecache_page);
}
out_mutex:
+ hugetlb_vma_unlock_read(vma);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
/*
* Generally it's safe to hold refcount during waiting page lock. But
@@ -6343,8 +6397,9 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_cache_range(vma, range.start, range.end);

mmu_notifier_invalidate_range_start(&range);
- last_addr_mask = hugetlb_mask_last_page(h);
+ hugetlb_vma_lock_write(vma);
i_mmap_lock_write(vma->vm_file->f_mapping);
+ last_addr_mask = hugetlb_mask_last_page(h);
for (; address < end; address += psize) {
spinlock_t *ptl;
ptep = huge_pte_offset(mm, address, psize);
@@ -6443,6 +6498,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
* See Documentation/mm/mmu_notifier.rst
*/
i_mmap_unlock_write(vma->vm_file->f_mapping);
+ hugetlb_vma_unlock_write(vma);
mmu_notifier_invalidate_range_end(&range);

return pages << h->order;
@@ -6909,6 +6965,7 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
pud_t *pud = pud_offset(p4d, addr);

i_mmap_assert_write_locked(vma->vm_file->f_mapping);
+ hugetlb_vma_assert_locked(vma);
BUG_ON(page_count(virt_to_page(ptep)) == 0);
if (page_count(virt_to_page(ptep)) == 1)
return 0;
@@ -6920,6 +6977,31 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
}

#else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
+void hugetlb_vma_lock_read(struct vm_area_struct *vma)
+{
+}
+
+void hugetlb_vma_unlock_read(struct vm_area_struct *vma)
+{
+}
+
+void hugetlb_vma_lock_write(struct vm_area_struct *vma)
+{
+}
+
+void hugetlb_vma_unlock_write(struct vm_area_struct *vma)
+{
+}
+
+int hugetlb_vma_trylock_write(struct vm_area_struct *vma)
+{
+ return 1;
+}
+
+void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
+{
+}
+
static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
{
}
@@ -7298,6 +7380,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm,
start, end);
mmu_notifier_invalidate_range_start(&range);
+ hugetlb_vma_lock_write(vma);
i_mmap_lock_write(vma->vm_file->f_mapping);
for (address = start; address < end; address += PUD_SIZE) {
ptep = huge_pte_offset(mm, address, sz);
@@ -7309,6 +7392,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
}
flush_hugetlb_tlb_range(vma, start, end);
i_mmap_unlock_write(vma->vm_file->f_mapping);
+ hugetlb_vma_unlock_write(vma);
/*
* No need to call mmu_notifier_invalidate_range(), see
* Documentation/mm/mmu_notifier.rst.
diff --git a/mm/memory.c b/mm/memory.c
index 2f3cc57a5a11..55166045ab55 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1675,10 +1675,12 @@ static void unmap_single_vma(struct mmu_gather *tlb,
if (vma->vm_file) {
zap_flags_t zap_flags = details ?
details->zap_flags : 0;
+ hugetlb_vma_lock_write(vma);
i_mmap_lock_write(vma->vm_file->f_mapping);
__unmap_hugepage_range_final(tlb, vma, start, end,
NULL, zap_flags);
i_mmap_unlock_write(vma->vm_file->f_mapping);
+ hugetlb_vma_unlock_write(vma);
}
} else
unmap_page_range(tlb, vma, start, end, details);
diff --git a/mm/rmap.c b/mm/rmap.c
index 55209e029847..60d7db60428e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1558,24 +1558,39 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
* To call huge_pmd_unshare, i_mmap_rwsem must be
* held in write mode. Caller needs to explicitly
* do this outside rmap routines.
+ *
+ * We also must hold hugetlb vma_lock in write mode.
+ * Lock order dictates acquiring vma_lock BEFORE
+ * i_mmap_rwsem. We can only try lock here and fail
+ * if unsuccessful.
*/
- VM_BUG_ON(!anon && !(flags & TTU_RMAP_LOCKED));
- if (!anon && huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
- flush_tlb_range(vma, range.start, range.end);
- mmu_notifier_invalidate_range(mm, range.start,
- range.end);
-
- /*
- * The ref count of the PMD page was dropped
- * which is part of the way map counting
- * is done for shared PMDs. Return 'true'
- * here. When there is no other sharing,
- * huge_pmd_unshare returns false and we will
- * unmap the actual page and drop map count
- * to zero.
- */
- page_vma_mapped_walk_done(&pvmw);
- break;
+ if (!anon) {
+ VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+ if (!hugetlb_vma_trylock_write(vma)) {
+ page_vma_mapped_walk_done(&pvmw);
+ ret = false;
+ break;
+ }
+ if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
+ hugetlb_vma_unlock_write(vma);
+ flush_tlb_range(vma,
+ range.start, range.end);
+ mmu_notifier_invalidate_range(mm,
+ range.start, range.end);
+ /*
+ * The ref count of the PMD page was
+ * dropped which is part of the way map
+ * counting is done for shared PMDs.
+ * Return 'true' here. When there is
+ * no other sharing, huge_pmd_unshare
+ * returns false and we will unmap the
+ * actual page and drop map count
+ * to zero.
+ */
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+ hugetlb_vma_unlock_write(vma);
}
pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
} else {
@@ -1934,26 +1949,41 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
* To call huge_pmd_unshare, i_mmap_rwsem must be
* held in write mode. Caller needs to explicitly
* do this outside rmap routines.
+ *
+ * We also must hold hugetlb vma_lock in write mode.
+ * Lock order dictates acquiring vma_lock BEFORE
+ * i_mmap_rwsem. We can only try lock here and
+ * fail if unsuccessful.
*/
- VM_BUG_ON(!anon && !(flags & TTU_RMAP_LOCKED));
- if (!anon && huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
- flush_tlb_range(vma, range.start, range.end);
- mmu_notifier_invalidate_range(mm, range.start,
- range.end);
-
- /*
- * The ref count of the PMD page was dropped
- * which is part of the way map counting
- * is done for shared PMDs. Return 'true'
- * here. When there is no other sharing,
- * huge_pmd_unshare returns false and we will
- * unmap the actual page and drop map count
- * to zero.
- */
- page_vma_mapped_walk_done(&pvmw);
- break;
+ if (!anon) {
+ VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
+ if (!hugetlb_vma_trylock_write(vma)) {
+ page_vma_mapped_walk_done(&pvmw);
+ ret = false;
+ break;
+ }
+ if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
+ hugetlb_vma_unlock_write(vma);
+ flush_tlb_range(vma,
+ range.start, range.end);
+ mmu_notifier_invalidate_range(mm,
+ range.start, range.end);
+
+ /*
+ * The ref count of the PMD page was
+ * dropped which is part of the way map
+ * counting is done for shared PMDs.
+ * Return 'true' here. When there is
+ * no other sharing, huge_pmd_unshare
+ * returns false and we will unmap the
+ * actual page and drop map count
+ * to zero.
+ */
+ page_vma_mapped_walk_done(&pvmw);
+ break;
+ }
+ hugetlb_vma_unlock_write(vma);
}
-
/* Nuke the hugetlb page table entry */
pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
} else {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 7707f2664adb..2b0502710ea1 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -377,16 +377,21 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
BUG_ON(dst_addr >= dst_start + len);

/*
- * Serialize via hugetlb_fault_mutex.
+ * Serialize via vma_lock and hugetlb_fault_mutex.
+ * vma_lock ensures the dst_pte remains valid even
+ * in the case of shared pmds. fault mutex prevents
+ * races with other faulting threads.
*/
idx = linear_page_index(dst_vma, dst_addr);
mapping = dst_vma->vm_file->f_mapping;
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ hugetlb_vma_lock_read(dst_vma);

err = -ENOMEM;
dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
if (!dst_pte) {
+ hugetlb_vma_unlock_read(dst_vma);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
goto out_unlock;
}
@@ -394,6 +399,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
if (mode != MCOPY_ATOMIC_CONTINUE &&
!huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
err = -EEXIST;
+ hugetlb_vma_unlock_read(dst_vma);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
goto out_unlock;
}
@@ -402,6 +408,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_addr, src_addr, mode, &page,
wp_copy);

+ hugetlb_vma_unlock_read(dst_vma);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);

cond_resched();
--
2.37.1

2022-08-24 18:25:58

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 7/8] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio

Create the new routine hugetlb_unmap_file_folio that will unmap a single
file folio. This is refactored code from hugetlb_vmdelete_list. It is
modified to do locking within the routine itself and check whether the
page is mapped within a specific vma before unmapping.

This refactoring will be put to use and expanded upon in a subsequent
patch adding vma specific locking.

Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 123 +++++++++++++++++++++++++++++++++----------
1 file changed, 94 insertions(+), 29 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index e83fd31671b3..b93d131b0cb5 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -371,6 +371,94 @@ static void hugetlb_delete_from_page_cache(struct page *page)
delete_from_page_cache(page);
}

+/*
+ * Called with i_mmap_rwsem held for inode based vma maps. This makes
+ * sure vma (and vm_mm) will not go away. We also hold the hugetlb fault
+ * mutex for the page in the mapping. So, we can not race with page being
+ * faulted into the vma.
+ */
+static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
+ unsigned long addr, struct page *page)
+{
+ pte_t *ptep, pte;
+
+ ptep = huge_pte_offset(vma->vm_mm, addr,
+ huge_page_size(hstate_vma(vma)));
+
+ if (!ptep)
+ return false;
+
+ pte = huge_ptep_get(ptep);
+ if (huge_pte_none(pte) || !pte_present(pte))
+ return false;
+
+ if (pte_page(pte) == page)
+ return true;
+
+ return false;
+}
+
+/*
+ * Can vma_offset_start/vma_offset_end overflow on 32-bit arches?
+ * No, because the interval tree returns us only those vmas
+ * which overlap the truncated area starting at pgoff,
+ * and no vma on a 32-bit arch can span beyond the 4GB.
+ */
+static unsigned long vma_offset_start(struct vm_area_struct *vma, pgoff_t start)
+{
+ if (vma->vm_pgoff < start)
+ return (start - vma->vm_pgoff) << PAGE_SHIFT;
+ else
+ return 0;
+}
+
+static unsigned long vma_offset_end(struct vm_area_struct *vma, pgoff_t end)
+{
+ unsigned long t_end;
+
+ if (!end)
+ return vma->vm_end;
+
+ t_end = ((end - vma->vm_pgoff) << PAGE_SHIFT) + vma->vm_start;
+ if (t_end > vma->vm_end)
+ t_end = vma->vm_end;
+ return t_end;
+}
+
+/*
+ * Called with hugetlb fault mutex held. Therefore, no more mappings to
+ * this folio can be created while executing the routine.
+ */
+static void hugetlb_unmap_file_folio(struct hstate *h,
+ struct address_space *mapping,
+ struct folio *folio, pgoff_t index)
+{
+ struct rb_root_cached *root = &mapping->i_mmap;
+ struct page *page = &folio->page;
+ struct vm_area_struct *vma;
+ unsigned long v_start;
+ unsigned long v_end;
+ pgoff_t start, end;
+
+ start = index * pages_per_huge_page(h);
+ end = ((index + 1) * pages_per_huge_page(h));
+
+ i_mmap_lock_write(mapping);
+
+ vma_interval_tree_foreach(vma, root, start, end - 1) {
+ v_start = vma_offset_start(vma, start);
+ v_end = vma_offset_end(vma, end);
+
+ if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
+ continue;
+
+ unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
+ NULL, ZAP_FLAG_DROP_MARKER);
+ }
+
+ i_mmap_unlock_write(mapping);
+}
+
static void
hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
zap_flags_t zap_flags)
@@ -383,30 +471,13 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
* an inclusive "last".
*/
vma_interval_tree_foreach(vma, root, start, end ? end - 1 : ULONG_MAX) {
- unsigned long v_offset;
+ unsigned long v_start;
unsigned long v_end;

- /*
- * Can the expression below overflow on 32-bit arches?
- * No, because the interval tree returns us only those vmas
- * which overlap the truncated area starting at pgoff,
- * and no vma on a 32-bit arch can span beyond the 4GB.
- */
- if (vma->vm_pgoff < start)
- v_offset = (start - vma->vm_pgoff) << PAGE_SHIFT;
- else
- v_offset = 0;
-
- if (!end)
- v_end = vma->vm_end;
- else {
- v_end = ((end - vma->vm_pgoff) << PAGE_SHIFT)
- + vma->vm_start;
- if (v_end > vma->vm_end)
- v_end = vma->vm_end;
- }
+ v_start = vma_offset_start(vma, start);
+ v_end = vma_offset_end(vma, end);

- unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
+ unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
NULL, zap_flags);
}
}
@@ -428,14 +499,8 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
* the fault mutex. The mutex will prevent faults
* until we finish removing the folio.
*/
- if (unlikely(folio_mapped(folio))) {
- i_mmap_lock_write(mapping);
- hugetlb_vmdelete_list(&mapping->i_mmap,
- index * pages_per_huge_page(h),
- (index + 1) * pages_per_huge_page(h),
- ZAP_FLAG_DROP_MARKER);
- i_mmap_unlock_write(mapping);
- }
+ if (unlikely(folio_mapped(folio)))
+ hugetlb_unmap_file_folio(h, mapping, folio, index);

folio_lock(folio);
/*
--
2.37.1

2022-08-24 18:35:18

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 6/8] hugetlb: add vma based lock for pmd sharing

Allocate a rw semaphore and hang off vm_private_data for
synchronization use by vmas that could be involved in pmd sharing. Only
add infrastructure for the new lock here. Actual use will be added in
subsequent patch.

Signed-off-by: Mike Kravetz <[email protected]>
---
include/linux/hugetlb.h | 36 ++++++++-
kernel/fork.c | 6 +-
mm/hugetlb.c | 170 ++++++++++++++++++++++++++++++++++++----
mm/rmap.c | 8 +-
4 files changed, 197 insertions(+), 23 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index acace1a25226..852f911d676e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -126,7 +126,7 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
long min_hpages);
void hugepage_put_subpool(struct hugepage_subpool *spool);

-void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
+void hugetlb_dup_vma_private(struct vm_area_struct *vma);
void clear_vma_resv_huge_pages(struct vm_area_struct *vma);
int hugetlb_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *);
int hugetlb_overcommit_handler(struct ctl_table *, int, void *, size_t *,
@@ -214,6 +214,13 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
struct page *follow_huge_pgd(struct mm_struct *mm, unsigned long address,
pgd_t *pgd, int flags);

+void hugetlb_vma_lock_read(struct vm_area_struct *vma);
+void hugetlb_vma_unlock_read(struct vm_area_struct *vma);
+void hugetlb_vma_lock_write(struct vm_area_struct *vma);
+void hugetlb_vma_unlock_write(struct vm_area_struct *vma);
+int hugetlb_vma_trylock_write(struct vm_area_struct *vma);
+void hugetlb_vma_assert_locked(struct vm_area_struct *vma);
+
int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pud);
unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
@@ -225,7 +232,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);

#else /* !CONFIG_HUGETLB_PAGE */

-static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
+static inline void hugetlb_dup_vma_private(struct vm_area_struct *vma)
{
}

@@ -336,6 +343,31 @@ static inline int prepare_hugepage_range(struct file *file,
return -EINVAL;
}

+static inline void hugetlb_vma_lock_read(struct vm_area_struct *vma)
+{
+}
+
+static inline void hugetlb_vma_unlock_read(struct vm_area_struct *vma)
+{
+}
+
+static inline void hugetlb_vma_lock_write(struct vm_area_struct *vma)
+{
+}
+
+static inline void hugetlb_vma_unlock_write(struct vm_area_struct *vma)
+{
+}
+
+static inline int hugetlb_vma_trylock_write(struct vm_area_struct *vma)
+{
+ return 1;
+}
+
+static inline void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
+{
+}
+
static inline int pmd_huge(pmd_t pmd)
{
return 0;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9470220e8f43..421c143286d2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -675,12 +675,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
}

/*
- * Clear hugetlb-related page reserves for children. This only
- * affects MAP_PRIVATE mappings. Faults generated by the child
- * are not guaranteed to succeed, even if read-only
+ * Copy/update hugetlb private vma information.
*/
if (is_vm_hugetlb_page(tmp))
- reset_vma_resv_huge_pages(tmp);
+ hugetlb_dup_vma_private(tmp);

/*
* Link in the new vma and copy the page table entries.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 758b6844d566..6fb0bff2c7ee 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -91,6 +91,8 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;

/* Forward declaration */
static int hugetlb_acct_memory(struct hstate *h, long delta);
+static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
+static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);

static inline bool subpool_is_free(struct hugepage_subpool *spool)
{
@@ -1008,12 +1010,25 @@ static int is_vma_resv_set(struct vm_area_struct *vma, unsigned long flag)
return (get_vma_private_data(vma) & flag) != 0;
}

-/* Reset counters to 0 and clear all HPAGE_RESV_* flags */
-void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
+void hugetlb_dup_vma_private(struct vm_area_struct *vma)
{
VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma);
+ /*
+ * Clear vm_private_data
+ * - For MAP_PRIVATE mappings, this is the reserve map which does
+ * not apply to children. Faults generated by the children are
+ * not guaranteed to succeed, even if read-only.
+ * - For shared mappings this is a per-vma semaphore that may be
+ * allocated below.
+ */
+ vma->vm_private_data = (void *)0;
if (!(vma->vm_flags & VM_MAYSHARE))
- vma->vm_private_data = (void *)0;
+ return;
+
+ /*
+ * Allocate semaphore if pmd sharing is possible.
+ */
+ hugetlb_vma_lock_alloc(vma);
}

/*
@@ -1044,7 +1059,7 @@ void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
kref_put(&reservations->refs, resv_map_release);
}

- reset_vma_resv_huge_pages(vma);
+ hugetlb_dup_vma_private(vma);
}

/* Returns true if the VMA has associated reserve pages */
@@ -4623,16 +4638,21 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma)
resv_map_dup_hugetlb_cgroup_uncharge_info(resv);
kref_get(&resv->refs);
}
+
+ hugetlb_vma_lock_alloc(vma);
}

static void hugetlb_vm_op_close(struct vm_area_struct *vma)
{
struct hstate *h = hstate_vma(vma);
- struct resv_map *resv = vma_resv_map(vma);
+ struct resv_map *resv;
struct hugepage_subpool *spool = subpool_vma(vma);
unsigned long reserve, start, end;
long gbl_reserve;

+ hugetlb_vma_lock_free(vma);
+
+ resv = vma_resv_map(vma);
if (!resv || !is_vma_resv_set(vma, HPAGE_RESV_OWNER))
return;

@@ -6447,6 +6467,11 @@ bool hugetlb_reserve_pages(struct inode *inode,
return false;
}

+ /*
+ * vma specific semaphore used for pmd sharing synchronization
+ */
+ hugetlb_vma_lock_alloc(vma);
+
/*
* Only apply hugepage reservation if asked. At fault time, an
* attempt will be made for VM_NORESERVE to allocate a page
@@ -6470,12 +6495,11 @@ bool hugetlb_reserve_pages(struct inode *inode,
resv_map = inode_resv_map(inode);

chg = region_chg(resv_map, from, to, &regions_needed);
-
} else {
/* Private mapping. */
resv_map = resv_map_alloc();
if (!resv_map)
- return false;
+ goto out_err;

chg = to - from;

@@ -6570,6 +6594,7 @@ bool hugetlb_reserve_pages(struct inode *inode,
hugetlb_cgroup_uncharge_cgroup_rsvd(hstate_index(h),
chg * pages_per_huge_page(h), h_cg);
out_err:
+ hugetlb_vma_lock_free(vma);
if (!vma || vma->vm_flags & VM_MAYSHARE)
/* Only call region_abort if the region_chg succeeded but the
* region_add failed or didn't run.
@@ -6649,14 +6674,34 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
}

static bool __vma_aligned_range_pmd_shareable(struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end,
+ bool check_vma_lock)
{
+#ifdef CONFIG_USERFAULTFD
+ if (uffd_disable_huge_pmd_share(vma))
+ return false;
+#endif
/*
* check on proper vm_flags and page table alignment
*/
- if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, start, end))
- return true;
- return false;
+ if (!(vma->vm_flags & VM_MAYSHARE))
+ return false;
+ if (check_vma_lock && !vma->vm_private_data)
+ return false;
+ if (!range_in_vma(vma, start, end))
+ return false;
+ return true;
+}
+
+static bool vma_pmd_shareable(struct vm_area_struct *vma)
+{
+ unsigned long start = ALIGN(vma->vm_start, PUD_SIZE),
+ end = ALIGN_DOWN(vma->vm_end, PUD_SIZE);
+
+ if (start >= end)
+ return false;
+
+ return __vma_aligned_range_pmd_shareable(vma, start, end, false);
}

static bool vma_addr_pmd_shareable(struct vm_area_struct *vma,
@@ -6665,15 +6710,11 @@ static bool vma_addr_pmd_shareable(struct vm_area_struct *vma,
unsigned long start = addr & PUD_MASK;
unsigned long end = start + PUD_SIZE;

- return __vma_aligned_range_pmd_shareable(vma, start, end);
+ return __vma_aligned_range_pmd_shareable(vma, start, end, true);
}

bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
{
-#ifdef CONFIG_USERFAULTFD
- if (uffd_disable_huge_pmd_share(vma))
- return false;
-#endif
return vma_addr_pmd_shareable(vma, addr);
}

@@ -6704,6 +6745,95 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
*end = ALIGN(*end, PUD_SIZE);
}

+static bool __vma_shareable_flags_pmd(struct vm_area_struct *vma)
+{
+ return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
+ vma->vm_private_data;
+}
+
+void hugetlb_vma_lock_read(struct vm_area_struct *vma)
+{
+ if (__vma_shareable_flags_pmd(vma))
+ down_read((struct rw_semaphore *)vma->vm_private_data);
+}
+
+void hugetlb_vma_unlock_read(struct vm_area_struct *vma)
+{
+ if (__vma_shareable_flags_pmd(vma))
+ up_read((struct rw_semaphore *)vma->vm_private_data);
+}
+
+void hugetlb_vma_lock_write(struct vm_area_struct *vma)
+{
+ if (__vma_shareable_flags_pmd(vma))
+ down_write((struct rw_semaphore *)vma->vm_private_data);
+}
+
+void hugetlb_vma_unlock_write(struct vm_area_struct *vma)
+{
+ if (__vma_shareable_flags_pmd(vma))
+ up_write((struct rw_semaphore *)vma->vm_private_data);
+}
+
+int hugetlb_vma_trylock_write(struct vm_area_struct *vma)
+{
+ if (!__vma_shareable_flags_pmd(vma))
+ return 1;
+
+ return down_write_trylock((struct rw_semaphore *)vma->vm_private_data);
+}
+
+void hugetlb_vma_assert_locked(struct vm_area_struct *vma)
+{
+ if (__vma_shareable_flags_pmd(vma))
+ lockdep_assert_held((struct rw_semaphore *)
+ vma->vm_private_data);
+}
+
+static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
+{
+ /*
+ * Only present in sharable vmas. See comment in
+ * __unmap_hugepage_range_final about the neeed to check both
+ * VM_SHARED and VM_MAYSHARE in free path
+ */
+ if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED)))
+ return;
+
+ if (vma->vm_private_data) {
+ kfree(vma->vm_private_data);
+ vma->vm_private_data = NULL;
+ }
+}
+
+static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+{
+ struct rw_semaphore *vma_sema;
+
+ /* Only establish in (flags) sharable vmas */
+ if (!vma || !(vma->vm_flags & VM_MAYSHARE))
+ return;
+
+ /* Should never get here with non-NULL vm_private_data */
+ if (vma->vm_private_data)
+ return;
+
+ /* Check size/alignment for pmd sharing possible */
+ if (!vma_pmd_shareable(vma))
+ return;
+
+ vma_sema = kmalloc(sizeof(*vma_sema), GFP_KERNEL);
+ if (!vma_sema)
+ /*
+ * If we can not allocate semaphore, then vma can not
+ * participate in pmd sharing.
+ */
+ return;
+
+ init_rwsem(vma_sema);
+ vma->vm_private_data = vma_sema;
+}
+
/*
* Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
* and returns the corresponding pte. While this is not necessary for the
@@ -6790,6 +6920,14 @@ int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
}

#else /* !CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
+static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
+{
+}
+
+static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
+{
+}
+
pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pud_t *pud)
{
diff --git a/mm/rmap.c b/mm/rmap.c
index ad9c97c6445c..55209e029847 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -24,7 +24,7 @@
* mm->mmap_lock
* mapping->invalidate_lock (in filemap_fault)
* page->flags PG_locked (lock_page)
- * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
+ * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
* mapping->i_mmap_rwsem
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
@@ -44,6 +44,12 @@
* anon_vma->rwsem,mapping->i_mmap_rwsem (memory_failure, collect_procs_anon)
* ->tasklist_lock
* pte map lock
+ *
+ * hugetlbfs PageHuge() take locks in this order:
+ * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
+ * vma_lock (hugetlb specific lock for pmd_sharing)
+ * mapping->i_mmap_rwsem (also used for hugetlb pmd sharing)
+ * page->flags PG_locked (lock_page)
*/

#include <linux/mm.h>
--
2.37.1

2022-08-24 18:41:29

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 2/8] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing. However, this has been shown to cause
performance/scaling issues. Revert the code and go back to only taking
the semaphore in huge_pmd_share during the fault path.

Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.

NOTE: Reverting this code does expose the following race condition.

Faulting thread Unsharing thread
... ...
ptep = huge_pte_offset()
or
ptep = huge_pte_alloc()
...
i_mmap_lock_write
lock page table
ptep invalid <------------------------ huge_pmd_unshare()
Could be in a previously unlock_page_table
sharing process or worse i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

It is unknown if the above race was ever experienced by a user. It was
discovered via code inspection when initially addressed.

In subsequent patches, a new synchronization mechanism will be added to
coordinate pmd sharing and eliminate this race.

Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 2 --
mm/hugetlb.c | 77 +++++++-------------------------------------
mm/rmap.c | 8 +----
mm/userfaultfd.c | 11 ++-----
4 files changed, 15 insertions(+), 83 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a32031e751d1..dfb735a91bbb 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -467,9 +467,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
if (unlikely(folio_mapped(folio))) {
BUG_ON(truncate_op);

- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
i_mmap_lock_write(mapping);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
hugetlb_vmdelete_list(&mapping->i_mmap,
index * pages_per_huge_page(h),
(index + 1) * pages_per_huge_page(h),
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 70bc7f867bc0..95c6f9a5bbf0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4770,7 +4770,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
struct hstate *h = hstate_vma(src_vma);
unsigned long sz = huge_page_size(h);
unsigned long npages = pages_per_huge_page(h);
- struct address_space *mapping = src_vma->vm_file->f_mapping;
struct mmu_notifier_range range;
unsigned long last_addr_mask;
int ret = 0;
@@ -4782,14 +4781,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
mmu_notifier_invalidate_range_start(&range);
mmap_assert_write_locked(src);
raw_write_seqcount_begin(&src->write_protect_seq);
- } else {
- /*
- * For shared mappings i_mmap_rwsem must be held to call
- * huge_pte_alloc, otherwise the returned ptep could go
- * away if part of a shared pmd and another thread calls
- * huge_pmd_unshare.
- */
- i_mmap_lock_read(mapping);
}

last_addr_mask = hugetlb_mask_last_page(h);
@@ -4937,8 +4928,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
if (cow) {
raw_write_seqcount_end(&src->write_protect_seq);
mmu_notifier_invalidate_range_end(&range);
- } else {
- i_mmap_unlock_read(mapping);
}

return ret;
@@ -5347,30 +5336,9 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
* may get SIGKILLed if it later faults.
*/
if (outside_reserve) {
- struct address_space *mapping = vma->vm_file->f_mapping;
- pgoff_t idx;
- u32 hash;
-
put_page(old_page);
BUG_ON(huge_pte_none(pte));
- /*
- * Drop hugetlb_fault_mutex and i_mmap_rwsem before
- * unmapping. unmapping needs to hold i_mmap_rwsem
- * in write mode. Dropping i_mmap_rwsem in read mode
- * here is OK as COW mappings do not interact with
- * PMD sharing.
- *
- * Reacquire both after unmap operation.
- */
- idx = vma_hugecache_offset(h, vma, haddr);
- hash = hugetlb_fault_mutex_hash(mapping, idx);
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- i_mmap_unlock_read(mapping);
-
unmap_ref_private(mm, vma, old_page, haddr);
-
- i_mmap_lock_read(mapping);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
spin_lock(ptl);
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (likely(ptep &&
@@ -5538,9 +5506,7 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
*/
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- i_mmap_unlock_read(mapping);
ret = handle_userfault(&vmf, reason);
- i_mmap_lock_read(mapping);
mutex_lock(&hugetlb_fault_mutex_table[hash]);

return ret;
@@ -5772,11 +5738,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,

ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (ptep) {
- /*
- * Since we hold no locks, ptep could be stale. That is
- * OK as we are only making decisions based on content and
- * not actually modifying content here.
- */
entry = huge_ptep_get(ptep);
if (unlikely(is_hugetlb_entry_migration(entry))) {
migration_entry_wait_huge(vma, ptep);
@@ -5784,31 +5745,20 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
return VM_FAULT_HWPOISON_LARGE |
VM_FAULT_SET_HINDEX(hstate_index(h));
+ } else {
+ ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
+ if (!ptep)
+ return VM_FAULT_OOM;
}

- /*
- * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
- * until finished with ptep. This prevents huge_pmd_unshare from
- * being called elsewhere and making the ptep no longer valid.
- *
- * ptep could have already be assigned via huge_pte_offset. That
- * is OK, as huge_pte_alloc will return the same value unless
- * something has changed.
- */
mapping = vma->vm_file->f_mapping;
- i_mmap_lock_read(mapping);
- ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
- if (!ptep) {
- i_mmap_unlock_read(mapping);
- return VM_FAULT_OOM;
- }
+ idx = vma_hugecache_offset(h, vma, haddr);

/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
* the same page in the page cache.
*/
- idx = vma_hugecache_offset(h, vma, haddr);
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_lock(&hugetlb_fault_mutex_table[hash]);

@@ -5873,7 +5823,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
put_page(pagecache_page);
}
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- i_mmap_unlock_read(mapping);
return handle_userfault(&vmf, VM_UFFD_WP);
}

@@ -5917,7 +5866,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
out_mutex:
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- i_mmap_unlock_read(mapping);
/*
* Generally it's safe to hold refcount during waiting page lock. But
* here we just wait to defer the next page fault to avoid busy loop and
@@ -6758,12 +6706,10 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
* Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
* and returns the corresponding pte. While this is not necessary for the
* !shared pmd case because we can allocate the pmd later as well, it makes the
- * code much cleaner.
- *
- * This routine must be called with i_mmap_rwsem held in at least read mode if
- * sharing is possible. For hugetlbfs, this prevents removal of any page
- * table entries associated with the address space. This is important as we
- * are setting up sharing based on existing page table entries (mappings).
+ * code much cleaner. pmd allocation is essential for the shared case because
+ * pud has to be populated inside the same i_mmap_rwsem section - otherwise
+ * racing tasks could either miss the sharing (see huge_pte_offset) or select a
+ * bad pmd for sharing.
*/
pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pud_t *pud)
@@ -6777,7 +6723,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t *pte;
spinlock_t *ptl;

- i_mmap_assert_locked(mapping);
+ i_mmap_lock_read(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -6807,6 +6753,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
+ i_mmap_unlock_read(mapping);
return pte;
}

@@ -6817,7 +6764,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
* indicated by page_count > 1, unmap is achieved by clearing pud and
* decrementing the ref count. If count == 1, the pte page is not shared.
*
- * Called with page table lock held and i_mmap_rwsem held in write mode.
+ * Called with page table lock held.
*
* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user
diff --git a/mm/rmap.c b/mm/rmap.c
index 7dc6d77ae865..ad9c97c6445c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,10 +23,9 @@
* inode->i_rwsem (while writing or truncating, not reading or faulting)
* mm->mmap_lock
* mapping->invalidate_lock (in filemap_fault)
- * page->flags PG_locked (lock_page) * (see hugetlbfs below)
+ * page->flags PG_locked (lock_page)
* hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
* mapping->i_mmap_rwsem
- * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
* swap_lock (in swap_duplicate, swap_info_get)
@@ -45,11 +44,6 @@
* anon_vma->rwsem,mapping->i_mmap_rwsem (memory_failure, collect_procs_anon)
* ->tasklist_lock
* pte map lock
- *
- * * hugetlbfs PageHuge() pages take locks in this order:
- * mapping->i_mmap_rwsem
- * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
- * page->flags PG_locked (lock_page)
*/

#include <linux/mm.h>
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 7327b2573f7c..7707f2664adb 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -377,14 +377,10 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
BUG_ON(dst_addr >= dst_start + len);

/*
- * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
- * i_mmap_rwsem ensures the dst_pte remains valid even
- * in the case of shared pmds. fault mutex prevents
- * races with other faulting threads.
+ * Serialize via hugetlb_fault_mutex.
*/
- mapping = dst_vma->vm_file->f_mapping;
- i_mmap_lock_read(mapping);
idx = linear_page_index(dst_vma, dst_addr);
+ mapping = dst_vma->vm_file->f_mapping;
hash = hugetlb_fault_mutex_hash(mapping, idx);
mutex_lock(&hugetlb_fault_mutex_table[hash]);

@@ -392,7 +388,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
if (!dst_pte) {
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- i_mmap_unlock_read(mapping);
goto out_unlock;
}

@@ -400,7 +395,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
!huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
err = -EEXIST;
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- i_mmap_unlock_read(mapping);
goto out_unlock;
}

@@ -409,7 +403,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
wp_copy);

mutex_unlock(&hugetlb_fault_mutex_table[hash]);
- i_mmap_unlock_read(mapping);

cond_resched();

--
2.37.1

2022-08-24 18:57:18

by Mike Kravetz

[permalink] [raw]
Subject: [PATCH 3/8] hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache

remove_huge_page removes a hugetlb page from the page cache. Change
to hugetlb_delete_from_page_cache as it is a more descriptive name.
huge_add_to_page_cache is global in scope, but only deals with hugetlb
pages. For consistency and clarity, rename to hugetlb_add_to_page_cache.

Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 21 ++++++++++-----------
include/linux/hugetlb.h | 2 +-
mm/hugetlb.c | 8 ++++----
3 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index dfb735a91bbb..d98c6edbd1a4 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -364,7 +364,7 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
return -EINVAL;
}

-static void remove_huge_page(struct page *page)
+static void hugetlb_delete_from_page_cache(struct page *page)
{
ClearPageDirty(page);
ClearPageUptodate(page);
@@ -478,15 +478,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
folio_lock(folio);
/*
* We must free the huge page and remove from page
- * cache (remove_huge_page) BEFORE removing the
- * region/reserve map (hugetlb_unreserve_pages). In
- * rare out of memory conditions, removal of the
- * region/reserve map could fail. Correspondingly,
- * the subpool and global reserve usage count can need
- * to be adjusted.
+ * cache BEFORE removing the * region/reserve map
+ * (hugetlb_unreserve_pages). In rare out of memory
+ * conditions, removal of the region/reserve map could
+ * fail. Correspondingly, the subpool and global
+ * reserve usage count can need to be adjusted.
*/
VM_BUG_ON(HPageRestoreReserve(&folio->page));
- remove_huge_page(&folio->page);
+ hugetlb_delete_from_page_cache(&folio->page);
freed++;
if (!truncate_op) {
if (unlikely(hugetlb_unreserve_pages(inode,
@@ -723,7 +722,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
}
clear_huge_page(page, addr, pages_per_huge_page(h));
__SetPageUptodate(page);
- error = huge_add_to_page_cache(page, mapping, index);
+ error = hugetlb_add_to_page_cache(page, mapping, index);
if (unlikely(error)) {
restore_reserve_on_error(h, &pseudo_vma, addr, page);
put_page(page);
@@ -735,7 +734,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,

SetHPageMigratable(page);
/*
- * unlock_page because locked by huge_add_to_page_cache()
+ * unlock_page because locked by hugetlb_add_to_page_cache()
* put_page() due to reference from alloc_huge_page()
*/
unlock_page(page);
@@ -980,7 +979,7 @@ static int hugetlbfs_error_remove_page(struct address_space *mapping,
struct inode *inode = mapping->host;
pgoff_t index = page->index;

- remove_huge_page(page);
+ hugetlb_delete_from_page_cache(page);
if (unlikely(hugetlb_unreserve_pages(inode, index, index + 1, 1)))
hugetlb_fix_reserve_counts(inode);

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3ec981a0d8b3..acace1a25226 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -665,7 +665,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
nodemask_t *nmask, gfp_t gfp_mask);
struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
unsigned long address);
-int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+int hugetlb_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx);
void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
unsigned long address, struct page *page);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95c6f9a5bbf0..11c02513588c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5445,7 +5445,7 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
return page != NULL;
}

-int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
+int hugetlb_add_to_page_cache(struct page *page, struct address_space *mapping,
pgoff_t idx)
{
struct folio *folio = page_folio(page);
@@ -5586,7 +5586,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
new_page = true;

if (vma->vm_flags & VM_MAYSHARE) {
- int err = huge_add_to_page_cache(page, mapping, idx);
+ int err = hugetlb_add_to_page_cache(page, mapping, idx);
if (err) {
restore_reserve_on_error(h, vma, haddr, page);
put_page(page);
@@ -5993,11 +5993,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,

/*
* Serialization between remove_inode_hugepages() and
- * huge_add_to_page_cache() below happens through the
+ * hugetlb_add_to_page_cache() below happens through the
* hugetlb_fault_mutex_table that here must be hold by
* the caller.
*/
- ret = huge_add_to_page_cache(page, mapping, idx);
+ ret = hugetlb_add_to_page_cache(page, mapping, idx);
if (ret)
goto out_release_nounlock;
page_in_pagecache = true;
--
2.37.1

2022-08-25 17:05:43

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

On 08/24/22 10:57, Mike Kravetz wrote:
> When page fault code needs to allocate and instantiate a new hugetlb
> page (huegtlb_no_page), it checks early to determine if the fault is
> beyond i_size. When discovered early, it is easy to abort the fault and
> return an error. However, it becomes much more difficult to handle when
> discovered later after allocating the page and consuming reservations
> and adding to the page cache. Backing out changes in such instances
> becomes difficult and error prone.
>
> Instead of trying to catch and backout all such races, use the hugetlb
> fault mutex to handle truncate racing with page faults. The most
> significant change is modification of the routine remove_inode_hugepages
> such that it will take the fault mutex for EVERY index in the truncated
> range (or hole in the case of hole punch). Since remove_inode_hugepages
> is called in the truncate path after updating i_size, we can experience
> races as follows.
> - truncate code updates i_size and takes fault mutex before a racing
> fault. After fault code takes mutex, it will notice fault beyond
> i_size and abort early.
> - fault code obtains mutex, and truncate updates i_size after early
> checks in fault code. fault code will add page beyond i_size.
> When truncate code takes mutex for page/index, it will remove the
> page.
> - truncate updates i_size, but fault code obtains mutex first. If
> fault code sees updated i_size it will abort early. If fault code
> does not see updated i_size, it will add page beyond i_size and
> truncate code will remove page when it obtains fault mutex.
>
> Note, for performance reasons remove_inode_hugepages will still use
> filemap_get_folios for bulk folio lookups. For indicies not returned in
> the bulk lookup, it will need to lookup individual folios to check for
> races with page fault.
>
<snip>
> /*
> * remove_inode_hugepages handles two distinct cases: truncation and hole
> * punch. There are subtle differences in operation for each case.
> @@ -418,11 +507,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> * truncation is indicated by end of range being LLONG_MAX
> * In this case, we first scan the range and release found pages.
> * After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
> - * maps and global counts. Page faults can not race with truncation
> - * in this routine. hugetlb_no_page() prevents page faults in the
> - * truncated range. It checks i_size before allocation, and again after
> - * with the page table lock for the page held. The same lock must be
> - * acquired to unmap a page.
> + * maps and global counts. Page faults can race with truncation.
> + * During faults, hugetlb_no_page() checks i_size before page allocation,
> + * and again after obtaining page table lock. It will 'back out'
> + * allocations in the truncated range.
> * hole punch is indicated if end is not LLONG_MAX
> * In the hole punch case we scan the range and release found pages.
> * Only when releasing a page is the associated region/reserve map
> @@ -431,75 +519,69 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> * This is indicated if we find a mapped page.
> * Note: If the passed end of range value is beyond the end of file, but
> * not LLONG_MAX this routine still performs a hole punch operation.
> + *
> + * Since page faults can race with this routine, care must be taken as both
> + * modify huge page reservation data. To somewhat synchronize these operations
> + * the hugetlb fault mutex is taken for EVERY index in the range to be hole
> + * punched or truncated. In this way, we KNOW either:
> + * - fault code has added a page beyond i_size, and we will remove here
> + * - fault code will see updated i_size and not add a page beyond
> + * The parameter 'lm__end' indicates the offset of the end of hole or file
> + * before truncation. For hole punch lm_end == lend.
> */
> static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> - loff_t lend)
> + loff_t lend, loff_t lm_end)
> {
> struct hstate *h = hstate_inode(inode);
> struct address_space *mapping = &inode->i_data;
> const pgoff_t start = lstart >> huge_page_shift(h);
> const pgoff_t end = lend >> huge_page_shift(h);
> + pgoff_t m_end = lm_end >> huge_page_shift(h);
> + pgoff_t m_start, m_index;
> struct folio_batch fbatch;
> + struct folio *folio;
> pgoff_t next, index;
> - int i, freed = 0;
> + unsigned int i;
> + long freed = 0;
> + u32 hash;
> bool truncate_op = (lend == LLONG_MAX);
>
> folio_batch_init(&fbatch);
> - next = start;
> + next = m_start = start;
> while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
> for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> - struct folio *folio = fbatch.folios[i];
> - u32 hash = 0;
> + folio = fbatch.folios[i];
>
> index = folio->index;
> - hash = hugetlb_fault_mutex_hash(mapping, index);
> - mutex_lock(&hugetlb_fault_mutex_table[hash]);
> -
> /*
> - * If folio is mapped, it was faulted in after being
> - * unmapped in caller. Unmap (again) now after taking
> - * the fault mutex. The mutex will prevent faults
> - * until we finish removing the folio.
> - *
> - * This race can only happen in the hole punch case.
> - * Getting here in a truncate operation is a bug.
> + * Take fault mutex for missing folios before index,
> + * while checking folios that might have been added
> + * due to a race with fault code.
> */
> - if (unlikely(folio_mapped(folio))) {
> - BUG_ON(truncate_op);
> -
> - i_mmap_lock_write(mapping);
> - hugetlb_vmdelete_list(&mapping->i_mmap,
> - index * pages_per_huge_page(h),
> - (index + 1) * pages_per_huge_page(h),
> - ZAP_FLAG_DROP_MARKER);
> - i_mmap_unlock_write(mapping);
> - }
> + freed += fault_lock_inode_indicies(h, inode, mapping,
> + m_start, m_index, truncate_op);

This should be 'index' instead of 'm_index' as discovered here:
https://lore.kernel.org/linux-mm/CA+G9fYsHVdu0toduQqk6vsR8Z8mOVzZ9-_p3O5fjQ5mOpSxsDA@mail.gmail.com/

--
Mike Kravetz

2022-08-27 03:12:17

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 2/8] hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization

On 2022/8/25 1:57, Mike Kravetz wrote:
> Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
> synchronization") added code to take i_mmap_rwsem in read mode for the
> duration of fault processing. However, this has been shown to cause
> performance/scaling issues. Revert the code and go back to only taking
> the semaphore in huge_pmd_share during the fault path.
>
> Keep the code that takes i_mmap_rwsem in write mode before calling
> try_to_unmap as this is required if huge_pmd_unshare is called.
>
> NOTE: Reverting this code does expose the following race condition.
>
> Faulting thread Unsharing thread
> ... ...
> ptep = huge_pte_offset()
> or
> ptep = huge_pte_alloc()
> ...
> i_mmap_lock_write
> lock page table
> ptep invalid <------------------------ huge_pmd_unshare()
> Could be in a previously unlock_page_table
> sharing process or worse i_mmap_unlock_write
> ...
> ptl = huge_pte_lock(ptep)
> get/update pte
> set_pte_at(pte, ptep)
>
> It is unknown if the above race was ever experienced by a user. It was
> discovered via code inspection when initially addressed.
>
> In subsequent patches, a new synchronization mechanism will be added to
> coordinate pmd sharing and eliminate this race.
>
> Signed-off-by: Mike Kravetz <[email protected]>

LGTM. Thanks.

Reviewed-by: Miaohe Lin <[email protected]>

Thanks,
Miaohe Lin

> ---
> fs/hugetlbfs/inode.c | 2 --
> mm/hugetlb.c | 77 +++++++-------------------------------------
> mm/rmap.c | 8 +----
> mm/userfaultfd.c | 11 ++-----
> 4 files changed, 15 insertions(+), 83 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a32031e751d1..dfb735a91bbb 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -467,9 +467,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> if (unlikely(folio_mapped(folio))) {
> BUG_ON(truncate_op);
>
> - mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> i_mmap_lock_write(mapping);
> - mutex_lock(&hugetlb_fault_mutex_table[hash]);
> hugetlb_vmdelete_list(&mapping->i_mmap,
> index * pages_per_huge_page(h),
> (index + 1) * pages_per_huge_page(h),
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 70bc7f867bc0..95c6f9a5bbf0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4770,7 +4770,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> struct hstate *h = hstate_vma(src_vma);
> unsigned long sz = huge_page_size(h);
> unsigned long npages = pages_per_huge_page(h);
> - struct address_space *mapping = src_vma->vm_file->f_mapping;
> struct mmu_notifier_range range;
> unsigned long last_addr_mask;
> int ret = 0;
> @@ -4782,14 +4781,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> mmu_notifier_invalidate_range_start(&range);
> mmap_assert_write_locked(src);
> raw_write_seqcount_begin(&src->write_protect_seq);
> - } else {
> - /*
> - * For shared mappings i_mmap_rwsem must be held to call
> - * huge_pte_alloc, otherwise the returned ptep could go
> - * away if part of a shared pmd and another thread calls
> - * huge_pmd_unshare.
> - */
> - i_mmap_lock_read(mapping);
> }
>
> last_addr_mask = hugetlb_mask_last_page(h);
> @@ -4937,8 +4928,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> if (cow) {
> raw_write_seqcount_end(&src->write_protect_seq);
> mmu_notifier_invalidate_range_end(&range);
> - } else {
> - i_mmap_unlock_read(mapping);
> }
>
> return ret;
> @@ -5347,30 +5336,9 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
> * may get SIGKILLed if it later faults.
> */
> if (outside_reserve) {
> - struct address_space *mapping = vma->vm_file->f_mapping;
> - pgoff_t idx;
> - u32 hash;
> -
> put_page(old_page);
> BUG_ON(huge_pte_none(pte));
> - /*
> - * Drop hugetlb_fault_mutex and i_mmap_rwsem before
> - * unmapping. unmapping needs to hold i_mmap_rwsem
> - * in write mode. Dropping i_mmap_rwsem in read mode
> - * here is OK as COW mappings do not interact with
> - * PMD sharing.
> - *
> - * Reacquire both after unmap operation.
> - */
> - idx = vma_hugecache_offset(h, vma, haddr);
> - hash = hugetlb_fault_mutex_hash(mapping, idx);
> - mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> - i_mmap_unlock_read(mapping);
> -
> unmap_ref_private(mm, vma, old_page, haddr);
> -
> - i_mmap_lock_read(mapping);
> - mutex_lock(&hugetlb_fault_mutex_table[hash]);
> spin_lock(ptl);
> ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> if (likely(ptep &&
> @@ -5538,9 +5506,7 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
> */
> hash = hugetlb_fault_mutex_hash(mapping, idx);
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> - i_mmap_unlock_read(mapping);
> ret = handle_userfault(&vmf, reason);
> - i_mmap_lock_read(mapping);
> mutex_lock(&hugetlb_fault_mutex_table[hash]);
>
> return ret;
> @@ -5772,11 +5738,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>
> ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
> if (ptep) {
> - /*
> - * Since we hold no locks, ptep could be stale. That is
> - * OK as we are only making decisions based on content and
> - * not actually modifying content here.
> - */
> entry = huge_ptep_get(ptep);
> if (unlikely(is_hugetlb_entry_migration(entry))) {
> migration_entry_wait_huge(vma, ptep);
> @@ -5784,31 +5745,20 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry)))
> return VM_FAULT_HWPOISON_LARGE |
> VM_FAULT_SET_HINDEX(hstate_index(h));
> + } else {
> + ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
> + if (!ptep)
> + return VM_FAULT_OOM;
> }
>
> - /*
> - * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
> - * until finished with ptep. This prevents huge_pmd_unshare from
> - * being called elsewhere and making the ptep no longer valid.
> - *
> - * ptep could have already be assigned via huge_pte_offset. That
> - * is OK, as huge_pte_alloc will return the same value unless
> - * something has changed.
> - */
> mapping = vma->vm_file->f_mapping;
> - i_mmap_lock_read(mapping);
> - ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
> - if (!ptep) {
> - i_mmap_unlock_read(mapping);
> - return VM_FAULT_OOM;
> - }
> + idx = vma_hugecache_offset(h, vma, haddr);
>
> /*
> * Serialize hugepage allocation and instantiation, so that we don't
> * get spurious allocation failures if two CPUs race to instantiate
> * the same page in the page cache.
> */
> - idx = vma_hugecache_offset(h, vma, haddr);
> hash = hugetlb_fault_mutex_hash(mapping, idx);
> mutex_lock(&hugetlb_fault_mutex_table[hash]);
>
> @@ -5873,7 +5823,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> put_page(pagecache_page);
> }
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> - i_mmap_unlock_read(mapping);
> return handle_userfault(&vmf, VM_UFFD_WP);
> }
>
> @@ -5917,7 +5866,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> }
> out_mutex:
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> - i_mmap_unlock_read(mapping);
> /*
> * Generally it's safe to hold refcount during waiting page lock. But
> * here we just wait to defer the next page fault to avoid busy loop and
> @@ -6758,12 +6706,10 @@ void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
> * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
> * and returns the corresponding pte. While this is not necessary for the
> * !shared pmd case because we can allocate the pmd later as well, it makes the
> - * code much cleaner.
> - *
> - * This routine must be called with i_mmap_rwsem held in at least read mode if
> - * sharing is possible. For hugetlbfs, this prevents removal of any page
> - * table entries associated with the address space. This is important as we
> - * are setting up sharing based on existing page table entries (mappings).
> + * code much cleaner. pmd allocation is essential for the shared case because
> + * pud has to be populated inside the same i_mmap_rwsem section - otherwise
> + * racing tasks could either miss the sharing (see huge_pte_offset) or select a
> + * bad pmd for sharing.
> */
> pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, pud_t *pud)
> @@ -6777,7 +6723,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> pte_t *pte;
> spinlock_t *ptl;
>
> - i_mmap_assert_locked(mapping);
> + i_mmap_lock_read(mapping);
> vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> if (svma == vma)
> continue;
> @@ -6807,6 +6753,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> spin_unlock(ptl);
> out:
> pte = (pte_t *)pmd_alloc(mm, pud, addr);
> + i_mmap_unlock_read(mapping);
> return pte;
> }
>
> @@ -6817,7 +6764,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
> * indicated by page_count > 1, unmap is achieved by clearing pud and
> * decrementing the ref count. If count == 1, the pte page is not shared.
> *
> - * Called with page table lock held and i_mmap_rwsem held in write mode.
> + * Called with page table lock held.
> *
> * returns: 1 successfully unmapped a shared pte page
> * 0 the underlying pte page is not shared, or it is the last user
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7dc6d77ae865..ad9c97c6445c 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -23,10 +23,9 @@
> * inode->i_rwsem (while writing or truncating, not reading or faulting)
> * mm->mmap_lock
> * mapping->invalidate_lock (in filemap_fault)
> - * page->flags PG_locked (lock_page) * (see hugetlbfs below)
> + * page->flags PG_locked (lock_page)
> * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share)
> * mapping->i_mmap_rwsem
> - * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
> * anon_vma->rwsem
> * mm->page_table_lock or pte_lock
> * swap_lock (in swap_duplicate, swap_info_get)
> @@ -45,11 +44,6 @@
> * anon_vma->rwsem,mapping->i_mmap_rwsem (memory_failure, collect_procs_anon)
> * ->tasklist_lock
> * pte map lock
> - *
> - * * hugetlbfs PageHuge() pages take locks in this order:
> - * mapping->i_mmap_rwsem
> - * hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
> - * page->flags PG_locked (lock_page)
> */
>
> #include <linux/mm.h>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 7327b2573f7c..7707f2664adb 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -377,14 +377,10 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> BUG_ON(dst_addr >= dst_start + len);
>
> /*
> - * Serialize via i_mmap_rwsem and hugetlb_fault_mutex.
> - * i_mmap_rwsem ensures the dst_pte remains valid even
> - * in the case of shared pmds. fault mutex prevents
> - * races with other faulting threads.
> + * Serialize via hugetlb_fault_mutex.
> */
> - mapping = dst_vma->vm_file->f_mapping;
> - i_mmap_lock_read(mapping);
> idx = linear_page_index(dst_vma, dst_addr);
> + mapping = dst_vma->vm_file->f_mapping;
> hash = hugetlb_fault_mutex_hash(mapping, idx);
> mutex_lock(&hugetlb_fault_mutex_table[hash]);
>
> @@ -392,7 +388,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> dst_pte = huge_pte_alloc(dst_mm, dst_vma, dst_addr, vma_hpagesize);
> if (!dst_pte) {
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> - i_mmap_unlock_read(mapping);
> goto out_unlock;
> }
>
> @@ -400,7 +395,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> !huge_pte_none_mostly(huge_ptep_get(dst_pte))) {
> err = -EEXIST;
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> - i_mmap_unlock_read(mapping);
> goto out_unlock;
> }
>
> @@ -409,7 +403,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
> wp_copy);
>
> mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> - i_mmap_unlock_read(mapping);
>
> cond_resched();
>
>

2022-08-27 03:12:52

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 1/8] hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race

On 2022/8/25 1:57, Mike Kravetz wrote:
> Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
> synchronization") added code to take i_mmap_rwsem in read mode for the
> duration of fault processing. The use of i_mmap_rwsem to prevent
> fault/truncate races depends on this. However, this has been shown to
> cause performance/scaling issues. As a result, that code will be
> reverted. Since the use i_mmap_rwsem to address page fault/truncate races
> depends on this, it must also be reverted.
>
> In a subsequent patch, code will be added to detect the fault/truncate
> race and back out operations as required.

LGTM. Thanks.

Reviewed-by: Miaohe Lin <[email protected]>

Thanks,
Miaohe Lin

>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 30 +++++++++---------------------
> mm/hugetlb.c | 23 ++++++++++++-----------
> 2 files changed, 21 insertions(+), 32 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index f7a5b5124d8a..a32031e751d1 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -419,9 +419,10 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> * In this case, we first scan the range and release found pages.
> * After releasing pages, hugetlb_unreserve_pages cleans up region/reserve
> * maps and global counts. Page faults can not race with truncation
> - * in this routine. hugetlb_no_page() holds i_mmap_rwsem and prevents
> - * page faults in the truncated range by checking i_size. i_size is
> - * modified while holding i_mmap_rwsem.
> + * in this routine. hugetlb_no_page() prevents page faults in the
> + * truncated range. It checks i_size before allocation, and again after
> + * with the page table lock for the page held. The same lock must be
> + * acquired to unmap a page.
> * hole punch is indicated if end is not LLONG_MAX
> * In the hole punch case we scan the range and release found pages.
> * Only when releasing a page is the associated region/reserve map
> @@ -451,16 +452,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> u32 hash = 0;
>
> index = folio->index;
> - if (!truncate_op) {
> - /*
> - * Only need to hold the fault mutex in the
> - * hole punch case. This prevents races with
> - * page faults. Races are not possible in the
> - * case of truncation.
> - */
> - hash = hugetlb_fault_mutex_hash(mapping, index);
> - mutex_lock(&hugetlb_fault_mutex_table[hash]);
> - }
> + hash = hugetlb_fault_mutex_hash(mapping, index);
> + mutex_lock(&hugetlb_fault_mutex_table[hash]);
>
> /*
> * If folio is mapped, it was faulted in after being
> @@ -504,8 +497,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> }
>
> folio_unlock(folio);
> - if (!truncate_op)
> - mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> }
> folio_batch_release(&fbatch);
> cond_resched();
> @@ -543,8 +535,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
> BUG_ON(offset & ~huge_page_mask(h));
> pgoff = offset >> PAGE_SHIFT;
>
> - i_mmap_lock_write(mapping);
> i_size_write(inode, offset);
> + i_mmap_lock_write(mapping);
> if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
> hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
> ZAP_FLAG_DROP_MARKER);
> @@ -703,11 +695,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
> /* addr is the offset within the file (zero based) */
> addr = index * hpage_size;
>
> - /*
> - * fault mutex taken here, protects against fault path
> - * and hole punch. inode_lock previously taken protects
> - * against truncation.
> - */
> + /* mutex taken here, fault path and hole punch */
> hash = hugetlb_fault_mutex_hash(mapping, index);
> mutex_lock(&hugetlb_fault_mutex_table[hash]);
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9a72499486c1..70bc7f867bc0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5575,18 +5575,17 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> }
>
> /*
> - * We can not race with truncation due to holding i_mmap_rwsem.
> - * i_size is modified when holding i_mmap_rwsem, so check here
> - * once for faults beyond end of file.
> + * Use page lock to guard against racing truncation
> + * before we get page_table_lock.
> */
> - size = i_size_read(mapping->host) >> huge_page_shift(h);
> - if (idx >= size)
> - goto out;
> -
> retry:
> new_page = false;
> page = find_lock_page(mapping, idx);
> if (!page) {
> + size = i_size_read(mapping->host) >> huge_page_shift(h);
> + if (idx >= size)
> + goto out;
> +
> /* Check for page in userfault range */
> if (userfaultfd_missing(vma)) {
> ret = hugetlb_handle_userfault(vma, mapping, idx,
> @@ -5677,6 +5676,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> }
>
> ptl = huge_pte_lock(h, mm, ptep);
> + size = i_size_read(mapping->host) >> huge_page_shift(h);
> + if (idx >= size)
> + goto backout;
> +
> ret = 0;
> /* If pte changed from under us, retry */
> if (!pte_same(huge_ptep_get(ptep), old_pte))
> @@ -5785,10 +5788,8 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>
> /*
> * Acquire i_mmap_rwsem before calling huge_pte_alloc and hold
> - * until finished with ptep. This serves two purposes:
> - * 1) It prevents huge_pmd_unshare from being called elsewhere
> - * and making the ptep no longer valid.
> - * 2) It synchronizes us with i_size modifications during truncation.
> + * until finished with ptep. This prevents huge_pmd_unshare from
> + * being called elsewhere and making the ptep no longer valid.
> *
> * ptep could have already be assigned via huge_pte_offset. That
> * is OK, as huge_pte_alloc will return the same value unless
>

2022-08-27 03:52:52

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 3/8] hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache

On 2022/8/25 1:57, Mike Kravetz wrote:
> remove_huge_page removes a hugetlb page from the page cache. Change
> to hugetlb_delete_from_page_cache as it is a more descriptive name.
> huge_add_to_page_cache is global in scope, but only deals with hugetlb
> pages. For consistency and clarity, rename to hugetlb_add_to_page_cache.
>
> Signed-off-by: Mike Kravetz <[email protected]>

LGTM with one nit below. Thanks.

> ---
> fs/hugetlbfs/inode.c | 21 ++++++++++-----------
> include/linux/hugetlb.h | 2 +-
> mm/hugetlb.c | 8 ++++----
> 3 files changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index dfb735a91bbb..d98c6edbd1a4 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -364,7 +364,7 @@ static int hugetlbfs_write_end(struct file *file, struct address_space *mapping,
> return -EINVAL;
> }
>
> -static void remove_huge_page(struct page *page)
> +static void hugetlb_delete_from_page_cache(struct page *page)
> {
> ClearPageDirty(page);
> ClearPageUptodate(page);
> @@ -478,15 +478,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> folio_lock(folio);
> /*
> * We must free the huge page and remove from page
> - * cache (remove_huge_page) BEFORE removing the
> - * region/reserve map (hugetlb_unreserve_pages). In
> - * rare out of memory conditions, removal of the
> - * region/reserve map could fail. Correspondingly,
> - * the subpool and global reserve usage count can need
> - * to be adjusted.
> + * cache BEFORE removing the * region/reserve map

s/the * region/the region/, i.e. remove extra "*".

Reviewed-by: Miaohe Lin <[email protected]>

Thanks,
Miaohe Lin

> + * (hugetlb_unreserve_pages). In rare out of memory
> + * conditions, removal of the region/reserve map could
> + * fail. Correspondingly, the subpool and global
> + * reserve usage count can need to be adjusted.
> */
> VM_BUG_ON(HPageRestoreReserve(&folio->page));
> - remove_huge_page(&folio->page);
> + hugetlb_delete_from_page_cache(&folio->page);
> freed++;
> if (!truncate_op) {
> if (unlikely(hugetlb_unreserve_pages(inode,
> @@ -723,7 +722,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
> }
> clear_huge_page(page, addr, pages_per_huge_page(h));
> __SetPageUptodate(page);
> - error = huge_add_to_page_cache(page, mapping, index);
> + error = hugetlb_add_to_page_cache(page, mapping, index);
> if (unlikely(error)) {
> restore_reserve_on_error(h, &pseudo_vma, addr, page);
> put_page(page);
> @@ -735,7 +734,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>
> SetHPageMigratable(page);
> /*
> - * unlock_page because locked by huge_add_to_page_cache()
> + * unlock_page because locked by hugetlb_add_to_page_cache()
> * put_page() due to reference from alloc_huge_page()
> */
> unlock_page(page);
> @@ -980,7 +979,7 @@ static int hugetlbfs_error_remove_page(struct address_space *mapping,
> struct inode *inode = mapping->host;
> pgoff_t index = page->index;
>
> - remove_huge_page(page);
> + hugetlb_delete_from_page_cache(page);
> if (unlikely(hugetlb_unreserve_pages(inode, index, index + 1, 1)))
> hugetlb_fix_reserve_counts(inode);
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 3ec981a0d8b3..acace1a25226 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -665,7 +665,7 @@ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid,
> nodemask_t *nmask, gfp_t gfp_mask);
> struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
> unsigned long address);
> -int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
> +int hugetlb_add_to_page_cache(struct page *page, struct address_space *mapping,
> pgoff_t idx);
> void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
> unsigned long address, struct page *page);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 95c6f9a5bbf0..11c02513588c 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5445,7 +5445,7 @@ static bool hugetlbfs_pagecache_present(struct hstate *h,
> return page != NULL;
> }
>
> -int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
> +int hugetlb_add_to_page_cache(struct page *page, struct address_space *mapping,
> pgoff_t idx)
> {
> struct folio *folio = page_folio(page);
> @@ -5586,7 +5586,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> new_page = true;
>
> if (vma->vm_flags & VM_MAYSHARE) {
> - int err = huge_add_to_page_cache(page, mapping, idx);
> + int err = hugetlb_add_to_page_cache(page, mapping, idx);
> if (err) {
> restore_reserve_on_error(h, vma, haddr, page);
> put_page(page);
> @@ -5993,11 +5993,11 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>
> /*
> * Serialization between remove_inode_hugepages() and
> - * huge_add_to_page_cache() below happens through the
> + * hugetlb_add_to_page_cache() below happens through the
> * hugetlb_fault_mutex_table that here must be hold by
> * the caller.
> */
> - ret = huge_add_to_page_cache(page, mapping, idx);
> + ret = hugetlb_add_to_page_cache(page, mapping, idx);
> if (ret)
> goto out_release_nounlock;
> page_in_pagecache = true;
>

2022-08-27 08:37:46

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 5/8] hugetlb: rename vma_shareable() and refactor code

On 2022/8/25 1:57, Mike Kravetz wrote:
> Rename the routine vma_shareable to vma_addr_pmd_shareable as it is
> checking a specific address within the vma. Refactor code to check if
> an aligned range is shareable as this will be needed in a subsequent
> patch.
>
> Signed-off-by: Mike Kravetz <[email protected]>

LGTM. Thanks.

Reviewed-by: Miaohe Lin <[email protected]>

Thanks,
Miaohe Lin

> ---
> mm/hugetlb.c | 19 +++++++++++++------
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a6eb46c64baf..758b6844d566 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6648,26 +6648,33 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma,
> return saddr;
> }
>
> -static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
> +static bool __vma_aligned_range_pmd_shareable(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end)
> {
> - unsigned long base = addr & PUD_MASK;
> - unsigned long end = base + PUD_SIZE;
> -
> /*
> * check on proper vm_flags and page table alignment
> */
> - if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, base, end))
> + if (vma->vm_flags & VM_MAYSHARE && range_in_vma(vma, start, end))
> return true;
> return false;
> }
>
> +static bool vma_addr_pmd_shareable(struct vm_area_struct *vma,
> + unsigned long addr)
> +{
> + unsigned long start = addr & PUD_MASK;
> + unsigned long end = start + PUD_SIZE;
> +
> + return __vma_aligned_range_pmd_shareable(vma, start, end);
> +}
> +
> bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr)
> {
> #ifdef CONFIG_USERFAULTFD
> if (uffd_disable_huge_pmd_share(vma))
> return false;
> #endif
> - return vma_shareable(vma, addr);
> + return vma_addr_pmd_shareable(vma, addr);
> }
>
> /*
>

2022-08-27 08:38:19

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

On 2022/8/25 1:57, Mike Kravetz wrote:
> When page fault code needs to allocate and instantiate a new hugetlb
> page (huegtlb_no_page), it checks early to determine if the fault is
> beyond i_size. When discovered early, it is easy to abort the fault and
> return an error. However, it becomes much more difficult to handle when
> discovered later after allocating the page and consuming reservations
> and adding to the page cache. Backing out changes in such instances
> becomes difficult and error prone.
>
> Instead of trying to catch and backout all such races, use the hugetlb
> fault mutex to handle truncate racing with page faults. The most
> significant change is modification of the routine remove_inode_hugepages
> such that it will take the fault mutex for EVERY index in the truncated
> range (or hole in the case of hole punch). Since remove_inode_hugepages
> is called in the truncate path after updating i_size, we can experience
> races as follows.
> - truncate code updates i_size and takes fault mutex before a racing
> fault. After fault code takes mutex, it will notice fault beyond
> i_size and abort early.
> - fault code obtains mutex, and truncate updates i_size after early
> checks in fault code. fault code will add page beyond i_size.
> When truncate code takes mutex for page/index, it will remove the
> page.
> - truncate updates i_size, but fault code obtains mutex first. If
> fault code sees updated i_size it will abort early. If fault code
> does not see updated i_size, it will add page beyond i_size and
> truncate code will remove page when it obtains fault mutex.
>
> Note, for performance reasons remove_inode_hugepages will still use
> filemap_get_folios for bulk folio lookups. For indicies not returned in
> the bulk lookup, it will need to lookup individual folios to check for
> races with page fault.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
> mm/hugetlb.c | 41 +++++-----
> 2 files changed, 152 insertions(+), 73 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index d98c6edbd1a4..e83fd31671b3 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -411,6 +411,95 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> }
> }
>
> +/*
> + * Called with hugetlb fault mutex held.
> + * Returns true if page was actually removed, false otherwise.
> + */
> +static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
> + struct address_space *mapping,
> + struct folio *folio, pgoff_t index,
> + bool truncate_op)
> +{
> + bool ret = false;
> +
> + /*
> + * If folio is mapped, it was faulted in after being
> + * unmapped in caller. Unmap (again) while holding
> + * the fault mutex. The mutex will prevent faults
> + * until we finish removing the folio.
> + */
> + if (unlikely(folio_mapped(folio))) {
> + i_mmap_lock_write(mapping);
> + hugetlb_vmdelete_list(&mapping->i_mmap,
> + index * pages_per_huge_page(h),
> + (index + 1) * pages_per_huge_page(h),
> + ZAP_FLAG_DROP_MARKER);
> + i_mmap_unlock_write(mapping);
> + }
> +
> + folio_lock(folio);
> + /*
> + * After locking page, make sure mapping is the same.
> + * We could have raced with page fault populate and
> + * backout code.
> + */
> + if (folio_mapping(folio) == mapping) {

Could you explain this more? IIUC, page fault won't remove the hugetlb page from page
cache anymore. So this check is unneeded? Or we should always check this in case future
code changing?

> + /*
> + * We must remove the folio from page cache before removing
> + * the region/ reserve map (hugetlb_unreserve_pages). In
> + * rare out of memory conditions, removal of the region/reserve
> + * map could fail. Correspondingly, the subpool and global
> + * reserve usage count can need to be adjusted.
> + */
> + VM_BUG_ON(HPageRestoreReserve(&folio->page));
> + hugetlb_delete_from_page_cache(&folio->page);
> + ret = true;
> + if (!truncate_op) {
> + if (unlikely(hugetlb_unreserve_pages(inode, index,
> + index + 1, 1)))
> + hugetlb_fix_reserve_counts(inode);
> + }
> + }
> +
> + folio_unlock(folio);
> + return ret;
> +}

<snip>
> @@ -5584,9 +5585,13 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> clear_huge_page(page, address, pages_per_huge_page(h));
> __SetPageUptodate(page);
> new_page = true;
> + if (HPageRestoreReserve(page))
> + reserve_alloc = true;
>
> if (vma->vm_flags & VM_MAYSHARE) {
> - int err = hugetlb_add_to_page_cache(page, mapping, idx);
> + int err;
> +
> + err = hugetlb_add_to_page_cache(page, mapping, idx);
> if (err) {
> restore_reserve_on_error(h, vma, haddr, page);
> put_page(page);
> @@ -5642,10 +5647,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> }
>
> ptl = huge_pte_lock(h, mm, ptep);
> - size = i_size_read(mapping->host) >> huge_page_shift(h);
> - if (idx >= size)
> - goto backout;
> -
> ret = 0;
> /* If pte changed from under us, retry */
> if (!pte_same(huge_ptep_get(ptep), old_pte))
> @@ -5689,10 +5690,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> backout:
> spin_unlock(ptl);
> backout_unlocked:
> - unlock_page(page);
> - /* restore reserve for newly allocated pages not in page cache */
> - if (new_page && !new_pagecache_page)
> + if (new_page && !new_pagecache_page) {
> + /*
> + * If reserve was consumed, make sure flag is set so that it
> + * will be restored in free_huge_page().
> + */
> + if (reserve_alloc)
> + SetHPageRestoreReserve(page);

If code reaches here, it should be a newly allocated page and it's not added to the hugetlb page cache.
Note that failing to add the page to hugetlb page cache should have returned already. So the page must be
anon? If so, HPageRestoreReserve isn't cleared yet as it's cleared right before set_huge_pte. Thus above
check can be removed?

Anyway, the patch looks good to me.

Reviewed-by: Miaohe Lin <[email protected]>

Thanks,
Miaohe Lin

2022-08-27 09:37:32

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 6/8] hugetlb: add vma based lock for pmd sharing

On 2022/8/25 1:57, Mike Kravetz wrote:
> Allocate a rw semaphore and hang off vm_private_data for
> synchronization use by vmas that could be involved in pmd sharing. Only
> add infrastructure for the new lock here. Actual use will be added in
> subsequent patch.
>
> Signed-off-by: Mike Kravetz <[email protected]>

<snip>

> +static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
> +{
> + /*
> + * Only present in sharable vmas. See comment in
> + * __unmap_hugepage_range_final about the neeed to check both

s/neeed/need/

> + * VM_SHARED and VM_MAYSHARE in free path

I think there might be some wrong checks around this patch. As above comment said, we
need to check both flags, so we should do something like below instead?

if (!(vma->vm_flags & (VM_MAYSHARE | VM_SHARED) == (VM_MAYSHARE | VM_SHARED)))

> + */
> + if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED)))
> + return;
> +
> + if (vma->vm_private_data) {
> + kfree(vma->vm_private_data);
> + vma->vm_private_data = NULL;
> + }
> +}
> +
> +static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
> +{
> + struct rw_semaphore *vma_sema;
> +
> + /* Only establish in (flags) sharable vmas */
> + if (!vma || !(vma->vm_flags & VM_MAYSHARE))
> + return;
> +
> + /* Should never get here with non-NULL vm_private_data */

We can get here with non-NULL vm_private_data when called from hugetlb_vm_op_open during fork?

Also there's one missing change on comment:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d0617d64d718..4bc844a1d312 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -863,7 +863,7 @@ __weak unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
* faults in a MAP_PRIVATE mapping. Only the process that called mmap()
* is guaranteed to have their future faults succeed.
*
- * With the exception of reset_vma_resv_huge_pages() which is called at fork(),
+ * With the exception of hugetlb_dup_vma_private() which is called at fork(),
* the reserve counters are updated with the hugetlb_lock held. It is safe
* to reset the VMA at fork() time as it is not in use yet and there is no
* chance of the global counters getting corrupted as a result of the values.


Otherwise this patch looks good to me. Thanks.

Thanks,
Miaohe Lin

2022-08-29 02:51:52

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 7/8] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio

On 2022/8/25 1:57, Mike Kravetz wrote:
> Create the new routine hugetlb_unmap_file_folio that will unmap a single
> file folio. This is refactored code from hugetlb_vmdelete_list. It is
> modified to do locking within the routine itself and check whether the
> page is mapped within a specific vma before unmapping.
>
> This refactoring will be put to use and expanded upon in a subsequent
> patch adding vma specific locking.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 123 +++++++++++++++++++++++++++++++++----------
> 1 file changed, 94 insertions(+), 29 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index e83fd31671b3..b93d131b0cb5 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -371,6 +371,94 @@ static void hugetlb_delete_from_page_cache(struct page *page)
> delete_from_page_cache(page);
> }
>
> +/*
> + * Called with i_mmap_rwsem held for inode based vma maps. This makes
> + * sure vma (and vm_mm) will not go away. We also hold the hugetlb fault
> + * mutex for the page in the mapping. So, we can not race with page being
> + * faulted into the vma.
> + */
> +static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
> + unsigned long addr, struct page *page)
> +{
> + pte_t *ptep, pte;
> +
> + ptep = huge_pte_offset(vma->vm_mm, addr,
> + huge_page_size(hstate_vma(vma)));
> +
> + if (!ptep)
> + return false;
> +
> + pte = huge_ptep_get(ptep);
> + if (huge_pte_none(pte) || !pte_present(pte))
> + return false;
> +
> + if (pte_page(pte) == page)
> + return true;

I'm thinking whether pte entry could change after we check it since huge_pte_lock is not held here.
But I think holding i_mmap_rwsem in writelock mode should give us such a guarantee, e.g. migration
entry is changed back to huge pte entry while holding i_mmap_rwsem in readlock mode.
Or am I miss something?

> +
> + return false;
> +}
> +
> +/*
> + * Can vma_offset_start/vma_offset_end overflow on 32-bit arches?
> + * No, because the interval tree returns us only those vmas
> + * which overlap the truncated area starting at pgoff,
> + * and no vma on a 32-bit arch can span beyond the 4GB.
> + */
> +static unsigned long vma_offset_start(struct vm_area_struct *vma, pgoff_t start)
> +{
> + if (vma->vm_pgoff < start)
> + return (start - vma->vm_pgoff) << PAGE_SHIFT;
> + else
> + return 0;
> +}
> +
> +static unsigned long vma_offset_end(struct vm_area_struct *vma, pgoff_t end)
> +{
> + unsigned long t_end;
> +
> + if (!end)
> + return vma->vm_end;
> +
> + t_end = ((end - vma->vm_pgoff) << PAGE_SHIFT) + vma->vm_start;
> + if (t_end > vma->vm_end)
> + t_end = vma->vm_end;
> + return t_end;
> +}
> +
> +/*
> + * Called with hugetlb fault mutex held. Therefore, no more mappings to
> + * this folio can be created while executing the routine.
> + */
> +static void hugetlb_unmap_file_folio(struct hstate *h,
> + struct address_space *mapping,
> + struct folio *folio, pgoff_t index)
> +{
> + struct rb_root_cached *root = &mapping->i_mmap;
> + struct page *page = &folio->page;
> + struct vm_area_struct *vma;
> + unsigned long v_start;
> + unsigned long v_end;
> + pgoff_t start, end;
> +
> + start = index * pages_per_huge_page(h);
> + end = ((index + 1) * pages_per_huge_page(h));

It seems the outer parentheses is unneeded?

Reviewed-by: Miaohe Lin <[email protected]>

Thanks,
Miaohe Lin


2022-08-29 22:09:43

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

On 08/27/22 16:02, Miaohe Lin wrote:
> On 2022/8/25 1:57, Mike Kravetz wrote:
> > When page fault code needs to allocate and instantiate a new hugetlb
> > page (huegtlb_no_page), it checks early to determine if the fault is
> > beyond i_size. When discovered early, it is easy to abort the fault and
> > return an error. However, it becomes much more difficult to handle when
> > discovered later after allocating the page and consuming reservations
> > and adding to the page cache. Backing out changes in such instances
> > becomes difficult and error prone.
> >
> > Instead of trying to catch and backout all such races, use the hugetlb
> > fault mutex to handle truncate racing with page faults. The most
> > significant change is modification of the routine remove_inode_hugepages
> > such that it will take the fault mutex for EVERY index in the truncated
> > range (or hole in the case of hole punch). Since remove_inode_hugepages
> > is called in the truncate path after updating i_size, we can experience
> > races as follows.
> > - truncate code updates i_size and takes fault mutex before a racing
> > fault. After fault code takes mutex, it will notice fault beyond
> > i_size and abort early.
> > - fault code obtains mutex, and truncate updates i_size after early
> > checks in fault code. fault code will add page beyond i_size.
> > When truncate code takes mutex for page/index, it will remove the
> > page.
> > - truncate updates i_size, but fault code obtains mutex first. If
> > fault code sees updated i_size it will abort early. If fault code
> > does not see updated i_size, it will add page beyond i_size and
> > truncate code will remove page when it obtains fault mutex.
> >
> > Note, for performance reasons remove_inode_hugepages will still use
> > filemap_get_folios for bulk folio lookups. For indicies not returned in
> > the bulk lookup, it will need to lookup individual folios to check for
> > races with page fault.
> >
> > Signed-off-by: Mike Kravetz <[email protected]>
> > ---
> > fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
> > mm/hugetlb.c | 41 +++++-----
> > 2 files changed, 152 insertions(+), 73 deletions(-)
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index d98c6edbd1a4..e83fd31671b3 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -411,6 +411,95 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> > }
> > }
> >
> > +/*
> > + * Called with hugetlb fault mutex held.
> > + * Returns true if page was actually removed, false otherwise.
> > + */
> > +static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,
> > + struct address_space *mapping,
> > + struct folio *folio, pgoff_t index,
> > + bool truncate_op)
> > +{
> > + bool ret = false;
> > +
> > + /*
> > + * If folio is mapped, it was faulted in after being
> > + * unmapped in caller. Unmap (again) while holding
> > + * the fault mutex. The mutex will prevent faults
> > + * until we finish removing the folio.
> > + */
> > + if (unlikely(folio_mapped(folio))) {
> > + i_mmap_lock_write(mapping);
> > + hugetlb_vmdelete_list(&mapping->i_mmap,
> > + index * pages_per_huge_page(h),
> > + (index + 1) * pages_per_huge_page(h),
> > + ZAP_FLAG_DROP_MARKER);
> > + i_mmap_unlock_write(mapping);
> > + }
> > +
> > + folio_lock(folio);
> > + /*
> > + * After locking page, make sure mapping is the same.
> > + * We could have raced with page fault populate and
> > + * backout code.
> > + */
> > + if (folio_mapping(folio) == mapping) {
>
> Could you explain this more? IIUC, page fault won't remove the hugetlb page from page
> cache anymore. So this check is unneeded? Or we should always check this in case future
> code changing?
>

You are correct, with the updated code we should never hit this condition.
The faulting code will not remove pages from the page cache.
It can be removed.

> > + /*
> > + * We must remove the folio from page cache before removing
> > + * the region/ reserve map (hugetlb_unreserve_pages). In
> > + * rare out of memory conditions, removal of the region/reserve
> > + * map could fail. Correspondingly, the subpool and global
> > + * reserve usage count can need to be adjusted.
> > + */
> > + VM_BUG_ON(HPageRestoreReserve(&folio->page));
> > + hugetlb_delete_from_page_cache(&folio->page);
> > + ret = true;
> > + if (!truncate_op) {
> > + if (unlikely(hugetlb_unreserve_pages(inode, index,
> > + index + 1, 1)))
> > + hugetlb_fix_reserve_counts(inode);
> > + }
> > + }
> > +
> > + folio_unlock(folio);
> > + return ret;
> > +}
>
> <snip>
> > @@ -5584,9 +5585,13 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> > clear_huge_page(page, address, pages_per_huge_page(h));
> > __SetPageUptodate(page);
> > new_page = true;
> > + if (HPageRestoreReserve(page))
> > + reserve_alloc = true;
> >
> > if (vma->vm_flags & VM_MAYSHARE) {
> > - int err = hugetlb_add_to_page_cache(page, mapping, idx);
> > + int err;
> > +
> > + err = hugetlb_add_to_page_cache(page, mapping, idx);
> > if (err) {
> > restore_reserve_on_error(h, vma, haddr, page);
> > put_page(page);
> > @@ -5642,10 +5647,6 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> > }
> >
> > ptl = huge_pte_lock(h, mm, ptep);
> > - size = i_size_read(mapping->host) >> huge_page_shift(h);
> > - if (idx >= size)
> > - goto backout;
> > -
> > ret = 0;
> > /* If pte changed from under us, retry */
> > if (!pte_same(huge_ptep_get(ptep), old_pte))
> > @@ -5689,10 +5690,18 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> > backout:
> > spin_unlock(ptl);
> > backout_unlocked:
> > - unlock_page(page);
> > - /* restore reserve for newly allocated pages not in page cache */
> > - if (new_page && !new_pagecache_page)
> > + if (new_page && !new_pagecache_page) {
> > + /*
> > + * If reserve was consumed, make sure flag is set so that it
> > + * will be restored in free_huge_page().
> > + */
> > + if (reserve_alloc)
> > + SetHPageRestoreReserve(page);
>
> If code reaches here, it should be a newly allocated page and it's not added to the hugetlb page cache.
> Note that failing to add the page to hugetlb page cache should have returned already. So the page must be
> anon? If so, HPageRestoreReserve isn't cleared yet as it's cleared right before set_huge_pte. Thus above
> check can be removed?

You are correct again. The above check can be removed.

Thanks!

I will remove them in a v2 series.
--
Mike Kravetz

>
> Anyway, the patch looks good to me.
>
> Reviewed-by: Miaohe Lin <[email protected]>
>
> Thanks,
> Miaohe Lin
>

2022-08-29 22:29:50

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 6/8] hugetlb: add vma based lock for pmd sharing

On 08/27/22 17:30, Miaohe Lin wrote:
> On 2022/8/25 1:57, Mike Kravetz wrote:
> > Allocate a rw semaphore and hang off vm_private_data for
> > synchronization use by vmas that could be involved in pmd sharing. Only
> > add infrastructure for the new lock here. Actual use will be added in
> > subsequent patch.
> >
> > Signed-off-by: Mike Kravetz <[email protected]>
>
> <snip>
>
> > +static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
> > +{
> > + /*
> > + * Only present in sharable vmas. See comment in
> > + * __unmap_hugepage_range_final about the neeed to check both
>
> s/neeed/need/
>
> > + * VM_SHARED and VM_MAYSHARE in free path
>
> I think there might be some wrong checks around this patch. As above comment said, we
> need to check both flags, so we should do something like below instead?
>
> if (!(vma->vm_flags & (VM_MAYSHARE | VM_SHARED) == (VM_MAYSHARE | VM_SHARED)))
>
> > + */

Thanks. I will update.

> > + if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED)))
> > + return;
> > +
> > + if (vma->vm_private_data) {
> > + kfree(vma->vm_private_data);
> > + vma->vm_private_data = NULL;
> > + }
> > +}
> > +
> > +static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
> > +{
> > + struct rw_semaphore *vma_sema;
> > +
> > + /* Only establish in (flags) sharable vmas */
> > + if (!vma || !(vma->vm_flags & VM_MAYSHARE))
> > + return;
> > +
> > + /* Should never get here with non-NULL vm_private_data */
>
> We can get here with non-NULL vm_private_data when called from hugetlb_vm_op_open during fork?

Right!

In fork, We allocate a new semaphore in hugetlb_dup_vma_private, and then
shortly after call hugetlb_vm_op_open.

It works as is, and I can update the comment. However, I wonder if we should
just clear vm_private_data in hugetlb_dup_vma_private and let hugetlb_vm_op_open
do the allocation.

>
> Also there's one missing change on comment:
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index d0617d64d718..4bc844a1d312 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -863,7 +863,7 @@ __weak unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
> * faults in a MAP_PRIVATE mapping. Only the process that called mmap()
> * is guaranteed to have their future faults succeed.
> *
> - * With the exception of reset_vma_resv_huge_pages() which is called at fork(),
> + * With the exception of hugetlb_dup_vma_private() which is called at fork(),
> * the reserve counters are updated with the hugetlb_lock held. It is safe
> * to reset the VMA at fork() time as it is not in use yet and there is no
> * chance of the global counters getting corrupted as a result of the values.
>
>
> Otherwise this patch looks good to me. Thanks.

Will update, Thank you!
--
Mike Kravetz

2022-08-29 23:02:24

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 7/8] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio

On 08/29/22 10:44, Miaohe Lin wrote:
> On 2022/8/25 1:57, Mike Kravetz wrote:
> > Create the new routine hugetlb_unmap_file_folio that will unmap a single
> > file folio. This is refactored code from hugetlb_vmdelete_list. It is
> > modified to do locking within the routine itself and check whether the
> > page is mapped within a specific vma before unmapping.
> >
> > This refactoring will be put to use and expanded upon in a subsequent
> > patch adding vma specific locking.
> >
> > Signed-off-by: Mike Kravetz <[email protected]>
> > ---
> > fs/hugetlbfs/inode.c | 123 +++++++++++++++++++++++++++++++++----------
> > 1 file changed, 94 insertions(+), 29 deletions(-)
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index e83fd31671b3..b93d131b0cb5 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -371,6 +371,94 @@ static void hugetlb_delete_from_page_cache(struct page *page)
> > delete_from_page_cache(page);
> > }
> >
> > +/*
> > + * Called with i_mmap_rwsem held for inode based vma maps. This makes
> > + * sure vma (and vm_mm) will not go away. We also hold the hugetlb fault
> > + * mutex for the page in the mapping. So, we can not race with page being
> > + * faulted into the vma.
> > + */
> > +static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
> > + unsigned long addr, struct page *page)
> > +{
> > + pte_t *ptep, pte;
> > +
> > + ptep = huge_pte_offset(vma->vm_mm, addr,
> > + huge_page_size(hstate_vma(vma)));
> > +
> > + if (!ptep)
> > + return false;
> > +
> > + pte = huge_ptep_get(ptep);
> > + if (huge_pte_none(pte) || !pte_present(pte))
> > + return false;
> > +
> > + if (pte_page(pte) == page)
> > + return true;
>
> I'm thinking whether pte entry could change after we check it since huge_pte_lock is not held here.
> But I think holding i_mmap_rwsem in writelock mode should give us such a guarantee, e.g. migration
> entry is changed back to huge pte entry while holding i_mmap_rwsem in readlock mode.
> Or am I miss something?

Let me think about this. I do not think it is possible, but you ask good
questions.

Do note that this is the same locking sequence used at the beginning of the
page fault code where the decision to call hugetlb_no_page() is made.

>
> > +
> > + return false;
> > +}
> > +
> > +/*
> > + * Can vma_offset_start/vma_offset_end overflow on 32-bit arches?
> > + * No, because the interval tree returns us only those vmas
> > + * which overlap the truncated area starting at pgoff,
> > + * and no vma on a 32-bit arch can span beyond the 4GB.
> > + */
> > +static unsigned long vma_offset_start(struct vm_area_struct *vma, pgoff_t start)
> > +{
> > + if (vma->vm_pgoff < start)
> > + return (start - vma->vm_pgoff) << PAGE_SHIFT;
> > + else
> > + return 0;
> > +}
> > +
> > +static unsigned long vma_offset_end(struct vm_area_struct *vma, pgoff_t end)
> > +{
> > + unsigned long t_end;
> > +
> > + if (!end)
> > + return vma->vm_end;
> > +
> > + t_end = ((end - vma->vm_pgoff) << PAGE_SHIFT) + vma->vm_start;
> > + if (t_end > vma->vm_end)
> > + t_end = vma->vm_end;
> > + return t_end;
> > +}
> > +
> > +/*
> > + * Called with hugetlb fault mutex held. Therefore, no more mappings to
> > + * this folio can be created while executing the routine.
> > + */
> > +static void hugetlb_unmap_file_folio(struct hstate *h,
> > + struct address_space *mapping,
> > + struct folio *folio, pgoff_t index)
> > +{
> > + struct rb_root_cached *root = &mapping->i_mmap;
> > + struct page *page = &folio->page;
> > + struct vm_area_struct *vma;
> > + unsigned long v_start;
> > + unsigned long v_end;
> > + pgoff_t start, end;
> > +
> > + start = index * pages_per_huge_page(h);
> > + end = ((index + 1) * pages_per_huge_page(h));
>
> It seems the outer parentheses is unneeded?

Correct. Thanks.
--
Mike Kravetz

>
> Reviewed-by: Miaohe Lin <[email protected]>
>
> Thanks,
> Miaohe Lin

2022-08-30 02:10:38

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization

On 2022/8/25 1:57, Mike Kravetz wrote:
> The new hugetlb vma lock (rw semaphore) is used to address this race:
>
> Faulting thread Unsharing thread
> ... ...
> ptep = huge_pte_offset()
> or
> ptep = huge_pte_alloc()
> ...
> i_mmap_lock_write
> lock page table
> ptep invalid <------------------------ huge_pmd_unshare()
> Could be in a previously unlock_page_table
> sharing process or worse i_mmap_unlock_write
> ...
>
> The vma_lock is used as follows:
> - During fault processing. the lock is acquired in read mode before
> doing a page table lock and allocation (huge_pte_alloc). The lock is
> held until code is finished with the page table entry (ptep).
> - The lock must be held in write mode whenever huge_pmd_unshare is
> called.
>
> Lock ordering issues come into play when unmapping a page from all
> vmas mapping the page. The i_mmap_rwsem must be held to search for the
> vmas, and the vma lock must be held before calling unmap which will
> call huge_pmd_unshare. This is done today in:
> - try_to_migrate_one and try_to_unmap_ for page migration and memory
> error handling. In these routines we 'try' to obtain the vma lock and
> fail to unmap if unsuccessful. Calling routines already deal with the
> failure of unmapping.
> - hugetlb_vmdelete_list for truncation and hole punch. This routine
> also tries to acquire the vma lock. If it fails, it skips the
> unmapping. However, we can not have file truncation or hole punch
> fail because of contention. After hugetlb_vmdelete_list, truncation
> and hole punch call remove_inode_hugepages. remove_inode_hugepages
> check for mapped pages and call hugetlb_unmap_file_page to unmap them.
> hugetlb_unmap_file_page is designed to drop locks and reacquire in the
> correct order to guarantee unmap success.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 46 +++++++++++++++++++
> mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++----
> mm/memory.c | 2 +
> mm/rmap.c | 100 +++++++++++++++++++++++++++---------------
> mm/userfaultfd.c | 9 +++-
> 5 files changed, 214 insertions(+), 45 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index b93d131b0cb5..52d9b390389b 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> struct folio *folio, pgoff_t index)
> {
> struct rb_root_cached *root = &mapping->i_mmap;
> + unsigned long skipped_vm_start;
> + struct mm_struct *skipped_mm;
> struct page *page = &folio->page;
> struct vm_area_struct *vma;
> unsigned long v_start;
> @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> end = ((index + 1) * pages_per_huge_page(h));
>
> i_mmap_lock_write(mapping);
> +retry:
> + skipped_mm = NULL;
>
> vma_interval_tree_foreach(vma, root, start, end - 1) {
> v_start = vma_offset_start(vma, start);
> @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> continue;
>
> + if (!hugetlb_vma_trylock_write(vma)) {
> + /*
> + * If we can not get vma lock, we need to drop
> + * immap_sema and take locks in order.
> + */
> + skipped_vm_start = vma->vm_start;
> + skipped_mm = vma->vm_mm;
> + /* grab mm-struct as we will be dropping i_mmap_sema */
> + mmgrab(skipped_mm);
> + break;
> + }
> +
> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> NULL, ZAP_FLAG_DROP_MARKER);
> + hugetlb_vma_unlock_write(vma);
> }
>
> i_mmap_unlock_write(mapping);
> +
> + if (skipped_mm) {
> + mmap_read_lock(skipped_mm);
> + vma = find_vma(skipped_mm, skipped_vm_start);
> + if (!vma || !is_vm_hugetlb_page(vma) ||
> + vma->vm_file->f_mapping != mapping ||
> + vma->vm_start != skipped_vm_start) {

i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway.

> + mmap_read_unlock(skipped_mm);
> + mmdrop(skipped_mm);
> + goto retry;
> + }
> +

IMHO, above check is not enough. Think about the below scene:

CPU 1 CPU 2
hugetlb_unmap_file_folio exit_mmap
mmap_read_lock(skipped_mm); mmap_read_lock(mm);
check vma is wanted.
unmap_vmas
mmap_read_unlock(skipped_mm); mmap_read_unlock
mmap_write_lock(mm);
free_pgtables
remove_vma
hugetlb_vma_lock_free
vma, hugetlb_vma_lock is still *used after free*
mmap_write_unlock(mm);
So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something?

> + hugetlb_vma_lock_write(vma);
> + i_mmap_lock_write(mapping);
> + mmap_read_unlock(skipped_mm);
> + mmdrop(skipped_mm);
> +
> + v_start = vma_offset_start(vma, start);
> + v_end = vma_offset_end(vma, end);
> + unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> + NULL, ZAP_FLAG_DROP_MARKER);
> + hugetlb_vma_unlock_write(vma);
> +
> + goto retry;

Should here be one cond_resched() here in case this function will take a really long time?

> + }
> }
>
> static void
> @@ -474,11 +516,15 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> unsigned long v_start;
> unsigned long v_end;
>
> + if (!hugetlb_vma_trylock_write(vma))
> + continue;
> +
> v_start = vma_offset_start(vma, start);
> v_end = vma_offset_end(vma, end);
>
> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> NULL, zap_flags);
> + hugetlb_vma_unlock_write(vma);
> }

unmap_hugepage_range is not called under hugetlb_vma_lock in unmap_ref_private since it's private vma?
Add a comment to avoid future confusion?

> }
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6fb0bff2c7ee..5912c2b97ddf 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4801,6 +4801,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> mmu_notifier_invalidate_range_start(&range);
> mmap_assert_write_locked(src);
> raw_write_seqcount_begin(&src->write_protect_seq);
> + } else {
> + /*
> + * For shared mappings the vma lock must be held before
> + * calling huge_pte_offset in the src vma. Otherwise, the

s/huge_pte_offset/huge_pte_alloc/, i.e. huge_pte_alloc could return shared pmd, not huge_pte_offset which
might lead to confusion. But this is really trivial...

Except from above comments, this patch looks good to me.

Thanks,
Miaohe Lin

2022-08-30 03:10:47

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 6/8] hugetlb: add vma based lock for pmd sharing

On 2022/8/30 6:24, Mike Kravetz wrote:
> On 08/27/22 17:30, Miaohe Lin wrote:
>> On 2022/8/25 1:57, Mike Kravetz wrote:
>>> Allocate a rw semaphore and hang off vm_private_data for
>>> synchronization use by vmas that could be involved in pmd sharing. Only
>>> add infrastructure for the new lock here. Actual use will be added in
>>> subsequent patch.
>>>
>>> Signed-off-by: Mike Kravetz <[email protected]>
>>
>> <snip>
>>
>>> +static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
>>> +{
>>> + /*
>>> + * Only present in sharable vmas. See comment in
>>> + * __unmap_hugepage_range_final about the neeed to check both
>>
>> s/neeed/need/
>>
>>> + * VM_SHARED and VM_MAYSHARE in free path
>>
>> I think there might be some wrong checks around this patch. As above comment said, we
>> need to check both flags, so we should do something like below instead?
>>
>> if (!(vma->vm_flags & (VM_MAYSHARE | VM_SHARED) == (VM_MAYSHARE | VM_SHARED)))
>>
>>> + */
>
> Thanks. I will update.
>
>>> + if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED)))
>>> + return;
>>> +
>>> + if (vma->vm_private_data) {
>>> + kfree(vma->vm_private_data);
>>> + vma->vm_private_data = NULL;
>>> + }
>>> +}
>>> +
>>> +static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma)
>>> +{
>>> + struct rw_semaphore *vma_sema;
>>> +
>>> + /* Only establish in (flags) sharable vmas */
>>> + if (!vma || !(vma->vm_flags & VM_MAYSHARE))
>>> + return;
>>> +
>>> + /* Should never get here with non-NULL vm_private_data */
>>
>> We can get here with non-NULL vm_private_data when called from hugetlb_vm_op_open during fork?
>
> Right!
>
> In fork, We allocate a new semaphore in hugetlb_dup_vma_private, and then
> shortly after call hugetlb_vm_op_open.
>
> It works as is, and I can update the comment. However, I wonder if we should
> just clear vm_private_data in hugetlb_dup_vma_private and let hugetlb_vm_op_open
> do the allocation.

I think it's a good idea. We can also avoid allocating memory for vma_lock (via clear_vma_resv_huge_pages()) and
then free the corresponding vma right away (via do_munmap())in move_vma(). But maybe I'm miss something.

Thanks,
Miaohe Lin

>
>>
>> Also there's one missing change on comment:
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index d0617d64d718..4bc844a1d312 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -863,7 +863,7 @@ __weak unsigned long vma_mmu_pagesize(struct vm_area_struct *vma)
>> * faults in a MAP_PRIVATE mapping. Only the process that called mmap()
>> * is guaranteed to have their future faults succeed.
>> *
>> - * With the exception of reset_vma_resv_huge_pages() which is called at fork(),
>> + * With the exception of hugetlb_dup_vma_private() which is called at fork(),
>> * the reserve counters are updated with the hugetlb_lock held. It is safe
>> * to reset the VMA at fork() time as it is not in use yet and there is no
>> * chance of the global counters getting corrupted as a result of the values.
>>
>>
>> Otherwise this patch looks good to me. Thanks.
>
> Will update, Thank you!
>

2022-08-30 03:14:50

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 7/8] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio

On 2022/8/30 6:37, Mike Kravetz wrote:
> On 08/29/22 10:44, Miaohe Lin wrote:
>> On 2022/8/25 1:57, Mike Kravetz wrote:
>>> Create the new routine hugetlb_unmap_file_folio that will unmap a single
>>> file folio. This is refactored code from hugetlb_vmdelete_list. It is
>>> modified to do locking within the routine itself and check whether the
>>> page is mapped within a specific vma before unmapping.
>>>
>>> This refactoring will be put to use and expanded upon in a subsequent
>>> patch adding vma specific locking.
>>>
>>> Signed-off-by: Mike Kravetz <[email protected]>
>>> ---
>>> fs/hugetlbfs/inode.c | 123 +++++++++++++++++++++++++++++++++----------
>>> 1 file changed, 94 insertions(+), 29 deletions(-)
>>>
>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>> index e83fd31671b3..b93d131b0cb5 100644
>>> --- a/fs/hugetlbfs/inode.c
>>> +++ b/fs/hugetlbfs/inode.c
>>> @@ -371,6 +371,94 @@ static void hugetlb_delete_from_page_cache(struct page *page)
>>> delete_from_page_cache(page);
>>> }
>>>
>>> +/*
>>> + * Called with i_mmap_rwsem held for inode based vma maps. This makes
>>> + * sure vma (and vm_mm) will not go away. We also hold the hugetlb fault
>>> + * mutex for the page in the mapping. So, we can not race with page being
>>> + * faulted into the vma.
>>> + */
>>> +static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
>>> + unsigned long addr, struct page *page)
>>> +{
>>> + pte_t *ptep, pte;
>>> +
>>> + ptep = huge_pte_offset(vma->vm_mm, addr,
>>> + huge_page_size(hstate_vma(vma)));
>>> +
>>> + if (!ptep)
>>> + return false;
>>> +
>>> + pte = huge_ptep_get(ptep);
>>> + if (huge_pte_none(pte) || !pte_present(pte))
>>> + return false;
>>> +
>>> + if (pte_page(pte) == page)
>>> + return true;
>>
>> I'm thinking whether pte entry could change after we check it since huge_pte_lock is not held here.
>> But I think holding i_mmap_rwsem in writelock mode should give us such a guarantee, e.g. migration
>> entry is changed back to huge pte entry while holding i_mmap_rwsem in readlock mode.
>> Or am I miss something?
>
> Let me think about this. I do not think it is possible, but you ask good
> questions.
>
> Do note that this is the same locking sequence used at the beginning of the
> page fault code where the decision to call hugetlb_no_page() is made.

Yes, hugetlb_fault() can tolerate the stale pte entry because pte entry will be re-checked later under the page table lock.
However if we see a stale pte entry here, the page might be leftover after truncated and thus break truncation? But I'm not
sure whether this will occur. Maybe the i_mmap_rwsem writelock and hugetlb_fault_mutex can prevent this issue.

Thanks,
Miaohe Lin


>
>>
>>> +
>>> + return false;
>>> +}
>>> +
>>> +/*
>>> + * Can vma_offset_start/vma_offset_end overflow on 32-bit arches?
>>> + * No, because the interval tree returns us only those vmas
>>> + * which overlap the truncated area starting at pgoff,
>>> + * and no vma on a 32-bit arch can span beyond the 4GB.
>>> + */
>>> +static unsigned long vma_offset_start(struct vm_area_struct *vma, pgoff_t start)
>>> +{
>>> + if (vma->vm_pgoff < start)
>>> + return (start - vma->vm_pgoff) << PAGE_SHIFT;
>>> + else
>>> + return 0;
>>> +}
>>> +
>>> +static unsigned long vma_offset_end(struct vm_area_struct *vma, pgoff_t end)
>>> +{
>>> + unsigned long t_end;
>>> +
>>> + if (!end)
>>> + return vma->vm_end;
>>> +
>>> + t_end = ((end - vma->vm_pgoff) << PAGE_SHIFT) + vma->vm_start;
>>> + if (t_end > vma->vm_end)
>>> + t_end = vma->vm_end;
>>> + return t_end;
>>> +}
>>> +
>>> +/*
>>> + * Called with hugetlb fault mutex held. Therefore, no more mappings to
>>> + * this folio can be created while executing the routine.
>>> + */
>>> +static void hugetlb_unmap_file_folio(struct hstate *h,
>>> + struct address_space *mapping,
>>> + struct folio *folio, pgoff_t index)
>>> +{
>>> + struct rb_root_cached *root = &mapping->i_mmap;
>>> + struct page *page = &folio->page;
>>> + struct vm_area_struct *vma;
>>> + unsigned long v_start;
>>> + unsigned long v_end;
>>> + pgoff_t start, end;
>>> +
>>> + start = index * pages_per_huge_page(h);
>>> + end = ((index + 1) * pages_per_huge_page(h));
>>
>> It seems the outer parentheses is unneeded?
>
> Correct. Thanks.
>

2022-09-02 21:41:24

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 7/8] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio

On 08/30/22 10:46, Miaohe Lin wrote:
> On 2022/8/30 6:37, Mike Kravetz wrote:
> > On 08/29/22 10:44, Miaohe Lin wrote:
> >> On 2022/8/25 1:57, Mike Kravetz wrote:
> >>> Create the new routine hugetlb_unmap_file_folio that will unmap a single
> >>> file folio. This is refactored code from hugetlb_vmdelete_list. It is
> >>> modified to do locking within the routine itself and check whether the
> >>> page is mapped within a specific vma before unmapping.
> >>>
> >>> This refactoring will be put to use and expanded upon in a subsequent
> >>> patch adding vma specific locking.
> >>>
> >>> Signed-off-by: Mike Kravetz <[email protected]>
> >>> ---
> >>> fs/hugetlbfs/inode.c | 123 +++++++++++++++++++++++++++++++++----------
> >>> 1 file changed, 94 insertions(+), 29 deletions(-)
> >>>
> >>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> >>> index e83fd31671b3..b93d131b0cb5 100644
> >>> --- a/fs/hugetlbfs/inode.c
> >>> +++ b/fs/hugetlbfs/inode.c
> >>> @@ -371,6 +371,94 @@ static void hugetlb_delete_from_page_cache(struct page *page)
> >>> delete_from_page_cache(page);
> >>> }
> >>>
> >>> +/*
> >>> + * Called with i_mmap_rwsem held for inode based vma maps. This makes
> >>> + * sure vma (and vm_mm) will not go away. We also hold the hugetlb fault
> >>> + * mutex for the page in the mapping. So, we can not race with page being
> >>> + * faulted into the vma.
> >>> + */
> >>> +static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
> >>> + unsigned long addr, struct page *page)
> >>> +{
> >>> + pte_t *ptep, pte;
> >>> +
> >>> + ptep = huge_pte_offset(vma->vm_mm, addr,
> >>> + huge_page_size(hstate_vma(vma)));
> >>> +
> >>> + if (!ptep)
> >>> + return false;
> >>> +
> >>> + pte = huge_ptep_get(ptep);
> >>> + if (huge_pte_none(pte) || !pte_present(pte))
> >>> + return false;
> >>> +
> >>> + if (pte_page(pte) == page)
> >>> + return true;
> >>
> >> I'm thinking whether pte entry could change after we check it since huge_pte_lock is not held here.
> >> But I think holding i_mmap_rwsem in writelock mode should give us such a guarantee, e.g. migration
> >> entry is changed back to huge pte entry while holding i_mmap_rwsem in readlock mode.
> >> Or am I miss something?
> >
> > Let me think about this. I do not think it is possible, but you ask good
> > questions.
> >
> > Do note that this is the same locking sequence used at the beginning of the
> > page fault code where the decision to call hugetlb_no_page() is made.
>
> Yes, hugetlb_fault() can tolerate the stale pte entry because pte entry will be re-checked later under the page table lock.
> However if we see a stale pte entry here, the page might be leftover after truncated and thus break truncation? But I'm not
> sure whether this will occur. Maybe the i_mmap_rwsem writelock and hugetlb_fault_mutex can prevent this issue.
>

I looked at this some more. Just to be clear, we only need to worry
about modifications of pte_page(). Racing with other pte modifications
such as accessed, or protection changes is acceptable.

Of course, the fault mutex prevents faults from happening. i_mmap_rwsem
protects against unmap and truncation operations as well as migration as
you noted above. I believe the only other place where we update pte_page()
is when copying page table such as during fork. However, with commit
bcd51a3c679d "Lazy page table copies in fork()" we are going to skip
copying for files and rely on page faults to populate the tables.

I believe we are safe from races with just the fault mutex and i_mmap_rwsem.
--
Mike Kravetz

2022-09-02 23:27:59

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization

On 08/30/22 10:02, Miaohe Lin wrote:
> On 2022/8/25 1:57, Mike Kravetz wrote:
> > The new hugetlb vma lock (rw semaphore) is used to address this race:
> >
> > Faulting thread Unsharing thread
> > ... ...
> > ptep = huge_pte_offset()
> > or
> > ptep = huge_pte_alloc()
> > ...
> > i_mmap_lock_write
> > lock page table
> > ptep invalid <------------------------ huge_pmd_unshare()
> > Could be in a previously unlock_page_table
> > sharing process or worse i_mmap_unlock_write
> > ...
> >
> > The vma_lock is used as follows:
> > - During fault processing. the lock is acquired in read mode before
> > doing a page table lock and allocation (huge_pte_alloc). The lock is
> > held until code is finished with the page table entry (ptep).
> > - The lock must be held in write mode whenever huge_pmd_unshare is
> > called.
> >
> > Lock ordering issues come into play when unmapping a page from all
> > vmas mapping the page. The i_mmap_rwsem must be held to search for the
> > vmas, and the vma lock must be held before calling unmap which will
> > call huge_pmd_unshare. This is done today in:
> > - try_to_migrate_one and try_to_unmap_ for page migration and memory
> > error handling. In these routines we 'try' to obtain the vma lock and
> > fail to unmap if unsuccessful. Calling routines already deal with the
> > failure of unmapping.
> > - hugetlb_vmdelete_list for truncation and hole punch. This routine
> > also tries to acquire the vma lock. If it fails, it skips the
> > unmapping. However, we can not have file truncation or hole punch
> > fail because of contention. After hugetlb_vmdelete_list, truncation
> > and hole punch call remove_inode_hugepages. remove_inode_hugepages
> > check for mapped pages and call hugetlb_unmap_file_page to unmap them.
> > hugetlb_unmap_file_page is designed to drop locks and reacquire in the
> > correct order to guarantee unmap success.
> >
> > Signed-off-by: Mike Kravetz <[email protected]>
> > ---
> > fs/hugetlbfs/inode.c | 46 +++++++++++++++++++
> > mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++----
> > mm/memory.c | 2 +
> > mm/rmap.c | 100 +++++++++++++++++++++++++++---------------
> > mm/userfaultfd.c | 9 +++-
> > 5 files changed, 214 insertions(+), 45 deletions(-)
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index b93d131b0cb5..52d9b390389b 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> > struct folio *folio, pgoff_t index)
> > {
> > struct rb_root_cached *root = &mapping->i_mmap;
> > + unsigned long skipped_vm_start;
> > + struct mm_struct *skipped_mm;
> > struct page *page = &folio->page;
> > struct vm_area_struct *vma;
> > unsigned long v_start;
> > @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> > end = ((index + 1) * pages_per_huge_page(h));
> >
> > i_mmap_lock_write(mapping);
> > +retry:
> > + skipped_mm = NULL;
> >
> > vma_interval_tree_foreach(vma, root, start, end - 1) {
> > v_start = vma_offset_start(vma, start);
> > @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> > if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> > continue;
> >
> > + if (!hugetlb_vma_trylock_write(vma)) {
> > + /*
> > + * If we can not get vma lock, we need to drop
> > + * immap_sema and take locks in order.
> > + */
> > + skipped_vm_start = vma->vm_start;
> > + skipped_mm = vma->vm_mm;
> > + /* grab mm-struct as we will be dropping i_mmap_sema */
> > + mmgrab(skipped_mm);
> > + break;
> > + }
> > +
> > unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> > NULL, ZAP_FLAG_DROP_MARKER);
> > + hugetlb_vma_unlock_write(vma);
> > }
> >
> > i_mmap_unlock_write(mapping);
> > +
> > + if (skipped_mm) {
> > + mmap_read_lock(skipped_mm);
> > + vma = find_vma(skipped_mm, skipped_vm_start);
> > + if (!vma || !is_vm_hugetlb_page(vma) ||
> > + vma->vm_file->f_mapping != mapping ||
> > + vma->vm_start != skipped_vm_start) {
>
> i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway.
>

Yes, that is missing. I will add here.

> > + mmap_read_unlock(skipped_mm);
> > + mmdrop(skipped_mm);
> > + goto retry;
> > + }
> > +
>
> IMHO, above check is not enough. Think about the below scene:
>
> CPU 1 CPU 2
> hugetlb_unmap_file_folio exit_mmap
> mmap_read_lock(skipped_mm); mmap_read_lock(mm);
> check vma is wanted.
> unmap_vmas
> mmap_read_unlock(skipped_mm); mmap_read_unlock
> mmap_write_lock(mm);
> free_pgtables
> remove_vma
> hugetlb_vma_lock_free
> vma, hugetlb_vma_lock is still *used after free*
> mmap_write_unlock(mm);
> So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something?

In the retry case, we are OK because go back and look up the vma again. Right?

After taking mmap_read_lock, vma can not go away until we mmap_read_unlock.
Before that, we do the following:

> > + hugetlb_vma_lock_write(vma);
> > + i_mmap_lock_write(mapping);

IIUC, vma can not go away while we hold i_mmap_lock_write. So, after this we
can.

> > + mmap_read_unlock(skipped_mm);
> > + mmdrop(skipped_mm);

We continue to hold i_mmap_lock_write as we goto retry.

I could be missing something as well. This was how I intended to keep
vma valid while dropping and acquiring locks.

> > +
> > + v_start = vma_offset_start(vma, start);
> > + v_end = vma_offset_end(vma, end);
> > + unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> > + NULL, ZAP_FLAG_DROP_MARKER);
> > + hugetlb_vma_unlock_write(vma);
> > +
> > + goto retry;
>
> Should here be one cond_resched() here in case this function will take a really long time?
>

I think we will at most retry once.

> > + }
> > }
> >
> > static void
> > @@ -474,11 +516,15 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> > unsigned long v_start;
> > unsigned long v_end;
> >
> > + if (!hugetlb_vma_trylock_write(vma))
> > + continue;
> > +
> > v_start = vma_offset_start(vma, start);
> > v_end = vma_offset_end(vma, end);
> >
> > unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> > NULL, zap_flags);
> > + hugetlb_vma_unlock_write(vma);
> > }
>
> unmap_hugepage_range is not called under hugetlb_vma_lock in unmap_ref_private since it's private vma?
> Add a comment to avoid future confusion?
>
> > }

Sure, will add a comment before hugetlb_vma_lock.

> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 6fb0bff2c7ee..5912c2b97ddf 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -4801,6 +4801,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
> > mmu_notifier_invalidate_range_start(&range);
> > mmap_assert_write_locked(src);
> > raw_write_seqcount_begin(&src->write_protect_seq);
> > + } else {
> > + /*
> > + * For shared mappings the vma lock must be held before
> > + * calling huge_pte_offset in the src vma. Otherwise, the
>
> s/huge_pte_offset/huge_pte_alloc/, i.e. huge_pte_alloc could return shared pmd, not huge_pte_offset which
> might lead to confusion. But this is really trivial...

Actually, it is huge_pte_offset. While looking up ptes in the source vma, we
do not want to race with other threads in the source process which could
be doing a huge_pmd_unshare. Otherwise, the returned pte could be invalid.

FYI - Most of this code is now 'dead' because of bcd51a3c679d "Lazy page table
copies in fork()". We will not copy shared mappigns at fork time.

>
> Except from above comments, this patch looks good to me.
>

Thank you! Thank you! Thank you! For looking at this series and all
your comments. I hope to send out v2 next week.
--
Mike Kravetz

2022-09-05 02:48:59

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 7/8] hugetlb: create hugetlb_unmap_file_folio to unmap single file folio

On 2022/9/3 5:35, Mike Kravetz wrote:
> On 08/30/22 10:46, Miaohe Lin wrote:
>> On 2022/8/30 6:37, Mike Kravetz wrote:
>>> On 08/29/22 10:44, Miaohe Lin wrote:
>>>> On 2022/8/25 1:57, Mike Kravetz wrote:
>>>>> Create the new routine hugetlb_unmap_file_folio that will unmap a single
>>>>> file folio. This is refactored code from hugetlb_vmdelete_list. It is
>>>>> modified to do locking within the routine itself and check whether the
>>>>> page is mapped within a specific vma before unmapping.
>>>>>
>>>>> This refactoring will be put to use and expanded upon in a subsequent
>>>>> patch adding vma specific locking.
>>>>>
>>>>> Signed-off-by: Mike Kravetz <[email protected]>
>>>>> ---
>>>>> fs/hugetlbfs/inode.c | 123 +++++++++++++++++++++++++++++++++----------
>>>>> 1 file changed, 94 insertions(+), 29 deletions(-)
>>>>>
>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>>> index e83fd31671b3..b93d131b0cb5 100644
>>>>> --- a/fs/hugetlbfs/inode.c
>>>>> +++ b/fs/hugetlbfs/inode.c
>>>>> @@ -371,6 +371,94 @@ static void hugetlb_delete_from_page_cache(struct page *page)
>>>>> delete_from_page_cache(page);
>>>>> }
>>>>>
>>>>> +/*
>>>>> + * Called with i_mmap_rwsem held for inode based vma maps. This makes
>>>>> + * sure vma (and vm_mm) will not go away. We also hold the hugetlb fault
>>>>> + * mutex for the page in the mapping. So, we can not race with page being
>>>>> + * faulted into the vma.
>>>>> + */
>>>>> +static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
>>>>> + unsigned long addr, struct page *page)
>>>>> +{
>>>>> + pte_t *ptep, pte;
>>>>> +
>>>>> + ptep = huge_pte_offset(vma->vm_mm, addr,
>>>>> + huge_page_size(hstate_vma(vma)));
>>>>> +
>>>>> + if (!ptep)
>>>>> + return false;
>>>>> +
>>>>> + pte = huge_ptep_get(ptep);
>>>>> + if (huge_pte_none(pte) || !pte_present(pte))
>>>>> + return false;
>>>>> +
>>>>> + if (pte_page(pte) == page)
>>>>> + return true;
>>>>
>>>> I'm thinking whether pte entry could change after we check it since huge_pte_lock is not held here.
>>>> But I think holding i_mmap_rwsem in writelock mode should give us such a guarantee, e.g. migration
>>>> entry is changed back to huge pte entry while holding i_mmap_rwsem in readlock mode.
>>>> Or am I miss something?
>>>
>>> Let me think about this. I do not think it is possible, but you ask good
>>> questions.
>>>
>>> Do note that this is the same locking sequence used at the beginning of the
>>> page fault code where the decision to call hugetlb_no_page() is made.
>>
>> Yes, hugetlb_fault() can tolerate the stale pte entry because pte entry will be re-checked later under the page table lock.
>> However if we see a stale pte entry here, the page might be leftover after truncated and thus break truncation? But I'm not
>> sure whether this will occur. Maybe the i_mmap_rwsem writelock and hugetlb_fault_mutex can prevent this issue.
>>
>
> I looked at this some more. Just to be clear, we only need to worry
> about modifications of pte_page(). Racing with other pte modifications
> such as accessed, or protection changes is acceptable.
>
> Of course, the fault mutex prevents faults from happening. i_mmap_rwsem
> protects against unmap and truncation operations as well as migration as
> you noted above. I believe the only other place where we update pte_page()
> is when copying page table such as during fork. However, with commit
> bcd51a3c679d "Lazy page table copies in fork()" we are going to skip
> copying for files and rely on page faults to populate the tables.
>
> I believe we are safe from races with just the fault mutex and i_mmap_rwsem.

I believe your analysis is right. Thanks for your clarifying.

Thanks,
Miaohe Lin


2022-09-05 03:32:09

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization

On 2022/9/3 7:07, Mike Kravetz wrote:
> On 08/30/22 10:02, Miaohe Lin wrote:
>> On 2022/8/25 1:57, Mike Kravetz wrote:
>>> The new hugetlb vma lock (rw semaphore) is used to address this race:
>>>
>>> Faulting thread Unsharing thread
>>> ... ...
>>> ptep = huge_pte_offset()
>>> or
>>> ptep = huge_pte_alloc()
>>> ...
>>> i_mmap_lock_write
>>> lock page table
>>> ptep invalid <------------------------ huge_pmd_unshare()
>>> Could be in a previously unlock_page_table
>>> sharing process or worse i_mmap_unlock_write
>>> ...
>>>
>>> The vma_lock is used as follows:
>>> - During fault processing. the lock is acquired in read mode before
>>> doing a page table lock and allocation (huge_pte_alloc). The lock is
>>> held until code is finished with the page table entry (ptep).
>>> - The lock must be held in write mode whenever huge_pmd_unshare is
>>> called.
>>>
>>> Lock ordering issues come into play when unmapping a page from all
>>> vmas mapping the page. The i_mmap_rwsem must be held to search for the
>>> vmas, and the vma lock must be held before calling unmap which will
>>> call huge_pmd_unshare. This is done today in:
>>> - try_to_migrate_one and try_to_unmap_ for page migration and memory
>>> error handling. In these routines we 'try' to obtain the vma lock and
>>> fail to unmap if unsuccessful. Calling routines already deal with the
>>> failure of unmapping.
>>> - hugetlb_vmdelete_list for truncation and hole punch. This routine
>>> also tries to acquire the vma lock. If it fails, it skips the
>>> unmapping. However, we can not have file truncation or hole punch
>>> fail because of contention. After hugetlb_vmdelete_list, truncation
>>> and hole punch call remove_inode_hugepages. remove_inode_hugepages
>>> check for mapped pages and call hugetlb_unmap_file_page to unmap them.
>>> hugetlb_unmap_file_page is designed to drop locks and reacquire in the
>>> correct order to guarantee unmap success.
>>>
>>> Signed-off-by: Mike Kravetz <[email protected]>
>>> ---
>>> fs/hugetlbfs/inode.c | 46 +++++++++++++++++++
>>> mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++----
>>> mm/memory.c | 2 +
>>> mm/rmap.c | 100 +++++++++++++++++++++++++++---------------
>>> mm/userfaultfd.c | 9 +++-
>>> 5 files changed, 214 insertions(+), 45 deletions(-)
>>>
>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>> index b93d131b0cb5..52d9b390389b 100644
>>> --- a/fs/hugetlbfs/inode.c
>>> +++ b/fs/hugetlbfs/inode.c
>>> @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>> struct folio *folio, pgoff_t index)
>>> {
>>> struct rb_root_cached *root = &mapping->i_mmap;
>>> + unsigned long skipped_vm_start;
>>> + struct mm_struct *skipped_mm;
>>> struct page *page = &folio->page;
>>> struct vm_area_struct *vma;
>>> unsigned long v_start;
>>> @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>> end = ((index + 1) * pages_per_huge_page(h));
>>>
>>> i_mmap_lock_write(mapping);
>>> +retry:
>>> + skipped_mm = NULL;
>>>
>>> vma_interval_tree_foreach(vma, root, start, end - 1) {
>>> v_start = vma_offset_start(vma, start);
>>> @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>> if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
>>> continue;
>>>
>>> + if (!hugetlb_vma_trylock_write(vma)) {
>>> + /*
>>> + * If we can not get vma lock, we need to drop
>>> + * immap_sema and take locks in order.
>>> + */
>>> + skipped_vm_start = vma->vm_start;
>>> + skipped_mm = vma->vm_mm;
>>> + /* grab mm-struct as we will be dropping i_mmap_sema */
>>> + mmgrab(skipped_mm);
>>> + break;
>>> + }
>>> +
>>> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
>>> NULL, ZAP_FLAG_DROP_MARKER);
>>> + hugetlb_vma_unlock_write(vma);
>>> }
>>>
>>> i_mmap_unlock_write(mapping);
>>> +
>>> + if (skipped_mm) {
>>> + mmap_read_lock(skipped_mm);
>>> + vma = find_vma(skipped_mm, skipped_vm_start);
>>> + if (!vma || !is_vm_hugetlb_page(vma) ||
>>> + vma->vm_file->f_mapping != mapping ||
>>> + vma->vm_start != skipped_vm_start) {
>>
>> i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway.
>>
>
> Yes, that is missing. I will add here.
>
>>> + mmap_read_unlock(skipped_mm);
>>> + mmdrop(skipped_mm);
>>> + goto retry;
>>> + }
>>> +
>>
>> IMHO, above check is not enough. Think about the below scene:
>>
>> CPU 1 CPU 2
>> hugetlb_unmap_file_folio exit_mmap
>> mmap_read_lock(skipped_mm); mmap_read_lock(mm);
>> check vma is wanted.
>> unmap_vmas
>> mmap_read_unlock(skipped_mm); mmap_read_unlock
>> mmap_write_lock(mm);
>> free_pgtables
>> remove_vma
>> hugetlb_vma_lock_free
>> vma, hugetlb_vma_lock is still *used after free*
>> mmap_write_unlock(mm);
>> So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something?
>
> In the retry case, we are OK because go back and look up the vma again. Right?
>
> After taking mmap_read_lock, vma can not go away until we mmap_read_unlock.
> Before that, we do the following:
>
>>> + hugetlb_vma_lock_write(vma);
>>> + i_mmap_lock_write(mapping);
>
> IIUC, vma can not go away while we hold i_mmap_lock_write. So, after this we

I think you're right. free_pgtables() can't complete its work as unlink_file_vma() will be
blocked on i_mmap_rwsem of mapping. Sorry for reporting such nonexistent race.

> can.
>
>>> + mmap_read_unlock(skipped_mm);
>>> + mmdrop(skipped_mm);
>
> We continue to hold i_mmap_lock_write as we goto retry.
>
> I could be missing something as well. This was how I intended to keep
> vma valid while dropping and acquiring locks.

Thanks for your clarifying.

>
>>> +
>>> + v_start = vma_offset_start(vma, start);
>>> + v_end = vma_offset_end(vma, end);
>>> + unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
>>> + NULL, ZAP_FLAG_DROP_MARKER);
>>> + hugetlb_vma_unlock_write(vma);
>>> +
>>> + goto retry;
>>
>> Should here be one cond_resched() here in case this function will take a really long time?
>>
>
> I think we will at most retry once.

I see. It should be acceptable.

>
>>> + }
>>> }
>>>
>>> static void
>>> @@ -474,11 +516,15 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
>>> unsigned long v_start;
>>> unsigned long v_end;
>>>
>>> + if (!hugetlb_vma_trylock_write(vma))
>>> + continue;
>>> +
>>> v_start = vma_offset_start(vma, start);
>>> v_end = vma_offset_end(vma, end);
>>>
>>> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
>>> NULL, zap_flags);
>>> + hugetlb_vma_unlock_write(vma);
>>> }
>>
>> unmap_hugepage_range is not called under hugetlb_vma_lock in unmap_ref_private since it's private vma?
>> Add a comment to avoid future confusion?
>>
>>> }
>
> Sure, will add a comment before hugetlb_vma_lock.
>
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 6fb0bff2c7ee..5912c2b97ddf 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -4801,6 +4801,14 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>>> mmu_notifier_invalidate_range_start(&range);
>>> mmap_assert_write_locked(src);
>>> raw_write_seqcount_begin(&src->write_protect_seq);
>>> + } else {
>>> + /*
>>> + * For shared mappings the vma lock must be held before
>>> + * calling huge_pte_offset in the src vma. Otherwise, the
>>
>> s/huge_pte_offset/huge_pte_alloc/, i.e. huge_pte_alloc could return shared pmd, not huge_pte_offset which
>> might lead to confusion. But this is really trivial...
>
> Actually, it is huge_pte_offset. While looking up ptes in the source vma, we
> do not want to race with other threads in the source process which could
> be doing a huge_pmd_unshare. Otherwise, the returned pte could be invalid.
>
> FYI - Most of this code is now 'dead' because of bcd51a3c679d "Lazy page table
> copies in fork()". We will not copy shared mappigns at fork time.

Agree. Should these "dead" codes be removed later?

Thanks,
Miaohe Lin


>
>>
>> Except from above comments, this patch looks good to me.
>>
>
> Thank you! Thank you! Thank you! For looking at this series and all
> your comments. I hope to send out v2 next week.
>

2022-09-06 15:27:20

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

Hi Mike,

Mike Kravetz <[email protected]> writes:

> When page fault code needs to allocate and instantiate a new hugetlb
> page (huegtlb_no_page), it checks early to determine if the fault is
> beyond i_size. When discovered early, it is easy to abort the fault and
> return an error. However, it becomes much more difficult to handle when
> discovered later after allocating the page and consuming reservations
> and adding to the page cache. Backing out changes in such instances
> becomes difficult and error prone.
>
> Instead of trying to catch and backout all such races, use the hugetlb
> fault mutex to handle truncate racing with page faults. The most
> significant change is modification of the routine remove_inode_hugepages
> such that it will take the fault mutex for EVERY index in the truncated
> range (or hole in the case of hole punch). Since remove_inode_hugepages
> is called in the truncate path after updating i_size, we can experience
> races as follows.
> - truncate code updates i_size and takes fault mutex before a racing
> fault. After fault code takes mutex, it will notice fault beyond
> i_size and abort early.
> - fault code obtains mutex, and truncate updates i_size after early
> checks in fault code. fault code will add page beyond i_size.
> When truncate code takes mutex for page/index, it will remove the
> page.
> - truncate updates i_size, but fault code obtains mutex first. If
> fault code sees updated i_size it will abort early. If fault code
> does not see updated i_size, it will add page beyond i_size and
> truncate code will remove page when it obtains fault mutex.
>
> Note, for performance reasons remove_inode_hugepages will still use
> filemap_get_folios for bulk folio lookups. For indicies not returned in
> the bulk lookup, it will need to lookup individual folios to check for
> races with page fault.
>
> Signed-off-by: Mike Kravetz <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
> mm/hugetlb.c | 41 +++++-----
> 2 files changed, 152 insertions(+), 73 deletions(-)

With linux next starting from next-20220831 i see hangs with this
patch applied while running the glibc test suite. The patch doesn't
revert cleanly on top, so i checked out one commit before that one and
with that revision everything works.

It looks like the malloc test suite in glibc triggers this. I cannot
identify a single test causing it, but instead the combination of
multiple tests. Running the test suite on a single CPU works. Given the
subject of the patch that's likely not a surprise.

This is on s390, and the warning i get from RCU is:

[ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
[ 1951.907009] rcu: 60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
[ 1951.907018] (t=6000 jiffies g=116125 q=1008072 ncpus=64)
[ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
[ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
[ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
[ 1951.907044] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
[ 1951.907095] Call Trace:
[ 1951.907098] [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
[ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
[ 1951.907107] [<0000000000584876>] remove_inode_hugepages+0x236/0x280
[ 1951.907109] [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
[ 1951.907111] [<000000000044fe96>] evict+0xe6/0x1c0
[ 1951.907116] [<000000000044a608>] __dentry_kill+0x108/0x1e0
[ 1951.907119] [<000000000044ac64>] dentry_kill+0x6c/0x290
[ 1951.907121] [<000000000044afec>] dput+0x164/0x1c0
[ 1951.907123] [<000000000042a4d6>] __fput+0xee/0x290
[ 1951.907127] [<00000000001794a8>] task_work_run+0x88/0xe0
[ 1951.907133] [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
[ 1951.907137] [<0000000000d0e42e>] __do_syscall+0x11e/0x200
[ 1951.907142] [<0000000000d1d392>] system_call+0x82/0xb0
[ 1951.907145] Last Breaking-Event-Address:
[ 1951.907146] [<0000038001d839c0>] 0x38001d839c0

One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.

Any thoughts?

Thanks,
Sven

2022-09-06 18:23:25

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

On 09/06/22 15:57, Sven Schnelle wrote:
> Hi Mike,
>
> Mike Kravetz <[email protected]> writes:
>
> > When page fault code needs to allocate and instantiate a new hugetlb
> > page (huegtlb_no_page), it checks early to determine if the fault is
> > beyond i_size. When discovered early, it is easy to abort the fault and
> > return an error. However, it becomes much more difficult to handle when
> > discovered later after allocating the page and consuming reservations
> > and adding to the page cache. Backing out changes in such instances
> > becomes difficult and error prone.
> >
> > Instead of trying to catch and backout all such races, use the hugetlb
> > fault mutex to handle truncate racing with page faults. The most
> > significant change is modification of the routine remove_inode_hugepages
> > such that it will take the fault mutex for EVERY index in the truncated
> > range (or hole in the case of hole punch). Since remove_inode_hugepages
> > is called in the truncate path after updating i_size, we can experience
> > races as follows.
> > - truncate code updates i_size and takes fault mutex before a racing
> > fault. After fault code takes mutex, it will notice fault beyond
> > i_size and abort early.
> > - fault code obtains mutex, and truncate updates i_size after early
> > checks in fault code. fault code will add page beyond i_size.
> > When truncate code takes mutex for page/index, it will remove the
> > page.
> > - truncate updates i_size, but fault code obtains mutex first. If
> > fault code sees updated i_size it will abort early. If fault code
> > does not see updated i_size, it will add page beyond i_size and
> > truncate code will remove page when it obtains fault mutex.
> >
> > Note, for performance reasons remove_inode_hugepages will still use
> > filemap_get_folios for bulk folio lookups. For indicies not returned in
> > the bulk lookup, it will need to lookup individual folios to check for
> > races with page fault.
> >
> > Signed-off-by: Mike Kravetz <[email protected]>
> > ---
> > fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
> > mm/hugetlb.c | 41 +++++-----
> > 2 files changed, 152 insertions(+), 73 deletions(-)
>
> With linux next starting from next-20220831 i see hangs with this
> patch applied while running the glibc test suite. The patch doesn't
> revert cleanly on top, so i checked out one commit before that one and
> with that revision everything works.
>
> It looks like the malloc test suite in glibc triggers this. I cannot
> identify a single test causing it, but instead the combination of
> multiple tests. Running the test suite on a single CPU works. Given the
> subject of the patch that's likely not a surprise.
>
> This is on s390, and the warning i get from RCU is:
>
> [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
> [ 1951.907009] rcu: 60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
> [ 1951.907018] (t=6000 jiffies g=116125 q=1008072 ncpus=64)
> [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
> [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
> [ 1951.907044] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> [ 1951.907095] Call Trace:
> [ 1951.907098] [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
> [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
> [ 1951.907107] [<0000000000584876>] remove_inode_hugepages+0x236/0x280
> [ 1951.907109] [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
> [ 1951.907111] [<000000000044fe96>] evict+0xe6/0x1c0
> [ 1951.907116] [<000000000044a608>] __dentry_kill+0x108/0x1e0
> [ 1951.907119] [<000000000044ac64>] dentry_kill+0x6c/0x290
> [ 1951.907121] [<000000000044afec>] dput+0x164/0x1c0
> [ 1951.907123] [<000000000042a4d6>] __fput+0xee/0x290
> [ 1951.907127] [<00000000001794a8>] task_work_run+0x88/0xe0
> [ 1951.907133] [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
> [ 1951.907137] [<0000000000d0e42e>] __do_syscall+0x11e/0x200
> [ 1951.907142] [<0000000000d1d392>] system_call+0x82/0xb0
> [ 1951.907145] Last Breaking-Event-Address:
> [ 1951.907146] [<0000038001d839c0>] 0x38001d839c0
>
> One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
>
> Any thoughts?

Thanks for the report, I will take a look.

My first thought is that this fix may not be applied,
https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
However, I see that that is in next-20220831.

Hopefully, this will recreate on x86.
--
Mike Kravetz

2022-09-06 18:24:40

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

On 09/06/22 09:48, Mike Kravetz wrote:
> On 09/06/22 15:57, Sven Schnelle wrote:
> > Hi Mike,
> >
> > Mike Kravetz <[email protected]> writes:
> >
> > > When page fault code needs to allocate and instantiate a new hugetlb
> > > page (huegtlb_no_page), it checks early to determine if the fault is
> > > beyond i_size. When discovered early, it is easy to abort the fault and
> > > return an error. However, it becomes much more difficult to handle when
> > > discovered later after allocating the page and consuming reservations
> > > and adding to the page cache. Backing out changes in such instances
> > > becomes difficult and error prone.
> > >
> > > Instead of trying to catch and backout all such races, use the hugetlb
> > > fault mutex to handle truncate racing with page faults. The most
> > > significant change is modification of the routine remove_inode_hugepages
> > > such that it will take the fault mutex for EVERY index in the truncated
> > > range (or hole in the case of hole punch). Since remove_inode_hugepages
> > > is called in the truncate path after updating i_size, we can experience
> > > races as follows.
> > > - truncate code updates i_size and takes fault mutex before a racing
> > > fault. After fault code takes mutex, it will notice fault beyond
> > > i_size and abort early.
> > > - fault code obtains mutex, and truncate updates i_size after early
> > > checks in fault code. fault code will add page beyond i_size.
> > > When truncate code takes mutex for page/index, it will remove the
> > > page.
> > > - truncate updates i_size, but fault code obtains mutex first. If
> > > fault code sees updated i_size it will abort early. If fault code
> > > does not see updated i_size, it will add page beyond i_size and
> > > truncate code will remove page when it obtains fault mutex.
> > >
> > > Note, for performance reasons remove_inode_hugepages will still use
> > > filemap_get_folios for bulk folio lookups. For indicies not returned in
> > > the bulk lookup, it will need to lookup individual folios to check for
> > > races with page fault.
> > >
> > > Signed-off-by: Mike Kravetz <[email protected]>
> > > ---
> > > fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
> > > mm/hugetlb.c | 41 +++++-----
> > > 2 files changed, 152 insertions(+), 73 deletions(-)
> >
> > With linux next starting from next-20220831 i see hangs with this
> > patch applied while running the glibc test suite. The patch doesn't
> > revert cleanly on top, so i checked out one commit before that one and
> > with that revision everything works.
> >
> > It looks like the malloc test suite in glibc triggers this. I cannot
> > identify a single test causing it, but instead the combination of
> > multiple tests. Running the test suite on a single CPU works. Given the
> > subject of the patch that's likely not a surprise.
> >
> > This is on s390, and the warning i get from RCU is:
> >
> > [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
> > [ 1951.907009] rcu: 60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
> > [ 1951.907018] (t=6000 jiffies g=116125 q=1008072 ncpus=64)
> > [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
> > [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> > [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
> > [ 1951.907044] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> > [ 1951.907095] Call Trace:
> > [ 1951.907098] [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
> > [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
> > [ 1951.907107] [<0000000000584876>] remove_inode_hugepages+0x236/0x280
> > [ 1951.907109] [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
> > [ 1951.907111] [<000000000044fe96>] evict+0xe6/0x1c0
> > [ 1951.907116] [<000000000044a608>] __dentry_kill+0x108/0x1e0
> > [ 1951.907119] [<000000000044ac64>] dentry_kill+0x6c/0x290
> > [ 1951.907121] [<000000000044afec>] dput+0x164/0x1c0
> > [ 1951.907123] [<000000000042a4d6>] __fput+0xee/0x290
> > [ 1951.907127] [<00000000001794a8>] task_work_run+0x88/0xe0
> > [ 1951.907133] [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
> > [ 1951.907137] [<0000000000d0e42e>] __do_syscall+0x11e/0x200
> > [ 1951.907142] [<0000000000d1d392>] system_call+0x82/0xb0
> > [ 1951.907145] Last Breaking-Event-Address:
> > [ 1951.907146] [<0000038001d839c0>] 0x38001d839c0
> >
> > One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
> >
> > Any thoughts?
>
> Thanks for the report, I will take a look.
>
> My first thought is that this fix may not be applied,
> https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
> However, I see that that is in next-20220831.
>
> Hopefully, this will recreate on x86.

One additional thought ...

With this patch, we will take the hugetlb fault mutex for EVERY index in the
range being truncated or hole punched. In the case of a very large file, that
is no different than code today where we take the mutex when removing pages
from the file. What is different is taking the mutex for indices that are
part of holes in the file. Consider a very large file with only one page at
the very large offset. We would then take the mutex for each index in that
very large hole. Depending on the size of the hole, this could appear as a
hang.

For the above locking scheme to work, we need to take the mutex for indices
in holes in case there would happen to be a racing page fault. However, there
are only a limited number of fault mutexes (it is a table). So, we only really
need to take at a maximum num_fault_mutexes mutexes. We could keep track of
these with a bitmap.

I am not sure this is the issue you are seeing, but a test named
tst-malloc-too-large-malloc-hugetlb2 may be doing this.

In any case, I think this issue needs to be addressed before this series can
move forward.
--
Mike Kravetz

2022-09-06 23:36:43

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

On 09/06/22 11:05, Mike Kravetz wrote:
> On 09/06/22 09:48, Mike Kravetz wrote:
> > On 09/06/22 15:57, Sven Schnelle wrote:
> > > Hi Mike,
> > >
> > > Mike Kravetz <[email protected]> writes:
> > >
> > > > When page fault code needs to allocate and instantiate a new hugetlb
> > > > page (huegtlb_no_page), it checks early to determine if the fault is
> > > > beyond i_size. When discovered early, it is easy to abort the fault and
> > > > return an error. However, it becomes much more difficult to handle when
> > > > discovered later after allocating the page and consuming reservations
> > > > and adding to the page cache. Backing out changes in such instances
> > > > becomes difficult and error prone.
> > > >
> > > > Instead of trying to catch and backout all such races, use the hugetlb
> > > > fault mutex to handle truncate racing with page faults. The most
> > > > significant change is modification of the routine remove_inode_hugepages
> > > > such that it will take the fault mutex for EVERY index in the truncated
> > > > range (or hole in the case of hole punch). Since remove_inode_hugepages
> > > > is called in the truncate path after updating i_size, we can experience
> > > > races as follows.
> > > > - truncate code updates i_size and takes fault mutex before a racing
> > > > fault. After fault code takes mutex, it will notice fault beyond
> > > > i_size and abort early.
> > > > - fault code obtains mutex, and truncate updates i_size after early
> > > > checks in fault code. fault code will add page beyond i_size.
> > > > When truncate code takes mutex for page/index, it will remove the
> > > > page.
> > > > - truncate updates i_size, but fault code obtains mutex first. If
> > > > fault code sees updated i_size it will abort early. If fault code
> > > > does not see updated i_size, it will add page beyond i_size and
> > > > truncate code will remove page when it obtains fault mutex.
> > > >
> > > > Note, for performance reasons remove_inode_hugepages will still use
> > > > filemap_get_folios for bulk folio lookups. For indicies not returned in
> > > > the bulk lookup, it will need to lookup individual folios to check for
> > > > races with page fault.
> > > >
> > > > Signed-off-by: Mike Kravetz <[email protected]>
> > > > ---
> > > > fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
> > > > mm/hugetlb.c | 41 +++++-----
> > > > 2 files changed, 152 insertions(+), 73 deletions(-)
> > >
> > > With linux next starting from next-20220831 i see hangs with this
> > > patch applied while running the glibc test suite. The patch doesn't
> > > revert cleanly on top, so i checked out one commit before that one and
> > > with that revision everything works.
> > >
> > > It looks like the malloc test suite in glibc triggers this. I cannot
> > > identify a single test causing it, but instead the combination of
> > > multiple tests. Running the test suite on a single CPU works. Given the
> > > subject of the patch that's likely not a surprise.
> > >
> > > This is on s390, and the warning i get from RCU is:
> > >
> > > [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
> > > [ 1951.907009] rcu: 60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
> > > [ 1951.907018] (t=6000 jiffies g=116125 q=1008072 ncpus=64)
> > > [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
> > > [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> > > [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
> > > [ 1951.907044] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> > > [ 1951.907095] Call Trace:
> > > [ 1951.907098] [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
> > > [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
> > > [ 1951.907107] [<0000000000584876>] remove_inode_hugepages+0x236/0x280
> > > [ 1951.907109] [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
> > > [ 1951.907111] [<000000000044fe96>] evict+0xe6/0x1c0
> > > [ 1951.907116] [<000000000044a608>] __dentry_kill+0x108/0x1e0
> > > [ 1951.907119] [<000000000044ac64>] dentry_kill+0x6c/0x290
> > > [ 1951.907121] [<000000000044afec>] dput+0x164/0x1c0
> > > [ 1951.907123] [<000000000042a4d6>] __fput+0xee/0x290
> > > [ 1951.907127] [<00000000001794a8>] task_work_run+0x88/0xe0
> > > [ 1951.907133] [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
> > > [ 1951.907137] [<0000000000d0e42e>] __do_syscall+0x11e/0x200
> > > [ 1951.907142] [<0000000000d1d392>] system_call+0x82/0xb0
> > > [ 1951.907145] Last Breaking-Event-Address:
> > > [ 1951.907146] [<0000038001d839c0>] 0x38001d839c0
> > >
> > > One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
> > >
> > > Any thoughts?
> >
> > Thanks for the report, I will take a look.
> >
> > My first thought is that this fix may not be applied,
> > https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
> > However, I see that that is in next-20220831.
> >
> > Hopefully, this will recreate on x86.
>
> One additional thought ...
>
> With this patch, we will take the hugetlb fault mutex for EVERY index in the
> range being truncated or hole punched. In the case of a very large file, that
> is no different than code today where we take the mutex when removing pages
> from the file. What is different is taking the mutex for indices that are
> part of holes in the file. Consider a very large file with only one page at
> the very large offset. We would then take the mutex for each index in that
> very large hole. Depending on the size of the hole, this could appear as a
> hang.
>
> For the above locking scheme to work, we need to take the mutex for indices
> in holes in case there would happen to be a racing page fault. However, there
> are only a limited number of fault mutexes (it is a table). So, we only really
> need to take at a maximum num_fault_mutexes mutexes. We could keep track of
> these with a bitmap.
>
> I am not sure this is the issue you are seeing, but a test named
> tst-malloc-too-large-malloc-hugetlb2 may be doing this.
>
> In any case, I think this issue needs to be addressed before this series can
> move forward.

Well, even if we address the issue of taking the same mutex multiple times,
this new synchronization scheme requires a folio lookup for EVERY index in
the truncated or hole punched range. This can easily 'stall' a CPU if there
is a really big hole in a file. One can recreate this easily with fallocate
to add a single page to a file at a really big offset, and then remove the file.

I am trying to come up with another algorithm to make this work.

Andrew, I wanted to give you a heads up that this series may need to be
pulled if I can not come up with something quickly.
--
Mike Kravetz

2022-09-07 02:30:01

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

On 2022/9/7 7:08, Mike Kravetz wrote:
> On 09/06/22 11:05, Mike Kravetz wrote:
>> On 09/06/22 09:48, Mike Kravetz wrote:
>>> On 09/06/22 15:57, Sven Schnelle wrote:
>>>> Hi Mike,
>>>>
>>>> Mike Kravetz <[email protected]> writes:
>>>>
>>>>> When page fault code needs to allocate and instantiate a new hugetlb
>>>>> page (huegtlb_no_page), it checks early to determine if the fault is
>>>>> beyond i_size. When discovered early, it is easy to abort the fault and
>>>>> return an error. However, it becomes much more difficult to handle when
>>>>> discovered later after allocating the page and consuming reservations
>>>>> and adding to the page cache. Backing out changes in such instances
>>>>> becomes difficult and error prone.
>>>>>
>>>>> Instead of trying to catch and backout all such races, use the hugetlb
>>>>> fault mutex to handle truncate racing with page faults. The most
>>>>> significant change is modification of the routine remove_inode_hugepages
>>>>> such that it will take the fault mutex for EVERY index in the truncated
>>>>> range (or hole in the case of hole punch). Since remove_inode_hugepages
>>>>> is called in the truncate path after updating i_size, we can experience
>>>>> races as follows.
>>>>> - truncate code updates i_size and takes fault mutex before a racing
>>>>> fault. After fault code takes mutex, it will notice fault beyond
>>>>> i_size and abort early.
>>>>> - fault code obtains mutex, and truncate updates i_size after early
>>>>> checks in fault code. fault code will add page beyond i_size.
>>>>> When truncate code takes mutex for page/index, it will remove the
>>>>> page.
>>>>> - truncate updates i_size, but fault code obtains mutex first. If
>>>>> fault code sees updated i_size it will abort early. If fault code
>>>>> does not see updated i_size, it will add page beyond i_size and
>>>>> truncate code will remove page when it obtains fault mutex.
>>>>>
>>>>> Note, for performance reasons remove_inode_hugepages will still use
>>>>> filemap_get_folios for bulk folio lookups. For indicies not returned in
>>>>> the bulk lookup, it will need to lookup individual folios to check for
>>>>> races with page fault.
>>>>>
>>>>> Signed-off-by: Mike Kravetz <[email protected]>
>>>>> ---
>>>>> fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
>>>>> mm/hugetlb.c | 41 +++++-----
>>>>> 2 files changed, 152 insertions(+), 73 deletions(-)
>>>>
>>>> With linux next starting from next-20220831 i see hangs with this
>>>> patch applied while running the glibc test suite. The patch doesn't
>>>> revert cleanly on top, so i checked out one commit before that one and
>>>> with that revision everything works.
>>>>
>>>> It looks like the malloc test suite in glibc triggers this. I cannot
>>>> identify a single test causing it, but instead the combination of
>>>> multiple tests. Running the test suite on a single CPU works. Given the
>>>> subject of the patch that's likely not a surprise.
>>>>
>>>> This is on s390, and the warning i get from RCU is:
>>>>
>>>> [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
>>>> [ 1951.907009] rcu: 60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
>>>> [ 1951.907018] (t=6000 jiffies g=116125 q=1008072 ncpus=64)
>>>> [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
>>>> [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
>>>> [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
>>>> [ 1951.907044] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>>>> [ 1951.907095] Call Trace:
>>>> [ 1951.907098] [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
>>>> [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
>>>> [ 1951.907107] [<0000000000584876>] remove_inode_hugepages+0x236/0x280
>>>> [ 1951.907109] [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
>>>> [ 1951.907111] [<000000000044fe96>] evict+0xe6/0x1c0
>>>> [ 1951.907116] [<000000000044a608>] __dentry_kill+0x108/0x1e0
>>>> [ 1951.907119] [<000000000044ac64>] dentry_kill+0x6c/0x290
>>>> [ 1951.907121] [<000000000044afec>] dput+0x164/0x1c0
>>>> [ 1951.907123] [<000000000042a4d6>] __fput+0xee/0x290
>>>> [ 1951.907127] [<00000000001794a8>] task_work_run+0x88/0xe0
>>>> [ 1951.907133] [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
>>>> [ 1951.907137] [<0000000000d0e42e>] __do_syscall+0x11e/0x200
>>>> [ 1951.907142] [<0000000000d1d392>] system_call+0x82/0xb0
>>>> [ 1951.907145] Last Breaking-Event-Address:
>>>> [ 1951.907146] [<0000038001d839c0>] 0x38001d839c0
>>>>
>>>> One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
>>>>
>>>> Any thoughts?
>>>
>>> Thanks for the report, I will take a look.
>>>
>>> My first thought is that this fix may not be applied,
>>> https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
>>> However, I see that that is in next-20220831.
>>>
>>> Hopefully, this will recreate on x86.
>>
>> One additional thought ...
>>
>> With this patch, we will take the hugetlb fault mutex for EVERY index in the
>> range being truncated or hole punched. In the case of a very large file, that
>> is no different than code today where we take the mutex when removing pages
>> from the file. What is different is taking the mutex for indices that are
>> part of holes in the file. Consider a very large file with only one page at
>> the very large offset. We would then take the mutex for each index in that
>> very large hole. Depending on the size of the hole, this could appear as a
>> hang.
>>
>> For the above locking scheme to work, we need to take the mutex for indices
>> in holes in case there would happen to be a racing page fault. However, there
>> are only a limited number of fault mutexes (it is a table). So, we only really
>> need to take at a maximum num_fault_mutexes mutexes. We could keep track of
>> these with a bitmap.
>>
>> I am not sure this is the issue you are seeing, but a test named
>> tst-malloc-too-large-malloc-hugetlb2 may be doing this.
>>
>> In any case, I think this issue needs to be addressed before this series can
>> move forward.
>
> Well, even if we address the issue of taking the same mutex multiple times,

Can we change to take all the hugetlb fault mutex at the same time to ensure every possible
future hugetlb page fault will see a truncated i_size? Then we could just drop all the hugetlb
fault mutex before doing any heavy stuff? It seems hugetlb fault mutex could be dropped when
new i_size is guaranteed to be visible for any future hugetlb page fault users?
But I might miss something...

> this new synchronization scheme requires a folio lookup for EVERY index in
> the truncated or hole punched range. This can easily 'stall' a CPU if there

If above thought holds, we could do batch folio lookup instead. Hopes my thought will help. ;)

Thanks,
Miaohe Lin


> is a really big hole in a file. One can recreate this easily with fallocate
> to add a single page to a file at a really big offset, and then remove the file.
>
> I am trying to come up with another algorithm to make this work.
>
> Andrew, I wanted to give you a heads up that this series may need to be
> pulled if I can not come up with something quickly.
>

2022-09-07 03:06:58

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

On 09/07/22 10:11, Miaohe Lin wrote:
> On 2022/9/7 7:08, Mike Kravetz wrote:
> > On 09/06/22 11:05, Mike Kravetz wrote:
> >> On 09/06/22 09:48, Mike Kravetz wrote:
> >>> On 09/06/22 15:57, Sven Schnelle wrote:
> >>>> Hi Mike,
> >>>>
> >>>> Mike Kravetz <[email protected]> writes:
> >>>>
> >>>>> When page fault code needs to allocate and instantiate a new hugetlb
> >>>>> page (huegtlb_no_page), it checks early to determine if the fault is
> >>>>> beyond i_size. When discovered early, it is easy to abort the fault and
> >>>>> return an error. However, it becomes much more difficult to handle when
> >>>>> discovered later after allocating the page and consuming reservations
> >>>>> and adding to the page cache. Backing out changes in such instances
> >>>>> becomes difficult and error prone.
> >>>>>
> >>>>> Instead of trying to catch and backout all such races, use the hugetlb
> >>>>> fault mutex to handle truncate racing with page faults. The most
> >>>>> significant change is modification of the routine remove_inode_hugepages
> >>>>> such that it will take the fault mutex for EVERY index in the truncated
> >>>>> range (or hole in the case of hole punch). Since remove_inode_hugepages
> >>>>> is called in the truncate path after updating i_size, we can experience
> >>>>> races as follows.
> >>>>> - truncate code updates i_size and takes fault mutex before a racing
> >>>>> fault. After fault code takes mutex, it will notice fault beyond
> >>>>> i_size and abort early.
> >>>>> - fault code obtains mutex, and truncate updates i_size after early
> >>>>> checks in fault code. fault code will add page beyond i_size.
> >>>>> When truncate code takes mutex for page/index, it will remove the
> >>>>> page.
> >>>>> - truncate updates i_size, but fault code obtains mutex first. If
> >>>>> fault code sees updated i_size it will abort early. If fault code
> >>>>> does not see updated i_size, it will add page beyond i_size and
> >>>>> truncate code will remove page when it obtains fault mutex.
> >>>>>
> >>>>> Note, for performance reasons remove_inode_hugepages will still use
> >>>>> filemap_get_folios for bulk folio lookups. For indicies not returned in
> >>>>> the bulk lookup, it will need to lookup individual folios to check for
> >>>>> races with page fault.
> >>>>>
> >>>>> Signed-off-by: Mike Kravetz <[email protected]>
> >>>>> ---
> >>>>> fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
> >>>>> mm/hugetlb.c | 41 +++++-----
> >>>>> 2 files changed, 152 insertions(+), 73 deletions(-)
> >>>>
> >>>> With linux next starting from next-20220831 i see hangs with this
> >>>> patch applied while running the glibc test suite. The patch doesn't
> >>>> revert cleanly on top, so i checked out one commit before that one and
> >>>> with that revision everything works.
> >>>>
> >>>> It looks like the malloc test suite in glibc triggers this. I cannot
> >>>> identify a single test causing it, but instead the combination of
> >>>> multiple tests. Running the test suite on a single CPU works. Given the
> >>>> subject of the patch that's likely not a surprise.
> >>>>
> >>>> This is on s390, and the warning i get from RCU is:
> >>>>
> >>>> [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
> >>>> [ 1951.907009] rcu: 60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
> >>>> [ 1951.907018] (t=6000 jiffies g=116125 q=1008072 ncpus=64)
> >>>> [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
> >>>> [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> >>>> [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
> >>>> [ 1951.907044] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> >>>> [ 1951.907095] Call Trace:
> >>>> [ 1951.907098] [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
> >>>> [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
> >>>> [ 1951.907107] [<0000000000584876>] remove_inode_hugepages+0x236/0x280
> >>>> [ 1951.907109] [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
> >>>> [ 1951.907111] [<000000000044fe96>] evict+0xe6/0x1c0
> >>>> [ 1951.907116] [<000000000044a608>] __dentry_kill+0x108/0x1e0
> >>>> [ 1951.907119] [<000000000044ac64>] dentry_kill+0x6c/0x290
> >>>> [ 1951.907121] [<000000000044afec>] dput+0x164/0x1c0
> >>>> [ 1951.907123] [<000000000042a4d6>] __fput+0xee/0x290
> >>>> [ 1951.907127] [<00000000001794a8>] task_work_run+0x88/0xe0
> >>>> [ 1951.907133] [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
> >>>> [ 1951.907137] [<0000000000d0e42e>] __do_syscall+0x11e/0x200
> >>>> [ 1951.907142] [<0000000000d1d392>] system_call+0x82/0xb0
> >>>> [ 1951.907145] Last Breaking-Event-Address:
> >>>> [ 1951.907146] [<0000038001d839c0>] 0x38001d839c0
> >>>>
> >>>> One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
> >>>>
> >>>> Any thoughts?
> >>>
> >>> Thanks for the report, I will take a look.
> >>>
> >>> My first thought is that this fix may not be applied,
> >>> https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
> >>> However, I see that that is in next-20220831.
> >>>
> >>> Hopefully, this will recreate on x86.
> >>
> >> One additional thought ...
> >>
> >> With this patch, we will take the hugetlb fault mutex for EVERY index in the
> >> range being truncated or hole punched. In the case of a very large file, that
> >> is no different than code today where we take the mutex when removing pages
> >> from the file. What is different is taking the mutex for indices that are
> >> part of holes in the file. Consider a very large file with only one page at
> >> the very large offset. We would then take the mutex for each index in that
> >> very large hole. Depending on the size of the hole, this could appear as a
> >> hang.
> >>
> >> For the above locking scheme to work, we need to take the mutex for indices
> >> in holes in case there would happen to be a racing page fault. However, there
> >> are only a limited number of fault mutexes (it is a table). So, we only really
> >> need to take at a maximum num_fault_mutexes mutexes. We could keep track of
> >> these with a bitmap.
> >>
> >> I am not sure this is the issue you are seeing, but a test named
> >> tst-malloc-too-large-malloc-hugetlb2 may be doing this.
> >>
> >> In any case, I think this issue needs to be addressed before this series can
> >> move forward.
> >
> > Well, even if we address the issue of taking the same mutex multiple times,
>
> Can we change to take all the hugetlb fault mutex at the same time to ensure every possible
> future hugetlb page fault will see a truncated i_size? Then we could just drop all the hugetlb
> fault mutex before doing any heavy stuff? It seems hugetlb fault mutex could be dropped when
> new i_size is guaranteed to be visible for any future hugetlb page fault users?
> But I might miss something...

Yes, that is the general direction and would work well for truncation. However,
the same routine remove_inode_hugepages is used for hole punch, and I am pretty
sure we want to take the fault mutex there as it can race with page faults.

>
> > this new synchronization scheme requires a folio lookup for EVERY index in
> > the truncated or hole punched range. This can easily 'stall' a CPU if there
>
> If above thought holds, we could do batch folio lookup instead. Hopes my thought will help. ;)
>

Yes, I have some promising POC code with two batch lookups in case of holes.
Hope to send something soon.
--
Mike Kravetz

2022-09-07 03:43:32

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

On 2022/9/7 10:37, Mike Kravetz wrote:
> On 09/07/22 10:11, Miaohe Lin wrote:
>> On 2022/9/7 7:08, Mike Kravetz wrote:
>>> On 09/06/22 11:05, Mike Kravetz wrote:
>>>> On 09/06/22 09:48, Mike Kravetz wrote:
>>>>> On 09/06/22 15:57, Sven Schnelle wrote:
>>>>>> Hi Mike,
>>>>>>
>>>>>> Mike Kravetz <[email protected]> writes:
>>>>>>
>>>>>>> When page fault code needs to allocate and instantiate a new hugetlb
>>>>>>> page (huegtlb_no_page), it checks early to determine if the fault is
>>>>>>> beyond i_size. When discovered early, it is easy to abort the fault and
>>>>>>> return an error. However, it becomes much more difficult to handle when
>>>>>>> discovered later after allocating the page and consuming reservations
>>>>>>> and adding to the page cache. Backing out changes in such instances
>>>>>>> becomes difficult and error prone.
>>>>>>>
>>>>>>> Instead of trying to catch and backout all such races, use the hugetlb
>>>>>>> fault mutex to handle truncate racing with page faults. The most
>>>>>>> significant change is modification of the routine remove_inode_hugepages
>>>>>>> such that it will take the fault mutex for EVERY index in the truncated
>>>>>>> range (or hole in the case of hole punch). Since remove_inode_hugepages
>>>>>>> is called in the truncate path after updating i_size, we can experience
>>>>>>> races as follows.
>>>>>>> - truncate code updates i_size and takes fault mutex before a racing
>>>>>>> fault. After fault code takes mutex, it will notice fault beyond
>>>>>>> i_size and abort early.
>>>>>>> - fault code obtains mutex, and truncate updates i_size after early
>>>>>>> checks in fault code. fault code will add page beyond i_size.
>>>>>>> When truncate code takes mutex for page/index, it will remove the
>>>>>>> page.
>>>>>>> - truncate updates i_size, but fault code obtains mutex first. If
>>>>>>> fault code sees updated i_size it will abort early. If fault code
>>>>>>> does not see updated i_size, it will add page beyond i_size and
>>>>>>> truncate code will remove page when it obtains fault mutex.
>>>>>>>
>>>>>>> Note, for performance reasons remove_inode_hugepages will still use
>>>>>>> filemap_get_folios for bulk folio lookups. For indicies not returned in
>>>>>>> the bulk lookup, it will need to lookup individual folios to check for
>>>>>>> races with page fault.
>>>>>>>
>>>>>>> Signed-off-by: Mike Kravetz <[email protected]>
>>>>>>> ---
>>>>>>> fs/hugetlbfs/inode.c | 184 +++++++++++++++++++++++++++++++------------
>>>>>>> mm/hugetlb.c | 41 +++++-----
>>>>>>> 2 files changed, 152 insertions(+), 73 deletions(-)
>>>>>>
>>>>>> With linux next starting from next-20220831 i see hangs with this
>>>>>> patch applied while running the glibc test suite. The patch doesn't
>>>>>> revert cleanly on top, so i checked out one commit before that one and
>>>>>> with that revision everything works.
>>>>>>
>>>>>> It looks like the malloc test suite in glibc triggers this. I cannot
>>>>>> identify a single test causing it, but instead the combination of
>>>>>> multiple tests. Running the test suite on a single CPU works. Given the
>>>>>> subject of the patch that's likely not a surprise.
>>>>>>
>>>>>> This is on s390, and the warning i get from RCU is:
>>>>>>
>>>>>> [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
>>>>>> [ 1951.907009] rcu: 60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
>>>>>> [ 1951.907018] (t=6000 jiffies g=116125 q=1008072 ncpus=64)
>>>>>> [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
>>>>>> [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
>>>>>> [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
>>>>>> [ 1951.907044] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
>>>>>> [ 1951.907095] Call Trace:
>>>>>> [ 1951.907098] [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
>>>>>> [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
>>>>>> [ 1951.907107] [<0000000000584876>] remove_inode_hugepages+0x236/0x280
>>>>>> [ 1951.907109] [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
>>>>>> [ 1951.907111] [<000000000044fe96>] evict+0xe6/0x1c0
>>>>>> [ 1951.907116] [<000000000044a608>] __dentry_kill+0x108/0x1e0
>>>>>> [ 1951.907119] [<000000000044ac64>] dentry_kill+0x6c/0x290
>>>>>> [ 1951.907121] [<000000000044afec>] dput+0x164/0x1c0
>>>>>> [ 1951.907123] [<000000000042a4d6>] __fput+0xee/0x290
>>>>>> [ 1951.907127] [<00000000001794a8>] task_work_run+0x88/0xe0
>>>>>> [ 1951.907133] [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
>>>>>> [ 1951.907137] [<0000000000d0e42e>] __do_syscall+0x11e/0x200
>>>>>> [ 1951.907142] [<0000000000d1d392>] system_call+0x82/0xb0
>>>>>> [ 1951.907145] Last Breaking-Event-Address:
>>>>>> [ 1951.907146] [<0000038001d839c0>] 0x38001d839c0
>>>>>>
>>>>>> One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
>>>>>>
>>>>>> Any thoughts?
>>>>>
>>>>> Thanks for the report, I will take a look.
>>>>>
>>>>> My first thought is that this fix may not be applied,
>>>>> https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
>>>>> However, I see that that is in next-20220831.
>>>>>
>>>>> Hopefully, this will recreate on x86.
>>>>
>>>> One additional thought ...
>>>>
>>>> With this patch, we will take the hugetlb fault mutex for EVERY index in the
>>>> range being truncated or hole punched. In the case of a very large file, that
>>>> is no different than code today where we take the mutex when removing pages
>>>> from the file. What is different is taking the mutex for indices that are
>>>> part of holes in the file. Consider a very large file with only one page at
>>>> the very large offset. We would then take the mutex for each index in that
>>>> very large hole. Depending on the size of the hole, this could appear as a
>>>> hang.
>>>>
>>>> For the above locking scheme to work, we need to take the mutex for indices
>>>> in holes in case there would happen to be a racing page fault. However, there
>>>> are only a limited number of fault mutexes (it is a table). So, we only really
>>>> need to take at a maximum num_fault_mutexes mutexes. We could keep track of
>>>> these with a bitmap.
>>>>
>>>> I am not sure this is the issue you are seeing, but a test named
>>>> tst-malloc-too-large-malloc-hugetlb2 may be doing this.
>>>>
>>>> In any case, I think this issue needs to be addressed before this series can
>>>> move forward.
>>>
>>> Well, even if we address the issue of taking the same mutex multiple times,
>>
>> Can we change to take all the hugetlb fault mutex at the same time to ensure every possible
>> future hugetlb page fault will see a truncated i_size? Then we could just drop all the hugetlb
>> fault mutex before doing any heavy stuff? It seems hugetlb fault mutex could be dropped when
>> new i_size is guaranteed to be visible for any future hugetlb page fault users?
>> But I might miss something...
>
> Yes, that is the general direction and would work well for truncation. However,
> the same routine remove_inode_hugepages is used for hole punch, and I am pretty
> sure we want to take the fault mutex there as it can race with page faults.

Oh, sorry. I missed that case.

>
>>
>>> this new synchronization scheme requires a folio lookup for EVERY index in
>>> the truncated or hole punched range. This can easily 'stall' a CPU if there
>>
>> If above thought holds, we could do batch folio lookup instead. Hopes my thought will help. ;)
>>
>
> Yes, I have some promising POC code with two batch lookups in case of holes.
> Hope to send something soon.

That will be really nice. ;)

Thanks,
Miaohe Lin

2022-09-07 03:53:41

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

On 09/07/22 11:07, Miaohe Lin wrote:
> On 2022/9/7 10:37, Mike Kravetz wrote:
> > On 09/07/22 10:11, Miaohe Lin wrote:
> >> On 2022/9/7 7:08, Mike Kravetz wrote:
> >>> On 09/06/22 11:05, Mike Kravetz wrote:
> >>>> On 09/06/22 09:48, Mike Kravetz wrote:
> >>>>> On 09/06/22 15:57, Sven Schnelle wrote:
> >>>>>>
> >>>>>> With linux next starting from next-20220831 i see hangs with this
> >>>>>> patch applied while running the glibc test suite. The patch doesn't
> >>>>>> revert cleanly on top, so i checked out one commit before that one and
> >>>>>> with that revision everything works.
> >>>>>>
> >>>>>> It looks like the malloc test suite in glibc triggers this. I cannot
> >>>>>> identify a single test causing it, but instead the combination of
> >>>>>> multiple tests. Running the test suite on a single CPU works. Given the
> >>>>>> subject of the patch that's likely not a surprise.
> >>>>>>
> >>>>>> This is on s390, and the warning i get from RCU is:
> >>>>>>
> >>>>>> [ 1951.906997] rcu: INFO: rcu_sched self-detected stall on CPU
> >>>>>> [ 1951.907009] rcu: 60-....: (6000 ticks this GP) idle=968c/1/0x4000000000000000 softirq=43971/43972 fqs=2765
> >>>>>> [ 1951.907018] (t=6000 jiffies g=116125 q=1008072 ncpus=64)
> >>>>>> [ 1951.907024] CPU: 60 PID: 1236661 Comm: ld64.so.1 Not tainted 6.0.0-rc3-next-20220901 #340
> >>>>>> [ 1951.907027] Hardware name: IBM 3906 M04 704 (z/VM 7.1.0)
> >>>>>> [ 1951.907029] Krnl PSW : 0704e00180000000 00000000003d9042 (hugetlb_fault_mutex_hash+0x2a/0xd8)
> >>>>>> [ 1951.907044] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
> >>>>>> [ 1951.907095] Call Trace:
> >>>>>> [ 1951.907098] [<00000000003d9042>] hugetlb_fault_mutex_hash+0x2a/0xd8
> >>>>>> [ 1951.907101] ([<00000000005845a6>] fault_lock_inode_indicies+0x8e/0x128)
> >>>>>> [ 1951.907107] [<0000000000584876>] remove_inode_hugepages+0x236/0x280
> >>>>>> [ 1951.907109] [<0000000000584a7c>] hugetlbfs_evict_inode+0x3c/0x60
> >>>>>> [ 1951.907111] [<000000000044fe96>] evict+0xe6/0x1c0
> >>>>>> [ 1951.907116] [<000000000044a608>] __dentry_kill+0x108/0x1e0
> >>>>>> [ 1951.907119] [<000000000044ac64>] dentry_kill+0x6c/0x290
> >>>>>> [ 1951.907121] [<000000000044afec>] dput+0x164/0x1c0
> >>>>>> [ 1951.907123] [<000000000042a4d6>] __fput+0xee/0x290
> >>>>>> [ 1951.907127] [<00000000001794a8>] task_work_run+0x88/0xe0
> >>>>>> [ 1951.907133] [<00000000001f77a0>] exit_to_user_mode_prepare+0x1a0/0x1a8
> >>>>>> [ 1951.907137] [<0000000000d0e42e>] __do_syscall+0x11e/0x200
> >>>>>> [ 1951.907142] [<0000000000d1d392>] system_call+0x82/0xb0
> >>>>>> [ 1951.907145] Last Breaking-Event-Address:
> >>>>>> [ 1951.907146] [<0000038001d839c0>] 0x38001d839c0
> >>>>>>
> >>>>>> One of the hanging test cases is usually malloc/tst-malloc-too-large-malloc-hugetlb2.
> >>>>>>
> >>>>>> Any thoughts?
> >>>>>
> >>>>> Thanks for the report, I will take a look.
> >>>>>
> >>>>> My first thought is that this fix may not be applied,
> >>>>> https://lore.kernel.org/linux-mm/Ywepr7C2X20ZvLdn@monkey/
> >>>>> However, I see that that is in next-20220831.
> >>>>>
> >>>>> Hopefully, this will recreate on x86.
> >>>>
> >>>> One additional thought ...
> >>>>
> >>>> With this patch, we will take the hugetlb fault mutex for EVERY index in the
> >>>> range being truncated or hole punched. In the case of a very large file, that
> >>>> is no different than code today where we take the mutex when removing pages
> >>>> from the file. What is different is taking the mutex for indices that are
> >>>> part of holes in the file. Consider a very large file with only one page at
> >>>> the very large offset. We would then take the mutex for each index in that
> >>>> very large hole. Depending on the size of the hole, this could appear as a
> >>>> hang.
> >>>>
> >>>> For the above locking scheme to work, we need to take the mutex for indices
> >>>> in holes in case there would happen to be a racing page fault. However, there
> >>>> are only a limited number of fault mutexes (it is a table). So, we only really
> >>>> need to take at a maximum num_fault_mutexes mutexes. We could keep track of
> >>>> these with a bitmap.
> >>>>
> >>>> I am not sure this is the issue you are seeing, but a test named
> >>>> tst-malloc-too-large-malloc-hugetlb2 may be doing this.
> >>>>
> >>>> In any case, I think this issue needs to be addressed before this series can
> >>>> move forward.
> >>>
> >>> Well, even if we address the issue of taking the same mutex multiple times,
> >>
> >> Can we change to take all the hugetlb fault mutex at the same time to ensure every possible
> >> future hugetlb page fault will see a truncated i_size? Then we could just drop all the hugetlb
> >> fault mutex before doing any heavy stuff? It seems hugetlb fault mutex could be dropped when
> >> new i_size is guaranteed to be visible for any future hugetlb page fault users?
> >> But I might miss something...
> >
> > Yes, that is the general direction and would work well for truncation. However,
> > the same routine remove_inode_hugepages is used for hole punch, and I am pretty
> > sure we want to take the fault mutex there as it can race with page faults.
>
> Oh, sorry. I missed that case.
>
> >
> >>
> >>> this new synchronization scheme requires a folio lookup for EVERY index in
> >>> the truncated or hole punched range. This can easily 'stall' a CPU if there
> >>
> >> If above thought holds, we could do batch folio lookup instead. Hopes my thought will help. ;)
> >>
> >
> > Yes, I have some promising POC code with two batch lookups in case of holes.
> > Hope to send something soon.
>
> That will be really nice. ;)
>

Hi Sven,

Would you be willing to try the patch below in your environment?
It addresses the stall I can create with a file that has a VERY large hole.
In addition, it passes libhugetlbfs tests and has run for a while in my
truncate/page fault race stress test. However, it is very early code.
It would be nice to see if it addresses the issue in your environment.


From 10c58195db9ed8aeff84c63ea2baf6c007651e42 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <[email protected]>
Date: Tue, 6 Sep 2022 19:59:47 -0700
Subject: [PATCH] hugetlb: redo remove_inode_hugepages algorithm for racing
page faults

Previously, remove_inode_hugepages would take the fault mutes for EVERY
index in a file and look for a folio. This included holes in the file.
For very large sparse files, this could result in minutes(or more) or
kernel time. Taking the fault mutex for every index is a requirement if
we want to use it for synchronization with page faults.

This patch adjusts the algorithm slightly to take large holes into
account. It tracks which fault mutexes have been taken so that it will
not needlessly take the same mutex more than once. Also, we skip
looking for folios in holes. Instead, we make a second scan of the file
with a bulk lookup.

Signed-off-by: Mike Kravetz <[email protected]>
---
fs/hugetlbfs/inode.c | 70 ++++++++++++++++++++++-------------------
include/linux/hugetlb.h | 16 ++++++++++
mm/hugetlb.c | 48 ++++++++++++++++++++++++++++
3 files changed, 102 insertions(+), 32 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 2f1d6da1bafb..ce1eb6202179 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -578,37 +578,27 @@ static bool remove_inode_single_folio(struct hstate *h, struct inode *inode,

/*
* Take hugetlb fault mutex for a set of inode indicies.
- * Check for and remove any found folios. Return the number of
- * any removed folios.
- *
*/
-static long fault_lock_inode_indicies(struct hstate *h,
+static void fault_lock_inode_indicies(struct hstate *h,
struct inode *inode,
struct address_space *mapping,
pgoff_t start, pgoff_t end,
- bool truncate_op)
+ bool truncate_op,
+ struct hugetlb_fault_mutex_track *hfmt)
{
- struct folio *folio;
- long freed = 0;
+ long holes = 0;
pgoff_t index;
u32 hash;

- for (index = start; index < end; index++) {
+ for (index = start; index < end &&
+ !hugetlb_fault_mutex_track_all_set(hfmt);
+ index++) {
hash = hugetlb_fault_mutex_hash(mapping, index);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
-
- folio = filemap_get_folio(mapping, index);
- if (folio) {
- if (remove_inode_single_folio(h, inode, mapping, folio,
- index, truncate_op))
- freed++;
- folio_put(folio);
- }
-
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ hugetlb_fault_mutex_track_lock(hfmt, hash, false);
+ hugetlb_fault_mutex_track_unlock(hfmt, hash, false);
+ holes++;
+ cond_resched();
}
-
- return freed;
}

/*
@@ -656,7 +646,14 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
long freed = 0;
u32 hash;
bool truncate_op = (lend == LLONG_MAX);
+ struct hugetlb_fault_mutex_track *hfmt;
+ bool rescan = true;
+ unsigned long holes;

+ hfmt = hugetlb_fault_mutex_track_alloc();
+
+rescan:
+ holes = 0;
folio_batch_init(&fbatch);
next = m_start = start;
while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
@@ -665,36 +662,45 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,

index = folio->index;
/*
- * Take fault mutex for missing folios before index,
- * while checking folios that might have been added
- * due to a race with fault code.
+ * Take fault mutex for missing folios before index
*/
- freed += fault_lock_inode_indicies(h, inode, mapping,
- m_start, index, truncate_op);
+ holes += (index - m_start);
+ if (rescan) /* no need on second pass */
+ fault_lock_inode_indicies(h, inode, mapping,
+ m_start, index, truncate_op, hfmt);

/*
* Remove folio that was part of folio_batch.
+ * Force taking fault mutex.
*/
hash = hugetlb_fault_mutex_hash(mapping, index);
- mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ hugetlb_fault_mutex_track_lock(hfmt, hash, true);
if (remove_inode_single_folio(h, inode, mapping, folio,
index, truncate_op))
freed++;
- mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ hugetlb_fault_mutex_track_unlock(hfmt, hash, true);
}
folio_batch_release(&fbatch);
cond_resched();
}

/*
- * Take fault mutex for missing folios at end of range while checking
- * for folios that might have been added due to a race with fault code.
+ * Take fault mutex for missing folios at end of range
*/
- freed += fault_lock_inode_indicies(h, inode, mapping, m_start, m_end,
- truncate_op);
+ holes += (m_end - m_start);
+ if (rescan)
+ fault_lock_inode_indicies(h, inode, mapping, m_start, m_end,
+ truncate_op, hfmt);
+
+ if (holes && rescan) {
+ rescan = false;
+ goto rescan; /* can happen at most once */
+ }

if (truncate_op)
(void)hugetlb_unreserve_pages(inode, start, LONG_MAX, freed);
+
+ hugetlb_fault_mutex_track_free(hfmt);
}

static void hugetlbfs_evict_inode(struct inode *inode)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 852f911d676e..dc532d2e42d0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -180,8 +180,24 @@ void putback_active_hugepage(struct page *page);
void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
void free_huge_page(struct page *page);
void hugetlb_fix_reserve_counts(struct inode *inode);
+
extern struct mutex *hugetlb_fault_mutex_table;
u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx);
+struct hugetlb_fault_mutex_track {
+ bool all_set;
+ unsigned long *track_bits;
+};
+struct hugetlb_fault_mutex_track *hugetlb_fault_mutex_track_alloc(void);
+void hugetlb_fault_mutex_track_lock(struct hugetlb_fault_mutex_track *hfmt,
+ u32 hash, bool force);
+void hugetlb_fault_mutex_track_unlock(struct hugetlb_fault_mutex_track *hfmt,
+ u32 hash, bool force);
+static inline bool hugetlb_fault_mutex_track_all_set(
+ struct hugetlb_fault_mutex_track *hfmt)
+{
+ return hfmt->all_set;
+}
+void hugetlb_fault_mutex_track_free(struct hugetlb_fault_mutex_track *hfmt);

pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pud_t *pud);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d0617d64d718..d9dfc4736928 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5788,6 +5788,54 @@ u32 hugetlb_fault_mutex_hash(struct address_space *mapping, pgoff_t idx)
}
#endif

+struct hugetlb_fault_mutex_track *hugetlb_fault_mutex_track_alloc(void)
+{
+ struct hugetlb_fault_mutex_track *hfmt;
+
+ hfmt = kmalloc(ALIGN(sizeof(struct hugetlb_fault_mutex_track),
+ sizeof(unsigned long)) +
+ sizeof(unsigned long) *
+ BITS_TO_LONGS(num_fault_mutexes),
+ GFP_KERNEL);
+ if (hfmt) {
+ hfmt->track_bits = (void *)hfmt +
+ ALIGN(sizeof(struct hugetlb_fault_mutex_track),
+ sizeof(unsigned long));
+
+ hfmt->all_set = false;
+ bitmap_zero(hfmt->track_bits, num_fault_mutexes);
+ }
+
+ return hfmt;
+}
+
+void hugetlb_fault_mutex_track_lock(struct hugetlb_fault_mutex_track *hfmt,
+ u32 hash, bool force)
+{
+ if (!hfmt || !test_bit((int)hash, hfmt->track_bits) || force) {
+ mutex_lock(&hugetlb_fault_mutex_table[hash]);
+ /* set bit when unlocking */
+ }
+}
+
+void hugetlb_fault_mutex_track_unlock(struct hugetlb_fault_mutex_track *hfmt,
+ u32 hash, bool force)
+{
+ if (!hfmt || !test_bit((int)hash, hfmt->track_bits) || force) {
+ mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+ if (hfmt && !hfmt->all_set) {
+ set_bit((int)hash, hfmt->track_bits);
+ if (bitmap_full(hfmt->track_bits, num_fault_mutexes))
+ hfmt->all_set = true;
+ }
+ }
+}
+
+void hugetlb_fault_mutex_track_free(struct hugetlb_fault_mutex_track *hfmt)
+{
+ kfree(hfmt);
+}
+
vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags)
{
--
2.37.2

2022-09-07 08:36:51

by Sven Schnelle

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

Mike Kravetz <[email protected]> writes:

> Would you be willing to try the patch below in your environment?
> It addresses the stall I can create with a file that has a VERY large hole.
> In addition, it passes libhugetlbfs tests and has run for a while in my
> truncate/page fault race stress test. However, it is very early code.
> It would be nice to see if it addresses the issue in your environment.

Yes, that fixes the issue for me. I added some debugging yesterday
evening after sending the initial report, and the end value in the loop
was indeed quite large - i didn't record the exact number, but it was
something like 0xffffffffff800001. Feel free to add my Tested-by.

Thanks
Sven

2022-09-07 14:53:42

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 4/8] hugetlb: handle truncate racing with page faults

On 09/07/22 10:22, Sven Schnelle wrote:
> Mike Kravetz <[email protected]> writes:
>
> > Would you be willing to try the patch below in your environment?
> > It addresses the stall I can create with a file that has a VERY large hole.
> > In addition, it passes libhugetlbfs tests and has run for a while in my
> > truncate/page fault race stress test. However, it is very early code.
> > It would be nice to see if it addresses the issue in your environment.
>
> Yes, that fixes the issue for me. I added some debugging yesterday
> evening after sending the initial report, and the end value in the loop
> was indeed quite large - i didn't record the exact number, but it was
> something like 0xffffffffff800001. Feel free to add my Tested-by.
>

Thank you!

When thinking about this some more, the new vma_lock introduced by this series
may address truncation/fault races without the need of involving the fault
mutex.

How?

Before truncating or hole punching, we need to unmap all users of that range.
To unmap, we need to acquire the vma_lock for each vma mapping the file. This
same lock is acquired in the page fault path. As such, it provides the same
type of synchronization around i_size as provided by the fault mutex in this
patch. So, I think we can make the code much simpler (and faster) by removing
the code taking the fault mutex for holes in files. Of course, this can not
happen until the vma_lock is actually put into use which is done in the last
patch of this series.
--
Mike Kravetz

2022-09-07 21:33:14

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 6/8] hugetlb: add vma based lock for pmd sharing

On 08/29/22 15:24, Mike Kravetz wrote:
> On 08/27/22 17:30, Miaohe Lin wrote:
> > On 2022/8/25 1:57, Mike Kravetz wrote:
> > > Allocate a rw semaphore and hang off vm_private_data for
> > > synchronization use by vmas that could be involved in pmd sharing. Only
> > > add infrastructure for the new lock here. Actual use will be added in
> > > subsequent patch.
> > >
> > > Signed-off-by: Mike Kravetz <[email protected]>
> >
> > <snip>
> >
> > > +static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
> > > +{
> > > + /*
> > > + * Only present in sharable vmas. See comment in
> > > + * __unmap_hugepage_range_final about the neeed to check both
> >
> > s/neeed/need/
> >
> > > + * VM_SHARED and VM_MAYSHARE in free path
> >
> > I think there might be some wrong checks around this patch. As above comment said, we
> > need to check both flags, so we should do something like below instead?
> >
> > if (!(vma->vm_flags & (VM_MAYSHARE | VM_SHARED) == (VM_MAYSHARE | VM_SHARED)))
> >
> > > + */
>
> Thanks. I will update.
>
> > > + if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED)))
> > > + return;

I think you misunderstood the comment which I admit was not very clear. And,
I misunderstood your suggestion. I believe the code is correct as it. Here
is the proposed update comment/code:

/*
* Only present in sharable vmas. See comment in
* __unmap_hugepage_range_final about how VM_SHARED could
* be set without VM_MAYSHARE. As a result, we need to
* check if either is set in the free path.
*/
if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED)))
return;

Hopefully, that makes more sense.
--
Mike Kravetz

2022-09-08 02:27:00

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 6/8] hugetlb: add vma based lock for pmd sharing

On 2022/9/8 4:50, Mike Kravetz wrote:
> On 08/29/22 15:24, Mike Kravetz wrote:
>> On 08/27/22 17:30, Miaohe Lin wrote:
>>> On 2022/8/25 1:57, Mike Kravetz wrote:
>>>> Allocate a rw semaphore and hang off vm_private_data for
>>>> synchronization use by vmas that could be involved in pmd sharing. Only
>>>> add infrastructure for the new lock here. Actual use will be added in
>>>> subsequent patch.
>>>>
>>>> Signed-off-by: Mike Kravetz <[email protected]>
>>>
>>> <snip>
>>>
>>>> +static void hugetlb_vma_lock_free(struct vm_area_struct *vma)
>>>> +{
>>>> + /*
>>>> + * Only present in sharable vmas. See comment in
>>>> + * __unmap_hugepage_range_final about the neeed to check both
>>>
>>> s/neeed/need/
>>>
>>>> + * VM_SHARED and VM_MAYSHARE in free path
>>>
>>> I think there might be some wrong checks around this patch. As above comment said, we
>>> need to check both flags, so we should do something like below instead?
>>>
>>> if (!(vma->vm_flags & (VM_MAYSHARE | VM_SHARED) == (VM_MAYSHARE | VM_SHARED)))
>>>
>>>> + */
>>
>> Thanks. I will update.
>>
>>>> + if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED)))
>>>> + return;
>
> I think you misunderstood the comment which I admit was not very clear. And,
> I misunderstood your suggestion. I believe the code is correct as it. Here
> is the proposed update comment/code:
>
> /*
> * Only present in sharable vmas. See comment in
> * __unmap_hugepage_range_final about how VM_SHARED could
> * be set without VM_MAYSHARE. As a result, we need to
> * check if either is set in the free path.
> */
> if (!vma || !(vma->vm_flags & (VM_MAYSHARE | VM_SHARED)))
> return;
>
> Hopefully, that makes more sense.

Somewhat confusing. Thanks for clarifying, Mike.

Thanks,
Miaohe Lin

2022-09-12 23:31:03

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization

On 09/05/22 11:08, Miaohe Lin wrote:
> On 2022/9/3 7:07, Mike Kravetz wrote:
> > On 08/30/22 10:02, Miaohe Lin wrote:
> >> On 2022/8/25 1:57, Mike Kravetz wrote:
> >>> The new hugetlb vma lock (rw semaphore) is used to address this race:
> >>>
> >>> Faulting thread Unsharing thread
> >>> ... ...
> >>> ptep = huge_pte_offset()
> >>> or
> >>> ptep = huge_pte_alloc()
> >>> ...
> >>> i_mmap_lock_write
> >>> lock page table
> >>> ptep invalid <------------------------ huge_pmd_unshare()
> >>> Could be in a previously unlock_page_table
> >>> sharing process or worse i_mmap_unlock_write
> >>> ...
> >>>
> >>> The vma_lock is used as follows:
> >>> - During fault processing. the lock is acquired in read mode before
> >>> doing a page table lock and allocation (huge_pte_alloc). The lock is
> >>> held until code is finished with the page table entry (ptep).
> >>> - The lock must be held in write mode whenever huge_pmd_unshare is
> >>> called.
> >>>
> >>> Lock ordering issues come into play when unmapping a page from all
> >>> vmas mapping the page. The i_mmap_rwsem must be held to search for the
> >>> vmas, and the vma lock must be held before calling unmap which will
> >>> call huge_pmd_unshare. This is done today in:
> >>> - try_to_migrate_one and try_to_unmap_ for page migration and memory
> >>> error handling. In these routines we 'try' to obtain the vma lock and
> >>> fail to unmap if unsuccessful. Calling routines already deal with the
> >>> failure of unmapping.
> >>> - hugetlb_vmdelete_list for truncation and hole punch. This routine
> >>> also tries to acquire the vma lock. If it fails, it skips the
> >>> unmapping. However, we can not have file truncation or hole punch
> >>> fail because of contention. After hugetlb_vmdelete_list, truncation
> >>> and hole punch call remove_inode_hugepages. remove_inode_hugepages
> >>> check for mapped pages and call hugetlb_unmap_file_page to unmap them.
> >>> hugetlb_unmap_file_page is designed to drop locks and reacquire in the
> >>> correct order to guarantee unmap success.
> >>>
> >>> Signed-off-by: Mike Kravetz <[email protected]>
> >>> ---
> >>> fs/hugetlbfs/inode.c | 46 +++++++++++++++++++
> >>> mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++----
> >>> mm/memory.c | 2 +
> >>> mm/rmap.c | 100 +++++++++++++++++++++++++++---------------
> >>> mm/userfaultfd.c | 9 +++-
> >>> 5 files changed, 214 insertions(+), 45 deletions(-)
> >>>
> >>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> >>> index b93d131b0cb5..52d9b390389b 100644
> >>> --- a/fs/hugetlbfs/inode.c
> >>> +++ b/fs/hugetlbfs/inode.c
> >>> @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >>> struct folio *folio, pgoff_t index)
> >>> {
> >>> struct rb_root_cached *root = &mapping->i_mmap;
> >>> + unsigned long skipped_vm_start;
> >>> + struct mm_struct *skipped_mm;
> >>> struct page *page = &folio->page;
> >>> struct vm_area_struct *vma;
> >>> unsigned long v_start;
> >>> @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >>> end = ((index + 1) * pages_per_huge_page(h));
> >>>
> >>> i_mmap_lock_write(mapping);
> >>> +retry:
> >>> + skipped_mm = NULL;
> >>>
> >>> vma_interval_tree_foreach(vma, root, start, end - 1) {
> >>> v_start = vma_offset_start(vma, start);
> >>> @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >>> if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> >>> continue;
> >>>
> >>> + if (!hugetlb_vma_trylock_write(vma)) {
> >>> + /*
> >>> + * If we can not get vma lock, we need to drop
> >>> + * immap_sema and take locks in order.
> >>> + */
> >>> + skipped_vm_start = vma->vm_start;
> >>> + skipped_mm = vma->vm_mm;
> >>> + /* grab mm-struct as we will be dropping i_mmap_sema */
> >>> + mmgrab(skipped_mm);
> >>> + break;
> >>> + }
> >>> +
> >>> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> >>> NULL, ZAP_FLAG_DROP_MARKER);
> >>> + hugetlb_vma_unlock_write(vma);
> >>> }
> >>>
> >>> i_mmap_unlock_write(mapping);
> >>> +
> >>> + if (skipped_mm) {
> >>> + mmap_read_lock(skipped_mm);
> >>> + vma = find_vma(skipped_mm, skipped_vm_start);
> >>> + if (!vma || !is_vm_hugetlb_page(vma) ||
> >>> + vma->vm_file->f_mapping != mapping ||
> >>> + vma->vm_start != skipped_vm_start) {
> >>
> >> i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway.
> >>
> >
> > Yes, that is missing. I will add here.
> >
> >>> + mmap_read_unlock(skipped_mm);
> >>> + mmdrop(skipped_mm);
> >>> + goto retry;
> >>> + }
> >>> +
> >>
> >> IMHO, above check is not enough. Think about the below scene:
> >>
> >> CPU 1 CPU 2
> >> hugetlb_unmap_file_folio exit_mmap
> >> mmap_read_lock(skipped_mm); mmap_read_lock(mm);
> >> check vma is wanted.
> >> unmap_vmas
> >> mmap_read_unlock(skipped_mm); mmap_read_unlock
> >> mmap_write_lock(mm);
> >> free_pgtables
> >> remove_vma
> >> hugetlb_vma_lock_free
> >> vma, hugetlb_vma_lock is still *used after free*
> >> mmap_write_unlock(mm);
> >> So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something?
> >
> > In the retry case, we are OK because go back and look up the vma again. Right?
> >
> > After taking mmap_read_lock, vma can not go away until we mmap_read_unlock.
> > Before that, we do the following:
> >
> >>> + hugetlb_vma_lock_write(vma);
> >>> + i_mmap_lock_write(mapping);
> >
> > IIUC, vma can not go away while we hold i_mmap_lock_write. So, after this we
>
> I think you're right. free_pgtables() can't complete its work as unlink_file_vma() will be
> blocked on i_mmap_rwsem of mapping. Sorry for reporting such nonexistent race.
>
> > can.
> >
> >>> + mmap_read_unlock(skipped_mm);
> >>> + mmdrop(skipped_mm);
> >
> > We continue to hold i_mmap_lock_write as we goto retry.
> >
> > I could be missing something as well. This was how I intended to keep
> > vma valid while dropping and acquiring locks.
>
> Thanks for your clarifying.
>

Well, that was all correct 'in theory' but not in practice. I did not take
into account the inode lock that is taken at the beginning of truncate (or
hole punch). In other code paths, we take inode lock after mmap_lock. So,
taking mmap_lock here is not allowed.

I came up with another way to make this work. As discussed above, we need to
drop the i_mmap lock before acquiring the vma_lock. However, once we drop
i_mmap, the vma could go away. My solution is to make the 'vma_lock' be a
ref counted structure that can live on after the vma is freed. Therefore,
this code can take a reference while under i_mmap then drop i_mmap and wait
on the vma_lock. Of course, once it acquires the vma_lock it needs to check
and make sure the vma still exists. It may sound complicated, but I think
it is a bit simpler than the code here. A new series will be out soon.
--
Mike Kravetz

2022-09-13 02:26:30

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization

On 2022/9/13 7:02, Mike Kravetz wrote:
> On 09/05/22 11:08, Miaohe Lin wrote:
>> On 2022/9/3 7:07, Mike Kravetz wrote:
>>> On 08/30/22 10:02, Miaohe Lin wrote:
>>>> On 2022/8/25 1:57, Mike Kravetz wrote:
>>>>> The new hugetlb vma lock (rw semaphore) is used to address this race:
>>>>>
>>>>> Faulting thread Unsharing thread
>>>>> ... ...
>>>>> ptep = huge_pte_offset()
>>>>> or
>>>>> ptep = huge_pte_alloc()
>>>>> ...
>>>>> i_mmap_lock_write
>>>>> lock page table
>>>>> ptep invalid <------------------------ huge_pmd_unshare()
>>>>> Could be in a previously unlock_page_table
>>>>> sharing process or worse i_mmap_unlock_write
>>>>> ...
>>>>>
>>>>> The vma_lock is used as follows:
>>>>> - During fault processing. the lock is acquired in read mode before
>>>>> doing a page table lock and allocation (huge_pte_alloc). The lock is
>>>>> held until code is finished with the page table entry (ptep).
>>>>> - The lock must be held in write mode whenever huge_pmd_unshare is
>>>>> called.
>>>>>
>>>>> Lock ordering issues come into play when unmapping a page from all
>>>>> vmas mapping the page. The i_mmap_rwsem must be held to search for the
>>>>> vmas, and the vma lock must be held before calling unmap which will
>>>>> call huge_pmd_unshare. This is done today in:
>>>>> - try_to_migrate_one and try_to_unmap_ for page migration and memory
>>>>> error handling. In these routines we 'try' to obtain the vma lock and
>>>>> fail to unmap if unsuccessful. Calling routines already deal with the
>>>>> failure of unmapping.
>>>>> - hugetlb_vmdelete_list for truncation and hole punch. This routine
>>>>> also tries to acquire the vma lock. If it fails, it skips the
>>>>> unmapping. However, we can not have file truncation or hole punch
>>>>> fail because of contention. After hugetlb_vmdelete_list, truncation
>>>>> and hole punch call remove_inode_hugepages. remove_inode_hugepages
>>>>> check for mapped pages and call hugetlb_unmap_file_page to unmap them.
>>>>> hugetlb_unmap_file_page is designed to drop locks and reacquire in the
>>>>> correct order to guarantee unmap success.
>>>>>
>>>>> Signed-off-by: Mike Kravetz <[email protected]>
>>>>> ---
>>>>> fs/hugetlbfs/inode.c | 46 +++++++++++++++++++
>>>>> mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++----
>>>>> mm/memory.c | 2 +
>>>>> mm/rmap.c | 100 +++++++++++++++++++++++++++---------------
>>>>> mm/userfaultfd.c | 9 +++-
>>>>> 5 files changed, 214 insertions(+), 45 deletions(-)
>>>>>
>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>>> index b93d131b0cb5..52d9b390389b 100644
>>>>> --- a/fs/hugetlbfs/inode.c
>>>>> +++ b/fs/hugetlbfs/inode.c
>>>>> @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>>>> struct folio *folio, pgoff_t index)
>>>>> {
>>>>> struct rb_root_cached *root = &mapping->i_mmap;
>>>>> + unsigned long skipped_vm_start;
>>>>> + struct mm_struct *skipped_mm;
>>>>> struct page *page = &folio->page;
>>>>> struct vm_area_struct *vma;
>>>>> unsigned long v_start;
>>>>> @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>>>> end = ((index + 1) * pages_per_huge_page(h));
>>>>>
>>>>> i_mmap_lock_write(mapping);
>>>>> +retry:
>>>>> + skipped_mm = NULL;
>>>>>
>>>>> vma_interval_tree_foreach(vma, root, start, end - 1) {
>>>>> v_start = vma_offset_start(vma, start);
>>>>> @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>>>> if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
>>>>> continue;
>>>>>
>>>>> + if (!hugetlb_vma_trylock_write(vma)) {
>>>>> + /*
>>>>> + * If we can not get vma lock, we need to drop
>>>>> + * immap_sema and take locks in order.
>>>>> + */
>>>>> + skipped_vm_start = vma->vm_start;
>>>>> + skipped_mm = vma->vm_mm;
>>>>> + /* grab mm-struct as we will be dropping i_mmap_sema */
>>>>> + mmgrab(skipped_mm);
>>>>> + break;
>>>>> + }
>>>>> +
>>>>> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
>>>>> NULL, ZAP_FLAG_DROP_MARKER);
>>>>> + hugetlb_vma_unlock_write(vma);
>>>>> }
>>>>>
>>>>> i_mmap_unlock_write(mapping);
>>>>> +
>>>>> + if (skipped_mm) {
>>>>> + mmap_read_lock(skipped_mm);
>>>>> + vma = find_vma(skipped_mm, skipped_vm_start);
>>>>> + if (!vma || !is_vm_hugetlb_page(vma) ||
>>>>> + vma->vm_file->f_mapping != mapping ||
>>>>> + vma->vm_start != skipped_vm_start) {
>>>>
>>>> i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway.
>>>>
>>>
>>> Yes, that is missing. I will add here.
>>>
>>>>> + mmap_read_unlock(skipped_mm);
>>>>> + mmdrop(skipped_mm);
>>>>> + goto retry;
>>>>> + }
>>>>> +
>>>>
>>>> IMHO, above check is not enough. Think about the below scene:
>>>>
>>>> CPU 1 CPU 2
>>>> hugetlb_unmap_file_folio exit_mmap
>>>> mmap_read_lock(skipped_mm); mmap_read_lock(mm);
>>>> check vma is wanted.
>>>> unmap_vmas
>>>> mmap_read_unlock(skipped_mm); mmap_read_unlock
>>>> mmap_write_lock(mm);
>>>> free_pgtables
>>>> remove_vma
>>>> hugetlb_vma_lock_free
>>>> vma, hugetlb_vma_lock is still *used after free*
>>>> mmap_write_unlock(mm);
>>>> So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something?
>>>
>>> In the retry case, we are OK because go back and look up the vma again. Right?
>>>
>>> After taking mmap_read_lock, vma can not go away until we mmap_read_unlock.
>>> Before that, we do the following:
>>>
>>>>> + hugetlb_vma_lock_write(vma);
>>>>> + i_mmap_lock_write(mapping);
>>>
>>> IIUC, vma can not go away while we hold i_mmap_lock_write. So, after this we
>>
>> I think you're right. free_pgtables() can't complete its work as unlink_file_vma() will be
>> blocked on i_mmap_rwsem of mapping. Sorry for reporting such nonexistent race.
>>
>>> can.
>>>
>>>>> + mmap_read_unlock(skipped_mm);
>>>>> + mmdrop(skipped_mm);
>>>
>>> We continue to hold i_mmap_lock_write as we goto retry.
>>>
>>> I could be missing something as well. This was how I intended to keep
>>> vma valid while dropping and acquiring locks.
>>
>> Thanks for your clarifying.
>>
>
> Well, that was all correct 'in theory' but not in practice. I did not take
> into account the inode lock that is taken at the beginning of truncate (or
> hole punch). In other code paths, we take inode lock after mmap_lock. So,
> taking mmap_lock here is not allowed.

Considering the Lock ordering in mm/filemap.c:

* ->i_rwsem
* ->invalidate_lock (acquired by fs in truncate path)
* ->i_mmap_rwsem (truncate->unmap_mapping_range)

* ->i_rwsem (generic_perform_write)
* ->mmap_lock (fault_in_readable->do_page_fault)

It seems inode_lock is taken before the mmap_lock?

Thanks,
Miaohe Lin

>
> I came up with another way to make this work. As discussed above, we need to
> drop the i_mmap lock before acquiring the vma_lock. However, once we drop
> i_mmap, the vma could go away. My solution is to make the 'vma_lock' be a
> ref counted structure that can live on after the vma is freed. Therefore,
> this code can take a reference while under i_mmap then drop i_mmap and wait
> on the vma_lock. Of course, once it acquires the vma_lock it needs to check
> and make sure the vma still exists. It may sound complicated, but I think
> it is a bit simpler than the code here. A new series will be out soon.
>

2022-09-14 01:11:45

by Mike Kravetz

[permalink] [raw]
Subject: Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization

On 09/13/22 10:14, Miaohe Lin wrote:
> On 2022/9/13 7:02, Mike Kravetz wrote:
> > On 09/05/22 11:08, Miaohe Lin wrote:
> >> On 2022/9/3 7:07, Mike Kravetz wrote:
> >>> On 08/30/22 10:02, Miaohe Lin wrote:
> >>>> On 2022/8/25 1:57, Mike Kravetz wrote:
> >>>>> The new hugetlb vma lock (rw semaphore) is used to address this race:
> >>>>>
> >>>>> Faulting thread Unsharing thread
> >>>>> ... ...
> >>>>> ptep = huge_pte_offset()
> >>>>> or
> >>>>> ptep = huge_pte_alloc()
> >>>>> ...
> >>>>> i_mmap_lock_write
> >>>>> lock page table
> >>>>> ptep invalid <------------------------ huge_pmd_unshare()
> >>>>> Could be in a previously unlock_page_table
> >>>>> sharing process or worse i_mmap_unlock_write
> >>>>> ...
> >>>>>
> >>>>> The vma_lock is used as follows:
> >>>>> - During fault processing. the lock is acquired in read mode before
> >>>>> doing a page table lock and allocation (huge_pte_alloc). The lock is
> >>>>> held until code is finished with the page table entry (ptep).
> >>>>> - The lock must be held in write mode whenever huge_pmd_unshare is
> >>>>> called.
> >>>>>
> >>>>> Lock ordering issues come into play when unmapping a page from all
> >>>>> vmas mapping the page. The i_mmap_rwsem must be held to search for the
> >>>>> vmas, and the vma lock must be held before calling unmap which will
> >>>>> call huge_pmd_unshare. This is done today in:
> >>>>> - try_to_migrate_one and try_to_unmap_ for page migration and memory
> >>>>> error handling. In these routines we 'try' to obtain the vma lock and
> >>>>> fail to unmap if unsuccessful. Calling routines already deal with the
> >>>>> failure of unmapping.
> >>>>> - hugetlb_vmdelete_list for truncation and hole punch. This routine
> >>>>> also tries to acquire the vma lock. If it fails, it skips the
> >>>>> unmapping. However, we can not have file truncation or hole punch
> >>>>> fail because of contention. After hugetlb_vmdelete_list, truncation
> >>>>> and hole punch call remove_inode_hugepages. remove_inode_hugepages
> >>>>> check for mapped pages and call hugetlb_unmap_file_page to unmap them.
> >>>>> hugetlb_unmap_file_page is designed to drop locks and reacquire in the
> >>>>> correct order to guarantee unmap success.
> >>>>>
> >>>>> Signed-off-by: Mike Kravetz <[email protected]>
> >>>>> ---
> >>>>> fs/hugetlbfs/inode.c | 46 +++++++++++++++++++
> >>>>> mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++----
> >>>>> mm/memory.c | 2 +
> >>>>> mm/rmap.c | 100 +++++++++++++++++++++++++++---------------
> >>>>> mm/userfaultfd.c | 9 +++-
> >>>>> 5 files changed, 214 insertions(+), 45 deletions(-)
> >>>>>
> >>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> >>>>> index b93d131b0cb5..52d9b390389b 100644
> >>>>> --- a/fs/hugetlbfs/inode.c
> >>>>> +++ b/fs/hugetlbfs/inode.c
> >>>>> @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >>>>> struct folio *folio, pgoff_t index)
> >>>>> {
> >>>>> struct rb_root_cached *root = &mapping->i_mmap;
> >>>>> + unsigned long skipped_vm_start;
> >>>>> + struct mm_struct *skipped_mm;
> >>>>> struct page *page = &folio->page;
> >>>>> struct vm_area_struct *vma;
> >>>>> unsigned long v_start;
> >>>>> @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >>>>> end = ((index + 1) * pages_per_huge_page(h));
> >>>>>
> >>>>> i_mmap_lock_write(mapping);
> >>>>> +retry:
> >>>>> + skipped_mm = NULL;
> >>>>>
> >>>>> vma_interval_tree_foreach(vma, root, start, end - 1) {
> >>>>> v_start = vma_offset_start(vma, start);
> >>>>> @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
> >>>>> if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
> >>>>> continue;
> >>>>>
> >>>>> + if (!hugetlb_vma_trylock_write(vma)) {
> >>>>> + /*
> >>>>> + * If we can not get vma lock, we need to drop
> >>>>> + * immap_sema and take locks in order.
> >>>>> + */
> >>>>> + skipped_vm_start = vma->vm_start;
> >>>>> + skipped_mm = vma->vm_mm;
> >>>>> + /* grab mm-struct as we will be dropping i_mmap_sema */
> >>>>> + mmgrab(skipped_mm);
> >>>>> + break;
> >>>>> + }
> >>>>> +
> >>>>> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
> >>>>> NULL, ZAP_FLAG_DROP_MARKER);
> >>>>> + hugetlb_vma_unlock_write(vma);
> >>>>> }
> >>>>>
> >>>>> i_mmap_unlock_write(mapping);
> >>>>> +
> >>>>> + if (skipped_mm) {
> >>>>> + mmap_read_lock(skipped_mm);
> >>>>> + vma = find_vma(skipped_mm, skipped_vm_start);
> >>>>> + if (!vma || !is_vm_hugetlb_page(vma) ||
> >>>>> + vma->vm_file->f_mapping != mapping ||
> >>>>> + vma->vm_start != skipped_vm_start) {
> >>>>
> >>>> i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway.
> >>>>
> >>>
> >>> Yes, that is missing. I will add here.
> >>>
> >>>>> + mmap_read_unlock(skipped_mm);
> >>>>> + mmdrop(skipped_mm);
> >>>>> + goto retry;
> >>>>> + }
> >>>>> +
> >>>>
> >>>> IMHO, above check is not enough. Think about the below scene:
> >>>>
> >>>> CPU 1 CPU 2
> >>>> hugetlb_unmap_file_folio exit_mmap
> >>>> mmap_read_lock(skipped_mm); mmap_read_lock(mm);
> >>>> check vma is wanted.
> >>>> unmap_vmas
> >>>> mmap_read_unlock(skipped_mm); mmap_read_unlock
> >>>> mmap_write_lock(mm);
> >>>> free_pgtables
> >>>> remove_vma
> >>>> hugetlb_vma_lock_free
> >>>> vma, hugetlb_vma_lock is still *used after free*
> >>>> mmap_write_unlock(mm);
> >>>> So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something?
> >>>
> >>> In the retry case, we are OK because go back and look up the vma again. Right?
> >>>
> >>> After taking mmap_read_lock, vma can not go away until we mmap_read_unlock.
> >>> Before that, we do the following:
> >>>
> >>>>> + hugetlb_vma_lock_write(vma);
> >>>>> + i_mmap_lock_write(mapping);
> >>>
> >>> IIUC, vma can not go away while we hold i_mmap_lock_write. So, after this we
> >>
> >> I think you're right. free_pgtables() can't complete its work as unlink_file_vma() will be
> >> blocked on i_mmap_rwsem of mapping. Sorry for reporting such nonexistent race.
> >>
> >>> can.
> >>>
> >>>>> + mmap_read_unlock(skipped_mm);
> >>>>> + mmdrop(skipped_mm);
> >>>
> >>> We continue to hold i_mmap_lock_write as we goto retry.
> >>>
> >>> I could be missing something as well. This was how I intended to keep
> >>> vma valid while dropping and acquiring locks.
> >>
> >> Thanks for your clarifying.
> >>
> >
> > Well, that was all correct 'in theory' but not in practice. I did not take
> > into account the inode lock that is taken at the beginning of truncate (or
> > hole punch). In other code paths, we take inode lock after mmap_lock. So,
> > taking mmap_lock here is not allowed.
>
> Considering the Lock ordering in mm/filemap.c:
>
> * ->i_rwsem
> * ->invalidate_lock (acquired by fs in truncate path)
> * ->i_mmap_rwsem (truncate->unmap_mapping_range)
>
> * ->i_rwsem (generic_perform_write)
> * ->mmap_lock (fault_in_readable->do_page_fault)
>
> It seems inode_lock is taken before the mmap_lock?

Hmmmm? I can't find a sequence where inode_lock is taken after mmap_lock.
lockdep was complaining about taking mmap_lock after i_rwsem in the above code.
I assumed there was such a sequence somewhere. Might need to go back and get
another trace/warning.

In any case, I think the scheme below is much cleaner. Doing another round of
benchmarking before sending.

> > I came up with another way to make this work. As discussed above, we need to
> > drop the i_mmap lock before acquiring the vma_lock. However, once we drop
> > i_mmap, the vma could go away. My solution is to make the 'vma_lock' be a
> > ref counted structure that can live on after the vma is freed. Therefore,
> > this code can take a reference while under i_mmap then drop i_mmap and wait
> > on the vma_lock. Of course, once it acquires the vma_lock it needs to check
> > and make sure the vma still exists. It may sound complicated, but I think
> > it is a bit simpler than the code here. A new series will be out soon.
> >

--
Mike Kravetz

2022-09-14 02:24:28

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH 8/8] hugetlb: use new vma_lock for pmd sharing synchronization

On 2022/9/14 8:50, Mike Kravetz wrote:
> On 09/13/22 10:14, Miaohe Lin wrote:
>> On 2022/9/13 7:02, Mike Kravetz wrote:
>>> On 09/05/22 11:08, Miaohe Lin wrote:
>>>> On 2022/9/3 7:07, Mike Kravetz wrote:
>>>>> On 08/30/22 10:02, Miaohe Lin wrote:
>>>>>> On 2022/8/25 1:57, Mike Kravetz wrote:
>>>>>>> The new hugetlb vma lock (rw semaphore) is used to address this race:
>>>>>>>
>>>>>>> Faulting thread Unsharing thread
>>>>>>> ... ...
>>>>>>> ptep = huge_pte_offset()
>>>>>>> or
>>>>>>> ptep = huge_pte_alloc()
>>>>>>> ...
>>>>>>> i_mmap_lock_write
>>>>>>> lock page table
>>>>>>> ptep invalid <------------------------ huge_pmd_unshare()
>>>>>>> Could be in a previously unlock_page_table
>>>>>>> sharing process or worse i_mmap_unlock_write
>>>>>>> ...
>>>>>>>
>>>>>>> The vma_lock is used as follows:
>>>>>>> - During fault processing. the lock is acquired in read mode before
>>>>>>> doing a page table lock and allocation (huge_pte_alloc). The lock is
>>>>>>> held until code is finished with the page table entry (ptep).
>>>>>>> - The lock must be held in write mode whenever huge_pmd_unshare is
>>>>>>> called.
>>>>>>>
>>>>>>> Lock ordering issues come into play when unmapping a page from all
>>>>>>> vmas mapping the page. The i_mmap_rwsem must be held to search for the
>>>>>>> vmas, and the vma lock must be held before calling unmap which will
>>>>>>> call huge_pmd_unshare. This is done today in:
>>>>>>> - try_to_migrate_one and try_to_unmap_ for page migration and memory
>>>>>>> error handling. In these routines we 'try' to obtain the vma lock and
>>>>>>> fail to unmap if unsuccessful. Calling routines already deal with the
>>>>>>> failure of unmapping.
>>>>>>> - hugetlb_vmdelete_list for truncation and hole punch. This routine
>>>>>>> also tries to acquire the vma lock. If it fails, it skips the
>>>>>>> unmapping. However, we can not have file truncation or hole punch
>>>>>>> fail because of contention. After hugetlb_vmdelete_list, truncation
>>>>>>> and hole punch call remove_inode_hugepages. remove_inode_hugepages
>>>>>>> check for mapped pages and call hugetlb_unmap_file_page to unmap them.
>>>>>>> hugetlb_unmap_file_page is designed to drop locks and reacquire in the
>>>>>>> correct order to guarantee unmap success.
>>>>>>>
>>>>>>> Signed-off-by: Mike Kravetz <[email protected]>
>>>>>>> ---
>>>>>>> fs/hugetlbfs/inode.c | 46 +++++++++++++++++++
>>>>>>> mm/hugetlb.c | 102 +++++++++++++++++++++++++++++++++++++++----
>>>>>>> mm/memory.c | 2 +
>>>>>>> mm/rmap.c | 100 +++++++++++++++++++++++++++---------------
>>>>>>> mm/userfaultfd.c | 9 +++-
>>>>>>> 5 files changed, 214 insertions(+), 45 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>>>>>>> index b93d131b0cb5..52d9b390389b 100644
>>>>>>> --- a/fs/hugetlbfs/inode.c
>>>>>>> +++ b/fs/hugetlbfs/inode.c
>>>>>>> @@ -434,6 +434,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>>>>>> struct folio *folio, pgoff_t index)
>>>>>>> {
>>>>>>> struct rb_root_cached *root = &mapping->i_mmap;
>>>>>>> + unsigned long skipped_vm_start;
>>>>>>> + struct mm_struct *skipped_mm;
>>>>>>> struct page *page = &folio->page;
>>>>>>> struct vm_area_struct *vma;
>>>>>>> unsigned long v_start;
>>>>>>> @@ -444,6 +446,8 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>>>>>> end = ((index + 1) * pages_per_huge_page(h));
>>>>>>>
>>>>>>> i_mmap_lock_write(mapping);
>>>>>>> +retry:
>>>>>>> + skipped_mm = NULL;
>>>>>>>
>>>>>>> vma_interval_tree_foreach(vma, root, start, end - 1) {
>>>>>>> v_start = vma_offset_start(vma, start);
>>>>>>> @@ -452,11 +456,49 @@ static void hugetlb_unmap_file_folio(struct hstate *h,
>>>>>>> if (!hugetlb_vma_maps_page(vma, vma->vm_start + v_start, page))
>>>>>>> continue;
>>>>>>>
>>>>>>> + if (!hugetlb_vma_trylock_write(vma)) {
>>>>>>> + /*
>>>>>>> + * If we can not get vma lock, we need to drop
>>>>>>> + * immap_sema and take locks in order.
>>>>>>> + */
>>>>>>> + skipped_vm_start = vma->vm_start;
>>>>>>> + skipped_mm = vma->vm_mm;
>>>>>>> + /* grab mm-struct as we will be dropping i_mmap_sema */
>>>>>>> + mmgrab(skipped_mm);
>>>>>>> + break;
>>>>>>> + }
>>>>>>> +
>>>>>>> unmap_hugepage_range(vma, vma->vm_start + v_start, v_end,
>>>>>>> NULL, ZAP_FLAG_DROP_MARKER);
>>>>>>> + hugetlb_vma_unlock_write(vma);
>>>>>>> }
>>>>>>>
>>>>>>> i_mmap_unlock_write(mapping);
>>>>>>> +
>>>>>>> + if (skipped_mm) {
>>>>>>> + mmap_read_lock(skipped_mm);
>>>>>>> + vma = find_vma(skipped_mm, skipped_vm_start);
>>>>>>> + if (!vma || !is_vm_hugetlb_page(vma) ||
>>>>>>> + vma->vm_file->f_mapping != mapping ||
>>>>>>> + vma->vm_start != skipped_vm_start) {
>>>>>>
>>>>>> i_mmap_lock_write(mapping) is missing here? Retry logic will do i_mmap_unlock_write(mapping) anyway.
>>>>>>
>>>>>
>>>>> Yes, that is missing. I will add here.
>>>>>
>>>>>>> + mmap_read_unlock(skipped_mm);
>>>>>>> + mmdrop(skipped_mm);
>>>>>>> + goto retry;
>>>>>>> + }
>>>>>>> +
>>>>>>
>>>>>> IMHO, above check is not enough. Think about the below scene:
>>>>>>
>>>>>> CPU 1 CPU 2
>>>>>> hugetlb_unmap_file_folio exit_mmap
>>>>>> mmap_read_lock(skipped_mm); mmap_read_lock(mm);
>>>>>> check vma is wanted.
>>>>>> unmap_vmas
>>>>>> mmap_read_unlock(skipped_mm); mmap_read_unlock
>>>>>> mmap_write_lock(mm);
>>>>>> free_pgtables
>>>>>> remove_vma
>>>>>> hugetlb_vma_lock_free
>>>>>> vma, hugetlb_vma_lock is still *used after free*
>>>>>> mmap_write_unlock(mm);
>>>>>> So we should check mm->mm_users == 0 to fix the above issue. Or am I miss something?
>>>>>
>>>>> In the retry case, we are OK because go back and look up the vma again. Right?
>>>>>
>>>>> After taking mmap_read_lock, vma can not go away until we mmap_read_unlock.
>>>>> Before that, we do the following:
>>>>>
>>>>>>> + hugetlb_vma_lock_write(vma);
>>>>>>> + i_mmap_lock_write(mapping);
>>>>>
>>>>> IIUC, vma can not go away while we hold i_mmap_lock_write. So, after this we
>>>>
>>>> I think you're right. free_pgtables() can't complete its work as unlink_file_vma() will be
>>>> blocked on i_mmap_rwsem of mapping. Sorry for reporting such nonexistent race.
>>>>
>>>>> can.
>>>>>
>>>>>>> + mmap_read_unlock(skipped_mm);
>>>>>>> + mmdrop(skipped_mm);
>>>>>
>>>>> We continue to hold i_mmap_lock_write as we goto retry.
>>>>>
>>>>> I could be missing something as well. This was how I intended to keep
>>>>> vma valid while dropping and acquiring locks.
>>>>
>>>> Thanks for your clarifying.
>>>>
>>>
>>> Well, that was all correct 'in theory' but not in practice. I did not take
>>> into account the inode lock that is taken at the beginning of truncate (or
>>> hole punch). In other code paths, we take inode lock after mmap_lock. So,
>>> taking mmap_lock here is not allowed.
>>
>> Considering the Lock ordering in mm/filemap.c:
>>
>> * ->i_rwsem
>> * ->invalidate_lock (acquired by fs in truncate path)
>> * ->i_mmap_rwsem (truncate->unmap_mapping_range)
>>
>> * ->i_rwsem (generic_perform_write)
>> * ->mmap_lock (fault_in_readable->do_page_fault)
>>
>> It seems inode_lock is taken before the mmap_lock?
>
> Hmmmm? I can't find a sequence where inode_lock is taken after mmap_lock.
> lockdep was complaining about taking mmap_lock after i_rwsem in the above code.
> I assumed there was such a sequence somewhere. Might need to go back and get
> another trace/warning.

Sorry, I'm somewhat confused. Take generic_file_write_iter() as an example:

generic_file_write_iter
inode_lock(inode); -- *inode lock is held here*
__generic_file_write_iter
generic_perform_write
fault_in_iov_iter_readable -- *may cause page fault and thus take mmap_lock*
inode_unlock(inode);

This is the documented example in the mm/filemap.c. So we should take inode_lock before
taking mmap_lock. Or this is out-dated ? And above example needs a fix?

>
> In any case, I think the scheme below is much cleaner. Doing another round of
> benchmarking before sending.

That should be a good alternative. Thanks for your work. :)

Thanks,
Miaohe Lin

>
>>> I came up with another way to make this work. As discussed above, we need to
>>> drop the i_mmap lock before acquiring the vma_lock. However, once we drop
>>> i_mmap, the vma could go away. My solution is to make the 'vma_lock' be a
>>> ref counted structure that can live on after the vma is freed. Therefore,
>>> this code can take a reference while under i_mmap then drop i_mmap and wait
>>> on the vma_lock. Of course, once it acquires the vma_lock it needs to check
>>> and make sure the vma still exists. It may sound complicated, but I think
>>> it is a bit simpler than the code here. A new series will be out soon.
>>>
>