2014-10-30 19:44:43

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH v2 -next 00/10] mm: improve usage of the i_mmap lock

Changes from v1:
o Collected Acks from Kirill and Srikar.
o Updated to apply on top of linux-next.

Hello,

This series is a continuation of the conversion of the
i_mmap_mutex to rwsem, following what we have for the
anon memory counterpart. With Hugh's feedback from the
first iteration (sorry about leaving this fall behind for
so long, but I've just finally had time to re-look at this
-- see https://lkml.org/lkml/2014/5/22/797), several
additional opportunities for sharing the lock are proposed.

Ultimately, the most obvious paths that require exclusive
ownership of the lock is when we modify the VMA interval
tree, via vma_interval_tree_insert() and vma_interval_tree_remove()
families. Cases such as unmapping, where the ptes content is
changed but the tree remains untouched should make it safe
to share the i_mmap_rwsem.

As such, the code of course is straightforward, however
the devil is very much in the details. While its been tested
on a number of workloads without anything exploding, I would
not be surprised if there are some less documented/known
assumptions about the lock that could suffer from these
changes. Or maybe I'm just missing something, but either way
I believe its at the point where it could use more eyes and
hopefully some time in linux-next.

Because the lock type conversion is the heart of this patchset,
its worth noting a few comparisons between mutex vs rwsem (xadd):

(i) Same size, no extra footprint.

(ii) Both have CONFIG_XXX_SPIN_ON_OWNER capabilities for
exclusive lock ownership.

(iii) Both can be slightly unfair wrt exclusive ownership, with
writer lock stealing properties, not necessarily respecting
FIFO order for granting the lock when contended.

(iv) Mutexes can be slightly faster than rwsems when
the lock is non-contended.

(v) Both suck at performance for debug (slowpaths), which
shouldn't matter anyway.

Sharing the lock is obviously beneficial, and sem writer ownership
is close enough to mutexes. The biggest winner of these changes
is migration.

As for concrete numbers, the following performance results are
for a 4-socket 60-core IvyBridge-EX with 130Gb of RAM.

Both alltests and disk (xfs+ramdisk) workloads of aim7 suite do quite
well with this set, with a steady ~60% throughput (jpm) increase
for alltests and up to ~30% for disk for high amounts of concurrency.
Lower counts of workload users (< 100) does not show much difference
at all, so at least no regressions.

3.18-rc1 3.18-rc1-i_mmap_rwsem
alltests-100 17918.72 ( 0.00%) 28417.97 ( 58.59%)
alltests-200 16529.39 ( 0.00%) 26807.92 ( 62.18%)
alltests-300 16591.17 ( 0.00%) 26878.08 ( 62.00%)
alltests-400 16490.37 ( 0.00%) 26664.63 ( 61.70%)
alltests-500 16593.17 ( 0.00%) 26433.72 ( 59.30%)
alltests-600 16508.56 ( 0.00%) 26409.20 ( 59.97%)
alltests-700 16508.19 ( 0.00%) 26298.58 ( 59.31%)
alltests-800 16437.58 ( 0.00%) 26433.02 ( 60.81%)
alltests-900 16418.35 ( 0.00%) 26241.61 ( 59.83%)
alltests-1000 16369.00 ( 0.00%) 26195.76 ( 60.03%)
alltests-1100 16330.11 ( 0.00%) 26133.46 ( 60.03%)
alltests-1200 16341.30 ( 0.00%) 26084.03 ( 59.62%)
alltests-1300 16304.75 ( 0.00%) 26024.74 ( 59.61%)
alltests-1400 16231.08 ( 0.00%) 25952.35 ( 59.89%)
alltests-1500 16168.06 ( 0.00%) 25850.58 ( 59.89%)
alltests-1600 16142.56 ( 0.00%) 25767.42 ( 59.62%)
alltests-1700 16118.91 ( 0.00%) 25689.58 ( 59.38%)
alltests-1800 16068.06 ( 0.00%) 25599.71 ( 59.32%)
alltests-1900 16046.94 ( 0.00%) 25525.92 ( 59.07%)
alltests-2000 16007.26 ( 0.00%) 25513.07 ( 59.38%)

disk-100 7582.14 ( 0.00%) 7257.48 ( -4.28%)
disk-200 6962.44 ( 0.00%) 7109.15 ( 2.11%)
disk-300 6435.93 ( 0.00%) 6904.75 ( 7.28%)
disk-400 6370.84 ( 0.00%) 6861.26 ( 7.70%)
disk-500 6353.42 ( 0.00%) 6846.71 ( 7.76%)
disk-600 6368.82 ( 0.00%) 6806.75 ( 6.88%)
disk-700 6331.37 ( 0.00%) 6796.01 ( 7.34%)
disk-800 6324.22 ( 0.00%) 6788.00 ( 7.33%)
disk-900 6253.52 ( 0.00%) 6750.43 ( 7.95%)
disk-1000 6242.53 ( 0.00%) 6855.11 ( 9.81%)
disk-1100 6234.75 ( 0.00%) 6858.47 ( 10.00%)
disk-1200 6312.76 ( 0.00%) 6845.13 ( 8.43%)
disk-1300 6309.95 ( 0.00%) 6834.51 ( 8.31%)
disk-1400 6171.76 ( 0.00%) 6787.09 ( 9.97%)
disk-1500 6139.81 ( 0.00%) 6761.09 ( 10.12%)
disk-1600 4807.12 ( 0.00%) 6725.33 ( 39.90%)
disk-1700 4669.50 ( 0.00%) 5985.38 ( 28.18%)
disk-1800 4663.51 ( 0.00%) 5972.99 ( 28.08%)
disk-1900 4674.31 ( 0.00%) 5949.94 ( 27.29%)
disk-2000 4668.36 ( 0.00%) 5834.93 ( 24.99%)

In addition, a 67.5% increase in successfully migrated NUMA pages,
thus improving node locality.

The patch layout is simple but designed for bisection (in case
reversion is needed if the changes break upstream) and easier
review:

o Patches 1-4 convert the i_mmap lock from mutex to rwsem.
o Patches 5-10 share the lock in specific paths, each patch
details the rationale behind why it should be safe.

This patchset has been tested with: postgres 9.4 (with brand new
hugetlb support), hugetlbfs test suite (all tests pass, in fact more
tests pass with these changes than with an upstream kernel), ltp, aim7
benchmarks, memcached and iozone with the -B option for mmap'ing.
*Untested* paths are nommu, memory-failure, uprobes and xip.

Applies on top of linux-next (20141030).

Thanks!

Davidlohr Bueso (10):
mm,fs: introduce helpers around the i_mmap_mutex
mm: use new helper functions around the i_mmap_mutex
mm: convert i_mmap_mutex to rwsem
mm/rmap: share the i_mmap_rwsem
uprobes: share the i_mmap_rwsem
mm/xip: share the i_mmap_rwsem
mm/memory-failure: share the i_mmap_rwsem
mm/mremap: share the i_mmap_rwsem
mm/nommu: share the i_mmap_rwsem
mm/hugetlb: share the i_mmap_rwsem

fs/hugetlbfs/inode.c | 14 +++++++-------
fs/inode.c | 2 +-
include/linux/fs.h | 23 ++++++++++++++++++++++-
include/linux/mmu_notifier.h | 2 +-
kernel/events/uprobes.c | 6 +++---
kernel/fork.c | 4 ++--
mm/filemap.c | 10 +++++-----
mm/filemap_xip.c | 23 +++++++++--------------
mm/hugetlb.c | 22 +++++++++++-----------
mm/memory-failure.c | 4 ++--
mm/memory.c | 8 ++++----
mm/mmap.c | 22 +++++++++++-----------
mm/mremap.c | 6 +++---
mm/nommu.c | 17 ++++++++---------
mm/rmap.c | 12 ++++++------
15 files changed, 95 insertions(+), 80 deletions(-)

--
1.8.4.5


2014-10-30 19:42:25

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 08/10] mm/mremap: share the i_mmap_rwsem

As per the comment in move_ptes(), we only require taking the
anon vma and i_mmap locks to ensure that rmap will always observe
either the old or new ptes, in the case of need_rmap_lock=true.
No modifications to the tree itself, thus share the i_mmap_rwsem.

Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
mm/mremap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index c929324..09bd644 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -119,7 +119,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
if (need_rmap_locks) {
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
}
if (vma->anon_vma) {
anon_vma = vma->anon_vma;
@@ -156,7 +156,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
if (anon_vma)
anon_vma_unlock_read(anon_vma);
if (mapping)
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);
}

#define LATENCY_LIMIT (64 * PAGE_SIZE)
--
1.8.4.5

2014-10-30 19:42:28

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 09/10] mm/nommu: share the i_mmap_rwsem

Shrinking/truncate logic can call nommu_shrink_inode_mappings()
to verify that any shared mappings of the inode in question aren't
broken (dead zone). afaict the only user being ramfs to handle
the size change attribute.

Pretty much a no-brainer to share the lock.

Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
mm/nommu.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/nommu.c b/mm/nommu.c
index 4201a38..2266a34 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -2086,14 +2086,14 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
high = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;

down_write(&nommu_region_sem);
- i_mmap_lock_write(inode->i_mapping);
+ i_mmap_lock_read(inode->i_mapping);

/* search for VMAs that fall within the dead zone */
vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, low, high) {
/* found one - only interested if it's shared out of the page
* cache */
if (vma->vm_flags & VM_SHARED) {
- i_mmap_unlock_write(inode->i_mapping);
+ i_mmap_unlock_read(inode->i_mapping);
up_write(&nommu_region_sem);
return -ETXTBSY; /* not quite true, but near enough */
}
@@ -2105,8 +2105,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
* we don't check for any regions that start beyond the EOF as there
* shouldn't be any
*/
- vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap,
- 0, ULONG_MAX) {
+ vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, 0, ULONG_MAX) {
if (!(vma->vm_flags & VM_SHARED))
continue;

@@ -2121,7 +2120,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
}
}

- i_mmap_unlock_write(inode->i_mapping);
+ i_mmap_unlock_read(inode->i_mapping);
up_write(&nommu_region_sem);
return 0;
}
--
1.8.4.5

2014-10-30 19:42:26

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 05/10] uprobes: share the i_mmap_rwsem

Both register and unregister call build_map_info() in order
to create the list of mappings before installing or removing
breakpoints for every mm which maps file backed memory. As
such, there is no reason to hold the i_mmap_rwsem exclusively,
so share it and allow concurrent readers to build the mapping
data.

Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Srikar Dronamraju <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
kernel/events/uprobes.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index e1bb60d..6158a64b 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -724,7 +724,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
int more = 0;

again:
- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
if (!valid_vma(vma, is_register))
continue;
@@ -755,7 +755,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
info->mm = vma->vm_mm;
info->vaddr = offset_to_vaddr(vma, offset);
}
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);

if (!more)
goto out;
--
1.8.4.5

2014-10-30 19:42:24

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 10/10] mm/hugetlb: share the i_mmap_rwsem

The i_mmap_rwsem protects shared pages against races
when doing the sharing and unsharing, ultimately
calling huge_pmd_share/unshare() for PMD pages --
it also needs it to avoid races when populating the pud
for pmd allocation when looking for a shareable pmd page
for hugetlb. Ultimately the interval tree remains intact.

Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
fs/hugetlbfs/inode.c | 4 ++--
mm/hugetlb.c | 12 ++++++------
mm/memory.c | 4 ++--
3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 5eba47f..0dca54d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -412,10 +412,10 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
pgoff = offset >> PAGE_SHIFT;

i_size_write(inode, offset);
- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap))
hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);
truncate_hugepages(inode, offset);
return 0;
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2071cf4..80349f2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2775,7 +2775,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
* this mapping should be shared between all the VMAs,
* __unmap_hugepage_range() is called as the lock is already held
*/
- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
vma_interval_tree_foreach(iter_vma, &mapping->i_mmap, pgoff, pgoff) {
/* Do not unmap the current VMA */
if (iter_vma == vma)
@@ -2792,7 +2792,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
unmap_hugepage_range(iter_vma, address,
address + huge_page_size(h), page);
}
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);
}

/*
@@ -3350,7 +3350,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_cache_range(vma, address, end);

mmu_notifier_invalidate_range_start(mm, start, end);
- i_mmap_lock_write(vma->vm_file->f_mapping);
+ i_mmap_lock_read(vma->vm_file->f_mapping);
for (; address < end; address += huge_page_size(h)) {
spinlock_t *ptl;
ptep = huge_pte_offset(mm, address);
@@ -3379,7 +3379,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
*/
flush_tlb_range(vma, start, end);
mmu_notifier_invalidate_range(mm, start, end);
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+ i_mmap_unlock_read(vma->vm_file->f_mapping);
mmu_notifier_invalidate_range_end(mm, start, end);

return pages << h->order;
@@ -3547,7 +3547,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);

- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -3575,7 +3575,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);
return pte;
}

diff --git a/mm/memory.c b/mm/memory.c
index 22c3089..2ca3105 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1345,9 +1345,9 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* safe to do nothing in this case.
*/
if (vma->vm_file) {
- i_mmap_lock_write(vma->vm_file->f_mapping);
+ i_mmap_lock_read(vma->vm_file->f_mapping);
__unmap_hugepage_range_final(tlb, vma, start, end, NULL);
- i_mmap_unlock_write(vma->vm_file->f_mapping);
+ i_mmap_unlock_read(vma->vm_file->f_mapping);
}
} else
unmap_page_range(tlb, vma, start, end, details);
--
1.8.4.5

2014-10-30 19:43:20

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 07/10] mm/memory-failure: share the i_mmap_rwsem

No brainer conversion: collect_procs_file() only schedules
a process for later kill, share the lock, similarly to
the anon vma variant.

Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
mm/memory-failure.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e1646fe..e619625 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -466,7 +466,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
struct task_struct *tsk;
struct address_space *mapping = page->mapping;

- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
read_lock(&tasklist_lock);
for_each_process(tsk) {
pgoff_t pgoff = page_to_pgoff(page);
@@ -488,7 +488,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
}
}
read_unlock(&tasklist_lock);
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);
}

/*
--
1.8.4.5

2014-10-30 19:43:18

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 01/10] mm,fs: introduce helpers around the i_mmap_mutex

Various parts of the kernel acquire and release this mutex,
so add i_mmap_lock_write() and immap_unlock_write() helper
functions that will encapsulate this logic. The next patch
will make use of these.

Signed-off-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
include/linux/fs.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 09d4480..f3a2372 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -467,6 +467,16 @@ struct block_device {

int mapping_tagged(struct address_space *mapping, int tag);

+static inline void i_mmap_lock_write(struct address_space *mapping)
+{
+ mutex_lock(&mapping->i_mmap_mutex);
+}
+
+static inline void i_mmap_unlock_write(struct address_space *mapping)
+{
+ mutex_unlock(&mapping->i_mmap_mutex);
+}
+
/*
* Might pages of this file be mapped into userspace?
*/
--
1.8.4.5

2014-10-30 19:43:17

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 06/10] mm/xip: share the i_mmap_rwsem

__xip_unmap() will remove the xip sparse page from the cache
and take down pte mapping, without altering the interval tree,
thus share the i_mmap_rwsem when searching for the ptes to
unmap.

Additionally, tidy up the function a bit and make variables only
local to the interval tree walk loop.

Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap_xip.c | 23 +++++++++--------------
1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index bad746b..0d105ae 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -155,22 +155,14 @@ xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
EXPORT_SYMBOL_GPL(xip_file_read);

/*
- * __xip_unmap is invoked from xip_unmap and
- * xip_write
+ * __xip_unmap is invoked from xip_unmap and xip_write
*
* This function walks all vmas of the address_space and unmaps the
* __xip_sparse_page when found at pgoff.
*/
-static void
-__xip_unmap (struct address_space * mapping,
- unsigned long pgoff)
+static void __xip_unmap(struct address_space * mapping, unsigned long pgoff)
{
struct vm_area_struct *vma;
- struct mm_struct *mm;
- unsigned long address;
- pte_t *pte;
- pte_t pteval;
- spinlock_t *ptl;
struct page *page;
unsigned count;
int locked = 0;
@@ -182,11 +174,14 @@ __xip_unmap (struct address_space * mapping,
return;

retry:
- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
- mm = vma->vm_mm;
- address = vma->vm_start +
+ pte_t *pte, pteval;
+ spinlock_t *ptl;
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address = vma->vm_start +
((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+
BUG_ON(address < vma->vm_start || address >= vma->vm_end);
pte = page_check_address(page, mm, address, &ptl, 1);
if (pte) {
@@ -202,7 +197,7 @@ retry:
page_cache_release(page);
}
}
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);

if (locked) {
mutex_unlock(&xip_sparse_mutex);
--
1.8.4.5

2014-10-30 19:43:16

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 03/10] mm: convert i_mmap_mutex to rwsem

The i_mmap_mutex is a close cousin of the anon vma lock,
both protecting similar data, one for file backed pages
and the other for anon memory. To this end, this lock can
also be a rwsem. In addition, there are some important
opportunities to share the lock when there are no tree
modifications.

This conversion is straightforward. For now, all users take
the write lock.

Signed-off-by: Davidlohr Bueso <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
fs/hugetlbfs/inode.c | 10 +++++-----
fs/inode.c | 2 +-
include/linux/fs.h | 7 ++++---
include/linux/mmu_notifier.h | 2 +-
kernel/events/uprobes.c | 2 +-
mm/filemap.c | 10 +++++-----
mm/hugetlb.c | 10 +++++-----
mm/mmap.c | 8 ++++----
mm/mremap.c | 2 +-
mm/rmap.c | 6 +++---
10 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a082709..5eba47f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -472,12 +472,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
}

/*
- * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
+ * Hugetlbfs is not reclaimable; therefore its i_mmap_rwsem will never
* be taken from reclaim -- unlike regular filesystems. This needs an
* annotation because huge_pmd_share() does an allocation under
- * i_mmap_mutex.
+ * i_mmap_rwsem.
*/
-static struct lock_class_key hugetlbfs_i_mmap_mutex_key;
+static struct lock_class_key hugetlbfs_i_mmap_rwsem_key;

static struct inode *hugetlbfs_get_inode(struct super_block *sb,
struct inode *dir,
@@ -495,8 +495,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
struct hugetlbfs_inode_info *info;
inode->i_ino = get_next_ino();
inode_init_owner(inode, dir, mode);
- lockdep_set_class(&inode->i_mapping->i_mmap_mutex,
- &hugetlbfs_i_mmap_mutex_key);
+ lockdep_set_class(&inode->i_mapping->i_mmap_rwsem,
+ &hugetlbfs_i_mmap_rwsem_key);
inode->i_mapping->a_ops = &hugetlbfs_aops;
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba..1732d6b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -349,7 +349,7 @@ void address_space_init_once(struct address_space *mapping)
memset(mapping, 0, sizeof(*mapping));
INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC);
spin_lock_init(&mapping->tree_lock);
- mutex_init(&mapping->i_mmap_mutex);
+ init_rwsem(&mapping->i_mmap_rwsem);
INIT_LIST_HEAD(&mapping->private_list);
spin_lock_init(&mapping->private_lock);
mapping->i_mmap = RB_ROOT;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f3a2372..648a77e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -18,6 +18,7 @@
#include <linux/pid.h>
#include <linux/bug.h>
#include <linux/mutex.h>
+#include <linux/rwsem.h>
#include <linux/capability.h>
#include <linux/semaphore.h>
#include <linux/fiemap.h>
@@ -401,7 +402,7 @@ struct address_space {
atomic_t i_mmap_writable;/* count VM_SHARED mappings */
struct rb_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
- struct mutex i_mmap_mutex; /* protect tree, count, list */
+ struct rw_semaphore i_mmap_rwsem; /* protect tree, count, list */
/* Protected by tree_lock together with the radix tree */
unsigned long nrpages; /* number of total pages */
unsigned long nrshadows; /* number of shadow entries */
@@ -469,12 +470,12 @@ int mapping_tagged(struct address_space *mapping, int tag);

static inline void i_mmap_lock_write(struct address_space *mapping)
{
- mutex_lock(&mapping->i_mmap_mutex);
+ down_write(&mapping->i_mmap_rwsem);
}

static inline void i_mmap_unlock_write(struct address_space *mapping)
{
- mutex_unlock(&mapping->i_mmap_mutex);
+ up_write(&mapping->i_mmap_rwsem);
}

/*
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 94d19f6..95243d2 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -177,7 +177,7 @@ struct mmu_notifier_ops {
* Therefore notifier chains can only be traversed when either
*
* 1. mmap_sem is held.
- * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwsem).
+ * 2. One of the reverse map locks is held (i_mmap_rwsem or anon_vma->rwsem).
* 3. No other concurrent thread can access the list (release)
*/
struct mmu_notifier {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index a1d99e3..e1bb60d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -731,7 +731,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)

if (!prev && !more) {
/*
- * Needs GFP_NOWAIT to avoid i_mmap_mutex recursion through
+ * Needs GFP_NOWAIT to avoid i_mmap_rwsem recursion through
* reclaim. This is optimistic, no harm done if it fails.
*/
prev = kmalloc(sizeof(struct map_info),
diff --git a/mm/filemap.c b/mm/filemap.c
index 14b4642..e8905bc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -62,16 +62,16 @@
/*
* Lock ordering:
*
- * ->i_mmap_mutex (truncate_pagecache)
+ * ->i_mmap_rwsem (truncate_pagecache)
* ->private_lock (__free_pte->__set_page_dirty_buffers)
* ->swap_lock (exclusive_swap_page, others)
* ->mapping->tree_lock
*
* ->i_mutex
- * ->i_mmap_mutex (truncate->unmap_mapping_range)
+ * ->i_mmap_rwsem (truncate->unmap_mapping_range)
*
* ->mmap_sem
- * ->i_mmap_mutex
+ * ->i_mmap_rwsem
* ->page_table_lock or pte_lock (various, mainly in memory.c)
* ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock)
*
@@ -85,7 +85,7 @@
* sb_lock (fs/fs-writeback.c)
* ->mapping->tree_lock (__sync_single_inode)
*
- * ->i_mmap_mutex
+ * ->i_mmap_rwsem
* ->anon_vma.lock (vma_adjust)
*
* ->anon_vma.lock
@@ -105,7 +105,7 @@
* ->inode->i_lock (zap_pte_range->set_page_dirty)
* ->private_lock (zap_pte_range->__set_page_dirty_buffers)
*
- * ->i_mmap_mutex
+ * ->i_mmap_rwsem
* ->tasklist_lock (memory_failure, collect_procs_ao)
*/

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5d8758f..2071cf4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2727,9 +2727,9 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
* on its way out. We're lucky that the flag has such an appropriate
* name, and can in fact be safely cleared here. We could clear it
* before the __unmap_hugepage_range above, but all that's necessary
- * is to clear it before releasing the i_mmap_mutex. This works
+ * is to clear it before releasing the i_mmap_rwsem. This works
* because in the context this is called, the VMA is about to be
- * destroyed and the i_mmap_mutex is held.
+ * destroyed and the i_mmap_rwsem is held.
*/
vma->vm_flags &= ~VM_MAYSHARE;
}
@@ -3372,9 +3372,9 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
spin_unlock(ptl);
}
/*
- * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
+ * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
* may have cleared our pud entry and done put_page on the page table:
- * once we release i_mmap_mutex, another task can do the final put_page
+ * once we release i_mmap_rwsem, another task can do the final put_page
* and that page table be reused and filled with junk.
*/
flush_tlb_range(vma, start, end);
@@ -3528,7 +3528,7 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
* and returns the corresponding pte. While this is not necessary for the
* !shared pmd case because we can allocate the pmd later as well, it makes the
* code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_mutex section - otherwise
+ * pud has to be populated inside the same i_mmap_rwsem section - otherwise
* racing tasks could either miss the sharing (see huge_pte_offset) or select a
* bad pmd for sharing.
*/
diff --git a/mm/mmap.c b/mm/mmap.c
index 136f4c8..db4ceac 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -232,7 +232,7 @@ error:
}

/*
- * Requires inode->i_mapping->i_mmap_mutex
+ * Requires inode->i_mapping->i_mmap_rwsem
*/
static void __remove_shared_vm_struct(struct vm_area_struct *vma,
struct file *file, struct address_space *mapping)
@@ -2854,7 +2854,7 @@ void exit_mmap(struct mm_struct *mm)

/* Insert vm structure into process list sorted by address
* and into the inode's i_mmap tree. If vm_file is non-NULL
- * then i_mmap_mutex is taken here.
+ * then i_mmap_rwsem is taken here.
*/
int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
{
@@ -3149,7 +3149,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
*/
if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
BUG();
- mutex_lock_nest_lock(&mapping->i_mmap_mutex, &mm->mmap_sem);
+ down_write_nest_lock(&mapping->i_mmap_rwsem, &mm->mmap_sem);
}
}

@@ -3176,7 +3176,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
* vma in this mm is backed by the same anon_vma or address_space.
*
* We can take all the locks in random order because the VM code
- * taking i_mmap_mutex or anon_vma->rwsem outside the mmap_sem never
+ * taking i_mmap_rwsem or anon_vma->rwsem outside the mmap_sem never
* takes more than one of them in a row. Secondly we're protected
* against a concurrent mm_take_all_locks() by the mm_all_locks_mutex.
*
diff --git a/mm/mremap.c b/mm/mremap.c
index fa7c432..c929324 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -99,7 +99,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
spinlock_t *old_ptl, *new_ptl;

/*
- * When need_rmap_locks is true, we take the i_mmap_mutex and anon_vma
+ * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
* locks to ensure that rmap will always observe either the old or the
* new ptes. This is the easiest way to avoid races with
* truncate_pagecache(), page migration, etc...
diff --git a/mm/rmap.c b/mm/rmap.c
index ae72965..e0c0e90 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,7 +23,7 @@
* inode->i_mutex (while writing or truncating, not reading or faulting)
* mm->mmap_sem
* page->flags PG_locked (lock_page)
- * mapping->i_mmap_mutex
+ * mapping->i_mmap_rwsem
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
* zone->lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -1258,7 +1258,7 @@ out_mlock:
/*
* We need mmap_sem locking, Otherwise VM_LOCKED check makes
* unstable result and race. Plus, We can't wait here because
- * we now hold anon_vma->rwsem or mapping->i_mmap_mutex.
+ * we now hold anon_vma->rwsem or mapping->i_mmap_rwsem.
* if trylock failed, the page remain in evictable lru and later
* vmscan could retry to move the page to unevictable lru if the
* page is actually mlocked.
@@ -1682,7 +1682,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
* The page lock not only makes sure that page->mapping cannot
* suddenly be NULLified by truncation, it makes sure that the
* structure at mapping cannot be freed and reused yet,
- * so we can safely take mapping->i_mmap_mutex.
+ * so we can safely take mapping->i_mmap_rwsem.
*/
VM_BUG_ON_PAGE(!PageLocked(page), page);

--
1.8.4.5

2014-10-30 19:44:46

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 02/10] mm: use new helper functions around the i_mmap_mutex

Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.

Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
fs/hugetlbfs/inode.c | 4 ++--
kernel/events/uprobes.c | 4 ++--
kernel/fork.c | 4 ++--
mm/filemap_xip.c | 4 ++--
mm/hugetlb.c | 12 ++++++------
mm/memory-failure.c | 4 ++--
mm/memory.c | 8 ++++----
mm/mmap.c | 14 +++++++-------
mm/mremap.c | 4 ++--
mm/nommu.c | 14 +++++++-------
mm/rmap.c | 4 ++--
11 files changed, 38 insertions(+), 38 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 1e2872b..a082709 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -412,10 +412,10 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
pgoff = offset >> PAGE_SHIFT;

i_size_write(inode, offset);
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap))
hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
truncate_hugepages(inode, offset);
return 0;
}
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index bc143cf..a1d99e3 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -724,7 +724,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
int more = 0;

again:
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
if (!valid_vma(vma, is_register))
continue;
@@ -755,7 +755,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
info->mm = vma->vm_mm;
info->vaddr = offset_to_vaddr(vma, offset);
}
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);

if (!more)
goto out;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9ca8418..4dc2dda 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -433,7 +433,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
get_file(file);
if (tmp->vm_flags & VM_DENYWRITE)
atomic_dec(&inode->i_writecount);
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
if (tmp->vm_flags & VM_SHARED)
atomic_inc(&mapping->i_mmap_writable);
flush_dcache_mmap_lock(mapping);
@@ -445,7 +445,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
vma_interval_tree_insert_after(tmp, mpnt,
&mapping->i_mmap);
flush_dcache_mmap_unlock(mapping);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}

/*
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..bad746b 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -182,7 +182,7 @@ __xip_unmap (struct address_space * mapping,
return;

retry:
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
mm = vma->vm_mm;
address = vma->vm_start +
@@ -202,7 +202,7 @@ retry:
page_cache_release(page);
}
}
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);

if (locked) {
mutex_unlock(&xip_sparse_mutex);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f0cca62..5d8758f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2775,7 +2775,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
* this mapping should be shared between all the VMAs,
* __unmap_hugepage_range() is called as the lock is already held
*/
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
vma_interval_tree_foreach(iter_vma, &mapping->i_mmap, pgoff, pgoff) {
/* Do not unmap the current VMA */
if (iter_vma == vma)
@@ -2792,7 +2792,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
unmap_hugepage_range(iter_vma, address,
address + huge_page_size(h), page);
}
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}

/*
@@ -3350,7 +3350,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
flush_cache_range(vma, address, end);

mmu_notifier_invalidate_range_start(mm, start, end);
- mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ i_mmap_lock_write(vma->vm_file->f_mapping);
for (; address < end; address += huge_page_size(h)) {
spinlock_t *ptl;
ptep = huge_pte_offset(mm, address);
@@ -3379,7 +3379,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
*/
flush_tlb_range(vma, start, end);
mmu_notifier_invalidate_range(mm, start, end);
- mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
mmu_notifier_invalidate_range_end(mm, start, end);

return pages << h->order;
@@ -3547,7 +3547,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -3575,7 +3575,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
spin_unlock(ptl);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
return pte;
}

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 84e7ded..e1646fe 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -466,7 +466,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
struct task_struct *tsk;
struct address_space *mapping = page->mapping;

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
read_lock(&tasklist_lock);
for_each_process(tsk) {
pgoff_t pgoff = page_to_pgoff(page);
@@ -488,7 +488,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
}
}
read_unlock(&tasklist_lock);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}

/*
diff --git a/mm/memory.c b/mm/memory.c
index f61d341..22c3089 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1345,9 +1345,9 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* safe to do nothing in this case.
*/
if (vma->vm_file) {
- mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ i_mmap_lock_write(vma->vm_file->f_mapping);
__unmap_hugepage_range_final(tlb, vma, start, end, NULL);
- mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
}
} else
unmap_page_range(tlb, vma, start, end, details);
@@ -2396,12 +2396,12 @@ void unmap_mapping_range(struct address_space *mapping,
details.last_index = ULONG_MAX;


- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
unmap_mapping_range_tree(&mapping->i_mmap, &details);
if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}
EXPORT_SYMBOL(unmap_mapping_range);

diff --git a/mm/mmap.c b/mm/mmap.c
index ed11b43..136f4c8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -260,9 +260,9 @@ void unlink_file_vma(struct vm_area_struct *vma)

if (file) {
struct address_space *mapping = file->f_mapping;
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
__remove_shared_vm_struct(vma, file, mapping);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}
}

@@ -674,14 +674,14 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
}

__vma_link(mm, vma, prev, rb_link, rb_parent);
__vma_link_file(vma);

if (mapping)
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);

mm->map_count++;
validate_mm(mm);
@@ -793,7 +793,7 @@ again: remove_next = 1 + (end > next->vm_end);
next->vm_end);
}

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
if (insert) {
/*
* Put into interval tree now, so instantiated pages
@@ -880,7 +880,7 @@ again: remove_next = 1 + (end > next->vm_end);
anon_vma_unlock_write(anon_vma);
}
if (mapping)
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);

if (root) {
uprobe_mmap(vma);
@@ -3245,7 +3245,7 @@ static void vm_unlock_mapping(struct address_space *mapping)
* AS_MM_ALL_LOCKS can't change to 0 from under us
* because we hold the mm_all_locks_mutex.
*/
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
&mapping->flags))
BUG();
diff --git a/mm/mremap.c b/mm/mremap.c
index 1e35ba66..fa7c432 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -119,7 +119,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
if (need_rmap_locks) {
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
}
if (vma->anon_vma) {
anon_vma = vma->anon_vma;
@@ -156,7 +156,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
if (anon_vma)
anon_vma_unlock_read(anon_vma);
if (mapping)
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}

#define LATENCY_LIMIT (64 * PAGE_SIZE)
diff --git a/mm/nommu.c b/mm/nommu.c
index bd10aa1..4201a38 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -722,11 +722,11 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
flush_dcache_mmap_lock(mapping);
vma_interval_tree_insert(vma, &mapping->i_mmap);
flush_dcache_mmap_unlock(mapping);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}

/* add the VMA to the tree */
@@ -795,11 +795,11 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
flush_dcache_mmap_lock(mapping);
vma_interval_tree_remove(vma, &mapping->i_mmap);
flush_dcache_mmap_unlock(mapping);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}

/* remove from the MM's tree and list */
@@ -2086,14 +2086,14 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
high = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;

down_write(&nommu_region_sem);
- mutex_lock(&inode->i_mapping->i_mmap_mutex);
+ i_mmap_lock_write(inode->i_mapping);

/* search for VMAs that fall within the dead zone */
vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, low, high) {
/* found one - only interested if it's shared out of the page
* cache */
if (vma->vm_flags & VM_SHARED) {
- mutex_unlock(&inode->i_mapping->i_mmap_mutex);
+ i_mmap_unlock_write(inode->i_mapping);
up_write(&nommu_region_sem);
return -ETXTBSY; /* not quite true, but near enough */
}
@@ -2121,7 +2121,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
}
}

- mutex_unlock(&inode->i_mapping->i_mmap_mutex);
+ i_mmap_unlock_write(inode->i_mapping);
up_write(&nommu_region_sem);
return 0;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index d3eb1e0..ae72965 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1688,7 +1688,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)

if (!mapping)
return ret;
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);

@@ -1711,7 +1711,7 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
ret = rwc->file_nonlinear(page, mapping, rwc->arg);

done:
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
return ret;
}

--
1.8.4.5

2014-10-30 19:44:45

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 04/10] mm/rmap: share the i_mmap_rwsem

Similarly to the anon memory counterpart, we can share
the mapping's lock ownership as the interval tree is
not modified when doing doing the walk, only the file
page.

Signed-off-by: Davidlohr Bueso <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
include/linux/fs.h | 10 ++++++++++
mm/rmap.c | 6 +++---
2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 648a77e..552a9fc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -478,6 +478,16 @@ static inline void i_mmap_unlock_write(struct address_space *mapping)
up_write(&mapping->i_mmap_rwsem);
}

+static inline void i_mmap_lock_read(struct address_space *mapping)
+{
+ down_read(&mapping->i_mmap_rwsem);
+}
+
+static inline void i_mmap_unlock_read(struct address_space *mapping)
+{
+ up_read(&mapping->i_mmap_rwsem);
+}
+
/*
* Might pages of this file be mapped into userspace?
*/
diff --git a/mm/rmap.c b/mm/rmap.c
index e0c0e90..7ab830b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1688,7 +1688,8 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)

if (!mapping)
return ret;
- i_mmap_lock_write(mapping);
+
+ i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);

@@ -1709,9 +1710,8 @@ static int rmap_walk_file(struct page *page, struct rmap_walk_control *rwc)
goto done;

ret = rwc->file_nonlinear(page, mapping, rwc->arg);
-
done:
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);
return ret;
}

--
1.8.4.5

2014-11-03 08:17:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 -next 00/10] mm: improve usage of the i_mmap lock

On Thu, Oct 30, 2014 at 12:34:07PM -0700, Davidlohr Bueso wrote:
>
> Davidlohr Bueso (10):
> mm,fs: introduce helpers around the i_mmap_mutex
> mm: use new helper functions around the i_mmap_mutex
> mm: convert i_mmap_mutex to rwsem
> mm/rmap: share the i_mmap_rwsem
> uprobes: share the i_mmap_rwsem
> mm/xip: share the i_mmap_rwsem
> mm/memory-failure: share the i_mmap_rwsem
> mm/mremap: share the i_mmap_rwsem
> mm/nommu: share the i_mmap_rwsem
> mm/hugetlb: share the i_mmap_rwsem
>

Acked-by: Peter Zijlstra (Intel) <[email protected]>

2014-11-04 06:04:37

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm/mremap: share the i_mmap_rwsem

I'm glad to see this series back, and nicely presented: thank you.
Not worth respinning them, but consider 1,2,3,4,5,6,7 and 9 as
Acked-by: Hugh Dickins <[email protected]>

On Thu, 30 Oct 2014, Davidlohr Bueso wrote:

> As per the comment in move_ptes(), we only require taking the
> anon vma and i_mmap locks to ensure that rmap will always observe
> either the old or new ptes, in the case of need_rmap_lock=true.
> No modifications to the tree itself, thus share the i_mmap_rwsem.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>

But this one is Nacked by me. I don't understand how you and Kirill
could read Michel's painstaking comment on need_rmap_locks, then go
go ahead and remove the exclusion of rmap_walk().

I agree the code here does not modify the interval tree, but the
comment explains how we're moving a pte from one place in the tree
to another, and in some cases there's a danger that the rmap walk
might miss the pte from both places (which doesn't matter much to
most of its uses, but is critical in page migration).

Or am I the one missing something?

Hugh

> ---
> mm/mremap.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index c929324..09bd644 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -119,7 +119,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
> if (need_rmap_locks) {
> if (vma->vm_file) {
> mapping = vma->vm_file->f_mapping;
> - i_mmap_lock_write(mapping);
> + i_mmap_lock_read(mapping);
> }
> if (vma->anon_vma) {
> anon_vma = vma->anon_vma;
> @@ -156,7 +156,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
> if (anon_vma)
> anon_vma_unlock_read(anon_vma);
> if (mapping)
> - i_mmap_unlock_write(mapping);
> + i_mmap_unlock_read(mapping);
> }
>
> #define LATENCY_LIMIT (64 * PAGE_SIZE)
> --
> 1.8.4.5

2014-11-04 06:35:11

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm/hugetlb: share the i_mmap_rwsem

On Thu, 30 Oct 2014, Davidlohr Bueso wrote:

> The i_mmap_rwsem protects shared pages against races
> when doing the sharing and unsharing, ultimately
> calling huge_pmd_share/unshare() for PMD pages --
> it also needs it to avoid races when populating the pud
> for pmd allocation when looking for a shareable pmd page
> for hugetlb. Ultimately the interval tree remains intact.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
linux.intel.com

I'm uncomfortable with this one: I'm certainly not prepared to Ack it;
but that could easily be that I'm just not thinking hard enough - I'd
rather leave the heavy thinking to someone else!

The fs/hugetlbfs/inode.c part of it should be okay, but the rest is
iffy. It gets into huge page table sharing territory, which is very
tricky and surprising territory indeed (take a look at my
__unmap_hugepage_range_final() comment, for one example).

You're right that the interval tree remains intact, but I've a feeling
we end up using i_mmap_mutex for more exclusion than just that (rather
like how huge_memory.c finds anon_vma lock useful for other exclusions).

I think Mel (already Cc'ed) and Michal (adding him) both have past
experience with the shared page table (as do I, but I'm in denial).

I wonder if the huge shared page table would be a good next target
for Kirill's removal of mm nastiness. (Removing it wouldn't hurt
Google for one: we have it "#if 0"ed out, though I forget why at
this moment.)

But, returning to the fs/hugetlbfs/inode.c part of it, that reminds
me: you're missing one patch from the series, aren't you? Why no
i_mmap_lock_read() in mm/memory.c unmap_mapping_range()? I doubt
it will add much useful parallelism, but it would be correct.

Hugh

> ---
> fs/hugetlbfs/inode.c | 4 ++--
> mm/hugetlb.c | 12 ++++++------
> mm/memory.c | 4 ++--
> 3 files changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 5eba47f..0dca54d 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -412,10 +412,10 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
> pgoff = offset >> PAGE_SHIFT;
>
> i_size_write(inode, offset);
> - i_mmap_lock_write(mapping);
> + i_mmap_lock_read(mapping);
> if (!RB_EMPTY_ROOT(&mapping->i_mmap))
> hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
> - i_mmap_unlock_write(mapping);
> + i_mmap_unlock_read(mapping);
> truncate_hugepages(inode, offset);
> return 0;
> }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2071cf4..80349f2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2775,7 +2775,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
> * this mapping should be shared between all the VMAs,
> * __unmap_hugepage_range() is called as the lock is already held
> */
> - i_mmap_lock_write(mapping);
> + i_mmap_lock_read(mapping);
> vma_interval_tree_foreach(iter_vma, &mapping->i_mmap, pgoff, pgoff) {
> /* Do not unmap the current VMA */
> if (iter_vma == vma)
> @@ -2792,7 +2792,7 @@ static void unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
> unmap_hugepage_range(iter_vma, address,
> address + huge_page_size(h), page);
> }
> - i_mmap_unlock_write(mapping);
> + i_mmap_unlock_read(mapping);
> }
>
> /*
> @@ -3350,7 +3350,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> flush_cache_range(vma, address, end);
>
> mmu_notifier_invalidate_range_start(mm, start, end);
> - i_mmap_lock_write(vma->vm_file->f_mapping);
> + i_mmap_lock_read(vma->vm_file->f_mapping);
> for (; address < end; address += huge_page_size(h)) {
> spinlock_t *ptl;
> ptep = huge_pte_offset(mm, address);
> @@ -3379,7 +3379,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
> */
> flush_tlb_range(vma, start, end);
> mmu_notifier_invalidate_range(mm, start, end);
> - i_mmap_unlock_write(vma->vm_file->f_mapping);
> + i_mmap_unlock_read(vma->vm_file->f_mapping);
> mmu_notifier_invalidate_range_end(mm, start, end);
>
> return pages << h->order;
> @@ -3547,7 +3547,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> if (!vma_shareable(vma, addr))
> return (pte_t *)pmd_alloc(mm, pud, addr);
>
> - i_mmap_lock_write(mapping);
> + i_mmap_lock_read(mapping);
> vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
> if (svma == vma)
> continue;
> @@ -3575,7 +3575,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> spin_unlock(ptl);
> out:
> pte = (pte_t *)pmd_alloc(mm, pud, addr);
> - i_mmap_unlock_write(mapping);
> + i_mmap_unlock_read(mapping);
> return pte;
> }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 22c3089..2ca3105 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1345,9 +1345,9 @@ static void unmap_single_vma(struct mmu_gather *tlb,
> * safe to do nothing in this case.
> */
> if (vma->vm_file) {
> - i_mmap_lock_write(vma->vm_file->f_mapping);
> + i_mmap_lock_read(vma->vm_file->f_mapping);
> __unmap_hugepage_range_final(tlb, vma, start, end, NULL);
> - i_mmap_unlock_write(vma->vm_file->f_mapping);
> + i_mmap_unlock_read(vma->vm_file->f_mapping);
> }
> } else
> unmap_page_range(tlb, vma, start, end, details);
> --
> 1.8.4.5

2014-11-04 12:31:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm/mremap: share the i_mmap_rwsem

On Mon, Nov 03, 2014 at 10:04:24PM -0800, Hugh Dickins wrote:
> I'm glad to see this series back, and nicely presented: thank you.
> Not worth respinning them, but consider 1,2,3,4,5,6,7 and 9 as
> Acked-by: Hugh Dickins <[email protected]>
>
> On Thu, 30 Oct 2014, Davidlohr Bueso wrote:
>
> > As per the comment in move_ptes(), we only require taking the
> > anon vma and i_mmap locks to ensure that rmap will always observe
> > either the old or new ptes, in the case of need_rmap_lock=true.
> > No modifications to the tree itself, thus share the i_mmap_rwsem.
> >
> > Signed-off-by: Davidlohr Bueso <[email protected]>
> > Acked-by: Kirill A. Shutemov <[email protected]>
>
> But this one is Nacked by me. I don't understand how you and Kirill
> could read Michel's painstaking comment on need_rmap_locks, then go
> go ahead and remove the exclusion of rmap_walk().
>
> I agree the code here does not modify the interval tree, but the
> comment explains how we're moving a pte from one place in the tree
> to another, and in some cases there's a danger that the rmap walk
> might miss the pte from both places (which doesn't matter much to
> most of its uses, but is critical in page migration).
>
> Or am I the one missing something?

You're completely right.

I've seen the comment (and I've added the missed need_rmap_locks case for
move_huge_pmd() before). What happened is I've over-extrapolated my
experience of rmap walk in case of split_huge_page(), which takes exclusive
anon_vma lock, to the rest of rmap use-cases. This of course was hugely
wrong.

I'm ashamed and feel really bad about the situation. Sorry.

--
Kirill A. Shutemov

2014-11-05 01:00:01

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm/hugetlb: share the i_mmap_rwsem

On Mon, 2014-11-03 at 22:35 -0800, Hugh Dickins wrote:
> On Thu, 30 Oct 2014, Davidlohr Bueso wrote:
>
> > The i_mmap_rwsem protects shared pages against races
> > when doing the sharing and unsharing, ultimately
> > calling huge_pmd_share/unshare() for PMD pages --
> > it also needs it to avoid races when populating the pud
> > for pmd allocation when looking for a shareable pmd page
> > for hugetlb. Ultimately the interval tree remains intact.
> >
> > Signed-off-by: Davidlohr Bueso <[email protected]>
> > Acked-by: Kirill A. Shutemov <[email protected]>
> linux.intel.com
>
> I'm uncomfortable with this one: I'm certainly not prepared to Ack it;
> but that could easily be that I'm just not thinking hard enough - I'd
> rather leave the heavy thinking to someone else!
>
> The fs/hugetlbfs/inode.c part of it should be okay, but the rest is
> iffy. It gets into huge page table sharing territory, which is very
> tricky and surprising territory indeed (take a look at my
> __unmap_hugepage_range_final() comment, for one example).
>
> You're right that the interval tree remains intact, but I've a feeling
> we end up using i_mmap_mutex for more exclusion than just that (rather
> like how huge_memory.c finds anon_vma lock useful for other exclusions).

Yeah, that certainly wouldn't surprise me, and this particular patch was
the one I was most unsure about for that exact same reason. Hopefully
others could confirm if this is truly doable and safe.

> I think Mel (already Cc'ed) and Michal (adding him) both have past
> experience with the shared page table (as do I, but I'm in denial).
>
> I wonder if the huge shared page table would be a good next target
> for Kirill's removal of mm nastiness. (Removing it wouldn't hurt
> Google for one: we have it "#if 0"ed out, though I forget why at
> this moment.)
>
> But, returning to the fs/hugetlbfs/inode.c part of it, that reminds
> me: you're missing one patch from the series, aren't you? Why no
> i_mmap_lock_read() in mm/memory.c unmap_mapping_range()? I doubt
> it will add much useful parallelism, but it would be correct.

Oh yes, not sure why I didn't update that function, I had it marked it
safe to share the lock. Thanks for taking a close look at the series.

8<------------------------------------------------
From: Davidlohr Bueso <[email protected]>
Subject: [PATCH 11/10] mm/memory.c: share the i_mmap_rwsem

The unmap_mapping_range family of functions do the unmapping
of user pages (ultimately via zap_page_range_single) without
touching the actual interval tree, thus share the lock.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
mm/memory.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2ca3105..06f2458 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2396,12 +2396,12 @@ void unmap_mapping_range(struct address_space *mapping,
details.last_index = ULONG_MAX;


- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
unmap_mapping_range_tree(&mapping->i_mmap, &details);
if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);
}
EXPORT_SYMBOL(unmap_mapping_range);

--
1.8.4.5


2014-11-05 16:04:21

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm/hugetlb: share the i_mmap_rwsem

On Tue, 4 Nov 2014, Davidlohr Bueso wrote:
> 8<------------------------------------------------
> From: Davidlohr Bueso <[email protected]>
> Subject: [PATCH 11/10] mm/memory.c: share the i_mmap_rwsem
>
> The unmap_mapping_range family of functions do the unmapping
> of user pages (ultimately via zap_page_range_single) without
> touching the actual interval tree, thus share the lock.
>
> Signed-off-by: Davidlohr Bueso <[email protected]>

Acked-by: Hugh Dickins <[email protected]>

Yes, thanks, let's get this 11/10 into mmotm along with the rest,
but put the hugetlb 10/10 on the shelf for now, until we've had
time to contemplate it more deeply.

> ---
> mm/memory.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 2ca3105..06f2458 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2396,12 +2396,12 @@ void unmap_mapping_range(struct address_space *mapping,
> details.last_index = ULONG_MAX;
>
>
> - i_mmap_lock_write(mapping);
> + i_mmap_lock_read(mapping);
> if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
> unmap_mapping_range_tree(&mapping->i_mmap, &details);
> if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
> unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
> - i_mmap_unlock_write(mapping);
> + i_mmap_unlock_read(mapping);
> }
> EXPORT_SYMBOL(unmap_mapping_range);
>
> --
> 1.8.4.5

2014-11-10 16:22:53

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 10/10] mm/hugetlb: share the i_mmap_rwsem

On Mon, Nov 03, 2014 at 10:35:04PM -0800, Hugh Dickins wrote:
> On Thu, 30 Oct 2014, Davidlohr Bueso wrote:
>
> > The i_mmap_rwsem protects shared pages against races
> > when doing the sharing and unsharing, ultimately
> > calling huge_pmd_share/unshare() for PMD pages --
> > it also needs it to avoid races when populating the pud
> > for pmd allocation when looking for a shareable pmd page
> > for hugetlb. Ultimately the interval tree remains intact.
> >
> > Signed-off-by: Davidlohr Bueso <[email protected]>
> > Acked-by: Kirill A. Shutemov <[email protected]>
> linux.intel.com
>
> I'm uncomfortable with this one: I'm certainly not prepared to Ack it;
> but that could easily be that I'm just not thinking hard enough - I'd
> rather leave the heavy thinking to someone else!
>
> The fs/hugetlbfs/inode.c part of it should be okay, but the rest is
> iffy. It gets into huge page table sharing territory, which is very
> tricky and surprising territory indeed (take a look at my
> __unmap_hugepage_range_final() comment, for one example).
>
> You're right that the interval tree remains intact, but I've a feeling
> we end up using i_mmap_mutex for more exclusion than just that (rather
> like how huge_memory.c finds anon_vma lock useful for other exclusions).
>
> I think Mel (already Cc'ed) and Michal (adding him) both have past
> experience with the shared page table (as do I, but I'm in denial).
>

I dealt with it far in the past when it was still buried under arch/x86
and it was a whole pile of no fun. In this case I think there is little or
no value in trying to convert the lock for page table sharing. The benefit
is marginal (database initialisation maybe) while the potential for
surprises is high.

The __unmap_hugepage_range_final() concern is valid. If this is converted to
read then I am fairly sure that the bug fixed by commit d833352a4338 ("mm:
hugetlbfs: close race during teardown of hugetlbfs shared page tables")
gets reintroduced. We also potentially see races between huge_pmd_unshare
ref counting and huge_pmd_share as huge_pmd_unshare does a race-prone
check on refcount if it's not serialised by i_mmap_lock_write. On a rance,
it will leak pages which will be hard to detect.

Considering the upside of this particular conversion, I don't think it's
worth the loss of hair or will to live to try fix it up.

> I wonder if the huge shared page table would be a good next target
> for Kirill's removal of mm nastiness. (Removing it wouldn't hurt
> Google for one: we have it "#if 0"ed out, though I forget why at
> this moment.)
>

I think the only benefit was reducing TLB pressure on databases with
very large shared memory before 1G pages existed and when 2M TLB entries
were a very limited resource. I doubt it's been quantified on anything
resembling recent hardware. If it did get killed though, it would need a
spin through a database test that used the particular database software
that benefitted.

--
Mel Gorman
SUSE Labs

2014-11-10 16:28:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v2 -next 00/10] mm: improve usage of the i_mmap lock

On Thu, Oct 30, 2014 at 12:34:07PM -0700, Davidlohr Bueso wrote:
> Changes from v1:
> o Collected Acks from Kirill and Srikar.
> o Updated to apply on top of linux-next.
>

FWIW, I looked at all the patches when looking at the hugetlbfs sharing
part and while it was not a very detailed review, I also did not see
anything bad so for patches 1-7, 9 and 11

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs