2013-06-25 00:21:49

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 0/5] mm: i_mmap_mutex to rwsem

This patchset extends the work started by Ingo Molnar in late 2012,
optimizing the anon-vma mutex lock, converting it from a exclusive mutex
to a rwsem, and sharing the lock for read-only paths when walking the
the vma-interval tree. More specifically commits 5a505085 and 4fc3f1d6.

The i_mmap mutex has similar responsibilities with the anon-vma, protecting
file backed pages. Therefore we can use similar locking techniques: covert
the mutex to a rwsem and share the lock when possible.

With these changes, and the rwsem optimizations discussed in
http://lkml.org/lkml/2013/6/16/38 we can see performance improvements.
For instance, on a 8 socket, 80 core DL980, when compared to a vanilla 3.10-rc5,
aim7 benefits in throughput, with the following workloads (beyond 500 users):

- alltests (+14.5%)
- custom (+17%)
- disk (+11%)
- high_systime (+5%)
- shared (+15%)
- short (+4%)

For lower amounts of users, there are no significant differences as all numbers
are within the 0-2% noise range.

Davidlohr Bueso (5):
mm,fs: introduce helpers around i_mmap_mutex
mm: use new helper functions around the i_mmap_mutex
mm: convert i_mmap_mutex to rwsem
mm/rmap: share the i_mmap_rwsem
mm: rename leftover i_mmap_mutex

Documentation/lockstat.txt | 2 +-
Documentation/vm/locking | 2 +-
arch/x86/mm/hugetlbpage.c | 6 +++---
fs/hugetlbfs/inode.c | 4 ++--
fs/inode.c | 2 +-
include/linux/fs.h | 22 +++++++++++++++++++++-
include/linux/mmu_notifier.h | 2 +-
kernel/events/uprobes.c | 6 +++---
kernel/fork.c | 4 ++--
mm/filemap.c | 10 +++++-----
mm/filemap_xip.c | 4 ++--
mm/fremap.c | 4 ++--
mm/hugetlb.c | 16 ++++++++--------
mm/memory-failure.c | 7 +++----
mm/memory.c | 8 ++++----
mm/mmap.c | 22 +++++++++++-----------
mm/mremap.c | 6 +++---
mm/nommu.c | 14 +++++++-------
mm/rmap.c | 24 ++++++++++++------------
19 files changed, 92 insertions(+), 73 deletions(-)

--
1.7.11.7


2013-06-25 00:22:21

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 5/5] mm: rename leftover i_mmap_mutex

Update the lock to i_mmap_rwsem throughout the kernel.
All changes are in comments and documentation.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
Documentation/lockstat.txt | 2 +-
Documentation/vm/locking | 2 +-
arch/x86/mm/hugetlbpage.c | 2 +-
include/linux/mmu_notifier.h | 2 +-
kernel/events/uprobes.c | 2 +-
mm/filemap.c | 10 +++++-----
mm/hugetlb.c | 8 ++++----
mm/mmap.c | 6 +++---
mm/mremap.c | 2 +-
mm/rmap.c | 8 ++++----
10 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/Documentation/lockstat.txt b/Documentation/lockstat.txt
index dd2f7b2..96b8233 100644
--- a/Documentation/lockstat.txt
+++ b/Documentation/lockstat.txt
@@ -168,7 +168,7 @@ View the top contending locks:
dcache_lock: 1037 1161 0.38 45.32 774.51 6611 243371 0.15 306.48 77387.24
&inode->i_mutex: 161 286 18446744073709 62882.54 1244614.55 3653 20598 18446744073709 62318.60 1693822.74
&zone->lru_lock: 94 94 0.53 7.33 92.10 4366 32690 0.29 59.81 16350.06
- &inode->i_data.i_mmap_mutex: 79 79 0.40 3.77 53.03 11779 87755 0.28 116.93 29898.44
+ &inode->i_data.i_mmap_rwsem: 79 79 0.40 3.77 53.03 11779 87755 0.28 116.93 29898.44
&q->__queue_lock: 48 50 0.52 31.62 86.31 774 13131 0.17 113.08 12277.52
&rq->rq_lock_key: 43 47 0.74 68.50 170.63 3706 33929 0.22 107.99 17460.62
&rq->rq_lock_key#2: 39 46 0.75 6.68 49.03 2979 32292 0.17 125.17 17137.63
diff --git a/Documentation/vm/locking b/Documentation/vm/locking
index f61228b..fb64028 100644
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ in some cases it is not really needed. Eg, vm_start is modified by
expand_stack(), it is hard to come up with a destructive scenario without
having the vmlist protection in this case.

-The page_table_lock nests with the inode i_mmap_mutex and the kmem cache
+The page_table_lock nests with the inode i_mmap_rwsem and the kmem cache
c_spinlock spinlocks. This is okay, since the kmem code asks for pages after
dropping c_spinlock. The page_table_lock also nests with pagecache_lock and
pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 9c61a1e..df68d13 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -60,7 +60,7 @@ static int vma_shareable(struct vm_area_struct *vma, unsigned long addr)
* and returns the corresponding pte. While this is not necessary for the
* !shared pmd case because we can allocate the pmd later as well, it makes the
* code much cleaner. pmd allocation is essential for the shared case because
- * pud has to be populated inside the same i_mmap_mutex section - otherwise
+ * pud has to be populated inside the same i_mmap_rwsem section - otherwise
* racing tasks could either miss the sharing (see huge_pte_offset) or select a
* bad pmd for sharing.
*/
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index deca874..f9c11ab 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -151,7 +151,7 @@ struct mmu_notifier_ops {
* Therefore notifier chains can only be traversed when either
*
* 1. mmap_sem is held.
- * 2. One of the reverse map locks is held (i_mmap_mutex or anon_vma->rwsem).
+ * 2. One of the reverse map locks is held (i_mmap_rwsem or anon_vma->rwsem).
* 3. No other concurrent thread can access the list (release)
*/
struct mmu_notifier {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index c7b9f45..4ca146e 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -700,7 +700,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)

if (!prev && !more) {
/*
- * Needs GFP_NOWAIT to avoid i_mmap_mutex recursion through
+ * Needs GFP_NOWAIT to avoid i_mmap_rwsem recursion through
* reclaim. This is optimistic, no harm done if it fails.
*/
prev = kmalloc(sizeof(struct map_info),
diff --git a/mm/filemap.c b/mm/filemap.c
index 7905fe7..5d3ae93 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -60,16 +60,16 @@
/*
* Lock ordering:
*
- * ->i_mmap_mutex (truncate_pagecache)
+ * ->i_mmap_rwsem (truncate_pagecache)
* ->private_lock (__free_pte->__set_page_dirty_buffers)
* ->swap_lock (exclusive_swap_page, others)
* ->mapping->tree_lock
*
* ->i_mutex
- * ->i_mmap_mutex (truncate->unmap_mapping_range)
+ * ->i_mmap_rwsem (truncate->unmap_mapping_range)
*
* ->mmap_sem
- * ->i_mmap_mutex
+ * ->i_mmap_rwsem
* ->page_table_lock or pte_lock (various, mainly in memory.c)
* ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock)
*
@@ -83,7 +83,7 @@
* sb_lock (fs/fs-writeback.c)
* ->mapping->tree_lock (__sync_single_inode)
*
- * ->i_mmap_mutex
+ * ->i_mmap_rwsem
* ->anon_vma.lock (vma_adjust)
*
* ->anon_vma.lock
@@ -103,7 +103,7 @@
* ->inode->i_lock (zap_pte_range->set_page_dirty)
* ->private_lock (zap_pte_range->__set_page_dirty_buffers)
*
- * ->i_mmap_mutex
+ * ->i_mmap_rwsem
* ->tasklist_lock (memory_failure, collect_procs_ao)
*/

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1becd0a..041f58f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2458,9 +2458,9 @@ void __unmap_hugepage_range_final(struct mmu_gather *tlb,
* on its way out. We're lucky that the flag has such an appropriate
* name, and can in fact be safely cleared here. We could clear it
* before the __unmap_hugepage_range above, but all that's necessary
- * is to clear it before releasing the i_mmap_mutex. This works
+ * is to clear it before releasing the i_mmap_rwsem. This works
* because in the context this is called, the VMA is about to be
- * destroyed and the i_mmap_mutex is held.
+ * destroyed and the i_mmap_rwsem is held.
*/
vma->vm_flags &= ~VM_MAYSHARE;
}
@@ -3067,9 +3067,9 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
}
spin_unlock(&mm->page_table_lock);
/*
- * Must flush TLB before releasing i_mmap_mutex: x86's huge_pmd_unshare
+ * Must flush TLB before releasing i_mmap_rwsem: x86's huge_pmd_unshare
* may have cleared our pud entry and done put_page on the page table:
- * once we release i_mmap_mutex, another task can do the final put_page
+ * once we release i_mmap_rwsem, another task can do the final put_page
* and that page table be reused and filled with junk.
*/
flush_tlb_range(vma, start, end);
diff --git a/mm/mmap.c b/mm/mmap.c
index b4e142a..b4ca52a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -205,7 +205,7 @@ error:
}

/*
- * Requires inode->i_mapping->i_mmap_mutex
+ * Requires inode->i_mapping->i_mmap_rwsem
*/
static void __remove_shared_vm_struct(struct vm_area_struct *vma,
struct file *file, struct address_space *mapping)
@@ -2759,7 +2759,7 @@ void exit_mmap(struct mm_struct *mm)

/* Insert vm structure into process list sorted by address
* and into the inode's i_mmap tree. If vm_file is non-NULL
- * then i_mmap_mutex is taken here.
+ * then i_mmap_rwsem is taken here.
*/
int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
{
@@ -3043,7 +3043,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
* vma in this mm is backed by the same anon_vma or address_space.
*
* We can take all the locks in random order because the VM code
- * taking i_mmap_mutex or anon_vma->rwsem outside the mmap_sem never
+ * taking i_mmap_rwsem or anon_vma->rwsem outside the mmap_sem never
* takes more than one of them in a row. Secondly we're protected
* against a concurrent mm_take_all_locks() by the mm_all_locks_mutex.
*
diff --git a/mm/mremap.c b/mm/mremap.c
index 02fc5df..742fe28 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -81,7 +81,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
spinlock_t *old_ptl, *new_ptl;

/*
- * When need_rmap_locks is true, we take the i_mmap_mutex and anon_vma
+ * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
* locks to ensure that rmap will always observe either the old or the
* new ptes. This is the easiest way to avoid races with
* truncate_pagecache(), page migration, etc...
diff --git a/mm/rmap.c b/mm/rmap.c
index 98b986d..1bfde51 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -23,7 +23,7 @@
* inode->i_mutex (while writing or truncating, not reading or faulting)
* mm->mmap_sem
* page->flags PG_locked (lock_page)
- * mapping->i_mmap_mutex
+ * mapping->i_mmap_rwsem
* anon_vma->rwsem
* mm->page_table_lock or pte_lock
* zone->lru_lock (in mark_page_accessed, isolate_lru_page)
@@ -804,14 +804,14 @@ static int page_referenced_file(struct page *page,
* The page lock not only makes sure that page->mapping cannot
* suddenly be NULLified by truncation, it makes sure that the
* structure at mapping cannot be freed and reused yet,
- * so we can safely take mapping->i_mmap_mutex.
+ * so we can safely take mapping->i_mmap_rwsem.
*/
BUG_ON(!PageLocked(page));

i_mmap_lock_read(mapping);

/*
- * i_mmap_mutex does not stabilize mapcount at all, but mapcount
+ * i_mmap_rwsem does not stabilize mapcount at all, but mapcount
* is more likely to be accurate if we note it after spinning.
*/
mapcount = page_mapcount(page);
@@ -1291,7 +1291,7 @@ out_mlock:
/*
* We need mmap_sem locking, Otherwise VM_LOCKED check makes
* unstable result and race. Plus, We can't wait here because
- * we now hold anon_vma->rwsem or mapping->i_mmap_mutex.
+ * we now hold anon_vma->rwsem or mapping->i_mmap_rwsem.
* if trylock failed, the page remain in evictable lru and later
* vmscan could retry to move the page to unevictable lru if the
* page is actually mlocked.
--
1.7.11.7

2013-06-25 00:22:19

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 4/5] mm/rmap: share the i_mmap_rwsem

Similar to commit 4fc3f1d6, which optimized the anon-vma rwsem, we can share
the i_mmap_rwsem among multiple readers for rmap_walk_file(),
try_to_unmap_file() and collect_procs_file().

With this change, and the rwsem optimizations discussed in
http://lkml.org/lkml/2013/6/16/38 we can see performance improvements.
On a 8 socket, 80 core DL980, when compared to a vanilla 3.10-rc5, aim7
benefits in throughput, with the following workloads (beyond 500 users):

- alltests (+14.5%)
- custom (+17%)
- disk (+11%)
- high_systime (+5%)
- shared (+15%)
- short (+4%)

For lower amounts of users, there are no significant differences as all numbers
are within the 0-2% noise range.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
include/linux/fs.h | 10 ++++++++++
mm/memory-failure.c | 7 +++----
mm/rmap.c | 12 ++++++------
3 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 79b8548..5646641 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -485,6 +485,16 @@ static inline void i_mmap_unlock_write(struct address_space *mapping)
up_write(&mapping->i_mmap_rwsem);
}

+static inline void i_mmap_lock_read(struct address_space *mapping)
+{
+ down_read(&mapping->i_mmap_rwsem);
+}
+
+static inline void i_mmap_unlock_read(struct address_space *mapping)
+{
+ up_read(&mapping->i_mmap_rwsem);
+}
+
/*
* Might pages of this file be mapped into userspace?
*/
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e7e0f90..6db44eb 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -436,7 +436,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
struct task_struct *tsk;
struct address_space *mapping = page->mapping;

- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
read_lock(&tasklist_lock);
for_each_process(tsk) {
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -444,8 +444,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
if (!task_early_kill(tsk))
continue;

- vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff,
- pgoff) {
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
/*
* Send early kill signal to tasks where a vma covers
* the page but the corrupted page is not necessarily
@@ -458,7 +457,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
}
}
read_unlock(&tasklist_lock);
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);
}

/*
diff --git a/mm/rmap.c b/mm/rmap.c
index bc8eeb5..98b986d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -808,7 +808,7 @@ static int page_referenced_file(struct page *page,
*/
BUG_ON(!PageLocked(page));

- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);

/*
* i_mmap_mutex does not stabilize mapcount at all, but mapcount
@@ -831,7 +831,7 @@ static int page_referenced_file(struct page *page,
break;
}

- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);
return referenced;
}

@@ -1516,7 +1516,7 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
if (PageHuge(page))
pgoff = page->index << compound_order(page);

- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
ret = try_to_unmap_one(page, vma, address, flags);
@@ -1594,7 +1594,7 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.nonlinear)
vma->vm_private_data = NULL;
out:
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);
return ret;
}

@@ -1711,7 +1711,7 @@ static int rmap_walk_file(struct page *page, int (*rmap_one)(struct page *,

if (!mapping)
return ret;
- i_mmap_lock_write(mapping);
+ i_mmap_lock_read(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
ret = rmap_one(page, vma, address, arg);
@@ -1723,7 +1723,7 @@ static int rmap_walk_file(struct page *page, int (*rmap_one)(struct page *,
* never contain migration ptes. Decide what to do about this
* limitation to linear when we need rmap_walk() on nonlinear.
*/
- i_mmap_unlock_write(mapping);
+ i_mmap_unlock_read(mapping);
return ret;
}

--
1.7.11.7

2013-06-25 00:22:18

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 3/5] mm: convert i_mmap_mutex to rwsem

This conversion is straightforward. All users take the write
lock, so there is really not much difference with the previous
mutex lock.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
fs/inode.c | 2 +-
include/linux/fs.h | 6 +++---
mm/mmap.c | 2 +-
3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 00d5fc3..af5f0ea 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -345,7 +345,7 @@ void address_space_init_once(struct address_space *mapping)
memset(mapping, 0, sizeof(*mapping));
INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC);
spin_lock_init(&mapping->tree_lock);
- mutex_init(&mapping->i_mmap_mutex);
+ init_rwsem(&mapping->i_mmap_rwsem);
INIT_LIST_HEAD(&mapping->private_list);
spin_lock_init(&mapping->private_lock);
mapping->i_mmap = RB_ROOT;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1ea6c68..79b8548 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -410,7 +410,7 @@ struct address_space {
unsigned int i_mmap_writable;/* count VM_SHARED mappings */
struct rb_root i_mmap; /* tree of private and shared mappings */
struct list_head i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
- struct mutex i_mmap_mutex; /* protect tree, count, list */
+ struct rw_semaphore i_mmap_rwsem; /* protect tree, count, list */
/* Protected by tree_lock together with the radix tree */
unsigned long nrpages; /* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
@@ -477,12 +477,12 @@ int mapping_tagged(struct address_space *mapping, int tag);

static inline void i_mmap_lock_write(struct address_space *mapping)
{
- mutex_lock(&mapping->i_mmap_mutex);
+ down_write(&mapping->i_mmap_rwsem);
}

static inline void i_mmap_unlock_write(struct address_space *mapping)
{
- mutex_unlock(&mapping->i_mmap_mutex);
+ up_write(&mapping->i_mmap_rwsem);
}

/*
diff --git a/mm/mmap.c b/mm/mmap.c
index 01a9876..b4e142a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3016,7 +3016,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping)
*/
if (test_and_set_bit(AS_MM_ALL_LOCKS, &mapping->flags))
BUG();
- mutex_lock_nest_lock(&mapping->i_mmap_mutex, &mm->mmap_sem);
+ down_write_nest_lock(&mapping->i_mmap_rwsem, &mm->mmap_sem);
}
}

--
1.7.11.7

2013-06-25 00:22:16

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 2/5] mm: use new helper functions around the i_mmap_mutex

Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
arch/x86/mm/hugetlbpage.c | 4 ++--
fs/hugetlbfs/inode.c | 4 ++--
kernel/events/uprobes.c | 4 ++--
kernel/fork.c | 4 ++--
mm/filemap_xip.c | 4 ++--
mm/fremap.c | 4 ++--
mm/hugetlb.c | 8 ++++----
mm/memory-failure.c | 4 ++--
mm/memory.c | 8 ++++----
mm/mmap.c | 14 +++++++-------
mm/mremap.c | 4 ++--
mm/nommu.c | 14 +++++++-------
mm/rmap.c | 16 ++++++++--------
13 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index ae1aa71..9c61a1e 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -79,7 +79,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -105,7 +105,7 @@ huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
spin_unlock(&mm->page_table_lock);
out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
return pte;
}

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a3f868a..35eebfc 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -404,10 +404,10 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
pgoff = offset >> PAGE_SHIFT;

i_size_write(inode, offset);
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
if (!RB_EMPTY_ROOT(&mapping->i_mmap))
hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
truncate_hugepages(inode, offset);
return 0;
}
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f356974..c7b9f45 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -693,7 +693,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
int more = 0;

again:
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
if (!valid_vma(vma, is_register))
continue;
@@ -724,7 +724,7 @@ build_map_info(struct address_space *mapping, loff_t offset, bool is_register)
info->mm = vma->vm_mm;
info->vaddr = offset_to_vaddr(vma, offset);
}
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);

if (!more)
goto out;
diff --git a/kernel/fork.c b/kernel/fork.c
index 987b28a..13226f1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -420,7 +420,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
get_file(file);
if (tmp->vm_flags & VM_DENYWRITE)
atomic_dec(&inode->i_writecount);
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
if (tmp->vm_flags & VM_SHARED)
mapping->i_mmap_writable++;
flush_dcache_mmap_lock(mapping);
@@ -432,7 +432,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
vma_interval_tree_insert_after(tmp, mpnt,
&mapping->i_mmap);
flush_dcache_mmap_unlock(mapping);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}

/*
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 28fe26b..c851586 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -182,7 +182,7 @@ __xip_unmap (struct address_space * mapping,
return;

retry:
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
mm = vma->vm_mm;
address = vma->vm_start +
@@ -202,7 +202,7 @@ retry:
page_cache_release(page);
}
}
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);

if (locked) {
mutex_unlock(&xip_sparse_mutex);
diff --git a/mm/fremap.c b/mm/fremap.c
index 87da359..fa49f3d 100644
--- a/mm/fremap.c
+++ b/mm/fremap.c
@@ -215,13 +215,13 @@ get_write_lock:
}
goto out;
}
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
flush_dcache_mmap_lock(mapping);
vma->vm_flags |= VM_NONLINEAR;
vma_interval_tree_remove(vma, &mapping->i_mmap);
vma_nonlinear_insert(vma, &mapping->i_mmap_nonlinear);
flush_dcache_mmap_unlock(mapping);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}

if (vma->vm_flags & VM_LOCKED) {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e2bfbf7..1becd0a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2506,7 +2506,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
* this mapping should be shared between all the VMAs,
* __unmap_hugepage_range() is called as the lock is already held
*/
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
vma_interval_tree_foreach(iter_vma, &mapping->i_mmap, pgoff, pgoff) {
/* Do not unmap the current VMA */
if (iter_vma == vma)
@@ -2523,7 +2523,7 @@ static int unmap_ref_private(struct mm_struct *mm, struct vm_area_struct *vma,
unmap_hugepage_range(iter_vma, address,
address + huge_page_size(h), page);
}
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);

return 1;
}
@@ -3047,7 +3047,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
BUG_ON(address >= end);
flush_cache_range(vma, address, end);

- mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ i_mmap_lock_write(vma->vm_file->f_mapping);
spin_lock(&mm->page_table_lock);
for (; address < end; address += huge_page_size(h)) {
ptep = huge_pte_offset(mm, address);
@@ -3073,7 +3073,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
* and that page table be reused and filled with junk.
*/
flush_tlb_range(vma, start, end);
- mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);

return pages << h->order;
}
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ceb0c7f..e7e0f90 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -436,7 +436,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
struct task_struct *tsk;
struct address_space *mapping = page->mapping;

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
read_lock(&tasklist_lock);
for_each_process(tsk) {
pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
@@ -458,7 +458,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
}
}
read_unlock(&tasklist_lock);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}

/*
diff --git a/mm/memory.c b/mm/memory.c
index 61a262b..344b4cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1340,9 +1340,9 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* safe to do nothing in this case.
*/
if (vma->vm_file) {
- mutex_lock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ i_mmap_lock_write(vma->vm_file->f_mapping);
__unmap_hugepage_range_final(tlb, vma, start, end, NULL);
- mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
}
} else
unmap_page_range(tlb, vma, start, end, details);
@@ -2974,12 +2974,12 @@ void unmap_mapping_range(struct address_space *mapping,
details.last_index = ULONG_MAX;


- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
unmap_mapping_range_tree(&mapping->i_mmap, &details);
if (unlikely(!list_empty(&mapping->i_mmap_nonlinear)))
unmap_mapping_range_list(&mapping->i_mmap_nonlinear, &details);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}
EXPORT_SYMBOL(unmap_mapping_range);

diff --git a/mm/mmap.c b/mm/mmap.c
index f681e18..01a9876 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -233,9 +233,9 @@ void unlink_file_vma(struct vm_area_struct *vma)

if (file) {
struct address_space *mapping = file->f_mapping;
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
__remove_shared_vm_struct(vma, file, mapping);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}
}

@@ -644,13 +644,13 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
mapping = vma->vm_file->f_mapping;

if (mapping)
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);

__vma_link(mm, vma, prev, rb_link, rb_parent);
__vma_link_file(vma);

if (mapping)
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);

mm->map_count++;
validate_mm(mm);
@@ -761,7 +761,7 @@ again: remove_next = 1 + (end > next->vm_end);
next->vm_end);
}

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
if (insert) {
/*
* Put into interval tree now, so instantiated pages
@@ -848,7 +848,7 @@ again: remove_next = 1 + (end > next->vm_end);
anon_vma_unlock_write(anon_vma);
}
if (mapping)
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);

if (root) {
uprobe_mmap(vma);
@@ -3112,7 +3112,7 @@ static void vm_unlock_mapping(struct address_space *mapping)
* AS_MM_ALL_LOCKS can't change to 0 from under us
* because we hold the mm_all_locks_mutex.
*/
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
if (!test_and_clear_bit(AS_MM_ALL_LOCKS,
&mapping->flags))
BUG();
diff --git a/mm/mremap.c b/mm/mremap.c
index 463a257..02fc5df 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -101,7 +101,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
if (need_rmap_locks) {
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
}
if (vma->anon_vma) {
anon_vma = vma->anon_vma;
@@ -137,7 +137,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
if (anon_vma)
anon_vma_unlock_write(anon_vma);
if (mapping)
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}

#define LATENCY_LIMIT (64 * PAGE_SIZE)
diff --git a/mm/nommu.c b/mm/nommu.c
index 298884d..5a11e45 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -714,11 +714,11 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
flush_dcache_mmap_lock(mapping);
vma_interval_tree_insert(vma, &mapping->i_mmap);
flush_dcache_mmap_unlock(mapping);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}

/* add the VMA to the tree */
@@ -780,11 +780,11 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
flush_dcache_mmap_lock(mapping);
vma_interval_tree_remove(vma, &mapping->i_mmap);
flush_dcache_mmap_unlock(mapping);
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
}

/* remove from the MM's tree and list */
@@ -2087,14 +2087,14 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
high = (size + PAGE_SIZE - 1) >> PAGE_SHIFT;

down_write(&nommu_region_sem);
- mutex_lock(&inode->i_mapping->i_mmap_mutex);
+ i_mmap_lock_write(inode->i_mapping);

/* search for VMAs that fall within the dead zone */
vma_interval_tree_foreach(vma, &inode->i_mapping->i_mmap, low, high) {
/* found one - only interested if it's shared out of the page
* cache */
if (vma->vm_flags & VM_SHARED) {
- mutex_unlock(&inode->i_mapping->i_mmap_mutex);
+ i_mmap_unlock_write(inode->i_mapping);
up_write(&nommu_region_sem);
return -ETXTBSY; /* not quite true, but near enough */
}
@@ -2122,7 +2122,7 @@ int nommu_shrink_inode_mappings(struct inode *inode, size_t size,
}
}

- mutex_unlock(&inode->i_mapping->i_mmap_mutex);
+ i_mmap_unlock_write(inode->i_mapping);
up_write(&nommu_region_sem);
return 0;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 6280da8..bc8eeb5 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -808,7 +808,7 @@ static int page_referenced_file(struct page *page,
*/
BUG_ON(!PageLocked(page));

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);

/*
* i_mmap_mutex does not stabilize mapcount at all, but mapcount
@@ -831,7 +831,7 @@ static int page_referenced_file(struct page *page,
break;
}

- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
return referenced;
}

@@ -920,14 +920,14 @@ static int page_mkclean_file(struct address_space *mapping, struct page *page)

BUG_ON(PageAnon(page));

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
if (vma->vm_flags & VM_SHARED) {
unsigned long address = vma_address(page, vma);
ret += page_mkclean_one(page, vma, address);
}
}
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
return ret;
}

@@ -1516,7 +1516,7 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
if (PageHuge(page))
pgoff = page->index << compound_order(page);

- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
ret = try_to_unmap_one(page, vma, address, flags);
@@ -1594,7 +1594,7 @@ static int try_to_unmap_file(struct page *page, enum ttu_flags flags)
list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.nonlinear)
vma->vm_private_data = NULL;
out:
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
return ret;
}

@@ -1711,7 +1711,7 @@ static int rmap_walk_file(struct page *page, int (*rmap_one)(struct page *,

if (!mapping)
return ret;
- mutex_lock(&mapping->i_mmap_mutex);
+ i_mmap_lock_write(mapping);
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
unsigned long address = vma_address(page, vma);
ret = rmap_one(page, vma, address, arg);
@@ -1723,7 +1723,7 @@ static int rmap_walk_file(struct page *page, int (*rmap_one)(struct page *,
* never contain migration ptes. Decide what to do about this
* limitation to linear when we need rmap_walk() on nonlinear.
*/
- mutex_unlock(&mapping->i_mmap_mutex);
+ i_mmap_unlock_write(mapping);
return ret;
}

--
1.7.11.7

2013-06-25 00:22:14

by Davidlohr Bueso

[permalink] [raw]
Subject: [PATCH 1/5] mm,fs: introduce helpers around i_mmap_mutex

Various parts of the kernel acquire and release this mutex,
so add i_mmap_lock_write() and immap_unlock_write() helper
functions that will encapsulate this logic. The next patch
will make use of these.

Signed-off-by: Davidlohr Bueso <[email protected]>
---
include/linux/fs.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 65c2be2..1ea6c68 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -475,6 +475,16 @@ struct block_device {

int mapping_tagged(struct address_space *mapping, int tag);

+static inline void i_mmap_lock_write(struct address_space *mapping)
+{
+ mutex_lock(&mapping->i_mmap_mutex);
+}
+
+static inline void i_mmap_unlock_write(struct address_space *mapping)
+{
+ mutex_unlock(&mapping->i_mmap_mutex);
+}
+
/*
* Might pages of this file be mapped into userspace?
*/
--
1.7.11.7

2014-06-02 20:08:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: i_mmap_mutex to rwsem

On Thu, 29 May 2014 19:20:15 -0700 Davidlohr Bueso <[email protected]> wrote:

> On Thu, 2014-05-22 at 20:33 -0700, Davidlohr Bueso wrote:
> > This patchset extends the work started by Ingo Molnar in late 2012,
> > optimizing the anon-vma mutex lock, converting it from a exclusive mutex
> > to a rwsem, and sharing the lock for read-only paths when walking the
> > the vma-interval tree. More specifically commits 5a505085 and 4fc3f1d6.
> >
> > The i_mmap_mutex has similar responsibilities with the anon-vma, protecting
> > file backed pages. Therefore we can use similar locking techniques: covert
> > the mutex to a rwsem and share the lock when possible.
> >
> > With the new optimistic spinning property we have in rwsems, we no longer
> > take a hit in performance when using this lock, and we can therefore
> > safely do the conversion. Tests show no throughput regressions in aim7 or
> > pgbench runs, and we can see gains from sharing the lock, in disk workloads
> > ~+15% for over 1000 users on a 8-socket Westmere system.
> >
> > This patchset applies on linux-next-20140522.
>
> ping? Andrew any chance of getting this in -next?

(top-posting repaired)

It was a bit late for 3.16 back on May 26, when you said "I will dig
deeper (probably for 3.17 now)". So, please take another look at the
patch factoring and let's get this underway for -rc1.

2014-06-02 20:31:10

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: i_mmap_mutex to rwsem

On Mon, 2014-06-02 at 13:08 -0700, Andrew Morton wrote:
> On Thu, 29 May 2014 19:20:15 -0700 Davidlohr Bueso <[email protected]> wrote:
>
> > On Thu, 2014-05-22 at 20:33 -0700, Davidlohr Bueso wrote:
> > > This patchset extends the work started by Ingo Molnar in late 2012,
> > > optimizing the anon-vma mutex lock, converting it from a exclusive mutex
> > > to a rwsem, and sharing the lock for read-only paths when walking the
> > > the vma-interval tree. More specifically commits 5a505085 and 4fc3f1d6.
> > >
> > > The i_mmap_mutex has similar responsibilities with the anon-vma, protecting
> > > file backed pages. Therefore we can use similar locking techniques: covert
> > > the mutex to a rwsem and share the lock when possible.
> > >
> > > With the new optimistic spinning property we have in rwsems, we no longer
> > > take a hit in performance when using this lock, and we can therefore
> > > safely do the conversion. Tests show no throughput regressions in aim7 or
> > > pgbench runs, and we can see gains from sharing the lock, in disk workloads
> > > ~+15% for over 1000 users on a 8-socket Westmere system.
> > >
> > > This patchset applies on linux-next-20140522.
> >
> > ping? Andrew any chance of getting this in -next?
>
> (top-posting repaired)
>
> It was a bit late for 3.16 back on May 26, when you said "I will dig
> deeper (probably for 3.17 now)". So, please take another look at the
> patch factoring and let's get this underway for -rc1.

Ok, so I meant that I'd dig deeper for the additional sharing
opportunities (which I've found a few as Hugh correctly suggested). So
those eventual patches could come later.

But I see no reason for *this* patchset to be delayed, as even if it
gets to be 3.17 material, I'd still very much want to have the same
patch factoring I have now. I think its the correct way to handle lock
transitioning for both correctness and bisectability.

Thanks,
Davidlohr

2014-06-02 23:56:15

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: i_mmap_mutex to rwsem

On Mon, 2 Jun 2014, Davidlohr Bueso wrote:
> On Mon, 2014-06-02 at 13:08 -0700, Andrew Morton wrote:
> > On Thu, 29 May 2014 19:20:15 -0700 Davidlohr Bueso <[email protected]> wrote:
> >
> > > On Thu, 2014-05-22 at 20:33 -0700, Davidlohr Bueso wrote:
> > > > This patchset extends the work started by Ingo Molnar in late 2012,
> > > > optimizing the anon-vma mutex lock, converting it from a exclusive mutex
> > > > to a rwsem, and sharing the lock for read-only paths when walking the
> > > > the vma-interval tree. More specifically commits 5a505085 and 4fc3f1d6.
> > > >
> > > > The i_mmap_mutex has similar responsibilities with the anon-vma, protecting
> > > > file backed pages. Therefore we can use similar locking techniques: covert
> > > > the mutex to a rwsem and share the lock when possible.
> > > >
> > > > With the new optimistic spinning property we have in rwsems, we no longer
> > > > take a hit in performance when using this lock, and we can therefore
> > > > safely do the conversion. Tests show no throughput regressions in aim7 or
> > > > pgbench runs, and we can see gains from sharing the lock, in disk workloads
> > > > ~+15% for over 1000 users on a 8-socket Westmere system.
> > > >
> > > > This patchset applies on linux-next-20140522.
> > >
> > > ping? Andrew any chance of getting this in -next?
> >
> > (top-posting repaired)
> >
> > It was a bit late for 3.16 back on May 26, when you said "I will dig
> > deeper (probably for 3.17 now)". So, please take another look at the
> > patch factoring and let's get this underway for -rc1.
>
> Ok, so I meant that I'd dig deeper for the additional sharing
> opportunities (which I've found a few as Hugh correctly suggested). So
> those eventual patches could come later.
>
> But I see no reason for *this* patchset to be delayed, as even if it
> gets to be 3.17 material, I'd still very much want to have the same
> patch factoring I have now. I think its the correct way to handle lock
> transitioning for both correctness and bisectability.

I'd be glad to see it go into 3.16 if it works as well as advertized.
And if you're attached to your current 2/5, fine, do stick with that.
But please do a proper job on your 3/5, instead of just aping how the
anon case worked out.

Hugh