2013-07-15 10:45:06

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH, REBASED 0/8] Transparent huge page cache: phase 0, prep work

From: "Kirill A. Shutemov" <[email protected]>

[ no changes since last post, only rebased to v3.11-rc1 ]

My patchset which introduces transparent huge page cache is pretty big and
hardly reviewable. Dave Hansen suggested to split it in few parts.

This is the first part: preparation work. I think it's useful without rest
patches.

There's one fix for bug in lru_add_page_tail(). I doubt it's possible to
trigger it on current code, but nice to have it upstream anyway.
Rest is cleanups.

Patch 8 depends on patch 7. Other patches are independent and can be
applied separately.

Please, consider applying.

Kirill A. Shutemov (8):
mm: drop actor argument of do_generic_file_read()
thp, mm: avoid PageUnevictable on active/inactive lru lists
thp: account anon transparent huge pages into NR_ANON_PAGES
mm: cleanup add_to_page_cache_locked()
thp, mm: locking tail page is a bug
thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
thp: do_huge_pmd_anonymous_page() cleanup
thp: consolidate code between handle_mm_fault() and
do_huge_pmd_anonymous_page()

drivers/base/node.c | 6 ---
fs/proc/meminfo.c | 6 ---
include/linux/huge_mm.h | 3 --
include/linux/mm.h | 3 +-
mm/filemap.c | 60 ++++++++++++-----------
mm/huge_memory.c | 125 ++++++++++++++++++++----------------------------
mm/memory.c | 9 ++--
mm/rmap.c | 18 +++----
mm/swap.c | 20 +-------
9 files changed, 104 insertions(+), 146 deletions(-)

--
1.8.3.2


2013-07-15 10:45:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 6/8] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()

From: "Kirill A. Shutemov" <[email protected]>

It's confusing that mk_huge_pmd() has semantics different from mk_pte()
or mk_pmd(). I spent some time on debugging issue cased by this
inconsistency.

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
prototype to match mk_pte().

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
mm/huge_memory.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 04f0749..ec735a9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -690,11 +690,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
return pmd;
}

-static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
{
pmd_t entry;
- entry = mk_pmd(page, vma->vm_page_prot);
- entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ entry = mk_pmd(page, prot);
entry = pmd_mkhuge(entry);
return entry;
}
@@ -727,7 +726,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
pte_free(mm, pgtable);
} else {
pmd_t entry;
- entry = mk_huge_pmd(page, vma);
+ entry = mk_huge_pmd(page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, haddr, pmd, entry);
@@ -1210,7 +1210,8 @@ alloc:
goto out_mn;
} else {
pmd_t entry;
- entry = mk_huge_pmd(new_page, vma);
+ entry = mk_huge_pmd(new_page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
pmdp_clear_flush(vma, haddr, pmd);
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
@@ -2356,7 +2357,8 @@ static void collapse_huge_page(struct mm_struct *mm,
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);

- _pmd = mk_huge_pmd(new_page, vma);
+ _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+ _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);

/*
* spin_lock() below is not the equivalent of smp_wmb(), so
--
1.8.3.2

2013-07-15 10:45:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 2/8] thp, mm: avoid PageUnevictable on active/inactive lru lists

From: "Kirill A. Shutemov" <[email protected]>

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_[in]active_list
goes crazy:

kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!

1090 static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
1091 struct lruvec *lruvec, struct list_head *dst,
1092 unsigned long *nr_scanned, struct scan_control *sc,
1093 isolate_mode_t mode, enum lru_list lru)
1094 {
...
1108 switch (__isolate_lru_page(page, mode)) {
1109 case 0:
...
1116 case -EBUSY:
...
1121 default:
1122 BUG();
1123 }
1124 }
...
1130 }

__isolate_lru_page() returns EINVAL for PageUnevictable(page).

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
Let's just copy PG_active and PG_unevictable from head page in
__split_huge_page_refcount(), it will simplify lru_add_page_tail().

This will fix one more bug in lru_add_page_tail():
if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru.
The patch will fix it as well since we copy PG_active from head page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
mm/huge_memory.c | 4 +++-
mm/swap.c | 20 ++------------------
2 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 243e710..a92012a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1620,7 +1620,9 @@ static void __split_huge_page_refcount(struct page *page,
((1L << PG_referenced) |
(1L << PG_swapbacked) |
(1L << PG_mlocked) |
- (1L << PG_uptodate)));
+ (1L << PG_uptodate) |
+ (1L << PG_active) |
+ (1L << PG_unevictable)));
page_tail->flags |= (1L << PG_dirty);

/* clear PageTail before overwriting first_page */
diff --git a/mm/swap.c b/mm/swap.c
index 4a1d0d2..7bd8910 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -774,8 +774,6 @@ EXPORT_SYMBOL(__pagevec_release);
void lru_add_page_tail(struct page *page, struct page *page_tail,
struct lruvec *lruvec, struct list_head *list)
{
- int uninitialized_var(active);
- enum lru_list lru;
const int file = 0;

VM_BUG_ON(!PageHead(page));
@@ -787,20 +785,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
if (!list)
SetPageLRU(page_tail);

- if (page_evictable(page_tail)) {
- if (PageActive(page)) {
- SetPageActive(page_tail);
- active = 1;
- lru = LRU_ACTIVE_ANON;
- } else {
- active = 0;
- lru = LRU_INACTIVE_ANON;
- }
- } else {
- SetPageUnevictable(page_tail);
- lru = LRU_UNEVICTABLE;
- }
-
if (likely(PageLRU(page)))
list_add_tail(&page_tail->lru, &page->lru);
else if (list) {
@@ -816,13 +800,13 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
* Use the standard add function to put page_tail on the list,
* but then correct its position so they all end up in order.
*/
- add_page_to_lru_list(page_tail, lruvec, lru);
+ add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail));
list_head = page_tail->lru.prev;
list_move_tail(&page_tail->lru, list_head);
}

if (!PageUnevictable(page))
- update_page_reclaim_stat(lruvec, file, active);
+ update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

--
1.8.3.2

2013-07-15 10:45:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 5/8] thp, mm: locking tail page is a bug

From: "Kirill A. Shutemov" <[email protected]>

Locking head page means locking entire compound page.
If we try to lock tail page, something went wrong.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index f7c4ed5..e59049c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -639,6 +639,7 @@ void __lock_page(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

+ VM_BUG_ON(PageTail(page));
__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
TASK_UNINTERRUPTIBLE);
}
@@ -648,6 +649,7 @@ int __lock_page_killable(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

+ VM_BUG_ON(PageTail(page));
return __wait_on_bit_lock(page_waitqueue(page), &wait,
sleep_on_page_killable, TASK_KILLABLE);
}
--
1.8.3.2

2013-07-15 10:45:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 8/8] thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()

From: "Kirill A. Shutemov" <[email protected]>

do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
to handle fallback path.

Let's consolidate code back by introducing VM_FAULT_FALLBACK return
code.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
include/linux/huge_mm.h | 3 ---
include/linux/mm.h | 3 ++-
mm/huge_memory.c | 31 +++++--------------------------
mm/memory.c | 9 ++++++---
4 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b60de92..3935428 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -96,9 +96,6 @@ extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end);
-extern int handle_pte_fault(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *pte, pmd_t *pmd, unsigned int flags);
extern int split_huge_page_to_list(struct page *page, struct list_head *list);
static inline int split_huge_page(struct page *page)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..d5c82dc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -884,11 +884,12 @@ static inline int page_mapped(struct page *page)
#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
#define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */
+#define VM_FAULT_FALLBACK 0x0800 /* huge page fault failed, fall back to small */

#define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */

#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
- VM_FAULT_HWPOISON_LARGE)
+ VM_FAULT_FALLBACK | VM_FAULT_HWPOISON_LARGE)

/* Encode hstate index for a hwpoisoned large page */
#define VM_FAULT_SET_HINDEX(x) ((x) << 12)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 82f0697..c3b8c9c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -783,10 +783,9 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page;
unsigned long haddr = address & HPAGE_PMD_MASK;
- pte_t *pte;

if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
- goto out;
+ return VM_FAULT_FALLBACK;
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma)))
@@ -803,7 +802,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!zero_page)) {
pte_free(mm, pgtable);
count_vm_event(THP_FAULT_FALLBACK);
- goto out;
+ return VM_FAULT_FALLBACK;
}
spin_lock(&mm->page_table_lock);
set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
@@ -819,40 +818,20 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
vma, haddr, numa_node_id(), 0);
if (unlikely(!page)) {
count_vm_event(THP_FAULT_FALLBACK);
- goto out;
+ return VM_FAULT_FALLBACK;
}
count_vm_event(THP_FAULT_ALLOC);
if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
put_page(page);
- goto out;
+ return VM_FAULT_FALLBACK;
}
if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
mem_cgroup_uncharge_page(page);
put_page(page);
- goto out;
+ return VM_FAULT_FALLBACK;
}

return 0;
-out:
- /*
- * Use __pte_alloc instead of pte_alloc_map, because we can't
- * run pte_offset_map on the pmd, if an huge pmd could
- * materialize from under us from a different thread.
- */
- if (unlikely(pmd_none(*pmd)) &&
- unlikely(__pte_alloc(mm, vma, pmd, address)))
- return VM_FAULT_OOM;
- /* if an huge pmd materialized from under us just retry later */
- if (unlikely(pmd_trans_huge(*pmd)))
- return 0;
- /*
- * A regular pmd is established and it can't morph into a huge pmd
- * from under us anymore at this point because we hold the mmap_sem
- * read mode and khugepaged takes it in write mode. So now it's
- * safe to run pte_offset_map().
- */
- pte = pte_offset_map(pmd, address);
- return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index 1ce2e2a..f2ab2a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3693,7 +3693,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* but allow concurrent faults), and pte mapped but not yet locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
-int handle_pte_fault(struct mm_struct *mm,
+static int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
pte_t *pte, pmd_t *pmd, unsigned int flags)
{
@@ -3780,9 +3780,12 @@ retry:
if (!pmd)
return VM_FAULT_OOM;
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+ int ret = VM_FAULT_FALLBACK;
if (!vma->vm_ops)
- return do_huge_pmd_anonymous_page(mm, vma, address,
- pmd, flags);
+ ret = do_huge_pmd_anonymous_page(mm, vma, address,
+ pmd, flags);
+ if (!(ret & VM_FAULT_FALLBACK))
+ return ret;
} else {
pmd_t orig_pmd = *pmd;
int ret;
--
1.8.3.2

2013-07-15 10:46:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 3/8] thp: account anon transparent huge pages into NR_ANON_PAGES

From: "Kirill A. Shutemov" <[email protected]>

We use NR_ANON_PAGES as base for reporting AnonPages to user.
There's not much sense in not accounting transparent huge pages there, but
add them on printing to user.

Let's account transparent huge pages in NR_ANON_PAGES in the first place.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
drivers/base/node.c | 6 ------
fs/proc/meminfo.c | 6 ------
mm/huge_memory.c | 1 -
mm/rmap.c | 18 +++++++++---------
4 files changed, 9 insertions(+), 22 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7616a77..bc9f43b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -125,13 +125,7 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(nid, NR_WRITEBACK)),
nid, K(node_page_state(nid, NR_FILE_PAGES)),
nid, K(node_page_state(nid, NR_FILE_MAPPED)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- nid, K(node_page_state(nid, NR_ANON_PAGES)
- + node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR),
-#else
nid, K(node_page_state(nid, NR_ANON_PAGES)),
-#endif
nid, K(node_page_state(nid, NR_SHMEM)),
nid, node_page_state(nid, NR_KERNEL_STACK) *
THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 5aa847a..59d85d6 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -132,13 +132,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
K(i.freeswap),
K(global_page_state(NR_FILE_DIRTY)),
K(global_page_state(NR_WRITEBACK)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- K(global_page_state(NR_ANON_PAGES)
- + global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR),
-#else
K(global_page_state(NR_ANON_PAGES)),
-#endif
K(global_page_state(NR_FILE_MAPPED)),
K(global_page_state(NR_SHMEM)),
K(global_page_state(NR_SLAB_RECLAIMABLE) +
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a92012a..04f0749 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1661,7 +1661,6 @@ static void __split_huge_page_refcount(struct page *page,
BUG_ON(atomic_read(&page->_count) <= 0);

__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
- __mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);

ClearPageCompound(page);
compound_unlock(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index cd356df..7066470 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1055,11 +1055,11 @@ void do_page_add_anon_rmap(struct page *page,
{
int first = atomic_inc_and_test(&page->_mapcount);
if (first) {
- if (!PageTransHuge(page))
- __inc_zone_page_state(page, NR_ANON_PAGES);
- else
+ if (PageTransHuge(page))
__inc_zone_page_state(page,
NR_ANON_TRANSPARENT_HUGEPAGES);
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+ hpage_nr_pages(page));
}
if (unlikely(PageKsm(page)))
return;
@@ -1088,10 +1088,10 @@ void page_add_new_anon_rmap(struct page *page,
VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
SetPageSwapBacked(page);
atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
- if (!PageTransHuge(page))
- __inc_zone_page_state(page, NR_ANON_PAGES);
- else
+ if (PageTransHuge(page))
__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+ hpage_nr_pages(page));
__page_set_anon_rmap(page, vma, address, 1);
if (!mlocked_vma_newpage(vma, page)) {
SetPageActive(page);
@@ -1151,11 +1151,11 @@ void page_remove_rmap(struct page *page)
goto out;
if (anon) {
mem_cgroup_uncharge_page(page);
- if (!PageTransHuge(page))
- __dec_zone_page_state(page, NR_ANON_PAGES);
- else
+ if (PageTransHuge(page))
__dec_zone_page_state(page,
NR_ANON_TRANSPARENT_HUGEPAGES);
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+ hpage_nr_pages(page));
} else {
__dec_zone_page_state(page, NR_FILE_MAPPED);
mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
--
1.8.3.2

2013-07-15 10:45:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 7/8] thp: do_huge_pmd_anonymous_page() cleanup

From: "Kirill A. Shutemov" <[email protected]>

Minor cleanup: unindent most code of the fucntion by inverting one
condition. It's preparation for the next patch.

No functional changes.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
mm/huge_memory.c | 83 ++++++++++++++++++++++++++++----------------------------
1 file changed, 41 insertions(+), 42 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec735a9..82f0697 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -785,55 +785,54 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long haddr = address & HPAGE_PMD_MASK;
pte_t *pte;

- if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
- if (unlikely(anon_vma_prepare(vma)))
- return VM_FAULT_OOM;
- if (unlikely(khugepaged_enter(vma)))
+ if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+ goto out;
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+ if (unlikely(khugepaged_enter(vma)))
+ return VM_FAULT_OOM;
+ if (!(flags & FAULT_FLAG_WRITE) &&
+ transparent_hugepage_use_zero_page()) {
+ pgtable_t pgtable;
+ struct page *zero_page;
+ bool set;
+ pgtable = pte_alloc_one(mm, haddr);
+ if (unlikely(!pgtable))
return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE) &&
- transparent_hugepage_use_zero_page()) {
- pgtable_t pgtable;
- struct page *zero_page;
- bool set;
- pgtable = pte_alloc_one(mm, haddr);
- if (unlikely(!pgtable))
- return VM_FAULT_OOM;
- zero_page = get_huge_zero_page();
- if (unlikely(!zero_page)) {
- pte_free(mm, pgtable);
- count_vm_event(THP_FAULT_FALLBACK);
- goto out;
- }
- spin_lock(&mm->page_table_lock);
- set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
- zero_page);
- spin_unlock(&mm->page_table_lock);
- if (!set) {
- pte_free(mm, pgtable);
- put_huge_zero_page();
- }
- return 0;
- }
- page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
- if (unlikely(!page)) {
+ zero_page = get_huge_zero_page();
+ if (unlikely(!zero_page)) {
+ pte_free(mm, pgtable);
count_vm_event(THP_FAULT_FALLBACK);
goto out;
}
- count_vm_event(THP_FAULT_ALLOC);
- if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
- put_page(page);
- goto out;
- }
- if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
- page))) {
- mem_cgroup_uncharge_page(page);
- put_page(page);
- goto out;
+ spin_lock(&mm->page_table_lock);
+ set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+ zero_page);
+ spin_unlock(&mm->page_table_lock);
+ if (!set) {
+ pte_free(mm, pgtable);
+ put_huge_zero_page();
}
-
return 0;
}
+ page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+ vma, haddr, numa_node_id(), 0);
+ if (unlikely(!page)) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ goto out;
+ }
+ count_vm_event(THP_FAULT_ALLOC);
+ if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+ put_page(page);
+ goto out;
+ }
+ if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
+ mem_cgroup_uncharge_page(page);
+ put_page(page);
+ goto out;
+ }
+
+ return 0;
out:
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
--
1.8.3.2

2013-07-15 10:47:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 1/8] mm: drop actor argument of do_generic_file_read()

From: "Kirill A. Shutemov" <[email protected]>

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
mm/filemap.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4b51ac1..f6fe61f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1088,7 +1088,6 @@ static void shrink_readahead_size_eio(struct file *filp,
* @filp: the file to read
* @ppos: current file position
* @desc: read_descriptor
- * @actor: read method
*
* This is a generic file read routine, and uses the
* mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1097,7 +1096,7 @@ static void shrink_readahead_size_eio(struct file *filp,
* of the logic when it comes to error handling etc.
*/
static void do_generic_file_read(struct file *filp, loff_t *ppos,
- read_descriptor_t *desc, read_actor_t actor)
+ read_descriptor_t *desc)
{
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
@@ -1198,13 +1197,14 @@ page_ok:
* Ok, we have the page, and it's up-to-date, so
* now we can copy it to user space...
*
- * The actor routine returns how many bytes were actually used..
+ * The file_read_actor routine returns how many bytes were
+ * actually used..
* NOTE! This may not be the same as how much of a user buffer
* we filled up (we may be padding etc), so we can only update
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
+ ret = file_read_actor(desc, page, offset, nr);
offset += ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
@@ -1477,7 +1477,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
if (desc.count == 0)
continue;
desc.error = 0;
- do_generic_file_read(filp, ppos, &desc, file_read_actor);
+ do_generic_file_read(filp, ppos, &desc);
retval += desc.written;
if (desc.error) {
retval = retval ?: desc.error;
--
1.8.3.2

2013-07-15 10:47:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 4/8] mm: cleanup add_to_page_cache_locked()

From: "Kirill A. Shutemov" <[email protected]>

The patch makes add_to_page_cache_locked() cleaner:
- unindent most code of the function by inverting one condition;
- streamline code no-error path;
- move insert error path outside normal code path;
- call radix_tree_preload_end() earlier;

No functional changes.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
mm/filemap.c | 48 +++++++++++++++++++++++++-----------------------
1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index f6fe61f..f7c4ed5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -467,32 +467,34 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
error = mem_cgroup_cache_charge(page, current->mm,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
- goto out;
+ return error;

error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
- if (error == 0) {
- page_cache_get(page);
- page->mapping = mapping;
- page->index = offset;
-
- spin_lock_irq(&mapping->tree_lock);
- error = radix_tree_insert(&mapping->page_tree, offset, page);
- if (likely(!error)) {
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- spin_unlock_irq(&mapping->tree_lock);
- trace_mm_filemap_add_to_page_cache(page);
- } else {
- page->mapping = NULL;
- /* Leave page->index set: truncation relies upon it */
- spin_unlock_irq(&mapping->tree_lock);
- mem_cgroup_uncharge_cache_page(page);
- page_cache_release(page);
- }
- radix_tree_preload_end();
- } else
+ if (error) {
mem_cgroup_uncharge_cache_page(page);
-out:
+ return error;
+ }
+
+ page_cache_get(page);
+ page->mapping = mapping;
+ page->index = offset;
+
+ spin_lock_irq(&mapping->tree_lock);
+ error = radix_tree_insert(&mapping->page_tree, offset, page);
+ radix_tree_preload_end();
+ if (unlikely(!error))
+ goto err_insert;
+ mapping->nrpages++;
+ __inc_zone_page_state(page, NR_FILE_PAGES);
+ spin_unlock_irq(&mapping->tree_lock);
+ trace_mm_filemap_add_to_page_cache(page);
+ return 0;
+err_insert:
+ page->mapping = NULL;
+ /* Leave page->index set: truncation relies upon it */
+ spin_unlock_irq(&mapping->tree_lock);
+ mem_cgroup_uncharge_cache_page(page);
+ page_cache_release(page);
return error;
}
EXPORT_SYMBOL(add_to_page_cache_locked);
--
1.8.3.2

2013-07-15 13:52:47

by Kirill A. Shutemov

[permalink] [raw]
Subject: RE: [PATCH 4/8] mm: cleanup add_to_page_cache_locked()

Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> The patch makes add_to_page_cache_locked() cleaner:
> - unindent most code of the function by inverting one condition;
> - streamline code no-error path;
> - move insert error path outside normal code path;
> - call radix_tree_preload_end() earlier;
>
> No functional changes.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Dave Hansen <[email protected]>
> ---

...

> + spin_lock_irq(&mapping->tree_lock);
> + error = radix_tree_insert(&mapping->page_tree, offset, page);
> + radix_tree_preload_end();
> + if (unlikely(!error))
> + goto err_insert;

Nadav Shemer noticed mistake here. It should be 'if (unlikely(error))'.

I've missed this during becase it was fixed by other patch in my
thp-pagecache series.

Fixed patch is below. Retested with this patchset applied only.
Looks okay now.

Sorry for this.

---

>From 0c74a39660f41a6c9504e3b4ec7a1ba3433af2a3 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Thu, 23 May 2013 16:08:12 +0300
Subject: [PATCH] mm: cleanup add_to_page_cache_locked()

The patch makes add_to_page_cache_locked() cleaner:
- unindent most code of the function by inverting one condition;
- streamline code no-error path;
- move insert error path outside normal code path;
- call radix_tree_preload_end() earlier;

No functional changes.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 48 +++++++++++++++++++++++++-----------------------
1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index f6fe61f..f1ca99c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -467,32 +467,34 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
error = mem_cgroup_cache_charge(page, current->mm,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
- goto out;
+ return error;

error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
- if (error == 0) {
- page_cache_get(page);
- page->mapping = mapping;
- page->index = offset;
-
- spin_lock_irq(&mapping->tree_lock);
- error = radix_tree_insert(&mapping->page_tree, offset, page);
- if (likely(!error)) {
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- spin_unlock_irq(&mapping->tree_lock);
- trace_mm_filemap_add_to_page_cache(page);
- } else {
- page->mapping = NULL;
- /* Leave page->index set: truncation relies upon it */
- spin_unlock_irq(&mapping->tree_lock);
- mem_cgroup_uncharge_cache_page(page);
- page_cache_release(page);
- }
- radix_tree_preload_end();
- } else
+ if (error) {
mem_cgroup_uncharge_cache_page(page);
-out:
+ return error;
+ }
+
+ page_cache_get(page);
+ page->mapping = mapping;
+ page->index = offset;
+
+ spin_lock_irq(&mapping->tree_lock);
+ error = radix_tree_insert(&mapping->page_tree, offset, page);
+ radix_tree_preload_end();
+ if (unlikely(error))
+ goto err_insert;
+ mapping->nrpages++;
+ __inc_zone_page_state(page, NR_FILE_PAGES);
+ spin_unlock_irq(&mapping->tree_lock);
+ trace_mm_filemap_add_to_page_cache(page);
+ return 0;
+err_insert:
+ page->mapping = NULL;
+ /* Leave page->index set: truncation relies upon it */
+ spin_unlock_irq(&mapping->tree_lock);
+ mem_cgroup_uncharge_cache_page(page);
+ page_cache_release(page);
return error;
}
EXPORT_SYMBOL(add_to_page_cache_locked);
--
Kirill A. Shutemov

2013-07-16 19:10:18

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/8] mm: drop actor argument of do_generic_file_read()

On Mon, Jul 15, 2013 at 01:47:47PM +0300, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> There's only one caller of do_generic_file_read() and the only actor is
> file_read_actor(). No reason to have a callback parameter.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Dave Hansen <[email protected]>

Would it make sense to do the same thing to do_shmem_file_read()?

From: Matthew Wilcox <[email protected]>

There's only one caller of do_shmem_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Matthew Wilcox <[email protected]>

diff --git a/mm/shmem.c b/mm/shmem.c
index 5e6a842..6a9c325 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1464,7 +1464,7 @@ shmem_write_end(struct file *file, struct address_space *mapping,
return copied;
}

-static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_t *desc, read_actor_t actor)
+static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_t *desc)
{
struct inode *inode = file_inode(filp);
struct address_space *mapping = inode->i_mapping;
@@ -1546,13 +1546,14 @@ static void do_shmem_file_read(struct file *filp, loff_t *ppos, read_descriptor_
* Ok, we have the page, and it's up-to-date, so
* now we can copy it to user space...
*
- * The actor routine returns how many bytes were actually used..
+ * The file_read_actor routine returns how many bytes were actually
+ * used..
* NOTE! This may not be the same as how much of a user buffer
* we filled up (we may be padding etc), so we can only update
- * "pos" here (the actor routine has to update the user buffer
+ * "pos" here (file_read_actor has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
+ ret = file_read_actor(desc, page, offset, nr);
offset += ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
@@ -1590,7 +1591,7 @@ static ssize_t shmem_file_aio_read(struct kiocb *iocb,
if (desc.count == 0)
continue;
desc.error = 0;
- do_shmem_file_read(filp, ppos, &desc, file_read_actor);
+ do_shmem_file_read(filp, ppos, &desc);
retval += desc.written;
if (desc.error) {
retval = retval ?: desc.error;

2013-07-16 23:10:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/8] mm: drop actor argument of do_generic_file_read()

Matthew Wilcox wrote:
> On Mon, Jul 15, 2013 at 01:47:47PM +0300, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > There's only one caller of do_generic_file_read() and the only actor is
> > file_read_actor(). No reason to have a callback parameter.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Acked-by: Dave Hansen <[email protected]>
>
> Would it make sense to do the same thing to do_shmem_file_read()?
>
> From: Matthew Wilcox <[email protected]>
>
> There's only one caller of do_shmem_file_read() and the only actor is
> file_read_actor(). No reason to have a callback parameter.
>
> Signed-off-by: Matthew Wilcox <[email protected]>

Looks good to me:

Reviewed-by: Kirill A. Shutemov <[email protected]>

v3.11-rc1 brings one more user for read_actor_t -- lustre. But it seemes
it's artifact of ages when f_op had ->sendfile and it's not in use
anymore.

--
Kirill A. Shutemov

2013-07-17 21:09:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 5/8] thp, mm: locking tail page is a bug

On Mon, 15 Jul 2013 13:47:51 +0300 "Kirill A. Shutemov" <[email protected]> wrote:

> From: "Kirill A. Shutemov" <[email protected]>
>
> Locking head page means locking entire compound page.
> If we try to lock tail page, something went wrong.
>
> ..
>
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -639,6 +639,7 @@ void __lock_page(struct page *page)
> {
> DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>
> + VM_BUG_ON(PageTail(page));
> __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
> TASK_UNINTERRUPTIBLE);
> }
> @@ -648,6 +649,7 @@ int __lock_page_killable(struct page *page)
> {
> DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>
> + VM_BUG_ON(PageTail(page));
> return __wait_on_bit_lock(page_waitqueue(page), &wait,
> sleep_on_page_killable, TASK_KILLABLE);
> }

lock_page() is a pretty commonly called function, and I assume quite a
lot of people run with CONFIG_DEBUG_VM=y.

Is the overhead added by this patch really worthwhile?

I'm thinking I might leave it in -mm indefinitely but not send it
upstream.

2013-07-17 22:42:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 5/8] thp, mm: locking tail page is a bug

Andrew Morton wrote:
> On Mon, 15 Jul 2013 13:47:51 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
>
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Locking head page means locking entire compound page.
> > If we try to lock tail page, something went wrong.
> >
> > ..
> >
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -639,6 +639,7 @@ void __lock_page(struct page *page)
> > {
> > DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> >
> > + VM_BUG_ON(PageTail(page));
> > __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
> > TASK_UNINTERRUPTIBLE);
> > }
> > @@ -648,6 +649,7 @@ int __lock_page_killable(struct page *page)
> > {
> > DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
> >
> > + VM_BUG_ON(PageTail(page));
> > return __wait_on_bit_lock(page_waitqueue(page), &wait,
> > sleep_on_page_killable, TASK_KILLABLE);
> > }
>
> lock_page() is a pretty commonly called function, and I assume quite a
> lot of people run with CONFIG_DEBUG_VM=y.
>
> Is the overhead added by this patch really worthwhile?

I found it useful, especially, when I was starting experiments with THP
for pagecache. But feel free to drop it if think that it adds to much
overhead.

> I'm thinking I might leave it in -mm indefinitely but not send it
> upstream.

Works for me too.

--
Kirill A. Shutemov

2013-07-17 22:44:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 5/8] thp, mm: locking tail page is a bug

On 07/17/2013 02:09 PM, Andrew Morton wrote:
> lock_page() is a pretty commonly called function, and I assume quite a
> lot of people run with CONFIG_DEBUG_VM=y.
>
> Is the overhead added by this patch really worthwhile?

I always thought of it as a developer-only thing. I don't think any of
the big distros turn it on by default.

2013-07-18 00:58:07

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 5/8] thp, mm: locking tail page is a bug

On Wed, 17 Jul 2013, Dave Hansen wrote:
> On 07/17/2013 02:09 PM, Andrew Morton wrote:
> > lock_page() is a pretty commonly called function, and I assume quite a
> > lot of people run with CONFIG_DEBUG_VM=y.
> >
> > Is the overhead added by this patch really worthwhile?
>
> I always thought of it as a developer-only thing. I don't think any of
> the big distros turn it on by default.

That's how I think of it too (and the problem is often that too few mm
developers turn it on); but Dave Jones did confirm last November that
Fedora turns it on.

I believe Fedora turns it on to help us all, and wouldn't mind a mere
VM_BUG_ON(PageTail(page)) in __lock_page() if it's helpful to Kirill.

But if VM_BUG_ONs become expensive, I do think it's for Fedora to
turn off CONFIG_DEBUG_VM, rather than for mm developers to avoid it.

Hugh

2013-07-18 02:54:35

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH 5/8] thp, mm: locking tail page is a bug

On Wed, Jul 17, 2013 at 05:58:13PM -0700, Hugh Dickins wrote:
> On Wed, 17 Jul 2013, Dave Hansen wrote:
> > On 07/17/2013 02:09 PM, Andrew Morton wrote:
> > > lock_page() is a pretty commonly called function, and I assume quite a
> > > lot of people run with CONFIG_DEBUG_VM=y.
> > >
> > > Is the overhead added by this patch really worthwhile?
> >
> > I always thought of it as a developer-only thing. I don't think any of
> > the big distros turn it on by default.
>
> That's how I think of it too (and the problem is often that too few mm
> developers turn it on); but Dave Jones did confirm last November that
> Fedora turns it on.
>
> I believe Fedora turns it on to help us all, and wouldn't mind a mere
> VM_BUG_ON(PageTail(page)) in __lock_page() if it's helpful to Kirill.
>
> But if VM_BUG_ONs become expensive, I do think it's for Fedora to
> turn off CONFIG_DEBUG_VM, rather than for mm developers to avoid it.

I'm ambivalent about whether we keep it on or off, we have no shortage
of bugs to fix already, though I think as mentioned above, very few people
actually enable it, so we're going to lose a lot of testing.

Another idea, perhaps is an extra config option for more expensive debug options ?

Dave

2013-07-18 19:59:06

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 5/8] thp, mm: locking tail page is a bug

On Wed, 17 Jul 2013, Dave Jones wrote:
> On Wed, Jul 17, 2013 at 05:58:13PM -0700, Hugh Dickins wrote:
> > On Wed, 17 Jul 2013, Dave Hansen wrote:
> > > On 07/17/2013 02:09 PM, Andrew Morton wrote:
> > > > lock_page() is a pretty commonly called function, and I assume quite a
> > > > lot of people run with CONFIG_DEBUG_VM=y.
> > > >
> > > > Is the overhead added by this patch really worthwhile?
> > >
> > > I always thought of it as a developer-only thing. I don't think any of
> > > the big distros turn it on by default.
> >
> > That's how I think of it too (and the problem is often that too few mm
> > developers turn it on); but Dave Jones did confirm last November that
> > Fedora turns it on.
> >
> > I believe Fedora turns it on to help us all, and wouldn't mind a mere
> > VM_BUG_ON(PageTail(page)) in __lock_page() if it's helpful to Kirill.
> >
> > But if VM_BUG_ONs become expensive, I do think it's for Fedora to
> > turn off CONFIG_DEBUG_VM, rather than for mm developers to avoid it.
>
> I'm ambivalent about whether we keep it on or off, we have no shortage
> of bugs to fix already, though I think as mentioned above, very few people
> actually enable it, so we're going to lose a lot of testing.
>
> Another idea, perhaps is an extra config option for more expensive debug options ?

I can't think of any expensive debug under CONFIG_DEBUG_VM at present,
just a little overhead as you'd expect. I don't think checking a page
flag or two in __lock_page() will change that much.

We do tend to make specific debug options for the more expensive ones,
rather than a generic heavy debug option. For example CONFIG_DEBUG_VM_RB,
which checks an mm's entire vma tree whenever a change is made there.

We could group the heavy debug options, like that and CONFIG_PROVE_LOCKING,
into a menu of their own; but I've no appetite for such a change myself!

Hugh