2013-04-05 12:07:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 00/34] Transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

Here's third RFC. Thanks everybody for feedback.

The patchset is pretty big already and I want to stop generate new
features to keep it reviewable. Next I'll concentrate on benchmarking and
tuning.

Therefore some features will be outside initial transparent huge page
cache implementation:
- page collapsing;
- migration;
- tmpfs/shmem;

There are few features which are not implemented and potentially can block
upstreaming:

1. Currently we allocate 2M page even if we create only 1 byte file on
ramfs. I don't think it's a problem by itself. With anon thp pages we also
try to allocate huge pages whenever possible.
The problem is that ramfs pages are unevictable and we can't just split
and pushed them in swap as with anon thp. We (at some point) have to have
mechanism to split last page of the file under memory pressure to reclaim
some memory.

2. We don't have knobs for disabling transparent huge page cache per-mount
or per-file. Should we have mount option and fadivse flags as part of
initial implementation?

Any thoughts?

The patchset is also on git:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/pagecache

v3:
- set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP;
- rewrite lru_add_page_tail() to address few bags;
- memcg accounting;
- represent file thp pages in meminfo and friends;
- dump page order in filemap trace;
- add missed flush_dcache_page() in zero_huge_user_segment;
- random cleanups based on feedback.
v2:
- mmap();
- fix add_to_page_cache_locked() and delete_from_page_cache();
- introduce mapping_can_have_hugepages();
- call split_huge_page() only for head page in filemap_fault();
- wait_split_huge_page(): serialize over i_mmap_mutex too;
- lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
- fix off-by-one in zero_huge_user_segment();
- THP_WRITE_ALLOC/THP_WRITE_FAILED counters;

Kirill A. Shutemov (34):
mm: drop actor argument of do_generic_file_read()
block: implement add_bdi_stat()
mm: implement zero_huge_user_segment and friends
radix-tree: implement preload for multiple contiguous elements
memcg, thp: charge huge cache pages
thp, mm: avoid PageUnevictable on active/inactive lru lists
thp, mm: basic defines for transparent huge page cache
thp, mm: introduce mapping_can_have_hugepages() predicate
thp: represent file thp pages in meminfo and friends
thp, mm: rewrite add_to_page_cache_locked() to support huge pages
mm: trace filemap: dump page order
thp, mm: rewrite delete_from_page_cache() to support huge pages
thp, mm: trigger bug in replace_page_cache_page() on THP
thp, mm: locking tail page is a bug
thp, mm: handle tail pages in page_cache_get_speculative()
thp, mm: add event counters for huge page alloc on write to a file
thp, mm: implement grab_thp_write_begin()
thp, mm: naive support of thp in generic read/write routines
thp, libfs: initial support of thp in
simple_read/write_begin/write_end
thp: handle file pages in split_huge_page()
thp: wait_split_huge_page(): serialize over i_mmap_mutex too
thp, mm: truncate support for transparent huge page cache
thp, mm: split huge page on mmap file page
ramfs: enable transparent huge page cache
x86-64, mm: proper alignment mappings with hugepages
mm: add huge_fault() callback to vm_operations_struct
thp: prepare zap_huge_pmd() to uncharge file pages
thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
thp, mm: basic huge_fault implementation for generic_file_vm_ops
thp: extract fallback path from do_huge_pmd_anonymous_page() to a
function
thp: initial implementation of do_huge_linear_fault()
thp: handle write-protect exception to file-backed huge pages
thp: call __vma_adjust_trans_huge() for file-backed VMA
thp: map file-backed huge pages on fault

arch/x86/kernel/sys_x86_64.c | 12 +-
drivers/base/node.c | 10 +
fs/libfs.c | 48 +++-
fs/proc/meminfo.c | 6 +
fs/ramfs/inode.c | 6 +-
include/linux/backing-dev.h | 10 +
include/linux/huge_mm.h | 36 ++-
include/linux/mm.h | 8 +
include/linux/mmzone.h | 1 +
include/linux/pagemap.h | 33 ++-
include/linux/radix-tree.h | 11 +
include/linux/vm_event_item.h | 2 +
include/trace/events/filemap.h | 7 +-
lib/radix-tree.c | 33 ++-
mm/filemap.c | 298 ++++++++++++++++++++-----
mm/huge_memory.c | 474 +++++++++++++++++++++++++++++++++-------
mm/memcontrol.c | 2 -
mm/memory.c | 41 +++-
mm/mmap.c | 3 +
mm/page_alloc.c | 7 +-
mm/swap.c | 20 +-
mm/truncate.c | 13 ++
mm/vmstat.c | 2 +
23 files changed, 902 insertions(+), 181 deletions(-)

--
1.7.10.4


2013-04-05 11:58:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 03/34] mm: implement zero_huge_user_segment and friends

From: "Kirill A. Shutemov" <[email protected]>

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment and zero_user, but for huge pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 7 +++++++
mm/memory.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5b7fd4e..09530c7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1731,6 +1731,13 @@ extern void dump_page(struct page *page);
extern void clear_huge_page(struct page *page,
unsigned long addr,
unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+ unsigned start, unsigned end);
+static inline void zero_huge_user(struct page *page,
+ unsigned start, unsigned len)
+{
+ zero_huge_user_segment(page, start, start + len);
+}
extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned long addr, struct vm_area_struct *vma,
unsigned int pages_per_huge_page);
diff --git a/mm/memory.c b/mm/memory.c
index 494526a..9da540f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4213,6 +4213,42 @@ void clear_huge_page(struct page *page,
}
}

+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+ int i;
+ unsigned start_idx, end_idx;
+ unsigned start_off, end_off;
+
+ BUG_ON(end < start);
+
+ might_sleep();
+
+ if (start == end)
+ return;
+
+ start_idx = start >> PAGE_SHIFT;
+ start_off = start & ~PAGE_MASK;
+ end_idx = (end - 1) >> PAGE_SHIFT;
+ end_off = ((end - 1) & ~PAGE_MASK) + 1;
+
+ /*
+ * if start and end are on the same small page we can call
+ * zero_user_segment() once and save one kmap_atomic().
+ */
+ if (start_idx == end_idx)
+ return zero_user_segment(page + start_idx, start_off, end_off);
+
+ /* zero the first (possibly partial) page */
+ zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
+ for (i = start_idx + 1; i < end_idx; i++) {
+ cond_resched();
+ clear_highpage(page + i);
+ flush_dcache_page(page + i);
+ }
+ /* zero the last (possibly partial) page */
+ zero_user_segment(page + end_idx, 0, end_off);
+}
+
static void copy_user_gigantic_page(struct page *dst, struct page *src,
unsigned long addr,
struct vm_area_struct *vma,
--
1.7.10.4

2013-04-05 11:58:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 02/34] block: implement add_bdi_stat()

From: "Kirill A. Shutemov" <[email protected]>

We're going to add/remove a number of page cache entries at once. This
patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
amount. It's required for batched page cache manipulations.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/backing-dev.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3504599..b05d961 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
__add_bdi_stat(bdi, item, -1);
}

+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+ enum bdi_stat_item item, s64 amount)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __add_bdi_stat(bdi, item, amount);
+ local_irq_restore(flags);
+}
+
static inline void dec_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
{
--
1.7.10.4

2013-04-05 11:58:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 24/34] ramfs: enable transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

For now we allocate only non-movable huge page. ramfs pages cannot be
moved yet.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/ramfs/inode.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index c24f1e1..54d69c7 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
inode_init_owner(inode, dir, mode);
inode->i_mapping->a_ops = &ramfs_aops;
inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
- mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+ /*
+ * TODO: make ramfs pages movable
+ */
+ mapping_set_gfp_mask(inode->i_mapping,
+ GFP_TRANSHUGE & ~__GFP_MOVABLE);
mapping_set_unevictable(inode->i_mapping);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
--
1.7.10.4

2013-04-05 11:58:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 17/34] thp, mm: implement grab_thp_write_begin()

From: "Kirill A. Shutemov" <[email protected]>

The function is grab_cache_page_write_begin() twin but it tries to
allocate huge page at given position aligned to HPAGE_CACHE_NR.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/pagemap.h | 10 ++++++
mm/filemap.c | 89 +++++++++++++++++++++++++++++++++++++++--------
2 files changed, 85 insertions(+), 14 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index bd07fc1..5a7dda9 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -271,6 +271,16 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,

struct page *grab_cache_page_write_begin(struct address_space *mapping,
pgoff_t index, unsigned flags);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+struct page *grab_thp_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags);
+#else
+static inline struct page *grab_thp_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ return NULL;
+}
+#endif

/*
* Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b4736c..bcb679c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2290,16 +2290,17 @@ out:
EXPORT_SYMBOL(generic_file_direct_write);

/*
- * Find or create a page at the given pagecache position. Return the locked
- * page. This function is specifically for buffered writes.
+ * Returns true if the page was found in page cache and
+ * false if it had to allocate a new page.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+static bool __grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags, unsigned int order,
+ struct page **page)
{
int status;
gfp_t gfp_mask;
- struct page *page;
gfp_t gfp_notmask = 0;
+ int found = true;

gfp_mask = mapping_gfp_mask(mapping);
if (mapping_cap_account_dirty(mapping))
@@ -2307,27 +2308,87 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
if (flags & AOP_FLAG_NOFS)
gfp_notmask = __GFP_FS;
repeat:
- page = find_lock_page(mapping, index);
- if (page)
+ *page = find_lock_page(mapping, index);
+ if (*page)
goto found;

- page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
- if (!page)
- return NULL;
- status = add_to_page_cache_lru(page, mapping, index,
+ found = false;
+ if (order)
+ *page = alloc_pages(gfp_mask & ~gfp_notmask, order);
+ else
+ *page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+ if (!*page)
+ return false;
+ status = add_to_page_cache_lru(*page, mapping, index,
GFP_KERNEL & ~gfp_notmask);
if (unlikely(status)) {
- page_cache_release(page);
+ page_cache_release(*page);
if (status == -EEXIST)
goto repeat;
- return NULL;
+ *page = NULL;
+ return false;
}
found:
- wait_for_stable_page(page);
+ wait_for_stable_page(*page);
+ return found;
+}
+
+/*
+ * Find or create a page at the given pagecache position. Return the locked
+ * page. This function is specifically for buffered writes.
+ */
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ struct page *page;
+ __grab_cache_page_write_begin(mapping, index, flags, 0, &page);
return page;
}
EXPORT_SYMBOL(grab_cache_page_write_begin);

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Find or create a huge page at the given pagecache position, aligned to
+ * HPAGE_CACHE_NR. Return the locked huge page.
+ *
+ * If, for some reason, it's not possible allocate a huge page at this
+ * possition, it returns NULL. Caller should take care of fallback to small
+ * pages.
+ *
+ * This function is specifically for buffered writes.
+ */
+struct page *grab_thp_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ gfp_t gfp_mask;
+ struct page *page;
+ bool found;
+
+ BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+ gfp_mask = mapping_gfp_mask(mapping);
+ BUG_ON(!(gfp_mask & __GFP_COMP));
+
+ found = __grab_cache_page_write_begin(mapping, index, flags,
+ HPAGE_PMD_ORDER, &page);
+ if (!page) {
+ if (!found)
+ count_vm_event(THP_WRITE_ALLOC_FAILED);
+ return NULL;
+ }
+
+ if (!found)
+ count_vm_event(THP_WRITE_ALLOC);
+
+ if (!PageTransHuge(page)) {
+ unlock_page(page);
+ page_cache_release(page);
+ return NULL;
+ }
+
+ return page;
+}
+#endif
+
static ssize_t generic_perform_write(struct file *file,
struct iov_iter *i, loff_t pos)
{
--
1.7.10.4

2013-04-05 11:58:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 25/34] x86-64, mm: proper alignment mappings with hugepages

From: "Kirill A. Shutemov" <[email protected]>

Make arch_get_unmapped_area() return unmapped area aligned to HPAGE_MASK
if the file mapping can have huge pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/sys_x86_64.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index dbded5a..d97ab40 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -15,6 +15,7 @@
#include <linux/random.h>
#include <linux/uaccess.h>
#include <linux/elf.h>
+#include <linux/pagemap.h>

#include <asm/ia32.h>
#include <asm/syscalls.h>
@@ -34,6 +35,13 @@ static unsigned long get_align_mask(void)
return va_align.mask;
}

+static inline unsigned long mapping_align_mask(struct address_space *mapping)
+{
+ if (mapping_can_have_hugepages(mapping))
+ return PAGE_MASK & ~HPAGE_MASK;
+ return get_align_mask();
+}
+
unsigned long align_vdso_addr(unsigned long addr)
{
unsigned long align_mask = get_align_mask();
@@ -135,7 +143,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
info.length = len;
info.low_limit = begin;
info.high_limit = end;
- info.align_mask = filp ? get_align_mask() : 0;
+ info.align_mask = filp ? mapping_align_mask(filp->f_mapping) : 0;
info.align_offset = pgoff << PAGE_SHIFT;
return vm_unmapped_area(&info);
}
@@ -174,7 +182,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = mm->mmap_base;
- info.align_mask = filp ? get_align_mask() : 0;
+ info.align_mask = filp ? mapping_align_mask(filp->f_mapping) : 0;
info.align_offset = pgoff << PAGE_SHIFT;
addr = vm_unmapped_area(&info);
if (!(addr & ~PAGE_MASK))
--
1.7.10.4

2013-04-05 11:59:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 16/34] thp, mm: add event counters for huge page alloc on write to a file

From: "Kirill A. Shutemov" <[email protected]>

Existing stats specify source of thp page: fault or collapse. We're
going allocate a new huge page with write(2). It's nither fault nor
collapse.

Let's introduce new events for that.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/vm_event_item.h | 2 ++
mm/vmstat.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d4b7a18..584c71c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
+ THP_WRITE_ALLOC,
+ THP_WRITE_ALLOC_FAILED,
THP_SPLIT,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 292b1cf..dd8323a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -818,6 +818,8 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
+ "thp_write_alloc",
+ "thp_write_alloc_failed",
"thp_split",
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
--
1.7.10.4

2013-04-05 11:59:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 20/34] thp: handle file pages in split_huge_page()

From: "Kirill A. Shutemov" <[email protected]>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

__split_huge_page_refcount() has been tunned a bit: we need to transfer
PG_swapbacked to tail pages.

Splitting mapped pages haven't tested at all, since we cannot mmap()
file-backed huge pages yet.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 71 +++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 59 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 46a44ac..ac0dc80 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1637,24 +1637,25 @@ static void __split_huge_page_refcount(struct page *page)
*/
page_tail->_mapcount = page->_mapcount;

- BUG_ON(page_tail->mapping);
page_tail->mapping = page->mapping;
-
page_tail->index = page->index + i;
page_nid_xchg_last(page_tail, page_nid_last(page));

- BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
BUG_ON(!PageDirty(page_tail));
- BUG_ON(!PageSwapBacked(page_tail));

lru_add_page_tail(page, page_tail, lruvec);
}
atomic_sub(tail_count, &page->_count);
BUG_ON(atomic_read(&page->_count) <= 0);

- __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
- __mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+ if (PageAnon(page)) {
+ __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+ __mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);
+ } else {
+ __mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1);
+ __mod_zone_page_state(zone, NR_FILE_PAGES, HPAGE_PMD_NR);
+ }

ClearPageCompound(page);
compound_unlock(page);
@@ -1754,7 +1755,7 @@ static int __split_huge_page_map(struct page *page,
}

/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
struct anon_vma *anon_vma)
{
int mapcount, mapcount2;
@@ -1801,14 +1802,11 @@ static void __split_huge_page(struct page *page,
BUG_ON(mapcount != mapcount2);
}

-int split_huge_page(struct page *page)
+static int split_anon_huge_page(struct page *page)
{
struct anon_vma *anon_vma;
int ret = 1;

- BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
- BUG_ON(!PageAnon(page));
-
/*
* The caller does not necessarily hold an mmap_sem that would prevent
* the anon_vma disappearing so we first we take a reference to it
@@ -1826,7 +1824,7 @@ int split_huge_page(struct page *page)
goto out_unlock;

BUG_ON(!PageSwapBacked(page));
- __split_huge_page(page, anon_vma);
+ __split_anon_huge_page(page, anon_vma);
count_vm_event(THP_SPLIT);

BUG_ON(PageCompound(page));
@@ -1837,6 +1835,55 @@ out:
return ret;
}

+static int split_file_huge_page(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ int mapcount, mapcount2;
+
+ BUG_ON(!PageHead(page));
+ BUG_ON(PageTail(page));
+
+ mutex_lock(&mapping->i_mmap_mutex);
+ mapcount = 0;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long addr = vma_address(page, vma);
+ mapcount += __split_huge_page_splitting(page, vma, addr);
+ }
+
+ if (mapcount != page_mapcount(page))
+ printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+ mapcount, page_mapcount(page));
+ BUG_ON(mapcount != page_mapcount(page));
+
+ __split_huge_page_refcount(page);
+
+ mapcount2 = 0;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long addr = vma_address(page, vma);
+ mapcount2 += __split_huge_page_map(page, vma, addr);
+ }
+
+ if (mapcount != mapcount2)
+ printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+ mapcount, mapcount2, page_mapcount(page));
+ BUG_ON(mapcount != mapcount2);
+ count_vm_event(THP_SPLIT);
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return 0;
+}
+
+int split_huge_page(struct page *page)
+{
+ BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
+
+ if (PageAnon(page))
+ return split_anon_huge_page(page);
+ else
+ return split_file_huge_page(page);
+}
+
#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)

int hugepage_madvise(struct vm_area_struct *vma,
--
1.7.10.4

2013-04-05 11:59:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 21/34] thp: wait_split_huge_page(): serialize over i_mmap_mutex too

From: "Kirill A. Shutemov" <[email protected]>

Since we're going to have huge pages backed by files,
wait_split_huge_page() has to serialize not only over anon_vma_lock,
but over i_mmap_mutex too.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 15 ++++++++++++---
mm/huge_memory.c | 4 ++--
mm/memory.c | 4 ++--
3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a54939c..b53e295 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -113,11 +113,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
__split_huge_page_pmd(__vma, __address, \
____pmd); \
} while (0)
-#define wait_split_huge_page(__anon_vma, __pmd) \
+#define wait_split_huge_page(__vma, __pmd) \
do { \
pmd_t *____pmd = (__pmd); \
- anon_vma_lock_write(__anon_vma); \
- anon_vma_unlock_write(__anon_vma); \
+ struct address_space *__mapping = \
+ vma->vm_file->f_mapping; \
+ struct anon_vma *__anon_vma = (__vma)->anon_vma; \
+ if (__mapping) \
+ mutex_lock(&__mapping->i_mmap_mutex); \
+ if (__anon_vma) { \
+ anon_vma_lock_write(__anon_vma); \
+ anon_vma_unlock_write(__anon_vma); \
+ } \
+ if (__mapping) \
+ mutex_unlock(&__mapping->i_mmap_mutex); \
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ac0dc80..7c48f58 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -907,7 +907,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
spin_unlock(&dst_mm->page_table_lock);
pte_free(dst_mm, pgtable);

- wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+ wait_split_huge_page(vma, src_pmd); /* src_vma */
goto out;
}
src_page = pmd_page(pmd);
@@ -1480,7 +1480,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(&vma->vm_mm->page_table_lock);
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
return -1;
} else {
/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index 9da540f..2895f0e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -619,7 +619,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
if (new)
pte_free(mm, new);
if (wait_split_huge_page)
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
return 0;
}

@@ -1529,7 +1529,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(&mm->page_table_lock);
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
} else {
page = follow_trans_huge_pmd(vma, address,
pmd, flags);
--
1.7.10.4

2013-04-05 11:58:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 07/34] thp, mm: basic defines for transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ee1c244..a54939c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
#define HPAGE_PMD_MASK HPAGE_MASK
#define HPAGE_PMD_SIZE HPAGE_SIZE

+#define HPAGE_CACHE_ORDER (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
extern bool is_vma_temporary_stack(struct vm_area_struct *vma);

#define transparent_hugepage_enabled(__vma) \
@@ -181,6 +185,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })

+#define HPAGE_CACHE_ORDER ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
#define hpage_nr_pages(x) 1

#define transparent_hugepage_enabled(__vma) 0
--
1.7.10.4

2013-04-05 12:00:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()

From: "Kirill A. Shutemov" <[email protected]>

The function tries to create a new page mapping using huge pages. It
only called for not yet mapped pages.

As usual in THP, we fallback to small pages if we fail to allocate huge
page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 3 +
mm/huge_memory.c | 196 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 199 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b53e295..aa52c48 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -5,6 +5,9 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags);
+extern int do_huge_linear_fault(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long address, pmd_t *pmd,
+ unsigned int flags);
extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c1d5f2b..ed4389b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -21,6 +21,7 @@
#include <linux/pagemap.h>
#include <linux/migrate.h>
#include <linux/hashtable.h>
+#include <linux/writeback.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -864,6 +865,201 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}

+int do_huge_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd, unsigned int flags)
+{
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct page *cow_page, *page, *dirty_page = NULL;
+ bool anon = false, fallback = false, page_mkwrite = false;
+ pgtable_t pgtable = NULL;
+ struct vm_fault vmf;
+ int ret;
+
+ /* Fallback if vm_pgoff and vm_start are not suitable */
+ if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+ (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+ return do_fallback(mm, vma, address, pmd, flags);
+
+ if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+ return do_fallback(mm, vma, address, pmd, flags);
+
+ if (unlikely(khugepaged_enter(vma)))
+ return VM_FAULT_OOM;
+
+ /*
+ * If we do COW later, allocate page before taking lock_page()
+ * on the file cache page. This will reduce lock holding time.
+ */
+ if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+
+ cow_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+ vma, haddr, numa_node_id(), 0);
+ if (!cow_page) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ return do_fallback(mm, vma, address, pmd, flags);
+ }
+ count_vm_event(THP_FAULT_ALLOC);
+ if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
+ page_cache_release(cow_page);
+ return do_fallback(mm, vma, address, pmd, flags);
+ }
+ } else
+ cow_page = NULL;
+
+ pgtable = pte_alloc_one(mm, haddr);
+ if (unlikely(!pgtable)) {
+ ret = VM_FAULT_OOM;
+ goto uncharge_out;
+ }
+
+ vmf.virtual_address = (void __user *)haddr;
+ vmf.pgoff = ((haddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ vmf.flags = flags;
+ vmf.page = NULL;
+
+ ret = vma->vm_ops->huge_fault(vma, &vmf);
+ if (unlikely(ret & VM_FAULT_OOM))
+ goto uncharge_out_fallback;
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+ goto uncharge_out;
+
+ if (unlikely(PageHWPoison(vmf.page))) {
+ if (ret & VM_FAULT_LOCKED)
+ unlock_page(vmf.page);
+ ret = VM_FAULT_HWPOISON;
+ goto uncharge_out;
+ }
+
+ /*
+ * For consistency in subsequent calls, make the faulted page always
+ * locked.
+ */
+ if (unlikely(!(ret & VM_FAULT_LOCKED)))
+ lock_page(vmf.page);
+ else
+ VM_BUG_ON(!PageLocked(vmf.page));
+
+ /*
+ * Should we do an early C-O-W break?
+ */
+ page = vmf.page;
+ if (flags & FAULT_FLAG_WRITE) {
+ if (!(vma->vm_flags & VM_SHARED)) {
+ page = cow_page;
+ anon = true;
+ copy_user_huge_page(page, vmf.page, haddr, vma,
+ HPAGE_PMD_NR);
+ __SetPageUptodate(page);
+ } else if (vma->vm_ops->page_mkwrite) {
+ int tmp;
+
+ unlock_page(page);
+ vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+ tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+ if (unlikely(tmp &
+ (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+ ret = tmp;
+ goto unwritable_page;
+ }
+ if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+ lock_page(page);
+ if (!page->mapping) {
+ ret = 0; /* retry the fault */
+ unlock_page(page);
+ goto unwritable_page;
+ }
+ } else
+ VM_BUG_ON(!PageLocked(page));
+ page_mkwrite = true;
+ }
+ }
+
+ VM_BUG_ON(!PageCompound(page));
+
+ spin_lock(&mm->page_table_lock);
+ if (likely(pmd_none(*pmd))) {
+ pmd_t entry;
+
+ flush_icache_page(vma, page);
+ entry = mk_huge_pmd(page, vma->vm_page_prot);
+ if (flags & FAULT_FLAG_WRITE)
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ if (anon) {
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ page_add_new_anon_rmap(page, vma, haddr);
+ } else {
+ add_mm_counter(mm, MM_FILEPAGES, HPAGE_PMD_NR);
+ page_add_file_rmap(page);
+ if (flags & FAULT_FLAG_WRITE) {
+ dirty_page = page;
+ get_page(dirty_page);
+ }
+ }
+ set_pmd_at(mm, haddr, pmd, entry);
+ pgtable_trans_huge_deposit(mm, pgtable);
+ mm->nr_ptes++;
+
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache_pmd(vma, address, pmd);
+ } else {
+ if (cow_page)
+ mem_cgroup_uncharge_page(cow_page);
+ if (anon)
+ page_cache_release(page);
+ else
+ anon = true; /* no anon but release faulted_page */
+ }
+ spin_unlock(&mm->page_table_lock);
+
+ if (dirty_page) {
+ struct address_space *mapping = page->mapping;
+ bool dirtied = false;
+
+ if (set_page_dirty(dirty_page))
+ dirtied = true;
+ unlock_page(dirty_page);
+ put_page(dirty_page);
+ if ((dirtied || page_mkwrite) && mapping) {
+ /*
+ * Some device drivers do not set page.mapping but still
+ * dirty their pages
+ */
+ balance_dirty_pages_ratelimited(mapping);
+ }
+
+ /* file_update_time outside page_lock */
+ if (vma->vm_file && !page_mkwrite)
+ file_update_time(vma->vm_file);
+ } else {
+ unlock_page(vmf.page);
+ if (anon)
+ page_cache_release(vmf.page);
+ }
+
+ return ret;
+
+unwritable_page:
+ pte_free(mm, pgtable);
+ page_cache_release(page);
+ return ret;
+uncharge_out_fallback:
+ fallback = true;
+uncharge_out:
+ if (pgtable)
+ pte_free(mm, pgtable);
+ if (cow_page) {
+ mem_cgroup_uncharge_page(cow_page);
+ page_cache_release(cow_page);
+ }
+
+ if (fallback)
+ return do_fallback(mm, vma, address, pmd, flags);
+ else
+ return ret;
+}
+
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
--
1.7.10.4

2013-04-05 12:00:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 15/34] thp, mm: handle tail pages in page_cache_get_speculative()

From: "Kirill A. Shutemov" <[email protected]>

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/pagemap.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 56debde..bd07fc1 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -160,6 +160,9 @@ static inline int page_cache_get_speculative(struct page *page)
{
VM_BUG_ON(in_interrupt());

+ if (unlikely(PageTail(page)))
+ return __get_page_tail(page);
+
#ifdef CONFIG_TINY_RCU
# ifdef CONFIG_PREEMPT_COUNT
VM_BUG_ON(!in_atomic());
@@ -186,7 +189,6 @@ static inline int page_cache_get_speculative(struct page *page)
return 0;
}
#endif
- VM_BUG_ON(PageTail(page));

return 1;
}
--
1.7.10.4

2013-04-05 12:00:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 32/34] thp: handle write-protect exception to file-backed huge pages

From: "Kirill A. Shutemov" <[email protected]>

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ed4389b..6dde87f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1339,7 +1339,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */

- VM_BUG_ON(!vma->anon_vma);
haddr = address & HPAGE_PMD_MASK;
if (is_huge_zero_pmd(orig_pmd))
goto alloc;
@@ -1349,7 +1348,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,

page = pmd_page(orig_pmd);
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
- if (page_mapcount(page) == 1) {
+ if (PageAnon(page) && page_mapcount(page) == 1) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1357,10 +1356,72 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
update_mmu_cache_pmd(vma, address, pmd);
ret |= VM_FAULT_WRITE;
goto out_unlock;
+ } else if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+ (VM_WRITE|VM_SHARED)) {
+ struct vm_fault vmf;
+ pmd_t entry;
+ struct address_space *mapping;
+
+ /* not yet impemented */
+ VM_BUG_ON(!vma->vm_ops || !vma->vm_ops->page_mkwrite);
+
+ vmf.virtual_address = (void __user *)haddr;
+ vmf.pgoff = page->index;
+ vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+ vmf.page = page;
+
+ page_cache_get(page);
+ spin_unlock(&mm->page_table_lock);
+
+ ret = vma->vm_ops->page_mkwrite(vma, &vmf);
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+ page_cache_release(page);
+ goto out;
+ }
+ if (unlikely(!(ret & VM_FAULT_LOCKED))) {
+ lock_page(page);
+ if (!page->mapping) {
+ ret = 0; /* retry */
+ unlock_page(page);
+ page_cache_release(page);
+ goto out;
+ }
+ } else
+ VM_BUG_ON(!PageLocked(page));
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto out_unlock;
+ }
+
+ flush_cache_page(vma, address, pmd_pfn(orig_pmd));
+ entry = pmd_mkyoung(orig_pmd);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
+ update_mmu_cache_pmd(vma, address, pmd);
+ ret = VM_FAULT_WRITE;
+ spin_unlock(&mm->page_table_lock);
+
+ mapping = page->mapping;
+ set_page_dirty(page);
+ unlock_page(page);
+ page_cache_release(page);
+ if (mapping) {
+ /*
+ * Some device drivers do not set page.mapping
+ * but still dirty their pages
+ */
+ balance_dirty_pages_ratelimited(mapping);
+ }
+ return ret;
}
get_page(page);
spin_unlock(&mm->page_table_lock);
alloc:
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -1424,6 +1485,10 @@ alloc:
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
put_huge_zero_page();
} else {
+ if (!PageAnon(page)) {
+ add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ }
VM_BUG_ON(!PageHead(page));
page_remove_rmap(page);
put_page(page);
--
1.7.10.4

2013-04-05 11:58:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 08/34] thp, mm: introduce mapping_can_have_hugepages() predicate

From: "Kirill A. Shutemov" <[email protected]>

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/pagemap.h | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..56debde 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,17 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
(__force unsigned long)mask;
}

+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ gfp_t gfp_mask = mapping_gfp_mask(m);
+ /* __GFP_COMP is key part of GFP_TRANSHUGE */
+ return !!(gfp_mask & __GFP_COMP);
+ }
+
+ return false;
+}
+
/*
* The page cache can done in larger chunks than
* one page, because it allows for more efficient
--
1.7.10.4

2013-04-05 12:01:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 34/34] thp: map file-backed huge pages on fault

From: "Kirill A. Shutemov" <[email protected]>

Look like all pieces are in place, we can map file-backed huge-pages
now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 4 +++-
mm/memory.c | 1 +
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c6e3aef..c175c78 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -80,7 +80,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
((__vma)->vm_flags & VM_HUGEPAGE))) && \
!((__vma)->vm_flags & VM_NOHUGEPAGE) && \
- !is_vma_temporary_stack(__vma))
+ !is_vma_temporary_stack(__vma) && \
+ (!(__vma)->vm_ops || \
+ mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))
#define transparent_hugepage_defrag(__vma) \
((transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) || \
diff --git a/mm/memory.c b/mm/memory.c
index 2895f0e..e40965f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3738,6 +3738,7 @@ retry:
if (!vma->vm_ops)
return do_huge_pmd_anonymous_page(mm, vma, address,
pmd, flags);
+ return do_huge_linear_fault(mm, vma, address, pmd, flags);
} else {
pmd_t orig_pmd = *pmd;
int ret;
--
1.7.10.4

2013-04-05 12:01:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 33/34] thp: call __vma_adjust_trans_huge() for file-backed VMA

From: "Kirill A. Shutemov" <[email protected]>

Since we're going to have huge pages in page cache, we need to call
__vma_adjust_trans_huge() for file-backed VMA, which potentially can
contain huge pages.

For now we call it for all VMAs with vm_ops->huge_fault defined.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index aa52c48..c6e3aef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -161,9 +161,9 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long end,
long adjust_next)
{
- if (!vma->anon_vma || vma->vm_ops)
- return;
- __vma_adjust_trans_huge(vma, start, end, adjust_next);
+ if ((vma->anon_vma && !vma->vm_ops) ||
+ (vma->vm_ops && vma->vm_ops->huge_fault))
+ __vma_adjust_trans_huge(vma, start, end, adjust_next);
}
static inline int hpage_nr_pages(struct page *page)
{
--
1.7.10.4

2013-04-05 12:01:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 28/34] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()

From: "Kirill A. Shutemov" <[email protected]>

It's confusing that mk_huge_pmd() has sematics different from mk_pte()
or mk_pmd().

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
prototype to match mk_pte().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4a1d8d7..0cf2e79 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -691,11 +691,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
return pmd;
}

-static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
{
pmd_t entry;
- entry = mk_pmd(page, vma->vm_page_prot);
- entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ entry = mk_pmd(page, prot);
entry = pmd_mkhuge(entry);
return entry;
}
@@ -723,7 +722,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
pte_free(mm, pgtable);
} else {
pmd_t entry;
- entry = mk_huge_pmd(page, vma);
+ entry = mk_huge_pmd(page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
/*
* The spinlocking to take the lru_lock inside
* page_add_new_anon_rmap() acts as a full memory
@@ -1212,7 +1212,8 @@ alloc:
goto out_mn;
} else {
pmd_t entry;
- entry = mk_huge_pmd(new_page, vma);
+ entry = mk_huge_pmd(new_page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
pmdp_clear_flush(vma, haddr, pmd);
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
@@ -2386,7 +2387,8 @@ static void collapse_huge_page(struct mm_struct *mm,
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);

- _pmd = mk_huge_pmd(new_page, vma);
+ _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+ _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);

/*
* spin_lock() below is not the equivalent of smp_wmb(), so
--
1.7.10.4

2013-04-05 12:01:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 30/34] thp: extract fallback path from do_huge_pmd_anonymous_page() to a function

From: "Kirill A. Shutemov" <[email protected]>

The same fallback path will be reused by non-anonymous pages, so lets'
extract it in separate function.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 112 ++++++++++++++++++++++++++++--------------------------
1 file changed, 59 insertions(+), 53 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0cf2e79..c1d5f2b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -779,64 +779,12 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
return true;
}

-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int do_fallback(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags)
{
- struct page *page;
- unsigned long haddr = address & HPAGE_PMD_MASK;
pte_t *pte;

- if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
- if (unlikely(anon_vma_prepare(vma)))
- return VM_FAULT_OOM;
- if (unlikely(khugepaged_enter(vma)))
- return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE) &&
- transparent_hugepage_use_zero_page()) {
- pgtable_t pgtable;
- unsigned long zero_pfn;
- bool set;
- pgtable = pte_alloc_one(mm, haddr);
- if (unlikely(!pgtable))
- return VM_FAULT_OOM;
- zero_pfn = get_huge_zero_page();
- if (unlikely(!zero_pfn)) {
- pte_free(mm, pgtable);
- count_vm_event(THP_FAULT_FALLBACK);
- goto out;
- }
- spin_lock(&mm->page_table_lock);
- set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
- zero_pfn);
- spin_unlock(&mm->page_table_lock);
- if (!set) {
- pte_free(mm, pgtable);
- put_huge_zero_page();
- }
- return 0;
- }
- page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
- if (unlikely(!page)) {
- count_vm_event(THP_FAULT_FALLBACK);
- goto out;
- }
- count_vm_event(THP_FAULT_ALLOC);
- if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
- put_page(page);
- goto out;
- }
- if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
- page))) {
- mem_cgroup_uncharge_page(page);
- put_page(page);
- goto out;
- }
-
- return 0;
- }
-out:
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
@@ -858,6 +806,64 @@ out:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags)
+{
+ struct page *page;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+
+ if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+ return do_fallback(mm, vma, address, pmd, flags);
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+ if (unlikely(khugepaged_enter(vma)))
+ return VM_FAULT_OOM;
+ if (!(flags & FAULT_FLAG_WRITE) &&
+ transparent_hugepage_use_zero_page()) {
+ pgtable_t pgtable;
+ unsigned long zero_pfn;
+ bool set;
+ pgtable = pte_alloc_one(mm, haddr);
+ if (unlikely(!pgtable))
+ return VM_FAULT_OOM;
+ zero_pfn = get_huge_zero_page();
+ if (unlikely(!zero_pfn)) {
+ pte_free(mm, pgtable);
+ count_vm_event(THP_FAULT_FALLBACK);
+ return do_fallback(mm, vma, address, pmd, flags);
+ }
+ spin_lock(&mm->page_table_lock);
+ set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+ zero_pfn);
+ spin_unlock(&mm->page_table_lock);
+ if (!set) {
+ pte_free(mm, pgtable);
+ put_huge_zero_page();
+ }
+ return 0;
+ }
+ page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+ vma, haddr, numa_node_id(), 0);
+ if (unlikely(!page)) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ return do_fallback(mm, vma, address, pmd, flags);
+ }
+ count_vm_event(THP_FAULT_ALLOC);
+ if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+ put_page(page);
+ return do_fallback(mm, vma, address, pmd, flags);
+ }
+ if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
+ page))) {
+ mem_cgroup_uncharge_page(page);
+ put_page(page);
+ return do_fallback(mm, vma, address, pmd, flags);
+ }
+
+ return 0;
+}
+
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
--
1.7.10.4

2013-04-05 12:02:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 29/34] thp, mm: basic huge_fault implementation for generic_file_vm_ops

From: "Kirill A. Shutemov" <[email protected]>

It provide enough functionality for simple cases like ramfs. Need to be
extended later.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 76 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6f0e3be..a170a40 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1768,6 +1768,81 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int filemap_huge_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_mapping;
+ struct inode *inode = mapping->host;
+ pgoff_t size, offset = vmf->pgoff;
+ unsigned long address = (unsigned long) vmf->virtual_address;
+ struct page *page;
+ int ret = 0;
+
+ BUG_ON(((address >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+ (offset & HPAGE_CACHE_INDEX_MASK));
+
+retry:
+ page = find_get_page(mapping, offset);
+ if (!page) {
+ gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
+ page = alloc_pages_vma(gfp_mask, HPAGE_PMD_ORDER,
+ vma, address, 0);
+ if (!page) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ return VM_FAULT_OOM;
+ }
+ count_vm_event(THP_FAULT_ALLOC);
+ ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
+ if (ret == 0)
+ ret = mapping->a_ops->readpage(file, page);
+ else if (ret == -EEXIST)
+ ret = 0; /* losing race to add is OK */
+ page_cache_release(page);
+ if (!ret || ret == AOP_TRUNCATED_PAGE)
+ goto retry;
+ return ret;
+ }
+
+ if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+ page_cache_release(page);
+ return ret | VM_FAULT_RETRY;
+ }
+
+ /* Did it get truncated? */
+ if (unlikely(page->mapping != mapping)) {
+ unlock_page(page);
+ put_page(page);
+ goto retry;
+ }
+ VM_BUG_ON(page->index != offset);
+ VM_BUG_ON(!PageUptodate(page));
+
+ if (!PageTransHuge(page)) {
+ unlock_page(page);
+ put_page(page);
+ /* Ask fallback to small pages */
+ return VM_FAULT_OOM;
+ }
+
+ /*
+ * Found the page and have a reference on it.
+ * We must recheck i_size under page lock.
+ */
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+ if (unlikely(offset >= size)) {
+ unlock_page(page);
+ page_cache_release(page);
+ return VM_FAULT_SIGBUS;
+ }
+
+ vmf->page = page;
+ return ret | VM_FAULT_LOCKED;
+}
+#else
+#define filemap_huge_fault NULL
+#endif
+
int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct page *page = vmf->page;
@@ -1797,6 +1872,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);

const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
+ .huge_fault = filemap_huge_fault,
.page_mkwrite = filemap_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
--
1.7.10.4

2013-04-05 12:03:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 27/34] thp: prepare zap_huge_pmd() to uncharge file pages

From: "Kirill A. Shutemov" <[email protected]>

Uncharge pages from correct counter.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7c48f58..4a1d8d7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1368,10 +1368,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
spin_unlock(&tlb->mm->page_table_lock);
put_huge_zero_page();
} else {
+ int member;
page = pmd_page(orig_pmd);
page_remove_rmap(page);
VM_BUG_ON(page_mapcount(page) < 0);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ member = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
+ add_mm_counter(tlb->mm, member, -HPAGE_PMD_NR);
VM_BUG_ON(!PageHead(page));
tlb->mm->nr_ptes--;
spin_unlock(&tlb->mm->page_table_lock);
--
1.7.10.4

2013-04-05 12:03:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 26/34] mm: add huge_fault() callback to vm_operations_struct

From: "Kirill A. Shutemov" <[email protected]>

huge_fault() should try to setup huge page for the pgoff, if possbile.
VM_FAULT_OOM return code means we need to fallback to small pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 09530c7..d978de8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -195,6 +195,7 @@ struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+ int (*huge_fault)(struct vm_area_struct *vma, struct vm_fault *vmf);

/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
--
1.7.10.4

2013-04-05 12:03:52

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 22/34] thp, mm: truncate support for transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

If we starting position of truncation is in tail page we have to spilit
the huge page page first.

We also have to split if end is within the huge page. Otherwise we can
truncate whole huge page at once.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/truncate.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..0152feb 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index > end)
break;

+ /* split page if we start from tail page */
+ if (PageTransTail(page))
+ split_huge_page(compound_trans_head(page));
+ if (PageTransHuge(page)) {
+ /* split if end is within huge page */
+ if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+ split_huge_page(page);
+ else
+ /* skip tail pages */
+ i += HPAGE_CACHE_NR - 1;
+ }
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
@@ -280,6 +291,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index > end)
break;

+ if (PageTransHuge(page))
+ split_huge_page(page);
lock_page(page);
WARN_ON(page->index != index);
wait_on_page_writeback(page);
--
1.7.10.4

2013-04-05 12:03:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 19/34] thp, libfs: initial support of thp in simple_read/write_begin/write_end

From: "Kirill A. Shutemov" <[email protected]>

For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
It's probably to weak condition and need to be reworked later.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/libfs.c | 48 ++++++++++++++++++++++++++++++++++++-----------
include/linux/pagemap.h | 8 ++++++++
2 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 916da8c..6e5286d 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -383,7 +383,7 @@ EXPORT_SYMBOL(simple_setattr);

int simple_readpage(struct file *file, struct page *page)
{
- clear_highpage(page);
+ clear_pagecache_page(page);
flush_dcache_page(page);
SetPageUptodate(page);
unlock_page(page);
@@ -394,21 +394,42 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
- struct page *page;
+ struct page *page = NULL;
pgoff_t index;

index = pos >> PAGE_CACHE_SHIFT;

- page = grab_cache_page_write_begin(mapping, index, flags);
+ /* XXX: too weak condition. Good enough for initial testing */
+ if (mapping_can_have_hugepages(mapping)) {
+ page = grab_thp_write_begin(mapping,
+ index & ~HPAGE_CACHE_INDEX_MASK, flags);
+ /* fallback to small page */
+ if (!page || !PageTransHuge(page)) {
+ unsigned long offset;
+ offset = pos & ~PAGE_CACHE_MASK;
+ len = min_t(unsigned long,
+ len, PAGE_CACHE_SIZE - offset);
+ }
+ }
+ if (!page)
+ page = grab_cache_page_write_begin(mapping, index, flags);
if (!page)
return -ENOMEM;
-
*pagep = page;

- if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
- unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
- zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
+ if (!PageUptodate(page)) {
+ unsigned from;
+
+ if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
+ from = pos & ~HPAGE_PMD_MASK;
+ zero_huge_user_segment(page, 0, from);
+ zero_huge_user_segment(page,
+ from + len, HPAGE_PMD_SIZE);
+ } else if (len != PAGE_CACHE_SIZE) {
+ from = pos & ~PAGE_CACHE_MASK;
+ zero_user_segments(page, 0, from,
+ from + len, PAGE_CACHE_SIZE);
+ }
}
return 0;
}
@@ -443,9 +464,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,

/* zero the stale part of the page if we did a short copy */
if (copied < len) {
- unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
- zero_user(page, from + copied, len - copied);
+ unsigned from;
+ if (PageTransHuge(page)) {
+ from = pos & ~HPAGE_PMD_MASK;
+ zero_huge_user(page, from + copied, len - copied);
+ } else {
+ from = pos & ~PAGE_CACHE_MASK;
+ zero_user(page, from + copied, len - copied);
+ }
}

if (!PageUptodate(page))
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 5a7dda9..c64d19c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -581,4 +581,12 @@ static inline int add_to_page_cache(struct page *page,
return error;
}

+static inline void clear_pagecache_page(struct page *page)
+{
+ if (PageTransHuge(page))
+ zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+ else
+ clear_highpage(page);
+}
+
#endif /* _LINUX_PAGEMAP_H */
--
1.7.10.4

2013-04-05 12:04:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 23/34] thp, mm: split huge page on mmap file page

From: "Kirill A. Shutemov" <[email protected]>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on fault attempt.

Later in the patchset we'll implement mmap() properly and this code path
be used for fallback cases.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3296f5c..6f0e3be 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1683,6 +1683,8 @@ retry_find:
goto no_cached_page;
}

+ if (PageTransCompound(page))
+ split_huge_page(compound_trans_head(page));
if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
page_cache_release(page);
return ret | VM_FAULT_RETRY;
--
1.7.10.4

2013-04-05 12:04:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 18/34] thp, mm: naive support of thp in generic read/write routines

From: "Kirill A. Shutemov" <[email protected]>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing store.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index bcb679c..3296f5c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1161,6 +1161,16 @@ find_page:
if (unlikely(page == NULL))
goto no_cached_page;
}
+ if (PageTransCompound(page)) {
+ struct page *head = compound_trans_head(page);
+ /*
+ * We don't yet support huge pages in page cache
+ * for filesystems with backing device, so pages
+ * should always be up-to-date.
+ */
+ BUG_ON(PageReadahead(head) || !PageUptodate(head));
+ goto page_ok;
+ }
if (PageReadahead(page)) {
page_cache_async_readahead(mapping,
ra, filp, page,
@@ -2439,8 +2449,13 @@ again:
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);

+ if (PageTransHuge(page))
+ offset = pos & ~HPAGE_PMD_MASK;
+
pagefault_disable();
- copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+ copied = iov_iter_copy_from_user_atomic(
+ page + (offset >> PAGE_CACHE_SHIFT),
+ i, offset & ~PAGE_CACHE_MASK, bytes);
pagefault_enable();
flush_dcache_page(page);

@@ -2463,6 +2478,7 @@ again:
* because not all segments in the iov can be copied at
* once without a pagefault.
*/
+ offset = pos & ~PAGE_CACHE_MASK;
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
iov_iter_single_seg_count(i));
goto again;
--
1.7.10.4

2013-04-05 11:58:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 14/34] thp, mm: locking tail page is a bug

From: "Kirill A. Shutemov" <[email protected]>

Locking head page means locking entire compound page.
If we try to lock tail page, something went wrong.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1defa83..7b4736c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -665,6 +665,7 @@ void __lock_page(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

+ VM_BUG_ON(PageTail(page));
__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
TASK_UNINTERRUPTIBLE);
}
@@ -674,6 +675,7 @@ int __lock_page_killable(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

+ VM_BUG_ON(PageTail(page));
return __wait_on_bit_lock(page_waitqueue(page), &wait,
sleep_on_page_killable, TASK_KILLABLE);
}
--
1.7.10.4

2013-04-05 12:05:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends

From: "Kirill A. Shutemov" <[email protected]>

The patch adds new zone stat to count file transparent huge pages and
adjust related places.

For now we don't count mapped or dirty file thp pages separately.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/base/node.c | 10 ++++++++++
fs/proc/meminfo.c | 6 ++++++
include/linux/mmzone.h | 1 +
mm/mmap.c | 3 +++
mm/page_alloc.c | 7 ++++++-
5 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index fac124a..eed3763 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -118,11 +118,18 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d SUnreclaim: %8lu kB\n"
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"Node %d AnonHugePages: %8lu kB\n"
+ "Node %d FileHugePages: %8lu kB\n"
#endif
,
nid, K(node_page_state(nid, NR_FILE_DIRTY)),
nid, K(node_page_state(nid, NR_WRITEBACK)),
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ nid, K(node_page_state(nid, NR_FILE_PAGES)
+ + node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR),
+#else
nid, K(node_page_state(nid, NR_FILE_PAGES)),
+#endif
nid, K(node_page_state(nid, NR_FILE_MAPPED)),
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
nid, K(node_page_state(nid, NR_ANON_PAGES)
@@ -145,6 +152,9 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
, nid,
K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR)
+ , nid,
+ K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
HPAGE_PMD_NR));
#else
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 1efaaa1..747ec70 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -41,6 +41,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)

cached = global_page_state(NR_FILE_PAGES) -
total_swapcache_pages() - i.bufferram;
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ cached += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR;
if (cached < 0)
cached = 0;

@@ -103,6 +106,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"AnonHugePages: %8lu kB\n"
+ "FileHugePages: %8lu kB\n"
#endif
,
K(i.totalram),
@@ -163,6 +167,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
HPAGE_PMD_NR)
+ ,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR)
#endif
);

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ab20a60..91fadd6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,7 @@ enum zone_stat_item {
NUMA_OTHER, /* allocation from other node */
#endif
NR_ANON_TRANSPARENT_HUGEPAGES,
+ NR_FILE_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };

diff --git a/mm/mmap.c b/mm/mmap.c
index 49dc7d5..afb9088 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -135,6 +135,9 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
free = global_page_state(NR_FREE_PAGES);
free += global_page_state(NR_FILE_PAGES);
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ free += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES)
+ * HPAGE_PMD_NR;

/*
* shmem pages shouldn't be counted as free in this
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ca7b01e..7a26038 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2963,6 +2963,7 @@ void show_free_areas(unsigned int filter)
{
int cpu;
struct zone *zone;
+ long cached;

for_each_populated_zone(zone) {
if (skip_free_areas_node(filter, zone_to_nid(zone)))
@@ -3112,7 +3113,11 @@ void show_free_areas(unsigned int filter)
printk("= %lukB\n", K(total));
}

- printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
+ cached = global_page_state(NR_FILE_PAGES);
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ cached += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR;
+ printk("%ld total pagecache pages\n", cached);

show_swap_cache_info();
}
--
1.7.10.4

2013-04-05 12:05:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 06/34] thp, mm: avoid PageUnevictable on active/inactive lru lists

From: "Kirill A. Shutemov" <[email protected]>

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_active_list
goes crazy:

kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
invalid opcode: 0000 [#1] SMP
CPU 0
Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
RIP: 0010:[<ffffffff81110478>] [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
RSP: 0000:ffff8800796d9b28 EFLAGS: 00010082
RAX: 00000000ffffffea RBX: 0000000000000012 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea0001de8040
RBP: ffff8800796d9b88 R08: ffff8800796d9df0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012
R13: ffffea0001de8060 R14: ffffffff818818e8 R15: ffff8800796d9bf8
FS: 0000000000000000(0000) GS:ffff88007a200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1bfc108000 CR3: 000000000180b000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 293, threadinfo ffff8800796d8000, task ffff880079e0a6e0)
Stack:
ffff8800796d9b48 ffffffff81881880 ffff8800796d9df0 ffff8800796d9be0
0000000000000002 000000000000001f ffff8800796d9b88 ffffffff818818c8
ffffffff81881480 ffff8800796d9dc0 0000000000000002 000000000000001f
Call Trace:
[<ffffffff81111e98>] shrink_inactive_list+0x108/0x4a0
[<ffffffff8109ce3d>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8107b8bf>] ? local_clock+0x4f/0x60
[<ffffffff8110ff5d>] ? shrink_slab+0x1fd/0x4c0
[<ffffffff811125a1>] shrink_zone+0x371/0x610
[<ffffffff8110ff75>] ? shrink_slab+0x215/0x4c0
[<ffffffff81112dfc>] kswapd+0x5bc/0xb60
[<ffffffff81112840>] ? shrink_zone+0x610/0x610
[<ffffffff81066676>] kthread+0xd6/0xe0
[<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
[<ffffffff814fed6c>] ret_from_fork+0x7c/0xb0
[<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
Code: 1f 40 00 49 8b 45 08 49 8b 75 00 48 89 46 08 48 89 30 49 8b 06 4c 89 68 08 49 89 45 00 4d 89 75 08 4d 89 2e eb 9c 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 31 db 45 31 e4 eb 9b 0f 0b 0f 0b 65 48
RIP [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
RSP <ffff8800796d9b28>

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
Let's just copy PG_active and PG_unevictable from head page in
__split_huge_page_refcount(), it will simplify lru_add_page_tail().

This will fix one more bug in lru_add_page_tail():
if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru.
The patch will fix it as well since we copy PG_active from head page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 4 +++-
mm/swap.c | 20 ++------------------
2 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..46a44ac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1613,7 +1613,9 @@ static void __split_huge_page_refcount(struct page *page)
((1L << PG_referenced) |
(1L << PG_swapbacked) |
(1L << PG_mlocked) |
- (1L << PG_uptodate)));
+ (1L << PG_uptodate) |
+ (1L << PG_active) |
+ (1L << PG_unevictable)));
page_tail->flags |= (1L << PG_dirty);

/* clear PageTail before overwriting first_page */
diff --git a/mm/swap.c b/mm/swap.c
index 92a9be5..20d78b6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -740,8 +740,6 @@ EXPORT_SYMBOL(__pagevec_release);
void lru_add_page_tail(struct page *page, struct page *page_tail,
struct lruvec *lruvec)
{
- int uninitialized_var(active);
- enum lru_list lru;
const int file = 0;

VM_BUG_ON(!PageHead(page));
@@ -752,20 +750,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,

SetPageLRU(page_tail);

- if (page_evictable(page_tail)) {
- if (PageActive(page)) {
- SetPageActive(page_tail);
- active = 1;
- lru = LRU_ACTIVE_ANON;
- } else {
- active = 0;
- lru = LRU_INACTIVE_ANON;
- }
- } else {
- SetPageUnevictable(page_tail);
- lru = LRU_UNEVICTABLE;
- }
-
if (likely(PageLRU(page)))
list_add_tail(&page_tail->lru, &page->lru);
else {
@@ -777,13 +761,13 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
* Use the standard add function to put page_tail on the list,
* but then correct its position so they all end up in order.
*/
- add_page_to_lru_list(page_tail, lruvec, lru);
+ add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail));
list_head = page_tail->lru.prev;
list_move_tail(&page_tail->lru, list_head);
}

if (!PageUnevictable(page))
- update_page_reclaim_stat(lruvec, file, active);
+ update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

--
1.7.10.4

2013-04-05 12:05:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 11/34] mm: trace filemap: dump page order

From: "Kirill A. Shutemov" <[email protected]>

Dump page order to trace to be able to distinguish between small page
and huge page in page cache.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/trace/events/filemap.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49..7e14b13 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
__field(struct page *, page)
__field(unsigned long, i_ino)
__field(unsigned long, index)
+ __field(int, order)
__field(dev_t, s_dev)
),

@@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
__entry->page = page;
__entry->i_ino = page->mapping->host->i_ino;
__entry->index = page->index;
+ __entry->order = compound_order(page);
if (page->mapping->host->i_sb)
__entry->s_dev = page->mapping->host->i_sb->s_dev;
else
__entry->s_dev = page->mapping->host->i_rdev;
),

- TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu",
+ TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d",
MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
__entry->i_ino,
__entry->page,
page_to_pfn(__entry->page),
- __entry->index << PAGE_SHIFT)
+ __entry->index << PAGE_SHIFT,
+ __entry->order)
);

DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache,
--
1.7.10.4

2013-04-05 12:05:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 12/34] thp, mm: rewrite delete_from_page_cache() to support huge pages

From: "Kirill A. Shutemov" <[email protected]>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index ce1ded8..56a81e3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,7 @@
void __delete_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
+ int nr;

trace_mm_filemap_delete_from_page_cache(page);
/*
@@ -127,13 +128,28 @@ void __delete_from_page_cache(struct page *page)
else
cleancache_invalidate_page(mapping, page);

- radix_tree_delete(&mapping->page_tree, page->index);
+ if (PageTransHuge(page)) {
+ int i;
+
+ radix_tree_delete(&mapping->page_tree, page->index);
+ for (i = 1; i < HPAGE_CACHE_NR; i++) {
+ radix_tree_delete(&mapping->page_tree, page->index + i);
+ page[i].mapping = NULL;
+ page_cache_release(page + i);
+ }
+ __dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+ nr = HPAGE_CACHE_NR;
+ } else {
+ radix_tree_delete(&mapping->page_tree, page->index);
+ __dec_zone_page_state(page, NR_FILE_PAGES);
+ nr = 1;
+ }
+
page->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */
- mapping->nrpages--;
- __dec_zone_page_state(page, NR_FILE_PAGES);
+ mapping->nrpages -= nr;
if (PageSwapBacked(page))
- __dec_zone_page_state(page, NR_SHMEM);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
BUG_ON(page_mapped(page));

/*
@@ -144,8 +160,8 @@ void __delete_from_page_cache(struct page *page)
* having removed the page entirely.
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
- dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+ mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+ add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
}
}

--
1.7.10.4

2013-04-05 12:05:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 10/34] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

From: "Kirill A. Shutemov" <[email protected]>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 71 ++++++++++++++++++++++++++++++++++++++--------------------
1 file changed, 47 insertions(+), 24 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2d99191..ce1ded8 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,39 +447,62 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
pgoff_t offset, gfp_t gfp_mask)
{
int error;
+ enum zone_stat_item item;
+ int i, nr;

VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));

+ /* memory cgroup controller handles thp pages on its side */
error = mem_cgroup_cache_charge(page, current->mm,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
- goto out;
-
- error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
- if (error == 0) {
- page_cache_get(page);
- page->mapping = mapping;
- page->index = offset;
+ return error;

- spin_lock_irq(&mapping->tree_lock);
- error = radix_tree_insert(&mapping->page_tree, offset, page);
- if (likely(!error)) {
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- spin_unlock_irq(&mapping->tree_lock);
- trace_mm_filemap_add_to_page_cache(page);
- } else {
- page->mapping = NULL;
- /* Leave page->index set: truncation relies upon it */
- spin_unlock_irq(&mapping->tree_lock);
- mem_cgroup_uncharge_cache_page(page);
- page_cache_release(page);
- }
- radix_tree_preload_end();
- } else
+ if (PageTransHuge(page)) {
+ BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+ nr = HPAGE_CACHE_NR;
+ item = NR_FILE_TRANSPARENT_HUGEPAGES;
+ } else {
+ nr = 1;
+ item = NR_FILE_PAGES;
+ }
+ error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
+ if (error) {
mem_cgroup_uncharge_cache_page(page);
-out:
+ return error;
+ }
+
+ spin_lock_irq(&mapping->tree_lock);
+ for (i = 0; i < nr; i++) {
+ page_cache_get(page + i);
+ page[i].index = offset + i;
+ page[i].mapping = mapping;
+ error = radix_tree_insert(&mapping->page_tree,
+ offset + i, page + i);
+ if (error)
+ goto err;
+ }
+ __inc_zone_page_state(page, item);
+ mapping->nrpages += nr;
+ spin_unlock_irq(&mapping->tree_lock);
+ radix_tree_preload_end();
+ trace_mm_filemap_add_to_page_cache(page);
+ return 0;
+err:
+ if (i != 0)
+ error = -ENOSPC; /* no space for a huge page */
+ page_cache_release(page + i);
+ page[i].mapping = NULL;
+ for (i--; i >= 0; i--) {
+ /* Leave page->index set: truncation relies upon it */
+ page[i].mapping = NULL;
+ radix_tree_delete(&mapping->page_tree, offset + i);
+ page_cache_release(page + i);
+ }
+ spin_unlock_irq(&mapping->tree_lock);
+ radix_tree_preload_end();
+ mem_cgroup_uncharge_cache_page(page);
return error;
}
EXPORT_SYMBOL(add_to_page_cache_locked);
--
1.7.10.4

2013-04-05 12:05:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 13/34] thp, mm: trigger bug in replace_page_cache_page() on THP

From: "Kirill A. Shutemov" <[email protected]>

replace_page_cache_page() is only used by FUSE. It's unlikely that we
will support THP in FUSE page cache any soon.

Let's pospone implemetation of THP handling in replace_page_cache_page()
until any will use it.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 56a81e3..1defa83 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -412,6 +412,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
{
int error;

+ VM_BUG_ON(PageTransHuge(old));
+ VM_BUG_ON(PageTransHuge(new));
VM_BUG_ON(!PageLocked(old));
VM_BUG_ON(!PageLocked(new));
VM_BUG_ON(new->mapping);
--
1.7.10.4

2013-04-05 12:07:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 05/34] memcg, thp: charge huge cache pages

From: "Kirill A. Shutemov" <[email protected]>

mem_cgroup_cache_charge() has check for PageCompound(). The check
prevents charging huge cache pages.

I don't see a reason why the check is present. Looks like it's just
legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

Let's just drop it.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memcontrol.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 690fa8c..0e7f7e6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3975,8 +3975,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,

if (mem_cgroup_disabled())
return 0;
- if (PageCompound(page))
- return 0;

if (!PageSwapCache(page))
ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
--
1.7.10.4

2013-04-05 11:58:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 01/34] mm: drop actor argument of do_generic_file_read()

From: "Kirill A. Shutemov" <[email protected]>

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4ebaf95..2d99191 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1075,7 +1075,6 @@ static void shrink_readahead_size_eio(struct file *filp,
* @filp: the file to read
* @ppos: current file position
* @desc: read_descriptor
- * @actor: read method
*
* This is a generic file read routine, and uses the
* mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1084,7 +1083,7 @@ static void shrink_readahead_size_eio(struct file *filp,
* of the logic when it comes to error handling etc.
*/
static void do_generic_file_read(struct file *filp, loff_t *ppos,
- read_descriptor_t *desc, read_actor_t actor)
+ read_descriptor_t *desc)
{
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
@@ -1185,13 +1184,14 @@ page_ok:
* Ok, we have the page, and it's up-to-date, so
* now we can copy it to user space...
*
- * The actor routine returns how many bytes were actually used..
+ * The file_read_actor routine returns how many bytes were
+ * actually used..
* NOTE! This may not be the same as how much of a user buffer
* we filled up (we may be padding etc), so we can only update
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
+ ret = file_read_actor(desc, page, offset, nr);
offset += ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
@@ -1464,7 +1464,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
if (desc.count == 0)
continue;
desc.error = 0;
- do_generic_file_read(filp, ppos, &desc, file_read_actor);
+ do_generic_file_read(filp, ppos, &desc);
retval += desc.written;
if (desc.error) {
retval = retval ?: desc.error;
--
1.7.10.4

2013-04-05 12:08:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3, RFC 04/34] radix-tree: implement preload for multiple contiguous elements

From: "Kirill A. Shutemov" <[email protected]>

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entires to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Worst case for adding N contiguous items is adding entries at indexes
(ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
item plus extra nodes if you cross the boundary from one node to the next.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

#ifdef __KERNEL__
#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
#else
#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
#endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6.

Since only THP uses batched preload at the , we disable (set max preload
to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE. This can be changed in the
future.

Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/radix-tree.h | 11 +++++++++++
lib/radix-tree.c | 33 ++++++++++++++++++++++++++-------
2 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..0d98fd6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do { \
(root)->rnode = NULL; \
} while (0)

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR 512
+#else
+#define RADIX_TREE_PRELOAD_NR 1
+#endif
+
/**
* Radix-tree synchronization
*
@@ -231,6 +241,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..1bc352f 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
* The worst case is a zero height tree with just a single item at index 0,
* and then inserting an item at index ULONG_MAX. This requires 2 new branches
* of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
* Hence:
*/
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+ (RADIX_TREE_PRELOAD_MIN + \
+ DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))

/*
* Per-cpu pool of preloaded nodes
*/
struct radix_tree_preload {
int nr;
- struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+ struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
};
static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };

@@ -257,29 +265,35 @@ radix_tree_node_free(struct radix_tree_node *node)

/*
* Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail. On
- * success, return zero, with preemption disabled. On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled. On error, return -ENOMEM
* with preemption not disabled.
*
* To make use of this facility, the radix tree must be initialised without
* __GFP_WAIT being passed to INIT_RADIX_TREE().
*/
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
{
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
int ret = -ENOMEM;
+ int preload_target = RADIX_TREE_PRELOAD_MIN +
+ DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+ if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+ "too large preload requested"))
+ return -ENOMEM;

preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+ while (rtp->nr < preload_target) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+ if (rtp->nr < preload_target)
rtp->nodes[rtp->nr++] = node;
else
kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +302,11 @@ int radix_tree_preload(gfp_t gfp_mask)
out:
return ret;
}
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+ return radix_tree_preload_count(1, gfp_mask);
+}
EXPORT_SYMBOL(radix_tree_preload);

/*
--
1.7.10.4

2013-04-07 00:40:58

by Ric Mason

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 00/34] Transparent huge page cache

Hi Kirill,
On 04/05/2013 07:59 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Here's third RFC. Thanks everybody for feedback.

Could you answer my questions in your version two?

>
> The patchset is pretty big already and I want to stop generate new
> features to keep it reviewable. Next I'll concentrate on benchmarking and
> tuning.
>
> Therefore some features will be outside initial transparent huge page
> cache implementation:
> - page collapsing;
> - migration;
> - tmpfs/shmem;
>
> There are few features which are not implemented and potentially can block
> upstreaming:
>
> 1. Currently we allocate 2M page even if we create only 1 byte file on
> ramfs. I don't think it's a problem by itself. With anon thp pages we also
> try to allocate huge pages whenever possible.
> The problem is that ramfs pages are unevictable and we can't just split
> and pushed them in swap as with anon thp. We (at some point) have to have
> mechanism to split last page of the file under memory pressure to reclaim
> some memory.
>
> 2. We don't have knobs for disabling transparent huge page cache per-mount
> or per-file. Should we have mount option and fadivse flags as part of
> initial implementation?
>
> Any thoughts?
>
> The patchset is also on git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git thp/pagecache
>
> v3:
> - set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP;
> - rewrite lru_add_page_tail() to address few bags;
> - memcg accounting;
> - represent file thp pages in meminfo and friends;
> - dump page order in filemap trace;
> - add missed flush_dcache_page() in zero_huge_user_segment;
> - random cleanups based on feedback.
> v2:
> - mmap();
> - fix add_to_page_cache_locked() and delete_from_page_cache();
> - introduce mapping_can_have_hugepages();
> - call split_huge_page() only for head page in filemap_fault();
> - wait_split_huge_page(): serialize over i_mmap_mutex too;
> - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
> - fix off-by-one in zero_huge_user_segment();
> - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;
>
> Kirill A. Shutemov (34):
> mm: drop actor argument of do_generic_file_read()
> block: implement add_bdi_stat()
> mm: implement zero_huge_user_segment and friends
> radix-tree: implement preload for multiple contiguous elements
> memcg, thp: charge huge cache pages
> thp, mm: avoid PageUnevictable on active/inactive lru lists
> thp, mm: basic defines for transparent huge page cache
> thp, mm: introduce mapping_can_have_hugepages() predicate
> thp: represent file thp pages in meminfo and friends
> thp, mm: rewrite add_to_page_cache_locked() to support huge pages
> mm: trace filemap: dump page order
> thp, mm: rewrite delete_from_page_cache() to support huge pages
> thp, mm: trigger bug in replace_page_cache_page() on THP
> thp, mm: locking tail page is a bug
> thp, mm: handle tail pages in page_cache_get_speculative()
> thp, mm: add event counters for huge page alloc on write to a file
> thp, mm: implement grab_thp_write_begin()
> thp, mm: naive support of thp in generic read/write routines
> thp, libfs: initial support of thp in
> simple_read/write_begin/write_end
> thp: handle file pages in split_huge_page()
> thp: wait_split_huge_page(): serialize over i_mmap_mutex too
> thp, mm: truncate support for transparent huge page cache
> thp, mm: split huge page on mmap file page
> ramfs: enable transparent huge page cache
> x86-64, mm: proper alignment mappings with hugepages
> mm: add huge_fault() callback to vm_operations_struct
> thp: prepare zap_huge_pmd() to uncharge file pages
> thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
> thp, mm: basic huge_fault implementation for generic_file_vm_ops
> thp: extract fallback path from do_huge_pmd_anonymous_page() to a
> function
> thp: initial implementation of do_huge_linear_fault()
> thp: handle write-protect exception to file-backed huge pages
> thp: call __vma_adjust_trans_huge() for file-backed VMA
> thp: map file-backed huge pages on fault
>
> arch/x86/kernel/sys_x86_64.c | 12 +-
> drivers/base/node.c | 10 +
> fs/libfs.c | 48 +++-
> fs/proc/meminfo.c | 6 +
> fs/ramfs/inode.c | 6 +-
> include/linux/backing-dev.h | 10 +
> include/linux/huge_mm.h | 36 ++-
> include/linux/mm.h | 8 +
> include/linux/mmzone.h | 1 +
> include/linux/pagemap.h | 33 ++-
> include/linux/radix-tree.h | 11 +
> include/linux/vm_event_item.h | 2 +
> include/trace/events/filemap.h | 7 +-
> lib/radix-tree.c | 33 ++-
> mm/filemap.c | 298 ++++++++++++++++++++-----
> mm/huge_memory.c | 474 +++++++++++++++++++++++++++++++++-------
> mm/memcontrol.c | 2 -
> mm/memory.c | 41 +++-
> mm/mmap.c | 3 +
> mm/page_alloc.c | 7 +-
> mm/swap.c | 20 +-
> mm/truncate.c | 13 ++
> mm/vmstat.c | 2 +
> 23 files changed, 902 insertions(+), 181 deletions(-)
>

2013-04-08 18:46:44

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()

On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> + if (unlikely(khugepaged_enter(vma)))
> + return VM_FAULT_OOM;
...
> + ret = vma->vm_ops->huge_fault(vma, &vmf);
> + if (unlikely(ret & VM_FAULT_OOM))
> + goto uncharge_out_fallback;
> + if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> + goto uncharge_out;
> +
> + if (unlikely(PageHWPoison(vmf.page))) {
> + if (ret & VM_FAULT_LOCKED)
> + unlock_page(vmf.page);
> + ret = VM_FAULT_HWPOISON;
> + goto uncharge_out;
> + }

One note on all these patches, but especially this one is that I think
they're way too liberal with unlikely()s. You really don't need to do
this for every single error case. Please reserve them for places where
you _know_ there is a benefit, or that the compiler is doing things that
you _know_ are blatantly wrong.

2013-04-08 18:52:55

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()

On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> +int do_huge_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, pmd_t *pmd, unsigned int flags)
> +{
> + unsigned long haddr = address & HPAGE_PMD_MASK;
> + struct page *cow_page, *page, *dirty_page = NULL;
> + bool anon = false, fallback = false, page_mkwrite = false;
> + pgtable_t pgtable = NULL;
> + struct vm_fault vmf;
> + int ret;
> +
> + /* Fallback if vm_pgoff and vm_start are not suitable */
> + if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
> + (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
> + return do_fallback(mm, vma, address, pmd, flags);
> +
> + if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> + return do_fallback(mm, vma, address, pmd, flags);
> +
> + if (unlikely(khugepaged_enter(vma)))
> + return VM_FAULT_OOM;
> +
> + /*
> + * If we do COW later, allocate page before taking lock_page()
> + * on the file cache page. This will reduce lock holding time.
> + */
> + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> + if (unlikely(anon_vma_prepare(vma)))
> + return VM_FAULT_OOM;
> +
> + cow_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
> + vma, haddr, numa_node_id(), 0);
> + if (!cow_page) {
> + count_vm_event(THP_FAULT_FALLBACK);
> + return do_fallback(mm, vma, address, pmd, flags);
> + }
> + count_vm_event(THP_FAULT_ALLOC);
> + if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
> + page_cache_release(cow_page);
> + return do_fallback(mm, vma, address, pmd, flags);
> + }

Ugh. This is essentially a copy-n-paste of code in __do_fault(),
including the comments. Is there no way to consolidate the code so that
there's less duplication here?

Part of the reason we have so many bugs in hugetlbfs is that it's really
a forked set of code that does things its own way. I really hope we're
not going down the road of creating another feature in the same way.

When you copy *this* much code (or any, really), you really need to talk
about it in the patch description. I was looking at other COW code, and
just happened to stumble on to the __do_fault() code.

2013-04-08 19:07:11

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 32/34] thp: handle write-protect exception to file-backed huge pages

For all the do_huge_pmd_wp_page(), I think we need a better description
of where the code came from. There are some more obviously
copy-n-pasted comments in there.

For the entire series, in the patch description, we need to know:
1. What was originally written and what was copied from elsewhere
2. For the stuff that was copied, was an attempt made to consolidate
instead of copy? Why was consolidation impossible or infeasible?

> + if (!PageAnon(page)) {
> + add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> + }

This seems like a bit of a hack. Shouldn't we have just been accounting
to MM_FILEPAGES in the first place?

2013-04-08 19:38:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends

On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> The patch adds new zone stat to count file transparent huge pages and
> adjust related places.
>
> For now we don't count mapped or dirty file thp pages separately.

I can understand tracking NR_FILE_TRANSPARENT_HUGEPAGES itself. But,
why not also account for them in NR_FILE_PAGES? That way, you don't
have to special-case each of the cases below:

> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -41,6 +41,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>
> cached = global_page_state(NR_FILE_PAGES) -
> total_swapcache_pages() - i.bufferram;
> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> + cached += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
> + HPAGE_PMD_NR;
> if (cached < 0)
> cached = 0;
....
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -135,6 +135,9 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
> if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
> free = global_page_state(NR_FREE_PAGES);
> free += global_page_state(NR_FILE_PAGES);
> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> + free += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES)
> + * HPAGE_PMD_NR;
...
> - printk("%ld total pagecache pages\n", global_page_state(NR_FILE_PAGES));
> + cached = global_page_state(NR_FILE_PAGES);
> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> + cached += global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
> + HPAGE_PMD_NR;
> + printk("%ld total pagecache pages\n", cached);

2013-04-16 14:47:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends

Dave Hansen wrote:
> On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> > The patch adds new zone stat to count file transparent huge pages and
> > adjust related places.
> >
> > For now we don't count mapped or dirty file thp pages separately.
>
> I can understand tracking NR_FILE_TRANSPARENT_HUGEPAGES itself. But,
> why not also account for them in NR_FILE_PAGES? That way, you don't
> have to special-case each of the cases below:

Good point.
To be consistent I'll also convert NR_ANON_TRANSPARENT_HUGEPAGES to be
accounted in NR_ANON_PAGES.

--
Kirill A. Shutemov

2013-04-16 15:11:30

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 09/34] thp: represent file thp pages in meminfo and friends

On 04/16/2013 07:49 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
>>> The patch adds new zone stat to count file transparent huge pages and
>>> adjust related places.
>>>
>>> For now we don't count mapped or dirty file thp pages separately.
>>
>> I can understand tracking NR_FILE_TRANSPARENT_HUGEPAGES itself. But,
>> why not also account for them in NR_FILE_PAGES? That way, you don't
>> have to special-case each of the cases below:
>
> Good point.
> To be consistent I'll also convert NR_ANON_TRANSPARENT_HUGEPAGES to be
> accounted in NR_ANON_PAGES.

Hmm... I didn't realize we did that for the anonymous version. But,
looking at the meminfo code:

> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> K(global_page_state(NR_ANON_PAGES)
> + global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
> HPAGE_PMD_NR),
> #else
> K(global_page_state(NR_ANON_PAGES)),
> #endif

That #ifdef and a couple others like it would just go away if we did
this. It would be a nice cleanup.

2013-04-17 14:36:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()

Dave Hansen wrote:
> On 04/05/2013 04:59 AM, Kirill A. Shutemov wrote:
> > +int do_huge_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > + unsigned long address, pmd_t *pmd, unsigned int flags)
> > +{
> > + unsigned long haddr = address & HPAGE_PMD_MASK;
> > + struct page *cow_page, *page, *dirty_page = NULL;
> > + bool anon = false, fallback = false, page_mkwrite = false;
> > + pgtable_t pgtable = NULL;
> > + struct vm_fault vmf;
> > + int ret;
> > +
> > + /* Fallback if vm_pgoff and vm_start are not suitable */
> > + if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
> > + (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
> > + return do_fallback(mm, vma, address, pmd, flags);
> > +
> > + if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
> > + return do_fallback(mm, vma, address, pmd, flags);
> > +
> > + if (unlikely(khugepaged_enter(vma)))
> > + return VM_FAULT_OOM;
> > +
> > + /*
> > + * If we do COW later, allocate page before taking lock_page()
> > + * on the file cache page. This will reduce lock holding time.
> > + */
> > + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
> > + if (unlikely(anon_vma_prepare(vma)))
> > + return VM_FAULT_OOM;
> > +
> > + cow_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
> > + vma, haddr, numa_node_id(), 0);
> > + if (!cow_page) {
> > + count_vm_event(THP_FAULT_FALLBACK);
> > + return do_fallback(mm, vma, address, pmd, flags);
> > + }
> > + count_vm_event(THP_FAULT_ALLOC);
> > + if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
> > + page_cache_release(cow_page);
> > + return do_fallback(mm, vma, address, pmd, flags);
> > + }
>
> Ugh. This is essentially a copy-n-paste of code in __do_fault(),
> including the comments. Is there no way to consolidate the code so that
> there's less duplication here?

I've looked into it once again and it seems there's not much space for
consolidation. Code structure looks very similar, but there are many
special cases for thp: fallback path, pte vs. pmd, etc. I don't see how we
can consolidate them in them in sane way.
I think copy is more maintainable :(

> Part of the reason we have so many bugs in hugetlbfs is that it's really
> a forked set of code that does things its own way. I really hope we're
> not going down the road of creating another feature in the same way.
>
> When you copy *this* much code (or any, really), you really need to talk
> about it in the patch description. I was looking at other COW code, and
> just happened to stumble on to the __do_fault() code.

I will document it in commit message and in comments for both functions.

--
Kirill A. Shutemov

2013-04-17 22:08:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()

On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
>> > Ugh. This is essentially a copy-n-paste of code in __do_fault(),
>> > including the comments. Is there no way to consolidate the code so that
>> > there's less duplication here?
> I've looked into it once again and it seems there's not much space for
> consolidation. Code structure looks very similar, but there are many
> special cases for thp: fallback path, pte vs. pmd, etc. I don't see how we
> can consolidate them in them in sane way.
> I think copy is more maintainable :(

I took the two copies, put them each in a file, changed some of the
_very_ trivial stuff to match (foo=1 vs foo=true) and diffed them.
They're very similar lengths (in lines):

185 __do_fault
197 do_huge_linear_fault

If you diff them:

1 file changed, 68 insertions(+), 56 deletions(-)

That means that of 185 lines in __do_fault(), 129 (70%) of them were
copied *VERBATIM*. Not similar in structure or appearance. Bit-for-bit
the same.

I took a stab at consolidating them. I think we could add a
VM_FAULT_FALLBACK flag to explicitly indicate that we need to do a
huge->small fallback, as well as a FAULT_FLAG_TRANSHUGE to indicate that
a given fault has not attempted to be handled by a huge page. If we
call __do_fault() with FAULT_FLAG_TRANSHUGE and we get back
VM_FAULT_FALLBACK or VM_FAULT_OOM, then we clear FAULT_FLAG_TRANSHUGE
and retry in handle_mm_fault().

I only went about 1/4 of the way in to __do_fault(). If went and spent
another hour or two, I'm pretty convinced I could push this even further.

Are you still sure you can't do _any_ better than a verbatim copy of 129
lines?



Attachments:
extend-__do_fault.patch (3.85 kB)

2013-04-18 16:08:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()

Dave Hansen wrote:
> On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
> Are you still sure you can't do _any_ better than a verbatim copy of 129
> lines?

It seems I was too lazy. Shame on me. :(
Here's consolidated version. Only build tested. Does it look better?

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1c25b90..47651d4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -186,6 +186,28 @@ static inline struct page *compound_trans_head(struct page *page)
return page;
}

+static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
+{
+ return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+}
+
+static inline struct page *alloc_hugepage_vma(int defrag,
+ struct vm_area_struct *vma,
+ unsigned long haddr, int nd,
+ gfp_t extra_gfp)
+{
+ return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
+ HPAGE_PMD_ORDER, vma, haddr, nd);
+}
+
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
+{
+ pmd_t entry;
+ entry = mk_pmd(page, prot);
+ entry = pmd_mkhuge(entry);
+ return entry;
+}
+
extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c8a8626..4669c19 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -165,6 +165,11 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED 0x40 /* second try */
+#ifdef CONFIG_CONFIG_TRANSPARENT_HUGEPAGE
+#define FAULT_FLAG_TRANSHUGE 0x80 /* Try to allocate transhuge page */
+#else
+#define FAULT_FLAG_TRANSHUGE 0 /* Optimize out THP code if disabled*/
+#endif

/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -880,6 +885,7 @@ static inline int page_mapped(struct page *page)
#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
#define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */
+#define VM_FAULT_FALLBACK 0x0800 /* large page fault failed, fall back to small */

#define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73691a3..e14fa81 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -692,14 +692,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
return pmd;
}

-static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
-{
- pmd_t entry;
- entry = mk_pmd(page, prot);
- entry = pmd_mkhuge(entry);
- return entry;
-}
-
static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long haddr, pmd_t *pmd,
@@ -742,20 +734,6 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
return 0;
}

-static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
-{
- return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
-}
-
-static inline struct page *alloc_hugepage_vma(int defrag,
- struct vm_area_struct *vma,
- unsigned long haddr, int nd,
- gfp_t extra_gfp)
-{
- return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
- HPAGE_PMD_ORDER, vma, haddr, nd);
-}
-
#ifndef CONFIG_NUMA
static inline struct page *alloc_hugepage(int defrag)
{
diff --git a/mm/memory.c b/mm/memory.c
index 5f782d6..e6efd8c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -59,6 +59,7 @@
#include <linux/gfp.h>
#include <linux/migrate.h>
#include <linux/string.h>
+#include <linux/khugepaged.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -3229,6 +3230,53 @@ oom:
return VM_FAULT_OOM;
}

+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+ if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+ (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK)) {
+ return false;
+ }
+ if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) {
+ return false;
+ }
+ return true;
+}
+
+static struct page *alloc_fault_page_vma(struct vm_area_struct *vma,
+ unsigned long addr, unsigned int flags)
+{
+
+ if (flags & FAULT_FLAG_TRANSHUGE) {
+ struct page *page;
+ unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+ page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+ vma, haddr, numa_node_id(), 0);
+ if (page)
+ count_vm_event(THP_FAULT_ALLOC);
+ else
+ count_vm_event(THP_FAULT_FALLBACK);
+ return page;
+ }
+ return alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+}
+
+static inline bool ptl_lock_and_check_entry(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long address, spinlock_t **ptl, pte_t **page_table,
+ pte_t orig_pte, unsigned int flags)
+{
+ if (flags & FAULT_FLAG_TRANSHUGE) {
+ spin_lock(&mm->page_table_lock);
+ return !pmd_none(*pmd);
+ } else {
+ *page_table = pte_offset_map_lock(mm, pmd, address, ptl);
+ return !pte_same(**page_table, orig_pte);
+ }
+}
+
/*
* __do_fault() tries to create a new page mapping. It aggressively
* tries to share with existing pages, but makes a separate copy if
@@ -3246,45 +3294,61 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
{
+ unsigned long haddr = address & PAGE_MASK;
pte_t *page_table;
spinlock_t *ptl;
- struct page *page;
- struct page *cow_page;
- pte_t entry;
- int anon = 0;
- struct page *dirty_page = NULL;
+ struct page *page, *cow_page, *dirty_page = NULL;
+ bool anon = false, page_mkwrite = false;
+ bool try_huge_pages = !!(flags & FAULT_FLAG_TRANSHUGE);
+ pgtable_t pgtable = NULL;
struct vm_fault vmf;
- int ret;
- int page_mkwrite = 0;
+ int nr = 1, ret;
+
+ if (try_huge_pages) {
+ if (!transhuge_vma_suitable(vma, haddr))
+ return VM_FAULT_FALLBACK;
+ if (unlikely(khugepaged_enter(vma)))
+ return VM_FAULT_OOM;
+ nr = HPAGE_PMD_NR;
+ haddr = address & HPAGE_PMD_MASK;
+ pgoff = linear_page_index(vma, haddr);
+ }

/*
* If we do COW later, allocate page befor taking lock_page()
* on the file cache page. This will reduce lock holding time.
*/
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
-
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;

- cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ cow_page = alloc_fault_page_vma(vma, address, flags);
if (!cow_page)
- return VM_FAULT_OOM;
+ return VM_FAULT_OOM | VM_FAULT_FALLBACK;

if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
page_cache_release(cow_page);
- return VM_FAULT_OOM;
+ return VM_FAULT_OOM | VM_FAULT_FALLBACK;
}
} else
cow_page = NULL;

- vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+ vmf.virtual_address = (void __user *)haddr;
vmf.pgoff = pgoff;
vmf.flags = flags;
vmf.page = NULL;

- ret = vma->vm_ops->fault(vma, &vmf);
+ if (try_huge_pages) {
+ pgtable = pte_alloc_one(mm, haddr);
+ if (unlikely(!pgtable)) {
+ ret = VM_FAULT_OOM;
+ goto uncharge_out;
+ }
+ ret = vma->vm_ops->huge_fault(vma, &vmf);
+ } else
+ ret = vma->vm_ops->fault(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
- VM_FAULT_RETRY)))
+ VM_FAULT_RETRY | VM_FAULT_FALLBACK)))
goto uncharge_out;

if (unlikely(PageHWPoison(vmf.page))) {
@@ -3310,42 +3374,69 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_WRITE) {
if (!(vma->vm_flags & VM_SHARED)) {
page = cow_page;
- anon = 1;
- copy_user_highpage(page, vmf.page, address, vma);
+ anon = true;
+ if (try_huge_pages)
+ copy_user_highpage(page, vmf.page,
+ address, vma);
+ else
+ copy_user_huge_page(page, vmf.page, haddr, vma,
+ HPAGE_PMD_NR);
__SetPageUptodate(page);
- } else {
+ } else if (vma->vm_ops->page_mkwrite) {
/*
* If the page will be shareable, see if the backing
* address space wants to know that the page is about
* to become writable
*/
- if (vma->vm_ops->page_mkwrite) {
- int tmp;
-
- unlock_page(page);
- vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
- tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
- if (unlikely(tmp &
- (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
- ret = tmp;
+ int tmp;
+
+ unlock_page(page);
+ vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+ tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+ if (unlikely(tmp &
+ (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+ ret = tmp;
+ goto unwritable_page;
+ }
+ if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+ lock_page(page);
+ if (!page->mapping) {
+ ret = 0; /* retry the fault */
+ unlock_page(page);
goto unwritable_page;
}
- if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
- lock_page(page);
- if (!page->mapping) {
- ret = 0; /* retry the fault */
- unlock_page(page);
- goto unwritable_page;
- }
- } else
- VM_BUG_ON(!PageLocked(page));
- page_mkwrite = 1;
- }
+ } else
+ VM_BUG_ON(!PageLocked(page));
+ page_mkwrite = true;
}
+ }

+ if (unlikely(ptl_lock_and_check_entry(mm, pmd, address,
+ &ptl, &page_table, orig_pte, flags))) {
+ /* pte/pmd has changed. do not touch it */
+ if (pgtable)
+ pte_free(mm, pgtable);
+ if (cow_page)
+ mem_cgroup_uncharge_page(cow_page);
+ if (anon)
+ page_cache_release(page);
+ unlock_page(vmf.page);
+ page_cache_release(vmf.page);
+ return ret;
}

- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ flush_icache_page(vma, page);
+ if (anon) {
+ add_mm_counter_fast(mm, MM_ANONPAGES, nr);
+ page_add_new_anon_rmap(page, vma, address);
+ } else {
+ add_mm_counter_fast(mm, MM_FILEPAGES, nr);
+ page_add_file_rmap(page);
+ if (flags & FAULT_FLAG_WRITE) {
+ dirty_page = page;
+ get_page(dirty_page);
+ }
+ }

/*
* This silly early PAGE_DIRTY setting removes a race
@@ -3358,43 +3449,28 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
* handle that later.
*/
/* Only go through if we didn't race with anybody else... */
- if (likely(pte_same(*page_table, orig_pte))) {
- flush_icache_page(vma, page);
- entry = mk_pte(page, vma->vm_page_prot);
+ if (try_huge_pages) {
+ pmd_t entry = mk_huge_pmd(page, vma->vm_page_prot);
if (flags & FAULT_FLAG_WRITE)
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (anon) {
- inc_mm_counter_fast(mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address);
- } else {
- inc_mm_counter_fast(mm, MM_FILEPAGES);
- page_add_file_rmap(page);
- if (flags & FAULT_FLAG_WRITE) {
- dirty_page = page;
- get_page(dirty_page);
- }
- }
- set_pte_at(mm, address, page_table, entry);
-
- /* no need to invalidate: a not-present page won't be cached */
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ set_pmd_at(mm, address, pmd, entry);
update_mmu_cache(vma, address, page_table);
+ spin_lock(&mm->page_table_lock);
} else {
- if (cow_page)
- mem_cgroup_uncharge_page(cow_page);
- if (anon)
- page_cache_release(page);
- else
- anon = 1; /* no anon but release faulted_page */
+ pte_t entry = mk_pte(page, vma->vm_page_prot);
+ if (flags & FAULT_FLAG_WRITE)
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ set_pte_at(mm, address, page_table, entry);
+ update_mmu_cache_pmd(vma, address, pmd);
+ pte_unmap_unlock(page_table, ptl);
}

- pte_unmap_unlock(page_table, ptl);
-
if (dirty_page) {
struct address_space *mapping = page->mapping;
- int dirtied = 0;
+ bool dirtied = false;

if (set_page_dirty(dirty_page))
- dirtied = 1;
+ dirtied = true;
unlock_page(dirty_page);
put_page(dirty_page);
if ((dirtied || page_mkwrite) && mapping) {
@@ -3413,13 +3489,16 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (anon)
page_cache_release(vmf.page);
}
-
return ret;

unwritable_page:
+ if (pgtable)
+ pte_free(mm, pgtable);
page_cache_release(page);
return ret;
uncharge_out:
+ if (pgtable)
+ pte_free(mm, pgtable);
/* fs's fault handler get error */
if (cow_page) {
mem_cgroup_uncharge_page(cow_page);
--
Kirill A. Shutemov

2013-04-18 16:17:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()

Kirill A. Shutemov wrote:
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c8a8626..4669c19 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -165,6 +165,11 @@ extern pgprot_t protection_map[16];
> #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
> #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
> #define FAULT_FLAG_TRIED 0x40 /* second try */
> +#ifdef CONFIG_CONFIG_TRANSPARENT_HUGEPAGE

Oops, s/CONFIG_CONFIG/CONFIG/.

--
Kirill A. Shutemov

2013-04-18 16:20:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()

On 04/18/2013 09:09 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
>> Are you still sure you can't do _any_ better than a verbatim copy of 129
>> lines?
>
> It seems I was too lazy. Shame on me. :(
> Here's consolidated version. Only build tested. Does it look better?

Yeah, it's definitely a step in the right direction. There rae
definitely some bugs in there like:

+ unsigned long haddr = address & PAGE_MASK;

I do think some of this refactoring stuff

> - unlock_page(page);
> - vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
> - tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
> - if (unlikely(tmp &
> - (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> - ret = tmp;
> + unlock_page(page);
> + vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
> + tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
> + if (unlikely(tmp &
> + (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> + ret = tmp;
> + goto unwritable_page;
> + }

could probably be a separate patch and would make what's going on more
clear, but it's passable the way it is. When it is done this way it's
hard sometimes reading the diff to realize if you are adding code or
just moving it around.

This stuff:

> if (set_page_dirty(dirty_page))
> - dirtied = 1;
> + dirtied = true;

needs to go in another patch for sure.

One thing I *REALLY* like about doing patches this way is that things
like this start to pop out:

> - ret = vma->vm_ops->fault(vma, &vmf);
> + if (try_huge_pages) {
> + pgtable = pte_alloc_one(mm, haddr);
> + if (unlikely(!pgtable)) {
> + ret = VM_FAULT_OOM;
> + goto uncharge_out;
> + }
> + ret = vma->vm_ops->huge_fault(vma, &vmf);
> + } else
> + ret = vma->vm_ops->fault(vma, &vmf);

The ->fault is (or can be) essentially per filesystem, and we're going
to be adding support per-filesystem. any reason we can't just handle
this inside the ->fault code and avoid adding huge_fault altogether?

2013-04-18 16:36:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()

Dave Hansen wrote:
> On 04/18/2013 09:09 AM, Kirill A. Shutemov wrote:
> > Dave Hansen wrote:
> >> On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
> >> Are you still sure you can't do _any_ better than a verbatim copy of 129
> >> lines?
> >
> > It seems I was too lazy. Shame on me. :(
> > Here's consolidated version. Only build tested. Does it look better?
>
> Yeah, it's definitely a step in the right direction. There rae
> definitely some bugs in there like:
>
> + unsigned long haddr = address & PAGE_MASK;

It's not bug. It's bad name for the variable.
See, first 'if (try_huge_pages)'. I update it there for huge page case.

addr_aligned better?

>
> I do think some of this refactoring stuff
>
> > - unlock_page(page);
> > - vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
> > - tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
> > - if (unlikely(tmp &
> > - (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> > - ret = tmp;
> > + unlock_page(page);
> > + vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
> > + tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
> > + if (unlikely(tmp &
> > + (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
> > + ret = tmp;
> > + goto unwritable_page;
> > + }
>
> could probably be a separate patch and would make what's going on more
> clear, but it's passable the way it is. When it is done this way it's
> hard sometimes reading the diff to realize if you are adding code or
> just moving it around.

Will do.

>
> This stuff:
>
> > if (set_page_dirty(dirty_page))
> > - dirtied = 1;
> > + dirtied = true;
>
> needs to go in another patch for sure.

Ditto.

> One thing I *REALLY* like about doing patches this way is that things
> like this start to pop out:
>
> > - ret = vma->vm_ops->fault(vma, &vmf);
> > + if (try_huge_pages) {
> > + pgtable = pte_alloc_one(mm, haddr);
> > + if (unlikely(!pgtable)) {
> > + ret = VM_FAULT_OOM;
> > + goto uncharge_out;
> > + }
> > + ret = vma->vm_ops->huge_fault(vma, &vmf);
> > + } else
> > + ret = vma->vm_ops->fault(vma, &vmf);
>
> The ->fault is (or can be) essentially per filesystem, and we're going
> to be adding support per-filesystem. any reason we can't just handle
> this inside the ->fault code and avoid adding huge_fault altogether?

will check. it's on my todo list already.

--
Kirill A. Shutemov

2013-04-18 16:42:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 31/34] thp: initial implementation of do_huge_linear_fault()

On 04/18/2013 09:38 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 04/18/2013 09:09 AM, Kirill A. Shutemov wrote:
>>> Dave Hansen wrote:
>>>> On 04/17/2013 07:38 AM, Kirill A. Shutemov wrote:
>>>> Are you still sure you can't do _any_ better than a verbatim copy of 129
>>>> lines?
>>>
>>> It seems I was too lazy. Shame on me. :(
>>> Here's consolidated version. Only build tested. Does it look better?
>>
>> Yeah, it's definitely a step in the right direction. There rae
>> definitely some bugs in there like:
>>
>> + unsigned long haddr = address & PAGE_MASK;
>
> It's not bug. It's bad name for the variable.
> See, first 'if (try_huge_pages)'. I update it there for huge page case.
>
> addr_aligned better?

That's a criminally bad name. :)

addr_aligned is better, and also please initialize the two cases
together. It's mean to separate them.

2013-04-26 15:29:08

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3, RFC 32/34] thp: handle write-protect exception to file-backed huge pages

Dave Hansen wrote:
> > + if (!PageAnon(page)) {
> > + add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
> > + add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> > + }
>
> This seems like a bit of a hack. Shouldn't we have just been accounting
> to MM_FILEPAGES in the first place?

No, it's not.

It handles MAP_PRIVATE file mappings. The page was read first and
accounted to MM_FILEPAGES and then COW'ed by anon page here, so we have to
adjust counters. do_wp_page() has similar code.

--
Kirill A. Shutemov