2013-09-23 12:08:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

It brings thp support for ramfs, but without mmap() -- it will be posted
separately.

Please review and consider applying.

Intro
-----

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project.

Design overview
---------------

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries. All entries points to head page -- refcounting for
tail pages is pretty expensive.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache three ways:
- write(2) to file or page;
- read(2) from sparse file;
- fault sparse file.

Potentially, one more way is collapsing small page, but it's outside initial
implementation.

For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
some room for speed up later.

Since mmap() isn't targeted for this patchset, we just split huge page on
page fault.

To minimize memory overhead for small files we aviod write-allocation in
first huge page area (2M on x86-64) of the file.

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range. If a huge page is only partly in the range we zero out
the part, exactly like we do for partial small pages.

split_huge_page() for file pages works similar to anon pages, but we
walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
truncate_inode_pages() to drop small pages beyond i_size, if any.

inode->i_split_sem taken on read will protect hugepages in inode's pagecache
against splitting. We take it on write during splitting.

Changes since v5
----------------
- change how hugepage stored in pagecache: head page for all relevant
indexes;
- introduce i_split_sem;
- do not create huge pages on write(2) into first hugepage area;
- compile-disabled by default;
- fix transparent_hugepage_pagecache();

Benchmarks
----------

Since the patchset doesn't include mmap() support, we should expect much
change in performance. We just need to check that we don't introduce any
major regression.

On average read/write on ramfs with thp is a bit slower, but I don't think
it's a stopper -- ramfs is a toy anyway, on real world filesystems I
expect difference to be smaller.

postmark
========

workload1:
chmod +x postmark
mount -t ramfs none /mnt
cat >/root/workload1 <<EOF
set transactions 250000
set size 5120 524288
set number 500
run
quit

workload2:
set transactions 10000
set size 2097152 10485760
set number 100
run
quit

throughput (transactions/sec)
workload1 workload2
baseline 8333 416
patched 8333 454

FS-Mark
=======

throughput (files/sec)

2000 files by 1M 200 files by 10M
baseline 5326.1 548.1
patched 5192.8 528.4

tiobench
========

baseline:
Tiotest results for 16 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 2048 MBs | 0.2 s | 8667.792 MB/s | 445.2 % | 5535.9 % |
| Random Write 62 MBs | 0.0 s | 8341.118 MB/s | 0.0 % | 2615.8 % |
| Read 2048 MBs | 0.2 s | 11680.431 MB/s | 339.9 % | 5470.6 % |
| Random Read 62 MBs | 0.0 s | 9451.081 MB/s | 786.3 % | 1451.7 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 0.006 ms | 28.019 ms | 0.00000 | 0.00000 |
| Random Write | 0.002 ms | 5.574 ms | 0.00000 | 0.00000 |
| Read | 0.005 ms | 28.018 ms | 0.00000 | 0.00000 |
| Random Read | 0.002 ms | 4.852 ms | 0.00000 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 0.005 ms | 28.019 ms | 0.00000 | 0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

patched:
Tiotest results for 16 concurrent io threads:
,----------------------------------------------------------------------.
| Item | Time | Rate | Usr CPU | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write 2048 MBs | 0.3 s | 7942.818 MB/s | 442.1 % | 5533.6 % |
| Random Write 62 MBs | 0.0 s | 9425.426 MB/s | 723.9 % | 965.2 % |
| Read 2048 MBs | 0.2 s | 11998.008 MB/s | 374.9 % | 5485.8 % |
| Random Read 62 MBs | 0.0 s | 9823.955 MB/s | 251.5 % | 2011.9 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write | 0.007 ms | 28.020 ms | 0.00000 | 0.00000 |
| Random Write | 0.001 ms | 0.022 ms | 0.00000 | 0.00000 |
| Read | 0.004 ms | 24.011 ms | 0.00000 | 0.00000 |
| Random Read | 0.001 ms | 0.019 ms | 0.00000 | 0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total | 0.005 ms | 28.020 ms | 0.00000 | 0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

IOZone
======

Syscalls, not mmap.

** Initial writers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 4741691 7986408 9149064 9898695 9868597 9629383 9469202 11605064 9507802 10641869 11360701 11040376
patched: 4682864 7275535 8691034 8872887 8712492 8771912 8397216 7701346 7366853 8839736 8299893 10788439
speed-up(times): 0.99 0.91 0.95 0.90 0.88 0.91 0.89 0.66 0.77 0.83 0.73 0.98

** Rewriters **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 5807891 9554869 12101083 13113533 12989751 14359910 16998236 16833861 24735659 17502634 17396706 20448655
patched: 6161690 9981294 12285789 13428846 13610058 13669153 20060182 17328347 24109999 19247934 24225103 34686574
speed-up(times): 1.06 1.04 1.02 1.02 1.05 0.95 1.18 1.03 0.97 1.10 1.39 1.70

** Readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 7978066 11825735 13808941 14049598 14765175 14422642 17322681 23209831 21386483 20060744 22032935 31166663
patched: 7723293 11481500 13796383 14363808 14353966 14979865 17648225 18701258 29192810 23973723 22163317 23104638
speed-up(times): 0.97 0.97 1.00 1.02 0.97 1.04 1.02 0.81 1.37 1.20 1.01 0.74

** Re-readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 7966269 11878323 14000782 14678206 14154235 14271991 15170829 20924052 27393344 19114990 12509316 18495597
patched: 7719350 11410937 13710233 13232756 14040928 15895021 16279330 17256068 26023572 18364678 27834483 23288680
speed-up(times): 0.97 0.96 0.98 0.90 0.99 1.11 1.07 0.82 0.95 0.96 2.23 1.26

** Reverse readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 6630795 10331013 12839501 13157433 12783323 13580283 15753068 15434572 21928982 17636994 14737489 19470679
patched: 6502341 9887711 12639278 12979232 13212825 12928255 13961195 14695786 21370667 19873807 20902582 21892899
speed-up(times): 0.98 0.96 0.98 0.99 1.03 0.95 0.89 0.95 0.97 1.13 1.42 1.12

** Random_readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 5152935 9043813 11752615 11996078 12283579 12484039 14588004 15781507 23847538 15748906 13698335 27195847
patched: 5009089 8438137 11266015 11631218 12093650 12779308 17768691 13640378 30468890 19269033 23444358 22775908
speed-up(times): 0.97 0.93 0.96 0.97 0.98 1.02 1.22 0.86 1.28 1.22 1.71 0.84

** Random_writers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80
baseline: 3886268 7405345 10531192 10858984 10994693 12758450 10729531 9656825 10370144 13139452 4528331 12615812
patched: 4335323 7916132 10978892 11423247 11790932 11424525 11798171 11413452 12230616 13075887 11165314 16925679
speed-up(times): 1.12 1.07 1.04 1.05 1.07 0.90 1.10 1.18 1.18 1.00 2.47 1.34

Kirill A. Shutemov (22):
mm: implement zero_huge_user_segment and friends
radix-tree: implement preload for multiple contiguous elements
memcg, thp: charge huge cache pages
thp: compile-time and sysfs knob for thp pagecache
thp, mm: introduce mapping_can_have_hugepages() predicate
thp: represent file thp pages in meminfo and friends
thp, mm: rewrite add_to_page_cache_locked() to support huge pages
mm: trace filemap: dump page order
block: implement add_bdi_stat()
thp, mm: rewrite delete_from_page_cache() to support huge pages
thp, mm: warn if we try to use replace_page_cache_page() with THP
thp, mm: add event counters for huge page alloc on file write or read
mm, vfs: introduce i_split_sem
thp, mm: allocate huge pages in grab_cache_page_write_begin()
thp, mm: naive support of thp in generic_perform_write
thp, mm: handle transhuge pages in do_generic_file_read()
thp, libfs: initial thp support
truncate: support huge pages
thp: handle file pages in split_huge_page()
thp: wait_split_huge_page(): serialize over i_mmap_mutex too
thp, mm: split huge page on mmap file page
ramfs: enable transparent huge page cache

Documentation/vm/transhuge.txt | 16 ++++
drivers/base/node.c | 4 +
fs/inode.c | 3 +
fs/libfs.c | 58 +++++++++++-
fs/proc/meminfo.c | 3 +
fs/ramfs/file-mmu.c | 2 +-
fs/ramfs/inode.c | 6 +-
include/linux/backing-dev.h | 10 +++
include/linux/fs.h | 11 +++
include/linux/huge_mm.h | 68 +++++++++++++-
include/linux/mm.h | 18 ++++
include/linux/mmzone.h | 1 +
include/linux/page-flags.h | 13 +++
include/linux/pagemap.h | 31 +++++++
include/linux/radix-tree.h | 11 +++
include/linux/vm_event_item.h | 4 +
include/trace/events/filemap.h | 7 +-
lib/radix-tree.c | 94 ++++++++++++++++++--
mm/Kconfig | 11 +++
mm/filemap.c | 196 ++++++++++++++++++++++++++++++++---------
mm/huge_memory.c | 147 +++++++++++++++++++++++++++----
mm/memcontrol.c | 3 +-
mm/memory.c | 40 ++++++++-
mm/truncate.c | 125 ++++++++++++++++++++------
mm/vmstat.c | 5 ++
25 files changed, 779 insertions(+), 108 deletions(-)

--
1.8.4.rc3


2013-09-23 12:06:06

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 03/22] memcg, thp: charge huge cache pages

mem_cgroup_cache_charge() has check for PageCompound(). The check
prevents charging huge cache pages.

I don't see a reason why the check is present. Looks like it's just
legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

Let's just drop it.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5ff3ce130..0b87a1bd25 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3963,8 +3963,7 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,

if (mem_cgroup_disabled())
return 0;
- if (PageCompound(page))
- return 0;
+ VM_BUG_ON(PageCompound(page) && !PageTransHuge(page));

if (!PageSwapCache(page))
ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
--
1.8.4.rc3

2013-09-23 12:06:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 10/22] thp, mm: rewrite delete_from_page_cache() to support huge pages

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index d2d6c0ebe9..60478ebeda 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,7 @@
void __delete_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
+ int i, nr;

trace_mm_filemap_delete_from_page_cache(page);
/*
@@ -127,13 +128,20 @@ void __delete_from_page_cache(struct page *page)
else
cleancache_invalidate_page(mapping, page);

- radix_tree_delete(&mapping->page_tree, page->index);
+ page->mapping = NULL;
+ nr = hpagecache_nr_pages(page);
+ for (i = 0; i < nr; i++)
+ radix_tree_delete(&mapping->page_tree, page->index + i);
+ /* thp */
+ if (nr > 1)
+ __dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+
page->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */
- mapping->nrpages--;
- __dec_zone_page_state(page, NR_FILE_PAGES);
+ mapping->nrpages -= nr;
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
if (PageSwapBacked(page))
- __dec_zone_page_state(page, NR_SHMEM);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
BUG_ON(page_mapped(page));

/*
@@ -144,8 +152,8 @@ void __delete_from_page_cache(struct page *page)
* having removed the page entirely.
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
- dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+ mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+ add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
}
}

--
1.8.4.rc3

2013-09-23 12:06:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 01/22] mm: implement zero_huge_user_segment and friends

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment and zero_user, but for huge pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 18 ++++++++++++++++++
mm/memory.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 54 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8b6e55ee88..a7b7e62930 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1809,9 +1809,27 @@ extern void dump_page(struct page *page);
extern void clear_huge_page(struct page *page,
unsigned long addr,
unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+ unsigned start, unsigned end);
+static inline void zero_huge_user(struct page *page,
+ unsigned start, unsigned len)
+{
+ zero_huge_user_segment(page, start, start + len);
+}
extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned long addr, struct vm_area_struct *vma,
unsigned int pages_per_huge_page);
+#else
+static inline void zero_huge_user_segment(struct page *page,
+ unsigned start, unsigned end)
+{
+ BUILD_BUG();
+}
+static inline void zero_huge_user(struct page *page,
+ unsigned start, unsigned len)
+{
+ BUILD_BUG();
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HUGETLBFS */

#ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/mm/memory.c b/mm/memory.c
index ca00039471..e5f74cd634 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4291,6 +4291,42 @@ void clear_huge_page(struct page *page,
}
}

+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+ int i;
+ unsigned start_idx, end_idx;
+ unsigned start_off, end_off;
+
+ BUG_ON(end < start);
+
+ might_sleep();
+
+ if (start == end)
+ return;
+
+ start_idx = start >> PAGE_SHIFT;
+ start_off = start & ~PAGE_MASK;
+ end_idx = (end - 1) >> PAGE_SHIFT;
+ end_off = ((end - 1) & ~PAGE_MASK) + 1;
+
+ /*
+ * if start and end are on the same small page we can call
+ * zero_user_segment() once and save one kmap_atomic().
+ */
+ if (start_idx == end_idx)
+ return zero_user_segment(page + start_idx, start_off, end_off);
+
+ /* zero the first (possibly partial) page */
+ zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
+ for (i = start_idx + 1; i < end_idx; i++) {
+ cond_resched();
+ clear_highpage(page + i);
+ flush_dcache_page(page + i);
+ }
+ /* zero the last (possibly partial) page */
+ zero_user_segment(page + end_idx, 0, end_off);
+}
+
static void copy_user_gigantic_page(struct page *dst, struct page *src,
unsigned long addr,
struct vm_area_struct *vma,
--
1.8.4.rc3

2013-09-23 12:06:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 16/22] thp, mm: handle transhuge pages in do_generic_file_read()

If a transhuge page is already in page cache (up to date and not
readahead) we go usual path: read from relevant subpage (head or tail).

If page is not cached (sparse file in ramfs case) and the mapping can
have hugepage we try allocate a new one and read it.

If a page is not up to date or in readahead, we have to move 'page' to
head page of the compound page, since it represents state of whole
transhuge page. We will switch back to relevant subpage when page is
ready to be read ('page_ok' label).

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 91 +++++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 66 insertions(+), 25 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 38d6856737..9bbc024e4c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1122,6 +1122,27 @@ static void shrink_readahead_size_eio(struct file *filp,
ra->ra_pages /= 4;
}

+static unsigned long page_cache_mask(struct page *page)
+{
+ if (PageTransHugeCache(page))
+ return HPAGE_PMD_MASK;
+ else
+ return PAGE_CACHE_MASK;
+}
+
+static unsigned long pos_to_off(struct page *page, loff_t pos)
+{
+ return pos & ~page_cache_mask(page);
+}
+
+static unsigned long pos_to_index(struct page *page, loff_t pos)
+{
+ if (PageTransHugeCache(page))
+ return pos >> HPAGE_PMD_SHIFT;
+ else
+ return pos >> PAGE_CACHE_SHIFT;
+}
+
/**
* do_generic_file_read - generic file read routine
* @filp: the file to read
@@ -1143,17 +1164,12 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
struct file_ra_state *ra = &filp->f_ra;
pgoff_t index;
pgoff_t last_index;
- pgoff_t prev_index;
- unsigned long offset; /* offset into pagecache page */
- unsigned int prev_offset;
int error;

index = *ppos >> PAGE_CACHE_SHIFT;
- prev_index = ra->prev_pos >> PAGE_CACHE_SHIFT;
- prev_offset = ra->prev_pos & (PAGE_CACHE_SIZE-1);
last_index = (*ppos + desc->count + PAGE_CACHE_SIZE-1) >> PAGE_CACHE_SHIFT;
- offset = *ppos & ~PAGE_CACHE_MASK;

+ i_split_down_read(inode);
for (;;) {
struct page *page;
pgoff_t end_index;
@@ -1172,8 +1188,12 @@ find_page:
ra, filp,
index, last_index - index);
page = find_get_page(mapping, index);
- if (unlikely(page == NULL))
- goto no_cached_page;
+ if (unlikely(page == NULL)) {
+ if (mapping_can_have_hugepages(mapping))
+ goto no_cached_page_thp;
+ else
+ goto no_cached_page;
+ }
}
if (PageReadahead(page)) {
page_cache_async_readahead(mapping,
@@ -1190,7 +1210,7 @@ find_page:
if (!page->mapping)
goto page_not_up_to_date_locked;
if (!mapping->a_ops->is_partially_uptodate(page,
- desc, offset))
+ desc, pos_to_off(page, *ppos)))
goto page_not_up_to_date_locked;
unlock_page(page);
}
@@ -1206,21 +1226,25 @@ page_ok:

isize = i_size_read(inode);
end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
+ if (PageTransHugeCache(page)) {
+ index &= ~HPAGE_CACHE_INDEX_MASK;
+ end_index &= ~HPAGE_CACHE_INDEX_MASK;
+ }
if (unlikely(!isize || index > end_index)) {
page_cache_release(page);
goto out;
}

/* nr is the maximum number of bytes to copy from this page */
- nr = PAGE_CACHE_SIZE;
+ nr = PAGE_CACHE_SIZE << compound_order(page);
if (index == end_index) {
- nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
- if (nr <= offset) {
+ nr = ((isize - 1) & ~page_cache_mask(page)) + 1;
+ if (nr <= pos_to_off(page, *ppos)) {
page_cache_release(page);
goto out;
}
}
- nr = nr - offset;
+ nr = nr - pos_to_off(page, *ppos);

/* If users can be writing to this page using arbitrary
* virtual addresses, take care about potential aliasing
@@ -1233,9 +1257,10 @@ page_ok:
* When a sequential read accesses a page several times,
* only mark it as accessed the first time.
*/
- if (prev_index != index || offset != prev_offset)
+ if (pos_to_index(page, ra->prev_pos) != index ||
+ pos_to_off(page, *ppos) !=
+ pos_to_off(page, ra->prev_pos))
mark_page_accessed(page);
- prev_index = index;

/*
* Ok, we have the page, and it's up-to-date, so
@@ -1247,11 +1272,10 @@ page_ok:
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
- offset += ret;
- index += offset >> PAGE_CACHE_SHIFT;
- offset &= ~PAGE_CACHE_MASK;
- prev_offset = offset;
+ ret = actor(desc, page, pos_to_off(page, *ppos), nr);
+ ra->prev_pos = *ppos;
+ *ppos += ret;
+ index = *ppos >> PAGE_CACHE_SHIFT;

page_cache_release(page);
if (ret == nr && desc->count)
@@ -1325,6 +1349,27 @@ readpage_error:
page_cache_release(page);
goto out;

+no_cached_page_thp:
+ page = alloc_pages(mapping_gfp_mask(mapping) | __GFP_COLD,
+ HPAGE_PMD_ORDER);
+ if (!page) {
+ count_vm_event(THP_READ_ALLOC_FAILED);
+ goto no_cached_page;
+ }
+ count_vm_event(THP_READ_ALLOC);
+
+ error = add_to_page_cache_lru(page, mapping,
+ pos_to_index(page, *ppos), GFP_KERNEL);
+ if (!error)
+ goto readpage;
+
+ page_cache_release(page);
+ if (error != -EEXIST && error != -ENOSPC) {
+ desc->error = error;
+ goto out;
+ }
+
+ /* Fallback to small page */
no_cached_page:
/*
* Ok, it wasn't cached, so we need to create a new
@@ -1348,11 +1393,7 @@ no_cached_page:
}

out:
- ra->prev_pos = prev_index;
- ra->prev_pos <<= PAGE_CACHE_SHIFT;
- ra->prev_pos |= prev_offset;
-
- *ppos = ((loff_t)index << PAGE_CACHE_SHIFT) + offset;
+ i_split_up_read(inode);
file_accessed(filp);
}

--
1.8.4.rc3

2013-09-23 12:06:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 02/22] radix-tree: implement preload for multiple contiguous elements

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.
The feature costs about 9.5KiB per-CPU on x86_64, details below.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

#ifdef __KERNEL__
#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
#else
#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
#endif

We are not going to use transparent huge page cache on small machines or
in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.

On 64-bit system old array size is 21, new is 37. Per-CPU feature
overhead is
for preload array:
(38 - 21) * sizeof(void*) = 136 bytes
plus, if the preload array is full
(38 - 21) * sizeof(struct radix_tree_node) = 17 * 560 = 9520 bytes
total: 9656 bytes

On 32-bit system old array size is 11, new is 22. Per-CPU feature
overhead is
for preload array:
(23 - 11) * sizeof(void*) = 48 bytes
plus, if the preload array is full
(23 - 11) * sizeof(struct radix_tree_node) = 12 * 296 = 3552 bytes
total: 3600 bytes

Since only THP uses batched preload at the moment, we disable (set max
preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
changed in the future.

Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
include/linux/radix-tree.h | 11 ++++++
lib/radix-tree.c | 94 +++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 96 insertions(+), 9 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 403940787b..3bf0b3e594 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do { \
(root)->rnode = NULL; \
} while (0)

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR 512
+#else
+#define RADIX_TREE_PRELOAD_NR 1
+#endif
+
/**
* Radix-tree synchronization
*
@@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
int radix_tree_preload(gfp_t gfp_mask);
int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3b4e..544a00a93b 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -84,14 +84,51 @@ static struct kmem_cache *radix_tree_node_cachep;
* of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
* Hence:
*/
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+
+/*
+ * Inserting N contiguous items is more complex. To simplify calculation, let's
+ * limit N (validated in radix_tree_init()):
+ * - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
+ * - N <= number of items 2-level tree can contain:
+ * 1UL << (2 * RADIX_TREE_MAP_SHIFT).
+ *
+ * No limitation on insert index alignment.
+ *
+ * Then the worst case is tree with only one element at index 0 and we add N
+ * items which cross boundary between items in root node.
+ *
+ * Basically, at least one index is less then
+ *
+ * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT + 1)
+ *
+ * and one is equal to.
+ *
+ * In this case we need:
+ *
+ * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
+ * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
+ * - +1 is for misalinged case;
+ * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
+ * - -2, because root node and last level nodes are already accounted).
+ *
+ * Hence:
+ */
+#if RADIX_TREE_PRELOAD_NR > 1
+#define RADIX_TREE_PRELOAD_MAX \
+ ( RADIX_TREE_MAX_PATH + \
+ RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
+ 2 * (RADIX_TREE_MAX_PATH - 2))
+#else
+#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
+#endif

/*
* Per-cpu pool of preloaded nodes
*/
struct radix_tree_preload {
int nr;
- struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+ struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
};
static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };

@@ -263,29 +300,43 @@ radix_tree_node_free(struct radix_tree_node *node)

/*
* Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail. On
- * success, return zero, with preemption disabled. On error, return -ENOMEM
+ * ensure that the addition of *contiguous* items in the tree cannot fail.
+ * On success, return zero, with preemption disabled. On error, return -ENOMEM
* with preemption not disabled.
*
* To make use of this facility, the radix tree must be initialised without
* __GFP_WAIT being passed to INIT_RADIX_TREE().
*/
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
{
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
int ret = -ENOMEM;
+ int preload_target = RADIX_TREE_PRELOAD_MIN;

+ if (size > 1) {
+ size = round_up(size, RADIX_TREE_MAP_SIZE);
+ if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+ "too large preload requested"))
+ return -ENOMEM;
+
+ /* The same math as with RADIX_TREE_PRELOAD_MAX */
+ preload_target = RADIX_TREE_MAX_PATH +
+ size / RADIX_TREE_MAP_SIZE + 1 +
+ 2 * (RADIX_TREE_MAX_PATH - 2);
+ }
+
+ BUG_ON(preload_target > RADIX_TREE_PRELOAD_MAX);
preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+ while (rtp->nr < preload_target) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+ if (rtp->nr < preload_target)
rtp->nodes[rtp->nr++] = node;
else
kmem_cache_free(radix_tree_node_cachep, node);
@@ -308,7 +359,7 @@ int radix_tree_preload(gfp_t gfp_mask)
{
/* Warn on non-sensical use... */
WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload_contig(1, gfp_mask);
}
EXPORT_SYMBOL(radix_tree_preload);

@@ -320,13 +371,22 @@ EXPORT_SYMBOL(radix_tree_preload);
int radix_tree_maybe_preload(gfp_t gfp_mask)
{
if (gfp_mask & __GFP_WAIT)
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload_contig(1, gfp_mask);
/* Preloading doesn't help anything with this gfp mask, skip it */
preempt_disable();
return 0;
}
EXPORT_SYMBOL(radix_tree_maybe_preload);

+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
+{
+ if (gfp_mask & __GFP_WAIT)
+ return __radix_tree_preload_contig(size, gfp_mask);
+ /* Preloading doesn't help anything with this gfp mask, skip it */
+ preempt_disable();
+ return 0;
+}
+
/*
* Return the maximum key which can be store into a
* radix tree with height HEIGHT.
@@ -1483,6 +1543,22 @@ static int radix_tree_callback(struct notifier_block *nfb,

void __init radix_tree_init(void)
{
+ /*
+ * Restrictions on RADIX_TREE_PRELOAD_NR simplify RADIX_TREE_PRELOAD_MAX
+ * calculation, it's already complex enough:
+ * - it must be multiplier of RADIX_TREE_MAP_SIZE, otherwise we will
+ * have to round it up to next RADIX_TREE_MAP_SIZE multiplier and we
+ * don't win anything;
+ * - must be less then number of items 2-level tree can contain.
+ * It's easier to calculate number of internal nodes required
+ * this way.
+ */
+ if (RADIX_TREE_PRELOAD_NR != 1) {
+ BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR % RADIX_TREE_MAP_SIZE != 0);
+ BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR >
+ 1UL << (2 * RADIX_TREE_MAP_SHIFT));
+ }
+
radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
sizeof(struct radix_tree_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
--
1.8.4.rc3

2013-09-23 12:06:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 07/22] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 24 ++++++++++++++++++++++++
include/linux/page-flags.h | 13 +++++++++++++
mm/filemap.c | 45 +++++++++++++++++++++++++++++++++++----------
3 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index fb0847572c..9747af1117 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -230,6 +230,20 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str

#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+
+#define HPAGE_CACHE_ORDER (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
+#else
+
+#define HPAGE_CACHE_ORDER ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
+#endif
+
static inline bool transparent_hugepage_pagecache(void)
{
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
@@ -243,4 +257,14 @@ static inline bool transparent_hugepage_pagecache(void)
return false;
return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
}
+
+static inline int hpagecache_nr_pages(struct page *page)
+{
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+ return hpage_nr_pages(page);
+
+ BUG_ON(PageTransHuge(page));
+ return 1;
+}
+
#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 6d53675c2b..6d2d7ce3e1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -452,6 +452,19 @@ static inline int PageTransTail(struct page *page)
}
#endif

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+static inline int PageTransHugeCache(struct page *page)
+{
+ return PageTransHuge(page);
+}
+#else
+
+static inline int PageTransHugeCache(struct page *page)
+{
+ return 0;
+}
+#endif
+
/*
* If network-based swap is enabled, sl*b must keep track of whether pages
* were allocated from pfmemalloc reserves.
diff --git a/mm/filemap.c b/mm/filemap.c
index c7e42aee5c..d2d6c0ebe9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -460,38 +460,63 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
pgoff_t offset, gfp_t gfp_mask)
{
int error;
+ int i, nr;

VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));

+ /* memory cgroup controller handles thp pages on its side */
error = mem_cgroup_cache_charge(page, current->mm,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
return error;

- error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
+ if (PageTransHugeCache(page))
+ BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+
+ nr = hpagecache_nr_pages(page);
+
+ error = radix_tree_maybe_preload_contig(nr, gfp_mask & ~__GFP_HIGHMEM);
if (error) {
mem_cgroup_uncharge_cache_page(page);
return error;
}

+ spin_lock_irq(&mapping->tree_lock);
page_cache_get(page);
- page->mapping = mapping;
page->index = offset;
-
- spin_lock_irq(&mapping->tree_lock);
- error = radix_tree_insert(&mapping->page_tree, offset, page);
+ page->mapping = mapping;
+ for (i = 0; i < nr; i++) {
+ error = radix_tree_insert(&mapping->page_tree,
+ offset + i, page);
+ /*
+ * In the midle of THP we can collide with small page which was
+ * established before THP page cache is enabled or by other VMA
+ * with bad alignement (most likely MAP_FIXED).
+ */
+ if (error) {
+ i--; /* failed to insert anything at offset + i */
+ goto err_insert;
+ }
+ }
radix_tree_preload_end();
- if (unlikely(error))
- goto err_insert;
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
+ mapping->nrpages += nr;
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+ if (PageTransHuge(page))
+ __inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
spin_unlock_irq(&mapping->tree_lock);
trace_mm_filemap_add_to_page_cache(page);
return 0;
err_insert:
- page->mapping = NULL;
+ radix_tree_preload_end();
+ if (i != 0)
+ error = -ENOSPC; /* no space for a huge page */
+
/* Leave page->index set: truncation relies upon it */
+ page->mapping = NULL;
+ for (; i >= 0; i--)
+ radix_tree_delete(&mapping->page_tree, offset + i);
+
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);
page_cache_release(page);
--
1.8.4.rc3

2013-09-23 12:06:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 14/22] thp, mm: allocate huge pages in grab_cache_page_write_begin()

Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/fs.h | 1 +
mm/filemap.c | 23 +++++++++++++++++++++--
2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 26801f0bb1..42ccdeddd9 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -282,6 +282,7 @@ enum positive_aop_returns {
#define AOP_FLAG_NOFS 0x0004 /* used by filesystem to direct
* helper code (eg buffer layer)
* to clear GFP_FS from alloc */
+#define AOP_FLAG_TRANSHUGE 0x0008 /* allocate transhuge page */

/*
* oh the beauties of C type declarations.
diff --git a/mm/filemap.c b/mm/filemap.c
index 3421bcaed4..410879a801 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2322,18 +2322,37 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
gfp_t gfp_mask;
struct page *page;
gfp_t gfp_notmask = 0;
+ bool must_use_thp = (flags & AOP_FLAG_TRANSHUGE) &&
+ IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);

gfp_mask = mapping_gfp_mask(mapping);
+ if (must_use_thp) {
+ BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+ BUG_ON(!(gfp_mask & __GFP_COMP));
+ }
if (mapping_cap_account_dirty(mapping))
gfp_mask |= __GFP_WRITE;
if (flags & AOP_FLAG_NOFS)
gfp_notmask = __GFP_FS;
repeat:
page = find_lock_page(mapping, index);
- if (page)
+ if (page) {
+ if (must_use_thp && !PageTransHuge(page)) {
+ unlock_page(page);
+ page_cache_release(page);
+ return NULL;
+ }
goto found;
+ }

- page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+ if (must_use_thp) {
+ page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+ if (page)
+ count_vm_event(THP_WRITE_ALLOC);
+ else
+ count_vm_event(THP_WRITE_ALLOC_FAILED);
+ } else
+ page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
if (!page)
return NULL;
status = add_to_page_cache_lru(page, mapping, index,
--
1.8.4.rc3

2013-09-23 12:07:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 04/22] thp: compile-time and sysfs knob for thp pagecache

For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64.
It's disabled by default.

Radix tree perload overhead can be significant on !BASE_FULL systems, so
let's add dependency.

/sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the
feature.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 9 +++++++++
include/linux/huge_mm.h | 14 ++++++++++++++
mm/Kconfig | 11 +++++++++++
mm/huge_memory.c | 23 +++++++++++++++++++++++
4 files changed, 57 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 4a63953a41..4cc15c40f4 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -103,6 +103,15 @@ echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled

+If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge pages in
+page cache if possible. It can be disable and re-enabled via sysfs:
+
+echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache
+echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache
+
+If it's disabled kernel will not add new huge pages to page cache and
+split them on mapping, but already mapped pages will stay intakt.
+
It's also possible to limit defrag efforts in the VM to generate
hugepages in case they're not immediately free to madvise regions or
to never try to defrag memory and simply fallback to regular pages
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3935428c57..fb0847572c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
+ TRANSPARENT_HUGEPAGE_PAGECACHE,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
#ifdef CONFIG_DEBUG_VM
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
@@ -229,4 +230,17 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str

#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+static inline bool transparent_hugepage_pagecache(void)
+{
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+ return false;
+ if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG)))
+ return false;
+
+ if (!(transparent_hugepage_flags &
+ ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
+ (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))))
+ return false;
+ return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
#endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 026771a9b0..562f12fd89 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -420,6 +420,17 @@ choice
benefit.
endchoice

+config TRANSPARENT_HUGEPAGE_PAGECACHE
+ bool "Transparent Hugepage Support for page cache"
+ depends on X86_64 && TRANSPARENT_HUGEPAGE
+ # avoid radix tree preload overhead
+ depends on BASE_FULL
+ help
+ Enabling the option adds support hugepages for file-backed
+ mappings. It requires transparent hugepage support from
+ filesystem side. For now, the only filesystem which supports
+ hugepages is ramfs.
+
config CROSS_MEMORY_ATTACH
bool "Cross Memory Support"
depends on MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7489884682..59f099b93f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -42,6 +42,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
#endif
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+ (1<<TRANSPARENT_HUGEPAGE_PAGECACHE)|
+#endif
(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);

/* default scan 8*512 pte (or vmas) every 30 second */
@@ -362,6 +365,23 @@ static ssize_t defrag_store(struct kobject *kobj,
static struct kobj_attribute defrag_attr =
__ATTR(defrag, 0644, defrag_show, defrag_store);

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+static ssize_t page_cache_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return single_flag_show(kobj, attr, buf,
+ TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static ssize_t page_cache_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ return single_flag_store(kobj, attr, buf, count,
+ TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static struct kobj_attribute page_cache_attr =
+ __ATTR(page_cache, 0644, page_cache_show, page_cache_store);
+#endif
+
static ssize_t use_zero_page_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
@@ -397,6 +417,9 @@ static struct kobj_attribute debug_cow_attr =
static struct attribute *hugepage_attr[] = {
&enabled_attr.attr,
&defrag_attr.attr,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+ &page_cache_attr.attr,
+#endif
&use_zero_page_attr.attr,
#ifdef CONFIG_DEBUG_VM
&debug_cow_attr.attr,
--
1.8.4.rc3

2013-09-23 12:08:06

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 20/22] thp: wait_split_huge_page(): serialize over i_mmap_mutex too

We're going to have huge pages backed by files, so we need to modify
wait_split_huge_page() to support that.

We have two options for:
- check whether the page anon or not and serialize only over required
lock;
- always serialize over both locks;

Current implementation, in fact, guarantees that *all* pages on the vma
is not splitting, not only the pages pmd is pointing on.

For now I prefer the second option since it's the safest: we provide the
same level of guarantees.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 15 ++++++++++++---
mm/huge_memory.c | 4 ++--
mm/memory.c | 4 ++--
3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ce9fcae8ef..9bc9937498 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -111,11 +111,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
__split_huge_page_pmd(__vma, __address, \
____pmd); \
} while (0)
-#define wait_split_huge_page(__anon_vma, __pmd) \
+#define wait_split_huge_page(__vma, __pmd) \
do { \
pmd_t *____pmd = (__pmd); \
- anon_vma_lock_write(__anon_vma); \
- anon_vma_unlock_write(__anon_vma); \
+ struct address_space *__mapping = (__vma)->vm_file ? \
+ (__vma)->vm_file->f_mapping : NULL; \
+ struct anon_vma *__anon_vma = (__vma)->anon_vma; \
+ if (__mapping) \
+ mutex_lock(&__mapping->i_mmap_mutex); \
+ if (__anon_vma) { \
+ anon_vma_lock_write(__anon_vma); \
+ anon_vma_unlock_write(__anon_vma); \
+ } \
+ if (__mapping) \
+ mutex_unlock(&__mapping->i_mmap_mutex); \
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3c45c62cde..d0798e5122 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -913,7 +913,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
spin_unlock(&dst_mm->page_table_lock);
pte_free(dst_mm, pgtable);

- wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+ wait_split_huge_page(vma, src_pmd); /* src_vma */
goto out;
}
src_page = pmd_page(pmd);
@@ -1497,7 +1497,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(&vma->vm_mm->page_table_lock);
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
return -1;
} else {
/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index e5f74cd634..dc5a56cab7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -584,7 +584,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
if (new)
pte_free(mm, new);
if (wait_split_huge_page)
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
return 0;
}

@@ -1520,7 +1520,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(&mm->page_table_lock);
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
} else {
page = follow_trans_huge_pmd(vma, address,
pmd, flags);
--
1.8.4.rc3

2013-09-23 12:06:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 06/22] thp: represent file thp pages in meminfo and friends

The patch adds new zone stat to count file transparent huge pages and
adjust related places.

For now we don't count mapped or dirty file thp pages separately.

The patch depends on patch
thp: account anon transparent huge pages into NR_ANON_PAGES

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
drivers/base/node.c | 4 ++++
fs/proc/meminfo.c | 3 +++
include/linux/mmzone.h | 1 +
mm/vmstat.c | 1 +
4 files changed, 9 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index bc9f43bf7e..de261f5722 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -119,6 +119,7 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d SUnreclaim: %8lu kB\n"
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"Node %d AnonHugePages: %8lu kB\n"
+ "Node %d FileHugePages: %8lu kB\n"
#endif
,
nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -140,6 +141,9 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
, nid,
K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR)
+ , nid,
+ K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
HPAGE_PMD_NR));
#else
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 59d85d6088..a62952cd4f 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -104,6 +104,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"AnonHugePages: %8lu kB\n"
+ "FileHugePages: %8lu kB\n"
#endif
,
K(i.totalram),
@@ -158,6 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
HPAGE_PMD_NR)
+ ,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR)
#endif
);

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bd791e452a..8b4525bd4f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -143,6 +143,7 @@ enum zone_stat_item {
NUMA_OTHER, /* allocation from other node */
#endif
NR_ANON_TRANSPARENT_HUGEPAGES,
+ NR_FILE_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9bb3145779..9af0d8536b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -771,6 +771,7 @@ const char * const vmstat_text[] = {
"numa_other",
#endif
"nr_anon_transparent_hugepages",
+ "nr_file_transparent_hugepages",
"nr_free_cma",
"nr_dirty_threshold",
"nr_dirty_background_threshold",
--
1.8.4.rc3

2013-09-23 12:06:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 05/22] thp, mm: introduce mapping_can_have_hugepages() predicate

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/pagemap.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75a07..ad60dcc50e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,20 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
(__force unsigned long)mask;
}

+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+ gfp_t gfp_mask = mapping_gfp_mask(m);
+
+ if (!transparent_hugepage_pagecache())
+ return false;
+
+ /*
+ * It's up to filesystem what gfp mask to use.
+ * The only part of GFP_TRANSHUGE which matters for us is __GFP_COMP.
+ */
+ return !!(gfp_mask & __GFP_COMP);
+}
+
/*
* The page cache can done in larger chunks than
* one page, because it allows for more efficient
--
1.8.4.rc3

2013-09-23 12:09:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 19/22] thp: handle file pages in split_huge_page()

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

When we add a huge page to page cache we take only reference to head
page, but on split we need to take addition reference to all tail pages
since they are still in page cache after splitting.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 120 +++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 107 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 59f099b93f..3c45c62cde 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1584,6 +1584,7 @@ static void __split_huge_page_refcount(struct page *page,
struct zone *zone = page_zone(page);
struct lruvec *lruvec;
int tail_count = 0;
+ int initial_tail_refcount;

/* prevent PageLRU to go away from under us, and freeze lru stats */
spin_lock_irq(&zone->lru_lock);
@@ -1593,6 +1594,13 @@ static void __split_huge_page_refcount(struct page *page,
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(page);

+ /*
+ * When we add a huge page to page cache we take only reference to head
+ * page, but on split we need to take addition reference to all tail
+ * pages since they are still in page cache after splitting.
+ */
+ initial_tail_refcount = PageAnon(page) ? 0 : 1;
+
for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
struct page *page_tail = page + i;

@@ -1615,8 +1623,9 @@ static void __split_huge_page_refcount(struct page *page,
* atomic_set() here would be safe on all archs (and
* not only on x86), it's safer to use atomic_add().
*/
- atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
- &page_tail->_count);
+ atomic_add(initial_tail_refcount + page_mapcount(page) +
+ page_mapcount(page_tail) + 1,
+ &page_tail->_count);

/* after clearing PageTail the gup refcount can be released */
smp_mb();
@@ -1655,23 +1664,23 @@ static void __split_huge_page_refcount(struct page *page,
*/
page_tail->_mapcount = page->_mapcount;

- BUG_ON(page_tail->mapping);
page_tail->mapping = page->mapping;

page_tail->index = page->index + i;
page_nid_xchg_last(page_tail, page_nid_last(page));

- BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
BUG_ON(!PageDirty(page_tail));
- BUG_ON(!PageSwapBacked(page_tail));

lru_add_page_tail(page, page_tail, lruvec, list);
}
atomic_sub(tail_count, &page->_count);
BUG_ON(atomic_read(&page->_count) <= 0);

- __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+ if (PageAnon(page))
+ __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+ else
+ __mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1);

ClearPageCompound(page);
compound_unlock(page);
@@ -1771,7 +1780,7 @@ static int __split_huge_page_map(struct page *page,
}

/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
struct anon_vma *anon_vma,
struct list_head *list)
{
@@ -1795,7 +1804,7 @@ static void __split_huge_page(struct page *page,
* and establishes a child pmd before
* __split_huge_page_splitting() freezes the parent pmd (so if
* we fail to prevent copy_huge_pmd() from running until the
- * whole __split_huge_page() is complete), we will still see
+ * whole __split_anon_huge_page() is complete), we will still see
* the newly established pmd of the child later during the
* walk, to be able to set it as pmd_trans_splitting too.
*/
@@ -1826,14 +1835,11 @@ static void __split_huge_page(struct page *page,
* from the hugepage.
* Return 0 if the hugepage is split successfully otherwise return 1.
*/
-int split_huge_page_to_list(struct page *page, struct list_head *list)
+static int split_anon_huge_page(struct page *page, struct list_head *list)
{
struct anon_vma *anon_vma;
int ret = 1;

- BUG_ON(is_huge_zero_page(page));
- BUG_ON(!PageAnon(page));
-
/*
* The caller does not necessarily hold an mmap_sem that would prevent
* the anon_vma disappearing so we first we take a reference to it
@@ -1851,7 +1857,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
goto out_unlock;

BUG_ON(!PageSwapBacked(page));
- __split_huge_page(page, anon_vma, list);
+ __split_anon_huge_page(page, anon_vma, list);
count_vm_event(THP_SPLIT);

BUG_ON(PageCompound(page));
@@ -1862,6 +1868,94 @@ out:
return ret;
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+static void __split_file_mapping(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct radix_tree_iter iter;
+ void **slot;
+ int count = 1;
+
+ spin_lock(&mapping->tree_lock);
+ radix_tree_for_each_slot(slot, &mapping->page_tree,
+ &iter, page->index + 1) {
+ struct page *slot_page;
+
+ slot_page = radix_tree_deref_slot_protected(slot,
+ &mapping->tree_lock);
+ BUG_ON(slot_page != page);
+ radix_tree_replace_slot(slot, page + count);
+ if (++count == HPAGE_CACHE_NR)
+ break;
+ }
+ BUG_ON(count != HPAGE_CACHE_NR);
+ spin_unlock(&mapping->tree_lock);
+}
+
+static int split_file_huge_page(struct page *page, struct list_head *list)
+{
+ struct address_space *mapping = page->mapping;
+ struct inode *inode = mapping->host;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ int mapcount, mapcount2;
+
+ BUG_ON(!PageHead(page));
+ BUG_ON(PageTail(page));
+
+ down_write(&inode->i_split_sem);
+ mutex_lock(&mapping->i_mmap_mutex);
+ mapcount = 0;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long addr = vma_address(page, vma);
+ mapcount += __split_huge_page_splitting(page, vma, addr);
+ }
+
+ if (mapcount != page_mapcount(page))
+ printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+ mapcount, page_mapcount(page));
+ BUG_ON(mapcount != page_mapcount(page));
+
+ __split_huge_page_refcount(page, list);
+ __split_file_mapping(page);
+
+ mapcount2 = 0;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long addr = vma_address(page, vma);
+ mapcount2 += __split_huge_page_map(page, vma, addr);
+ }
+
+ if (mapcount != mapcount2)
+ printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+ mapcount, mapcount2, page_mapcount(page));
+ BUG_ON(mapcount != mapcount2);
+ count_vm_event(THP_SPLIT);
+ mutex_unlock(&mapping->i_mmap_mutex);
+ up_write(&inode->i_split_sem);
+
+ /*
+ * Drop small pages beyond i_size if any.
+ */
+ truncate_inode_pages(mapping, i_size_read(inode));
+ return 0;
+}
+#else
+static int split_file_huge_page(struct page *page, struct list_head *list)
+{
+ BUG();
+}
+#endif
+
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+ BUG_ON(is_huge_zero_page(page));
+
+ if (PageAnon(page))
+ return split_anon_huge_page(page, list);
+ else
+ return split_file_huge_page(page, list);
+}
+
#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)

int hugepage_madvise(struct vm_area_struct *vma,
--
1.8.4.rc3

2013-09-23 12:09:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 18/22] truncate: support huge pages

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range.

If a huge page is only partly in the range we zero out the part,
exactly like we do for partial small pages.

In some cases it worth to split the huge page instead, if we need to
truncate it partly and free some memory. But split_huge_page() now
truncates the file, so we need to break truncate<->split interdependency
at some point.

invalidate_mapping_pages() just skips huge pages if they are not fully
in the range.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/pagemap.h | 9 ++++
mm/truncate.c | 125 ++++++++++++++++++++++++++++++++++++++----------
2 files changed, 109 insertions(+), 25 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 967aadbc5e..8ce130fe56 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -580,4 +580,13 @@ static inline void clear_pagecache_page(struct page *page)
clear_highpage(page);
}

+static inline void zero_pagecache_segment(struct page *page,
+ unsigned start, unsigned len)
+{
+ if (PageTransHugeCache(page))
+ zero_huge_user_segment(page, start, len);
+ else
+ zero_user_segment(page, start, len);
+}
+
#endif /* _LINUX_PAGEMAP_H */
diff --git a/mm/truncate.c b/mm/truncate.c
index 353b683afd..ba62ab2168 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -203,10 +203,10 @@ int invalidate_inode_page(struct page *page)
void truncate_inode_pages_range(struct address_space *mapping,
loff_t lstart, loff_t lend)
{
+ struct inode *inode = mapping->host;
pgoff_t start; /* inclusive */
pgoff_t end; /* exclusive */
- unsigned int partial_start; /* inclusive */
- unsigned int partial_end; /* exclusive */
+ bool partial_start, partial_end;
struct pagevec pvec;
pgoff_t index;
int i;
@@ -215,15 +215,13 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (mapping->nrpages == 0)
return;

- /* Offsets within partial pages */
+ /* Whether we have to do partial truncate */
partial_start = lstart & (PAGE_CACHE_SIZE - 1);
partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);

/*
* 'start' and 'end' always covers the range of pages to be fully
- * truncated. Partial pages are covered with 'partial_start' at the
- * start of the range and 'partial_end' at the end of the range.
- * Note that 'end' is exclusive while 'lend' is inclusive.
+ * truncated. Note that 'end' is exclusive while 'lend' is inclusive.
*/
start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (lend == -1)
@@ -236,10 +234,12 @@ void truncate_inode_pages_range(struct address_space *mapping,
else
end = (lend + 1) >> PAGE_CACHE_SHIFT;

+ i_split_down_read(inode);
pagevec_init(&pvec, 0);
index = start;
while (index < end && pagevec_lookup(&pvec, mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+ bool thp = false;
mem_cgroup_uncharge_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
@@ -249,6 +249,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index >= end)
break;

+ thp = PageTransHugeCache(page);
+ if (thp) {
+ /* the range starts in middle of huge page */
+ if (index < start) {
+ partial_start = true;
+ start = index + HPAGE_CACHE_NR;
+ break;
+ }
+
+ /* the range ends on huge page */
+ if (index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
+ partial_end = true;
+ end = index;
+ break;
+ }
+ }
+
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
@@ -258,54 +275,88 @@ void truncate_inode_pages_range(struct address_space *mapping,
}
truncate_inode_page(mapping, page);
unlock_page(page);
+ if (thp)
+ break;
}
pagevec_release(&pvec);
mem_cgroup_uncharge_end();
cond_resched();
- index++;
+ if (thp)
+ index += HPAGE_CACHE_NR;
+ else
+ index++;
}

if (partial_start) {
- struct page *page = find_lock_page(mapping, start - 1);
+ struct page *page;
+
+ page = find_get_page(mapping, start - 1);
if (page) {
- unsigned int top = PAGE_CACHE_SIZE;
- if (start > end) {
- /* Truncation within a single page */
- top = partial_end;
- partial_end = 0;
+ pgoff_t index_mask;
+ loff_t page_cache_mask;
+ unsigned pstart, pend;
+
+ if (PageTransHugeCache(page)) {
+ index_mask = HPAGE_CACHE_INDEX_MASK;
+ page_cache_mask = HPAGE_PMD_MASK;
+ } else {
+ index_mask = 0UL;
+ page_cache_mask = PAGE_CACHE_MASK;
}
+
+ pstart = lstart & ~page_cache_mask;
+ if ((end & ~index_mask) == page->index) {
+ pend = (lend + 1) & ~page_cache_mask;
+ end = page->index;
+ partial_end = false; /* handled here */
+ } else
+ pend = PAGE_CACHE_SIZE << compound_order(page);
+
+ lock_page(page);
wait_on_page_writeback(page);
- zero_user_segment(page, partial_start, top);
+ zero_pagecache_segment(page, pstart, pend);
cleancache_invalidate_page(mapping, page);
if (page_has_private(page))
- do_invalidatepage(page, partial_start,
- top - partial_start);
+ do_invalidatepage(page, pstart,
+ pend - pstart);
unlock_page(page);
page_cache_release(page);
}
}
if (partial_end) {
- struct page *page = find_lock_page(mapping, end);
+ struct page *page;
+
+ page = find_lock_page(mapping, end);
if (page) {
+ loff_t page_cache_mask;
+ unsigned pend;
+
+ if (PageTransHugeCache(page))
+ page_cache_mask = HPAGE_PMD_MASK;
+ else
+ page_cache_mask = PAGE_CACHE_MASK;
+ pend = (lend + 1) & ~page_cache_mask;
+ end = page->index;
wait_on_page_writeback(page);
- zero_user_segment(page, 0, partial_end);
+ zero_pagecache_segment(page, 0, pend);
cleancache_invalidate_page(mapping, page);
if (page_has_private(page))
- do_invalidatepage(page, 0,
- partial_end);
+ do_invalidatepage(page, 0, pend);
unlock_page(page);
page_cache_release(page);
}
}
/*
- * If the truncation happened within a single page no pages
- * will be released, just zeroed, so we can bail out now.
+ * If the truncation happened within a single page no
+ * pages will be released, just zeroed, so we can bail
+ * out now.
*/
if (start >= end)
- return;
+ goto out;

index = start;
for ( ; ; ) {
+ bool thp = false;
cond_resched();
if (!pagevec_lookup(&pvec, mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
@@ -327,16 +378,24 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index >= end)
break;

+ thp = PageTransHugeCache(page);
lock_page(page);
WARN_ON(page->index != index);
wait_on_page_writeback(page);
truncate_inode_page(mapping, page);
unlock_page(page);
+ if (thp)
+ break;
}
pagevec_release(&pvec);
mem_cgroup_uncharge_end();
- index++;
+ if (thp)
+ index += HPAGE_CACHE_NR;
+ else
+ index++;
}
+out:
+ i_split_up_read(inode);
cleancache_invalidate_inode(mapping);
}
EXPORT_SYMBOL(truncate_inode_pages_range);
@@ -375,6 +434,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
unsigned long invalidate_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t end)
{
+ struct inode *inode = mapping->host;
struct pagevec pvec;
pgoff_t index = start;
unsigned long ret;
@@ -389,9 +449,11 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
* (most pages are dirty), and already skips over any difficulties.
*/

+ i_split_down_read(inode);
pagevec_init(&pvec, 0);
while (index <= end && pagevec_lookup(&pvec, mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+ bool thp = false;
mem_cgroup_uncharge_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
@@ -401,6 +463,15 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
if (index > end)
break;

+ /* skip huge page if it's not fully in the range */
+ thp = PageTransHugeCache(page);
+ if (thp) {
+ if (index < start)
+ break;
+ if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+ break;
+ }
+
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
@@ -417,8 +488,12 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
pagevec_release(&pvec);
mem_cgroup_uncharge_end();
cond_resched();
- index++;
+ if (thp)
+ index += HPAGE_CACHE_NR;
+ else
+ index++;
}
+ i_split_up_read(inode);
return count;
}
EXPORT_SYMBOL(invalidate_mapping_pages);
--
1.8.4.rc3

2013-09-23 12:09:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 21/22] thp, mm: split huge page on mmap file page

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on fault attempt.

Later we'll implement mmap() properly and this code path be used for
fallback cases.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 9bbc024e4c..01a8f9945a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1736,6 +1736,8 @@ retry_find:
goto no_cached_page;
}

+ if (PageTransCompound(page))
+ split_huge_page(compound_trans_head(page));
if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
page_cache_release(page);
return ret | VM_FAULT_RETRY;
--
1.8.4.rc3

2013-09-23 12:09:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 22/22] ramfs: enable transparent huge page cache

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

ramfs pages are not movable[1] and switching to transhuge pages doesn't
affect that. We need to fix this eventually.

[1] http://lkml.org/lkml/2013/4/2/720

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/ramfs/file-mmu.c | 2 +-
fs/ramfs/inode.c | 6 +++++-
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 4884ac5ae9..ae787bf9ba 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -32,7 +32,7 @@

const struct address_space_operations ramfs_aops = {
.readpage = simple_readpage,
- .write_begin = simple_write_begin,
+ .write_begin = simple_thp_write_begin,
.write_end = simple_write_end,
.set_page_dirty = __set_page_dirty_no_writeback,
};
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 39d14659a8..5dafdfcd86 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
inode_init_owner(inode, dir, mode);
inode->i_mapping->a_ops = &ramfs_aops;
inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
- mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+ /*
+ * TODO: make ramfs pages movable
+ */
+ mapping_set_gfp_mask(inode->i_mapping,
+ GFP_TRANSHUGE & ~__GFP_MOVABLE);
mapping_set_unevictable(inode->i_mapping);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
--
1.8.4.rc3

2013-09-23 12:10:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 12/22] thp, mm: add event counters for huge page alloc on file write or read

Existing stats specify source of thp page: fault or collapse. We're
going allocate a new huge page with write(2) and read(2). It's nither
fault nor collapse.

Let's introduce new events for that.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 7 +++++++
include/linux/huge_mm.h | 5 +++++
include/linux/vm_event_item.h | 4 ++++
mm/vmstat.c | 4 ++++
4 files changed, 20 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 4cc15c40f4..a78f738403 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -202,6 +202,10 @@ thp_collapse_alloc is incremented by khugepaged when it has found
a range of pages to collapse into one huge page and has
successfully allocated a new huge page to store the data.

+thp_write_alloc and thp_read_alloc are incremented every time a huge
+ page is successfully allocated to handle write(2) to a file or
+ read(2) from file.
+
thp_fault_fallback is incremented if a page fault fails to allocate
a huge page and instead falls back to using small pages.

@@ -209,6 +213,9 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
of pages that should be collapsed into one huge page but failed
the allocation.

+thp_write_alloc_failed and thp_read_alloc_failed are incremented if
+ huge page allocation failed when tried on write(2) or read(2).
+
thp_split is incremented every time a huge page is split into base
pages. This can happen for a variety of reasons but a common
reason is that a huge page is old and is being reclaimed.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9747af1117..3700ada4d2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -183,6 +183,11 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })

+#define THP_WRITE_ALLOC ({ BUILD_BUG(); 0; })
+#define THP_WRITE_ALLOC_FAILED ({ BUILD_BUG(); 0; })
+#define THP_READ_ALLOC ({ BUILD_BUG(); 0; })
+#define THP_READ_ALLOC_FAILED ({ BUILD_BUG(); 0; })
+
#define hpage_nr_pages(x) 1

#define transparent_hugepage_enabled(__vma) 0
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1855f0a22a..8e071bbaa0 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -66,6 +66,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
+ THP_WRITE_ALLOC,
+ THP_WRITE_ALLOC_FAILED,
+ THP_READ_ALLOC,
+ THP_READ_ALLOC_FAILED,
THP_SPLIT,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 9af0d8536b..5d1eb7dbf1 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -847,6 +847,10 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
+ "thp_write_alloc",
+ "thp_write_alloc_failed",
+ "thp_read_alloc",
+ "thp_read_alloc_failed",
"thp_split",
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
--
1.8.4.rc3

2013-09-23 12:10:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 17/22] thp, libfs: initial thp support

simple_readpage() and simple_write_end() are modified to handle huge
pages.

simple_thp_write_begin() is introduced to allocate huge pages on write.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/libfs.c | 58 +++++++++++++++++++++++++++++++++++++++++++++----
include/linux/fs.h | 7 ++++++
include/linux/pagemap.h | 8 +++++++
3 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 3a3a9b53bf..807f66098e 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -364,7 +364,7 @@ EXPORT_SYMBOL(simple_setattr);

int simple_readpage(struct file *file, struct page *page)
{
- clear_highpage(page);
+ clear_pagecache_page(page);
flush_dcache_page(page);
SetPageUptodate(page);
unlock_page(page);
@@ -424,9 +424,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,

/* zero the stale part of the page if we did a short copy */
if (copied < len) {
- unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
- zero_user(page, from + copied, len - copied);
+ unsigned from;
+ if (PageTransHugeCache(page)) {
+ from = pos & ~HPAGE_PMD_MASK;
+ zero_huge_user(page, from + copied, len - copied);
+ } else {
+ from = pos & ~PAGE_CACHE_MASK;
+ zero_user(page, from + copied, len - copied);
+ }
}

if (!PageUptodate(page))
@@ -445,6 +450,51 @@ int simple_write_end(struct file *file, struct address_space *mapping,
return copied;
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+int simple_thp_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ struct page *page = NULL;
+ pgoff_t index;
+
+ index = pos >> PAGE_CACHE_SHIFT;
+
+ /*
+ * Do not allocate a huge page in the first huge page range in page
+ * cache. This way we can avoid most small files overhead.
+ */
+ if (mapping_can_have_hugepages(mapping) &&
+ pos >= HPAGE_PMD_SIZE) {
+ page = grab_cache_page_write_begin(mapping,
+ index & ~HPAGE_CACHE_INDEX_MASK,
+ flags | AOP_FLAG_TRANSHUGE);
+ /* fallback to small page */
+ if (!page) {
+ unsigned long offset;
+ offset = pos & ~PAGE_CACHE_MASK;
+ /* adjust the len to not cross small page boundary */
+ len = min_t(unsigned long,
+ len, PAGE_CACHE_SIZE - offset);
+ }
+ BUG_ON(page && !PageTransHuge(page));
+ }
+ if (!page)
+ return simple_write_begin(file, mapping, pos, len, flags,
+ pagep, fsdata);
+
+ *pagep = page;
+
+ if (!PageUptodate(page) && len != HPAGE_PMD_SIZE) {
+ unsigned from = pos & ~HPAGE_PMD_MASK;
+
+ zero_huge_user_segment(page, 0, from);
+ zero_huge_user_segment(page, from + len, HPAGE_PMD_SIZE);
+ }
+ return 0;
+}
+#endif
+
/*
* the inodes created here are not hashed. If you use iunique to generate
* unique inode values later for this filesystem, then you must take care
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 42ccdeddd9..71a5ce4472 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2566,6 +2566,13 @@ extern int simple_write_begin(struct file *file, struct address_space *mapping,
extern int simple_write_end(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+extern int simple_thp_write_begin(struct file *file,
+ struct address_space *mapping, loff_t pos, unsigned len,
+ unsigned flags, struct page **pagep, void **fsdata);
+#else
+#define simple_thp_write_begin simple_write_begin
+#endif

extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned int flags);
extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ad60dcc50e..967aadbc5e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -572,4 +572,12 @@ static inline int add_to_page_cache(struct page *page,
return error;
}

+static inline void clear_pagecache_page(struct page *page)
+{
+ if (PageTransHuge(page))
+ zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+ else
+ clear_highpage(page);
+}
+
#endif /* _LINUX_PAGEMAP_H */
--
1.8.4.rc3

2013-09-23 12:11:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 15/22] thp, mm: naive support of thp in generic_perform_write

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing storage.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 410879a801..38d6856737 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2384,12 +2384,14 @@ static ssize_t generic_perform_write(struct file *file,
if (segment_eq(get_fs(), KERNEL_DS))
flags |= AOP_FLAG_UNINTERRUPTIBLE;

+ i_split_down_read(mapping->host);
do {
struct page *page;
unsigned long offset; /* Offset into pagecache page */
unsigned long bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */
void *fsdata;
+ int subpage_nr = 0;

offset = (pos & (PAGE_CACHE_SIZE - 1));
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
@@ -2419,8 +2421,14 @@ again:
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);

+ if (PageTransHuge(page)) {
+ off_t huge_offset = pos & ~HPAGE_PMD_MASK;
+ subpage_nr = huge_offset >> PAGE_CACHE_SHIFT;
+ }
+
pagefault_disable();
- copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+ copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i,
+ offset, bytes);
pagefault_enable();
flush_dcache_page(page);

@@ -2457,6 +2465,7 @@ again:
}
} while (iov_iter_count(i));

+ i_split_up_read(mapping->host);
return written ? written : status;
}

--
1.8.4.rc3

2013-09-23 12:11:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 09/22] block: implement add_bdi_stat()

We're going to add/remove a number of page cache entries at once. This
patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
amount. It's required for batched page cache manipulations.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
include/linux/backing-dev.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5f66d519a7..39acfa974b 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -166,6 +166,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
__add_bdi_stat(bdi, item, -1);
}

+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+ enum bdi_stat_item item, s64 amount)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __add_bdi_stat(bdi, item, amount);
+ local_irq_restore(flags);
+}
+
static inline void dec_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
{
--
1.8.4.rc3

2013-09-23 12:06:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 11/22] thp, mm: warn if we try to use replace_page_cache_page() with THP

replace_page_cache_page() is only used by FUSE. It's unlikely that we
will support THP in FUSE page cache any soon.

Let's pospone implemetation of THP handling in replace_page_cache_page()
until any will use it. -EINVAL and WARN_ONCE() for now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 60478ebeda..3421bcaed4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -417,6 +417,10 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
{
int error;

+ if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new),
+ "unexpected transhuge page\n"))
+ return -EINVAL;
+
VM_BUG_ON(!PageLocked(old));
VM_BUG_ON(!PageLocked(new));
VM_BUG_ON(new->mapping);
--
1.8.4.rc3

2013-09-23 12:12:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 08/22] mm: trace filemap: dump page order

Dump page order to trace to be able to distinguish between small page
and huge page in page cache.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
include/trace/events/filemap.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49a20..7e14b13470 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
__field(struct page *, page)
__field(unsigned long, i_ino)
__field(unsigned long, index)
+ __field(int, order)
__field(dev_t, s_dev)
),

@@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
__entry->page = page;
__entry->i_ino = page->mapping->host->i_ino;
__entry->index = page->index;
+ __entry->order = compound_order(page);
if (page->mapping->host->i_sb)
__entry->s_dev = page->mapping->host->i_sb->s_dev;
else
__entry->s_dev = page->mapping->host->i_rdev;
),

- TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu",
+ TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d",
MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
__entry->i_ino,
__entry->page,
page_to_pfn(__entry->page),
- __entry->index << PAGE_SHIFT)
+ __entry->index << PAGE_SHIFT,
+ __entry->order)
);

DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache,
--
1.8.4.rc3

2013-09-23 12:12:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv6 13/22] mm, vfs: introduce i_split_sem

i_split_sem taken on read will protect hugepages in inode's pagecache
against splitting.

i_split_sem will be taken on write during splitting.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/inode.c | 3 +++
include/linux/fs.h | 3 +++
include/linux/huge_mm.h | 10 ++++++++++
3 files changed, 16 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index b33ba8e021..ea06e378c6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -162,6 +162,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)

atomic_set(&inode->i_dio_count, 0);

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+ init_rwsem(&inode->i_split_sem);
+#endif
mapping->a_ops = &empty_aops;
mapping->host = inode;
mapping->flags = 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3f40547ba1..26801f0bb1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -610,6 +610,9 @@ struct inode {
atomic_t i_readcount; /* struct files open RO */
#endif
void *i_private; /* fs or device private pointer */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+ struct rw_semaphore i_split_sem;
+#endif
};

static inline int inode_unhashed(struct inode *inode)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3700ada4d2..ce9fcae8ef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -241,12 +241,22 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str
#define HPAGE_CACHE_NR (1L << HPAGE_CACHE_ORDER)
#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)

+#define i_split_down_read(inode) down_read(&inode->i_split_sem)
+#define i_split_up_read(inode) up_read(&inode->i_split_sem)
+
#else

#define HPAGE_CACHE_ORDER ({ BUILD_BUG(); 0; })
#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; })
#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })

+static inline void i_split_down_read(struct inode *inode)
+{
+}
+
+static inline void i_split_up_read(struct inode *inode)
+{
+}
#endif

static inline bool transparent_hugepage_pagecache(void)
--
1.8.4.rc3

2013-09-24 23:37:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <[email protected]> wrote:

> It brings thp support for ramfs, but without mmap() -- it will be posted
> separately.

We were never going to do this :(

Has anyone reviewed these patches much yet?

> Please review and consider applying.

It appears rather too immature at this stage.

> Intro
> -----
>
> The goal of the project is preparing kernel infrastructure to handle huge
> pages in page cache.
>
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project.

At the very least we should get this done for a real filesystem to see
how intrusive the changes are and to evaluate the performance changes.


Sigh. A pox on whoever thought up huge pages. Words cannot express
how much of a godawful mess they have made of Linux MM. And it hasn't
ended yet :( My take is that we'd need to see some very attractive and
convincing real-world performance numbers before even thinking of
taking this on.


2013-09-24 23:49:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
>
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
>
> We were never going to do this :(
>
> Has anyone reviewed these patches much yet?

There already was a lot of review by various people.

This is not the first post, just the latest refactoring.

> > Intro
> > -----
> >
> > The goal of the project is preparing kernel infrastructure to handle huge
> > pages in page cache.
> >
> > To proof that the proposed changes are functional we enable the feature
> > for the most simple file system -- ramfs. ramfs is not that useful by
> > itself, but it's good pilot project.
>
> At the very least we should get this done for a real filesystem to see
> how intrusive the changes are and to evaluate the performance changes.

That would give even larger patches, and people already complain
the patchkit is too large.

The only good way to handle this is baby steps, and you
have to start somewhere.

> Sigh. A pox on whoever thought up huge pages.

managing 1TB+ of memory in 4K chunks is just insane.
The question of larger pages is not "if", but only "when".

-Andi

--
[email protected] -- Speaking for myself only

2013-09-24 23:58:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

On Tue, 24 Sep 2013 16:49:50 -0700 Andi Kleen <[email protected]> wrote:

> > At the very least we should get this done for a real filesystem to see
> > how intrusive the changes are and to evaluate the performance changes.
>
> That would give even larger patches, and people already complain
> the patchkit is too large.

The thing is that merging an implementation for ramfs commits us to
doing it for the major real filesystems. Before making that commitment
we should at least have a pretty good understanding of what those
changes will look like.

Plus I don't see how we can realistically performance-test it without
having real physical backing store in the picture?

2013-09-25 09:24:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

Ning Qu wrote:
> Hi, Kirill,
>
> Seems you dropped one patch in v5, is that intentional? Just wondering ...
>
> thp, mm: handle tail pages in page_cache_get_speculative()

It's not needed anymore, since we don't have tail pages in radix tree.

--
Kirill A. Shutemov

2013-09-25 09:51:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

Andrew Morton wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
>
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
>
> We were never going to do this :(
>
> Has anyone reviewed these patches much yet?

Dave did very good review. Few other people looked to separate patches.
See Reviewed-by/Acked-by tags in patches.

It looks like most mm experts are busy with numa balancing nowadays, so
it's hard to get more review.

The patchset was mostly ignored for few rounds and Dave suggested to split
to have less scary patch number.

> > Please review and consider applying.
>
> It appears rather too immature at this stage.

More review is always welcome and I'm committed to address issues.

--
Kirill A. Shutemov

2013-09-25 11:15:47

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

Andrew Morton wrote:
> On Tue, 24 Sep 2013 16:49:50 -0700 Andi Kleen <[email protected]> wrote:
>
> > > At the very least we should get this done for a real filesystem to see
> > > how intrusive the changes are and to evaluate the performance changes.
> >
> > That would give even larger patches, and people already complain
> > the patchkit is too large.
>
> The thing is that merging an implementation for ramfs commits us to
> doing it for the major real filesystems. Before making that commitment
> we should at least have a pretty good understanding of what those
> changes will look like.
>
> Plus I don't see how we can realistically performance-test it without
> having real physical backing store in the picture?

My plan for real filesystem is to get it first beneficial for read-mostly
files:
- allocate huge pages on read (or collapse small pages) only if nobody
has the inode opened on write;
- split huge page on write to avoid dealing with write back patch at
first and dirty only 4k pages;

This will will get most of elf executables and libraries mapped with huge
pages (it may require dynamic linker change to align length to huge page
boundary) which is not bad for start.

--
Kirill A. Shutemov

2013-09-25 15:05:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

> (it may require dynamic linker change to align length to huge page
> boundary)

x86-64 binaries should be already padded for this.

-Andi

2013-09-25 18:11:52

by Ning Qu

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

Got you. THanks!

Best wishes,
--
Ning Qu (曲宁) | Software Engineer | [email protected] | +1-408-418-6066


On Wed, Sep 25, 2013 at 2:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> Ning Qu wrote:
>> Hi, Kirill,
>>
>> Seems you dropped one patch in v5, is that intentional? Just wondering ...
>>
>> thp, mm: handle tail pages in page_cache_get_speculative()
>
> It's not needed anymore, since we don't have tail pages in radix tree.
>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-09-25 20:03:00

by Ning Qu

[permalink] [raw]
Subject: Re: [PATCHv6 10/22] thp, mm: rewrite delete_from_page_cache() to support huge pages

Best wishes,
--
Ning Qu (曲宁) | Software Engineer | [email protected] | +1-408-418-6066


On Mon, Sep 23, 2013 at 5:05 AM, Kirill A. Shutemov
<[email protected]> wrote:
>
> As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
> time.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/filemap.c | 20 ++++++++++++++------
> 1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index d2d6c0ebe9..60478ebeda 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -115,6 +115,7 @@
> void __delete_from_page_cache(struct page *page)
> {
> struct address_space *mapping = page->mapping;
> + int i, nr;
>
> trace_mm_filemap_delete_from_page_cache(page);
> /*
> @@ -127,13 +128,20 @@ void __delete_from_page_cache(struct page *page)
> else
> cleancache_invalidate_page(mapping, page);
>
> - radix_tree_delete(&mapping->page_tree, page->index);
> + page->mapping = NULL;
Seems with this line added, we clear the page->mapping twice? Once
here and another one after radix_tree_delete. Is this necessary here?

>
> + nr = hpagecache_nr_pages(page);
> + for (i = 0; i < nr; i++)
> + radix_tree_delete(&mapping->page_tree, page->index + i);
> + /* thp */
> + if (nr > 1)
> + __dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> +
> page->mapping = NULL;
> /* Leave page->index set: truncation lookup relies upon it */
> - mapping->nrpages--;
> - __dec_zone_page_state(page, NR_FILE_PAGES);
> + mapping->nrpages -= nr;
> + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
> if (PageSwapBacked(page))
> - __dec_zone_page_state(page, NR_SHMEM);
> + __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
> BUG_ON(page_mapped(page));
>
> /*
> @@ -144,8 +152,8 @@ void __delete_from_page_cache(struct page *page)
> * having removed the page entirely.
> */
> if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> - dec_zone_page_state(page, NR_FILE_DIRTY);
> - dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> + mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
> + add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
> }
> }
>
> --
> 1.8.4.rc3
>

2013-09-25 23:29:31

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

On Wed, Sep 25, 2013 at 12:51:04PM +0300, Kirill A. Shutemov wrote:
> Andrew Morton wrote:
> > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
> >
> > > It brings thp support for ramfs, but without mmap() -- it will be posted
> > > separately.
> >
> > We were never going to do this :(
> >
> > Has anyone reviewed these patches much yet?
>
> Dave did very good review. Few other people looked to separate patches.
> See Reviewed-by/Acked-by tags in patches.
>
> It looks like most mm experts are busy with numa balancing nowadays, so
> it's hard to get more review.

Nobody has reviewed it from the filesystem side, though.

The changes that require special code paths for huge pages in the
write_begin/write_end paths are nasty. You're adding conditional
code that depends on the page size and then having to add checks to
ensure that large page operations don't step over small page
boundaries and other such corner cases. It's an extremely fragile
design, IMO.

In general, I don't like all the if (thp) {} else {}; code that this
series introduces - they are code paths that simply won't get tested
with any sort of regularity and make the code more complex for those
that aren't using THP to understand and debug...

Then there is a new per-inode lock that is used in
generic_perform_write() which is held across page faults and calls
to filesystem block mapping callbacks. This inserts into the middle
of an existing locking chain that needs to be strictly ordered, and
as such will lead to the same type of lock inversion problems that
the mmap_sem had. We do not want to introduce a new lock that has
this same problem just as we are getting rid of that long standing
nastiness from the page fault path...

I also note that you didn't convert invalidate_inode_pages2_range()
to support huge pages which is needed by real filesystems that
support direct IO. There are other truncate/invalidate interfaces
that you didn't convert, either, and some of them will present you
with interesting locking challenges as a result of adding that new
lock...

> The patchset was mostly ignored for few rounds and Dave suggested to split
> to have less scary patch number.

It's still being ignored by filesystem people because you haven't
actually tried to implement support into a real filesystem.....

> > > Please review and consider applying.
> >
> > It appears rather too immature at this stage.
>
> More review is always welcome and I'm committed to address issues.

IMO, supporting a real block based filesystem like ext4 or XFS and
demonstrating that everything works is necessary before we go any
further...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-09-26 18:32:24

by Zach Brown

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

> > Sigh. A pox on whoever thought up huge pages.
>
> managing 1TB+ of memory in 4K chunks is just insane.
> The question of larger pages is not "if", but only "when".

And "how"!

Sprinking a bunch of magical if (thp) {} else {} throughtout the code
looks like a stunningly bad idea to me. It'd take real work to
restructure the code such that the current paths are a degenerate case
of the larger thp page case, but that's the work that needs doing in my
estimation.

- z

2013-09-26 19:06:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

On Thu, Sep 26, 2013 at 11:30:22AM -0700, Zach Brown wrote:
> > > Sigh. A pox on whoever thought up huge pages.
> >
> > managing 1TB+ of memory in 4K chunks is just insane.
> > The question of larger pages is not "if", but only "when".
>
> And "how"!
>
> Sprinking a bunch of magical if (thp) {} else {} throughtout the code
> looks like a stunningly bad idea to me. It'd take real work to
> restructure the code such that the current paths are a degenerate case
> of the larger thp page case, but that's the work that needs doing in my
> estimation.

Sorry, but that is how all of large pages in the Linux VM works
(both THP and hugetlbfs)

Yes it would be nice if small pages and large pages all ran
in a unified VM. But that's not how Linux is designed today.

Yes having a Pony would be nice too.

Back when huge pages were originally proposed Linus came
up with the "separate hugetlbfs VM" design and that is what were
stuck with today.

Asking for a whole scale VM redesign is just not realistic.

VM is always changing in baby steps. And the only
known way to do that is to have if (thp) and if (hugetlbfs) .

-Andi

--
[email protected] -- Speaking for myself only

2013-09-26 21:13:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

On 09/23/2013 05:05 AM, Kirill A. Shutemov wrote:
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project.

This does, at the least, give us a shared memory mechanism that can move
between large and small pages. We don't have anything which can do that
today.

Tony Luck was just mentioning that if we have a small (say 1-bit) memory
failure in a hugetlbfs page, then we end up tossing out the entire 2MB.
The app gets a chance to recover the contents, but it has to do it for
the entire 2MB. Ideally, we'd like to break the 2M down in to 4k pages,
which lets us continue using the remaining 2M-4k, and leaves the app to
rebuild 4k of its data instead of 2M.

If you look at the diffstat, it's also pretty obvious that virtually
none of this code is actually specific to ramfs. It'll all get used as
the foundation for the "real" filesystems too. I'm very interested in
how those end up looking, too, but I think Kirill is selling his patches
a bit short calling this a toy.

2013-09-30 10:03:01

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
> On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
>
> > It brings thp support for ramfs, but without mmap() -- it will be posted
> > separately.
>
> We were never going to do this :(
>
> Has anyone reviewed these patches much yet?
>

I am afraid I never looked too closely once I learned that the primary
motivation for this was relieving iTLB pressure in a very specific
case. AFAIK, this is not a problem in the vast majority of modern CPUs
and I found it very hard to be motivated to review the series as a result.
I suspected that in many cases that the cost of IO would continue to dominate
performance instead of TLB pressure. I also found it unlikely that there
was a workload that was tmpfs based that used enough memory to be hurt
by TLB pressure. My feedback was that a much more compelling case for the
series was needed but this discussion all happened on IRC unfortunately.

--
Mel Gorman
SUSE Labs

2013-09-30 10:10:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

On Mon, Sep 30, 2013 at 11:02:49AM +0100, Mel Gorman wrote:
> On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
> > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
> >
> > > It brings thp support for ramfs, but without mmap() -- it will be posted
> > > separately.
> >
> > We were never going to do this :(
> >
> > Has anyone reviewed these patches much yet?
> >
>
> I am afraid I never looked too closely once I learned that the primary
> motivation for this was relieving iTLB pressure in a very specific
> case. AFAIK, this is not a problem in the vast majority of modern CPUs
> and I found it very hard to be motivated to review the series as a result.
> I suspected that in many cases that the cost of IO would continue to dominate
> performance instead of TLB pressure. I also found it unlikely that there
> was a workload that was tmpfs based that used enough memory to be hurt
> by TLB pressure. My feedback was that a much more compelling case for the
> series was needed but this discussion all happened on IRC unfortunately.
>

Oh, one last thing I forgot. While tmpfs-based workloads were not likely to
benefit I would expect that sysV shared memory workloads would potentially
benefit from this. hugetlbfs is still required for shared memory areas
but it is not a problem that is addressed by this series.

--
Mel Gorman
SUSE Labs

2013-09-30 10:13:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

On Tue, Sep 24, 2013 at 04:49:50PM -0700, Andi Kleen wrote:
> > Sigh. A pox on whoever thought up huge pages.
>
> managing 1TB+ of memory in 4K chunks is just insane.
> The question of larger pages is not "if", but only "when".
>

Remember that there are at least two separate issues there. One is the
handling data in larger granularities than a 4K page and the second is
the TLB, pagetable etc handling. They are not necessarily the same problem.

--
Mel Gorman
SUSE Labs

2013-09-30 15:28:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

On 09/30/2013 03:02 AM, Mel Gorman wrote:
> I am afraid I never looked too closely once I learned that the primary
> motivation for this was relieving iTLB pressure in a very specific
> case. AFAIK, this is not a problem in the vast majority of modern CPUs
> and I found it very hard to be motivated to review the series as a result.
> I suspected that in many cases that the cost of IO would continue to dominate
> performance instead of TLB pressure. I also found it unlikely that there
> was a workload that was tmpfs based that used enough memory to be hurt
> by TLB pressure. My feedback was that a much more compelling case for the
> series was needed but this discussion all happened on IRC unfortunately.

FWIW, I'm mostly intrigued by the possibilities of how this can speed up
_software_, and I'm rather uninterested in what it can do for the TLB.
Page cache is particularly painful today, precisely because hugetlbfs
and anonymous-thp aren't available there. If you have an app with
hundreds of GB of files that it wants to mmap(), even if it's in the
page cache, it takes _minutes_ to just fault in. One example:

https://lkml.org/lkml/2013/6/27/698

2013-09-30 16:05:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

On Mon, Sep 30, 2013 at 11:13:00AM +0100, Mel Gorman wrote:
> On Tue, Sep 24, 2013 at 04:49:50PM -0700, Andi Kleen wrote:
> > > Sigh. A pox on whoever thought up huge pages.
> >
> > managing 1TB+ of memory in 4K chunks is just insane.
> > The question of larger pages is not "if", but only "when".
> >
>
> Remember that there are at least two separate issues there. One is the
> handling data in larger granularities than a 4K page and the second is
> the TLB, pagetable etc handling. They are not necessarily the same problem.

It's the same problem in the end.

The hardware is struggling with 4K pages too (both i and d)

I expect longer term TLB/page optimization to have far more important
than all this NUMA placement work that people spend so much
time on.


-Andi
--
[email protected] -- Speaking for myself only

2013-09-30 18:06:20

by Ning Qu

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

Yes, I agree. For our case, we have tens of GB files and thp with page
cache does improve the number as expected.

And compared to hugetlbfs (static huge page), it's more flexible and
beneficial to the system wide ....


Best wishes,
--
Ning Qu (曲宁) | Software Engineer | [email protected] | +1-408-418-6066


On Mon, Sep 30, 2013 at 8:27 AM, Dave Hansen <[email protected]> wrote:
> On 09/30/2013 03:02 AM, Mel Gorman wrote:
>> I am afraid I never looked too closely once I learned that the primary
>> motivation for this was relieving iTLB pressure in a very specific
>> case. AFAIK, this is not a problem in the vast majority of modern CPUs
>> and I found it very hard to be motivated to review the series as a result.
>> I suspected that in many cases that the cost of IO would continue to dominate
>> performance instead of TLB pressure. I also found it unlikely that there
>> was a workload that was tmpfs based that used enough memory to be hurt
>> by TLB pressure. My feedback was that a much more compelling case for the
>> series was needed but this discussion all happened on IRC unfortunately.
>
> FWIW, I'm mostly intrigued by the possibilities of how this can speed up
> _software_, and I'm rather uninterested in what it can do for the TLB.
> Page cache is particularly painful today, precisely because hugetlbfs
> and anonymous-thp aren't available there. If you have an app with
> hundreds of GB of files that it wants to mmap(), even if it's in the
> page cache, it takes _minutes_ to just fault in. One example:
>
> https://lkml.org/lkml/2013/6/27/698

2013-09-30 18:08:20

by Ning Qu

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

I suppose sysv shm and tmpfs share the same code base now, so both of
them will benefit from thp page cache?

And for Kirill's previous patchset (till v4), it contains mmap support
as well. I suppose the patchset got splitted into smaller group so
it's easier to review ....

Best wishes,
--
Ning Qu (曲宁) | Software Engineer | [email protected] | +1-408-418-6066


On Mon, Sep 30, 2013 at 3:10 AM, Mel Gorman <[email protected]> wrote:
> On Mon, Sep 30, 2013 at 11:02:49AM +0100, Mel Gorman wrote:
>> On Tue, Sep 24, 2013 at 04:37:40PM -0700, Andrew Morton wrote:
>> > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
>> >
>> > > It brings thp support for ramfs, but without mmap() -- it will be posted
>> > > separately.
>> >
>> > We were never going to do this :(
>> >
>> > Has anyone reviewed these patches much yet?
>> >
>>
>> I am afraid I never looked too closely once I learned that the primary
>> motivation for this was relieving iTLB pressure in a very specific
>> case. AFAIK, this is not a problem in the vast majority of modern CPUs
>> and I found it very hard to be motivated to review the series as a result.
>> I suspected that in many cases that the cost of IO would continue to dominate
>> performance instead of TLB pressure. I also found it unlikely that there
>> was a workload that was tmpfs based that used enough memory to be hurt
>> by TLB pressure. My feedback was that a much more compelling case for the
>> series was needed but this discussion all happened on IRC unfortunately.
>>
>
> Oh, one last thing I forgot. While tmpfs-based workloads were not likely to
> benefit I would expect that sysV shared memory workloads would potentially
> benefit from this. hugetlbfs is still required for shared memory areas
> but it is not a problem that is addressed by this series.
>
> --
> Mel Gorman
> SUSE Labs

2013-09-30 18:51:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

> AFAIK, this is not a problem in the vast majority of modern CPUs

Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
That's around 2MB. There's more and more code whose footprint exceeds
that.

Besides iTLB is not the only target. It is also useful for
data of course.

> > and I found it very hard to be motivated to review the series as a result.
> > I suspected that in many cases that the cost of IO would continue to dominate
> > performance instead of TLB pressure

The trend is to larger and larger memories, keeping things in memory.

In fact there's a good argument that memory sizes are growing faster
than TLB capacities. And without large TLBs we're even further off
the curve.

> Oh, one last thing I forgot. While tmpfs-based workloads were not likely to
> benefit I would expect that sysV shared memory workloads would potentially
> benefit from this. hugetlbfs is still required for shared memory areas
> but it is not a problem that is addressed by this series.

Of course it's only the first step. But if noone does the babysteps
then the other usages will also not ever materialize.

I expect once ramfs works, extending it to tmpfs etc. should be
straight forward.

-Andi

--
[email protected] -- Speaking for myself only

2013-10-01 08:38:41

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

On Mon, Sep 30, 2013 at 11:51:06AM -0700, Andi Kleen wrote:
> > AFAIK, this is not a problem in the vast majority of modern CPUs
>
> Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
> That's around 2MB. There's more and more code whose footprint exceeds
> that.
>

With an expectation that it is read-mostly data, replicated between the
caches accessing it and TLB refills taking very little time. This is not
universally true and there are exceptions but even recent papers on TLB
behaviour have tended to dismiss the iTLB refill overhead as a negligible
portion of the overall workload of interest.

> Besides iTLB is not the only target. It is also useful for
> data of course.
>

True, but how useful? I have not seen an example of a workload showing that
dTLB pressure on file-backed data was a major component of the workload. I
would expect that sysV shared memory is an exception but does that require
generic support for all filesystems or can tmpfs be special cased when
it's used for shared memory?

For normal data, if it's read-only data then there would be some benefit to
using huge pages once the data is in page cache. How common are workloads
that mmap() large amounts of read-only data? Possibly some databases
depending on the workload although there I would expect that the data is
placed in shared memory.

If the mmap()s data is being written then the cost of IO is likely to
dominate, not TLB pressure. For write-mostly workloads there are greater
concerns because dirty tracking can only be done at the huge page boundary
potentially leading to greater amounts of IO and degraded performance
overall.

I could be completely wrong here but these were the concerns I had when
I first glanced through the patches. The changelogs had no information
to convince me otherwise so I never dedicated the time to reviewing the
patches in detail. I raised my concerns and then dropped it.

> > > and I found it very hard to be motivated to review the series as a result.
> > > I suspected that in many cases that the cost of IO would continue to dominate
> > > performance instead of TLB pressure
>
> The trend is to larger and larger memories, keeping things in memory.
>

Yes, but using huge pages is not *necessarily* the answer. For fault
scalability it probably would be a lot easier to batch handle faults if
readahead indicates accesses are sequential. Background zeroing of pages
could be revisited for fault intensive workloads. A potential alternative
is that a contiguous page is allocated, zerod as one lump, split the pages
and put onto a local per-task list although the details get messy. Reclaim
scanning could be heavily modified to use collections of pages instead of
single pages (although I'm not aware of the proper design of such a thing).

Again, this could be completely off the mark but if it was me that was
working on this problem, I would have some profile data from some workloads
to make sure the part I'm optimising was a noticable percentage of the
workload and included that in the patch leader. I would hope that the data
was compelling enough to convince reviewers to pay close attention to the
series as the complexity would then be justified. Based on how complex THP
was for anonymous pages, I would be tempted to treat THP for file-backed
data as a last resort.

> In fact there's a good argument that memory sizes are growing faster
> than TLB capacities. And without large TLBs we're even further off
> the curve.
>

I'll admit this is also true. It was considered to be true in the 90's
when huge pages were first being thrown around as a possible solution to
the problem. One paper recently suggested using segmentation for large
memory segments but the workloads they examined looked like they would
be dominated by anonymous access, not file-backed data with one exception
where the workload frequently accessed compile-time constants.

--
Mel Gorman
SUSE Labs

2013-10-01 17:11:26

by Ning Qu

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

I can throw in some numbers for one of the test case I am working on.

One of the workload is using sysv shm to load GB level files into
memory, which is shared with other worker processes for long term. We
could load as much file which fits all the physical memory available.
And also, the heap is pretty big (GB level as well) to handle those
data.

For the workload I just mentioned, with thp, we have about 8%
performance improvement, 5% from thp anonymous memory and 3% from thp
page cache. It might not look so good but it's pretty good without
changing one line of code in application, which is the beauty of thp.

Before that, we have been using hugetlbfs, then we have to reserve a
huge amount of memory at boot time, no matter those memory will be
used or not. It is working but no other major services could ever
share the server resources anymore.
Best wishes,
--
Ning Qu (曲宁) | Software Engineer | [email protected] | +1-408-418-6066


On Tue, Oct 1, 2013 at 1:38 AM, Mel Gorman <[email protected]> wrote:
> On Mon, Sep 30, 2013 at 11:51:06AM -0700, Andi Kleen wrote:
>> > AFAIK, this is not a problem in the vast majority of modern CPUs
>>
>> Let's do some simple math: e.g. a Sandy Bridge system has 512 4K iTLB L2 entries.
>> That's around 2MB. There's more and more code whose footprint exceeds
>> that.
>>
>
> With an expectation that it is read-mostly data, replicated between the
> caches accessing it and TLB refills taking very little time. This is not
> universally true and there are exceptions but even recent papers on TLB
> behaviour have tended to dismiss the iTLB refill overhead as a negligible
> portion of the overall workload of interest.
>
>> Besides iTLB is not the only target. It is also useful for
>> data of course.
>>
>
> True, but how useful? I have not seen an example of a workload showing that
> dTLB pressure on file-backed data was a major component of the workload. I
> would expect that sysV shared memory is an exception but does that require
> generic support for all filesystems or can tmpfs be special cased when
> it's used for shared memory?
>
> For normal data, if it's read-only data then there would be some benefit to
> using huge pages once the data is in page cache. How common are workloads
> that mmap() large amounts of read-only data? Possibly some databases
> depending on the workload although there I would expect that the data is
> placed in shared memory.
>
> If the mmap()s data is being written then the cost of IO is likely to
> dominate, not TLB pressure. For write-mostly workloads there are greater
> concerns because dirty tracking can only be done at the huge page boundary
> potentially leading to greater amounts of IO and degraded performance
> overall.
>
> I could be completely wrong here but these were the concerns I had when
> I first glanced through the patches. The changelogs had no information
> to convince me otherwise so I never dedicated the time to reviewing the
> patches in detail. I raised my concerns and then dropped it.
>
>> > > and I found it very hard to be motivated to review the series as a result.
>> > > I suspected that in many cases that the cost of IO would continue to dominate
>> > > performance instead of TLB pressure
>>
>> The trend is to larger and larger memories, keeping things in memory.
>>
>
> Yes, but using huge pages is not *necessarily* the answer. For fault
> scalability it probably would be a lot easier to batch handle faults if
> readahead indicates accesses are sequential. Background zeroing of pages
> could be revisited for fault intensive workloads. A potential alternative
> is that a contiguous page is allocated, zerod as one lump, split the pages
> and put onto a local per-task list although the details get messy. Reclaim
> scanning could be heavily modified to use collections of pages instead of
> single pages (although I'm not aware of the proper design of such a thing).
>
> Again, this could be completely off the mark but if it was me that was
> working on this problem, I would have some profile data from some workloads
> to make sure the part I'm optimising was a noticable percentage of the
> workload and included that in the patch leader. I would hope that the data
> was compelling enough to convince reviewers to pay close attention to the
> series as the complexity would then be justified. Based on how complex THP
> was for anonymous pages, I would be tempted to treat THP for file-backed
> data as a last resort.
>
>> In fact there's a good argument that memory sizes are growing faster
>> than TLB capacities. And without large TLBs we're even further off
>> the curve.
>>
>
> I'll admit this is also true. It was considered to be true in the 90's
> when huge pages were first being thrown around as a possible solution to
> the problem. One paper recently suggested using segmentation for large
> memory segments but the workloads they examined looked like they would
> be dominated by anonymous access, not file-backed data with one exception
> where the workload frequently accessed compile-time constants.
>
> --
> Mel Gorman
> SUSE Labs

2013-10-14 13:56:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

Dave Chinner wrote:
> On Wed, Sep 25, 2013 at 12:51:04PM +0300, Kirill A. Shutemov wrote:
> > Andrew Morton wrote:
> > > On Mon, 23 Sep 2013 15:05:28 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
> > >
> > > > It brings thp support for ramfs, but without mmap() -- it will be posted
> > > > separately.
> > >
> > > We were never going to do this :(
> > >
> > > Has anyone reviewed these patches much yet?
> >
> > Dave did very good review. Few other people looked to separate patches.
> > See Reviewed-by/Acked-by tags in patches.
> >
> > It looks like most mm experts are busy with numa balancing nowadays, so
> > it's hard to get more review.
>
> Nobody has reviewed it from the filesystem side, though.
>
> The changes that require special code paths for huge pages in the
> write_begin/write_end paths are nasty. You're adding conditional
> code that depends on the page size and then having to add checks to
> ensure that large page operations don't step over small page
> boundaries and other such corner cases. It's an extremely fragile
> design, IMO.
>
> In general, I don't like all the if (thp) {} else {}; code that this
> series introduces - they are code paths that simply won't get tested
> with any sort of regularity and make the code more complex for those
> that aren't using THP to understand and debug...

Okay, I'll try to get rid of special cases where it's possible.

> Then there is a new per-inode lock that is used in
> generic_perform_write() which is held across page faults and calls
> to filesystem block mapping callbacks. This inserts into the middle
> of an existing locking chain that needs to be strictly ordered, and
> as such will lead to the same type of lock inversion problems that
> the mmap_sem had. We do not want to introduce a new lock that has
> this same problem just as we are getting rid of that long standing
> nastiness from the page fault path...

I don't see how we can protect against splitting with existing locks,
but I'll try find a way.

> I also note that you didn't convert invalidate_inode_pages2_range()
> to support huge pages which is needed by real filesystems that
> support direct IO. There are other truncate/invalidate interfaces
> that you didn't convert, either, and some of them will present you
> with interesting locking challenges as a result of adding that new
> lock...

Thanks. I'll take a look on these code paths.

> > The patchset was mostly ignored for few rounds and Dave suggested to split
> > to have less scary patch number.
>
> It's still being ignored by filesystem people because you haven't
> actually tried to implement support into a real filesystem.....

If it will support a real filesystem, wouldn't it be ignored due
patch count? ;)

> > > > Please review and consider applying.
> > >
> > > It appears rather too immature at this stage.
> >
> > More review is always welcome and I'm committed to address issues.
>
> IMO, supporting a real block based filesystem like ext4 or XFS and
> demonstrating that everything works is necessary before we go any
> further...

Will see what numbers I can bring in next iterations.

Thanks for your feedback. And sorry for late answer.

--
Kirill A. Shutemov

2013-10-14 14:27:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv6 00/22] Transparent huge page cache: phase 1, everything but mmap()

Mel Gorman wrote:
> I could be completely wrong here but these were the concerns I had when
> I first glanced through the patches. The changelogs had no information
> to convince me otherwise so I never dedicated the time to reviewing the
> patches in detail. I raised my concerns and then dropped it.

Okay. I got your point: more data from real-world workloads. I'll try to
bring some in next iteration.

--
Kirill A. Shutemov