2013-05-12 01:32:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 00/39] Transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

It's version 4. You can also use git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git

branch thp/pagecache.

If you want to check changes since v3 you can look at diff between tags
thp/pagecache/v3 and thp/pagecache/v4-prerebase.

Intro
-----

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project. It provides information on what
performance boost we should expect on other files systems.

Design overview
---------------

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries: one entry for head page and HPAGE_PMD_NR-1 entries
for tail pages.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache two ways: write(2) to file or page
fault sparse file. Potentially, third way is collapsing small page, but
it's outside initial implementation.

[ While preparing the patchset I've found one more place where we could
alocate huge page: read(2) on sparse file. With current code we will get
4k pages. It's okay, but not optimal. Will be fixed later. ]

File systems are decision makers on allocation huge or small pages: they
should have better visibility if it's useful in every particular case.

For write(2) the decision point is mapping_ops->write_begin(). For ramfs
it's simple_write_begin.

For page fault, it's vm_ops->fault(): mm core will call ->fault() with
FAULT_FLAG_TRANSHUGE if huge page is appropriate. ->fault can return
VM_FAULT_FALLBACK if it wants small page instead. For ramfs ->fault() is
filemap_fault().

Performance
-----------

Numbers I posted with v3 were too good to be true. I forgot to
disable debug options in kernel config :-P

The test machine is 4s Westmere - 4x10 cores + HT.

I've used IOzone for benchmarking. Base command is:

iozone -s 8g/$threads -t $threads -r 4 -i 0 -i 1 -i 2 -i 3

Units are KB/s. I've used "Children see throughput" field from iozone
report.

Using mmap (-B option):

** Initial writers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 1444052 3010882 6055090 11746060 14404889 25109004 28310733 29044218 29619191 29618651 29514987 29348440 29315639 29326998 29410809
patched: 2207350 4707001 9642674 18356751 21399813 27011674 26775610 24088924 18549342 15453297 13876530 13358992 13166737 13095453 13111227
speed-up(times): 1.53 1.56 1.59 1.56 1.49 1.08 0.95 0.83 0.63 0.52 0.47 0.46 0.45 0.45 0.45

** Rewriters **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 2012192 3941325 7179208 13093224 13978721 19120624 14938912 16672082 16430882 14384357 12311291 16421748 13485785 10642142 11461610
patched: 3106380 5822011 11657398 17109111 15498272 18507004 16960717 14877209 17498172 15317104 15470030 19190455 14758974 9242583 10548081
speed-up(times): 1.54 1.48 1.62 1.31 1.11 0.97 1.14 0.89 1.06 1.06 1.26 1.17 1.09 0.87 0.92

** Readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 1541551 3301643 5624206 11672717 16145085 27885416 38730976 42438132 47526802 48077097 47126201 45950491 45108567 45011088 46310317
patched: 1800898 3582243 8062851 14418948 17587027 34938636 46653133 46561002 50396044 49525385 47731629 46594399 46424568 45357496 45258561
speed-up(times): 1.17 1.08 1.43 1.24 1.09 1.25 1.20 1.10 1.06 1.03 1.01 1.01 1.03 1.01 0.98

** Re-readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 1407462 3022304 5944814 12290200 15700871 27452022 38785250 45720460 47958008 48616065 47805237 45933767 45139644 44752527 45324330
patched: 1880030 4265188 7406094 15220592 19781387 33994635 43689297 47557123 51175499 50607686 48695647 46799726 46250685 46108964 45180965
speed-up(times): 1.34 1.41 1.25 1.24 1.26 1.24 1.13 1.04 1.07 1.04 1.02 1.02 1.02 1.03 1.00

** Reverse readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 1790475 3547606 6639853 14323339 17029576 30420579 39954056 44082873 45397731 45956797 46861276 46149824 44356709 43789684 44961204
patched: 1848356 3470499 7270728 15685450 19329038 33186403 43574373 48972628 47398951 48588366 48233477 46959725 46383543 43998385 45272745
speed-up(times): 1.03 0.98 1.10 1.10 1.14 1.09 1.09 1.11 1.04 1.06 1.03 1.02 1.05 1.00 1.01

** Random_readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 1098140 2549558 4625359 9248630 11764863 22648276 32809857 37617500 39028665 41283083 41886214 44448720 43535904 43481063 44041363
patched: 1893732 4034810 8218138 15051324 24400039 35208044 41339655 48233519 51046118 47613022 46427129 45893974 45190367 45158010 45944107
speed-up(times): 1.72 1.58 1.78 1.63 2.07 1.55 1.26 1.28 1.31 1.15 1.11 1.03 1.04 1.04 1.04

** Random_writers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 1366232 2863721 5714268 10615938 12711800 18768227 19430964 19895410 19108420 19666818 19189895 19666578 18953431 18712664 18676119
patched: 3308906 6093588 11885456 21035728 21744093 21940402 20155000 20800063 21107088 20821950 21369886 21324576 21019851 20418478 20547713
speed-up(times): 2.42 2.13 2.08 1.98 1.71 1.17 1.04 1.05 1.10 1.06 1.11 1.08 1.11 1.09 1.10

****************************

Using syscall (no -B option):

** Initial writers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 1786744 3693529 7600563 14594702 17645248 26197482 28938801 29700591 29858369 29831816 29730708 29606829 29621126 29538778 29589533
patched: 1817240 3732281 7598178 14578689 17824204 27186214 29552434 26634121 22304410 18631185 16485981 15801835 15590995 15514384 15483872
speed-up(times): 1.02 1.01 1.00 1.00 1.01 1.04 1.02 0.90 0.75 0.62 0.55 0.53 0.53 0.53 0.52

** Rewriters **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 2025119 3891368 8662423 14477011 17815278 20618509 18330301 14184305 14421901 12488145 12329534 12285723 12049399 12101321 12017546
patched: 2071648 4106464 8915170 15475594 18461212 23360704 25107019 26244308 26634094 27680123 27342845 27006682 26239505 25881556 26030227
speed-up(times): 1.02 1.06 1.03 1.07 1.04 1.13 1.37 1.85 1.85 2.22 2.22 2.20 2.18 2.14 2.17

** Readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 2414037 5609352 9326943 20594508 22135032 37437276 35593047 41574568 45919334 45903379 45680066 45703659 42766312 42265067 44491712
patched: 2388758 4573606 9867239 18485205 22269461 36172618 46830113 45828302 45974984 48244870 45334303 45395237 44213071 44418922 44881804
speed-up(times): 0.99 0.82 1.06 0.90 1.01 0.97 1.32 1.10 1.00 1.05 0.99 0.99 1.03 1.05 1.01

** Re-readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 2410474 5006316 9620458 19420701 24929010 37301471 37897701 48067032 46620958 44619322 45474645 45627080 38448032 44844358 44529239
patched: 2210495 4588974 9330074 18237863 23200139 36691762 43412170 48349035 46607100 47318490 45429944 45285141 44631543 44601157 44913130
speed-up(times): 0.92 0.92 0.97 0.94 0.93 0.98 1.15 1.01 1.00 1.06 1.00 0.99 1.16 0.99 1.01

** Reverse readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 2383446 4633256 9572545 18500373 21489130 36958118 31747157 39855519 31440942 32131944 37714689 42428280 17402480 14893057 16207342
patched: 2240576 4847211 8373112 17181179 20205163 35186361 42922118 45388409 46244837 47153867 45257508 45476325 43479030 43613958 43296206
speed-up(times): 0.94 1.05 0.87 0.93 0.94 0.95 1.35 1.14 1.47 1.47 1.20 1.07 2.50 2.93 2.67

** Random_readers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 1821175 3575869 8742168 13764493 20136443 30901949 37823254 43994032 41037782 43925224 41853227 42095250 39393426 33851319 41424361
patched: 1458968 3169634 6244046 12271864 15474602 29337377 35430875 39734695 41587609 42676631 42077827 41473062 40933033 40944148 41846858
speed-up(times): 0.80 0.89 0.71 0.89 0.77 0.95 0.94 0.90 1.01 0.97 1.01 0.99 1.04 1.21 1.01

** Random_writers **
threads: 1 2 4 8 10 20 30 40 50 60 70 80 120 160 200
baseline: 1556393 3063377 6014016 12199163 16187258 24737005 27293400 27678633 26549637 26963066 26202907 26090764 26159003 25842459 26009927
patched: 1642937 3461512 6405111 12425923 16990495 25404113 27340882 27467380 27057498 27297246 26627644 26733315 26624258 26787503 26603172
speed-up(times): 1.06 1.13 1.07 1.02 1.05 1.03 1.00 0.99 1.02 1.01 1.02 1.02 1.02 1.04 1.02

I haven't yet analyzed why it behaves poorly on high number of processes,
but I will.

Changelog
---------

v4:
- Drop RFC tag;
- Consolidate code thp and non-thp code (net diff to v3 is -177 lines);
- Compile time and sysfs knob for the feature;
- Rework zone_stat for huge pages;
- x86-64 only for now;
- ...
v3:
- set RADIX_TREE_PRELOAD_NR to 512 only if we build with THP;
- rewrite lru_add_page_tail() to address few bags;
- memcg accounting;
- represent file thp pages in meminfo and friends;
- dump page order in filemap trace;
- add missed flush_dcache_page() in zero_huge_user_segment;
- random cleanups based on feedback.
v2:
- mmap();
- fix add_to_page_cache_locked() and delete_from_page_cache();
- introduce mapping_can_have_hugepages();
- call split_huge_page() only for head page in filemap_fault();
- wait_split_huge_page(): serialize over i_mmap_mutex too;
- lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
- fix off-by-one in zero_huge_user_segment();
- THP_WRITE_ALLOC/THP_WRITE_FAILED counters;

Kirill A. Shutemov (39):
mm: drop actor argument of do_generic_file_read()
block: implement add_bdi_stat()
mm: implement zero_huge_user_segment and friends
radix-tree: implement preload for multiple contiguous elements
memcg, thp: charge huge cache pages
thp, mm: avoid PageUnevictable on active/inactive lru lists
thp, mm: basic defines for transparent huge page cache
thp: compile-time and sysfs knob for thp pagecache
thp, mm: introduce mapping_can_have_hugepages() predicate
thp: account anon transparent huge pages into NR_ANON_PAGES
thp: represent file thp pages in meminfo and friends
thp, mm: rewrite add_to_page_cache_locked() to support huge pages
mm: trace filemap: dump page order
thp, mm: rewrite delete_from_page_cache() to support huge pages
thp, mm: trigger bug in replace_page_cache_page() on THP
thp, mm: locking tail page is a bug
thp, mm: handle tail pages in page_cache_get_speculative()
thp, mm: add event counters for huge page alloc on write to a file
thp, mm: allocate huge pages in grab_cache_page_write_begin()
thp, mm: naive support of thp in generic read/write routines
thp, libfs: initial support of thp in
simple_read/write_begin/write_end
thp: handle file pages in split_huge_page()
thp: wait_split_huge_page(): serialize over i_mmap_mutex too
thp, mm: truncate support for transparent huge page cache
thp, mm: split huge page on mmap file page
ramfs: enable transparent huge page cache
x86-64, mm: proper alignment mappings with hugepages
thp: prepare zap_huge_pmd() to uncharge file pages
thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
thp: do_huge_pmd_anonymous_page() cleanup
thp: consolidate code between handle_mm_fault() and
do_huge_pmd_anonymous_page()
mm: cleanup __do_fault() implementation
thp, mm: implement do_huge_linear_fault()
thp, mm: handle huge pages in filemap_fault()
mm: decomposite do_wp_page() and get rid of some 'goto' logic
mm: do_wp_page(): extract VM_WRITE|VM_SHARED case to separate
function
thp: handle write-protect exception to file-backed huge pages
thp: vma_adjust_trans_huge(): adjust file-backed VMA too
thp: map file-backed huge pages on fault

arch/x86/kernel/sys_x86_64.c | 12 +-
drivers/base/node.c | 10 +-
fs/libfs.c | 50 +++-
fs/proc/meminfo.c | 9 +-
fs/ramfs/inode.c | 6 +-
include/linux/backing-dev.h | 10 +
include/linux/fs.h | 1 +
include/linux/huge_mm.h | 92 +++++--
include/linux/mm.h | 19 +-
include/linux/mmzone.h | 1 +
include/linux/pagemap.h | 33 ++-
include/linux/radix-tree.h | 11 +
include/linux/vm_event_item.h | 2 +
include/trace/events/filemap.h | 7 +-
lib/radix-tree.c | 33 ++-
mm/Kconfig | 10 +
mm/filemap.c | 216 +++++++++++----
mm/huge_memory.c | 257 +++++++++--------
mm/memcontrol.c | 2 -
mm/memory.c | 597 ++++++++++++++++++++++++++--------------
mm/rmap.c | 18 +-
mm/swap.c | 20 +-
mm/truncate.c | 13 +
mm/vmstat.c | 3 +
24 files changed, 988 insertions(+), 444 deletions(-)

--
1.7.10.4


2013-05-12 01:21:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements

From: "Kirill A. Shutemov" <[email protected]>

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entires to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Worst case for adding N contiguous items is adding entries at indexes
(ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
item plus extra nodes if you cross the boundary from one node to the next.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

#ifdef __KERNEL__
#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
#else
#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
#endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6.

Since only THP uses batched preload at the , we disable (set max preload
to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be changed
in the future.

Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/radix-tree.h | 11 +++++++++++
lib/radix-tree.c | 33 ++++++++++++++++++++++++++-------
2 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..a859195 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do { \
(root)->rnode = NULL; \
} while (0)

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR 512
+#else
+#define RADIX_TREE_PRELOAD_NR 1
+#endif
+
/**
* Radix-tree synchronization
*
@@ -231,6 +241,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..1bc352f 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
* The worst case is a zero height tree with just a single item at index 0,
* and then inserting an item at index ULONG_MAX. This requires 2 new branches
* of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
* Hence:
*/
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+ (RADIX_TREE_PRELOAD_MIN + \
+ DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))

/*
* Per-cpu pool of preloaded nodes
*/
struct radix_tree_preload {
int nr;
- struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+ struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
};
static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };

@@ -257,29 +265,35 @@ radix_tree_node_free(struct radix_tree_node *node)

/*
* Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail. On
- * success, return zero, with preemption disabled. On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled. On error, return -ENOMEM
* with preemption not disabled.
*
* To make use of this facility, the radix tree must be initialised without
* __GFP_WAIT being passed to INIT_RADIX_TREE().
*/
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
{
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
int ret = -ENOMEM;
+ int preload_target = RADIX_TREE_PRELOAD_MIN +
+ DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+ if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+ "too large preload requested"))
+ return -ENOMEM;

preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+ while (rtp->nr < preload_target) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+ if (rtp->nr < preload_target)
rtp->nodes[rtp->nr++] = node;
else
kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +302,11 @@ int radix_tree_preload(gfp_t gfp_mask)
out:
return ret;
}
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+ return radix_tree_preload_count(1, gfp_mask);
+}
EXPORT_SYMBOL(radix_tree_preload);

/*
--
1.7.10.4

2013-05-12 01:22:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 39/39] thp: map file-backed huge pages on fault

From: "Kirill A. Shutemov" <[email protected]>

Look like all pieces are in place, we can map file-backed huge-pages
now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 4 +++-
mm/memory.c | 5 ++++-
2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f4d6626..903f097 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -78,7 +78,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
((__vma)->vm_flags & VM_HUGEPAGE))) && \
!((__vma)->vm_flags & VM_NOHUGEPAGE) && \
- !is_vma_temporary_stack(__vma))
+ !is_vma_temporary_stack(__vma) && \
+ (!(__vma)->vm_ops || \
+ mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))
#define transparent_hugepage_defrag(__vma) \
((transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) || \
diff --git a/mm/memory.c b/mm/memory.c
index ebff552..7fe9752 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3939,10 +3939,13 @@ retry:
if (!pmd)
return VM_FAULT_OOM;
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
- int ret = 0;
+ int ret;
if (!vma->vm_ops)
ret = do_huge_pmd_anonymous_page(mm, vma, address,
pmd, flags);
+ else
+ ret = do_huge_linear_fault(mm, vma, address,
+ pmd, flags);
if ((ret & VM_FAULT_FALLBACK) == 0)
return ret;
} else {
--
1.7.10.4

2013-05-12 01:22:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 01/39] mm: drop actor argument of do_generic_file_read()

From: "Kirill A. Shutemov" <[email protected]>

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index e989fb1..61158ac 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1088,7 +1088,6 @@ static void shrink_readahead_size_eio(struct file *filp,
* @filp: the file to read
* @ppos: current file position
* @desc: read_descriptor
- * @actor: read method
*
* This is a generic file read routine, and uses the
* mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1097,7 +1096,7 @@ static void shrink_readahead_size_eio(struct file *filp,
* of the logic when it comes to error handling etc.
*/
static void do_generic_file_read(struct file *filp, loff_t *ppos,
- read_descriptor_t *desc, read_actor_t actor)
+ read_descriptor_t *desc)
{
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
@@ -1198,13 +1197,14 @@ page_ok:
* Ok, we have the page, and it's up-to-date, so
* now we can copy it to user space...
*
- * The actor routine returns how many bytes were actually used..
+ * The file_read_actor routine returns how many bytes were
+ * actually used..
* NOTE! This may not be the same as how much of a user buffer
* we filled up (we may be padding etc), so we can only update
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
+ ret = file_read_actor(desc, page, offset, nr);
offset += ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
@@ -1477,7 +1477,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
if (desc.count == 0)
continue;
desc.error = 0;
- do_generic_file_read(filp, ppos, &desc, file_read_actor);
+ do_generic_file_read(filp, ppos, &desc);
retval += desc.written;
if (desc.error) {
retval = retval ?: desc.error;
--
1.7.10.4

2013-05-12 01:22:02

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()

From: "Kirill A. Shutemov" <[email protected]>

Let's modify __do_fault() to handle transhuge pages. To indicate that
huge page is required caller pass flags with FAULT_FLAG_TRANSHUGE set.

__do_fault() now returns VM_FAULT_FALLBACK to indicate that fallback to
small pages is required.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 41 +++++++++++++
include/linux/mm.h | 5 ++
mm/huge_memory.c | 22 -------
mm/memory.c | 148 ++++++++++++++++++++++++++++++++++++++++-------
4 files changed, 172 insertions(+), 44 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d688271..b20334a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -188,6 +188,28 @@ static inline struct page *compound_trans_head(struct page *page)
return page;
}

+static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
+{
+ return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+}
+
+static inline struct page *alloc_hugepage_vma(int defrag,
+ struct vm_area_struct *vma,
+ unsigned long haddr, int nd,
+ gfp_t extra_gfp)
+{
+ return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
+ HPAGE_PMD_ORDER, vma, haddr, nd);
+}
+
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
+{
+ pmd_t entry;
+ entry = mk_pmd(page, prot);
+ entry = pmd_mkhuge(entry);
+ return entry;
+}
+
extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp);

@@ -200,12 +222,15 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; })
#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })

+#define THP_FAULT_ALLOC ({ BUILD_BUG(); 0; })
+#define THP_FAULT_FALLBACK ({ BUILD_BUG(); 0; })
#define THP_WRITE_ALLOC ({ BUILD_BUG(); 0; })
#define THP_WRITE_ALLOC_FAILED ({ BUILD_BUG(); 0; })

#define hpage_nr_pages(x) 1

#define transparent_hugepage_enabled(__vma) 0
+#define transparent_hugepage_defrag(__vma) 0

#define transparent_hugepage_flags 0UL
static inline int
@@ -242,6 +267,22 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
return 0;
}

+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
+{
+ pmd_t entry;
+ BUILD_BUG();
+ return entry;
+}
+
+static inline struct page *alloc_hugepage_vma(int defrag,
+ struct vm_area_struct *vma,
+ unsigned long haddr, int nd,
+ gfp_t extra_gfp)
+{
+ BUILD_BUG();
+ return NULL;
+}
+
static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 280b414..563c8b7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -167,6 +167,11 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED 0x40 /* second try */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+#define FAULT_FLAG_TRANSHUGE 0x80 /* Try to allocate transhuge page */
+#else
+#define FAULT_FLAG_TRANSHUGE 0 /* Optimize out THP code if disabled */
+#endif

/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index facfdac..893cc69 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -709,14 +709,6 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
return pmd;
}

-static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
-{
- pmd_t entry;
- entry = mk_pmd(page, prot);
- entry = pmd_mkhuge(entry);
- return entry;
-}
-
static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long haddr, pmd_t *pmd,
@@ -758,20 +750,6 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
return 0;
}

-static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
-{
- return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
-}
-
-static inline struct page *alloc_hugepage_vma(int defrag,
- struct vm_area_struct *vma,
- unsigned long haddr, int nd,
- gfp_t extra_gfp)
-{
- return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
- HPAGE_PMD_ORDER, vma, haddr, nd);
-}
-
#ifndef CONFIG_NUMA
static inline struct page *alloc_hugepage(int defrag)
{
diff --git a/mm/memory.c b/mm/memory.c
index 97b22c7..8997cd8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -59,6 +59,7 @@
#include <linux/gfp.h>
#include <linux/migrate.h>
#include <linux/string.h>
+#include <linux/khugepaged.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -167,6 +168,7 @@ static void check_sync_rss_stat(struct task_struct *task)
}
#else /* SPLIT_RSS_COUNTING */

+#define add_mm_counter_fast(mm, member, val) add_mm_counter(mm, member, val)
#define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
#define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)

@@ -3282,6 +3284,38 @@ oom:
return VM_FAULT_OOM;
}

+static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
+ unsigned long addr)
+{
+ unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+ if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+ (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+ return false;
+ if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+ return false;
+ return true;
+}
+
+static struct page *alloc_fault_page_vma(struct vm_area_struct *vma,
+ unsigned long addr, unsigned int flags)
+{
+
+ if (flags & FAULT_FLAG_TRANSHUGE) {
+ struct page *page;
+ unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+ page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+ vma, haddr, numa_node_id(), 0);
+ if (page)
+ count_vm_event(THP_FAULT_ALLOC);
+ else
+ count_vm_event(THP_FAULT_FALLBACK);
+ return page;
+ }
+ return alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+}
+
/*
* __do_fault() tries to create a new page mapping. It aggressively
* tries to share with existing pages, but makes a separate copy if
@@ -3301,12 +3335,23 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
{
pte_t *page_table;
spinlock_t *ptl;
+ pgtable_t pgtable = NULL;
struct page *page, *cow_page, *dirty_page = NULL;
- pte_t entry;
bool anon = false, page_mkwrite = false;
bool write = flags & FAULT_FLAG_WRITE;
+ bool thp = flags & FAULT_FLAG_TRANSHUGE;
+ unsigned long addr_aligned;
struct vm_fault vmf;
- int ret;
+ int nr, ret;
+
+ if (thp) {
+ if (!transhuge_vma_suitable(vma, address))
+ return VM_FAULT_FALLBACK;
+ if (unlikely(khugepaged_enter(vma)))
+ return VM_FAULT_OOM;
+ addr_aligned = address & HPAGE_PMD_MASK;
+ } else
+ addr_aligned = address & PAGE_MASK;

/*
* If we do COW later, allocate page befor taking lock_page()
@@ -3316,17 +3361,25 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;

- cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ cow_page = alloc_fault_page_vma(vma, address, flags);
if (!cow_page)
- return VM_FAULT_OOM;
+ return VM_FAULT_OOM | VM_FAULT_FALLBACK;

if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
page_cache_release(cow_page);
- return VM_FAULT_OOM;
+ return VM_FAULT_OOM | VM_FAULT_FALLBACK;
}
} else
cow_page = NULL;

+ if (thp) {
+ pgtable = pte_alloc_one(mm, address);
+ if (unlikely(!pgtable)) {
+ ret = VM_FAULT_OOM;
+ goto uncharge_out;
+ }
+ }
+
vmf.virtual_address = (void __user *)(address & PAGE_MASK);
vmf.pgoff = pgoff;
vmf.flags = flags;
@@ -3353,6 +3406,13 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
VM_BUG_ON(!PageLocked(vmf.page));

page = vmf.page;
+
+ /*
+ * If we asked for huge page we expect to get it or VM_FAULT_FALLBACK.
+ * If we don't ask for huge page it must be splitted in ->fault().
+ */
+ BUG_ON(PageTransHuge(page) != thp);
+
if (!write)
goto update_pgtable;

@@ -3362,7 +3422,11 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (!(vma->vm_flags & VM_SHARED)) {
page = cow_page;
anon = true;
- copy_user_highpage(page, vmf.page, address, vma);
+ if (thp)
+ copy_user_huge_page(page, vmf.page, addr_aligned, vma,
+ HPAGE_PMD_NR);
+ else
+ copy_user_highpage(page, vmf.page, address, vma);
__SetPageUptodate(page);
} else if (vma->vm_ops->page_mkwrite) {
/*
@@ -3373,6 +3437,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,

unlock_page(page);
vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+ if (thp)
+ vmf.flags |= FAULT_FLAG_TRANSHUGE;
tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
ret = tmp;
@@ -3391,19 +3457,30 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}

update_pgtable:
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
/* Only go through if we didn't race with anybody else... */
- if (unlikely(!pte_same(*page_table, orig_pte))) {
- pte_unmap_unlock(page_table, ptl);
- goto race_out;
+ if (thp) {
+ spin_lock(&mm->page_table_lock);
+ if (!pmd_none(*pmd)) {
+ spin_unlock(&mm->page_table_lock);
+ goto race_out;
+ }
+ /* make GCC happy */
+ ptl = NULL; page_table = NULL;
+ } else {
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (unlikely(!pte_same(*page_table, orig_pte))) {
+ pte_unmap_unlock(page_table, ptl);
+ goto race_out;
+ }
}

flush_icache_page(vma, page);
+ nr = thp ? HPAGE_PMD_NR : 1;
if (anon) {
- inc_mm_counter_fast(mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address);
+ add_mm_counter_fast(mm, MM_ANONPAGES, nr);
+ page_add_new_anon_rmap(page, vma, addr_aligned);
} else {
- inc_mm_counter_fast(mm, MM_FILEPAGES);
+ add_mm_counter_fast(mm, MM_FILEPAGES, nr);
page_add_file_rmap(page);
if (write) {
dirty_page = page;
@@ -3419,15 +3496,23 @@ update_pgtable:
* exclusive copy of the page, or this is a shared mapping, so we can
* make it writable and dirty to avoid having to handle that later.
*/
- entry = mk_pte(page, vma->vm_page_prot);
- if (write)
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- set_pte_at(mm, address, page_table, entry);
-
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, address, page_table);
-
- pte_unmap_unlock(page_table, ptl);
+ if (thp) {
+ pmd_t entry = mk_huge_pmd(page, vma->vm_page_prot);
+ if (flags & FAULT_FLAG_WRITE)
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ set_pmd_at(mm, address, pmd, entry);
+ pgtable_trans_huge_deposit(mm, pgtable);
+ mm->nr_ptes++;
+ update_mmu_cache_pmd(vma, address, pmd);
+ spin_unlock(&mm->page_table_lock);
+ } else {
+ pte_t entry = mk_pte(page, vma->vm_page_prot);
+ if (write)
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ set_pte_at(mm, address, page_table, entry);
+ update_mmu_cache(vma, address, page_table);
+ pte_unmap_unlock(page_table, ptl);
+ }

if (dirty_page) {
struct address_space *mapping = page->mapping;
@@ -3457,9 +3542,13 @@ update_pgtable:
return ret;

unwritable_page:
+ if (pgtable)
+ pte_free(mm, pgtable);
page_cache_release(page);
return ret;
uncharge_out:
+ if (pgtable)
+ pte_free(mm, pgtable);
/* fs's fault handler get error */
if (cow_page) {
mem_cgroup_uncharge_page(cow_page);
@@ -3467,6 +3556,8 @@ uncharge_out:
}
return ret;
race_out:
+ if (pgtable)
+ pte_free(mm, pgtable);
if (cow_page)
mem_cgroup_uncharge_page(cow_page);
if (anon)
@@ -3519,6 +3610,19 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}

+static int do_huge_linear_fault(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long address, pmd_t *pmd,
+ unsigned int flags)
+{
+ pgoff_t pgoff = (((address & PAGE_MASK)
+ - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ pte_t __unused; /* unused with FAULT_FLAG_TRANSHUGE */
+
+ flags |= FAULT_FLAG_TRANSHUGE;
+
+ return __do_fault(mm, vma, address, pmd, pgoff, flags, __unused);
+}
+
int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
unsigned long addr, int current_nid)
{
--
1.7.10.4

2013-05-12 01:22:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 22/39] thp: handle file pages in split_huge_page()

From: "Kirill A. Shutemov" <[email protected]>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 68 +++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 57 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ed31e90..73974e8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1655,23 +1655,23 @@ static void __split_huge_page_refcount(struct page *page,
*/
page_tail->_mapcount = page->_mapcount;

- BUG_ON(page_tail->mapping);
page_tail->mapping = page->mapping;

page_tail->index = page->index + i;
page_nid_xchg_last(page_tail, page_nid_last(page));

- BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
BUG_ON(!PageDirty(page_tail));
- BUG_ON(!PageSwapBacked(page_tail));

lru_add_page_tail(page, page_tail, lruvec, list);
}
atomic_sub(tail_count, &page->_count);
BUG_ON(atomic_read(&page->_count) <= 0);

- __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+ if (PageAnon(page))
+ __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+ else
+ __mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1);

ClearPageCompound(page);
compound_unlock(page);
@@ -1771,7 +1771,7 @@ static int __split_huge_page_map(struct page *page,
}

/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
struct anon_vma *anon_vma,
struct list_head *list)
{
@@ -1795,7 +1795,7 @@ static void __split_huge_page(struct page *page,
* and establishes a child pmd before
* __split_huge_page_splitting() freezes the parent pmd (so if
* we fail to prevent copy_huge_pmd() from running until the
- * whole __split_huge_page() is complete), we will still see
+ * whole __split_anon_huge_page() is complete), we will still see
* the newly established pmd of the child later during the
* walk, to be able to set it as pmd_trans_splitting too.
*/
@@ -1826,14 +1826,11 @@ static void __split_huge_page(struct page *page,
* from the hugepage.
* Return 0 if the hugepage is split successfully otherwise return 1.
*/
-int split_huge_page_to_list(struct page *page, struct list_head *list)
+static int split_anon_huge_page(struct page *page, struct list_head *list)
{
struct anon_vma *anon_vma;
int ret = 1;

- BUG_ON(is_huge_zero_page(page));
- BUG_ON(!PageAnon(page));
-
/*
* The caller does not necessarily hold an mmap_sem that would prevent
* the anon_vma disappearing so we first we take a reference to it
@@ -1851,7 +1848,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
goto out_unlock;

BUG_ON(!PageSwapBacked(page));
- __split_huge_page(page, anon_vma, list);
+ __split_anon_huge_page(page, anon_vma, list);
count_vm_event(THP_SPLIT);

BUG_ON(PageCompound(page));
@@ -1862,6 +1859,55 @@ out:
return ret;
}

+static int split_file_huge_page(struct page *page, struct list_head *list)
+{
+ struct address_space *mapping = page->mapping;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ int mapcount, mapcount2;
+
+ BUG_ON(!PageHead(page));
+ BUG_ON(PageTail(page));
+
+ mutex_lock(&mapping->i_mmap_mutex);
+ mapcount = 0;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long addr = vma_address(page, vma);
+ mapcount += __split_huge_page_splitting(page, vma, addr);
+ }
+
+ if (mapcount != page_mapcount(page))
+ printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+ mapcount, page_mapcount(page));
+ BUG_ON(mapcount != page_mapcount(page));
+
+ __split_huge_page_refcount(page, list);
+
+ mapcount2 = 0;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long addr = vma_address(page, vma);
+ mapcount2 += __split_huge_page_map(page, vma, addr);
+ }
+
+ if (mapcount != mapcount2)
+ printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+ mapcount, mapcount2, page_mapcount(page));
+ BUG_ON(mapcount != mapcount2);
+ count_vm_event(THP_SPLIT);
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return 0;
+}
+
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+ BUG_ON(is_huge_zero_page(page));
+
+ if (PageAnon(page))
+ return split_anon_huge_page(page, list);
+ else
+ return split_file_huge_page(page, list);
+}
+
#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)

int hugepage_madvise(struct vm_area_struct *vma,
--
1.7.10.4

2013-05-12 01:21:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 34/39] thp, mm: handle huge pages in filemap_fault()

From: "Kirill A. Shutemov" <[email protected]>

If caller asks for huge page (flags & FAULT_FLAG_TRANSHUGE),
filemap_fault() return it if there's a huge page already by the offset.

If the area of page cache required to create huge is empty, we create a
new huge page and return it.

Otherwise we return VM_FAULT_FALLBACK to indicate that fallback to small
pages is required.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 52 +++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 43 insertions(+), 9 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 9877347..1deedd6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1557,14 +1557,23 @@ EXPORT_SYMBOL(generic_file_aio_read);
* This adds the requested page to the page cache if it isn't already there,
* and schedules an I/O to read in its contents from disk.
*/
-static int page_cache_read(struct file *file, pgoff_t offset)
+static int page_cache_read(struct file *file, pgoff_t offset, bool thp)
{
struct address_space *mapping = file->f_mapping;
- struct page *page;
+ struct page *page;
int ret;

do {
- page = page_cache_alloc_cold(mapping);
+ if (thp) {
+ gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
+ BUG_ON(offset & HPAGE_CACHE_INDEX_MASK);
+ page = alloc_pages(gfp_mask, HPAGE_PMD_ORDER);
+ if (page)
+ count_vm_event(THP_FAULT_ALLOC);
+ else
+ count_vm_event(THP_FAULT_FALLBACK);
+ } else
+ page = page_cache_alloc_cold(mapping);
if (!page)
return -ENOMEM;

@@ -1573,11 +1582,18 @@ static int page_cache_read(struct file *file, pgoff_t offset)
ret = mapping->a_ops->readpage(file, page);
else if (ret == -EEXIST)
ret = 0; /* losing race to add is OK */
+ else if (ret == -ENOSPC)
+ /*
+ * No space in page cache to add huge page.
+ * For caller it's the same as -ENOMEM: fall back to
+ * small pages is required.
+ */
+ ret = -ENOMEM;

page_cache_release(page);

} while (ret == AOP_TRUNCATED_PAGE);
-
+
return ret;
}

@@ -1669,13 +1685,20 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
struct address_space *mapping = file->f_mapping;
struct file_ra_state *ra = &file->f_ra;
struct inode *inode = mapping->host;
+ bool thp = vmf->flags & FAULT_FLAG_TRANSHUGE;
pgoff_t offset = vmf->pgoff;
+ unsigned long address = (unsigned long)vmf->virtual_address;
struct page *page;
pgoff_t size;
int ret = 0;

+ if (thp) {
+ BUG_ON(ra->ra_pages);
+ offset = linear_page_index(vma, address & HPAGE_PMD_MASK);
+ }
+
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (offset >= size)
+ if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;

/*
@@ -1700,7 +1723,8 @@ retry_find:
goto no_cached_page;
}

- if (PageTransCompound(page))
+ /* Split huge page if we don't want huge page to be here */
+ if (!thp && PageTransCompound(page))
split_huge_page(compound_trans_head(page));
if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
page_cache_release(page);
@@ -1722,12 +1746,22 @@ retry_find:
if (unlikely(!PageUptodate(page)))
goto page_not_uptodate;

+ if (thp && !PageTransHuge(page)) {
+ /*
+ * Caller asked for huge page, but we have small page
+ * by this offset. Fallback to small pages.
+ */
+ unlock_page(page);
+ page_cache_release(page);
+ return VM_FAULT_FALLBACK;
+ }
+
/*
* Found the page and have a reference on it.
* We must recheck i_size under page lock.
*/
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (unlikely(offset >= size)) {
+ if (unlikely(vmf->pgoff >= size)) {
unlock_page(page);
page_cache_release(page);
return VM_FAULT_SIGBUS;
@@ -1741,7 +1775,7 @@ no_cached_page:
* We're only likely to ever get here if MADV_RANDOM is in
* effect.
*/
- error = page_cache_read(file, offset);
+ error = page_cache_read(file, offset, thp);

/*
* The page we want has now been added to the page cache.
@@ -1757,7 +1791,7 @@ no_cached_page:
* to schedule I/O.
*/
if (error == -ENOMEM)
- return VM_FAULT_OOM;
+ return VM_FAULT_OOM | VM_FAULT_FALLBACK;
return VM_FAULT_SIGBUS;

page_not_uptodate:
--
1.7.10.4

2013-05-12 01:21:58

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 28/39] thp: prepare zap_huge_pmd() to uncharge file pages

From: "Kirill A. Shutemov" <[email protected]>

Uncharge pages from correct counter.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ad458d..a88f9b2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1385,10 +1385,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
spin_unlock(&tlb->mm->page_table_lock);
put_huge_zero_page();
} else {
+ int member;
page = pmd_page(orig_pmd);
page_remove_rmap(page);
VM_BUG_ON(page_mapcount(page) < 0);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ member = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
+ add_mm_counter(tlb->mm, member, -HPAGE_PMD_NR);
VM_BUG_ON(!PageHead(page));
tlb->mm->nr_ptes--;
spin_unlock(&tlb->mm->page_table_lock);
--
1.7.10.4

2013-05-12 01:21:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 19/39] thp, mm: allocate huge pages in grab_cache_page_write_begin()

From: "Kirill A. Shutemov" <[email protected]>

Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/fs.h | 1 +
include/linux/huge_mm.h | 3 +++
include/linux/pagemap.h | 9 ++++++++-
mm/filemap.c | 29 ++++++++++++++++++++++++-----
4 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..a70b0ac 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -280,6 +280,7 @@ enum positive_aop_returns {
#define AOP_FLAG_NOFS 0x0004 /* used by filesystem to direct
* helper code (eg buffer layer)
* to clear GFP_FS from alloc */
+#define AOP_FLAG_TRANSHUGE 0x0008 /* allocate transhuge page */

/*
* oh the beauties of C type declarations.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 88b44e2..74494a2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -194,6 +194,9 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; })
#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })

+#define THP_WRITE_ALLOC ({ BUILD_BUG(); 0; })
+#define THP_WRITE_ALLOC_FAILED ({ BUILD_BUG(); 0; })
+
#define hpage_nr_pages(x) 1

#define transparent_hugepage_enabled(__vma) 0
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2e86251..8feeecc 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -270,8 +270,15 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
int tag, unsigned int nr_pages, struct page **pages);

-struct page *grab_cache_page_write_begin(struct address_space *mapping,
+struct page *__grab_cache_page_write_begin(struct address_space *mapping,
pgoff_t index, unsigned flags);
+static inline struct page *grab_cache_page_write_begin(
+ struct address_space *mapping, pgoff_t index, unsigned flags)
+{
+ if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
+ return NULL;
+ return __grab_cache_page_write_begin(mapping, index, flags);
+}

/*
* Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index 9ea46a4..e086ef0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2309,25 +2309,44 @@ EXPORT_SYMBOL(generic_file_direct_write);
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+struct page *__grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
{
int status;
gfp_t gfp_mask;
struct page *page;
gfp_t gfp_notmask = 0;
+ bool thp = (flags & AOP_FLAG_TRANSHUGE) &&
+ IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);

gfp_mask = mapping_gfp_mask(mapping);
if (mapping_cap_account_dirty(mapping))
gfp_mask |= __GFP_WRITE;
if (flags & AOP_FLAG_NOFS)
gfp_notmask = __GFP_FS;
+ if (thp) {
+ BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+ BUG_ON(!(gfp_mask & __GFP_COMP));
+ }
repeat:
page = find_lock_page(mapping, index);
- if (page)
+ if (page) {
+ if (thp && !PageTransHuge(page)) {
+ unlock_page(page);
+ page_cache_release(page);
+ return NULL;
+ }
goto found;
+ }

- page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+ if (thp) {
+ page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+ if (page)
+ count_vm_event(THP_WRITE_ALLOC);
+ else
+ count_vm_event(THP_WRITE_ALLOC_FAILED);
+ } else
+ page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
if (!page)
return NULL;
status = add_to_page_cache_lru(page, mapping, index,
@@ -2342,7 +2361,7 @@ found:
wait_for_stable_page(page);
return page;
}
-EXPORT_SYMBOL(grab_cache_page_write_begin);
+EXPORT_SYMBOL(__grab_cache_page_write_begin);

static ssize_t generic_perform_write(struct file *file,
struct iov_iter *i, loff_t pos)
--
1.7.10.4

2013-05-12 01:21:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 03/39] mm: implement zero_huge_user_segment and friends

From: "Kirill A. Shutemov" <[email protected]>

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment and zero_user, but for huge pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 7 +++++++
mm/memory.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c05d7cf..5e156fb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1797,6 +1797,13 @@ extern void dump_page(struct page *page);
extern void clear_huge_page(struct page *page,
unsigned long addr,
unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+ unsigned start, unsigned end);
+static inline void zero_huge_user(struct page *page,
+ unsigned start, unsigned len)
+{
+ zero_huge_user_segment(page, start, start + len);
+}
extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned long addr, struct vm_area_struct *vma,
unsigned int pages_per_huge_page);
diff --git a/mm/memory.c b/mm/memory.c
index f7a1fba..f02a8be 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4266,6 +4266,42 @@ void clear_huge_page(struct page *page,
}
}

+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+ int i;
+ unsigned start_idx, end_idx;
+ unsigned start_off, end_off;
+
+ BUG_ON(end < start);
+
+ might_sleep();
+
+ if (start == end)
+ return;
+
+ start_idx = start >> PAGE_SHIFT;
+ start_off = start & ~PAGE_MASK;
+ end_idx = (end - 1) >> PAGE_SHIFT;
+ end_off = ((end - 1) & ~PAGE_MASK) + 1;
+
+ /*
+ * if start and end are on the same small page we can call
+ * zero_user_segment() once and save one kmap_atomic().
+ */
+ if (start_idx == end_idx)
+ return zero_user_segment(page + start_idx, start_off, end_off);
+
+ /* zero the first (possibly partial) page */
+ zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
+ for (i = start_idx + 1; i < end_idx; i++) {
+ cond_resched();
+ clear_highpage(page + i);
+ flush_dcache_page(page + i);
+ }
+ /* zero the last (possibly partial) page */
+ zero_user_segment(page + end_idx, 0, end_off);
+}
+
static void copy_user_gigantic_page(struct page *dst, struct page *src,
unsigned long addr,
struct vm_area_struct *vma,
--
1.7.10.4

2013-05-12 01:21:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 30/39] thp: do_huge_pmd_anonymous_page() cleanup

From: "Kirill A. Shutemov" <[email protected]>

Minor cleanup: unindent most code of the fucntion by inverting one
condition. It's preparation for the next patch.

No functional changes.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 83 +++++++++++++++++++++++++++---------------------------
1 file changed, 41 insertions(+), 42 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 575f29b..ab07f5d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -804,55 +804,54 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long haddr = address & HPAGE_PMD_MASK;
pte_t *pte;

- if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
- if (unlikely(anon_vma_prepare(vma)))
- return VM_FAULT_OOM;
- if (unlikely(khugepaged_enter(vma)))
+ if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+ goto out;
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+ if (unlikely(khugepaged_enter(vma)))
+ return VM_FAULT_OOM;
+ if (!(flags & FAULT_FLAG_WRITE) &&
+ transparent_hugepage_use_zero_page()) {
+ pgtable_t pgtable;
+ struct page *zero_page;
+ bool set;
+ pgtable = pte_alloc_one(mm, haddr);
+ if (unlikely(!pgtable))
return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE) &&
- transparent_hugepage_use_zero_page()) {
- pgtable_t pgtable;
- struct page *zero_page;
- bool set;
- pgtable = pte_alloc_one(mm, haddr);
- if (unlikely(!pgtable))
- return VM_FAULT_OOM;
- zero_page = get_huge_zero_page();
- if (unlikely(!zero_page)) {
- pte_free(mm, pgtable);
- count_vm_event(THP_FAULT_FALLBACK);
- goto out;
- }
- spin_lock(&mm->page_table_lock);
- set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
- zero_page);
- spin_unlock(&mm->page_table_lock);
- if (!set) {
- pte_free(mm, pgtable);
- put_huge_zero_page();
- }
- return 0;
- }
- page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
- if (unlikely(!page)) {
+ zero_page = get_huge_zero_page();
+ if (unlikely(!zero_page)) {
+ pte_free(mm, pgtable);
count_vm_event(THP_FAULT_FALLBACK);
goto out;
}
- count_vm_event(THP_FAULT_ALLOC);
- if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
- put_page(page);
- goto out;
- }
- if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
- page))) {
- mem_cgroup_uncharge_page(page);
- put_page(page);
- goto out;
+ spin_lock(&mm->page_table_lock);
+ set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+ zero_page);
+ spin_unlock(&mm->page_table_lock);
+ if (!set) {
+ pte_free(mm, pgtable);
+ put_huge_zero_page();
}
-
return 0;
}
+ page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+ vma, haddr, numa_node_id(), 0);
+ if (unlikely(!page)) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ goto out;
+ }
+ count_vm_event(THP_FAULT_ALLOC);
+ if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+ put_page(page);
+ goto out;
+ }
+ if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
+ mem_cgroup_uncharge_page(page);
+ put_page(page);
+ goto out;
+ }
+
+ return 0;
out:
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
--
1.7.10.4

2013-05-12 01:21:53

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages

From: "Kirill A. Shutemov" <[email protected]>

Make arch_get_unmapped_area() return unmapped area aligned to HPAGE_MASK
if the file mapping can have huge pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/sys_x86_64.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index dbded5a..d97ab40 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -15,6 +15,7 @@
#include <linux/random.h>
#include <linux/uaccess.h>
#include <linux/elf.h>
+#include <linux/pagemap.h>

#include <asm/ia32.h>
#include <asm/syscalls.h>
@@ -34,6 +35,13 @@ static unsigned long get_align_mask(void)
return va_align.mask;
}

+static inline unsigned long mapping_align_mask(struct address_space *mapping)
+{
+ if (mapping_can_have_hugepages(mapping))
+ return PAGE_MASK & ~HPAGE_MASK;
+ return get_align_mask();
+}
+
unsigned long align_vdso_addr(unsigned long addr)
{
unsigned long align_mask = get_align_mask();
@@ -135,7 +143,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
info.length = len;
info.low_limit = begin;
info.high_limit = end;
- info.align_mask = filp ? get_align_mask() : 0;
+ info.align_mask = filp ? mapping_align_mask(filp->f_mapping) : 0;
info.align_offset = pgoff << PAGE_SHIFT;
return vm_unmapped_area(&info);
}
@@ -174,7 +182,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = mm->mmap_base;
- info.align_mask = filp ? get_align_mask() : 0;
+ info.align_mask = filp ? mapping_align_mask(filp->f_mapping) : 0;
info.align_offset = pgoff << PAGE_SHIFT;
addr = vm_unmapped_area(&info);
if (!(addr & ~PAGE_MASK))
--
1.7.10.4

2013-05-12 01:21:52

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 21/39] thp, libfs: initial support of thp in simple_read/write_begin/write_end

From: "Kirill A. Shutemov" <[email protected]>

For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
It's probably to weak condition and need to be reworked later.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/libfs.c | 50 ++++++++++++++++++++++++++++++++++++-----------
include/linux/pagemap.h | 8 ++++++++
2 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 916da8c..ce807fe 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -383,7 +383,7 @@ EXPORT_SYMBOL(simple_setattr);

int simple_readpage(struct file *file, struct page *page)
{
- clear_highpage(page);
+ clear_pagecache_page(page);
flush_dcache_page(page);
SetPageUptodate(page);
unlock_page(page);
@@ -394,21 +394,44 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
- struct page *page;
+ struct page *page = NULL;
pgoff_t index;

index = pos >> PAGE_CACHE_SHIFT;

- page = grab_cache_page_write_begin(mapping, index, flags);
+ /* XXX: too weak condition? */
+ if (mapping_can_have_hugepages(mapping)) {
+ page = grab_cache_page_write_begin(mapping,
+ index & ~HPAGE_CACHE_INDEX_MASK,
+ flags | AOP_FLAG_TRANSHUGE);
+ /* fallback to small page */
+ if (!page) {
+ unsigned long offset;
+ offset = pos & ~PAGE_CACHE_MASK;
+ len = min_t(unsigned long,
+ len, PAGE_CACHE_SIZE - offset);
+ }
+ BUG_ON(page && !PageTransHuge(page));
+ }
+ if (!page)
+ page = grab_cache_page_write_begin(mapping, index, flags);
if (!page)
return -ENOMEM;
-
*pagep = page;

- if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
- unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
- zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
+ if (!PageUptodate(page)) {
+ unsigned from;
+
+ if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
+ from = pos & ~HPAGE_PMD_MASK;
+ zero_huge_user_segment(page, 0, from);
+ zero_huge_user_segment(page,
+ from + len, HPAGE_PMD_SIZE);
+ } else if (len != PAGE_CACHE_SIZE) {
+ from = pos & ~PAGE_CACHE_MASK;
+ zero_user_segments(page, 0, from,
+ from + len, PAGE_CACHE_SIZE);
+ }
}
return 0;
}
@@ -443,9 +466,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,

/* zero the stale part of the page if we did a short copy */
if (copied < len) {
- unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
- zero_user(page, from + copied, len - copied);
+ unsigned from;
+ if (PageTransHuge(page)) {
+ from = pos & ~HPAGE_PMD_MASK;
+ zero_huge_user(page, from + copied, len - copied);
+ } else {
+ from = pos & ~PAGE_CACHE_MASK;
+ zero_user(page, from + copied, len - copied);
+ }
}

if (!PageUptodate(page))
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 8feeecc..462fcca 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -579,4 +579,12 @@ static inline int add_to_page_cache(struct page *page,
return error;
}

+static inline void clear_pagecache_page(struct page *page)
+{
+ if (PageTransHuge(page))
+ zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+ else
+ clear_highpage(page);
+}
+
#endif /* _LINUX_PAGEMAP_H */
--
1.7.10.4

2013-05-12 01:21:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 24/39] thp, mm: truncate support for transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

If we starting position of truncation is in tail page we have to spilit
the huge page page first.

We also have to split if end is within the huge page. Otherwise we can
truncate whole huge page at once.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/truncate.c | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..0152feb 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index > end)
break;

+ /* split page if we start from tail page */
+ if (PageTransTail(page))
+ split_huge_page(compound_trans_head(page));
+ if (PageTransHuge(page)) {
+ /* split if end is within huge page */
+ if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+ split_huge_page(page);
+ else
+ /* skip tail pages */
+ i += HPAGE_CACHE_NR - 1;
+ }
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
@@ -280,6 +291,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index > end)
break;

+ if (PageTransHuge(page))
+ split_huge_page(page);
lock_page(page);
WARN_ON(page->index != index);
wait_on_page_writeback(page);
--
1.7.10.4

2013-05-12 01:21:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 26/39] ramfs: enable transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

For now we allocate only non-movable huge page. ramfs pages cannot be
moved yet.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/ramfs/inode.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index c24f1e1..54d69c7 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
inode_init_owner(inode, dir, mode);
inode->i_mapping->a_ops = &ramfs_aops;
inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
- mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+ /*
+ * TODO: make ramfs pages movable
+ */
+ mapping_set_gfp_mask(inode->i_mapping,
+ GFP_TRANSHUGE & ~__GFP_MOVABLE);
mapping_set_unevictable(inode->i_mapping);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
--
1.7.10.4

2013-05-12 01:21:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()

From: "Kirill A. Shutemov" <[email protected]>

It's confusing that mk_huge_pmd() has sematics different from mk_pte()
or mk_pmd().

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
prototype to match mk_pte().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a88f9b2..575f29b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -709,11 +709,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
return pmd;
}

-static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
{
pmd_t entry;
- entry = mk_pmd(page, vma->vm_page_prot);
- entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ entry = mk_pmd(page, prot);
entry = pmd_mkhuge(entry);
return entry;
}
@@ -746,7 +745,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
pte_free(mm, pgtable);
} else {
pmd_t entry;
- entry = mk_huge_pmd(page, vma);
+ entry = mk_huge_pmd(page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
pgtable_trans_huge_deposit(mm, pgtable);
@@ -1229,7 +1229,8 @@ alloc:
goto out_mn;
} else {
pmd_t entry;
- entry = mk_huge_pmd(new_page, vma);
+ entry = mk_huge_pmd(new_page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
pmdp_clear_flush(vma, haddr, pmd);
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
@@ -2410,7 +2411,8 @@ static void collapse_huge_page(struct mm_struct *mm,
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);

- _pmd = mk_huge_pmd(new_page, vma);
+ _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+ _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);

/*
* spin_lock() below is not the equivalent of smp_wmb(), so
--
1.7.10.4

2013-05-12 01:21:44

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines

From: "Kirill A. Shutemov" <[email protected]>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing store.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 19 ++++++++++++++++++-
1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index e086ef0..ebd361a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1177,6 +1177,17 @@ find_page:
if (unlikely(page == NULL))
goto no_cached_page;
}
+ if (PageTransCompound(page)) {
+ struct page *head = compound_trans_head(page);
+ /*
+ * We don't yet support huge pages in page cache
+ * for filesystems with backing device, so pages
+ * should always be up-to-date.
+ */
+ BUG_ON(ra->ra_pages);
+ BUG_ON(!PageUptodate(head));
+ goto page_ok;
+ }
if (PageReadahead(page)) {
page_cache_async_readahead(mapping,
ra, filp, page,
@@ -2413,8 +2424,13 @@ again:
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);

+ if (PageTransHuge(page))
+ offset = pos & ~HPAGE_PMD_MASK;
+
pagefault_disable();
- copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+ copied = iov_iter_copy_from_user_atomic(
+ page + (offset >> PAGE_CACHE_SHIFT),
+ i, offset & ~PAGE_CACHE_MASK, bytes);
pagefault_enable();
flush_dcache_page(page);

@@ -2437,6 +2453,7 @@ again:
* because not all segments in the iov can be copied at
* once without a pagefault.
*/
+ offset = pos & ~PAGE_CACHE_MASK;
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
iov_iter_single_seg_count(i));
goto again;
--
1.7.10.4

2013-05-12 01:25:47

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages

From: "Kirill A. Shutemov" <[email protected]>

VM_WRITE|VM_SHARED has already almost covered by do_wp_page_shared().
We only need to hadle locking differentely and setup pmd instead of pte.

do_huge_pmd_wp_page() itself needs only few minor changes:

- now we may need to allocate anon_vma on WP. Having huge page to COW
doesn't mean we have anon_vma, since the huge page can be file-backed.
- we need to adjust mm counters on COW file pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 4 +++
mm/huge_memory.c | 17 +++++++++++--
mm/memory.c | 70 +++++++++++++++++++++++++++++++++++++++++-----------
3 files changed, 74 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 563c8b7..7f3bc24 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1001,6 +1001,10 @@ extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, unsigned int flags);
extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags);
+extern int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd, struct page *page,
+ pte_t *page_table, spinlock_t *ptl,
+ pte_t orig_pte, pmd_t orig_pmd);
#else
static inline int handle_mm_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 893cc69..d7c9df5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1110,7 +1110,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */

- VM_BUG_ON(!vma->anon_vma);
haddr = address & HPAGE_PMD_MASK;
if (is_huge_zero_pmd(orig_pmd))
goto alloc;
@@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,

page = pmd_page(orig_pmd);
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
- if (page_mapcount(page) == 1) {
+ if (PageAnon(page) && page_mapcount(page) == 1) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1129,9 +1128,18 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
ret |= VM_FAULT_WRITE;
goto out_unlock;
}
+
+ if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)) {
+ pte_t __unused;
+ return do_wp_page_shared(mm, vma, address, pmd, page,
+ NULL, NULL, __unused, orig_pmd);
+ }
get_page(page);
spin_unlock(&mm->page_table_lock);
alloc:
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -1195,6 +1203,11 @@ alloc:
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
put_huge_zero_page();
} else {
+ if (!PageAnon(page)) {
+ /* File page COWed with anon page */
+ add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ }
VM_BUG_ON(!PageHead(page));
page_remove_rmap(page);
put_page(page);
diff --git a/mm/memory.c b/mm/memory.c
index 4685dd1..ebff552 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2640,16 +2640,33 @@ static void mkwrite_pte(struct vm_area_struct *vma, unsigned long address,
update_mmu_cache(vma, address, page_table);
}

+static void mkwrite_pmd(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, pmd_t orig_pmd)
+{
+ pmd_t entry;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+
+ flush_cache_page(vma, address, pmd_pfn(orig_pmd));
+ entry = pmd_mkyoung(orig_pmd);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
+ update_mmu_cache_pmd(vma, address, pmd);
+}
+
/*
* Only catch write-faults on shared writable pages, read-only shared pages can
* get COWed by get_user_pages(.write=1, .force=1).
*/
-static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- spinlock_t *ptl, pte_t orig_pte, struct page *page)
+int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd, struct page *page,
+ pte_t *page_table, spinlock_t *ptl,
+ pte_t orig_pte, pmd_t orig_pmd)
{
struct vm_fault vmf;
bool page_mkwrite = false;
+ /* no page_table means caller asks for THP */
+ bool thp = (page_table == NULL) &&
+ IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
int tmp, ret = 0;

if (vma->vm_ops && vma->vm_ops->page_mkwrite)
@@ -2660,6 +2677,9 @@ static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
vmf.page = page;

+ if (thp)
+ vmf.flags |= FAULT_FLAG_TRANSHUGE;
+
/*
* Notify the address space that the page is about to
* become writable so that it can prohibit this or wait
@@ -2669,7 +2689,10 @@ static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
* sleep if it needs to.
*/
page_cache_get(page);
- pte_unmap_unlock(page_table, ptl);
+ if (thp)
+ spin_unlock(&mm->page_table_lock);
+ else
+ pte_unmap_unlock(page_table, ptl);

tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
@@ -2693,19 +2716,34 @@ static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
* they did, we just return, as we can count on the
* MMU to tell us if they didn't also make it writable.
*/
- page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_same(*page_table, orig_pte)) {
- unlock_page(page);
- pte_unmap_unlock(page_table, ptl);
- page_cache_release(page);
- return ret;
+ if (thp) {
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+ unlock_page(page);
+ spin_unlock(&mm->page_table_lock);
+ page_cache_release(page);
+ return ret;
+ }
+ } else {
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_same(*page_table, orig_pte)) {
+ unlock_page(page);
+ pte_unmap_unlock(page_table, ptl);
+ page_cache_release(page);
+ return ret;
+ }
}

page_mkwrite = true;
mkwrite_done:
get_page(page);
- mkwrite_pte(vma, address, page_table, orig_pte);
- pte_unmap_unlock(page_table, ptl);
+ if (thp) {
+ mkwrite_pmd(vma, address, pmd, orig_pmd);
+ spin_unlock(&mm->page_table_lock);
+ } else {
+ mkwrite_pte(vma, address, page_table, orig_pte);
+ pte_unmap_unlock(page_table, ptl);
+ }
dirty_page(vma, page, page_mkwrite);
return ret | VM_FAULT_WRITE;
}
@@ -2787,9 +2825,11 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
- (VM_WRITE|VM_SHARED)))
- return do_wp_page_shared(mm, vma, address, page_table, pmd, ptl,
- orig_pte, old_page);
+ (VM_WRITE|VM_SHARED))) {
+ pmd_t __unused;
+ return do_wp_page_shared(mm, vma, address, pmd, old_page,
+ page_table, ptl, orig_pte, __unused);
+ }

/*
* Ok, we need to copy. Oh, well..
--
1.7.10.4

2013-05-12 01:26:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 38/39] thp: vma_adjust_trans_huge(): adjust file-backed VMA too

From: "Kirill A. Shutemov" <[email protected]>

Since we're going to have huge pages in page cache, we need to call
adjust file-backed VMA, which potentially can contain huge pages.

For now we call it for all VMAs.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 11 +----------
mm/huge_memory.c | 2 +-
2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b20334a..f4d6626 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -139,7 +139,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
#endif
extern int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice);
-extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
long adjust_next);
@@ -155,15 +155,6 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
else
return 0;
}
-static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
- unsigned long start,
- unsigned long end,
- long adjust_next)
-{
- if (!vma->anon_vma || vma->vm_ops)
- return;
- __vma_adjust_trans_huge(vma, start, end, adjust_next);
-}
static inline int hpage_nr_pages(struct page *page)
{
if (unlikely(PageTransHuge(page)))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d7c9df5..9c3815b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2783,7 +2783,7 @@ static void split_huge_page_address(struct mm_struct *mm,
split_huge_page_pmd_mm(mm, address, pmd);
}

-void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+void vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
long adjust_next)
--
1.7.10.4

2013-05-12 01:26:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 35/39] mm: decomposite do_wp_page() and get rid of some 'goto' logic

From: "Kirill A. Shutemov" <[email protected]>

Let's extract some 'reuse' path to separate function and use it instead
of ugly goto.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memory.c | 110 ++++++++++++++++++++++++++++++++---------------------------
1 file changed, 59 insertions(+), 51 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 8997cd8..eb99ab1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2594,6 +2594,52 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
copy_user_highpage(dst, src, va, vma);
}

+static void dirty_page(struct vm_area_struct *vma, struct page *page,
+ bool page_mkwrite)
+{
+ /*
+ * Yes, Virginia, this is actually required to prevent a race
+ * with clear_page_dirty_for_io() from clearing the page dirty
+ * bit after it clear all dirty ptes, but before a racing
+ * do_wp_page installs a dirty pte.
+ *
+ * __do_fault is protected similarly.
+ */
+ if (!page_mkwrite) {
+ wait_on_page_locked(page);
+ set_page_dirty_balance(page, page_mkwrite);
+ /* file_update_time outside page_lock */
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
+ }
+ put_page(page);
+ if (page_mkwrite) {
+ struct address_space *mapping = page->mapping;
+
+ set_page_dirty(page);
+ unlock_page(page);
+ page_cache_release(page);
+ if (mapping) {
+ /*
+ * Some device drivers do not set page.mapping
+ * but still dirty their pages
+ */
+ balance_dirty_pages_ratelimited(mapping);
+ }
+ }
+}
+
+static void mkwrite_pte(struct vm_area_struct *vma, unsigned long address,
+ pte_t *page_table, pte_t orig_pte)
+{
+ pte_t entry;
+ flush_cache_page(vma, address, pte_pfn(orig_pte));
+ entry = pte_mkyoung(orig_pte);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (ptep_set_access_flags(vma, address, page_table, entry, 1))
+ update_mmu_cache(vma, address, page_table);
+}
+
/*
* This routine handles present pages, when users try to write
* to a shared page. It is done by copying the page to a new address
@@ -2618,10 +2664,8 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
__releases(ptl)
{
struct page *old_page, *new_page = NULL;
- pte_t entry;
int ret = 0;
int page_mkwrite = 0;
- struct page *dirty_page = NULL;
unsigned long mmun_start = 0; /* For mmu_notifiers */
unsigned long mmun_end = 0; /* For mmu_notifiers */

@@ -2635,8 +2679,11 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
* accounting on raw pfn maps.
*/
if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
- (VM_WRITE|VM_SHARED))
- goto reuse;
+ (VM_WRITE|VM_SHARED)) {
+ mkwrite_pte(vma, address, page_table, orig_pte);
+ pte_unmap_unlock(page_table, ptl);
+ return VM_FAULT_WRITE;
+ }
goto gotten;
}

@@ -2665,7 +2712,9 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
page_move_anon_rmap(old_page, vma, address);
unlock_page(old_page);
- goto reuse;
+ mkwrite_pte(vma, address, page_table, orig_pte);
+ pte_unmap_unlock(page_table, ptl);
+ return VM_FAULT_WRITE;
}
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
@@ -2727,53 +2776,11 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,

page_mkwrite = 1;
}
- dirty_page = old_page;
- get_page(dirty_page);
-
-reuse:
- flush_cache_page(vma, address, pte_pfn(orig_pte));
- entry = pte_mkyoung(orig_pte);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (ptep_set_access_flags(vma, address, page_table, entry,1))
- update_mmu_cache(vma, address, page_table);
+ get_page(old_page);
+ mkwrite_pte(vma, address, page_table, orig_pte);
pte_unmap_unlock(page_table, ptl);
- ret |= VM_FAULT_WRITE;
-
- if (!dirty_page)
- return ret;
-
- /*
- * Yes, Virginia, this is actually required to prevent a race
- * with clear_page_dirty_for_io() from clearing the page dirty
- * bit after it clear all dirty ptes, but before a racing
- * do_wp_page installs a dirty pte.
- *
- * __do_fault is protected similarly.
- */
- if (!page_mkwrite) {
- wait_on_page_locked(dirty_page);
- set_page_dirty_balance(dirty_page, page_mkwrite);
- /* file_update_time outside page_lock */
- if (vma->vm_file)
- file_update_time(vma->vm_file);
- }
- put_page(dirty_page);
- if (page_mkwrite) {
- struct address_space *mapping = dirty_page->mapping;
-
- set_page_dirty(dirty_page);
- unlock_page(dirty_page);
- page_cache_release(dirty_page);
- if (mapping) {
- /*
- * Some device drivers do not set page.mapping
- * but still dirty their pages
- */
- balance_dirty_pages_ratelimited(mapping);
- }
- }
-
- return ret;
+ dirty_page(vma, old_page, page_mkwrite);
+ return ret | VM_FAULT_WRITE;
}

/*
@@ -2810,6 +2817,7 @@ gotten:
*/
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte))) {
+ pte_t entry;
if (old_page) {
if (!PageAnon(old_page)) {
dec_mm_counter_fast(mm, MM_FILEPAGES);
--
1.7.10.4

2013-05-12 01:26:44

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 31/39] thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()

From: "Kirill A. Shutemov" <[email protected]>

do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
to handle fallback path.

Let's consolidate code back by introducing VM_FAULT_FALLBACK return
code.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 3 ---
include/linux/mm.h | 3 ++-
mm/huge_memory.c | 31 +++++--------------------------
mm/memory.c | 9 ++++++---
4 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9e6425f..d688271 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -101,9 +101,6 @@ extern int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end);
-extern int handle_pte_fault(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *pte, pmd_t *pmd, unsigned int flags);
extern int split_huge_page_to_list(struct page *page, struct list_head *list);
static inline int split_huge_page(struct page *page)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5e156fb..280b414 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -881,11 +881,12 @@ static inline int page_mapped(struct page *page)
#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
#define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */
+#define VM_FAULT_FALLBACK 0x0800 /* huge page fault failed, fall back to small */

#define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */

#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON | \
- VM_FAULT_HWPOISON_LARGE)
+ VM_FAULT_FALLBACK | VM_FAULT_HWPOISON_LARGE)

/* Encode hstate index for a hwpoisoned large page */
#define VM_FAULT_SET_HINDEX(x) ((x) << 12)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ab07f5d..facfdac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -802,10 +802,9 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page;
unsigned long haddr = address & HPAGE_PMD_MASK;
- pte_t *pte;

if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
- goto out;
+ return VM_FAULT_FALLBACK;
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma)))
@@ -822,7 +821,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!zero_page)) {
pte_free(mm, pgtable);
count_vm_event(THP_FAULT_FALLBACK);
- goto out;
+ return VM_FAULT_FALLBACK;
}
spin_lock(&mm->page_table_lock);
set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
@@ -838,40 +837,20 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
vma, haddr, numa_node_id(), 0);
if (unlikely(!page)) {
count_vm_event(THP_FAULT_FALLBACK);
- goto out;
+ return VM_FAULT_FALLBACK;
}
count_vm_event(THP_FAULT_ALLOC);
if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
put_page(page);
- goto out;
+ return VM_FAULT_FALLBACK;
}
if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd, page))) {
mem_cgroup_uncharge_page(page);
put_page(page);
- goto out;
+ return VM_FAULT_FALLBACK;
}

return 0;
-out:
- /*
- * Use __pte_alloc instead of pte_alloc_map, because we can't
- * run pte_offset_map on the pmd, if an huge pmd could
- * materialize from under us from a different thread.
- */
- if (unlikely(pmd_none(*pmd)) &&
- unlikely(__pte_alloc(mm, vma, pmd, address)))
- return VM_FAULT_OOM;
- /* if an huge pmd materialized from under us just retry later */
- if (unlikely(pmd_trans_huge(*pmd)))
- return 0;
- /*
- * A regular pmd is established and it can't morph into a huge pmd
- * from under us anymore at this point because we hold the mmap_sem
- * read mode and khugepaged takes it in write mode. So now it's
- * safe to run pte_offset_map().
- */
- pte = pte_offset_map(pmd, address);
- return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index c845cf2..4008d93 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3701,7 +3701,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* but allow concurrent faults), and pte mapped but not yet locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
*/
-int handle_pte_fault(struct mm_struct *mm,
+static int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
pte_t *pte, pmd_t *pmd, unsigned int flags)
{
@@ -3788,9 +3788,12 @@ retry:
if (!pmd)
return VM_FAULT_OOM;
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
+ int ret = 0;
if (!vma->vm_ops)
- return do_huge_pmd_anonymous_page(mm, vma, address,
- pmd, flags);
+ ret = do_huge_pmd_anonymous_page(mm, vma, address,
+ pmd, flags);
+ if ((ret & VM_FAULT_FALLBACK) == 0)
+ return ret;
} else {
pmd_t orig_pmd = *pmd;
int ret;
--
1.7.10.4

2013-05-12 01:26:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 36/39] mm: do_wp_page(): extract VM_WRITE|VM_SHARED case to separate function

From: "Kirill A. Shutemov" <[email protected]>

The code will be shared with transhuge pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memory.c | 142 ++++++++++++++++++++++++++++++-----------------------------
1 file changed, 73 insertions(+), 69 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index eb99ab1..4685dd1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2641,6 +2641,76 @@ static void mkwrite_pte(struct vm_area_struct *vma, unsigned long address,
}

/*
+ * Only catch write-faults on shared writable pages, read-only shared pages can
+ * get COWed by get_user_pages(.write=1, .force=1).
+ */
+static int do_wp_page_shared(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ spinlock_t *ptl, pte_t orig_pte, struct page *page)
+{
+ struct vm_fault vmf;
+ bool page_mkwrite = false;
+ int tmp, ret = 0;
+
+ if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+ goto mkwrite_done;
+
+ vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+ vmf.pgoff = page->index;
+ vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+ vmf.page = page;
+
+ /*
+ * Notify the address space that the page is about to
+ * become writable so that it can prohibit this or wait
+ * for the page to get into an appropriate state.
+ *
+ * We do this without the lock held, so that it can
+ * sleep if it needs to.
+ */
+ page_cache_get(page);
+ pte_unmap_unlock(page_table, ptl);
+
+ tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+ if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+ ret = tmp;
+ page_cache_release(page);
+ return ret;
+ }
+ if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+ lock_page(page);
+ if (!page->mapping) {
+ unlock_page(page);
+ page_cache_release(page);
+ return ret;
+ }
+ } else
+ VM_BUG_ON(!PageLocked(page));
+
+ /*
+ * Since we dropped the lock we need to revalidate
+ * the PTE as someone else may have changed it. If
+ * they did, we just return, as we can count on the
+ * MMU to tell us if they didn't also make it writable.
+ */
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_same(*page_table, orig_pte)) {
+ unlock_page(page);
+ pte_unmap_unlock(page_table, ptl);
+ page_cache_release(page);
+ return ret;
+ }
+
+ page_mkwrite = true;
+mkwrite_done:
+ get_page(page);
+ mkwrite_pte(vma, address, page_table, orig_pte);
+ pte_unmap_unlock(page_table, ptl);
+ dirty_page(vma, page, page_mkwrite);
+ return ret | VM_FAULT_WRITE;
+}
+
+/*
* This routine handles present pages, when users try to write
* to a shared page. It is done by copying the page to a new address
* and decrementing the shared-page counter for the old page.
@@ -2665,7 +2735,6 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *old_page, *new_page = NULL;
int ret = 0;
- int page_mkwrite = 0;
unsigned long mmun_start = 0; /* For mmu_notifiers */
unsigned long mmun_end = 0; /* For mmu_notifiers */

@@ -2718,70 +2787,9 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
- (VM_WRITE|VM_SHARED))) {
- /*
- * Only catch write-faults on shared writable pages,
- * read-only shared pages can get COWed by
- * get_user_pages(.write=1, .force=1).
- */
- if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
- struct vm_fault vmf;
- int tmp;
-
- vmf.virtual_address = (void __user *)(address &
- PAGE_MASK);
- vmf.pgoff = old_page->index;
- vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
- vmf.page = old_page;
-
- /*
- * Notify the address space that the page is about to
- * become writable so that it can prohibit this or wait
- * for the page to get into an appropriate state.
- *
- * We do this without the lock held, so that it can
- * sleep if it needs to.
- */
- page_cache_get(old_page);
- pte_unmap_unlock(page_table, ptl);
-
- tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
- if (unlikely(tmp &
- (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
- ret = tmp;
- goto unwritable_page;
- }
- if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
- lock_page(old_page);
- if (!old_page->mapping) {
- ret = 0; /* retry the fault */
- unlock_page(old_page);
- goto unwritable_page;
- }
- } else
- VM_BUG_ON(!PageLocked(old_page));
-
- /*
- * Since we dropped the lock we need to revalidate
- * the PTE as someone else may have changed it. If
- * they did, we just return, as we can count on the
- * MMU to tell us if they didn't also make it writable.
- */
- page_table = pte_offset_map_lock(mm, pmd, address,
- &ptl);
- if (!pte_same(*page_table, orig_pte)) {
- unlock_page(old_page);
- goto unlock;
- }
-
- page_mkwrite = 1;
- }
- get_page(old_page);
- mkwrite_pte(vma, address, page_table, orig_pte);
- pte_unmap_unlock(page_table, ptl);
- dirty_page(vma, old_page, page_mkwrite);
- return ret | VM_FAULT_WRITE;
- }
+ (VM_WRITE|VM_SHARED)))
+ return do_wp_page_shared(mm, vma, address, page_table, pmd, ptl,
+ orig_pte, old_page);

/*
* Ok, we need to copy. Oh, well..
@@ -2900,10 +2908,6 @@ oom:
if (old_page)
page_cache_release(old_page);
return VM_FAULT_OOM;
-
-unwritable_page:
- page_cache_release(old_page);
- return ret;
}

static void unmap_mapping_range_vma(struct vm_area_struct *vma,
--
1.7.10.4

2013-05-12 01:21:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 02/39] block: implement add_bdi_stat()

From: "Kirill A. Shutemov" <[email protected]>

We're going to add/remove a number of page cache entries at once. This
patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
amount. It's required for batched page cache manipulations.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/backing-dev.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3504599..b05d961 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
__add_bdi_stat(bdi, item, -1);
}

+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+ enum bdi_stat_item item, s64 amount)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __add_bdi_stat(bdi, item, amount);
+ local_irq_restore(flags);
+}
+
static inline void dec_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
{
--
1.7.10.4

2013-05-12 01:27:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 17/39] thp, mm: handle tail pages in page_cache_get_speculative()

From: "Kirill A. Shutemov" <[email protected]>

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/pagemap.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 28597ec..2e86251 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -161,6 +161,9 @@ static inline int page_cache_get_speculative(struct page *page)
{
VM_BUG_ON(in_interrupt());

+ if (unlikely(PageTail(page)))
+ return __get_page_tail(page);
+
#ifdef CONFIG_TINY_RCU
# ifdef CONFIG_PREEMPT_COUNT
VM_BUG_ON(!in_atomic());
@@ -187,7 +190,6 @@ static inline int page_cache_get_speculative(struct page *page)
return 0;
}
#endif
- VM_BUG_ON(PageTail(page));

return 1;
}
--
1.7.10.4

2013-05-12 01:27:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP

From: "Kirill A. Shutemov" <[email protected]>

replace_page_cache_page() is only used by FUSE. It's unlikely that we
will support THP in FUSE page cache any soon.

Let's pospone implemetation of THP handling in replace_page_cache_page()
until any will use it.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 657ce82..3a03426 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -428,6 +428,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
{
int error;

+ VM_BUG_ON(PageTransHuge(old));
+ VM_BUG_ON(PageTransHuge(new));
VM_BUG_ON(!PageLocked(old));
VM_BUG_ON(!PageLocked(new));
VM_BUG_ON(new->mapping);
--
1.7.10.4

2013-05-12 01:27:53

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 07/39] thp, mm: basic defines for transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 528454c..6b4c9b2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
#define HPAGE_PMD_MASK HPAGE_MASK
#define HPAGE_PMD_SIZE HPAGE_SIZE

+#define HPAGE_CACHE_ORDER (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
extern bool is_vma_temporary_stack(struct vm_area_struct *vma);

#define transparent_hugepage_enabled(__vma) \
@@ -185,6 +189,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })

+#define HPAGE_CACHE_ORDER ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
#define hpage_nr_pages(x) 1

#define transparent_hugepage_enabled(__vma) 0
--
1.7.10.4

2013-05-12 01:28:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 18/39] thp, mm: add event counters for huge page alloc on write to a file

From: "Kirill A. Shutemov" <[email protected]>

Existing stats specify source of thp page: fault or collapse. We're
going allocate a new huge page with write(2). It's nither fault nor
collapse.

Let's introduce new events for that.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/vm_event_item.h | 2 ++
mm/vmstat.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d4b7a18..584c71c 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
+ THP_WRITE_ALLOC,
+ THP_WRITE_ALLOC_FAILED,
THP_SPLIT,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7945285..df8dcda 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -821,6 +821,8 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
+ "thp_write_alloc",
+ "thp_write_alloc_failed",
"thp_split",
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
--
1.7.10.4

2013-05-12 01:28:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 16/39] thp, mm: locking tail page is a bug

From: "Kirill A. Shutemov" <[email protected]>

Locking head page means locking entire compound page.
If we try to lock tail page, something went wrong.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 3a03426..9ea46a4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -681,6 +681,7 @@ void __lock_page(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

+ VM_BUG_ON(PageTail(page));
__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
TASK_UNINTERRUPTIBLE);
}
@@ -690,6 +691,7 @@ int __lock_page_killable(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

+ VM_BUG_ON(PageTail(page));
return __wait_on_bit_lock(page_waitqueue(page), &wait,
sleep_on_page_killable, TASK_KILLABLE);
}
--
1.7.10.4

2013-05-12 01:28:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too

From: "Kirill A. Shutemov" <[email protected]>

Since we're going to have huge pages backed by files,
wait_split_huge_page() has to serialize not only over anon_vma_lock,
but over i_mmap_mutex too.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 15 ++++++++++++---
mm/huge_memory.c | 4 ++--
mm/memory.c | 4 ++--
3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 74494a2..9e6425f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -118,11 +118,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
__split_huge_page_pmd(__vma, __address, \
____pmd); \
} while (0)
-#define wait_split_huge_page(__anon_vma, __pmd) \
+#define wait_split_huge_page(__vma, __pmd) \
do { \
pmd_t *____pmd = (__pmd); \
- anon_vma_lock_write(__anon_vma); \
- anon_vma_unlock_write(__anon_vma); \
+ struct address_space *__mapping = \
+ vma->vm_file->f_mapping; \
+ struct anon_vma *__anon_vma = (__vma)->anon_vma; \
+ if (__mapping) \
+ mutex_lock(&__mapping->i_mmap_mutex); \
+ if (__anon_vma) { \
+ anon_vma_lock_write(__anon_vma); \
+ anon_vma_unlock_write(__anon_vma); \
+ } \
+ if (__mapping) \
+ mutex_unlock(&__mapping->i_mmap_mutex); \
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73974e8..7ad458d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -924,7 +924,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
spin_unlock(&dst_mm->page_table_lock);
pte_free(dst_mm, pgtable);

- wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+ wait_split_huge_page(vma, src_pmd); /* src_vma */
goto out;
}
src_page = pmd_page(pmd);
@@ -1497,7 +1497,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(&vma->vm_mm->page_table_lock);
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
return -1;
} else {
/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index f02a8be..c845cf2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -620,7 +620,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
if (new)
pte_free(mm, new);
if (wait_split_huge_page)
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
return 0;
}

@@ -1530,7 +1530,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(&mm->page_table_lock);
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
} else {
page = follow_trans_huge_pmd(vma, address,
pmd, flags);
--
1.7.10.4

2013-05-12 01:28:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 25/39] thp, mm: split huge page on mmap file page

From: "Kirill A. Shutemov" <[email protected]>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on fault attempt.

Later in the patchset we'll implement mmap() properly and this code path
be used for fallback cases.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index ebd361a..9877347 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1700,6 +1700,8 @@ retry_find:
goto no_cached_page;
}

+ if (PageTransCompound(page))
+ split_huge_page(compound_trans_head(page));
if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
page_cache_release(page);
return ret | VM_FAULT_RETRY;
--
1.7.10.4

2013-05-12 01:28:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 32/39] mm: cleanup __do_fault() implementation

From: "Kirill A. Shutemov" <[email protected]>

Let's cleanup __do_fault() to prepare it for transparent huge pages
support injection.

Cleanups:
- int -> bool where appropriate;
- unindent some code by reverting 'if' condition;
- extract !pte_same() path to get it clear;
- separate pte update from mm stats update;
- some comments reformated;

Functionality is not changed.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memory.c | 157 +++++++++++++++++++++++++++++------------------------------
1 file changed, 76 insertions(+), 81 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4008d93..97b22c7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3301,21 +3301,18 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
{
pte_t *page_table;
spinlock_t *ptl;
- struct page *page;
- struct page *cow_page;
+ struct page *page, *cow_page, *dirty_page = NULL;
pte_t entry;
- int anon = 0;
- struct page *dirty_page = NULL;
+ bool anon = false, page_mkwrite = false;
+ bool write = flags & FAULT_FLAG_WRITE;
struct vm_fault vmf;
int ret;
- int page_mkwrite = 0;

/*
* If we do COW later, allocate page befor taking lock_page()
* on the file cache page. This will reduce lock holding time.
*/
- if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
-
+ if (write && !(vma->vm_flags & VM_SHARED)) {
if (unlikely(anon_vma_prepare(vma)))
return VM_FAULT_OOM;

@@ -3336,8 +3333,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
vmf.page = NULL;

ret = vma->vm_ops->fault(vma, &vmf);
- if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
- VM_FAULT_RETRY)))
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;

if (unlikely(PageHWPoison(vmf.page))) {
@@ -3356,98 +3352,89 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
else
VM_BUG_ON(!PageLocked(vmf.page));

+ page = vmf.page;
+ if (!write)
+ goto update_pgtable;
+
/*
* Should we do an early C-O-W break?
*/
- page = vmf.page;
- if (flags & FAULT_FLAG_WRITE) {
- if (!(vma->vm_flags & VM_SHARED)) {
- page = cow_page;
- anon = 1;
- copy_user_highpage(page, vmf.page, address, vma);
- __SetPageUptodate(page);
- } else {
- /*
- * If the page will be shareable, see if the backing
- * address space wants to know that the page is about
- * to become writable
- */
- if (vma->vm_ops->page_mkwrite) {
- int tmp;
-
+ if (!(vma->vm_flags & VM_SHARED)) {
+ page = cow_page;
+ anon = true;
+ copy_user_highpage(page, vmf.page, address, vma);
+ __SetPageUptodate(page);
+ } else if (vma->vm_ops->page_mkwrite) {
+ /*
+ * If the page will be shareable, see if the backing address
+ * space wants to know that the page is about to become writable
+ */
+ int tmp;
+
+ unlock_page(page);
+ vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+ tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+ if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+ ret = tmp;
+ goto unwritable_page;
+ }
+ if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+ lock_page(page);
+ if (!page->mapping) {
+ ret = 0; /* retry the fault */
unlock_page(page);
- vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
- tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
- if (unlikely(tmp &
- (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
- ret = tmp;
- goto unwritable_page;
- }
- if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
- lock_page(page);
- if (!page->mapping) {
- ret = 0; /* retry the fault */
- unlock_page(page);
- goto unwritable_page;
- }
- } else
- VM_BUG_ON(!PageLocked(page));
- page_mkwrite = 1;
+ goto unwritable_page;
}
- }
-
+ } else
+ VM_BUG_ON(!PageLocked(page));
+ page_mkwrite = true;
}

+update_pgtable:
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+ /* Only go through if we didn't race with anybody else... */
+ if (unlikely(!pte_same(*page_table, orig_pte))) {
+ pte_unmap_unlock(page_table, ptl);
+ goto race_out;
+ }
+
+ flush_icache_page(vma, page);
+ if (anon) {
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
+ page_add_new_anon_rmap(page, vma, address);
+ } else {
+ inc_mm_counter_fast(mm, MM_FILEPAGES);
+ page_add_file_rmap(page);
+ if (write) {
+ dirty_page = page;
+ get_page(dirty_page);
+ }
+ }

/*
- * This silly early PAGE_DIRTY setting removes a race
- * due to the bad i386 page protection. But it's valid
- * for other architectures too.
+ * This silly early PAGE_DIRTY setting removes a race due to the bad
+ * i386 page protection. But it's valid for other architectures too.
*
- * Note that if FAULT_FLAG_WRITE is set, we either now have
- * an exclusive copy of the page, or this is a shared mapping,
- * so we can make it writable and dirty to avoid having to
- * handle that later.
+ * Note that if FAULT_FLAG_WRITE is set, we either now have an
+ * exclusive copy of the page, or this is a shared mapping, so we can
+ * make it writable and dirty to avoid having to handle that later.
*/
- /* Only go through if we didn't race with anybody else... */
- if (likely(pte_same(*page_table, orig_pte))) {
- flush_icache_page(vma, page);
- entry = mk_pte(page, vma->vm_page_prot);
- if (flags & FAULT_FLAG_WRITE)
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (anon) {
- inc_mm_counter_fast(mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, address);
- } else {
- inc_mm_counter_fast(mm, MM_FILEPAGES);
- page_add_file_rmap(page);
- if (flags & FAULT_FLAG_WRITE) {
- dirty_page = page;
- get_page(dirty_page);
- }
- }
- set_pte_at(mm, address, page_table, entry);
+ entry = mk_pte(page, vma->vm_page_prot);
+ if (write)
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ set_pte_at(mm, address, page_table, entry);

- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, address, page_table);
- } else {
- if (cow_page)
- mem_cgroup_uncharge_page(cow_page);
- if (anon)
- page_cache_release(page);
- else
- anon = 1; /* no anon but release faulted_page */
- }
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache(vma, address, page_table);

pte_unmap_unlock(page_table, ptl);

if (dirty_page) {
struct address_space *mapping = page->mapping;
- int dirtied = 0;
+ bool dirtied = false;

if (set_page_dirty(dirty_page))
- dirtied = 1;
+ dirtied = true;
unlock_page(dirty_page);
put_page(dirty_page);
if ((dirtied || page_mkwrite) && mapping) {
@@ -3479,6 +3466,14 @@ uncharge_out:
page_cache_release(cow_page);
}
return ret;
+race_out:
+ if (cow_page)
+ mem_cgroup_uncharge_page(cow_page);
+ if (anon)
+ page_cache_release(page);
+ unlock_page(vmf.page);
+ page_cache_release(vmf.page);
+ return ret;
}

static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
--
1.7.10.4

2013-05-12 01:28:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 05/39] memcg, thp: charge huge cache pages

From: "Kirill A. Shutemov" <[email protected]>

mem_cgroup_cache_charge() has check for PageCompound(). The check
prevents charging huge cache pages.

I don't see a reason why the check is present. Looks like it's just
legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

Let's just drop it.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe4f123..a7de6a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4080,8 +4080,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,

if (mem_cgroup_disabled())
return 0;
- if (PageCompound(page))
- return 0;

if (!PageSwapCache(page))
ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
--
1.7.10.4

2013-05-12 01:21:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate

From: "Kirill A. Shutemov" <[email protected]>

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/pagemap.h | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..28597ec 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
(__force unsigned long)mask;
}

+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
+ gfp_t gfp_mask = mapping_gfp_mask(m);
+ /* __GFP_COMP is key part of GFP_TRANSHUGE */
+ return !!(gfp_mask & __GFP_COMP) &&
+ transparent_hugepage_pagecache();
+ }
+
+ return false;
+}
+
/*
* The page cache can done in larger chunks than
* one page, because it allows for more efficient
--
1.7.10.4

2013-05-12 01:29:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

From: "Kirill A. Shutemov" <[email protected]>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 71 ++++++++++++++++++++++++++++++++++++++--------------------
1 file changed, 47 insertions(+), 24 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 61158ac..b0c7c8c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -460,39 +460,62 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
pgoff_t offset, gfp_t gfp_mask)
{
int error;
+ int i, nr;

VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));

+ /* memory cgroup controller handles thp pages on its side */
error = mem_cgroup_cache_charge(page, current->mm,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
- goto out;
-
- error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
- if (error == 0) {
- page_cache_get(page);
- page->mapping = mapping;
- page->index = offset;
+ return error;

- spin_lock_irq(&mapping->tree_lock);
- error = radix_tree_insert(&mapping->page_tree, offset, page);
- if (likely(!error)) {
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- spin_unlock_irq(&mapping->tree_lock);
- trace_mm_filemap_add_to_page_cache(page);
- } else {
- page->mapping = NULL;
- /* Leave page->index set: truncation relies upon it */
- spin_unlock_irq(&mapping->tree_lock);
- mem_cgroup_uncharge_cache_page(page);
- page_cache_release(page);
- }
- radix_tree_preload_end();
- } else
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
+ BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+ nr = hpage_nr_pages(page);
+ } else {
+ BUG_ON(PageTransHuge(page));
+ nr = 1;
+ }
+ error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
+ if (error) {
mem_cgroup_uncharge_cache_page(page);
-out:
+ return error;
+ }
+
+ spin_lock_irq(&mapping->tree_lock);
+ for (i = 0; i < nr; i++) {
+ page_cache_get(page + i);
+ page[i].index = offset + i;
+ page[i].mapping = mapping;
+ error = radix_tree_insert(&mapping->page_tree,
+ offset + i, page + i);
+ if (error)
+ goto err;
+ }
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+ if (PageTransHuge(page))
+ __inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+ mapping->nrpages += nr;
+ spin_unlock_irq(&mapping->tree_lock);
+ radix_tree_preload_end();
+ trace_mm_filemap_add_to_page_cache(page);
+ return 0;
+err:
+ if (i != 0)
+ error = -ENOSPC; /* no space for a huge page */
+ page_cache_release(page + i);
+ page[i].mapping = NULL;
+ for (i--; i >= 0; i--) {
+ /* Leave page->index set: truncation relies upon it */
+ page[i].mapping = NULL;
+ radix_tree_delete(&mapping->page_tree, offset + i);
+ page_cache_release(page + i);
+ }
+ spin_unlock_irq(&mapping->tree_lock);
+ radix_tree_preload_end();
+ mem_cgroup_uncharge_cache_page(page);
return error;
}
EXPORT_SYMBOL(add_to_page_cache_locked);
--
1.7.10.4

2013-05-12 01:30:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 08/39] thp: compile-time and sysfs knob for thp pagecache

From: "Kirill A. Shutemov" <[email protected]>

For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for X86_64.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 7 +++++++
mm/Kconfig | 10 ++++++++++
mm/huge_memory.c | 19 +++++++++++++++++++
3 files changed, 36 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6b4c9b2..88b44e2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
+ TRANSPARENT_HUGEPAGE_PAGECACHE,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
#ifdef CONFIG_DEBUG_VM
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
@@ -240,4 +241,10 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str

#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+static inline bool transparent_hugepage_pagecache(void)
+{
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+ return 0;
+ return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
#endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index e742d06..3a271b7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -420,6 +420,16 @@ choice
benefit.
endchoice

+config TRANSPARENT_HUGEPAGE_PAGECACHE
+ bool "Transparent Hugepage Support for page cache"
+ depends on X86_64 && TRANSPARENT_HUGEPAGE
+ default y
+ help
+ Enabling the option adds support hugepages for file-backed
+ mappings. It requires transparent hugepage support from
+ filesystem side. For now, the only filesystem which supports
+ hugepages is ramfs.
+
config CROSS_MEMORY_ATTACH
bool "Cross Memory Support"
depends on MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b39fa01..bd8ef7f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -42,6 +42,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
#endif
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+ (1<<TRANSPARENT_HUGEPAGE_PAGECACHE)|
+#endif
(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);

/* default scan 8*512 pte (or vmas) every 30 second */
@@ -357,6 +360,21 @@ static ssize_t defrag_store(struct kobject *kobj,
static struct kobj_attribute defrag_attr =
__ATTR(defrag, 0644, defrag_show, defrag_store);

+static ssize_t page_cache_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return single_flag_show(kobj, attr, buf,
+ TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static ssize_t page_cache_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ return single_flag_store(kobj, attr, buf, count,
+ TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static struct kobj_attribute page_cache_attr =
+ __ATTR(page_cache, 0644, page_cache_show, page_cache_store);
+
static ssize_t use_zero_page_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
@@ -392,6 +410,7 @@ static struct kobj_attribute debug_cow_attr =
static struct attribute *hugepage_attr[] = {
&enabled_attr.attr,
&defrag_attr.attr,
+ &page_cache_attr.attr,
&use_zero_page_attr.attr,
#ifdef CONFIG_DEBUG_VM
&debug_cow_attr.attr,
--
1.7.10.4

2013-05-12 01:30:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 10/39] thp: account anon transparent huge pages into NR_ANON_PAGES

From: "Kirill A. Shutemov" <[email protected]>

We use NR_ANON_PAGES as base for reporting AnonPages to user.
There's not much sense in not accounting transparent huge pages there, but
add them on printing to user.

Let's account transparent huge pages in NR_ANON_PAGES in the first place.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/base/node.c | 6 ------
fs/proc/meminfo.c | 6 ------
mm/huge_memory.c | 1 -
mm/rmap.c | 18 +++++++++---------
4 files changed, 9 insertions(+), 22 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 7616a77..bc9f43b 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -125,13 +125,7 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(nid, NR_WRITEBACK)),
nid, K(node_page_state(nid, NR_FILE_PAGES)),
nid, K(node_page_state(nid, NR_FILE_MAPPED)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- nid, K(node_page_state(nid, NR_ANON_PAGES)
- + node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR),
-#else
nid, K(node_page_state(nid, NR_ANON_PAGES)),
-#endif
nid, K(node_page_state(nid, NR_SHMEM)),
nid, node_page_state(nid, NR_KERNEL_STACK) *
THREAD_SIZE / 1024,
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 5aa847a..59d85d6 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -132,13 +132,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
K(i.freeswap),
K(global_page_state(NR_FILE_DIRTY)),
K(global_page_state(NR_WRITEBACK)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- K(global_page_state(NR_ANON_PAGES)
- + global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR),
-#else
K(global_page_state(NR_ANON_PAGES)),
-#endif
K(global_page_state(NR_FILE_MAPPED)),
K(global_page_state(NR_SHMEM)),
K(global_page_state(NR_SLAB_RECLAIMABLE) +
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bd8ef7f..ed31e90 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1672,7 +1672,6 @@ static void __split_huge_page_refcount(struct page *page,
BUG_ON(atomic_read(&page->_count) <= 0);

__mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
- __mod_zone_page_state(zone, NR_ANON_PAGES, HPAGE_PMD_NR);

ClearPageCompound(page);
compound_unlock(page);
diff --git a/mm/rmap.c b/mm/rmap.c
index 6280da8..6abf387 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1055,11 +1055,11 @@ void do_page_add_anon_rmap(struct page *page,
{
int first = atomic_inc_and_test(&page->_mapcount);
if (first) {
- if (!PageTransHuge(page))
- __inc_zone_page_state(page, NR_ANON_PAGES);
- else
+ if (PageTransHuge(page))
__inc_zone_page_state(page,
NR_ANON_TRANSPARENT_HUGEPAGES);
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+ hpage_nr_pages(page));
}
if (unlikely(PageKsm(page)))
return;
@@ -1088,10 +1088,10 @@ void page_add_new_anon_rmap(struct page *page,
VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end);
SetPageSwapBacked(page);
atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */
- if (!PageTransHuge(page))
- __inc_zone_page_state(page, NR_ANON_PAGES);
- else
+ if (PageTransHuge(page))
__inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+ hpage_nr_pages(page));
__page_set_anon_rmap(page, vma, address, 1);
if (!mlocked_vma_newpage(vma, page))
lru_cache_add_lru(page, LRU_ACTIVE_ANON);
@@ -1150,11 +1150,11 @@ void page_remove_rmap(struct page *page)
goto out;
if (anon) {
mem_cgroup_uncharge_page(page);
- if (!PageTransHuge(page))
- __dec_zone_page_state(page, NR_ANON_PAGES);
- else
+ if (PageTransHuge(page))
__dec_zone_page_state(page,
NR_ANON_TRANSPARENT_HUGEPAGES);
+ __mod_zone_page_state(page_zone(page), NR_ANON_PAGES,
+ hpage_nr_pages(page));
} else {
__dec_zone_page_state(page, NR_FILE_MAPPED);
mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED);
--
1.7.10.4

2013-05-12 01:30:53

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 11/39] thp: represent file thp pages in meminfo and friends

From: "Kirill A. Shutemov" <[email protected]>

The patch adds new zone stat to count file transparent huge pages and
adjust related places.

For now we don't count mapped or dirty file thp pages separately.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
drivers/base/node.c | 4 ++++
fs/proc/meminfo.c | 3 +++
include/linux/mmzone.h | 1 +
mm/vmstat.c | 1 +
4 files changed, 9 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index bc9f43b..de261f5 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -119,6 +119,7 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d SUnreclaim: %8lu kB\n"
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"Node %d AnonHugePages: %8lu kB\n"
+ "Node %d FileHugePages: %8lu kB\n"
#endif
,
nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -140,6 +141,9 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
, nid,
K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR)
+ , nid,
+ K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
HPAGE_PMD_NR));
#else
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 59d85d6..a62952c 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -104,6 +104,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"AnonHugePages: %8lu kB\n"
+ "FileHugePages: %8lu kB\n"
#endif
,
K(i.totalram),
@@ -158,6 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
HPAGE_PMD_NR)
+ ,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR)
#endif
);

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 72e1cb5..33fd258 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,7 @@ enum zone_stat_item {
NUMA_OTHER, /* allocation from other node */
#endif
NR_ANON_TRANSPARENT_HUGEPAGES,
+ NR_FILE_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7a35116..7945285 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -738,6 +738,7 @@ const char * const vmstat_text[] = {
"numa_other",
#endif
"nr_anon_transparent_hugepages",
+ "nr_file_transparent_hugepages",
"nr_free_cma",
"nr_dirty_threshold",
"nr_dirty_background_threshold",
--
1.7.10.4

2013-05-12 01:30:52

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 06/39] thp, mm: avoid PageUnevictable on active/inactive lru lists

From: "Kirill A. Shutemov" <[email protected]>

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_active_list
goes crazy:

kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
invalid opcode: 0000 [#1] SMP
CPU 0
Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
RIP: 0010:[<ffffffff81110478>] [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
RSP: 0000:ffff8800796d9b28 EFLAGS: 00010082
RAX: 00000000ffffffea RBX: 0000000000000012 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea0001de8040
RBP: ffff8800796d9b88 R08: ffff8800796d9df0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012
R13: ffffea0001de8060 R14: ffffffff818818e8 R15: ffff8800796d9bf8
FS: 0000000000000000(0000) GS:ffff88007a200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1bfc108000 CR3: 000000000180b000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 293, threadinfo ffff8800796d8000, task ffff880079e0a6e0)
Stack:
ffff8800796d9b48 ffffffff81881880 ffff8800796d9df0 ffff8800796d9be0
0000000000000002 000000000000001f ffff8800796d9b88 ffffffff818818c8
ffffffff81881480 ffff8800796d9dc0 0000000000000002 000000000000001f
Call Trace:
[<ffffffff81111e98>] shrink_inactive_list+0x108/0x4a0
[<ffffffff8109ce3d>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8107b8bf>] ? local_clock+0x4f/0x60
[<ffffffff8110ff5d>] ? shrink_slab+0x1fd/0x4c0
[<ffffffff811125a1>] shrink_zone+0x371/0x610
[<ffffffff8110ff75>] ? shrink_slab+0x215/0x4c0
[<ffffffff81112dfc>] kswapd+0x5bc/0xb60
[<ffffffff81112840>] ? shrink_zone+0x610/0x610
[<ffffffff81066676>] kthread+0xd6/0xe0
[<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
[<ffffffff814fed6c>] ret_from_fork+0x7c/0xb0
[<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
Code: 1f 40 00 49 8b 45 08 49 8b 75 00 48 89 46 08 48 89 30 49 8b 06 4c 89 68 08 49 89 45 00 4d 89 75 08 4d 89 2e eb 9c 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 31 db 45 31 e4 eb 9b 0f 0b 0f 0b 65 48
RIP [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
RSP <ffff8800796d9b28>

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
Let's just copy PG_active and PG_unevictable from head page in
__split_huge_page_refcount(), it will simplify lru_add_page_tail().

This will fix one more bug in lru_add_page_tail():
if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru.
The patch will fix it as well since we copy PG_active from head page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 4 +++-
mm/swap.c | 20 ++------------------
2 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 03a89a2..b39fa01 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1612,7 +1612,9 @@ static void __split_huge_page_refcount(struct page *page,
((1L << PG_referenced) |
(1L << PG_swapbacked) |
(1L << PG_mlocked) |
- (1L << PG_uptodate)));
+ (1L << PG_uptodate) |
+ (1L << PG_active) |
+ (1L << PG_unevictable)));
page_tail->flags |= (1L << PG_dirty);

/* clear PageTail before overwriting first_page */
diff --git a/mm/swap.c b/mm/swap.c
index acd40bf..9b0a64b 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -739,8 +739,6 @@ EXPORT_SYMBOL(__pagevec_release);
void lru_add_page_tail(struct page *page, struct page *page_tail,
struct lruvec *lruvec, struct list_head *list)
{
- int uninitialized_var(active);
- enum lru_list lru;
const int file = 0;

VM_BUG_ON(!PageHead(page));
@@ -752,20 +750,6 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
if (!list)
SetPageLRU(page_tail);

- if (page_evictable(page_tail)) {
- if (PageActive(page)) {
- SetPageActive(page_tail);
- active = 1;
- lru = LRU_ACTIVE_ANON;
- } else {
- active = 0;
- lru = LRU_INACTIVE_ANON;
- }
- } else {
- SetPageUnevictable(page_tail);
- lru = LRU_UNEVICTABLE;
- }
-
if (likely(PageLRU(page)))
list_add_tail(&page_tail->lru, &page->lru);
else if (list) {
@@ -781,13 +765,13 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
* Use the standard add function to put page_tail on the list,
* but then correct its position so they all end up in order.
*/
- add_page_to_lru_list(page_tail, lruvec, lru);
+ add_page_to_lru_list(page_tail, lruvec, page_lru(page_tail));
list_head = page_tail->lru.prev;
list_move_tail(&page_tail->lru, list_head);
}

if (!PageUnevictable(page))
- update_page_reclaim_stat(lruvec, file, active);
+ update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

--
1.7.10.4

2013-05-12 01:30:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages

From: "Kirill A. Shutemov" <[email protected]>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 31 +++++++++++++++++++++++++------
1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b0c7c8c..657ce82 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,9 @@
void __delete_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
+ bool thp = PageTransHuge(page) &&
+ IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
+ int nr;

trace_mm_filemap_delete_from_page_cache(page);
/*
@@ -127,13 +130,29 @@ void __delete_from_page_cache(struct page *page)
else
cleancache_invalidate_page(mapping, page);

- radix_tree_delete(&mapping->page_tree, page->index);
+ if (thp) {
+ int i;
+
+ nr = HPAGE_CACHE_NR;
+ radix_tree_delete(&mapping->page_tree, page->index);
+ for (i = 1; i < HPAGE_CACHE_NR; i++) {
+ radix_tree_delete(&mapping->page_tree, page->index + i);
+ page[i].mapping = NULL;
+ page_cache_release(page + i);
+ }
+ __dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+ } else {
+ BUG_ON(PageTransHuge(page));
+ nr = 1;
+ radix_tree_delete(&mapping->page_tree, page->index);
+ }
+
page->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */
- mapping->nrpages--;
- __dec_zone_page_state(page, NR_FILE_PAGES);
+ mapping->nrpages -= nr;
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
if (PageSwapBacked(page))
- __dec_zone_page_state(page, NR_SHMEM);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
BUG_ON(page_mapped(page));

/*
@@ -144,8 +163,8 @@ void __delete_from_page_cache(struct page *page)
* having removed the page entirely.
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
- dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+ mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+ add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
}
}

--
1.7.10.4

2013-05-12 01:30:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv4 13/39] mm: trace filemap: dump page order

From: "Kirill A. Shutemov" <[email protected]>

Dump page order to trace to be able to distinguish between small page
and huge page in page cache.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/trace/events/filemap.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49..7e14b13 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
__field(struct page *, page)
__field(unsigned long, i_ino)
__field(unsigned long, index)
+ __field(int, order)
__field(dev_t, s_dev)
),

@@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
__entry->page = page;
__entry->i_ino = page->mapping->host->i_ino;
__entry->index = page->index;
+ __entry->order = compound_order(page);
if (page->mapping->host->i_sb)
__entry->s_dev = page->mapping->host->i_sb->s_dev;
else
__entry->s_dev = page->mapping->host->i_rdev;
),

- TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu",
+ TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d",
MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
__entry->i_ino,
__entry->page,
page_to_pfn(__entry->page),
- __entry->index << PAGE_SHIFT)
+ __entry->index << PAGE_SHIFT,
+ __entry->order)
);

DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache,
--
1.7.10.4

2013-05-21 18:22:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 01/39] mm: drop actor argument of do_generic_file_read()

On 05/11/2013 06:22 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> There's only one caller of do_generic_file_read() and the only actor is
> file_read_actor(). No reason to have a callback parameter.

Looks sane. This can and should go up separately from the rest of the set.

Acked-by: Dave Hansen <[email protected]>

2013-05-21 18:25:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 02/39] block: implement add_bdi_stat()

On 05/11/2013 06:22 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> We're going to add/remove a number of page cache entries at once. This
> patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
> amount. It's required for batched page cache manipulations.

Add, but no dec?

I'd also move this closer to where it gets used in the series.

2013-05-21 18:37:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 00/39] Transparent huge page cache

On 05/11/2013 06:22 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> It's version 4. You can also use git tree:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git
>
> branch thp/pagecache.
>
> If you want to check changes since v3 you can look at diff between tags
> thp/pagecache/v3 and thp/pagecache/v4-prerebase.

What's the purpose of posting these patches? Do you want them merged?
Or are they useful as they stand, or are they just here so folks can
play with them as you improve them?

> The goal of the project is preparing kernel infrastructure to handle huge
> pages in page cache.
>
> To proof that the proposed changes are functional we enable the feature
> for the most simple file system -- ramfs. ramfs is not that useful by
> itself, but it's good pilot project. It provides information on what
> performance boost we should expect on other files systems.

Do you think folks would use ramfs in practice? Or is this just a toy?
Could this replace some (or all) existing hugetlbfs use, for instance?

2013-05-21 18:58:10

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. For transparent huge page cache we want
> to insert HPAGE_PMD_NR (512 on x86-64) entires to address_space at once.

^^entries

> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.

Would radix_tree_preload_contig() be a better name, then?

...
> On 64-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
>
> On 32-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
>
> On most machines we will have RADIX_TREE_MAP_SHIFT=6.

Thanks for adding that to the description. The array you're talking
about is just pointers, right?

107-43 = 64. So, we have 64 extra pointers * NR_CPUS, plus 64 extra
radix tree nodes that we will keep around most of the time. On x86_64,
that's 512 bytes plus 64*560 bytes of nodes which is ~35k of memory per CPU.

That's not bad I guess, but I do bet it's something that some folks want
to configure out. Please make sure to call out the actual size cost in
bytes per CPU in future patch postings, at least for the common case
(64-bit non-CONFIG_BASE_SMALL).

> Since only THP uses batched preload at the , we disable (set max preload
> to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be changed
> in the future.

"at the..." Is there something missing in that sentence?

No major nits, so:

Acked-by: Dave Hansen <[email protected]>

2013-05-21 19:04:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 05/39] memcg, thp: charge huge cache pages

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> mem_cgroup_cache_charge() has check for PageCompound(). The check
> prevents charging huge cache pages.
>
> I don't see a reason why the check is present. Looks like it's just
> legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

FWIW, that commit introduced two PageCompound() checks. The other one
went away inexplicably in 01b1ae63c22.

Acked-by: Dave Hansen <[email protected]>

2013-05-21 19:18:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 06/39] thp, mm: avoid PageUnevictable on active/inactive lru lists

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
> that have been placed on the LRU lists when first allocated), but these
> pages must not have PageUnevictable set - otherwise shrink_active_list
> goes crazy:
>
> kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
> invalid opcode: 0000 [#1] SMP
> CPU 0
> Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
> RIP: 0010:[<ffffffff81110478>] [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
> RSP: 0000:ffff8800796d9b28 EFLAGS: 00010082'
...

I'd much rather see a code snippet and description the BUG_ON() than a
register and stack dump. That line number is wrong already. ;)

> For lru_add_page_tail(), it means we should not set PageUnevictable()
> for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
> Let's just copy PG_active and PG_unevictable from head page in
> __split_huge_page_refcount(), it will simplify lru_add_page_tail().
>
> This will fix one more bug in lru_add_page_tail():
> if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
> will go to the same lru as page, but nobody cares to sync page_tail
> active/inactive state with page. So we can end up with inactive page on
> active lru.
> The patch will fix it as well since we copy PG_active from head page.

This all seems good, and if it fixes a bug, it should really get merged
as it stands. Have you been actually able to trigger that bug in any
way in practice?

Acked-by: Dave Hansen <[email protected]>

2013-05-21 19:28:19

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Returns true if mapping can have huge pages. Just check for __GFP_COMP
> in gfp mask of the mapping for now.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/pagemap.h | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e3dea75..28597ec 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -84,6 +84,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> (__force unsigned long)mask;
> }
>
> +static inline bool mapping_can_have_hugepages(struct address_space *m)
> +{
> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
> + gfp_t gfp_mask = mapping_gfp_mask(m);
> + /* __GFP_COMP is key part of GFP_TRANSHUGE */
> + return !!(gfp_mask & __GFP_COMP) &&
> + transparent_hugepage_pagecache();
> + }
> +
> + return false;
> +}

transparent_hugepage_pagecache() already has the same IS_ENABLED()
check, Is it really necessary to do it again here?

IOW, can you do this?

> +static inline bool mapping_can_have_hugepages(struct address_space
> +{
> + gfp_t gfp_mask = mapping_gfp_mask(m);
if (!transparent_hugepage_pagecache())
return false;
> + /* __GFP_COMP is key part of GFP_TRANSHUGE */
> + return !!(gfp_mask & __GFP_COMP);
> +}

I know we talked about this in the past, but I've forgotten already.
Why is this checking for __GFP_COMP instead of GFP_TRANSHUGE?

Please flesh out the comment.

Also, what happens if "transparent_hugepage_flags &
(1<<TRANSPARENT_HUGEPAGE_PAGECACHE)" becomes false at runtime and you
have some already-instantiated huge page cache mappings around? Will
things like mapping_align_mask() break?

2013-05-21 19:32:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 10/39] thp: account anon transparent huge pages into NR_ANON_PAGES

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> We use NR_ANON_PAGES as base for reporting AnonPages to user.
> There's not much sense in not accounting transparent huge pages there, but
> add them on printing to user.
>
> Let's account transparent huge pages in NR_ANON_PAGES in the first place.

This is another one that needs to be pretty carefully considered
_independently_ of the rest of this set. It also has potential
user-visible changes, so it would be nice to have a blurb in the patch
description if you've thought about this, any why you think it's OK.

But, it still makes solid sense to me, and simplifies the code.

Acked-by: Dave Hansen <[email protected]>

2013-05-21 19:34:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 11/39] thp: represent file thp pages in meminfo and friends

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> The patch adds new zone stat to count file transparent huge pages and
> adjust related places.
>
> For now we don't count mapped or dirty file thp pages separately.

You need to call out that this depends on the previous "NR_ANON_PAGES"
behaviour change to make sense. Otherwise,

Acked-by: Dave Hansen <[email protected]>

2013-05-21 19:35:55

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 13/39] mm: trace filemap: dump page order

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Dump page order to trace to be able to distinguish between small page
> and huge page in page cache.

Acked-by: Dave Hansen <[email protected]>

2013-05-21 19:59:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
> page for the specified index and HPAGE_CACHE_NR-1 tail pages for
> following indexes.

The really nice way to do these patches is refactor them, first, with no
behavior change, in one patch, the introduce the new support in the
second one.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 61158ac..b0c7c8c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -460,39 +460,62 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> pgoff_t offset, gfp_t gfp_mask)
> {
> int error;
> + int i, nr;
>
> VM_BUG_ON(!PageLocked(page));
> VM_BUG_ON(PageSwapBacked(page));
>
> + /* memory cgroup controller handles thp pages on its side */
> error = mem_cgroup_cache_charge(page, current->mm,
> gfp_mask & GFP_RECLAIM_MASK);
> if (error)
> - goto out;
> -
> - error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> - if (error == 0) {
> - page_cache_get(page);
> - page->mapping = mapping;
> - page->index = offset;
> + return error;
>
> - spin_lock_irq(&mapping->tree_lock);
> - error = radix_tree_insert(&mapping->page_tree, offset, page);
> - if (likely(!error)) {
> - mapping->nrpages++;
> - __inc_zone_page_state(page, NR_FILE_PAGES);
> - spin_unlock_irq(&mapping->tree_lock);
> - trace_mm_filemap_add_to_page_cache(page);
> - } else {
> - page->mapping = NULL;
> - /* Leave page->index set: truncation relies upon it */
> - spin_unlock_irq(&mapping->tree_lock);
> - mem_cgroup_uncharge_cache_page(page);
> - page_cache_release(page);
> - }
> - radix_tree_preload_end();
> - } else
> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
> + BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
> + nr = hpage_nr_pages(page);
> + } else {
> + BUG_ON(PageTransHuge(page));
> + nr = 1;
> + }

Why can't this just be

nr = hpage_nr_pages(page);

Are you trying to optimize for the THP=y, but THP-pagecache=n case?

> + error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
> + if (error) {
> mem_cgroup_uncharge_cache_page(page);
> -out:
> + return error;
> + }
> +
> + spin_lock_irq(&mapping->tree_lock);
> + for (i = 0; i < nr; i++) {
> + page_cache_get(page + i);
> + page[i].index = offset + i;
> + page[i].mapping = mapping;
> + error = radix_tree_insert(&mapping->page_tree,
> + offset + i, page + i);
> + if (error)
> + goto err;

I know it's not a super-common thing in the kernel, but could you call
this "insert_err" or something?

> + }
> + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> + if (PageTransHuge(page))
> + __inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> + mapping->nrpages += nr;
> + spin_unlock_irq(&mapping->tree_lock);
> + radix_tree_preload_end();
> + trace_mm_filemap_add_to_page_cache(page);
> + return 0;
> +err:
> + if (i != 0)
> + error = -ENOSPC; /* no space for a huge page */
> + page_cache_release(page + i);
> + page[i].mapping = NULL;

I guess it's a slight behaviour change (I think it's harmless) but if
you delay doing the page_cache_get() and page[i].mapping= until after
the radix tree insertion, you can avoid these two lines.

> + for (i--; i >= 0; i--) {

I kinda glossed over that initial "i--". It might be worth a quick
comment to call it out.

> + /* Leave page->index set: truncation relies upon it */
> + page[i].mapping = NULL;
> + radix_tree_delete(&mapping->page_tree, offset + i);
> + page_cache_release(page + i);
> + }
> + spin_unlock_irq(&mapping->tree_lock);
> + radix_tree_preload_end();
> + mem_cgroup_uncharge_cache_page(page);
> return error;
> }

FWIW, I think you can move the radix_tree_preload_end() up a bit. I
guess it won't make any practical difference since you're holding a
spinlock, but it at least makes the point that you're not depending on
it any more.

I'm also trying to figure out how and when you'd actually have to unroll
a partial-huge-page worth of radix_tree_insert(). In the small-page
case, you can collide with another guy inserting in to the page cache.
But, can that happen in the _middle_ of a THP?

Despite my nits, the code still looks correct here, so:

Acked-by: Dave Hansen <[email protected]>

2013-05-21 20:14:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
> time.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/filemap.c | 31 +++++++++++++++++++++++++------
> 1 file changed, 25 insertions(+), 6 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b0c7c8c..657ce82 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -115,6 +115,9 @@
> void __delete_from_page_cache(struct page *page)
> {
> struct address_space *mapping = page->mapping;
> + bool thp = PageTransHuge(page) &&
> + IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
> + int nr;

Is that check for the config option really necessary? How would we get
a page with PageTransHuge() set without it being enabled?

> trace_mm_filemap_delete_from_page_cache(page);
> /*
> @@ -127,13 +130,29 @@ void __delete_from_page_cache(struct page *page)
> else
> cleancache_invalidate_page(mapping, page);
>
> - radix_tree_delete(&mapping->page_tree, page->index);
> + if (thp) {
> + int i;
> +
> + nr = HPAGE_CACHE_NR;
> + radix_tree_delete(&mapping->page_tree, page->index);
> + for (i = 1; i < HPAGE_CACHE_NR; i++) {
> + radix_tree_delete(&mapping->page_tree, page->index + i);
> + page[i].mapping = NULL;
> + page_cache_release(page + i);
> + }
> + __dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> + } else {
> + BUG_ON(PageTransHuge(page));
> + nr = 1;
> + radix_tree_delete(&mapping->page_tree, page->index);
> + }
> page->mapping = NULL;

I like to rewrite your code. :)

nr = hpage_nr_pages(page);
for (i = 0; i < nr; i++) {
page[i].mapping = NULL;
radix_tree_delete(&mapping->page_tree, page->index + i);
/* tail pages: */
if (i)
page_cache_release(page + i);
}
if (thp)
__dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);

I like this because it explicitly calls out the logic that tail pages
are different from head pages. We handle their reference counts
differently.

Which reminds me... Why do we handle their reference counts differently? :)

It seems like we could easily put a for loop in delete_from_page_cache()
that will release their reference counts along with the head page.
Wouldn't that make the code less special-cased for tail pages?

> /* Leave page->index set: truncation lookup relies upon it */
> - mapping->nrpages--;
> - __dec_zone_page_state(page, NR_FILE_PAGES);
> + mapping->nrpages -= nr;
> + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
> if (PageSwapBacked(page))
> - __dec_zone_page_state(page, NR_SHMEM);
> + __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
> BUG_ON(page_mapped(page));

Man, we suck:

__dec_zone_page_state()
and
__mod_zone_page_state()

take a differently-typed first argument. <sigh>

Would there be any good to making __dec_zone_page_state() check to see
if the page we passed in _is_ a compound page, and adjusting its
behaviour accordingly?

> /*
> @@ -144,8 +163,8 @@ void __delete_from_page_cache(struct page *page)
> * having removed the page entirely.
> */
> if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> - dec_zone_page_state(page, NR_FILE_DIRTY);
> - dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> + mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
> + add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
> }
> }

Ahh, I see now why you didn't need a dec_bdi_stat(). Oh well...

2013-05-21 20:17:36

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> replace_page_cache_page() is only used by FUSE. It's unlikely that we
> will support THP in FUSE page cache any soon.
>
> Let's pospone implemetation of THP handling in replace_page_cache_page()
> until any will use it.
...
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 657ce82..3a03426 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -428,6 +428,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> {
> int error;
>
> + VM_BUG_ON(PageTransHuge(old));
> + VM_BUG_ON(PageTransHuge(new));
> VM_BUG_ON(!PageLocked(old));
> VM_BUG_ON(!PageLocked(new));
> VM_BUG_ON(new->mapping);

The code calling replace_page_cache_page() has a bunch of fallback and
error returning code. It seems a little bit silly to bring the whole
machine down when you could just WARN_ONCE() and return an error code
like fuse already does:

> /*
> * This is a new and locked page, it shouldn't be mapped or
> * have any special flags on it
> */
> if (WARN_ON(page_mapped(oldpage)))
> goto out_fallback_unlock;
> if (WARN_ON(page_has_private(oldpage)))
> goto out_fallback_unlock;
> if (WARN_ON(PageDirty(oldpage) || PageWriteback(oldpage)))
> goto out_fallback_unlock;
> if (WARN_ON(PageMlocked(oldpage)))
> goto out_fallback_unlock;

2013-05-21 20:18:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 16/39] thp, mm: locking tail page is a bug

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Locking head page means locking entire compound page.
> If we try to lock tail page, something went wrong.

Have you actually triggered this in your development?

This is another one that can theoretically get merged separately.

Acked-by: Dave Hansen <[email protected]>

2013-05-21 20:49:55

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 17/39] thp, mm: handle tail pages in page_cache_get_speculative()

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> For tail page we call __get_page_tail(). It has the same semantics, but
> for tail page.

page_cache_get_speculative() has a ~50-line comment above it with lots
of scariness about grace periods and RCU. A two line comment saying
that the semantics are the same doesn't make me feel great that you've
done your homework here.

Are there any performance implications here? __get_page_tail() says:
"It implements the slow path of get_page().".
page_cache_get_speculative() seems awfully speculative which would make
me think that it is part of a _fast_ path.

> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 28597ec..2e86251 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -161,6 +161,9 @@ static inline int page_cache_get_speculative(struct page *page)
> {
> VM_BUG_ON(in_interrupt());
>
> + if (unlikely(PageTail(page)))
> + return __get_page_tail(page);
> +
> #ifdef CONFIG_TINY_RCU
> # ifdef CONFIG_PREEMPT_COUNT
> VM_BUG_ON(!in_atomic());
> @@ -187,7 +190,6 @@ static inline int page_cache_get_speculative(struct page *page)
> return 0;
> }
> #endif
> - VM_BUG_ON(PageTail(page));
>
> return 1;
> }

FWIW, that VM_BUG_ON() should theoretically be able to stay there since
it's unreachable now that you've short-circuited the function for
PageTail() pages.

2013-05-21 20:54:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 18/39] thp, mm: add event counters for huge page alloc on write to a file

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index d4b7a18..584c71c 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> THP_FAULT_FALLBACK,
> THP_COLLAPSE_ALLOC,
> THP_COLLAPSE_ALLOC_FAILED,
> + THP_WRITE_ALLOC,
> + THP_WRITE_ALLOC_FAILED,
> THP_SPLIT,
> THP_ZERO_PAGE_ALLOC,
> THP_ZERO_PAGE_ALLOC_FAILED,
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 7945285..df8dcda 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -821,6 +821,8 @@ const char * const vmstat_text[] = {
> "thp_fault_fallback",
> "thp_collapse_alloc",
> "thp_collapse_alloc_failed",
> + "thp_write_alloc",
> + "thp_write_alloc_failed",
> "thp_split",
> "thp_zero_page_alloc",
> "thp_zero_page_alloc_failed",

I guess these new counters are _consistent_ with all the others. But,
why do we need a separate "_failed" for each one of these? While I'm
nitpicking, does "thp_write_alloc" mean allocs or _successful_ allocs?
I had to look at the code to tell.

I thihk it's probably safe to combine this patch with the next one.
Breaking them apart just makes it harder to review. If _anything_,
this, plus the use of the counters should go in to a different patch
from the true code changes in "mm: allocate huge pages in
grab_cache_page_write_begin()".

2013-05-21 21:14:34

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 19/39] thp, mm: allocate huge pages in grab_cache_page_write_begin()

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.

Why do we need this flag? When might we set it, and when would we not
set it? What kinds of callers need to check for and act on it?

Some of this, at least, needs to make it in to the comment by the #define.

> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -194,6 +194,9 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
> #define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; })
> #define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
>
> +#define THP_WRITE_ALLOC ({ BUILD_BUG(); 0; })
> +#define THP_WRITE_ALLOC_FAILED ({ BUILD_BUG(); 0; })

Doesn't this belong in the previous patch?

> #define hpage_nr_pages(x) 1
>
> #define transparent_hugepage_enabled(__vma) 0
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2e86251..8feeecc 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -270,8 +270,15 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
> int tag, unsigned int nr_pages, struct page **pages);
>
> -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> pgoff_t index, unsigned flags);
> +static inline struct page *grab_cache_page_write_begin(
> + struct address_space *mapping, pgoff_t index, unsigned flags)
> +{
> + if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
> + return NULL;
> + return __grab_cache_page_write_begin(mapping, index, flags);
> +}

OK, so there's some of the behavior.

Could you also call out why you refactored this code? It seems like
you're trying to optimize for the case where AOP_FLAG_TRANSHUGE isn't
set and where the compiler knows that it isn't set.

Could you talk a little bit about the cases that you're thinking of here?

> /*
> * Returns locked page at given index in given cache, creating it if needed.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 9ea46a4..e086ef0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2309,25 +2309,44 @@ EXPORT_SYMBOL(generic_file_direct_write);
> * Find or create a page at the given pagecache position. Return the locked
> * page. This function is specifically for buffered writes.
> */
> -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> - pgoff_t index, unsigned flags)
> +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> + pgoff_t index, unsigned flags)
> {
> int status;
> gfp_t gfp_mask;
> struct page *page;
> gfp_t gfp_notmask = 0;
> + bool thp = (flags & AOP_FLAG_TRANSHUGE) &&
> + IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);

Instead of 'thp', how about 'must_use_thp'? The flag seems to be a
pretty strong edict rather than a hint, so it should be reflected in the
variables derived from it.

"IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)" has also popped up
enough times in the code that it's probably time to start thinking about
shortening it up. It's a wee bit verbose.

> gfp_mask = mapping_gfp_mask(mapping);
> if (mapping_cap_account_dirty(mapping))
> gfp_mask |= __GFP_WRITE;
> if (flags & AOP_FLAG_NOFS)
> gfp_notmask = __GFP_FS;
> + if (thp) {
> + BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
> + BUG_ON(!(gfp_mask & __GFP_COMP));
> + }
> repeat:
> page = find_lock_page(mapping, index);
> - if (page)
> + if (page) {
> + if (thp && !PageTransHuge(page)) {
> + unlock_page(page);
> + page_cache_release(page);
> + return NULL;
> + }
> goto found;
> + }
>
> - page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
> + if (thp) {
> + page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
> + if (page)
> + count_vm_event(THP_WRITE_ALLOC);
> + else
> + count_vm_event(THP_WRITE_ALLOC_FAILED);
> + } else
> + page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
> if (!page)
> return NULL;
> status = add_to_page_cache_lru(page, mapping, index,
> @@ -2342,7 +2361,7 @@ found:
> wait_for_stable_page(page);
> return page;
> }
> -EXPORT_SYMBOL(grab_cache_page_write_begin);
> +EXPORT_SYMBOL(__grab_cache_page_write_begin);
>
> static ssize_t generic_perform_write(struct file *file,
> struct iov_iter *i, loff_t pos)
>

2013-05-21 21:28:16

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> + if (PageTransHuge(page))
> + offset = pos & ~HPAGE_PMD_MASK;
> +
> pagefault_disable();
> - copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> + copied = iov_iter_copy_from_user_atomic(
> + page + (offset >> PAGE_CACHE_SHIFT),
> + i, offset & ~PAGE_CACHE_MASK, bytes);
> pagefault_enable();
> flush_dcache_page(page);

I think there's enough voodoo in there to warrant a comment or adding
some temporary variables. There are three things going on that you wan
to convey:

1. Offset is normally <PAGE_SIZE, but you make it <HPAGE_PMD_SIZE if
you are dealing with a huge page
2. (offset >> PAGE_CACHE_SHIFT) is always 0 for small pages since
offset < PAGE_SIZE
3. "offset & ~PAGE_CACHE_MASK" does nothing for small-page offsets, but
it turns a large-page offset back in to a small-page-offset.

I think you can do it with something like this:

int subpage_nr = 0;
off_t smallpage_offset = offset;
if (PageTransHuge(page)) {
// we transform 'offset' to be offset in to the huge
// page instead of inside the PAGE_SIZE page
offset = pos & ~HPAGE_PMD_MASK;
subpage_nr = (offset >> PAGE_CACHE_SHIFT);
}

> + copied = iov_iter_copy_from_user_atomic(
> + page + subpage_nr,
> + i, smallpage_offset, bytes);


> @@ -2437,6 +2453,7 @@ again:
> * because not all segments in the iov can be copied at
> * once without a pagefault.
> */
> + offset = pos & ~PAGE_CACHE_MASK;

Urg, and now it's *BACK* in to a small-page offset?

This means that 'offset' has two _different_ meanings and it morphs
between them during the function a couple of times. That seems very
error-prone to me.

> bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
> iov_iter_single_seg_count(i));
> goto again;
>

2013-05-21 21:49:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 21/39] thp, libfs: initial support of thp in simple_read/write_begin/write_end

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
> It's probably to weak condition and need to be reworked later.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> fs/libfs.c | 50 ++++++++++++++++++++++++++++++++++++-----------
> include/linux/pagemap.h | 8 ++++++++
> 2 files changed, 47 insertions(+), 11 deletions(-)
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 916da8c..ce807fe 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -383,7 +383,7 @@ EXPORT_SYMBOL(simple_setattr);
>
> int simple_readpage(struct file *file, struct page *page)
> {
> - clear_highpage(page);
> + clear_pagecache_page(page);
> flush_dcache_page(page);
> SetPageUptodate(page);
> unlock_page(page);
> @@ -394,21 +394,44 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> {
> - struct page *page;
> + struct page *page = NULL;
> pgoff_t index;

I know ramfs uses simple_write_begin(), but it's not the only one. I
think you probably want to create a new ->write_begin() function just
for ramfs rather than modifying this one.

The optimization that you just put in a few patches ago:

>> +static inline struct page *grab_cache_page_write_begin(
>> +{
>> + if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
>> + return NULL;
>> + return __grab_cache_page_write_begin(mapping, index, flags);


is now worthless for any user of simple_readpage().

> index = pos >> PAGE_CACHE_SHIFT;
>
> - page = grab_cache_page_write_begin(mapping, index, flags);
> + /* XXX: too weak condition? */

Why would it be too weak?

> + if (mapping_can_have_hugepages(mapping)) {
> + page = grab_cache_page_write_begin(mapping,
> + index & ~HPAGE_CACHE_INDEX_MASK,
> + flags | AOP_FLAG_TRANSHUGE);
> + /* fallback to small page */
> + if (!page) {
> + unsigned long offset;
> + offset = pos & ~PAGE_CACHE_MASK;
> + len = min_t(unsigned long,
> + len, PAGE_CACHE_SIZE - offset);
> + }

Why does this have to muck with 'len'? It doesn't appear to be undoing
anything from earlier in the function. What is it fixing up?

> + BUG_ON(page && !PageTransHuge(page));
> + }

So, those semantics for AOP_FLAG_TRANSHUGE are actually pretty strong.
They mean that you can only return a transparent pagecache page, but you
better not return a small page.

Would it have been possible for a huge page to get returned from
grab_cache_page_write_begin(), but had it split up between there and the
BUG_ON()?

Which reminds me... under what circumstances _do_ we split these huge
pages? How are those circumstances different from the anonymous ones?

> + if (!page)
> + page = grab_cache_page_write_begin(mapping, index, flags);
> if (!page)
> return -ENOMEM;
> -
> *pagep = page;
>
> - if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
> - unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> -
> - zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
> + if (!PageUptodate(page)) {
> + unsigned from;
> +
> + if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
> + from = pos & ~HPAGE_PMD_MASK;
> + zero_huge_user_segment(page, 0, from);
> + zero_huge_user_segment(page,
> + from + len, HPAGE_PMD_SIZE);
> + } else if (len != PAGE_CACHE_SIZE) {
> + from = pos & ~PAGE_CACHE_MASK;
> + zero_user_segments(page, 0, from,
> + from + len, PAGE_CACHE_SIZE);
> + }
> }
> return 0;
> }
> @@ -443,9 +466,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,
>
> /* zero the stale part of the page if we did a short copy */
> if (copied < len) {
> - unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> -
> - zero_user(page, from + copied, len - copied);
> + unsigned from;
> + if (PageTransHuge(page)) {
> + from = pos & ~HPAGE_PMD_MASK;
> + zero_huge_user(page, from + copied, len - copied);
> + } else {
> + from = pos & ~PAGE_CACHE_MASK;
> + zero_user(page, from + copied, len - copied);
> + }
> }

When I see stuff going in to the simple_* functions, I fear that this
code will end up getting copied in to each and every one of the
filesystems that implement these on their own.

I guess this works for now, but I'm worried that the the next fs is just
going to copy-and-paste these. Guess I'll yell at them when they do it. :)

2013-05-21 22:05:37

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Since we're going to have huge pages backed by files,
> wait_split_huge_page() has to serialize not only over anon_vma_lock,
> but over i_mmap_mutex too.
...
> -#define wait_split_huge_page(__anon_vma, __pmd) \
> +#define wait_split_huge_page(__vma, __pmd) \
> do { \
> pmd_t *____pmd = (__pmd); \
> - anon_vma_lock_write(__anon_vma); \
> - anon_vma_unlock_write(__anon_vma); \
> + struct address_space *__mapping = \
> + vma->vm_file->f_mapping; \
> + struct anon_vma *__anon_vma = (__vma)->anon_vma; \
> + if (__mapping) \
> + mutex_lock(&__mapping->i_mmap_mutex); \
> + if (__anon_vma) { \
> + anon_vma_lock_write(__anon_vma); \
> + anon_vma_unlock_write(__anon_vma); \
> + } \
> + if (__mapping) \
> + mutex_unlock(&__mapping->i_mmap_mutex); \
> BUG_ON(pmd_trans_splitting(*____pmd) || \
> pmd_trans_huge(*____pmd)); \
> } while (0)

Kirill, I asked about this patch in the previous series, and you wrote
some very nice, detailed answers to my stupid questions. But, you
didn't add any comments or update the patch description. So, if a
reviewer or anybody looking at the changelog in the future has my same
stupid questions, they're unlikely to find the very nice description
that you wrote up.

I'd highly suggest that you go back through the comments you've received
before and make sure that you both answered the questions, *and* made
sure to cover those questions either in the code or in the patch
descriptions.

Could you also describe the lengths to which you've gone to try and keep
this macro from growing in to any larger of an abomination. Is it truly
_impossible_ to turn this in to a normal function? Or will it simply be
a larger amount of work that you can do right now? What would it take?

2013-05-21 22:39:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 24/39] thp, mm: truncate support for transparent huge page cache

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> If we starting position of truncation is in tail page we have to spilit
> the huge page page first.

That's a very interesting sentence sentence. :)

> We also have to split if end is within the huge page. Otherwise we can
> truncate whole huge page at once.

How about something more like this as a description?

Splitting a huge page is relatively expensive. If at all possible, we
would like to do truncation without first splitting a page. However, if
the truncation request starts or ends in the middle of a huge page, we
have no choice and must split it.

> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/truncate.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/mm/truncate.c b/mm/truncate.c
> index c75b736..0152feb 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
> if (index > end)
> break;
>
> + /* split page if we start from tail page */
> + if (PageTransTail(page))
> + split_huge_page(compound_trans_head(page));

I know it makes no logical difference, but should this be an "else if"?
It would make it more clear to me that PageTransTail() and
PageTransHead() are mutually exclusive.

> + if (PageTransHuge(page)) {
> + /* split if end is within huge page */
> + if (index == (end & ~HPAGE_CACHE_INDEX_MASK))

How about:

if ((end - index) > HPAGE_CACHE_NR)

That seems a bit more straightforward, to me at least.

> + split_huge_page(page);
> + else
> + /* skip tail pages */
> + i += HPAGE_CACHE_NR - 1;
> + }


Hmm.. This is all inside a loop, right?

for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

PAGEVEC_SIZE is only 14 here, so it seems a bit odd to be incrementing i
by 512-1. We'll break out of the pagevec loop, but won't 'index' be set
to the wrong thing on the next iteration of the loop? Did you want to
be incrementing 'index' instead of 'i'?

This is also another case where I wonder about racing split_huge_page()
operations.

> if (!trylock_page(page))
> continue;
> WARN_ON(page->index != index);
> @@ -280,6 +291,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
> if (index > end)
> break;
>
> + if (PageTransHuge(page))
> + split_huge_page(page);
> lock_page(page);
> WARN_ON(page->index != index);
> wait_on_page_writeback(page);

This seems to imply that we would have taken care of the case where we
encountered a tail page in the first pass. Should we put a comment in
to explain that assumption?

2013-05-21 22:43:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 26/39] ramfs: enable transparent huge page cache

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> ramfs is the most simple fs from page cache point of view. Let's start
> transparent huge page cache enabling here.
>
> For now we allocate only non-movable huge page. ramfs pages cannot be
> moved yet.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> fs/ramfs/inode.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> index c24f1e1..54d69c7 100644
> --- a/fs/ramfs/inode.c
> +++ b/fs/ramfs/inode.c
> @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> inode_init_owner(inode, dir, mode);
> inode->i_mapping->a_ops = &ramfs_aops;
> inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> + /*
> + * TODO: make ramfs pages movable
> + */
> + mapping_set_gfp_mask(inode->i_mapping,
> + GFP_TRANSHUGE & ~__GFP_MOVABLE);

So, before these patches, ramfs was movable. Now, even on architectures
or configurations that have no chance of using THP-pagecache, ramfs
pages are no longer movable. Right?

That seems unfortunate, and probably not something we want to
intentionally merge in this state.

Worst-case, we should at least make sure the pages remain movable in
configurations where THP-pagecache is unavailable.

2013-05-21 22:56:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> +static inline unsigned long mapping_align_mask(struct address_space *mapping)
> +{
> + if (mapping_can_have_hugepages(mapping))
> + return PAGE_MASK & ~HPAGE_MASK;
> + return get_align_mask();
> +}

get_align_mask() appears to be a bit more complicated to me than just a
plain old mask. Are you sure you don't need to pick up any of its
behavior for the mapping_can_have_hugepages() case?

2013-05-21 23:20:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Make arch_get_unmapped_area() return unmapped area aligned to HPAGE_MASK
> if the file mapping can have huge pages.

OK, so there are at least four phases of this patch set which are
distinct to me.

1. Prep work that can go upstream now
2. Making the page cache able to hold compound pages
3. Making thp-cache work with ramfs
4. Making mmap() work with thp-cache

(1) needs to go upstream now.

(2) and (3) are related and should go upstream together. There should
be enough performance benefits from this alone to let them get merged.

(4) has lot of the code complexity, and is certainly required...
eventually. I think you should stop for the _moment_ posting things in
this category and wait until you get the other stuff merged. Go ahead
and keep it in your git tree for toying around with, but don't try to
get it merged until parts 1-3 are in.

2013-05-21 23:23:09

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> It's confusing that mk_huge_pmd() has sematics different from mk_pte()
> or mk_pmd().
>
> Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
> prototype to match mk_pte().

Was there a motivation to do this beyond adding consistency? Do you use
this later or something?

> @@ -746,7 +745,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> pte_free(mm, pgtable);
> } else {
> pmd_t entry;
> - entry = mk_huge_pmd(page, vma);
> + entry = mk_huge_pmd(page, vma->vm_page_prot);
> + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> page_add_new_anon_rmap(page, vma, haddr);
> set_pmd_at(mm, haddr, pmd, entry);

I'm not the biggest fan since this does add lines of code, but I do
appreciate the consistency it adds, so:

Acked-by: Dave Hansen <[email protected]>

2013-05-21 23:23:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> It's confusing that mk_huge_pmd() has sematics different from mk_pte()
> or mk_pmd().
>
> Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
> prototype to match mk_pte().

Oh, and please stick this in your queue of stuff to go upstream, first.

2013-05-21 23:38:34

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 31/39] thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
> to handle fallback path.
>
> Let's consolidate code back by introducing VM_FAULT_FALLBACK return
> code.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/huge_mm.h | 3 ---
> include/linux/mm.h | 3 ++-
> mm/huge_memory.c | 31 +++++--------------------------
> mm/memory.c | 9 ++++++---
> 4 files changed, 13 insertions(+), 33 deletions

Wow, nice diffstat!

This and the previous patch can go in the cleanups pile, no?

> @@ -3788,9 +3788,12 @@ retry:
> if (!pmd)
> return VM_FAULT_OOM;
> if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> + int ret = 0;
> if (!vma->vm_ops)
> - return do_huge_pmd_anonymous_page(mm, vma, address,
> - pmd, flags);
> + ret = do_huge_pmd_anonymous_page(mm, vma, address,
> + pmd, flags);
> + if ((ret & VM_FAULT_FALLBACK) == 0)
> + return ret;

This could use a small comment about where the code flow is going, when
and why. FWIW, I vastly prefer the '!' form in these:

if (!(ret & VM_FAULT_FALLBACK))

2013-05-21 23:57:26

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 32/39] mm: cleanup __do_fault() implementation

On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Let's cleanup __do_fault() to prepare it for transparent huge pages
> support injection.
>
> Cleanups:
> - int -> bool where appropriate;
> - unindent some code by reverting 'if' condition;
> - extract !pte_same() path to get it clear;
> - separate pte update from mm stats update;
> - some comments reformated;

I've scanned through the rest of these patches. They look OK, and I
don't have _too_ much to say. They definitely need some closer review,
but I think you should concentrate your attention on the stuff _before_
this point in the series.

2013-05-22 06:51:43

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 31/39] thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page()

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> do_huge_pmd_anonymous_page() has copy-pasted piece of handle_mm_fault()
> to handle fallback path.
>
> Let's consolidate code back by introducing VM_FAULT_FALLBACK return
> code.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---

Acked-by: Hillf Danton <[email protected]>

2013-05-22 07:26:42

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 28/39] thp: prepare zap_huge_pmd() to uncharge file pages

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Uncharge pages from correct counter.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/huge_memory.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 7ad458d..a88f9b2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1385,10 +1385,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> spin_unlock(&tlb->mm->page_table_lock);
> put_huge_zero_page();
> } else {
> + int member;
> page = pmd_page(orig_pmd);

Better _if_ member is determined before we touch rmap, conceptually?

> page_remove_rmap(page);
> VM_BUG_ON(page_mapcount(page) < 0);
> - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> + member = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
> + add_mm_counter(tlb->mm, member, -HPAGE_PMD_NR);
> VM_BUG_ON(!PageHead(page));
> tlb->mm->nr_ptes--;
> spin_unlock(&tlb->mm->page_table_lock);
> --
> 1.7.10.4
>

2013-05-22 11:03:41

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 02/39] block: implement add_bdi_stat()

Dave Hansen wrote:
> On 05/11/2013 06:22 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > We're going to add/remove a number of page cache entries at once. This
> > patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
> > amount. It's required for batched page cache manipulations.
>
> Add, but no dec?

'sub', I guess, not 'dec'. For that we use add_bdi_stat(m, item, -nr).
It's consistent with __add_bdi_stat() usage.

> I'd also move this closer to where it gets used in the series.

Okay.

--
Kirill A. Shutemov

2013-05-22 11:19:15

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 08/39] thp: compile-time and sysfs knob for thp pagecache

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for X86_64.
>
How about THPC, TRANSPARENT_HUGEPAGE_CACHE?

2013-05-22 11:37:27

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 34/39] thp, mm: handle huge pages in filemap_fault()

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> If caller asks for huge page (flags & FAULT_FLAG_TRANSHUGE),
> filemap_fault() return it if there's a huge page already by the offset.
>
> If the area of page cache required to create huge is empty, we create a
> new huge page and return it.
>
> Otherwise we return VM_FAULT_FALLBACK to indicate that fallback to small
> pages is required.
>
s/small/regular/g ?

2013-05-22 11:45:45

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 30/39] thp: do_huge_pmd_anonymous_page() cleanup

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Minor cleanup: unindent most code of the fucntion by inverting one
> condition. It's preparation for the next patch.
>
> No functional changes.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
Acked-by: Hillf Danton <[email protected]>

2013-05-22 12:01:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
>
> Would radix_tree_preload_contig() be a better name, then?

Yes. Will rename.

> ...
> > On 64-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> >
> > On 32-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> >
> > On most machines we will have RADIX_TREE_MAP_SHIFT=6.
>
> Thanks for adding that to the description. The array you're talking
> about is just pointers, right?
>
> 107-43 = 64. So, we have 64 extra pointers * NR_CPUS, plus 64 extra
> radix tree nodes that we will keep around most of the time. On x86_64,
> that's 512 bytes plus 64*560 bytes of nodes which is ~35k of memory per CPU.
>
> That's not bad I guess, but I do bet it's something that some folks want
> to configure out. Please make sure to call out the actual size cost in
> bytes per CPU in future patch postings, at least for the common case
> (64-bit non-CONFIG_BASE_SMALL).

I will add this to the commit message:

On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
on 64-bit system the per-CPU feature overhead is
for preload array:
(30 - 21) * sizeof(void*) = 72 bytes
plus, if the preload array is full
(30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
total: 5112 bytes

on 32-bit system the per-CPU feature overhead is
for preload array:
(19 - 11) * sizeof(void*) = 32 bytes
plus, if the preload array is full
(19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
total: 2400 bytes
---

Is it good enough?

I probably, will add !BASE_SMALL dependency to
TRANSPARENT_HUGEPAGE_PAGECACHE config option.

>
> > Since only THP uses batched preload at the , we disable (set max preload
> > to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be changed
> > in the future.
>
> "at the..." Is there something missing in that sentence?

at the moment :)

> No major nits, so:
>
> Acked-by: Dave Hansen <[email protected]>

Thanks!

--
Kirill A. Shutemov

2013-05-22 12:32:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 06/39] thp, mm: avoid PageUnevictable on active/inactive lru lists

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
> > that have been placed on the LRU lists when first allocated), but these
> > pages must not have PageUnevictable set - otherwise shrink_active_list
> > goes crazy:
> >
> > kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
> > invalid opcode: 0000 [#1] SMP
> > CPU 0
> > Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
> > RIP: 0010:[<ffffffff81110478>] [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
> > RSP: 0000:ffff8800796d9b28 EFLAGS: 00010082'
> ...
>
> I'd much rather see a code snippet and description the BUG_ON() than a
> register and stack dump. That line number is wrong already. ;)

Good point.

> > For lru_add_page_tail(), it means we should not set PageUnevictable()
> > for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
> > Let's just copy PG_active and PG_unevictable from head page in
> > __split_huge_page_refcount(), it will simplify lru_add_page_tail().
> >
> > This will fix one more bug in lru_add_page_tail():
> > if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
> > will go to the same lru as page, but nobody cares to sync page_tail
> > active/inactive state with page. So we can end up with inactive page on
> > active lru.
> > The patch will fix it as well since we copy PG_active from head page.
>
> This all seems good, and if it fixes a bug, it should really get merged
> as it stands. Have you been actually able to trigger that bug in any
> way in practice?

I was only able to trigger it on ramfs transhuge pages split.
I doubt there's a way to reproduce it on current upstream code.

> Acked-by: Dave Hansen <[email protected]>

Thanks!

--
Kirill A. Shutemov

2013-05-22 12:47:06

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> @@ -3301,12 +3335,23 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> {
> pte_t *page_table;
> spinlock_t *ptl;
> + pgtable_t pgtable = NULL;
> struct page *page, *cow_page, *dirty_page = NULL;
> - pte_t entry;
> bool anon = false, page_mkwrite = false;
> bool write = flags & FAULT_FLAG_WRITE;
> + bool thp = flags & FAULT_FLAG_TRANSHUGE;
> + unsigned long addr_aligned;
> struct vm_fault vmf;
> - int ret;
> + int nr, ret;
> +
> + if (thp) {
> + if (!transhuge_vma_suitable(vma, address))
> + return VM_FAULT_FALLBACK;
> + if (unlikely(khugepaged_enter(vma)))

vma->vm_mm now is under the care of khugepaged, why?

> + return VM_FAULT_OOM;
> + addr_aligned = address & HPAGE_PMD_MASK;
> + } else
> + addr_aligned = address & PAGE_MASK;
>

2013-05-22 12:56:26

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> @@ -3316,17 +3361,25 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> if (unlikely(anon_vma_prepare(vma)))
> return VM_FAULT_OOM;
>
> - cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> + cow_page = alloc_fault_page_vma(vma, address, flags);
> if (!cow_page)
> - return VM_FAULT_OOM;
> + return VM_FAULT_OOM | VM_FAULT_FALLBACK;
>

Fallback makes sense with !thp ?

> if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
> page_cache_release(cow_page);
> - return VM_FAULT_OOM;
> + return VM_FAULT_OOM | VM_FAULT_FALLBACK;
> }
> } else
> cow_page = NULL;

2013-05-22 13:24:52

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> page = vmf.page;
> +
> + /*
> + * If we asked for huge page we expect to get it or VM_FAULT_FALLBACK.
> + * If we don't ask for huge page it must be splitted in ->fault().
> + */
> + BUG_ON(PageTransHuge(page) != thp);
> +
Based on the log message in 34/39(
If the area of page cache required to create huge is empty, we create a
new huge page and return it.), the above trap looks bogus.

if (thp)
BUG_ON(!PageTransHuge(page));

2013-05-22 13:48:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Returns true if mapping can have huge pages. Just check for __GFP_COMP
> > in gfp mask of the mapping for now.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > include/linux/pagemap.h | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index e3dea75..28597ec 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -84,6 +84,18 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> > (__force unsigned long)mask;
> > }
> >
> > +static inline bool mapping_can_have_hugepages(struct address_space *m)
> > +{
> > + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
> > + gfp_t gfp_mask = mapping_gfp_mask(m);
> > + /* __GFP_COMP is key part of GFP_TRANSHUGE */
> > + return !!(gfp_mask & __GFP_COMP) &&
> > + transparent_hugepage_pagecache();
> > + }
> > +
> > + return false;
> > +}
>
> transparent_hugepage_pagecache() already has the same IS_ENABLED()
> check, Is it really necessary to do it again here?
>
> IOW, can you do this?
>
> > +static inline bool mapping_can_have_hugepages(struct address_space
> > +{
> > + gfp_t gfp_mask = mapping_gfp_mask(m);
> if (!transparent_hugepage_pagecache())
> return false;
> > + /* __GFP_COMP is key part of GFP_TRANSHUGE */
> > + return !!(gfp_mask & __GFP_COMP);
> > +}

Yeah, it's better.

> I know we talked about this in the past, but I've forgotten already.
> Why is this checking for __GFP_COMP instead of GFP_TRANSHUGE?

It's up to filesystem what gfp mask to use. For example ramfs's pages are
not movable currently. So, we check only part which matters.

> Please flesh out the comment.

I'll make the comment in code a bit more descriptive.

> Also, what happens if "transparent_hugepage_flags &
> (1<<TRANSPARENT_HUGEPAGE_PAGECACHE)" becomes false at runtime and you
> have some already-instantiated huge page cache mappings around? Will
> things like mapping_align_mask() break?

We will not touch existing huge pages in existing VMAs. The userspace can
use them until they will be unmapped or split. It's consistent with anon
THP pages.

If anybody mmap() the file after disabling the feature, we will not
setup huge pages anymore: transparent_hugepage_enabled() check in
handle_mm_fault will fail and the page fill be split.

mapping_align_mask() is part of mmap() call path, so there's only chance
that we will get VMA aligned more strictly then needed. Nothing to worry
about.

--
Kirill A. Shutemov

2013-05-22 14:10:58

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 16/39] thp, mm: locking tail page is a bug

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Locking head page means locking entire compound page.
> > If we try to lock tail page, something went wrong.
>
> Have you actually triggered this in your development?

Yes, on early prototypes.

--
Kirill A. Shutemov

2013-05-22 14:20:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 26/39] ramfs: enable transparent huge page cache

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > ramfs is the most simple fs from page cache point of view. Let's start
> > transparent huge page cache enabling here.
> >
> > For now we allocate only non-movable huge page. ramfs pages cannot be
> > moved yet.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > fs/ramfs/inode.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> > index c24f1e1..54d69c7 100644
> > --- a/fs/ramfs/inode.c
> > +++ b/fs/ramfs/inode.c
> > @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> > inode_init_owner(inode, dir, mode);
> > inode->i_mapping->a_ops = &ramfs_aops;
> > inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> > - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > + /*
> > + * TODO: make ramfs pages movable
> > + */
> > + mapping_set_gfp_mask(inode->i_mapping,
> > + GFP_TRANSHUGE & ~__GFP_MOVABLE);
>
> So, before these patches, ramfs was movable. Now, even on architectures
> or configurations that have no chance of using THP-pagecache, ramfs
> pages are no longer movable. Right?

No, it wasn't movable. GFP_HIGHUSER is not GFP_HIGHUSER_MOVABLE (yeah,
names of gfp constants could be more consistent).

ramfs should be fixed to use movable pages, but it's outside the scope of the
patchset.

See more details: http://lkml.org/lkml/2013/4/2/720

--
Kirill A. Shutemov

2013-05-22 14:20:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 04/39] radix-tree: implement preload for multiple contiguous elements

On 05/22/2013 05:03 AM, Kirill A. Shutemov wrote:
> On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> on 64-bit system the per-CPU feature overhead is
> for preload array:
> (30 - 21) * sizeof(void*) = 72 bytes
> plus, if the preload array is full
> (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> total: 5112 bytes
>
> on 32-bit system the per-CPU feature overhead is
> for preload array:
> (19 - 11) * sizeof(void*) = 32 bytes
> plus, if the preload array is full
> (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> total: 2400 bytes
> ---
>
> Is it good enough?

Yup, just stick the calculations way down in the commit message. You
can put the description that it "eats about 5k more memory per-cpu than
existing code" up in the very beginning.

2013-05-22 14:36:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > It's confusing that mk_huge_pmd() has sematics different from mk_pte()
> > or mk_pmd().
> >
> > Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
> > prototype to match mk_pte().
>
> Was there a motivation to do this beyond adding consistency? Do you use
> this later or something?

I spent some time on debugging problem caused by this inconsistency, so at
that point I was motivated to fix it. :)

--
Kirill A. Shutemov

2013-05-22 14:53:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 16/39] thp, mm: locking tail page is a bug

On 05/22/2013 07:12 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> From: "Kirill A. Shutemov" <[email protected]>
>>>
>>> Locking head page means locking entire compound page.
>>> If we try to lock tail page, something went wrong.
>>
>> Have you actually triggered this in your development?
>
> Yes, on early prototypes.

I'd mention this in the description, and think about how necessary this
is with your _current_ code.

2013-05-22 14:55:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 26/39] ramfs: enable transparent huge page cache

On 05/22/2013 07:22 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>>> + /*
>>> + * TODO: make ramfs pages movable
>>> + */
>>> + mapping_set_gfp_mask(inode->i_mapping,
>>> + GFP_TRANSHUGE & ~__GFP_MOVABLE);
>>
>> So, before these patches, ramfs was movable. Now, even on architectures
>> or configurations that have no chance of using THP-pagecache, ramfs
>> pages are no longer movable. Right?
>
> No, it wasn't movable. GFP_HIGHUSER is not GFP_HIGHUSER_MOVABLE (yeah,
> names of gfp constants could be more consistent).
>
> ramfs should be fixed to use movable pages, but it's outside the scope of the
> patchset.
>
> See more details: http://lkml.org/lkml/2013/4/2/720

Please make sure this is clear from the patch description.

Personally, I wouldn't be adding TODO's to the code that I'm not
planning to go fix, lest I would get tagged with _doing_ it. :)

2013-05-22 14:57:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 29/39] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()

On 05/22/2013 07:37 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> From: "Kirill A. Shutemov" <[email protected]>
>>>
>>> It's confusing that mk_huge_pmd() has sematics different from mk_pte()
>>> or mk_pmd().
>>>
>>> Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
>>> prototype to match mk_pte().
>>
>> Was there a motivation to do this beyond adding consistency? Do you use
>> this later or something?
>
> I spent some time on debugging problem caused by this inconsistency, so at
> that point I was motivated to fix it. :)

A little anecdote that this bit you in practice to help indicate this
isn't just random code churn would be nice.

2013-05-22 15:10:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > @@ -3301,12 +3335,23 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > {
> > pte_t *page_table;
> > spinlock_t *ptl;
> > + pgtable_t pgtable = NULL;
> > struct page *page, *cow_page, *dirty_page = NULL;
> > - pte_t entry;
> > bool anon = false, page_mkwrite = false;
> > bool write = flags & FAULT_FLAG_WRITE;
> > + bool thp = flags & FAULT_FLAG_TRANSHUGE;
> > + unsigned long addr_aligned;
> > struct vm_fault vmf;
> > - int ret;
> > + int nr, ret;
> > +
> > + if (thp) {
> > + if (!transhuge_vma_suitable(vma, address))
> > + return VM_FAULT_FALLBACK;
> > + if (unlikely(khugepaged_enter(vma)))
>
> vma->vm_mm now is under the care of khugepaged, why?

Because it has at least once VMA suitable for huge pages.

Yes, we can't collapse pages in file-backed VMAs yet, but It's better to
be consistent to avoid issues when collapsing will be implemented.

--
Kirill A. Shutemov

2013-05-22 15:11:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > @@ -3316,17 +3361,25 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > if (unlikely(anon_vma_prepare(vma)))
> > return VM_FAULT_OOM;
> >
> > - cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> > + cow_page = alloc_fault_page_vma(vma, address, flags);
> > if (!cow_page)
> > - return VM_FAULT_OOM;
> > + return VM_FAULT_OOM | VM_FAULT_FALLBACK;
> >
>
> Fallback makes sense with !thp ?

No, it's nop. handle_pte_fault() will notice only VM_FAULT_OOM. That's
what we need.

--
Kirill A. Shutemov

2013-05-22 15:23:52

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 33/39] thp, mm: implement do_huge_linear_fault()

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > page = vmf.page;
> > +
> > + /*
> > + * If we asked for huge page we expect to get it or VM_FAULT_FALLBACK.
> > + * If we don't ask for huge page it must be splitted in ->fault().
> > + */
> > + BUG_ON(PageTransHuge(page) != thp);
> > +
> Based on the log message in 34/39(
> If the area of page cache required to create huge is empty, we create a
> new huge page and return it.), the above trap looks bogus.

The statement in 34/39 is true for (flags & FAULT_FLAG_TRANSHUGE).
For !(flags & FAULT_FLAG_TRANSHUGE) huge page must be split in ->fault.

The BUG_ON() above is shortcut for two checks:

if (thp)
BUG_ON(!PageTransHuge(page));
if (!thp)
BUG_ON(PageTransHuge(page));

--
Kirill A. Shutemov

2013-05-22 15:31:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 09/39] thp, mm: introduce mapping_can_have_hugepages() predicate

On 05/22/2013 06:51 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> Also, what happens if "transparent_hugepage_flags &
>> (1<<TRANSPARENT_HUGEPAGE_PAGECACHE)" becomes false at runtime and you
>> have some already-instantiated huge page cache mappings around? Will
>> things like mapping_align_mask() break?
>
> We will not touch existing huge pages in existing VMAs. The userspace can
> use them until they will be unmapped or split. It's consistent with anon
> THP pages.
>
> If anybody mmap() the file after disabling the feature, we will not
> setup huge pages anymore: transparent_hugepage_enabled() check in
> handle_mm_fault will fail and the page fill be split.
>
> mapping_align_mask() is part of mmap() call path, so there's only chance
> that we will get VMA aligned more strictly then needed. Nothing to worry
> about.

Could we get a little blurb along those lines somewhere? Maybe even in
your docs that you've added to Documentation/. Oh, wait, you don't have
any documentation? :)

You did add a sysfs knob, so you do owe us some docs for it.

"If the THP-cache sysfs tunable is disabled, huge pages will no longer
be mapped with new mmap()s, but they will remain in place in the page
cache. You might still see some benefits from read/write operations,
etc..."

2013-05-22 15:32:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 34/39] thp, mm: handle huge pages in filemap_fault()

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > If caller asks for huge page (flags & FAULT_FLAG_TRANSHUGE),
> > filemap_fault() return it if there's a huge page already by the offset.
> >
> > If the area of page cache required to create huge is empty, we create a
> > new huge page and return it.
> >
> > Otherwise we return VM_FAULT_FALLBACK to indicate that fallback to small
> > pages is required.
> >
> s/small/regular/g ?

% git log --oneline -p -i --grep 'small.\?page' | wc -l
5962
% git log --oneline -p -i --grep 'regular.\?page' | wc -l
3623

--
Kirill A. Shutemov

2013-05-23 10:32:30

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 03/39] mm: implement zero_huge_user_segment and friends

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Let's add helpers to clear huge page segment(s). They provide the same
> functionallity as zero_user_segment and zero_user, but for huge pages.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/mm.h | 7 +++++++
> mm/memory.c | 36 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 43 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index c05d7cf..5e156fb 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1797,6 +1797,13 @@ extern void dump_page(struct page *page);
> extern void clear_huge_page(struct page *page,
> unsigned long addr,
> unsigned int pages_per_huge_page);
> +extern void zero_huge_user_segment(struct page *page,
> + unsigned start, unsigned end);
> +static inline void zero_huge_user(struct page *page,
> + unsigned start, unsigned len)
> +{
> + zero_huge_user_segment(page, start, start + len);
> +}
> extern void copy_user_huge_page(struct page *dst, struct page *src,
> unsigned long addr, struct vm_area_struct *vma,
> unsigned int pages_per_huge_page);
> diff --git a/mm/memory.c b/mm/memory.c
> index f7a1fba..f02a8be 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4266,6 +4266,42 @@ void clear_huge_page(struct page *page,
> }
> }
>
> +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
> +{
> + int i;
> + unsigned start_idx, end_idx;
> + unsigned start_off, end_off;
> +
> + BUG_ON(end < start);
> +
> + might_sleep();
> +
> + if (start == end)
> + return;
> +
> + start_idx = start >> PAGE_SHIFT;
> + start_off = start & ~PAGE_MASK;
> + end_idx = (end - 1) >> PAGE_SHIFT;
> + end_off = ((end - 1) & ~PAGE_MASK) + 1;
> +
> + /*
> + * if start and end are on the same small page we can call
> + * zero_user_segment() once and save one kmap_atomic().
> + */
> + if (start_idx == end_idx)
> + return zero_user_segment(page + start_idx, start_off, end_off);
> +
> + /* zero the first (possibly partial) page */
> + zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
> + for (i = start_idx + 1; i < end_idx; i++) {
> + cond_resched();
> + clear_highpage(page + i);
> + flush_dcache_page(page + i);

Can we use the function again?
zero_user_segment(page + i, 0, PAGE_SIZE);

> + }
> + /* zero the last (possibly partial) page */
> + zero_user_segment(page + end_idx, 0, end_off);
> +}
> +
> static void copy_user_gigantic_page(struct page *dst, struct page *src,
> unsigned long addr,
> struct vm_area_struct *vma,
> --
> 1.7.10.4
>

2013-05-23 10:36:42

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 07/39] thp, mm: basic defines for transparent huge page cache

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>

Better if one or two sentences are prepared to show that the following
defines are necessary.

> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/huge_mm.h | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 528454c..6b4c9b2 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
> #define HPAGE_PMD_MASK HPAGE_MASK
> #define HPAGE_PMD_SIZE HPAGE_SIZE
>
> +#define HPAGE_CACHE_ORDER (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
> +#define HPAGE_CACHE_NR (1L << HPAGE_CACHE_ORDER)
> +#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
> +
> extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
>
> #define transparent_hugepage_enabled(__vma) \
> @@ -185,6 +189,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
> #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
>
> +#define HPAGE_CACHE_ORDER ({ BUILD_BUG(); 0; })
> +#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; })
> +#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
> +
> #define hpage_nr_pages(x) 1
>
> #define transparent_hugepage_enabled(__vma) 0
> --
> 1.7.10.4
>

2013-05-23 11:01:08

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 38/39] thp: vma_adjust_trans_huge(): adjust file-backed VMA too

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Since we're going to have huge pages in page cache, we need to call
> adjust file-backed VMA, which potentially can contain huge pages.
>
> For now we call it for all VMAs.
>
> Probably later we will need to introduce a flag to indicate that the VMA
> has huge pages.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---

Acked-by: Hillf Danton <[email protected]>

> include/linux/huge_mm.h | 11 +----------
> mm/huge_memory.c | 2 +-
> 2 files changed, 2 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index b20334a..f4d6626 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -139,7 +139,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
> #endif
> extern int hugepage_madvise(struct vm_area_struct *vma,
> unsigned long *vm_flags, int advice);
> -extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
> +extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
> unsigned long start,
> unsigned long end,
> long adjust_next);
> @@ -155,15 +155,6 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
> else
> return 0;
> }
> -static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
> - unsigned long start,
> - unsigned long end,
> - long adjust_next)
> -{
> - if (!vma->anon_vma || vma->vm_ops)
> - return;
> - __vma_adjust_trans_huge(vma, start, end, adjust_next);
> -}
> static inline int hpage_nr_pages(struct page *page)
> {
> if (unlikely(PageTransHuge(page)))
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d7c9df5..9c3815b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2783,7 +2783,7 @@ static void split_huge_page_address(struct mm_struct *mm,
> split_huge_page_pmd_mm(mm, address, pmd);
> }
>
> -void __vma_adjust_trans_huge(struct vm_area_struct *vma,
> +void vma_adjust_trans_huge(struct vm_area_struct *vma,
> unsigned long start,
> unsigned long end,
> long adjust_next)
> --
> 1.7.10.4
>

2013-05-23 11:30:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 03/39] mm: implement zero_huge_user_segment and friends

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Let's add helpers to clear huge page segment(s). They provide the same
> > functionallity as zero_user_segment and zero_user, but for huge pages.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > include/linux/mm.h | 7 +++++++
> > mm/memory.c | 36 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 43 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index c05d7cf..5e156fb 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1797,6 +1797,13 @@ extern void dump_page(struct page *page);
> > extern void clear_huge_page(struct page *page,
> > unsigned long addr,
> > unsigned int pages_per_huge_page);
> > +extern void zero_huge_user_segment(struct page *page,
> > + unsigned start, unsigned end);
> > +static inline void zero_huge_user(struct page *page,
> > + unsigned start, unsigned len)
> > +{
> > + zero_huge_user_segment(page, start, start + len);
> > +}
> > extern void copy_user_huge_page(struct page *dst, struct page *src,
> > unsigned long addr, struct vm_area_struct *vma,
> > unsigned int pages_per_huge_page);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f7a1fba..f02a8be 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4266,6 +4266,42 @@ void clear_huge_page(struct page *page,
> > }
> > }
> >
> > +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
> > +{
> > + int i;
> > + unsigned start_idx, end_idx;
> > + unsigned start_off, end_off;
> > +
> > + BUG_ON(end < start);
> > +
> > + might_sleep();
> > +
> > + if (start == end)
> > + return;
> > +
> > + start_idx = start >> PAGE_SHIFT;
> > + start_off = start & ~PAGE_MASK;
> > + end_idx = (end - 1) >> PAGE_SHIFT;
> > + end_off = ((end - 1) & ~PAGE_MASK) + 1;
> > +
> > + /*
> > + * if start and end are on the same small page we can call
> > + * zero_user_segment() once and save one kmap_atomic().
> > + */
> > + if (start_idx == end_idx)
> > + return zero_user_segment(page + start_idx, start_off, end_off);
> > +
> > + /* zero the first (possibly partial) page */
> > + zero_user_segment(page + start_idx, start_off, PAGE_SIZE);
> > + for (i = start_idx + 1; i < end_idx; i++) {
> > + cond_resched();
> > + clear_highpage(page + i);
> > + flush_dcache_page(page + i);
>
> Can we use the function again?
> zero_user_segment(page + i, 0, PAGE_SIZE);

No. zero_user_segment() is memset()-based. clear_highpage() is higly
optimized for page clearing on many architectures.

--
Kirill A. Shutemov

2013-05-23 11:36:45

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 39/39] thp: map file-backed huge pages on fault

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Look like all pieces are in place, we can map file-backed huge-pages
> now.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/huge_mm.h | 4 +++-
> mm/memory.c | 5 ++++-
> 2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f4d6626..903f097 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -78,7 +78,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
> (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
> ((__vma)->vm_flags & VM_HUGEPAGE))) && \
> !((__vma)->vm_flags & VM_NOHUGEPAGE) && \
> - !is_vma_temporary_stack(__vma))
> + !is_vma_temporary_stack(__vma) && \
> + (!(__vma)->vm_ops || \
> + mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))

Redefine, why?

> #define transparent_hugepage_defrag(__vma) \
> ((transparent_hugepage_flags & \
> (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) || \
> diff --git a/mm/memory.c b/mm/memory.c
> index ebff552..7fe9752 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3939,10 +3939,13 @@ retry:
> if (!pmd)
> return VM_FAULT_OOM;
> if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> - int ret = 0;
> + int ret;
> if (!vma->vm_ops)
> ret = do_huge_pmd_anonymous_page(mm, vma, address,
> pmd, flags);

Ah vma->vm_ops is checked here, so
else if (mapping_can_have_hugepages())

> + else
> + ret = do_huge_linear_fault(mm, vma, address,
> + pmd, flags);
> if ((ret & VM_FAULT_FALLBACK) == 0)
> return ret;
> } else {
> --
> 1.7.10.4
>

2013-05-23 11:46:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 39/39] thp: map file-backed huge pages on fault

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Look like all pieces are in place, we can map file-backed huge-pages
> > now.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > include/linux/huge_mm.h | 4 +++-
> > mm/memory.c | 5 ++++-
> > 2 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index f4d6626..903f097 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -78,7 +78,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
> > (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
> > ((__vma)->vm_flags & VM_HUGEPAGE))) && \
> > !((__vma)->vm_flags & VM_NOHUGEPAGE) && \
> > - !is_vma_temporary_stack(__vma))
> > + !is_vma_temporary_stack(__vma) && \
> > + (!(__vma)->vm_ops || \
> > + mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))
>
> Redefine, why?
>
> > #define transparent_hugepage_defrag(__vma) \
> > ((transparent_hugepage_flags & \
> > (1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) || \
> > diff --git a/mm/memory.c b/mm/memory.c
> > index ebff552..7fe9752 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3939,10 +3939,13 @@ retry:
> > if (!pmd)
> > return VM_FAULT_OOM;
> > if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> > - int ret = 0;
> > + int ret;
> > if (!vma->vm_ops)
> > ret = do_huge_pmd_anonymous_page(mm, vma, address,
> > pmd, flags);
>
> Ah vma->vm_ops is checked here, so
> else if (mapping_can_have_hugepages())

Okay, it's cleaner.

--
Kirill A. Shutemov

2013-05-23 11:57:26

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages

On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
<[email protected]> wrote:
> @@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
> page = pmd_page(orig_pmd);
> VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> - if (page_mapcount(page) == 1) {
> + if (PageAnon(page) && page_mapcount(page) == 1) {

Could we avoid copying huge page if
no-one else is using it, no matter anon?

> pmd_t entry;
> entry = pmd_mkyoung(orig_pmd);
> entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

2013-05-23 12:06:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages

Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > @@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >
> > page = pmd_page(orig_pmd);
> > VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> > - if (page_mapcount(page) == 1) {
> > + if (PageAnon(page) && page_mapcount(page) == 1) {
>
> Could we avoid copying huge page if
> no-one else is using it, no matter anon?

No. The page is still in page cache and can be later accessed later.
We could isolate the page from page cache, but I'm not sure whether it's
good idea.

do_wp_page() does exectly the same for small pages, so let's keep it
consistent.

--
Kirill A. Shutemov

2013-05-23 12:12:10

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages

On Thu, May 23, 2013 at 8:08 PM, Kirill A. Shutemov
<[email protected]> wrote:
> Hillf Danton wrote:
>> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
>> <[email protected]> wrote:
>> > @@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>> >
>> > page = pmd_page(orig_pmd);
>> > VM_BUG_ON(!PageCompound(page) || !PageHead(page));
>> > - if (page_mapcount(page) == 1) {
>> > + if (PageAnon(page) && page_mapcount(page) == 1) {
>>
>> Could we avoid copying huge page if
>> no-one else is using it, no matter anon?
>
> No. The page is still in page cache and can be later accessed later.
> We could isolate the page from page cache, but I'm not sure whether it's
> good idea.
>
Hugetlb tries to avoid copying pahe.

/* If no-one else is actually using this page, avoid the copy
* and just make the page writable */
avoidcopy = (page_mapcount(old_page) == 1);

2013-05-23 12:31:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 37/39] thp: handle write-protect exception to file-backed huge pages

Hillf Danton wrote:
> On Thu, May 23, 2013 at 8:08 PM, Kirill A. Shutemov
> <[email protected]> wrote:
> > Hillf Danton wrote:
> >> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> >> <[email protected]> wrote:
> >> > @@ -1120,7 +1119,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >> >
> >> > page = pmd_page(orig_pmd);
> >> > VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> >> > - if (page_mapcount(page) == 1) {
> >> > + if (PageAnon(page) && page_mapcount(page) == 1) {
> >>
> >> Could we avoid copying huge page if
> >> no-one else is using it, no matter anon?
> >
> > No. The page is still in page cache and can be later accessed later.
> > We could isolate the page from page cache, but I'm not sure whether it's
> > good idea.
> >
> Hugetlb tries to avoid copying pahe.
>
> /* If no-one else is actually using this page, avoid the copy
> * and just make the page writable */
> avoidcopy = (page_mapcount(old_page) == 1);

It makes sense for hugetlb, since it RAM-backed only.

Currently, the project supports only ramfs, but I hope we will bring
storage-backed filesystems later. For them it would be much cheaper to
copy the page then bring it back later from storage.

And one more point: we must not ever reuse dirty pages, since it will lead
to data lost. And ramfs pages are always dirty.

--
Kirill A. Shutemov

2013-05-23 14:34:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
> > page for the specified index and HPAGE_CACHE_NR-1 tail pages for
> > following indexes.
>
> The really nice way to do these patches is refactor them, first, with no
> behavior change, in one patch, the introduce the new support in the
> second one.

I've split it into two patches.

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 61158ac..b0c7c8c 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -460,39 +460,62 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> > pgoff_t offset, gfp_t gfp_mask)
> > {
> > int error;
> > + int i, nr;
> >
> > VM_BUG_ON(!PageLocked(page));
> > VM_BUG_ON(PageSwapBacked(page));
> >
> > + /* memory cgroup controller handles thp pages on its side */
> > error = mem_cgroup_cache_charge(page, current->mm,
> > gfp_mask & GFP_RECLAIM_MASK);
> > if (error)
> > - goto out;
> > -
> > - error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> > - if (error == 0) {
> > - page_cache_get(page);
> > - page->mapping = mapping;
> > - page->index = offset;
> > + return error;
> >
> > - spin_lock_irq(&mapping->tree_lock);
> > - error = radix_tree_insert(&mapping->page_tree, offset, page);
> > - if (likely(!error)) {
> > - mapping->nrpages++;
> > - __inc_zone_page_state(page, NR_FILE_PAGES);
> > - spin_unlock_irq(&mapping->tree_lock);
> > - trace_mm_filemap_add_to_page_cache(page);
> > - } else {
> > - page->mapping = NULL;
> > - /* Leave page->index set: truncation relies upon it */
> > - spin_unlock_irq(&mapping->tree_lock);
> > - mem_cgroup_uncharge_cache_page(page);
> > - page_cache_release(page);
> > - }
> > - radix_tree_preload_end();
> > - } else
> > + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
> > + BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
> > + nr = hpage_nr_pages(page);
> > + } else {
> > + BUG_ON(PageTransHuge(page));
> > + nr = 1;
> > + }
>
> Why can't this just be
>
> nr = hpage_nr_pages(page);
>
> Are you trying to optimize for the THP=y, but THP-pagecache=n case?

Yes, I try to optimize for the case.

> > + if (error)
> > + goto err;
>
> I know it's not a super-common thing in the kernel, but could you call
> this "insert_err" or something?

I've changed it to err_insert.

> > + }
> > + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> > + if (PageTransHuge(page))
> > + __inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> > + mapping->nrpages += nr;
> > + spin_unlock_irq(&mapping->tree_lock);
> > + radix_tree_preload_end();
> > + trace_mm_filemap_add_to_page_cache(page);
> > + return 0;
> > +err:
> > + if (i != 0)
> > + error = -ENOSPC; /* no space for a huge page */
> > + page_cache_release(page + i);
> > + page[i].mapping = NULL;
>
> I guess it's a slight behaviour change (I think it's harmless) but if
> you delay doing the page_cache_get() and page[i].mapping= until after
> the radix tree insertion, you can avoid these two lines.

Hm. I don't think it's safe. The spinlock protects radix-tree against
modification, but find_get_page() can see it just after
radix_tree_insert().

The page is locked and IIUC never uptodate at this point, so nobody will
be able to do much with it, but leave it without valid ->mapping is a bad
idea.

> > + for (i--; i >= 0; i--) {
>
> I kinda glossed over that initial "i--". It might be worth a quick
> comment to call it out.

Okay.

> > + /* Leave page->index set: truncation relies upon it */
> > + page[i].mapping = NULL;
> > + radix_tree_delete(&mapping->page_tree, offset + i);
> > + page_cache_release(page + i);
> > + }
> > + spin_unlock_irq(&mapping->tree_lock);
> > + radix_tree_preload_end();
> > + mem_cgroup_uncharge_cache_page(page);
> > return error;
> > }
>
> FWIW, I think you can move the radix_tree_preload_end() up a bit. I
> guess it won't make any practical difference since you're holding a
> spinlock, but it at least makes the point that you're not depending on
> it any more.

Good point.

> I'm also trying to figure out how and when you'd actually have to unroll
> a partial-huge-page worth of radix_tree_insert(). In the small-page
> case, you can collide with another guy inserting in to the page cache.
> But, can that happen in the _middle_ of a THP?

E.g. if you enable THP after some uptime, the mapping can contain small pages
already.
Or if a process map the file with bad alignement (MAP_FIXED) and touch the
area, it will get small pages.

> Despite my nits, the code still looks correct here, so:
>
> Acked-by: Dave Hansen <[email protected]>

The incremental diff for the patch is below. I guess it's still valid to
use your ack, right?

diff --git a/mm/filemap.c b/mm/filemap.c
index f643062..d004331 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -492,29 +492,33 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
error = radix_tree_insert(&mapping->page_tree,
offset + i, page + i);
if (error)
- goto err;
+ goto err_insert;
}
+ radix_tree_preload_end();
__mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
if (PageTransHuge(page))
__inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
mapping->nrpages += nr;
spin_unlock_irq(&mapping->tree_lock);
- radix_tree_preload_end();
trace_mm_filemap_add_to_page_cache(page);
return 0;
-err:
+err_insert:
+ radix_tree_preload_end();
if (i != 0)
error = -ENOSPC; /* no space for a huge page */
+
+ /* page[i] was not inserted to tree, handle separately */
page_cache_release(page + i);
page[i].mapping = NULL;
- for (i--; i >= 0; i--) {
+ i--;
+
+ for (; i >= 0; i--) {
/* Leave page->index set: truncation relies upon it */
page[i].mapping = NULL;
radix_tree_delete(&mapping->page_tree, offset + i);
page_cache_release(page + i);
}
spin_unlock_irq(&mapping->tree_lock);
- radix_tree_preload_end();
mem_cgroup_uncharge_cache_page(page);
return error;
}
--
Kirill A. Shutemov

2013-05-23 15:49:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 07/39] thp, mm: basic defines for transparent huge page cache

On 05/23/2013 03:36 AM, Hillf Danton wrote:
> On Sun, May 12, 2013 at 9:23 AM, Kirill A. Shutemov
> <[email protected]> wrote:
>> > From: "Kirill A. Shutemov" <[email protected]>
> Better if one or two sentences are prepared to show that the following
> defines are necessary.
...
>> >
>> > +#define HPAGE_CACHE_ORDER (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
>> > +#define HPAGE_CACHE_NR (1L << HPAGE_CACHE_ORDER)
>> > +#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)

Yeah, or just stick them in the patch that uses them first. These
aren't exactly rocket science.

2013-05-23 16:00:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

On 05/23/2013 07:36 AM, Kirill A. Shutemov wrote:
>>> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)) {
>>> + BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
>>> + nr = hpage_nr_pages(page);
>>> + } else {
>>> + BUG_ON(PageTransHuge(page));
>>> + nr = 1;
>>> + }
>>
>> Why can't this just be
>>
>> nr = hpage_nr_pages(page);
>>
>> Are you trying to optimize for the THP=y, but THP-pagecache=n case?
>
> Yes, I try to optimize for the case.

I'd suggest either optimizing in _common_ code, or not optimizing it at
all. Once in production, and all the config options are on, the
optimization goes away anyway.

You could create a hpagecache_nr_pages() helper or something I guess.

>>> + }
>>> + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
>>> + if (PageTransHuge(page))
>>> + __inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
>>> + mapping->nrpages += nr;
>>> + spin_unlock_irq(&mapping->tree_lock);
>>> + radix_tree_preload_end();
>>> + trace_mm_filemap_add_to_page_cache(page);
>>> + return 0;
>>> +err:
>>> + if (i != 0)
>>> + error = -ENOSPC; /* no space for a huge page */
>>> + page_cache_release(page + i);
>>> + page[i].mapping = NULL;
>>
>> I guess it's a slight behaviour change (I think it's harmless) but if
>> you delay doing the page_cache_get() and page[i].mapping= until after
>> the radix tree insertion, you can avoid these two lines.
>
> Hm. I don't think it's safe. The spinlock protects radix-tree against
> modification, but find_get_page() can see it just after
> radix_tree_insert().

Except that the mapping->tree_lock is still held. I don't think
find_get_page() can find it in the radix tree without taking the lock.

> The page is locked and IIUC never uptodate at this point, so nobody will
> be able to do much with it, but leave it without valid ->mapping is a bad
> idea.

->mapping changes are protected by lock_page(). You can't keep
->mapping stable without holding it. If you unlock_page(), you have to
recheck ->mapping after you reacquire the lock.

In other words, I think the code is fine.

>> I'm also trying to figure out how and when you'd actually have to unroll
>> a partial-huge-page worth of radix_tree_insert(). In the small-page
>> case, you can collide with another guy inserting in to the page cache.
>> But, can that happen in the _middle_ of a THP?
>
> E.g. if you enable THP after some uptime, the mapping can contain small pages
> already.
> Or if a process map the file with bad alignement (MAP_FIXED) and touch the
> area, it will get small pages.

Could you put a comment in explaining this case a bit? It's a bit subtle.

2013-05-28 11:57:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 12/39] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

Dave Hansen wrote:
> You could create a hpagecache_nr_pages() helper or something I guess.

Makes sense.
>
> >>> + }
> >>> + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> >>> + if (PageTransHuge(page))
> >>> + __inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
> >>> + mapping->nrpages += nr;
> >>> + spin_unlock_irq(&mapping->tree_lock);
> >>> + radix_tree_preload_end();
> >>> + trace_mm_filemap_add_to_page_cache(page);
> >>> + return 0;
> >>> +err:
> >>> + if (i != 0)
> >>> + error = -ENOSPC; /* no space for a huge page */
> >>> + page_cache_release(page + i);
> >>> + page[i].mapping = NULL;
> >>
> >> I guess it's a slight behaviour change (I think it's harmless) but if
> >> you delay doing the page_cache_get() and page[i].mapping= until after
> >> the radix tree insertion, you can avoid these two lines.
> >
> > Hm. I don't think it's safe. The spinlock protects radix-tree against
> > modification, but find_get_page() can see it just after
> > radix_tree_insert().
>
> Except that the mapping->tree_lock is still held. I don't think
> find_get_page() can find it in the radix tree without taking the lock.

It can. Lookup is rcu-protected. ->tree_lock is only for add/delete/replace.

>
> > The page is locked and IIUC never uptodate at this point, so nobody will
> > be able to do much with it, but leave it without valid ->mapping is a bad
> > idea.
>
> ->mapping changes are protected by lock_page(). You can't keep
> ->mapping stable without holding it. If you unlock_page(), you have to
> recheck ->mapping after you reacquire the lock.
>
> In other words, I think the code is fine.

You are right.

>
> >> I'm also trying to figure out how and when you'd actually have to unroll
> >> a partial-huge-page worth of radix_tree_insert(). In the small-page
> >> case, you can collide with another guy inserting in to the page cache.
> >> But, can that happen in the _middle_ of a THP?
> >
> > E.g. if you enable THP after some uptime, the mapping can contain small pages
> > already.
> > Or if a process map the file with bad alignement (MAP_FIXED) and touch the
> > area, it will get small pages.
>
> Could you put a comment in explaining this case a bit? It's a bit subtle.

okay.

--
Kirill A. Shutemov

2013-05-28 12:25:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
> > time.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > mm/filemap.c | 31 +++++++++++++++++++++++++------
> > 1 file changed, 25 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index b0c7c8c..657ce82 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -115,6 +115,9 @@
> > void __delete_from_page_cache(struct page *page)
> > {
> > struct address_space *mapping = page->mapping;
> > + bool thp = PageTransHuge(page) &&
> > + IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
> > + int nr;
>
> Is that check for the config option really necessary? How would we get
> a page with PageTransHuge() set without it being enabled?

I'll drop it and use hpagecache_nr_page() instead.

> I like to rewrite your code. :)

It's nice. Thanks.

> Which reminds me... Why do we handle their reference counts differently? :)
>
> It seems like we could easily put a for loop in delete_from_page_cache()
> that will release their reference counts along with the head page.
> Wouldn't that make the code less special-cased for tail pages?

delete_from_page_cache() is not the only user of
__delete_from_page_cache()...

It seems I did it wrong in add_to_page_cache_locked(). We shouldn't take
references on tail pages there, only one on head. On split it will be
distributed properly.

> > /* Leave page->index set: truncation lookup relies upon it */
> > - mapping->nrpages--;
> > - __dec_zone_page_state(page, NR_FILE_PAGES);
> > + mapping->nrpages -= nr;
> > + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
> > if (PageSwapBacked(page))
> > - __dec_zone_page_state(page, NR_SHMEM);
> > + __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
> > BUG_ON(page_mapped(page));
>
> Man, we suck:
>
> __dec_zone_page_state()
> and
> __mod_zone_page_state()
>
> take a differently-typed first argument. <sigh>
>
> Would there be any good to making __dec_zone_page_state() check to see
> if the page we passed in _is_ a compound page, and adjusting its
> behaviour accordingly?

Yeah, it would be better but I think it outside the scope of the patchset.
Probably, later.

--
Kirill A. Shutemov

2013-05-28 12:51:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > replace_page_cache_page() is only used by FUSE. It's unlikely that we
> > will support THP in FUSE page cache any soon.
> >
> > Let's pospone implemetation of THP handling in replace_page_cache_page()
> > until any will use it.
> ...
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 657ce82..3a03426 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -428,6 +428,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> > {
> > int error;
> >
> > + VM_BUG_ON(PageTransHuge(old));
> > + VM_BUG_ON(PageTransHuge(new));
> > VM_BUG_ON(!PageLocked(old));
> > VM_BUG_ON(!PageLocked(new));
> > VM_BUG_ON(new->mapping);
>
> The code calling replace_page_cache_page() has a bunch of fallback and
> error returning code. It seems a little bit silly to bring the whole
> machine down when you could just WARN_ONCE() and return an error code
> like fuse already does:

What about:

if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new),
"%s: unexpected huge page\n", __func__))
return -EINVAL;

?

--
Kirill A. Shutemov

2013-05-28 16:33:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 15/39] thp, mm: trigger bug in replace_page_cache_page() on THP

On 05/28/2013 05:53 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> + VM_BUG_ON(PageTransHuge(old));
>>> + VM_BUG_ON(PageTransHuge(new));
>>> VM_BUG_ON(!PageLocked(old));
>>> VM_BUG_ON(!PageLocked(new));
>>> VM_BUG_ON(new->mapping);
>>
>> The code calling replace_page_cache_page() has a bunch of fallback and
>> error returning code. It seems a little bit silly to bring the whole
>> machine down when you could just WARN_ONCE() and return an error code
>> like fuse already does:
>
> What about:
>
> if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new),
> "%s: unexpected huge page\n", __func__))
> return -EINVAL;

That looks sane to me. But, please do make sure to differentiate in the
error message between thp and hugetlbfs (if you have the room).

BTW, I'm also not sure you need to print the function name. The
WARN_ON() register dump usually has the function name.

2013-05-30 13:18:11

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 19/39] thp, mm: allocate huge pages in grab_cache_page_write_begin()

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.
>
> Why do we need this flag?

I don't see other way to indicate grab_cache_page_write_begin(), that we
want THP here.

> When might we set it, and when would we not set it? What kinds of
> callers need to check for and act on it?

The decision whether allocate huge page or not is up to filesystem. In
ramfs case we just use mapping_can_have_hugepages(), on other filesystem
check might be more complicated.

> Some of this, at least, needs to make it in to the comment by the #define.

Sorry, I fail to see what kind of comment you want me to add there.

> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -194,6 +194,9 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
> > #define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; })
> > #define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
> >
> > +#define THP_WRITE_ALLOC ({ BUILD_BUG(); 0; })
> > +#define THP_WRITE_ALLOC_FAILED ({ BUILD_BUG(); 0; })
>
> Doesn't this belong in the previous patch?

Yes. Fixed.

> > #define hpage_nr_pages(x) 1
> >
> > #define transparent_hugepage_enabled(__vma) 0
> > diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> > index 2e86251..8feeecc 100644
> > --- a/include/linux/pagemap.h
> > +++ b/include/linux/pagemap.h
> > @@ -270,8 +270,15 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> > unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
> > int tag, unsigned int nr_pages, struct page **pages);
> >
> > -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> > +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> > pgoff_t index, unsigned flags);
> > +static inline struct page *grab_cache_page_write_begin(
> > + struct address_space *mapping, pgoff_t index, unsigned flags)
> > +{
> > + if (!transparent_hugepage_pagecache() && (flags & AOP_FLAG_TRANSHUGE))
> > + return NULL;
> > + return __grab_cache_page_write_begin(mapping, index, flags);
> > +}
>
> OK, so there's some of the behavior.
>
> Could you also call out why you refactored this code? It seems like
> you're trying to optimize for the case where AOP_FLAG_TRANSHUGE isn't
> set and where the compiler knows that it isn't set.
>
> Could you talk a little bit about the cases that you're thinking of here?

I just tried to make it cheaper for !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
case, but it seems not worth it: the only user call it from
'if (mapping_can_have_hugepages())', so I'll drop this.

> > /*
> > * Returns locked page at given index in given cache, creating it if needed.
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 9ea46a4..e086ef0 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2309,25 +2309,44 @@ EXPORT_SYMBOL(generic_file_direct_write);
> > * Find or create a page at the given pagecache position. Return the locked
> > * page. This function is specifically for buffered writes.
> > */
> > -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> > - pgoff_t index, unsigned flags)
> > +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> > + pgoff_t index, unsigned flags)
> > {
> > int status;
> > gfp_t gfp_mask;
> > struct page *page;
> > gfp_t gfp_notmask = 0;
> > + bool thp = (flags & AOP_FLAG_TRANSHUGE) &&
> > + IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
>
> Instead of 'thp', how about 'must_use_thp'? The flag seems to be a
> pretty strong edict rather than a hint, so it should be reflected in the
> variables derived from it.

Ok.

> "IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE)" has also popped up
> enough times in the code that it's probably time to start thinking about
> shortening it up. It's a wee bit verbose.

I'll leave it as is for now. Probably come back later.


--
Kirill A. Shutemov

2013-06-03 14:59:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Since we're going to have huge pages backed by files,
> > wait_split_huge_page() has to serialize not only over anon_vma_lock,
> > but over i_mmap_mutex too.
> ...
> > -#define wait_split_huge_page(__anon_vma, __pmd) \
> > +#define wait_split_huge_page(__vma, __pmd) \
> > do { \
> > pmd_t *____pmd = (__pmd); \
> > - anon_vma_lock_write(__anon_vma); \
> > - anon_vma_unlock_write(__anon_vma); \
> > + struct address_space *__mapping = \
> > + vma->vm_file->f_mapping; \
> > + struct anon_vma *__anon_vma = (__vma)->anon_vma; \
> > + if (__mapping) \
> > + mutex_lock(&__mapping->i_mmap_mutex); \
> > + if (__anon_vma) { \
> > + anon_vma_lock_write(__anon_vma); \
> > + anon_vma_unlock_write(__anon_vma); \
> > + } \
> > + if (__mapping) \
> > + mutex_unlock(&__mapping->i_mmap_mutex); \
> > BUG_ON(pmd_trans_splitting(*____pmd) || \
> > pmd_trans_huge(*____pmd)); \
> > } while (0)
>
> Kirill, I asked about this patch in the previous series, and you wrote
> some very nice, detailed answers to my stupid questions. But, you
> didn't add any comments or update the patch description. So, if a
> reviewer or anybody looking at the changelog in the future has my same
> stupid questions, they're unlikely to find the very nice description
> that you wrote up.
>
> I'd highly suggest that you go back through the comments you've received
> before and make sure that you both answered the questions, *and* made
> sure to cover those questions either in the code or in the patch
> descriptions.

Will do.

> Could you also describe the lengths to which you've gone to try and keep
> this macro from growing in to any larger of an abomination. Is it truly
> _impossible_ to turn this in to a normal function? Or will it simply be
> a larger amount of work that you can do right now? What would it take?

Okay, I've tried once again. The patch is below. It looks too invasive for
me. What do you think?

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 19c8c14..7ed4412 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,6 +1,8 @@
#ifndef _LINUX_HUGE_MM_H
#define _LINUX_HUGE_MM_H

+#include <linux/fs.h>
+
extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
@@ -114,23 +116,22 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
__split_huge_page_pmd(__vma, __address, \
____pmd); \
} while (0)
-#define wait_split_huge_page(__vma, __pmd) \
- do { \
- pmd_t *____pmd = (__pmd); \
- struct address_space *__mapping = \
- vma->vm_file->f_mapping; \
- struct anon_vma *__anon_vma = (__vma)->anon_vma; \
- if (__mapping) \
- mutex_lock(&__mapping->i_mmap_mutex); \
- if (__anon_vma) { \
- anon_vma_lock_write(__anon_vma); \
- anon_vma_unlock_write(__anon_vma); \
- } \
- if (__mapping) \
- mutex_unlock(&__mapping->i_mmap_mutex); \
- BUG_ON(pmd_trans_splitting(*____pmd) || \
- pmd_trans_huge(*____pmd)); \
- } while (0)
+static inline void wait_split_huge_page(struct vm_area_struct *vma,
+ pmd_t *pmd)
+{
+ struct address_space *mapping = vma->vm_file->f_mapping;
+
+ if (mapping)
+ mutex_lock(&mapping->i_mmap_mutex);
+ if (vma->anon_vma) {
+ anon_vma_lock_write(vma->anon_vma);
+ anon_vma_unlock_write(vma->anon_vma);
+ }
+ if (mapping)
+ mutex_unlock(&mapping->i_mmap_mutex);
+ BUG_ON(pmd_trans_splitting(*pmd));
+ BUG_ON(pmd_trans_huge(*pmd));
+}
extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
pmd_t *pmd);
#if HPAGE_PMD_ORDER > MAX_ORDER
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0a60f28..9fc126e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -19,7 +19,6 @@
#include <linux/shrinker.h>

struct mempolicy;
-struct anon_vma;
struct anon_vma_chain;
struct file_ra_state;
struct user_struct;
@@ -260,7 +259,6 @@ static inline int get_freepage_migratetype(struct page *page)
* files which need it (119 of them)
*/
#include <linux/page-flags.h>
-#include <linux/huge_mm.h>

/*
* Methods to modify the page usage count.
@@ -1475,6 +1473,28 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
for (avc = anon_vma_interval_tree_iter_first(root, start, last); \
avc; avc = anon_vma_interval_tree_iter_next(avc, start, last))

+static inline void anon_vma_lock_write(struct anon_vma *anon_vma)
+{
+ down_write(&anon_vma->root->rwsem);
+}
+
+static inline void anon_vma_unlock_write(struct anon_vma *anon_vma)
+{
+ up_write(&anon_vma->root->rwsem);
+}
+
+static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
+{
+ down_read(&anon_vma->root->rwsem);
+}
+
+static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
+{
+ up_read(&anon_vma->root->rwsem);
+}
+
+#include <linux/huge_mm.h>
+
/* mmap.c */
extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin);
extern int vma_adjust(struct vm_area_struct *vma, unsigned long start,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fb425aa..9805e55 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -453,4 +453,41 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
return mm->cpu_vm_mask_var;
}

+/*
+ * The anon_vma heads a list of private "related" vmas, to scan if
+ * an anonymous page pointing to this anon_vma needs to be unmapped:
+ * the vmas on the list will be related by forking, or by splitting.
+ *
+ * Since vmas come and go as they are split and merged (particularly
+ * in mprotect), the mapping field of an anonymous page cannot point
+ * directly to a vma: instead it points to an anon_vma, on whose list
+ * the related vmas can be easily linked or unlinked.
+ *
+ * After unlinking the last vma on the list, we must garbage collect
+ * the anon_vma object itself: we're guaranteed no page can be
+ * pointing to this anon_vma once its vma list is empty.
+ */
+struct anon_vma {
+ struct anon_vma *root; /* Root of this anon_vma tree */
+ struct rw_semaphore rwsem; /* W: modification, R: walking the list */
+ /*
+ * The refcount is taken on an anon_vma when there is no
+ * guarantee that the vma of page tables will exist for
+ * the duration of the operation. A caller that takes
+ * the reference is responsible for clearing up the
+ * anon_vma if they are the last user on release
+ */
+ atomic_t refcount;
+
+ /*
+ * NOTE: the LSB of the rb_root.rb_node is set by
+ * mm_take_all_locks() _after_ taking the above lock. So the
+ * rb_root must only be read/written after taking the above lock
+ * to be sure to see a valid next pointer. The LSB bit itself
+ * is serialized by a system wide lock only visible to
+ * mm_take_all_locks() (mm_all_locks_mutex).
+ */
+ struct rb_root rb_root; /* Interval tree of private "related" vmas */
+};
+
#endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6dacb93..22c7278 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -11,43 +11,6 @@
#include <linux/memcontrol.h>

/*
- * The anon_vma heads a list of private "related" vmas, to scan if
- * an anonymous page pointing to this anon_vma needs to be unmapped:
- * the vmas on the list will be related by forking, or by splitting.
- *
- * Since vmas come and go as they are split and merged (particularly
- * in mprotect), the mapping field of an anonymous page cannot point
- * directly to a vma: instead it points to an anon_vma, on whose list
- * the related vmas can be easily linked or unlinked.
- *
- * After unlinking the last vma on the list, we must garbage collect
- * the anon_vma object itself: we're guaranteed no page can be
- * pointing to this anon_vma once its vma list is empty.
- */
-struct anon_vma {
- struct anon_vma *root; /* Root of this anon_vma tree */
- struct rw_semaphore rwsem; /* W: modification, R: walking the list */
- /*
- * The refcount is taken on an anon_vma when there is no
- * guarantee that the vma of page tables will exist for
- * the duration of the operation. A caller that takes
- * the reference is responsible for clearing up the
- * anon_vma if they are the last user on release
- */
- atomic_t refcount;
-
- /*
- * NOTE: the LSB of the rb_root.rb_node is set by
- * mm_take_all_locks() _after_ taking the above lock. So the
- * rb_root must only be read/written after taking the above lock
- * to be sure to see a valid next pointer. The LSB bit itself
- * is serialized by a system wide lock only visible to
- * mm_take_all_locks() (mm_all_locks_mutex).
- */
- struct rb_root rb_root; /* Interval tree of private "related" vmas */
-};
-
-/*
* The copy-on-write semantics of fork mean that an anon_vma
* can become associated with multiple processes. Furthermore,
* each child process will have its own anon_vma, where new
@@ -118,27 +81,6 @@ static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
up_write(&anon_vma->root->rwsem);
}

-static inline void anon_vma_lock_write(struct anon_vma *anon_vma)
-{
- down_write(&anon_vma->root->rwsem);
-}
-
-static inline void anon_vma_unlock_write(struct anon_vma *anon_vma)
-{
- up_write(&anon_vma->root->rwsem);
-}
-
-static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
-{
- down_read(&anon_vma->root->rwsem);
-}
-
-static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
-{
- up_read(&anon_vma->root->rwsem);
-}
-
-
/*
* anon_vma helper functions.
*/
diff --git a/mm/memory.c b/mm/memory.c
index c845cf2..2f4fb39 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -589,7 +589,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long address)
{
pgtable_t new = pte_alloc_one(mm, address);
- int wait_split_huge_page;
+ int wait_split;
if (!new)
return -ENOMEM;

@@ -609,17 +609,17 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */

spin_lock(&mm->page_table_lock);
- wait_split_huge_page = 0;
+ wait_split = 0;
if (likely(pmd_none(*pmd))) { /* Has another populated it ? */
mm->nr_ptes++;
pmd_populate(mm, pmd, new);
new = NULL;
} else if (unlikely(pmd_trans_splitting(*pmd)))
- wait_split_huge_page = 1;
+ wait_split = 1;
spin_unlock(&mm->page_table_lock);
if (new)
pte_free(mm, new);
- if (wait_split_huge_page)
+ if (wait_split)
wait_split_huge_page(vma, pmd);
return 0;
}
--
Kirill A. Shutemov

2013-06-03 15:53:20

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too


On 06/03/2013 08:02 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> From: "Kirill A. Shutemov" <[email protected]
>>> -#define wait_split_huge_page(__anon_vma, __pmd) \
>>> +#define wait_split_huge_page(__vma, __pmd) \
>>> do { \
>>> pmd_t *____pmd = (__pmd); \
>>> - anon_vma_lock_write(__anon_vma); \
>>> - anon_vma_unlock_write(__anon_vma); \
>>> + struct address_space *__mapping = \
>>> + vma->vm_file->f_mapping; \
>>> + struct anon_vma *__anon_vma = (__vma)->anon_vma; \
>>> + if (__mapping) \
>>> + mutex_lock(&__mapping->i_mmap_mutex); \
>>> + if (__anon_vma) { \
>>> + anon_vma_lock_write(__anon_vma); \
>>> + anon_vma_unlock_write(__anon_vma); \
>>> + } \
>>> + if (__mapping) \
>>> + mutex_unlock(&__mapping->i_mmap_mutex); \
>>> BUG_ON(pmd_trans_splitting(*____pmd) || \
>>> pmd_trans_huge(*____pmd)); \
>>> } while (0)
...
>> Could you also describe the lengths to which you've gone to try and keep
>> this macro from growing in to any larger of an abomination. Is it truly
>> _impossible_ to turn this in to a normal function? Or will it simply be
>> a larger amount of work that you can do right now? What would it take?
>
> Okay, I've tried once again. The patch is below. It looks too invasive for
> me. What do you think?

That patch looks great to me, actually. It really looks to just be
superficially moving code around. The diffstat is even too:

> include/linux/huge_mm.h | 35 ++++++++++++++--------------
> include/linux/mm.h | 24 +++++++++++++++++--
> include/linux/mm_types.h | 37 +++++++++++++++++++++++++++++
> include/linux/rmap.h | 58 -----------------------------------------------
> mm/memory.c | 8 +++---
> 5 files changed, 81 insertions(+), 81 deletions(-)

2013-06-03 16:08:47

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 23/39] thp: wait_split_huge_page(): serialize over i_mmap_mutex too

Dave Hansen wrote:
>
> On 06/03/2013 08:02 AM, Kirill A. Shutemov wrote:
> > Dave Hansen wrote:
> >> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> >>> From: "Kirill A. Shutemov" <[email protected]
> >>> -#define wait_split_huge_page(__anon_vma, __pmd) \
> >>> +#define wait_split_huge_page(__vma, __pmd) \
> >>> do { \
> >>> pmd_t *____pmd = (__pmd); \
> >>> - anon_vma_lock_write(__anon_vma); \
> >>> - anon_vma_unlock_write(__anon_vma); \
> >>> + struct address_space *__mapping = \
> >>> + vma->vm_file->f_mapping; \
> >>> + struct anon_vma *__anon_vma = (__vma)->anon_vma; \
> >>> + if (__mapping) \
> >>> + mutex_lock(&__mapping->i_mmap_mutex); \
> >>> + if (__anon_vma) { \
> >>> + anon_vma_lock_write(__anon_vma); \
> >>> + anon_vma_unlock_write(__anon_vma); \
> >>> + } \
> >>> + if (__mapping) \
> >>> + mutex_unlock(&__mapping->i_mmap_mutex); \
> >>> BUG_ON(pmd_trans_splitting(*____pmd) || \
> >>> pmd_trans_huge(*____pmd)); \
> >>> } while (0)
> ...
> >> Could you also describe the lengths to which you've gone to try and keep
> >> this macro from growing in to any larger of an abomination. Is it truly
> >> _impossible_ to turn this in to a normal function? Or will it simply be
> >> a larger amount of work that you can do right now? What would it take?
> >
> > Okay, I've tried once again. The patch is below. It looks too invasive for
> > me. What do you think?
>
> That patch looks great to me, actually. It really looks to just be
> superficially moving code around. The diffstat is even too:

One of blocker I see is new dependency <linux/mm.h> -> <linux/fs.h>.
It makes header files nightmare worse.

--
Kirill A. Shutemov

2013-06-07 15:08:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages

Kirill A. Shutemov wrote:
> Dave Hansen wrote:
> > Which reminds me... Why do we handle their reference counts differently? :)
> >
> > It seems like we could easily put a for loop in delete_from_page_cache()
> > that will release their reference counts along with the head page.
> > Wouldn't that make the code less special-cased for tail pages?
>
> delete_from_page_cache() is not the only user of
> __delete_from_page_cache()...
>
> It seems I did it wrong in add_to_page_cache_locked(). We shouldn't take
> references on tail pages there, only one on head. On split it will be
> distributed properly.

This way:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b267859..c2c0df2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1556,6 +1556,7 @@ static void __split_huge_page_refcount(struct page *page,
struct zone *zone = page_zone(page);
struct lruvec *lruvec;
int tail_count = 0;
+ int init_tail_refcount;

/* prevent PageLRU to go away from under us, and freeze lru stats */
spin_lock_irq(&zone->lru_lock);
@@ -1565,6 +1566,13 @@ static void __split_huge_page_refcount(struct page *page,
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(page);

+ /*
+ * When we add a huge page to page cache we take only reference to head
+ * page, but on split we need to take addition reference to all tail
+ * pages since they are still in page cache after splitting.
+ */
+ init_tail_refcount = PageAnon(page) ? 0 : 1;
+
for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
struct page *page_tail = page + i;

@@ -1587,8 +1595,9 @@ static void __split_huge_page_refcount(struct page *page,
* atomic_set() here would be safe on all archs (and
* not only on x86), it's safer to use atomic_add().
*/
- atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
- &page_tail->_count);
+ atomic_add(init_tail_refcount + page_mapcount(page) +
+ page_mapcount(page_tail) + 1,
+ &page_tail->_count);

/* after clearing PageTail the gup refcount can be released */
smp_mb();
--
Kirill A. Shutemov

2013-06-07 15:14:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > + if (PageTransHuge(page))
> > + offset = pos & ~HPAGE_PMD_MASK;
> > +
> > pagefault_disable();
> > - copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> > + copied = iov_iter_copy_from_user_atomic(
> > + page + (offset >> PAGE_CACHE_SHIFT),
> > + i, offset & ~PAGE_CACHE_MASK, bytes);
> > pagefault_enable();
> > flush_dcache_page(page);
>
> I think there's enough voodoo in there to warrant a comment or adding
> some temporary variables. There are three things going on that you wan
> to convey:
>
> 1. Offset is normally <PAGE_SIZE, but you make it <HPAGE_PMD_SIZE if
> you are dealing with a huge page
> 2. (offset >> PAGE_CACHE_SHIFT) is always 0 for small pages since
> offset < PAGE_SIZE
> 3. "offset & ~PAGE_CACHE_MASK" does nothing for small-page offsets, but
> it turns a large-page offset back in to a small-page-offset.
>
> I think you can do it with something like this:
>
> int subpage_nr = 0;
> off_t smallpage_offset = offset;
> if (PageTransHuge(page)) {
> // we transform 'offset' to be offset in to the huge
> // page instead of inside the PAGE_SIZE page
> offset = pos & ~HPAGE_PMD_MASK;
> subpage_nr = (offset >> PAGE_CACHE_SHIFT);
> }
>
> > + copied = iov_iter_copy_from_user_atomic(
> > + page + subpage_nr,
> > + i, smallpage_offset, bytes);
>
>
> > @@ -2437,6 +2453,7 @@ again:
> > * because not all segments in the iov can be copied at
> > * once without a pagefault.
> > */
> > + offset = pos & ~PAGE_CACHE_MASK;
>
> Urg, and now it's *BACK* in to a small-page offset?
>
> This means that 'offset' has two _different_ meanings and it morphs
> between them during the function a couple of times. That seems very
> error-prone to me.

I guess this way is better, right?

@@ -2382,6 +2393,7 @@ static ssize_t generic_perform_write(struct file *file,
unsigned long bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */
void *fsdata;
+ int subpage_nr = 0;

offset = (pos & (PAGE_CACHE_SIZE - 1));
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
@@ -2411,8 +2423,14 @@ again:
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);

+ if (PageTransHuge(page)) {
+ off_t huge_offset = pos & ~HPAGE_PMD_MASK;
+ subpage_nr = huge_offset >> PAGE_CACHE_SHIFT;
+ }
+
pagefault_disable();
- copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+ copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i,
+ offset, bytes);
pagefault_enable();
flush_dcache_page(page);

--
Kirill A. Shutemov

2013-06-07 15:29:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 20/39] thp, mm: naive support of thp in generic read/write routines

On 06/07/2013 08:17 AM, Kirill A. Shutemov wrote:
<snip>
> I guess this way is better, right?
>
> @@ -2382,6 +2393,7 @@ static ssize_t generic_perform_write(struct file *file,
> unsigned long bytes; /* Bytes to write to page */
> size_t copied; /* Bytes copied from user */
> void *fsdata;
> + int subpage_nr = 0;
>
> offset = (pos & (PAGE_CACHE_SIZE - 1));
> bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
> @@ -2411,8 +2423,14 @@ again:
> if (mapping_writably_mapped(mapping))
> flush_dcache_page(page);
>
> + if (PageTransHuge(page)) {
> + off_t huge_offset = pos & ~HPAGE_PMD_MASK;
> + subpage_nr = huge_offset >> PAGE_CACHE_SHIFT;
> + }
> +
> pagefault_disable();
> - copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> + copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i,
> + offset, bytes);
> pagefault_enable();
> flush_dcache_page(page);

That looks substantially easier to understand to me. Nice.

2013-06-10 15:38:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages

On 06/07/2013 08:10 AM, Kirill A. Shutemov wrote:
> + /*
> + * When we add a huge page to page cache we take only reference to head
> + * page, but on split we need to take addition reference to all tail
> + * pages since they are still in page cache after splitting.
> + */
> + init_tail_refcount = PageAnon(page) ? 0 : 1;

What's the "init" for in the name?

In add_to_page_cache_locked() in patch 12/39, you do
> + spin_lock_irq(&mapping->tree_lock);
> + for (i = 0; i < nr; i++) {
> + page_cache_get(page + i);

That looks to me to be taking references to the tail pages. What gives? :)

> for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
> struct page *page_tail = page + i;
>
> @@ -1587,8 +1595,9 @@ static void __split_huge_page_refcount(struct page *page,
> * atomic_set() here would be safe on all archs (and
> * not only on x86), it's safer to use atomic_add().
> */
> - atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
> - &page_tail->_count);
> + atomic_add(init_tail_refcount + page_mapcount(page) +
> + page_mapcount(page_tail) + 1,
> + &page_tail->_count);
>
> /* after clearing PageTail the gup refcount can be released */
> smp_mb();

This does look much better in general, though.

2013-06-10 17:39:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 14/39] thp, mm: rewrite delete_from_page_cache() to support huge pages

Dave Hansen wrote:
> On 06/07/2013 08:10 AM, Kirill A. Shutemov wrote:
> > + /*
> > + * When we add a huge page to page cache we take only reference to head
> > + * page, but on split we need to take addition reference to all tail
> > + * pages since they are still in page cache after splitting.
> > + */
> > + init_tail_refcount = PageAnon(page) ? 0 : 1;
>
> What's the "init" for in the name?

initial_tail_refcount?

> In add_to_page_cache_locked() in patch 12/39, you do
> > + spin_lock_irq(&mapping->tree_lock);
> > + for (i = 0; i < nr; i++) {
> > + page_cache_get(page + i);
>
> That looks to me to be taking references to the tail pages. What gives? :)

The point is to drop this from add_to_page_cache_locked() and make distribution
on split.

--
Kirill A. Shutemov

2013-06-25 14:54:11

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > +static inline unsigned long mapping_align_mask(struct address_space *mapping)
> > +{
> > + if (mapping_can_have_hugepages(mapping))
> > + return PAGE_MASK & ~HPAGE_MASK;
> > + return get_align_mask();
> > +}
>
> get_align_mask() appears to be a bit more complicated to me than just a
> plain old mask. Are you sure you don't need to pick up any of its
> behavior for the mapping_can_have_hugepages() case?

get_align_mask() never returns more strict mask then we do in
mapping_can_have_hugepages() case.

I can modify it this way:

unsigned long mask = get_align_mask();

if (mapping_can_have_hugepages(mapping))
mask &= PAGE_MASK & ~HPAGE_MASK;
return mask;

But it looks more confusing for me. What do you think?

--
Kirill A. Shutemov

2013-06-25 16:46:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv4 27/39] x86-64, mm: proper alignment mappings with hugepages

On 06/25/2013 07:56 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
>>> +static inline unsigned long mapping_align_mask(struct address_space *mapping)
>>> +{
>>> + if (mapping_can_have_hugepages(mapping))
>>> + return PAGE_MASK & ~HPAGE_MASK;
>>> + return get_align_mask();
>>> +}
>>
>> get_align_mask() appears to be a bit more complicated to me than just a
>> plain old mask. Are you sure you don't need to pick up any of its
>> behavior for the mapping_can_have_hugepages() case?
>
> get_align_mask() never returns more strict mask then we do in
> mapping_can_have_hugepages() case.
>
> I can modify it this way:
>
> unsigned long mask = get_align_mask();
>
> if (mapping_can_have_hugepages(mapping))
> mask &= PAGE_MASK & ~HPAGE_MASK;
> return mask;
>
> But it looks more confusing for me. What do you think?

Personally, I find that a *LOT* more clear. The &= pretty much spells
out what you said in your explanation: get_align_mask()'s mask can only
be made more strict when we encounter a huge page.

The relationship between the two masks is not apparent at all in your
original code. This is all nitpicking though, I just wanted to make
sure you'd considered if you were accidentally changing behavior.

2013-06-27 12:37:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv4 17/39] thp, mm: handle tail pages in page_cache_get_speculative()

Dave Hansen wrote:
> On 05/11/2013 06:23 PM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > For tail page we call __get_page_tail(). It has the same semantics, but
> > for tail page.
>
> page_cache_get_speculative() has a ~50-line comment above it with lots
> of scariness about grace periods and RCU. A two line comment saying
> that the semantics are the same doesn't make me feel great that you've
> done your homework here.

Okay. Will fix commit message and the comment.

> Are there any performance implications here? __get_page_tail() says:
> "It implements the slow path of get_page().".
> page_cache_get_speculative() seems awfully speculative which would make
> me think that it is part of a _fast_ path.

It's slow path in the sense that we have to do more for tail page then for
non-compound or head page.

Probably, we can get it a bit faster by unrolling function calls and doing
only what is relevant for our case. Like this:

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ad60dcc..57ad1ae 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -161,6 +161,8 @@ void release_pages(struct page **pages, int nr, int cold);
*/
static inline int page_cache_get_speculative(struct page *page)
{
+ struct page *page_head = compound_trans_head(page);
+
VM_BUG_ON(in_interrupt());

#ifdef CONFIG_TINY_RCU
@@ -176,11 +178,11 @@ static inline int page_cache_get_speculative(struct page *page)
* disabling preempt, and hence no need for the "speculative get" that
* SMP requires.
*/
- VM_BUG_ON(page_count(page) == 0);
+ VM_BUG_ON(page_count(page_head) == 0);
atomic_inc(&page->_count);

#else
- if (unlikely(!get_page_unless_zero(page))) {
+ if (unlikely(!get_page_unless_zero(page_head))) {
/*
* Either the page has been freed, or will be freed.
* In either case, retry here and the caller should
@@ -189,7 +191,23 @@ static inline int page_cache_get_speculative(struct page *page)
return 0;
}
#endif
- VM_BUG_ON(PageTail(page));
+
+ if (unlikely(PageTransTail(page))) {
+ unsigned long flags;
+ int got = 0;
+
+ flags = compound_lock_irqsave(page_head);
+ if (likely(PageTransTail(page))) {
+ atomic_inc(&page->_mapcount);
+ got = 1;
+ }
+ compound_unlock_irqrestore(page_head, flags);
+
+ if (unlikely(!got))
+ put_page(page_head);
+
+ return got;
+ }

return 1;
}

What do you think? Is it better?

--
Kirill A. Shutemov