2013-08-04 02:15:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv5 00/23] Transparent huge page cache: phase 1, everything but mmap()

From: "Kirill A. Shutemov" <[email protected]>

This is the second part of my transparent huge page cache work.
It brings thp support for ramfs, but without mmap() -- it will be posted
separately.

Intro
-----

The goal of the project is preparing kernel infrastructure to handle huge
pages in page cache.

To proof that the proposed changes are functional we enable the feature
for the most simple file system -- ramfs. ramfs is not that useful by
itself, but it's good pilot project.

Design overview
---------------

Every huge page is represented in page cache radix-tree by HPAGE_PMD_NR
(512 on x86-64) entries: one entry for head page and HPAGE_PMD_NR-1 entries
for tail pages.

Radix tree manipulations are implemented in batched way: we add and remove
whole huge page at once, under one tree_lock. To make it possible, we
extended radix-tree interface to be able to pre-allocate memory enough to
insert a number of *contiguous* elements (kudos to Matthew Wilcox).

Huge pages can be added to page cache three ways:
- write(2) to file or page;
- read(2) from sparse file;
- fault sparse file.

Potentially, one more way is collapsing small page, but it's outside initial
implementation.

For now we still write/read at most PAGE_CACHE_SIZE bytes a time. There's
some room for speed up later.

Since mmap() isn't targeted for this patchset, we just split huge page on
page fault.

To minimize memory overhead for small file we setup fops->release helper
-- simple_thp_release() -- which splits the last page in file, when last
writer goes away.

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range. If a huge page is only partly in the range we zero out
the part, exactly like we do for partial small pages.

split_huge_page() for file pages works similar to anon pages, but we
walk by mapping->i_mmap rather then anon_vma->rb_root. At the end we call
truncate_inode_pages() to drop small pages beyond i_size, if any.

Locking model around split_huge_page() rather complicated and I still
don't feel myself confident enough with it. Looks like we need to
serialize over i_mutex in split_huge_page(), but it breaks locking
ordering for i_mutex->mmap_sem. I don't see how it can be fixed easily.
Any ideas are welcome.

Performance indicators will be posted separately.

Please, review.

Kirill A. Shutemov (23):
radix-tree: implement preload for multiple contiguous elements
memcg, thp: charge huge cache pages
thp: compile-time and sysfs knob for thp pagecache
thp, mm: introduce mapping_can_have_hugepages() predicate
thp: represent file thp pages in meminfo and friends
thp, mm: rewrite add_to_page_cache_locked() to support huge pages
mm: trace filemap: dump page order
block: implement add_bdi_stat()
thp, mm: rewrite delete_from_page_cache() to support huge pages
thp, mm: warn if we try to use replace_page_cache_page() with THP
thp, mm: handle tail pages in page_cache_get_speculative()
thp, mm: add event counters for huge page alloc on file write or read
thp, mm: allocate huge pages in grab_cache_page_write_begin()
thp, mm: naive support of thp in generic_perform_write
mm, fs: avoid page allocation beyond i_size on read
thp, mm: handle transhuge pages in do_generic_file_read()
thp, libfs: initial thp support
thp: libfs: introduce simple_thp_release()
truncate: support huge pages
thp: handle file pages in split_huge_page()
thp: wait_split_huge_page(): serialize over i_mmap_mutex too
thp, mm: split huge page on mmap file page
ramfs: enable transparent huge page cache

Documentation/vm/transhuge.txt | 16 ++++
drivers/base/node.c | 4 +
fs/libfs.c | 80 ++++++++++++++++++-
fs/proc/meminfo.c | 3 +
fs/ramfs/file-mmu.c | 3 +-
fs/ramfs/inode.c | 6 +-
include/linux/backing-dev.h | 10 +++
include/linux/fs.h | 10 +++
include/linux/huge_mm.h | 53 ++++++++++++-
include/linux/mmzone.h | 1 +
include/linux/page-flags.h | 33 ++++++++
include/linux/pagemap.h | 48 +++++++++++-
include/linux/radix-tree.h | 11 +++
include/linux/vm_event_item.h | 4 +
include/trace/events/filemap.h | 7 +-
lib/radix-tree.c | 41 +++++++---
mm/Kconfig | 12 +++
mm/filemap.c | 171 +++++++++++++++++++++++++++++++++++------
mm/huge_memory.c | 116 ++++++++++++++++++++++++----
mm/memcontrol.c | 2 -
mm/memory.c | 4 +-
mm/truncate.c | 108 ++++++++++++++++++++------
mm/vmstat.c | 5 ++
23 files changed, 658 insertions(+), 90 deletions(-)

--
1.8.3.2


2013-08-04 02:14:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 04/23] thp, mm: introduce mapping_can_have_hugepages() predicate

From: "Kirill A. Shutemov" <[email protected]>

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/pagemap.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e8ca8cf..47b5082 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,20 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
(__force unsigned long)mask;
}

+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+ gfp_t gfp_mask = mapping_gfp_mask(m);
+
+ if (!transparent_hugepage_pagecache())
+ return false;
+
+ /*
+ * It's up to filesystem what gfp mask to use.
+ * The only part of GFP_TRANSHUGE which matters for us is __GFP_COMP.
+ */
+ return !!(gfp_mask & __GFP_COMP);
+}
+
/*
* The page cache can done in larger chunks than
* one page, because it allows for more efficient
--
1.8.3.2

2013-08-04 02:14:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

From: "Kirill A. Shutemov" <[email protected]>

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.
The feature costs about 5KiB per-CPU, details below.

Worst case for adding N contiguous items is adding entries at indexes
(ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
item plus extra nodes if you cross the boundary from one node to the next.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

#ifdef __KERNEL__
#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
#else
#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
#endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
on 64-bit system the per-CPU feature overhead is
for preload array:
(30 - 21) * sizeof(void*) = 72 bytes
plus, if the preload array is full
(30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
total: 5112 bytes

on 32-bit system the per-CPU feature overhead is
for preload array:
(19 - 11) * sizeof(void*) = 32 bytes
plus, if the preload array is full
(19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
total: 2400 bytes

Since only THP uses batched preload at the moment, we disable (set max
preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
changed in the future.

Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
include/linux/radix-tree.h | 11 +++++++++++
lib/radix-tree.c | 41 ++++++++++++++++++++++++++++++++---------
2 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..3bf0b3e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do { \
(root)->rnode = NULL; \
} while (0)

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR 512
+#else
+#define RADIX_TREE_PRELOAD_NR 1
+#endif
+
/**
* Radix-tree synchronization
*
@@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
int radix_tree_preload(gfp_t gfp_mask);
int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..99ab73c 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
* The worst case is a zero height tree with just a single item at index 0,
* and then inserting an item at index ULONG_MAX. This requires 2 new branches
* of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
* Hence:
*/
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+ (RADIX_TREE_PRELOAD_MIN + \
+ DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))

/*
* Per-cpu pool of preloaded nodes
*/
struct radix_tree_preload {
int nr;
- struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+ struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
};
static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };

@@ -263,29 +271,35 @@ radix_tree_node_free(struct radix_tree_node *node)

/*
* Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail. On
- * success, return zero, with preemption disabled. On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled. On error, return -ENOMEM
* with preemption not disabled.
*
* To make use of this facility, the radix tree must be initialised without
* __GFP_WAIT being passed to INIT_RADIX_TREE().
*/
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
{
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
int ret = -ENOMEM;
+ int preload_target = RADIX_TREE_PRELOAD_MIN +
+ DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+ if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+ "too large preload requested"))
+ return -ENOMEM;

preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+ while (rtp->nr < preload_target) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+ if (rtp->nr < preload_target)
rtp->nodes[rtp->nr++] = node;
else
kmem_cache_free(radix_tree_node_cachep, node);
@@ -308,7 +322,7 @@ int radix_tree_preload(gfp_t gfp_mask)
{
/* Warn on non-sensical use... */
WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload_contig(1, gfp_mask);
}
EXPORT_SYMBOL(radix_tree_preload);

@@ -320,13 +334,22 @@ EXPORT_SYMBOL(radix_tree_preload);
int radix_tree_maybe_preload(gfp_t gfp_mask)
{
if (gfp_mask & __GFP_WAIT)
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload_contig(1, gfp_mask);
/* Preloading doesn't help anything with this gfp mask, skip it */
preempt_disable();
return 0;
}
EXPORT_SYMBOL(radix_tree_maybe_preload);

+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
+{
+ if (gfp_mask & __GFP_WAIT)
+ return __radix_tree_preload_contig(size, gfp_mask);
+ /* Preloading doesn't help anything with this gfp mask, skip it */
+ preempt_disable();
+ return 0;
+}
+
/*
* Return the maximum key which can be store into a
* radix tree with height HEIGHT.
--
1.8.3.2

2013-08-04 02:14:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 02/23] memcg, thp: charge huge cache pages

From: "Kirill A. Shutemov" <[email protected]>

mem_cgroup_cache_charge() has check for PageCompound(). The check
prevents charging huge cache pages.

I don't see a reason why the check is present. Looks like it's just
legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).

Let's just drop it.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
mm/memcontrol.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b6cd870..dc50c1a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3921,8 +3921,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,

if (mem_cgroup_disabled())
return 0;
- if (PageCompound(page))
- return 0;

if (!PageSwapCache(page))
ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
--
1.8.3.2

2013-08-04 02:15:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 15/23] mm, fs: avoid page allocation beyond i_size on read

From: "Kirill A. Shutemov" <[email protected]>

I've noticed that we allocated unneeded page for cache on read beyond
i_size. Simple test case (I checked it on ramfs):

$ touch testfile
$ cat testfile

It triggers 'no_cached_page' code path in do_generic_file_read().

Looks like it's regression since commit a32ea1e. Let's fix it.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: NeilBrown <[email protected]>
---
mm/filemap.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 066bbff..c31d296 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1163,6 +1163,10 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
loff_t isize;
unsigned long nr, ret;

+ isize = i_size_read(inode);
+ if (!isize || index > (isize - 1) >> PAGE_CACHE_SHIFT)
+ goto out;
+
cond_resched();
find_page:
page = find_get_page(mapping, index);
--
1.8.3.2

2013-08-04 02:16:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 16/23] thp, mm: handle transhuge pages in do_generic_file_read()

From: "Kirill A. Shutemov" <[email protected]>

If a transhuge page is already in page cache (up to date and not
readahead) we go usual path: read from relevant subpage (head or tail).

If page is not cached (sparse file in ramfs case) and the mapping can
have hugepage we try allocate a new one and read it.

If a page is not up to date or in readahead, we have to move 'page' to
head page of the compound page, since it represents state of whole
transhuge page. We will switch back to relevant subpage when page is
ready to be read ('page_ok' label).

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c31d296..ed65af5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1175,8 +1175,28 @@ find_page:
ra, filp,
index, last_index - index);
page = find_get_page(mapping, index);
- if (unlikely(page == NULL))
- goto no_cached_page;
+ if (unlikely(page == NULL)) {
+ if (mapping_can_have_hugepages(mapping))
+ goto no_cached_page_thp;
+ else
+ goto no_cached_page;
+ }
+ }
+ if (PageTransCompound(page)) {
+ struct page *head = compound_trans_head(page);
+
+ if (!PageReadahead(head) && PageUptodate(page))
+ goto page_ok;
+
+ /*
+ * Switch 'page' to head page. That's needed to handle
+ * readahead or make page uptodate.
+ * It will be switched back to the right tail page at
+ * the begining 'page_ok'.
+ */
+ page_cache_get(head);
+ page_cache_release(page);
+ page = head;
}
if (PageReadahead(page)) {
page_cache_async_readahead(mapping,
@@ -1198,6 +1218,18 @@ find_page:
unlock_page(page);
}
page_ok:
+ /* Switch back to relevant tail page, if needed */
+ if (PageTransCompoundCache(page) && !PageTransTail(page)) {
+ int off = index & HPAGE_CACHE_INDEX_MASK;
+ if (off){
+ page_cache_get(page + off);
+ page_cache_release(page);
+ page += off;
+ }
+ }
+
+ VM_BUG_ON(page->index != index);
+
/*
* i_size must be checked after we know the page is Uptodate.
*
@@ -1329,6 +1361,27 @@ readpage_error:
page_cache_release(page);
goto out;

+no_cached_page_thp:
+ page = alloc_pages(mapping_gfp_mask(mapping) | __GFP_COLD,
+ HPAGE_PMD_ORDER);
+ if (!page) {
+ count_vm_event(THP_READ_ALLOC_FAILED);
+ goto no_cached_page;
+ }
+ count_vm_event(THP_READ_ALLOC);
+
+ error = add_to_page_cache_lru(page, mapping,
+ index & ~HPAGE_CACHE_INDEX_MASK, GFP_KERNEL);
+ if (!error)
+ goto readpage;
+
+ page_cache_release(page);
+ if (error != -EEXIST && error != -ENOSPC) {
+ desc->error = error;
+ goto out;
+ }
+
+ /* Fallback to small page */
no_cached_page:
/*
* Ok, it wasn't cached, so we need to create a new
--
1.8.3.2

2013-08-04 02:16:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 23/23] ramfs: enable transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

ramfs pages are not movable[1] and switching to transhuge pages doesn't
affect that. We need to fix this eventually.

[1] http://lkml.org/lkml/2013/4/2/720

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/ramfs/file-mmu.c | 3 ++-
fs/ramfs/inode.c | 6 +++++-
2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/ramfs/file-mmu.c b/fs/ramfs/file-mmu.c
index 4884ac5..3236e41 100644
--- a/fs/ramfs/file-mmu.c
+++ b/fs/ramfs/file-mmu.c
@@ -32,7 +32,7 @@

const struct address_space_operations ramfs_aops = {
.readpage = simple_readpage,
- .write_begin = simple_write_begin,
+ .write_begin = simple_thp_write_begin,
.write_end = simple_write_end,
.set_page_dirty = __set_page_dirty_no_writeback,
};
@@ -47,6 +47,7 @@ const struct file_operations ramfs_file_operations = {
.splice_read = generic_file_splice_read,
.splice_write = generic_file_splice_write,
.llseek = generic_file_llseek,
+ .release = simple_thp_release,
};

const struct inode_operations ramfs_file_inode_operations = {
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 39d1465..5dafdfc 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
inode_init_owner(inode, dir, mode);
inode->i_mapping->a_ops = &ramfs_aops;
inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
- mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+ /*
+ * TODO: make ramfs pages movable
+ */
+ mapping_set_gfp_mask(inode->i_mapping,
+ GFP_TRANSHUGE & ~__GFP_MOVABLE);
mapping_set_unevictable(inode->i_mapping);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
--
1.8.3.2

2013-08-04 02:16:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 22/23] thp, mm: split huge page on mmap file page

From: "Kirill A. Shutemov" <[email protected]>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on fault attempt.

Later we'll implement mmap() properly and this code path be used for
fallback cases.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index ed65af5..f7857ef 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1743,6 +1743,8 @@ retry_find:
goto no_cached_page;
}

+ if (PageTransCompound(page))
+ split_huge_page(compound_trans_head(page));
if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
page_cache_release(page);
return ret | VM_FAULT_RETRY;
--
1.8.3.2

2013-08-04 02:16:08

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 21/23] thp: wait_split_huge_page(): serialize over i_mmap_mutex too

From: "Kirill A. Shutemov" <[email protected]>

We're going to have huge pages backed by files, so we need to modify
wait_split_huge_page() to support that.

We have two options for:
- check whether the page anon or not and serialize only over required
lock;
- always serialize over both locks;

Current implementation, in fact, guarantees that *all* pages on the vma
is not splitting, not only the pages pmd is pointing on.

For now I prefer the second option since it's the safest: we provide the
same level of guarantees.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 15 ++++++++++++---
mm/huge_memory.c | 4 ++--
mm/memory.c | 4 ++--
3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 9a0a114..186f4f2 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -111,11 +111,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
__split_huge_page_pmd(__vma, __address, \
____pmd); \
} while (0)
-#define wait_split_huge_page(__anon_vma, __pmd) \
+#define wait_split_huge_page(__vma, __pmd) \
do { \
pmd_t *____pmd = (__pmd); \
- anon_vma_lock_write(__anon_vma); \
- anon_vma_unlock_write(__anon_vma); \
+ struct address_space *__mapping = (__vma)->vm_file ? \
+ (__vma)->vm_file->f_mapping : NULL; \
+ struct anon_vma *__anon_vma = (__vma)->anon_vma; \
+ if (__mapping) \
+ mutex_lock(&__mapping->i_mmap_mutex); \
+ if (__anon_vma) { \
+ anon_vma_lock_write(__anon_vma); \
+ anon_vma_unlock_write(__anon_vma); \
+ } \
+ if (__mapping) \
+ mutex_unlock(&__mapping->i_mmap_mutex); \
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d7c6830..9af643d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -911,7 +911,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
spin_unlock(&dst_mm->page_table_lock);
pte_free(dst_mm, pgtable);

- wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+ wait_split_huge_page(vma, src_pmd); /* src_vma */
goto out;
}
src_page = pmd_page(pmd);
@@ -1493,7 +1493,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(&vma->vm_mm->page_table_lock);
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
return -1;
} else {
/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index 7d35f90..ea74ab1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -609,7 +609,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
if (new)
pte_free(mm, new);
if (wait_split_huge_page)
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
return 0;
}

@@ -1522,7 +1522,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(&mm->page_table_lock);
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
} else {
page = follow_trans_huge_pmd(vma, address,
pmd, flags);
--
1.8.3.2

2013-08-04 02:16:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 17/23] thp, libfs: initial thp support

From: "Kirill A. Shutemov" <[email protected]>

simple_readpage() and simple_write_end() are modified to handle huge
pages.

simple_thp_write_begin() is introduced to allocate huge pages on write.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/libfs.c | 53 +++++++++++++++++++++++++++++++++++++++++++++----
include/linux/fs.h | 7 +++++++
include/linux/pagemap.h | 8 ++++++++
3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 3a3a9b5..934778b 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -364,7 +364,7 @@ EXPORT_SYMBOL(simple_setattr);

int simple_readpage(struct file *file, struct page *page)
{
- clear_highpage(page);
+ clear_pagecache_page(page);
flush_dcache_page(page);
SetPageUptodate(page);
unlock_page(page);
@@ -424,9 +424,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,

/* zero the stale part of the page if we did a short copy */
if (copied < len) {
- unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
- zero_user(page, from + copied, len - copied);
+ unsigned from;
+ if (PageTransHuge(page)) {
+ from = pos & ~HPAGE_PMD_MASK;
+ zero_huge_user(page, from + copied, len - copied);
+ } else {
+ from = pos & ~PAGE_CACHE_MASK;
+ zero_user(page, from + copied, len - copied);
+ }
}

if (!PageUptodate(page))
@@ -445,6 +450,46 @@ int simple_write_end(struct file *file, struct address_space *mapping,
return copied;
}

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+int simple_thp_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ struct page *page = NULL;
+ pgoff_t index;
+
+ index = pos >> PAGE_CACHE_SHIFT;
+
+ if (mapping_can_have_hugepages(mapping)) {
+ page = grab_cache_page_write_begin(mapping,
+ index & ~HPAGE_CACHE_INDEX_MASK,
+ flags | AOP_FLAG_TRANSHUGE);
+ /* fallback to small page */
+ if (!page) {
+ unsigned long offset;
+ offset = pos & ~PAGE_CACHE_MASK;
+ /* adjust the len to not cross small page boundary */
+ len = min_t(unsigned long,
+ len, PAGE_CACHE_SIZE - offset);
+ }
+ BUG_ON(page && !PageTransHuge(page));
+ }
+ if (!page)
+ return simple_write_begin(file, mapping, pos, len, flags,
+ pagep, fsdata);
+
+ *pagep = page;
+
+ if (!PageUptodate(page) && len != HPAGE_PMD_SIZE) {
+ unsigned from = pos & ~HPAGE_PMD_MASK;
+
+ zero_huge_user_segment(page, 0, from);
+ zero_huge_user_segment(page, from + len, HPAGE_PMD_SIZE);
+ }
+ return 0;
+}
+#endif
+
/*
* the inodes created here are not hashed. If you use iunique to generate
* unique inode values later for this filesystem, then you must take care
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d5f58b3..c1dbf43 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2553,6 +2553,13 @@ extern int simple_write_begin(struct file *file, struct address_space *mapping,
extern int simple_write_end(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+extern int simple_thp_write_begin(struct file *file,
+ struct address_space *mapping, loff_t pos, unsigned len,
+ unsigned flags, struct page **pagep, void **fsdata);
+#else
+#define simple_thp_write_begin simple_write_begin
+#endif

extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned int flags);
extern ssize_t generic_read_dir(struct file *, char __user *, size_t, loff_t *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index d459b38..eb484f2 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -591,4 +591,12 @@ static inline int add_to_page_cache(struct page *page,
return error;
}

+static inline void clear_pagecache_page(struct page *page)
+{
+ if (PageTransHuge(page))
+ zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+ else
+ clear_highpage(page);
+}
+
#endif /* _LINUX_PAGEMAP_H */
--
1.8.3.2

2013-08-04 02:16:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 19/23] truncate: support huge pages

From: "Kirill A. Shutemov" <[email protected]>

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range.

If a huge page is only partly in the range we zero out the part,
exactly like we do for partial small pages.

invalidate_mapping_pages() just skips huge pages if they are not fully
in the range.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/truncate.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 84 insertions(+), 24 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 353b683..fcef7cb 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -205,8 +205,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
{
pgoff_t start; /* inclusive */
pgoff_t end; /* exclusive */
- unsigned int partial_start; /* inclusive */
- unsigned int partial_end; /* exclusive */
+ bool partial_thp_start = false, partial_thp_end = false;
struct pagevec pvec;
pgoff_t index;
int i;
@@ -215,15 +214,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (mapping->nrpages == 0)
return;

- /* Offsets within partial pages */
- partial_start = lstart & (PAGE_CACHE_SIZE - 1);
- partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
-
/*
* 'start' and 'end' always covers the range of pages to be fully
- * truncated. Partial pages are covered with 'partial_start' at the
- * start of the range and 'partial_end' at the end of the range.
- * Note that 'end' is exclusive while 'lend' is inclusive.
+ * truncated. Note that 'end' is exclusive while 'lend' is inclusive.
*/
start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (lend == -1)
@@ -249,6 +242,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index >= end)
break;

+ if (PageTransTailCache(page)) {
+ /* part of already handled huge page */
+ if (!page->mapping)
+ continue;
+ /* the range starts in middle of huge page */
+ partial_thp_start = true;
+ start = index & ~HPAGE_CACHE_INDEX_MASK;
+ continue;
+ }
+ /* the range ends on huge page */
+ if (PageTransHugeCache(page) &&
+ index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
+ partial_thp_end = true;
+ end = index;
+ break;
+ }
+
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
@@ -265,34 +275,74 @@ void truncate_inode_pages_range(struct address_space *mapping,
index++;
}

- if (partial_start) {
- struct page *page = find_lock_page(mapping, start - 1);
+ if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
+ pgoff_t off;
+ struct page *page;
+ unsigned pstart, pend;
+ void (*zero_segment)(struct page *page,
+ unsigned start, unsigned len);
+retry_partial_start:
+ if (partial_thp_start) {
+ zero_segment = zero_huge_user_segment;
+ off = (start - 1) & ~HPAGE_CACHE_INDEX_MASK;
+ pstart = lstart & ~HPAGE_PMD_MASK;
+ if ((end & ~HPAGE_CACHE_INDEX_MASK) == off)
+ pend = (lend - 1) & ~HPAGE_PMD_MASK;
+ else
+ pend = HPAGE_PMD_SIZE;
+ } else {
+ zero_segment = zero_user_segment;
+ off = start - 1;
+ pstart = lstart & ~PAGE_CACHE_MASK;
+ if (start > end)
+ pend = (lend - 1) & ~PAGE_CACHE_MASK;
+ else
+ pend = PAGE_CACHE_SIZE;
+ }
+
+ page = find_get_page(mapping, off);
if (page) {
- unsigned int top = PAGE_CACHE_SIZE;
- if (start > end) {
- /* Truncation within a single page */
- top = partial_end;
- partial_end = 0;
+ /* the last tail page*/
+ if (PageTransTailCache(page)) {
+ partial_thp_start = true;
+ page_cache_release(page);
+ goto retry_partial_start;
}
+
+ lock_page(page);
wait_on_page_writeback(page);
- zero_user_segment(page, partial_start, top);
+ zero_segment(page, pstart, pend);
cleancache_invalidate_page(mapping, page);
if (page_has_private(page))
- do_invalidatepage(page, partial_start,
- top - partial_start);
+ do_invalidatepage(page, pstart,
+ pend - pstart);
unlock_page(page);
page_cache_release(page);
}
}
- if (partial_end) {
- struct page *page = find_lock_page(mapping, end);
+ if (partial_thp_end || (lend + 1) & ~PAGE_CACHE_MASK) {
+ pgoff_t off;
+ struct page *page;
+ unsigned pend;
+ void (*zero_segment)(struct page *page,
+ unsigned start, unsigned len);
+ if (partial_thp_end) {
+ zero_segment = zero_huge_user_segment;
+ off = end & ~HPAGE_CACHE_INDEX_MASK;
+ pend = (lend - 1) & ~HPAGE_PMD_MASK;
+ } else {
+ zero_segment = zero_user_segment;
+ off = end;
+ pend = (lend - 1) & ~PAGE_CACHE_MASK;
+ }
+
+ page = find_lock_page(mapping, end);
if (page) {
wait_on_page_writeback(page);
- zero_user_segment(page, 0, partial_end);
+ zero_segment(page, 0, pend);
cleancache_invalidate_page(mapping, page);
if (page_has_private(page))
- do_invalidatepage(page, 0,
- partial_end);
+ do_invalidatepage(page, 0, pend);
unlock_page(page);
page_cache_release(page);
}
@@ -327,6 +377,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index >= end)
break;

+ if (PageTransTailCache(page))
+ continue;
+
lock_page(page);
WARN_ON(page->index != index);
wait_on_page_writeback(page);
@@ -401,6 +454,13 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
if (index > end)
break;

+ /* skip huge page if it's not fully in the range */
+ if (PageTransHugeCache(page) &&
+ index + HPAGE_CACHE_NR - 1 > end)
+ continue;
+ if (PageTransTailCache(page))
+ continue;
+
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
--
1.8.3.2

2013-08-04 02:16:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 18/23] thp: libfs: introduce simple_thp_release()

From: "Kirill A. Shutemov" <[email protected]>

simple_thp_release() is a dummy implementation of fops->release with
transparent huge page support. It's required to minimize memory overhead
of huge pages for small files.

It checks whether we should split the last page in the file to give
memory back to the system.

We split the page if it meets following criteria:
- nobody has the file opened on write;
- spliting will actually free any memory (at least one small page);
- if it's a huge page ;)

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/libfs.c | 27 +++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
2 files changed, 29 insertions(+)

diff --git a/fs/libfs.c b/fs/libfs.c
index 934778b..c43b055 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -488,6 +488,33 @@ int simple_thp_write_begin(struct file *file, struct address_space *mapping,
}
return 0;
}
+
+int simple_thp_release(struct inode *inode, struct file *file)
+{
+ pgoff_t last_index;
+ struct page *page;
+
+ /* check if anybody still writes to file */
+ if (atomic_read(&inode->i_writecount) != !!(file->f_mode & FMODE_WRITE))
+ return 0;
+
+ last_index = i_size_read(inode) >> PAGE_CACHE_SHIFT;
+
+ /* check if splitting the page will free any memory */
+ if ((last_index & HPAGE_CACHE_INDEX_MASK) + 1 == HPAGE_CACHE_NR)
+ return 0;
+
+ page = find_get_page(file->f_mapping,
+ last_index & ~HPAGE_CACHE_INDEX_MASK);
+ if (!page)
+ return 0;
+
+ if (PageTransHuge(page))
+ split_huge_page(page);
+
+ page_cache_release(page);
+ return 0;
+}
#endif

/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c1dbf43..b594f10 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2557,8 +2557,10 @@ extern int simple_write_end(struct file *file, struct address_space *mapping,
extern int simple_thp_write_begin(struct file *file,
struct address_space *mapping, loff_t pos, unsigned len,
unsigned flags, struct page **pagep, void **fsdata);
+extern int simple_thp_release(struct inode *inode, struct file *file);
#else
#define simple_thp_write_begin simple_write_begin
+#define simple_thp_release NULL
#endif

extern struct dentry *simple_lookup(struct inode *, struct dentry *, unsigned int flags);
--
1.8.3.2

2013-08-04 02:15:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 10/23] thp, mm: warn if we try to use replace_page_cache_page() with THP

From: "Kirill A. Shutemov" <[email protected]>

replace_page_cache_page() is only used by FUSE. It's unlikely that we
will support THP in FUSE page cache any soon.

Let's pospone implemetation of THP handling in replace_page_cache_page()
until any will use it. -EINVAL and WARN_ONCE() for now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index b75bdf5..28f4927 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -418,6 +418,10 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
{
int error;

+ if (WARN_ONCE(PageTransHuge(old) || PageTransHuge(new),
+ "unexpected transhuge page\n"))
+ return -EINVAL;
+
VM_BUG_ON(!PageLocked(old));
VM_BUG_ON(!PageLocked(new));
VM_BUG_ON(new->mapping);
--
1.8.3.2

2013-08-04 02:15:53

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 20/23] thp: handle file pages in split_huge_page()

From: "Kirill A. Shutemov" <[email protected]>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

When we add a huge page to page cache we take only reference to head
page, but on split we need to take addition reference to all tail pages
since they are still in page cache after splitting.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 76 insertions(+), 13 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 523946c..d7c6830 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1580,6 +1580,7 @@ static void __split_huge_page_refcount(struct page *page,
struct zone *zone = page_zone(page);
struct lruvec *lruvec;
int tail_count = 0;
+ int initial_tail_refcount;

/* prevent PageLRU to go away from under us, and freeze lru stats */
spin_lock_irq(&zone->lru_lock);
@@ -1589,6 +1590,13 @@ static void __split_huge_page_refcount(struct page *page,
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(page);

+ /*
+ * When we add a huge page to page cache we take only reference to head
+ * page, but on split we need to take addition reference to all tail
+ * pages since they are still in page cache after splitting.
+ */
+ initial_tail_refcount = PageAnon(page) ? 0 : 1;
+
for (i = HPAGE_PMD_NR - 1; i >= 1; i--) {
struct page *page_tail = page + i;

@@ -1611,8 +1619,9 @@ static void __split_huge_page_refcount(struct page *page,
* atomic_set() here would be safe on all archs (and
* not only on x86), it's safer to use atomic_add().
*/
- atomic_add(page_mapcount(page) + page_mapcount(page_tail) + 1,
- &page_tail->_count);
+ atomic_add(initial_tail_refcount + page_mapcount(page) +
+ page_mapcount(page_tail) + 1,
+ &page_tail->_count);

/* after clearing PageTail the gup refcount can be released */
smp_mb();
@@ -1651,23 +1660,23 @@ static void __split_huge_page_refcount(struct page *page,
*/
page_tail->_mapcount = page->_mapcount;

- BUG_ON(page_tail->mapping);
page_tail->mapping = page->mapping;

page_tail->index = page->index + i;
page_nid_xchg_last(page_tail, page_nid_last(page));

- BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
BUG_ON(!PageDirty(page_tail));
- BUG_ON(!PageSwapBacked(page_tail));

lru_add_page_tail(page, page_tail, lruvec, list);
}
atomic_sub(tail_count, &page->_count);
BUG_ON(atomic_read(&page->_count) <= 0);

- __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+ if (PageAnon(page))
+ __mod_zone_page_state(zone, NR_ANON_TRANSPARENT_HUGEPAGES, -1);
+ else
+ __mod_zone_page_state(zone, NR_FILE_TRANSPARENT_HUGEPAGES, -1);

ClearPageCompound(page);
compound_unlock(page);
@@ -1767,7 +1776,7 @@ static int __split_huge_page_map(struct page *page,
}

/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
struct anon_vma *anon_vma,
struct list_head *list)
{
@@ -1791,7 +1800,7 @@ static void __split_huge_page(struct page *page,
* and establishes a child pmd before
* __split_huge_page_splitting() freezes the parent pmd (so if
* we fail to prevent copy_huge_pmd() from running until the
- * whole __split_huge_page() is complete), we will still see
+ * whole __split_anon_huge_page() is complete), we will still see
* the newly established pmd of the child later during the
* walk, to be able to set it as pmd_trans_splitting too.
*/
@@ -1822,14 +1831,11 @@ static void __split_huge_page(struct page *page,
* from the hugepage.
* Return 0 if the hugepage is split successfully otherwise return 1.
*/
-int split_huge_page_to_list(struct page *page, struct list_head *list)
+static int split_anon_huge_page(struct page *page, struct list_head *list)
{
struct anon_vma *anon_vma;
int ret = 1;

- BUG_ON(is_huge_zero_page(page));
- BUG_ON(!PageAnon(page));
-
/*
* The caller does not necessarily hold an mmap_sem that would prevent
* the anon_vma disappearing so we first we take a reference to it
@@ -1847,7 +1853,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
goto out_unlock;

BUG_ON(!PageSwapBacked(page));
- __split_huge_page(page, anon_vma, list);
+ __split_anon_huge_page(page, anon_vma, list);
count_vm_event(THP_SPLIT);

BUG_ON(PageCompound(page));
@@ -1858,6 +1864,63 @@ out:
return ret;
}

+static int split_file_huge_page(struct page *page, struct list_head *list)
+{
+ struct address_space *mapping = page->mapping;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ int mapcount, mapcount2;
+
+ BUG_ON(!PageHead(page));
+ BUG_ON(PageTail(page));
+
+ mutex_lock(&mapping->i_mmap_mutex);
+ mapcount = 0;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long addr = vma_address(page, vma);
+ mapcount += __split_huge_page_splitting(page, vma, addr);
+ }
+
+ if (mapcount != page_mapcount(page))
+ printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+ mapcount, page_mapcount(page));
+ BUG_ON(mapcount != page_mapcount(page));
+
+ __split_huge_page_refcount(page, list);
+
+ mapcount2 = 0;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long addr = vma_address(page, vma);
+ mapcount2 += __split_huge_page_map(page, vma, addr);
+ }
+
+ if (mapcount != mapcount2)
+ printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+ mapcount, mapcount2, page_mapcount(page));
+ BUG_ON(mapcount != mapcount2);
+ count_vm_event(THP_SPLIT);
+ mutex_unlock(&mapping->i_mmap_mutex);
+
+ /*
+ * Drop small pages beyond i_size if any.
+ *
+ * XXX: do we need to serialize over i_mutex here?
+ * If yes, how to get mmap_sem vs. i_mutex ordering fixed?
+ */
+ truncate_inode_pages(mapping, i_size_read(mapping->host));
+ return 0;
+}
+
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+ BUG_ON(is_huge_zero_page(page));
+
+ if (PageAnon(page))
+ return split_anon_huge_page(page, list);
+ else
+ return split_file_huge_page(page, list);
+}
+
#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)

int hugepage_madvise(struct vm_area_struct *vma,
--
1.8.3.2

2013-08-04 02:15:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 12/23] thp, mm: add event counters for huge page alloc on file write or read

From: "Kirill A. Shutemov" <[email protected]>

Existing stats specify source of thp page: fault or collapse. We're
going allocate a new huge page with write(2) and read(2). It's nither
fault nor collapse.

Let's introduce new events for that.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 7 +++++++
include/linux/huge_mm.h | 5 +++++
include/linux/vm_event_item.h | 4 ++++
mm/vmstat.c | 4 ++++
4 files changed, 20 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 4cc15c4..a78f738 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -202,6 +202,10 @@ thp_collapse_alloc is incremented by khugepaged when it has found
a range of pages to collapse into one huge page and has
successfully allocated a new huge page to store the data.

+thp_write_alloc and thp_read_alloc are incremented every time a huge
+ page is successfully allocated to handle write(2) to a file or
+ read(2) from file.
+
thp_fault_fallback is incremented if a page fault fails to allocate
a huge page and instead falls back to using small pages.

@@ -209,6 +213,9 @@ thp_collapse_alloc_failed is incremented if khugepaged found a range
of pages that should be collapsed into one huge page but failed
the allocation.

+thp_write_alloc_failed and thp_read_alloc_failed are incremented if
+ huge page allocation failed when tried on write(2) or read(2).
+
thp_split is incremented every time a huge page is split into base
pages. This can happen for a variety of reasons but a common
reason is that a huge page is old and is being reclaimed.
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4dc66c9..9a0a114 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -183,6 +183,11 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })

+#define THP_WRITE_ALLOC ({ BUILD_BUG(); 0; })
+#define THP_WRITE_ALLOC_FAILED ({ BUILD_BUG(); 0; })
+#define THP_READ_ALLOC ({ BUILD_BUG(); 0; })
+#define THP_READ_ALLOC_FAILED ({ BUILD_BUG(); 0; })
+
#define hpage_nr_pages(x) 1

#define transparent_hugepage_enabled(__vma) 0
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 1855f0a..8e071bb 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -66,6 +66,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
+ THP_WRITE_ALLOC,
+ THP_WRITE_ALLOC_FAILED,
+ THP_READ_ALLOC,
+ THP_READ_ALLOC_FAILED,
THP_SPLIT,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ffe3fbd..a80ea59 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -815,6 +815,10 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
+ "thp_write_alloc",
+ "thp_write_alloc_failed",
+ "thp_read_alloc",
+ "thp_read_alloc_failed",
"thp_split",
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
--
1.8.3.2

2013-08-04 02:15:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 08/23] block: implement add_bdi_stat()

From: "Kirill A. Shutemov" <[email protected]>

We're going to add/remove a number of page cache entries at once. This
patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
amount. It's required for batched page cache manipulations.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/backing-dev.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..7060180 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -166,6 +166,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
__add_bdi_stat(bdi, item, -1);
}

+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+ enum bdi_stat_item item, s64 amount)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __add_bdi_stat(bdi, item, amount);
+ local_irq_restore(flags);
+}
+
static inline void dec_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
{
--
1.8.3.2

2013-08-04 02:15:44

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 13/23] thp, mm: allocate huge pages in grab_cache_page_write_begin()

From: "Kirill A. Shutemov" <[email protected]>

Try to allocate huge page if flags has AOP_FLAG_TRANSHUGE.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/fs.h | 1 +
mm/filemap.c | 24 ++++++++++++++++++++++--
2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index b09ddc0..d5f58b3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -282,6 +282,7 @@ enum positive_aop_returns {
#define AOP_FLAG_NOFS 0x0004 /* used by filesystem to direct
* helper code (eg buffer layer)
* to clear GFP_FS from alloc */
+#define AOP_FLAG_TRANSHUGE 0x0008 /* allocate transhuge page */

/*
* oh the beauties of C type declarations.
diff --git a/mm/filemap.c b/mm/filemap.c
index 28f4927..b17ebb9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2313,18 +2313,38 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
gfp_t gfp_mask;
struct page *page;
gfp_t gfp_notmask = 0;
+ bool must_use_thp = (flags & AOP_FLAG_TRANSHUGE) &&
+ IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE);
+

gfp_mask = mapping_gfp_mask(mapping);
+ if (must_use_thp) {
+ BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+ BUG_ON(!(gfp_mask & __GFP_COMP));
+ }
if (mapping_cap_account_dirty(mapping))
gfp_mask |= __GFP_WRITE;
if (flags & AOP_FLAG_NOFS)
gfp_notmask = __GFP_FS;
repeat:
page = find_lock_page(mapping, index);
- if (page)
+ if (page) {
+ if (must_use_thp && !PageTransHuge(page)) {
+ unlock_page(page);
+ page_cache_release(page);
+ return NULL;
+ }
goto found;
+ }

- page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+ if (must_use_thp) {
+ page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+ if (page)
+ count_vm_event(THP_WRITE_ALLOC);
+ else
+ count_vm_event(THP_WRITE_ALLOC_FAILED);
+ } else
+ page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
if (!page)
return NULL;
status = add_to_page_cache_lru(page, mapping, index,
--
1.8.3.2

2013-08-04 02:15:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 06/23] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

From: "Kirill A. Shutemov" <[email protected]>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
include/linux/huge_mm.h | 24 ++++++++++++++++++++++
include/linux/page-flags.h | 33 ++++++++++++++++++++++++++++++
mm/filemap.c | 50 +++++++++++++++++++++++++++++++++++-----------
3 files changed, 95 insertions(+), 12 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1534e1e..4dc66c9 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -230,6 +230,20 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str

#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+
+#define HPAGE_CACHE_ORDER (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
+#else
+
+#define HPAGE_CACHE_ORDER ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
+#endif
+
static inline bool transparent_hugepage_pagecache(void)
{
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
@@ -238,4 +252,14 @@ static inline bool transparent_hugepage_pagecache(void)
return false;
return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
}
+
+static inline int hpagecache_nr_pages(struct page *page)
+{
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+ return hpage_nr_pages(page);
+
+ BUG_ON(PageTransHuge(page));
+ return 1;
+}
+
#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f1a5b59..7657de0 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -452,6 +452,39 @@ static inline int PageTransTail(struct page *page)
}
#endif

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+static inline int PageTransHugeCache(struct page *page)
+{
+ return PageTransHuge(page);
+}
+
+static inline int PageTransCompoundCache(struct page *page)
+{
+ return PageTransCompound(page);
+}
+
+static inline int PageTransTailCache(struct page *page)
+{
+ return PageTransTail(page);
+}
+#else
+
+static inline int PageTransHugeCache(struct page *page)
+{
+ return 0;
+}
+
+static inline int PageTransCompoundCache(struct page *page)
+{
+ return 0;
+}
+
+static inline int PageTransTailCache(struct page *page)
+{
+ return 0;
+}
+#endif
+
/*
* If network-based swap is enabled, sl*b must keep track of whether pages
* were allocated from pfmemalloc reserves.
diff --git a/mm/filemap.c b/mm/filemap.c
index ae5cc01..619e6cb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -460,38 +460,64 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
pgoff_t offset, gfp_t gfp_mask)
{
int error;
+ int i, nr;

VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));

+ /* memory cgroup controller handles thp pages on its side */
error = mem_cgroup_cache_charge(page, current->mm,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
return error;

- error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
+ if (PageTransHugeCache(page))
+ BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+
+ nr = hpagecache_nr_pages(page);
+
+ error = radix_tree_maybe_preload_contig(nr, gfp_mask & ~__GFP_HIGHMEM);
if (error) {
mem_cgroup_uncharge_cache_page(page);
return error;
}

- page_cache_get(page);
- page->mapping = mapping;
- page->index = offset;
-
spin_lock_irq(&mapping->tree_lock);
- error = radix_tree_insert(&mapping->page_tree, offset, page);
+ page_cache_get(page);
+ for (i = 0; i < nr; i++) {
+ error = radix_tree_insert(&mapping->page_tree,
+ offset + i, page + i);
+ /*
+ * In the midle of THP we can collide with small page which was
+ * established before THP page cache is enabled or by other VMA
+ * with bad alignement (most likely MAP_FIXED).
+ */
+ if (error)
+ goto err_insert;
+ page[i].index = offset + i;
+ page[i].mapping = mapping;
+ }
radix_tree_preload_end();
- if (unlikely(error))
- goto err_insert;
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
+ mapping->nrpages += nr;
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+ if (PageTransHuge(page))
+ __inc_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
spin_unlock_irq(&mapping->tree_lock);
trace_mm_filemap_add_to_page_cache(page);
return 0;
err_insert:
- page->mapping = NULL;
- /* Leave page->index set: truncation relies upon it */
+ radix_tree_preload_end();
+ if (i != 0)
+ error = -ENOSPC; /* no space for a huge page */
+
+ /* page[i] was not inserted to tree, skip it */
+ i--;
+
+ for (; i >= 0; i--) {
+ /* Leave page->index set: truncation relies upon it */
+ page[i].mapping = NULL;
+ radix_tree_delete(&mapping->page_tree, offset + i);
+ }
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);
page_cache_release(page);
--
1.8.3.2

2013-08-04 02:15:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 11/23] thp, mm: handle tail pages in page_cache_get_speculative()

From: "Kirill A. Shutemov" <[email protected]>

For tail pages we need to take two refcounters:
- ->_count for its head page;
- ->_mapcount for the tail page;

To protect against splitting we take compound lock and re-check that we
still have tail page before taking ->_mapcount reference.
If the page was split we drop ->_count reference from head page and
return 0 to indicate caller that it must retry.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/pagemap.h | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 47b5082..d459b38 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -161,6 +161,8 @@ void release_pages(struct page **pages, int nr, int cold);
*/
static inline int page_cache_get_speculative(struct page *page)
{
+ struct page *page_head = compound_trans_head(page);
+
VM_BUG_ON(in_interrupt());

#ifdef CONFIG_TINY_RCU
@@ -176,11 +178,11 @@ static inline int page_cache_get_speculative(struct page *page)
* disabling preempt, and hence no need for the "speculative get" that
* SMP requires.
*/
- VM_BUG_ON(page_count(page) == 0);
- atomic_inc(&page->_count);
+ VM_BUG_ON(page_count(page_head) == 0);
+ atomic_inc(&page_head->_count);

#else
- if (unlikely(!get_page_unless_zero(page))) {
+ if (unlikely(!get_page_unless_zero(page_head))) {
/*
* Either the page has been freed, or will be freed.
* In either case, retry here and the caller should
@@ -189,7 +191,23 @@ static inline int page_cache_get_speculative(struct page *page)
return 0;
}
#endif
- VM_BUG_ON(PageTail(page));
+
+ if (unlikely(PageTransTail(page))) {
+ unsigned long flags;
+ int got = 0;
+
+ flags = compound_lock_irqsave(page_head);
+ if (likely(PageTransTail(page))) {
+ atomic_inc(&page->_mapcount);
+ got = 1;
+ }
+ compound_unlock_irqrestore(page_head, flags);
+
+ if (unlikely(!got))
+ atomic_dec(&page_head->_count);
+
+ return got;
+ }

return 1;
}
--
1.8.3.2

2013-08-04 02:15:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 07/23] mm: trace filemap: dump page order

From: "Kirill A. Shutemov" <[email protected]>

Dump page order to trace to be able to distinguish between small page
and huge page in page cache.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
include/trace/events/filemap.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/filemap.h b/include/trace/events/filemap.h
index 0421f49..7e14b13 100644
--- a/include/trace/events/filemap.h
+++ b/include/trace/events/filemap.h
@@ -21,6 +21,7 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
__field(struct page *, page)
__field(unsigned long, i_ino)
__field(unsigned long, index)
+ __field(int, order)
__field(dev_t, s_dev)
),

@@ -28,18 +29,20 @@ DECLARE_EVENT_CLASS(mm_filemap_op_page_cache,
__entry->page = page;
__entry->i_ino = page->mapping->host->i_ino;
__entry->index = page->index;
+ __entry->order = compound_order(page);
if (page->mapping->host->i_sb)
__entry->s_dev = page->mapping->host->i_sb->s_dev;
else
__entry->s_dev = page->mapping->host->i_rdev;
),

- TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu",
+ TP_printk("dev %d:%d ino %lx page=%p pfn=%lu ofs=%lu order=%d",
MAJOR(__entry->s_dev), MINOR(__entry->s_dev),
__entry->i_ino,
__entry->page,
page_to_pfn(__entry->page),
- __entry->index << PAGE_SHIFT)
+ __entry->index << PAGE_SHIFT,
+ __entry->order)
);

DEFINE_EVENT(mm_filemap_op_page_cache, mm_filemap_delete_from_page_cache,
--
1.8.3.2

2013-08-04 02:15:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 09/23] thp, mm: rewrite delete_from_page_cache() to support huge pages

From: "Kirill A. Shutemov" <[email protected]>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 619e6cb..b75bdf5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,7 @@
void __delete_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
+ int i, nr;

trace_mm_filemap_delete_from_page_cache(page);
/*
@@ -127,13 +128,21 @@ void __delete_from_page_cache(struct page *page)
else
cleancache_invalidate_page(mapping, page);

- radix_tree_delete(&mapping->page_tree, page->index);
+ nr = hpagecache_nr_pages(page);
+ for (i = 0; i < nr; i++) {
+ page[i].mapping = NULL;
+ radix_tree_delete(&mapping->page_tree, page->index + i);
+ }
+ /* thp */
+ if (nr > 1)
+ __dec_zone_page_state(page, NR_FILE_TRANSPARENT_HUGEPAGES);
+
page->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */
- mapping->nrpages--;
- __dec_zone_page_state(page, NR_FILE_PAGES);
+ mapping->nrpages -= nr;
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
if (PageSwapBacked(page))
- __dec_zone_page_state(page, NR_SHMEM);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
BUG_ON(page_mapped(page));

/*
@@ -144,8 +153,8 @@ void __delete_from_page_cache(struct page *page)
* having removed the page entirely.
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
- dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+ mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+ add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
}
}

--
1.8.3.2

2013-08-04 02:15:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 14/23] thp, mm: naive support of thp in generic_perform_write

From: "Kirill A. Shutemov" <[email protected]>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing storage.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index b17ebb9..066bbff 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2382,6 +2382,7 @@ static ssize_t generic_perform_write(struct file *file,
unsigned long bytes; /* Bytes to write to page */
size_t copied; /* Bytes copied from user */
void *fsdata;
+ int subpage_nr = 0;

offset = (pos & (PAGE_CACHE_SIZE - 1));
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
@@ -2411,8 +2412,14 @@ again:
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);

+ if (PageTransHuge(page)) {
+ off_t huge_offset = pos & ~HPAGE_PMD_MASK;
+ subpage_nr = huge_offset >> PAGE_CACHE_SHIFT;
+ }
+
pagefault_disable();
- copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+ copied = iov_iter_copy_from_user_atomic(page + subpage_nr, i,
+ offset, bytes);
pagefault_enable();
flush_dcache_page(page);

--
1.8.3.2

2013-08-04 02:15:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 05/23] thp: represent file thp pages in meminfo and friends

From: "Kirill A. Shutemov" <[email protected]>

The patch adds new zone stat to count file transparent huge pages and
adjust related places.

For now we don't count mapped or dirty file thp pages separately.

The patch depends on patch
thp: account anon transparent huge pages into NR_ANON_PAGES

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
drivers/base/node.c | 4 ++++
fs/proc/meminfo.c | 3 +++
include/linux/mmzone.h | 1 +
mm/vmstat.c | 1 +
4 files changed, 9 insertions(+)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index bc9f43b..de261f5 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -119,6 +119,7 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d SUnreclaim: %8lu kB\n"
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"Node %d AnonHugePages: %8lu kB\n"
+ "Node %d FileHugePages: %8lu kB\n"
#endif
,
nid, K(node_page_state(nid, NR_FILE_DIRTY)),
@@ -140,6 +141,9 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
, nid,
K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR)
+ , nid,
+ K(node_page_state(nid, NR_FILE_TRANSPARENT_HUGEPAGES) *
HPAGE_PMD_NR));
#else
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 59d85d6..a62952c 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -104,6 +104,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"AnonHugePages: %8lu kB\n"
+ "FileHugePages: %8lu kB\n"
#endif
,
K(i.totalram),
@@ -158,6 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
,K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
HPAGE_PMD_NR)
+ ,K(global_page_state(NR_FILE_TRANSPARENT_HUGEPAGES) *
+ HPAGE_PMD_NR)
#endif
);

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0c41d59..ba81833 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,7 @@ enum zone_stat_item {
NUMA_OTHER, /* allocation from other node */
#endif
NR_ANON_TRANSPARENT_HUGEPAGES,
+ NR_FILE_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 87228c5..ffe3fbd 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -739,6 +739,7 @@ const char * const vmstat_text[] = {
"numa_other",
#endif
"nr_anon_transparent_hugepages",
+ "nr_file_transparent_hugepages",
"nr_free_cma",
"nr_dirty_threshold",
"nr_dirty_background_threshold",
--
1.8.3.2

2013-08-04 02:21:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache

From: "Kirill A. Shutemov" <[email protected]>

For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64.

Radix tree perload overhead can be significant on BASE_SMALL systems, so
let's add dependency on !BASE_SMALL.

/sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the
feature. It's enabled by default if TRANSPARENT_HUGEPAGE_PAGECACHE is
enabled.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 9 +++++++++
include/linux/huge_mm.h | 9 +++++++++
mm/Kconfig | 12 ++++++++++++
mm/huge_memory.c | 23 +++++++++++++++++++++++
4 files changed, 53 insertions(+)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 4a63953..4cc15c4 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -103,6 +103,15 @@ echo always >/sys/kernel/mm/transparent_hugepage/enabled
echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
echo never >/sys/kernel/mm/transparent_hugepage/enabled

+If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge pages in
+page cache if possible. It can be disable and re-enabled via sysfs:
+
+echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache
+echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache
+
+If it's disabled kernel will not add new huge pages to page cache and
+split them on mapping, but already mapped pages will stay intakt.
+
It's also possible to limit defrag efforts in the VM to generate
hugepages in case they're not immediately free to madvise regions or
to never try to defrag memory and simply fallback to regular pages
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3935428..1534e1e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
+ TRANSPARENT_HUGEPAGE_PAGECACHE,
TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
#ifdef CONFIG_DEBUG_VM
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
@@ -229,4 +230,12 @@ static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_str

#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+static inline bool transparent_hugepage_pagecache(void)
+{
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
+ return false;
+ if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG)))
+ return false;
+ return transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
#endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 256bfd0..1e30ee8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -420,6 +420,18 @@ choice
benefit.
endchoice

+config TRANSPARENT_HUGEPAGE_PAGECACHE
+ bool "Transparent Hugepage Support for page cache"
+ depends on X86_64 && TRANSPARENT_HUGEPAGE
+ # avoid radix tree preload overhead
+ depends on !BASE_SMALL
+ default y
+ help
+ Enabling the option adds support hugepages for file-backed
+ mappings. It requires transparent hugepage support from
+ filesystem side. For now, the only filesystem which supports
+ hugepages is ramfs.
+
config CROSS_MEMORY_ATTACH
bool "Cross Memory Support"
depends on MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d96d921..523946c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -42,6 +42,9 @@ unsigned long transparent_hugepage_flags __read_mostly =
#endif
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+ (1<<TRANSPARENT_HUGEPAGE_PAGECACHE)|
+#endif
(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);

/* default scan 8*512 pte (or vmas) every 30 second */
@@ -362,6 +365,23 @@ static ssize_t defrag_store(struct kobject *kobj,
static struct kobj_attribute defrag_attr =
__ATTR(defrag, 0644, defrag_show, defrag_store);

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+static ssize_t page_cache_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return single_flag_show(kobj, attr, buf,
+ TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static ssize_t page_cache_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ return single_flag_store(kobj, attr, buf, count,
+ TRANSPARENT_HUGEPAGE_PAGECACHE);
+}
+static struct kobj_attribute page_cache_attr =
+ __ATTR(page_cache, 0644, page_cache_show, page_cache_store);
+#endif
+
static ssize_t use_zero_page_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
{
@@ -397,6 +417,9 @@ static struct kobj_attribute debug_cow_attr =
static struct attribute *hugepage_attr[] = {
&enabled_attr.attr,
&defrag_attr.attr,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+ &page_cache_attr.attr,
+#endif
&use_zero_page_attr.attr,
#ifdef CONFIG_DEBUG_VM
&debug_cow_attr.attr,
--
1.8.3.2

2013-08-04 08:25:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 02/23] memcg, thp: charge huge cache pages

On Sun 04-08-13 05:17:04, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> mem_cgroup_cache_charge() has check for PageCompound(). The check
> prevents charging huge cache pages.
>
> I don't see a reason why the check is present. Looks like it's just
> legacy (introduced in 52d4b9a memcg: allocate all page_cgroup at boot).
>
> Let's just drop it.

If the page cache charging path only sees THP as compound pages then OK.
Can we keep at least VM_BUG_ON(PageCompound(page) && !PageTransHuge(page))

Otherwise mem_cgroup_charge_common would be confused and charge such a
page as order-0

> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: KAMEZAWA Hiroyuki <[email protected]>
> Acked-by: Dave Hansen <[email protected]>

Other than that, looks good to me.
Acked-by: Michal Hocko <[email protected]>

> ---
> mm/memcontrol.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index b6cd870..dc50c1a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3921,8 +3921,6 @@ int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
>
> if (mem_cgroup_disabled())
> return 0;
> - if (PageCompound(page))
> - return 0;
>
> if (!PageSwapCache(page))
> ret = mem_cgroup_charge_common(page, mm, gfp_mask, type);
> --
> 1.8.3.2
>

--
Michal Hocko
SUSE Labs

2013-08-05 00:29:45

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 15/23] mm, fs: avoid page allocation beyond i_size on read

On Sun, 4 Aug 2013 05:17:17 +0300 "Kirill A. Shutemov"
<[email protected]> wrote:

> From: "Kirill A. Shutemov" <[email protected]>
>
> I've noticed that we allocated unneeded page for cache on read beyond
> i_size. Simple test case (I checked it on ramfs):
>
> $ touch testfile
> $ cat testfile
>
> It triggers 'no_cached_page' code path in do_generic_file_read().
>
> Looks like it's regression since commit a32ea1e. Let's fix it.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: NeilBrown <[email protected]>
> ---
> mm/filemap.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 066bbff..c31d296 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1163,6 +1163,10 @@ static void do_generic_file_read(struct file *filp, loff_t *ppos,
> loff_t isize;
> unsigned long nr, ret;
>
> + isize = i_size_read(inode);
> + if (!isize || index > (isize - 1) >> PAGE_CACHE_SHIFT)
> + goto out;
> +
> cond_resched();
> find_page:
> page = find_get_page(mapping, index);

Looks good to me.

Acked-by: NeilBrown <[email protected]>

Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2013-08-05 11:17:47

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> The radix tree is variable-height, so an insert operation not only has
> to build the branch to its corresponding item, it also has to build the
> branch to existing items if the size has to be increased (by
> radix_tree_extend).
>
> The worst case is a zero height tree with just a single item at index 0,
> and then inserting an item at index ULONG_MAX. This requires 2 new branches
> of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
>
> Radix tree is usually protected by spin lock. It means we want to
> pre-allocate required memory before taking the lock.
>
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. For transparent huge page cache we want
> to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
>
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.
> The feature costs about 5KiB per-CPU, details below.
>
> Worst case for adding N contiguous items is adding entries at indexes
> (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> item plus extra nodes if you cross the boundary from one node to the next.
>
> Preload uses per-CPU array to store nodes. The total cost of preload is
> "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> to be able to handle 512 entries at once.
>
> Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
>
> We have three possible RADIX_TREE_MAP_SHIFT:
>
> #ifdef __KERNEL__
> #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> #else
> #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
> #endif
>
> On 64-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
>
> On 32-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
>
> On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> on 64-bit system the per-CPU feature overhead is
> for preload array:
> (30 - 21) * sizeof(void*) = 72 bytes
> plus, if the preload array is full
> (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> total: 5112 bytes
>
> on 32-bit system the per-CPU feature overhead is
> for preload array:
> (19 - 11) * sizeof(void*) = 32 bytes
> plus, if the preload array is full
> (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> total: 2400 bytes
>
> Since only THP uses batched preload at the moment, we disable (set max
> preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> changed in the future.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Dave Hansen <[email protected]>
> ---
> include/linux/radix-tree.h | 11 +++++++++++
> lib/radix-tree.c | 41 ++++++++++++++++++++++++++++++++---------
> 2 files changed, 43 insertions(+), 9 deletions(-)
...
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 7811ed3..99ab73c 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> * The worst case is a zero height tree with just a single item at index 0,
> * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> + *
> + * Worst case for adding N contiguous items is adding entries at indexes
> + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> + * item plus extra nodes if you cross the boundary from one node to the next.
> + *
> * Hence:
> */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MAX \
> + (RADIX_TREE_PRELOAD_MIN + \
> + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
Umm, is this really correct? I see two problems:
1) You may need internal tree nodes at various levels but you seem to
account only for the level 1.
2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
require 3 nodes at level 1 if the indexes are like:
i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
^ ^
node boundary node boundary

Otherwise the patch looks good.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-08-05 11:21:22

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 08/23] block: implement add_bdi_stat()

On Sun 04-08-13 05:17:10, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> We're going to add/remove a number of page cache entries at once. This
> patch implements add_bdi_stat() which adjusts bdi stats by arbitrary
> amount. It's required for batched page cache manipulations.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
Looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> include/linux/backing-dev.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c388155..7060180 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -166,6 +166,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
> __add_bdi_stat(bdi, item, -1);
> }
>
> +static inline void add_bdi_stat(struct backing_dev_info *bdi,
> + enum bdi_stat_item item, s64 amount)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + __add_bdi_stat(bdi, item, amount);
> + local_irq_restore(flags);
> +}
> +
> static inline void dec_bdi_stat(struct backing_dev_info *bdi,
> enum bdi_stat_item item)
> {
> --
> 1.8.3.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-08-05 13:29:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 19/23] truncate: support huge pages

On Sun 04-08-13 05:17:21, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> truncate_inode_pages_range() drops whole huge page at once if it's fully
> inside the range.
>
> If a huge page is only partly in the range we zero out the part,
> exactly like we do for partial small pages.
>
> invalidate_mapping_pages() just skips huge pages if they are not fully
> in the range.
Umm, this is not a new problem but with THP pagecache it will become more
visible: When we punch holes within a file like <0..2MB>, <2MB-4MB>
(presuming 4 MB hugepages), then we won't free the underlying huge page for
the range 0..4MB. Maybe for initial implementation is doesn't matter but we
should at least note it in truncate_inode_pages_range() so that people are
aware of this.

Otherwise the patch looks OK to me. So you can add:
Reviewed-by: Jan Kara <[email protected]>

Honza

> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/truncate.c | 108 +++++++++++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 84 insertions(+), 24 deletions(-)
>
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 353b683..fcef7cb 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -205,8 +205,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
> {
> pgoff_t start; /* inclusive */
> pgoff_t end; /* exclusive */
> - unsigned int partial_start; /* inclusive */
> - unsigned int partial_end; /* exclusive */
> + bool partial_thp_start = false, partial_thp_end = false;
> struct pagevec pvec;
> pgoff_t index;
> int i;
> @@ -215,15 +214,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
> if (mapping->nrpages == 0)
> return;
>
> - /* Offsets within partial pages */
> - partial_start = lstart & (PAGE_CACHE_SIZE - 1);
> - partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
> -
> /*
> * 'start' and 'end' always covers the range of pages to be fully
> - * truncated. Partial pages are covered with 'partial_start' at the
> - * start of the range and 'partial_end' at the end of the range.
> - * Note that 'end' is exclusive while 'lend' is inclusive.
> + * truncated. Note that 'end' is exclusive while 'lend' is inclusive.
> */
> start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
> if (lend == -1)
> @@ -249,6 +242,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
> if (index >= end)
> break;
>
> + if (PageTransTailCache(page)) {
> + /* part of already handled huge page */
> + if (!page->mapping)
> + continue;
> + /* the range starts in middle of huge page */
> + partial_thp_start = true;
> + start = index & ~HPAGE_CACHE_INDEX_MASK;
> + continue;
> + }
> + /* the range ends on huge page */
> + if (PageTransHugeCache(page) &&
> + index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
> + partial_thp_end = true;
> + end = index;
> + break;
> + }
> +
> if (!trylock_page(page))
> continue;
> WARN_ON(page->index != index);
> @@ -265,34 +275,74 @@ void truncate_inode_pages_range(struct address_space *mapping,
> index++;
> }
>
> - if (partial_start) {
> - struct page *page = find_lock_page(mapping, start - 1);
> + if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
> + pgoff_t off;
> + struct page *page;
> + unsigned pstart, pend;
> + void (*zero_segment)(struct page *page,
> + unsigned start, unsigned len);
> +retry_partial_start:
> + if (partial_thp_start) {
> + zero_segment = zero_huge_user_segment;
> + off = (start - 1) & ~HPAGE_CACHE_INDEX_MASK;
> + pstart = lstart & ~HPAGE_PMD_MASK;
> + if ((end & ~HPAGE_CACHE_INDEX_MASK) == off)
> + pend = (lend - 1) & ~HPAGE_PMD_MASK;
> + else
> + pend = HPAGE_PMD_SIZE;
> + } else {
> + zero_segment = zero_user_segment;
> + off = start - 1;
> + pstart = lstart & ~PAGE_CACHE_MASK;
> + if (start > end)
> + pend = (lend - 1) & ~PAGE_CACHE_MASK;
> + else
> + pend = PAGE_CACHE_SIZE;
> + }
> +
> + page = find_get_page(mapping, off);
> if (page) {
> - unsigned int top = PAGE_CACHE_SIZE;
> - if (start > end) {
> - /* Truncation within a single page */
> - top = partial_end;
> - partial_end = 0;
> + /* the last tail page*/
> + if (PageTransTailCache(page)) {
> + partial_thp_start = true;
> + page_cache_release(page);
> + goto retry_partial_start;
> }
> +
> + lock_page(page);
> wait_on_page_writeback(page);
> - zero_user_segment(page, partial_start, top);
> + zero_segment(page, pstart, pend);
> cleancache_invalidate_page(mapping, page);
> if (page_has_private(page))
> - do_invalidatepage(page, partial_start,
> - top - partial_start);
> + do_invalidatepage(page, pstart,
> + pend - pstart);
> unlock_page(page);
> page_cache_release(page);
> }
> }
> - if (partial_end) {
> - struct page *page = find_lock_page(mapping, end);
> + if (partial_thp_end || (lend + 1) & ~PAGE_CACHE_MASK) {
> + pgoff_t off;
> + struct page *page;
> + unsigned pend;
> + void (*zero_segment)(struct page *page,
> + unsigned start, unsigned len);
> + if (partial_thp_end) {
> + zero_segment = zero_huge_user_segment;
> + off = end & ~HPAGE_CACHE_INDEX_MASK;
> + pend = (lend - 1) & ~HPAGE_PMD_MASK;
> + } else {
> + zero_segment = zero_user_segment;
> + off = end;
> + pend = (lend - 1) & ~PAGE_CACHE_MASK;
> + }
> +
> + page = find_lock_page(mapping, end);
> if (page) {
> wait_on_page_writeback(page);
> - zero_user_segment(page, 0, partial_end);
> + zero_segment(page, 0, pend);
> cleancache_invalidate_page(mapping, page);
> if (page_has_private(page))
> - do_invalidatepage(page, 0,
> - partial_end);
> + do_invalidatepage(page, 0, pend);
> unlock_page(page);
> page_cache_release(page);
> }
> @@ -327,6 +377,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
> if (index >= end)
> break;
>
> + if (PageTransTailCache(page))
> + continue;
> +
> lock_page(page);
> WARN_ON(page->index != index);
> wait_on_page_writeback(page);
> @@ -401,6 +454,13 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> if (index > end)
> break;
>
> + /* skip huge page if it's not fully in the range */
> + if (PageTransHugeCache(page) &&
> + index + HPAGE_CACHE_NR - 1 > end)
> + continue;
> + if (PageTransTailCache(page))
> + continue;
> +
> if (!trylock_page(page))
> continue;
> WARN_ON(page->index != index);
> --
> 1.8.3.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-08-06 16:34:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

On Mon, Aug 05, 2013 at 01:17:39PM +0200, Jan Kara wrote:
> On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> > The radix tree is variable-height, so an insert operation not only has
> > to build the branch to its corresponding item, it also has to build the
> > branch to existing items if the size has to be increased (by
> > radix_tree_extend).
> > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > * The worst case is a zero height tree with just a single item at index 0,
> > * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > + *
> > + * Worst case for adding N contiguous items is adding entries at indexes
> > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > + * item plus extra nodes if you cross the boundary from one node to the next.
> > + *
> > * Hence:
> > */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MAX \
> > + (RADIX_TREE_PRELOAD_MIN + \
> > + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>
> Umm, is this really correct? I see two problems:
> 1) You may need internal tree nodes at various levels but you seem to
> account only for the level 1.
> 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> require 3 nodes at level 1 if the indexes are like:
> i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> ^ ^
> node boundary node boundary
>
> Otherwise the patch looks good.

You are correct that in the fully general case, these things are needed,
and the patch undercounts the number of nodes needed. However, in the
specific case of THP pagecache, insertions are naturally aligned, and
we end up needing very few internal nodes (so few that we've never hit
the end of this array in some fairly heavy testing).

There are two penalties for getting the general case correct. One is
that the calculation becomes harder to understand, and the other is
that we consume more per-CPU memory. I think we should document that
the current code requires "natural alignment", and include a note about
what things will need to change in order to support arbitrary alignment
in case anybody needs to do it in the future, but not include support
for arbitrary alignment in this patchset.

What do you think?

2013-08-06 20:17:19

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

On Tue 06-08-13 12:34:14, Matthew Wilcox wrote:
> On Mon, Aug 05, 2013 at 01:17:39PM +0200, Jan Kara wrote:
> > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <[email protected]>
> > > The radix tree is variable-height, so an insert operation not only has
> > > to build the branch to its corresponding item, it also has to build the
> > > branch to existing items if the size has to be increased (by
> > > radix_tree_extend).
> > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > * The worst case is a zero height tree with just a single item at index 0,
> > > * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > + *
> > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > + *
> > > * Hence:
> > > */
> > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MAX \
> > > + (RADIX_TREE_PRELOAD_MIN + \
> > > + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> >
> > Umm, is this really correct? I see two problems:
> > 1) You may need internal tree nodes at various levels but you seem to
> > account only for the level 1.
> > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > require 3 nodes at level 1 if the indexes are like:
> > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > ^ ^
> > node boundary node boundary
> >
> > Otherwise the patch looks good.
>
> You are correct that in the fully general case, these things are needed,
> and the patch undercounts the number of nodes needed. However, in the
> specific case of THP pagecache, insertions are naturally aligned, and
> we end up needing very few internal nodes (so few that we've never hit
> the end of this array in some fairly heavy testing).
Have you checked that you really didn't hit end of the array? Because if
we run out of preloaded nodes, we just try atomic kmalloc to get new
nodes. And that has high chances of success...

> There are two penalties for getting the general case correct. One is
> that the calculation becomes harder to understand, and the other is
> that we consume more per-CPU memory. I think we should document that
> the current code requires "natural alignment", and include a note about
> what things will need to change in order to support arbitrary alignment
> in case anybody needs to do it in the future, but not include support
> for arbitrary alignment in this patchset.
>
> What do you think?
I'm OK with expecting things to be naturally aligned if that's documented
and there's WARN_ON_ONCE which checks that someone isn't misusing the code.
It is true that you actually implemented only 'maybe_preload' variant for
contiguous extent preloading so leaving out higher level nodes might be OK.
But it would be good to at least document this at the place where you
estimate how much nodes to preload.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-08-06 20:24:09

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 19/23] truncate: support huge pages

On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> + if (PageTransTailCache(page)) {
> + /* part of already handled huge page */
> + if (!page->mapping)
> + continue;
> + /* the range starts in middle of huge page */
> + partial_thp_start = true;
> + start = index & ~HPAGE_CACHE_INDEX_MASK;
> + continue;
> + }
> + /* the range ends on huge page */
> + if (PageTransHugeCache(page) &&
> + index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
> + partial_thp_end = true;
> + end = index;
> + break;
> + }

Still reading through the code, but that "index ==" line's indentation
is screwed up. It makes it look like "index == " is a line of code
instead of part of the if() condition.

2013-08-06 20:56:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 19/23] truncate: support huge pages

On Tue, Aug 06, 2013 at 01:23:48PM -0700, Dave Hansen wrote:
> On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> > + if (PageTransTailCache(page)) {
> > + /* part of already handled huge page */
> > + if (!page->mapping)
> > + continue;
> > + /* the range starts in middle of huge page */
> > + partial_thp_start = true;
> > + start = index & ~HPAGE_CACHE_INDEX_MASK;
> > + continue;
> > + }
> > + /* the range ends on huge page */
> > + if (PageTransHugeCache(page) &&
> > + index == (end & ~HPAGE_CACHE_INDEX_MASK)) {
> > + partial_thp_end = true;
> > + end = index;
> > + break;
> > + }
>
> Still reading through the code, but that "index ==" line's indentation
> is screwed up. It makes it look like "index == " is a line of code
> instead of part of the if() condition.

I screwed it up myself. Otherwise, the line is too long. :-/

Probably, I should move partial page logic into separate function.

--
Kirill A. Shutemov

2013-08-06 21:55:42

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 19/23] truncate: support huge pages

On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> If a huge page is only partly in the range we zero out the part,
> exactly like we do for partial small pages.

What's the logic behind this behaviour? Seems like the kind of place
that we would really want to be splitting pages.

> + if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
> + pgoff_t off;
> + struct page *page;
> + unsigned pstart, pend;
> + void (*zero_segment)(struct page *page,
> + unsigned start, unsigned len);
> +retry_partial_start:
> + if (partial_thp_start) {
> + zero_segment = zero_huge_user_segment;

That's a pretty hackish way to conditionally call a function, especially
since its done twice in one function. :)

I seem to recall zero_user_segment() vs. zero_huge_user_segment() being
something that caused some ugliness in the previous versions too.
What's the barrier to just having a smart zero_..._user_segment()
function that can conditionally perform huge or base page-zeroing?

> + if (partial_thp_end) {
> + zero_segment = zero_huge_user_segment;
> + off = end & ~HPAGE_CACHE_INDEX_MASK;
> + pend = (lend - 1) & ~HPAGE_PMD_MASK;
> + } else {
> + zero_segment = zero_user_segment;
> + off = end;
> + pend = (lend - 1) & ~PAGE_CACHE_MASK;
> + }

We went though a similar exercise for the fault code (I think), but I
really think you need to refactor this. Way too much of the code is in
the style:

if (thp) {
// new behavior
} else {
// old behavior
}

To me, that's just a recipe that makes it hard to review, and I also bet
it'll make the thp much more prone to bitrot. Maybe something like this?

size_t page_cache_mask = PAGE_CACHE_MASK;
unsigned long end_mask = 0UL;

if (partial_thp_end) {
page_cache_mask = HPAGE_PMD_MASK;
end_mask = HPAGE_CACHE_INDEX_MASK;
}
...
magic_zero_user_segment(...);
off = end & ~end_mask;
pend = (lend - 1) & ~page_cache_mask;

Like I said before, I somehow like to rewrite your code. :)

2013-08-07 16:29:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

Jan Kara wrote:
> On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > The radix tree is variable-height, so an insert operation not only has
> > to build the branch to its corresponding item, it also has to build the
> > branch to existing items if the size has to be increased (by
> > radix_tree_extend).
> >
> > The worst case is a zero height tree with just a single item at index 0,
> > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> >
> > Radix tree is usually protected by spin lock. It means we want to
> > pre-allocate required memory before taking the lock.
> >
> > Currently radix_tree_preload() only guarantees enough nodes to insert
> > one element. It's a hard limit. For transparent huge page cache we want
> > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> >
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
> > The feature costs about 5KiB per-CPU, details below.
> >
> > Worst case for adding N contiguous items is adding entries at indexes
> > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > item plus extra nodes if you cross the boundary from one node to the next.
> >
> > Preload uses per-CPU array to store nodes. The total cost of preload is
> > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > to be able to handle 512 entries at once.
> >
> > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> >
> > We have three possible RADIX_TREE_MAP_SHIFT:
> >
> > #ifdef __KERNEL__
> > #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> > #else
> > #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
> > #endif
> >
> > On 64-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> >
> > On 32-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> >
> > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > on 64-bit system the per-CPU feature overhead is
> > for preload array:
> > (30 - 21) * sizeof(void*) = 72 bytes
> > plus, if the preload array is full
> > (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > total: 5112 bytes
> >
> > on 32-bit system the per-CPU feature overhead is
> > for preload array:
> > (19 - 11) * sizeof(void*) = 32 bytes
> > plus, if the preload array is full
> > (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > total: 2400 bytes
> >
> > Since only THP uses batched preload at the moment, we disable (set max
> > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > changed in the future.
> >
> > Signed-off-by: Matthew Wilcox <[email protected]>
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Acked-by: Dave Hansen <[email protected]>
> > ---
> > include/linux/radix-tree.h | 11 +++++++++++
> > lib/radix-tree.c | 41 ++++++++++++++++++++++++++++++++---------
> > 2 files changed, 43 insertions(+), 9 deletions(-)
> ...
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index 7811ed3..99ab73c 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > * The worst case is a zero height tree with just a single item at index 0,
> > * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > + *
> > + * Worst case for adding N contiguous items is adding entries at indexes
> > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > + * item plus extra nodes if you cross the boundary from one node to the next.
> > + *
> > * Hence:
> > */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MAX \
> > + (RADIX_TREE_PRELOAD_MIN + \
> > + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> Umm, is this really correct? I see two problems:
> 1) You may need internal tree nodes at various levels but you seem to
> account only for the level 1.
> 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> require 3 nodes at level 1 if the indexes are like:
> i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> ^ ^
> node boundary node boundary

My bad. Let's try to calculate once again.

We want to insert N contiguous items without restriction on alignment.

Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.

Worst case scenario, I can imagine, is tree with only one element at index
0 and we add N items where at least one index requires max tree high and
we cross boundary between items in root node.

Basically, at least one index is less then

1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)

and one equal or more.

In this case we need:

- RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
- DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
items;
- 2 * (RADIX_TREE_MAX_PATH - 2) nodes to build path to last level nodes.
(-2, because root node and last level nodes are already accounted).

The macro:

#define RADIX_TREE_PRELOAD_MAX \
( RADIX_TREE_MAX_PATH + \
DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR, RADIX_TREE_MAP_SIZE) + \
2 * (RADIX_TREE_MAX_PATH - 2) )

For 64-bit system and N=512, RADIX_TREE_PRELOAD_MAX is 37.

Does it sound correct? Any thoughts?

--
Kirill A. Shutemov

2013-08-07 20:00:41

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> Jan Kara wrote:
> > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <[email protected]>
> > >
> > > The radix tree is variable-height, so an insert operation not only has
> > > to build the branch to its corresponding item, it also has to build the
> > > branch to existing items if the size has to be increased (by
> > > radix_tree_extend).
> > >
> > > The worst case is a zero height tree with just a single item at index 0,
> > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > >
> > > Radix tree is usually protected by spin lock. It means we want to
> > > pre-allocate required memory before taking the lock.
> > >
> > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > one element. It's a hard limit. For transparent huge page cache we want
> > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > >
> > > This patch introduces radix_tree_preload_count(). It allows to
> > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > The feature costs about 5KiB per-CPU, details below.
> > >
> > > Worst case for adding N contiguous items is adding entries at indexes
> > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > item plus extra nodes if you cross the boundary from one node to the next.
> > >
> > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > to be able to handle 512 entries at once.
> > >
> > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > >
> > > We have three possible RADIX_TREE_MAP_SHIFT:
> > >
> > > #ifdef __KERNEL__
> > > #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> > > #else
> > > #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
> > > #endif
> > >
> > > On 64-bit system:
> > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > >
> > > On 32-bit system:
> > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > >
> > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > on 64-bit system the per-CPU feature overhead is
> > > for preload array:
> > > (30 - 21) * sizeof(void*) = 72 bytes
> > > plus, if the preload array is full
> > > (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > > total: 5112 bytes
> > >
> > > on 32-bit system the per-CPU feature overhead is
> > > for preload array:
> > > (19 - 11) * sizeof(void*) = 32 bytes
> > > plus, if the preload array is full
> > > (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > > total: 2400 bytes
> > >
> > > Since only THP uses batched preload at the moment, we disable (set max
> > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > changed in the future.
> > >
> > > Signed-off-by: Matthew Wilcox <[email protected]>
> > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > Acked-by: Dave Hansen <[email protected]>
> > > ---
> > > include/linux/radix-tree.h | 11 +++++++++++
> > > lib/radix-tree.c | 41 ++++++++++++++++++++++++++++++++---------
> > > 2 files changed, 43 insertions(+), 9 deletions(-)
> > ...
> > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > index 7811ed3..99ab73c 100644
> > > --- a/lib/radix-tree.c
> > > +++ b/lib/radix-tree.c
> > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > * The worst case is a zero height tree with just a single item at index 0,
> > > * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > + *
> > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > + *
> > > * Hence:
> > > */
> > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MAX \
> > > + (RADIX_TREE_PRELOAD_MIN + \
> > > + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> > Umm, is this really correct? I see two problems:
> > 1) You may need internal tree nodes at various levels but you seem to
> > account only for the level 1.
> > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > require 3 nodes at level 1 if the indexes are like:
> > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > ^ ^
> > node boundary node boundary
>
> My bad. Let's try to calculate once again.
>
> We want to insert N contiguous items without restriction on alignment.
>
> Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
>
> Worst case scenario, I can imagine, is tree with only one element at index
> 0 and we add N items where at least one index requires max tree high and
> we cross boundary between items in root node.
>
> Basically, at least one index is less then
>
> 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
>
> and one equal or more.
>
> In this case we need:
>
> - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
> items;
Here, I think you need to count with
DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
account for the situation b) I described.

> - 2 * (RADIX_TREE_MAX_PATH - 2) nodes to build path to last level nodes.
> (-2, because root node and last level nodes are already accounted).
>
> The macro:
>
> #define RADIX_TREE_PRELOAD_MAX \
> ( RADIX_TREE_MAX_PATH + \
> DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR, RADIX_TREE_MAP_SIZE) + \
> 2 * (RADIX_TREE_MAX_PATH - 2) )
>
> For 64-bit system and N=512, RADIX_TREE_PRELOAD_MAX is 37.
>
> Does it sound correct? Any thoughts?
Otherwise I agree with your math (given the limitation on N). But please
add a comment explaining how we arrived to the math in
RADIX_TREE_PRELOAD_MAX. Thanks!

Another option I could live with is the one Matthew described - rely on
inserted range being aligned, but then comment about it and add some
assertion checking it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-08-07 20:20:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

Jan Kara wrote:
> On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> > Jan Kara wrote:
> > > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > > From: "Kirill A. Shutemov" <[email protected]>
> > > >
> > > > The radix tree is variable-height, so an insert operation not only has
> > > > to build the branch to its corresponding item, it also has to build the
> > > > branch to existing items if the size has to be increased (by
> > > > radix_tree_extend).
> > > >
> > > > The worst case is a zero height tree with just a single item at index 0,
> > > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > >
> > > > Radix tree is usually protected by spin lock. It means we want to
> > > > pre-allocate required memory before taking the lock.
> > > >
> > > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > > one element. It's a hard limit. For transparent huge page cache we want
> > > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > >
> > > > This patch introduces radix_tree_preload_count(). It allows to
> > > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > > The feature costs about 5KiB per-CPU, details below.
> > > >
> > > > Worst case for adding N contiguous items is adding entries at indexes
> > > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > item plus extra nodes if you cross the boundary from one node to the next.
> > > >
> > > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > > to be able to handle 512 entries at once.
> > > >
> > > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > >
> > > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > >
> > > > #ifdef __KERNEL__
> > > > #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> > > > #else
> > > > #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
> > > > #endif
> > > >
> > > > On 64-bit system:
> > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > >
> > > > On 32-bit system:
> > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > >
> > > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > > on 64-bit system the per-CPU feature overhead is
> > > > for preload array:
> > > > (30 - 21) * sizeof(void*) = 72 bytes
> > > > plus, if the preload array is full
> > > > (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > > > total: 5112 bytes
> > > >
> > > > on 32-bit system the per-CPU feature overhead is
> > > > for preload array:
> > > > (19 - 11) * sizeof(void*) = 32 bytes
> > > > plus, if the preload array is full
> > > > (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > > > total: 2400 bytes
> > > >
> > > > Since only THP uses batched preload at the moment, we disable (set max
> > > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > > changed in the future.
> > > >
> > > > Signed-off-by: Matthew Wilcox <[email protected]>
> > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > Acked-by: Dave Hansen <[email protected]>
> > > > ---
> > > > include/linux/radix-tree.h | 11 +++++++++++
> > > > lib/radix-tree.c | 41 ++++++++++++++++++++++++++++++++---------
> > > > 2 files changed, 43 insertions(+), 9 deletions(-)
> > > ...
> > > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > > index 7811ed3..99ab73c 100644
> > > > --- a/lib/radix-tree.c
> > > > +++ b/lib/radix-tree.c
> > > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > > * The worst case is a zero height tree with just a single item at index 0,
> > > > * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > + *
> > > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > > + *
> > > > * Hence:
> > > > */
> > > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > +#define RADIX_TREE_PRELOAD_MAX \
> > > > + (RADIX_TREE_PRELOAD_MIN + \
> > > > + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> > > Umm, is this really correct? I see two problems:
> > > 1) You may need internal tree nodes at various levels but you seem to
> > > account only for the level 1.
> > > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > > require 3 nodes at level 1 if the indexes are like:
> > > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > > ^ ^
> > > node boundary node boundary
> >
> > My bad. Let's try to calculate once again.
> >
> > We want to insert N contiguous items without restriction on alignment.
> >
> > Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> > CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> >
> > Worst case scenario, I can imagine, is tree with only one element at index
> > 0 and we add N items where at least one index requires max tree high and
> > we cross boundary between items in root node.
> >
> > Basically, at least one index is less then
> >
> > 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> >
> > and one equal or more.
> >
> > In this case we need:
> >
> > - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> > - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
> > items;
> Here, I think you need to count with
> DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
> account for the situation b) I described.

It's not obvious for me why it's needed. Since we limit N, we can cross
second (or above) level node boundary only once.

I've tried to confirm the math with my kvm (see *ugly* patch below) and
I was not able to find anything that is not covered.

Could you demonstrate the case you are talking about.

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 3bf0b3e..67768b6 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -63,12 +63,14 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
/* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
struct radix_tree_root {
unsigned int height;
+ unsigned int nr_nodes;
gfp_t gfp_mask;
struct radix_tree_node __rcu *rnode;
};

#define RADIX_TREE_INIT(mask) { \
.height = 0, \
+ .nr_nodes = 0, \
.gfp_mask = (mask), \
.rnode = NULL, \
}
@@ -79,6 +81,7 @@ struct radix_tree_root {
#define INIT_RADIX_TREE(root, mask) \
do { \
(root)->height = 0; \
+ (root)->nr_nodes = 0; \
(root)->gfp_mask = (mask); \
(root)->rnode = NULL; \
} while (0)
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 99ab73c..a4d267e 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -33,7 +33,7 @@
#include <linux/bitops.h>
#include <linux/rcupdate.h>
#include <linux/hardirq.h> /* in_interrupt() */
-
+#include <linux/reboot.h>

#ifdef __KERNEL__
#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
@@ -44,6 +44,9 @@
#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
#define RADIX_TREE_MAP_MASK (RADIX_TREE_MAP_SIZE-1)

+#undef RADIX_TREE_PRELOAD_NR
+#define RADIX_TREE_PRELOAD_NR (1UL << 9)
+
#define RADIX_TREE_TAG_LONGS \
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)

@@ -91,8 +94,9 @@ static struct kmem_cache *radix_tree_node_cachep;
*/
#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
#define RADIX_TREE_PRELOAD_MAX \
- (RADIX_TREE_PRELOAD_MIN + \
- DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
+ ( RADIX_TREE_MAX_PATH + \
+ DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR, RADIX_TREE_MAP_SIZE) + \
+ 2 * (RADIX_TREE_MAX_PATH - 2) )

/*
* Per-cpu pool of preloaded nodes
@@ -240,6 +244,9 @@ radix_tree_node_alloc(struct radix_tree_root *root)
ret = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);

BUG_ON(radix_tree_is_indirect_ptr(ret));
+ if (ret)
+ root->nr_nodes++;
+
return ret;
}

@@ -264,8 +271,9 @@ static void radix_tree_node_rcu_free(struct rcu_head *head)
}

static inline void
-radix_tree_node_free(struct radix_tree_node *node)
+radix_tree_node_free(struct radix_tree_root *root, struct radix_tree_node *node)
{
+ root->nr_nodes--;
call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
}

@@ -1353,7 +1361,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
*((unsigned long *)&to_free->slots[0]) |=
RADIX_TREE_INDIRECT_PTR;

- radix_tree_node_free(to_free);
+ radix_tree_node_free(root, to_free);
}
}

@@ -1420,7 +1428,7 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
* last reference to it disappears (set NULL, above).
*/
if (to_free)
- radix_tree_node_free(to_free);
+ radix_tree_node_free(root, to_free);

if (node->count) {
if (node == indirect_to_ptr(root->rnode))
@@ -1440,7 +1448,7 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
root->height = 0;
root->rnode = NULL;
if (to_free)
- radix_tree_node_free(to_free);
+ radix_tree_node_free(root, to_free);

out:
return slot;
@@ -1504,6 +1512,33 @@ static int radix_tree_callback(struct notifier_block *nfb,
return NOTIFY_OK;
}

+static void self_test(void)
+{
+ struct radix_tree_root test_tree = RADIX_TREE_INIT(GFP_ATOMIC);
+ int i;
+ unsigned long off = (1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)) - 1;
+
+ printk("radix tree self-test\n");
+ printk("RADIX_TREE_PRELOAD_MAX: %lu\n", RADIX_TREE_PRELOAD_MAX);
+ printk("RADIX_TREE_PRELOAD_MIN: %lu\n", RADIX_TREE_PRELOAD_MIN);
+ printk("RADIX_TREE_PRELOAD_NR: %lu\n", RADIX_TREE_PRELOAD_NR);
+
+ radix_tree_insert(&test_tree, 0, (void*)0xdead0000);
+
+ if (test_tree.nr_nodes != 0) {
+ printk("\toff: %lu, nr_nodes: %d\n", off, test_tree.nr_nodes);
+ BUG();
+ }
+
+ for (i = 0; i < RADIX_TREE_PRELOAD_NR; i++) {
+ radix_tree_insert(&test_tree, off + i, (void*)0xdead0000);
+ }
+ printk("off: %lu, nr_nodes: %d, height: %d\n",
+ off, test_tree.nr_nodes, test_tree.height);
+
+ emergency_restart();
+}
+
void __init radix_tree_init(void)
{
radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
@@ -1511,5 +1546,6 @@ void __init radix_tree_init(void)
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
radix_tree_node_ctor);
radix_tree_init_maxindex();
+ self_test();
hotcpu_notifier(radix_tree_callback, 0);
}
--
Kirill A. Shutemov

2013-08-07 20:36:55

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

On Wed 07-08-13 23:24:03, Kirill A. Shutemov wrote:
> Jan Kara wrote:
> > On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> > > Jan Kara wrote:
> > > > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > > > From: "Kirill A. Shutemov" <[email protected]>
> > > > >
> > > > > The radix tree is variable-height, so an insert operation not only has
> > > > > to build the branch to its corresponding item, it also has to build the
> > > > > branch to existing items if the size has to be increased (by
> > > > > radix_tree_extend).
> > > > >
> > > > > The worst case is a zero height tree with just a single item at index 0,
> > > > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > >
> > > > > Radix tree is usually protected by spin lock. It means we want to
> > > > > pre-allocate required memory before taking the lock.
> > > > >
> > > > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > > > one element. It's a hard limit. For transparent huge page cache we want
> > > > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > > >
> > > > > This patch introduces radix_tree_preload_count(). It allows to
> > > > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > > > The feature costs about 5KiB per-CPU, details below.
> > > > >
> > > > > Worst case for adding N contiguous items is adding entries at indexes
> > > > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > item plus extra nodes if you cross the boundary from one node to the next.
> > > > >
> > > > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > > > to be able to handle 512 entries at once.
> > > > >
> > > > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > > >
> > > > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > > >
> > > > > #ifdef __KERNEL__
> > > > > #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> > > > > #else
> > > > > #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
> > > > > #endif
> > > > >
> > > > > On 64-bit system:
> > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > > >
> > > > > On 32-bit system:
> > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > > >
> > > > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > > > on 64-bit system the per-CPU feature overhead is
> > > > > for preload array:
> > > > > (30 - 21) * sizeof(void*) = 72 bytes
> > > > > plus, if the preload array is full
> > > > > (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > > > > total: 5112 bytes
> > > > >
> > > > > on 32-bit system the per-CPU feature overhead is
> > > > > for preload array:
> > > > > (19 - 11) * sizeof(void*) = 32 bytes
> > > > > plus, if the preload array is full
> > > > > (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > > > > total: 2400 bytes
> > > > >
> > > > > Since only THP uses batched preload at the moment, we disable (set max
> > > > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > > > changed in the future.
> > > > >
> > > > > Signed-off-by: Matthew Wilcox <[email protected]>
> > > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > > Acked-by: Dave Hansen <[email protected]>
> > > > > ---
> > > > > include/linux/radix-tree.h | 11 +++++++++++
> > > > > lib/radix-tree.c | 41 ++++++++++++++++++++++++++++++++---------
> > > > > 2 files changed, 43 insertions(+), 9 deletions(-)
> > > > ...
> > > > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > > > index 7811ed3..99ab73c 100644
> > > > > --- a/lib/radix-tree.c
> > > > > +++ b/lib/radix-tree.c
> > > > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > > > * The worst case is a zero height tree with just a single item at index 0,
> > > > > * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > > * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > + *
> > > > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > > > + *
> > > > > * Hence:
> > > > > */
> > > > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > +#define RADIX_TREE_PRELOAD_MAX \
> > > > > + (RADIX_TREE_PRELOAD_MIN + \
> > > > > + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> > > > Umm, is this really correct? I see two problems:
> > > > 1) You may need internal tree nodes at various levels but you seem to
> > > > account only for the level 1.
> > > > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > > > require 3 nodes at level 1 if the indexes are like:
> > > > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > > > ^ ^
> > > > node boundary node boundary
> > >
> > > My bad. Let's try to calculate once again.
> > >
> > > We want to insert N contiguous items without restriction on alignment.
> > >
> > > Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> > > CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> > >
> > > Worst case scenario, I can imagine, is tree with only one element at index
> > > 0 and we add N items where at least one index requires max tree high and
> > > we cross boundary between items in root node.
> > >
> > > Basically, at least one index is less then
> > >
> > > 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> > >
> > > and one equal or more.
> > >
> > > In this case we need:
> > >
> > > - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> > > - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
> > > items;
> > Here, I think you need to count with
> > DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
> > account for the situation b) I described.
>
> It's not obvious for me why it's needed. Since we limit N, we can cross
> second (or above) level node boundary only once.
>
> I've tried to confirm the math with my kvm (see *ugly* patch below) and
> I was not able to find anything that is not covered.
>
> Could you demonstrate the case you are talking about.
Sure. So let RADIX_TREE_MAP_SHIFT be 6 (i.e. RADIX_TREE_MAP_SIZE == 64).
Let's have radix tree with single element from index 0. We insert 66 elements
starting from index 127. So for nodes at the last level we need - node for
index 127, node for indexes 128 .. 191, node for index 192. That is
together three nodes. But DIV_ROUND_UP(66, 64) = 2. The problem happens
because starting index 127 isn't multiple of RADIX_TREE_MAP_SIZE so we can
have partially used nodes both at the beginning and at the end of the
range.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-08-07 21:34:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

Jan Kara wrote:
> On Wed 07-08-13 23:24:03, Kirill A. Shutemov wrote:
> > Jan Kara wrote:
> > > On Wed 07-08-13 19:32:36, Kirill A. Shutemov wrote:
> > > > Jan Kara wrote:
> > > > > On Sun 04-08-13 05:17:03, Kirill A. Shutemov wrote:
> > > > > > From: "Kirill A. Shutemov" <[email protected]>
> > > > > >
> > > > > > The radix tree is variable-height, so an insert operation not only has
> > > > > > to build the branch to its corresponding item, it also has to build the
> > > > > > branch to existing items if the size has to be increased (by
> > > > > > radix_tree_extend).
> > > > > >
> > > > > > The worst case is a zero height tree with just a single item at index 0,
> > > > > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > >
> > > > > > Radix tree is usually protected by spin lock. It means we want to
> > > > > > pre-allocate required memory before taking the lock.
> > > > > >
> > > > > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > > > > one element. It's a hard limit. For transparent huge page cache we want
> > > > > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > > > > >
> > > > > > This patch introduces radix_tree_preload_count(). It allows to
> > > > > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > > > > The feature costs about 5KiB per-CPU, details below.
> > > > > >
> > > > > > Worst case for adding N contiguous items is adding entries at indexes
> > > > > > (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > > item plus extra nodes if you cross the boundary from one node to the next.
> > > > > >
> > > > > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > > > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > > > > to be able to handle 512 entries at once.
> > > > > >
> > > > > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > > > > >
> > > > > > We have three possible RADIX_TREE_MAP_SHIFT:
> > > > > >
> > > > > > #ifdef __KERNEL__
> > > > > > #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> > > > > > #else
> > > > > > #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
> > > > > > #endif
> > > > > >
> > > > > > On 64-bit system:
> > > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> > > > > >
> > > > > > On 32-bit system:
> > > > > > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > > > > > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > > > > > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> > > > > >
> > > > > > On most machines we will have RADIX_TREE_MAP_SHIFT=6. In this case,
> > > > > > on 64-bit system the per-CPU feature overhead is
> > > > > > for preload array:
> > > > > > (30 - 21) * sizeof(void*) = 72 bytes
> > > > > > plus, if the preload array is full
> > > > > > (30 - 21) * sizeof(struct radix_tree_node) = 9 * 560 = 5040 bytes
> > > > > > total: 5112 bytes
> > > > > >
> > > > > > on 32-bit system the per-CPU feature overhead is
> > > > > > for preload array:
> > > > > > (19 - 11) * sizeof(void*) = 32 bytes
> > > > > > plus, if the preload array is full
> > > > > > (19 - 11) * sizeof(struct radix_tree_node) = 8 * 296 = 2368 bytes
> > > > > > total: 2400 bytes
> > > > > >
> > > > > > Since only THP uses batched preload at the moment, we disable (set max
> > > > > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > > > > changed in the future.
> > > > > >
> > > > > > Signed-off-by: Matthew Wilcox <[email protected]>
> > > > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > > > Acked-by: Dave Hansen <[email protected]>
> > > > > > ---
> > > > > > include/linux/radix-tree.h | 11 +++++++++++
> > > > > > lib/radix-tree.c | 41 ++++++++++++++++++++++++++++++++---------
> > > > > > 2 files changed, 43 insertions(+), 9 deletions(-)
> > > > > ...
> > > > > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > > > > index 7811ed3..99ab73c 100644
> > > > > > --- a/lib/radix-tree.c
> > > > > > +++ b/lib/radix-tree.c
> > > > > > @@ -82,16 +82,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > > > > * The worst case is a zero height tree with just a single item at index 0,
> > > > > > * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > > > > * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > > > > + *
> > > > > > + * Worst case for adding N contiguous items is adding entries at indexes
> > > > > > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > > > > > + * item plus extra nodes if you cross the boundary from one node to the next.
> > > > > > + *
> > > > > > * Hence:
> > > > > > */
> > > > > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > > > > +#define RADIX_TREE_PRELOAD_MAX \
> > > > > > + (RADIX_TREE_PRELOAD_MIN + \
> > > > > > + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> > > > > Umm, is this really correct? I see two problems:
> > > > > 1) You may need internal tree nodes at various levels but you seem to
> > > > > account only for the level 1.
> > > > > 2) The rounding doesn't seem right because RADIX_TREE_MAP_SIZE+2 nodes may
> > > > > require 3 nodes at level 1 if the indexes are like:
> > > > > i_0 | i_1 .. i_{RADIX_TREE_MAP_SIZE} | i_{RADIX_TREE_MAP_SIZE+1}
> > > > > ^ ^
> > > > > node boundary node boundary
> > > >
> > > > My bad. Let's try to calculate once again.
> > > >
> > > > We want to insert N contiguous items without restriction on alignment.
> > > >
> > > > Let's limit N <= 1UL << (2 * RADIX_TREE_MAP_SHIFT), without
> > > > CONFIG_BASE_SMALL it's 4096. It will simplify calculation a bit.
> > > >
> > > > Worst case scenario, I can imagine, is tree with only one element at index
> > > > 0 and we add N items where at least one index requires max tree high and
> > > > we cross boundary between items in root node.
> > > >
> > > > Basically, at least one index is less then
> > > >
> > > > 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> > > >
> > > > and one equal or more.
> > > >
> > > > In this case we need:
> > > >
> > > > - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> > > > - DIV_ROUND_UP(N, RADIX_TREE_MAP_SIZE) nodes for last level nodes for new
> > > > items;
> > > Here, I think you need to count with
> > > DIV_ROUND_UP(N + RADIX_TREE_MAP_SIZE - 1, RADIX_TREE_MAP_SIZE) to propely
> > > account for the situation b) I described.
> >
> > It's not obvious for me why it's needed. Since we limit N, we can cross
> > second (or above) level node boundary only once.
> >
> > I've tried to confirm the math with my kvm (see *ugly* patch below) and
> > I was not able to find anything that is not covered.
> >
> > Could you demonstrate the case you are talking about.
> Sure. So let RADIX_TREE_MAP_SHIFT be 6 (i.e. RADIX_TREE_MAP_SIZE == 64).
> Let's have radix tree with single element from index 0. We insert 66 elements
> starting from index 127. So for nodes at the last level we need - node for
> index 127, node for indexes 128 .. 191, node for index 192. That is
> together three nodes. But DIV_ROUND_UP(66, 64) = 2. The problem happens
> because starting index 127 isn't multiple of RADIX_TREE_MAP_SIZE so we can
> have partially used nodes both at the beginning and at the end of the
> range.

I see. Looks like you are correct.

So with patch in previous mail, it should be triable if we set
RADIX_TREE_PRELOAD_NR to 514 and off to
(1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)) - RADIX_TREE_MAP_SIZE - 1

In this case it should use 39 nodes, but it uses only 38. I can't understand why. :(

--
Kirill A. Shutemov

2013-08-08 08:42:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

Kirill A. Shutemov wrote:
> In this case it should use 39 nodes, but it uses only 38. I can't understand why. :(

Okay, I've got it. We share one 2nd level node.

Patch is below. Please review.

>From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Wed, 9 Jan 2013 13:25:47 +0200
Subject: [PATCH] radix-tree: implement preload for multiple contiguous
elements

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.
The feature costs about 9KiB per-CPU on x86_64, details below.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

#ifdef __KERNEL__
#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
#else
#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
#endif

We are not going to use transparent huge page cache on small machines or
in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.

On 64-bit system old array size is 21, new is 37. Per-CPU feature
overhead is
for preload array:
(37 - 21) * sizeof(void*) = 128 bytes
plus, if the preload array is full
(37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
total: 9088 bytes

On 32-bit system old array size is 11, new is 22. Per-CPU feature
overhead is
for preload array:
(22 - 11) * sizeof(void*) = 44 bytes
plus, if the preload array is full
(22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
total: 3300 bytes

Since only THP uses batched preload at the moment, we disable (set max
preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
changed in the future.

Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
include/linux/radix-tree.h | 11 ++++++
lib/radix-tree.c | 89 +++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 91 insertions(+), 9 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..3bf0b3e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do { \
(root)->rnode = NULL; \
} while (0)

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR 512
+#else
+#define RADIX_TREE_PRELOAD_NR 1
+#endif
+
/**
* Radix-tree synchronization
*
@@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
int radix_tree_preload(gfp_t gfp_mask);
int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..980e4c4 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
* of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
* Hence:
*/
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+
+/*
+ * Inserting N contiguous items is more complex. To simplify calculation, let's
+ * limit N (validated in radix_tree_init()):
+ * - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
+ * - N <= number of items 2-level tree can contain:
+ * 1UL << (2 * RADIX_TREE_MAP_SHIFT).
+ *
+ * No limitation on insert index alignment.
+ *
+ * Then the worst case is tree with only one element at index 0 and we add N
+ * items where at least one index requires max tree high and we cross boundary
+ * between items in root node.
+ *
+ * Basically, at least one index is less then
+ *
+ * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
+ *
+ * and one is equal to.
+ *
+ * In this case we need:
+ *
+ * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
+ * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
+ * - +1 is for misalinged case;
+ * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
+ * - -2, because root node and last level nodes are already accounted);
+ * - -1, because one second level node is shared between item with index 0
+ * and item below the boundary;
+ *
+ * Hence:
+ */
+#if RADIX_TREE_PRELOAD_NR > 1
+#define RADIX_TREE_PRELOAD_MAX \
+ ( RADIX_TREE_MAX_PATH + \
+ RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
+ 2 * (RADIX_TREE_MAX_PATH - 2) - 1)
+#else
+#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
+#endif

/*
* Per-cpu pool of preloaded nodes
*/
struct radix_tree_preload {
int nr;
- struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+ struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
};
static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };

@@ -263,29 +303,35 @@ radix_tree_node_free(struct radix_tree_node *node)

/*
* Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail. On
- * success, return zero, with preemption disabled. On error, return -ENOMEM
+ * ensure that the addition of *contiguous* items in the tree cannot fail.
+ * On success, return zero, with preemption disabled. On error, return -ENOMEM
* with preemption not disabled.
*
* To make use of this facility, the radix tree must be initialised without
* __GFP_WAIT being passed to INIT_RADIX_TREE().
*/
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
{
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
int ret = -ENOMEM;
+ int preload_target = RADIX_TREE_PRELOAD_MIN +
+ DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+ if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+ "too large preload requested"))
+ return -ENOMEM;

preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+ while (rtp->nr < preload_target) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+ if (rtp->nr < preload_target)
rtp->nodes[rtp->nr++] = node;
else
kmem_cache_free(radix_tree_node_cachep, node);
@@ -308,7 +354,7 @@ int radix_tree_preload(gfp_t gfp_mask)
{
/* Warn on non-sensical use... */
WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload_contig(1, gfp_mask);
}
EXPORT_SYMBOL(radix_tree_preload);

@@ -320,13 +366,22 @@ EXPORT_SYMBOL(radix_tree_preload);
int radix_tree_maybe_preload(gfp_t gfp_mask)
{
if (gfp_mask & __GFP_WAIT)
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload_contig(1, gfp_mask);
/* Preloading doesn't help anything with this gfp mask, skip it */
preempt_disable();
return 0;
}
EXPORT_SYMBOL(radix_tree_maybe_preload);

+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
+{
+ if (gfp_mask & __GFP_WAIT)
+ return __radix_tree_preload_contig(size, gfp_mask);
+ /* Preloading doesn't help anything with this gfp mask, skip it */
+ preempt_disable();
+ return 0;
+}
+
/*
* Return the maximum key which can be store into a
* radix tree with height HEIGHT.
@@ -1483,6 +1538,22 @@ static int radix_tree_callback(struct notifier_block *nfb,

void __init radix_tree_init(void)
{
+ /*
+ * Restrictions on RADIX_TREE_PRELOAD_NR simplify RADIX_TREE_PRELOAD_MAX
+ * calculation, it's already complex enough:
+ * - it must be multiplier of RADIX_TREE_MAP_SIZE, otherwise we will
+ * have to round it up to next RADIX_TREE_MAP_SIZE multiplier and we
+ * don't win anything;
+ * - must be less then number of items 2-level tree can contain.
+ * It's easier to calculate number of internal nodes required
+ * this way.
+ */
+ if (RADIX_TREE_PRELOAD_NR != 1) {
+ BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR % RADIX_TREE_MAP_SIZE != 0);
+ BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR >
+ 1UL << (2 * RADIX_TREE_MAP_SHIFT));
+ }
+
radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
sizeof(struct radix_tree_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
--
1.8.3.2

--
Kirill A. Shutemov

2013-08-08 10:04:12

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

On Thu 08-08-13 11:45:05, Kirill A. Shutemov wrote:
> From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Wed, 9 Jan 2013 13:25:47 +0200
> Subject: [PATCH] radix-tree: implement preload for multiple contiguous
> elements
>
> The radix tree is variable-height, so an insert operation not only has
> to build the branch to its corresponding item, it also has to build the
> branch to existing items if the size has to be increased (by
> radix_tree_extend).
>
> The worst case is a zero height tree with just a single item at index 0,
> and then inserting an item at index ULONG_MAX. This requires 2 new branches
> of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
>
> Radix tree is usually protected by spin lock. It means we want to
> pre-allocate required memory before taking the lock.
>
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. For transparent huge page cache we want
> to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
>
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.
> The feature costs about 9KiB per-CPU on x86_64, details below.
>
> Preload uses per-CPU array to store nodes. The total cost of preload is
> "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> to be able to handle 512 entries at once.
>
> Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
>
> We have three possible RADIX_TREE_MAP_SHIFT:
>
> #ifdef __KERNEL__
> #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> #else
> #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
> #endif
>
> We are not going to use transparent huge page cache on small machines or
> in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
>
> On 64-bit system old array size is 21, new is 37. Per-CPU feature
> overhead is
> for preload array:
> (37 - 21) * sizeof(void*) = 128 bytes
> plus, if the preload array is full
> (37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
> total: 9088 bytes
>
> On 32-bit system old array size is 11, new is 22. Per-CPU feature
> overhead is
> for preload array:
> (22 - 11) * sizeof(void*) = 44 bytes
> plus, if the preload array is full
> (22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
> total: 3300 bytes
>
> Since only THP uses batched preload at the moment, we disable (set max
> preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> changed in the future.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Dave Hansen <[email protected]>
> ---
> include/linux/radix-tree.h | 11 ++++++
> lib/radix-tree.c | 89 +++++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 91 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 4039407..3bf0b3e 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -83,6 +83,16 @@ do { \
> (root)->rnode = NULL; \
> } while (0)
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> +/*
> + * At the moment only THP uses preload for more then on item for batched
> + * pagecache manipulations.
> + */
> +#define RADIX_TREE_PRELOAD_NR 512
> +#else
> +#define RADIX_TREE_PRELOAD_NR 1
> +#endif
> +
> /**
> * Radix-tree synchronization
> *
> @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> unsigned long index, unsigned long max_scan);
> int radix_tree_preload(gfp_t gfp_mask);
> int radix_tree_maybe_preload(gfp_t gfp_mask);
> +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
> void radix_tree_init(void);
> void *radix_tree_tag_set(struct radix_tree_root *root,
> unsigned long index, unsigned int tag);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 7811ed3..980e4c4 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
> * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> * Hence:
> */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +
> +/*
> + * Inserting N contiguous items is more complex. To simplify calculation, let's
> + * limit N (validated in radix_tree_init()):
> + * - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
Is this limitation really worth the one reserved item you save?


> + * - N <= number of items 2-level tree can contain:
> + * 1UL << (2 * RADIX_TREE_MAP_SHIFT).
> + *
> + * No limitation on insert index alignment.
> + *
> + * Then the worst case is tree with only one element at index 0 and we add N
> + * items where at least one index requires max tree high and we cross boundary
> + * between items in root node.
> + *
> + * Basically, at least one index is less then
> + *
> + * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> + *
> + * and one is equal to.
> + *
> + * In this case we need:
> + *
> + * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> + * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
> + * - +1 is for misalinged case;
> + * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
> + * - -2, because root node and last level nodes are already accounted);
> + * - -1, because one second level node is shared between item with index 0
> + * and item below the boundary;
> + *
> + * Hence:
> + */
> +#if RADIX_TREE_PRELOAD_NR > 1
> +#define RADIX_TREE_PRELOAD_MAX \
> + ( RADIX_TREE_MAX_PATH + \
> + RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
> + 2 * (RADIX_TREE_MAX_PATH - 2) - 1)
> +#else
> +#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
> +#endif
>
> /*
> * Per-cpu pool of preloaded nodes
> */
> struct radix_tree_preload {
> int nr;
> - struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> + struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
> };
> static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>
> @@ -263,29 +303,35 @@ radix_tree_node_free(struct radix_tree_node *node)
>
> /*
> * Load up this CPU's radix_tree_node buffer with sufficient objects to
> - * ensure that the addition of a single element in the tree cannot fail. On
> - * success, return zero, with preemption disabled. On error, return -ENOMEM
> + * ensure that the addition of *contiguous* items in the tree cannot fail.
> + * On success, return zero, with preemption disabled. On error, return -ENOMEM
> * with preemption not disabled.
> *
> * To make use of this facility, the radix tree must be initialised without
> * __GFP_WAIT being passed to INIT_RADIX_TREE().
> */
> -static int __radix_tree_preload(gfp_t gfp_mask)
> +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
> {
> struct radix_tree_preload *rtp;
> struct radix_tree_node *node;
> int ret = -ENOMEM;
> + int preload_target = RADIX_TREE_PRELOAD_MIN +
> + DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
So you should also warn if size % RADIX_TREE_MAP_SIZE != 0, right?
Otherwise the patch looks fine.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-08-08 20:51:09

by Khalid Aziz

[permalink] [raw]
Subject: Re: [PATCH 22/23] thp, mm: split huge page on mmap file page

On Sun, 2013-08-04 at 05:17 +0300, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> We are not ready to mmap file-backed tranparent huge pages. Let's split
> them on fault attempt.
>
> Later we'll implement mmap() properly and this code path be used for
> fallback cases.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/filemap.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ed65af5..f7857ef 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1743,6 +1743,8 @@ retry_find:
> goto no_cached_page;
> }
>
> + if (PageTransCompound(page))
> + split_huge_page(compound_trans_head(page));

Since PageTransCompound(page) returns true for transparent huge pages as
well as hugetlbfs pages, could this code split hugetlbfs pages on an
mmap() on to hugetlbfs pages? hugetlbfs pages are not supposed to be
split, right?

> if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
> page_cache_release(page);
> return ret | VM_FAULT_RETRY;

--
Khalid

2013-08-09 11:10:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

Jan Kara wrote:
> On Thu 08-08-13 11:45:05, Kirill A. Shutemov wrote:
> > From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <[email protected]>
> > Date: Wed, 9 Jan 2013 13:25:47 +0200
> > Subject: [PATCH] radix-tree: implement preload for multiple contiguous
> > elements
> >
> > The radix tree is variable-height, so an insert operation not only has
> > to build the branch to its corresponding item, it also has to build the
> > branch to existing items if the size has to be increased (by
> > radix_tree_extend).
> >
> > The worst case is a zero height tree with just a single item at index 0,
> > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> >
> > Radix tree is usually protected by spin lock. It means we want to
> > pre-allocate required memory before taking the lock.
> >
> > Currently radix_tree_preload() only guarantees enough nodes to insert
> > one element. It's a hard limit. For transparent huge page cache we want
> > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> >
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
> > The feature costs about 9KiB per-CPU on x86_64, details below.
> >
> > Preload uses per-CPU array to store nodes. The total cost of preload is
> > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > to be able to handle 512 entries at once.
> >
> > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> >
> > We have three possible RADIX_TREE_MAP_SHIFT:
> >
> > #ifdef __KERNEL__
> > #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> > #else
> > #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
> > #endif
> >
> > We are not going to use transparent huge page cache on small machines or
> > in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
> >
> > On 64-bit system old array size is 21, new is 37. Per-CPU feature
> > overhead is
> > for preload array:
> > (37 - 21) * sizeof(void*) = 128 bytes
> > plus, if the preload array is full
> > (37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
> > total: 9088 bytes
> >
> > On 32-bit system old array size is 11, new is 22. Per-CPU feature
> > overhead is
> > for preload array:
> > (22 - 11) * sizeof(void*) = 44 bytes
> > plus, if the preload array is full
> > (22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
> > total: 3300 bytes
> >
> > Since only THP uses batched preload at the moment, we disable (set max
> > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > changed in the future.
> >
> > Signed-off-by: Matthew Wilcox <[email protected]>
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Acked-by: Dave Hansen <[email protected]>
> > ---
> > include/linux/radix-tree.h | 11 ++++++
> > lib/radix-tree.c | 89 +++++++++++++++++++++++++++++++++++++++++-----
> > 2 files changed, 91 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index 4039407..3bf0b3e 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -83,6 +83,16 @@ do { \
> > (root)->rnode = NULL; \
> > } while (0)
> >
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> > +/*
> > + * At the moment only THP uses preload for more then on item for batched
> > + * pagecache manipulations.
> > + */
> > +#define RADIX_TREE_PRELOAD_NR 512
> > +#else
> > +#define RADIX_TREE_PRELOAD_NR 1
> > +#endif
> > +
> > /**
> > * Radix-tree synchronization
> > *
> > @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> > unsigned long index, unsigned long max_scan);
> > int radix_tree_preload(gfp_t gfp_mask);
> > int radix_tree_maybe_preload(gfp_t gfp_mask);
> > +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
> > void radix_tree_init(void);
> > void *radix_tree_tag_set(struct radix_tree_root *root,
> > unsigned long index, unsigned int tag);
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index 7811ed3..980e4c4 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
> > * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > * Hence:
> > */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +
> > +/*
> > + * Inserting N contiguous items is more complex. To simplify calculation, let's
> > + * limit N (validated in radix_tree_init()):
> > + * - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
> Is this limitation really worth the one reserved item you save?

It's not really limitation, It just doesn't make any sense to lower
RADIX_TREE_PRELOAD_NR not to RADIX_TREE_MAP_SIZE boundary, since we will
have to round it up anyway.

I made one more mistake: I've modeled not the situation I've describe, so
last -1 is wrong. Fixed in patch below.

> > -static int __radix_tree_preload(gfp_t gfp_mask)
> > +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
> > {
> > struct radix_tree_preload *rtp;
> > struct radix_tree_node *node;
> > int ret = -ENOMEM;
> > + int preload_target = RADIX_TREE_PRELOAD_MIN +
> > + DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
> So you should also warn if size % RADIX_TREE_MAP_SIZE != 0, right?

I've reworked logic here to match RADIX_TREE_PRELOAD_MAX calculation math.

> Otherwise the patch looks fine.

Ack?

>From 0e7d481093a4ded0dfb58e87d4edda003f3b2fd6 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Wed, 9 Jan 2013 13:25:47 +0200
Subject: [PATCH] radix-tree: implement preload for multiple contiguous
elements

The radix tree is variable-height, so an insert operation not only has
to build the branch to its corresponding item, it also has to build the
branch to existing items if the size has to be increased (by
radix_tree_extend).

The worst case is a zero height tree with just a single item at index 0,
and then inserting an item at index ULONG_MAX. This requires 2 new branches
of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.

Radix tree is usually protected by spin lock. It means we want to
pre-allocate required memory before taking the lock.

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. For transparent huge page cache we want
to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.
The feature costs about 9.5KiB per-CPU on x86_64, details below.

Preload uses per-CPU array to store nodes. The total cost of preload is
"array size" * sizeof(void*) * NR_CPUS. We want to increase array size
to be able to handle 512 entries at once.

Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.

We have three possible RADIX_TREE_MAP_SHIFT:

#ifdef __KERNEL__
#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
#else
#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
#endif

We are not going to use transparent huge page cache on small machines or
in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.

On 64-bit system old array size is 21, new is 37. Per-CPU feature
overhead is
for preload array:
(38 - 21) * sizeof(void*) = 136 bytes
plus, if the preload array is full
(38 - 21) * sizeof(struct radix_tree_node) = 17 * 560 = 9520 bytes
total: 9656 bytes

On 32-bit system old array size is 11, new is 22. Per-CPU feature
overhead is
for preload array:
(23 - 11) * sizeof(void*) = 48 bytes
plus, if the preload array is full
(23 - 11) * sizeof(struct radix_tree_node) = 12 * 296 = 3552 bytes
total: 3600 bytes

Since only THP uses batched preload at the moment, we disable (set max
preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
changed in the future.

Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
include/linux/radix-tree.h | 11 ++++++
lib/radix-tree.c | 95 +++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 97 insertions(+), 9 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 4039407..3bf0b3e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,16 @@ do { \
(root)->rnode = NULL; \
} while (0)

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
+/*
+ * At the moment only THP uses preload for more then on item for batched
+ * pagecache manipulations.
+ */
+#define RADIX_TREE_PRELOAD_NR 512
+#else
+#define RADIX_TREE_PRELOAD_NR 1
+#endif
+
/**
* Radix-tree synchronization
*
@@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
int radix_tree_preload(gfp_t gfp_mask);
int radix_tree_maybe_preload(gfp_t gfp_mask);
+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3..1c19595 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -84,14 +84,52 @@ static struct kmem_cache *radix_tree_node_cachep;
* of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
* Hence:
*/
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+
+/*
+ * Inserting N contiguous items is more complex. To simplify calculation, let's
+ * limit N (validated in radix_tree_init()):
+ * - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
+ * - N <= number of items 2-level tree can contain:
+ * 1UL << (2 * RADIX_TREE_MAP_SHIFT).
+ *
+ * No limitation on insert index alignment.
+ *
+ * Then the worst case is tree with only one element at index 0 and we add N
+ * items where at least one index requires max tree high and we cross boundary
+ * between items in root node.
+ *
+ * Basically, at least one index is less then
+ *
+ * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
+ *
+ * and one is equal to.
+ *
+ * In this case we need:
+ *
+ * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
+ * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
+ * - +1 is for misalinged case;
+ * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
+ * - -2, because root node and last level nodes are already accounted).
+ *
+ * Hence:
+ */
+#if RADIX_TREE_PRELOAD_NR > 1
+#define RADIX_TREE_PRELOAD_MAX \
+ ( RADIX_TREE_MAX_PATH + \
+ RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
+ 2 * (RADIX_TREE_MAX_PATH - 2))
+#else
+#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
+#endif

/*
* Per-cpu pool of preloaded nodes
*/
struct radix_tree_preload {
int nr;
- struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+ struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
};
static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };

@@ -263,29 +301,43 @@ radix_tree_node_free(struct radix_tree_node *node)

/*
* Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail. On
- * success, return zero, with preemption disabled. On error, return -ENOMEM
+ * ensure that the addition of *contiguous* items in the tree cannot fail.
+ * On success, return zero, with preemption disabled. On error, return -ENOMEM
* with preemption not disabled.
*
* To make use of this facility, the radix tree must be initialised without
* __GFP_WAIT being passed to INIT_RADIX_TREE().
*/
-static int __radix_tree_preload(gfp_t gfp_mask)
+static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
{
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
int ret = -ENOMEM;
+ int preload_target = RADIX_TREE_PRELOAD_MIN;

+ if (size > 1) {
+ size = round_up(size, RADIX_TREE_MAP_SIZE);
+ if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
+ "too large preload requested"))
+ return -ENOMEM;
+
+ /* The same math as with RADIX_TREE_PRELOAD_MAX */
+ preload_target = RADIX_TREE_MAX_PATH +
+ size / RADIX_TREE_MAP_SIZE + 1 +
+ 2 * (RADIX_TREE_MAX_PATH - 2);
+ }
+
+ BUG_ON(preload_target > RADIX_TREE_PRELOAD_MAX);
preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+ while (rtp->nr < preload_target) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+ if (rtp->nr < preload_target)
rtp->nodes[rtp->nr++] = node;
else
kmem_cache_free(radix_tree_node_cachep, node);
@@ -308,7 +360,7 @@ int radix_tree_preload(gfp_t gfp_mask)
{
/* Warn on non-sensical use... */
WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload_contig(1, gfp_mask);
}
EXPORT_SYMBOL(radix_tree_preload);

@@ -320,13 +372,22 @@ EXPORT_SYMBOL(radix_tree_preload);
int radix_tree_maybe_preload(gfp_t gfp_mask)
{
if (gfp_mask & __GFP_WAIT)
- return __radix_tree_preload(gfp_mask);
+ return __radix_tree_preload_contig(1, gfp_mask);
/* Preloading doesn't help anything with this gfp mask, skip it */
preempt_disable();
return 0;
}
EXPORT_SYMBOL(radix_tree_maybe_preload);

+int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
+{
+ if (gfp_mask & __GFP_WAIT)
+ return __radix_tree_preload_contig(size, gfp_mask);
+ /* Preloading doesn't help anything with this gfp mask, skip it */
+ preempt_disable();
+ return 0;
+}
+
/*
* Return the maximum key which can be store into a
* radix tree with height HEIGHT.
@@ -1483,6 +1544,22 @@ static int radix_tree_callback(struct notifier_block *nfb,

void __init radix_tree_init(void)
{
+ /*
+ * Restrictions on RADIX_TREE_PRELOAD_NR simplify RADIX_TREE_PRELOAD_MAX
+ * calculation, it's already complex enough:
+ * - it must be multiplier of RADIX_TREE_MAP_SIZE, otherwise we will
+ * have to round it up to next RADIX_TREE_MAP_SIZE multiplier and we
+ * don't win anything;
+ * - must be less then number of items 2-level tree can contain.
+ * It's easier to calculate number of internal nodes required
+ * this way.
+ */
+ if (RADIX_TREE_PRELOAD_NR != 1) {
+ BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR % RADIX_TREE_MAP_SIZE != 0);
+ BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR >
+ 1UL << (2 * RADIX_TREE_MAP_SHIFT));
+ }
+
radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
sizeof(struct radix_tree_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
--
Kirill A. Shutemov

2013-08-09 11:36:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 01/23] radix-tree: implement preload for multiple contiguous elements

On Fri 09-08-13 14:13:51, Kirill A. Shutemov wrote:
> Jan Kara wrote:
> > On Thu 08-08-13 11:45:05, Kirill A. Shutemov wrote:
> > > From 35ba5687ea7aea98645da34ddd0be01a9de8b32d Mon Sep 17 00:00:00 2001
> > > From: "Kirill A. Shutemov" <[email protected]>
> > > Date: Wed, 9 Jan 2013 13:25:47 +0200
> > > Subject: [PATCH] radix-tree: implement preload for multiple contiguous
> > > elements
> > >
> > > The radix tree is variable-height, so an insert operation not only has
> > > to build the branch to its corresponding item, it also has to build the
> > > branch to existing items if the size has to be increased (by
> > > radix_tree_extend).
> > >
> > > The worst case is a zero height tree with just a single item at index 0,
> > > and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > > of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > >
> > > Radix tree is usually protected by spin lock. It means we want to
> > > pre-allocate required memory before taking the lock.
> > >
> > > Currently radix_tree_preload() only guarantees enough nodes to insert
> > > one element. It's a hard limit. For transparent huge page cache we want
> > > to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
> > >
> > > This patch introduces radix_tree_preload_count(). It allows to
> > > preallocate nodes enough to insert a number of *contiguous* elements.
> > > The feature costs about 9KiB per-CPU on x86_64, details below.
> > >
> > > Preload uses per-CPU array to store nodes. The total cost of preload is
> > > "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> > > to be able to handle 512 entries at once.
> > >
> > > Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
> > >
> > > We have three possible RADIX_TREE_MAP_SHIFT:
> > >
> > > #ifdef __KERNEL__
> > > #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> > > #else
> > > #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
> > > #endif
> > >
> > > We are not going to use transparent huge page cache on small machines or
> > > in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
> > >
> > > On 64-bit system old array size is 21, new is 37. Per-CPU feature
> > > overhead is
> > > for preload array:
> > > (37 - 21) * sizeof(void*) = 128 bytes
> > > plus, if the preload array is full
> > > (37 - 21) * sizeof(struct radix_tree_node) = 16 * 560 = 8960 bytes
> > > total: 9088 bytes
> > >
> > > On 32-bit system old array size is 11, new is 22. Per-CPU feature
> > > overhead is
> > > for preload array:
> > > (22 - 11) * sizeof(void*) = 44 bytes
> > > plus, if the preload array is full
> > > (22 - 11) * sizeof(struct radix_tree_node) = 11 * 296 = 3256 bytes
> > > total: 3300 bytes
> > >
> > > Since only THP uses batched preload at the moment, we disable (set max
> > > preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> > > changed in the future.
> > >
> > > Signed-off-by: Matthew Wilcox <[email protected]>
> > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > Acked-by: Dave Hansen <[email protected]>
> > > ---
> > > include/linux/radix-tree.h | 11 ++++++
> > > lib/radix-tree.c | 89 +++++++++++++++++++++++++++++++++++++++++-----
> > > 2 files changed, 91 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > > index 4039407..3bf0b3e 100644
> > > --- a/include/linux/radix-tree.h
> > > +++ b/include/linux/radix-tree.h
> > > @@ -83,6 +83,16 @@ do { \
> > > (root)->rnode = NULL; \
> > > } while (0)
> > >
> > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> > > +/*
> > > + * At the moment only THP uses preload for more then on item for batched
> > > + * pagecache manipulations.
> > > + */
> > > +#define RADIX_TREE_PRELOAD_NR 512
> > > +#else
> > > +#define RADIX_TREE_PRELOAD_NR 1
> > > +#endif
> > > +
> > > /**
> > > * Radix-tree synchronization
> > > *
> > > @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> > > unsigned long index, unsigned long max_scan);
> > > int radix_tree_preload(gfp_t gfp_mask);
> > > int radix_tree_maybe_preload(gfp_t gfp_mask);
> > > +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
> > > void radix_tree_init(void);
> > > void *radix_tree_tag_set(struct radix_tree_root *root,
> > > unsigned long index, unsigned int tag);
> > > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > > index 7811ed3..980e4c4 100644
> > > --- a/lib/radix-tree.c
> > > +++ b/lib/radix-tree.c
> > > @@ -84,14 +84,54 @@ static struct kmem_cache *radix_tree_node_cachep;
> > > * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > > * Hence:
> > > */
> > > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > > +
> > > +/*
> > > + * Inserting N contiguous items is more complex. To simplify calculation, let's
> > > + * limit N (validated in radix_tree_init()):
> > > + * - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
> > Is this limitation really worth the one reserved item you save?
>
> It's not really limitation, It just doesn't make any sense to lower
> RADIX_TREE_PRELOAD_NR not to RADIX_TREE_MAP_SIZE boundary, since we will
> have to round it up anyway.
>
> I made one more mistake: I've modeled not the situation I've describe, so
> last -1 is wrong. Fixed in patch below.
>
> > > -static int __radix_tree_preload(gfp_t gfp_mask)
> > > +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
> > > {
> > > struct radix_tree_preload *rtp;
> > > struct radix_tree_node *node;
> > > int ret = -ENOMEM;
> > > + int preload_target = RADIX_TREE_PRELOAD_MIN +
> > > + DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
> > So you should also warn if size % RADIX_TREE_MAP_SIZE != 0, right?
>
> I've reworked logic here to match RADIX_TREE_PRELOAD_MAX calculation math.
>
> > Otherwise the patch looks fine.
>
> Ack?
Yes. You can add:
Reviewed-by: Jan Kara <[email protected]>

Honza
>
> From 0e7d481093a4ded0dfb58e87d4edda003f3b2fd6 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Wed, 9 Jan 2013 13:25:47 +0200
> Subject: [PATCH] radix-tree: implement preload for multiple contiguous
> elements
>
> The radix tree is variable-height, so an insert operation not only has
> to build the branch to its corresponding item, it also has to build the
> branch to existing items if the size has to be increased (by
> radix_tree_extend).
>
> The worst case is a zero height tree with just a single item at index 0,
> and then inserting an item at index ULONG_MAX. This requires 2 new branches
> of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
>
> Radix tree is usually protected by spin lock. It means we want to
> pre-allocate required memory before taking the lock.
>
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. For transparent huge page cache we want
> to insert HPAGE_PMD_NR (512 on x86-64) entries to address_space at once.
>
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.
> The feature costs about 9.5KiB per-CPU on x86_64, details below.
>
> Preload uses per-CPU array to store nodes. The total cost of preload is
> "array size" * sizeof(void*) * NR_CPUS. We want to increase array size
> to be able to handle 512 entries at once.
>
> Size of array depends on system bitness and on RADIX_TREE_MAP_SHIFT.
>
> We have three possible RADIX_TREE_MAP_SHIFT:
>
> #ifdef __KERNEL__
> #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> #else
> #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
> #endif
>
> We are not going to use transparent huge page cache on small machines or
> in userspace, so we are interested in RADIX_TREE_MAP_SHIFT=6.
>
> On 64-bit system old array size is 21, new is 37. Per-CPU feature
> overhead is
> for preload array:
> (38 - 21) * sizeof(void*) = 136 bytes
> plus, if the preload array is full
> (38 - 21) * sizeof(struct radix_tree_node) = 17 * 560 = 9520 bytes
> total: 9656 bytes
>
> On 32-bit system old array size is 11, new is 22. Per-CPU feature
> overhead is
> for preload array:
> (23 - 11) * sizeof(void*) = 48 bytes
> plus, if the preload array is full
> (23 - 11) * sizeof(struct radix_tree_node) = 12 * 296 = 3552 bytes
> total: 3600 bytes
>
> Since only THP uses batched preload at the moment, we disable (set max
> preload to 1) it if !CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE. This can be
> changed in the future.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Dave Hansen <[email protected]>
> ---
> include/linux/radix-tree.h | 11 ++++++
> lib/radix-tree.c | 95 +++++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 97 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 4039407..3bf0b3e 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -83,6 +83,16 @@ do { \
> (root)->rnode = NULL; \
> } while (0)
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE
> +/*
> + * At the moment only THP uses preload for more then on item for batched
> + * pagecache manipulations.
> + */
> +#define RADIX_TREE_PRELOAD_NR 512
> +#else
> +#define RADIX_TREE_PRELOAD_NR 1
> +#endif
> +
> /**
> * Radix-tree synchronization
> *
> @@ -232,6 +242,7 @@ unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> unsigned long index, unsigned long max_scan);
> int radix_tree_preload(gfp_t gfp_mask);
> int radix_tree_maybe_preload(gfp_t gfp_mask);
> +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask);
> void radix_tree_init(void);
> void *radix_tree_tag_set(struct radix_tree_root *root,
> unsigned long index, unsigned int tag);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 7811ed3..1c19595 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -84,14 +84,52 @@ static struct kmem_cache *radix_tree_node_cachep;
> * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> * Hence:
> */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +
> +/*
> + * Inserting N contiguous items is more complex. To simplify calculation, let's
> + * limit N (validated in radix_tree_init()):
> + * - N is multiplier of RADIX_TREE_MAP_SIZE (or 1);
> + * - N <= number of items 2-level tree can contain:
> + * 1UL << (2 * RADIX_TREE_MAP_SHIFT).
> + *
> + * No limitation on insert index alignment.
> + *
> + * Then the worst case is tree with only one element at index 0 and we add N
> + * items where at least one index requires max tree high and we cross boundary
> + * between items in root node.
> + *
> + * Basically, at least one index is less then
> + *
> + * 1UL << ((RADIX_TREE_MAX_PATH - 1) * RADIX_TREE_MAP_SHIFT)
> + *
> + * and one is equal to.
> + *
> + * In this case we need:
> + *
> + * - RADIX_TREE_MAX_PATH nodes to build new path to item with index 0;
> + * - N / RADIX_TREE_MAP_SIZE + 1 nodes for last level nodes for new items:
> + * - +1 is for misalinged case;
> + * - 2 * (RADIX_TREE_MAX_PATH - 2) - 1 nodes to build path to last level nodes:
> + * - -2, because root node and last level nodes are already accounted).
> + *
> + * Hence:
> + */
> +#if RADIX_TREE_PRELOAD_NR > 1
> +#define RADIX_TREE_PRELOAD_MAX \
> + ( RADIX_TREE_MAX_PATH + \
> + RADIX_TREE_PRELOAD_NR / RADIX_TREE_MAP_SIZE + 1 + \
> + 2 * (RADIX_TREE_MAX_PATH - 2))
> +#else
> +#define RADIX_TREE_PRELOAD_MAX RADIX_TREE_PRELOAD_MIN
> +#endif
>
> /*
> * Per-cpu pool of preloaded nodes
> */
> struct radix_tree_preload {
> int nr;
> - struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> + struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
> };
> static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>
> @@ -263,29 +301,43 @@ radix_tree_node_free(struct radix_tree_node *node)
>
> /*
> * Load up this CPU's radix_tree_node buffer with sufficient objects to
> - * ensure that the addition of a single element in the tree cannot fail. On
> - * success, return zero, with preemption disabled. On error, return -ENOMEM
> + * ensure that the addition of *contiguous* items in the tree cannot fail.
> + * On success, return zero, with preemption disabled. On error, return -ENOMEM
> * with preemption not disabled.
> *
> * To make use of this facility, the radix tree must be initialised without
> * __GFP_WAIT being passed to INIT_RADIX_TREE().
> */
> -static int __radix_tree_preload(gfp_t gfp_mask)
> +static int __radix_tree_preload_contig(unsigned size, gfp_t gfp_mask)
> {
> struct radix_tree_preload *rtp;
> struct radix_tree_node *node;
> int ret = -ENOMEM;
> + int preload_target = RADIX_TREE_PRELOAD_MIN;
>
> + if (size > 1) {
> + size = round_up(size, RADIX_TREE_MAP_SIZE);
> + if (WARN_ONCE(size > RADIX_TREE_PRELOAD_NR,
> + "too large preload requested"))
> + return -ENOMEM;
> +
> + /* The same math as with RADIX_TREE_PRELOAD_MAX */
> + preload_target = RADIX_TREE_MAX_PATH +
> + size / RADIX_TREE_MAP_SIZE + 1 +
> + 2 * (RADIX_TREE_MAX_PATH - 2);
> + }
> +
> + BUG_ON(preload_target > RADIX_TREE_PRELOAD_MAX);
> preempt_disable();
> rtp = &__get_cpu_var(radix_tree_preloads);
> - while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
> + while (rtp->nr < preload_target) {
> preempt_enable();
> node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
> if (node == NULL)
> goto out;
> preempt_disable();
> rtp = &__get_cpu_var(radix_tree_preloads);
> - if (rtp->nr < ARRAY_SIZE(rtp->nodes))
> + if (rtp->nr < preload_target)
> rtp->nodes[rtp->nr++] = node;
> else
> kmem_cache_free(radix_tree_node_cachep, node);
> @@ -308,7 +360,7 @@ int radix_tree_preload(gfp_t gfp_mask)
> {
> /* Warn on non-sensical use... */
> WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
> - return __radix_tree_preload(gfp_mask);
> + return __radix_tree_preload_contig(1, gfp_mask);
> }
> EXPORT_SYMBOL(radix_tree_preload);
>
> @@ -320,13 +372,22 @@ EXPORT_SYMBOL(radix_tree_preload);
> int radix_tree_maybe_preload(gfp_t gfp_mask)
> {
> if (gfp_mask & __GFP_WAIT)
> - return __radix_tree_preload(gfp_mask);
> + return __radix_tree_preload_contig(1, gfp_mask);
> /* Preloading doesn't help anything with this gfp mask, skip it */
> preempt_disable();
> return 0;
> }
> EXPORT_SYMBOL(radix_tree_maybe_preload);
>
> +int radix_tree_maybe_preload_contig(unsigned size, gfp_t gfp_mask)
> +{
> + if (gfp_mask & __GFP_WAIT)
> + return __radix_tree_preload_contig(size, gfp_mask);
> + /* Preloading doesn't help anything with this gfp mask, skip it */
> + preempt_disable();
> + return 0;
> +}
> +
> /*
> * Return the maximum key which can be store into a
> * radix tree with height HEIGHT.
> @@ -1483,6 +1544,22 @@ static int radix_tree_callback(struct notifier_block *nfb,
>
> void __init radix_tree_init(void)
> {
> + /*
> + * Restrictions on RADIX_TREE_PRELOAD_NR simplify RADIX_TREE_PRELOAD_MAX
> + * calculation, it's already complex enough:
> + * - it must be multiplier of RADIX_TREE_MAP_SIZE, otherwise we will
> + * have to round it up to next RADIX_TREE_MAP_SIZE multiplier and we
> + * don't win anything;
> + * - must be less then number of items 2-level tree can contain.
> + * It's easier to calculate number of internal nodes required
> + * this way.
> + */
> + if (RADIX_TREE_PRELOAD_NR != 1) {
> + BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR % RADIX_TREE_MAP_SIZE != 0);
> + BUILD_BUG_ON(RADIX_TREE_PRELOAD_NR >
> + 1UL << (2 * RADIX_TREE_MAP_SHIFT));
> + }
> +
> radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
> sizeof(struct radix_tree_node), 0,
> SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
> --
> Kirill A. Shutemov
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-08-09 14:36:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 19/23] truncate: support huge pages

Dave Hansen wrote:
> On 08/03/2013 07:17 PM, Kirill A. Shutemov wrote:
> > If a huge page is only partly in the range we zero out the part,
> > exactly like we do for partial small pages.
>
> What's the logic behind this behaviour? Seems like the kind of place
> that we would really want to be splitting pages.

split_huge_page() now truncates the file, so we need to break
truncate<->split interdependency at some point.

> Like I said before, I somehow like to rewrite your code. :)

Makes sense. Please, take a look on patch below.

>From 631ad747933acbaa3284fae6e24ff1ae870a8f8f Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Fri, 2 Aug 2013 12:57:08 +0300
Subject: [PATCH] truncate: support huge pages

truncate_inode_pages_range() drops whole huge page at once if it's fully
inside the range.

If a huge page is only partly in the range we zero out the part,
exactly like we do for partial small pages.

In some cases it worth to split the huge page instead, if we need to
truncate it partly and free some memory. But split_huge_page() now
truncates the file, so we need to break truncate<->split interdependency
at some point.

invalidate_mapping_pages() just skips huge pages if they are not fully
in the range.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
include/linux/pagemap.h | 9 +++++
mm/truncate.c | 98 +++++++++++++++++++++++++++++++++++++------------
2 files changed, 83 insertions(+), 24 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index eb484f2..418be14 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -599,4 +599,13 @@ static inline void clear_pagecache_page(struct page *page)
clear_highpage(page);
}

+static inline void zero_pagecache_segment(struct page *page,
+ unsigned start, unsigned len)
+{
+ if (PageTransHugeCache(page))
+ zero_huge_user_segment(page, start, len);
+ else
+ zero_user_segment(page, start, len);
+}
+
#endif /* _LINUX_PAGEMAP_H */
diff --git a/mm/truncate.c b/mm/truncate.c
index 353b683..bc4f8d6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -205,8 +205,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
{
pgoff_t start; /* inclusive */
pgoff_t end; /* exclusive */
- unsigned int partial_start; /* inclusive */
- unsigned int partial_end; /* exclusive */
+ bool partial_thp_start = false, partial_thp_end = false;
struct pagevec pvec;
pgoff_t index;
int i;
@@ -215,15 +214,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (mapping->nrpages == 0)
return;

- /* Offsets within partial pages */
- partial_start = lstart & (PAGE_CACHE_SIZE - 1);
- partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
-
/*
* 'start' and 'end' always covers the range of pages to be fully
- * truncated. Partial pages are covered with 'partial_start' at the
- * start of the range and 'partial_end' at the end of the range.
- * Note that 'end' is exclusive while 'lend' is inclusive.
+ * truncated. Note that 'end' is exclusive while 'lend' is inclusive.
*/
start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (lend == -1)
@@ -249,6 +242,23 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index >= end)
break;

+ if (PageTransTailCache(page)) {
+ /* part of already handled huge page */
+ if (!page->mapping)
+ continue;
+ /* the range starts in middle of huge page */
+ partial_thp_start = true;
+ start = index & ~HPAGE_CACHE_INDEX_MASK;
+ continue;
+ }
+ /* the range ends on huge page */
+ if (PageTransHugeCache(page) && index ==
+ (end & ~HPAGE_CACHE_INDEX_MASK)) {
+ partial_thp_end = true;
+ end = index;
+ break;
+ }
+
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
@@ -265,34 +275,64 @@ void truncate_inode_pages_range(struct address_space *mapping,
index++;
}

- if (partial_start) {
- struct page *page = find_lock_page(mapping, start - 1);
+ if (partial_thp_start || lstart & ~PAGE_CACHE_MASK) {
+ pgoff_t off;
+ struct page *page;
+ pgoff_t index_mask = 0UL;
+ loff_t page_cache_mask = PAGE_CACHE_MASK;
+retry_partial_start:
+ if (partial_thp_start) {
+ index_mask = HPAGE_CACHE_INDEX_MASK;
+ page_cache_mask = HPAGE_PMD_MASK;
+ }
+
+ off = (start - 1) & ~index_mask;
+ page = find_get_page(mapping, off);
if (page) {
- unsigned int top = PAGE_CACHE_SIZE;
- if (start > end) {
- /* Truncation within a single page */
- top = partial_end;
- partial_end = 0;
+ unsigned pstart, pend;
+
+ /* the last tail page */
+ if (PageTransTailCache(page)) {
+ partial_thp_start = true;
+ page_cache_release(page);
+ goto retry_partial_start;
}
+
+ pstart = lstart & ~page_cache_mask;
+ if ((end & ~index_mask) == off)
+ pend = (lend - 1) & ~PAGE_CACHE_MASK;
+ else
+ pend = PAGE_CACHE_SIZE;
+
+ lock_page(page);
wait_on_page_writeback(page);
- zero_user_segment(page, partial_start, top);
+ zero_pagecache_segment(page, pstart, pend);
cleancache_invalidate_page(mapping, page);
if (page_has_private(page))
- do_invalidatepage(page, partial_start,
- top - partial_start);
+ do_invalidatepage(page, pstart,
+ pend - pstart);
unlock_page(page);
page_cache_release(page);
}
}
- if (partial_end) {
- struct page *page = find_lock_page(mapping, end);
+ if (partial_thp_end || (lend + 1) & ~PAGE_CACHE_MASK) {
+ struct page *page;
+ pgoff_t index_mask = 0UL;
+ loff_t page_cache_mask = PAGE_CACHE_MASK;
+
+ if (partial_thp_end) {
+ index_mask = HPAGE_CACHE_INDEX_MASK;
+ page_cache_mask = HPAGE_PMD_MASK;
+ }
+
+ page = find_lock_page(mapping, end & ~index_mask);
if (page) {
+ unsigned pend = (lend - 1) & ~page_cache_mask;
wait_on_page_writeback(page);
- zero_user_segment(page, 0, partial_end);
+ zero_pagecache_segment(page, 0, pend);
cleancache_invalidate_page(mapping, page);
if (page_has_private(page))
- do_invalidatepage(page, 0,
- partial_end);
+ do_invalidatepage(page, 0, pend);
unlock_page(page);
page_cache_release(page);
}
@@ -327,6 +367,9 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index >= end)
break;

+ if (PageTransTailCache(page))
+ continue;
+
lock_page(page);
WARN_ON(page->index != index);
wait_on_page_writeback(page);
@@ -401,6 +444,13 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
if (index > end)
break;

+ /* skip huge page if it's not fully in the range */
+ if (PageTransHugeCache(page) &&
+ index + HPAGE_CACHE_NR - 1 > end)
+ continue;
+ if (PageTransTailCache(page))
+ continue;
+
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
--
Kirill A. Shutemov

2013-08-09 14:42:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 20/23] thp: handle file pages in split_huge_page()

Ning Qu wrote:
> I just tried, and it seems working fine now without the deadlock anymore. I
> can run some big internal test with about 40GB files in sysv shm. Just move
> the line before the locking happens in vma_adjust, something as below, the
> line number is not accurate because my patch is based on another tree right
> now.

Looks okay to me. Could you prepare real patch (description, etc.). I'll
add it to my patchset.

>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -581,6 +581,8 @@ again: remove_next = 1 + (end >
> next->vm_end);
> }
> }
>
> + vma_adjust_trans_huge(vma, start, end, adjust_next);
> +
> if (file) {
> mapping = file->f_mapping;
> if (!(vma->vm_flags & VM_NONLINEAR))
> @@ -597,8 +599,6 @@ again: remove_next = 1 + (end >
> next->vm_end);
> }
> }
>
> - vma_adjust_trans_huge(vma, start, end, adjust_next);
> -
> anon_vma = vma->anon_vma;
> if (!anon_vma && adjust_next)
> anon_vma = next->anon_vma;
>
>
> Best wishes,
> --
> Ning Qu (曲宁) | Software Engineer | [email protected] | +1-408-418-6066

--
Kirill A. Shutemov

2013-08-09 14:46:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 22/23] thp, mm: split huge page on mmap file page

Khalid Aziz wrote:
> On Sun, 2013-08-04 at 05:17 +0300, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > We are not ready to mmap file-backed tranparent huge pages. Let's split
> > them on fault attempt.
> >
> > Later we'll implement mmap() properly and this code path be used for
> > fallback cases.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > mm/filemap.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index ed65af5..f7857ef 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1743,6 +1743,8 @@ retry_find:
> > goto no_cached_page;
> > }
> >
> > + if (PageTransCompound(page))
> > + split_huge_page(compound_trans_head(page));
>
> Since PageTransCompound(page) returns true for transparent huge pages as
> well as hugetlbfs pages, could this code split hugetlbfs pages on an
> mmap() on to hugetlbfs pages? hugetlbfs pages are not supposed to be
> split, right?

hugetlbfs page cannot (should not) be here. hugetlb uses different
vm_ops->fault handeler.

--
Kirill A. Shutemov

2013-09-02 11:36:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 05/23] thp: represent file thp pages in meminfo and friends

Ning Qu wrote:
> Hi, Kirill
>
> I believe there is a typo in your previous commit, but you didn't include
> it in this series of patch set. Below is the link for the commit. I think
> you are trying to decrease the value NR_ANON_PAGES in page_remove_rmap, but
> it is currently adding the value instead when using __mod_zone_page_state.Let
> me know if you would like to fix it in your commit or you want another
> patch from me. Thanks!

The patch is already in Andrew's tree. I'll send a fix for that. Thanks.

--
Kirill A. Shutemov

2013-09-06 11:34:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 03/23] thp: compile-time and sysfs knob for thp pagecache

Ning Qu wrote:
> One minor question inline.
>
> Best wishes,
> --
> Ning Qu (曲宁) | Software Engineer | [email protected] | +1-408-418-6066
>
>
> On Sat, Aug 3, 2013 at 7:17 PM, Kirill A. Shutemov <
> [email protected]> wrote:
>
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > For now, TRANSPARENT_HUGEPAGE_PAGECACHE is only implemented for x86_64.
> >
> > Radix tree perload overhead can be significant on BASE_SMALL systems, so
> > let's add dependency on !BASE_SMALL.
> >
> > /sys/kernel/mm/transparent_hugepage/page_cache is runtime knob for the
> > feature. It's enabled by default if TRANSPARENT_HUGEPAGE_PAGECACHE is
> > enabled.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > Documentation/vm/transhuge.txt | 9 +++++++++
> > include/linux/huge_mm.h | 9 +++++++++
> > mm/Kconfig | 12 ++++++++++++
> > mm/huge_memory.c | 23 +++++++++++++++++++++++
> > 4 files changed, 53 insertions(+)
> >
> > diff --git a/Documentation/vm/transhuge.txt
> > b/Documentation/vm/transhuge.txt
> > index 4a63953..4cc15c4 100644
> > --- a/Documentation/vm/transhuge.txt
> > +++ b/Documentation/vm/transhuge.txt
> > @@ -103,6 +103,15 @@ echo always
> > >/sys/kernel/mm/transparent_hugepage/enabled
> > echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
> > echo never >/sys/kernel/mm/transparent_hugepage/enabled
> >
> > +If TRANSPARENT_HUGEPAGE_PAGECACHE is enabled kernel will use huge pages in
> > +page cache if possible. It can be disable and re-enabled via sysfs:
> > +
> > +echo 0 >/sys/kernel/mm/transparent_hugepage/page_cache
> > +echo 1 >/sys/kernel/mm/transparent_hugepage/page_cache
> > +
> > +If it's disabled kernel will not add new huge pages to page cache and
> > +split them on mapping, but already mapped pages will stay intakt.
> > +
> > It's also possible to limit defrag efforts in the VM to generate
> > hugepages in case they're not immediately free to madvise regions or
> > to never try to defrag memory and simply fallback to regular pages
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 3935428..1534e1e 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -40,6 +40,7 @@ enum transparent_hugepage_flag {
> > TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
> > TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> > TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
> > + TRANSPARENT_HUGEPAGE_PAGECACHE,
> > TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
> > #ifdef CONFIG_DEBUG_VM
> > TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
> > @@ -229,4 +230,12 @@ static inline int do_huge_pmd_numa_page(struct
> > mm_struct *mm, struct vm_area_str
> >
> > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > +static inline bool transparent_hugepage_pagecache(void)
> > +{
> > + if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE_PAGECACHE))
> > + return false;
> > + if (!(transparent_hugepage_flags & (1<<TRANSPARENT_HUGEPAGE_FLAG)))
> >
>
> Here, I suppose we should test the TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG as
> well? E.g.
> if (!(transparent_hugepage_flags &
> ((1<<TRANSPARENT_HUGEPAGE_FLAG) |
> (1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG))))
>
> + return false;

You're right. Fixed.

--
Kirill A. Shutemov