2013-03-14 17:56:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 00/30] Transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

Here's the second version of the patchset.

The intend of the work is get code ready to enable transparent huge page
cache for the most simple fs -- ramfs.

We have read()/write()/mmap() functionality now. Still plenty work ahead.

Any feedback is welcome.

Changes since v1:
- mmap();
- fix add_to_page_cache_locked() and delete_from_page_cache();
- introduce mapping_can_have_hugepages();
- call split_huge_page() only for head page in filemap_fault();
- wait_split_huge_page(): serialize over i_mmap_mutex too;
- lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
- fix off-by-one in zero_huge_user_segment();
- THP_WRITE_ALLOC/THP_WRITE_FAILED counters;

TODO:
- memcg accounting has not yet evaluated;
- collapse;
- migration (?);
- stats, knobs, etc.;
- tmpfs/shmem enabling;


Kirill A. Shutemov (30):
block: implement add_bdi_stat()
mm: implement zero_huge_user_segment and friends
mm: drop actor argument of do_generic_file_read()
radix-tree: implement preload for multiple contiguous elements
thp, mm: avoid PageUnevictable on active/inactive lru lists
thp, mm: basic defines for transparent huge page cache
thp, mm: introduce mapping_can_have_hugepages() predicate
thp, mm: rewrite add_to_page_cache_locked() to support huge pages
thp, mm: rewrite delete_from_page_cache() to support huge pages
thp, mm: locking tail page is a bug
thp, mm: handle tail pages in page_cache_get_speculative()
thp, mm: add event counters for huge page alloc on write to a file
thp, mm: implement grab_cache_huge_page_write_begin()
thp, mm: naive support of thp in generic read/write routines
thp, libfs: initial support of thp in
simple_read/write_begin/write_end
thp: handle file pages in split_huge_page()
thp: wait_split_huge_page(): serialize over i_mmap_mutex too
thp, mm: truncate support for transparent huge page cache
thp, mm: split huge page on mmap file page
ramfs: enable transparent huge page cache
x86-64, mm: proper alignment mappings with hugepages
mm: add huge_fault() callback to vm_operations_struct
thp: prepare zap_huge_pmd() to uncharge file pages
thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
thp, mm: basic huge_fault implementation for generic_file_vm_ops
thp: extract fallback path from do_huge_pmd_anonymous_page() to a
function
thp: initial implementation of do_huge_linear_fault()
thp: handle write-protect exception to file-backed huge pages
thp: call __vma_adjust_trans_huge() for file-backed VMA
thp: map file-backed huge pages on fault

arch/x86/kernel/sys_x86_64.c | 13 +-
fs/libfs.c | 50 ++++-
fs/ramfs/inode.c | 6 +-
include/linux/backing-dev.h | 10 +
include/linux/huge_mm.h | 36 +++-
include/linux/mm.h | 16 ++
include/linux/pagemap.h | 24 ++-
include/linux/radix-tree.h | 3 +
include/linux/vm_event_item.h | 2 +
lib/radix-tree.c | 32 ++-
mm/filemap.c | 283 +++++++++++++++++++++----
mm/huge_memory.c | 462 ++++++++++++++++++++++++++++++++++-------
mm/memory.c | 31 ++-
mm/swap.c | 3 +-
mm/truncate.c | 12 ++
mm/vmstat.c | 2 +
16 files changed, 842 insertions(+), 143 deletions(-)

--
1.7.10.4


2013-03-14 17:49:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 01/30] block: implement add_bdi_stat()

From: "Kirill A. Shutemov" <[email protected]>

It's required for batched stats update.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/backing-dev.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3504599..b05d961 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,6 +167,16 @@ static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
__add_bdi_stat(bdi, item, -1);
}

+static inline void add_bdi_stat(struct backing_dev_info *bdi,
+ enum bdi_stat_item item, s64 amount)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __add_bdi_stat(bdi, item, amount);
+ local_irq_restore(flags);
+}
+
static inline void dec_bdi_stat(struct backing_dev_info *bdi,
enum bdi_stat_item item)
{
--
1.7.10.4

2013-03-14 17:49:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 22/30] mm: add huge_fault() callback to vm_operations_struct

From: "Kirill A. Shutemov" <[email protected]>

huge_fault() should try to setup huge page for the pgoff, if possbile.
VM_FAULT_OOM return code means we need to fallback to small pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index df83ab9..5456294 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -195,6 +195,7 @@ struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+ int (*huge_fault)(struct vm_area_struct *vma, struct vm_fault *vmf);

/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
--
1.7.10.4

2013-03-14 17:49:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 27/30] thp: initial implementation of do_huge_linear_fault()

From: "Kirill A. Shutemov" <[email protected]>

The function tries to create a new page mapping using huge pages. It
only called for not yet mapped pages.

As usual in THP, we fallback to small pages if we fail to allocate huge
page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 3 +
mm/huge_memory.c | 196 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 199 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b53e295..aa52c48 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -5,6 +5,9 @@ extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags);
+extern int do_huge_linear_fault(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long address, pmd_t *pmd,
+ unsigned int flags);
extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0df1309..d1adaea 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -21,6 +21,7 @@
#include <linux/pagemap.h>
#include <linux/migrate.h>
#include <linux/hashtable.h>
+#include <linux/writeback.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -864,6 +865,201 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}

+int do_huge_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd, unsigned int flags)
+{
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct page *cow_page, *page, *dirty_page = NULL;
+ bool anon = false, fallback = false, page_mkwrite = false;
+ pgtable_t pgtable = NULL;
+ struct vm_fault vmf;
+ int ret;
+
+ /* Fallback if vm_pgoff and vm_start are not suitable */
+ if (((vma->vm_start >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+ (vma->vm_pgoff & HPAGE_CACHE_INDEX_MASK))
+ return do_fallback(mm, vma, address, pmd, flags);
+
+ if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+ return do_fallback(mm, vma, address, pmd, flags);
+
+ if (unlikely(khugepaged_enter(vma)))
+ return VM_FAULT_OOM;
+
+ /*
+ * If we do COW later, allocate page before taking lock_page()
+ * on the file cache page. This will reduce lock holding time.
+ */
+ if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+
+ cow_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+ vma, haddr, numa_node_id(), 0);
+ if (!cow_page) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ return do_fallback(mm, vma, address, pmd, flags);
+ }
+ count_vm_event(THP_FAULT_ALLOC);
+ if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
+ page_cache_release(cow_page);
+ return do_fallback(mm, vma, address, pmd, flags);
+ }
+ } else
+ cow_page = NULL;
+
+ pgtable = pte_alloc_one(mm, haddr);
+ if (unlikely(!pgtable)) {
+ ret = VM_FAULT_OOM;
+ goto uncharge_out;
+ }
+
+ vmf.virtual_address = (void __user *)haddr;
+ vmf.pgoff = ((haddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ vmf.flags = flags;
+ vmf.page = NULL;
+
+ ret = vma->vm_ops->huge_fault(vma, &vmf);
+ if (unlikely(ret & VM_FAULT_OOM))
+ goto uncharge_out_fallback;
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
+ goto uncharge_out;
+
+ if (unlikely(PageHWPoison(vmf.page))) {
+ if (ret & VM_FAULT_LOCKED)
+ unlock_page(vmf.page);
+ ret = VM_FAULT_HWPOISON;
+ goto uncharge_out;
+ }
+
+ /*
+ * For consistency in subsequent calls, make the faulted page always
+ * locked.
+ */
+ if (unlikely(!(ret & VM_FAULT_LOCKED)))
+ lock_page(vmf.page);
+ else
+ VM_BUG_ON(!PageLocked(vmf.page));
+
+ /*
+ * Should we do an early C-O-W break?
+ */
+ page = vmf.page;
+ if (flags & FAULT_FLAG_WRITE) {
+ if (!(vma->vm_flags & VM_SHARED)) {
+ page = cow_page;
+ anon = true;
+ copy_user_huge_page(page, vmf.page, haddr, vma,
+ HPAGE_PMD_NR);
+ __SetPageUptodate(page);
+ } else if (vma->vm_ops->page_mkwrite) {
+ int tmp;
+
+ unlock_page(page);
+ vmf.flags = FAULT_FLAG_WRITE | FAULT_FLAG_MKWRITE;
+ tmp = vma->vm_ops->page_mkwrite(vma, &vmf);
+ if (unlikely(tmp &
+ (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+ ret = tmp;
+ goto unwritable_page;
+ }
+ if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
+ lock_page(page);
+ if (!page->mapping) {
+ ret = 0; /* retry the fault */
+ unlock_page(page);
+ goto unwritable_page;
+ }
+ } else
+ VM_BUG_ON(!PageLocked(page));
+ page_mkwrite = true;
+ }
+ }
+
+ VM_BUG_ON(!PageCompound(page));
+
+ spin_lock(&mm->page_table_lock);
+ if (likely(pmd_none(*pmd))) {
+ pmd_t entry;
+
+ flush_icache_page(vma, page);
+ entry = mk_huge_pmd(page, vma->vm_page_prot);
+ if (flags & FAULT_FLAG_WRITE)
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ if (anon) {
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ page_add_new_anon_rmap(page, vma, haddr);
+ } else {
+ add_mm_counter(mm, MM_FILEPAGES, HPAGE_PMD_NR);
+ page_add_file_rmap(page);
+ if (flags & FAULT_FLAG_WRITE) {
+ dirty_page = page;
+ get_page(dirty_page);
+ }
+ }
+ set_pmd_at(mm, haddr, pmd, entry);
+ pgtable_trans_huge_deposit(mm, pgtable);
+ mm->nr_ptes++;
+
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache_pmd(vma, address, pmd);
+ } else {
+ if (cow_page)
+ mem_cgroup_uncharge_page(cow_page);
+ if (anon)
+ page_cache_release(page);
+ else
+ anon = true; /* no anon but release faulted_page */
+ }
+ spin_unlock(&mm->page_table_lock);
+
+ if (dirty_page) {
+ struct address_space *mapping = page->mapping;
+ bool dirtied = false;
+
+ if (set_page_dirty(dirty_page))
+ dirtied = true;
+ unlock_page(dirty_page);
+ put_page(dirty_page);
+ if ((dirtied || page_mkwrite) && mapping) {
+ /*
+ * Some device drivers do not set page.mapping but still
+ * dirty their pages
+ */
+ balance_dirty_pages_ratelimited(mapping);
+ }
+
+ /* file_update_time outside page_lock */
+ if (vma->vm_file && !page_mkwrite)
+ file_update_time(vma->vm_file);
+ } else {
+ unlock_page(vmf.page);
+ if (anon)
+ page_cache_release(vmf.page);
+ }
+
+ return ret;
+
+unwritable_page:
+ pte_free(mm, pgtable);
+ page_cache_release(page);
+ return ret;
+uncharge_out_fallback:
+ fallback = true;
+uncharge_out:
+ if (pgtable)
+ pte_free(mm, pgtable);
+ if (cow_page) {
+ mem_cgroup_uncharge_page(cow_page);
+ page_cache_release(cow_page);
+ }
+
+ if (fallback)
+ return do_fallback(mm, vma, address, pmd, flags);
+ else
+ return ret;
+}
+
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
--
1.7.10.4

2013-03-14 17:49:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 25/30] thp, mm: basic huge_fault implementation for generic_file_vm_ops

From: "Kirill A. Shutemov" <[email protected]>

It provide enough functionality for simple cases like ramfs. Need to be
extended later.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 75 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 57611be..032fec39 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1781,6 +1781,80 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static int filemap_huge_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_mapping;
+ struct inode *inode = mapping->host;
+ pgoff_t size, offset = vmf->pgoff;
+ unsigned long address = (unsigned long) vmf->virtual_address;
+ struct page *page;
+ int ret = 0;
+
+ BUG_ON(((address >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
+ (offset & HPAGE_CACHE_INDEX_MASK));
+
+retry:
+ page = find_get_page(mapping, offset);
+ if (!page) {
+ gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
+ page = alloc_pages(gfp_mask, HPAGE_PMD_ORDER);
+ if (!page) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ return VM_FAULT_OOM;
+ }
+ count_vm_event(THP_FAULT_ALLOC);
+ ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL);
+ if (ret == 0)
+ ret = mapping->a_ops->readpage(file, page);
+ else if (ret == -EEXIST)
+ ret = 0; /* losing race to add is OK */
+ page_cache_release(page);
+ if (!ret || ret == AOP_TRUNCATED_PAGE)
+ goto retry;
+ return ret;
+ }
+
+ if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+ page_cache_release(page);
+ return ret | VM_FAULT_RETRY;
+ }
+
+ /* Did it get truncated? */
+ if (unlikely(page->mapping != mapping)) {
+ unlock_page(page);
+ put_page(page);
+ goto retry;
+ }
+ VM_BUG_ON(page->index != offset);
+ VM_BUG_ON(!PageUptodate(page));
+
+ if (!PageTransHuge(page)) {
+ unlock_page(page);
+ put_page(page);
+ /* Ask fallback to small pages */
+ return VM_FAULT_OOM;
+ }
+
+ /*
+ * Found the page and have a reference on it.
+ * We must recheck i_size under page lock.
+ */
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+ if (unlikely(offset >= size)) {
+ unlock_page(page);
+ page_cache_release(page);
+ return VM_FAULT_SIGBUS;
+ }
+
+ vmf->page = page;
+ return ret | VM_FAULT_LOCKED;
+}
+#else
+#define filemap_huge_fault NULL
+#endif
+
int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct page *page = vmf->page;
@@ -1810,6 +1884,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);

const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
+ .huge_fault = filemap_huge_fault,
.page_mkwrite = filemap_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
--
1.7.10.4

2013-03-14 17:49:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 30/30] thp: map file-backed huge pages on fault

From: "Kirill A. Shutemov" <[email protected]>

Look like all pieces are in place, we can map file-backed huge-pages
now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 4 +++-
mm/memory.c | 1 +
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c6e3aef..c175c78 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -80,7 +80,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG) && \
((__vma)->vm_flags & VM_HUGEPAGE))) && \
!((__vma)->vm_flags & VM_NOHUGEPAGE) && \
- !is_vma_temporary_stack(__vma))
+ !is_vma_temporary_stack(__vma) && \
+ (!(__vma)->vm_ops || \
+ mapping_can_have_hugepages((__vma)->vm_file->f_mapping)))
#define transparent_hugepage_defrag(__vma) \
((transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)) || \
diff --git a/mm/memory.c b/mm/memory.c
index 52bd6cf..f3b5a11 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3738,6 +3738,7 @@ retry:
if (!vma->vm_ops)
return do_huge_pmd_anonymous_page(mm, vma, address,
pmd, flags);
+ return do_huge_linear_fault(mm, vma, address, pmd, flags);
} else {
pmd_t orig_pmd = *pmd;
int ret;
--
1.7.10.4

2013-03-14 17:49:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page

From: "Kirill A. Shutemov" <[email protected]>

We are not ready to mmap file-backed tranparent huge pages. Let's split
them on mmap() attempt.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 79ba9cd..57611be 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1696,6 +1696,8 @@ retry_find:
goto no_cached_page;
}

+ if (PageTransCompound(page))
+ split_huge_page(compound_trans_head(page));
if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
page_cache_release(page);
return ret | VM_FAULT_RETRY;
--
1.7.10.4

2013-03-14 17:50:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 28/30] thp: handle write-protect exception to file-backed huge pages

From: "Kirill A. Shutemov" <[email protected]>

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 67 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d1adaea..a416a77 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1339,7 +1339,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */

- VM_BUG_ON(!vma->anon_vma);
haddr = address & HPAGE_PMD_MASK;
if (is_huge_zero_pmd(orig_pmd))
goto alloc;
@@ -1349,7 +1348,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,

page = pmd_page(orig_pmd);
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
- if (page_mapcount(page) == 1) {
+ if (PageAnon(page) && page_mapcount(page) == 1) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1357,10 +1356,72 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
update_mmu_cache_pmd(vma, address, pmd);
ret |= VM_FAULT_WRITE;
goto out_unlock;
+ } else if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
+ (VM_WRITE|VM_SHARED)) {
+ struct vm_fault vmf;
+ pmd_t entry;
+ struct address_space *mapping;
+
+ /* not yet impemented */
+ VM_BUG_ON(!vma->vm_ops || !vma->vm_ops->page_mkwrite);
+
+ vmf.virtual_address = (void __user *)haddr;
+ vmf.pgoff = page->index;
+ vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
+ vmf.page = page;
+
+ page_cache_get(page);
+ spin_unlock(&mm->page_table_lock);
+
+ ret = vma->vm_ops->page_mkwrite(vma, &vmf);
+ if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
+ page_cache_release(page);
+ goto out;
+ }
+ if (unlikely(!(ret & VM_FAULT_LOCKED))) {
+ lock_page(page);
+ if (!page->mapping) {
+ ret = 0; /* retry */
+ unlock_page(page);
+ page_cache_release(page);
+ goto out;
+ }
+ } else
+ VM_BUG_ON(!PageLocked(page));
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, orig_pmd))) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto out_unlock;
+ }
+
+ flush_cache_page(vma, address, pmd_pfn(orig_pmd));
+ entry = pmd_mkyoung(orig_pmd);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
+ update_mmu_cache_pmd(vma, address, pmd);
+ ret = VM_FAULT_WRITE;
+ spin_unlock(&mm->page_table_lock);
+
+ mapping = page->mapping;
+ set_page_dirty(page);
+ unlock_page(page);
+ page_cache_release(page);
+ if (mapping) {
+ /*
+ * Some device drivers do not set page.mapping
+ * but still dirty their pages
+ */
+ balance_dirty_pages_ratelimited(mapping);
+ }
+ return ret;
}
get_page(page);
spin_unlock(&mm->page_table_lock);
alloc:
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -1424,6 +1485,10 @@ alloc:
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
put_huge_zero_page();
} else {
+ if (!PageAnon(page)) {
+ add_mm_counter(mm, MM_FILEPAGES, -HPAGE_PMD_NR);
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ }
VM_BUG_ON(!PageHead(page));
page_remove_rmap(page);
put_page(page);
--
1.7.10.4

2013-03-14 17:50:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 21/30] x86-64, mm: proper alignment mappings with hugepages

From: "Kirill A. Shutemov" <[email protected]>

Make arch_get_unmapped_area() return unmapped area aligned to HPAGE_MASK
if the file mapping can have huge pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/sys_x86_64.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index dbded5a..4e20cd5 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -15,6 +15,7 @@
#include <linux/random.h>
#include <linux/uaccess.h>
#include <linux/elf.h>
+#include <linux/pagemap.h>

#include <asm/ia32.h>
#include <asm/syscalls.h>
@@ -135,7 +136,11 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
info.length = len;
info.low_limit = begin;
info.high_limit = end;
- info.align_mask = filp ? get_align_mask() : 0;
+ if (filp)
+ info.align_mask = mapping_can_have_hugepages(filp->f_mapping) ?
+ PAGE_MASK & ~HPAGE_MASK : get_align_mask();
+ else
+ info.align_mask = 0;
info.align_offset = pgoff << PAGE_SHIFT;
return vm_unmapped_area(&info);
}
@@ -174,7 +179,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
info.length = len;
info.low_limit = PAGE_SIZE;
info.high_limit = mm->mmap_base;
- info.align_mask = filp ? get_align_mask() : 0;
+ if (filp)
+ info.align_mask = mapping_can_have_hugepages(filp->f_mapping) ?
+ PAGE_MASK & ~HPAGE_MASK : get_align_mask();
+ else
+ info.align_mask = 0;
info.align_offset = pgoff << PAGE_SHIFT;
addr = vm_unmapped_area(&info);
if (!(addr & ~PAGE_MASK))
--
1.7.10.4

2013-03-14 17:50:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 24/30] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()

From: "Kirill A. Shutemov" <[email protected]>

It's confusing that mk_huge_pmd() has sematics different from mk_pte()
or mk_pmd().

Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
prototype to match mk_pte().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 34e0385..be7b7e1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -691,11 +691,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
return pmd;
}

-static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
+static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
{
pmd_t entry;
- entry = mk_pmd(page, vma->vm_page_prot);
- entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ entry = mk_pmd(page, prot);
entry = pmd_mkhuge(entry);
return entry;
}
@@ -723,7 +722,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
pte_free(mm, pgtable);
} else {
pmd_t entry;
- entry = mk_huge_pmd(page, vma);
+ entry = mk_huge_pmd(page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
/*
* The spinlocking to take the lru_lock inside
* page_add_new_anon_rmap() acts as a full memory
@@ -1212,7 +1212,8 @@ alloc:
goto out_mn;
} else {
pmd_t entry;
- entry = mk_huge_pmd(new_page, vma);
+ entry = mk_huge_pmd(new_page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
pmdp_clear_flush(vma, haddr, pmd);
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
@@ -2382,7 +2383,8 @@ static void collapse_huge_page(struct mm_struct *mm,
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);

- _pmd = mk_huge_pmd(new_page, vma);
+ _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
+ _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);

/*
* spin_lock() below is not the equivalent of smp_wmb(), so
--
1.7.10.4

2013-03-14 17:51:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 23/30] thp: prepare zap_huge_pmd() to uncharge file pages

From: "Kirill A. Shutemov" <[email protected]>

Uncharge pages from correct counter.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a23da8b..34e0385 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1368,10 +1368,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
spin_unlock(&tlb->mm->page_table_lock);
put_huge_zero_page();
} else {
+ int counter;
page = pmd_page(orig_pmd);
page_remove_rmap(page);
VM_BUG_ON(page_mapcount(page) < 0);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ counter = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
+ add_mm_counter(tlb->mm, counter, -HPAGE_PMD_NR);
VM_BUG_ON(!PageHead(page));
tlb->mm->nr_ptes--;
spin_unlock(&tlb->mm->page_table_lock);
--
1.7.10.4

2013-03-14 17:52:08

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file

From: "Kirill A. Shutemov" <[email protected]>

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/vm_event_item.h | 2 ++
mm/vmstat.c | 2 ++
2 files changed, 4 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index d4b7a18..04587c4 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_FAULT_FALLBACK,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
+ THP_WRITE_ALLOC,
+ THP_WRITE_FAILED,
THP_SPLIT,
THP_ZERO_PAGE_ALLOC,
THP_ZERO_PAGE_ALLOC_FAILED,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 292b1cf..bd59782 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -818,6 +818,8 @@ const char * const vmstat_text[] = {
"thp_fault_fallback",
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
+ "thp_write_alloc",
+ "thp_write_failed",
"thp_split",
"thp_zero_page_alloc",
"thp_zero_page_alloc_failed",
--
1.7.10.4

2013-03-14 17:52:06

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 17/30] thp: wait_split_huge_page(): serialize over i_mmap_mutex too

From: "Kirill A. Shutemov" <[email protected]>

Since we're going to have huge pages backed by files,
wait_split_huge_page() has to serialize not only over anon_vma_lock,
but over i_mmap_mutex too.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 15 ++++++++++++---
mm/huge_memory.c | 4 ++--
mm/memory.c | 4 ++--
3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a54939c..b53e295 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -113,11 +113,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
__split_huge_page_pmd(__vma, __address, \
____pmd); \
} while (0)
-#define wait_split_huge_page(__anon_vma, __pmd) \
+#define wait_split_huge_page(__vma, __pmd) \
do { \
pmd_t *____pmd = (__pmd); \
- anon_vma_lock_write(__anon_vma); \
- anon_vma_unlock_write(__anon_vma); \
+ struct address_space *__mapping = \
+ vma->vm_file->f_mapping; \
+ struct anon_vma *__anon_vma = (__vma)->anon_vma; \
+ if (__mapping) \
+ mutex_lock(&__mapping->i_mmap_mutex); \
+ if (__anon_vma) { \
+ anon_vma_lock_write(__anon_vma); \
+ anon_vma_unlock_write(__anon_vma); \
+ } \
+ if (__mapping) \
+ mutex_unlock(&__mapping->i_mmap_mutex); \
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
} while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb777d3..a23da8b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -907,7 +907,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
spin_unlock(&dst_mm->page_table_lock);
pte_free(dst_mm, pgtable);

- wait_split_huge_page(vma->anon_vma, src_pmd); /* src_vma */
+ wait_split_huge_page(vma, src_pmd); /* src_vma */
goto out;
}
src_page = pmd_page(pmd);
@@ -1480,7 +1480,7 @@ int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(&vma->vm_mm->page_table_lock);
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
return -1;
} else {
/* Thp mapped by 'pmd' is stable, so we can
diff --git a/mm/memory.c b/mm/memory.c
index 98c25dd..52bd6cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -619,7 +619,7 @@ int __pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
if (new)
pte_free(mm, new);
if (wait_split_huge_page)
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
return 0;
}

@@ -1529,7 +1529,7 @@ struct page *follow_page_mask(struct vm_area_struct *vma,
if (likely(pmd_trans_huge(*pmd))) {
if (unlikely(pmd_trans_splitting(*pmd))) {
spin_unlock(&mm->page_table_lock);
- wait_split_huge_page(vma->anon_vma, pmd);
+ wait_split_huge_page(vma, pmd);
} else {
page = follow_trans_huge_pmd(vma, address,
pmd, flags);
--
1.7.10.4

2013-03-14 17:52:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

ramfs is the most simple fs from page cache point of view. Let's start
transparent huge page cache enabling here.

For now we allocate only non-movable huge page. It's not yet clear if
movable page is safe here and what need to be done to make it safe.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/ramfs/inode.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index c24f1e1..da30b4f 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
inode_init_owner(inode, dir, mode);
inode->i_mapping->a_ops = &ramfs_aops;
inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
- mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+ /*
+ * TODO: what should be done to make movable safe?
+ */
+ mapping_set_gfp_mask(inode->i_mapping,
+ GFP_TRANSHUGE & ~__GFP_MOVABLE);
mapping_set_unevictable(inode->i_mapping);
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
switch (mode & S_IFMT) {
--
1.7.10.4

2013-03-14 17:53:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists

From: "Kirill A. Shutemov" <[email protected]>

active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
that have been placed on the LRU lists when first allocated), but these
pages must not have PageUnevictable set - otherwise shrink_active_list
goes crazy:

kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122!
invalid opcode: 0000 [#1] SMP
CPU 0
Pid: 293, comm: kswapd0 Not tainted 3.8.0-rc6-next-20130202+ #531
RIP: 0010:[<ffffffff81110478>] [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
RSP: 0000:ffff8800796d9b28 EFLAGS: 00010082
RAX: 00000000ffffffea RBX: 0000000000000012 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffea0001de8040
RBP: ffff8800796d9b88 R08: ffff8800796d9df0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012
R13: ffffea0001de8060 R14: ffffffff818818e8 R15: ffff8800796d9bf8
FS: 0000000000000000(0000) GS:ffff88007a200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1bfc108000 CR3: 000000000180b000 CR4: 00000000000406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kswapd0 (pid: 293, threadinfo ffff8800796d8000, task ffff880079e0a6e0)
Stack:
ffff8800796d9b48 ffffffff81881880 ffff8800796d9df0 ffff8800796d9be0
0000000000000002 000000000000001f ffff8800796d9b88 ffffffff818818c8
ffffffff81881480 ffff8800796d9dc0 0000000000000002 000000000000001f
Call Trace:
[<ffffffff81111e98>] shrink_inactive_list+0x108/0x4a0
[<ffffffff8109ce3d>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8107b8bf>] ? local_clock+0x4f/0x60
[<ffffffff8110ff5d>] ? shrink_slab+0x1fd/0x4c0
[<ffffffff811125a1>] shrink_zone+0x371/0x610
[<ffffffff8110ff75>] ? shrink_slab+0x215/0x4c0
[<ffffffff81112dfc>] kswapd+0x5bc/0xb60
[<ffffffff81112840>] ? shrink_zone+0x610/0x610
[<ffffffff81066676>] kthread+0xd6/0xe0
[<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
[<ffffffff814fed6c>] ret_from_fork+0x7c/0xb0
[<ffffffff810665a0>] ? __kthread_bind+0x40/0x40
Code: 1f 40 00 49 8b 45 08 49 8b 75 00 48 89 46 08 48 89 30 49 8b 06 4c 89 68 08 49 89 45 00 4d 89 75 08 4d 89 2e eb 9c 0f 1f 44 00 00 <0f> 0b 66 0f 1f 44 00 00 31 db 45 31 e4 eb 9b 0f 0b 0f 0b 65 48
RIP [<ffffffff81110478>] isolate_lru_pages.isra.61+0x138/0x260
RSP <ffff8800796d9b28>

For lru_add_page_tail(), it means we should not set PageUnevictable()
for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
it's marked PageUnevictable() too.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/swap.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 92a9be5..31584d0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
lru = LRU_INACTIVE_ANON;
}
} else {
- SetPageUnevictable(page_tail);
+ if (!PageLRU(page) || PageUnevictable(page))
+ SetPageUnevictable(page_tail);
lru = LRU_UNEVICTABLE;
}

--
1.7.10.4

2013-03-14 17:53:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 10/30] thp, mm: locking tail page is a bug

From: "Kirill A. Shutemov" <[email protected]>

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 0ff3403..38fdc92 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -669,6 +669,7 @@ void __lock_page(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

+ VM_BUG_ON(PageTail(page));
__wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
TASK_UNINTERRUPTIBLE);
}
@@ -678,6 +679,7 @@ int __lock_page_killable(struct page *page)
{
DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);

+ VM_BUG_ON(PageTail(page));
return __wait_on_bit_lock(page_waitqueue(page), &wait,
sleep_on_page_killable, TASK_KILLABLE);
}
--
1.7.10.4

2013-03-14 17:49:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 18/30] thp, mm: truncate support for transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

If we starting position of truncation is in tail page we have to spilit
the huge page page first.

We also have to split if end is within the huge page. Otherwise we can
truncate whole huge page at once.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/truncate.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..87c247d 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -231,6 +231,17 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index > end)
break;

+ /* split page if we start from tail page */
+ if (PageTransTail(page))
+ split_huge_page(compound_trans_head(page));
+ if (PageTransHuge(page)) {
+ /* split if end is within huge page */
+ if (index == (end & ~HPAGE_CACHE_INDEX_MASK))
+ split_huge_page(page);
+ else
+ /* skip tail pages */
+ i += HPAGE_CACHE_NR - 1;
+ }
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
@@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index > end)
break;

+ VM_BUG_ON(PageTransHuge(page));
lock_page(page);
WARN_ON(page->index != index);
wait_on_page_writeback(page);
--
1.7.10.4

2013-03-14 17:53:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 06/30] thp, mm: basic defines for transparent huge page cache

From: "Kirill A. Shutemov" <[email protected]>

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ee1c244..a54939c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -64,6 +64,10 @@ extern pmd_t *page_check_address_pmd(struct page *page,
#define HPAGE_PMD_MASK HPAGE_MASK
#define HPAGE_PMD_SIZE HPAGE_SIZE

+#define HPAGE_CACHE_ORDER (HPAGE_SHIFT - PAGE_CACHE_SHIFT)
+#define HPAGE_CACHE_NR (1L << HPAGE_CACHE_ORDER)
+#define HPAGE_CACHE_INDEX_MASK (HPAGE_CACHE_NR - 1)
+
extern bool is_vma_temporary_stack(struct vm_area_struct *vma);

#define transparent_hugepage_enabled(__vma) \
@@ -181,6 +185,10 @@ extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vm
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })

+#define HPAGE_CACHE_ORDER ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_NR ({ BUILD_BUG(); 0; })
+#define HPAGE_CACHE_INDEX_MASK ({ BUILD_BUG(); 0; })
+
#define hpage_nr_pages(x) 1

#define transparent_hugepage_enabled(__vma) 0
--
1.7.10.4

2013-03-14 17:49:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

From: "Kirill A. Shutemov" <[email protected]>

For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
page for the specified index and HPAGE_CACHE_NR-1 tail pages for
following indexes.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 76 ++++++++++++++++++++++++++++++++++++++++------------------
1 file changed, 53 insertions(+), 23 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2d99191..6bac9e2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -447,6 +447,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
pgoff_t offset, gfp_t gfp_mask)
{
int error;
+ int nr = 1;

VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageSwapBacked(page));
@@ -454,32 +455,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
error = mem_cgroup_cache_charge(page, current->mm,
gfp_mask & GFP_RECLAIM_MASK);
if (error)
- goto out;
+ return error;

- error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
- if (error == 0) {
- page_cache_get(page);
- page->mapping = mapping;
- page->index = offset;
+ if (PageTransHuge(page)) {
+ BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
+ nr = HPAGE_CACHE_NR;
+ }
+ error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
+ if (error) {
+ mem_cgroup_uncharge_cache_page(page);
+ return error;
+ }

- spin_lock_irq(&mapping->tree_lock);
- error = radix_tree_insert(&mapping->page_tree, offset, page);
- if (likely(!error)) {
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- spin_unlock_irq(&mapping->tree_lock);
- trace_mm_filemap_add_to_page_cache(page);
- } else {
- page->mapping = NULL;
- /* Leave page->index set: truncation relies upon it */
- spin_unlock_irq(&mapping->tree_lock);
- mem_cgroup_uncharge_cache_page(page);
- page_cache_release(page);
+ page_cache_get(page);
+ spin_lock_irq(&mapping->tree_lock);
+ page->mapping = mapping;
+ page->index = offset;
+ error = radix_tree_insert(&mapping->page_tree, offset, page);
+ if (unlikely(error))
+ goto err;
+ if (PageTransHuge(page)) {
+ int i;
+ for (i = 1; i < HPAGE_CACHE_NR; i++) {
+ page_cache_get(page + i);
+ page[i].index = offset + i;
+ error = radix_tree_insert(&mapping->page_tree,
+ offset + i, page + i);
+ if (error) {
+ page_cache_release(page + i);
+ break;
+ }
}
- radix_tree_preload_end();
- } else
- mem_cgroup_uncharge_cache_page(page);
-out:
+ if (error) {
+ error = ENOSPC; /* no space for a huge page */
+ for (i--; i > 0; i--) {
+ radix_tree_delete(&mapping->page_tree,
+ offset + i);
+ page_cache_release(page + i);
+ }
+ radix_tree_delete(&mapping->page_tree, offset);
+ goto err;
+ }
+ }
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
+ mapping->nrpages += nr;
+ spin_unlock_irq(&mapping->tree_lock);
+ trace_mm_filemap_add_to_page_cache(page);
+ radix_tree_preload_end();
+ return 0;
+err:
+ page->mapping = NULL;
+ /* Leave page->index set: truncation relies upon it */
+ spin_unlock_irq(&mapping->tree_lock);
+ radix_tree_preload_end();
+ mem_cgroup_uncharge_cache_page(page);
+ page_cache_release(page);
return error;
}
EXPORT_SYMBOL(add_to_page_cache_locked);
--
1.7.10.4

2013-03-14 17:54:44

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()

From: "Kirill A. Shutemov" <[email protected]>

There's only one caller of do_generic_file_read() and the only actor is
file_read_actor(). No reason to have a callback parameter.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4ebaf95..2d99191 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1075,7 +1075,6 @@ static void shrink_readahead_size_eio(struct file *filp,
* @filp: the file to read
* @ppos: current file position
* @desc: read_descriptor
- * @actor: read method
*
* This is a generic file read routine, and uses the
* mapping->a_ops->readpage() function for the actual low-level stuff.
@@ -1084,7 +1083,7 @@ static void shrink_readahead_size_eio(struct file *filp,
* of the logic when it comes to error handling etc.
*/
static void do_generic_file_read(struct file *filp, loff_t *ppos,
- read_descriptor_t *desc, read_actor_t actor)
+ read_descriptor_t *desc)
{
struct address_space *mapping = filp->f_mapping;
struct inode *inode = mapping->host;
@@ -1185,13 +1184,14 @@ page_ok:
* Ok, we have the page, and it's up-to-date, so
* now we can copy it to user space...
*
- * The actor routine returns how many bytes were actually used..
+ * The file_read_actor routine returns how many bytes were
+ * actually used..
* NOTE! This may not be the same as how much of a user buffer
* we filled up (we may be padding etc), so we can only update
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = actor(desc, page, offset, nr);
+ ret = file_read_actor(desc, page, offset, nr);
offset += ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
@@ -1464,7 +1464,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
if (desc.count == 0)
continue;
desc.error = 0;
- do_generic_file_read(filp, ppos, &desc, file_read_actor);
+ do_generic_file_read(filp, ppos, &desc);
retval += desc.written;
if (desc.error) {
retval = retval ?: desc.error;
--
1.7.10.4

2013-03-14 17:54:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 29/30] thp: call __vma_adjust_trans_huge() for file-backed VMA

From: "Kirill A. Shutemov" <[email protected]>

Since we're going to have huge pages in page cache, we need to call
__vma_adjust_trans_huge() for file-backed VMA, which potentially can
contain huge pages.

For now we call it for all VMAs with vm_ops->huge_fault defined.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/huge_mm.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index aa52c48..c6e3aef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -161,9 +161,9 @@ static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long end,
long adjust_next)
{
- if (!vma->anon_vma || vma->vm_ops)
- return;
- __vma_adjust_trans_huge(vma, start, end, adjust_next);
+ if ((vma->anon_vma && !vma->vm_ops) ||
+ (vma->vm_ops && vma->vm_ops->huge_fault))
+ __vma_adjust_trans_huge(vma, start, end, adjust_next);
}
static inline int hpage_nr_pages(struct page *page)
{
--
1.7.10.4

2013-03-14 17:55:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 26/30] thp: extract fallback path from do_huge_pmd_anonymous_page() to a function

From: "Kirill A. Shutemov" <[email protected]>

The same fallback path will be reused by non-anonymous pages, so lets'
extract it in separate function.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 112 ++++++++++++++++++++++++++++--------------------------
1 file changed, 59 insertions(+), 53 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index be7b7e1..0df1309 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -779,64 +779,12 @@ static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
return true;
}

-int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int do_fallback(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags)
{
- struct page *page;
- unsigned long haddr = address & HPAGE_PMD_MASK;
pte_t *pte;

- if (haddr >= vma->vm_start && haddr + HPAGE_PMD_SIZE <= vma->vm_end) {
- if (unlikely(anon_vma_prepare(vma)))
- return VM_FAULT_OOM;
- if (unlikely(khugepaged_enter(vma)))
- return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE) &&
- transparent_hugepage_use_zero_page()) {
- pgtable_t pgtable;
- unsigned long zero_pfn;
- bool set;
- pgtable = pte_alloc_one(mm, haddr);
- if (unlikely(!pgtable))
- return VM_FAULT_OOM;
- zero_pfn = get_huge_zero_page();
- if (unlikely(!zero_pfn)) {
- pte_free(mm, pgtable);
- count_vm_event(THP_FAULT_FALLBACK);
- goto out;
- }
- spin_lock(&mm->page_table_lock);
- set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
- zero_pfn);
- spin_unlock(&mm->page_table_lock);
- if (!set) {
- pte_free(mm, pgtable);
- put_huge_zero_page();
- }
- return 0;
- }
- page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id(), 0);
- if (unlikely(!page)) {
- count_vm_event(THP_FAULT_FALLBACK);
- goto out;
- }
- count_vm_event(THP_FAULT_ALLOC);
- if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
- put_page(page);
- goto out;
- }
- if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
- page))) {
- mem_cgroup_uncharge_page(page);
- put_page(page);
- goto out;
- }
-
- return 0;
- }
-out:
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
@@ -858,6 +806,64 @@ out:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags)
+{
+ struct page *page;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+
+ if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end)
+ return do_fallback(mm, vma, address, pmd, flags);
+ if (unlikely(anon_vma_prepare(vma)))
+ return VM_FAULT_OOM;
+ if (unlikely(khugepaged_enter(vma)))
+ return VM_FAULT_OOM;
+ if (!(flags & FAULT_FLAG_WRITE) &&
+ transparent_hugepage_use_zero_page()) {
+ pgtable_t pgtable;
+ unsigned long zero_pfn;
+ bool set;
+ pgtable = pte_alloc_one(mm, haddr);
+ if (unlikely(!pgtable))
+ return VM_FAULT_OOM;
+ zero_pfn = get_huge_zero_page();
+ if (unlikely(!zero_pfn)) {
+ pte_free(mm, pgtable);
+ count_vm_event(THP_FAULT_FALLBACK);
+ return do_fallback(mm, vma, address, pmd, flags);
+ }
+ spin_lock(&mm->page_table_lock);
+ set = set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+ zero_pfn);
+ spin_unlock(&mm->page_table_lock);
+ if (!set) {
+ pte_free(mm, pgtable);
+ put_huge_zero_page();
+ }
+ return 0;
+ }
+ page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
+ vma, haddr, numa_node_id(), 0);
+ if (unlikely(!page)) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ return do_fallback(mm, vma, address, pmd, flags);
+ }
+ count_vm_event(THP_FAULT_ALLOC);
+ if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
+ put_page(page);
+ return do_fallback(mm, vma, address, pmd, flags);
+ }
+ if (unlikely(__do_huge_pmd_anonymous_page(mm, vma, haddr, pmd,
+ page))) {
+ mem_cgroup_uncharge_page(page);
+ put_page(page);
+ return do_fallback(mm, vma, address, pmd, flags);
+ }
+
+ return 0;
+}
+
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
--
1.7.10.4

2013-03-14 17:55:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()

From: "Kirill A. Shutemov" <[email protected]>

The base scheme is the same as for anonymous pages, but we walk by
mapping->i_mmap rather then anon_vma->rb_root.

__split_huge_page_refcount() has been tunned a bit: we need to transfer
PG_swapbacked to tail pages.

Splitting mapped pages haven't tested at all, since we cannot mmap()
file-backed huge pages yet.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 54 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..eb777d3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1613,7 +1613,8 @@ static void __split_huge_page_refcount(struct page *page)
((1L << PG_referenced) |
(1L << PG_swapbacked) |
(1L << PG_mlocked) |
- (1L << PG_uptodate)));
+ (1L << PG_uptodate) |
+ (1L << PG_swapbacked)));
page_tail->flags |= (1L << PG_dirty);

/* clear PageTail before overwriting first_page */
@@ -1641,10 +1642,8 @@ static void __split_huge_page_refcount(struct page *page)
page_tail->index = page->index + i;
page_nid_xchg_last(page_tail, page_nid_last(page));

- BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
BUG_ON(!PageDirty(page_tail));
- BUG_ON(!PageSwapBacked(page_tail));

lru_add_page_tail(page, page_tail, lruvec);
}
@@ -1752,7 +1751,7 @@ static int __split_huge_page_map(struct page *page,
}

/* must be called with anon_vma->root->rwsem held */
-static void __split_huge_page(struct page *page,
+static void __split_anon_huge_page(struct page *page,
struct anon_vma *anon_vma)
{
int mapcount, mapcount2;
@@ -1799,14 +1798,11 @@ static void __split_huge_page(struct page *page,
BUG_ON(mapcount != mapcount2);
}

-int split_huge_page(struct page *page)
+static int split_anon_huge_page(struct page *page)
{
struct anon_vma *anon_vma;
int ret = 1;

- BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
- BUG_ON(!PageAnon(page));
-
/*
* The caller does not necessarily hold an mmap_sem that would prevent
* the anon_vma disappearing so we first we take a reference to it
@@ -1824,7 +1820,7 @@ int split_huge_page(struct page *page)
goto out_unlock;

BUG_ON(!PageSwapBacked(page));
- __split_huge_page(page, anon_vma);
+ __split_anon_huge_page(page, anon_vma);
count_vm_event(THP_SPLIT);

BUG_ON(PageCompound(page));
@@ -1835,6 +1831,55 @@ out:
return ret;
}

+static int split_file_huge_page(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ int mapcount, mapcount2;
+
+ BUG_ON(!PageHead(page));
+ BUG_ON(PageTail(page));
+
+ mutex_lock(&mapping->i_mmap_mutex);
+ mapcount = 0;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long addr = vma_address(page, vma);
+ mapcount += __split_huge_page_splitting(page, vma, addr);
+ }
+
+ if (mapcount != page_mapcount(page))
+ printk(KERN_ERR "mapcount %d page_mapcount %d\n",
+ mapcount, page_mapcount(page));
+ BUG_ON(mapcount != page_mapcount(page));
+
+ __split_huge_page_refcount(page);
+
+ mapcount2 = 0;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ unsigned long addr = vma_address(page, vma);
+ mapcount2 += __split_huge_page_map(page, vma, addr);
+ }
+
+ if (mapcount != mapcount2)
+ printk(KERN_ERR "mapcount %d mapcount2 %d page_mapcount %d\n",
+ mapcount, mapcount2, page_mapcount(page));
+ BUG_ON(mapcount != mapcount2);
+ count_vm_event(THP_SPLIT);
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return 0;
+}
+
+int split_huge_page(struct page *page)
+{
+ BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
+
+ if (PageAnon(page))
+ return split_anon_huge_page(page);
+ else
+ return split_file_huge_page(page);
+}
+
#define VM_NO_THP (VM_SPECIAL|VM_MIXEDMAP|VM_HUGETLB|VM_SHARED|VM_MAYSHARE)

int hugepage_madvise(struct vm_area_struct *vma,
--
1.7.10.4

2013-03-14 17:49:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 09/30] thp, mm: rewrite delete_from_page_cache() to support huge pages

From: "Kirill A. Shutemov" <[email protected]>

As with add_to_page_cache_locked() we handle HPAGE_CACHE_NR pages a
time.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6bac9e2..0ff3403 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,6 +115,7 @@
void __delete_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
+ int nr = 1;

trace_mm_filemap_delete_from_page_cache(page);
/*
@@ -127,13 +128,23 @@ void __delete_from_page_cache(struct page *page)
else
cleancache_invalidate_page(mapping, page);

- radix_tree_delete(&mapping->page_tree, page->index);
+ if (PageTransHuge(page)) {
+ int i;
+
+ for (i = 0; i < HPAGE_CACHE_NR; i++)
+ radix_tree_delete(&mapping->page_tree, page->index + i);
+ nr = HPAGE_CACHE_NR;
+ } else {
+ radix_tree_delete(&mapping->page_tree, page->index);
+ }
+
page->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */
- mapping->nrpages--;
- __dec_zone_page_state(page, NR_FILE_PAGES);
+
+ mapping->nrpages -= nr;
+ __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, -nr);
if (PageSwapBacked(page))
- __dec_zone_page_state(page, NR_SHMEM);
+ __mod_zone_page_state(page_zone(page), NR_SHMEM, -nr);
BUG_ON(page_mapped(page));

/*
@@ -144,8 +155,8 @@ void __delete_from_page_cache(struct page *page)
* having removed the page entirely.
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
- dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+ mod_zone_page_state(page_zone(page), NR_FILE_DIRTY, -nr);
+ add_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE, -nr);
}
}

@@ -161,6 +172,7 @@ void delete_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
void (*freepage)(struct page *);
+ int i;

BUG_ON(!PageLocked(page));

@@ -172,6 +184,9 @@ void delete_from_page_cache(struct page *page)

if (freepage)
freepage(page);
+ if (PageTransHuge(page))
+ for (i = 1; i < HPAGE_CACHE_NR; i++)
+ page_cache_release(page + i);
page_cache_release(page);
}
EXPORT_SYMBOL(delete_from_page_cache);
--
1.7.10.4

2013-03-14 17:56:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 15/30] thp, libfs: initial support of thp in simple_read/write_begin/write_end

From: "Kirill A. Shutemov" <[email protected]>

For now we try to grab a huge cache page if gfp_mask has __GFP_COMP.
It's probably to weak condition and need to be reworked later.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/libfs.c | 50 +++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 39 insertions(+), 11 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 916da8c..6edfa9d 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr);

int simple_readpage(struct file *file, struct page *page)
{
- clear_highpage(page);
+ if (PageTransHuge(page))
+ zero_huge_user(page, 0, HPAGE_PMD_SIZE);
+ else
+ clear_highpage(page);
flush_dcache_page(page);
SetPageUptodate(page);
unlock_page(page);
@@ -394,21 +397,41 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
{
- struct page *page;
+ struct page *page = NULL;
pgoff_t index;

index = pos >> PAGE_CACHE_SHIFT;

- page = grab_cache_page_write_begin(mapping, index, flags);
+ /* XXX: too weak condition. Good enough for initial testing */
+ if (mapping_can_have_hugepages(mapping)) {
+ page = grab_cache_huge_page_write_begin(mapping,
+ index & ~HPAGE_CACHE_INDEX_MASK, flags);
+ /* fallback to small page */
+ if (!page || !PageTransHuge(page)) {
+ unsigned long offset;
+ offset = pos & ~PAGE_CACHE_MASK;
+ len = min_t(unsigned long,
+ len, PAGE_CACHE_SIZE - offset);
+ }
+ }
+ if (!page)
+ page = grab_cache_page_write_begin(mapping, index, flags);
if (!page)
return -ENOMEM;
-
*pagep = page;

- if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
- unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
- zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
+ if (!PageUptodate(page)) {
+ unsigned from;
+
+ if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
+ from = pos & ~HPAGE_PMD_MASK;
+ zero_huge_user_segments(page, 0, from,
+ from + len, HPAGE_PMD_SIZE);
+ } else if (len != PAGE_CACHE_SIZE) {
+ from = pos & ~PAGE_CACHE_MASK;
+ zero_user_segments(page, 0, from,
+ from + len, PAGE_CACHE_SIZE);
+ }
}
return 0;
}
@@ -443,9 +466,14 @@ int simple_write_end(struct file *file, struct address_space *mapping,

/* zero the stale part of the page if we did a short copy */
if (copied < len) {
- unsigned from = pos & (PAGE_CACHE_SIZE - 1);
-
- zero_user(page, from + copied, len - copied);
+ unsigned from;
+ if (PageTransHuge(page)) {
+ from = pos & ~HPAGE_PMD_MASK;
+ zero_huge_user(page, from + copied, len - copied);
+ } else {
+ from = pos & ~PAGE_CACHE_MASK;
+ zero_user(page, from + copied, len - copied);
+ }
}

if (!PageUptodate(page))
--
1.7.10.4

2013-03-14 17:56:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines

From: "Kirill A. Shutemov" <[email protected]>

For now we still write/read at most PAGE_CACHE_SIZE bytes a time.

This implementation doesn't cover address spaces with backing store.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/filemap.c | 35 ++++++++++++++++++++++++++++++-----
1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index bdedb1b..79ba9cd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1165,12 +1165,23 @@ find_page:
if (unlikely(page == NULL))
goto no_cached_page;
}
+ if (PageTransTail(page)) {
+ page_cache_release(page);
+ page = find_get_page(mapping,
+ index & ~HPAGE_CACHE_INDEX_MASK);
+ if (!PageTransHuge(page)) {
+ page_cache_release(page);
+ goto find_page;
+ }
+ }
if (PageReadahead(page)) {
+ BUG_ON(PageTransHuge(page));
page_cache_async_readahead(mapping,
ra, filp, page,
index, last_index - index);
}
if (!PageUptodate(page)) {
+ BUG_ON(PageTransHuge(page));
if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
!mapping->a_ops->is_partially_uptodate)
goto page_not_up_to_date;
@@ -1212,18 +1223,25 @@ page_ok:
}
nr = nr - offset;

+ /* Recalculate offset in page if we've got a huge page */
+ if (PageTransHuge(page)) {
+ offset = (((loff_t)index << PAGE_CACHE_SHIFT) + offset);
+ offset &= ~HPAGE_PMD_MASK;
+ }
+
/* If users can be writing to this page using arbitrary
* virtual addresses, take care about potential aliasing
* before reading the page on the kernel side.
*/
if (mapping_writably_mapped(mapping))
- flush_dcache_page(page);
+ flush_dcache_page(page + (offset >> PAGE_CACHE_SHIFT));

/*
* When a sequential read accesses a page several times,
* only mark it as accessed the first time.
*/
- if (prev_index != index || offset != prev_offset)
+ if (prev_index != index ||
+ (offset & ~PAGE_CACHE_MASK) != prev_offset)
mark_page_accessed(page);
prev_index = index;

@@ -1238,8 +1256,9 @@ page_ok:
* "pos" here (the actor routine has to update the user buffer
* pointers and the remaining count).
*/
- ret = file_read_actor(desc, page, offset, nr);
- offset += ret;
+ ret = file_read_actor(desc, page + (offset >> PAGE_CACHE_SHIFT),
+ offset & ~PAGE_CACHE_MASK, nr);
+ offset = (offset & ~PAGE_CACHE_MASK) + ret;
index += offset >> PAGE_CACHE_SHIFT;
offset &= ~PAGE_CACHE_MASK;
prev_offset = offset;
@@ -2440,8 +2459,13 @@ again:
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);

+ if (PageTransHuge(page))
+ offset = pos & ~HPAGE_PMD_MASK;
+
pagefault_disable();
- copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
+ copied = iov_iter_copy_from_user_atomic(
+ page + (offset >> PAGE_CACHE_SHIFT),
+ i, offset & ~PAGE_CACHE_MASK, bytes);
pagefault_enable();
flush_dcache_page(page);

@@ -2464,6 +2488,7 @@ again:
* because not all segments in the iov can be copied at
* once without a pagefault.
*/
+ offset = pos & ~PAGE_CACHE_MASK;
bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
iov_iter_single_seg_count(i));
goto again;
--
1.7.10.4

2013-03-14 17:57:06

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()

From: "Kirill A. Shutemov" <[email protected]>

The function is grab_cache_page_write_begin() twin but it tries to
allocate huge page at given position aligned to HPAGE_CACHE_NR.

If, for some reason, it's not possible allocate a huge page at this
possition, it returns NULL. Caller should take care of fallback to
small pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/pagemap.h | 10 ++++++++
mm/filemap.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 68 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 408c4e3..c87ed7c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -270,6 +270,16 @@ unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,

struct page *grab_cache_page_write_begin(struct address_space *mapping,
pgoff_t index, unsigned flags);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags);
+#else
+static inline struct page *grab_cache_huge_page_write_begin(
+ struct address_space *mapping, pgoff_t index, unsigned flags)
+{
+ return NULL;
+}
+#endif

/*
* Returns locked page at given index in given cache, creating it if needed.
diff --git a/mm/filemap.c b/mm/filemap.c
index 38fdc92..bdedb1b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2332,6 +2332,64 @@ found:
}
EXPORT_SYMBOL(grab_cache_page_write_begin);

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Find or create a huge page at the given pagecache position, aligned to
+ * HPAGE_CACHE_NR. Return the locked huge page.
+ *
+ * If, for some reason, it's not possible allocate a huge page at this
+ * possition, it returns NULL. Caller should take care of fallback to small
+ * pages.
+ *
+ * This function is specifically for buffered writes.
+ */
+struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ int status;
+ gfp_t gfp_mask;
+ struct page *page;
+ gfp_t gfp_notmask = 0;
+
+ BUG_ON(index & HPAGE_CACHE_INDEX_MASK);
+ gfp_mask = mapping_gfp_mask(mapping);
+ BUG_ON(!(gfp_mask & __GFP_COMP));
+ if (mapping_cap_account_dirty(mapping))
+ gfp_mask |= __GFP_WRITE;
+ if (flags & AOP_FLAG_NOFS)
+ gfp_notmask = __GFP_FS;
+repeat:
+ page = find_lock_page(mapping, index);
+ if (page) {
+ if (!PageTransHuge(page)) {
+ unlock_page(page);
+ page_cache_release(page);
+ return NULL;
+ }
+ goto found;
+ }
+
+ page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
+ if (!page) {
+ count_vm_event(THP_WRITE_FAILED);
+ return NULL;
+ }
+
+ count_vm_event(THP_WRITE_ALLOC);
+ status = add_to_page_cache_lru(page, mapping, index,
+ GFP_KERNEL & ~gfp_notmask);
+ if (unlikely(status)) {
+ page_cache_release(page);
+ if (status == -EEXIST)
+ goto repeat;
+ return NULL;
+ }
+found:
+ wait_on_page_writeback(page);
+ return page;
+}
+#endif
+
static ssize_t generic_perform_write(struct file *file,
struct iov_iter *i, loff_t pos)
{
--
1.7.10.4

2013-03-14 17:49:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 02/30] mm: implement zero_huge_user_segment and friends

From: "Kirill A. Shutemov" <[email protected]>

Let's add helpers to clear huge page segment(s). They provide the same
functionallity as zero_user_segment{,s} and zero_user, but for huge
pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 15 +++++++++++++++
mm/memory.c | 26 ++++++++++++++++++++++++++
2 files changed, 41 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5b7fd4e..df83ab9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1731,6 +1731,21 @@ extern void dump_page(struct page *page);
extern void clear_huge_page(struct page *page,
unsigned long addr,
unsigned int pages_per_huge_page);
+extern void zero_huge_user_segment(struct page *page,
+ unsigned start, unsigned end);
+static inline void zero_huge_user_segments(struct page *page,
+ unsigned start1, unsigned end1,
+ unsigned start2, unsigned end2)
+{
+ zero_huge_user_segment(page, start1, end1);
+ zero_huge_user_segment(page, start2, end2);
+}
+static inline void zero_huge_user(struct page *page,
+ unsigned start, unsigned len)
+{
+ zero_huge_user_segment(page, start, start+len);
+}
+
extern void copy_user_huge_page(struct page *dst, struct page *src,
unsigned long addr, struct vm_area_struct *vma,
unsigned int pages_per_huge_page);
diff --git a/mm/memory.c b/mm/memory.c
index 494526a..98c25dd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4213,6 +4213,32 @@ void clear_huge_page(struct page *page,
}
}

+void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
+{
+ int i;
+
+ BUG_ON(end < start);
+
+ might_sleep();
+
+ if (start == end)
+ return;
+
+ /* start and end are on the same small page */
+ if ((start & PAGE_MASK) == ((end - 1) & PAGE_MASK))
+ return zero_user_segment(page + (start >> PAGE_SHIFT),
+ start & ~PAGE_MASK,
+ ((end - 1) & ~PAGE_MASK) + 1);
+
+ zero_user_segment(page + (start >> PAGE_SHIFT),
+ start & ~PAGE_MASK, PAGE_SIZE);
+ for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) {
+ cond_resched();
+ clear_highpage(page + i);
+ }
+ zero_user_segment(page + i, 0, ((end - 1) & ~PAGE_MASK) + 1);
+}
+
static void copy_user_gigantic_page(struct page *dst, struct page *src,
unsigned long addr,
struct vm_area_struct *vma,
--
1.7.10.4

2013-03-14 17:57:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 11/30] thp, mm: handle tail pages in page_cache_get_speculative()

From: "Kirill A. Shutemov" <[email protected]>

For tail page we call __get_page_tail(). It has the same semantics, but
for tail page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/pagemap.h | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 3521b0d..408c4e3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -159,6 +159,9 @@ static inline int page_cache_get_speculative(struct page *page)
{
VM_BUG_ON(in_interrupt());

+ if (unlikely(PageTail(page)))
+ return __get_page_tail(page);
+
#ifdef CONFIG_TINY_RCU
# ifdef CONFIG_PREEMPT_COUNT
VM_BUG_ON(!in_atomic());
@@ -185,7 +188,6 @@ static inline int page_cache_get_speculative(struct page *page)
return 0;
}
#endif
- VM_BUG_ON(PageTail(page));

return 1;
}
--
1.7.10.4

2013-03-14 17:57:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate

From: "Kirill A. Shutemov" <[email protected]>

Returns true if mapping can have huge pages. Just check for __GFP_COMP
in gfp mask of the mapping for now.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/pagemap.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..3521b0d 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
(__force unsigned long)mask;
}

+static inline bool mapping_can_have_hugepages(struct address_space *m)
+{
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
+ gfp_t gfp_mask = mapping_gfp_mask(m);
+ return !!(gfp_mask & __GFP_COMP);
+ }
+
+ return false;
+}
+
/*
* The page cache can done in larger chunks than
* one page, because it allows for more efficient
--
1.7.10.4

2013-03-14 17:58:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements

From: "Kirill A. Shutemov" <[email protected]>

Currently radix_tree_preload() only guarantees enough nodes to insert
one element. It's a hard limit. You cannot batch a number insert under
one tree_lock.

This patch introduces radix_tree_preload_count(). It allows to
preallocate nodes enough to insert a number of *contiguous* elements.

Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/radix-tree.h | 3 +++
lib/radix-tree.c | 32 +++++++++++++++++++++++++-------
2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ffc444c..81318cb 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -83,6 +83,8 @@ do { \
(root)->rnode = NULL; \
} while (0)

+#define RADIX_TREE_PRELOAD_NR 512 /* For THP's benefit */
+
/**
* Radix-tree synchronization
*
@@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
unsigned long index, unsigned long max_scan);
int radix_tree_preload(gfp_t gfp_mask);
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
void radix_tree_init(void);
void *radix_tree_tag_set(struct radix_tree_root *root,
unsigned long index, unsigned int tag);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e796429..9bef0ac 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
* The worst case is a zero height tree with just a single item at index 0,
* and then inserting an item at index ULONG_MAX. This requires 2 new branches
* of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
+ *
+ * Worst case for adding N contiguous items is adding entries at indexes
+ * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
+ * item plus extra nodes if you cross the boundary from one node to the next.
+ *
* Hence:
*/
-#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
+#define RADIX_TREE_PRELOAD_MAX \
+ (RADIX_TREE_PRELOAD_MIN + \
+ DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))

/*
* Per-cpu pool of preloaded nodes
*/
struct radix_tree_preload {
int nr;
- struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
+ struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
};
static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };

@@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)

/*
* Load up this CPU's radix_tree_node buffer with sufficient objects to
- * ensure that the addition of a single element in the tree cannot fail. On
- * success, return zero, with preemption disabled. On error, return -ENOMEM
+ * ensure that the addition of *contiguous* elements in the tree cannot fail.
+ * On success, return zero, with preemption disabled. On error, return -ENOMEM
* with preemption not disabled.
*
* To make use of this facility, the radix tree must be initialised without
* __GFP_WAIT being passed to INIT_RADIX_TREE().
*/
-int radix_tree_preload(gfp_t gfp_mask)
+int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
{
struct radix_tree_preload *rtp;
struct radix_tree_node *node;
int ret = -ENOMEM;
+ int alloc = RADIX_TREE_PRELOAD_MIN +
+ DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
+
+ if (size > RADIX_TREE_PRELOAD_NR)
+ return -ENOMEM;

preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
+ while (rtp->nr < alloc) {
preempt_enable();
node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
if (node == NULL)
goto out;
preempt_disable();
rtp = &__get_cpu_var(radix_tree_preloads);
- if (rtp->nr < ARRAY_SIZE(rtp->nodes))
+ if (rtp->nr < alloc)
rtp->nodes[rtp->nr++] = node;
else
kmem_cache_free(radix_tree_node_cachep, node);
@@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
out:
return ret;
}
+
+int radix_tree_preload(gfp_t gfp_mask)
+{
+ return radix_tree_preload_count(1, gfp_mask);
+}
EXPORT_SYMBOL(radix_tree_preload);

/*
--
1.7.10.4

2013-03-15 00:21:47

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<[email protected]> wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> There's only one caller of do_generic_file_read() and the only actor is
> file_read_actor(). No reason to have a callback parameter.
>
This cleanup is not urgent if it nukes no barrier for THP cache.

Hillf

2013-03-15 00:27:45

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<[email protected]> wrote:
>
> There's only one caller of do_generic_file_read() and the only actor is
> file_read_actor(). No reason to have a callback parameter.
>
This cleanup is not urgent if it nukes no barrier for THP cache.

Hillf

2013-03-15 00:33:05

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 00/30] Transparent huge page cache

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<[email protected]> wrote:
>
> Here's the second version of the patchset.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
Where is your git tree including THP cache?

Hillf

2013-03-15 01:30:37

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<[email protected]> wrote:
> + page_cache_get(page);
> + spin_lock_irq(&mapping->tree_lock);
> + page->mapping = mapping;
> + page->index = offset;
> + error = radix_tree_insert(&mapping->page_tree, offset, page);
> + if (unlikely(error))
> + goto err;
> + if (PageTransHuge(page)) {
> + int i;
> + for (i = 1; i < HPAGE_CACHE_NR; i++) {
struct page *tail = page + i; to easy reader

> + page_cache_get(page + i);
s/page_cache_get/get_page_foll/ ?

> + page[i].index = offset + i;
> + error = radix_tree_insert(&mapping->page_tree,
> + offset + i, page + i);
> + if (error) {
> + page_cache_release(page + i);
> + break;
> + }
> }
> - radix_tree_preload_end();
> - } else
> - mem_cgroup_uncharge_cache_page(page);
> -out:
> + if (error) {
> + error = ENOSPC; /* no space for a huge page */
s/E/-E/

> + for (i--; i > 0; i--) {
> + radix_tree_delete(&mapping->page_tree,
> + offset + i);
> + page_cache_release(page + i);
> + }
> + radix_tree_delete(&mapping->page_tree, offset);
> + goto err;
> + }
> + }

2013-03-15 02:25:09

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 09/30] thp, mm: rewrite delete_from_page_cache() to support huge pages

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<[email protected]> wrote:
> + if (PageTransHuge(page)) {
> + int i;
> +
> + for (i = 0; i < HPAGE_CACHE_NR; i++)
> + radix_tree_delete(&mapping->page_tree, page->index + i);

Move the below page_cache_release for tail page here, please.

> + nr = HPAGE_CACHE_NR;
[...]
> + if (PageTransHuge(page))
> + for (i = 1; i < HPAGE_CACHE_NR; i++)
> + page_cache_release(page + i);
> page_cache_release(page);

2013-03-15 02:34:49

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<[email protected]> wrote:
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
> + pgoff_t index, unsigned flags);
> +#else
> +static inline struct page *grab_cache_huge_page_write_begin(
> + struct address_space *mapping, pgoff_t index, unsigned flags)
> +{
build bug?

> + return NULL;
> +}
> +#endif
>
btw, how about grab_thp_write_begin?

2013-03-15 03:11:34

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<[email protected]> wrote:
> + if (PageTransTail(page)) {
> + page_cache_release(page);
> + page = find_get_page(mapping,
> + index & ~HPAGE_CACHE_INDEX_MASK);
check page is valid, please.

> + if (!PageTransHuge(page)) {
> + page_cache_release(page);
> + goto find_page;
> + }
> + }

2013-03-15 06:15:14

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<[email protected]> wrote:
> -int split_huge_page(struct page *page)
> +static int split_anon_huge_page(struct page *page)
> {
> struct anon_vma *anon_vma;
> int ret = 1;
>
> - BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> - BUG_ON(!PageAnon(page));
> -
deleted, why?

2013-03-15 06:59:03

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<[email protected]> wrote:
>
> We are not ready to mmap file-backed tranparent huge pages.
>
It is not on todo list either.

2013-03-15 07:09:53

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 23/30] thp: prepare zap_huge_pmd() to uncharge file pages

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<[email protected]> wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Uncharge pages from correct counter.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/huge_memory.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index a23da8b..34e0385 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1368,10 +1368,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> spin_unlock(&tlb->mm->page_table_lock);
> put_huge_zero_page();
> } else {
> + int counter;
s/counter/item/ ?

> page = pmd_page(orig_pmd);
> page_remove_rmap(page);
> VM_BUG_ON(page_mapcount(page) < 0);
> - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> + counter = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
> + add_mm_counter(tlb->mm, counter, -HPAGE_PMD_NR);
> VM_BUG_ON(!PageHead(page));
> tlb->mm->nr_ptes--;
> spin_unlock(&tlb->mm->page_table_lock);
> --
> 1.7.10.4
>

2013-03-15 07:31:30

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 24/30] thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<[email protected]> wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> It's confusing that mk_huge_pmd() has sematics different from mk_pte()
> or mk_pmd().
>
> Let's move maybe_pmd_mkwrite() out of mk_huge_pmd() and adjust
> prototype to match mk_pte().
>
No urgent if not used subsequently.

> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/huge_memory.c | 14 ++++++++------
> 1 file changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 34e0385..be7b7e1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -691,11 +691,10 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
> return pmd;
> }
>
> -static inline pmd_t mk_huge_pmd(struct page *page, struct vm_area_struct *vma)
> +static inline pmd_t mk_huge_pmd(struct page *page, pgprot_t prot)
> {
> pmd_t entry;
> - entry = mk_pmd(page, vma->vm_page_prot);
> - entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> + entry = mk_pmd(page, prot);
> entry = pmd_mkhuge(entry);
> return entry;
> }
> @@ -723,7 +722,8 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
> pte_free(mm, pgtable);
> } else {
> pmd_t entry;
> - entry = mk_huge_pmd(page, vma);
> + entry = mk_huge_pmd(page, vma->vm_page_prot);
> + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> /*
> * The spinlocking to take the lru_lock inside
> * page_add_new_anon_rmap() acts as a full memory
> @@ -1212,7 +1212,8 @@ alloc:
> goto out_mn;
> } else {
> pmd_t entry;
> - entry = mk_huge_pmd(new_page, vma);
> + entry = mk_huge_pmd(new_page, vma->vm_page_prot);
> + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> pmdp_clear_flush(vma, haddr, pmd);
> page_add_new_anon_rmap(new_page, vma, haddr);
> set_pmd_at(mm, haddr, pmd, entry);
> @@ -2382,7 +2383,8 @@ static void collapse_huge_page(struct mm_struct *mm,
> __SetPageUptodate(new_page);
> pgtable = pmd_pgtable(_pmd);
>
> - _pmd = mk_huge_pmd(new_page, vma);
> + _pmd = mk_huge_pmd(new_page, vma->vm_page_prot);
> + _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
>
> /*
> * spin_lock() below is not the equivalent of smp_wmb(), so
> --
> 1.7.10.4
>

2013-03-15 07:44:53

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 25/30] thp, mm: basic huge_fault implementation for generic_file_vm_ops

On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
<[email protected]> wrote:
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static int filemap_huge_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + struct file *file = vma->vm_file;
> + struct address_space *mapping = file->f_mapping;
> + struct inode *inode = mapping->host;
> + pgoff_t size, offset = vmf->pgoff;
> + unsigned long address = (unsigned long) vmf->virtual_address;
> + struct page *page;
> + int ret = 0;
> +
> + BUG_ON(((address >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
> + (offset & HPAGE_CACHE_INDEX_MASK));
> +
> +retry:
> + page = find_get_page(mapping, offset);
> + if (!page) {
> + gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
> + page = alloc_pages(gfp_mask, HPAGE_PMD_ORDER);
s/pages/pages_vma/ ?

> + if (!page) {
> + count_vm_event(THP_FAULT_FALLBACK);
> + return VM_FAULT_OOM;
> + }

2013-03-15 13:21:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > There's only one caller of do_generic_file_read() and the only actor is
> > file_read_actor(). No reason to have a callback parameter.
> >
> This cleanup is not urgent if it nukes no barrier for THP cache.

Yes, it's not urgent. On other hand it can be applied upstream right now ;)

--
Kirill A. Shutemov

2013-03-15 13:21:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > + page_cache_get(page);
> > + spin_lock_irq(&mapping->tree_lock);
> > + page->mapping = mapping;
> > + page->index = offset;
> > + error = radix_tree_insert(&mapping->page_tree, offset, page);
> > + if (unlikely(error))
> > + goto err;
> > + if (PageTransHuge(page)) {
> > + int i;
> > + for (i = 1; i < HPAGE_CACHE_NR; i++) {
> struct page *tail = page + i; to easy reader
>
> > + page_cache_get(page + i);
> s/page_cache_get/get_page_foll/ ?

Why?

> > + page[i].index = offset + i;
> > + error = radix_tree_insert(&mapping->page_tree,
> > + offset + i, page + i);
> > + if (error) {
> > + page_cache_release(page + i);
> > + break;
> > + }
> > }
> > - radix_tree_preload_end();
> > - } else
> > - mem_cgroup_uncharge_cache_page(page);
> > -out:
> > + if (error) {
> > + error = ENOSPC; /* no space for a huge page */
> s/E/-E/

Good catch! Thanks.

> > + for (i--; i > 0; i--) {
> > + radix_tree_delete(&mapping->page_tree,
> > + offset + i);
> > + page_cache_release(page + i);
> > + }
> > + radix_tree_delete(&mapping->page_tree, offset);
> > + goto err;
> > + }
> > + }

--
Kirill A. Shutemov

2013-03-15 13:22:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 09/30] thp, mm: rewrite delete_from_page_cache() to support huge pages

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > + if (PageTransHuge(page)) {
> > + int i;
> > +
> > + for (i = 0; i < HPAGE_CACHE_NR; i++)
> > + radix_tree_delete(&mapping->page_tree, page->index + i);
>
> Move the below page_cache_release for tail page here, please.

Okay. Thanks.

> > + nr = HPAGE_CACHE_NR;
> [...]
> > + if (PageTransHuge(page))
> > + for (i = 1; i < HPAGE_CACHE_NR; i++)
> > + page_cache_release(page + i);
> > page_cache_release(page);

--
Kirill A. Shutemov

2013-03-15 13:23:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
> > + pgoff_t index, unsigned flags);
> > +#else
> > +static inline struct page *grab_cache_huge_page_write_begin(
> > + struct address_space *mapping, pgoff_t index, unsigned flags)
> > +{
> build bug?

Hm?. No. Why?

> > + return NULL;
> > +}
> > +#endif
> >
> btw, how about grab_thp_write_begin?

Sounds better, thanks.

--
Kirill A. Shutemov

2013-03-15 13:25:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > -int split_huge_page(struct page *page)
> > +static int split_anon_huge_page(struct page *page)
> > {
> > struct anon_vma *anon_vma;
> > int ret = 1;
> >
> > - BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> > - BUG_ON(!PageAnon(page));
> > -
> deleted, why?

split_anon_huge_page() should only be called from split_huge_page().
Probably I could bring it back, but it's kinda redundant.

--
Kirill A. Shutemov

2013-03-15 13:25:23

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

On Fri, Mar 15, 2013 at 9:23 PM, Kirill A. Shutemov
<[email protected]> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <[email protected]> wrote:
>> > + page_cache_get(page);
>> > + spin_lock_irq(&mapping->tree_lock);
>> > + page->mapping = mapping;
>> > + page->index = offset;
>> > + error = radix_tree_insert(&mapping->page_tree, offset, page);
>> > + if (unlikely(error))
>> > + goto err;
>> > + if (PageTransHuge(page)) {
>> > + int i;
>> > + for (i = 1; i < HPAGE_CACHE_NR; i++) {
>> struct page *tail = page + i; to easy reader
>>
>> > + page_cache_get(page + i);
>> s/page_cache_get/get_page_foll/ ?
>
> Why?
>
see follow_trans_huge_pmd() please.

2013-03-15 13:25:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > + if (PageTransTail(page)) {
> > + page_cache_release(page);
> > + page = find_get_page(mapping,
> > + index & ~HPAGE_CACHE_INDEX_MASK);
> check page is valid, please.

Good catch. Fixed.

> > + if (!PageTransHuge(page)) {
> > + page_cache_release(page);
> > + goto find_page;
> > + }
> > + }

--
Kirill A. Shutemov

2013-03-15 13:27:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > We are not ready to mmap file-backed tranparent huge pages.
> >
> It is not on todo list either.

Actually, following patches implement mmap for file-backed thp and this
split_huge_page() will catch only fallback cases.

--
Kirill A. Shutemov

2013-03-15 13:28:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 23/30] thp: prepare zap_huge_pmd() to uncharge file pages

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Uncharge pages from correct counter.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > mm/huge_memory.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index a23da8b..34e0385 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1368,10 +1368,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > spin_unlock(&tlb->mm->page_table_lock);
> > put_huge_zero_page();
> > } else {
> > + int counter;
> s/counter/item/ ?

I saw 'member' in other place, so I'll rename it to 'member'.

> > page = pmd_page(orig_pmd);
> > page_remove_rmap(page);
> > VM_BUG_ON(page_mapcount(page) < 0);
> > - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> > + counter = PageAnon(page) ? MM_ANONPAGES : MM_FILEPAGES;
> > + add_mm_counter(tlb->mm, counter, -HPAGE_PMD_NR);
> > VM_BUG_ON(!PageHead(page));
> > tlb->mm->nr_ptes--;
> > spin_unlock(&tlb->mm->page_table_lock);
> > --
> > 1.7.10.4
> >

--
Kirill A. Shutemov

2013-03-15 13:29:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 25/30] thp, mm: basic huge_fault implementation for generic_file_vm_ops

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +static int filemap_huge_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > + struct file *file = vma->vm_file;
> > + struct address_space *mapping = file->f_mapping;
> > + struct inode *inode = mapping->host;
> > + pgoff_t size, offset = vmf->pgoff;
> > + unsigned long address = (unsigned long) vmf->virtual_address;
> > + struct page *page;
> > + int ret = 0;
> > +
> > + BUG_ON(((address >> PAGE_SHIFT) & HPAGE_CACHE_INDEX_MASK) !=
> > + (offset & HPAGE_CACHE_INDEX_MASK));
> > +
> > +retry:
> > + page = find_get_page(mapping, offset);
> > + if (!page) {
> > + gfp_t gfp_mask = mapping_gfp_mask(mapping) | __GFP_COLD;
> > + page = alloc_pages(gfp_mask, HPAGE_PMD_ORDER);
> s/pages/pages_vma/ ?

Fixed. Thanks.

> > + if (!page) {
> > + count_vm_event(THP_FAULT_FALLBACK);
> > + return VM_FAULT_OOM;
> > + }

--
Kirill A. Shutemov

2013-03-15 13:30:37

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()

On Fri, Mar 15, 2013 at 9:24 PM, Kirill A. Shutemov
<[email protected]> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <[email protected]> wrote:
>> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> > +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
>> > + pgoff_t index, unsigned flags);
>> > +#else
>> > +static inline struct page *grab_cache_huge_page_write_begin(
>> > + struct address_space *mapping, pgoff_t index, unsigned flags)
>> > +{
>> build bug?
>
> Hm?. No. Why?
>
Stop build if THP not configured?

2013-03-15 13:32:11

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 00/30] Transparent huge page cache

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > Here's the second version of the patchset.
> >
> > The intend of the work is get code ready to enable transparent huge page
> > cache for the most simple fs -- ramfs.
> >
> Where is your git tree including THP cache?

Here: git://git.kernel.org/pub/scm/linux/kernel/git/kas/linux.git

Tag thp/pagecache/v2 represents the patcheset.
Branch thp/pagecache has fixed according to your feedback.

--
Kirill A. Shutemov

2013-03-15 13:33:37

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()

On Fri, Mar 15, 2013 at 9:26 PM, Kirill A. Shutemov
<[email protected]> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <[email protected]> wrote:
>> > -int split_huge_page(struct page *page)
>> > +static int split_anon_huge_page(struct page *page)
>> > {
>> > struct anon_vma *anon_vma;
>> > int ret = 1;
>> >
>> > - BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
>> > - BUG_ON(!PageAnon(page));
>> > -
>> deleted, why?
>
> split_anon_huge_page() should only be called from split_huge_page().
> Probably I could bring it back, but it's kinda redundant.
>
Ok, no more question.

2013-03-15 13:34:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 9:24 PM, Kirill A. Shutemov
> <[email protected]> wrote:
> > Hillf Danton wrote:
> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> >> <[email protected]> wrote:
> >> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >> > +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
> >> > + pgoff_t index, unsigned flags);
> >> > +#else
> >> > +static inline struct page *grab_cache_huge_page_write_begin(
> >> > + struct address_space *mapping, pgoff_t index, unsigned flags)
> >> > +{
> >> build bug?
> >
> > Hm?. No. Why?
> >
> Stop build if THP not configured?

No. I've tested it without CONFIG_TRANSPARENT_HUGEPAGE.

--
Kirill A. Shutemov

2013-03-15 13:35:52

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page

On Fri, Mar 15, 2013 at 9:29 PM, Kirill A. Shutemov
<[email protected]> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <[email protected]> wrote:
>> >
>> > We are not ready to mmap file-backed tranparent huge pages.
>> >
>> It is not on todo list either.
>
> Actually, following patches implement mmap for file-backed thp and this
> split_huge_page() will catch only fallback cases.
>
I wonder if the effort we pay for THP cache is nuked by mmap.

2013-03-15 13:37:48

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()

On Fri, Mar 15, 2013 at 9:35 PM, Kirill A. Shutemov
<[email protected]> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 9:24 PM, Kirill A. Shutemov
>> <[email protected]> wrote:
>> > Hillf Danton wrote:
>> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> >> <[email protected]> wrote:
>> >> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> >> > +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
>> >> > + pgoff_t index, unsigned flags);
>> >> > +#else
>> >> > +static inline struct page *grab_cache_huge_page_write_begin(
>> >> > + struct address_space *mapping, pgoff_t index, unsigned flags)
>> >> > +{
>> >> build bug?
>> >
>> > Hm?. No. Why?
>> >
>> Stop build if THP not configured?
>
> No. I've tested it without CONFIG_TRANSPARENT_HUGEPAGE.
>
OK, I see.

thanks
Hillf

2013-03-15 13:43:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 19/30] thp, mm: split huge page on mmap file page

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 9:29 PM, Kirill A. Shutemov
> <[email protected]> wrote:
> > Hillf Danton wrote:
> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> >> <[email protected]> wrote:
> >> >
> >> > We are not ready to mmap file-backed tranparent huge pages.
> >> >
> >> It is not on todo list either.
> >
> > Actually, following patches implement mmap for file-backed thp and this
> > split_huge_page() will catch only fallback cases.
> >
> I wonder if the effort we pay for THP cache is nuked by mmap.

I will not be nuked after patch 30/30.

Actually, it will be splited only if process tries to map hugepage with
unsuitable alignment.

--
Kirill A. Shutemov

2013-03-15 13:48:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 9:23 PM, Kirill A. Shutemov
> <[email protected]> wrote:
> > Hillf Danton wrote:
> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> >> <[email protected]> wrote:
> >> > + page_cache_get(page);
> >> > + spin_lock_irq(&mapping->tree_lock);
> >> > + page->mapping = mapping;
> >> > + page->index = offset;
> >> > + error = radix_tree_insert(&mapping->page_tree, offset, page);
> >> > + if (unlikely(error))
> >> > + goto err;
> >> > + if (PageTransHuge(page)) {
> >> > + int i;
> >> > + for (i = 1; i < HPAGE_CACHE_NR; i++) {
> >> struct page *tail = page + i; to easy reader
> >>
> >> > + page_cache_get(page + i);
> >> s/page_cache_get/get_page_foll/ ?
> >
> > Why?
> >
> see follow_trans_huge_pmd() please.

Sorry, I fail to see how follow_trans_huge_pmd() is relevant here.
Could you elaborate?

--
Kirill A. Shutemov

2013-03-15 13:55:32

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

On Fri, Mar 15, 2013 at 9:50 PM, Kirill A. Shutemov
<[email protected]> wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 9:23 PM, Kirill A. Shutemov
>> <[email protected]> wrote:
>> > Hillf Danton wrote:
>> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> >> <[email protected]> wrote:
>> >> > + page_cache_get(page);
>> >> > + spin_lock_irq(&mapping->tree_lock);
>> >> > + page->mapping = mapping;
>> >> > + page->index = offset;
>> >> > + error = radix_tree_insert(&mapping->page_tree, offset, page);
>> >> > + if (unlikely(error))
>> >> > + goto err;
>> >> > + if (PageTransHuge(page)) {
>> >> > + int i;
>> >> > + for (i = 1; i < HPAGE_CACHE_NR; i++) {
>> >> struct page *tail = page + i; to easy reader
>> >>
>> >> > + page_cache_get(page + i);
>> >> s/page_cache_get/get_page_foll/ ?
>> >
>> > Why?
>> >
>> see follow_trans_huge_pmd() please.
>
> Sorry, I fail to see how follow_trans_huge_pmd() is relevant here.
> Could you elaborate?
>
Lets see the code

page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
//page is tail now
VM_BUG_ON(!PageCompound(page));
if (flags & FOLL_GET)
get_page_foll(page);
//raise page count with the foll function

thus I raised question.

2013-03-15 15:03:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

Hillf Danton wrote:
> On Fri, Mar 15, 2013 at 9:50 PM, Kirill A. Shutemov
> <[email protected]> wrote:
> > Hillf Danton wrote:
> >> On Fri, Mar 15, 2013 at 9:23 PM, Kirill A. Shutemov
> >> <[email protected]> wrote:
> >> > Hillf Danton wrote:
> >> >> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
> >> >> <[email protected]> wrote:
> >> >> > + page_cache_get(page);
> >> >> > + spin_lock_irq(&mapping->tree_lock);
> >> >> > + page->mapping = mapping;
> >> >> > + page->index = offset;
> >> >> > + error = radix_tree_insert(&mapping->page_tree, offset, page);
> >> >> > + if (unlikely(error))
> >> >> > + goto err;
> >> >> > + if (PageTransHuge(page)) {
> >> >> > + int i;
> >> >> > + for (i = 1; i < HPAGE_CACHE_NR; i++) {
> >> >> struct page *tail = page + i; to easy reader
> >> >>
> >> >> > + page_cache_get(page + i);
> >> >> s/page_cache_get/get_page_foll/ ?
> >> >
> >> > Why?
> >> >
> >> see follow_trans_huge_pmd() please.
> >
> > Sorry, I fail to see how follow_trans_huge_pmd() is relevant here.
> > Could you elaborate?
> >
> Lets see the code
>
> page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
> //page is tail now
> VM_BUG_ON(!PageCompound(page));
> if (flags & FOLL_GET)
> get_page_foll(page);
> //raise page count with the foll function
>
> thus I raised question.

get_page_foll() is designed to be part of follow_page*() call chain.
get_page() can handle compound pages properly.

--
Kirill A. Shutemov

2013-03-18 04:03:16

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 00/30] Transparent huge page cache

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Here's the second version of the patchset.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
> We have read()/write()/mmap() functionality now. Still plenty work ahead.

One offline question.

Why set PG_mlocked to page_tail which be splited in function
__split_huge_page_refcount?

>
> Any feedback is welcome.
>
> Changes since v1:
> - mmap();
> - fix add_to_page_cache_locked() and delete_from_page_cache();
> - introduce mapping_can_have_hugepages();
> - call split_huge_page() only for head page in filemap_fault();
> - wait_split_huge_page(): serialize over i_mmap_mutex too;
> - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
> - fix off-by-one in zero_huge_user_segment();
> - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;
>
> TODO:
> - memcg accounting has not yet evaluated;
> - collapse;
> - migration (?);
> - stats, knobs, etc.;
> - tmpfs/shmem enabling;
>
>
> Kirill A. Shutemov (30):
> block: implement add_bdi_stat()
> mm: implement zero_huge_user_segment and friends
> mm: drop actor argument of do_generic_file_read()
> radix-tree: implement preload for multiple contiguous elements
> thp, mm: avoid PageUnevictable on active/inactive lru lists
> thp, mm: basic defines for transparent huge page cache
> thp, mm: introduce mapping_can_have_hugepages() predicate
> thp, mm: rewrite add_to_page_cache_locked() to support huge pages
> thp, mm: rewrite delete_from_page_cache() to support huge pages
> thp, mm: locking tail page is a bug
> thp, mm: handle tail pages in page_cache_get_speculative()
> thp, mm: add event counters for huge page alloc on write to a file
> thp, mm: implement grab_cache_huge_page_write_begin()
> thp, mm: naive support of thp in generic read/write routines
> thp, libfs: initial support of thp in
> simple_read/write_begin/write_end
> thp: handle file pages in split_huge_page()
> thp: wait_split_huge_page(): serialize over i_mmap_mutex too
> thp, mm: truncate support for transparent huge page cache
> thp, mm: split huge page on mmap file page
> ramfs: enable transparent huge page cache
> x86-64, mm: proper alignment mappings with hugepages
> mm: add huge_fault() callback to vm_operations_struct
> thp: prepare zap_huge_pmd() to uncharge file pages
> thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
> thp, mm: basic huge_fault implementation for generic_file_vm_ops
> thp: extract fallback path from do_huge_pmd_anonymous_page() to a
> function
> thp: initial implementation of do_huge_linear_fault()
> thp: handle write-protect exception to file-backed huge pages
> thp: call __vma_adjust_trans_huge() for file-backed VMA
> thp: map file-backed huge pages on fault
>
> arch/x86/kernel/sys_x86_64.c | 13 +-
> fs/libfs.c | 50 ++++-
> fs/ramfs/inode.c | 6 +-
> include/linux/backing-dev.h | 10 +
> include/linux/huge_mm.h | 36 +++-
> include/linux/mm.h | 16 ++
> include/linux/pagemap.h | 24 ++-
> include/linux/radix-tree.h | 3 +
> include/linux/vm_event_item.h | 2 +
> lib/radix-tree.c | 32 ++-
> mm/filemap.c | 283 +++++++++++++++++++++----
> mm/huge_memory.c | 462 ++++++++++++++++++++++++++++++++++-------
> mm/memory.c | 31 ++-
> mm/swap.c | 3 +-
> mm/truncate.c | 12 ++
> mm/vmstat.c | 2 +
> 16 files changed, 842 insertions(+), 143 deletions(-)
>

2013-03-18 11:18:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 00/30] Transparent huge page cache

Simon Jeons wrote:
> On 03/18/2013 12:03 PM, Simon Jeons wrote:
> > Hi Kirill,
> > On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> >> From: "Kirill A. Shutemov" <[email protected]>
> >>
> >> Here's the second version of the patchset.
> >>
> >> The intend of the work is get code ready to enable transparent huge page
> >> cache for the most simple fs -- ramfs.
> >>
> >> We have read()/write()/mmap() functionality now. Still plenty work
> >> ahead.
> >
> > One offline question.
> >
> > Why set PG_mlocked to page_tail which be splited in function
> > __split_huge_page_refcount?

Not set, but copied from head page. Head page represents up-to-date sate
of compound page, we need to copy it to all tail pages on split.

> Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
> cleared in split_huge_page path?

The pmd is invalidated and replaced with reference to page table at the end
of __split_huge_page_map.

> Another offline question:
> Why don't clear tail page PG_tail flag in function
> __split_huge_page_refcount?

We do:

page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;

--
Kirill A. Shutemov

2013-03-18 11:29:35

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 00/30] Transparent huge page cache

Hi Kirill,
On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote:
> Simon Jeons wrote:
>> On 03/18/2013 12:03 PM, Simon Jeons wrote:
>>> Hi Kirill,
>>> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
>>>> From: "Kirill A. Shutemov" <[email protected]>
>>>>
>>>> Here's the second version of the patchset.
>>>>
>>>> The intend of the work is get code ready to enable transparent huge page
>>>> cache for the most simple fs -- ramfs.
>>>>
>>>> We have read()/write()/mmap() functionality now. Still plenty work
>>>> ahead.
>>> One offline question.
>>>
>>> Why set PG_mlocked to page_tail which be splited in function
>>> __split_huge_page_refcount?
> Not set, but copied from head page. Head page represents up-to-date sate
> of compound page, we need to copy it to all tail pages on split.

I always see up-to-date state, could you conclude to me which state can
be treated as up-to-date? :-)

>
>> Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
>> cleared in split_huge_page path?
>
> The pmd is invalidated and replaced with reference to page table at the end
> of __split_huge_page_map.

Since pmd is populated by page table and new flag why need
invalidated(clear present flag) before it?

>
>> Another offline question:
>> Why don't clear tail page PG_tail flag in function
>> __split_huge_page_refcount?
> We do:
>
> page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP | __PG_HWPOISON;
>

Got it.

2013-03-18 11:40:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 00/30] Transparent huge page cache

Simon Jeons wrote:
> Hi Kirill,
> On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote:
> > Simon Jeons wrote:
> >> On 03/18/2013 12:03 PM, Simon Jeons wrote:
> >>> Hi Kirill,
> >>> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> >>>> From: "Kirill A. Shutemov" <[email protected]>
> >>>>
> >>>> Here's the second version of the patchset.
> >>>>
> >>>> The intend of the work is get code ready to enable transparent huge page
> >>>> cache for the most simple fs -- ramfs.
> >>>>
> >>>> We have read()/write()/mmap() functionality now. Still plenty work
> >>>> ahead.
> >>> One offline question.
> >>>
> >>> Why set PG_mlocked to page_tail which be splited in function
> >>> __split_huge_page_refcount?
> > Not set, but copied from head page. Head page represents up-to-date sate
> > of compound page, we need to copy it to all tail pages on split.
>
> I always see up-to-date state, could you conclude to me which state can
> be treated as up-to-date? :-)

While we work with huge page we only alter flags (like mlocked or
uptodate) of head page, but not tail, so we have to copy flags to all tail
pages on split. We also need to distribute _count and _mapcount properly.
Just read the code.

>
> >
> >> Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
> >> cleared in split_huge_page path?
> >
> > The pmd is invalidated and replaced with reference to page table at the end
> > of __split_huge_page_map.
>
> Since pmd is populated by page table and new flag why need
> invalidated(clear present flag) before it?

Comment just before pmdp_invalidate() in __split_huge_page_map() is fairly
informative.

--
Kirill A. Shutemov

2013-03-18 11:42:45

by Ric Mason

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 00/30] Transparent huge page cache

On 03/18/2013 07:42 PM, Kirill A. Shutemov wrote:
> Simon Jeons wrote:
>> Hi Kirill,
>> On 03/18/2013 07:19 PM, Kirill A. Shutemov wrote:
>>> Simon Jeons wrote:
>>>> On 03/18/2013 12:03 PM, Simon Jeons wrote:
>>>>> Hi Kirill,
>>>>> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
>>>>>> From: "Kirill A. Shutemov" <[email protected]>
>>>>>>
>>>>>> Here's the second version of the patchset.
>>>>>>
>>>>>> The intend of the work is get code ready to enable transparent huge page
>>>>>> cache for the most simple fs -- ramfs.
>>>>>>
>>>>>> We have read()/write()/mmap() functionality now. Still plenty work
>>>>>> ahead.
>>>>> One offline question.
>>>>>
>>>>> Why set PG_mlocked to page_tail which be splited in function
>>>>> __split_huge_page_refcount?
>>> Not set, but copied from head page. Head page represents up-to-date sate
>>> of compound page, we need to copy it to all tail pages on split.
>> I always see up-to-date state, could you conclude to me which state can
>> be treated as up-to-date? :-)
> While we work with huge page we only alter flags (like mlocked or
> uptodate) of head page, but not tail, so we have to copy flags to all tail
> pages on split. We also need to distribute _count and _mapcount properly.
> Just read the code.

Sorry, you can treat this question as an offline one and irrelevant thp.
Which state of page can be treated as up-to-date?

>
>>>
>>>> Also why can't find where _PAGE_SPLITTING and _PAGE_PSE flags are
>>>> cleared in split_huge_page path?
>>>
>>> The pmd is invalidated and replaced with reference to page table at the end
>>> of __split_huge_page_map.
>> Since pmd is populated by page table and new flag why need
>> invalidated(clear present flag) before it?
> Comment just before pmdp_invalidate() in __split_huge_page_map() is fairly
> informative.
>

2013-03-20 01:09:59

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 00/30] Transparent huge page cache

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Here's the second version of the patchset.
>
> The intend of the work is get code ready to enable transparent huge page
> cache for the most simple fs -- ramfs.
>
> We have read()/write()/mmap() functionality now. Still plenty work ahead.

It's a great work. But could you give more explanation how you design to
implement thp support page cache in order that more developers are easy
and happy to review your big patchset?

>
> Any feedback is welcome.
>
> Changes since v1:
> - mmap();
> - fix add_to_page_cache_locked() and delete_from_page_cache();
> - introduce mapping_can_have_hugepages();
> - call split_huge_page() only for head page in filemap_fault();
> - wait_split_huge_page(): serialize over i_mmap_mutex too;
> - lru_add_page_tail: avoid PageUnevictable on active/inactive lru lists;
> - fix off-by-one in zero_huge_user_segment();
> - THP_WRITE_ALLOC/THP_WRITE_FAILED counters;
>
> TODO:
> - memcg accounting has not yet evaluated;
> - collapse;
> - migration (?);
> - stats, knobs, etc.;
> - tmpfs/shmem enabling;
>
>
> Kirill A. Shutemov (30):
> block: implement add_bdi_stat()
> mm: implement zero_huge_user_segment and friends
> mm: drop actor argument of do_generic_file_read()
> radix-tree: implement preload for multiple contiguous elements
> thp, mm: avoid PageUnevictable on active/inactive lru lists
> thp, mm: basic defines for transparent huge page cache
> thp, mm: introduce mapping_can_have_hugepages() predicate
> thp, mm: rewrite add_to_page_cache_locked() to support huge pages
> thp, mm: rewrite delete_from_page_cache() to support huge pages
> thp, mm: locking tail page is a bug
> thp, mm: handle tail pages in page_cache_get_speculative()
> thp, mm: add event counters for huge page alloc on write to a file
> thp, mm: implement grab_cache_huge_page_write_begin()
> thp, mm: naive support of thp in generic read/write routines
> thp, libfs: initial support of thp in
> simple_read/write_begin/write_end
> thp: handle file pages in split_huge_page()
> thp: wait_split_huge_page(): serialize over i_mmap_mutex too
> thp, mm: truncate support for transparent huge page cache
> thp, mm: split huge page on mmap file page
> ramfs: enable transparent huge page cache
> x86-64, mm: proper alignment mappings with hugepages
> mm: add huge_fault() callback to vm_operations_struct
> thp: prepare zap_huge_pmd() to uncharge file pages
> thp: move maybe_pmd_mkwrite() out of mk_huge_pmd()
> thp, mm: basic huge_fault implementation for generic_file_vm_ops
> thp: extract fallback path from do_huge_pmd_anonymous_page() to a
> function
> thp: initial implementation of do_huge_linear_fault()
> thp: handle write-protect exception to file-backed huge pages
> thp: call __vma_adjust_trans_huge() for file-backed VMA
> thp: map file-backed huge pages on fault
>
> arch/x86/kernel/sys_x86_64.c | 13 +-
> fs/libfs.c | 50 ++++-
> fs/ramfs/inode.c | 6 +-
> include/linux/backing-dev.h | 10 +
> include/linux/huge_mm.h | 36 +++-
> include/linux/mm.h | 16 ++
> include/linux/pagemap.h | 24 ++-
> include/linux/radix-tree.h | 3 +
> include/linux/vm_event_item.h | 2 +
> lib/radix-tree.c | 32 ++-
> mm/filemap.c | 283 +++++++++++++++++++++----
> mm/huge_memory.c | 462 ++++++++++++++++++++++++++++++++++-------
> mm/memory.c | 31 ++-
> mm/swap.c | 3 +-
> mm/truncate.c | 12 ++
> mm/vmstat.c | 2 +
> 16 files changed, 842 insertions(+), 143 deletions(-)
>

2013-03-21 14:44:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 01/30] block: implement add_bdi_stat()

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> It's required for batched stats update.

The description here is a little terse. Could you at least also
describe how and where it's going to get used in later patches? Just a
sentence or two more would be really helpful.

You might also mention that it's essentially a clone of dec_bdi_stat().

2013-03-21 15:22:23

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 02/30] mm: implement zero_huge_user_segment and friends

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> Let's add helpers to clear huge page segment(s). They provide the same
> functionallity as zero_user_segment{,s} and zero_user, but for huge
> pages
...
> +static inline void zero_huge_user_segments(struct page *page,
> + unsigned start1, unsigned end1,
> + unsigned start2, unsigned end2)
> +{
> + zero_huge_user_segment(page, start1, end1);
> + zero_huge_user_segment(page, start2, end2);
> +}

I'm not sure that this helper saves very much code. The one call later
in these patches:

+ zero_huge_user_segments(page, 0, from,
+ from + len, HPAGE_PMD_SIZE);

really only saves one line over this:

zero_huge_user_segment(page, 0, from);
zero_huge_user_segment(page, from + len,
HPAGE_PMD_SIZE);

and I think the second one is much more clear to read.

I do see that there's a small-page variant of this, but I think that one
was done to save doing two kmap_atomic() operations when you wanted to
zero two separate operations. This variant doesn't have that kind of
optimization, so it makes much less sense.

> +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
> +{
> + int i;
> +
> + BUG_ON(end < start);
> +
> + might_sleep();
> +
> + if (start == end)
> + return;

I've really got to wonder how much of an optimization this is in
practice. Was there a specific reason this was added?

> + /* start and end are on the same small page */
> + if ((start & PAGE_MASK) == ((end - 1) & PAGE_MASK))
> + return zero_user_segment(page + (start >> PAGE_SHIFT),
> + start & ~PAGE_MASK,
> + ((end - 1) & ~PAGE_MASK) + 1);

It wasn't immediately obvious to me why we need to optimize the "on the
same page" case. I _think_ it's because using zero_user_segments()
saves us a kmap_atomic() over the code below. Is that right? It might
be worth a comment.

> + zero_user_segment(page + (start >> PAGE_SHIFT),
> + start & ~PAGE_MASK, PAGE_SIZE);
> + for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) {
> + cond_resched();
> + clear_highpage(page + i);

zero_user_segments() does a flush_dcache_page(), which wouldn't get done
on these middle pages. Is that a problem?

> + }
> + zero_user_segment(page + i, 0, ((end - 1) & ~PAGE_MASK) + 1);
> +}

This code is dying for some local variables. It could really use a
'start_pfn_offset' and 'end_pfn_offset' or something similar. All of
the shifting and masking is a bit hard to read and it would be nice to
think of some real names for what it is doing.

It also desperately needs some comments about how it works. Some
one-liners like:

/* zero the first (possibly partial) page */
for()..
/* zero the full pages in the middle */
/* zero the last (possibly partial) page */

would be pretty sweet.

2013-03-21 15:25:16

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 03/30] mm: drop actor argument of do_generic_file_read()

On 03/15/2013 06:22 AM, Kirill A. Shutemov wrote:
> Hillf Danton wrote:
>> On Fri, Mar 15, 2013 at 1:50 AM, Kirill A. Shutemov
>> <[email protected]> wrote:
>>>
>>> There's only one caller of do_generic_file_read() and the only actor is
>>> file_read_actor(). No reason to have a callback parameter.
>>>
>> This cleanup is not urgent if it nukes no barrier for THP cache.
>
> Yes, it's not urgent. On other hand it can be applied upstream right now ;)

If someone might pick this up and merge it right away, it's probably
best to put it very first in the series and call it out as such to the
person you expect to pick it up, or even send it up to them separately.

2013-03-21 15:54:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. You cannot batch a number insert under
> one tree_lock.
>
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.

You don't need to write a paper on how radix trees work, but it might be
nice to include a wee bit of text in here about how the existing preload
works, and how this new guarantee works.

> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index ffc444c..81318cb 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -83,6 +83,8 @@ do { \
> (root)->rnode = NULL; \
> } while (0)
>
> +#define RADIX_TREE_PRELOAD_NR 512 /* For THP's benefit */

This eventually boils down to making the radix_tree_preload array
larger. Do we really want to do this unconditionally if it's only for
THP's benefit?

> /**
> * Radix-tree synchronization
> *
> @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
> unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> unsigned long index, unsigned long max_scan);
> int radix_tree_preload(gfp_t gfp_mask);
> +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
> void radix_tree_init(void);
> void *radix_tree_tag_set(struct radix_tree_root *root,
> unsigned long index, unsigned int tag);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index e796429..9bef0ac 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> * The worst case is a zero height tree with just a single item at index 0,
> * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> + *
> + * Worst case for adding N contiguous items is adding entries at indexes
> + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> + * item plus extra nodes if you cross the boundary from one node to the next.
> + *
> * Hence:
> */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MAX \
> + (RADIX_TREE_PRELOAD_MIN + \
> + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>
> /*
> * Per-cpu pool of preloaded nodes
> */
> struct radix_tree_preload {
> int nr;
> - struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> + struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
> };

For those of us too lazy to go compile a kernel and figure this out in
practice, how much bigger does this make the nodes[] array?

> static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>
> @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
>
> /*
> * Load up this CPU's radix_tree_node buffer with sufficient objects to
> - * ensure that the addition of a single element in the tree cannot fail. On
> - * success, return zero, with preemption disabled. On error, return -ENOMEM
> + * ensure that the addition of *contiguous* elements in the tree cannot fail.
> + * On success, return zero, with preemption disabled. On error, return -ENOMEM
> * with preemption not disabled.
> *
> * To make use of this facility, the radix tree must be initialised without
> * __GFP_WAIT being passed to INIT_RADIX_TREE().
> */
> -int radix_tree_preload(gfp_t gfp_mask)
> +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
> {
> struct radix_tree_preload *rtp;
> struct radix_tree_node *node;
> int ret = -ENOMEM;
> + int alloc = RADIX_TREE_PRELOAD_MIN +
> + DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);

Any chance I could talk you in to giving 'alloc' a better name? Maybe
"preload_target" or "preload_fill_to".

> + if (size > RADIX_TREE_PRELOAD_NR)
> + return -ENOMEM;

I always wonder if these deep, logical -ENOMEMs deserve a WARN_ONCE().
We really don't expect to hit this unless something really funky is
going on, right?

> preempt_disable();
> rtp = &__get_cpu_var(radix_tree_preloads);
> - while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
> + while (rtp->nr < alloc) {
> preempt_enable();
> node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
> if (node == NULL)
> goto out;
> preempt_disable();
> rtp = &__get_cpu_var(radix_tree_preloads);
> - if (rtp->nr < ARRAY_SIZE(rtp->nodes))
> + if (rtp->nr < alloc)
> rtp->nodes[rtp->nr++] = node;
> else
> kmem_cache_free(radix_tree_node_cachep, node);
> @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
> out:
> return ret;
> }

2013-03-21 16:13:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
> that have been placed on the LRU lists when first allocated), but these
> pages must not have PageUnevictable set - otherwise shrink_active_list
> goes crazy:
...
> For lru_add_page_tail(), it means we should not set PageUnevictable()
> for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
> The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
> it's marked PageUnevictable() too.

This is only an issue once you're using lru_add_page_tail() for
non-anonymous pages, right?

> diff --git a/mm/swap.c b/mm/swap.c
> index 92a9be5..31584d0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
> lru = LRU_INACTIVE_ANON;
> }
> } else {
> - SetPageUnevictable(page_tail);
> + if (!PageLRU(page) || PageUnevictable(page))
> + SetPageUnevictable(page_tail);
> lru = LRU_UNEVICTABLE;
> }

You were saying above that ramfs pages can get on the normal
active/inactive lists. But, this will end up getting them on the
unevictable list, right? So, we have normal ramfs pages on the
active/inactive lists, but ramfs pages after a huge-page-split on the
unevictable list. That seems a bit inconsistent.

2013-03-21 16:19:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> +static inline bool mapping_can_have_hugepages(struct address_space *m)
> +{
> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> + gfp_t gfp_mask = mapping_gfp_mask(m);
> + return !!(gfp_mask & __GFP_COMP);
> + }
> +
> + return false;
> +}

I did a quick search in all your patches and don't see __GFP_COMP
getting _set_ anywhere. Am I missing something?

2013-03-21 17:09:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
> page for the specified index and HPAGE_CACHE_NR-1 tail pages for
> following indexes.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/filemap.c | 76 ++++++++++++++++++++++++++++++++++++++++------------------
> 1 file changed, 53 insertions(+), 23 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 2d99191..6bac9e2 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -447,6 +447,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> pgoff_t offset, gfp_t gfp_mask)
> {
> int error;
> + int nr = 1;
>
> VM_BUG_ON(!PageLocked(page));
> VM_BUG_ON(PageSwapBacked(page));
> @@ -454,32 +455,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> error = mem_cgroup_cache_charge(page, current->mm,
> gfp_mask & GFP_RECLAIM_MASK);
> if (error)
> - goto out;
> + return error;
>
> - error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> - if (error == 0) {
> - page_cache_get(page);
> - page->mapping = mapping;
> - page->index = offset;
> + if (PageTransHuge(page)) {
> + BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
> + nr = HPAGE_CACHE_NR;
> + }

That seems like a slightly odd place to put a BUILD_BUG_ON(). I guess
it doesn't matter to some degree, but does putting it inside the if()
imply anything?

> + error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
> + if (error) {
> + mem_cgroup_uncharge_cache_page(page);
> + return error;
> + }
>
> - spin_lock_irq(&mapping->tree_lock);
> - error = radix_tree_insert(&mapping->page_tree, offset, page);
> - if (likely(!error)) {
> - mapping->nrpages++;
> - __inc_zone_page_state(page, NR_FILE_PAGES);
> - spin_unlock_irq(&mapping->tree_lock);
> - trace_mm_filemap_add_to_page_cache(page);
> - } else {
> - page->mapping = NULL;
> - /* Leave page->index set: truncation relies upon it */
> - spin_unlock_irq(&mapping->tree_lock);
> - mem_cgroup_uncharge_cache_page(page);
> - page_cache_release(page);

I do really like how this rewrite de-indents this code. :)

> + page_cache_get(page);
> + spin_lock_irq(&mapping->tree_lock);
> + page->mapping = mapping;
> + page->index = offset;
> + error = radix_tree_insert(&mapping->page_tree, offset, page);
> + if (unlikely(error))
> + goto err;
> + if (PageTransHuge(page)) {
> + int i;
> + for (i = 1; i < HPAGE_CACHE_NR; i++) {
> + page_cache_get(page + i);
> + page[i].index = offset + i;

Is it OK to leave page->mapping unset for these?

> + error = radix_tree_insert(&mapping->page_tree,
> + offset + i, page + i);
> + if (error) {
> + page_cache_release(page + i);
> + break;
> + }
> }

Throughout all this new code, I'd really challenge you to try as much as
possible to minimize the code stuck under "if (PageTransHuge(page))".

For instance, could you change the for() loop a bit and have it shared
between both cases, like:

> + for (i = 0; i < nr; i++) {
> + page_cache_get(page + i);
> + page[i].index = offset + i;
> + error = radix_tree_insert(&mapping->page_tree,
> + offset + i, page + i);
> + if (error) {
> + page_cache_release(page + i);
> + break;
> + }
> }

> - radix_tree_preload_end();
> - } else
> - mem_cgroup_uncharge_cache_page(page);
> -out:
> + if (error) {
> + error = ENOSPC; /* no space for a huge page */
> + for (i--; i > 0; i--) {
> + radix_tree_delete(&mapping->page_tree,
> + offset + i);
> + page_cache_release(page + i);
> + }
> + radix_tree_delete(&mapping->page_tree, offset);

I wonder if this would look any nicer if you just did all the
page_cache_get()s for the entire huge page along with the head page, and
then released them all in one place. I think it might shrink the error
handling paths here.

> + goto err;
> + }
> + }
> + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> + mapping->nrpages += nr;
> + spin_unlock_irq(&mapping->tree_lock);
> + trace_mm_filemap_add_to_page_cache(page);

Do we need to change the tracing to make sure it notes that these were
or weren't huge pages?

> + radix_tree_preload_end();
> + return 0;
> +err:
> + page->mapping = NULL;
> + /* Leave page->index set: truncation relies upon it */
> + spin_unlock_irq(&mapping->tree_lock);
> + radix_tree_preload_end();
> + mem_cgroup_uncharge_cache_page(page);
> + page_cache_release(page);
> return error;
> }
> EXPORT_SYMBOL(add_to_page_cache_locked);

Does the cgroup code know how to handle these large pages internally
somehow? It looks like the charge/uncharge is only being done for the
head page.

2013-03-21 17:17:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 01/30] block: implement add_bdi_stat()

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > It's required for batched stats update.
>
> The description here is a little terse. Could you at least also
> describe how and where it's going to get used in later patches? Just a
> sentence or two more would be really helpful.
>
> You might also mention that it's essentially a clone of dec_bdi_stat().

Okay. Will update the description. Thanks.

--
Kirill A. Shutemov

2013-03-21 17:18:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 10/30] thp, mm: locking tail page is a bug

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> index 0ff3403..38fdc92 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -669,6 +669,7 @@ void __lock_page(struct page *page)
> {
> DEFINE_WAIT_BIT(wait, &page->flags, PG_locked);
>
> + VM_BUG_ON(PageTail(page));
> __wait_on_bit_lock(page_waitqueue(page), &wait, sleep_on_page,
> TASK_UNINTERRUPTIBLE);
> }

Could we get a bit more of a description here in a comment or in the
patch summary about this? It makes some sense to me that code shouldn't
be mucking with the tail pages, but I'm curious what your logic is here,
too. I'm sure you've thought about it a lot more than I have.

2013-03-21 17:58:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> THP_FAULT_FALLBACK,
> THP_COLLAPSE_ALLOC,
> THP_COLLAPSE_ALLOC_FAILED,
> + THP_WRITE_ALLOC,
> + THP_WRITE_FAILED,
> THP_SPLIT,
> THP_ZERO_PAGE_ALLOC,
> THP_ZERO_PAGE_ALLOC_FAILED,

I think these names are a bit terse. It's certainly not _writes_ that
are failing and "THP_WRITE_FAILED" makes it sound that way. Also, why
do we need to differentiate these from the existing anon-hugepage vm
stats? The alloc_pages() call seems to be doing the exact same thing in
the end. Is one more likely to succeed than the other?

2013-03-21 18:13:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> The function is grab_cache_page_write_begin() twin but it tries to
> allocate huge page at given position aligned to HPAGE_CACHE_NR.

The obvious question, then, is whether we should just replace
grab_cache_page_write_begin() with this code and pass in HPAGE_CACHE_NR
or 1 based on whether we're doing a huge or normal page.

> diff --git a/mm/filemap.c b/mm/filemap.c
> index 38fdc92..bdedb1b 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2332,6 +2332,64 @@ found:
> }
> EXPORT_SYMBOL(grab_cache_page_write_begin);
>
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * Find or create a huge page at the given pagecache position, aligned to
> + * HPAGE_CACHE_NR. Return the locked huge page.
> + *
> + * If, for some reason, it's not possible allocate a huge page at this
> + * possition, it returns NULL. Caller should take care of fallback to small
> + * pages.
> + *
> + * This function is specifically for buffered writes.
> + */
> +struct page *grab_cache_huge_page_write_begin(struct address_space *mapping,
> + pgoff_t index, unsigned flags)
> +{
> + int status;
> + gfp_t gfp_mask;
> + struct page *page;
> + gfp_t gfp_notmask = 0;
> +
> + BUG_ON(index & HPAGE_CACHE_INDEX_MASK);

--
> + gfp_mask = mapping_gfp_mask(mapping);
> + BUG_ON(!(gfp_mask & __GFP_COMP));
> + if (mapping_cap_account_dirty(mapping))
> + gfp_mask |= __GFP_WRITE;
> + if (flags & AOP_FLAG_NOFS)
> + gfp_notmask = __GFP_FS;

This whole hunk is both non-obvious and copy-n-pasted from
grab_cache_page_write_begin(). That makes me worry that bugs/features
will get added/removed in one and not the other. I really think they
need to get consolidated somehow.

> +repeat:
> + page = find_lock_page(mapping, index);
> + if (page) {
> + if (!PageTransHuge(page)) {
> + unlock_page(page);
> + page_cache_release(page);
> + return NULL;
> + }
> + goto found;
> + }
> +
> + page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);

I alluded to this a second ago, but what's wrong with alloc_hugepage()?

> + if (!page) {
> + count_vm_event(THP_WRITE_FAILED);
> + return NULL;
> + }
> +
> + count_vm_event(THP_WRITE_ALLOC);
> + status = add_to_page_cache_lru(page, mapping, index,
> + GFP_KERNEL & ~gfp_notmask);
> + if (unlikely(status)) {
> + page_cache_release(page);
> + if (status == -EEXIST)
> + goto repeat;
> + return NULL;
> + }

I'm rather un-fond of sprinking likely/unlikelies around. But, I guess
this is really just copied from the existing one. <sigh>

> +found:
> + wait_on_page_writeback(page);
> + return page;
> +}
> +#endif

So, I diffed :

-struct page *grab_cache_page_write_begin(struct address_space
vs.
+struct page *grab_cache_huge_page_write_begin(struct address_space

They're just to similar to ignore. Please consolidate them somehow.

2013-03-21 18:15:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> +found:
> + wait_on_page_writeback(page);
> + return page;
> +}
> +#endif

In grab_cache_page_write_begin(), this "wait" is:

wait_for_stable_page(page);

Why is it different here?

2013-03-22 09:20:11

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 02/30] mm: implement zero_huge_user_segment and friends

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > Let's add helpers to clear huge page segment(s). They provide the same
> > functionallity as zero_user_segment{,s} and zero_user, but for huge
> > pages
> ...
> > +static inline void zero_huge_user_segments(struct page *page,
> > + unsigned start1, unsigned end1,
> > + unsigned start2, unsigned end2)
> > +{
> > + zero_huge_user_segment(page, start1, end1);
> > + zero_huge_user_segment(page, start2, end2);
> > +}
>
> I'm not sure that this helper saves very much code. The one call later
> in these patches:
>
> + zero_huge_user_segments(page, 0, from,
> + from + len, HPAGE_PMD_SIZE);
>
> really only saves one line over this:
>
> zero_huge_user_segment(page, 0, from);
> zero_huge_user_segment(page, from + len,
> HPAGE_PMD_SIZE);
>
> and I think the second one is much more clear to read.

I've tried to mimic non-huge zero_user*, but, yeah, this is silly.
Will drop.

> I do see that there's a small-page variant of this, but I think that one
> was done to save doing two kmap_atomic() operations when you wanted to
> zero two separate operations. This variant doesn't have that kind of
> optimization, so it makes much less sense.
>
> > +void zero_huge_user_segment(struct page *page, unsigned start, unsigned end)
> > +{
> > + int i;
> > +
> > + BUG_ON(end < start);
> > +
> > + might_sleep();
> > +
> > + if (start == end)
> > + return;
>
> I've really got to wonder how much of an optimization this is in
> practice. Was there a specific reason this was added?

It's likely for simple_write_begin() to call zero[_huge]_user_segments()
with one of two segments start == end.

But, honestly, it was just easier to cut the corner case first and don't
bother about it in following code. ;)

> > + /* start and end are on the same small page */
> > + if ((start & PAGE_MASK) == ((end - 1) & PAGE_MASK))
> > + return zero_user_segment(page + (start >> PAGE_SHIFT),
> > + start & ~PAGE_MASK,
> > + ((end - 1) & ~PAGE_MASK) + 1);
>
> It wasn't immediately obvious to me why we need to optimize the "on the
> same page" case. I _think_ it's because using zero_user_segments()
> saves us a kmap_atomic() over the code below. Is that right? It might
> be worth a comment.

The code below will call zero_user_segment() twice for the same small
page, but here we can use just one.

I'll document it.

> > + zero_user_segment(page + (start >> PAGE_SHIFT),
> > + start & ~PAGE_MASK, PAGE_SIZE);
> > + for (i = (start >> PAGE_SHIFT) + 1; i < (end >> PAGE_SHIFT) - 1; i++) {
> > + cond_resched();
> > + clear_highpage(page + i);
>
> zero_user_segments() does a flush_dcache_page(), which wouldn't get done
> on these middle pages. Is that a problem?

I think, it is. Will fix.

> > + }
> > + zero_user_segment(page + i, 0, ((end - 1) & ~PAGE_MASK) + 1);
> > +}
>
> This code is dying for some local variables. It could really use a
> 'start_pfn_offset' and 'end_pfn_offset' or something similar. All of
> the shifting and masking is a bit hard to read and it would be nice to
> think of some real names for what it is doing.
>
> It also desperately needs some comments about how it works. Some
> one-liners like:
>
> /* zero the first (possibly partial) page */
> for()..
> /* zero the full pages in the middle */
> /* zero the last (possibly partial) page */
>
> would be pretty sweet.

Okay, will rework it.

--
Kirill A. Shutemov

2013-03-22 09:46:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > Currently radix_tree_preload() only guarantees enough nodes to insert
> > one element. It's a hard limit. You cannot batch a number insert under
> > one tree_lock.
> >
> > This patch introduces radix_tree_preload_count(). It allows to
> > preallocate nodes enough to insert a number of *contiguous* elements.
>
> You don't need to write a paper on how radix trees work, but it might be
> nice to include a wee bit of text in here about how the existing preload
> works, and how this new guarantee works.

Reasonable, will do.

> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index ffc444c..81318cb 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -83,6 +83,8 @@ do { \
> > (root)->rnode = NULL; \
> > } while (0)
> >
> > +#define RADIX_TREE_PRELOAD_NR 512 /* For THP's benefit */
>
> This eventually boils down to making the radix_tree_preload array
> larger. Do we really want to do this unconditionally if it's only for
> THP's benefit?

It will be useful not only for THP. Batching can be useful to solve
scalability issues.

> > /**
> > * Radix-tree synchronization
> > *
> > @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
> > unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> > unsigned long index, unsigned long max_scan);
> > int radix_tree_preload(gfp_t gfp_mask);
> > +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
> > void radix_tree_init(void);
> > void *radix_tree_tag_set(struct radix_tree_root *root,
> > unsigned long index, unsigned int tag);
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index e796429..9bef0ac 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> > * The worst case is a zero height tree with just a single item at index 0,
> > * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> > * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> > + *
> > + * Worst case for adding N contiguous items is adding entries at indexes
> > + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> > + * item plus extra nodes if you cross the boundary from one node to the next.
> > + *
> > * Hence:
> > */
> > -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> > +#define RADIX_TREE_PRELOAD_MAX \
> > + (RADIX_TREE_PRELOAD_MIN + \
> > + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
> >
> > /*
> > * Per-cpu pool of preloaded nodes
> > */
> > struct radix_tree_preload {
> > int nr;
> > - struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> > + struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
> > };
>
> For those of us too lazy to go compile a kernel and figure this out in
> practice, how much bigger does this make the nodes[] array?

We have three possible RADIX_TREE_MAP_SHIFT:

#ifdef __KERNEL__
#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
#else
#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
#endif

On 64-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.

On 32-bit system:
For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.

On most machines we will have RADIX_TREE_MAP_SHIFT=6.

>
> > static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
> >
> > @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
> >
> > /*
> > * Load up this CPU's radix_tree_node buffer with sufficient objects to
> > - * ensure that the addition of a single element in the tree cannot fail. On
> > - * success, return zero, with preemption disabled. On error, return -ENOMEM
> > + * ensure that the addition of *contiguous* elements in the tree cannot fail.
> > + * On success, return zero, with preemption disabled. On error, return -ENOMEM
> > * with preemption not disabled.
> > *
> > * To make use of this facility, the radix tree must be initialised without
> > * __GFP_WAIT being passed to INIT_RADIX_TREE().
> > */
> > -int radix_tree_preload(gfp_t gfp_mask)
> > +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
> > {
> > struct radix_tree_preload *rtp;
> > struct radix_tree_node *node;
> > int ret = -ENOMEM;
> > + int alloc = RADIX_TREE_PRELOAD_MIN +
> > + DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
>
> Any chance I could talk you in to giving 'alloc' a better name? Maybe
> "preload_target" or "preload_fill_to".

Ok.

> > + if (size > RADIX_TREE_PRELOAD_NR)
> > + return -ENOMEM;
>
> I always wonder if these deep, logical -ENOMEMs deserve a WARN_ONCE().
> We really don't expect to hit this unless something really funky is
> going on, right?

Correct. Will add WARN.

> > preempt_disable();
> > rtp = &__get_cpu_var(radix_tree_preloads);
> > - while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
> > + while (rtp->nr < alloc) {
> > preempt_enable();
> > node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
> > if (node == NULL)
> > goto out;
> > preempt_disable();
> > rtp = &__get_cpu_var(radix_tree_preloads);
> > - if (rtp->nr < ARRAY_SIZE(rtp->nodes))
> > + if (rtp->nr < alloc)
> > rtp->nodes[rtp->nr++] = node;
> > else
> > kmem_cache_free(radix_tree_node_cachep, node);
> > @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
> > out:
> > return ret;
> > }

--
Kirill A. Shutemov

2013-03-22 10:09:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
> > that have been placed on the LRU lists when first allocated), but these
> > pages must not have PageUnevictable set - otherwise shrink_active_list
> > goes crazy:
> ...
> > For lru_add_page_tail(), it means we should not set PageUnevictable()
> > for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
> > The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
> > it's marked PageUnevictable() too.
>
> This is only an issue once you're using lru_add_page_tail() for
> non-anonymous pages, right?

I'm not sure about that. Documentation/vm/unevictable-lru.txt:

Some examples of these unevictable pages on the LRU lists are:

(1) ramfs pages that have been placed on the LRU lists when first allocated.

(2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to
allocate or fault in the pages in the shared memory region. This happens
when an application accesses the page the first time after SHM_LOCK'ing
the segment.

(3) mlocked pages that could not be isolated from the LRU and moved to the
unevictable list in mlock_vma_page().

(4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
munlock_vma_page() was forced to let the page back on to the normal LRU
list for vmscan to handle.

> > diff --git a/mm/swap.c b/mm/swap.c
> > index 92a9be5..31584d0 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
> > lru = LRU_INACTIVE_ANON;
> > }
> > } else {
> > - SetPageUnevictable(page_tail);
> > + if (!PageLRU(page) || PageUnevictable(page))
> > + SetPageUnevictable(page_tail);
> > lru = LRU_UNEVICTABLE;
> > }
>
> You were saying above that ramfs pages can get on the normal
> active/inactive lists. But, this will end up getting them on the
> unevictable list, right? So, we have normal ramfs pages on the
> active/inactive lists, but ramfs pages after a huge-page-split on the
> unevictable list. That seems a bit inconsistent.

Yeah, it's confusing.

I was able to trigger another bug on this code:
if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
will go to the same lru as page, but nobody cares to sync page_tail
active/inactive state with page. So we can end up with inactive page on
active lru...

I've updated the patch for the next interation. You can check it in git.
It should be cleaner. Description need to be updated.

--
Kirill A. Shutemov

2013-03-22 10:10:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > +static inline bool mapping_can_have_hugepages(struct address_space *m)
> > +{
> > + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> > + gfp_t gfp_mask = mapping_gfp_mask(m);
> > + return !!(gfp_mask & __GFP_COMP);
> > + }
> > +
> > + return false;
> > +}
>
> I did a quick search in all your patches and don't see __GFP_COMP
> getting _set_ anywhere. Am I missing something?

__GFP_COMP is part of GFP_TRANSHUGE. We set it for ramfs in patch 20/30.

--
Kirill A. Shutemov

2013-03-22 10:32:58

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > For huge page we add to radix tree HPAGE_CACHE_NR pages at once: head
> > page for the specified index and HPAGE_CACHE_NR-1 tail pages for
> > following indexes.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > mm/filemap.c | 76 ++++++++++++++++++++++++++++++++++++++++------------------
> > 1 file changed, 53 insertions(+), 23 deletions(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 2d99191..6bac9e2 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -447,6 +447,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> > pgoff_t offset, gfp_t gfp_mask)
> > {
> > int error;
> > + int nr = 1;
> >
> > VM_BUG_ON(!PageLocked(page));
> > VM_BUG_ON(PageSwapBacked(page));
> > @@ -454,32 +455,61 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
> > error = mem_cgroup_cache_charge(page, current->mm,
> > gfp_mask & GFP_RECLAIM_MASK);
> > if (error)
> > - goto out;
> > + return error;
> >
> > - error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> > - if (error == 0) {
> > - page_cache_get(page);
> > - page->mapping = mapping;
> > - page->index = offset;
> > + if (PageTransHuge(page)) {
> > + BUILD_BUG_ON(HPAGE_CACHE_NR > RADIX_TREE_PRELOAD_NR);
> > + nr = HPAGE_CACHE_NR;
> > + }
>
> That seems like a slightly odd place to put a BUILD_BUG_ON(). I guess
> it doesn't matter to some degree, but does putting it inside the if()
> imply anything?

It actually matters.

HPAGE_CACHE_NR is BUILD_BUG() if !CONFIG_TRANSPARENT_HUGEPAGE, so we need
to hide it inside 'if (PageTransHuge(page))'. PageTransHuge(page) is 0 in
compile time if !CONFIG_TRANSPARENT_HUGEPAGE, so compiler can be smart and
optimize out the check.

> > + error = radix_tree_preload_count(nr, gfp_mask & ~__GFP_HIGHMEM);
> > + if (error) {
> > + mem_cgroup_uncharge_cache_page(page);
> > + return error;
> > + }
> >
> > - spin_lock_irq(&mapping->tree_lock);
> > - error = radix_tree_insert(&mapping->page_tree, offset, page);
> > - if (likely(!error)) {
> > - mapping->nrpages++;
> > - __inc_zone_page_state(page, NR_FILE_PAGES);
> > - spin_unlock_irq(&mapping->tree_lock);
> > - trace_mm_filemap_add_to_page_cache(page);
> > - } else {
> > - page->mapping = NULL;
> > - /* Leave page->index set: truncation relies upon it */
> > - spin_unlock_irq(&mapping->tree_lock);
> > - mem_cgroup_uncharge_cache_page(page);
> > - page_cache_release(page);
>
> I do really like how this rewrite de-indents this code. :)

:)

> > + page_cache_get(page);
> > + spin_lock_irq(&mapping->tree_lock);
> > + page->mapping = mapping;
> > + page->index = offset;
> > + error = radix_tree_insert(&mapping->page_tree, offset, page);
> > + if (unlikely(error))
> > + goto err;
> > + if (PageTransHuge(page)) {
> > + int i;
> > + for (i = 1; i < HPAGE_CACHE_NR; i++) {
> > + page_cache_get(page + i);
> > + page[i].index = offset + i;
>
> Is it OK to leave page->mapping unset for these?

Good catch, thanks.
Seems nobody really use it, since I haven't got any oops, but we need to
set it anyway.

> > + error = radix_tree_insert(&mapping->page_tree,
> > + offset + i, page + i);
> > + if (error) {
> > + page_cache_release(page + i);
> > + break;
> > + }
> > }
>
> Throughout all this new code, I'd really challenge you to try as much as
> possible to minimize the code stuck under "if (PageTransHuge(page))".

I put thp-related code under the 'if' intentionally to be able to optimize
it out if !CONFIG_TRANSPARENT_HUGEPAGE. The config option is disabled by
default.

> For instance, could you change the for() loop a bit and have it shared
> between both cases, like:
>
> > + for (i = 0; i < nr; i++) {
> > + page_cache_get(page + i);
> > + page[i].index = offset + i;
> > + error = radix_tree_insert(&mapping->page_tree,
> > + offset + i, page + i);
> > + if (error) {
> > + page_cache_release(page + i);
> > + break;
> > + }
> > }
>
> > - radix_tree_preload_end();
> > - } else
> > - mem_cgroup_uncharge_cache_page(page);
> > -out:
> > + if (error) {
> > + error = ENOSPC; /* no space for a huge page */
> > + for (i--; i > 0; i--) {
> > + radix_tree_delete(&mapping->page_tree,
> > + offset + i);
> > + page_cache_release(page + i);
> > + }
> > + radix_tree_delete(&mapping->page_tree, offset);
>
> I wonder if this would look any nicer if you just did all the
> page_cache_get()s for the entire huge page along with the head page, and
> then released them all in one place. I think it might shrink the error
> handling paths here.
>
> > + goto err;
> > + }
> > + }
> > + __mod_zone_page_state(page_zone(page), NR_FILE_PAGES, nr);
> > + mapping->nrpages += nr;
> > + spin_unlock_irq(&mapping->tree_lock);
> > + trace_mm_filemap_add_to_page_cache(page);
>
> Do we need to change the tracing to make sure it notes that these were
> or weren't huge pages?

Hm.. I guess we just need to add page order to the trace.

> > + radix_tree_preload_end();
> > + return 0;
> > +err:
> > + page->mapping = NULL;
> > + /* Leave page->index set: truncation relies upon it */
> > + spin_unlock_irq(&mapping->tree_lock);
> > + radix_tree_preload_end();
> > + mem_cgroup_uncharge_cache_page(page);
> > + page_cache_release(page);
> > return error;
> > }
> > EXPORT_SYMBOL(add_to_page_cache_locked);
>
> Does the cgroup code know how to handle these large pages internally
> somehow? It looks like the charge/uncharge is only being done for the
> head page.

It can. We only need to remove PageCompound() check there. Patch is in
git.

--
Kirill A. Shutemov

2013-03-22 14:37:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements

On 03/22/2013 02:47 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> +#define RADIX_TREE_PRELOAD_NR 512 /* For THP's benefit */
>>
>> This eventually boils down to making the radix_tree_preload array
>> larger. Do we really want to do this unconditionally if it's only for
>> THP's benefit?
>
> It will be useful not only for THP. Batching can be useful to solve
> scalability issues.

Still, it seems like something that little machines with no THP support
probably don't want to pay the cost for. Perhaps you could enable it
for THP||NR_CPUS>$FOO.

>> For those of us too lazy to go compile a kernel and figure this out in
>> practice, how much bigger does this make the nodes[] array?
>
> We have three possible RADIX_TREE_MAP_SHIFT:
>
> #ifdef __KERNEL__
> #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> #else
> #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
> #endif
>
> On 64-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
>
> On 32-bit system:
> For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
>
> On most machines we will have RADIX_TREE_MAP_SHIFT=6.

Could you stick that in your patch description? The total cost is
"array size" * sizeof(void*) * NR_CPUS, right?

-- Dave Hansen, Intel OTC Scalability Team

2013-03-22 14:43:38

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate

On 03/22/2013 03:12 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> +static inline bool mapping_can_have_hugepages(struct address_space *m)
>>> +{
>>> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>>> + gfp_t gfp_mask = mapping_gfp_mask(m);
>>> + return !!(gfp_mask & __GFP_COMP);
>>> + }
>>> +
>>> + return false;
>>> +}
>>
>> I did a quick search in all your patches and don't see __GFP_COMP
>> getting _set_ anywhere. Am I missing something?
>
> __GFP_COMP is part of GFP_TRANSHUGE. We set it for ramfs in patch 20/30.

That's a bit non-obvious. For a casual observer, it _seems_ like you
should just be setting and checking GFP_TRANSHUGE directly. It looks
like you were having some problems with __GFP_MOVABLE and masked it out
of GFP_TRANSHUGE and that has cascaded over to _this_ check.

I _think_ the right thing to do is add a comment up there in
mapping_can_have_hugepages() that does (GFP_TRANSHUGE & ~__GFP_MOVABLE),
and adds a TODO in the code and patch comments to clean it up once
ramfs_get_inode() gets fixed up too.

2013-03-22 14:50:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 08/30] thp, mm: rewrite add_to_page_cache_locked() to support huge pages

On 03/22/2013 03:34 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> + error = radix_tree_insert(&mapping->page_tree,
>>> + offset + i, page + i);
>>> + if (error) {
>>> + page_cache_release(page + i);
>>> + break;
>>> + }
>>> }
>>
>> Throughout all this new code, I'd really challenge you to try as much as
>> possible to minimize the code stuck under "if (PageTransHuge(page))".
>
> I put thp-related code under the 'if' intentionally to be able to optimize
> it out if !CONFIG_TRANSPARENT_HUGEPAGE. The config option is disabled by
> default.

You've gotta give the compiler some credit! :) In this function's case,
the compiler should be able to realize that nr=1 is constant if
!TRANSPARENT_HUGEPAGE. It'll realize that all your for loops are:

for (i = 0; i < 1; i++) {

and will end up generating _very_ similar code to what you get with the
explicit #ifdefs. You already _have_ #ifdefs, but they're just up in
the headers around PageTransHuge()'s definition.

The advantages for you are *huge* since it means that folks will have a
harder time inadvertently breaking your CONFIG_TRANSPARENT_HUGEPAGE code.

>> Does the cgroup code know how to handle these large pages internally
>> somehow? It looks like the charge/uncharge is only being done for the
>> head page.
>
> It can. We only need to remove PageCompound() check there. Patch is in
> git.

OK, cool. This _might_ deserve a comment or something here. Again, it
looks asymmetric and fishy to someone just reading the code.

2013-03-22 15:21:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> For now we still write/read at most PAGE_CACHE_SIZE bytes a time.
>
> This implementation doesn't cover address spaces with backing store.
...
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1165,12 +1165,23 @@ find_page:
> if (unlikely(page == NULL))
> goto no_cached_page;
> }
> + if (PageTransTail(page)) {
> + page_cache_release(page);
> + page = find_get_page(mapping,
> + index & ~HPAGE_CACHE_INDEX_MASK);
> + if (!PageTransHuge(page)) {
> + page_cache_release(page);
> + goto find_page;
> + }
> + }

So, we're going to do a read of a file, and we pulled a tail page out of
the page cache. Why can't we just deal with the tail page directly?
What prevents this?

Is there something special about THP pages that keeps the head page in
the page cache after the tail has been released? I'd normally be
worried that the find_get_page() here might fail.

It's probably also worth a quick comment like:

/* can't deal with tail pages directly, move to head page */

otherwise the reassignment of "page" starts to seem a bit odd.

> if (PageReadahead(page)) {
> + BUG_ON(PageTransHuge(page));
> page_cache_async_readahead(mapping,
> ra, filp, page,
> index, last_index - index);
> }

Is this because we only do readahead for fs's with backing stores?
Could we have a comment to this effect?

> if (!PageUptodate(page)) {
> + BUG_ON(PageTransHuge(page));
> if (inode->i_blkbits == PAGE_CACHE_SHIFT ||
> !mapping->a_ops->is_partially_uptodate)
> goto page_not_up_to_date;

Same question. :)

Since your two-line description covers two topics, it's not immediately
obvious which one this BUG_ON() applies to.

> @@ -1212,18 +1223,25 @@ page_ok:
> }
> nr = nr - offset;
>
> + /* Recalculate offset in page if we've got a huge page */
> + if (PageTransHuge(page)) {
> + offset = (((loff_t)index << PAGE_CACHE_SHIFT) + offset);
> + offset &= ~HPAGE_PMD_MASK;
> + }

Does this need to be done in cases other than the path that goes through
"if(PageTransTail(page))" above? If not, I'd probably stick this code
up with the other part.

> /* If users can be writing to this page using arbitrary
> * virtual addresses, take care about potential aliasing
> * before reading the page on the kernel side.
> */
> if (mapping_writably_mapped(mapping))
> - flush_dcache_page(page);
> + flush_dcache_page(page + (offset >> PAGE_CACHE_SHIFT));

This is another case where I think adding another local variable would
essentially help the code self-document. The way it stands, it's fairly
subtle how (offset>>PAGE_CACHE_SHIFT) works and that it's conditional on
THP being enabled.

int tail_page_index = (offset >> PAGE_CACHE_SHIFT)
...
> + flush_dcache_page(page + tail_page_index);

This makes it obvious that we're indexing off something, *and* that it's
only going to be relevant when dealing with tail pages.

> /*
> * When a sequential read accesses a page several times,
> * only mark it as accessed the first time.
> */
> - if (prev_index != index || offset != prev_offset)
> + if (prev_index != index ||
> + (offset & ~PAGE_CACHE_MASK) != prev_offset)
> mark_page_accessed(page);
> prev_index = index;
>
> @@ -1238,8 +1256,9 @@ page_ok:
> * "pos" here (the actor routine has to update the user buffer
> * pointers and the remaining count).
> */
> - ret = file_read_actor(desc, page, offset, nr);
> - offset += ret;
> + ret = file_read_actor(desc, page + (offset >> PAGE_CACHE_SHIFT),
> + offset & ~PAGE_CACHE_MASK, nr);
> + offset = (offset & ~PAGE_CACHE_MASK) + ret;

^^ There's an extra space in that last line.

> index += offset >> PAGE_CACHE_SHIFT;
> offset &= ~PAGE_CACHE_MASK;
> prev_offset = offset;
> @@ -2440,8 +2459,13 @@ again:
> if (mapping_writably_mapped(mapping))
> flush_dcache_page(page);
>
> + if (PageTransHuge(page))
> + offset = pos & ~HPAGE_PMD_MASK;
> +
> pagefault_disable();
> - copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
> + copied = iov_iter_copy_from_user_atomic(
> + page + (offset >> PAGE_CACHE_SHIFT),
> + i, offset & ~PAGE_CACHE_MASK, bytes);
> pagefault_enable();
> flush_dcache_page(page);
>
> @@ -2464,6 +2488,7 @@ again:
> * because not all segments in the iov can be copied at
> * once without a pagefault.
> */
> + offset = pos & ~PAGE_CACHE_MASK;
> bytes = min_t(unsigned long, PAGE_CACHE_SIZE - offset,
> iov_iter_single_seg_count(i));
> goto again;
>

I think the difficulty in this function is that you're now dealing with
two 'struct page's, two offsets, and two indexes. It isn't blindingly
obvious which one should be used in a given situation.

The way you've done it here might just be the best way. I'd *really*
encourage you to make sure that this is tested exhaustively, and make
sure you hit all the different paths in that function. I'd suspect
there is still a bug or two in there outside the diff context.

Would it be sane to have a set of variables like:

struct page *thp_tail_page = page + (offset >> PAGE_CACHE_SHIFT);

instead of just open-coding the masks and shifts every time?

2013-03-22 17:52:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 15/30] thp, libfs: initial support of thp in simple_read/write_begin/write_end

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> @@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr);
>
> int simple_readpage(struct file *file, struct page *page)
> {
> - clear_highpage(page);
> + if (PageTransHuge(page))
> + zero_huge_user(page, 0, HPAGE_PMD_SIZE);
> + else
> + clear_highpage(page);

This makes me really wonder on which level we want to be hooking in this
code. The fact that we're doing it in simple_readpage() seems to mean
that we'll have to go explicitly and widely modify every fs that wants
to support this.

It seems to me that we either want to hide this behind clear_highpage()
itself, _or_ make a clear_pagecache_page() function that does it.

BTW, didn't you have a HUGE_PAGE_CACHE_SIZE macro somewhere? Shouldn't
that get used here?

> flush_dcache_page(page);
> SetPageUptodate(page);
> unlock_page(page);
> @@ -394,21 +397,41 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> {
> - struct page *page;
> + struct page *page = NULL;
> pgoff_t index;
>
> index = pos >> PAGE_CACHE_SHIFT;
>
> - page = grab_cache_page_write_begin(mapping, index, flags);
> + /* XXX: too weak condition. Good enough for initial testing */
> + if (mapping_can_have_hugepages(mapping)) {
> + page = grab_cache_huge_page_write_begin(mapping,
> + index & ~HPAGE_CACHE_INDEX_MASK, flags);
> + /* fallback to small page */
> + if (!page || !PageTransHuge(page)) {
> + unsigned long offset;
> + offset = pos & ~PAGE_CACHE_MASK;
> + len = min_t(unsigned long,
> + len, PAGE_CACHE_SIZE - offset);
> + }
> + }
> + if (!page)
> + page = grab_cache_page_write_begin(mapping, index, flags);

Same thing goes here. Can/should we hide the
grab_cache_huge_page_write_begin() call inside
grab_cache_page_write_begin()?

> if (!page)
> return -ENOMEM;
> -
> *pagep = page;
>
> - if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
> - unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> -
> - zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
> + if (!PageUptodate(page)) {
> + unsigned from;
> +
> + if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
> + from = pos & ~HPAGE_PMD_MASK;
> + zero_huge_user_segments(page, 0, from,
> + from + len, HPAGE_PMD_SIZE);
> + } else if (len != PAGE_CACHE_SIZE) {
> + from = pos & ~PAGE_CACHE_MASK;
> + zero_user_segments(page, 0, from,
> + from + len, PAGE_CACHE_SIZE);
> + }
> }

Let's say you introduced two new functions: page_cache_size(page) and
page_cache_mask(page), and hid the zero_huge_user_segments() inside
zero_user_segments(). This code would end up looking like this:

if (len != page_cache_size(page)) {
from = pos & ~page_cache_mask(page)
zero_user_segments(page, 0, from,
from + len, page_cache_size(page));
}

It would also compile down to exactly what was there before without
having to _explicitly_ put a case in for THP.

2013-03-22 18:08:51

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> +static int split_anon_huge_page(struct page *page)
> {
> struct anon_vma *anon_vma;
> int ret = 1;
>
> - BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> - BUG_ON(!PageAnon(page));

Did you really mean to kill these BUG_ON()s? They still look relevant
to me.

2013-03-22 18:13:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 17/30] thp: wait_split_huge_page(): serialize over i_mmap_mutex too

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -113,11 +113,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
> __split_huge_page_pmd(__vma, __address, \
> ____pmd); \
> } while (0)
> -#define wait_split_huge_page(__anon_vma, __pmd) \
> +#define wait_split_huge_page(__vma, __pmd) \
> do { \
> pmd_t *____pmd = (__pmd); \
> - anon_vma_lock_write(__anon_vma); \
> - anon_vma_unlock_write(__anon_vma); \
> + struct address_space *__mapping = \
> + vma->vm_file->f_mapping; \
> + struct anon_vma *__anon_vma = (__vma)->anon_vma; \
> + if (__mapping) \
> + mutex_lock(&__mapping->i_mmap_mutex); \
> + if (__anon_vma) { \
> + anon_vma_lock_write(__anon_vma); \
> + anon_vma_unlock_write(__anon_vma); \
> + } \
> + if (__mapping) \
> + mutex_unlock(&__mapping->i_mmap_mutex); \
> BUG_ON(pmd_trans_splitting(*____pmd) || \
> pmd_trans_huge(*____pmd)); \
> } while (0)

That thing was pretty ugly _before_. :) Any chance this can get turned
in to a function?

What's the deal with the i_mmap_mutex operation getting split up? I'm
blanking on what kind of pages would have both anon_vmas and a valid
mapping.

2013-03-22 18:20:43

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 18/30] thp, mm: truncate support for transparent huge page cache

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> @@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
> if (index > end)
> break;
>
> + VM_BUG_ON(PageTransHuge(page));
> lock_page(page);
> WARN_ON(page->index != index);
> wait_on_page_writeback(page);

This looks to be during the second truncate pass where things are
allowed to block. What's the logic behind it not being possible to
encounter TransHugePage()s here?

2013-03-22 18:28:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 21/30] x86-64, mm: proper alignment mappings with hugepages

On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> + if (filp)
> + info.align_mask = mapping_can_have_hugepages(filp->f_mapping) ?
> + PAGE_MASK & ~HPAGE_MASK : get_align_mask();
> + else
> + info.align_mask = 0;
> info.align_offset = pgoff << PAGE_SHIFT;
> return vm_unmapped_area(&info);
> }
> @@ -174,7 +179,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
> info.length = len;
> info.low_limit = PAGE_SIZE;
> info.high_limit = mm->mmap_base;
> - info.align_mask = filp ? get_align_mask() : 0;
> + if (filp)
> + info.align_mask = mapping_can_have_hugepages(filp->f_mapping) ?
> + PAGE_MASK & ~HPAGE_MASK : get_align_mask();
> + else
> + info.align_mask = 0;
> info.align_offset = pgoff << PAGE_SHIFT;
> addr = vm_unmapped_area(&info);
> if (!(addr & ~PAGE_MASK))

how about

static inline
unsigned long mapping_align_mask(struct address_space *mapping)
{
if (mapping_can_have_hugepages(filp->f_mapping))
return PAGE_MASK & ~HPAGE_MASK;
return get_align_mask();
}

to replace these two open-coded versions?

2013-03-25 13:02:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements

Dave Hansen wrote:
> On 03/22/2013 02:47 AM, Kirill A. Shutemov wrote:
> > Dave Hansen wrote:
> >> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> >>> +#define RADIX_TREE_PRELOAD_NR 512 /* For THP's benefit */
> >>
> >> This eventually boils down to making the radix_tree_preload array
> >> larger. Do we really want to do this unconditionally if it's only for
> >> THP's benefit?
> >
> > It will be useful not only for THP. Batching can be useful to solve
> > scalability issues.
>
> Still, it seems like something that little machines with no THP support
> probably don't want to pay the cost for. Perhaps you could enable it
> for THP||NR_CPUS>$FOO.

Okay, I'll disable it for !THP. We always can change it if we'll find good
candidate for batching.

> >> For those of us too lazy to go compile a kernel and figure this out in
> >> practice, how much bigger does this make the nodes[] array?
> >
> > We have three possible RADIX_TREE_MAP_SHIFT:
> >
> > #ifdef __KERNEL__
> > #define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
> > #else
> > #define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
> > #endif
> >
> > On 64-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 43, new is 107.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 31, new is 63.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 21, new is 30.
> >
> > On 32-bit system:
> > For RADIX_TREE_MAP_SHIFT=3, old array size is 21, new is 84.
> > For RADIX_TREE_MAP_SHIFT=4, old array size is 15, new is 46.
> > For RADIX_TREE_MAP_SHIFT=6, old array size is 11, new is 19.
> >
> > On most machines we will have RADIX_TREE_MAP_SHIFT=6.
>
> Could you stick that in your patch description?

Will do.

> The total cost is "array size" * sizeof(void*) * NR_CPUS, right?

Correct.

--
Kirill A. Shutemov

2013-03-26 08:38:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > THP_FAULT_FALLBACK,
> > THP_COLLAPSE_ALLOC,
> > THP_COLLAPSE_ALLOC_FAILED,
> > + THP_WRITE_ALLOC,
> > + THP_WRITE_FAILED,
> > THP_SPLIT,
> > THP_ZERO_PAGE_ALLOC,
> > THP_ZERO_PAGE_ALLOC_FAILED,
>
> I think these names are a bit terse. It's certainly not _writes_ that
> are failing and "THP_WRITE_FAILED" makes it sound that way.

Right. s/THP_WRITE_FAILED/THP_WRITE_ALLOC_FAILED/

> Also, why do we need to differentiate these from the existing anon-hugepage
> vm stats? The alloc_pages() call seems to be doing the exact same thing in
> the end. Is one more likely to succeed than the other?

Existing stats specify source of thp page: fault or collapse. When we
allocate a new huge page with write(2) it's nither fault nor collapse. I
think it's reasonable to introduce new type of event for that.

--
Kirill A. Shutemov

2013-03-26 10:46:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()

Dave Hansen wrote:
> > +repeat:
> > + page = find_lock_page(mapping, index);
> > + if (page) {
> > + if (!PageTransHuge(page)) {
> > + unlock_page(page);
> > + page_cache_release(page);
> > + return NULL;
> > + }
> > + goto found;
> > + }
> > +
> > + page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
>
> I alluded to this a second ago, but what's wrong with alloc_hugepage()?

It's defined only for !CONFIG_NUMA and only inside mm/huge_memory.c.

> > +found:
> > + wait_on_page_writeback(page);
> > + return page;
> > +}
> > +#endif
>
> So, I diffed :
>
> -struct page *grab_cache_page_write_begin(struct address_space
> vs.
> +struct page *grab_cache_huge_page_write_begin(struct address_space
>
> They're just to similar to ignore. Please consolidate them somehow.

Will do.

> > +found:
> > + wait_on_page_writeback(page);
> > + return page;
> > +}
> > +#endif
>
> In grab_cache_page_write_begin(), this "wait" is:
>
> wait_for_stable_page(page);
>
> Why is it different here?

It was wait_on_page_writeback() in grab_cache_page_write_begin() when I forked
it :(

See 1d1d1a7 mm: only enforce stable page writes if the backing device requires it

Consolidation will fix this.

--
Kirill A. Shutemov

2013-03-26 15:40:33

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 13/30] thp, mm: implement grab_cache_huge_page_write_begin()

On 03/26/2013 03:48 AM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>>> +repeat:
>>> + page = find_lock_page(mapping, index);
>>> + if (page) {
>>> + if (!PageTransHuge(page)) {
>>> + unlock_page(page);
>>> + page_cache_release(page);
>>> + return NULL;
>>> + }
>>> + goto found;
>>> + }
>>> +
>>> + page = alloc_pages(gfp_mask & ~gfp_notmask, HPAGE_PMD_ORDER);
>>
>> I alluded to this a second ago, but what's wrong with alloc_hugepage()?
>
> It's defined only for !CONFIG_NUMA and only inside mm/huge_memory.c.

It's a short function, but you could easily pull it out from under the
#ifdef and export it. I kinda like the idea of these things being
allocated in as few code paths possible. But, it's not a big deal.

>>> +found:
>>> + wait_on_page_writeback(page);
>>> + return page;
>>> +}
>>> +#endif
>>
>> So, I diffed :
>>
>> -struct page *grab_cache_page_write_begin(struct address_space
>> vs.
>> +struct page *grab_cache_huge_page_write_begin(struct address_space
>>
>> They're just to similar to ignore. Please consolidate them somehow.
>
> Will do.
>
>>> +found:
>>> + wait_on_page_writeback(page);
>>> + return page;
>>> +}
>>> +#endif
>>
>> In grab_cache_page_write_begin(), this "wait" is:
>>
>> wait_for_stable_page(page);
>>
>> Why is it different here?
>
> It was wait_on_page_writeback() in grab_cache_page_write_begin() when I forked
> it :(
>
> See 1d1d1a7 mm: only enforce stable page writes if the backing device requires it
>
> Consolidation will fix this.

Excellent.

2013-03-28 12:24:11

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 14/30] thp, mm: naive support of thp in generic read/write routines

Dave Hansen wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > For now we still write/read at most PAGE_CACHE_SIZE bytes a time.
> >
> > This implementation doesn't cover address spaces with backing store.
> ...
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1165,12 +1165,23 @@ find_page:
> > if (unlikely(page == NULL))
> > goto no_cached_page;
> > }
> > + if (PageTransTail(page)) {
> > + page_cache_release(page);
> > + page = find_get_page(mapping,
> > + index & ~HPAGE_CACHE_INDEX_MASK);
> > + if (!PageTransHuge(page)) {
> > + page_cache_release(page);
> > + goto find_page;
> > + }
> > + }
>
> So, we're going to do a read of a file, and we pulled a tail page out of
> the page cache. Why can't we just deal with the tail page directly?

Good point. I'll redo it once again.

First I thought to make it possible to read/write more PAGE_SIZE at once.
If not take this option in account for now, it's possible to make code much
cleaner.

--
Kirill A. Shutemov

2013-03-28 14:27:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 15/30] thp, libfs: initial support of thp in simple_read/write_begin/write_end

Dave wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > @@ -383,7 +383,10 @@ EXPORT_SYMBOL(simple_setattr);
> >
> > int simple_readpage(struct file *file, struct page *page)
> > {
> > - clear_highpage(page);
> > + if (PageTransHuge(page))
> > + zero_huge_user(page, 0, HPAGE_PMD_SIZE);
> > + else
> > + clear_highpage(page);
>
> This makes me really wonder on which level we want to be hooking in this
> code. The fact that we're doing it in simple_readpage() seems to mean
> that we'll have to go explicitly and widely modify every fs that wants
> to support this.
>
> It seems to me that we either want to hide this behind clear_highpage()
> itself, _or_ make a clear_pagecache_page() function that does it.

clear_pagecache_page() is a good idea.

> BTW, didn't you have a HUGE_PAGE_CACHE_SIZE macro somewhere? Shouldn't
> that get used here?

No. If we have a huge page in page cache it's always HPAGE_PMD_SIZE.

All these PAGE_CACHE_* are really annoying. page cache page size is always
equal to small page size and the macros only confuses, especially on
border between fs/pagecache and rest mm.

I want to get rid of them eventually.

>
> > flush_dcache_page(page);
> > SetPageUptodate(page);
> > unlock_page(page);
> > @@ -394,21 +397,41 @@ int simple_write_begin(struct file *file, struct address_space *mapping,
> > loff_t pos, unsigned len, unsigned flags,
> > struct page **pagep, void **fsdata)
> > {
> > - struct page *page;
> > + struct page *page = NULL;
> > pgoff_t index;
> >
> > index = pos >> PAGE_CACHE_SHIFT;
> >
> > - page = grab_cache_page_write_begin(mapping, index, flags);
> > + /* XXX: too weak condition. Good enough for initial testing */
> > + if (mapping_can_have_hugepages(mapping)) {
> > + page = grab_cache_huge_page_write_begin(mapping,
> > + index & ~HPAGE_CACHE_INDEX_MASK, flags);
> > + /* fallback to small page */
> > + if (!page || !PageTransHuge(page)) {
> > + unsigned long offset;
> > + offset = pos & ~PAGE_CACHE_MASK;
> > + len = min_t(unsigned long,
> > + len, PAGE_CACHE_SIZE - offset);
> > + }
> > + }
> > + if (!page)
> > + page = grab_cache_page_write_begin(mapping, index, flags);
>
> Same thing goes here. Can/should we hide the
> grab_cache_huge_page_write_begin() call inside
> grab_cache_page_write_begin()?

No. I want to keep it open coded. fs, not page cache, should decide
whether it wants huge page or not.

> > if (!page)
> > return -ENOMEM;
> > -
> > *pagep = page;
> >
> > - if (!PageUptodate(page) && (len != PAGE_CACHE_SIZE)) {
> > - unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> > -
> > - zero_user_segments(page, 0, from, from + len, PAGE_CACHE_SIZE);
> > + if (!PageUptodate(page)) {
> > + unsigned from;
> > +
> > + if (PageTransHuge(page) && len != HPAGE_PMD_SIZE) {
> > + from = pos & ~HPAGE_PMD_MASK;
> > + zero_huge_user_segments(page, 0, from,
> > + from + len, HPAGE_PMD_SIZE);
> > + } else if (len != PAGE_CACHE_SIZE) {
> > + from = pos & ~PAGE_CACHE_MASK;
> > + zero_user_segments(page, 0, from,
> > + from + len, PAGE_CACHE_SIZE);
> > + }
> > }
>
> Let's say you introduced two new functions: page_cache_size(page) and
> page_cache_mask(page), and hid the zero_huge_user_segments() inside
> zero_user_segments(). This code would end up looking like this:
>
> if (len != page_cache_size(page)) {
> from = pos & ~page_cache_mask(page)
> zero_user_segments(page, 0, from,
> from + len, page_cache_size(page));
> }
>
> It would also compile down to exactly what was there before without
> having to _explicitly_ put a case in for THP.

I would keep it as is for now. There are not that many places where we
have to check for THP. It can change when (if) we implement it for other
fs'es. We can generalize it later if needed.

--
Kirill A. Shutemov

2013-03-28 14:30:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 16/30] thp: handle file pages in split_huge_page()

Dave wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > +static int split_anon_huge_page(struct page *page)
> > {
> > struct anon_vma *anon_vma;
> > int ret = 1;
> >
> > - BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> > - BUG_ON(!PageAnon(page));
>
> Did you really mean to kill these BUG_ON()s? They still look relevant
> to me.

The zero page BUG_ON() is moved to new split_huge_page().
!PageAnon(page) we now can handle.

Note: nobody should call split_anon_huge_page() directly, only
split_huge_page().

--
Kirill A. Shutemov

2013-03-28 15:06:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 17/30] thp: wait_split_huge_page(): serialize over i_mmap_mutex too

Dave wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -113,11 +113,20 @@ extern void __split_huge_page_pmd(struct vm_area_struct *vma,
> > __split_huge_page_pmd(__vma, __address, \
> > ____pmd); \
> > } while (0)
> > -#define wait_split_huge_page(__anon_vma, __pmd) \
> > +#define wait_split_huge_page(__vma, __pmd) \
> > do { \
> > pmd_t *____pmd = (__pmd); \
> > - anon_vma_lock_write(__anon_vma); \
> > - anon_vma_unlock_write(__anon_vma); \
> > + struct address_space *__mapping = \
> > + vma->vm_file->f_mapping; \
> > + struct anon_vma *__anon_vma = (__vma)->anon_vma; \
> > + if (__mapping) \
> > + mutex_lock(&__mapping->i_mmap_mutex); \
> > + if (__anon_vma) { \
> > + anon_vma_lock_write(__anon_vma); \
> > + anon_vma_unlock_write(__anon_vma); \
> > + } \
> > + if (__mapping) \
> > + mutex_unlock(&__mapping->i_mmap_mutex); \
> > BUG_ON(pmd_trans_splitting(*____pmd) || \
> > pmd_trans_huge(*____pmd)); \
> > } while (0)
>
> That thing was pretty ugly _before_. :) Any chance this can get turned
> in to a function?

Cyclic headers dependencies... :(

> What's the deal with the i_mmap_mutex operation getting split up? I'm
> blanking on what kind of pages would have both anon_vmas and a valid
> mapping.

anon_vma_lock protects all anon pages on vma against splitting.
i_mmap_mutex protects shared pages. None pages can be on both sides, but
MAP_PRIVATE file vma can have both anon and shared pages.

As an option we can lookup for struct page with pmd_page(), check
PageAnon() and serialize only relevant lock, but...

Original macro, in fact, guarantees that *all* pages on the vma is not
splitting, not only the pages pmd is poinging on. PageAnon() check will
change semantics a bit. It shouldn't be a problem, but who knows.

Do you want me to add the check?

--
Kirill A. Shutemov

2013-03-28 15:30:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 18/30] thp, mm: truncate support for transparent huge page cache

Dave wrote:
> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> > @@ -280,6 +291,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
> > if (index > end)
> > break;
> >
> > + VM_BUG_ON(PageTransHuge(page));
> > lock_page(page);
> > WARN_ON(page->index != index);
> > wait_on_page_writeback(page);
>
> This looks to be during the second truncate pass where things are
> allowed to block. What's the logic behind it not being possible to
> encounter TransHugePage()s here?

Good question.

The only way how the page can be created from under us is collapsing, but
it's not implemented for file pages and I'm not sure yet how to implement
it...

Probably, I'll replace the BUG with

if (PageTransHuge(page))
split_huge_page(page);

It should be good enough.

--
Kirill A. Shutemov

2013-04-02 14:44:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate

Dave Hansen wrote:
> On 03/22/2013 03:12 AM, Kirill A. Shutemov wrote:
> > Dave Hansen wrote:
> >> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
> >>> +static inline bool mapping_can_have_hugepages(struct address_space *m)
> >>> +{
> >>> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> >>> + gfp_t gfp_mask = mapping_gfp_mask(m);
> >>> + return !!(gfp_mask & __GFP_COMP);
> >>> + }
> >>> +
> >>> + return false;
> >>> +}
> >>
> >> I did a quick search in all your patches and don't see __GFP_COMP
> >> getting _set_ anywhere. Am I missing something?
> >
> > __GFP_COMP is part of GFP_TRANSHUGE. We set it for ramfs in patch 20/30.
>
> That's a bit non-obvious. For a casual observer, it _seems_ like you
> should just be setting and checking GFP_TRANSHUGE directly. It looks
> like you were having some problems with __GFP_MOVABLE and masked it out
> of GFP_TRANSHUGE and that has cascaded over to _this_ check.

Checking GFP_TRANSHUGE directly is not right way. File systems can clear
GFP bits or set additional for its own reason. We should not limit file
systems here.

So the only way robust way is to check __GFP_COMP. I'll add comment.

--
Kirill A. Shutemov

2013-04-02 16:26:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: RE: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache

Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> ramfs is the most simple fs from page cache point of view. Let's start
> transparent huge page cache enabling here.
>
> For now we allocate only non-movable huge page. It's not yet clear if
> movable page is safe here and what need to be done to make it safe.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> fs/ramfs/inode.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> index c24f1e1..da30b4f 100644
> --- a/fs/ramfs/inode.c
> +++ b/fs/ramfs/inode.c
> @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> inode_init_owner(inode, dir, mode);
> inode->i_mapping->a_ops = &ramfs_aops;
> inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> + /*
> + * TODO: what should be done to make movable safe?
> + */
> + mapping_set_gfp_mask(inode->i_mapping,
> + GFP_TRANSHUGE & ~__GFP_MOVABLE);

Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
GFP_HIGHUSER_MOVABLE:

http://lkml.org/lkml/2006/11/27/156

It seems the origin reason is not longer valid, correct?

--
Kirill A. Shutemov

2013-04-02 22:15:52

by Hugh Dickins

[permalink] [raw]
Subject: RE: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache

On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
> Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > ramfs is the most simple fs from page cache point of view. Let's start
> > transparent huge page cache enabling here.
> >
> > For now we allocate only non-movable huge page. It's not yet clear if
> > movable page is safe here and what need to be done to make it safe.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > ---
> > fs/ramfs/inode.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> > index c24f1e1..da30b4f 100644
> > --- a/fs/ramfs/inode.c
> > +++ b/fs/ramfs/inode.c
> > @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> > inode_init_owner(inode, dir, mode);
> > inode->i_mapping->a_ops = &ramfs_aops;
> > inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> > - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > + /*
> > + * TODO: what should be done to make movable safe?
> > + */
> > + mapping_set_gfp_mask(inode->i_mapping,
> > + GFP_TRANSHUGE & ~__GFP_MOVABLE);
>
> Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
> GFP_HIGHUSER_MOVABLE:
>
> http://lkml.org/lkml/2006/11/27/156
>
> It seems the origin reason is not longer valid, correct?

Incorrect, I believe: so far as I know, the original reason remains
valid - though it would only require a couple of good small changes
to reverse that - or perhaps you have already made these changes?

The original reason is that ramfs pages are not migratable,
therefore they should be allocated from an unmovable area.

As I understand it (and I would have preferred to run a test to check
my understanding before replying, but don't have time for that), ramfs
pages cannot be migrated for two reasons, neither of them a good reason.

One reason (okay, it wouldn't have been quite this way in 2006) is that
ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
the page_evictable() test, so they will be marked PageUnevictable, so
__isolate_lru_page() will refuse to isolate them for migration (except
for CMA).

I am strongly in favour of removing that limitation from
__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
and Christoph were both in favour too); and note that there is no such
restriction in the confusingly similar but different isolate_lru_page().

Some people do worry that migrating Mlocked pages would introduce the
occasional possibility of a minor fault (with migration_entry_wait())
on an Mlocked region which never faulted before. I tend to dismiss
that worry, but maybe I'm wrong to do so: maybe there should be a
tunable for realtimey people to set, to prohibit page migration from
mlocked areas; but the default should be to allow it.

(Of course, we could separate ramfs's mapping_unevictable case from
the Mlocked case; but I'd prefer to continue to treat them the same.)

The other reason it looks as if ramfs pages cannot be migrated, is
that it does not set a suitable ->migratepage method, so would be
handled by fallback_migrate_page(), whose PageDirty test will end
up failing the migration with -EBUSY or -EINVAL - if I read it
correctly.

Perhaps other such reasons would surface once those are fixed.
But until ramfs pages can be migrated, they should not be allocated
with __GFP_MOVABLE. (I've been writing about the migratability of
small pages: I expect you have the migratability of THPages in flux.)

Hugh

2013-04-03 01:11:11

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache

On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
> On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
> > Kirill A. Shutemov wrote:
> > > From: "Kirill A. Shutemov" <[email protected]>
> > >
> > > ramfs is the most simple fs from page cache point of view. Let's start
> > > transparent huge page cache enabling here.
> > >
> > > For now we allocate only non-movable huge page. It's not yet clear if
> > > movable page is safe here and what need to be done to make it safe.
> > >
> > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > ---
> > > fs/ramfs/inode.c | 6 +++++-
> > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> > > index c24f1e1..da30b4f 100644
> > > --- a/fs/ramfs/inode.c
> > > +++ b/fs/ramfs/inode.c
> > > @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> > > inode_init_owner(inode, dir, mode);
> > > inode->i_mapping->a_ops = &ramfs_aops;
> > > inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> > > - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > > + /*
> > > + * TODO: what should be done to make movable safe?
> > > + */
> > > + mapping_set_gfp_mask(inode->i_mapping,
> > > + GFP_TRANSHUGE & ~__GFP_MOVABLE);
> >
> > Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
> > GFP_HIGHUSER_MOVABLE:
> >
> > http://lkml.org/lkml/2006/11/27/156
> >
> > It seems the origin reason is not longer valid, correct?
>
> Incorrect, I believe: so far as I know, the original reason remains
> valid - though it would only require a couple of good small changes
> to reverse that - or perhaps you have already made these changes?
>
> The original reason is that ramfs pages are not migratable,
> therefore they should be allocated from an unmovable area.
>
> As I understand it (and I would have preferred to run a test to check
> my understanding before replying, but don't have time for that), ramfs
> pages cannot be migrated for two reasons, neither of them a good reason.
>
> One reason (okay, it wouldn't have been quite this way in 2006) is that
> ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
> the page_evictable() test, so they will be marked PageUnevictable, so
> __isolate_lru_page() will refuse to isolate them for migration (except
> for CMA).

True.

>
> I am strongly in favour of removing that limitation from
> __isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> and Christoph were both in favour too); and note that there is no such
> restriction in the confusingly similar but different isolate_lru_page().
>
> Some people do worry that migrating Mlocked pages would introduce the
> occasional possibility of a minor fault (with migration_entry_wait())
> on an Mlocked region which never faulted before. I tend to dismiss
> that worry, but maybe I'm wrong to do so: maybe there should be a
> tunable for realtimey people to set, to prohibit page migration from
> mlocked areas; but the default should be to allow it.

I agree.
Just FYI for mlocked page migration

I tried migratioin of mlocked page and Johannes and Mel had a concern
about that.
http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html

But later, Peter already acked it and I guess by reading the thread that
Hugh was in favour when page migration was merged first time.

http://marc.info/?l=linux-mm&m=133697873414205&w=2
http://marc.info/?l=linux-mm&m=133700341823358&w=2

Many people said mlock means memory-resident, NOT pinning so it could
allow minor fault while Mel still had a concern except CMA.
http://marc.info/?l=linux-mm&m=133674219714419&w=2

>
> (Of course, we could separate ramfs's mapping_unevictable case from
> the Mlocked case; but I'd prefer to continue to treat them the same.)

Fair enough.

>
> The other reason it looks as if ramfs pages cannot be migrated, is
> that it does not set a suitable ->migratepage method, so would be
> handled by fallback_migrate_page(), whose PageDirty test will end
> up failing the migration with -EBUSY or -EINVAL - if I read it
> correctly.

True.

>
> Perhaps other such reasons would surface once those are fixed.
> But until ramfs pages can be migrated, they should not be allocated
> with __GFP_MOVABLE. (I've been writing about the migratability of
> small pages: I expect you have the migratability of THPages in flux.)

Agreed.

>
> Hugh
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

Subject: RE: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache

On Tue, 2 Apr 2013, Hugh Dickins wrote:

> I am strongly in favour of removing that limitation from
> __isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> and Christoph were both in favour too); and note that there is no such
> restriction in the confusingly similar but different isolate_lru_page().

Well the naming could be cleaned up. The fundamental issue with migrating
pages is that all references have to be tracked and updates in a way that
no references can be followed to invalid or stale page contents. If ramfs
does not maintain separate pointers but only relies on pointers already
handled by the migration logic then migration is fine.

> Some people do worry that migrating Mlocked pages would introduce the
> occasional possibility of a minor fault (with migration_entry_wait())
> on an Mlocked region which never faulted before. I tend to dismiss
> that worry, but maybe I'm wrong to do so: maybe there should be a
> tunable for realtimey people to set, to prohibit page migration from
> mlocked areas; but the default should be to allow it.

Could we have a different way of marking pages "pinned"? This is useful
for various subsystems (like RDMA and various graphics drivers etc) which
need to ensure that virtual address to physical address mappings stay the
same for a prolonged period of time. I think this use case is becoming
more frequent given that offload techniques have to be used these days to
overcome the limits on processor performance.

> The other reason it looks as if ramfs pages cannot be migrated, is
> that it does not set a suitable ->migratepage method, so would be
> handled by fallback_migrate_page(), whose PageDirty test will end
> up failing the migration with -EBUSY or -EINVAL - if I read it
> correctly.

These could be handled the same way that anonymous pages are handled.

> But until ramfs pages can be migrated, they should not be allocated
> with __GFP_MOVABLE. (I've been writing about the migratability of
> small pages: I expect you have the migratability of THPages in flux.)

I agree.

2013-04-05 03:37:34

by Ric Mason

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 04/30] radix-tree: implement preload for multiple contiguous elements

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Currently radix_tree_preload() only guarantees enough nodes to insert
> one element. It's a hard limit. You cannot batch a number insert under
> one tree_lock.
>
> This patch introduces radix_tree_preload_count(). It allows to
> preallocate nodes enough to insert a number of *contiguous* elements.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/radix-tree.h | 3 +++
> lib/radix-tree.c | 32 +++++++++++++++++++++++++-------
> 2 files changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index ffc444c..81318cb 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -83,6 +83,8 @@ do { \
> (root)->rnode = NULL; \
> } while (0)
>
> +#define RADIX_TREE_PRELOAD_NR 512 /* For THP's benefit */
> +
> /**
> * Radix-tree synchronization
> *
> @@ -231,6 +233,7 @@ unsigned long radix_tree_next_hole(struct radix_tree_root *root,
> unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
> unsigned long index, unsigned long max_scan);
> int radix_tree_preload(gfp_t gfp_mask);
> +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask);
> void radix_tree_init(void);
> void *radix_tree_tag_set(struct radix_tree_root *root,
> unsigned long index, unsigned int tag);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index e796429..9bef0ac 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -81,16 +81,24 @@ static struct kmem_cache *radix_tree_node_cachep;
> * The worst case is a zero height tree with just a single item at index 0,
> * and then inserting an item at index ULONG_MAX. This requires 2 new branches
> * of RADIX_TREE_MAX_PATH size to be created, with only the root node shared.
> + *
> + * Worst case for adding N contiguous items is adding entries at indexes
> + * (ULONG_MAX - N) to ULONG_MAX. It requires nodes to insert single worst-case
> + * item plus extra nodes if you cross the boundary from one node to the next.
> + *

What's the meaning of this comments? Could you explain in details? I
also don't understand #define RADIX_TREE_PRELOAD_SIZE
(RADIX_TREE_MAX_PATH * 2 - 1), why RADIX_TREE_MAX_PATH * 2 - 1, I fail
to understand comments above it.

> * Hence:
> */
> -#define RADIX_TREE_PRELOAD_SIZE (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MIN (RADIX_TREE_MAX_PATH * 2 - 1)
> +#define RADIX_TREE_PRELOAD_MAX \
> + (RADIX_TREE_PRELOAD_MIN + \
> + DIV_ROUND_UP(RADIX_TREE_PRELOAD_NR - 1, RADIX_TREE_MAP_SIZE))
>
> /*
> * Per-cpu pool of preloaded nodes
> */
> struct radix_tree_preload {
> int nr;
> - struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_SIZE];
> + struct radix_tree_node *nodes[RADIX_TREE_PRELOAD_MAX];
> };
> static DEFINE_PER_CPU(struct radix_tree_preload, radix_tree_preloads) = { 0, };
>
> @@ -257,29 +265,34 @@ radix_tree_node_free(struct radix_tree_node *node)
>
> /*
> * Load up this CPU's radix_tree_node buffer with sufficient objects to
> - * ensure that the addition of a single element in the tree cannot fail. On
> - * success, return zero, with preemption disabled. On error, return -ENOMEM
> + * ensure that the addition of *contiguous* elements in the tree cannot fail.
> + * On success, return zero, with preemption disabled. On error, return -ENOMEM
> * with preemption not disabled.
> *
> * To make use of this facility, the radix tree must be initialised without
> * __GFP_WAIT being passed to INIT_RADIX_TREE().
> */
> -int radix_tree_preload(gfp_t gfp_mask)
> +int radix_tree_preload_count(unsigned size, gfp_t gfp_mask)
> {
> struct radix_tree_preload *rtp;
> struct radix_tree_node *node;
> int ret = -ENOMEM;
> + int alloc = RADIX_TREE_PRELOAD_MIN +
> + DIV_ROUND_UP(size - 1, RADIX_TREE_MAP_SIZE);
> +
> + if (size > RADIX_TREE_PRELOAD_NR)
> + return -ENOMEM;
>
> preempt_disable();
> rtp = &__get_cpu_var(radix_tree_preloads);
> - while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
> + while (rtp->nr < alloc) {
> preempt_enable();
> node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
> if (node == NULL)
> goto out;
> preempt_disable();
> rtp = &__get_cpu_var(radix_tree_preloads);
> - if (rtp->nr < ARRAY_SIZE(rtp->nodes))
> + if (rtp->nr < alloc)
> rtp->nodes[rtp->nr++] = node;
> else
> kmem_cache_free(radix_tree_node_cachep, node);
> @@ -288,6 +301,11 @@ int radix_tree_preload(gfp_t gfp_mask)
> out:
> return ret;
> }
> +
> +int radix_tree_preload(gfp_t gfp_mask)
> +{
> + return radix_tree_preload_count(1, gfp_mask);
> +}
> EXPORT_SYMBOL(radix_tree_preload);
>
> /*

2013-04-05 03:42:33

by Ric Mason

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 05/30] thp, mm: avoid PageUnevictable on active/inactive lru lists

Hi Kirill,
On 03/22/2013 06:11 PM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> active/inactive lru lists can contain unevicable pages (i.e. ramfs pages
>>> that have been placed on the LRU lists when first allocated), but these
>>> pages must not have PageUnevictable set - otherwise shrink_active_list
>>> goes crazy:
>> ...
>>> For lru_add_page_tail(), it means we should not set PageUnevictable()
>>> for tail pages unless we're sure that it will go to LRU_UNEVICTABLE.
>>> The tail page will go LRU_UNEVICTABLE if head page is not on LRU or if
>>> it's marked PageUnevictable() too.
>> This is only an issue once you're using lru_add_page_tail() for
>> non-anonymous pages, right?
> I'm not sure about that. Documentation/vm/unevictable-lru.txt:
>
> Some examples of these unevictable pages on the LRU lists are:
>
> (1) ramfs pages that have been placed on the LRU lists when first allocated.
>
> (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to
> allocate or fault in the pages in the shared memory region. This happens
> when an application accesses the page the first time after SHM_LOCK'ing
> the segment.
>
> (3) mlocked pages that could not be isolated from the LRU and moved to the
> unevictable list in mlock_vma_page().
>
> (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
> acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
> munlock_vma_page() was forced to let the page back on to the normal LRU
> list for vmscan to handle.
>
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index 92a9be5..31584d0 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -762,7 +762,8 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
>>> lru = LRU_INACTIVE_ANON;
>>> }
>>> } else {
>>> - SetPageUnevictable(page_tail);
>>> + if (!PageLRU(page) || PageUnevictable(page))
>>> + SetPageUnevictable(page_tail);
>>> lru = LRU_UNEVICTABLE;
>>> }
>> You were saying above that ramfs pages can get on the normal
>> active/inactive lists. But, this will end up getting them on the
>> unevictable list, right? So, we have normal ramfs pages on the
>> active/inactive lists, but ramfs pages after a huge-page-split on the
>> unevictable list. That seems a bit inconsistent.
> Yeah, it's confusing.
>
> I was able to trigger another bug on this code:
> if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail
> will go to the same lru as page, but nobody cares to sync page_tail
> active/inactive state with page. So we can end up with inactive page on
> active lru...
>
> I've updated the patch for the next interation. You can check it in git.
> It should be cleaner. Description need to be updated.

Hope you can send out soon. ;-)

>

2013-04-05 03:45:31

by Ric Mason

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Returns true if mapping can have huge pages. Just check for __GFP_COMP
> in gfp mask of the mapping for now.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/pagemap.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e3dea75..3521b0d 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
> (__force unsigned long)mask;
> }
>
> +static inline bool mapping_can_have_hugepages(struct address_space *m)
> +{
> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
> + gfp_t gfp_mask = mapping_gfp_mask(m);
> + return !!(gfp_mask & __GFP_COMP);

I always see !! in kernel, but why check directly instead of have !! prefix?

> + }
> +
> + return false;
> +}
> +
> /*
> * The page cache can done in larger chunks than
> * one page, because it allows for more efficient

2013-04-05 03:48:17

by Ric Mason

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 07/30] thp, mm: introduce mapping_can_have_hugepages() predicate

On 04/05/2013 11:45 AM, Ric Mason wrote:
> Hi Kirill,
> On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
>> From: "Kirill A. Shutemov" <[email protected]>
>>
>> Returns true if mapping can have huge pages. Just check for __GFP_COMP
>> in gfp mask of the mapping for now.
>>
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> ---
>> include/linux/pagemap.h | 10 ++++++++++
>> 1 file changed, 10 insertions(+)
>>
>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>> index e3dea75..3521b0d 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -84,6 +84,16 @@ static inline void mapping_set_gfp_mask(struct
>> address_space *m, gfp_t mask)
>> (__force unsigned long)mask;
>> }
>> +static inline bool mapping_can_have_hugepages(struct address_space
>> *m)
>> +{
>> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) {
>> + gfp_t gfp_mask = mapping_gfp_mask(m);
>> + return !!(gfp_mask & __GFP_COMP);
>
> I always see !! in kernel, but why check directly instead of have !!
> prefix?

s/why/why not

>
>> + }
>> +
>> + return false;
>> +}
>> +
>> /*
>> * The page cache can done in larger chunks than
>> * one page, because it allows for more efficient
>

2013-04-05 04:03:27

by Ric Mason

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 11/30] thp, mm: handle tail pages in page_cache_get_speculative()

Hi Kirill,
On 03/15/2013 01:50 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> For tail page we call __get_page_tail(). It has the same semantics, but
> for tail page.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> include/linux/pagemap.h | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 3521b0d..408c4e3 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -159,6 +159,9 @@ static inline int page_cache_get_speculative(struct page *page)

What's the different between page_cache_get_speculative and page_cache_get?

> {
> VM_BUG_ON(in_interrupt());
>
> + if (unlikely(PageTail(page)))
> + return __get_page_tail(page);
> +
> #ifdef CONFIG_TINY_RCU
> # ifdef CONFIG_PREEMPT_COUNT
> VM_BUG_ON(!in_atomic());
> @@ -185,7 +188,6 @@ static inline int page_cache_get_speculative(struct page *page)
> return 0;
> }
> #endif
> - VM_BUG_ON(PageTail(page));
>
> return 1;
> }

2013-04-05 04:05:57

by Ric Mason

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 12/30] thp, mm: add event counters for huge page alloc on write to a file

Hi Kirill,
On 03/26/2013 04:40 PM, Kirill A. Shutemov wrote:
> Dave Hansen wrote:
>> On 03/14/2013 10:50 AM, Kirill A. Shutemov wrote:
>>> --- a/include/linux/vm_event_item.h
>>> +++ b/include/linux/vm_event_item.h
>>> @@ -71,6 +71,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>>> THP_FAULT_FALLBACK,
>>> THP_COLLAPSE_ALLOC,
>>> THP_COLLAPSE_ALLOC_FAILED,
>>> + THP_WRITE_ALLOC,
>>> + THP_WRITE_FAILED,
>>> THP_SPLIT,
>>> THP_ZERO_PAGE_ALLOC,
>>> THP_ZERO_PAGE_ALLOC_FAILED,
>> I think these names are a bit terse. It's certainly not _writes_ that
>> are failing and "THP_WRITE_FAILED" makes it sound that way.
> Right. s/THP_WRITE_FAILED/THP_WRITE_ALLOC_FAILED/
>
>> Also, why do we need to differentiate these from the existing anon-hugepage
>> vm stats? The alloc_pages() call seems to be doing the exact same thing in
>> the end. Is one more likely to succeed than the other?
> Existing stats specify source of thp page: fault or collapse. When we
> allocate a new huge page with write(2) it's nither fault nor collapse. I
> think it's reasonable to introduce new type of event for that.

Why when we allocated a new huge page with write(2) is not a write fault?

>

2013-04-05 06:47:37

by Simon Jeons

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache

Hi Minchan,
On 04/03/2013 09:11 AM, Minchan Kim wrote:
> On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
>> On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
>>> Kirill A. Shutemov wrote:
>>>> From: "Kirill A. Shutemov" <[email protected]>
>>>>
>>>> ramfs is the most simple fs from page cache point of view. Let's start
>>>> transparent huge page cache enabling here.
>>>>
>>>> For now we allocate only non-movable huge page. It's not yet clear if
>>>> movable page is safe here and what need to be done to make it safe.
>>>>
>>>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>>> ---
>>>> fs/ramfs/inode.c | 6 +++++-
>>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
>>>> index c24f1e1..da30b4f 100644
>>>> --- a/fs/ramfs/inode.c
>>>> +++ b/fs/ramfs/inode.c
>>>> @@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
>>>> inode_init_owner(inode, dir, mode);
>>>> inode->i_mapping->a_ops = &ramfs_aops;
>>>> inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
>>>> - mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
>>>> + /*
>>>> + * TODO: what should be done to make movable safe?
>>>> + */
>>>> + mapping_set_gfp_mask(inode->i_mapping,
>>>> + GFP_TRANSHUGE & ~__GFP_MOVABLE);
>>> Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
>>> GFP_HIGHUSER_MOVABLE:
>>>
>>> http://lkml.org/lkml/2006/11/27/156
>>>
>>> It seems the origin reason is not longer valid, correct?
>> Incorrect, I believe: so far as I know, the original reason remains
>> valid - though it would only require a couple of good small changes
>> to reverse that - or perhaps you have already made these changes?
>>
>> The original reason is that ramfs pages are not migratable,
>> therefore they should be allocated from an unmovable area.
>>
>> As I understand it (and I would have preferred to run a test to check
>> my understanding before replying, but don't have time for that), ramfs
>> pages cannot be migrated for two reasons, neither of them a good reason.
>>
>> One reason (okay, it wouldn't have been quite this way in 2006) is that
>> ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
>> the page_evictable() test, so they will be marked PageUnevictable, so
>> __isolate_lru_page() will refuse to isolate them for migration (except
>> for CMA).
> True.
>
>> I am strongly in favour of removing that limitation from
>> __isolate_lru_page() (and the thread you pointed - thank you - shows Mel
>> and Christoph were both in favour too); and note that there is no such
>> restriction in the confusingly similar but different isolate_lru_page().
>>
>> Some people do worry that migrating Mlocked pages would introduce the
>> occasional possibility of a minor fault (with migration_entry_wait())
>> on an Mlocked region which never faulted before. I tend to dismiss
>> that worry, but maybe I'm wrong to do so: maybe there should be a
>> tunable for realtimey people to set, to prohibit page migration from
>> mlocked areas; but the default should be to allow it.
> I agree.
> Just FYI for mlocked page migration
>
> I tried migratioin of mlocked page and Johannes and Mel had a concern
> about that.
> http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
>
> But later, Peter already acked it and I guess by reading the thread that
> Hugh was in favour when page migration was merged first time.
>
> http://marc.info/?l=linux-mm&m=133697873414205&w=2
> http://marc.info/?l=linux-mm&m=133700341823358&w=2
>
> Many people said mlock means memory-resident, NOT pinning so it could
> allow minor fault while Mel still had a concern except CMA.
> http://marc.info/?l=linux-mm&m=133674219714419&w=2

How about add a knob?

>> (Of course, we could separate ramfs's mapping_unevictable case from
>> the Mlocked case; but I'd prefer to continue to treat them the same.)
> Fair enough.
>
>> The other reason it looks as if ramfs pages cannot be migrated, is
>> that it does not set a suitable ->migratepage method, so would be
>> handled by fallback_migrate_page(), whose PageDirty test will end
>> up failing the migration with -EBUSY or -EINVAL - if I read it
>> correctly.
> True.
>
>> Perhaps other such reasons would surface once those are fixed.
>> But until ramfs pages can be migrated, they should not be allocated
>> with __GFP_MOVABLE. (I've been writing about the migratability of
>> small pages: I expect you have the migratability of THPages in flux.)
> Agreed.
>
>> Hugh
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-05 08:01:13

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache

On Fri, Apr 05, 2013 at 02:47:25PM +0800, Simon Jeons wrote:
> Hi Minchan,
> On 04/03/2013 09:11 AM, Minchan Kim wrote:
> >On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
> >>On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
> >>>Kirill A. Shutemov wrote:
> >>>>From: "Kirill A. Shutemov" <[email protected]>
> >>>>
> >>>>ramfs is the most simple fs from page cache point of view. Let's start
> >>>>transparent huge page cache enabling here.
> >>>>
> >>>>For now we allocate only non-movable huge page. It's not yet clear if
> >>>>movable page is safe here and what need to be done to make it safe.
> >>>>
> >>>>Signed-off-by: Kirill A. Shutemov <[email protected]>
> >>>>---
> >>>> fs/ramfs/inode.c | 6 +++++-
> >>>> 1 file changed, 5 insertions(+), 1 deletion(-)
> >>>>
> >>>>diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> >>>>index c24f1e1..da30b4f 100644
> >>>>--- a/fs/ramfs/inode.c
> >>>>+++ b/fs/ramfs/inode.c
> >>>>@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> >>>> inode_init_owner(inode, dir, mode);
> >>>> inode->i_mapping->a_ops = &ramfs_aops;
> >>>> inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> >>>>- mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> >>>>+ /*
> >>>>+ * TODO: what should be done to make movable safe?
> >>>>+ */
> >>>>+ mapping_set_gfp_mask(inode->i_mapping,
> >>>>+ GFP_TRANSHUGE & ~__GFP_MOVABLE);
> >>>Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
> >>>GFP_HIGHUSER_MOVABLE:
> >>>
> >>>http://lkml.org/lkml/2006/11/27/156
> >>>
> >>>It seems the origin reason is not longer valid, correct?
> >>Incorrect, I believe: so far as I know, the original reason remains
> >>valid - though it would only require a couple of good small changes
> >>to reverse that - or perhaps you have already made these changes?
> >>
> >>The original reason is that ramfs pages are not migratable,
> >>therefore they should be allocated from an unmovable area.
> >>
> >>As I understand it (and I would have preferred to run a test to check
> >>my understanding before replying, but don't have time for that), ramfs
> >>pages cannot be migrated for two reasons, neither of them a good reason.
> >>
> >>One reason (okay, it wouldn't have been quite this way in 2006) is that
> >>ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
> >>the page_evictable() test, so they will be marked PageUnevictable, so
> >>__isolate_lru_page() will refuse to isolate them for migration (except
> >>for CMA).
> >True.
> >
> >>I am strongly in favour of removing that limitation from
> >>__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> >>and Christoph were both in favour too); and note that there is no such
> >>restriction in the confusingly similar but different isolate_lru_page().
> >>
> >>Some people do worry that migrating Mlocked pages would introduce the
> >>occasional possibility of a minor fault (with migration_entry_wait())
> >>on an Mlocked region which never faulted before. I tend to dismiss
> >>that worry, but maybe I'm wrong to do so: maybe there should be a
> >>tunable for realtimey people to set, to prohibit page migration from
> >>mlocked areas; but the default should be to allow it.
> >I agree.
> >Just FYI for mlocked page migration
> >
> >I tried migratioin of mlocked page and Johannes and Mel had a concern
> >about that.
> >http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
> >
> >But later, Peter already acked it and I guess by reading the thread that
> >Hugh was in favour when page migration was merged first time.
> >
> >http://marc.info/?l=linux-mm&m=133697873414205&w=2
> >http://marc.info/?l=linux-mm&m=133700341823358&w=2
> >
> >Many people said mlock means memory-resident, NOT pinning so it could
> >allow minor fault while Mel still had a concern except CMA.
> >http://marc.info/?l=linux-mm&m=133674219714419&w=2
>
> How about add a knob?

Maybe, volunteering?

--
Kind regards,
Minchan Kim

2013-04-05 08:31:20

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache

Hi Wanpeng,

On Fri, Apr 05, 2013 at 04:22:17PM +0800, Wanpeng Li wrote:
> On Fri, Apr 05, 2013 at 05:01:06PM +0900, Minchan Kim wrote:
> >On Fri, Apr 05, 2013 at 02:47:25PM +0800, Simon Jeons wrote:
> >> Hi Minchan,
> >> On 04/03/2013 09:11 AM, Minchan Kim wrote:
> >> >On Tue, Apr 02, 2013 at 03:15:23PM -0700, Hugh Dickins wrote:
> >> >>On Tue, 2 Apr 2013, Kirill A. Shutemov wrote:
> >> >>>Kirill A. Shutemov wrote:
> >> >>>>From: "Kirill A. Shutemov" <[email protected]>
> >> >>>>
> >> >>>>ramfs is the most simple fs from page cache point of view. Let's start
> >> >>>>transparent huge page cache enabling here.
> >> >>>>
> >> >>>>For now we allocate only non-movable huge page. It's not yet clear if
> >> >>>>movable page is safe here and what need to be done to make it safe.
> >> >>>>
> >> >>>>Signed-off-by: Kirill A. Shutemov <[email protected]>
> >> >>>>---
> >> >>>> fs/ramfs/inode.c | 6 +++++-
> >> >>>> 1 file changed, 5 insertions(+), 1 deletion(-)
> >> >>>>
> >> >>>>diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
> >> >>>>index c24f1e1..da30b4f 100644
> >> >>>>--- a/fs/ramfs/inode.c
> >> >>>>+++ b/fs/ramfs/inode.c
> >> >>>>@@ -61,7 +61,11 @@ struct inode *ramfs_get_inode(struct super_block *sb,
> >> >>>> inode_init_owner(inode, dir, mode);
> >> >>>> inode->i_mapping->a_ops = &ramfs_aops;
> >> >>>> inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;
> >> >>>>- mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> >> >>>>+ /*
> >> >>>>+ * TODO: what should be done to make movable safe?
> >> >>>>+ */
> >> >>>>+ mapping_set_gfp_mask(inode->i_mapping,
> >> >>>>+ GFP_TRANSHUGE & ~__GFP_MOVABLE);
> >> >>>Hugh, I've found old thread with the reason why we have GFP_HIGHUSER here, not
> >> >>>GFP_HIGHUSER_MOVABLE:
> >> >>>
> >> >>>http://lkml.org/lkml/2006/11/27/156
> >> >>>
> >> >>>It seems the origin reason is not longer valid, correct?
> >> >>Incorrect, I believe: so far as I know, the original reason remains
> >> >>valid - though it would only require a couple of good small changes
> >> >>to reverse that - or perhaps you have already made these changes?
> >> >>
> >> >>The original reason is that ramfs pages are not migratable,
> >> >>therefore they should be allocated from an unmovable area.
> >> >>
> >> >>As I understand it (and I would have preferred to run a test to check
> >> >>my understanding before replying, but don't have time for that), ramfs
> >> >>pages cannot be migrated for two reasons, neither of them a good reason.
> >> >>
> >> >>One reason (okay, it wouldn't have been quite this way in 2006) is that
> >> >>ramfs (rightly) calls mapping_set_unevictable(), so its pages will fail
> >> >>the page_evictable() test, so they will be marked PageUnevictable, so
> >> >>__isolate_lru_page() will refuse to isolate them for migration (except
> >> >>for CMA).
> >> >True.
> >> >
> >> >>I am strongly in favour of removing that limitation from
> >> >>__isolate_lru_page() (and the thread you pointed - thank you - shows Mel
> >> >>and Christoph were both in favour too); and note that there is no such
> >> >>restriction in the confusingly similar but different isolate_lru_page().
> >> >>
> >> >>Some people do worry that migrating Mlocked pages would introduce the
> >> >>occasional possibility of a minor fault (with migration_entry_wait())
> >> >>on an Mlocked region which never faulted before. I tend to dismiss
> >> >>that worry, but maybe I'm wrong to do so: maybe there should be a
> >> >>tunable for realtimey people to set, to prohibit page migration from
> >> >>mlocked areas; but the default should be to allow it.
> >> >I agree.
> >> >Just FYI for mlocked page migration
> >> >
> >> >I tried migratioin of mlocked page and Johannes and Mel had a concern
> >> >about that.
> >> >http://lkml.indiana.edu/hypermail/linux/kernel/1109.0/00175.html
> >> >
> >> >But later, Peter already acked it and I guess by reading the thread that
> >> >Hugh was in favour when page migration was merged first time.
> >> >
> >> >http://marc.info/?l=linux-mm&m=133697873414205&w=2
> >> >http://marc.info/?l=linux-mm&m=133700341823358&w=2
> >> >
> >> >Many people said mlock means memory-resident, NOT pinning so it could
> >> >allow minor fault while Mel still had a concern except CMA.
> >> >http://marc.info/?l=linux-mm&m=133674219714419&w=2
> >>
> >> How about add a knob?
> >
> >Maybe, volunteering?
>
> Hi Minchan,
>
> I can be the volunteer, what I care is if add a knob make sense?

Frankly sepaking, I'd like to avoid new knob but there might be
some workloads suffered from mlocked page migration so we coudn't
dismiss it. In such case, introducing the knob would be a solution
with default enabling. If we don't have any report for a long time,
we can remove the knob someday, IMHO.

Thanks.

--
Kind regards,
Minchan Kim

Subject: Re: [PATCHv2, RFC 20/30] ramfs: enable transparent huge page cache

On Fri, 5 Apr 2013, Minchan Kim wrote:

> > >> How about add a knob?
> > >
> > >Maybe, volunteering?
> >
> > Hi Minchan,
> >
> > I can be the volunteer, what I care is if add a knob make sense?
>
> Frankly sepaking, I'd like to avoid new knob but there might be
> some workloads suffered from mlocked page migration so we coudn't
> dismiss it. In such case, introducing the knob would be a solution
> with default enabling. If we don't have any report for a long time,
> we can remove the knob someday, IMHO.

No Knob please. A new implementation for page pinning that avoids the
mlock crap.

1. It should be available for device drivers to pin their memory (they are
now elevating the ref counter which means page migration will have to see
if it can account for all references before giving up and it does that
quite frequently). So there needs to be an in kernel API, a syscall API as
well as a command line one. Preferably as similar as possible.

2. A sane API for marking pages as mlocked. Maybe part of MMAP? I hate the
command line tools and the APIs for doing that right now.

3. The reservation scheme for mlock via ulimit is broken. We have per
process constraints only it seems. If you start enough processes you can
still make the kernel go OOM.

4. mlock semantics are prescribed by posix which states that the page
stays in memory. I think we should stay with that narrow definition for
mlock.

5. Pinning could also mean that page faults on the page are to be avoided.
COW could occur on fork and page table entries could be instantated at
mmap/fork time. Pinning could mean that minor/major faults will not occur
on a page.