2015-07-10 20:29:42

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 00/10] Huge page support for DAX files

From: Matthew Wilcox <[email protected]>

This series of patches adds support for using PMD page table entries
to map DAX files. We expect NV-DIMMs to start showing up that are
many gigabytes in size and the memory consumption of 4kB PTEs will
be astronomical.

The patch series leverages much of the Transparant Huge Pages
infrastructure, going so far as to borrow one of Kirill's patches from
his THP page cache series.

The ext2 and XFS patches are merely compile tested. The ext4 code has
survived the NVML test suite, some Trinity testing and an xfstests run.

Kirill A. Shutemov (1):
thp: vma_adjust_trans_huge(): adjust file-backed VMA too

Matthew Wilcox (9):
dax: Move DAX-related functions to a new header
thp: Prepare for DAX huge pages
mm: Add a pmd_fault handler
mm: Export various functions for the benefit of DAX
mm: Add vmf_insert_pfn_pmd()
dax: Add huge page fault support
ext2: Huge page fault support
ext4: Huge page fault support
xfs: Huge page fault support

Documentation/filesystems/dax.txt | 7 +-
fs/block_dev.c | 1 +
fs/dax.c | 152 ++++++++++++++++++++++++++++++++++++++
fs/ext2/file.c | 10 ++-
fs/ext2/inode.c | 1 +
fs/ext4/file.c | 11 ++-
fs/ext4/indirect.c | 1 +
fs/ext4/inode.c | 1 +
fs/xfs/xfs_buf.h | 1 +
fs/xfs/xfs_file.c | 30 +++++++-
fs/xfs/xfs_trace.h | 1 +
include/linux/dax.h | 39 ++++++++++
include/linux/fs.h | 14 ----
include/linux/huge_mm.h | 23 +++---
include/linux/mm.h | 2 +
mm/huge_memory.c | 100 ++++++++++++++++++-------
mm/memory.c | 30 ++++++--
17 files changed, 362 insertions(+), 62 deletions(-)
create mode 100644 include/linux/dax.h

--
2.1.4


2015-07-10 20:30:23

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 01/10] thp: vma_adjust_trans_huge(): adjust file-backed VMA too

From: "Kirill A. Shutemov" <[email protected]>

Since we're going to have huge pages in page cache, we need to call
adjust file-backed VMA, which potentially can contain huge pages.

For now we call it for all VMAs.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Hillf Danton <[email protected]>
---
include/linux/huge_mm.h | 11 +----------
mm/huge_memory.c | 2 +-
2 files changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f10b20f..1c53c7d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -122,7 +122,7 @@ extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
#endif
extern int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice);
-extern void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+extern void vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
long adjust_next);
@@ -138,15 +138,6 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
else
return 0;
}
-static inline void vma_adjust_trans_huge(struct vm_area_struct *vma,
- unsigned long start,
- unsigned long end,
- long adjust_next)
-{
- if (!vma->anon_vma || vma->vm_ops)
- return;
- __vma_adjust_trans_huge(vma, start, end, adjust_next);
-}
static inline int hpage_nr_pages(struct page *page)
{
if (unlikely(PageTransHuge(page)))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c107094..911071b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2967,7 +2967,7 @@ static void split_huge_page_address(struct mm_struct *mm,
split_huge_page_pmd_mm(mm, address, pmd);
}

-void __vma_adjust_trans_huge(struct vm_area_struct *vma,
+void vma_adjust_trans_huge(struct vm_area_struct *vma,
unsigned long start,
unsigned long end,
long adjust_next)
--
2.1.4

2015-07-10 20:30:57

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 02/10] dax: Move DAX-related functions to a new header

From: Matthew Wilcox <[email protected]>

In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h. Given that we don't want to include <linux/mm.h>
in <linux/fs.h>, the easiest solution is to move the DAX-related functions
to a new header, <linux/dax.h>. We could also have moved VM_FAULT_*
definitions to a new header, or a different header that isn't quite
such a boil-the-ocean header as <linux/mm.h>, but this felt like the
best option.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/block_dev.c | 1 +
fs/ext2/file.c | 1 +
fs/ext2/inode.c | 1 +
fs/ext4/file.c | 1 +
fs/ext4/indirect.c | 1 +
fs/ext4/inode.c | 1 +
fs/xfs/xfs_buf.h | 1 +
include/linux/dax.h | 21 +++++++++++++++++++++
include/linux/fs.h | 14 --------------
9 files changed, 28 insertions(+), 14 deletions(-)
create mode 100644 include/linux/dax.h

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1982437..9be2d7e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -28,6 +28,7 @@
#include <linux/namei.h>
#include <linux/log2.h>
#include <linux/cleancache.h>
+#include <linux/dax.h>
#include <asm/uaccess.h>
#include "internal.h"

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 3b57c9f..db4c299 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -20,6 +20,7 @@

#include <linux/time.h>
#include <linux/pagemap.h>
+#include <linux/dax.h>
#include <linux/quotaops.h>
#include "ext2.h"
#include "xattr.h"
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 27c85ea..1488248 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -25,6 +25,7 @@
#include <linux/time.h>
#include <linux/highuid.h>
#include <linux/pagemap.h>
+#include <linux/dax.h>
#include <linux/quotaops.h>
#include <linux/writeback.h>
#include <linux/buffer_head.h>
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index e5bdcb7..34d814f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -22,6 +22,7 @@
#include <linux/fs.h>
#include <linux/mount.h>
#include <linux/path.h>
+#include <linux/dax.h>
#include <linux/quotaops.h>
#include <linux/pagevec.h>
#include <linux/uio.h>
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 4f6ac49..2468261 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -22,6 +22,7 @@

#include "ext4_jbd2.h"
#include "truncate.h"
+#include <linux/dax.h>
#include <linux/uio.h>

#include <trace/events/ext4.h>
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ae54de4..26a32da 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -22,6 +22,7 @@
#include <linux/time.h>
#include <linux/highuid.h>
#include <linux/pagemap.h>
+#include <linux/dax.h>
#include <linux/quotaops.h>
#include <linux/string.h>
#include <linux/buffer_head.h>
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 331c1cc..c79b717 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -23,6 +23,7 @@
#include <linux/spinlock.h>
#include <linux/mm.h>
#include <linux/fs.h>
+#include <linux/dax.h>
#include <linux/buffer_head.h>
#include <linux/uio.h>
#include <linux/list_lru.h>
diff --git a/include/linux/dax.h b/include/linux/dax.h
new file mode 100644
index 0000000..4f27d3d
--- /dev/null
+++ b/include/linux/dax.h
@@ -0,0 +1,21 @@
+#ifndef _LINUX_DAX_H
+#define _LINUX_DAX_H
+
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <asm/pgtable.h>
+
+ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
+ get_block_t, dio_iodone_t, int flags);
+int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
+int dax_truncate_page(struct inode *, loff_t from, get_block_t);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
+ dax_iodone_t);
+int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
+ dax_iodone_t);
+int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
+#define dax_mkwrite(vma, vmf, gb, iod) dax_fault(vma, vmf, gb, iod)
+#define __dax_mkwrite(vma, vmf, gb, iod) __dax_fault(vma, vmf, gb, iod)
+
+#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 47053b3..824102a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -51,7 +51,6 @@ struct swap_info_struct;
struct seq_file;
struct workqueue_struct;
struct iov_iter;
-struct vm_fault;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -2656,19 +2655,6 @@ extern loff_t fixed_size_llseek(struct file *file, loff_t offset,
extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

-ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
- get_block_t, dio_iodone_t, int flags);
-int dax_clear_blocks(struct inode *, sector_t block, long size);
-int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
-int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
- dax_iodone_t);
-int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
- dax_iodone_t);
-int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
-#define dax_mkwrite(vma, vmf, gb, iod) dax_fault(vma, vmf, gb, iod)
-#define __dax_mkwrite(vma, vmf, gb, iod) __dax_fault(vma, vmf, gb, iod)
-
#ifdef CONFIG_BLOCK
typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
loff_t file_offset);
--
2.1.4

2015-07-10 20:30:07

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 03/10] thp: Prepare for DAX huge pages

From: Matthew Wilcox <[email protected]>

Add a vma_is_dax() helper macro to test whether the VMA is DAX, and use
it in zap_huge_pmd() and __split_huge_page_pmd().

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/dax.h | 4 ++++
mm/huge_memory.c | 46 ++++++++++++++++++++++++++++------------------
2 files changed, 32 insertions(+), 18 deletions(-)

diff --git a/include/linux/dax.h b/include/linux/dax.h
index 4f27d3d..9b51f9d 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -18,4 +18,8 @@ int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
#define dax_mkwrite(vma, vmf, gb, iod) dax_fault(vma, vmf, gb, iod)
#define __dax_mkwrite(vma, vmf, gb, iod) __dax_fault(vma, vmf, gb, iod)

+static inline bool vma_is_dax(struct vm_area_struct *vma)
+{
+ return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
+}
#endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 911071b..b7bd855 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -23,6 +23,7 @@
#include <linux/pagemap.h>
#include <linux/migrate.h>
#include <linux/hashtable.h>
+#include <linux/dax.h>

#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -1391,7 +1392,6 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
int ret = 0;

if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
- struct page *page;
pgtable_t pgtable;
pmd_t orig_pmd;
/*
@@ -1403,13 +1403,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
tlb->fullmm);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
- pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
+ if (vma_is_dax(vma)) {
+ if (is_huge_zero_pmd(orig_pmd)) {
+ pgtable = NULL;
+ } else {
+ spin_unlock(ptl);
+ return 1;
+ }
+ } else {
+ pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
+ }
if (is_huge_zero_pmd(orig_pmd)) {
atomic_long_dec(&tlb->mm->nr_ptes);
spin_unlock(ptl);
put_huge_zero_page();
} else {
- page = pmd_page(orig_pmd);
+ struct page *page = pmd_page(orig_pmd);
page_remove_rmap(page);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
@@ -1418,7 +1427,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
spin_unlock(ptl);
tlb_remove_page(tlb, page);
}
- pte_free(tlb->mm, pgtable);
+ if (pgtable)
+ pte_free(tlb->mm, pgtable);
ret = 1;
}
return ret;
@@ -2887,7 +2897,7 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd)
{
spinlock_t *ptl;
- struct page *page;
+ struct page *page = NULL;
struct mm_struct *mm = vma->vm_mm;
unsigned long haddr = address & HPAGE_PMD_MASK;
unsigned long mmun_start; /* For mmu_notifiers */
@@ -2900,25 +2910,25 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
again:
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
ptl = pmd_lock(mm, pmd);
- if (unlikely(!pmd_trans_huge(*pmd))) {
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
- return;
- }
- if (is_huge_zero_pmd(*pmd)) {
+ if (unlikely(!pmd_trans_huge(*pmd)))
+ goto unlock;
+ if (vma_is_dax(vma)) {
+ pmdp_huge_clear_flush(vma, haddr, pmd);
+ } else if (is_huge_zero_pmd(*pmd)) {
__split_huge_zero_page_pmd(vma, haddr, pmd);
- spin_unlock(ptl);
- mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
- return;
+ } else {
+ page = pmd_page(*pmd);
+ VM_BUG_ON_PAGE(!page_count(page), page);
+ get_page(page);
}
- page = pmd_page(*pmd);
- VM_BUG_ON_PAGE(!page_count(page), page);
- get_page(page);
+ unlock:
spin_unlock(ptl);
mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

- split_huge_page(page);
+ if (!page)
+ return;

+ split_huge_page(page);
put_page(page);

/*
--
2.1.4

2015-07-10 20:30:01

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 04/10] mm: Add a pmd_fault handler

From: Matthew Wilcox <[email protected]>

Allow non-anonymous VMAs to provide huge pages in response to a page fault.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/mm.h | 2 ++
mm/memory.c | 30 ++++++++++++++++++++++++------
2 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2e872f9..00473e4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -246,6 +246,8 @@ struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+ int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
+ pmd_t *, unsigned int flags);
void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);

/* notification that a previously read-only page is about to become
diff --git a/mm/memory.c b/mm/memory.c
index a84fbb7..32007d6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3209,6 +3209,27 @@ out:
return 0;
}

+static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd, unsigned int flags)
+{
+ if (!vma->vm_ops)
+ return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
+ if (vma->vm_ops->pmd_fault)
+ return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+ return VM_FAULT_FALLBACK;
+}
+
+static int wp_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd, pmd_t orig_pmd,
+ unsigned int flags)
+{
+ if (!vma->vm_ops)
+ return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd);
+ if (vma->vm_ops->pmd_fault)
+ return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+ return VM_FAULT_FALLBACK;
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -3312,10 +3333,7 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (!pmd)
return VM_FAULT_OOM;
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
- int ret = VM_FAULT_FALLBACK;
- if (!vma->vm_ops)
- ret = do_huge_pmd_anonymous_page(mm, vma, address,
- pmd, flags);
+ int ret = create_huge_pmd(mm, vma, address, pmd, flags);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
@@ -3339,8 +3357,8 @@ static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
orig_pmd, pmd);

if (dirty && !pmd_write(orig_pmd)) {
- ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
- orig_pmd);
+ ret = wp_huge_pmd(mm, vma, address, pmd,
+ orig_pmd, flags);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
--
2.1.4

2015-07-10 20:30:15

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 05/10] mm: Export various functions for the benefit of DAX

From: Matthew Wilcox <[email protected]>

To use the huge zero page in DAX, we need these functions exported.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/huge_mm.h | 10 ++++++++++
mm/huge_memory.c | 9 ++-------
2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1c53c7d..70587ea 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -155,6 +155,16 @@ static inline bool is_huge_zero_page(struct page *page)
return ACCESS_ONCE(huge_zero_page) == page;
}

+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return is_huge_zero_page(pmd_page(pmd));
+}
+
+struct page *get_huge_zero_page(void);
+bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long haddr,
+ pmd_t *pmd, struct page *zero_page);
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b7bd855..db3180f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -172,12 +172,7 @@ fail:
static atomic_t huge_zero_refcount;
struct page *huge_zero_page __read_mostly;

-static inline bool is_huge_zero_pmd(pmd_t pmd)
-{
- return is_huge_zero_page(pmd_page(pmd));
-}
-
-static struct page *get_huge_zero_page(void)
+struct page *get_huge_zero_page(void)
{
struct page *zero_page;
retry:
@@ -772,7 +767,7 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
}

/* Caller must hold page table lock. */
-static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
struct page *zero_page)
{
--
2.1.4

2015-07-10 20:30:38

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 06/10] mm: Add vmf_insert_pfn_pmd()

From: Matthew Wilcox <[email protected]>

Similar to vm_insert_pfn(), but for PMDs rather than PTEs. The 'vmf_'
prefix instead of 'vm_' prefix is intended to indicate that it returns
a VMF_ value rather than an errno (which would only have to be converted
into a VMF_ value anyway).

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/huge_mm.h | 2 ++
mm/huge_memory.c | 43 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 70587ea..f9b612f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -33,6 +33,8 @@ extern int move_huge_pmd(struct vm_area_struct *vma,
extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot,
int prot_numa);
+int vmf_insert_pfn_pmd(struct vm_area_struct *, unsigned long addr, pmd_t *,
+ unsigned long pfn, bool write);

enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_FLAG,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index db3180f..26d0fc1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -837,6 +837,49 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}

+static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, unsigned long pfn, pgprot_t prot, bool write)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pmd_t entry;
+ spinlock_t *ptl;
+
+ ptl = pmd_lock(mm, pmd);
+ if (pmd_none(*pmd)) {
+ entry = pmd_mkhuge(pfn_pmd(pfn, prot));
+ if (write) {
+ entry = pmd_mkyoung(pmd_mkdirty(entry));
+ entry = maybe_pmd_mkwrite(entry, vma);
+ }
+ set_pmd_at(mm, addr, pmd, entry);
+ update_mmu_cache_pmd(vma, addr, pmd);
+ }
+ spin_unlock(ptl);
+ return VM_FAULT_NOPAGE;
+}
+
+int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, unsigned long pfn, bool write)
+{
+ pgprot_t pgprot = vma->vm_page_prot;
+ /*
+ * If we had pmd_special, we could avoid all these restrictions,
+ * but we need to be consistent with PTEs and architectures that
+ * can't support a 'special' bit.
+ */
+ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)));
+ BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
+ (VM_PFNMAP|VM_MIXEDMAP));
+ BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
+ BUG_ON((vma->vm_flags & VM_MIXEDMAP) && pfn_valid(pfn));
+
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return VM_FAULT_SIGBUS;
+ if (track_pfn_insert(vma, &pgprot, pfn))
+ return VM_FAULT_SIGBUS;
+ return insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write);
+}
+
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
--
2.1.4

2015-07-10 20:31:06

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 07/10] dax: Add huge page fault support

From: Matthew Wilcox <[email protected]>

This is the support code for DAX-enabled filesystems to allow them to
provide huge pages in response to faults.

Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/filesystems/dax.txt | 7 +-
fs/dax.c | 152 ++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 14 ++++
3 files changed, 170 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index 7af2851..7bde640 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -60,9 +60,10 @@ Filesystem support consists of
- implementing the direct_IO address space operation, and calling
dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
- implementing an mmap file operation for DAX files which sets the
- VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
- for fault and page_mkwrite (which should probably call dax_fault() and
- dax_mkwrite(), passing the appropriate get_block() callback)
+ VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
+ include handlers for fault, pmd_fault and page_mkwrite (which should
+ probably call dax_fault(), dax_pmd_fault() and dax_mkwrite(), passing the
+ appropriate get_block() callback)
- calling dax_truncate_page() instead of block_truncate_page() for DAX files
- calling dax_zero_page_range() instead of zero_user() for DAX files
- ensuring that there is sufficient locking between reads, writes,
diff --git a/fs/dax.c b/fs/dax.c
index c3e21cc..20cf3b0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -484,6 +484,158 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}
EXPORT_SYMBOL_GPL(dax_fault);

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * The 'colour' (ie low bits) within a PMD of a page offset. This comes up
+ * more often than one might expect in the below function.
+ */
+#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
+
+int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, unsigned int flags, get_block_t get_block,
+ dax_iodone_t complete_unwritten)
+{
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_mapping;
+ struct inode *inode = mapping->host;
+ struct buffer_head bh;
+ unsigned blkbits = inode->i_blkbits;
+ unsigned long pmd_addr = address & PMD_MASK;
+ bool write = flags & FAULT_FLAG_WRITE;
+ long length;
+ void *kaddr;
+ pgoff_t size, pgoff;
+ sector_t block, sector;
+ unsigned long pfn;
+ int result = 0;
+
+ /* Fall back to PTEs if we're going to COW */
+ if (write && !(vma->vm_flags & VM_SHARED))
+ return VM_FAULT_FALLBACK;
+ /* If the PMD would extend outside the VMA */
+ if (pmd_addr < vma->vm_start)
+ return VM_FAULT_FALLBACK;
+ if ((pmd_addr + PMD_SIZE) > vma->vm_end)
+ return VM_FAULT_FALLBACK;
+
+ pgoff = ((pmd_addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (pgoff >= size)
+ return VM_FAULT_SIGBUS;
+ /* If the PMD would cover blocks out of the file */
+ if ((pgoff | PG_PMD_COLOUR) >= size)
+ return VM_FAULT_FALLBACK;
+
+ memset(&bh, 0, sizeof(bh));
+ block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
+
+ bh.b_size = PMD_SIZE;
+ length = get_block(inode, block, &bh, write);
+ if (length)
+ return VM_FAULT_SIGBUS;
+ i_mmap_lock_read(mapping);
+
+ /*
+ * If the filesystem isn't willing to tell us the length of a hole,
+ * just fall back to PTEs. Calling get_block 512 times in a loop
+ * would be silly.
+ */
+ if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
+ goto fallback;
+
+ /* Guard against a race with truncate */
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (pgoff >= size) {
+ result = VM_FAULT_SIGBUS;
+ goto out;
+ }
+ if ((pgoff | PG_PMD_COLOUR) >= size)
+ goto fallback;
+
+ if (is_huge_zero_pmd(*pmd))
+ unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
+
+ if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
+ bool set;
+ spinlock_t *ptl;
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *zero_page = get_huge_zero_page();
+ if (unlikely(!zero_page))
+ goto fallback;
+
+ ptl = pmd_lock(mm, pmd);
+ set = set_huge_zero_page(NULL, mm, vma, pmd_addr, pmd,
+ zero_page);
+ spin_unlock(ptl);
+ result = VM_FAULT_NOPAGE;
+ } else {
+ sector = bh.b_blocknr << (blkbits - 9);
+ length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
+ bh.b_size);
+ if (length < 0) {
+ result = VM_FAULT_SIGBUS;
+ goto out;
+ }
+ if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
+ goto fallback;
+
+ if (buffer_unwritten(&bh) || buffer_new(&bh)) {
+ int i;
+ for (i = 0; i < PTRS_PER_PMD; i++)
+ clear_page(kaddr + i * PAGE_SIZE);
+ count_vm_event(PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+ result |= VM_FAULT_MAJOR;
+ }
+
+ result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+ }
+
+ out:
+ i_mmap_unlock_read(mapping);
+
+ if (buffer_unwritten(&bh))
+ complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
+
+ return result;
+
+ fallback:
+ count_vm_event(THP_FAULT_FALLBACK);
+ result = VM_FAULT_FALLBACK;
+ goto out;
+}
+EXPORT_SYMBOL_GPL(__dax_pmd_fault);
+
+/**
+ * dax_pmd_fault - handle a PMD fault on a DAX file
+ * @vma: The virtual memory area where the fault occurred
+ * @vmf: The description of the fault
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * When a page fault occurs, filesystems may call this helper in their
+ * pmd_fault handler for DAX files.
+ */
+int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, unsigned int flags, get_block_t get_block,
+ dax_iodone_t complete_unwritten)
+{
+ int result;
+ struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+ if (flags & FAULT_FLAG_WRITE) {
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ }
+ result = __dax_pmd_fault(vma, address, pmd, flags, get_block,
+ complete_unwritten);
+ if (flags & FAULT_FLAG_WRITE)
+ sb_end_pagefault(sb);
+
+ return result;
+}
+EXPORT_SYMBOL_GPL(dax_pmd_fault);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGES */
+
/**
* dax_pfn_mkwrite - handle first write to DAX page
* @vma: The virtual memory area where the fault occurred
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9b51f9d..b415e52 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -14,6 +14,20 @@ int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
dax_iodone_t);
int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
dax_iodone_t);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
+ unsigned int flags, get_block_t, dax_iodone_t);
+int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
+ unsigned int flags, get_block_t, dax_iodone_t);
+#else
+static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, unsigned int flags, get_block_t gb,
+ dax_iodone_t di)
+{
+ return VM_FAULT_FALLBACK;
+}
+#define __dax_pmd_fault dax_pmd_fault
+#endif
int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
#define dax_mkwrite(vma, vmf, gb, iod) dax_fault(vma, vmf, gb, iod)
#define __dax_mkwrite(vma, vmf, gb, iod) __dax_fault(vma, vmf, gb, iod)
--
2.1.4

2015-07-10 20:30:32

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 08/10] ext2: Huge page fault support

From: Matthew Wilcox <[email protected]>

Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/file.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index db4c299..1982c3f 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -32,6 +32,12 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return dax_fault(vma, vmf, ext2_get_block, NULL);
}

+static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, unsigned int flags)
+{
+ return dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
+}
+
static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
return dax_mkwrite(vma, vmf, ext2_get_block, NULL);
@@ -39,6 +45,7 @@ static int ext2_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

static const struct vm_operations_struct ext2_dax_vm_ops = {
.fault = ext2_dax_fault,
+ .pmd_fault = ext2_dax_pmd_fault,
.page_mkwrite = ext2_dax_mkwrite,
.pfn_mkwrite = dax_pfn_mkwrite,
};
@@ -50,7 +57,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)

file_accessed(file);
vma->vm_ops = &ext2_dax_vm_ops;
- vma->vm_flags |= VM_MIXEDMAP;
+ vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
return 0;
}
#else
--
2.1.4

2015-07-10 20:30:44

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 09/10] ext4: Huge page fault support

From: Matthew Wilcox <[email protected]>

Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext4/file.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 34d814f..ca5302a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -210,6 +210,13 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return dax_fault(vma, vmf, ext4_get_block_write, ext4_end_io_unwritten);
}

+static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, unsigned int flags)
+{
+ return dax_pmd_fault(vma, addr, pmd, flags, ext4_get_block_write,
+ ext4_end_io_unwritten);
+}
+
static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
return dax_mkwrite(vma, vmf, ext4_get_block_write,
@@ -218,6 +225,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

static const struct vm_operations_struct ext4_dax_vm_ops = {
.fault = ext4_dax_fault,
+ .pmd_fault = ext4_dax_pmd_fault,
.page_mkwrite = ext4_dax_mkwrite,
.pfn_mkwrite = dax_pfn_mkwrite,
};
@@ -245,7 +253,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
file_accessed(file);
if (IS_DAX(file_inode(file))) {
vma->vm_ops = &ext4_dax_vm_ops;
- vma->vm_flags |= VM_MIXEDMAP;
+ vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
} else {
vma->vm_ops = &ext4_file_vm_ops;
}
--
2.1.4

2015-07-10 20:30:53

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH 10/10] xfs: Huge page fault support

From: Matthew Wilcox <[email protected]>

Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/xfs/xfs_file.c | 30 +++++++++++++++++++++++++++++-
fs/xfs/xfs_trace.h | 1 +
2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a212b7b..d3ea50c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1534,8 +1534,36 @@ xfs_filemap_fault(
return ret;
}

+STATIC int
+xfs_filemap_pmd_fault(
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ pmd_t *pmd,
+ unsigned int flags)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ struct xfs_inode *ip = XFS_I(inode);
+ int ret;
+
+ if (!IS_DAX(inode))
+ return VM_FAULT_FALLBACK;
+
+ trace_xfs_filemap_pmd_fault(ip);
+
+ sb_start_pagefault(inode->i_sb);
+ file_update_time(vma->vm_file);
+ xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+ ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_direct,
+ xfs_end_io_dax_write);
+ xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
+ sb_end_pagefault(inode->i_sb);
+
+ return ret;
+}
+
static const struct vm_operations_struct xfs_file_vm_ops = {
.fault = xfs_filemap_fault,
+ .pmd_fault = xfs_filemap_pmd_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = xfs_filemap_page_mkwrite,
};
@@ -1548,7 +1576,7 @@ xfs_file_mmap(
file_accessed(filp);
vma->vm_ops = &xfs_file_vm_ops;
if (IS_DAX(file_inode(filp)))
- vma->vm_flags |= VM_MIXEDMAP;
+ vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
return 0;
}

diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 8d916d3..8229cae 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -687,6 +687,7 @@ DEFINE_INODE_EVENT(xfs_inode_clear_eofblocks_tag);
DEFINE_INODE_EVENT(xfs_inode_free_eofblocks_invalid);

DEFINE_INODE_EVENT(xfs_filemap_fault);
+DEFINE_INODE_EVENT(xfs_filemap_pmd_fault);
DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite);

DECLARE_EVENT_CLASS(xfs_iref_class,
--
2.1.4

2015-07-13 13:23:45

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: Add vmf_insert_pfn_pmd()

Matthew Wilcox <[email protected]> writes:

> +static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> + pmd_t *pmd, unsigned long pfn, pgprot_t prot, bool write)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + pmd_t entry;
> + spinlock_t *ptl;
> +
> + ptl = pmd_lock(mm, pmd);
> + if (pmd_none(*pmd)) {
> + entry = pmd_mkhuge(pfn_pmd(pfn, prot));
> + if (write) {
> + entry = pmd_mkyoung(pmd_mkdirty(entry));
> + entry = maybe_pmd_mkwrite(entry, vma);
> + }
> + set_pmd_at(mm, addr, pmd, entry);
> + update_mmu_cache_pmd(vma, addr, pmd);
> + }
> + spin_unlock(ptl);
> + return VM_FAULT_NOPAGE;
> +}

What's the point of the return value?

Cheers,
Jeff

2015-07-13 15:02:41

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 06/10] mm: Add vmf_insert_pfn_pmd()

On Mon, Jul 13, 2015 at 09:23:41AM -0400, Jeff Moyer wrote:
> Matthew Wilcox <[email protected]> writes:
>
> > +static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> > + pmd_t *pmd, unsigned long pfn, pgprot_t prot, bool write)
> > +{
> > + return VM_FAULT_NOPAGE;
> > +}
>
> What's the point of the return value?

Good point. Originally, it paralleled insert_pfn() in mm/memory.c, but it
became apparent that the return code of 0 or -Exxx was useless, and in converting insert_pfn_pmd over to VM_FAULT_ codes, all possible return codes were
going to be VM_FAULT_NOPAGE. It didn't occur to me to take it one step further and make the function return void.

It doesn't make much difference either way:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 26d0fc1..5ffdcaa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -837,7 +837,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}

-static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, unsigned long pfn, pgprot_t prot, bool write)
{
struct mm_struct *mm = vma->vm_mm;
@@ -855,7 +855,6 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
update_mmu_cache_pmd(vma, addr, pmd);
}
spin_unlock(ptl);
- return VM_FAULT_NOPAGE;
}

int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -877,7 +876,8 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
return VM_FAULT_SIGBUS;
if (track_pfn_insert(vma, &pgprot, pfn))
return VM_FAULT_SIGBUS;
- return insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write);
+ insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write);
+ return VM_FAULT_NOPAGE;
}

int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,


I suppose it's slightly cleaner. I'll integrate this for the next release.

2015-07-13 15:05:06

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 07/10] dax: Add huge page fault support

On Fri 10-07-15 16:29:22, Matthew Wilcox wrote:
> From: Matthew Wilcox <[email protected]>
>
> This is the support code for DAX-enabled filesystems to allow them to
> provide huge pages in response to faults.
>
> Signed-off-by: Matthew Wilcox <[email protected]>

...

> +int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> + pmd_t *pmd, unsigned int flags, get_block_t get_block,
> + dax_iodone_t complete_unwritten)
> +{
> + struct file *file = vma->vm_file;
> + struct address_space *mapping = file->f_mapping;
> + struct inode *inode = mapping->host;
> + struct buffer_head bh;
> + unsigned blkbits = inode->i_blkbits;
> + unsigned long pmd_addr = address & PMD_MASK;
> + bool write = flags & FAULT_FLAG_WRITE;
> + long length;
> + void *kaddr;
> + pgoff_t size, pgoff;
> + sector_t block, sector;
> + unsigned long pfn;
> + int result = 0;
> +
> + /* Fall back to PTEs if we're going to COW */
> + if (write && !(vma->vm_flags & VM_SHARED))
> + return VM_FAULT_FALLBACK;
> + /* If the PMD would extend outside the VMA */
> + if (pmd_addr < vma->vm_start)
> + return VM_FAULT_FALLBACK;
> + if ((pmd_addr + PMD_SIZE) > vma->vm_end)
> + return VM_FAULT_FALLBACK;
> +
> + pgoff = ((pmd_addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (pgoff >= size)
> + return VM_FAULT_SIGBUS;
> + /* If the PMD would cover blocks out of the file */
> + if ((pgoff | PG_PMD_COLOUR) >= size)
> + return VM_FAULT_FALLBACK;
> +
> + memset(&bh, 0, sizeof(bh));
> + block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);
> +
> + bh.b_size = PMD_SIZE;
> + length = get_block(inode, block, &bh, write);
> + if (length)
> + return VM_FAULT_SIGBUS;
> + i_mmap_lock_read(mapping);
> +
> + /*
> + * If the filesystem isn't willing to tell us the length of a hole,
> + * just fall back to PTEs. Calling get_block 512 times in a loop
> + * would be silly.
> + */
> + if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
> + goto fallback;
> +
> + /* Guard against a race with truncate */
> + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (pgoff >= size) {
> + result = VM_FAULT_SIGBUS;
> + goto out;
> + }

So if this is a writeable fault and we race with truncate, we can leave
stale blocks beyond i_size, can't we? Ah, looking at dax_insert_mapping()
this seems to be a documented quirk of DAX mmap code. Would be worth
mentioning here as well so that people don't wonder...

Otherwise the patch looks good to me.

Honza

> + if ((pgoff | PG_PMD_COLOUR) >= size)
> + goto fallback;
> +
> + if (is_huge_zero_pmd(*pmd))
> + unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);
> +
> + if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
> + bool set;
> + spinlock_t *ptl;
> + struct mm_struct *mm = vma->vm_mm;
> + struct page *zero_page = get_huge_zero_page();
> + if (unlikely(!zero_page))
> + goto fallback;
> +
> + ptl = pmd_lock(mm, pmd);
> + set = set_huge_zero_page(NULL, mm, vma, pmd_addr, pmd,
> + zero_page);
> + spin_unlock(ptl);
> + result = VM_FAULT_NOPAGE;
> + } else {
> + sector = bh.b_blocknr << (blkbits - 9);
> + length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
> + bh.b_size);
> + if (length < 0) {
> + result = VM_FAULT_SIGBUS;
> + goto out;
> + }
> + if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
> + goto fallback;
> +
> + if (buffer_unwritten(&bh) || buffer_new(&bh)) {
> + int i;
> + for (i = 0; i < PTRS_PER_PMD; i++)
> + clear_page(kaddr + i * PAGE_SIZE);
> + count_vm_event(PGMAJFAULT);
> + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> + result |= VM_FAULT_MAJOR;
> + }
> +
> + result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
> + }
> +
> + out:
> + i_mmap_unlock_read(mapping);
> +
> + if (buffer_unwritten(&bh))
> + complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
> +
> + return result;
> +
> + fallback:
> + count_vm_event(THP_FAULT_FALLBACK);
> + result = VM_FAULT_FALLBACK;
> + goto out;
> +}
> +EXPORT_SYMBOL_GPL(__dax_pmd_fault);
> +
> +/**
> + * dax_pmd_fault - handle a PMD fault on a DAX file
> + * @vma: The virtual memory area where the fault occurred
> + * @vmf: The description of the fault
> + * @get_block: The filesystem method used to translate file offsets to blocks
> + *
> + * When a page fault occurs, filesystems may call this helper in their
> + * pmd_fault handler for DAX files.
> + */
> +int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> + pmd_t *pmd, unsigned int flags, get_block_t get_block,
> + dax_iodone_t complete_unwritten)
> +{
> + int result;
> + struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +
> + if (flags & FAULT_FLAG_WRITE) {
> + sb_start_pagefault(sb);
> + file_update_time(vma->vm_file);
> + }
> + result = __dax_pmd_fault(vma, address, pmd, flags, get_block,
> + complete_unwritten);
> + if (flags & FAULT_FLAG_WRITE)
> + sb_end_pagefault(sb);
> +
> + return result;
> +}
> +EXPORT_SYMBOL_GPL(dax_pmd_fault);
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGES */
> +
> /**
> * dax_pfn_mkwrite - handle first write to DAX page
> * @vma: The virtual memory area where the fault occurred
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 9b51f9d..b415e52 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -14,6 +14,20 @@ int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> dax_iodone_t);
> int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
> dax_iodone_t);
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> + unsigned int flags, get_block_t, dax_iodone_t);
> +int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> + unsigned int flags, get_block_t, dax_iodone_t);
> +#else
> +static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
> + pmd_t *pmd, unsigned int flags, get_block_t gb,
> + dax_iodone_t di)
> +{
> + return VM_FAULT_FALLBACK;
> +}
> +#define __dax_pmd_fault dax_pmd_fault
> +#endif
> int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
> #define dax_mkwrite(vma, vmf, gb, iod) dax_fault(vma, vmf, gb, iod)
> #define __dax_mkwrite(vma, vmf, gb, iod) __dax_fault(vma, vmf, gb, iod)
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-13 15:33:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 07/10] dax: Add huge page fault support

On Mon, Jul 13, 2015 at 05:05:00PM +0200, Jan Kara wrote:
> So if this is a writeable fault and we race with truncate, we can leave
> stale blocks beyond i_size, can't we? Ah, looking at dax_insert_mapping()
> this seems to be a documented quirk of DAX mmap code. Would be worth
> mentioning here as well so that people don't wonder...

Thanks!

- /* Guard against a race with truncate */
+ /*
+ * If a truncate happened while we were allocating blocks, we may
+ * leave blocks allocated to the file that are beyond EOF. We can't
+ * take i_mutex here, so just leave them hanging; they'll be freed
+ * when the file is deleted.
+ */

is what I'll commit.

2015-07-19 11:03:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 03/10] thp: Prepare for DAX huge pages

On Fri, Jul 10, 2015 at 04:29:18PM -0400, Matthew Wilcox wrote:
> From: Matthew Wilcox <[email protected]>
>
> Add a vma_is_dax() helper macro to test whether the VMA is DAX, and use
> it in zap_huge_pmd() and __split_huge_page_pmd().
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> ---
> include/linux/dax.h | 4 ++++
> mm/huge_memory.c | 46 ++++++++++++++++++++++++++++------------------
> 2 files changed, 32 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 4f27d3d..9b51f9d 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -18,4 +18,8 @@ int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
> #define dax_mkwrite(vma, vmf, gb, iod) dax_fault(vma, vmf, gb, iod)
> #define __dax_mkwrite(vma, vmf, gb, iod) __dax_fault(vma, vmf, gb, iod)
>
> +static inline bool vma_is_dax(struct vm_area_struct *vma)
> +{
> + return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
> +}
> #endif
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 911071b..b7bd855 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -23,6 +23,7 @@
> #include <linux/pagemap.h>
> #include <linux/migrate.h>
> #include <linux/hashtable.h>
> +#include <linux/dax.h>
>
> #include <asm/tlb.h>
> #include <asm/pgalloc.h>
> @@ -1391,7 +1392,6 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> int ret = 0;
>
> if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
> - struct page *page;
> pgtable_t pgtable;
> pmd_t orig_pmd;
> /*
> @@ -1403,13 +1403,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> tlb->fullmm);
> tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> - pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> + if (vma_is_dax(vma)) {
> + if (is_huge_zero_pmd(orig_pmd)) {
> + pgtable = NULL;

pgtable_t is not always a pointer. See arch/arc.

> + } else {
> + spin_unlock(ptl);
> + return 1;
> + }
> + } else {
> + pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
> + }
> if (is_huge_zero_pmd(orig_pmd)) {
> atomic_long_dec(&tlb->mm->nr_ptes);
> spin_unlock(ptl);
> put_huge_zero_page();
> } else {
> - page = pmd_page(orig_pmd);
> + struct page *page = pmd_page(orig_pmd);
> page_remove_rmap(page);
> VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
> add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
> @@ -1418,7 +1427,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> spin_unlock(ptl);
> tlb_remove_page(tlb, page);
> }
> - pte_free(tlb->mm, pgtable);
> + if (pgtable)
> + pte_free(tlb->mm, pgtable);

It's better to drop "pgtable = NULL;" above and use "if (vma_is_dax(vma))"
here.

> ret = 1;
> }
> return ret;
> @@ -2887,7 +2897,7 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
> pmd_t *pmd)
> {
> spinlock_t *ptl;
> - struct page *page;
> + struct page *page = NULL;
> struct mm_struct *mm = vma->vm_mm;
> unsigned long haddr = address & HPAGE_PMD_MASK;
> unsigned long mmun_start; /* For mmu_notifiers */
> @@ -2900,25 +2910,25 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
> again:
> mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
> ptl = pmd_lock(mm, pmd);
> - if (unlikely(!pmd_trans_huge(*pmd))) {
> - spin_unlock(ptl);
> - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> - return;
> - }
> - if (is_huge_zero_pmd(*pmd)) {
> + if (unlikely(!pmd_trans_huge(*pmd)))
> + goto unlock;
> + if (vma_is_dax(vma)) {
> + pmdp_huge_clear_flush(vma, haddr, pmd);

pmdp_huge_clear_flush_notify()

> + } else if (is_huge_zero_pmd(*pmd)) {
> __split_huge_zero_page_pmd(vma, haddr, pmd);
> - spin_unlock(ptl);
> - mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> - return;
> + } else {
> + page = pmd_page(*pmd);
> + VM_BUG_ON_PAGE(!page_count(page), page);
> + get_page(page);
> }
> - page = pmd_page(*pmd);
> - VM_BUG_ON_PAGE(!page_count(page), page);
> - get_page(page);
> + unlock:
> spin_unlock(ptl);
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>
> - split_huge_page(page);
> + if (!page)
> + return;
>
> + split_huge_page(page);
> put_page(page);
>
> /*
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Kirill A. Shutemov