LinuxLists.cc - [PATCH 00/11] DAX fixes for 4.3

2015-08-04 19:58:15

Subject: [PATCH 00/11] DAX fixes for 4.3

From: Matthew Wilcox <[email protected]>

Hi Andrew,

I believe this entire patch series should apply cleanly to your tree.

The first patch I have already sent to Ted. It should be applied for 4.2.

Patch 4 depends on patch 1 having been applied first, patch 5 depends
on patch 4, and patch 6 depends on patch 5, so I can't wait for those
patches to go in via the ext4 tree.

Most of these patches aren't needed for 4.2 and earlier, because they
pertain to the PMD support which is only in the -mm tree.

Kirill A. Shutemov (3):
thp: Decrement refcount on huge zero page if it is split
thp: Fix zap_huge_pmd() for DAX
dax: Don't use set_huge_zero_page()

Matthew Wilcox (8):
ext4: Use ext4_get_block_write() for DAX
thp: Change insert_pfn's return type to void
dax: Improve comment about truncate race
ext4: Add ext4_get_block_dax()
ext4: Start transaction before calling into DAX
dax: Fix race between simultaneous faults
dax: Ensure that zero pages are removed from other processes
dax: Use linear_page_index()

fs/dax.c | 66 ++++++++++++++++++++++++---------------
fs/ext4/ext4.h | 2 ++
fs/ext4/file.c | 61 ++++++++++++++++++++++++++++++++----
fs/ext4/inode.c | 11 +++++++
include/linux/huge_mm.h | 3 --
mm/huge_memory.c | 83 ++++++++++++++++++++++---------------------------
mm/memory.c | 11 +++++--
7 files changed, 155 insertions(+), 82 deletions(-)

--
2.1.4

2015-08-04 20:01:58

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 01/11] ext4: Use ext4_get_block_write() for DAX

From: Matthew Wilcox <[email protected]>

DAX relies on the get_block function either zeroing newly allocated blocks
before they're findable by subsequent calls to get_block, or marking newly
allocated blocks as unwritten. ext4_get_block() cannot create unwritten
extents, but ext4_get_block_write() can.

Reported-by: Andy Rudoff <[email protected]>
Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext4/file.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 953d519..ca5302a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -196,7 +196,7 @@ out:
static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
{
struct inode *inode = bh->b_assoc_map->host;
- /* XXX: breaks on 32-bit > 16GB. Is that even supported? */
+ /* XXX: breaks on 32-bit > 16TB. Is that even supported? */
loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
int err;
if (!uptodate)
@@ -207,8 +207,7 @@ static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)

static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
- /* Is this the right get_block? */
+ return dax_fault(vma, vmf, ext4_get_block_write, ext4_end_io_unwritten);
}

static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
@@ -220,7 +219,8 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,

static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
+ return dax_mkwrite(vma, vmf, ext4_get_block_write,
+ ext4_end_io_unwritten);
}

static const struct vm_operations_struct ext4_dax_vm_ops = {
--
2.1.4

2015-08-04 19:58:18

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 02/11] thp: Change insert_pfn's return type to void

From: Matthew Wilcox <[email protected]>

It would make more sense to have all the return values from
vmf_insert_pfn_pmd() encoded in one place instead of having to follow
the convention into insert_pfn(). Suggested by Jeff Moyer.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/huge_memory.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 26d0fc1..5ffdcaa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -837,7 +837,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}

-static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, unsigned long pfn, pgprot_t prot, bool write)
{
struct mm_struct *mm = vma->vm_mm;
@@ -855,7 +855,6 @@ static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
update_mmu_cache_pmd(vma, addr, pmd);
}
spin_unlock(ptl);
- return VM_FAULT_NOPAGE;
}

int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -877,7 +876,8 @@ int vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
return VM_FAULT_SIGBUS;
if (track_pfn_insert(vma, &pgprot, pfn))
return VM_FAULT_SIGBUS;
- return insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write);
+ insert_pfn_pmd(vma, addr, pmd, pfn, pgprot, write);
+ return VM_FAULT_NOPAGE;
}

int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
--
2.1.4

2015-08-04 19:58:16

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 03/11] dax: Improve comment about truncate race

From: Matthew Wilcox <[email protected]>

Jan Kara pointed out I should be more explicit here about the perils of
racing against truncate. The comment is mostly the same as for the PTE
case.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 15f8ffc..0a13118 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -553,7 +553,12 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
goto fallback;

- /* Guard against a race with truncate */
+ /*
+ * If a truncate happened while we were allocating blocks, we may
+ * leave blocks allocated to the file that are beyond EOF. We can't
+ * take i_mutex here, so just leave them hanging; they'll be freed
+ * when the file is deleted.
+ */
size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (pgoff >= size) {
result = VM_FAULT_SIGBUS;
--
2.1.4

2015-08-04 20:01:19

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 04/11] ext4: Add ext4_get_block_dax()

From: Matthew Wilcox <[email protected]>

DAX wants different semantics from any currently-existing ext4
get_block callback. Unlike ext4_get_block_write(), it needs to honour
the 'create' flag, and unlike ext4_get_block(), it needs to be able
to return unwritten extents. So introduce a new ext4_get_block_dax()
which has those semantics. We could also change ext4_get_block_write()
to honour the 'create' flag, but that might have consequences on other
users that I do not currently understand.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext4/ext4.h | 2 ++
fs/ext4/file.c | 6 +++---
fs/ext4/inode.c | 11 +++++++++++
3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d743d93..51c5008 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2274,6 +2274,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
int ext4_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
+int ext4_get_block_dax(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ca5302a..d5219e4 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,19 +207,19 @@ static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)

static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- return dax_fault(vma, vmf, ext4_get_block_write, ext4_end_io_unwritten);
+ return dax_fault(vma, vmf, ext4_get_block_dax, ext4_end_io_unwritten);
}

static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, unsigned int flags)
{
- return dax_pmd_fault(vma, addr, pmd, flags, ext4_get_block_write,
+ return dax_pmd_fault(vma, addr, pmd, flags, ext4_get_block_dax,
ext4_end_io_unwritten);
}

static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- return dax_mkwrite(vma, vmf, ext4_get_block_write,
+ return dax_mkwrite(vma, vmf, ext4_get_block_dax,
ext4_end_io_unwritten);
}

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 40e6c66..75146d1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3025,6 +3025,17 @@ static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
EXT4_GET_BLOCKS_NO_LOCK);
}

+int ext4_get_block_dax(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
+{
+ int flags = EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_UNWRIT_EXT;
+ if (create)
+ flags |= EXT4_GET_BLOCKS_CREATE;
+ ext4_debug("ext4_get_block_dax: inode %lu, create flag %d\n",
+ inode->i_ino, create);
+ return _ext4_get_block(inode, iblock, bh_result, flags);
+}
+
static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private)
{
--
2.1.4

2015-08-04 20:00:38

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 05/11] ext4: Start transaction before calling into DAX

From: Matthew Wilcox <[email protected]>

Jan Kara pointed out that in the case where we are writing to a hole,
we can end up with a lock inversion between the page lock and the
journal lock. We can avoid this by starting the transaction in ext4
before calling into DAX. The journal lock nests inside the superblock
pagefault lock, so we have to duplicate that code from dax_fault, like
XFS does.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext4/file.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 52 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d5219e4..113837e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,14 +207,63 @@ static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)

static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- return dax_fault(vma, vmf, ext4_get_block_dax, ext4_end_io_unwritten);
+ int result;
+ handle_t *handle = NULL;
+ struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+ bool write = vmf->flags & FAULT_FLAG_WRITE;
+
+ if (write) {
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
+ EXT4_DATA_TRANS_BLOCKS(sb));
+ }
+
+ if (IS_ERR(handle))
+ result = VM_FAULT_SIGBUS;
+ else
+ result = __dax_fault(vma, vmf, ext4_get_block_dax,
+ ext4_end_io_unwritten);
+
+ if (write) {
+ if (!IS_ERR(handle))
+ ext4_journal_stop(handle);
+ sb_end_pagefault(sb);
+ }
+
+ return result;
}

static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, unsigned int flags)
{
- return dax_pmd_fault(vma, addr, pmd, flags, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ int result;
+ handle_t *handle = NULL;
+ struct inode *inode = file_inode(vma->vm_file);
+ struct super_block *sb = inode->i_sb;
+ bool write = flags & FAULT_FLAG_WRITE;
+
+ if (write) {
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
+ ext4_chunk_trans_blocks(inode,
+ PMD_SIZE / PAGE_SIZE));
+ }
+
+ if (IS_ERR(handle))
+ result = VM_FAULT_SIGBUS;
+ else
+ result = __dax_pmd_fault(vma, addr, pmd, flags,
+ ext4_get_block_dax, ext4_end_io_unwritten);
+
+ if (write) {
+ if (!IS_ERR(handle))
+ ext4_journal_stop(handle);
+ sb_end_pagefault(sb);
+ }
+
+ return result;
}

static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
--
2.1.4

2015-08-04 20:01:41

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 06/11] dax: Fix race between simultaneous faults

From: Matthew Wilcox <[email protected]>

If two threads write-fault on the same hole at the same time, the winner
of the race will return to userspace and complete their store, only to
have the loser overwrite their store with zeroes. Fix this for now by
taking the i_mmap_sem for write instead of read, and do so outside the
call to get_block(). Now the loser of the race will see the block has
already been zeroed, and will not zero it again.

This severely limits our scalability. I have ideas for improving it,
but those can wait for a later patch.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 33 +++++++++++++++++----------------
mm/memory.c | 11 ++++++++---
2 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0a13118..9daf005 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -272,7 +272,6 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
struct vm_area_struct *vma, struct vm_fault *vmf)
{
- struct address_space *mapping = inode->i_mapping;
sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
unsigned long vaddr = (unsigned long)vmf->virtual_address;
void *addr;
@@ -280,8 +279,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
pgoff_t size;
int error;

- i_mmap_lock_read(mapping);
-
/*
* Check truncate didn't happen while we were allocating a block.
* If it did, this block may or may not be still allocated to the
@@ -309,8 +306,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
error = vm_insert_mixed(vma, vaddr, pfn);

out:
- i_mmap_unlock_read(mapping);
-
return error;
}

@@ -372,15 +367,17 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
* from a read fault and we've raced with a truncate
*/
error = -EIO;
- goto unlock_page;
+ goto unlock;
}
+ } else {
+ i_mmap_lock_write(mapping);
}

error = get_block(inode, block, &bh, 0);
if (!error && (bh.b_size < PAGE_SIZE))
error = -EIO; /* fs corruption? */
if (error)
- goto unlock_page;
+ goto unlock;

if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
if (vmf->flags & FAULT_FLAG_WRITE) {
@@ -391,8 +388,9 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
if (!error && (bh.b_size < PAGE_SIZE))
error = -EIO;
if (error)
- goto unlock_page;
+ goto unlock;
} else {
+ i_mmap_unlock_write(mapping);
return dax_load_hole(mapping, page, vmf);
}
}
@@ -404,17 +402,15 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
else
clear_user_highpage(new_page, vaddr);
if (error)
- goto unlock_page;
+ goto unlock;
vmf->page = page;
if (!page) {
- i_mmap_lock_read(mapping);
/* Check we didn't race with truncate */
size = (i_size_read(inode) + PAGE_SIZE - 1) >>
PAGE_SHIFT;
if (vmf->pgoff >= size) {
- i_mmap_unlock_read(mapping);
error = -EIO;
- goto out;
+ goto unlock;
}
}
return VM_FAULT_LOCKED;
@@ -450,6 +446,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
}

+ if (!page)
+ i_mmap_unlock_write(mapping);
out:
if (error == -ENOMEM)
return VM_FAULT_OOM | major;
@@ -458,11 +456,14 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
return VM_FAULT_SIGBUS | major;
return VM_FAULT_NOPAGE | major;

- unlock_page:
+ unlock:
if (page) {
unlock_page(page);
page_cache_release(page);
+ } else {
+ i_mmap_unlock_write(mapping);
}
+
goto out;
}
EXPORT_SYMBOL(__dax_fault);
@@ -540,10 +541,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
block = (sector_t)pgoff << (PAGE_SHIFT - blkbits);

bh.b_size = PMD_SIZE;
+ i_mmap_lock_write(mapping);
length = get_block(inode, block, &bh, write);
if (length)
return VM_FAULT_SIGBUS;
- i_mmap_lock_read(mapping);

/*
* If the filesystem isn't willing to tell us the length of a hole,
@@ -607,11 +608,11 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
}

out:
- i_mmap_unlock_read(mapping);
-
if (buffer_unwritten(&bh))
complete_unwritten(&bh, !(result & VM_FAULT_ERROR));

+ i_mmap_unlock_write(mapping);
+
return result;

fallback:
diff --git a/mm/memory.c b/mm/memory.c
index b94b587..5f46350 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2426,11 +2426,16 @@ void unmap_mapping_range(struct address_space *mapping,
details.last_index = ULONG_MAX;

- /* DAX uses i_mmap_lock to serialise file truncate vs page fault */
- i_mmap_lock_write(mapping);
+ /*
+ * DAX already holds i_mmap_lock to serialise file truncate vs
+ * page fault and page fault vs page fault.
+ */
+ if (!IS_DAX(mapping->host))
+ i_mmap_lock_write(mapping);
if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
unmap_mapping_range_tree(&mapping->i_mmap, &details);
- i_mmap_unlock_write(mapping);
+ if (!IS_DAX(mapping->host))
+ i_mmap_unlock_write(mapping);
}
EXPORT_SYMBOL(unmap_mapping_range);

--
2.1.4

2015-08-04 19:58:41

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 07/11] thp: Decrement refcount on huge zero page if it is split

From: "Kirill A. Shutemov" <[email protected]>

The DAX code neglected to put the refcount on the huge zero page.
Also we must notify on splits.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/huge_memory.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5ffdcaa..326d17e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2951,7 +2951,9 @@ again:
if (unlikely(!pmd_trans_huge(*pmd)))
goto unlock;
if (vma_is_dax(vma)) {
- pmdp_huge_clear_flush(vma, haddr, pmd);
+ pmd_t _pmd = pmdp_huge_clear_flush_notify(vma, haddr, pmd);
+ if (is_huge_zero_pmd(_pmd))
+ put_huge_zero_page();
} else if (is_huge_zero_pmd(*pmd)) {
__split_huge_zero_page_pmd(vma, haddr, pmd);
} else {
--
2.1.4

2015-08-04 19:58:59

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 08/11] thp: Fix zap_huge_pmd() for DAX

From: "Kirill A. Shutemov" <[email protected]>

The original DAX code assumed that pgtable_t was a pointer, which isn't
true on all architectures. Restructure the code to not rely on that
assumption.

Signed-off-by: Kirill A. Shutemov <[email protected]>
[further fixes from Matthew integrated into this patch]
Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/huge_memory.c | 71 +++++++++++++++++++++++++-------------------------------
1 file changed, 31 insertions(+), 40 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 326d17e..b51bed1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1426,50 +1426,41 @@ out:
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr)
{
+ pmd_t orig_pmd;
spinlock_t *ptl;
- int ret = 0;

- if (__pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
- pgtable_t pgtable;
- pmd_t orig_pmd;
- /*
- * For architectures like ppc64 we look at deposited pgtable
- * when calling pmdp_huge_get_and_clear. So do the
- * pgtable_trans_huge_withdraw after finishing pmdp related
- * operations.
- */
- orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
- tlb->fullmm);
- tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
- if (vma_is_dax(vma)) {
- if (is_huge_zero_pmd(orig_pmd)) {
- pgtable = NULL;
- } else {
- spin_unlock(ptl);
- return 1;
- }
- } else {
- pgtable = pgtable_trans_huge_withdraw(tlb->mm, pmd);
- }
- if (is_huge_zero_pmd(orig_pmd)) {
- atomic_long_dec(&tlb->mm->nr_ptes);
- spin_unlock(ptl);
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl) != 1)
+ return 0;
+ /*
+ * For architectures like ppc64 we look at deposited pgtable
+ * when calling pmdp_huge_get_and_clear. So do the
+ * pgtable_trans_huge_withdraw after finishing pmdp related
+ * operations.
+ */
+ orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
+ tlb->fullmm);
+ tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ if (vma_is_dax(vma)) {
+ spin_unlock(ptl);
+ if (is_huge_zero_pmd(orig_pmd))
put_huge_zero_page();
- } else {
- struct page *page = pmd_page(orig_pmd);
- page_remove_rmap(page);
- VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
- VM_BUG_ON_PAGE(!PageHead(page), page);
- atomic_long_dec(&tlb->mm->nr_ptes);
- spin_unlock(ptl);
- tlb_remove_page(tlb, page);
- }
- if (pgtable)
- pte_free(tlb->mm, pgtable);
- ret = 1;
+ } else if (is_huge_zero_pmd(orig_pmd)) {
+ pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
+ atomic_long_dec(&tlb->mm->nr_ptes);
+ spin_unlock(ptl);
+ put_huge_zero_page();
+ } else {
+ struct page *page = pmd_page(orig_pmd);
+ page_remove_rmap(page);
+ VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+ add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ VM_BUG_ON_PAGE(!PageHead(page), page);
+ pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
+ atomic_long_dec(&tlb->mm->nr_ptes);
+ spin_unlock(ptl);
+ tlb_remove_page(tlb, page);
}
- return ret;
+ return 1;
}

int move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
--
2.1.4

2015-08-04 19:58:58

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 09/11] dax: Don't use set_huge_zero_page()

From: "Kirill A. Shutemov" <[email protected]>

This is another place where DAX assumed that pgtable_t was a pointer.
Open code the important parts of set_huge_zero_page() in DAX and make
set_huge_zero_page() static again.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 18 ++++++++++++------
include/linux/huge_mm.h | 3 ---
mm/huge_memory.c | 2 +-
3 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9daf005..bf9a22b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -572,18 +572,24 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);

if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
- bool set;
spinlock_t *ptl;
- struct mm_struct *mm = vma->vm_mm;
+ pmd_t entry;
struct page *zero_page = get_huge_zero_page();
+
if (unlikely(!zero_page))
goto fallback;

- ptl = pmd_lock(mm, pmd);
- set = set_huge_zero_page(NULL, mm, vma, pmd_addr, pmd,
- zero_page);
- spin_unlock(ptl);
+ ptl = pmd_lock(vma->vm_mm, pmd);
+ if (!pmd_none(*pmd)) {
+ spin_unlock(ptl);
+ goto fallback;
+ }
+
+ entry = mk_pmd(zero_page, vma->vm_page_prot);
+ entry = pmd_mkhuge(entry);
+ set_pmd_at(vma->vm_mm, pmd_addr, pmd, entry);
result = VM_FAULT_NOPAGE;
+ spin_unlock(ptl);
} else {
sector = bh.b_blocknr << (blkbits - 9);
length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f9b612f..ecb080d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -163,9 +163,6 @@ static inline bool is_huge_zero_pmd(pmd_t pmd)
}

struct page *get_huge_zero_page(void);
-bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long haddr,
- pmd_t *pmd, struct page *zero_page);

#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b51bed1..63c8fe5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -767,7 +767,7 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
}

/* Caller must hold page table lock. */
-bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
struct page *zero_page)
{
--
2.1.4

2015-08-04 19:58:56

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 10/11] dax: Ensure that zero pages are removed from other processes

From: Matthew Wilcox <[email protected]>

If the first access to a huge page was a store, there would be no existing
zero pmd in this process's page tables. There could be a zero pmd in
another process's page tables, if it had done a load. We can detect this
case by noticing that the buffer_head returned from the filesystem is
New, and ensure that other processes mapping this huge page have their
page tables flushed.

Reported-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index bf9a22b..233e61e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -568,7 +568,11 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
if ((pgoff | PG_PMD_COLOUR) >= size)
goto fallback;

- if (is_huge_zero_pmd(*pmd))
+ /*
+ * If we allocated new storage, make sure no process has any
+ * zero pages covering this hole
+ */
+ if (buffer_new(&bh))
unmap_mapping_range(mapping, pgoff << PAGE_SHIFT, PMD_SIZE, 0);

if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
--
2.1.4

2015-08-04 20:00:56

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 11/11] dax: Use linear_page_index()

From: Matthew Wilcox <[email protected]>

I was basically open-coding it (thanks to copying code from do_fault()
which probably also needs to be fixed).

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 233e61e..b6769ce 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -529,7 +529,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
if ((pmd_addr + PMD_SIZE) > vma->vm_end)
return VM_FAULT_FALLBACK;

- pgoff = ((pmd_addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ pgoff = linear_page_index(vma, pmd_addr);
size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (pgoff >= size)
return VM_FAULT_SIGBUS;
--
2.1.4

2015-08-05 02:04:20

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH 04/11] ext4: Add ext4_get_block_dax()

On Tue, Aug 04, 2015 at 03:57:58PM -0400, Matthew Wilcox wrote:
> From: Matthew Wilcox <[email protected]>
>
> DAX wants different semantics from any currently-existing ext4
> get_block callback. Unlike ext4_get_block_write(), it needs to honour
> the 'create' flag, and unlike ext4_get_block(), it needs to be able
> to return unwritten extents. So introduce a new ext4_get_block_dax()
> which has those semantics. We could also change ext4_get_block_write()
> to honour the 'create' flag, but that might have consequences on other
> users that I do not currently understand.
>
> Signed-off-by: Matthew Wilcox <[email protected]>

Doesn't this make the first patch in the series redundant?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-08-05 11:43:13

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [PATCH 06/11] dax: Fix race between simultaneous faults

On Tue, Aug 04, 2015 at 03:58:00PM -0400, Matthew Wilcox wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index b94b587..5f46350 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2426,11 +2426,16 @@ void unmap_mapping_range(struct address_space *mapping,
> details.last_index = ULONG_MAX;
>
>
> - /* DAX uses i_mmap_lock to serialise file truncate vs page fault */
> - i_mmap_lock_write(mapping);
> + /*
> + * DAX already holds i_mmap_lock to serialise file truncate vs
> + * page fault and page fault vs page fault.
> + */
> + if (!IS_DAX(mapping->host))
> + i_mmap_lock_write(mapping);
> if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap)))
> unmap_mapping_range_tree(&mapping->i_mmap, &details);
> - i_mmap_unlock_write(mapping);
> + if (!IS_DAX(mapping->host))
> + i_mmap_unlock_write(mapping);
> }
> EXPORT_SYMBOL(unmap_mapping_range);

Huh? What protects mapping->i_mmap here? I don't see anything up by stack
taking the lock.

--
Kirill A. Shutemov

2015-08-05 15:19:10

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 04/11] ext4: Add ext4_get_block_dax()

On Wed, Aug 05, 2015 at 12:03:57PM +1000, Dave Chinner wrote:
> On Tue, Aug 04, 2015 at 03:57:58PM -0400, Matthew Wilcox wrote:
> > From: Matthew Wilcox <[email protected]>
> >
> > DAX wants different semantics from any currently-existing ext4
> > get_block callback. Unlike ext4_get_block_write(), it needs to honour
> > the 'create' flag, and unlike ext4_get_block(), it needs to be able
> > to return unwritten extents. So introduce a new ext4_get_block_dax()
> > which has those semantics. We could also change ext4_get_block_write()
> > to honour the 'create' flag, but that might have consequences on other
> > users that I do not currently understand.
> >
> > Signed-off-by: Matthew Wilcox <[email protected]>
>
> Doesn't this make the first patch in the series redundant?

As I explained in the cover letter, patch 1 already went to Ted. It might
be on its way in, and it might not. Rather than sending a patch that
applies to current mainline and forcing someone to fix up a conflict
later, better to resend the patch as part of this series, and our tools
will do the right thing no matter which order patches go into Linus' tree.