2014-02-27 19:54:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache

Here's new version of faultaround patchset. It took a while to tune it and
collect performance data.

First patch adds new callback ->map_pages to vm_operations_struct.

->map_pages() is called when VM asks to map easy accessible pages.
Filesystem should find and map pages associated with offsets from "pgoff"
till "max_pgoff". ->map_pages() is called with page table locked and must
not block. If it's not possible to reach a page without blocking,
filesystem should skip it. Filesystem should use do_set_pte() to setup
page table entry. Pointer to entry associated with offset "pgoff" is
passed in "pte" field in vm_fault structure. Pointers to entries for other
offsets should be calculated relative to "pte".

Currently VM use ->map_pages only on read page fault path. We try to map
FAULT_AROUND_PAGES a time. FAULT_AROUND_PAGES is 16 for now. Performance
data for different FAULT_AROUND_ORDER is below.

TODO:
- implement ->map_pages() for shmem/tmpfs;
- modify get_user_pages() to be able to use ->map_pages() and implement
mmap(MAP_POPULATE|MAP_NONBLOCK) on top.

Please consider applying.

=========================================================================
Tested on 4-socket machine (120 threads) with 128GiB of RAM.

Few real-world workloads. The sweet spot for FAULT_AROUND_ORDER here is
somewhere between 3 and 5. Let's say 4 :)

Linux build (make -j60)
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
minor-faults 283,301,572 247,151,987 212,215,789 204,772,882 199,568,944 194,703,779 193,381,485
time, seconds 151.227629483 153.920996480 151.356125472 150.863792049 150.879207877 151.150764954 151.450962358
Linux rebuild (make -j60)
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
minor-faults 5,396,854 4,148,444 2,855,286 2,577,282 2,361,957 2,169,573 2,112,643
time, seconds 27.404543757 27.559725591 27.030057426 26.855045126 26.678618635 26.974523490 26.761320095
Git test suite (make -j60 test)
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
minor-faults 129,591,823 99,200,751 66,106,718 57,606,410 51,510,808 45,776,813 44,085,515
time, seconds 66.087215026 64.784546905 64.401156567 65.282708668 66.034016829 66.793780811 67.237810413

Two synthetic tests: access every word in file in sequential/random order.
It doesn't improve much after FAULT_AROUND_ORDER == 4.

Sequential access 16GiB file
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
1 thread
minor-faults 4,195,437 2,098,275 525,068 262,251 131,170 32,856 8,282
time, seconds 7.250461742 6.461711074 5.493859139 5.488488147 5.707213983 5.898510832 5.109232856
8 threads
minor-faults 33,557,540 16,892,728 4,515,848 2,366,999 1,423,382 442,732 142,339
time, seconds 16.649304881 9.312555263 6.612490639 6.394316732 6.669827501 6.75078944 6.371900528
32 threads
minor-faults 134,228,222 67,526,810 17,725,386 9,716,537 4,763,731 1,668,921 537,200
time, seconds 49.164430543 29.712060103 12.938649729 10.175151004 11.840094583 9.594081325 9.928461797
60 threads
minor-faults 251,687,988 126,146,952 32,919,406 18,208,804 10,458,947 2,733,907 928,217
time, seconds 86.260656897 49.626551828 22.335007632 17.608243696 16.523119035 16.339489186 16.326390902
120 threads
minor-faults 503,352,863 252,939,677 67,039,168 35,191,827 19,170,091 4,688,357 1,471,862
time, seconds 124.589206333 79.757867787 39.508707872 32.167281632 29.972989292 28.729834575 28.042251622
Random access 1GiB file
1 thread
minor-faults 262,636 132,743 34,369 17,299 8,527 3,451 1,222
time, seconds 15.351890914 16.613802482 16.569227308 15.179220992 16.557356122 16.578247824 15.365266994
8 threads
minor-faults 2,098,948 1,061,871 273,690 154,501 87,110 25,663 7,384
time, seconds 15.040026343 15.096933500 14.474757288 14.289129964 14.411537468 14.296316837 14.395635804
32 threads
minor-faults 8,390,734 4,231,023 1,054,432 528,847 269,242 97,746 26,881
time, seconds 20.430433109 21.585235358 22.115062928 14.872878951 14.880856305 14.883370649 14.821261690
60 threads
minor-faults 15,733,258 7,892,809 1,973,393 988,266 594,789 164,994 51,691
time, seconds 26.577302548 25.692397770 18.728863715 20.153026398 21.619101933 17.745086260 17.613215273
120 threads
minor-faults 31,471,111 15,816,616 3,959,209 1,978,685 1,008,299 264,635 96,010
time, seconds 41.835322703 40.459786095 36.085306105 35.313894834 35.814445675 36.552633793 34.289210594

Worst case scenario: we touch one page every 2M, so faultaround is useless
here. Just to demonstrate how much overhead we add.

Touch only one page in page table in 16GiB file
FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
1 thread
minor-faults 8,372 8,324 8,270 8,260 8,249 8,239 8,237
time, seconds 0.039892712 0.045369149 0.051846126 0.063681685 0.079095975 0.17652406 0.541213386
8 threads
minor-faults 65,731 65,681 65,628 65,620 65,608 65,599 65,596
time, seconds 0.124159196 0.488600638 0.156854426 0.191901957 0.242631486 0.543569456 1.677303984
32 threads
minor-faults 262,388 262,341 262,285 262,276 262,266 262,257 263,183
time, seconds 0.452421421 0.488600638 0.565020946 0.648229739 0.789850823 1.651584361 5.000361559
60 threads
minor-faults 491,822 491,792 491,723 491,711 491,701 491,691 491,825
time, seconds 0.763288616 0.869620515 0.980727360 1.161732354 1.466915814 3.04041448 9.308612938
120 threads
minor-faults 983,466 983,655 983,366 983,372 983,363 984,083 984,164
time, seconds 1.595846553 1.667902182 2.008959376 2.425380942 2.941368804 5.977807890 18.401846125

Kirill A. Shutemov (2):
mm: introduce vm_ops->map_pages()
mm: implement ->map_pages for page cache

Documentation/filesystems/Locking | 10 ++++++
fs/9p/vfs_file.c | 2 ++
fs/btrfs/file.c | 1 +
fs/cifs/file.c | 1 +
fs/ext4/file.c | 1 +
fs/f2fs/file.c | 1 +
fs/fuse/file.c | 1 +
fs/gfs2/file.c | 1 +
fs/nfs/file.c | 1 +
fs/nilfs2/file.c | 1 +
fs/ubifs/file.c | 1 +
fs/xfs/xfs_file.c | 1 +
include/linux/mm.h | 9 +++++
mm/filemap.c | 72 +++++++++++++++++++++++++++++++++++++++
mm/memory.c | 67 ++++++++++++++++++++++++++++++++++--
mm/nommu.c | 6 ++++
16 files changed, 173 insertions(+), 3 deletions(-)

--
1.9.0


2014-02-27 19:53:55

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()

The patch introduces new vm_ops callback ->map_pages() and uses it for
mapping easy accessible pages around fault address.

On read page fault, if filesystem provides ->map_pages(), we try to map
up to FAULT_AROUND_PAGES pages around page fault address in hope to
reduce number of minor page faults.

We call ->map_pages first and use ->fault() as fallback if page by the
offset is not ready to be mapped (cold page cache or something).

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/filesystems/Locking | 10 ++++++
include/linux/mm.h | 8 +++++
mm/memory.c | 67 +++++++++++++++++++++++++++++++++++++--
3 files changed, 82 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 5b0c083d7c0e..767930f04a12 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -525,6 +525,7 @@ locking rules:
open: yes
close: yes
fault: yes can return with page locked
+map_pages: yes
page_mkwrite: yes can return with page locked
access: yes

@@ -536,6 +537,15 @@ the page, then ensure it is not already truncated (the page lock will block
subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
locked. The VM will unlock the page.

+ ->map_pages() is called when VM asks to map easy accessible pages.
+Filesystem should find and map pages associated with offsets from "pgoff"
+till "max_pgoff". ->map_pages() is called with page table locked and must
+not block. If it's not possible to reach a page without blocking,
+filesystem should skip it. Filesystem should use do_set_pte() to setup
+page table entry. Pointer to entry associated with offset "pgoff" is
+passed in "pte" field in vm_fault structure. Pointers to entries for other
+offsets should be calculated relative to "pte".
+
->page_mkwrite() is called when a previously read-only pte is
about to become writeable. The filesystem again must ensure that there are
no truncate/invalidate races, and then return with the page locked. If
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46eade6a..aed92cb17127 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -210,6 +210,10 @@ struct vm_fault {
* is set (which is also implied by
* VM_FAULT_ERROR).
*/
+ /* for ->map_pages() only */
+ pgoff_t max_pgoff; /* map pages for offset from pgoff till
+ * max_pgoff inclusive */
+ pte_t *pte; /* pte entry associated with ->pgoff */
};

/*
@@ -221,6 +225,7 @@ struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+ void (*map_pages)(struct vm_area_struct *vma, struct vm_fault *vmf);

/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
@@ -571,6 +576,9 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
pte = pte_mkwrite(pte);
return pte;
}
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+ struct page *page, pte_t *pte, bool write, bool anon);
#endif

/*
diff --git a/mm/memory.c b/mm/memory.c
index 7f52c46ef1e1..3f17a60e817f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3318,7 +3318,8 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
return ret;
}

-static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
+
+void do_set_pte(struct vm_area_struct *vma, unsigned long address,
struct page *page, pte_t *pte, bool write, bool anon)
{
pte_t entry;
@@ -3342,6 +3343,52 @@ static void do_set_pte(struct vm_area_struct *vma, unsigned long address,
update_mmu_cache(vma, address, pte);
}

+#define FAULT_AROUND_ORDER 4
+#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
+#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
+
+static void do_fault_around(struct vm_area_struct *vma, unsigned long address,
+ pte_t *pte, pgoff_t pgoff, unsigned int flags)
+{
+ unsigned long start_addr;
+ pgoff_t max_pgoff;
+ struct vm_fault vmf;
+ int off;
+
+ BUILD_BUG_ON(FAULT_AROUND_PAGES > PTRS_PER_PTE);
+
+ start_addr = max(address & FAULT_AROUND_MASK, vma->vm_start);
+ off = ((address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+ pte -= off;
+ pgoff -= off;
+
+ /*
+ * max_pgoff is either end of page table or end of vma
+ * or FAULT_AROUND_PAGES from pgoff, depending what is neast.
+ */
+ max_pgoff = pgoff - ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+ PTRS_PER_PTE - 1;
+ max_pgoff = min3(max_pgoff, vma_pages(vma) + vma->vm_pgoff - 1,
+ pgoff + FAULT_AROUND_PAGES - 1);
+
+ /* Check if it makes any sense to call ->map_pages */
+ while (!pte_none(*pte)) {
+ if (++pgoff > max_pgoff)
+ return;
+ start_addr += PAGE_SIZE;
+ if (start_addr >= vma->vm_end)
+ return;
+ pte++;
+ }
+
+ vmf.virtual_address = (void __user *) start_addr;
+ vmf.pte = pte;
+ vmf.pgoff = pgoff;
+ vmf.max_pgoff = max_pgoff;
+ vmf.flags = flags;
+ vma->vm_ops->map_pages(vma, &vmf);
+}
+
static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
@@ -3349,7 +3396,20 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *fault_page;
spinlock_t *ptl;
pte_t *pte;
- int ret;
+ int ret = 0;
+
+ /*
+ * Let's call ->map_pages() first and use ->fault() as fallback
+ * if page by the offset is not ready to be mapped (cold cache or
+ * something).
+ */
+ if (vma->vm_ops->map_pages) {
+ pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+ do_fault_around(vma, address, pte, pgoff, flags);
+ if (!pte_same(*pte, orig_pte))
+ goto unlock_out;
+ pte_unmap_unlock(pte, ptl);
+ }

ret = __do_fault(vma, address, pgoff, flags, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
@@ -3363,8 +3423,9 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return ret;
}
do_set_pte(vma, address, fault_page, pte, false, false);
- pte_unmap_unlock(pte, ptl);
unlock_page(fault_page);
+unlock_out:
+ pte_unmap_unlock(pte, ptl);
return ret;
}

--
1.9.0

2014-02-27 19:54:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 2/2] mm: implement ->map_pages for page cache

filemap_map_pages() is generic implementation of ->map_pages() for
filesystems who uses page cache.

It should be safe to use filemap_map_pages() for ->map_pages() if
filesystem use filemap_fault() for ->fault().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/9p/vfs_file.c | 2 ++
fs/btrfs/file.c | 1 +
fs/cifs/file.c | 1 +
fs/ext4/file.c | 1 +
fs/f2fs/file.c | 1 +
fs/fuse/file.c | 1 +
fs/gfs2/file.c | 1 +
fs/nfs/file.c | 1 +
fs/nilfs2/file.c | 1 +
fs/ubifs/file.c | 1 +
fs/xfs/xfs_file.c | 1 +
include/linux/mm.h | 1 +
mm/filemap.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/nommu.c | 6 +++++
14 files changed, 91 insertions(+)

diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index a16b0ff497ca..d8223209d4b1 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)

static const struct vm_operations_struct v9fs_file_vm_ops = {
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = v9fs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
@@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
.close = v9fs_mmap_vm_close,
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = v9fs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0165b8672f09..d1f0415bbfb1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1993,6 +1993,7 @@ out:

static const struct vm_operations_struct btrfs_file_vm_ops = {
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = btrfs_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 755584684f6c..6d081de57fdb 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

static struct vm_operations_struct cifs_file_vm_ops = {
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = cifs_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1a5073959f32..46e78f8a133f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,

static const struct vm_operations_struct ext4_file_vm_ops = {
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = ext4_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 0dfcef53a6ed..129a3bdb05ca 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -84,6 +84,7 @@ out:

static const struct vm_operations_struct f2fs_file_vm_ops = {
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = f2fs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 77bcc303c3ae..da99a76668f8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
static const struct vm_operations_struct fuse_file_vm_ops = {
.close = fuse_vma_close,
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = fuse_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index efc078f0ee4e..2739f3c3bb8f 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -494,6 +494,7 @@ out:

static const struct vm_operations_struct gfs2_vm_ops = {
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = gfs2_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index 5bb790a69c71..284ca901fe16 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -617,6 +617,7 @@ out:

static const struct vm_operations_struct nfs_file_vm_ops = {
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = nfs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 08fdb77852ac..f3a82fbcae02 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

static const struct vm_operations_struct nilfs_file_vm_ops = {
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = nilfs_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 123c79b7261e..4f34dbae823d 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1538,6 +1538,7 @@ out_unlock:

static const struct vm_operations_struct ubifs_file_vm_ops = {
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = ubifs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 64b48eade91d..b2be204e16ca 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {

static const struct vm_operations_struct xfs_file_vm_ops = {
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = xfs_vm_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/include/linux/mm.h b/include/linux/mm.h
index aed92cb17127..7c4bfb725e63 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1818,6 +1818,7 @@ extern void truncate_inode_pages_range(struct address_space *,

/* generic vm_area_ops exported for stackable file systems */
extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
+extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);

/* mm/page-writeback.c */
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a13f6ac5421..1bc12a96060d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -33,6 +33,7 @@
#include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
#include <linux/memcontrol.h>
#include <linux/cleancache.h>
+#include <linux/rmap.h>
#include "internal.h"

#define CREATE_TRACE_POINTS
@@ -1726,6 +1727,76 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);

+void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct radix_tree_iter iter;
+ void **slot;
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_mapping;
+ loff_t size;
+ struct page *page;
+ unsigned long address = (unsigned long) vmf->virtual_address;
+ unsigned long addr;
+ pte_t *pte;
+
+ rcu_read_lock();
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
+ if (iter.index > vmf->max_pgoff)
+ break;
+repeat:
+ page = radix_tree_deref_slot(slot);
+ if (radix_tree_exception(page)) {
+ if (radix_tree_deref_retry(page))
+ break;
+ else
+ goto next;
+ }
+
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *slot)) {
+ page_cache_release(page);
+ goto repeat;
+ }
+
+ if (!PageUptodate(page) ||
+ PageReadahead(page) ||
+ PageHWPoison(page))
+ goto skip;
+ if (!trylock_page(page))
+ goto skip;
+
+ if (page->mapping != mapping || !PageUptodate(page))
+ goto unlock;
+
+ size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;
+ if (page->index >= size >> PAGE_CACHE_SHIFT)
+ goto unlock;
+
+ pte = vmf->pte + page->index - vmf->pgoff;
+ if (!pte_none(*pte))
+ goto unlock;
+
+ if (file->f_ra.mmap_miss > 0)
+ file->f_ra.mmap_miss--;
+ addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
+ do_set_pte(vma, addr, page, pte, false, false);
+ unlock_page(page);
+ goto next;
+unlock:
+ unlock_page(page);
+skip:
+ page_cache_release(page);
+next:
+ if (page->index == vmf->max_pgoff)
+ break;
+ }
+ rcu_read_unlock();
+}
+EXPORT_SYMBOL(filemap_map_pages);
+
int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct page *page = vmf->page;
@@ -1755,6 +1826,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);

const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
+ .map_pages = filemap_map_pages,
.page_mkwrite = filemap_page_mkwrite,
.remap_pages = generic_file_remap_pages,
};
diff --git a/mm/nommu.c b/mm/nommu.c
index 8740213b1647..ce401e37e29f 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1985,6 +1985,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
}
EXPORT_SYMBOL(filemap_fault);

+void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ BUG();
+}
+EXPORT_SYMBOL(filemap_map_pages);
+
int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr,
unsigned long size, pgoff_t pgoff)
{
--
1.9.0

2014-02-27 21:28:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache

On Thu, Feb 27, 2014 at 11:53 AM, Kirill A. Shutemov
<[email protected]> wrote:
> Here's new version of faultaround patchset. It took a while to tune it and
> collect performance data.

Andrew, mind taking this into -mm with my acks? It's based on top of
Kirill's cleanup patches that I think are also in your tree.

Kirill - no complaints from me. I do have two minor issues that you
might satisfy, but I think the patch is fine as-is.

The issues/questions are:

(a) could you test this on a couple of different architectures? Even
if you just have access to intel machines, testing it across a couple
of generations of microarchitectures would be good. The reason I say
that is that from my profiles, it *looks* like the page fault costs
are relatively higher on Ivybridge/Haswell than on some earlier
uarchs.

Now, I may well be wrong about the uarch issue, and maybe I just
didn't notice it as much before. I've stared at a lot of profiles over
the years, though, and the page fault cost seems to stand out much
more than it used to. And don't get me wrong - it might not be because
Ivy/Haswell is any worse, it might just be that exception performance
hasn't improved together with some other improvements.

(b) I suspect we should try to strongly discourage filesystems from
actually using map_pages unless they use the standard
filemap_map_pages function as-is. Even with the fairly clean
interface, and forcing people to use "do_set_pte()", I think the docs
might want to try to more explicitly discourage people from using this
to do their own hacks..

Hmm? Either way, even without those questions answered, I'm happy with
how your patches look.

Linus

2014-02-27 21:47:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHv3 2/2] mm: implement ->map_pages for page cache

On Thu, 27 Feb 2014 21:53:47 +0200 "Kirill A. Shutemov" <[email protected]> wrote:

> filemap_map_pages() is generic implementation of ->map_pages() for
> filesystems who uses page cache.
>
> It should be safe to use filemap_map_pages() for ->map_pages() if
> filesystem use filemap_fault() for ->fault().
>
> ...
>
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1818,6 +1818,7 @@ extern void truncate_inode_pages_range(struct address_space *,
>
> /* generic vm_area_ops exported for stackable file systems */
> extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
> +extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
> extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
>
> /* mm/page-writeback.c */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7a13f6ac5421..1bc12a96060d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -33,6 +33,7 @@
> #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
> #include <linux/memcontrol.h>
> #include <linux/cleancache.h>
> +#include <linux/rmap.h>
> #include "internal.h"
>
> #define CREATE_TRACE_POINTS
> @@ -1726,6 +1727,76 @@ page_not_uptodate:
> }
> EXPORT_SYMBOL(filemap_fault);
>
> +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + struct radix_tree_iter iter;
> + void **slot;
> + struct file *file = vma->vm_file;
> + struct address_space *mapping = file->f_mapping;
> + loff_t size;
> + struct page *page;
> + unsigned long address = (unsigned long) vmf->virtual_address;
> + unsigned long addr;
> + pte_t *pte;
> +
> + rcu_read_lock();
> + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> + if (iter.index > vmf->max_pgoff)
> + break;
> +repeat:
> + page = radix_tree_deref_slot(slot);
> + if (radix_tree_exception(page)) {
> + if (radix_tree_deref_retry(page))
> + break;
> + else
> + goto next;
> + }
> +
> + if (!page_cache_get_speculative(page))
> + goto repeat;
> +
> + /* Has the page moved? */
> + if (unlikely(page != *slot)) {
> + page_cache_release(page);
> + goto repeat;
> + }
> +
> + if (!PageUptodate(page) ||
> + PageReadahead(page) ||
> + PageHWPoison(page))
> + goto skip;
> + if (!trylock_page(page))
> + goto skip;
> +
> + if (page->mapping != mapping || !PageUptodate(page))
> + goto unlock;
> +
> + size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;

Could perhaps use round_up here.

> + if (page->index >= size >> PAGE_CACHE_SHIFT)
> + goto unlock;
> + pte = vmf->pte + page->index - vmf->pgoff;
> + if (!pte_none(*pte))
> + goto unlock;
> +
> + if (file->f_ra.mmap_miss > 0)
> + file->f_ra.mmap_miss--;

I'm wondering about this. We treat every speculative faultahead as a
hit, whether or not userspace will actually touch that page.

What's the effect of this? To cause the amount of physical readahead
to increase? But if userspace is in fact touching the file in a sparse
random fashion, that is exactly the wrong thing to do?

> + addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
> + do_set_pte(vma, addr, page, pte, false, false);
> + unlock_page(page);
> + goto next;
> +unlock:
> + unlock_page(page);
> +skip:
> + page_cache_release(page);
> +next:
> + if (page->index == vmf->max_pgoff)
> + break;
> + }
> + rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(filemap_map_pages);
> +

2014-02-27 22:00:03

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()

On 02/27/2014 11:53 AM, Kirill A. Shutemov wrote:
> +#define FAULT_AROUND_ORDER 4
> +#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
> +#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)

Looking at the performance data made me think of this: do we really want
this to be static? It seems like the kind of thing that will cause a
regression _somewhere_.

Also, the folks with larger base bage sizes probably don't want a
FAULT_AROUND_ORDER=4. That's 1MB of fault-around for ppc64, for example.

2014-02-27 22:06:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()

On Thu, Feb 27, 2014 at 1:59 PM, Dave Hansen
<[email protected]> wrote:
>
> Also, the folks with larger base bage sizes probably don't want a
> FAULT_AROUND_ORDER=4. That's 1MB of fault-around for ppc64, for example.

Actually, I'd expect that they won't mind, because there's no real
extra cost (the costs are indepenent of page size).

For small mappings the mapping size itself will avoid the
fault-around, and for big mappings they'll get the reduced page
faults.

They chose 64kB pages for a reason (although arguably that reason is
"our TLB fills are horrible crap"), they'll be fine with that "let's
try to map a few pages around us".

That said, making it runtime configurable for testing is likely a good
thing anyway, with some hardcoded maximum fault-around size for
sanity.

Linus

2014-02-27 22:08:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()

On Thu, 27 Feb 2014 13:59:59 -0800 Dave Hansen <[email protected]> wrote:

> On 02/27/2014 11:53 AM, Kirill A. Shutemov wrote:
> > +#define FAULT_AROUND_ORDER 4
> > +#define FAULT_AROUND_PAGES (1UL << FAULT_AROUND_ORDER)
> > +#define FAULT_AROUND_MASK ~((1UL << (PAGE_SHIFT + FAULT_AROUND_ORDER)) - 1)
>
> Looking at the performance data made me think of this: do we really want
> this to be static? It seems like the kind of thing that will cause a
> regression _somewhere_.

Yes, allowing people to tweak it at runtime would improve testability a
lot.

I don't think we want to let yet another tunable out into the wild
unless we really need to - perhaps a not-for-mainline add-on patch, or
something in debugfs so we have the option of taking it away later.

> Also, the folks with larger base bage sizes probably don't want a
> FAULT_AROUND_ORDER=4. That's 1MB of fault-around for ppc64, for example.

Yup, we don't want the same app to trigger dramatically different
kernel behaviour when it is moved from x86 to ppc.

2014-02-27 22:34:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()

On 02/27/2014 02:06 PM, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 1:59 PM, Dave Hansen
> <[email protected]> wrote:
>>
>> Also, the folks with larger base bage sizes probably don't want a
>> FAULT_AROUND_ORDER=4. That's 1MB of fault-around for ppc64, for example.
>
> Actually, I'd expect that they won't mind, because there's no real
> extra cost (the costs are indepenent of page size).

The question is really whether or not we ever access the mapping that we
faulted around, though. If we never access it, then the cost (however
small it was) is a loss. That's the mechanism that I'd expect causes
Kirill's numbers to go up after they hit their minimum at ~order-4.

> For small mappings the mapping size itself will avoid the
> fault-around, and for big mappings they'll get the reduced page
> faults.

Kirill's git test suite runs did show that it _can_ hurt in some cases.
Orders 1 to 4 were improvements. But, Order-5 was even with the
baseline, and orders 7 and 9 got a bit worse:

> Git test suite (make -j60 test)
> FAULT_AROUND_ORDER Baseline 1 3 4 5 7 9
> minor-faults 129,591,823 99,200,751 66,106,718 57,606,410 51,510,808 45,776,813 44,085,515
> time, seconds 66.087215026 64.784546905 64.401156567 65.282708668 66.034016829 66.793780811 67.237810413

2014-02-28 00:11:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache

On Thu, Feb 27, 2014 at 01:28:22PM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:53 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > Here's new version of faultaround patchset. It took a while to tune it and
> > collect performance data.
>
> Andrew, mind taking this into -mm with my acks? It's based on top of
> Kirill's cleanup patches that I think are also in your tree.
>
> Kirill - no complaints from me. I do have two minor issues that you
> might satisfy, but I think the patch is fine as-is.
>
> The issues/questions are:
>
> (a) could you test this on a couple of different architectures? Even
> if you just have access to intel machines, testing it across a couple
> of generations of microarchitectures would be good. The reason I say
> that is that from my profiles, it *looks* like the page fault costs
> are relatively higher on Ivybridge/Haswell than on some earlier
> uarchs.

These numbers were from Ivy Bridge.
I'll bring some numbers for Westmere and Haswell.

> (b) I suspect we should try to strongly discourage filesystems from
> actually using map_pages unless they use the standard
> filemap_map_pages function as-is. Even with the fairly clean
> interface, and forcing people to use "do_set_pte()", I think the docs
> might want to try to more explicitly discourage people from using this
> to do their own hacks..

We would need ->map_pages() at least for shmem/tmpfs. It should be
benefitial there.

Also Matthew noticed that some drivers do ugly hacks like fault in whole
VMA on first page fault. IIUC, it's for performance reasons. See
psbfb_vm_fault() or ttm_bo_vm_fault().

I thought it could be reasonable to have ->map_pages() there and do VMA
population get_user_pages() on mmap() instead.

What do you think?

--
Kirill A. Shutemov

2014-02-28 00:20:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()

On Thu, Feb 27, 2014 at 02:34:55PM -0800, Dave Hansen wrote:
> Kirill's git test suite runs did show that it _can_ hurt in some cases.

And see last use-case for how much it can hurt. :)

It shouldn't differs much for the same *number* of pages between [u]archs
unless setup of the pte is significantly more expensive or page fault is
faster.

But of course, I can move FAULT_AROUND_PAGES to arch/x86/ and let
architecture mantainers to decide if they want the feature. ;)

--
Kirill A. Shutemov

2014-02-28 00:31:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 2/2] mm: implement ->map_pages for page cache

On Thu, Feb 27, 2014 at 01:47:11PM -0800, Andrew Morton wrote:
> On Thu, 27 Feb 2014 21:53:47 +0200 "Kirill A. Shutemov" <[email protected]> wrote:
>
> > filemap_map_pages() is generic implementation of ->map_pages() for
> > filesystems who uses page cache.
> >
> > It should be safe to use filemap_map_pages() for ->map_pages() if
> > filesystem use filemap_fault() for ->fault().
> >
> > ...
> >
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1818,6 +1818,7 @@ extern void truncate_inode_pages_range(struct address_space *,
> >
> > /* generic vm_area_ops exported for stackable file systems */
> > extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
> > +extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
> > extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
> >
> > /* mm/page-writeback.c */
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 7a13f6ac5421..1bc12a96060d 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -33,6 +33,7 @@
> > #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
> > #include <linux/memcontrol.h>
> > #include <linux/cleancache.h>
> > +#include <linux/rmap.h>
> > #include "internal.h"
> >
> > #define CREATE_TRACE_POINTS
> > @@ -1726,6 +1727,76 @@ page_not_uptodate:
> > }
> > EXPORT_SYMBOL(filemap_fault);
> >
> > +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > + struct radix_tree_iter iter;
> > + void **slot;
> > + struct file *file = vma->vm_file;
> > + struct address_space *mapping = file->f_mapping;
> > + loff_t size;
> > + struct page *page;
> > + unsigned long address = (unsigned long) vmf->virtual_address;
> > + unsigned long addr;
> > + pte_t *pte;
> > +
> > + rcu_read_lock();
> > + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> > + if (iter.index > vmf->max_pgoff)
> > + break;
> > +repeat:
> > + page = radix_tree_deref_slot(slot);
> > + if (radix_tree_exception(page)) {
> > + if (radix_tree_deref_retry(page))
> > + break;
> > + else
> > + goto next;
> > + }
> > +
> > + if (!page_cache_get_speculative(page))
> > + goto repeat;
> > +
> > + /* Has the page moved? */
> > + if (unlikely(page != *slot)) {
> > + page_cache_release(page);
> > + goto repeat;
> > + }
> > +
> > + if (!PageUptodate(page) ||
> > + PageReadahead(page) ||
> > + PageHWPoison(page))
> > + goto skip;
> > + if (!trylock_page(page))
> > + goto skip;
> > +
> > + if (page->mapping != mapping || !PageUptodate(page))
> > + goto unlock;
> > +
> > + size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;
>
> Could perhaps use round_up here.

Okay, I'll update tomorrow.

> > + if (page->index >= size >> PAGE_CACHE_SHIFT)
> > + goto unlock;
> > + pte = vmf->pte + page->index - vmf->pgoff;
> > + if (!pte_none(*pte))
> > + goto unlock;
> > +
> > + if (file->f_ra.mmap_miss > 0)
> > + file->f_ra.mmap_miss--;
>
> I'm wondering about this. We treat every speculative faultahead as a
> hit, whether or not userspace will actually touch that page.
>
> What's the effect of this? To cause the amount of physical readahead
> to increase? But if userspace is in fact touching the file in a sparse
> random fashion, that is exactly the wrong thing to do?

IIUC, it will not increase readahead window: readahead recalculate the
window when it hits PageReadahead() and we don't touch these pages.

It can increase number of readahead retry before give up.

I'm not sure what we should do here if anything. We could decrease
->mmap_miss by half of mapped pages or something. But I think
MMAP_LOTSAMISS is pretty arbitrary anyway.

>
> > + addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
> > + do_set_pte(vma, addr, page, pte, false, false);
> > + unlock_page(page);
> > + goto next;
> > +unlock:
> > + unlock_page(page);
> > +skip:
> > + page_cache_release(page);
> > +next:
> > + if (page->index == vmf->max_pgoff)
> > + break;
> > + }
> > + rcu_read_unlock();
> > +}
> > +EXPORT_SYMBOL(filemap_map_pages);
> > +
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Kirill A. Shutemov

2014-02-28 03:53:19

by Matthew Wilcox

[permalink] [raw]
Subject: RE: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache

I think the psbfb case is just horribly broken; they probably want to populate the entire VMA at mmap time rather than fault time. It'll be less code for them.

ttm is more nuanced, and there're one or two other graphics drivers that have similar requirements of "faulting around". But all of the ones that try it have the nasty feature of potentially faulting in some pages that weren't the one that actually faulted on, and then failing before faulting in the requested one, then returning to userspace.

Granted, this is a pretty rare case. You'd have to be incredibly low on memory to fail to allocate a page table page. But it can happen, and shouldn't.

So I was thinking about a helper that these drivers could use to "fault around" in ->fault, then Kirill let me in on ->map_pages, and I think this way could work too.

________________________________________
From: Kirill A. Shutemov [[email protected]]
Sent: February 27, 2014 4:10 PM
To: Linus Torvalds
Cc: Kirill A. Shutemov; Andrew Morton; Mel Gorman; Rik van Riel; Andi Kleen; Wilcox, Matthew R; Dave Hansen; Alexander Viro; Dave Chinner; Ning Qu; linux-mm; linux-fsdevel; Linux Kernel Mailing List
Subject: Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache

On Thu, Feb 27, 2014 at 01:28:22PM -0800, Linus Torvalds wrote:
> On Thu, Feb 27, 2014 at 11:53 AM, Kirill A. Shutemov
> <[email protected]> wrote:
> > Here's new version of faultaround patchset. It took a while to tune it and
> > collect performance data.
>
> Andrew, mind taking this into -mm with my acks? It's based on top of
> Kirill's cleanup patches that I think are also in your tree.
>
> Kirill - no complaints from me. I do have two minor issues that you
> might satisfy, but I think the patch is fine as-is.
>
> The issues/questions are:
>
> (a) could you test this on a couple of different architectures? Even
> if you just have access to intel machines, testing it across a couple
> of generations of microarchitectures would be good. The reason I say
> that is that from my profiles, it *looks* like the page fault costs
> are relatively higher on Ivybridge/Haswell than on some earlier
> uarchs.

These numbers were from Ivy Bridge.
I'll bring some numbers for Westmere and Haswell.

> (b) I suspect we should try to strongly discourage filesystems from
> actually using map_pages unless they use the standard
> filemap_map_pages function as-is. Even with the fairly clean
> interface, and forcing people to use "do_set_pte()", I think the docs
> might want to try to more explicitly discourage people from using this
> to do their own hacks..

We would need ->map_pages() at least for shmem/tmpfs. It should be
benefitial there.

Also Matthew noticed that some drivers do ugly hacks like fault in whole
VMA on first page fault. IIUC, it's for performance reasons. See
psbfb_vm_fault() or ttm_bo_vm_fault().

I thought it could be reasonable to have ->map_pages() there and do VMA
population get_user_pages() on mmap() instead.

What do you think?

--
Kirill A. Shutemov

2014-02-28 11:51:40

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()

On Thu, 2014-02-27 at 14:34 -0800, Dave Hansen wrote:
>
> The question is really whether or not we ever access the mapping that we
> faulted around, though. If we never access it, then the cost (however
> small it was) is a loss. That's the mechanism that I'd expect causes
> Kirill's numbers to go up after they hit their minimum at ~order-4.

On the other hand, the cost of our faults on ppc64 is higher. The two hash
lookups by the MMU (generally L2 misses) before it even decides to take the
fault, followed by a generally longer code path before we get to Linux
fault handler.

So there might be a bigger win for us, especially if the "around" pages
get pre-hashed (ie, via update_mmu_cache)

I don't have the bandwidth to play around with that myself at the moment
but I'll try to find somebody who can.

Cheers,
Ben.

2014-02-28 23:08:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCHv3 0/2] mm: map few pages around fault address if they are in page cache

On Thu, Feb 27, 2014 at 6:10 PM, Kirill A. Shutemov
<[email protected]> wrote:
>
> Also Matthew noticed that some drivers do ugly hacks like fault in whole
> VMA on first page fault. IIUC, it's for performance reasons. See
> psbfb_vm_fault() or ttm_bo_vm_fault().

I guarantee it's not for performance reasons, it's probably some other breakage.

And if anything really does want to populate things fully, doing so at
fault time is wrong anyway, you should just use MAP_POPULATE at mmap
time.

So I'll believe the shm/tmpfs issue, and that's "mm internal" anyway.
But random crappy drivers? No way in hell.

Linus

2014-04-02 18:03:27

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCHv3 2/2] mm: implement ->map_pages for page cache

On Thu, Feb 27, 2014 at 11:53 PM, Kirill A. Shutemov
<[email protected]> wrote:
> filemap_map_pages() is generic implementation of ->map_pages() for
> filesystems who uses page cache.
>
> It should be safe to use filemap_map_pages() for ->map_pages() if
> filesystem use filemap_fault() for ->fault().
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> fs/9p/vfs_file.c | 2 ++
> fs/btrfs/file.c | 1 +
> fs/cifs/file.c | 1 +
> fs/ext4/file.c | 1 +
> fs/f2fs/file.c | 1 +
> fs/fuse/file.c | 1 +
> fs/gfs2/file.c | 1 +
> fs/nfs/file.c | 1 +
> fs/nilfs2/file.c | 1 +
> fs/ubifs/file.c | 1 +
> fs/xfs/xfs_file.c | 1 +
> include/linux/mm.h | 1 +
> mm/filemap.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> mm/nommu.c | 6 +++++
> 14 files changed, 91 insertions(+)
>
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index a16b0ff497ca..d8223209d4b1 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -832,6 +832,7 @@ static void v9fs_mmap_vm_close(struct vm_area_struct *vma)
>
> static const struct vm_operations_struct v9fs_file_vm_ops = {
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = v9fs_vm_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> @@ -839,6 +840,7 @@ static const struct vm_operations_struct v9fs_file_vm_ops = {
> static const struct vm_operations_struct v9fs_mmap_file_vm_ops = {
> .close = v9fs_mmap_vm_close,
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = v9fs_vm_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0165b8672f09..d1f0415bbfb1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -1993,6 +1993,7 @@ out:
>
> static const struct vm_operations_struct btrfs_file_vm_ops = {
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = btrfs_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> index 755584684f6c..6d081de57fdb 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -3094,6 +3094,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>
> static struct vm_operations_struct cifs_file_vm_ops = {
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = cifs_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 1a5073959f32..46e78f8a133f 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -200,6 +200,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
>
> static const struct vm_operations_struct ext4_file_vm_ops = {
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = ext4_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
> index 0dfcef53a6ed..129a3bdb05ca 100644
> --- a/fs/f2fs/file.c
> +++ b/fs/f2fs/file.c
> @@ -84,6 +84,7 @@ out:
>
> static const struct vm_operations_struct f2fs_file_vm_ops = {
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = f2fs_vm_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 77bcc303c3ae..da99a76668f8 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1940,6 +1940,7 @@ static int fuse_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> static const struct vm_operations_struct fuse_file_vm_ops = {
> .close = fuse_vma_close,
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = fuse_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
> index efc078f0ee4e..2739f3c3bb8f 100644
> --- a/fs/gfs2/file.c
> +++ b/fs/gfs2/file.c
> @@ -494,6 +494,7 @@ out:
>
> static const struct vm_operations_struct gfs2_vm_ops = {
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = gfs2_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index 5bb790a69c71..284ca901fe16 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -617,6 +617,7 @@ out:
>
> static const struct vm_operations_struct nfs_file_vm_ops = {
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = nfs_vm_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
> index 08fdb77852ac..f3a82fbcae02 100644
> --- a/fs/nilfs2/file.c
> +++ b/fs/nilfs2/file.c
> @@ -134,6 +134,7 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>
> static const struct vm_operations_struct nilfs_file_vm_ops = {
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = nilfs_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
> index 123c79b7261e..4f34dbae823d 100644
> --- a/fs/ubifs/file.c
> +++ b/fs/ubifs/file.c
> @@ -1538,6 +1538,7 @@ out_unlock:
>
> static const struct vm_operations_struct ubifs_file_vm_ops = {
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = ubifs_vm_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 64b48eade91d..b2be204e16ca 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1465,6 +1465,7 @@ const struct file_operations xfs_dir_file_operations = {
>
> static const struct vm_operations_struct xfs_file_vm_ops = {
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = xfs_vm_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index aed92cb17127..7c4bfb725e63 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1818,6 +1818,7 @@ extern void truncate_inode_pages_range(struct address_space *,
>
> /* generic vm_area_ops exported for stackable file systems */
> extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
> +extern void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf);
> extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
>
> /* mm/page-writeback.c */
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7a13f6ac5421..1bc12a96060d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -33,6 +33,7 @@
> #include <linux/hardirq.h> /* for BUG_ON(!in_atomic()) only */
> #include <linux/memcontrol.h>
> #include <linux/cleancache.h>
> +#include <linux/rmap.h>
> #include "internal.h"
>
> #define CREATE_TRACE_POINTS
> @@ -1726,6 +1727,76 @@ page_not_uptodate:
> }
> EXPORT_SYMBOL(filemap_fault);
>
> +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + struct radix_tree_iter iter;
> + void **slot;
> + struct file *file = vma->vm_file;
> + struct address_space *mapping = file->f_mapping;
> + loff_t size;
> + struct page *page;
> + unsigned long address = (unsigned long) vmf->virtual_address;
> + unsigned long addr;
> + pte_t *pte;
> +
> + rcu_read_lock();
> + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> + if (iter.index > vmf->max_pgoff)
> + break;
> +repeat:
> + page = radix_tree_deref_slot(slot);

Here is obvious race with memory reclaimer/truncate. Pointer to page
might become NULL.

> + if (radix_tree_exception(page)) {
> + if (radix_tree_deref_retry(page))
> + break;
> + else
> + goto next;
> + }
> +
> + if (!page_cache_get_speculative(page))
> + goto repeat;
> +
> + /* Has the page moved? */
> + if (unlikely(page != *slot)) {
> + page_cache_release(page);
> + goto repeat;
> + }
> +
> + if (!PageUptodate(page) ||
> + PageReadahead(page) ||
> + PageHWPoison(page))
> + goto skip;
> + if (!trylock_page(page))
> + goto skip;
> +
> + if (page->mapping != mapping || !PageUptodate(page))
> + goto unlock;
> +
> + size = i_size_read(mapping->host) + PAGE_CACHE_SIZE - 1;
> + if (page->index >= size >> PAGE_CACHE_SHIFT)
> + goto unlock;
> +
> + pte = vmf->pte + page->index - vmf->pgoff;
> + if (!pte_none(*pte))
> + goto unlock;
> +
> + if (file->f_ra.mmap_miss > 0)
> + file->f_ra.mmap_miss--;
> + addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
> + do_set_pte(vma, addr, page, pte, false, false);
> + unlock_page(page);
> + goto next;
> +unlock:
> + unlock_page(page);
> +skip:
> + page_cache_release(page);
> +next:
> + if (page->index == vmf->max_pgoff)
> + break;
> + }
> + rcu_read_unlock();
> +}
> +EXPORT_SYMBOL(filemap_map_pages);
> +
> int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> struct page *page = vmf->page;
> @@ -1755,6 +1826,7 @@ EXPORT_SYMBOL(filemap_page_mkwrite);
>
> const struct vm_operations_struct generic_file_vm_ops = {
> .fault = filemap_fault,
> + .map_pages = filemap_map_pages,
> .page_mkwrite = filemap_page_mkwrite,
> .remap_pages = generic_file_remap_pages,
> };
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 8740213b1647..ce401e37e29f 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1985,6 +1985,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> }
> EXPORT_SYMBOL(filemap_fault);
>
> +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + BUG();
> +}
> +EXPORT_SYMBOL(filemap_map_pages);
> +
> int generic_file_remap_pages(struct vm_area_struct *vma, unsigned long addr,
> unsigned long size, pgoff_t pgoff)
> {
> --
> 1.9.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-04-02 19:07:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 2/2] mm: implement ->map_pages for page cache

On Wed, Apr 02, 2014 at 10:03:24PM +0400, Konstantin Khlebnikov wrote:
> > +void filemap_map_pages(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > + struct radix_tree_iter iter;
> > + void **slot;
> > + struct file *file = vma->vm_file;
> > + struct address_space *mapping = file->f_mapping;
> > + loff_t size;
> > + struct page *page;
> > + unsigned long address = (unsigned long) vmf->virtual_address;
> > + unsigned long addr;
> > + pte_t *pte;
> > +
> > + rcu_read_lock();
> > + radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
> > + if (iter.index > vmf->max_pgoff)
> > + break;
> > +repeat:
> > + page = radix_tree_deref_slot(slot);
>
> Here is obvious race with memory reclaimer/truncate. Pointer to page
> might become NULL.

Thanks for noticing that. It has been fixed in -mm already.

--
Kirill A. Shutemov

2014-07-24 03:34:30

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()

On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
> The patch introduces new vm_ops callback ->map_pages() and uses it for
> mapping easy accessible pages around fault address.
>
> On read page fault, if filesystem provides ->map_pages(), we try to map
> up to FAULT_AROUND_PAGES pages around page fault address in hope to
> reduce number of minor page faults.
>
> We call ->map_pages first and use ->fault() as fallback if page by the
> offset is not ready to be mapped (cold page cache or something).
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---

Hi all,

This patch triggers use-after-free when fuzzing using trinity and the KASAN
patchset.

KASAN's report is:

[ 663.269187] AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
[ 663.275260] page:ffffea001229ccc0 count:0 mapcount:0 mapping: (null) index:0x0
[ 663.277061] page flags: 0xafffff80008000(tail)
[ 663.278759] page dumped because: kasan error
[ 663.280645] CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
[ 663.282898] 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
[ 663.288759] ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
[ 663.291496] 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
[ 663.294379] Call Trace:
[ 663.294806] dump_stack (lib/dump_stack.c:52)
[ 663.300665] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
[ 663.301659] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[ 663.304645] ? preempt_count_sub (kernel/sched/core.c:2606)
[ 663.305800] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[ 663.306839] ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
[ 663.307515] __asan_load8 (mm/kasan/kasan.c:364)
[ 663.308038] ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[ 663.309158] do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[ 663.310311] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
[ 663.311282] ? __pte_alloc (mm/memory.c:598)
[ 663.312331] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[ 663.313895] ? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
[ 663.314793] __get_user_pages (mm/gup.c:286 mm/gup.c:478)
[ 663.315775] __mlock_vma_pages_range (mm/mlock.c:262)
[ 663.316879] __mm_populate (mm/mlock.c:710)
[ 663.317813] SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
[ 663.318848] tracesys (arch/x86/kernel/entry_64.S:541)
[ 663.319683] Read of size 8 by thread T9262:
[ 663.320580] Memory state around the buggy address:
[ 663.321392] ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 663.322573] ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 663.323802] ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 663.325080] ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 663.326327] ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 663.327572] >ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 663.328840] ^
[ 663.329487] ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 663.330762] ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 663.331994] ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 663.333262] ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 663.334488] ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

This also proceeds with the traditional:

[ 663.474532] BUG: unable to handle kernel paging request at ffff88048a635de8
[ 663.474548] IP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)

But the rest of it got scrambled between the KASAN prints and the other BUG info + trace (who
broke printk?!).


Thanks,
Sasha

2014-07-24 06:59:36

by Andrey Ryabinin

[permalink] [raw]
Subject: Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()

On 07/24/14 07:33, Sasha Levin wrote:
> On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
>> The patch introduces new vm_ops callback ->map_pages() and uses it for
>> mapping easy accessible pages around fault address.
>>
>> On read page fault, if filesystem provides ->map_pages(), we try to map
>> up to FAULT_AROUND_PAGES pages around page fault address in hope to
>> reduce number of minor page faults.
>>
>> We call ->map_pages first and use ->fault() as fallback if page by the
>> offset is not ready to be mapped (cold page cache or something).
>>
>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>> ---
>
> Hi all,
>
> This patch triggers use-after-free when fuzzing using trinity and the KASAN
> patchset.
>

I think this should be fixed already by following patch:

From: Konstantin Khlebnikov <[email protected]>
Subject: mm: do not call do_fault_around for non-linear fault

Ingo Korb reported that "repeated mapping of the same file on tmpfs using
remap_file_pages sometimes triggers a BUG at mm/filemap.c:202 when the
process exits". He bisected the bug to d7c1755179b82d ("mm: implement
->map_pages for shmem/tmpfs"), although the bug was actually added by
8c6e50b0290c4 ("mm: introduce vm_ops->map_pages()").

Problem is caused by calling do_fault_around for _non-linear_ faiult. In
this case pgoff is shifted and might become negative during calculation.

Faulting around non-linear page-fault has no sense and breaks logic in
do_fault_around because pgoff is shifted.

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Reported-by: Ingo Korb <[email protected]>
Tested-by: Ingo Korb <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Ning Qu <[email protected]>
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: <[email protected]> [3.15.x]
Signed-off-by: Andrew Morton <[email protected]>
---

mm/memory.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff -puN mm/memory.c~mm-do-not-call-do_fault_around-for-non-linear-fault mm/memory.c
--- a/mm/memory.c~mm-do-not-call-do_fault_around-for-non-linear-fault
+++ a/mm/memory.c
@@ -2882,7 +2882,8 @@ static int do_read_fault(struct mm_struc
* if page by the offset is not ready to be mapped (cold cache or
* something).
*/
- if (vma->vm_ops->map_pages && fault_around_pages() > 1) {
+ if (vma->vm_ops->map_pages && !(flags & FAULT_FLAG_NONLINEAR) &&
+ fault_around_pages() > 1) {
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
do_fault_around(vma, address, pte, pgoff, flags);
if (!pte_same(*pte, orig_pte))


> KASAN's report is:
>
> [ 663.269187] AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> [ 663.275260] page:ffffea001229ccc0 count:0 mapcount:0 mapping: (null) index:0x0
> [ 663.277061] page flags: 0xafffff80008000(tail)
> [ 663.278759] page dumped because: kasan error
> [ 663.280645] CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> [ 663.282898] 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> [ 663.288759] ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> [ 663.291496] 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> [ 663.294379] Call Trace:
> [ 663.294806] dump_stack (lib/dump_stack.c:52)
> [ 663.300665] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> [ 663.301659] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [ 663.304645] ? preempt_count_sub (kernel/sched/core.c:2606)
> [ 663.305800] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [ 663.306839] ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> [ 663.307515] __asan_load8 (mm/kasan/kasan.c:364)
> [ 663.308038] ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [ 663.309158] do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [ 663.310311] ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> [ 663.311282] ? __pte_alloc (mm/memory.c:598)
> [ 663.312331] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> [ 663.313895] ? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> [ 663.314793] __get_user_pages (mm/gup.c:286 mm/gup.c:478)
> [ 663.315775] __mlock_vma_pages_range (mm/mlock.c:262)
> [ 663.316879] __mm_populate (mm/mlock.c:710)
> [ 663.317813] SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> [ 663.318848] tracesys (arch/x86/kernel/entry_64.S:541)
> [ 663.319683] Read of size 8 by thread T9262:
> [ 663.320580] Memory state around the buggy address:
> [ 663.321392] ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 663.322573] ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 663.323802] ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 663.325080] ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 663.326327] ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 663.327572] >ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 663.328840] ^
> [ 663.329487] ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 663.330762] ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 663.331994] ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 663.333262] ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> [ 663.334488] ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
> This also proceeds with the traditional:
>
> [ 663.474532] BUG: unable to handle kernel paging request at ffff88048a635de8
> [ 663.474548] IP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
>
> But the rest of it got scrambled between the KASAN prints and the other BUG info + trace (who
> broke printk?!).
>
>
> Thanks,
> Sasha
>

2014-07-24 12:51:14

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()

On 07/24/2014 02:53 AM, Andrey Ryabinin wrote:
> On 07/24/14 07:33, Sasha Levin wrote:
>> > On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
>>> >> The patch introduces new vm_ops callback ->map_pages() and uses it for
>>> >> mapping easy accessible pages around fault address.
>>> >>
>>> >> On read page fault, if filesystem provides ->map_pages(), we try to map
>>> >> up to FAULT_AROUND_PAGES pages around page fault address in hope to
>>> >> reduce number of minor page faults.
>>> >>
>>> >> We call ->map_pages first and use ->fault() as fallback if page by the
>>> >> offset is not ready to be mapped (cold page cache or something).
>>> >>
>>> >> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>> >> ---
>> >
>> > Hi all,
>> >
>> > This patch triggers use-after-free when fuzzing using trinity and the KASAN
>> > patchset.
>> >
> I think this should be fixed already by following patch:
>
> From: Konstantin Khlebnikov <[email protected]>
> Subject: mm: do not call do_fault_around for non-linear fault

I don't think so. It's supposed to deal with a different issue, and it was already
in my -next tree which triggered the issue I've reported.


Thanks,
Sasha

2014-07-24 13:06:08

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()

On 07/23/2014 11:33 PM, Sasha Levin wrote:
> On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
>> > The patch introduces new vm_ops callback ->map_pages() and uses it for
>> > mapping easy accessible pages around fault address.
>> >
>> > On read page fault, if filesystem provides ->map_pages(), we try to map
>> > up to FAULT_AROUND_PAGES pages around page fault address in hope to
>> > reduce number of minor page faults.
>> >
>> > We call ->map_pages first and use ->fault() as fallback if page by the
>> > offset is not ready to be mapped (cold page cache or something).
>> >
>> > Signed-off-by: Kirill A. Shutemov <[email protected]>
>> > ---
> Hi all,
>
> This patch triggers use-after-free when fuzzing using trinity and the KASAN
> patchset.

FWIW, if it helps, here's another KASAN report with the conventional BUG following it:

[ 360.498001] ==================================================================
[ 360.500896] AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff880581ee3fd0
[ 360.504474] page:ffffea001607b8c0 count:0 mapcount:0 mapping: (null) index:0x0
[ 360.507264] page flags: 0xefffff80000000()
[ 360.508655] page dumped because: kasan error
[ 360.509489] CPU: 8 PID: 9251 Comm: trinity-c159 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
[ 360.511717] 00000000000000ff 0000000000000000 ffffea001607b8c0 ffff8801ba8bbb98
[ 360.513272] ffffffff8fe40903 ffff8801ba8bbc68 ffff8801ba8bbc58 ffffffff8b42acfc
[ 360.514729] 0000000000000001 ffff880592deaa48 ffff8801ba84b038 ffff8801ba8bbbd0
[ 360.516156] Call Trace:
[ 360.516622] dump_stack (lib/dump_stack.c:52)
[ 360.517566] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
[ 360.518745] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[ 360.519923] ? preempt_count_sub (kernel/sched/core.c:2606)
[ 360.521124] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
[ 360.522431] ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
[ 360.523651] __asan_load8 (mm/kasan/kasan.c:364)
[ 360.524625] ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[ 360.525887] do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[ 360.527156] ? __rcu_read_unlock (kernel/rcu/update.c:101)
[ 360.528251] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[ 360.529308] ? vmacache_update (mm/vmacache.c:61)
[ 360.530505] ? find_vma (mm/mmap.c:2027)
[ 360.531453] __do_page_fault (arch/x86/mm/fault.c:1231)
[ 360.532503] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 360.533744] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 360.535001] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
[ 360.536262] ? trace_hardirqs_off (kernel/locking/lockdep.c:2645)
[ 360.537360] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[ 360.538493] do_async_page_fault (arch/x86/kernel/kvm.c:279)
[ 360.539567] async_page_fault (arch/x86/kernel/entry_64.S:1321)
[ 360.540668] Read of size 8 by thread T9251:
[ 360.541485] Memory state around the buggy address:
[ 360.542463] ffff880581ee3d00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 360.543816] ffff880581ee3d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 360.545116] ffff880581ee3e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 360.546476] ffff880581ee3e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 360.547806] ffff880581ee3f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 360.549112] >ffff880581ee3f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
[ 360.550507] ^
[ 360.551574] ffff880581ee4000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 360.552910] ffff880581ee4080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 360.554207] ffff880581ee4100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 360.555621] ffff880581ee4180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 360.557035] ffff880581ee4200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 360.558489] ==================================================================
[ 360.559804] BUG: unable to handle kernel paging request at ffff880581ee3fd0
[ 360.559817] IP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[ 360.559827] PGD 147b9067 PUD 70353d067 PMD 70352d067 PTE 8000000581ee3060
[ 360.559836] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 360.559892] Dumping ftrace buffer:
[ 360.560117] (ftrace buffer empty)
[ 360.560132] Modules linked in:
[ 360.560145] CPU: 8 PID: 9251 Comm: trinity-c159 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
[ 360.560154] task: ffff8801ba84b000 ti: ffff8801ba8b8000 task.ti: ffff8801ba8b8000
[ 360.560172] RIP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[ 360.560180] RSP: 0000:ffff8801ba8bbc98 EFLAGS: 00010296
[ 360.560186] RAX: 0000000000000000 RBX: ffff8801ba8d5a00 RCX: 0000000000000006
[ 360.560192] RDX: 0000000000000006 RSI: ffffffff912f5fd5 RDI: 0000000000000282
[ 360.560199] RBP: ffff8801ba8bbd58 R08: 0000000000000001 R09: 0000000000000000
[ 360.560206] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801ba8ad470
[ 360.560213] R13: 00007f2411bfa000 R14: 0000000000000000 R15: ffff880581ee3fd0
[ 360.560222] FS: 00007f2412de6700(0000) GS:ffff8801de000000(0000) knlGS:0000000000000000
[ 360.560228] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 360.560234] CR2: ffff880581ee3fd0 CR3: 00000001ba89b000 CR4: 00000000000006a0
[ 360.560255] DR0: 00000000006ec000 DR1: 0000000000000000 DR2: 0000000000000000
[ 360.560262] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
[ 360.560264] Stack:
[ 360.560278] 0000000000000001 0000000000000000 ffff880581ee4030 ffff880592deaa30
[ 360.560290] 0000000000000005 00007f2411c07000 ffff8801ba8d5a90 ffff880592deaa30
[ 360.560297] 0000000000000000 000000a8ba84b038 000000000000000c 00007f2411c06220
[ 360.560299] Call Trace:
[ 360.560317] ? __rcu_read_unlock (kernel/rcu/update.c:101)
[ 360.560332] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
[ 360.560345] ? vmacache_update (mm/vmacache.c:61)
[ 360.560358] ? find_vma (mm/mmap.c:2027)
[ 360.560370] __do_page_fault (arch/x86/mm/fault.c:1231)
[ 360.560386] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 360.560400] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 360.560413] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
[ 360.560425] ? trace_hardirqs_off (kernel/locking/lockdep.c:2645)
[ 360.560432] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
[ 360.560440] do_async_page_fault (arch/x86/kernel/kvm.c:279)
[ 360.560447] async_page_fault (arch/x86/kernel/entry_64.S:1321)
[ 360.560458] Code: 47 f9 48 8b 8d 68 ff ff ff 48 29 f1 48 c1 e9 0c 49 8d 44 08 ff 48 39 c7 48 0f 46 c7 4c 89 ff 48 89 85 60 ff ff ff e8 3e 19 07 00 <49> 83 3f 00 0f 84 8f 00 00 00 49 83 c6 01 4c 39 b5 60 ff ff ff
All code
========
0: 47 f9 rex.RXB stc
2: 48 8b 8d 68 ff ff ff mov -0x98(%rbp),%rcx
9: 48 29 f1 sub %rsi,%rcx
c: 48 c1 e9 0c shr $0xc,%rcx
10: 49 8d 44 08 ff lea -0x1(%r8,%rcx,1),%rax
15: 48 39 c7 cmp %rax,%rdi
18: 48 0f 46 c7 cmovbe %rdi,%rax
1c: 4c 89 ff mov %r15,%rdi
1f: 48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
26: e8 3e 19 07 00 callq 0x71969
2b:* 49 83 3f 00 cmpq $0x0,(%r15) <-- trapping instruction
2f: 0f 84 8f 00 00 00 je 0xc4
35: 49 83 c6 01 add $0x1,%r14
39: 4c 39 b5 60 ff ff ff cmp %r14,-0xa0(%rbp)
...

Code starting with the faulting instruction
===========================================
0: 49 83 3f 00 cmpq $0x0,(%r15)
4: 0f 84 8f 00 00 00 je 0x99
a: 49 83 c6 01 add $0x1,%r14
e: 4c 39 b5 60 ff ff ff cmp %r14,-0xa0(%rbp)
...
[ 360.560458] RIP do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
[ 360.560458] RSP <ffff8801ba8bbc98>
[ 360.560458] CR2: ffff880581ee3fd0
[ 360.560458] ---[ end trace ccd7cee352be7945 ]---

Thanks,
Sasha

2014-07-24 13:30:22

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCHv3 1/2] mm: introduce vm_ops->map_pages()

On Thu, Jul 24, 2014 at 5:05 PM, Sasha Levin <[email protected]> wrote:
> On 07/23/2014 11:33 PM, Sasha Levin wrote:
>> On 02/27/2014 02:53 PM, Kirill A. Shutemov wrote:
>>> > The patch introduces new vm_ops callback ->map_pages() and uses it for
>>> > mapping easy accessible pages around fault address.
>>> >
>>> > On read page fault, if filesystem provides ->map_pages(), we try to map
>>> > up to FAULT_AROUND_PAGES pages around page fault address in hope to
>>> > reduce number of minor page faults.
>>> >
>>> > We call ->map_pages first and use ->fault() as fallback if page by the
>>> > offset is not ready to be mapped (cold page cache or something).
>>> >
>>> > Signed-off-by: Kirill A. Shutemov <[email protected]>
>>> > ---
>> Hi all,
>>
>> This patch triggers use-after-free when fuzzing using trinity and the KASAN
>> patchset.
>
> FWIW, if it helps, here's another KASAN report with the conventional BUG following it:

I haven't looked into the code yet, but this patch has a potential to
fix this bug.
https://lkml.org/lkml/2014/7/15/159

This pgoff-mess still looks suspicious for me.

>
> [ 360.498001] ==================================================================
> [ 360.500896] AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff880581ee3fd0
> [ 360.504474] page:ffffea001607b8c0 count:0 mapcount:0 mapping: (null) index:0x0
> [ 360.507264] page flags: 0xefffff80000000()
> [ 360.508655] page dumped because: kasan error
> [ 360.509489] CPU: 8 PID: 9251 Comm: trinity-c159 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> [ 360.511717] 00000000000000ff 0000000000000000 ffffea001607b8c0 ffff8801ba8bbb98
> [ 360.513272] ffffffff8fe40903 ffff8801ba8bbc68 ffff8801ba8bbc58 ffffffff8b42acfc
> [ 360.514729] 0000000000000001 ffff880592deaa48 ffff8801ba84b038 ffff8801ba8bbbd0
> [ 360.516156] Call Trace:
> [ 360.516622] dump_stack (lib/dump_stack.c:52)
> [ 360.517566] kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> [ 360.518745] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [ 360.519923] ? preempt_count_sub (kernel/sched/core.c:2606)
> [ 360.521124] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> [ 360.522431] ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> [ 360.523651] __asan_load8 (mm/kasan/kasan.c:364)
> [ 360.524625] ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [ 360.525887] do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [ 360.527156] ? __rcu_read_unlock (kernel/rcu/update.c:101)
> [ 360.528251] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> [ 360.529308] ? vmacache_update (mm/vmacache.c:61)
> [ 360.530505] ? find_vma (mm/mmap.c:2027)
> [ 360.531453] __do_page_fault (arch/x86/mm/fault.c:1231)
> [ 360.532503] ? context_tracking_user_exit (kernel/context_tracking.c:184)
> [ 360.533744] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 360.535001] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
> [ 360.536262] ? trace_hardirqs_off (kernel/locking/lockdep.c:2645)
> [ 360.537360] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
> [ 360.538493] do_async_page_fault (arch/x86/kernel/kvm.c:279)
> [ 360.539567] async_page_fault (arch/x86/kernel/entry_64.S:1321)
> [ 360.540668] Read of size 8 by thread T9251:
> [ 360.541485] Memory state around the buggy address:
> [ 360.542463] ffff880581ee3d00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [ 360.543816] ffff880581ee3d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [ 360.545116] ffff880581ee3e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [ 360.546476] ffff880581ee3e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [ 360.547806] ffff880581ee3f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [ 360.549112] >ffff880581ee3f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> [ 360.550507] ^
> [ 360.551574] ffff880581ee4000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 360.552910] ffff880581ee4080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 360.554207] ffff880581ee4100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 360.555621] ffff880581ee4180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 360.557035] ffff880581ee4200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 360.558489] ==================================================================
> [ 360.559804] BUG: unable to handle kernel paging request at ffff880581ee3fd0
> [ 360.559817] IP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [ 360.559827] PGD 147b9067 PUD 70353d067 PMD 70352d067 PTE 8000000581ee3060
> [ 360.559836] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 360.559892] Dumping ftrace buffer:
> [ 360.560117] (ftrace buffer empty)
> [ 360.560132] Modules linked in:
> [ 360.560145] CPU: 8 PID: 9251 Comm: trinity-c159 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> [ 360.560154] task: ffff8801ba84b000 ti: ffff8801ba8b8000 task.ti: ffff8801ba8b8000
> [ 360.560172] RIP: do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [ 360.560180] RSP: 0000:ffff8801ba8bbc98 EFLAGS: 00010296
> [ 360.560186] RAX: 0000000000000000 RBX: ffff8801ba8d5a00 RCX: 0000000000000006
> [ 360.560192] RDX: 0000000000000006 RSI: ffffffff912f5fd5 RDI: 0000000000000282
> [ 360.560199] RBP: ffff8801ba8bbd58 R08: 0000000000000001 R09: 0000000000000000
> [ 360.560206] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801ba8ad470
> [ 360.560213] R13: 00007f2411bfa000 R14: 0000000000000000 R15: ffff880581ee3fd0
> [ 360.560222] FS: 00007f2412de6700(0000) GS:ffff8801de000000(0000) knlGS:0000000000000000
> [ 360.560228] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 360.560234] CR2: ffff880581ee3fd0 CR3: 00000001ba89b000 CR4: 00000000000006a0
> [ 360.560255] DR0: 00000000006ec000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 360.560262] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 00000000000d0602
> [ 360.560264] Stack:
> [ 360.560278] 0000000000000001 0000000000000000 ffff880581ee4030 ffff880592deaa30
> [ 360.560290] 0000000000000005 00007f2411c07000 ffff8801ba8d5a90 ffff880592deaa30
> [ 360.560297] 0000000000000000 000000a8ba84b038 000000000000000c 00007f2411c06220
> [ 360.560299] Call Trace:
> [ 360.560317] ? __rcu_read_unlock (kernel/rcu/update.c:101)
> [ 360.560332] handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> [ 360.560345] ? vmacache_update (mm/vmacache.c:61)
> [ 360.560358] ? find_vma (mm/mmap.c:2027)
> [ 360.560370] __do_page_fault (arch/x86/mm/fault.c:1231)
> [ 360.560386] ? context_tracking_user_exit (kernel/context_tracking.c:184)
> [ 360.560400] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 360.560413] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2639 (discriminator 8))
> [ 360.560425] ? trace_hardirqs_off (kernel/locking/lockdep.c:2645)
> [ 360.560432] trace_do_page_fault (arch/x86/mm/fault.c:1314 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1315)
> [ 360.560440] do_async_page_fault (arch/x86/kernel/kvm.c:279)
> [ 360.560447] async_page_fault (arch/x86/kernel/entry_64.S:1321)
> [ 360.560458] Code: 47 f9 48 8b 8d 68 ff ff ff 48 29 f1 48 c1 e9 0c 49 8d 44 08 ff 48 39 c7 48 0f 46 c7 4c 89 ff 48 89 85 60 ff ff ff e8 3e 19 07 00 <49> 83 3f 00 0f 84 8f 00 00 00 49 83 c6 01 4c 39 b5 60 ff ff ff
> All code
> ========
> 0: 47 f9 rex.RXB stc
> 2: 48 8b 8d 68 ff ff ff mov -0x98(%rbp),%rcx
> 9: 48 29 f1 sub %rsi,%rcx
> c: 48 c1 e9 0c shr $0xc,%rcx
> 10: 49 8d 44 08 ff lea -0x1(%r8,%rcx,1),%rax
> 15: 48 39 c7 cmp %rax,%rdi
> 18: 48 0f 46 c7 cmovbe %rdi,%rax
> 1c: 4c 89 ff mov %r15,%rdi
> 1f: 48 89 85 60 ff ff ff mov %rax,-0xa0(%rbp)
> 26: e8 3e 19 07 00 callq 0x71969
> 2b:* 49 83 3f 00 cmpq $0x0,(%r15) <-- trapping instruction
> 2f: 0f 84 8f 00 00 00 je 0xc4
> 35: 49 83 c6 01 add $0x1,%r14
> 39: 4c 39 b5 60 ff ff ff cmp %r14,-0xa0(%rbp)
> ...
>
> Code starting with the faulting instruction
> ===========================================
> 0: 49 83 3f 00 cmpq $0x0,(%r15)
> 4: 0f 84 8f 00 00 00 je 0x99
> a: 49 83 c6 01 add $0x1,%r14
> e: 4c 39 b5 60 ff ff ff cmp %r14,-0xa0(%rbp)
> ...
> [ 360.560458] RIP do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> [ 360.560458] RSP <ffff8801ba8bbc98>
> [ 360.560458] CR2: ffff880581ee3fd0
> [ 360.560458] ---[ end trace ccd7cee352be7945 ]---
>
> Thanks,
> Sasha
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-07-28 07:49:19

by Andrey Ryabinin

[permalink] [raw]
Subject: [PATCH] mm: don't allow fault_around_bytes to be 0

Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
patchset:

AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
page:ffffea001229ccc0 count:0 mapcount:0 mapping: (null) index:0x0
page flags: 0xafffff80008000(tail)
page dumped because: kasan error
CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
Call Trace:
dump_stack (lib/dump_stack.c:52)
kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
? debug_smp_processor_id (lib/smp_processor_id.c:57)
? preempt_count_sub (kernel/sched/core.c:2606)
? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
__asan_load8 (mm/kasan/kasan.c:364)
? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
? __pte_alloc (mm/memory.c:598)
handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
__get_user_pages (mm/gup.c:286 mm/gup.c:478)
__mlock_vma_pages_range (mm/mlock.c:262)
__mm_populate (mm/mlock.c:710)
SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
tracesys (arch/x86/kernel/entry_64.S:541)
Read of size 8 by thread T9262:
Memory state around the buggy address:
ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb


It looks like that pte pointer is invalid in do_fault_around().
This could happen if fault_around_bytes is set to 0.
fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
The result of rounddown_pow_of_to is undefined if parameter == 0
(in my environment it returns 0x8000000000000000).

One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
however this would add extra code on fault path.

So let's just forbid to set fault_around_bytes to zero.
Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
instead of 0.

Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
Signed-off-by: Andrey Ryabinin <[email protected]>
Reported-by: Sasha Levin <[email protected]>
Cc: <[email protected]> # 3.15.x
Cc: "Kirill A. Shutemov" <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
Cc: Hugh Dickins <[email protected]>
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e8d820..5927c42 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2784,7 +2784,7 @@ static int fault_around_bytes_get(void *data, u64 *val)

static int fault_around_bytes_set(void *data, u64 val)
{
- if (val / PAGE_SIZE > PTRS_PER_PTE)
+ if (!val || val / PAGE_SIZE > PTRS_PER_PTE)
return -EINVAL;
fault_around_bytes = val;
return 0;
--
1.8.5.5

2014-07-28 07:53:33

by Andrey Ryabinin

[permalink] [raw]
Subject: Re: [PATCH] mm: don't allow fault_around_bytes to be 0

On 07/28/14 11:43, Andrey Ryabinin wrote:
> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
> patchset:
>
> AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> page:ffffea001229ccc0 count:0 mapcount:0 mapping: (null) index:0x0
> page flags: 0xafffff80008000(tail)
> page dumped because: kasan error
> CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> Call Trace:
> dump_stack (lib/dump_stack.c:52)
> kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> ? preempt_count_sub (kernel/sched/core.c:2606)
> ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> __asan_load8 (mm/kasan/kasan.c:364)
> ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> ? __pte_alloc (mm/memory.c:598)
> handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> ? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> __get_user_pages (mm/gup.c:286 mm/gup.c:478)
> __mlock_vma_pages_range (mm/mlock.c:262)
> __mm_populate (mm/mlock.c:710)
> SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> tracesys (arch/x86/kernel/entry_64.S:541)
> Read of size 8 by thread T9262:
> Memory state around the buggy address:
> ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ^
> ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
>
> It looks like that pte pointer is invalid in do_fault_around().
> This could happen if fault_around_bytes is set to 0.
> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
> The result of rounddown_pow_of_to is undefined if parameter == 0
> (in my environment it returns 0x8000000000000000).
>
> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
> however this would add extra code on fault path.
>
> So let's just forbid to set fault_around_bytes to zero.
> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
> instead of 0.
>
> Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
> Signed-off-by: Andrey Ryabinin <[email protected]>
> Reported-by: Sasha Levin <[email protected]>
> Cc: <[email protected]> # 3.15.x
> Cc: "Kirill A. Shutemov" <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Dave Jones <[email protected]>
> Cc: Konstantin Khlebnikov <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> ---
> mm/memory.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e8d820..5927c42 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2784,7 +2784,7 @@ static int fault_around_bytes_get(void *data, u64 *val)
>
> static int fault_around_bytes_set(void *data, u64 val)
> {
> - if (val / PAGE_SIZE > PTRS_PER_PTE)
> + if (!val || val / PAGE_SIZE > PTRS_PER_PTE)
> return -EINVAL;
> fault_around_bytes = val;
> return 0;
>

Adding Sasha to cc.

2014-07-28 09:36:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH] mm: don't allow fault_around_bytes to be 0

On Mon, Jul 28, 2014 at 11:43:20AM +0400, Andrey Ryabinin wrote:
> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
> patchset:
>
> AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> page:ffffea001229ccc0 count:0 mapcount:0 mapping: (null) index:0x0
> page flags: 0xafffff80008000(tail)
> page dumped because: kasan error
> CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> Call Trace:
> dump_stack (lib/dump_stack.c:52)
> kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> ? preempt_count_sub (kernel/sched/core.c:2606)
> ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> __asan_load8 (mm/kasan/kasan.c:364)
> ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> ? __pte_alloc (mm/memory.c:598)
> handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> ? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> __get_user_pages (mm/gup.c:286 mm/gup.c:478)
> __mlock_vma_pages_range (mm/mlock.c:262)
> __mm_populate (mm/mlock.c:710)
> SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> tracesys (arch/x86/kernel/entry_64.S:541)
> Read of size 8 by thread T9262:
> Memory state around the buggy address:
> ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ^
> ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
>
> It looks like that pte pointer is invalid in do_fault_around().
> This could happen if fault_around_bytes is set to 0.
> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
> The result of rounddown_pow_of_to is undefined if parameter == 0
> (in my environment it returns 0x8000000000000000).

Ouch. Good catch!

Although, I'm not convinced that it caused the issue. Sasha, did you touch the
debugfs handle?

> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
> however this would add extra code on fault path.
>
> So let's just forbid to set fault_around_bytes to zero.
> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
> instead of 0.

>From user point of view, 0 is perfectly fine. What about untested patch
below?

Other option: get rid of debugfs interface, so fault_around_pages() and
fault_around_mask() will always be known compile time.

There's other problem with the debugfs handle: we don't have serialization
between fault_around_bytes_set() and do_fault_around(). It can end up
badly if fault_around_bytes will be changed under do_fault_around()...

I don't think it worth adding the serialization to hot path to protect
against debug interface.
Any thoughts?

>From 2932fbcefe4ec21c046348e21981149ecce5d161 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Mon, 28 Jul 2014 12:16:49 +0300
Subject: [PATCH] mm, debugfs: workaround undefined behaviour of
rounddown_pow_of_two(0)

Result of rounddown_pow_of_two(0) is not defined. It can cause a bug if
user will set fault_around_bytes to 0 via debugfs interface.

Let's set fault_around_bytes to PAGE_SIZE if user tries to set it to
something below PAGE_SIZE.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/memory.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e8d8205b610..2d8fa7a7b0ee 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2786,7 +2786,8 @@ static int fault_around_bytes_set(void *data, u64 val)
{
if (val / PAGE_SIZE > PTRS_PER_PTE)
return -EINVAL;
- fault_around_bytes = val;
+ /* rounddown_pow_of_two(0) is not defined */
+ fault_around_bytes = max(val, PAGE_SIZE);
return 0;
}
DEFINE_SIMPLE_ATTRIBUTE(fault_around_bytes_fops,
--
Kirill A. Shutemov

2014-07-28 10:33:28

by Andrey Ryabinin

[permalink] [raw]
Subject: Re: [PATCH] mm: don't allow fault_around_bytes to be 0

On 07/28/14 13:36, Kirill A. Shutemov wrote:
> On Mon, Jul 28, 2014 at 11:43:20AM +0400, Andrey Ryabinin wrote:
>> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
>> patchset:
>>
>> AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
>> page:ffffea001229ccc0 count:0 mapcount:0 mapping: (null) index:0x0
>> page flags: 0xafffff80008000(tail)
>> page dumped because: kasan error
>> CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
>> 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
>> ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
>> 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
>> Call Trace:
>> dump_stack (lib/dump_stack.c:52)
>> kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
>> ? debug_smp_processor_id (lib/smp_processor_id.c:57)
>> ? preempt_count_sub (kernel/sched/core.c:2606)
>> ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
>> ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
>> __asan_load8 (mm/kasan/kasan.c:364)
>> ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
>> do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
>> ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
>> ? __pte_alloc (mm/memory.c:598)
>> handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
>> ? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
>> __get_user_pages (mm/gup.c:286 mm/gup.c:478)
>> __mlock_vma_pages_range (mm/mlock.c:262)
>> __mm_populate (mm/mlock.c:710)
>> SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
>> tracesys (arch/x86/kernel/entry_64.S:541)
>> Read of size 8 by thread T9262:
>> Memory state around the buggy address:
>> ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> >ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ^
>> ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>
>>
>> It looks like that pte pointer is invalid in do_fault_around().
>> This could happen if fault_around_bytes is set to 0.
>> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
>> The result of rounddown_pow_of_to is undefined if parameter == 0
>> (in my environment it returns 0x8000000000000000).
>
> Ouch. Good catch!
>
> Although, I'm not convinced that it caused the issue. Sasha, did you touch the
> debugfs handle?
>

I suppose trinity could change it, no? I've got the very same spew after setting fault_around_bytes to 0.

>> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
>> however this would add extra code on fault path.
>>
>> So let's just forbid to set fault_around_bytes to zero.
>> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
>> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
>> instead of 0.
>
>>From user point of view, 0 is perfectly fine. What about untested patch
> below?
>

In case if we are not going to get rid of debugfs interface I would better keep
faul_around_bytes always roundded down, like in following patch:


>From f41b7777b29f06dc62f80526e5617cae82a38709 Mon Sep 17 00:00:00 2001
From: Andrey Ryabinin <[email protected]>
Date: Mon, 28 Jul 2014 13:46:10 +0400
Subject: [PATCH] mm: debugfs: move rounddown_pow_of_two() out from do_fault
path

do_fault_around expects fault_around_bytes rounded down to nearest
page order. Instead of calling rounddown_pow_of_two every time
in fault_around_pages()/fault_around_mask() we could do round down
when user changes fault_around_bytes via debugfs interface.

This also fixes bug when user set fault_around_bytes to 0.
Result of rounddown_pow_of_two(0) is not defined, therefore
fault_around_bytes == 0 doesn't work without this patch.

Let's set fault_around_bytes to PAGE_SIZE if user sets to something
less than PAGE_SIZE

Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
Signed-off-by: Andrey Ryabinin <[email protected]>
Cc: <[email protected]> # 3.15.x
---
mm/memory.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e8d820..e0c6fd6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2758,20 +2758,16 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
update_mmu_cache(vma, address, pte);
}

-static unsigned long fault_around_bytes = 65536;
+static unsigned long fault_around_bytes = rounddown_pow_of_two(65536);

-/*
- * fault_around_pages() and fault_around_mask() round down fault_around_bytes
- * to nearest page order. It's what do_fault_around() expects to see.
- */
static inline unsigned long fault_around_pages(void)
{
- return rounddown_pow_of_two(fault_around_bytes) / PAGE_SIZE;
+ return fault_around_bytes >> PAGE_SHIFT;
}

static inline unsigned long fault_around_mask(void)
{
- return ~(rounddown_pow_of_two(fault_around_bytes) - 1) & PAGE_MASK;
+ return ~(fault_around_bytes - 1) & PAGE_MASK;
}


@@ -2782,11 +2778,18 @@ static int fault_around_bytes_get(void *data, u64 *val)
return 0;
}

+/*
+ * fault_around_pages() and fault_around_mask() expects fault_around_bytes
+ * rounded down to nearest page order. It's what do_fault_around() expects to see.
+ */
static int fault_around_bytes_set(void *data, u64 val)
{
if (val / PAGE_SIZE > PTRS_PER_PTE)
return -EINVAL;
- fault_around_bytes = val;
+ if (val > PAGE_SIZE)
+ fault_around_bytes = rounddown_pow_of_two(val);
+ else
+ fault_around_bytes = PAGE_SIZE; /* rounddown_pow_of_two(0) is undefined */
return 0;
}
DEFINE_SIMPLE_ATTRIBUTE(fault_around_bytes_fops,
--
1.8.5.5






2014-07-28 10:53:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH] mm: don't allow fault_around_bytes to be 0

On Mon, Jul 28, 2014 at 02:27:37PM +0400, Andrey Ryabinin wrote:
> On 07/28/14 13:36, Kirill A. Shutemov wrote:
> > On Mon, Jul 28, 2014 at 11:43:20AM +0400, Andrey Ryabinin wrote:
> >> Sasha Levin triggered use-after-free when fuzzing using trinity and the KASAN
> >> patchset:
> >>
> >> AddressSanitizer: use after free in do_read_fault.isra.40+0x3c2/0x510 at addr ffff88048a733110
> >> page:ffffea001229ccc0 count:0 mapcount:0 mapping: (null) index:0x0
> >> page flags: 0xafffff80008000(tail)
> >> page dumped because: kasan error
> >> CPU: 6 PID: 9262 Comm: trinity-c104 Not tainted 3.16.0-rc6-next-20140723-sasha-00047-g289342b-dirty #929
> >> 00000000000000fb 0000000000000000 ffffea001229ccc0 ffff88038ac0fb78
> >> ffffffffa5e40903 ffff88038ac0fc48 ffff88038ac0fc38 ffffffffa142acfc
> >> 0000000000000001 ffff880509ff5aa8 ffff88038ac10038 ffff88038ac0fbb0
> >> Call Trace:
> >> dump_stack (lib/dump_stack.c:52)
> >> kasan_report_error (mm/kasan/report.c:98 mm/kasan/report.c:166)
> >> ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> >> ? preempt_count_sub (kernel/sched/core.c:2606)
> >> ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
> >> ? do_read_fault.isra.40 (mm/memory.c:2784 mm/memory.c:2849 mm/memory.c:2898)
> >> __asan_load8 (mm/kasan/kasan.c:364)
> >> ? do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> >> do_read_fault.isra.40 (mm/memory.c:2864 mm/memory.c:2898)
> >> ? _raw_spin_unlock (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:152 kernel/locking/spinlock.c:183)
> >> ? __pte_alloc (mm/memory.c:598)
> >> handle_mm_fault (mm/memory.c:3092 mm/memory.c:3225 mm/memory.c:3345 mm/memory.c:3374)
> >> ? pud_huge (./arch/x86/include/asm/paravirt.h:611 arch/x86/mm/hugetlbpage.c:76)
> >> __get_user_pages (mm/gup.c:286 mm/gup.c:478)
> >> __mlock_vma_pages_range (mm/mlock.c:262)
> >> __mm_populate (mm/mlock.c:710)
> >> SyS_remap_file_pages (mm/mmap.c:2653 mm/mmap.c:2593)
> >> tracesys (arch/x86/kernel/entry_64.S:541)
> >> Read of size 8 by thread T9262:
> >> Memory state around the buggy address:
> >> ffff88048a732e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> ffff88048a732f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> ffff88048a732f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> ffff88048a733000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> ffff88048a733080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> >ffff88048a733100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> ^
> >> ffff88048a733180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> ffff88048a733200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> ffff88048a733280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> ffff88048a733300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >> ffff88048a733380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >>
> >>
> >> It looks like that pte pointer is invalid in do_fault_around().
> >> This could happen if fault_around_bytes is set to 0.
> >> fault_around_pages() and fault_around_mask() calls rounddown_pow_of_to(fault_around_bytes)
> >> The result of rounddown_pow_of_to is undefined if parameter == 0
> >> (in my environment it returns 0x8000000000000000).
> >
> > Ouch. Good catch!
> >
> > Although, I'm not convinced that it caused the issue. Sasha, did you touch the
> > debugfs handle?
> >
>
> I suppose trinity could change it, no? I've got the very same spew after setting fault_around_bytes to 0.
>
> >> One way to fix this would be to return 0 from fault_around_pages() if fault_around_bytes == 0,
> >> however this would add extra code on fault path.
> >>
> >> So let's just forbid to set fault_around_bytes to zero.
> >> Fault around is not used if fault_around_pages() <= 1, so if anyone doesn't want to use
> >> it, fault_around_bytes could be set to any value in range [1, 2*PAGE_SIZE - 1]
> >> instead of 0.
> >
> >>From user point of view, 0 is perfectly fine. What about untested patch
> > below?
> >
>
> In case if we are not going to get rid of debugfs interface I would better keep
> faul_around_bytes always roundded down, like in following patch:
>
>
> From f41b7777b29f06dc62f80526e5617cae82a38709 Mon Sep 17 00:00:00 2001
> From: Andrey Ryabinin <[email protected]>
> Date: Mon, 28 Jul 2014 13:46:10 +0400
> Subject: [PATCH] mm: debugfs: move rounddown_pow_of_two() out from do_fault
> path
>
> do_fault_around expects fault_around_bytes rounded down to nearest
> page order. Instead of calling rounddown_pow_of_two every time
> in fault_around_pages()/fault_around_mask() we could do round down
> when user changes fault_around_bytes via debugfs interface.
>
> This also fixes bug when user set fault_around_bytes to 0.
> Result of rounddown_pow_of_two(0) is not defined, therefore
> fault_around_bytes == 0 doesn't work without this patch.
>
> Let's set fault_around_bytes to PAGE_SIZE if user sets to something
> less than PAGE_SIZE
>
> Fixes: a9b0f861("mm: nominate faultaround area in bytes rather than page order")
> Signed-off-by: Andrey Ryabinin <[email protected]>
> Cc: <[email protected]> # 3.15.x
> ---
> mm/memory.c | 19 +++++++++++--------
> 1 file changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e8d820..e0c6fd6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2758,20 +2758,16 @@ void do_set_pte(struct vm_area_struct *vma, unsigned long address,
> update_mmu_cache(vma, address, pte);
> }
>
> -static unsigned long fault_around_bytes = 65536;
> +static unsigned long fault_around_bytes = rounddown_pow_of_two(65536);

This looks weird, but okay...

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2014-07-28 12:33:02

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH] mm: don't allow fault_around_bytes to be 0

On 07/28/2014 06:27 AM, Andrey Ryabinin wrote:
>> Although, I'm not convinced that it caused the issue. Sasha, did you touch the
>> > debugfs handle?
>> >
> I suppose trinity could change it, no? I've got the very same spew after setting fault_around_bytes to 0.

Not on purpose, but as Andrey said - it's very possible that trinity did.


Thanks,
Sasha

2014-07-28 15:26:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] mm: don't allow fault_around_bytes to be 0

On 07/28/2014 02:36 AM, Kirill A. Shutemov wrote:
> +++ b/mm/memory.c
> @@ -2786,7 +2786,8 @@ static int fault_around_bytes_set(void *data, u64 val)
> {
> if (val / PAGE_SIZE > PTRS_PER_PTE)
> return -EINVAL;
> - fault_around_bytes = val;
> + /* rounddown_pow_of_two(0) is not defined */
> + fault_around_bytes = max(val, PAGE_SIZE);
> return 0;
> }

It's also possible to race and have fault_around_bytes change between
when fault_around_mask() and fault_around_pages() are called so that
they don't match any more. The min()/max() in do_fault_around() should
keep this from doing anything _too_ nasty, but it's worth thinking about
at least.

The safest thing to do might be to use an ACCESS_ONCE() at the beginning
of do_fault_around() for fault_around_bytes and generate
fault_around_mask() from the ACCESS_ONCE() result.

2014-07-28 22:43:35

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] mm: don't allow fault_around_bytes to be 0

On Mon, 28 Jul 2014, Andrey Ryabinin wrote:

> do_fault_around expects fault_around_bytes rounded down to nearest
> page order. Instead of calling rounddown_pow_of_two every time
> in fault_around_pages()/fault_around_mask() we could do round down
> when user changes fault_around_bytes via debugfs interface.
>

If you're going to optimize this, it seems like fault_around_bytes would
benefit from being __read_mostly.