2013-12-22 21:49:27

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 00/22] Add XIP support to ext4

... or, as it should probably be called now "Complete rewrite of everything
XIP related". It's only compile-tested; before I get too much further down
this path, I wanted to get some opinions on the approach.

Matthew Wilcox (21):
Fix XIP fault vs truncate race
Simplify COW of XIP mappings
axonram: Fix bug in direct_access
Change direct_access calling convention
Introduce IS_XIP(inode)
Treat XIP like O_DIRECT
Rewrite XIP page fault handling
Change xip_truncate_page to take a get_block parameter
Remove mm/filemap_xip.c
Remove get_xip_mem
Replace ext2_clear_xip_target with xip_clear_blocks
ext2: Remove ext2_xip_verify_sb()
ext2: Remove ext2_use_xip
ext2: Remove xip.c and xip.h
Remove CONFIG_EXT2_FS_XIP
ext2: Remove ext2_aops_xip
xip: Add xip_zero_page_range
ext4: Make ext4_block_zero_page_range static
ext4: Fix typos
Add support for pmd_faults
xip: Add reporting of major faults

Ross Zwisler (1):
ext4: Add XIP functionality

Documentation/filesystems/Locking | 3 -
Documentation/filesystems/ext4.txt | 2 +
Documentation/filesystems/xip.txt | 119 ++++-----
arch/powerpc/sysdev/axonram.c | 8 +-
drivers/block/brd.c | 8 +-
drivers/s390/block/dcssblk.c | 19 +-
fs/Kconfig | 21 +-
fs/Makefile | 1 +
fs/exofs/inode.c | 1 -
fs/ext2/Kconfig | 11 -
fs/ext2/Makefile | 1 -
fs/ext2/ext2.h | 1 -
fs/ext2/file.c | 44 +++-
fs/ext2/inode.c | 35 +--
fs/ext2/namei.c | 9 +-
fs/ext2/super.c | 38 ++-
fs/ext2/xip.c | 91 -------
fs/ext2/xip.h | 26 --
fs/ext4/ext4.h | 4 +-
fs/ext4/file.c | 54 ++++-
fs/ext4/indirect.c | 19 +-
fs/ext4/inode.c | 106 +++++---
fs/ext4/namei.c | 10 +-
fs/ext4/super.c | 39 ++-
fs/open.c | 5 +-
fs/xip.c | 432 +++++++++++++++++++++++++++++++++
include/linux/blkdev.h | 4 +-
include/linux/fs.h | 48 +++-
include/linux/mm.h | 6 +
mm/Makefile | 1 -
mm/fadvise.c | 6 +-
mm/filemap.c | 6 +-
mm/filemap_xip.c | 483 -------------------------------------
mm/madvise.c | 2 +-
mm/memory.c | 69 +++++-
35 files changed, 901 insertions(+), 831 deletions(-)
delete mode 100644 fs/ext2/xip.c
delete mode 100644 fs/ext2/xip.h
create mode 100644 fs/xip.c
delete mode 100644 mm/filemap_xip.c

--
1.8.4.rc3



2013-12-22 21:50:04

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 02/22] Simplify COW of XIP mappings

Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page. It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

The only minor complexity is that we need to lock against truncation until
the PTE has been established. We extend the scope of the i_mmap_mutex
in this case until the PTE has been established. This is parallel with
the holding of the page lock for the situation with pagecache files.

This allows us to get rid of the xip_sparse_page and __xip_unmap, which
is a nice reduction in code.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/mm.h | 2 +
mm/filemap_xip.c | 185 ++++++++++-------------------------------------------
mm/memory.c | 19 +++++-
3 files changed, 52 insertions(+), 154 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1cedd00..e07c57c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -195,6 +195,7 @@ struct vm_fault {
pgoff_t pgoff; /* Logical page offset based on vma */
void __user *virtual_address; /* Faulting virtual address */

+ struct page *cow_page; /* Handler may choose to COW */
struct page *page; /* ->fault handlers should return a
* page here, unless VM_FAULT_NOPAGE
* is set (which is also implied by
@@ -958,6 +959,7 @@ static inline int page_mapped(struct page *page)
#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned small page */
#define VM_FAULT_HWPOISON_LARGE 0x0020 /* Hit poisoned large page. Index encoded in upper bits */

+#define VM_FAULT_COWED 0x0080 /* ->fault COWed the page instead */
#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
#define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index c8d23e9..cb088a6 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -22,24 +22,10 @@
#include <asm/io.h>

/*
- * We do use our own empty page to avoid interference with other users
- * of ZERO_PAGE(), such as /dev/zero
+ * Only one caller is allowed to try to create mappings at a time.
+ * Should move down into filesystem code
*/
static DEFINE_MUTEX(xip_sparse_mutex);
-static seqcount_t xip_sparse_seq = SEQCNT_ZERO(xip_sparse_seq);
-static struct page *__xip_sparse_page;
-
-/* called under xip_sparse_mutex */
-static struct page *xip_sparse_page(void)
-{
- if (!__xip_sparse_page) {
- struct page *page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-
- if (page)
- __xip_sparse_page = page;
- }
- return __xip_sparse_page;
-}

/*
* This is a file read routine for execute in place files, and uses
@@ -154,63 +140,12 @@ xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
}
EXPORT_SYMBOL_GPL(xip_file_read);

-/*
- * __xip_unmap is invoked from xip_unmap and
- * xip_write
- *
- * This function walks all vmas of the address_space and unmaps the
- * __xip_sparse_page when found at pgoff.
- */
-static void
-__xip_unmap (struct address_space * mapping,
- unsigned long pgoff)
+static inline void copy_user_highdest(struct page *to, void *vfrom,
+ unsigned long vaddr)
{
- struct vm_area_struct *vma;
- struct mm_struct *mm;
- unsigned long address;
- pte_t *pte;
- pte_t pteval;
- spinlock_t *ptl;
- struct page *page;
- unsigned count;
- int locked = 0;
-
- count = read_seqcount_begin(&xip_sparse_seq);
-
- page = __xip_sparse_page;
- if (!page)
- return;
-
-retry:
- mutex_lock(&mapping->i_mmap_mutex);
- vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
- mm = vma->vm_mm;
- address = vma->vm_start +
- ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
- BUG_ON(address < vma->vm_start || address >= vma->vm_end);
- pte = page_check_address(page, mm, address, &ptl, 1);
- if (pte) {
- /* Nuke the page table entry. */
- flush_cache_page(vma, address, pte_pfn(*pte));
- pteval = ptep_clear_flush(vma, address, pte);
- page_remove_rmap(page);
- dec_mm_counter(mm, MM_FILEPAGES);
- BUG_ON(pte_dirty(pteval));
- pte_unmap_unlock(pte, ptl);
- /* must invalidate_page _before_ freeing the page */
- mmu_notifier_invalidate_page(mm, address);
- page_cache_release(page);
- }
- }
- mutex_unlock(&mapping->i_mmap_mutex);
-
- if (locked) {
- mutex_unlock(&xip_sparse_mutex);
- } else if (read_seqcount_retry(&xip_sparse_seq, count)) {
- mutex_lock(&xip_sparse_mutex);
- locked = 1;
- goto retry;
- }
+ char *vto = kmap_atomic(to);
+ copy_user_page(vto, vfrom, vaddr, to);
+ kunmap_atomic(vto);
}

/*
@@ -224,14 +159,12 @@ static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
+ unsigned long vaddr = (unsigned long)vmf->virtual_address;
pgoff_t size;
void *xip_mem;
unsigned long xip_pfn;
- struct page *page;
int error;

- /* XXX: are VM_FAULT_ codes OK? */
-again:
size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;
@@ -240,87 +173,40 @@ again:
&xip_mem, &xip_pfn);
if (likely(!error))
goto found;
- if (error != -ENODATA)
- return VM_FAULT_OOM;
-
- /* sparse block */
- if ((vma->vm_flags & (VM_WRITE | VM_MAYWRITE)) &&
- (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) &&
- (!(mapping->host->i_sb->s_flags & MS_RDONLY))) {
- int err;
-
- /* maybe shared writable, allocate new block */
+ /* Don't allocate backing store if we're going to COW a hole */
+ if (error == -ENODATA && !vmf->cow_page) {
mutex_lock(&xip_sparse_mutex);
error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
&xip_mem, &xip_pfn);
mutex_unlock(&xip_sparse_mutex);
- if (error)
- return VM_FAULT_SIGBUS;
- /* unmap sparse mappings at pgoff from all other vmas */
- __xip_unmap(mapping, vmf->pgoff);
+ }
+ if (error != -ENODATA)
+ return VM_FAULT_SIGBUS;

found:
- /* We must recheck i_size under i_mmap_mutex */
- mutex_lock(&mapping->i_mmap_mutex);
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
- PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- mutex_unlock(&mapping->i_mmap_mutex);
- return VM_FAULT_SIGBUS;
- }
- err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
- xip_pfn);
- mutex_unlock(&mapping->i_mmap_mutex);
- if (err == -ENOMEM)
- return VM_FAULT_OOM;
- /*
- * err == -EBUSY is fine, we've raced against another thread
- * that faulted-in the same page
- */
- if (err != -EBUSY)
- BUG_ON(err);
- return VM_FAULT_NOPAGE;
- } else {
- int err, ret = VM_FAULT_OOM;
-
- mutex_lock(&xip_sparse_mutex);
- write_seqcount_begin(&xip_sparse_seq);
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(!error)) {
- write_seqcount_end(&xip_sparse_seq);
- mutex_unlock(&xip_sparse_mutex);
- goto again;
- }
- if (error != -ENODATA)
- goto out;
-
- /* We must recheck i_size under i_mmap_mutex */
- mutex_lock(&mapping->i_mmap_mutex);
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
- PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- ret = VM_FAULT_SIGBUS;
- goto unlock;
- }
- /* not shared and writable, use xip_sparse_page() */
- page = xip_sparse_page();
- if (!page)
- goto unlock;
- err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
- page);
- if (err == -ENOMEM)
- goto unlock;
-
- ret = VM_FAULT_NOPAGE;
-unlock:
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
mutex_unlock(&mapping->i_mmap_mutex);
-out:
- write_seqcount_end(&xip_sparse_seq);
- mutex_unlock(&xip_sparse_mutex);
-
- return ret;
+ return VM_FAULT_SIGBUS;
}
+ if (vmf->cow_page) {
+ if (error == -ENODATA)
+ clear_user_highpage(vmf->cow_page, vaddr);
+ else
+ copy_user_highdest(vmf->cow_page, xip_mem, vaddr);
+ return VM_FAULT_COWED;
+ }
+
+ error = vm_insert_mixed(vma, vaddr, xip_pfn);
+ mutex_unlock(&mapping->i_mmap_mutex);
+ if (error == -ENOMEM)
+ return VM_FAULT_OOM;
+ /* -EBUSY is fine, somebody else faulted on the same PTE */
+ if (error != -EBUSY)
+ BUG_ON(error);
+ return VM_FAULT_NOPAGE;
}

static const struct vm_operations_struct xip_file_vm_ops = {
@@ -374,9 +260,6 @@ __xip_file_write(struct file *filp, const char __user *buf,
status = a_ops->get_xip_mem(mapping, index, 1,
&xip_mem, &xip_pfn);
mutex_unlock(&xip_sparse_mutex);
- if (!status)
- /* unmap page at pgoff from all other vmas */
- __xip_unmap(mapping, index);
}

if (status)
diff --git a/mm/memory.c b/mm/memory.c
index 5d9025f..ecd63fe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3335,11 +3335,18 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
vmf.pgoff = pgoff;
vmf.flags = flags;
vmf.page = NULL;
+ vmf.cow_page = cow_page;

ret = vma->vm_ops->fault(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
VM_FAULT_RETRY)))
goto uncharge_out;
+ if (unlikely(ret & VM_FAULT_COWED)) {
+ page = cow_page;
+ anon = 1;
+ __SetPageUptodate(page);
+ goto cowed;
+ }

if (unlikely(PageHWPoison(vmf.page))) {
if (ret & VM_FAULT_LOCKED)
@@ -3399,6 +3406,7 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,

}

+ cowed:
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);

/*
@@ -3465,9 +3473,14 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (vma->vm_file && !page_mkwrite)
file_update_time(vma->vm_file);
} else {
- unlock_page(vmf.page);
- if (anon)
- page_cache_release(vmf.page);
+ if ((ret & VM_FAULT_COWED)) {
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ mutex_unlock(&mapping->i_mmap_mutex);
+ } else {
+ unlock_page(vmf.page);
+ if (anon)
+ page_cache_release(vmf.page);
+ }
}

return ret;
--
1.8.4.rc3


2013-12-22 21:49:28

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 01/22] Fix XIP fault vs truncate race

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate. We don't have a page to lock
in the XIP case, so use the i_mmap_mutex instead. It is locked in the
truncate path in unmap_mapping_range() after updating i_size. So while
we hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the
thread has a mapping to a page which will be removed from the file,
but this is harmless as the page will not be allocated to a different
purpose before the thread's access to it is revoked.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/filemap_xip.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8d23e9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -260,8 +260,17 @@ again:
__xip_unmap(mapping, vmf->pgoff);

found:
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return VM_FAULT_SIGBUS;
+ }
err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
xip_pfn);
+ mutex_unlock(&mapping->i_mmap_mutex);
if (err == -ENOMEM)
return VM_FAULT_OOM;
/*
@@ -285,16 +294,27 @@ found:
}
if (error != -ENODATA)
goto out;
+
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+ PAGE_CACHE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ ret = VM_FAULT_SIGBUS;
+ goto unlock;
+ }
/* not shared and writable, use xip_sparse_page() */
page = xip_sparse_page();
if (!page)
- goto out;
+ goto unlock;
err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
page);
if (err == -ENOMEM)
- goto out;
+ goto unlock;

ret = VM_FAULT_NOPAGE;
+unlock:
+ mutex_unlock(&mapping->i_mmap_mutex);
out:
write_seqcount_end(&xip_sparse_seq);
mutex_unlock(&xip_sparse_mutex);
--
1.8.4.rc3


2013-12-22 21:49:36

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 09/22] Remove mm/filemap_xip.c

It is now empty as all of its contents have been replaced by fs/xip.c

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/Makefile | 1 -
mm/filemap_xip.c | 23 -----------------------
2 files changed, 24 deletions(-)
delete mode 100644 mm/filemap_xip.c

diff --git a/mm/Makefile b/mm/Makefile
index 305d10a..c019f5e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -47,7 +47,6 @@ obj-$(CONFIG_SLUB) += slub.o
obj-$(CONFIG_KMEMCHECK) += kmemcheck.o
obj-$(CONFIG_FAILSLAB) += failslab.o
obj-$(CONFIG_MEMORY_HOTPLUG) += memory_hotplug.o
-obj-$(CONFIG_FS_XIP) += filemap_xip.o
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
deleted file mode 100644
index 6316578..0000000
--- a/mm/filemap_xip.c
+++ /dev/null
@@ -1,23 +0,0 @@
-/*
- * linux/mm/filemap_xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte <[email protected]>
- *
- * derived from linux/mm/filemap.c - Copyright (C) Linus Torvalds
- *
- */
-
-#include <linux/fs.h>
-#include <linux/pagemap.h>
-#include <linux/export.h>
-#include <linux/uio.h>
-#include <linux/rmap.h>
-#include <linux/mmu_notifier.h>
-#include <linux/sched.h>
-#include <linux/seqlock.h>
-#include <linux/mutex.h>
-#include <linux/gfp.h>
-#include <asm/tlbflush.h>
-#include <asm/io.h>
-
--
1.8.4.rc3


2013-12-22 21:49:37

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 10/22] Remove get_xip_mem

All callers of get_xip_mem() are now gone. Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.

Add documentation for the new way of writing an XIP filesystem.

Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/filesystems/Locking | 3 --
Documentation/filesystems/xip.txt | 110 ++++++++++++++++++++------------------
fs/exofs/inode.c | 1 -
fs/ext2/inode.c | 1 -
fs/ext2/xip.c | 37 -------------
fs/ext2/xip.h | 3 --
fs/open.c | 5 +-
include/linux/fs.h | 2 -
mm/fadvise.c | 6 ++-
mm/madvise.c | 2 +-
10 files changed, 64 insertions(+), 106 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index fe7afe2..75da4b6 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -194,8 +194,6 @@ prototypes:
void (*freepage)(struct page *);
int (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
- int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
- unsigned long *);
int (*migratepage)(struct address_space *, struct page *, struct page *);
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, read_descriptor_t *, unsigned long);
@@ -220,7 +218,6 @@ invalidatepage: yes
releasepage: yes
freepage: yes
direct_IO:
-get_xip_mem: maybe
migratepage: yes (both)
launder_page: yes
is_partially_uptodate: yes
diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index b62eabf..f22a63f 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -3,69 +3,75 @@ Execute-in-place for file mappings

Motivation
----------
-File mappings are performed by mapping page cache pages to userspace. In
-addition, read&write type file operations also transfer data from/to the page
-cache.
-
-For memory backed storage devices that use the block device interface, the page
-cache pages are in fact copies of the original storage. Various approaches
-exist to work around the need for an extra copy. The ramdisk driver for example
-does read the data into the page cache, keeps a reference, and discards the
-original data behind later on.
-
-Execute-in-place solves this issue the other way around: instead of keeping
-data in the page cache, the need to have a page cache copy is eliminated
-completely. With execute-in-place, read&write type operations are performed
-directly from/to the memory backed storage device. For file mappings, the
-storage device itself is mapped directly into userspace.
-
-This implementation was initially written for shared memory segments between
-different virtual machines on s390 hardware to allow multiple machines to
-share the same binaries and libraries.
-
-Implementation
---------------
-Execute-in-place is implemented in three steps: block device operation,
-address space operation, and file operations.
-
-A block device operation named direct_access is used to translate the
-block device sector number to a page frame number (pfn) that identifies
-the physical page for the memory. It also returns a kernel virtual
-address that can be used to access the memory.
+
+File mappings are usually performed by mapping page cache pages to
+userspace. In addition, read & write file operations also transfer data
+between the page cache and storage.
+
+For memory backed storage devices that use the block device interface,
+the page cache pages are just copies of the original storage. The
+execute-in-place code removes the extra copy by performing reads and
+writes directly on the memory backed storage device. For file mappings,
+the storage device itself is mapped directly into userspace.
+
+
+Implementation Tips for Block Driver Writers
+--------------------------------------------
+
+To support XIP in your block driver, implement the 'direct_access'
+block device operation. It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory. It also returns a
+kernel virtual address that can be used to access the memory.

The direct_access method takes a 'size' parameter that indicates the
number of bytes being requested. The function should return the number
of bytes that it can provide, although it must not exceed the number of
bytes requested. It may also return a negative errno if an error occurs.

-The block device operation is optional, these block devices support it as of
-today:
+In order to support this method, the storage must be byte-accessable by
+the CPU at all times. If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access. Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+These block devices may be used for inspiration:
+- axonram: Axon DDR2 device driver
+- brd: RAM backed block device driver
- dcssblk: s390 dcss block device driver

-An address space operation named get_xip_mem is used to retrieve references
-to a page frame number and a kernel address. To obtain these values a reference
-to an address_space is provided. This function assigns values to the kmem and
-pfn parameters. The third argument indicates whether the function should allocate
-blocks if needed.

-This address space operation is mutually exclusive with readpage&writepage that
-do page cache read/write operations.
-The following filesystems support it as of today:
-- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+Implementation Tips for Filesystem Writers
+------------------------------------------
+
+Filesystem support consists of
+- adding support to mark inodes as being XIP by setting the S_XIP flag in
+ i_flags
+- implementing the direct_IO address space operation, and calling xip_do_io()
+ instead of blockdev_direct_IO() if S_XIP is set
+- implementing an mmap file operation, and if the S_XIP flag is set, setting
+ the VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include a fault
+ handler that calls xip_fault, passing an appropriate get_block callback
+- ensuring that the get_block callback used by xip_fault() never exposes
+ allocated, uninitialised blocks
+- ensuring that there is sufficient locking between reads, writes, truncates
+ and page faults

-A set of file operations that do utilize get_xip_page can be found in
-mm/filemap_xip.c . The following file operation implementations are provided:
-- aio_read/aio_write
-- readv/writev
-- sendfile
+These filesystems may be used for inspiration:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt

-The generic file operations do_sync_read/do_sync_write can be used to implement
-classic synchronous IO calls.

Shortcomings
------------
-This implementation is limited to storage devices that are cpu addressable at
-all times (no highmem or such). It works well on rom/ram, but enhancements are
-needed to make it work with flash in read+write mode.
-Putting the Linux kernel and/or its modules on a xip filesystem does not mean
-they are not copied.
+
+Even if the kernel or its modules are stored on an filesystem that supports
+XIP on a block device that supports XIP, they will still be copied into RAM.
+
+Calling get_user_pages() on a range of user memory that has been mmaped
+from an XIP file will fail as there are no 'struct page' to describe
+those pages. This problem is being worked on. That means that O_DIRECT
+reads/writes to those memory ranges from a non-XIP file will fail (note
+that O_DIRECT reads/writes _of an XIP file_ do work, it is the memory
+that is being accessed that is key here). Other things that will not
+work include RDMA, sendfile() and splice().
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index a52a5d2..a96ce89 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -977,7 +977,6 @@ const struct address_space_operations exofs_aops = {
.direct_IO = NULL, /* TODO: Should be trivial to do */

/* With these NULL has special meaning or default is not exported */
- .get_xip_mem = NULL,
.migratepage = NULL,
.launder_page = NULL,
.is_partially_uptodate = NULL,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 5c1fb09..cc60c23 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -891,7 +891,6 @@ const struct address_space_operations ext2_aops = {

const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
- .get_xip_mem = ext2_get_xip_mem,
.direct_IO = ext2_direct_IO,
};

diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index fa40091..ca745ff 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -22,27 +22,6 @@ static inline long __inode_direct_access(struct inode *inode, sector_t block,
return ops->direct_access(bdev, sector, kaddr, pfn, size);
}

-static inline int
-__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
- sector_t *result)
-{
- struct buffer_head tmp;
- int rc;
-
- memset(&tmp, 0, sizeof(struct buffer_head));
- tmp.b_size = 1 << inode->i_blkbits;
- rc = ext2_get_block(inode, pgoff, &tmp, create);
- *result = tmp.b_blocknr;
-
- /* did we get a sparse block (hole in the file)? */
- if (!tmp.b_blocknr && !rc) {
- BUG_ON(create);
- rc = -ENODATA;
- }
-
- return rc;
-}
-
int
ext2_clear_xip_target(struct inode *inode, sector_t block)
{
@@ -69,19 +48,3 @@ void ext2_xip_verify_sb(struct super_block *sb)
"not supported by bdev");
}
}
-
-int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
- void **kmem, unsigned long *pfn)
-{
- long rc;
- sector_t block;
-
- /* first, retrieve the sector number */
- rc = __ext2_get_block(mapping->host, pgoff, create, &block);
- if (rc)
- return rc;
-
- /* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
- return (rc < 0) ? rc : 0;
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 29be737..0fa8b7f 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -14,11 +14,8 @@ static inline int ext2_use_xip (struct super_block *sb)
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
-int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
- void **, unsigned long *);
#else
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#define ext2_clear_xip_target(inode, chain) 0
-#define ext2_get_xip_mem NULL
#endif
diff --git a/fs/open.c b/fs/open.c
index 4b3e1ed..4b16abe 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -665,11 +665,8 @@ int open_check_o_direct(struct file *f)
{
/* NB: we're sure to have correct a_ops only after f_op->open */
if (f->f_flags & O_DIRECT) {
- if (!f->f_mapping->a_ops ||
- ((!f->f_mapping->a_ops->direct_IO) &&
- (!f->f_mapping->a_ops->get_xip_mem))) {
+ if (!f->f_mapping->a_ops || !f->f_mapping->a_ops->direct_IO)
return -EINVAL;
- }
}
return 0;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d59a714..197c3c6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -372,8 +372,6 @@ struct address_space_operations {
void (*freepage)(struct page *);
ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov,
loff_t offset, unsigned long nr_segs);
- int (*get_xip_mem)(struct address_space *, pgoff_t, int,
- void **, unsigned long *);
/*
* migrate the contents of a page to the specified target. If
* migrate_mode is MIGRATE_ASYNC, it must not block.
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..30d03bf 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -28,6 +28,7 @@
SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
{
struct fd f = fdget(fd);
+ struct inode *inode;
struct address_space *mapping;
struct backing_dev_info *bdi;
loff_t endbyte; /* inclusive */
@@ -39,7 +40,8 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
if (!f.file)
return -EBADF;

- if (S_ISFIFO(file_inode(f.file)->i_mode)) {
+ inode = file_inode(f.file);
+ if (S_ISFIFO(inode->i_mode)) {
ret = -ESPIPE;
goto out;
}
@@ -50,7 +52,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
goto out;
}

- if (mapping->a_ops->get_xip_mem) {
+ if (IS_XIP(inode)) {
switch (advice) {
case POSIX_FADV_NORMAL:
case POSIX_FADV_RANDOM:
diff --git a/mm/madvise.c b/mm/madvise.c
index 539eeb9..9b487ca 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -236,7 +236,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
if (!file)
return -EBADF;

- if (file->f_mapping->a_ops->get_xip_mem) {
+ if (IS_XIP(file_inode(file))) {
/* no bad return value, but ignore advice */
return 0;
}
--
1.8.4.rc3


2013-12-22 21:49:42

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 15/22] Remove CONFIG_EXT2_FS_XIP

The fewer Kconfig options we have the better. Use the generic CONFIG_FS_XIP
to enable XIP support in ext2 as well as in the core.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/Kconfig | 21 ++++++++++++++-------
fs/ext2/Kconfig | 11 -----------
fs/ext2/file.c | 6 +++++-
fs/ext2/super.c | 4 ++--
4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index c229f82..1e01dda 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -13,13 +13,6 @@ if BLOCK
source "fs/ext2/Kconfig"
source "fs/ext3/Kconfig"
source "fs/ext4/Kconfig"
-
-config FS_XIP
-# execute in place
- bool
- depends on EXT2_FS_XIP
- default y
-
source "fs/jbd/Kconfig"
source "fs/jbd2/Kconfig"

@@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig"
source "fs/btrfs/Kconfig"
source "fs/nilfs2/Kconfig"

+config FS_XIP
+ bool "Execute in place support"
+ depends on MMU
+ help
+ Execute in place (XIP) can be used on memory-backed block devices.
+ If the block device supports XIP and the filesystem supports XIP,
+ then you can avoid using the pagecache to buffer I/Os. Turning
+ on this option will compile in support for XIP; you will need to
+ mount the filesystem using the -o xip option.
+
+ If you do not have a block device that is capable of using this,
+ or if unsure, say N. Saying Y will increase the size of the kernel
+ by about 2kB.
+
endif # BLOCK

# Posix ACL utility routines
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 14a6780..c634874e 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -42,14 +42,3 @@ config EXT2_FS_SECURITY

If you are not using a security module that requires using
extended attributes for file security labels, say N.
-
-config EXT2_FS_XIP
- bool "Ext2 execute in place support"
- depends on EXT2_FS && MMU
- help
- Execute in place can be used on memory-backed block devices. If you
- enable this option, you can select to mount block devices which are
- capable of this feature without using the page cache.
-
- If you do not use a block device that is capable of using this,
- or if unsure, say N.
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 47a0b1d..6e6e803 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,6 +25,7 @@
#include "xattr.h"
#include "acl.h"

+#ifdef CONFIG_FS_XIP
static int ext2_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
return xip_fault(vma, vmf, ext2_get_block);
@@ -45,6 +46,9 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
vma->vm_flags |= VM_MIXEDMAP;
return 0;
}
+#else
+#define ext2_file_mmap generic_file_mmap
+#endif

/*
* Called when filp is released. This happens when all file descriptors
@@ -99,7 +103,7 @@ const struct file_operations ext2_file_operations = {
.splice_write = generic_file_splice_write,
};

-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_XIP
const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
.read = do_sync_read,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index fd62082..9a203a9 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root)
seq_puts(seq, ",grpquota");
#endif

-#if defined(CONFIG_EXT2_FS_XIP)
+#if defined(CONFIG_FS_XIP)
if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
seq_puts(seq, ",xip");
#endif
@@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb)
break;
#endif
case Opt_xip:
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_XIP
set_opt (sbi->s_mount_opt, XIP);
#else
ext2_msg(sb, KERN_INFO, "xip option not supported");
--
1.8.4.rc3


2013-12-22 21:49:48

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 21/22] Add support for pmd_faults

Introduce the vm_ops ->pmd_fault handler, add a vm_insert_pfn_pmd function
and create an xip_pmd_fault handler.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/file.c | 9 ++++-
fs/ext4/file.c | 9 ++++-
fs/xip.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 8 ++---
include/linux/mm.h | 4 +++
mm/memory.c | 50 +++++++++++++++++++++++---
6 files changed, 169 insertions(+), 12 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 6e6e803..7d6e492 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -31,8 +31,15 @@ static int ext2_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return xip_fault(vma, vmf, ext2_get_block);
}

+static int ext2_xip_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, unsigned int flags)
+{
+ return xip_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
+}
+
static const struct vm_operations_struct ext2_xip_vm_ops = {
.fault = ext2_xip_fault,
+ .pmd_fault = ext2_xip_pmd_fault,
.remap_pages = generic_file_remap_pages,
};

@@ -43,7 +50,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)

file_accessed(file);
vma->vm_ops = &ext2_xip_vm_ops;
- vma->vm_flags |= VM_MIXEDMAP;
+ vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
return 0;
}
#else
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d6ae6be..6211f56 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -205,8 +205,15 @@ static int ext4_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
/* Is this the right get_block? */
}

+static int ext4_xip_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, unsigned int flags)
+{
+ return xip_pmd_fault(vma, addr, pmd, flags, ext4_get_block);
+}
+
static const struct vm_operations_struct ext4_xip_vm_ops = {
.fault = ext4_xip_fault,
+ .pmd_fault = ext4_xip_pmd_fault,
.remap_pages = generic_file_remap_pages,
};
#else
@@ -224,7 +231,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
file_accessed(file);
if (IS_XIP(file_inode(file))) {
vma->vm_ops = &ext4_xip_vm_ops;
- vma->vm_flags |= VM_MIXEDMAP;
+ vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
} else {
vma->vm_ops = &ext4_file_vm_ops;
}
diff --git a/fs/xip.c b/fs/xip.c
index e6e52ee..d032838 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -273,6 +273,107 @@ int xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}
EXPORT_SYMBOL_GPL(xip_fault);

+/*
+ * The 'colour' (ie low bits) within a PMD of a page offset. This comes up
+ * more often than one might expect in the below function.
+ */
+#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
+
+/*
+ * We are willing to use a PMD mapping to cover the end of a file if it
+ * could be mapped by a complete PMD's worth of PTEs. That is, the last
+ * part of the file might be slightly smaller than PMD_SIZE, but as long
+ * as it's at least (PMD_SIZE - PAGE_SIZE + 1) bytes long, we allow the
+ * PMD mapping.
+ */
+static int do_xip_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, unsigned int flags, get_block_t get_block)
+{
+ struct file *file = vma->vm_file;
+ struct inode *inode = file_inode(file);
+ struct address_space *mapping = file->f_mapping;
+ struct buffer_head bh;
+ long length;
+ pgoff_t size, pgoff;
+ sector_t block;
+ unsigned long pfn;
+
+ /* Fall back to PTEs if we're going to COW */
+ if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
+ return VM_FAULT_FALLBACK;
+
+ pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (pgoff >= size)
+ return VM_FAULT_SIGBUS;
+ if ((pgoff | PG_PMD_COLOUR) >= size)
+ return VM_FAULT_FALLBACK;
+
+ memset(&bh, 0, sizeof(bh));
+ block = ((sector_t)pgoff & ~PG_PMD_COLOUR) <<
+ (PAGE_SHIFT - inode->i_blkbits);
+
+ /* Start by seeing if we already have an allocated block */
+ bh.b_size = PMD_SIZE;
+ length = get_block(inode, block, &bh, 0);
+ if (length)
+ return VM_FAULT_SIGBUS;
+ if (buffer_mapped(&bh) && bh.b_size == PMD_SIZE)
+ goto insert;
+
+ /* Next, try to allocate the whole thing */
+ bh.b_size = PMD_SIZE;
+ length = get_block(inode, block, &bh, 1);
+ if (length)
+ return VM_FAULT_SIGBUS;
+ if (bh.b_size == PMD_SIZE)
+ goto insert;
+
+ return VM_FAULT_FALLBACK;
+
+ insert:
+ length = xip_get_pfn(inode, &bh, &pfn);
+ if (length < 0)
+ return VM_FAULT_SIGBUS;
+ if (length < PMD_SIZE)
+ return VM_FAULT_FALLBACK;
+ if (pfn & PG_PMD_COLOUR)
+ return VM_FAULT_FALLBACK; /* not aligned */
+
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if ((pgoff | PG_PMD_COLOUR) < size)
+ length = vm_insert_pfn_pmd(vma, address, pmd, pfn);
+ mutex_unlock(&mapping->i_mmap_mutex);
+
+ if (pgoff >= size)
+ return VM_FAULT_SIGBUS;
+ if ((pgoff | PG_PMD_COLOUR) >= size)
+ return VM_FAULT_FALLBACK;
+ if (length == -ENOMEM)
+ return VM_FAULT_OOM;
+ /* -EBUSY is fine, somebody else faulted on the same PMD */
+ if (length != -EBUSY)
+ BUG_ON(length);
+ return VM_FAULT_NOPAGE;
+}
+
+int xip_pmd_fault(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, unsigned int flags, get_block_t get_block)
+{
+ int result;
+ struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ result = do_xip_pmd_fault(vma, address, pmd, flags, get_block);
+ sb_end_pagefault(sb);
+
+ return result;
+}
+EXPORT_SYMBOL_GPL(xip_pmd_fault);
+
/**
* xip_zero_page_range - zero a range within a page of an XIP file
* @inode: The file being truncated
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3a4a217..e789218 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2513,6 +2513,8 @@ int xip_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
int xip_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int xip_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
+ unsigned int flags, get_block_t);
#else
static inline int xip_clear_blocks(struct inode *i, sector_t blk, long sz)
{
@@ -2531,12 +2533,6 @@ static inline ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
{
return -ENOTTY;
}
-
-static inline int xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
- get_block_t gb)
-{
- return 0;
-}
#endif

/* PAGE_CACHE_ALIGN is defined in pagemap.h */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e07c57c..d48913d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -212,6 +212,8 @@ struct vm_operations_struct {
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
+ int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
+ pmd_t *, unsigned int flags);

/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
@@ -1857,6 +1859,8 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
+int vm_insert_pfn_pmd(struct vm_area_struct *, unsigned long addr, pmd_t *,
+ unsigned long pfn);
int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);


diff --git a/mm/memory.c b/mm/memory.c
index ecd63fe..0d332cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2249,6 +2249,41 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
}
EXPORT_SYMBOL(vm_insert_mixed);

+static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, unsigned long pfn, pgprot_t prot)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ int retval;
+ pmd_t entry;
+ spinlock_t *ptl;
+
+ ptl = pmd_lock(mm, pmd);
+ retval = -EBUSY;
+ if (!pmd_none(*pmd))
+ goto out_unlock;
+
+ /* Ok, finally just insert the thing.. */
+ entry = pfn_pmd(pfn, prot); /* XXX: pmd_mkspecial? */
+ set_pmd_at(mm, addr, pmd, entry);
+ update_mmu_cache_pmd(vma, addr, pmd);
+
+ retval = 0;
+ out_unlock:
+ spin_unlock(ptl);
+ return retval;
+}
+
+int vm_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, unsigned long pfn)
+{
+ BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
+
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return -EFAULT;
+ return insert_pfn_pmd(vma, addr, pmd, pfn, vma->vm_page_prot);
+}
+EXPORT_SYMBOL(vm_insert_pfn_pmd);
+
/*
* maps a range of physical memory into the requested pages. the old
* mappings are removed. any references to nonexistent pages results
@@ -3630,6 +3665,16 @@ out:
return 0;
}

+static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd, unsigned int flags)
+{
+ if (!vma->vm_ops)
+ return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
+ if (vma->vm_ops->pmd_fault)
+ return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+ return VM_FAULT_FALLBACK;
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -3722,10 +3767,7 @@ retry:
if (!pmd)
return VM_FAULT_OOM;
if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
- int ret = VM_FAULT_FALLBACK;
- if (!vma->vm_ops)
- ret = do_huge_pmd_anonymous_page(mm, vma, address,
- pmd, flags);
+ int ret = create_huge_pmd(mm, vma, address, pmd, flags);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
--
1.8.4.rc3


2013-12-22 21:49:47

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 20/22] ext4: Fix typos

Comment fix only

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext4/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 8b73d77..cae40af 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3720,7 +3720,7 @@ void ext4_truncate(struct inode *inode)

/*
* There is a possibility that we're either freeing the inode
- * or it completely new indode. In those cases we might not
+ * or it's a completely new inode. In those cases we might not
* have i_mutex locked because it's not necessary.
*/
if (!(inode->i_state & (I_NEW|I_FREEING)))
--
1.8.4.rc3


2013-12-22 21:49:45

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 18/22] ext4: Make ext4_block_zero_page_range static

It's only called within inode.c, so make it static, remove its prototype
from ext4.h and move it above all of its callers so it doesn't need a
prototype within inode.c.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext4/ext4.h | 2 --
fs/ext4/inode.c | 42 +++++++++++++++++++++---------------------
2 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e618503..a48d367 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2121,8 +2121,6 @@ extern int ext4_writepage_trans_blocks(struct inode *);
extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
extern int ext4_block_truncate_page(handle_t *handle,
struct address_space *mapping, loff_t from);
-extern int ext4_block_zero_page_range(handle_t *handle,
- struct address_space *mapping, loff_t from, loff_t length);
extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t lend);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0757634..c767666 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3324,33 +3324,13 @@ void ext4_set_aops(struct inode *inode)
}

/*
- * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
- * up to the end of the block which corresponds to `from'.
- * This required during truncate. We need to physically zero the tail end
- * of that block so it doesn't yield old data if the file is later grown.
- */
-int ext4_block_truncate_page(handle_t *handle,
- struct address_space *mapping, loff_t from)
-{
- unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned length;
- unsigned blocksize;
- struct inode *inode = mapping->host;
-
- blocksize = inode->i_sb->s_blocksize;
- length = blocksize - (offset & (blocksize - 1));
-
- return ext4_block_zero_page_range(handle, mapping, from, length);
-}
-
-/*
* ext4_block_zero_page_range() zeros out a mapping of length 'length'
* starting from file offset 'from'. The range to be zero'd must
* be contained with in one block. If the specified range exceeds
* the end of the block it will be shortened to end of the block
* that cooresponds to 'from'
*/
-int ext4_block_zero_page_range(handle_t *handle,
+static int ext4_block_zero_page_range(handle_t *handle,
struct address_space *mapping, loff_t from, loff_t length)
{
ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3440,6 +3420,26 @@ unlock:
return err;
}

+/*
+ * ext4_block_truncate_page() zeroes out a mapping from file offset `from'
+ * up to the end of the block which corresponds to `from'.
+ * This required during truncate. We need to physically zero the tail end
+ * of that block so it doesn't yield old data if the file is later grown.
+ */
+int ext4_block_truncate_page(handle_t *handle,
+ struct address_space *mapping, loff_t from)
+{
+ unsigned offset = from & (PAGE_CACHE_SIZE-1);
+ unsigned length;
+ unsigned blocksize;
+ struct inode *inode = mapping->host;
+
+ blocksize = inode->i_sb->s_blocksize;
+ length = blocksize - (offset & (blocksize - 1));
+
+ return ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t length)
{
--
1.8.4.rc3


2013-12-22 21:49:43

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 16/22] ext2: Remove ext2_aops_xip

We shouldn't need a special address_space_operations any more

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/ext2.h | 1 -
fs/ext2/inode.c | 7 +------
fs/ext2/namei.c | 4 ++--
3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index d9a17d0..f39599e 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -789,7 +789,6 @@ extern const struct file_operations ext2_xip_file_operations;

/* inode.c */
extern const struct address_space_operations ext2_aops;
-extern const struct address_space_operations ext2_aops_xip;
extern const struct address_space_operations ext2_nobh_aops;

/* namei.c */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index af0fa94..9eadd78 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -890,11 +890,6 @@ const struct address_space_operations ext2_aops = {
.error_remove_page = generic_error_remove_page,
};

-const struct address_space_operations ext2_aops_xip = {
- .bmap = ext2_bmap,
- .direct_IO = ext2_direct_IO,
-};
-
const struct address_space_operations ext2_nobh_aops = {
.readpage = ext2_readpage,
.readpages = ext2_readpages,
@@ -1393,7 +1388,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 76d33b9..2fc20ca 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode

inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)

inode->i_op = &ext2_file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
- inode->i_mapping->a_ops = &ext2_aops_xip;
+ inode->i_mapping->a_ops = &ext2_aops;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = &ext2_nobh_aops;
--
1.8.4.rc3


2013-12-22 21:49:39

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 12/22] ext2: Remove ext2_xip_verify_sb()

Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount()
doesn't make sense, since changing the XIP option on remount isn't
allowed. It also doesn't make sense to re-check whether blocksize is
supported since it can't change between mounts.

Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the
equivalent check and delete the definition.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/super.c | 33 ++++++++++++---------------------
fs/ext2/xip.c | 12 ------------
fs/ext2/xip.h | 2 --
3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 2885349..caf3f62 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)
((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
MS_POSIXACL : 0);

- ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
- EXT2_MOUNT_XIP if not */
-
if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
(EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent)

blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);

- if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
- if (!silent)
+ if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+ if (blocksize != PAGE_SIZE) {
ext2_msg(sb, KERN_ERR,
- "error: unsupported blocksize for xip");
- goto failed_mount;
+ "error: unsupported blocksize for xip");
+ goto failed_mount;
+ }
+ if (!sb->s_bdev->bd_disk->fops->direct_access) {
+ ext2_msg(sb, KERN_ERR,
+ "error: device does not support xip");
+ goto failed_mount;
+ }
}

/* If the blocksize doesn't match, re-read the thing.. */
@@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
{
struct ext2_sb_info * sbi = EXT2_SB(sb);
struct ext2_super_block * es;
- unsigned long old_mount_opt = sbi->s_mount_opt;
struct ext2_mount_options old_opts;
unsigned long old_sb_flags;
int err;
@@ -1273,22 +1275,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data)
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);

- ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
- EXT2_MOUNT_XIP if not */
-
- if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) {
- ext2_msg(sb, KERN_WARNING,
- "warning: unsupported blocksize for xip");
- err = -EINVAL;
- goto restore_opts;
- }
-
es = sbi->s_es;
- if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) {
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
"xip flag with busy inodes while remounting");
- sbi->s_mount_opt &= ~EXT2_MOUNT_XIP;
- sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP;
+ sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
}
if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
spin_unlock(&sbi->s_lock);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 132d4da..66ca113 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,15 +13,3 @@
#include "ext2.h"
#include "xip.h"

-void ext2_xip_verify_sb(struct super_block *sb)
-{
- struct ext2_sb_info *sbi = EXT2_SB(sb);
-
- if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) &&
- !sb->s_bdev->bd_disk->fops->direct_access) {
- sbi->s_mount_opt &= (~EXT2_MOUNT_XIP);
- ext2_msg(sb, KERN_WARNING,
- "warning: ignoring xip option - "
- "not supported by bdev");
- }
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index e7b9f0a..87eeb04 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -6,13 +6,11 @@
*/

#ifdef CONFIG_EXT2_FS_XIP
-extern void ext2_xip_verify_sb (struct super_block *);
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
}
#else
-#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#endif
--
1.8.4.rc3


2013-12-22 21:49:41

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 14/22] ext2: Remove xip.c and xip.h

These files are now empty, so delete them

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/Makefile | 1 -
fs/ext2/inode.c | 1 -
fs/ext2/namei.c | 1 -
fs/ext2/super.c | 1 -
fs/ext2/xip.c | 15 ---------------
fs/ext2/xip.h | 16 ----------------
6 files changed, 35 deletions(-)
delete mode 100644 fs/ext2/xip.c
delete mode 100644 fs/ext2/xip.h

diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index f42af45..445b0e9 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
ext2-$(CONFIG_EXT2_FS_SECURITY) += xattr_security.o
-ext2-$(CONFIG_EXT2_FS_XIP) += xip.o
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9d6f2e1..af0fa94 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -34,7 +34,6 @@
#include <linux/aio.h>
#include "ext2.h"
#include "acl.h"
-#include "xip.h"
#include "xattr.h"

static int __ext2_write_inode(struct inode *inode, int do_sync);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0a7697a..76d33b9 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -35,7 +35,6 @@
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
-#include "xip.h"

static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
{
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index caf3f62..fd62082 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -35,7 +35,6 @@
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
-#include "xip.h"

static void ext2_sync_super(struct super_block *sb,
struct ext2_super_block *es, int wait);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
deleted file mode 100644
index 66ca113..0000000
--- a/fs/ext2/xip.c
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- * linux/fs/ext2/xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte ([email protected])
- */
-
-#include <linux/mm.h>
-#include <linux/fs.h>
-#include <linux/genhd.h>
-#include <linux/buffer_head.h>
-#include <linux/blkdev.h>
-#include "ext2.h"
-#include "xip.h"
-
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
deleted file mode 100644
index 87eeb04..0000000
--- a/fs/ext2/xip.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/*
- * linux/fs/ext2/xip.h
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte ([email protected])
- */
-
-#ifdef CONFIG_EXT2_FS_XIP
-static inline int ext2_use_xip (struct super_block *sb)
-{
- struct ext2_sb_info *sbi = EXT2_SB(sb);
- return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
-}
-#else
-#define ext2_use_xip(sb) 0
-#endif
--
1.8.4.rc3


2013-12-22 21:50:45

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 22/22] xip: Add reporting of major faults

If we have to call get_block with the create argument set to 1, then
the filesystem almost certainly had to zero the block. which is an I/O,
which should be reported as a major fault.

Note that major faults on XIP files happen for different reasons than
major faults on non-XIP files. XIP files behave as if everything except
file holes is already cached. That's all the more reason to report
major faults when we do have to do I/O; it may be a valuable resource
for sysadmins trying to diagnose performance problems.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/xip.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/xip.c b/fs/xip.c
index d032838..4eb4596 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -19,10 +19,12 @@
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/highmem.h>
+#include <linux/memcontrol.h>
#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/sched.h>
#include <linux/uio.h>
+#include <linux/vmstat.h>

int xip_clear_blocks(struct inode *inode, sector_t block, long size)
{
@@ -213,6 +215,7 @@ static int do_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
pgoff_t size;
unsigned long pfn;
int error;
+ int major = 0;

size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (vmf->pgoff >= size)
@@ -226,6 +229,9 @@ static int do_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
/* Don't allocate backing store if we're going to COW a hole */
if (!error && !buffer_mapped(&bh) && !vmf->cow_page) {
error = get_block(inode, block, &bh, 1);
+ count_vm_event(PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+ major = VM_FAULT_MAJOR;
}

if (error)
@@ -255,7 +261,7 @@ static int do_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
/* -EBUSY is fine, somebody else faulted on the same PTE */
if (error != -EBUSY)
BUG_ON(error);
- return VM_FAULT_NOPAGE;
+ return VM_FAULT_NOPAGE | major;
}

int xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
@@ -297,6 +303,7 @@ static int do_xip_pmd_fault(struct vm_area_struct *vma, unsigned long address,
pgoff_t size, pgoff;
sector_t block;
unsigned long pfn;
+ int major = 0;

/* Fall back to PTEs if we're going to COW */
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
@@ -324,6 +331,9 @@ static int do_xip_pmd_fault(struct vm_area_struct *vma, unsigned long address,
/* Next, try to allocate the whole thing */
bh.b_size = PMD_SIZE;
length = get_block(inode, block, &bh, 1);
+ count_vm_event(PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+ major = VM_FAULT_MAJOR;
if (length)
return VM_FAULT_SIGBUS;
if (bh.b_size == PMD_SIZE)
@@ -356,7 +366,7 @@ static int do_xip_pmd_fault(struct vm_area_struct *vma, unsigned long address,
/* -EBUSY is fine, somebody else faulted on the same PMD */
if (length != -EBUSY)
BUG_ON(length);
- return VM_FAULT_NOPAGE;
+ return VM_FAULT_NOPAGE | major;
}

int xip_pmd_fault(struct vm_area_struct *vma, unsigned long address,
--
1.8.4.rc3


2013-12-22 21:49:38

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 11/22] Replace ext2_clear_xip_target with xip_clear_blocks

This is practically generic code; other filesystems will want to call
it from other places, but there's nothing ext2-specific about it.

Make it a little more generic by allowing it to take a count of the number
of bytes to zero rather than fixing it to a single page. Thanks to Dave
Hansen for suggesting that I need to call cond_resched() if zeroing more
than one page.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/inode.c | 8 +++++---
fs/ext2/xip.c | 23 -----------------------
fs/ext2/xip.h | 3 ---
fs/xip.c | 34 ++++++++++++++++++++++++++++++++++
include/linux/fs.h | 6 ++++++
5 files changed, 45 insertions(+), 29 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index cc60c23..5a2cb24 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode,

if (IS_XIP(inode)) {
/*
- * we need to clear the block
+ * block must be initialised before we put it in the tree
+ * so that it's not found by another thread before it's
+ * initialised
*/
- err = ext2_clear_xip_target (inode,
- le32_to_cpu(chain[depth-1].key));
+ err = xip_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
+ count << inode->i_blkbits);
if (err) {
mutex_unlock(&ei->truncate_mutex);
goto cleanup;
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index ca745ff..132d4da 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,29 +13,6 @@
#include "ext2.h"
#include "xip.h"

-static inline long __inode_direct_access(struct inode *inode, sector_t block,
- void **kaddr, unsigned long *pfn, long size)
-{
- struct block_device *bdev = inode->i_sb->s_bdev;
- const struct block_device_operations *ops = bdev->bd_disk->fops;
- sector_t sector = block * (PAGE_SIZE / 512);
- return ops->direct_access(bdev, sector, kaddr, pfn, size);
-}
-
-int
-ext2_clear_xip_target(struct inode *inode, sector_t block)
-{
- void *kaddr;
- unsigned long pfn;
- long size;
-
- size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
- if (size < 0)
- return size;
- clear_page(kaddr);
- return 0;
-}
-
void ext2_xip_verify_sb(struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 0fa8b7f..e7b9f0a 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -7,8 +7,6 @@

#ifdef CONFIG_EXT2_FS_XIP
extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, sector_t);
-
static inline int ext2_use_xip (struct super_block *sb)
{
struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -17,5 +15,4 @@ static inline int ext2_use_xip (struct super_block *sb)
#else
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
-#define ext2_clear_xip_target(inode, chain) 0
#endif
diff --git a/fs/xip.c b/fs/xip.c
index 0c5026f..991a705 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -21,8 +21,42 @@
#include <linux/highmem.h>
#include <linux/mm.h>
#include <linux/mutex.h>
+#include <linux/sched.h>
#include <linux/uio.h>

+int xip_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+ struct block_device *bdev = inode->i_sb->s_bdev;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ sector_t sector = block << (inode->i_blkbits - 9);
+ unsigned long pfn;
+
+ might_sleep();
+ do {
+ void *addr;
+ long count = ops->direct_access(bdev, sector, &addr, &pfn,
+ size);
+ if (count < 0)
+ return count;
+ while (count >= PAGE_SIZE) {
+ clear_page(addr);
+ addr += PAGE_SIZE;
+ size -= PAGE_SIZE;
+ count -= PAGE_SIZE;
+ sector += PAGE_SIZE / 512;
+ cond_resched();
+ }
+ if (count > 0) {
+ memset(addr, 0, count);
+ sector += count / 512;
+ size -= count;
+ }
+ } while (size);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(xip_clear_blocks);
+
static long xip_get_addr(struct inode *inode, struct buffer_head *bh,
void **addr)
{
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 197c3c6..a0c6812 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2508,11 +2508,17 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
+int xip_clear_blocks(struct inode *, sector_t block, long size);
int xip_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
int xip_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
#else
+static inline int xip_clear_blocks(struct inode *i, sector_t blk, long sz)
+{
+ return 0;
+}
+
static inline int xip_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
{
return 0;
--
1.8.4.rc3


2013-12-22 21:49:40

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 13/22] ext2: Remove ext2_use_xip

Replace ext2_use_xip() with test_opt(XIP) which expands to the same code

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/inode.c | 4 ++--
fs/ext2/namei.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 5a2cb24..9d6f2e1 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1291,7 +1291,7 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
- if (ext2_use_xip(inode->i_sb))
+ if (test_opt(inode->i_sb, XIP))
inode->i_flags |= S_XIP;
}

@@ -1393,7 +1393,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino)

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 256dd5f..0a7697a 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
return PTR_ERR(inode);

inode->i_op = &ext2_file_inode_operations;
- if (ext2_use_xip(inode->i_sb)) {
+ if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = &ext2_aops_xip;
inode->i_fop = &ext2_xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
--
1.8.4.rc3


2013-12-22 21:50:43

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 17/22] xip: Add xip_zero_page_range

This new function allows us to support hole-punch for XIP files by zeroing
a partial page, as opposed to the xip_truncate_page() function which can
only truncate to the end of the page. Reimplement xip_truncate_page() as
a macro that calls xip_zero_page_range().

Signed-off-by: Matthew Wilcox <[email protected]>
[ported to 3.13-rc2]
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/xip.c | 15 +++++++++------
include/linux/fs.h | 11 +++++++++--
2 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/xip.c b/fs/xip.c
index 991a705..e6e52ee 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -274,13 +274,16 @@ int xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
EXPORT_SYMBOL_GPL(xip_fault);

/**
- * xip_truncate_page - handle a partial page being truncated in an XIP file
+ * xip_zero_page_range - zero a range within a page of an XIP file
* @inode: The file being truncated
* @from: The file offset that is being truncated to
+ * @length: The number of bytes to zero
* @get_block: The filesystem method used to get blocks
*
- * Similar to block_truncate_page(), this function can be called by a
- * filesystem when it is truncating an XIP file to handle the partial page.
+ * This function can be called by a filesystem when it is zeroing part of a
+ * page in an XIP file. This is intended for hole-punch operations. If
+ * you are truncating a file, the helper function xip_truncate_page() may be
+ * more convenient.
*
* We work in terms of PAGE_CACHE_SIZE here for commonality with
* block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
@@ -288,12 +291,12 @@ EXPORT_SYMBOL_GPL(xip_fault);
* block size is smaller than PAGE_SIZE, we have to zero the rest of the page
* since the file might be mmaped.
*/
-int xip_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+int xip_zero_page_range(struct inode *inode, loff_t from, unsigned length,
+ get_block_t get_block)
{
struct buffer_head bh;
pgoff_t index = from >> PAGE_CACHE_SHIFT;
unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned length = PAGE_CACHE_ALIGN(from) - from;
int err;

/* Block boundary? Nothing to do */
@@ -315,4 +318,4 @@ int xip_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)

return 0;
}
-EXPORT_SYMBOL_GPL(xip_truncate_page);
+EXPORT_SYMBOL_GPL(xip_zero_page_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a0c6812..3a4a217 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2509,7 +2509,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
int xip_clear_blocks(struct inode *, sector_t block, long size);
-int xip_truncate_page(struct inode *, loff_t from, get_block_t);
+int xip_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
int xip_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
@@ -2519,7 +2519,8 @@ static inline int xip_clear_blocks(struct inode *i, sector_t blk, long sz)
return 0;
}

-static inline int xip_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
+static inline int xip_zero_page_range(struct inode *inode, loff_t from,
+ unsigned len, get_block_t gb)
{
return 0;
}
@@ -2538,6 +2539,12 @@ static inline int xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}
#endif

+/* PAGE_CACHE_ALIGN is defined in pagemap.h */
+#define xip_truncate_page(inode, from, get_block) \
+ xip_zero_page_range(inode, from, PAGE_CACHE_ALIGN(from) - from, \
+ get_block)
+
+
#ifdef CONFIG_BLOCK
typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode *inode,
loff_t file_offset);
--
1.8.4.rc3


2013-12-22 21:50:42

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 19/22] ext4: Add XIP functionality

From: Ross Zwisler <[email protected]>

This is a port of the XIP functionality found in the current version of
ext2.

Signed-off-by: Ross Zwisler <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
[heavily tweaked]
Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/filesystems/ext4.txt | 2 ++
fs/ext4/ext4.h | 2 ++
fs/ext4/file.c | 47 ++++++++++++++++++++++++----
fs/ext4/indirect.c | 19 +++++++----
fs/ext4/inode.c | 64 +++++++++++++++++++++++++++-----------
fs/ext4/namei.c | 10 ++++--
fs/ext4/super.c | 39 ++++++++++++++++++++++-
7 files changed, 150 insertions(+), 33 deletions(-)

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 919a329..06dab5e 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,8 @@ max_dir_size_kb=n This limits the size of directories so that any
i_version Enable 64-bit inode version support. This option is
off by default.

+xip Use execute in place (no caching) if possible
+
Data Mode
=========
There are 3 different data modes:
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a48d367..5b160da 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -954,6 +954,7 @@ struct ext4_inode_info {
#define EXT4_MOUNT_ERRORS_MASK 0x00070
#define EXT4_MOUNT_MINIX_DF 0x00080 /* Mimics the Minix statfs */
#define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/
+#define EXT4_MOUNT_XIP 0x00200 /* Execute in place */
#define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */
#define EXT4_MOUNT_JOURNAL_DATA 0x00400 /* Write data to journal */
#define EXT4_MOUNT_ORDERED_DATA 0x00800 /* Flush data before commit */
@@ -2569,6 +2570,7 @@ extern const struct file_operations ext4_dir_operations;
/* file.c */
extern const struct inode_operations ext4_file_inode_operations;
extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_xip_file_operations;
extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
extern void ext4_unwritten_wait(struct inode *inode);

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3da2194..d6ae6be 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -190,7 +190,7 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
}
}

- if (unlikely(iocb->ki_filp->f_flags & O_DIRECT))
+ if (io_is_direct(iocb->ki_filp))
ret = ext4_file_dio_write(iocb, iov, nr_segs, pos);
else
ret = generic_file_aio_write(iocb, iov, nr_segs, pos);
@@ -198,6 +198,21 @@ ext4_file_write(struct kiocb *iocb, const struct iovec *iov,
return ret;
}

+#ifdef CONFIG_FS_XIP
+static int ext4_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return xip_fault(vma, vmf, ext4_get_block);
+ /* Is this the right get_block? */
+}
+
+static const struct vm_operations_struct ext4_xip_vm_ops = {
+ .fault = ext4_xip_fault,
+ .remap_pages = generic_file_remap_pages,
+};
+#else
+#define ext4_xip_vm_ops ext4_file_vm_ops
+#endif
+
static const struct vm_operations_struct ext4_file_vm_ops = {
.fault = filemap_fault,
.page_mkwrite = ext4_page_mkwrite,
@@ -206,12 +221,13 @@ static const struct vm_operations_struct ext4_file_vm_ops = {

static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
{
- struct address_space *mapping = file->f_mapping;
-
- if (!mapping->a_ops->readpage)
- return -ENOEXEC;
file_accessed(file);
- vma->vm_ops = &ext4_file_vm_ops;
+ if (IS_XIP(file_inode(file))) {
+ vma->vm_ops = &ext4_xip_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP;
+ } else {
+ vma->vm_ops = &ext4_file_vm_ops;
+ }
return 0;
}

@@ -609,6 +625,25 @@ const struct file_operations ext4_file_operations = {
.fallocate = ext4_fallocate,
};

+#ifdef CONFIG_FS_XIP
+const struct file_operations ext4_xip_file_operations = {
+ .llseek = ext4_llseek,
+ .read = do_sync_read,
+ .write = do_sync_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = ext4_file_write,
+ .unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = ext4_compat_ioctl,
+#endif
+ .mmap = ext4_file_mmap,
+ .open = ext4_file_open,
+ .release = ext4_release_file,
+ .fsync = ext4_sync_file,
+ .fallocate = ext4_fallocate,
+};
+#endif
+
const struct inode_operations ext4_file_inode_operations = {
.setattr = ext4_setattr,
.getattr = ext4_getattr,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 594009f..ef2bdef 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -686,15 +686,22 @@ retry:
inode_dio_done(inode);
goto locked;
}
- ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iov,
- offset, nr_segs,
- ext4_get_block, NULL, NULL, 0);
+ if (IS_XIP(inode))
+ ret = xip_do_io(rw, iocb, inode, iov, offset, nr_segs,
+ ext4_get_block, NULL, 0);
+ else
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ inode->i_sb->s_bdev, iov, offset,
+ nr_segs, ext4_get_block, NULL, NULL, 0);
inode_dio_done(inode);
} else {
locked:
- ret = blockdev_direct_IO(rw, iocb, inode, iov,
- offset, nr_segs, ext4_get_block);
+ if (IS_XIP(inode))
+ ret = xip_do_io(rw, iocb, inode, iov, offset, nr_segs,
+ ext4_get_block, NULL, 0);
+ else
+ ret = blockdev_direct_IO(rw, iocb, inode, iov,
+ offset, nr_segs, ext4_get_block);

if (unlikely((rw & WRITE) && ret < 0)) {
loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c767666..8b73d77 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -663,6 +663,18 @@ found:
WARN_ON(1);
}

+ /* this is probably wrong for ext4. unlike ext2, ext4 supports
+ * uninitialised extents, so we should probably be hooking
+ * into the "make it initialised" code instead. */
+ if (IS_XIP(inode)) {
+ ret = xip_clear_blocks(inode, map->m_pblk,
+ map->m_len << inode->i_blkbits);
+ if (ret) {
+ retval = ret;
+ goto has_zeroout;
+ }
+ }
+
/*
* If the extent has been zeroed out, we don't need to update
* extent status tree.
@@ -3152,13 +3164,14 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
get_block_func = ext4_get_block_write;
dio_flags = DIO_LOCKING;
}
- ret = __blockdev_direct_IO(rw, iocb, inode,
- inode->i_sb->s_bdev, iov,
- offset, nr_segs,
- get_block_func,
- ext4_end_io_dio,
- NULL,
- dio_flags);
+ if (IS_XIP(inode))
+ ret = xip_do_io(rw, iocb, inode, iov, offset, nr_segs,
+ get_block_func, ext4_end_io_dio, dio_flags);
+ else
+ ret = __blockdev_direct_IO(rw, iocb, inode,
+ inode->i_sb->s_bdev, iov, offset,
+ nr_segs, get_block_func,
+ ext4_end_io_dio, NULL, dio_flags);

/*
* Put our reference to io_end. This can free the io_end structure e.g.
@@ -3323,14 +3336,7 @@ void ext4_set_aops(struct inode *inode)
inode->i_mapping->a_ops = &ext4_aops;
}

-/*
- * ext4_block_zero_page_range() zeros out a mapping of length 'length'
- * starting from file offset 'from'. The range to be zero'd must
- * be contained with in one block. If the specified range exceeds
- * the end of the block it will be shortened to end of the block
- * that cooresponds to 'from'
- */
-static int ext4_block_zero_page_range(handle_t *handle,
+static int __ext4_block_zero_page_range(handle_t *handle,
struct address_space *mapping, loff_t from, loff_t length)
{
ext4_fsblk_t index = from >> PAGE_CACHE_SHIFT;
@@ -3421,6 +3427,22 @@ unlock:
}

/*
+ * ext4_block_zero_page_range() zeros out a mapping of length 'length'
+ * starting from file offset 'from'. The range to be zero'd must
+ * be contained with in one block. If the specified range exceeds
+ * the end of the block it will be shortened to end of the block
+ * that cooresponds to 'from'
+ */
+static int ext4_block_zero_page_range(handle_t *handle,
+ struct address_space *mapping, loff_t from, loff_t length)
+{
+ struct inode *inode = mapping->host;
+ if (IS_XIP(inode))
+ return xip_zero_page_range(inode, from, length, ext4_get_block);
+ return __ext4_block_zero_page_range(handle, mapping, from, length);
+}
+
+/*
* ext4_block_truncate_page() zeroes out a mapping from file offset `from'
* up to the end of the block which corresponds to `from'.
* This required during truncate. We need to physically zero the tail end
@@ -3939,7 +3961,8 @@ void ext4_set_inode_flags(struct inode *inode)
{
unsigned int flags = EXT4_I(inode)->i_flags;

- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+ S_DIRSYNC | S_XIP);
if (flags & EXT4_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT4_APPEND_FL)
@@ -3950,6 +3973,8 @@ void ext4_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT4_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (test_opt(inode->i_sb, XIP))
+ inode->i_flags |= S_XIP;
}

/* Propagate flags from i_flags to EXT4_I(inode)->i_flags */
@@ -4201,7 +4226,10 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, XIP))
+ inode->i_fop = &ext4_xip_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = &ext4_dir_inode_operations;
@@ -4653,7 +4681,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
* Truncate pagecache after we've waited for commit
* in data=journal mode to make pages freeable.
*/
- truncate_pagecache(inode, inode->i_size);
+ truncate_pagecache(inode, inode->i_size);
}
/*
* We want to call ext4_truncate() even if attr->ia_size ==
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 5a0408d..ac68129 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2250,7 +2250,10 @@ retry:
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, XIP))
+ inode->i_fop = &ext4_xip_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
err = ext4_add_nondir(handle, dentry, inode);
if (!err && IS_DIRSYNC(dir))
@@ -2314,7 +2317,10 @@ retry:
err = PTR_ERR(inode);
if (!IS_ERR(inode)) {
inode->i_op = &ext4_file_inode_operations;
- inode->i_fop = &ext4_file_operations;
+ if (test_opt(inode->i_sb, XIP))
+ inode->i_fop = &ext4_xip_file_operations;
+ else
+ inode->i_fop = &ext4_file_operations;
ext4_set_aops(inode);
d_tmpfile(dentry, inode);
err = ext4_orphan_add(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c977f4e..309a1a3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1156,7 +1156,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
Opt_noquota, Opt_barrier, Opt_nobarrier, Opt_err,
- Opt_usrquota, Opt_grpquota, Opt_i_version,
+ Opt_usrquota, Opt_grpquota, Opt_i_version, Opt_xip,
Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
@@ -1218,6 +1218,7 @@ static const match_table_t tokens = {
{Opt_barrier, "barrier"},
{Opt_nobarrier, "nobarrier"},
{Opt_i_version, "i_version"},
+ {Opt_xip, "xip"},
{Opt_stripe, "stripe=%u"},
{Opt_delalloc, "delalloc"},
{Opt_nodelalloc, "nodelalloc"},
@@ -1400,6 +1401,7 @@ static const struct mount_opts {
{Opt_min_batch_time, 0, MOPT_GTE0},
{Opt_inode_readahead_blks, 0, MOPT_GTE0},
{Opt_init_itable, 0, MOPT_GTE0},
+ {Opt_xip, EXT4_MOUNT_XIP, MOPT_SET},
{Opt_stripe, 0, MOPT_GTE0},
{Opt_resuid, 0, MOPT_GTE0},
{Opt_resgid, 0, MOPT_GTE0},
@@ -1638,6 +1640,11 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
}
sbi->s_jquota_fmt = m->mount_opt;
#endif
+#ifndef CONFIG_FS_XIP
+ } else if (token == Opt_xip) {
+ ext4_msg(sb, KERN_INFO, "xip option not supported");
+ return -1;
+#endif
} else {
if (!args->from)
arg = 1;
@@ -3551,6 +3558,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
"both data=journal and dioread_nolock");
goto failed_mount;
}
+ if (test_opt(sb, XIP)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and xip");
+ goto failed_mount;
+ }
if (test_opt(sb, DELALLOC))
clear_opt(sb, DELALLOC);
}
@@ -3604,6 +3616,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto failed_mount;
}

+ if (sbi->s_mount_opt & EXT4_MOUNT_XIP) {
+ if (blocksize != PAGE_SIZE) {
+ ext4_msg(sb, KERN_ERR,
+ "error: unsupported blocksize for xip");
+ goto failed_mount;
+ }
+ if (!sb->s_bdev->bd_disk->fops->direct_access) {
+ ext4_msg(sb, KERN_ERR,
+ "error: device does not support xip");
+ goto failed_mount;
+ }
+ }
+
if (sb->s_blocksize != blocksize) {
/* Validate the filesystem blocksize */
if (!sb_set_blocksize(sb, blocksize)) {
@@ -4798,6 +4823,18 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
err = -EINVAL;
goto restore_opts;
}
+ if (test_opt(sb, XIP)) {
+ ext4_msg(sb, KERN_ERR, "can't mount with "
+ "both data=journal and xip");
+ err = -EINVAL;
+ goto restore_opts;
+ }
+ }
+
+ if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT4_MOUNT_XIP) {
+ ext4_msg(sb, KERN_WARNING, "warning: refusing change of "
+ "xip flag with busy inodes while remounting");
+ sbi->s_mount_opt ^= EXT4_MOUNT_XIP;
}

if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
--
1.8.4.rc3


2013-12-22 21:49:35

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 08/22] Change xip_truncate_page to take a get_block parameter

Just like nobh_truncate_page() and block_truncate_page()

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/inode.c | 2 +-
fs/xip.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 5 +++--
mm/filemap_xip.c | 40 ----------------------------------------
4 files changed, 48 insertions(+), 43 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index a3f708f..5c1fb09 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1207,7 +1207,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
inode_dio_wait(inode);

if (IS_XIP(inode))
- error = xip_truncate_page(inode->i_mapping, newsize);
+ error = xip_truncate_page(inode, newsize, ext2_get_block);
else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
newsize, ext2_get_block);
diff --git a/fs/xip.c b/fs/xip.c
index 5740aed..0c5026f 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -238,3 +238,47 @@ int xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
return result;
}
EXPORT_SYMBOL_GPL(xip_fault);
+
+/**
+ * xip_truncate_page - handle a partial page being truncated in an XIP file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to get blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an XIP file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks. Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int xip_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+ struct buffer_head bh;
+ pgoff_t index = from >> PAGE_CACHE_SHIFT;
+ unsigned offset = from & (PAGE_CACHE_SIZE-1);
+ unsigned length = PAGE_CACHE_ALIGN(from) - from;
+ int err;
+
+ /* Block boundary? Nothing to do */
+ if (!length)
+ return 0;
+
+ memset(&bh, 0, sizeof(bh));
+ bh.b_size = PAGE_CACHE_SIZE;
+ err = get_block(inode, index, &bh, 0);
+ if (err < 0)
+ return err;
+ if (buffer_mapped(&bh)) {
+ void *addr;
+ err = xip_get_addr(inode, &bh, &addr);
+ if (err)
+ return err;
+ memset(addr + offset, 0, length);
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(xip_truncate_page);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5c6e175..d59a714 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2510,15 +2510,16 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
-extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+int xip_truncate_page(struct inode *, loff_t from, get_block_t);
ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
int xip_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
#else
-static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
+static inline int xip_truncate_page(struct inode *i, loff_t frm, get_block_t gb)
{
return 0;
}
+
static inline ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
const struct iovec *iov, loff_t offset, unsigned nr_segs,
get_block_t get_block, dio_iodone_t end_io, int flags)
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 9dd45f3..6316578 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -21,43 +21,3 @@
#include <asm/tlbflush.h>
#include <asm/io.h>

-/*
- * truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_mem
- * to get the page instead of page cache
- */
-int
-xip_truncate_page(struct address_space *mapping, loff_t from)
-{
- pgoff_t index = from >> PAGE_CACHE_SHIFT;
- unsigned offset = from & (PAGE_CACHE_SIZE-1);
- unsigned blocksize;
- unsigned length;
- void *xip_mem;
- unsigned long xip_pfn;
- int err;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- blocksize = 1 << mapping->host->i_blkbits;
- length = offset & (blocksize - 1);
-
- /* Block boundary? Nothing to do */
- if (!length)
- return 0;
-
- length = blocksize - length;
-
- err = mapping->a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(err)) {
- if (err == -ENODATA)
- /* Hole? No need to truncate */
- return 0;
- else
- return err;
- }
- memset(xip_mem + offset, 0, length);
- return 0;
-}
-EXPORT_SYMBOL_GPL(xip_truncate_page);
--
1.8.4.rc3


2013-12-22 21:50:11

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 07/22] Rewrite XIP page fault handling

Instead of calling aops->get_xip_mem from the fault handler, allow the
filesystem to pass in a get_block_t that is used to find the appropriate
blocks.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/file.c | 25 ++++++++++++--
fs/xip.c | 99 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 13 ++++---
mm/filemap_xip.c | 92 --------------------------------------------------
4 files changed, 130 insertions(+), 99 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index b0eb1d4..47a0b1d 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,6 +25,27 @@
#include "xattr.h"
#include "acl.h"

+static int ext2_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return xip_fault(vma, vmf, ext2_get_block);
+}
+
+static const struct vm_operations_struct ext2_xip_vm_ops = {
+ .fault = ext2_xip_fault,
+ .remap_pages = generic_file_remap_pages,
+};
+
+static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ if (!IS_XIP(file_inode(file)))
+ return generic_file_mmap(file, vma);
+
+ file_accessed(file);
+ vma->vm_ops = &ext2_xip_vm_ops;
+ vma->vm_flags |= VM_MIXEDMAP;
+ return 0;
+}
+
/*
* Called when filp is released. This happens when all file descriptors
* for a single struct file are closed. Note that different open() calls
@@ -70,7 +91,7 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = generic_file_mmap,
+ .mmap = ext2_file_mmap,
.open = dquot_file_open,
.release = ext2_release_file,
.fsync = ext2_fsync,
@@ -89,7 +110,7 @@ const struct file_operations ext2_xip_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
#endif
- .mmap = xip_file_mmap,
+ .mmap = ext2_file_mmap,
.open = dquot_file_open,
.release = ext2_release_file,
.fsync = ext2_fsync,
diff --git a/fs/xip.c b/fs/xip.c
index fcb6ffd..5740aed 100644
--- a/fs/xip.c
+++ b/fs/xip.c
@@ -18,6 +18,8 @@
#include <linux/buffer_head.h>
#include <linux/fs.h>
#include <linux/genhd.h>
+#include <linux/highmem.h>
+#include <linux/mm.h>
#include <linux/mutex.h>
#include <linux/uio.h>

@@ -31,6 +33,16 @@ static long xip_get_addr(struct inode *inode, struct buffer_head *bh,
return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size);
}

+static long xip_get_pfn(struct inode *inode, struct buffer_head *bh,
+ unsigned long *pfn)
+{
+ struct block_device *bdev = bh->b_bdev;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ void *addr;
+ sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+ return ops->direct_access(bdev, sector, &addr, pfn, bh->b_size);
+}
+
static ssize_t xip_io(int rw, struct inode *inode, const struct iovec *iov,
loff_t start, loff_t end, unsigned nr_segs,
get_block_t get_block, struct buffer_head *bh)
@@ -139,3 +151,90 @@ ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
}
EXPORT_SYMBOL_GPL(xip_do_io);

+static inline void copy_user_bh(struct page *to, struct inode *inode,
+ struct buffer_head *bh, unsigned long vaddr)
+{
+ void *vfrom, *vto;
+ xip_get_addr(inode, bh, &vfrom); /* XXX: error handling */
+ vto = kmap_atomic(to);
+ copy_user_page(vto, vfrom, vaddr, to);
+ kunmap_atomic(vto);
+}
+
+/*
+ * xip_fault() is invoked via the vma operations vector for a
+ * mapped memory region to read in file data during a page fault.
+ *
+ * This function is derived from filemap_fault, but used for execute in place
+ */
+static int do_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ struct file *file = vma->vm_file;
+ struct inode *inode = file_inode(file);
+ struct address_space *mapping = file->f_mapping;
+ struct buffer_head bh;
+ unsigned long vaddr = (unsigned long)vmf->virtual_address;
+ sector_t block;
+ pgoff_t size;
+ unsigned long pfn;
+ int error;
+
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size)
+ return VM_FAULT_SIGBUS;
+
+ memset(&bh, 0, sizeof(bh));
+ block = (sector_t)vmf->pgoff << (PAGE_SHIFT - inode->i_blkbits);
+ bh.b_size = PAGE_SIZE;
+ error = get_block(inode, block, &bh, 0);
+
+ /* Don't allocate backing store if we're going to COW a hole */
+ if (!error && !buffer_mapped(&bh) && !vmf->cow_page) {
+ error = get_block(inode, block, &bh, 1);
+ }
+
+ if (error)
+ return VM_FAULT_SIGBUS;
+
+ /* We must recheck i_size under i_mmap_mutex */
+ mutex_lock(&mapping->i_mmap_mutex);
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (unlikely(vmf->pgoff >= size)) {
+ mutex_unlock(&mapping->i_mmap_mutex);
+ return VM_FAULT_SIGBUS;
+ }
+ if (vmf->cow_page) {
+ if (buffer_mapped(&bh))
+ copy_user_bh(vmf->cow_page, inode, &bh, vaddr);
+ else
+ clear_user_highpage(vmf->cow_page, vaddr);
+ return VM_FAULT_COWED;
+ }
+
+ error = xip_get_pfn(inode, &bh, &pfn);
+ if (error > 0)
+ error = vm_insert_mixed(vma, vaddr, pfn);
+ mutex_unlock(&mapping->i_mmap_mutex);
+ if (error == -ENOMEM)
+ return VM_FAULT_OOM;
+ /* -EBUSY is fine, somebody else faulted on the same PTE */
+ if (error != -EBUSY)
+ BUG_ON(error);
+ return VM_FAULT_NOPAGE;
+}
+
+int xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t get_block)
+{
+ int result;
+ struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ result = do_xip_fault(vma, vmf, get_block);
+ sb_end_pagefault(sb);
+
+ return result;
+}
+EXPORT_SYMBOL_GPL(xip_fault);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 26e9095..5c6e175 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -48,6 +48,7 @@ struct cred;
struct swap_info_struct;
struct seq_file;
struct workqueue_struct;
+struct vm_fault;

extern void __init inode_init(void);
extern void __init inode_init_early(void);
@@ -2509,14 +2510,10 @@ extern int generic_file_open(struct inode * inode, struct file * filp);
extern int nonseekable_open(struct inode * inode, struct file * filp);

#ifdef CONFIG_FS_XIP
-extern ssize_t xip_file_read(struct file *filp, char __user *buf, size_t len,
- loff_t *ppos);
-extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
-extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
- size_t len, loff_t *ppos);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
+int xip_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
#else
static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
{
@@ -2528,6 +2525,12 @@ static inline ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
{
return -ENOTTY;
}
+
+static inline int xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+ get_block_t gb)
+{
+ return 0;
+}
#endif

#ifdef CONFIG_BLOCK
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index bea0980..9dd45f3 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -22,98 +22,6 @@
#include <asm/io.h>

/*
- * Only one caller is allowed to try to create mappings at a time.
- * Should move down into filesystem code
- */
-static DEFINE_MUTEX(xip_sparse_mutex);
-
-static inline void copy_user_highdest(struct page *to, void *vfrom,
- unsigned long vaddr)
-{
- char *vto = kmap_atomic(to);
- copy_user_page(vto, vfrom, vaddr, to);
- kunmap_atomic(vto);
-}
-
-/*
- * xip_fault() is invoked via the vma operations vector for a
- * mapped memory region to read in file data during a page fault.
- *
- * This function is derived from filemap_fault, but used for execute in place
- */
-static int xip_file_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
-{
- struct file *file = vma->vm_file;
- struct address_space *mapping = file->f_mapping;
- struct inode *inode = mapping->host;
- unsigned long vaddr = (unsigned long)vmf->virtual_address;
- pgoff_t size;
- void *xip_mem;
- unsigned long xip_pfn;
- int error;
-
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (vmf->pgoff >= size)
- return VM_FAULT_SIGBUS;
-
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 0,
- &xip_mem, &xip_pfn);
- if (likely(!error))
- goto found;
- /* Don't allocate backing store if we're going to COW a hole */
- if (error == -ENODATA && !vmf->cow_page) {
- mutex_lock(&xip_sparse_mutex);
- error = mapping->a_ops->get_xip_mem(mapping, vmf->pgoff, 1,
- &xip_mem, &xip_pfn);
- mutex_unlock(&xip_sparse_mutex);
- }
- if (error != -ENODATA)
- return VM_FAULT_SIGBUS;
-
-found:
- /* We must recheck i_size under i_mmap_mutex */
- mutex_lock(&mapping->i_mmap_mutex);
- size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- mutex_unlock(&mapping->i_mmap_mutex);
- return VM_FAULT_SIGBUS;
- }
- if (vmf->cow_page) {
- if (error == -ENODATA)
- clear_user_highpage(vmf->cow_page, vaddr);
- else
- copy_user_highdest(vmf->cow_page, xip_mem, vaddr);
- return VM_FAULT_COWED;
- }
-
- error = vm_insert_mixed(vma, vaddr, xip_pfn);
- mutex_unlock(&mapping->i_mmap_mutex);
- if (error == -ENOMEM)
- return VM_FAULT_OOM;
- /* -EBUSY is fine, somebody else faulted on the same PTE */
- if (error != -EBUSY)
- BUG_ON(error);
- return VM_FAULT_NOPAGE;
-}
-
-static const struct vm_operations_struct xip_file_vm_ops = {
- .fault = xip_file_fault,
- .page_mkwrite = filemap_page_mkwrite,
- .remap_pages = generic_file_remap_pages,
-};
-
-int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
-{
- BUG_ON(!file->f_mapping->a_ops->get_xip_mem);
-
- file_accessed(file);
- vma->vm_ops = &xip_file_vm_ops;
- vma->vm_flags |= VM_MIXEDMAP;
- return 0;
-}
-EXPORT_SYMBOL_GPL(xip_file_mmap);

2013-12-22 21:49:33

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 06/22] Treat XIP like O_DIRECT

Instead of separate read and write methods, use the generic AIO
infrastructure. In addition to giving us support for AIO, this adds
the mising locking between read() and truncate() that was missing.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/Makefile | 1 +
fs/ext2/file.c | 6 +-
fs/ext2/inode.c | 7 +-
fs/xip.c | 141 ++++++++++++++++++++++++++++++++
include/linux/fs.h | 13 +++
mm/filemap.c | 6 +-
mm/filemap_xip.c | 231 -----------------------------------------------------
7 files changed, 167 insertions(+), 238 deletions(-)
create mode 100644 fs/xip.c

diff --git a/fs/Makefile b/fs/Makefile
index 4fe6df3..96061f2 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
obj-$(CONFIG_AIO) += aio.o
+obj-$(CONFIG_FS_XIP) += xip.o
obj-$(CONFIG_FILE_LOCKING) += locks.o
obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o
obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index a5b3a5d..b0eb1d4 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -81,8 +81,10 @@ const struct file_operations ext2_file_operations = {
#ifdef CONFIG_EXT2_FS_XIP
const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
- .read = xip_file_read,
- .write = xip_file_write,
+ .read = do_sync_read,
+ .write = do_sync_write,
+ .aio_read = generic_file_aio_read,
+ .aio_write = generic_file_aio_write,
.unlocked_ioctl = ext2_ioctl,
#ifdef CONFIG_COMPAT
.compat_ioctl = ext2_compat_ioctl,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0e18d1e..a3f708f 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -858,7 +858,11 @@ ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
struct inode *inode = mapping->host;
ssize_t ret;

- ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
+ if (IS_XIP(inode))
+ ret = xip_do_io(rw, iocb, inode, iov, offset, nr_segs,
+ ext2_get_block, NULL, DIO_LOCKING);
+ else
+ ret = blockdev_direct_IO(rw, iocb, inode, iov, offset, nr_segs,
ext2_get_block);
if (ret < 0 && (rw & WRITE))
ext2_write_failed(mapping, offset + iov_length(iov, nr_segs));
@@ -888,6 +892,7 @@ const struct address_space_operations ext2_aops = {
const struct address_space_operations ext2_aops_xip = {
.bmap = ext2_bmap,
.get_xip_mem = ext2_get_xip_mem,
+ .direct_IO = ext2_direct_IO,
};

const struct address_space_operations ext2_nobh_aops = {
diff --git a/fs/xip.c b/fs/xip.c
new file mode 100644
index 0000000..fcb6ffd
--- /dev/null
+++ b/fs/xip.c
@@ -0,0 +1,141 @@
+/*
+ * fs/xip.c - Execute In Place filesystem code
+ * Copyright (c) 2013 Intel Corporation
+ * Author: Matthew Wilcox <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/atomic.h>
+#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+#include <linux/mutex.h>
+#include <linux/uio.h>
+
+static long xip_get_addr(struct inode *inode, struct buffer_head *bh,
+ void **addr)
+{
+ struct block_device *bdev = bh->b_bdev;
+ const struct block_device_operations *ops = bdev->bd_disk->fops;
+ unsigned long pfn;
+ sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
+ return ops->direct_access(bdev, sector, addr, &pfn, bh->b_size);
+}
+
+static ssize_t xip_io(int rw, struct inode *inode, const struct iovec *iov,
+ loff_t start, loff_t end, unsigned nr_segs,
+ get_block_t get_block, struct buffer_head *bh)
+{
+ ssize_t retval;
+ unsigned seg = 0;
+ unsigned len;
+ unsigned copied = 0;
+ loff_t offset = start;
+ loff_t max = start;
+ void *addr;
+
+ while (offset < end) {
+ void __user *buf = iov[seg].iov_base + copied;
+ bool hole;
+
+ if (max == offset) {
+ sector_t block = offset >> inode->i_blkbits;
+ long size;
+ memset(bh, 0, sizeof(*bh));
+ bh->b_size = ALIGN(end - offset, PAGE_SIZE);
+ retval = get_block(inode, block, bh, rw == WRITE);
+ if (retval)
+ break;
+ if (buffer_mapped(bh)) {
+ hole = false;
+ retval = xip_get_addr(inode, bh, &addr);
+ if (retval < 0)
+ break;
+ size = retval;
+ } else {
+ if (rw == WRITE) {
+ retval = -EIO;
+ break;
+ }
+ hole = true;
+ size = bh->b_size;
+ addr = NULL;
+ }
+ max = offset + size;
+ }
+
+ len = min_t(unsigned, iov[seg].iov_len - copied, max - offset);
+
+ if (rw == WRITE)
+ len -= __copy_from_user_nocache(addr, buf, len);
+ else if (!hole)
+ len -= __copy_to_user(buf, addr, len);
+ else
+ len -= __clear_user(buf, len);
+
+ if (!len)
+ break;
+
+ offset += len;
+ copied += len;
+ if (copied == iov[seg].iov_len) {
+ seg++;
+ copied = 0;
+ }
+ }
+
+ return (offset == start) ? retval : offset - start;
+}
+
+/*
+ * Perform I/O to an XIP file. We follow the same rules as
+ * __blockdev_direct_IO with respect to locking
+ */
+ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+ const struct iovec *iov, loff_t offset, unsigned nr_segs,
+ get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+ struct buffer_head bh;
+ unsigned seg;
+ ssize_t retval = -EINVAL;
+ loff_t end = offset;
+
+ for (seg = 0; seg < nr_segs; seg++)
+ end += iov[seg].iov_len;
+
+ if ((flags & DIO_LOCKING) && (rw == READ)) {
+ struct address_space *mapping = inode->i_mapping;
+ mutex_lock(&inode->i_mutex);
+ retval = filemap_write_and_wait_range(mapping, offset, end - 1);
+ if (retval) {
+ mutex_unlock(&inode->i_mutex);
+ goto out;
+ }
+ }
+
+ /* Protects against truncate */
+ atomic_inc(&inode->i_dio_count);
+
+ retval = xip_io(rw, inode, iov, offset, end, nr_segs, get_block, &bh);
+
+ if ((flags & DIO_LOCKING) && (rw == READ))
+ mutex_unlock(&inode->i_mutex);
+
+ inode_dio_done(inode);
+
+ if ((retval > 0) && end_io)
+ end_io(iocb, offset, retval, bh.b_private);
+ out:
+ return retval;
+}
+EXPORT_SYMBOL_GPL(xip_do_io);
+
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 80cfb42..26e9095 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2515,11 +2515,19 @@ extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
extern ssize_t xip_file_write(struct file *filp, const char __user *buf,
size_t len, loff_t *ppos);
extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
+ loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
#else
static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
{
return 0;
}
+static inline ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
+ const struct iovec *iov, loff_t offset, unsigned nr_segs,
+ get_block_t get_block, dio_iodone_t end_io, int flags)
+{
+ return -ENOTTY;
+}
#endif

#ifdef CONFIG_BLOCK
@@ -2669,6 +2677,11 @@ extern int generic_show_options(struct seq_file *m, struct dentry *root);
extern void save_mount_options(struct super_block *sb, char *options);
extern void replace_mount_options(struct super_block *sb, char *options);

+static inline bool io_is_direct(struct file *filp)
+{
+ return (filp->f_flags & O_DIRECT) || IS_XIP(file_inode(filp));
+}
+
static inline ino_t parent_ino(struct dentry *dentry)
{
ino_t res;
diff --git a/mm/filemap.c b/mm/filemap.c
index b7749a9..61a31f0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1417,8 +1417,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,
if (retval)
return retval;

- /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
- if (filp->f_flags & O_DIRECT) {
+ if (io_is_direct(filp)) {
loff_t size;
struct address_space *mapping;
struct inode *inode;
@@ -2470,8 +2469,7 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
if (err)
goto out;

- /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
- if (unlikely(file->f_flags & O_DIRECT)) {
+ if (io_is_direct(file)) {
loff_t endbyte;
ssize_t written_buffered;

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index cb088a6..bea0980 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -27,119 +27,6 @@
*/
static DEFINE_MUTEX(xip_sparse_mutex);

-/*
- * This is a file read routine for execute in place files, and uses
- * the mapping->a_ops->get_xip_mem() function for the actual low-level
- * stuff.
- *
- * Note the struct file* is not used at all. It may be NULL.
- */
-static ssize_t
-do_xip_mapping_read(struct address_space *mapping,
- struct file_ra_state *_ra,
- struct file *filp,
- char __user *buf,
- size_t len,
- loff_t *ppos)
-{
- struct inode *inode = mapping->host;
- pgoff_t index, end_index;
- unsigned long offset;
- loff_t isize, pos;
- size_t copied = 0, error = 0;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- pos = *ppos;
- index = pos >> PAGE_CACHE_SHIFT;
- offset = pos & ~PAGE_CACHE_MASK;
-
- isize = i_size_read(inode);
- if (!isize)
- goto out;
-
- end_index = (isize - 1) >> PAGE_CACHE_SHIFT;
- do {
- unsigned long nr, left;
- void *xip_mem;
- unsigned long xip_pfn;
- int zero = 0;
-
- /* nr is the maximum number of bytes to copy from this page */
- nr = PAGE_CACHE_SIZE;
- if (index >= end_index) {
- if (index > end_index)
- goto out;
- nr = ((isize - 1) & ~PAGE_CACHE_MASK) + 1;
- if (nr <= offset) {
- goto out;
- }
- }
- nr = nr - offset;
- if (nr > len - copied)
- nr = len - copied;
-
- error = mapping->a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (unlikely(error)) {
- if (error == -ENODATA) {
- /* sparse */
- zero = 1;
- } else
- goto out;
- }
-
- /* If users can be writing to this page using arbitrary
- * virtual addresses, take care about potential aliasing
- * before reading the page on the kernel side.
- */
- if (mapping_writably_mapped(mapping))
- /* address based flush */ ;
-
- /*
- * Ok, we have the mem, so now we can copy it to user space...
- *
- * The actor routine returns how many bytes were actually used..
- * NOTE! This may not be the same as how much of a user buffer
- * we filled up (we may be padding etc), so we can only update
- * "pos" here (the actor routine has to update the user buffer
- * pointers and the remaining count).
- */
- if (!zero)
- left = __copy_to_user(buf+copied, xip_mem+offset, nr);
- else
- left = __clear_user(buf + copied, nr);
-
- if (left) {
- error = -EFAULT;
- goto out;
- }
-
- copied += (nr - left);
- offset += (nr - left);
- index += offset >> PAGE_CACHE_SHIFT;
- offset &= ~PAGE_CACHE_MASK;
- } while (copied < len);
-
-out:
- *ppos = pos + copied;
- if (filp)
- file_accessed(filp);
-
- return (copied ? copied : error);
-}
-
-ssize_t
-xip_file_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
-{
- if (!access_ok(VERIFY_WRITE, buf, len))
- return -EFAULT;
-
- return do_xip_mapping_read(filp->f_mapping, &filp->f_ra, filp,
- buf, len, ppos);
-}
-EXPORT_SYMBOL_GPL(xip_file_read);
-
static inline void copy_user_highdest(struct page *to, void *vfrom,
unsigned long vaddr)
{
@@ -226,124 +113,6 @@ int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
}
EXPORT_SYMBOL_GPL(xip_file_mmap);

-static ssize_t
-__xip_file_write(struct file *filp, const char __user *buf,
- size_t count, loff_t pos, loff_t *ppos)
-{
- struct address_space * mapping = filp->f_mapping;
- const struct address_space_operations *a_ops = mapping->a_ops;
- struct inode *inode = mapping->host;
- long status = 0;
- size_t bytes;
- ssize_t written = 0;
-
- BUG_ON(!mapping->a_ops->get_xip_mem);
-
- do {
- unsigned long index;
- unsigned long offset;
- size_t copied;
- void *xip_mem;
- unsigned long xip_pfn;
-
- offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
- index = pos >> PAGE_CACHE_SHIFT;
- bytes = PAGE_CACHE_SIZE - offset;
- if (bytes > count)
- bytes = count;
-
- status = a_ops->get_xip_mem(mapping, index, 0,
- &xip_mem, &xip_pfn);
- if (status == -ENODATA) {
- /* we allocate a new page unmap it */
- mutex_lock(&xip_sparse_mutex);
- status = a_ops->get_xip_mem(mapping, index, 1,
- &xip_mem, &xip_pfn);
- mutex_unlock(&xip_sparse_mutex);
- }
-
- if (status)
- break;
-
- copied = bytes -
- __copy_from_user_nocache(xip_mem + offset, buf, bytes);
-
- if (likely(copied > 0)) {
- status = copied;
-
- if (status >= 0) {
- written += status;
- count -= status;
- pos += status;
- buf += status;
- }
- }
- if (unlikely(copied != bytes))
- if (status >= 0)
- status = -EFAULT;
- if (status < 0)
- break;
- } while (count);
- *ppos = pos;
- /*
- * No need to use i_size_read() here, the i_size
- * cannot change under us because we hold i_mutex.
- */
- if (pos > inode->i_size) {
- i_size_write(inode, pos);
- mark_inode_dirty(inode);
- }
-
- return written ? written : status;
-}
-
-ssize_t
-xip_file_write(struct file *filp, const char __user *buf, size_t len,
- loff_t *ppos)
-{
- struct address_space *mapping = filp->f_mapping;
- struct inode *inode = mapping->host;
- size_t count;
- loff_t pos;
- ssize_t ret;
-
- mutex_lock(&inode->i_mutex);
-
- if (!access_ok(VERIFY_READ, buf, len)) {
- ret=-EFAULT;
- goto out_up;
- }
-
- pos = *ppos;
- count = len;
-
- /* We can write back this queue in page reclaim */
- current->backing_dev_info = mapping->backing_dev_info;
-
- ret = generic_write_checks(filp, &pos, &count, S_ISBLK(inode->i_mode));
- if (ret)
- goto out_backing;
- if (count == 0)
- goto out_backing;
-
- ret = file_remove_suid(filp);
- if (ret)
- goto out_backing;
-
- ret = file_update_time(filp);
- if (ret)
- goto out_backing;
-
- ret = __xip_file_write (filp, buf, count, pos, ppos);
-
- out_backing:
- current->backing_dev_info = NULL;
- out_up:
- mutex_unlock(&inode->i_mutex);
- return ret;
-}
-EXPORT_SYMBOL_GPL(xip_file_write);
-
/*
* truncate a page used for execute in place
* functionality is analog to block_truncate_page but does use get_xip_mem
--
1.8.4.rc3


2013-12-22 21:49:32

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 05/22] Introduce IS_XIP(inode)

Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip or (when appropriate)
ext2_use_xip.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext2/inode.c | 9 ++++++---
fs/ext2/xip.h | 2 --
include/linux/fs.h | 6 ++++++
3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 8a33764..0e18d1e 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode,
goto cleanup;
}

- if (ext2_use_xip(inode->i_sb)) {
+ if (IS_XIP(inode)) {
/*
* we need to clear the block
*/
@@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)

inode_dio_wait(inode);

- if (mapping_is_xip(inode->i_mapping))
+ if (IS_XIP(inode))
error = xip_truncate_page(inode->i_mapping, newsize);
else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
@@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode)
{
unsigned int flags = EXT2_I(inode)->i_flags;

- inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+ inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+ S_DIRSYNC | S_XIP);
if (flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT2_APPEND_FL)
@@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+ if (ext2_use_xip(inode->i_sb))
+ inode->i_flags |= S_XIP;
}

/* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 18b34d2..29be737 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
}
int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
#else
-#define mapping_is_xip(map) 0
#define ext2_xip_verify_sb(sb) do { } while (0)
#define ext2_use_xip(sb) 0
#define ext2_clear_xip_target(inode, chain) 0
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 121f11f..80cfb42 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1639,6 +1639,7 @@ struct super_operations {
#define S_IMA 1024 /* Inode has an associated IMA struct */
#define S_AUTOMOUNT 2048 /* Automount/referral quasi-directory */
#define S_NOSEC 4096 /* no suid or xattr security attributes */
+#define S_XIP 8192 /* Avoid using the page cache */

/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1676,6 +1677,11 @@ struct super_operations {
#define IS_IMA(inode) ((inode)->i_flags & S_IMA)
#define IS_AUTOMOUNT(inode) ((inode)->i_flags & S_AUTOMOUNT)
#define IS_NOSEC(inode) ((inode)->i_flags & S_NOSEC)
+#ifdef CONFIG_FS_XIP
+#define IS_XIP(inode) ((inode)->i_flags & S_XIP)
+#else
+#define IS_XIP(inode) 0
+#endif

/*
* Inode state bits. Protected by inode->i_lock
--
1.8.4.rc3


2013-12-22 21:49:31

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 04/22] Change direct_access calling convention

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/filesystems/xip.txt | 15 +++++++++------
arch/powerpc/sysdev/axonram.c | 6 +++---
drivers/block/brd.c | 8 +++++---
drivers/s390/block/dcssblk.c | 19 ++++++++++---------
fs/ext2/xip.c | 30 +++++++++++++-----------------
include/linux/blkdev.h | 4 ++--
6 files changed, 42 insertions(+), 40 deletions(-)

diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt
index 0466ee5..b62eabf 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
Execute-in-place is implemented in three steps: block device operation,
address space operation, and file operations.

-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory. It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested. The function should return the number
+of bytes that it can provide, although it must not exceed the number of
+bytes requested. It may also return a negative errno if an error occurs.

The block device operation is optional, these block devices support it as of
today:
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 1fea249..28e69b0 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -138,9 +138,9 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio)
* axon_ram_direct_access - direct_access() method for block device
* @device, @sector, @data: see block_device_operations method
*/
-static int
+static long
axon_ram_direct_access(struct block_device *device, sector_t sector,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct axon_ram_bank *bank = device->bd_disk->private_data;
loff_t offset;
@@ -157,7 +157,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
*kaddr = (void *)(bank->ph_addr + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

- return 0;
+ return min_t(unsigned long, size, bank->size - offset);
}

static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index d91f1a5..fa809cb 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -361,8 +361,8 @@ out:
}

#ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
- void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+ void **kaddr, unsigned long *pfn, long size)
{
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
@@ -379,7 +379,9 @@ static int brd_direct_access(struct block_device *bdev, sector_t sector,
*kaddr = page_address(page);
*pfn = page_to_pfn(page);

- return 0;
+ /* Could optimistically check to see if the next page in the
+ * file is mapped to the next page of physical RAM */
+ return PAGE_SIZE;
}
#endif

diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 6eca019..c09a2b0 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -28,8 +28,8 @@
static int dcssblk_open(struct block_device *bdev, fmode_t mode);
static void dcssblk_release(struct gendisk *disk, fmode_t mode);
static void dcssblk_make_request(struct request_queue *q, struct bio *bio);
-static int dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn);
+static long dcssblk_direct_access(struct block_device *bdev, sector_t secnum,
+ void **kaddr, unsigned long *pfn, long size);

static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0";

@@ -865,25 +865,26 @@ fail:
bio_io_error(bio);
}

-static int
+static long
dcssblk_direct_access (struct block_device *bdev, sector_t secnum,
- void **kaddr, unsigned long *pfn)
+ void **kaddr, unsigned long *pfn, long size)
{
struct dcssblk_dev_info *dev_info;
- unsigned long pgoff;
+ unsigned long offset, dev_sz;

dev_info = bdev->bd_disk->private_data;
if (!dev_info)
return -ENODEV;
+ dev_sz = dev_info->end - dev_info->start;
if (secnum % (PAGE_SIZE/512))
return -EINVAL;
- pgoff = secnum / (PAGE_SIZE / 512);
- if ((pgoff+1)*PAGE_SIZE-1 > dev_info->end - dev_info->start)
+ offset = secnum * 512;
+ if (offset > dev_sz)
return -ERANGE;
- *kaddr = (void *) (dev_info->start+pgoff*PAGE_SIZE);
+ *kaddr = (void *) (dev_info->start + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

- return 0;
+ return min_t(unsigned long, size, dev_sz - offset);
}

static void
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index e98171a..fa40091 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,18 +13,13 @@
#include "ext2.h"
#include "xip.h"

-static inline int
-__inode_direct_access(struct inode *inode, sector_t block,
- void **kaddr, unsigned long *pfn)
+static inline long __inode_direct_access(struct inode *inode, sector_t block,
+ void **kaddr, unsigned long *pfn, long size)
{
struct block_device *bdev = inode->i_sb->s_bdev;
const struct block_device_operations *ops = bdev->bd_disk->fops;
- sector_t sector;
-
- sector = block * (PAGE_SIZE / 512); /* ext2 block to bdev sector */
-
- BUG_ON(!ops->direct_access);
- return ops->direct_access(bdev, sector, kaddr, pfn);
+ sector_t sector = block * (PAGE_SIZE / 512);
+ return ops->direct_access(bdev, sector, kaddr, pfn, size);
}

static inline int
@@ -53,12 +48,13 @@ ext2_clear_xip_target(struct inode *inode, sector_t block)
{
void *kaddr;
unsigned long pfn;
- int rc;
+ long size;

- rc = __inode_direct_access(inode, block, &kaddr, &pfn);
- if (!rc)
- clear_page(kaddr);
- return rc;
+ size = __inode_direct_access(inode, block, &kaddr, &pfn, PAGE_SIZE);
+ if (size < 0)
+ return size;
+ clear_page(kaddr);
+ return 0;
}

void ext2_xip_verify_sb(struct super_block *sb)
@@ -77,7 +73,7 @@ void ext2_xip_verify_sb(struct super_block *sb)
int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
void **kmem, unsigned long *pfn)
{
- int rc;
+ long rc;
sector_t block;

/* first, retrieve the sector number */
@@ -86,6 +82,6 @@ int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
return rc;

/* retrieve address of the target data */
- rc = __inode_direct_access(mapping->host, block, kmem, pfn);
- return rc;
+ rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
+ return (rc < 0) ? rc : 0;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1b135d4..00c8e1a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1566,8 +1566,8 @@ struct block_device_operations {
void (*release) (struct gendisk *, fmode_t);
int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long);
- int (*direct_access) (struct block_device *, sector_t,
- void **, unsigned long *);
+ long (*direct_access) (struct block_device *, sector_t,
+ void **, unsigned long *pfn, long size);
unsigned int (*check_events) (struct gendisk *disk,
unsigned int clearing);
/* ->media_changed() is DEPRECATED, use ->check_events() instead */
--
1.8.4.rc3


2013-12-22 21:49:30

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 03/22] axonram: Fix bug in direct_access

The 'pfn' returned by axonram was completely bogus, and has been since
2008.

Signed-off-by: Matthew Wilcox <[email protected]>
---
arch/powerpc/sysdev/axonram.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 1c16141..1fea249 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -155,7 +155,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector,
}

*kaddr = (void *)(bank->ph_addr + offset);
- *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+ *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;

return 0;
}
--
1.8.4.rc3


2013-12-23 13:41:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 21/22] Add support for pmd_faults

On Sun, Dec 22, 2013 at 04:49:48PM -0500, Matthew Wilcox wrote:
> Introduce the vm_ops ->pmd_fault handler, add a vm_insert_pfn_pmd function
> and create an xip_pmd_fault handler.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> ---
> fs/ext2/file.c | 9 ++++-
> fs/ext4/file.c | 9 ++++-
> fs/xip.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/fs.h | 8 ++---
> include/linux/mm.h | 4 +++
> mm/memory.c | 50 +++++++++++++++++++++++---
> 6 files changed, 169 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index 6e6e803..7d6e492 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -31,8 +31,15 @@ static int ext2_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> return xip_fault(vma, vmf, ext2_get_block);
> }
>
> +static int ext2_xip_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
> + pmd_t *pmd, unsigned int flags)
> +{
> + return xip_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
> +}
> +
> static const struct vm_operations_struct ext2_xip_vm_ops = {
> .fault = ext2_xip_fault,
> + .pmd_fault = ext2_xip_pmd_fault,
> .remap_pages = generic_file_remap_pages,
> };
>
> @@ -43,7 +50,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)
>
> file_accessed(file);
> vma->vm_ops = &ext2_xip_vm_ops;
> - vma->vm_flags |= VM_MIXEDMAP;
> + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
> return 0;
> }
> #else
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index d6ae6be..6211f56 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -205,8 +205,15 @@ static int ext4_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> /* Is this the right get_block? */
> }
>
> +static int ext4_xip_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
> + pmd_t *pmd, unsigned int flags)
> +{
> + return xip_pmd_fault(vma, addr, pmd, flags, ext4_get_block);
> +}
> +
> static const struct vm_operations_struct ext4_xip_vm_ops = {
> .fault = ext4_xip_fault,
> + .pmd_fault = ext4_xip_pmd_fault,
> .remap_pages = generic_file_remap_pages,
> };
> #else
> @@ -224,7 +231,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
> file_accessed(file);
> if (IS_XIP(file_inode(file))) {
> vma->vm_ops = &ext4_xip_vm_ops;
> - vma->vm_flags |= VM_MIXEDMAP;
> + vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
> } else {
> vma->vm_ops = &ext4_file_vm_ops;
> }
> diff --git a/fs/xip.c b/fs/xip.c
> index e6e52ee..d032838 100644
> --- a/fs/xip.c
> +++ b/fs/xip.c
> @@ -273,6 +273,107 @@ int xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> }
> EXPORT_SYMBOL_GPL(xip_fault);
>
> +/*
> + * The 'colour' (ie low bits) within a PMD of a page offset. This comes up
> + * more often than one might expect in the below function.
> + */
> +#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
> +
> +/*
> + * We are willing to use a PMD mapping to cover the end of a file if it
> + * could be mapped by a complete PMD's worth of PTEs. That is, the last
> + * part of the file might be slightly smaller than PMD_SIZE, but as long
> + * as it's at least (PMD_SIZE - PAGE_SIZE + 1) bytes long, we allow the
> + * PMD mapping.
> + */
> +static int do_xip_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> + pmd_t *pmd, unsigned int flags, get_block_t get_block)
> +{
> + struct file *file = vma->vm_file;
> + struct inode *inode = file_inode(file);
> + struct address_space *mapping = file->f_mapping;
> + struct buffer_head bh;
> + long length;
> + pgoff_t size, pgoff;
> + sector_t block;
> + unsigned long pfn;
> +
> + /* Fall back to PTEs if we're going to COW */
> + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> + return VM_FAULT_FALLBACK;

Why?

> + pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if (pgoff >= size)
> + return VM_FAULT_SIGBUS;
> + if ((pgoff | PG_PMD_COLOUR) >= size)
> + return VM_FAULT_FALLBACK;

I don't think it's necessary to fallback in this case.
Do you care about SIGBUS behaviour or what?

> + memset(&bh, 0, sizeof(bh));
> + block = ((sector_t)pgoff & ~PG_PMD_COLOUR) <<
> + (PAGE_SHIFT - inode->i_blkbits);
> +
> + /* Start by seeing if we already have an allocated block */
> + bh.b_size = PMD_SIZE;
> + length = get_block(inode, block, &bh, 0);
> + if (length)
> + return VM_FAULT_SIGBUS;
> + if (buffer_mapped(&bh) && bh.b_size == PMD_SIZE)
> + goto insert;
> +
> + /* Next, try to allocate the whole thing */
> + bh.b_size = PMD_SIZE;
> + length = get_block(inode, block, &bh, 1);
> + if (length)
> + return VM_FAULT_SIGBUS;
> + if (bh.b_size == PMD_SIZE)
> + goto insert;
> +
> + return VM_FAULT_FALLBACK;
> +
> + insert:
> + length = xip_get_pfn(inode, &bh, &pfn);
> + if (length < 0)
> + return VM_FAULT_SIGBUS;
> + if (length < PMD_SIZE)
> + return VM_FAULT_FALLBACK;
> + if (pfn & PG_PMD_COLOUR)
> + return VM_FAULT_FALLBACK; /* not aligned */

Without assistance from get_unmapped_area() you will hit this all the time
(511 of 512 on x86_64).
And the check should be moved before get_block(), I think.

> +
> + /* We must recheck i_size under i_mmap_mutex */
> + mutex_lock(&mapping->i_mmap_mutex);
> + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> + if ((pgoff | PG_PMD_COLOUR) < size)
> + length = vm_insert_pfn_pmd(vma, address, pmd, pfn);
> + mutex_unlock(&mapping->i_mmap_mutex);
> +
> + if (pgoff >= size)
> + return VM_FAULT_SIGBUS;
> + if ((pgoff | PG_PMD_COLOUR) >= size)
> + return VM_FAULT_FALLBACK;
> + if (length == -ENOMEM)
> + return VM_FAULT_OOM;
> + /* -EBUSY is fine, somebody else faulted on the same PMD */
> + if (length != -EBUSY)
> + BUG_ON(length);
> + return VM_FAULT_NOPAGE;
> +}
> +
> +int xip_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> + pmd_t *pmd, unsigned int flags, get_block_t get_block)
> +{
> + int result;
> + struct super_block *sb = file_inode(vma->vm_file)->i_sb;
> +
> + sb_start_pagefault(sb);
> + file_update_time(vma->vm_file);
> + result = do_xip_pmd_fault(vma, address, pmd, flags, get_block);
> + sb_end_pagefault(sb);
> +
> + return result;
> +}
> +EXPORT_SYMBOL_GPL(xip_pmd_fault);
> +
> /**
> * xip_zero_page_range - zero a range within a page of an XIP file
> * @inode: The file being truncated
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3a4a217..e789218 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2513,6 +2513,8 @@ int xip_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
> ssize_t xip_do_io(int rw, struct kiocb *, struct inode *, const struct iovec *,
> loff_t, unsigned segs, get_block_t, dio_iodone_t, int flags);
> int xip_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
> +int xip_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> + unsigned int flags, get_block_t);
> #else
> static inline int xip_clear_blocks(struct inode *i, sector_t blk, long sz)
> {
> @@ -2531,12 +2533,6 @@ static inline ssize_t xip_do_io(int rw, struct kiocb *iocb, struct inode *inode,
> {
> return -ENOTTY;
> }
> -
> -static inline int xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> - get_block_t gb)
> -{
> - return 0;
> -}
> #endif
>
> /* PAGE_CACHE_ALIGN is defined in pagemap.h */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e07c57c..d48913d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -212,6 +212,8 @@ struct vm_operations_struct {
> void (*open)(struct vm_area_struct * area);
> void (*close)(struct vm_area_struct * area);
> int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
> + int (*pmd_fault)(struct vm_area_struct *, unsigned long address,
> + pmd_t *, unsigned int flags);
>
> /* notification that a previously read-only page is about to become
> * writable, if an error is returned it will cause a SIGBUS */
> @@ -1857,6 +1859,8 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> unsigned long pfn);
> int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> unsigned long pfn);
> +int vm_insert_pfn_pmd(struct vm_area_struct *, unsigned long addr, pmd_t *,
> + unsigned long pfn);
> int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len);
>
>
> diff --git a/mm/memory.c b/mm/memory.c
> index ecd63fe..0d332cf 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2249,6 +2249,41 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
> }
> EXPORT_SYMBOL(vm_insert_mixed);
>
> +static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> + pmd_t *pmd, unsigned long pfn, pgprot_t prot)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + int retval;
> + pmd_t entry;
> + spinlock_t *ptl;
> +
> + ptl = pmd_lock(mm, pmd);
> + retval = -EBUSY;
> + if (!pmd_none(*pmd))
> + goto out_unlock;
> +
> + /* Ok, finally just insert the thing.. */
> + entry = pfn_pmd(pfn, prot); /* XXX: pmd_mkspecial? */
> + set_pmd_at(mm, addr, pmd, entry);
> + update_mmu_cache_pmd(vma, addr, pmd);

Here you need to allocate pgtable and deposit it to be able to split the page.

> + retval = 0;
> + out_unlock:
> + spin_unlock(ptl);
> + return retval;
> +}
> +
> +int vm_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> + pmd_t *pmd, unsigned long pfn)
> +{
> + BUG_ON(!(vma->vm_flags & VM_MIXEDMAP));
> +
> + if (addr < vma->vm_start || addr >= vma->vm_end)
> + return -EFAULT;
> + return insert_pfn_pmd(vma, addr, pmd, pfn, vma->vm_page_prot);
> +}
> +EXPORT_SYMBOL(vm_insert_pfn_pmd);
> +
> /*
> * maps a range of physical memory into the requested pages. the old
> * mappings are removed. any references to nonexistent pages results
> @@ -3630,6 +3665,16 @@ out:
> return 0;
> }
>
> +static int create_huge_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, pmd_t *pmd, unsigned int flags)
> +{
> + if (!vma->vm_ops)
> + return do_huge_pmd_anonymous_page(mm, vma, address, pmd, flags);
> + if (vma->vm_ops->pmd_fault)
> + return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
> + return VM_FAULT_FALLBACK;
> +}
> +
> /*
> * These routines also need to handle stuff like marking pages dirty
> * and/or accessed for architectures that don't do it in hardware (most
> @@ -3722,10 +3767,7 @@ retry:
> if (!pmd)
> return VM_FAULT_OOM;
> if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
> - int ret = VM_FAULT_FALLBACK;
> - if (!vma->vm_ops)
> - ret = do_huge_pmd_anonymous_page(mm, vma, address,
> - pmd, flags);
> + int ret = create_huge_pmd(mm, vma, address, pmd, flags);
> if (!(ret & VM_FAULT_FALLBACK))
> return ret;
> } else {
> --
> 1.8.4.rc3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Kirill A. Shutemov

2013-12-23 14:50:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 21/22] Add support for pmd_faults

On Mon, Dec 23, 2013 at 03:41:13PM +0200, Kirill A. Shutemov wrote:
> > + /* Fall back to PTEs if we're going to COW */
> > + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> > + return VM_FAULT_FALLBACK;
>
> Why?

If somebody mmaps a file with MAP_PRIVATE and changes a single byte, I
think we should allocate a single page to hold that change, not a PMD's
worth of pages.

> > + pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > + if (pgoff >= size)
> > + return VM_FAULT_SIGBUS;
> > + if ((pgoff | PG_PMD_COLOUR) >= size)
> > + return VM_FAULT_FALLBACK;
>
> I don't think it's necessary to fallback in this case.
> Do you care about SIGBUS behaviour or what?

I'm looking to preserve the same behaviour we see with PTE mappings. I mean,
it's supposed to be _transparent_ huge pages, right?

> > + insert:
> > + length = xip_get_pfn(inode, &bh, &pfn);
> > + if (length < 0)
> > + return VM_FAULT_SIGBUS;
> > + if (length < PMD_SIZE)
> > + return VM_FAULT_FALLBACK;
> > + if (pfn & PG_PMD_COLOUR)
> > + return VM_FAULT_FALLBACK; /* not aligned */
>
> Without assistance from get_unmapped_area() you will hit this all the time
> (511 of 512 on x86_64).

Yes ... I thought you were working on that part for your transparent huge
page cache patchset?

> And the check should be moved before get_block(), I think.

Can't. The PFN we're checking is the PFN of the storage. We have to
call get_block() to find out where it's going to be.

> > +static int insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> > + pmd_t *pmd, unsigned long pfn, pgprot_t prot)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + int retval;
> > + pmd_t entry;
> > + spinlock_t *ptl;
> > +
> > + ptl = pmd_lock(mm, pmd);
> > + retval = -EBUSY;
> > + if (!pmd_none(*pmd))
> > + goto out_unlock;
> > +
> > + /* Ok, finally just insert the thing.. */
> > + entry = pfn_pmd(pfn, prot); /* XXX: pmd_mkspecial? */
> > + set_pmd_at(mm, addr, pmd, entry);
> > + update_mmu_cache_pmd(vma, addr, pmd);
>
> Here you need to allocate pgtable and deposit it to be able to split the page.

You've mentioned that in the past, and looking at it further is on my
todo list.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-23 15:04:11

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 21/22] Add support for pmd_faults

On Mon, Dec 23, 2013 at 07:50:31AM -0700, Matthew Wilcox wrote:
> > > + pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > + if (pgoff >= size)
> > > + return VM_FAULT_SIGBUS;
> > > + if ((pgoff | PG_PMD_COLOUR) >= size)
> > > + return VM_FAULT_FALLBACK;
> >
> > I don't think it's necessary to fallback in this case.
> > Do you care about SIGBUS behaviour or what?
>
> I'm looking to preserve the same behaviour we see with PTE mappings. I mean,
> it's supposed to be _transparent_ huge pages, right?

Speaking of which ... we also need to check if the PMD is entirely within
the VMA. So, this is needed:

@@ -308,6 +308,11 @@ static int do_xip_pmd_fault(struct vm_area_struct *vma, uns
/* Fall back to PTEs if we're going to COW */
if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
return VM_FAULT_FALLBACK;
+ /* Fall back to PTEs if the mapping would extend outside the VMA */
+ if ((address & ~PMD_MASK) < vma->vm_start)
+ return VM_FAULT_FALLBACK;
+ if (ALIGN(address, PMD_SIZE) >= vma->vm_end)
+ return VM_FAULT_FALLBACK;

pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;

I'll fold that into the next round of patches.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-23 15:10:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 21/22] Add support for pmd_faults

On Mon, Dec 23, 2013 at 07:50:31AM -0700, Matthew Wilcox wrote:
> On Mon, Dec 23, 2013 at 03:41:13PM +0200, Kirill A. Shutemov wrote:
> > > + /* Fall back to PTEs if we're going to COW */
> > > + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> > > + return VM_FAULT_FALLBACK;
> >
> > Why?
>
> If somebody mmaps a file with MAP_PRIVATE and changes a single byte, I
> think we should allocate a single page to hold that change, not a PMD's
> worth of pages.

We try allocate new huge page in the same situation for AnonTHP. I don't
see a reason why not to do the same here. It would be much harder (if
possible) to collapse small page into a huge one later.

> > > + pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > + if (pgoff >= size)
> > > + return VM_FAULT_SIGBUS;
> > > + if ((pgoff | PG_PMD_COLOUR) >= size)
> > > + return VM_FAULT_FALLBACK;
> >
> > I don't think it's necessary to fallback in this case.
> > Do you care about SIGBUS behaviour or what?
>
> I'm looking to preserve the same behaviour we see with PTE mappings. I mean,
> it's supposed to be _transparent_ huge pages, right?

We can't be totally transparent. At least from performance point of view.

The question is whether it's critical to preserve SIGBUS beheviour. I
would prefer to map last page in mapping with huge pages too, if it's
possible.

Do you know anyone who relay on SIGBUS for correctness?

>
> > > + insert:
> > > + length = xip_get_pfn(inode, &bh, &pfn);
> > > + if (length < 0)
> > > + return VM_FAULT_SIGBUS;
> > > + if (length < PMD_SIZE)
> > > + return VM_FAULT_FALLBACK;
> > > + if (pfn & PG_PMD_COLOUR)
> > > + return VM_FAULT_FALLBACK; /* not aligned */
> >
> > Without assistance from get_unmapped_area() you will hit this all the time
> > (511 of 512 on x86_64).
>
> Yes ... I thought you were working on that part for your transparent huge
> page cache patchset?

Yeah, I have patch for x86-64. Just a side note.

>
> > And the check should be moved before get_block(), I think.
>
> Can't. The PFN we're checking is the PFN of the storage. We have to
> call get_block() to find out where it's going to be.

I see.

--
Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-12-23 15:11:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 21/22] Add support for pmd_faults

On Mon, Dec 23, 2013 at 08:04:09AM -0700, Matthew Wilcox wrote:
> On Mon, Dec 23, 2013 at 07:50:31AM -0700, Matthew Wilcox wrote:
> > > > + pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > > > + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > > + if (pgoff >= size)
> > > > + return VM_FAULT_SIGBUS;
> > > > + if ((pgoff | PG_PMD_COLOUR) >= size)
> > > > + return VM_FAULT_FALLBACK;
> > >
> > > I don't think it's necessary to fallback in this case.
> > > Do you care about SIGBUS behaviour or what?
> >
> > I'm looking to preserve the same behaviour we see with PTE mappings. I mean,
> > it's supposed to be _transparent_ huge pages, right?
>
> Speaking of which ... we also need to check if the PMD is entirely within
> the VMA. So, this is needed:
>
> @@ -308,6 +308,11 @@ static int do_xip_pmd_fault(struct vm_area_struct *vma, uns
> /* Fall back to PTEs if we're going to COW */
> if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> return VM_FAULT_FALLBACK;
> + /* Fall back to PTEs if the mapping would extend outside the VMA */
> + if ((address & ~PMD_MASK) < vma->vm_start)
> + return VM_FAULT_FALLBACK;
> + if (ALIGN(address, PMD_SIZE) >= vma->vm_end)
> + return VM_FAULT_FALLBACK;
>
> pgoff = ((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
>
> I'll fold that into the next round of patches.

Agree with this part.

--
Kirill A. Shutemov

2013-12-23 18:42:24

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 21/22] Add support for pmd_faults

On Mon, Dec 23, 2013 at 05:10:03PM +0200, Kirill A. Shutemov wrote:
> On Mon, Dec 23, 2013 at 07:50:31AM -0700, Matthew Wilcox wrote:
> > On Mon, Dec 23, 2013 at 03:41:13PM +0200, Kirill A. Shutemov wrote:
> > > > + /* Fall back to PTEs if we're going to COW */
> > > > + if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED))
> > > > + return VM_FAULT_FALLBACK;
> > >
> > > Why?
> >
> > If somebody mmaps a file with MAP_PRIVATE and changes a single byte, I
> > think we should allocate a single page to hold that change, not a PMD's
> > worth of pages.
>
> We try allocate new huge page in the same situation for AnonTHP. I don't
> see a reason why not to do the same here. It would be much harder (if
> possible) to collapse small page into a huge one later.

OK, I'll look at what AnonTHP does here. There may be good reasons to
do it differently, but in the absence of data, we should probably handle
the two cases the same.

> > > > + if ((pgoff | PG_PMD_COLOUR) >= size)
> > > > + return VM_FAULT_FALLBACK;
> > >
> > > I don't think it's necessary to fallback in this case.
> > > Do you care about SIGBUS behaviour or what?
> >
> > I'm looking to preserve the same behaviour we see with PTE mappings. I mean,
> > it's supposed to be _transparent_ huge pages, right?
>
> We can't be totally transparent. At least from performance point of view.
>
> The question is whether it's critical to preserve SIGBUS beheviour. I
> would prefer to map last page in mapping with huge pages too, if it's
> possible.
>
> Do you know anyone who relay on SIGBUS for correctness?

Oh, I remember the real reason now. If we install a PMD that hangs off
the end of the file then by reading past i_size, we can read the blocks of
whatever happens to be in storage after the end of the file, which could
be another file's data. This doesn't happen for the PTE case because the
existing code only works for filesystems with a block size == PAGE_SIZE.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2013-12-23 18:54:45

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 21/22] Add support for pmd_faults

On Mon, Dec 23, 2013 at 11:42:22AM -0700, Matthew Wilcox wrote:
> > Do you know anyone who relay on SIGBUS for correctness?
>
> Oh, I remember the real reason now. If we install a PMD that hangs off
> the end of the file then by reading past i_size, we can read the blocks of
> whatever happens to be in storage after the end of the file, which could
> be another file's data. This doesn't happen for the PTE case because the
> existing code only works for filesystems with a block size == PAGE_SIZE.

I see. It's valid reason. Probably, it's better to add comment there.

--
Kirill A. Shutemov