v3->v4:
- dropped the ->page_mkwrite portion of these patches, we don't actually see
issues with mkwrite in production, and I kept running into corner cases where
I missed something important. I want to wait on that part until I have a real
reason to do the work so I can have a solid test in place.
- completely reworked how we drop the mmap_sem in filemap_fault and cleaned it
up a bit. Once I started actually testing this with our horrifying reproducer
I saw a bunch of places where we still ended up doing IO under the mmap_sem
because I had missed a few corner cases. Fixed this by reworking
filemap_fault to only return RETRY once it has a completely uptodate page
ready to be used.
- lots more testing, including production testing.
v2->v3:
- dropped the RFC, ready for a real review.
- fixed a kbuild error for !MMU configs.
- dropped the swapcache patches since Johannes is still working on those parts.
v1->v2:
- reworked so it only affects x86, since its the only arch I can build and test.
- fixed the fact that do_page_mkwrite wasn't actually sending ALLOW_RETRY down
to ->page_mkwrite.
- fixed error handling in do_page_mkwrite/callers to explicitly catch
VM_FAULT_RETRY.
- fixed btrfs to set ->cached_page properly.
-- Original message --
Now that we have proper isolation in place with cgroups2 we have started going
through and fixing the various priority inversions. Most are all gone now, but
this one is sort of weird since it's not necessarily a priority inversion that
happens within the kernel, but rather because of something userspace does.
We have giant applications that we want to protect, and parts of these giant
applications do things like watch the system state to determine how healthy the
box is for load balancing and such. This involves running 'ps' or other such
utilities. These utilities will often walk /proc/<pid>/whatever, and these
files can sometimes need to down_read(&task->mmap_sem). Not usually a big deal,
but we noticed when we are stress testing that sometimes our protected
application has latency spikes trying to get the mmap_sem for tasks that are in
lower priority cgroups.
This is because any down_write() on a semaphore essentially turns it into a
mutex, so even if we currently have it held for reading, any new readers will
not be allowed on to keep from starving the writer. This is fine, except a
lower priority task could be stuck doing IO because it has been throttled to the
point that its IO is taking much longer than normal. But because a higher
priority group depends on this completing it is now stuck behind lower priority
work.
In order to avoid this particular priority inversion we want to use the existing
retry mechanism to stop from holding the mmap_sem at all if we are going to do
IO. This already exists in the read case sort of, but needed to be extended for
more than just grabbing the page lock. With io.latency we throttle at
submit_bio() time, so the readahead stuff can block and even page_cache_read can
block, so all these paths need to have the mmap_sem dropped.
The other big thing is ->page_mkwrite. btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation. We use the same retry method as the read path, and simply
cache the page and verify the page is still setup properly the next pass through
->page_mkwrite().
I've tested these patches with xfstests and there are no regressions. Let me
know what you think. Thanks,
Josef
Currently we only drop the mmap_sem if there is contention on the page
lock. The idea is that we issue readahead and then go to lock the page
while it is under IO and we want to not hold the mmap_sem during the IO.
The problem with this is the assumption that the readahead does
anything. In the case that the box is under extreme memory or IO
pressure we may end up not reading anything at all for readahead, which
means we will end up reading in the page under the mmap_sem.
Instead rework filemap fault path to drop the mmap sem at any point that
we may do IO or block for an extended period of time. This includes
while issuing readahead, locking the page, or needing to call ->readpage
because readahead did not occur. Then once we have a fully uptodate
page we can return with VM_FAULT_RETRY and come back again to find our
nicely in-cache page that was gotten outside of the mmap_sem.
Signed-off-by: Josef Bacik <[email protected]>
---
mm/filemap.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 93 insertions(+), 20 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index f068712c2525..5e76b24b2a0f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2304,28 +2304,44 @@ EXPORT_SYMBOL(generic_file_read_iter);
#ifdef CONFIG_MMU
#define MMAP_LOTSAMISS (100)
+static struct file *maybe_unlock_mmap_for_io(struct file *fpin,
+ struct vm_area_struct *vma,
+ int flags)
+{
+ if (fpin)
+ return fpin;
+ if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
+ FAULT_FLAG_ALLOW_RETRY) {
+ fpin = get_file(vma->vm_file);
+ up_read(&vma->vm_mm->mmap_sem);
+ }
+ return fpin;
+}
/*
* Synchronous readahead happens when we don't even find
* a page in the page cache at all.
*/
-static void do_sync_mmap_readahead(struct vm_area_struct *vma,
- struct file_ra_state *ra,
- struct file *file,
- pgoff_t offset)
+static struct file *do_sync_mmap_readahead(struct vm_area_struct *vma,
+ struct file_ra_state *ra,
+ struct file *file,
+ pgoff_t offset,
+ int flags)
{
struct address_space *mapping = file->f_mapping;
+ struct file *fpin = NULL;
/* If we don't want any read-ahead, don't bother */
if (vma->vm_flags & VM_RAND_READ)
- return;
+ return fpin;
if (!ra->ra_pages)
- return;
+ return fpin;
if (vma->vm_flags & VM_SEQ_READ) {
+ fpin = maybe_unlock_mmap_for_io(fpin, vma, flags);
page_cache_sync_readahead(mapping, ra, file, offset,
ra->ra_pages);
- return;
+ return fpin;
}
/* Avoid banging the cache line if not needed */
@@ -2337,37 +2353,43 @@ static void do_sync_mmap_readahead(struct vm_area_struct *vma,
* stop bothering with read-ahead. It will only hurt.
*/
if (ra->mmap_miss > MMAP_LOTSAMISS)
- return;
+ return fpin;
/*
* mmap read-around
*/
+ fpin = maybe_unlock_mmap_for_io(fpin, vma, flags);
ra->start = max_t(long, 0, offset - ra->ra_pages / 2);
ra->size = ra->ra_pages;
ra->async_size = ra->ra_pages / 4;
ra_submit(ra, mapping, file);
+ return fpin;
}
/*
* Asynchronous readahead happens when we find the page and PG_readahead,
* so we want to possibly extend the readahead further..
*/
-static void do_async_mmap_readahead(struct vm_area_struct *vma,
- struct file_ra_state *ra,
- struct file *file,
- struct page *page,
- pgoff_t offset)
+static struct file *do_async_mmap_readahead(struct vm_area_struct *vma,
+ struct file_ra_state *ra,
+ struct file *file,
+ struct page *page,
+ pgoff_t offset, int flags)
{
struct address_space *mapping = file->f_mapping;
+ struct file *fpin = NULL;
/* If we don't want any read-ahead, don't bother */
if (vma->vm_flags & VM_RAND_READ)
- return;
+ return fpin;
if (ra->mmap_miss > 0)
ra->mmap_miss--;
- if (PageReadahead(page))
+ if (PageReadahead(page)) {
+ fpin = maybe_unlock_mmap_for_io(fpin, vma, flags);
page_cache_async_readahead(mapping, ra, file,
page, offset, ra->ra_pages);
+ }
+ return fpin;
}
/**
@@ -2397,6 +2419,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
{
int error;
struct file *file = vmf->vma->vm_file;
+ struct file *fpin = NULL;
struct address_space *mapping = file->f_mapping;
struct file_ra_state *ra = &file->f_ra;
struct inode *inode = mapping->host;
@@ -2418,10 +2441,12 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
* We found the page, so try async readahead before
* waiting for the lock.
*/
- do_async_mmap_readahead(vmf->vma, ra, file, page, offset);
+ fpin = do_async_mmap_readahead(vmf->vma, ra, file, page, offset,
+ vmf->flags);
} else if (!page) {
/* No page in the page cache at all */
- do_sync_mmap_readahead(vmf->vma, ra, file, offset);
+ fpin = do_sync_mmap_readahead(vmf->vma, ra, file, offset,
+ vmf->flags);
count_vm_event(PGMAJFAULT);
count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
@@ -2433,9 +2458,32 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
return vmf_error(-ENOMEM);
}
- if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
- put_page(page);
- return ret | VM_FAULT_RETRY;
+ /*
+ * We are open-coding lock_page_or_retry here because we want to do the
+ * readpage if necessary while the mmap_sem is dropped. If there
+ * happens to be a lock on the page but it wasn't being faulted in we'd
+ * come back around without ALLOW_RETRY set and then have to do the IO
+ * under the mmap_sem, which would be a bummer.
+ */
+ if (!trylock_page(page)) {
+ fpin = maybe_unlock_mmap_for_io(fpin, vmf->vma, vmf->flags);
+ if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
+ goto out_retry;
+ if (vmf->flags & FAULT_FLAG_KILLABLE) {
+ if (__lock_page_killable(page)) {
+ /*
+ * If we don't have the right flags for
+ * maybe_unlock_mmap_for_io to do it's thing we
+ * still need to drop the sem and return
+ * VM_FAULT_RETRY so the upper layer checks the
+ * signal and takes the appropriate action.
+ */
+ if (!fpin)
+ up_read(&vmf->vma->vm_mm->mmap_sem);
+ goto out_retry;
+ }
+ } else
+ __lock_page(page);
}
/* Did it get truncated? */
@@ -2453,6 +2501,16 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
if (unlikely(!PageUptodate(page)))
goto page_not_uptodate;
+ /*
+ * We've made it this far and we had to drop our mmap_sem, now is the
+ * time to return to the upper layer and have it re-find the vma and
+ * redo the fault.
+ */
+ if (fpin) {
+ unlock_page(page);
+ goto out_retry;
+ }
+
/*
* Found the page and have a reference on it.
* We must recheck i_size under page lock.
@@ -2475,12 +2533,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
* and we need to check for errors.
*/
ClearPageError(page);
+ fpin = maybe_unlock_mmap_for_io(fpin, vmf->vma, vmf->flags);
error = mapping->a_ops->readpage(file, page);
if (!error) {
wait_on_page_locked(page);
if (!PageUptodate(page))
error = -EIO;
}
+ if (fpin)
+ goto out_retry;
put_page(page);
if (!error || error == AOP_TRUNCATED_PAGE)
@@ -2489,6 +2550,18 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
/* Things didn't work out. Return zero to tell the mm layer so. */
shrink_readahead_size_eio(file, ra);
return VM_FAULT_SIGBUS;
+
+out_retry:
+ /*
+ * We dropped the mmap_sem, we need to return to the fault handler to
+ * re-find the vma and come back and find our hopefully still populated
+ * page.
+ */
+ if (page)
+ put_page(page);
+ if (fpin)
+ fput(fpin);
+ return ret | VM_FAULT_RETRY;
}
EXPORT_SYMBOL(filemap_fault);
--
2.14.3
If we drop the mmap_sem we have to redo the vma lookup which requires
redoing the fault handler. Chances are we will just come back to the
same page, so save this page in our vmf->cached_page and reuse it in the
next loop through the fault handler.
Signed-off-by: Josef Bacik <[email protected]>
---
mm/filemap.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 43 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 5e76b24b2a0f..d4385b704e04 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2392,6 +2392,35 @@ static struct file *do_async_mmap_readahead(struct vm_area_struct *vma,
return fpin;
}
+static int vmf_has_cached_page(struct vm_fault *vmf, struct page **page)
+{
+ struct page *cached_page = vmf->cached_page;
+ struct mm_struct *mm = vmf->vma->vm_mm;
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ pgoff_t offset = vmf->pgoff;
+
+ if (!cached_page)
+ return 0;
+
+ if (vmf->flags & FAULT_FLAG_KILLABLE) {
+ int ret = lock_page_killable(cached_page);
+ if (ret) {
+ up_read(&mm->mmap_sem);
+ return ret;
+ }
+ } else
+ lock_page(cached_page);
+ vmf->cached_page = NULL;
+ if (cached_page->mapping == mapping &&
+ cached_page->index == offset) {
+ *page = cached_page;
+ } else {
+ unlock_page(cached_page);
+ put_page(cached_page);
+ }
+ return 0;
+}
+
/**
* filemap_fault - read in file data for page fault handling
* @vmf: struct vm_fault containing details of the fault
@@ -2425,13 +2454,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
struct inode *inode = mapping->host;
pgoff_t offset = vmf->pgoff;
pgoff_t max_off;
- struct page *page;
+ struct page *page = NULL;
vm_fault_t ret = 0;
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (unlikely(offset >= max_off))
return VM_FAULT_SIGBUS;
+ /*
+ * We may have read in the page already and have a page from an earlier
+ * loop. If so we need to see if this page is still valid, and if not
+ * do the whole dance over again.
+ */
+ error = vmf_has_cached_page(vmf, &page);
+ if (error)
+ goto out_retry;
+ if (page)
+ goto have_cached_page;
+
/*
* Do we have something in the page cache already?
*/
@@ -2492,6 +2532,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
put_page(page);
goto retry_find;
}
+have_cached_page:
VM_BUG_ON_PAGE(page->index != offset, page);
/*
@@ -2558,7 +2599,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
* page.
*/
if (page)
- put_page(page);
+ vmf->cached_page = page;
if (fpin)
fput(fpin);
return ret | VM_FAULT_RETRY;
--
2.14.3
If we do not have a page at filemap_fault time we'll do this weird
forced page_cache_read thing to populate the page, and then drop it
again and loop around and find it. This makes for 2 ways we can read a
page in filemap_fault, and it's not really needed. Instead add a
FGP_FOR_MMAP flag so that pagecache_get_page() will return a unlocked
page that's in pagecache. Then use the normal page locking and readpage
logic already in filemap_fault. This simplifies the no page in page
cache case significantly.
Signed-off-by: Josef Bacik <[email protected]>
---
include/linux/pagemap.h | 1 +
mm/filemap.c | 73 ++++++++++---------------------------------------
2 files changed, 16 insertions(+), 58 deletions(-)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 226f96f0dee0..b13c2442281f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -252,6 +252,7 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping,
#define FGP_WRITE 0x00000008
#define FGP_NOFS 0x00000010
#define FGP_NOWAIT 0x00000020
+#define FGP_FOR_MMAP 0x00000040
struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
int fgp_flags, gfp_t cache_gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index 81adec8ee02c..f068712c2525 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1503,6 +1503,9 @@ EXPORT_SYMBOL(find_lock_entry);
* @gfp_mask and added to the page cache and the VM's LRU
* list. The page is returned locked and with an increased
* refcount. Otherwise, NULL is returned.
+ * - FGP_FOR_MMAP: Similar to FGP_CREAT, only it unlocks the page after it has
+ * added it to pagecache, as the mmap code expects to do it's own special
+ * locking dance.
*
* If FGP_LOCK or FGP_CREAT are specified then the function may sleep even
* if the GFP flags specified for FGP_CREAT are atomic.
@@ -1555,7 +1558,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
if (!page)
return NULL;
- if (WARN_ON_ONCE(!(fgp_flags & FGP_LOCK)))
+ if (WARN_ON_ONCE(!(fgp_flags & (FGP_LOCK | FGP_FOR_MMAP))))
fgp_flags |= FGP_LOCK;
/* Init accessed so avoid atomic mark_page_accessed later */
@@ -1569,6 +1572,13 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
if (err == -EEXIST)
goto repeat;
}
+
+ /*
+ * add_to_page_cache_lru lock's the page, and for mmap we expect
+ * a unlocked page.
+ */
+ if (fgp_flags & FGP_FOR_MMAP)
+ unlock_page(page);
}
return page;
@@ -2293,39 +2303,6 @@ generic_file_read_iter(struct kiocb *iocb, struct iov_iter *iter)
EXPORT_SYMBOL(generic_file_read_iter);
#ifdef CONFIG_MMU
-/**
- * page_cache_read - adds requested page to the page cache if not already there
- * @file: file to read
- * @offset: page index
- * @gfp_mask: memory allocation flags
- *
- * This adds the requested page to the page cache if it isn't already there,
- * and schedules an I/O to read in its contents from disk.
- */
-static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask)
-{
- struct address_space *mapping = file->f_mapping;
- struct page *page;
- int ret;
-
- do {
- page = __page_cache_alloc(gfp_mask);
- if (!page)
- return -ENOMEM;
-
- ret = add_to_page_cache_lru(page, mapping, offset, gfp_mask);
- if (ret == 0)
- ret = mapping->a_ops->readpage(file, page);
- else if (ret == -EEXIST)
- ret = 0; /* losing race to add is OK */
-
- put_page(page);
-
- } while (ret == AOP_TRUNCATED_PAGE);
-
- return ret;
-}
-
#define MMAP_LOTSAMISS (100)
/*
@@ -2449,9 +2426,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
retry_find:
- page = find_get_page(mapping, offset);
+ page = pagecache_get_page(mapping, offset,
+ FGP_CREAT|FGP_FOR_MMAP,
+ vmf->gfp_mask);
if (!page)
- goto no_cached_page;
+ return vmf_error(-ENOMEM);
}
if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
@@ -2488,28 +2467,6 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
vmf->page = page;
return ret | VM_FAULT_LOCKED;
-no_cached_page:
- /*
- * We're only likely to ever get here if MADV_RANDOM is in
- * effect.
- */
- error = page_cache_read(file, offset, vmf->gfp_mask);
-
- /*
- * The page we want has now been added to the page cache.
- * In the unlikely event that someone removed it in the
- * meantime, we'll just come back here and read it again.
- */
- if (error >= 0)
- goto retry_find;
-
- /*
- * An error return from page_cache_read can result if the
- * system is low on memory, or a problem occurs while trying
- * to schedule I/O.
- */
- return vmf_error(error);
-
page_not_uptodate:
/*
* Umm, take care of errors if the page isn't up-to-date.
--
2.14.3
We want to be able to cache the result of a previous loop of a page
fault in the case that we use VM_FAULT_RETRY, so introduce
handle_mm_fault_cacheable that will take a struct vm_fault directly, add
a ->cached_page field to vm_fault, and add helpers to init/cleanup the
struct vm_fault.
I've converted x86, other arch's can follow suit if they so wish, it's
relatively straightforward.
Signed-off-by: Josef Bacik <[email protected]>
---
arch/x86/mm/fault.c | 6 +++-
include/linux/mm.h | 31 +++++++++++++++++++++
mm/memory.c | 79 ++++++++++++++++++++++++++++++++---------------------
3 files changed, 84 insertions(+), 32 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 71d4b9d4d43f..8060ad6a34da 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1230,6 +1230,7 @@ void do_user_addr_fault(struct pt_regs *regs,
unsigned long hw_error_code,
unsigned long address)
{
+ struct vm_fault vmf = {};
unsigned long sw_error_code;
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -1420,7 +1421,8 @@ void do_user_addr_fault(struct pt_regs *regs,
* userland). The return to userland is identified whenever
* FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags.
*/
- fault = handle_mm_fault(vma, address, flags);
+ vm_fault_init(&vmf, vma, address, flags);
+ fault = handle_mm_fault_cacheable(&vmf);
major |= fault & VM_FAULT_MAJOR;
/*
@@ -1436,6 +1438,7 @@ void do_user_addr_fault(struct pt_regs *regs,
if (!fatal_signal_pending(tsk))
goto retry;
}
+ vm_fault_cleanup(&vmf);
/* User mode? Just return to handle the fatal exception */
if (flags & FAULT_FLAG_USER)
@@ -1446,6 +1449,7 @@ void do_user_addr_fault(struct pt_regs *regs,
return;
}
+ vm_fault_cleanup(&vmf);
up_read(&mm->mmap_sem);
if (unlikely(fault & VM_FAULT_ERROR)) {
mm_fault_error(regs, sw_error_code, address, fault);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..3f1dda389aa7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -360,6 +360,12 @@ struct vm_fault {
* is set (which is also implied by
* VM_FAULT_ERROR).
*/
+ struct page *cached_page; /* ->fault handlers that return
+ * VM_FAULT_RETRY can store their
+ * previous page here to be reused the
+ * next time we loop through the fault
+ * handler for faster lookup.
+ */
/* These three entries are valid only while holding ptl lock */
pte_t *pte; /* Pointer to pte entry matching
* the 'address'. NULL if the page
@@ -378,6 +384,16 @@ struct vm_fault {
*/
};
+static inline void vm_fault_init(struct vm_fault *vmf,
+ struct vm_area_struct *vma,
+ unsigned long address,
+ unsigned int flags)
+{
+ vmf->vma = vma;
+ vmf->address = address;
+ vmf->flags = flags;
+}
+
/* page entry size for vm->huge_fault() */
enum page_entry_size {
PE_SIZE_PTE = 0,
@@ -963,6 +979,14 @@ static inline void put_page(struct page *page)
__put_page(page);
}
+static inline void vm_fault_cleanup(struct vm_fault *vmf)
+{
+ if (vmf->cached_page) {
+ put_page(vmf->cached_page);
+ vmf->cached_page = NULL;
+ }
+}
+
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
#define SECTION_IN_PAGE_FLAGS
#endif
@@ -1425,6 +1449,7 @@ int invalidate_inode_page(struct page *page);
#ifdef CONFIG_MMU
extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
unsigned long address, unsigned int flags);
+extern vm_fault_t handle_mm_fault_cacheable(struct vm_fault *vmf);
extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
unsigned long address, unsigned int fault_flags,
bool *unlocked);
@@ -1440,6 +1465,12 @@ static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma,
BUG();
return VM_FAULT_SIGBUS;
}
+static inline vm_fault_t handle_mm_fault_cacheable(struct vm_fault *vmf)
+{
+ /* should never happen if there's no MMU */
+ BUG();
+ return VM_FAULT_SIGBUS;
+}
static inline int fixup_user_fault(struct task_struct *tsk,
struct mm_struct *mm, unsigned long address,
unsigned int fault_flags, bool *unlocked)
diff --git a/mm/memory.c b/mm/memory.c
index 4ad2d293ddc2..d16bb4816f9d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3806,36 +3806,34 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static vm_fault_t __handle_mm_fault(struct vm_fault *vmf)
{
- struct vm_fault vmf = {
- .vma = vma,
- .address = address & PAGE_MASK,
- .flags = flags,
- .pgoff = linear_page_index(vma, address),
- .gfp_mask = __get_fault_gfp_mask(vma),
- };
- unsigned int dirty = flags & FAULT_FLAG_WRITE;
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long address = vmf->address;
+ unsigned int dirty = vmf->flags & FAULT_FLAG_WRITE;
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
p4d_t *p4d;
vm_fault_t ret;
+ vmf->address = address & PAGE_MASK;
+ vmf->pgoff = linear_page_index(vma, address);
+ vmf->gfp_mask = __get_fault_gfp_mask(vma);
+
pgd = pgd_offset(mm, address);
p4d = p4d_alloc(mm, pgd, address);
if (!p4d)
return VM_FAULT_OOM;
- vmf.pud = pud_alloc(mm, p4d, address);
- if (!vmf.pud)
+ vmf->pud = pud_alloc(mm, p4d, address);
+ if (!vmf->pud)
return VM_FAULT_OOM;
- if (pud_none(*vmf.pud) && transparent_hugepage_enabled(vma)) {
- ret = create_huge_pud(&vmf);
+ if (pud_none(*vmf->pud) && transparent_hugepage_enabled(vma)) {
+ ret = create_huge_pud(vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- pud_t orig_pud = *vmf.pud;
+ pud_t orig_pud = *vmf->pud;
barrier();
if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
@@ -3843,50 +3841,50 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
/* NUMA case for anonymous PUDs would go here */
if (dirty && !pud_write(orig_pud)) {
- ret = wp_huge_pud(&vmf, orig_pud);
+ ret = wp_huge_pud(vmf, orig_pud);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- huge_pud_set_accessed(&vmf, orig_pud);
+ huge_pud_set_accessed(vmf, orig_pud);
return 0;
}
}
}
- vmf.pmd = pmd_alloc(mm, vmf.pud, address);
- if (!vmf.pmd)
+ vmf->pmd = pmd_alloc(mm, vmf->pud, address);
+ if (!vmf->pmd)
return VM_FAULT_OOM;
- if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
- ret = create_huge_pmd(&vmf);
+ if (pmd_none(*vmf->pmd) && transparent_hugepage_enabled(vma)) {
+ ret = create_huge_pmd(vmf);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- pmd_t orig_pmd = *vmf.pmd;
+ pmd_t orig_pmd = *vmf->pmd;
barrier();
if (unlikely(is_swap_pmd(orig_pmd))) {
VM_BUG_ON(thp_migration_supported() &&
!is_pmd_migration_entry(orig_pmd));
if (is_pmd_migration_entry(orig_pmd))
- pmd_migration_entry_wait(mm, vmf.pmd);
+ pmd_migration_entry_wait(mm, vmf->pmd);
return 0;
}
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
- return do_huge_pmd_numa_page(&vmf, orig_pmd);
+ return do_huge_pmd_numa_page(vmf, orig_pmd);
if (dirty && !pmd_write(orig_pmd)) {
- ret = wp_huge_pmd(&vmf, orig_pmd);
+ ret = wp_huge_pmd(vmf, orig_pmd);
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- huge_pmd_set_accessed(&vmf, orig_pmd);
+ huge_pmd_set_accessed(vmf, orig_pmd);
return 0;
}
}
}
- return handle_pte_fault(&vmf);
+ return handle_pte_fault(vmf);
}
/*
@@ -3895,9 +3893,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
- unsigned int flags)
+static vm_fault_t do_handle_mm_fault(struct vm_fault *vmf)
{
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned int flags = vmf->flags;
vm_fault_t ret;
__set_current_state(TASK_RUNNING);
@@ -3921,9 +3920,9 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
mem_cgroup_enter_user_fault();
if (unlikely(is_vm_hugetlb_page(vma)))
- ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
+ ret = hugetlb_fault(vma->vm_mm, vma, vmf->address, flags);
else
- ret = __handle_mm_fault(vma, address, flags);
+ ret = __handle_mm_fault(vmf);
if (flags & FAULT_FLAG_USER) {
mem_cgroup_exit_user_fault();
@@ -3939,8 +3938,26 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
return ret;
}
+
+vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
+ unsigned int flags)
+{
+ struct vm_fault vmf = {};
+ vm_fault_t ret;
+
+ vm_fault_init(&vmf, vma, address, flags);
+ ret = do_handle_mm_fault(&vmf);
+ vm_fault_cleanup(&vmf);
+ return ret;
+}
EXPORT_SYMBOL_GPL(handle_mm_fault);
+vm_fault_t handle_mm_fault_cacheable(struct vm_fault *vmf)
+{
+ return do_handle_mm_fault(vmf);
+}
+EXPORT_SYMBOL_GPL(handle_mm_fault_cacheable);
+
#ifndef __PAGETABLE_P4D_FOLDED
/*
* Allocate p4d page table.
--
2.14.3
On Fri, 30 Nov 2018 14:58:08 -0500 Josef Bacik <[email protected]> wrote:
> Now that we have proper isolation in place with cgroups2 we have started going
> through and fixing the various priority inversions. Most are all gone now, but
> this one is sort of weird since it's not necessarily a priority inversion that
> happens within the kernel, but rather because of something userspace does.
>
> We have giant applications that we want to protect, and parts of these giant
> applications do things like watch the system state to determine how healthy the
> box is for load balancing and such. This involves running 'ps' or other such
> utilities. These utilities will often walk /proc/<pid>/whatever, and these
> files can sometimes need to down_read(&task->mmap_sem). Not usually a big deal,
> but we noticed when we are stress testing that sometimes our protected
> application has latency spikes trying to get the mmap_sem for tasks that are in
> lower priority cgroups.
>
> This is because any down_write() on a semaphore essentially turns it into a
> mutex, so even if we currently have it held for reading, any new readers will
> not be allowed on to keep from starving the writer. This is fine, except a
> lower priority task could be stuck doing IO because it has been throttled to the
> point that its IO is taking much longer than normal. But because a higher
> priority group depends on this completing it is now stuck behind lower priority
> work.
>
> In order to avoid this particular priority inversion we want to use the existing
> retry mechanism to stop from holding the mmap_sem at all if we are going to do
> IO. This already exists in the read case sort of, but needed to be extended for
> more than just grabbing the page lock. With io.latency we throttle at
> submit_bio() time, so the readahead stuff can block and even page_cache_read can
> block, so all these paths need to have the mmap_sem dropped.
>
> The other big thing is ->page_mkwrite. btrfs is particularly shitty here
> because we have to reserve space for the dirty page, which can be a very
> expensive operation. We use the same retry method as the read path, and simply
> cache the page and verify the page is still setup properly the next pass through
> ->page_mkwrite().
Seems reasonable. I have a few minorish changeloggish comments.
We're at v4 and no acks have been gathered?
On Fri, 30 Nov 2018 14:58:11 -0500 Josef Bacik <[email protected]> wrote:
> Currently we only drop the mmap_sem if there is contention on the page
> lock. The idea is that we issue readahead and then go to lock the page
> while it is under IO and we want to not hold the mmap_sem during the IO.
>
> The problem with this is the assumption that the readahead does
> anything. In the case that the box is under extreme memory or IO
> pressure we may end up not reading anything at all for readahead, which
> means we will end up reading in the page under the mmap_sem.
Please describe here why this is considered to be a problem.
Application stalling, I assume? Some description of in-the-field
observations would be appropriate. ie, how serious is the problem
whcih is being addressed.
> Instead rework filemap fault path to drop the mmap sem at any point that
> we may do IO or block for an extended period of time. This includes
> while issuing readahead, locking the page, or needing to call ->readpage
> because readahead did not occur. Then once we have a fully uptodate
> page we can return with VM_FAULT_RETRY and come back again to find our
> nicely in-cache page that was gotten outside of the mmap_sem.
>
> ...
>
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2304,28 +2304,44 @@ EXPORT_SYMBOL(generic_file_read_iter);
>
> #ifdef CONFIG_MMU
> #define MMAP_LOTSAMISS (100)
> +static struct file *maybe_unlock_mmap_for_io(struct file *fpin,
> + struct vm_area_struct *vma,
> + int flags)
> +{
> + if (fpin)
> + return fpin;
> + if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
> + FAULT_FLAG_ALLOW_RETRY) {
> + fpin = get_file(vma->vm_file);
> + up_read(&vma->vm_mm->mmap_sem);
> + }
> + return fpin;
> +}
A code comment would be nice. What it does and, especially, why it does it.
> - if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> - put_page(page);
> - return ret | VM_FAULT_RETRY;
> + /*
> + * We are open-coding lock_page_or_retry here because we want to do the
> + * readpage if necessary while the mmap_sem is dropped. If there
> + * happens to be a lock on the page but it wasn't being faulted in we'd
> + * come back around without ALLOW_RETRY set and then have to do the IO
> + * under the mmap_sem, which would be a bummer.
Expanding on "a bummer" would help here ;)
> + */
> + if (!trylock_page(page)) {
> + fpin = maybe_unlock_mmap_for_io(fpin, vmf->vma, vmf->flags);
> + if (vmf->flags & FAULT_FLAG_RETRY_NOWAIT)
> + goto out_retry;
> + if (vmf->flags & FAULT_FLAG_KILLABLE) {
> + if (__lock_page_killable(page)) {
> + /*
> + * If we don't have the right flags for
> + * maybe_unlock_mmap_for_io to do it's thing we
"its"
> + * still need to drop the sem and return
> + * VM_FAULT_RETRY so the upper layer checks the
> + * signal and takes the appropriate action.
> + */
> + if (!fpin)
> + up_read(&vmf->vma->vm_mm->mmap_sem);
> + goto out_retry;
> + }
> + } else
> + __lock_page(page);
> }
>
> /* Did it get truncated? */
On Fri, 30 Nov 2018 14:58:12 -0500 Josef Bacik <[email protected]> wrote:
> If we drop the mmap_sem we have to redo the vma lookup which requires
> redoing the fault handler. Chances are we will just come back to the
> same page, so save this page in our vmf->cached_page and reuse it in the
> next loop through the fault handler.
>
Is this really worthwhile? Rerunning the fault handler is rare (we
hope) and a single pagecache lookup is fast.
Some performance testing results would be helpful here. It's
practically obligatory when claiming a performance improvement.
On Fri, 30 Nov 2018 14:58:09 -0500 Josef Bacik <[email protected]> wrote:
> We want to be able to cache the result of a previous loop of a page
> fault in the case that we use VM_FAULT_RETRY,
Please explain here why we want to do that.
> so introduce
> handle_mm_fault_cacheable that will take a struct vm_fault directly, add
> a ->cached_page field to vm_fault, and add helpers to init/cleanup the
> struct vm_fault.
>
> I've converted x86, other arch's can follow suit if they so wish, it's
> relatively straightforward.
>
On Tue, Dec 04, 2018 at 02:50:34PM -0800, Andrew Morton wrote:
> On Fri, 30 Nov 2018 14:58:12 -0500 Josef Bacik <[email protected]> wrote:
>
> > If we drop the mmap_sem we have to redo the vma lookup which requires
> > redoing the fault handler. Chances are we will just come back to the
> > same page, so save this page in our vmf->cached_page and reuse it in the
> > next loop through the fault handler.
> >
>
> Is this really worthwhile? Rerunning the fault handler is rare (we
> hope) and a single pagecache lookup is fast.
>
> Some performance testing results would be helpful here. It's
> practically obligatory when claiming a performance improvement.
>
>
Honestly the big thing is just not doing IO under the mmap_sem. I had this
infrastructure originally for the mkwrite portion of these patches that I
dropped, because I was worried about the page being messed with after we did all
the mkwrite work. However since I'm not doing that anymore there's less of a
need for it. I have no performance numbers for this, just seemed like a good
idea since we are likely to just have the page again, and this keeps us from
evicting the page right away and causing more thrashing.
I'll try and set something up to see if there's a difference. If there's no
difference do you want me to drop this? Thanks,
Josef
On Fri, Nov 30, 2018 at 02:58:10PM -0500, Josef Bacik wrote:
> If we do not have a page at filemap_fault time we'll do this weird
> forced page_cache_read thing to populate the page, and then drop it
> again and loop around and find it. This makes for 2 ways we can read a
> page in filemap_fault, and it's not really needed. Instead add a
> FGP_FOR_MMAP flag so that pagecache_get_page() will return a unlocked
> page that's in pagecache. Then use the normal page locking and readpage
> logic already in filemap_fault. This simplifies the no page in page
> cache case significantly.
>
> Signed-off-by: Josef Bacik <[email protected]>
That's a great simplification. Looks correct to me.
Acked-by: Johannes Weiner <[email protected]>
On Fri, Nov 30, 2018 at 02:58:11PM -0500, Josef Bacik wrote:
> Currently we only drop the mmap_sem if there is contention on the page
> lock. The idea is that we issue readahead and then go to lock the page
> while it is under IO and we want to not hold the mmap_sem during the IO.
>
> The problem with this is the assumption that the readahead does
> anything. In the case that the box is under extreme memory or IO
> pressure we may end up not reading anything at all for readahead, which
> means we will end up reading in the page under the mmap_sem.
I'd also add that even if readahead did something, the block request
queues could be contended enough that merely submitting the io could
become IO bound if it has to wait for in-flight requests.
Not really a concern with cgroup IO control, but this has always
somewhat defeated the original purpose of the mmap_sem dropping
(avoiding serializing page faults when there is a writer queued).
> Instead rework filemap fault path to drop the mmap sem at any point that
> we may do IO or block for an extended period of time. This includes
> while issuing readahead, locking the page, or needing to call ->readpage
> because readahead did not occur. Then once we have a fully uptodate
> page we can return with VM_FAULT_RETRY and come back again to find our
> nicely in-cache page that was gotten outside of the mmap_sem.
>
> Signed-off-by: Josef Bacik <[email protected]>
Keeping the fpin throughout the fault handler makes things a lot
simpler than the -EAGAIN and wait_on_page_locked dance from earlier
versions. Nice.
Acked-by: Johannes Weiner <[email protected]>
On Tue, Dec 04, 2018 at 02:49:31PM -0800, Andrew Morton wrote:
> On Fri, 30 Nov 2018 14:58:08 -0500 Josef Bacik <[email protected]> wrote:
>
> > Now that we have proper isolation in place with cgroups2 we have started going
> > through and fixing the various priority inversions. Most are all gone now, but
> > this one is sort of weird since it's not necessarily a priority inversion that
> > happens within the kernel, but rather because of something userspace does.
> >
> > We have giant applications that we want to protect, and parts of these giant
> > applications do things like watch the system state to determine how healthy the
> > box is for load balancing and such. This involves running 'ps' or other such
> > utilities. These utilities will often walk /proc/<pid>/whatever, and these
> > files can sometimes need to down_read(&task->mmap_sem). Not usually a big deal,
> > but we noticed when we are stress testing that sometimes our protected
> > application has latency spikes trying to get the mmap_sem for tasks that are in
> > lower priority cgroups.
> >
> > This is because any down_write() on a semaphore essentially turns it into a
> > mutex, so even if we currently have it held for reading, any new readers will
> > not be allowed on to keep from starving the writer. This is fine, except a
> > lower priority task could be stuck doing IO because it has been throttled to the
> > point that its IO is taking much longer than normal. But because a higher
> > priority group depends on this completing it is now stuck behind lower priority
> > work.
> >
> > In order to avoid this particular priority inversion we want to use the existing
> > retry mechanism to stop from holding the mmap_sem at all if we are going to do
> > IO. This already exists in the read case sort of, but needed to be extended for
> > more than just grabbing the page lock. With io.latency we throttle at
> > submit_bio() time, so the readahead stuff can block and even page_cache_read can
> > block, so all these paths need to have the mmap_sem dropped.
> >
> > The other big thing is ->page_mkwrite. btrfs is particularly shitty here
> > because we have to reserve space for the dirty page, which can be a very
> > expensive operation. We use the same retry method as the read path, and simply
> > cache the page and verify the page is still setup properly the next pass through
> > ->page_mkwrite().
>
> Seems reasonable. I have a few minorish changeloggish comments.
>
> We're at v4 and no acks have been gathered?
I looked at previous versions and had a bunch of questions and
change requests. I haven't had time to look at this version yet,
but seeing as the page_mkwrite() stuff has been dropped from this
version it isn't useful anymore for solving the problem I had in
mind when reviewing it originally...
What I really want is unconditionally retriable page faults so the
filesystem can cause the page fault to be restarted from scratch. We
have a requirement for DAX and shared data extents (reflink) to
work, and that requires changing the faulted page location during
page_mkwrite. i.e. we get a fault on a read mapped shared page, then
we have to do a copy-on-write operation to break physical data
sharing and so the page with the file data in it physically changes
during ->page_mkwrite (because DAX). Hence we need to restart the
page fault to map the new page correctly because the file no longer
points at the page that was originally faulted.
With this stashed-page-and-retry mechanism implemented for
->page_mkwrite, we could stash the new page in the vmf and tell the
fault to retry, and everything would just work. Without
->page_mkwrite support, it's just not that interesting and I have
higher priority things to deal with right now....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri 30-11-18 14:58:10, Josef Bacik wrote:
> If we do not have a page at filemap_fault time we'll do this weird
> forced page_cache_read thing to populate the page, and then drop it
> again and loop around and find it. This makes for 2 ways we can read a
> page in filemap_fault, and it's not really needed. Instead add a
> FGP_FOR_MMAP flag so that pagecache_get_page() will return a unlocked
> page that's in pagecache. Then use the normal page locking and readpage
> logic already in filemap_fault. This simplifies the no page in page
> cache case significantly.
>
> Signed-off-by: Josef Bacik <[email protected]>
Thanks for the patch. I like the simplification but I think it could be
even improved... see below.
> @@ -2449,9 +2426,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
> ret = VM_FAULT_MAJOR;
> retry_find:
> - page = find_get_page(mapping, offset);
> + page = pagecache_get_page(mapping, offset,
> + FGP_CREAT|FGP_FOR_MMAP,
> + vmf->gfp_mask);
> if (!page)
> - goto no_cached_page;
> + return vmf_error(-ENOMEM);
So why don't you just do:
page = pagecache_get_page(mapping, offset,
FGP_CREAT | FGP_LOCK, vmf->gfp_mask);
if (!page)
return vmf_error(-ENOMEM);
goto check_uptodate;
where check_uptodate would be a label before 'PageUptodate' check?
Then you don't have to introduce new flag for pagecache_get_page() and you
also don't have to unlock and then lock again the page... And you can still
delete all the code you've deleted.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 07-12-18 10:57:50, Jan Kara wrote:
> On Fri 30-11-18 14:58:10, Josef Bacik wrote:
> > If we do not have a page at filemap_fault time we'll do this weird
> > forced page_cache_read thing to populate the page, and then drop it
> > again and loop around and find it. This makes for 2 ways we can read a
> > page in filemap_fault, and it's not really needed. Instead add a
> > FGP_FOR_MMAP flag so that pagecache_get_page() will return a unlocked
> > page that's in pagecache. Then use the normal page locking and readpage
> > logic already in filemap_fault. This simplifies the no page in page
> > cache case significantly.
> >
> > Signed-off-by: Josef Bacik <[email protected]>
>
> Thanks for the patch. I like the simplification but I think it could be
> even improved... see below.
>
> > @@ -2449,9 +2426,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
> > ret = VM_FAULT_MAJOR;
> > retry_find:
> > - page = find_get_page(mapping, offset);
> > + page = pagecache_get_page(mapping, offset,
> > + FGP_CREAT|FGP_FOR_MMAP,
> > + vmf->gfp_mask);
> > if (!page)
> > - goto no_cached_page;
> > + return vmf_error(-ENOMEM);
>
> So why don't you just do:
>
> page = pagecache_get_page(mapping, offset,
> FGP_CREAT | FGP_LOCK, vmf->gfp_mask);
> if (!page)
> return vmf_error(-ENOMEM);
> goto check_uptodate;
>
> where check_uptodate would be a label before 'PageUptodate' check?
>
> Then you don't have to introduce new flag for pagecache_get_page() and you
> also don't have to unlock and then lock again the page... And you can still
> delete all the code you've deleted.
Ah, you don't want lock_page() to block in case someone raced with you and
instantiated the page so that you can drop mmap_sem. OK, the patch looks
good to me then. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 30-11-18 14:58:11, Josef Bacik wrote:
> Currently we only drop the mmap_sem if there is contention on the page
> lock. The idea is that we issue readahead and then go to lock the page
> while it is under IO and we want to not hold the mmap_sem during the IO.
>
> The problem with this is the assumption that the readahead does
> anything. In the case that the box is under extreme memory or IO
> pressure we may end up not reading anything at all for readahead, which
> means we will end up reading in the page under the mmap_sem.
>
> Instead rework filemap fault path to drop the mmap sem at any point that
> we may do IO or block for an extended period of time. This includes
> while issuing readahead, locking the page, or needing to call ->readpage
> because readahead did not occur. Then once we have a fully uptodate
> page we can return with VM_FAULT_RETRY and come back again to find our
> nicely in-cache page that was gotten outside of the mmap_sem.
>
> Signed-off-by: Josef Bacik <[email protected]>
> ---
> mm/filemap.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 93 insertions(+), 20 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index f068712c2525..5e76b24b2a0f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2304,28 +2304,44 @@ EXPORT_SYMBOL(generic_file_read_iter);
>
> #ifdef CONFIG_MMU
> #define MMAP_LOTSAMISS (100)
> +static struct file *maybe_unlock_mmap_for_io(struct file *fpin,
> + struct vm_area_struct *vma,
> + int flags)
> +{
> + if (fpin)
> + return fpin;
> + if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
> + FAULT_FLAG_ALLOW_RETRY) {
> + fpin = get_file(vma->vm_file);
> + up_read(&vma->vm_mm->mmap_sem);
> + }
> + return fpin;
> +}
>
> /*
> * Synchronous readahead happens when we don't even find
> * a page in the page cache at all.
> */
> -static void do_sync_mmap_readahead(struct vm_area_struct *vma,
> - struct file_ra_state *ra,
> - struct file *file,
> - pgoff_t offset)
> +static struct file *do_sync_mmap_readahead(struct vm_area_struct *vma,
> + struct file_ra_state *ra,
> + struct file *file,
> + pgoff_t offset,
> + int flags)
> {
IMO it would be nicer to pass vmf here at this point. Everything this
function needs is there and the number of arguments is already quite big.
But I don't insist.
> /*
> * Asynchronous readahead happens when we find the page and PG_readahead,
> * so we want to possibly extend the readahead further..
> */
> -static void do_async_mmap_readahead(struct vm_area_struct *vma,
> - struct file_ra_state *ra,
> - struct file *file,
> - struct page *page,
> - pgoff_t offset)
> +static struct file *do_async_mmap_readahead(struct vm_area_struct *vma,
> + struct file_ra_state *ra,
> + struct file *file,
> + struct page *page,
> + pgoff_t offset, int flags)
> {
The same here (except for 'page' which needs to be kept).
> @@ -2433,9 +2458,32 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> return vmf_error(-ENOMEM);
> }
>
> - if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> - put_page(page);
> - return ret | VM_FAULT_RETRY;
> + /*
> + * We are open-coding lock_page_or_retry here because we want to do the
> + * readpage if necessary while the mmap_sem is dropped. If there
> + * happens to be a lock on the page but it wasn't being faulted in we'd
> + * come back around without ALLOW_RETRY set and then have to do the IO
> + * under the mmap_sem, which would be a bummer.
> + */
Hum, lock_page_or_retry() has two callers and you've just killed one. I
think it would be better to modify the function to suit both callers rather
than opencoding? Maybe something like lock_page_maybe_drop_mmap() which
would unconditionally acquire the lock and return whether it has dropped
mmap sem or not? Callers can then decide what to do.
BTW I'm not sure this complication is really worth it. The "drop mmap_sem
for IO" is never going to be 100% thing if nothing else because only one
retry is allowed in do_user_addr_fault(). So the second time we get to
filemap_fault(), we will not have FAULT_FLAG_ALLOW_RETRY set and thus do
blocking locking. So I think your code needs to catch common cases you
observe in practice but not those super-rare corner cases...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 05-12-18 09:58:10, Josef Bacik wrote:
> On Tue, Dec 04, 2018 at 02:50:34PM -0800, Andrew Morton wrote:
> > On Fri, 30 Nov 2018 14:58:12 -0500 Josef Bacik <[email protected]> wrote:
> >
> > > If we drop the mmap_sem we have to redo the vma lookup which requires
> > > redoing the fault handler. Chances are we will just come back to the
> > > same page, so save this page in our vmf->cached_page and reuse it in the
> > > next loop through the fault handler.
> > >
> >
> > Is this really worthwhile? Rerunning the fault handler is rare (we
> > hope) and a single pagecache lookup is fast.
> >
> > Some performance testing results would be helpful here. It's
> > practically obligatory when claiming a performance improvement.
> >
> >
>
> Honestly the big thing is just not doing IO under the mmap_sem. I had this
> infrastructure originally for the mkwrite portion of these patches that I
> dropped, because I was worried about the page being messed with after we did all
> the mkwrite work. However since I'm not doing that anymore there's less of a
> need for it. I have no performance numbers for this, just seemed like a good
> idea since we are likely to just have the page again, and this keeps us from
> evicting the page right away and causing more thrashing.
>
> I'll try and set something up to see if there's a difference. If there's no
> difference do you want me to drop this? Thanks,
If there's no difference, I'd like to drop this as well. It just
complicates the fault state handling which is already complex enough.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri, Dec 07, 2018 at 12:01:38PM +0100, Jan Kara wrote:
> On Fri 30-11-18 14:58:11, Josef Bacik wrote:
> > Currently we only drop the mmap_sem if there is contention on the page
> > lock. The idea is that we issue readahead and then go to lock the page
> > while it is under IO and we want to not hold the mmap_sem during the IO.
> >
> > The problem with this is the assumption that the readahead does
> > anything. In the case that the box is under extreme memory or IO
> > pressure we may end up not reading anything at all for readahead, which
> > means we will end up reading in the page under the mmap_sem.
> >
> > Instead rework filemap fault path to drop the mmap sem at any point that
> > we may do IO or block for an extended period of time. This includes
> > while issuing readahead, locking the page, or needing to call ->readpage
> > because readahead did not occur. Then once we have a fully uptodate
> > page we can return with VM_FAULT_RETRY and come back again to find our
> > nicely in-cache page that was gotten outside of the mmap_sem.
> >
> > Signed-off-by: Josef Bacik <[email protected]>
> > ---
> > mm/filemap.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++-----------
> > 1 file changed, 93 insertions(+), 20 deletions(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index f068712c2525..5e76b24b2a0f 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2304,28 +2304,44 @@ EXPORT_SYMBOL(generic_file_read_iter);
> >
> > #ifdef CONFIG_MMU
> > #define MMAP_LOTSAMISS (100)
> > +static struct file *maybe_unlock_mmap_for_io(struct file *fpin,
> > + struct vm_area_struct *vma,
> > + int flags)
> > +{
> > + if (fpin)
> > + return fpin;
> > + if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) ==
> > + FAULT_FLAG_ALLOW_RETRY) {
> > + fpin = get_file(vma->vm_file);
> > + up_read(&vma->vm_mm->mmap_sem);
> > + }
> > + return fpin;
> > +}
> >
> > /*
> > * Synchronous readahead happens when we don't even find
> > * a page in the page cache at all.
> > */
> > -static void do_sync_mmap_readahead(struct vm_area_struct *vma,
> > - struct file_ra_state *ra,
> > - struct file *file,
> > - pgoff_t offset)
> > +static struct file *do_sync_mmap_readahead(struct vm_area_struct *vma,
> > + struct file_ra_state *ra,
> > + struct file *file,
> > + pgoff_t offset,
> > + int flags)
> > {
>
> IMO it would be nicer to pass vmf here at this point. Everything this
> function needs is there and the number of arguments is already quite big.
> But I don't insist.
>
> > /*
> > * Asynchronous readahead happens when we find the page and PG_readahead,
> > * so we want to possibly extend the readahead further..
> > */
> > -static void do_async_mmap_readahead(struct vm_area_struct *vma,
> > - struct file_ra_state *ra,
> > - struct file *file,
> > - struct page *page,
> > - pgoff_t offset)
> > +static struct file *do_async_mmap_readahead(struct vm_area_struct *vma,
> > + struct file_ra_state *ra,
> > + struct file *file,
> > + struct page *page,
> > + pgoff_t offset, int flags)
> > {
>
> The same here (except for 'page' which needs to be kept).
>
> > @@ -2433,9 +2458,32 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > return vmf_error(-ENOMEM);
> > }
> >
> > - if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> > - put_page(page);
> > - return ret | VM_FAULT_RETRY;
> > + /*
> > + * We are open-coding lock_page_or_retry here because we want to do the
> > + * readpage if necessary while the mmap_sem is dropped. If there
> > + * happens to be a lock on the page but it wasn't being faulted in we'd
> > + * come back around without ALLOW_RETRY set and then have to do the IO
> > + * under the mmap_sem, which would be a bummer.
> > + */
>
> Hum, lock_page_or_retry() has two callers and you've just killed one. I
> think it would be better to modify the function to suit both callers rather
> than opencoding? Maybe something like lock_page_maybe_drop_mmap() which
> would unconditionally acquire the lock and return whether it has dropped
> mmap sem or not? Callers can then decide what to do.
>
I tried this, but it ends up being convoluted, since swap doesn't have a file to
pin we have to add extra cases for that, and then change the return value to
indicate wether we locked the page _and_ dropped the mmap sem, or just locked
the page, etc. It didn't seem the extra complication, so I just broke the open
coding out into its own helper.
> BTW I'm not sure this complication is really worth it. The "drop mmap_sem
> for IO" is never going to be 100% thing if nothing else because only one
> retry is allowed in do_user_addr_fault(). So the second time we get to
> filemap_fault(), we will not have FAULT_FLAG_ALLOW_RETRY set and thus do
> blocking locking. So I think your code needs to catch common cases you
> observe in practice but not those super-rare corner cases...
I had counters in all of these paths because I was sure some things weren't
getting hit at all, but it turns out each of these cases gets hit with
surprisingly high regularity. The lock_page_or_retry() case in particular gets
hit a lot with multi-threaded applications that got paged out because of heavy
memory pressure. By no means is it as high as just the normal readpage or
readahead cases, but it's not 0, so I'd rather have the extra helper here to
make sure we're never getting screwed. Thanks,
Josef
On Mon 10-12-18 13:44:39, Josef Bacik wrote:
> On Fri, Dec 07, 2018 at 12:01:38PM +0100, Jan Kara wrote:
> > On Fri 30-11-18 14:58:11, Josef Bacik wrote:
> > > @@ -2433,9 +2458,32 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > > return vmf_error(-ENOMEM);
> > > }
> > >
> > > - if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> > > - put_page(page);
> > > - return ret | VM_FAULT_RETRY;
> > > + /*
> > > + * We are open-coding lock_page_or_retry here because we want to do the
> > > + * readpage if necessary while the mmap_sem is dropped. If there
> > > + * happens to be a lock on the page but it wasn't being faulted in we'd
> > > + * come back around without ALLOW_RETRY set and then have to do the IO
> > > + * under the mmap_sem, which would be a bummer.
> > > + */
> >
> > Hum, lock_page_or_retry() has two callers and you've just killed one. I
> > think it would be better to modify the function to suit both callers rather
> > than opencoding? Maybe something like lock_page_maybe_drop_mmap() which
> > would unconditionally acquire the lock and return whether it has dropped
> > mmap sem or not? Callers can then decide what to do.
> >
>
> I tried this, but it ends up being convoluted, since swap doesn't have a file to
> pin we have to add extra cases for that, and then change the return value to
> indicate wether we locked the page _and_ dropped the mmap sem, or just locked
> the page, etc. It didn't seem the extra complication, so I just broke the open
> coding out into its own helper.
OK. Thanks for looking into this!
> > BTW I'm not sure this complication is really worth it. The "drop mmap_sem
> > for IO" is never going to be 100% thing if nothing else because only one
> > retry is allowed in do_user_addr_fault(). So the second time we get to
> > filemap_fault(), we will not have FAULT_FLAG_ALLOW_RETRY set and thus do
> > blocking locking. So I think your code needs to catch common cases you
> > observe in practice but not those super-rare corner cases...
>
> I had counters in all of these paths because I was sure some things weren't
> getting hit at all, but it turns out each of these cases gets hit with
> surprisingly high regularity.
Cool! Could you share these counters? I'd be curious and they'd be actually
nice as a motivation in the changelog of this patch to show the benefit.
> The lock_page_or_retry() case in particular gets hit a lot with
> multi-threaded applications that got paged out because of heavy memory
> pressure. By no means is it as high as just the normal readpage or
> readahead cases, but it's not 0, so I'd rather have the extra helper here
> to make sure we're never getting screwed.
Do you mean the case where we the page is locked in filemap_fault() (so
that lock_page_or_retry() bails after waiting) and when the page becomes
unlocked it is not uptodate? Because that is the reason why you opencode
lock_page_or_retry(), right? I'm not aware of any normal code path that
would create page in page cache and not try to fill it with data before
unlocking it so that's why I'm really trying to make sure we understand
each other.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Dec 11, 2018 at 10:40:34AM +0100, Jan Kara wrote:
> On Mon 10-12-18 13:44:39, Josef Bacik wrote:
> > On Fri, Dec 07, 2018 at 12:01:38PM +0100, Jan Kara wrote:
> > > On Fri 30-11-18 14:58:11, Josef Bacik wrote:
> > > > @@ -2433,9 +2458,32 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
> > > > return vmf_error(-ENOMEM);
> > > > }
> > > >
> > > > - if (!lock_page_or_retry(page, vmf->vma->vm_mm, vmf->flags)) {
> > > > - put_page(page);
> > > > - return ret | VM_FAULT_RETRY;
> > > > + /*
> > > > + * We are open-coding lock_page_or_retry here because we want to do the
> > > > + * readpage if necessary while the mmap_sem is dropped. If there
> > > > + * happens to be a lock on the page but it wasn't being faulted in we'd
> > > > + * come back around without ALLOW_RETRY set and then have to do the IO
> > > > + * under the mmap_sem, which would be a bummer.
> > > > + */
> > >
> > > Hum, lock_page_or_retry() has two callers and you've just killed one. I
> > > think it would be better to modify the function to suit both callers rather
> > > than opencoding? Maybe something like lock_page_maybe_drop_mmap() which
> > > would unconditionally acquire the lock and return whether it has dropped
> > > mmap sem or not? Callers can then decide what to do.
> > >
> >
> > I tried this, but it ends up being convoluted, since swap doesn't have a file to
> > pin we have to add extra cases for that, and then change the return value to
> > indicate wether we locked the page _and_ dropped the mmap sem, or just locked
> > the page, etc. It didn't seem the extra complication, so I just broke the open
> > coding out into its own helper.
>
> OK. Thanks for looking into this!
>
> > > BTW I'm not sure this complication is really worth it. The "drop mmap_sem
> > > for IO" is never going to be 100% thing if nothing else because only one
> > > retry is allowed in do_user_addr_fault(). So the second time we get to
> > > filemap_fault(), we will not have FAULT_FLAG_ALLOW_RETRY set and thus do
> > > blocking locking. So I think your code needs to catch common cases you
> > > observe in practice but not those super-rare corner cases...
> >
> > I had counters in all of these paths because I was sure some things weren't
> > getting hit at all, but it turns out each of these cases gets hit with
> > surprisingly high regularity.
>
> Cool! Could you share these counters? I'd be curious and they'd be actually
> nice as a motivation in the changelog of this patch to show the benefit.
>
Hmm I can't seem to find anything other than the scratch txt file I had to keep
track of stuff, but the un-labeled numbers are 18953 and 879. I assume the
largest number is the times we went through the readpage path where we weren't
doing any mmap_sem dropping and the 879 was for the lock_page path, but I have
no way of knowing at this point.
> > The lock_page_or_retry() case in particular gets hit a lot with
> > multi-threaded applications that got paged out because of heavy memory
> > pressure. By no means is it as high as just the normal readpage or
> > readahead cases, but it's not 0, so I'd rather have the extra helper here
> > to make sure we're never getting screwed.
>
> Do you mean the case where we the page is locked in filemap_fault() (so
> that lock_page_or_retry() bails after waiting) and when the page becomes
> unlocked it is not uptodate? Because that is the reason why you opencode
> lock_page_or_retry(), right? I'm not aware of any normal code path that
> would create page in page cache and not try to fill it with data before
> unlocking it so that's why I'm really trying to make sure we understand
> each other.
Uhh so that's embarressing. We have an internal patchset that I thought was
upstream but hasn't come along yet. Basically before this patchset the way we
dealt with this problem was to short-circuit readahead IO's by checking to see
if the blkcg was congested (or if there was a fatal signal pending) and doing
bio_wouldblock_error on the bio. So this very case came up a lot, readahead
would go through because it got in before we were congested, but would then get
throttled, and then once the throttling was over would get aborted. Other
threads would run into these pages that had been locked, but they are never read
in which means they waited for the lock to be dropped, did the VM_FAULT_RETRY,
came back unable to drop the mmap_sem and did the actual readpage() and would
get throttled.
This means this particular part of the patch isn't helpful for upstream at the
moment, but now that I know these patches aren't upstream yet that'll be my next
project, so I'd like to keep this bit here as it is. Thanks,
Josef
On Tue 11-12-18 11:08:53, Josef Bacik wrote:
> On Tue, Dec 11, 2018 at 10:40:34AM +0100, Jan Kara wrote:
> > > The lock_page_or_retry() case in particular gets hit a lot with
> > > multi-threaded applications that got paged out because of heavy memory
> > > pressure. By no means is it as high as just the normal readpage or
> > > readahead cases, but it's not 0, so I'd rather have the extra helper here
> > > to make sure we're never getting screwed.
> >
> > Do you mean the case where we the page is locked in filemap_fault() (so
> > that lock_page_or_retry() bails after waiting) and when the page becomes
> > unlocked it is not uptodate? Because that is the reason why you opencode
> > lock_page_or_retry(), right? I'm not aware of any normal code path that
> > would create page in page cache and not try to fill it with data before
> > unlocking it so that's why I'm really trying to make sure we understand
> > each other.
>
> Uhh so that's embarressing. We have an internal patchset that I thought
> was upstream but hasn't come along yet. Basically before this patchset
> the way we dealt with this problem was to short-circuit readahead IO's by
> checking to see if the blkcg was congested (or if there was a fatal
> signal pending) and doing bio_wouldblock_error on the bio. So this very
> case came up a lot, readahead would go through because it got in before
> we were congested, but would then get throttled, and then once the
> throttling was over would get aborted. Other threads would run into
> these pages that had been locked, but they are never read in which means
> they waited for the lock to be dropped, did the VM_FAULT_RETRY, came back
> unable to drop the mmap_sem and did the actual readpage() and would get
> throttled.
OK, I'm somewhat unsure why we throttle on bios that actually get aborted
but that's a separate discussion over a different patches. Overall it makes
sense that some submitted readahead may actually get aborted on congestion
and thus unlocked pages will not be uptodate. So I agree that this case is
actually reasonably likely to happen. Just please mention case like aborted
readahead in the comment so that we don't wonder about the reason in a few
years again.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR