2012-05-12 11:53:16

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 0/10] shmem/tmpfs: misc and fallocate

Here's a bunch of shmem/tmpfs updates: mostly completed in January,
then put aside while I attended to other stuff. But the more recent
1/10 has some urgency, so I'm expediting the descriptions, and shipping
them off to you now for v3.5.

They're diffed against 3.4.0-rc5-next-20120504, but
apply and build and work on most v3.4-rc and v3.4-rc-next.

The fallocate ones were prompted by posts from Cong Wang in November:
I've attributed four of those with Based-on-patch-by, but could not
put From or Signed-off-by, since the originals were somewhat flawed,
and I needed to start again and reorder it all.

Whether 10/10 should go any further than exposure in -next
is in doubt: we shall have to see if it's useful to anyone.

1/10 shmem: replace page if mapping excludes its zone
2/10 tmpfs: enable NOSEC optimization
3/10 tmpfs: optimize clearing when writing
4/10 tmpfs: support fallocate FALLOC_FL_PUNCH_HOLE
5/10 mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE
6/10 mm/fs: remove truncate_range
7/10 tmpfs: support fallocate preallocation
8/10 tmpfs: undo fallocation on failure
9/10 tmpfs: quit when fallocate fills memory
10/10 tmpfs: support SEEK_DATA and SEEK_HOLE

Documentation/filesystems/Locking | 2
Documentation/filesystems/vfs.txt | 13
drivers/staging/android/ashmem.c | 8
fs/bad_inode.c | 1
include/linux/fs.h | 1
include/linux/mm.h | 4
include/linux/swap.h | 6
mm/madvise.c | 15
mm/memcontrol.c | 17
mm/shmem.c | 513 +++++++++++++++++++++++++---
mm/swapfile.c | 2
mm/truncate.c | 25 -
12 files changed, 500 insertions(+), 107 deletions(-)

Hugh


2012-05-12 12:00:16

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 1/10] shmem: replace page if mapping excludes its zone

The GMA500 GPU driver uses GEM shmem objects, but with a new twist:
the backing RAM has to be below 4GB. Not a problem while the boards
supported only 4GB: but now Intel's D2700MUD boards support 8GB, and
their GMA3600 is managed by the GMA500 driver.

shmem/tmpfs has never pretended to support hardware restrictions on
the backing memory, but it might have appeared to do so before v3.1,
and even now it works fine until a page is swapped out then back in.
When read_cache_page_gfp() supplied a freshly allocated page for copy,
that compensated for whatever choice might have been made by earlier
swapin readahead; but swapoff was likely to destroy the illusion.

We'd like to continue to support GMA500, so now add a new
shmem_should_replace_page() check on the zone when about to move
a page from swapcache to filecache (in swapin and swapoff cases),
with shmem_replace_page() to allocate and substitute a suitable page
(given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).

This does involve a minor extension to mem_cgroup_replace_page_cache()
(the page may or may not have already been charged); and I've removed
a comment and call to mem_cgroup_uncharge_cache_page(), which in fact
is always a no-op while PageSwapCache.

Also removed optimization of an unlikely path in shmem_getpage_gfp(),
now that we need to check PageSwapCache more carefully (a racing caller
might already have made the copy). And at one point shmem_unuse_inode()
needs to use the hitherto private page_swapcount(), to guard against
racing with inode eviction.

It would make sense to extend shmem_should_replace_page(), to cover
cpuset and NUMA mempolicy restrictions too, but set that aside for
now: needs a cleanup of shmem mempolicy handling, and more testing,
and ought to handle swap faults in do_swap_page() as well as shmem.

Signed-off-by: Hugh Dickins <[email protected]>
---
I've Cc'ed Stephane, Andi, Dave, Daniel and Rob because of their interest
in the i915 Sandybridge issue; but reiterate that this patch does nothing
for that case.

include/linux/swap.h | 6 +
mm/memcontrol.c | 17 +++-
mm/shmem.c | 141 +++++++++++++++++++++++++++++++++++------
mm/swapfile.c | 2
4 files changed, 142 insertions(+), 24 deletions(-)

--- 3045N.orig/include/linux/swap.h 2012-05-05 10:42:33.572056784 -0700
+++ 3045N/include/linux/swap.h 2012-05-05 10:45:17.884060895 -0700
@@ -355,6 +355,7 @@ extern int swap_type_of(dev_t, sector_t,
extern unsigned int count_swap_pages(int, int);
extern sector_t map_swap_page(struct page *, struct block_device **);
extern sector_t swapdev_block(int, pgoff_t);
+extern int page_swapcount(struct page *);
extern int reuse_swap_page(struct page *);
extern int try_to_free_swap(struct page *);
struct backing_dev_info;
@@ -448,6 +449,11 @@ static inline void delete_from_swap_cach
{
}

+static inline int page_swapcount(struct page *page)
+{
+ return 0;
+}
+
#define reuse_swap_page(page) (page_mapcount(page) == 1)

static inline int try_to_free_swap(struct page *page)
--- 3045N.orig/mm/memcontrol.c 2012-05-05 10:42:33.576056912 -0700
+++ 3045N/mm/memcontrol.c 2012-05-05 10:45:17.888060878 -0700
@@ -3548,7 +3548,7 @@ void mem_cgroup_end_migration(struct mem
void mem_cgroup_replace_page_cache(struct page *oldpage,
struct page *newpage)
{
- struct mem_cgroup *memcg;
+ struct mem_cgroup *memcg = NULL;
struct page_cgroup *pc;
enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;

@@ -3558,11 +3558,20 @@ void mem_cgroup_replace_page_cache(struc
pc = lookup_page_cgroup(oldpage);
/* fix accounting on old pages */
lock_page_cgroup(pc);
- memcg = pc->mem_cgroup;
- mem_cgroup_charge_statistics(memcg, false, -1);
- ClearPageCgroupUsed(pc);
+ if (PageCgroupUsed(pc)) {
+ memcg = pc->mem_cgroup;
+ mem_cgroup_charge_statistics(memcg, false, -1);
+ ClearPageCgroupUsed(pc);
+ }
unlock_page_cgroup(pc);

+ /*
+ * When called from shmem_replace_page(), in some cases the
+ * oldpage has already been charged, and in some cases not.
+ */
+ if (!memcg)
+ return;
+
if (PageSwapBacked(oldpage))
type = MEM_CGROUP_CHARGE_TYPE_SHMEM;

--- 3045N.orig/mm/shmem.c 2012-05-05 10:42:33.576056912 -0700
+++ 3045N/mm/shmem.c 2012-05-05 10:45:17.888060878 -0700
@@ -103,6 +103,9 @@ static unsigned long shmem_default_max_i
}
#endif

+static bool shmem_should_replace_page(struct page *page, gfp_t gfp);
+static int shmem_replace_page(struct page **pagep, gfp_t gfp,
+ struct shmem_inode_info *info, pgoff_t index);
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type);

@@ -604,12 +607,13 @@ static void shmem_evict_inode(struct ino
* If swap found in inode, free it and move page from swapcache to filecache.
*/
static int shmem_unuse_inode(struct shmem_inode_info *info,
- swp_entry_t swap, struct page *page)
+ swp_entry_t swap, struct page **pagep)
{
struct address_space *mapping = info->vfs_inode.i_mapping;
void *radswap;
pgoff_t index;
- int error;
+ gfp_t gfp;
+ int error = 0;

radswap = swp_to_radix_entry(swap);
index = radix_tree_locate_item(&mapping->page_tree, radswap);
@@ -625,22 +629,37 @@ static int shmem_unuse_inode(struct shme
if (shmem_swaplist.next != &info->swaplist)
list_move_tail(&shmem_swaplist, &info->swaplist);

+ gfp = mapping_gfp_mask(mapping);
+ if (shmem_should_replace_page(*pagep, gfp)) {
+ mutex_unlock(&shmem_swaplist_mutex);
+ error = shmem_replace_page(pagep, gfp, info, index);
+ mutex_lock(&shmem_swaplist_mutex);
+ /*
+ * We needed to drop mutex to make that restrictive page
+ * allocation; but the inode might already be freed by now,
+ * and we cannot refer to inode or mapping or info to check.
+ * However, we do hold page lock on the PageSwapCache page,
+ * so can check if that still has our reference remaining.
+ */
+ if (!page_swapcount(*pagep))
+ error = -ENOENT;
+ }
+
/*
* We rely on shmem_swaplist_mutex, not only to protect the swaplist,
* but also to hold up shmem_evict_inode(): so inode cannot be freed
* beneath us (pagelock doesn't help until the page is in pagecache).
*/
- error = shmem_add_to_page_cache(page, mapping, index,
+ if (!error)
+ error = shmem_add_to_page_cache(*pagep, mapping, index,
GFP_NOWAIT, radswap);
- /* which does mem_cgroup_uncharge_cache_page on error */
-
if (error != -ENOMEM) {
/*
* Truncation and eviction use free_swap_and_cache(), which
* only does trylock page: if we raced, best clean up here.
*/
- delete_from_swap_cache(page);
- set_page_dirty(page);
+ delete_from_swap_cache(*pagep);
+ set_page_dirty(*pagep);
if (!error) {
spin_lock(&info->lock);
info->swapped--;
@@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct
struct list_head *this, *next;
struct shmem_inode_info *info;
int found = 0;
- int error;
+ int error = 0;
+
+ /*
+ * There's a faint possibility that swap page was replaced before
+ * caller locked it: it will come back later with the right page.
+ */
+ if (unlikely(!PageSwapCache(page)))
+ goto out;

/*
* Charge page using GFP_KERNEL while we can wait, before taking
@@ -676,7 +702,7 @@ int shmem_unuse(swp_entry_t swap, struct
list_for_each_safe(this, next, &shmem_swaplist) {
info = list_entry(this, struct shmem_inode_info, swaplist);
if (info->swapped)
- found = shmem_unuse_inode(info, swap, page);
+ found = shmem_unuse_inode(info, swap, &page);
else
list_del_init(&info->swaplist);
cond_resched();
@@ -685,8 +711,6 @@ int shmem_unuse(swp_entry_t swap, struct
}
mutex_unlock(&shmem_swaplist_mutex);

- if (!found)
- mem_cgroup_uncharge_cache_page(page);
if (found < 0)
error = found;
out:
@@ -856,6 +880,84 @@ static inline struct mempolicy *shmem_ge
#endif

/*
+ * When a page is moved from swapcache to shmem filecache (either by the
+ * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of
+ * shmem_unuse_inode()), it may have been read in earlier from swap, in
+ * ignorance of the mapping it belongs to. If that mapping has special
+ * constraints (like the gma500 GEM driver, which requires RAM below 4GB),
+ * we may need to copy to a suitable page before moving to filecache.
+ *
+ * In a future release, this may well be extended to respect cpuset and
+ * NUMA mempolicy, and applied also to anonymous pages in do_swap_page();
+ * but for now it is a simple matter of zone.
+ */
+static bool shmem_should_replace_page(struct page *page, gfp_t gfp)
+{
+ return page_zonenum(page) > gfp_zone(gfp);
+}
+
+static int shmem_replace_page(struct page **pagep, gfp_t gfp,
+ struct shmem_inode_info *info, pgoff_t index)
+{
+ struct page *oldpage, *newpage;
+ struct address_space *swap_mapping;
+ pgoff_t swap_index;
+ int error;
+
+ oldpage = *pagep;
+ swap_index = page_private(oldpage);
+ swap_mapping = page_mapping(oldpage);
+
+ /*
+ * We have arrived here because our zones are constrained, so don't
+ * limit chance of success by further cpuset and node constraints.
+ */
+ gfp &= ~GFP_CONSTRAINT_MASK;
+ newpage = shmem_alloc_page(gfp, info, index);
+ if (!newpage)
+ return -ENOMEM;
+ VM_BUG_ON(shmem_should_replace_page(newpage, gfp));
+
+ *pagep = newpage;
+ page_cache_get(newpage);
+ copy_highpage(newpage, oldpage);
+
+ VM_BUG_ON(!PageLocked(oldpage));
+ __set_page_locked(newpage);
+ VM_BUG_ON(!PageUptodate(oldpage));
+ SetPageUptodate(newpage);
+ VM_BUG_ON(!PageSwapBacked(oldpage));
+ SetPageSwapBacked(newpage);
+ VM_BUG_ON(!swap_index);
+ set_page_private(newpage, swap_index);
+ VM_BUG_ON(!PageSwapCache(oldpage));
+ SetPageSwapCache(newpage);
+
+ /*
+ * Our caller will very soon move newpage out of swapcache, but it's
+ * a nice clean interface for us to replace oldpage by newpage there.
+ */
+ spin_lock_irq(&swap_mapping->tree_lock);
+ error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
+ newpage);
+ __inc_zone_page_state(newpage, NR_FILE_PAGES);
+ __dec_zone_page_state(oldpage, NR_FILE_PAGES);
+ spin_unlock_irq(&swap_mapping->tree_lock);
+ BUG_ON(error);
+
+ mem_cgroup_replace_page_cache(oldpage, newpage);
+ lru_cache_add_anon(newpage);
+
+ ClearPageSwapCache(oldpage);
+ set_page_private(oldpage, 0);
+
+ unlock_page(oldpage);
+ page_cache_release(oldpage);
+ page_cache_release(oldpage);
+ return 0;
+}
+
+/*
* shmem_getpage_gfp - find page in cache, or get from swap, or allocate
*
* If we allocate a new one we do not mark it dirty. That's up to the
@@ -923,19 +1025,20 @@ repeat:

/* We have to do this with page locked to prevent races */
lock_page(page);
+ if (!PageSwapCache(page) || page->mapping) {
+ error = -EEXIST; /* try again */
+ goto failed;
+ }
if (!PageUptodate(page)) {
error = -EIO;
goto failed;
}
wait_on_page_writeback(page);

- /* Someone may have already done it for us */
- if (page->mapping) {
- if (page->mapping == mapping &&
- page->index == index)
- goto done;
- error = -EEXIST;
- goto failed;
+ if (shmem_should_replace_page(page, gfp)) {
+ error = shmem_replace_page(&page, gfp, info, index);
+ if (error)
+ goto failed;
}

error = mem_cgroup_cache_charge(page, current->mm,
@@ -998,7 +1101,7 @@ repeat:
if (sgp == SGP_DIRTY)
set_page_dirty(page);
}
-done:
+
/* Perhaps the file has been truncated since we checked */
if (sgp != SGP_WRITE &&
((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
--- 3045N.orig/mm/swapfile.c 2012-05-05 10:42:33.576056912 -0700
+++ 3045N/mm/swapfile.c 2012-05-05 10:45:17.888060878 -0700
@@ -604,7 +604,7 @@ void swapcache_free(swp_entry_t entry, s
* This does not give an exact answer when swap count is continued,
* but does include the high COUNT_CONTINUED flag to allow for that.
*/
-static inline int page_swapcount(struct page *page)
+int page_swapcount(struct page *page)
{
int count = 0;
struct swap_info_struct *p;

2012-05-12 12:02:47

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 2/10] tmpfs: enable NOSEC optimization

Let tmpfs into the NOSEC optimization (avoiding file_remove_suid()
overhead on most common writes): set MS_NOSEC on its superblocks.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 1 +
1 file changed, 1 insertion(+)

--- 3045N.orig/mm/shmem.c 2012-05-05 10:45:17.888060878 -0700
+++ 3045N/mm/shmem.c 2012-05-05 10:46:05.732062006 -0700
@@ -2361,6 +2361,7 @@ int shmem_fill_super(struct super_block
}
}
sb->s_export_op = &shmem_export_ops;
+ sb->s_flags |= MS_NOSEC;
#else
sb->s_flags |= MS_NOUSER;
#endif

2012-05-12 12:04:30

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 3/10] tmpfs: optimize clearing when writing

Nick proposed years ago that tmpfs should avoid clearing its pages where
write will overwrite them with new data, as ramfs has long done. But I
messed it up and just got bad data. Tried again recently, it works fine.

Here's time output for writing 4GiB 16 times on this Core i5 laptop:

before: real 0m21.169s user 0m0.028s sys 0m21.057s
real 0m21.382s user 0m0.016s sys 0m21.289s
real 0m21.311s user 0m0.020s sys 0m21.217s

after: real 0m18.273s user 0m0.032s sys 0m18.165s
real 0m18.354s user 0m0.020s sys 0m18.265s
real 0m18.440s user 0m0.032s sys 0m18.337s

ramfs: real 0m16.860s user 0m0.028s sys 0m16.765s
real 0m17.382s user 0m0.040s sys 0m17.273s
real 0m17.133s user 0m0.044s sys 0m17.021s

Yes, I have done perf reports, but they need more explanation than they
deserve: in summary, clear_page vanishes, its cache loading shifts into
copy_user_generic_unrolled; shmem_getpage_gfp goes down, and surprisingly
mark_page_accessed goes way up - I think because they are respectively
where the cache gets to be reloaded after being purged by clear or copy.

Suggested-by: Nick Piggin <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)

--- 3045N.orig/mm/shmem.c 2012-05-05 10:46:05.732062006 -0700
+++ 3045N/mm/shmem.c 2012-05-05 10:46:12.316062172 -0700
@@ -1095,9 +1095,14 @@ repeat:
shmem_recalc_inode(inode);
spin_unlock(&info->lock);

- clear_highpage(page);
- flush_dcache_page(page);
- SetPageUptodate(page);
+ /*
+ * Let SGP_WRITE caller clear ends if write does not fill page
+ */
+ if (sgp != SGP_WRITE) {
+ clear_highpage(page);
+ flush_dcache_page(page);
+ SetPageUptodate(page);
+ }
if (sgp == SGP_DIRTY)
set_page_dirty(page);
}
@@ -1307,6 +1312,14 @@ shmem_write_end(struct file *file, struc
if (pos + copied > inode->i_size)
i_size_write(inode, pos + copied);

+ if (!PageUptodate(page)) {
+ if (copied < PAGE_CACHE_SIZE) {
+ unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+ zero_user_segments(page, 0, from,
+ from + copied, PAGE_CACHE_SIZE);
+ }
+ SetPageUptodate(page);
+ }
set_page_dirty(page);
unlock_page(page);
page_cache_release(page);
@@ -1768,6 +1781,7 @@ static int shmem_symlink(struct inode *d
kaddr = kmap_atomic(page);
memcpy(kaddr, symname, len);
kunmap_atomic(kaddr);
+ SetPageUptodate(page);
set_page_dirty(page);
unlock_page(page);
page_cache_release(page);

2012-05-12 12:06:10

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 4/10] tmpfs: support fallocate FALLOC_FL_PUNCH_HOLE

tmpfs has supported hole-punching since 2.6.16, via madvise(,,MADV_REMOVE).
But nowadays fallocate(,FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,,) is the
agreed way to punch holes.

So add shmem_fallocate() to support that, and tweak shmem_truncate_range()
to support partial pages at both the beginning and end of range (never
needed for madvise, which demands rounded addr and rounds up length).

Based-on-patch-by: Cong Wang <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 68 ++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 57 insertions(+), 11 deletions(-)

--- 3045N.orig/mm/shmem.c 2012-05-05 10:46:12.316062172 -0700
+++ 3045N/mm/shmem.c 2012-05-05 10:46:18.768062321 -0700
@@ -53,6 +53,7 @@ static struct vfsmount *shm_mnt;
#include <linux/blkdev.h>
#include <linux/pagevec.h>
#include <linux/percpu_counter.h>
+#include <linux/falloc.h>
#include <linux/splice.h>
#include <linux/security.h>
#include <linux/swapops.h>
@@ -432,21 +433,23 @@ void shmem_truncate_range(struct inode *
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
pgoff_t start = (lstart + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
- unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
- pgoff_t end = (lend >> PAGE_CACHE_SHIFT);
+ pgoff_t end = (lend + 1) >> PAGE_CACHE_SHIFT;
+ unsigned int partial_start = lstart & (PAGE_CACHE_SIZE - 1);
+ unsigned int partial_end = (lend + 1) & (PAGE_CACHE_SIZE - 1);
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
long nr_swaps_freed = 0;
pgoff_t index;
int i;

- BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
+ if (lend == -1)
+ end = -1; /* unsigned, so actually very big */

pagevec_init(&pvec, 0);
index = start;
- while (index <= end) {
+ while (index < end) {
pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE),
pvec.pages, indices);
if (!pvec.nr)
break;
@@ -455,7 +458,7 @@ void shmem_truncate_range(struct inode *
struct page *page = pvec.pages[i];

index = indices[i];
- if (index > end)
+ if (index >= end)
break;

if (radix_tree_exceptional_entry(page)) {
@@ -479,22 +482,39 @@ void shmem_truncate_range(struct inode *
index++;
}

- if (partial) {
+ if (partial_start) {
struct page *page = NULL;
shmem_getpage(inode, start - 1, &page, SGP_READ, NULL);
if (page) {
- zero_user_segment(page, partial, PAGE_CACHE_SIZE);
+ unsigned int top = PAGE_CACHE_SIZE;
+ if (start > end) {
+ top = partial_end;
+ partial_end = 0;
+ }
+ zero_user_segment(page, partial_start, top);
set_page_dirty(page);
unlock_page(page);
page_cache_release(page);
}
}
+ if (partial_end) {
+ struct page *page = NULL;
+ shmem_getpage(inode, end, &page, SGP_READ, NULL);
+ if (page) {
+ zero_user_segment(page, 0, partial_end);
+ set_page_dirty(page);
+ unlock_page(page);
+ page_cache_release(page);
+ }
+ }
+ if (start >= end)
+ return;

index = start;
for ( ; ; ) {
cond_resched();
pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE),
pvec.pages, indices);
if (!pvec.nr) {
if (index == start)
@@ -502,7 +522,7 @@ void shmem_truncate_range(struct inode *
index = start;
continue;
}
- if (index == start && indices[0] > end) {
+ if (index == start && indices[0] >= end) {
shmem_deswap_pagevec(&pvec);
pagevec_release(&pvec);
break;
@@ -512,7 +532,7 @@ void shmem_truncate_range(struct inode *
struct page *page = pvec.pages[i];

index = indices[i];
- if (index > end)
+ if (index >= end)
break;

if (radix_tree_exceptional_entry(page)) {
@@ -1578,6 +1598,31 @@ static ssize_t shmem_file_splice_read(st
return error;
}

+static long shmem_fallocate(struct file *file, int mode, loff_t offset,
+ loff_t len)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ int error = -EOPNOTSUPP;
+
+ mutex_lock(&inode->i_mutex);
+
+ if (mode & FALLOC_FL_PUNCH_HOLE) {
+ struct address_space *mapping = file->f_mapping;
+ loff_t unmap_start = round_up(offset, PAGE_SIZE);
+ loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
+
+ if ((u64)unmap_end > (u64)unmap_start)
+ unmap_mapping_range(mapping, unmap_start,
+ 1 + unmap_end - unmap_start, 0);
+ shmem_truncate_range(inode, offset, offset + len - 1);
+ /* No need to unmap again: hole-punching leaves COWed pages */
+ error = 0;
+ }
+
+ mutex_unlock(&inode->i_mutex);
+ return error;
+}
+
static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
{
struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
@@ -2478,6 +2523,7 @@ static const struct file_operations shme
.fsync = noop_fsync,
.splice_read = shmem_file_splice_read,
.splice_write = generic_file_splice_write,
+ .fallocate = shmem_fallocate,
#endif
};

2012-05-12 12:13:37

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 5/10] mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE

Now tmpfs supports hole-punching via fallocate(), switch madvise_remove()
to use do_fallocate() instead of vmtruncate_range(): which extends
madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs.

There is one more user of vmtruncate_range() in our tree, staging/android's
ashmem_shrink(): convert it to use do_fallocate() too (but if its unpinned
areas are already unmapped - I don't know - then it would do better to use
shmem_truncate_range() directly).

Based-on-patch-by: Cong Wang <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
drivers/staging/android/ashmem.c | 8 +++++---
mm/madvise.c | 15 +++++++--------
2 files changed, 12 insertions(+), 11 deletions(-)

--- 3045N.orig/drivers/staging/android/ashmem.c 2012-05-05 10:42:33.564056626 -0700
+++ 3045N/drivers/staging/android/ashmem.c 2012-05-05 10:46:25.692062478 -0700
@@ -19,6 +19,7 @@
#include <linux/module.h>
#include <linux/file.h>
#include <linux/fs.h>
+#include <linux/falloc.h>
#include <linux/miscdevice.h>
#include <linux/security.h>
#include <linux/mm.h>
@@ -363,11 +364,12 @@ static int ashmem_shrink(struct shrinker

mutex_lock(&ashmem_mutex);
list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
- struct inode *inode = range->asma->file->f_dentry->d_inode;
loff_t start = range->pgstart * PAGE_SIZE;
- loff_t end = (range->pgend + 1) * PAGE_SIZE - 1;
+ loff_t end = (range->pgend + 1) * PAGE_SIZE;

- vmtruncate_range(inode, start, end);
+ do_fallocate(range->asma->file,
+ FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ start, end - start);
range->purged = ASHMEM_WAS_PURGED;
lru_del(range);

--- 3045N.orig/mm/madvise.c 2012-05-05 10:42:33.572056784 -0700
+++ 3045N/mm/madvise.c 2012-05-05 10:46:25.692062478 -0700
@@ -11,8 +11,10 @@
#include <linux/mempolicy.h>
#include <linux/page-isolation.h>
#include <linux/hugetlb.h>
+#include <linux/falloc.h>
#include <linux/sched.h>
#include <linux/ksm.h>
+#include <linux/fs.h>

/*
* Any behaviour which results in changes to the vma->vm_flags needs to
@@ -200,8 +202,7 @@ static long madvise_remove(struct vm_are
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
- struct address_space *mapping;
- loff_t offset, endoff;
+ loff_t offset;
int error;

*prev = NULL; /* tell sys_madvise we drop mmap_sem */
@@ -217,16 +218,14 @@ static long madvise_remove(struct vm_are
if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE))
return -EACCES;

- mapping = vma->vm_file->f_mapping;
-
offset = (loff_t)(start - vma->vm_start)
+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
- endoff = (loff_t)(end - vma->vm_start - 1)
- + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);

- /* vmtruncate_range needs to take i_mutex */
+ /* filesystem's fallocate may need to take i_mutex */
up_read(&current->mm->mmap_sem);
- error = vmtruncate_range(mapping->host, offset, endoff);
+ error = do_fallocate(vma->vm_file,
+ FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ offset, end - start);
down_read(&current->mm->mmap_sem);
return error;
}

2012-05-12 12:15:24

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 6/10] mm/fs: remove truncate_range

Remove vmtruncate_range(), and remove the truncate_range method from
struct inode_operations: only tmpfs ever supported it, and tmpfs has
now converted over to using the fallocate method of file_operations.

Update Documentation accordingly, adding (setlease and) fallocate lines.
And while we're in mm.h, remove duplicate declarations of shmem_lock()
and shmem_file_setup(): everyone is now using the ones in shmem_fs.h.

Based-on-patch-by: Cong Wang <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/filesystems/Locking | 2 --
Documentation/filesystems/vfs.txt | 13 ++++++++-----
fs/bad_inode.c | 1 -
include/linux/fs.h | 1 -
include/linux/mm.h | 4 ----
mm/shmem.c | 1 -
mm/truncate.c | 25 -------------------------
7 files changed, 8 insertions(+), 39 deletions(-)

--- 3045N.orig/Documentation/filesystems/Locking 2012-05-05 10:42:33.560056589 -0700
+++ 3045N/Documentation/filesystems/Locking 2012-05-05 10:46:35.220062708 -0700
@@ -60,7 +60,6 @@ ata *);
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
- void (*truncate_range)(struct inode *, loff_t, loff_t);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len);

locking rules:
@@ -87,7 +86,6 @@ setxattr: yes
getxattr: no
listxattr: no
removexattr: yes
-truncate_range: yes
fiemap: no
Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on
victim.
--- 3045N.orig/Documentation/filesystems/vfs.txt 2012-05-05 10:42:33.560056589 -0700
+++ 3045N/Documentation/filesystems/vfs.txt 2012-05-05 10:46:35.220062708 -0700
@@ -363,7 +363,6 @@ struct inode_operations {
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
- void (*truncate_range)(struct inode *, loff_t, loff_t);
};

Again, all methods are called without any locks being held, unless
@@ -472,9 +471,6 @@ otherwise noted.
removexattr: called by the VFS to remove an extended attribute from
a file. This method is called by removexattr(2) system call.

- truncate_range: a method provided by the underlying filesystem to truncate a
- range of blocks , i.e. punch a hole somewhere in a file.
-

The Address Space Object
========================
@@ -760,7 +756,7 @@ struct file_operations
----------------------

This describes how the VFS can manipulate an open file. As of kernel
-2.6.22, the following members are defined:
+3.5, the following members are defined:

struct file_operations {
struct module *owner;
@@ -790,6 +786,8 @@ struct file_operations {
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int);
+ int (*setlease)(struct file *, long arg, struct file_lock **);
+ long (*fallocate)(struct file *, int mode, loff_t offset, loff_t len);
};

Again, all methods are called without any locks being held, unless
@@ -858,6 +856,11 @@ otherwise noted.
splice_read: called by the VFS to splice data from file to a pipe. This
method is used by the splice(2) system call

+ setlease: called by the VFS to set or release a file lock lease.
+ setlease has the file_lock_lock held and must not sleep.
+
+ fallocate: called by the VFS to preallocate blocks or punch a hole.
+
Note that the file operations are implemented by the specific
filesystem in which the inode resides. When opening a device node
(character or block special) most filesystems will call special
--- 3045N.orig/fs/bad_inode.c 2012-05-05 10:42:33.564056626 -0700
+++ 3045N/fs/bad_inode.c 2012-05-05 10:46:35.220062708 -0700
@@ -292,7 +292,6 @@ static const struct inode_operations bad
.getxattr = bad_inode_getxattr,
.listxattr = bad_inode_listxattr,
.removexattr = bad_inode_removexattr,
- /* truncate_range returns void */
};


--- 3045N.orig/include/linux/fs.h 2012-05-05 10:42:33.572056784 -0700
+++ 3045N/include/linux/fs.h 2012-05-05 10:46:35.220062708 -0700
@@ -1673,7 +1673,6 @@ struct inode_operations {
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
- void (*truncate_range)(struct inode *, loff_t, loff_t);
int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
u64 len);
} ____cacheline_aligned;
--- 3045N.orig/include/linux/mm.h 2012-05-05 10:42:33.572056784 -0700
+++ 3045N/include/linux/mm.h 2012-05-05 10:46:35.220062708 -0700
@@ -871,8 +871,6 @@ extern void pagefault_out_of_memory(void
extern void show_free_areas(unsigned int flags);
extern bool skip_free_areas_node(unsigned int flags, int nid);

-int shmem_lock(struct file *file, int lock, struct user_struct *user);
-struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags);
int shmem_zero_setup(struct vm_area_struct *);

extern int can_do_mlock(void);
@@ -953,11 +951,9 @@ static inline void unmap_shared_mapping_
extern void truncate_pagecache(struct inode *inode, loff_t old, loff_t new);
extern void truncate_setsize(struct inode *inode, loff_t newsize);
extern int vmtruncate(struct inode *inode, loff_t offset);
-extern int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end);
void truncate_pagecache_range(struct inode *inode, loff_t offset, loff_t end);
int truncate_inode_page(struct address_space *mapping, struct page *page);
int generic_error_remove_page(struct address_space *mapping, struct page *page);
-
int invalidate_inode_page(struct page *page);

#ifdef CONFIG_MMU
--- 3045N.orig/mm/shmem.c 2012-05-05 10:46:18.768062321 -0700
+++ 3045N/mm/shmem.c 2012-05-05 10:46:35.224062700 -0700
@@ -2529,7 +2529,6 @@ static const struct file_operations shme

static const struct inode_operations shmem_inode_operations = {
.setattr = shmem_setattr,
- .truncate_range = shmem_truncate_range,
#ifdef CONFIG_TMPFS_XATTR
.setxattr = shmem_setxattr,
.getxattr = shmem_getxattr,
--- 3045N.orig/mm/truncate.c 2012-05-05 10:42:33.576056912 -0700
+++ 3045N/mm/truncate.c 2012-05-05 10:46:35.224062700 -0700
@@ -602,31 +602,6 @@ int vmtruncate(struct inode *inode, loff
}
EXPORT_SYMBOL(vmtruncate);

-int vmtruncate_range(struct inode *inode, loff_t lstart, loff_t lend)
-{
- struct address_space *mapping = inode->i_mapping;
- loff_t holebegin = round_up(lstart, PAGE_SIZE);
- loff_t holelen = 1 + lend - holebegin;
-
- /*
- * If the underlying filesystem is not going to provide
- * a way to truncate a range of blocks (punch a hole) -
- * we should return failure right now.
- */
- if (!inode->i_op->truncate_range)
- return -ENOSYS;
-
- mutex_lock(&inode->i_mutex);
- inode_dio_wait(inode);
- unmap_mapping_range(mapping, holebegin, holelen, 1);
- inode->i_op->truncate_range(inode, lstart, lend);
- /* unmap again to remove racily COWed private pages */
- unmap_mapping_range(mapping, holebegin, holelen, 1);
- mutex_unlock(&inode->i_mutex);
-
- return 0;
-}
-
/**
* truncate_pagecache_range - unmap and remove pagecache that is hole-punched
* @inode: inode

2012-05-12 12:17:43

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 7/10] tmpfs: support fallocate preallocation

The systemd plumbers expressed a wish that tmpfs support preallocation.
Cong Wang wrote a patch, but several kernel guys expressed scepticism:
https://lkml.org/lkml/2011/11/18/137

Christoph Hellwig: What for exactly? Please explain why preallocating
on tmpfs would make any sense.
Kay Sievers: To be able to safely use mmap(), regarding SIGBUS, on files
on the /dev/shm filesystem. The glibc fallback loop for -ENOSYS [or
-EOPNOTSUPP] on fallocate is just ugly.
Hugh Dickins: If tmpfs is going to support fallocate(FALLOC_FL_PUNCH_HOLE),
it would seem perverse to permit the deallocation but fail the allocation.
Christoph Hellwig: Agreed.

Now that we do have shmem_fallocate() for hole-punching, plumb in basic
support for preallocation mode too. It's fairly straightforward (though
quite a few details needed attention), except for when it fails part way
through. What a pity that fallocate(2) was not specified to return the
length allocated, permitting short fallocations!

As it is, when it fails part way through, we ought to free what has just
been allocated by this system call; but must be very sure not to free any
allocated earlier, or any allocated by racing accesses (not all excluded
by i_mutex).

But we cannot distinguish them: so in this patch simply leak allocations
on partial failure (they will be freed later if the file is removed).

An attractive alternative approach would have been for fallocate() not to
allocate pages at all, but note reservations by entries in the radix-tree.
But that would give less assurance, and, critically, would be hard to fit
with mem cgroups (who owns the reservations?): allocating pages lets
fallocate() behave in just the same way as write().

Based-on-patch-by: Cong Wang <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 60 insertions(+), 1 deletion(-)

--- 3045N.orig/mm/shmem.c 2012-05-05 10:46:35.224062700 -0700
+++ 3045N/mm/shmem.c 2012-05-05 10:46:45.312062979 -0700
@@ -1602,7 +1602,9 @@ static long shmem_fallocate(struct file
loff_t len)
{
struct inode *inode = file->f_path.dentry->d_inode;
- int error = -EOPNOTSUPP;
+ struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+ pgoff_t start, index, end;
+ int error;

mutex_lock(&inode->i_mutex);

@@ -1617,8 +1619,65 @@ static long shmem_fallocate(struct file
shmem_truncate_range(inode, offset, offset + len - 1);
/* No need to unmap again: hole-punching leaves COWed pages */
error = 0;
+ goto out;
}

+ /* We need to check rlimit even when FALLOC_FL_KEEP_SIZE */
+ error = inode_newsize_ok(inode, offset + len);
+ if (error)
+ goto out;
+
+ start = offset >> PAGE_CACHE_SHIFT;
+ end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+ /* Try to avoid a swapstorm if len is impossible to satisfy */
+ if (sbinfo->max_blocks && end - start > sbinfo->max_blocks) {
+ error = -ENOSPC;
+ goto out;
+ }
+
+ for (index = start; index < end; index++) {
+ struct page *page;
+
+ /*
+ * Good, the fallocate(2) manpage permits EINTR: we may have
+ * been interrupted because we are using up too much memory.
+ */
+ if (signal_pending(current))
+ error = -EINTR;
+ else
+ error = shmem_getpage(inode, index, &page, SGP_WRITE,
+ NULL);
+ if (error) {
+ /*
+ * We really ought to free what we allocated so far,
+ * but it would be wrong to free pages allocated
+ * earlier, or already now in use: i_mutex does not
+ * exclude all cases. We do not know what to free.
+ */
+ goto ctime;
+ }
+
+ if (!PageUptodate(page)) {
+ clear_highpage(page);
+ flush_dcache_page(page);
+ SetPageUptodate(page);
+ }
+ /*
+ * set_page_dirty so that memory pressure will swap rather
+ * than free the pages we are allocating (and SGP_CACHE pages
+ * might still be clean: we now need to mark those dirty too).
+ */
+ set_page_dirty(page);
+ unlock_page(page);
+ page_cache_release(page);
+ cond_resched();
+ }
+
+ if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
+ i_size_write(inode, offset + len);
+ctime:
+ inode->i_ctime = CURRENT_TIME;
+out:
mutex_unlock(&inode->i_mutex);
return error;
}

2012-05-12 12:19:36

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 8/10] tmpfs: undo fallocation on failure

In the previous episode, we left the already-fallocated pages attached
to the file when shmem_fallocate() fails part way through.

Now try to do better, by extending the earlier optimization of !Uptodate
pages (then always under page lock) to !Uptodate pages (outside of page
lock), representing fallocated pages. And don't waste time clearing
them at the time of fallocate(), leave that until later if necessary.

Adapt shmem_truncate_range() to shmem_undo_range(), so that a failing
fallocate can recognize and remove precisely those !Uptodate allocations
which it added (and were not independently allocated by racing tasks).

But unless we start playing with swapfile.c and memcontrol.c too, once
one of our fallocated pages reaches shmem_writepage(), we do then have
to instantiate it as an ordinarily allocated page, before swapping out.
This is unsatisfactory, but improved in the next episode.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 105 ++++++++++++++++++++++++++++++++++-----------------
1 file changed, 72 insertions(+), 33 deletions(-)

--- 3045N.orig/mm/shmem.c 2012-05-05 10:46:45.312062979 -0700
+++ 3045N/mm/shmem.c 2012-05-05 10:46:53.860063201 -0700
@@ -89,7 +89,8 @@ enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
SGP_CACHE, /* don't exceed i_size, may allocate page */
SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */
- SGP_WRITE, /* may exceed i_size, may allocate page */
+ SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */
+ SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
};

#ifdef CONFIG_TMPFS
@@ -427,8 +428,10 @@ void shmem_unlock_mapping(struct address

/*
* Remove range of pages and swap entries from radix tree, and free them.
+ * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
*/
-void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
+static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
+ bool unfalloc)
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
@@ -462,6 +465,8 @@ void shmem_truncate_range(struct inode *
break;

if (radix_tree_exceptional_entry(page)) {
+ if (unfalloc)
+ continue;
nr_swaps_freed += !shmem_free_swap(mapping,
index, page);
continue;
@@ -469,9 +474,11 @@ void shmem_truncate_range(struct inode *

if (!trylock_page(page))
continue;
- if (page->mapping == mapping) {
- VM_BUG_ON(PageWriteback(page));
- truncate_inode_page(mapping, page);
+ if (!unfalloc || !PageUptodate(page)) {
+ if (page->mapping == mapping) {
+ VM_BUG_ON(PageWriteback(page));
+ truncate_inode_page(mapping, page);
+ }
}
unlock_page(page);
}
@@ -517,12 +524,12 @@ void shmem_truncate_range(struct inode *
min(end - index, (pgoff_t)PAGEVEC_SIZE),
pvec.pages, indices);
if (!pvec.nr) {
- if (index == start)
+ if (index == start || unfalloc)
break;
index = start;
continue;
}
- if (index == start && indices[0] >= end) {
+ if ((index == start || unfalloc) && indices[0] >= end) {
shmem_deswap_pagevec(&pvec);
pagevec_release(&pvec);
break;
@@ -536,15 +543,19 @@ void shmem_truncate_range(struct inode *
break;

if (radix_tree_exceptional_entry(page)) {
+ if (unfalloc)
+ continue;
nr_swaps_freed += !shmem_free_swap(mapping,
index, page);
continue;
}

lock_page(page);
- if (page->mapping == mapping) {
- VM_BUG_ON(PageWriteback(page));
- truncate_inode_page(mapping, page);
+ if (!unfalloc || !PageUptodate(page)) {
+ if (page->mapping == mapping) {
+ VM_BUG_ON(PageWriteback(page));
+ truncate_inode_page(mapping, page);
+ }
}
unlock_page(page);
}
@@ -558,7 +569,11 @@ void shmem_truncate_range(struct inode *
info->swapped -= nr_swaps_freed;
shmem_recalc_inode(inode);
spin_unlock(&info->lock);
+}

+void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
+{
+ shmem_undo_range(inode, lstart, lend, false);
inode->i_ctime = inode->i_mtime = CURRENT_TIME;
}
EXPORT_SYMBOL_GPL(shmem_truncate_range);
@@ -771,6 +786,18 @@ static int shmem_writepage(struct page *
WARN_ON_ONCE(1); /* Still happens? Tell us about it! */
goto redirty;
}
+
+ /*
+ * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
+ * value into swapfile.c, the only way we can correctly account for a
+ * fallocated page arriving here is now to initialize it and write it.
+ */
+ if (!PageUptodate(page)) {
+ clear_highpage(page);
+ flush_dcache_page(page);
+ SetPageUptodate(page);
+ }
+
swap = get_swap_page();
if (!swap.val)
goto redirty;
@@ -994,6 +1021,7 @@ static int shmem_getpage_gfp(struct inod
swp_entry_t swap;
int error;
int once = 0;
+ int alloced = 0;

if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
return -EFBIG;
@@ -1005,19 +1033,21 @@ repeat:
page = NULL;
}

- if (sgp != SGP_WRITE &&
+ if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
error = -EINVAL;
goto failed;
}

+ /* fallocated page? */
+ if (page && !PageUptodate(page)) {
+ if (sgp != SGP_READ)
+ goto clear;
+ unlock_page(page);
+ page_cache_release(page);
+ page = NULL;
+ }
if (page || (sgp == SGP_READ && !swap.val)) {
- /*
- * Once we can get the page lock, it must be uptodate:
- * if there were an error in reading back from swap,
- * the page would not be inserted into the filecache.
- */
- BUG_ON(page && !PageUptodate(page));
*pagep = page;
return 0;
}
@@ -1114,9 +1144,18 @@ repeat:
inode->i_blocks += BLOCKS_PER_PAGE;
shmem_recalc_inode(inode);
spin_unlock(&info->lock);
+ alloced = true;

/*
- * Let SGP_WRITE caller clear ends if write does not fill page
+ * Let SGP_FALLOC use the SGP_WRITE optimization on a new page.
+ */
+ if (sgp == SGP_FALLOC)
+ sgp = SGP_WRITE;
+clear:
+ /*
+ * Let SGP_WRITE caller clear ends if write does not fill page;
+ * but SGP_FALLOC on a page fallocated earlier must initialize
+ * it now, lest undo on failure cancel our earlier guarantee.
*/
if (sgp != SGP_WRITE) {
clear_highpage(page);
@@ -1128,10 +1167,13 @@ repeat:
}

/* Perhaps the file has been truncated since we checked */
- if (sgp != SGP_WRITE &&
+ if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
error = -EINVAL;
- goto trunc;
+ if (alloced)
+ goto trunc;
+ else
+ goto failed;
}
*pagep = page;
return 0;
@@ -1140,6 +1182,7 @@ repeat:
* Error recovery.
*/
trunc:
+ info = SHMEM_I(inode);
ClearPageDirty(page);
delete_from_page_cache(page);
spin_lock(&info->lock);
@@ -1147,6 +1190,7 @@ trunc:
inode->i_blocks -= BLOCKS_PER_PAGE;
spin_unlock(&info->lock);
decused:
+ sbinfo = SHMEM_SB(inode->i_sb);
if (sbinfo->max_blocks)
percpu_counter_add(&sbinfo->used_blocks, -1);
unacct:
@@ -1645,25 +1689,20 @@ static long shmem_fallocate(struct file
if (signal_pending(current))
error = -EINTR;
else
- error = shmem_getpage(inode, index, &page, SGP_WRITE,
+ error = shmem_getpage(inode, index, &page, SGP_FALLOC,
NULL);
if (error) {
- /*
- * We really ought to free what we allocated so far,
- * but it would be wrong to free pages allocated
- * earlier, or already now in use: i_mutex does not
- * exclude all cases. We do not know what to free.
- */
+ /* Remove the !PageUptodate pages we added */
+ shmem_undo_range(inode,
+ (loff_t)start << PAGE_CACHE_SHIFT,
+ (loff_t)index << PAGE_CACHE_SHIFT, true);
goto ctime;
}

- if (!PageUptodate(page)) {
- clear_highpage(page);
- flush_dcache_page(page);
- SetPageUptodate(page);
- }
/*
- * set_page_dirty so that memory pressure will swap rather
+ * If !PageUptodate, leave it that way so that freeable pages
+ * can be recognized if we need to rollback on error later.
+ * But set_page_dirty so that memory pressure will swap rather
* than free the pages we are allocating (and SGP_CACHE pages
* might still be clean: we now need to mark those dirty too).
*/

2012-05-12 12:21:41

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 9/10] tmpfs: quit when fallocate fills memory

As it stands, a large fallocate() on tmpfs is liable to fill memory with
pages, freed on failure except when they run into swap, at which point
they become fixed into the file despite the failure. That feels quite
wrong, to be consuming resources precisely when they're in short supply.

Go the other way instead: shmem_fallocate() indicate the range it has
fallocated to shmem_writepage(), keeping count of pages it's allocating;
shmem_writepage() reactivate instead of swapping out pages fallocated by
this syscall (but happily swap out those from earlier occasions), keeping
count; shmem_fallocate() compare counts and give up once the reactivated
pages have started to coming back to writepage (approximately: some zones
would in fact recycle faster than others).

This is a little unusual, but works well: although we could consider the
failure to swap as a bug, and fix it later with SWAP_MAP_FALLOC handling
added in swapfile.c and memcontrol.c, I doubt that we shall ever want to.

(If there's no swap, an over-large fallocate() on tmpfs is limited in the
same way as writing: stopped by rlimit, or by tmpfs mount size if that was
set sensibly, or by __vm_enough_memory() heuristics if OVERCOMMIT_GUESS or
OVERCOMMIT_NEVER. If OVERCOMMIT_ALWAYS, then it is liable to OOM-kill
others as writing would, but stops and frees if interrupted.)

Now that everything is freed on failure, we can then skip updating ctime.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 56 insertions(+), 2 deletions(-)

--- 3045N.orig/mm/shmem.c 2012-05-05 10:46:53.860063201 -0700
+++ 3045N/mm/shmem.c 2012-05-05 10:47:02.216063339 -0700
@@ -84,6 +84,18 @@ struct shmem_xattr {
char value[0];
};

+/*
+ * shmem_fallocate and shmem_writepage communicate via inode->i_private
+ * (with i_mutex making sure that it has only one user at a time):
+ * we would prefer not to enlarge the shmem inode just for that.
+ */
+struct shmem_falloc {
+ pgoff_t start; /* start of range currently being fallocated */
+ pgoff_t next; /* the next page offset to be fallocated */
+ pgoff_t nr_falloced; /* how many new pages have been fallocated */
+ pgoff_t nr_unswapped; /* how often writepage refused to swap out */
+};
+
/* Flag allocation requirements to shmem_getpage */
enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
@@ -791,8 +803,28 @@ static int shmem_writepage(struct page *
* This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
* value into swapfile.c, the only way we can correctly account for a
* fallocated page arriving here is now to initialize it and write it.
+ *
+ * That's okay for a page already fallocated earlier, but if we have
+ * not yet completed the fallocation, then (a) we want to keep track
+ * of this page in case we have to undo it, and (b) it may not be a
+ * good idea to continue anyway, once we're pushing into swap. So
+ * reactivate the page, and let shmem_fallocate() quit when too many.
*/
if (!PageUptodate(page)) {
+ if (inode->i_private) {
+ struct shmem_falloc *shmem_falloc;
+ spin_lock(&inode->i_lock);
+ shmem_falloc = inode->i_private;
+ if (shmem_falloc &&
+ index >= shmem_falloc->start &&
+ index < shmem_falloc->next)
+ shmem_falloc->nr_unswapped++;
+ else
+ shmem_falloc = NULL;
+ spin_unlock(&inode->i_lock);
+ if (shmem_falloc)
+ goto redirty;
+ }
clear_highpage(page);
flush_dcache_page(page);
SetPageUptodate(page);
@@ -1647,6 +1679,7 @@ static long shmem_fallocate(struct file
{
struct inode *inode = file->f_path.dentry->d_inode;
struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+ struct shmem_falloc shmem_falloc;
pgoff_t start, index, end;
int error;

@@ -1679,6 +1712,14 @@ static long shmem_fallocate(struct file
goto out;
}

+ shmem_falloc.start = start;
+ shmem_falloc.next = start;
+ shmem_falloc.nr_falloced = 0;
+ shmem_falloc.nr_unswapped = 0;
+ spin_lock(&inode->i_lock);
+ inode->i_private = &shmem_falloc;
+ spin_unlock(&inode->i_lock);
+
for (index = start; index < end; index++) {
struct page *page;

@@ -1688,6 +1729,8 @@ static long shmem_fallocate(struct file
*/
if (signal_pending(current))
error = -EINTR;
+ else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced)
+ error = -ENOMEM;
else
error = shmem_getpage(inode, index, &page, SGP_FALLOC,
NULL);
@@ -1696,10 +1739,18 @@ static long shmem_fallocate(struct file
shmem_undo_range(inode,
(loff_t)start << PAGE_CACHE_SHIFT,
(loff_t)index << PAGE_CACHE_SHIFT, true);
- goto ctime;
+ goto undone;
}

/*
+ * Inform shmem_writepage() how far we have reached.
+ * No need for lock or barrier: we have the page lock.
+ */
+ shmem_falloc.next++;
+ if (!PageUptodate(page))
+ shmem_falloc.nr_falloced++;
+
+ /*
* If !PageUptodate, leave it that way so that freeable pages
* can be recognized if we need to rollback on error later.
* But set_page_dirty so that memory pressure will swap rather
@@ -1714,8 +1765,11 @@ static long shmem_fallocate(struct file

if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size)
i_size_write(inode, offset + len);
-ctime:
inode->i_ctime = CURRENT_TIME;
+undone:
+ spin_lock(&inode->i_lock);
+ inode->i_private = NULL;
+ spin_unlock(&inode->i_lock);
out:
mutex_unlock(&inode->i_mutex);
return error;

2012-05-12 12:27:51

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 10/10] tmpfs: support SEEK_DATA and SEEK_HOLE

It's quite easy for tmpfs to scan the radix_tree to support llseek's
new SEEK_DATA and SEEK_HOLE options: so add them while the minutiae
are still on my mind (in particular, the !PageUptodate-ness of pages
fallocated but still unwritten).

But I don't know who actually uses SEEK_DATA or SEEK_HOLE, and whether
it would be of any use to them on tmpfs. This code adds 92 lines and
752 bytes on x86_64 - is that bloat or worthwhile?

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 93 insertions(+), 1 deletion(-)

--- 3045N.orig/mm/shmem.c 2012-05-05 10:47:02.216063339 -0700
+++ 3045N/mm/shmem.c 2012-05-05 10:47:09.724063528 -0700
@@ -439,6 +439,56 @@ void shmem_unlock_mapping(struct address
}

/*
+ * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
+ */
+static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
+ pgoff_t index, pgoff_t end, int origin)
+{
+ struct page *page;
+ struct pagevec pvec;
+ pgoff_t indices[PAGEVEC_SIZE];
+ bool done = false;
+ int i;
+
+ pagevec_init(&pvec, 0);
+ pvec.nr = 1; /* start small: we may be there already */
+ while (!done) {
+ pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+ pvec.nr, pvec.pages, indices);
+ if (!pvec.nr) {
+ if (origin == SEEK_DATA)
+ index = end;
+ break;
+ }
+ for (i = 0; i < pvec.nr; i++, index++) {
+ if (index < indices[i]) {
+ if (origin == SEEK_HOLE) {
+ done = true;
+ break;
+ }
+ index = indices[i];
+ }
+ page = pvec.pages[i];
+ if (page && !radix_tree_exceptional_entry(page)) {
+ if (!PageUptodate(page))
+ page = NULL;
+ }
+ if (index >= end ||
+ (page && origin == SEEK_DATA) ||
+ (!page && origin == SEEK_HOLE)) {
+ done = true;
+ break;
+ }
+ }
+ shmem_deswap_pagevec(&pvec);
+ pagevec_release(&pvec);
+ pvec.nr = PAGEVEC_SIZE;
+ cond_resched();
+ }
+ return index;
+}
+
+/*
* Remove range of pages and swap entries from radix tree, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
*/
@@ -1674,6 +1724,48 @@ static ssize_t shmem_file_splice_read(st
return error;
}

+static loff_t shmem_file_llseek(struct file *file, loff_t offset, int origin)
+{
+ struct address_space *mapping;
+ struct inode *inode;
+ pgoff_t start, end;
+ loff_t new_offset;
+
+ if (origin != SEEK_DATA && origin != SEEK_HOLE)
+ return generic_file_llseek_size(file, offset, origin,
+ MAX_LFS_FILESIZE);
+ mapping = file->f_mapping;
+ inode = mapping->host;
+ mutex_lock(&inode->i_mutex);
+ /* We're holding i_mutex so we can access i_size directly */
+
+ if (offset < 0)
+ offset = -EINVAL;
+ else if (offset >= inode->i_size)
+ offset = -ENXIO;
+ else {
+ start = offset >> PAGE_CACHE_SHIFT;
+ end = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
+ new_offset = shmem_seek_hole_data(mapping, start, end, origin);
+ new_offset <<= PAGE_CACHE_SHIFT;
+ if (new_offset > offset) {
+ if (new_offset < inode->i_size)
+ offset = new_offset;
+ else if (origin == SEEK_DATA)
+ offset = -ENXIO;
+ else
+ offset = inode->i_size;
+ }
+ }
+
+ if (offset >= 0 && offset != file->f_pos) {
+ file->f_pos = offset;
+ file->f_version = 0;
+ }
+ mutex_unlock(&inode->i_mutex);
+ return offset;
+}
+
static long shmem_fallocate(struct file *file, int mode, loff_t offset,
loff_t len)
{
@@ -2667,7 +2759,7 @@ static const struct address_space_operat
static const struct file_operations shmem_file_operations = {
.mmap = shmem_mmap,
#ifdef CONFIG_TMPFS
- .llseek = generic_file_llseek,
+ .llseek = shmem_file_llseek,
.read = do_sync_read,
.write = do_sync_write,
.aio_read = shmem_file_aio_read,

2012-05-14 08:55:50

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone

On 05/12/2012 07:59 PM, Hugh Dickins wrote:
> + VM_BUG_ON(!PageLocked(oldpage));
> + __set_page_locked(newpage);
> + VM_BUG_ON(!PageUptodate(oldpage));
> + SetPageUptodate(newpage);
> + VM_BUG_ON(!PageSwapBacked(oldpage));
> + SetPageSwapBacked(newpage);
> + VM_BUG_ON(!swap_index);
> + set_page_private(newpage, swap_index);
> + VM_BUG_ON(!PageSwapCache(oldpage));
> + SetPageSwapCache(newpage);
> +

Are all of these VM_BUG_ON's necessary?

2012-05-14 09:08:15

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone

(2012/05/12 20:59), Hugh Dickins wrote:

> The GMA500 GPU driver uses GEM shmem objects, but with a new twist:
> the backing RAM has to be below 4GB. Not a problem while the boards
> supported only 4GB: but now Intel's D2700MUD boards support 8GB, and
> their GMA3600 is managed by the GMA500 driver.
>
> shmem/tmpfs has never pretended to support hardware restrictions on
> the backing memory, but it might have appeared to do so before v3.1,
> and even now it works fine until a page is swapped out then back in.
> When read_cache_page_gfp() supplied a freshly allocated page for copy,
> that compensated for whatever choice might have been made by earlier
> swapin readahead; but swapoff was likely to destroy the illusion.
>
> We'd like to continue to support GMA500, so now add a new
> shmem_should_replace_page() check on the zone when about to move
> a page from swapcache to filecache (in swapin and swapoff cases),
> with shmem_replace_page() to allocate and substitute a suitable page
> (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).
>
> This does involve a minor extension to mem_cgroup_replace_page_cache()
> (the page may or may not have already been charged); and I've removed
> a comment and call to mem_cgroup_uncharge_cache_page(), which in fact
> is always a no-op while PageSwapCache.
>
> Also removed optimization of an unlikely path in shmem_getpage_gfp(),
> now that we need to check PageSwapCache more carefully (a racing caller
> might already have made the copy). And at one point shmem_unuse_inode()
> needs to use the hitherto private page_swapcount(), to guard against
> racing with inode eviction.
>
> It would make sense to extend shmem_should_replace_page(), to cover
> cpuset and NUMA mempolicy restrictions too, but set that aside for
> now: needs a cleanup of shmem mempolicy handling, and more testing,
> and ought to handle swap faults in do_swap_page() as well as shmem.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---



Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2012-05-14 09:35:10

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH 2/10] tmpfs: enable NOSEC optimization

On 05/12/2012 08:02 PM, Hugh Dickins wrote:
> Let tmpfs into the NOSEC optimization (avoiding file_remove_suid()
> overhead on most common writes): set MS_NOSEC on its superblocks.
>
> Signed-off-by: Hugh Dickins<[email protected]>
> ---
> mm/shmem.c | 1 +
> 1 file changed, 1 insertion(+)
>
> --- 3045N.orig/mm/shmem.c 2012-05-05 10:45:17.888060878 -0700
> +++ 3045N/mm/shmem.c 2012-05-05 10:46:05.732062006 -0700
> @@ -2361,6 +2361,7 @@ int shmem_fill_super(struct super_block
> }
> }
> sb->s_export_op =&shmem_export_ops;
> + sb->s_flags |= MS_NOSEC;

Isn't setting the flag on inode better? Something like:

diff --git a/mm/shmem.c b/mm/shmem.c
index f99ff3e..7d98fb5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2325,6 +2325,7 @@ static void shmem_init_inode(void *foo)
{
struct shmem_inode_info *info = foo;
inode_init_once(&info->vfs_inode);
+ info->vfs_inode.i_flags |= S_NOSEC;
}

static int shmem_init_inodecache(void)

2012-05-14 19:42:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone

On Mon, 14 May 2012, Cong Wang wrote:
> On 05/12/2012 07:59 PM, Hugh Dickins wrote:
> > + VM_BUG_ON(!PageLocked(oldpage));
> > + __set_page_locked(newpage);
> > + VM_BUG_ON(!PageUptodate(oldpage));
> > + SetPageUptodate(newpage);
> > + VM_BUG_ON(!PageSwapBacked(oldpage));
> > + SetPageSwapBacked(newpage);
> > + VM_BUG_ON(!swap_index);
> > + set_page_private(newpage, swap_index);
> > + VM_BUG_ON(!PageSwapCache(oldpage));
> > + SetPageSwapCache(newpage);
> > +
>
> Are all of these VM_BUG_ON's necessary?

I'm really glad you asked that - thank you.

At first I was just going to brush you off with a standard reply of
something like "well, no BUG_ON should ever be necessary, but we do
find them helpful in practice".

But (a) these ones have probably outlived their usefulness: they were
certainly reassuring to me when I was testing, but perhaps now are
just cluttering up the flow. I did make them "VM_" BUG_ONs in the
hope that distros wouldn't waste space and time switching them on, but
now I'm inclined to agree with you that they should just be removed.
Most of them are doing no more than confirm what's been checked before
calling the function (and confirming that status cannot racily change).

And (b) whereas they didn't actually catch anything for me, they have
been giving false assurance: because (I believe) there really is a bug
lurking there that they have not yet met and caught. And I would have
missed it if you hadn't directed me back to think about these.

It is an exceedingly unlikely bug (and need not delay use of the patch),
but what I'm re-remembering is just how slippery swap is: the problem is
that a swapcache page can get freed and reused before getting the page
lock on it; and it might even get reused for swapcache. Perhaps I need
also to be checking page->private, or perhaps I need to check for error
instead of BUG_ON(error) just before the lines you picked out, or both.

I'm not going to rush the incremental patch to fix this: need to think
about it quietly first.

If you're wondering what I'm talking about (sorry, I don't have time
to explain more right now), take a look at comment and git history of
line 2956 (in 3.4-rc7) of mm/memory.c:
if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
I don't suppose anyone ever actually hit the bug in the years before
we added that protection, but we still ought to guard against it,
there and here in shmem_replace_page().

Hugh

2012-05-14 19:48:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2/10] tmpfs: enable NOSEC optimization

On Mon, 14 May 2012, Cong Wang wrote:
> On 05/12/2012 08:02 PM, Hugh Dickins wrote:
> > Let tmpfs into the NOSEC optimization (avoiding file_remove_suid()
> > overhead on most common writes): set MS_NOSEC on its superblocks.
> >
> > Signed-off-by: Hugh Dickins<[email protected]>
> > ---
> > mm/shmem.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > --- 3045N.orig/mm/shmem.c 2012-05-05 10:45:17.888060878 -0700
> > +++ 3045N/mm/shmem.c 2012-05-05 10:46:05.732062006 -0700
> > @@ -2361,6 +2361,7 @@ int shmem_fill_super(struct super_block
> > }
> > }
> > sb->s_export_op =&shmem_export_ops;
> > + sb->s_flags |= MS_NOSEC;
>
> Isn't setting the flag on inode better? Something like:

I don't think so. The MS_NOSEC S_NOSEC business is fairly subtle,
and easy to miss if it's gone wrong, so I would much rather follow
the established pattern in local block filesystems: which is to set
MS_NOSEC in superblock flags, and leave S_NOSEC to file_remove_suid().

Hugh

>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index f99ff3e..7d98fb5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2325,6 +2325,7 @@ static void shmem_init_inode(void *foo)
> {
> struct shmem_inode_info *info = foo;
> inode_init_once(&info->vfs_inode);
> + info->vfs_inode.i_flags |= S_NOSEC;
> }
>
> static int shmem_init_inodecache(void)

2012-05-14 23:13:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone

On Sat, 12 May 2012 04:59:56 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> The GMA500 GPU driver uses GEM shmem objects, but with a new twist:
> the backing RAM has to be below 4GB. Not a problem while the boards
> supported only 4GB: but now Intel's D2700MUD boards support 8GB, and
> their GMA3600 is managed by the GMA500 driver.
>
> shmem/tmpfs has never pretended to support hardware restrictions on
> the backing memory, but it might have appeared to do so before v3.1,
> and even now it works fine until a page is swapped out then back in.
> When read_cache_page_gfp() supplied a freshly allocated page for copy,
> that compensated for whatever choice might have been made by earlier
> swapin readahead; but swapoff was likely to destroy the illusion.
>
> We'd like to continue to support GMA500, so now add a new
> shmem_should_replace_page() check on the zone when about to move
> a page from swapcache to filecache (in swapin and swapoff cases),
> with shmem_replace_page() to allocate and substitute a suitable page
> (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).
>
> This does involve a minor extension to mem_cgroup_replace_page_cache()
> (the page may or may not have already been charged); and I've removed
> a comment and call to mem_cgroup_uncharge_cache_page(), which in fact
> is always a no-op while PageSwapCache.
>
> Also removed optimization of an unlikely path in shmem_getpage_gfp(),
> now that we need to check PageSwapCache more carefully (a racing caller
> might already have made the copy). And at one point shmem_unuse_inode()
> needs to use the hitherto private page_swapcount(), to guard against
> racing with inode eviction.
>
> It would make sense to extend shmem_should_replace_page(), to cover
> cpuset and NUMA mempolicy restrictions too, but set that aside for
> now: needs a cleanup of shmem mempolicy handling, and more testing,
> and ought to handle swap faults in do_swap_page() as well as shmem.
>
> ...
>
> static int shmem_unuse_inode(struct shmem_inode_info *info,
> - swp_entry_t swap, struct page *page)
> + swp_entry_t swap, struct page **pagep)
> {
> struct address_space *mapping = info->vfs_inode.i_mapping;
> void *radswap;
> pgoff_t index;
> - int error;
> + gfp_t gfp;
> + int error = 0;
>
> radswap = swp_to_radix_entry(swap);
> index = radix_tree_locate_item(&mapping->page_tree, radswap);
> @@ -625,22 +629,37 @@ static int shmem_unuse_inode(struct shme
> if (shmem_swaplist.next != &info->swaplist)
> list_move_tail(&shmem_swaplist, &info->swaplist);
>
> + gfp = mapping_gfp_mask(mapping);
> + if (shmem_should_replace_page(*pagep, gfp)) {
> + mutex_unlock(&shmem_swaplist_mutex);
> + error = shmem_replace_page(pagep, gfp, info, index);
> + mutex_lock(&shmem_swaplist_mutex);
> + /*
> + * We needed to drop mutex to make that restrictive page
> + * allocation; but the inode might already be freed by now,
> + * and we cannot refer to inode or mapping or info to check.
> + * However, we do hold page lock on the PageSwapCache page,
> + * so can check if that still has our reference remaining.
> + */
> + if (!page_swapcount(*pagep))
> + error = -ENOENT;

This has my head spinning a bit. What is "our reference"? I'd expect
that to mean a temporary reference which was taken by this thread of
control. But such a thing has no relevance when trying to determine
the state of the page and/or data structures which refer to it.

Also, what are we trying to determine here with this test? Whether the
page was removed from swapcache under our feet? Presumably not, as it
is locked.

So perhaps you could spell out in more detail what we're trying to do
here, and what contributes to page_swapcount() here?


> + }
> +
> /*
> * We rely on shmem_swaplist_mutex, not only to protect the swaplist,
> * but also to hold up shmem_evict_inode(): so inode cannot be freed
> * beneath us (pagelock doesn't help until the page is in pagecache).
> */
> - error = shmem_add_to_page_cache(page, mapping, index,
> + if (!error)
> + error = shmem_add_to_page_cache(*pagep, mapping, index,
> GFP_NOWAIT, radswap);
> - /* which does mem_cgroup_uncharge_cache_page on error */
> -
> if (error != -ENOMEM) {
> /*
> * Truncation and eviction use free_swap_and_cache(), which
> * only does trylock page: if we raced, best clean up here.
> */
> - delete_from_swap_cache(page);
> - set_page_dirty(page);
> + delete_from_swap_cache(*pagep);
> + set_page_dirty(*pagep);
> if (!error) {
> spin_lock(&info->lock);
> info->swapped--;
> @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct
> struct list_head *this, *next;
> struct shmem_inode_info *info;
> int found = 0;
> - int error;
> + int error = 0;
> +
> + /*
> + * There's a faint possibility that swap page was replaced before
> + * caller locked it: it will come back later with the right page.

So a caller locked the page then failed to check that it's still the
right sort of page? Shouldn't the caller locally clean up its own mess
rather than requiring a callee to know about the caller's intricate
shortcomings?

> + */
> + if (unlikely(!PageSwapCache(page)))
> + goto out;
>
> /*
> * Charge page using GFP_KERNEL while we can wait, before taking
>
> ...
>
> @@ -856,6 +880,84 @@ static inline struct mempolicy *shmem_ge
> #endif
>
> /*
> + * When a page is moved from swapcache to shmem filecache (either by the
> + * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of
> + * shmem_unuse_inode()), it may have been read in earlier from swap, in
> + * ignorance of the mapping it belongs to. If that mapping has special
> + * constraints (like the gma500 GEM driver, which requires RAM below 4GB),
> + * we may need to copy to a suitable page before moving to filecache.
> + *
> + * In a future release, this may well be extended to respect cpuset and
> + * NUMA mempolicy, and applied also to anonymous pages in do_swap_page();
> + * but for now it is a simple matter of zone.
> + */
> +static bool shmem_should_replace_page(struct page *page, gfp_t gfp)
> +{
> + return page_zonenum(page) > gfp_zone(gfp);
> +}
> +
> +static int shmem_replace_page(struct page **pagep, gfp_t gfp,
> + struct shmem_inode_info *info, pgoff_t index)
> +{
> + struct page *oldpage, *newpage;
> + struct address_space *swap_mapping;
> + pgoff_t swap_index;
> + int error;
> +
> + oldpage = *pagep;
> + swap_index = page_private(oldpage);
> + swap_mapping = page_mapping(oldpage);
> +
> + /*
> + * We have arrived here because our zones are constrained, so don't
> + * limit chance of success by further cpuset and node constraints.
> + */
> + gfp &= ~GFP_CONSTRAINT_MASK;
> + newpage = shmem_alloc_page(gfp, info, index);
> + if (!newpage)
> + return -ENOMEM;
> + VM_BUG_ON(shmem_should_replace_page(newpage, gfp));
> +
> + *pagep = newpage;
> + page_cache_get(newpage);
> + copy_highpage(newpage, oldpage);

copy_highpage() doesn't do flush_dcache_page() - did we need copy_user_highpage()?

> + VM_BUG_ON(!PageLocked(oldpage));
> + __set_page_locked(newpage);
> + VM_BUG_ON(!PageUptodate(oldpage));
> + SetPageUptodate(newpage);
> + VM_BUG_ON(!PageSwapBacked(oldpage));
> + SetPageSwapBacked(newpage);
> + VM_BUG_ON(!swap_index);
> + set_page_private(newpage, swap_index);
> + VM_BUG_ON(!PageSwapCache(oldpage));
> + SetPageSwapCache(newpage);
> +
> + /*
> + * Our caller will very soon move newpage out of swapcache, but it's
> + * a nice clean interface for us to replace oldpage by newpage there.
> + */
> + spin_lock_irq(&swap_mapping->tree_lock);
> + error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
> + newpage);
> + __inc_zone_page_state(newpage, NR_FILE_PAGES);
> + __dec_zone_page_state(oldpage, NR_FILE_PAGES);
> + spin_unlock_irq(&swap_mapping->tree_lock);
> + BUG_ON(error);
> +
> + mem_cgroup_replace_page_cache(oldpage, newpage);
> + lru_cache_add_anon(newpage);
> +
> + ClearPageSwapCache(oldpage);
> + set_page_private(oldpage, 0);
> +
> + unlock_page(oldpage);
> + page_cache_release(oldpage);
> + page_cache_release(oldpage);
> + return 0;
> +}

shmem_replace_page() is a fairly generic and unexceptional sounding
thing. Methinks shmem_substitute_page() would be a better name.

> +/*
> * shmem_getpage_gfp - find page in cache, or get from swap, or allocate
> *
> * If we allocate a new one we do not mark it dirty. That's up to the
>
> ...
>

2012-05-15 04:08:11

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone

On Mon, 14 May 2012, Andrew Morton wrote:
> On Sat, 12 May 2012 04:59:56 -0700 (PDT)
> Hugh Dickins <[email protected]> wrote:
> >
> > We'd like to continue to support GMA500, so now add a new
> > shmem_should_replace_page() check on the zone when about to move
> > a page from swapcache to filecache (in swapin and swapoff cases),
> > with shmem_replace_page() to allocate and substitute a suitable page
> > (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).
> >
...
> > + gfp = mapping_gfp_mask(mapping);
> > + if (shmem_should_replace_page(*pagep, gfp)) {
> > + mutex_unlock(&shmem_swaplist_mutex);
> > + error = shmem_replace_page(pagep, gfp, info, index);
> > + mutex_lock(&shmem_swaplist_mutex);
> > + /*
> > + * We needed to drop mutex to make that restrictive page
> > + * allocation; but the inode might already be freed by now,
> > + * and we cannot refer to inode or mapping or info to check.
> > + * However, we do hold page lock on the PageSwapCache page,
> > + * so can check if that still has our reference remaining.
> > + */
> > + if (!page_swapcount(*pagep))
> > + error = -ENOENT;
>
> This has my head spinning a bit. What is "our reference"? I'd expect
> that to mean a temporary reference which was taken by this thread of
> control.

(I'm sure you'll prefer a reworking of that comment in an incremental
fixes patch, but let me try to explain better here too.)

No, I didn't mean a temporary reference taken by this (swapoff) thread,
but the reference (swap entry) which has just been located in the inode's
radix_tree, just before this hunk: which would be tracked by page_swapcount
1 (there's also a page swapcache bit in the swap_map along with the count,
corresponding to the reference from the swapcache page itself, but that's
not included in page_swapcount).

> But such a thing has no relevance when trying to determine
> the state of the page and/or data structures which refer to it.

I don't understand you there, but maybe it won't matter.

>
> Also, what are we trying to determine here with this test? Whether the
> page was removed from swapcache under our feet? Presumably not, as it
> is locked.
>
> So perhaps you could spell out in more detail what we're trying to do
> here, and what contributes to page_swapcount() here?

The danger here is that the inode we're dealing with has gone through
shmem_evict_inode() while we dropped shmem_swaplist_mutex: inode was
certainly in use before, and shmem_swaplist_mutex (together with inode
being on shmem_swaplist) holds it up from being evicted and freed; but
once we drop the mutex, it could go away at any moment. We cannot
determine that by looking at struct inode or struct address_space or
struct shmem_inode_info, they're all part of what would be freed;
but we cannot proceed to shmem_add_to_page_cache() once they're freed.
How to tell whether it's been freed?

Once upon a time I "solved" it with igrab() and iput(), but Konstantin
demonstrated how that gives no safety against unmounting, and I remain
reluctant to go back to relying upon filesystem semantics to solve this.

It occurred to me that the inode cannot be freed until that radix_tree
entry has been removed (by shmem_evict_inode's shmem_truncate_range),
and the act or removing that entry (free_swap_and_cache) brings
page_swapcount down from 1 to 0.

You're thinking that the page cannot be removed from swapcache while
we hold page lock: correct, but... free_swap_and_cache() only does a
trylock_page(), and happily leaves the swapcache page to be garbage
collected later if it cannot get the page lock. (And I certainly
would not want to change it to wait for page lock.) So, the inode
can get evicted while the page is still in swapcache: the page lock
gives no protection against that, until the page itself gets into
the radix_tree.

I doubt that writing this essay into a comment there will be the
right thing to do (and I may still be losing you); but I shall try
to rewrite it, and if there's one missing fact that needs highlighting,
it probably is that last, that free_swap_and_cache() only does a trylock,
so our page lock does not protect the inode from eviction.

(At this moment, I can't think what is the relevance of my comment
"we do hold page lock on the PageSwapCache page": in other contexts it
would be important, but here in swapoff we know that that swap cannot
get reused, or not before we're done.)

> > @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct
> > struct list_head *this, *next;
> > struct shmem_inode_info *info;
> > int found = 0;
> > - int error;
> > + int error = 0;
> > +
> > + /*
> > + * There's a faint possibility that swap page was replaced before
> > + * caller locked it: it will come back later with the right page.
>
> So a caller locked the page then failed to check that it's still the
> right sort of page? Shouldn't the caller locally clean up its own mess
> rather than requiring a callee to know about the caller's intricate
> shortcomings?

The caller being try_to_unuse(). You're certainly not the first to argue
that way. Perhaps I'm a bit perverse, in letting code which works even
in the surprising cases, remain as it is without weeding out those
surprising cases. And on this occasion didn't want to add an additional
dependence on a slight subtle change in mm/swapfile.c functionality.

Hmm, yes, I do still prefer to have the check here in shmem.c:
particularly since it is this "shmem_replace_page" business which is
increasing the likelihood of such a race, and making further demands
on it (if we're going to make the copied page PageSwapCache, then we
need to be sure that the page it's replacing was PageSwapCache - though
that's something I need to think through again in the light of the race
which I thought of in responding to Cong).

> > + newpage = shmem_alloc_page(gfp, info, index);
> > + if (!newpage)
> > + return -ENOMEM;
> > + VM_BUG_ON(shmem_should_replace_page(newpage, gfp));
> > +
> > + *pagep = newpage;
> > + page_cache_get(newpage);
> > + copy_highpage(newpage, oldpage);
>
> copy_highpage() doesn't do flush_dcache_page() - did we need copy_user_highpage()?

Ooh, I'm pretty sure you're right that we do need flush_dcache_page()
there: good catch, thank you. We can't use copy_user_highpage() because
in general we don't know any address and vma; but should be following the
shmem_getpage_gfp() pattern of clear_highpage+flush_dcache_page+SetUptodate.

>
> shmem_replace_page() is a fairly generic and unexceptional sounding
> thing. Methinks shmem_substitute_page() would be a better name.

Okay, shmem_replace_page() seemed appropriate to me (especially thinking
of it as "re-place"), but I don't mind changing to shmem_substitute_page().

The flush_dcache_page() addition is important, but until people are
using GMA500 on ARM or something (I doubt that combination) with more
than 4GB, this code is not coming into play - so I'm not breaking anyone's
system if it sneaks into linux-next before I fix that.

The main thing I need to think through quietly is the slippery swap race:
I'll send you an incremental patch to fix all these up once I'm satisfied
on that.

Thanks,
Hugh

2012-05-15 08:51:23

by Nicholas Piggin

[permalink] [raw]
Subject: Re: [PATCH 3/10] tmpfs: optimize clearing when writing

On 12 May 2012 22:04, Hugh Dickins <[email protected]> wrote:
> Nick proposed years ago that tmpfs should avoid clearing its pages where
> write will overwrite them with new data, as ramfs has long done.  But I
> messed it up and just got bad data.  Tried again recently, it works fine.
>
> Here's time output for writing 4GiB 16 times on this Core i5 laptop:
>
> before: real    0m21.169s user  0m0.028s sys    0m21.057s
>        real    0m21.382s user  0m0.016s sys    0m21.289s
>        real    0m21.311s user  0m0.020s sys    0m21.217s
>
> after:  real    0m18.273s user  0m0.032s sys    0m18.165s
>        real    0m18.354s user  0m0.020s sys    0m18.265s
>        real    0m18.440s user  0m0.032s sys    0m18.337s
>
> ramfs:  real    0m16.860s user  0m0.028s sys    0m16.765s
>        real    0m17.382s user  0m0.040s sys    0m17.273s
>        real    0m17.133s user  0m0.044s sys    0m17.021s

Cool, thanks Hugh! Very big speedup.


>
> Yes, I have done perf reports, but they need more explanation than they
> deserve: in summary, clear_page vanishes, its cache loading shifts into
> copy_user_generic_unrolled; shmem_getpage_gfp goes down, and surprisingly
> mark_page_accessed goes way up - I think because they are respectively
> where the cache gets to be reloaded after being purged by clear or copy.
>
> Suggested-by: Nick Piggin <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>  mm/shmem.c |   20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
>
> --- 3045N.orig/mm/shmem.c       2012-05-05 10:46:05.732062006 -0700
> +++ 3045N/mm/shmem.c    2012-05-05 10:46:12.316062172 -0700
> @@ -1095,9 +1095,14 @@ repeat:
>                shmem_recalc_inode(inode);
>                spin_unlock(&info->lock);
>
> -               clear_highpage(page);
> -               flush_dcache_page(page);
> -               SetPageUptodate(page);
> +               /*
> +                * Let SGP_WRITE caller clear ends if write does not fill page
> +                */
> +               if (sgp != SGP_WRITE) {
> +                       clear_highpage(page);
> +                       flush_dcache_page(page);
> +                       SetPageUptodate(page);
> +               }
>                if (sgp == SGP_DIRTY)
>                        set_page_dirty(page);
>        }
> @@ -1307,6 +1312,14 @@ shmem_write_end(struct file *file, struc
>        if (pos + copied > inode->i_size)
>                i_size_write(inode, pos + copied);
>
> +       if (!PageUptodate(page)) {
> +               if (copied < PAGE_CACHE_SIZE) {
> +                       unsigned from = pos & (PAGE_CACHE_SIZE - 1);
> +                       zero_user_segments(page, 0, from,
> +                                       from + copied, PAGE_CACHE_SIZE);
> +               }
> +               SetPageUptodate(page);
> +       }
>        set_page_dirty(page);
>        unlock_page(page);
>        page_cache_release(page);
> @@ -1768,6 +1781,7 @@ static int shmem_symlink(struct inode *d
>                kaddr = kmap_atomic(page);
>                memcpy(kaddr, symname, len);
>                kunmap_atomic(kaddr);
> +               SetPageUptodate(page);
>                set_page_dirty(page);
>                unlock_page(page);
>                page_cache_release(page);
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2012-05-15 09:44:53

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone

On 05/15/2012 03:42 AM, Hugh Dickins wrote:
> I'm not going to rush the incremental patch to fix this: need to think
> about it quietly first.
>
> If you're wondering what I'm talking about (sorry, I don't have time
> to explain more right now), take a look at comment and git history of
> line 2956 (in 3.4-rc7) of mm/memory.c:
> if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
> I don't suppose anyone ever actually hit the bug in the years before
> we added that protection, but we still ought to guard against it,
> there and here in shmem_replace_page().
>

Ok, I have no objections.

Thanks for your patches anyway!

2012-05-19 01:42:05

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/10] shmem: replace page if mapping excludes its zone

On Mon, 14 May 2012, Hugh Dickins wrote:
> On Mon, 14 May 2012, Andrew Morton wrote:
> > On Sat, 12 May 2012 04:59:56 -0700 (PDT)
> > Hugh Dickins <[email protected]> wrote:
> > >
> > > We'd like to continue to support GMA500, so now add a new
> > > shmem_should_replace_page() check on the zone when about to move
> > > a page from swapcache to filecache (in swapin and swapoff cases),
> > > with shmem_replace_page() to allocate and substitute a suitable page
> > > (given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).
> > >
> ...
> > > + gfp = mapping_gfp_mask(mapping);
> > > + if (shmem_should_replace_page(*pagep, gfp)) {
> > > + mutex_unlock(&shmem_swaplist_mutex);
> > > + error = shmem_replace_page(pagep, gfp, info, index);
> > > + mutex_lock(&shmem_swaplist_mutex);
> > > + /*
> > > + * We needed to drop mutex to make that restrictive page
> > > + * allocation; but the inode might already be freed by now,
> > > + * and we cannot refer to inode or mapping or info to check.
> > > + * However, we do hold page lock on the PageSwapCache page,
> > > + * so can check if that still has our reference remaining.
> > > + */
> > > + if (!page_swapcount(*pagep))
> > > + error = -ENOENT;
> >
> > This has my head spinning a bit. What is "our reference"? I'd expect
> > that to mean a temporary reference which was taken by this thread of
> > control.
>
> (I'm sure you'll prefer a reworking of that comment in an incremental
> fixes patch, but let me try to explain better here too.)
>
> No, I didn't mean a temporary reference taken by this (swapoff) thread,
> but the reference (swap entry) which has just been located in the inode's
> radix_tree, just before this hunk: which would be tracked by page_swapcount
> 1 (there's also a page swapcache bit in the swap_map along with the count,
> corresponding to the reference from the swapcache page itself, but that's
> not included in page_swapcount).
>
> > But such a thing has no relevance when trying to determine
> > the state of the page and/or data structures which refer to it.
>
> I don't understand you there, but maybe it won't matter.
>
> >
> > Also, what are we trying to determine here with this test? Whether the
> > page was removed from swapcache under our feet? Presumably not, as it
> > is locked.
> >
> > So perhaps you could spell out in more detail what we're trying to do
> > here, and what contributes to page_swapcount() here?
>
> The danger here is that the inode we're dealing with has gone through
> shmem_evict_inode() while we dropped shmem_swaplist_mutex: inode was
> certainly in use before, and shmem_swaplist_mutex (together with inode
> being on shmem_swaplist) holds it up from being evicted and freed; but
> once we drop the mutex, it could go away at any moment. We cannot
> determine that by looking at struct inode or struct address_space or
> struct shmem_inode_info, they're all part of what would be freed;
> but we cannot proceed to shmem_add_to_page_cache() once they're freed.
> How to tell whether it's been freed?
>
> Once upon a time I "solved" it with igrab() and iput(), but Konstantin
> demonstrated how that gives no safety against unmounting, and I remain
> reluctant to go back to relying upon filesystem semantics to solve this.
>
> It occurred to me that the inode cannot be freed until that radix_tree
> entry has been removed (by shmem_evict_inode's shmem_truncate_range),
> and the act or removing that entry (free_swap_and_cache) brings
> page_swapcount down from 1 to 0.
>
> You're thinking that the page cannot be removed from swapcache while
> we hold page lock: correct, but... free_swap_and_cache() only does a
> trylock_page(), and happily leaves the swapcache page to be garbage
> collected later if it cannot get the page lock. (And I certainly
> would not want to change it to wait for page lock.) So, the inode
> can get evicted while the page is still in swapcache: the page lock
> gives no protection against that, until the page itself gets into
> the radix_tree.
>
> I doubt that writing this essay into a comment there will be the
> right thing to do (and I may still be losing you); but I shall try
> to rewrite it, and if there's one missing fact that needs highlighting,
> it probably is that last, that free_swap_and_cache() only does a trylock,
> so our page lock does not protect the inode from eviction.
>
> (At this moment, I can't think what is the relevance of my comment
> "we do hold page lock on the PageSwapCache page": in other contexts it
> would be important, but here in swapoff we know that that swap cannot
> get reused, or not before we're done.)
>
> > > @@ -660,7 +679,14 @@ int shmem_unuse(swp_entry_t swap, struct
> > > struct list_head *this, *next;
> > > struct shmem_inode_info *info;
> > > int found = 0;
> > > - int error;
> > > + int error = 0;
> > > +
> > > + /*
> > > + * There's a faint possibility that swap page was replaced before
> > > + * caller locked it: it will come back later with the right page.
> >
> > So a caller locked the page then failed to check that it's still the
> > right sort of page? Shouldn't the caller locally clean up its own mess
> > rather than requiring a callee to know about the caller's intricate
> > shortcomings?
>
> The caller being try_to_unuse(). You're certainly not the first to argue
> that way. Perhaps I'm a bit perverse, in letting code which works even
> in the surprising cases, remain as it is without weeding out those
> surprising cases. And on this occasion didn't want to add an additional
> dependence on a slight subtle change in mm/swapfile.c functionality.
>
> Hmm, yes, I do still prefer to have the check here in shmem.c:
> particularly since it is this "shmem_replace_page" business which is
> increasing the likelihood of such a race, and making further demands
> on it (if we're going to make the copied page PageSwapCache, then we
> need to be sure that the page it's replacing was PageSwapCache - though
> that's something I need to think through again in the light of the race
> which I thought of in responding to Cong).
>
> > > + newpage = shmem_alloc_page(gfp, info, index);
> > > + if (!newpage)
> > > + return -ENOMEM;
> > > + VM_BUG_ON(shmem_should_replace_page(newpage, gfp));
> > > +
> > > + *pagep = newpage;
> > > + page_cache_get(newpage);
> > > + copy_highpage(newpage, oldpage);
> >
> > copy_highpage() doesn't do flush_dcache_page() - did we need copy_user_highpage()?
>
> Ooh, I'm pretty sure you're right that we do need flush_dcache_page()
> there: good catch, thank you. We can't use copy_user_highpage() because
> in general we don't know any address and vma; but should be following the
> shmem_getpage_gfp() pattern of clear_highpage+flush_dcache_page+SetUptodate.
>
> >
> > shmem_replace_page() is a fairly generic and unexceptional sounding
> > thing. Methinks shmem_substitute_page() would be a better name.
>
> Okay, shmem_replace_page() seemed appropriate to me (especially thinking
> of it as "re-place"), but I don't mind changing to shmem_substitute_page().
>
> The flush_dcache_page() addition is important, but until people are
> using GMA500 on ARM or something (I doubt that combination) with more
> than 4GB, this code is not coming into play - so I'm not breaking anyone's
> system if it sneaks into linux-next before I fix that.
>
> The main thing I need to think through quietly is the slippery swap race:
> I'll send you an incremental patch to fix all these up once I'm satisfied
> on that.

I promised you an incremental, but that's not really possible because of
the name changes from "replace" to "substitute". So I'll be sending you
a v2 patch in a moment, to replace (or substitute for) the original 1/10.

It responds to feedback comment:

1. "substitute" instead of "replace" [akpm]
2. more explanation of page_swapcount test [akpm]
3. flush_dcache_page after copy_highpage [akpm]
4. removal of excessive VM_BUG_ONs [wangcong]
5. check page_private before and error path within substitute_page [hughd]

See below for a diff from v1 for review, omitting replace->substitute mods.

Please don't be disappointed if I send you a further patch to
shmem_substitute_page() in the weeks ahead: although the page_private
checks I've added in this one make it very very very unlikely, and its
consequence very probably benign, there is still a surprising (never
observed) race by which shmem_getpage_gfp() could get hold of someone
else's swap.

It's correctly resolved by shmem_add_to_page_cache(), but by that time
we have already done a mem_cgroup charge, and now also this substitution.
It would be better to rearrange a little here, to eliminate all chance of
that surprise: I hoped to complete that earlier, but now think I'd better
get the safer intermediate version to you first.

Thanks,
Hugh

---
mm/shmem.c | 57 +++++++++++++++++++++++++++++++++------------------
1 file changed, 37 insertions(+), 20 deletions(-)

--- 3045N.orig/mm/shmem.c 2012-05-17 16:28:43.278076430 -0700
+++ 3045N/mm/shmem.c 2012-05-18 16:28:33.642198028 -0700
@@ -636,10 +636,21 @@ static int shmem_unuse_inode(struct shme
mutex_lock(&shmem_swaplist_mutex);
/*
* We needed to drop mutex to make that restrictive page
- * allocation; but the inode might already be freed by now,
- * and we cannot refer to inode or mapping or info to check.
- * However, we do hold page lock on the PageSwapCache page,
- * so can check if that still has our reference remaining.
+ * allocation, but the inode might have been freed while we
+ * dropped it: although a racing shmem_evict_inode() cannot
+ * complete without emptying the radix_tree, our page lock
+ * on this swapcache page is not enough to prevent that -
+ * free_swap_and_cache() of our swap entry will only
+ * trylock_page(), removing swap from radix_tree whatever.
+ *
+ * We must not proceed to shmem_add_to_page_cache() if the
+ * inode has been freed, but of course we cannot rely on
+ * inode or mapping or info to check that. However, we can
+ * safely check if our swap entry is still in use (and here
+ * it can't have got reused for another page): if it's still
+ * in use, then the inode cannot have been freed yet, and we
+ * can safely proceed (if it's no longer in use, that tells
+ * nothing about the inode, but we don't need to unuse swap).
*/
if (!page_swapcount(*pagep))
error = -ENOENT;
@@ -683,9 +694,9 @@ int shmem_unuse(swp_entry_t swap, struct

/*
* There's a faint possibility that swap page was substituted before
- * caller locked it: it will come back later with the right page.
+ * caller locked it: caller will come back later with the right page.
*/
- if (unlikely(!PageSwapCache(page)))
+ if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val))
goto out;

/*
@@ -916,21 +927,15 @@ static int shmem_substitute_page(struct
newpage = shmem_alloc_page(gfp, info, index);
if (!newpage)
return -ENOMEM;
- VM_BUG_ON(shmem_should_substitute_page(newpage, gfp));

- *pagep = newpage;
page_cache_get(newpage);
copy_highpage(newpage, oldpage);
+ flush_dcache_page(newpage);

- VM_BUG_ON(!PageLocked(oldpage));
__set_page_locked(newpage);
- VM_BUG_ON(!PageUptodate(oldpage));
SetPageUptodate(newpage);
- VM_BUG_ON(!PageSwapBacked(oldpage));
SetPageSwapBacked(newpage);
- VM_BUG_ON(!swap_index);
set_page_private(newpage, swap_index);
- VM_BUG_ON(!PageSwapCache(oldpage));
SetPageSwapCache(newpage);

/*
@@ -940,13 +945,24 @@ static int shmem_substitute_page(struct
spin_lock_irq(&swap_mapping->tree_lock);
error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
newpage);
- __inc_zone_page_state(newpage, NR_FILE_PAGES);
- __dec_zone_page_state(oldpage, NR_FILE_PAGES);
+ if (!error) {
+ __inc_zone_page_state(newpage, NR_FILE_PAGES);
+ __dec_zone_page_state(oldpage, NR_FILE_PAGES);
+ }
spin_unlock_irq(&swap_mapping->tree_lock);
- BUG_ON(error);

- mem_cgroup_replace_page_cache(oldpage, newpage);
- lru_cache_add_anon(newpage);
+ if (unlikely(error)) {
+ /*
+ * Is this possible? I think not, now that our callers check
+ * both PageSwapCache and page_private after getting page lock;
+ * but be defensive. Reverse old to newpage for clear and free.
+ */
+ oldpage = newpage;
+ } else {
+ mem_cgroup_replace_page_cache(oldpage, newpage);
+ lru_cache_add_anon(newpage);
+ *pagep = newpage;
+ }

ClearPageSwapCache(oldpage);
set_page_private(oldpage, 0);
@@ -954,7 +970,7 @@ static int shmem_substitute_page(struct
unlock_page(oldpage);
page_cache_release(oldpage);
page_cache_release(oldpage);
- return 0;
+ return error;
}

/*
@@ -1025,7 +1041,8 @@ repeat:

/* We have to do this with page locked to prevent races */
lock_page(page);
- if (!PageSwapCache(page) || page->mapping) {
+ if (!PageSwapCache(page) || page_private(page) != swap.val ||
+ page->mapping) {
error = -EEXIST; /* try again */
goto failed;
}

2012-05-19 01:44:40

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 1/10] shmem: substitute page if mapping excludes its zone

The GMA500 GPU driver uses GEM shmem objects, but with a new twist:
the backing RAM has to be below 4GB. Not a problem while the boards
supported only 4GB: but now Intel's D2700MUD boards support 8GB, and
their GMA3600 is managed by the GMA500 driver.

shmem/tmpfs has never pretended to support hardware restrictions on
the backing memory, but it might have appeared to do so before v3.1,
and even now it works fine until a page is swapped out then back in.
When read_cache_page_gfp() supplied a freshly allocated page for copy,
that compensated for whatever choice might have been made by earlier
swapin readahead; but swapoff was likely to destroy the illusion.

We would like to continue to support GMA500, so now add a new
shmem_should_substitute_page() check on the zone when about to move a
page from swapcache to filecache (in swapin and swapoff cases), with
shmem_substitute_page() to allocate and substitute a suitable page
(given gma500/gem.c's mapping_set_gfp_mask GFP_KERNEL | __GFP_DMA32).

This does involve a minor extension to mem_cgroup_replace_page_cache()
(the page may or may not have already been charged); and I've removed
a comment and call to mem_cgroup_uncharge_cache_page(), which in fact
is always a no-op while PageSwapCache.

Also removed optimization of an unlikely path in shmem_getpage_gfp(),
now that we need to check PageSwapCache more carefully (a racing caller
might already have made the copy). And at one point shmem_unuse_inode()
needs to use the hitherto private page_swapcount(), to guard against
racing with inode eviction.

It would make sense to extend shmem_should_substitute_page(), to cover
cpuset and NUMA mempolicy restrictions too, but set that aside for
now: needs a cleanup of shmem mempolicy handling, and more testing,
and ought to handle swap faults in do_swap_page() as well as shmem.

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
I've Cc'ed Stephane, Andi, Dave, Daniel and Rob because of their interest
in the i915 Sandybridge issue; but reiterate that this patch does nothing
for that case.

include/linux/swap.h | 6 +
mm/memcontrol.c | 17 +++-
mm/shmem.c | 158 ++++++++++++++++++++++++++++++++++++-----
mm/swapfile.c | 2
4 files changed, 159 insertions(+), 24 deletions(-)

--- 3045N.orig/include/linux/swap.h 2012-05-17 16:30:20.222078994 -0700
+++ 3045N/include/linux/swap.h 2012-05-17 16:30:29.786078930 -0700
@@ -355,6 +355,7 @@ extern int swap_type_of(dev_t, sector_t,
extern unsigned int count_swap_pages(int, int);
extern sector_t map_swap_page(struct page *, struct block_device **);
extern sector_t swapdev_block(int, pgoff_t);
+extern int page_swapcount(struct page *);
extern int reuse_swap_page(struct page *);
extern int try_to_free_swap(struct page *);
struct backing_dev_info;
@@ -448,6 +449,11 @@ static inline void delete_from_swap_cach
{
}

+static inline int page_swapcount(struct page *page)
+{
+ return 0;
+}
+
#define reuse_swap_page(page) (page_mapcount(page) == 1)

static inline int try_to_free_swap(struct page *page)
--- 3045N.orig/mm/memcontrol.c 2012-05-17 16:30:20.226078835 -0700
+++ 3045N/mm/memcontrol.c 2012-05-17 16:30:29.786078930 -0700
@@ -3548,7 +3548,7 @@ void mem_cgroup_end_migration(struct mem
void mem_cgroup_replace_page_cache(struct page *oldpage,
struct page *newpage)
{
- struct mem_cgroup *memcg;
+ struct mem_cgroup *memcg = NULL;
struct page_cgroup *pc;
enum charge_type type = MEM_CGROUP_CHARGE_TYPE_CACHE;

@@ -3558,11 +3558,20 @@ void mem_cgroup_replace_page_cache(struc
pc = lookup_page_cgroup(oldpage);
/* fix accounting on old pages */
lock_page_cgroup(pc);
- memcg = pc->mem_cgroup;
- mem_cgroup_charge_statistics(memcg, false, -1);
- ClearPageCgroupUsed(pc);
+ if (PageCgroupUsed(pc)) {
+ memcg = pc->mem_cgroup;
+ mem_cgroup_charge_statistics(memcg, false, -1);
+ ClearPageCgroupUsed(pc);
+ }
unlock_page_cgroup(pc);

+ /*
+ * When called from shmem_substitute_page(), in some cases the
+ * oldpage has already been charged, and in some cases not.
+ */
+ if (!memcg)
+ return;
+
if (PageSwapBacked(oldpage))
type = MEM_CGROUP_CHARGE_TYPE_SHMEM;

--- 3045N.orig/mm/shmem.c 2012-05-17 16:30:20.226078835 -0700
+++ 3045N/mm/shmem.c 2012-05-18 16:28:33.642198028 -0700
@@ -103,6 +103,9 @@ static unsigned long shmem_default_max_i
}
#endif

+static bool shmem_should_substitute_page(struct page *page, gfp_t gfp);
+static int shmem_substitute_page(struct page **pagep, gfp_t gfp,
+ struct shmem_inode_info *info, pgoff_t index);
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type);

@@ -604,12 +607,13 @@ static void shmem_evict_inode(struct ino
* If swap found in inode, free it and move page from swapcache to filecache.
*/
static int shmem_unuse_inode(struct shmem_inode_info *info,
- swp_entry_t swap, struct page *page)
+ swp_entry_t swap, struct page **pagep)
{
struct address_space *mapping = info->vfs_inode.i_mapping;
void *radswap;
pgoff_t index;
- int error;
+ gfp_t gfp;
+ int error = 0;

radswap = swp_to_radix_entry(swap);
index = radix_tree_locate_item(&mapping->page_tree, radswap);
@@ -625,22 +629,48 @@ static int shmem_unuse_inode(struct shme
if (shmem_swaplist.next != &info->swaplist)
list_move_tail(&shmem_swaplist, &info->swaplist);

+ gfp = mapping_gfp_mask(mapping);
+ if (shmem_should_substitute_page(*pagep, gfp)) {
+ mutex_unlock(&shmem_swaplist_mutex);
+ error = shmem_substitute_page(pagep, gfp, info, index);
+ mutex_lock(&shmem_swaplist_mutex);
+ /*
+ * We needed to drop mutex to make that restrictive page
+ * allocation, but the inode might have been freed while we
+ * dropped it: although a racing shmem_evict_inode() cannot
+ * complete without emptying the radix_tree, our page lock
+ * on this swapcache page is not enough to prevent that -
+ * free_swap_and_cache() of our swap entry will only
+ * trylock_page(), removing swap from radix_tree whatever.
+ *
+ * We must not proceed to shmem_add_to_page_cache() if the
+ * inode has been freed, but of course we cannot rely on
+ * inode or mapping or info to check that. However, we can
+ * safely check if our swap entry is still in use (and here
+ * it can't have got reused for another page): if it's still
+ * in use, then the inode cannot have been freed yet, and we
+ * can safely proceed (if it's no longer in use, that tells
+ * nothing about the inode, but we don't need to unuse swap).
+ */
+ if (!page_swapcount(*pagep))
+ error = -ENOENT;
+ }
+
/*
* We rely on shmem_swaplist_mutex, not only to protect the swaplist,
* but also to hold up shmem_evict_inode(): so inode cannot be freed
* beneath us (pagelock doesn't help until the page is in pagecache).
*/
- error = shmem_add_to_page_cache(page, mapping, index,
+ if (!error)
+ error = shmem_add_to_page_cache(*pagep, mapping, index,
GFP_NOWAIT, radswap);
- /* which does mem_cgroup_uncharge_cache_page on error */
-
if (error != -ENOMEM) {
/*
* Truncation and eviction use free_swap_and_cache(), which
* only does trylock page: if we raced, best clean up here.
*/
- delete_from_swap_cache(page);
- set_page_dirty(page);
+ delete_from_swap_cache(*pagep);
+ set_page_dirty(*pagep);
if (!error) {
spin_lock(&info->lock);
info->swapped--;
@@ -660,7 +690,14 @@ int shmem_unuse(swp_entry_t swap, struct
struct list_head *this, *next;
struct shmem_inode_info *info;
int found = 0;
- int error;
+ int error = 0;
+
+ /*
+ * There's a faint possibility that swap page was substituted before
+ * caller locked it: caller will come back later with the right page.
+ */
+ if (unlikely(!PageSwapCache(page) || page_private(page) != swap.val))
+ goto out;

/*
* Charge page using GFP_KERNEL while we can wait, before taking
@@ -676,7 +713,7 @@ int shmem_unuse(swp_entry_t swap, struct
list_for_each_safe(this, next, &shmem_swaplist) {
info = list_entry(this, struct shmem_inode_info, swaplist);
if (info->swapped)
- found = shmem_unuse_inode(info, swap, page);
+ found = shmem_unuse_inode(info, swap, &page);
else
list_del_init(&info->swaplist);
cond_resched();
@@ -685,8 +722,6 @@ int shmem_unuse(swp_entry_t swap, struct
}
mutex_unlock(&shmem_swaplist_mutex);

- if (!found)
- mem_cgroup_uncharge_cache_page(page);
if (found < 0)
error = found;
out:
@@ -856,6 +891,89 @@ static inline struct mempolicy *shmem_ge
#endif

/*
+ * When a page is moved from swapcache to shmem filecache (either by the
+ * usual swapin of shmem_getpage_gfp(), or by the less common swapoff of
+ * shmem_unuse_inode()), it may have been read in earlier from swap, in
+ * ignorance of the mapping it belongs to. If that mapping has special
+ * constraints (like the gma500 GEM driver, which requires RAM below 4GB),
+ * we may need to copy to a suitable page before moving to filecache.
+ *
+ * In a future release, this may well be extended to respect cpuset and
+ * NUMA mempolicy, and applied also to anonymous pages in do_swap_page();
+ * but for now it is a simple matter of zone.
+ */
+static bool shmem_should_substitute_page(struct page *page, gfp_t gfp)
+{
+ return page_zonenum(page) > gfp_zone(gfp);
+}
+
+static int shmem_substitute_page(struct page **pagep, gfp_t gfp,
+ struct shmem_inode_info *info, pgoff_t index)
+{
+ struct page *oldpage, *newpage;
+ struct address_space *swap_mapping;
+ pgoff_t swap_index;
+ int error;
+
+ oldpage = *pagep;
+ swap_index = page_private(oldpage);
+ swap_mapping = page_mapping(oldpage);
+
+ /*
+ * We have arrived here because our zones are constrained, so don't
+ * limit chance of success by further cpuset and node constraints.
+ */
+ gfp &= ~GFP_CONSTRAINT_MASK;
+ newpage = shmem_alloc_page(gfp, info, index);
+ if (!newpage)
+ return -ENOMEM;
+
+ page_cache_get(newpage);
+ copy_highpage(newpage, oldpage);
+ flush_dcache_page(newpage);
+
+ __set_page_locked(newpage);
+ SetPageUptodate(newpage);
+ SetPageSwapBacked(newpage);
+ set_page_private(newpage, swap_index);
+ SetPageSwapCache(newpage);
+
+ /*
+ * Our caller will very soon move newpage out of swapcache, but it's a
+ * nice clean interface for us to substitute newpage for oldpage there.
+ */
+ spin_lock_irq(&swap_mapping->tree_lock);
+ error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
+ newpage);
+ if (!error) {
+ __inc_zone_page_state(newpage, NR_FILE_PAGES);
+ __dec_zone_page_state(oldpage, NR_FILE_PAGES);
+ }
+ spin_unlock_irq(&swap_mapping->tree_lock);
+
+ if (unlikely(error)) {
+ /*
+ * Is this possible? I think not, now that our callers check
+ * both PageSwapCache and page_private after getting page lock;
+ * but be defensive. Reverse old to newpage for clear and free.
+ */
+ oldpage = newpage;
+ } else {
+ mem_cgroup_replace_page_cache(oldpage, newpage);
+ lru_cache_add_anon(newpage);
+ *pagep = newpage;
+ }
+
+ ClearPageSwapCache(oldpage);
+ set_page_private(oldpage, 0);
+
+ unlock_page(oldpage);
+ page_cache_release(oldpage);
+ page_cache_release(oldpage);
+ return error;
+}
+
+/*
* shmem_getpage_gfp - find page in cache, or get from swap, or allocate
*
* If we allocate a new one we do not mark it dirty. That's up to the
@@ -923,19 +1041,21 @@ repeat:

/* We have to do this with page locked to prevent races */
lock_page(page);
+ if (!PageSwapCache(page) || page_private(page) != swap.val ||
+ page->mapping) {
+ error = -EEXIST; /* try again */
+ goto failed;
+ }
if (!PageUptodate(page)) {
error = -EIO;
goto failed;
}
wait_on_page_writeback(page);

- /* Someone may have already done it for us */
- if (page->mapping) {
- if (page->mapping == mapping &&
- page->index == index)
- goto done;
- error = -EEXIST;
- goto failed;
+ if (shmem_should_substitute_page(page, gfp)) {
+ error = shmem_substitute_page(&page, gfp, info, index);
+ if (error)
+ goto failed;
}

error = mem_cgroup_cache_charge(page, current->mm,
@@ -998,7 +1118,7 @@ repeat:
if (sgp == SGP_DIRTY)
set_page_dirty(page);
}
-done:
+
/* Perhaps the file has been truncated since we checked */
if (sgp != SGP_WRITE &&
((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
--- 3045N.orig/mm/swapfile.c 2012-05-17 16:30:20.226078835 -0700
+++ 3045N/mm/swapfile.c 2012-05-17 16:30:29.790078925 -0700
@@ -604,7 +604,7 @@ void swapcache_free(swp_entry_t entry, s
* This does not give an exact answer when swap count is continued,
* but does include the high COUNT_CONTINUED flag to allow for that.
*/
-static inline int page_swapcount(struct page *page)
+int page_swapcount(struct page *page)
{
int count = 0;
struct swap_info_struct *p;

2012-05-22 00:08:34

by john stultz

[permalink] [raw]
Subject: Re: [PATCH 5/10] mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE

On Sat, May 12, 2012 at 5:13 AM, Hugh Dickins <[email protected]> wrote:
> Now tmpfs supports hole-punching via fallocate(), switch madvise_remove()
> to use do_fallocate() instead of vmtruncate_range(): which extends
> madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs.
>
> There is one more user of vmtruncate_range() in our tree, staging/android's
> ashmem_shrink(): convert it to use do_fallocate() too (but if its unpinned
> areas are already unmapped - I don't know - then it would do better to use
> shmem_truncate_range() directly).

I suspect shmem_truncate_range directly would be the right approach,
but am not totally sure.
Arve: Any thoughts?

Hugh: Do you have a git tree with this set available somewhere? I was
working on my own tmpfs support for FALLOC_FL_PUNCH_HOLE, along with
my volatile range work, so I'd like to rebase on top of your work
here.

thanks
-john


>
> Based-on-patch-by: Cong Wang <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> ?drivers/staging/android/ashmem.c | ? ?8 +++++---
> ?mm/madvise.c ? ? ? ? ? ? ? ? ? ? | ? 15 +++++++--------
> ?2 files changed, 12 insertions(+), 11 deletions(-)
>
> --- 3045N.orig/drivers/staging/android/ashmem.c 2012-05-05 10:42:33.564056626 -0700
> +++ 3045N/drivers/staging/android/ashmem.c ? ? ?2012-05-05 10:46:25.692062478 -0700
> @@ -19,6 +19,7 @@
> ?#include <linux/module.h>
> ?#include <linux/file.h>
> ?#include <linux/fs.h>
> +#include <linux/falloc.h>
> ?#include <linux/miscdevice.h>
> ?#include <linux/security.h>
> ?#include <linux/mm.h>
> @@ -363,11 +364,12 @@ static int ashmem_shrink(struct shrinker
>
> ? ? ? ?mutex_lock(&ashmem_mutex);
> ? ? ? ?list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
> - ? ? ? ? ? ? ? struct inode *inode = range->asma->file->f_dentry->d_inode;
> ? ? ? ? ? ? ? ?loff_t start = range->pgstart * PAGE_SIZE;
> - ? ? ? ? ? ? ? loff_t end = (range->pgend + 1) * PAGE_SIZE - 1;
> + ? ? ? ? ? ? ? loff_t end = (range->pgend + 1) * PAGE_SIZE;
>
> - ? ? ? ? ? ? ? vmtruncate_range(inode, start, end);
> + ? ? ? ? ? ? ? do_fallocate(range->asma->file,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? start, end - start);
> ? ? ? ? ? ? ? ?range->purged = ASHMEM_WAS_PURGED;
> ? ? ? ? ? ? ? ?lru_del(range);
>
> --- 3045N.orig/mm/madvise.c ? ? 2012-05-05 10:42:33.572056784 -0700
> +++ 3045N/mm/madvise.c ?2012-05-05 10:46:25.692062478 -0700
> @@ -11,8 +11,10 @@
> ?#include <linux/mempolicy.h>
> ?#include <linux/page-isolation.h>
> ?#include <linux/hugetlb.h>
> +#include <linux/falloc.h>
> ?#include <linux/sched.h>
> ?#include <linux/ksm.h>
> +#include <linux/fs.h>
>
> ?/*
> ?* Any behaviour which results in changes to the vma->vm_flags needs to
> @@ -200,8 +202,7 @@ static long madvise_remove(struct vm_are
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct vm_area_struct **prev,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long start, unsigned long end)
> ?{
> - ? ? ? struct address_space *mapping;
> - ? ? ? loff_t offset, endoff;
> + ? ? ? loff_t offset;
> ? ? ? ?int error;
>
> ? ? ? ?*prev = NULL; ? /* tell sys_madvise we drop mmap_sem */
> @@ -217,16 +218,14 @@ static long madvise_remove(struct vm_are
> ? ? ? ?if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE))
> ? ? ? ? ? ? ? ?return -EACCES;
>
> - ? ? ? mapping = vma->vm_file->f_mapping;
> -
> ? ? ? ?offset = (loff_t)(start - vma->vm_start)
> ? ? ? ? ? ? ? ? ? ? ? ?+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> - ? ? ? endoff = (loff_t)(end - vma->vm_start - 1)
> - ? ? ? ? ? ? ? ? ? ? ? + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
>
> - ? ? ? /* vmtruncate_range needs to take i_mutex */
> + ? ? ? /* filesystem's fallocate may need to take i_mutex */
> ? ? ? ?up_read(&current->mm->mmap_sem);
> - ? ? ? error = vmtruncate_range(mapping->host, offset, endoff);
> + ? ? ? error = do_fallocate(vma->vm_file,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? offset, end - start);
> ? ? ? ?down_read(&current->mm->mmap_sem);
> ? ? ? ?return error;
> ?}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/

2012-05-22 15:12:18

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 5/10] mm/fs: route MADV_REMOVE to FALLOC_FL_PUNCH_HOLE

On Mon, 21 May 2012, john stultz wrote:
> On Sat, May 12, 2012 at 5:13 AM, Hugh Dickins <[email protected]> wrote:
> > Now tmpfs supports hole-punching via fallocate(), switch madvise_remove()
> > to use do_fallocate() instead of vmtruncate_range(): which extends
> > madvise(,,MADV_REMOVE) support from tmpfs to ext4, ocfs2 and xfs.
> >
> > There is one more user of vmtruncate_range() in our tree, staging/android's
> > ashmem_shrink(): convert it to use do_fallocate() too (but if its unpinned
> > areas are already unmapped - I don't know - then it would do better to use
> > shmem_truncate_range() directly).
>
> I suspect shmem_truncate_range directly would be the right approach,
> but am not totally sure.
> Arve: Any thoughts?
>
> Hugh: Do you have a git tree with this set available somewhere? I was
> working on my own tmpfs support for FALLOC_FL_PUNCH_HOLE, along with
> my volatile range work, so I'd like to rebase on top of your work
> here.

I don't, no, just the patch series posted.

I had hoped by now to say that it's in linux-next (though it would be
at the daily rebased end, which probably doesn't help you), but not yet.

If shmem_truncate_range() is all you need, then that doesn't depend on
these patches at all - but I expect you are aiming to be more general.

Hugh

>
> thanks
> -john
>
>
> >
> > Based-on-patch-by: Cong Wang <[email protected]>
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
> > ?drivers/staging/android/ashmem.c | ? ?8 +++++---
> > ?mm/madvise.c ? ? ? ? ? ? ? ? ? ? | ? 15 +++++++--------
> > ?2 files changed, 12 insertions(+), 11 deletions(-)
> >
> > --- 3045N.orig/drivers/staging/android/ashmem.c 2012-05-05 10:42:33.564056626 -0700
> > +++ 3045N/drivers/staging/android/ashmem.c ? ? ?2012-05-05 10:46:25.692062478 -0700
> > @@ -19,6 +19,7 @@
> > ?#include <linux/module.h>
> > ?#include <linux/file.h>
> > ?#include <linux/fs.h>
> > +#include <linux/falloc.h>
> > ?#include <linux/miscdevice.h>
> > ?#include <linux/security.h>
> > ?#include <linux/mm.h>
> > @@ -363,11 +364,12 @@ static int ashmem_shrink(struct shrinker
> >
> > ? ? ? ?mutex_lock(&ashmem_mutex);
> > ? ? ? ?list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
> > - ? ? ? ? ? ? ? struct inode *inode = range->asma->file->f_dentry->d_inode;
> > ? ? ? ? ? ? ? ?loff_t start = range->pgstart * PAGE_SIZE;
> > - ? ? ? ? ? ? ? loff_t end = (range->pgend + 1) * PAGE_SIZE - 1;
> > + ? ? ? ? ? ? ? loff_t end = (range->pgend + 1) * PAGE_SIZE;
> >
> > - ? ? ? ? ? ? ? vmtruncate_range(inode, start, end);
> > + ? ? ? ? ? ? ? do_fallocate(range->asma->file,
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? start, end - start);
> > ? ? ? ? ? ? ? ?range->purged = ASHMEM_WAS_PURGED;
> > ? ? ? ? ? ? ? ?lru_del(range);
> >
> > --- 3045N.orig/mm/madvise.c ? ? 2012-05-05 10:42:33.572056784 -0700
> > +++ 3045N/mm/madvise.c ?2012-05-05 10:46:25.692062478 -0700
> > @@ -11,8 +11,10 @@
> > ?#include <linux/mempolicy.h>
> > ?#include <linux/page-isolation.h>
> > ?#include <linux/hugetlb.h>
> > +#include <linux/falloc.h>
> > ?#include <linux/sched.h>
> > ?#include <linux/ksm.h>
> > +#include <linux/fs.h>
> >
> > ?/*
> > ?* Any behaviour which results in changes to the vma->vm_flags needs to
> > @@ -200,8 +202,7 @@ static long madvise_remove(struct vm_are
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct vm_area_struct **prev,
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long start, unsigned long end)
> > ?{
> > - ? ? ? struct address_space *mapping;
> > - ? ? ? loff_t offset, endoff;
> > + ? ? ? loff_t offset;
> > ? ? ? ?int error;
> >
> > ? ? ? ?*prev = NULL; ? /* tell sys_madvise we drop mmap_sem */
> > @@ -217,16 +218,14 @@ static long madvise_remove(struct vm_are
> > ? ? ? ?if ((vma->vm_flags & (VM_SHARED|VM_WRITE)) != (VM_SHARED|VM_WRITE))
> > ? ? ? ? ? ? ? ?return -EACCES;
> >
> > - ? ? ? mapping = vma->vm_file->f_mapping;
> > -
> > ? ? ? ?offset = (loff_t)(start - vma->vm_start)
> > ? ? ? ? ? ? ? ? ? ? ? ?+ ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> > - ? ? ? endoff = (loff_t)(end - vma->vm_start - 1)
> > - ? ? ? ? ? ? ? ? ? ? ? + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> >
> > - ? ? ? /* vmtruncate_range needs to take i_mutex */
> > + ? ? ? /* filesystem's fallocate may need to take i_mutex */
> > ? ? ? ?up_read(&current->mm->mmap_sem);
> > - ? ? ? error = vmtruncate_range(mapping->host, offset, endoff);
> > + ? ? ? error = do_fallocate(vma->vm_file,
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? offset, end - start);
> > ? ? ? ?down_read(&current->mm->mmap_sem);
> > ? ? ? ?return error;
> > ?}
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at ?http://www.tux.org/lkml/
>