LinuxLists.cc - [PATCH 0/9] tmpfs: towards unionfs and memcgroups

2007-12-18 21:58:19

Subject: [PATCH 0/9] tmpfs: towards unionfs and memcgroups

Neither unionfs nor memcgroups are working correctly with tmpfs and shmem.
If either or both are to go forward to 2.6.25, we shall need these patches
(against 2.6.24-rc5-mm1, but not dependent on unionfs or memcgroups) to go
in ahead. A small set for unionfs and a smaller set for memcgroups follow.

include/linux/swap.h | 15 --
mm/shmem.c | 213 ++++++++++++++++++++++++++++-------------
mm/swap_state.c | 100 ++++---------------
mm/swapfile.c | 23 +---
4 files changed, 184 insertions(+), 167 deletions(-)

In your -mm series file, please shift
shmem-factor-out-sbi-free_inodes-manipulations.patch
shmem-factor-out-sbi-free_inodes-manipulations-fix.patch
tmpfs-fix-mounts-when-size-is-less-than-the-page-size.patch
down after my
swapoff-scan-ptes-preemptibly.patch
and then append the nine tmpfs patchs which follow.

Thanks!
Hugh

2007-12-18 21:58:55

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 1/9] tmpfs: move swap_state stats update

Both unionfs and memcgroups pose challenges to tmpfs and shmem. To help
fix, it's best to move the swap swizzling functions from swap_state.c
to shmem.c. As a preliminary to that, move swap stats updating down
into __add_to_swap_cache, which will remain internal to swap_state.c.

Well, actually, just move down the incrementation of add_total: remove
noent_race and exist_race completely, they are relics of my 2.4.11 testing.
Alt-SysRq-m users will be thrilled if 2.6.25 is at last free of "race M+N"s.

Signed-off-by: Hugh Dickins <[email protected]>
---
checkpatch.pl would like me to add a KERN_level to the printk, but that's
not something I want to get into here: this is one of a sequence of printks,
some of which are per-arch, with different KERN_levels in different archs.

Perhaps the remaining add/delete/find swap stats should be junked too;
but they are of more general significance, so leave them for now. Nor do I
want to get into removing that irritatingly repeated "Free swap" right now.

mm/swap_state.c | 22 ++++++----------------
1 file changed, 6 insertions(+), 16 deletions(-)

--- 2.6.24-rc5-mm1/mm/swap_state.c 2007-12-05 10:38:45.000000000 +0000
+++ tmpfs1/mm/swap_state.c 2007-12-05 16:42:16.000000000 +0000
@@ -52,16 +52,13 @@ static struct {
unsigned long del_total;
unsigned long find_success;
unsigned long find_total;
- unsigned long noent_race;
- unsigned long exist_race;
} swap_cache_info;

void show_swap_cache_info(void)
{
- printk("Swap cache: add %lu, delete %lu, find %lu/%lu, race %lu+%lu\n",
+ printk("Swap cache: add %lu, delete %lu, find %lu/%lu\n",
swap_cache_info.add_total, swap_cache_info.del_total,
- swap_cache_info.find_success, swap_cache_info.find_total,
- swap_cache_info.noent_race, swap_cache_info.exist_race);
+ swap_cache_info.find_success, swap_cache_info.find_total);
printk("Free swap = %lukB\n", nr_swap_pages << (PAGE_SHIFT - 10));
printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10));
}
@@ -89,6 +86,7 @@ static int __add_to_swap_cache(struct pa
set_page_private(page, entry.val);
total_swapcache_pages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
+ INC_CACHE_INFO(add_total);
}
write_unlock_irq(&swapper_space.tree_lock);
radix_tree_preload_end();
@@ -102,10 +100,9 @@ static int add_to_swap_cache(struct page
int error;

BUG_ON(PageLocked(page));
- if (!swap_duplicate(entry)) {
- INC_CACHE_INFO(noent_race);
+ if (!swap_duplicate(entry))
return -ENOENT;
- }
+
SetPageLocked(page);
error = __add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL);
/*
@@ -114,11 +111,8 @@ static int add_to_swap_cache(struct page
if (error) {
ClearPageLocked(page);
swap_free(entry);
- if (error == -EEXIST)
- INC_CACHE_INFO(exist_race);
return error;
}
- INC_CACHE_INFO(add_total);
return 0;
}

@@ -178,11 +172,9 @@ int add_to_swap(struct page * page, gfp_
case 0: /* Success */
SetPageUptodate(page);
SetPageDirty(page);
- INC_CACHE_INFO(add_total);
return 1;
case -EEXIST:
/* Raced with "speculative" read_swap_cache_async */
- INC_CACHE_INFO(exist_race);
swap_free(entry);
continue;
default:
@@ -225,9 +217,7 @@ int move_to_swap_cache(struct page *page
if (!swap_duplicate(entry))
BUG();
SetPageDirty(page);
- INC_CACHE_INFO(add_total);
- } else if (err == -EEXIST)
- INC_CACHE_INFO(exist_race);
+ }
return err;
}

2007-12-18 22:00:31

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 2/9] tmpfs: shuffle add_to_swap_caches

add_to_swap_cache doesn't amount to much: merge it into its sole caller
read_swap_cache_async. But we'll be needing to call __add_to_swap_cache
from shmem.c, so promote it to the new add_to_swap_cache. Both were
static, so there's no interface confusion to worry about.

And lose that inappropriate "Anon pages are already on the LRU" comment
in the merging: they're not already on the LRU, as Nick Piggin noticed.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/swap_state.c | 53 ++++++++++++++++------------------------------
1 file changed, 19 insertions(+), 34 deletions(-)

--- tmpfs1/mm/swap_state.c 2007-12-05 16:42:16.000000000 +0000
+++ tmpfs2/mm/swap_state.c 2007-12-05 16:42:16.000000000 +0000
@@ -64,10 +64,10 @@ void show_swap_cache_info(void)
}

/*
- * __add_to_swap_cache resembles add_to_page_cache on swapper_space,
+ * add_to_swap_cache resembles add_to_page_cache on swapper_space,
* but sets SwapCache flag and private instead of mapping and index.
*/
-static int __add_to_swap_cache(struct page *page, swp_entry_t entry,
+static int add_to_swap_cache(struct page *page, swp_entry_t entry,
gfp_t gfp_mask)
{
int error;
@@ -94,28 +94,6 @@ static int __add_to_swap_cache(struct pa
return error;
}

-static int add_to_swap_cache(struct page *page, swp_entry_t entry,
- gfp_t gfp_mask)
-{
- int error;
-
- BUG_ON(PageLocked(page));
- if (!swap_duplicate(entry))
- return -ENOENT;
-
- SetPageLocked(page);
- error = __add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL);
- /*
- * Anon pages are already on the LRU, we don't run lru_cache_add here.
- */
- if (error) {
- ClearPageLocked(page);
- swap_free(entry);
- return error;
- }
- return 0;
-}
-
/*
* This must be called only on pages that have
* been verified to be in the swap cache.
@@ -165,7 +143,7 @@ int add_to_swap(struct page * page, gfp_
/*
* Add it to the swap cache and mark it dirty
*/
- err = __add_to_swap_cache(page, entry,
+ err = add_to_swap_cache(page, entry,
gfp_mask|__GFP_NOMEMALLOC|__GFP_NOWARN);

switch (err) {
@@ -210,7 +188,7 @@ void delete_from_swap_cache(struct page
*/
int move_to_swap_cache(struct page *page, swp_entry_t entry)
{
- int err = __add_to_swap_cache(page, entry, GFP_ATOMIC);
+ int err = add_to_swap_cache(page, entry, GFP_ATOMIC);
if (!err) {
remove_from_page_cache(page);
page_cache_release(page); /* pagecache ref */
@@ -335,16 +313,21 @@ struct page *read_swap_cache_async(swp_e
}

/*
+ * Swap entry may have been freed since our caller observed it.
+ */
+ if (!swap_duplicate(entry))
+ break;
+
+ /*
* Associate the page with swap entry in the swap cache.
- * May fail (-ENOENT) if swap entry has been freed since
- * our caller observed it. May fail (-EEXIST) if there
- * is already a page associated with this entry in the
- * swap cache: added by a racing read_swap_cache_async,
- * or by try_to_swap_out (or shmem_writepage) re-using
- * the just freed swap entry for an existing page.
+ * May fail (-EEXIST) if there is already a page associated
+ * with this entry in the swap cache: added by a racing
+ * read_swap_cache_async, or add_to_swap or shmem_writepage
+ * re-using the just freed swap entry for an existing page.
* May fail (-ENOMEM) if radix-tree node allocation failed.
*/
- err = add_to_swap_cache(new_page, entry, gfp_mask);
+ SetPageLocked(new_page);
+ err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
if (!err) {
/*
* Initiate read into locked page and return.
@@ -353,7 +336,9 @@ struct page *read_swap_cache_async(swp_e
swap_readpage(NULL, new_page);
return new_page;
}
- } while (err != -ENOENT && err != -ENOMEM);
+ ClearPageLocked(new_page);
+ swap_free(entry);
+ } while (err != -ENOMEM);

if (new_page)
page_cache_release(new_page);

2007-12-18 22:01:42

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 3/9] tmpfs: move swap swizzling into shmem

move_to_swap_cache and move_from_swap_cache functions (which swizzle a
page between tmpfs page cache and swap cache, to avoid page copying) are
only used by shmem.c; and our subsequent fix for unionfs needs different
treatments in the two instances of move_from_swap_cache. Move them from
swap_state.c into their callsites shmem_writepage, shmem_unuse_inode and
shmem_getpage, making add_to_swap_cache externally visible.

shmem.c likes to say set_page_dirty where swap_state.c liked to say
SetPageDirty: respect that diversity, which __set_page_dirty_no_writeback
makes moot (and implies we should lose that "shift page from clean_pages
to dirty_pages list" comment: it's on neither).

Signed-off-by: Hugh Dickins <[email protected]>
---
checkpatch.pl objects to the embedded assignment in
- } else if (!(error = move_from_swap_cache(
+ } else if (!(error = add_to_page_cache(
but please let me not mess with that here.

include/linux/swap.h | 15 ++++-----------
mm/shmem.c | 16 ++++++++++++----
mm/swap_state.c | 35 +----------------------------------
3 files changed, 17 insertions(+), 49 deletions(-)

--- tmpfs2/include/linux/swap.h 2007-12-05 10:38:43.000000000 +0000
+++ tmpfs3/include/linux/swap.h 2007-12-05 16:42:18.000000000 +0000
@@ -224,11 +224,9 @@ extern struct address_space swapper_spac
#define total_swapcache_pages swapper_space.nrpages
extern void show_swap_cache_info(void);
extern int add_to_swap(struct page *, gfp_t);
+extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
extern void __delete_from_swap_cache(struct page *);
extern void delete_from_swap_cache(struct page *);
-extern int move_to_swap_cache(struct page *, swp_entry_t);
-extern int move_from_swap_cache(struct page *, unsigned long,
- struct address_space *);
extern void free_page_and_swap_cache(struct page *);
extern void free_pages_and_swap_cache(struct page **, int);
extern struct page *lookup_swap_cache(swp_entry_t);
@@ -323,15 +321,10 @@ static inline struct page *lookup_swap_c

#define can_share_swap_page(p) (page_mapcount(p) == 1)

-static inline int move_to_swap_cache(struct page *page, swp_entry_t entry)
+static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
+ gfp_t gfp_mask)
{
- return 1;
-}
-
-static inline int move_from_swap_cache(struct page *page, unsigned long index,
- struct address_space *mapping)
-{
- return 1;
+ return -1;
}

static inline void __delete_from_swap_cache(struct page *page)
--- tmpfs2/mm/shmem.c 2007-12-05 10:38:45.000000000 +0000
+++ tmpfs3/mm/shmem.c 2007-12-05 16:42:18.000000000 +0000
@@ -884,7 +884,9 @@ lost2:
found:
idx += offset;
inode = &info->vfs_inode;
- if (move_from_swap_cache(page, idx, inode->i_mapping) == 0) {
+ if (add_to_page_cache(page, inode->i_mapping, idx, GFP_ATOMIC) == 0) {
+ delete_from_swap_cache(page);
+ set_page_dirty(page);
info->flags |= SHMEM_PAGEIN;
shmem_swp_set(info, ptr + offset, 0);
}
@@ -972,7 +974,8 @@ static int shmem_writepage(struct page *
BUG_ON(!entry);
BUG_ON(entry->val);

- if (move_to_swap_cache(page, swap) == 0) {
+ if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
+ remove_from_page_cache(page);
shmem_swp_set(info, entry, swap.val);
shmem_swp_unmap(entry);
spin_unlock(&info->lock);
@@ -982,6 +985,9 @@ static int shmem_writepage(struct page *
list_move_tail(&info->swaplist, &shmem_swaplist);
spin_unlock(&shmem_swaplist_lock);
}
+ swap_duplicate(swap);
+ page_cache_release(page); /* pagecache ref */
+ set_page_dirty(page);
unlock_page(page);
return 0;
}
@@ -1217,13 +1223,15 @@ repeat:
SetPageUptodate(filepage);
set_page_dirty(filepage);
swap_free(swap);
- } else if (!(error = move_from_swap_cache(
- swappage, idx, mapping))) {
+ } else if (!(error = add_to_page_cache(
+ swappage, mapping, idx, GFP_ATOMIC))) {
info->flags |= SHMEM_PAGEIN;
shmem_swp_set(info, entry, 0);
shmem_swp_unmap(entry);
+ delete_from_swap_cache(swappage);
spin_unlock(&info->lock);
filepage = swappage;
+ set_page_dirty(filepage);
swap_free(swap);
} else {
shmem_swp_unmap(entry);
--- tmpfs2/mm/swap_state.c 2007-12-05 16:42:16.000000000 +0000
+++ tmpfs3/mm/swap_state.c 2007-12-05 16:42:18.000000000 +0000
@@ -67,8 +67,7 @@ void show_swap_cache_info(void)
* add_to_swap_cache resembles add_to_page_cache on swapper_space,
* but sets SwapCache flag and private instead of mapping and index.
*/
-static int add_to_swap_cache(struct page *page, swp_entry_t entry,
- gfp_t gfp_mask)
+int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
{
int error;

@@ -183,38 +182,6 @@ void delete_from_swap_cache(struct page
page_cache_release(page);
}

-/*
- * Strange swizzling function only for use by shmem_writepage
- */
-int move_to_swap_cache(struct page *page, swp_entry_t entry)
-{
- int err = add_to_swap_cache(page, entry, GFP_ATOMIC);
- if (!err) {
- remove_from_page_cache(page);
- page_cache_release(page); /* pagecache ref */
- if (!swap_duplicate(entry))
- BUG();
- SetPageDirty(page);
- }
- return err;
-}
-
-/*
- * Strange swizzling function for shmem_getpage (and shmem_unuse)
- */
-int move_from_swap_cache(struct page *page, unsigned long index,
- struct address_space *mapping)
-{
- int err = add_to_page_cache(page, mapping, index, GFP_ATOMIC);
- if (!err) {
- delete_from_swap_cache(page);
- /* shift page from clean_pages to dirty_pages list */
- ClearPageDirty(page);
- set_page_dirty(page);
- }
- return err;
-}
-
/*
* If we are the only user, then try to free up the swap cache.
*

2007-12-18 22:02:31

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 4/9] tmpfs: allow filepage alongside swappage

tmpfs has long allowed for a fresh filepage to be created in pagecache,
just before shmem_getpage gets the chance to match it up with the swappage
which already belongs to that offset. But unionfs_writepage now does a
find_or_create_page, divorced from shmem_getpage, which leaves conflicting
filepage and swappage outstanding indefinitely, when unionfs is over tmpfs.

Therefore shmem_writepage (where a page is swizzled from file to swap) must
now be on the lookout for existing swap, ready to free it in favour of the
more uptodate filepage, instead of BUGging on that clash. And when the
add_to_page_cache fails in shmem_unuse_inode, it must defer to an uptodate
filepage, otherwise swapoff would hang. Whereas when add_to_page_cache
fails in shmem_getpage, it should retry in the same way it already does.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/shmem.c | 69 ++++++++++++++++++++++++++++++++-------------------
1 file changed, 44 insertions(+), 25 deletions(-)

--- tmpfs3/mm/shmem.c 2007-12-05 16:42:18.000000000 +0000
+++ tmpfs4/mm/shmem.c 2007-12-05 16:42:19.000000000 +0000
@@ -827,6 +827,7 @@ static int shmem_unuse_inode(struct shme
struct page *subdir;
swp_entry_t *ptr;
int offset;
+ int error;

idx = 0;
ptr = info->i_direct;
@@ -884,7 +885,20 @@ lost2:
found:
idx += offset;
inode = &info->vfs_inode;
- if (add_to_page_cache(page, inode->i_mapping, idx, GFP_ATOMIC) == 0) {
+ error = add_to_page_cache(page, inode->i_mapping, idx, GFP_ATOMIC);
+ if (error == -EEXIST) {
+ struct page *filepage = find_get_page(inode->i_mapping, idx);
+ if (filepage) {
+ /*
+ * There might be a more uptodate page coming down
+ * from a stacked writepage: forget our swappage if so.
+ */
+ if (PageUptodate(filepage))
+ error = 0;
+ page_cache_release(filepage);
+ }
+ }
+ if (!error) {
delete_from_swap_cache(page);
set_page_dirty(page);
info->flags |= SHMEM_PAGEIN;
@@ -937,44 +951,45 @@ static int shmem_writepage(struct page *
struct inode *inode;

BUG_ON(!PageLocked(page));
- /*
- * shmem_backing_dev_info's capabilities prevent regular writeback or
- * sync from ever calling shmem_writepage; but a stacking filesystem
- * may use the ->writepage of its underlying filesystem, in which case
- * we want to do nothing when that underlying filesystem is tmpfs
- * (writing out to swap is useful as a response to memory pressure, but
- * of no use to stabilize the data) - just redirty the page, unlock it
- * and claim success in this case. AOP_WRITEPAGE_ACTIVATE, and the
- * page_mapped check below, must be avoided unless we're in reclaim.
- */
- if (!wbc->for_reclaim) {
- set_page_dirty(page);
- unlock_page(page);
- return 0;
- }
- BUG_ON(page_mapped(page));
-
mapping = page->mapping;
index = page->index;
inode = mapping->host;
info = SHMEM_I(inode);
if (info->flags & VM_LOCKED)
goto redirty;
- swap = get_swap_page();
- if (!swap.val)
+ if (!total_swap_pages)
goto redirty;

+ /*
+ * shmem_backing_dev_info's capabilities prevent regular writeback or
+ * sync from ever calling shmem_writepage; but a stacking filesystem
+ * may use the ->writepage of its underlying filesystem, in which case
+ * tmpfs should write out to swap only in response to memory pressure,
+ * and not for pdflush or sync. However, in those cases, we do still
+ * want to check if there's a redundant swappage to be discarded.
+ */
+ if (wbc->for_reclaim)
+ swap = get_swap_page();
+ else
+ swap.val = 0;
+
spin_lock(&info->lock);
- shmem_recalc_inode(inode);
if (index >= info->next_index) {
BUG_ON(!(info->flags & SHMEM_TRUNCATE));
goto unlock;
}
entry = shmem_swp_entry(info, index, NULL);
- BUG_ON(!entry);
- BUG_ON(entry->val);
+ if (entry->val) {
+ /*
+ * The more uptodate page coming down from a stacked
+ * writepage should replace our old swappage.
+ */
+ free_swap_and_cache(*entry);
+ shmem_swp_set(info, entry, 0);
+ }
+ shmem_recalc_inode(inode);

- if (add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
+ if (swap.val && add_to_swap_cache(page, swap, GFP_ATOMIC) == 0) {
remove_from_page_cache(page);
shmem_swp_set(info, entry, swap.val);
shmem_swp_unmap(entry);
@@ -986,6 +1001,7 @@ static int shmem_writepage(struct page *
spin_unlock(&shmem_swaplist_lock);
}
swap_duplicate(swap);
+ BUG_ON(page_mapped(page));
page_cache_release(page); /* pagecache ref */
set_page_dirty(page);
unlock_page(page);
@@ -998,7 +1014,10 @@ unlock:
swap_free(swap);
redirty:
set_page_dirty(page);
- return AOP_WRITEPAGE_ACTIVATE; /* Return with the page locked */
+ if (wbc->for_reclaim)
+ return AOP_WRITEPAGE_ACTIVATE; /* Return with page locked */
+ unlock_page(page);
+ return 0;
}

#ifdef CONFIG_NUMA

2007-12-18 22:03:31

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 5/9] tmpfs: allocate on read when stacked

tmpfs is expected to limit the memory used (unless mounted with nr_blocks=0
or size=0). But if a stacked filesystem such as unionfs gets pages from a
sparse tmpfs file by reading holes, and then writes to them, it can easily
exceed any such limit at present.

So suppress the SGP_READ "don't allocate page" ZERO_PAGE optimization when
reading for the kernel (a KERNEL_DS check, ugh, sorry about that). Indeed,
pessimistically mark such pages as dirty, so they cannot get reclaimed and
unaccounted by mistake. The venerable shmem_recalc_inode code (originally
to account for the reclaim of clean pages) suffices to get the accounting
right when swappages are dropped in favour of more uptodate filepages.

This also fixes the NULL shmem_swp_entry BUG or oops in shmem_writepage,
caused by unionfs writing to a very sparse tmpfs file: to minimize memory
allocation in swapout, tmpfs requires the swap vector be allocated upfront,
which wasn't always happening in this stacked case.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/shmem.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

--- tmpfs4/mm/shmem.c 2007-12-05 16:42:19.000000000 +0000
+++ tmpfs5/mm/shmem.c 2007-12-05 16:42:19.000000000 +0000
@@ -80,6 +80,7 @@
enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
SGP_CACHE, /* don't exceed i_size, may allocate page */
+ SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */
SGP_WRITE, /* may exceed i_size, may allocate page */
};

@@ -1333,6 +1334,8 @@ repeat:
clear_highpage(filepage);
flush_dcache_page(filepage);
SetPageUptodate(filepage);
+ if (sgp == SGP_DIRTY)
+ set_page_dirty(filepage);
}
done:
*pagep = filepage;
@@ -1518,6 +1521,15 @@ static void do_shmem_file_read(struct fi
struct inode *inode = filp->f_path.dentry->d_inode;
struct address_space *mapping = inode->i_mapping;
unsigned long index, offset;
+ enum sgp_type sgp = SGP_READ;
+
+ /*
+ * Might this read be for a stacking filesystem? Then when reading
+ * holes of a sparse file, we actually need to allocate those pages,
+ * and even mark them dirty, so it cannot exceed the max_blocks limit.
+ */
+ if (segment_eq(get_fs(), KERNEL_DS))
+ sgp = SGP_DIRTY;

index = *ppos >> PAGE_CACHE_SHIFT;
offset = *ppos & ~PAGE_CACHE_MASK;
@@ -1536,7 +1548,7 @@ static void do_shmem_file_read(struct fi
break;
}

- desc->error = shmem_getpage(inode, index, &page, SGP_READ, NULL);
+ desc->error = shmem_getpage(inode, index, &page, sgp, NULL);
if (desc->error) {
if (desc->error == -EINVAL)
desc->error = 0;

2007-12-18 22:05:21

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 6/9] tmpfs: make shmem_unuse more preemptible

shmem_unuse is at present an unbroken search through every swap vector page
of every tmpfs file which might be swapped, all under shmem_swaplist_lock.
This dates from long ago, when the caller held mmlist_lock over it all too:
long gone, but there's never been much pressure for preemptible swapoff.

Make it a little more preemptible, replacing shmem_swaplist_lock by
shmem_swaplist_mutex, inserting a cond_resched in the main loop, and
a cond_resched_lock (on info->lock) at one convenient point in the
shmem_unuse_inode loop, where it has no outstanding kmap_atomic.

If we're serious about preemptible swapoff, there's much further to go
e.g. I'm stupid to let the kmap_atomics of the decreasingly significant
HIGHMEM case dictate preemptiblility for other configs. But as in the
earlier patch to make swapoff scan ptes preemptibly, my hidden agenda is
really towards making memcgroups work, hardly about preemptibility at all.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/shmem.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)

--- tmpfs5/mm/shmem.c 2007-12-05 16:42:19.000000000 +0000
+++ tmpfs6/mm/shmem.c 2007-12-06 16:34:56.000000000 +0000
@@ -193,7 +193,7 @@ static struct backing_dev_info shmem_bac
};

static LIST_HEAD(shmem_swaplist);
-static DEFINE_SPINLOCK(shmem_swaplist_lock);
+static DEFINE_MUTEX(shmem_swaplist_mutex);

static void shmem_free_blocks(struct inode *inode, long pages)
{
@@ -796,9 +796,9 @@ static void shmem_delete_inode(struct in
inode->i_size = 0;
shmem_truncate(inode);
if (!list_empty(&info->swaplist)) {
- spin_lock(&shmem_swaplist_lock);
+ mutex_lock(&shmem_swaplist_mutex);
list_del_init(&info->swaplist);
- spin_unlock(&shmem_swaplist_lock);
+ mutex_unlock(&shmem_swaplist_mutex);
}
}
BUG_ON(inode->i_blocks);
@@ -851,6 +851,14 @@ static int shmem_unuse_inode(struct shme
for (idx = SHMEM_NR_DIRECT; idx < limit; idx += ENTRIES_PER_PAGE, dir++) {
if (unlikely(idx == stage)) {
shmem_dir_unmap(dir-1);
+ if (cond_resched_lock(&info->lock)) {
+ /* check it has not been truncated */
+ if (limit > info->next_index) {
+ limit = info->next_index;
+ if (idx >= limit)
+ goto lost2;
+ }
+ }
dir = shmem_dir_map(info->i_indirect) +
ENTRIES_PER_PAGE/2 + idx/ENTRIES_PER_PAGEPAGE;
while (!*dir) {
@@ -924,7 +932,7 @@ int shmem_unuse(swp_entry_t entry, struc
struct shmem_inode_info *info;
int found = 0;

- spin_lock(&shmem_swaplist_lock);
+ mutex_lock(&shmem_swaplist_mutex);
list_for_each_safe(p, next, &shmem_swaplist) {
info = list_entry(p, struct shmem_inode_info, swaplist);
if (!info->swapped)
@@ -935,8 +943,9 @@ int shmem_unuse(swp_entry_t entry, struc
found = 1;
break;
}
+ cond_resched();
}
- spin_unlock(&shmem_swaplist_lock);
+ mutex_unlock(&shmem_swaplist_mutex);
return found;
}

@@ -996,10 +1005,10 @@ static int shmem_writepage(struct page *
shmem_swp_unmap(entry);
spin_unlock(&info->lock);
if (list_empty(&info->swaplist)) {
- spin_lock(&shmem_swaplist_lock);
+ mutex_lock(&shmem_swaplist_mutex);
/* move instead of add in case we're racing */
list_move_tail(&info->swaplist, &shmem_swaplist);
- spin_unlock(&shmem_swaplist_lock);
+ mutex_unlock(&shmem_swaplist_mutex);
}
swap_duplicate(swap);
BUG_ON(page_mapped(page));

2007-12-18 22:05:44

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 7/9] tmpfs: open a window in shmem_unuse_inode

There are a couple of reasons (patches follow) why it would be good to open
a window for sleep in shmem_unuse_inode, between its search for a matching
swap entry, and its handling of the entry found.

shmem_unuse_inode must then use igrab to hold the inode against deletion in
that window, and its corresponding iput might result in deletion: so it had
better unlock_page before the iput, and might as well release the page too.

Nor is there any need to hold on to shmem_swaplist_mutex once we know we'll
leave the loop. So this unwinding moves from try_to_unuse and shmem_unuse
into shmem_unuse_inode, in the case when it finds a match.

Let try_to_unuse break on error in the shmem_unuse case, as it does in the
unuse_mm case: though at this point in the series, no error to break on.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/shmem.c | 57 +++++++++++++++++++++++++++++-------------------
mm/swapfile.c | 23 ++++++++-----------
2 files changed, 45 insertions(+), 35 deletions(-)

--- tmpfs6/mm/shmem.c 2007-12-06 16:34:56.000000000 +0000
+++ tmpfs7/mm/shmem.c 2007-12-06 18:11:26.000000000 +0000
@@ -838,10 +838,8 @@ static int shmem_unuse_inode(struct shme
if (size > SHMEM_NR_DIRECT)
size = SHMEM_NR_DIRECT;
offset = shmem_find_swp(entry, ptr, ptr+size);
- if (offset >= 0) {
- shmem_swp_balance_unmap();
+ if (offset >= 0)
goto found;
- }
if (!info->i_indirect)
goto lost2;

@@ -879,11 +877,11 @@ static int shmem_unuse_inode(struct shme
if (size > ENTRIES_PER_PAGE)
size = ENTRIES_PER_PAGE;
offset = shmem_find_swp(entry, ptr, ptr+size);
+ shmem_swp_unmap(ptr);
if (offset >= 0) {
shmem_dir_unmap(dir);
goto found;
}
- shmem_swp_unmap(ptr);
}
}
lost1:
@@ -893,10 +891,25 @@ lost2:
return 0;
found:
idx += offset;
- inode = &info->vfs_inode;
- error = add_to_page_cache(page, inode->i_mapping, idx, GFP_ATOMIC);
+ inode = igrab(&info->vfs_inode);
+ spin_unlock(&info->lock);
+
+ /* move head to start search for next from here */
+ list_move_tail(&shmem_swaplist, &info->swaplist);
+ mutex_unlock(&shmem_swaplist_mutex);
+
+ error = 1;
+ if (!inode)
+ goto out;
+
+ spin_lock(&info->lock);
+ ptr = shmem_swp_entry(info, idx, NULL);
+ if (ptr && ptr->val == entry.val)
+ error = add_to_page_cache(page, inode->i_mapping,
+ idx, GFP_ATOMIC);
if (error == -EEXIST) {
struct page *filepage = find_get_page(inode->i_mapping, idx);
+ error = 1;
if (filepage) {
/*
* There might be a more uptodate page coming down
@@ -911,16 +924,18 @@ found:
delete_from_swap_cache(page);
set_page_dirty(page);
info->flags |= SHMEM_PAGEIN;
- shmem_swp_set(info, ptr + offset, 0);
+ shmem_swp_set(info, ptr, 0);
+ swap_free(entry);
+ error = 1; /* not an error, but entry was found */
}
- shmem_swp_unmap(ptr);
+ if (ptr)
+ shmem_swp_unmap(ptr);
spin_unlock(&info->lock);
- /*
- * Decrement swap count even when the entry is left behind:
- * try_to_unuse will skip over mms, then reincrement count.
- */
- swap_free(entry);
- return 1;
+out:
+ unlock_page(page);
+ page_cache_release(page);
+ iput(inode); /* allows for NULL */
+ return error;
}

/*
@@ -935,18 +950,16 @@ int shmem_unuse(swp_entry_t entry, struc
mutex_lock(&shmem_swaplist_mutex);
list_for_each_safe(p, next, &shmem_swaplist) {
info = list_entry(p, struct shmem_inode_info, swaplist);
- if (!info->swapped)
+ if (info->swapped)
+ found = shmem_unuse_inode(info, entry, page);
+ else
list_del_init(&info->swaplist);
- else if (shmem_unuse_inode(info, entry, page)) {
- /* move head to start search for next from here */
- list_move_tail(&shmem_swaplist, &info->swaplist);
- found = 1;
- break;
- }
cond_resched();
+ if (found)
+ goto out;
}
mutex_unlock(&shmem_swaplist_mutex);
- return found;
+out: return found; /* 0 or 1 or -ENOMEM */
}

/*
--- tmpfs6/mm/swapfile.c 2007-12-05 10:38:45.000000000 +0000
+++ tmpfs7/mm/swapfile.c 2007-12-06 17:00:18.000000000 +0000
@@ -823,7 +823,7 @@ static int try_to_unuse(unsigned int typ
atomic_inc(&new_start_mm->mm_users);
atomic_inc(&prev_mm->mm_users);
spin_lock(&mmlist_lock);
- while (*swap_map > 1 && !retval &&
+ while (*swap_map > 1 && !retval && !shmem &&
(p = p->next) != &start_mm->mmlist) {
mm = list_entry(p, struct mm_struct, mmlist);
if (!atomic_inc_not_zero(&mm->mm_users))
@@ -855,6 +855,13 @@ static int try_to_unuse(unsigned int typ
mmput(start_mm);
start_mm = new_start_mm;
}
+ if (shmem) {
+ /* page has already been unlocked and released */
+ if (shmem > 0)
+ continue;
+ retval = shmem;
+ break;
+ }
if (retval) {
unlock_page(page);
page_cache_release(page);
@@ -893,12 +900,6 @@ static int try_to_unuse(unsigned int typ
* read from disk into another page. Splitting into two
* pages would be incorrect if swap supported "shared
* private" pages, but they are handled by tmpfs files.
- *
- * Note shmem_unuse already deleted a swappage from
- * the swap cache, unless the move to filepage failed:
- * in which case it left swappage in cache, lowered its
- * swap count to pass quickly through the loops above,
- * and now we must reincrement count to try again later.
*/
if ((*swap_map > 1) && PageDirty(page) && PageSwapCache(page)) {
struct writeback_control wbc = {
@@ -909,12 +910,8 @@ static int try_to_unuse(unsigned int typ
lock_page(page);
wait_on_page_writeback(page);
}
- if (PageSwapCache(page)) {
- if (shmem)
- swap_duplicate(entry);
- else
- delete_from_swap_cache(page);
- }
+ if (PageSwapCache(page))
+ delete_from_swap_cache(page);

/*
* So we could skip searching mms once swap count went

2007-12-18 22:06:46

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 8/9] tmpfs: radix_tree_preloading

Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page
cache or swap cache, without any radix tree preload: so tending to deplete
emergency reserves of memory.

GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache:
it's being called under memory pressure, so must not wait for more memory
to become available. But shmem_unuse_inode now has a window in which it
can and should preload with GFP_KERNEL, and say GFP_NOWAIT instead of
GFP_ATOMIC in its add_to_page_cache.

shmem_getpage is not so straightforward: its filepage/swappage integrity
relies upon exchanging between caches under spinlock, and it would need a
lot of restructuring to place the preloads correctly. Instead, follow
its pattern of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in
add_to_page_cache, and begin each circuit of the repeat loop with a sleeping
radix_tree_preload, followed immediately by radix_tree_preload_end - that
won't guarantee success in the next add_to_page_cache, but doesn't need to.

And we can then remove that bothersome congestion_wait: when needed,
it'll automatically get done in the course of the radix_tree_preload.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/shmem.c | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)

--- tmpfs7/mm/shmem.c 2007-12-06 18:11:26.000000000 +0000
+++ tmpfs8/mm/shmem.c 2007-12-06 19:20:29.000000000 +0000
@@ -901,12 +901,16 @@ found:
error = 1;
if (!inode)
goto out;
+ error = radix_tree_preload(GFP_KERNEL);
+ if (error)
+ goto out;
+ error = 1;

spin_lock(&info->lock);
ptr = shmem_swp_entry(info, idx, NULL);
if (ptr && ptr->val == entry.val)
error = add_to_page_cache(page, inode->i_mapping,
- idx, GFP_ATOMIC);
+ idx, GFP_NOWAIT);
if (error == -EEXIST) {
struct page *filepage = find_get_page(inode->i_mapping, idx);
error = 1;
@@ -931,6 +935,7 @@ found:
if (ptr)
shmem_swp_unmap(ptr);
spin_unlock(&info->lock);
+ radix_tree_preload_end();
out:
unlock_page(page);
page_cache_release(page);
@@ -1185,6 +1190,16 @@ repeat:
goto done;
error = 0;
gfp = mapping_gfp_mask(mapping);
+ if (!filepage) {
+ /*
+ * Try to preload while we can wait, to not make a habit of
+ * draining atomic reserves; but don't latch on to this cpu.
+ */
+ error = radix_tree_preload(gfp & ~__GFP_HIGHMEM);
+ if (error)
+ goto failed;
+ radix_tree_preload_end();
+ }

spin_lock(&info->lock);
shmem_recalc_inode(inode);
@@ -1266,7 +1281,7 @@ repeat:
set_page_dirty(filepage);
swap_free(swap);
} else if (!(error = add_to_page_cache(
- swappage, mapping, idx, GFP_ATOMIC))) {
+ swappage, mapping, idx, GFP_NOWAIT))) {
info->flags |= SHMEM_PAGEIN;
shmem_swp_set(info, entry, 0);
shmem_swp_unmap(entry);
@@ -1280,10 +1295,6 @@ repeat:
spin_unlock(&info->lock);
unlock_page(swappage);
page_cache_release(swappage);
- if (error == -ENOMEM) {
- /* let kswapd refresh zone for GFP_ATOMICs */
- congestion_wait(WRITE, HZ/50);
- }
goto repeat;
}
} else if (sgp == SGP_READ && !filepage) {
@@ -1338,7 +1349,7 @@ repeat:
shmem_swp_unmap(entry);
}
if (error || swap.val || 0 != add_to_page_cache_lru(
- filepage, mapping, idx, GFP_ATOMIC)) {
+ filepage, mapping, idx, GFP_NOWAIT)) {
spin_unlock(&info->lock);
page_cache_release(filepage);
shmem_unacct_blocks(info->flags, 1);

2007-12-18 22:08:18

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 9/9] tmpfs: fix shmem_swaplist races

Intensive swapoff testing shows shmem_unuse spinning on an entry in
shmem_swaplist pointing to itself: how does that come about? Days pass...

First guess is this: shmem_delete_inode tests list_empty without taking the
global mutex (so the swapping case doesn't slow down the common case); but
there's an instant in shmem_unuse_inode's list_move_tail when the list entry
may appear empty (a rare case, because it's actually moving the head not the
the list member). So there's a danger of leaving the inode on the swaplist
when it's freed, then reinitialized to point to itself when reused. Fix that
by skipping the list_move_tail when it's a no-op, which happens to plug this.

But this same spinning then surfaces on another machine. Ah, I'd never
suspected it, but shmem_writepage's swaplist manipulation is unsafe: though
we still hold page lock, which would hold off inode deletion if the page were
in pagecache, it doesn't hold off once it's in swapcache (free_swap_and_cache
doesn't wait on locked pages). Hmm: we could put the the inode on swaplist
earlier, but then shmem_unuse_inode could never prune unswapped inodes.

Fix this with an igrab before dropping info->lock, as in shmem_unuse_inode;
though I am a little uneasy about the iput which has to follow - it works,
and I see nothing wrong with it, but it is surprising that shmem inode
deletion may now occur below shmem_writepage. Revisit this fix later?

And while we're looking at these races: the way shmem_unuse tests swapped
without holding info->lock looks unsafe, if we've more than one swap area:
a racing shmem_writepage on another page of the same inode could be putting
it in swapcache, just as we're deciding to remove the inode from swaplist -
there's a danger of going on swap without being listed, so a later swapoff
would hang, being unable to locate the entry. Move that test and removal
down into shmem_unuse_inode, once info->lock is held.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/shmem.c | 37 +++++++++++++++++++++++++------------
1 file changed, 25 insertions(+), 12 deletions(-)

--- tmpfs8/mm/shmem.c 2007-12-06 19:20:29.000000000 +0000
+++ tmpfs9/mm/shmem.c 2007-12-17 16:12:29.000000000 +0000
@@ -833,6 +833,10 @@ static int shmem_unuse_inode(struct shme
idx = 0;
ptr = info->i_direct;
spin_lock(&info->lock);
+ if (!info->swapped) {
+ list_del_init(&info->swaplist);
+ goto lost2;
+ }
limit = info->next_index;
size = limit;
if (size > SHMEM_NR_DIRECT)
@@ -894,8 +898,15 @@ found:
inode = igrab(&info->vfs_inode);
spin_unlock(&info->lock);

- /* move head to start search for next from here */
- list_move_tail(&shmem_swaplist, &info->swaplist);
+ /*
+ * Move _head_ to start search for next from here.
+ * But be careful: shmem_delete_inode checks list_empty without taking
+ * mutex, and there's an instant in list_move_tail when info->swaplist
+ * would appear empty, if it were the only one on shmem_swaplist. We
+ * could avoid doing it if inode NULL; or use this minor optimization.
+ */
+ if (shmem_swaplist.next != &info->swaplist)
+ list_move_tail(&shmem_swaplist, &info->swaplist);
mutex_unlock(&shmem_swaplist_mutex);

error = 1;
@@ -955,10 +966,7 @@ int shmem_unuse(swp_entry_t entry, struc
mutex_lock(&shmem_swaplist_mutex);
list_for_each_safe(p, next, &shmem_swaplist) {
info = list_entry(p, struct shmem_inode_info, swaplist);
- if (info->swapped)
- found = shmem_unuse_inode(info, entry, page);
- else
- list_del_init(&info->swaplist);
+ found = shmem_unuse_inode(info, entry, page);
cond_resched();
if (found)
goto out;
@@ -1021,18 +1029,23 @@ static int shmem_writepage(struct page *
remove_from_page_cache(page);
shmem_swp_set(info, entry, swap.val);
shmem_swp_unmap(entry);
+ if (list_empty(&info->swaplist))
+ inode = igrab(inode);
+ else
+ inode = NULL;
spin_unlock(&info->lock);
- if (list_empty(&info->swaplist)) {
- mutex_lock(&shmem_swaplist_mutex);
- /* move instead of add in case we're racing */
- list_move_tail(&info->swaplist, &shmem_swaplist);
- mutex_unlock(&shmem_swaplist_mutex);
- }
swap_duplicate(swap);
BUG_ON(page_mapped(page));
page_cache_release(page); /* pagecache ref */
set_page_dirty(page);
unlock_page(page);
+ if (inode) {
+ mutex_lock(&shmem_swaplist_mutex);
+ /* move instead of add in case we're racing */
+ list_move_tail(&info->swaplist, &shmem_swaplist);
+ mutex_unlock(&shmem_swaplist_mutex);
+ iput(inode);
+ }
return 0;
}

2007-12-18 22:52:22

by Erez Zadok

[permalink] [raw]

Subject: Re: [PATCH 4/9] tmpfs: allow filepage alongside swappage

In message <[email protected]>, Hugh Dickins writes:

> redirty:
> set_page_dirty(page);
> - return AOP_WRITEPAGE_ACTIVATE; /* Return with the page locked */
> + if (wbc->for_reclaim)
> + return AOP_WRITEPAGE_ACTIVATE; /* Return with page locked */
> + unlock_page(page);
> + return 0;

Just curious, but didn't you (or someone else) mention that they wanted to
do away with AOP_WRITEPAGE_ACTIVATE entirely? If these patches are for
2.6.25, would that be the right time?

Erez.

2007-12-19 00:42:25

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 4/9] tmpfs: allow filepage alongside swappage

On Tue, 18 Dec 2007, Erez Zadok wrote:
>
> Just curious, but didn't you (or someone else) mention that they wanted to
> do away with AOP_WRITEPAGE_ACTIVATE entirely? If these patches are for
> 2.6.25, would that be the right time?

That's right, it was me. Not forgotten, but I was anxious to complete
this set of fixes, before going back to that cleanup. When I've a moment.

Hugh

2007-12-19 01:56:00

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH 2/9] tmpfs: shuffle add_to_swap_caches

On Tue, Dec 18, 2007 at 09:59:22PM +0000, Hugh Dickins wrote:
> add_to_swap_cache doesn't amount to much: merge it into its sole caller
> read_swap_cache_async. But we'll be needing to call __add_to_swap_cache
> from shmem.c, so promote it to the new add_to_swap_cache. Both were
> static, so there's no interface confusion to worry about.
>
> And lose that inappropriate "Anon pages are already on the LRU" comment
> in the merging: they're not already on the LRU, as Nick Piggin noticed.
>
> Signed-off-by: Hugh Dickins <[email protected]>

No problems with me. Actually it is nice to make add_to_swap_cache
more closely match add_to_page_cache.

>
> mm/swap_state.c | 53 ++++++++++++++++------------------------------
> 1 file changed, 19 insertions(+), 34 deletions(-)
>
> --- tmpfs1/mm/swap_state.c 2007-12-05 16:42:16.000000000 +0000
> +++ tmpfs2/mm/swap_state.c 2007-12-05 16:42:16.000000000 +0000
> @@ -64,10 +64,10 @@ void show_swap_cache_info(void)
> }
>
> /*
> - * __add_to_swap_cache resembles add_to_page_cache on swapper_space,
> + * add_to_swap_cache resembles add_to_page_cache on swapper_space,
> * but sets SwapCache flag and private instead of mapping and index.
> */
> -static int __add_to_swap_cache(struct page *page, swp_entry_t entry,
> +static int add_to_swap_cache(struct page *page, swp_entry_t entry,
> gfp_t gfp_mask)
> {
> int error;
> @@ -94,28 +94,6 @@ static int __add_to_swap_cache(struct pa
> return error;
> }
>
> -static int add_to_swap_cache(struct page *page, swp_entry_t entry,
> - gfp_t gfp_mask)
> -{
> - int error;
> -
> - BUG_ON(PageLocked(page));
> - if (!swap_duplicate(entry))
> - return -ENOENT;
> -
> - SetPageLocked(page);
> - error = __add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL);
> - /*
> - * Anon pages are already on the LRU, we don't run lru_cache_add here.
> - */
> - if (error) {
> - ClearPageLocked(page);
> - swap_free(entry);
> - return error;
> - }
> - return 0;
> -}
> -
> /*
> * This must be called only on pages that have
> * been verified to be in the swap cache.
> @@ -165,7 +143,7 @@ int add_to_swap(struct page * page, gfp_
> /*
> * Add it to the swap cache and mark it dirty
> */
> - err = __add_to_swap_cache(page, entry,
> + err = add_to_swap_cache(page, entry,
> gfp_mask|__GFP_NOMEMALLOC|__GFP_NOWARN);
>
> switch (err) {
> @@ -210,7 +188,7 @@ void delete_from_swap_cache(struct page
> */
> int move_to_swap_cache(struct page *page, swp_entry_t entry)
> {
> - int err = __add_to_swap_cache(page, entry, GFP_ATOMIC);
> + int err = add_to_swap_cache(page, entry, GFP_ATOMIC);
> if (!err) {
> remove_from_page_cache(page);
> page_cache_release(page); /* pagecache ref */
> @@ -335,16 +313,21 @@ struct page *read_swap_cache_async(swp_e
> }
>
> /*
> + * Swap entry may have been freed since our caller observed it.
> + */
> + if (!swap_duplicate(entry))
> + break;
> +
> + /*
> * Associate the page with swap entry in the swap cache.
> - * May fail (-ENOENT) if swap entry has been freed since
> - * our caller observed it. May fail (-EEXIST) if there
> - * is already a page associated with this entry in the
> - * swap cache: added by a racing read_swap_cache_async,
> - * or by try_to_swap_out (or shmem_writepage) re-using
> - * the just freed swap entry for an existing page.
> + * May fail (-EEXIST) if there is already a page associated
> + * with this entry in the swap cache: added by a racing
> + * read_swap_cache_async, or add_to_swap or shmem_writepage
> + * re-using the just freed swap entry for an existing page.
> * May fail (-ENOMEM) if radix-tree node allocation failed.
> */
> - err = add_to_swap_cache(new_page, entry, gfp_mask);
> + SetPageLocked(new_page);
> + err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> if (!err) {
> /*
> * Initiate read into locked page and return.
> @@ -353,7 +336,9 @@ struct page *read_swap_cache_async(swp_e
> swap_readpage(NULL, new_page);
> return new_page;
> }
> - } while (err != -ENOENT && err != -ENOMEM);
> + ClearPageLocked(new_page);
> + swap_free(entry);
> + } while (err != -ENOMEM);
>
> if (new_page)
> page_cache_release(new_page);

2007-12-19 06:50:59

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH 8/9] tmpfs: radix_tree_preloading

On Tue, Dec 18, 2007 at 10:05:19PM +0000, Hugh Dickins wrote:
> Nick has observed that shmem.c still uses GFP_ATOMIC when adding to page
> cache or swap cache, without any radix tree preload: so tending to deplete
> emergency reserves of memory.
>
> GFP_ATOMIC remains appropriate in shmem_writepage's add_to_swap_cache:
> it's being called under memory pressure, so must not wait for more memory
> to become available. But shmem_unuse_inode now has a window in which it
> can and should preload with GFP_KERNEL, and say GFP_NOWAIT instead of
> GFP_ATOMIC in its add_to_page_cache.
>
> shmem_getpage is not so straightforward: its filepage/swappage integrity
> relies upon exchanging between caches under spinlock, and it would need a
> lot of restructuring to place the preloads correctly. Instead, follow
> its pattern of retrying on races: use GFP_NOWAIT instead of GFP_ATOMIC in
> add_to_page_cache, and begin each circuit of the repeat loop with a sleeping
> radix_tree_preload, followed immediately by radix_tree_preload_end - that
> won't guarantee success in the next add_to_page_cache, but doesn't need to.
>
> And we can then remove that bothersome congestion_wait: when needed,
> it'll automatically get done in the course of the radix_tree_preload.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Looks good to me. Thanks for this!