2014-02-04 01:00:48

by Johannes Weiner

[permalink] [raw]
Subject: [patch 00/10] mm: thrash detection-based file cache sizing v9

Changes in this revision

o Fix vmstat build problems on UP (Fengguang Wu's build bot)

o Clarify why optimistic radix_tree_node->private_list link checking
is safe without holding the list_lru lock (Dave Chinner)

o Assert locking balance when the list_lru isolator says it dropped
the list lock (Dave Chinner)

o Remove remnant of a manual reclaim counter in the shadow isolator,
the list_lru-provided accounting is accurate now that we added
LRU_REMOVED_RETRY (Dave Chinner)

o Set an object limit for the shadow shrinker instead of messing with
its seeks setting. The configured seeks define how pressure applied
to pages translates to pressure on the object pool, in itself it is
not enough to replace proper object valuation to classify expired
and in-use objects. Shadow nodes contain up to 64 shadow entries
from different/alternating zones that have their own atomic age
counter, so determining if a node is overall expired is crazy
expensive. Instead, use an object limit above which nodes are very
likely to be expired.

o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim)

o radix_tree_node->count accessors for pages and shadows (Minchan Kim)

o Rebase to v3.14-rc1 and add review tags

Summary

The VM maintains cached filesystem pages on two types of lists. One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past. We call the recently used
list "inactive list" and the frequently used list "active list".

Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to *protect* frequently used
pages and the ability to *detect* frequently used pages. This means
that working set changes bigger than half of cache memory go
undetected and thrash indefinitely, whereas working sets bigger than
half of cache memory are unprotected against used-once streams that
don't even need caching.

This happens on file servers and media streaming servers, where the
popular files and file sections change over time. Even though the
individual files might be smaller than half of memory, concurrent
access to many of them may still result in their inter-reference
distance being greater than half of memory. It's also been reported
as a problem on database workloads that switch back and forth between
tables that are bigger than half of memory. In these cases the VM
never recognizes the new working set and will for the remainder of the
workload thrash disk data which could easily live in memory.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list. This model gave established
working sets more gracetime in the face of temporary use-once streams,
but ultimately was not significantly better than a FIFO policy and
still thrashed cache based on eviction speed, rather than actual
demand for cache.

This series solves the problem by maintaining a history of pages
evicted from the inactive list, enabling the VM to detect frequently
used pages regardless of inactive list size and facilitate working set
transitions.

Tests

The reported database workload is easily demonstrated on a 8G machine
with two filesets a 6G. This fio workload operates on one set first,
then switches to the other. The VM should obviously always cache the
set that the workload is currently using.

This test is based on a problem encountered by Citus Data customers:
http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data

unpatched:
db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, mint= 113672msec, maxt= 113672msec
db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, mint=1521302msec, maxt=1521302msec
sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92%

real 27m15.541s
user 0m19.059s
sys 0m51.459s

patched:
db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, mint=114679msec, maxt=114679msec
db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, mint=253273msec, maxt=253273msec
sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40%

real 6m8.630s
user 0m14.714s
sys 0m31.233s

As can be seen, the unpatched kernel simply never adapts to the
workingset change and db2 is stuck indefinitely with secondary storage
speed. The patched kernel needs 2-3 iterations over db2 before it
replaces db1 and reaches full memory speed. Given the unbounded
negative affect of the existing VM behavior, these patches should be
considered correctness fixes rather than performance optimizations.

Another test resembles a fileserver or streaming server workload,
where data in excess of memory size is accessed at different
frequencies. There is very hot data accessed at a high frequency.
Machines should be fitted so that the hot set of such a workload can
be fully cached or all bets are off. Then there is a very big
(compared to available memory) set of data that is used-once or at a
very low frequency; this is what drives the inactive list and does not
really benefit from caching. Lastly, there is a big set of warm data
in between that is accessed at medium frequencies and benefits from
caching the pages between the first and last streamer of each burst.

unpatched:
hot: READ: io=128000MB, aggrb=160693KB/s, minb=160693KB/s, maxb=160693KB/s, mint=815665msec, maxt=815665msec
warm: READ: io= 81920MB, aggrb=109853KB/s, minb= 27463KB/s, maxb= 29244KB/s, mint=717110msec, maxt=763617msec
cold: READ: io= 30720MB, aggrb= 35245KB/s, minb= 35245KB/s, maxb= 35245KB/s, mint=892530msec, maxt=892530msec
sdb: ios=797960/4, merge=11763/1, ticks=4307910/796, in_queue=4308380, util=100.00%

patched:
hot: READ: io=128000MB, aggrb=160678KB/s, minb=160678KB/s, maxb=160678KB/s, mint=815740msec, maxt=815740msec
warm: READ: io= 81920MB, aggrb=147747KB/s, minb= 36936KB/s, maxb= 40960KB/s, mint=512000msec, maxt=567767msec
cold: READ: io= 30720MB, aggrb= 40960KB/s, minb= 40960KB/s, maxb= 40960KB/s, mint=768000msec, maxt=768000msec
sdb: ios=596514/4, merge=9341/1, ticks=2395362/997, in_queue=2396484, util=79.18%

In both kernels, the hot set is propagated to the active list and then
served from cache.

In both kernels, the beginning of the warm set is propagated to the
active list as well, but in the unpatched case the active list
eventually takes up half of memory and no new pages from the warm set
get activated, despite repeated access, and despite most of the active
list soon being stale. The patched kernel on the other hand detects
the thrashing and manages to keep this cache window rolling through
the data set. This frees up enough IO bandwidth that the cold set is
served at full speed as well and disk utilization even drops by 20%.

For reference, this same test was performed with the traditional
demotion mechanism, where deactivation is coupled to inactive list
reclaim. However, this had the same outcome as the unpatched kernel:
while the warm set does indeed get activated continuously, it is
forced out of the active list by inactive list pressure, which is
dictated primarily by the unrelated cold set. The warm set is evicted
before subsequent streamers can benefit from it, even though there
would be enough space available to cache the pages of interest.

Costs

Page reclaim used to shrink the radix trees but now the tree nodes are
reused for shadow entries, where the cost depends heavily on the page
cache access patterns. However, with workloads that maintain spatial
or temporal locality, the shadow entries are either refaulted quickly
or reclaimed along with the inode object itself. Workloads that will
experience a memory cost increase are those that don't really benefit
from caching in the first place.

A more predictable alternative would be a fixed-cost separate pool of
shadow entries, but this would incur relatively higher memory cost for
well-behaved workloads at the benefit of cornercases. It would also
make the shadow entry lookup more costly compared to storing them
directly in the cache structure.

Future

To simplify the merging process, this patch set is implementing thrash
detection on a global per-zone level only for now, but the design is
such that it can be extended to memory cgroups as well. All we need
to do is store the unique cgroup ID along the node and zone identifier
inside the eviction cookie to identify the lruvec.

Right now we have a fixed ratio (50:50) between inactive and active
list but we already have complaints about working sets exceeding half
of memory being pushed out of the cache by simple streaming in the
background. Ultimately, we want to adjust this ratio and allow for a
much smaller inactive list. These patches are an essential step in
this direction because they decouple the VMs ability to detect working
set changes from the inactive list size. This would allow us to base
the inactive list size on the combined readahead window size for
example and potentially protect a much bigger working set.

It's also a big step towards activating pages with a reuse distance
larger than memory, as long as they are the most frequently used pages
in the workload. This will require knowing more about the access
frequency of active pages than what we measure right now, so it's also
deferred in this series.

Another possibility of having thrashing information would be to
revisit the idea of local reclaim in the form of zero-config memory
control groups. Instead of having allocating tasks go straight to
global reclaim, they could try to reclaim the pages in the memcg they
are part of first as long as the group is not thrashing. This would
allow a user to drop e.g. a back-up job in an otherwise unconfigured
memcg and it would only inflate (and possibly do global reclaim) until
it has enough memory to do proper readahead. But once it reaches that
point and stops thrashing it would just recycle its own used-once
pages without kicking out the cache of any other tasks in the system
more than necessary.

Documentation/filesystems/porting | 6 +-
drivers/staging/lustre/lustre/llite/llite_lib.c | 2 +-
fs/9p/vfs_inode.c | 2 +-
fs/affs/inode.c | 2 +-
fs/afs/inode.c | 2 +-
fs/bfs/inode.c | 2 +-
fs/block_dev.c | 4 +-
fs/btrfs/compression.c | 2 +-
fs/btrfs/inode.c | 2 +-
fs/cachefiles/rdwr.c | 33 +-
fs/cifs/cifsfs.c | 2 +-
fs/coda/inode.c | 2 +-
fs/ecryptfs/super.c | 2 +-
fs/exofs/inode.c | 2 +-
fs/ext2/inode.c | 2 +-
fs/ext3/inode.c | 2 +-
fs/ext4/inode.c | 4 +-
fs/f2fs/inode.c | 2 +-
fs/fat/inode.c | 2 +-
fs/freevxfs/vxfs_inode.c | 2 +-
fs/fuse/inode.c | 2 +-
fs/gfs2/super.c | 2 +-
fs/hfs/inode.c | 2 +-
fs/hfsplus/super.c | 2 +-
fs/hostfs/hostfs_kern.c | 2 +-
fs/hpfs/inode.c | 2 +-
fs/inode.c | 4 +-
fs/jffs2/fs.c | 2 +-
fs/jfs/inode.c | 4 +-
fs/kernfs/inode.c | 2 +-
fs/logfs/readwrite.c | 2 +-
fs/minix/inode.c | 2 +-
fs/ncpfs/inode.c | 2 +-
fs/nfs/blocklayout/blocklayout.c | 2 +-
fs/nfs/inode.c | 2 +-
fs/nfs/nfs4super.c | 2 +-
fs/nilfs2/inode.c | 6 +-
fs/ntfs/inode.c | 2 +-
fs/ocfs2/inode.c | 4 +-
fs/omfs/inode.c | 2 +-
fs/proc/inode.c | 2 +-
fs/reiserfs/inode.c | 2 +-
fs/sysv/inode.c | 2 +-
fs/ubifs/super.c | 2 +-
fs/udf/inode.c | 4 +-
fs/ufs/inode.c | 2 +-
fs/xfs/xfs_super.c | 2 +-
include/linux/fs.h | 1 +
include/linux/list_lru.h | 2 +
include/linux/mm.h | 9 +
include/linux/mmzone.h | 6 +
include/linux/pagemap.h | 33 +-
include/linux/pagevec.h | 3 +
include/linux/radix-tree.h | 55 ++-
include/linux/shmem_fs.h | 1 +
include/linux/swap.h | 36 ++
include/linux/vmstat.h | 29 +-
lib/radix-tree.c | 383 ++++++++++----------
mm/Makefile | 2 +-
mm/filemap.c | 417 +++++++++++++++++++---
mm/list_lru.c | 10 +
mm/mincore.c | 20 +-
mm/readahead.c | 6 +-
mm/shmem.c | 122 ++-----
mm/swap.c | 50 +++
mm/truncate.c | 147 +++++++-
mm/vmscan.c | 24 +-
mm/vmstat.c | 3 +
mm/workingset.c | 396 ++++++++++++++++++++
69 files changed, 1438 insertions(+), 462 deletions(-)


2014-02-04 00:58:38

by Johannes Weiner

[permalink] [raw]
Subject: [patch 03/10] lib: radix-tree: radix_tree_delete_item()

Provide a function that does not just delete an entry at a given
index, but also allows passing in an expected item. Delete only if
that item is still located at the specified index.

This is handy when lockless tree traversals want to delete entries as
well because they don't have to do an second, locked lookup to verify
the slot has not changed under them before deleting the entry.

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
include/linux/radix-tree.h | 1 +
lib/radix-tree.c | 31 +++++++++++++++++++++++++++----
2 files changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 403940787be1..1bf0a9c388d9 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -219,6 +219,7 @@ static inline void radix_tree_replace_slot(void **pslot, void *item)
int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
void *radix_tree_delete(struct radix_tree_root *, unsigned long);
unsigned int
radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 7811ed3b4e70..f442e3243607 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1335,15 +1335,18 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
}

/**
- * radix_tree_delete - delete an item from a radix tree
+ * radix_tree_delete_item - delete an item from a radix tree
* @root: radix tree root
* @index: index key
+ * @item: expected item
*
- * Remove the item at @index from the radix tree rooted at @root.
+ * Remove @item at @index from the radix tree rooted at @root.
*
- * Returns the address of the deleted item, or NULL if it was not present.
+ * Returns the address of the deleted item, or NULL if it was not present
+ * or the entry at the given @index was not @item.
*/
-void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+void *radix_tree_delete_item(struct radix_tree_root *root,
+ unsigned long index, void *item)
{
struct radix_tree_node *node = NULL;
struct radix_tree_node *slot = NULL;
@@ -1378,6 +1381,11 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
if (slot == NULL)
goto out;

+ if (item && slot != item) {
+ slot = NULL;
+ goto out;
+ }
+
/*
* Clear all tags associated with the item to be deleted.
* This way of doing it would be inefficient, but seldom is any set.
@@ -1422,6 +1430,21 @@ void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
out:
return slot;
}
+EXPORT_SYMBOL(radix_tree_delete_item);
+
+/**
+ * radix_tree_delete - delete an item from a radix tree
+ * @root: radix tree root
+ * @index: index key
+ *
+ * Remove the item at @index from the radix tree rooted at @root.
+ *
+ * Returns the address of the deleted item, or NULL if it was not present.
+ */
+void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
+{
+ return radix_tree_delete_item(root, index, NULL);
+}
EXPORT_SYMBOL(radix_tree_delete);

/**
--
1.8.5.3

2014-02-04 00:58:50

by Johannes Weiner

[permalink] [raw]
Subject: [patch 05/10] mm: filemap: move radix tree hole searching here

The radix tree hole searching code is only used for page cache, for
example the readahead code trying to get a a picture of the area
surrounding a fault.

It sufficed to rely on the radix tree definition of holes, which is
"empty tree slot". But this is about to change, though, as shadow
page descriptors will be stored in the page cache after the actual
pages get evicted from memory.

Move the functions over to mm/filemap.c and make them native page
cache operations, where they can later be adapted to handle the new
definition of "page cache hole".

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 2 +-
include/linux/pagemap.h | 5 +++
include/linux/radix-tree.h | 4 ---
lib/radix-tree.c | 75 ---------------------------------------
mm/filemap.c | 76 ++++++++++++++++++++++++++++++++++++++++
mm/readahead.c | 4 +--
6 files changed, 84 insertions(+), 82 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 56ff823ca82e..65d849bdf77a 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -1213,7 +1213,7 @@ static u64 pnfs_num_cont_bytes(struct inode *inode, pgoff_t idx)
end = DIV_ROUND_UP(i_size_read(inode), PAGE_CACHE_SIZE);
if (end != NFS_I(inode)->npages) {
rcu_read_lock();
- end = radix_tree_next_hole(&mapping->page_tree, idx + 1, ULONG_MAX);
+ end = page_cache_next_hole(mapping, idx + 1, ULONG_MAX);
rcu_read_unlock();
}

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 1710d1b060ba..52d56872fe26 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -243,6 +243,11 @@ static inline struct page *page_cache_alloc_readahead(struct address_space *x)

typedef int filler_t(void *, struct page *);

+pgoff_t page_cache_next_hole(struct address_space *mapping,
+ pgoff_t index, unsigned long max_scan);
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+ pgoff_t index, unsigned long max_scan);
+
extern struct page * find_get_page(struct address_space *mapping,
pgoff_t index);
extern struct page * find_lock_page(struct address_space *mapping,
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 1bf0a9c388d9..e8be53ecfc45 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -227,10 +227,6 @@ radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
unsigned int radix_tree_gang_lookup_slot(struct radix_tree_root *root,
void ***results, unsigned long *indices,
unsigned long first_index, unsigned int max_items);
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
- unsigned long index, unsigned long max_scan);
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
- unsigned long index, unsigned long max_scan);
int radix_tree_preload(gfp_t gfp_mask);
int radix_tree_maybe_preload(gfp_t gfp_mask);
void radix_tree_init(void);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index f442e3243607..e8adb5d8a184 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -946,81 +946,6 @@ next:
}
EXPORT_SYMBOL(radix_tree_range_tag_if_tagged);

-
-/**
- * radix_tree_next_hole - find the next hole (not-present entry)
- * @root: tree root
- * @index: index key
- * @max_scan: maximum range to search
- *
- * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the lowest
- * indexed hole.
- *
- * Returns: the index of the hole if found, otherwise returns an index
- * outside of the set specified (in which case 'return - index >= max_scan'
- * will be true). In rare cases of index wrap-around, 0 will be returned.
- *
- * radix_tree_next_hole may be called under rcu_read_lock. However, like
- * radix_tree_gang_lookup, this will not atomically search a snapshot of
- * the tree at a single point in time. For example, if a hole is created
- * at index 5, then subsequently a hole is created at index 10,
- * radix_tree_next_hole covering both indexes may return 10 if called
- * under rcu_read_lock.
- */
-unsigned long radix_tree_next_hole(struct radix_tree_root *root,
- unsigned long index, unsigned long max_scan)
-{
- unsigned long i;
-
- for (i = 0; i < max_scan; i++) {
- if (!radix_tree_lookup(root, index))
- break;
- index++;
- if (index == 0)
- break;
- }
-
- return index;
-}
-EXPORT_SYMBOL(radix_tree_next_hole);
-
-/**
- * radix_tree_prev_hole - find the prev hole (not-present entry)
- * @root: tree root
- * @index: index key
- * @max_scan: maximum range to search
- *
- * Search backwards in the range [max(index-max_scan+1, 0), index]
- * for the first hole.
- *
- * Returns: the index of the hole if found, otherwise returns an index
- * outside of the set specified (in which case 'index - return >= max_scan'
- * will be true). In rare cases of wrap-around, ULONG_MAX will be returned.
- *
- * radix_tree_next_hole may be called under rcu_read_lock. However, like
- * radix_tree_gang_lookup, this will not atomically search a snapshot of
- * the tree at a single point in time. For example, if a hole is created
- * at index 10, then subsequently a hole is created at index 5,
- * radix_tree_prev_hole covering both indexes may return 5 if called under
- * rcu_read_lock.
- */
-unsigned long radix_tree_prev_hole(struct radix_tree_root *root,
- unsigned long index, unsigned long max_scan)
-{
- unsigned long i;
-
- for (i = 0; i < max_scan; i++) {
- if (!radix_tree_lookup(root, index))
- break;
- index--;
- if (index == ULONG_MAX)
- break;
- }
-
- return index;
-}
-EXPORT_SYMBOL(radix_tree_prev_hole);
-
/**
* radix_tree_gang_lookup - perform multiple lookup on a radix tree
* @root: radix tree root
diff --git a/mm/filemap.c b/mm/filemap.c
index d56d3c145b9f..1f7e89d1c1ac 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -686,6 +686,82 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
}

/**
+ * page_cache_next_hole - find the next hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the
+ * lowest indexed hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'return - index >=
+ * max_scan' will be true). In rare cases of index wrap-around, 0 will
+ * be returned.
+ *
+ * page_cache_next_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 5, then subsequently a hole is created at
+ * index 10, page_cache_next_hole covering both indexes may return 10
+ * if called under rcu_read_lock.
+ */
+pgoff_t page_cache_next_hole(struct address_space *mapping,
+ pgoff_t index, unsigned long max_scan)
+{
+ unsigned long i;
+
+ for (i = 0; i < max_scan; i++) {
+ if (!radix_tree_lookup(&mapping->page_tree, index))
+ break;
+ index++;
+ if (index == 0)
+ break;
+ }
+
+ return index;
+}
+EXPORT_SYMBOL(page_cache_next_hole);
+
+/**
+ * page_cache_prev_hole - find the prev hole (not-present entry)
+ * @mapping: mapping
+ * @index: index
+ * @max_scan: maximum range to search
+ *
+ * Search backwards in the range [max(index-max_scan+1, 0), index] for
+ * the first hole.
+ *
+ * Returns: the index of the hole if found, otherwise returns an index
+ * outside of the set specified (in which case 'index - return >=
+ * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX
+ * will be returned.
+ *
+ * page_cache_prev_hole may be called under rcu_read_lock. However,
+ * like radix_tree_gang_lookup, this will not atomically search a
+ * snapshot of the tree at a single point in time. For example, if a
+ * hole is created at index 10, then subsequently a hole is created at
+ * index 5, page_cache_prev_hole covering both indexes may return 5 if
+ * called under rcu_read_lock.
+ */
+pgoff_t page_cache_prev_hole(struct address_space *mapping,
+ pgoff_t index, unsigned long max_scan)
+{
+ unsigned long i;
+
+ for (i = 0; i < max_scan; i++) {
+ if (!radix_tree_lookup(&mapping->page_tree, index))
+ break;
+ index--;
+ if (index == ULONG_MAX)
+ break;
+ }
+
+ return index;
+}
+EXPORT_SYMBOL(page_cache_prev_hole);
+
+/**
* find_get_page - find and get a page reference
* @mapping: the address_space to search
* @offset: the page index
diff --git a/mm/readahead.c b/mm/readahead.c
index 0de2360d65f3..c62d85ace0cc 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -347,7 +347,7 @@ static pgoff_t count_history_pages(struct address_space *mapping,
pgoff_t head;

rcu_read_lock();
- head = radix_tree_prev_hole(&mapping->page_tree, offset - 1, max);
+ head = page_cache_prev_hole(mapping, offset - 1, max);
rcu_read_unlock();

return offset - 1 - head;
@@ -427,7 +427,7 @@ ondemand_readahead(struct address_space *mapping,
pgoff_t start;

rcu_read_lock();
- start = radix_tree_next_hole(&mapping->page_tree, offset+1,max);
+ start = page_cache_next_hole(mapping, offset + 1, max);
rcu_read_unlock();

if (!start || start - offset > max)
--
1.8.5.3

2014-02-04 00:58:45

by Johannes Weiner

[permalink] [raw]
Subject: [patch 01/10] mm: vmstat: fix UP zone state accounting

Fengguang Wu's build testing spotted problems with inc_zone_state()
and dec_zone_state() on UP configurations in out-of-tree patches.

inc_zone_state() is declared but not defined, dec_zone_state() is
missing entirely.

Just like with *_zone_page_state(), they can be defined like their
preemption-unsafe counterparts on UP.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/vmstat.h | 29 +++++++++++++++--------------
1 file changed, 15 insertions(+), 14 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index a67b38415768..a32dbd2c2155 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -179,8 +179,6 @@ extern void zone_statistics(struct zone *, struct zone *, gfp_t gfp);
#define add_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, __d)
#define sub_zone_page_state(__z, __i, __d) mod_zone_page_state(__z, __i, -(__d))

-extern void inc_zone_state(struct zone *, enum zone_stat_item);
-
#ifdef CONFIG_SMP
void __mod_zone_page_state(struct zone *, enum zone_stat_item item, int);
void __inc_zone_page_state(struct page *, enum zone_stat_item);
@@ -216,24 +214,12 @@ static inline void __mod_zone_page_state(struct zone *zone,
zone_page_state_add(delta, zone, item);
}

-static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
-{
- atomic_long_inc(&zone->vm_stat[item]);
- atomic_long_inc(&vm_stat[item]);
-}
-
static inline void __inc_zone_page_state(struct page *page,
enum zone_stat_item item)
{
__inc_zone_state(page_zone(page), item);
}

-static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
-{
- atomic_long_dec(&zone->vm_stat[item]);
- atomic_long_dec(&vm_stat[item]);
-}
-
static inline void __dec_zone_page_state(struct page *page,
enum zone_stat_item item)
{
@@ -248,6 +234,21 @@ static inline void __dec_zone_page_state(struct page *page,
#define dec_zone_page_state __dec_zone_page_state
#define mod_zone_page_state __mod_zone_page_state

+static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
+{
+ atomic_long_inc(&zone->vm_stat[item]);
+ atomic_long_inc(&vm_stat[item]);
+}
+
+static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
+{
+ atomic_long_dec(&zone->vm_stat[item]);
+ atomic_long_dec(&vm_stat[item]);
+}
+
+#define inc_zone_state __inc_zone_state
+#define dec_zone_state __dec_zone_state
+
#define set_pgdat_percpu_threshold(pgdat, callback) { }

static inline void refresh_cpu_vm_stats(int cpu) { }
--
1.8.5.3

2014-02-04 00:59:00

by Johannes Weiner

[permalink] [raw]
Subject: [patch 08/10] mm: thrash detection-based file cache sizing

The VM maintains cached filesystem pages on two types of lists. One
list holds the pages recently faulted into the cache, the other list
holds pages that have been referenced repeatedly on that first list.
The idea is to prefer reclaiming young pages over those that have
shown to benefit from caching in the past. We call the recently used
list "inactive list" and the frequently used list "active list".

Currently, the VM aims for a 1:1 ratio between the lists, which is the
"perfect" trade-off between the ability to *protect* frequently used
pages and the ability to *detect* frequently used pages. This means
that working set changes bigger than half of cache memory go
undetected and thrash indefinitely, whereas working sets bigger than
half of cache memory are unprotected against used-once streams that
don't even need caching.

Historically, every reclaim scan of the inactive list also took a
smaller number of pages from the tail of the active list and moved
them to the head of the inactive list. This model gave established
working sets more gracetime in the face of temporary use-once streams,
but ultimately was not significantly better than a FIFO policy and
still thrashed cache based on eviction speed, rather than actual
demand for cache.

This patch solves one half of the problem by decoupling the ability to
detect working set changes from the inactive list size. By
maintaining a history of recently evicted file pages it can detect
frequently used pages with an arbitrarily small inactive list size,
and subsequently apply pressure on the active list based on actual
demand for cache, not just overall eviction speed.

Every zone maintains a counter that tracks inactive list aging speed.
When a page is evicted, a snapshot of this counter is stored in the
now-empty page cache radix tree slot. On refault, the minimum access
distance of the page can be assessed, to evaluate whether the page
should be part of the active list or not.

This fixes the VM's blindness towards working set changes in excess of
the inactive list. And it's the foundation to further improve the
protection ability and reduce the minimum inactive list size of 50%.

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: Bob Liu <[email protected]>
---
include/linux/mmzone.h | 5 +
include/linux/swap.h | 5 +
mm/Makefile | 2 +-
mm/filemap.c | 61 ++++++++----
mm/swap.c | 2 +
mm/vmscan.c | 24 ++++-
mm/vmstat.c | 2 +
mm/workingset.c | 253 +++++++++++++++++++++++++++++++++++++++++++++++++
8 files changed, 331 insertions(+), 23 deletions(-)
create mode 100644 mm/workingset.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5f2052c83154..b4bdeb411a4d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -142,6 +142,8 @@ enum zone_stat_item {
NUMA_LOCAL, /* allocation from local node */
NUMA_OTHER, /* allocation from other node */
#endif
+ WORKINGSET_REFAULT,
+ WORKINGSET_ACTIVATE,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
@@ -392,6 +394,9 @@ struct zone {
spinlock_t lru_lock;
struct lruvec lruvec;

+ /* Evictions & activations on the inactive file list */
+ atomic_long_t inactive_age;
+
unsigned long pages_scanned; /* since last reclaim */
unsigned long flags; /* zone flags, see below */

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 46ba0c6c219f..b83cf61403ed 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,11 @@ struct swap_list_t {
int next; /* swapfile to be used next */
};

+/* linux/mm/workingset.c */
+void *workingset_eviction(struct address_space *mapping, struct page *page);
+bool workingset_refault(void *shadow);
+void workingset_activation(struct page *page);
+
/* linux/mm/page_alloc.c */
extern unsigned long totalram_pages;
extern unsigned long totalreserve_pages;
diff --git a/mm/Makefile b/mm/Makefile
index 310c90a09264..cdd741519ee0 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -17,7 +17,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
util.o mmzone.o vmstat.o backing-dev.o \
mm_init.o mmu_context.o percpu.o slab_common.o \
compaction.o balloon_compaction.o \
- interval_tree.o list_lru.o $(mmu-y)
+ interval_tree.o list_lru.o workingset.o $(mmu-y)

obj-y += init-mm.o

diff --git a/mm/filemap.c b/mm/filemap.c
index 18f80d418f83..33ceebf4d577 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -469,7 +469,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
EXPORT_SYMBOL_GPL(replace_page_cache_page);

static int page_cache_tree_insert(struct address_space *mapping,
- struct page *page)
+ struct page *page, void **shadowp)
{
void **slot;
int error;
@@ -484,6 +484,8 @@ static int page_cache_tree_insert(struct address_space *mapping,
radix_tree_replace_slot(slot, page);
mapping->nrshadows--;
mapping->nrpages++;
+ if (shadowp)
+ *shadowp = p;
return 0;
}
error = radix_tree_insert(&mapping->page_tree, page->index, page);
@@ -492,18 +494,10 @@ static int page_cache_tree_insert(struct address_space *mapping,
return error;
}

-/**
- * add_to_page_cache_locked - add a locked page to the pagecache
- * @page: page to add
- * @mapping: the page's address_space
- * @offset: page index
- * @gfp_mask: page allocation mode
- *
- * This function is used to add a page to the pagecache. It must be locked.
- * This function does not add the page to the LRU. The caller must do that.
- */
-int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
- pgoff_t offset, gfp_t gfp_mask)
+static int __add_to_page_cache_locked(struct page *page,
+ struct address_space *mapping,
+ pgoff_t offset, gfp_t gfp_mask,
+ void **shadowp)
{
int error;

@@ -526,7 +520,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
page->index = offset;

spin_lock_irq(&mapping->tree_lock);
- error = page_cache_tree_insert(mapping, page);
+ error = page_cache_tree_insert(mapping, page, shadowp);
radix_tree_preload_end();
if (unlikely(error))
goto err_insert;
@@ -542,16 +536,49 @@ err_insert:
page_cache_release(page);
return error;
}
+
+/**
+ * add_to_page_cache_locked - add a locked page to the pagecache
+ * @page: page to add
+ * @mapping: the page's address_space
+ * @offset: page index
+ * @gfp_mask: page allocation mode
+ *
+ * This function is used to add a page to the pagecache. It must be locked.
+ * This function does not add the page to the LRU. The caller must do that.
+ */
+int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
+ pgoff_t offset, gfp_t gfp_mask)
+{
+ return __add_to_page_cache_locked(page, mapping, offset,
+ gfp_mask, NULL);
+}
EXPORT_SYMBOL(add_to_page_cache_locked);

int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
pgoff_t offset, gfp_t gfp_mask)
{
+ void *shadow = NULL;
int ret;

- ret = add_to_page_cache(page, mapping, offset, gfp_mask);
- if (ret == 0)
- lru_cache_add_file(page);
+ __set_page_locked(page);
+ ret = __add_to_page_cache_locked(page, mapping, offset,
+ gfp_mask, &shadow);
+ if (unlikely(ret))
+ __clear_page_locked(page);
+ else {
+ /*
+ * The page might have been evicted from cache only
+ * recently, in which case it should be activated like
+ * any other repeatedly accessed page.
+ */
+ if (shadow && workingset_refault(shadow)) {
+ SetPageActive(page);
+ workingset_activation(page);
+ } else
+ ClearPageActive(page);
+ lru_cache_add(page);
+ }
return ret;
}
EXPORT_SYMBOL_GPL(add_to_page_cache_lru);
diff --git a/mm/swap.c b/mm/swap.c
index 20c267b52914..8573c710d261 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -574,6 +574,8 @@ void mark_page_accessed(struct page *page)
else
__lru_cache_activate_page(page);
ClearPageReferenced(page);
+ if (page_is_file_cache(page))
+ workingset_activation(page);
} else if (!PageReferenced(page)) {
SetPageReferenced(page);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 63712938169b..a3fa590ad32a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -523,7 +523,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
* Same as remove_mapping, but if the page is removed from the mapping, it
* gets returned with a refcount of 0.
*/
-static int __remove_mapping(struct address_space *mapping, struct page *page)
+static int __remove_mapping(struct address_space *mapping, struct page *page,
+ bool reclaimed)
{
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
@@ -569,10 +570,23 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)
swapcache_free(swap, page);
} else {
void (*freepage)(struct page *);
+ void *shadow = NULL;

freepage = mapping->a_ops->freepage;
-
- __delete_from_page_cache(page, NULL);
+ /*
+ * Remember a shadow entry for reclaimed file cache in
+ * order to detect refaults, thus thrashing, later on.
+ *
+ * But don't store shadows in an address space that is
+ * already exiting. This is not just an optizimation,
+ * inode reclaim needs to empty out the radix tree or
+ * the nodes are lost. Don't plant shadows behind its
+ * back.
+ */
+ if (reclaimed && page_is_file_cache(page) &&
+ !mapping_exiting(mapping))
+ shadow = workingset_eviction(mapping, page);
+ __delete_from_page_cache(page, shadow);
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);

@@ -595,7 +609,7 @@ cannot_free:
*/
int remove_mapping(struct address_space *mapping, struct page *page)
{
- if (__remove_mapping(mapping, page)) {
+ if (__remove_mapping(mapping, page, false)) {
/*
* Unfreezing the refcount with 1 rather than 2 effectively
* drops the pagecache ref for us without requiring another
@@ -1065,7 +1079,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
}

- if (!mapping || !__remove_mapping(mapping, page))
+ if (!mapping || !__remove_mapping(mapping, page, true))
goto keep_locked;

/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 72496140ac08..c95634e0c098 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -770,6 +770,8 @@ const char * const vmstat_text[] = {
"numa_local",
"numa_other",
#endif
+ "workingset_refault",
+ "workingset_activate",
"nr_anon_transparent_hugepages",
"nr_free_cma",
"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
new file mode 100644
index 000000000000..8a6c7cff4923
--- /dev/null
+++ b/mm/workingset.c
@@ -0,0 +1,253 @@
+/*
+ * Workingset detection
+ *
+ * Copyright (C) 2013 Red Hat, Inc., Johannes Weiner
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/writeback.h>
+#include <linux/pagemap.h>
+#include <linux/atomic.h>
+#include <linux/module.h>
+#include <linux/swap.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+
+/*
+ * Double CLOCK lists
+ *
+ * Per zone, two clock lists are maintained for file pages: the
+ * inactive and the active list. Freshly faulted pages start out at
+ * the head of the inactive list and page reclaim scans pages from the
+ * tail. Pages that are accessed multiple times on the inactive list
+ * are promoted to the active list, to protect them from reclaim,
+ * whereas active pages are demoted to the inactive list when the
+ * active list grows too big.
+ *
+ * fault ------------------------+
+ * |
+ * +--------------+ | +-------------+
+ * reclaim <- | inactive | <-+-- demotion | active | <--+
+ * +--------------+ +-------------+ |
+ * | |
+ * +-------------- promotion ------------------+
+ *
+ *
+ * Access frequency and refault distance
+ *
+ * A workload is thrashing when its pages are frequently used but they
+ * are evicted from the inactive list every time before another access
+ * would have promoted them to the active list.
+ *
+ * In cases where the average access distance between thrashing pages
+ * is bigger than the size of memory there is nothing that can be
+ * done - the thrashing set could never fit into memory under any
+ * circumstance.
+ *
+ * However, the average access distance could be bigger than the
+ * inactive list, yet smaller than the size of memory. In this case,
+ * the set could fit into memory if it weren't for the currently
+ * active pages - which may be used more, hopefully less frequently:
+ *
+ * +-memory available to cache-+
+ * | |
+ * +-inactive------+-active----+
+ * a b | c d e f g h i | J K L M N |
+ * +---------------+-----------+
+ *
+ * It is prohibitively expensive to accurately track access frequency
+ * of pages. But a reasonable approximation can be made to measure
+ * thrashing on the inactive list, after which refaulting pages can be
+ * activated optimistically to compete with the existing active pages.
+ *
+ * Approximating inactive page access frequency - Observations:
+ *
+ * 1. When a page is accessed for the first time, it is added to the
+ * head of the inactive list, slides every existing inactive page
+ * towards the tail by one slot, and pushes the current tail page
+ * out of memory.
+ *
+ * 2. When a page is accessed for the second time, it is promoted to
+ * the active list, shrinking the inactive list by one slot. This
+ * also slides all inactive pages that were faulted into the cache
+ * more recently than the activated page towards the tail of the
+ * inactive list.
+ *
+ * Thus:
+ *
+ * 1. The sum of evictions and activations between any two points in
+ * time indicate the minimum number of inactive pages accessed in
+ * between.
+ *
+ * 2. Moving one inactive page N page slots towards the tail of the
+ * list requires at least N inactive page accesses.
+ *
+ * Combining these:
+ *
+ * 1. When a page is finally evicted from memory, the number of
+ * inactive pages accessed while the page was in cache is at least
+ * the number of page slots on the inactive list.
+ *
+ * 2. In addition, measuring the sum of evictions and activations (E)
+ * at the time of a page's eviction, and comparing it to another
+ * reading (R) at the time the page faults back into memory tells
+ * the minimum number of accesses while the page was not cached.
+ * This is called the refault distance.
+ *
+ * Because the first access of the page was the fault and the second
+ * access the refault, we combine the in-cache distance with the
+ * out-of-cache distance to get the complete minimum access distance
+ * of this page:
+ *
+ * NR_inactive + (R - E)
+ *
+ * And knowing the minimum access distance of a page, we can easily
+ * tell if the page would be able to stay in cache assuming all page
+ * slots in the cache were available:
+ *
+ * NR_inactive + (R - E) <= NR_inactive + NR_active
+ *
+ * which can be further simplified to
+ *
+ * (R - E) <= NR_active
+ *
+ * Put into words, the refault distance (out-of-cache) can be seen as
+ * a deficit in inactive list space (in-cache). If the inactive list
+ * had (R - E) more page slots, the page would not have been evicted
+ * in between accesses, but activated instead. And on a full system,
+ * the only thing eating into inactive list space is active pages.
+ *
+ *
+ * Activating refaulting pages
+ *
+ * All that is known about the active list is that the pages have been
+ * accessed more than once in the past. This means that at any given
+ * time there is actually a good chance that pages on the active list
+ * are no longer in active use.
+ *
+ * So when a refault distance of (R - E) is observed and there are at
+ * least (R - E) active pages, the refaulting page is activated
+ * optimistically in the hope that (R - E) active pages are actually
+ * used less frequently than the refaulting page - or even not used at
+ * all anymore.
+ *
+ * If this is wrong and demotion kicks in, the pages which are truly
+ * used more frequently will be reactivated while the less frequently
+ * used once will be evicted from memory.
+ *
+ * But if this is right, the stale pages will be pushed out of memory
+ * and the used pages get to stay in cache.
+ *
+ *
+ * Implementation
+ *
+ * For each zone's file LRU lists, a counter for inactive evictions
+ * and activations is maintained (zone->inactive_age).
+ *
+ * On eviction, a snapshot of this counter (along with some bits to
+ * identify the zone) is stored in the now empty page cache radix tree
+ * slot of the evicted page. This is called a shadow entry.
+ *
+ * On cache misses for which there are shadow entries, an eligible
+ * refault distance will immediately activate the refaulting page.
+ */
+
+static void *pack_shadow(unsigned long eviction, struct zone *zone)
+{
+ eviction = (eviction << NODES_SHIFT) | zone_to_nid(zone);
+ eviction = (eviction << ZONES_SHIFT) | zone_idx(zone);
+ eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);
+
+ return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
+}
+
+static void unpack_shadow(void *shadow,
+ struct zone **zone,
+ unsigned long *distance)
+{
+ unsigned long entry = (unsigned long)shadow;
+ unsigned long eviction;
+ unsigned long refault;
+ unsigned long mask;
+ int zid, nid;
+
+ entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
+ zid = entry & ((1UL << ZONES_SHIFT) - 1);
+ entry >>= ZONES_SHIFT;
+ nid = entry & ((1UL << NODES_SHIFT) - 1);
+ entry >>= NODES_SHIFT;
+ eviction = entry;
+
+ *zone = NODE_DATA(nid)->node_zones + zid;
+
+ refault = atomic_long_read(&(*zone)->inactive_age);
+ mask = ~0UL >> (NODES_SHIFT + ZONES_SHIFT +
+ RADIX_TREE_EXCEPTIONAL_SHIFT);
+ /*
+ * The unsigned subtraction here gives an accurate distance
+ * across inactive_age overflows in most cases.
+ *
+ * There is a special case: usually, shadow entries have a
+ * short lifetime and are either refaulted or reclaimed along
+ * with the inode before they get too old. But it is not
+ * impossible for the inactive_age to lap a shadow entry in
+ * the field, which can then can result in a false small
+ * refault distance, leading to a false activation should this
+ * old entry actually refault again. However, earlier kernels
+ * used to deactivate unconditionally with *every* reclaim
+ * invocation for the longest time, so the occasional
+ * inappropriate activation leading to pressure on the active
+ * list is not a problem.
+ */
+ *distance = (refault - eviction) & mask;
+}
+
+/**
+ * workingset_eviction - note the eviction of a page from memory
+ * @mapping: address space the page was backing
+ * @page: the page being evicted
+ *
+ * Returns a shadow entry to be stored in @mapping->page_tree in place
+ * of the evicted @page so that a later refault can be detected.
+ */
+void *workingset_eviction(struct address_space *mapping, struct page *page)
+{
+ struct zone *zone = page_zone(page);
+ unsigned long eviction;
+
+ eviction = atomic_long_inc_return(&zone->inactive_age);
+ return pack_shadow(eviction, zone);
+}
+
+/**
+ * workingset_refault - evaluate the refault of a previously evicted page
+ * @shadow: shadow entry of the evicted page
+ *
+ * Calculates and evaluates the refault distance of the previously
+ * evicted page in the context of the zone it was allocated in.
+ *
+ * Returns %true if the page should be activated, %false otherwise.
+ */
+bool workingset_refault(void *shadow)
+{
+ unsigned long refault_distance;
+ struct zone *zone;
+
+ unpack_shadow(shadow, &zone, &refault_distance);
+ inc_zone_state(zone, WORKINGSET_REFAULT);
+
+ if (refault_distance <= zone_page_state(zone, NR_ACTIVE_FILE)) {
+ inc_zone_state(zone, WORKINGSET_ACTIVATE);
+ return true;
+ }
+ return false;
+}
+
+/**
+ * workingset_activation - note a page activation
+ * @page: page that is being activated
+ */
+void workingset_activation(struct page *page)
+{
+ atomic_long_inc(&page_zone(page)->inactive_age);
+}
--
1.8.5.3

2014-02-04 00:59:26

by Johannes Weiner

[permalink] [raw]
Subject: [patch 06/10] mm + fs: prepare for non-page entries in page cache radix trees

shmem mappings already contain exceptional entries where swap slot
information is remembered.

To be able to store eviction information for regular page cache,
prepare every site dealing with the radix trees directly to handle
entries other than pages.

The common lookup functions will filter out non-page entries and
return NULL for page cache holes, just as before. But provide a raw
version of the API which returns non-page entries as well, and switch
shmem over to use it.

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
---
fs/btrfs/compression.c | 2 +-
include/linux/mm.h | 8 ++
include/linux/pagemap.h | 15 ++--
include/linux/pagevec.h | 3 +
include/linux/shmem_fs.h | 1 +
mm/filemap.c | 197 +++++++++++++++++++++++++++++++++++++++++------
mm/mincore.c | 20 +++--
mm/readahead.c | 2 +-
mm/shmem.c | 97 +++++------------------
mm/swap.c | 48 ++++++++++++
mm/truncate.c | 73 ++++++++++++++----
11 files changed, 338 insertions(+), 128 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index e2600cdb6c25..1b8d21b681f2 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -472,7 +472,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
rcu_read_lock();
page = radix_tree_lookup(&mapping->page_tree, pg_index);
rcu_read_unlock();
- if (page) {
+ if (page && !radix_tree_exceptional_entry(page)) {
misses++;
if (misses > 4)
break;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f28f46eade6a..d684ac125482 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1031,6 +1031,14 @@ extern void show_free_areas(unsigned int flags);
extern bool skip_free_areas_node(unsigned int flags, int nid);

int shmem_zero_setup(struct vm_area_struct *);
+#ifdef CONFIG_SHMEM
+bool shmem_mapping(struct address_space *mapping);
+#else
+static inline bool shmem_mapping(struct address_space *mapping)
+{
+ return false;
+}
+#endif

extern int can_do_mlock(void);
extern int user_shm_lock(size_t, struct user_struct *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 52d56872fe26..2eeca3c83b0f 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -248,12 +248,15 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
pgoff_t page_cache_prev_hole(struct address_space *mapping,
pgoff_t index, unsigned long max_scan);

-extern struct page * find_get_page(struct address_space *mapping,
- pgoff_t index);
-extern struct page * find_lock_page(struct address_space *mapping,
- pgoff_t index);
-extern struct page * find_or_create_page(struct address_space *mapping,
- pgoff_t index, gfp_t gfp_mask);
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset);
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset);
+struct page *find_or_create_page(struct address_space *mapping, pgoff_t index,
+ gfp_t gfp_mask);
+unsigned __find_get_pages(struct address_space *mapping, pgoff_t start,
+ unsigned int nr_pages, struct page **pages,
+ pgoff_t *indices);
unsigned find_get_pages(struct address_space *mapping, pgoff_t start,
unsigned int nr_pages, struct page **pages);
unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index e4dbfab37729..3c6b8b1e945b 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -22,6 +22,9 @@ struct pagevec {

void __pagevec_release(struct pagevec *pvec);
void __pagevec_lru_add(struct pagevec *pvec);
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+ pgoff_t start, unsigned nr_pages, pgoff_t *indices);
+void pagevec_remove_exceptionals(struct pagevec *pvec);
unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
pgoff_t start, unsigned nr_pages);
unsigned pagevec_lookup_tag(struct pagevec *pvec,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 9d55438bc4ad..4d1771c2d29f 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -51,6 +51,7 @@ extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
unsigned long flags);
extern int shmem_zero_setup(struct vm_area_struct *);
extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
+extern bool shmem_mapping(struct address_space *mapping);
extern void shmem_unlock_mapping(struct address_space *mapping);
extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
diff --git a/mm/filemap.c b/mm/filemap.c
index 1f7e89d1c1ac..a194179303e5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -446,6 +446,29 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
}
EXPORT_SYMBOL_GPL(replace_page_cache_page);

+static int page_cache_tree_insert(struct address_space *mapping,
+ struct page *page)
+{
+ void **slot;
+ int error;
+
+ slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+ if (slot) {
+ void *p;
+
+ p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+ if (!radix_tree_exceptional_entry(p))
+ return -EEXIST;
+ radix_tree_replace_slot(slot, page);
+ mapping->nrpages++;
+ return 0;
+ }
+ error = radix_tree_insert(&mapping->page_tree, page->index, page);
+ if (!error)
+ mapping->nrpages++;
+ return error;
+}
+
/**
* add_to_page_cache_locked - add a locked page to the pagecache
* @page: page to add
@@ -480,11 +503,10 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
page->index = offset;

spin_lock_irq(&mapping->tree_lock);
- error = radix_tree_insert(&mapping->page_tree, offset, page);
+ error = page_cache_tree_insert(mapping, page);
radix_tree_preload_end();
if (unlikely(error))
goto err_insert;
- mapping->nrpages++;
__inc_zone_page_state(page, NR_FILE_PAGES);
spin_unlock_irq(&mapping->tree_lock);
trace_mm_filemap_add_to_page_cache(page);
@@ -712,7 +734,10 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
unsigned long i;

for (i = 0; i < max_scan; i++) {
- if (!radix_tree_lookup(&mapping->page_tree, index))
+ struct page *page;
+
+ page = radix_tree_lookup(&mapping->page_tree, index);
+ if (!page || radix_tree_exceptional_entry(page))
break;
index++;
if (index == 0)
@@ -750,7 +775,10 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
unsigned long i;

for (i = 0; i < max_scan; i++) {
- if (!radix_tree_lookup(&mapping->page_tree, index))
+ struct page *page;
+
+ page = radix_tree_lookup(&mapping->page_tree, index);
+ if (!page || radix_tree_exceptional_entry(page))
break;
index--;
if (index == ULONG_MAX)
@@ -762,14 +790,19 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
EXPORT_SYMBOL(page_cache_prev_hole);

/**
- * find_get_page - find and get a page reference
+ * __find_get_page - find and get a page reference
* @mapping: the address_space to search
* @offset: the page index
*
- * Is there a pagecache struct page at the given (mapping, offset) tuple?
- * If yes, increment its refcount and return it; if no, return NULL.
+ * Looks up the page cache slot at @mapping & @offset. If there is a
+ * page cache page, it is returned with an increased refcount.
+ *
+ * If the slot holds a shadow entry of a previously evicted page, it
+ * is returned.
+ *
+ * Otherwise, %NULL is returned.
*/
-struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
+struct page *__find_get_page(struct address_space *mapping, pgoff_t offset)
{
void **pagep;
struct page *page;
@@ -810,24 +843,49 @@ out:

return page;
}
+EXPORT_SYMBOL(__find_get_page);
+
+/**
+ * find_get_page - find and get a page reference
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset. If there is a
+ * page cache page, it is returned with an increased refcount.
+ *
+ * Otherwise, %NULL is returned.
+ */
+struct page *find_get_page(struct address_space *mapping, pgoff_t offset)
+{
+ struct page *page = __find_get_page(mapping, offset);
+
+ if (radix_tree_exceptional_entry(page))
+ page = NULL;
+ return page;
+}
EXPORT_SYMBOL(find_get_page);

/**
- * find_lock_page - locate, pin and lock a pagecache page
+ * __find_lock_page - locate, pin and lock a pagecache page
* @mapping: the address_space to search
* @offset: the page index
*
- * Locates the desired pagecache page, locks it, increments its reference
- * count and returns its address.
+ * Looks up the page cache slot at @mapping & @offset. If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * If the slot holds a shadow entry of a previously evicted page, it
+ * is returned.
+ *
+ * Otherwise, %NULL is returned.
*
- * Returns zero if the page was not present. find_lock_page() may sleep.
+ * __find_lock_page() may sleep.
*/
-struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+struct page *__find_lock_page(struct address_space *mapping, pgoff_t offset)
{
struct page *page;
-
repeat:
- page = find_get_page(mapping, offset);
+ page = __find_get_page(mapping, offset);
if (page && !radix_tree_exception(page)) {
lock_page(page);
/* Has the page been truncated? */
@@ -840,6 +898,29 @@ repeat:
}
return page;
}
+EXPORT_SYMBOL(__find_lock_page);
+
+/**
+ * find_lock_page - locate, pin and lock a pagecache page
+ * @mapping: the address_space to search
+ * @offset: the page index
+ *
+ * Looks up the page cache slot at @mapping & @offset. If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
+ *
+ * Otherwise, %NULL is returned.
+ *
+ * find_lock_page() may sleep.
+ */
+struct page *find_lock_page(struct address_space *mapping, pgoff_t offset)
+{
+ struct page *page = __find_lock_page(mapping, offset);
+
+ if (radix_tree_exceptional_entry(page))
+ page = NULL;
+ return page;
+}
EXPORT_SYMBOL(find_lock_page);

/**
@@ -848,16 +929,18 @@ EXPORT_SYMBOL(find_lock_page);
* @index: the page's index into the mapping
* @gfp_mask: page allocation mode
*
- * Locates a page in the pagecache. If the page is not present, a new page
- * is allocated using @gfp_mask and is added to the pagecache and to the VM's
- * LRU list. The returned page is locked and has its reference count
- * incremented.
+ * Looks up the page cache slot at @mapping & @offset. If there is a
+ * page cache page, it is returned locked and with an increased
+ * refcount.
*
- * find_or_create_page() may sleep, even if @gfp_flags specifies an atomic
- * allocation!
+ * If the page is not present, a new page is allocated using @gfp_mask
+ * and added to the page cache and the VM's LRU list. The page is
+ * returned locked and with an increased refcount.
*
- * find_or_create_page() returns the desired page's address, or zero on
- * memory exhaustion.
+ * On memory exhaustion, %NULL is returned.
+ *
+ * find_or_create_page() may sleep, even if @gfp_flags specifies an
+ * atomic allocation!
*/
struct page *find_or_create_page(struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask)
@@ -890,6 +973,74 @@ repeat:
EXPORT_SYMBOL(find_or_create_page);

/**
+ * __find_get_pages - gang pagecache lookup
+ * @mapping: The address_space to search
+ * @start: The starting page index
+ * @nr_pages: The maximum number of pages
+ * @pages: Where the resulting entries are placed
+ * @indices: The cache indices corresponding to the entries in @pages
+ *
+ * __find_get_pages() will search for and return a group of up to
+ * @nr_pages pages in the mapping. The pages are placed at @pages.
+ * __find_get_pages() takes a reference against the returned pages.
+ *
+ * The search returns a group of mapping-contiguous pages with ascending
+ * indexes. There may be holes in the indices due to not-present pages.
+ *
+ * Any shadow entries of evicted pages are included in the returned
+ * array.
+ *
+ * __find_get_pages() returns the number of pages and shadow entries
+ * which were found.
+ */
+unsigned __find_get_pages(struct address_space *mapping,
+ pgoff_t start, unsigned int nr_pages,
+ struct page **pages, pgoff_t *indices)
+{
+ void **slot;
+ unsigned int ret = 0;
+ struct radix_tree_iter iter;
+
+ if (!nr_pages)
+ return 0;
+
+ rcu_read_lock();
+restart:
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+ struct page *page;
+repeat:
+ page = radix_tree_deref_slot(slot);
+ if (unlikely(!page))
+ continue;
+ if (radix_tree_exception(page)) {
+ if (radix_tree_deref_retry(page))
+ goto restart;
+ /*
+ * Otherwise, we must be storing a swap entry
+ * here as an exceptional entry: so return it
+ * without attempting to raise page count.
+ */
+ goto export;
+ }
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *slot)) {
+ page_cache_release(page);
+ goto repeat;
+ }
+export:
+ indices[ret] = iter.index;
+ pages[ret] = page;
+ if (++ret == nr_pages)
+ break;
+ }
+ rcu_read_unlock();
+ return ret;
+}
+
+/**
* find_get_pages - gang pagecache lookup
* @mapping: The address_space to search
* @start: The starting page index
diff --git a/mm/mincore.c b/mm/mincore.c
index 101623378fbf..df52b572e8b4 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -70,13 +70,21 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
* any other file mapping (ie. marked !present and faulted in with
* tmpfs's .fault). So swapped out tmpfs mappings are tested here.
*/
- page = find_get_page(mapping, pgoff);
#ifdef CONFIG_SWAP
- /* shmem/tmpfs may return swap: account for swapcache page too. */
- if (radix_tree_exceptional_entry(page)) {
- swp_entry_t swap = radix_to_swp_entry(page);
- page = find_get_page(swap_address_space(swap), swap.val);
- }
+ if (shmem_mapping(mapping)) {
+ page = __find_get_page(mapping, pgoff);
+ /*
+ * shmem/tmpfs may return swap: account for swapcache
+ * page too.
+ */
+ if (radix_tree_exceptional_entry(page)) {
+ swp_entry_t swp = radix_to_swp_entry(page);
+ page = find_get_page(swap_address_space(swp), swp.val);
+ }
+ } else
+ page = find_get_page(mapping, pgoff);
+#else
+ page = find_get_page(mapping, pgoff);
#endif
if (page) {
present = PageUptodate(page);
diff --git a/mm/readahead.c b/mm/readahead.c
index c62d85ace0cc..62c500a088a7 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -179,7 +179,7 @@ __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
rcu_read_lock();
page = radix_tree_lookup(&mapping->page_tree, page_offset);
rcu_read_unlock();
- if (page)
+ if (page && !radix_tree_exceptional_entry(page))
continue;

page = page_cache_alloc_readahead(mapping);
diff --git a/mm/shmem.c b/mm/shmem.c
index e470997010cd..e5fe262bb834 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -329,56 +329,6 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
}

/*
- * Like find_get_pages, but collecting swap entries as well as pages.
- */
-static unsigned shmem_find_get_pages_and_swap(struct address_space *mapping,
- pgoff_t start, unsigned int nr_pages,
- struct page **pages, pgoff_t *indices)
-{
- void **slot;
- unsigned int ret = 0;
- struct radix_tree_iter iter;
-
- if (!nr_pages)
- return 0;
-
- rcu_read_lock();
-restart:
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
- struct page *page;
-repeat:
- page = radix_tree_deref_slot(slot);
- if (unlikely(!page))
- continue;
- if (radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page))
- goto restart;
- /*
- * Otherwise, we must be storing a swap entry
- * here as an exceptional entry: so return it
- * without attempting to raise page count.
- */
- goto export;
- }
- if (!page_cache_get_speculative(page))
- goto repeat;
-
- /* Has the page moved? */
- if (unlikely(page != *slot)) {
- page_cache_release(page);
- goto repeat;
- }
-export:
- indices[ret] = iter.index;
- pages[ret] = page;
- if (++ret == nr_pages)
- break;
- }
- rcu_read_unlock();
- return ret;
-}
-
-/*
* Remove swap entry from radix tree, free the swap and its page cache.
*/
static int shmem_free_swap(struct address_space *mapping,
@@ -396,21 +346,6 @@ static int shmem_free_swap(struct address_space *mapping,
}

/*
- * Pagevec may contain swap entries, so shuffle up pages before releasing.
- */
-static void shmem_deswap_pagevec(struct pagevec *pvec)
-{
- int i, j;
-
- for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
- struct page *page = pvec->pages[i];
- if (!radix_tree_exceptional_entry(page))
- pvec->pages[j++] = page;
- }
- pvec->nr = j;
-}
-
-/*
* SysV IPC SHM_UNLOCK restore Unevictable pages to their evictable lists.
*/
void shmem_unlock_mapping(struct address_space *mapping)
@@ -428,12 +363,12 @@ void shmem_unlock_mapping(struct address_space *mapping)
* Avoid pagevec_lookup(): find_get_pages() returns 0 as if it
* has finished, if it hits a row of PAGEVEC_SIZE swap entries.
*/
- pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+ pvec.nr = __find_get_pages(mapping, index,
PAGEVEC_SIZE, pvec.pages, indices);
if (!pvec.nr)
break;
index = indices[pvec.nr - 1] + 1;
- shmem_deswap_pagevec(&pvec);
+ pagevec_remove_exceptionals(&pvec);
check_move_unevictable_pages(pvec.pages, pvec.nr);
pagevec_release(&pvec);
cond_resched();
@@ -465,9 +400,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
pagevec_init(&pvec, 0);
index = start;
while (index < end) {
- pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE),
- pvec.pages, indices);
+ pvec.nr = __find_get_pages(mapping, index,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE),
+ pvec.pages, indices);
if (!pvec.nr)
break;
mem_cgroup_uncharge_start();
@@ -496,7 +431,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}
unlock_page(page);
}
- shmem_deswap_pagevec(&pvec);
+ pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
mem_cgroup_uncharge_end();
cond_resched();
@@ -534,9 +469,10 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
index = start;
for ( ; ; ) {
cond_resched();
- pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+
+ pvec.nr = __find_get_pages(mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE),
- pvec.pages, indices);
+ pvec.pages, indices);
if (!pvec.nr) {
if (index == start || unfalloc)
break;
@@ -544,7 +480,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
continue;
}
if ((index == start || unfalloc) && indices[0] >= end) {
- shmem_deswap_pagevec(&pvec);
+ pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
break;
}
@@ -573,7 +509,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
}
unlock_page(page);
}
- shmem_deswap_pagevec(&pvec);
+ pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
mem_cgroup_uncharge_end();
index++;
@@ -1079,7 +1015,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
return -EFBIG;
repeat:
swap.val = 0;
- page = find_lock_page(mapping, index);
+ page = __find_lock_page(mapping, index);
if (radix_tree_exceptional_entry(page)) {
swap = radix_to_swp_entry(page);
page = NULL;
@@ -1416,6 +1352,11 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
return inode;
}

+bool shmem_mapping(struct address_space *mapping)
+{
+ return mapping->backing_dev_info == &shmem_backing_dev_info;
+}
+
#ifdef CONFIG_TMPFS
static const struct inode_operations shmem_symlink_inode_operations;
static const struct inode_operations shmem_short_symlink_operations;
@@ -1728,7 +1669,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
pagevec_init(&pvec, 0);
pvec.nr = 1; /* start small: we may be there already */
while (!done) {
- pvec.nr = shmem_find_get_pages_and_swap(mapping, index,
+ pvec.nr = __find_get_pages(mapping, index,
pvec.nr, pvec.pages, indices);
if (!pvec.nr) {
if (whence == SEEK_DATA)
@@ -1755,7 +1696,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
break;
}
}
- shmem_deswap_pagevec(&pvec);
+ pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
pvec.nr = PAGEVEC_SIZE;
cond_resched();
diff --git a/mm/swap.c b/mm/swap.c
index b31ba67d440a..20c267b52914 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -948,6 +948,54 @@ void __pagevec_lru_add(struct pagevec *pvec)
EXPORT_SYMBOL(__pagevec_lru_add);

/**
+ * __pagevec_lookup - gang pagecache lookup
+ * @pvec: Where the resulting entries are placed
+ * @mapping: The address_space to search
+ * @start: The starting entry index
+ * @nr_pages: The maximum number of entries
+ * @indices: The cache indices corresponding to the entries in @pvec
+ *
+ * __pagevec_lookup() will search for and return a group of up to
+ * @nr_pages pages and shadow entries in the mapping. All entries are
+ * placed in @pvec. __pagevec_lookup() takes a reference against
+ * actual pages in @pvec.
+ *
+ * The search returns a group of mapping-contiguous entries with
+ * ascending indexes. There may be holes in the indices due to
+ * not-present entries.
+ *
+ * __pagevec_lookup() returns the number of entries which were found.
+ */
+unsigned __pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
+ pgoff_t start, unsigned nr_pages, pgoff_t *indices)
+{
+ pvec->nr = __find_get_pages(mapping, start, nr_pages,
+ pvec->pages, indices);
+ return pagevec_count(pvec);
+}
+
+/**
+ * pagevec_remove_exceptionals - pagevec exceptionals pruning
+ * @pvec: The pagevec to prune
+ *
+ * __pagevec_lookup() fills both pages and exceptional radix tree
+ * entries into the pagevec. This function prunes all exceptionals
+ * from @pvec without leaving holes, so that it can be passed on to
+ * page-only pagevec operations.
+ */
+void pagevec_remove_exceptionals(struct pagevec *pvec)
+{
+ int i, j;
+
+ for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
+ struct page *page = pvec->pages[i];
+ if (!radix_tree_exceptional_entry(page))
+ pvec->pages[j++] = page;
+ }
+ pvec->nr = j;
+}
+
+/**
* pagevec_lookup - gang pagecache lookup
* @pvec: Where the resulting pages are placed
* @mapping: The address_space to search
diff --git a/mm/truncate.c b/mm/truncate.c
index 353b683afd6e..b0f4d4bee8ab 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -22,6 +22,22 @@
#include <linux/cleancache.h>
#include "internal.h"

+static void clear_exceptional_entry(struct address_space *mapping,
+ pgoff_t index, void *entry)
+{
+ /* Handled by shmem itself */
+ if (shmem_mapping(mapping))
+ return;
+
+ spin_lock_irq(&mapping->tree_lock);
+ /*
+ * Regular page slots are stabilized by the page lock even
+ * without the tree itself locked. These unlocked entries
+ * need verification under the tree lock.
+ */
+ radix_tree_delete_item(&mapping->page_tree, index, entry);
+ spin_unlock_irq(&mapping->tree_lock);
+}

/**
* do_invalidatepage - invalidate part or all of a page
@@ -208,6 +224,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
unsigned int partial_start; /* inclusive */
unsigned int partial_end; /* exclusive */
struct pagevec pvec;
+ pgoff_t indices[PAGEVEC_SIZE];
pgoff_t index;
int i;

@@ -238,17 +255,23 @@ void truncate_inode_pages_range(struct address_space *mapping,

pagevec_init(&pvec, 0);
index = start;
- while (index < end && pagevec_lookup(&pvec, mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+ while (index < end && __pagevec_lookup(&pvec, mapping, index,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE),
+ indices)) {
mem_cgroup_uncharge_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

/* We rely upon deletion not changing page->index */
- index = page->index;
+ index = indices[i];
if (index >= end)
break;

+ if (radix_tree_exceptional_entry(page)) {
+ clear_exceptional_entry(mapping, index, page);
+ continue;
+ }
+
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
@@ -259,6 +282,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
truncate_inode_page(mapping, page);
unlock_page(page);
}
+ pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
mem_cgroup_uncharge_end();
cond_resched();
@@ -307,14 +331,15 @@ void truncate_inode_pages_range(struct address_space *mapping,
index = start;
for ( ; ; ) {
cond_resched();
- if (!pagevec_lookup(&pvec, mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE))) {
+ if (!__pagevec_lookup(&pvec, mapping, index,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE),
+ indices)) {
if (index == start)
break;
index = start;
continue;
}
- if (index == start && pvec.pages[0]->index >= end) {
+ if (index == start && indices[0] >= end) {
pagevec_release(&pvec);
break;
}
@@ -323,16 +348,22 @@ void truncate_inode_pages_range(struct address_space *mapping,
struct page *page = pvec.pages[i];

/* We rely upon deletion not changing page->index */
- index = page->index;
+ index = indices[i];
if (index >= end)
break;

+ if (radix_tree_exceptional_entry(page)) {
+ clear_exceptional_entry(mapping, index, page);
+ continue;
+ }
+
lock_page(page);
WARN_ON(page->index != index);
wait_on_page_writeback(page);
truncate_inode_page(mapping, page);
unlock_page(page);
}
+ pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
mem_cgroup_uncharge_end();
index++;
@@ -375,6 +406,7 @@ EXPORT_SYMBOL(truncate_inode_pages);
unsigned long invalidate_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t end)
{
+ pgoff_t indices[PAGEVEC_SIZE];
struct pagevec pvec;
pgoff_t index = start;
unsigned long ret;
@@ -390,17 +422,23 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
*/

pagevec_init(&pvec, 0);
- while (index <= end && pagevec_lookup(&pvec, mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+ while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+ indices)) {
mem_cgroup_uncharge_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

/* We rely upon deletion not changing page->index */
- index = page->index;
+ index = indices[i];
if (index > end)
break;

+ if (radix_tree_exceptional_entry(page)) {
+ clear_exceptional_entry(mapping, index, page);
+ continue;
+ }
+
if (!trylock_page(page))
continue;
WARN_ON(page->index != index);
@@ -414,6 +452,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
deactivate_page(page);
count += ret;
}
+ pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
mem_cgroup_uncharge_end();
cond_resched();
@@ -481,6 +520,7 @@ static int do_launder_page(struct address_space *mapping, struct page *page)
int invalidate_inode_pages2_range(struct address_space *mapping,
pgoff_t start, pgoff_t end)
{
+ pgoff_t indices[PAGEVEC_SIZE];
struct pagevec pvec;
pgoff_t index;
int i;
@@ -491,17 +531,23 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
cleancache_invalidate_inode(mapping);
pagevec_init(&pvec, 0);
index = start;
- while (index <= end && pagevec_lookup(&pvec, mapping, index,
- min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1)) {
+ while (index <= end && __pagevec_lookup(&pvec, mapping, index,
+ min(end - index, (pgoff_t)PAGEVEC_SIZE - 1) + 1,
+ indices)) {
mem_cgroup_uncharge_start();
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];

/* We rely upon deletion not changing page->index */
- index = page->index;
+ index = indices[i];
if (index > end)
break;

+ if (radix_tree_exceptional_entry(page)) {
+ clear_exceptional_entry(mapping, index, page);
+ continue;
+ }
+
lock_page(page);
WARN_ON(page->index != index);
if (page->mapping != mapping) {
@@ -539,6 +585,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
ret = ret2;
unlock_page(page);
}
+ pagevec_remove_exceptionals(&pvec);
pagevec_release(&pvec);
mem_cgroup_uncharge_end();
cond_resched();
--
1.8.5.3

2014-02-04 01:00:01

by Johannes Weiner

[permalink] [raw]
Subject: [patch 07/10] mm + fs: store shadow entries in page cache

Reclaim will be leaving shadow entries in the page cache radix tree
upon evicting the real page. As those pages are found from the LRU,
an iput() can lead to the inode being freed concurrently. At this
point, reclaim must no longer install shadow pages because the inode
freeing code needs to ensure the page tree is really empty.

Add an address_space flag, AS_EXITING, that the inode freeing code
sets under the tree lock before doing the final truncate. Reclaim
will check for this flag before installing shadow pages.

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
---
Documentation/filesystems/porting | 6 +--
drivers/staging/lustre/lustre/llite/llite_lib.c | 2 +-
fs/9p/vfs_inode.c | 2 +-
fs/affs/inode.c | 2 +-
fs/afs/inode.c | 2 +-
fs/bfs/inode.c | 2 +-
fs/block_dev.c | 4 +-
fs/btrfs/inode.c | 2 +-
fs/cifs/cifsfs.c | 2 +-
fs/coda/inode.c | 2 +-
fs/ecryptfs/super.c | 2 +-
fs/exofs/inode.c | 2 +-
fs/ext2/inode.c | 2 +-
fs/ext3/inode.c | 2 +-
fs/ext4/inode.c | 4 +-
fs/f2fs/inode.c | 2 +-
fs/fat/inode.c | 2 +-
fs/freevxfs/vxfs_inode.c | 2 +-
fs/fuse/inode.c | 2 +-
fs/gfs2/super.c | 2 +-
fs/hfs/inode.c | 2 +-
fs/hfsplus/super.c | 2 +-
fs/hostfs/hostfs_kern.c | 2 +-
fs/hpfs/inode.c | 2 +-
fs/inode.c | 4 +-
fs/jffs2/fs.c | 2 +-
fs/jfs/inode.c | 4 +-
fs/kernfs/inode.c | 2 +-
fs/logfs/readwrite.c | 2 +-
fs/minix/inode.c | 2 +-
fs/ncpfs/inode.c | 2 +-
fs/nfs/inode.c | 2 +-
fs/nfs/nfs4super.c | 2 +-
fs/nilfs2/inode.c | 6 +--
fs/ntfs/inode.c | 2 +-
fs/ocfs2/inode.c | 4 +-
fs/omfs/inode.c | 2 +-
fs/proc/inode.c | 2 +-
fs/reiserfs/inode.c | 2 +-
fs/sysv/inode.c | 2 +-
fs/ubifs/super.c | 2 +-
fs/udf/inode.c | 4 +-
fs/ufs/inode.c | 2 +-
fs/xfs/xfs_super.c | 2 +-
include/linux/fs.h | 1 +
include/linux/mm.h | 1 +
include/linux/pagemap.h | 13 +++++-
mm/filemap.c | 33 ++++++++++++---
mm/truncate.c | 54 +++++++++++++++++++++++--
mm/vmscan.c | 2 +-
50 files changed, 147 insertions(+), 65 deletions(-)

diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index fe2b7ae6f962..0f3a1390bf00 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -295,9 +295,9 @@ in the beginning of ->setattr unconditionally.
->clear_inode() and ->delete_inode() are gone; ->evict_inode() should
be used instead. It gets called whenever the inode is evicted, whether it has
remaining links or not. Caller does *not* evict the pagecache or inode-associated
-metadata buffers; getting rid of those is responsibility of method, as it had
-been for ->delete_inode(). Caller makes sure async writeback cannot be running
-for the inode while (or after) ->evict_inode() is called.
+metadata buffers; the method has to use truncate_inode_pages_final() to get rid
+of those. Caller makes sure async writeback cannot be running for the inode while
+(or after) ->evict_inode() is called.

->drop_inode() returns int now; it's called on final iput() with
inode->i_lock held and it returns true if filesystems wants the inode to be
diff --git a/drivers/staging/lustre/lustre/llite/llite_lib.c b/drivers/staging/lustre/lustre/llite/llite_lib.c
index 6cfdb9e4b74b..fc6aac3cfe00 100644
--- a/drivers/staging/lustre/lustre/llite/llite_lib.c
+++ b/drivers/staging/lustre/lustre/llite/llite_lib.c
@@ -1877,7 +1877,7 @@ void ll_delete_inode(struct inode *inode)
cl_sync_file_range(inode, 0, OBD_OBJECT_EOF,
CL_FSYNC_DISCARD, 1);

- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);

/* Workaround for LU-118 */
if (inode->i_data.nrpages) {
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index bb7991c7e5c7..53161ec058a7 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -451,7 +451,7 @@ void v9fs_evict_inode(struct inode *inode)
{
struct v9fs_inode *v9inode = V9FS_I(inode);

- truncate_inode_pages(inode->i_mapping, 0);
+ truncate_inode_pages_final(inode->i_mapping);
clear_inode(inode);
filemap_fdatawrite(inode->i_mapping);

diff --git a/fs/affs/inode.c b/fs/affs/inode.c
index 0e092d08680e..96df91e8c334 100644
--- a/fs/affs/inode.c
+++ b/fs/affs/inode.c
@@ -259,7 +259,7 @@ affs_evict_inode(struct inode *inode)
{
unsigned long cache_page;
pr_debug("AFFS: evict_inode(ino=%lu, nlink=%u)\n", inode->i_ino, inode->i_nlink);
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);

if (!inode->i_nlink) {
inode->i_size = 0;
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index ce25d755b7aa..294671288449 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -422,7 +422,7 @@ void afs_evict_inode(struct inode *inode)

ASSERTCMP(inode->i_ino, ==, vnode->fid.vnode);

- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);

afs_give_up_callback(vnode);
diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
index 8defc6b3f9a2..29aa5cf6639b 100644
--- a/fs/bfs/inode.c
+++ b/fs/bfs/inode.c
@@ -172,7 +172,7 @@ static void bfs_evict_inode(struct inode *inode)

dprintf("ino=%08lx\n", ino);

- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
invalidate_inode_buffers(inode);
clear_inode(inode);

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1e86823a9cbd..c7a7def27b07 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -83,7 +83,7 @@ void kill_bdev(struct block_device *bdev)
{
struct address_space *mapping = bdev->bd_inode->i_mapping;

- if (mapping->nrpages == 0)
+ if (mapping->nrpages == 0 && mapping->nrshadows == 0)
return;

invalidate_bh_lrus();
@@ -419,7 +419,7 @@ static void bdev_evict_inode(struct inode *inode)
{
struct block_device *bdev = &BDEV_I(inode)->bdev;
struct list_head *p;
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
invalidate_inode_buffers(inode); /* is it needed here? */
clear_inode(inode);
spin_lock(&bdev_lock);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5c4ab9c18940..c73f67c8dbc7 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4593,7 +4593,7 @@ static void evict_inode_truncate_pages(struct inode *inode)
struct rb_node *node;

ASSERT(inode->i_state & I_FREEING);
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);

write_lock(&map_tree->lock);
while (!RB_EMPTY_ROOT(&map_tree->map)) {
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 849f6132b327..d23ae08f7bca 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -286,7 +286,7 @@ cifs_destroy_inode(struct inode *inode)
static void
cifs_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
cifs_fscache_release_inode_cookie(inode);
}
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index 506de34a4ef3..62618ec9356c 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -250,7 +250,7 @@ static void coda_put_super(struct super_block *sb)

static void coda_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
coda_cache_clear_inode(inode);
}
diff --git a/fs/ecryptfs/super.c b/fs/ecryptfs/super.c
index e879cf8ff0b1..afa1b81c3418 100644
--- a/fs/ecryptfs/super.c
+++ b/fs/ecryptfs/super.c
@@ -132,7 +132,7 @@ static int ecryptfs_statfs(struct dentry *dentry, struct kstatfs *buf)
*/
static void ecryptfs_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
iput(ecryptfs_inode_to_lower(inode));
}
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index ee4317faccb1..d1c244d67667 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1486,7 +1486,7 @@ void exofs_evict_inode(struct inode *inode)
struct ore_io_state *ios;
int ret;

- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);

/* TODO: should do better here */
if (inode->i_nlink || is_bad_inode(inode))
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 94ed36849b71..b1d2a4675d42 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -78,7 +78,7 @@ void ext2_evict_inode(struct inode * inode)
dquot_drop(inode);
}

- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);

if (want_delete) {
sb_start_intwrite(inode->i_sb);
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 384b6ebb655f..efce2bbfb5e5 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -228,7 +228,7 @@ void ext3_evict_inode (struct inode *inode)
log_wait_commit(journal, commit_tid);
filemap_write_and_wait(&inode->i_data);
}
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);

ext3_discard_reservation(inode);
rsv = ei->i_block_alloc_info;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6e39895a91b8..7e83b4a1ae00 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -214,7 +214,7 @@ void ext4_evict_inode(struct inode *inode)
jbd2_complete_transaction(journal, commit_tid);
filemap_write_and_wait(&inode->i_data);
}
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);

WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
goto no_delete;
@@ -225,7 +225,7 @@ void ext4_evict_inode(struct inode *inode)

if (ext4_should_order_data(inode))
ext4_begin_ordered_truncate(inode, 0);
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);

WARN_ON(atomic_read(&EXT4_I(inode)->i_ioend_count));
if (is_bad_inode(inode))
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index 4d67ed736dca..28cea76d78c6 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -260,7 +260,7 @@ void f2fs_evict_inode(struct inode *inode)
struct f2fs_sb_info *sbi = F2FS_SB(inode->i_sb);

trace_f2fs_evict_inode(inode);
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);

if (inode->i_ino == F2FS_NODE_INO(sbi) ||
inode->i_ino == F2FS_META_INO(sbi))
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 854b578f6695..c68d9f27135e 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -490,7 +490,7 @@ EXPORT_SYMBOL_GPL(fat_build_inode);

static void fat_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
if (!inode->i_nlink) {
inode->i_size = 0;
fat_truncate_blocks(inode, 0);
diff --git a/fs/freevxfs/vxfs_inode.c b/fs/freevxfs/vxfs_inode.c
index f47df72cef17..363e3ae25f6b 100644
--- a/fs/freevxfs/vxfs_inode.c
+++ b/fs/freevxfs/vxfs_inode.c
@@ -354,7 +354,7 @@ static void vxfs_i_callback(struct rcu_head *head)
void
vxfs_evict_inode(struct inode *ip)
{
- truncate_inode_pages(&ip->i_data, 0);
+ truncate_inode_pages_final(&ip->i_data);
clear_inode(ip);
call_rcu(&ip->i_rcu, vxfs_i_callback);
}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index d468643a68b2..9c761b611c54 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -123,7 +123,7 @@ static void fuse_destroy_inode(struct inode *inode)

static void fuse_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
if (inode->i_sb->s_flags & MS_ACTIVE) {
struct fuse_conn *fc = get_fuse_conn(inode);
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 60f60f6181f3..24410cd9a82a 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1558,7 +1558,7 @@ out_unlock:
fs_warn(sdp, "gfs2_evict_inode: %d\n", error);
out:
/* Case 3 starts here */
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
gfs2_rs_delete(ip, NULL);
gfs2_ordered_del_inode(ip);
clear_inode(inode);
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 380ab31b5e0f..9e2fecd62f62 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -547,7 +547,7 @@ out:

void hfs_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
if (HFS_IS_RSRC(inode) && HFS_I(inode)->rsrc_inode) {
HFS_I(HFS_I(inode)->rsrc_inode)->rsrc_inode = NULL;
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index 80875aa640ef..a6abf87d79d0 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -161,7 +161,7 @@ static int hfsplus_write_inode(struct inode *inode,
static void hfsplus_evict_inode(struct inode *inode)
{
hfs_dbg(INODE, "hfsplus_evict_inode: %lu\n", inode->i_ino);
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
if (HFSPLUS_IS_RSRC(inode)) {
HFSPLUS_I(HFSPLUS_I(inode)->rsrc_inode)->rsrc_inode = NULL;
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index fe649d325b1f..9c470fde9878 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -230,7 +230,7 @@ static struct inode *hostfs_alloc_inode(struct super_block *sb)

static void hostfs_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
if (HOSTFS_I(inode)->fd != -1) {
close_file(&HOSTFS_I(inode)->fd);
diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
index 9edeeb0ea97e..50a427313835 100644
--- a/fs/hpfs/inode.c
+++ b/fs/hpfs/inode.c
@@ -304,7 +304,7 @@ void hpfs_write_if_changed(struct inode *inode)

void hpfs_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
if (!inode->i_nlink) {
hpfs_lock(inode->i_sb);
diff --git a/fs/inode.c b/fs/inode.c
index 4bcdad3c9361..e6905152c39f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -503,6 +503,7 @@ void clear_inode(struct inode *inode)
*/
spin_lock_irq(&inode->i_data.tree_lock);
BUG_ON(inode->i_data.nrpages);
+ BUG_ON(inode->i_data.nrshadows);
spin_unlock_irq(&inode->i_data.tree_lock);
BUG_ON(!list_empty(&inode->i_data.private_list));
BUG_ON(!(inode->i_state & I_FREEING));
@@ -548,8 +549,7 @@ static void evict(struct inode *inode)
if (op->evict_inode) {
op->evict_inode(inode);
} else {
- if (inode->i_data.nrpages)
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
}
if (S_ISBLK(inode->i_mode) && inode->i_bdev)
diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
index a69e426435dd..a012e16a8bb3 100644
--- a/fs/jffs2/fs.c
+++ b/fs/jffs2/fs.c
@@ -242,7 +242,7 @@ void jffs2_evict_inode (struct inode *inode)

jffs2_dbg(1, "%s(): ino #%lu mode %o\n",
__func__, inode->i_ino, inode->i_mode);
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
jffs2_do_clear_inode(c, f);
}
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index f4aab719add5..6f8fe72c2a7a 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -154,7 +154,7 @@ void jfs_evict_inode(struct inode *inode)
dquot_initialize(inode);

if (JFS_IP(inode)->fileset == FILESYSTEM_I) {
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);

if (test_cflag(COMMIT_Freewmap, inode))
jfs_free_zero_link(inode);
@@ -168,7 +168,7 @@ void jfs_evict_inode(struct inode *inode)
dquot_free_inode(inode);
}
} else {
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
}
clear_inode(inode);
dquot_drop(inode);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index e55126f85bd2..abb0f1f53d93 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -355,7 +355,7 @@ void kernfs_evict_inode(struct inode *inode)
{
struct kernfs_node *kn = inode->i_private;

- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
kernfs_put(kn);
}
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index 9a59cbade2fb..48140315f627 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -2180,7 +2180,7 @@ void logfs_evict_inode(struct inode *inode)
do_delete_inode(inode);
}
}
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);

/* Cheaper version of write_inode. All changes are concealed in
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index 0332109162a5..03aaeb1a694a 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -26,7 +26,7 @@ static int minix_remount (struct super_block * sb, int * flags, char * data);

static void minix_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
if (!inode->i_nlink) {
inode->i_size = 0;
minix_truncate(inode);
diff --git a/fs/ncpfs/inode.c b/fs/ncpfs/inode.c
index 2cf2ebecb55f..ee59d35ff069 100644
--- a/fs/ncpfs/inode.c
+++ b/fs/ncpfs/inode.c
@@ -296,7 +296,7 @@ ncp_iget(struct super_block *sb, struct ncp_entry_info *info)
static void
ncp_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);

if (S_ISDIR(inode->i_mode)) {
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 28a0a3cbd3b7..a2494b616951 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -128,7 +128,7 @@ EXPORT_SYMBOL_GPL(nfs_clear_inode);

void nfs_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
nfs_clear_inode(inode);
}
diff --git a/fs/nfs/nfs4super.c b/fs/nfs/nfs4super.c
index 808f29574412..6f340f02f2ba 100644
--- a/fs/nfs/nfs4super.c
+++ b/fs/nfs/nfs4super.c
@@ -90,7 +90,7 @@ static int nfs4_write_inode(struct inode *inode, struct writeback_control *wbc)
*/
static void nfs4_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
pnfs_return_layout(inode);
pnfs_destroy_layout(NFS_I(inode));
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 7e350c562e0e..b9c5726120e3 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -783,16 +783,14 @@ void nilfs_evict_inode(struct inode *inode)
int ret;

if (inode->i_nlink || !ii->i_root || unlikely(is_bad_inode(inode))) {
- if (inode->i_data.nrpages)
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
nilfs_clear_inode(inode);
return;
}
nilfs_transaction_begin(sb, &ti, 0); /* never fails */

- if (inode->i_data.nrpages)
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);

/* TODO: some of the following operations may fail. */
nilfs_truncate_bmap(ii, 0);
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index ffb9b3675736..9d8153ebacfb 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -2259,7 +2259,7 @@ void ntfs_evict_big_inode(struct inode *vi)
{
ntfs_inode *ni = NTFS_I(vi);

- truncate_inode_pages(&vi->i_data, 0);
+ truncate_inode_pages_final(&vi->i_data);
clear_inode(vi);

#ifdef NTFS_RW
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index f29a90fde619..a11bfffbbc29 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -941,7 +941,7 @@ static void ocfs2_cleanup_delete_inode(struct inode *inode,
(unsigned long long)OCFS2_I(inode)->ip_blkno, sync_data);
if (sync_data)
filemap_write_and_wait(inode->i_mapping);
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
}

static void ocfs2_delete_inode(struct inode *inode)
@@ -1157,7 +1157,7 @@ void ocfs2_evict_inode(struct inode *inode)
(OCFS2_I(inode)->ip_flags & OCFS2_INODE_MAYBE_ORPHANED)) {
ocfs2_delete_inode(inode);
} else {
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
}
ocfs2_clear_inode(inode);
}
diff --git a/fs/omfs/inode.c b/fs/omfs/inode.c
index d8b0afde2179..ec58c7659183 100644
--- a/fs/omfs/inode.c
+++ b/fs/omfs/inode.c
@@ -183,7 +183,7 @@ int omfs_sync_inode(struct inode *inode)
*/
static void omfs_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);

if (inode->i_nlink)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 124fc43c7090..8f20e3404fd2 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -35,7 +35,7 @@ static void proc_evict_inode(struct inode *inode)
const struct proc_ns_operations *ns_ops;
void *ns;

- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);

/* Stop tracking associated processes */
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index ad62bdbb451e..bc8b8009897d 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -35,7 +35,7 @@ void reiserfs_evict_inode(struct inode *inode)
if (!inode->i_nlink && !is_bad_inode(inode))
dquot_initialize(inode);

- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
if (inode->i_nlink)
goto no_delete;

diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c
index c327d4ee1235..5625ca920f5e 100644
--- a/fs/sysv/inode.c
+++ b/fs/sysv/inode.c
@@ -295,7 +295,7 @@ int sysv_sync_inode(struct inode *inode)

static void sysv_evict_inode(struct inode *inode)
{
- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
if (!inode->i_nlink) {
inode->i_size = 0;
sysv_truncate(inode);
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 5ded8490c0c6..48f943f7f5d5 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -351,7 +351,7 @@ static void ubifs_evict_inode(struct inode *inode)
dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode);
ubifs_assert(!atomic_read(&inode->i_count));

- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);

if (inode->i_nlink)
goto done;
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 062b7925bca0..af6f4c38d91a 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -146,8 +146,8 @@ void udf_evict_inode(struct inode *inode)
want_delete = 1;
udf_setsize(inode, 0);
udf_update_inode(inode, IS_SYNC(inode));
- } else
- truncate_inode_pages(&inode->i_data, 0);
+ }
+ truncate_inode_pages_final(&inode->i_data);
invalidate_inode_buffers(inode);
clear_inode(inode);
if (iinfo->i_alloc_type != ICBTAG_FLAG_AD_IN_ICB &&
diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
index c8ca96086784..61e8a9b021dd 100644
--- a/fs/ufs/inode.c
+++ b/fs/ufs/inode.c
@@ -885,7 +885,7 @@ void ufs_evict_inode(struct inode * inode)
if (!inode->i_nlink && !is_bad_inode(inode))
want_delete = 1;

- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
if (want_delete) {
loff_t old_i_size;
/*UFS_I(inode)->i_dtime = CURRENT_TIME;*/
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index f317488263dd..01ee44444885 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -996,7 +996,7 @@ xfs_fs_evict_inode(

trace_xfs_evict_inode(ip);

- truncate_inode_pages(&inode->i_data, 0);
+ truncate_inode_pages_final(&inode->i_data);
clear_inode(inode);
XFS_STATS_INC(vn_rele);
XFS_STATS_INC(vn_remove);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 09f553c59813..6491bd6c313f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -416,6 +416,7 @@ struct address_space {
struct mutex i_mmap_mutex; /* protect tree, count, list */
/* Protected by tree_lock together with the radix tree */
unsigned long nrpages; /* number of total pages */
+ unsigned long nrshadows; /* number of shadow entries */
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops; /* methods */
unsigned long flags; /* error bits/gfp mask */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d684ac125482..ad26f7b49b1a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1815,6 +1815,7 @@ vm_unmapped_area(struct vm_unmapped_area_info *info)
extern void truncate_inode_pages(struct address_space *, loff_t);
extern void truncate_inode_pages_range(struct address_space *,
loff_t lstart, loff_t lend);
+extern void truncate_inode_pages_final(struct address_space *);

/* generic vm_area_ops exported for stackable file systems */
extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2eeca3c83b0f..e7729734156e 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
AS_BALLOON_MAP = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+ AS_EXITING = __GFP_BITS_SHIFT + 5, /* final truncate in progress */
};

static inline void mapping_set_error(struct address_space *mapping, int error)
@@ -69,6 +70,16 @@ static inline int mapping_balloon(struct address_space *mapping)
return mapping && test_bit(AS_BALLOON_MAP, &mapping->flags);
}

+static inline void mapping_set_exiting(struct address_space *mapping)
+{
+ set_bit(AS_EXITING, &mapping->flags);
+}
+
+static inline int mapping_exiting(struct address_space *mapping)
+{
+ return test_bit(AS_EXITING, &mapping->flags);
+}
+
static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
{
return (__force gfp_t)mapping->flags & __GFP_BITS_MASK;
@@ -547,7 +558,7 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page);
+extern void __delete_from_page_cache(struct page *page, void *shadow);
int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);

/*
diff --git a/mm/filemap.c b/mm/filemap.c
index a194179303e5..18f80d418f83 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -107,12 +107,33 @@
* ->tasklist_lock (memory_failure, collect_procs_ao)
*/

+static void page_cache_tree_delete(struct address_space *mapping,
+ struct page *page, void *shadow)
+{
+ if (shadow) {
+ void **slot;
+
+ slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
+ radix_tree_replace_slot(slot, shadow);
+ mapping->nrshadows++;
+ /*
+ * Make sure the nrshadows update is committed before
+ * the nrpages update so that final truncate racing
+ * with reclaim does not see both counters 0 at the
+ * same time and miss a shadow entry.
+ */
+ smp_wmb();
+ } else
+ radix_tree_delete(&mapping->page_tree, page->index);
+ mapping->nrpages--;
+}
+
/*
* Delete a page from the page cache and free it. Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
* is safe. The caller must hold the mapping's tree_lock.
*/
-void __delete_from_page_cache(struct page *page)
+void __delete_from_page_cache(struct page *page, void *shadow)
{
struct address_space *mapping = page->mapping;

@@ -127,10 +148,11 @@ void __delete_from_page_cache(struct page *page)
else
cleancache_invalidate_page(mapping, page);

- radix_tree_delete(&mapping->page_tree, page->index);
+ page_cache_tree_delete(mapping, page, shadow);
+
page->mapping = NULL;
/* Leave page->index set: truncation lookup relies upon it */
- mapping->nrpages--;
+
__dec_zone_page_state(page, NR_FILE_PAGES);
if (PageSwapBacked(page))
__dec_zone_page_state(page, NR_SHMEM);
@@ -166,7 +188,7 @@ void delete_from_page_cache(struct page *page)

freepage = mapping->a_ops->freepage;
spin_lock_irq(&mapping->tree_lock);
- __delete_from_page_cache(page);
+ __delete_from_page_cache(page, NULL);
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);

@@ -426,7 +448,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
new->index = offset;

spin_lock_irq(&mapping->tree_lock);
- __delete_from_page_cache(old);
+ __delete_from_page_cache(old, NULL);
error = radix_tree_insert(&mapping->page_tree, offset, new);
BUG_ON(error);
mapping->nrpages++;
@@ -460,6 +482,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
if (!radix_tree_exceptional_entry(p))
return -EEXIST;
radix_tree_replace_slot(slot, page);
+ mapping->nrshadows--;
mapping->nrpages++;
return 0;
}
diff --git a/mm/truncate.c b/mm/truncate.c
index b0f4d4bee8ab..4ca425e5db58 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -35,7 +35,8 @@ static void clear_exceptional_entry(struct address_space *mapping,
* without the tree itself locked. These unlocked entries
* need verification under the tree lock.
*/
- radix_tree_delete_item(&mapping->page_tree, index, entry);
+ if (radix_tree_delete_item(&mapping->page_tree, index, entry) == entry)
+ mapping->nrshadows--;
spin_unlock_irq(&mapping->tree_lock);
}

@@ -229,7 +230,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
int i;

cleancache_invalidate_inode(mapping);
- if (mapping->nrpages == 0)
+ if (mapping->nrpages == 0 && mapping->nrshadows == 0)
return;

/* Offsets within partial pages */
@@ -391,6 +392,53 @@ void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
EXPORT_SYMBOL(truncate_inode_pages);

/**
+ * truncate_inode_pages_final - truncate *all* pages before inode dies
+ * @mapping: mapping to truncate
+ *
+ * Called under (and serialized by) inode->i_mutex.
+ *
+ * Filesystems have to use this in the .evict_inode path to inform the
+ * VM that this is the final truncate and the inode is going away.
+ */
+void truncate_inode_pages_final(struct address_space *mapping)
+{
+ unsigned long nrshadows;
+ unsigned long nrpages;
+
+ /*
+ * Page reclaim can not participate in regular inode lifetime
+ * management (can't call iput()) and thus can race with the
+ * inode teardown. Tell it when the address space is exiting,
+ * so that it does not install eviction information after the
+ * final truncate has begun.
+ */
+ mapping_set_exiting(mapping);
+
+ /*
+ * When reclaim installs eviction entries, it increases
+ * nrshadows first, then decreases nrpages. Make sure we see
+ * this in the right order or we might miss an entry.
+ */
+ nrpages = mapping->nrpages;
+ smp_rmb();
+ nrshadows = mapping->nrshadows;
+
+ if (nrpages || nrshadows) {
+ /*
+ * As truncation uses a lockless tree lookup, cycle
+ * the tree lock to make sure any ongoing tree
+ * modification that does not see AS_EXITING is
+ * completed before starting the final truncate.
+ */
+ spin_lock_irq(&mapping->tree_lock);
+ spin_unlock_irq(&mapping->tree_lock);
+
+ truncate_inode_pages(mapping, 0);
+ }
+}
+EXPORT_SYMBOL(truncate_inode_pages_final);
+
+/**
* invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
* @mapping: the address_space which holds the pages to invalidate
* @start: the offset 'from' which to invalidate
@@ -483,7 +531,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
goto failed;

BUG_ON(page_has_private(page));
- __delete_from_page_cache(page);
+ __delete_from_page_cache(page, NULL);
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b409681..63712938169b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -572,7 +572,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page)

freepage = mapping->a_ops->freepage;

- __delete_from_page_cache(page);
+ __delete_from_page_cache(page, NULL);
spin_unlock_irq(&mapping->tree_lock);
mem_cgroup_uncharge_cache_page(page);

--
1.8.5.3

2014-02-04 00:59:56

by Johannes Weiner

[permalink] [raw]
Subject: [patch 10/10] mm: keep page cache radix tree nodes in check

Previously, page cache radix tree nodes were freed after reclaim
emptied out their page pointers. But now reclaim stores shadow
entries in their place, which are only reclaimed when the inodes
themselves are reclaimed. This is problematic for bigger files that
are still in use after they have a significant amount of their cache
reclaimed, without any of those pages actually refaulting. The shadow
entries will just sit there and waste memory. In the worst case, the
shadow entries will accumulate until the machine runs out of memory.

To get this under control, the VM will track radix tree nodes
exclusively containing shadow entries on a per-NUMA node list.
Per-NUMA rather than global because we expect the radix tree nodes
themselves to be allocated node-locally and we want to reduce
cross-node references of otherwise independent cache workloads. A
simple shrinker will then reclaim these nodes on memory pressure.

A few things need to be stored in the radix tree node to implement the
shadow node LRU and allow tree deletions coming from the list:

1. There is no index available that would describe the reverse path
from the node up to the tree root, which is needed to perform a
deletion. To solve this, encode in each node its offset inside the
parent. This can be stored in the unused upper bits of the same
member that stores the node's height at no extra space cost.

2. The number of shadow entries needs to be counted in addition to the
regular entries, to quickly detect when the node is ready to go to
the shadow node LRU list. The current entry count is an unsigned
int but the maximum number of entries is 64, so a shadow counter
can easily be stored in the unused upper bits.

3. Tree modification needs tree lock and tree root, which are located
in the address space, so store an address_space backpointer in the
node. The parent pointer of the node is in a union with the 2-word
rcu_head, so the backpointer comes at no extra cost as well.

4. The node needs to be linked to an LRU list, which requires a list
head inside the node. This does increase the size of the node, but
it does not change the number of objects that fit into a slab page.

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
---
include/linux/list_lru.h | 2 +
include/linux/mmzone.h | 1 +
include/linux/radix-tree.h | 32 +++++++---
include/linux/swap.h | 31 ++++++++++
lib/radix-tree.c | 36 +++++++-----
mm/filemap.c | 90 +++++++++++++++++++++++-----
mm/list_lru.c | 10 ++++
mm/truncate.c | 26 ++++++++-
mm/vmstat.c | 1 +
mm/workingset.c | 143 +++++++++++++++++++++++++++++++++++++++++++++
10 files changed, 332 insertions(+), 40 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3ce541753c88..b02fc233eadd 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -13,6 +13,8 @@
/* list_lru_walk_cb has to always return one of those */
enum lru_status {
LRU_REMOVED, /* item removed from list */
+ LRU_REMOVED_RETRY, /* item removed, but lock has been
+ dropped and reacquired */
LRU_ROTATE, /* item referenced, give another pass */
LRU_SKIP, /* item cannot be locked, skip */
LRU_RETRY, /* item not freeable. May drop the lock
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4bdeb411a4d..934820b3249c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -144,6 +144,7 @@ enum zone_stat_item {
#endif
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
+ WORKINGSET_NODERECLAIM,
NR_ANON_TRANSPARENT_HUGEPAGES,
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 13636c40bc42..33170dbd9db4 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -72,21 +72,37 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)
#define RADIX_TREE_TAG_LONGS \
((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)

+#define RADIX_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
+ RADIX_TREE_MAP_SHIFT))
+
+/* Height component in node->path */
+#define RADIX_TREE_HEIGHT_SHIFT (RADIX_TREE_MAX_PATH + 1)
+#define RADIX_TREE_HEIGHT_MASK ((1UL << RADIX_TREE_HEIGHT_SHIFT) - 1)
+
+/* Internally used bits of node->count */
+#define RADIX_TREE_COUNT_SHIFT (RADIX_TREE_MAP_SHIFT + 1)
+#define RADIX_TREE_COUNT_MASK ((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
+
struct radix_tree_node {
- unsigned int height; /* Height from the bottom */
+ unsigned int path; /* Offset in parent & height from the bottom */
unsigned int count;
union {
- struct radix_tree_node *parent; /* Used when ascending tree */
- struct rcu_head rcu_head; /* Used when freeing node */
+ struct {
+ /* Used when ascending tree */
+ struct radix_tree_node *parent;
+ /* For tree user */
+ void *private_data;
+ };
+ /* Used when freeing node */
+ struct rcu_head rcu_head;
};
+ /* For tree user */
+ struct list_head private_list;
void __rcu *slots[RADIX_TREE_MAP_SIZE];
unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
};

-#define RADIX_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long))
-#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
- RADIX_TREE_MAP_SHIFT))
-
/* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
struct radix_tree_root {
unsigned int height;
@@ -251,7 +267,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
struct radix_tree_node **nodep, void ***slotp);
void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
-bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
+bool __radix_tree_delete_node(struct radix_tree_root *root,
struct radix_tree_node *node);
void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
void *radix_tree_delete(struct radix_tree_root *, unsigned long);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index b83cf61403ed..350711560753 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -264,6 +264,37 @@ struct swap_list_t {
void *workingset_eviction(struct address_space *mapping, struct page *page);
bool workingset_refault(void *shadow);
void workingset_activation(struct page *page);
+extern struct list_lru workingset_shadow_nodes;
+
+static inline unsigned int workingset_node_pages(struct radix_tree_node *node)
+{
+ return node->count & RADIX_TREE_COUNT_MASK;
+}
+
+static inline void workingset_node_pages_inc(struct radix_tree_node *node)
+{
+ node->count++;
+}
+
+static inline void workingset_node_pages_dec(struct radix_tree_node *node)
+{
+ node->count--;
+}
+
+static inline unsigned int workingset_node_shadows(struct radix_tree_node *node)
+{
+ return node->count >> RADIX_TREE_COUNT_SHIFT;
+}
+
+static inline void workingset_node_shadows_inc(struct radix_tree_node *node)
+{
+ node->count += 1U << RADIX_TREE_COUNT_SHIFT;
+}
+
+static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
+{
+ node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+}

/* linux/mm/page_alloc.c */
extern unsigned long totalram_pages;
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e601c56a43d0..0a0895371447 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -342,7 +342,8 @@ static int radix_tree_extend(struct radix_tree_root *root, unsigned long index)

/* Increase the height. */
newheight = root->height+1;
- node->height = newheight;
+ BUG_ON(newheight & ~RADIX_TREE_HEIGHT_MASK);
+ node->path = newheight;
node->count = 1;
node->parent = NULL;
slot = root->rnode;
@@ -400,11 +401,12 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
/* Have to add a child node. */
if (!(slot = radix_tree_node_alloc(root)))
return -ENOMEM;
- slot->height = height;
+ slot->path = height;
slot->parent = node;
if (node) {
rcu_assign_pointer(node->slots[offset], slot);
node->count++;
+ slot->path |= offset << RADIX_TREE_HEIGHT_SHIFT;
} else
rcu_assign_pointer(root->rnode, ptr_to_indirect(slot));
}
@@ -496,7 +498,7 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
}
node = indirect_to_ptr(node);

- height = node->height;
+ height = node->path & RADIX_TREE_HEIGHT_MASK;
if (index > radix_tree_maxindex(height))
return NULL;

@@ -702,7 +704,7 @@ int radix_tree_tag_get(struct radix_tree_root *root,
return (index == 0);
node = indirect_to_ptr(node);

- height = node->height;
+ height = node->path & RADIX_TREE_HEIGHT_MASK;
if (index > radix_tree_maxindex(height))
return 0;

@@ -739,7 +741,7 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
{
unsigned shift, tag = flags & RADIX_TREE_ITER_TAG_MASK;
struct radix_tree_node *rnode, *node;
- unsigned long index, offset;
+ unsigned long index, offset, height;

if ((flags & RADIX_TREE_ITER_TAGGED) && !root_tag_get(root, tag))
return NULL;
@@ -770,7 +772,8 @@ void **radix_tree_next_chunk(struct radix_tree_root *root,
return NULL;

restart:
- shift = (rnode->height - 1) * RADIX_TREE_MAP_SHIFT;
+ height = rnode->path & RADIX_TREE_HEIGHT_MASK;
+ shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
offset = index >> shift;

/* Index outside of the tree */
@@ -1140,7 +1143,7 @@ static unsigned long __locate(struct radix_tree_node *slot, void *item,
unsigned int shift, height;
unsigned long i;

- height = slot->height;
+ height = slot->path & RADIX_TREE_HEIGHT_MASK;
shift = (height-1) * RADIX_TREE_MAP_SHIFT;

for ( ; height > 1; height--) {
@@ -1203,7 +1206,8 @@ unsigned long radix_tree_locate_item(struct radix_tree_root *root, void *item)
}

node = indirect_to_ptr(node);
- max_index = radix_tree_maxindex(node->height);
+ max_index = radix_tree_maxindex(node->path &
+ RADIX_TREE_HEIGHT_MASK);
if (cur_index > max_index)
break;

@@ -1297,7 +1301,7 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
*
* Returns %true if @node was freed, %false otherwise.
*/
-bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
+bool __radix_tree_delete_node(struct radix_tree_root *root,
struct radix_tree_node *node)
{
bool deleted = false;
@@ -1316,9 +1320,10 @@ bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,

parent = node->parent;
if (parent) {
- index >>= RADIX_TREE_MAP_SHIFT;
+ unsigned int offset;

- parent->slots[index & RADIX_TREE_MAP_MASK] = NULL;
+ offset = node->path >> RADIX_TREE_HEIGHT_SHIFT;
+ parent->slots[offset] = NULL;
parent->count--;
} else {
root_tag_clear_all(root);
@@ -1382,7 +1387,7 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
node->slots[offset] = NULL;
node->count--;

- __radix_tree_delete_node(root, index, node);
+ __radix_tree_delete_node(root, node);

return entry;
}
@@ -1415,9 +1420,12 @@ int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag)
EXPORT_SYMBOL(radix_tree_tagged);

static void
-radix_tree_node_ctor(void *node)
+radix_tree_node_ctor(void *arg)
{
- memset(node, 0, sizeof(struct radix_tree_node));
+ struct radix_tree_node *node = arg;
+
+ memset(node, 0, sizeof(*node));
+ INIT_LIST_HEAD(&node->private_list);
}

static __init unsigned long __maxindex(unsigned int height)
diff --git a/mm/filemap.c b/mm/filemap.c
index 33ceebf4d577..ae4d37c8e0ad 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,11 +110,17 @@
static void page_cache_tree_delete(struct address_space *mapping,
struct page *page, void *shadow)
{
- if (shadow) {
- void **slot;
+ struct radix_tree_node *node;
+ unsigned long index;
+ unsigned int offset;
+ unsigned int tag;
+ void **slot;

- slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
- radix_tree_replace_slot(slot, shadow);
+ VM_BUG_ON(!PageLocked(page));
+
+ __radix_tree_lookup(&mapping->page_tree, page->index, &node, &slot);
+
+ if (shadow) {
mapping->nrshadows++;
/*
* Make sure the nrshadows update is committed before
@@ -123,9 +129,45 @@ static void page_cache_tree_delete(struct address_space *mapping,
* same time and miss a shadow entry.
*/
smp_wmb();
- } else
- radix_tree_delete(&mapping->page_tree, page->index);
+ }
mapping->nrpages--;
+
+ if (!node) {
+ /* Clear direct pointer tags in root node */
+ mapping->page_tree.gfp_mask &= __GFP_BITS_MASK;
+ radix_tree_replace_slot(slot, shadow);
+ return;
+ }
+
+ /* Clear tree tags for the removed page */
+ index = page->index;
+ offset = index & RADIX_TREE_MAP_MASK;
+ for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
+ if (test_bit(offset, node->tags[tag]))
+ radix_tree_tag_clear(&mapping->page_tree, index, tag);
+ }
+
+ /* Delete page, swap shadow entry */
+ radix_tree_replace_slot(slot, shadow);
+ workingset_node_pages_dec(node);
+ if (shadow)
+ workingset_node_shadows_inc(node);
+ else
+ if (__radix_tree_delete_node(&mapping->page_tree, node))
+ return;
+
+ /*
+ * Track node that only contains shadow entries.
+ *
+ * Avoid acquiring the list_lru lock if already tracked. The
+ * list_empty() test is safe as node->private_list is
+ * protected by mapping->tree_lock.
+ */
+ if (!workingset_node_pages(node) &&
+ list_empty(&node->private_list)) {
+ node->private_data = mapping;
+ list_lru_add(&workingset_shadow_nodes, &node->private_list);
+ }
}

/*
@@ -471,27 +513,43 @@ EXPORT_SYMBOL_GPL(replace_page_cache_page);
static int page_cache_tree_insert(struct address_space *mapping,
struct page *page, void **shadowp)
{
+ struct radix_tree_node *node;
void **slot;
int error;

- slot = radix_tree_lookup_slot(&mapping->page_tree, page->index);
- if (slot) {
+ error = __radix_tree_create(&mapping->page_tree, page->index,
+ &node, &slot);
+ if (error)
+ return error;
+ if (*slot) {
void *p;

p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
if (!radix_tree_exceptional_entry(p))
return -EEXIST;
- radix_tree_replace_slot(slot, page);
- mapping->nrshadows--;
- mapping->nrpages++;
if (shadowp)
*shadowp = p;
- return 0;
+ mapping->nrshadows--;
+ if (node)
+ workingset_node_shadows_dec(node);
}
- error = radix_tree_insert(&mapping->page_tree, page->index, page);
- if (!error)
- mapping->nrpages++;
- return error;
+ radix_tree_replace_slot(slot, page);
+ mapping->nrpages++;
+ if (node) {
+ workingset_node_pages_inc(node);
+ /*
+ * Don't track node that contains actual pages.
+ *
+ * Avoid acquiring the list_lru lock if already
+ * untracked. The list_empty() test is safe as
+ * node->private_list is protected by
+ * mapping->tree_lock.
+ */
+ if (!list_empty(&node->private_list))
+ list_lru_del(&workingset_shadow_nodes,
+ &node->private_list);
+ }
+ return 0;
}

static int __add_to_page_cache_locked(struct page *page,
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 72f9decb0104..7f5b73e2513b 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -87,11 +87,20 @@ restart:

ret = isolate(item, &nlru->lock, cb_arg);
switch (ret) {
+ case LRU_REMOVED_RETRY:
+ assert_spin_locked(&nlru->lock);
case LRU_REMOVED:
if (--nlru->nr_items == 0)
node_clear(nid, lru->active_nodes);
WARN_ON_ONCE(nlru->nr_items < 0);
isolated++;
+ /*
+ * If the lru lock has been dropped, our list
+ * traversal is now invalid and so we have to
+ * restart from scratch.
+ */
+ if (ret == LRU_REMOVED_RETRY)
+ goto restart;
break;
case LRU_ROTATE:
list_move_tail(item, &nlru->list);
@@ -103,6 +112,7 @@ restart:
* The lru lock has been dropped, our list traversal is
* now invalid and so we have to restart from scratch.
*/
+ assert_spin_locked(&nlru->lock);
goto restart;
default:
BUG();
diff --git a/mm/truncate.c b/mm/truncate.c
index 4ca425e5db58..9cb54b7525dc 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -25,6 +25,9 @@
static void clear_exceptional_entry(struct address_space *mapping,
pgoff_t index, void *entry)
{
+ struct radix_tree_node *node;
+ void **slot;
+
/* Handled by shmem itself */
if (shmem_mapping(mapping))
return;
@@ -35,8 +38,27 @@ static void clear_exceptional_entry(struct address_space *mapping,
* without the tree itself locked. These unlocked entries
* need verification under the tree lock.
*/
- if (radix_tree_delete_item(&mapping->page_tree, index, entry) == entry)
- mapping->nrshadows--;
+ if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
+ goto unlock;
+ if (*slot != entry)
+ goto unlock;
+ radix_tree_replace_slot(slot, NULL);
+ mapping->nrshadows--;
+ if (!node)
+ goto unlock;
+ workingset_node_shadows_dec(node);
+ /*
+ * Don't track node without shadow entries.
+ *
+ * Avoid acquiring the list_lru lock if already untracked.
+ * The list_empty() test is safe as node->private_list is
+ * protected by mapping->tree_lock.
+ */
+ if (!workingset_node_shadows(node) &&
+ !list_empty(&node->private_list))
+ list_lru_del(&workingset_shadow_nodes, &node->private_list);
+ __radix_tree_delete_node(&mapping->page_tree, node);
+unlock:
spin_unlock_irq(&mapping->tree_lock);
}

diff --git a/mm/vmstat.c b/mm/vmstat.c
index c95634e0c098..927ef3898c4e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -772,6 +772,7 @@ const char * const vmstat_text[] = {
#endif
"workingset_refault",
"workingset_activate",
+ "workingset_nodereclaim",
"nr_anon_transparent_hugepages",
"nr_free_cma",
"nr_dirty_threshold",
diff --git a/mm/workingset.c b/mm/workingset.c
index 8a6c7cff4923..33429c7ddec5 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -251,3 +251,146 @@ void workingset_activation(struct page *page)
{
atomic_long_inc(&page_zone(page)->inactive_age);
}
+
+/*
+ * Shadow entries reflect the share of the working set that does not
+ * fit into memory, so their number depends on the access pattern of
+ * the workload. In most cases, they will refault or get reclaimed
+ * along with the inode, but a (malicious) workload that streams
+ * through files with a total size several times that of available
+ * memory, while preventing the inodes from being reclaimed, can
+ * create excessive amounts of shadow nodes. To keep a lid on this,
+ * track shadow nodes and reclaim them when they grow way past the
+ * point where they would still be useful.
+ */
+
+struct list_lru workingset_shadow_nodes;
+
+static unsigned long count_shadow_nodes(struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+ unsigned long shadow_nodes;
+ unsigned long max_nodes;
+ unsigned long pages;
+
+ shadow_nodes = list_lru_count_node(&workingset_shadow_nodes, sc->nid);
+ pages = node_present_pages(sc->nid);
+ /*
+ * Active cache pages are limited to 50% of memory, and shadow
+ * entries that represent a refault distance bigger than that
+ * do not have any effect. Limit the number of shadow nodes
+ * such that shadow entries do not exceed the number of active
+ * cache pages, assuming a worst-case node population density
+ * of 1/8th on average.
+ *
+ * On 64-bit with 7 radix_tree_nodes per page and 64 slots
+ * each, this will reclaim shadow entries when they consume
+ * ~2% of available memory:
+ *
+ * PAGE_SIZE / radix_tree_nodes / node_entries / PAGE_SIZE
+ */
+ max_nodes = pages >> (1 + RADIX_TREE_MAP_SHIFT - 3);
+
+ if (shadow_nodes <= max_nodes)
+ return 0;
+
+ return shadow_nodes - max_nodes;
+}
+
+static enum lru_status shadow_lru_isolate(struct list_head *item,
+ spinlock_t *lru_lock,
+ void *arg)
+{
+ struct address_space *mapping;
+ struct radix_tree_node *node;
+ unsigned int i;
+ int ret;
+
+ /*
+ * Page cache insertions and deletions synchroneously maintain
+ * the shadow node LRU under the mapping->tree_lock and the
+ * lru_lock. Because the page cache tree is emptied before
+ * the inode can be destroyed, holding the lru_lock pins any
+ * address_space that has radix tree nodes on the LRU.
+ *
+ * We can then safely transition to the mapping->tree_lock to
+ * pin only the address_space of the particular node we want
+ * to reclaim, take the node off-LRU, and drop the lru_lock.
+ */
+
+ node = container_of(item, struct radix_tree_node, private_list);
+ mapping = node->private_data;
+
+ /* Coming from the list, invert the lock order */
+ if (!spin_trylock_irq(&mapping->tree_lock)) {
+ spin_unlock(lru_lock);
+ ret = LRU_RETRY;
+ goto out;
+ }
+
+ list_del_init(item);
+ spin_unlock(lru_lock);
+
+ /*
+ * The nodes should only contain one or more shadow entries,
+ * no pages, so we expect to be able to remove them all and
+ * delete and free the empty node afterwards.
+ */
+
+ BUG_ON(!node->count);
+ BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
+
+ for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
+ if (node->slots[i]) {
+ BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
+ node->slots[i] = NULL;
+ BUG_ON(node->count < (1U << RADIX_TREE_COUNT_SHIFT));
+ node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
+ BUG_ON(!mapping->nrshadows);
+ mapping->nrshadows--;
+ }
+ }
+ BUG_ON(node->count);
+ inc_zone_state(page_zone(virt_to_page(node)), WORKINGSET_NODERECLAIM);
+ if (!__radix_tree_delete_node(&mapping->page_tree, node))
+ BUG();
+
+ spin_unlock_irq(&mapping->tree_lock);
+ ret = LRU_REMOVED_RETRY;
+out:
+ cond_resched();
+ spin_lock(lru_lock);
+ return ret;
+}
+
+static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
+ struct shrink_control *sc)
+{
+ return list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
+ shadow_lru_isolate, NULL, &sc->nr_to_scan);
+}
+
+static struct shrinker workingset_shadow_shrinker = {
+ .count_objects = count_shadow_nodes,
+ .scan_objects = scan_shadow_nodes,
+ .seeks = DEFAULT_SEEKS,
+ .flags = SHRINKER_NUMA_AWARE,
+};
+
+static int __init workingset_init(void)
+{
+ int ret;
+
+ ret = list_lru_init(&workingset_shadow_nodes);
+ if (ret)
+ goto err;
+ ret = register_shrinker(&workingset_shadow_shrinker);
+ if (ret)
+ goto err_list_lru;
+ return 0;
+err_list_lru:
+ list_lru_destroy(&workingset_shadow_nodes);
+err:
+ return ret;
+}
+module_init(workingset_init);
--
1.8.5.3

2014-02-04 01:00:51

by Johannes Weiner

[permalink] [raw]
Subject: [patch 02/10] fs: cachefiles: use add_to_page_cache_lru()

This code used to have its own lru cache pagevec up until a0b8cab3
("mm: remove lru parameter from __pagevec_lru_add and remove parts of
pagevec API"). Now it's just add_to_page_cache() followed by
lru_cache_add(), might as well use add_to_page_cache_lru() directly.

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
---
fs/cachefiles/rdwr.c | 33 +++++++++++++--------------------
1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
index ebaff368120d..4b1fb5ca65b8 100644
--- a/fs/cachefiles/rdwr.c
+++ b/fs/cachefiles/rdwr.c
@@ -265,24 +265,22 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
goto nomem_monitor;
}

- ret = add_to_page_cache(newpage, bmapping,
- netpage->index, cachefiles_gfp);
+ ret = add_to_page_cache_lru(newpage, bmapping,
+ netpage->index, cachefiles_gfp);
if (ret == 0)
goto installed_new_backing_page;
if (ret != -EEXIST)
goto nomem_page;
}

- /* we've installed a new backing page, so now we need to add it
- * to the LRU list and start it reading */
+ /* we've installed a new backing page, so now we need to start
+ * it reading */
installed_new_backing_page:
_debug("- new %p", newpage);

backpage = newpage;
newpage = NULL;

- lru_cache_add_file(backpage);
-
read_backing_page:
ret = bmapping->a_ops->readpage(NULL, backpage);
if (ret < 0)
@@ -510,24 +508,23 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
goto nomem;
}

- ret = add_to_page_cache(newpage, bmapping,
- netpage->index, cachefiles_gfp);
+ ret = add_to_page_cache_lru(newpage, bmapping,
+ netpage->index,
+ cachefiles_gfp);
if (ret == 0)
goto installed_new_backing_page;
if (ret != -EEXIST)
goto nomem;
}

- /* we've installed a new backing page, so now we need to add it
- * to the LRU list and start it reading */
+ /* we've installed a new backing page, so now we need
+ * to start it reading */
installed_new_backing_page:
_debug("- new %p", newpage);

backpage = newpage;
newpage = NULL;

- lru_cache_add_file(backpage);
-
reread_backing_page:
ret = bmapping->a_ops->readpage(NULL, backpage);
if (ret < 0)
@@ -538,8 +535,8 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
monitor_backing_page:
_debug("- monitor add");

- ret = add_to_page_cache(netpage, op->mapping, netpage->index,
- cachefiles_gfp);
+ ret = add_to_page_cache_lru(netpage, op->mapping,
+ netpage->index, cachefiles_gfp);
if (ret < 0) {
if (ret == -EEXIST) {
page_cache_release(netpage);
@@ -549,8 +546,6 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
goto nomem;
}

- lru_cache_add_file(netpage);
-
/* install a monitor */
page_cache_get(netpage);
monitor->netfs_page = netpage;
@@ -613,8 +608,8 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
backing_page_already_uptodate:
_debug("- uptodate");

- ret = add_to_page_cache(netpage, op->mapping, netpage->index,
- cachefiles_gfp);
+ ret = add_to_page_cache_lru(netpage, op->mapping,
+ netpage->index, cachefiles_gfp);
if (ret < 0) {
if (ret == -EEXIST) {
page_cache_release(netpage);
@@ -631,8 +626,6 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,

fscache_mark_page_cached(op, netpage);

- lru_cache_add_file(netpage);
-
/* the netpage is unlocked and marked up to date here */
fscache_end_io(op, netpage, 0);
page_cache_release(netpage);
--
1.8.5.3

2014-02-04 01:00:46

by Johannes Weiner

[permalink] [raw]
Subject: [patch 04/10] mm: shmem: save one radix tree lookup when truncating swapped pages

Page cache radix tree slots are usually stabilized by the page lock,
but shmem's swap cookies have no such thing. Because the overall
truncation loop is lockless, the swap entry is currently confirmed by
a tree lookup and then deleted by another tree lookup under the same
tree lock region.

Use radix_tree_delete_item() instead, which does the verification and
deletion with only one lookup. This also allows removing the
delete-only special case from shmem_radix_tree_replace().

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
mm/shmem.c | 25 ++++++++++++-------------
1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 1f18c9d0d93e..e470997010cd 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -242,19 +242,17 @@ static int shmem_radix_tree_replace(struct address_space *mapping,
pgoff_t index, void *expected, void *replacement)
{
void **pslot;
- void *item = NULL;
+ void *item;

VM_BUG_ON(!expected);
+ VM_BUG_ON(!replacement);
pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
- if (pslot)
- item = radix_tree_deref_slot_protected(pslot,
- &mapping->tree_lock);
+ if (!pslot)
+ return -ENOENT;
+ item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock);
if (item != expected)
return -ENOENT;
- if (replacement)
- radix_tree_replace_slot(pslot, replacement);
- else
- radix_tree_delete(&mapping->page_tree, index);
+ radix_tree_replace_slot(pslot, replacement);
return 0;
}

@@ -386,14 +384,15 @@ export:
static int shmem_free_swap(struct address_space *mapping,
pgoff_t index, void *radswap)
{
- int error;
+ void *old;

spin_lock_irq(&mapping->tree_lock);
- error = shmem_radix_tree_replace(mapping, index, radswap, NULL);
+ old = radix_tree_delete_item(&mapping->page_tree, index, radswap);
spin_unlock_irq(&mapping->tree_lock);
- if (!error)
- free_swap_and_cache(radix_to_swp_entry(radswap));
- return error;
+ if (old != radswap)
+ return -ENOENT;
+ free_swap_and_cache(radix_to_swp_entry(radswap));
+ return 0;
}

/*
--
1.8.5.3

2014-02-04 01:00:40

by Johannes Weiner

[permalink] [raw]
Subject: [patch 09/10] lib: radix_tree: tree node interface

Make struct radix_tree_node part of the public interface and provide
API functions to create, look up, and delete whole nodes. Refactor
the existing insert, look up, delete functions on top of these new
node primitives.

This will allow the VM to track and garbage collect page cache radix
tree nodes.

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
include/linux/radix-tree.h | 34 ++++++
lib/radix-tree.c | 261 +++++++++++++++++++++++++--------------------
2 files changed, 180 insertions(+), 115 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index e8be53ecfc45..13636c40bc42 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -60,6 +60,33 @@ static inline int radix_tree_is_indirect_ptr(void *ptr)

#define RADIX_TREE_MAX_TAGS 3

+#ifdef __KERNEL__
+#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
+#else
+#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
+#endif
+
+#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
+#define RADIX_TREE_MAP_MASK (RADIX_TREE_MAP_SIZE-1)
+
+#define RADIX_TREE_TAG_LONGS \
+ ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
+
+struct radix_tree_node {
+ unsigned int height; /* Height from the bottom */
+ unsigned int count;
+ union {
+ struct radix_tree_node *parent; /* Used when ascending tree */
+ struct rcu_head rcu_head; /* Used when freeing node */
+ };
+ void __rcu *slots[RADIX_TREE_MAP_SIZE];
+ unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
+};
+
+#define RADIX_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
+ RADIX_TREE_MAP_SHIFT))
+
/* root tags are stored in gfp_mask, shifted by __GFP_BITS_SHIFT */
struct radix_tree_root {
unsigned int height;
@@ -101,6 +128,7 @@ do { \
* concurrently with other readers.
*
* The notable exceptions to this rule are the following functions:
+ * __radix_tree_lookup
* radix_tree_lookup
* radix_tree_lookup_slot
* radix_tree_tag_get
@@ -216,9 +244,15 @@ static inline void radix_tree_replace_slot(void **pslot, void *item)
rcu_assign_pointer(*pslot, item);
}

+int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
+ struct radix_tree_node **nodep, void ***slotp);
int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
+void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
+ struct radix_tree_node **nodep, void ***slotp);
void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
+ struct radix_tree_node *node);
void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
void *radix_tree_delete(struct radix_tree_root *, unsigned long);
unsigned int
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index e8adb5d8a184..e601c56a43d0 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -35,33 +35,6 @@
#include <linux/hardirq.h> /* in_interrupt() */


-#ifdef __KERNEL__
-#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
-#else
-#define RADIX_TREE_MAP_SHIFT 3 /* For more stressful testing */
-#endif
-
-#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
-#define RADIX_TREE_MAP_MASK (RADIX_TREE_MAP_SIZE-1)
-
-#define RADIX_TREE_TAG_LONGS \
- ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
-
-struct radix_tree_node {
- unsigned int height; /* Height from the bottom */
- unsigned int count;
- union {
- struct radix_tree_node *parent; /* Used when ascending tree */
- struct rcu_head rcu_head; /* Used when freeing node */
- };
- void __rcu *slots[RADIX_TREE_MAP_SIZE];
- unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
-};
-
-#define RADIX_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long))
-#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
- RADIX_TREE_MAP_SHIFT))
-
/*
* The height_to_maxindex array needs to be one deeper than the maximum
* path as height 0 holds only 1 entry.
@@ -387,23 +360,28 @@ out:
}

/**
- * radix_tree_insert - insert into a radix tree
+ * __radix_tree_create - create a slot in a radix tree
* @root: radix tree root
* @index: index key
- * @item: item to insert
+ * @nodep: returns node
+ * @slotp: returns slot
*
- * Insert an item into the radix tree at position @index.
+ * Create, if necessary, and return the node and slot for an item
+ * at position @index in the radix tree @root.
+ *
+ * Until there is more than one item in the tree, no nodes are
+ * allocated and @root->rnode is used as a direct slot instead of
+ * pointing to a node, in which case *@nodep will be NULL.
+ *
+ * Returns -ENOMEM, or 0 for success.
*/
-int radix_tree_insert(struct radix_tree_root *root,
- unsigned long index, void *item)
+int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
+ struct radix_tree_node **nodep, void ***slotp)
{
struct radix_tree_node *node = NULL, *slot;
- unsigned int height, shift;
- int offset;
+ unsigned int height, shift, offset;
int error;

- BUG_ON(radix_tree_is_indirect_ptr(item));
-
/* Make sure the tree is high enough. */
if (index > radix_tree_maxindex(root->height)) {
error = radix_tree_extend(root, index);
@@ -439,16 +417,40 @@ int radix_tree_insert(struct radix_tree_root *root,
height--;
}

- if (slot != NULL)
+ if (nodep)
+ *nodep = node;
+ if (slotp)
+ *slotp = node ? node->slots + offset : (void **)&root->rnode;
+ return 0;
+}
+
+/**
+ * radix_tree_insert - insert into a radix tree
+ * @root: radix tree root
+ * @index: index key
+ * @item: item to insert
+ *
+ * Insert an item into the radix tree at position @index.
+ */
+int radix_tree_insert(struct radix_tree_root *root,
+ unsigned long index, void *item)
+{
+ struct radix_tree_node *node;
+ void **slot;
+ int error;
+
+ BUG_ON(radix_tree_is_indirect_ptr(item));
+
+ error = __radix_tree_create(root, index, &node, &slot);
+ if (*slot != NULL)
return -EEXIST;
+ rcu_assign_pointer(*slot, item);

if (node) {
node->count++;
- rcu_assign_pointer(node->slots[offset], item);
- BUG_ON(tag_get(node, 0, offset));
- BUG_ON(tag_get(node, 1, offset));
+ BUG_ON(tag_get(node, 0, index & RADIX_TREE_MAP_MASK));
+ BUG_ON(tag_get(node, 1, index & RADIX_TREE_MAP_MASK));
} else {
- rcu_assign_pointer(root->rnode, item);
BUG_ON(root_tag_get(root, 0));
BUG_ON(root_tag_get(root, 1));
}
@@ -457,15 +459,26 @@ int radix_tree_insert(struct radix_tree_root *root,
}
EXPORT_SYMBOL(radix_tree_insert);

-/*
- * is_slot == 1 : search for the slot.
- * is_slot == 0 : search for the node.
+/**
+ * __radix_tree_lookup - lookup an item in a radix tree
+ * @root: radix tree root
+ * @index: index key
+ * @nodep: returns node
+ * @slotp: returns slot
+ *
+ * Lookup and return the item at position @index in the radix
+ * tree @root.
+ *
+ * Until there is more than one item in the tree, no nodes are
+ * allocated and @root->rnode is used as a direct slot instead of
+ * pointing to a node, in which case *@nodep will be NULL.
*/
-static void *radix_tree_lookup_element(struct radix_tree_root *root,
- unsigned long index, int is_slot)
+void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
+ struct radix_tree_node **nodep, void ***slotp)
{
+ struct radix_tree_node *node, *parent;
unsigned int height, shift;
- struct radix_tree_node *node, **slot;
+ void **slot;

node = rcu_dereference_raw(root->rnode);
if (node == NULL)
@@ -474,7 +487,12 @@ static void *radix_tree_lookup_element(struct radix_tree_root *root,
if (!radix_tree_is_indirect_ptr(node)) {
if (index > 0)
return NULL;
- return is_slot ? (void *)&root->rnode : node;
+
+ if (nodep)
+ *nodep = NULL;
+ if (slotp)
+ *slotp = (void **)&root->rnode;
+ return node;
}
node = indirect_to_ptr(node);

@@ -485,8 +503,8 @@ static void *radix_tree_lookup_element(struct radix_tree_root *root,
shift = (height-1) * RADIX_TREE_MAP_SHIFT;

do {
- slot = (struct radix_tree_node **)
- (node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
+ parent = node;
+ slot = node->slots + ((index >> shift) & RADIX_TREE_MAP_MASK);
node = rcu_dereference_raw(*slot);
if (node == NULL)
return NULL;
@@ -495,7 +513,11 @@ static void *radix_tree_lookup_element(struct radix_tree_root *root,
height--;
} while (height > 0);

- return is_slot ? (void *)slot : indirect_to_ptr(node);
+ if (nodep)
+ *nodep = parent;
+ if (slotp)
+ *slotp = slot;
+ return node;
}

/**
@@ -513,7 +535,11 @@ static void *radix_tree_lookup_element(struct radix_tree_root *root,
*/
void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
{
- return (void **)radix_tree_lookup_element(root, index, 1);
+ void **slot;
+
+ if (!__radix_tree_lookup(root, index, NULL, &slot))
+ return NULL;
+ return slot;
}
EXPORT_SYMBOL(radix_tree_lookup_slot);

@@ -531,7 +557,7 @@ EXPORT_SYMBOL(radix_tree_lookup_slot);
*/
void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
{
- return radix_tree_lookup_element(root, index, 0);
+ return __radix_tree_lookup(root, index, NULL, NULL);
}
EXPORT_SYMBOL(radix_tree_lookup);

@@ -1260,6 +1286,56 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
}

/**
+ * __radix_tree_delete_node - try to free node after clearing a slot
+ * @root: radix tree root
+ * @index: index key
+ * @node: node containing @index
+ *
+ * After clearing the slot at @index in @node from radix tree
+ * rooted at @root, call this function to attempt freeing the
+ * node and shrinking the tree.
+ *
+ * Returns %true if @node was freed, %false otherwise.
+ */
+bool __radix_tree_delete_node(struct radix_tree_root *root, unsigned long index,
+ struct radix_tree_node *node)
+{
+ bool deleted = false;
+
+ do {
+ struct radix_tree_node *parent;
+
+ if (node->count) {
+ if (node == indirect_to_ptr(root->rnode)) {
+ radix_tree_shrink(root);
+ if (root->height == 0)
+ deleted = true;
+ }
+ return deleted;
+ }
+
+ parent = node->parent;
+ if (parent) {
+ index >>= RADIX_TREE_MAP_SHIFT;
+
+ parent->slots[index & RADIX_TREE_MAP_MASK] = NULL;
+ parent->count--;
+ } else {
+ root_tag_clear_all(root);
+ root->height = 0;
+ root->rnode = NULL;
+ }
+
+ radix_tree_node_free(node);
+ deleted = true;
+
+ node = parent;
+ } while (node);
+
+ return deleted;
+}
+
+/**
* radix_tree_delete_item - delete an item from a radix tree
* @root: radix tree root
* @index: index key
@@ -1273,43 +1349,26 @@ static inline void radix_tree_shrink(struct radix_tree_root *root)
void *radix_tree_delete_item(struct radix_tree_root *root,
unsigned long index, void *item)
{
- struct radix_tree_node *node = NULL;
- struct radix_tree_node *slot = NULL;
- struct radix_tree_node *to_free;
- unsigned int height, shift;
+ struct radix_tree_node *node;
+ unsigned int offset;
+ void **slot;
+ void *entry;
int tag;
- int uninitialized_var(offset);

- height = root->height;
- if (index > radix_tree_maxindex(height))
- goto out;
+ entry = __radix_tree_lookup(root, index, &node, &slot);
+ if (!entry)
+ return NULL;

- slot = root->rnode;
- if (height == 0) {
+ if (item && entry != item)
+ return NULL;
+
+ if (!node) {
root_tag_clear_all(root);
root->rnode = NULL;
- goto out;
+ return entry;
}
- slot = indirect_to_ptr(slot);
- shift = height * RADIX_TREE_MAP_SHIFT;
-
- do {
- if (slot == NULL)
- goto out;
-
- shift -= RADIX_TREE_MAP_SHIFT;
- offset = (index >> shift) & RADIX_TREE_MAP_MASK;
- node = slot;
- slot = slot->slots[offset];
- } while (shift);
-
- if (slot == NULL)
- goto out;

- if (item && slot != item) {
- slot = NULL;
- goto out;
- }
+ offset = index & RADIX_TREE_MAP_MASK;

/*
* Clear all tags associated with the item to be deleted.
@@ -1320,40 +1379,12 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
radix_tree_tag_clear(root, index, tag);
}

- to_free = NULL;
- /* Now free the nodes we do not need anymore */
- while (node) {
- node->slots[offset] = NULL;
- node->count--;
- /*
- * Queue the node for deferred freeing after the
- * last reference to it disappears (set NULL, above).
- */
- if (to_free)
- radix_tree_node_free(to_free);
-
- if (node->count) {
- if (node == indirect_to_ptr(root->rnode))
- radix_tree_shrink(root);
- goto out;
- }
-
- /* Node with zero slots in use so free it */
- to_free = node;
-
- index >>= RADIX_TREE_MAP_SHIFT;
- offset = index & RADIX_TREE_MAP_MASK;
- node = node->parent;
- }
+ node->slots[offset] = NULL;
+ node->count--;

- root_tag_clear_all(root);
- root->height = 0;
- root->rnode = NULL;
- if (to_free)
- radix_tree_node_free(to_free);
+ __radix_tree_delete_node(root, index, node);

-out:
- return slot;
+ return entry;
}
EXPORT_SYMBOL(radix_tree_delete_item);

--
1.8.5.3

2014-02-04 23:08:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 10/10] mm: keep page cache radix tree nodes in check

On Mon, 3 Feb 2014 19:53:42 -0500 Johannes Weiner <[email protected]> wrote:

> Previously, page cache radix tree nodes were freed after reclaim
> emptied out their page pointers. But now reclaim stores shadow
> entries in their place, which are only reclaimed when the inodes
> themselves are reclaimed. This is problematic for bigger files that
> are still in use after they have a significant amount of their cache
> reclaimed, without any of those pages actually refaulting. The shadow
> entries will just sit there and waste memory. In the worst case, the
> shadow entries will accumulate until the machine runs out of memory.
>
> To get this under control, the VM will track radix tree nodes
> exclusively containing shadow entries on a per-NUMA node list.
> Per-NUMA rather than global because we expect the radix tree nodes
> themselves to be allocated node-locally and we want to reduce
> cross-node references of otherwise independent cache workloads. A
> simple shrinker will then reclaim these nodes on memory pressure.
>
> A few things need to be stored in the radix tree node to implement the
> shadow node LRU and allow tree deletions coming from the list:
>
> 1. There is no index available that would describe the reverse path
> from the node up to the tree root, which is needed to perform a
> deletion. To solve this, encode in each node its offset inside the
> parent. This can be stored in the unused upper bits of the same
> member that stores the node's height at no extra space cost.
>
> 2. The number of shadow entries needs to be counted in addition to the
> regular entries, to quickly detect when the node is ready to go to
> the shadow node LRU list. The current entry count is an unsigned
> int but the maximum number of entries is 64, so a shadow counter
> can easily be stored in the unused upper bits.
>
> 3. Tree modification needs tree lock and tree root, which are located
> in the address space, so store an address_space backpointer in the
> node. The parent pointer of the node is in a union with the 2-word
> rcu_head, so the backpointer comes at no extra cost as well.
>
> 4. The node needs to be linked to an LRU list, which requires a list
> head inside the node. This does increase the size of the node, but
> it does not change the number of objects that fit into a slab page.

changelog forgot to mention that this reclaim is performed via a
shrinker...

How expensive is that list walk in scan_shadow_nodes()? I assume in
the best case it will bale out after nr_to_scan iterations?

2014-02-04 23:14:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

On Mon, 3 Feb 2014 19:53:32 -0500 Johannes Weiner <[email protected]> wrote:

> o Fix vmstat build problems on UP (Fengguang Wu's build bot)
>
> o Clarify why optimistic radix_tree_node->private_list link checking
> is safe without holding the list_lru lock (Dave Chinner)
>
> o Assert locking balance when the list_lru isolator says it dropped
> the list lock (Dave Chinner)
>
> o Remove remnant of a manual reclaim counter in the shadow isolator,
> the list_lru-provided accounting is accurate now that we added
> LRU_REMOVED_RETRY (Dave Chinner)
>
> o Set an object limit for the shadow shrinker instead of messing with
> its seeks setting. The configured seeks define how pressure applied
> to pages translates to pressure on the object pool, in itself it is
> not enough to replace proper object valuation to classify expired
> and in-use objects. Shadow nodes contain up to 64 shadow entries
> from different/alternating zones that have their own atomic age
> counter, so determining if a node is overall expired is crazy
> expensive. Instead, use an object limit above which nodes are very
> likely to be expired.
>
> o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim)
>
> o radix_tree_node->count accessors for pages and shadows (Minchan Kim)
>
> o Rebase to v3.14-rc1 and add review tags

An earlier version caused a 24-byte inode bloatage. That appears to
have been reduced to 8 bytes, yes? What was done there?

> 69 files changed, 1438 insertions(+), 462 deletions(-)

omigod

2014-02-05 01:54:44

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 10/10] mm: keep page cache radix tree nodes in check

On Tue, Feb 04, 2014 at 03:07:56PM -0800, Andrew Morton wrote:
> On Mon, 3 Feb 2014 19:53:42 -0500 Johannes Weiner <[email protected]> wrote:
>
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers. But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed. This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting. The shadow
> > entries will just sit there and waste memory. In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> >
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads. A
> > simple shrinker will then reclaim these nodes on memory pressure.

^^^^^^^^^^^^^^^
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> >
> > 1. There is no index available that would describe the reverse path
> > from the node up to the tree root, which is needed to perform a
> > deletion. To solve this, encode in each node its offset inside the
> > parent. This can be stored in the unused upper bits of the same
> > member that stores the node's height at no extra space cost.
> >
> > 2. The number of shadow entries needs to be counted in addition to the
> > regular entries, to quickly detect when the node is ready to go to
> > the shadow node LRU list. The current entry count is an unsigned
> > int but the maximum number of entries is 64, so a shadow counter
> > can easily be stored in the unused upper bits.
> >
> > 3. Tree modification needs tree lock and tree root, which are located
> > in the address space, so store an address_space backpointer in the
> > node. The parent pointer of the node is in a union with the 2-word
> > rcu_head, so the backpointer comes at no extra cost as well.
> >
> > 4. The node needs to be linked to an LRU list, which requires a list
> > head inside the node. This does increase the size of the node, but
> > it does not change the number of objects that fit into a slab page.
>
> changelog forgot to mention that this reclaim is performed via a
> shrinker...

Uhm... see above? :)

> How expensive is that list walk in scan_shadow_nodes()? I assume in
> the best case it will bale out after nr_to_scan iterations?

Yes, it scans sc->nr_to_scan radix tree nodes, cleans their pointers,
and frees them.

I ran a worst-case scenario on an 8G machine that creates one 8T
sparse file and faults one page per 64-page radix tree node, i.e. one
node per sparse file fault at CPU speed. The profile:

1 9.21% radixblow [kernel.kallsyms] [k] memset
2 7.23% radixblow [kernel.kallsyms] [k] do_mpage_readpage
3 4.76% radixblow [kernel.kallsyms] [k] copy_user_generic_string
4 3.85% radixblow [kernel.kallsyms] [k] __radix_tree_lookup
5 3.32% kswapd0 [kernel.kallsyms] [k] shadow_lru_isolate
6 2.92% radixblow [kernel.kallsyms] [k] get_page_from_freelist
7 2.81% kswapd0 [kernel.kallsyms] [k] __delete_from_page_cache
8 2.50% radixblow [kernel.kallsyms] [k] radix_tree_node_ctor
9 1.79% radixblow [kernel.kallsyms] [k] _raw_spin_lock_irq
10 1.70% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common

Same scenario with 4 pages per 64-page radix tree node:

13 1.39% kswapd0 [kernel.kallsyms] [k] shadow_lru_isolate

16 pages per 64-page node:

75 0.20% kswapd0 [kernel.kallsyms] [k] shadow_lru_isolate

So I doubt this will bother anyone, especially since most use-once
streamers should have a better population density and populate cache
at disk speed, not CPU speed.

2014-02-05 02:03:00

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

On Tue, Feb 04, 2014 at 03:14:24PM -0800, Andrew Morton wrote:
> On Mon, 3 Feb 2014 19:53:32 -0500 Johannes Weiner <[email protected]> wrote:
>
> > o Fix vmstat build problems on UP (Fengguang Wu's build bot)
> >
> > o Clarify why optimistic radix_tree_node->private_list link checking
> > is safe without holding the list_lru lock (Dave Chinner)
> >
> > o Assert locking balance when the list_lru isolator says it dropped
> > the list lock (Dave Chinner)
> >
> > o Remove remnant of a manual reclaim counter in the shadow isolator,
> > the list_lru-provided accounting is accurate now that we added
> > LRU_REMOVED_RETRY (Dave Chinner)
> >
> > o Set an object limit for the shadow shrinker instead of messing with
> > its seeks setting. The configured seeks define how pressure applied
> > to pages translates to pressure on the object pool, in itself it is
> > not enough to replace proper object valuation to classify expired
> > and in-use objects. Shadow nodes contain up to 64 shadow entries
> > from different/alternating zones that have their own atomic age
> > counter, so determining if a node is overall expired is crazy
> > expensive. Instead, use an object limit above which nodes are very
> > likely to be expired.
> >
> > o __pagevec_lookup and __find_get_pages kerneldoc fixes (Minchan Kim)
> >
> > o radix_tree_node->count accessors for pages and shadows (Minchan Kim)
> >
> > o Rebase to v3.14-rc1 and add review tags
>
> An earlier version caused a 24-byte inode bloatage. That appears to
> have been reduced to 8 bytes, yes? What was done there?

Instead of inodes, the shrinker now directly tracks radix tree nodes
that contain only shadow entries. So the 16 bytes for the list_head
are now in struct radix_tree_node, but due to different slab packing
it didn't increase memory consumption.

> > 69 files changed, 1438 insertions(+), 462 deletions(-)
>
> omigod

Most of it is comments and Minchan's accessor functions.

2014-02-05 22:17:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 01/10] mm: vmstat: fix UP zone state accounting

On Mon, 3 Feb 2014 19:53:33 -0500 Johannes Weiner <[email protected]> wrote:

> Fengguang Wu's build testing spotted problems with inc_zone_state()
> and dec_zone_state() on UP configurations in out-of-tree patches.
>
> inc_zone_state() is declared but not defined, dec_zone_state() is
> missing entirely.
>
> Just like with *_zone_page_state(), they can be defined like their
> preemption-unsafe counterparts on UP.

um,

In file included from include/linux/mm.h:876,
from include/linux/suspend.h:8,
from arch/x86/kernel/asm-offsets.c:12:
include/linux/vmstat.h: In function '__inc_zone_page_state':
include/linux/vmstat.h:228: error: implicit declaration of function '__inc_zone_state'
include/linux/vmstat.h: In function '__dec_zone_page_state':
include/linux/vmstat.h:234: error: implicit declaration of function '__dec_zone_state'
include/linux/vmstat.h: At top level:
include/linux/vmstat.h:245: warning: conflicting types for '__inc_zone_state'
include/linux/vmstat.h:245: error: static declaration of '__inc_zone_state' follows non-static declaration
include/linux/vmstat.h:228: note: previous implicit declaration of '__inc_zone_state' was here
include/linux/vmstat.h:251: warning: conflicting types for '__dec_zone_state'
include/linux/vmstat.h:251: error: static declaration of '__dec_zone_state' follows non-static declaration
include/linux/vmstat.h:234: note: previous implicit declaration of '__dec_zone_state' was here

I shuffled them around:

--- a/include/linux/vmstat.h~mm-vmstat-fix-up-zone-state-accounting-fix
+++ a/include/linux/vmstat.h
@@ -214,6 +214,18 @@ static inline void __mod_zone_page_state
zone_page_state_add(delta, zone, item);
}

+static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
+{
+ atomic_long_inc(&zone->vm_stat[item]);
+ atomic_long_inc(&vm_stat[item]);
+}
+
+static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
+{
+ atomic_long_dec(&zone->vm_stat[item]);
+ atomic_long_dec(&vm_stat[item]);
+}
+
static inline void __inc_zone_page_state(struct page *page,
enum zone_stat_item item)
{
@@ -234,18 +246,6 @@ static inline void __dec_zone_page_state
#define dec_zone_page_state __dec_zone_page_state
#define mod_zone_page_state __mod_zone_page_state

-static inline void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
-{
- atomic_long_inc(&zone->vm_stat[item]);
- atomic_long_inc(&vm_stat[item]);
-}
-
-static inline void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
-{
- atomic_long_dec(&zone->vm_stat[item]);
- atomic_long_dec(&vm_stat[item]);
-}
-
#define inc_zone_state __inc_zone_state
#define dec_zone_state __dec_zone_state

_

2014-02-08 11:44:48

by Rafael Aquini

[permalink] [raw]
Subject: Re: [patch 02/10] fs: cachefiles: use add_to_page_cache_lru()

On Mon, Feb 03, 2014 at 07:53:34PM -0500, Johannes Weiner wrote:
> This code used to have its own lru cache pagevec up until a0b8cab3
> ("mm: remove lru parameter from __pagevec_lru_add and remove parts of
> pagevec API"). Now it's just add_to_page_cache() followed by
> lru_cache_add(), might as well use add_to_page_cache_lru() directly.
>

Just a heads-up, here: take a look at https://lkml.org/lkml/2014/2/7/587

I'm not saying that hunks below will cause the same leak issue as depicted on
the thread I pointed, but it surely doesn't hurt to double-check them

Regards,
-- Rafael

> Signed-off-by: Johannes Weiner <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
> ---
> fs/cachefiles/rdwr.c | 33 +++++++++++++--------------------
> 1 file changed, 13 insertions(+), 20 deletions(-)
>
> diff --git a/fs/cachefiles/rdwr.c b/fs/cachefiles/rdwr.c
> index ebaff368120d..4b1fb5ca65b8 100644
> --- a/fs/cachefiles/rdwr.c
> +++ b/fs/cachefiles/rdwr.c
> @@ -265,24 +265,22 @@ static int cachefiles_read_backing_file_one(struct cachefiles_object *object,
> goto nomem_monitor;
> }
>
> - ret = add_to_page_cache(newpage, bmapping,
> - netpage->index, cachefiles_gfp);
> + ret = add_to_page_cache_lru(newpage, bmapping,
> + netpage->index, cachefiles_gfp);
> if (ret == 0)
> goto installed_new_backing_page;
> if (ret != -EEXIST)
> goto nomem_page;
> }
>
> - /* we've installed a new backing page, so now we need to add it
> - * to the LRU list and start it reading */
> + /* we've installed a new backing page, so now we need to start
> + * it reading */
> installed_new_backing_page:
> _debug("- new %p", newpage);
>
> backpage = newpage;
> newpage = NULL;
>
> - lru_cache_add_file(backpage);
> -
> read_backing_page:
> ret = bmapping->a_ops->readpage(NULL, backpage);
> if (ret < 0)
> @@ -510,24 +508,23 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
> goto nomem;
> }
>
> - ret = add_to_page_cache(newpage, bmapping,
> - netpage->index, cachefiles_gfp);
> + ret = add_to_page_cache_lru(newpage, bmapping,
> + netpage->index,
> + cachefiles_gfp);
> if (ret == 0)
> goto installed_new_backing_page;
> if (ret != -EEXIST)
> goto nomem;
> }
>
> - /* we've installed a new backing page, so now we need to add it
> - * to the LRU list and start it reading */
> + /* we've installed a new backing page, so now we need
> + * to start it reading */
> installed_new_backing_page:
> _debug("- new %p", newpage);
>
> backpage = newpage;
> newpage = NULL;
>
> - lru_cache_add_file(backpage);
> -
> reread_backing_page:
> ret = bmapping->a_ops->readpage(NULL, backpage);
> if (ret < 0)
> @@ -538,8 +535,8 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
> monitor_backing_page:
> _debug("- monitor add");
>
> - ret = add_to_page_cache(netpage, op->mapping, netpage->index,
> - cachefiles_gfp);
> + ret = add_to_page_cache_lru(netpage, op->mapping,
> + netpage->index, cachefiles_gfp);
> if (ret < 0) {
> if (ret == -EEXIST) {
> page_cache_release(netpage);
> @@ -549,8 +546,6 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
> goto nomem;
> }
>
> - lru_cache_add_file(netpage);
> -
> /* install a monitor */
> page_cache_get(netpage);
> monitor->netfs_page = netpage;
> @@ -613,8 +608,8 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
> backing_page_already_uptodate:
> _debug("- uptodate");
>
> - ret = add_to_page_cache(netpage, op->mapping, netpage->index,
> - cachefiles_gfp);
> + ret = add_to_page_cache_lru(netpage, op->mapping,
> + netpage->index, cachefiles_gfp);
> if (ret < 0) {
> if (ret == -EEXIST) {
> page_cache_release(netpage);
> @@ -631,8 +626,6 @@ static int cachefiles_read_backing_file(struct cachefiles_object *object,
>
> fscache_mark_page_cached(op, netpage);
>
> - lru_cache_add_file(netpage);
> -
> /* the netpage is unlocked and marked up to date here */
> fscache_end_io(op, netpage, 0);
> page_cache_release(netpage);
> --
> 1.8.5.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-02-09 17:35:18

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 02/10] fs: cachefiles: use add_to_page_cache_lru()

On Sat, Feb 08, 2014 at 09:43:35AM -0200, Rafael Aquini wrote:
> On Mon, Feb 03, 2014 at 07:53:34PM -0500, Johannes Weiner wrote:
> > This code used to have its own lru cache pagevec up until a0b8cab3
> > ("mm: remove lru parameter from __pagevec_lru_add and remove parts of
> > pagevec API"). Now it's just add_to_page_cache() followed by
> > lru_cache_add(), might as well use add_to_page_cache_lru() directly.
> >
>
> Just a heads-up, here: take a look at https://lkml.org/lkml/2014/2/7/587

Ah, yes. That patch replaced a private pagevec, which consumes the
references you pass in, with add_to_page_cache_lru(), which gets its
own references.

My patch changes

add_to_page_cache()
lru_cache_add()

to

add_to_page_cache_lru()
add_to_page_cache()
lru_cache_add()

so the refcounting does not change for the caller.

Thanks for pointing it out, though, it never hurts to double check
stuff like that.

2014-02-12 10:58:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch 02/10] fs: cachefiles: use add_to_page_cache_lru()

On Mon, Feb 03, 2014 at 07:53:34PM -0500, Johannes Weiner wrote:
> This code used to have its own lru cache pagevec up until a0b8cab3
> ("mm: remove lru parameter from __pagevec_lru_add and remove parts of
> pagevec API"). Now it's just add_to_page_cache() followed by
> lru_cache_add(), might as well use add_to_page_cache_lru() directly.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2014-02-12 11:00:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch 03/10] lib: radix-tree: radix_tree_delete_item()

On Mon, Feb 03, 2014 at 07:53:35PM -0500, Johannes Weiner wrote:
> Provide a function that does not just delete an entry at a given
> index, but also allows passing in an expected item. Delete only if
> that item is still located at the specified index.
>
> This is handy when lockless tree traversals want to delete entries as
> well because they don't have to do an second, locked lookup to verify
> the slot has not changed under them before deleting the entry.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2014-02-12 11:11:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch 04/10] mm: shmem: save one radix tree lookup when truncating swapped pages

On Mon, Feb 03, 2014 at 07:53:36PM -0500, Johannes Weiner wrote:
> Page cache radix tree slots are usually stabilized by the page lock,
> but shmem's swap cookies have no such thing. Because the overall
> truncation loop is lockless, the swap entry is currently confirmed by
> a tree lookup and then deleted by another tree lookup under the same
> tree lock region.
>
> Use radix_tree_delete_item() instead, which does the verification and
> deletion with only one lookup. This also allows removing the
> delete-only special case from shmem_radix_tree_replace().
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2014-02-12 11:16:38

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch 05/10] mm: filemap: move radix tree hole searching here

On Mon, Feb 03, 2014 at 07:53:37PM -0500, Johannes Weiner wrote:
> The radix tree hole searching code is only used for page cache, for
> example the readahead code trying to get a a picture of the area
> surrounding a fault.
>
> It sufficed to rely on the radix tree definition of holes, which is
> "empty tree slot". But this is about to change, though, as shadow
> page descriptors will be stored in the page cache after the actual
> pages get evicted from memory.
>
> Move the functions over to mm/filemap.c and make them native page
> cache operations, where they can later be adapted to handle the new
> definition of "page cache hole".
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2014-02-13 03:21:31

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

Hello.

I got a lockdep warning shown below, and the bad commit seems to be de055616
\"mm: keep page cache radix tree nodes in check\" as of next-20140212
on linux-next.git.

Regards.

=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.14.0-rc1-00099-gde05561 #126 Tainted: GF
---------------------------------------------------------
swapper/0/0 just changed the state of lock:
(&(&mapping->tree_lock)->rlock){..-.-.}, at: [<ffffffff8116f3e8>] test_clear_page_writeback+0x48/0x190
but this lock took another, SOFTIRQ-unsafe lock in the past:
(&(&lru->node[i].lock)->rlock){+.+.-.}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
Possible interrupt unsafe locking scenario:

CPU0 CPU1
---- ----
lock(&(&lru->node[i].lock)->rlock);
local_irq_disable();
lock(&(&mapping->tree_lock)->rlock);
lock(&(&lru->node[i].lock)->rlock);
<Interrupt>
lock(&(&mapping->tree_lock)->rlock);

*** DEADLOCK ***

no locks held by swapper/0/0.

the shortest dependencies between 2nd lock and 1st lock:
-> (&(&lru->node[i].lock)->rlock){+.+.-.} ops: 445715 {
HARDIRQ-ON-W at:
[<ffffffff810b6490>] mark_irqflags+0x130/0x190
[<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
[<ffffffff810b819e>] lock_acquire+0x9e/0x170
[<ffffffff816305ae>] _raw_spin_lock+0x3e/0x80
[<ffffffff8118c82b>] list_lru_add+0x5b/0xf0
[<ffffffff811e5c1c>] dput+0xbc/0x120
[<ffffffff811cebc2>] __fput+0x1d2/0x310
[<ffffffff811cedae>] ____fput+0xe/0x10
[<ffffffff8107cc2d>] task_work_run+0xad/0xe0
[<ffffffff81003be5>] do_notify_resume+0x75/0x80
[<ffffffff8163b01a>] int_signal+0x12/0x17
SOFTIRQ-ON-W at:
[<ffffffff810b64b4>] mark_irqflags+0x154/0x190
[<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
[<ffffffff810b819e>] lock_acquire+0x9e/0x170
[<ffffffff816305ae>] _raw_spin_lock+0x3e/0x80
[<ffffffff8118c82b>] list_lru_add+0x5b/0xf0
[<ffffffff811e5c1c>] dput+0xbc/0x120
[<ffffffff811cebc2>] __fput+0x1d2/0x310
[<ffffffff811cedae>] ____fput+0xe/0x10
[<ffffffff8107cc2d>] task_work_run+0xad/0xe0
[<ffffffff81003be5>] do_notify_resume+0x75/0x80
[<ffffffff8163b01a>] int_signal+0x12/0x17
IN-RECLAIM_FS-W at:
[<ffffffff810b6426>] mark_irqflags+0xc6/0x190
[<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
[<ffffffff810b819e>] lock_acquire+0x9e/0x170
[<ffffffff816305ae>] _raw_spin_lock+0x3e/0x80
[<ffffffff8118c698>] list_lru_count_node+0x28/0x70
[<ffffffff811d05b3>] super_cache_count+0x83/0x120
[<ffffffff81176647>] shrink_slab_node+0x47/0x350
[<ffffffff811769dd>] shrink_slab+0x8d/0x160
[<ffffffff81179480>] kswapd_shrink_zone+0x130/0x1c0
[<ffffffff81179fe9>] balance_pgdat+0x389/0x520
[<ffffffff8117b92f>] kswapd+0x1bf/0x380
[<ffffffff81080abe>] kthread+0xee/0x110
[<ffffffff8163ac6c>] ret_from_fork+0x7c/0xb0
INITIAL USE at:
[<ffffffff810b7d34>] __lock_acquire+0x214/0x5e0
[<ffffffff810b819e>] lock_acquire+0x9e/0x170
[<ffffffff816305ae>] _raw_spin_lock+0x3e/0x80
[<ffffffff8118c82b>] list_lru_add+0x5b/0xf0
[<ffffffff811e5c1c>] dput+0xbc/0x120
[<ffffffff811cebc2>] __fput+0x1d2/0x310
[<ffffffff811cedae>] ____fput+0xe/0x10
[<ffffffff8107cc2d>] task_work_run+0xad/0xe0
[<ffffffff81003be5>] do_notify_resume+0x75/0x80
[<ffffffff8163b01a>] int_signal+0x12/0x17
}
... key at: [<ffffffff82b59f34>] __key.23573+0x0/0xc
... acquired at:
[<ffffffff810b79c1>] validate_chain+0x6e1/0x840
[<ffffffff810b7e87>] __lock_acquire+0x367/0x5e0
[<ffffffff810b819e>] lock_acquire+0x9e/0x170
[<ffffffff816305ae>] _raw_spin_lock+0x3e/0x80
[<ffffffff8118c82b>] list_lru_add+0x5b/0xf0
[<ffffffff81161180>] page_cache_tree_delete+0x140/0x1a0
[<ffffffff81161230>] __delete_from_page_cache+0x50/0x1c0
[<ffffffff8117571d>] __remove_mapping+0x9d/0x170
[<ffffffff81177347>] shrink_page_list+0x617/0x7f0
[<ffffffff811783aa>] shrink_inactive_list+0x26a/0x520
[<ffffffff81178cc6>] shrink_lruvec+0x336/0x420
[<ffffffff81178e0c>] shrink_zone+0x5c/0x120
[<ffffffff8117944b>] kswapd_shrink_zone+0xfb/0x1c0
[<ffffffff81179fe9>] balance_pgdat+0x389/0x520
[<ffffffff8117b92f>] kswapd+0x1bf/0x380
[<ffffffff81080abe>] kthread+0xee/0x110
[<ffffffff8163ac6c>] ret_from_fork+0x7c/0xb0

-> (&(&mapping->tree_lock)->rlock){..-.-.} ops: 11597 {
IN-SOFTIRQ-W at:
[<ffffffff810b6469>] mark_irqflags+0x109/0x190
[<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
[<ffffffff810b819e>] lock_acquire+0x9e/0x170
[<ffffffff81630760>] _raw_spin_lock_irqsave+0x50/0x90
[<ffffffff8116f3e8>] test_clear_page_writeback+0x48/0x190
[<ffffffff811604f0>] end_page_writeback+0x20/0x60
[<ffffffffa00c17b8>] ext4_finish_bio+0x168/0x220 [ext4]
[<ffffffffa00c1c37>] ext4_end_bio+0x97/0xe0 [ext4]
[<ffffffff8120b9e3>] bio_endio+0x53/0xa0
[<ffffffff812da333>] blk_update_request+0x213/0x430
[<ffffffff812da577>] blk_update_bidi_request+0x27/0xb0
[<ffffffff812db18f>] blk_end_bidi_request+0x2f/0x80
[<ffffffff812db230>] blk_end_request+0x10/0x20
[<ffffffff8143d770>] scsi_end_request+0x40/0xb0
[<ffffffff8143daff>] scsi_io_completion+0x9f/0x6c0
[<ffffffff81433204>] scsi_finish_command+0xd4/0x140
[<ffffffff8143e28f>] scsi_softirq_done+0x14f/0x170
[<ffffffff812e3144>] blk_done_softirq+0x84/0xa0
[<ffffffff8105b29d>] __do_softirq+0x12d/0x430
[<ffffffff8105b6d5>] irq_exit+0xc5/0xd0
[<ffffffff8163cc47>] do_IRQ+0x67/0x110
[<ffffffff8163156f>] ret_from_intr+0x0/0x13
[<ffffffff8100dd26>] arch_cpu_idle+0x26/0x30
[<ffffffff810cac39>] cpu_idle_loop+0xa9/0x3c0
[<ffffffff810caf73>] cpu_startup_entry+0x23/0x30
[<ffffffff81623434>] rest_init+0xf4/0x170
[<ffffffff81f491dc>] start_kernel+0x346/0x34d
[<ffffffff81f483a8>] x86_64_start_reservations+0x2a/0x2c
[<ffffffff81f484d8>] x86_64_start_kernel+0xf5/0xfc
IN-RECLAIM_FS-W at:
[<ffffffff810b6426>] mark_irqflags+0xc6/0x190
[<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
[<ffffffff810b819e>] lock_acquire+0x9e/0x170
[<ffffffff816306d4>] _raw_spin_lock_irq+0x44/0x80
[<ffffffff811756d5>] __remove_mapping+0x55/0x170
[<ffffffff81177347>] shrink_page_list+0x617/0x7f0
[<ffffffff811783aa>] shrink_inactive_list+0x26a/0x520
[<ffffffff81178cc6>] shrink_lruvec+0x336/0x420
[<ffffffff81178e0c>] shrink_zone+0x5c/0x120
[<ffffffff8117944b>] kswapd_shrink_zone+0xfb/0x1c0
[<ffffffff81179fe9>] balance_pgdat+0x389/0x520
[<ffffffff8117b92f>] kswapd+0x1bf/0x380
[<ffffffff81080abe>] kthread+0xee/0x110
[<ffffffff8163ac6c>] ret_from_fork+0x7c/0xb0
INITIAL USE at:
[<ffffffff810b7d34>] __lock_acquire+0x214/0x5e0
[<ffffffff810b819e>] lock_acquire+0x9e/0x170
[<ffffffff816306d4>] _raw_spin_lock_irq+0x44/0x80
[<ffffffff81161810>] __add_to_page_cache_locked+0xa0/0x1d0
[<ffffffff81161968>] add_to_page_cache_lru+0x28/0x80
[<ffffffff81162c28>] grab_cache_page_write_begin+0x98/0xe0
[<ffffffff811f95e4>] simple_write_begin+0x34/0x100
[<ffffffff8115fc9a>] generic_perform_write+0xca/0x210
[<ffffffff8115fe43>] generic_file_buffered_write+0x63/0xa0
[<ffffffff81163aea>] __generic_file_aio_write+0x1ca/0x3c0
[<ffffffff81163d46>] generic_file_aio_write+0x66/0xb0
[<ffffffff811cc20f>] do_sync_write+0x5f/0xa0
[<ffffffff811ce127>] vfs_write+0xc7/0x1f0
[<ffffffff811ce362>] SyS_write+0x62/0xb0
[<ffffffff81f4b0c6>] do_copy+0x2b/0xb0
[<ffffffff81f4ad46>] flush_buffer+0x7d/0xa3
[<ffffffff81f7e9ab>] gunzip+0x287/0x330
[<ffffffff81f4b967>] unpack_to_rootfs+0x167/0x293
[<ffffffff81f4bb97>] populate_rootfs+0x62/0xdf
[<ffffffff810002a2>] do_one_initcall+0xd2/0x180
[<ffffffff81f48933>] do_basic_setup+0x9d/0xc0
[<ffffffff81f48bd6>] kernel_init_freeable+0x280/0x303
[<ffffffff816234be>] kernel_init+0xe/0x130
[<ffffffff8163ac6c>] ret_from_fork+0x7c/0xb0
}
... key at: [<ffffffff82bb7f98>] __key.41448+0x0/0x8
... acquired at:
[<ffffffff810b51f0>] check_usage_forwards+0x90/0x110
[<ffffffff810b5f4f>] mark_lock_irq+0x9f/0x2c0
[<ffffffff810b628c>] mark_lock+0x11c/0x1f0
[<ffffffff810b6469>] mark_irqflags+0x109/0x190
[<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
[<ffffffff810b819e>] lock_acquire+0x9e/0x170
[<ffffffff81630760>] _raw_spin_lock_irqsave+0x50/0x90
[<ffffffff8116f3e8>] test_clear_page_writeback+0x48/0x190
[<ffffffff811604f0>] end_page_writeback+0x20/0x60
[<ffffffffa00c17b8>] ext4_finish_bio+0x168/0x220 [ext4]
[<ffffffffa00c1c37>] ext4_end_bio+0x97/0xe0 [ext4]
[<ffffffff8120b9e3>] bio_endio+0x53/0xa0
[<ffffffff812da333>] blk_update_request+0x213/0x430
[<ffffffff812da577>] blk_update_bidi_request+0x27/0xb0
[<ffffffff812db18f>] blk_end_bidi_request+0x2f/0x80
[<ffffffff812db230>] blk_end_request+0x10/0x20
[<ffffffff8143d770>] scsi_end_request+0x40/0xb0
[<ffffffff8143daff>] scsi_io_completion+0x9f/0x6c0
[<ffffffff81433204>] scsi_finish_command+0xd4/0x140
[<ffffffff8143e28f>] scsi_softirq_done+0x14f/0x170
[<ffffffff812e3144>] blk_done_softirq+0x84/0xa0
[<ffffffff8105b29d>] __do_softirq+0x12d/0x430
[<ffffffff8105b6d5>] irq_exit+0xc5/0xd0
[<ffffffff8163cc47>] do_IRQ+0x67/0x110
[<ffffffff8163156f>] ret_from_intr+0x0/0x13
[<ffffffff8100dd26>] arch_cpu_idle+0x26/0x30
[<ffffffff810cac39>] cpu_idle_loop+0xa9/0x3c0
[<ffffffff810caf73>] cpu_startup_entry+0x23/0x30
[<ffffffff81623434>] rest_init+0xf4/0x170
[<ffffffff81f491dc>] start_kernel+0x346/0x34d
[<ffffffff81f483a8>] x86_64_start_reservations+0x2a/0x2c
[<ffffffff81f484d8>] x86_64_start_kernel+0xf5/0xfc


stack backtrace:
CPU: 0 PID: 0 Comm: swapper/0 Tainted: GF 3.14.0-rc1-00099-gde05561 #126
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/20/2012
ffffffff8233b140 ffff8800792038d8 ffffffff8162aa99 0000000000000002
ffffffff8233b140 ffff880079203928 ffffffff810b5118 ffffffff81a08d4a
ffffffff81a08d4a ffff880079203928 ffffffff81c110a8 ffff880079203938
Call Trace:
<IRQ> [<ffffffff8162aa99>] dump_stack+0x51/0x70
[<ffffffff810b5118>] print_irq_inversion_bug+0x1c8/0x210
[<ffffffff810b51f0>] check_usage_forwards+0x90/0x110
[<ffffffff8107cec8>] ? __kernel_text_address+0x58/0x80
[<ffffffff810b5160>] ? print_irq_inversion_bug+0x210/0x210
[<ffffffff810b5f4f>] mark_lock_irq+0x9f/0x2c0
[<ffffffff810b628c>] mark_lock+0x11c/0x1f0
[<ffffffff810b6469>] mark_irqflags+0x109/0x190
[<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
[<ffffffff810b819e>] lock_acquire+0x9e/0x170
[<ffffffff8116f3e8>] ? test_clear_page_writeback+0x48/0x190
[<ffffffff81048aed>] ? __change_page_attr_set_clr+0x4d/0xb0
[<ffffffff81630760>] _raw_spin_lock_irqsave+0x50/0x90
[<ffffffff8116f3e8>] ? test_clear_page_writeback+0x48/0x190
[<ffffffff8116f3e8>] test_clear_page_writeback+0x48/0x190
[<ffffffffa00c1825>] ? ext4_finish_bio+0x1d5/0x220 [ext4]
[<ffffffff811604f0>] end_page_writeback+0x20/0x60
[<ffffffffa00c17b8>] ext4_finish_bio+0x168/0x220 [ext4]
[<ffffffffa00c18ec>] ? ext4_release_io_end+0x7c/0x100 [ext4]
[<ffffffff812da079>] ? blk_account_io_completion+0x119/0x1c0
[<ffffffffa00c1c37>] ext4_end_bio+0x97/0xe0 [ext4]
[<ffffffff8120b9e3>] bio_endio+0x53/0xa0
[<ffffffff812da333>] blk_update_request+0x213/0x430
[<ffffffff812da577>] blk_update_bidi_request+0x27/0xb0
[<ffffffff812db18f>] blk_end_bidi_request+0x2f/0x80
[<ffffffff812db230>] blk_end_request+0x10/0x20
[<ffffffff8143d770>] scsi_end_request+0x40/0xb0
[<ffffffff816310c0>] ? _raw_spin_unlock_irqrestore+0x40/0x70
[<ffffffff8143daff>] scsi_io_completion+0x9f/0x6c0
[<ffffffff810b6add>] ? trace_hardirqs_on+0xd/0x10
[<ffffffff81433204>] scsi_finish_command+0xd4/0x140
[<ffffffff8143e28f>] scsi_softirq_done+0x14f/0x170
[<ffffffff812e3144>] blk_done_softirq+0x84/0xa0
[<ffffffff8105b29d>] __do_softirq+0x12d/0x430
[<ffffffff8105b6d5>] irq_exit+0xc5/0xd0
[<ffffffff8163cc47>] do_IRQ+0x67/0x110
[<ffffffff8163156f>] common_interrupt+0x6f/0x6f
<EOI> [<ffffffff8100e426>] ? default_idle+0x26/0x210
[<ffffffff8100e424>] ? default_idle+0x24/0x210
[<ffffffff8100dd26>] arch_cpu_idle+0x26/0x30
[<ffffffff810cac39>] cpu_idle_loop+0xa9/0x3c0
[<ffffffff810caf73>] cpu_startup_entry+0x23/0x30
[<ffffffff81623434>] rest_init+0xf4/0x170
[<ffffffff81623340>] ? csum_partial_copy_generic+0x170/0x170
[<ffffffff81f491dc>] start_kernel+0x346/0x34d
[<ffffffff81f48cb4>] ? repair_env_string+0x5b/0x5b
[<ffffffff816293ea>] ? memblock_reserve+0x49/0x4e
[<ffffffff81f483a8>] x86_64_start_reservations+0x2a/0x2c
[<ffffffff81f484d8>] x86_64_start_kernel+0xf5/0xfc

2014-02-13 22:11:35

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

Hi Tetsuo,

On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote:
> Hello.
>
> I got a lockdep warning shown below, and the bad commit seems to be de055616
> \"mm: keep page cache radix tree nodes in check\" as of next-20140212
> on linux-next.git.

Thanks for the report. There is already a fix for this in -mm:
http://marc.info/?l=linux-mm-commits&m=139180637114625&w=2

It was merged on the 7th, so it should show up in -next... any day
now?

> Regards.
>
> =========================================================
> [ INFO: possible irq lock inversion dependency detected ]
> 3.14.0-rc1-00099-gde05561 #126 Tainted: GF
> ---------------------------------------------------------
> swapper/0/0 just changed the state of lock:
> (&(&mapping->tree_lock)->rlock){..-.-.}, at: [<ffffffff8116f3e8>] test_clear_page_writeback+0x48/0x190
> but this lock took another, SOFTIRQ-unsafe lock in the past:
> (&(&lru->node[i].lock)->rlock){+.+.-.}
>
> and interrupts could create inverse lock ordering between them.
>
>
> other info that might help us debug this:
> Possible interrupt unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&(&lru->node[i].lock)->rlock);
> local_irq_disable();
> lock(&(&mapping->tree_lock)->rlock);
> lock(&(&lru->node[i].lock)->rlock);
> <Interrupt>
> lock(&(&mapping->tree_lock)->rlock);
>
> *** DEADLOCK ***
>
> no locks held by swapper/0/0.
>
> the shortest dependencies between 2nd lock and 1st lock:
> -> (&(&lru->node[i].lock)->rlock){+.+.-.} ops: 445715 {
> HARDIRQ-ON-W at:
> [<ffffffff810b6490>] mark_irqflags+0x130/0x190
> [<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
> [<ffffffff810b819e>] lock_acquire+0x9e/0x170
> [<ffffffff816305ae>] _raw_spin_lock+0x3e/0x80
> [<ffffffff8118c82b>] list_lru_add+0x5b/0xf0
> [<ffffffff811e5c1c>] dput+0xbc/0x120
> [<ffffffff811cebc2>] __fput+0x1d2/0x310
> [<ffffffff811cedae>] ____fput+0xe/0x10
> [<ffffffff8107cc2d>] task_work_run+0xad/0xe0
> [<ffffffff81003be5>] do_notify_resume+0x75/0x80
> [<ffffffff8163b01a>] int_signal+0x12/0x17
> SOFTIRQ-ON-W at:
> [<ffffffff810b64b4>] mark_irqflags+0x154/0x190
> [<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
> [<ffffffff810b819e>] lock_acquire+0x9e/0x170
> [<ffffffff816305ae>] _raw_spin_lock+0x3e/0x80
> [<ffffffff8118c82b>] list_lru_add+0x5b/0xf0
> [<ffffffff811e5c1c>] dput+0xbc/0x120
> [<ffffffff811cebc2>] __fput+0x1d2/0x310
> [<ffffffff811cedae>] ____fput+0xe/0x10
> [<ffffffff8107cc2d>] task_work_run+0xad/0xe0
> [<ffffffff81003be5>] do_notify_resume+0x75/0x80
> [<ffffffff8163b01a>] int_signal+0x12/0x17
> IN-RECLAIM_FS-W at:
> [<ffffffff810b6426>] mark_irqflags+0xc6/0x190
> [<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
> [<ffffffff810b819e>] lock_acquire+0x9e/0x170
> [<ffffffff816305ae>] _raw_spin_lock+0x3e/0x80
> [<ffffffff8118c698>] list_lru_count_node+0x28/0x70
> [<ffffffff811d05b3>] super_cache_count+0x83/0x120
> [<ffffffff81176647>] shrink_slab_node+0x47/0x350
> [<ffffffff811769dd>] shrink_slab+0x8d/0x160
> [<ffffffff81179480>] kswapd_shrink_zone+0x130/0x1c0
> [<ffffffff81179fe9>] balance_pgdat+0x389/0x520
> [<ffffffff8117b92f>] kswapd+0x1bf/0x380
> [<ffffffff81080abe>] kthread+0xee/0x110
> [<ffffffff8163ac6c>] ret_from_fork+0x7c/0xb0
> INITIAL USE at:
> [<ffffffff810b7d34>] __lock_acquire+0x214/0x5e0
> [<ffffffff810b819e>] lock_acquire+0x9e/0x170
> [<ffffffff816305ae>] _raw_spin_lock+0x3e/0x80
> [<ffffffff8118c82b>] list_lru_add+0x5b/0xf0
> [<ffffffff811e5c1c>] dput+0xbc/0x120
> [<ffffffff811cebc2>] __fput+0x1d2/0x310
> [<ffffffff811cedae>] ____fput+0xe/0x10
> [<ffffffff8107cc2d>] task_work_run+0xad/0xe0
> [<ffffffff81003be5>] do_notify_resume+0x75/0x80
> [<ffffffff8163b01a>] int_signal+0x12/0x17
> }
> ... key at: [<ffffffff82b59f34>] __key.23573+0x0/0xc
> ... acquired at:
> [<ffffffff810b79c1>] validate_chain+0x6e1/0x840
> [<ffffffff810b7e87>] __lock_acquire+0x367/0x5e0
> [<ffffffff810b819e>] lock_acquire+0x9e/0x170
> [<ffffffff816305ae>] _raw_spin_lock+0x3e/0x80
> [<ffffffff8118c82b>] list_lru_add+0x5b/0xf0
> [<ffffffff81161180>] page_cache_tree_delete+0x140/0x1a0
> [<ffffffff81161230>] __delete_from_page_cache+0x50/0x1c0
> [<ffffffff8117571d>] __remove_mapping+0x9d/0x170
> [<ffffffff81177347>] shrink_page_list+0x617/0x7f0
> [<ffffffff811783aa>] shrink_inactive_list+0x26a/0x520
> [<ffffffff81178cc6>] shrink_lruvec+0x336/0x420
> [<ffffffff81178e0c>] shrink_zone+0x5c/0x120
> [<ffffffff8117944b>] kswapd_shrink_zone+0xfb/0x1c0
> [<ffffffff81179fe9>] balance_pgdat+0x389/0x520
> [<ffffffff8117b92f>] kswapd+0x1bf/0x380
> [<ffffffff81080abe>] kthread+0xee/0x110
> [<ffffffff8163ac6c>] ret_from_fork+0x7c/0xb0
>
> -> (&(&mapping->tree_lock)->rlock){..-.-.} ops: 11597 {
> IN-SOFTIRQ-W at:
> [<ffffffff810b6469>] mark_irqflags+0x109/0x190
> [<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
> [<ffffffff810b819e>] lock_acquire+0x9e/0x170
> [<ffffffff81630760>] _raw_spin_lock_irqsave+0x50/0x90
> [<ffffffff8116f3e8>] test_clear_page_writeback+0x48/0x190
> [<ffffffff811604f0>] end_page_writeback+0x20/0x60
> [<ffffffffa00c17b8>] ext4_finish_bio+0x168/0x220 [ext4]
> [<ffffffffa00c1c37>] ext4_end_bio+0x97/0xe0 [ext4]
> [<ffffffff8120b9e3>] bio_endio+0x53/0xa0
> [<ffffffff812da333>] blk_update_request+0x213/0x430
> [<ffffffff812da577>] blk_update_bidi_request+0x27/0xb0
> [<ffffffff812db18f>] blk_end_bidi_request+0x2f/0x80
> [<ffffffff812db230>] blk_end_request+0x10/0x20
> [<ffffffff8143d770>] scsi_end_request+0x40/0xb0
> [<ffffffff8143daff>] scsi_io_completion+0x9f/0x6c0
> [<ffffffff81433204>] scsi_finish_command+0xd4/0x140
> [<ffffffff8143e28f>] scsi_softirq_done+0x14f/0x170
> [<ffffffff812e3144>] blk_done_softirq+0x84/0xa0
> [<ffffffff8105b29d>] __do_softirq+0x12d/0x430
> [<ffffffff8105b6d5>] irq_exit+0xc5/0xd0
> [<ffffffff8163cc47>] do_IRQ+0x67/0x110
> [<ffffffff8163156f>] ret_from_intr+0x0/0x13
> [<ffffffff8100dd26>] arch_cpu_idle+0x26/0x30
> [<ffffffff810cac39>] cpu_idle_loop+0xa9/0x3c0
> [<ffffffff810caf73>] cpu_startup_entry+0x23/0x30
> [<ffffffff81623434>] rest_init+0xf4/0x170
> [<ffffffff81f491dc>] start_kernel+0x346/0x34d
> [<ffffffff81f483a8>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81f484d8>] x86_64_start_kernel+0xf5/0xfc
> IN-RECLAIM_FS-W at:
> [<ffffffff810b6426>] mark_irqflags+0xc6/0x190
> [<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
> [<ffffffff810b819e>] lock_acquire+0x9e/0x170
> [<ffffffff816306d4>] _raw_spin_lock_irq+0x44/0x80
> [<ffffffff811756d5>] __remove_mapping+0x55/0x170
> [<ffffffff81177347>] shrink_page_list+0x617/0x7f0
> [<ffffffff811783aa>] shrink_inactive_list+0x26a/0x520
> [<ffffffff81178cc6>] shrink_lruvec+0x336/0x420
> [<ffffffff81178e0c>] shrink_zone+0x5c/0x120
> [<ffffffff8117944b>] kswapd_shrink_zone+0xfb/0x1c0
> [<ffffffff81179fe9>] balance_pgdat+0x389/0x520
> [<ffffffff8117b92f>] kswapd+0x1bf/0x380
> [<ffffffff81080abe>] kthread+0xee/0x110
> [<ffffffff8163ac6c>] ret_from_fork+0x7c/0xb0
> INITIAL USE at:
> [<ffffffff810b7d34>] __lock_acquire+0x214/0x5e0
> [<ffffffff810b819e>] lock_acquire+0x9e/0x170
> [<ffffffff816306d4>] _raw_spin_lock_irq+0x44/0x80
> [<ffffffff81161810>] __add_to_page_cache_locked+0xa0/0x1d0
> [<ffffffff81161968>] add_to_page_cache_lru+0x28/0x80
> [<ffffffff81162c28>] grab_cache_page_write_begin+0x98/0xe0
> [<ffffffff811f95e4>] simple_write_begin+0x34/0x100
> [<ffffffff8115fc9a>] generic_perform_write+0xca/0x210
> [<ffffffff8115fe43>] generic_file_buffered_write+0x63/0xa0
> [<ffffffff81163aea>] __generic_file_aio_write+0x1ca/0x3c0
> [<ffffffff81163d46>] generic_file_aio_write+0x66/0xb0
> [<ffffffff811cc20f>] do_sync_write+0x5f/0xa0
> [<ffffffff811ce127>] vfs_write+0xc7/0x1f0
> [<ffffffff811ce362>] SyS_write+0x62/0xb0
> [<ffffffff81f4b0c6>] do_copy+0x2b/0xb0
> [<ffffffff81f4ad46>] flush_buffer+0x7d/0xa3
> [<ffffffff81f7e9ab>] gunzip+0x287/0x330
> [<ffffffff81f4b967>] unpack_to_rootfs+0x167/0x293
> [<ffffffff81f4bb97>] populate_rootfs+0x62/0xdf
> [<ffffffff810002a2>] do_one_initcall+0xd2/0x180
> [<ffffffff81f48933>] do_basic_setup+0x9d/0xc0
> [<ffffffff81f48bd6>] kernel_init_freeable+0x280/0x303
> [<ffffffff816234be>] kernel_init+0xe/0x130
> [<ffffffff8163ac6c>] ret_from_fork+0x7c/0xb0
> }
> ... key at: [<ffffffff82bb7f98>] __key.41448+0x0/0x8
> ... acquired at:
> [<ffffffff810b51f0>] check_usage_forwards+0x90/0x110
> [<ffffffff810b5f4f>] mark_lock_irq+0x9f/0x2c0
> [<ffffffff810b628c>] mark_lock+0x11c/0x1f0
> [<ffffffff810b6469>] mark_irqflags+0x109/0x190
> [<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
> [<ffffffff810b819e>] lock_acquire+0x9e/0x170
> [<ffffffff81630760>] _raw_spin_lock_irqsave+0x50/0x90
> [<ffffffff8116f3e8>] test_clear_page_writeback+0x48/0x190
> [<ffffffff811604f0>] end_page_writeback+0x20/0x60
> [<ffffffffa00c17b8>] ext4_finish_bio+0x168/0x220 [ext4]
> [<ffffffffa00c1c37>] ext4_end_bio+0x97/0xe0 [ext4]
> [<ffffffff8120b9e3>] bio_endio+0x53/0xa0
> [<ffffffff812da333>] blk_update_request+0x213/0x430
> [<ffffffff812da577>] blk_update_bidi_request+0x27/0xb0
> [<ffffffff812db18f>] blk_end_bidi_request+0x2f/0x80
> [<ffffffff812db230>] blk_end_request+0x10/0x20
> [<ffffffff8143d770>] scsi_end_request+0x40/0xb0
> [<ffffffff8143daff>] scsi_io_completion+0x9f/0x6c0
> [<ffffffff81433204>] scsi_finish_command+0xd4/0x140
> [<ffffffff8143e28f>] scsi_softirq_done+0x14f/0x170
> [<ffffffff812e3144>] blk_done_softirq+0x84/0xa0
> [<ffffffff8105b29d>] __do_softirq+0x12d/0x430
> [<ffffffff8105b6d5>] irq_exit+0xc5/0xd0
> [<ffffffff8163cc47>] do_IRQ+0x67/0x110
> [<ffffffff8163156f>] ret_from_intr+0x0/0x13
> [<ffffffff8100dd26>] arch_cpu_idle+0x26/0x30
> [<ffffffff810cac39>] cpu_idle_loop+0xa9/0x3c0
> [<ffffffff810caf73>] cpu_startup_entry+0x23/0x30
> [<ffffffff81623434>] rest_init+0xf4/0x170
> [<ffffffff81f491dc>] start_kernel+0x346/0x34d
> [<ffffffff81f483a8>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81f484d8>] x86_64_start_kernel+0xf5/0xfc
>
>
> stack backtrace:
> CPU: 0 PID: 0 Comm: swapper/0 Tainted: GF 3.14.0-rc1-00099-gde05561 #126
> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/20/2012
> ffffffff8233b140 ffff8800792038d8 ffffffff8162aa99 0000000000000002
> ffffffff8233b140 ffff880079203928 ffffffff810b5118 ffffffff81a08d4a
> ffffffff81a08d4a ffff880079203928 ffffffff81c110a8 ffff880079203938
> Call Trace:
> <IRQ> [<ffffffff8162aa99>] dump_stack+0x51/0x70
> [<ffffffff810b5118>] print_irq_inversion_bug+0x1c8/0x210
> [<ffffffff810b51f0>] check_usage_forwards+0x90/0x110
> [<ffffffff8107cec8>] ? __kernel_text_address+0x58/0x80
> [<ffffffff810b5160>] ? print_irq_inversion_bug+0x210/0x210
> [<ffffffff810b5f4f>] mark_lock_irq+0x9f/0x2c0
> [<ffffffff810b628c>] mark_lock+0x11c/0x1f0
> [<ffffffff810b6469>] mark_irqflags+0x109/0x190
> [<ffffffff810b7edc>] __lock_acquire+0x3bc/0x5e0
> [<ffffffff810b819e>] lock_acquire+0x9e/0x170
> [<ffffffff8116f3e8>] ? test_clear_page_writeback+0x48/0x190
> [<ffffffff81048aed>] ? __change_page_attr_set_clr+0x4d/0xb0
> [<ffffffff81630760>] _raw_spin_lock_irqsave+0x50/0x90
> [<ffffffff8116f3e8>] ? test_clear_page_writeback+0x48/0x190
> [<ffffffff8116f3e8>] test_clear_page_writeback+0x48/0x190
> [<ffffffffa00c1825>] ? ext4_finish_bio+0x1d5/0x220 [ext4]
> [<ffffffff811604f0>] end_page_writeback+0x20/0x60
> [<ffffffffa00c17b8>] ext4_finish_bio+0x168/0x220 [ext4]
> [<ffffffffa00c18ec>] ? ext4_release_io_end+0x7c/0x100 [ext4]
> [<ffffffff812da079>] ? blk_account_io_completion+0x119/0x1c0
> [<ffffffffa00c1c37>] ext4_end_bio+0x97/0xe0 [ext4]
> [<ffffffff8120b9e3>] bio_endio+0x53/0xa0
> [<ffffffff812da333>] blk_update_request+0x213/0x430
> [<ffffffff812da577>] blk_update_bidi_request+0x27/0xb0
> [<ffffffff812db18f>] blk_end_bidi_request+0x2f/0x80
> [<ffffffff812db230>] blk_end_request+0x10/0x20
> [<ffffffff8143d770>] scsi_end_request+0x40/0xb0
> [<ffffffff816310c0>] ? _raw_spin_unlock_irqrestore+0x40/0x70
> [<ffffffff8143daff>] scsi_io_completion+0x9f/0x6c0
> [<ffffffff810b6add>] ? trace_hardirqs_on+0xd/0x10
> [<ffffffff81433204>] scsi_finish_command+0xd4/0x140
> [<ffffffff8143e28f>] scsi_softirq_done+0x14f/0x170
> [<ffffffff812e3144>] blk_done_softirq+0x84/0xa0
> [<ffffffff8105b29d>] __do_softirq+0x12d/0x430
> [<ffffffff8105b6d5>] irq_exit+0xc5/0xd0
> [<ffffffff8163cc47>] do_IRQ+0x67/0x110
> [<ffffffff8163156f>] common_interrupt+0x6f/0x6f
> <EOI> [<ffffffff8100e426>] ? default_idle+0x26/0x210
> [<ffffffff8100e424>] ? default_idle+0x24/0x210
> [<ffffffff8100dd26>] arch_cpu_idle+0x26/0x30
> [<ffffffff810cac39>] cpu_idle_loop+0xa9/0x3c0
> [<ffffffff810caf73>] cpu_startup_entry+0x23/0x30
> [<ffffffff81623434>] rest_init+0xf4/0x170
> [<ffffffff81623340>] ? csum_partial_copy_generic+0x170/0x170
> [<ffffffff81f491dc>] start_kernel+0x346/0x34d
> [<ffffffff81f48cb4>] ? repair_env_string+0x5b/0x5b
> [<ffffffff816293ea>] ? memblock_reserve+0x49/0x4e
> [<ffffffff81f483a8>] x86_64_start_reservations+0x2a/0x2c
> [<ffffffff81f484d8>] x86_64_start_kernel+0xf5/0xfc

2014-02-13 22:24:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

On Thu, 13 Feb 2014 17:11:26 -0500 Johannes Weiner <[email protected]> wrote:

> Hi Tetsuo,
>
> On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote:
> > Hello.
> >
> > I got a lockdep warning shown below, and the bad commit seems to be de055616
> > \"mm: keep page cache radix tree nodes in check\" as of next-20140212
> > on linux-next.git.
>
> Thanks for the report. There is already a fix for this in -mm:
> http://marc.info/?l=linux-mm-commits&m=139180637114625&w=2
>
> It was merged on the 7th, so it should show up in -next... any day
> now?

Tomorrow, if Stephen works weekends, which I expect he doesn't ;)

http://ozlabs.org/~akpm/mmotm/broken-out/mm-keep-page-cache-radix-tree-nodes-in-check-fix.patch

2014-02-14 00:29:43

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

Hi Andrew,

On Thu, 13 Feb 2014 14:24:07 -0800 Andrew Morton <[email protected]> wrote:
>
> On Thu, 13 Feb 2014 17:11:26 -0500 Johannes Weiner <[email protected]> wrote:
>
> > On Thu, Feb 13, 2014 at 12:21:17PM +0900, Tetsuo Handa wrote:
> > > Hello.
> > >
> > > I got a lockdep warning shown below, and the bad commit seems to be de055616
> > > \"mm: keep page cache radix tree nodes in check\" as of next-20140212
> > > on linux-next.git.
> >
> > Thanks for the report. There is already a fix for this in -mm:
> > http://marc.info/?l=linux-mm-commits&m=139180637114625&w=2
> >
> > It was merged on the 7th, so it should show up in -next... any day
> > now?
>
> Tomorrow, if Stephen works weekends, which I expect he doesn't ;)

Actually later today (my time), since Friday is not a weekend :-(

--
Cheers,
Stephen Rothwell [email protected]


Attachments:
(No filename) (881.00 B)
(No filename) (836.00 B)
Download all attachments

2014-02-14 06:05:55

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

Johannes Weiner wrote:
> Thanks for the report. There is already a fix for this in -mm:
> http://marc.info/?l=linux-mm-commits&m=139180637114625&w=2
>
> It was merged on the 7th, so it should show up in -next... any day
> now?

That patch solved this bproblem but breaks build instead.

ERROR: \"list_lru_init_key\" [fs/xfs/xfs.ko] undefined!
ERROR: \"list_lru_init_key\" [fs/gfs2/gfs2.ko] undefined!
make[1]: *** [__modpost] Error 1

diff --git a/mm/list_lru.c b/mm/list_lru.c
index 2a5b8fd..f1a0db1 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -143,7 +143,7 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key)
}
return 0;
}
-EXPORT_SYMBOL_GPL(list_lru_init);
+EXPORT_SYMBOL_GPL(list_lru_init_key);

void list_lru_destroy(struct list_lru *lru)
{

2014-02-14 15:30:48

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 00/10] mm: thrash detection-based file cache sizing v9

On Fri, Feb 14, 2014 at 03:05:41PM +0900, Tetsuo Handa wrote:
> Johannes Weiner wrote:
> > Thanks for the report. There is already a fix for this in -mm:
> > http://marc.info/?l=linux-mm-commits&m=139180637114625&w=2
> >
> > It was merged on the 7th, so it should show up in -next... any day
> > now?
>
> That patch solved this bproblem but breaks build instead.
>
> ERROR: \"list_lru_init_key\" [fs/xfs/xfs.ko] undefined!
> ERROR: \"list_lru_init_key\" [fs/gfs2/gfs2.ko] undefined!
> make[1]: *** [__modpost] Error 1

There is a follow-up fix in -mm:

http://marc.info/?l=linux-mm-commits&m=139180636814624&w=2