This is another revision of the radix tree / workingset patches based
on feedback from Linus and Jan. Thanks for your input.
This is a follow-up to d3798ae8c6f3 ("mm: filemap: don't plant shadow
entries without radix tree node"). That patch fixed an issue that was
caused mainly by the page cache sneaking special shadow page entries
into the radix tree and relying on subtleties in the radix tree code
to make that work. The fix also had to stop tracking refaults for
single-page files because shadow pages stored as direct pointers in
radix_tree_root->rnode weren't properly handled during tree extension.
These patches make the radix tree code explicitely support and track
such special entries, to eliminate the subtleties and to restore the
thrash detection for single-page files.
arch/s390/mm/gmap.c | 2 +-
drivers/sh/intc/virq.c | 2 +-
fs/dax.c | 9 ++--
include/linux/radix-tree.h | 30 ++++--------
include/linux/swap.h | 32 -------------
lib/radix-tree.c | 84 +++++++++++++++++++++++++++++++--
mm/filemap.c | 41 +++++-----------
mm/khugepaged.c | 8 ++--
mm/migrate.c | 4 +-
mm/shmem.c | 8 ++--
mm/truncate.c | 6 +--
mm/workingset.c | 25 ++++++----
tools/testing/radix-tree/multiorder.c | 2 +-
13 files changed, 140 insertions(+), 113 deletions(-)
The radix tree counts valid entries in each tree node. Entries stored
in the tree cannot be removed by simpling storing NULL in the slot or
the internal counters will be off and the node never gets freed again.
When collapsing a shmem page fails, restore the holes that were filled
with radix_tree_insert() with a proper radix tree deletion.
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Reported-by: Jan Kara <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/khugepaged.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 728d7790dc2d..eac6f0580e26 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
if (!nr_none)
break;
/* Put holes back where they were */
- radix_tree_replace_slot(slot, NULL);
+ radix_tree_delete(&mapping->page_tree,
+ iter.index);
nr_none--;
continue;
}
--
2.10.1
When the shadow page shrinker tries to reclaim a radix tree node but
finds it in an unexpected state - it should contain no pages, and
non-zero shadow entries - there is no need to kill the executing task
or even the entire system. Warn about the invalid state, then leave
that tree node be. Simply don't put it back on the shadow LRU for
future reclaim and move on.
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/workingset.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 617475f529f4..3cfc61d84a52 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -418,23 +418,27 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
* no pages, so we expect to be able to remove them all and
* delete and free the empty node afterwards.
*/
- BUG_ON(!workingset_node_shadows(node));
- BUG_ON(workingset_node_pages(node));
-
+ if (WARN_ON_ONCE(!workingset_node_shadows(node)))
+ goto out_invalid;
+ if (WARN_ON_ONCE(workingset_node_pages(node)))
+ goto out_invalid;
for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
if (node->slots[i]) {
- BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
+ if (WARN_ON_ONCE(!radix_tree_exceptional_entry(node->slots[i])))
+ goto out_invalid;
+ if (WARN_ON_ONCE(!mapping->nrexceptional))
+ goto out_invalid;
node->slots[i] = NULL;
workingset_node_shadows_dec(node);
- BUG_ON(!mapping->nrexceptional);
mapping->nrexceptional--;
}
}
- BUG_ON(workingset_node_shadows(node));
+ if (WARN_ON_ONCE(workingset_node_shadows(node)))
+ goto out_invalid;
inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
- if (!__radix_tree_delete_node(&mapping->page_tree, node))
- BUG();
+ __radix_tree_delete_node(&mapping->page_tree, node);
+out_invalid:
spin_unlock(&mapping->tree_lock);
ret = LRU_REMOVED_RETRY;
out:
--
2.10.1
Shadow entries in the page cache used to be accounted behind the radix
tree implementation's back in the upper bits of node->count, and the
radix tree code extending a single-entry tree with a shadow entry in
root->rnode would corrupt that counter. As a result, we could not put
shadow entries at index 0 if the tree didn't have any other entries,
and that means no refault detection for any single-page file.
Now that the shadow entries are tracked natively in the radix tree's
exceptional counter, this is no longer necessary. Extending and
shrinking the tree from and to single entries in root->rnode now does
the right thing when the entry is exceptional, remove that limitation.
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/filemap.c | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 438f0b54f8fd..55a3b136a527 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -178,19 +178,12 @@ static void page_cache_tree_delete(struct address_space *mapping,
radix_tree_clear_tags(&mapping->page_tree, node, slot);
- if (!node) {
- VM_BUG_ON_PAGE(nr != 1, page);
- /*
- * We need a node to properly account shadow
- * entries. Don't plant any without. XXX
- */
- shadow = NULL;
- }
-
__radix_tree_replace(&mapping->page_tree, node, slot, shadow);
- if (!node)
+ if (!node) {
+ VM_BUG_ON_PAGE(nr != 1, page);
break;
+ }
if (!shadow &&
__radix_tree_delete_node(&mapping->page_tree, node))
--
2.10.1
The bug in khugepaged (fixed in the first patch of this series) shows
that radix tree slot replacement is fragile; and it will become more
so when not only NULL<->!NULL transitions need to be accounted but
transitions from and to exceptional entries as well. We need checks.
Re-implement radix_tree_replace_slot() on top of the sanity-checked
__radix_tree_replace(). This requires existing callers to also pass
the radix tree root, but it'll warn us when somebody replaces slots
with contents that need proper accounting (transitions between NULL
entries, real entries, exceptional entries) and where a replacement
through the slot pointer would corrupt the radix tree node counts.
Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
arch/s390/mm/gmap.c | 2 +-
drivers/sh/intc/virq.c | 2 +-
fs/dax.c | 4 +--
include/linux/radix-tree.h | 16 ++-------
lib/radix-tree.c | 64 +++++++++++++++++++++++++++--------
mm/filemap.c | 4 +--
mm/khugepaged.c | 5 +--
mm/migrate.c | 4 +--
mm/truncate.c | 2 +-
tools/testing/radix-tree/multiorder.c | 2 +-
10 files changed, 64 insertions(+), 41 deletions(-)
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 3ba622702ce4..ec1f0dedb948 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -1015,7 +1015,7 @@ static inline void gmap_insert_rmap(struct gmap *sg, unsigned long vmaddr,
if (slot) {
rmap->next = radix_tree_deref_slot_protected(slot,
&sg->guest_table_lock);
- radix_tree_replace_slot(slot, rmap);
+ radix_tree_replace_slot(&sg->host_to_rmap, slot, rmap);
} else {
rmap->next = NULL;
radix_tree_insert(&sg->host_to_rmap, vmaddr >> PAGE_SHIFT,
diff --git a/drivers/sh/intc/virq.c b/drivers/sh/intc/virq.c
index e7899624aa0b..35bbe288ddb4 100644
--- a/drivers/sh/intc/virq.c
+++ b/drivers/sh/intc/virq.c
@@ -254,7 +254,7 @@ static void __init intc_subgroup_map(struct intc_desc_int *d)
radix_tree_tag_clear(&d->tree, entry->enum_id,
INTC_TAG_VIRQ_NEEDS_ALLOC);
- radix_tree_replace_slot((void **)entries[i],
+ radix_tree_replace_slot(&d->tree, (void **)entries[i],
&intc_irq_xlate[irq]);
}
diff --git a/fs/dax.c b/fs/dax.c
index db78bae0dc0f..85930c2a2749 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -342,7 +342,7 @@ static inline void *lock_slot(struct address_space *mapping, void **slot)
radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
entry |= RADIX_DAX_ENTRY_LOCK;
- radix_tree_replace_slot(slot, (void *)entry);
+ radix_tree_replace_slot(&mapping->page_tree, slot, (void *)entry);
return (void *)entry;
}
@@ -356,7 +356,7 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
entry &= ~(unsigned long)RADIX_DAX_ENTRY_LOCK;
- radix_tree_replace_slot(slot, (void *)entry);
+ radix_tree_replace_slot(&mapping->page_tree, slot, (void *)entry);
return (void *)entry;
}
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 7ced8a70cc8b..2d1b9b8be983 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -249,20 +249,6 @@ static inline int radix_tree_exception(void *arg)
return unlikely((unsigned long)arg & RADIX_TREE_ENTRY_MASK);
}
-/**
- * radix_tree_replace_slot - replace item in a slot
- * @pslot: pointer to slot, returned by radix_tree_lookup_slot
- * @item: new item to store in the slot.
- *
- * For use with radix_tree_lookup_slot(). Caller must hold tree write locked
- * across slot lookup and replacement.
- */
-static inline void radix_tree_replace_slot(void **pslot, void *item)
-{
- BUG_ON(radix_tree_is_internal_node(item));
- rcu_assign_pointer(*pslot, item);
-}
-
int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
unsigned order, struct radix_tree_node **nodep,
void ***slotp);
@@ -280,6 +266,8 @@ void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
void __radix_tree_replace(struct radix_tree_root *root,
struct radix_tree_node *node,
void **slot, void *item);
+void radix_tree_replace_slot(struct radix_tree_root *root,
+ void **slot, void *item);
bool __radix_tree_delete_node(struct radix_tree_root *root,
struct radix_tree_node *node);
void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index ddc6facbf4da..5cbdd911931e 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -753,19 +753,10 @@ void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
}
EXPORT_SYMBOL(radix_tree_lookup);
-/**
- * __radix_tree_replace - replace item in a slot
- * @root: radix tree root
- * @node: pointer to tree node
- * @slot: pointer to slot in @node
- * @item: new item to store in the slot.
- *
- * For use with __radix_tree_lookup(). Caller must hold tree write locked
- * across slot lookup and replacement.
- */
-void __radix_tree_replace(struct radix_tree_root *root,
- struct radix_tree_node *node,
- void **slot, void *item)
+static void replace_slot(struct radix_tree_root *root,
+ struct radix_tree_node *node,
+ void **slot, void *item,
+ bool warn_typeswitch)
{
void *old = rcu_dereference_raw(*slot);
int count, exceptional;
@@ -776,8 +767,7 @@ void __radix_tree_replace(struct radix_tree_root *root,
exceptional = !!radix_tree_exceptional_entry(item) -
!!radix_tree_exceptional_entry(old);
- WARN_ONCE(!node && slot != (void **)&root->rnode &&
- (count || exceptional),
+ WARN_ONCE(warn_typeswitch && (count || exceptional),
"Unaccounted slot replacement: old=%p item=%p count=%d exceptional=%d\n",
old, item, count, exceptional);
@@ -789,6 +779,50 @@ void __radix_tree_replace(struct radix_tree_root *root,
}
/**
+ * __radix_tree_replace - replace item in a slot
+ * @root: radix tree root
+ * @node: pointer to tree node
+ * @slot: pointer to slot in @node
+ * @item: new item to store in the slot.
+ *
+ * For use with __radix_tree_lookup(). Caller must hold tree write locked
+ * across slot lookup and replacement.
+ */
+void __radix_tree_replace(struct radix_tree_root *root,
+ struct radix_tree_node *node,
+ void **slot, void *item)
+{
+ /*
+ * This function supports replacing entries with different types
+ * (NULL, regular entries, exceptional entries), but that needs
+ * accounting against the node - unless the slot is root->rnode.
+ */
+ replace_slot(root, node, slot, item,
+ !node && slot != (void **)&root->rnode);
+}
+
+/**
+ * radix_tree_replace_slot - replace item in a slot
+ * @root: radix tree root
+ * @slot: pointer to slot
+ * @item: new item to store in the slot.
+ *
+ * For use with radix_tree_lookup_slot(), radix_tree_gang_lookup_slot(),
+ * radix_tree_gang_lookup_tag_slot(). Caller must hold tree write locked
+ * across slot lookup and replacement.
+ *
+ * NOTE: This cannot be used to switch between non-entries (empty slots),
+ * regular entries, and exceptional entries, as that requires accounting
+ * inside the radix tree node. When switching from one type of entry to
+ * another, use __radix_tree_lookup() and __radix_tree_replace().
+ */
+void radix_tree_replace_slot(struct radix_tree_root *root,
+ void **slot, void *item)
+{
+ replace_slot(root, NULL, slot, item, true);
+}
+
+/**
* radix_tree_tag_set - set a tag on a radix tree node
* @root: radix tree root
* @index: index key
diff --git a/mm/filemap.c b/mm/filemap.c
index c7fe2f16503f..eb463156f29a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -147,7 +147,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
false);
}
}
- radix_tree_replace_slot(slot, page);
+ radix_tree_replace_slot(&mapping->page_tree, slot, page);
mapping->nrpages++;
if (node) {
workingset_node_pages_inc(node);
@@ -193,7 +193,7 @@ static void page_cache_tree_delete(struct address_space *mapping,
shadow = NULL;
}
- radix_tree_replace_slot(slot, shadow);
+ radix_tree_replace_slot(&mapping->page_tree, slot, shadow);
if (!node)
break;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eac6f0580e26..fed8d5e96978 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1421,7 +1421,7 @@ static void collapse_shmem(struct mm_struct *mm,
list_add_tail(&page->lru, &pagelist);
/* Finally, replace with the new page. */
- radix_tree_replace_slot(slot,
+ radix_tree_replace_slot(&mapping->page_tree, slot,
new_page + (index % HPAGE_PMD_NR));
index++;
@@ -1531,7 +1531,8 @@ static void collapse_shmem(struct mm_struct *mm,
/* Unfreeze the page. */
list_del(&page->lru);
page_ref_unfreeze(page, 2);
- radix_tree_replace_slot(slot, page);
+ radix_tree_replace_slot(&mapping->page_tree,
+ slot, page);
spin_unlock_irq(&mapping->tree_lock);
putback_lru_page(page);
unlock_page(page);
diff --git a/mm/migrate.c b/mm/migrate.c
index 99250aee1ac1..9b88e4e37d0a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -482,7 +482,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
SetPageDirty(newpage);
}
- radix_tree_replace_slot(pslot, newpage);
+ radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);
/*
* Drop cache reference from old page by unfreezing
@@ -556,7 +556,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
get_page(newpage);
- radix_tree_replace_slot(pslot, newpage);
+ radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);
page_ref_unfreeze(page, expected_count - 1);
diff --git a/mm/truncate.c b/mm/truncate.c
index a01cce450a26..6ae44571d4c7 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -49,7 +49,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
goto unlock;
if (*slot != entry)
goto unlock;
- radix_tree_replace_slot(slot, NULL);
+ radix_tree_replace_slot(&mapping->page_tree, slot, NULL);
mapping->nrexceptional--;
if (!node)
goto unlock;
diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c
index 05d7bc488971..d1be94667a30 100644
--- a/tools/testing/radix-tree/multiorder.c
+++ b/tools/testing/radix-tree/multiorder.c
@@ -146,7 +146,7 @@ static void multiorder_check(unsigned long index, int order)
slot = radix_tree_lookup_slot(&tree, index);
free(*slot);
- radix_tree_replace_slot(slot, item2);
+ radix_tree_replace_slot(&tree, slot, item2);
for (i = min; i < max; i++) {
struct item *item = item_lookup(&tree, i);
assert(item != 0);
--
2.10.1
Currently, we track the shadow entries in the page cache in the upper
bits of the radix_tree_node->count, behind the back of the radix tree
implementation. Because the radix tree code has no awareness of them,
we rely on random subtleties throughout the implementation (such as
the node->count != 1 check in the shrinking code which is meant to
exclude multi-entry nodes, but also happens to skip nodes with only
one shadow entry since it's accounted in the upper bits). This is
error prone and has, in fact, caused the bug fixed in d3798ae8c6f3
("mm: filemap: don't plant shadow entries without radix tree node").
To remove these subtleties, this patch moves shadow entry tracking
from the upper bits of node->count to the existing counter for
exceptional entries. node->count goes back to being a simple counter
of valid entries in the tree node and can be shrunk to a single byte.
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/radix-tree.h | 6 +-----
include/linux/swap.h | 32 --------------------------------
mm/filemap.c | 30 +++++++++++-------------------
mm/truncate.c | 4 +---
mm/workingset.c | 11 +++++++----
5 files changed, 20 insertions(+), 63 deletions(-)
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 2d1b9b8be983..56619703ae7a 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -80,14 +80,10 @@ static inline bool radix_tree_is_internal_node(void *ptr)
#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
RADIX_TREE_MAP_SHIFT))
-/* Internally used bits of node->count */
-#define RADIX_TREE_COUNT_SHIFT (RADIX_TREE_MAP_SHIFT + 1)
-#define RADIX_TREE_COUNT_MASK ((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
-
struct radix_tree_node {
unsigned char shift; /* Bits remaining in each slot */
unsigned char offset; /* Slot offset in parent */
- unsigned int count; /* Total entry count */
+ unsigned char count; /* Total entry count */
unsigned char exceptional; /* Exceptional entry count */
union {
struct {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a56523cefb9b..660a11de0186 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -248,38 +248,6 @@ bool workingset_refault(void *shadow);
void workingset_activation(struct page *page);
extern struct list_lru workingset_shadow_nodes;
-static inline unsigned int workingset_node_pages(struct radix_tree_node *node)
-{
- return node->count & RADIX_TREE_COUNT_MASK;
-}
-
-static inline void workingset_node_pages_inc(struct radix_tree_node *node)
-{
- node->count++;
-}
-
-static inline void workingset_node_pages_dec(struct radix_tree_node *node)
-{
- VM_WARN_ON_ONCE(!workingset_node_pages(node));
- node->count--;
-}
-
-static inline unsigned int workingset_node_shadows(struct radix_tree_node *node)
-{
- return node->count >> RADIX_TREE_COUNT_SHIFT;
-}
-
-static inline void workingset_node_shadows_inc(struct radix_tree_node *node)
-{
- node->count += 1U << RADIX_TREE_COUNT_SHIFT;
-}
-
-static inline void workingset_node_shadows_dec(struct radix_tree_node *node)
-{
- VM_WARN_ON_ONCE(!workingset_node_shadows(node));
- node->count -= 1U << RADIX_TREE_COUNT_SHIFT;
-}
-
/* linux/mm/page_alloc.c */
extern unsigned long totalram_pages;
extern unsigned long totalreserve_pages;
diff --git a/mm/filemap.c b/mm/filemap.c
index eb463156f29a..438f0b54f8fd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -132,25 +132,19 @@ static int page_cache_tree_insert(struct address_space *mapping,
if (!dax_mapping(mapping)) {
if (shadowp)
*shadowp = p;
- if (node)
- workingset_node_shadows_dec(node);
} else {
/* DAX can replace empty locked entry with a hole */
WARN_ON_ONCE(p !=
(void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
RADIX_DAX_ENTRY_LOCK));
- /* DAX accounts exceptional entries as normal pages */
- if (node)
- workingset_node_pages_dec(node);
/* Wakeup waiters for exceptional entry lock */
dax_wake_mapping_entry_waiter(mapping, page->index,
false);
}
}
- radix_tree_replace_slot(&mapping->page_tree, slot, page);
+ __radix_tree_replace(&mapping->page_tree, node, slot, page);
mapping->nrpages++;
if (node) {
- workingset_node_pages_inc(node);
/*
* Don't track node that contains actual pages.
*
@@ -193,29 +187,27 @@ static void page_cache_tree_delete(struct address_space *mapping,
shadow = NULL;
}
- radix_tree_replace_slot(&mapping->page_tree, slot, shadow);
+ __radix_tree_replace(&mapping->page_tree, node, slot, shadow);
if (!node)
break;
- workingset_node_pages_dec(node);
- if (shadow)
- workingset_node_shadows_inc(node);
- else
- if (__radix_tree_delete_node(&mapping->page_tree, node))
- continue;
+ if (!shadow &&
+ __radix_tree_delete_node(&mapping->page_tree, node))
+ continue;
/*
- * Track node that only contains shadow entries. DAX mappings
- * contain no shadow entries and may contain other exceptional
- * entries so skip those.
+ * Track node that only contains shadow entries. DAX and SHMEM
+ * mappings contain no shadow entries and may contain other
+ * exceptional entries so skip those.
*
* Avoid acquiring the list_lru lock if already tracked.
* The list_empty() test is safe as node->private_list is
* protected by mapping->tree_lock.
*/
- if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
- list_empty(&node->private_list)) {
+ if (!dax_mapping(mapping) && !shmem_mapping(mapping) &&
+ node->count == node->exceptional &&
+ list_empty(&node->private_list)) {
node->private_data = mapping;
list_lru_add(&workingset_shadow_nodes,
&node->private_list);
diff --git a/mm/truncate.c b/mm/truncate.c
index 6ae44571d4c7..d3ce5f261f47 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -53,7 +53,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
mapping->nrexceptional--;
if (!node)
goto unlock;
- workingset_node_shadows_dec(node);
/*
* Don't track node without shadow entries.
*
@@ -61,8 +60,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
* The list_empty() test is safe as node->private_list is
* protected by mapping->tree_lock.
*/
- if (!workingset_node_shadows(node) &&
- !list_empty(&node->private_list))
+ if (!node->exceptional && !list_empty(&node->private_list))
list_lru_del(&workingset_shadow_nodes,
&node->private_list);
__radix_tree_delete_node(&mapping->page_tree, node);
diff --git a/mm/workingset.c b/mm/workingset.c
index 3cfc61d84a52..ca92d0f70d9a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -418,22 +418,25 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
* no pages, so we expect to be able to remove them all and
* delete and free the empty node afterwards.
*/
- if (WARN_ON_ONCE(!workingset_node_shadows(node)))
+ if (WARN_ON_ONCE(!node->exceptional))
goto out_invalid;
- if (WARN_ON_ONCE(workingset_node_pages(node)))
+ if (WARN_ON_ONCE(node->count != node->exceptional))
goto out_invalid;
for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
if (node->slots[i]) {
if (WARN_ON_ONCE(!radix_tree_exceptional_entry(node->slots[i])))
goto out_invalid;
+ if (WARN_ON_ONCE(!node->exceptional))
+ goto out_invalid;
if (WARN_ON_ONCE(!mapping->nrexceptional))
goto out_invalid;
node->slots[i] = NULL;
- workingset_node_shadows_dec(node);
+ node->exceptional--;
+ node->count--;
mapping->nrexceptional--;
}
}
- if (WARN_ON_ONCE(workingset_node_shadows(node)))
+ if (WARN_ON_ONCE(node->exceptional))
goto out_invalid;
inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
__radix_tree_delete_node(&mapping->page_tree, node);
--
2.10.1
The way the page cache is sneaking shadow entries of evicted pages
into the radix tree past the node entry accounting and tracking them
manually in the upper bits of node->count is fraught with problems.
These shadow entries are marked in the tree as exceptional entries,
which are a native concept to the radix tree. Maintain an explicit
counter of exceptional entries in the radix tree node. Subsequent
patches will switch shadow entry tracking over to that counter.
DAX and shmem are the other users of exceptional entries. Since slot
replacements that change the entry type from regular to exceptional
must now be accounted, introduce a __radix_tree_replace() function
that does replacement and accounting, and switch DAX and shmem over.
The increase in radix tree node size is temporary. The last patch will
transition from the shadow count in the upper bits of node->count to
the new exceptional entry counter and shrink node->count to one byte.
Signed-off-by: Johannes Weiner <[email protected]>
---
fs/dax.c | 5 +++--
include/linux/radix-tree.h | 10 +++++++---
lib/radix-tree.c | 50 +++++++++++++++++++++++++++++++++++++++++++---
mm/shmem.c | 8 ++++----
4 files changed, 61 insertions(+), 12 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 014defd2e744..db78bae0dc0f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -643,12 +643,13 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
}
mapping->nrexceptional++;
} else {
+ struct radix_tree_node *node;
void **slot;
void *ret;
- ret = __radix_tree_lookup(page_tree, index, NULL, &slot);
+ ret = __radix_tree_lookup(page_tree, index, &node, &slot);
WARN_ON_ONCE(ret != entry);
- radix_tree_replace_slot(slot, new_entry);
+ __radix_tree_replace(page_tree, node, slot, new_entry);
}
if (vmf->flags & FAULT_FLAG_WRITE)
radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index af3581b8a451..7ced8a70cc8b 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -85,9 +85,10 @@ static inline bool radix_tree_is_internal_node(void *ptr)
#define RADIX_TREE_COUNT_MASK ((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
struct radix_tree_node {
- unsigned char shift; /* Bits remaining in each slot */
- unsigned char offset; /* Slot offset in parent */
- unsigned int count;
+ unsigned char shift; /* Bits remaining in each slot */
+ unsigned char offset; /* Slot offset in parent */
+ unsigned int count; /* Total entry count */
+ unsigned char exceptional; /* Exceptional entry count */
union {
struct {
/* Used when ascending tree */
@@ -276,6 +277,9 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
struct radix_tree_node **nodep, void ***slotp);
void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
+void __radix_tree_replace(struct radix_tree_root *root,
+ struct radix_tree_node *node,
+ void **slot, void *item);
bool __radix_tree_delete_node(struct radix_tree_root *root,
struct radix_tree_node *node);
void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 8e6d552c40dd..ddc6facbf4da 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -220,10 +220,10 @@ static void dump_node(struct radix_tree_node *node, unsigned long index)
{
unsigned long i;
- pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d parent %p\n",
+ pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d exceptional %d parent %p\n",
node, node->offset,
node->tags[0][0], node->tags[1][0], node->tags[2][0],
- node->shift, node->count, node->parent);
+ node->shift, node->count, node->exceptional, node->parent);
for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
unsigned long first = index | (i << node->shift);
@@ -522,8 +522,13 @@ static int radix_tree_extend(struct radix_tree_root *root,
node->offset = 0;
node->count = 1;
node->parent = NULL;
- if (radix_tree_is_internal_node(slot))
+ if (radix_tree_is_internal_node(slot)) {
entry_to_node(slot)->parent = node;
+ } else {
+ /* Moving an exceptional root->rnode to a node */
+ if (radix_tree_exceptional_entry(slot))
+ node->exceptional = 1;
+ }
node->slots[0] = slot;
slot = node_to_entry(node);
rcu_assign_pointer(root->rnode, slot);
@@ -649,6 +654,8 @@ int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
if (node) {
unsigned offset = get_slot_offset(node, slot);
node->count++;
+ if (radix_tree_exceptional_entry(item))
+ node->exceptional++;
BUG_ON(tag_get(node, 0, offset));
BUG_ON(tag_get(node, 1, offset));
BUG_ON(tag_get(node, 2, offset));
@@ -747,6 +754,41 @@ void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
EXPORT_SYMBOL(radix_tree_lookup);
/**
+ * __radix_tree_replace - replace item in a slot
+ * @root: radix tree root
+ * @node: pointer to tree node
+ * @slot: pointer to slot in @node
+ * @item: new item to store in the slot.
+ *
+ * For use with __radix_tree_lookup(). Caller must hold tree write locked
+ * across slot lookup and replacement.
+ */
+void __radix_tree_replace(struct radix_tree_root *root,
+ struct radix_tree_node *node,
+ void **slot, void *item)
+{
+ void *old = rcu_dereference_raw(*slot);
+ int count, exceptional;
+
+ WARN_ON_ONCE(radix_tree_is_internal_node(item));
+
+ count = !!item - !!old;
+ exceptional = !!radix_tree_exceptional_entry(item) -
+ !!radix_tree_exceptional_entry(old);
+
+ WARN_ONCE(!node && slot != (void **)&root->rnode &&
+ (count || exceptional),
+ "Unaccounted slot replacement: old=%p item=%p count=%d exceptional=%d\n",
+ old, item, count, exceptional);
+
+ if (node) {
+ node->count += count;
+ node->exceptional += exceptional;
+ }
+ rcu_assign_pointer(*slot, item);
+}
+
+/**
* radix_tree_tag_set - set a tag on a radix tree node
* @root: radix tree root
* @index: index key
@@ -1561,6 +1603,8 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
delete_sibling_entries(node, node_to_entry(slot), offset);
node->slots[offset] = NULL;
node->count--;
+ if (radix_tree_exceptional_entry(entry))
+ node->exceptional--;
__radix_tree_delete_node(root, node);
diff --git a/mm/shmem.c b/mm/shmem.c
index ad7813d73ea7..7f3a08df25c9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -300,18 +300,18 @@ void shmem_uncharge(struct inode *inode, long pages)
static int shmem_radix_tree_replace(struct address_space *mapping,
pgoff_t index, void *expected, void *replacement)
{
+ struct radix_tree_node *node;
void **pslot;
void *item;
VM_BUG_ON(!expected);
VM_BUG_ON(!replacement);
- pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
- if (!pslot)
+ item = __radix_tree_lookup(&mapping->page_tree, index, &node, &pslot);
+ if (!item)
return -ENOENT;
- item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock);
if (item != expected)
return -ENOENT;
- radix_tree_replace_slot(pslot, replacement);
+ __radix_tree_replace(&mapping->page_tree, node, pslot, replacement);
return 0;
}
--
2.10.1
On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> The radix tree counts valid entries in each tree node. Entries stored
> in the tree cannot be removed by simpling storing NULL in the slot or
> the internal counters will be off and the node never gets freed again.
>
> When collapsing a shmem page fails, restore the holes that were filled
> with radix_tree_insert() with a proper radix tree deletion.
>
> Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> Reported-by: Jan Kara <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/khugepaged.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 728d7790dc2d..eac6f0580e26 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> if (!nr_none)
> break;
> /* Put holes back where they were */
> - radix_tree_replace_slot(slot, NULL);
> + radix_tree_delete(&mapping->page_tree,
> + iter.index);
Hum, but this is inside radix_tree_for_each_slot() iteration. And
radix_tree_delete() may end up freeing nodes resulting in invalidating
current slot pointer and the iteration code will do use-after-free.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 07-11-16 14:07:37, Johannes Weiner wrote:
> When the shadow page shrinker tries to reclaim a radix tree node but
> finds it in an unexpected state - it should contain no pages, and
> non-zero shadow entries - there is no need to kill the executing task
> or even the entire system. Warn about the invalid state, then leave
> that tree node be. Simply don't put it back on the shadow LRU for
> future reclaim and move on.
>
> Signed-off-by: Johannes Weiner <[email protected]>
The patch looks good to me. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> mm/workingset.c | 20 ++++++++++++--------
> 1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 617475f529f4..3cfc61d84a52 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -418,23 +418,27 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
> * no pages, so we expect to be able to remove them all and
> * delete and free the empty node afterwards.
> */
> - BUG_ON(!workingset_node_shadows(node));
> - BUG_ON(workingset_node_pages(node));
> -
> + if (WARN_ON_ONCE(!workingset_node_shadows(node)))
> + goto out_invalid;
> + if (WARN_ON_ONCE(workingset_node_pages(node)))
> + goto out_invalid;
> for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
> if (node->slots[i]) {
> - BUG_ON(!radix_tree_exceptional_entry(node->slots[i]));
> + if (WARN_ON_ONCE(!radix_tree_exceptional_entry(node->slots[i])))
> + goto out_invalid;
> + if (WARN_ON_ONCE(!mapping->nrexceptional))
> + goto out_invalid;
> node->slots[i] = NULL;
> workingset_node_shadows_dec(node);
> - BUG_ON(!mapping->nrexceptional);
> mapping->nrexceptional--;
> }
> }
> - BUG_ON(workingset_node_shadows(node));
> + if (WARN_ON_ONCE(workingset_node_shadows(node)))
> + goto out_invalid;
> inc_node_state(page_pgdat(virt_to_page(node)), WORKINGSET_NODERECLAIM);
> - if (!__radix_tree_delete_node(&mapping->page_tree, node))
> - BUG();
> + __radix_tree_delete_node(&mapping->page_tree, node);
>
> +out_invalid:
> spin_unlock(&mapping->tree_lock);
> ret = LRU_REMOVED_RETRY;
> out:
> --
> 2.10.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 07-11-16 14:07:38, Johannes Weiner wrote:
> The way the page cache is sneaking shadow entries of evicted pages
> into the radix tree past the node entry accounting and tracking them
> manually in the upper bits of node->count is fraught with problems.
>
> These shadow entries are marked in the tree as exceptional entries,
> which are a native concept to the radix tree. Maintain an explicit
> counter of exceptional entries in the radix tree node. Subsequent
> patches will switch shadow entry tracking over to that counter.
>
> DAX and shmem are the other users of exceptional entries. Since slot
> replacements that change the entry type from regular to exceptional
> must now be accounted, introduce a __radix_tree_replace() function
> that does replacement and accounting, and switch DAX and shmem over.
>
> The increase in radix tree node size is temporary. The last patch will
> transition from the shadow count in the upper bits of node->count to
> the new exceptional entry counter and shrink node->count to one byte.
>
> Signed-off-by: Johannes Weiner <[email protected]>
The patch looks good to me. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/dax.c | 5 +++--
> include/linux/radix-tree.h | 10 +++++++---
> lib/radix-tree.c | 50 +++++++++++++++++++++++++++++++++++++++++++---
> mm/shmem.c | 8 ++++----
> 4 files changed, 61 insertions(+), 12 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 014defd2e744..db78bae0dc0f 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -643,12 +643,13 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
> }
> mapping->nrexceptional++;
> } else {
> + struct radix_tree_node *node;
> void **slot;
> void *ret;
>
> - ret = __radix_tree_lookup(page_tree, index, NULL, &slot);
> + ret = __radix_tree_lookup(page_tree, index, &node, &slot);
> WARN_ON_ONCE(ret != entry);
> - radix_tree_replace_slot(slot, new_entry);
> + __radix_tree_replace(page_tree, node, slot, new_entry);
> }
> if (vmf->flags & FAULT_FLAG_WRITE)
> radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index af3581b8a451..7ced8a70cc8b 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -85,9 +85,10 @@ static inline bool radix_tree_is_internal_node(void *ptr)
> #define RADIX_TREE_COUNT_MASK ((1UL << RADIX_TREE_COUNT_SHIFT) - 1)
>
> struct radix_tree_node {
> - unsigned char shift; /* Bits remaining in each slot */
> - unsigned char offset; /* Slot offset in parent */
> - unsigned int count;
> + unsigned char shift; /* Bits remaining in each slot */
> + unsigned char offset; /* Slot offset in parent */
> + unsigned int count; /* Total entry count */
> + unsigned char exceptional; /* Exceptional entry count */
> union {
> struct {
> /* Used when ascending tree */
> @@ -276,6 +277,9 @@ void *__radix_tree_lookup(struct radix_tree_root *root, unsigned long index,
> struct radix_tree_node **nodep, void ***slotp);
> void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
> void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
> +void __radix_tree_replace(struct radix_tree_root *root,
> + struct radix_tree_node *node,
> + void **slot, void *item);
> bool __radix_tree_delete_node(struct radix_tree_root *root,
> struct radix_tree_node *node);
> void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 8e6d552c40dd..ddc6facbf4da 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -220,10 +220,10 @@ static void dump_node(struct radix_tree_node *node, unsigned long index)
> {
> unsigned long i;
>
> - pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d parent %p\n",
> + pr_debug("radix node: %p offset %d tags %lx %lx %lx shift %d count %d exceptional %d parent %p\n",
> node, node->offset,
> node->tags[0][0], node->tags[1][0], node->tags[2][0],
> - node->shift, node->count, node->parent);
> + node->shift, node->count, node->exceptional, node->parent);
>
> for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
> unsigned long first = index | (i << node->shift);
> @@ -522,8 +522,13 @@ static int radix_tree_extend(struct radix_tree_root *root,
> node->offset = 0;
> node->count = 1;
> node->parent = NULL;
> - if (radix_tree_is_internal_node(slot))
> + if (radix_tree_is_internal_node(slot)) {
> entry_to_node(slot)->parent = node;
> + } else {
> + /* Moving an exceptional root->rnode to a node */
> + if (radix_tree_exceptional_entry(slot))
> + node->exceptional = 1;
> + }
> node->slots[0] = slot;
> slot = node_to_entry(node);
> rcu_assign_pointer(root->rnode, slot);
> @@ -649,6 +654,8 @@ int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,
> if (node) {
> unsigned offset = get_slot_offset(node, slot);
> node->count++;
> + if (radix_tree_exceptional_entry(item))
> + node->exceptional++;
> BUG_ON(tag_get(node, 0, offset));
> BUG_ON(tag_get(node, 1, offset));
> BUG_ON(tag_get(node, 2, offset));
> @@ -747,6 +754,41 @@ void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
> EXPORT_SYMBOL(radix_tree_lookup);
>
> /**
> + * __radix_tree_replace - replace item in a slot
> + * @root: radix tree root
> + * @node: pointer to tree node
> + * @slot: pointer to slot in @node
> + * @item: new item to store in the slot.
> + *
> + * For use with __radix_tree_lookup(). Caller must hold tree write locked
> + * across slot lookup and replacement.
> + */
> +void __radix_tree_replace(struct radix_tree_root *root,
> + struct radix_tree_node *node,
> + void **slot, void *item)
> +{
> + void *old = rcu_dereference_raw(*slot);
> + int count, exceptional;
> +
> + WARN_ON_ONCE(radix_tree_is_internal_node(item));
> +
> + count = !!item - !!old;
> + exceptional = !!radix_tree_exceptional_entry(item) -
> + !!radix_tree_exceptional_entry(old);
> +
> + WARN_ONCE(!node && slot != (void **)&root->rnode &&
> + (count || exceptional),
> + "Unaccounted slot replacement: old=%p item=%p count=%d exceptional=%d\n",
> + old, item, count, exceptional);
> +
> + if (node) {
> + node->count += count;
> + node->exceptional += exceptional;
> + }
> + rcu_assign_pointer(*slot, item);
> +}
> +
> +/**
> * radix_tree_tag_set - set a tag on a radix tree node
> * @root: radix tree root
> * @index: index key
> @@ -1561,6 +1603,8 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
> delete_sibling_entries(node, node_to_entry(slot), offset);
> node->slots[offset] = NULL;
> node->count--;
> + if (radix_tree_exceptional_entry(entry))
> + node->exceptional--;
>
> __radix_tree_delete_node(root, node);
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ad7813d73ea7..7f3a08df25c9 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -300,18 +300,18 @@ void shmem_uncharge(struct inode *inode, long pages)
> static int shmem_radix_tree_replace(struct address_space *mapping,
> pgoff_t index, void *expected, void *replacement)
> {
> + struct radix_tree_node *node;
> void **pslot;
> void *item;
>
> VM_BUG_ON(!expected);
> VM_BUG_ON(!replacement);
> - pslot = radix_tree_lookup_slot(&mapping->page_tree, index);
> - if (!pslot)
> + item = __radix_tree_lookup(&mapping->page_tree, index, &node, &pslot);
> + if (!item)
> return -ENOENT;
> - item = radix_tree_deref_slot_protected(pslot, &mapping->tree_lock);
> if (item != expected)
> return -ENOENT;
> - radix_tree_replace_slot(pslot, replacement);
> + __radix_tree_replace(&mapping->page_tree, node, pslot, replacement);
> return 0;
> }
>
> --
> 2.10.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 07-11-16 14:07:39, Johannes Weiner wrote:
> The bug in khugepaged (fixed in the first patch of this series) shows
> that radix tree slot replacement is fragile; and it will become more
> so when not only NULL<->!NULL transitions need to be accounted but
> transitions from and to exceptional entries as well. We need checks.
>
> Re-implement radix_tree_replace_slot() on top of the sanity-checked
> __radix_tree_replace(). This requires existing callers to also pass
> the radix tree root, but it'll warn us when somebody replaces slots
> with contents that need proper accounting (transitions between NULL
> entries, real entries, exceptional entries) and where a replacement
> through the slot pointer would corrupt the radix tree node counts.
>
> Suggested-by: Jan Kara <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
The patch looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> arch/s390/mm/gmap.c | 2 +-
> drivers/sh/intc/virq.c | 2 +-
> fs/dax.c | 4 +--
> include/linux/radix-tree.h | 16 ++-------
> lib/radix-tree.c | 64 +++++++++++++++++++++++++++--------
> mm/filemap.c | 4 +--
> mm/khugepaged.c | 5 +--
> mm/migrate.c | 4 +--
> mm/truncate.c | 2 +-
> tools/testing/radix-tree/multiorder.c | 2 +-
> 10 files changed, 64 insertions(+), 41 deletions(-)
>
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index 3ba622702ce4..ec1f0dedb948 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -1015,7 +1015,7 @@ static inline void gmap_insert_rmap(struct gmap *sg, unsigned long vmaddr,
> if (slot) {
> rmap->next = radix_tree_deref_slot_protected(slot,
> &sg->guest_table_lock);
> - radix_tree_replace_slot(slot, rmap);
> + radix_tree_replace_slot(&sg->host_to_rmap, slot, rmap);
> } else {
> rmap->next = NULL;
> radix_tree_insert(&sg->host_to_rmap, vmaddr >> PAGE_SHIFT,
> diff --git a/drivers/sh/intc/virq.c b/drivers/sh/intc/virq.c
> index e7899624aa0b..35bbe288ddb4 100644
> --- a/drivers/sh/intc/virq.c
> +++ b/drivers/sh/intc/virq.c
> @@ -254,7 +254,7 @@ static void __init intc_subgroup_map(struct intc_desc_int *d)
>
> radix_tree_tag_clear(&d->tree, entry->enum_id,
> INTC_TAG_VIRQ_NEEDS_ALLOC);
> - radix_tree_replace_slot((void **)entries[i],
> + radix_tree_replace_slot(&d->tree, (void **)entries[i],
> &intc_irq_xlate[irq]);
> }
>
> diff --git a/fs/dax.c b/fs/dax.c
> index db78bae0dc0f..85930c2a2749 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -342,7 +342,7 @@ static inline void *lock_slot(struct address_space *mapping, void **slot)
> radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
>
> entry |= RADIX_DAX_ENTRY_LOCK;
> - radix_tree_replace_slot(slot, (void *)entry);
> + radix_tree_replace_slot(&mapping->page_tree, slot, (void *)entry);
> return (void *)entry;
> }
>
> @@ -356,7 +356,7 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
> radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
>
> entry &= ~(unsigned long)RADIX_DAX_ENTRY_LOCK;
> - radix_tree_replace_slot(slot, (void *)entry);
> + radix_tree_replace_slot(&mapping->page_tree, slot, (void *)entry);
> return (void *)entry;
> }
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 7ced8a70cc8b..2d1b9b8be983 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -249,20 +249,6 @@ static inline int radix_tree_exception(void *arg)
> return unlikely((unsigned long)arg & RADIX_TREE_ENTRY_MASK);
> }
>
> -/**
> - * radix_tree_replace_slot - replace item in a slot
> - * @pslot: pointer to slot, returned by radix_tree_lookup_slot
> - * @item: new item to store in the slot.
> - *
> - * For use with radix_tree_lookup_slot(). Caller must hold tree write locked
> - * across slot lookup and replacement.
> - */
> -static inline void radix_tree_replace_slot(void **pslot, void *item)
> -{
> - BUG_ON(radix_tree_is_internal_node(item));
> - rcu_assign_pointer(*pslot, item);
> -}
> -
> int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
> unsigned order, struct radix_tree_node **nodep,
> void ***slotp);
> @@ -280,6 +266,8 @@ void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
> void __radix_tree_replace(struct radix_tree_root *root,
> struct radix_tree_node *node,
> void **slot, void *item);
> +void radix_tree_replace_slot(struct radix_tree_root *root,
> + void **slot, void *item);
> bool __radix_tree_delete_node(struct radix_tree_root *root,
> struct radix_tree_node *node);
> void *radix_tree_delete_item(struct radix_tree_root *, unsigned long, void *);
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index ddc6facbf4da..5cbdd911931e 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -753,19 +753,10 @@ void *radix_tree_lookup(struct radix_tree_root *root, unsigned long index)
> }
> EXPORT_SYMBOL(radix_tree_lookup);
>
> -/**
> - * __radix_tree_replace - replace item in a slot
> - * @root: radix tree root
> - * @node: pointer to tree node
> - * @slot: pointer to slot in @node
> - * @item: new item to store in the slot.
> - *
> - * For use with __radix_tree_lookup(). Caller must hold tree write locked
> - * across slot lookup and replacement.
> - */
> -void __radix_tree_replace(struct radix_tree_root *root,
> - struct radix_tree_node *node,
> - void **slot, void *item)
> +static void replace_slot(struct radix_tree_root *root,
> + struct radix_tree_node *node,
> + void **slot, void *item,
> + bool warn_typeswitch)
> {
> void *old = rcu_dereference_raw(*slot);
> int count, exceptional;
> @@ -776,8 +767,7 @@ void __radix_tree_replace(struct radix_tree_root *root,
> exceptional = !!radix_tree_exceptional_entry(item) -
> !!radix_tree_exceptional_entry(old);
>
> - WARN_ONCE(!node && slot != (void **)&root->rnode &&
> - (count || exceptional),
> + WARN_ONCE(warn_typeswitch && (count || exceptional),
> "Unaccounted slot replacement: old=%p item=%p count=%d exceptional=%d\n",
> old, item, count, exceptional);
>
> @@ -789,6 +779,50 @@ void __radix_tree_replace(struct radix_tree_root *root,
> }
>
> /**
> + * __radix_tree_replace - replace item in a slot
> + * @root: radix tree root
> + * @node: pointer to tree node
> + * @slot: pointer to slot in @node
> + * @item: new item to store in the slot.
> + *
> + * For use with __radix_tree_lookup(). Caller must hold tree write locked
> + * across slot lookup and replacement.
> + */
> +void __radix_tree_replace(struct radix_tree_root *root,
> + struct radix_tree_node *node,
> + void **slot, void *item)
> +{
> + /*
> + * This function supports replacing entries with different types
> + * (NULL, regular entries, exceptional entries), but that needs
> + * accounting against the node - unless the slot is root->rnode.
> + */
> + replace_slot(root, node, slot, item,
> + !node && slot != (void **)&root->rnode);
> +}
> +
> +/**
> + * radix_tree_replace_slot - replace item in a slot
> + * @root: radix tree root
> + * @slot: pointer to slot
> + * @item: new item to store in the slot.
> + *
> + * For use with radix_tree_lookup_slot(), radix_tree_gang_lookup_slot(),
> + * radix_tree_gang_lookup_tag_slot(). Caller must hold tree write locked
> + * across slot lookup and replacement.
> + *
> + * NOTE: This cannot be used to switch between non-entries (empty slots),
> + * regular entries, and exceptional entries, as that requires accounting
> + * inside the radix tree node. When switching from one type of entry to
> + * another, use __radix_tree_lookup() and __radix_tree_replace().
> + */
> +void radix_tree_replace_slot(struct radix_tree_root *root,
> + void **slot, void *item)
> +{
> + replace_slot(root, NULL, slot, item, true);
> +}
> +
> +/**
> * radix_tree_tag_set - set a tag on a radix tree node
> * @root: radix tree root
> * @index: index key
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c7fe2f16503f..eb463156f29a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -147,7 +147,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
> false);
> }
> }
> - radix_tree_replace_slot(slot, page);
> + radix_tree_replace_slot(&mapping->page_tree, slot, page);
> mapping->nrpages++;
> if (node) {
> workingset_node_pages_inc(node);
> @@ -193,7 +193,7 @@ static void page_cache_tree_delete(struct address_space *mapping,
> shadow = NULL;
> }
>
> - radix_tree_replace_slot(slot, shadow);
> + radix_tree_replace_slot(&mapping->page_tree, slot, shadow);
>
> if (!node)
> break;
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index eac6f0580e26..fed8d5e96978 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1421,7 +1421,7 @@ static void collapse_shmem(struct mm_struct *mm,
> list_add_tail(&page->lru, &pagelist);
>
> /* Finally, replace with the new page. */
> - radix_tree_replace_slot(slot,
> + radix_tree_replace_slot(&mapping->page_tree, slot,
> new_page + (index % HPAGE_PMD_NR));
>
> index++;
> @@ -1531,7 +1531,8 @@ static void collapse_shmem(struct mm_struct *mm,
> /* Unfreeze the page. */
> list_del(&page->lru);
> page_ref_unfreeze(page, 2);
> - radix_tree_replace_slot(slot, page);
> + radix_tree_replace_slot(&mapping->page_tree,
> + slot, page);
> spin_unlock_irq(&mapping->tree_lock);
> putback_lru_page(page);
> unlock_page(page);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 99250aee1ac1..9b88e4e37d0a 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -482,7 +482,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
> SetPageDirty(newpage);
> }
>
> - radix_tree_replace_slot(pslot, newpage);
> + radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);
>
> /*
> * Drop cache reference from old page by unfreezing
> @@ -556,7 +556,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
>
> get_page(newpage);
>
> - radix_tree_replace_slot(pslot, newpage);
> + radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);
>
> page_ref_unfreeze(page, expected_count - 1);
>
> diff --git a/mm/truncate.c b/mm/truncate.c
> index a01cce450a26..6ae44571d4c7 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -49,7 +49,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
> goto unlock;
> if (*slot != entry)
> goto unlock;
> - radix_tree_replace_slot(slot, NULL);
> + radix_tree_replace_slot(&mapping->page_tree, slot, NULL);
> mapping->nrexceptional--;
> if (!node)
> goto unlock;
> diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c
> index 05d7bc488971..d1be94667a30 100644
> --- a/tools/testing/radix-tree/multiorder.c
> +++ b/tools/testing/radix-tree/multiorder.c
> @@ -146,7 +146,7 @@ static void multiorder_check(unsigned long index, int order)
>
> slot = radix_tree_lookup_slot(&tree, index);
> free(*slot);
> - radix_tree_replace_slot(slot, item2);
> + radix_tree_replace_slot(&tree, slot, item2);
> for (i = min; i < max; i++) {
> struct item *item = item_lookup(&tree, i);
> assert(item != 0);
> --
> 2.10.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 07-11-16 14:07:40, Johannes Weiner wrote:
> Currently, we track the shadow entries in the page cache in the upper
> bits of the radix_tree_node->count, behind the back of the radix tree
> implementation. Because the radix tree code has no awareness of them,
> we rely on random subtleties throughout the implementation (such as
> the node->count != 1 check in the shrinking code which is meant to
> exclude multi-entry nodes, but also happens to skip nodes with only
> one shadow entry since it's accounted in the upper bits). This is
> error prone and has, in fact, caused the bug fixed in d3798ae8c6f3
> ("mm: filemap: don't plant shadow entries without radix tree node").
>
> To remove these subtleties, this patch moves shadow entry tracking
> from the upper bits of node->count to the existing counter for
> exceptional entries. node->count goes back to being a simple counter
> of valid entries in the tree node and can be shrunk to a single byte.
...
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 6ae44571d4c7..d3ce5f261f47 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -53,7 +53,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
> mapping->nrexceptional--;
> if (!node)
> goto unlock;
> - workingset_node_shadows_dec(node);
> /*
> * Don't track node without shadow entries.
> *
> @@ -61,8 +60,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
> * The list_empty() test is safe as node->private_list is
> * protected by mapping->tree_lock.
> */
> - if (!workingset_node_shadows(node) &&
> - !list_empty(&node->private_list))
> + if (!node->exceptional && !list_empty(&node->private_list))
> list_lru_del(&workingset_shadow_nodes,
> &node->private_list);
> __radix_tree_delete_node(&mapping->page_tree, node);
Is this really correct now? The radix tree implementation can move a single
exceptional entry at index 0 from a node into a direct pointer and free
the node while it is still in the LRU list. Or am I missing something?
To fix this I'd prefer to just have a callback from radix tree code when it
is freeing a node, rather that trying to second-guess its implementation in
the page-cache code...
Otherwise the patch looks good to me and I really like the simplification!
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 07-11-16 14:07:41, Johannes Weiner wrote:
> Shadow entries in the page cache used to be accounted behind the radix
> tree implementation's back in the upper bits of node->count, and the
> radix tree code extending a single-entry tree with a shadow entry in
> root->rnode would corrupt that counter. As a result, we could not put
> shadow entries at index 0 if the tree didn't have any other entries,
> and that means no refault detection for any single-page file.
>
> Now that the shadow entries are tracked natively in the radix tree's
> exceptional counter, this is no longer necessary. Extending and
> shrinking the tree from and to single entries in root->rnode now does
> the right thing when the entry is exceptional, remove that limitation.
>
> Signed-off-by: Johannes Weiner <[email protected]>
The patch looks good to me. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> mm/filemap.c | 13 +++----------
> 1 file changed, 3 insertions(+), 10 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 438f0b54f8fd..55a3b136a527 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -178,19 +178,12 @@ static void page_cache_tree_delete(struct address_space *mapping,
>
> radix_tree_clear_tags(&mapping->page_tree, node, slot);
>
> - if (!node) {
> - VM_BUG_ON_PAGE(nr != 1, page);
> - /*
> - * We need a node to properly account shadow
> - * entries. Don't plant any without. XXX
> - */
> - shadow = NULL;
> - }
> -
> __radix_tree_replace(&mapping->page_tree, node, slot, shadow);
>
> - if (!node)
> + if (!node) {
> + VM_BUG_ON_PAGE(nr != 1, page);
> break;
> + }
>
> if (!shadow &&
> __radix_tree_delete_node(&mapping->page_tree, node))
> --
> 2.10.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > The radix tree counts valid entries in each tree node. Entries stored
> > in the tree cannot be removed by simpling storing NULL in the slot or
> > the internal counters will be off and the node never gets freed again.
> >
> > When collapsing a shmem page fails, restore the holes that were filled
> > with radix_tree_insert() with a proper radix tree deletion.
> >
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Reported-by: Jan Kara <[email protected]>
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > mm/khugepaged.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 728d7790dc2d..eac6f0580e26 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > if (!nr_none)
> > break;
> > /* Put holes back where they were */
> > - radix_tree_replace_slot(slot, NULL);
> > + radix_tree_delete(&mapping->page_tree,
> > + iter.index);
>
> Hum, but this is inside radix_tree_for_each_slot() iteration. And
> radix_tree_delete() may end up freeing nodes resulting in invalidating
> current slot pointer and the iteration code will do use-after-free.
Good point, we need to do another tree lookup after the deletion.
But there are other instances in the code, where we drop the lock
temporarily and somebody else could delete the node from under us.
In the main collapse path, I *think* this is prevented by the fact
that when we drop the tree lock we still hold the page lock of the
regular page that's in the tree while we isolate and unmap it, thus
pin the node. Even so, it would seem a little hairy to rely on that.
Kirill?
I'll update this patch and prepend another fix to the series that
addresses the other two lock dropping issues.
Thanks Jan.
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index fed8d5e96978..1e43e77a98da 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1424,6 +1424,7 @@ static void collapse_shmem(struct mm_struct *mm,
radix_tree_replace_slot(&mapping->page_tree, slot,
new_page + (index % HPAGE_PMD_NR));
+ slot = radix_tree_iter_next(&iter);
index++;
continue;
out_lru:
@@ -1522,6 +1523,7 @@ static void collapse_shmem(struct mm_struct *mm,
/* Put holes back where they were */
radix_tree_delete(&mapping->page_tree,
iter.index);
+ slot = radix_tree_iter_next(&iter);
nr_none--;
continue;
}
@@ -1537,6 +1539,7 @@ static void collapse_shmem(struct mm_struct *mm,
putback_lru_page(page);
unlock_page(page);
spin_lock_irq(&mapping->tree_lock);
+ slot = radix_tree_iter_next(&iter);
}
VM_BUG_ON(nr_none);
spin_unlock_irq(&mapping->tree_lock);
On Tue, Nov 08, 2016 at 11:27:16AM +0100, Jan Kara wrote:
> On Mon 07-11-16 14:07:40, Johannes Weiner wrote:
> > Currently, we track the shadow entries in the page cache in the upper
> > bits of the radix_tree_node->count, behind the back of the radix tree
> > implementation. Because the radix tree code has no awareness of them,
> > we rely on random subtleties throughout the implementation (such as
> > the node->count != 1 check in the shrinking code which is meant to
> > exclude multi-entry nodes, but also happens to skip nodes with only
> > one shadow entry since it's accounted in the upper bits). This is
> > error prone and has, in fact, caused the bug fixed in d3798ae8c6f3
> > ("mm: filemap: don't plant shadow entries without radix tree node").
> >
> > To remove these subtleties, this patch moves shadow entry tracking
> > from the upper bits of node->count to the existing counter for
> > exceptional entries. node->count goes back to being a simple counter
> > of valid entries in the tree node and can be shrunk to a single byte.
>
> ...
>
> > diff --git a/mm/truncate.c b/mm/truncate.c
> > index 6ae44571d4c7..d3ce5f261f47 100644
> > --- a/mm/truncate.c
> > +++ b/mm/truncate.c
> > @@ -53,7 +53,6 @@ static void clear_exceptional_entry(struct address_space *mapping,
> > mapping->nrexceptional--;
> > if (!node)
> > goto unlock;
> > - workingset_node_shadows_dec(node);
> > /*
> > * Don't track node without shadow entries.
> > *
> > @@ -61,8 +60,7 @@ static void clear_exceptional_entry(struct address_space *mapping,
> > * The list_empty() test is safe as node->private_list is
> > * protected by mapping->tree_lock.
> > */
> > - if (!workingset_node_shadows(node) &&
> > - !list_empty(&node->private_list))
> > + if (!node->exceptional && !list_empty(&node->private_list))
> > list_lru_del(&workingset_shadow_nodes,
> > &node->private_list);
> > __radix_tree_delete_node(&mapping->page_tree, node);
>
> Is this really correct now? The radix tree implementation can move a single
> exceptional entry at index 0 from a node into a direct pointer and free
> the node while it is still in the LRU list. Or am I missing something?
You're right. I missed that scenario.
> To fix this I'd prefer to just have a callback from radix tree code when it
> is freeing a node, rather that trying to second-guess its implementation in
> the page-cache code...
>
> Otherwise the patch looks good to me and I really like the simplification!
That's a good idea. I'll do away with __radix_tree_delete_node()
altogether and move not just the slot accounting but also the tree
shrinking and the maintenance callback into __radix_tree_replace().
The page cache can then simply do
__radix_tree_replace(&mapping->page_tree, node, slot, new,
workingset_node_update, mapping)
And workingset_node_update() gets called on every node that changes,
where it can track and untrack it depending on count & exceptional.
I'll give it some testing before posting it, but currently it's
include/linux/radix-tree.h | 4 +-
include/linux/swap.h | 1 -
lib/radix-tree.c | 212 ++++++++++++++++++++-----------------------
mm/filemap.c | 48 +---------
mm/truncate.c | 16 +---
mm/workingset.c | 31 +++++--
6 files changed, 134 insertions(+), 178 deletions(-)
on top of the simplifications of this patch 5/6.
Thanks for your input, Jan!
On Tue 08-11-16 11:12:45, Johannes Weiner wrote:
> On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > The radix tree counts valid entries in each tree node. Entries stored
> > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > the internal counters will be off and the node never gets freed again.
> > >
> > > When collapsing a shmem page fails, restore the holes that were filled
> > > with radix_tree_insert() with a proper radix tree deletion.
> > >
> > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > Reported-by: Jan Kara <[email protected]>
> > > Signed-off-by: Johannes Weiner <[email protected]>
> > > ---
> > > mm/khugepaged.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 728d7790dc2d..eac6f0580e26 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > > if (!nr_none)
> > > break;
> > > /* Put holes back where they were */
> > > - radix_tree_replace_slot(slot, NULL);
> > > + radix_tree_delete(&mapping->page_tree,
> > > + iter.index);
> >
> > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > current slot pointer and the iteration code will do use-after-free.
>
> Good point, we need to do another tree lookup after the deletion.
>
> But there are other instances in the code, where we drop the lock
> temporarily and somebody else could delete the node from under us.
>
> In the main collapse path, I *think* this is prevented by the fact
> that when we drop the tree lock we still hold the page lock of the
> regular page that's in the tree while we isolate and unmap it, thus
> pin the node. Even so, it would seem a little hairy to rely on that.
Yeah, I think that is mostly right but I'm not sure whether shrinking of
radix tree into direct pointer cannot bite us here as well. Generally that
relies on internal implementatation of the radix tree and its iterator
so what you did makes sense to me.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Nov 08, 2016 at 11:12:45AM -0500, Johannes Weiner wrote:
> On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > The radix tree counts valid entries in each tree node. Entries stored
> > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > the internal counters will be off and the node never gets freed again.
> > >
> > > When collapsing a shmem page fails, restore the holes that were filled
> > > with radix_tree_insert() with a proper radix tree deletion.
> > >
> > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > Reported-by: Jan Kara <[email protected]>
> > > Signed-off-by: Johannes Weiner <[email protected]>
> > > ---
> > > mm/khugepaged.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 728d7790dc2d..eac6f0580e26 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > > if (!nr_none)
> > > break;
> > > /* Put holes back where they were */
> > > - radix_tree_replace_slot(slot, NULL);
> > > + radix_tree_delete(&mapping->page_tree,
> > > + iter.index);
> >
> > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > current slot pointer and the iteration code will do use-after-free.
>
> Good point, we need to do another tree lookup after the deletion.
>
> But there are other instances in the code, where we drop the lock
> temporarily and somebody else could delete the node from under us.
>
> In the main collapse path, I *think* this is prevented by the fact
> that when we drop the tree lock we still hold the page lock of the
> regular page that's in the tree while we isolate and unmap it, thus
> pin the node. Even so, it would seem a little hairy to rely on that.
>
> Kirill?
[ sorry for delay ]
Yes, we make sure that locked page still belong to the radix tree and fall
off if it's not. Locked page cannot be removed from radix-tree, so we
should be fine.
> I'll update this patch and prepend another fix to the series that
> addresses the other two lock dropping issues.
Feel free add my Acked-by.
--
Kirill A. Shutemov
On Fri 11-11-16 13:59:21, Kirill A. Shutemov wrote:
> On Tue, Nov 08, 2016 at 11:12:45AM -0500, Johannes Weiner wrote:
> > On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > > The radix tree counts valid entries in each tree node. Entries stored
> > > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > > the internal counters will be off and the node never gets freed again.
> > > >
> > > > When collapsing a shmem page fails, restore the holes that were filled
> > > > with radix_tree_insert() with a proper radix tree deletion.
> > > >
> > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > > Reported-by: Jan Kara <[email protected]>
> > > > Signed-off-by: Johannes Weiner <[email protected]>
> > > > ---
> > > > mm/khugepaged.c | 3 ++-
> > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 728d7790dc2d..eac6f0580e26 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > > > if (!nr_none)
> > > > break;
> > > > /* Put holes back where they were */
> > > > - radix_tree_replace_slot(slot, NULL);
> > > > + radix_tree_delete(&mapping->page_tree,
> > > > + iter.index);
> > >
> > > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > > current slot pointer and the iteration code will do use-after-free.
> >
> > Good point, we need to do another tree lookup after the deletion.
> >
> > But there are other instances in the code, where we drop the lock
> > temporarily and somebody else could delete the node from under us.
> >
> > In the main collapse path, I *think* this is prevented by the fact
> > that when we drop the tree lock we still hold the page lock of the
> > regular page that's in the tree while we isolate and unmap it, thus
> > pin the node. Even so, it would seem a little hairy to rely on that.
> >
> > Kirill?
>
> [ sorry for delay ]
>
> Yes, we make sure that locked page still belong to the radix tree and fall
> off if it's not. Locked page cannot be removed from radix-tree, so we
> should be fine.
Well, it cannot be removed from the radix tree but radix tree code is still
free to collapse / expand the tree nodes as it sees fit (currently the only
real case is when changing direct page pointer in the tree root to a node
pointer or vice versa but still...). So code should not really assume that
the node page is referenced from does not change once tree_lock is dropped.
It leads to subtle bugs...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri, Nov 11, 2016 at 01:22:24PM +0100, Jan Kara wrote:
> On Fri 11-11-16 13:59:21, Kirill A. Shutemov wrote:
> > On Tue, Nov 08, 2016 at 11:12:45AM -0500, Johannes Weiner wrote:
> > > On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > > > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > > > The radix tree counts valid entries in each tree node. Entries stored
> > > > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > > > the internal counters will be off and the node never gets freed again.
> > > > >
> > > > > When collapsing a shmem page fails, restore the holes that were filled
> > > > > with radix_tree_insert() with a proper radix tree deletion.
> > > > >
> > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > > > Reported-by: Jan Kara <[email protected]>
> > > > > Signed-off-by: Johannes Weiner <[email protected]>
> > > > > ---
> > > > > mm/khugepaged.c | 3 ++-
> > > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > index 728d7790dc2d..eac6f0580e26 100644
> > > > > --- a/mm/khugepaged.c
> > > > > +++ b/mm/khugepaged.c
> > > > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > > > > if (!nr_none)
> > > > > break;
> > > > > /* Put holes back where they were */
> > > > > - radix_tree_replace_slot(slot, NULL);
> > > > > + radix_tree_delete(&mapping->page_tree,
> > > > > + iter.index);
> > > >
> > > > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > > > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > > > current slot pointer and the iteration code will do use-after-free.
> > >
> > > Good point, we need to do another tree lookup after the deletion.
> > >
> > > But there are other instances in the code, where we drop the lock
> > > temporarily and somebody else could delete the node from under us.
> > >
> > > In the main collapse path, I *think* this is prevented by the fact
> > > that when we drop the tree lock we still hold the page lock of the
> > > regular page that's in the tree while we isolate and unmap it, thus
> > > pin the node. Even so, it would seem a little hairy to rely on that.
> > >
> > > Kirill?
> >
> > [ sorry for delay ]
> >
> > Yes, we make sure that locked page still belong to the radix tree and fall
> > off if it's not. Locked page cannot be removed from radix-tree, so we
> > should be fine.
>
> Well, it cannot be removed from the radix tree but radix tree code is still
> free to collapse / expand the tree nodes as it sees fit (currently the only
> real case is when changing direct page pointer in the tree root to a node
> pointer or vice versa but still...). So code should not really assume that
> the node page is referenced from does not change once tree_lock is dropped.
> It leads to subtle bugs...
Hm. Okay.
What is the right way re-validate that slot is still valid? Do I need full
look up again? Can I pin node explicitly?
--
Kirill A. Shutemov
On Fri 11-11-16 19:37:53, Kirill A. Shutemov wrote:
> On Fri, Nov 11, 2016 at 01:22:24PM +0100, Jan Kara wrote:
> > On Fri 11-11-16 13:59:21, Kirill A. Shutemov wrote:
> > > On Tue, Nov 08, 2016 at 11:12:45AM -0500, Johannes Weiner wrote:
> > > > On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > > > > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > > > > The radix tree counts valid entries in each tree node. Entries stored
> > > > > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > > > > the internal counters will be off and the node never gets freed again.
> > > > > >
> > > > > > When collapsing a shmem page fails, restore the holes that were filled
> > > > > > with radix_tree_insert() with a proper radix tree deletion.
> > > > > >
> > > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > > > > Reported-by: Jan Kara <[email protected]>
> > > > > > Signed-off-by: Johannes Weiner <[email protected]>
> > > > > > ---
> > > > > > mm/khugepaged.c | 3 ++-
> > > > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > index 728d7790dc2d..eac6f0580e26 100644
> > > > > > --- a/mm/khugepaged.c
> > > > > > +++ b/mm/khugepaged.c
> > > > > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > > > > > if (!nr_none)
> > > > > > break;
> > > > > > /* Put holes back where they were */
> > > > > > - radix_tree_replace_slot(slot, NULL);
> > > > > > + radix_tree_delete(&mapping->page_tree,
> > > > > > + iter.index);
> > > > >
> > > > > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > > > > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > > > > current slot pointer and the iteration code will do use-after-free.
> > > >
> > > > Good point, we need to do another tree lookup after the deletion.
> > > >
> > > > But there are other instances in the code, where we drop the lock
> > > > temporarily and somebody else could delete the node from under us.
> > > >
> > > > In the main collapse path, I *think* this is prevented by the fact
> > > > that when we drop the tree lock we still hold the page lock of the
> > > > regular page that's in the tree while we isolate and unmap it, thus
> > > > pin the node. Even so, it would seem a little hairy to rely on that.
> > > >
> > > > Kirill?
> > >
> > > [ sorry for delay ]
> > >
> > > Yes, we make sure that locked page still belong to the radix tree and fall
> > > off if it's not. Locked page cannot be removed from radix-tree, so we
> > > should be fine.
> >
> > Well, it cannot be removed from the radix tree but radix tree code is still
> > free to collapse / expand the tree nodes as it sees fit (currently the only
> > real case is when changing direct page pointer in the tree root to a node
> > pointer or vice versa but still...). So code should not really assume that
> > the node page is referenced from does not change once tree_lock is dropped.
> > It leads to subtle bugs...
>
> Hm. Okay.
>
> What is the right way re-validate that slot is still valid? Do I need full
> look up again? Can I pin node explicitly?
Full lookup is the only way to re-validate the slot. There is no way to pin
a radix tree node.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, Nov 14, 2016 at 09:07:44AM +0100, Jan Kara wrote:
> On Fri 11-11-16 19:37:53, Kirill A. Shutemov wrote:
> > On Fri, Nov 11, 2016 at 01:22:24PM +0100, Jan Kara wrote:
> > > On Fri 11-11-16 13:59:21, Kirill A. Shutemov wrote:
> > > > On Tue, Nov 08, 2016 at 11:12:45AM -0500, Johannes Weiner wrote:
> > > > > On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > > > > > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > > > > > The radix tree counts valid entries in each tree node. Entries stored
> > > > > > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > > > > > the internal counters will be off and the node never gets freed again.
> > > > > > >
> > > > > > > When collapsing a shmem page fails, restore the holes that were filled
> > > > > > > with radix_tree_insert() with a proper radix tree deletion.
> > > > > > >
> > > > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > > > > > Reported-by: Jan Kara <[email protected]>
> > > > > > > Signed-off-by: Johannes Weiner <[email protected]>
> > > > > > > ---
> > > > > > > mm/khugepaged.c | 3 ++-
> > > > > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > index 728d7790dc2d..eac6f0580e26 100644
> > > > > > > --- a/mm/khugepaged.c
> > > > > > > +++ b/mm/khugepaged.c
> > > > > > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > > > > > > if (!nr_none)
> > > > > > > break;
> > > > > > > /* Put holes back where they were */
> > > > > > > - radix_tree_replace_slot(slot, NULL);
> > > > > > > + radix_tree_delete(&mapping->page_tree,
> > > > > > > + iter.index);
> > > > > >
> > > > > > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > > > > > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > > > > > current slot pointer and the iteration code will do use-after-free.
> > > > >
> > > > > Good point, we need to do another tree lookup after the deletion.
> > > > >
> > > > > But there are other instances in the code, where we drop the lock
> > > > > temporarily and somebody else could delete the node from under us.
> > > > >
> > > > > In the main collapse path, I *think* this is prevented by the fact
> > > > > that when we drop the tree lock we still hold the page lock of the
> > > > > regular page that's in the tree while we isolate and unmap it, thus
> > > > > pin the node. Even so, it would seem a little hairy to rely on that.
> > > > >
> > > > > Kirill?
> > > >
> > > > [ sorry for delay ]
> > > >
> > > > Yes, we make sure that locked page still belong to the radix tree and fall
> > > > off if it's not. Locked page cannot be removed from radix-tree, so we
> > > > should be fine.
> > >
> > > Well, it cannot be removed from the radix tree but radix tree code is still
> > > free to collapse / expand the tree nodes as it sees fit (currently the only
> > > real case is when changing direct page pointer in the tree root to a node
> > > pointer or vice versa but still...). So code should not really assume that
> > > the node page is referenced from does not change once tree_lock is dropped.
> > > It leads to subtle bugs...
> >
> > Hm. Okay.
> >
> > What is the right way re-validate that slot is still valid? Do I need full
> > look up again? Can I pin node explicitly?
>
> Full lookup is the only way to re-validate the slot. There is no way to pin
> a radix tree node.
I guess this should be enough:
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 728d7790dc2d..c5ef73588676 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1400,7 +1400,9 @@ static void collapse_shmem(struct mm_struct *mm,
PAGE_SIZE, 0);
spin_lock_irq(&mapping->tree_lock);
-
+ slot = radix_tree_lookup_slot(&mapping->page_tree, index);
+ VM_BUG_ON_PAGE(page != radix_tree_deref_slot_protected(slot,
+ &mapping->tree_lock), page);
VM_BUG_ON_PAGE(page_mapped(page), page);
/*
--
Kirill A. Shutemov
On Mon, Nov 14, 2016 at 05:29:02PM +0300, Kirill A. Shutemov wrote:
> On Mon, Nov 14, 2016 at 09:07:44AM +0100, Jan Kara wrote:
> > On Fri 11-11-16 19:37:53, Kirill A. Shutemov wrote:
> > > On Fri, Nov 11, 2016 at 01:22:24PM +0100, Jan Kara wrote:
> > > > On Fri 11-11-16 13:59:21, Kirill A. Shutemov wrote:
> > > > > On Tue, Nov 08, 2016 at 11:12:45AM -0500, Johannes Weiner wrote:
> > > > > > On Tue, Nov 08, 2016 at 10:53:52AM +0100, Jan Kara wrote:
> > > > > > > On Mon 07-11-16 14:07:36, Johannes Weiner wrote:
> > > > > > > > The radix tree counts valid entries in each tree node. Entries stored
> > > > > > > > in the tree cannot be removed by simpling storing NULL in the slot or
> > > > > > > > the internal counters will be off and the node never gets freed again.
> > > > > > > >
> > > > > > > > When collapsing a shmem page fails, restore the holes that were filled
> > > > > > > > with radix_tree_insert() with a proper radix tree deletion.
> > > > > > > >
> > > > > > > > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > > > > > > > Reported-by: Jan Kara <[email protected]>
> > > > > > > > Signed-off-by: Johannes Weiner <[email protected]>
> > > > > > > > ---
> > > > > > > > mm/khugepaged.c | 3 ++-
> > > > > > > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > > > > > index 728d7790dc2d..eac6f0580e26 100644
> > > > > > > > --- a/mm/khugepaged.c
> > > > > > > > +++ b/mm/khugepaged.c
> > > > > > > > @@ -1520,7 +1520,8 @@ static void collapse_shmem(struct mm_struct *mm,
> > > > > > > > if (!nr_none)
> > > > > > > > break;
> > > > > > > > /* Put holes back where they were */
> > > > > > > > - radix_tree_replace_slot(slot, NULL);
> > > > > > > > + radix_tree_delete(&mapping->page_tree,
> > > > > > > > + iter.index);
> > > > > > >
> > > > > > > Hum, but this is inside radix_tree_for_each_slot() iteration. And
> > > > > > > radix_tree_delete() may end up freeing nodes resulting in invalidating
> > > > > > > current slot pointer and the iteration code will do use-after-free.
> > > > > >
> > > > > > Good point, we need to do another tree lookup after the deletion.
> > > > > >
> > > > > > But there are other instances in the code, where we drop the lock
> > > > > > temporarily and somebody else could delete the node from under us.
> > > > > >
> > > > > > In the main collapse path, I *think* this is prevented by the fact
> > > > > > that when we drop the tree lock we still hold the page lock of the
> > > > > > regular page that's in the tree while we isolate and unmap it, thus
> > > > > > pin the node. Even so, it would seem a little hairy to rely on that.
> > > > > >
> > > > > > Kirill?
> > > > >
> > > > > [ sorry for delay ]
> > > > >
> > > > > Yes, we make sure that locked page still belong to the radix tree and fall
> > > > > off if it's not. Locked page cannot be removed from radix-tree, so we
> > > > > should be fine.
> > > >
> > > > Well, it cannot be removed from the radix tree but radix tree code is still
> > > > free to collapse / expand the tree nodes as it sees fit (currently the only
> > > > real case is when changing direct page pointer in the tree root to a node
> > > > pointer or vice versa but still...). So code should not really assume that
> > > > the node page is referenced from does not change once tree_lock is dropped.
> > > > It leads to subtle bugs...
> > >
> > > Hm. Okay.
> > >
> > > What is the right way re-validate that slot is still valid? Do I need full
> > > look up again? Can I pin node explicitly?
> >
> > Full lookup is the only way to re-validate the slot. There is no way to pin
> > a radix tree node.
>
> I guess this should be enough:
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 728d7790dc2d..c5ef73588676 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1400,7 +1400,9 @@ static void collapse_shmem(struct mm_struct *mm,
> PAGE_SIZE, 0);
>
> spin_lock_irq(&mapping->tree_lock);
> -
> + slot = radix_tree_lookup_slot(&mapping->page_tree, index);
> + VM_BUG_ON_PAGE(page != radix_tree_deref_slot_protected(slot,
> + &mapping->tree_lock), page);
> VM_BUG_ON_PAGE(page_mapped(page), page);
That looks good to me. The slot may get relocated, but the content
shouldn't change with the page locked.
Are you going to send a full patch with changelog and sign-off? If so,
please add:
Acked-by: Johannes Weiner <[email protected]>
On Mon, Nov 14, 2016 at 10:52:50AM -0500, Johannes Weiner wrote:
> On Mon, Nov 14, 2016 at 05:29:02PM +0300, Kirill A. Shutemov wrote:
> > @@ -1400,7 +1400,9 @@ static void collapse_shmem(struct mm_struct *mm,
> > PAGE_SIZE, 0);
> >
> > spin_lock_irq(&mapping->tree_lock);
> > -
> > + slot = radix_tree_lookup_slot(&mapping->page_tree, index);
> > + VM_BUG_ON_PAGE(page != radix_tree_deref_slot_protected(slot,
> > + &mapping->tree_lock), page);
> > VM_BUG_ON_PAGE(page_mapped(page), page);
>
> That looks good to me. The slot may get relocated, but the content
> shouldn't change with the page locked.
>
> Are you going to send a full patch with changelog and sign-off? If so,
> please add:
>
> Acked-by: Johannes Weiner <[email protected]>
Just to clarify, this is in addition to my radix_tree_iter_next()
change. The iterator still needs to be reloaded because the number of
valid slots that come after the current one can change as well.
On Mon, Nov 14, 2016 at 11:48:22AM -0500, Johannes Weiner wrote:
> On Mon, Nov 14, 2016 at 10:52:50AM -0500, Johannes Weiner wrote:
> > On Mon, Nov 14, 2016 at 05:29:02PM +0300, Kirill A. Shutemov wrote:
> > > @@ -1400,7 +1400,9 @@ static void collapse_shmem(struct mm_struct *mm,
> > > PAGE_SIZE, 0);
> > >
> > > spin_lock_irq(&mapping->tree_lock);
> > > -
> > > + slot = radix_tree_lookup_slot(&mapping->page_tree, index);
> > > + VM_BUG_ON_PAGE(page != radix_tree_deref_slot_protected(slot,
> > > + &mapping->tree_lock), page);
> > > VM_BUG_ON_PAGE(page_mapped(page), page);
> >
> > That looks good to me. The slot may get relocated, but the content
> > shouldn't change with the page locked.
> >
> > Are you going to send a full patch with changelog and sign-off? If so,
> > please add:
> >
> > Acked-by: Johannes Weiner <[email protected]>
>
> Just to clarify, this is in addition to my radix_tree_iter_next()
> change. The iterator still needs to be reloaded because the number of
> valid slots that come after the current one can change as well.
Could you just amend all these fixups into your patch?
--
Kirill A. Shutemov
On Mon, Nov 14, 2016 at 10:40:54PM +0300, Kirill A. Shutemov wrote:
> Could you just amend all these fixups into your patch?
Will do.