2017-12-06 01:08:08

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 00/73] XArray version 4

From: Matthew Wilcox <[email protected]>

I looked through some notes and decided this was version 4 of the XArray.
Last posted two weeks ago, this version includes a *lot* of changes.
I'd like to thank Dave Chinner for his feedback, encouragement and
distracting ideas for improvement, which I'll get to once this is merged.

Highlights:
- Over 2000 words of documentation in patch 8! And lots more kernel-doc.
- The page cache is now fully converted to the XArray.
- Many more tests in the test-suite.

This patch set is not for applying. 0day is still reporting problems,
and I'd feel bad for eating someone's data. These patches apply on top
of a set of prepatory patches which just aren't interesting. If you
want to see the patches applied to a tree, I suggest pulling my git tree:
http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04
I also left out the idr_preload removals. They're still in the git tree,
but I'm not looking for feedback on them.

Changes since v3:

XArray API differences:
- Store a pointer to the struct xarray in the xa_state (changes almost
every prototype in the advanced API).
- Added xas_lock() etc to operate on the XArray stored in the xa_state.
- Added xa_erase() as a synonym for xa_store(..., NULL, 0).
- Added __xa_erase() which is an exact replacement for radix_tree_delete();
it assumes you are holding the xa_lock.
- Renamed xa_next() to xa_find_after().
- Renamed xas_next() to xas_next_entry().
- Renamed xas_prev_any() and xas_next_any() to xas_prev() and xas_next().
- Changed the semantics of xas_prev() and xas_next() substantially
(see their kernel-doc).
- Renamed skip entry to zero entry.
- Introduced a new XAS_BOUNDS state to distinguish between an xa_state
that has not been walked and an xa_state that has walked off the
current end of the array.
- Changed xas_set_err() to take a negative errno, not a positive one.
XAS_ERROR still takes a positive errno, but this is an undocumented
internal part of the implementation, not an API.
- Changed behaviour when returning a multi-index entry; xas.xa_index
is now always set to the first (canonical) index of this entry.
Before, it was never rewound. Eg if you have an entry occupying
indices 4-7, and called xas_load() with xas.xa_index set to 6, it
will now set xas.xa_index to 4.
- Changed xas_nomem() to release any allocated memory if there is no
ENOMEM error. This means that (unless the user explicitly bypasses
calling xas_nomem() on some path), there's no need to call xas_destroy()
and it is removed from the API.
- Added xas_create_range() for the benefit of our current hugepage users.
I hope to be able to remove it again once they are converted to use
multi-index entries.
- Add xa_get_maybe_tag() which will call xa_get_entries() if you specify
XA_NO_TAG and xa_get_tagged otherwise.

IDR API differences:
- Removed the IDR cyclic API change (decided not to do it after all).
- Made idr_alloc_ul() and idr_alloc_u32() assign the ID before inserting
the pointer into the IDR, so a lookup cannot find an uninitialised object.

Bug fixes:
- Made INIT_RADIX_TREE() initialise the xa_lock correctly so lockdep
doesn't whine about it.
- Fixed a locking bug in the IPC IDR conversion.
- If we call xas_store(&xas, NULL) and that causes the XArray to shrink,
set the xas to the XAS_BOUNDS state so we don't dereference a pointer
to a node which has been passed to RCU free. This is only a problem
on !SMP machines.
- Fixed bug when shrinking the XArray to a single entry at index 0.
- Fixed bug where we could scan off the end of the slot array when storing
a NULL.
- Made xas_pause() not do anything if we're in an error state. Before, it
would have dereferenced a NULL pointer.
- Fixed a bug in xa_find_after(). it just plain didn't work. Now there is
a test-case for it.

Conversions:
- Converted backing dev cgroup code from radix tree to XArray.
- Converted the USB XHCI driver from radix tree to XArray.
- Moved btrfs_page_exists_in_range() guts to page cache code.
- Renamed page_cache_{next,prev}_hole() to ..._gap(). The page cache
doesn't cache holes.
- Finished the page cache conversion.

Miscellaneous:
- Documentation. Lots and lots of documentation. xarray.rst, more XArray
kernel-doc and also IDR kernel-doc which has been missing for years.
- Added MAINTAINERS entry for XArray/IDR.
- Deleted the now-unused parts of the radix tree API (see git tree).
- Added XA_DEBUG code and enable it in test-suite.
- Improved code generation for initialising xa_state by explicitly
initialising the struct padding (stupid gcc).
- Stub out more code if CONFIG_RADIX_TREE_MULTIORDER isn't enabled.
- Added more tests to the test-suite.
- Removed the IDR preload conversions from this patch set (see git tree).

Matthew Wilcox (73):
xfs: Rename xa_ elements to ail_
xarray: Add the xa_lock to the radix_tree_root
page cache: Use xa_lock
xarray: Replace exceptional entries
xarray: Change definition of sibling entries
xarray: Add definition of struct xarray
xarray: Define struct xa_node
xarray: Add documentation
xarray: Add xa_load
xarray: Add xa_get_tag, xa_set_tag and xa_clear_tag
xarray: Add xa_store
xarray: Add xa_cmpxchg
xarray: Add xa_for_each
xarray: Add xas_for_each_tag
xarray: Add xa_get_entries, xa_get_tagged and xa_get_maybe_tag
xarray: Add xa_destroy
xarray: Add xas_next and xas_prev
xarray: Add xas_create_range
xarray: Add MAINTAINERS entry
idr: Convert to XArray
ida: Convert to XArray
page cache: Convert hole search to XArray
page cache: Add page_cache_range_empty function
page cache: Add and replace pages using the XArray
page cache: Convert page deletion to XArray
page cache: Convert page cache lookups to XArray
page cache: Convert delete_batch to XArray
page cache: Remove stray radix comment
mm: Convert page-writeback to XArray
mm: Convert workingset to XArray
mm: Convert truncate to XArray
mm: Convert add_to_swap_cache to XArray
mm: Convert delete_from_swap_cache to XArray
mm: Convert cgroup writeback to XArray
mm: Convert __do_page_cache_readahead to XArray
mm: Convert page migration to XArray
mm: Convert huge_memory to XArray
mm: Convert collapse_shmem to XArray
mm: Convert khugepaged_scan_shmem to XArray
pagevec: Use xa_tag_t
shmem: Convert replace to XArray
shmem: Convert shmem_confirm_swap to XArray
shmem: Convert find_swap_entry to XArray
shmem: Convert shmem_tag_pins to XArray
shmem: Convert shmem_wait_for_pins to XArray
shmem: Convert shmem_add_to_page_cache to XArray
shmem: Convert shmem_alloc_hugepage to XArray
shmem: Convert shmem_free_swap to XArray
shmem: Convert shmem_partial_swap_usage to XArray
shmem: Comment fixups
btrfs: Convert page cache to XArray
fs: Convert buffer to XArray
fs: Convert writeback to XArray
nilfs2: Convert to XArray
f2fs: Convert to XArray
lustre: Convert to XArray
dax: Convert dax_unlock_mapping_entry to XArray
dax: Convert lock_slot to XArray
dax: More XArray conversion
dax: Convert __dax_invalidate_mapping_entry to XArray
dax: Convert dax_writeback_one to XArray
dax: Convert dax_insert_pfn_mkwrite to XArray
dax: Convert dax_insert_mapping_entry to XArray
dax: Convert grab_mapping_entry to XArray
dax: Fix sparse warning
page cache: Finish XArray conversion
vmalloc: Convert to XArray
brd: Convert to XArray
xfs: Convert m_perag_tree to XArray
xfs: Convert pag_ici_root to XArray
xfs: Convert xfs dquot to XArray
xfs: Convert mru cache to XArray
usb: Convert xhci-mem to XArray

Documentation/cgroup-v1/memory.txt | 2 +-
Documentation/core-api/index.rst | 1 +
Documentation/core-api/xarray.rst | 287 +++++
Documentation/vm/page_migration | 14 +-
MAINTAINERS | 12 +
arch/arm/include/asm/cacheflush.h | 6 +-
arch/nios2/include/asm/cacheflush.h | 6 +-
arch/parisc/include/asm/cacheflush.h | 6 +-
arch/powerpc/include/asm/book3s/64/pgtable.h | 4 +-
arch/powerpc/include/asm/nohash/64/pgtable.h | 4 +-
drivers/block/brd.c | 87 +-
drivers/gpu/drm/i915/i915_gem.c | 17 +-
drivers/staging/lustre/lustre/llite/glimpse.c | 12 +-
drivers/staging/lustre/lustre/mdc/mdc_request.c | 16 +-
drivers/usb/host/xhci-mem.c | 70 +-
drivers/usb/host/xhci.h | 6 +-
fs/afs/write.c | 2 +-
fs/btrfs/btrfs_inode.h | 7 +-
fs/btrfs/compression.c | 6 +-
fs/btrfs/extent_io.c | 16 +-
fs/btrfs/inode.c | 70 --
fs/buffer.c | 22 +-
fs/cifs/file.c | 2 +-
fs/dax.c | 382 +++---
fs/ext4/inode.c | 2 +-
fs/f2fs/data.c | 9 +-
fs/f2fs/dir.c | 5 +-
fs/f2fs/gc.c | 2 +-
fs/f2fs/inline.c | 6 +-
fs/f2fs/node.c | 10 +-
fs/fs-writeback.c | 37 +-
fs/gfs2/aops.c | 2 +-
fs/inode.c | 11 +-
fs/nfs/blocklayout/blocklayout.c | 2 +-
fs/nilfs2/btnode.c | 41 +-
fs/nilfs2/page.c | 78 +-
fs/proc/task_mmu.c | 2 +-
fs/xfs/libxfs/xfs_sb.c | 11 +-
fs/xfs/libxfs/xfs_sb.h | 2 +-
fs/xfs/xfs_buf_item.c | 10 +-
fs/xfs/xfs_dquot.c | 37 +-
fs/xfs/xfs_dquot_item.c | 11 +-
fs/xfs/xfs_icache.c | 142 +--
fs/xfs/xfs_icache.h | 10 +-
fs/xfs/xfs_inode.c | 24 +-
fs/xfs/xfs_inode_item.c | 22 +-
fs/xfs/xfs_log.c | 6 +-
fs/xfs/xfs_log_recover.c | 80 +-
fs/xfs/xfs_mount.c | 22 +-
fs/xfs/xfs_mount.h | 6 +-
fs/xfs/xfs_mru_cache.c | 72 +-
fs/xfs/xfs_qm.c | 32 +-
fs/xfs/xfs_qm.h | 18 +-
fs/xfs/xfs_trans.c | 18 +-
fs/xfs/xfs_trans_ail.c | 152 +--
fs/xfs/xfs_trans_buf.c | 4 +-
fs/xfs/xfs_trans_priv.h | 42 +-
include/linux/backing-dev-defs.h | 2 +-
include/linux/backing-dev.h | 14 +-
include/linux/fs.h | 68 +-
include/linux/idr.h | 173 ++-
include/linux/mm.h | 2 +-
include/linux/pagemap.h | 16 +-
include/linux/pagevec.h | 8 +-
include/linux/radix-tree.h | 95 +-
include/linux/swap.h | 5 +-
include/linux/swapops.h | 19 +-
include/linux/xarray.h | 887 +++++++++++++
kernel/pid.c | 2 +-
lib/Makefile | 2 +-
lib/idr.c | 617 +++++----
lib/radix-tree.c | 413 ++----
lib/xarray.c | 1520 +++++++++++++++++++++++
mm/backing-dev.c | 28 +-
mm/filemap.c | 746 ++++-------
mm/huge_memory.c | 23 +-
mm/khugepaged.c | 182 ++-
mm/madvise.c | 2 +-
mm/memcontrol.c | 4 +-
mm/migrate.c | 41 +-
mm/mincore.c | 2 +-
mm/page-writeback.c | 78 +-
mm/readahead.c | 10 +-
mm/rmap.c | 4 +-
mm/shmem.c | 311 ++---
mm/swap.c | 6 +-
mm/swap_state.c | 124 +-
mm/truncate.c | 45 +-
mm/vmalloc.c | 39 +-
mm/vmscan.c | 14 +-
mm/workingset.c | 78 +-
tools/include/linux/spinlock.h | 2 +
tools/testing/radix-tree/.gitignore | 2 +
tools/testing/radix-tree/Makefile | 15 +-
tools/testing/radix-tree/idr-test.c | 29 +-
tools/testing/radix-tree/linux/bug.h | 1 +
tools/testing/radix-tree/linux/kconfig.h | 1 +
tools/testing/radix-tree/linux/rcupdate.h | 2 +
tools/testing/radix-tree/linux/xarray.h | 3 +
tools/testing/radix-tree/multiorder.c | 53 +-
tools/testing/radix-tree/regression1.c | 68 +-
tools/testing/radix-tree/test.c | 8 +-
tools/testing/radix-tree/xarray-test.c | 468 +++++++
103 files changed, 5309 insertions(+), 2908 deletions(-)
create mode 100644 Documentation/core-api/xarray.rst
create mode 100644 include/linux/xarray.h
create mode 100644 lib/xarray.c
create mode 100644 tools/testing/radix-tree/linux/kconfig.h
create mode 100644 tools/testing/radix-tree/linux/xarray.h
create mode 100644 tools/testing/radix-tree/xarray-test.c

--
2.15.0


2017-12-06 00:42:22

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 18/73] xarray: Add xas_create_range

From: Matthew Wilcox <[email protected]>

This hopefully temporary function is useful for users who have not yet
been converted to multi-index entries.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/xarray.h | 2 ++
lib/xarray.c | 22 ++++++++++++++++++++++
2 files changed, 24 insertions(+)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 416708ace115..afa3374f20bd 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -594,6 +594,8 @@ void xas_init_tags(const struct xa_state *);
bool xas_nomem(struct xa_state *, gfp_t);
void xas_pause(struct xa_state *);

+void xas_create_range(struct xa_state *, unsigned long max);
+
/**
* xas_reload() - Refetch an entry from the xarray.
* @xas: XArray operation state.
diff --git a/lib/xarray.c b/lib/xarray.c
index 8c6e83d10554..cc88df7bd6df 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -570,6 +570,28 @@ void *xas_create(struct xa_state *xas)
}
EXPORT_SYMBOL_GPL(xas_create);

+/**
+ * xas_create_range() - Ensure that stores to this range will succeed
+ * @xas: XArray operation state.
+ * @max: The highest index to create a slot for.
+ *
+ * Creates all of the slots in the range between the current position of
+ * @xas and @max. This is for the benefit of users who have not yet been
+ * converted to multi-index entries.
+ *
+ * The implementation is naive.
+ */
+void xas_create_range(struct xa_state *xas, unsigned long max)
+{
+ XA_STATE(tmp, xas->xa, xas->xa_index);
+
+ do {
+ xas_create(&tmp);
+ xas_set(&tmp, tmp.xa_index + XA_CHUNK_SIZE);
+ } while (tmp.xa_index < max);
+}
+EXPORT_SYMBOL_GPL(xas_create_range);
+
static void store_siblings(struct xa_state *xas,
void *entry, int *countp, int *valuesp)
{
--
2.15.0

2017-12-06 00:42:25

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 22/73] page cache: Convert hole search to XArray

From: Matthew Wilcox <[email protected]>

The page cache offers the ability to search for a miss in the previous or
next N locations. Rather than teach the XArray about the page cache's
definition of a miss, use xas_prev() and xas_next() to search the page
array. This should be more efficient as it does not have to start the
lookup from the top for each index.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/nfs/blocklayout/blocklayout.c | 2 +-
include/linux/pagemap.h | 4 +-
mm/filemap.c | 110 ++++++++++++++++++---------------------
mm/readahead.c | 4 +-
4 files changed, 55 insertions(+), 65 deletions(-)

diff --git a/fs/nfs/blocklayout/blocklayout.c b/fs/nfs/blocklayout/blocklayout.c
index 995d707537da..7bd643538cff 100644
--- a/fs/nfs/blocklayout/blocklayout.c
+++ b/fs/nfs/blocklayout/blocklayout.c
@@ -826,7 +826,7 @@ static u64 pnfs_num_cont_bytes(struct inode *inode, pgoff_t idx)
end = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (end != inode->i_mapping->nrpages) {
rcu_read_lock();
- end = page_cache_next_hole(mapping, idx + 1, ULONG_MAX);
+ end = page_cache_next_gap(mapping, idx + 1, ULONG_MAX);
rcu_read_unlock();
}

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 80a6149152d4..0db127c3ccac 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -241,9 +241,9 @@ static inline gfp_t readahead_gfp_mask(struct address_space *x)

typedef int filler_t(void *, struct page *);

-pgoff_t page_cache_next_hole(struct address_space *mapping,
+pgoff_t page_cache_next_gap(struct address_space *mapping,
pgoff_t index, unsigned long max_scan);
-pgoff_t page_cache_prev_hole(struct address_space *mapping,
+pgoff_t page_cache_prev_gap(struct address_space *mapping,
pgoff_t index, unsigned long max_scan);

#define FGP_ACCESSED 0x00000001
diff --git a/mm/filemap.c b/mm/filemap.c
index 1d012dd3629e..650624f7b79d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1326,86 +1326,76 @@ int __lock_page_or_retry(struct page *page, struct mm_struct *mm,
}

/**
- * page_cache_next_hole - find the next hole (not-present entry)
- * @mapping: mapping
- * @index: index
- * @max_scan: maximum range to search
- *
- * Search the set [index, min(index+max_scan-1, MAX_INDEX)] for the
- * lowest indexed hole.
- *
- * Returns: the index of the hole if found, otherwise returns an index
- * outside of the set specified (in which case 'return - index >=
- * max_scan' will be true). In rare cases of index wrap-around, 0 will
- * be returned.
- *
- * page_cache_next_hole may be called under rcu_read_lock. However,
- * like radix_tree_gang_lookup, this will not atomically search a
- * snapshot of the tree at a single point in time. For example, if a
- * hole is created at index 5, then subsequently a hole is created at
- * index 10, page_cache_next_hole covering both indexes may return 10
- * if called under rcu_read_lock.
+ * page_cache_next_gap() - Find the next gap in the page cache.
+ * @mapping: Mapping.
+ * @index: Index.
+ * @max_scan: Maximum range to search.
+ *
+ * Search the range [index, min(index + max_scan - 1, ULONG_MAX)] for the
+ * gap with the lowest index.
+ *
+ * This function may be called under the rcu_read_lock. However, this will
+ * not atomically search a snapshot of the cache at a single point in time.
+ * For example, if a gap is created at index 5, then subsequently a gap is
+ * created at index 10, page_cache_next_gap covering both indices may
+ * return 10 if called under the rcu_read_lock.
+ *
+ * Return: The index of the gap if found, otherwise an index outside the
+ * range specified (in which case 'return - index >= max_scan' will be true).
+ * In the rare case of index wrap-around, 0 will be returned.
*/
-pgoff_t page_cache_next_hole(struct address_space *mapping,
+pgoff_t page_cache_next_gap(struct address_space *mapping,
pgoff_t index, unsigned long max_scan)
{
- unsigned long i;
+ XA_STATE(xas, &mapping->pages, index);

- for (i = 0; i < max_scan; i++) {
- struct page *page;
-
- page = radix_tree_lookup(&mapping->pages, index);
- if (!page || xa_is_value(page))
+ while (max_scan--) {
+ void *entry = xas_next(&xas);
+ if (!entry || xa_is_value(entry))
break;
- index++;
- if (index == 0)
+ if (xas.xa_index == 0)
break;
}

- return index;
+ return xas.xa_index;
}
-EXPORT_SYMBOL(page_cache_next_hole);
+EXPORT_SYMBOL(page_cache_next_gap);

/**
- * page_cache_prev_hole - find the prev hole (not-present entry)
- * @mapping: mapping
- * @index: index
- * @max_scan: maximum range to search
- *
- * Search backwards in the range [max(index-max_scan+1, 0), index] for
- * the first hole.
- *
- * Returns: the index of the hole if found, otherwise returns an index
- * outside of the set specified (in which case 'index - return >=
- * max_scan' will be true). In rare cases of wrap-around, ULONG_MAX
- * will be returned.
- *
- * page_cache_prev_hole may be called under rcu_read_lock. However,
- * like radix_tree_gang_lookup, this will not atomically search a
- * snapshot of the tree at a single point in time. For example, if a
- * hole is created at index 10, then subsequently a hole is created at
- * index 5, page_cache_prev_hole covering both indexes may return 5 if
- * called under rcu_read_lock.
+ * page_cache_prev_gap() - Find the next gap in the page cache.
+ * @mapping: Mapping.
+ * @index: Index.
+ * @max_scan: Maximum range to search.
+ *
+ * Search the range [max(index - max_scan + 1, 0), index] for the
+ * gap with the highest index.
+ *
+ * This function may be called under the rcu_read_lock. However, this will
+ * not atomically search a snapshot of the cache at a single point in time.
+ * For example, if a gap is created at index 10, then subsequently a gap is
+ * created at index 5, page_cache_prev_gap() covering both indices may
+ * return 5 if called under the rcu_read_lock.
+ *
+ * Return: The index of the gap if found, otherwise an index outside the
+ * range specified (in which case 'index - return >= max_scan' will be true).
+ * In the rare case of wrap-around, ULONG_MAX will be returned.
*/
-pgoff_t page_cache_prev_hole(struct address_space *mapping,
+pgoff_t page_cache_prev_gap(struct address_space *mapping,
pgoff_t index, unsigned long max_scan)
{
- unsigned long i;
-
- for (i = 0; i < max_scan; i++) {
- struct page *page;
+ XA_STATE(xas, &mapping->pages, index);

- page = radix_tree_lookup(&mapping->pages, index);
- if (!page || xa_is_value(page))
+ while (max_scan--) {
+ void *entry = xas_prev(&xas);
+ if (!entry || xa_is_value(entry))
break;
- index--;
- if (index == ULONG_MAX)
+ if (xas.xa_index == ULONG_MAX)
break;
}

- return index;
+ return xas.xa_index;
}
-EXPORT_SYMBOL(page_cache_prev_hole);
+EXPORT_SYMBOL(page_cache_prev_gap);

/**
* find_get_entry - find and get a page cache entry
diff --git a/mm/readahead.c b/mm/readahead.c
index 4851f002605f..f64b31b3a84a 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -329,7 +329,7 @@ static pgoff_t count_history_pages(struct address_space *mapping,
pgoff_t head;

rcu_read_lock();
- head = page_cache_prev_hole(mapping, offset - 1, max);
+ head = page_cache_prev_gap(mapping, offset - 1, max);
rcu_read_unlock();

return offset - 1 - head;
@@ -417,7 +417,7 @@ ondemand_readahead(struct address_space *mapping,
pgoff_t start;

rcu_read_lock();
- start = page_cache_next_hole(mapping, offset + 1, max_pages);
+ start = page_cache_next_gap(mapping, offset + 1, max_pages);
rcu_read_unlock();

if (!start || start - offset > max_pages)
--
2.15.0

2017-12-06 00:42:18

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 12/73] xarray: Add xa_cmpxchg

From: Matthew Wilcox <[email protected]>

This works like doing cmpxchg() on an array entry. Code which wants
the radix_tree_insert() semantic of not overwriting an existing entry
can cmpxchg() with NULL and get the action it wants. Plus, instead of
having an error returned, they get the value currently stored in the
array, which often saves them a subsequent lookup.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/xarray.h | 2 ++
lib/xarray.c | 37 +++++++++++++++++++++++++++++++++++++
2 files changed, 39 insertions(+)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 6f1f55d9fc94..a570d7d9a252 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -72,6 +72,8 @@ static inline void xa_init(struct xarray *xa)

void *xa_load(struct xarray *, unsigned long index);
void *xa_store(struct xarray *, unsigned long index, void *entry, gfp_t);
+void *xa_cmpxchg(struct xarray *, unsigned long index,
+ void *old, void *entry, gfp_t);

/**
* xa_erase() - Erase this entry from the XArray.
diff --git a/lib/xarray.c b/lib/xarray.c
index fbbb02c25b6d..6625b1763123 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -852,6 +852,43 @@ void *xa_store(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp)
}
EXPORT_SYMBOL(xa_store);

+/**
+ * xa_cmpxchg() - Conditionally replace an entry in the XArray.
+ * @xa: XArray.
+ * @index: Index into array.
+ * @old: Old value to test against.
+ * @entry: New value to place in array.
+ * @gfp: Allocation flags.
+ *
+ * If the entry at @index is the same as @old, replace it with @entry.
+ * If the return value is equal to @old, then the exchange was successful.
+ *
+ * Return: The old value at this index or ERR_PTR() if an error happened.
+ */
+void *xa_cmpxchg(struct xarray *xa, unsigned long index,
+ void *old, void *entry, gfp_t gfp)
+{
+ XA_STATE(xas, xa, index);
+ unsigned long flags;
+ void *curr;
+
+ if (WARN_ON_ONCE(xa_is_internal(entry)))
+ return ERR_PTR(-EINVAL);
+
+ do {
+ xa_lock_irqsave(xa, flags);
+ curr = xas_create(&xas);
+ if (curr == old)
+ xas_store(&xas, entry);
+ xa_unlock_irqrestore(xa, flags);
+ } while (xas_nomem(&xas, gfp));
+
+ if (xas_error(&xas))
+ curr = ERR_PTR(xas_error(&xas));
+ return curr;
+}
+EXPORT_SYMBOL(xa_cmpxchg);
+
/**
* __xa_set_tag() - Set this tag on this entry while locked.
* @xa: XArray.
--
2.15.0

2017-12-06 00:43:29

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 60/73] dax: Convert __dax_invalidate_mapping_entry to XArray

From: Matthew Wilcox <[email protected]>

Simple now that we already have an xa_state!

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index ad984dece12e..66f6c4ea18f7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -413,24 +413,24 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
XA_STATE(xas, &mapping->pages, index);
int ret = 0;
void *entry;
- struct radix_tree_root *pages = &mapping->pages;

xa_lock_irq(&mapping->pages);
entry = get_unlocked_mapping_entry(&xas);
if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
goto out;
if (!trunc &&
- (radix_tree_tag_get(pages, index, PAGECACHE_TAG_DIRTY) ||
- radix_tree_tag_get(pages, index, PAGECACHE_TAG_TOWRITE)))
+ (xas_get_tag(&xas, PAGECACHE_TAG_DIRTY) ||
+ xas_get_tag(&xas, PAGECACHE_TAG_TOWRITE)))
goto out;
- radix_tree_delete(pages, index);
+ xas_store(&xas, NULL);
mapping->nrexceptional--;
ret = 1;
out:
put_unlocked_mapping_entry(&xas, entry);
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
return ret;
}
+
/*
* Delete DAX data value entry at @index from @mapping. Wait for radix tree
* entry to get unlocked before deleting it.
--
2.15.0

2017-12-06 00:43:37

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 64/73] dax: Convert grab_mapping_entry to XArray

From: Matthew Wilcox <[email protected]>

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 98 +++++++++++++++++-----------------------------------------------
1 file changed, 26 insertions(+), 72 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index de85ce87d333..c663d82e8ba3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -44,6 +44,7 @@

/* The 'colour' (ie low bits) within a PMD of a page offset. */
#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
+#define PMD_ORDER (PMD_SHIFT - PAGE_SHIFT)

static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];

@@ -89,10 +90,10 @@ static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
DAX_ENTRY_LOCK);
}

-static unsigned int dax_radix_order(void *entry)
+static unsigned int dax_entry_order(void *entry)
{
if (xa_to_value(entry) & DAX_PMD)
- return PMD_SHIFT - PAGE_SHIFT;
+ return PMD_ORDER;
return 0;
}

@@ -299,10 +300,11 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
{
XA_STATE(xas, &mapping->pages, index);
bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
- void *entry, **slot;
+ void *entry;

+ xas_set_order(&xas, index, size_flag ? PMD_ORDER : 0);
restart:
- xa_lock_irq(&mapping->pages);
+ xas_lock_irq(&xas);
entry = get_unlocked_mapping_entry(&xas);

if (WARN_ON_ONCE(entry && !xa_is_value(entry))) {
@@ -326,84 +328,36 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
}
}

- /* No entry for given index? Make sure radix tree is big enough. */
- if (!entry || pmd_downgrade) {
- int err;
-
- if (pmd_downgrade) {
- /*
- * Make sure 'entry' remains valid while we drop
- * mapping xa_lock.
- */
- entry = lock_slot(&xas);
- }
-
- xa_unlock_irq(&mapping->pages);
+ if (pmd_downgrade) {
+ entry = lock_slot(&xas);
/*
* Besides huge zero pages the only other thing that gets
* downgraded are empty entries which don't need to be
* unmapped.
*/
- if (pmd_downgrade && dax_is_zero_entry(entry))
+ if (dax_is_zero_entry(entry)) {
+ xas_pause(&xas);
+ xas_unlock_irq(&xas);
unmap_mapping_range(mapping,
(index << PAGE_SHIFT) & PMD_MASK, PMD_SIZE, 0);
-
- err = radix_tree_preload(
- mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
- if (err) {
- if (pmd_downgrade)
- put_locked_mapping_entry(mapping, index);
- return ERR_PTR(err);
+ xas_lock_irq(&xas);
}
- xa_lock_irq(&mapping->pages);
-
- if (!entry) {
- /*
- * We needed to drop the pages lock while calling
- * radix_tree_preload() and we didn't have an entry to
- * lock. See if another thread inserted an entry at
- * our index during this time.
- */
- entry = __radix_tree_lookup(&mapping->pages, index,
- NULL, &slot);
- if (entry) {
- radix_tree_preload_end();
- xa_unlock_irq(&mapping->pages);
- goto restart;
- }
- }
-
- if (pmd_downgrade) {
- radix_tree_delete(&mapping->pages, index);
- mapping->nrexceptional--;
- dax_wake_entry(&xas, entry, true);
- }
-
+ xas_store(&xas, NULL);
+ mapping->nrexceptional--;
+ dax_wake_entry(&xas, entry, true);
+ }
+ if (!entry || pmd_downgrade) {
entry = dax_radix_locked_entry(0, size_flag | DAX_EMPTY);
-
- err = __radix_tree_insert(&mapping->pages, index,
- dax_radix_order(entry), entry);
- radix_tree_preload_end();
- if (err) {
- xa_unlock_irq(&mapping->pages);
- /*
- * Our insertion of a DAX entry failed, most likely
- * because we were inserting a PMD entry and it
- * collided with a PTE sized entry at a different
- * index in the PMD range. We haven't inserted
- * anything into the radix tree and have no waiters to
- * wake.
- */
- return ERR_PTR(err);
- }
- /* Good, we have inserted empty locked entry into the tree. */
- mapping->nrexceptional++;
- xa_unlock_irq(&mapping->pages);
- return entry;
+ xas_store(&xas, entry);
+ if (!xas_error(&xas))
+ mapping->nrexceptional++;
+ } else {
+ entry = lock_slot(&xas);
}
- entry = lock_slot(&xas);
out_unlock:
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
+ if (xas_nomem(&xas, GFP_NOIO))
+ goto restart;
return entry;
}

@@ -683,7 +637,7 @@ static int dax_writeback_one(struct block_device *bdev,
* worry about partial PMD writebacks.
*/
sector = dax_radix_sector(entry);
- size = PAGE_SIZE << dax_radix_order(entry);
+ size = PAGE_SIZE << dax_entry_order(entry);

id = dax_read_lock();
ret = bdev_dax_pgoff(bdev, sector, size, &pgoff);
--
2.15.0

2017-12-06 00:43:33

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 59/73] dax: More XArray conversion

From: Matthew Wilcox <[email protected]>

This time, we want to convert get_unlocked_mapping_entry() to use the
XArray. That has a ripple effect, causing us to change the waitqueues
to hash on the address of the xarray rather than the address of the
mapping (functionally equivalent), and create a lot of on-the-stack
xa_state which are only used as a container for passing the xarray and
the index down to deeper function calls.

Also rename dax_wake_mapping_entry_waiter() to dax_wake_entry().

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 80 +++++++++++++++++++++++++++++-----------------------------------
1 file changed, 36 insertions(+), 44 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index d2007a17d257..ad984dece12e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -120,7 +120,7 @@ static int dax_is_empty_entry(void *entry)
* DAX radix tree locking
*/
struct exceptional_entry_key {
- struct address_space *mapping;
+ struct xarray *xa;
pgoff_t entry_start;
};

@@ -129,9 +129,10 @@ struct wait_exceptional_entry_queue {
struct exceptional_entry_key key;
};

-static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
- pgoff_t index, void *entry, struct exceptional_entry_key *key)
+static wait_queue_head_t *dax_entry_waitqueue(struct xa_state *xas,
+ void *entry, struct exceptional_entry_key *key)
{
+ unsigned long index = xas->xa_index;
unsigned long hash;

/*
@@ -142,10 +143,10 @@ static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
if (dax_is_pmd_entry(entry))
index &= ~PG_PMD_COLOUR;

- key->mapping = mapping;
+ key->xa = xas->xa;
key->entry_start = index;

- hash = hash_long((unsigned long)mapping ^ index, DAX_WAIT_TABLE_BITS);
+ hash = hash_long((unsigned long)xas->xa ^ index, DAX_WAIT_TABLE_BITS);
return wait_table + hash;
}

@@ -156,26 +157,23 @@ static int wake_exceptional_entry_func(wait_queue_entry_t *wait, unsigned int mo
struct wait_exceptional_entry_queue *ewait =
container_of(wait, struct wait_exceptional_entry_queue, wait);

- if (key->mapping != ewait->key.mapping ||
+ if (key->xa != ewait->key.xa ||
key->entry_start != ewait->key.entry_start)
return 0;
return autoremove_wake_function(wait, mode, sync, NULL);
}

/*
- * We do not necessarily hold the mapping xa_lock when we call this
- * function so it is possible that 'entry' is no longer a valid item in the
- * radix tree. This is okay because all we really need to do is to find the
- * correct waitqueue where tasks might be waiting for that old 'entry' and
- * wake them.
+ * @entry may no longer be the entry at the index in the array. The
+ * important information it's conveying is whether the entry at this
+ * index *used* to be a PMD entry..
*/
-static void dax_wake_mapping_entry_waiter(struct address_space *mapping,
- pgoff_t index, void *entry, bool wake_all)
+static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
{
struct exceptional_entry_key key;
wait_queue_head_t *wq;

- wq = dax_entry_waitqueue(mapping, index, entry, &key);
+ wq = dax_entry_waitqueue(xas, entry, &key);

/*
* Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -207,10 +205,9 @@ static inline void *lock_slot(struct xa_state *xas)
*
* The function must be called with mapping xa_lock held.
*/
-static void *get_unlocked_mapping_entry(struct address_space *mapping,
- pgoff_t index, void ***slotp)
+static void *get_unlocked_mapping_entry(struct xa_state *xas)
{
- void *entry, **slot;
+ void *entry;
struct wait_exceptional_entry_queue ewait;
wait_queue_head_t *wq;

@@ -218,22 +215,19 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
ewait.wait.func = wake_exceptional_entry_func;

for (;;) {
- entry = __radix_tree_lookup(&mapping->pages, index, NULL,
- &slot);
- if (!entry ||
- WARN_ON_ONCE(!xa_is_value(entry)) || !dax_locked(entry)) {
- if (slotp)
- *slotp = slot;
+ entry = xas_load(xas);
+ if (!entry || WARN_ON_ONCE(!xa_is_value(entry)) ||
+ !dax_locked(entry))
return entry;
- }

- wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
+ wq = dax_entry_waitqueue(xas, entry, &ewait.key);
prepare_to_wait_exclusive(wq, &ewait.wait,
TASK_UNINTERRUPTIBLE);
- xa_unlock_irq(&mapping->pages);
+ xas_pause(xas);
+ xas_unlock_irq(xas);
schedule();
finish_wait(wq, &ewait.wait);
- xa_lock_irq(&mapping->pages);
+ xas_lock_irq(xas);
}
}

@@ -253,7 +247,7 @@ static void dax_unlock_mapping_entry(struct address_space *mapping,
xas_store(&xas, entry);
/* Safe to not call xas_pause here -- we don't touch the array after */
xas_unlock_irq(&xas);
- dax_wake_mapping_entry_waiter(mapping, index, entry, false);
+ dax_wake_entry(&xas, entry, false);
}

static void put_locked_mapping_entry(struct address_space *mapping,
@@ -266,14 +260,13 @@ static void put_locked_mapping_entry(struct address_space *mapping,
* Called when we are done with radix tree entry we looked up via
* get_unlocked_mapping_entry() and which we didn't lock in the end.
*/
-static void put_unlocked_mapping_entry(struct address_space *mapping,
- pgoff_t index, void *entry)
+static void put_unlocked_mapping_entry(struct xa_state *xas, void *entry)
{
if (!entry)
return;

/* We have to wake up next waiter for the radix tree entry lock */
- dax_wake_mapping_entry_waiter(mapping, index, entry, false);
+ dax_wake_entry(xas, entry, false);
}

/*
@@ -310,7 +303,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,

restart:
xa_lock_irq(&mapping->pages);
- entry = get_unlocked_mapping_entry(mapping, index, &slot);
+ entry = get_unlocked_mapping_entry(&xas);

if (WARN_ON_ONCE(entry && !xa_is_value(entry))) {
entry = ERR_PTR(-EIO);
@@ -320,8 +313,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
if (entry) {
if (size_flag & DAX_PMD) {
if (dax_is_pte_entry(entry)) {
- put_unlocked_mapping_entry(mapping, index,
- entry);
+ put_unlocked_mapping_entry(&xas, entry);
entry = ERR_PTR(-EEXIST);
goto out_unlock;
}
@@ -384,8 +376,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
if (pmd_downgrade) {
radix_tree_delete(&mapping->pages, index);
mapping->nrexceptional--;
- dax_wake_mapping_entry_waiter(mapping, index, entry,
- true);
+ dax_wake_entry(&xas, entry, true);
}

entry = dax_radix_locked_entry(0, size_flag | DAX_EMPTY);
@@ -419,12 +410,13 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
static int __dax_invalidate_mapping_entry(struct address_space *mapping,
pgoff_t index, bool trunc)
{
+ XA_STATE(xas, &mapping->pages, index);
int ret = 0;
void *entry;
struct radix_tree_root *pages = &mapping->pages;

xa_lock_irq(&mapping->pages);
- entry = get_unlocked_mapping_entry(mapping, index, NULL);
+ entry = get_unlocked_mapping_entry(&xas);
if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
goto out;
if (!trunc &&
@@ -435,7 +427,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
mapping->nrexceptional--;
ret = 1;
out:
- put_unlocked_mapping_entry(mapping, index, entry);
+ put_unlocked_mapping_entry(&xas, entry);
xa_unlock_irq(&mapping->pages);
return ret;
}
@@ -643,7 +635,7 @@ static int dax_writeback_one(struct block_device *bdev,
{
struct radix_tree_root *pages = &mapping->pages;
XA_STATE(xas, pages, index);
- void *entry2, **slot, *kaddr;
+ void *entry2, *kaddr;
long ret = 0, id;
sector_t sector;
pgoff_t pgoff;
@@ -658,7 +650,7 @@ static int dax_writeback_one(struct block_device *bdev,
return -EIO;

xa_lock_irq(&mapping->pages);
- entry2 = get_unlocked_mapping_entry(mapping, index, &slot);
+ entry2 = get_unlocked_mapping_entry(&xas);
/* Entry got punched out / reallocated? */
if (!entry2 || WARN_ON_ONCE(!xa_is_value(entry2)))
goto put_unlocked;
@@ -736,7 +728,7 @@ static int dax_writeback_one(struct block_device *bdev,
return ret;

put_unlocked:
- put_unlocked_mapping_entry(mapping, index, entry2);
+ put_unlocked_mapping_entry(&xas, entry2);
xa_unlock_irq(&mapping->pages);
return ret;
}
@@ -1506,16 +1498,16 @@ static int dax_insert_pfn_mkwrite(struct vm_fault *vmf,
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
pgoff_t index = vmf->pgoff;
XA_STATE(xas, &mapping->pages, index);
- void *entry, **slot;
+ void *entry;
int vmf_ret, error;

xa_lock_irq(&mapping->pages);
- entry = get_unlocked_mapping_entry(mapping, index, &slot);
+ entry = get_unlocked_mapping_entry(&xas);
/* Did we race with someone splitting entry or so? */
if (!entry ||
(pe_size == PE_SIZE_PTE && !dax_is_pte_entry(entry)) ||
(pe_size == PE_SIZE_PMD && !dax_is_pmd_entry(entry))) {
- put_unlocked_mapping_entry(mapping, index, entry);
+ put_unlocked_mapping_entry(&xas, entry);
xa_unlock_irq(&mapping->pages);
trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
VM_FAULT_NOPAGE);
--
2.15.0

2017-12-06 00:43:26

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 70/73] xfs: Convert pag_ici_root to XArray

From: Matthew Wilcox <[email protected]>

Rename pag_ici_root to pag_ici_xa and use XArray APIs instead of radix
tree APIs. Shorter code, typechecking on tag numbers, better error
checking in xfs_reclaim_inode(), and eliminates a call to
radix_tree_preload().

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/xfs/libxfs/xfs_sb.c | 2 +-
fs/xfs/libxfs/xfs_sb.h | 2 +-
fs/xfs/xfs_icache.c | 107 +++++++++++++++++++------------------------------
fs/xfs/xfs_icache.h | 4 +-
fs/xfs/xfs_inode.c | 24 ++++-------
fs/xfs/xfs_mount.c | 3 +-
fs/xfs/xfs_mount.h | 3 +-
7 files changed, 54 insertions(+), 91 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 3b0b65eb8224..8fb7c216c761 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -76,7 +76,7 @@ struct xfs_perag *
xfs_perag_get_tag(
struct xfs_mount *mp,
xfs_agnumber_t first,
- int tag)
+ xa_tag_t tag)
{
XA_STATE(xas, &mp->m_perag_xa, first);
struct xfs_perag *pag;
diff --git a/fs/xfs/libxfs/xfs_sb.h b/fs/xfs/libxfs/xfs_sb.h
index 961e6475a309..d2de90b8f39c 100644
--- a/fs/xfs/libxfs/xfs_sb.h
+++ b/fs/xfs/libxfs/xfs_sb.h
@@ -23,7 +23,7 @@
*/
extern struct xfs_perag *xfs_perag_get(struct xfs_mount *, xfs_agnumber_t);
extern struct xfs_perag *xfs_perag_get_tag(struct xfs_mount *, xfs_agnumber_t,
- int tag);
+ xa_tag_t tag);
extern void xfs_perag_put(struct xfs_perag *pag);
extern int xfs_initialize_perag_data(struct xfs_mount *, xfs_agnumber_t);

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index f56e500d89e2..edd44e190f3e 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -186,7 +186,7 @@ xfs_perag_set_reclaim_tag(
{
struct xfs_mount *mp = pag->pag_mount;

- lockdep_assert_held(&pag->pag_ici_lock);
+ lockdep_assert_held(&pag->pag_ici_xa.xa_lock);
if (pag->pag_ici_reclaimable++)
return;

@@ -205,7 +205,7 @@ xfs_perag_clear_reclaim_tag(
{
struct xfs_mount *mp = pag->pag_mount;

- lockdep_assert_held(&pag->pag_ici_lock);
+ lockdep_assert_held(&pag->pag_ici_xa.xa_lock);
if (--pag->pag_ici_reclaimable)
return;

@@ -228,16 +228,16 @@ xfs_inode_set_reclaim_tag(
struct xfs_perag *pag;

pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
- spin_lock(&pag->pag_ici_lock);
+ xa_lock(&pag->pag_ici_xa);
spin_lock(&ip->i_flags_lock);

- radix_tree_tag_set(&pag->pag_ici_root, XFS_INO_TO_AGINO(mp, ip->i_ino),
+ __xa_set_tag(&pag->pag_ici_xa, XFS_INO_TO_AGINO(mp, ip->i_ino),
XFS_ICI_RECLAIM_TAG);
xfs_perag_set_reclaim_tag(pag);
__xfs_iflags_set(ip, XFS_IRECLAIMABLE);

spin_unlock(&ip->i_flags_lock);
- spin_unlock(&pag->pag_ici_lock);
+ xa_unlock(&pag->pag_ici_xa);
xfs_perag_put(pag);
}

@@ -246,7 +246,7 @@ xfs_inode_clear_reclaim_tag(
struct xfs_perag *pag,
xfs_ino_t ino)
{
- radix_tree_tag_clear(&pag->pag_ici_root,
+ __xa_clear_tag(&pag->pag_ici_xa,
XFS_INO_TO_AGINO(pag->pag_mount, ino),
XFS_ICI_RECLAIM_TAG);
xfs_perag_clear_reclaim_tag(pag);
@@ -367,8 +367,8 @@ xfs_iget_cache_hit(
/*
* We need to set XFS_IRECLAIM to prevent xfs_reclaim_inode
* from stomping over us while we recycle the inode. We can't
- * clear the radix tree reclaimable tag yet as it requires
- * pag_ici_lock to be held exclusive.
+ * clear the xarray reclaimable tag yet as it requires
+ * pag_ici_xa.xa_lock to be held exclusive.
*/
ip->i_flags |= XFS_IRECLAIM;

@@ -393,7 +393,7 @@ xfs_iget_cache_hit(
goto out_error;
}

- spin_lock(&pag->pag_ici_lock);
+ xa_lock(&pag->pag_ici_xa);
spin_lock(&ip->i_flags_lock);

/*
@@ -410,7 +410,7 @@ xfs_iget_cache_hit(
init_rwsem(&inode->i_rwsem);

spin_unlock(&ip->i_flags_lock);
- spin_unlock(&pag->pag_ici_lock);
+ xa_unlock(&pag->pag_ici_xa);
} else {
/* If the VFS inode is being torn down, pause and try again. */
if (!igrab(inode)) {
@@ -451,7 +451,7 @@ xfs_iget_cache_miss(
int flags,
int lock_flags)
{
- struct xfs_inode *ip;
+ struct xfs_inode *ip, *curr;
int error;
xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ino);
int iflags;
@@ -471,17 +471,6 @@ xfs_iget_cache_miss(
goto out_destroy;
}

- /*
- * Preload the radix tree so we can insert safely under the
- * write spinlock. Note that we cannot sleep inside the preload
- * region. Since we can be called from transaction context, don't
- * recurse into the file system.
- */
- if (radix_tree_preload(GFP_NOFS)) {
- error = -EAGAIN;
- goto out_destroy;
- }
-
/*
* Because the inode hasn't been added to the radix-tree yet it can't
* be found by another thread, so we can do the non-sleeping lock here.
@@ -509,23 +498,18 @@ xfs_iget_cache_miss(
xfs_iflags_set(ip, iflags);

/* insert the new inode */
- spin_lock(&pag->pag_ici_lock);
- error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
- if (unlikely(error)) {
- WARN_ON(error != -EEXIST);
+ curr = xa_cmpxchg(&pag->pag_ici_xa, agino, NULL, ip, GFP_NOFS);
+ if (unlikely(curr)) {
+ WARN_ON(IS_ERR(curr));
XFS_STATS_INC(mp, xs_ig_dup);
error = -EAGAIN;
- goto out_preload_end;
+ goto out_unlock;
}
- spin_unlock(&pag->pag_ici_lock);
- radix_tree_preload_end();

*ipp = ip;
return 0;

-out_preload_end:
- spin_unlock(&pag->pag_ici_lock);
- radix_tree_preload_end();
+out_unlock:
if (lock_flags)
xfs_iunlock(ip, lock_flags);
out_destroy:
@@ -592,7 +576,7 @@ xfs_iget(
again:
error = 0;
rcu_read_lock();
- ip = radix_tree_lookup(&pag->pag_ici_root, agino);
+ ip = xa_load(&pag->pag_ici_xa, agino);

if (ip) {
error = xfs_iget_cache_hit(pag, ip, ino, flags, lock_flags);
@@ -731,7 +715,7 @@ xfs_inode_ag_walk(
void *args),
int flags,
void *args,
- int tag,
+ xa_tag_t tag,
int iter_flags)
{
uint32_t first_index;
@@ -752,15 +736,8 @@ xfs_inode_ag_walk(

rcu_read_lock();

- if (tag == -1)
- nr_found = radix_tree_gang_lookup(&pag->pag_ici_root,
- (void **)batch, first_index,
- XFS_LOOKUP_BATCH);
- else
- nr_found = radix_tree_gang_lookup_tag(
- &pag->pag_ici_root,
- (void **) batch, first_index,
- XFS_LOOKUP_BATCH, tag);
+ nr_found = xa_get_maybe_tag(&pag->pag_ici_xa, (void **)batch,
+ first_index, ULONG_MAX, XFS_LOOKUP_BATCH, tag);

if (!nr_found) {
rcu_read_unlock();
@@ -896,8 +873,8 @@ xfs_inode_ag_iterator_flags(
ag = 0;
while ((pag = xfs_perag_get(mp, ag))) {
ag = pag->pag_agno + 1;
- error = xfs_inode_ag_walk(mp, pag, execute, flags, args, -1,
- iter_flags);
+ error = xfs_inode_ag_walk(mp, pag, execute, flags, args,
+ XFS_ICI_NO_TAG, iter_flags);
xfs_perag_put(pag);
if (error) {
last_error = error;
@@ -926,7 +903,7 @@ xfs_inode_ag_iterator_tag(
void *args),
int flags,
void *args,
- int tag)
+ xa_tag_t tag)
{
struct xfs_perag *pag;
int error = 0;
@@ -1040,7 +1017,7 @@ xfs_reclaim_inode(
int sync_mode)
{
struct xfs_buf *bp = NULL;
- xfs_ino_t ino = ip->i_ino; /* for radix_tree_delete */
+ xfs_ino_t ino = ip->i_ino;
int error;

restart:
@@ -1128,16 +1105,15 @@ xfs_reclaim_inode(
/*
* Remove the inode from the per-AG radix tree.
*
- * Because radix_tree_delete won't complain even if the item was never
- * added to the tree assert that it's been there before to catch
- * problems with the inode life time early on.
+ * Check that it was there before to catch problems with the
+ * inode life time early on.
*/
- spin_lock(&pag->pag_ici_lock);
- if (!radix_tree_delete(&pag->pag_ici_root,
- XFS_INO_TO_AGINO(ip->i_mount, ino)))
+ xa_lock(&pag->pag_ici_xa);
+ if (__xa_erase(&pag->pag_ici_xa,
+ XFS_INO_TO_AGINO(ip->i_mount, ino)) != ip)
ASSERT(0);
xfs_perag_clear_reclaim_tag(pag);
- spin_unlock(&pag->pag_ici_lock);
+ xa_unlock(&pag->pag_ici_xa);

/*
* Here we do an (almost) spurious inode lock in order to coordinate
@@ -1213,9 +1189,8 @@ xfs_reclaim_inodes_ag(
int i;

rcu_read_lock();
- nr_found = radix_tree_gang_lookup_tag(
- &pag->pag_ici_root,
- (void **)batch, first_index,
+ nr_found = xa_get_tagged(&pag->pag_ici_xa,
+ (void **)batch, first_index, ULONG_MAX,
XFS_LOOKUP_BATCH,
XFS_ICI_RECLAIM_TAG);
if (!nr_found) {
@@ -1450,7 +1425,7 @@ __xfs_icache_free_eofblocks(
struct xfs_eofblocks *eofb,
int (*execute)(struct xfs_inode *ip, int flags,
void *args),
- int tag)
+ xa_tag_t tag)
{
int flags = SYNC_TRYLOCK;

@@ -1546,10 +1521,10 @@ __xfs_inode_set_eofblocks_tag(
spin_unlock(&ip->i_flags_lock);

pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
- spin_lock(&pag->pag_ici_lock);
+ xa_lock(&pag->pag_ici_xa);

- tagged = radix_tree_tagged(&pag->pag_ici_root, tag);
- radix_tree_tag_set(&pag->pag_ici_root,
+ tagged = xa_tagged(&pag->pag_ici_xa, tag);
+ __xa_set_tag(&pag->pag_ici_xa,
XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), tag);
if (!tagged) {
/* propagate the eofblocks tag up into the perag radix tree */
@@ -1563,7 +1538,7 @@ __xfs_inode_set_eofblocks_tag(
set_tp(ip->i_mount, pag->pag_agno, -1, _RET_IP_);
}

- spin_unlock(&pag->pag_ici_lock);
+ xa_unlock(&pag->pag_ici_xa);
xfs_perag_put(pag);
}

@@ -1592,11 +1567,11 @@ __xfs_inode_clear_eofblocks_tag(
spin_unlock(&ip->i_flags_lock);

pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
- spin_lock(&pag->pag_ici_lock);
+ xa_lock(&pag->pag_ici_xa);

- radix_tree_tag_clear(&pag->pag_ici_root,
+ __xa_clear_tag(&pag->pag_ici_xa,
XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), tag);
- if (!radix_tree_tagged(&pag->pag_ici_root, tag)) {
+ if (!xa_tagged(&pag->pag_ici_xa, tag)) {
/* clear the eofblocks tag from the perag radix tree */
xa_clear_tag(&ip->i_mount->m_perag_xa,
XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
@@ -1604,7 +1579,7 @@ __xfs_inode_clear_eofblocks_tag(
clear_tp(ip->i_mount, pag->pag_agno, -1, _RET_IP_);
}

- spin_unlock(&pag->pag_ici_lock);
+ xa_unlock(&pag->pag_ici_xa);
xfs_perag_put(pag);
}

diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index bd04d5adadfe..436e7f0b1ecc 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -35,7 +35,7 @@ struct xfs_eofblocks {
/*
* tags for inode radix tree
*/
-#define XFS_ICI_NO_TAG (-1) /* special flag for an untagged lookup
+#define XFS_ICI_NO_TAG XA_NO_TAG /* special flag for an untagged lookup
in xfs_inode_ag_iterator */
#define XFS_ICI_RECLAIM_TAG XA_TAG_0 /* inode is to be reclaimed */
#define XFS_ICI_EOFBLOCKS_TAG XA_TAG_1 /* inode has blocks beyond EOF */
@@ -90,7 +90,7 @@ int xfs_inode_ag_iterator_flags(struct xfs_mount *mp,
int flags, void *args, int iter_flags);
int xfs_inode_ag_iterator_tag(struct xfs_mount *mp,
int (*execute)(struct xfs_inode *ip, int flags, void *args),
- int flags, void *args, int tag);
+ int flags, void *args, xa_tag_t tag);

static inline int
xfs_fs_eofblocks_from_user(
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 801274126648..605ac6c11056 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2309,7 +2309,7 @@ xfs_ifree_cluster(
for (i = 0; i < inodes_per_cluster; i++) {
retry:
rcu_read_lock();
- ip = radix_tree_lookup(&pag->pag_ici_root,
+ ip = xa_load(&pag->pag_ici_xa,
XFS_INO_TO_AGINO(mp, (inum + i)));

/* Inode not in memory, nothing to do */
@@ -3207,7 +3207,7 @@ xfs_iflush_cluster(
{
struct xfs_mount *mp = ip->i_mount;
struct xfs_perag *pag;
- unsigned long first_index, mask;
+ unsigned long first_index, last_index, mask;
unsigned long inodes_per_cluster;
int cilist_size;
struct xfs_inode **cilist;
@@ -3225,12 +3225,12 @@ xfs_iflush_cluster(
if (!cilist)
goto out_put;

- mask = ~(((mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog)) - 1);
- first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
+ mask = (((mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog)) - 1);
+ first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & ~mask;
+ last_index = first_index | mask;
rcu_read_lock();
- /* really need a gang lookup range call here */
- nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, (void**)cilist,
- first_index, inodes_per_cluster);
+ nr_found = xa_get_entries(&pag->pag_ici_xa, (void**)cilist, first_index,
+ last_index, inodes_per_cluster);
if (nr_found == 0)
goto out_free;

@@ -3251,16 +3251,6 @@ xfs_iflush_cluster(
spin_unlock(&cip->i_flags_lock);
continue;
}
-
- /*
- * Once we fall off the end of the cluster, no point checking
- * any more inodes in the list because they will also all be
- * outside the cluster.
- */
- if ((XFS_INO_TO_AGINO(mp, cip->i_ino) & mask) != first_index) {
- spin_unlock(&cip->i_flags_lock);
- break;
- }
spin_unlock(&cip->i_flags_lock);

/*
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 0541aeb8449c..fc517e424fae 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -210,9 +210,8 @@ xfs_initialize_perag(
goto out_unwind_new_pags;
pag->pag_agno = index;
pag->pag_mount = mp;
- spin_lock_init(&pag->pag_ici_lock);
mutex_init(&pag->pag_ici_reclaim_lock);
- INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
+ xa_init(&pag->pag_ici_xa);
if (xfs_buf_hash_init(pag))
goto out_free_pag;
init_waitqueue_head(&pag->pagb_wait);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 6e5ad7b26f46..ab0f706d2fd7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -374,8 +374,7 @@ typedef struct xfs_perag {

atomic_t pagf_fstrms; /* # of filestreams active in this AG */

- spinlock_t pag_ici_lock; /* incore inode cache lock */
- struct radix_tree_root pag_ici_root; /* incore inode cache root */
+ struct xarray pag_ici_xa; /* incore inode cache */
int pag_ici_reclaimable; /* reclaimable inodes */
struct mutex pag_ici_reclaim_lock; /* serialisation point */
unsigned long pag_ici_reclaim_cursor; /* reclaim restart point */
--
2.15.0

2017-12-06 00:43:21

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 61/73] dax: Convert dax_writeback_one to XArray

From: Matthew Wilcox <[email protected]>

Likewise easy

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 17 +++++++----------
1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 66f6c4ea18f7..7bd94f1b61d0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -633,8 +633,7 @@ static int dax_writeback_one(struct block_device *bdev,
struct dax_device *dax_dev, struct address_space *mapping,
pgoff_t index, void *entry)
{
- struct radix_tree_root *pages = &mapping->pages;
- XA_STATE(xas, pages, index);
+ XA_STATE(xas, &mapping->pages, index);
void *entry2, *kaddr;
long ret = 0, id;
sector_t sector;
@@ -649,7 +648,7 @@ static int dax_writeback_one(struct block_device *bdev,
if (WARN_ON(!xa_is_value(entry)))
return -EIO;

- xa_lock_irq(&mapping->pages);
+ xas_lock_irq(&xas);
entry2 = get_unlocked_mapping_entry(&xas);
/* Entry got punched out / reallocated? */
if (!entry2 || WARN_ON_ONCE(!xa_is_value(entry2)))
@@ -668,7 +667,7 @@ static int dax_writeback_one(struct block_device *bdev,
}

/* Another fsync thread may have already written back this entry */
- if (!radix_tree_tag_get(pages, index, PAGECACHE_TAG_TOWRITE))
+ if (!xas_get_tag(&xas, PAGECACHE_TAG_TOWRITE))
goto put_unlocked;
/* Lock the entry to serialize with page faults */
entry = lock_slot(&xas);
@@ -679,8 +678,8 @@ static int dax_writeback_one(struct block_device *bdev,
* at the entry only under xa_lock and once they do that they will
* see the entry locked and wait for it to unlock.
*/
- radix_tree_tag_clear(pages, index, PAGECACHE_TAG_TOWRITE);
- xa_unlock_irq(&mapping->pages);
+ xas_clear_tag(&xas, PAGECACHE_TAG_TOWRITE);
+ xas_unlock_irq(&xas);

/*
* Even if dax_writeback_mapping_range() was given a wbc->range_start
@@ -718,9 +717,7 @@ static int dax_writeback_one(struct block_device *bdev,
* the pfn mappings are writeprotected and fault waits for mapping
* entry lock.
*/
- xa_lock_irq(&mapping->pages);
- radix_tree_tag_clear(pages, index, PAGECACHE_TAG_DIRTY);
- xa_unlock_irq(&mapping->pages);
+ xa_clear_tag(&mapping->pages, index, PAGECACHE_TAG_DIRTY);
trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
dax_unlock:
dax_read_unlock(id);
@@ -729,7 +726,7 @@ static int dax_writeback_one(struct block_device *bdev,

put_unlocked:
put_unlocked_mapping_entry(&xas, entry2);
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
return ret;
}

--
2.15.0

2017-12-06 00:43:17

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 62/73] dax: Convert dax_insert_pfn_mkwrite to XArray

From: Matthew Wilcox <[email protected]>

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7bd94f1b61d0..619aff70583f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1498,21 +1498,21 @@ static int dax_insert_pfn_mkwrite(struct vm_fault *vmf,
void *entry;
int vmf_ret, error;

- xa_lock_irq(&mapping->pages);
+ xas_lock_irq(&xas);
entry = get_unlocked_mapping_entry(&xas);
/* Did we race with someone splitting entry or so? */
if (!entry ||
(pe_size == PE_SIZE_PTE && !dax_is_pte_entry(entry)) ||
(pe_size == PE_SIZE_PMD && !dax_is_pmd_entry(entry))) {
put_unlocked_mapping_entry(&xas, entry);
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
VM_FAULT_NOPAGE);
return VM_FAULT_NOPAGE;
}
- radix_tree_tag_set(&mapping->pages, index, PAGECACHE_TAG_DIRTY);
+ xas_set_tag(&xas, PAGECACHE_TAG_DIRTY);
entry = lock_slot(&xas);
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
switch (pe_size) {
case PE_SIZE_PTE:
error = vm_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
--
2.15.0

2017-12-06 00:43:11

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 68/73] brd: Convert to XArray

From: Matthew Wilcox <[email protected]>

Convert brd_pages from a radix tree to an XArray. Simpler and smaller
code; in particular another user of radix_tree_preload is eliminated.

Signed-off-by: Matthew Wilcox <[email protected]>
---
drivers/block/brd.c | 87 ++++++++++++++---------------------------------------
1 file changed, 23 insertions(+), 64 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 8028a3a7e7fd..4d8ae1b399e6 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -17,7 +17,7 @@
#include <linux/bio.h>
#include <linux/highmem.h>
#include <linux/mutex.h>
-#include <linux/radix-tree.h>
+#include <linux/xarray.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/backing-dev.h>
@@ -29,9 +29,9 @@
#define PAGE_SECTORS (1 << PAGE_SECTORS_SHIFT)

/*
- * Each block ramdisk device has a radix_tree brd_pages of pages that stores
- * the pages containing the block device's contents. A brd page's ->index is
- * its offset in PAGE_SIZE units. This is similar to, but in no way connected
+ * Each block ramdisk device has an xarray brd_pages that stores the pages
+ * containing the block device's contents. A brd page's ->index is its
+ * offset in PAGE_SIZE units. This is similar to, but in no way connected
* with, the kernel's pagecache or buffer cache (which sit above our block
* device).
*/
@@ -41,13 +41,7 @@ struct brd_device {
struct request_queue *brd_queue;
struct gendisk *brd_disk;
struct list_head brd_list;
-
- /*
- * Backing store of pages and lock to protect it. This is the contents
- * of the block device.
- */
- spinlock_t brd_lock;
- struct radix_tree_root brd_pages;
+ struct xarray brd_pages;
};

/*
@@ -62,17 +56,9 @@ static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
* The page lifetime is protected by the fact that we have opened the
* device node -- brd pages will never be deleted under us, so we
* don't need any further locking or refcounting.
- *
- * This is strictly true for the radix-tree nodes as well (ie. we
- * don't actually need the rcu_read_lock()), however that is not a
- * documented feature of the radix-tree API so it is better to be
- * safe here (we don't have total exclusion from radix tree updates
- * here, only deletes).
*/
- rcu_read_lock();
idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
- page = radix_tree_lookup(&brd->brd_pages, idx);
- rcu_read_unlock();
+ page = xa_load(&brd->brd_pages, idx);

BUG_ON(page && page->index != idx);

@@ -87,7 +73,7 @@ static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
{
pgoff_t idx;
- struct page *page;
+ struct page *curr, *page;
gfp_t gfp_flags;

page = brd_lookup_page(brd, sector);
@@ -108,62 +94,36 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
if (!page)
return NULL;

- if (radix_tree_preload(GFP_NOIO)) {
- __free_page(page);
- return NULL;
- }
-
- spin_lock(&brd->brd_lock);
idx = sector >> PAGE_SECTORS_SHIFT;
page->index = idx;
- if (radix_tree_insert(&brd->brd_pages, idx, page)) {
+ curr = xa_cmpxchg(&brd->brd_pages, idx, NULL, page, GFP_NOIO);
+ if (curr) {
__free_page(page);
- page = radix_tree_lookup(&brd->brd_pages, idx);
+ page = curr;
BUG_ON(!page);
BUG_ON(page->index != idx);
}
- spin_unlock(&brd->brd_lock);
-
- radix_tree_preload_end();

return page;
}

/*
- * Free all backing store pages and radix tree. This must only be called when
+ * Free all backing store pages and xarray. This must only be called when
* there are no other users of the device.
*/
-#define FREE_BATCH 16
static void brd_free_pages(struct brd_device *brd)
{
- unsigned long pos = 0;
- struct page *pages[FREE_BATCH];
- int nr_pages;
-
- do {
- int i;
-
- nr_pages = radix_tree_gang_lookup(&brd->brd_pages,
- (void **)pages, pos, FREE_BATCH);
-
- for (i = 0; i < nr_pages; i++) {
- void *ret;
-
- BUG_ON(pages[i]->index < pos);
- pos = pages[i]->index;
- ret = radix_tree_delete(&brd->brd_pages, pos);
- BUG_ON(!ret || ret != pages[i]);
- __free_page(pages[i]);
- }
-
- pos++;
-
- /*
- * This assumes radix_tree_gang_lookup always returns as
- * many pages as possible. If the radix-tree code changes,
- * so will this have to.
- */
- } while (nr_pages == FREE_BATCH);
+ XA_STATE(xas, &brd->brd_pages, 0);
+ struct page *page;
+
+ /* lockdep can't know there are no other users */
+ xas_lock(&xas);
+ xas_for_each(&xas, page, ULONG_MAX) {
+ BUG_ON(page->index != xas.xa_index);
+ __free_page(page);
+ xas_store(&xas, NULL);
+ }
+ xas_unlock(&xas);
}

/*
@@ -373,8 +333,7 @@ static struct brd_device *brd_alloc(int i)
if (!brd)
goto out;
brd->brd_number = i;
- spin_lock_init(&brd->brd_lock);
- INIT_RADIX_TREE(&brd->brd_pages, GFP_ATOMIC);
+ xa_init(&brd->brd_pages);

brd->brd_queue = blk_alloc_queue(GFP_KERNEL);
if (!brd->brd_queue)
--
2.15.0

2017-12-06 00:43:04

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 54/73] nilfs2: Convert to XArray

From: Matthew Wilcox <[email protected]>

I'm not 100% convinced that the rewrite of nilfs_copy_back_pages is
correct, but it will at least have different bugs from the current
version.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/nilfs2/btnode.c | 37 +++++++++++-----------------
fs/nilfs2/page.c | 72 +++++++++++++++++++++++++++++++-----------------------
2 files changed, 56 insertions(+), 53 deletions(-)

diff --git a/fs/nilfs2/btnode.c b/fs/nilfs2/btnode.c
index 9e2a00207436..92cf58e244f9 100644
--- a/fs/nilfs2/btnode.c
+++ b/fs/nilfs2/btnode.c
@@ -177,42 +177,36 @@ int nilfs_btnode_prepare_change_key(struct address_space *btnc,
ctxt->newbh = NULL;

if (inode->i_blkbits == PAGE_SHIFT) {
- lock_page(obh->b_page);
- /*
- * We cannot call radix_tree_preload for the kernels older
- * than 2.6.23, because it is not exported for modules.
- */
+ void *entry;
+ struct page *opage = obh->b_page;
+ lock_page(opage);
retry:
- err = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM);
- if (err)
- goto failed_unlock;
/* BUG_ON(oldkey != obh->b_page->index); */
- if (unlikely(oldkey != obh->b_page->index))
- NILFS_PAGE_BUG(obh->b_page,
+ if (unlikely(oldkey != opage->index))
+ NILFS_PAGE_BUG(opage,
"invalid oldkey %lld (newkey=%lld)",
(unsigned long long)oldkey,
(unsigned long long)newkey);

- xa_lock_irq(&btnc->pages);
- err = radix_tree_insert(&btnc->pages, newkey, obh->b_page);
- xa_unlock_irq(&btnc->pages);
+ entry = xa_cmpxchg(&btnc->pages, newkey, NULL, opage, GFP_NOFS);
/*
* Note: page->index will not change to newkey until
* nilfs_btnode_commit_change_key() will be called.
* To protect the page in intermediate state, the page lock
* is held.
*/
- radix_tree_preload_end();
- if (!err)
+ if (!entry)
return 0;
- else if (err != -EEXIST)
+ if (IS_ERR(entry)) {
+ err = PTR_ERR(entry);
goto failed_unlock;
+ }

err = invalidate_inode_pages2_range(btnc, newkey, newkey);
if (!err)
goto retry;
/* fallback to copy mode */
- unlock_page(obh->b_page);
+ unlock_page(opage);
}

nbh = nilfs_btnode_create_block(btnc, newkey);
@@ -252,9 +246,8 @@ void nilfs_btnode_commit_change_key(struct address_space *btnc,
mark_buffer_dirty(obh);

xa_lock_irq(&btnc->pages);
- radix_tree_delete(&btnc->pages, oldkey);
- radix_tree_tag_set(&btnc->pages, newkey,
- PAGECACHE_TAG_DIRTY);
+ __xa_erase(&btnc->pages, oldkey);
+ __xa_set_tag(&btnc->pages, newkey, PAGECACHE_TAG_DIRTY);
xa_unlock_irq(&btnc->pages);

opage->index = obh->b_blocknr = newkey;
@@ -283,9 +276,7 @@ void nilfs_btnode_abort_change_key(struct address_space *btnc,
return;

if (nbh == NULL) { /* blocksize == pagesize */
- xa_lock_irq(&btnc->pages);
- radix_tree_delete(&btnc->pages, newkey);
- xa_unlock_irq(&btnc->pages);
+ xa_erase(&btnc->pages, newkey);
unlock_page(ctxt->bh->b_page);
} else
brelse(nbh);
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 1c6703efde9e..31d20f624971 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -304,10 +304,10 @@ int nilfs_copy_dirty_pages(struct address_space *dmap,
void nilfs_copy_back_pages(struct address_space *dmap,
struct address_space *smap)
{
+ XA_STATE(xas, &dmap->pages, 0);
struct pagevec pvec;
unsigned int i, n;
pgoff_t index = 0;
- int err;

pagevec_init(&pvec);
repeat:
@@ -317,43 +317,56 @@ void nilfs_copy_back_pages(struct address_space *dmap,

for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i], *dpage;
- pgoff_t offset = page->index;
+ xas_set(&xas, page->index);

lock_page(page);
- dpage = find_lock_page(dmap, offset);
+ do {
+ xas_lock_irq(&xas);
+ dpage = xas_create(&xas);
+ if (!xas_error(&xas))
+ break;
+ xas_unlock_irq(&xas);
+ if (!xas_nomem(&xas, GFP_NOFS)) {
+ unlock_page(page);
+ /*
+ * Callers have a touching faith that this
+ * function cannot fail. Just leak the page.
+ * Other pages may be salvagable if the
+ * xarray doesn't need to allocate memory
+ * to store them.
+ */
+ WARN_ON(1);
+ page->mapping = NULL;
+ put_page(page);
+ goto shadow_remove;
+ }
+ } while (1);
+
if (dpage) {
- /* override existing page on the destination cache */
+ get_page(dpage);
+ xas_unlock_irq(&xas);
+ lock_page(dpage);
+ /* override existing page in the destination cache */
WARN_ON(PageDirty(dpage));
nilfs_copy_page(dpage, page, 0);
unlock_page(dpage);
put_page(dpage);
} else {
- struct page *page2;
-
- /* move the page to the destination cache */
- xa_lock_irq(&smap->pages);
- page2 = radix_tree_delete(&smap->pages, offset);
- WARN_ON(page2 != page);
-
- smap->nrpages--;
- xa_unlock_irq(&smap->pages);
-
- xa_lock_irq(&dmap->pages);
- err = radix_tree_insert(&dmap->pages, offset, page);
- if (unlikely(err < 0)) {
- WARN_ON(err == -EEXIST);
- page->mapping = NULL;
- put_page(page); /* for cache */
- } else {
- page->mapping = dmap;
- dmap->nrpages++;
- if (PageDirty(page))
- radix_tree_tag_set(&dmap->pages,
- offset,
- PAGECACHE_TAG_DIRTY);
- }
+ xas_store(&xas, page);
+ page->mapping = dmap;
+ dmap->nrpages++;
+ if (PageDirty(page))
+ xas_set_tag(&xas, PAGECACHE_TAG_DIRTY);
xa_unlock_irq(&dmap->pages);
}
+
+shadow_remove:
+ /* remove the page from the shadow cache */
+ xa_lock_irq(&smap->pages);
+ WARN_ON(__xa_erase(&smap->pages, xas.xa_index) != page);
+ smap->nrpages--;
+ xa_unlock_irq(&smap->pages);
+
unlock_page(page);
}
pagevec_release(&pvec);
@@ -476,8 +489,7 @@ int __nilfs_clear_page_dirty(struct page *page)
if (mapping) {
xa_lock_irq(&mapping->pages);
if (test_bit(PG_dirty, &page->flags)) {
- radix_tree_tag_clear(&mapping->pages,
- page_index(page),
+ __xa_clear_tag(&mapping->pages, page_index(page),
PAGECACHE_TAG_DIRTY);
xa_unlock_irq(&mapping->pages);
return clear_page_dirty_for_io(page);
--
2.15.0

2017-12-06 00:43:00

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 67/73] vmalloc: Convert to XArray

From: Matthew Wilcox <[email protected]>

The radix tree of vmap blocks is simpler to express as an XArray.
Saves a couple of hundred bytes of text and eliminates a user of the
radix tree preload API.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/vmalloc.c | 39 +++++++++++++--------------------------
1 file changed, 13 insertions(+), 26 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 673942094328..3a46efc27525 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -23,7 +23,7 @@
#include <linux/list.h>
#include <linux/notifier.h>
#include <linux/rbtree.h>
-#include <linux/radix-tree.h>
+#include <linux/xarray.h>
#include <linux/rcupdate.h>
#include <linux/pfn.h>
#include <linux/kmemleak.h>
@@ -821,12 +821,11 @@ struct vmap_block {
static DEFINE_PER_CPU(struct vmap_block_queue, vmap_block_queue);

/*
- * Radix tree of vmap blocks, indexed by address, to quickly find a vmap block
+ * XArray of vmap blocks, indexed by address, to quickly find a vmap block
* in the free path. Could get rid of this if we change the API to return a
* "cookie" from alloc, to be passed to free. But no big deal yet.
*/
-static DEFINE_SPINLOCK(vmap_block_tree_lock);
-static RADIX_TREE(vmap_block_tree, GFP_ATOMIC);
+static DEFINE_XARRAY(vmap_block_tree);

/*
* We should probably have a fallback mechanism to allocate virtual memory
@@ -865,8 +864,8 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
struct vmap_block *vb;
struct vmap_area *va;
unsigned long vb_idx;
- int node, err;
- void *vaddr;
+ int node;
+ void *ret, *vaddr;

node = numa_node_id();

@@ -883,13 +882,6 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
return ERR_CAST(va);
}

- err = radix_tree_preload(gfp_mask);
- if (unlikely(err)) {
- kfree(vb);
- free_vmap_area(va);
- return ERR_PTR(err);
- }
-
vaddr = vmap_block_vaddr(va->va_start, 0);
spin_lock_init(&vb->lock);
vb->va = va;
@@ -902,11 +894,12 @@ static void *new_vmap_block(unsigned int order, gfp_t gfp_mask)
INIT_LIST_HEAD(&vb->free_list);

vb_idx = addr_to_vb_idx(va->va_start);
- spin_lock(&vmap_block_tree_lock);
- err = radix_tree_insert(&vmap_block_tree, vb_idx, vb);
- spin_unlock(&vmap_block_tree_lock);
- BUG_ON(err);
- radix_tree_preload_end();
+ ret = xa_store(&vmap_block_tree, vb_idx, vb, gfp_mask);
+ if (IS_ERR(ret)) {
+ kfree(vb);
+ free_vmap_area(va);
+ return ret;
+ }

vbq = &get_cpu_var(vmap_block_queue);
spin_lock(&vbq->lock);
@@ -923,9 +916,7 @@ static void free_vmap_block(struct vmap_block *vb)
unsigned long vb_idx;

vb_idx = addr_to_vb_idx(vb->va->va_start);
- spin_lock(&vmap_block_tree_lock);
- tmp = radix_tree_delete(&vmap_block_tree, vb_idx);
- spin_unlock(&vmap_block_tree_lock);
+ tmp = xa_erase(&vmap_block_tree, vb_idx);
BUG_ON(tmp != vb);

free_vmap_area_noflush(vb->va);
@@ -1031,7 +1022,6 @@ static void *vb_alloc(unsigned long size, gfp_t gfp_mask)
static void vb_free(const void *addr, unsigned long size)
{
unsigned long offset;
- unsigned long vb_idx;
unsigned int order;
struct vmap_block *vb;

@@ -1045,10 +1035,7 @@ static void vb_free(const void *addr, unsigned long size)
offset = (unsigned long)addr & (VMAP_BLOCK_SIZE - 1);
offset >>= PAGE_SHIFT;

- vb_idx = addr_to_vb_idx((unsigned long)addr);
- rcu_read_lock();
- vb = radix_tree_lookup(&vmap_block_tree, vb_idx);
- rcu_read_unlock();
+ vb = xa_load(&vmap_block_tree, addr_to_vb_idx((unsigned long)addr));
BUG_ON(!vb);

vunmap_page_range((unsigned long)addr, (unsigned long)addr + size);
--
2.15.0

2017-12-06 00:46:42

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 69/73] xfs: Convert m_perag_tree to XArray

From: Matthew Wilcox <[email protected]>

Getting rid of the m_perag_lock lets us also get rid of the call to
radix_tree_preload(). This is a relatively naive conversion; we could
improve performance over the radix tree implementation by passing around
xa_state pointers instead of indices, possibly at the expense of extending
rcu_read_lock() periods.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/xfs/libxfs/xfs_sb.c | 9 ++++-----
fs/xfs/xfs_icache.c | 35 +++++++++--------------------------
fs/xfs/xfs_icache.h | 6 +++---
fs/xfs/xfs_mount.c | 19 ++++---------------
fs/xfs/xfs_mount.h | 3 +--
5 files changed, 21 insertions(+), 51 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 9b5aae2bcc0b..3b0b65eb8224 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -59,7 +59,7 @@ xfs_perag_get(
int ref = 0;

rcu_read_lock();
- pag = radix_tree_lookup(&mp->m_perag_tree, agno);
+ pag = xa_load(&mp->m_perag_xa, agno);
if (pag) {
ASSERT(atomic_read(&pag->pag_ref) >= 0);
ref = atomic_inc_return(&pag->pag_ref);
@@ -78,14 +78,13 @@ xfs_perag_get_tag(
xfs_agnumber_t first,
int tag)
{
+ XA_STATE(xas, &mp->m_perag_xa, first);
struct xfs_perag *pag;
- int found;
int ref;

rcu_read_lock();
- found = radix_tree_gang_lookup_tag(&mp->m_perag_tree,
- (void **)&pag, first, 1, tag);
- if (found <= 0) {
+ pag = xas_find_tag(&xas, ULONG_MAX, tag);
+ if (!pag) {
rcu_read_unlock();
return NULL;
}
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 43005fbe8b1e..f56e500d89e2 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -156,13 +156,10 @@ static void
xfs_reclaim_work_queue(
struct xfs_mount *mp)
{
-
- rcu_read_lock();
- if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_RECLAIM_TAG)) {
+ if (xa_tagged(&mp->m_perag_xa, XFS_ICI_RECLAIM_TAG)) {
queue_delayed_work(mp->m_reclaim_workqueue, &mp->m_reclaim_work,
msecs_to_jiffies(xfs_syncd_centisecs / 6 * 10));
}
- rcu_read_unlock();
}

/*
@@ -194,10 +191,7 @@ xfs_perag_set_reclaim_tag(
return;

/* propagate the reclaim tag up into the perag radix tree */
- spin_lock(&mp->m_perag_lock);
- radix_tree_tag_set(&mp->m_perag_tree, pag->pag_agno,
- XFS_ICI_RECLAIM_TAG);
- spin_unlock(&mp->m_perag_lock);
+ xa_set_tag(&mp->m_perag_xa, pag->pag_agno, XFS_ICI_RECLAIM_TAG);

/* schedule periodic background inode reclaim */
xfs_reclaim_work_queue(mp);
@@ -216,10 +210,7 @@ xfs_perag_clear_reclaim_tag(
return;

/* clear the reclaim tag from the perag radix tree */
- spin_lock(&mp->m_perag_lock);
- radix_tree_tag_clear(&mp->m_perag_tree, pag->pag_agno,
- XFS_ICI_RECLAIM_TAG);
- spin_unlock(&mp->m_perag_lock);
+ xa_clear_tag(&mp->m_perag_xa, pag->pag_agno, XFS_ICI_RECLAIM_TAG);
trace_xfs_perag_clear_reclaim(mp, pag->pag_agno, -1, _RET_IP_);
}

@@ -847,12 +838,10 @@ void
xfs_queue_eofblocks(
struct xfs_mount *mp)
{
- rcu_read_lock();
- if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_EOFBLOCKS_TAG))
+ if (xa_tagged(&mp->m_perag_xa, XFS_ICI_EOFBLOCKS_TAG))
queue_delayed_work(mp->m_eofblocks_workqueue,
&mp->m_eofblocks_work,
msecs_to_jiffies(xfs_eofb_secs * 1000));
- rcu_read_unlock();
}

void
@@ -874,12 +863,10 @@ STATIC void
xfs_queue_cowblocks(
struct xfs_mount *mp)
{
- rcu_read_lock();
- if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_COWBLOCKS_TAG))
+ if (xa_tagged(&mp->m_perag_xa, XFS_ICI_COWBLOCKS_TAG))
queue_delayed_work(mp->m_eofblocks_workqueue,
&mp->m_cowblocks_work,
msecs_to_jiffies(xfs_cowb_secs * 1000));
- rcu_read_unlock();
}

void
@@ -1542,7 +1529,7 @@ __xfs_inode_set_eofblocks_tag(
void (*execute)(struct xfs_mount *mp),
void (*set_tp)(struct xfs_mount *mp, xfs_agnumber_t agno,
int error, unsigned long caller_ip),
- int tag)
+ xa_tag_t tag)
{
struct xfs_mount *mp = ip->i_mount;
struct xfs_perag *pag;
@@ -1566,11 +1553,9 @@ __xfs_inode_set_eofblocks_tag(
XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), tag);
if (!tagged) {
/* propagate the eofblocks tag up into the perag radix tree */
- spin_lock(&ip->i_mount->m_perag_lock);
- radix_tree_tag_set(&ip->i_mount->m_perag_tree,
+ xa_set_tag(&ip->i_mount->m_perag_xa,
XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
tag);
- spin_unlock(&ip->i_mount->m_perag_lock);

/* kick off background trimming */
execute(ip->i_mount);
@@ -1597,7 +1582,7 @@ __xfs_inode_clear_eofblocks_tag(
xfs_inode_t *ip,
void (*clear_tp)(struct xfs_mount *mp, xfs_agnumber_t agno,
int error, unsigned long caller_ip),
- int tag)
+ xa_tag_t tag)
{
struct xfs_mount *mp = ip->i_mount;
struct xfs_perag *pag;
@@ -1613,11 +1598,9 @@ __xfs_inode_clear_eofblocks_tag(
XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino), tag);
if (!radix_tree_tagged(&pag->pag_ici_root, tag)) {
/* clear the eofblocks tag from the perag radix tree */
- spin_lock(&ip->i_mount->m_perag_lock);
- radix_tree_tag_clear(&ip->i_mount->m_perag_tree,
+ xa_clear_tag(&ip->i_mount->m_perag_xa,
XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino),
tag);
- spin_unlock(&ip->i_mount->m_perag_lock);
clear_tp(ip->i_mount, pag->pag_agno, -1, _RET_IP_);
}

diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index bff4d85e5498..bd04d5adadfe 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -37,9 +37,9 @@ struct xfs_eofblocks {
*/
#define XFS_ICI_NO_TAG (-1) /* special flag for an untagged lookup
in xfs_inode_ag_iterator */
-#define XFS_ICI_RECLAIM_TAG 0 /* inode is to be reclaimed */
-#define XFS_ICI_EOFBLOCKS_TAG 1 /* inode has blocks beyond EOF */
-#define XFS_ICI_COWBLOCKS_TAG 2 /* inode can have cow blocks to gc */
+#define XFS_ICI_RECLAIM_TAG XA_TAG_0 /* inode is to be reclaimed */
+#define XFS_ICI_EOFBLOCKS_TAG XA_TAG_1 /* inode has blocks beyond EOF */
+#define XFS_ICI_COWBLOCKS_TAG XA_TAG_2 /* inode can have cow blocks to gc */

/*
* Flags for xfs_iget()
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index c879b517cc94..0541aeb8449c 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -156,9 +156,7 @@ xfs_free_perag(
struct xfs_perag *pag;

for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
- spin_lock(&mp->m_perag_lock);
- pag = radix_tree_delete(&mp->m_perag_tree, agno);
- spin_unlock(&mp->m_perag_lock);
+ pag = xa_erase(&mp->m_perag_xa, agno);
ASSERT(pag);
ASSERT(atomic_read(&pag->pag_ref) == 0);
xfs_buf_hash_destroy(pag);
@@ -219,19 +217,11 @@ xfs_initialize_perag(
goto out_free_pag;
init_waitqueue_head(&pag->pagb_wait);

- if (radix_tree_preload(GFP_NOFS))
- goto out_hash_destroy;
-
- spin_lock(&mp->m_perag_lock);
- if (radix_tree_insert(&mp->m_perag_tree, index, pag)) {
+ if (xa_store(&mp->m_perag_xa, index, pag, GFP_NOFS)) {
BUG();
- spin_unlock(&mp->m_perag_lock);
- radix_tree_preload_end();
error = -EEXIST;
goto out_hash_destroy;
}
- spin_unlock(&mp->m_perag_lock);
- radix_tree_preload_end();
/* first new pag is fully initialized */
if (first_initialised == NULLAGNUMBER)
first_initialised = index;
@@ -252,7 +242,7 @@ xfs_initialize_perag(
out_unwind_new_pags:
/* unwind any prior newly initialized pags */
for (index = first_initialised; index < agcount; index++) {
- pag = radix_tree_delete(&mp->m_perag_tree, index);
+ pag = xa_erase(&mp->m_perag_xa, index);
if (!pag)
break;
xfs_buf_hash_destroy(pag);
@@ -816,8 +806,7 @@ xfs_mountfs(
/*
* Allocate and initialize the per-ag data.
*/
- spin_lock_init(&mp->m_perag_lock);
- INIT_RADIX_TREE(&mp->m_perag_tree, GFP_ATOMIC);
+ xa_init(&mp->m_perag_xa);
error = xfs_initialize_perag(mp, sbp->sb_agcount, &mp->m_maxagi);
if (error) {
xfs_warn(mp, "Failed per-ag init: %d", error);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e0792d036be2..6e5ad7b26f46 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -134,8 +134,7 @@ typedef struct xfs_mount {
xfs_extlen_t m_ag_prealloc_blocks; /* reserved ag blocks */
uint m_alloc_set_aside; /* space we can't use */
uint m_ag_max_usable; /* max space per AG */
- struct radix_tree_root m_perag_tree; /* per-ag accounting info */
- spinlock_t m_perag_lock; /* lock for m_perag_tree */
+ struct xarray m_perag_xa; /* per-ag accounting info */
struct mutex m_growlock; /* growfs mutex */
int m_fixedfsid[2]; /* unchanged for life of FS */
uint m_dmevmask; /* DMI events for this FS */
--
2.15.0

2017-12-06 00:46:47

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 66/73] page cache: Finish XArray conversion

From: Matthew Wilcox <[email protected]>

With no more radix tree API users left, we can drop the GFP flags
and use xa_init() instead of INIT_RADIX_TREE().

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/inode.c | 2 +-
include/linux/fs.h | 2 +-
mm/swap_state.c | 2 +-
3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index c7b00573c10d..2046ff6dd1b3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -348,7 +348,7 @@ EXPORT_SYMBOL(inc_nlink);
void address_space_init_once(struct address_space *mapping)
{
memset(mapping, 0, sizeof(*mapping));
- INIT_RADIX_TREE(&mapping->pages, GFP_ATOMIC | __GFP_ACCOUNT);
+ xa_init(&mapping->pages);
init_rwsem(&mapping->i_mmap_rwsem);
INIT_LIST_HEAD(&mapping->private_list);
spin_lock_init(&mapping->private_lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c58bc3c619bf..b459bf4ddb62 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -410,7 +410,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
*/
struct address_space {
struct inode *host;
- struct radix_tree_root pages;
+ struct xarray pages;
gfp_t gfp_mask;
atomic_t i_mmap_writable;
struct rb_root_cached i_mmap;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 7c862258af66..101e952e01e6 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -573,7 +573,7 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
return -ENOMEM;
for (i = 0; i < nr; i++) {
space = spaces + i;
- INIT_RADIX_TREE(&space->pages, GFP_ATOMIC|__GFP_NOWARN);
+ xa_init(&space->pages);
atomic_set(&space->i_mmap_writable, 0);
space->a_ops = &swap_aops;
/* swap cache doesn't use writeback related tags */
--
2.15.0

2017-12-06 00:46:52

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 73/73] usb: Convert xhci-mem to XArray

From: Matthew Wilcox <[email protected]>

The XArray API is a slightly better fit for xhci_insert_segment_mapping()
than the radix tree API was.

Signed-off-by: Matthew Wilcox <[email protected]>
---
drivers/usb/host/xhci-mem.c | 70 +++++++++++++++++++++------------------------
drivers/usb/host/xhci.h | 6 ++--
2 files changed, 35 insertions(+), 41 deletions(-)

diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
index e1fba4688509..533d813bdc52 100644
--- a/drivers/usb/host/xhci-mem.c
+++ b/drivers/usb/host/xhci-mem.c
@@ -149,70 +149,64 @@ static void xhci_link_rings(struct xhci_hcd *xhci, struct xhci_ring *ring,
}

/*
- * We need a radix tree for mapping physical addresses of TRBs to which stream
- * ID they belong to. We need to do this because the host controller won't tell
+ * We need to map physical addresses of TRBs to the stream ID they belong to.
+ * We need to do this because the host controller won't tell
* us which stream ring the TRB came from. We could store the stream ID in an
* event data TRB, but that doesn't help us for the cancellation case, since the
* endpoint may stop before it reaches that event data TRB.
*
- * The radix tree maps the upper portion of the TRB DMA address to a ring
+ * The xarray maps the upper portion of the TRB DMA address to a ring
* segment that has the same upper portion of DMA addresses. For example, say I
* have segments of size 1KB, that are always 1KB aligned. A segment may
* start at 0x10c91000 and end at 0x10c913f0. If I use the upper 10 bits, the
- * key to the stream ID is 0x43244. I can use the DMA address of the TRB to
- * pass the radix tree a key to get the right stream ID:
+ * index of the stream ID is 0x43244. I can use the DMA address of the TRB as
+ * the xarray index to get the right stream ID:
*
* 0x10c90fff >> 10 = 0x43243
* 0x10c912c0 >> 10 = 0x43244
* 0x10c91400 >> 10 = 0x43245
*
* Obviously, only those TRBs with DMA addresses that are within the segment
- * will make the radix tree return the stream ID for that ring.
+ * will make the xarray return the stream ID for that ring.
*
- * Caveats for the radix tree:
+ * Caveats for the xarray:
*
- * The radix tree uses an unsigned long as a key pair. On 32-bit systems, an
+ * The xarray uses an unsigned long for the index. On 32-bit systems, an
* unsigned long will be 32-bits; on a 64-bit system an unsigned long will be
* 64-bits. Since we only request 32-bit DMA addresses, we can use that as the
- * key on 32-bit or 64-bit systems (it would also be fine if we asked for 64-bit
- * PCI DMA addresses on a 64-bit system). There might be a problem on 32-bit
- * extended systems (where the DMA address can be bigger than 32-bits),
+ * index on 32-bit or 64-bit systems (it would also be fine if we asked for
+ * 64-bit PCI DMA addresses on a 64-bit system). There might be a problem on
+ * 32-bit extended systems (where the DMA address can be bigger than 32-bits),
* if we allow the PCI dma mask to be bigger than 32-bits. So don't do that.
*/
-static int xhci_insert_segment_mapping(struct radix_tree_root *trb_address_map,
+
+static unsigned long trb_index(dma_addr_t dma)
+{
+ return (unsigned long)(dma >> TRB_SEGMENT_SHIFT);
+}
+
+static int xhci_insert_segment_mapping(struct xarray *trb_address_map,
struct xhci_ring *ring,
struct xhci_segment *seg,
- gfp_t mem_flags)
+ gfp_t gfp)
{
- unsigned long key;
- int ret;
-
- key = (unsigned long)(seg->dma >> TRB_SEGMENT_SHIFT);
/* Skip any segments that were already added. */
- if (radix_tree_lookup(trb_address_map, key))
- return 0;
+ void *entry = xa_cmpxchg(trb_address_map, trb_index(seg->dma), NULL,
+ ring, gfp);

- ret = radix_tree_maybe_preload(mem_flags);
- if (ret)
- return ret;
- ret = radix_tree_insert(trb_address_map,
- key, ring);
- radix_tree_preload_end();
- return ret;
+ if (IS_ERR(entry))
+ return PTR_ERR(entry);
+ return 0;
}

-static void xhci_remove_segment_mapping(struct radix_tree_root *trb_address_map,
+static void xhci_remove_segment_mapping(struct xarray *trb_address_map,
struct xhci_segment *seg)
{
- unsigned long key;
-
- key = (unsigned long)(seg->dma >> TRB_SEGMENT_SHIFT);
- if (radix_tree_lookup(trb_address_map, key))
- radix_tree_delete(trb_address_map, key);
+ xa_erase(trb_address_map, trb_index(seg->dma));
}

static int xhci_update_stream_segment_mapping(
- struct radix_tree_root *trb_address_map,
+ struct xarray *trb_address_map,
struct xhci_ring *ring,
struct xhci_segment *first_seg,
struct xhci_segment *last_seg,
@@ -574,8 +568,8 @@ struct xhci_ring *xhci_dma_to_transfer_ring(
u64 address)
{
if (ep->ep_state & EP_HAS_STREAMS)
- return radix_tree_lookup(&ep->stream_info->trb_address_map,
- address >> TRB_SEGMENT_SHIFT);
+ return xa_load(&ep->stream_info->trb_address_map,
+ trb_index(address));
return ep->ring;
}

@@ -654,10 +648,10 @@ struct xhci_stream_info *xhci_alloc_stream_info(struct xhci_hcd *xhci,
if (!stream_info->free_streams_command)
goto cleanup_ctx;

- INIT_RADIX_TREE(&stream_info->trb_address_map, GFP_ATOMIC);
+ xa_init(&stream_info->trb_address_map);

/* Allocate rings for all the streams that the driver will use,
- * and add their segment DMA addresses to the radix tree.
+ * and add their segment DMA addresses to the map.
* Stream 0 is reserved.
*/

@@ -2362,7 +2356,7 @@ int xhci_mem_init(struct xhci_hcd *xhci, gfp_t flags)
* Initialize the ring segment pool. The ring must be a contiguous
* structure comprised of TRBs. The TRBs must be 16 byte aligned,
* however, the command ring segment needs 64-byte aligned segments
- * and our use of dma addresses in the trb_address_map radix tree needs
+ * and our use of dma addresses in the trb_address_map xarray needs
* TRB_SEGMENT_SIZE alignment, so we pick the greater alignment need.
*/
xhci->segment_pool = dma_pool_create("xHCI ring segments", dev,
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 054ce74524af..e8208a3eee3c 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -15,7 +15,7 @@
#include <linux/usb.h>
#include <linux/timer.h>
#include <linux/kernel.h>
-#include <linux/radix-tree.h>
+#include <linux/xarray.h>
#include <linux/usb/hcd.h>
#include <linux/io-64-nonatomic-lo-hi.h>

@@ -837,7 +837,7 @@ struct xhci_stream_info {
unsigned int num_stream_ctxs;
dma_addr_t ctx_array_dma;
/* For mapping physical TRB addresses to segments in stream rings */
- struct radix_tree_root trb_address_map;
+ struct xarray trb_address_map;
struct xhci_command *free_streams_command;
};

@@ -1584,7 +1584,7 @@ struct xhci_ring {
unsigned int bounce_buf_len;
enum xhci_ring_type type;
bool last_td_was_short;
- struct radix_tree_root *trb_address_map;
+ struct xarray *trb_address_map;
};

struct xhci_erst_entry {
--
2.15.0

2017-12-06 00:46:39

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 71/73] xfs: Convert xfs dquot to XArray

From: Matthew Wilcox <[email protected]>

This is a pretty straight-forward conversion.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/xfs/xfs_dquot.c | 33 +++++++++++++++++----------------
fs/xfs/xfs_qm.c | 32 ++++++++++++++++----------------
fs/xfs/xfs_qm.h | 18 +++++++++---------
3 files changed, 42 insertions(+), 41 deletions(-)

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index e2a466df5dd1..a35fcc37770b 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -44,7 +44,7 @@
* Lock order:
*
* ip->i_lock
- * qi->qi_tree_lock
+ * qi->qi_xa_lock
* dquot->q_qlock (xfs_dqlock() and friends)
* dquot->q_flush (xfs_dqflock() and friends)
* qi->qi_lru_lock
@@ -752,8 +752,8 @@ xfs_qm_dqget(
xfs_dquot_t **O_dqpp) /* OUT : locked incore dquot */
{
struct xfs_quotainfo *qi = mp->m_quotainfo;
- struct radix_tree_root *tree = xfs_dquot_tree(qi, type);
- struct xfs_dquot *dqp;
+ struct xarray *xa = xfs_dquot_xa(qi, type);
+ struct xfs_dquot *dqp, *curr;
int error;

ASSERT(XFS_IS_QUOTA_RUNNING(mp));
@@ -772,13 +772,14 @@ xfs_qm_dqget(
}

restart:
- mutex_lock(&qi->qi_tree_lock);
- dqp = radix_tree_lookup(tree, id);
+ mutex_lock(&qi->qi_xa_lock);
+ dqp = xa_load(xa, id);
+found:
if (dqp) {
xfs_dqlock(dqp);
if (dqp->dq_flags & XFS_DQ_FREEING) {
xfs_dqunlock(dqp);
- mutex_unlock(&qi->qi_tree_lock);
+ mutex_unlock(&qi->qi_xa_lock);
trace_xfs_dqget_freeing(dqp);
delay(1);
goto restart;
@@ -788,7 +789,7 @@ xfs_qm_dqget(
if (flags & XFS_QMOPT_DQNEXT) {
if (XFS_IS_DQUOT_UNINITIALIZED(dqp)) {
xfs_dqunlock(dqp);
- mutex_unlock(&qi->qi_tree_lock);
+ mutex_unlock(&qi->qi_xa_lock);
error = xfs_dq_get_next_id(mp, type, &id);
if (error)
return error;
@@ -797,14 +798,14 @@ xfs_qm_dqget(
}

dqp->q_nrefs++;
- mutex_unlock(&qi->qi_tree_lock);
+ mutex_unlock(&qi->qi_xa_lock);

trace_xfs_dqget_hit(dqp);
XFS_STATS_INC(mp, xs_qm_dqcachehits);
*O_dqpp = dqp;
return 0;
}
- mutex_unlock(&qi->qi_tree_lock);
+ mutex_unlock(&qi->qi_xa_lock);
XFS_STATS_INC(mp, xs_qm_dqcachemisses);

/*
@@ -854,20 +855,20 @@ xfs_qm_dqget(
}
}

- mutex_lock(&qi->qi_tree_lock);
- error = radix_tree_insert(tree, id, dqp);
- if (unlikely(error)) {
- WARN_ON(error != -EEXIST);
+ mutex_lock(&qi->qi_xa_lock);
+ curr = xa_cmpxchg(xa, id, NULL, dqp, GFP_NOFS);
+ if (unlikely(curr)) {
+ WARN_ON(IS_ERR(curr));

/*
* Duplicate found. Just throw away the new dquot and start
* over.
*/
- mutex_unlock(&qi->qi_tree_lock);
trace_xfs_dqget_dup(dqp);
xfs_qm_dqdestroy(dqp);
XFS_STATS_INC(mp, xs_qm_dquot_dups);
- goto restart;
+ dqp = curr;
+ goto found;
}

/*
@@ -877,7 +878,7 @@ xfs_qm_dqget(
dqp->q_nrefs = 1;

qi->qi_dquots++;
- mutex_unlock(&qi->qi_tree_lock);
+ mutex_unlock(&qi->qi_xa_lock);

/* If we are asked to find next active id, keep looking */
if (flags & XFS_QMOPT_DQNEXT) {
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index 010a13a201aa..5a75836faf92 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -67,7 +67,7 @@ xfs_qm_dquot_walk(
void *data)
{
struct xfs_quotainfo *qi = mp->m_quotainfo;
- struct radix_tree_root *tree = xfs_dquot_tree(qi, type);
+ struct xarray *xa = xfs_dquot_xa(qi, type);
uint32_t next_index;
int last_error = 0;
int skipped;
@@ -83,11 +83,11 @@ xfs_qm_dquot_walk(
int error = 0;
int i;

- mutex_lock(&qi->qi_tree_lock);
- nr_found = radix_tree_gang_lookup(tree, (void **)batch,
- next_index, XFS_DQ_LOOKUP_BATCH);
+ mutex_lock(&qi->qi_xa_lock);
+ nr_found = xa_get_entries(xa, (void **)batch, next_index,
+ ULONG_MAX, XFS_DQ_LOOKUP_BATCH);
if (!nr_found) {
- mutex_unlock(&qi->qi_tree_lock);
+ mutex_unlock(&qi->qi_xa_lock);
break;
}

@@ -105,7 +105,7 @@ xfs_qm_dquot_walk(
last_error = error;
}

- mutex_unlock(&qi->qi_tree_lock);
+ mutex_unlock(&qi->qi_xa_lock);

/* bail out if the filesystem is corrupted. */
if (last_error == -EFSCORRUPTED) {
@@ -178,8 +178,8 @@ xfs_qm_dqpurge(
xfs_dqfunlock(dqp);
xfs_dqunlock(dqp);

- radix_tree_delete(xfs_dquot_tree(qi, dqp->q_core.d_flags),
- be32_to_cpu(dqp->q_core.d_id));
+ xa_store(xfs_dquot_xa(qi, dqp->q_core.d_flags),
+ be32_to_cpu(dqp->q_core.d_id), NULL, GFP_NOWAIT);
qi->qi_dquots--;

/*
@@ -623,10 +623,10 @@ xfs_qm_init_quotainfo(
if (error)
goto out_free_lru;

- INIT_RADIX_TREE(&qinf->qi_uquota_tree, GFP_NOFS);
- INIT_RADIX_TREE(&qinf->qi_gquota_tree, GFP_NOFS);
- INIT_RADIX_TREE(&qinf->qi_pquota_tree, GFP_NOFS);
- mutex_init(&qinf->qi_tree_lock);
+ xa_init(&qinf->qi_uquota_xa);
+ xa_init(&qinf->qi_gquota_xa);
+ xa_init(&qinf->qi_pquota_xa);
+ mutex_init(&qinf->qi_xa_lock);

/* mutex used to serialize quotaoffs */
mutex_init(&qinf->qi_quotaofflock);
@@ -1606,12 +1606,12 @@ xfs_qm_dqfree_one(
struct xfs_mount *mp = dqp->q_mount;
struct xfs_quotainfo *qi = mp->m_quotainfo;

- mutex_lock(&qi->qi_tree_lock);
- radix_tree_delete(xfs_dquot_tree(qi, dqp->q_core.d_flags),
- be32_to_cpu(dqp->q_core.d_id));
+ mutex_lock(&qi->qi_xa_lock);
+ xa_store(xfs_dquot_xa(qi, dqp->q_core.d_flags),
+ be32_to_cpu(dqp->q_core.d_id), NULL, GFP_NOWAIT);

qi->qi_dquots--;
- mutex_unlock(&qi->qi_tree_lock);
+ mutex_unlock(&qi->qi_xa_lock);

xfs_qm_dqdestroy(dqp);
}
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 2975a822e9f0..946f929f7bfb 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -67,10 +67,10 @@ struct xfs_def_quota {
* The mount structure keeps a pointer to this.
*/
typedef struct xfs_quotainfo {
- struct radix_tree_root qi_uquota_tree;
- struct radix_tree_root qi_gquota_tree;
- struct radix_tree_root qi_pquota_tree;
- struct mutex qi_tree_lock;
+ struct xarray qi_uquota_xa;
+ struct xarray qi_gquota_xa;
+ struct xarray qi_pquota_xa;
+ struct mutex qi_xa_lock;
struct xfs_inode *qi_uquotaip; /* user quota inode */
struct xfs_inode *qi_gquotaip; /* group quota inode */
struct xfs_inode *qi_pquotaip; /* project quota inode */
@@ -91,18 +91,18 @@ typedef struct xfs_quotainfo {
struct shrinker qi_shrinker;
} xfs_quotainfo_t;

-static inline struct radix_tree_root *
-xfs_dquot_tree(
+static inline struct xarray *
+xfs_dquot_xa(
struct xfs_quotainfo *qi,
int type)
{
switch (type) {
case XFS_DQ_USER:
- return &qi->qi_uquota_tree;
+ return &qi->qi_uquota_xa;
case XFS_DQ_GROUP:
- return &qi->qi_gquota_tree;
+ return &qi->qi_gquota_xa;
case XFS_DQ_PROJ:
- return &qi->qi_pquota_tree;
+ return &qi->qi_pquota_xa;
default:
ASSERT(0);
}
--
2.15.0

2017-12-06 00:46:33

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 72/73] xfs: Convert mru cache to XArray

From: Matthew Wilcox <[email protected]>

This eliminates a call to radix_tree_preload().

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/xfs/xfs_mru_cache.c | 72 +++++++++++++++++++++++---------------------------
1 file changed, 33 insertions(+), 39 deletions(-)

diff --git a/fs/xfs/xfs_mru_cache.c b/fs/xfs/xfs_mru_cache.c
index f8a674d7f092..2179bede5396 100644
--- a/fs/xfs/xfs_mru_cache.c
+++ b/fs/xfs/xfs_mru_cache.c
@@ -101,10 +101,9 @@
* an infinite loop in the code.
*/
struct xfs_mru_cache {
- struct radix_tree_root store; /* Core storage data structure. */
+ struct xarray store; /* Core storage data structure. */
struct list_head *lists; /* Array of lists, one per grp. */
struct list_head reap_list; /* Elements overdue for reaping. */
- spinlock_t lock; /* Lock to protect this struct. */
unsigned int grp_count; /* Number of discrete groups. */
unsigned int grp_time; /* Time period spanned by grps. */
unsigned int lru_grp; /* Group containing time zero. */
@@ -232,22 +231,21 @@ _xfs_mru_cache_list_insert(
* data store, removing it from the reap list, calling the client's free
* function and deleting the element from the element zone.
*
- * We get called holding the mru->lock, which we drop and then reacquire.
- * Sparse need special help with this to tell it we know what we are doing.
+ * We get called holding the mru->store lock, which we drop and then reacquire.
+ * Sparse needs special help with this to tell it we know what we are doing.
*/
STATIC void
_xfs_mru_cache_clear_reap_list(
struct xfs_mru_cache *mru)
- __releases(mru->lock) __acquires(mru->lock)
+ __releases(mru->store) __acquires(mru->store)
{
struct xfs_mru_cache_elem *elem, *next;
struct list_head tmp;

INIT_LIST_HEAD(&tmp);
list_for_each_entry_safe(elem, next, &mru->reap_list, list_node) {
-
/* Remove the element from the data store. */
- radix_tree_delete(&mru->store, elem->key);
+ __xa_erase(&mru->store, elem->key);

/*
* remove to temp list so it can be freed without
@@ -255,14 +253,14 @@ _xfs_mru_cache_clear_reap_list(
*/
list_move(&elem->list_node, &tmp);
}
- spin_unlock(&mru->lock);
+ xa_unlock(&mru->store);

list_for_each_entry_safe(elem, next, &tmp, list_node) {
list_del_init(&elem->list_node);
mru->free_func(elem);
}

- spin_lock(&mru->lock);
+ xa_lock(&mru->store);
}

/*
@@ -284,7 +282,7 @@ _xfs_mru_cache_reap(
if (!mru || !mru->lists)
return;

- spin_lock(&mru->lock);
+ xa_lock(&mru->store);
next = _xfs_mru_cache_migrate(mru, jiffies);
_xfs_mru_cache_clear_reap_list(mru);

@@ -298,7 +296,7 @@ _xfs_mru_cache_reap(
queue_delayed_work(xfs_mru_reap_wq, &mru->work, next);
}

- spin_unlock(&mru->lock);
+ xa_unlock(&mru->store);
}

int
@@ -358,13 +356,8 @@ xfs_mru_cache_create(
for (grp = 0; grp < mru->grp_count; grp++)
INIT_LIST_HEAD(mru->lists + grp);

- /*
- * We use GFP_KERNEL radix tree preload and do inserts under a
- * spinlock so GFP_ATOMIC is appropriate for the radix tree itself.
- */
- INIT_RADIX_TREE(&mru->store, GFP_ATOMIC);
+ xa_init(&mru->store);
INIT_LIST_HEAD(&mru->reap_list);
- spin_lock_init(&mru->lock);
INIT_DELAYED_WORK(&mru->work, _xfs_mru_cache_reap);

mru->grp_time = grp_time;
@@ -394,17 +387,17 @@ xfs_mru_cache_flush(
if (!mru || !mru->lists)
return;

- spin_lock(&mru->lock);
+ xa_lock(&mru->store);
if (mru->queued) {
- spin_unlock(&mru->lock);
+ xa_unlock(&mru->store);
cancel_delayed_work_sync(&mru->work);
- spin_lock(&mru->lock);
+ xa_lock(&mru->store);
}

_xfs_mru_cache_migrate(mru, jiffies + mru->grp_count * mru->grp_time);
_xfs_mru_cache_clear_reap_list(mru);

- spin_unlock(&mru->lock);
+ xa_unlock(&mru->store);
}

void
@@ -431,24 +424,24 @@ xfs_mru_cache_insert(
unsigned long key,
struct xfs_mru_cache_elem *elem)
{
+ XA_STATE(xas, &mru->store, key);
int error;

ASSERT(mru && mru->lists);
if (!mru || !mru->lists)
return -EINVAL;

- if (radix_tree_preload(GFP_NOFS))
- return -ENOMEM;
-
INIT_LIST_HEAD(&elem->list_node);
elem->key = key;

- spin_lock(&mru->lock);
- error = radix_tree_insert(&mru->store, key, elem);
- radix_tree_preload_end();
- if (!error)
- _xfs_mru_cache_list_insert(mru, elem);
- spin_unlock(&mru->lock);
+ do {
+ xas_lock(&xas);
+ xas_store(&xas, elem);
+ error = xas_error(&xas);
+ if (!error)
+ _xfs_mru_cache_list_insert(mru, elem);
+ xas_unlock(&xas);
+ } while (xas_nomem(&xas, GFP_NOFS));

return error;
}
@@ -470,11 +463,11 @@ xfs_mru_cache_remove(
if (!mru || !mru->lists)
return NULL;

- spin_lock(&mru->lock);
- elem = radix_tree_delete(&mru->store, key);
+ xa_lock(&mru->store);
+ elem = __xa_erase(&mru->store, key);
if (elem)
list_del(&elem->list_node);
- spin_unlock(&mru->lock);
+ xa_unlock(&mru->store);

return elem;
}
@@ -520,20 +513,21 @@ xfs_mru_cache_lookup(
struct xfs_mru_cache *mru,
unsigned long key)
{
+ XA_STATE(xas, &mru->store, key);
struct xfs_mru_cache_elem *elem;

ASSERT(mru && mru->lists);
if (!mru || !mru->lists)
return NULL;

- spin_lock(&mru->lock);
- elem = radix_tree_lookup(&mru->store, key);
+ xas_lock(&xas);
+ elem = xas_load(&xas);
if (elem) {
list_del(&elem->list_node);
_xfs_mru_cache_list_insert(mru, elem);
- __release(mru_lock); /* help sparse not be stupid */
+ __release(&xas); /* help sparse not be stupid */
} else
- spin_unlock(&mru->lock);
+ xas_unlock(&xas);

return elem;
}
@@ -546,7 +540,7 @@ xfs_mru_cache_lookup(
void
xfs_mru_cache_done(
struct xfs_mru_cache *mru)
- __releases(mru->lock)
+ __releases(mru->store)
{
- spin_unlock(&mru->lock);
+ xa_unlock(&mru->store);
}
--
2.15.0

2017-12-06 00:46:27

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 65/73] dax: Fix sparse warning

From: Matthew Wilcox <[email protected]>

sparse doesn't know that follow_pte_pmd conditionally acquires the ptl,
so add an annotation to let it know what's going on.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/dax.c b/fs/dax.c
index c663d82e8ba3..7a86ff1153dd 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -531,6 +531,7 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
*/
if (follow_pte_pmd(vma->vm_mm, address, &start, &end, &ptep, &pmdp, &ptl))
continue;
+ __acquire(ptl); /* Conditionally acquired above */

/*
* No need to call mmu_notifier_invalidate_range() as we are
--
2.15.0

2017-12-06 00:48:43

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 63/73] dax: Convert dax_insert_mapping_entry to XArray

From: Matthew Wilcox <[email protected]>

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 18 ++++++------------
1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 619aff70583f..de85ce87d333 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -498,9 +498,9 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
void *entry, sector_t sector,
unsigned long flags, bool dirty)
{
- struct radix_tree_root *pages = &mapping->pages;
void *new_entry;
pgoff_t index = vmf->pgoff;
+ XA_STATE(xas, &mapping->pages, index);

if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -516,7 +516,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
PAGE_SIZE, 0);
}

- xa_lock_irq(&mapping->pages);
+ xas_lock_irq(&xas);
new_entry = dax_radix_locked_entry(sector, flags);

if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
@@ -528,21 +528,15 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
* existing entry is a PMD, we will just leave the PMD in the
* tree and dirty it if necessary.
*/
- struct radix_tree_node *node;
- void **slot;
- void *ret;
-
- ret = __radix_tree_lookup(pages, index, &node, &slot);
- WARN_ON_ONCE(ret != entry);
- __radix_tree_replace(pages, node, slot,
- new_entry, NULL);
+ void *prev = xas_store(&xas, new_entry);
+ WARN_ON_ONCE(prev != entry);
entry = new_entry;
}

if (dirty)
- radix_tree_tag_set(pages, index, PAGECACHE_TAG_DIRTY);
+ xas_set_tag(&xas, PAGECACHE_TAG_DIRTY);

- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
return entry;
}

--
2.15.0

2017-12-06 00:49:18

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 55/73] f2fs: Convert to XArray

From: Matthew Wilcox <[email protected]>

This is a straightforward conversion.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/f2fs/data.c | 3 +--
fs/f2fs/dir.c | 5 +----
fs/f2fs/inline.c | 6 +-----
fs/f2fs/node.c | 10 ++--------
4 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index c8f6d9806896..1f3f192f152f 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2175,8 +2175,7 @@ void f2fs_set_page_dirty_nobuffers(struct page *page)
xa_lock_irqsave(&mapping->pages, flags);
WARN_ON_ONCE(!PageUptodate(page));
account_page_dirtied(page, mapping);
- radix_tree_tag_set(&mapping->pages,
- page_index(page), PAGECACHE_TAG_DIRTY);
+ __xa_set_tag(&mapping->pages, page_index(page), PAGECACHE_TAG_DIRTY);
xa_unlock_irqrestore(&mapping->pages, flags);
unlock_page_memcg(page);

diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index b5515ea6bb2f..296070016ec9 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -708,7 +708,6 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page,
unsigned int bit_pos;
int slots = GET_DENTRY_SLOTS(le16_to_cpu(dentry->name_len));
struct address_space *mapping = page_mapping(page);
- unsigned long flags;
int i;

f2fs_update_time(F2FS_I_SB(dir), REQ_TIME);
@@ -739,10 +738,8 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page,

if (bit_pos == NR_DENTRY_IN_BLOCK &&
!truncate_hole(dir, page->index, page->index + 1)) {
- xa_lock_irqsave(&mapping->pages, flags);
- radix_tree_tag_clear(&mapping->pages, page_index(page),
+ xa_clear_tag(&mapping->pages, page_index(page),
PAGECACHE_TAG_DIRTY);
- xa_unlock_irqrestore(&mapping->pages, flags);

clear_page_dirty_for_io(page);
ClearPagePrivate(page);
diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c
index 7858b8e15f33..d3c3f84beca9 100644
--- a/fs/f2fs/inline.c
+++ b/fs/f2fs/inline.c
@@ -204,7 +204,6 @@ int f2fs_write_inline_data(struct inode *inode, struct page *page)
void *src_addr, *dst_addr;
struct dnode_of_data dn;
struct address_space *mapping = page_mapping(page);
- unsigned long flags;
int err;

set_new_dnode(&dn, inode, NULL, NULL, 0);
@@ -226,10 +225,7 @@ int f2fs_write_inline_data(struct inode *inode, struct page *page)
kunmap_atomic(src_addr);
set_page_dirty(dn.inode_page);

- xa_lock_irqsave(&mapping->pages, flags);
- radix_tree_tag_clear(&mapping->pages, page_index(page),
- PAGECACHE_TAG_DIRTY);
- xa_unlock_irqrestore(&mapping->pages, flags);
+ xa_clear_tag(&mapping->pages, page_index(page), PAGECACHE_TAG_DIRTY);

set_inode_flag(inode, FI_APPEND_WRITE);
set_inode_flag(inode, FI_DATA_EXIST);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 6b64a3009d55..0a6d5c2f996e 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -88,14 +88,10 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type)
static void clear_node_page_dirty(struct page *page)
{
struct address_space *mapping = page->mapping;
- unsigned int long flags;

if (PageDirty(page)) {
- xa_lock_irqsave(&mapping->pages, flags);
- radix_tree_tag_clear(&mapping->pages,
- page_index(page),
+ xa_clear_tag(&mapping->pages, page_index(page),
PAGECACHE_TAG_DIRTY);
- xa_unlock_irqrestore(&mapping->pages, flags);

clear_page_dirty_for_io(page);
dec_page_count(F2FS_M_SB(mapping), F2FS_DIRTY_NODES);
@@ -1142,9 +1138,7 @@ void ra_node_page(struct f2fs_sb_info *sbi, nid_t nid)
return;
f2fs_bug_on(sbi, check_nid_range(sbi, nid));

- rcu_read_lock();
- apage = radix_tree_lookup(&NODE_MAPPING(sbi)->pages, nid);
- rcu_read_unlock();
+ apage = xa_load(&NODE_MAPPING(sbi)->pages, nid);
if (apage)
return;

--
2.15.0

2017-12-06 00:49:14

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 56/73] lustre: Convert to XArray

From: Matthew Wilcox <[email protected]>

Signed-off-by: Matthew Wilcox <[email protected]>
---
drivers/staging/lustre/lustre/llite/glimpse.c | 12 +++++-------
drivers/staging/lustre/lustre/mdc/mdc_request.c | 16 ++++++++--------
2 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/drivers/staging/lustre/lustre/llite/glimpse.c b/drivers/staging/lustre/lustre/llite/glimpse.c
index 5f2843da911c..25232fdf5797 100644
--- a/drivers/staging/lustre/lustre/llite/glimpse.c
+++ b/drivers/staging/lustre/lustre/llite/glimpse.c
@@ -57,7 +57,7 @@ static const struct cl_lock_descr whole_file = {
};

/*
- * Check whether file has possible unwriten pages.
+ * Check whether file has possible unwritten pages.
*
* \retval 1 file is mmap-ed or has dirty pages
* 0 otherwise
@@ -66,16 +66,14 @@ blkcnt_t dirty_cnt(struct inode *inode)
{
blkcnt_t cnt = 0;
struct vvp_object *vob = cl_inode2vvp(inode);
- void *results[1];

- if (inode->i_mapping)
- cnt += radix_tree_gang_lookup_tag(&inode->i_mapping->pages,
- results, 0, 1,
- PAGECACHE_TAG_DIRTY);
+ if (inode->i_mapping && xa_tagged(&inode->i_mapping->pages,
+ PAGECACHE_TAG_DIRTY))
+ cnt = 1;
if (cnt == 0 && atomic_read(&vob->vob_mmap_cnt) > 0)
cnt = 1;

- return (cnt > 0) ? 1 : 0;
+ return cnt;
}

int cl_glimpse_lock(const struct lu_env *env, struct cl_io *io,
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index 2ec79a6b17da..ea23247e9e02 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -934,17 +934,18 @@ static struct page *mdc_page_locate(struct address_space *mapping, __u64 *hash,
* hash _smaller_ than one we are looking for.
*/
unsigned long offset = hash_x_index(*hash, hash64);
+ XA_STATE(xas, &mapping->pages, offset);
struct page *page;
- int found;

- xa_lock_irq(&mapping->pages);
- found = radix_tree_gang_lookup(&mapping->pages,
- (void **)&page, offset, 1);
- if (found > 0 && !xa_is_value(page)) {
+ xas_lock_irq(&xas);
+ page = xas_find(&xas, ULONG_MAX);
+ if (xa_is_value(page))
+ page = NULL;
+ if (page) {
struct lu_dirpage *dp;

get_page(page);
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
/*
* In contrast to find_lock_page() we are sure that directory
* page cannot be truncated (while DLM lock is held) and,
@@ -992,8 +993,7 @@ static struct page *mdc_page_locate(struct address_space *mapping, __u64 *hash,
page = ERR_PTR(-EIO);
}
} else {
- xa_unlock_irq(&mapping->pages);
- page = NULL;
+ xas_unlock_irq(&xas);
}
return page;
}
--
2.15.0

2017-12-06 00:49:07

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 58/73] dax: Convert lock_slot to XArray

From: Matthew Wilcox <[email protected]>

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 25 +++++++++++++------------
1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 03bfa599f75c..d2007a17d257 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -188,15 +188,13 @@ static void dax_wake_mapping_entry_waiter(struct address_space *mapping,
}

/*
- * Mark the given slot is locked. The function must be called with
- * mapping xa_lock held
+ * Mark the given slot as locked. Must be called with xa_lock held.
*/
-static inline void *lock_slot(struct address_space *mapping, void **slot)
+static inline void *lock_slot(struct xa_state *xas)
{
- unsigned long v = xa_to_value(
- radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock));
+ unsigned long v = xa_to_value(xas_load(xas));
void *entry = xa_mk_value(v | DAX_ENTRY_LOCK);
- radix_tree_replace_slot(&mapping->pages, slot, entry);
+ xas_store(xas, entry);
return entry;
}

@@ -247,7 +245,7 @@ static void dax_unlock_mapping_entry(struct address_space *mapping,

xas_lock_irq(&xas);
entry = xas_load(&xas);
- if (WARN_ON_ONCE(!entry || !xa_is_value(entry) || !dax_locked(entry))) {
+ if (WARN_ON_ONCE(!xa_is_value(entry) || !dax_locked(entry))) {
xas_unlock_irq(&xas);
return;
}
@@ -306,6 +304,7 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
unsigned long size_flag)
{
+ XA_STATE(xas, &mapping->pages, index);
bool pmd_downgrade = false; /* splitting 2MiB entry into 4k entries? */
void *entry, **slot;

@@ -344,7 +343,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
* Make sure 'entry' remains valid while we drop
* mapping xa_lock.
*/
- entry = lock_slot(mapping, slot);
+ entry = lock_slot(&xas);
}

xa_unlock_irq(&mapping->pages);
@@ -411,7 +410,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
xa_unlock_irq(&mapping->pages);
return entry;
}
- entry = lock_slot(mapping, slot);
+ entry = lock_slot(&xas);
out_unlock:
xa_unlock_irq(&mapping->pages);
return entry;
@@ -643,6 +642,7 @@ static int dax_writeback_one(struct block_device *bdev,
pgoff_t index, void *entry)
{
struct radix_tree_root *pages = &mapping->pages;
+ XA_STATE(xas, pages, index);
void *entry2, **slot, *kaddr;
long ret = 0, id;
sector_t sector;
@@ -679,7 +679,7 @@ static int dax_writeback_one(struct block_device *bdev,
if (!radix_tree_tag_get(pages, index, PAGECACHE_TAG_TOWRITE))
goto put_unlocked;
/* Lock the entry to serialize with page faults */
- entry = lock_slot(mapping, slot);
+ entry = lock_slot(&xas);
/*
* We can clear the tag now but we have to be careful so that concurrent
* dax_writeback_one() calls for the same index cannot finish before we
@@ -1504,8 +1504,9 @@ static int dax_insert_pfn_mkwrite(struct vm_fault *vmf,
pfn_t pfn)
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
- void *entry, **slot;
pgoff_t index = vmf->pgoff;
+ XA_STATE(xas, &mapping->pages, index);
+ void *entry, **slot;
int vmf_ret, error;

xa_lock_irq(&mapping->pages);
@@ -1521,7 +1522,7 @@ static int dax_insert_pfn_mkwrite(struct vm_fault *vmf,
return VM_FAULT_NOPAGE;
}
radix_tree_tag_set(&mapping->pages, index, PAGECACHE_TAG_DIRTY);
- entry = lock_slot(mapping, slot);
+ entry = lock_slot(&xas);
xa_unlock_irq(&mapping->pages);
switch (pe_size) {
case PE_SIZE_PTE:
--
2.15.0

2017-12-06 00:50:33

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 53/73] fs: Convert writeback to XArray

From: Matthew Wilcox <[email protected]>

A couple of short loops.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/fs-writeback.c | 27 ++++++++++-----------------
1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a3c2352507f6..18ad86ccba96 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -339,9 +339,9 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
struct address_space *mapping = inode->i_mapping;
struct bdi_writeback *old_wb = inode->i_wb;
struct bdi_writeback *new_wb = isw->new_wb;
- struct radix_tree_iter iter;
+ XA_STATE(xas, &mapping->pages, 0);
+ struct page *page;
bool switched = false;
- void **slot;

/*
* By the time control reaches here, RCU grace period has passed
@@ -373,27 +373,20 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
/*
* Count and transfer stats. Note that PAGECACHE_TAG_DIRTY points
* to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to
- * pages actually under underwriteback.
+ * pages actually under writeback.
*/
- radix_tree_for_each_tagged(slot, &mapping->pages, &iter, 0,
- PAGECACHE_TAG_DIRTY) {
- struct page *page = radix_tree_deref_slot_protected(slot,
- &mapping->pages.xa_lock);
- if (likely(page) && PageDirty(page)) {
+ xas_for_each_tag(&xas, page, ULONG_MAX, PAGECACHE_TAG_DIRTY) {
+ if (PageDirty(page)) {
dec_wb_stat(old_wb, WB_RECLAIMABLE);
inc_wb_stat(new_wb, WB_RECLAIMABLE);
}
}

- radix_tree_for_each_tagged(slot, &mapping->pages, &iter, 0,
- PAGECACHE_TAG_WRITEBACK) {
- struct page *page = radix_tree_deref_slot_protected(slot,
- &mapping->pages.xa_lock);
- if (likely(page)) {
- WARN_ON_ONCE(!PageWriteback(page));
- dec_wb_stat(old_wb, WB_WRITEBACK);
- inc_wb_stat(new_wb, WB_WRITEBACK);
- }
+ xas_set(&xas, 0);
+ xas_for_each_tag(&xas, page, ULONG_MAX, PAGECACHE_TAG_WRITEBACK) {
+ WARN_ON_ONCE(!PageWriteback(page));
+ dec_wb_stat(old_wb, WB_WRITEBACK);
+ inc_wb_stat(new_wb, WB_WRITEBACK);
}

wb_get(new_wb);
--
2.15.0

2017-12-06 00:50:30

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 57/73] dax: Convert dax_unlock_mapping_entry to XArray

From: Matthew Wilcox <[email protected]>

Replace slot_locked() with dax_locked() and inline unlock_slot() into
its only caller.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 50 ++++++++++++++++----------------------------------
1 file changed, 16 insertions(+), 34 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 86bacca51eed..03bfa599f75c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -73,6 +73,11 @@ fs_initcall(init_dax_wait_table);
#define DAX_ZERO_PAGE (1UL << 2)
#define DAX_EMPTY (1UL << 3)

+static bool dax_locked(void *entry)
+{
+ return xa_to_value(entry) & DAX_ENTRY_LOCK;
+}
+
static unsigned long dax_radix_sector(void *entry)
{
return xa_to_value(entry) >> DAX_SHIFT;
@@ -182,17 +187,6 @@ static void dax_wake_mapping_entry_waiter(struct address_space *mapping,
__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
}

-/*
- * Check whether the given slot is locked. The function must be called with
- * mapping xa_lock held
- */
-static inline int slot_locked(struct address_space *mapping, void **slot)
-{
- unsigned long entry = xa_to_value(
- radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock));
- return entry & DAX_ENTRY_LOCK;
-}
-
/*
* Mark the given slot is locked. The function must be called with
* mapping xa_lock held
@@ -206,19 +200,6 @@ static inline void *lock_slot(struct address_space *mapping, void **slot)
return entry;
}

-/*
- * Mark the given slot is unlocked. The function must be called with
- * mapping xa_lock held
- */
-static inline void *unlock_slot(struct address_space *mapping, void **slot)
-{
- unsigned long v = xa_to_value(
- radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock));
- void *entry = xa_mk_value(v & ~DAX_ENTRY_LOCK);
- radix_tree_replace_slot(&mapping->pages, slot, entry);
- return entry;
-}
-
/*
* Lookup entry in radix tree, wait for it to become unlocked if it is
* a data value entry and return it. The caller must call
@@ -242,8 +223,7 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
entry = __radix_tree_lookup(&mapping->pages, index, NULL,
&slot);
if (!entry ||
- WARN_ON_ONCE(!xa_is_value(entry)) ||
- !slot_locked(mapping, slot)) {
+ WARN_ON_ONCE(!xa_is_value(entry)) || !dax_locked(entry)) {
if (slotp)
*slotp = slot;
return entry;
@@ -262,17 +242,19 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
static void dax_unlock_mapping_entry(struct address_space *mapping,
pgoff_t index)
{
- void *entry, **slot;
+ XA_STATE(xas, &mapping->pages, index);
+ void *entry;

- xa_lock_irq(&mapping->pages);
- entry = __radix_tree_lookup(&mapping->pages, index, NULL, &slot);
- if (WARN_ON_ONCE(!entry || !xa_is_value(entry) ||
- !slot_locked(mapping, slot))) {
- xa_unlock_irq(&mapping->pages);
+ xas_lock_irq(&xas);
+ entry = xas_load(&xas);
+ if (WARN_ON_ONCE(!entry || !xa_is_value(entry) || !dax_locked(entry))) {
+ xas_unlock_irq(&xas);
return;
}
- unlock_slot(mapping, slot);
- xa_unlock_irq(&mapping->pages);
+ entry = xa_mk_value(xa_to_value(entry) & ~DAX_ENTRY_LOCK);
+ xas_store(&xas, entry);
+ /* Safe to not call xas_pause here -- we don't touch the array after */
+ xas_unlock_irq(&xas);
dax_wake_mapping_entry_waiter(mapping, index, entry, false);
}

--
2.15.0

2017-12-06 00:51:26

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 15/73] xarray: Add xa_get_entries, xa_get_tagged and xa_get_maybe_tag

From: Matthew Wilcox <[email protected]>

These functions allow a range of xarray entries to be extracted into a
compact normal array.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/xarray.h | 27 ++++++++++++++++
lib/xarray.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 115 insertions(+)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 4e61ebd406f5..c3efcc3432f7 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -135,6 +135,33 @@ void *xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);

void *xa_find(struct xarray *xa, unsigned long *index, unsigned long max);
void *xa_find_after(struct xarray *xa, unsigned long *index, unsigned long max);
+int xa_get_entries(struct xarray *, void **dst, unsigned long start,
+ unsigned long max, unsigned int n);
+int xa_get_tagged(struct xarray *, void **dst, unsigned long start,
+ unsigned long max, unsigned int n, xa_tag_t);
+
+/**
+ * xa_get_maybe_tag() - Copy entries from the XArray into a normal array.
+ * @xa: The source XArray to copy from.
+ * @dst: The buffer to copy pointers into.
+ * @start: The first index in the XArray eligible to be copied from.
+ * @max: The last index in the XArray eligible to be copied from.
+ * @n: The maximum number of entries to copy.
+ * @tag: Tag number.
+ *
+ * If you specify %XA_NO_TAG as the tag number, this is the same as
+ * xa_get_entries(). Otherwise, it is the same as xa_get_tagged().
+ *
+ * Return: The number of entries copied.
+ */
+static inline int xa_get_maybe_tag(struct xarray *xa, void **dst,
+ unsigned long start, unsigned long max,
+ unsigned int n, xa_tag_t tag)
+{
+ if (tag == XA_NO_TAG)
+ return xa_get_entries(xa, dst, start, max, n);
+ return xa_get_tagged(xa, dst, start, max, n, tag);
+}

/**
* xa_for_each() - Iterate over a portion of an XArray.
diff --git a/lib/xarray.c b/lib/xarray.c
index f9eaac2d85f9..251724f62b11 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1253,6 +1253,94 @@ void *xa_find_after(struct xarray *xa, unsigned long *indexp, unsigned long max)
}
EXPORT_SYMBOL(xa_find_after);

+/**
+ * xa_get_entries() - Copy entries from the XArray into a normal array.
+ * @xa: The source XArray to copy from.
+ * @dst: The buffer to copy pointers into.
+ * @start: The first index in the XArray eligible to be copied from.
+ * @max: The last index in the XArray eligible to be copied from.
+ * @n: The maximum number of entries to copy.
+ *
+ * Copies up to @n non-NULL entries from the XArray. The copied entries will
+ * have indices between @start and @max, inclusive.
+ *
+ * This function uses the RCU lock to protect itself. That means that the
+ * entries returned may not represent a snapshot of the XArray at a moment
+ * in time. For example, if index 5 is stored to, then index 10 is stored to,
+ * calling xa_get_entries() may return the old contents of index 5 and the
+ * new contents of index 10. Indices not modified while this function is
+ * running will not be skipped.
+ *
+ * If you need stronger guarantees, holding the xa_lock across calls to this
+ * function will prevent concurrent modification.
+ *
+ * Return: The number of entries copied.
+ */
+int xa_get_entries(struct xarray *xa, void **dst, unsigned long start,
+ unsigned long max, unsigned int n)
+{
+ XA_STATE(xas, xa, start);
+ void *entry;
+ unsigned int i = 0;
+
+ if (!n)
+ return 0;
+
+ rcu_read_lock();
+ xas_for_each(&xas, entry, max) {
+ if (xas_retry(&xas, entry))
+ continue;
+ dst[i++] = entry;
+ if (i == n)
+ break;
+ }
+ rcu_read_unlock();
+
+ return i;
+}
+EXPORT_SYMBOL(xa_get_entries);
+
+/**
+ * xa_get_tagged() - Copy tagged entries from the XArray into a normal array.
+ * @xa: The source XArray to copy from.
+ * @dst: The buffer to copy pointers into.
+ * @start: The first index in the XArray eligible to be copied from.
+ * @max: The last index in the XArray eligible to be copied from
+ * @n: The maximum number of entries to copy.
+ * @tag: Tag number.
+ *
+ * Copies up to @n non-NULL entries that have @tag set from the XArray. The
+ * copied entries will have indices between @start and @max, inclusive.
+ *
+ * See the xa_get_entries() documentation for the consistency guarantees
+ * provided.
+ *
+ * Return: The number of entries copied.
+ */
+int xa_get_tagged(struct xarray *xa, void **dst, unsigned long start,
+ unsigned long max, unsigned int n, xa_tag_t tag)
+{
+ XA_STATE(xas, xa, start);
+ void *entry;
+ unsigned int i = 0;
+
+ if (!n)
+ return 0;
+
+ rcu_read_lock();
+ xas_for_each_tag(&xas, entry, max, tag) {
+ if (xas_retry(&xas, entry))
+ continue;
+ dst[i++] = entry;
+ if (i == n)
+ break;
+ }
+ rcu_read_unlock();
+
+ return i;
+}
+EXPORT_SYMBOL(xa_get_tagged);
+
#ifdef XA_DEBUG
void xa_dump_entry(void *entry, unsigned long index)
{
--
2.15.0

2017-12-06 00:51:36

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 48/73] shmem: Convert shmem_free_swap to XArray

From: Matthew Wilcox <[email protected]>

This is a perfect use for xa_cmpxchg(). Note the use of 0 for GFP
flags; we won't be allocating memory.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/shmem.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 768d470a03da..ca45ff493587 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -635,16 +635,13 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
}

/*
- * Remove swap entry from radix tree, free the swap and its page cache.
+ * Remove swap entry from page cache, free the swap and its page cache.
*/
static int shmem_free_swap(struct address_space *mapping,
pgoff_t index, void *radswap)
{
- void *old;
+ void *old = xa_cmpxchg(&mapping->pages, index, radswap, NULL, 0);

- xa_lock_irq(&mapping->pages);
- old = radix_tree_delete_item(&mapping->pages, index, radswap);
- xa_unlock_irq(&mapping->pages);
if (old != radswap)
return -ENOENT;
free_swap_and_cache(radix_to_swp_entry(radswap));
--
2.15.0

2017-12-06 00:51:32

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 49/73] shmem: Convert shmem_partial_swap_usage to XArray

From: Matthew Wilcox <[email protected]>

Simpler code because the xarray takes care of things like the limit and
dereferencing the slot.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/shmem.c | 18 +++---------------
1 file changed, 3 insertions(+), 15 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ca45ff493587..01102e2e0ef3 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -658,29 +658,17 @@ static int shmem_free_swap(struct address_space *mapping,
unsigned long shmem_partial_swap_usage(struct address_space *mapping,
pgoff_t start, pgoff_t end)
{
- struct radix_tree_iter iter;
- void **slot;
+ XA_STATE(xas, &mapping->pages, start);
struct page *page;
unsigned long swapped = 0;

rcu_read_lock();
-
- radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
- if (iter.index >= end)
- break;
-
- page = radix_tree_deref_slot(slot);
-
- if (radix_tree_deref_retry(page)) {
- slot = radix_tree_iter_retry(&iter);
- continue;
- }
-
+ xas_for_each(&xas, page, end - 1) {
if (xa_is_value(page))
swapped++;

if (need_resched()) {
- slot = radix_tree_iter_resume(slot, &iter);
+ xas_pause(&xas);
cond_resched_rcu();
}
}
--
2.15.0

2017-12-06 00:51:22

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 52/73] fs: Convert buffer to XArray

From: Matthew Wilcox <[email protected]>

Mostly comment fixes, but one use of __xa_set_tag.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/buffer.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 33c08624d45b..986b50b0fd50 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -593,7 +593,7 @@ void mark_buffer_dirty_inode(struct buffer_head *bh, struct inode *inode)
EXPORT_SYMBOL(mark_buffer_dirty_inode);

/*
- * Mark the page dirty, and set it dirty in the radix tree, and mark the inode
+ * Mark the page dirty, and set it dirty in the page cache, and mark the inode
* dirty.
*
* If warn is true, then emit a warning if the page is not uptodate and has
@@ -610,8 +610,8 @@ void __set_page_dirty(struct page *page, struct address_space *mapping,
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
- radix_tree_tag_set(&mapping->pages,
- page_index(page), PAGECACHE_TAG_DIRTY);
+ __xa_set_tag(&mapping->pages, page_index(page),
+ PAGECACHE_TAG_DIRTY);
}
xa_unlock_irqrestore(&mapping->pages, flags);
}
@@ -1073,7 +1073,7 @@ __getblk_slow(struct block_device *bdev, sector_t block,
* The relationship between dirty buffers and dirty pages:
*
* Whenever a page has any dirty buffers, the page's dirty bit is set, and
- * the page is tagged dirty in its radix tree.
+ * the page is tagged dirty in the page cache.
*
* At all times, the dirtiness of the buffers represents the dirtiness of
* subsections of the page. If the page has buffers, the page dirty bit is
@@ -1096,9 +1096,9 @@ __getblk_slow(struct block_device *bdev, sector_t block,
* mark_buffer_dirty - mark a buffer_head as needing writeout
* @bh: the buffer_head to mark dirty
*
- * mark_buffer_dirty() will set the dirty bit against the buffer, then set its
- * backing page dirty, then tag the page as dirty in its address_space's radix
- * tree and then attach the address_space's inode to its superblock's dirty
+ * mark_buffer_dirty() will set the dirty bit against the buffer, then set
+ * its backing page dirty, then tag the page as dirty in the page cache
+ * and then attach the address_space's inode to its superblock's dirty
* inode list.
*
* mark_buffer_dirty() is atomic. It takes bh->b_page->mapping->private_lock,
--
2.15.0

2017-12-06 00:52:51

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 50/73] shmem: Comment fixups

From: Matthew Wilcox <[email protected]>

Remove the last mentions of radix tree from various comments.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/shmem.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 01102e2e0ef3..090937922c1d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -743,7 +743,7 @@ void shmem_unlock_mapping(struct address_space *mapping)
}

/*
- * Remove range of pages and swap entries from radix tree, and free them.
+ * Remove range of pages and swap entries from page cache, and free them.
* If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
*/
static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
@@ -1118,10 +1118,10 @@ static int shmem_unuse_inode(struct shmem_inode_info *info,
* We needed to drop mutex to make that restrictive page
* allocation, but the inode might have been freed while we
* dropped it: although a racing shmem_evict_inode() cannot
- * complete without emptying the radix_tree, our page lock
+ * complete without emptying the page cache, our page lock
* on this swapcache page is not enough to prevent that -
* free_swap_and_cache() of our swap entry will only
- * trylock_page(), removing swap from radix_tree whatever.
+ * trylock_page(), removing swap from page cache whatever.
*
* We must not proceed to shmem_add_to_page_cache() if the
* inode has been freed, but of course we cannot rely on
@@ -1187,7 +1187,7 @@ int shmem_unuse(swp_entry_t swap, struct page *page)
false);
if (error)
goto out;
- /* No radix_tree_preload: swap entry keeps a place for page in tree */
+ /* No memory allocation: swap entry occupies the slot for the page */
error = -EAGAIN;

mutex_lock(&shmem_swaplist_mutex);
@@ -1862,7 +1862,7 @@ alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, inode,
spin_unlock_irq(&info->lock);
goto repeat;
}
- if (error == -EEXIST) /* from above or from radix_tree_insert */
+ if (error == -EEXIST)
goto repeat;
return error;
}
@@ -2474,7 +2474,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
}

/*
- * llseek SEEK_DATA or SEEK_HOLE through the radix_tree.
+ * llseek SEEK_DATA or SEEK_HOLE through the page cache.
*/
static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
pgoff_t index, pgoff_t end, int whence)
@@ -2562,7 +2562,7 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
}

/*
- * We need a tag: a new tag would expand every radix_tree_node by 8 bytes,
+ * We need a tag: a new tag would expand every xa_node by 8 bytes,
* so reuse a tag which we firmly believe is never set or cleared on shmem.
*/
#define SHMEM_TAG_PINNED PAGECACHE_TAG_TOWRITE
--
2.15.0

2017-12-06 00:53:29

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 46/73] shmem: Convert shmem_add_to_page_cache to XArray

From: Matthew Wilcox <[email protected]>

This removes the last caller of radix_tree_maybe_preload_order().
Simpler code, unless we run out of memory for new xa_nodes partway through
inserting entries into the xarray. Hopefully we can support multi-index
entries in the page cache soon and all the awful code goes away.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/shmem.c | 87 ++++++++++++++++++++++++++++----------------------------------
1 file changed, 39 insertions(+), 48 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index e4a2eb1336be..54fbfc2c6c09 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -558,9 +558,10 @@ static unsigned long shmem_unused_huge_shrink(struct shmem_sb_info *sbinfo,
*/
static int shmem_add_to_page_cache(struct page *page,
struct address_space *mapping,
- pgoff_t index, void *expected)
+ pgoff_t index, void *expected, gfp_t gfp)
{
- int error, nr = hpage_nr_pages(page);
+ XA_STATE(xas, &mapping->pages, index);
+ unsigned int i, nr = compound_order(page);

VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(index != round_down(index, nr), page);
@@ -569,49 +570,47 @@ static int shmem_add_to_page_cache(struct page *page,
VM_BUG_ON(expected && PageTransHuge(page));

page_ref_add(page, nr);
- page->mapping = mapping;
page->index = index;
+ page->mapping = mapping;

- xa_lock_irq(&mapping->pages);
- if (PageTransHuge(page)) {
- void __rcu **results;
- pgoff_t idx;
- int i;
-
- error = 0;
- if (radix_tree_gang_lookup_slot(&mapping->pages,
- &results, &idx, index, 1) &&
- idx < index + HPAGE_PMD_NR) {
- error = -EEXIST;
+ do {
+ xas_lock_irq(&xas);
+ xas_create_range(&xas, index + nr - 1);
+ if (xas_error(&xas))
+ goto unlock;
+ for (i = 0; i < nr; i++) {
+ void *entry = xas_load(&xas);
+ if (entry != expected)
+ xas_set_err(&xas, -ENOENT);
+ if (xas_error(&xas))
+ goto undo;
+ xas_store(&xas, page + i);
+ xas_next(&xas);
}
-
- if (!error) {
- for (i = 0; i < HPAGE_PMD_NR; i++) {
- error = radix_tree_insert(&mapping->pages,
- index + i, page + i);
- VM_BUG_ON(error);
- }
+ if (PageTransHuge(page)) {
count_vm_event(THP_FILE_ALLOC);
+ __inc_node_page_state(page, NR_SHMEM_THPS);
}
- } else if (!expected) {
- error = radix_tree_insert(&mapping->pages, index, page);
- } else {
- error = shmem_xa_replace(mapping, index, expected, page);
- }
-
- if (!error) {
mapping->nrpages += nr;
- if (PageTransHuge(page))
- __inc_node_page_state(page, NR_SHMEM_THPS);
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
__mod_node_page_state(page_pgdat(page), NR_SHMEM, nr);
- xa_unlock_irq(&mapping->pages);
- } else {
+ goto unlock;
+undo:
+ while (i-- > 0) {
+ xas_store(&xas, NULL);
+ xas_prev(&xas);
+ }
+unlock:
+ xas_unlock_irq(&xas);
+ } while (xas_nomem(&xas, gfp));
+
+ if (xas_error(&xas)) {
page->mapping = NULL;
- xa_unlock_irq(&mapping->pages);
page_ref_sub(page, nr);
+ return xas_error(&xas);
}
- return error;
+
+ return 0;
}

/*
@@ -1159,7 +1158,7 @@ static int shmem_unuse_inode(struct shmem_inode_info *info,
*/
if (!error)
error = shmem_add_to_page_cache(*pagep, mapping, index,
- radswap);
+ radswap, gfp);
if (error != -ENOMEM) {
/*
* Truncation and eviction use free_swap_and_cache(), which
@@ -1677,7 +1676,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
false);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
- swp_to_radix_entry(swap));
+ swp_to_radix_entry(swap), gfp);
/*
* We already confirmed swap under page lock, and make
* no memory allocation here, so usually no possibility
@@ -1783,13 +1782,8 @@ alloc_nohuge: page = shmem_alloc_and_acct_page(gfp, inode,
PageTransHuge(page));
if (error)
goto unacct;
- error = radix_tree_maybe_preload_order(gfp & GFP_RECLAIM_MASK,
- compound_order(page));
- if (!error) {
- error = shmem_add_to_page_cache(page, mapping, hindex,
- NULL);
- radix_tree_preload_end();
- }
+ error = shmem_add_to_page_cache(page, mapping, hindex,
+ NULL, gfp & GFP_RECLAIM_MASK);
if (error) {
mem_cgroup_cancel_charge(page, memcg,
PageTransHuge(page));
@@ -2256,11 +2250,8 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
if (ret)
goto out_release;

- ret = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
- if (!ret) {
- ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL);
- radix_tree_preload_end();
- }
+ ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
+ gfp & GFP_RECLAIM_MASK);
if (ret)
goto out_release_uncharge;

--
2.15.0

2017-12-06 00:53:22

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 45/73] shmem: Convert shmem_wait_for_pins to XArray

From: Matthew Wilcox <[email protected]>

As with shmem_tag_pins(), hold the lock around the entire loop instead
of acquiring & dropping it for each entry we're going to untag.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/shmem.c | 59 ++++++++++++++++++++++++-----------------------------------
1 file changed, 24 insertions(+), 35 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 2f41c7ceea18..e4a2eb1336be 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2636,9 +2636,7 @@ static void shmem_tag_pins(struct address_space *mapping)
*/
static int shmem_wait_for_pins(struct address_space *mapping)
{
- struct radix_tree_iter iter;
- void **slot;
- pgoff_t start;
+ XA_STATE(xas, &mapping->pages, 0);
struct page *page;
int error, scan;

@@ -2646,7 +2644,9 @@ static int shmem_wait_for_pins(struct address_space *mapping)

error = 0;
for (scan = 0; scan <= LAST_SCAN; scan++) {
- if (!radix_tree_tagged(&mapping->pages, SHMEM_TAG_PINNED))
+ unsigned int tagged = 0;
+
+ if (!xas_tagged(&xas, SHMEM_TAG_PINNED))
break;

if (!scan)
@@ -2654,45 +2654,34 @@ static int shmem_wait_for_pins(struct address_space *mapping)
else if (schedule_timeout_killable((HZ << scan) / 200))
scan = LAST_SCAN;

- start = 0;
- rcu_read_lock();
- radix_tree_for_each_tagged(slot, &mapping->pages, &iter,
- start, SHMEM_TAG_PINNED) {
-
- page = radix_tree_deref_slot(slot);
- if (radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page)) {
- slot = radix_tree_iter_retry(&iter);
- continue;
- }
-
- page = NULL;
- }
-
- if (page &&
- page_count(page) - page_mapcount(page) != 1) {
- if (scan < LAST_SCAN)
- goto continue_resched;
-
+ xas_set(&xas, 0);
+ xas_lock_irq(&xas);
+ xas_for_each_tag(&xas, page, ULONG_MAX, SHMEM_TAG_PINNED) {
+ bool clear = true;
+ if (xa_is_value(page))
+ continue;
+ if (page_count(page) - page_mapcount(page) != 1) {
/*
* On the last scan, we clean up all those tags
* we inserted; but make a note that we still
* found pages pinned.
*/
- error = -EBUSY;
+ if (scan == LAST_SCAN)
+ error = -EBUSY;
+ else
+ clear = false;
}
+ if (clear)
+ xas_clear_tag(&xas, SHMEM_TAG_PINNED);
+ if (++tagged % XA_CHECK_SCHED)
+ continue;

- xa_lock_irq(&mapping->pages);
- radix_tree_tag_clear(&mapping->pages,
- iter.index, SHMEM_TAG_PINNED);
- xa_unlock_irq(&mapping->pages);
-continue_resched:
- if (need_resched()) {
- slot = radix_tree_iter_resume(slot, &iter);
- cond_resched_rcu();
- }
+ xas_pause(&xas);
+ xas_unlock_irq(&xas);
+ cond_resched();
+ xas_lock_irq(&xas);
}
- rcu_read_unlock();
+ xas_unlock_irq(&xas);
}

return error;
--
2.15.0

2017-12-06 00:53:42

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 28/73] page cache: Remove stray radix comment

From: Matthew Wilcox <[email protected]>

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/filemap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 9e6158cfbaeb..79d0731b8762 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2601,7 +2601,7 @@ static struct page *do_read_cache_page(struct address_space *mapping,
put_page(page);
if (err == -EEXIST)
goto repeat;
- /* Presumably ENOMEM for radix tree node */
+ /* Presumably ENOMEM for xarray node */
return ERR_PTR(err);
}

--
2.15.0

2017-12-06 00:53:50

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 40/73] pagevec: Use xa_tag_t

From: Matthew Wilcox <[email protected]>

Removes sparse warnings.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/btrfs/extent_io.c | 4 ++--
fs/ext4/inode.c | 2 +-
fs/f2fs/data.c | 2 +-
fs/gfs2/aops.c | 2 +-
include/linux/pagevec.h | 8 +++++---
mm/swap.c | 4 ++--
6 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 94f734e7e66f..b8b5b4562d50 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3795,7 +3795,7 @@ int btree_write_cache_pages(struct address_space *mapping,
pgoff_t index;
pgoff_t end; /* Inclusive */
int scanned = 0;
- int tag;
+ xa_tag_t tag;

pagevec_init(&pvec);
if (wbc->range_cyclic) {
@@ -3922,7 +3922,7 @@ static int extent_write_cache_pages(struct address_space *mapping,
pgoff_t done_index;
int range_whole = 0;
int scanned = 0;
- int tag;
+ xa_tag_t tag;

/*
* We have to hold onto the inode so that ordered extents can do their
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7df2c5644e59..2534304daec3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2605,7 +2605,7 @@ static int mpage_prepare_extent_to_map(struct mpage_da_data *mpd)
long left = mpd->wbc->nr_to_write;
pgoff_t index = mpd->first_page;
pgoff_t end = mpd->last_page;
- int tag;
+ xa_tag_t tag;
int i, err = 0;
int blkbits = mpd->inode->i_blkbits;
ext4_lblk_t lblk;
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 8f51ac47b77f..c8f6d9806896 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1640,7 +1640,7 @@ static int f2fs_write_cache_pages(struct address_space *mapping,
pgoff_t last_idx = ULONG_MAX;
int cycled;
int range_whole = 0;
- int tag;
+ xa_tag_t tag;

pagevec_init(&pvec);

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 1daf15a1f00c..c78ecd008191 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -369,7 +369,7 @@ static int gfs2_write_cache_jdata(struct address_space *mapping,
pgoff_t done_index;
int cycled;
int range_whole = 0;
- int tag;
+ xa_tag_t tag;

pagevec_init(&pvec);
if (wbc->range_cyclic) {
diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h
index 5fb6580f7f23..5168901bf06d 100644
--- a/include/linux/pagevec.h
+++ b/include/linux/pagevec.h
@@ -9,6 +9,8 @@
#ifndef _LINUX_PAGEVEC_H
#define _LINUX_PAGEVEC_H

+#include <linux/xarray.h>
+
/* 14 pointers + two long's align the pagevec structure to a power of two */
#define PAGEVEC_SIZE 14

@@ -40,12 +42,12 @@ static inline unsigned pagevec_lookup(struct pagevec *pvec,

unsigned pagevec_lookup_range_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, pgoff_t end,
- int tag);
+ xa_tag_t tag);
unsigned pagevec_lookup_range_nr_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, pgoff_t end,
- int tag, unsigned max_pages);
+ xa_tag_t tag, unsigned max_pages);
static inline unsigned pagevec_lookup_tag(struct pagevec *pvec,
- struct address_space *mapping, pgoff_t *index, int tag)
+ struct address_space *mapping, pgoff_t *index, xa_tag_t tag)
{
return pagevec_lookup_range_tag(pvec, mapping, index, (pgoff_t)-1, tag);
}
diff --git a/mm/swap.c b/mm/swap.c
index 8d7773cb2c3f..31d79479dacf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -991,7 +991,7 @@ EXPORT_SYMBOL(pagevec_lookup_range);

unsigned pagevec_lookup_range_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, pgoff_t end,
- int tag)
+ xa_tag_t tag)
{
pvec->nr = find_get_pages_range_tag(mapping, index, end, tag,
PAGEVEC_SIZE, pvec->pages);
@@ -1001,7 +1001,7 @@ EXPORT_SYMBOL(pagevec_lookup_range_tag);

unsigned pagevec_lookup_range_nr_tag(struct pagevec *pvec,
struct address_space *mapping, pgoff_t *index, pgoff_t end,
- int tag, unsigned max_pages)
+ xa_tag_t tag, unsigned max_pages)
{
pvec->nr = find_get_pages_range_tag(mapping, index, end, tag,
min_t(unsigned int, max_pages, PAGEVEC_SIZE), pvec->pages);
--
2.15.0

2017-12-06 00:54:01

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 41/73] shmem: Convert replace to XArray

From: Matthew Wilcox <[email protected]>

shmem_radix_tree_replace() is renamed to shmem_xa_replace() and
converted to use the XArray API.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/shmem.c | 22 ++++++++--------------
1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c5731bb954a1..fad6c9e7402e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -321,24 +321,20 @@ void shmem_uncharge(struct inode *inode, long pages)
}

/*
- * Replace item expected in radix tree by a new item, while holding tree lock.
+ * Replace item expected in xarray by a new item, while holding xa_lock.
*/
-static int shmem_radix_tree_replace(struct address_space *mapping,
+static int shmem_xa_replace(struct address_space *mapping,
pgoff_t index, void *expected, void *replacement)
{
- struct radix_tree_node *node;
- void **pslot;
+ XA_STATE(xas, &mapping->pages, index);
void *item;

VM_BUG_ON(!expected);
VM_BUG_ON(!replacement);
- item = __radix_tree_lookup(&mapping->pages, index, &node, &pslot);
- if (!item)
- return -ENOENT;
+ item = xas_load(&xas);
if (item != expected)
return -ENOENT;
- __radix_tree_replace(&mapping->pages, node, pslot,
- replacement, NULL);
+ xas_store(&xas, replacement);
return 0;
}

@@ -605,8 +601,7 @@ static int shmem_add_to_page_cache(struct page *page,
} else if (!expected) {
error = radix_tree_insert(&mapping->pages, index, page);
} else {
- error = shmem_radix_tree_replace(mapping, index, expected,
- page);
+ error = shmem_xa_replace(mapping, index, expected, page);
}

if (!error) {
@@ -635,7 +630,7 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
VM_BUG_ON_PAGE(PageCompound(page), page);

xa_lock_irq(&mapping->pages);
- error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
+ error = shmem_xa_replace(mapping, page->index, page, radswap);
page->mapping = NULL;
mapping->nrpages--;
__dec_node_page_state(page, NR_FILE_PAGES);
@@ -1550,8 +1545,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
* a nice clean interface for us to replace oldpage by newpage there.
*/
xa_lock_irq(&swap_mapping->pages);
- error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
- newpage);
+ error = shmem_xa_replace(swap_mapping, swap_index, oldpage, newpage);
if (!error) {
__inc_node_page_state(newpage, NR_FILE_PAGES);
__dec_node_page_state(oldpage, NR_FILE_PAGES);
--
2.15.0

2017-12-06 00:54:14

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 36/73] mm: Convert page migration to XArray

From: Matthew Wilcox <[email protected]>

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/migrate.c | 40 ++++++++++++++++------------------------
1 file changed, 16 insertions(+), 24 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 59f18c571120..7122fec9b075 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -322,7 +322,7 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
page = migration_entry_to_page(entry);

/*
- * Once radix-tree replacement of page migration started, page_count
+ * Once page cache replacement of page migration started, page_count
* *must* be zero. And, we don't want to call wait_on_page_locked()
* against a page without get_page().
* So, we use get_page_unless_zero(), here. Even failed, page fault
@@ -437,10 +437,10 @@ int migrate_page_move_mapping(struct address_space *mapping,
struct buffer_head *head, enum migrate_mode mode,
int extra_count)
{
+ XA_STATE(xas, &mapping->pages, page_index(page));
struct zone *oldzone, *newzone;
int dirty;
int expected_count = 1 + extra_count;
- void **pslot;

/*
* Device public or private pages have an extra refcount as they are
@@ -466,20 +466,16 @@ int migrate_page_move_mapping(struct address_space *mapping,
oldzone = page_zone(page);
newzone = page_zone(newpage);

- xa_lock_irq(&mapping->pages);
-
- pslot = radix_tree_lookup_slot(&mapping->pages,
- page_index(page));
+ xas_lock_irq(&xas);

expected_count += 1 + page_has_private(page);
- if (page_count(page) != expected_count ||
- radix_tree_deref_slot_protected(pslot, &mapping->pages.xa_lock) != page) {
- xa_unlock_irq(&mapping->pages);
+ if (page_count(page) != expected_count || xas_load(&xas) != page) {
+ xas_unlock_irq(&xas);
return -EAGAIN;
}

if (!page_ref_freeze(page, expected_count)) {
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
return -EAGAIN;
}

@@ -493,7 +489,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
if (mode == MIGRATE_ASYNC && head &&
!buffer_migrate_lock_buffers(head, mode)) {
page_ref_unfreeze(page, expected_count);
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
return -EAGAIN;
}

@@ -521,7 +517,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
SetPageDirty(newpage);
}

- radix_tree_replace_slot(&mapping->pages, pslot, newpage);
+ xas_store(&xas, newpage);

/*
* Drop cache reference from old page by unfreezing
@@ -530,7 +526,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
*/
page_ref_unfreeze(page, expected_count - 1);

- xa_unlock(&mapping->pages);
+ xas_unlock(&xas);
/* Leave irq disabled to prevent preemption while updating stats */

/*
@@ -570,22 +566,18 @@ EXPORT_SYMBOL(migrate_page_move_mapping);
int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page)
{
+ XA_STATE(xas, &mapping->pages, page_index(page));
int expected_count;
- void **pslot;
-
- xa_lock_irq(&mapping->pages);
-
- pslot = radix_tree_lookup_slot(&mapping->pages, page_index(page));

+ xas_lock_irq(&xas);
expected_count = 2 + page_has_private(page);
- if (page_count(page) != expected_count ||
- radix_tree_deref_slot_protected(pslot, &mapping->pages.xa_lock) != page) {
- xa_unlock_irq(&mapping->pages);
+ if (page_count(page) != expected_count || xas_load(&xas) != page) {
+ xas_unlock_irq(&xas);
return -EAGAIN;
}

if (!page_ref_freeze(page, expected_count)) {
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
return -EAGAIN;
}

@@ -594,11 +586,11 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,

get_page(newpage);

- radix_tree_replace_slot(&mapping->pages, pslot, newpage);
+ xas_store(&xas, newpage);

page_ref_unfreeze(page, expected_count - 1);

- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);

return MIGRATEPAGE_SUCCESS;
}
--
2.15.0

2017-12-06 00:54:12

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 39/73] mm: Convert khugepaged_scan_shmem to XArray

From: Matthew Wilcox <[email protected]>

Slightly shorter and easier to read code.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/khugepaged.c | 17 +++++------------
1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9f49d0cd61c2..15f1b2d81a69 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1534,8 +1534,7 @@ static void khugepaged_scan_shmem(struct mm_struct *mm,
pgoff_t start, struct page **hpage)
{
struct page *page = NULL;
- struct radix_tree_iter iter;
- void **slot;
+ XA_STATE(xas, &mapping->pages, start);
int present, swap;
int node = NUMA_NO_NODE;
int result = SCAN_SUCCEED;
@@ -1544,17 +1543,11 @@ static void khugepaged_scan_shmem(struct mm_struct *mm,
swap = 0;
memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
rcu_read_lock();
- radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
- if (iter.index >= start + HPAGE_PMD_NR)
- break;
-
- page = radix_tree_deref_slot(slot);
- if (radix_tree_deref_retry(page)) {
- slot = radix_tree_iter_retry(&iter);
+ xas_for_each(&xas, page, start + HPAGE_PMD_NR - 1) {
+ if (xas_retry(&xas, page))
continue;
- }

- if (radix_tree_exception(page)) {
+ if (xa_is_value(page)) {
if (++swap > khugepaged_max_ptes_swap) {
result = SCAN_EXCEED_SWAP_PTE;
break;
@@ -1593,7 +1586,7 @@ static void khugepaged_scan_shmem(struct mm_struct *mm,
present++;

if (need_resched()) {
- slot = radix_tree_iter_resume(slot, &iter);
+ xas_pause(&xas);
cond_resched_rcu();
}
}
--
2.15.0

2017-12-06 00:54:08

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 37/73] mm: Convert huge_memory to XArray

From: Matthew Wilcox <[email protected]>

Quite a straightforward conversion.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/huge_memory.c | 19 ++++++++-----------
1 file changed, 8 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 28909c475ee5..5a41b00d86bd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2379,7 +2379,7 @@ static void __split_huge_page_tail(struct page *head, int tail,
if (PageAnon(head) && !PageSwapCache(head)) {
page_ref_inc(page_tail);
} else {
- /* Additional pin to radix tree */
+ /* Additional pin to page cache */
page_ref_add(page_tail, 2);
}

@@ -2450,13 +2450,13 @@ static void __split_huge_page(struct page *page, struct list_head *list,
ClearPageCompound(head);
/* See comment in __split_huge_page_tail() */
if (PageAnon(head)) {
- /* Additional pin to radix tree of swap cache */
+ /* Additional pin to swap cache */
if (PageSwapCache(head))
page_ref_add(head, 2);
else
page_ref_inc(head);
} else {
- /* Additional pin to radix tree */
+ /* Additional pin to page cache */
page_ref_add(head, 2);
xa_unlock(&head->mapping->pages);
}
@@ -2568,7 +2568,7 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
{
int extra_pins;

- /* Additional pins from radix tree */
+ /* Additional pins from page cache */
if (PageAnon(page))
extra_pins = PageSwapCache(page) ? HPAGE_PMD_NR : 0;
else
@@ -2664,17 +2664,14 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
spin_lock_irqsave(zone_lru_lock(page_zone(head)), flags);

if (mapping) {
- void **pslot;
+ XA_STATE(xas, &mapping->pages, page_index(head));

- xa_lock(&mapping->pages);
- pslot = radix_tree_lookup_slot(&mapping->pages,
- page_index(head));
/*
- * Check if the head page is present in radix tree.
+ * Check if the head page is present in page cache.
* We assume all tail are present too, if head is there.
*/
- if (radix_tree_deref_slot_protected(pslot,
- &mapping->pages.xa_lock) != head)
+ xa_lock(&mapping->pages);
+ if (xas_load(&xas) != head)
goto fail;
}

--
2.15.0

2017-12-06 00:53:57

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 43/73] shmem: Convert find_swap_entry to XArray

From: Matthew Wilcox <[email protected]>

This is a 1:1 conversion.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/shmem.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 654f367aca90..ce285ae635ea 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1076,28 +1076,27 @@ static void shmem_evict_inode(struct inode *inode)
clear_inode(inode);
}

-static unsigned long find_swap_entry(struct radix_tree_root *root, void *item)
+static unsigned long find_swap_entry(struct xarray *xa, void *item)
{
- struct radix_tree_iter iter;
- void **slot;
- unsigned long found = -1;
+ XA_STATE(xas, xa, 0);
unsigned int checked = 0;
+ void *entry;

rcu_read_lock();
- radix_tree_for_each_slot(slot, root, &iter, 0) {
- if (*slot == item) {
- found = iter.index;
+ xas_for_each(&xas, entry, ULONG_MAX) {
+ if (xas_retry(&xas, entry))
+ continue;
+ if (entry == item)
break;
- }
checked++;
- if ((checked % 4096) != 0)
+ if ((checked % XA_CHECK_SCHED) != 0)
continue;
- slot = radix_tree_iter_resume(slot, &iter);
+ xas_pause(&xas);
cond_resched_rcu();
}
-
rcu_read_unlock();
- return found;
+
+ return xas_invalid(&xas) ? -1 : xas.xa_index;
}

/*
--
2.15.0

2017-12-06 00:53:46

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 44/73] shmem: Convert shmem_tag_pins to XArray

From: Matthew Wilcox <[email protected]>

Simplify the locking by taking the spinlock while we walk the tree on
the assumption that many acquires and releases of the lock will be
worse than holding the lock for a (potentially) long time.

We could replicate the same locking behaviour with the xarray, but would
have to be careful that the xa_node wasn't RCU-freed under us before we
took the lock.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/shmem.c | 39 ++++++++++++++++-----------------------
1 file changed, 16 insertions(+), 23 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index ce285ae635ea..2f41c7ceea18 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2601,35 +2601,28 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)

static void shmem_tag_pins(struct address_space *mapping)
{
- struct radix_tree_iter iter;
- void **slot;
- pgoff_t start;
+ XA_STATE(xas, &mapping->pages, 0);
struct page *page;
+ unsigned int tagged = 0;

lru_add_drain();
- start = 0;
- rcu_read_lock();

- radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
- page = radix_tree_deref_slot(slot);
- if (!page || radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page)) {
- slot = radix_tree_iter_retry(&iter);
- continue;
- }
- } else if (page_count(page) - page_mapcount(page) > 1) {
- xa_lock_irq(&mapping->pages);
- radix_tree_tag_set(&mapping->pages, iter.index,
- SHMEM_TAG_PINNED);
- xa_unlock_irq(&mapping->pages);
- }
+ xas_lock_irq(&xas);
+ xas_for_each(&xas, page, ULONG_MAX) {
+ if (xa_is_value(page))
+ continue;
+ if (page_count(page) - page_mapcount(page) > 1)
+ xas_set_tag(&xas, SHMEM_TAG_PINNED);

- if (need_resched()) {
- slot = radix_tree_iter_resume(slot, &iter);
- cond_resched_rcu();
- }
+ if (++tagged % XA_CHECK_SCHED)
+ continue;
+
+ xas_pause(&xas);
+ xas_unlock_irq(&xas);
+ cond_resched();
+ xas_lock_irq(&xas);
}
- rcu_read_unlock();
+ xas_unlock_irq(&xas);
}

/*
--
2.15.0

2017-12-06 00:53:36

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 42/73] shmem: Convert shmem_confirm_swap to XArray

From: Matthew Wilcox <[email protected]>

xa_load has its own RCU locking, so we can eliminate it here.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/shmem.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index fad6c9e7402e..654f367aca90 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -348,12 +348,7 @@ static int shmem_xa_replace(struct address_space *mapping,
static bool shmem_confirm_swap(struct address_space *mapping,
pgoff_t index, swp_entry_t swap)
{
- void *item;
-
- rcu_read_lock();
- item = radix_tree_lookup(&mapping->pages, index);
- rcu_read_unlock();
- return item == swp_to_radix_entry(swap);
+ return xa_load(&mapping->pages, index) == swp_to_radix_entry(swap);
}

/*
--
2.15.0

2017-12-06 00:53:25

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 47/73] shmem: Convert shmem_alloc_hugepage to XArray

From: Matthew Wilcox <[email protected]>

xa_find() is a slightly easier API to use than
radix_tree_gang_lookup_slot() because it contains its own RCU locking.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/shmem.c | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 54fbfc2c6c09..768d470a03da 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1413,23 +1413,16 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
{
struct vm_area_struct pvma;
- struct inode *inode = &info->vfs_inode;
- struct address_space *mapping = inode->i_mapping;
- pgoff_t idx, hindex;
- void __rcu **results;
+ struct address_space *mapping = info->vfs_inode.i_mapping;
+ pgoff_t hindex;
struct page *page;

if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE))
return NULL;

hindex = round_down(index, HPAGE_PMD_NR);
- rcu_read_lock();
- if (radix_tree_gang_lookup_slot(&mapping->pages, &results, &idx,
- hindex, 1) && idx < hindex + HPAGE_PMD_NR) {
- rcu_read_unlock();
+ if (xa_find(&mapping->pages, &hindex, hindex + HPAGE_PMD_NR - 1))
return NULL;
- }
- rcu_read_unlock();

shmem_pseudo_vma_init(&pvma, info, hindex);
page = alloc_pages_vma(gfp | __GFP_COMP | __GFP_NORETRY | __GFP_NOWARN,
--
2.15.0

2017-12-06 00:53:17

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 51/73] btrfs: Convert page cache to XArray

From: Matthew Wilcox <[email protected]>

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/btrfs/compression.c | 4 +---
fs/btrfs/extent_io.c | 6 ++----
2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index e687d06cd97c..4174b166e235 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -449,9 +449,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
if (pg_index > end_index)
break;

- rcu_read_lock();
- page = radix_tree_lookup(&mapping->pages, pg_index);
- rcu_read_unlock();
+ page = xa_load(&mapping->pages, pg_index);
if (page && !xa_is_value(page)) {
misses++;
if (misses > 4)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b8b5b4562d50..96328c3a548e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5197,11 +5197,9 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)

clear_page_dirty_for_io(page);
xa_lock_irq(&page->mapping->pages);
- if (!PageDirty(page)) {
- radix_tree_tag_clear(&page->mapping->pages,
- page_index(page),
+ if (!PageDirty(page))
+ __xa_clear_tag(&page->mapping->pages, page_index(page),
PAGECACHE_TAG_DIRTY);
- }
xa_unlock_irq(&page->mapping->pages);
ClearPageError(page);
unlock_page(page);
--
2.15.0

2017-12-06 00:59:04

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 34/73] mm: Convert cgroup writeback to XArray

From: Matthew Wilcox <[email protected]>

This is a fairly naive conversion, leaving in place the GFP_ATOMIC
allocation. By switching the locking around, we could use GFP_KERNEL
and probably simplify the error handling.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/backing-dev-defs.h | 2 +-
include/linux/backing-dev.h | 2 +-
mm/backing-dev.c | 28 ++++++++++++++++------------
3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index bfe86b54f6c1..074a54aad33c 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -187,7 +187,7 @@ struct backing_dev_info {
struct bdi_writeback wb; /* the root writeback info for this bdi */
struct list_head wb_list; /* list of all wbs */
#ifdef CONFIG_CGROUP_WRITEBACK
- struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
+ struct xarray cgwb_xa; /* radix tree of active cgroup wbs */
struct rb_root cgwb_congested_tree; /* their congested states */
#else
struct bdi_writeback_congested *wb_congested;
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 9038f6c1eeda..50f666d23527 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -271,7 +271,7 @@ static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi
if (!memcg_css->parent)
return &bdi->wb;

- wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
+ wb = xa_load(&bdi->cgwb_xa, memcg_css->id);

/*
* %current's blkcg equals the effective blkcg of its memcg. No
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 84b2dc76f140..7aa2d893f929 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -417,8 +417,8 @@ static void wb_exit(struct bdi_writeback *wb)
#include <linux/memcontrol.h>

/*
- * cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
- * blkcg->cgwb_list, and memcg->cgwb_list. bdi->cgwb_tree is also RCU
+ * cgwb_lock protects bdi->cgwb_xa, bdi->cgwb_congested_tree,
+ * blkcg->cgwb_list, and memcg->cgwb_list. bdi->cgwb_xa is also RCU
* protected.
*/
static DEFINE_SPINLOCK(cgwb_lock);
@@ -539,7 +539,7 @@ static void cgwb_kill(struct bdi_writeback *wb)
{
lockdep_assert_held(&cgwb_lock);

- WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
+ WARN_ON(xa_erase(&wb->bdi->cgwb_xa, wb->memcg_css->id) != wb);
list_del(&wb->memcg_node);
list_del(&wb->blkcg_node);
percpu_ref_kill(&wb->refcnt);
@@ -571,7 +571,7 @@ static int cgwb_create(struct backing_dev_info *bdi,

/* look up again under lock and discard on blkcg mismatch */
spin_lock_irqsave(&cgwb_lock, flags);
- wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
+ wb = xa_load(&bdi->cgwb_xa, memcg_css->id);
if (wb && wb->blkcg_css != blkcg_css) {
cgwb_kill(wb);
wb = NULL;
@@ -615,13 +615,18 @@ static int cgwb_create(struct backing_dev_info *bdi,
if (test_bit(WB_registered, &bdi->wb.state) &&
blkcg_cgwb_list->next && memcg_cgwb_list->next) {
/* we might have raced another instance of this function */
- ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
- if (!ret) {
+ void *curr = xa_cmpxchg(&bdi->cgwb_xa, memcg_css->id, NULL,
+ wb, GFP_ATOMIC);
+ if (!curr) {
list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
list_add(&wb->memcg_node, memcg_cgwb_list);
list_add(&wb->blkcg_node, blkcg_cgwb_list);
css_get(memcg_css);
css_get(blkcg_css);
+ } else if (IS_ERR(curr)) {
+ ret = PTR_ERR(curr);
+ } else {
+ ret = -EEXIST;
}
}
spin_unlock_irqrestore(&cgwb_lock, flags);
@@ -682,7 +687,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,

do {
rcu_read_lock();
- wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
+ wb = xa_load(&bdi->cgwb_xa, memcg_css->id);
if (wb) {
struct cgroup_subsys_state *blkcg_css;

@@ -704,7 +709,7 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
{
int ret;

- INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
+ xa_init(&bdi->cgwb_xa);
bdi->cgwb_congested_tree = RB_ROOT;

ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
@@ -717,15 +722,14 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)

static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
{
- struct radix_tree_iter iter;
- void **slot;
+ XA_STATE(xas, &bdi->cgwb_xa, 0);
struct bdi_writeback *wb;

WARN_ON(test_bit(WB_registered, &bdi->wb.state));

spin_lock_irq(&cgwb_lock);
- radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
- cgwb_kill(*slot);
+ xas_for_each(&xas, wb, ULONG_MAX)
+ cgwb_kill(wb);

while (!list_empty(&bdi->wb_list)) {
wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
--
2.15.0

2017-12-06 00:59:01

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 33/73] mm: Convert delete_from_swap_cache to XArray

From: Matthew Wilcox <[email protected]>

Both callers of __delete_from_swap_cache have the swp_entry_t already,
so pass that in to make constructing the XA_STATE easier.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/swap.h | 5 +++--
mm/swap_state.c | 24 ++++++++++--------------
mm/vmscan.c | 2 +-
3 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index e4a8afcb214c..569a8ac4fe3f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -413,7 +413,7 @@ extern void show_swap_cache_info(void);
extern int add_to_swap(struct page *page);
extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t);
extern int __add_to_swap_cache(struct page *page, swp_entry_t entry);
-extern void __delete_from_swap_cache(struct page *);
+extern void __delete_from_swap_cache(struct page *, swp_entry_t entry);
extern void delete_from_swap_cache(struct page *);
extern void free_page_and_swap_cache(struct page *);
extern void free_pages_and_swap_cache(struct page **, int);
@@ -588,7 +588,8 @@ static inline int add_to_swap_cache(struct page *page, swp_entry_t entry,
return -1;
}

-static inline void __delete_from_swap_cache(struct page *page)
+static inline void __delete_from_swap_cache(struct page *page,
+ swp_entry_t entry)
{
}

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 117b5da9dc01..7c862258af66 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -154,23 +154,22 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
* This must be called only on pages that have
* been verified to be in the swap cache.
*/
-void __delete_from_swap_cache(struct page *page)
+void __delete_from_swap_cache(struct page *page, swp_entry_t entry)
{
- struct address_space *address_space;
+ struct address_space *address_space = swap_address_space(entry);
int i, nr = hpage_nr_pages(page);
- swp_entry_t entry;
- pgoff_t idx;
+ pgoff_t idx = swp_offset(entry);
+ XA_STATE(xas, &address_space->pages, idx);

VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapCache(page), page);
VM_BUG_ON_PAGE(PageWriteback(page), page);

- entry.val = page_private(page);
- address_space = swap_address_space(entry);
- idx = swp_offset(entry);
for (i = 0; i < nr; i++) {
- radix_tree_delete(&address_space->pages, idx + i);
+ void *entry = xas_store(&xas, NULL);
+ VM_BUG_ON_PAGE(entry != page + i, entry);
set_page_private(page + i, 0);
+ xas_next(&xas);
}
ClearPageSwapCache(page);
address_space->nrpages -= nr;
@@ -246,14 +245,11 @@ int add_to_swap(struct page *page)
*/
void delete_from_swap_cache(struct page *page)
{
- swp_entry_t entry;
- struct address_space *address_space;
-
- entry.val = page_private(page);
+ swp_entry_t entry = { .val = page_private(page) };
+ struct address_space *address_space = swap_address_space(entry);

- address_space = swap_address_space(entry);
xa_lock_irq(&address_space->pages);
- __delete_from_swap_cache(page);
+ __delete_from_swap_cache(page, entry);
xa_unlock_irq(&address_space->pages);

put_swap_page(page, entry);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 96316bd91f91..51df3f9ba0bc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -715,7 +715,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
- __delete_from_swap_cache(page);
+ __delete_from_swap_cache(page, swap);
xa_unlock_irqrestore(&mapping->pages, flags);
put_swap_page(page, swap);
} else {
--
2.15.0

2017-12-06 00:58:55

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 35/73] mm: Convert __do_page_cache_readahead to XArray

From: Matthew Wilcox <[email protected]>

This one is trivial.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/readahead.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index f64b31b3a84a..66bcaffd47f0 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -174,9 +174,7 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
if (page_offset > end_index)
break;

- rcu_read_lock();
- page = radix_tree_lookup(&mapping->pages, page_offset);
- rcu_read_unlock();
+ page = xa_load(&mapping->pages, page_offset);
if (page && !xa_is_value(page))
continue;

--
2.15.0

2017-12-06 00:58:51

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 38/73] mm: Convert collapse_shmem to XArray

From: Matthew Wilcox <[email protected]>

I found another victim of the radix tree being hard to use. Because
there was no call to radix_tree_preload(), khugepaged was allocating
radix_tree_nodes using GFP_ATOMIC.

I also converted a local_irq_save()/restore() pair to
disable()/enable().

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/swap.h | 4 +-
mm/khugepaged.c | 158 +++++++++++++++++++++------------------------------
2 files changed, 67 insertions(+), 95 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 569a8ac4fe3f..9774f43d3e4f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -300,12 +300,12 @@ bool workingset_refault(void *shadow);
void workingset_activation(struct page *page);

/* Do not use directly, use workingset_lookup_update */
-void workingset_update_node(struct xa_node *node);
+void workingset_update_node(struct radix_tree_node *node);

/* Returns workingset_update_node() if the mapping has shadow entries. */
#define workingset_lookup_update(mapping) \
({ \
- xa_update_node_t __helper = workingset_update_node; \
+ radix_tree_update_node_t __helper = workingset_update_node; \
if (dax_mapping(mapping) || shmem_mapping(mapping)) \
__helper = NULL; \
__helper; \
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 55ade70c33bb..9f49d0cd61c2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1282,17 +1282,17 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
*
* Basic scheme is simple, details are more complex:
* - allocate and freeze a new huge page;
- * - scan over radix tree replacing old pages the new one
+ * - scan page cache replacing old pages with the new one
* + swap in pages if necessary;
* + fill in gaps;
- * + keep old pages around in case if rollback is required;
- * - if replacing succeed:
+ * + keep old pages around in case rollback is required;
+ * - if replacing succeeds:
* + copy data over;
* + free old pages;
* + unfreeze huge page;
* - if replacing failed;
* + put all pages back and unfreeze them;
- * + restore gaps in the radix-tree;
+ * + restore gaps in the page cache;
* + free huge page;
*/
static void collapse_shmem(struct mm_struct *mm,
@@ -1300,12 +1300,11 @@ static void collapse_shmem(struct mm_struct *mm,
struct page **hpage, int node)
{
gfp_t gfp;
- struct page *page, *new_page, *tmp;
+ struct page *new_page;
struct mem_cgroup *memcg;
pgoff_t index, end = start + HPAGE_PMD_NR;
LIST_HEAD(pagelist);
- struct radix_tree_iter iter;
- void **slot;
+ XA_STATE(xas, &mapping->pages, start);
int nr_none = 0, result = SCAN_SUCCEED;

VM_BUG_ON(start & (HPAGE_PMD_NR - 1));
@@ -1330,48 +1329,48 @@ static void collapse_shmem(struct mm_struct *mm,
__SetPageLocked(new_page);
BUG_ON(!page_ref_freeze(new_page, 1));

-
/*
- * At this point the new_page is 'frozen' (page_count() is zero), locked
- * and not up-to-date. It's safe to insert it into radix tree, because
- * nobody would be able to map it or use it in other way until we
- * unfreeze it.
+ * At this point the new_page is 'frozen' (page_count() is zero),
+ * locked and not up-to-date. It's safe to insert it into the page
+ * cache, because nobody would be able to map it or use it in other
+ * way until we unfreeze it.
*/

- index = start;
- xa_lock_irq(&mapping->pages);
- radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
- int n = min(iter.index, end) - index;
-
- /*
- * Handle holes in the radix tree: charge it from shmem and
- * insert relevant subpage of new_page into the radix-tree.
- */
- if (n && !shmem_charge(mapping->host, n)) {
- result = SCAN_FAIL;
+ /* This will be less messy when we use multi-index entries */
+ do {
+ xas_lock_irq(&xas);
+ xas_create_range(&xas, end - 1);
+ if (!xas_error(&xas))
break;
- }
- nr_none += n;
- for (; index < min(iter.index, end); index++) {
- radix_tree_insert(&mapping->pages, index,
- new_page + (index % HPAGE_PMD_NR));
- }
+ xas_unlock_irq(&xas);
+ if (!xas_nomem(&xas, GFP_KERNEL))
+ goto out;
+ } while (1);

- /* We are done. */
- if (index >= end)
- break;
+ for (index = start; index < end; index++) {
+ struct page *page = xas_next(&xas);
+
+ VM_BUG_ON(index != xas.xa_index);
+ if (!page) {
+ if (!shmem_charge(mapping->host, 1)) {
+ result = SCAN_FAIL;
+ break;
+ }
+ xas_store(&xas, new_page + (index % HPAGE_PMD_NR));
+ nr_none++;
+ continue;
+ }

- page = radix_tree_deref_slot_protected(slot,
- &mapping->pages.xa_lock);
if (xa_is_value(page) || !PageUptodate(page)) {
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
/* swap in or instantiate fallocated page */
if (shmem_getpage(mapping->host, index, &page,
SGP_NOHUGE)) {
result = SCAN_FAIL;
- goto tree_unlocked;
+ goto xa_unlocked;
}
- xa_lock_irq(&mapping->pages);
+ xas_lock_irq(&xas);
+ xas_set(&xas, index);
} else if (trylock_page(page)) {
get_page(page);
} else {
@@ -1391,7 +1390,7 @@ static void collapse_shmem(struct mm_struct *mm,
result = SCAN_TRUNCATED;
goto out_unlock;
}
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);

if (isolate_lru_page(page)) {
result = SCAN_DEL_PAGE_LRU;
@@ -1402,17 +1401,16 @@ static void collapse_shmem(struct mm_struct *mm,
unmap_mapping_range(mapping, index << PAGE_SHIFT,
PAGE_SIZE, 0);

- xa_lock_irq(&mapping->pages);
+ xas_lock(&xas);
+ xas_set(&xas, index);

- slot = radix_tree_lookup_slot(&mapping->pages, index);
- VM_BUG_ON_PAGE(page != radix_tree_deref_slot_protected(slot,
- &mapping->pages.xa_lock), page);
+ VM_BUG_ON_PAGE(page != xas_load(&xas), page);
VM_BUG_ON_PAGE(page_mapped(page), page);

/*
* The page is expected to have page_count() == 3:
* - we hold a pin on it;
- * - one reference from radix tree;
+ * - one reference from page cache;
* - one from isolate_lru_page;
*/
if (!page_ref_freeze(page, 3)) {
@@ -1427,56 +1425,30 @@ static void collapse_shmem(struct mm_struct *mm,
list_add_tail(&page->lru, &pagelist);

/* Finally, replace with the new page. */
- radix_tree_replace_slot(&mapping->pages, slot,
- new_page + (index % HPAGE_PMD_NR));
-
- slot = radix_tree_iter_resume(slot, &iter);
- index++;
+ xas_store(&xas, new_page + (index % HPAGE_PMD_NR));
continue;
out_lru:
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
putback_lru_page(page);
out_isolate_failed:
unlock_page(page);
put_page(page);
- goto tree_unlocked;
+ goto xa_unlocked;
out_unlock:
unlock_page(page);
put_page(page);
break;
}
+ xas_unlock_irq(&xas);

- /*
- * Handle hole in radix tree at the end of the range.
- * This code only triggers if there's nothing in radix tree
- * beyond 'end'.
- */
- if (result == SCAN_SUCCEED && index < end) {
- int n = end - index;
-
- if (!shmem_charge(mapping->host, n)) {
- result = SCAN_FAIL;
- goto tree_locked;
- }
-
- for (; index < end; index++) {
- radix_tree_insert(&mapping->pages, index,
- new_page + (index % HPAGE_PMD_NR));
- }
- nr_none += n;
- }
-
-tree_locked:
- xa_unlock_irq(&mapping->pages);
-tree_unlocked:
-
+xa_unlocked:
if (result == SCAN_SUCCEED) {
- unsigned long flags;
+ struct page *page, *tmp;
struct zone *zone = page_zone(new_page);

/*
- * Replacing old pages with new one has succeed, now we need to
- * copy the content and free old pages.
+ * Replacing old pages with new one has succeeded, now we
+ * need to copy the content and free the old pages.
*/
list_for_each_entry_safe(page, tmp, &pagelist, lru) {
copy_highpage(new_page + (page->index % HPAGE_PMD_NR),
@@ -1490,16 +1462,16 @@ static void collapse_shmem(struct mm_struct *mm,
put_page(page);
}

- local_irq_save(flags);
+ local_irq_disable();
__inc_node_page_state(new_page, NR_SHMEM_THPS);
if (nr_none) {
__mod_node_page_state(zone->zone_pgdat, NR_FILE_PAGES, nr_none);
__mod_node_page_state(zone->zone_pgdat, NR_SHMEM, nr_none);
}
- local_irq_restore(flags);
+ local_irq_enable();

/*
- * Remove pte page tables, so we can re-faulti
+ * Remove pte page tables, so we can re-fault
* the page as huge.
*/
retract_page_tables(mapping, start);
@@ -1514,37 +1486,37 @@ static void collapse_shmem(struct mm_struct *mm,

*hpage = NULL;
} else {
- /* Something went wrong: rollback changes to the radix-tree */
+ struct page *page;
+ /* Something went wrong: roll back page cache changes */
shmem_uncharge(mapping->host, nr_none);
- xa_lock_irq(&mapping->pages);
- radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
- if (iter.index >= end)
- break;
+ xas_lock_irq(&xas);
+ xas_set(&xas, start);
+ xas_for_each(&xas, page, end - 1) {
page = list_first_entry_or_null(&pagelist,
struct page, lru);
- if (!page || iter.index < page->index) {
+ if (!page || xas.xa_index < page->index) {
if (!nr_none)
break;
nr_none--;
/* Put holes back where they were */
- radix_tree_delete(&mapping->pages, iter.index);
+ xas_store(&xas, NULL);
continue;
}

- VM_BUG_ON_PAGE(page->index != iter.index, page);
+ VM_BUG_ON_PAGE(page->index != xas.xa_index, page);

/* Unfreeze the page. */
list_del(&page->lru);
page_ref_unfreeze(page, 2);
- radix_tree_replace_slot(&mapping->pages, slot, page);
- slot = radix_tree_iter_resume(slot, &iter);
- xa_unlock_irq(&mapping->pages);
+ xas_store(&xas, page);
+ xas_pause(&xas);
+ xas_unlock_irq(&xas);
putback_lru_page(page);
unlock_page(page);
- xa_lock_irq(&mapping->pages);
+ xas_lock_irq(&xas);
}
VM_BUG_ON(nr_none);
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);

/* Unfreeze new_page, caller would take care about freeing it */
page_ref_unfreeze(new_page, 1);
--
2.15.0

2017-12-06 01:00:56

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 23/73] page cache: Add page_cache_range_empty function

From: Matthew Wilcox <[email protected]>

btrfs has its own custom function for determining whether the page cache
has any pages in a particular range. Move this functionality to the
page cache, and call it from btrfs.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/btrfs/btrfs_inode.h | 7 ++++-
fs/btrfs/inode.c | 70 -------------------------------------------------
include/linux/pagemap.h | 2 ++
mm/filemap.c | 26 ++++++++++++++++++
4 files changed, 34 insertions(+), 71 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 63f0ccc92a71..a48bd6e0a0bb 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -365,6 +365,11 @@ static inline void btrfs_print_data_csum_error(struct btrfs_inode *inode,
logical_start, csum, csum_expected, mirror_num);
}

-bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end);
+static inline bool btrfs_page_exists_in_range(struct inode *inode,
+ loff_t start, loff_t end)
+{
+ return page_cache_range_empty(inode->i_mapping, start >> PAGE_SHIFT,
+ end >> PAGE_SHIFT);
+}

#endif
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 72f763c56127..a2692bceaa98 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7539,76 +7539,6 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,
return ret;
}

-bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end)
-{
- struct radix_tree_root *root = &inode->i_mapping->pages;
- bool found = false;
- void **pagep = NULL;
- struct page *page = NULL;
- unsigned long start_idx;
- unsigned long end_idx;
-
- start_idx = start >> PAGE_SHIFT;
-
- /*
- * end is the last byte in the last page. end == start is legal
- */
- end_idx = end >> PAGE_SHIFT;
-
- rcu_read_lock();
-
- /* Most of the code in this while loop is lifted from
- * find_get_page. It's been modified to begin searching from a
- * page and return just the first page found in that range. If the
- * found idx is less than or equal to the end idx then we know that
- * a page exists. If no pages are found or if those pages are
- * outside of the range then we're fine (yay!) */
- while (page == NULL &&
- radix_tree_gang_lookup_slot(root, &pagep, NULL, start_idx, 1)) {
- page = radix_tree_deref_slot(pagep);
- if (unlikely(!page))
- break;
-
- if (radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page)) {
- page = NULL;
- continue;
- }
- /*
- * Otherwise, shmem/tmpfs must be storing a swap entry
- * here so return it without attempting to raise page
- * count.
- */
- page = NULL;
- break; /* TODO: Is this relevant for this use case? */
- }
-
- if (!page_cache_get_speculative(page)) {
- page = NULL;
- continue;
- }
-
- /*
- * Has the page moved?
- * This is part of the lockless pagecache protocol. See
- * include/linux/pagemap.h for details.
- */
- if (unlikely(page != *pagep)) {
- put_page(page);
- page = NULL;
- }
- }
-
- if (page) {
- if (page->index <= end_idx)
- found = true;
- put_page(page);
- }
-
- rcu_read_unlock();
- return found;
-}
-
static int lock_extent_direct(struct inode *inode, u64 lockstart, u64 lockend,
struct extent_state **cached_state, int writing)
{
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 0db127c3ccac..34d4fa3ad1c5 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -245,6 +245,8 @@ pgoff_t page_cache_next_gap(struct address_space *mapping,
pgoff_t index, unsigned long max_scan);
pgoff_t page_cache_prev_gap(struct address_space *mapping,
pgoff_t index, unsigned long max_scan);
+bool page_cache_range_empty(struct address_space *mapping,
+ pgoff_t index, pgoff_t max);

#define FGP_ACCESSED 0x00000001
#define FGP_LOCK 0x00000002
diff --git a/mm/filemap.c b/mm/filemap.c
index 650624f7b79d..51f88ffc5319 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1397,6 +1397,32 @@ pgoff_t page_cache_prev_gap(struct address_space *mapping,
}
EXPORT_SYMBOL(page_cache_prev_gap);

+bool page_cache_range_empty(struct address_space *mapping, pgoff_t index,
+ pgoff_t max)
+{
+ struct page *page;
+ XA_STATE(xas, &mapping->pages, index);
+
+ rcu_read_lock();
+ do {
+ page = xas_find(&xas, max);
+ if (xas_retry(&xas, page))
+ continue;
+ /* Shadow entries don't count */
+ if (xa_is_value(page))
+ continue;
+ /*
+ * We don't need to try to pin this page; we're about to
+ * release the RCU lock anyway. It is enough to know that
+ * there was a page here recently.
+ */
+ } while (0);
+ rcu_read_unlock();
+
+ return page != NULL;
+}
+EXPORT_SYMBOL_GPL(page_cache_range_empty);
+
/**
* find_get_entry - find and get a page cache entry
* @mapping: the address_space to search
--
2.15.0

2017-12-06 01:01:03

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 30/73] mm: Convert workingset to XArray

From: Matthew Wilcox <[email protected]>

We construct a fake XA_STATE and use it to delete the node with xa_store()
rather than adding a special function for this unique use case.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/swap.h | 4 ++--
mm/workingset.c | 48 ++++++++++++++++++++----------------------------
2 files changed, 22 insertions(+), 30 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c2b8128799c1..e4a8afcb214c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -300,12 +300,12 @@ bool workingset_refault(void *shadow);
void workingset_activation(struct page *page);

/* Do not use directly, use workingset_lookup_update */
-void workingset_update_node(struct radix_tree_node *node);
+void workingset_update_node(struct xa_node *node);

/* Returns workingset_update_node() if the mapping has shadow entries. */
#define workingset_lookup_update(mapping) \
({ \
- radix_tree_update_node_t __helper = workingset_update_node; \
+ xa_update_node_t __helper = workingset_update_node; \
if (dax_mapping(mapping) || shmem_mapping(mapping)) \
__helper = NULL; \
__helper; \
diff --git a/mm/workingset.c b/mm/workingset.c
index 0a3465700d5f..e51deb274d2f 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -148,7 +148,7 @@
* and activations is maintained (node->inactive_age).
*
* On eviction, a snapshot of this counter (along with some bits to
- * identify the node) is stored in the now empty page cache radix tree
+ * identify the node) is stored in the now empty page cache
* slot of the evicted page. This is called a shadow entry.
*
* On cache misses for which there are shadow entries, an eligible
@@ -162,7 +162,7 @@

/*
* Eviction timestamps need to be able to cover the full range of
- * actionable refaults. However, bits are tight in the radix tree
+ * actionable refaults. However, bits are tight in the xarray
* entry, and after storing the identifier for the lruvec there might
* not be enough left to represent every single actionable refault. In
* that case, we have to sacrifice granularity for distance, and group
@@ -338,7 +338,7 @@ void workingset_activation(struct page *page)

static struct list_lru shadow_nodes;

-void workingset_update_node(struct radix_tree_node *node)
+void workingset_update_node(struct xa_node *node)
{
/*
* Track non-empty nodes that contain only shadow entries;
@@ -370,7 +370,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
local_irq_enable();

/*
- * Approximate a reasonable limit for the radix tree nodes
+ * Approximate a reasonable limit for the nodes
* containing shadow entries. We don't need to keep more
* shadow entries than possible pages on the active list,
* since refault distances bigger than that are dismissed.
@@ -385,11 +385,11 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
* worst-case density of 1/8th. Below that, not all eligible
* refaults can be detected anymore.
*
- * On 64-bit with 7 radix_tree_nodes per page and 64 slots
+ * On 64-bit with 7 xa_nodes per page and 64 slots
* each, this will reclaim shadow entries when they consume
* ~1.8% of available memory:
*
- * PAGE_SIZE / radix_tree_nodes / node_entries * 8 / PAGE_SIZE
+ * PAGE_SIZE / xa_nodes / node_entries * 8 / PAGE_SIZE
*/
if (sc->memcg) {
cache = mem_cgroup_node_nr_lru_pages(sc->memcg, sc->nid,
@@ -410,9 +410,9 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
spinlock_t *lru_lock,
void *arg)
{
+ XA_STATE(xas, NULL, 0);
struct address_space *mapping;
- struct radix_tree_node *node;
- unsigned int i;
+ struct xa_node *node;
int ret;

/*
@@ -420,14 +420,14 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
* the shadow node LRU under the mapping->pages.xa_lock and the
* lru_lock. Because the page cache tree is emptied before
* the inode can be destroyed, holding the lru_lock pins any
- * address_space that has radix tree nodes on the LRU.
+ * address_space that has nodes on the LRU.
*
* We can then safely transition to the mapping->pages.xa_lock to
* pin only the address_space of the particular node we want
* to reclaim, take the node off-LRU, and drop the lru_lock.
*/

- node = container_of(item, struct radix_tree_node, private_list);
+ node = container_of(item, struct xa_node, private_list);
mapping = container_of(node->root, struct address_space, pages);

/* Coming from the list, invert the lock order */
@@ -449,25 +449,17 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
goto out_invalid;
if (WARN_ON_ONCE(node->count != node->exceptional))
goto out_invalid;
- for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
- if (node->slots[i]) {
- if (WARN_ON_ONCE(!xa_is_value(node->slots[i])))
- goto out_invalid;
- if (WARN_ON_ONCE(!node->exceptional))
- goto out_invalid;
- if (WARN_ON_ONCE(!mapping->nrexceptional))
- goto out_invalid;
- node->slots[i] = NULL;
- node->exceptional--;
- node->count--;
- mapping->nrexceptional--;
- }
- }
- if (WARN_ON_ONCE(node->exceptional))
- goto out_invalid;
+ mapping->nrexceptional -= node->exceptional;
+ xas.xa = node->root;
+ xas.xa_node = node->parent;
+ xas.xa_offset = node->offset;
+ xas.xa_update = workingset_update_node;
+ /*
+ * We could store a shadow entry here which was the minimum of the
+ * shadow entries we were tracking ...
+ */
+ xas_store(&xas, NULL);
inc_lruvec_page_state(virt_to_page(node), WORKINGSET_NODERECLAIM);
- __radix_tree_delete_node(&mapping->pages, node,
- workingset_lookup_update(mapping));

out_invalid:
xa_unlock(&mapping->pages);
--
2.15.0

2017-12-06 01:01:21

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 03/73] page cache: Use xa_lock

From: Matthew Wilcox <[email protected]>

Remove the address_space ->tree_lock and use the xa_lock newly added to
the radix_tree_root. Rename the address_space ->page_tree to ->pages,
since we don't really care that it's a tree. Take the opportunity to
rearrange the elements of address_space to pack them better on 64-bit,
and make the comments more useful.

Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/cgroup-v1/memory.txt | 2 +-
Documentation/vm/page_migration | 14 ++--
arch/arm/include/asm/cacheflush.h | 6 +-
arch/nios2/include/asm/cacheflush.h | 6 +-
arch/parisc/include/asm/cacheflush.h | 6 +-
drivers/staging/lustre/lustre/llite/glimpse.c | 2 +-
drivers/staging/lustre/lustre/mdc/mdc_request.c | 8 +-
fs/afs/write.c | 2 +-
fs/btrfs/compression.c | 2 +-
fs/btrfs/extent_io.c | 8 +-
fs/btrfs/inode.c | 2 +-
fs/buffer.c | 10 +--
fs/cifs/file.c | 2 +-
fs/dax.c | 106 ++++++++++++------------
fs/f2fs/data.c | 6 +-
fs/f2fs/dir.c | 6 +-
fs/f2fs/inline.c | 6 +-
fs/f2fs/node.c | 8 +-
fs/fs-writeback.c | 18 ++--
fs/inode.c | 11 ++-
fs/nilfs2/btnode.c | 20 ++---
fs/nilfs2/page.c | 22 ++---
include/linux/backing-dev.h | 12 +--
include/linux/fs.h | 17 ++--
include/linux/mm.h | 2 +-
include/linux/pagemap.h | 4 +-
mm/filemap.c | 83 +++++++++----------
mm/huge_memory.c | 10 +--
mm/khugepaged.c | 49 +++++------
mm/memcontrol.c | 2 +-
mm/migrate.c | 31 ++++---
mm/page-writeback.c | 42 +++++-----
mm/readahead.c | 2 +-
mm/rmap.c | 4 +-
mm/shmem.c | 60 +++++++-------
mm/swap_state.c | 17 ++--
mm/truncate.c | 22 ++---
mm/vmscan.c | 12 +--
mm/workingset.c | 22 ++---
39 files changed, 322 insertions(+), 342 deletions(-)

diff --git a/Documentation/cgroup-v1/memory.txt b/Documentation/cgroup-v1/memory.txt
index cefb63639070..1d17fb0405ef 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -262,7 +262,7 @@ When oom event notifier is registered, event will be delivered.
2.6 Locking

lock_page_cgroup()/unlock_page_cgroup() should not be called under
- mapping->tree_lock.
+ mapping xa_lock.

Other lock order is following:
PG_locked.
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
index 0478ae2ad44a..faf849596a85 100644
--- a/Documentation/vm/page_migration
+++ b/Documentation/vm/page_migration
@@ -90,7 +90,7 @@ Steps:

1. Lock the page to be migrated

-2. Insure that writeback is complete.
+2. Ensure that writeback is complete.

3. Lock the new page that we want to move to. It is locked so that accesses to
this (not yet uptodate) page immediately lock while the move is in progress.
@@ -100,8 +100,8 @@ Steps:
mapcount is not zero then we do not migrate the page. All user space
processes that attempt to access the page will now wait on the page lock.

-5. The radix tree lock is taken. This will cause all processes trying
- to access the page via the mapping to block on the radix tree spinlock.
+5. The address space xa_lock is taken. This will cause all processes trying
+ to access the page via the mapping to block on the spinlock.

6. The refcount of the page is examined and we back out if references remain
otherwise we know that we are the only one referencing this page.
@@ -114,12 +114,12 @@ Steps:

9. The radix tree is changed to point to the new page.

-10. The reference count of the old page is dropped because the radix tree
+10. The reference count of the old page is dropped because the address space
reference is gone. A reference to the new page is established because
- the new page is referenced to by the radix tree.
+ the new page is referenced by the address space.

-11. The radix tree lock is dropped. With that lookups in the mapping
- become possible again. Processes will move from spinning on the tree_lock
+11. The address space xa_lock is dropped. With that lookups in the mapping
+ become possible again. Processes will move from spinning on the xa_lock
to sleeping on the locked new page.

12. The page contents are copied to the new page.
diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
index 74504b154256..f4ead9a74b7d 100644
--- a/arch/arm/include/asm/cacheflush.h
+++ b/arch/arm/include/asm/cacheflush.h
@@ -318,10 +318,8 @@ static inline void flush_anon_page(struct vm_area_struct *vma,
#define ARCH_HAS_FLUSH_KERNEL_DCACHE_PAGE
extern void flush_kernel_dcache_page(struct page *);

-#define flush_dcache_mmap_lock(mapping) \
- spin_lock_irq(&(mapping)->tree_lock)
-#define flush_dcache_mmap_unlock(mapping) \
- spin_unlock_irq(&(mapping)->tree_lock)
+#define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->pages)
+#define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->pages)

#define flush_icache_user_range(vma,page,addr,len) \
flush_dcache_page(page)
diff --git a/arch/nios2/include/asm/cacheflush.h b/arch/nios2/include/asm/cacheflush.h
index 55e383c173f7..7a6eda381964 100644
--- a/arch/nios2/include/asm/cacheflush.h
+++ b/arch/nios2/include/asm/cacheflush.h
@@ -46,9 +46,7 @@ extern void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
extern void flush_dcache_range(unsigned long start, unsigned long end);
extern void invalidate_dcache_range(unsigned long start, unsigned long end);

-#define flush_dcache_mmap_lock(mapping) \
- spin_lock_irq(&(mapping)->tree_lock)
-#define flush_dcache_mmap_unlock(mapping) \
- spin_unlock_irq(&(mapping)->tree_lock)
+#define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->pages)
+#define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->pages)

#endif /* _ASM_NIOS2_CACHEFLUSH_H */
diff --git a/arch/parisc/include/asm/cacheflush.h b/arch/parisc/include/asm/cacheflush.h
index 3742508cc534..b772dd320118 100644
--- a/arch/parisc/include/asm/cacheflush.h
+++ b/arch/parisc/include/asm/cacheflush.h
@@ -54,10 +54,8 @@ void invalidate_kernel_vmap_range(void *vaddr, int size);
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
extern void flush_dcache_page(struct page *page);

-#define flush_dcache_mmap_lock(mapping) \
- spin_lock_irq(&(mapping)->tree_lock)
-#define flush_dcache_mmap_unlock(mapping) \
- spin_unlock_irq(&(mapping)->tree_lock)
+#define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->pages)
+#define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->pages)

#define flush_icache_page(vma,page) do { \
flush_kernel_dcache_page(page); \
diff --git a/drivers/staging/lustre/lustre/llite/glimpse.c b/drivers/staging/lustre/lustre/llite/glimpse.c
index c43ac574274c..5f2843da911c 100644
--- a/drivers/staging/lustre/lustre/llite/glimpse.c
+++ b/drivers/staging/lustre/lustre/llite/glimpse.c
@@ -69,7 +69,7 @@ blkcnt_t dirty_cnt(struct inode *inode)
void *results[1];

if (inode->i_mapping)
- cnt += radix_tree_gang_lookup_tag(&inode->i_mapping->page_tree,
+ cnt += radix_tree_gang_lookup_tag(&inode->i_mapping->pages,
results, 0, 1,
PAGECACHE_TAG_DIRTY);
if (cnt == 0 && atomic_read(&vob->vob_mmap_cnt) > 0)
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index 03e55bca4ada..45dcf9f958d4 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -937,14 +937,14 @@ static struct page *mdc_page_locate(struct address_space *mapping, __u64 *hash,
struct page *page;
int found;

- spin_lock_irq(&mapping->tree_lock);
- found = radix_tree_gang_lookup(&mapping->page_tree,
+ xa_lock_irq(&mapping->pages);
+ found = radix_tree_gang_lookup(&mapping->pages,
(void **)&page, offset, 1);
if (found > 0 && !radix_tree_exceptional_entry(page)) {
struct lu_dirpage *dp;

get_page(page);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
/*
* In contrast to find_lock_page() we are sure that directory
* page cannot be truncated (while DLM lock is held) and,
@@ -992,7 +992,7 @@ static struct page *mdc_page_locate(struct address_space *mapping, __u64 *hash,
page = ERR_PTR(-EIO);
}
} else {
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
page = NULL;
}
return page;
diff --git a/fs/afs/write.c b/fs/afs/write.c
index cb5f8a3df577..9a10c08e8cbd 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -568,7 +568,7 @@ static int afs_writepages_region(struct address_space *mapping,

_debug("wback %lx", page->index);

- /* at this point we hold neither mapping->tree_lock nor lock on
+ /* at this point we hold neither mapping xa_lock nor lock on
* the page itself: the page may be truncated or invalidated
* (changing page->mapping to NULL), or even swizzled back from
* swapper_space to tmpfs file mapping
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 5982c8a71f02..280717b26224 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -450,7 +450,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
break;

rcu_read_lock();
- page = radix_tree_lookup(&mapping->page_tree, pg_index);
+ page = radix_tree_lookup(&mapping->pages, pg_index);
rcu_read_unlock();
if (page && !radix_tree_exceptional_entry(page)) {
misses++;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 012d63870b99..94f734e7e66f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3966,7 +3966,7 @@ static int extent_write_cache_pages(struct address_space *mapping,

done_index = page->index;
/*
- * At this point we hold neither mapping->tree_lock nor
+ * At this point we hold neither mapping xa_lock nor
* lock on the page itself: the page may be truncated or
* invalidated (changing page->mapping to NULL), or even
* swizzled back from swapper_space to tmpfs file
@@ -5196,13 +5196,13 @@ void clear_extent_buffer_dirty(struct extent_buffer *eb)
WARN_ON(!PagePrivate(page));

clear_page_dirty_for_io(page);
- spin_lock_irq(&page->mapping->tree_lock);
+ xa_lock_irq(&page->mapping->pages);
if (!PageDirty(page)) {
- radix_tree_tag_clear(&page->mapping->page_tree,
+ radix_tree_tag_clear(&page->mapping->pages,
page_index(page),
PAGECACHE_TAG_DIRTY);
}
- spin_unlock_irq(&page->mapping->tree_lock);
+ xa_unlock_irq(&page->mapping->pages);
ClearPageError(page);
unlock_page(page);
}
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 993061f83067..4da872bafcf8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7541,7 +7541,7 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len,

bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end)
{
- struct radix_tree_root *root = &inode->i_mapping->page_tree;
+ struct radix_tree_root *root = &inode->i_mapping->pages;
bool found = false;
void **pagep = NULL;
struct page *page = NULL;
diff --git a/fs/buffer.c b/fs/buffer.c
index fd894c2ae284..33c08624d45b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -195,7 +195,7 @@ EXPORT_SYMBOL(end_buffer_write_sync);
* Hack idea: for the blockdev mapping, i_bufferlist_lock contention
* may be quite high. This code could TryLock the page, and if that
* succeeds, there is no need to take private_lock. (But if
- * private_lock is contended then so is mapping->tree_lock).
+ * private_lock is contended then so is mapping xa_lock).
*/
static struct buffer_head *
__find_get_block_slow(struct block_device *bdev, sector_t block)
@@ -606,14 +606,14 @@ void __set_page_dirty(struct page *page, struct address_space *mapping,
{
unsigned long flags;

- spin_lock_irqsave(&mapping->tree_lock, flags);
+ xa_lock_irqsave(&mapping->pages, flags);
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
account_page_dirtied(page, mapping);
- radix_tree_tag_set(&mapping->page_tree,
+ radix_tree_tag_set(&mapping->pages,
page_index(page), PAGECACHE_TAG_DIRTY);
}
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);
}
EXPORT_SYMBOL_GPL(__set_page_dirty);

@@ -1102,7 +1102,7 @@ __getblk_slow(struct block_device *bdev, sector_t block,
* inode list.
*
* mark_buffer_dirty() is atomic. It takes bh->b_page->mapping->private_lock,
- * mapping->tree_lock and mapping->host->i_lock.
+ * mapping xa_lock and mapping->host->i_lock.
*/
void mark_buffer_dirty(struct buffer_head *bh)
{
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 3a85df2a9baf..ca7084825622 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -1987,7 +1987,7 @@ wdata_prepare_pages(struct cifs_writedata *wdata, unsigned int found_pages,
for (i = 0; i < found_pages; i++) {
page = wdata->pages[i];
/*
- * At this point we hold neither mapping->tree_lock nor
+ * At this point we hold neither mapping xa_lock nor
* lock on the page itself: the page may be truncated or
* invalidated (changing page->mapping to NULL), or even
* swizzled back from swapper_space to tmpfs file
diff --git a/fs/dax.c b/fs/dax.c
index 78b72c48374e..e743ff1f6240 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -158,7 +158,7 @@ static int wake_exceptional_entry_func(wait_queue_entry_t *wait, unsigned int mo
}

/*
- * We do not necessarily hold the mapping->tree_lock when we call this
+ * We do not necessarily hold the mapping xa_lock when we call this
* function so it is possible that 'entry' is no longer a valid item in the
* radix tree. This is okay because all we really need to do is to find the
* correct waitqueue where tasks might be waiting for that old 'entry' and
@@ -174,7 +174,7 @@ static void dax_wake_mapping_entry_waiter(struct address_space *mapping,

/*
* Checking for locked entry and prepare_to_wait_exclusive() happens
- * under mapping->tree_lock, ditto for entry handling in our callers.
+ * under mapping xa_lock, ditto for entry handling in our callers.
* So at this point all tasks that could have seen our entry locked
* must be in the waitqueue and the following check will see them.
*/
@@ -184,40 +184,40 @@ static void dax_wake_mapping_entry_waiter(struct address_space *mapping,

/*
* Check whether the given slot is locked. The function must be called with
- * mapping->tree_lock held
+ * mapping xa_lock held
*/
static inline int slot_locked(struct address_space *mapping, void **slot)
{
unsigned long entry = (unsigned long)
- radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+ radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock);
return entry & RADIX_DAX_ENTRY_LOCK;
}

/*
* Mark the given slot is locked. The function must be called with
- * mapping->tree_lock held
+ * mapping xa_lock held
*/
static inline void *lock_slot(struct address_space *mapping, void **slot)
{
unsigned long entry = (unsigned long)
- radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+ radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock);

entry |= RADIX_DAX_ENTRY_LOCK;
- radix_tree_replace_slot(&mapping->page_tree, slot, (void *)entry);
+ radix_tree_replace_slot(&mapping->pages, slot, (void *)entry);
return (void *)entry;
}

/*
* Mark the given slot is unlocked. The function must be called with
- * mapping->tree_lock held
+ * mapping xa_lock held
*/
static inline void *unlock_slot(struct address_space *mapping, void **slot)
{
unsigned long entry = (unsigned long)
- radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+ radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock);

entry &= ~(unsigned long)RADIX_DAX_ENTRY_LOCK;
- radix_tree_replace_slot(&mapping->page_tree, slot, (void *)entry);
+ radix_tree_replace_slot(&mapping->pages, slot, (void *)entry);
return (void *)entry;
}

@@ -228,7 +228,7 @@ static inline void *unlock_slot(struct address_space *mapping, void **slot)
* put_locked_mapping_entry() when he locked the entry and now wants to
* unlock it.
*
- * The function must be called with mapping->tree_lock held.
+ * The function must be called with mapping xa_lock held.
*/
static void *get_unlocked_mapping_entry(struct address_space *mapping,
pgoff_t index, void ***slotp)
@@ -241,7 +241,7 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
ewait.wait.func = wake_exceptional_entry_func;

for (;;) {
- entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+ entry = __radix_tree_lookup(&mapping->pages, index, NULL,
&slot);
if (!entry ||
WARN_ON_ONCE(!radix_tree_exceptional_entry(entry)) ||
@@ -254,10 +254,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
prepare_to_wait_exclusive(wq, &ewait.wait,
TASK_UNINTERRUPTIBLE);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
schedule();
finish_wait(wq, &ewait.wait);
- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
}
}

@@ -266,15 +266,15 @@ static void dax_unlock_mapping_entry(struct address_space *mapping,
{
void *entry, **slot;

- spin_lock_irq(&mapping->tree_lock);
- entry = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+ xa_lock_irq(&mapping->pages);
+ entry = __radix_tree_lookup(&mapping->pages, index, NULL, &slot);
if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
!slot_locked(mapping, slot))) {
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
return;
}
unlock_slot(mapping, slot);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
dax_wake_mapping_entry_waiter(mapping, index, entry, false);
}

@@ -331,7 +331,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
void *entry, **slot;

restart:
- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
entry = get_unlocked_mapping_entry(mapping, index, &slot);

if (WARN_ON_ONCE(entry && !radix_tree_exceptional_entry(entry))) {
@@ -363,12 +363,12 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
if (pmd_downgrade) {
/*
* Make sure 'entry' remains valid while we drop
- * mapping->tree_lock.
+ * mapping xa_lock.
*/
entry = lock_slot(mapping, slot);
}

- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
/*
* Besides huge zero pages the only other thing that gets
* downgraded are empty entries which don't need to be
@@ -385,26 +385,26 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
put_locked_mapping_entry(mapping, index);
return ERR_PTR(err);
}
- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);

if (!entry) {
/*
- * We needed to drop the page_tree lock while calling
+ * We needed to drop the pages lock while calling
* radix_tree_preload() and we didn't have an entry to
* lock. See if another thread inserted an entry at
* our index during this time.
*/
- entry = __radix_tree_lookup(&mapping->page_tree, index,
+ entry = __radix_tree_lookup(&mapping->pages, index,
NULL, &slot);
if (entry) {
radix_tree_preload_end();
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
goto restart;
}
}

if (pmd_downgrade) {
- radix_tree_delete(&mapping->page_tree, index);
+ radix_tree_delete(&mapping->pages, index);
mapping->nrexceptional--;
dax_wake_mapping_entry_waiter(mapping, index, entry,
true);
@@ -412,11 +412,11 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,

entry = dax_radix_locked_entry(0, size_flag | RADIX_DAX_EMPTY);

- err = __radix_tree_insert(&mapping->page_tree, index,
+ err = __radix_tree_insert(&mapping->pages, index,
dax_radix_order(entry), entry);
radix_tree_preload_end();
if (err) {
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
/*
* Our insertion of a DAX entry failed, most likely
* because we were inserting a PMD entry and it
@@ -429,12 +429,12 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
}
/* Good, we have inserted empty locked entry into the tree. */
mapping->nrexceptional++;
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
return entry;
}
entry = lock_slot(mapping, slot);
out_unlock:
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
return entry;
}

@@ -443,22 +443,22 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
{
int ret = 0;
void *entry;
- struct radix_tree_root *page_tree = &mapping->page_tree;
+ struct radix_tree_root *pages = &mapping->pages;

- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
entry = get_unlocked_mapping_entry(mapping, index, NULL);
if (!entry || WARN_ON_ONCE(!radix_tree_exceptional_entry(entry)))
goto out;
if (!trunc &&
- (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
- radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
+ (radix_tree_tag_get(pages, index, PAGECACHE_TAG_DIRTY) ||
+ radix_tree_tag_get(pages, index, PAGECACHE_TAG_TOWRITE)))
goto out;
- radix_tree_delete(page_tree, index);
+ radix_tree_delete(pages, index);
mapping->nrexceptional--;
ret = 1;
out:
put_unlocked_mapping_entry(mapping, index, entry);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
return ret;
}
/*
@@ -528,7 +528,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
void *entry, sector_t sector,
unsigned long flags, bool dirty)
{
- struct radix_tree_root *page_tree = &mapping->page_tree;
+ struct radix_tree_root *pages = &mapping->pages;
void *new_entry;
pgoff_t index = vmf->pgoff;

@@ -546,7 +546,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
PAGE_SIZE, 0);
}

- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
new_entry = dax_radix_locked_entry(sector, flags);

if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
@@ -562,17 +562,17 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
void **slot;
void *ret;

- ret = __radix_tree_lookup(page_tree, index, &node, &slot);
+ ret = __radix_tree_lookup(pages, index, &node, &slot);
WARN_ON_ONCE(ret != entry);
- __radix_tree_replace(page_tree, node, slot,
+ __radix_tree_replace(pages, node, slot,
new_entry, NULL);
entry = new_entry;
}

if (dirty)
- radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
+ radix_tree_tag_set(pages, index, PAGECACHE_TAG_DIRTY);

- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
return entry;
}

@@ -663,7 +663,7 @@ static int dax_writeback_one(struct block_device *bdev,
struct dax_device *dax_dev, struct address_space *mapping,
pgoff_t index, void *entry)
{
- struct radix_tree_root *page_tree = &mapping->page_tree;
+ struct radix_tree_root *pages = &mapping->pages;
void *entry2, **slot, *kaddr;
long ret = 0, id;
sector_t sector;
@@ -678,7 +678,7 @@ static int dax_writeback_one(struct block_device *bdev,
if (WARN_ON(!radix_tree_exceptional_entry(entry)))
return -EIO;

- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
entry2 = get_unlocked_mapping_entry(mapping, index, &slot);
/* Entry got punched out / reallocated? */
if (!entry2 || WARN_ON_ONCE(!radix_tree_exceptional_entry(entry2)))
@@ -697,7 +697,7 @@ static int dax_writeback_one(struct block_device *bdev,
}

/* Another fsync thread may have already written back this entry */
- if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+ if (!radix_tree_tag_get(pages, index, PAGECACHE_TAG_TOWRITE))
goto put_unlocked;
/* Lock the entry to serialize with page faults */
entry = lock_slot(mapping, slot);
@@ -705,11 +705,11 @@ static int dax_writeback_one(struct block_device *bdev,
* We can clear the tag now but we have to be careful so that concurrent
* dax_writeback_one() calls for the same index cannot finish before we
* actually flush the caches. This is achieved as the calls will look
- * at the entry only under tree_lock and once they do that they will
+ * at the entry only under xa_lock and once they do that they will
* see the entry locked and wait for it to unlock.
*/
- radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
- spin_unlock_irq(&mapping->tree_lock);
+ radix_tree_tag_clear(pages, index, PAGECACHE_TAG_TOWRITE);
+ xa_unlock_irq(&mapping->pages);

/*
* Even if dax_writeback_mapping_range() was given a wbc->range_start
@@ -727,7 +727,7 @@ static int dax_writeback_one(struct block_device *bdev,
goto dax_unlock;

/*
- * dax_direct_access() may sleep, so cannot hold tree_lock over
+ * dax_direct_access() may sleep, so cannot hold xa_lock over
* its invocation.
*/
ret = dax_direct_access(dax_dev, pgoff, size / PAGE_SIZE, &kaddr, &pfn);
@@ -747,9 +747,9 @@ static int dax_writeback_one(struct block_device *bdev,
* the pfn mappings are writeprotected and fault waits for mapping
* entry lock.
*/
- spin_lock_irq(&mapping->tree_lock);
- radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_DIRTY);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
+ radix_tree_tag_clear(pages, index, PAGECACHE_TAG_DIRTY);
+ xa_unlock_irq(&mapping->pages);
trace_dax_writeback_one(mapping->host, index, size >> PAGE_SHIFT);
dax_unlock:
dax_read_unlock(id);
@@ -758,7 +758,7 @@ static int dax_writeback_one(struct block_device *bdev,

put_unlocked:
put_unlocked_mapping_entry(mapping, index, entry2);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
return ret;
}

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 516fa0d3ff9c..8f51ac47b77f 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -2172,12 +2172,12 @@ void f2fs_set_page_dirty_nobuffers(struct page *page)
SetPageDirty(page);
spin_unlock(&mapping->private_lock);

- spin_lock_irqsave(&mapping->tree_lock, flags);
+ xa_lock_irqsave(&mapping->pages, flags);
WARN_ON_ONCE(!PageUptodate(page));
account_page_dirtied(page, mapping);
- radix_tree_tag_set(&mapping->page_tree,
+ radix_tree_tag_set(&mapping->pages,
page_index(page), PAGECACHE_TAG_DIRTY);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);
unlock_page_memcg(page);

__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
diff --git a/fs/f2fs/dir.c b/fs/f2fs/dir.c
index 2d98d877c09d..b5515ea6bb2f 100644
--- a/fs/f2fs/dir.c
+++ b/fs/f2fs/dir.c
@@ -739,10 +739,10 @@ void f2fs_delete_entry(struct f2fs_dir_entry *dentry, struct page *page,

if (bit_pos == NR_DENTRY_IN_BLOCK &&
!truncate_hole(dir, page->index, page->index + 1)) {
- spin_lock_irqsave(&mapping->tree_lock, flags);
- radix_tree_tag_clear(&mapping->page_tree, page_index(page),
+ xa_lock_irqsave(&mapping->pages, flags);
+ radix_tree_tag_clear(&mapping->pages, page_index(page),
PAGECACHE_TAG_DIRTY);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);

clear_page_dirty_for_io(page);
ClearPagePrivate(page);
diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c
index 90e38d8ea688..7858b8e15f33 100644
--- a/fs/f2fs/inline.c
+++ b/fs/f2fs/inline.c
@@ -226,10 +226,10 @@ int f2fs_write_inline_data(struct inode *inode, struct page *page)
kunmap_atomic(src_addr);
set_page_dirty(dn.inode_page);

- spin_lock_irqsave(&mapping->tree_lock, flags);
- radix_tree_tag_clear(&mapping->page_tree, page_index(page),
+ xa_lock_irqsave(&mapping->pages, flags);
+ radix_tree_tag_clear(&mapping->pages, page_index(page),
PAGECACHE_TAG_DIRTY);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);

set_inode_flag(inode, FI_APPEND_WRITE);
set_inode_flag(inode, FI_DATA_EXIST);
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index d3322752426f..6b64a3009d55 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -91,11 +91,11 @@ static void clear_node_page_dirty(struct page *page)
unsigned int long flags;

if (PageDirty(page)) {
- spin_lock_irqsave(&mapping->tree_lock, flags);
- radix_tree_tag_clear(&mapping->page_tree,
+ xa_lock_irqsave(&mapping->pages, flags);
+ radix_tree_tag_clear(&mapping->pages,
page_index(page),
PAGECACHE_TAG_DIRTY);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);

clear_page_dirty_for_io(page);
dec_page_count(F2FS_M_SB(mapping), F2FS_DIRTY_NODES);
@@ -1143,7 +1143,7 @@ void ra_node_page(struct f2fs_sb_info *sbi, nid_t nid)
f2fs_bug_on(sbi, check_nid_range(sbi, nid));

rcu_read_lock();
- apage = radix_tree_lookup(&NODE_MAPPING(sbi)->page_tree, nid);
+ apage = radix_tree_lookup(&NODE_MAPPING(sbi)->pages, nid);
rcu_read_unlock();
if (apage)
return;
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index cea4836385b7..a3c2352507f6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -347,9 +347,9 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
* By the time control reaches here, RCU grace period has passed
* since I_WB_SWITCH assertion and all wb stat update transactions
* between unlocked_inode_to_wb_begin/end() are guaranteed to be
- * synchronizing against mapping->tree_lock.
+ * synchronizing against mapping xa_lock.
*
- * Grabbing old_wb->list_lock, inode->i_lock and mapping->tree_lock
+ * Grabbing old_wb->list_lock, inode->i_lock and mapping xa_lock
* gives us exclusion against all wb related operations on @inode
* including IO list manipulations and stat updates.
*/
@@ -361,7 +361,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
spin_lock_nested(&old_wb->list_lock, SINGLE_DEPTH_NESTING);
}
spin_lock(&inode->i_lock);
- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);

/*
* Once I_FREEING is visible under i_lock, the eviction path owns
@@ -375,20 +375,20 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
* to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to
* pages actually under underwriteback.
*/
- radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0,
+ radix_tree_for_each_tagged(slot, &mapping->pages, &iter, 0,
PAGECACHE_TAG_DIRTY) {
struct page *page = radix_tree_deref_slot_protected(slot,
- &mapping->tree_lock);
+ &mapping->pages.xa_lock);
if (likely(page) && PageDirty(page)) {
dec_wb_stat(old_wb, WB_RECLAIMABLE);
inc_wb_stat(new_wb, WB_RECLAIMABLE);
}
}

- radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0,
+ radix_tree_for_each_tagged(slot, &mapping->pages, &iter, 0,
PAGECACHE_TAG_WRITEBACK) {
struct page *page = radix_tree_deref_slot_protected(slot,
- &mapping->tree_lock);
+ &mapping->pages.xa_lock);
if (likely(page)) {
WARN_ON_ONCE(!PageWriteback(page));
dec_wb_stat(old_wb, WB_WRITEBACK);
@@ -430,7 +430,7 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
*/
smp_store_release(&inode->i_state, inode->i_state & ~I_WB_SWITCH);

- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
spin_unlock(&inode->i_lock);
spin_unlock(&new_wb->list_lock);
spin_unlock(&old_wb->list_lock);
@@ -507,7 +507,7 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
/*
* In addition to synchronizing among switchers, I_WB_SWITCH tells
* the RCU protected stat update paths to grab the mapping's
- * tree_lock so that stat transfer can synchronize against them.
+ * xa_lock so that stat transfer can synchronize against them.
* Let's continue after I_WB_SWITCH is guaranteed to be visible.
*/
call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn);
diff --git a/fs/inode.c b/fs/inode.c
index 03102d6ef044..c7b00573c10d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -348,8 +348,7 @@ EXPORT_SYMBOL(inc_nlink);
void address_space_init_once(struct address_space *mapping)
{
memset(mapping, 0, sizeof(*mapping));
- INIT_RADIX_TREE(&mapping->page_tree, GFP_ATOMIC | __GFP_ACCOUNT);
- spin_lock_init(&mapping->tree_lock);
+ INIT_RADIX_TREE(&mapping->pages, GFP_ATOMIC | __GFP_ACCOUNT);
init_rwsem(&mapping->i_mmap_rwsem);
INIT_LIST_HEAD(&mapping->private_list);
spin_lock_init(&mapping->private_lock);
@@ -499,14 +498,14 @@ void clear_inode(struct inode *inode)
{
might_sleep();
/*
- * We have to cycle tree_lock here because reclaim can be still in the
+ * We have to cycle the xa_lock here because reclaim can be in the
* process of removing the last page (in __delete_from_page_cache())
- * and we must not free mapping under it.
+ * and we must not free the mapping under it.
*/
- spin_lock_irq(&inode->i_data.tree_lock);
+ xa_lock_irq(&inode->i_data.pages);
BUG_ON(inode->i_data.nrpages);
BUG_ON(inode->i_data.nrexceptional);
- spin_unlock_irq(&inode->i_data.tree_lock);
+ xa_unlock_irq(&inode->i_data.pages);
BUG_ON(!list_empty(&inode->i_data.private_list));
BUG_ON(!(inode->i_state & I_FREEING));
BUG_ON(inode->i_state & I_CLEAR);
diff --git a/fs/nilfs2/btnode.c b/fs/nilfs2/btnode.c
index c21e0b4454a6..9e2a00207436 100644
--- a/fs/nilfs2/btnode.c
+++ b/fs/nilfs2/btnode.c
@@ -193,9 +193,9 @@ int nilfs_btnode_prepare_change_key(struct address_space *btnc,
(unsigned long long)oldkey,
(unsigned long long)newkey);

- spin_lock_irq(&btnc->tree_lock);
- err = radix_tree_insert(&btnc->page_tree, newkey, obh->b_page);
- spin_unlock_irq(&btnc->tree_lock);
+ xa_lock_irq(&btnc->pages);
+ err = radix_tree_insert(&btnc->pages, newkey, obh->b_page);
+ xa_unlock_irq(&btnc->pages);
/*
* Note: page->index will not change to newkey until
* nilfs_btnode_commit_change_key() will be called.
@@ -251,11 +251,11 @@ void nilfs_btnode_commit_change_key(struct address_space *btnc,
(unsigned long long)newkey);
mark_buffer_dirty(obh);

- spin_lock_irq(&btnc->tree_lock);
- radix_tree_delete(&btnc->page_tree, oldkey);
- radix_tree_tag_set(&btnc->page_tree, newkey,
+ xa_lock_irq(&btnc->pages);
+ radix_tree_delete(&btnc->pages, oldkey);
+ radix_tree_tag_set(&btnc->pages, newkey,
PAGECACHE_TAG_DIRTY);
- spin_unlock_irq(&btnc->tree_lock);
+ xa_unlock_irq(&btnc->pages);

opage->index = obh->b_blocknr = newkey;
unlock_page(opage);
@@ -283,9 +283,9 @@ void nilfs_btnode_abort_change_key(struct address_space *btnc,
return;

if (nbh == NULL) { /* blocksize == pagesize */
- spin_lock_irq(&btnc->tree_lock);
- radix_tree_delete(&btnc->page_tree, newkey);
- spin_unlock_irq(&btnc->tree_lock);
+ xa_lock_irq(&btnc->pages);
+ radix_tree_delete(&btnc->pages, newkey);
+ xa_unlock_irq(&btnc->pages);
unlock_page(ctxt->bh->b_page);
} else
brelse(nbh);
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 68241512d7c1..1c6703efde9e 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -331,15 +331,15 @@ void nilfs_copy_back_pages(struct address_space *dmap,
struct page *page2;

/* move the page to the destination cache */
- spin_lock_irq(&smap->tree_lock);
- page2 = radix_tree_delete(&smap->page_tree, offset);
+ xa_lock_irq(&smap->pages);
+ page2 = radix_tree_delete(&smap->pages, offset);
WARN_ON(page2 != page);

smap->nrpages--;
- spin_unlock_irq(&smap->tree_lock);
+ xa_unlock_irq(&smap->pages);

- spin_lock_irq(&dmap->tree_lock);
- err = radix_tree_insert(&dmap->page_tree, offset, page);
+ xa_lock_irq(&dmap->pages);
+ err = radix_tree_insert(&dmap->pages, offset, page);
if (unlikely(err < 0)) {
WARN_ON(err == -EEXIST);
page->mapping = NULL;
@@ -348,11 +348,11 @@ void nilfs_copy_back_pages(struct address_space *dmap,
page->mapping = dmap;
dmap->nrpages++;
if (PageDirty(page))
- radix_tree_tag_set(&dmap->page_tree,
+ radix_tree_tag_set(&dmap->pages,
offset,
PAGECACHE_TAG_DIRTY);
}
- spin_unlock_irq(&dmap->tree_lock);
+ xa_unlock_irq(&dmap->pages);
}
unlock_page(page);
}
@@ -474,15 +474,15 @@ int __nilfs_clear_page_dirty(struct page *page)
struct address_space *mapping = page->mapping;

if (mapping) {
- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
if (test_bit(PG_dirty, &page->flags)) {
- radix_tree_tag_clear(&mapping->page_tree,
+ radix_tree_tag_clear(&mapping->pages,
page_index(page),
PAGECACHE_TAG_DIRTY);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
return clear_page_dirty_for_io(page);
}
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
return 0;
}
return TestClearPageDirty(page);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index e54e7e0033eb..9038f6c1eeda 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -329,7 +329,7 @@ static inline bool inode_to_wb_is_valid(struct inode *inode)
* @inode: inode of interest
*
* Returns the wb @inode is currently associated with. The caller must be
- * holding either @inode->i_lock, @inode->i_mapping->tree_lock, or the
+ * holding either @inode->i_lock, @inode->i_mapping->pages.xa_lock, or the
* associated wb's list_lock.
*/
static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
@@ -337,7 +337,7 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
#ifdef CONFIG_LOCKDEP
WARN_ON_ONCE(debug_locks &&
(!lockdep_is_held(&inode->i_lock) &&
- !lockdep_is_held(&inode->i_mapping->tree_lock) &&
+ !xa_lock_held(&inode->i_mapping->pages) &&
!lockdep_is_held(&inode->i_wb->list_lock)));
#endif
return inode->i_wb;
@@ -349,7 +349,7 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
* @lockedp: temp bool output param, to be passed to the end function
*
* The caller wants to access the wb associated with @inode but isn't
- * holding inode->i_lock, mapping->tree_lock or wb->list_lock. This
+ * holding inode->i_lock, mapping->pages.xa_lock or wb->list_lock. This
* function determines the wb associated with @inode and ensures that the
* association doesn't change until the transaction is finished with
* unlocked_inode_to_wb_end().
@@ -370,10 +370,10 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
*lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH;

if (unlikely(*lockedp))
- spin_lock_irq(&inode->i_mapping->tree_lock);
+ xa_lock_irq(&inode->i_mapping->pages);

/*
- * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock.
+ * Protected by either !I_WB_SWITCH + rcu_read_lock() or xa_lock.
* inode_to_wb() will bark. Deref directly.
*/
return inode->i_wb;
@@ -387,7 +387,7 @@ unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp)
static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked)
{
if (unlikely(locked))
- spin_unlock_irq(&inode->i_mapping->tree_lock);
+ xa_unlock_irq(&inode->i_mapping->pages);

rcu_read_unlock();
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 511fbaabf624..c07169cfb44a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -13,6 +13,7 @@
#include <linux/list_lru.h>
#include <linux/llist.h>
#include <linux/radix-tree.h>
+#include <linux/xarray.h>
#include <linux/rbtree.h>
#include <linux/init.h>
#include <linux/pid.h>
@@ -390,23 +391,21 @@ int pagecache_write_end(struct file *, struct address_space *mapping,

struct address_space {
struct inode *host; /* owner: inode, block_device */
- struct radix_tree_root page_tree; /* radix tree of all pages */
- spinlock_t tree_lock; /* and lock protecting it */
+ struct radix_tree_root pages; /* cached pages */
+ gfp_t gfp_mask; /* for allocating pages */
atomic_t i_mmap_writable;/* count VM_SHARED mappings */
struct rb_root_cached i_mmap; /* tree of private and shared mappings */
struct rw_semaphore i_mmap_rwsem; /* protect tree, count, list */
- /* Protected by tree_lock together with the radix tree */
+ /* Protected by pages.xa_lock */
unsigned long nrpages; /* number of total pages */
- /* number of shadow or DAX exceptional entries */
- unsigned long nrexceptional;
+ unsigned long nrexceptional; /* shadow or DAX entries */
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops; /* methods */
unsigned long flags; /* error bits */
+ errseq_t wb_err;
spinlock_t private_lock; /* for use by the address_space */
- gfp_t gfp_mask; /* implicit gfp mask for allocations */
- struct list_head private_list; /* for use by the address_space */
+ struct list_head private_list; /* ditto */
void *private_data; /* ditto */
- errseq_t wb_err;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;
/*
* On most architectures that alignment is already the case; but
@@ -1977,7 +1976,7 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp)
*
* I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to
* synchronize competing switching instances and to tell
- * wb stat updates to grab mapping->tree_lock. See
+ * wb stat updates to grab mapping->pages.xa_lock. See
* inode_switch_wb_work_fn() for details.
*
* I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 96b1932380e9..fe1ee4313add 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -738,7 +738,7 @@ int finish_mkwrite_fault(struct vm_fault *vmf);
* refcount. The each user mapping also has a reference to the page.
*
* The pagecache pages are stored in a per-mapping radix tree, which is
- * rooted at mapping->page_tree, and indexed by offset.
+ * rooted at mapping->pages, and indexed by offset.
* Where 2.4 and early 2.6 kernels kept dirty/clean pages in per-address_space
* lists, we instead now tag pages as dirty/writeback in the radix tree.
*
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 34ce3ebf97d5..80a6149152d4 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -144,7 +144,7 @@ void release_pages(struct page **pages, int nr);
* 3. check the page is still in pagecache (if no, goto 1)
*
* Remove-side that cares about stability of _refcount (eg. reclaim) has the
- * following (with tree_lock held for write):
+ * following (with pages.xa_lock held):
* A. atomically check refcount is correct and set it to 0 (atomic_cmpxchg)
* B. remove page from pagecache
* C. free the page
@@ -157,7 +157,7 @@ void release_pages(struct page **pages, int nr);
*
* It is possible that between 1 and 2, the page is removed then the exact same
* page is inserted into the same position in pagecache. That's OK: the
- * old find_get_page using tree_lock could equally have run before or after
+ * old find_get_page using a lock could equally have run before or after
* such a re-insertion, depending on order that locks are granted.
*
* Lookups racing against pagecache insertion isn't a big problem: either 1
diff --git a/mm/filemap.c b/mm/filemap.c
index ee83baaf855d..5c8f22fe4e62 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -67,7 +67,7 @@
* ->i_mmap_rwsem (truncate_pagecache)
* ->private_lock (__free_pte->__set_page_dirty_buffers)
* ->swap_lock (exclusive_swap_page, others)
- * ->mapping->tree_lock
+ * ->mapping->pages.xa_lock
*
* ->i_mutex
* ->i_mmap_rwsem (truncate->unmap_mapping_range)
@@ -75,7 +75,7 @@
* ->mmap_sem
* ->i_mmap_rwsem
* ->page_table_lock or pte_lock (various, mainly in memory.c)
- * ->mapping->tree_lock (arch-dependent flush_dcache_mmap_lock)
+ * ->mapping->pages.xa_lock (arch-dependent flush_dcache_mmap_lock)
*
* ->mmap_sem
* ->lock_page (access_process_vm)
@@ -85,7 +85,7 @@
*
* bdi->wb.list_lock
* sb_lock (fs/fs-writeback.c)
- * ->mapping->tree_lock (__sync_single_inode)
+ * ->mapping->pages.xa_lock (__sync_single_inode)
*
* ->i_mmap_rwsem
* ->anon_vma.lock (vma_adjust)
@@ -96,11 +96,11 @@
* ->page_table_lock or pte_lock
* ->swap_lock (try_to_unmap_one)
* ->private_lock (try_to_unmap_one)
- * ->tree_lock (try_to_unmap_one)
+ * ->pages.xa_lock (try_to_unmap_one)
* ->zone_lru_lock(zone) (follow_page->mark_page_accessed)
* ->zone_lru_lock(zone) (check_pte_range->isolate_lru_page)
* ->private_lock (page_remove_rmap->set_page_dirty)
- * ->tree_lock (page_remove_rmap->set_page_dirty)
+ * ->pages.xa_lock (page_remove_rmap->set_page_dirty)
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
* ->inode->i_lock (page_remove_rmap->set_page_dirty)
* ->memcg->move_lock (page_remove_rmap->lock_page_memcg)
@@ -119,14 +119,14 @@ static int page_cache_tree_insert(struct address_space *mapping,
void **slot;
int error;

- error = __radix_tree_create(&mapping->page_tree, page->index, 0,
+ error = __radix_tree_create(&mapping->pages, page->index, 0,
&node, &slot);
if (error)
return error;
if (*slot) {
void *p;

- p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+ p = radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock);
if (!radix_tree_exceptional_entry(p))
return -EEXIST;

@@ -134,7 +134,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
if (shadowp)
*shadowp = p;
}
- __radix_tree_replace(&mapping->page_tree, node, slot, page,
+ __radix_tree_replace(&mapping->pages, node, slot, page,
workingset_lookup_update(mapping));
mapping->nrpages++;
return 0;
@@ -156,13 +156,13 @@ static void page_cache_tree_delete(struct address_space *mapping,
struct radix_tree_node *node;
void **slot;

- __radix_tree_lookup(&mapping->page_tree, page->index + i,
+ __radix_tree_lookup(&mapping->pages, page->index + i,
&node, &slot);

VM_BUG_ON_PAGE(!node && nr != 1, page);

- radix_tree_clear_tags(&mapping->page_tree, node, slot);
- __radix_tree_replace(&mapping->page_tree, node, slot, shadow,
+ radix_tree_clear_tags(&mapping->pages, node, slot);
+ __radix_tree_replace(&mapping->pages, node, slot, shadow,
workingset_lookup_update(mapping));
}

@@ -254,7 +254,7 @@ static void unaccount_page_cache_page(struct address_space *mapping,
/*
* Delete a page from the page cache and free it. Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
- * is safe. The caller must hold the mapping's tree_lock.
+ * is safe. The caller must hold the mapping's xa_lock.
*/
void __delete_from_page_cache(struct page *page, void *shadow)
{
@@ -297,9 +297,9 @@ void delete_from_page_cache(struct page *page)
unsigned long flags;

BUG_ON(!PageLocked(page));
- spin_lock_irqsave(&mapping->tree_lock, flags);
+ xa_lock_irqsave(&mapping->pages, flags);
__delete_from_page_cache(page, NULL);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);

page_cache_free_page(mapping, page);
}
@@ -310,14 +310,14 @@ EXPORT_SYMBOL(delete_from_page_cache);
* @mapping: the mapping to which pages belong
* @pvec: pagevec with pages to delete
*
- * The function walks over mapping->page_tree and removes pages passed in @pvec
- * from the radix tree. The function expects @pvec to be sorted by page index.
- * It tolerates holes in @pvec (radix tree entries at those indices are not
+ * The function walks over mapping->pages and removes pages passed in @pvec
+ * from the mapping. The function expects @pvec to be sorted by page index.
+ * It tolerates holes in @pvec (mapping entries at those indices are not
* modified). The function expects only THP head pages to be present in the
- * @pvec and takes care to delete all corresponding tail pages from the radix
- * tree as well.
+ * @pvec and takes care to delete all corresponding tail pages from the
+ * mapping as well.
*
- * The function expects mapping->tree_lock to be held.
+ * The function expects xa_lock to be held.
*/
static void
page_cache_tree_delete_batch(struct address_space *mapping,
@@ -331,11 +331,11 @@ page_cache_tree_delete_batch(struct address_space *mapping,
pgoff_t start;

start = pvec->pages[0]->index;
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+ radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
if (i >= pagevec_count(pvec) && !tail_pages)
break;
page = radix_tree_deref_slot_protected(slot,
- &mapping->tree_lock);
+ &mapping->pages.xa_lock);
if (radix_tree_exceptional_entry(page))
continue;
if (!tail_pages) {
@@ -358,8 +358,8 @@ page_cache_tree_delete_batch(struct address_space *mapping,
} else {
tail_pages--;
}
- radix_tree_clear_tags(&mapping->page_tree, iter.node, slot);
- __radix_tree_replace(&mapping->page_tree, iter.node, slot, NULL,
+ radix_tree_clear_tags(&mapping->pages, iter.node, slot);
+ __radix_tree_replace(&mapping->pages, iter.node, slot, NULL,
workingset_lookup_update(mapping));
total_pages++;
}
@@ -375,14 +375,14 @@ void delete_from_page_cache_batch(struct address_space *mapping,
if (!pagevec_count(pvec))
return;

- spin_lock_irqsave(&mapping->tree_lock, flags);
+ xa_lock_irqsave(&mapping->pages, flags);
for (i = 0; i < pagevec_count(pvec); i++) {
trace_mm_filemap_delete_from_page_cache(pvec->pages[i]);

unaccount_page_cache_page(mapping, pvec->pages[i]);
}
page_cache_tree_delete_batch(mapping, pvec);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);

for (i = 0; i < pagevec_count(pvec); i++)
page_cache_free_page(mapping, pvec->pages[i]);
@@ -799,7 +799,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
new->mapping = mapping;
new->index = offset;

- spin_lock_irqsave(&mapping->tree_lock, flags);
+ xa_lock_irqsave(&mapping->pages, flags);
__delete_from_page_cache(old, NULL);
error = page_cache_tree_insert(mapping, new, NULL);
BUG_ON(error);
@@ -811,7 +811,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
__inc_node_page_state(new, NR_FILE_PAGES);
if (PageSwapBacked(new))
__inc_node_page_state(new, NR_SHMEM);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);
mem_cgroup_migrate(old, new);
radix_tree_preload_end();
if (freepage)
@@ -853,7 +853,7 @@ static int __add_to_page_cache_locked(struct page *page,
page->mapping = mapping;
page->index = offset;

- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
error = page_cache_tree_insert(mapping, page, shadowp);
radix_tree_preload_end();
if (unlikely(error))
@@ -862,7 +862,7 @@ static int __add_to_page_cache_locked(struct page *page,
/* hugetlb pages do not participate in page cache accounting. */
if (!huge)
__inc_node_page_state(page, NR_FILE_PAGES);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
if (!huge)
mem_cgroup_commit_charge(page, memcg, false, false);
trace_mm_filemap_add_to_page_cache(page);
@@ -870,7 +870,7 @@ static int __add_to_page_cache_locked(struct page *page,
err_insert:
page->mapping = NULL;
/* Leave page->index set: truncation relies upon it */
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
if (!huge)
mem_cgroup_cancel_charge(page, memcg, false);
put_page(page);
@@ -1354,7 +1354,7 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
for (i = 0; i < max_scan; i++) {
struct page *page;

- page = radix_tree_lookup(&mapping->page_tree, index);
+ page = radix_tree_lookup(&mapping->pages, index);
if (!page || radix_tree_exceptional_entry(page))
break;
index++;
@@ -1395,7 +1395,7 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
for (i = 0; i < max_scan; i++) {
struct page *page;

- page = radix_tree_lookup(&mapping->page_tree, index);
+ page = radix_tree_lookup(&mapping->pages, index);
if (!page || radix_tree_exceptional_entry(page))
break;
index--;
@@ -1428,7 +1428,7 @@ struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
rcu_read_lock();
repeat:
page = NULL;
- pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
+ pagep = radix_tree_lookup_slot(&mapping->pages, offset);
if (pagep) {
page = radix_tree_deref_slot(pagep);
if (unlikely(!page))
@@ -1634,7 +1634,7 @@ unsigned find_get_entries(struct address_space *mapping,
return 0;

rcu_read_lock();
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+ radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
@@ -1711,7 +1711,7 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
return 0;

rcu_read_lock();
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, *start) {
+ radix_tree_for_each_slot(slot, &mapping->pages, &iter, *start) {
struct page *head, *page;

if (iter.index > end)
@@ -1796,7 +1796,7 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,
return 0;

rcu_read_lock();
- radix_tree_for_each_contig(slot, &mapping->page_tree, &iter, index) {
+ radix_tree_for_each_contig(slot, &mapping->pages, &iter, index) {
struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
@@ -1876,8 +1876,7 @@ unsigned find_get_pages_range_tag(struct address_space *mapping, pgoff_t *index,
return 0;

rcu_read_lock();
- radix_tree_for_each_tagged(slot, &mapping->page_tree,
- &iter, *index, tag) {
+ radix_tree_for_each_tagged(slot, &mapping->pages, &iter, *index, tag) {
struct page *head, *page;

if (iter.index > end)
@@ -1970,8 +1969,7 @@ unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
return 0;

rcu_read_lock();
- radix_tree_for_each_tagged(slot, &mapping->page_tree,
- &iter, start, tag) {
+ radix_tree_for_each_tagged(slot, &mapping->pages, &iter, start, tag) {
struct page *head, *page;
repeat:
page = radix_tree_deref_slot(slot);
@@ -2625,8 +2623,7 @@ void filemap_map_pages(struct vm_fault *vmf,
struct page *head, *page;

rcu_read_lock();
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
- start_pgoff) {
+ radix_tree_for_each_slot(slot, &mapping->pages, &iter, start_pgoff) {
if (iter.index > end_pgoff)
break;
repeat:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f2f5e774902..28909c475ee5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2458,7 +2458,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
} else {
/* Additional pin to radix tree */
page_ref_add(head, 2);
- spin_unlock(&head->mapping->tree_lock);
+ xa_unlock(&head->mapping->pages);
}

spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
@@ -2666,15 +2666,15 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
if (mapping) {
void **pslot;

- spin_lock(&mapping->tree_lock);
- pslot = radix_tree_lookup_slot(&mapping->page_tree,
+ xa_lock(&mapping->pages);
+ pslot = radix_tree_lookup_slot(&mapping->pages,
page_index(head));
/*
* Check if the head page is present in radix tree.
* We assume all tail are present too, if head is there.
*/
if (radix_tree_deref_slot_protected(pslot,
- &mapping->tree_lock) != head)
+ &mapping->pages.xa_lock) != head)
goto fail;
}

@@ -2708,7 +2708,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
}
spin_unlock(&pgdata->split_queue_lock);
fail: if (mapping)
- spin_unlock(&mapping->tree_lock);
+ xa_unlock(&mapping->pages);
spin_unlock_irqrestore(zone_lru_lock(page_zone(head)), flags);
unfreeze_page(head);
ret = -EBUSY;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ea4ff259b671..cb4d199bf328 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1339,8 +1339,8 @@ static void collapse_shmem(struct mm_struct *mm,
*/

index = start;
- spin_lock_irq(&mapping->tree_lock);
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+ xa_lock_irq(&mapping->pages);
+ radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
int n = min(iter.index, end) - index;

/*
@@ -1353,7 +1353,7 @@ static void collapse_shmem(struct mm_struct *mm,
}
nr_none += n;
for (; index < min(iter.index, end); index++) {
- radix_tree_insert(&mapping->page_tree, index,
+ radix_tree_insert(&mapping->pages, index,
new_page + (index % HPAGE_PMD_NR));
}

@@ -1362,16 +1362,16 @@ static void collapse_shmem(struct mm_struct *mm,
break;

page = radix_tree_deref_slot_protected(slot,
- &mapping->tree_lock);
+ &mapping->pages.xa_lock);
if (radix_tree_exceptional_entry(page) || !PageUptodate(page)) {
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
/* swap in or instantiate fallocated page */
if (shmem_getpage(mapping->host, index, &page,
SGP_NOHUGE)) {
result = SCAN_FAIL;
goto tree_unlocked;
}
- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
} else if (trylock_page(page)) {
get_page(page);
} else {
@@ -1380,7 +1380,7 @@ static void collapse_shmem(struct mm_struct *mm,
}

/*
- * The page must be locked, so we can drop the tree_lock
+ * The page must be locked, so we can drop the xa_lock
* without racing with truncate.
*/
VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1391,7 +1391,7 @@ static void collapse_shmem(struct mm_struct *mm,
result = SCAN_TRUNCATED;
goto out_unlock;
}
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);

if (isolate_lru_page(page)) {
result = SCAN_DEL_PAGE_LRU;
@@ -1402,11 +1402,11 @@ static void collapse_shmem(struct mm_struct *mm,
unmap_mapping_range(mapping, index << PAGE_SHIFT,
PAGE_SIZE, 0);

- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);

- slot = radix_tree_lookup_slot(&mapping->page_tree, index);
+ slot = radix_tree_lookup_slot(&mapping->pages, index);
VM_BUG_ON_PAGE(page != radix_tree_deref_slot_protected(slot,
- &mapping->tree_lock), page);
+ &mapping->pages.xa_lock), page);
VM_BUG_ON_PAGE(page_mapped(page), page);

/*
@@ -1427,14 +1427,14 @@ static void collapse_shmem(struct mm_struct *mm,
list_add_tail(&page->lru, &pagelist);

/* Finally, replace with the new page. */
- radix_tree_replace_slot(&mapping->page_tree, slot,
+ radix_tree_replace_slot(&mapping->pages, slot,
new_page + (index % HPAGE_PMD_NR));

slot = radix_tree_iter_resume(slot, &iter);
index++;
continue;
out_lru:
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
putback_lru_page(page);
out_isolate_failed:
unlock_page(page);
@@ -1460,14 +1460,14 @@ static void collapse_shmem(struct mm_struct *mm,
}

for (; index < end; index++) {
- radix_tree_insert(&mapping->page_tree, index,
+ radix_tree_insert(&mapping->pages, index,
new_page + (index % HPAGE_PMD_NR));
}
nr_none += n;
}

tree_locked:
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
tree_unlocked:

if (result == SCAN_SUCCEED) {
@@ -1516,9 +1516,8 @@ static void collapse_shmem(struct mm_struct *mm,
} else {
/* Something went wrong: rollback changes to the radix-tree */
shmem_uncharge(mapping->host, nr_none);
- spin_lock_irq(&mapping->tree_lock);
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter,
- start) {
+ xa_lock_irq(&mapping->pages);
+ radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
if (iter.index >= end)
break;
page = list_first_entry_or_null(&pagelist,
@@ -1528,8 +1527,7 @@ static void collapse_shmem(struct mm_struct *mm,
break;
nr_none--;
/* Put holes back where they were */
- radix_tree_delete(&mapping->page_tree,
- iter.index);
+ radix_tree_delete(&mapping->pages, iter.index);
continue;
}

@@ -1538,16 +1536,15 @@ static void collapse_shmem(struct mm_struct *mm,
/* Unfreeze the page. */
list_del(&page->lru);
page_ref_unfreeze(page, 2);
- radix_tree_replace_slot(&mapping->page_tree,
- slot, page);
+ radix_tree_replace_slot(&mapping->pages, slot, page);
slot = radix_tree_iter_resume(slot, &iter);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
putback_lru_page(page);
unlock_page(page);
- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
}
VM_BUG_ON(nr_none);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);

/* Unfreeze new_page, caller would take care about freeing it */
page_ref_unfreeze(new_page, 1);
@@ -1575,7 +1572,7 @@ static void khugepaged_scan_shmem(struct mm_struct *mm,
swap = 0;
memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
rcu_read_lock();
- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+ radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
if (iter.index >= start + HPAGE_PMD_NR)
break;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ac2ffd5e02b9..16af05a3ec6e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6034,7 +6034,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)

/*
* Interrupts should be disabled here because the caller holds the
- * mapping->tree_lock lock which is taken with interrupts-off. It is
+ * mapping->pages xa_lock which is taken with interrupts-off. It is
* important here to have the interrupts disabled because it is the
* only synchronisation we have for udpating the per-CPU variables.
*/
diff --git a/mm/migrate.c b/mm/migrate.c
index 4d0be47a322a..59f18c571120 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -466,20 +466,20 @@ int migrate_page_move_mapping(struct address_space *mapping,
oldzone = page_zone(page);
newzone = page_zone(newpage);

- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);

- pslot = radix_tree_lookup_slot(&mapping->page_tree,
+ pslot = radix_tree_lookup_slot(&mapping->pages,
page_index(page));

expected_count += 1 + page_has_private(page);
if (page_count(page) != expected_count ||
- radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
- spin_unlock_irq(&mapping->tree_lock);
+ radix_tree_deref_slot_protected(pslot, &mapping->pages.xa_lock) != page) {
+ xa_unlock_irq(&mapping->pages);
return -EAGAIN;
}

if (!page_ref_freeze(page, expected_count)) {
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
return -EAGAIN;
}

@@ -493,7 +493,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
if (mode == MIGRATE_ASYNC && head &&
!buffer_migrate_lock_buffers(head, mode)) {
page_ref_unfreeze(page, expected_count);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
return -EAGAIN;
}

@@ -521,7 +521,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
SetPageDirty(newpage);
}

- radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);
+ radix_tree_replace_slot(&mapping->pages, pslot, newpage);

/*
* Drop cache reference from old page by unfreezing
@@ -530,7 +530,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
*/
page_ref_unfreeze(page, expected_count - 1);

- spin_unlock(&mapping->tree_lock);
+ xa_unlock(&mapping->pages);
/* Leave irq disabled to prevent preemption while updating stats */

/*
@@ -573,20 +573,19 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
int expected_count;
void **pslot;

- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);

- pslot = radix_tree_lookup_slot(&mapping->page_tree,
- page_index(page));
+ pslot = radix_tree_lookup_slot(&mapping->pages, page_index(page));

expected_count = 2 + page_has_private(page);
if (page_count(page) != expected_count ||
- radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
- spin_unlock_irq(&mapping->tree_lock);
+ radix_tree_deref_slot_protected(pslot, &mapping->pages.xa_lock) != page) {
+ xa_unlock_irq(&mapping->pages);
return -EAGAIN;
}

if (!page_ref_freeze(page, expected_count)) {
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
return -EAGAIN;
}

@@ -595,11 +594,11 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,

get_page(newpage);

- radix_tree_replace_slot(&mapping->page_tree, pslot, newpage);
+ radix_tree_replace_slot(&mapping->pages, pslot, newpage);

page_ref_unfreeze(page, expected_count - 1);

- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);

return MIGRATEPAGE_SUCCESS;
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 586f31261c83..588ce729d199 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2099,7 +2099,7 @@ void __init page_writeback_init(void)
* so that it can tag pages faster than a dirtying process can create them).
*/
/*
- * We tag pages in batches of WRITEBACK_TAG_BATCH to reduce tree_lock latency.
+ * We tag pages in batches of WRITEBACK_TAG_BATCH to reduce xa_lock latency.
*/
void tag_pages_for_writeback(struct address_space *mapping,
pgoff_t start, pgoff_t end)
@@ -2109,22 +2109,22 @@ void tag_pages_for_writeback(struct address_space *mapping,
struct radix_tree_iter iter;
void **slot;

- spin_lock_irq(&mapping->tree_lock);
- radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, start,
+ xa_lock_irq(&mapping->pages);
+ radix_tree_for_each_tagged(slot, &mapping->pages, &iter, start,
PAGECACHE_TAG_DIRTY) {
if (iter.index > end)
break;
- radix_tree_iter_tag_set(&mapping->page_tree, &iter,
+ radix_tree_iter_tag_set(&mapping->pages, &iter,
PAGECACHE_TAG_TOWRITE);
tagged++;
if ((tagged % WRITEBACK_TAG_BATCH) != 0)
continue;
slot = radix_tree_iter_resume(slot, &iter);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
cond_resched();
- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
}
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
}
EXPORT_SYMBOL(tag_pages_for_writeback);

@@ -2467,13 +2467,13 @@ int __set_page_dirty_nobuffers(struct page *page)
return 1;
}

- spin_lock_irqsave(&mapping->tree_lock, flags);
+ xa_lock_irqsave(&mapping->pages, flags);
BUG_ON(page_mapping(page) != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
- radix_tree_tag_set(&mapping->page_tree, page_index(page),
+ radix_tree_tag_set(&mapping->pages, page_index(page),
PAGECACHE_TAG_DIRTY);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);
unlock_page_memcg(page);

if (mapping->host) {
@@ -2718,11 +2718,10 @@ int test_clear_page_writeback(struct page *page)
struct backing_dev_info *bdi = inode_to_bdi(inode);
unsigned long flags;

- spin_lock_irqsave(&mapping->tree_lock, flags);
+ xa_lock_irqsave(&mapping->pages, flags);
ret = TestClearPageWriteback(page);
if (ret) {
- radix_tree_tag_clear(&mapping->page_tree,
- page_index(page),
+ radix_tree_tag_clear(&mapping->pages, page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi)) {
struct bdi_writeback *wb = inode_to_wb(inode);
@@ -2736,7 +2735,7 @@ int test_clear_page_writeback(struct page *page)
PAGECACHE_TAG_WRITEBACK))
sb_clear_inode_writeback(mapping->host);

- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);
} else {
ret = TestClearPageWriteback(page);
}
@@ -2766,7 +2765,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
struct backing_dev_info *bdi = inode_to_bdi(inode);
unsigned long flags;

- spin_lock_irqsave(&mapping->tree_lock, flags);
+ xa_lock_irqsave(&mapping->pages, flags);
ret = TestSetPageWriteback(page);
if (!ret) {
bool on_wblist;
@@ -2774,8 +2773,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
on_wblist = mapping_tagged(mapping,
PAGECACHE_TAG_WRITEBACK);

- radix_tree_tag_set(&mapping->page_tree,
- page_index(page),
+ radix_tree_tag_set(&mapping->pages, page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
@@ -2789,14 +2787,12 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
sb_mark_inode_writeback(mapping->host);
}
if (!PageDirty(page))
- radix_tree_tag_clear(&mapping->page_tree,
- page_index(page),
+ radix_tree_tag_clear(&mapping->pages, page_index(page),
PAGECACHE_TAG_DIRTY);
if (!keep_write)
- radix_tree_tag_clear(&mapping->page_tree,
- page_index(page),
+ radix_tree_tag_clear(&mapping->pages, page_index(page),
PAGECACHE_TAG_TOWRITE);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);
} else {
ret = TestSetPageWriteback(page);
}
@@ -2816,7 +2812,7 @@ EXPORT_SYMBOL(__test_set_page_writeback);
*/
int mapping_tagged(struct address_space *mapping, int tag)
{
- return radix_tree_tagged(&mapping->page_tree, tag);
+ return radix_tree_tagged(&mapping->pages, tag);
}
EXPORT_SYMBOL(mapping_tagged);

diff --git a/mm/readahead.c b/mm/readahead.c
index c4ca70239233..514188fd2489 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -175,7 +175,7 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
break;

rcu_read_lock();
- page = radix_tree_lookup(&mapping->page_tree, page_offset);
+ page = radix_tree_lookup(&mapping->pages, page_offset);
rcu_read_unlock();
if (page && !radix_tree_exceptional_entry(page))
continue;
diff --git a/mm/rmap.c b/mm/rmap.c
index 47db27f8049e..87c1ca0cf1a3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -32,11 +32,11 @@
* mmlist_lock (in mmput, drain_mmlist and others)
* mapping->private_lock (in __set_page_dirty_buffers)
* mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
- * mapping->tree_lock (widely used)
+ * mapping->pages.xa_lock (widely used)
* inode->i_lock (in set_page_dirty's __mark_inode_dirty)
* bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
* sb_lock (within inode_lock in fs/fs-writeback.c)
- * mapping->tree_lock (widely used, in set_page_dirty,
+ * mapping->pages.xa_lock (widely used, in set_page_dirty,
* in arch-dependent flush_dcache_mmap_lock,
* within bdi.wb->list_lock in __sync_single_inode)
*
diff --git a/mm/shmem.c b/mm/shmem.c
index 7fbe67be86fa..9b1766e7c8cf 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -332,12 +332,12 @@ static int shmem_radix_tree_replace(struct address_space *mapping,

VM_BUG_ON(!expected);
VM_BUG_ON(!replacement);
- item = __radix_tree_lookup(&mapping->page_tree, index, &node, &pslot);
+ item = __radix_tree_lookup(&mapping->pages, index, &node, &pslot);
if (!item)
return -ENOENT;
if (item != expected)
return -ENOENT;
- __radix_tree_replace(&mapping->page_tree, node, pslot,
+ __radix_tree_replace(&mapping->pages, node, pslot,
replacement, NULL);
return 0;
}
@@ -355,7 +355,7 @@ static bool shmem_confirm_swap(struct address_space *mapping,
void *item;

rcu_read_lock();
- item = radix_tree_lookup(&mapping->page_tree, index);
+ item = radix_tree_lookup(&mapping->pages, index);
rcu_read_unlock();
return item == swp_to_radix_entry(swap);
}
@@ -581,14 +581,14 @@ static int shmem_add_to_page_cache(struct page *page,
page->mapping = mapping;
page->index = index;

- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
if (PageTransHuge(page)) {
void __rcu **results;
pgoff_t idx;
int i;

error = 0;
- if (radix_tree_gang_lookup_slot(&mapping->page_tree,
+ if (radix_tree_gang_lookup_slot(&mapping->pages,
&results, &idx, index, 1) &&
idx < index + HPAGE_PMD_NR) {
error = -EEXIST;
@@ -596,14 +596,14 @@ static int shmem_add_to_page_cache(struct page *page,

if (!error) {
for (i = 0; i < HPAGE_PMD_NR; i++) {
- error = radix_tree_insert(&mapping->page_tree,
+ error = radix_tree_insert(&mapping->pages,
index + i, page + i);
VM_BUG_ON(error);
}
count_vm_event(THP_FILE_ALLOC);
}
} else if (!expected) {
- error = radix_tree_insert(&mapping->page_tree, index, page);
+ error = radix_tree_insert(&mapping->pages, index, page);
} else {
error = shmem_radix_tree_replace(mapping, index, expected,
page);
@@ -615,10 +615,10 @@ static int shmem_add_to_page_cache(struct page *page,
__inc_node_page_state(page, NR_SHMEM_THPS);
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
__mod_node_page_state(page_pgdat(page), NR_SHMEM, nr);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
} else {
page->mapping = NULL;
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
page_ref_sub(page, nr);
}
return error;
@@ -634,13 +634,13 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)

VM_BUG_ON_PAGE(PageCompound(page), page);

- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
error = shmem_radix_tree_replace(mapping, page->index, page, radswap);
page->mapping = NULL;
mapping->nrpages--;
__dec_node_page_state(page, NR_FILE_PAGES);
__dec_node_page_state(page, NR_SHMEM);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
put_page(page);
BUG_ON(error);
}
@@ -653,9 +653,9 @@ static int shmem_free_swap(struct address_space *mapping,
{
void *old;

- spin_lock_irq(&mapping->tree_lock);
- old = radix_tree_delete_item(&mapping->page_tree, index, radswap);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
+ old = radix_tree_delete_item(&mapping->pages, index, radswap);
+ xa_unlock_irq(&mapping->pages);
if (old != radswap)
return -ENOENT;
free_swap_and_cache(radix_to_swp_entry(radswap));
@@ -666,7 +666,7 @@ static int shmem_free_swap(struct address_space *mapping,
* Determine (in bytes) how many of the shmem object's pages mapped by the
* given offsets are swapped out.
*
- * This is safe to call without i_mutex or mapping->tree_lock thanks to RCU,
+ * This is safe to call without i_mutex or mapping->pages.xa_lock thanks to RCU,
* as long as the inode doesn't go away and racy results are not a problem.
*/
unsigned long shmem_partial_swap_usage(struct address_space *mapping,
@@ -679,7 +679,7 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,

rcu_read_lock();

- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+ radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
if (iter.index >= end)
break;

@@ -708,7 +708,7 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,
* Determine (in bytes) how many of the shmem object's pages mapped by the
* given vma is swapped out.
*
- * This is safe to call without i_mutex or mapping->tree_lock thanks to RCU,
+ * This is safe to call without i_mutex or mapping->pages.xa_lock thanks to RCU,
* as long as the inode doesn't go away and racy results are not a problem.
*/
unsigned long shmem_swap_usage(struct vm_area_struct *vma)
@@ -1123,7 +1123,7 @@ static int shmem_unuse_inode(struct shmem_inode_info *info,
int error = 0;

radswap = swp_to_radix_entry(swap);
- index = find_swap_entry(&mapping->page_tree, radswap);
+ index = find_swap_entry(&mapping->pages, radswap);
if (index == -1)
return -EAGAIN; /* tell shmem_unuse we found nothing */

@@ -1436,7 +1436,7 @@ static struct page *shmem_alloc_hugepage(gfp_t gfp,

hindex = round_down(index, HPAGE_PMD_NR);
rcu_read_lock();
- if (radix_tree_gang_lookup_slot(&mapping->page_tree, &results, &idx,
+ if (radix_tree_gang_lookup_slot(&mapping->pages, &results, &idx,
hindex, 1) && idx < hindex + HPAGE_PMD_NR) {
rcu_read_unlock();
return NULL;
@@ -1549,14 +1549,14 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
* Our caller will very soon move newpage out of swapcache, but it's
* a nice clean interface for us to replace oldpage by newpage there.
*/
- spin_lock_irq(&swap_mapping->tree_lock);
+ xa_lock_irq(&swap_mapping->pages);
error = shmem_radix_tree_replace(swap_mapping, swap_index, oldpage,
newpage);
if (!error) {
__inc_node_page_state(newpage, NR_FILE_PAGES);
__dec_node_page_state(oldpage, NR_FILE_PAGES);
}
- spin_unlock_irq(&swap_mapping->tree_lock);
+ xa_unlock_irq(&swap_mapping->pages);

if (unlikely(error)) {
/*
@@ -2622,7 +2622,7 @@ static void shmem_tag_pins(struct address_space *mapping)
start = 0;
rcu_read_lock();

- radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+ radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
page = radix_tree_deref_slot(slot);
if (!page || radix_tree_exception(page)) {
if (radix_tree_deref_retry(page)) {
@@ -2630,10 +2630,10 @@ static void shmem_tag_pins(struct address_space *mapping)
continue;
}
} else if (page_count(page) - page_mapcount(page) > 1) {
- spin_lock_irq(&mapping->tree_lock);
- radix_tree_tag_set(&mapping->page_tree, iter.index,
+ xa_lock_irq(&mapping->pages);
+ radix_tree_tag_set(&mapping->pages, iter.index,
SHMEM_TAG_PINNED);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
}

if (need_resched()) {
@@ -2665,7 +2665,7 @@ static int shmem_wait_for_pins(struct address_space *mapping)

error = 0;
for (scan = 0; scan <= LAST_SCAN; scan++) {
- if (!radix_tree_tagged(&mapping->page_tree, SHMEM_TAG_PINNED))
+ if (!radix_tree_tagged(&mapping->pages, SHMEM_TAG_PINNED))
break;

if (!scan)
@@ -2675,7 +2675,7 @@ static int shmem_wait_for_pins(struct address_space *mapping)

start = 0;
rcu_read_lock();
- radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter,
+ radix_tree_for_each_tagged(slot, &mapping->pages, &iter,
start, SHMEM_TAG_PINNED) {

page = radix_tree_deref_slot(slot);
@@ -2701,10 +2701,10 @@ static int shmem_wait_for_pins(struct address_space *mapping)
error = -EBUSY;
}

- spin_lock_irq(&mapping->tree_lock);
- radix_tree_tag_clear(&mapping->page_tree,
+ xa_lock_irq(&mapping->pages);
+ radix_tree_tag_clear(&mapping->pages,
iter.index, SHMEM_TAG_PINNED);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
continue_resched:
if (need_resched()) {
slot = radix_tree_iter_resume(slot, &iter);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 39ae7cfad90f..3f95e8fc4cb2 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -124,10 +124,10 @@ int __add_to_swap_cache(struct page *page, swp_entry_t entry)
SetPageSwapCache(page);

address_space = swap_address_space(entry);
- spin_lock_irq(&address_space->tree_lock);
+ xa_lock_irq(&address_space->pages);
for (i = 0; i < nr; i++) {
set_page_private(page + i, entry.val + i);
- error = radix_tree_insert(&address_space->page_tree,
+ error = radix_tree_insert(&address_space->pages,
idx + i, page + i);
if (unlikely(error))
break;
@@ -145,13 +145,13 @@ int __add_to_swap_cache(struct page *page, swp_entry_t entry)
VM_BUG_ON(error == -EEXIST);
set_page_private(page + i, 0UL);
while (i--) {
- radix_tree_delete(&address_space->page_tree, idx + i);
+ radix_tree_delete(&address_space->pages, idx + i);
set_page_private(page + i, 0UL);
}
ClearPageSwapCache(page);
page_ref_sub(page, nr);
}
- spin_unlock_irq(&address_space->tree_lock);
+ xa_unlock_irq(&address_space->pages);

return error;
}
@@ -188,7 +188,7 @@ void __delete_from_swap_cache(struct page *page)
address_space = swap_address_space(entry);
idx = swp_offset(entry);
for (i = 0; i < nr; i++) {
- radix_tree_delete(&address_space->page_tree, idx + i);
+ radix_tree_delete(&address_space->pages, idx + i);
set_page_private(page + i, 0);
}
ClearPageSwapCache(page);
@@ -272,9 +272,9 @@ void delete_from_swap_cache(struct page *page)
entry.val = page_private(page);

address_space = swap_address_space(entry);
- spin_lock_irq(&address_space->tree_lock);
+ xa_lock_irq(&address_space->pages);
__delete_from_swap_cache(page);
- spin_unlock_irq(&address_space->tree_lock);
+ xa_unlock_irq(&address_space->pages);

put_swap_page(page, entry);
page_ref_sub(page, hpage_nr_pages(page));
@@ -612,12 +612,11 @@ int init_swap_address_space(unsigned int type, unsigned long nr_pages)
return -ENOMEM;
for (i = 0; i < nr; i++) {
space = spaces + i;
- INIT_RADIX_TREE(&space->page_tree, GFP_ATOMIC|__GFP_NOWARN);
+ INIT_RADIX_TREE(&space->pages, GFP_ATOMIC|__GFP_NOWARN);
atomic_set(&space->i_mmap_writable, 0);
space->a_ops = &swap_aops;
/* swap cache doesn't use writeback related tags */
mapping_set_no_writeback_tags(space);
- spin_lock_init(&space->tree_lock);
}
nr_swapper_spaces[type] = nr;
rcu_assign_pointer(swapper_spaces[type], spaces);
diff --git a/mm/truncate.c b/mm/truncate.c
index e4b4cf0f4070..094158f2e447 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -36,11 +36,11 @@ static inline void __clear_shadow_entry(struct address_space *mapping,
struct radix_tree_node *node;
void **slot;

- if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
+ if (!__radix_tree_lookup(&mapping->pages, index, &node, &slot))
return;
if (*slot != entry)
return;
- __radix_tree_replace(&mapping->page_tree, node, slot, NULL,
+ __radix_tree_replace(&mapping->pages, node, slot, NULL,
workingset_update_node);
mapping->nrexceptional--;
}
@@ -48,9 +48,9 @@ static inline void __clear_shadow_entry(struct address_space *mapping,
static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
void *entry)
{
- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
__clear_shadow_entry(mapping, index, entry);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
}

/*
@@ -79,7 +79,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
dax = dax_mapping(mapping);
lock = !dax && indices[j] < end;
if (lock)
- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);

for (i = j; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
@@ -102,7 +102,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
}

if (lock)
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
pvec->nr = j;
}

@@ -522,8 +522,8 @@ void truncate_inode_pages_final(struct address_space *mapping)
* modification that does not see AS_EXITING is
* completed before starting the final truncate.
*/
- spin_lock_irq(&mapping->tree_lock);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
+ xa_unlock_irq(&mapping->pages);

truncate_inode_pages(mapping, 0);
}
@@ -631,13 +631,13 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
return 0;

- spin_lock_irqsave(&mapping->tree_lock, flags);
+ xa_lock_irqsave(&mapping->pages, flags);
if (PageDirty(page))
goto failed;

BUG_ON(page_has_private(page));
__delete_from_page_cache(page, NULL);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);

if (mapping->a_ops->freepage)
mapping->a_ops->freepage(page);
@@ -645,7 +645,7 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
put_page(page); /* pagecache ref */
return 1;
failed:
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);
return 0;
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c02c850ea349..96316bd91f91 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -674,7 +674,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));

- spin_lock_irqsave(&mapping->tree_lock, flags);
+ xa_lock_irqsave(&mapping->pages, flags);
/*
* The non racy check for a busy page.
*
@@ -698,7 +698,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
* load is not satisfied before that of page->_refcount.
*
* Note that if SetPageDirty is always performed via set_page_dirty,
- * and thus under tree_lock, then this ordering is not required.
+ * and thus under xa_lock, then this ordering is not required.
*/
if (unlikely(PageTransHuge(page)) && PageSwapCache(page))
refcount = 1 + HPAGE_PMD_NR;
@@ -716,7 +716,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
__delete_from_swap_cache(page);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);
put_swap_page(page, swap);
} else {
void (*freepage)(struct page *);
@@ -737,13 +737,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
* only page cache pages found in these are zero pages
* covering holes, and because we don't want to mix DAX
* exceptional entries and shadow exceptional entries in the
- * same page_tree.
+ * same address_space.
*/
if (reclaimed && page_is_file_cache(page) &&
!mapping_exiting(mapping) && !dax_mapping(mapping))
shadow = workingset_eviction(mapping, page);
__delete_from_page_cache(page, shadow);
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);

if (freepage != NULL)
freepage(page);
@@ -752,7 +752,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
return 1;

cannot_free:
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ xa_unlock_irqrestore(&mapping->pages, flags);
return 0;
}

diff --git a/mm/workingset.c b/mm/workingset.c
index b7d616a3bbbe..2d071f0df3af 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -202,7 +202,7 @@ static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
* @mapping: address space the page was backing
* @page: the page being evicted
*
- * Returns a shadow entry to be stored in @mapping->page_tree in place
+ * Returns a shadow entry to be stored in @mapping->pages in place
* of the evicted @page so that a later refault can be detected.
*/
void *workingset_eviction(struct address_space *mapping, struct page *page)
@@ -348,7 +348,7 @@ void workingset_update_node(struct radix_tree_node *node)
*
* Avoid acquiring the list_lru lock when the nodes are
* already where they should be. The list_empty() test is safe
- * as node->private_list is protected by &mapping->tree_lock.
+ * as node->private_list is protected by &mapping->pages.xa_lock.
*/
if (node->count && node->count == node->exceptional) {
if (list_empty(&node->private_list))
@@ -366,7 +366,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
unsigned long nodes;
unsigned long cache;

- /* list_lru lock nests inside IRQ-safe mapping->tree_lock */
+ /* list_lru lock nests inside IRQ-safe mapping->pages.xa_lock */
local_irq_disable();
nodes = list_lru_shrink_count(&shadow_nodes, sc);
local_irq_enable();
@@ -419,21 +419,21 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,

/*
* Page cache insertions and deletions synchroneously maintain
- * the shadow node LRU under the mapping->tree_lock and the
+ * the shadow node LRU under the mapping->pages.xa_lock and the
* lru_lock. Because the page cache tree is emptied before
* the inode can be destroyed, holding the lru_lock pins any
* address_space that has radix tree nodes on the LRU.
*
- * We can then safely transition to the mapping->tree_lock to
+ * We can then safely transition to the mapping->pages.xa_lock to
* pin only the address_space of the particular node we want
* to reclaim, take the node off-LRU, and drop the lru_lock.
*/

node = container_of(item, struct radix_tree_node, private_list);
- mapping = container_of(node->root, struct address_space, page_tree);
+ mapping = container_of(node->root, struct address_space, pages);

/* Coming from the list, invert the lock order */
- if (!spin_trylock(&mapping->tree_lock)) {
+ if (!xa_trylock(&mapping->pages)) {
spin_unlock(lru_lock);
ret = LRU_RETRY;
goto out;
@@ -468,11 +468,11 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
if (WARN_ON_ONCE(node->exceptional))
goto out_invalid;
inc_lruvec_page_state(virt_to_page(node), WORKINGSET_NODERECLAIM);
- __radix_tree_delete_node(&mapping->page_tree, node,
+ __radix_tree_delete_node(&mapping->pages, node,
workingset_lookup_update(mapping));

out_invalid:
- spin_unlock(&mapping->tree_lock);
+ xa_unlock(&mapping->pages);
ret = LRU_REMOVED_RETRY;
out:
local_irq_enable();
@@ -487,7 +487,7 @@ static unsigned long scan_shadow_nodes(struct shrinker *shrinker,
{
unsigned long ret;

- /* list_lru lock nests inside IRQ-safe mapping->tree_lock */
+ /* list_lru lock nests inside IRQ-safe mapping->xa_lock */
local_irq_disable();
ret = list_lru_shrink_walk(&shadow_nodes, sc, shadow_lru_isolate, NULL);
local_irq_enable();
@@ -503,7 +503,7 @@ static struct shrinker workingset_shadow_shrinker = {

/*
* Our list_lru->lock is IRQ-safe as it nests inside the IRQ-safe
- * mapping->tree_lock.
+ * mapping->pages.xa_lock.
*/
static struct lock_class_key shadow_nodes_key;

--
2.15.0

2017-12-06 01:01:27

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 21/73] ida: Convert to XArray

From: Matthew Wilcox <[email protected]>

Use the xarray infrstructure like we used the radix tree infrastructure.
This lets us get rid of idr_get_free() from the radix tree code.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/idr.h | 8 +-
include/linux/radix-tree.h | 4 -
lib/idr.c | 320 ++++++++++++++++++++++++++-------------------
lib/radix-tree.c | 126 ------------------
4 files changed, 187 insertions(+), 271 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 06412fbaa65f..d6b3dbe483b8 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -245,11 +245,11 @@ struct ida_bitmap {
DECLARE_PER_CPU(struct ida_bitmap *, ida_bitmap);

struct ida {
- struct radix_tree_root ida_rt;
+ struct xarray ida_xa;
};

#define IDA_INIT(name) { \
- .ida_rt = RADIX_TREE_INIT(name, IDR_INIT_FLAGS | GFP_NOWAIT), \
+ .ida_xa = __XARRAY_INIT(name.ida_xa, IDR_INIT_FLAGS) \
}
#define DEFINE_IDA(name) struct ida name = IDA_INIT(name)

@@ -264,7 +264,7 @@ void ida_simple_remove(struct ida *ida, unsigned int id);

static inline void ida_init(struct ida *ida)
{
- INIT_RADIX_TREE(&ida->ida_rt, IDR_INIT_FLAGS | GFP_NOWAIT);
+ __xa_init(&ida->ida_xa, IDR_INIT_FLAGS);
}

/**
@@ -281,6 +281,6 @@ static inline int ida_get_new(struct ida *ida, int *p_id)

static inline bool ida_is_empty(const struct ida *ida)
{
- return radix_tree_empty(&ida->ida_rt);
+ return xa_empty(&ida->ida_xa);
}
#endif /* _LINUX_IDR_H */
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index f46e3de57115..5d67939b8cd0 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -301,10 +301,6 @@ int radix_tree_split(struct radix_tree_root *, unsigned long index,
int radix_tree_join(struct radix_tree_root *, unsigned long index,
unsigned new_order, void *);

-void __rcu **idr_get_free(struct radix_tree_root *root,
- struct radix_tree_iter *iter, gfp_t gfp,
- unsigned long max);
-
enum {
RADIX_TREE_ITER_TAG_MASK = 0x0f, /* tag index in lower nybble */
RADIX_TREE_ITER_TAGGED = 0x10, /* lookup tagged slots */
diff --git a/lib/idr.c b/lib/idr.c
index e677d1869ead..eb145da485f2 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -6,7 +6,6 @@
#include <linux/xarray.h>

DEFINE_PER_CPU(struct ida_bitmap *, ida_bitmap);
-static DEFINE_SPINLOCK(simple_ida_lock);

/* In radix-tree.c temporarily */
extern bool idr_nomem(struct xa_state *, gfp_t);
@@ -307,26 +306,23 @@ EXPORT_SYMBOL_GPL(idr_replace);
/*
* Developer's notes:
*
- * The IDA uses the functionality provided by the IDR & radix tree to store
- * bitmaps in each entry. The XA_FREE_TAG tag means there is at least one bit
- * free, unlike the IDR where it means at least one entry is free.
- *
- * I considered telling the radix tree that each slot is an order-10 node
- * and storing the bit numbers in the radix tree, but the radix tree can't
- * allow a single multiorder entry at index 0, which would significantly
- * increase memory consumption for the IDA. So instead we divide the index
- * by the number of bits in the leaf bitmap before doing a radix tree lookup.
- *
- * As an optimisation, if there are only a few low bits set in any given
- * leaf, instead of allocating a 128-byte bitmap, we store the bits
+ * The IDA uses the functionality provided by the IDR & XArray to store
+ * bitmaps in each entry. The XA_FREE_TAG tag is used to mean that there
+ * is at least one bit free, unlike the IDR where it means at least one
+ * array entry is free.
+ *
+ * The XArray supports multi-index entries, so I considered teaching the
+ * XArray that each slot is an order-10 node and indexing the XArray by the
+ * ID. The XArray has the significant optimisation of storing the first
+ * entry in the struct xarray and avoiding allocating an xa_node.
+ * Unfortunately, it can't do that for multi-order entries.
+ * So instead the XArray index is the ID divided by the number of bits in
+ * the bitmap
+ *
+ * As a further optimisation, if there are only a few low bits set in any
+ * given leaf, instead of allocating a 128-byte bitmap, we store the bits
* directly in the entry.
*
- * We allow the radix tree 'exceptional' count to get out of date. Nothing
- * in the IDA nor the radix tree code checks it. If it becomes important
- * to maintain an accurate exceptional count, switch the rcu_assign_pointer()
- * calls to radix_tree_iter_replace() which will correct the exceptional
- * count.
- *
* The IDA always requires a lock to alloc/free. If we add a 'test_bit'
* equivalent, it will still need locking. Going to RCU lookup would require
* using RCU to free bitmaps, and that's not trivial without embedding an
@@ -336,104 +332,114 @@ EXPORT_SYMBOL_GPL(idr_replace);

#define IDA_MAX (0x80000000U / IDA_BITMAP_BITS - 1)

+static struct ida_bitmap *alloc_ida_bitmap(void)
+{
+ struct ida_bitmap *bitmap = this_cpu_xchg(ida_bitmap, NULL);
+ if (bitmap)
+ memset(bitmap, 0, sizeof(*bitmap));
+ return bitmap;
+}
+
+static void free_ida_bitmap(struct ida_bitmap *bitmap)
+{
+ if (this_cpu_cmpxchg(ida_bitmap, NULL, bitmap))
+ kfree(bitmap);
+}
+
/**
* ida_get_new_above - allocate new ID above or equal to a start id
* @ida: ida handle
* @start: id to start search at
* @id: pointer to the allocated handle
*
- * Allocate new ID above or equal to @start. It should be called
- * with any required locks to ensure that concurrent calls to
- * ida_get_new_above() / ida_get_new() / ida_remove() are not allowed.
- * Consider using ida_simple_get() if you do not have complex locking
- * requirements.
+ * Allocate new ID above or equal to @start. The ida has its own lock,
+ * although you may wish to provide your own locking around it.
*
* If memory is required, it will return %-EAGAIN, you should unlock
* and go back to the ida_pre_get() call. If the ida is full, it will
* return %-ENOSPC. On success, it will return 0.
*
- * @id returns a value in the range @start ... %0x7fffffff.
+ * @id returns a value in the range @start ... %INT_MAX.
*/
int ida_get_new_above(struct ida *ida, int start, int *id)
{
- struct radix_tree_root *root = &ida->ida_rt;
- void __rcu **slot;
- struct radix_tree_iter iter;
+ unsigned long flags;
+ unsigned long index = start / IDA_BITMAP_BITS;
+ unsigned int bit = start % IDA_BITMAP_BITS;
+ XA_STATE(xas, &ida->ida_xa, index);
struct ida_bitmap *bitmap;
- unsigned long index;
- unsigned bit;
- int new;
-
- index = start / IDA_BITMAP_BITS;
- bit = start % IDA_BITMAP_BITS;
-
- slot = radix_tree_iter_init(&iter, index);
- for (;;) {
- if (slot)
- slot = radix_tree_next_slot(slot, &iter,
- RADIX_TREE_ITER_TAGGED);
- if (!slot) {
- slot = idr_get_free(root, &iter, GFP_NOWAIT, IDA_MAX);
- if (IS_ERR(slot)) {
- if (slot == ERR_PTR(-ENOMEM))
- return -EAGAIN;
- return PTR_ERR(slot);
- }
- }
- if (iter.index > index)
- bit = 0;
- new = iter.index * IDA_BITMAP_BITS;
- bitmap = rcu_dereference_raw(*slot);
- if (xa_is_value(bitmap)) {
- unsigned long tmp = xa_to_value(bitmap);
- int vbit = find_next_zero_bit(&tmp, BITS_PER_XA_VALUE,
- bit);
- if (vbit < BITS_PER_XA_VALUE) {
- tmp |= 1UL << vbit;
- rcu_assign_pointer(*slot, xa_mk_value(tmp));
- *id = new + vbit;
- return 0;
- }
- bitmap = this_cpu_xchg(ida_bitmap, NULL);
- if (!bitmap)
- return -EAGAIN;
- memset(bitmap, 0, sizeof(*bitmap));
- bitmap->bitmap[0] = tmp;
- rcu_assign_pointer(*slot, bitmap);
- }
+ unsigned int new;
+
+ xas_lock_irqsave(&xas, flags);
+retry:
+ bitmap = xas_find_tag(&xas, IDA_MAX, XA_FREE_TAG);
+ if (xas.xa_index > IDA_MAX)
+ goto nospc;
+ if (xas.xa_index > index)
+ bit = 0;
+ new = xas.xa_index * IDA_BITMAP_BITS;
+ if (xa_is_value(bitmap)) {
+ unsigned long value = xa_to_value(bitmap);
+ if (bit < BITS_PER_XA_VALUE) {
+ unsigned long tmp = value | ((1UL << bit) - 1);
+ bit = ffz(tmp);

- if (bitmap) {
- bit = find_next_zero_bit(bitmap->bitmap,
- IDA_BITMAP_BITS, bit);
- new += bit;
- if (new < 0)
- return -ENOSPC;
- if (bit == IDA_BITMAP_BITS)
- continue;
-
- __set_bit(bit, bitmap->bitmap);
- if (bitmap_full(bitmap->bitmap, IDA_BITMAP_BITS))
- radix_tree_iter_tag_clear(root, &iter,
- XA_FREE_TAG);
- } else {
- new += bit;
- if (new < 0)
- return -ENOSPC;
if (bit < BITS_PER_XA_VALUE) {
- bitmap = xa_mk_value(1UL << bit);
- } else {
- bitmap = this_cpu_xchg(ida_bitmap, NULL);
- if (!bitmap)
- return -EAGAIN;
- memset(bitmap, 0, sizeof(*bitmap));
- __set_bit(bit, bitmap->bitmap);
+ value |= (1UL << bit);
+ xas_store(&xas, xa_mk_value(value));
+ new += bit;
+ goto unlock;
}
- radix_tree_iter_replace(root, &iter, slot, bitmap);
}

- *id = new;
- return 0;
+ bitmap = alloc_ida_bitmap();
+ if (!bitmap)
+ goto nomem;
+ bitmap->bitmap[0] = value;
+ new += bit;
+ __set_bit(bit, bitmap->bitmap);
+ xas_store(&xas, bitmap);
+ if (xas_error(&xas))
+ free_ida_bitmap(bitmap);
+ } else if (bitmap) {
+ bit = find_next_zero_bit(bitmap->bitmap, IDA_BITMAP_BITS, bit);
+ if (bit == IDA_BITMAP_BITS)
+ goto retry;
+ new += bit;
+ if (new > INT_MAX)
+ goto nospc;
+ __set_bit(bit, bitmap->bitmap);
+ if (bitmap_full(bitmap->bitmap, IDA_BITMAP_BITS))
+ xas_clear_tag(&xas, XA_FREE_TAG);
+ } else if (bit < BITS_PER_XA_VALUE) {
+ new += bit;
+ bitmap = xa_mk_value(1UL << bit);
+ xas_store(&xas, bitmap);
+ } else {
+ bitmap = alloc_ida_bitmap();
+ if (!bitmap)
+ goto nomem;
+ new += bit;
+ __set_bit(bit, bitmap->bitmap);
+ xas_store(&xas, bitmap);
+ if (xas_error(&xas))
+ free_ida_bitmap(bitmap);
}
+
+ if (idr_nomem(&xas, GFP_NOWAIT))
+ goto retry;
+unlock:
+ xas_unlock_irqrestore(&xas, flags);
+ if (xas_error(&xas) == -ENOMEM)
+ return -EAGAIN;
+ *id = new;
+ return 0;
+nospc:
+ xas_unlock_irqrestore(&xas, flags);
+ return -ENOSPC;
+nomem:
+ xas_unlock_irqrestore(&xas, flags);
+ return -EAGAIN;
}
EXPORT_SYMBOL(ida_get_new_above);

@@ -441,45 +447,44 @@ EXPORT_SYMBOL(ida_get_new_above);
* ida_remove - Free the given ID
* @ida: ida handle
* @id: ID to free
- *
- * This function should not be called at the same time as ida_get_new_above().
*/
void ida_remove(struct ida *ida, int id)
{
+ unsigned long flags;
unsigned long index = id / IDA_BITMAP_BITS;
- unsigned offset = id % IDA_BITMAP_BITS;
+ unsigned bit = id % IDA_BITMAP_BITS;
+ XA_STATE(xas, &ida->ida_xa, index);
struct ida_bitmap *bitmap;
- unsigned long *btmp;
- struct radix_tree_iter iter;
- void __rcu **slot;

- slot = radix_tree_iter_lookup(&ida->ida_rt, &iter, index);
- if (!slot)
+ xas_lock_irqsave(&xas, flags);
+ bitmap = xas_load(&xas);
+ if (!bitmap)
goto err;
-
- bitmap = rcu_dereference_raw(*slot);
if (xa_is_value(bitmap)) {
- btmp = (unsigned long *)slot;
- offset += 1; /* Intimate knowledge of the xa_data encoding */
- if (offset >= BITS_PER_LONG)
+ unsigned long v = xa_to_value(bitmap);
+ if (bit >= BITS_PER_XA_VALUE)
goto err;
+ if (!(v & (1UL << bit)))
+ goto err;
+ v &= ~(1UL << bit);
+ if (v)
+ bitmap = xa_mk_value(v);
+ else
+ bitmap = NULL;
+ xas_store(&xas, bitmap);
} else {
- btmp = bitmap->bitmap;
- }
- if (!test_bit(offset, btmp))
- goto err;
-
- __clear_bit(offset, btmp);
- radix_tree_iter_tag_set(&ida->ida_rt, &iter, XA_FREE_TAG);
- if (xa_is_value(bitmap)) {
- if (xa_to_value(rcu_dereference_raw(*slot)) == 0)
- radix_tree_iter_delete(&ida->ida_rt, &iter, slot);
- } else if (bitmap_empty(btmp, IDA_BITMAP_BITS)) {
- kfree(bitmap);
- radix_tree_iter_delete(&ida->ida_rt, &iter, slot);
+ if (!__test_and_clear_bit(bit, bitmap->bitmap))
+ goto err;
+ if (bitmap_empty(bitmap->bitmap, IDA_BITMAP_BITS)) {
+ kfree(bitmap);
+ xas_store(&xas, NULL);
+ }
}
+ xas_set_tag(&xas, XA_FREE_TAG);
+ xas_unlock_irqrestore(&xas, flags);
return;
err:
+ xas_unlock_irqrestore(&xas, flags);
WARN(1, "ida_remove called for id=%d which is not allocated.\n", id);
}
EXPORT_SYMBOL(ida_remove);
@@ -489,21 +494,21 @@ EXPORT_SYMBOL(ida_remove);
* @ida: ida handle
*
* Calling this function releases all resources associated with an IDA. When
- * this call returns, the IDA is empty and can be reused or freed. The caller
- * should not allow ida_remove() or ida_get_new_above() to be called at the
- * same time.
+ * this call returns, the IDA is empty and can be reused or freed.
*/
void ida_destroy(struct ida *ida)
{
- struct radix_tree_iter iter;
- void __rcu **slot;
+ XA_STATE(xas, &ida->ida_xa, 0);
+ unsigned long flags;
+ struct ida_bitmap *bitmap;

- radix_tree_for_each_slot(slot, &ida->ida_rt, &iter, 0) {
- struct ida_bitmap *bitmap = rcu_dereference_raw(*slot);
+ xas_lock_irqsave(&xas, flags);
+ xas_for_each(&xas, bitmap, ULONG_MAX) {
if (!xa_is_value(bitmap))
kfree(bitmap);
- radix_tree_iter_delete(&ida->ida_rt, &iter, slot);
+ xas_store(&xas, NULL);
}
+ xas_unlock_irqrestore(&xas, flags);
}
EXPORT_SYMBOL(ida_destroy);

@@ -527,7 +532,6 @@ int ida_simple_get(struct ida *ida, unsigned int start, unsigned int end,
{
int ret, id;
unsigned int max;
- unsigned long flags;

BUG_ON((int)start < 0);
BUG_ON((int)end < 0);
@@ -543,7 +547,6 @@ int ida_simple_get(struct ida *ida, unsigned int start, unsigned int end,
if (!ida_pre_get(ida, gfp_mask))
return -ENOMEM;

- spin_lock_irqsave(&simple_ida_lock, flags);
ret = ida_get_new_above(ida, start, &id);
if (!ret) {
if (id > max) {
@@ -553,7 +556,6 @@ int ida_simple_get(struct ida *ida, unsigned int start, unsigned int end,
ret = id;
}
}
- spin_unlock_irqrestore(&simple_ida_lock, flags);

if (unlikely(ret == -EAGAIN))
goto again;
@@ -574,11 +576,55 @@ EXPORT_SYMBOL(ida_simple_get);
*/
void ida_simple_remove(struct ida *ida, unsigned int id)
{
- unsigned long flags;
-
BUG_ON((int)id < 0);
- spin_lock_irqsave(&simple_ida_lock, flags);
ida_remove(ida, id);
- spin_unlock_irqrestore(&simple_ida_lock, flags);
}
EXPORT_SYMBOL(ida_simple_remove);
+
+#ifdef XA_DEBUG
+static void dump_ida_node(void *entry, unsigned long index)
+{
+ unsigned long i;
+
+ if (!entry)
+ return;
+
+ if (xa_is_node(entry)) {
+ struct xa_node *node = xa_to_node(entry);
+ unsigned long first = index * IDA_BITMAP_BITS;
+ unsigned long last = first | ((((unsigned long)XA_CHUNK_SIZE *
+ IDA_BITMAP_BITS) << node->shift) - 1);
+
+ pr_debug("ida node: %p offset %d indices %lu-%lu parent %p free %lx shift %d count %d\n",
+ node, node->offset, first, last, node->parent,
+ node->tags[0][0], node->shift, node->count);
+ for (i = 0; i < XA_CHUNK_SIZE; i++)
+ dump_ida_node(node->slots[i],
+ index | (i << node->shift));
+ } else if (xa_is_value(entry)) {
+ pr_debug("ida excp: %p offset %d indices %lu-%lu data %lx\n",
+ entry, (int)(index & XA_CHUNK_MASK),
+ index * IDA_BITMAP_BITS,
+ index * IDA_BITMAP_BITS + BITS_PER_XA_VALUE,
+ xa_to_value(entry));
+ } else {
+ struct ida_bitmap *bitmap = entry;
+
+ pr_debug("ida btmp: %p offset %d indices %lu-%lu data", bitmap,
+ (int)(index & XA_CHUNK_MASK),
+ index * IDA_BITMAP_BITS,
+ (index + 1) * IDA_BITMAP_BITS - 1);
+ for (i = 0; i < IDA_BITMAP_LONGS; i++)
+ pr_cont(" %lx", bitmap->bitmap[i]);
+ pr_cont("\n");
+ }
+}
+
+void ida_dump(struct ida *ida)
+{
+ struct xarray *xa = &ida->ida_xa;
+ pr_debug("ida: %p node %p free %d\n", ida, xa->xa_head,
+ xa->xa_flags >> ROOT_TAG_SHIFT);
+ dump_ida_node(xa->xa_head, 0);
+}
+#endif
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 3b63d1ce7fda..ee135e369a72 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -247,61 +247,6 @@ static inline unsigned long node_maxindex(const struct radix_tree_node *node)
return shift_maxindex(node->shift);
}

-static unsigned long rnext_index(unsigned long index,
- const struct radix_tree_node *node,
- unsigned long offset)
-{
- return (index & ~node_maxindex(node)) + (offset << node->shift);
-}
-
-#ifndef __KERNEL__
-static void dump_ida_node(void *entry, unsigned long index)
-{
- unsigned long i;
-
- if (!entry)
- return;
-
- if (radix_tree_is_internal_node(entry)) {
- struct radix_tree_node *node = entry_to_node(entry);
-
- pr_debug("ida node: %p offset %d indices %lu-%lu parent %p free %lx shift %d count %d\n",
- node, node->offset, index * IDA_BITMAP_BITS,
- ((index | node_maxindex(node)) + 1) *
- IDA_BITMAP_BITS - 1,
- node->parent, node->tags[0][0], node->shift,
- node->count);
- for (i = 0; i < RADIX_TREE_MAP_SIZE; i++)
- dump_ida_node(node->slots[i],
- index | (i << node->shift));
- } else if (xa_is_value(entry)) {
- pr_debug("ida excp: %p offset %d indices %lu-%lu data %lx\n",
- entry, (int)(index & RADIX_TREE_MAP_MASK),
- index * IDA_BITMAP_BITS,
- index * IDA_BITMAP_BITS + BITS_PER_XA_VALUE,
- xa_to_value(entry));
- } else {
- struct ida_bitmap *bitmap = entry;
-
- pr_debug("ida btmp: %p offset %d indices %lu-%lu data", bitmap,
- (int)(index & RADIX_TREE_MAP_MASK),
- index * IDA_BITMAP_BITS,
- (index + 1) * IDA_BITMAP_BITS - 1);
- for (i = 0; i < IDA_BITMAP_LONGS; i++)
- pr_cont(" %lx", bitmap->bitmap[i]);
- pr_cont("\n");
- }
-}
-
-static void ida_dump(struct ida *ida)
-{
- struct radix_tree_root *root = &ida->ida_rt;
- pr_debug("ida: %p node %p free %d\n", ida, root->xa_head,
- root->xa_flags >> ROOT_TAG_SHIFT);
- dump_ida_node(root->xa_head, 0);
-}
-#endif
-
/*
* This assumes that the caller has performed appropriate preallocation, and
* that the caller has pinned this thread of control to the current CPU.
@@ -2083,77 +2028,6 @@ int ida_pre_get(struct ida *ida, gfp_t gfp)
}
EXPORT_SYMBOL(ida_pre_get);

-void __rcu **idr_get_free(struct radix_tree_root *root,
- struct radix_tree_iter *iter, gfp_t gfp,
- unsigned long max)
-{
- struct radix_tree_node *node = NULL, *child;
- void __rcu **slot = (void __rcu **)&root->xa_head;
- unsigned long maxindex, start = iter->next_index;
- unsigned int shift, offset = 0;
-
- grow:
- shift = radix_tree_load_root(root, &child, &maxindex);
- if (!radix_tree_tagged(root, XA_FREE_TAG))
- start = max(start, maxindex + 1);
- if (start > max)
- return ERR_PTR(-ENOSPC);
-
- if (start > maxindex) {
- int error = radix_tree_extend(root, gfp, start, shift);
- if (error < 0)
- return ERR_PTR(error);
- shift = error;
- child = rcu_dereference_raw(root->xa_head);
- }
-
- while (shift) {
- shift -= RADIX_TREE_MAP_SHIFT;
- if (child == NULL) {
- /* Have to add a child node. */
- child = radix_tree_node_alloc(gfp, node, root, shift,
- offset, 0, 0);
- if (!child)
- return ERR_PTR(-ENOMEM);
- all_tag_set(child, XA_FREE_TAG);
- rcu_assign_pointer(*slot, node_to_entry(child));
- if (node)
- node->count++;
- } else if (!radix_tree_is_internal_node(child))
- break;
-
- node = entry_to_node(child);
- offset = radix_tree_descend(node, &child, start);
- if (!rtag_get(node, XA_FREE_TAG, offset)) {
- offset = radix_tree_find_next_bit(node, XA_FREE_TAG,
- offset + 1);
- start = rnext_index(start, node, offset);
- if (start > max)
- return ERR_PTR(-ENOSPC);
- while (offset == RADIX_TREE_MAP_SIZE) {
- offset = node->offset + 1;
- node = node->parent;
- if (!node)
- goto grow;
- shift = node->shift;
- }
- child = rcu_dereference_raw(node->slots[offset]);
- }
- slot = &node->slots[offset];
- }
-
- iter->index = start;
- if (node)
- iter->next_index = 1 + min(max, (start | node_maxindex(node)));
- else
- iter->next_index = 1;
- iter->node = node;
- __set_iter_shift(iter, shift);
- set_iter_tags(iter, node, offset, XA_FREE_TAG);
-
- return slot;
-}
-
static void
radix_tree_node_ctor(void *arg)
{
--
2.15.0

2017-12-06 01:01:47

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 01/73] xfs: Rename xa_ elements to ail_

From: Matthew Wilcox <[email protected]>

This is a simple rename, except that xa_ail becomes ail_head.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/xfs/xfs_buf_item.c | 10 ++--
fs/xfs/xfs_dquot.c | 4 +-
fs/xfs/xfs_dquot_item.c | 11 ++--
fs/xfs/xfs_inode_item.c | 22 +++----
fs/xfs/xfs_log.c | 6 +-
fs/xfs/xfs_log_recover.c | 80 ++++++++++++-------------
fs/xfs/xfs_trans.c | 18 +++---
fs/xfs/xfs_trans_ail.c | 152 +++++++++++++++++++++++------------------------
fs/xfs/xfs_trans_buf.c | 4 +-
fs/xfs/xfs_trans_priv.h | 42 ++++++-------
10 files changed, 175 insertions(+), 174 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index e0a0af0946f2..6c5035544a93 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -459,7 +459,7 @@ xfs_buf_item_unpin(
bp->b_fspriv = NULL;
bp->b_iodone = NULL;
} else {
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
xfs_trans_ail_delete(ailp, lip, SHUTDOWN_LOG_IO_ERROR);
xfs_buf_item_relse(bp);
ASSERT(bp->b_fspriv == NULL);
@@ -1056,13 +1056,13 @@ xfs_buf_do_callbacks_fail(
struct xfs_log_item *lip = bp->b_fspriv;
struct xfs_ail *ailp = lip->li_ailp;

- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
for (; lip; lip = next) {
next = lip->li_bio_list;
if (lip->li_ops->iop_error)
lip->li_ops->iop_error(lip, bp);
}
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
}

static bool
@@ -1215,7 +1215,7 @@ xfs_buf_iodone(
*
* Either way, AIL is useless if we're forcing a shutdown.
*/
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
xfs_buf_item_free(BUF_ITEM(lip));
}
@@ -1236,7 +1236,7 @@ xfs_buf_resubmit_failed_buffers(
/*
* Clear XFS_LI_FAILED flag from all items before resubmit
*
- * XFS_LI_FAILED set/clear is protected by xa_lock, caller this
+ * XFS_LI_FAILED set/clear is protected by ail_lock, caller this
* function already have it acquired
*/
for (; lip; lip = next) {
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index f248708c10ff..e2a466df5dd1 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -974,7 +974,7 @@ xfs_qm_dqflush_done(
(lip->li_flags & XFS_LI_FAILED))) {

/* xfs_trans_ail_delete() drops the AIL lock. */
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
if (lip->li_lsn == qip->qli_flush_lsn) {
xfs_trans_ail_delete(ailp, lip, SHUTDOWN_CORRUPT_INCORE);
} else {
@@ -984,7 +984,7 @@ xfs_qm_dqflush_done(
*/
if (lip->li_flags & XFS_LI_FAILED)
xfs_clear_li_failed(lip);
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
}
}

diff --git a/fs/xfs/xfs_dquot_item.c b/fs/xfs/xfs_dquot_item.c
index 664dea105e76..62637a226601 100644
--- a/fs/xfs/xfs_dquot_item.c
+++ b/fs/xfs/xfs_dquot_item.c
@@ -160,8 +160,9 @@ xfs_dquot_item_error(
STATIC uint
xfs_qm_dquot_logitem_push(
struct xfs_log_item *lip,
- struct list_head *buffer_list) __releases(&lip->li_ailp->xa_lock)
- __acquires(&lip->li_ailp->xa_lock)
+ struct list_head *buffer_list)
+ __releases(&lip->li_ailp->ail_lock)
+ __acquires(&lip->li_ailp->ail_lock)
{
struct xfs_dquot *dqp = DQUOT_ITEM(lip)->qli_dquot;
struct xfs_buf *bp = lip->li_buf;
@@ -208,7 +209,7 @@ xfs_qm_dquot_logitem_push(
goto out_unlock;
}

- spin_unlock(&lip->li_ailp->xa_lock);
+ spin_unlock(&lip->li_ailp->ail_lock);

error = xfs_qm_dqflush(dqp, &bp);
if (error) {
@@ -220,7 +221,7 @@ xfs_qm_dquot_logitem_push(
xfs_buf_relse(bp);
}

- spin_lock(&lip->li_ailp->xa_lock);
+ spin_lock(&lip->li_ailp->ail_lock);
out_unlock:
xfs_dqunlock(dqp);
return rval;
@@ -403,7 +404,7 @@ xfs_qm_qoffend_logitem_committed(
* Delete the qoff-start logitem from the AIL.
* xfs_trans_ail_delete() drops the AIL lock.
*/
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
xfs_trans_ail_delete(ailp, &qfs->qql_item, SHUTDOWN_LOG_IO_ERROR);

kmem_free(qfs->qql_item.li_lv_shadow);
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6ee5c3bf19ad..071acd4249a0 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -501,8 +501,8 @@ STATIC uint
xfs_inode_item_push(
struct xfs_log_item *lip,
struct list_head *buffer_list)
- __releases(&lip->li_ailp->xa_lock)
- __acquires(&lip->li_ailp->xa_lock)
+ __releases(&lip->li_ailp->ail_lock)
+ __acquires(&lip->li_ailp->ail_lock)
{
struct xfs_inode_log_item *iip = INODE_ITEM(lip);
struct xfs_inode *ip = iip->ili_inode;
@@ -561,7 +561,7 @@ xfs_inode_item_push(
ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
ASSERT(iip->ili_logged == 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));

- spin_unlock(&lip->li_ailp->xa_lock);
+ spin_unlock(&lip->li_ailp->ail_lock);

error = xfs_iflush(ip, &bp);
if (!error) {
@@ -570,7 +570,7 @@ xfs_inode_item_push(
xfs_buf_relse(bp);
}

- spin_lock(&lip->li_ailp->xa_lock);
+ spin_lock(&lip->li_ailp->ail_lock);
out_unlock:
xfs_iunlock(ip, XFS_ILOCK_SHARED);
return rval;
@@ -774,7 +774,7 @@ xfs_iflush_done(
bool mlip_changed = false;

/* this is an opencoded batch version of xfs_trans_ail_delete */
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
for (blip = lip; blip; blip = blip->li_bio_list) {
if (INODE_ITEM(blip)->ili_logged &&
blip->li_lsn == INODE_ITEM(blip)->ili_flush_lsn)
@@ -785,15 +785,15 @@ xfs_iflush_done(
}

if (mlip_changed) {
- if (!XFS_FORCED_SHUTDOWN(ailp->xa_mount))
- xlog_assign_tail_lsn_locked(ailp->xa_mount);
- if (list_empty(&ailp->xa_ail))
- wake_up_all(&ailp->xa_empty);
+ if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
+ xlog_assign_tail_lsn_locked(ailp->ail_mount);
+ if (list_empty(&ailp->ail_head))
+ wake_up_all(&ailp->ail_empty);
}
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);

if (mlip_changed)
- xfs_log_space_wake(ailp->xa_mount);
+ xfs_log_space_wake(ailp->ail_mount);
}

/*
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index a503af96d780..7148625eebf2 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1148,7 +1148,7 @@ xlog_assign_tail_lsn_locked(
struct xfs_log_item *lip;
xfs_lsn_t tail_lsn;

- assert_spin_locked(&mp->m_ail->xa_lock);
+ assert_spin_locked(&mp->m_ail->ail_lock);

/*
* To make sure we always have a valid LSN for the log tail we keep
@@ -1171,9 +1171,9 @@ xlog_assign_tail_lsn(
{
xfs_lsn_t tail_lsn;

- spin_lock(&mp->m_ail->xa_lock);
+ spin_lock(&mp->m_ail->ail_lock);
tail_lsn = xlog_assign_tail_lsn_locked(mp);
- spin_unlock(&mp->m_ail->xa_lock);
+ spin_unlock(&mp->m_ail->ail_lock);

return tail_lsn;
}
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 28d1abfe835e..d871761626fb 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3424,7 +3424,7 @@ xlog_recover_efi_pass2(
}
atomic_set(&efip->efi_next_extent, efi_formatp->efi_nextents);

- spin_lock(&log->l_ailp->xa_lock);
+ spin_lock(&log->l_ailp->ail_lock);
/*
* The EFI has two references. One for the EFD and one for EFI to ensure
* it makes it into the AIL. Insert the EFI into the AIL directly and
@@ -3467,7 +3467,7 @@ xlog_recover_efd_pass2(
* Search for the EFI with the id in the EFD format structure in the
* AIL.
*/
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
while (lip != NULL) {
if (lip->li_type == XFS_LI_EFI) {
@@ -3477,9 +3477,9 @@ xlog_recover_efd_pass2(
* Drop the EFD reference to the EFI. This
* removes the EFI from the AIL and frees it.
*/
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
xfs_efi_release(efip);
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
break;
}
}
@@ -3487,7 +3487,7 @@ xlog_recover_efd_pass2(
}

xfs_trans_ail_cursor_done(&cur);
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);

return 0;
}
@@ -3520,7 +3520,7 @@ xlog_recover_rui_pass2(
}
atomic_set(&ruip->rui_next_extent, rui_formatp->rui_nextents);

- spin_lock(&log->l_ailp->xa_lock);
+ spin_lock(&log->l_ailp->ail_lock);
/*
* The RUI has two references. One for the RUD and one for RUI to ensure
* it makes it into the AIL. Insert the RUI into the AIL directly and
@@ -3560,7 +3560,7 @@ xlog_recover_rud_pass2(
* Search for the RUI with the id in the RUD format structure in the
* AIL.
*/
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
while (lip != NULL) {
if (lip->li_type == XFS_LI_RUI) {
@@ -3570,9 +3570,9 @@ xlog_recover_rud_pass2(
* Drop the RUD reference to the RUI. This
* removes the RUI from the AIL and frees it.
*/
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
xfs_rui_release(ruip);
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
break;
}
}
@@ -3580,7 +3580,7 @@ xlog_recover_rud_pass2(
}

xfs_trans_ail_cursor_done(&cur);
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);

return 0;
}
@@ -3636,7 +3636,7 @@ xlog_recover_cui_pass2(
}
atomic_set(&cuip->cui_next_extent, cui_formatp->cui_nextents);

- spin_lock(&log->l_ailp->xa_lock);
+ spin_lock(&log->l_ailp->ail_lock);
/*
* The CUI has two references. One for the CUD and one for CUI to ensure
* it makes it into the AIL. Insert the CUI into the AIL directly and
@@ -3677,7 +3677,7 @@ xlog_recover_cud_pass2(
* Search for the CUI with the id in the CUD format structure in the
* AIL.
*/
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
while (lip != NULL) {
if (lip->li_type == XFS_LI_CUI) {
@@ -3687,9 +3687,9 @@ xlog_recover_cud_pass2(
* Drop the CUD reference to the CUI. This
* removes the CUI from the AIL and frees it.
*/
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
xfs_cui_release(cuip);
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
break;
}
}
@@ -3697,7 +3697,7 @@ xlog_recover_cud_pass2(
}

xfs_trans_ail_cursor_done(&cur);
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);

return 0;
}
@@ -3755,7 +3755,7 @@ xlog_recover_bui_pass2(
}
atomic_set(&buip->bui_next_extent, bui_formatp->bui_nextents);

- spin_lock(&log->l_ailp->xa_lock);
+ spin_lock(&log->l_ailp->ail_lock);
/*
* The RUI has two references. One for the RUD and one for RUI to ensure
* it makes it into the AIL. Insert the RUI into the AIL directly and
@@ -3796,7 +3796,7 @@ xlog_recover_bud_pass2(
* Search for the BUI with the id in the BUD format structure in the
* AIL.
*/
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
while (lip != NULL) {
if (lip->li_type == XFS_LI_BUI) {
@@ -3806,9 +3806,9 @@ xlog_recover_bud_pass2(
* Drop the BUD reference to the BUI. This
* removes the BUI from the AIL and frees it.
*/
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
xfs_bui_release(buip);
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
break;
}
}
@@ -3816,7 +3816,7 @@ xlog_recover_bud_pass2(
}

xfs_trans_ail_cursor_done(&cur);
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);

return 0;
}
@@ -4649,9 +4649,9 @@ xlog_recover_process_efi(
if (test_bit(XFS_EFI_RECOVERED, &efip->efi_flags))
return 0;

- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
error = xfs_efi_recover(mp, efip);
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);

return error;
}
@@ -4667,9 +4667,9 @@ xlog_recover_cancel_efi(

efip = container_of(lip, struct xfs_efi_log_item, efi_item);

- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
xfs_efi_release(efip);
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
}

/* Recover the RUI if necessary. */
@@ -4689,9 +4689,9 @@ xlog_recover_process_rui(
if (test_bit(XFS_RUI_RECOVERED, &ruip->rui_flags))
return 0;

- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
error = xfs_rui_recover(mp, ruip);
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);

return error;
}
@@ -4707,9 +4707,9 @@ xlog_recover_cancel_rui(

ruip = container_of(lip, struct xfs_rui_log_item, rui_item);

- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
xfs_rui_release(ruip);
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
}

/* Recover the CUI if necessary. */
@@ -4730,9 +4730,9 @@ xlog_recover_process_cui(
if (test_bit(XFS_CUI_RECOVERED, &cuip->cui_flags))
return 0;

- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
error = xfs_cui_recover(mp, cuip, dfops);
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);

return error;
}
@@ -4748,9 +4748,9 @@ xlog_recover_cancel_cui(

cuip = container_of(lip, struct xfs_cui_log_item, cui_item);

- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
xfs_cui_release(cuip);
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
}

/* Recover the BUI if necessary. */
@@ -4771,9 +4771,9 @@ xlog_recover_process_bui(
if (test_bit(XFS_BUI_RECOVERED, &buip->bui_flags))
return 0;

- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
error = xfs_bui_recover(mp, buip, dfops);
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);

return error;
}
@@ -4789,9 +4789,9 @@ xlog_recover_cancel_bui(

buip = container_of(lip, struct xfs_bui_log_item, bui_item);

- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
xfs_bui_release(buip);
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
}

/* Is this log item a deferred action intent? */
@@ -4879,7 +4879,7 @@ xlog_recover_process_intents(
#endif

ailp = log->l_ailp;
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
#if defined(DEBUG) || defined(XFS_WARN)
last_lsn = xlog_assign_lsn(log->l_curr_cycle, log->l_curr_block);
@@ -4933,7 +4933,7 @@ xlog_recover_process_intents(
}
out:
xfs_trans_ail_cursor_done(&cur);
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
if (error)
xfs_defer_cancel(&dfops);
else
@@ -4956,7 +4956,7 @@ xlog_recover_cancel_intents(
struct xfs_ail *ailp;

ailp = log->l_ailp;
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
lip = xfs_trans_ail_cursor_first(ailp, &cur, 0);
while (lip != NULL) {
/*
@@ -4990,7 +4990,7 @@ xlog_recover_cancel_intents(
}

xfs_trans_ail_cursor_done(&cur);
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
return error;
}

diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index a87f657f59c9..756e01999c24 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -781,8 +781,8 @@ xfs_log_item_batch_insert(
{
int i;

- spin_lock(&ailp->xa_lock);
- /* xfs_trans_ail_update_bulk drops ailp->xa_lock */
+ spin_lock(&ailp->ail_lock);
+ /* xfs_trans_ail_update_bulk drops ailp->ail_lock */
xfs_trans_ail_update_bulk(ailp, cur, log_items, nr_items, commit_lsn);

for (i = 0; i < nr_items; i++) {
@@ -825,9 +825,9 @@ xfs_trans_committed_bulk(
struct xfs_ail_cursor cur;
int i = 0;

- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
xfs_trans_ail_cursor_last(ailp, &cur, commit_lsn);
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);

/* unpin all the log items */
for (lv = log_vector; lv; lv = lv->lv_next ) {
@@ -847,7 +847,7 @@ xfs_trans_committed_bulk(
* object into the AIL as we are in a shutdown situation.
*/
if (aborted) {
- ASSERT(XFS_FORCED_SHUTDOWN(ailp->xa_mount));
+ ASSERT(XFS_FORCED_SHUTDOWN(ailp->ail_mount));
lip->li_ops->iop_unpin(lip, 1);
continue;
}
@@ -861,11 +861,11 @@ xfs_trans_committed_bulk(
* not affect the AIL cursor the bulk insert path is
* using.
*/
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
if (XFS_LSN_CMP(item_lsn, lip->li_lsn) > 0)
xfs_trans_ail_update(ailp, lip, item_lsn);
else
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
lip->li_ops->iop_unpin(lip, 0);
continue;
}
@@ -883,9 +883,9 @@ xfs_trans_committed_bulk(
if (i)
xfs_log_item_batch_insert(ailp, &cur, log_items, i, commit_lsn);

- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
xfs_trans_ail_cursor_done(&cur);
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
}

/*
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index cef89f7127d3..d4a2445215e6 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -40,7 +40,7 @@ xfs_ail_check(
{
xfs_log_item_t *prev_lip;

- if (list_empty(&ailp->xa_ail))
+ if (list_empty(&ailp->ail_head))
return;

/*
@@ -48,11 +48,11 @@ xfs_ail_check(
*/
ASSERT((lip->li_flags & XFS_LI_IN_AIL) != 0);
prev_lip = list_entry(lip->li_ail.prev, xfs_log_item_t, li_ail);
- if (&prev_lip->li_ail != &ailp->xa_ail)
+ if (&prev_lip->li_ail != &ailp->ail_head)
ASSERT(XFS_LSN_CMP(prev_lip->li_lsn, lip->li_lsn) <= 0);

prev_lip = list_entry(lip->li_ail.next, xfs_log_item_t, li_ail);
- if (&prev_lip->li_ail != &ailp->xa_ail)
+ if (&prev_lip->li_ail != &ailp->ail_head)
ASSERT(XFS_LSN_CMP(prev_lip->li_lsn, lip->li_lsn) >= 0);


@@ -69,10 +69,10 @@ static xfs_log_item_t *
xfs_ail_max(
struct xfs_ail *ailp)
{
- if (list_empty(&ailp->xa_ail))
+ if (list_empty(&ailp->ail_head))
return NULL;

- return list_entry(ailp->xa_ail.prev, xfs_log_item_t, li_ail);
+ return list_entry(ailp->ail_head.prev, xfs_log_item_t, li_ail);
}

/*
@@ -84,7 +84,7 @@ xfs_ail_next(
struct xfs_ail *ailp,
xfs_log_item_t *lip)
{
- if (lip->li_ail.next == &ailp->xa_ail)
+ if (lip->li_ail.next == &ailp->ail_head)
return NULL;

return list_first_entry(&lip->li_ail, xfs_log_item_t, li_ail);
@@ -105,11 +105,11 @@ xfs_ail_min_lsn(
xfs_lsn_t lsn = 0;
xfs_log_item_t *lip;

- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
lip = xfs_ail_min(ailp);
if (lip)
lsn = lip->li_lsn;
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);

return lsn;
}
@@ -124,11 +124,11 @@ xfs_ail_max_lsn(
xfs_lsn_t lsn = 0;
xfs_log_item_t *lip;

- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
lip = xfs_ail_max(ailp);
if (lip)
lsn = lip->li_lsn;
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);

return lsn;
}
@@ -146,7 +146,7 @@ xfs_trans_ail_cursor_init(
struct xfs_ail_cursor *cur)
{
cur->item = NULL;
- list_add_tail(&cur->list, &ailp->xa_cursors);
+ list_add_tail(&cur->list, &ailp->ail_cursors);
}

/*
@@ -194,7 +194,7 @@ xfs_trans_ail_cursor_clear(
{
struct xfs_ail_cursor *cur;

- list_for_each_entry(cur, &ailp->xa_cursors, list) {
+ list_for_each_entry(cur, &ailp->ail_cursors, list) {
if (cur->item == lip)
cur->item = (struct xfs_log_item *)
((uintptr_t)cur->item | 1);
@@ -222,7 +222,7 @@ xfs_trans_ail_cursor_first(
goto out;
}

- list_for_each_entry(lip, &ailp->xa_ail, li_ail) {
+ list_for_each_entry(lip, &ailp->ail_head, li_ail) {
if (XFS_LSN_CMP(lip->li_lsn, lsn) >= 0)
goto out;
}
@@ -241,7 +241,7 @@ __xfs_trans_ail_cursor_last(
{
xfs_log_item_t *lip;

- list_for_each_entry_reverse(lip, &ailp->xa_ail, li_ail) {
+ list_for_each_entry_reverse(lip, &ailp->ail_head, li_ail) {
if (XFS_LSN_CMP(lip->li_lsn, lsn) <= 0)
return lip;
}
@@ -310,7 +310,7 @@ xfs_ail_splice(
if (lip)
list_splice(list, &lip->li_ail);
else
- list_splice(list, &ailp->xa_ail);
+ list_splice(list, &ailp->ail_head);
}

/*
@@ -335,17 +335,17 @@ xfsaild_push_item(
* If log item pinning is enabled, skip the push and track the item as
* pinned. This can help induce head-behind-tail conditions.
*/
- if (XFS_TEST_ERROR(false, ailp->xa_mount, XFS_ERRTAG_LOG_ITEM_PIN))
+ if (XFS_TEST_ERROR(false, ailp->ail_mount, XFS_ERRTAG_LOG_ITEM_PIN))
return XFS_ITEM_PINNED;

- return lip->li_ops->iop_push(lip, &ailp->xa_buf_list);
+ return lip->li_ops->iop_push(lip, &ailp->ail_buf_list);
}

static long
xfsaild_push(
struct xfs_ail *ailp)
{
- xfs_mount_t *mp = ailp->xa_mount;
+ xfs_mount_t *mp = ailp->ail_mount;
struct xfs_ail_cursor cur;
xfs_log_item_t *lip;
xfs_lsn_t lsn;
@@ -360,30 +360,30 @@ xfsaild_push(
* buffers the last time we ran, force the log first and wait for it
* before pushing again.
*/
- if (ailp->xa_log_flush && ailp->xa_last_pushed_lsn == 0 &&
- (!list_empty_careful(&ailp->xa_buf_list) ||
+ if (ailp->ail_log_flush && ailp->ail_last_pushed_lsn == 0 &&
+ (!list_empty_careful(&ailp->ail_buf_list) ||
xfs_ail_min_lsn(ailp))) {
- ailp->xa_log_flush = 0;
+ ailp->ail_log_flush = 0;

XFS_STATS_INC(mp, xs_push_ail_flush);
xfs_log_force(mp, XFS_LOG_SYNC);
}

- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);

- /* barrier matches the xa_target update in xfs_ail_push() */
+ /* barrier matches the ail_target update in xfs_ail_push() */
smp_rmb();
- target = ailp->xa_target;
- ailp->xa_target_prev = target;
+ target = ailp->ail_target;
+ ailp->ail_target_prev = target;

- lip = xfs_trans_ail_cursor_first(ailp, &cur, ailp->xa_last_pushed_lsn);
+ lip = xfs_trans_ail_cursor_first(ailp, &cur, ailp->ail_last_pushed_lsn);
if (!lip) {
/*
* If the AIL is empty or our push has reached the end we are
* done now.
*/
xfs_trans_ail_cursor_done(&cur);
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
goto out_done;
}

@@ -404,7 +404,7 @@ xfsaild_push(
XFS_STATS_INC(mp, xs_push_ail_success);
trace_xfs_ail_push(lip);

- ailp->xa_last_pushed_lsn = lsn;
+ ailp->ail_last_pushed_lsn = lsn;
break;

case XFS_ITEM_FLUSHING:
@@ -423,7 +423,7 @@ xfsaild_push(
trace_xfs_ail_flushing(lip);

flushing++;
- ailp->xa_last_pushed_lsn = lsn;
+ ailp->ail_last_pushed_lsn = lsn;
break;

case XFS_ITEM_PINNED:
@@ -431,7 +431,7 @@ xfsaild_push(
trace_xfs_ail_pinned(lip);

stuck++;
- ailp->xa_log_flush++;
+ ailp->ail_log_flush++;
break;
case XFS_ITEM_LOCKED:
XFS_STATS_INC(mp, xs_push_ail_locked);
@@ -468,10 +468,10 @@ xfsaild_push(
lsn = lip->li_lsn;
}
xfs_trans_ail_cursor_done(&cur);
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);

- if (xfs_buf_delwri_submit_nowait(&ailp->xa_buf_list))
- ailp->xa_log_flush++;
+ if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
+ ailp->ail_log_flush++;

if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
out_done:
@@ -481,7 +481,7 @@ xfsaild_push(
* AIL before we start the next scan from the start of the AIL.
*/
tout = 50;
- ailp->xa_last_pushed_lsn = 0;
+ ailp->ail_last_pushed_lsn = 0;
} else if (((stuck + flushing) * 100) / count > 90) {
/*
* Either there is a lot of contention on the AIL or we are
@@ -494,7 +494,7 @@ xfsaild_push(
* the restart to issue a log force to unpin the stuck items.
*/
tout = 20;
- ailp->xa_last_pushed_lsn = 0;
+ ailp->ail_last_pushed_lsn = 0;
} else {
/*
* Assume we have more work to do in a short while.
@@ -536,26 +536,26 @@ xfsaild(
break;
}

- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);

/*
* Idle if the AIL is empty and we are not racing with a target
* update. We check the AIL after we set the task to a sleep
- * state to guarantee that we either catch an xa_target update
+ * state to guarantee that we either catch an ail_target update
* or that a wake_up resets the state to TASK_RUNNING.
* Otherwise, we run the risk of sleeping indefinitely.
*
- * The barrier matches the xa_target update in xfs_ail_push().
+ * The barrier matches the ail_target update in xfs_ail_push().
*/
smp_rmb();
if (!xfs_ail_min(ailp) &&
- ailp->xa_target == ailp->xa_target_prev) {
- spin_unlock(&ailp->xa_lock);
+ ailp->ail_target == ailp->ail_target_prev) {
+ spin_unlock(&ailp->ail_lock);
freezable_schedule();
tout = 0;
continue;
}
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);

if (tout)
freezable_schedule_timeout(msecs_to_jiffies(tout));
@@ -592,8 +592,8 @@ xfs_ail_push(
xfs_log_item_t *lip;

lip = xfs_ail_min(ailp);
- if (!lip || XFS_FORCED_SHUTDOWN(ailp->xa_mount) ||
- XFS_LSN_CMP(threshold_lsn, ailp->xa_target) <= 0)
+ if (!lip || XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
+ XFS_LSN_CMP(threshold_lsn, ailp->ail_target) <= 0)
return;

/*
@@ -601,10 +601,10 @@ xfs_ail_push(
* the XFS_AIL_PUSHING_BIT.
*/
smp_wmb();
- xfs_trans_ail_copy_lsn(ailp, &ailp->xa_target, &threshold_lsn);
+ xfs_trans_ail_copy_lsn(ailp, &ailp->ail_target, &threshold_lsn);
smp_wmb();

- wake_up_process(ailp->xa_task);
+ wake_up_process(ailp->ail_task);
}

/*
@@ -630,18 +630,18 @@ xfs_ail_push_all_sync(
struct xfs_log_item *lip;
DEFINE_WAIT(wait);

- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
while ((lip = xfs_ail_max(ailp)) != NULL) {
- prepare_to_wait(&ailp->xa_empty, &wait, TASK_UNINTERRUPTIBLE);
- ailp->xa_target = lip->li_lsn;
- wake_up_process(ailp->xa_task);
- spin_unlock(&ailp->xa_lock);
+ prepare_to_wait(&ailp->ail_empty, &wait, TASK_UNINTERRUPTIBLE);
+ ailp->ail_target = lip->li_lsn;
+ wake_up_process(ailp->ail_task);
+ spin_unlock(&ailp->ail_lock);
schedule();
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
}
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);

- finish_wait(&ailp->xa_empty, &wait);
+ finish_wait(&ailp->ail_empty, &wait);
}

/*
@@ -672,7 +672,7 @@ xfs_trans_ail_update_bulk(
struct xfs_ail_cursor *cur,
struct xfs_log_item **log_items,
int nr_items,
- xfs_lsn_t lsn) __releases(ailp->xa_lock)
+ xfs_lsn_t lsn) __releases(ailp->ail_lock)
{
xfs_log_item_t *mlip;
int mlip_changed = 0;
@@ -705,13 +705,13 @@ xfs_trans_ail_update_bulk(
xfs_ail_splice(ailp, cur, &tmp, lsn);

if (mlip_changed) {
- if (!XFS_FORCED_SHUTDOWN(ailp->xa_mount))
- xlog_assign_tail_lsn_locked(ailp->xa_mount);
- spin_unlock(&ailp->xa_lock);
+ if (!XFS_FORCED_SHUTDOWN(ailp->ail_mount))
+ xlog_assign_tail_lsn_locked(ailp->ail_mount);
+ spin_unlock(&ailp->ail_lock);

- xfs_log_space_wake(ailp->xa_mount);
+ xfs_log_space_wake(ailp->ail_mount);
} else {
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
}
}

@@ -756,13 +756,13 @@ void
xfs_trans_ail_delete(
struct xfs_ail *ailp,
struct xfs_log_item *lip,
- int shutdown_type) __releases(ailp->xa_lock)
+ int shutdown_type) __releases(ailp->ail_lock)
{
- struct xfs_mount *mp = ailp->xa_mount;
+ struct xfs_mount *mp = ailp->ail_mount;
bool mlip_changed;

if (!(lip->li_flags & XFS_LI_IN_AIL)) {
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
if (!XFS_FORCED_SHUTDOWN(mp)) {
xfs_alert_tag(mp, XFS_PTAG_AILDELETE,
"%s: attempting to delete a log item that is not in the AIL",
@@ -776,13 +776,13 @@ xfs_trans_ail_delete(
if (mlip_changed) {
if (!XFS_FORCED_SHUTDOWN(mp))
xlog_assign_tail_lsn_locked(mp);
- if (list_empty(&ailp->xa_ail))
- wake_up_all(&ailp->xa_empty);
+ if (list_empty(&ailp->ail_head))
+ wake_up_all(&ailp->ail_empty);
}

- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
if (mlip_changed)
- xfs_log_space_wake(ailp->xa_mount);
+ xfs_log_space_wake(ailp->ail_mount);
}

int
@@ -795,16 +795,16 @@ xfs_trans_ail_init(
if (!ailp)
return -ENOMEM;

- ailp->xa_mount = mp;
- INIT_LIST_HEAD(&ailp->xa_ail);
- INIT_LIST_HEAD(&ailp->xa_cursors);
- spin_lock_init(&ailp->xa_lock);
- INIT_LIST_HEAD(&ailp->xa_buf_list);
- init_waitqueue_head(&ailp->xa_empty);
+ ailp->ail_mount = mp;
+ INIT_LIST_HEAD(&ailp->ail_head);
+ INIT_LIST_HEAD(&ailp->ail_cursors);
+ spin_lock_init(&ailp->ail_lock);
+ INIT_LIST_HEAD(&ailp->ail_buf_list);
+ init_waitqueue_head(&ailp->ail_empty);

- ailp->xa_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
- ailp->xa_mount->m_fsname);
- if (IS_ERR(ailp->xa_task))
+ ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
+ ailp->ail_mount->m_fsname);
+ if (IS_ERR(ailp->ail_task))
goto out_free_ailp;

mp->m_ail = ailp;
@@ -821,6 +821,6 @@ xfs_trans_ail_destroy(
{
struct xfs_ail *ailp = mp->m_ail;

- kthread_stop(ailp->xa_task);
+ kthread_stop(ailp->ail_task);
kmem_free(ailp);
}
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 3ba7a96a8abd..b8871bcfe00b 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -429,8 +429,8 @@ xfs_trans_brelse(xfs_trans_t *tp,
* If the fs has shutdown and we dropped the last reference, it may fall
* on us to release a (possibly dirty) bli if it never made it to the
* AIL (e.g., the aborted unpin already happened and didn't release it
- * due to our reference). Since we're already shutdown and need xa_lock,
- * just force remove from the AIL and release the bli here.
+ * due to our reference). Since we're already shutdown and need
+ * ail_lock, just force remove from the AIL and release the bli here.
*/
if (XFS_FORCED_SHUTDOWN(tp->t_mountp) && freed) {
xfs_trans_ail_remove(&bip->bli_item, SHUTDOWN_LOG_IO_ERROR);
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index b317a3644c00..be24b0c8a332 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -65,17 +65,17 @@ struct xfs_ail_cursor {
* Eventually we need to drive the locking in here as well.
*/
struct xfs_ail {
- struct xfs_mount *xa_mount;
- struct task_struct *xa_task;
- struct list_head xa_ail;
- xfs_lsn_t xa_target;
- xfs_lsn_t xa_target_prev;
- struct list_head xa_cursors;
- spinlock_t xa_lock;
- xfs_lsn_t xa_last_pushed_lsn;
- int xa_log_flush;
- struct list_head xa_buf_list;
- wait_queue_head_t xa_empty;
+ struct xfs_mount *ail_mount;
+ struct task_struct *ail_task;
+ struct list_head ail_head;
+ xfs_lsn_t ail_target;
+ xfs_lsn_t ail_target_prev;
+ struct list_head ail_cursors;
+ spinlock_t ail_lock;
+ xfs_lsn_t ail_last_pushed_lsn;
+ int ail_log_flush;
+ struct list_head ail_buf_list;
+ wait_queue_head_t ail_empty;
};

/*
@@ -84,7 +84,7 @@ struct xfs_ail {
void xfs_trans_ail_update_bulk(struct xfs_ail *ailp,
struct xfs_ail_cursor *cur,
struct xfs_log_item **log_items, int nr_items,
- xfs_lsn_t lsn) __releases(ailp->xa_lock);
+ xfs_lsn_t lsn) __releases(ailp->ail_lock);
/*
* Return a pointer to the first item in the AIL. If the AIL is empty, then
* return NULL.
@@ -93,7 +93,7 @@ static inline struct xfs_log_item *
xfs_ail_min(
struct xfs_ail *ailp)
{
- return list_first_entry_or_null(&ailp->xa_ail, struct xfs_log_item,
+ return list_first_entry_or_null(&ailp->ail_head, struct xfs_log_item,
li_ail);
}

@@ -101,14 +101,14 @@ static inline void
xfs_trans_ail_update(
struct xfs_ail *ailp,
struct xfs_log_item *lip,
- xfs_lsn_t lsn) __releases(ailp->xa_lock)
+ xfs_lsn_t lsn) __releases(ailp->ail_lock)
{
xfs_trans_ail_update_bulk(ailp, NULL, &lip, 1, lsn);
}

bool xfs_ail_delete_one(struct xfs_ail *ailp, struct xfs_log_item *lip);
void xfs_trans_ail_delete(struct xfs_ail *ailp, struct xfs_log_item *lip,
- int shutdown_type) __releases(ailp->xa_lock);
+ int shutdown_type) __releases(ailp->ail_lock);

static inline void
xfs_trans_ail_remove(
@@ -117,12 +117,12 @@ xfs_trans_ail_remove(
{
struct xfs_ail *ailp = lip->li_ailp;

- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
/* xfs_trans_ail_delete() drops the AIL lock */
if (lip->li_flags & XFS_LI_IN_AIL)
xfs_trans_ail_delete(ailp, lip, shutdown_type);
else
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
}

void xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
@@ -149,9 +149,9 @@ xfs_trans_ail_copy_lsn(
xfs_lsn_t *src)
{
ASSERT(sizeof(xfs_lsn_t) == 8); /* don't lock if it shrinks */
- spin_lock(&ailp->xa_lock);
+ spin_lock(&ailp->ail_lock);
*dst = *src;
- spin_unlock(&ailp->xa_lock);
+ spin_unlock(&ailp->ail_lock);
}
#else
static inline void
@@ -172,7 +172,7 @@ xfs_clear_li_failed(
struct xfs_buf *bp = lip->li_buf;

ASSERT(lip->li_flags & XFS_LI_IN_AIL);
- lockdep_assert_held(&lip->li_ailp->xa_lock);
+ lockdep_assert_held(&lip->li_ailp->ail_lock);

if (lip->li_flags & XFS_LI_FAILED) {
lip->li_flags &= ~XFS_LI_FAILED;
@@ -186,7 +186,7 @@ xfs_set_li_failed(
struct xfs_log_item *lip,
struct xfs_buf *bp)
{
- lockdep_assert_held(&lip->li_ailp->xa_lock);
+ lockdep_assert_held(&lip->li_ailp->ail_lock);

if (!(lip->li_flags & XFS_LI_FAILED)) {
xfs_buf_hold(bp);
--
2.15.0

2017-12-06 01:01:50

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 14/73] xarray: Add xas_for_each_tag

From: Matthew Wilcox <[email protected]>

This iterator operates across each tagged entry in the specified range.
We do not yet have a user for an xa_for_each_tag iterator, but it would
be straight-forward to add one if needed. This commit also includes
xas_find_tag() and xas_next_tag().

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/xarray.h | 68 +++++++++++++++++++++++++++++++++++++++++++
lib/xarray.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 146 insertions(+)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index a62baf6f1a28..4e61ebd406f5 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -554,6 +554,7 @@ void *xas_find(struct xa_state *, unsigned long max);
bool xas_get_tag(const struct xa_state *, xa_tag_t);
void xas_set_tag(const struct xa_state *, xa_tag_t);
void xas_clear_tag(const struct xa_state *, xa_tag_t);
+void *xas_find_tag(struct xa_state *, unsigned long max, xa_tag_t);
void xas_init_tags(const struct xa_state *);

bool xas_nomem(struct xa_state *, gfp_t);
@@ -676,6 +677,55 @@ static inline void *xas_next_entry(struct xa_state *xas, unsigned long max)
return entry;
}

+/* Private */
+static inline unsigned int xas_find_chunk(struct xa_state *xas, bool advance,
+ xa_tag_t tag)
+{
+ unsigned long *addr = xas->xa_node->tags[(__force unsigned)tag];
+ unsigned int offset = xas->xa_offset;
+
+ if (advance)
+ offset++;
+ if (XA_CHUNK_SIZE == BITS_PER_LONG) {
+ unsigned long data = *addr & (~0UL << offset);
+ if (data)
+ return __ffs(data);
+ return XA_CHUNK_SIZE;
+ }
+
+ return find_next_bit(addr, XA_CHUNK_SIZE, offset);
+}
+
+/**
+ * xas_next_tag() - Advance iterator to next tagged entry.
+ * @xas: XArray operation state.
+ * @max: Highest index to return.
+ * @tag: Tag to search for.
+ *
+ * xas_next_tag() is an inline function to optimise xarray traversal for
+ * speed. It is equivalent to calling xas_find_tag(), and will call
+ * xas_find_tag() for all the hard cases.
+ *
+ * Return: The next tagged entry after the one currently referred to by @xas.
+ */
+static inline void *xas_next_tag(struct xa_state *xas, unsigned long max,
+ xa_tag_t tag)
+{
+ struct xa_node *node = xas->xa_node;
+ unsigned int offset;
+
+ if (unlikely(xas_not_node(node) || xa_node_shift(node)))
+ return xas_find_tag(xas, max, tag);
+ offset = xas_find_chunk(xas, true, tag);
+ xas->xa_offset = offset;
+ xas->xa_index = (xas->xa_index & ~XA_CHUNK_MASK) + offset;
+ if (xas->xa_index > max)
+ return NULL;
+ if (offset == XA_CHUNK_SIZE)
+ return xas_find_tag(xas, max, tag);
+ return xa_entry(xas->xa, node, offset);
+}
+
/*
* If iterating while holding a lock, drop the lock and reschedule
* every %XA_CHECK_SCHED loops.
@@ -701,6 +751,24 @@ enum {
for (entry = xas_find(xas, max); entry; \
entry = xas_next_entry(xas, max))

+/**
+ * xas_for_each_tag() - Iterate over a range of an XArray
+ * @xas: XArray operation state.
+ * @entry: Entry retrieved from array.
+ * @max: Maximum index to retrieve from array.
+ * @tag: Tag to search for.
+ *
+ * The loop body will be executed for each tagged entry in the xarray
+ * between the current xas position and @max. @entry will be set to
+ * the entry retrieved from the xarray. It is safe to delete entries
+ * from the array in the loop body. You should hold either the RCU lock
+ * or the xa_lock while iterating. If you need to drop the lock, call
+ * xas_pause() first.
+ */
+#define xas_for_each_tag(xas, entry, max, tag) \
+ for (entry = xas_find_tag(xas, max, tag); entry; \
+ entry = xas_next_tag(xas, max, tag))
+
/* Internal functions, mostly shared between radix-tree.c, xarray.c and idr.c */
void xas_destroy(struct xa_state *);

diff --git a/lib/xarray.c b/lib/xarray.c
index ac4ff3daf476..f9eaac2d85f9 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -858,6 +858,84 @@ void *xas_find(struct xa_state *xas, unsigned long max)
}
EXPORT_SYMBOL_GPL(xas_find);

+/**
+ * xas_find_tag() - Find the next tagged entry in the XArray.
+ * @xas: XArray operation state.
+ * @max: Highest index to return.
+ * @tag: Tag number to search for.
+ *
+ * If the xas has not yet been walked to an entry, return the tagged entry
+ * which has an index >= xas.xa_index. If it has been walked, the entry
+ * currently being pointed at has been processed, and so we move to the
+ * next tagged entry.
+ *
+ * If no tagged entry is found and the array is smaller than @max, @xas is
+ * set to the restart state and xas->xa_index is set to the smallest index
+ * not yet in the array. This allows @xas to be immediately passed to
+ * xas_create().
+ *
+ * Return: The entry, if found, otherwise NULL.
+ */
+void *xas_find_tag(struct xa_state *xas, unsigned long max, xa_tag_t tag)
+{
+ bool advance = true;
+ unsigned int offset;
+ void *entry;
+
+ if (xas_error(xas))
+ return NULL;
+
+ if (!xas->xa_node) {
+ xas->xa_index = 1;
+ goto out;
+ } else if (xas_top(xas->xa_node)) {
+ advance = false;
+ entry = xa_head(xas->xa);
+ if (xas->xa_index > max_index(entry))
+ goto out;
+ if (!xa_is_node(entry)) {
+ if (xa_tagged(xas->xa, tag)) {
+ xas->xa_node = NULL;
+ return entry;
+ }
+ xas->xa_index = 1;
+ goto out;
+ }
+ xas->xa_node = xa_to_node(entry);
+ xas->xa_offset = xas->xa_index >> xas->xa_node->shift;
+ }
+
+ while (xas->xa_index <= max) {
+ if (unlikely(xas->xa_offset == XA_CHUNK_SIZE)) {
+ xas->xa_offset = xas->xa_node->offset + 1;
+ xas->xa_node = xa_parent(xas->xa, xas->xa_node);
+ if (!xas->xa_node)
+ break;
+ advance = false;
+ continue;
+ }
+
+ offset = xas_find_chunk(xas, advance, tag);
+ xas_add(xas, offset - xas->xa_offset);
+ if (offset == XA_CHUNK_SIZE) {
+ advance = false;
+ continue;
+ }
+
+ entry = xa_entry(xas->xa, xas->xa_node, xas->xa_offset);
+ if (!xa_is_node(entry))
+ return entry;
+ xas->xa_node = xa_to_node(entry);
+ xas->xa_offset = get_offset(xas->xa_index, xas->xa_node);
+ }
+
+ out:
+ if (!xas->xa_node)
+ xas->xa_node = XAS_BOUNDS;
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(xas_find_tag);
+
/**
* __xa_init() - Initialise an empty XArray.
* @xa: XArray.
--
2.15.0

2017-12-06 01:01:56

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 05/73] xarray: Change definition of sibling entries

From: Matthew Wilcox <[email protected]>

Instead of storing a pointer to the slot containing the canonical entry,
store the offset of the slot. Produces slightly more efficient code
(~300 bytes) and simplifies the implementation.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/xarray.h | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++
lib/radix-tree.c | 65 +++++++++++----------------------------
2 files changed, 100 insertions(+), 47 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index e55f5cfd14ed..2c45d87a3476 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -58,6 +58,8 @@ static inline bool xa_is_value(void *entry)
return (unsigned long)entry & 1;
}

+/* Everything below here is the Advanced API. Proceed with caution. */
+
#define xa_trylock(xa) spin_trylock(&(xa)->xa_lock)
#define xa_lock(xa) spin_lock(&(xa)->xa_lock)
#define xa_unlock(xa) spin_unlock(&(xa)->xa_lock)
@@ -71,4 +73,84 @@ static inline bool xa_is_value(void *entry)
spin_unlock_irqrestore(&(xa)->xa_lock, flags)
#define xa_lock_held(xa) lockdep_is_held(&(xa)->xa_lock)

+/*
+ * The xarray is constructed out of a set of 'chunks' of pointers. Choosing
+ * the best chunk size requires some tradeoffs. A power of two recommends
+ * itself so that we can walk the tree based purely on shifts and masks.
+ * Generally, the larger the better; as the number of slots per level of the
+ * tree increases, the less tall the tree needs to be. But that needs to be
+ * balanced against the memory consumption of each node. On a 64-bit system,
+ * xa_node is currently 576 bytes, and we get 7 of them per 4kB page. If we
+ * doubled the number of slots per node, we'd get only 3 nodes per 4kB page.
+ */
+#ifndef XA_CHUNK_SHIFT
+#define XA_CHUNK_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
+#endif
+#define XA_CHUNK_SIZE (1UL << XA_CHUNK_SHIFT)
+#define XA_CHUNK_MASK (XA_CHUNK_SIZE - 1)
+
+/*
+ * Internal entries have the bottom two bits set to the value 10b. Most
+ * internal entries are pointers to the next node in the tree. Since the
+ * kernel unmaps page 0 to trap NULL pointer dereferences, we can use values
+ * 0-1023 for special purposes. Values 0-62 are used for sibling
+ * entries. Value 256 is used for the retry entry.
+ */
+
+/* Private */
+static inline void *xa_mk_internal(unsigned long v)
+{
+ return (void *)((v << 2) | 2);
+}
+
+/* Private */
+static inline unsigned long xa_to_internal(void *entry)
+{
+ return (unsigned long)entry >> 2;
+}
+
+/**
+ * xa_is_internal() - Is the entry an internal entry?
+ * @entry: Entry retrieved from the XArray
+ *
+ * Return: %true if the entry is an internal entry.
+ */
+static inline bool xa_is_internal(void *entry)
+{
+ return ((unsigned long)entry & 3) == 2;
+}
+
+/* Private */
+static inline bool xa_is_node(void *entry)
+{
+ return xa_is_internal(entry) && (unsigned long)entry > 4096;
+}
+
+/* Private */
+static inline void *xa_mk_sibling(unsigned int offset)
+{
+ return xa_mk_internal(offset);
+}
+
+/* Private */
+static inline unsigned long xa_to_sibling(void *entry)
+{
+ return xa_to_internal(entry);
+}
+
+/**
+ * xa_is_sibling() - Is the entry a sibling entry?
+ * @entry: Entry retrieved from the XArray
+ *
+ * Return: %true if the entry is a sibling entry.
+ */
+static inline bool xa_is_sibling(void *entry)
+{
+ return IS_ENABLED(CONFIG_RADIX_TREE_MULTIORDER) &&
+ xa_is_internal(entry) &&
+ (entry < xa_mk_sibling(XA_CHUNK_SIZE - 1));
+}
+
+#define XA_RETRY_ENTRY xa_mk_internal(256)
+
#endif /* _LINUX_XARRAY_H */
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index cda7a730e591..0a7a21dd9398 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -37,6 +37,7 @@
#include <linux/rcupdate.h>
#include <linux/slab.h>
#include <linux/string.h>
+#include <linux/xarray.h>


/* Number of nodes in fully populated tree of given height */
@@ -97,24 +98,7 @@ static inline void *node_to_entry(void *ptr)
return (void *)((unsigned long)ptr | RADIX_TREE_INTERNAL_NODE);
}

-#define RADIX_TREE_RETRY node_to_entry(NULL)
-
-#ifdef CONFIG_RADIX_TREE_MULTIORDER
-/* Sibling slots point directly to another slot in the same node */
-static inline
-bool is_sibling_entry(const struct radix_tree_node *parent, void *node)
-{
- void __rcu **ptr = node;
- return (parent->slots <= ptr) &&
- (ptr < parent->slots + RADIX_TREE_MAP_SIZE);
-}
-#else
-static inline
-bool is_sibling_entry(const struct radix_tree_node *parent, void *node)
-{
- return false;
-}
-#endif
+#define RADIX_TREE_RETRY XA_RETRY_ENTRY

static inline unsigned long
get_slot_offset(const struct radix_tree_node *parent, void __rcu **slot)
@@ -128,16 +112,10 @@ static unsigned int radix_tree_descend(const struct radix_tree_node *parent,
unsigned int offset = (index >> parent->shift) & RADIX_TREE_MAP_MASK;
void __rcu **entry = rcu_dereference_raw(parent->slots[offset]);

-#ifdef CONFIG_RADIX_TREE_MULTIORDER
- if (radix_tree_is_internal_node(entry)) {
- if (is_sibling_entry(parent, entry)) {
- void __rcu **sibentry;
- sibentry = (void __rcu **) entry_to_node(entry);
- offset = get_slot_offset(parent, sibentry);
- entry = rcu_dereference_raw(*sibentry);
- }
+ if (xa_is_sibling(entry)) {
+ offset = xa_to_sibling(entry);
+ entry = rcu_dereference_raw(parent->slots[offset]);
}
-#endif

*nodep = (void *)entry;
return offset;
@@ -299,10 +277,10 @@ static void dump_node(struct radix_tree_node *node, unsigned long index)
} else if (!radix_tree_is_internal_node(entry)) {
pr_debug("radix entry %p offset %ld indices %lu-%lu parent %p\n",
entry, i, first, last, node);
- } else if (is_sibling_entry(node, entry)) {
+ } else if (xa_is_sibling(entry)) {
pr_debug("radix sblng %p offset %ld indices %lu-%lu parent %p val %p\n",
entry, i, first, last, node,
- *(void **)entry_to_node(entry));
+ node->slots[xa_to_sibling(entry)]);
} else {
dump_node(entry_to_node(entry), first);
}
@@ -872,8 +850,7 @@ static void radix_tree_free_nodes(struct radix_tree_node *node)

for (;;) {
void *entry = rcu_dereference_raw(child->slots[offset]);
- if (radix_tree_is_internal_node(entry) &&
- !is_sibling_entry(child, entry)) {
+ if (xa_is_node(entry)) {
child = entry_to_node(entry);
offset = 0;
continue;
@@ -895,7 +872,7 @@ static void radix_tree_free_nodes(struct radix_tree_node *node)
static inline int insert_entries(struct radix_tree_node *node,
void __rcu **slot, void *item, unsigned order, bool replace)
{
- struct radix_tree_node *child;
+ void *sibling;
unsigned i, n, tag, offset, tags = 0;

if (node) {
@@ -913,7 +890,7 @@ static inline int insert_entries(struct radix_tree_node *node,
offset = offset & ~(n - 1);
slot = &node->slots[offset];
}
- child = node_to_entry(slot);
+ sibling = xa_mk_sibling(offset);

for (i = 0; i < n; i++) {
if (slot[i]) {
@@ -930,7 +907,7 @@ static inline int insert_entries(struct radix_tree_node *node,
for (i = 0; i < n; i++) {
struct radix_tree_node *old = rcu_dereference_raw(slot[i]);
if (i) {
- rcu_assign_pointer(slot[i], child);
+ rcu_assign_pointer(slot[i], sibling);
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
if (tags & (1 << tag))
tag_clear(node, tag, offset + i);
@@ -940,9 +917,7 @@ static inline int insert_entries(struct radix_tree_node *node,
if (tags & (1 << tag))
tag_set(node, tag, offset);
}
- if (radix_tree_is_internal_node(old) &&
- !is_sibling_entry(node, old) &&
- (old != RADIX_TREE_RETRY))
+ if (xa_is_node(old))
radix_tree_free_nodes(old);
if (xa_is_value(old))
node->exceptional--;
@@ -1101,10 +1076,10 @@ static inline void replace_sibling_entries(struct radix_tree_node *node,
void __rcu **slot, int count, int exceptional)
{
#ifdef CONFIG_RADIX_TREE_MULTIORDER
- void *ptr = node_to_entry(slot);
- unsigned offset = get_slot_offset(node, slot) + 1;
+ unsigned offset = get_slot_offset(node, slot);
+ void *ptr = xa_mk_sibling(offset);

- while (offset < RADIX_TREE_MAP_SIZE) {
+ while (++offset < RADIX_TREE_MAP_SIZE) {
if (rcu_dereference_raw(node->slots[offset]) != ptr)
break;
if (count < 0) {
@@ -1112,7 +1087,6 @@ static inline void replace_sibling_entries(struct radix_tree_node *node,
node->count--;
}
node->exceptional += exceptional;
- offset++;
}
#endif
}
@@ -1311,8 +1285,7 @@ int radix_tree_split(struct radix_tree_root *root, unsigned long index,
tags |= 1 << tag;

for (end = offset + 1; end < RADIX_TREE_MAP_SIZE; end++) {
- if (!is_sibling_entry(parent,
- rcu_dereference_raw(parent->slots[end])))
+ if (!xa_is_sibling(rcu_dereference_raw(parent->slots[end])))
break;
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
if (tags & (1 << tag))
@@ -1608,11 +1581,9 @@ static void set_iter_tags(struct radix_tree_iter *iter,
static void __rcu **skip_siblings(struct radix_tree_node **nodep,
void __rcu **slot, struct radix_tree_iter *iter)
{
- void *sib = node_to_entry(slot - 1);
-
while (iter->index < iter->next_index) {
*nodep = rcu_dereference_raw(*slot);
- if (*nodep && *nodep != sib)
+ if (*nodep && !xa_is_sibling(*nodep))
return slot;
slot++;
iter->index = __radix_tree_iter_add(iter, 1);
@@ -1763,7 +1734,7 @@ void __rcu **radix_tree_next_chunk(const struct radix_tree_root *root,
while (++offset < RADIX_TREE_MAP_SIZE) {
void *slot = rcu_dereference_raw(
node->slots[offset]);
- if (is_sibling_entry(node, slot))
+ if (xa_is_sibling(slot))
continue;
if (slot)
break;
--
2.15.0

2017-12-06 01:02:07

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 25/73] page cache: Convert page deletion to XArray

From: Matthew Wilcox <[email protected]>

The code is slightly shorter and simpler.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/filemap.c | 26 ++++++++++++--------------
1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 2439747a0a17..6e2808fd3c06 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -115,27 +115,25 @@
static void page_cache_tree_delete(struct address_space *mapping,
struct page *page, void *shadow)
{
- int i, nr;
+ XA_STATE(xas, &mapping->pages, page->index);
+ unsigned int i, nr;

- /* hugetlb pages are represented by one entry in the radix tree */
+ xas_set_update(&xas, workingset_lookup_update(mapping));
+
+ /* hugetlb pages are represented by a single entry in the xarray */
nr = PageHuge(page) ? 1 : hpage_nr_pages(page);

VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(nr != 1 && shadow, page);

- for (i = 0; i < nr; i++) {
- struct radix_tree_node *node;
- void **slot;
-
- __radix_tree_lookup(&mapping->pages, page->index + i,
- &node, &slot);
-
- VM_BUG_ON_PAGE(!node && nr != 1, page);
-
- radix_tree_clear_tags(&mapping->pages, node, slot);
- __radix_tree_replace(&mapping->pages, node, slot, shadow,
- workingset_lookup_update(mapping));
+ i = nr;
+repeat:
+ xas_store(&xas, shadow);
+ xas_init_tags(&xas);
+ if (--i) {
+ xas_next(&xas);
+ goto repeat;
}

page->mapping = NULL;
--
2.15.0

2017-12-06 01:02:02

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 16/73] xarray: Add xa_destroy

From: Matthew Wilcox <[email protected]>

This function frees all the internal memory allocated to the xarray
and reinitialises it to be empty.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/xarray.h | 1 +
lib/xarray.c | 26 ++++++++++++++++++++++++++
2 files changed, 27 insertions(+)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index c3efcc3432f7..b648c1b93d9f 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -74,6 +74,7 @@ void *xa_load(struct xarray *, unsigned long index);
void *xa_store(struct xarray *, unsigned long index, void *entry, gfp_t);
void *xa_cmpxchg(struct xarray *, unsigned long index,
void *old, void *entry, gfp_t);
+void xa_destroy(struct xarray *);

/**
* xa_erase() - Erase this entry from the XArray.
diff --git a/lib/xarray.c b/lib/xarray.c
index 251724f62b11..f3875b251b41 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1341,6 +1341,32 @@ int xa_get_tagged(struct xarray *xa, void **dst, unsigned long start,
}
EXPORT_SYMBOL(xa_get_tagged);

+/**
+ * xa_destroy() - Free all internal data structures.
+ * @xa: XArray.
+ *
+ * After calling this function, the XArray is empty and has freed all memory
+ * allocated for its internal data structures. You are responsible for
+ * freeing the objects referenced by the XArray.
+ */
+void xa_destroy(struct xarray *xa)
+{
+ XA_STATE(xas, xa, 0);
+ unsigned long flags;
+ void *entry;
+
+ xas.xa_node = NULL;
+ xa_lock_irqsave(xa, flags);
+ entry = xa_head_locked(xa);
+ RCU_INIT_POINTER(xa->xa_head, NULL);
+ xas_init_tags(&xas);
+ /* lockdep checks we're still holding the lock in xas_free_nodes() */
+ if (xa_is_node(entry))
+ xas_free_nodes(&xas, xa_to_node(entry));
+ xa_unlock_irqrestore(xa, flags);
+}
+EXPORT_SYMBOL(xa_destroy);
+
#ifdef XA_DEBUG
void xa_dump_entry(void *entry, unsigned long index)
{
--
2.15.0

2017-12-06 01:01:42

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 04/73] xarray: Replace exceptional entries

From: Matthew Wilcox <[email protected]>

Introduce xarray value entries to replace the radix tree exceptional
entry code. This is a slight change in encoding to allow the use of an
extra bit (we can now store BITS_PER_LONG - 1 bits in a value entry).
It is also a change in emphasis; exceptional entries are intimidating
and different. As the comment explains, you can choose to store values
or pointers in the xarray and they are both first-class citizens.

Signed-off-by: Matthew Wilcox <[email protected]>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 4 +-
arch/powerpc/include/asm/nohash/64/pgtable.h | 4 +-
drivers/gpu/drm/i915/i915_gem.c | 17 ++--
drivers/staging/lustre/lustre/mdc/mdc_request.c | 2 +-
fs/btrfs/compression.c | 2 +-
fs/btrfs/inode.c | 4 +-
fs/dax.c | 113 ++++++++++++------------
fs/proc/task_mmu.c | 2 +-
include/linux/fs.h | 48 ++++++----
include/linux/radix-tree.h | 36 ++------
include/linux/swapops.h | 19 ++--
include/linux/xarray.h | 40 +++++++++
lib/idr.c | 63 ++++++-------
lib/radix-tree.c | 21 ++---
mm/filemap.c | 10 +--
mm/khugepaged.c | 2 +-
mm/madvise.c | 2 +-
mm/memcontrol.c | 2 +-
mm/mincore.c | 2 +-
mm/readahead.c | 2 +-
mm/shmem.c | 10 +--
mm/swap.c | 2 +-
mm/truncate.c | 12 +--
mm/workingset.c | 12 ++-
tools/testing/radix-tree/idr-test.c | 6 +-
tools/testing/radix-tree/linux/radix-tree.h | 1 +
tools/testing/radix-tree/multiorder.c | 47 +++++-----
tools/testing/radix-tree/test.c | 2 +-
28 files changed, 248 insertions(+), 239 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 44697817ccc6..5025c26f1acd 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -649,9 +649,7 @@ static inline bool pte_user(pte_t pte)
BUILD_BUG_ON(_PAGE_HPTEFLAGS & (0x1f << _PAGE_BIT_SWAP_TYPE)); \
BUILD_BUG_ON(_PAGE_HPTEFLAGS & _PAGE_SWP_SOFT_DIRTY); \
} while (0)
-/*
- * on pte we don't need handle RADIX_TREE_EXCEPTIONAL_SHIFT;
- */
+
#define SWP_TYPE_BITS 5
#define __swp_type(x) (((x).val >> _PAGE_BIT_SWAP_TYPE) \
& ((1UL << SWP_TYPE_BITS) - 1))
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h b/arch/powerpc/include/asm/nohash/64/pgtable.h
index abddf5830ad5..f711773568d7 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -329,9 +329,7 @@ static inline void __ptep_set_access_flags(struct mm_struct *mm,
*/ \
BUILD_BUG_ON(_PAGE_HPTEFLAGS & (0x1f << _PAGE_BIT_SWAP_TYPE)); \
} while (0)
-/*
- * on pte we don't need handle RADIX_TREE_EXCEPTIONAL_SHIFT;
- */
+
#define SWP_TYPE_BITS 5
#define __swp_type(x) (((x).val >> _PAGE_BIT_SWAP_TYPE) \
& ((1UL << SWP_TYPE_BITS) - 1))
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 3a140eedfc83..0446ed973f75 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -5375,7 +5375,8 @@ i915_gem_object_get_sg(struct drm_i915_gem_object *obj,
count = __sg_page_count(sg);

while (idx + count <= n) {
- unsigned long exception, i;
+ void *entry;
+ unsigned long i;
int ret;

/* If we cannot allocate and insert this entry, or the
@@ -5390,12 +5391,9 @@ i915_gem_object_get_sg(struct drm_i915_gem_object *obj,
if (ret && ret != -EEXIST)
goto scan;

- exception =
- RADIX_TREE_EXCEPTIONAL_ENTRY |
- idx << RADIX_TREE_EXCEPTIONAL_SHIFT;
+ entry = xa_mk_value(idx);
for (i = 1; i < count; i++) {
- ret = radix_tree_insert(&iter->radix, idx + i,
- (void *)exception);
+ ret = radix_tree_insert(&iter->radix, idx + i, entry);
if (ret && ret != -EEXIST)
goto scan;
}
@@ -5433,15 +5431,14 @@ i915_gem_object_get_sg(struct drm_i915_gem_object *obj,
GEM_BUG_ON(!sg);

/* If this index is in the middle of multi-page sg entry,
- * the radixtree will contain an exceptional entry that points
+ * the radix tree will contain a data value entry that points
* to the start of that range. We will return the pointer to
* the base page and the offset of this page within the
* sg entry's range.
*/
*offset = 0;
- if (unlikely(radix_tree_exception(sg))) {
- unsigned long base =
- (unsigned long)sg >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+ if (unlikely(xa_is_value(sg))) {
+ unsigned long base = xa_to_value(sg);

sg = radix_tree_lookup(&iter->radix, base);
GEM_BUG_ON(!sg);
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index 45dcf9f958d4..2ec79a6b17da 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -940,7 +940,7 @@ static struct page *mdc_page_locate(struct address_space *mapping, __u64 *hash,
xa_lock_irq(&mapping->pages);
found = radix_tree_gang_lookup(&mapping->pages,
(void **)&page, offset, 1);
- if (found > 0 && !radix_tree_exceptional_entry(page)) {
+ if (found > 0 && !xa_is_value(page)) {
struct lu_dirpage *dp;

get_page(page);
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 280717b26224..e687d06cd97c 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -452,7 +452,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
rcu_read_lock();
page = radix_tree_lookup(&mapping->pages, pg_index);
rcu_read_unlock();
- if (page && !radix_tree_exceptional_entry(page)) {
+ if (page && !xa_is_value(page)) {
misses++;
if (misses > 4)
break;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4da872bafcf8..72f763c56127 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7576,8 +7576,8 @@ bool btrfs_page_exists_in_range(struct inode *inode, loff_t start, loff_t end)
}
/*
* Otherwise, shmem/tmpfs must be storing a swap entry
- * here as an exceptional entry: so return it without
- * attempting to raise page count.
+ * here so return it without attempting to raise page
+ * count.
*/
page = NULL;
break; /* TODO: Is this relevant for this use case? */
diff --git a/fs/dax.c b/fs/dax.c
index e743ff1f6240..86bacca51eed 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -58,57 +58,57 @@ static int __init init_dax_wait_table(void)
fs_initcall(init_dax_wait_table);

/*
- * We use lowest available bit in exceptional entry for locking, one bit for
- * the entry size (PMD) and two more to tell us if the entry is a zero page or
- * an empty entry that is just used for locking. In total four special bits.
+ * We use the lowest available bit in a data value entry for locking, one bit
+ * for the entry size (PMD) and two more to tell us if the entry is a zero
+ * page or an empty entry that is just used for locking. In total four
+ * special bits.
*
* If the PMD bit isn't set the entry has size PAGE_SIZE, and if the ZERO_PAGE
* and EMPTY bits aren't set the entry is a normal DAX entry with a filesystem
* block allocation.
*/
-#define RADIX_DAX_SHIFT (RADIX_TREE_EXCEPTIONAL_SHIFT + 4)
-#define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_ZERO_PAGE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_EMPTY (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 3))
+#define DAX_SHIFT (4)
+#define DAX_ENTRY_LOCK (1UL << 0)
+#define DAX_PMD (1UL << 1)
+#define DAX_ZERO_PAGE (1UL << 2)
+#define DAX_EMPTY (1UL << 3)

static unsigned long dax_radix_sector(void *entry)
{
- return (unsigned long)entry >> RADIX_DAX_SHIFT;
+ return xa_to_value(entry) >> DAX_SHIFT;
}

static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
{
- return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
- ((unsigned long)sector << RADIX_DAX_SHIFT) |
- RADIX_DAX_ENTRY_LOCK);
+ return xa_mk_value(flags | ((unsigned long)sector << DAX_SHIFT) |
+ DAX_ENTRY_LOCK);
}

static unsigned int dax_radix_order(void *entry)
{
- if ((unsigned long)entry & RADIX_DAX_PMD)
+ if (xa_to_value(entry) & DAX_PMD)
return PMD_SHIFT - PAGE_SHIFT;
return 0;
}

static int dax_is_pmd_entry(void *entry)
{
- return (unsigned long)entry & RADIX_DAX_PMD;
+ return xa_to_value(entry) & DAX_PMD;
}

static int dax_is_pte_entry(void *entry)
{
- return !((unsigned long)entry & RADIX_DAX_PMD);
+ return !(xa_to_value(entry) & DAX_PMD);
}

static int dax_is_zero_entry(void *entry)
{
- return (unsigned long)entry & RADIX_DAX_ZERO_PAGE;
+ return xa_to_value(entry) & DAX_ZERO_PAGE;
}

static int dax_is_empty_entry(void *entry)
{
- return (unsigned long)entry & RADIX_DAX_EMPTY;
+ return xa_to_value(entry) & DAX_EMPTY;
}

/*
@@ -188,9 +188,9 @@ static void dax_wake_mapping_entry_waiter(struct address_space *mapping,
*/
static inline int slot_locked(struct address_space *mapping, void **slot)
{
- unsigned long entry = (unsigned long)
- radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock);
- return entry & RADIX_DAX_ENTRY_LOCK;
+ unsigned long entry = xa_to_value(
+ radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock));
+ return entry & DAX_ENTRY_LOCK;
}

/*
@@ -199,12 +199,11 @@ static inline int slot_locked(struct address_space *mapping, void **slot)
*/
static inline void *lock_slot(struct address_space *mapping, void **slot)
{
- unsigned long entry = (unsigned long)
- radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock);
-
- entry |= RADIX_DAX_ENTRY_LOCK;
- radix_tree_replace_slot(&mapping->pages, slot, (void *)entry);
- return (void *)entry;
+ unsigned long v = xa_to_value(
+ radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock));
+ void *entry = xa_mk_value(v | DAX_ENTRY_LOCK);
+ radix_tree_replace_slot(&mapping->pages, slot, entry);
+ return entry;
}

/*
@@ -213,17 +212,16 @@ static inline void *lock_slot(struct address_space *mapping, void **slot)
*/
static inline void *unlock_slot(struct address_space *mapping, void **slot)
{
- unsigned long entry = (unsigned long)
- radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock);
-
- entry &= ~(unsigned long)RADIX_DAX_ENTRY_LOCK;
- radix_tree_replace_slot(&mapping->pages, slot, (void *)entry);
- return (void *)entry;
+ unsigned long v = xa_to_value(
+ radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock));
+ void *entry = xa_mk_value(v & ~DAX_ENTRY_LOCK);
+ radix_tree_replace_slot(&mapping->pages, slot, entry);
+ return entry;
}

/*
* Lookup entry in radix tree, wait for it to become unlocked if it is
- * exceptional entry and return it. The caller must call
+ * a data value entry and return it. The caller must call
* put_unlocked_mapping_entry() when he decided not to lock the entry or
* put_locked_mapping_entry() when he locked the entry and now wants to
* unlock it.
@@ -244,7 +242,7 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
entry = __radix_tree_lookup(&mapping->pages, index, NULL,
&slot);
if (!entry ||
- WARN_ON_ONCE(!radix_tree_exceptional_entry(entry)) ||
+ WARN_ON_ONCE(!xa_is_value(entry)) ||
!slot_locked(mapping, slot)) {
if (slotp)
*slotp = slot;
@@ -268,7 +266,7 @@ static void dax_unlock_mapping_entry(struct address_space *mapping,

xa_lock_irq(&mapping->pages);
entry = __radix_tree_lookup(&mapping->pages, index, NULL, &slot);
- if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry) ||
+ if (WARN_ON_ONCE(!entry || !xa_is_value(entry) ||
!slot_locked(mapping, slot))) {
xa_unlock_irq(&mapping->pages);
return;
@@ -299,12 +297,11 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
}

/*
- * Find radix tree entry at given index. If it points to an exceptional entry,
- * return it with the radix tree entry locked. If the radix tree doesn't
- * contain given index, create an empty exceptional entry for the index and
- * return with it locked.
+ * Find radix tree entry at given index. If it is a data value, return it
+ * with the radix tree entry locked. If the radix tree doesn't contain the
+ * given index, create an empty value for the index and return with it locked.
*
- * When requesting an entry with size RADIX_DAX_PMD, grab_mapping_entry() will
+ * When requesting an entry with size DAX_PMD, grab_mapping_entry() will
* either return that locked entry or will return an error. This error will
* happen if there are any 4k entries within the 2MiB range that we are
* requesting.
@@ -334,13 +331,13 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
xa_lock_irq(&mapping->pages);
entry = get_unlocked_mapping_entry(mapping, index, &slot);

- if (WARN_ON_ONCE(entry && !radix_tree_exceptional_entry(entry))) {
+ if (WARN_ON_ONCE(entry && !xa_is_value(entry))) {
entry = ERR_PTR(-EIO);
goto out_unlock;
}

if (entry) {
- if (size_flag & RADIX_DAX_PMD) {
+ if (size_flag & DAX_PMD) {
if (dax_is_pte_entry(entry)) {
put_unlocked_mapping_entry(mapping, index,
entry);
@@ -410,7 +407,7 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
true);
}

- entry = dax_radix_locked_entry(0, size_flag | RADIX_DAX_EMPTY);
+ entry = dax_radix_locked_entry(0, size_flag | DAX_EMPTY);

err = __radix_tree_insert(&mapping->pages, index,
dax_radix_order(entry), entry);
@@ -447,7 +444,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,

xa_lock_irq(&mapping->pages);
entry = get_unlocked_mapping_entry(mapping, index, NULL);
- if (!entry || WARN_ON_ONCE(!radix_tree_exceptional_entry(entry)))
+ if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
goto out;
if (!trunc &&
(radix_tree_tag_get(pages, index, PAGECACHE_TAG_DIRTY) ||
@@ -462,7 +459,7 @@ static int __dax_invalidate_mapping_entry(struct address_space *mapping,
return ret;
}
/*
- * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
+ * Delete DAX data value entry at @index from @mapping. Wait for radix tree
* entry to get unlocked before deleting it.
*/
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
@@ -473,7 +470,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
* This gets called from truncate / punch_hole path. As such, the caller
* must hold locks protecting against concurrent modifications of the
* radix tree (usually fs-private i_mmap_sem for writing). Since the
- * caller has seen exceptional entry for this index, we better find it
+ * caller has seen a data value entry for this index, we better find it
* at that index as well...
*/
WARN_ON_ONCE(!ret);
@@ -481,7 +478,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
}

/*
- * Invalidate exceptional DAX entry if it is clean.
+ * Invalidate DAX data value entry if it is clean.
*/
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index)
@@ -535,7 +532,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

- if (dax_is_zero_entry(entry) && !(flags & RADIX_DAX_ZERO_PAGE)) {
+ if (dax_is_zero_entry(entry) && !(flags & DAX_ZERO_PAGE)) {
/* we are replacing a zero page with block mapping */
if (dax_is_pmd_entry(entry))
unmap_mapping_range(mapping,
@@ -675,13 +672,13 @@ static int dax_writeback_one(struct block_device *bdev,
* A page got tagged dirty in DAX mapping? Something is seriously
* wrong.
*/
- if (WARN_ON(!radix_tree_exceptional_entry(entry)))
+ if (WARN_ON(!xa_is_value(entry)))
return -EIO;

xa_lock_irq(&mapping->pages);
entry2 = get_unlocked_mapping_entry(mapping, index, &slot);
/* Entry got punched out / reallocated? */
- if (!entry2 || WARN_ON_ONCE(!radix_tree_exceptional_entry(entry2)))
+ if (!entry2 || WARN_ON_ONCE(!xa_is_value(entry2)))
goto put_unlocked;
/*
* Entry got reallocated elsewhere? No need to writeback. We have to
@@ -887,7 +884,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
}

entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
- RADIX_DAX_ZERO_PAGE, false);
+ DAX_ZERO_PAGE, false);
if (IS_ERR(entry2)) {
ret = VM_FAULT_SIGBUS;
goto out;
@@ -1293,7 +1290,7 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
goto fallback;

ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
- RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE, false);
+ DAX_PMD | DAX_ZERO_PAGE, false);
if (IS_ERR(ret))
goto fallback;

@@ -1378,7 +1375,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
* is already in the tree, for instance), it will return -EEXIST and
* we just fall back to 4k entries.
*/
- entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+ entry = grab_mapping_entry(mapping, pgoff, DAX_PMD);
if (IS_ERR(entry))
goto fallback;

@@ -1417,7 +1414,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,

entry = dax_insert_mapping_entry(mapping, vmf, entry,
dax_iomap_sector(&iomap, pos),
- RADIX_DAX_PMD, write && !sync);
+ DAX_PMD, write && !sync);
if (IS_ERR(entry))
goto finish_iomap;

@@ -1529,21 +1526,21 @@ static int dax_insert_pfn_mkwrite(struct vm_fault *vmf,
pgoff_t index = vmf->pgoff;
int vmf_ret, error;

- spin_lock_irq(&mapping->tree_lock);
+ xa_lock_irq(&mapping->pages);
entry = get_unlocked_mapping_entry(mapping, index, &slot);
/* Did we race with someone splitting entry or so? */
if (!entry ||
(pe_size == PE_SIZE_PTE && !dax_is_pte_entry(entry)) ||
(pe_size == PE_SIZE_PMD && !dax_is_pmd_entry(entry))) {
put_unlocked_mapping_entry(mapping, index, entry);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
VM_FAULT_NOPAGE);
return VM_FAULT_NOPAGE;
}
- radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
+ radix_tree_tag_set(&mapping->pages, index, PAGECACHE_TAG_DIRTY);
entry = lock_slot(mapping, slot);
- spin_unlock_irq(&mapping->tree_lock);
+ xa_unlock_irq(&mapping->pages);
switch (pe_size) {
case PE_SIZE_PTE:
error = vm_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 339e4c1c044d..fadc6dbe17d6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -553,7 +553,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
if (!page)
return;

- if (radix_tree_exceptional_entry(page))
+ if (xa_is_value(page))
mss->swap += PAGE_SIZE;
else
put_page(page);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c07169cfb44a..e4345c13e237 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -389,23 +389,41 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
struct page *page, void *fsdata);

+/**
+ * struct address_space - Contents of a cachable, mappable object
+ *
+ * @host: Owner, either the inode or the block_device
+ * @pages: Cached pages
+ * @gfp_mask: Memory allocation flags to use for allocating pages
+ * @i_mmap_writable: count VM_SHARED mappings
+ * @i_mmap: tree of private and shared mappings
+ * @i_mmap_rwsem: Protects @i_mmap and @i_mmap_writable
+ * @nrpages: Number of total pages, protected by pages.xa_lock
+ * @nrexceptional: Shadow or DAX entries, protected by pages.xa_lock
+ * @writeback_index: writeback starts here
+ * @a_ops: methods
+ * @flags: Error bits and flags (AS_*)
+ * @wb_err: The most recent error which has occurred
+ * @private_lock: For use by the owner of the address_space
+ * @private_list: For use by the owner of the address space
+ * @private_data: For use by the owner of the address space
+ */
struct address_space {
- struct inode *host; /* owner: inode, block_device */
- struct radix_tree_root pages; /* cached pages */
- gfp_t gfp_mask; /* for allocating pages */
- atomic_t i_mmap_writable;/* count VM_SHARED mappings */
- struct rb_root_cached i_mmap; /* tree of private and shared mappings */
- struct rw_semaphore i_mmap_rwsem; /* protect tree, count, list */
- /* Protected by pages.xa_lock */
- unsigned long nrpages; /* number of total pages */
- unsigned long nrexceptional; /* shadow or DAX entries */
- pgoff_t writeback_index;/* writeback starts here */
- const struct address_space_operations *a_ops; /* methods */
- unsigned long flags; /* error bits */
+ struct inode *host;
+ struct radix_tree_root pages;
+ gfp_t gfp_mask;
+ atomic_t i_mmap_writable;
+ struct rb_root_cached i_mmap;
+ struct rw_semaphore i_mmap_rwsem;
+ unsigned long nrpages;
+ unsigned long nrexceptional;
+ pgoff_t writeback_index;
+ const struct address_space_operations *a_ops;
+ unsigned long flags;
errseq_t wb_err;
- spinlock_t private_lock; /* for use by the address_space */
- struct list_head private_list; /* ditto */
- void *private_data; /* ditto */
+ spinlock_t private_lock;
+ struct list_head private_list;
+ void *private_data;
} __attribute__((aligned(sizeof(long)))) __randomize_layout;
/*
* On most architectures that alignment is already the case; but
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index d2253b540cd7..5130f44d9f93 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -28,34 +28,26 @@
#include <linux/rcupdate.h>
#include <linux/spinlock.h>
#include <linux/types.h>
+#include <linux/xarray.h>

/*
* The bottom two bits of the slot determine how the remaining bits in the
* slot are interpreted:
*
* 00 - data pointer
- * 01 - internal entry
- * 10 - exceptional entry
- * 11 - this bit combination is currently unused/reserved
+ * 10 - internal entry
+ * x1 - data value entry
*
* The internal entry may be a pointer to the next level in the tree, a
* sibling entry, or an indicator that the entry in this slot has been moved
* to another location in the tree and the lookup should be restarted. While
* NULL fits the 'data pointer' pattern, it means that there is no entry in
* the tree for this index (no matter what level of the tree it is found at).
- * This means that you cannot store NULL in the tree as a value for the index.
+ * This means that storing a NULL entry in the tree is the same as deleting
+ * the entry from the tree.
*/
#define RADIX_TREE_ENTRY_MASK 3UL
-#define RADIX_TREE_INTERNAL_NODE 1UL
-
-/*
- * Most users of the radix tree store pointers but shmem/tmpfs stores swap
- * entries in the same tree. They are marked as exceptional entries to
- * distinguish them from pointers to struct page.
- * EXCEPTIONAL_ENTRY tests the bit, EXCEPTIONAL_SHIFT shifts content past it.
- */
-#define RADIX_TREE_EXCEPTIONAL_ENTRY 2
-#define RADIX_TREE_EXCEPTIONAL_SHIFT 2
+#define RADIX_TREE_INTERNAL_NODE 2UL

static inline bool radix_tree_is_internal_node(void *ptr)
{
@@ -83,11 +75,10 @@ static inline bool radix_tree_is_internal_node(void *ptr)

/*
* @count is the count of every non-NULL element in the ->slots array
- * whether that is an exceptional entry, a retry entry, a user pointer,
+ * whether that is a data entry, a retry entry, a user pointer,
* a sibling entry or a pointer to the next level of the tree.
* @exceptional is the count of every element in ->slots which is
- * either radix_tree_exceptional_entry() or is a sibling entry for an
- * exceptional entry.
+ * either a data entry or a sibling entry for data.
*/
struct radix_tree_node {
unsigned char shift; /* Bits remaining in each slot */
@@ -267,17 +258,6 @@ static inline int radix_tree_deref_retry(void *arg)
return unlikely(radix_tree_is_internal_node(arg));
}

-/**
- * radix_tree_exceptional_entry - radix_tree_deref_slot gave exceptional entry?
- * @arg: value returned by radix_tree_deref_slot
- * Returns: 0 if well-aligned pointer, non-0 if exceptional entry.
- */
-static inline int radix_tree_exceptional_entry(void *arg)
-{
- /* Not unlikely because radix_tree_exception often tested first */
- return (unsigned long)arg & RADIX_TREE_EXCEPTIONAL_ENTRY;
-}
-
/**
* radix_tree_exception - radix_tree_deref_slot returned either exception?
* @arg: value returned by radix_tree_deref_slot
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 9c5a2628d6ce..5e93c7b500da 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -17,9 +17,8 @@
*
* swp_entry_t's are *never* stored anywhere in their arch-dependent format.
*/
-#define SWP_TYPE_SHIFT(e) ((sizeof(e.val) * 8) - \
- (MAX_SWAPFILES_SHIFT + RADIX_TREE_EXCEPTIONAL_SHIFT))
-#define SWP_OFFSET_MASK(e) ((1UL << SWP_TYPE_SHIFT(e)) - 1)
+#define SWP_TYPE_SHIFT (BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
+#define SWP_OFFSET_MASK ((1UL << SWP_TYPE_SHIFT) - 1)

/*
* Store a type+offset into a swp_entry_t in an arch-independent format
@@ -28,8 +27,7 @@ static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
{
swp_entry_t ret;

- ret.val = (type << SWP_TYPE_SHIFT(ret)) |
- (offset & SWP_OFFSET_MASK(ret));
+ ret.val = (type << SWP_TYPE_SHIFT) | (offset & SWP_OFFSET_MASK);
return ret;
}

@@ -39,7 +37,7 @@ static inline swp_entry_t swp_entry(unsigned long type, pgoff_t offset)
*/
static inline unsigned swp_type(swp_entry_t entry)
{
- return (entry.val >> SWP_TYPE_SHIFT(entry));
+ return (entry.val >> SWP_TYPE_SHIFT);
}

/*
@@ -48,7 +46,7 @@ static inline unsigned swp_type(swp_entry_t entry)
*/
static inline pgoff_t swp_offset(swp_entry_t entry)
{
- return entry.val & SWP_OFFSET_MASK(entry);
+ return entry.val & SWP_OFFSET_MASK;
}

#ifdef CONFIG_MMU
@@ -89,16 +87,13 @@ static inline swp_entry_t radix_to_swp_entry(void *arg)
{
swp_entry_t entry;

- entry.val = (unsigned long)arg >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+ entry.val = xa_to_value(arg);
return entry;
}

static inline void *swp_to_radix_entry(swp_entry_t entry)
{
- unsigned long value;
-
- value = entry.val << RADIX_TREE_EXCEPTIONAL_SHIFT;
- return (void *)(value | RADIX_TREE_EXCEPTIONAL_ENTRY);
+ return xa_mk_value(entry.val);
}

#if IS_ENABLED(CONFIG_DEVICE_PRIVATE)
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index a5a933925b85..e55f5cfd14ed 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -14,9 +14,49 @@
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
+ *
+ * See Documentation/core-api/xarray.rst for an overview of the XArray.
*/

+#include <linux/bug.h>
#include <linux/spinlock.h>
+#include <linux/types.h>
+
+#define BITS_PER_XA_VALUE (BITS_PER_LONG - 1)
+
+/**
+ * xa_mk_value() - Create an XArray entry from a data value.
+ * @v: Value to store in XArray.
+ *
+ * Return: An entry suitable for storing in the XArray.
+ */
+static inline void *xa_mk_value(unsigned long v)
+{
+ WARN_ON((long)v < 0);
+ return (void *)((v << 1) | 1);
+}
+
+/**
+ * xa_to_value() - Get value stored in an XArray entry.
+ * @entry: XArray entry.
+ *
+ * Return: The value stored in the XArray entry.
+ */
+static inline unsigned long xa_to_value(void *entry)
+{
+ return (unsigned long)entry >> 1;
+}
+
+/**
+ * xa_is_value() - Determine if an entry is a data value.
+ * @entry: XArray entry.
+ *
+ * Return: True if the entry is a data value, false if it is a pointer.
+ */
+static inline bool xa_is_value(void *entry)
+{
+ return (unsigned long)entry & 1;
+}

#define xa_trylock(xa) spin_trylock(&(xa)->xa_lock)
#define xa_lock(xa) spin_lock(&(xa)->xa_lock)
diff --git a/lib/idr.c b/lib/idr.c
index 2cd9429e11e3..48c53890adc0 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -3,6 +3,7 @@
#include <linux/idr.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
+#include <linux/xarray.h>

DEFINE_PER_CPU(struct ida_bitmap *, ida_bitmap);
static DEFINE_SPINLOCK(simple_ida_lock);
@@ -273,11 +274,8 @@ EXPORT_SYMBOL(idr_replace);
* by the number of bits in the leaf bitmap before doing a radix tree lookup.
*
* As an optimisation, if there are only a few low bits set in any given
- * leaf, instead of allocating a 128-byte bitmap, we use the 'exceptional
- * entry' functionality of the radix tree to store BITS_PER_LONG - 2 bits
- * directly in the entry. By being really tricksy, we could store
- * BITS_PER_LONG - 1 bits, but there're diminishing returns after optimising
- * for 0-3 allocated IDs.
+ * leaf, instead of allocating a 128-byte bitmap, we store the bits
+ * directly in the entry.
*
* We allow the radix tree 'exceptional' count to get out of date. Nothing
* in the IDA nor the radix tree code checks it. If it becomes important
@@ -319,12 +317,11 @@ int ida_get_new_above(struct ida *ida, int start, int *id)
struct radix_tree_iter iter;
struct ida_bitmap *bitmap;
unsigned long index;
- unsigned bit, ebit;
+ unsigned bit;
int new;

index = start / IDA_BITMAP_BITS;
bit = start % IDA_BITMAP_BITS;
- ebit = bit + RADIX_TREE_EXCEPTIONAL_SHIFT;

slot = radix_tree_iter_init(&iter, index);
for (;;) {
@@ -339,26 +336,25 @@ int ida_get_new_above(struct ida *ida, int start, int *id)
return PTR_ERR(slot);
}
}
- if (iter.index > index) {
+ if (iter.index > index)
bit = 0;
- ebit = RADIX_TREE_EXCEPTIONAL_SHIFT;
- }
new = iter.index * IDA_BITMAP_BITS;
bitmap = rcu_dereference_raw(*slot);
- if (radix_tree_exception(bitmap)) {
- unsigned long tmp = (unsigned long)bitmap;
- ebit = find_next_zero_bit(&tmp, BITS_PER_LONG, ebit);
- if (ebit < BITS_PER_LONG) {
- tmp |= 1UL << ebit;
- rcu_assign_pointer(*slot, (void *)tmp);
- *id = new + ebit - RADIX_TREE_EXCEPTIONAL_SHIFT;
+ if (xa_is_value(bitmap)) {
+ unsigned long tmp = xa_to_value(bitmap);
+ int vbit = find_next_zero_bit(&tmp, BITS_PER_XA_VALUE,
+ bit);
+ if (vbit < BITS_PER_XA_VALUE) {
+ tmp |= 1UL << vbit;
+ rcu_assign_pointer(*slot, xa_mk_value(tmp));
+ *id = new + vbit;
return 0;
}
bitmap = this_cpu_xchg(ida_bitmap, NULL);
if (!bitmap)
return -EAGAIN;
memset(bitmap, 0, sizeof(*bitmap));
- bitmap->bitmap[0] = tmp >> RADIX_TREE_EXCEPTIONAL_SHIFT;
+ bitmap->bitmap[0] = tmp;
rcu_assign_pointer(*slot, bitmap);
}

@@ -379,19 +375,15 @@ int ida_get_new_above(struct ida *ida, int start, int *id)
new += bit;
if (new < 0)
return -ENOSPC;
- if (ebit < BITS_PER_LONG) {
- bitmap = (void *)((1UL << ebit) |
- RADIX_TREE_EXCEPTIONAL_ENTRY);
- radix_tree_iter_replace(root, &iter, slot,
- bitmap);
- *id = new;
- return 0;
+ if (bit < BITS_PER_XA_VALUE) {
+ bitmap = xa_mk_value(1UL << bit);
+ } else {
+ bitmap = this_cpu_xchg(ida_bitmap, NULL);
+ if (!bitmap)
+ return -EAGAIN;
+ memset(bitmap, 0, sizeof(*bitmap));
+ __set_bit(bit, bitmap->bitmap);
}
- bitmap = this_cpu_xchg(ida_bitmap, NULL);
- if (!bitmap)
- return -EAGAIN;
- memset(bitmap, 0, sizeof(*bitmap));
- __set_bit(bit, bitmap->bitmap);
radix_tree_iter_replace(root, &iter, slot, bitmap);
}

@@ -422,9 +414,9 @@ void ida_remove(struct ida *ida, int id)
goto err;

bitmap = rcu_dereference_raw(*slot);
- if (radix_tree_exception(bitmap)) {
+ if (xa_is_value(bitmap)) {
btmp = (unsigned long *)slot;
- offset += RADIX_TREE_EXCEPTIONAL_SHIFT;
+ offset += 1; /* Intimate knowledge of the xa_data encoding */
if (offset >= BITS_PER_LONG)
goto err;
} else {
@@ -435,9 +427,8 @@ void ida_remove(struct ida *ida, int id)

__clear_bit(offset, btmp);
radix_tree_iter_tag_set(&ida->ida_rt, &iter, IDR_FREE);
- if (radix_tree_exception(bitmap)) {
- if (rcu_dereference_raw(*slot) ==
- (void *)RADIX_TREE_EXCEPTIONAL_ENTRY)
+ if (xa_is_value(bitmap)) {
+ if (xa_to_value(rcu_dereference_raw(*slot)) == 0)
radix_tree_iter_delete(&ida->ida_rt, &iter, slot);
} else if (bitmap_empty(btmp, IDA_BITMAP_BITS)) {
kfree(bitmap);
@@ -465,7 +456,7 @@ void ida_destroy(struct ida *ida)

radix_tree_for_each_slot(slot, &ida->ida_rt, &iter, 0) {
struct ida_bitmap *bitmap = rcu_dereference_raw(*slot);
- if (!radix_tree_exception(bitmap))
+ if (!xa_is_value(bitmap))
kfree(bitmap);
radix_tree_iter_delete(&ida->ida_rt, &iter, slot);
}
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 5334ec9918c8..cda7a730e591 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -339,14 +339,12 @@ static void dump_ida_node(void *entry, unsigned long index)
for (i = 0; i < RADIX_TREE_MAP_SIZE; i++)
dump_ida_node(node->slots[i],
index | (i << node->shift));
- } else if (radix_tree_exceptional_entry(entry)) {
+ } else if (xa_is_value(entry)) {
pr_debug("ida excp: %p offset %d indices %lu-%lu data %lx\n",
entry, (int)(index & RADIX_TREE_MAP_MASK),
index * IDA_BITMAP_BITS,
- index * IDA_BITMAP_BITS + BITS_PER_LONG -
- RADIX_TREE_EXCEPTIONAL_SHIFT,
- (unsigned long)entry >>
- RADIX_TREE_EXCEPTIONAL_SHIFT);
+ index * IDA_BITMAP_BITS + BITS_PER_XA_VALUE,
+ xa_to_value(entry));
} else {
struct ida_bitmap *bitmap = entry;

@@ -655,7 +653,7 @@ static int radix_tree_extend(struct radix_tree_root *root, gfp_t gfp,
BUG_ON(shift > BITS_PER_LONG);
if (radix_tree_is_internal_node(entry)) {
entry_to_node(entry)->parent = node;
- } else if (radix_tree_exceptional_entry(entry)) {
+ } else if (xa_is_value(entry)) {
/* Moving an exceptional root->rnode to a node */
node->exceptional = 1;
}
@@ -946,12 +944,12 @@ static inline int insert_entries(struct radix_tree_node *node,
!is_sibling_entry(node, old) &&
(old != RADIX_TREE_RETRY))
radix_tree_free_nodes(old);
- if (radix_tree_exceptional_entry(old))
+ if (xa_is_value(old))
node->exceptional--;
}
if (node) {
node->count += n;
- if (radix_tree_exceptional_entry(item))
+ if (xa_is_value(item))
node->exceptional += n;
}
return n;
@@ -965,7 +963,7 @@ static inline int insert_entries(struct radix_tree_node *node,
rcu_assign_pointer(*slot, item);
if (node) {
node->count++;
- if (radix_tree_exceptional_entry(item))
+ if (xa_is_value(item))
node->exceptional++;
}
return 1;
@@ -1182,8 +1180,7 @@ void __radix_tree_replace(struct radix_tree_root *root,
radix_tree_update_node_t update_node)
{
void *old = rcu_dereference_raw(*slot);
- int exceptional = !!radix_tree_exceptional_entry(item) -
- !!radix_tree_exceptional_entry(old);
+ int exceptional = !!xa_is_value(item) - !!xa_is_value(old);
int count = calculate_count(root, node, slot, item, old);

/*
@@ -1986,7 +1983,7 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
struct radix_tree_node *node, void __rcu **slot)
{
void *old = rcu_dereference_raw(*slot);
- int exceptional = radix_tree_exceptional_entry(old) ? -1 : 0;
+ int exceptional = xa_is_value(old) ? -1 : 0;
unsigned offset = get_slot_offset(node, slot);
int tag;

diff --git a/mm/filemap.c b/mm/filemap.c
index 5c8f22fe4e62..1d012dd3629e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -127,7 +127,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
void *p;

p = radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock);
- if (!radix_tree_exceptional_entry(p))
+ if (!xa_is_value(p))
return -EEXIST;

mapping->nrexceptional--;
@@ -336,7 +336,7 @@ page_cache_tree_delete_batch(struct address_space *mapping,
break;
page = radix_tree_deref_slot_protected(slot,
&mapping->pages.xa_lock);
- if (radix_tree_exceptional_entry(page))
+ if (xa_is_value(page))
continue;
if (!tail_pages) {
/*
@@ -1355,7 +1355,7 @@ pgoff_t page_cache_next_hole(struct address_space *mapping,
struct page *page;

page = radix_tree_lookup(&mapping->pages, index);
- if (!page || radix_tree_exceptional_entry(page))
+ if (!page || xa_is_value(page))
break;
index++;
if (index == 0)
@@ -1396,7 +1396,7 @@ pgoff_t page_cache_prev_hole(struct address_space *mapping,
struct page *page;

page = radix_tree_lookup(&mapping->pages, index);
- if (!page || radix_tree_exceptional_entry(page))
+ if (!page || xa_is_value(page))
break;
index--;
if (index == ULONG_MAX)
@@ -1539,7 +1539,7 @@ struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,

repeat:
page = find_get_entry(mapping, offset);
- if (radix_tree_exceptional_entry(page))
+ if (xa_is_value(page))
page = NULL;
if (!page)
goto no_page;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index cb4d199bf328..55ade70c33bb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1363,7 +1363,7 @@ static void collapse_shmem(struct mm_struct *mm,

page = radix_tree_deref_slot_protected(slot,
&mapping->pages.xa_lock);
- if (radix_tree_exceptional_entry(page) || !PageUptodate(page)) {
+ if (xa_is_value(page) || !PageUptodate(page)) {
xa_unlock_irq(&mapping->pages);
/* swap in or instantiate fallocated page */
if (shmem_getpage(mapping->host, index, &page,
diff --git a/mm/madvise.c b/mm/madvise.c
index 751e97aa2210..83f8a1a8e6b5 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -251,7 +251,7 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
index = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;

page = find_get_entry(mapping, index);
- if (!radix_tree_exceptional_entry(page)) {
+ if (!xa_is_value(page)) {
if (page)
put_page(page);
continue;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 16af05a3ec6e..d9cc1bfe6a48 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4526,7 +4526,7 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
/* shmem/tmpfs may report page out on swap: account for that too. */
if (shmem_mapping(mapping)) {
page = find_get_entry(mapping, pgoff);
- if (radix_tree_exceptional_entry(page)) {
+ if (xa_is_value(page)) {
swp_entry_t swp = radix_to_swp_entry(page);
if (do_memsw_account())
*entry = swp;
diff --git a/mm/mincore.c b/mm/mincore.c
index fc37afe226e6..4985965aa20a 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -66,7 +66,7 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
* shmem/tmpfs may return swap: account for swapcache
* page too.
*/
- if (radix_tree_exceptional_entry(page)) {
+ if (xa_is_value(page)) {
swp_entry_t swp = radix_to_swp_entry(page);
page = find_get_page(swap_address_space(swp),
swp_offset(swp));
diff --git a/mm/readahead.c b/mm/readahead.c
index 514188fd2489..4851f002605f 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -177,7 +177,7 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp,
rcu_read_lock();
page = radix_tree_lookup(&mapping->pages, page_offset);
rcu_read_unlock();
- if (page && !radix_tree_exceptional_entry(page))
+ if (page && !xa_is_value(page))
continue;

page = __page_cache_alloc(gfp_mask);
diff --git a/mm/shmem.c b/mm/shmem.c
index 9b1766e7c8cf..c5731bb954a1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -690,7 +690,7 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,
continue;
}

- if (radix_tree_exceptional_entry(page))
+ if (xa_is_value(page))
swapped++;

if (need_resched()) {
@@ -805,7 +805,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
if (index >= end)
break;

- if (radix_tree_exceptional_entry(page)) {
+ if (xa_is_value(page)) {
if (unfalloc)
continue;
nr_swaps_freed += !shmem_free_swap(mapping,
@@ -902,7 +902,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
if (index >= end)
break;

- if (radix_tree_exceptional_entry(page)) {
+ if (xa_is_value(page)) {
if (unfalloc)
continue;
if (shmem_free_swap(mapping, index, page)) {
@@ -1614,7 +1614,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
repeat:
swap.val = 0;
page = find_lock_entry(mapping, index);
- if (radix_tree_exceptional_entry(page)) {
+ if (xa_is_value(page)) {
swap = radix_to_swp_entry(page);
page = NULL;
}
@@ -2547,7 +2547,7 @@ static pgoff_t shmem_seek_hole_data(struct address_space *mapping,
index = indices[i];
}
page = pvec.pages[i];
- if (page && !radix_tree_exceptional_entry(page)) {
+ if (page && !xa_is_value(page)) {
if (!PageUptodate(page))
page = NULL;
}
diff --git a/mm/swap.c b/mm/swap.c
index 38e1b6374a97..8d7773cb2c3f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -953,7 +953,7 @@ void pagevec_remove_exceptionals(struct pagevec *pvec)

for (i = 0, j = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
- if (!radix_tree_exceptional_entry(page))
+ if (!xa_is_value(page))
pvec->pages[j++] = page;
}
pvec->nr = j;
diff --git a/mm/truncate.c b/mm/truncate.c
index 094158f2e447..69bb743dd7e5 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -70,7 +70,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
return;

for (j = 0; j < pagevec_count(pvec); j++)
- if (radix_tree_exceptional_entry(pvec->pages[j]))
+ if (xa_is_value(pvec->pages[j]))
break;

if (j == pagevec_count(pvec))
@@ -85,7 +85,7 @@ static void truncate_exceptional_pvec_entries(struct address_space *mapping,
struct page *page = pvec->pages[i];
pgoff_t index = indices[i];

- if (!radix_tree_exceptional_entry(page)) {
+ if (!xa_is_value(page)) {
pvec->pages[j++] = page;
continue;
}
@@ -351,7 +351,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (index >= end)
break;

- if (radix_tree_exceptional_entry(page))
+ if (xa_is_value(page))
continue;

if (!trylock_page(page))
@@ -446,7 +446,7 @@ void truncate_inode_pages_range(struct address_space *mapping,
break;
}

- if (radix_tree_exceptional_entry(page))
+ if (xa_is_value(page))
continue;

lock_page(page);
@@ -565,7 +565,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
if (index > end)
break;

- if (radix_tree_exceptional_entry(page)) {
+ if (xa_is_value(page)) {
invalidate_exceptional_entry(mapping, index,
page);
continue;
@@ -696,7 +696,7 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
if (index > end)
break;

- if (radix_tree_exceptional_entry(page)) {
+ if (xa_is_value(page)) {
if (!invalidate_exceptional_entry2(mapping,
index, page))
ret = -EBUSY;
diff --git a/mm/workingset.c b/mm/workingset.c
index 2d071f0df3af..0a3465700d5f 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -155,8 +155,8 @@
* refault distance will immediately activate the refaulting page.
*/

-#define EVICTION_SHIFT (RADIX_TREE_EXCEPTIONAL_ENTRY + \
- NODES_SHIFT + \
+#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \
+ NODES_SHIFT + \
MEM_CGROUP_ID_SHIFT)
#define EVICTION_MASK (~0UL >> EVICTION_SHIFT)

@@ -175,18 +175,16 @@ static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction)
eviction >>= bucket_order;
eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
- eviction = (eviction << RADIX_TREE_EXCEPTIONAL_SHIFT);

- return (void *)(eviction | RADIX_TREE_EXCEPTIONAL_ENTRY);
+ return xa_mk_value(eviction);
}

static void unpack_shadow(void *shadow, int *memcgidp, pg_data_t **pgdat,
unsigned long *evictionp)
{
- unsigned long entry = (unsigned long)shadow;
+ unsigned long entry = xa_to_value(shadow);
int memcgid, nid;

- entry >>= RADIX_TREE_EXCEPTIONAL_SHIFT;
nid = entry & ((1UL << NODES_SHIFT) - 1);
entry >>= NODES_SHIFT;
memcgid = entry & ((1UL << MEM_CGROUP_ID_SHIFT) - 1);
@@ -453,7 +451,7 @@ static enum lru_status shadow_lru_isolate(struct list_head *item,
goto out_invalid;
for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
if (node->slots[i]) {
- if (WARN_ON_ONCE(!radix_tree_exceptional_entry(node->slots[i])))
+ if (WARN_ON_ONCE(!xa_is_value(node->slots[i])))
goto out_invalid;
if (WARN_ON_ONCE(!node->exceptional))
goto out_invalid;
diff --git a/tools/testing/radix-tree/idr-test.c b/tools/testing/radix-tree/idr-test.c
index 193450b29bf0..7499319e85f8 100644
--- a/tools/testing/radix-tree/idr-test.c
+++ b/tools/testing/radix-tree/idr-test.c
@@ -19,7 +19,7 @@

#include "test.h"

-#define DUMMY_PTR ((void *)0x12)
+#define DUMMY_PTR ((void *)0x10)

int item_idr_free(int id, void *p, void *data)
{
@@ -320,11 +320,11 @@ void ida_check_conv(void)
for (i = 0; i < 1000000; i++) {
int err = ida_get_new(&ida, &id);
if (err == -EAGAIN) {
- assert((i % IDA_BITMAP_BITS) == (BITS_PER_LONG - 2));
+ assert((i % IDA_BITMAP_BITS) == (BITS_PER_LONG - 1));
assert(ida_pre_get(&ida, GFP_KERNEL));
err = ida_get_new(&ida, &id);
} else {
- assert((i % IDA_BITMAP_BITS) != (BITS_PER_LONG - 2));
+ assert((i % IDA_BITMAP_BITS) != (BITS_PER_LONG - 1));
}
assert(!err);
assert(id == i);
diff --git a/tools/testing/radix-tree/linux/radix-tree.h b/tools/testing/radix-tree/linux/radix-tree.h
index 36fb716d5557..40c9671ee365 100644
--- a/tools/testing/radix-tree/linux/radix-tree.h
+++ b/tools/testing/radix-tree/linux/radix-tree.h
@@ -5,6 +5,7 @@
#include "generated/map-shift.h"
#include "linux/bug.h"
#include "../../../../include/linux/radix-tree.h"
+#include <linux/xarray.h>

extern int kmalloc_verbose;
extern int test_verbose;
diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c
index 59245b3d587c..684e76f79f4a 100644
--- a/tools/testing/radix-tree/multiorder.c
+++ b/tools/testing/radix-tree/multiorder.c
@@ -38,12 +38,11 @@ static void __multiorder_tag_test(int index, int order)

/*
* Verify we get collisions for covered indices. We try and fail to
- * insert an exceptional entry so we don't leak memory via
+ * insert a data entry so we don't leak memory via
* item_insert_order().
*/
for_each_index(i, base, order) {
- err = __radix_tree_insert(&tree, i, order,
- (void *)(0xA0 | RADIX_TREE_EXCEPTIONAL_ENTRY));
+ err = __radix_tree_insert(&tree, i, order, xa_mk_value(0xA0));
assert(err == -EEXIST);
}

@@ -379,8 +378,8 @@ static void multiorder_join1(unsigned long index,
}

/*
- * Check that the accounting of exceptional entries is handled correctly
- * by joining an exceptional entry to a normal pointer.
+ * Check that the accounting of inline data entries is handled correctly
+ * by joining a data entry to a normal pointer.
*/
static void multiorder_join2(unsigned order1, unsigned order2)
{
@@ -390,9 +389,9 @@ static void multiorder_join2(unsigned order1, unsigned order2)
void *item2;

item_insert_order(&tree, 0, order2);
- radix_tree_insert(&tree, 1 << order2, (void *)0x12UL);
+ radix_tree_insert(&tree, 1 << order2, xa_mk_value(5));
item2 = __radix_tree_lookup(&tree, 1 << order2, &node, NULL);
- assert(item2 == (void *)0x12UL);
+ assert(item2 == xa_mk_value(5));
assert(node->exceptional == 1);

item2 = radix_tree_lookup(&tree, 0);
@@ -406,7 +405,7 @@ static void multiorder_join2(unsigned order1, unsigned order2)
}

/*
- * This test revealed an accounting bug for exceptional entries at one point.
+ * This test revealed an accounting bug for inline data entries at one point.
* Nodes were being freed back into the pool with an elevated exception count
* by radix_tree_join() and then radix_tree_split() was failing to zero the
* count of exceptional entries.
@@ -420,16 +419,16 @@ static void multiorder_join3(unsigned int order)
unsigned long i;

for (i = 0; i < (1 << order); i++) {
- radix_tree_insert(&tree, i, (void *)0x12UL);
+ radix_tree_insert(&tree, i, xa_mk_value(5));
}

- radix_tree_join(&tree, 0, order, (void *)0x16UL);
+ radix_tree_join(&tree, 0, order, xa_mk_value(7));
rcu_barrier();

radix_tree_split(&tree, 0, 0);

radix_tree_for_each_slot(slot, &tree, &iter, 0) {
- radix_tree_iter_replace(&tree, &iter, slot, (void *)0x12UL);
+ radix_tree_iter_replace(&tree, &iter, slot, xa_mk_value(5));
}

__radix_tree_lookup(&tree, 0, &node, NULL);
@@ -516,10 +515,10 @@ static void __multiorder_split2(int old_order, int new_order)
struct radix_tree_node *node;
void *item;

- __radix_tree_insert(&tree, 0, old_order, (void *)0x12);
+ __radix_tree_insert(&tree, 0, old_order, xa_mk_value(5));

item = __radix_tree_lookup(&tree, 0, &node, NULL);
- assert(item == (void *)0x12);
+ assert(item == xa_mk_value(5));
assert(node->exceptional > 0);

radix_tree_split(&tree, 0, new_order);
@@ -529,7 +528,7 @@ static void __multiorder_split2(int old_order, int new_order)
}

item = __radix_tree_lookup(&tree, 0, &node, NULL);
- assert(item != (void *)0x12);
+ assert(item != xa_mk_value(5));
assert(node->exceptional == 0);

item_kill_tree(&tree);
@@ -543,40 +542,40 @@ static void __multiorder_split3(int old_order, int new_order)
struct radix_tree_node *node;
void *item;

- __radix_tree_insert(&tree, 0, old_order, (void *)0x12);
+ __radix_tree_insert(&tree, 0, old_order, xa_mk_value(5));

item = __radix_tree_lookup(&tree, 0, &node, NULL);
- assert(item == (void *)0x12);
+ assert(item == xa_mk_value(5));
assert(node->exceptional > 0);

radix_tree_split(&tree, 0, new_order);
radix_tree_for_each_slot(slot, &tree, &iter, 0) {
- radix_tree_iter_replace(&tree, &iter, slot, (void *)0x16);
+ radix_tree_iter_replace(&tree, &iter, slot, xa_mk_value(7));
}

item = __radix_tree_lookup(&tree, 0, &node, NULL);
- assert(item == (void *)0x16);
+ assert(item == xa_mk_value(7));
assert(node->exceptional > 0);

item_kill_tree(&tree);

- __radix_tree_insert(&tree, 0, old_order, (void *)0x12);
+ __radix_tree_insert(&tree, 0, old_order, xa_mk_value(5));

item = __radix_tree_lookup(&tree, 0, &node, NULL);
- assert(item == (void *)0x12);
+ assert(item == xa_mk_value(5));
assert(node->exceptional > 0);

radix_tree_split(&tree, 0, new_order);
radix_tree_for_each_slot(slot, &tree, &iter, 0) {
if (iter.index == (1 << new_order))
radix_tree_iter_replace(&tree, &iter, slot,
- (void *)0x16);
+ xa_mk_value(7));
else
radix_tree_iter_replace(&tree, &iter, slot, NULL);
}

item = __radix_tree_lookup(&tree, 1 << new_order, &node, NULL);
- assert(item == (void *)0x16);
+ assert(item == xa_mk_value(7));
assert(node->count == node->exceptional);
do {
node = node->parent;
@@ -609,13 +608,13 @@ static void multiorder_account(void)

item_insert_order(&tree, 0, 5);

- __radix_tree_insert(&tree, 1 << 5, 5, (void *)0x12);
+ __radix_tree_insert(&tree, 1 << 5, 5, xa_mk_value(5));
__radix_tree_lookup(&tree, 0, &node, NULL);
assert(node->count == node->exceptional * 2);
radix_tree_delete(&tree, 1 << 5);
assert(node->exceptional == 0);

- __radix_tree_insert(&tree, 1 << 5, 5, (void *)0x12);
+ __radix_tree_insert(&tree, 1 << 5, 5, xa_mk_value(5));
__radix_tree_lookup(&tree, 1 << 5, &node, &slot);
assert(node->count == node->exceptional * 2);
__radix_tree_replace(&tree, node, slot, NULL, NULL);
diff --git a/tools/testing/radix-tree/test.c b/tools/testing/radix-tree/test.c
index 5978ab1f403d..0d69c49177c6 100644
--- a/tools/testing/radix-tree/test.c
+++ b/tools/testing/radix-tree/test.c
@@ -276,7 +276,7 @@ void item_kill_tree(struct radix_tree_root *root)
int nfound;

radix_tree_for_each_slot(slot, root, &iter, 0) {
- if (radix_tree_exceptional_entry(*slot))
+ if (xa_is_value(*slot))
radix_tree_delete(root, iter.index);
}

--
2.15.0

2017-12-06 01:01:37

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 29/73] mm: Convert page-writeback to XArray

From: Matthew Wilcox <[email protected]>

Includes moving mapping_tagged() to fs.h as a static inline, and
changing it to return bool.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/fs.h | 17 +++++++++------
mm/page-writeback.c | 62 +++++++++++++++++++----------------------------------
2 files changed, 32 insertions(+), 47 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index e4345c13e237..c58bc3c619bf 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -470,15 +470,18 @@ struct block_device {
struct mutex bd_fsfreeze_mutex;
} __randomize_layout;

+/* XArray tags, for tagging dirty and writeback pages in the pagecache. */
+#define PAGECACHE_TAG_DIRTY XA_TAG_0
+#define PAGECACHE_TAG_WRITEBACK XA_TAG_1
+#define PAGECACHE_TAG_TOWRITE XA_TAG_2
+
/*
- * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
- * radix trees
+ * Returns true if any of the pages in the mapping are marked with the tag.
*/
-#define PAGECACHE_TAG_DIRTY 0
-#define PAGECACHE_TAG_WRITEBACK 1
-#define PAGECACHE_TAG_TOWRITE 2
-
-int mapping_tagged(struct address_space *mapping, int tag);
+static inline bool mapping_tagged(struct address_space *mapping, xa_tag_t tag)
+{
+ return xa_tagged(&mapping->pages, tag);
+}

static inline void i_mmap_lock_write(struct address_space *mapping)
{
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 588ce729d199..0407436a8305 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2098,33 +2098,25 @@ void __init page_writeback_init(void)
* dirty pages in the file (thus it is important for this function to be quick
* so that it can tag pages faster than a dirtying process can create them).
*/
-/*
- * We tag pages in batches of WRITEBACK_TAG_BATCH to reduce xa_lock latency.
- */
void tag_pages_for_writeback(struct address_space *mapping,
pgoff_t start, pgoff_t end)
{
-#define WRITEBACK_TAG_BATCH 4096
- unsigned long tagged = 0;
- struct radix_tree_iter iter;
- void **slot;
+ XA_STATE(xas, &mapping->pages, start);
+ unsigned int tagged = 0;
+ void *page;

- xa_lock_irq(&mapping->pages);
- radix_tree_for_each_tagged(slot, &mapping->pages, &iter, start,
- PAGECACHE_TAG_DIRTY) {
- if (iter.index > end)
- break;
- radix_tree_iter_tag_set(&mapping->pages, &iter,
- PAGECACHE_TAG_TOWRITE);
- tagged++;
- if ((tagged % WRITEBACK_TAG_BATCH) != 0)
+ xas_lock_irq(&xas);
+ xas_for_each_tag(&xas, page, end, PAGECACHE_TAG_DIRTY) {
+ xas_set_tag(&xas, PAGECACHE_TAG_TOWRITE);
+ if (++tagged % XA_CHECK_SCHED)
continue;
- slot = radix_tree_iter_resume(slot, &iter);
- xa_unlock_irq(&mapping->pages);
+
+ xas_pause(&xas);
+ xas_unlock_irq(&xas);
cond_resched();
- xa_lock_irq(&mapping->pages);
+ xas_lock_irq(&xas);
}
- xa_unlock_irq(&mapping->pages);
+ xas_unlock_irq(&xas);
}
EXPORT_SYMBOL(tag_pages_for_writeback);

@@ -2164,7 +2156,7 @@ int write_cache_pages(struct address_space *mapping,
pgoff_t done_index;
int cycled;
int range_whole = 0;
- int tag;
+ xa_tag_t tag;

pagevec_init(&pvec);
if (wbc->range_cyclic) {
@@ -2445,7 +2437,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,

/*
* For address_spaces which do not use buffers. Just tag the page as dirty in
- * its radix tree.
+ * the xarray.
*
* This is also used when a single buffer is being dirtied: we want to set the
* page dirty in that case, but not all the buffers. This is a "bottom-up"
@@ -2471,7 +2463,7 @@ int __set_page_dirty_nobuffers(struct page *page)
BUG_ON(page_mapping(page) != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
account_page_dirtied(page, mapping);
- radix_tree_tag_set(&mapping->pages, page_index(page),
+ __xa_set_tag(&mapping->pages, page_index(page),
PAGECACHE_TAG_DIRTY);
xa_unlock_irqrestore(&mapping->pages, flags);
unlock_page_memcg(page);
@@ -2634,13 +2626,13 @@ EXPORT_SYMBOL(__cancel_dirty_page);
* Returns true if the page was previously dirty.
*
* This is for preparing to put the page under writeout. We leave the page
- * tagged as dirty in the radix tree so that a concurrent write-for-sync
+ * tagged as dirty in the xarray so that a concurrent write-for-sync
* can discover it via a PAGECACHE_TAG_DIRTY walk. The ->writepage
* implementation will run either set_page_writeback() or set_page_dirty(),
- * at which stage we bring the page's dirty flag and radix-tree dirty tag
+ * at which stage we bring the page's dirty flag and xarray dirty tag
* back into sync.
*
- * This incoherency between the page's dirty flag and radix-tree tag is
+ * This incoherency between the page's dirty flag and xarray tag is
* unfortunate, but it only exists while the page is locked.
*/
int clear_page_dirty_for_io(struct page *page)
@@ -2721,7 +2713,7 @@ int test_clear_page_writeback(struct page *page)
xa_lock_irqsave(&mapping->pages, flags);
ret = TestClearPageWriteback(page);
if (ret) {
- radix_tree_tag_clear(&mapping->pages, page_index(page),
+ __xa_clear_tag(&mapping->pages, page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi)) {
struct bdi_writeback *wb = inode_to_wb(inode);
@@ -2773,7 +2765,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
on_wblist = mapping_tagged(mapping,
PAGECACHE_TAG_WRITEBACK);

- radix_tree_tag_set(&mapping->pages, page_index(page),
+ __xa_set_tag(&mapping->pages, page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
@@ -2787,10 +2779,10 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
sb_mark_inode_writeback(mapping->host);
}
if (!PageDirty(page))
- radix_tree_tag_clear(&mapping->pages, page_index(page),
+ __xa_clear_tag(&mapping->pages, page_index(page),
PAGECACHE_TAG_DIRTY);
if (!keep_write)
- radix_tree_tag_clear(&mapping->pages, page_index(page),
+ __xa_clear_tag(&mapping->pages, page_index(page),
PAGECACHE_TAG_TOWRITE);
xa_unlock_irqrestore(&mapping->pages, flags);
} else {
@@ -2806,16 +2798,6 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
}
EXPORT_SYMBOL(__test_set_page_writeback);

-/*
- * Return true if any of the pages in the mapping are marked with the
- * passed tag.
- */
-int mapping_tagged(struct address_space *mapping, int tag)
-{
- return radix_tree_tagged(&mapping->pages, tag);
-}
-EXPORT_SYMBOL(mapping_tagged);
-
/**
* wait_for_stable_page() - wait for writeback to finish, if necessary.
* @page: The page to wait on.
--
2.15.0

2017-12-06 01:01:10

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 32/73] mm: Convert add_to_swap_cache to XArray

From: Matthew Wilcox <[email protected]>

Combine __add_to_swap_cache and add_to_swap_cache into one function
since there is no more need to preload.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/swap_state.c | 93 ++++++++++++++++++---------------------------------------
1 file changed, 29 insertions(+), 64 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3f95e8fc4cb2..117b5da9dc01 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -107,14 +107,15 @@ void show_swap_cache_info(void)
}

/*
- * __add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
+ * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space,
* but sets SwapCache flag and private instead of mapping and index.
*/
-int __add_to_swap_cache(struct page *page, swp_entry_t entry)
+int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp)
{
- int error, i, nr = hpage_nr_pages(page);
- struct address_space *address_space;
+ struct address_space *address_space = swap_address_space(entry);
pgoff_t idx = swp_offset(entry);
+ XA_STATE(xas, &address_space->pages, idx);
+ unsigned int i, nr = compound_order(page);

VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageSwapCache(page), page);
@@ -123,50 +124,30 @@ int __add_to_swap_cache(struct page *page, swp_entry_t entry)
page_ref_add(page, nr);
SetPageSwapCache(page);

- address_space = swap_address_space(entry);
- xa_lock_irq(&address_space->pages);
- for (i = 0; i < nr; i++) {
- set_page_private(page + i, entry.val + i);
- error = radix_tree_insert(&address_space->pages,
- idx + i, page + i);
- if (unlikely(error))
- break;
- }
- if (likely(!error)) {
+ do {
+ xas_lock_irq(&xas);
+ xas_create_range(&xas, idx + nr - 1);
+ if (xas_error(&xas))
+ goto unlock;
+ for (i = 0; i < nr; i++) {
+ VM_BUG_ON_PAGE(xas.xa_index != idx + i, page);
+ set_page_private(page + i, entry.val + i);
+ xas_store(&xas, page + i);
+ xas_next(&xas);
+ }
address_space->nrpages += nr;
__mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
ADD_CACHE_INFO(add_total, nr);
- } else {
- /*
- * Only the context which have set SWAP_HAS_CACHE flag
- * would call add_to_swap_cache().
- * So add_to_swap_cache() doesn't returns -EEXIST.
- */
- VM_BUG_ON(error == -EEXIST);
- set_page_private(page + i, 0UL);
- while (i--) {
- radix_tree_delete(&address_space->pages, idx + i);
- set_page_private(page + i, 0UL);
- }
- ClearPageSwapCache(page);
- page_ref_sub(page, nr);
- }
- xa_unlock_irq(&address_space->pages);
+unlock:
+ xas_unlock_irq(&xas);
+ } while (xas_nomem(&xas, gfp));

- return error;
-}
-
-
-int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask)
-{
- int error;
+ if (!xas_error(&xas))
+ return 0;

- error = radix_tree_maybe_preload_order(gfp_mask, compound_order(page));
- if (!error) {
- error = __add_to_swap_cache(page, entry);
- radix_tree_preload_end();
- }
- return error;
+ ClearPageSwapCache(page);
+ page_ref_sub(page, nr);
+ return xas_error(&xas);
}

/*
@@ -220,7 +201,7 @@ int add_to_swap(struct page *page)
goto fail;

/*
- * Radix-tree node allocations from PF_MEMALLOC contexts could
+ * XArray node allocations from PF_MEMALLOC contexts could
* completely exhaust the page allocator. __GFP_NOMEMALLOC
* stops emergency reserves from being allocated.
*
@@ -232,7 +213,6 @@ int add_to_swap(struct page *page)
*/
err = add_to_swap_cache(page, entry,
__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);
- /* -ENOMEM radix-tree allocation failure */
if (err)
/*
* add_to_swap_cache() doesn't return -EEXIST, so we can safely
@@ -400,19 +380,11 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
break; /* Out of memory */
}

- /*
- * call radix_tree_preload() while we can wait.
- */
- err = radix_tree_maybe_preload(gfp_mask & GFP_KERNEL);
- if (err)
- break;
-
/*
* Swap entry may have been freed since our caller observed it.
*/
err = swapcache_prepare(entry);
if (err == -EEXIST) {
- radix_tree_preload_end();
/*
* We might race against get_swap_page() and stumble
* across a SWAP_HAS_CACHE swap_map entry whose page
@@ -420,26 +392,19 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
*/
cond_resched();
continue;
- }
- if (err) { /* swp entry is obsolete ? */
- radix_tree_preload_end();
+ } else if (err) /* swp entry is obsolete ? */
break;
- }

- /* May fail (-ENOMEM) if radix-tree node allocation failed. */
+ /* May fail (-ENOMEM) if XArray node allocation failed. */
__SetPageLocked(new_page);
__SetPageSwapBacked(new_page);
- err = __add_to_swap_cache(new_page, entry);
+ err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
if (likely(!err)) {
- radix_tree_preload_end();
- /*
- * Initiate read into locked page and return.
- */
+ /* Initiate read into locked page */
lru_cache_add_anon(new_page);
*new_page_allocated = true;
return new_page;
}
- radix_tree_preload_end();
__ClearPageLocked(new_page);
/*
* add_to_swap_cache() doesn't return -EEXIST, so we can safely
--
2.15.0

2017-12-06 01:00:48

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 31/73] mm: Convert truncate to XArray

From: Matthew Wilcox <[email protected]>

This is essentially xa_cmpxchg() with the locking handled above us,
and it doesn't have to handle replacing a NULL entry.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/truncate.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/mm/truncate.c b/mm/truncate.c
index 69bb743dd7e5..70323c347298 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -33,15 +33,12 @@
static inline void __clear_shadow_entry(struct address_space *mapping,
pgoff_t index, void *entry)
{
- struct radix_tree_node *node;
- void **slot;
+ XA_STATE(xas, &mapping->pages, index);

- if (!__radix_tree_lookup(&mapping->pages, index, &node, &slot))
+ xas_set_update(&xas, workingset_update_node);
+ if (xas_load(&xas) != entry)
return;
- if (*slot != entry)
- return;
- __radix_tree_replace(&mapping->pages, node, slot, NULL,
- workingset_update_node);
+ xas_store(&xas, NULL);
mapping->nrexceptional--;
}

@@ -746,10 +743,10 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
index++;
}
/*
- * For DAX we invalidate page tables after invalidating radix tree. We
+ * For DAX we invalidate page tables after invalidating page cache. We
* could invalidate page tables while invalidating each entry however
* that would be expensive. And doing range unmapping before doesn't
- * work as we have no cheap way to find whether radix tree entry didn't
+ * work as we have no cheap way to find whether page cache entry didn't
* get remapped later.
*/
if (dax_mapping(mapping)) {
--
2.15.0

2017-12-06 01:00:45

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 26/73] page cache: Convert page cache lookups to XArray

From: Matthew Wilcox <[email protected]>

Introduce page_cache_pin() to factor out the common logic between the
various lookup routines:

find_get_entry
find_get_entries
find_get_pages_range
find_get_pages_contig
find_get_pages_range_tag
find_get_entries_tag
filemap_map_pages

By using the xa_state to control the iteration, we can remove most of
the gotos and just use the normal break/continue loop control flow.

Also convert the regression1 read-side to XArray since that simulates
the functions being modified here.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/pagemap.h | 6 +-
mm/filemap.c | 380 +++++++++------------------------
tools/testing/radix-tree/regression1.c | 68 +++---
3 files changed, 129 insertions(+), 325 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 34d4fa3ad1c5..1a59f4a5424a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -365,17 +365,17 @@ static inline unsigned find_get_pages(struct address_space *mapping,
unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
unsigned int nr_pages, struct page **pages);
unsigned find_get_pages_range_tag(struct address_space *mapping, pgoff_t *index,
- pgoff_t end, int tag, unsigned int nr_pages,
+ pgoff_t end, xa_tag_t tag, unsigned int nr_pages,
struct page **pages);
static inline unsigned find_get_pages_tag(struct address_space *mapping,
- pgoff_t *index, int tag, unsigned int nr_pages,
+ pgoff_t *index, xa_tag_t tag, unsigned int nr_pages,
struct page **pages)
{
return find_get_pages_range_tag(mapping, index, (pgoff_t)-1, tag,
nr_pages, pages);
}
unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
- int tag, unsigned int nr_entries,
+ xa_tag_t tag, unsigned int nr_entries,
struct page **entries, pgoff_t *indices);

struct page *grab_cache_page_write_begin(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index 6e2808fd3c06..6c9cad248e7f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1401,6 +1401,32 @@ bool page_cache_range_empty(struct address_space *mapping, pgoff_t index,
}
EXPORT_SYMBOL_GPL(page_cache_range_empty);

+/*
+ * page_cache_pin() - Try to pin a page in the page cache.
+ * @xas: The XArray operation state.
+ * @pagep: The page which has been previously found at this location.
+ *
+ * On success, the page has an elevated refcount, but is not locked.
+ * This implements the lockless pagecache protocol as described in
+ * include/linux/pagemap.h; see page_cache_get_speculative().
+ *
+ * Return: True if the page is still in the cache.
+ */
+static bool page_cache_pin(struct xa_state *xas, struct page *page)
+{
+ struct page *head = compound_head(page);
+ bool got = page_cache_get_speculative(head);
+
+ if (likely(got && (xas_reload(xas) == page) &&
+ (compound_head(page) == head)))
+ return true;
+
+ if (got)
+ put_page(head);
+ xas_retry(xas, XA_RETRY_ENTRY);
+ return false;
+}
+
/**
* find_get_entry - find and get a page cache entry
* @mapping: the address_space to search
@@ -1416,51 +1442,21 @@ EXPORT_SYMBOL_GPL(page_cache_range_empty);
*/
struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
{
- void **pagep;
- struct page *head, *page;
+ XA_STATE(xas, &mapping->pages, offset);
+ struct page *page;

rcu_read_lock();
-repeat:
- page = NULL;
- pagep = radix_tree_lookup_slot(&mapping->pages, offset);
- if (pagep) {
- page = radix_tree_deref_slot(pagep);
- if (unlikely(!page))
- goto out;
- if (radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page))
- goto repeat;
- /*
- * A shadow entry of a recently evicted page,
- * or a swap entry from shmem/tmpfs. Return
- * it without attempting to raise page count.
- */
- goto out;
- }
-
- head = compound_head(page);
- if (!page_cache_get_speculative(head))
- goto repeat;
-
- /* The page was split under us? */
- if (compound_head(page) != head) {
- put_page(head);
- goto repeat;
- }
+ do {
+ page = xas_load(&xas);
+ if (xas_retry(&xas, page))
+ continue;
+ if (!page || xa_is_value(page))
+ break;;
+ if (!page_cache_pin(&xas, page))
+ continue;
+ } while (0);

- /*
- * Has the page moved?
- * This is part of the lockless pagecache protocol. See
- * include/linux/pagemap.h for details.
- */
- if (unlikely(page != *pagep)) {
- put_page(head);
- goto repeat;
- }
- }
-out:
rcu_read_unlock();
-
return page;
}
EXPORT_SYMBOL(find_get_entry);
@@ -1487,7 +1483,7 @@ struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset)

repeat:
page = find_get_entry(mapping, offset);
- if (page && !radix_tree_exception(page)) {
+ if (page && !xa_is_value(page)) {
lock_page(page);
/* Has the page been truncated? */
if (unlikely(page_mapping(page) != mapping)) {
@@ -1620,50 +1616,21 @@ unsigned find_get_entries(struct address_space *mapping,
pgoff_t start, unsigned int nr_entries,
struct page **entries, pgoff_t *indices)
{
- void **slot;
+ XA_STATE(xas, &mapping->pages, start);
+ struct page *page;
unsigned int ret = 0;
- struct radix_tree_iter iter;

if (!nr_entries)
return 0;

rcu_read_lock();
- radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
- struct page *head, *page;
-repeat:
- page = radix_tree_deref_slot(slot);
- if (unlikely(!page))
+ xas_for_each(&xas, page, ULONG_MAX) {
+ if (xas_retry(&xas, page))
+ continue;
+ if (!xa_is_value(page) && !page_cache_pin(&xas, page))
continue;
- if (radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page)) {
- slot = radix_tree_iter_retry(&iter);
- continue;
- }
- /*
- * A shadow entry of a recently evicted page, a swap
- * entry from shmem/tmpfs or a DAX entry. Return it
- * without attempting to raise page count.
- */
- goto export;
- }
-
- head = compound_head(page);
- if (!page_cache_get_speculative(head))
- goto repeat;
-
- /* The page was split under us? */
- if (compound_head(page) != head) {
- put_page(head);
- goto repeat;
- }

- /* Has the page moved? */
- if (unlikely(page != *slot)) {
- put_page(head);
- goto repeat;
- }
-export:
- indices[ret] = iter.index;
+ indices[ret] = xas.xa_index;
entries[ret] = page;
if (++ret == nr_entries)
break;
@@ -1697,56 +1664,26 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
pgoff_t end, unsigned int nr_pages,
struct page **pages)
{
- struct radix_tree_iter iter;
- void **slot;
+ XA_STATE(xas, &mapping->pages, *start);
+ struct page *page;
unsigned ret = 0;

if (unlikely(!nr_pages))
return 0;

rcu_read_lock();
- radix_tree_for_each_slot(slot, &mapping->pages, &iter, *start) {
- struct page *head, *page;
-
- if (iter.index > end)
- break;
-repeat:
- page = radix_tree_deref_slot(slot);
- if (unlikely(!page))
+ xas_for_each(&xas, page, end) {
+ if (xas_retry(&xas, page))
continue;
-
- if (radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page)) {
- slot = radix_tree_iter_retry(&iter);
- continue;
- }
- /*
- * A shadow entry of a recently evicted page,
- * or a swap entry from shmem/tmpfs. Skip
- * over it.
- */
+ /* Skip over shadow or swap entries */
+ if (xa_is_value(page))
+ continue;
+ if (!page_cache_pin(&xas, page))
continue;
- }
-
- head = compound_head(page);
- if (!page_cache_get_speculative(head))
- goto repeat;
-
- /* The page was split under us? */
- if (compound_head(page) != head) {
- put_page(head);
- goto repeat;
- }
-
- /* Has the page moved? */
- if (unlikely(page != *slot)) {
- put_page(head);
- goto repeat;
- }

pages[ret] = page;
if (++ret == nr_pages) {
- *start = pages[ret - 1]->index + 1;
+ *start = page->index + 1;
goto out;
}
}
@@ -1754,7 +1691,7 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
/*
* We come here when there is no page beyond @end. We take care to not
* overflow the index @start as it confuses some of the callers. This
- * breaks the iteration when there is page at index -1 but that is
+ * breaks the iteration when there is a page at index -1 but that is
* already broken anyway.
*/
if (end == (pgoff_t)-1)
@@ -1782,57 +1719,28 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t index,
unsigned int nr_pages, struct page **pages)
{
- struct radix_tree_iter iter;
- void **slot;
+ XA_STATE(xas, &mapping->pages, index);
+ struct page *page;
unsigned int ret = 0;

if (unlikely(!nr_pages))
return 0;

rcu_read_lock();
- radix_tree_for_each_contig(slot, &mapping->pages, &iter, index) {
- struct page *head, *page;
-repeat:
- page = radix_tree_deref_slot(slot);
- /* The hole, there no reason to continue */
- if (unlikely(!page))
- break;
-
- if (radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page)) {
- slot = radix_tree_iter_retry(&iter);
- continue;
- }
- /*
- * A shadow entry of a recently evicted page,
- * or a swap entry from shmem/tmpfs. Stop
- * looking for contiguous pages.
- */
+ for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+ if (xas_retry(&xas, page))
+ continue;
+ if (xa_is_value(page))
break;
- }
-
- head = compound_head(page);
- if (!page_cache_get_speculative(head))
- goto repeat;
-
- /* The page was split under us? */
- if (compound_head(page) != head) {
- put_page(head);
- goto repeat;
- }
-
- /* Has the page moved? */
- if (unlikely(page != *slot)) {
- put_page(head);
- goto repeat;
- }
+ if (!page_cache_pin(&xas, page))
+ continue;

/*
* must check mapping and index after taking the ref.
* otherwise we can get both false positives and false
* negatives, which is just confusing to the caller.
*/
- if (page->mapping == NULL || page_to_pgoff(page) != iter.index) {
+ if (!page->mapping || page_to_pgoff(page) != xas.xa_index) {
put_page(page);
break;
}
@@ -1859,74 +1767,42 @@ EXPORT_SYMBOL(find_get_pages_contig);
* @tag. We update @index to index the next page for the traversal.
*/
unsigned find_get_pages_range_tag(struct address_space *mapping, pgoff_t *index,
- pgoff_t end, int tag, unsigned int nr_pages,
+ pgoff_t end, xa_tag_t tag, unsigned int nr_pages,
struct page **pages)
{
- struct radix_tree_iter iter;
- void **slot;
+ XA_STATE(xas, &mapping->pages, *index);
+ struct page *page;
unsigned ret = 0;

if (unlikely(!nr_pages))
return 0;

rcu_read_lock();
- radix_tree_for_each_tagged(slot, &mapping->pages, &iter, *index, tag) {
- struct page *head, *page;
-
- if (iter.index > end)
- break;
-repeat:
- page = radix_tree_deref_slot(slot);
- if (unlikely(!page))
+ xas_for_each_tag(&xas, page, end, tag) {
+ if (xas_retry(&xas, page))
continue;
-
- if (radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page)) {
- slot = radix_tree_iter_retry(&iter);
- continue;
- }
- /*
- * A shadow entry of a recently evicted page.
- *
- * Those entries should never be tagged, but
- * this tree walk is lockless and the tags are
- * looked up in bulk, one radix tree node at a
- * time, so there is a sizable window for page
- * reclaim to evict a page we saw tagged.
- *
- * Skip over it.
- */
+ /*
+ * Shadow entries should never be tagged, but this tree walk
+ * is lockless so there is a window for page reclaim to evict
+ * a page we saw tagged. Skip over it.
+ */
+ if (xa_is_value(page))
+ continue;
+ if (!page_cache_pin(&xas, page))
continue;
- }
-
- head = compound_head(page);
- if (!page_cache_get_speculative(head))
- goto repeat;
-
- /* The page was split under us? */
- if (compound_head(page) != head) {
- put_page(head);
- goto repeat;
- }
-
- /* Has the page moved? */
- if (unlikely(page != *slot)) {
- put_page(head);
- goto repeat;
- }

pages[ret] = page;
if (++ret == nr_pages) {
- *index = pages[ret - 1]->index + 1;
+ *index = page->index + 1;
goto out;
}
}

/*
- * We come here when we got at @end. We take care to not overflow the
+ * We come here when we got to @end. We take care to not overflow the
* index @index as it confuses some of the callers. This breaks the
- * iteration when there is page at index -1 but that is already broken
- * anyway.
+ * iteration when there is a page at index -1 but that is already
+ * broken anyway.
*/
if (end == (pgoff_t)-1)
*index = (pgoff_t)-1;
@@ -1952,54 +1828,24 @@ EXPORT_SYMBOL(find_get_pages_range_tag);
* @tag.
*/
unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
- int tag, unsigned int nr_entries,
+ xa_tag_t tag, unsigned int nr_entries,
struct page **entries, pgoff_t *indices)
{
- void **slot;
+ XA_STATE(xas, &mapping->pages, start);
+ struct page *page;
unsigned int ret = 0;
- struct radix_tree_iter iter;

if (!nr_entries)
return 0;

rcu_read_lock();
- radix_tree_for_each_tagged(slot, &mapping->pages, &iter, start, tag) {
- struct page *head, *page;
-repeat:
- page = radix_tree_deref_slot(slot);
- if (unlikely(!page))
+ xas_for_each_tag(&xas, page, ULONG_MAX, tag) {
+ if (xas_retry(&xas, page))
+ continue;
+ if (!xa_is_value(page) && !page_cache_pin(&xas, page))
continue;
- if (radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page)) {
- slot = radix_tree_iter_retry(&iter);
- continue;
- }
-
- /*
- * A shadow entry of a recently evicted page, a swap
- * entry from shmem/tmpfs or a DAX entry. Return it
- * without attempting to raise page count.
- */
- goto export;
- }
-
- head = compound_head(page);
- if (!page_cache_get_speculative(head))
- goto repeat;
-
- /* The page was split under us? */
- if (compound_head(page) != head) {
- put_page(head);
- goto repeat;
- }

- /* Has the page moved? */
- if (unlikely(page != *slot)) {
- put_page(head);
- goto repeat;
- }
-export:
- indices[ret] = iter.index;
+ indices[ret] = xas.xa_index;
entries[ret] = page;
if (++ret == nr_entries)
break;
@@ -2608,45 +2454,21 @@ EXPORT_SYMBOL(filemap_fault);
void filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
- struct radix_tree_iter iter;
- void **slot;
struct file *file = vmf->vma->vm_file;
struct address_space *mapping = file->f_mapping;
pgoff_t last_pgoff = start_pgoff;
unsigned long max_idx;
- struct page *head, *page;
+ XA_STATE(xas, &mapping->pages, start_pgoff);
+ struct page *page;

rcu_read_lock();
- radix_tree_for_each_slot(slot, &mapping->pages, &iter, start_pgoff) {
- if (iter.index > end_pgoff)
- break;
-repeat:
- page = radix_tree_deref_slot(slot);
- if (unlikely(!page))
- goto next;
- if (radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page)) {
- slot = radix_tree_iter_retry(&iter);
- continue;
- }
+ xas_for_each(&xas, page, end_pgoff) {
+ if (xas_retry(&xas, page))
+ continue;
+ if (xa_is_value(page))
goto next;
- }
-
- head = compound_head(page);
- if (!page_cache_get_speculative(head))
- goto repeat;
-
- /* The page was split under us? */
- if (compound_head(page) != head) {
- put_page(head);
- goto repeat;
- }
-
- /* Has the page moved? */
- if (unlikely(page != *slot)) {
- put_page(head);
- goto repeat;
- }
+ if (!page_cache_pin(&xas, page))
+ continue;

if (!PageUptodate(page) ||
PageReadahead(page) ||
@@ -2665,10 +2487,10 @@ void filemap_map_pages(struct vm_fault *vmf,
if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;

- vmf->address += (iter.index - last_pgoff) << PAGE_SHIFT;
+ vmf->address += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
if (vmf->pte)
- vmf->pte += iter.index - last_pgoff;
- last_pgoff = iter.index;
+ vmf->pte += xas.xa_index - last_pgoff;
+ last_pgoff = xas.xa_index;
if (alloc_set_pte(vmf, NULL, page))
goto unlock;
unlock_page(page);
@@ -2681,8 +2503,6 @@ void filemap_map_pages(struct vm_fault *vmf,
/* Huge page is mapped? No need to proceed. */
if (pmd_trans_huge(*vmf->pmd))
break;
- if (iter.index == end_pgoff)
- break;
}
rcu_read_unlock();
}
diff --git a/tools/testing/radix-tree/regression1.c b/tools/testing/radix-tree/regression1.c
index 0aece092f40e..008393906be5 100644
--- a/tools/testing/radix-tree/regression1.c
+++ b/tools/testing/radix-tree/regression1.c
@@ -58,7 +58,7 @@ static struct page *page_alloc(void)
struct page *p;
p = malloc(sizeof(struct page));
p->count = 1;
- p->index = 1;
+ p->index = (unsigned long)p;
pthread_mutex_init(&p->lock, NULL);

return p;
@@ -77,53 +77,37 @@ static void page_free(struct page *p)
call_rcu(&p->rcu, page_rcu_free);
}

+static bool page_cache_pin(struct xa_state *xas, struct page *page)
+{
+ pthread_mutex_lock(&page->lock);
+ if (!page->count) {
+ pthread_mutex_unlock(&page->lock);
+ goto fail;
+ }
+ /* don't actually update page refcount */
+ pthread_mutex_unlock(&page->lock);
+
+ /* Has the page moved? */
+ if (xas_reload(xas) == page)
+ return true;
+fail:
+ xas_retry(xas, XA_RETRY_ENTRY);
+ return false;
+}
+
static unsigned find_get_pages(unsigned long start,
unsigned int nr_pages, struct page **pages)
{
- unsigned int i;
- unsigned int ret;
- unsigned int nr_found;
+ XA_STATE(xas, &mt_tree, start);
+ struct page *page;
+ unsigned int ret = 0;

rcu_read_lock();
-restart:
- nr_found = radix_tree_gang_lookup_slot(&mt_tree,
- (void ***)pages, NULL, start, nr_pages);
- ret = 0;
- for (i = 0; i < nr_found; i++) {
- struct page *page;
-repeat:
- page = radix_tree_deref_slot((void **)pages[i]);
- if (unlikely(!page))
+ xas_for_each(&xas, page, ULONG_MAX) {
+ if (xas_retry(&xas, page))
+ continue;
+ if (!page_cache_pin(&xas, page))
continue;
-
- if (radix_tree_exception(page)) {
- if (radix_tree_deref_retry(page)) {
- /*
- * Transient condition which can only trigger
- * when entry at index 0 moves out of or back
- * to root: none yet gotten, safe to restart.
- */
- assert((start | i) == 0);
- goto restart;
- }
- /*
- * No exceptional entries are inserted in this test.
- */
- assert(0);
- }
-
- pthread_mutex_lock(&page->lock);
- if (!page->count) {
- pthread_mutex_unlock(&page->lock);
- goto repeat;
- }
- /* don't actually update page refcount */
- pthread_mutex_unlock(&page->lock);
-
- /* Has the page moved? */
- if (unlikely(page != *((void **)pages[i]))) {
- goto repeat;
- }

pages[ret] = page;
ret++;
--
2.15.0

2017-12-06 01:07:30

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 20/73] idr: Convert to XArray

From: Matthew Wilcox <[email protected]>

The IDR distinguishes between unallocated entries (read as NULL) and
entries where the user has chosen to store NULL. The radix tree was
modified to consider NULL entries which had tag 0 _clear_ as being
allocated, but it added a lot of complexity.

Instead, the XArray has a 'zero entry', which the normal API will treat
as NULL, but is distinct from NULL when using the advanced API. The IDR
code converts between NULL and zero entries.

The idr_for_each_entry_ul() iterator becomes an alias for xa_for_each(),
so we drop the idr_get_next_ul() function as it has no users.

The exported IDR API was a weird mix of GPL-only and general symbols;
I converted them all to GPL as there was no way to use the IDR API
without being GPL.

Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/core-api/xarray.rst | 6 +
include/linux/idr.h | 161 +++++++++++++-------
include/linux/xarray.h | 27 +++-
lib/idr.c | 282 +++++++++++++++++++++---------------
lib/radix-tree.c | 77 +++++-----
lib/xarray.c | 6 +
tools/testing/radix-tree/idr-test.c | 23 +++
7 files changed, 367 insertions(+), 215 deletions(-)

diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
index 871161539242..b252bf3dc23f 100644
--- a/Documentation/core-api/xarray.rst
+++ b/Documentation/core-api/xarray.rst
@@ -200,6 +200,12 @@ to :c:func:`xas_retry`, and retry the operation if it returns ``true``.
this RCU period. You should restart the lookup from the head of the
array.

+ * - Zero
+ - :c:func:`xa_is_zero`
+ - Zero entries appear as ``NULL`` through the Normal API, but occupy an
+ entry in the XArray which can be tagged or otherwise used to reserve
+ the index.
+
Other internal entries may be added in the future. As far as possible, they
will be handled by :c:func:`xas_retry`.

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 4ffdb7058121..06412fbaa65f 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -9,33 +9,34 @@
* tables.
*/

-#ifndef __IDR_H__
-#define __IDR_H__
+#ifndef _LINUX_IDR_H
+#define _LINUX_IDR_H

#include <linux/radix-tree.h>
#include <linux/gfp.h>
#include <linux/percpu.h>
+#include <linux/xarray.h>

struct idr {
- struct radix_tree_root idr_rt;
- unsigned int idr_next;
+ struct xarray idr_xa;
+ unsigned int idr_next;
};

-/*
- * The IDR API does not expose the tagging functionality of the radix tree
- * to users. Use tag 0 to track whether a node has free space below it.
- */
-#define IDR_FREE 0
-
-/* Set the IDR flag and the IDR_FREE tag */
-#define IDR_RT_MARKER ((__force gfp_t)(3 << __GFP_BITS_SHIFT))
+#define IDR_INIT_FLAGS (XA_FLAGS_TRACK_FREE | XA_FLAGS_TAG(0))

#define IDR_INIT(name) \
{ \
- .idr_rt = RADIX_TREE_INIT(name, IDR_RT_MARKER) \
+ .idr_xa = __XARRAY_INIT(name.idr_xa, IDR_INIT_FLAGS), \
+ .idr_next = 0, \
}
#define DEFINE_IDR(name) struct idr name = IDR_INIT(name)

+static inline void idr_init(struct idr *idr)
+{
+ __xa_init(&idr->idr_xa, IDR_INIT_FLAGS);
+ idr->idr_next = 0;
+}
+
/**
* idr_get_cursor - Return the current position of the cyclic allocator
* @idr: idr handle
@@ -64,62 +65,97 @@ static inline void idr_set_cursor(struct idr *idr, unsigned int val)

/**
* DOC: idr sync
- * idr synchronization (stolen from radix-tree.h)
- *
- * idr_find() is able to be called locklessly, using RCU. The caller must
- * ensure calls to this function are made within rcu_read_lock() regions.
- * Other readers (lock-free or otherwise) and modifications may be running
- * concurrently.
- *
- * It is still required that the caller manage the synchronization and
- * lifetimes of the items. So if RCU lock-free lookups are used, typically
- * this would mean that the items have their own locks, or are amenable to
- * lock-free access; and that the items are freed by RCU (or only freed after
- * having been deleted from the idr tree *and* a synchronize_rcu() grace
- * period).
+ * idr synchronization
+ *
+ * The IDR manages its own locking, using irqsafe spinlocks for operations
+ * which modify the IDR and RCU for operations which do not. The user of
+ * the IDR may choose to wrap accesses to it in a lock if it needs to
+ * guarantee the IDR does not change during a read access. The easiest way
+ * to do this is to grab the same lock the IDR uses for write accesses
+ * using one of the idr_lock() wrappers.
+ *
+ * The caller must still manage the synchronization and lifetimes of the
+ * items. So if RCU lock-free lookups are used, typically this would mean
+ * that the items have their own locks, or are amenable to lock-free access;
+ * and that the items are freed by RCU (or only freed after having been
+ * deleted from the IDR *and* a synchronize_rcu() grace period has elapsed).
*/

-void idr_preload(gfp_t gfp_mask);
+#define idr_lock(idr) xa_lock(&(idr)->idr_xa)
+#define idr_unlock(idr) xa_unlock(&(idr)->idr_xa)
+#define idr_lock_bh(idr) xa_lock_bh(&(idr)->idr_xa)
+#define idr_unlock_bh(idr) xa_unlock_bh(&(idr)->idr_xa)
+#define idr_lock_irq(idr) xa_lock_irq(&(idr)->idr_xa)
+#define idr_unlock_irq(idr) xa_unlock_irq(&(idr)->idr_xa)
+#define idr_lock_irqsave(idr, flags) \
+ xa_lock_irqsave(&(idr)->idr_xa, flags)
+#define idr_unlock_irqrestore(idr, flags) \
+ xa_unlock_irqrestore(&(idr)->idr_xa, flags)
+
+void idr_preload(gfp_t);

int idr_alloc(struct idr *, void *, int start, int end, gfp_t);
int __must_check idr_alloc_ul(struct idr *, void *, unsigned long *nextid,
unsigned long max, gfp_t);
int idr_alloc_cyclic(struct idr *, void *entry, int start, int end, gfp_t);
-int idr_for_each(const struct idr *,
+int idr_for_each(struct idr *,
int (*fn)(int id, void *p, void *data), void *data);
void *idr_get_next(struct idr *, int *nextid);
-void *idr_get_next_ul(struct idr *, unsigned long *nextid);
void *idr_replace(struct idr *, void *, unsigned long id);
-void idr_destroy(struct idr *);

+#ifdef CONFIG_64BIT
+int __must_check idr_alloc_u32(struct idr *, void *, unsigned int *nextid,
+ unsigned int max, gfp_t);
+#else /* !CONFIG_64BIT */
static inline int __must_check idr_alloc_u32(struct idr *idr, void *ptr,
- u32 *nextid, unsigned long max, gfp_t gfp)
+ unsigned int *nextid, unsigned int max, gfp_t gfp)
{
- unsigned long tmp = *nextid;
- int ret = idr_alloc_ul(idr, ptr, &tmp, max, gfp);
- *nextid = tmp;
- return ret;
+ return idr_alloc_ul(idr, ptr, (unsigned long *)nextid, max, gfp);
}
+#endif

+/**
+ * idr_remove() - Remove an item from the IDR.
+ * @idr: IDR handle.
+ * @id: Object ID.
+ *
+ * Once this function returns, the ID is available for allocation again.
+ * This function protects itself with the IDR lock.
+ *
+ * Return: The pointer associated with this ID.
+ */
static inline void *idr_remove(struct idr *idr, unsigned long id)
{
- return radix_tree_delete_item(&idr->idr_rt, id, NULL);
+ return xa_erase(&idr->idr_xa, id);
}

-static inline void idr_init(struct idr *idr)
+/**
+ * idr_is_empty() - Determine if there are no entries in the IDR
+ * @idr: IDR handle.
+ *
+ * Return: %true if there are no entries in the IDR.
+ */
+static inline bool idr_is_empty(const struct idr *idr)
{
- INIT_RADIX_TREE(&idr->idr_rt, IDR_RT_MARKER);
- idr->idr_next = 0;
+ return xa_empty(&idr->idr_xa);
}

-static inline bool idr_is_empty(const struct idr *idr)
+/**
+ * idr_destroy() - Free all internal memory used by an IDR.
+ * @idr: IDR handle.
+ *
+ * When you have finished using an IDR, you can free all the memory used
+ * for the IDR data structure by calling this function. If you also
+ * wish to free the objects referenced by the IDR, you can use idr_for_each()
+ * or idr_for_each_entry() to do that first.
+ */
+static inline void idr_destroy(struct idr *idr)
{
- return radix_tree_empty(&idr->idr_rt) &&
- radix_tree_tagged(&idr->idr_rt, IDR_FREE);
+ xa_destroy(&idr->idr_xa);
}

/**
- * idr_preload_end - end preload section started with idr_preload()
+ * idr_preload_end() - end preload section started with idr_preload()
*
* Each idr_preload() should be matched with an invocation of this
* function. See idr_preload() for details.
@@ -130,7 +166,7 @@ static inline void idr_preload_end(void)
}

/**
- * idr_find - return pointer for given id
+ * idr_find() - return pointer for given id
* @idr: idr handle
* @id: lookup key
*
@@ -138,14 +174,35 @@ static inline void idr_preload_end(void)
* return indicates that @id is not valid or you passed %NULL in
* idr_get_new().
*
- * This function can be called under rcu_read_lock(), given that the leaf
- * pointers lifetimes are correctly managed.
+ * This function is protected by the RCU read lock. If you want to ensure
+ * that it does not race with a call to idr_remove(), perhaps because you
+ * need to establish a refcount on the object, you can use idr_lock() and
+ * idr_unlock() to prevent simultaneous modification.
*/
-static inline void *idr_find(const struct idr *idr, unsigned long id)
+static inline void *idr_find(struct idr *idr, unsigned long id)
{
- return radix_tree_lookup(&idr->idr_rt, id);
+ return xa_load(&idr->idr_xa, id);
}

+/**
+ * idr_for_each_entry_ul() - Iterate over the entries in an IDR.
+ * @idr: IDR handle.
+ * @entry: Pointer to each entry in turn.
+ * @id: ID of each entry.
+ *
+ * Initialise @id to the lowest ID before using this iterator.
+ * In the body of the loop, @entry will point to the object stored in the
+ * IDR. After the loop has finished normally, @entry will be %NULL, which
+ * is a convenient way to distinguish between a 'break' exit from the loop
+ * and normal termination.
+ *
+ * The control elements of this loop protect themselves with the RCU read
+ * lock, which is dropped before invoking the body. You may sleep unless
+ * your own locking prevents that.
+ */
+#define idr_for_each_entry_ul(idr, entry, id) \
+ xa_for_each(&(idr)->idr_xa, entry, id, ULONG_MAX)
+
/**
* idr_for_each_entry - iterate over an idr's elements of a given type
* @idr: idr handle
@@ -158,8 +215,6 @@ static inline void *idr_find(const struct idr *idr, unsigned long id)
*/
#define idr_for_each_entry(idr, entry, id) \
for (id = 0; ((entry) = idr_get_next(idr, &(id))) != NULL; ++id)
-#define idr_for_each_entry_ul(idr, entry, id) \
- for (id = 0; ((entry) = idr_get_next_ul(idr, &(id))) != NULL; ++id)

/**
* idr_for_each_entry_continue - continue iteration over an idr's elements of a given type
@@ -194,7 +249,7 @@ struct ida {
};

#define IDA_INIT(name) { \
- .ida_rt = RADIX_TREE_INIT(name, IDR_RT_MARKER | GFP_NOWAIT), \
+ .ida_rt = RADIX_TREE_INIT(name, IDR_INIT_FLAGS | GFP_NOWAIT), \
}
#define DEFINE_IDA(name) struct ida name = IDA_INIT(name)

@@ -209,7 +264,7 @@ void ida_simple_remove(struct ida *ida, unsigned int id);

static inline void ida_init(struct ida *ida)
{
- INIT_RADIX_TREE(&ida->ida_rt, IDR_RT_MARKER | GFP_NOWAIT);
+ INIT_RADIX_TREE(&ida->ida_rt, IDR_INIT_FLAGS | GFP_NOWAIT);
}

/**
@@ -228,4 +283,4 @@ static inline bool ida_is_empty(const struct ida *ida)
{
return radix_tree_empty(&ida->ida_rt);
}
-#endif /* __IDR_H__ */
+#endif /* _LINUX_IDR_H */
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index afa3374f20bd..7017153d89e8 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -331,7 +331,8 @@ static inline void *xa_entry_locked(struct xarray *xa,
* internal entries are pointers to the next node in the tree. Since the
* kernel unmaps page 0 to trap NULL pointer dereferences, we can use values
* 0-1023 for special purposes. Values 0-62 are used for sibling
- * entries. Value 256 is used for the retry entry.
+ * entries. Value 256 is used for zero entries. Value 257 is used for the
+ * retry entry.
*/

/* Private */
@@ -400,7 +401,19 @@ static inline bool xa_is_sibling(void *entry)
(entry < xa_mk_sibling(XA_CHUNK_SIZE - 1));
}

-#define XA_RETRY_ENTRY xa_mk_internal(256)
+#define XA_ZERO_ENTRY xa_mk_internal(256)
+#define XA_RETRY_ENTRY xa_mk_internal(257)
+
+/**
+ * xa_is_zero() - Is the entry a zero entry?
+ * @entry: Entry retrieved from the XArray
+ *
+ * Return: %true if the entry is a zero entry.
+ */
+static inline bool xa_is_zero(void *entry)
+{
+ return unlikely(entry == XA_ZERO_ENTRY);
+}

/**
* xa_is_retry() - Is the entry a retry entry?
@@ -562,18 +575,20 @@ static inline bool xas_top(struct xa_node *node)
}

/**
- * xas_retry() - Handle a retry entry.
+ * xas_retry() - Retry the operation if appropriate.
* @xas: XArray operation state.
* @entry: Entry from xarray.
*
- * An RCU-protected read may see a retry entry as a side-effect of a
- * simultaneous modification. This function sets up the @xas to retry
- * the walk from the head of the array.
+ * The advanced functions may sometimes return an internal entry, such as
+ * a retry entry or a zero entry. This function sets up the @xas to restart
+ * the walk from the head of the array if needed.
*
* Return: true if the operation needs to be retried.
*/
static inline bool xas_retry(struct xa_state *xas, void *entry)
{
+ if (xa_is_zero(entry))
+ return true;
if (!xa_is_retry(entry))
return false;
xas->xa_node = XAS_RESTART;
diff --git a/lib/idr.c b/lib/idr.c
index b9aa08e198a2..e677d1869ead 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -8,67 +8,121 @@
DEFINE_PER_CPU(struct ida_bitmap *, ida_bitmap);
static DEFINE_SPINLOCK(simple_ida_lock);

+/* In radix-tree.c temporarily */
+extern bool idr_nomem(struct xa_state *, gfp_t);
+
/**
- * idr_alloc_ul() - allocate a large ID
- * @idr: idr handle
- * @ptr: pointer to be associated with the new ID
- * @nextid: Pointer to minimum ID to allocate
- * @max: the maximum ID (inclusive)
- * @gfp: memory allocation flags
+ * idr_alloc_ul() - Allocate a large ID.
+ * @idr: IDR handle.
+ * @ptr: Pointer to be associated with the new ID.
+ * @nextid: Pointer to minimum ID to allocate.
+ * @max: The maximum ID (inclusive).
+ * @gfp: Memory allocation flags.
*
* Allocates an unused ID in the range [*nextid, end] and stores it in
* @nextid. Note that @max differs from the @end parameter to idr_alloc().
*
- * Simultaneous modifications to the @idr are not allowed and should be
- * prevented by the user, usually with a lock. idr_alloc_ul() may be called
- * concurrently with read-only accesses to the @idr, such as idr_find() and
- * idr_for_each_entry().
+ * The IDR uses its own spinlock to protect against simultaneous
+ * modification. @nextid is assigned to before @ptr is stored in the IDR;
+ * if @nextid points into the object referenced by @ptr, it will not be
+ * possible for a simultaneous lookup to see the wrong value in @nextid.
*
- * Return: 0 on success or a negative errno on failure (ENOMEM or ENOSPC)
+ * Return: 0 on success or a negative errno on failure (ENOMEM or ENOSPC).
*/
int idr_alloc_ul(struct idr *idr, void *ptr, unsigned long *nextid,
unsigned long max, gfp_t gfp)
{
- struct radix_tree_iter iter;
- void __rcu **slot;
+ XA_STATE(xas, &idr->idr_xa, *nextid);
+ unsigned long flags;

- if (WARN_ON_ONCE(radix_tree_is_internal_node(ptr)))
+ if (WARN_ON_ONCE(xa_is_internal(ptr)))
return -EINVAL;
+ if (!ptr)
+ ptr = XA_ZERO_ENTRY;
+
+ do {
+ xas_lock_irqsave(&xas, flags);
+ xas_find_tag(&xas, max, XA_FREE_TAG);
+ if (xas.xa_index > max)
+ xas_set_err(&xas, -ENOSPC);
+ else
+ *nextid = xas.xa_index;
+ xas_store(&xas, ptr);
+ xas_clear_tag(&xas, XA_FREE_TAG);
+ xas_unlock_irqrestore(&xas, flags);
+ } while (idr_nomem(&xas, gfp));
+
+ return xas_error(&xas);
+}
+EXPORT_SYMBOL_GPL(idr_alloc_ul);

- if (WARN_ON_ONCE(!(idr->idr_rt.xa_flags & ROOT_IS_IDR)))
- idr->idr_rt.xa_flags |= IDR_RT_MARKER;
-
- radix_tree_iter_init(&iter, *nextid);
- slot = idr_get_free(&idr->idr_rt, &iter, gfp, max);
- if (IS_ERR(slot))
- return PTR_ERR(slot);
-
- radix_tree_iter_replace(&idr->idr_rt, &iter, slot, ptr);
- radix_tree_iter_tag_clear(&idr->idr_rt, &iter, IDR_FREE);
+/**
+ * idr_alloc_u32() - Allocate an ID.
+ * @idr: IDR handle.
+ * @ptr: Pointer to be associated with the new ID.
+ * @nextid: Pointer to minimum ID to allocate.
+ * @max: The maximum ID (inclusive).
+ * @gfp: Memory allocation flags.
+ *
+ * Allocates an unused ID in the range [*nextid, end] and stores it in
+ * @nextid. Note that @max differs from the @end parameter to idr_alloc().
+ *
+ * The IDR uses its own spinlock to protect against simultaneous
+ * modification. @nextid is assigned to before @ptr is stored in the IDR;
+ * if @nextid points into the object referenced by @ptr, it will not be
+ * possible for a simultaneous lookup to see the wrong value in @nextid.
+ *
+ * Return: 0 on success or a negative errno on failure (ENOMEM or ENOSPC).
+ */
+#ifdef CONFIG_64BIT
+int idr_alloc_u32(struct idr *idr, void *ptr, unsigned int *nextid,
+ unsigned int max, gfp_t gfp)
+{
+ XA_STATE(xas, &idr->idr_xa, *nextid);
+ unsigned long flags;

- *nextid = iter.index;
- return 0;
+ if (WARN_ON_ONCE(xa_is_internal(ptr)))
+ return -EINVAL;
+ if (!ptr)
+ ptr = XA_ZERO_ENTRY;
+
+ do {
+ xas_lock_irqsave(&xas, flags);
+ xas_find_tag(&xas, max, XA_FREE_TAG);
+ if (xas.xa_index > max)
+ xas_set_err(&xas, -ENOSPC);
+ else
+ *nextid = xas.xa_index;
+ xas_store(&xas, ptr);
+ xas_clear_tag(&xas, XA_FREE_TAG);
+ xas_unlock_irqrestore(&xas, flags);
+ } while (idr_nomem(&xas, gfp));
+
+ return xas_error(&xas);
}
-EXPORT_SYMBOL_GPL(idr_alloc_ul);
+EXPORT_SYMBOL_GPL(idr_alloc_u32);
+#endif

/**
- * idr_alloc - allocate an id
- * @idr: idr handle
- * @ptr: pointer to be associated with the new id
- * @start: the minimum id (inclusive)
- * @end: the maximum id (exclusive)
- * @gfp: memory allocation flags
+ * idr_alloc() - Allocate an ID.
+ * @idr: IDR handle.
+ * @ptr: Pointer to be associated with the new ID.
+ * @start: The minimum id (inclusive).
+ * @end: The maximum id (exclusive).
+ * @gfp: Memory allocation flags.
+ *
+ * Allocates an unused ID >= start and < end.
*
- * Allocates an unused ID in the range [start, end). Returns -ENOSPC
- * if there are no unused IDs in that range.
+ * If @end is <= 0, it is treated as %INT_MAX + 1. This is to always
+ * allow using @start + N as @end as long as N is <= %INT_MAX. This
+ * differs from the @max parameter to idr_alloc_ul() and idr_alloc_u32().
*
- * Note that @end is treated as max when <= 0. This is to always allow
- * using @start + N as @end as long as N is inside integer range.
+ * The IDR uses its own spinlock to protect against simultaneous
+ * modification. The @ptr is visible to other simultaneous readers
+ * like idr_find() before this function returns.
*
- * Simultaneous modifications to the @idr are not allowed and should be
- * prevented by the user, usually with a lock. idr_alloc() may be called
- * concurrently with read-only accesses to the @idr, such as idr_find() and
- * idr_for_each_entry().
+ * Return: The newly allocated ID on success. -ENOMEM for a memory
+ * allocation failure. -ENOSPC if there are no free IDs in the range.
*/
int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp)
{
@@ -88,16 +142,22 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp)
EXPORT_SYMBOL_GPL(idr_alloc);

/**
- * idr_alloc_cyclic - allocate new idr entry in a cyclical fashion
- * @idr: idr handle
- * @ptr: pointer to be associated with the new id
- * @start: the minimum id (inclusive)
- * @end: the maximum id (exclusive)
- * @gfp: memory allocation flags
- *
- * Allocates an ID larger than the last ID allocated if one is available.
- * If not, it will attempt to allocate the smallest ID that is larger or
- * equal to @start.
+ * idr_alloc_cyclic - Allocate an ID cyclically.
+ * @idr: IDR handle.
+ * @ptr: Pointer to be associated with the new ID.
+ * @start: The minimum id (inclusive).
+ * @end: The maximum id (exclusive).
+ * @gfp: Memory allocation flags.
+ *
+ * Allocates an unused ID >= @start and < @end. It will start searching
+ * after the last ID allocated and wrap back around to @start.
+ *
+ * The IDR uses its own spinlock to protect against simultaneous
+ * modification. The @ptr is visible to other simultaneous readers
+ * like idr_find() before this function returns.
+ *
+ * Return: The newly allocated ID on success. -ENOMEM for a memory
+ * allocation failure. -ENOSPC if there are no free IDs in the range.
*/
int idr_alloc_cyclic(struct idr *idr, void *ptr, int start, int end, gfp_t gfp)
{
@@ -119,88 +179,68 @@ int idr_alloc_cyclic(struct idr *idr, void *ptr, int start, int end, gfp_t gfp)
idr->idr_next = id + 1U;
return id;
}
-EXPORT_SYMBOL(idr_alloc_cyclic);
+EXPORT_SYMBOL_GPL(idr_alloc_cyclic);

/**
- * idr_for_each - iterate through all stored pointers
+ * idr_for_each() - iterate through all stored pointers
* @idr: idr handle
* @fn: function to be called for each pointer
* @data: data passed to callback function
*
- * The callback function will be called for each entry in @idr, passing
- * the id, the pointer and the data pointer passed to this function.
+ * The callback function will be called for each non-NULL pointer in
+ * @idr, passing the id, the pointer and @data. No internal locks are
+ * held while @fn is called, so @fn may sleep unless otherwise prevented
+ * by your own locking.
*
* If @fn returns anything other than %0, the iteration stops and that
* value is returned from this function.
*
- * idr_for_each() can be called concurrently with idr_alloc() and
- * idr_remove() if protected by RCU. Newly added entries may not be
- * seen and deleted entries may be seen, but adding and removing entries
- * will not cause other entries to be skipped, nor spurious ones to be seen.
+ * idr_for_each() protects itself with the RCU read lock. Newly added
+ * entries may not be seen and deleted entries may be seen, but adding
+ * and removing entries will not cause other entries to be skipped, nor
+ * spurious ones to be seen.
+ *
+ * Return: The value returned by the last call to @fn.
*/
-int idr_for_each(const struct idr *idr,
+int idr_for_each(struct idr *idr,
int (*fn)(int id, void *p, void *data), void *data)
{
- struct radix_tree_iter iter;
- void __rcu **slot;
+ unsigned long i = 0;
+ void *p;

- radix_tree_for_each_slot(slot, &idr->idr_rt, &iter, 0) {
- int ret = fn(iter.index, rcu_dereference_raw(*slot), data);
+ xa_for_each(&idr->idr_xa, p, i, INT_MAX) {
+ int ret = fn(i, p, data);
if (ret)
return ret;
}

return 0;
}
-EXPORT_SYMBOL(idr_for_each);
+EXPORT_SYMBOL_GPL(idr_for_each);

/**
- * idr_get_next - Find next populated entry
+ * idr_get_next() - Find next populated entry
* @idr: idr handle
- * @nextid: Pointer to lowest possible ID to return
+ * @id: Pointer to lowest possible ID to return
*
* Returns the next populated entry in the tree with an ID greater than
* or equal to the value pointed to by @nextid. On exit, @nextid is updated
* to the ID of the found value. To use in a loop, the value pointed to by
* nextid must be incremented by the user.
- */
-void *idr_get_next(struct idr *idr, int *nextid)
-{
- struct radix_tree_iter iter;
- void __rcu **slot;
-
- slot = radix_tree_iter_find(&idr->idr_rt, &iter, *nextid);
- if (!slot)
- return NULL;
-
- *nextid = iter.index;
- return rcu_dereference_raw(*slot);
-}
-EXPORT_SYMBOL(idr_get_next);
-
-/**
- * idr_get_next_ul - Find next populated entry
- * @idr: idr handle
- * @nextid: Pointer to lowest possible ID to return
*
- * Returns the next populated entry in the tree with an ID greater than
- * or equal to the value pointed to by @nextid. On exit, @nextid is updated
- * to the ID of the found value. To use in a loop, the value pointed to by
- * nextid must be incremented by the user.
+ * This function protects itself with the RCU read lock, so may return a
+ * stale entry or may skip a newly added entry unless synchronised with
+ * a lock.
*/
-void *idr_get_next_ul(struct idr *idr, unsigned long *nextid)
+void *idr_get_next(struct idr *idr, int *id)
{
- struct radix_tree_iter iter;
- void __rcu **slot;
+ unsigned long index = *id;
+ void *entry = xa_find(&idr->idr_xa, &index, INT_MAX);

- slot = radix_tree_iter_find(&idr->idr_rt, &iter, *nextid);
- if (!slot)
- return NULL;
-
- *nextid = iter.index;
- return rcu_dereference_raw(*slot);
+ *id = index;
+ return entry;
}
-EXPORT_SYMBOL(idr_get_next_ul);
+EXPORT_SYMBOL_GPL(idr_get_next);

/**
* idr_replace - replace pointer for given id
@@ -209,31 +249,35 @@ EXPORT_SYMBOL(idr_get_next_ul);
* @id: Lookup key
*
* Replace the pointer registered with an ID and return the old value.
- * This function can be called under the RCU read lock concurrently with
- * idr_alloc() and idr_remove() (as long as the ID being removed is not
- * the one being replaced!).
+ * This function protects itself with a spinlock.
*
* Returns: the old value on success. %-ENOENT indicates that @id was not
* found. %-EINVAL indicates that @id or @ptr were not valid.
*/
void *idr_replace(struct idr *idr, void *ptr, unsigned long id)
{
- struct radix_tree_node *node;
- void __rcu **slot = NULL;
- void *entry;
+ XA_STATE(xas, &idr->idr_xa, id);
+ unsigned long flags;
+ void *curr;

- if (WARN_ON_ONCE(radix_tree_is_internal_node(ptr)))
+ if (WARN_ON_ONCE(xa_is_internal(ptr)))
return ERR_PTR(-EINVAL);
-
- entry = __radix_tree_lookup(&idr->idr_rt, id, &node, &slot);
- if (!slot || radix_tree_tag_get(&idr->idr_rt, id, IDR_FREE))
- return ERR_PTR(-ENOENT);
-
- __radix_tree_replace(&idr->idr_rt, node, slot, ptr, NULL);
-
- return entry;
+ if (!ptr)
+ ptr = XA_ZERO_ENTRY;
+
+ xas_lock_irqsave(&xas, flags);
+ curr = xas_load(&xas);
+ if (curr)
+ xas_store(&xas, ptr);
+ else
+ curr = ERR_PTR(-ENOENT);
+ xas_unlock_irqrestore(&xas, flags);
+
+ if (xa_is_zero(curr))
+ return NULL;
+ return curr;
}
-EXPORT_SYMBOL(idr_replace);
+EXPORT_SYMBOL_GPL(idr_replace);

/**
* DOC: IDA description
@@ -264,7 +308,7 @@ EXPORT_SYMBOL(idr_replace);
* Developer's notes:
*
* The IDA uses the functionality provided by the IDR & radix tree to store
- * bitmaps in each entry. The IDR_FREE tag means there is at least one bit
+ * bitmaps in each entry. The XA_FREE_TAG tag means there is at least one bit
* free, unlike the IDR where it means at least one entry is free.
*
* I considered telling the radix tree that each slot is an order-10 node
@@ -370,7 +414,7 @@ int ida_get_new_above(struct ida *ida, int start, int *id)
__set_bit(bit, bitmap->bitmap);
if (bitmap_full(bitmap->bitmap, IDA_BITMAP_BITS))
radix_tree_iter_tag_clear(root, &iter,
- IDR_FREE);
+ XA_FREE_TAG);
} else {
new += bit;
if (new < 0)
@@ -426,7 +470,7 @@ void ida_remove(struct ida *ida, int id)
goto err;

__clear_bit(offset, btmp);
- radix_tree_iter_tag_set(&ida->ida_rt, &iter, IDR_FREE);
+ radix_tree_iter_tag_set(&ida->ida_rt, &iter, XA_FREE_TAG);
if (xa_is_value(bitmap)) {
if (xa_to_value(rcu_dereference_raw(*slot)) == 0)
radix_tree_iter_delete(&ida->ida_rt, &iter, slot);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index cb7cb9e96a8b..3b63d1ce7fda 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -529,6 +529,30 @@ int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order)
return __radix_tree_preload(gfp_mask, nr_nodes);
}

+/* Once the IDR users abandon the preload API, we can use xas_nomem */
+bool idr_nomem(struct xa_state *xas, gfp_t gfp)
+{
+ if (xas->xa_node != XAS_ERROR(ENOMEM)) {
+ xas_destroy(xas);
+ return false;
+ }
+ xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep,
+ gfp | __GFP_NOWARN);
+ if (!xas->xa_alloc) {
+ struct radix_tree_preload *rtp;
+
+ rtp = this_cpu_ptr(&radix_tree_preloads);
+ if (!rtp->nr)
+ return false;
+ xas->xa_alloc = rtp->nodes;
+ rtp->nodes = xas->xa_alloc->parent;
+ rtp->nr--;
+ }
+
+ xas->xa_node = XAS_RESTART;
+ return true;
+}
+
static unsigned radix_tree_load_root(const struct radix_tree_root *root,
struct radix_tree_node **nodep, unsigned long *maxindex)
{
@@ -562,7 +586,7 @@ static int radix_tree_extend(struct radix_tree_root *root, gfp_t gfp,
maxshift += RADIX_TREE_MAP_SHIFT;

entry = rcu_dereference_raw(root->xa_head);
- if (!entry && (!is_idr(root) || root_tag_get(root, IDR_FREE)))
+ if (!entry && (!is_idr(root) || root_tag_get(root, XA_FREE_TAG)))
goto out;

do {
@@ -572,10 +596,10 @@ static int radix_tree_extend(struct radix_tree_root *root, gfp_t gfp,
return -ENOMEM;

if (is_idr(root)) {
- all_tag_set(node, IDR_FREE);
- if (!root_tag_get(root, IDR_FREE)) {
- rtag_clear(node, IDR_FREE, 0);
- root_tag_set(root, IDR_FREE);
+ all_tag_set(node, XA_FREE_TAG);
+ if (!root_tag_get(root, XA_FREE_TAG)) {
+ rtag_clear(node, XA_FREE_TAG, 0);
+ root_tag_set(root, XA_FREE_TAG);
}
} else {
/* Propagate the aggregated tag info to the new child */
@@ -646,8 +670,8 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root,
* one (root->xa_head) as far as dependent read barriers go.
*/
root->xa_head = (void __rcu *)child;
- if (is_idr(root) && !rtag_get(node, IDR_FREE, 0))
- root_tag_clear(root, IDR_FREE);
+ if (is_idr(root) && !rtag_get(node, XA_FREE_TAG, 0))
+ root_tag_clear(root, XA_FREE_TAG);

/*
* We have a dilemma here. The node's slot[0] must not be
@@ -1074,7 +1098,7 @@ static bool node_tag_get(const struct radix_tree_root *root,
/*
* IDR users want to be able to store NULL in the tree, so if the slot isn't
* free, don't adjust the count, even if it's transitioning between NULL and
- * non-NULL. For the IDA, we mark slots as being IDR_FREE while they still
+ * non-NULL. For the IDA, we mark slots as being XA_FREE_TAG while they still
* have empty bits, but it only stores NULL in slots when they're being
* deleted.
*/
@@ -1084,7 +1108,7 @@ static int calculate_count(struct radix_tree_root *root,
{
if (is_idr(root)) {
unsigned offset = get_slot_offset(node, slot);
- bool free = node_tag_get(root, node, IDR_FREE, offset);
+ bool free = node_tag_get(root, node, XA_FREE_TAG, offset);
if (!free)
return 0;
if (!old)
@@ -1915,7 +1939,7 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
int tag;

if (is_idr(root))
- node_tag_set(root, node, IDR_FREE, offset);
+ node_tag_set(root, node, XA_FREE_TAG, offset);
else
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
node_tag_clear(root, node, tag, offset);
@@ -1963,7 +1987,7 @@ void *radix_tree_delete_item(struct radix_tree_root *root,
void *entry;

entry = __radix_tree_lookup(root, index, &node, &slot);
- if (!entry && (!is_idr(root) || node_tag_get(root, node, IDR_FREE,
+ if (!entry && (!is_idr(root) || node_tag_get(root, node, XA_FREE_TAG,
get_slot_offset(node, slot))))
return NULL;

@@ -2070,7 +2094,7 @@ void __rcu **idr_get_free(struct radix_tree_root *root,

grow:
shift = radix_tree_load_root(root, &child, &maxindex);
- if (!radix_tree_tagged(root, IDR_FREE))
+ if (!radix_tree_tagged(root, XA_FREE_TAG))
start = max(start, maxindex + 1);
if (start > max)
return ERR_PTR(-ENOSPC);
@@ -2091,7 +2115,7 @@ void __rcu **idr_get_free(struct radix_tree_root *root,
offset, 0, 0);
if (!child)
return ERR_PTR(-ENOMEM);
- all_tag_set(child, IDR_FREE);
+ all_tag_set(child, XA_FREE_TAG);
rcu_assign_pointer(*slot, node_to_entry(child));
if (node)
node->count++;
@@ -2100,8 +2124,8 @@ void __rcu **idr_get_free(struct radix_tree_root *root,

node = entry_to_node(child);
offset = radix_tree_descend(node, &child, start);
- if (!rtag_get(node, IDR_FREE, offset)) {
- offset = radix_tree_find_next_bit(node, IDR_FREE,
+ if (!rtag_get(node, XA_FREE_TAG, offset)) {
+ offset = radix_tree_find_next_bit(node, XA_FREE_TAG,
offset + 1);
start = rnext_index(start, node, offset);
if (start > max)
@@ -2125,32 +2149,11 @@ void __rcu **idr_get_free(struct radix_tree_root *root,
iter->next_index = 1;
iter->node = node;
__set_iter_shift(iter, shift);
- set_iter_tags(iter, node, offset, IDR_FREE);
+ set_iter_tags(iter, node, offset, XA_FREE_TAG);

return slot;
}

-/**
- * idr_destroy - release all internal memory from an IDR
- * @idr: idr handle
- *
- * After this function is called, the IDR is empty, and may be reused or
- * the data structure containing it may be freed.
- *
- * A typical clean-up sequence for objects stored in an idr tree will use
- * idr_for_each() to free all objects, if necessary, then idr_destroy() to
- * free the memory used to keep track of those objects.
- */
-void idr_destroy(struct idr *idr)
-{
- struct radix_tree_node *node = rcu_dereference_raw(idr->idr_rt.xa_head);
- if (radix_tree_is_internal_node(node))
- radix_tree_free_nodes(node);
- idr->idr_rt.xa_head = NULL;
- root_tag_set(&idr->idr_rt, IDR_FREE);
-}
-EXPORT_SYMBOL(idr_destroy);
-
static void
radix_tree_node_ctor(void *arg)
{
diff --git a/lib/xarray.c b/lib/xarray.c
index cc88df7bd6df..baa425ba3ee1 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1062,6 +1062,8 @@ void *xa_load(struct xarray *xa, unsigned long index)
rcu_read_lock();
do {
entry = xas_load(&xas);
+ if (xa_is_zero(entry))
+ entry = NULL;
} while (xas_retry(&xas, entry));
rcu_read_unlock();

@@ -1119,6 +1121,8 @@ void *xa_store(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp)
xa_lock_irqsave(xa, flags);
curr = xas_store(&xas, entry);
xa_unlock_irqrestore(xa, flags);
+ if (xa_is_zero(curr))
+ curr = NULL;
} while (xas_nomem(&xas, gfp));

if (xas_error(&xas))
@@ -1491,6 +1495,8 @@ void xa_dump_entry(void *entry, unsigned long index)
printk("%lu: retry (%ld)\n", index, xa_to_internal(entry));
else if (xa_is_sibling(entry))
printk("%lu: sibling (%ld)\n", index, xa_to_sibling(entry));
+ else if (xa_is_zero(entry))
+ printk("%lu: zero (%ld)\n", index, xa_to_internal(entry));
else
printk("%lu: UNKNOWN ENTRY (%p)\n", index, entry);
}
diff --git a/tools/testing/radix-tree/idr-test.c b/tools/testing/radix-tree/idr-test.c
index 7499319e85f8..7b710145d2ae 100644
--- a/tools/testing/radix-tree/idr-test.c
+++ b/tools/testing/radix-tree/idr-test.c
@@ -177,6 +177,22 @@ void idr_get_next_test(void)
idr_destroy(&idr);
}

+/*
+ * Check that growing the IDR works properly.
+ */
+void idr_alloc_far(struct idr *idr, unsigned long end)
+{
+ int i;
+
+ for (i = 1; i < end; i++)
+ assert(idr_alloc(idr, idr, i, i + 1, GFP_KERNEL) == i);
+
+ for (i = 1; i <= end; i++) {
+ assert(idr_alloc(idr, idr, 1, 0, GFP_KERNEL) == end);
+ idr_remove(idr, end);
+ }
+}
+
void idr_checks(void)
{
unsigned long i;
@@ -227,6 +243,11 @@ void idr_checks(void)
idr_null_test();
idr_nowait_test();
idr_get_next_test();
+
+ for (i = 2; i < 18; i++) {
+ idr_alloc_far(&idr, 1UL << i);
+ idr_destroy(&idr);
+ }
}

/*
@@ -505,7 +526,9 @@ void ida_thread_tests(void)
int __weak main(void)
{
radix_tree_init();
+ printv(0, "starting IDR checks\n");
idr_checks();
+ printv(0, "starting IDA checks\n");
ida_checks();
ida_thread_tests();
radix_tree_cpu_dead(1);
--
2.15.0

2017-12-06 01:07:58

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 11/73] xarray: Add xa_store

From: Matthew Wilcox <[email protected]>

xa_store() differs from radix_tree_insert() in that it will overwrite an
existing element in the array rather than returning an error. This is
the behaviour which most users want, and those that want more complex
behaviour generally want to use the xas family of routines anyway.

For memory allocation, xa_store() will first attempt to request memory
from the slab allocator; if memory is not immediately available, it will
drop the xa_lock and allocate memory, keeping a pointer in the xa_state.
It does not use the per-CPU cache, although those will continue to exist
until all radix tree users are converted to the xarray.

This patch also includes xa_erase() and __xa_erase() for a streamlined
way to store NULL. Since there is no need to allocate memory in order
to store a NULL in the XArray, we do not need to trouble the user with
deciding what memory allocation flags to use.

Signed-off-by: Matthew Wilcox <[email protected]>

squash xa_store

Add __xa_erase
---
include/linux/xarray.h | 98 +++++
lib/radix-tree.c | 4 +-
lib/xarray.c | 569 ++++++++++++++++++++++++++++++
tools/testing/radix-tree/linux/rcupdate.h | 1 +
tools/testing/radix-tree/xarray-test.c | 111 +++++-
5 files changed, 779 insertions(+), 4 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index ed95ebe91169..6f1f55d9fc94 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -71,6 +71,32 @@ static inline void xa_init(struct xarray *xa)
}

void *xa_load(struct xarray *, unsigned long index);
+void *xa_store(struct xarray *, unsigned long index, void *entry, gfp_t);
+
+/**
+ * xa_erase() - Erase this entry from the XArray.
+ * @xa: XArray.
+ * @index: Index of entry.
+ *
+ * This function is the equivalent of calling xa_store() with %NULL as
+ * the third argument. The XArray does not need to allocate memory, so
+ * the user does not need to provide GFP flags.
+ */
+static inline void *xa_erase(struct xarray *xa, unsigned long index)
+{
+ return xa_store(xa, index, NULL, 0);
+}
+
+/**
+ * xa_empty() - Determine if an array has any present entries.
+ * @xa: XArray.
+ *
+ * Return: %true if the array contains only NULL pointers.
+ */
+static inline bool xa_empty(const struct xarray *xa)
+{
+ return xa->xa_head == NULL;
+}

typedef unsigned __bitwise xa_tag_t;
#define XA_TAG_0 ((__force xa_tag_t)0U)
@@ -80,9 +106,15 @@ typedef unsigned __bitwise xa_tag_t;

#define XA_TAG_MAX XA_TAG_2
#define XA_FREE_TAG XA_TAG_0
+#define XA_FLAGS_TRACK_FREE ((__force gfp_t)(1U << __GFP_BITS_SHIFT))
#define XA_FLAGS_TAG(tag) ((__force gfp_t)((2U << __GFP_BITS_SHIFT) << \
(__force unsigned)(tag)))

+static inline bool xa_track_free(const struct xarray *xa)
+{
+ return xa->xa_flags & XA_FLAGS_TRACK_FREE;
+}
+
/**
* xa_tagged() - Inquire whether any entry in this array has a tag set
* @xa: Array
@@ -151,6 +183,7 @@ static inline bool xa_is_value(void *entry)
#define xa_lock_held(xa) lockdep_is_held(&(xa)->xa_lock)

/* Versions of the normal API which require the caller to hold the xa_lock */
+void *__xa_erase(struct xarray *, unsigned long index);
void *__xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
void *__xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);

@@ -265,6 +298,12 @@ static inline bool xa_is_internal(void *entry)
return ((unsigned long)entry & 3) == 2;
}

+/* Private */
+static inline void *xa_mk_node(struct xa_node *node)
+{
+ return (void *)((unsigned long)node | 2);
+}
+
/* Private */
static inline struct xa_node *xa_to_node(void *entry)
{
@@ -445,6 +484,12 @@ static inline bool xas_valid(const struct xa_state *xas)
return !xas_invalid(xas);
}

+/* True if the node represents head-of-tree, RESTART or BOUNDS */
+static inline bool xas_top(struct xa_node *node)
+{
+ return node <= XAS_BOUNDS;
+}
+
/**
* xas_retry() - Handle a retry entry.
* @xas: XArray operation state.
@@ -465,10 +510,15 @@ static inline bool xas_retry(struct xa_state *xas, void *entry)
}

void *xas_load(struct xa_state *);
+void *xas_store(struct xa_state *, void *entry);
+void *xas_create(struct xa_state *);

bool xas_get_tag(const struct xa_state *, xa_tag_t);
void xas_set_tag(const struct xa_state *, xa_tag_t);
void xas_clear_tag(const struct xa_state *, xa_tag_t);
+void xas_init_tags(const struct xa_state *);
+
+bool xas_nomem(struct xa_state *, gfp_t);

/**
* xas_reload() - Refetch an entry from the xarray.
@@ -493,4 +543,52 @@ static inline void *xas_reload(struct xa_state *xas)
return xa_head(xas->xa);
}

+/**
+ * xas_set() - Set up XArray operation state for a different index.
+ * @xas: XArray operation state.
+ * @index: New index into the XArray.
+ *
+ * Move the operation state to refer to a different index. This will
+ * have the effect of starting a walk from the top; see xas_next()
+ * to move to an adjacent index.
+ */
+static inline void xas_set(struct xa_state *xas, unsigned long index)
+{
+ xas->xa_index = index;
+ xas->xa_node = XAS_RESTART;
+}
+
+#ifdef CONFIG_RADIX_TREE_MULTIORDER
+/**
+ * xas_set_order() - Set up XArray operation state for a multislot entry.
+ * @xas: XArray operation state.
+ * @index: Target of the operation.
+ * @order: Entry occupies 2^@order indices.
+ */
+static inline void xas_set_order(struct xa_state *xas, unsigned long index,
+ unsigned int order)
+{
+ xas->xa_index = (index >> order) << order;
+ xas->xa_shift = order - (order % XA_CHUNK_SHIFT);
+ xas->xa_sibs = (1 << (order % XA_CHUNK_SHIFT)) - 1;
+ xas->xa_node = XAS_RESTART;
+}
+#endif
+
+/**
+ * xas_set_update() - Set up XArray operation state for a callback.
+ * @xas: XArray operation state.
+ * @update: Function to call when updating a node.
+ *
+ * The XArray can notify a caller after it has updated an xa_node.
+ * This is advanced functionality and is only needed by the page cache.
+ */
+static inline void xas_set_update(struct xa_state *xas, xa_update_node_t update)
+{
+ xas->xa_update = update;
+}
+
+/* Internal functions, mostly shared between radix-tree.c, xarray.c and idr.c */
+void xas_destroy(struct xa_state *);
+
#endif /* _LINUX_XARRAY_H */
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 8a8485749433..b24361a6e517 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -46,7 +46,7 @@ static unsigned long height_to_maxnodes[RADIX_TREE_MAX_PATH + 1] __read_mostly;
/*
* Radix tree node cache.
*/
-static struct kmem_cache *radix_tree_node_cachep;
+struct kmem_cache *radix_tree_node_cachep;

/*
* The radix tree is variable-height, so an insert operation not only has
@@ -364,7 +364,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent,
return ret;
}

-static void radix_tree_node_rcu_free(struct rcu_head *head)
+void radix_tree_node_rcu_free(struct rcu_head *head)
{
struct radix_tree_node *node =
container_of(head, struct radix_tree_node, rcu_head);
diff --git a/lib/xarray.c b/lib/xarray.c
index 8a289c89d3bb..fbbb02c25b6d 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -14,6 +14,8 @@

#include <linux/bitmap.h>
#include <linux/export.h>
+#include <linux/list.h>
+#include <linux/slab.h>
#include <linux/xarray.h>

/*
@@ -74,11 +76,20 @@ static inline void tag_clear(struct xa_node *node, unsigned int offset,
__clear_bit(offset, node->tags[(__force unsigned)tag]);
}

+static inline void tag_set_all(struct xa_node *node, xa_tag_t tag)
+{
+ bitmap_fill(node->tags[(__force unsigned)tag], XA_CHUNK_SIZE);
+}
+
static inline bool tag_any_set(struct xa_node *node, xa_tag_t tag)
{
return !bitmap_empty(node->tags[(__force unsigned)tag], XA_CHUNK_SIZE);
}

+#define tag_inc(tag) do { \
+ tag = (__force xa_tag_t)((__force unsigned)(tag) + 1); \
+} while (0)
+
/* extracts the offset within this node from the index */
static unsigned int get_offset(unsigned long index, struct xa_node *node)
{
@@ -167,6 +178,478 @@ void *xas_load(struct xa_state *xas)
}
EXPORT_SYMBOL_GPL(xas_load);

+/* Move the radix tree node cache here */
+extern struct kmem_cache *radix_tree_node_cachep;
+extern void radix_tree_node_rcu_free(struct rcu_head *head);
+
+static void xa_node_free(struct xa_node *node)
+{
+ XA_BUG_ON(node, !list_empty(&node->private_list));
+ call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
+}
+
+/*
+ * xas_destroy() - Free any resources allocated during the XArray operation.
+ * @xas: XArray operation state.
+ *
+ * This function is now internal-only (and will be made static once
+ * idr_preload() is removed).
+ */
+void xas_destroy(struct xa_state *xas)
+{
+ struct xa_node *node = xas->xa_alloc;
+
+ if (!node)
+ return;
+ XA_BUG_ON(node, !list_empty(&node->private_list));
+ kmem_cache_free(radix_tree_node_cachep, node);
+ xas->xa_alloc = NULL;
+}
+
+/**
+ * xas_nomem() - Allocate memory if needed.
+ * @xas: XArray operation state.
+ * @gfp: Memory allocation flags.
+ *
+ * If we need to add new nodes to the XArray, we try to allocate memory
+ * with GFP_NOWAIT while holding the lock, which will usually succeed.
+ * If it fails, @xas is flagged as needing memory to continue. The caller
+ * should drop the lock and call xas_nomem(). If xas_nomem() succeeds,
+ * the caller should retry the operation.
+ *
+ * Forward progress is guaranteed as one node is allocated here and
+ * stored in the xa_state where it will be found by xas_alloc(). More
+ * nodes will likely be found in the slab allocator, but we do not tie
+ * them up here.
+ *
+ * Return: true if memory was needed, and was successfully allocated.
+ */
+bool xas_nomem(struct xa_state *xas, gfp_t gfp)
+{
+ if (xas->xa_node != XAS_ERROR(ENOMEM)) {
+ xas_destroy(xas);
+ return false;
+ }
+ xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+ if (!xas->xa_alloc)
+ return false;
+ XA_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
+ xas->xa_node = XAS_RESTART;
+ return true;
+}
+EXPORT_SYMBOL_GPL(xas_nomem);
+
+static void *xas_alloc(struct xa_state *xas, unsigned int shift)
+{
+ struct xa_node *parent = xas->xa_node;
+ struct xa_node *node = xas->xa_alloc;
+
+ if (xas_invalid(xas))
+ return NULL;
+
+ if (node) {
+ xas->xa_alloc = NULL;
+ } else {
+ node = kmem_cache_alloc(radix_tree_node_cachep,
+ GFP_NOWAIT | __GFP_NOWARN);
+ if (!node) {
+ xas_set_err(xas, -ENOMEM);
+ return NULL;
+ }
+ }
+
+ if (xas->xa_node) {
+ node->offset = xas->xa_offset;
+ parent->count++;
+ XA_BUG_ON(node, parent->count > XA_CHUNK_SIZE);
+ }
+ XA_BUG_ON(node, shift > BITS_PER_LONG);
+ XA_BUG_ON(node, !list_empty(&node->private_list));
+ node->shift = shift;
+ node->count = 0;
+ node->exceptional = 0;
+ RCU_INIT_POINTER(node->parent, xas->xa_node);
+ node->root = xas->xa;
+
+ return node;
+}
+
+/*
+ * Use this to calculate the maximum index that will need to be created
+ * in order to add the entry described by @xas. Because we cannot store a
+ * multiple-index entry at index 0, the calculation is a little more complex
+ * than you might expect.
+ */
+static unsigned long xas_max(struct xa_state *xas)
+{
+ unsigned long mask, max = xas->xa_index;
+
+#ifdef CONFIG_RADIX_TREE_MULTIORDER
+ if (xas->xa_shift || xas->xa_sibs) {
+ mask = (((xas->xa_sibs + 1UL) << xas->xa_shift) - 1);
+ max |= mask;
+ if (mask == max)
+ max++;
+ }
+#endif
+
+ return max;
+}
+
+/* The maximum index that can be contained in the array without expanding it */
+static unsigned long max_index(void *entry)
+{
+ if (!xa_is_node(entry))
+ return 0;
+ return (XA_CHUNK_SIZE << xa_to_node(entry)->shift) - 1;
+}
+
+static void xas_shrink(struct xa_state *xas)
+{
+ struct xarray *xa = xas->xa;
+ struct xa_node *node = xas->xa_node;
+
+ for (;;) {
+ void *entry;
+
+ XA_BUG_ON(node, node->count > XA_CHUNK_SIZE);
+ if (node->count != 1)
+ break;
+ entry = xa_entry_locked(xa, node, 0);
+ if (!entry)
+ break;
+ if (!xa_is_node(entry) && node->shift)
+ break;
+ xas->xa_node = XAS_BOUNDS;
+
+ RCU_INIT_POINTER(xa->xa_head, entry);
+ if (xa_track_free(xa) && !tag_get(node, 0, XA_FREE_TAG))
+ xa_tag_clear(xa, XA_FREE_TAG);
+
+ node->count = 0;
+ node->exceptional = 0;
+ if (!xa_is_node(entry))
+ RCU_INIT_POINTER(node->slots[0], XA_RETRY_ENTRY);
+ XA_BUG_ON(node, !list_empty(&node->private_list));
+ xa_node_free(node);
+ if (!xa_is_node(entry))
+ break;
+ node = xa_to_node(entry);
+ if (xas->xa_update)
+ xas->xa_update(node);
+ else
+ XA_BUG_ON(node, !list_empty(&node->private_list));
+ }
+}
+
+/*
+ * xas_delete_node() - Attempt to delete an xa_node
+ * @xas: Array operation state.
+ *
+ * Attempts to delete the @xas->xa_node. This will fail if xa->node has
+ * a non-zero reference count.
+ */
+static void xas_delete_node(struct xa_state *xas)
+{
+ struct xa_node *node = xas->xa_node;
+
+ for (;;) {
+ struct xa_node *parent;
+
+ XA_BUG_ON(node, node->count > XA_CHUNK_SIZE);
+ if (node->count)
+ break;
+
+ parent = xa_parent_locked(xas->xa, node);
+ xas->xa_node = parent;
+ xas->xa_offset = node->offset;
+ XA_BUG_ON(node, !list_empty(&node->private_list));
+ xa_node_free(node);
+
+ if (!parent) {
+ xas->xa->xa_head = NULL;
+ xas->xa_node = XAS_BOUNDS;
+ return;
+ }
+
+ parent->slots[xas->xa_offset] = NULL;
+ parent->count--;
+ XA_BUG_ON(parent, parent->count > XA_CHUNK_SIZE);
+ node = parent;
+ if (xas->xa_update)
+ xas->xa_update(node);
+ else
+ XA_BUG_ON(node, !list_empty(&node->private_list));
+ }
+
+ if (!node->parent)
+ xas_shrink(xas);
+}
+
+/**
+ * xas_free_nodes() - Free this node and all nodes that it references
+ * @xas: Array operation state.
+ * @top: Node to free
+ *
+ * This node has been removed from the tree. We must now free it and all
+ * of its subnodes. There may be RCU walkers with references into the tree,
+ * so we must replace all entries with retry markers.
+ */
+static void xas_free_nodes(struct xa_state *xas, struct xa_node *top)
+{
+ unsigned int offset = 0;
+ struct xa_node *node = top;
+
+ for (;;) {
+ void *entry = xa_entry_locked(xas->xa, node, offset);
+
+ if (xa_is_node(entry)) {
+ node = xa_to_node(entry);
+ offset = 0;
+ continue;
+ }
+ if (entry)
+ RCU_INIT_POINTER(node->slots[offset], XA_RETRY_ENTRY);
+ offset++;
+ while (offset == XA_CHUNK_SIZE) {
+ struct xa_node *parent = xa_parent_locked(xas->xa, node);
+
+ offset = node->offset + 1;
+ node->count = 0;
+ node->exceptional = 0;
+ if (xas->xa_update)
+ xas->xa_update(node);
+ XA_BUG_ON(node, !list_empty(&node->private_list));
+ xa_node_free(node);
+ if (node == top)
+ return;
+ node = parent;
+ }
+ }
+}
+
+/*
+ * xas_expand adds nodes to the head of the tree until it has reached
+ * sufficient height to be able to contain @xas->xa_index
+ */
+static int xas_expand(struct xa_state *xas, void *head)
+{
+ struct xarray *xa = xas->xa;
+ struct xa_node *node = NULL;
+ unsigned int shift = 0;
+ unsigned long max = xas_max(xas);
+
+ if (!head) {
+ if (max == 0)
+ return 0;
+ while ((max >> shift) >= XA_CHUNK_SIZE)
+ shift += XA_CHUNK_SHIFT;
+ return shift + XA_CHUNK_SHIFT;
+ } else if (xa_is_node(head)) {
+ node = xa_to_node(head);
+ shift = node->shift + XA_CHUNK_SHIFT;
+ }
+ xas->xa_node = NULL;
+
+ while (max > max_index(head)) {
+ xa_tag_t tag = 0;
+
+ XA_BUG_ON(node, shift > BITS_PER_LONG);
+ node = xas_alloc(xas, shift);
+ if (!node)
+ return -ENOMEM;
+
+ node->count = 1;
+ if (xa_is_value(head))
+ node->exceptional = 1;
+ RCU_INIT_POINTER(node->slots[0], head);
+
+ /* Propagate the aggregated tag info to the new child */
+ if (xa_track_free(xa)) {
+ tag_set_all(node, XA_FREE_TAG);
+ if (!xa_tagged(xa, XA_FREE_TAG)) {
+ tag_clear(node, 0, XA_FREE_TAG);
+ xa_tag_set(xa, XA_FREE_TAG);
+ }
+ tag_inc(tag);
+ }
+ for (;;) {
+ if (xa_tagged(xa, tag))
+ tag_set(node, 0, tag);
+ if (tag == XA_TAG_MAX)
+ break;
+ tag_inc(tag);
+ }
+
+ /*
+ * Now that the new node is fully initialised, we can add
+ * it to the tree
+ */
+ if (xa_is_node(head)) {
+ xa_to_node(head)->offset = 0;
+ rcu_assign_pointer(xa_to_node(head)->parent, node);
+ }
+ head = xa_mk_node(node);
+ rcu_assign_pointer(xa->xa_head, head);
+
+ shift += XA_CHUNK_SHIFT;
+ }
+
+ xas->xa_node = node;
+ return shift;
+}
+
+/**
+ * xas_create() - Create a slot to store an entry in.
+ * @xas: XArray operation state.
+ *
+ * Most users will not need to call this function directly, as it is called
+ * by xas_store(). It is useful for doing conditional store operations
+ * (see the xa_cmpxchg() implementation for an example).
+ *
+ * Return: If the slot already existed, returns the contents of this slot.
+ * If the slot was newly created, returns NULL. If it failed to create the
+ * slot, returns NULL and indicates the error in @xas.
+ */
+void *xas_create(struct xa_state *xas)
+{
+ struct xarray *xa = xas->xa;
+ void *entry;
+ void __rcu **slot;
+ struct xa_node *node = xas->xa_node;
+ int shift;
+ unsigned int order = xas->xa_shift;
+
+ if (xas_top(node)) {
+ entry = xa_head_locked(xa);
+ xas->xa_node = NULL;
+ shift = xas_expand(xas, entry);
+ if (shift < 0)
+ return NULL;
+ entry = xa_head_locked(xa);
+ slot = &xa->xa_head;
+ } else if (xas_error(xas)) {
+ return NULL;
+ } else if (node) {
+ unsigned int offset = xas->xa_offset;
+
+ shift = node->shift;
+ entry = xa_entry_locked(xa, node, offset);
+ slot = &node->slots[offset];
+ } else {
+ shift = 0;
+ entry = xa_head_locked(xa);
+ slot = &xa->xa_head;
+ }
+
+ while (shift > order) {
+ shift -= XA_CHUNK_SHIFT;
+ if (!entry) {
+ node = xas_alloc(xas, shift);
+ if (!node)
+ break;
+ if (xa_track_free(xa))
+ tag_set_all(node, XA_FREE_TAG);
+ rcu_assign_pointer(*slot, xa_mk_node(node));
+ } else if (xa_is_node(entry)) {
+ node = xa_to_node(entry);
+ } else {
+ break;
+ }
+ entry = xas_descend(xas, node);
+ slot = &node->slots[xas->xa_offset];
+ }
+
+ return entry;
+}
+EXPORT_SYMBOL_GPL(xas_create);
+
+static void store_siblings(struct xa_state *xas,
+ void *entry, int *countp, int *valuesp)
+{
+#ifdef CONFIG_RADIX_TREE_MULTIORDER
+ struct xa_node *node = xas->xa_node;
+ unsigned int sibs, offset = xas->xa_offset;
+ void *sibling = entry ? xa_mk_sibling(offset) : NULL;
+ void *real = entry;
+
+ if (!entry)
+ sibs = XA_CHUNK_MASK - offset;
+ else if (xas->xa_shift < node->shift)
+ sibs = 0;
+ else
+ sibs = xas->xa_sibs;
+
+ while (sibs--) {
+ void *next = xa_entry(xas->xa, node, ++offset);
+
+ if (!xa_is_sibling(next)) {
+ if (!entry)
+ break;
+ real = next;
+ }
+ RCU_INIT_POINTER(node->slots[offset], sibling);
+ if (xa_is_node(next))
+ xas_free_nodes(xas, xa_to_node(next));
+ *countp += !next - !entry;
+ *valuesp += !xa_is_value(real) - !xa_is_value(entry);
+ }
+#endif
+}
+
+/**
+ * xas_store() - Store this entry in the XArray.
+ * @xas: XArray operation state.
+ * @entry: New entry.
+ *
+ * Return: The old entry at this index.
+ */
+void *xas_store(struct xa_state *xas, void *entry)
+{
+ struct xa_node *node;
+ int count, values;
+ void *curr;
+
+ if (entry)
+ curr = xas_create(xas);
+ else
+ curr = xas_load(xas);
+ if (xas_invalid(xas))
+ return curr;
+ if ((curr == entry) && !xas->xa_sibs)
+ return curr;
+
+ node = xas->xa_node;
+ if (node)
+ rcu_assign_pointer(node->slots[xas->xa_offset], entry);
+ else
+ rcu_assign_pointer(xas->xa->xa_head, entry);
+ if (!entry)
+ xas_init_tags(xas);
+
+ values = !xa_is_value(curr) - !xa_is_value(entry);
+ count = !curr - !entry;
+ if (xa_is_node(curr))
+ xas_free_nodes(xas, xa_to_node(curr));
+
+ if (node) {
+ store_siblings(xas, entry, &count, &values);
+ node->count += count;
+ XA_BUG_ON(node, node->count > XA_CHUNK_SIZE);
+ node->exceptional += values;
+ XA_BUG_ON(node, node->exceptional > XA_CHUNK_SIZE);
+ if ((count || values) && xas->xa_update)
+ xas->xa_update(node);
+ else
+ XA_BUG_ON(node, !list_empty(&node->private_list));
+ if (count < 0)
+ xas_delete_node(xas);
+ }
+
+ return curr;
+}
+EXPORT_SYMBOL_GPL(xas_store);
+
/**
* xas_get_tag() - Returns the state of this tag.
* @xas: XArray operation state.
@@ -246,6 +729,34 @@ void xas_clear_tag(const struct xa_state *xas, xa_tag_t tag)
}
EXPORT_SYMBOL_GPL(xas_clear_tag);

+/**
+ * xas_init_tags() - Initialise all tags for the entry
+ * @xas: Array operations state.
+ *
+ * Initialise all tags for the entry specified by @xas. If we're tracking
+ * free entries with a tag, we need to set it on all entries. All other
+ * tags are cleared.
+ *
+ * This implementation is not as efficient as it could be; we may walk
+ * up the tree multiple times.
+ */
+void xas_init_tags(const struct xa_state *xas)
+{
+ xa_tag_t tag = 0;
+
+ if (xa_track_free(xas->xa)) {
+ xas_set_tag(xas, XA_FREE_TAG);
+ tag_inc(tag);
+ }
+ for (;;) {
+ xas_clear_tag(xas, tag);
+ if (tag == XA_TAG_MAX)
+ break;
+ tag_inc(tag);
+ }
+}
+EXPORT_SYMBOL_GPL(xas_init_tags);
+
/**
* __xa_init() - Initialise an empty XArray.
* @xa: XArray.
@@ -283,6 +794,64 @@ void *xa_load(struct xarray *xa, unsigned long index)
}
EXPORT_SYMBOL(xa_load);

+/**
+ * __xa_erase() - Erase this entry from the XArray while locked.
+ * @xa: XArray.
+ * @index: Index into array.
+ *
+ * If the entry at this index is a multi-index entry then all indices will
+ * be erased, and the entry will no longer be a multi-index entry.
+ * This function expects the xa_lock to be held on entry.
+ *
+ * Return: The old entry at this index.
+ */
+void *__xa_erase(struct xarray *xa, unsigned long index)
+{
+ XA_STATE(xas, xa, index);
+ void *curr = xas_store(&xas, NULL);
+
+ if (xa_is_zero(curr))
+ curr = NULL;
+ return curr;
+}
+EXPORT_SYMBOL_GPL(__xa_erase);
+
+/**
+ * xa_store() - Store this entry in the XArray.
+ * @xa: XArray.
+ * @index: Index into array.
+ * @entry: New entry.
+ * @gfp: Allocation flags.
+ *
+ * Stores almost always succeed. The notable exceptions:
+ * - Attempted to store a reserved pointer entry (-EINVAL)
+ * - Ran out of memory trying to allocate new nodes (-ENOMEM)
+ *
+ * Storing into an existing multislot entry updates the entry of every index.
+ *
+ * Return: The old entry at this index or ERR_PTR() if an error happened.
+ */
+void *xa_store(struct xarray *xa, unsigned long index, void *entry, gfp_t gfp)
+{
+ XA_STATE(xas, xa, index);
+ unsigned long flags;
+ void *curr;
+
+ if (WARN_ON_ONCE(xa_is_internal(entry)))
+ return ERR_PTR(-EINVAL);
+
+ do {
+ xa_lock_irqsave(xa, flags);
+ curr = xas_store(&xas, entry);
+ xa_unlock_irqrestore(xa, flags);
+ } while (xas_nomem(&xas, gfp));
+
+ if (xas_error(&xas))
+ curr = ERR_PTR(xas_error(&xas));
+ return curr;
+}
+EXPORT_SYMBOL(xa_store);
+
/**
* __xa_set_tag() - Set this tag on this entry while locked.
* @xa: XArray.
diff --git a/tools/testing/radix-tree/linux/rcupdate.h b/tools/testing/radix-tree/linux/rcupdate.h
index 25010bf86c1d..fd280b070fdb 100644
--- a/tools/testing/radix-tree/linux/rcupdate.h
+++ b/tools/testing/radix-tree/linux/rcupdate.h
@@ -7,5 +7,6 @@
#define rcu_dereference_raw(p) rcu_dereference(p)
#define rcu_dereference_protected(p, cond) rcu_dereference(p)
#define rcu_dereference_check(p, cond) rcu_dereference(p)
+#define RCU_INIT_POINTER(p, v) (p) = (v)

#endif
diff --git a/tools/testing/radix-tree/xarray-test.c b/tools/testing/radix-tree/xarray-test.c
index 3f8f19cb3739..a9a2b6042177 100644
--- a/tools/testing/radix-tree/xarray-test.c
+++ b/tools/testing/radix-tree/xarray-test.c
@@ -19,6 +19,23 @@

#include "test.h"

+void check_xa_tag(struct xarray *xa)
+{
+ assert(xa_get_tag(xa, 0, XA_TAG_0) == false);
+ assert(xa_set_tag(xa, 0, XA_TAG_0) == NULL);
+ assert(xa_get_tag(xa, 0, XA_TAG_0) == false);
+ assert(xa_store(xa, 0, xa, GFP_KERNEL) == NULL);
+ assert(xa_get_tag(xa, 0, XA_TAG_0) == false);
+ assert(xa_set_tag(xa, 0, XA_TAG_0) == xa);
+ assert(xa_get_tag(xa, 0, XA_TAG_0) == true);
+ assert(xa_get_tag(xa, 1, XA_TAG_0) == false);
+ assert(xa_store(xa, 0, NULL, GFP_KERNEL) == xa);
+ assert(xa_empty(xa));
+ assert(xa_get_tag(xa, 0, XA_TAG_0) == false);
+ assert(xa_set_tag(xa, 0, XA_TAG_0) == NULL);
+ assert(xa_get_tag(xa, 0, XA_TAG_0) == false);
+}
+
void check_xa_load(struct xarray *xa)
{
unsigned long i, j;
@@ -31,16 +48,106 @@ void check_xa_load(struct xarray *xa)
else
assert(!entry);
}
- radix_tree_insert(xa, i, xa_mk_value(i));
+ xa_store(xa, i, xa_mk_value(i), GFP_KERNEL);
+ }
+}
+
+void check_xa_shrink(struct xarray *xa)
+{
+ XA_STATE(xas, xa, 1);
+ struct xa_node *node;
+
+ xa_store(xa, 0, xa_mk_value(0), GFP_KERNEL);
+ xa_store(xa, 1, xa_mk_value(1), GFP_KERNEL);
+
+ assert(xas_load(&xas) == xa_mk_value(1));
+ node = xas.xa_node;
+ assert(node->slots[0] == xa_mk_value(0));
+ rcu_read_lock();
+ xas_store(&xas, NULL);
+ assert(xas.xa_node == XAS_BOUNDS);
+ assert(node->slots[0] == XA_RETRY_ENTRY);
+ rcu_read_unlock();
+ assert(xa_load(xa, 0) == xa_mk_value(0));
+}
+
+static void *xa_store_order(struct xarray *xa, unsigned long index,
+ unsigned order, void *entry)
+{
+ XA_STATE(xas, xa, 0);
+ void *curr;
+
+ xas_set_order(&xas, index, order);
+ do {
+ curr = xas_store(&xas, entry);
+ } while (xas_nomem(&xas, GFP_KERNEL));
+
+ return curr;
+}
+
+void check_multi_store(struct xarray *xa)
+{
+ unsigned long i, j, k;
+
+ xa_store_order(xa, 0, 1, xa_mk_value(0));
+ assert(xa_load(xa, 0) == xa_mk_value(0));
+ assert(xa_load(xa, 1) == xa_mk_value(0));
+ assert(xa_load(xa, 2) == NULL);
+ assert(xa_to_node(xa_head(xa))->count == 2);
+ assert(xa_to_node(xa_head(xa))->exceptional == 2);
+
+ xa_store(xa, 3, xa, GFP_KERNEL);
+ assert(xa_load(xa, 0) == xa_mk_value(0));
+ assert(xa_load(xa, 1) == xa_mk_value(0));
+ assert(xa_load(xa, 2) == NULL);
+ assert(xa_to_node(xa_head(xa))->count == 3);
+ assert(xa_to_node(xa_head(xa))->exceptional == 2);
+
+ xa_store_order(xa, 0, 2, xa_mk_value(1));
+ assert(xa_load(xa, 0) == xa_mk_value(1));
+ assert(xa_load(xa, 1) == xa_mk_value(1));
+ assert(xa_load(xa, 2) == xa_mk_value(1));
+ assert(xa_load(xa, 3) == xa_mk_value(1));
+ assert(xa_load(xa, 4) == NULL);
+ assert(xa_to_node(xa_head(xa))->count == 4);
+ assert(xa_to_node(xa_head(xa))->exceptional == 4);
+
+ xa_store_order(xa, 0, 64, NULL);
+ assert(xa_empty(xa));
+
+ for (i = 0; i < 60; i++) {
+ for (j = 0; j < 60; j++) {
+ xa_store_order(xa, 0, i, xa_mk_value(i));
+ xa_store_order(xa, 0, j, xa_mk_value(j));
+
+ for (k = 0; k < 60; k++) {
+ void *entry = xa_load(xa, (1UL << k) - 1);
+ if ((i < k) && (j < k))
+ assert(entry == NULL);
+ else
+ assert(entry == xa_mk_value(j));
+ }
+
+ xa_erase(xa, 0);
+ assert(xa_empty(xa));
+ }
}
}

void xarray_checks(void)
{
- RADIX_TREE(array, GFP_KERNEL);
+ DEFINE_XARRAY(array);
+
+ check_xa_tag(&array);
+ item_kill_tree(&array);

check_xa_load(&array);
+ item_kill_tree(&array);
+
+ check_xa_shrink(&array);
+ item_kill_tree(&array);

+ check_multi_store(&array);
item_kill_tree(&array);
}

--
2.15.0

2017-12-06 01:08:12

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 19/73] xarray: Add MAINTAINERS entry

From: Matthew Wilcox <[email protected]>

Add myself as XArray and IDR maintainer.

Signed-off-by: Matthew Wilcox <[email protected]>
---
MAINTAINERS | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index d4fdcb12616c..b2f8d606756b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14874,6 +14874,18 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/vdso
S: Maintained
F: arch/x86/entry/vdso/

+XARRAY
+M: Matthew Wilcox <[email protected]>
+M: Matthew Wilcox <[email protected]>
+L: [email protected]
+S: Supported
+F: Documentation/core-api/xarray.rst
+F: lib/idr.c
+F: lib/xarray.c
+F: include/linux/idr.h
+F: include/linux/xarray.h
+F: tools/testing/radix-tree
+
XC2028/3028 TUNER DRIVER
M: Mauro Carvalho Chehab <[email protected]>
M: Mauro Carvalho Chehab <[email protected]>
--
2.15.0

2017-12-06 01:08:24

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 06/73] xarray: Add definition of struct xarray

From: Matthew Wilcox <[email protected]>

This is a direct replacement for struct radix_tree_root. Some of the
struct members have changed name; convert those, and use a #define so
that radix_tree users continue to work without change.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/radix-tree.h | 31 ++++---------
include/linux/xarray.h | 45 +++++++++++++++++++
lib/Makefile | 2 +-
lib/idr.c | 4 +-
lib/radix-tree.c | 75 ++++++++++++++++----------------
lib/xarray.c | 47 ++++++++++++++++++++
tools/include/linux/spinlock.h | 1 +
tools/testing/radix-tree/.gitignore | 1 +
tools/testing/radix-tree/Makefile | 8 +++-
tools/testing/radix-tree/linux/bug.h | 1 +
tools/testing/radix-tree/linux/kconfig.h | 1 +
tools/testing/radix-tree/linux/xarray.h | 2 +
tools/testing/radix-tree/multiorder.c | 6 +--
tools/testing/radix-tree/test.c | 6 +--
14 files changed, 158 insertions(+), 72 deletions(-)
create mode 100644 lib/xarray.c
create mode 100644 tools/testing/radix-tree/linux/kconfig.h
create mode 100644 tools/testing/radix-tree/linux/xarray.h

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 5130f44d9f93..f31a278de8eb 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -30,6 +30,9 @@
#include <linux/types.h>
#include <linux/xarray.h>

+/* Keep unconverted code working */
+#define radix_tree_root xarray
+
/*
* The bottom two bits of the slot determine how the remaining bits in the
* slot are interpreted:
@@ -59,10 +62,7 @@ static inline bool radix_tree_is_internal_node(void *ptr)

#define RADIX_TREE_MAX_TAGS 3

-#ifndef RADIX_TREE_MAP_SHIFT
-#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6)
-#endif
-
+#define RADIX_TREE_MAP_SHIFT XA_CHUNK_SHIFT
#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
#define RADIX_TREE_MAP_MASK (RADIX_TREE_MAP_SIZE-1)

@@ -95,35 +95,20 @@ struct radix_tree_node {
unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
};

-/* The top bits of gfp_mask are used to store the root tags and the IDR flag */
+/* The top bits of xa_flags are used to store the root tags and the IDR flag */
#define ROOT_IS_IDR ((__force gfp_t)(1 << __GFP_BITS_SHIFT))
#define ROOT_TAG_SHIFT (__GFP_BITS_SHIFT + 1)

-struct radix_tree_root {
- spinlock_t xa_lock;
- gfp_t gfp_mask;
- struct radix_tree_node __rcu *rnode;
-};
-
-#define RADIX_TREE_INIT(name, mask) { \
- .xa_lock = __SPIN_LOCK_UNLOCKED(name.xa_lock), \
- .gfp_mask = (mask), \
- .rnode = NULL, \
-}
+#define RADIX_TREE_INIT(name, mask) __XARRAY_INIT(name, mask)

#define RADIX_TREE(name, mask) \
struct radix_tree_root name = RADIX_TREE_INIT(name, mask)

-#define INIT_RADIX_TREE(root, mask) \
-do { \
- spin_lock_init(&(root)->xa_lock); \
- (root)->gfp_mask = (mask); \
- (root)->rnode = NULL; \
-} while (0)
+#define INIT_RADIX_TREE(root, mask) __xa_init(root, mask)

static inline bool radix_tree_empty(const struct radix_tree_root *root)
{
- return root->rnode == NULL;
+ return root->xa_head == NULL;
}

/**
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 2c45d87a3476..dcdac2053ea6 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -19,9 +19,54 @@
*/

#include <linux/bug.h>
+#include <linux/compiler.h>
+#include <linux/kconfig.h>
#include <linux/spinlock.h>
#include <linux/types.h>

+/**
+ * struct xarray - The anchor of the XArray.
+ *
+ * To use the xarray, define it statically or embed it in your data structure.
+ * It is a very small data structure, so it does not usually make sense to
+ * allocate it separately and keep a pointer to it in your data structure.
+ */
+/*
+ * If all of the entries in the array are NULL, @xa_head is a NULL pointer.
+ * If the only non-NULL entry in the array is at index 0, @xa_head is that
+ * entry. If any other entry in the array is non-NULL, @xa_head points
+ * to an @xa_node.
+ */
+struct xarray {
+/* private: The entire xarray is opaque */
+ spinlock_t xa_lock;
+ gfp_t xa_flags;
+ void __rcu * xa_head;
+};
+
+#define __XARRAY_INIT(name, flags) { \
+ .xa_lock = __SPIN_LOCK_UNLOCKED(name.xa_lock), \
+ .xa_flags = flags, \
+ .xa_head = NULL, \
+}
+
+#define XARRAY_INIT(name) __XARRAY_INIT(name, 0)
+
+#define DEFINE_XARRAY(name) struct xarray name = XARRAY_INIT(name)
+
+void __xa_init(struct xarray *, gfp_t flags);
+
+/**
+ * xa_init() - Initialise an empty XArray.
+ * @xa: XArray.
+ *
+ * An empty XArray is full of NULL entries.
+ */
+static inline void xa_init(struct xarray *xa)
+{
+ __xa_init(xa, 0);
+}
+
#define BITS_PER_XA_VALUE (BITS_PER_LONG - 1)

/**
diff --git a/lib/Makefile b/lib/Makefile
index d11c48ec8ffd..6aa523acc7c1 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -18,7 +18,7 @@ KCOV_INSTRUMENT_debugobjects.o := n
KCOV_INSTRUMENT_dynamic_debug.o := n

lib-y := ctype.o string.o vsprintf.o cmdline.o \
- rbtree.o radix-tree.o dump_stack.o timerqueue.o\
+ rbtree.o radix-tree.o dump_stack.o timerqueue.o xarray.o \
idr.o int_sqrt.o extable.o \
sha1.o chacha20.o irq_regs.o argv_split.o \
flex_proportions.o ratelimit.o show_mem.o \
diff --git a/lib/idr.c b/lib/idr.c
index 48c53890adc0..b9aa08e198a2 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -35,8 +35,8 @@ int idr_alloc_ul(struct idr *idr, void *ptr, unsigned long *nextid,
if (WARN_ON_ONCE(radix_tree_is_internal_node(ptr)))
return -EINVAL;

- if (WARN_ON_ONCE(!(idr->idr_rt.gfp_mask & ROOT_IS_IDR)))
- idr->idr_rt.gfp_mask |= IDR_RT_MARKER;
+ if (WARN_ON_ONCE(!(idr->idr_rt.xa_flags & ROOT_IS_IDR)))
+ idr->idr_rt.xa_flags |= IDR_RT_MARKER;

radix_tree_iter_init(&iter, *nextid);
slot = idr_get_free(&idr->idr_rt, &iter, gfp, max);
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 0a7a21dd9398..930eb7d298d7 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -123,7 +123,7 @@ static unsigned int radix_tree_descend(const struct radix_tree_node *parent,

static inline gfp_t root_gfp_mask(const struct radix_tree_root *root)
{
- return root->gfp_mask & __GFP_BITS_MASK;
+ return root->xa_flags & __GFP_BITS_MASK;
}

static inline void tag_set(struct radix_tree_node *node, unsigned int tag,
@@ -146,32 +146,32 @@ static inline int tag_get(const struct radix_tree_node *node, unsigned int tag,

static inline void root_tag_set(struct radix_tree_root *root, unsigned tag)
{
- root->gfp_mask |= (__force gfp_t)(1 << (tag + ROOT_TAG_SHIFT));
+ root->xa_flags |= (__force gfp_t)(1 << (tag + ROOT_TAG_SHIFT));
}

static inline void root_tag_clear(struct radix_tree_root *root, unsigned tag)
{
- root->gfp_mask &= (__force gfp_t)~(1 << (tag + ROOT_TAG_SHIFT));
+ root->xa_flags &= (__force gfp_t)~(1 << (tag + ROOT_TAG_SHIFT));
}

static inline void root_tag_clear_all(struct radix_tree_root *root)
{
- root->gfp_mask &= (1 << ROOT_TAG_SHIFT) - 1;
+ root->xa_flags &= (__force gfp_t)((1 << ROOT_TAG_SHIFT) - 1);
}

static inline int root_tag_get(const struct radix_tree_root *root, unsigned tag)
{
- return (__force int)root->gfp_mask & (1 << (tag + ROOT_TAG_SHIFT));
+ return (__force int)root->xa_flags & (1 << (tag + ROOT_TAG_SHIFT));
}

static inline unsigned root_tags_get(const struct radix_tree_root *root)
{
- return (__force unsigned)root->gfp_mask >> ROOT_TAG_SHIFT;
+ return (__force unsigned)root->xa_flags >> ROOT_TAG_SHIFT;
}

static inline bool is_idr(const struct radix_tree_root *root)
{
- return !!(root->gfp_mask & ROOT_IS_IDR);
+ return !!(root->xa_flags & ROOT_IS_IDR);
}

/*
@@ -290,12 +290,12 @@ static void dump_node(struct radix_tree_node *node, unsigned long index)
/* For debug */
static void radix_tree_dump(struct radix_tree_root *root)
{
- pr_debug("radix root: %p rnode %p tags %x\n",
- root, root->rnode,
- root->gfp_mask >> ROOT_TAG_SHIFT);
- if (!radix_tree_is_internal_node(root->rnode))
+ pr_debug("radix root: %p xa_head %p tags %x\n",
+ root, root->xa_head,
+ root->xa_flags >> ROOT_TAG_SHIFT);
+ if (!radix_tree_is_internal_node(root->xa_head))
return;
- dump_node(entry_to_node(root->rnode), 0);
+ dump_node(entry_to_node(root->xa_head), 0);
}

static void dump_ida_node(void *entry, unsigned long index)
@@ -339,9 +339,9 @@ static void dump_ida_node(void *entry, unsigned long index)
static void ida_dump(struct ida *ida)
{
struct radix_tree_root *root = &ida->ida_rt;
- pr_debug("ida: %p node %p free %d\n", ida, root->rnode,
- root->gfp_mask >> ROOT_TAG_SHIFT);
- dump_ida_node(root->rnode, 0);
+ pr_debug("ida: %p node %p free %d\n", ida, root->xa_head,
+ root->xa_flags >> ROOT_TAG_SHIFT);
+ dump_ida_node(root->xa_head, 0);
}
#endif

@@ -575,7 +575,7 @@ int radix_tree_maybe_preload_order(gfp_t gfp_mask, int order)
static unsigned radix_tree_load_root(const struct radix_tree_root *root,
struct radix_tree_node **nodep, unsigned long *maxindex)
{
- struct radix_tree_node *node = rcu_dereference_raw(root->rnode);
+ struct radix_tree_node *node = rcu_dereference_raw(root->xa_head);

*nodep = node;

@@ -604,7 +604,7 @@ static int radix_tree_extend(struct radix_tree_root *root, gfp_t gfp,
while (index > shift_maxindex(maxshift))
maxshift += RADIX_TREE_MAP_SHIFT;

- entry = rcu_dereference_raw(root->rnode);
+ entry = rcu_dereference_raw(root->xa_head);
if (!entry && (!is_idr(root) || root_tag_get(root, IDR_FREE)))
goto out;

@@ -632,7 +632,7 @@ static int radix_tree_extend(struct radix_tree_root *root, gfp_t gfp,
if (radix_tree_is_internal_node(entry)) {
entry_to_node(entry)->parent = node;
} else if (xa_is_value(entry)) {
- /* Moving an exceptional root->rnode to a node */
+ /* Moving an exceptional root->xa_head to a node */
node->exceptional = 1;
}
/*
@@ -641,7 +641,7 @@ static int radix_tree_extend(struct radix_tree_root *root, gfp_t gfp,
*/
node->slots[0] = (void __rcu *)entry;
entry = node_to_entry(node);
- rcu_assign_pointer(root->rnode, entry);
+ rcu_assign_pointer(root->xa_head, entry);
shift += RADIX_TREE_MAP_SHIFT;
} while (shift <= maxshift);
out:
@@ -658,7 +658,7 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root,
bool shrunk = false;

for (;;) {
- struct radix_tree_node *node = rcu_dereference_raw(root->rnode);
+ struct radix_tree_node *node = rcu_dereference_raw(root->xa_head);
struct radix_tree_node *child;

if (!radix_tree_is_internal_node(node))
@@ -686,9 +686,9 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root,
* moving the node from one part of the tree to another: if it
* was safe to dereference the old pointer to it
* (node->slots[0]), it will be safe to dereference the new
- * one (root->rnode) as far as dependent read barriers go.
+ * one (root->xa_head) as far as dependent read barriers go.
*/
- root->rnode = (void __rcu *)child;
+ root->xa_head = (void __rcu *)child;
if (is_idr(root) && !tag_get(node, IDR_FREE, 0))
root_tag_clear(root, IDR_FREE);

@@ -736,9 +736,8 @@ static bool delete_node(struct radix_tree_root *root,

if (node->count) {
if (node_to_entry(node) ==
- rcu_dereference_raw(root->rnode))
- deleted |= radix_tree_shrink(root,
- update_node);
+ rcu_dereference_raw(root->xa_head))
+ deleted |= radix_tree_shrink(root, update_node);
return deleted;
}

@@ -753,7 +752,7 @@ static bool delete_node(struct radix_tree_root *root,
*/
if (!is_idr(root))
root_tag_clear_all(root);
- root->rnode = NULL;
+ root->xa_head = NULL;
}

WARN_ON_ONCE(!list_empty(&node->private_list));
@@ -778,7 +777,7 @@ static bool delete_node(struct radix_tree_root *root,
* at position @index in the radix tree @root.
*
* Until there is more than one item in the tree, no nodes are
- * allocated and @root->rnode is used as a direct slot instead of
+ * allocated and @root->xa_head is used as a direct slot instead of
* pointing to a node, in which case *@nodep will be NULL.
*
* Returns -ENOMEM, or 0 for success.
@@ -788,7 +787,7 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
void __rcu ***slotp)
{
struct radix_tree_node *node = NULL, *child;
- void __rcu **slot = (void __rcu **)&root->rnode;
+ void __rcu **slot = (void __rcu **)&root->xa_head;
unsigned long maxindex;
unsigned int shift, offset = 0;
unsigned long max = index | ((1UL << order) - 1);
@@ -804,7 +803,7 @@ int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
if (error < 0)
return error;
shift = error;
- child = rcu_dereference_raw(root->rnode);
+ child = rcu_dereference_raw(root->xa_head);
}

while (shift > order) {
@@ -995,7 +994,7 @@ EXPORT_SYMBOL(__radix_tree_insert);
* tree @root.
*
* Until there is more than one item in the tree, no nodes are
- * allocated and @root->rnode is used as a direct slot instead of
+ * allocated and @root->xa_head is used as a direct slot instead of
* pointing to a node, in which case *@nodep will be NULL.
*/
void *__radix_tree_lookup(const struct radix_tree_root *root,
@@ -1008,7 +1007,7 @@ void *__radix_tree_lookup(const struct radix_tree_root *root,

restart:
parent = NULL;
- slot = (void __rcu **)&root->rnode;
+ slot = (void __rcu **)&root->xa_head;
radix_tree_load_root(root, &node, &maxindex);
if (index > maxindex)
return NULL;
@@ -1160,9 +1159,9 @@ void __radix_tree_replace(struct radix_tree_root *root,
/*
* This function supports replacing exceptional entries and
* deleting entries, but that needs accounting against the
- * node unless the slot is root->rnode.
+ * node unless the slot is root->xa_head.
*/
- WARN_ON_ONCE(!node && (slot != (void __rcu **)&root->rnode) &&
+ WARN_ON_ONCE(!node && (slot != (void __rcu **)&root->xa_head) &&
(count || exceptional));
replace_slot(slot, item, node, count, exceptional);

@@ -1714,7 +1713,7 @@ void __rcu **radix_tree_next_chunk(const struct radix_tree_root *root,
iter->tags = 1;
iter->node = NULL;
__set_iter_shift(iter, 0);
- return (void __rcu **)&root->rnode;
+ return (void __rcu **)&root->xa_head;
}

do {
@@ -2108,7 +2107,7 @@ void __rcu **idr_get_free(struct radix_tree_root *root,
unsigned long max)
{
struct radix_tree_node *node = NULL, *child;
- void __rcu **slot = (void __rcu **)&root->rnode;
+ void __rcu **slot = (void __rcu **)&root->xa_head;
unsigned long maxindex, start = iter->next_index;
unsigned int shift, offset = 0;

@@ -2124,7 +2123,7 @@ void __rcu **idr_get_free(struct radix_tree_root *root,
if (error < 0)
return ERR_PTR(error);
shift = error;
- child = rcu_dereference_raw(root->rnode);
+ child = rcu_dereference_raw(root->xa_head);
}

while (shift) {
@@ -2187,10 +2186,10 @@ void __rcu **idr_get_free(struct radix_tree_root *root,
*/
void idr_destroy(struct idr *idr)
{
- struct radix_tree_node *node = rcu_dereference_raw(idr->idr_rt.rnode);
+ struct radix_tree_node *node = rcu_dereference_raw(idr->idr_rt.xa_head);
if (radix_tree_is_internal_node(node))
radix_tree_free_nodes(node);
- idr->idr_rt.rnode = NULL;
+ idr->idr_rt.xa_head = NULL;
root_tag_set(&idr->idr_rt, IDR_FREE);
}
EXPORT_SYMBOL(idr_destroy);
diff --git a/lib/xarray.c b/lib/xarray.c
new file mode 100644
index 000000000000..67ddcb3e630c
--- /dev/null
+++ b/lib/xarray.c
@@ -0,0 +1,47 @@
+/*
+ * XArray implementation
+ * Copyright (c) 2017 Microsoft Corporation
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/export.h>
+#include <linux/xarray.h>
+
+/*
+ * Coding conventions in this file:
+ *
+ * @xa is used to refer to the entire xarray.
+ * @xas is the 'xarray operation state'. It may be either a pointer to
+ * an xa_state, or an xa_state stored on the stack. This is an unfortunate
+ * ambiguity.
+ * @index is the index of the entry being operated on
+ * @tag is an xa_tag_t; a small number indicating one of the tag bits.
+ * @node refers to an xa_node; usually the primary one being operated on by
+ * this function.
+ * @offset is the index into the slots array inside an xa_node.
+ * @parent refers to the @xa_node closer to the head than @node.
+ * @entry refers to something stored in a slot in the xarray
+ */
+
+/**
+ * __xa_init() - Initialise an empty XArray.
+ * @xa: XArray.
+ * @flags: XA_FLAG values.
+ *
+ * An empty XArray is full of NULL pointers.
+ */
+void __xa_init(struct xarray *xa, gfp_t flags)
+{
+ spin_lock_init(&xa->xa_lock);
+ xa->xa_flags = flags;
+ xa->xa_head = NULL;
+}
+EXPORT_SYMBOL(__xa_init);
diff --git a/tools/include/linux/spinlock.h b/tools/include/linux/spinlock.h
index b21b586b9854..34fed5c38da2 100644
--- a/tools/include/linux/spinlock.h
+++ b/tools/include/linux/spinlock.h
@@ -8,6 +8,7 @@
#define spinlock_t pthread_mutex_t
#define DEFINE_SPINLOCK(x) pthread_mutex_t x = PTHREAD_MUTEX_INITIALIZER;
#define __SPIN_LOCK_UNLOCKED(x) (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER
+#define spin_lock_init(x) pthread_mutex_init(x, NULL);

#define spin_lock_irqsave(x, f) (void)f, pthread_mutex_lock(x)
#define spin_unlock_irqrestore(x, f) (void)f, pthread_mutex_unlock(x)
diff --git a/tools/testing/radix-tree/.gitignore b/tools/testing/radix-tree/.gitignore
index d4706c0ffceb..8d4df7a72a8e 100644
--- a/tools/testing/radix-tree/.gitignore
+++ b/tools/testing/radix-tree/.gitignore
@@ -4,3 +4,4 @@ idr-test
main
multiorder
radix-tree.c
+xarray.c
diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index fa7ee369b3c9..3868bc189199 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -4,7 +4,7 @@ CFLAGS += -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address
LDFLAGS += -fsanitize=address
LDLIBS+= -lpthread -lurcu
TARGETS = main idr-test multiorder
-CORE_OFILES := radix-tree.o idr.o linux.o test.o find_bit.o
+CORE_OFILES := xarray.o radix-tree.o idr.o linux.o test.o find_bit.o
OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \
tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o

@@ -33,9 +33,13 @@ vpath %.c ../../lib
$(OFILES): Makefile *.h */*.h generated/map-shift.h \
../../include/linux/*.h \
../../include/asm/*.h \
+ ../../../include/linux/xarray.h \
../../../include/linux/radix-tree.h \
../../../include/linux/idr.h

+xarray.c: ../../../lib/xarray.c
+ sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@
+
radix-tree.c: ../../../lib/radix-tree.c
sed -e 's/^static //' -e 's/__always_inline //' -e 's/inline //' < $< > $@

@@ -46,6 +50,6 @@ idr.c: ../../../lib/idr.c

mapshift:
@if ! grep -qws $(SHIFT) generated/map-shift.h; then \
- echo "#define RADIX_TREE_MAP_SHIFT $(SHIFT)" > \
+ echo "#define XA_CHUNK_SHIFT $(SHIFT)" > \
generated/map-shift.h; \
fi
diff --git a/tools/testing/radix-tree/linux/bug.h b/tools/testing/radix-tree/linux/bug.h
index 23b8ed52f8c8..03dc8a57eb99 100644
--- a/tools/testing/radix-tree/linux/bug.h
+++ b/tools/testing/radix-tree/linux/bug.h
@@ -1 +1,2 @@
+#include <stdio.h>
#include "asm/bug.h"
diff --git a/tools/testing/radix-tree/linux/kconfig.h b/tools/testing/radix-tree/linux/kconfig.h
new file mode 100644
index 000000000000..6c8675859913
--- /dev/null
+++ b/tools/testing/radix-tree/linux/kconfig.h
@@ -0,0 +1 @@
+#include "../../../../include/linux/kconfig.h"
diff --git a/tools/testing/radix-tree/linux/xarray.h b/tools/testing/radix-tree/linux/xarray.h
new file mode 100644
index 000000000000..df3812cda376
--- /dev/null
+++ b/tools/testing/radix-tree/linux/xarray.h
@@ -0,0 +1,2 @@
+#include "generated/map-shift.h"
+#include "../../../../include/linux/xarray.h"
diff --git a/tools/testing/radix-tree/multiorder.c b/tools/testing/radix-tree/multiorder.c
index 684e76f79f4a..24293a2fd82d 100644
--- a/tools/testing/radix-tree/multiorder.c
+++ b/tools/testing/radix-tree/multiorder.c
@@ -191,13 +191,13 @@ static void multiorder_shrink(unsigned long index, int order)

assert(item_insert_order(&tree, 0, order) == 0);

- node = tree.rnode;
+ node = tree.xa_head;

assert(item_insert(&tree, index) == 0);
- assert(node != tree.rnode);
+ assert(node != tree.xa_head);

assert(item_delete(&tree, index) != 0);
- assert(node == tree.rnode);
+ assert(node == tree.xa_head);

for (i = 0; i < max; i++) {
struct item *item = item_lookup(&tree, i);
diff --git a/tools/testing/radix-tree/test.c b/tools/testing/radix-tree/test.c
index 0d69c49177c6..6e1cc2040817 100644
--- a/tools/testing/radix-tree/test.c
+++ b/tools/testing/radix-tree/test.c
@@ -262,7 +262,7 @@ static int verify_node(struct radix_tree_node *slot, unsigned int tag,

void verify_tag_consistency(struct radix_tree_root *root, unsigned int tag)
{
- struct radix_tree_node *node = root->rnode;
+ struct radix_tree_node *node = root->xa_head;
if (!radix_tree_is_internal_node(node))
return;
verify_node(node, tag, !!root_tag_get(root, tag));
@@ -292,13 +292,13 @@ void item_kill_tree(struct radix_tree_root *root)
}
}
assert(radix_tree_gang_lookup(root, (void **)items, 0, 32) == 0);
- assert(root->rnode == NULL);
+ assert(root->xa_head == NULL);
}

void tree_verify_min_height(struct radix_tree_root *root, int maxindex)
{
unsigned shift;
- struct radix_tree_node *node = root->rnode;
+ struct radix_tree_node *node = root->xa_head;
if (!radix_tree_is_internal_node(node)) {
assert(maxindex == 0);
return;
--
2.15.0

2017-12-06 01:08:21

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 08/73] xarray: Add documentation

From: Matthew Wilcox <[email protected]>

This is documentation on how to use the XArray, not details about its
internal implementation.

Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/core-api/index.rst | 1 +
Documentation/core-api/xarray.rst | 281 ++++++++++++++++++++++++++++++++++++++
2 files changed, 282 insertions(+)
create mode 100644 Documentation/core-api/xarray.rst

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index d5bbe035316d..eb16ba30aeb6 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -18,6 +18,7 @@ Core utilities
local_ops
workqueue
genericirq
+ xarray
flexible-arrays
librs
genalloc
diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
new file mode 100644
index 000000000000..871161539242
--- /dev/null
+++ b/Documentation/core-api/xarray.rst
@@ -0,0 +1,281 @@
+======
+XArray
+======
+
+Overview
+========
+
+The XArray is an abstract data type which behaves like a very large array
+of pointers. It meets many of the same needs as a hash or a conventional
+resizable array. Unlike a hash, it allows you to sensibly go to the
+next or previous entry in a cache-efficient manner. In contrast to
+a resizable array, there is no need for copying data or changing MMU
+mappings in order to grow the array. It is more memory-efficient,
+parallelisable and cache friendly than a doubly-linked list. It takes
+advantage of RCU to perform lookups without locking.
+
+The XArray implementation is efficient when the indices used are
+densely clustered; hashing the object and using the hash as the index
+will not perform well. The XArray is optimised for small indices,
+but still has good performance with large indices. If your index is
+larger than ULONG_MAX then the XArray is not the data type for you.
+The most important user of the XArray is the page cache.
+
+A freshly-initialised XArray contains a ``NULL`` pointer at every index.
+Each non-``NULL`` entry in the array has three bits associated with
+it called tags. Each tag may be flipped on or off independently of
+the others. You can search for entries with a given tag set.
+
+Normal pointers may be stored in the XArray directly. They must be 4-byte
+aligned, which is true for any pointer returned from :c:func:`kmalloc` and
+:c:func:`alloc_page`. It isn't true for arbitrary user-space pointers,
+nor for function pointers. You can store pointers to statically allocated
+objects, as long as those objects have an alignment of at least 4.
+
+The XArray does not support storing :c:func:`IS_ERR` pointers; some
+conflict with data values and others conflict with entries the XArray
+uses for its own purposes. If you need to store special values which
+cannot be confused with real kernel pointers, the values 4, 8, ... 4092
+are available.
+
+You can also store integers between 0 and ``LONG_MAX`` in the XArray.
+You must first convert it into an entry using :c:func:`xa_mk_value`.
+When you retrieve an entry from the XArray, you can check whether it is
+a data value by calling :c:func:`xa_is_value`, and convert it back to
+an integer by calling :c:func:`xa_to_value`.
+
+An unusual feature of the XArray is the ability to create entries which
+occupy a range of indices. Once stored to, looking up any index in
+the range will return the same entry as looking up any other index in
+the range. Setting a tag on one index will set it on all of them.
+Storing to any index will store to all of them. Multi-index entries can
+be explicitly split into smaller entries, or storing ``NULL`` into any
+entry will cause the XArray to forget about the range.
+
+Normal API
+==========
+
+Start by initialising an XArray, either with :c:func:`DEFINE_XARRAY`
+for statically allocated XArrays or :c:func:`xa_init` for dynamically
+allocated ones.
+
+You can then set entries using :c:func:`xa_store` and get entries
+using :c:func:`xa_load`. xa_store will overwrite any entry with the
+new entry and return the previous entry stored at that index. If you
+store %NULL, the XArray does not need to allocate memory. You can call
+:c:func:`xa_erase` to avoid inventing a GFP flags value. There is no
+difference between an entry that has never been stored to and one that
+has most recently had %NULL stored to it.
+
+You can conditionally replace an entry at an index by using
+:c:func:`xa_cmpxchg`. Like :c:func:`cmpxchg`, it will only succeed if
+the entry at that index has the 'old' value. It also returns the entry
+which was at that index; if it returns the same entry which was passed as
+'old', then :c:func:`xa_cmpxchg` succeeded.
+
+You can enquire whether a tag is set on an entry by using
+:c:func:`xa_get_tag`. If the entry is not ``NULL``, you can set a tag
+on it by using :c:func:`xa_set_tag` and remove the tag from an entry by
+calling :c:func:`xa_clear_tag`. You can ask whether any entry in the
+XArray has a particular tag set by calling :c:func:`xa_tagged`.
+
+You can copy entries out of the XArray into a plain array by
+calling :c:func:`xa_get_entries` and copy tagged entries by calling
+:c:func:`xa_get_tagged`. Or you can iterate over the non-``NULL``
+entries in place in the XArray by calling :c:func:`xa_for_each`.
+You may prefer to use :c:func:`xa_find` or :c:func:`xa_find_after`
+to move to the next present entry in the XArray.
+
+Finally, you can remove all entries from an XArray by calling
+:c:func:`xa_destroy`. If the XArray entries are pointers, you may
+wish to free the entries first. You can do this by iterating over
+all non-``NULL`` entries in the XArray using the :c:func:`xa_for_each`
+iterator.
+
+When using the Normal API, you do not have to worry about locking.
+The XArray uses RCU and an irq-safe spinlock to synchronise access to
+the XArray:
+
+No lock needed:
+ * :c:func:`xa_empty`
+ * :c:func:`xa_tagged`
+
+Takes RCU read lock:
+ * :c:func:`xa_load`
+ * :c:func:`xa_for_each`
+ * :c:func:`xa_find`
+ * :c:func:`xa_find_after`
+ * :c:func:`xa_get_entries`
+ * :c:func:`xa_get_tagged`
+ * :c:func:`xa_get_tag`
+
+Takes xa_lock internally:
+ * :c:func:`xa_store`
+ * :c:func:`xa_cmpxchg`
+ * :c:func:`xa_destroy`
+ * :c:func:`xa_set_tag`
+ * :c:func:`xa_clear_tag`
+
+The :c:func:`xa_store` and :c:func:`xa_cmpxchg` functions take a gfp_t
+parameter in case the XArray needs to allocate memory to store this entry.
+If the entry being stored is ``NULL``, no memory allocation needs to be
+performed, and the GFP flags specified here will be ignored.
+
+Advanced API
+============
+
+The advanced API offers more flexibility and better performance at the
+cost of an interface which can be harder to use and has fewer safeguards.
+No locking is done for you by the advanced API, and you are required
+to use the xa_lock while modifying the array. You can choose whether
+to use the xa_lock or the RCU lock while doing read-only operations on
+the array. You can mix advanced and normal operations on the same array;
+indeed the normal API is implemented in terms of the advanced API. The
+advanced API is only available to modules with a GPL-compatible license.
+
+The advanced API is based around the xa_state. This is an opaque data
+structure which you declare on the stack using the :c:func:`XA_STATE`
+macro. This macro initialises the xa_state ready to start walking
+around the XArray. It is used as a cursor to maintain the position
+in the XArray and let you compose various operations together without
+having to restart from the top every time.
+
+The xa_state is also used to store errors. You can call
+:c:func:`xas_error` to retrieve the error. All operations check whether
+the xa_state is in an error state before proceeding, so there's no need
+for you to check for an error after each call; you can make multiple
+calls in succession and only check at a convenient point. The only error
+currently generated by the xarray code itself is %ENOMEM, but it supports
+arbitrary errors in case you want to call :c:func:`xas_set_err` yourself.
+
+If the xa_state is holding an %ENOMEM error, calling :c:func:`xas_nomem`
+will attempt to allocate more memory using the specified gfp flags and
+cache it in the xa_state for the next attempt. The idea is that you take
+the xa_lock, attempt the operation and drop the lock. The operation
+attempts to allocate memory while holding the lock, but it is more
+likely to fail. Once you have dropped the lock, :c:func:`xas_nomem`
+can try harder to allocate more memory. It will return ``true`` if it
+is worth retrying the operation (ie that there was a memory error *and*
+more memory was allocated. If it has previously allocated memory, and
+that memory wasn't used, and there is no error (or some error that isn't
+%ENOMEM), then it will free the memory previously allocated.
+
+Some users wish to hold the xa_lock for their own purpose while performing
+one simple XArray operation, and then some other operation of their
+own, protected by the same lock. While they could declare an xa_state
+and use it to call one of the usual advanced API functions, it is a
+little cumbersome. These interfaces are added on demand; at the moment,
+:c:func:`__xa_erase`, :c:func:`__xa_set_tag` and :c:func:`__xa_clear_tag`
+are available.
+
+Internal Entries
+----------------
+
+The XArray reserves some entries for its own purposes. These are never
+exposed through the normal API, but when using the advanced API, it's
+possible to see them. Usually the best way to handle them is to pass them
+to :c:func:`xas_retry`, and retry the operation if it returns ``true``.
+
+.. flat-table::
+ :widths: 1 1 6
+
+ * - Name
+ - Test
+ - Usage
+
+ * - Node
+ - :c:func:`xa_is_node`
+ - An XArray node. Should never be visible; all functions should recurse
+ into an XArray node.
+
+ * - Sibling
+ - :c:func:`xa_is_sibling`
+ - A non-canonical entry for a multi-index entry. The value indicates
+ which slot in this node has the canonical entry.
+
+ * - Retry
+ - :c:func:`xa_is_retry`
+ - This entry is currently being modified by a thread which has the
+ xa_lock. The node containing this entry may be freed at the end of
+ this RCU period. You should restart the lookup from the head of the
+ array.
+
+Other internal entries may be added in the future. As far as possible, they
+will be handled by :c:func:`xas_retry`.
+
+Additional functionality
+------------------------
+
+The :c:func:`xas_create` function ensures that there is somewhere in the
+XArray to store an entry. It will store ENOMEM in the xa_state if it
+cannot allocate memory. You do not normally need to call this function
+yourself as it is called by :c:func:`xas_store`.
+
+You can use :c:func:`xas_init_tags` to reset the tags on an entry
+to their default state. This is usually all tags clear, unless the
+XArray is marked with ``XA_FLAGS_TRACK_FREE``, in which case tag 0 is set
+and all other tags are clear. Replacing one entry with another using
+:c:func:`xas_store` will not reset the tags on that entry; if you want
+the tags reset, you should do that explicitly.
+
+The :c:func:`xas_load` will walk the xa_state as close to the entry
+as it can. If you know the xa_state has already been walked to the
+entry and need to check that the entry hasn't changed, you can use
+:c:func:`xas_reload` to save a function call.
+
+If you need to move to a different index in the XArray, call
+:c:func:`xas_set`. This reinitialises the cursor which will generally
+have the effect of making the next operation walk the cursor to the
+desired spot in the tree. If you want to move to the next or previous
+index, call :c:func:`xas_next` or :c:func:`xas_prev`. Setting the index
+does not walk the cursor around the array so does not require a lock to
+be held, while moving to the next or previous index does.
+
+You can create a multi-index entry by using :c:func:`xas_set_order`.
+If a load or find operation finds a multi-index entry, the index in the
+xa_state will be the one searched for, and not necessarily the
+lowest or highest index used by the entry.
+Currently the only supported multi-index entries supported are powers
+of two, but there are two potential users of arbitrary ranges, so that
+functionality may be added soon.
+
+You can search for the next present entry using :c:func:`xas_find`. This
+is the equivalent of both :c:func:`xa_find` and :c:func:`xa_find_after`;
+if the cursor has been walked to an entry, then it will find the next
+entry after the one currently referenced. If not, it will return the
+entry at the index of the xa_state. Using :c:func:`xas_next_entry` to
+move to the next present entry instead of :c:func:`xas_find` will save
+a function call in the majority of cases at the expense of emitting more
+inline code.
+
+The :c:func:`xas_find_tag` function is similar, returning the first tagged
+entry after the entry referenced by the xa_state if it has already been
+walked, and returning the entry at the index of the xa_state if it is
+tagged, and the xa_state has not been walked. The :c:func:`xas_next_tag`
+function is the equivalent of :c:func:`xas_next_entry`.
+
+When iterating over a range of the XArray using :c:func:`xas_for_each`
+or :c:func:`xas_for_each_tag`, it may be necessary to temporarily stop
+the iteration. The :c:func:`xas_pause` function exists for this purpose.
+After you have done the necessary work and wish to resume, the xa_state
+is in an appropriate state to continue the iteration after the entry
+you last processed. If you have interrupts disabled while iterating,
+then it is good manners to pause the iteration and reenable interrupts
+every ``XA_CHECK_SCHED`` entries.
+
+The :c:func:`xas_get_tag`, :c:func:`xas_set_tag` and
+:c:func:`xas_clear_tag` functions require the xa_state cursor to have
+been moved to the appropriate location in the xarray; they will do
+nothing if you have called :c:func:`xas_pause` or :c:func:`xas_set`
+immediately before.
+
+You can call :c:func:`xas_set_update` to have a callback function
+called each time the XArray updates a node. This is used by the page
+cache workingset code to maintain its list of nodes which contain only
+shadow entries.
+
+Functions and structures
+========================
+
+.. kernel-doc:: include/linux/xarray.h
+.. kernel-doc:: lib/xarray.c
--
2.15.0

2017-12-06 01:08:05

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 13/73] xarray: Add xa_for_each

From: Matthew Wilcox <[email protected]>

This iterator allows the user to efficiently walk a range of the array,
executing the loop body once for each non-NULL entry in that range.
This commit also includes xa_find() and xa_next() which are helper
functions for xa_for_each() but may also be useful in their own right.

In the xas family of functions, we also have xas_for_each(),
xas_find(), xas_next(), xas_pause() and xas_jump(). xas_pause() allows
a xas_for_each() iteration to be resumed later from the next element
and xas_jump() allows an iteration to be resumed from a specified index.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/xarray.h | 111 ++++++++++++++++++++++
lib/radix-tree.c | 4 +-
lib/xarray.c | 166 +++++++++++++++++++++++++++++++++
tools/testing/radix-tree/xarray-test.c | 91 ++++++++++++++++++
4 files changed, 370 insertions(+), 2 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index a570d7d9a252..a62baf6f1a28 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -133,6 +133,35 @@ bool xa_get_tag(struct xarray *, unsigned long index, xa_tag_t);
void *xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
void *xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);

+void *xa_find(struct xarray *xa, unsigned long *index, unsigned long max);
+void *xa_find_after(struct xarray *xa, unsigned long *index, unsigned long max);
+
+/**
+ * xa_for_each() - Iterate over a portion of an XArray.
+ * @xa: XArray.
+ * @entry: Entry retrieved from array.
+ * @index: Index of @entry.
+ * @max: Maximum index to retrieve from array.
+ *
+ * Initialise @index to the minimum index you want to retrieve from
+ * the array. During the iteration, @entry will have the value of the
+ * entry stored in @xa at @index. The iteration will skip all NULL
+ * entries in the array. You may modify @index during the
+ * iteration if you want to skip indices. It is safe to modify the
+ * array during the iteration. At the end of the iteration, @entry will
+ * be set to NULL and @index will have a value less than or equal to max.
+ *
+ * xa_for_each() is O(n.log(n)) while xas_for_each() is O(n). You have
+ * to handle your own locking with xas_for_each(), and if you have to unlock
+ * after each iteration, it will also end up being O(n.log(n)). xa_for_each()
+ * will spin if it hits a retry entry; if you intend to see retry entries,
+ * you should use the xas_for_each() iterator instead. The xas_for_each()
+ * iterator will expand into more inline code than xa_for_each().
+ */
+#define xa_for_each(xa, entry, index, max) \
+ for (entry = xa_find(xa, &index, max); entry; \
+ entry = xa_find_after(xa, &index, max))
+
#define BITS_PER_XA_VALUE (BITS_PER_LONG - 1)

/**
@@ -486,6 +515,12 @@ static inline bool xas_valid(const struct xa_state *xas)
return !xas_invalid(xas);
}

+/* True if the pointer is something other than a node */
+static inline bool xas_not_node(struct xa_node *node)
+{
+ return (unsigned long)node < 4096;
+}
+
/* True if the node represents head-of-tree, RESTART or BOUNDS */
static inline bool xas_top(struct xa_node *node)
{
@@ -514,6 +549,7 @@ static inline bool xas_retry(struct xa_state *xas, void *entry)
void *xas_load(struct xa_state *);
void *xas_store(struct xa_state *, void *entry);
void *xas_create(struct xa_state *);
+void *xas_find(struct xa_state *, unsigned long max);

bool xas_get_tag(const struct xa_state *, xa_tag_t);
void xas_set_tag(const struct xa_state *, xa_tag_t);
@@ -521,6 +557,7 @@ void xas_clear_tag(const struct xa_state *, xa_tag_t);
void xas_init_tags(const struct xa_state *);

bool xas_nomem(struct xa_state *, gfp_t);
+void xas_pause(struct xa_state *);

/**
* xas_reload() - Refetch an entry from the xarray.
@@ -590,6 +627,80 @@ static inline void xas_set_update(struct xa_state *xas, xa_update_node_t update)
xas->xa_update = update;
}

+/* Skip over any of these entries when iterating */
+static inline bool xa_iter_skip(void *entry)
+{
+ return unlikely(!entry ||
+ (xa_is_internal(entry) && entry < XA_RETRY_ENTRY));
+}
+
+/*
+ * node->shift is always 0 for the inline iterators unless we're processing
+ * a multi-index entry.
+ */
+#ifdef CONFIG_RADIX_TREE_MULTIORDER
+#define xa_node_shift(node) node->shift
+#else
+#define xa_node_shift(node) 0
+#endif
+
+/**
+ * xas_next_entry() - Advance iterator to next present entry.
+ * @xas: XArray operation state.
+ * @max: Highest index to return.
+ *
+ * xas_next_entry() is an inline function to optimise xarray traversal for
+ * speed. It is equivalent to calling xas_find(), and will call xas_find()
+ * for all the hard cases.
+ *
+ * Return: The next present entry after the one currently referred to by @xas.
+ */
+static inline void *xas_next_entry(struct xa_state *xas, unsigned long max)
+{
+ struct xa_node *node = xas->xa_node;
+ void *entry;
+
+ if (unlikely(xas_not_node(node) || xa_node_shift(node)))
+ return xas_find(xas, max);
+
+ do {
+ if (unlikely(xas->xa_index >= max))
+ return xas_find(xas, max);
+ if (unlikely(xas->xa_offset == XA_CHUNK_MASK))
+ return xas_find(xas, max);
+ xas->xa_index++;
+ xas->xa_offset++;
+ entry = xa_entry(xas->xa, node, xas->xa_offset);
+ } while (xa_iter_skip(entry));
+
+ return entry;
+}
+
+/*
+ * If iterating while holding a lock, drop the lock and reschedule
+ * every %XA_CHECK_SCHED loops.
+ */
+enum {
+ XA_CHECK_SCHED = 4096,
+};
+
+/**
+ * xas_for_each() - Iterate over a range of an XArray
+ * @xas: XArray operation state.
+ * @entry: Entry retrieved from array.
+ * @max: Maximum index to retrieve from array.
+ *
+ * The loop body will be executed for each entry present in the xarray
+ * between the current xas position and @max. @entry will be set to
+ * the entry retrieved from the xarray. It is safe to delete entries
+ * from the array in the loop body. You should hold either the RCU lock
+ * or the xa_lock while iterating. If you need to drop the lock, call
+ * xas_pause() first.
+ */
+#define xas_for_each(xas, entry, max) \
+ for (entry = xas_find(xas, max); entry; \
+ entry = xas_next_entry(xas, max))
+
/* Internal functions, mostly shared between radix-tree.c, xarray.c and idr.c */
void xas_destroy(struct xa_state *);

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index b24361a6e517..cb7cb9e96a8b 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -247,7 +247,7 @@ static inline unsigned long node_maxindex(const struct radix_tree_node *node)
return shift_maxindex(node->shift);
}

-static unsigned long next_index(unsigned long index,
+static unsigned long rnext_index(unsigned long index,
const struct radix_tree_node *node,
unsigned long offset)
{
@@ -2103,7 +2103,7 @@ void __rcu **idr_get_free(struct radix_tree_root *root,
if (!rtag_get(node, IDR_FREE, offset)) {
offset = radix_tree_find_next_bit(node, IDR_FREE,
offset + 1);
- start = next_index(start, node, offset);
+ start = rnext_index(start, node, offset);
if (start > max)
return ERR_PTR(-ENOSPC);
while (offset == RADIX_TREE_MAP_SIZE) {
diff --git a/lib/xarray.c b/lib/xarray.c
index 6625b1763123..ac4ff3daf476 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -96,6 +96,12 @@ static unsigned int get_offset(unsigned long index, struct xa_node *node)
return (index >> node->shift) & XA_CHUNK_MASK;
}

+static void xas_add(struct xa_state *xas, unsigned long val)
+{
+ xas->xa_index += (val << xas->xa_node->shift);
+ xas->xa_offset += val;
+}
+
static void *set_bounds(struct xa_state *xas)
{
xas->xa_node = XAS_BOUNDS;
@@ -757,6 +763,101 @@ void xas_init_tags(const struct xa_state *xas)
}
EXPORT_SYMBOL_GPL(xas_init_tags);

+/**
+ * xas_pause() - Pause a walk to drop a lock.
+ * @xas: XArray operation state.
+ *
+ * Some users need to pause a walk and drop the lock they're holding in
+ * order to yield to a higher priority thread or carry out an operation
+ * on an entry. Those users should call this function before they drop
+ * the lock. It resets the @xas to be suitable for the next iteration
+ * of the loop after the user has reacquired the lock. If most entries
+ * found during a walk require you to call xas_pause(), the xa_for_each()
+ * iterator may be more appropriate.
+ *
+ * Note that xas_pause() only works for forward iteration. If a user needs
+ * to pause a reverse iteration, we will need a xas_pause_rev().
+ */
+void xas_pause(struct xa_state *xas)
+{
+ struct xa_node *node = xas->xa_node;
+
+ if (xas_invalid(xas))
+ return;
+
+ if (node) {
+ unsigned int offset = xas->xa_offset;
+ while (++offset < XA_CHUNK_SIZE) {
+ if (!xa_is_sibling(xa_entry(xas->xa, node, offset)))
+ break;
+ }
+ xas->xa_index += (offset - xas->xa_offset) << node->shift;
+ } else {
+ xas->xa_index++;
+ }
+ xas->xa_node = XAS_RESTART;
+}
+EXPORT_SYMBOL_GPL(xas_pause);
+
+/**
+ * xas_find() - Find the next present entry in the XArray.
+ * @xas: XArray operation state.
+ * @max: Highest index to return.
+ *
+ * If the xas has not yet been walked to an entry, return the entry
+ * which has an index >= xas.xa_index. If it has been walked, the entry
+ * currently being pointed at has been processed, and so we move to the
+ * next entry.
+ *
+ * If no entry is found and the array is smaller than @max, the iterator
+ * is set to the smallest index not yet in the array. This allows @xas
+ * to be immediately passed to xas_create().
+ *
+ * Return: The entry, if found, otherwise NULL.
+ */
+void *xas_find(struct xa_state *xas, unsigned long max)
+{
+ void *entry;
+
+ if (xas_error(xas))
+ return NULL;
+
+ if (!xas->xa_node) {
+ xas->xa_index = 1;
+ return set_bounds(xas);
+ } else if (xas_top(xas->xa_node)) {
+ entry = xas_load(xas);
+ if (entry || xas_not_node(xas->xa_node))
+ return entry;
+ }
+
+ xas_add(xas, 1);
+
+ while (xas->xa_node && (xas->xa_index <= max)) {
+ if (unlikely(xas->xa_offset == XA_CHUNK_SIZE)) {
+ xas->xa_offset = xas->xa_node->offset + 1;
+ xas->xa_node = xa_parent(xas->xa, xas->xa_node);
+ continue;
+ }
+
+ entry = xa_entry(xas->xa, xas->xa_node, xas->xa_offset);
+ if (xa_is_node(entry)) {
+ xas->xa_node = xa_to_node(entry);
+ xas->xa_offset = 0;
+ continue;
+ }
+ if (!xa_iter_skip(entry))
+ return entry;
+
+ xas_add(xas, 1);
+ }
+
+ if (!xas->xa_node)
+ xas->xa_node = XAS_BOUNDS;
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(xas_find);
+
/**
* __xa_init() - Initialise an empty XArray.
* @xa: XArray.
@@ -1009,6 +1110,71 @@ void *xa_clear_tag(struct xarray *xa, unsigned long index, xa_tag_t tag)
}
EXPORT_SYMBOL(xa_clear_tag);

+/**
+ * xa_find() - Search the XArray for a present entry.
+ * @xa: XArray.
+ * @indexp: Pointer to an index.
+ * @max: Maximum index to search to.
+ *
+ * Finds the entry in the xarray with the lowest index that is between
+ * *@indexp and max, inclusive. If an entry is found, updates @indexp to
+ * be the index of the pointer. This function is protected by the RCU read
+ * lock, so it may not find all entries if called in a loop. It will not
+ * return an %XA_RETRY_ENTRY; if you need to see retry entries, use xas_find().
+ *
+ * Return: The entry, if found, otherwise NULL.
+ */
+void *xa_find(struct xarray *xa, unsigned long *indexp, unsigned long max)
+{
+ XA_STATE(xas, xa, *indexp);
+ void *entry;
+
+ rcu_read_lock();
+ do {
+ entry = xas_find(&xas, max);
+ } while (xas_retry(&xas, entry));
+ rcu_read_unlock();
+
+ if (entry)
+ *indexp = xas.xa_index;
+ return entry;
+}
+EXPORT_SYMBOL(xa_find);
+
+/**
+ * xa_find_after() - Search the XArray for a present entry.
+ * @xa: XArray.
+ * @indexp: Pointer to an index.
+ * @max: Maximum index to search to.
+ *
+ * Finds the entry in @xa with the lowest index that is above *@indexp and
+ * less than or equal to @max. If an entry is found, updates @indexp to be
+ * the index of the pointer. This function is protected by the RCU read
+ * lock, so it may miss entries which are being simultaneously added. It
+ * will not return an %XA_RETRY_ENTRY; if you need to see retry entries,
+ * use xas_find().
+ *
+ * Return: The pointer, if found, otherwise NULL.
+ */
+void *xa_find_after(struct xarray *xa, unsigned long *indexp, unsigned long max)
+{
+ XA_STATE(xas, xa, *indexp + 1);
+ void *entry;
+
+ rcu_read_lock();
+ do {
+ entry = xas_find(&xas, max);
+ if (*indexp >= xas.xa_index)
+ entry = xas_next_entry(&xas, max);
+ } while (xas_retry(&xas, entry));
+ rcu_read_unlock();
+
+ if (entry)
+ *indexp = xas.xa_index;
+ return entry;
+}
+EXPORT_SYMBOL(xa_find_after);
+
#ifdef XA_DEBUG
void xa_dump_entry(void *entry, unsigned long index)
{
diff --git a/tools/testing/radix-tree/xarray-test.c b/tools/testing/radix-tree/xarray-test.c
index a9a2b6042177..cc5d0b7a1edf 100644
--- a/tools/testing/radix-tree/xarray-test.c
+++ b/tools/testing/radix-tree/xarray-test.c
@@ -36,6 +36,72 @@ void check_xa_tag(struct xarray *xa)
assert(xa_get_tag(xa, 0, XA_TAG_0) == false);
}

+/* Check that putting the xas into an error state works correctly */
+void check_xas_error(struct xarray *xa)
+{
+ XA_STATE(xas, xa, 0);
+
+ assert(xa_store(xa, 1, xa_mk_value(1), GFP_KERNEL) == 0);
+ assert(xa_load(xa, 1) == xa_mk_value(1));
+
+ assert(xas_error(&xas) == 0);
+
+ xas_set_err(&xas, -ENOTTY);
+ assert(xas_error(&xas) == -ENOTTY);
+
+ xas_set_err(&xas, -ENOSPC);
+ assert(xas_error(&xas) == -ENOSPC);
+
+ xas_set_err(&xas, -ENOMEM);
+ assert(xas_error(&xas) == -ENOMEM);
+
+ assert(xas_load(&xas) == NULL);
+ assert(xas_store(&xas, &xas) == NULL);
+ assert(xas_load(&xas) == NULL);
+
+ assert(xas.xa_index == 0);
+ assert(xas_next(&xas) == NULL);
+ assert(xas.xa_index == 0);
+
+ assert(xas_prev(&xas) == NULL);
+ assert(xas.xa_index == 0);
+
+ xas_set_err(&xas, 0);
+ assert(xas_error(&xas) == 0);
+
+ assert(xas_find(&xas, ULONG_MAX) == xa_mk_value(1));
+ assert(xas.xa_index == 1);
+ assert(xas_error(&xas) == 0);
+
+ assert(xas_find(&xas, ULONG_MAX) == NULL);
+ assert(xas.xa_index > 1);
+ assert(xas_error(&xas) == 0);
+ assert(xas.xa_node == XAS_BOUNDS);
+}
+
+void check_xas_retry(struct xarray *xa)
+{
+ XA_STATE(xas, xa, 0);
+
+ xa_store(xa, 0, xa_mk_value(0), GFP_KERNEL);
+ xa_store(xa, 1, xa_mk_value(1), GFP_KERNEL);
+
+ assert(xas_find(&xas, ULONG_MAX) == xa_mk_value(0));
+ xa_erase(xa, 1);
+ assert(xa_is_retry(xas_reload(&xas)));
+ assert(!xas_retry(&xas, NULL));
+ assert(!xas_retry(&xas, xa_mk_value(0)));
+ assert(xas_retry(&xas, XA_RETRY_ENTRY));
+ assert(xas.xa_node == XAS_RESTART);
+ assert(xas_next_entry(&xas, ULONG_MAX) == xa_mk_value(0));
+ assert(xas.xa_node == NULL);
+
+ xa_store(xa, 1, xa_mk_value(1), GFP_KERNEL);
+ assert(xa_is_internal(xas_reload(&xas)));
+ xas.xa_node = XAS_RESTART;
+ assert(xas_next_entry(&xas, ULONG_MAX) == xa_mk_value(0));
+}
+
void check_xa_load(struct xarray *xa)
{
unsigned long i, j;
@@ -134,6 +200,22 @@ void check_multi_store(struct xarray *xa)
}
}

+void check_find(struct xarray *xa)
+{
+ unsigned long index;
+ xa_store_order(xa, 12, 2, xa_mk_value(12));
+ xa_store(xa, 16, xa_mk_value(16), GFP_KERNEL);
+
+ index = 0;
+ assert(xa_find(xa, &index, ULONG_MAX) == xa_mk_value(12));
+ assert(index == 12);
+ index = 13;
+ assert(xa_find(xa, &index, ULONG_MAX) == xa_mk_value(12));
+ assert(index >= 12 && index < 16);
+ assert(xa_find_after(xa, &index, ULONG_MAX) == xa_mk_value(16));
+ assert(index == 16);
+}
+
void xarray_checks(void)
{
DEFINE_XARRAY(array);
@@ -141,6 +223,12 @@ void xarray_checks(void)
check_xa_tag(&array);
item_kill_tree(&array);

+ check_xas_error(&array);
+ item_kill_tree(&array);
+
+ check_xas_retry(&array);
+ item_kill_tree(&array);
+
check_xa_load(&array);
item_kill_tree(&array);

@@ -149,6 +237,9 @@ void xarray_checks(void)

check_multi_store(&array);
item_kill_tree(&array);
+
+ check_find(&array);
+ item_kill_tree(&array);
}

int __weak main(void)
--
2.15.0

2017-12-06 01:07:55

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 24/73] page cache: Add and replace pages using the XArray

From: Matthew Wilcox <[email protected]>

Use the XArray APIs to add and replace pages in the page cache. This
removes two uses of the radix tree preload API and is significantly
shorter code.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/filemap.c | 142 +++++++++++++++++++++++++----------------------------------
1 file changed, 61 insertions(+), 81 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 51f88ffc5319..2439747a0a17 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -112,34 +112,6 @@
* ->tasklist_lock (memory_failure, collect_procs_ao)
*/

-static int page_cache_tree_insert(struct address_space *mapping,
- struct page *page, void **shadowp)
-{
- struct radix_tree_node *node;
- void **slot;
- int error;
-
- error = __radix_tree_create(&mapping->pages, page->index, 0,
- &node, &slot);
- if (error)
- return error;
- if (*slot) {
- void *p;
-
- p = radix_tree_deref_slot_protected(slot, &mapping->pages.xa_lock);
- if (!xa_is_value(p))
- return -EEXIST;
-
- mapping->nrexceptional--;
- if (shadowp)
- *shadowp = p;
- }
- __radix_tree_replace(&mapping->pages, node, slot, page,
- workingset_lookup_update(mapping));
- mapping->nrpages++;
- return 0;
-}
-
static void page_cache_tree_delete(struct address_space *mapping,
struct page *page, void *shadow)
{
@@ -775,51 +747,44 @@ EXPORT_SYMBOL(file_write_and_wait_range);
* locked. This function does not add the new page to the LRU, the
* caller must do that.
*
- * The remove + add is atomic. The only way this function can fail is
- * memory allocation failure.
+ * The remove + add is atomic. This function cannot fail.
*/
int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
{
- int error;
+ struct address_space *mapping = old->mapping;
+ void (*freepage)(struct page *) = mapping->a_ops->freepage;
+ pgoff_t offset = old->index;
+ XA_STATE(xas, &mapping->pages, offset);
+ unsigned long flags;

VM_BUG_ON_PAGE(!PageLocked(old), old);
VM_BUG_ON_PAGE(!PageLocked(new), new);
VM_BUG_ON_PAGE(new->mapping, new);

- error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
- if (!error) {
- struct address_space *mapping = old->mapping;
- void (*freepage)(struct page *);
- unsigned long flags;
-
- pgoff_t offset = old->index;
- freepage = mapping->a_ops->freepage;
-
- get_page(new);
- new->mapping = mapping;
- new->index = offset;
+ get_page(new);
+ new->mapping = mapping;
+ new->index = offset;

- xa_lock_irqsave(&mapping->pages, flags);
- __delete_from_page_cache(old, NULL);
- error = page_cache_tree_insert(mapping, new, NULL);
- BUG_ON(error);
+ xas_lock_irqsave(&xas, flags);
+ xas_store(&xas, new);

- /*
- * hugetlb pages do not participate in page cache accounting.
- */
- if (!PageHuge(new))
- __inc_node_page_state(new, NR_FILE_PAGES);
- if (PageSwapBacked(new))
- __inc_node_page_state(new, NR_SHMEM);
- xa_unlock_irqrestore(&mapping->pages, flags);
- mem_cgroup_migrate(old, new);
- radix_tree_preload_end();
- if (freepage)
- freepage(old);
- put_page(old);
- }
+ old->mapping = NULL;
+ /* hugetlb pages do not participate in page cache accounting. */
+ if (!PageHuge(old))
+ __dec_node_page_state(new, NR_FILE_PAGES);
+ if (!PageHuge(new))
+ __inc_node_page_state(new, NR_FILE_PAGES);
+ if (PageSwapBacked(old))
+ __dec_node_page_state(new, NR_SHMEM);
+ if (PageSwapBacked(new))
+ __inc_node_page_state(new, NR_SHMEM);
+ xas_unlock_irqrestore(&xas, flags);
+ mem_cgroup_migrate(old, new);
+ if (freepage)
+ freepage(old);
+ put_page(old);

- return error;
+ return 0;
}
EXPORT_SYMBOL_GPL(replace_page_cache_page);

@@ -828,12 +793,15 @@ static int __add_to_page_cache_locked(struct page *page,
pgoff_t offset, gfp_t gfp_mask,
void **shadowp)
{
+ XA_STATE(xas, &mapping->pages, offset);
int huge = PageHuge(page);
struct mem_cgroup *memcg;
int error;
+ void *old;

VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageSwapBacked(page), page);
+ xas_set_update(&xas, workingset_lookup_update(mapping));

if (!huge) {
error = mem_cgroup_try_charge(page, current->mm,
@@ -842,39 +810,51 @@ static int __add_to_page_cache_locked(struct page *page,
return error;
}

- error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
- if (error) {
- if (!huge)
- mem_cgroup_cancel_charge(page, memcg, false);
- return error;
- }
-
get_page(page);
page->mapping = mapping;
page->index = offset;

- xa_lock_irq(&mapping->pages);
- error = page_cache_tree_insert(mapping, page, shadowp);
- radix_tree_preload_end();
- if (unlikely(error))
- goto err_insert;
+ do {
+ xas_lock_irq(&xas);
+ old = xas_create(&xas);
+ if (xas_error(&xas))
+ goto unlock;
+ if (xa_is_value(old)) {
+ mapping->nrexceptional--;
+ if (shadowp)
+ *shadowp = old;
+ } else if (old) {
+ xas_set_err(&xas, -EEXIST);
+ goto unlock;
+ }
+
+ xas_store(&xas, page);
+ mapping->nrpages++;
+
+ /*
+ * hugetlb pages do not participate in
+ * page cache accounting.
+ */
+ if (!huge)
+ __inc_node_page_state(page, NR_FILE_PAGES);
+unlock:
+ xas_unlock_irq(&xas);
+ } while (xas_nomem(&xas, gfp_mask & ~__GFP_HIGHMEM));
+
+ if (xas_error(&xas))
+ goto error;

- /* hugetlb pages do not participate in page cache accounting. */
- if (!huge)
- __inc_node_page_state(page, NR_FILE_PAGES);
- xa_unlock_irq(&mapping->pages);
if (!huge)
mem_cgroup_commit_charge(page, memcg, false, false);
trace_mm_filemap_add_to_page_cache(page);
return 0;
-err_insert:
+error:
page->mapping = NULL;
/* Leave page->index set: truncation relies upon it */
- xa_unlock_irq(&mapping->pages);
if (!huge)
mem_cgroup_cancel_charge(page, memcg, false);
put_page(page);
- return error;
+ return xas_error(&xas);
}

/**
--
2.15.0

2017-12-06 01:07:48

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 09/73] xarray: Add xa_load

From: Matthew Wilcox <[email protected]>

This first function in the XArray API brings with it a lot of support
infrastructure. The advanced API is based around the xa_state which is
a more capable version of the radix_tree_iter.

As the test-suite demonstrates, it is possible to use the xarray and
radix tree APIs on the same data structure.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/xarray.h | 235 ++++++++++++++++++++++++++++
lib/radix-tree.c | 43 -----
lib/xarray.c | 160 +++++++++++++++++++
tools/testing/radix-tree/.gitignore | 1 +
tools/testing/radix-tree/Makefile | 7 +-
tools/testing/radix-tree/linux/radix-tree.h | 1 -
tools/testing/radix-tree/linux/rcupdate.h | 1 +
tools/testing/radix-tree/linux/xarray.h | 1 +
tools/testing/radix-tree/xarray-test.c | 56 +++++++
9 files changed, 459 insertions(+), 46 deletions(-)
create mode 100644 tools/testing/radix-tree/xarray-test.c

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 1aff0069458b..af52ba75e6a3 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -21,6 +21,8 @@
#include <linux/bug.h>
#include <linux/compiler.h>
#include <linux/kconfig.h>
+#include <linux/kernel.h>
+#include <linux/rcupdate.h>
#include <linux/spinlock.h>
#include <linux/types.h>

@@ -67,6 +69,8 @@ static inline void xa_init(struct xarray *xa)
__xa_init(xa, 0);
}

+void *xa_load(struct xarray *, unsigned long index);
+
#define BITS_PER_XA_VALUE (BITS_PER_LONG - 1)

/**
@@ -158,6 +162,46 @@ struct xa_node {
unsigned long tags[XA_MAX_TAGS][XA_TAG_LONGS];
};

+#ifdef XA_DEBUG
+void xa_dump(const struct xarray *);
+void xa_dump_node(const struct xa_node *);
+#define XA_BUG_ON(node, x) do { \
+ if ((x) && (node)) \
+ xa_dump_node(node); \
+ BUG_ON(x); \
+ } while (0)
+#else
+#define XA_BUG_ON(node, x) do { } while (0)
+#endif
+
+/* Private */
+static inline void *xa_head(struct xarray *xa)
+{
+ return rcu_dereference_check(xa->xa_head, xa_lock_held(xa));
+}
+
+/* Private */
+static inline void *xa_head_locked(struct xarray *xa)
+{
+ return rcu_dereference_protected(xa->xa_head, xa_lock_held(xa));
+}
+
+/* Private */
+static inline void *xa_entry(struct xarray *xa,
+ const struct xa_node *node, unsigned int offset)
+{
+ XA_BUG_ON(node, offset >= XA_CHUNK_SIZE);
+ return rcu_dereference_check(node->slots[offset], xa_lock_held(xa));
+}
+
+/* Private */
+static inline void *xa_entry_locked(struct xarray *xa,
+ const struct xa_node *node, unsigned int offset)
+{
+ XA_BUG_ON(node, offset >= XA_CHUNK_SIZE);
+ return rcu_dereference_protected(node->slots[offset], xa_lock_held(xa));
+}
+
/*
* Internal entries have the bottom two bits set to the value 10b. Most
* internal entries are pointers to the next node in the tree. Since the
@@ -189,6 +233,12 @@ static inline bool xa_is_internal(void *entry)
return ((unsigned long)entry & 3) == 2;
}

+/* Private */
+static inline struct xa_node *xa_to_node(void *entry)
+{
+ return (struct xa_node *)((unsigned long)entry & ~3UL);
+}
+
/* Private */
static inline bool xa_is_node(void *entry)
{
@@ -222,4 +272,189 @@ static inline bool xa_is_sibling(void *entry)

#define XA_RETRY_ENTRY xa_mk_internal(256)

+/**
+ * xa_is_retry() - Is the entry a retry entry?
+ * @entry: Entry retrieved from the XArray
+ *
+ * Return: %true if the entry is a retry entry.
+ */
+static inline bool xa_is_retry(void *entry)
+{
+ return unlikely(entry == XA_RETRY_ENTRY);
+}
+
+/**
+ * typedef xa_update_node_t - A callback function from the XArray.
+ * @node: The node which is being processed
+ *
+ * This function is called every time the XArray updates the count of
+ * present and value entries in a node. It allows advanced users to
+ * maintain the private_list in the node.
+ */
+typedef void (*xa_update_node_t)(struct xa_node *node);
+
+/*
+ * The xa_state is opaque to its users. It contains various different pieces
+ * of state involved in the current operation on the XArray. It should be
+ * declared on the stack and passed between the various internal routines.
+ * The various elements in it should not be accessed directly, but only
+ * through the provided accessor functions. The below documentation is for
+ * the benefit of those working on the code, not for users of the XArray.
+ *
+ * @xa_node usually points to the xa_node containing the slot we're operating
+ * on (and @xa_offset is the offset in the slots array). If there is a
+ * single entry in the array at index 0, there are no allocated xa_nodes to
+ * point to, and so we store %NULL in @xa_node. @xa_node is set to
+ * the value %XAS_RESTART if the xa_state is not walked to the correct
+ * position in the tree of nodes for this operation. If an error occurs
+ * during an operation, it is set to an %XAS_ERROR value. If we run off the
+ * end of the allocated nodes, it is set to %XAS_BOUNDS.
+ */
+struct xa_state {
+ struct xarray *xa;
+ unsigned long xa_index;
+ unsigned char xa_shift;
+ unsigned char xa_sibs;
+ unsigned char xa_offset;
+ unsigned char xa_pad; /* Helps gcc generate better code */
+ struct xa_node *xa_node;
+ struct xa_node *xa_alloc;
+ xa_update_node_t xa_update;
+};
+
+/*
+ * We encode errnos in the xas->xa_node. If an error has happened, we need to
+ * drop the lock to fix it, and once we've done so the xa_state is invalid.
+ */
+#define XAS_ERROR(errno) ((struct xa_node *)(((unsigned long)errno << 1) | 1))
+#define XAS_RESTART XAS_ERROR(0)
+#define XAS_BOUNDS ((struct xa_node *)2UL)
+
+#define __XA_STATE(array, index) { \
+ .xa = array, \
+ .xa_index = index, \
+ .xa_shift = 0, \
+ .xa_sibs = 0, \
+ .xa_offset = 0, \
+ .xa_pad = 0, \
+ .xa_node = XAS_RESTART, \
+ .xa_alloc = NULL, \
+ .xa_update = NULL \
+}
+
+/**
+ * XA_STATE() - Declare an XArray operation state.
+ * @name: Name of this operation state (usually xas).
+ * @index: Initial index of interest.
+ *
+ * Declare and initialise an xa_state on the stack.
+ */
+#define XA_STATE(name, array, index) \
+ struct xa_state name = __XA_STATE(array, index)
+
+#define xas_tagged(xas, tag) xa_tagged((xas)->xa, (tag))
+#define xas_trylock(xas) xa_trylock((xas)->xa)
+#define xas_lock(xas) xa_lock((xas)->xa)
+#define xas_unlock(xas) xa_unlock((xas)->xa)
+#define xas_lock_bh(xas) xa_lock_bh((xas)->xa)
+#define xas_unlock_bh(xas) xa_unlock_bh((xas)->xa)
+#define xas_lock_irq(xas) xa_lock_irq((xas)->xa)
+#define xas_unlock_irq(xas) xa_unlock_irq((xas)->xa)
+#define xas_lock_irqsave(xas, flags) \
+ xa_lock_irqsave((xas)->xa, flags)
+#define xas_unlock_irqrestore(xas, flags) \
+ xa_unlock_irqrestore((xas)->xa, flags)
+
+/**
+ * xas_error() - Return an errno stored in the xa_state.
+ * @xas: XArray operation state.
+ *
+ * Return: 0 if no error has been noted. A negative errno if one has.
+ */
+static inline int xas_error(const struct xa_state *xas)
+{
+ unsigned long v = (unsigned long)xas->xa_node;
+ return (v & 1) ? -(v >> 1) : 0;
+}
+
+/**
+ * xas_set_err() - Note an error in the xa_state.
+ * @xas: XArray operation state.
+ * @err: Negative error number.
+ *
+ * You can call this function with @err set to 0 to take the xa_state
+ * out of the error state. The next operation will walk it to the correct
+ * location.
+ */
+static inline void xas_set_err(struct xa_state *xas, long err)
+{
+ xas->xa_node = XAS_ERROR(-err);
+}
+
+/**
+ * xas_invalid() - Is the xas in a retry or error state?
+ * @xas: XArray operation state.
+ *
+ * Return: %true if the xas cannot be used for operations.
+ */
+static inline bool xas_invalid(const struct xa_state *xas)
+{
+ return (unsigned long)xas->xa_node & 3;
+}
+
+/**
+ * xas_valid() - Is the xas a valid cursor into the array?
+ * @xas: XArray operation state.
+ *
+ * Return: %true if the xas can be used for operations.
+ */
+static inline bool xas_valid(const struct xa_state *xas)
+{
+ return !xas_invalid(xas);
+}
+
+/**
+ * xas_retry() - Handle a retry entry.
+ * @xas: XArray operation state.
+ * @entry: Entry from xarray.
+ *
+ * An RCU-protected read may see a retry entry as a side-effect of a
+ * simultaneous modification. This function sets up the @xas to retry
+ * the walk from the head of the array.
+ *
+ * Return: true if the operation needs to be retried.
+ */
+static inline bool xas_retry(struct xa_state *xas, void *entry)
+{
+ if (!xa_is_retry(entry))
+ return false;
+ xas->xa_node = XAS_RESTART;
+ return true;
+}
+
+void *xas_load(struct xa_state *);
+
+/**
+ * xas_reload() - Refetch an entry from the xarray.
+ * @xas: XArray operation state.
+ *
+ * Use this function to check that a previously loaded entry still has
+ * the same value. This is useful for the lockless pagecache lookup where
+ * we walk the array with only the RCU lock to protect us, lock the page,
+ * then check that the page hasn't moved since we looked it up.
+ *
+ * The caller guarantees that @xas is still valid. If it may be in an
+ * error or restart state, call xas_load() instead.
+ *
+ * Return: The entry at this location in the xarray.
+ */
+static inline void *xas_reload(struct xa_state *xas)
+{
+ struct xa_node *node = xas->xa_node;
+
+ if (node)
+ return xa_entry(xas->xa, node, xas->xa_offset);
+ return xa_head(xas->xa);
+}
+
#endif /* _LINUX_XARRAY_H */
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 930eb7d298d7..a919c60b10a4 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -255,49 +255,6 @@ static unsigned long next_index(unsigned long index,
}

#ifndef __KERNEL__
-static void dump_node(struct radix_tree_node *node, unsigned long index)
-{
- unsigned long i;
-
- pr_debug("radix node: %p offset %d indices %lu-%lu parent %p tags %lx %lx %lx shift %d count %d exceptional %d\n",
- node, node->offset, index, index | node_maxindex(node),
- node->parent,
- node->tags[0][0], node->tags[1][0], node->tags[2][0],
- node->shift, node->count, node->exceptional);
-
- for (i = 0; i < RADIX_TREE_MAP_SIZE; i++) {
- unsigned long first = index | (i << node->shift);
- unsigned long last = first | ((1UL << node->shift) - 1);
- void *entry = node->slots[i];
- if (!entry)
- continue;
- if (entry == RADIX_TREE_RETRY) {
- pr_debug("radix retry offset %ld indices %lu-%lu parent %p\n",
- i, first, last, node);
- } else if (!radix_tree_is_internal_node(entry)) {
- pr_debug("radix entry %p offset %ld indices %lu-%lu parent %p\n",
- entry, i, first, last, node);
- } else if (xa_is_sibling(entry)) {
- pr_debug("radix sblng %p offset %ld indices %lu-%lu parent %p val %p\n",
- entry, i, first, last, node,
- node->slots[xa_to_sibling(entry)]);
- } else {
- dump_node(entry_to_node(entry), first);
- }
- }
-}
-
-/* For debug */
-static void radix_tree_dump(struct radix_tree_root *root)
-{
- pr_debug("radix root: %p xa_head %p tags %x\n",
- root, root->xa_head,
- root->xa_flags >> ROOT_TAG_SHIFT);
- if (!radix_tree_is_internal_node(root->xa_head))
- return;
- dump_node(entry_to_node(root->xa_head), 0);
-}
-
static void dump_ida_node(void *entry, unsigned long index)
{
unsigned long i;
diff --git a/lib/xarray.c b/lib/xarray.c
index 67ddcb3e630c..2f77e4c5d0b8 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -31,6 +31,94 @@
* @entry refers to something stored in a slot in the xarray
*/

+/* extracts the offset within this node from the index */
+static unsigned int get_offset(unsigned long index, struct xa_node *node)
+{
+ return (index >> node->shift) & XA_CHUNK_MASK;
+}
+
+static void *set_bounds(struct xa_state *xas)
+{
+ xas->xa_node = XAS_BOUNDS;
+ return NULL;
+}
+
+/*
+ * Starts a walk. If the @xas is already valid, we assume that it's on
+ * the right path and just return where we've got to. If we're in an
+ * error state, return NULL. If the index is outside the current scope
+ * of the xarray, return NULL without changing @xas->xa_node. Otherwise
+ * set @xas->xa_node to NULL and return the current head of the array.
+ */
+static void *xas_start(struct xa_state *xas)
+{
+ void *entry;
+
+ if (xas_valid(xas))
+ return xas_reload(xas);
+ if (xas_error(xas))
+ return NULL;
+
+ entry = xa_head(xas->xa);
+ if (!xa_is_node(entry)) {
+ if (xas->xa_index)
+ return set_bounds(xas);
+ } else {
+ if ((xas->xa_index >> xa_to_node(entry)->shift) > XA_CHUNK_MASK)
+ return set_bounds(xas);
+ }
+
+ xas->xa_node = NULL;
+ return entry;
+}
+
+static void *xas_descend(struct xa_state *xas, struct xa_node *node)
+{
+ unsigned int offset = get_offset(xas->xa_index, node);
+ void *entry = xa_entry(xas->xa, node, offset);
+
+ if (xa_is_sibling(entry)) {
+ offset = xa_to_sibling(entry);
+ entry = xa_entry(xas->xa, node, offset);
+ /* Move xa_index to the first index of this entry */
+ xas->xa_index = (((xas->xa_index >> node->shift) &
+ ~XA_CHUNK_MASK) | offset) << node->shift;
+ }
+
+ xas->xa_node = node;
+ xas->xa_offset = offset;
+ return entry;
+}
+
+/**
+ * xas_load() - Load an entry from the XArray (advanced).
+ * @xas: XArray operation state.
+ *
+ * Usually walks the @xas to the appropriate state to load the entry stored
+ * at xa_index. However, it will do nothing and return NULL if @xas is
+ * holding an error. If the xa_shift indicates we're operating on a
+ * multislot entry, it will terminate early and potentially return an
+ * internal entry. xas_load() will never expand the tree (see xas_create()).
+ *
+ * The caller should hold the xa_lock or the RCU lock.
+ *
+ * Return: Usually an entry in the XArray, but see description for exceptions.
+ */
+void *xas_load(struct xa_state *xas)
+{
+ void *entry = xas_start(xas);
+
+ while (xa_is_node(entry)) {
+ struct xa_node *node = xa_to_node(entry);
+
+ if (xas->xa_shift > node->shift)
+ break;
+ entry = xas_descend(xas, node);
+ }
+ return entry;
+}
+EXPORT_SYMBOL_GPL(xas_load);
+
/**
* __xa_init() - Initialise an empty XArray.
* @xa: XArray.
@@ -45,3 +133,75 @@ void __xa_init(struct xarray *xa, gfp_t flags)
xa->xa_head = NULL;
}
EXPORT_SYMBOL(__xa_init);
+
+/**
+ * xa_load() - Load an entry from an XArray.
+ * @xa: XArray.
+ * @index: index into array.
+ *
+ * Return: The entry at @index in @xa.
+ */
+void *xa_load(struct xarray *xa, unsigned long index)
+{
+ XA_STATE(xas, xa, index);
+ void *entry;
+
+ rcu_read_lock();
+ do {
+ entry = xas_load(&xas);
+ } while (xas_retry(&xas, entry));
+ rcu_read_unlock();
+
+ return entry;
+}
+EXPORT_SYMBOL(xa_load);
+
+#ifdef XA_DEBUG
+void xa_dump_entry(void *entry, unsigned long index)
+{
+ if (!entry)
+ return;
+
+ if (xa_is_value(entry))
+ printk("%lu: value %#lx\n", index, xa_to_value(entry));
+ else if (!xa_is_internal(entry))
+ printk("%lu: %p\n", index, entry);
+ else if (xa_is_node(entry)) {
+ unsigned long i;
+ struct xa_node *node = xa_to_node(entry);
+ printk("node %p %s %d parent %p shift %d count %d "
+ "exceptional %d tags %lx %lx %lx indices %lu-%lu\n",
+ node, node->parent ? "offset" : "max", node->offset,
+ node->parent, node->shift, node->count,
+ node->exceptional,
+ node->tags[0][0], node->tags[1][0], node->tags[2][0],
+ index, index |
+ (((unsigned long)XA_CHUNK_SIZE << node->shift) - 1));
+ for (i = 0; i < XA_CHUNK_SIZE; i++)
+ xa_dump_entry(node->slots[i],
+ index + (i << node->shift));
+ } else if (xa_is_retry(entry))
+ printk("%lu: retry (%ld)\n", index, xa_to_internal(entry));
+ else if (xa_is_sibling(entry))
+ printk("%lu: sibling (%ld)\n", index, xa_to_sibling(entry));
+ else
+ printk("%lu: UNKNOWN ENTRY (%p)\n", index, entry);
+}
+
+void xa_dump_node(const struct xa_node *node)
+{
+ printk("xadn: node %p %s %d parent %p shift %d count %d "
+ "exceptional %d array %p list %p %p tags %lx %lx %lx\n",
+ node, node->parent ? "offset" : "max", node->offset,
+ node->parent, node->shift, node->count,
+ node->exceptional, node->root, node->private_list.prev,
+ node->private_list.next,
+ node->tags[0][0], node->tags[1][0], node->tags[2][0]);
+}
+
+void xa_dump(const struct xarray *xa)
+{
+ printk("xarray: %p %x %p\n", xa, xa->xa_flags, xa->xa_head);
+ xa_dump_entry(xa->xa_head, 0);
+}
+#endif
diff --git a/tools/testing/radix-tree/.gitignore b/tools/testing/radix-tree/.gitignore
index 8d4df7a72a8e..833136896b91 100644
--- a/tools/testing/radix-tree/.gitignore
+++ b/tools/testing/radix-tree/.gitignore
@@ -5,3 +5,4 @@ main
multiorder
radix-tree.c
xarray.c
+xarray-test
diff --git a/tools/testing/radix-tree/Makefile b/tools/testing/radix-tree/Makefile
index 3868bc189199..749ef734a87c 100644
--- a/tools/testing/radix-tree/Makefile
+++ b/tools/testing/radix-tree/Makefile
@@ -3,10 +3,11 @@
CFLAGS += -I. -I../../include -g -O2 -Wall -D_LGPL_SOURCE -fsanitize=address
LDFLAGS += -fsanitize=address
LDLIBS+= -lpthread -lurcu
-TARGETS = main idr-test multiorder
+TARGETS = main idr-test multiorder xarray-test
CORE_OFILES := xarray.o radix-tree.o idr.o linux.o test.o find_bit.o
OFILES = main.o $(CORE_OFILES) regression1.o regression2.o regression3.o \
- tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o
+ tag_check.o multiorder.o idr-test.o iteration_check.o benchmark.o \
+ xarray-test.o

ifndef SHIFT
SHIFT=3
@@ -23,6 +24,8 @@ main: $(OFILES)

idr-test: idr-test.o $(CORE_OFILES)

+xarray-test: idr-test.o $(CORE_OFILES)
+
multiorder: multiorder.o $(CORE_OFILES)

clean:
diff --git a/tools/testing/radix-tree/linux/radix-tree.h b/tools/testing/radix-tree/linux/radix-tree.h
index 40c9671ee365..36fb716d5557 100644
--- a/tools/testing/radix-tree/linux/radix-tree.h
+++ b/tools/testing/radix-tree/linux/radix-tree.h
@@ -5,7 +5,6 @@
#include "generated/map-shift.h"
#include "linux/bug.h"
#include "../../../../include/linux/radix-tree.h"
-#include <linux/xarray.h>

extern int kmalloc_verbose;
extern int test_verbose;
diff --git a/tools/testing/radix-tree/linux/rcupdate.h b/tools/testing/radix-tree/linux/rcupdate.h
index 73ed33658203..25010bf86c1d 100644
--- a/tools/testing/radix-tree/linux/rcupdate.h
+++ b/tools/testing/radix-tree/linux/rcupdate.h
@@ -6,5 +6,6 @@

#define rcu_dereference_raw(p) rcu_dereference(p)
#define rcu_dereference_protected(p, cond) rcu_dereference(p)
+#define rcu_dereference_check(p, cond) rcu_dereference(p)

#endif
diff --git a/tools/testing/radix-tree/linux/xarray.h b/tools/testing/radix-tree/linux/xarray.h
index df3812cda376..3eaf9596c2a6 100644
--- a/tools/testing/radix-tree/linux/xarray.h
+++ b/tools/testing/radix-tree/linux/xarray.h
@@ -1,2 +1,3 @@
#include "generated/map-shift.h"
+#define XA_DEBUG
#include "../../../../include/linux/xarray.h"
diff --git a/tools/testing/radix-tree/xarray-test.c b/tools/testing/radix-tree/xarray-test.c
new file mode 100644
index 000000000000..3f8f19cb3739
--- /dev/null
+++ b/tools/testing/radix-tree/xarray-test.c
@@ -0,0 +1,56 @@
+/*
+ * xarray-test.c: Test the XArray API
+ * Copyright (c) 2017 Microsoft Corporation <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ */
+#include <linux/bitmap.h>
+#include <linux/xarray.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+
+#include "test.h"
+
+void check_xa_load(struct xarray *xa)
+{
+ unsigned long i, j;
+
+ for (i = 0; i < 1024; i++) {
+ for (j = 0; j < 1024; j++) {
+ void *entry = xa_load(xa, j);
+ if (j < i)
+ assert(xa_to_value(entry) == j);
+ else
+ assert(!entry);
+ }
+ radix_tree_insert(xa, i, xa_mk_value(i));
+ }
+}
+
+void xarray_checks(void)
+{
+ RADIX_TREE(array, GFP_KERNEL);
+
+ check_xa_load(&array);
+
+ item_kill_tree(&array);
+}
+
+int __weak main(void)
+{
+ radix_tree_init();
+ xarray_checks();
+ radix_tree_cpu_dead(1);
+ rcu_barrier();
+ if (nr_allocated)
+ printf("nr_allocated = %d\n", nr_allocated);
+ return 0;
+}
--
2.15.0

2017-12-06 01:07:42

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 17/73] xarray: Add xas_next and xas_prev

From: Matthew Wilcox <[email protected]>

These two functions move the xas index by one position, and adjust the
rest of the iterator state to match it. This is more efficient than
calling xas_set() as it keeps the iterator at the leaves of the tree
instead of walking the iterator from the root each time.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/xarray.h | 71 ++++++++++-
lib/xarray.c | 74 ++++++++++++
tools/testing/radix-tree/xarray-test.c | 214 +++++++++++++++++++++++++++++++++
3 files changed, 357 insertions(+), 2 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index b648c1b93d9f..416708ace115 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -549,6 +549,12 @@ static inline bool xas_not_node(struct xa_node *node)
return (unsigned long)node < 4096;
}

+/* True if the node represents RESTART or an error */
+static inline bool xas_frozen(struct xa_node *node)
+{
+ return (unsigned long)node & 1;
+}
+
/* True if the node represents head-of-tree, RESTART or BOUNDS */
static inline bool xas_top(struct xa_node *node)
{
@@ -664,8 +670,8 @@ static inline bool xa_iter_skip(void *entry)
}

/*
- * node->shift is always 0 for the inline iterators unless we're processing
- * a multi-index entry.
+ * node->shift is always 0 for next_entry and next_tag unless we're processing
+ * a multi-index entry. It can be non-0 for next/prev, so it's not used there.
*/
#ifdef CONFIG_RADIX_TREE_MULTIORDER
#define xa_node_shift(node) node->shift
@@ -673,6 +679,67 @@ static inline bool xa_iter_skip(void *entry)
#define xa_node_shift(node) 0
#endif

+void *__xas_next(struct xa_state *);
+void *__xas_prev(struct xa_state *);
+
+/**
+ * xas_prev() - Move iterator to previous index.
+ * @xas: XArray operation state.
+ *
+ * If the @xas was in an error state, it will remain in an error state
+ * and this function will return %NULL. If the @xas has never been walked,
+ * it will have the effect of calling xas_load(). Otherwise one will be
+ * subtracted from the index and the state will be walked to the correct
+ * location in the array for the next operation.
+ *
+ * If the iterator was referencing index 0, this function wraps
+ * around to %ULONG_MAX.
+ *
+ * Return: The entry at the new index. This may be %NULL or an internal
+ * entry, although it should never be a node entry.
+ */
+static inline void *xas_prev(struct xa_state *xas)
+{
+ struct xa_node *node = xas->xa_node;
+
+ if (unlikely(xas_not_node(node) || node->shift ||
+ xas->xa_offset == 0))
+ return __xas_prev(xas);
+
+ xas->xa_index--;
+ xas->xa_offset--;
+ return xa_entry(xas->xa, node, xas->xa_offset);
+}
+
+/**
+ * xas_next() - Move state to next index.
+ * @xas: XArray operation state.
+ *
+ * If the @xas was in an error state, it will remain in an error state
+ * and this function will return %NULL. If the @xas has never been walked,
+ * it will have the effect of calling xas_load(). Otherwise one will be
+ * added to the index and the state will be walked to the correct
+ * location in the array for the next operation.
+ *
+ * If the iterator was referencing index %ULONG_MAX, this function wraps
+ * around to 0.
+ *
+ * Return: The entry at the new index. This may be %NULL or an internal
+ * entry, although it should never be a node entry.
+ */
+static inline void *xas_next(struct xa_state *xas)
+{
+ struct xa_node *node = xas->xa_node;
+
+ if (unlikely(xas_not_node(node) || node->shift ||
+ xas->xa_offset == XA_CHUNK_MASK))
+ return __xas_next(xas);
+
+ xas->xa_index++;
+ xas->xa_offset++;
+ return xa_entry(xas->xa, node, xas->xa_offset);
+}
+
/**
* xas_next_entry() - Advance iterator to next present entry.
* @xas: XArray operation state.
diff --git a/lib/xarray.c b/lib/xarray.c
index f3875b251b41..8c6e83d10554 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -799,6 +799,80 @@ void xas_pause(struct xa_state *xas)
}
EXPORT_SYMBOL_GPL(xas_pause);

+/*
+ * __xas_prev() - Find the previous entry in the XArray.
+ * @xas: XArray operation state.
+ *
+ * Helper function for xas_prev() which handles all the complex cases
+ * out of line.
+ */
+void *__xas_prev(struct xa_state *xas)
+{
+ void *entry;
+
+ if (!xas_frozen(xas->xa_node))
+ xas->xa_index--;
+ if (xas_not_node(xas->xa_node))
+ return xas_load(xas);
+
+ if (xas->xa_offset != get_offset(xas->xa_index, xas->xa_node))
+ xas->xa_offset--;
+
+ while (xas->xa_offset == 255) {
+ xas->xa_offset = xas->xa_node->offset - 1;
+ xas->xa_node = xa_parent(xas->xa, xas->xa_node);
+ if (!xas->xa_node)
+ return set_bounds(xas);
+ }
+
+ for (;;) {
+ entry = xa_entry(xas->xa, xas->xa_node, xas->xa_offset);
+ if (!xa_is_node(entry))
+ return entry;
+
+ xas->xa_node = xa_to_node(entry);
+ xas->xa_offset = get_offset(xas->xa_index, xas->xa_node);
+ }
+}
+EXPORT_SYMBOL_GPL(__xas_prev);
+
+/*
+ * __xas_next() - Find the next entry in the XArray.
+ * @xas: XArray operation state.
+ *
+ * Helper function for xas_next() which handles all the complex cases
+ * out of line.
+ */
+void *__xas_next(struct xa_state *xas)
+{
+ void *entry;
+
+ if (!xas_frozen(xas->xa_node))
+ xas->xa_index++;
+ if (xas_not_node(xas->xa_node))
+ return xas_load(xas);
+
+ if (xas->xa_offset != get_offset(xas->xa_index, xas->xa_node))
+ xas->xa_offset++;
+
+ while (xas->xa_offset == XA_CHUNK_SIZE) {
+ xas->xa_offset = xas->xa_node->offset + 1;
+ xas->xa_node = xa_parent(xas->xa, xas->xa_node);
+ if (!xas->xa_node)
+ return set_bounds(xas);
+ }
+
+ for (;;) {
+ entry = xa_entry(xas->xa, xas->xa_node, xas->xa_offset);
+ if (!xa_is_node(entry))
+ return entry;
+
+ xas->xa_node = xa_to_node(entry);
+ xas->xa_offset = get_offset(xas->xa_index, xas->xa_node);
+ }
+}
+EXPORT_SYMBOL_GPL(__xas_next);
+
/**
* xas_find() - Find the next present entry in the XArray.
* @xas: XArray operation state.
diff --git a/tools/testing/radix-tree/xarray-test.c b/tools/testing/radix-tree/xarray-test.c
index cc5d0b7a1edf..0946eef351e2 100644
--- a/tools/testing/radix-tree/xarray-test.c
+++ b/tools/testing/radix-tree/xarray-test.c
@@ -79,6 +79,104 @@ void check_xas_error(struct xarray *xa)
assert(xas.xa_node == XAS_BOUNDS);
}

+void check_xas_pause(struct xarray *xa)
+{
+ XA_STATE(xas, xa, 0);
+ void *entry;
+ unsigned int seen;
+
+ xa_store(xa, 0, xa_mk_value(0), GFP_KERNEL);
+ xa_set_tag(xa, 0, XA_TAG_0);
+
+ seen = 0;
+ rcu_read_lock();
+ xas_for_each_tag(&xas, entry, ULONG_MAX, XA_TAG_0) {
+ if (!seen++) {
+ xa_store(xa, 1, xa_mk_value(1), GFP_KERNEL);
+ xa_set_tag(xa, 1, XA_TAG_0);
+ }
+ }
+ rcu_read_unlock();
+ /* We don't see an entry that was added after we started */
+ assert(seen == 1);
+
+ seen = 0;
+ xas_set(&xas, 0);
+ rcu_read_lock();
+ xas_for_each_tag(&xas, entry, ULONG_MAX, XA_TAG_0) {
+ if (!seen++)
+ xa_erase(xa, 1);
+ }
+ rcu_read_unlock();
+ assert(seen == 1);
+
+ seen = 0;
+ xas_set(&xas, 0);
+ rcu_read_lock();
+ xas_for_each(&xas, entry, ULONG_MAX) {
+ if (!seen++)
+ xa_store(xa, 1, xa_mk_value(1), GFP_KERNEL);
+ }
+ rcu_read_unlock();
+ assert(seen == 1);
+
+ seen = 0;
+ xas_set(&xas, 0);
+ rcu_read_lock();
+ xas_for_each(&xas, entry, ULONG_MAX) {
+ if (!seen++)
+ xa_erase(xa, 1);
+ }
+ rcu_read_unlock();
+ assert(seen == 1);
+
+ seen = 0;
+ xas_set(&xas, 0);
+ rcu_read_lock();
+ for (entry = xas_load(&xas); entry; entry = xas_next(&xas)) {
+ if (!seen++)
+ xa_store(xa, 1, xa_mk_value(1), GFP_KERNEL);
+ }
+ rcu_read_unlock();
+ assert(seen == 2);
+
+ seen = 0;
+ xas_set(&xas, 0);
+ rcu_read_lock();
+ for (entry = xas_load(&xas); entry; entry = xas_next(&xas)) {
+ if (!seen++)
+ xa_erase(xa, 1);
+ }
+ rcu_read_unlock();
+ assert(seen == 1);
+
+ xa_store(xa, 1, xa_mk_value(1), GFP_KERNEL);
+ seen = 0;
+ xas_set(&xas, 0);
+ xas_for_each(&xas, entry, ULONG_MAX) {
+ if (!seen++)
+ xas_pause(&xas);
+ }
+ assert(seen == 2);
+
+ seen = 0;
+ xas_set(&xas, 0);
+ for (entry = xas_load(&xas); entry; entry = xas_next(&xas)) {
+ if (!seen++)
+ xas_pause(&xas);
+ }
+ assert(seen == 2);
+
+ seen = 0;
+ xas_set(&xas, 0);
+ xa_set_tag(xa, 1, XA_TAG_0);
+ xas_for_each_tag(&xas, entry, ULONG_MAX, XA_TAG_0) {
+ if (!seen++)
+ xas_pause(&xas);
+ }
+ assert(seen == 2);
+}
+
void check_xas_retry(struct xarray *xa)
{
XA_STATE(xas, xa, 0);
@@ -216,9 +314,109 @@ void check_find(struct xarray *xa)
assert(index == 16);
}

+void check_move_small(struct xarray *xa, unsigned long idx)
+{
+ XA_STATE(xas, xa, 0);
+ unsigned long i;
+
+ xa_store(xa, 0, xa_mk_value(0), GFP_KERNEL);
+ xa_store(xa, idx, xa_mk_value(idx), GFP_KERNEL);
+
+ for (i = 0; i < idx * 4; i++) {
+ void *entry = xas_next(&xas);
+ if (i <= idx)
+ assert(xas.xa_node != XAS_RESTART);
+ assert(xas.xa_index == i);
+ if (i == 0 || i == idx)
+ assert(entry == xa_mk_value(i));
+ else
+ assert(entry == NULL);
+ }
+ xas_next(&xas);
+ assert(xas.xa_index == i);
+
+ do {
+ void *entry = xas_prev(&xas);
+ i--;
+ if (i <= idx)
+ assert(xas.xa_node != XAS_RESTART);
+ assert(xas.xa_index == i);
+ if (i == 0 || i == idx)
+ assert(entry == xa_mk_value(i));
+ else
+ assert(entry == NULL);
+ } while (i > 0);
+
+ xas_set(&xas, ULONG_MAX);
+ assert(xas_next(&xas) == NULL);
+ assert(xas.xa_index == ULONG_MAX);
+ assert(xas_next(&xas) == xa_mk_value(0));
+ assert(xas.xa_index == 0);
+ assert(xas_prev(&xas) == NULL);
+ assert(xas.xa_index == ULONG_MAX);
+}
+
+void check_move(struct xarray *xa)
+{
+ XA_STATE(xas, xa, (1 << 16) - 1);
+ unsigned long i;
+
+ for (i = 0; i < (1 << 16); i++) {
+ xa_store(xa, i, xa_mk_value(i), GFP_KERNEL);
+ }
+
+ do {
+ void *entry = xas_prev(&xas);
+ i--;
+ assert(entry == xa_mk_value(i));
+ assert(i == xas.xa_index);
+ } while (i != 0);
+
+ assert(xas_prev(&xas) == NULL);
+ assert(xas.xa_index == ULONG_MAX);
+
+ do {
+ void *entry = xas_next(&xas);
+ assert(entry == xa_mk_value(i));
+ assert(i == xas.xa_index);
+ i++;
+ } while (i < (1 << 16));
+
+ for (i = (1 << 8); i < (1 << 15); i++) {
+ xa_erase(xa, i);
+ }
+
+ i = xas.xa_index;
+
+ do {
+ void *entry = xas_prev(&xas);
+ i--;
+ if ((i < (1 << 8)) || (i >= (1 << 15)))
+ assert(entry == xa_mk_value(i));
+ else
+ assert(entry == NULL);
+ assert(i == xas.xa_index);
+ } while (i != 0);
+
+ assert(xas_prev(&xas) == NULL);
+ assert(xas.xa_index == ULONG_MAX);
+
+ do {
+ void *entry = xas_next(&xas);
+ if ((i < (1 << 8)) || (i >= (1 << 15)))
+ assert(entry == xa_mk_value(i));
+ else
+ assert(entry == NULL);
+ assert(i == xas.xa_index);
+ i++;
+ } while (i < (1 << 16));
+
+}
+
void xarray_checks(void)
{
DEFINE_XARRAY(array);
+ unsigned long i;

check_xa_tag(&array);
item_kill_tree(&array);
@@ -229,6 +427,9 @@ void xarray_checks(void)
check_xas_retry(&array);
item_kill_tree(&array);

+ check_xas_pause(&array);
+ item_kill_tree(&array);
+
check_xa_load(&array);
item_kill_tree(&array);

@@ -240,6 +441,19 @@ void xarray_checks(void)

check_find(&array);
item_kill_tree(&array);
+
+ for (i = 0; i < 16; i++) {
+ check_move_small(&array, 1UL << i);
+ item_kill_tree(&array);
+ }
+
+ for (i = 2; i < 16; i++) {
+ check_move_small(&array, (1UL << i) - 1);
+ item_kill_tree(&array);
+ }
+
+ check_move(&array);
+ item_kill_tree(&array);
}

int __weak main(void)
--
2.15.0

2017-12-06 01:07:25

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 02/73] xarray: Add the xa_lock to the radix_tree_root

From: Matthew Wilcox <[email protected]>

This results in no change in structure size on 64-bit x86 as it fits in
the padding between the gfp_t and the void *.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/f2fs/gc.c | 2 +-
include/linux/idr.h | 12 ++++++------
include/linux/radix-tree.h | 7 +++++--
include/linux/xarray.h | 34 ++++++++++++++++++++++++++++++++++
kernel/pid.c | 2 +-
tools/include/linux/spinlock.h | 1 +
6 files changed, 48 insertions(+), 10 deletions(-)
create mode 100644 include/linux/xarray.h

diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index d844dcb80570..aac1e02f75df 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -991,7 +991,7 @@ int f2fs_gc(struct f2fs_sb_info *sbi, bool sync,
unsigned int init_segno = segno;
struct gc_inode_list gc_list = {
.ilist = LIST_HEAD_INIT(gc_list.ilist),
- .iroot = RADIX_TREE_INIT(GFP_NOFS),
+ .iroot = RADIX_TREE_INIT(gc_list.iroot, GFP_NOFS),
};

trace_f2fs_gc_begin(sbi->sb, sync, background,
diff --git a/include/linux/idr.h b/include/linux/idr.h
index 5f55e119d128..4ffdb7058121 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -30,11 +30,11 @@ struct idr {
/* Set the IDR flag and the IDR_FREE tag */
#define IDR_RT_MARKER ((__force gfp_t)(3 << __GFP_BITS_SHIFT))

-#define IDR_INIT \
+#define IDR_INIT(name) \
{ \
- .idr_rt = RADIX_TREE_INIT(IDR_RT_MARKER) \
+ .idr_rt = RADIX_TREE_INIT(name, IDR_RT_MARKER) \
}
-#define DEFINE_IDR(name) struct idr name = IDR_INIT
+#define DEFINE_IDR(name) struct idr name = IDR_INIT(name)

/**
* idr_get_cursor - Return the current position of the cyclic allocator
@@ -193,10 +193,10 @@ struct ida {
struct radix_tree_root ida_rt;
};

-#define IDA_INIT { \
- .ida_rt = RADIX_TREE_INIT(IDR_RT_MARKER | GFP_NOWAIT), \
+#define IDA_INIT(name) { \
+ .ida_rt = RADIX_TREE_INIT(name, IDR_RT_MARKER | GFP_NOWAIT), \
}
-#define DEFINE_IDA(name) struct ida name = IDA_INIT
+#define DEFINE_IDA(name) struct ida name = IDA_INIT(name)

int ida_pre_get(struct ida *ida, gfp_t gfp_mask);
int ida_get_new_above(struct ida *ida, int starting_id, int *p_id);
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index fc55ff31eca7..d2253b540cd7 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -109,20 +109,23 @@ struct radix_tree_node {
#define ROOT_TAG_SHIFT (__GFP_BITS_SHIFT + 1)

struct radix_tree_root {
+ spinlock_t xa_lock;
gfp_t gfp_mask;
struct radix_tree_node __rcu *rnode;
};

-#define RADIX_TREE_INIT(mask) { \
+#define RADIX_TREE_INIT(name, mask) { \
+ .xa_lock = __SPIN_LOCK_UNLOCKED(name.xa_lock), \
.gfp_mask = (mask), \
.rnode = NULL, \
}

#define RADIX_TREE(name, mask) \
- struct radix_tree_root name = RADIX_TREE_INIT(mask)
+ struct radix_tree_root name = RADIX_TREE_INIT(name, mask)

#define INIT_RADIX_TREE(root, mask) \
do { \
+ spin_lock_init(&(root)->xa_lock); \
(root)->gfp_mask = (mask); \
(root)->rnode = NULL; \
} while (0)
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
new file mode 100644
index 000000000000..a5a933925b85
--- /dev/null
+++ b/include/linux/xarray.h
@@ -0,0 +1,34 @@
+#ifndef _LINUX_XARRAY_H
+#define _LINUX_XARRAY_H
+/*
+ * eXtensible Arrays
+ * Copyright (c) 2017 Microsoft Corporation
+ * Author: Matthew Wilcox <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/spinlock.h>
+
+#define xa_trylock(xa) spin_trylock(&(xa)->xa_lock)
+#define xa_lock(xa) spin_lock(&(xa)->xa_lock)
+#define xa_unlock(xa) spin_unlock(&(xa)->xa_lock)
+#define xa_lock_bh(xa) spin_lock_bh(&(xa)->xa_lock)
+#define xa_unlock_bh(xa) spin_unlock_bh(&(xa)->xa_lock)
+#define xa_lock_irq(xa) spin_lock_irq(&(xa)->xa_lock)
+#define xa_unlock_irq(xa) spin_unlock_irq(&(xa)->xa_lock)
+#define xa_lock_irqsave(xa, flags) \
+ spin_lock_irqsave(&(xa)->xa_lock, flags)
+#define xa_unlock_irqrestore(xa, flags) \
+ spin_unlock_irqrestore(&(xa)->xa_lock, flags)
+#define xa_lock_held(xa) lockdep_is_held(&(xa)->xa_lock)
+
+#endif /* _LINUX_XARRAY_H */
diff --git a/kernel/pid.c b/kernel/pid.c
index b13b624e2c49..b050b4643eee 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -58,7 +58,7 @@ int pid_max_max = PID_MAX_LIMIT;
*/
struct pid_namespace init_pid_ns = {
.kref = KREF_INIT(2),
- .idr = IDR_INIT,
+ .idr = IDR_INIT(init_pid_ns.idr),
.pid_allocated = PIDNS_ADDING,
.level = 0,
.child_reaper = &init_task,
diff --git a/tools/include/linux/spinlock.h b/tools/include/linux/spinlock.h
index 4ed569fcb139..b21b586b9854 100644
--- a/tools/include/linux/spinlock.h
+++ b/tools/include/linux/spinlock.h
@@ -7,6 +7,7 @@

#define spinlock_t pthread_mutex_t
#define DEFINE_SPINLOCK(x) pthread_mutex_t x = PTHREAD_MUTEX_INITIALIZER;
+#define __SPIN_LOCK_UNLOCKED(x) (pthread_mutex_t)PTHREAD_MUTEX_INITIALIZER

#define spin_lock_irqsave(x, f) (void)f, pthread_mutex_lock(x)
#define spin_unlock_irqrestore(x, f) (void)f, pthread_mutex_unlock(x)
--
2.15.0

2017-12-06 01:07:19

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 27/73] page cache: Convert delete_batch to XArray

From: Matthew Wilcox <[email protected]>

Rename the function from page_cache_tree_delete_batch to just
page_cache_delete_batch.

Signed-off-by: Matthew Wilcox <[email protected]>
---
mm/filemap.c | 21 +++++++--------------
1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 6c9cad248e7f..9e6158cfbaeb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -276,7 +276,7 @@ void delete_from_page_cache(struct page *page)
EXPORT_SYMBOL(delete_from_page_cache);

/*
- * page_cache_tree_delete_batch - delete several pages from page cache
+ * page_cache_delete_batch - delete several pages from page cache
* @mapping: the mapping to which pages belong
* @pvec: pagevec with pages to delete
*
@@ -289,23 +289,18 @@ EXPORT_SYMBOL(delete_from_page_cache);
*
* The function expects xa_lock to be held.
*/
-static void
-page_cache_tree_delete_batch(struct address_space *mapping,
+static void page_cache_delete_batch(struct address_space *mapping,
struct pagevec *pvec)
{
- struct radix_tree_iter iter;
- void **slot;
+ XA_STATE(xas, &mapping->pages, pvec->pages[0]->index);
int total_pages = 0;
int i = 0, tail_pages = 0;
struct page *page;
- pgoff_t start;

- start = pvec->pages[0]->index;
- radix_tree_for_each_slot(slot, &mapping->pages, &iter, start) {
+ xas_set_update(&xas, workingset_lookup_update(mapping));
+ xas_for_each(&xas, page, ULONG_MAX) {
if (i >= pagevec_count(pvec) && !tail_pages)
break;
- page = radix_tree_deref_slot_protected(slot,
- &mapping->pages.xa_lock);
if (xa_is_value(page))
continue;
if (!tail_pages) {
@@ -328,9 +323,7 @@ page_cache_tree_delete_batch(struct address_space *mapping,
} else {
tail_pages--;
}
- radix_tree_clear_tags(&mapping->pages, iter.node, slot);
- __radix_tree_replace(&mapping->pages, iter.node, slot, NULL,
- workingset_lookup_update(mapping));
+ xas_store(&xas, NULL);
total_pages++;
}
mapping->nrpages -= total_pages;
@@ -351,7 +344,7 @@ void delete_from_page_cache_batch(struct address_space *mapping,

unaccount_page_cache_page(mapping, pvec->pages[i]);
}
- page_cache_tree_delete_batch(mapping, pvec);
+ page_cache_delete_batch(mapping, pvec);
xa_unlock_irqrestore(&mapping->pages, flags);

for (i = 0; i < pagevec_count(pvec); i++)
--
2.15.0

2017-12-06 01:07:12

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 07/73] xarray: Define struct xa_node

From: Matthew Wilcox <[email protected]>

This is a direct replacement for struct radix_tree_node. Use a #define
so that radix tree users continue to work without change.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/radix-tree.h | 29 +++--------------------------
include/linux/xarray.h | 24 ++++++++++++++++++++++++
2 files changed, 27 insertions(+), 26 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index f31a278de8eb..f46e3de57115 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -32,6 +32,7 @@

/* Keep unconverted code working */
#define radix_tree_root xarray
+#define radix_tree_node xa_node

/*
* The bottom two bits of the slot determine how the remaining bits in the
@@ -60,41 +61,17 @@ static inline bool radix_tree_is_internal_node(void *ptr)

/*** radix-tree API starts here ***/

-#define RADIX_TREE_MAX_TAGS 3
-
#define RADIX_TREE_MAP_SHIFT XA_CHUNK_SHIFT
#define RADIX_TREE_MAP_SIZE (1UL << RADIX_TREE_MAP_SHIFT)
#define RADIX_TREE_MAP_MASK (RADIX_TREE_MAP_SIZE-1)

-#define RADIX_TREE_TAG_LONGS \
- ((RADIX_TREE_MAP_SIZE + BITS_PER_LONG - 1) / BITS_PER_LONG)
+#define RADIX_TREE_MAX_TAGS XA_MAX_TAGS
+#define RADIX_TREE_TAG_LONGS XA_TAG_LONGS

#define RADIX_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long))
#define RADIX_TREE_MAX_PATH (DIV_ROUND_UP(RADIX_TREE_INDEX_BITS, \
RADIX_TREE_MAP_SHIFT))

-/*
- * @count is the count of every non-NULL element in the ->slots array
- * whether that is a data entry, a retry entry, a user pointer,
- * a sibling entry or a pointer to the next level of the tree.
- * @exceptional is the count of every element in ->slots which is
- * either a data entry or a sibling entry for data.
- */
-struct radix_tree_node {
- unsigned char shift; /* Bits remaining in each slot */
- unsigned char offset; /* Slot offset in parent */
- unsigned char count; /* Total entry count */
- unsigned char exceptional; /* Exceptional entry count */
- struct radix_tree_node *parent; /* Used when ascending tree */
- struct radix_tree_root *root; /* The tree we belong to */
- union {
- struct list_head private_list; /* For tree user */
- struct rcu_head rcu_head; /* Used when freeing node */
- };
- void __rcu *slots[RADIX_TREE_MAP_SIZE];
- unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
-};
-
/* The top bits of xa_flags are used to store the root tags and the IDR flag */
#define ROOT_IS_IDR ((__force gfp_t)(1 << __GFP_BITS_SHIFT))
#define ROOT_TAG_SHIFT (__GFP_BITS_SHIFT + 1)
diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index dcdac2053ea6..1aff0069458b 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -133,6 +133,30 @@ static inline bool xa_is_value(void *entry)
#endif
#define XA_CHUNK_SIZE (1UL << XA_CHUNK_SHIFT)
#define XA_CHUNK_MASK (XA_CHUNK_SIZE - 1)
+#define XA_MAX_TAGS 3
+#define XA_TAG_LONGS DIV_ROUND_UP(XA_CHUNK_SIZE, BITS_PER_LONG)
+
+/*
+ * @count is the count of every non-NULL element in the ->slots array
+ * whether that is a data value entry, a retry entry, a user pointer,
+ * a sibling entry or a pointer to the next level of the tree.
+ * @exceptional is the count of every element in ->slots which is
+ * either a data value entry or a sibling entry for a data value.
+ */
+struct xa_node {
+ unsigned char shift; /* Bits remaining in each slot */
+ unsigned char offset; /* Slot offset in parent */
+ unsigned char count; /* Total entry count */
+ unsigned char exceptional; /* Exceptional entry count */
+ struct xa_node *parent; /* Used when ascending tree */
+ struct xarray * root; /* The tree we belong to */
+ union {
+ struct list_head private_list; /* For tree user */
+ struct rcu_head rcu_head; /* Used when freeing node */
+ };
+ void __rcu *slots[XA_CHUNK_SIZE];
+ unsigned long tags[XA_MAX_TAGS][XA_TAG_LONGS];
+};

/*
* Internal entries have the bottom two bits set to the value 10b. Most
--
2.15.0

2017-12-06 01:07:06

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 10/73] xarray: Add xa_get_tag, xa_set_tag and xa_clear_tag

From: Matthew Wilcox <[email protected]>

XArray tags are slightly more strongly typed than the radix tree tags,
but occupy the same bits. This commit also adds the xas_ family of tag
operations, for cases where the caller is already holding the lock, and
xa_tagged() to ask whether any array member has a particular tag set.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/xarray.h | 38 +++++++-
lib/radix-tree.c | 52 +++++------
lib/xarray.c | 247 +++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 310 insertions(+), 27 deletions(-)

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index af52ba75e6a3..ed95ebe91169 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -20,6 +20,7 @@

#include <linux/bug.h>
#include <linux/compiler.h>
+#include <linux/gfp.h>
#include <linux/kconfig.h>
#include <linux/kernel.h>
#include <linux/rcupdate.h>
@@ -71,6 +72,33 @@ static inline void xa_init(struct xarray *xa)

void *xa_load(struct xarray *, unsigned long index);

+typedef unsigned __bitwise xa_tag_t;
+#define XA_TAG_0 ((__force xa_tag_t)0U)
+#define XA_TAG_1 ((__force xa_tag_t)1U)
+#define XA_TAG_2 ((__force xa_tag_t)2U)
+#define XA_NO_TAG ((__force xa_tag_t)4U)
+
+#define XA_TAG_MAX XA_TAG_2
+#define XA_FREE_TAG XA_TAG_0
+#define XA_FLAGS_TAG(tag) ((__force gfp_t)((2U << __GFP_BITS_SHIFT) << \
+ (__force unsigned)(tag)))
+
+/**
+ * xa_tagged() - Inquire whether any entry in this array has a tag set
+ * @xa: Array
+ * @tag: Tag value
+ *
+ * Return: %true if any entry has this tag set.
+ */
+static inline bool xa_tagged(const struct xarray *xa, xa_tag_t tag)
+{
+ return xa->xa_flags & XA_FLAGS_TAG(tag);
+}
+
+bool xa_get_tag(struct xarray *, unsigned long index, xa_tag_t);
+void *xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
+void *xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);
+
#define BITS_PER_XA_VALUE (BITS_PER_LONG - 1)

/**
@@ -122,6 +150,10 @@ static inline bool xa_is_value(void *entry)
spin_unlock_irqrestore(&(xa)->xa_lock, flags)
#define xa_lock_held(xa) lockdep_is_held(&(xa)->xa_lock)

+/* Versions of the normal API which require the caller to hold the xa_lock */
+void *__xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
+void *__xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);
+
/*
* The xarray is constructed out of a set of 'chunks' of pointers. Choosing
* the best chunk size requires some tradeoffs. A power of two recommends
@@ -152,7 +184,7 @@ struct xa_node {
unsigned char offset; /* Slot offset in parent */
unsigned char count; /* Total entry count */
unsigned char exceptional; /* Exceptional entry count */
- struct xa_node *parent; /* Used when ascending tree */
+ struct xa_node __rcu *parent; /* Used when ascending tree */
struct xarray * root; /* The tree we belong to */
union {
struct list_head private_list; /* For tree user */
@@ -434,6 +466,10 @@ static inline bool xas_retry(struct xa_state *xas, void *entry)

void *xas_load(struct xa_state *);

+bool xas_get_tag(const struct xa_state *, xa_tag_t);
+void xas_set_tag(const struct xa_state *, xa_tag_t);
+void xas_clear_tag(const struct xa_state *, xa_tag_t);
+
/**
* xas_reload() - Refetch an entry from the xarray.
* @xas: XArray operation state.
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index a919c60b10a4..8a8485749433 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -126,19 +126,19 @@ static inline gfp_t root_gfp_mask(const struct radix_tree_root *root)
return root->xa_flags & __GFP_BITS_MASK;
}

-static inline void tag_set(struct radix_tree_node *node, unsigned int tag,
+static inline void rtag_set(struct radix_tree_node *node, unsigned int tag,
int offset)
{
__set_bit(offset, node->tags[tag]);
}

-static inline void tag_clear(struct radix_tree_node *node, unsigned int tag,
+static inline void rtag_clear(struct radix_tree_node *node, unsigned int tag,
int offset)
{
__clear_bit(offset, node->tags[tag]);
}

-static inline int tag_get(const struct radix_tree_node *node, unsigned int tag,
+static inline int rtag_get(const struct radix_tree_node *node, unsigned int tag,
int offset)
{
return test_bit(offset, node->tags[tag]);
@@ -574,14 +574,14 @@ static int radix_tree_extend(struct radix_tree_root *root, gfp_t gfp,
if (is_idr(root)) {
all_tag_set(node, IDR_FREE);
if (!root_tag_get(root, IDR_FREE)) {
- tag_clear(node, IDR_FREE, 0);
+ rtag_clear(node, IDR_FREE, 0);
root_tag_set(root, IDR_FREE);
}
} else {
/* Propagate the aggregated tag info to the new child */
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
if (root_tag_get(root, tag))
- tag_set(node, tag, 0);
+ rtag_set(node, tag, 0);
}
}

@@ -646,7 +646,7 @@ static inline bool radix_tree_shrink(struct radix_tree_root *root,
* one (root->xa_head) as far as dependent read barriers go.
*/
root->xa_head = (void __rcu *)child;
- if (is_idr(root) && !tag_get(node, IDR_FREE, 0))
+ if (is_idr(root) && !rtag_get(node, IDR_FREE, 0))
root_tag_clear(root, IDR_FREE);

/*
@@ -853,7 +853,7 @@ static inline int insert_entries(struct radix_tree_node *node,
if (replace) {
node->count--;
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
- if (tag_get(node, tag, offset + i))
+ if (rtag_get(node, tag, offset + i))
tags |= 1 << tag;
} else
return -EEXIST;
@@ -866,12 +866,12 @@ static inline int insert_entries(struct radix_tree_node *node,
rcu_assign_pointer(slot[i], sibling);
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
if (tags & (1 << tag))
- tag_clear(node, tag, offset + i);
+ rtag_clear(node, tag, offset + i);
} else {
rcu_assign_pointer(slot[i], item);
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
if (tags & (1 << tag))
- tag_set(node, tag, offset);
+ rtag_set(node, tag, offset);
}
if (xa_is_node(old))
radix_tree_free_nodes(old);
@@ -929,9 +929,9 @@ int __radix_tree_insert(struct radix_tree_root *root, unsigned long index,

if (node) {
unsigned offset = get_slot_offset(node, slot);
- BUG_ON(tag_get(node, 0, offset));
- BUG_ON(tag_get(node, 1, offset));
- BUG_ON(tag_get(node, 2, offset));
+ BUG_ON(rtag_get(node, 0, offset));
+ BUG_ON(rtag_get(node, 1, offset));
+ BUG_ON(rtag_get(node, 2, offset));
} else {
BUG_ON(root_tags_get(root));
}
@@ -1067,7 +1067,7 @@ static bool node_tag_get(const struct radix_tree_root *root,
unsigned int tag, unsigned int offset)
{
if (node)
- return tag_get(node, tag, offset);
+ return rtag_get(node, tag, offset);
return root_tag_get(root, tag);
}

@@ -1237,7 +1237,7 @@ int radix_tree_split(struct radix_tree_root *root, unsigned long index,
offset = get_slot_offset(parent, slot);

for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
- if (tag_get(parent, tag, offset))
+ if (rtag_get(parent, tag, offset))
tags |= 1 << tag;

for (end = offset + 1; end < RADIX_TREE_MAP_SIZE; end++) {
@@ -1245,7 +1245,7 @@ int radix_tree_split(struct radix_tree_root *root, unsigned long index,
break;
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
if (tags & (1 << tag))
- tag_set(parent, tag, end);
+ rtag_set(parent, tag, end);
/* rcu_assign_pointer ensures tags are set before RETRY */
rcu_assign_pointer(parent->slots[end], RADIX_TREE_RETRY);
}
@@ -1276,7 +1276,7 @@ int radix_tree_split(struct radix_tree_root *root, unsigned long index,
node_to_entry(child));
for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
if (tags & (1 << tag))
- tag_set(node, tag, offset);
+ rtag_set(node, tag, offset);
}

node = child;
@@ -1290,7 +1290,7 @@ int radix_tree_split(struct radix_tree_root *root, unsigned long index,

for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
if (tags & (1 << tag))
- tag_set(node, tag, offset);
+ rtag_set(node, tag, offset);
offset += n;

while (offset == RADIX_TREE_MAP_SIZE) {
@@ -1320,9 +1320,9 @@ static void node_tag_set(struct radix_tree_root *root,
unsigned int tag, unsigned int offset)
{
while (node) {
- if (tag_get(node, tag, offset))
+ if (rtag_get(node, tag, offset))
return;
- tag_set(node, tag, offset);
+ rtag_set(node, tag, offset);
offset = node->offset;
node = node->parent;
}
@@ -1360,8 +1360,8 @@ void *radix_tree_tag_set(struct radix_tree_root *root,
offset = radix_tree_descend(parent, &node, index);
BUG_ON(!node);

- if (!tag_get(parent, tag, offset))
- tag_set(parent, tag, offset);
+ if (!rtag_get(parent, tag, offset))
+ rtag_set(parent, tag, offset);
}

/* set the root's tag bit */
@@ -1389,9 +1389,9 @@ static void node_tag_clear(struct radix_tree_root *root,
unsigned int tag, unsigned int offset)
{
while (node) {
- if (!tag_get(node, tag, offset))
+ if (!rtag_get(node, tag, offset))
return;
- tag_clear(node, tag, offset);
+ rtag_clear(node, tag, offset);
if (any_tag_set(node, tag))
return;

@@ -1489,7 +1489,7 @@ int radix_tree_tag_get(const struct radix_tree_root *root,
parent = entry_to_node(node);
offset = radix_tree_descend(parent, &node, index);

- if (!tag_get(parent, tag, offset))
+ if (!rtag_get(parent, tag, offset))
return 0;
if (node == RADIX_TREE_RETRY)
break;
@@ -1678,7 +1678,7 @@ void __rcu **radix_tree_next_chunk(const struct radix_tree_root *root,
offset = radix_tree_descend(node, &child, index);

if ((flags & RADIX_TREE_ITER_TAGGED) ?
- !tag_get(node, tag, offset) : !child) {
+ !rtag_get(node, tag, offset) : !child) {
/* Hole detected */
if (flags & RADIX_TREE_ITER_CONTIG)
return NULL;
@@ -2100,7 +2100,7 @@ void __rcu **idr_get_free(struct radix_tree_root *root,

node = entry_to_node(child);
offset = radix_tree_descend(node, &child, start);
- if (!tag_get(node, IDR_FREE, offset)) {
+ if (!rtag_get(node, IDR_FREE, offset)) {
offset = radix_tree_find_next_bit(node, IDR_FREE,
offset + 1);
start = next_index(start, node, offset);
diff --git a/lib/xarray.c b/lib/xarray.c
index 2f77e4c5d0b8..8a289c89d3bb 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -12,6 +12,7 @@
* more details.
*/

+#include <linux/bitmap.h>
#include <linux/export.h>
#include <linux/xarray.h>

@@ -31,6 +32,53 @@
* @entry refers to something stored in a slot in the xarray
*/

+static inline struct xa_node *xa_parent(struct xarray *xa,
+ const struct xa_node *node)
+{
+ return rcu_dereference_check(node->parent, xa_lock_held(xa));
+}
+
+static inline struct xa_node *xa_parent_locked(struct xarray *xa,
+ const struct xa_node *node)
+{
+ return rcu_dereference_protected(node->parent, xa_lock_held(xa));
+}
+
+static inline void xa_tag_set(struct xarray *xa, xa_tag_t tag)
+{
+ if (!(xa->xa_flags & XA_FLAGS_TAG(tag)))
+ xa->xa_flags |= XA_FLAGS_TAG(tag);
+}
+
+static inline void xa_tag_clear(struct xarray *xa, xa_tag_t tag)
+{
+ if (xa->xa_flags & XA_FLAGS_TAG(tag))
+ xa->xa_flags &= ~(XA_FLAGS_TAG(tag));
+}
+
+static inline bool tag_get(const struct xa_node *node, unsigned int offset,
+ xa_tag_t tag)
+{
+ return test_bit(offset, node->tags[(__force unsigned)tag]);
+}
+
+static inline void tag_set(struct xa_node *node, unsigned int offset,
+ xa_tag_t tag)
+{
+ __set_bit(offset, node->tags[(__force unsigned)tag]);
+}
+
+static inline void tag_clear(struct xa_node *node, unsigned int offset,
+ xa_tag_t tag)
+{
+ __clear_bit(offset, node->tags[(__force unsigned)tag]);
+}
+
+static inline bool tag_any_set(struct xa_node *node, xa_tag_t tag)
+{
+ return !bitmap_empty(node->tags[(__force unsigned)tag], XA_CHUNK_SIZE);
+}
+
/* extracts the offset within this node from the index */
static unsigned int get_offset(unsigned long index, struct xa_node *node)
{
@@ -119,6 +167,85 @@ void *xas_load(struct xa_state *xas)
}
EXPORT_SYMBOL_GPL(xas_load);

+/**
+ * xas_get_tag() - Returns the state of this tag.
+ * @xas: XArray operation state.
+ * @tag: Tag number.
+ *
+ * Return: true if the tag is set, false if the tag is clear or @xas
+ * is in an error state.
+ */
+bool xas_get_tag(const struct xa_state *xas, xa_tag_t tag)
+{
+ if (xas_invalid(xas))
+ return false;
+ if (!xas->xa_node)
+ return xa_tagged(xas->xa, tag);
+ return tag_get(xas->xa_node, xas->xa_offset, tag);
+}
+EXPORT_SYMBOL_GPL(xas_get_tag);
+
+/**
+ * xas_set_tag() - Sets the tag on this entry and its parents.
+ * @xas: XArray operation state.
+ * @tag: Tag number.
+ *
+ * Sets the specified tag on this entry, and walks up the tree setting it
+ * on all the ancestor entries. Does nothing if @xas has not been walked to
+ * an entry, or is in an error state.
+ */
+void xas_set_tag(const struct xa_state *xas, xa_tag_t tag)
+{
+ struct xa_node *node = xas->xa_node;
+ unsigned int offset = xas->xa_offset;
+
+ if (xas_invalid(xas))
+ return;
+
+ while (node) {
+ if (tag_get(node, offset, tag))
+ return;
+ tag_set(node, offset, tag);
+ offset = node->offset;
+ node = xa_parent_locked(xas->xa, node);
+ }
+
+ if (!xa_tagged(xas->xa, tag))
+ xa_tag_set(xas->xa, tag);
+}
+EXPORT_SYMBOL_GPL(xas_set_tag);
+
+/**
+ * xas_clear_tag() - Clears the tag on this entry and its parents.
+ * @xas: XArray operation state.
+ * @tag: Tag number.
+ *
+ * Clears the specified tag on this entry, and walks back to the head
+ * attempting to clear it on all the ancestor entries. Does nothing if
+ * @xas has not been walked to an entry, or is in an error state.
+ */
+void xas_clear_tag(const struct xa_state *xas, xa_tag_t tag)
+{
+ struct xa_node *node = xas->xa_node;
+ unsigned int offset = xas->xa_offset;
+
+ if (xas_invalid(xas))
+ return;
+
+ while (node) {
+ tag_clear(node, offset, tag);
+ if (tag_any_set(node, tag))
+ return;
+
+ offset = node->offset;
+ node = xa_parent_locked(xas->xa, node);
+ }
+
+ if (xa_tagged(xas->xa, tag))
+ xa_tag_clear(xas->xa, tag);
+}
+EXPORT_SYMBOL_GPL(xas_clear_tag);
+
/**
* __xa_init() - Initialise an empty XArray.
* @xa: XArray.
@@ -156,6 +283,126 @@ void *xa_load(struct xarray *xa, unsigned long index)
}
EXPORT_SYMBOL(xa_load);

+/**
+ * __xa_set_tag() - Set this tag on this entry while locked.
+ * @xa: XArray.
+ * @index: Index of entry.
+ * @tag: Tag number.
+ *
+ * Attempting to set a tag on a NULL entry does not succeed.
+ * This function expects the xa_lock to be held on entry.
+ *
+ * Return: The entry at this index.
+ */
+void *__xa_set_tag(struct xarray *xa, unsigned long index, xa_tag_t tag)
+{
+ XA_STATE(xas, xa, index);
+ void *entry = xas_load(&xas);
+
+ if (entry)
+ xas_set_tag(&xas, tag);
+
+ return entry;
+}
+EXPORT_SYMBOL_GPL(__xa_set_tag);
+
+/**
+ * __xa_clear_tag() - Clear this tag on this entry while locked.
+ * @xa: XArray.
+ * @index: Index of entry.
+ * @tag: Tag number.
+ *
+ * This function expects the xa_lock to be held on entry.
+ *
+ * Return: The entry at this index.
+ */
+void *__xa_clear_tag(struct xarray *xa, unsigned long index, xa_tag_t tag)
+{
+ XA_STATE(xas, xa, index);
+ void *entry = xas_load(&xas);
+
+ if (entry)
+ xas_clear_tag(&xas, tag);
+
+ return entry;
+}
+EXPORT_SYMBOL_GPL(__xa_clear_tag);
+
+/**
+ * xa_get_tag() - Inquire whether this tag is set on this entry.
+ * @xa: XArray.
+ * @index: Index of entry.
+ * @tag: Tag number.
+ *
+ * This function uses the RCU read lock, so the result may be out of date
+ * by the time it returns. If you need the result to be stable, use a lock.
+ *
+ * Return: True if the entry at @index has this tag set, false if it doesn't.
+ */
+bool xa_get_tag(struct xarray *xa, unsigned long index, xa_tag_t tag)
+{
+ XA_STATE(xas, xa, index);
+ void *entry;
+
+ rcu_read_lock();
+ entry = xas_start(&xas);
+ while (xas_get_tag(&xas, tag)) {
+ if (!xa_is_node(entry))
+ goto found;
+ entry = xas_descend(&xas, xa_to_node(entry));
+ }
+ rcu_read_unlock();
+ return false;
+ found:
+ rcu_read_unlock();
+ return true;
+}
+EXPORT_SYMBOL(xa_get_tag);
+
+/**
+ * xa_set_tag() - Set this tag on this entry.
+ * @xa: XArray.
+ * @index: Index of entry.
+ * @tag: Tag number.
+ *
+ * Attempting to set a tag on a NULL entry does not succeed.
+ *
+ * Return: The entry at this index.
+ */
+void *xa_set_tag(struct xarray *xa, unsigned long index, xa_tag_t tag)
+{
+ unsigned long flags;
+ void *entry;
+
+ xa_lock_irqsave(xa, flags);
+ entry = __xa_set_tag(xa, index, tag);
+ xa_unlock_irqrestore(xa, flags);
+
+ return entry;
+}
+EXPORT_SYMBOL(xa_set_tag);
+
+/**
+ * xa_clear_tag() - Clear this tag on this entry.
+ * @xa: XArray.
+ * @index: Index of entry.
+ * @tag: Tag number.
+ *
+ * Return: The entry at this index.
+ */
+void *xa_clear_tag(struct xarray *xa, unsigned long index, xa_tag_t tag)
+{
+ unsigned long flags;
+ void *entry;
+
+ xa_lock_irqsave(xa, flags);
+ entry = __xa_clear_tag(xa, index, tag);
+ xa_unlock_irqrestore(xa, flags);
+
+ return entry;
+}
+EXPORT_SYMBOL(xa_clear_tag);
+
#ifdef XA_DEBUG
void xa_dump_entry(void *entry, unsigned long index)
{
--
2.15.0

2017-12-06 01:43:53

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Tue, Dec 05, 2017 at 04:41:58PM -0800, Matthew Wilcox wrote:
> From: Matthew Wilcox <[email protected]>
>
> This eliminates a call to radix_tree_preload().

.....

> void
> @@ -431,24 +424,24 @@ xfs_mru_cache_insert(
> unsigned long key,
> struct xfs_mru_cache_elem *elem)
> {
> + XA_STATE(xas, &mru->store, key);
> int error;
>
> ASSERT(mru && mru->lists);
> if (!mru || !mru->lists)
> return -EINVAL;
>
> - if (radix_tree_preload(GFP_NOFS))
> - return -ENOMEM;
> -
> INIT_LIST_HEAD(&elem->list_node);
> elem->key = key;
>
> - spin_lock(&mru->lock);
> - error = radix_tree_insert(&mru->store, key, elem);
> - radix_tree_preload_end();
> - if (!error)
> - _xfs_mru_cache_list_insert(mru, elem);
> - spin_unlock(&mru->lock);
> + do {
> + xas_lock(&xas);
> + xas_store(&xas, elem);
> + error = xas_error(&xas);
> + if (!error)
> + _xfs_mru_cache_list_insert(mru, elem);
> + xas_unlock(&xas);
> + } while (xas_nomem(&xas, GFP_NOFS));

Ok, so why does this have a retry loop on ENOMEM despite the
existing code handling that error? And why put such a loop in this
code and not any of the other XFS code that used
radix_tree_preload() and is arguably much more important to avoid
ENOMEM on insert (e.g. the inode cache)?

Also, I really don't like the pattern of using xa_lock()/xa_unlock()
to protect access to an external structure. i.e. the mru->lock
context is protecting multiple fields and operations in the MRU
structure, not just the radix tree operations. Turning that around
so that a larger XFS structure and algorithm is now protected by an
opaque internal lock from generic storage structure the forms part
of the larger structure seems like a bad design pattern to me...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-12-06 01:45:58

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 00/73] XArray version 4

On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> From: Matthew Wilcox <[email protected]>
>
> I looked through some notes and decided this was version 4 of the XArray.
> Last posted two weeks ago, this version includes a *lot* of changes.
> I'd like to thank Dave Chinner for his feedback, encouragement and
> distracting ideas for improvement, which I'll get to once this is merged.

BTW, you need to fix the "To:" line on your patchbombs:

> To: unlisted-recipients: ;, no To-header on input <@gmail-pop.l.google.com>

This bad email address getting quoted to the cc line makes some MTAs
very unhappy.

>
> Highlights:
> - Over 2000 words of documentation in patch 8! And lots more kernel-doc.
> - The page cache is now fully converted to the XArray.
> - Many more tests in the test-suite.
>
> This patch set is not for applying. 0day is still reporting problems,
> and I'd feel bad for eating someone's data. These patches apply on top
> of a set of prepatory patches which just aren't interesting. If you
> want to see the patches applied to a tree, I suggest pulling my git tree:
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04
> I also left out the idr_preload removals. They're still in the git tree,
> but I'm not looking for feedback on them.

I'll give this a quick burn this afternoon and see what catches fire...

Cheers,

Dave.

--
Dave Chinner
[email protected]

2017-12-06 01:51:16

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 00/73] XArray version 4

On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote:
> On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> > From: Matthew Wilcox <[email protected]>
> >
> > I looked through some notes and decided this was version 4 of the XArray.
> > Last posted two weeks ago, this version includes a *lot* of changes.
> > I'd like to thank Dave Chinner for his feedback, encouragement and
> > distracting ideas for improvement, which I'll get to once this is merged.
>
> BTW, you need to fix the "To:" line on your patchbombs:
>
> > To: unlisted-recipients: ;, no To-header on input <@gmail-pop.l.google.com>
>
> This bad email address getting quoted to the cc line makes some MTAs
> very unhappy.
>
> >
> > Highlights:
> > - Over 2000 words of documentation in patch 8! And lots more kernel-doc.
> > - The page cache is now fully converted to the XArray.
> > - Many more tests in the test-suite.
> >
> > This patch set is not for applying. 0day is still reporting problems,
> > and I'd feel bad for eating someone's data. These patches apply on top
> > of a set of prepatory patches which just aren't interesting. If you
> > want to see the patches applied to a tree, I suggest pulling my git tree:
> > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04
> > I also left out the idr_preload removals. They're still in the git tree,
> > but I'm not looking for feedback on them.
>
> I'll give this a quick burn this afternoon and see what catches fire...

Build warnings/errors:

.....
lib/radix-tree.c:700:13: warning: ?radix_tree_free_nodes? defined but not used [-Wunused-function]
static void radix_tree_free_nodes(struct radix_tree_node *node)
.....
lib/xarray.c: In function ?xas_max?:
lib/xarray.c:291:16: warning: unused variable ?mask?
[-Wunused-variable]
unsigned long mask, max = xas->xa_index;
^~~~
......
fs/dax.c: In function ?grab_mapping_entry?:
fs/dax.c:305:2: error: implicit declaration of function ?xas_set_order?; did you mean ?xas_set_err?? [-Werror=implicit-function-declaration]
xas_set_order(&xas, index, size_flag ? PMD_ORDER : 0);
^~~~~~~~~~~~~
scripts/Makefile.build:310: recipe for target 'fs/dax.o' failed
make[1]: *** [fs/dax.o] Error 1

-Dave.
--
Dave Chinner
[email protected]

2017-12-06 01:53:48

by Matthew Wilcox

[permalink] [raw]
Subject: RE: [PATCH v4 00/73] XArray version 4

Huh, you've caught a couple of problems that 0day hasn't sent me yet. Try turning on DAX or TRANSPARENT_HUGEPAGE. Thanks!

> -----Original Message-----
> From: Dave Chinner [mailto:[email protected]]
> Sent: Tuesday, December 5, 2017 8:51 PM
> To: Matthew Wilcox <[email protected]>
> Cc: Matthew Wilcox <[email protected]>; Ross Zwisler
> <[email protected]>; Jens Axboe <[email protected]>; Rehas
> Sachdeva <[email protected]>; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH v4 00/73] XArray version 4
>
> On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote:
> > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> > > From: Matthew Wilcox <[email protected]>
> > >
> > > I looked through some notes and decided this was version 4 of the XArray.
> > > Last posted two weeks ago, this version includes a *lot* of changes.
> > > I'd like to thank Dave Chinner for his feedback, encouragement and
> > > distracting ideas for improvement, which I'll get to once this is merged.
> >
> > BTW, you need to fix the "To:" line on your patchbombs:
> >
> > > To: unlisted-recipients: ;, no To-header on input <@gmail-
> pop.l.google.com>
> >
> > This bad email address getting quoted to the cc line makes some MTAs
> > very unhappy.
> >
> > >
> > > Highlights:
> > > - Over 2000 words of documentation in patch 8! And lots more kernel-doc.
> > > - The page cache is now fully converted to the XArray.
> > > - Many more tests in the test-suite.
> > >
> > > This patch set is not for applying. 0day is still reporting problems,
> > > and I'd feel bad for eating someone's data. These patches apply on top
> > > of a set of prepatory patches which just aren't interesting. If you
> > > want to see the patches applied to a tree, I suggest pulling my git tree:
> > >
> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit.infrade
> ad.org%2Fusers%2Fwilly%2Flinux-
> dax.git%2Fshortlog%2Frefs%2Fheads%2Fxarray-2017-12-
> 04&data=02%7C01%7Cmawilcox%40microsoft.com%7Ca3e721545f8b4b9dff1
> 608d53c4bd42f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364
> 81218740341312&sdata=IXNZXXLTf964OQ0eLDpJt2LCv%2BGGWFW%2FQd4Kc
> KYu6zo%3D&reserved=0
> > > I also left out the idr_preload removals. They're still in the git tree,
> > > but I'm not looking for feedback on them.
> >
> > I'll give this a quick burn this afternoon and see what catches fire...
>
> Build warnings/errors:
>
> .....
> lib/radix-tree.c:700:13: warning: ¿radix_tree_free_nodes¿ defined but not used
> [-Wunused-function]
> static void radix_tree_free_nodes(struct radix_tree_node *node)
> .....
> lib/xarray.c: In function ¿xas_max¿:
> lib/xarray.c:291:16: warning: unused variable ¿mask¿
> [-Wunused-variable]
> unsigned long mask, max = xas->xa_index;
> ^~~~
> ......
> fs/dax.c: In function ¿grab_mapping_entry¿:
> fs/dax.c:305:2: error: implicit declaration of function ¿xas_set_order¿; did you
> mean ¿xas_set_err¿? [-Werror=implicit-function-declaration]
> xas_set_order(&xas, index, size_flag ? PMD_ORDER : 0);
> ^~~~~~~~~~~~~
> scripts/Makefile.build:310: recipe for target 'fs/dax.o' failed
> make[1]: *** [fs/dax.o] Error 1
>
> -Dave.
> --
> Dave Chinner
> [email protected]

2017-12-06 02:02:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Wed, Dec 06, 2017 at 12:36:48PM +1100, Dave Chinner wrote:
> > - if (radix_tree_preload(GFP_NOFS))
> > - return -ENOMEM;
> > -
> > INIT_LIST_HEAD(&elem->list_node);
> > elem->key = key;
> >
> > - spin_lock(&mru->lock);
> > - error = radix_tree_insert(&mru->store, key, elem);
> > - radix_tree_preload_end();
> > - if (!error)
> > - _xfs_mru_cache_list_insert(mru, elem);
> > - spin_unlock(&mru->lock);
> > + do {
> > + xas_lock(&xas);
> > + xas_store(&xas, elem);
> > + error = xas_error(&xas);
> > + if (!error)
> > + _xfs_mru_cache_list_insert(mru, elem);
> > + xas_unlock(&xas);
> > + } while (xas_nomem(&xas, GFP_NOFS));
>
> Ok, so why does this have a retry loop on ENOMEM despite the
> existing code handling that error? And why put such a loop in this
> code and not any of the other XFS code that used
> radix_tree_preload() and is arguably much more important to avoid
> ENOMEM on insert (e.g. the inode cache)?

If we need more nodes in the tree, xas_store() will try to allocate them
with GFP_NOWAIT | __GFP_NOWARN. If that fails, it signals it in xas_error().
xas_nomem() will notice that we're in an ENOMEM situation, and allocate
a node using your preferred GFP flags (NOIO in your case). Then we retry,
guaranteeing forward progress. [1]

The other conversions use the normal API instead of the advanced API, so
all of this gets hidden away. For example, the inode cache does this:

+ curr = xa_cmpxchg(&pag->pag_ici_xa, agino, NULL, ip, GFP_NOFS);

and xa_cmpxchg internally does:

do {
xa_lock_irqsave(xa, flags);
curr = xas_create(&xas);
if (curr == old)
xas_store(&xas, entry);
xa_unlock_irqrestore(xa, flags);
} while (xas_nomem(&xas, gfp));


> Also, I really don't like the pattern of using xa_lock()/xa_unlock()
> to protect access to an external structure. i.e. the mru->lock
> context is protecting multiple fields and operations in the MRU
> structure, not just the radix tree operations. Turning that around
> so that a larger XFS structure and algorithm is now protected by an
> opaque internal lock from generic storage structure the forms part
> of the larger structure seems like a bad design pattern to me...

It's the design pattern I've always intended to use. Naturally, the
xfs radix trees weren't my initial target; it was the page cache, and
the page cache does the same thing; uses the tree_lock to protect both
the radix tree and several other fields in that same data structure.

I'm open to argument on this though ... particularly if you have a better
design pattern in mind!

[1] I actually have this documented! It's in the xas_nomem() kernel-doc:

* If we need to add new nodes to the XArray, we try to allocate memory
* with GFP_NOWAIT while holding the lock, which will usually succeed.
* If it fails, @xas is flagged as needing memory to continue. The caller
* should drop the lock and call xas_nomem(). If xas_nomem() succeeds,
* the caller should retry the operation.
*
* Forward progress is guaranteed as one node is allocated here and
* stored in the xa_state where it will be found by xas_alloc(). More
* nodes will likely be found in the slab allocator, but we do not tie
* them up here.
*
* Return: true if memory was needed, and was successfully allocated.

2017-12-06 02:05:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 00/73] XArray version 4

On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote:
> On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> > From: Matthew Wilcox <[email protected]>
> >
> > I looked through some notes and decided this was version 4 of the XArray.
> > Last posted two weeks ago, this version includes a *lot* of changes.
> > I'd like to thank Dave Chinner for his feedback, encouragement and
> > distracting ideas for improvement, which I'll get to once this is merged.
>
> BTW, you need to fix the "To:" line on your patchbombs:
>
> > To: unlisted-recipients: ;, no To-header on input <@gmail-pop.l.google.com>
>
> This bad email address getting quoted to the cc line makes some MTAs
> very unhappy.

I know :-( I was unhappy when I realised what I'd done.

https://marc.info/?l=git&m=151252237912266&w=2

> I'll give this a quick burn this afternoon and see what catches fire...

All of the things ... 0day gave me a 90% chance of hanging in one
configuration. Need to drill down on it more and find out what stupid
thing I've done wrong this time.

2017-12-06 02:18:21

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 00/73] XArray version 4

On Wed, Dec 06, 2017 at 01:53:41AM +0000, Matthew Wilcox wrote:
> Huh, you've caught a couple of problems that 0day hasn't sent me yet. Try turning on DAX or TRANSPARENT_HUGEPAGE. Thanks!

Dax is turned on, CONFIG_TRANSPARENT_HUGEPAGE is not.

Looks like nothing is setting CONFIG_RADIX_TREE_MULTIORDER, which is
what xas_set_order() is hidden under.

Ah, CONFIG_ZONE_DEVICE turns it on, not CONFIG_DAX/CONFIG_FS_DAX.

Hmmmm. That seems wrong if it's used in fs/dax.c...

$ grep DAX .config
CONFIG_DAX=y
CONFIG_FS_DAX=y
$ grep ZONE_DEVICE .config
CONFIG_ARCH_HAS_ZONE_DEVICE=y
$

So I have DAX enabled, but not ZONE_DEVICE? Shouldn't DAX be
selecting ZONE_DEVICE, not relying on a user to select both of them
so that stuff works properly? Hmmm - there's no menu option to turn
on zone device, so it's selected by something else? Oh, HMM turns
on ZONE device. But that is "default y", so should be turned on. But
it's not? And there's no obvious HMM menu config option, either....

What a godawful mess Kconfig has turned into.

I'm just going to enable TRANSPARENT_HUGEPAGE - madness awaits me if
I follow the other path down the rat hole....

Ok, it build this time.

-Dave.

>
> > -----Original Message-----
> > From: Dave Chinner [mailto:[email protected]]
> > Sent: Tuesday, December 5, 2017 8:51 PM
> > To: Matthew Wilcox <[email protected]>
> > Cc: Matthew Wilcox <[email protected]>; Ross Zwisler
> > <[email protected]>; Jens Axboe <[email protected]>; Rehas
> > Sachdeva <[email protected]>; [email protected]; linux-
> > [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]
> > Subject: Re: [PATCH v4 00/73] XArray version 4
> >
> > On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote:
> > > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> > > > From: Matthew Wilcox <[email protected]>
> > > >
> > > > I looked through some notes and decided this was version 4 of the XArray.
> > > > Last posted two weeks ago, this version includes a *lot* of changes.
> > > > I'd like to thank Dave Chinner for his feedback, encouragement and
> > > > distracting ideas for improvement, which I'll get to once this is merged.
> > >
> > > BTW, you need to fix the "To:" line on your patchbombs:
> > >
> > > > To: unlisted-recipients: ;, no To-header on input <@gmail-
> > pop.l.google.com>
> > >
> > > This bad email address getting quoted to the cc line makes some MTAs
> > > very unhappy.
> > >
> > > >
> > > > Highlights:
> > > > - Over 2000 words of documentation in patch 8! And lots more kernel-doc.
> > > > - The page cache is now fully converted to the XArray.
> > > > - Many more tests in the test-suite.
> > > >
> > > > This patch set is not for applying. 0day is still reporting problems,
> > > > and I'd feel bad for eating someone's data. These patches apply on top
> > > > of a set of prepatory patches which just aren't interesting. If you
> > > > want to see the patches applied to a tree, I suggest pulling my git tree:
> > > >
> > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit.infrade
> > ad.org%2Fusers%2Fwilly%2Flinux-
> > dax.git%2Fshortlog%2Frefs%2Fheads%2Fxarray-2017-12-
> > 04&data=02%7C01%7Cmawilcox%40microsoft.com%7Ca3e721545f8b4b9dff1
> > 608d53c4bd42f%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6364
> > 81218740341312&sdata=IXNZXXLTf964OQ0eLDpJt2LCv%2BGGWFW%2FQd4Kc
> > KYu6zo%3D&reserved=0
> > > > I also left out the idr_preload removals. They're still in the git tree,
> > > > but I'm not looking for feedback on them.
> > >
> > > I'll give this a quick burn this afternoon and see what catches fire...
> >
> > Build warnings/errors:
> >
> > .....
> > lib/radix-tree.c:700:13: warning: ?radix_tree_free_nodes? defined but not used
> > [-Wunused-function]
> > static void radix_tree_free_nodes(struct radix_tree_node *node)
> > .....
> > lib/xarray.c: In function ?xas_max?:
> > lib/xarray.c:291:16: warning: unused variable ?mask?
> > [-Wunused-variable]
> > unsigned long mask, max = xas->xa_index;
> > ^~~~
> > ......
> > fs/dax.c: In function ?grab_mapping_entry?:
> > fs/dax.c:305:2: error: implicit declaration of function ?xas_set_order?; did you
> > mean ?xas_set_err?? [-Werror=implicit-function-declaration]
> > xas_set_order(&xas, index, size_flag ? PMD_ORDER : 0);
> > ^~~~~~~~~~~~~
> > scripts/Makefile.build:310: recipe for target 'fs/dax.o' failed
> > make[1]: *** [fs/dax.o] Error 1
> >
> > -Dave.
> > --
> > Dave Chinner
> > [email protected]

--
Dave Chinner
[email protected]

2017-12-06 02:27:18

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 00/73] XArray version 4

On Wed, Dec 06, 2017 at 01:17:52PM +1100, Dave Chinner wrote:
> On Wed, Dec 06, 2017 at 01:53:41AM +0000, Matthew Wilcox wrote:
> > Huh, you've caught a couple of problems that 0day hasn't sent me yet. Try turning on DAX or TRANSPARENT_HUGEPAGE. Thanks!
>
> Dax is turned on, CONFIG_TRANSPARENT_HUGEPAGE is not.
>
> Looks like nothing is setting CONFIG_RADIX_TREE_MULTIORDER, which is
> what xas_set_order() is hidden under.
>
> Ah, CONFIG_ZONE_DEVICE turns it on, not CONFIG_DAX/CONFIG_FS_DAX.
>
> Hmmmm. That seems wrong if it's used in fs/dax.c...

Yes, it's my mistake for not making xas_set_order available in the
!MULTIORDER case. I'm working on a fix now.

> What a godawful mess Kconfig has turned into.

I'm not going to argue with that ...

2017-12-06 02:38:13

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 00/73] XArray version 4

On Tue, Dec 05, 2017 at 06:05:15PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 12:45:49PM +1100, Dave Chinner wrote:
> > On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> > > From: Matthew Wilcox <[email protected]>
> > >
> > > I looked through some notes and decided this was version 4 of the XArray.
> > > Last posted two weeks ago, this version includes a *lot* of changes.
> > > I'd like to thank Dave Chinner for his feedback, encouragement and
> > > distracting ideas for improvement, which I'll get to once this is merged.
> >
> > BTW, you need to fix the "To:" line on your patchbombs:
> >
> > > To: unlisted-recipients: ;, no To-header on input <@gmail-pop.l.google.com>
> >
> > This bad email address getting quoted to the cc line makes some MTAs
> > very unhappy.
>
> I know :-( I was unhappy when I realised what I'd done.
>
> https://marc.info/?l=git&m=151252237912266&w=2
>
> > I'll give this a quick burn this afternoon and see what catches fire...
>
> All of the things ... 0day gave me a 90% chance of hanging in one
> configuration. Need to drill down on it more and find out what stupid
> thing I've done wrong this time.

Yup, Bad Stuff happened on boot:

[ 24.548039] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 24.548978] 1-...!: (0 ticks this GP) idle=688/0/0 softirq=143/143 fqs=0
[ 24.549926] 5-...!: (0 ticks this GP) idle=db8/0/0 softirq=120/120 fqs=0
[ 24.550864] 6-...!: (0 ticks this GP) idle=d58/0/0 softirq=111/111 fqs=0
[ 24.551802] 8-...!: (5 GPs behind) idle=514/0/0 softirq=189/189 fqs=0
[ 24.552722] 10-...!: (84 GPs behind) idle=ac0/0/0 softirq=80/80 fqs=0
[ 24.553617] 11-...!: (8 GPs behind) idle=cfc/0/0 softirq=95/95 fqs=0
[ 24.554496] 13-...!: (8 GPs behind) idle=b0c/0/0 softirq=82/82 fqs=0
[ 24.555382] 14-...!: (38 GPs behind) idle=a7c/0/0 softirq=93/93 fqs=0
[ 24.556305] 15-...!: (4 GPs behind) idle=b18/0/0 softirq=88/88 fqs=0
[ 24.557190] (detected by 9, t=5252 jiffies, g=-178, c=-179, q=994)
[ 24.558051] Sending NMI from CPU 9 to CPUs 1:
[ 24.558703] NMI backtrace for cpu 1 skipped: idling at native_safe_halt+0x2/0x10
[ 24.559654] Sending NMI from CPU 9 to CPUs 5:
[ 24.559675] NMI backtrace for cpu 5 skipped: idling at native_safe_halt+0x2/0x10
[ 24.560654] Sending NMI from CPU 9 to CPUs 6:
[ 24.560689] NMI backtrace for cpu 6 skipped: idling at native_safe_halt+0x2/0x10
[ 24.561655] Sending NMI from CPU 9 to CPUs 8:
[ 24.561701] NMI backtrace for cpu 8 skipped: idling at native_safe_halt+0x2/0x10
[ 24.562654] Sending NMI from CPU 9 to CPUs 10:
[ 24.562675] NMI backtrace for cpu 10 skipped: idling at native_safe_halt+0x2/0x10
[ 24.563653] Sending NMI from CPU 9 to CPUs 11:
[ 24.563669] NMI backtrace for cpu 11 skipped: idling at native_safe_halt+0x2/0x10
[ 24.564653] Sending NMI from CPU 9 to CPUs 13:
[ 24.564670] NMI backtrace for cpu 13 skipped: idling at native_safe_halt+0x2/0x10
[ 24.565652] Sending NMI from CPU 9 to CPUs 14:
[ 24.565674] NMI backtrace for cpu 14 skipped: idling at native_safe_halt+0x2/0x10
[ 24.566652] Sending NMI from CPU 9 to CPUs 15:
[ 24.566669] NMI backtrace for cpu 15 skipped: idling at native_safe_halt+0x2/0x10
[ 24.567653] rcu_preempt kthread starved for 5256 jiffies! g18446744073709551438 c18446744073709551437 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->7
[ 24.567654] rcu_preempt I15128 9 2 0x80000000
[ 24.567660] Call Trace:
[ 24.567679] ? __schedule+0x289/0x880
[ 24.567681] schedule+0x2f/0x90
[ 24.567682] schedule_timeout+0x152/0x370
[ 24.567686] ? __next_timer_interrupt+0xc0/0xc0
[ 24.567689] rcu_gp_kthread+0x561/0x880
[ 24.567691] ? force_qs_rnp+0x1a0/0x1a0
[ 24.567693] kthread+0x111/0x130
[ 24.567695] ? __kthread_create_worker+0x120/0x120
[ 24.567697] ret_from_fork+0x24/0x30
[ 44.064092] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kswapd0:854]
[ 44.065920] CPU: 0 PID: 854 Comm: kswapd0 Not tainted 4.15.0-rc2-dgc #228
[ 44.067769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 44.070030] RIP: 0010:smp_call_function_single+0xce/0x100
[ 44.071521] RSP: 0000:ffffc90001d2fb20 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff11
[ 44.073592] RAX: 0000000000000000 RBX: ffff88013ab515c8 RCX: ffffc9000350bb20
[ 44.075560] RDX: 0000000000000001 RSI: ffffc90001d2fb20 RDI: ffffc90001d2fb20
[ 44.077531] RBP: ffffc90001d2fb50 R08: 0000000000000007 R09: 0000000000000080
[ 44.079483] R10: ffffc90001d2fb78 R11: ffffc90001d2fb30 R12: ffffc90001d2fc10
[ 44.081465] R13: ffffea000449fc78 R14: ffffea000449fc58 R15: ffff88013ba36c40
[ 44.083434] FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[ 44.085683] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 44.087276] CR2: 00007f1ad65f2260 CR3: 0000000002009001 CR4: 00000000000606f0
[ 44.089228] Call Trace:
[ 44.089942] ? flush_tlb_func_common.constprop.9+0x240/0x240
[ 44.091509] ? arch_tlbbatch_flush+0x66/0xd0
[ 44.092727] arch_tlbbatch_flush+0x66/0xd0
[ 44.093882] try_to_unmap_flush+0x26/0x40
[ 44.095013] shrink_page_list+0x3f0/0xe20
[ 44.096155] shrink_inactive_list+0x209/0x430
[ 44.097392] ? lruvec_lru_size+0x1d/0xa0
[ 44.098495] shrink_node_memcg.constprop.80+0x3f6/0x650
[ 44.099952] ? _raw_spin_unlock+0xc/0x20
[ 44.101060] ? list_lru_count_one+0x25/0x30
[ 44.102225] ? shrink_node+0x44/0x180
[ 44.103252] shrink_node+0x44/0x180
[ 44.104238] kswapd+0x270/0x6b0
[ 44.105142] ? node_reclaim+0x220/0x220
[ 44.106222] kthread+0x111/0x130
[ 44.107109] ? __kthread_create_worker+0x120/0x120
[ 44.108416] ? call_usermodehelper_exec_async+0x11c/0x150
[ 44.109882] ret_from_fork+0x24/0x30
[ 44.110866] Code: 89 3a ee 7e 74 3d 48 83 c4 28 41 5a 5d 49 8d 62 f8 c3 48 89 d1 48 89 f2 48 8d 75 d0 e8 cc fc ff ff 8b 55 e8 83 e2 01 74 0a f
[ 45.596015] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 45.596911] 7-...0: (1 GPs behind) idle=e56/140000000000000/0 softirq=138/139 fqs=2567
[ 45.598054] (detected by 9, t=5252 jiffies, g=-177, c=-178, q=1001)
[ 45.598925] Sending NMI from CPU 9 to CPUs 7:

-Dave.

--
Dave Chinner
[email protected]

2017-12-06 03:15:03

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Tue, Dec 05, 2017 at 06:02:08PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 12:36:48PM +1100, Dave Chinner wrote:
> > > - if (radix_tree_preload(GFP_NOFS))
> > > - return -ENOMEM;
> > > -
> > > INIT_LIST_HEAD(&elem->list_node);
> > > elem->key = key;
> > >
> > > - spin_lock(&mru->lock);
> > > - error = radix_tree_insert(&mru->store, key, elem);
> > > - radix_tree_preload_end();
> > > - if (!error)
> > > - _xfs_mru_cache_list_insert(mru, elem);
> > > - spin_unlock(&mru->lock);
> > > + do {
> > > + xas_lock(&xas);
> > > + xas_store(&xas, elem);
> > > + error = xas_error(&xas);
> > > + if (!error)
> > > + _xfs_mru_cache_list_insert(mru, elem);
> > > + xas_unlock(&xas);
> > > + } while (xas_nomem(&xas, GFP_NOFS));
> >
> > Ok, so why does this have a retry loop on ENOMEM despite the
> > existing code handling that error? And why put such a loop in this
> > code and not any of the other XFS code that used
> > radix_tree_preload() and is arguably much more important to avoid
> > ENOMEM on insert (e.g. the inode cache)?
>
> If we need more nodes in the tree, xas_store() will try to allocate them
> with GFP_NOWAIT | __GFP_NOWARN. If that fails, it signals it in xas_error().
> xas_nomem() will notice that we're in an ENOMEM situation, and allocate
> a node using your preferred GFP flags (NOIO in your case). Then we retry,
> guaranteeing forward progress. [1]
>
> The other conversions use the normal API instead of the advanced API, so
> all of this gets hidden away. For example, the inode cache does this:
>
> + curr = xa_cmpxchg(&pag->pag_ici_xa, agino, NULL, ip, GFP_NOFS);
>
> and xa_cmpxchg internally does:
>
> do {
> xa_lock_irqsave(xa, flags);
> curr = xas_create(&xas);
> if (curr == old)
> xas_store(&xas, entry);
> xa_unlock_irqrestore(xa, flags);
> } while (xas_nomem(&xas, gfp));

Ah, OK, that's not obvious from the code changes. :/

However, it's probably overkill for XFS. In all the cases, when we
insert there should be no entry in the tree - the
radix tree insert error handling code there was simply catching
"should never happen" cases and handling it without crashing.

Now that I've looked at this, I have to say that having a return
value of NULL meaning "success" is quite counter-intuitive. That's
going to fire my "that looks so wrong" detector every time I look at
the code and notice it's erroring out on a non-null return value
that isn't a PTR_ERR case....

Also, there's no need for irqsave/restore() locking contexts here as
we never access these caches from interrupt contexts. That's just
going to be extra overhead, especially on workloads that run 10^6
inodes inodes through the cache every second. That's a problem
caused by driving the locks into the XA structure and then needing
to support callers that require irq safety....

> > Also, I really don't like the pattern of using xa_lock()/xa_unlock()
> > to protect access to an external structure. i.e. the mru->lock
> > context is protecting multiple fields and operations in the MRU
> > structure, not just the radix tree operations. Turning that around
> > so that a larger XFS structure and algorithm is now protected by an
> > opaque internal lock from generic storage structure the forms part
> > of the larger structure seems like a bad design pattern to me...
>
> It's the design pattern I've always intended to use. Naturally, the
> xfs radix trees weren't my initial target; it was the page cache, and
> the page cache does the same thing; uses the tree_lock to protect both
> the radix tree and several other fields in that same data structure.
>
> I'm open to argument on this though ... particularly if you have a better
> design pattern in mind!

I don't mind structures having internal locking - I have a problem
with leaking them into contexts outside the structure they protect.
That way lies madness - you can't change the internal locking in
future because of external dependencies, and the moment you need
something different externally we've got to go back to an external
lock anyway.

This is demonstrated by the way you converted the XFS dquot tree -
you didn't replace the dquot tree lock with the internal xa_lock
because it's a mutex and we have to sleep holding it. IOWs, we've
added another layer of locking here, not simplified the code.

What I really see here is that we have inconsistent locking
patterns w.r.t. XA stores inside XFS - some have an external mutex
to cover a wider scope, some use xa_lock/xa_unlock to span multiple
operations, some directly access the internal xa lock via direct
spin_lock/unlock(...xa_lock) calls and non-locking XA call variants.
In some places you remove explicit rcu_read_lock() calls because the
internal xa_lock implies RCU, but in other places we still need them
because we have to protect the objects the tree points to, not just
the tree....

IOWs, there's no consistent pattern to the changes you've made to
the XFS code. The existing radix tree code has clear, consistent
locking, tagging and lookup patterns. In contrast, each conversion
to the XA code has resulted in a different solution for each radix
tree conversion. Yes, there's been a small reduction in the amoutn
of code in converting to the XA API, but it comes at the cost of
consistency and ease of understanding the code that uses the radix
tree API.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-12-06 04:45:55

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Wed, Dec 06, 2017 at 02:14:56PM +1100, Dave Chinner wrote:
> > The other conversions use the normal API instead of the advanced API, so
> > all of this gets hidden away. For example, the inode cache does this:
>
> Ah, OK, that's not obvious from the code changes. :/

Yeah, it's a lot easier to understand (I think!) if you build the
docs in that tree and look at
file:///home/willy/kernel/xarray-3/Documentation/output/core-api/xarray.html
(mutatis mutandi). I've tried to tell a nice story about how to put
all the pieces together from the normal to the advanced API.

> However, it's probably overkill for XFS. In all the cases, when we
> insert there should be no entry in the tree - the
> radix tree insert error handling code there was simply catching
> "should never happen" cases and handling it without crashing.

I thought it was probably overkill to be using xa_cmpxchg() in the
pag_ici patch. I didn't want to take away your error handling as part
of the conversion, but I think a rational person implementing it today
would just call xa_store() and not even worry about the return value
except to check it for IS_ERR().

That said, using xa_cmpxchg() in the dquot code looked like the right
thing to do? Since we'd dropped the qi mutex and the ILOCK, it looks
entirely reasonable for another thread to come in and set up the dquot.
But I'm obviously quite ignorant of the XFS internals, so maybe there's
something else going on that makes this essentially a "can't happen".

> Now that I've looked at this, I have to say that having a return
> value of NULL meaning "success" is quite counter-intuitive. That's
> going to fire my "that looks so wrong" detector every time I look at
> the code and notice it's erroring out on a non-null return value
> that isn't a PTR_ERR case....

It's the same convention as cmpxchg(). I think it's triggering your
"looks so wrong" detector because it's fundamentally not the natural
thing to write. I certainly don't cmpxchg() new entries into an array
and check the result was NULL ;-)

> Also, there's no need for irqsave/restore() locking contexts here as
> we never access these caches from interrupt contexts. That's just
> going to be extra overhead, especially on workloads that run 10^6
> inodes inodes through the cache every second. That's a problem
> caused by driving the locks into the XA structure and then needing
> to support callers that require irq safety....

I'm quite happy to have normal API variants that don't save/restore
interrupts. Just need to come up with good names ... I don't think
xa_store_noirq() is a good name, but maybe you do?

> > It's the design pattern I've always intended to use. Naturally, the
> > xfs radix trees weren't my initial target; it was the page cache, and
> > the page cache does the same thing; uses the tree_lock to protect both
> > the radix tree and several other fields in that same data structure.
> >
> > I'm open to argument on this though ... particularly if you have a better
> > design pattern in mind!
>
> I don't mind structures having internal locking - I have a problem
> with leaking them into contexts outside the structure they protect.
> That way lies madness - you can't change the internal locking in
> future because of external dependencies, and the moment you need
> something different externally we've got to go back to an external
> lock anyway.
>
> This is demonstrated by the way you converted the XFS dquot tree -
> you didn't replace the dquot tree lock with the internal xa_lock
> because it's a mutex and we have to sleep holding it. IOWs, we've
> added another layer of locking here, not simplified the code.

I agree the dquot code is no simpler than it was, but it's also no more
complicated from a locking analysis point of view; the xa_lock is just
not providing you with any useful exclusion.

At least, not today. One of the future plans is to allow xa_nodes to
be allocated from ZONE_MOVABLE. In order to do that, we have to be
able to tell which lock protects any given node. With the XArray,
we can find that out (xa_node->root->xa_lock); with the radix tree,
we don't even know what kind of lock protects the tree.

> What I really see here is that we have inconsistent locking
> patterns w.r.t. XA stores inside XFS - some have an external mutex
> to cover a wider scope, some use xa_lock/xa_unlock to span multiple
> operations, some directly access the internal xa lock via direct
> spin_lock/unlock(...xa_lock) calls and non-locking XA call variants.
> In some places you remove explicit rcu_read_lock() calls because the
> internal xa_lock implies RCU, but in other places we still need them
> because we have to protect the objects the tree points to, not just
> the tree....
>
> IOWs, there's no consistent pattern to the changes you've made to
> the XFS code. The existing radix tree code has clear, consistent
> locking, tagging and lookup patterns. In contrast, each conversion
> to the XA code has resulted in a different solution for each radix
> tree conversion. Yes, there's been a small reduction in the amoutn
> of code in converting to the XA API, but it comes at the cost of
> consistency and ease of understanding the code that uses the radix
> tree API.

There are other costs to not having a lock. The lockdep/RCU
analysis done on the radix tree code is none. Because we have
no idea what lock might protect any individual radix tree, we use
rcu_dereference_raw(), disabling lockdep's ability to protect us.

It's funny that you see the hodgepodge of different locking strategies
in the XFS code base as being a problem with the XArray. I see it as
being a consequence of XFS's different needs. No, the XArray can't
solve all of your problems, but it hasn't made your locking more complex.

And I don't agree that the existing radix tree code has clear, consistent
locking patterns. For example, this use of RCU was unnecessary:

xfs_queue_eofblocks(
struct xfs_mount *mp)
{
- rcu_read_lock();
- if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_EOFBLOCKS_TAG))
+ if (xa_tagged(&mp->m_perag_xa, XFS_ICI_EOFBLOCKS_TAG))
queue_delayed_work(mp->m_eofblocks_workqueue,
&mp->m_eofblocks_work,
msecs_to_jiffies(xfs_eofb_secs * 1000));
- rcu_read_unlock();
}

radix_tree_tagged never required the RCU lock (commit 7cf9c2c76c1a).
I think you're just used to the radix tree pattern of "we provide no
locking for you, come up with your own scheme".

What might make more sense for XFS is coming up with something
intermediate between the full on xa_state-based API and the "we handle
everything for you" normal API. For example, how would you feel about
xfs_mru_cache_insert() looking like this:

xa_lock(&mru->store);
error = PTR_ERR_OR_ZERO(__xa_store(&mru->store, key, elem, GFP_NOFS));
if (!error)
_xfs_mru_cache_list_insert(mru, elem);
xa_unlock(&mru->store);

return error;

xfs_mru_cache_lookup would look like:

xa_lock(&mru->store);
elem = __xa_load(&mru->store, key);
...

There's no real need for the mru code to be using the full-on xa_state
API. For something like DAX or the page cache, there's a real advantage,
but the mru code is, I think, a great example of a user who has somewhat
more complex locking requirements, but doesn't use the array in a
complex way.

The dquot code is just going to have to live with the fact that there's
additional locking going on that it doesn't need. I'm open to getting
rid of the irqsafety, but we can't give up the spinlock protection
without giving up the RCU/lockdep analysis and the ability to move nodes.
I don't suppose the dquot code can


Thanks for spending so much time on this and being so passionate about
making this the best possible code it can be.

2017-12-06 04:53:01

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote:
> The dquot code is just going to have to live with the fact that there's
> additional locking going on that it doesn't need. I'm open to getting
> rid of the irqsafety, but we can't give up the spinlock protection
> without giving up the RCU/lockdep analysis and the ability to move nodes.
> I don't suppose the dquot code can

Oops, thought I'd finished writing this paragraph.

I don't suppose the dquot code can be restructured to use the xa_lock to
protect, say, qi_dquots? I suspect not, since you wouldn't know which
of the three xarray locks to use.

2017-12-06 08:44:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 02:14:56PM +1100, Dave Chinner wrote:
> > > The other conversions use the normal API instead of the advanced API, so
> > > all of this gets hidden away. For example, the inode cache does this:
> >
> > Ah, OK, that's not obvious from the code changes. :/
>
> Yeah, it's a lot easier to understand (I think!) if you build the
> docs in that tree and look at
> file:///home/willy/kernel/xarray-3/Documentation/output/core-api/xarray.html
> (mutatis mutandi). I've tried to tell a nice story about how to put
> all the pieces together from the normal to the advanced API.
>
> > However, it's probably overkill for XFS. In all the cases, when we
> > insert there should be no entry in the tree - the
> > radix tree insert error handling code there was simply catching
> > "should never happen" cases and handling it without crashing.
>
> I thought it was probably overkill to be using xa_cmpxchg() in the
> pag_ici patch. I didn't want to take away your error handling as part
> of the conversion, but I think a rational person implementing it today
> would just call xa_store() and not even worry about the return value
> except to check it for IS_ERR().

*nod*

> That said, using xa_cmpxchg() in the dquot code looked like the right
> thing to do? Since we'd dropped the qi mutex and the ILOCK, it looks
> entirely reasonable for another thread to come in and set up the dquot.
> But I'm obviously quite ignorant of the XFS internals, so maybe there's
> something else going on that makes this essentially a "can't happen".

It's no different to the inode cache code, which drops the RCU
lock on lookup miss, instantiates the new inode (maybe reading it
off disk), then locks the tree and attempts to insert it. Both cases
use "insert if empty, otherwise retry lookup from start" semantics.

cmpxchg is for replacing a known object in a store - it's not really
intended for doing initial inserts after a lookup tells us there is
nothing in the store. The radix tree "insert only if empty" makes
sense here, because it naturally takes care of lookup/insert races
via the -EEXIST mechanism.

I think that providing xa_store_excl() (which would return -EEXIST
if the entry is not empty) would be a better interface here, because
it matches the semantics of lookup cache population used all over
the kernel....

> > Now that I've looked at this, I have to say that having a return
> > value of NULL meaning "success" is quite counter-intuitive. That's
> > going to fire my "that looks so wrong" detector every time I look at
> > the code and notice it's erroring out on a non-null return value
> > that isn't a PTR_ERR case....
>
> It's the same convention as cmpxchg(). I think it's triggering your
> "looks so wrong" detector because it's fundamentally not the natural
> thing to write.

Most definitely the case, and this is why it's a really bad
interface for the semantics we have. This how we end up with code
that makes it easy for programmers to screw up pointer checks in
error handling... :/

> I'm quite happy to have normal API variants that don't save/restore
> interrupts. Just need to come up with good names ... I don't think
> xa_store_noirq() is a good name, but maybe you do?

I'd prefer not to have to deal with such things at all. :P

How many subsystems actually require irq safety in the XA locking
code? Make them use irqsafe versions, not make everyone else use
"noirq" versions, as is the convention for the rest of the kernel
code....

> > > It's the design pattern I've always intended to use. Naturally, the
> > > xfs radix trees weren't my initial target; it was the page cache, and
> > > the page cache does the same thing; uses the tree_lock to protect both
> > > the radix tree and several other fields in that same data structure.
> > >
> > > I'm open to argument on this though ... particularly if you have a better
> > > design pattern in mind!
> >
> > I don't mind structures having internal locking - I have a problem
> > with leaking them into contexts outside the structure they protect.
> > That way lies madness - you can't change the internal locking in
> > future because of external dependencies, and the moment you need
> > something different externally we've got to go back to an external
> > lock anyway.
> >
> > This is demonstrated by the way you converted the XFS dquot tree -
> > you didn't replace the dquot tree lock with the internal xa_lock
> > because it's a mutex and we have to sleep holding it. IOWs, we've
> > added another layer of locking here, not simplified the code.
>
> I agree the dquot code is no simpler than it was, but it's also no more
> complicated from a locking analysis point of view; the xa_lock is just
> not providing you with any useful exclusion.

Sure, that's fine. All I'm doing is pointing out that we can't use
the internal xa_lock to handle everything the indexed objects
require, and so we're going to still need external locks in
many cases.

> At least, not today. One of the future plans is to allow xa_nodes to
> be allocated from ZONE_MOVABLE. In order to do that, we have to be
> able to tell which lock protects any given node. With the XArray,
> we can find that out (xa_node->root->xa_lock); with the radix tree,
> we don't even know what kind of lock protects the tree.

Yup, this is a prime example of why we shouldn't be creating
external dependencies by smearing the locking context outside the XA
structure itself. It's not a stretch to see something like a
ZONE_MOVEABLE dependency because some other object indexed in a XA
is stored in the same page as the xa_node that points to it, and
both require the same xa_lock to move/update...

> There are other costs to not having a lock. The lockdep/RCU
> analysis done on the radix tree code is none. Because we have
> no idea what lock might protect any individual radix tree, we use
> rcu_dereference_raw(), disabling lockdep's ability to protect us.

Unfortunately for you, I don't find arguments along the lines of
"lockdep will save us" at all convincing. lockdep already throws
too many false positives to be useful as a tool that reliably and
accurately points out rare, exciting, complex, intricate locking
problems.

> It's funny that you see the hodgepodge of different locking strategies
> in the XFS code base as being a problem with the XArray. I see it as
> being a consequence of XFS's different needs. No, the XArray can't
> solve all of your problems, but it hasn't made your locking more complex.

I'm not worried about changes in locking complexity here because, as
you point out, there isn't a change. What I'm mostly concerned about
is the removal of abstraction, modularity and isolation between
the XFS code and the library infrastructure it uses.

>
> And I don't agree that the existing radix tree code has clear, consistent
> locking patterns. For example, this use of RCU was unnecessary:
>
> xfs_queue_eofblocks(
> struct xfs_mount *mp)
> {
> - rcu_read_lock();
> - if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_EOFBLOCKS_TAG))
> + if (xa_tagged(&mp->m_perag_xa, XFS_ICI_EOFBLOCKS_TAG))
> queue_delayed_work(mp->m_eofblocks_workqueue,
> &mp->m_eofblocks_work,
> msecs_to_jiffies(xfs_eofb_secs * 1000));
> - rcu_read_unlock();
> }
>
> radix_tree_tagged never required the RCU lock (commit 7cf9c2c76c1a).
> I think you're just used to the radix tree pattern of "we provide no
> locking for you, come up with your own scheme".

No, I'm used to having no-one really understand how "magic lockless
RCU lookups" actually work. When i originally wrote the lockless
lookup code, I couldn't find anyone who both understood RCU and the
XFS inode cache to review the code for correctness. Hence it had to
be dumbed down to the point that it was "stupidly obvious that it's
safe".

That problem has not gone away - very few people who read and have
to maintain this code understandxs all the nasty little intricacies
of RCU lookups. Hiding /more/ of the locking semantics from the
programmers makes it even harder to explain why the algorithm is
safe. If the rules are basic (e.g. all radix tree lookups use RCU
locking) then it's easier for everyone to understand, review and
keep the code working correctly because there's almost no scope for
getting it wrong.

That's one of the advantages of the "we provide no locking for you,
come up with your own scheme" approach - we can dumb it down to the
point of being understandable and maintainable without anyone
needing to hurt their brain on memory-barriers.txt every time
someone changes the code.

Also, it's worth keeping in mind that this dumb code provides the
fastest and most scalable inode cache infrastructure in the kernel.
i.e. it's the structures and algorithms iused that make the code
fast, but it's the simplicity of the code that makes it
understandable and maintainable. The XArray code is a good
algorithm, we've just got to make the API suitable for dumb idiots
like me to be able to write reliable, maintainable code that uses
it.

> What might make more sense for XFS is coming up with something
> intermediate between the full on xa_state-based API and the "we handle
> everything for you" normal API. For example, how would you feel about
> xfs_mru_cache_insert() looking like this:
>
> xa_lock(&mru->store);
> error = PTR_ERR_OR_ZERO(__xa_store(&mru->store, key, elem, GFP_NOFS));
> if (!error)
> _xfs_mru_cache_list_insert(mru, elem);
> xa_unlock(&mru->store);
>
> return error;
>
> xfs_mru_cache_lookup would look like:
>
> xa_lock(&mru->store);
> elem = __xa_load(&mru->store, key);
> ....
> There's no real need for the mru code to be using the full-on xa_state
> API. For something like DAX or the page cache, there's a real advantage,
> but the mru code is, I think, a great example of a user who has somewhat
> more complex locking requirements, but doesn't use the array in a
> complex way.

Yes, that's because the radix tree is not central to it's algorithm
or purpose. The MRU cache (Most Recently Used Cache) is mostly
about the management of the items on lists in the priority
reclaimation array. The radix tree is just there to provide a fast
"is there an item for this key already being aged" lookup so we
don't have to scan lists to do this.

i.e. Right now we could just as easily replace the radix tree with a
rbtree or resizing hash table as an XArray - the radix tree was just
a convenient "already implemented" key-based indexing mechanism that
was in the kernel when the MRU cache was implemented. Put simply:
the radix tree is not a primary structure in the MRU cache - it's
only an implementation detail and that's another reason why I'm not
a fan of smearing the internal locking of the replacement structure
all through the MRU code....

/me shrugs

BTW, something else I just noticed: all the comments in XFS that
talk about the radix trees would need updating.

$ git grep radix fs/xfs
fs/xfs/xfs_dquot.c: /* uninit / unused quota found in radix tree, keep looking */
fs/xfs/xfs_icache.c: /* propagate the reclaim tag up into the perag radix tree */
fs/xfs/xfs_icache.c: /* clear the reclaim tag from the perag radix tree */
fs/xfs/xfs_icache.c: * We set the inode flag atomically with the radix tree tag.
fs/xfs/xfs_icache.c: * Once we get tag lookups on the radix tree, this inode flag
fs/xfs/xfs_icache.c: * radix tree nodes not being updated yet. We monitor for this by
fs/xfs/xfs_icache.c: * Because the inode hasn't been added to the radix-tree yet it can't
fs/xfs/xfs_icache.c: * These values must be set before inserting the inode into the radix
fs/xfs/xfs_icache.c: * radix tree traversal here. It assumes this function
fs/xfs/xfs_icache.c: * radix tree lookups to a minimum. The batch size is a trade off between
fs/xfs/xfs_icache.c: * The radix tree lock here protects a thread in xfs_iget from racing
fs/xfs/xfs_icache.c: * Remove the inode from the per-AG radix tree.
fs/xfs/xfs_icache.c: * with inode cache radix tree lookups. This is because the lookup
fs/xfs/xfs_icache.c: * Don't bother locking the AG and looking up in the radix trees
fs/xfs/xfs_icache.c: /* propagate the eofblocks tag up into the perag radix tree */
fs/xfs/xfs_icache.c: /* clear the eofblocks tag from the perag radix tree */
fs/xfs/xfs_icache.h: * tags for inode radix tree
fs/xfs/xfs_qm.c: * currently is the only interface into the radix tree code that allows
$

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-12-06 14:06:56

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Wed, Dec 06, 2017 at 07:44:04PM +1100, Dave Chinner wrote:
> On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote:
> > That said, using xa_cmpxchg() in the dquot code looked like the right
> > thing to do? Since we'd dropped the qi mutex and the ILOCK, it looks
> > entirely reasonable for another thread to come in and set up the dquot.
> > But I'm obviously quite ignorant of the XFS internals, so maybe there's
> > something else going on that makes this essentially a "can't happen".
>
> It's no different to the inode cache code, which drops the RCU
> lock on lookup miss, instantiates the new inode (maybe reading it
> off disk), then locks the tree and attempts to insert it. Both cases
> use "insert if empty, otherwise retry lookup from start" semantics.

Ah. I had my focus set a little narrow on the inode cache code and didn't
recognise the pattern.

Why do you sleep for one jiffy after encountering a miss, then seeing
someone else insert the inode for you?

> cmpxchg is for replacing a known object in a store - it's not really
> intended for doing initial inserts after a lookup tells us there is
> nothing in the store. The radix tree "insert only if empty" makes
> sense here, because it naturally takes care of lookup/insert races
> via the -EEXIST mechanism.
>
> I think that providing xa_store_excl() (which would return -EEXIST
> if the entry is not empty) would be a better interface here, because
> it matches the semantics of lookup cache population used all over
> the kernel....

I'm not thrilled with xa_store_excl(), but I need to think about that
a bit more.

> > I'm quite happy to have normal API variants that don't save/restore
> > interrupts. Just need to come up with good names ... I don't think
> > xa_store_noirq() is a good name, but maybe you do?
>
> I'd prefer not to have to deal with such things at all. :P
>
> How many subsystems actually require irq safety in the XA locking
> code? Make them use irqsafe versions, not make everyone else use
> "noirq" versions, as is the convention for the rest of the kernel
> code....

Hard to say how many existing radix tree users require the irq safety.
Also hard to say how many potential users (people currently using
linked lists, people using resizable arrays, etc) need irq safety.
My thinking was "make it safe by default and let people who know better
have a way to opt out", but there's definitely something to be said for
"make it fast by default and let people who need the unusual behaviour
type those extra few letters".

So, you're arguing for providing xa_store(), xa_store_irq(), xa_store_bh()
and xa_store_irqsafe()? (at least on demand, as users come to light?)
At least the read side doesn't require any variants; everybody can use
RCU for read side protection.

("safe", not "save" because I wouldn't make the caller provide the
"flags" argument).

> > At least, not today. One of the future plans is to allow xa_nodes to
> > be allocated from ZONE_MOVABLE. In order to do that, we have to be
> > able to tell which lock protects any given node. With the XArray,
> > we can find that out (xa_node->root->xa_lock); with the radix tree,
> > we don't even know what kind of lock protects the tree.
>
> Yup, this is a prime example of why we shouldn't be creating
> external dependencies by smearing the locking context outside the XA
> structure itself. It's not a stretch to see something like a
> ZONE_MOVEABLE dependency because some other object indexed in a XA
> is stored in the same page as the xa_node that points to it, and
> both require the same xa_lock to move/update...

That is a bit of a stretch. Christoph Lameter and I had a discussion about it
here: https://www.spinics.net/lists/linux-mm/msg122902.html

There's no situation where you need to acquire two locks in order to
free an object; you'd create odd locking dependencies between objects
if you did that (eg we already have a locking dependency between pag_ici
and perag from __xfs_inode_set_eofblocks_tag). It'd be a pretty horrible
shrinker design where you had to get all the locks on all the objects,
regardless of what locking order the real code had.

> > There are other costs to not having a lock. The lockdep/RCU
> > analysis done on the radix tree code is none. Because we have
> > no idea what lock might protect any individual radix tree, we use
> > rcu_dereference_raw(), disabling lockdep's ability to protect us.
>
> Unfortunately for you, I don't find arguments along the lines of
> "lockdep will save us" at all convincing. lockdep already throws
> too many false positives to be useful as a tool that reliably and
> accurately points out rare, exciting, complex, intricate locking
> problems.

But it does reliably and accurately point out "dude, you forgot to take
the lock". It's caught a number of real problems in my own testing that
you never got to see.

> That problem has not gone away - very few people who read and have
> to maintain this code understandxs all the nasty little intricacies
> of RCU lookups. Hiding /more/ of the locking semantics from the
> programmers makes it even harder to explain why the algorithm is
> safe. If the rules are basic (e.g. all radix tree lookups use RCU
> locking) then it's easier for everyone to understand, review and
> keep the code working correctly because there's almost no scope for
> getting it wrong.

Couldn't agree more. Using RCU is subtle, and the parts of the kernel
that use calls like radix_tree_lookup_slot() are frequently buggy,
not least because the sparse annotations were missing until I added
them recently. That's why the XArray makes sure it has the RCU lock
for you on anything that needs it.

Not that helps you ... you need to hold the RCU lock yourself because
your data are protected by RCU. I did wonder if you could maybe
improve performance slightly by using something like the page cache's
get_speculative, re-check scheme, but I totally understand your desire
to not make this so hard to understand.

> BTW, something else I just noticed: all the comments in XFS that
> talk about the radix trees would need updating.

I know ... I've been trying to resist the urge to fix comments and spend
more of my time on getting the code working. It's frustrating to see
people use "radix tree" when what they really mean was "page cache".
Our abstractions leak like sieves.

2017-12-06 23:58:33

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH v4 00/73] XArray version 4

On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> From: Matthew Wilcox <[email protected]>
>
> I looked through some notes and decided this was version 4 of the XArray.
> Last posted two weeks ago, this version includes a *lot* of changes.
> I'd like to thank Dave Chinner for his feedback, encouragement and
> distracting ideas for improvement, which I'll get to once this is merged.
>
> Highlights:
> - Over 2000 words of documentation in patch 8! And lots more kernel-doc.
> - The page cache is now fully converted to the XArray.
> - Many more tests in the test-suite.
>
> This patch set is not for applying. 0day is still reporting problems,
> and I'd feel bad for eating someone's data. These patches apply on top
> of a set of prepatory patches which just aren't interesting. If you
> want to see the patches applied to a tree, I suggest pulling my git tree:
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04
> I also left out the idr_preload removals. They're still in the git tree,
> but I'm not looking for feedback on them.

Hey Matthew,

Maybe I missed this from a previous version, but can you explain the
motivation for replacing the radix tree with an xarray? (I think this should
probably still be part of the cover letter?) Do we have a performance problem
we need to solve? A code complexity issue we need to solve? Something else?

- Ross

2017-12-07 00:13:50

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 00/73] XArray version 4

On Wed, Dec 06, 2017 at 04:58:29PM -0700, Ross Zwisler wrote:
> Maybe I missed this from a previous version, but can you explain the
> motivation for replacing the radix tree with an xarray? (I think this should
> probably still be part of the cover letter?) Do we have a performance problem
> we need to solve? A code complexity issue we need to solve? Something else?

Sure! Something else I screwed up in the v4 announcement ... I'll
need it again for v5, so here's a quick update of the v1 announcement's
justification:

I wrote the xarray to replace the radix tree with a better API based
on observing how programmers are currently using the radix tree, and
on how (and why) they aren't. Conceptually, an xarray is an array of
ULONG_MAX pointers which is initially full of NULL pointers.

Improvements the xarray has over the radix tree:

- The radix tree provides operations like other trees do; 'insert' and
'delete'. But what users really want is an automatically resizing
array, and so it makes more sense to give users an API that is like
an array -- 'load' and 'store'.
- Locking is part of the API. This simplifies a lot of users who
formerly had to manage their own locking just for the radix tree.
It also improves code generation as we can now tell RCU that we're
holding a lock and it doesn't need to generate as much fencing code.
The other advantage is that tree nodes can be moved (not yet
implemented).
- GFP flags are now parameters to calls which may need to allocate
memory. The radix tree forced users to decide what the allocation
flags would be at creation time. It's much clearer to specify them
at allocation time. I know the MM people disapprove of the radix
tree using the top bits of the GFP flags for its own purpose, so
they'll like this aspect.
- Memory is not preloaded; we don't tie up dozens of pages on the
off chance that the slab allocator fails. Instead, we drop the lock,
allocate a new node and retry the operation.
- The xarray provides a conditional-replace operation. The radix tree
forces users to roll their own (and at least four have).
- Iterators now take a 'max' parameter. That simplifies many users and
will reduce the amount of iteration done.
- Iteration can proceed backwards. We only have one user for this, but
since it's called as part of the pagefault readahead algorithm, that
seemed worth mentioning.
- RCU-protected pointers are not exposed as part of the API. There are
some fun bugs where the page cache forgets to use rcu_dereference()
in the current codebase.
- Any function which wants it can now call the update_node() callback.
There were a few places missing that I noticed as part of this rewrite.
- Exceptional entries may now be BITS_PER_LONG-1 in size, rather than the
BITS_PER_LONG-2 that they had in the radix tree. That gives us the
extra bit we need to put huge page swap entries in the page cache.

The API comes in two parts, normal and advanced. The normal API takes
care of the locking and memory allocation for you. You can get the
value of a pointer by calling xa_load() and set the value of a pointer by
calling xa_store(). You can conditionally update the value of a pointer
by calling xa_cmpxchg(). Each pointer which isn't NULL can be tagged
with up to 3 bits of extra information, accessed through xa_get_tag(),
xa_set_tag() and xa_clear_tag(). You can copy batches of pointers out
of the array by calling xa_get_entries() or xa_get_tagged(). You can
iterate over pointers in the array by calling xa_find(), xa_find_after()
or xa_for_each().

The advanced API allows users to build their own operations. You have
to take care of your own locking and handle memory allocation failures.
Most of the advanced operations are based around the xa_state which
keeps state between sub-operations. Read the xarray.h header file for
more information on the advanced API, and see the implementation of the
normal API for examples of how to use the advanced API.

Those familiar with the radix tree may notice certain similarities between
the implementation of the xarray and the radix tree. That's entirely
intentional, but the implementation will certainly adapt in the future.
For example, one of the impediments I see to using xarrays instead of
kvmalloced arrays is memory consumption, so I have a couple of ideas to
reduce memory usage for smaller arrays.

I have reimplementated the IDR and the IDA based on the xarray. They are
roughly the same complexity as they were when implemented on top of the
radix tree (although much less intertwined).

When converting code from the radix tree to the xarray, the biggest thing
to bear in mind is that 'store' overwrites anything which happens to be
in the xarray. Just like the assignment operator. The equivalent to
the insert operation is to replace NULL with the new value.

A quick reference guide to help when converting radix tree code.
Functions which start 'xas' are XA_ADVANCED functions.

INIT_RADIX_TREE xa_init
radix_tree_empty xa_empty
__radix_tree_create xas_create
__radix_tree_insert xas_store
radix_tree_insert(x) xa_cmpxchg(NULL, x)
__radix_tree_lookup xas_load
radix_tree_lookup xa_load
radix_tree_lookup_slot xas_load
__radix_tree_replace xas_store
radix_tree_iter_replace xas_store
radix_tree_replace_slot xas_store
__radix_tree_delete_node xas_store
radix_tree_delete_item xa_cmpxhcg
radix_tree_delete xa_erase
radix_tree_clear_tags xas_init_tags
radix_tree_gang_lookup xa_get_entries
radix_tree_gang_lookup_slot xas_find (*1)
radix_tree_preload (*3)
radix_tree_maybe_preload (*3)
radix_tree_tag_set xa_set_tag
radix_tree_tag_clear xa_clear_tag
radix_tree_tag_get xa_get_tag
radix_tree_iter_tag_set xas_set_tag
radix_tree_gang_lookup_tag xa_get_tagged
radix_tree_gang_lookup_tag_slot xas_load (*2)
radix_tree_tagged xa_tagged
radix_tree_preload_end (*3)
radix_tree_split_preload (*3)
radix_tree_split xas_split (*4)
radix_tree_join xas_store

(*1) All three users of radix_tree_gang_lookup_slot() are using it to
ensure that there are no entries in a given range.
(*2) The one radix_tree_gang_lookup_tag_slot user should be using a
radix_tree_iter loop. It can use an xas_for_each() loop, or even an
xa_for_each() loop.
(*3) I don't think we're going to need a preallocation API. If we do
end up needing one, I have a plan that doesn't involve per-cpu
preallocation pools.
(*4) Not yet implemented

2017-12-07 00:42:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 07:44:04PM +1100, Dave Chinner wrote:
> > On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote:
> > > That said, using xa_cmpxchg() in the dquot code looked like the right
> > > thing to do? Since we'd dropped the qi mutex and the ILOCK, it looks
> > > entirely reasonable for another thread to come in and set up the dquot.
> > > But I'm obviously quite ignorant of the XFS internals, so maybe there's
> > > something else going on that makes this essentially a "can't happen".
> >
> > It's no different to the inode cache code, which drops the RCU
> > lock on lookup miss, instantiates the new inode (maybe reading it
> > off disk), then locks the tree and attempts to insert it. Both cases
> > use "insert if empty, otherwise retry lookup from start" semantics.
>
> Ah. I had my focus set a little narrow on the inode cache code and didn't
> recognise the pattern.
>
> Why do you sleep for one jiffy after encountering a miss, then seeing
> someone else insert the inode for you?

The sleep is a backoff that allows whatever we raced with to
complete, be it a hit that raced with an inode being reclaimed and
removed, or a miss that raced with another insert. Ideally we'd
sleep on the XFS_INEW bit, similar to the vfs I_NEW flag, but it's
not quite that simple with the reclaim side of things...

> > cmpxchg is for replacing a known object in a store - it's not really
> > intended for doing initial inserts after a lookup tells us there is
> > nothing in the store. The radix tree "insert only if empty" makes
> > sense here, because it naturally takes care of lookup/insert races
> > via the -EEXIST mechanism.
> >
> > I think that providing xa_store_excl() (which would return -EEXIST
> > if the entry is not empty) would be a better interface here, because
> > it matches the semantics of lookup cache population used all over
> > the kernel....
>
> I'm not thrilled with xa_store_excl(), but I need to think about that
> a bit more.

Not fussed about the name - I just think we need a function that
matches the insert semantics of the code....

> > > I'm quite happy to have normal API variants that don't save/restore
> > > interrupts. Just need to come up with good names ... I don't think
> > > xa_store_noirq() is a good name, but maybe you do?
> >
> > I'd prefer not to have to deal with such things at all. :P
> >
> > How many subsystems actually require irq safety in the XA locking
> > code? Make them use irqsafe versions, not make everyone else use
> > "noirq" versions, as is the convention for the rest of the kernel
> > code....
>
> Hard to say how many existing radix tree users require the irq safety.

The mapping tree requires it because it gets called from IO
completion contexts to clear page writeback state, but I don't know
about any of the others.

> Also hard to say how many potential users (people currently using
> linked lists, people using resizable arrays, etc) need irq safety.
> My thinking was "make it safe by default and let people who know better
> have a way to opt out", but there's definitely something to be said for
> "make it fast by default and let people who need the unusual behaviour
> type those extra few letters".
>
> So, you're arguing for providing xa_store(), xa_store_irq(), xa_store_bh()
> and xa_store_irqsafe()? (at least on demand, as users come to light?)
> At least the read side doesn't require any variants; everybody can use
> RCU for read side protection.

That would follow the pattern of the rest of the kernel APIs, though
I think it might be cleaner to simply state the locking requirement
to xa_init() and keep all those details completely internal rather
than encoding them into API calls. After all, the "irqsafe-ness" of
the locking needs to be consistent across the entire XA instance....

> ("safe", not "save" because I wouldn't make the caller provide the
> "flags" argument).
>
> > > At least, not today. One of the future plans is to allow xa_nodes to
> > > be allocated from ZONE_MOVABLE. In order to do that, we have to be
> > > able to tell which lock protects any given node. With the XArray,
> > > we can find that out (xa_node->root->xa_lock); with the radix tree,
> > > we don't even know what kind of lock protects the tree.
> >
> > Yup, this is a prime example of why we shouldn't be creating
> > external dependencies by smearing the locking context outside the XA
> > structure itself. It's not a stretch to see something like a
> > ZONE_MOVEABLE dependency because some other object indexed in a XA
> > is stored in the same page as the xa_node that points to it, and
> > both require the same xa_lock to move/update...
>
> That is a bit of a stretch. Christoph Lameter and I had a discussion about it
> here: https://www.spinics.net/lists/linux-mm/msg122902.html
>
> There's no situation where you need to acquire two locks in order to
> free an object;

ZONE_MOVEABLE is for moving migratable objects, not freeing
unreferenced objects. i.e. it's used to indicate the active objects
can be moved to a different location whilst it has other objects
pointing to it. This requires atomically swapping all the external
pointer references to the object so everything sees either the old
object before the move or the new object after the move. While the
move is in progress, we have to stall anything that could possibly
reference the object and in general that means we have lock up all
the objects that point to the object being moved.

For things like inodes, we have *lots* of external references to
them, and so we'd have to stall all of those external references
to update them once movement is complete. Lots of locks to hold
there, potentially including the xa_lock for the trees that index
the inode.

Hence if we are trying to migrate multiple objects at a time (i.e.
the bulk slab page clearing case) then we've got to lock up multiple
refrenceing objects and structure that may have overlapping
dependencies and so could end up trying to get the same locks that
other objects in the page already hold.

It's an utter mess - xa_node might be simple, but the general case
for slab objects in ZONE_MOVEABLE is anything but simple. That's
the reason we've never made any progress on generic slab
defragmentation in the past 12-13 years - we haven't worked out how
to solve this fundamental "atomically update all external references
to the object being moved" problem.

> you'd create odd locking dependencies between objects
> if you did that (eg we already have a locking dependency between pag_ici
> and perag from __xfs_inode_set_eofblocks_tag)

You missed this one: xfs_inode_set_reclaim_tag()

It nests pag->pag_ici_lock - ip->i_flags_lock - mp->m_perag_lock
in one pass because we've got an inode flag and tags in two separate
radix trees we need to update atomically....

Also, that's called in the evict() path so, yeah, we're actually
nesting multiple locks to get the inode into a state where we can
reclaim it...

> It'd be a pretty horrible
> shrinker design where you had to get all the locks on all the objects,
> regardless of what locking order the real code had.

The shrinker (i.e. memory reclaim) doesn't need to do that - only
object migration does. They operate on vastly different object
contexts and should not be conflated.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-12-07 16:06:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> > Unfortunately for you, I don't find arguments along the lines of
> > "lockdep will save us" at all convincing. lockdep already throws
> > too many false positives to be useful as a tool that reliably and
> > accurately points out rare, exciting, complex, intricate locking
> > problems.
>
> But it does reliably and accurately point out "dude, you forgot to take
> the lock". It's caught a number of real problems in my own testing that
> you never got to see.

The problem is that if it has too many false positives --- and it's
gotten *way* worse with the completion callback "feature", people will
just stop using Lockdep as being too annyoing and a waste of developer
time when trying to figure what is a legitimate locking bug versus
lockdep getting confused.

<Rant>I can't even disable the new Lockdep feature which is throwing
lots of new false positives --- it's just all or nothing.</Rant>

Dave has just said he's already stopped using Lockdep, as a result.

- Ted

2017-12-07 22:22:23

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote:
> On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> > > Unfortunately for you, I don't find arguments along the lines of
> > > "lockdep will save us" at all convincing. lockdep already throws
> > > too many false positives to be useful as a tool that reliably and
> > > accurately points out rare, exciting, complex, intricate locking
> > > problems.
> >
> > But it does reliably and accurately point out "dude, you forgot to take
> > the lock". It's caught a number of real problems in my own testing that
> > you never got to see.
>
> The problem is that if it has too many false positives --- and it's
> gotten *way* worse with the completion callback "feature", people will
> just stop using Lockdep as being too annyoing and a waste of developer
> time when trying to figure what is a legitimate locking bug versus
> lockdep getting confused.
>
> <Rant>I can't even disable the new Lockdep feature which is throwing
> lots of new false positives --- it's just all or nothing.</Rant>
>
> Dave has just said he's already stopped using Lockdep, as a result.

This is compeltely OT, but FYI I stopped using lockdep a long time
ago. We've spend orders of magnitude more time and effort to shut
up lockdep false positives in the XFS code than we ever have on
locking problems that lockdep has uncovered. And still lockdep
throws too many false positives on XFS workloads to be useful to me.

But it's more than that: I understand just how much lockdep *doesn't
check* and that means *I know I can't rely on lockdep* for potential
deadlock detection. e.g. it doesn't cover semaphores, which means
it has zero coverage of the entire XFS metadata buffer subsystem and
the complex locking orders we have for metadata updates.

Put simply: lockdep doesn't provide me with any benefit, so I don't
use it...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-12-07 22:38:14

by Matthew Wilcox

[permalink] [raw]
Subject: Lockdep is less useful than it was

On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote:
> The problem is that if it has too many false positives --- and it's
> gotten *way* worse with the completion callback "feature", people will
> just stop using Lockdep as being too annyoing and a waste of developer
> time when trying to figure what is a legitimate locking bug versus
> lockdep getting confused.
>
> <Rant>I can't even disable the new Lockdep feature which is throwing
> lots of new false positives --- it's just all or nothing.</Rant>

You *can* ... but it's way more hacking Kconfig than you ought to have
to do (which is a separate rant ...)

You need to get LOCKDEP_CROSSRELEASE off. I'd revert patches
e26f34a407aec9c65bce2bc0c838fabe4f051fc6 and
b483cf3bc249d7af706390efa63d6671e80d1c09

I think it was a mistake to force these on for everybody; they have a
much higher false-positive rate than the rest of lockdep, so as you say
forcing them on leads to fewer people using *any* of lockdep.

The bug you're hitting isn't Byungchul's fault; it's an annotation
problem. The same kind of annotation problem that we used to have with
dozens of other places in the kernel which are now fixed. If you didn't
have to hack Kconfig to get rid of this problem, you'd be happier, right?

2017-12-07 22:40:03

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Lockdep is less useful than it was

On Thu, Dec 07, 2017 at 02:38:03PM -0800, Matthew Wilcox wrote:
> You need to get LOCKDEP_CROSSRELEASE off. I'd revert patches
> e26f34a407aec9c65bce2bc0c838fabe4f051fc6 and
> b483cf3bc249d7af706390efa63d6671e80d1c09

Oops. I meant to revert 2dcd5adfb7401b762ddbe4b86dcacc2f3de6b97b.
Or you could cherry-pick b483cf3bc249d7af706390efa63d6671e80d1c09.

2017-12-08 00:16:40

by Dave Chinner

[permalink] [raw]
Subject: Re: Lockdep is less useful than it was

On Thu, Dec 07, 2017 at 02:38:03PM -0800, Matthew Wilcox wrote:
> On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote:
> > The problem is that if it has too many false positives --- and it's
> > gotten *way* worse with the completion callback "feature", people will
> > just stop using Lockdep as being too annyoing and a waste of developer
> > time when trying to figure what is a legitimate locking bug versus
> > lockdep getting confused.
> >
> > <Rant>I can't even disable the new Lockdep feature which is throwing
> > lots of new false positives --- it's just all or nothing.</Rant>
>
> You *can* ... but it's way more hacking Kconfig than you ought to have
> to do (which is a separate rant ...)
>
> You need to get LOCKDEP_CROSSRELEASE off. I'd revert patches
> e26f34a407aec9c65bce2bc0c838fabe4f051fc6 and
> b483cf3bc249d7af706390efa63d6671e80d1c09
>
> I think it was a mistake to force these on for everybody; they have a
> much higher false-positive rate than the rest of lockdep, so as you say
> forcing them on leads to fewer people using *any* of lockdep.
>
> The bug you're hitting isn't Byungchul's fault; it's an annotation
> problem. The same kind of annotation problem that we used to have with
> dozens of other places in the kernel which are now fixed.

That's one of the fundamental problem with lockdep - it throws the
difficulty of solving all these new false positives onto the
developers who know nothing about lockdep and don't follow it's
development. And until they do solve them - especially in critical
subsystems that everyone uses like the storage stack - lockdep is
essentially worthless.

> If you didn't
> have to hack Kconfig to get rid of this problem, you'd be happier, right?

I'd be much happier if it wasn't turned on by default in the first
place. We gave plenty of warnings that there were still unsolved
false positive problems with the new checks in the storage stack.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-12-08 04:46:20

by Byungchul Park

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Fri, Dec 08, 2017 at 09:22:16AM +1100, Dave Chinner wrote:
> On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote:
> > On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> > > > Unfortunately for you, I don't find arguments along the lines of
> > > > "lockdep will save us" at all convincing. lockdep already throws
> > > > too many false positives to be useful as a tool that reliably and
> > > > accurately points out rare, exciting, complex, intricate locking
> > > > problems.
> > >
> > > But it does reliably and accurately point out "dude, you forgot to take
> > > the lock". It's caught a number of real problems in my own testing that
> > > you never got to see.
> >
> > The problem is that if it has too many false positives --- and it's
> > gotten *way* worse with the completion callback "feature", people will
> > just stop using Lockdep as being too annyoing and a waste of developer
> > time when trying to figure what is a legitimate locking bug versus
> > lockdep getting confused.
> >
> > <Rant>I can't even disable the new Lockdep feature which is throwing
> > lots of new false positives --- it's just all or nothing.</Rant>
> >
> > Dave has just said he's already stopped using Lockdep, as a result.
>
> This is compeltely OT, but FYI I stopped using lockdep a long time
> ago. We've spend orders of magnitude more time and effort to shut
> up lockdep false positives in the XFS code than we ever have on
> locking problems that lockdep has uncovered. And still lockdep
> throws too many false positives on XFS workloads to be useful to me.
>
> But it's more than that: I understand just how much lockdep *doesn't
> check* and that means *I know I can't rely on lockdep* for potential
> deadlock detection. e.g. it doesn't cover semaphores, which means

Hello,

I'm careful in saying the following since you seem to feel not good at
crossrelease and even lockdep. Now that cross-release has been
introduced, semaphores can be covered as you might know. Actually, all
general waiters can.

> it has zero coverage of the entire XFS metadata buffer subsystem and
> the complex locking orders we have for metadata updates.
>
> Put simply: lockdep doesn't provide me with any benefit, so I don't
> use it...

Sad..

> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2017-12-08 07:25:07

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Fri, Dec 08, 2017 at 01:45:52PM +0900, Byungchul Park wrote:
> On Fri, Dec 08, 2017 at 09:22:16AM +1100, Dave Chinner wrote:
> > On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote:
> > > On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> > > > > Unfortunately for you, I don't find arguments along the lines of
> > > > > "lockdep will save us" at all convincing. lockdep already throws
> > > > > too many false positives to be useful as a tool that reliably and
> > > > > accurately points out rare, exciting, complex, intricate locking
> > > > > problems.
> > > >
> > > > But it does reliably and accurately point out "dude, you forgot to take
> > > > the lock". It's caught a number of real problems in my own testing that
> > > > you never got to see.
> > >
> > > The problem is that if it has too many false positives --- and it's
> > > gotten *way* worse with the completion callback "feature", people will
> > > just stop using Lockdep as being too annyoing and a waste of developer
> > > time when trying to figure what is a legitimate locking bug versus
> > > lockdep getting confused.
> > >
> > > <Rant>I can't even disable the new Lockdep feature which is throwing
> > > lots of new false positives --- it's just all or nothing.</Rant>
> > >
> > > Dave has just said he's already stopped using Lockdep, as a result.
> >
> > This is compeltely OT, but FYI I stopped using lockdep a long time
> > ago. We've spend orders of magnitude more time and effort to shut
> > up lockdep false positives in the XFS code than we ever have on
> > locking problems that lockdep has uncovered. And still lockdep
> > throws too many false positives on XFS workloads to be useful to me.
> >
> > But it's more than that: I understand just how much lockdep *doesn't
> > check* and that means *I know I can't rely on lockdep* for potential
> > deadlock detection. e.g. it doesn't cover semaphores, which means
>
> Hello,
>
> I'm careful in saying the following since you seem to feel not good at
> crossrelease and even lockdep. Now that cross-release has been
> introduced, semaphores can be covered as you might know. Actually, all
> general waiters can.

And all it will do is create a whole bunch more work for us XFS guys
to shut up all the the false positive crap that falls out from it
because the locking model we have is far more complex than any of
the lockdep developers thought was necessary to support, just like
happened with the XFS inode annotations all those years ago.

e.g. nobody has ever bothered to ask us what is needed to describe
XFS's semaphore locking model. If you did that, you'd know that we
nest *thousands* of locked semaphores in compeltely random lock
order during metadata buffer writeback. And that this lock order
does not reflect the actual locking order rules we have for locking
buffers during transactions.

Oh, and you'd also know that a semaphore's lock order and context
can change multiple times during the life time of the buffer. Say
we free a block and the reallocate it as something else before it is
reclaimed - that buffer now might have a different lock order. Or
maybe we promote a buffer to be a root btree block as a result of a
join - it's now the first buffer in a lock run, rather than a child.
Or we split a tree, and the root is now a node and so no longer is
the first buffer in a lock run. Or that we walk sideways along the
leaf nodes siblings during searches. IOWs, there is no well defined
static lock ordering at all for buffers - and therefore semaphores -
in XFS at all.

And knowing that, you wouldn't simply mention that lockdep can
support semaphores now as though that is necessary to "make it work"
for XFS. It's going to be much simpler for us to just turn off
lockdep and ignore whatever crap it sends our way than it is to
spend unplanned weeks of our time to try to make lockdep sorta work
again. Sure, we might get there in the end, but it's likely to take
months, if not years like it did with the XFS inode annotations.....

> > it has zero coverage of the entire XFS metadata buffer subsystem and
> > the complex locking orders we have for metadata updates.
> >
> > Put simply: lockdep doesn't provide me with any benefit, so I don't
> > use it...
>
> Sad..

I don't think you understand. I'll try to explain.

The lockdep infrastructure by itself doesn't make lockdep a useful
tool - it mostly generates false positives because it has no
concept of locking models that don't match it's internal tracking
assumptions and/or limitations.

That means if we can't suppress the false positives, then lockdep is
going to be too noisy to find real problems. It's taken the XFS
developers months of work over the past 7-8 years to suppress all
the *common* false positives that lockdep throws on XFS. And despite
all that work, there's still too many false positives occuring
because we can't easily suppress them with annotations. IOWs, the
signal to noise ratio is still too low for lockdep to find real
problems.

That's why lockdep isn't useful to me - the noise floor is too high,
and the effort to lower the noise floor further is too great.

This is important, because cross-release just raised the noise floor
by a large margin and so now we have to spend the time to reduce it
again back to where it was before cross-release was added. IOWs,
adding new detection features to lockdep actually makes lockdep less
useful for a significant period of time. That length of time is
dependent on the rate at which subsystem developers can suppress the
false positives and lower the noise floor back down to an acceptible
level. And there is always the possibility that we can't get the
noise floor low enough for lockdep to be a reliable, useful tool for
some subsystems....

That's what I don't think you understand - that the most important
part of lockdep is /not the core infrastructure/ you work on. The
most important part of lockdep is the annotations that suppress the
noise floor and allow the real problems to stand out.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-12-08 09:27:57

by Byungchul Park

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On 12/8/2017 4:25 PM, Dave Chinner wrote:
> On Fri, Dec 08, 2017 at 01:45:52PM +0900, Byungchul Park wrote:
>> On Fri, Dec 08, 2017 at 09:22:16AM +1100, Dave Chinner wrote:
>>> On Thu, Dec 07, 2017 at 11:06:34AM -0500, Theodore Ts'o wrote:
>>>> On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
>>>>>> Unfortunately for you, I don't find arguments along the lines of
>>>>>> "lockdep will save us" at all convincing. lockdep already throws
>>>>>> too many false positives to be useful as a tool that reliably and
>>>>>> accurately points out rare, exciting, complex, intricate locking
>>>>>> problems.
>>>>>
>>>>> But it does reliably and accurately point out "dude, you forgot to take
>>>>> the lock". It's caught a number of real problems in my own testing that
>>>>> you never got to see.
>>>>
>>>> The problem is that if it has too many false positives --- and it's
>>>> gotten *way* worse with the completion callback "feature", people will
>>>> just stop using Lockdep as being too annyoing and a waste of developer
>>>> time when trying to figure what is a legitimate locking bug versus
>>>> lockdep getting confused.
>>>>
>>>> <Rant>I can't even disable the new Lockdep feature which is throwing
>>>> lots of new false positives --- it's just all or nothing.</Rant>
>>>>
>>>> Dave has just said he's already stopped using Lockdep, as a result.
>>>
>>> This is compeltely OT, but FYI I stopped using lockdep a long time
>>> ago. We've spend orders of magnitude more time and effort to shut
>>> up lockdep false positives in the XFS code than we ever have on
>>> locking problems that lockdep has uncovered. And still lockdep
>>> throws too many false positives on XFS workloads to be useful to me.
>>>
>>> But it's more than that: I understand just how much lockdep *doesn't
>>> check* and that means *I know I can't rely on lockdep* for potential
>>> deadlock detection. e.g. it doesn't cover semaphores, which means
>>
>> Hello,
>>
>> I'm careful in saying the following since you seem to feel not good at
>> crossrelease and even lockdep. Now that cross-release has been
>> introduced, semaphores can be covered as you might know. Actually, all
>> general waiters can.
>
> And all it will do is create a whole bunch more work for us XFS guys
> to shut up all the the false positive crap that falls out from it
> because the locking model we have is far more complex than any of
> the lockdep developers thought was necessary to support, just like
> happened with the XFS inode annotations all those years ago.
>
> e.g. nobody has ever bothered to ask us what is needed to describe
> XFS's semaphore locking model. If you did that, you'd know that we
> nest *thousands* of locked semaphores in compeltely random lock
> order during metadata buffer writeback. And that this lock order
> does not reflect the actual locking order rules we have for locking
> buffers during transactions.
>
> Oh, and you'd also know that a semaphore's lock order and context
> can change multiple times during the life time of the buffer. Say
> we free a block and the reallocate it as something else before it is
> reclaimed - that buffer now might have a different lock order. Or
> maybe we promote a buffer to be a root btree block as a result of a
> join - it's now the first buffer in a lock run, rather than a child.
> Or we split a tree, and the root is now a node and so no longer is
> the first buffer in a lock run. Or that we walk sideways along the
> leaf nodes siblings during searches. IOWs, there is no well defined
> static lock ordering at all for buffers - and therefore semaphores -
> in XFS at all.
>
> And knowing that, you wouldn't simply mention that lockdep can
> support semaphores now as though that is necessary to "make it work"
> for XFS. It's going to be much simpler for us to just turn off
> lockdep and ignore whatever crap it sends our way than it is to
> spend unplanned weeks of our time to try to make lockdep sorta work
> again. Sure, we might get there in the end, but it's likely to take
> months, if not years like it did with the XFS inode annotations.....
>
>>> it has zero coverage of the entire XFS metadata buffer subsystem and
>>> the complex locking orders we have for metadata updates.
>>>
>>> Put simply: lockdep doesn't provide me with any benefit, so I don't
>>> use it...
>>
>> Sad..
>
> I don't think you understand. I'll try to explain.
>
> The lockdep infrastructure by itself doesn't make lockdep a useful
> tool - it mostly generates false positives because it has no
> concept of locking models that don't match it's internal tracking
> assumptions and/or limitations.
>
> That means if we can't suppress the false positives, then lockdep is
> going to be too noisy to find real problems. It's taken the XFS
> developers months of work over the past 7-8 years to suppress all
> the *common* false positives that lockdep throws on XFS. And despite
> all that work, there's still too many false positives occuring
> because we can't easily suppress them with annotations. IOWs, the
> signal to noise ratio is still too low for lockdep to find real
> problems.
>
> That's why lockdep isn't useful to me - the noise floor is too high,
> and the effort to lower the noise floor further is too great.
>
> This is important, because cross-release just raised the noise floor
> by a large margin and so now we have to spend the time to reduce it
> again back to where it was before cross-release was added. IOWs,
> adding new detection features to lockdep actually makes lockdep less
> useful for a significant period of time. That length of time is
> dependent on the rate at which subsystem developers can suppress the
> false positives and lower the noise floor back down to an acceptible
> level. And there is always the possibility that we can't get the
> noise floor low enough for lockdep to be a reliable, useful tool for
> some subsystems....
>
> That's what I don't think you understand - that the most important
> part of lockdep is /not the core infrastructure/ you work on. The
> most important part of lockdep is the annotations that suppress the
> noise floor and allow the real problems to stand out.

I'm sorry to hear that.. If I were you, I would also get
annoyed. And.. thanks for explanation.

But, I think assigning lock classes properly and checking
relationship of the classes to detect deadlocks is reasonable.

In my opinion about the common lockdep stuff, there are 2
problems on it.

1) Firstly, it's hard to assign lock classes *properly*. By
default, it relies on the caller site of lockdep_init_map(),
but we need to assign another class manually, where ordering
rules are complicated so cannot rely on the caller site. That
*only* can be done by experts of the subsystem.

I think if they want to get benifit from lockdep, they have no
choice but to assign classes manually with the domain knowledge,
or use *lockdep_set_novalidate_class()* to invalidate locks
making the developers annoyed and not want to use the checking
for them.

It's a problem of choice between (1) getting benifit from
lockdep by doing something with the domain knowledge, and (2)
giving up the benifit by invalidating locks making them panic.

2) Secondly, I've seen several places where lock_acquire()s
are a little bit wrongly used more than we need. That would add
additional detection capability and make lockdep strong but
increase the possibility to give us more false positives.

If you don't want to work on the additional annotations at the
moment, then I think you can choose an option whatever you
want, and consider locks again you've invalidated, when it
becomes necessary to detect deadlocks involving those locks by
validating those locks back and adding necessary annotations.

Am I missing something?

--
Thanks,
Byungchul

2017-12-08 15:27:48

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Lockdep is less useful than it was

On Thu, Dec 07, 2017 at 02:38:03PM -0800, Matthew Wilcox wrote:
> I think it was a mistake to force these on for everybody; they have a
> much higher false-positive rate than the rest of lockdep, so as you say
> forcing them on leads to fewer people using *any* of lockdep.
>
> The bug you're hitting isn't Byungchul's fault; it's an annotation
> problem. The same kind of annotation problem that we used to have with
> dozens of other places in the kernel which are now fixed.

The question is whose responsibility is it to annotate the code? On
another thread it was suggested it was ext4's responsibility to add
annotations to avoid the false positives --- never the mind the fact
that every single file system is going to have add annotations.

Also note that the documentation for how to add annotations is
***horrible***. It's mostly, "try to figure out how other people
added magic cargo cult code which is not well defined (look at the
definitions of lockdep_set_class, lockdep_set_class_and_name,
lockdep_set_class_and_subclass, and lockdep_set_subclass, and weep) in
other subsystems and hope and pray it works for you."

And the explanation when there are failures, either false positives,
or not, are completely opaque. For example:

[ 16.190198] ext4lazyinit/648 is trying to acquire lock:
[ 16.191201] ((gendisk_completion)1 << part_shift(NUMA_NO_NODE)){+.+.}, at: [<8a1ebe9d>] wait_for_completion_io+0x12/0x20

Just try to tell me that:

((gendisk_completion)1 << part_shift(NUMA_NO_NODE)){+.+.}

is human comprehensible with a straight face. And since the messages
don't even include the subclass/class/name key annotations, as lockdep
tries to handle things that are more and more complex, I'd argue it
has already crossed the boundary where unless you are a lockdep
developer, good luck trying to understand what is going on or how to
add annotations.

So if you are adding complexity to the kernel with the argument,
"lockdep will save us", I'm with Dave --- it's just not a believable
argument.

Cheers,

- Ted

2017-12-08 17:35:10

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Fri, 8 Dec 2017, Byungchul Park wrote:

> I'm sorry to hear that.. If I were you, I would also get
> annoyed. And.. thanks for explanation.
>
> But, I think assigning lock classes properly and checking
> relationship of the classes to detect deadlocks is reasonable.
>
> In my opinion about the common lockdep stuff, there are 2
> problems on it.
>
> 1) Firstly, it's hard to assign lock classes *properly*. By
> default, it relies on the caller site of lockdep_init_map(),
> but we need to assign another class manually, where ordering
> rules are complicated so cannot rely on the caller site. That
> *only* can be done by experts of the subsystem.
>
> I think if they want to get benifit from lockdep, they have no
> choice but to assign classes manually with the domain knowledge,
> or use *lockdep_set_novalidate_class()* to invalidate locks
> making the developers annoyed and not want to use the checking
> for them.

Lockdep's no_validate class is used when the locking patterns are too
complicated for lockdep to understand. Basically, it tells lockdep to
ignore those locks.

The device core uses that class. The tree of struct devices, each with
its own lock, gets used in many different and complicated ways.
Lockdep can't understand this -- it doesn't have the ability to
represent an arbitrarily deep hierarchical tree of locks -- so we tell
it to ignore the device locks.

It sounds like XFS may need to do the same thing with its semaphores.

Alan Stern

2017-12-08 18:14:51

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Lockdep is less useful than it was

On Fri, Dec 08, 2017 at 10:27:17AM -0500, Theodore Ts'o wrote:
> So if you are adding complexity to the kernel with the argument,
> "lockdep will save us", I'm with Dave --- it's just not a believable
> argument.

I think that's a gross misrepresentation of what I'm doing.

At the moment, the radix tree actively disables the RCU checking that
enabling lockdep would give us. It has to, because it has no idea what
lock protects any individual access to the radix tree. The XArray can
use the RCU checking because it knows that every reference is protected
by either the spinlock or the RCU lock.

Dave was saying that he has a tree which has to be protected by a mutex
because of where it is in the locking hierarchy, and I was vigorously
declining his proposal of allowing him to skip taking the spinlock.

And yes, we have bugs today that I assume we only stumble across every
few billion years (or only on alpha, or only if our compiler gets more
aggressive) because we have missing rcu_dereference annotations.

2017-12-08 22:37:04

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Fri, Dec 08, 2017 at 12:35:07PM -0500, Alan Stern wrote:
> On Fri, 8 Dec 2017, Byungchul Park wrote:
>
> > I'm sorry to hear that.. If I were you, I would also get
> > annoyed. And.. thanks for explanation.
> >
> > But, I think assigning lock classes properly and checking
> > relationship of the classes to detect deadlocks is reasonable.
> >
> > In my opinion about the common lockdep stuff, there are 2
> > problems on it.
> >
> > 1) Firstly, it's hard to assign lock classes *properly*. By
> > default, it relies on the caller site of lockdep_init_map(),
> > but we need to assign another class manually, where ordering
> > rules are complicated so cannot rely on the caller site. That
> > *only* can be done by experts of the subsystem.

Sure, but that's not the issue here. The issue here is the lack of
communication with subsystem experts and that the annotation
complexity warnings given immediately by the subsystem experts were
completely ignored...

> > I think if they want to get benifit from lockdep, they have no
> > choice but to assign classes manually with the domain knowledge,
> > or use *lockdep_set_novalidate_class()* to invalidate locks
> > making the developers annoyed and not want to use the checking
> > for them.
>
> Lockdep's no_validate class is used when the locking patterns are too
> complicated for lockdep to understand. Basically, it tells lockdep to
> ignore those locks.

Let me just point out two things here:

1. Using lockdep_set_novalidate_class() for anything other
than device->mutex will throw checkpatch warnings. Nice. (*)

2. lockdep_set_novalidate_class() is completely undocumented
- it's the first I've ever heard of this functionality. i.e.
nobody has ever told us there is a mechanism to turn off
validation of an object; we've *always* been told to "change
your code and/or fix your annotations" when discussing
lockdep deficiencies. (**)

> The device core uses that class. The tree of struct devices, each with
> its own lock, gets used in many different and complicated ways.
> Lockdep can't understand this -- it doesn't have the ability to
> represent an arbitrarily deep hierarchical tree of locks -- so we tell
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That largely describes the in-memory structure of XFS, except we
have a forest of lock trees, not just one....

> it to ignore the device locks.
>
> It sounds like XFS may need to do the same thing with its semaphores.

Who-ever adds semaphore checking to lockdep can add those
annotations. The externalisation of the development cost of new
lockdep functionality is one of the problems here.

-Dave.

(*) checkpatch.pl is considered mostly harmful round here, too,
but that's another rant....

(**) the frequent occurrence of "core code/devs aren't held to the
same rules/standard as everyone else" is another rant I have stored
up for a rainy day.

--
Dave Chinner
[email protected]

2017-12-08 22:49:06

by Dave Chinner

[permalink] [raw]
Subject: Re: Lockdep is less useful than it was

On Fri, Dec 08, 2017 at 10:14:38AM -0800, Matthew Wilcox wrote:
> At the moment, the radix tree actively disables the RCU checking that
> enabling lockdep would give us. It has to, because it has no idea what
> lock protects any individual access to the radix tree. The XArray can
> use the RCU checking because it knows that every reference is protected
> by either the spinlock or the RCU lock.
>
> Dave was saying that he has a tree which has to be protected by a mutex
> because of where it is in the locking hierarchy, and I was vigorously
> declining his proposal of allowing him to skip taking the spinlock.

Oh, I wasn't suggesting that you remove the internal tree locking
because we need external locking.

I was trying to point out that the internal locking doesn't remove
the need for external locking, and that there are cases where
smearing the internal lock outside the XA tree doesn't work, either.
i.e. internal locking doesn't replace all the cases where external
locking is required, and so it's less efficient than the existing
radix tree code.

What I was questioning was the value of replacing the radix tree
code with a less efficient structure just to add lockdep validation
to a tree that doesn't actually need any extra locking validation...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-12-08 23:01:37

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Thu, Dec 07, 2017 at 11:38:43AM +1100, Dave Chinner wrote:
> > > cmpxchg is for replacing a known object in a store - it's not really
> > > intended for doing initial inserts after a lookup tells us there is
> > > nothing in the store. The radix tree "insert only if empty" makes
> > > sense here, because it naturally takes care of lookup/insert races
> > > via the -EEXIST mechanism.
> > >
> > > I think that providing xa_store_excl() (which would return -EEXIST
> > > if the entry is not empty) would be a better interface here, because
> > > it matches the semantics of lookup cache population used all over
> > > the kernel....
> >
> > I'm not thrilled with xa_store_excl(), but I need to think about that
> > a bit more.
>
> Not fussed about the name - I just think we need a function that
> matches the insert semantics of the code....

I think I have something that works better for you than returning -EEXIST
(because you don't actually want -EEXIST, you want -EAGAIN):

/* insert the new inode */
- spin_lock(&pag->pag_ici_lock);
- error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
- if (unlikely(error)) {
- WARN_ON(error != -EEXIST);
- XFS_STATS_INC(mp, xs_ig_dup);
- error = -EAGAIN;
- goto out_preload_end;
- }
- spin_unlock(&pag->pag_ici_lock);
- radix_tree_preload_end();
+ curr = xa_cmpxchg(&pag->pag_ici_xa, agino, NULL, ip, GFP_NOFS);
+ error = __xa_race(curr, -EAGAIN);
+ if (error)
+ goto out_unlock;

...

-out_preload_end:
- spin_unlock(&pag->pag_ici_lock);
- radix_tree_preload_end();
+out_unlock:
+ if (error == -EAGAIN)
+ XFS_STATS_INC(mp, xs_ig_dup);

I've changed the behaviour slightly in that returning an -ENOMEM used to
hit a WARN_ON, and I don't think that's the right response -- GFP_NOFS
returning -ENOMEM probably gets you a nice warning already from the
mm code.

2017-12-09 17:00:27

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote:
> 1. Using lockdep_set_novalidate_class() for anything other
> than device->mutex will throw checkpatch warnings. Nice. (*)
[]
> (*) checkpatch.pl is considered mostly harmful round here, too,
> but that's another rant....

How so?

> (**) the frequent occurrence of "core code/devs aren't held to the
> same rules/standard as everyone else" is another rant I have stored
> up for a rainy day.

Yeah. I wouldn't mind reading that one...

Rainy season is starting right about now here too.

2017-12-10 23:58:30

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Fri, Dec 08, 2017 at 03:01:31PM -0800, Matthew Wilcox wrote:
> On Thu, Dec 07, 2017 at 11:38:43AM +1100, Dave Chinner wrote:
> > > > cmpxchg is for replacing a known object in a store - it's not really
> > > > intended for doing initial inserts after a lookup tells us there is
> > > > nothing in the store. The radix tree "insert only if empty" makes
> > > > sense here, because it naturally takes care of lookup/insert races
> > > > via the -EEXIST mechanism.
> > > >
> > > > I think that providing xa_store_excl() (which would return -EEXIST
> > > > if the entry is not empty) would be a better interface here, because
> > > > it matches the semantics of lookup cache population used all over
> > > > the kernel....
> > >
> > > I'm not thrilled with xa_store_excl(), but I need to think about that
> > > a bit more.
> >
> > Not fussed about the name - I just think we need a function that
> > matches the insert semantics of the code....
>
> I think I have something that works better for you than returning -EEXIST
> (because you don't actually want -EEXIST, you want -EAGAIN):
>
> /* insert the new inode */
> - spin_lock(&pag->pag_ici_lock);
> - error = radix_tree_insert(&pag->pag_ici_root, agino, ip);
> - if (unlikely(error)) {
> - WARN_ON(error != -EEXIST);
> - XFS_STATS_INC(mp, xs_ig_dup);
> - error = -EAGAIN;
> - goto out_preload_end;
> - }
> - spin_unlock(&pag->pag_ici_lock);
> - radix_tree_preload_end();
> + curr = xa_cmpxchg(&pag->pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> + error = __xa_race(curr, -EAGAIN);
> + if (error)
> + goto out_unlock;
>
> ...
>
> -out_preload_end:
> - spin_unlock(&pag->pag_ici_lock);
> - radix_tree_preload_end();
> +out_unlock:
> + if (error == -EAGAIN)
> + XFS_STATS_INC(mp, xs_ig_dup);
>
> I've changed the behaviour slightly in that returning an -ENOMEM used to
> hit a WARN_ON, and I don't think that's the right response -- GFP_NOFS
> returning -ENOMEM probably gets you a nice warning already from the
> mm code.

It's been a couple of days since I first looked at this, and my
initial reaction of "that's horrible!" hasn't changed.

What you are proposing here might be a perfectly reasonable
*internal implemention* of xa_store_excl(), but it makes for a
terrible external API because the sematics and behaviour are so
vague. e.g. what does "race" mean here with respect to an insert
failure?

i.e. the fact the cmpxchg failed may not have anything to do with a
race condtion - it failed because the slot wasn't empty like we
expected it to be. There can be any number of reasons the slot isn't
empty - the API should not "document" that the reason the insert
failed was a race condition. It should document the case that we
"couldn't insert because there was an existing entry in the slot".
Let the surrounding code document the reason why that might have
happened - it's not for the API to assume reasons for failure.

i.e. this API and potential internal implementation makes much
more sense:

int
xa_store_iff_empty(...)
{
curr = xa_cmpxchg(&pag->pag_ici_xa, agino, NULL, ip, GFP_NOFS);
if (!curr)
return 0; /* success! */
if (!IS_ERR(curr))
return -EEXIST; /* failed - slot not empty */
return PTR_ERR(curr); /* failed - XA internal issue */
}

as it replaces the existing preload and insert code in the XFS code
paths whilst letting us handle and document the "insert failed
because slot not empty" case however we want. It implies nothing
about *why* the slot wasn't empty, just that we couldn't do the
insert because it wasn't empty.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-12-11 04:23:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Mon, Dec 11, 2017 at 10:57:45AM +1100, Dave Chinner wrote:
> i.e. the fact the cmpxchg failed may not have anything to do with a
> race condtion - it failed because the slot wasn't empty like we
> expected it to be. There can be any number of reasons the slot isn't
> empty - the API should not "document" that the reason the insert
> failed was a race condition. It should document the case that we
> "couldn't insert because there was an existing entry in the slot".
> Let the surrounding code document the reason why that might have
> happened - it's not for the API to assume reasons for failure.
>
> i.e. this API and potential internal implementation makes much
> more sense:
>
> int
> xa_store_iff_empty(...)
> {
> curr = xa_cmpxchg(&pag->pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> if (!curr)
> return 0; /* success! */
> if (!IS_ERR(curr))
> return -EEXIST; /* failed - slot not empty */
> return PTR_ERR(curr); /* failed - XA internal issue */
> }
>
> as it replaces the existing preload and insert code in the XFS code
> paths whilst letting us handle and document the "insert failed
> because slot not empty" case however we want. It implies nothing
> about *why* the slot wasn't empty, just that we couldn't do the
> insert because it wasn't empty.

Yeah, OK. So, over the top of the recent changes I'm looking at this:

I'm not in love with xa_store_empty() as a name. I almost want
xa_store_weak(), but after my MAP_FIXED_WEAK proposed name got shot
down, I'm leery of it. "empty" is at least a concept we already have
in the API with the comment for xa_init() talking about an empty array
and xa_empty(). I also considered xa_store_null and xa_overwrite_null
and xa_replace_null(). Naming remains hard.

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 941f38bb94a4..586b43836905 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -451,7 +451,7 @@ xfs_iget_cache_miss(
int flags,
int lock_flags)
{
- struct xfs_inode *ip, *curr;
+ struct xfs_inode *ip;
int error;
xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ino);
int iflags;
@@ -498,8 +498,7 @@ xfs_iget_cache_miss(
xfs_iflags_set(ip, iflags);

/* insert the new inode */
- curr = xa_cmpxchg(&pag->pag_ici_xa, agino, NULL, ip, GFP_NOFS);
- error = __xa_race(curr, -EAGAIN);
+ error = xa_store_empty(&pag->pag_ici_xa, agino, ip, GFP_NOFS, -EAGAIN);
if (error)
goto out_unlock;

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index 5792b6dbb040..cc7cc5253a67 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -271,43 +271,30 @@ static inline int xa_err(void *entry)
}

/**
- * __xa_race() - Turn a cmpxchg result into an errno.
- * @entry: Result from calling an XArray function.
- * @errno: Error number to return if we lost the race.
+ * xa_store_empty() - Store this entry in the XArray unless another entry is
+ * already present.
+ * @xa: XArray.
+ * @index: Index into array.
+ * @entry: New entry.
+ * @gfp: Memory allocation flags.
+ * @rc: Number to return if another entry was present.
*
- * Like xa_race(), but returns the error number of your choice. Calling
- * __xa_race(entry, 0) has the same result (but is less efficient) as
- * calling xa_err().
+ * Like xa_store(), but will fail and return the supplied error number if
+ * the existing entry at @index is not %NULL.
*
* Return: A negative errno or 0.
*/
-static inline int __xa_race(void *entry, int errno)
+static inline int xa_store_empty(struct xarray *xa, unsigned long index,
+ void *entry, gfp_t gfp, int errno)
{
- if (!entry)
+ void *curr = xa_cmpxchg(xa, index, NULL, entry, gfp);
+ if (!curr)
return 0;
- if (xa_is_err(entry))
- return (long)entry >> 2;
+ if (xa_is_err(curr))
+ return xa_err(curr);
return errno;
}

-/**
- * xa_race() - Turn a cmpxchg result into an errno.
- * @entry: Result from calling an XArray function.
- *
- * It is common to use xa_cmpxchg() to ensure that only one thread assigns
- * a value to an index. Pass the result from xa_cmpxchg() to xa_race() to
- * get an errno back. This function also handles any other error which
- * may have been returned by xa_cmpxchg() such as ENOMEM.
- *
- * If you don't care that you lost the race, you can use xa_err() instead.
- *
- * Return: A negative errno or 0.
- */
-static inline int xa_race(void *entry)
-{
- return __xa_race(entry, -EEXIST);
-}
-
/* Everything below here is the Advanced API. Proceed with caution. */

#define xa_trylock(xa) spin_trylock(&(xa)->xa_lock)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 85d1bc963ab6..87ed55af823e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -614,8 +614,8 @@ static int cgwb_create(struct backing_dev_info *bdi,
spin_lock_irqsave(&cgwb_lock, flags);
if (test_bit(WB_registered, &bdi->wb.state) &&
blkcg_cgwb_list->next && memcg_cgwb_list->next) {
- ret = xa_race(xa_cmpxchg(&bdi->cgwb_xa, memcg_css->id, NULL,
- wb, GFP_ATOMIC));
+ ret = xa_store_empty(&bdi->cgwb_xa, memcg_css->id, wb,
+ GFP_ATOMIC, -EEXIST);
if (!ret) {
list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
list_add(&wb->memcg_node, memcg_cgwb_list);

2017-12-11 21:43:07

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote:
> On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote:
> > 1. Using lockdep_set_novalidate_class() for anything other
> > than device->mutex will throw checkpatch warnings. Nice. (*)
> []
> > (*) checkpatch.pl is considered mostly harmful round here, too,
> > but that's another rant....
>
> How so?

Short story is that it barfs all over the slightly non-standard
coding style used in XFS. It generates enough noise on incidental
things we aren't important that it complicates simple things. e.g. I
just moved a block of defines from one header to another, and
checkpatch throws about 10 warnings on that because of whitespace.
I'm just moving code - I don't want to change it and it doesn't need
to be modified because it is neat and easy to read and is obviously
correct. A bunch of prototypes I added another parameter to gets
warnings because it uses "unsigned" for an existing parameter that
I'm not changing. And so on.

This sort of stuff is just lowest-common-denominator noise - great
for new code and/or inexperienced developers, but not for working
with large bodies of existing code with slightly non-standard
conventions. Churning *lots* of code we otherwise wouldn't touch
just to shut up checkpatch warnings takes time, risks regressions
and makes it harder to trace the history of the code when we are
trying to track down bugs.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-12-11 21:55:07

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Sun, Dec 10, 2017 at 08:23:15PM -0800, Matthew Wilcox wrote:
> On Mon, Dec 11, 2017 at 10:57:45AM +1100, Dave Chinner wrote:
> > i.e. the fact the cmpxchg failed may not have anything to do with a
> > race condtion - it failed because the slot wasn't empty like we
> > expected it to be. There can be any number of reasons the slot isn't
> > empty - the API should not "document" that the reason the insert
> > failed was a race condition. It should document the case that we
> > "couldn't insert because there was an existing entry in the slot".
> > Let the surrounding code document the reason why that might have
> > happened - it's not for the API to assume reasons for failure.
> >
> > i.e. this API and potential internal implementation makes much
> > more sense:
> >
> > int
> > xa_store_iff_empty(...)
> > {
> > curr = xa_cmpxchg(&pag->pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> > if (!curr)
> > return 0; /* success! */
> > if (!IS_ERR(curr))
> > return -EEXIST; /* failed - slot not empty */
> > return PTR_ERR(curr); /* failed - XA internal issue */
> > }
> >
> > as it replaces the existing preload and insert code in the XFS code
> > paths whilst letting us handle and document the "insert failed
> > because slot not empty" case however we want. It implies nothing
> > about *why* the slot wasn't empty, just that we couldn't do the
> > insert because it wasn't empty.
>
> Yeah, OK. So, over the top of the recent changes I'm looking at this:
>
> I'm not in love with xa_store_empty() as a name. I almost want
> xa_store_weak(), but after my MAP_FIXED_WEAK proposed name got shot
> down, I'm leery of it. "empty" is at least a concept we already have
> in the API with the comment for xa_init() talking about an empty array
> and xa_empty(). I also considered xa_store_null and xa_overwrite_null
> and xa_replace_null(). Naming remains hard.
>
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 941f38bb94a4..586b43836905 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -451,7 +451,7 @@ xfs_iget_cache_miss(
> int flags,
> int lock_flags)
> {
> - struct xfs_inode *ip, *curr;
> + struct xfs_inode *ip;
> int error;
> xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ino);
> int iflags;
> @@ -498,8 +498,7 @@ xfs_iget_cache_miss(
> xfs_iflags_set(ip, iflags);
>
> /* insert the new inode */
> - curr = xa_cmpxchg(&pag->pag_ici_xa, agino, NULL, ip, GFP_NOFS);
> - error = __xa_race(curr, -EAGAIN);
> + error = xa_store_empty(&pag->pag_ici_xa, agino, ip, GFP_NOFS, -EAGAIN);
> if (error)
> goto out_unlock;

Can we avoid encoding the error to be returned into the function
call? No other generic/library API does this, so this seems like a
highly unusual special snowflake approach. I just don't see how
creating a whole new error specification convention is justified
for the case where it *might* save a line or two of error checking
code in a caller?

I'd much prefer that the API defines the error on failure, and let
the caller handle that specific error however they need to like
every other library interface we have...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-12-11 22:12:38

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Tue, 2017-12-12 at 08:43 +1100, Dave Chinner wrote:
> On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote:
> > On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote:
> > > 1. Using lockdep_set_novalidate_class() for anything other
> > > than device->mutex will throw checkpatch warnings. Nice. (*)
> > []
> > > (*) checkpatch.pl is considered mostly harmful round here, too,
> > > but that's another rant....
> >
> > How so?
>
> Short story is that it barfs all over the slightly non-standard
> coding style used in XFS.
[]
> This sort of stuff is just lowest-common-denominator noise - great
> for new code and/or inexperienced developers, but not for working
> with large bodies of existing code with slightly non-standard
> conventions.

Completely reasonable. Thanks.

Do you get many checkpatch submitters for fs/xfs?

If so, could probably do something about adding
a checkpatch file flag to the directory or equivalent.

Maybe add something like:

fs/xfs/.checkpatch

where the contents turn off most everything

2017-12-11 22:43:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Mon, Dec 11, 2017 at 02:12:28PM -0800, Joe Perches wrote:
> Completely reasonable. Thanks.

If we're doing "completely reasonable" complaints, then ...

- I don't understand why plain 'unsigned' is deemed bad.

- The rule about all function parameters in prototypes having a name
doesn't make sense. Example:

int ida_get_new_above(struct ida *ida, int starting_id, int *p_id);

There is zero additional value in naming 'ida'. I know it's an ida.
The struct name tells me that. If there're two struct ida pointers
in the prototype, then sure, I want to name them so I know which is
which (maybe 'src' and 'dst'). Having an unadorned 'int' parameter
to a function should be a firable offence. But there's no need to
call 'gfp_t' anything. We know it's a gfp_t. Adding 'gfp_mask'
after it doesn't tell us anything extra.

- Forcing a blank line after variable declarations sometimes makes for
some weird-looking code. For example, there is no problem with this
code (from a checkpatch PoV):

if (xa_is_sibling(entry)) {
offset = xa_to_sibling(entry);
entry = xa_entry(xas->xa, node, offset);
/* Move xa_index to the first index of this entry */
xas->xa_index = (((xas->xa_index >> node->shift) &
~XA_CHUNK_MASK) | offset) << node->shift;
}

but if I decide I don't need 'offset' outside this block, and I want
to move the declaration inside, it looks like this:

if (xa_is_sibling(entry)) {
unsigned int offset = xa_to_sibling(entry);

entry = xa_entry(xas->xa, node, offset);
/* Move xa_index to the first index of this entry */
xas->xa_index = (((xas->xa_index >> node->shift) &
~XA_CHUNK_MASK) | offset) << node->shift;
}

Does that blank line really add anything to your comprehension of the
block? It upsets my train of thought.

Constructively, I think this warning can be suppressed for blocks
that are under, say, 8 lines. Or maybe indented blocks is where I don't
want this warning. Not sure.

Here's another example where I don't think the blank line adds anything:

static inline int xa_store_empty(struct xarray *xa, unsigned long index,
void *entry, gfp_t gfp, int errno)
{
void *curr = xa_cmpxchg(xa, index, NULL, entry, gfp);
if (!curr)
return 0;
if (xa_is_err(curr))
return xa_err(curr);
return errno;
}

So line count definitely has something to do with it.

- There's no warning for the first paragraph of section 6:

6) Functions
------------

Functions should be short and sweet, and do just one thing. They should
fit on one or two screenfuls of text (the ISO/ANSI screen size is 80x24,
as we all know), and do one thing and do that well.

I'm not expecting you to be able to write a perl script that checks
the first line, but we have way too many 200-plus line functions in
the kernel. I'd like a warning on anything over 200 lines (a factor
of 4 over Linus's stated goal).

- I don't understand the error for xa_head here:

struct xarray {
spinlock_t xa_lock;
gfp_t xa_flags;
void __rcu * xa_head;
};

Do people really think that:

struct xarray {
spinlock_t xa_lock;
gfp_t xa_flags;
void __rcu *xa_head;
};

is more aesthetically pleasing? And not just that, but it's an *error*
so the former is *RIGHT* and this is *WRONG*. And not just a matter
of taste?

2017-12-11 23:10:36

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v4 08/73] xarray: Add documentation

On 12/05/2017 04:40 PM, Matthew Wilcox wrote:
> From: Matthew Wilcox <[email protected]>
>
> This is documentation on how to use the XArray, not details about its
> internal implementation.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> ---
> Documentation/core-api/index.rst | 1 +
> Documentation/core-api/xarray.rst | 281 ++++++++++++++++++++++++++++++++++++++
> 2 files changed, 282 insertions(+)
> create mode 100644 Documentation/core-api/xarray.rst
>
> diff --git a/Documentation/core-api/xarray.rst b/Documentation/core-api/xarray.rst
> new file mode 100644
> index 000000000000..871161539242
> --- /dev/null
> +++ b/Documentation/core-api/xarray.rst
> @@ -0,0 +1,281 @@
> +======
> +XArray
> +======
> +
> +Overview
> +========
> +
> +The XArray is an abstract data type which behaves like a very large array
> +of pointers. It meets many of the same needs as a hash or a conventional
> +resizable array. Unlike a hash, it allows you to sensibly go to the
> +next or previous entry in a cache-efficient manner. In contrast to
> +a resizable array, there is no need for copying data or changing MMU
> +mappings in order to grow the array. It is more memory-efficient,
> +parallelisable and cache friendly than a doubly-linked list. It takes
> +advantage of RCU to perform lookups without locking.
> +
> +The XArray implementation is efficient when the indices used are
> +densely clustered; hashing the object and using the hash as the index
> +will not perform well. The XArray is optimised for small indices,
> +but still has good performance with large indices. If your index is
> +larger than ULONG_MAX then the XArray is not the data type for you.
> +The most important user of the XArray is the page cache.
> +
> +A freshly-initialised XArray contains a ``NULL`` pointer at every index.
> +Each non-``NULL`` entry in the array has three bits associated with
> +it called tags. Each tag may be flipped on or off independently of
> +the others. You can search for entries with a given tag set.

Only tags that are set, or search for entries with some tag(s) cleared?
Or is that like a mathematical set?


> +Normal pointers may be stored in the XArray directly. They must be 4-byte
> +aligned, which is true for any pointer returned from :c:func:`kmalloc` and
> +:c:func:`alloc_page`. It isn't true for arbitrary user-space pointers,
> +nor for function pointers. You can store pointers to statically allocated
> +objects, as long as those objects have an alignment of at least 4.

This (above) is due to the internal usage of low bits for flags?

> +The XArray does not support storing :c:func:`IS_ERR` pointers; some
> +conflict with data values and others conflict with entries the XArray
> +uses for its own purposes. If you need to store special values which
> +cannot be confused with real kernel pointers, the values 4, 8, ... 4092
> +are available.

or if I know that they values are errno-range values, I can just shift them
left by 2 to store them and then shift them right by 2 to use them?

oh, or use the following function?

> +You can also store integers between 0 and ``LONG_MAX`` in the XArray.
> +You must first convert it into an entry using :c:func:`xa_mk_value`.
> +When you retrieve an entry from the XArray, you can check whether it is
> +a data value by calling :c:func:`xa_is_value`, and convert it back to
> +an integer by calling :c:func:`xa_to_value`.
> +
> +An unusual feature of the XArray is the ability to create entries which
> +occupy a range of indices. Once stored to, looking up any index in
> +the range will return the same entry as looking up any other index in
> +the range. Setting a tag on one index will set it on all of them.
> +Storing to any index will store to all of them. Multi-index entries can
> +be explicitly split into smaller entries, or storing ``NULL`` into any
> +entry will cause the XArray to forget about the range.
> +
> +Normal API
> +==========
> +
> +Start by initialising an XArray, either with :c:func:`DEFINE_XARRAY`
> +for statically allocated XArrays or :c:func:`xa_init` for dynamically
> +allocated ones.
> +
> +You can then set entries using :c:func:`xa_store` and get entries
> +using :c:func:`xa_load`. xa_store will overwrite any entry with the
> +new entry and return the previous entry stored at that index. If you
> +store %NULL, the XArray does not need to allocate memory. You can call
> +:c:func:`xa_erase` to avoid inventing a GFP flags value. There is no
> +difference between an entry that has never been stored to and one that
> +has most recently had %NULL stored to it.
> +
> +You can conditionally replace an entry at an index by using
> +:c:func:`xa_cmpxchg`. Like :c:func:`cmpxchg`, it will only succeed if
> +the entry at that index has the 'old' value. It also returns the entry
> +which was at that index; if it returns the same entry which was passed as
> +'old', then :c:func:`xa_cmpxchg` succeeded.
> +
> +You can enquire whether a tag is set on an entry by using
> +:c:func:`xa_get_tag`. If the entry is not ``NULL``, you can set a tag
> +on it by using :c:func:`xa_set_tag` and remove the tag from an entry by
> +calling :c:func:`xa_clear_tag`. You can ask whether any entry in the

an entry

> +XArray has a particular tag set by calling :c:func:`xa_tagged`.

or maybe I don't understand. Does xa_tagged() test one entry and return its
"tagged" result/status? or does it test (potentially) the entire array to search
for a particular tag value?


> +You can copy entries out of the XArray into a plain array by
> +calling :c:func:`xa_get_entries` and copy tagged entries by calling
> +:c:func:`xa_get_tagged`. Or you can iterate over the non-``NULL``
> +entries in place in the XArray by calling :c:func:`xa_for_each`.
> +You may prefer to use :c:func:`xa_find` or :c:func:`xa_find_after`
> +to move to the next present entry in the XArray.
> +
> +Finally, you can remove all entries from an XArray by calling
> +:c:func:`xa_destroy`. If the XArray entries are pointers, you may
> +wish to free the entries first. You can do this by iterating over
> +all non-``NULL`` entries in the XArray using the :c:func:`xa_for_each`
> +iterator.
> +
> +When using the Normal API, you do not have to worry about locking.
> +The XArray uses RCU and an irq-safe spinlock to synchronise access to
> +the XArray:

[snip]

> +Advanced API
> +============
> +
> +The advanced API offers more flexibility and better performance at the
> +cost of an interface which can be harder to use and has fewer safeguards.
> +No locking is done for you by the advanced API, and you are required
> +to use the xa_lock while modifying the array. You can choose whether
> +to use the xa_lock or the RCU lock while doing read-only operations on
> +the array. You can mix advanced and normal operations on the same array;
> +indeed the normal API is implemented in terms of the advanced API. The
> +advanced API is only available to modules with a GPL-compatible license.
> +
> +The advanced API is based around the xa_state. This is an opaque data
> +structure which you declare on the stack using the :c:func:`XA_STATE`
> +macro. This macro initialises the xa_state ready to start walking
> +around the XArray. It is used as a cursor to maintain the position
> +in the XArray and let you compose various operations together without
> +having to restart from the top every time.
> +
> +The xa_state is also used to store errors. You can call
> +:c:func:`xas_error` to retrieve the error. All operations check whether
> +the xa_state is in an error state before proceeding, so there's no need
> +for you to check for an error after each call; you can make multiple
> +calls in succession and only check at a convenient point. The only error
> +currently generated by the xarray code itself is %ENOMEM, but it supports
> +arbitrary errors in case you want to call :c:func:`xas_set_err` yourself.
> +
> +If the xa_state is holding an %ENOMEM error, calling :c:func:`xas_nomem`
> +will attempt to allocate more memory using the specified gfp flags and
> +cache it in the xa_state for the next attempt. The idea is that you take
> +the xa_lock, attempt the operation and drop the lock. The operation
> +attempts to allocate memory while holding the lock, but it is more
> +likely to fail. Once you have dropped the lock, :c:func:`xas_nomem`
> +can try harder to allocate more memory. It will return ``true`` if it
> +is worth retrying the operation (ie that there was a memory error *and*

usually i.e.

> +more memory was allocated. If it has previously allocated memory, and

allocated).

> +that memory wasn't used, and there is no error (or some error that isn't
> +%ENOMEM), then it will free the memory previously allocated.
> +
> +Some users wish to hold the xa_lock for their own purpose while performing
> +one simple XArray operation, and then some other operation of their
> +own, protected by the same lock. While they could declare an xa_state
> +and use it to call one of the usual advanced API functions, it is a
> +little cumbersome. These interfaces are added on demand; at the moment,
> +:c:func:`__xa_erase`, :c:func:`__xa_set_tag` and :c:func:`__xa_clear_tag`
> +are available.
> +
> +Internal Entries
> +----------------

[snip]

> +Additional functionality
> +------------------------
> +
> +The :c:func:`xas_create` function ensures that there is somewhere in the
> +XArray to store an entry. It will store ENOMEM in the xa_state if it
> +cannot allocate memory. You do not normally need to call this function
> +yourself as it is called by :c:func:`xas_store`.
> +
> +You can use :c:func:`xas_init_tags` to reset the tags on an entry
> +to their default state. This is usually all tags clear, unless the
> +XArray is marked with ``XA_FLAGS_TRACK_FREE``, in which case tag 0 is set
> +and all other tags are clear. Replacing one entry with another using
> +:c:func:`xas_store` will not reset the tags on that entry; if you want
> +the tags reset, you should do that explicitly.
> +
> +The :c:func:`xas_load` will walk the xa_state as close to the entry
> +as it can. If you know the xa_state has already been walked to the
> +entry and need to check that the entry hasn't changed, you can use
> +:c:func:`xas_reload` to save a function call.
> +
> +If you need to move to a different index in the XArray, call
> +:c:func:`xas_set`. This reinitialises the cursor which will generally

I would put a comma .... here.......................^
but consult your $editor. :)

> +have the effect of making the next operation walk the cursor to the
> +desired spot in the tree. If you want to move to the next or previous
> +index, call :c:func:`xas_next` or :c:func:`xas_prev`. Setting the index
> +does not walk the cursor around the array so does not require a lock to
> +be held, while moving to the next or previous index does.

[snip]


Nicely done. Thanks.
--
~Randy

2017-12-11 23:46:13

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Mon, 2017-12-11 at 14:43 -0800, Matthew Wilcox wrote:
> On Mon, Dec 11, 2017 at 02:12:28PM -0800, Joe Perches wrote:
> > Completely reasonable. Thanks.
>
> If we're doing "completely reasonable" complaints, then ...
>
> - I don't understand why plain 'unsigned' is deemed bad.

That was a David Miller preference.

> - The rule about all function parameters in prototypes having a name
> doesn't make sense. Example:
>
> int ida_get_new_above(struct ida *ida, int starting_id, int *p_id);

Improvements to regex welcomed.

> - Forcing a blank line after variable declarations sometimes makes for
> some weird-looking code.

True. I don't care for this one myself.
> Constructively, I think this warning can be suppressed for blocks
> that are under, say, 8 lines.

Not easy to do as checkpatch works on patches.

> 6) Functions
> ------------
>
> Functions should be short and sweet, and do just one thing. They should
> fit on one or two screenfuls of text (the ISO/ANSI screen size is 80x24,
> as we all know), and do one thing and do that well.
>
> I'm not expecting you to be able to write a perl script that checks
> the first line, but we have way too many 200-plus line functions in
> the kernel. I'd like a warning on anything over 200 lines (a factor
> of 4 over Linus's stated goal).

Maybe reasonable.
Some declaration blocks for things like:

void foo(void)
{
static const struct foobar array[] = {
{ long count of lines... };
[body]
}

might make that warning unreasonable though.

> - I don't understand the error for xa_head here:
>
> struct xarray {
> spinlock_t xa_lock;
> gfp_t xa_flags;
> void __rcu * xa_head;
> };
>
> Do people really think that:
>
> struct xarray {
> spinlock_t xa_lock;
> gfp_t xa_flags;
> void __rcu *xa_head;
> };
>
> is more aesthetically pleasing? And not just that, but it's an *error*
> so the former is *RIGHT* and this is *WRONG*. And not just a matter
> of taste?

No opinion really.
That's from Andy Whitcroft's original implementation.

2017-12-11 23:47:02

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Mon, Dec 11, 2017 at 02:12:28PM -0800, Joe Perches wrote:
> On Tue, 2017-12-12 at 08:43 +1100, Dave Chinner wrote:
> > On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote:
> > > On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote:
> > > > 1. Using lockdep_set_novalidate_class() for anything other
> > > > than device->mutex will throw checkpatch warnings. Nice. (*)
> > > []
> > > > (*) checkpatch.pl is considered mostly harmful round here, too,
> > > > but that's another rant....
> > >
> > > How so?
> >
> > Short story is that it barfs all over the slightly non-standard
> > coding style used in XFS.
> []
> > This sort of stuff is just lowest-common-denominator noise - great
> > for new code and/or inexperienced developers, but not for working
> > with large bodies of existing code with slightly non-standard
> > conventions.
>
> Completely reasonable. Thanks.
>
> Do you get many checkpatch submitters for fs/xfs?

We used to. Not recently, though.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-12-12 15:51:41

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Mon, 11 Dec 2017, Joe Perches wrote:

> > - I don't understand the error for xa_head here:
> >
> > struct xarray {
> > spinlock_t xa_lock;
> > gfp_t xa_flags;
> > void __rcu * xa_head;
> > };
> >
> > Do people really think that:
> >
> > struct xarray {
> > spinlock_t xa_lock;
> > gfp_t xa_flags;
> > void __rcu *xa_head;
> > };
> >
> > is more aesthetically pleasing? And not just that, but it's an *error*
> > so the former is *RIGHT* and this is *WRONG*. And not just a matter

Not sure what was meant here. Neither one is *WRONG* in the sense of
being invalid C code. In the sense of what checkpatch will accept, the
former is *WRONG* and the latter is *RIGHT* -- the opposite of what
was written.

> > of taste?
>
> No opinion really.
> That's from Andy Whitcroft's original implementation.

This one, at least, is easy to explain. The original version tends to
lead to bugs, or easily misunderstood code. Consider if another
variable was added to the declaration; it could easily turn into:

void __rcu * xa_head, xa_head2;

(The compiler will reject this, but it wouldn't if the underlying type
had been int instead of void.)

Doing it the other way makes the meaning a lot more clear:

void __rcu *xa_head, *xa_head2;

This is an idiom specifically intended to reduce the likelihood of
errors. Rather like avoiding assignments inside conditionals.

Alan Stern

2017-12-14 18:23:52

by Joe Perches

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

On Mon, 2017-12-11 at 14:43 -0800, Matthew Wilcox wrote:
> - There's no warning for the first paragraph of section 6:
>
> 6) Functions
> ------------
>
> Functions should be short and sweet, and do just one thing. They should
> fit on one or two screenfuls of text (the ISO/ANSI screen size is 80x24,
> as we all know), and do one thing and do that well.
>
> I'm not expecting you to be able to write a perl script that checks
> the first line, but we have way too many 200-plus line functions in
> the kernel. I'd like a warning on anything over 200 lines (a factor
> of 4 over Linus's stated goal).

Perhaps something like this?

(very very lightly tested, more testing appreciated)
---
scripts/checkpatch.pl | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 4306b7616cdd..523aead40b87 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -59,6 +59,7 @@ my $conststructsfile = "$D/const_structs.checkpatch";
my $typedefsfile = "";
my $color = "auto";
my $allow_c99_comments = 1;
+my $max_function_length = 200;

sub help {
my ($exitcode) = @_;
@@ -2202,6 +2203,7 @@ sub process {
my $realcnt = 0;
my $here = '';
my $context_function; #undef'd unless there's a known function
+ my $context_function_linenum;
my $in_comment = 0;
my $comment_edge = 0;
my $first_line = 0;
@@ -2341,6 +2343,7 @@ sub process {
} else {
undef $context_function;
}
+ undef $context_function_linenum;
next;

# track the line number as we move through the hunk, note that
@@ -3200,11 +3203,18 @@ sub process {
if ($sline =~ /^\+\{\s*$/ &&
$prevline =~ /^\+(?:(?:(?:$Storage|$Inline)\s*)*\s*$Type\s*)?($Ident)\(/) {
$context_function = $1;
+ $context_function_linenum = $realline;
}

# check if this appears to be the end of function declaration
if ($sline =~ /^\+\}\s*$/) {
+ if (defined($context_function_linenum) &&
+ ($realline - $context_function_linenum) > $max_function_length) {
+ WARN("LONG_FUNCTION",
+ "'$context_function' function definition is " . ($realline - $context_function_linenum) . " lines, perhaps refactor\n" . $herecurr);
+ }
undef $context_function;
+ undef $context_function_linenum;
}

# check indentation of any line with a bare else
@@ -5983,6 +5993,7 @@ sub process {
defined $stat &&
$stat =~ /^.\s*(?:$Storage\s+)?$Type\s*($Ident)\s*$balanced_parens\s*{/s) {
$context_function = $1;
+ $context_function_linenum = $realline;

# check for multiline function definition with misplaced open brace
my $ok = 0;

2017-12-15 04:22:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 08/73] xarray: Add documentation

On Mon, Dec 11, 2017 at 03:10:22PM -0800, Randy Dunlap wrote:
> > +A freshly-initialised XArray contains a ``NULL`` pointer at every index.
> > +Each non-``NULL`` entry in the array has three bits associated with
> > +it called tags. Each tag may be flipped on or off independently of
> > +the others. You can search for entries with a given tag set.
>
> Only tags that are set, or search for entries with some tag(s) cleared?
> Or is that like a mathematical set?

hmm ...

"Each tag may be set or cleared independently of the others. You can
search for entries which have a particular tag set."

Doesn't completely remove the ambiguity, but I can't think of how to phrase
that better ...

> > +Normal pointers may be stored in the XArray directly. They must be 4-byte
> > +aligned, which is true for any pointer returned from :c:func:`kmalloc` and
> > +:c:func:`alloc_page`. It isn't true for arbitrary user-space pointers,
> > +nor for function pointers. You can store pointers to statically allocated
> > +objects, as long as those objects have an alignment of at least 4.
>
> This (above) is due to the internal usage of low bits for flags?

Sort of ... if bit 0 is set then we're storing an integer (see below),
and if bit 0 is clear and bit 1 is set, then it's an internal entry.

But I don't want the implementation details to leak into the user manual.
>From the user's point of view, they can store a pointer to anything they
allocated with kmalloc. If they want to store an arbitrary pointer,
they're out of luck.

> > +The XArray does not support storing :c:func:`IS_ERR` pointers; some
> > +conflict with data values and others conflict with entries the XArray
> > +uses for its own purposes. If you need to store special values which
> > +cannot be confused with real kernel pointers, the values 4, 8, ... 4092
> > +are available.
>
> or if I know that they values are errno-range values, I can just shift them
> left by 2 to store them and then shift them right by 2 to use them?

Yes, the values -4 to -4092 also make good error signals.

> oh, or use the following function?
>
> > +You can also store integers between 0 and ``LONG_MAX`` in the XArray.
> > +You must first convert it into an entry using :c:func:`xa_mk_value`.
> > +When you retrieve an entry from the XArray, you can check whether it is
> > +a data value by calling :c:func:`xa_is_value`, and convert it back to
> > +an integer by calling :c:func:`xa_to_value`.

Yup, you could also store errors as integers, if you like. Your choice :-)

> > +You can enquire whether a tag is set on an entry by using
> > +:c:func:`xa_get_tag`. If the entry is not ``NULL``, you can set a tag
> > +on it by using :c:func:`xa_set_tag` and remove the tag from an entry by
> > +calling :c:func:`xa_clear_tag`. You can ask whether any entry in the
>
> an entry
>
> > +XArray has a particular tag set by calling :c:func:`xa_tagged`.
>
> or maybe I don't understand. Does xa_tagged() test one entry and return its
> "tagged" result/status? or does it test (potentially) the entire array to search
> for a particular tag value?

It asks the question "Does any entry in the array have tag N set?"

> > +If the xa_state is holding an %ENOMEM error, calling :c:func:`xas_nomem`
> > +will attempt to allocate more memory using the specified gfp flags and
> > +cache it in the xa_state for the next attempt. The idea is that you take
> > +the xa_lock, attempt the operation and drop the lock. The operation
> > +attempts to allocate memory while holding the lock, but it is more
> > +likely to fail. Once you have dropped the lock, :c:func:`xas_nomem`
> > +can try harder to allocate more memory. It will return ``true`` if it
> > +is worth retrying the operation (ie that there was a memory error *and*
>
> usually i.e.
>
> > +more memory was allocated. If it has previously allocated memory, and
>
> allocated).

Thanks!

> > +If you need to move to a different index in the XArray, call
> > +:c:func:`xas_set`. This reinitialises the cursor which will generally
>
> I would put a comma .... here.......................^
> but consult your $editor. :)

I'll ask her, but I think you're right :-)

> Nicely done. Thanks.

Thanks for the review! I think we're still struggling a little to
talk about tags in an unambiguous way, but apart from that it feels
pretty good.

2017-12-15 12:34:25

by Matthew Wilcox

[permalink] [raw]
Subject: Naming of tag operations in the XArray

On Thu, Dec 14, 2017 at 08:22:14PM -0800, Matthew Wilcox wrote:
> On Mon, Dec 11, 2017 at 03:10:22PM -0800, Randy Dunlap wrote:
> > > +A freshly-initialised XArray contains a ``NULL`` pointer at every index.
> > > +Each non-``NULL`` entry in the array has three bits associated with
> > > +it called tags. Each tag may be flipped on or off independently of
> > > +the others. You can search for entries with a given tag set.
> >
> > Only tags that are set, or search for entries with some tag(s) cleared?
> > Or is that like a mathematical set?
>
> hmm ...
>
> "Each tag may be set or cleared independently of the others. You can
> search for entries which have a particular tag set."
>
> Doesn't completely remove the ambiguity, but I can't think of how to phrase
> that better ...

Thinking about this some more ...

At the moment, the pieces of the API which deal with tags look like this:

bool xa_tagged(const struct xarray *, xa_tag_t)
bool xa_get_tag(struct xarray *, unsigned long index, xa_tag_t);
void xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
void xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);
int xa_get_tagged(struct xarray *, void **dst, unsigned long start,
unsigned long max, unsigned int n, xa_tag_t);

bool xas_get_tag(const struct xa_state *, xa_tag_t);
void xas_set_tag(const struct xa_state *, xa_tag_t);
void xas_clear_tag(const struct xa_state *, xa_tag_t);
void *xas_find_tag(struct xa_state *, unsigned long max, xa_tag_t);
xas_for_each_tag(xas, entry, max, tag) { }

(at some point there will be an xa_for_each_tag too, there just hasn't
been a user yet).

I'm always ambivalent about using the word 'get' in an API because it has
two common meanings; (increment a refcount) and (return the state). How
would people feel about these names instead:

bool xa_any_tagged(xa, tag);
bool xa_is_tagged(xa, index, tag);
void xa_tag(xa, index, tag);
void xa_untag(xa, index, tag);
int xa_select(xa, dst, start, max, n, tag);

bool xas_is_tagged(xas, tag);
void xas_tag(xas, tag);
void xas_untag(xas, tag);
void *xas_find_tag(xas, max, tag);
xas_for_each_tag(xas, entry, max, tag) { }

(the last two are unchanged)

2017-12-15 17:10:49

by Matthew Wilcox

[permalink] [raw]
Subject: Storing errors in the XArray

On Mon, Dec 11, 2017 at 03:10:22PM -0800, Randy Dunlap wrote:
> > +The XArray does not support storing :c:func:`IS_ERR` pointers; some
> > +conflict with data values and others conflict with entries the XArray
> > +uses for its own purposes. If you need to store special values which
> > +cannot be confused with real kernel pointers, the values 4, 8, ... 4092
> > +are available.
>
> or if I know that they values are errno-range values, I can just shift them
> left by 2 to store them and then shift them right by 2 to use them?

On further thought, I like this idea so much, it's worth writing helpers
for this usage. And test-suite (also doubles as a demonstration of how
to use it).

diff --git a/include/linux/xarray.h b/include/linux/xarray.h
index c616e9319c7c..53aa251df57a 100644
--- a/include/linux/xarray.h
+++ b/include/linux/xarray.h
@@ -232,6 +232,39 @@ static inline bool xa_is_value(const void *entry)
return (unsigned long)entry & 1;
}

+/**
+ * xa_mk_errno() - Create an XArray entry from an error number.
+ * @error: Error number to store in XArray.
+ *
+ * Return: An entry suitable for storing in the XArray.
+ */
+static inline void *xa_mk_errno(long error)
+{
+ return (void *)(error << 2);
+}
+
+/**
+ * xa_to_errno() - Get error number stored in an XArray entry.
+ * @entry: XArray entry.
+ *
+ * Return: The error number stored in the XArray entry.
+ */
+static inline unsigned long xa_to_errno(const void *entry)
+{
+ return (long)entry >> 2;
+}
+
+/**
+ * xa_is_errno() - Determine if an entry is an errno.
+ * @entry: XArray entry.
+ *
+ * Return: True if the entry is an errno, false if it is a pointer.
+ */
+static inline bool xa_is_errno(const void *entry)
+{
+ return (((unsigned long)entry & 3) == 0) && (entry > (void *)-4096);
+}
+
/**
* xa_is_internal() - Is the entry an internal entry?
* @entry: Entry retrieved from the XArray
diff --git a/tools/testing/radix-tree/xarray-test.c b/tools/testing/radix-tree/xarray-test.c
index 43111786ebdd..b843cedf3988 100644
--- a/tools/testing/radix-tree/xarray-test.c
+++ b/tools/testing/radix-tree/xarray-test.c
@@ -29,7 +29,13 @@ void check_xa_err(struct xarray *xa)
assert(xa_err(xa_store(xa, 1, xa_mk_value(0), GFP_KERNEL)) == 0);
assert(xa_err(xa_store(xa, 1, NULL, 0)) == 0);
// kills the test-suite :-(
-// assert(xa_err(xa_store(xa, 0, xa_mk_internal(0), 0)) == -EINVAL);
+// assert(xa_err(xa_store(xa, 0, xa_mk_internal(0), 0)) == -EINVAL);
+
+ assert(xa_err(xa_store(xa, 0, xa_mk_errno(-ENOMEM), GFP_KERNEL)) == 0);
+ assert(xa_err(xa_load(xa, 0)) == 0);
+ assert(xa_is_errno(xa_load(xa, 0)) == true);
+ assert(xa_to_errno(xa_load(xa, 0)) == -ENOMEM);
+ xa_erase(xa, 0);
}

void check_xa_tag(struct xarray *xa)

2017-12-17 01:27:02

by Joe Perches

[permalink] [raw]
Subject: [RFC patch] checkpatch: Add a test for long function definitions (>200 lines)

On Mon, 2017-12-11 at 14:43 -0800, Matthew Wilcox wrote:
> - There's no warning for the first paragraph of section 6:
>
> 6) Functions
> ------------
>
> Functions should be short and sweet, and do just one thing. They should
> fit on one or two screenfuls of text (the ISO/ANSI screen size is 80x24,
> as we all know), and do one thing and do that well.
>
> I'm not expecting you to be able to write a perl script that checks
> the first line, but we have way too many 200-plus line functions in
> the kernel. I'd like a warning on anything over 200 lines (a factor
> of 4 over Linus's stated goal).

In response to Matthew's request:

This is a possible checkpatch warning for long
function definitions.

Running against the last 10000 git commits, there
are no false positives though perhaps there are
some false negatives.

These are the matches in the last 10000 commits:

1 227 42e0442f8a237d3de9ea3f2dd2be2739e6db7fdb:1157: WARNING: 'ir_lirc_ioctl' function definition is 206 lines, perhaps refactor
2 552 148abd3b5b146021a637d36ac5c0ee91cd4ad520:790: WARNING: 'tda18250_set_params' function definition is 206 lines, perhaps refactor
3 1352 123e25c4a5658be27f08ed0fb85ade34683e5dc7:366: WARNING: 'cudbg_fill_meminfo' function definition is 252 lines, perhaps refactor
4 2688 62d591a8e00cc349e6a9efb87efac9548f178624:462: WARNING: 'program_watermarks' function definition is 232 lines, perhaps refactor
5 6171 d2ddc776a4581d900fc3bdc7803b403daae64d88:3530: WARNING: 'afs_select_fileserver' function definition is 283 lines, perhaps refactor
6 9135 2f4b411a3d6766e6362ffbf00e0495a2dfe92507:962: WARNING: 'i40e_parse_cls_flower' function definition is 242 lines, perhaps refactor
7 9450 fd708b81d972a0714b02a60eb4792fdbf15868c4:1094: WARNING: 'btrfs_ref_tree_mod' function definition is 201 lines, perhaps refactor

Running against files in mm lib and kernel, there are
52 functions that exceed 200 lines.

$ git ls-files -- mm kernel lib | \
xargs ./scripts/checkpatch.pl -f --types=long_function --terse --quiet --no-summary | \
cat -n
1 kernel/audit.c:1447: WARNING: 'audit_receive_msg' function definition is 308 lines, perhaps refactor
2 kernel/auditsc.c:713: WARNING: 'audit_filter_rules' function definition is 273 lines, perhaps refactor
3 kernel/bpf/core.c:1285: WARNING: '___bpf_prog_run' function definition is 508 lines, perhaps refactor
4 kernel/bpf/verifier.c:2240: WARNING: 'adjust_scalar_min_max_vals' function definition is 218 lines, perhaps refactor
5 kernel/bpf/verifier.c:4064: WARNING: 'do_check' function definition is 288 lines, perhaps refactor
6 kernel/debug/debug_core.c:682: WARNING: 'kgdb_cpu_enter' function definition is 214 lines, perhaps refactor
7 kernel/debug/kdb/kdb_io.c:422: WARNING: 'kdb_read' function definition is 217 lines, perhaps refactor
8 kernel/debug/kdb/kdb_io.c:850: WARNING: 'vkdb_printf' function definition is 297 lines, perhaps refactor
9 kernel/fork.c:2033: WARNING: 'copy_process' function definition is 454 lines, perhaps refactor
10 kernel/futex.c:705: WARNING: 'get_futex_key' function definition is 204 lines, perhaps refactor
11 kernel/futex.c:2135: WARNING: 'futex_requeue' function definition is 283 lines, perhaps refactor
12 kernel/irq/manage.c:1488: WARNING: '__setup_irq' function definition is 354 lines, perhaps refactor
13 kernel/locking/locktorture.c:1050: WARNING: 'lock_torture_init' function definition is 201 lines, perhaps refactor
14 kernel/locking/qspinlock.c:505: WARNING: 'queued_spin_lock_slowpath' function definition is 210 lines, perhaps refactor
15 kernel/locking/rtmutex.c:796: WARNING: 'rt_mutex_adjust_prio_chain' function definition is 348 lines, perhaps refactor
16 kernel/power/swap.c:873: WARNING: 'save_image_lzo' function definition is 205 lines, perhaps refactor
17 kernel/power/swap.c:1469: WARNING: 'load_image_lzo' function definition is 312 lines, perhaps refactor
18 kernel/ptrace.c:1104: WARNING: 'ptrace_request' function definition is 221 lines, perhaps refactor
19 kernel/sched/core.c:4254: WARNING: '_setscheduler' function definition is 238 lines, perhaps refactor
20 kernel/sched/fair.c:8722: WARNING: 'load_balance' function definition is 261 lines, perhaps refactor
21 kernel/trace/trace_kprobe.c:839: WARNING: 'create_trace_kprobe' function definition is 207 lines, perhaps refactor
22 kernel/trace/trace_uprobe.c:564: WARNING: 'create_trace_uprobe' function definition is 202 lines, perhaps refactor
23 lib/asn1_decoder.c:518: WARNING: 'asn1_ber_decoder' function definition is 347 lines, perhaps refactor
24 lib/assoc_array.c:790: WARNING: 'assoc_array_insert_into_terminal_node' function definition is 312 lines, perhaps refactor
25 lib/assoc_array.c:1721: WARNING: 'assoc_array_gc' function definition is 266 lines, perhaps refactor
26 lib/decompress_bunzip2.c:513: WARNING: 'get_next_block' function definition is 357 lines, perhaps refactor
27 lib/lz4/lz4_compress.c:456: WARNING: 'LZ4_compress_generic' function definition is 280 lines, perhaps refactor
28 lib/lz4/lz4_decompress.c:335: WARNING: 'LZ4_decompress_generic' function definition is 283 lines, perhaps refactor
29 lib/lz4/lz4hc_compress.c:579: WARNING: 'LZ4HC_compress_generic' function definition is 241 lines, perhaps refactor
30 lib/lzo/lzo1x_decompress_safe.c:261: WARNING: 'lzo1x_decompress_safe' function definition is 223 lines, perhaps refactor
31 lib/mpi/mpi-pow.c:329: WARNING: 'mpi_powm' function definition is 291 lines, perhaps refactor
32 lib/raid6/recov_avx512.c:230: WARNING: 'raid6_2data_recov_avx512' function definition is 201 lines, perhaps refactor
33 lib/vsprintf.c:3109: WARNING: 'vsscanf' function definition is 275 lines, perhaps refactor
34 lib/zlib_inflate/inffast.c:347: WARNING: 'inflate_fast' function definition is 258 lines, perhaps refactor
35 lib/zlib_inflate/inflate.c:740: WARNING: 'zlib_inflate' function definition is 422 lines, perhaps refactor
36 lib/zlib_inflate/inftrees.c:315: WARNING: 'zlib_inflate_table' function definition is 292 lines, perhaps refactor
37 lib/zstd/compress.c:830: WARNING: 'ZSTD_compressSequences_internal' function definition is 243 lines, perhaps refactor
38 lib/zstd/zstd_opt.h:697: WARNING: 'ZSTD_compressBlock_opt_generic' function definition is 289 lines, perhaps refactor
39 lib/zstd/zstd_opt.h:1012: WARNING: 'ZSTD_compressBlock_opt_extDict_generic' function definition is 311 lines, perhaps refactor
40 mm/compaction.c:958: WARNING: 'isolate_migratepages_block' function definition is 268 lines, perhaps refactor
41 mm/filemap.c:2303: WARNING: 'generic_file_buffered_read' function definition is 251 lines, perhaps refactor
42 mm/khugepaged.c:1560: WARNING: 'collapse_shmem' function definition is 262 lines, perhaps refactor
43 mm/ksm.c:1755: WARNING: 'stable_tree_search' function definition is 236 lines, perhaps refactor
44 mm/memory.c:3080: WARNING: 'do_swap_page' function definition is 210 lines, perhaps refactor
45 mm/mmap.c:968: WARNING: '__vma_adjust' function definition is 287 lines, perhaps refactor
46 mm/nommu.c:1424: WARNING: 'do_mmap' function definition is 223 lines, perhaps refactor
47 mm/page-writeback.c:1829: WARNING: 'balance_dirty_pages' function definition is 268 lines, perhaps refactor
48 mm/rmap.c:1609: WARNING: 'try_to_unmap_one' function definition is 275 lines, perhaps refactor
49 mm/shmem.c:1911: WARNING: 'shmem_getpage_gfp' function definition is 315 lines, perhaps refactor
50 mm/swapfile.c:2253: WARNING: 'try_to_unuse' function definition is 222 lines, perhaps refactor
51 mm/swapfile.c:3325: WARNING: 'SYSCALL_DEFINE2' function definition is 230 lines, perhaps refactor
52 mm/vmscan.c:1297: WARNING: 'shrink_page_list' function definition is 411 lines, perhaps refactor

Anyone think this function line length test is useful?
Should it be longer? shorter? not exist at all?

---
scripts/checkpatch.pl | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 4306b7616cdd..523aead40b87 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -59,6 +59,7 @@ my $conststructsfile = "$D/const_structs.checkpatch";
my $typedefsfile = "";
my $color = "auto";
my $allow_c99_comments = 1;
+my $max_function_length = 200;

sub help {
my ($exitcode) = @_;
@@ -2202,6 +2203,7 @@ sub process {
my $realcnt = 0;
my $here = '';
my $context_function; #undef'd unless there's a known function
+ my $context_function_linenum;
my $in_comment = 0;
my $comment_edge = 0;
my $first_line = 0;
@@ -2341,6 +2343,7 @@ sub process {
} else {
undef $context_function;
}
+ undef $context_function_linenum;
next;

# track the line number as we move through the hunk, note that
@@ -3200,11 +3203,18 @@ sub process {
if ($sline =~ /^\+\{\s*$/ &&
$prevline =~ /^\+(?:(?:(?:$Storage|$Inline)\s*)*\s*$Type\s*)?($Ident)\(/) {
$context_function = $1;
+ $context_function_linenum = $realline;
}

# check if this appears to be the end of function declaration
if ($sline =~ /^\+\}\s*$/) {
+ if (defined($context_function_linenum) &&
+ ($realline - $context_function_linenum) > $max_function_length) {
+ WARN("LONG_FUNCTION",
+ "'$context_function' function definition is " . ($realline - $context_function_linenum) . " lines, perhaps refactor\n" . $herecurr);
+ }
undef $context_function;
+ undef $context_function_linenum;
}

# check indentation of any line with a bare else
@@ -5983,6 +5993,7 @@ sub process {
defined $stat &&
$stat =~ /^.\s*(?:$Storage\s+)?$Type\s*($Ident)\s*$balanced_parens\s*{/s) {
$context_function = $1;
+ $context_function_linenum = $realline;

# check for multiline function definition with misplaced open brace
my $ok = 0;

2017-12-17 21:46:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC patch] checkpatch: Add a test for long function definitions (>200 lines)

On Sat, Dec 16, 2017 at 5:26 PM, Joe Perches <[email protected]> wrote:
>>
>> I'm not expecting you to be able to write a perl script that checks
>> the first line, but we have way too many 200-plus line functions in
>> the kernel. I'd like a warning on anything over 200 lines (a factor
>> of 4 over Linus's stated goal).
>
> In response to Matthew's request:
>
> This is a possible checkpatch warning for long
> function definitions.

So I'm not sure a line count makes sense.

Sometimes long functions can be sensible, if they are basically just
one big case-statement or similar.

Looking at one of your examples: futex_requeue() is indeed a long
function, but that's mainly because it has a lot of comments about
exactly what is going on, and while it only has one (fairly small)
case statement, the rest of it is very similar (ie "in this case, do
XYZ").

Another case I looked at - try_to_unmap_one() - had very similar
behavior. It's long, but it's not long for the wrong reasons.

And yes, "copy_process()" is disgusting, and probably _could_ be split
up a bit, but at the same time the bulk of the lines there really is
just the "initialize all the parts of the "struct task_struct".

And other times, I suspect even a 50-line function is way too dense,
just because it's doing crazy things.

So I have a really hard time with some arbitrary line limit. At eh
very least, I think it should ignore comments and whitespace lines.

And yes, some real "complexity analysis" might give a much more sane
limit, but I don't even know what that would be or how it would work.

LInus

2017-12-17 22:22:29

by Joe Perches

[permalink] [raw]
Subject: Re: [RFC patch] checkpatch: Add a test for long function definitions (>200 lines)

On Sun, 2017-12-17 at 13:46 -0800, Linus Torvalds wrote:
> On Sat, Dec 16, 2017 at 5:26 PM, Joe Perches <[email protected]> wrote:
> > >
> > > I'm not expecting you to be able to write a perl script that checks
> > > the first line, but we have way too many 200-plus line functions in
> > > the kernel. I'd like a warning on anything over 200 lines (a factor
> > > of 4 over Linus's stated goal).
> >
> > In response to Matthew's request:
> >
> > This is a possible checkpatch warning for long
> > function definitions.
>
> So I'm not sure a line count makes sense.
>
> Sometimes long functions can be sensible, if they are basically just
> one big case-statement or similar.
>
> Looking at one of your examples: futex_requeue() is indeed a long
> function, but that's mainly because it has a lot of comments about
> exactly what is going on, and while it only has one (fairly small)
> case statement, the rest of it is very similar (ie "in this case, do
> XYZ").
>
> Another case I looked at - try_to_unmap_one() - had very similar
> behavior. It's long, but it's not long for the wrong reasons.
>
> And yes, "copy_process()" is disgusting, and probably _could_ be split
> up a bit, but at the same time the bulk of the lines there really is
> just the "initialize all the parts of the "struct task_struct".
>
> And other times, I suspect even a 50-line function is way too dense,
> just because it's doing crazy things.
>
> So I have a really hard time with some arbitrary line limit. At eh
> very least, I think it should ignore comments and whitespace lines.

That part is easy enough to do. (below)

> And yes, some real "complexity analysis" might give a much more sane
> limit, but I don't even know what that would be or how it would work.

I suspect there are better tools (like gnu complexity)
for this and I'm not at all tied to this as a checkpatch
feature.

btw: futex_requeue line count is now 140 so it doesn't warn.

---
scripts/checkpatch.pl | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)

diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
index 168687ae24fa..99c065f90360 100755
--- a/scripts/checkpatch.pl
+++ b/scripts/checkpatch.pl
@@ -59,6 +59,7 @@ my $conststructsfile = "$D/const_structs.checkpatch";
my $typedefsfile = "";
my $color = "auto";
my $allow_c99_comments = 1;
+my $max_function_length = 200;

sub help {
my ($exitcode) = @_;
@@ -2202,6 +2203,8 @@ sub process {
my $realcnt = 0;
my $here = '';
my $context_function; #undef'd unless there's a known function
+ my $context_function_start;
+ my $context_function_lines;
my $in_comment = 0;
my $comment_edge = 0;
my $first_line = 0;
@@ -2341,6 +2344,8 @@ sub process {
} else {
undef $context_function;
}
+ undef $context_function_start;
+ $context_function_lines = 0;
next;

# track the line number as we move through the hunk, note that
@@ -3200,11 +3205,25 @@ sub process {
if ($sline =~ /^\+\{\s*$/ &&
$prevline =~ /^\+(?:(?:(?:$Storage|$Inline)\s*)*\s*$Type\s*)?($Ident)\(/) {
$context_function = $1;
+ $context_function_start = $realline;
+ $context_function_lines = 0;
+ }
+
+# if in a function, count the non-blank lines
+ if (defined ($context_function) && $sline !~ /^[ \+]\s*$/) {
+ $context_function_lines++;
}

# check if this appears to be the end of function declaration
if ($sline =~ /^\+\}\s*$/) {
+ if (defined($context_function_start) &&
+ $context_function_lines > $max_function_length) {
+ WARN("LONG_FUNCTION",
+ "'$context_function' function definition is " . $context_function_lines . " statement lines, perhaps refactor\n" . $herecurr);
+ }
undef $context_function;
+ undef $context_function_start;
+ $context_function_lines = 0;
}

# check indentation of any line with a bare else
@@ -5983,6 +6002,8 @@ sub process {
defined $stat &&
$stat =~ /^.\s*(?:$Storage\s+)?$Type\s*($Ident)\s*$balanced_parens\s*{/s) {
$context_function = $1;
+ $context_function_start = $realline;
+ $context_function_lines = 0;

# check for multiline function definition with misplaced open brace
my $ok = 0;

2017-12-17 22:33:07

by Luc Van Oostenryck

[permalink] [raw]
Subject: Re: [RFC patch] checkpatch: Add a test for long function definitions (>200 lines)

On Sun, Dec 17, 2017 at 01:46:45PM -0800, Linus Torvalds wrote:
> On Sat, Dec 16, 2017 at 5:26 PM, Joe Perches <[email protected]> wrote:
> >>
> >> I'm not expecting you to be able to write a perl script that checks
> >> the first line, but we have way too many 200-plus line functions in
> >> the kernel. I'd like a warning on anything over 200 lines (a factor
> >> of 4 over Linus's stated goal).
> >
> > In response to Matthew's request:
> >
> > This is a possible checkpatch warning for long
> > function definitions.
>
> So I'm not sure a line count makes sense.
>
> Sometimes long functions can be sensible, if they are basically just
> one big case-statement or similar.
>
> Looking at one of your examples: futex_requeue() is indeed a long
> function, but that's mainly because it has a lot of comments about
> exactly what is going on, and while it only has one (fairly small)
> case statement, the rest of it is very similar (ie "in this case, do
> XYZ").
>
> Another case I looked at - try_to_unmap_one() - had very similar
> behavior. It's long, but it's not long for the wrong reasons.
>
> And yes, "copy_process()" is disgusting, and probably _could_ be split
> up a bit, but at the same time the bulk of the lines there really is
> just the "initialize all the parts of the "struct task_struct".
>
> And other times, I suspect even a 50-line function is way too dense,
> just because it's doing crazy things.
>
> So I have a really hard time with some arbitrary line limit. At eh
> very least, I think it should ignore comments and whitespace lines.
>
> And yes, some real "complexity analysis" might give a much more sane
> limit, but I don't even know what that would be or how it would work.
>

It would be very easy to let sparse calculate the cyclomatic complexity
of each function (and then either printing it or warn if too high), but:
- warning would also need a hard limit
- cyclomatic complexity of a function with a big (but simple) switch
will also be high.

I far from sure that the cyclomatic complexity is very useful but maybe
some variation of it (like counting a switch as a single edge) could
have some value here.

-- Luc Van Oostenryck

2017-12-19 00:16:19

by Randy Dunlap

[permalink] [raw]
Subject: Re: Naming of tag operations in the XArray

On 12/15/2017 04:34 AM, Matthew Wilcox wrote:
> On Thu, Dec 14, 2017 at 08:22:14PM -0800, Matthew Wilcox wrote:
>> On Mon, Dec 11, 2017 at 03:10:22PM -0800, Randy Dunlap wrote:
>>>> +A freshly-initialised XArray contains a ``NULL`` pointer at every index.
>>>> +Each non-``NULL`` entry in the array has three bits associated with
>>>> +it called tags. Each tag may be flipped on or off independently of
>>>> +the others. You can search for entries with a given tag set.
>>>
>>> Only tags that are set, or search for entries with some tag(s) cleared?
>>> Or is that like a mathematical set?
>>
>> hmm ...
>>
>> "Each tag may be set or cleared independently of the others. You can
>> search for entries which have a particular tag set."
>>
>> Doesn't completely remove the ambiguity, but I can't think of how to phrase
>> that better ...
>
> Thinking about this some more ...
>
> At the moment, the pieces of the API which deal with tags look like this:
>
> bool xa_tagged(const struct xarray *, xa_tag_t)
> bool xa_get_tag(struct xarray *, unsigned long index, xa_tag_t);
> void xa_set_tag(struct xarray *, unsigned long index, xa_tag_t);
> void xa_clear_tag(struct xarray *, unsigned long index, xa_tag_t);
> int xa_get_tagged(struct xarray *, void **dst, unsigned long start,
> unsigned long max, unsigned int n, xa_tag_t);
>
> bool xas_get_tag(const struct xa_state *, xa_tag_t);
> void xas_set_tag(const struct xa_state *, xa_tag_t);
> void xas_clear_tag(const struct xa_state *, xa_tag_t);
> void *xas_find_tag(struct xa_state *, unsigned long max, xa_tag_t);
> xas_for_each_tag(xas, entry, max, tag) { }
>
> (at some point there will be an xa_for_each_tag too, there just hasn't
> been a user yet).
>
> I'm always ambivalent about using the word 'get' in an API because it has
> two common meanings; (increment a refcount) and (return the state). How

Yes, I get that. But you usually wouldn't lock a tag AFAIK.

> would people feel about these names instead:

I think that the original names are mostly better, except I do like
xa_select() instead of xa_get_tagged(). But even that doesn't have
to change.

> bool xa_any_tagged(xa, tag);
> bool xa_is_tagged(xa, index, tag);
> void xa_tag(xa, index, tag);
> void xa_untag(xa, index, tag);
> int xa_select(xa, dst, start, max, n, tag);
>
> bool xas_is_tagged(xas, tag);
> void xas_tag(xas, tag);
> void xas_untag(xas, tag);
> void *xas_find_tag(xas, max, tag);
> xas_for_each_tag(xas, entry, max, tag) { }
>
> (the last two are unchanged)
>


--
~Randy

2017-12-19 00:27:45

by Randy Dunlap

[permalink] [raw]
Subject: Re: Storing errors in the XArray

On 12/15/2017 09:10 AM, Matthew Wilcox wrote:
> On Mon, Dec 11, 2017 at 03:10:22PM -0800, Randy Dunlap wrote:
>>> +The XArray does not support storing :c:func:`IS_ERR` pointers; some
>>> +conflict with data values and others conflict with entries the XArray
>>> +uses for its own purposes. If you need to store special values which
>>> +cannot be confused with real kernel pointers, the values 4, 8, ... 4092
>>> +are available.
>>
>> or if I know that they values are errno-range values, I can just shift them
>> left by 2 to store them and then shift them right by 2 to use them?
>
> On further thought, I like this idea so much, it's worth writing helpers
> for this usage. And test-suite (also doubles as a demonstration of how
> to use it).
>
> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index c616e9319c7c..53aa251df57a 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -232,6 +232,39 @@ static inline bool xa_is_value(const void *entry)
> return (unsigned long)entry & 1;
> }
>
> +/**
> + * xa_mk_errno() - Create an XArray entry from an error number.
> + * @error: Error number to store in XArray.
> + *
> + * Return: An entry suitable for storing in the XArray.
> + */
> +static inline void *xa_mk_errno(long error)
> +{
> + return (void *)(error << 2);
> +}
> +
> +/**
> + * xa_to_errno() - Get error number stored in an XArray entry.
> + * @entry: XArray entry.
> + *
> + * Return: The error number stored in the XArray entry.
> + */
> +static inline unsigned long xa_to_errno(const void *entry)
> +{
> + return (long)entry >> 2;
> +}
> +
> +/**
> + * xa_is_errno() - Determine if an entry is an errno.
> + * @entry: XArray entry.
> + *
> + * Return: True if the entry is an errno, false if it is a pointer.
> + */
> +static inline bool xa_is_errno(const void *entry)
> +{
> + return (((unsigned long)entry & 3) == 0) && (entry > (void *)-4096);

Some named mask bits would be ^^^ preferable there.
#define MAX_ERRNO 4095 // from err.h
&& (entry >= (void *)-MAX_ERRNO);

> +}
> +
> /**
> * xa_is_internal() - Is the entry an internal entry?
> * @entry: Entry retrieved from the XArray
> diff --git a/tools/testing/radix-tree/xarray-test.c b/tools/testing/radix-tree/xarray-test.c
> index 43111786ebdd..b843cedf3988 100644
> --- a/tools/testing/radix-tree/xarray-test.c
> +++ b/tools/testing/radix-tree/xarray-test.c
> @@ -29,7 +29,13 @@ void check_xa_err(struct xarray *xa)
> assert(xa_err(xa_store(xa, 1, xa_mk_value(0), GFP_KERNEL)) == 0);
> assert(xa_err(xa_store(xa, 1, NULL, 0)) == 0);
> // kills the test-suite :-(
> -// assert(xa_err(xa_store(xa, 0, xa_mk_internal(0), 0)) == -EINVAL);
> +// assert(xa_err(xa_store(xa, 0, xa_mk_internal(0), 0)) == -EINVAL);
> +
> + assert(xa_err(xa_store(xa, 0, xa_mk_errno(-ENOMEM), GFP_KERNEL)) == 0);
> + assert(xa_err(xa_load(xa, 0)) == 0);
> + assert(xa_is_errno(xa_load(xa, 0)) == true);
> + assert(xa_to_errno(xa_load(xa, 0)) == -ENOMEM);
> + xa_erase(xa, 0);
> }
>
> void check_xa_tag(struct xarray *xa)
>

Thanks,
--
~Randy

2017-12-21 12:08:12

by Knut Omang

[permalink] [raw]
Subject: Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

Joe Perches <[email protected]> writes:

> On Tue, 2017-12-12 at 08:43 +1100, Dave Chinner wrote:
>> On Sat, Dec 09, 2017 at 09:00:18AM -0800, Joe Perches wrote:
>> > On Sat, 2017-12-09 at 09:36 +1100, Dave Chinner wrote:
>> > > 1. Using lockdep_set_novalidate_class() for anything other
>> > > than device->mutex will throw checkpatch warnings. Nice. (*)
>> > []
>> > > (*) checkpatch.pl is considered mostly harmful round here, too,
>> > > but that's another rant....
>> >
>> > How so?
>>
>> Short story is that it barfs all over the slightly non-standard
>> coding style used in XFS.
> []
>> This sort of stuff is just lowest-common-denominator noise - great
>> for new code and/or inexperienced developers, but not for working
>> with large bodies of existing code with slightly non-standard
>> conventions.
>
> Completely reasonable. Thanks.
>
> Do you get many checkpatch submitters for fs/xfs?
>
> If so, could probably do something about adding
> a checkpatch file flag to the directory or equivalent.
>
> Maybe add something like:
>
> fs/xfs/.checkpatch
>
> where the contents turn off most everything

I propose a more fine grained and configurable form of this in

https://lkml.org/lkml/2017/12/16/343

that also handles sparse and other checkers in a similar way.

Thanks,
Knut

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>