2022-06-14 18:27:54

by Brian Foster

[permalink] [raw]
Subject: [PATCH 0/3] proc: improve root readdir latency with many threads

Hi all,

We have a user who has reported performance problems related to
(presumably) custom task monitoring on Linux systems when running
processes with large numbers of threads. Unfortunately I don't have much
information around the practical workload and observations, but only
that the problem had been narrowed down to excessive readdir latency of
the /proc root dir in the presence of large numbers of threads in the
associated pid namespace.

This latency boils down to the inefficient pid_namespace walk down in
the proc_pid_readdir() path. More specifically, every thread/task
allocates an associated struct pid, and the procfs next_tgid()
implementation walks every pid in the namespace looking for those with
an associated PIDTYPE_TGID task to fill into the directory listing.

Given that ids are part of the idr radix-tree, it seemed fairly logical
that this could be improved using an internal tree tag. I started
playing around with an approach that tagged and untagged ids based on
actual task association (i.e., attach_pid() and friends), but after some
thought and feedback came to the realization that this could probably be
simplified to just tag the pid once on allocation and allow procfs to
use it as a hint for root dir population. This works because post-fork
tgid task disassociation (without an exit() and freeing the pid) seems
to be uncommon. The only tool I've seen in my testing so far that leaves
around a tagged, non-TGID pid is chronyd, which appears to do a fork()
-> setsid() -> fork() pattern where the intermediate task exits but the
associated pid hangs around for the lifetime of the process due to the
PIDTYPE_SID association.

Therefore, this series implements this tgid tag hinting approach. Patch
1 includes a couple tweaks to the idr tree to support traditional
radix-tree tag propagation. Patch 2 defines the new tag and sets it on
pid allocation. Patch 3 updates procfs to use the tag for the readdir
pid_namespace traversal.

As far as testing goes, I've thrown this at fstests (not for filesystem
testing purposes, but moreso just because I had the test env handy and
it's a longish running task creation workload), LTP and some of the
kernel internal tests in tools/testing/selftests (clone, proc,
pid_namespace) without any obvious regressions. From the performance
angle, the user who reported this problem has provided some synthetic
tools to create dummy tasks/threads and run repeated readdir iterations
of /proc, which is what they've been using to compare results on Linux
kernels with some $other OS. These tools show a notable improvement in
terms of the number of /proc readdir iterations possible per-second. For
example, on 5.19.0-rc2 running on a mostly idle system with an active
100k thread process, readdirs-per-second improves from a baseline of
~285 to ~7.3k with the series applied. More detailed getdents() latency
numbers are included in the commit log of patch 3.

Thoughts, reviews, flames appreciated.

Brian

Brian Foster (3):
radix-tree: propagate all tags in idr tree
pid: use idr tag to hint pids associated with group leader tasks
proc: use idr tgid tag hint to iterate pids in readdir

fs/proc/base.c | 2 +-
include/linux/idr.h | 25 +++++++++++++++++++++++++
include/linux/pid.h | 2 +-
kernel/fork.c | 2 +-
kernel/pid.c | 9 ++++++++-
lib/radix-tree.c | 26 +++++++++++++++-----------
6 files changed, 51 insertions(+), 15 deletions(-)

--
2.34.1


2022-06-14 18:28:16

by Brian Foster

[permalink] [raw]
Subject: [PATCH 2/3] pid: use idr tag to hint pids associated with group leader tasks

Searching the pid_namespace for group leader tasks is a fairly
inefficient operation. Listing the root directory of a procfs mount
performs a linear walk of allocated pids, checking each for an
associated PIDTYPE_TGID task to determine whether to populate a
directory entry. This can cause a significant increase in readdir()
syscall latency when run in runtime environments that might have one
or more processes with significant thread counts.

To facilitate improved TGID pid searches, define a new IDR
radix-tree tag for struct pid entries that are likely to have an
associated PIDTYPE_TGID task. To keep the code simple and avoid
having to maintain synchronization between tag state and post-fork
pid-task association changes, the tag is applied to all pids
initially allocated for tasks that are cloned without CLONE_THREAD.
The semantics of the tag are thus that false positives are possible
(i.e. tagged pids without PIDTYPE_TGID tasks), but false negatives
(i.e. untagged pids without PIDTYPE_TGID tasks) are not allowed.

For example, a userspace task that does a setsid() followed by a
fork() and exit() in the initial task will leave the initial pid
tagged (as it remains allocated for the PIDTYPE_SID association)
while the group leader task associates with the pid allocated for
the child fork. Once set, the tag persists for the lifetime of the
pid and is cleared when the pid is freed and associated entry
removed from the idr tree.

This is an effective optimization because false negatives are
relatively uncommon, essentially don't add any overhead that doesn't
already exist (i.e. having to check pid_task(..., PIDTYPE_TGID), but
still allows filtering out large numbers of thread pids that are
guaranteed to not have TGID task association.

Define the new IDR_TGID radix tree tag and provide a couple helpers
to set and check tag state. Set the tag in the allocation path when
the caller specifies that the pid is expected to track a group
leader. Since false negatives are not allowed, warn in the event
that a PIDTYPE_TGID task is ever attached to an untagged pid.

Signed-off-by: Brian Foster <[email protected]>
---
include/linux/idr.h | 11 +++++++++++
include/linux/pid.h | 2 +-
kernel/fork.c | 2 +-
kernel/pid.c | 9 ++++++++-
4 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/linux/idr.h b/include/linux/idr.h
index a0dce14090a9..11e0ccedfc92 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -27,6 +27,7 @@ struct idr {
* to users. Use tag 0 to track whether a node has free space below it.
*/
#define IDR_FREE 0
+#define IDR_TGID 1

/* Set the IDR flag and the IDR_FREE tag */
#define IDR_RT_MARKER (ROOT_IS_IDR | (__force gfp_t) \
@@ -174,6 +175,16 @@ static inline void idr_preload_end(void)
local_unlock(&radix_tree_preloads.lock);
}

+static inline void idr_set_group_lead(struct idr *idr, unsigned long id)
+{
+ radix_tree_tag_set(&idr->idr_rt, id, IDR_TGID);
+}
+
+static inline bool idr_is_group_lead(struct idr *idr, unsigned long id)
+{
+ return radix_tree_tag_get(&idr->idr_rt, id, IDR_TGID);
+}
+
/**
* idr_for_each_entry() - Iterate over an IDR's elements of a given type.
* @idr: IDR handle.
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 343abf22092e..31f3cf765cee 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -134,7 +134,7 @@ extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr, struct pid_namespace *);

extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
- size_t set_tid_size);
+ size_t set_tid_size, bool group_leader);
extern void free_pid(struct pid *pid);
extern void disable_pid_allocation(struct pid_namespace *ns);

diff --git a/kernel/fork.c b/kernel/fork.c
index 9d44f2d46c69..3c52f45ec93e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2254,7 +2254,7 @@ static __latent_entropy struct task_struct *copy_process(

if (pid != &init_struct_pid) {
pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
- args->set_tid_size);
+ args->set_tid_size, !(clone_flags & CLONE_THREAD));
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
diff --git a/kernel/pid.c b/kernel/pid.c
index 2fc0a16ec77b..5a745c35475e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -157,7 +157,7 @@ void free_pid(struct pid *pid)
}

struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
- size_t set_tid_size)
+ size_t set_tid_size, bool group_leader)
{
struct pid *pid;
enum pid_type type;
@@ -272,6 +272,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
for ( ; upid >= pid->numbers; --upid) {
/* Make the PID visible to find_pid_ns. */
idr_replace(&upid->ns->idr, pid, upid->nr);
+ if (group_leader)
+ idr_set_group_lead(&upid->ns->idr, upid->nr);
upid->ns->pid_allocated++;
}
spin_unlock_irq(&pidmap_lock);
@@ -331,6 +333,11 @@ static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type)
void attach_pid(struct task_struct *task, enum pid_type type)
{
struct pid *pid = *task_pid_ptr(task, type);
+ struct pid_namespace *pid_ns = ns_of_pid(pid);
+ pid_t pid_nr = pid_nr_ns(pid, pid_ns);
+
+ WARN_ON(type == PIDTYPE_TGID &&
+ !idr_is_group_lead(&pid_ns->idr, pid_nr));
hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
}

--
2.34.1

2022-06-14 19:14:19

by Brian Foster

[permalink] [raw]
Subject: [PATCH 1/3] radix-tree: propagate all tags in idr tree

The IDR tree has hardcoded tag propagation logic to handle the
internal IDR_FREE tag and ignore all others. Fix up the hardcoded
logic to support additional tags.

This is specifically to support a new internal IDR_TGID radix tree
tag used to improve search efficiency of pids with associated
PIDTYPE_TGID tasks within a pid namespace.

Signed-off-by: Brian Foster <[email protected]>
---
lib/radix-tree.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index b3afafe46fff..08eef33e7820 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -431,12 +431,14 @@ static int radix_tree_extend(struct radix_tree_root *root, gfp_t gfp,
tag_clear(node, IDR_FREE, 0);
root_tag_set(root, IDR_FREE);
}
- } else {
- /* Propagate the aggregated tag info to the new child */
- for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
- if (root_tag_get(root, tag))
- tag_set(node, tag, 0);
- }
+ }
+
+ /* Propagate the aggregated tag info to the new child */
+ for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
+ if (is_idr(root) && tag == IDR_FREE)
+ continue;
+ if (root_tag_get(root, tag))
+ tag_set(node, tag, 0);
}

BUG_ON(shift > BITS_PER_LONG);
@@ -1368,11 +1370,13 @@ static bool __radix_tree_delete(struct radix_tree_root *root,
unsigned offset = get_slot_offset(node, slot);
int tag;

- if (is_idr(root))
- node_tag_set(root, node, IDR_FREE, offset);
- else
- for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
- node_tag_clear(root, node, tag, offset);
+ for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
+ if (is_idr(root) && tag == IDR_FREE) {
+ node_tag_set(root, node, tag, offset);
+ continue;
+ }
+ node_tag_clear(root, node, tag, offset);
+ }

replace_slot(slot, NULL, node, -1, values);
return node && delete_node(root, node);
--
2.34.1

2022-06-15 11:26:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/3] radix-tree: propagate all tags in idr tree

On Tue, Jun 14, 2022 at 02:09:47PM -0400, Brian Foster wrote:
> The IDR tree has hardcoded tag propagation logic to handle the
> internal IDR_FREE tag and ignore all others. Fix up the hardcoded
> logic to support additional tags.
>
> This is specifically to support a new internal IDR_TGID radix tree
> tag used to improve search efficiency of pids with associated
> PIDTYPE_TGID tasks within a pid namespace.

Wouldn't it make sense to switch over to an xarray here rather
then adding new features to the radix tree?

2022-06-15 13:00:10

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 1/3] radix-tree: propagate all tags in idr tree

On Wed, Jun 15, 2022 at 04:12:14AM -0700, Christoph Hellwig wrote:
> On Tue, Jun 14, 2022 at 02:09:47PM -0400, Brian Foster wrote:
> > The IDR tree has hardcoded tag propagation logic to handle the
> > internal IDR_FREE tag and ignore all others. Fix up the hardcoded
> > logic to support additional tags.
> >
> > This is specifically to support a new internal IDR_TGID radix tree
> > tag used to improve search efficiency of pids with associated
> > PIDTYPE_TGID tasks within a pid namespace.
>
> Wouldn't it make sense to switch over to an xarray here rather
> then adding new features to the radix tree?
>

The xarray question crossed my mind when I first waded into this code
and realized the idr tree seems to be some sort of offshoot or custom
mode of the core radix tree. I eventually realized that the problem wrt
to normal radix tree tags in the idr variant was that the tag
propagation logic in the idr variant simply didn't care to handle
traditional tags, presumably because they were unused in that mode. So
this patch doesn't really add a feature to the radix-tree, it just fixes
up some of the grotty idr tree logic to handle both forms of tags.

I assume it makes sense for this to move towards xarray in general, but
I don't have enough context on either side to know what the sticking
points are. In particular, does xarray support something analogous to
IDR_FREE or otherwise solve whatever problem idr currently depends on it
for (i.e. efficient id allocation)? I think Willy has done work in this
area so I'm hoping he can chime in on some of that if he's put any
thought into the idr thing specifically..

WRT to this series, I really don't think it makes much sense to put a
broad rework of the idr code in front of what is otherwise an
incremental performance improvement based on using a mechanism that
radix-tree pretty much already supports (i.e. tags). Is there some
fundamental problem you see with this patch that apparently depends on
xarray for some reason, or are you just calling it out as apparent
technical debt in this area of code? If the latter, then I think it
makes more sense to consider that as a followup effort.

Brian

2022-06-15 14:18:50

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH 1/3] radix-tree: propagate all tags in idr tree


On 15/6/22 19:12, Christoph Hellwig wrote:
> On Tue, Jun 14, 2022 at 02:09:47PM -0400, Brian Foster wrote:
>> The IDR tree has hardcoded tag propagation logic to handle the
>> internal IDR_FREE tag and ignore all others. Fix up the hardcoded
>> logic to support additional tags.
>>
>> This is specifically to support a new internal IDR_TGID radix tree
>> tag used to improve search efficiency of pids with associated
>> PIDTYPE_TGID tasks within a pid namespace.
> Wouldn't it make sense to switch over to an xarray here rather
> then adding new features to the radix tree?
>
Might be a dumb question but ...


How would the, essentially sparse, pid type PIDTYPE_TGID pids

traversal get benefits from an xarray?


From what I have seen the searching isn't the real problem, it's the

traversal (that, at the moment, does a search 'and' a traversal over

a lot of pids to get a relatively small number of them).

Ian

2022-06-15 14:21:16

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/3] radix-tree: propagate all tags in idr tree

On Wed, Jun 15, 2022 at 08:57:56AM -0400, Brian Foster wrote:
> On Wed, Jun 15, 2022 at 04:12:14AM -0700, Christoph Hellwig wrote:
> > On Tue, Jun 14, 2022 at 02:09:47PM -0400, Brian Foster wrote:
> > > The IDR tree has hardcoded tag propagation logic to handle the
> > > internal IDR_FREE tag and ignore all others. Fix up the hardcoded
> > > logic to support additional tags.
> > >
> > > This is specifically to support a new internal IDR_TGID radix tree
> > > tag used to improve search efficiency of pids with associated
> > > PIDTYPE_TGID tasks within a pid namespace.
> >
> > Wouldn't it make sense to switch over to an xarray here rather
> > then adding new features to the radix tree?
> >
>
> The xarray question crossed my mind when I first waded into this code
> and realized the idr tree seems to be some sort of offshoot or custom
> mode of the core radix tree. I eventually realized that the problem wrt
> to normal radix tree tags in the idr variant was that the tag
> propagation logic in the idr variant simply didn't care to handle
> traditional tags, presumably because they were unused in that mode. So
> this patch doesn't really add a feature to the radix-tree, it just fixes
> up some of the grotty idr tree logic to handle both forms of tags.
>
> I assume it makes sense for this to move towards xarray in general, but
> I don't have enough context on either side to know what the sticking
> points are. In particular, does xarray support something analogous to
> IDR_FREE or otherwise solve whatever problem idr currently depends on it
> for (i.e. efficient id allocation)? I think Willy has done work in this
> area so I'm hoping he can chime in on some of that if he's put any
> thought into the idr thing specifically..

Without going into the history of the idr/radix-tree/xarray, the
current hope is that we'll move all users of the idr & radix tree
over to the xarray API. It's fundamentally the same data structure
for all three now, just a question of the API change.

The XArray does indeed have a way to solve the IDR_FREE problem;
you need to declare an allocating XArray:
https://www.kernel.org/doc/html/latest/core-api/xarray.html#allocating-xarrays

and using XA_MARK_1 and XA_MARK_2 should work the way you want them to.

2022-06-15 15:13:28

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 1/3] radix-tree: propagate all tags in idr tree

On Wed, Jun 15, 2022 at 02:40:43PM +0100, Matthew Wilcox wrote:
> On Wed, Jun 15, 2022 at 08:57:56AM -0400, Brian Foster wrote:
> > On Wed, Jun 15, 2022 at 04:12:14AM -0700, Christoph Hellwig wrote:
> > > On Tue, Jun 14, 2022 at 02:09:47PM -0400, Brian Foster wrote:
> > > > The IDR tree has hardcoded tag propagation logic to handle the
> > > > internal IDR_FREE tag and ignore all others. Fix up the hardcoded
> > > > logic to support additional tags.
> > > >
> > > > This is specifically to support a new internal IDR_TGID radix tree
> > > > tag used to improve search efficiency of pids with associated
> > > > PIDTYPE_TGID tasks within a pid namespace.
> > >
> > > Wouldn't it make sense to switch over to an xarray here rather
> > > then adding new features to the radix tree?
> > >
> >
> > The xarray question crossed my mind when I first waded into this code
> > and realized the idr tree seems to be some sort of offshoot or custom
> > mode of the core radix tree. I eventually realized that the problem wrt
> > to normal radix tree tags in the idr variant was that the tag
> > propagation logic in the idr variant simply didn't care to handle
> > traditional tags, presumably because they were unused in that mode. So
> > this patch doesn't really add a feature to the radix-tree, it just fixes
> > up some of the grotty idr tree logic to handle both forms of tags.
> >
> > I assume it makes sense for this to move towards xarray in general, but
> > I don't have enough context on either side to know what the sticking
> > points are. In particular, does xarray support something analogous to
> > IDR_FREE or otherwise solve whatever problem idr currently depends on it
> > for (i.e. efficient id allocation)? I think Willy has done work in this
> > area so I'm hoping he can chime in on some of that if he's put any
> > thought into the idr thing specifically..
>
> Without going into the history of the idr/radix-tree/xarray, the
> current hope is that we'll move all users of the idr & radix tree
> over to the xarray API. It's fundamentally the same data structure
> for all three now, just a question of the API change.
>
> The XArray does indeed have a way to solve the IDR_FREE problem;
> you need to declare an allocating XArray:
> https://www.kernel.org/doc/html/latest/core-api/xarray.html#allocating-xarrays
>
> and using XA_MARK_1 and XA_MARK_2 should work the way you want them to.
>

Interesting, thanks. I'll have to dig more into this to grok the current
state of the radix-tree interface vs. the underlying data structure. If
I follow correctly, you're saying the radix-tree api is essentially
already a translation layer to the xarray these days, and we just need
to move legacy users off the radix-tree api so we can eventually kill it
off...

Brian

2022-06-15 16:37:09

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/3] radix-tree: propagate all tags in idr tree

On Wed, Jun 15, 2022 at 10:43:33AM -0400, Brian Foster wrote:
> Interesting, thanks. I'll have to dig more into this to grok the current
> state of the radix-tree interface vs. the underlying data structure. If
> I follow correctly, you're saying the radix-tree api is essentially
> already a translation layer to the xarray these days, and we just need
> to move legacy users off the radix-tree api so we can eventually kill it
> off...

If only it were that easy ... the XArray has a whole bunch of debugging
asserts to make sure the users are actually using it correctly, and a
lot of radix tree users don't (they're probably not buggy, but they
don't use the XArray's embedded lock).

Anyway, here's a first cut at converting the PID allocator from the IDR
to the XArray API. It boots, but I haven't tried to do anything tricky
with PID namespaces or CRIU.


diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index f32878d9a39f..cec85a07184a 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -21,7 +21,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
nr_running(), nr_threads,
- idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
+ task_active_pid_ns(current)->cursor - 1);
return 0;
}

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 07481bb87d4e..68aaaf78491b 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -9,7 +9,7 @@
#include <linux/threads.h>
#include <linux/nsproxy.h>
#include <linux/ns_common.h>
-#include <linux/idr.h>
+#include <linux/xarray.h>

/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
#define MAX_PID_NS_LEVEL 32
@@ -17,8 +17,9 @@
struct fs_pin;

struct pid_namespace {
- struct idr idr;
+ struct xarray xa;
struct rcu_head rcu;
+ unsigned int cursor;
unsigned int pid_allocated;
struct task_struct *child_reaper;
struct kmem_cache *pid_cachep;
@@ -37,6 +38,20 @@ extern struct pid_namespace init_pid_ns;

#define PIDNS_ADDING (1U << 31)

+/*
+ * Note: disable interrupts while the xarray lock is held as an
+ * interrupt might come in and do read_lock(&tasklist_lock).
+ *
+ * If we don't disable interrupts there is a nasty deadlock between
+ * detach_pid()->free_pid() and another cpu that locks the PIDs
+ * followed by an interrupt routine that does read_lock(&tasklist_lock);
+ *
+ * After we clean up the tasklist_lock and know there are no
+ * irq handlers that take it we can leave the interrupts enabled.
+ * For now it is easier to be safe than to prove it can't happen.
+ */
+#define PID_XA_FLAGS (XA_FLAGS_TRACK_FREE | XA_FLAGS_LOCK_IRQ)
+
#ifdef CONFIG_PID_NS
static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
{
@@ -84,7 +99,7 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)

extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
void pidhash_init(void);
-void pid_idr_init(void);
+void pid_init(void);

static inline bool task_is_in_init_pid_ns(struct task_struct *tsk)
{
diff --git a/include/linux/threads.h b/include/linux/threads.h
index c34173e6c5f1..37e4391ee89f 100644
--- a/include/linux/threads.h
+++ b/include/linux/threads.h
@@ -38,7 +38,7 @@
* Define a minimum number of pids per cpu. Heuristically based
* on original pid max of 32k for 32 cpus. Also, increase the
* minimum settable value for pid_max on the running system based
- * on similar defaults. See kernel/pid.c:pid_idr_init() for details.
+ * on similar defaults. See kernel/pid.c:pid_init() for details.
*/
#define PIDS_PER_CPU_DEFAULT 1024
#define PIDS_PER_CPU_MIN 8
diff --git a/init/main.c b/init/main.c
index 0ee39cdcfcac..3944dcd10c09 100644
--- a/init/main.c
+++ b/init/main.c
@@ -73,7 +73,6 @@
#include <linux/sched.h>
#include <linux/sched/init.h>
#include <linux/signal.h>
-#include <linux/idr.h>
#include <linux/kgdb.h>
#include <linux/ftrace.h>
#include <linux/async.h>
@@ -1100,7 +1099,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
late_time_init();
sched_clock_init();
calibrate_delay();
- pid_idr_init();
+ pid_init();
anon_vma_init();
#ifdef CONFIG_X86
if (efi_enabled(EFI_RUNTIME_SERVICES))
diff --git a/kernel/pid.c b/kernel/pid.c
index 2fc0a16ec77b..de0b4614bdb8 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -41,7 +41,7 @@
#include <linux/anon_inodes.h>
#include <linux/sched/signal.h>
#include <linux/sched/task.h>
-#include <linux/idr.h>
+#include <linux/xarray.h>
#include <net/sock.h>
#include <uapi/linux/pidfd.h>

@@ -66,15 +66,10 @@ int pid_max = PID_MAX_DEFAULT;
int pid_max_min = RESERVED_PIDS + 1;
int pid_max_max = PID_MAX_LIMIT;

-/*
- * PID-map pages start out as NULL, they get allocated upon
- * first use and are never deallocated. This way a low pid_max
- * value does not cause lots of bitmaps to be allocated, but
- * the scheme scales to up to 4 million PIDs, runtime.
- */
struct pid_namespace init_pid_ns = {
.ns.count = REFCOUNT_INIT(2),
- .idr = IDR_INIT(init_pid_ns.idr),
+ .xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS),
+ .cursor = 1,
.pid_allocated = PIDNS_ADDING,
.level = 0,
.child_reaper = &init_task,
@@ -86,22 +81,6 @@ struct pid_namespace init_pid_ns = {
};
EXPORT_SYMBOL_GPL(init_pid_ns);

-/*
- * Note: disable interrupts while the pidmap_lock is held as an
- * interrupt might come in and do read_lock(&tasklist_lock).
- *
- * If we don't disable interrupts there is a nasty deadlock between
- * detach_pid()->free_pid() and another cpu that does
- * spin_lock(&pidmap_lock) followed by an interrupt routine that does
- * read_lock(&tasklist_lock);
- *
- * After we clean up the tasklist_lock and know there are no
- * irq handlers that take it we can leave the interrupts enabled.
- * For now it is easier to be safe than to prove it can't happen.
- */
-
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
-
void put_pid(struct pid *pid)
{
struct pid_namespace *ns;
@@ -129,10 +108,11 @@ void free_pid(struct pid *pid)
int i;
unsigned long flags;

- spin_lock_irqsave(&pidmap_lock, flags);
for (i = 0; i <= pid->level; i++) {
struct upid *upid = pid->numbers + i;
struct pid_namespace *ns = upid->ns;
+
+ xa_lock_irqsave(&ns->xa, flags);
switch (--ns->pid_allocated) {
case 2:
case 1:
@@ -149,9 +129,9 @@ void free_pid(struct pid *pid)
break;
}

- idr_remove(&ns->idr, upid->nr);
+ __xa_erase(&ns->xa, upid->nr);
+ xa_unlock_irqrestore(&ns->xa, flags);
}
- spin_unlock_irqrestore(&pidmap_lock, flags);

call_rcu(&pid->rcu, delayed_put_pid);
}
@@ -161,7 +141,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
{
struct pid *pid;
enum pid_type type;
- int i, nr;
+ int i;
struct pid_namespace *tmp;
struct upid *upid;
int retval = -ENOMEM;
@@ -205,43 +185,42 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
set_tid_size--;
}

- idr_preload(GFP_KERNEL);
- spin_lock_irq(&pidmap_lock);
-
if (tid) {
- nr = idr_alloc(&tmp->idr, NULL, tid,
- tid + 1, GFP_ATOMIC);
+ retval = xa_insert_irq(&tmp->xa, tid, NULL, GFP_KERNEL);
+
/*
- * If ENOSPC is returned it means that the PID is
+ * If EBUSY is returned it means that the PID is
* alreay in use. Return EEXIST in that case.
*/
- if (nr == -ENOSPC)
- nr = -EEXIST;
+ if (retval == -EBUSY)
+ retval = -EEXIST;
} else {
int pid_min = 1;
+
+ xa_lock_irq(&tmp->xa);
/*
* init really needs pid 1, but after reaching the
* maximum wrap back to RESERVED_PIDS
*/
- if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
+ if (tmp->cursor > RESERVED_PIDS)
pid_min = RESERVED_PIDS;

/*
* Store a null pointer so find_pid_ns does not find
* a partially initialized PID (see below).
*/
- nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
- pid_max, GFP_ATOMIC);
+ retval = __xa_alloc_cyclic(&tmp->xa, &tid, NULL,
+ XA_LIMIT(pid_min, pid_max),
+ &tmp->cursor, GFP_KERNEL);
+ xa_unlock_irq(&tmp->xa);
+ if (retval == -EBUSY)
+ retval = -EAGAIN;
}
- spin_unlock_irq(&pidmap_lock);
- idr_preload_end();

- if (nr < 0) {
- retval = (nr == -ENOSPC) ? -EAGAIN : nr;
+ if (retval < 0)
goto out_free;
- }

- pid->numbers[i].nr = nr;
+ pid->numbers[i].nr = tid;
pid->numbers[i].ns = tmp;
tmp = tmp->parent;
}
@@ -266,34 +245,35 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
INIT_HLIST_HEAD(&pid->inodes);

upid = pid->numbers + ns->level;
- spin_lock_irq(&pidmap_lock);
+ xa_lock_irq(&ns->xa);
if (!(ns->pid_allocated & PIDNS_ADDING))
goto out_unlock;
for ( ; upid >= pid->numbers; --upid) {
/* Make the PID visible to find_pid_ns. */
- idr_replace(&upid->ns->idr, pid, upid->nr);
+ if (upid->ns != ns)
+ xa_lock_irq(&ns->xa);
+ __xa_store(&upid->ns->xa, upid->nr, pid, 0);
upid->ns->pid_allocated++;
+ xa_unlock_irq(&ns->xa);
}
- spin_unlock_irq(&pidmap_lock);

return pid;

out_unlock:
- spin_unlock_irq(&pidmap_lock);
+ xa_unlock_irq(&ns->xa);
put_pid_ns(ns);

out_free:
- spin_lock_irq(&pidmap_lock);
while (++i <= ns->level) {
upid = pid->numbers + i;
- idr_remove(&upid->ns->idr, upid->nr);
+ xa_erase_irq(&upid->ns->xa, upid->nr);
}

+ xa_lock_irq(&ns->xa);
/* On failure to allocate the first pid, reset the state */
if (ns->pid_allocated == PIDNS_ADDING)
- idr_set_cursor(&ns->idr, 0);
-
- spin_unlock_irq(&pidmap_lock);
+ ns->cursor = 0;
+ xa_unlock_irq(&ns->xa);

kmem_cache_free(ns->pid_cachep, pid);
return ERR_PTR(retval);
@@ -301,14 +281,14 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,

void disable_pid_allocation(struct pid_namespace *ns)
{
- spin_lock_irq(&pidmap_lock);
+ xa_lock_irq(&ns->xa);
ns->pid_allocated &= ~PIDNS_ADDING;
- spin_unlock_irq(&pidmap_lock);
+ xa_unlock_irq(&ns->xa);
}

struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
{
- return idr_find(&ns->idr, nr);
+ return xa_load(&ns->xa, nr);
}
EXPORT_SYMBOL_GPL(find_pid_ns);

@@ -517,7 +497,9 @@ EXPORT_SYMBOL_GPL(task_active_pid_ns);
*/
struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
{
- return idr_get_next(&ns->idr, &nr);
+ unsigned long index = nr;
+
+ return xa_find(&ns->xa, &index, ULONG_MAX, XA_PRESENT);
}

struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
@@ -646,7 +628,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
return fd;
}

-void __init pid_idr_init(void)
+void __init pid_init(void)
{
/* Verify no one has done anything silly: */
BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
@@ -658,8 +640,6 @@ void __init pid_idr_init(void)
PIDS_PER_CPU_MIN * num_possible_cpus());
pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);

- idr_init(&init_pid_ns.idr);
-
init_pid_ns.pid_cachep = KMEM_CACHE(pid,
SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
}
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index f4f8cb0435b4..947e25fb8546 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -22,7 +22,7 @@
#include <linux/export.h>
#include <linux/sched/task.h>
#include <linux/sched/signal.h>
-#include <linux/idr.h>
+#include <linux/xarray.h>

static DEFINE_MUTEX(pid_caches_mutex);
static struct kmem_cache *pid_ns_cachep;
@@ -92,15 +92,15 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
if (ns == NULL)
goto out_dec;

- idr_init(&ns->idr);
+ xa_init_flags(&ns->xa, PID_XA_FLAGS);

ns->pid_cachep = create_pid_cachep(level);
if (ns->pid_cachep == NULL)
- goto out_free_idr;
+ goto out_free_xa;

err = ns_alloc_inum(&ns->ns);
if (err)
- goto out_free_idr;
+ goto out_free_xa;
ns->ns.ops = &pidns_operations;

refcount_set(&ns->ns.count, 1);
@@ -112,8 +112,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns

return ns;

-out_free_idr:
- idr_destroy(&ns->idr);
+out_free_xa:
+ xa_destroy(&ns->xa);
kmem_cache_free(pid_ns_cachep, ns);
out_dec:
dec_pid_namespaces(ucounts);
@@ -135,7 +135,7 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
{
ns_free_inum(&ns->ns);

- idr_destroy(&ns->idr);
+ xa_destroy(&ns->xa);
call_rcu(&ns->rcu, delayed_free_pidns);
}

@@ -165,7 +165,7 @@ EXPORT_SYMBOL_GPL(put_pid_ns);

void zap_pid_ns_processes(struct pid_namespace *pid_ns)
{
- int nr;
+ long nr;
int rc;
struct task_struct *task, *me = current;
int init_pids = thread_group_leader(me) ? 1 : 2;
@@ -198,8 +198,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
*/
rcu_read_lock();
read_lock(&tasklist_lock);
- nr = 2;
- idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
+ xa_for_each_range(&pid_ns->xa, nr, pid, 2, ULONG_MAX) {
task = pid_task(pid, PIDTYPE_PID);
if (task && !__fatal_signal_pending(task))
group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
@@ -272,12 +271,12 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
* it should synchronize its usage with external means.
*/

- next = idr_get_cursor(&pid_ns->idr) - 1;
+ next = pid_ns->cursor - 1;

tmp.data = &next;
ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
if (!ret && write)
- idr_set_cursor(&pid_ns->idr, next + 1);
+ pid_ns->cursor = next + 1;

return ret;
}

2022-06-28 13:12:24

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH 1/3] radix-tree: propagate all tags in idr tree

On Wed, Jun 15, 2022 at 05:34:02PM +0100, Matthew Wilcox wrote:
> On Wed, Jun 15, 2022 at 10:43:33AM -0400, Brian Foster wrote:
> > Interesting, thanks. I'll have to dig more into this to grok the current
> > state of the radix-tree interface vs. the underlying data structure. If
> > I follow correctly, you're saying the radix-tree api is essentially
> > already a translation layer to the xarray these days, and we just need
> > to move legacy users off the radix-tree api so we can eventually kill it
> > off...
>
> If only it were that easy ... the XArray has a whole bunch of debugging
> asserts to make sure the users are actually using it correctly, and a
> lot of radix tree users don't (they're probably not buggy, but they
> don't use the XArray's embedded lock).
>
> Anyway, here's a first cut at converting the PID allocator from the IDR
> to the XArray API. It boots, but I haven't tried to do anything tricky
> with PID namespaces or CRIU.

It'd be great to see that conversion done.
Fwiw, there's test cases for e.g. nested pid namespace creation with
specifically requested PIDs in

tools/selftests/clone3

and there's additional tests in

tools/selftests/pidfd

>
>
> diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
> index f32878d9a39f..cec85a07184a 100644
> --- a/fs/proc/loadavg.c
> +++ b/fs/proc/loadavg.c
> @@ -21,7 +21,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
> LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
> LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
> nr_running(), nr_threads,
> - idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> + task_active_pid_ns(current)->cursor - 1);
> return 0;
> }
>
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 07481bb87d4e..68aaaf78491b 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -9,7 +9,7 @@
> #include <linux/threads.h>
> #include <linux/nsproxy.h>
> #include <linux/ns_common.h>
> -#include <linux/idr.h>
> +#include <linux/xarray.h>
>
> /* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
> #define MAX_PID_NS_LEVEL 32
> @@ -17,8 +17,9 @@
> struct fs_pin;
>
> struct pid_namespace {
> - struct idr idr;
> + struct xarray xa;
> struct rcu_head rcu;
> + unsigned int cursor;
> unsigned int pid_allocated;
> struct task_struct *child_reaper;
> struct kmem_cache *pid_cachep;
> @@ -37,6 +38,20 @@ extern struct pid_namespace init_pid_ns;
>
> #define PIDNS_ADDING (1U << 31)
>
> +/*
> + * Note: disable interrupts while the xarray lock is held as an
> + * interrupt might come in and do read_lock(&tasklist_lock).
> + *
> + * If we don't disable interrupts there is a nasty deadlock between
> + * detach_pid()->free_pid() and another cpu that locks the PIDs
> + * followed by an interrupt routine that does read_lock(&tasklist_lock);
> + *
> + * After we clean up the tasklist_lock and know there are no
> + * irq handlers that take it we can leave the interrupts enabled.
> + * For now it is easier to be safe than to prove it can't happen.
> + */
> +#define PID_XA_FLAGS (XA_FLAGS_TRACK_FREE | XA_FLAGS_LOCK_IRQ)
> +
> #ifdef CONFIG_PID_NS
> static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
> {
> @@ -84,7 +99,7 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
>
> extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
> void pidhash_init(void);
> -void pid_idr_init(void);
> +void pid_init(void);
>
> static inline bool task_is_in_init_pid_ns(struct task_struct *tsk)
> {
> diff --git a/include/linux/threads.h b/include/linux/threads.h
> index c34173e6c5f1..37e4391ee89f 100644
> --- a/include/linux/threads.h
> +++ b/include/linux/threads.h
> @@ -38,7 +38,7 @@
> * Define a minimum number of pids per cpu. Heuristically based
> * on original pid max of 32k for 32 cpus. Also, increase the
> * minimum settable value for pid_max on the running system based
> - * on similar defaults. See kernel/pid.c:pid_idr_init() for details.
> + * on similar defaults. See kernel/pid.c:pid_init() for details.
> */
> #define PIDS_PER_CPU_DEFAULT 1024
> #define PIDS_PER_CPU_MIN 8
> diff --git a/init/main.c b/init/main.c
> index 0ee39cdcfcac..3944dcd10c09 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -73,7 +73,6 @@
> #include <linux/sched.h>
> #include <linux/sched/init.h>
> #include <linux/signal.h>
> -#include <linux/idr.h>
> #include <linux/kgdb.h>
> #include <linux/ftrace.h>
> #include <linux/async.h>
> @@ -1100,7 +1099,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
> late_time_init();
> sched_clock_init();
> calibrate_delay();
> - pid_idr_init();
> + pid_init();
> anon_vma_init();
> #ifdef CONFIG_X86
> if (efi_enabled(EFI_RUNTIME_SERVICES))
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 2fc0a16ec77b..de0b4614bdb8 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -41,7 +41,7 @@
> #include <linux/anon_inodes.h>
> #include <linux/sched/signal.h>
> #include <linux/sched/task.h>
> -#include <linux/idr.h>
> +#include <linux/xarray.h>
> #include <net/sock.h>
> #include <uapi/linux/pidfd.h>
>
> @@ -66,15 +66,10 @@ int pid_max = PID_MAX_DEFAULT;
> int pid_max_min = RESERVED_PIDS + 1;
> int pid_max_max = PID_MAX_LIMIT;
>
> -/*
> - * PID-map pages start out as NULL, they get allocated upon
> - * first use and are never deallocated. This way a low pid_max
> - * value does not cause lots of bitmaps to be allocated, but
> - * the scheme scales to up to 4 million PIDs, runtime.
> - */
> struct pid_namespace init_pid_ns = {
> .ns.count = REFCOUNT_INIT(2),
> - .idr = IDR_INIT(init_pid_ns.idr),
> + .xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS),
> + .cursor = 1,
> .pid_allocated = PIDNS_ADDING,
> .level = 0,
> .child_reaper = &init_task,
> @@ -86,22 +81,6 @@ struct pid_namespace init_pid_ns = {
> };
> EXPORT_SYMBOL_GPL(init_pid_ns);
>
> -/*
> - * Note: disable interrupts while the pidmap_lock is held as an
> - * interrupt might come in and do read_lock(&tasklist_lock).
> - *
> - * If we don't disable interrupts there is a nasty deadlock between
> - * detach_pid()->free_pid() and another cpu that does
> - * spin_lock(&pidmap_lock) followed by an interrupt routine that does
> - * read_lock(&tasklist_lock);
> - *
> - * After we clean up the tasklist_lock and know there are no
> - * irq handlers that take it we can leave the interrupts enabled.
> - * For now it is easier to be safe than to prove it can't happen.
> - */
> -
> -static __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
> -
> void put_pid(struct pid *pid)
> {
> struct pid_namespace *ns;
> @@ -129,10 +108,11 @@ void free_pid(struct pid *pid)
> int i;
> unsigned long flags;
>
> - spin_lock_irqsave(&pidmap_lock, flags);
> for (i = 0; i <= pid->level; i++) {
> struct upid *upid = pid->numbers + i;
> struct pid_namespace *ns = upid->ns;
> +
> + xa_lock_irqsave(&ns->xa, flags);
> switch (--ns->pid_allocated) {
> case 2:
> case 1:
> @@ -149,9 +129,9 @@ void free_pid(struct pid *pid)
> break;
> }
>
> - idr_remove(&ns->idr, upid->nr);
> + __xa_erase(&ns->xa, upid->nr);
> + xa_unlock_irqrestore(&ns->xa, flags);
> }
> - spin_unlock_irqrestore(&pidmap_lock, flags);
>
> call_rcu(&pid->rcu, delayed_put_pid);
> }
> @@ -161,7 +141,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> {
> struct pid *pid;
> enum pid_type type;
> - int i, nr;
> + int i;
> struct pid_namespace *tmp;
> struct upid *upid;
> int retval = -ENOMEM;
> @@ -205,43 +185,42 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> set_tid_size--;
> }
>
> - idr_preload(GFP_KERNEL);
> - spin_lock_irq(&pidmap_lock);
> -
> if (tid) {
> - nr = idr_alloc(&tmp->idr, NULL, tid,
> - tid + 1, GFP_ATOMIC);
> + retval = xa_insert_irq(&tmp->xa, tid, NULL, GFP_KERNEL);
> +
> /*
> - * If ENOSPC is returned it means that the PID is
> + * If EBUSY is returned it means that the PID is
> * alreay in use. Return EEXIST in that case.
> */
> - if (nr == -ENOSPC)
> - nr = -EEXIST;
> + if (retval == -EBUSY)
> + retval = -EEXIST;
> } else {
> int pid_min = 1;
> +
> + xa_lock_irq(&tmp->xa);
> /*
> * init really needs pid 1, but after reaching the
> * maximum wrap back to RESERVED_PIDS
> */
> - if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> + if (tmp->cursor > RESERVED_PIDS)
> pid_min = RESERVED_PIDS;
>
> /*
> * Store a null pointer so find_pid_ns does not find
> * a partially initialized PID (see below).
> */
> - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> - pid_max, GFP_ATOMIC);
> + retval = __xa_alloc_cyclic(&tmp->xa, &tid, NULL,
> + XA_LIMIT(pid_min, pid_max),
> + &tmp->cursor, GFP_KERNEL);
> + xa_unlock_irq(&tmp->xa);
> + if (retval == -EBUSY)
> + retval = -EAGAIN;
> }
> - spin_unlock_irq(&pidmap_lock);
> - idr_preload_end();
>
> - if (nr < 0) {
> - retval = (nr == -ENOSPC) ? -EAGAIN : nr;
> + if (retval < 0)
> goto out_free;
> - }
>
> - pid->numbers[i].nr = nr;
> + pid->numbers[i].nr = tid;
> pid->numbers[i].ns = tmp;
> tmp = tmp->parent;
> }
> @@ -266,34 +245,35 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> INIT_HLIST_HEAD(&pid->inodes);
>
> upid = pid->numbers + ns->level;
> - spin_lock_irq(&pidmap_lock);
> + xa_lock_irq(&ns->xa);
> if (!(ns->pid_allocated & PIDNS_ADDING))
> goto out_unlock;
> for ( ; upid >= pid->numbers; --upid) {
> /* Make the PID visible to find_pid_ns. */
> - idr_replace(&upid->ns->idr, pid, upid->nr);
> + if (upid->ns != ns)
> + xa_lock_irq(&ns->xa);
> + __xa_store(&upid->ns->xa, upid->nr, pid, 0);
> upid->ns->pid_allocated++;
> + xa_unlock_irq(&ns->xa);
> }
> - spin_unlock_irq(&pidmap_lock);
>
> return pid;
>
> out_unlock:
> - spin_unlock_irq(&pidmap_lock);
> + xa_unlock_irq(&ns->xa);
> put_pid_ns(ns);
>
> out_free:
> - spin_lock_irq(&pidmap_lock);
> while (++i <= ns->level) {
> upid = pid->numbers + i;
> - idr_remove(&upid->ns->idr, upid->nr);
> + xa_erase_irq(&upid->ns->xa, upid->nr);
> }
>
> + xa_lock_irq(&ns->xa);
> /* On failure to allocate the first pid, reset the state */
> if (ns->pid_allocated == PIDNS_ADDING)
> - idr_set_cursor(&ns->idr, 0);
> -
> - spin_unlock_irq(&pidmap_lock);
> + ns->cursor = 0;
> + xa_unlock_irq(&ns->xa);
>
> kmem_cache_free(ns->pid_cachep, pid);
> return ERR_PTR(retval);
> @@ -301,14 +281,14 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>
> void disable_pid_allocation(struct pid_namespace *ns)
> {
> - spin_lock_irq(&pidmap_lock);
> + xa_lock_irq(&ns->xa);
> ns->pid_allocated &= ~PIDNS_ADDING;
> - spin_unlock_irq(&pidmap_lock);
> + xa_unlock_irq(&ns->xa);
> }
>
> struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
> {
> - return idr_find(&ns->idr, nr);
> + return xa_load(&ns->xa, nr);
> }
> EXPORT_SYMBOL_GPL(find_pid_ns);
>
> @@ -517,7 +497,9 @@ EXPORT_SYMBOL_GPL(task_active_pid_ns);
> */
> struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
> {
> - return idr_get_next(&ns->idr, &nr);
> + unsigned long index = nr;
> +
> + return xa_find(&ns->xa, &index, ULONG_MAX, XA_PRESENT);
> }
>
> struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
> @@ -646,7 +628,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> return fd;
> }
>
> -void __init pid_idr_init(void)
> +void __init pid_init(void)
> {
> /* Verify no one has done anything silly: */
> BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
> @@ -658,8 +640,6 @@ void __init pid_idr_init(void)
> PIDS_PER_CPU_MIN * num_possible_cpus());
> pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
>
> - idr_init(&init_pid_ns.idr);
> -
> init_pid_ns.pid_cachep = KMEM_CACHE(pid,
> SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
> }
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index f4f8cb0435b4..947e25fb8546 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -22,7 +22,7 @@
> #include <linux/export.h>
> #include <linux/sched/task.h>
> #include <linux/sched/signal.h>
> -#include <linux/idr.h>
> +#include <linux/xarray.h>
>
> static DEFINE_MUTEX(pid_caches_mutex);
> static struct kmem_cache *pid_ns_cachep;
> @@ -92,15 +92,15 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
> if (ns == NULL)
> goto out_dec;
>
> - idr_init(&ns->idr);
> + xa_init_flags(&ns->xa, PID_XA_FLAGS);
>
> ns->pid_cachep = create_pid_cachep(level);
> if (ns->pid_cachep == NULL)
> - goto out_free_idr;
> + goto out_free_xa;
>
> err = ns_alloc_inum(&ns->ns);
> if (err)
> - goto out_free_idr;
> + goto out_free_xa;
> ns->ns.ops = &pidns_operations;
>
> refcount_set(&ns->ns.count, 1);
> @@ -112,8 +112,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>
> return ns;
>
> -out_free_idr:
> - idr_destroy(&ns->idr);
> +out_free_xa:
> + xa_destroy(&ns->xa);
> kmem_cache_free(pid_ns_cachep, ns);
> out_dec:
> dec_pid_namespaces(ucounts);
> @@ -135,7 +135,7 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
> {
> ns_free_inum(&ns->ns);
>
> - idr_destroy(&ns->idr);
> + xa_destroy(&ns->xa);
> call_rcu(&ns->rcu, delayed_free_pidns);
> }
>
> @@ -165,7 +165,7 @@ EXPORT_SYMBOL_GPL(put_pid_ns);
>
> void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> {
> - int nr;
> + long nr;
> int rc;
> struct task_struct *task, *me = current;
> int init_pids = thread_group_leader(me) ? 1 : 2;
> @@ -198,8 +198,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> */
> rcu_read_lock();
> read_lock(&tasklist_lock);
> - nr = 2;
> - idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
> + xa_for_each_range(&pid_ns->xa, nr, pid, 2, ULONG_MAX) {
> task = pid_task(pid, PIDTYPE_PID);
> if (task && !__fatal_signal_pending(task))
> group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
> @@ -272,12 +271,12 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> * it should synchronize its usage with external means.
> */
>
> - next = idr_get_cursor(&pid_ns->idr) - 1;
> + next = pid_ns->cursor - 1;
>
> tmp.data = &next;
> ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> if (!ret && write)
> - idr_set_cursor(&pid_ns->idr, next + 1);
> + pid_ns->cursor = next + 1;
>
> return ret;
> }

2022-06-28 14:57:47

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 1/3] radix-tree: propagate all tags in idr tree

On Tue, Jun 28, 2022 at 02:55:11PM +0200, Christian Brauner wrote:
> On Wed, Jun 15, 2022 at 05:34:02PM +0100, Matthew Wilcox wrote:
> > On Wed, Jun 15, 2022 at 10:43:33AM -0400, Brian Foster wrote:
> > > Interesting, thanks. I'll have to dig more into this to grok the current
> > > state of the radix-tree interface vs. the underlying data structure. If
> > > I follow correctly, you're saying the radix-tree api is essentially
> > > already a translation layer to the xarray these days, and we just need
> > > to move legacy users off the radix-tree api so we can eventually kill it
> > > off...
> >
> > If only it were that easy ... the XArray has a whole bunch of debugging
> > asserts to make sure the users are actually using it correctly, and a
> > lot of radix tree users don't (they're probably not buggy, but they
> > don't use the XArray's embedded lock).
> >
> > Anyway, here's a first cut at converting the PID allocator from the IDR
> > to the XArray API. It boots, but I haven't tried to do anything tricky
> > with PID namespaces or CRIU.
>
> It'd be great to see that conversion done.
> Fwiw, there's test cases for e.g. nested pid namespace creation with
> specifically requested PIDs in
>

Ok, but I'm a little confused. Why open code the xarray usage as opposed
to work the idr bits closer to being able to use the xarray api (and/or
work the xarray to better support the idr use case)? I see 150+ callers
of idr_init(). Is the goal to eventually open code them all? That seems
a lot of potential api churn for something that is presumably a generic
interface (and perhaps inconsistent with ida, which looks like it uses
xarray directly?), but I'm probably missing details.

If the issue is open-coded locking across all the idr users conflicting
with internal xarray debug bits, I guess what I'm wondering is why not
implement your patch more generically within idr (i.e. expose some
locking apis, etc.)? Even if it meant creating something like a
temporary init_idr_xa() variant that users can switch over to as they're
audited for expected behavior, I don't quite grok why that couldn't be
made to work if changing this code over directly does and the various
core radix tree data structures idr uses are already #defined to xarray
variants. Hm?

Brian

> tools/selftests/clone3
>
> and there's additional tests in
>
> tools/selftests/pidfd
>
> >
> >
> > diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
> > index f32878d9a39f..cec85a07184a 100644
> > --- a/fs/proc/loadavg.c
> > +++ b/fs/proc/loadavg.c
> > @@ -21,7 +21,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
> > LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
> > LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
> > nr_running(), nr_threads,
> > - idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> > + task_active_pid_ns(current)->cursor - 1);
> > return 0;
> > }
> >
> > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> > index 07481bb87d4e..68aaaf78491b 100644
> > --- a/include/linux/pid_namespace.h
> > +++ b/include/linux/pid_namespace.h
> > @@ -9,7 +9,7 @@
> > #include <linux/threads.h>
> > #include <linux/nsproxy.h>
> > #include <linux/ns_common.h>
> > -#include <linux/idr.h>
> > +#include <linux/xarray.h>
> >
> > /* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
> > #define MAX_PID_NS_LEVEL 32
> > @@ -17,8 +17,9 @@
> > struct fs_pin;
> >
> > struct pid_namespace {
> > - struct idr idr;
> > + struct xarray xa;
> > struct rcu_head rcu;
> > + unsigned int cursor;
> > unsigned int pid_allocated;
> > struct task_struct *child_reaper;
> > struct kmem_cache *pid_cachep;
> > @@ -37,6 +38,20 @@ extern struct pid_namespace init_pid_ns;
> >
> > #define PIDNS_ADDING (1U << 31)
> >
> > +/*
> > + * Note: disable interrupts while the xarray lock is held as an
> > + * interrupt might come in and do read_lock(&tasklist_lock).
> > + *
> > + * If we don't disable interrupts there is a nasty deadlock between
> > + * detach_pid()->free_pid() and another cpu that locks the PIDs
> > + * followed by an interrupt routine that does read_lock(&tasklist_lock);
> > + *
> > + * After we clean up the tasklist_lock and know there are no
> > + * irq handlers that take it we can leave the interrupts enabled.
> > + * For now it is easier to be safe than to prove it can't happen.
> > + */
> > +#define PID_XA_FLAGS (XA_FLAGS_TRACK_FREE | XA_FLAGS_LOCK_IRQ)
> > +
> > #ifdef CONFIG_PID_NS
> > static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
> > {
> > @@ -84,7 +99,7 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
> >
> > extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
> > void pidhash_init(void);
> > -void pid_idr_init(void);
> > +void pid_init(void);
> >
> > static inline bool task_is_in_init_pid_ns(struct task_struct *tsk)
> > {
> > diff --git a/include/linux/threads.h b/include/linux/threads.h
> > index c34173e6c5f1..37e4391ee89f 100644
> > --- a/include/linux/threads.h
> > +++ b/include/linux/threads.h
> > @@ -38,7 +38,7 @@
> > * Define a minimum number of pids per cpu. Heuristically based
> > * on original pid max of 32k for 32 cpus. Also, increase the
> > * minimum settable value for pid_max on the running system based
> > - * on similar defaults. See kernel/pid.c:pid_idr_init() for details.
> > + * on similar defaults. See kernel/pid.c:pid_init() for details.
> > */
> > #define PIDS_PER_CPU_DEFAULT 1024
> > #define PIDS_PER_CPU_MIN 8
> > diff --git a/init/main.c b/init/main.c
> > index 0ee39cdcfcac..3944dcd10c09 100644
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -73,7 +73,6 @@
> > #include <linux/sched.h>
> > #include <linux/sched/init.h>
> > #include <linux/signal.h>
> > -#include <linux/idr.h>
> > #include <linux/kgdb.h>
> > #include <linux/ftrace.h>
> > #include <linux/async.h>
> > @@ -1100,7 +1099,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
> > late_time_init();
> > sched_clock_init();
> > calibrate_delay();
> > - pid_idr_init();
> > + pid_init();
> > anon_vma_init();
> > #ifdef CONFIG_X86
> > if (efi_enabled(EFI_RUNTIME_SERVICES))
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index 2fc0a16ec77b..de0b4614bdb8 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -41,7 +41,7 @@
> > #include <linux/anon_inodes.h>
> > #include <linux/sched/signal.h>
> > #include <linux/sched/task.h>
> > -#include <linux/idr.h>
> > +#include <linux/xarray.h>
> > #include <net/sock.h>
> > #include <uapi/linux/pidfd.h>
> >
> > @@ -66,15 +66,10 @@ int pid_max = PID_MAX_DEFAULT;
> > int pid_max_min = RESERVED_PIDS + 1;
> > int pid_max_max = PID_MAX_LIMIT;
> >
> > -/*
> > - * PID-map pages start out as NULL, they get allocated upon
> > - * first use and are never deallocated. This way a low pid_max
> > - * value does not cause lots of bitmaps to be allocated, but
> > - * the scheme scales to up to 4 million PIDs, runtime.
> > - */
> > struct pid_namespace init_pid_ns = {
> > .ns.count = REFCOUNT_INIT(2),
> > - .idr = IDR_INIT(init_pid_ns.idr),
> > + .xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS),
> > + .cursor = 1,
> > .pid_allocated = PIDNS_ADDING,
> > .level = 0,
> > .child_reaper = &init_task,
> > @@ -86,22 +81,6 @@ struct pid_namespace init_pid_ns = {
> > };
> > EXPORT_SYMBOL_GPL(init_pid_ns);
> >
> > -/*
> > - * Note: disable interrupts while the pidmap_lock is held as an
> > - * interrupt might come in and do read_lock(&tasklist_lock).
> > - *
> > - * If we don't disable interrupts there is a nasty deadlock between
> > - * detach_pid()->free_pid() and another cpu that does
> > - * spin_lock(&pidmap_lock) followed by an interrupt routine that does
> > - * read_lock(&tasklist_lock);
> > - *
> > - * After we clean up the tasklist_lock and know there are no
> > - * irq handlers that take it we can leave the interrupts enabled.
> > - * For now it is easier to be safe than to prove it can't happen.
> > - */
> > -
> > -static __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
> > -
> > void put_pid(struct pid *pid)
> > {
> > struct pid_namespace *ns;
> > @@ -129,10 +108,11 @@ void free_pid(struct pid *pid)
> > int i;
> > unsigned long flags;
> >
> > - spin_lock_irqsave(&pidmap_lock, flags);
> > for (i = 0; i <= pid->level; i++) {
> > struct upid *upid = pid->numbers + i;
> > struct pid_namespace *ns = upid->ns;
> > +
> > + xa_lock_irqsave(&ns->xa, flags);
> > switch (--ns->pid_allocated) {
> > case 2:
> > case 1:
> > @@ -149,9 +129,9 @@ void free_pid(struct pid *pid)
> > break;
> > }
> >
> > - idr_remove(&ns->idr, upid->nr);
> > + __xa_erase(&ns->xa, upid->nr);
> > + xa_unlock_irqrestore(&ns->xa, flags);
> > }
> > - spin_unlock_irqrestore(&pidmap_lock, flags);
> >
> > call_rcu(&pid->rcu, delayed_put_pid);
> > }
> > @@ -161,7 +141,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> > {
> > struct pid *pid;
> > enum pid_type type;
> > - int i, nr;
> > + int i;
> > struct pid_namespace *tmp;
> > struct upid *upid;
> > int retval = -ENOMEM;
> > @@ -205,43 +185,42 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> > set_tid_size--;
> > }
> >
> > - idr_preload(GFP_KERNEL);
> > - spin_lock_irq(&pidmap_lock);
> > -
> > if (tid) {
> > - nr = idr_alloc(&tmp->idr, NULL, tid,
> > - tid + 1, GFP_ATOMIC);
> > + retval = xa_insert_irq(&tmp->xa, tid, NULL, GFP_KERNEL);
> > +
> > /*
> > - * If ENOSPC is returned it means that the PID is
> > + * If EBUSY is returned it means that the PID is
> > * alreay in use. Return EEXIST in that case.
> > */
> > - if (nr == -ENOSPC)
> > - nr = -EEXIST;
> > + if (retval == -EBUSY)
> > + retval = -EEXIST;
> > } else {
> > int pid_min = 1;
> > +
> > + xa_lock_irq(&tmp->xa);
> > /*
> > * init really needs pid 1, but after reaching the
> > * maximum wrap back to RESERVED_PIDS
> > */
> > - if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> > + if (tmp->cursor > RESERVED_PIDS)
> > pid_min = RESERVED_PIDS;
> >
> > /*
> > * Store a null pointer so find_pid_ns does not find
> > * a partially initialized PID (see below).
> > */
> > - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> > - pid_max, GFP_ATOMIC);
> > + retval = __xa_alloc_cyclic(&tmp->xa, &tid, NULL,
> > + XA_LIMIT(pid_min, pid_max),
> > + &tmp->cursor, GFP_KERNEL);
> > + xa_unlock_irq(&tmp->xa);
> > + if (retval == -EBUSY)
> > + retval = -EAGAIN;
> > }
> > - spin_unlock_irq(&pidmap_lock);
> > - idr_preload_end();
> >
> > - if (nr < 0) {
> > - retval = (nr == -ENOSPC) ? -EAGAIN : nr;
> > + if (retval < 0)
> > goto out_free;
> > - }
> >
> > - pid->numbers[i].nr = nr;
> > + pid->numbers[i].nr = tid;
> > pid->numbers[i].ns = tmp;
> > tmp = tmp->parent;
> > }
> > @@ -266,34 +245,35 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> > INIT_HLIST_HEAD(&pid->inodes);
> >
> > upid = pid->numbers + ns->level;
> > - spin_lock_irq(&pidmap_lock);
> > + xa_lock_irq(&ns->xa);
> > if (!(ns->pid_allocated & PIDNS_ADDING))
> > goto out_unlock;
> > for ( ; upid >= pid->numbers; --upid) {
> > /* Make the PID visible to find_pid_ns. */
> > - idr_replace(&upid->ns->idr, pid, upid->nr);
> > + if (upid->ns != ns)
> > + xa_lock_irq(&ns->xa);
> > + __xa_store(&upid->ns->xa, upid->nr, pid, 0);
> > upid->ns->pid_allocated++;
> > + xa_unlock_irq(&ns->xa);
> > }
> > - spin_unlock_irq(&pidmap_lock);
> >
> > return pid;
> >
> > out_unlock:
> > - spin_unlock_irq(&pidmap_lock);
> > + xa_unlock_irq(&ns->xa);
> > put_pid_ns(ns);
> >
> > out_free:
> > - spin_lock_irq(&pidmap_lock);
> > while (++i <= ns->level) {
> > upid = pid->numbers + i;
> > - idr_remove(&upid->ns->idr, upid->nr);
> > + xa_erase_irq(&upid->ns->xa, upid->nr);
> > }
> >
> > + xa_lock_irq(&ns->xa);
> > /* On failure to allocate the first pid, reset the state */
> > if (ns->pid_allocated == PIDNS_ADDING)
> > - idr_set_cursor(&ns->idr, 0);
> > -
> > - spin_unlock_irq(&pidmap_lock);
> > + ns->cursor = 0;
> > + xa_unlock_irq(&ns->xa);
> >
> > kmem_cache_free(ns->pid_cachep, pid);
> > return ERR_PTR(retval);
> > @@ -301,14 +281,14 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> >
> > void disable_pid_allocation(struct pid_namespace *ns)
> > {
> > - spin_lock_irq(&pidmap_lock);
> > + xa_lock_irq(&ns->xa);
> > ns->pid_allocated &= ~PIDNS_ADDING;
> > - spin_unlock_irq(&pidmap_lock);
> > + xa_unlock_irq(&ns->xa);
> > }
> >
> > struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
> > {
> > - return idr_find(&ns->idr, nr);
> > + return xa_load(&ns->xa, nr);
> > }
> > EXPORT_SYMBOL_GPL(find_pid_ns);
> >
> > @@ -517,7 +497,9 @@ EXPORT_SYMBOL_GPL(task_active_pid_ns);
> > */
> > struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
> > {
> > - return idr_get_next(&ns->idr, &nr);
> > + unsigned long index = nr;
> > +
> > + return xa_find(&ns->xa, &index, ULONG_MAX, XA_PRESENT);
> > }
> >
> > struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
> > @@ -646,7 +628,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> > return fd;
> > }
> >
> > -void __init pid_idr_init(void)
> > +void __init pid_init(void)
> > {
> > /* Verify no one has done anything silly: */
> > BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
> > @@ -658,8 +640,6 @@ void __init pid_idr_init(void)
> > PIDS_PER_CPU_MIN * num_possible_cpus());
> > pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
> >
> > - idr_init(&init_pid_ns.idr);
> > -
> > init_pid_ns.pid_cachep = KMEM_CACHE(pid,
> > SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
> > }
> > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> > index f4f8cb0435b4..947e25fb8546 100644
> > --- a/kernel/pid_namespace.c
> > +++ b/kernel/pid_namespace.c
> > @@ -22,7 +22,7 @@
> > #include <linux/export.h>
> > #include <linux/sched/task.h>
> > #include <linux/sched/signal.h>
> > -#include <linux/idr.h>
> > +#include <linux/xarray.h>
> >
> > static DEFINE_MUTEX(pid_caches_mutex);
> > static struct kmem_cache *pid_ns_cachep;
> > @@ -92,15 +92,15 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
> > if (ns == NULL)
> > goto out_dec;
> >
> > - idr_init(&ns->idr);
> > + xa_init_flags(&ns->xa, PID_XA_FLAGS);
> >
> > ns->pid_cachep = create_pid_cachep(level);
> > if (ns->pid_cachep == NULL)
> > - goto out_free_idr;
> > + goto out_free_xa;
> >
> > err = ns_alloc_inum(&ns->ns);
> > if (err)
> > - goto out_free_idr;
> > + goto out_free_xa;
> > ns->ns.ops = &pidns_operations;
> >
> > refcount_set(&ns->ns.count, 1);
> > @@ -112,8 +112,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
> >
> > return ns;
> >
> > -out_free_idr:
> > - idr_destroy(&ns->idr);
> > +out_free_xa:
> > + xa_destroy(&ns->xa);
> > kmem_cache_free(pid_ns_cachep, ns);
> > out_dec:
> > dec_pid_namespaces(ucounts);
> > @@ -135,7 +135,7 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
> > {
> > ns_free_inum(&ns->ns);
> >
> > - idr_destroy(&ns->idr);
> > + xa_destroy(&ns->xa);
> > call_rcu(&ns->rcu, delayed_free_pidns);
> > }
> >
> > @@ -165,7 +165,7 @@ EXPORT_SYMBOL_GPL(put_pid_ns);
> >
> > void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> > {
> > - int nr;
> > + long nr;
> > int rc;
> > struct task_struct *task, *me = current;
> > int init_pids = thread_group_leader(me) ? 1 : 2;
> > @@ -198,8 +198,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> > */
> > rcu_read_lock();
> > read_lock(&tasklist_lock);
> > - nr = 2;
> > - idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
> > + xa_for_each_range(&pid_ns->xa, nr, pid, 2, ULONG_MAX) {
> > task = pid_task(pid, PIDTYPE_PID);
> > if (task && !__fatal_signal_pending(task))
> > group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
> > @@ -272,12 +271,12 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> > * it should synchronize its usage with external means.
> > */
> >
> > - next = idr_get_cursor(&pid_ns->idr) - 1;
> > + next = pid_ns->cursor - 1;
> >
> > tmp.data = &next;
> > ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> > if (!ret && write)
> > - idr_set_cursor(&pid_ns->idr, next + 1);
> > + pid_ns->cursor = next + 1;
> >
> > return ret;
> > }
>

2022-06-29 19:48:34

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 1/3] radix-tree: propagate all tags in idr tree

On Tue, Jun 28, 2022 at 10:53:50AM -0400, Brian Foster wrote:
> On Tue, Jun 28, 2022 at 02:55:11PM +0200, Christian Brauner wrote:
> > On Wed, Jun 15, 2022 at 05:34:02PM +0100, Matthew Wilcox wrote:
> > > On Wed, Jun 15, 2022 at 10:43:33AM -0400, Brian Foster wrote:
> > > > Interesting, thanks. I'll have to dig more into this to grok the current
> > > > state of the radix-tree interface vs. the underlying data structure. If
> > > > I follow correctly, you're saying the radix-tree api is essentially
> > > > already a translation layer to the xarray these days, and we just need
> > > > to move legacy users off the radix-tree api so we can eventually kill it
> > > > off...
> > >
> > > If only it were that easy ... the XArray has a whole bunch of debugging
> > > asserts to make sure the users are actually using it correctly, and a
> > > lot of radix tree users don't (they're probably not buggy, but they
> > > don't use the XArray's embedded lock).
> > >
> > > Anyway, here's a first cut at converting the PID allocator from the IDR
> > > to the XArray API. It boots, but I haven't tried to do anything tricky
> > > with PID namespaces or CRIU.
> >
> > It'd be great to see that conversion done.
> > Fwiw, there's test cases for e.g. nested pid namespace creation with
> > specifically requested PIDs in
> >
>
> Ok, but I'm a little confused. Why open code the xarray usage as opposed
> to work the idr bits closer to being able to use the xarray api (and/or
> work the xarray to better support the idr use case)? I see 150+ callers
> of idr_init(). Is the goal to eventually open code them all? That seems
> a lot of potential api churn for something that is presumably a generic
> interface (and perhaps inconsistent with ida, which looks like it uses
> xarray directly?), but I'm probably missing details.
>
> If the issue is open-coded locking across all the idr users conflicting
> with internal xarray debug bits, I guess what I'm wondering is why not
> implement your patch more generically within idr (i.e. expose some
> locking apis, etc.)? Even if it meant creating something like a
> temporary init_idr_xa() variant that users can switch over to as they're
> audited for expected behavior, I don't quite grok why that couldn't be
> made to work if changing this code over directly does and the various
> core radix tree data structures idr uses are already #defined to xarray
> variants. Hm?
>

Using Willy's patch as a reference, here's a hacked up example of what I
was thinking (squashed to a single diff and based on top of my pending
idr tag patches). Obviously this needs more work and thought. I skipped
the locking change to start, so this will nest the internal xarray lock
inside the pidmap lock. I'm assuming doing otherwise might cause xarray
problems based on earlier comments..? It does boot and doesn't
immediately explode however; the tag -> mark bits seem to work as
expected, etc.

I am a little curious how pervasive the aforementioned locking issue is
here. Are we talking about the lockdep_is_held() assertions in xarray.h?
If so, could we get away with either disabling those for idr users (via
some new flag) or perhaps allow idr users to pass along a lockdep_map
associated with an external lock that the xarray can feed into
lock_is_held()..?

Brian

--- 8< ---

diff --git a/include/linux/idr.h b/include/linux/idr.h
index 5b62dfa4a031..f7dd0e8d8ac2 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -17,28 +17,31 @@
#include <linux/percpu.h>

struct idr {
- struct radix_tree_root idr_rt;
+ struct xarray idr_rt;
unsigned int idr_base;
unsigned int idr_next;
+ bool idr_xa;
};

/*
- * The IDR API does not expose the tagging functionality of the radix tree
- * to users. Use tag 0 to track whether a node has free space below it.
+ * The IDR API does not expose the tagging functionality of the tree to users.
+ * Use tag 0 to track whether a node has free space below it.
*/
#define IDR_FREE 0
#define IDR_TAG 1

/* Set the IDR flag and the IDR_FREE tag */
-#define IDR_RT_MARKER (ROOT_IS_IDR | (__force gfp_t) \
- (1 << (ROOT_TAG_SHIFT + IDR_FREE)))
+#define IDR_RT_MARKER (ROOT_IS_IDR | XA_FLAGS_MARK(IDR_FREE))

-#define IDR_INIT_BASE(name, base) { \
- .idr_rt = RADIX_TREE_INIT(name, IDR_RT_MARKER), \
+#define __IDR_INIT(name, base, flags, is_xa) { \
+ .idr_rt = XARRAY_INIT(name, flags), \
.idr_base = (base), \
.idr_next = 0, \
+ .idr_xa = is_xa, \
}

+#define IDR_INIT_BASE(name, base) __IDR_INIT(name, base, IDR_RT_MARKER, false)
+
/**
* IDR_INIT() - Initialise an IDR.
* @name: Name of IDR.
@@ -111,6 +114,7 @@ static inline void idr_set_cursor(struct idr *idr, unsigned int val)
xa_unlock_irqrestore(&(idr)->idr_rt, flags)

void idr_preload(gfp_t gfp_mask);
+void idr_preload_end(void);

int idr_alloc(struct idr *, void *ptr, int start, int end, gfp_t);
int __must_check idr_alloc_u32(struct idr *, void *ptr, u32 *id,
@@ -123,6 +127,9 @@ int idr_for_each(const struct idr *,
void *idr_get_next(struct idr *, int *nextid);
void *idr_get_next_ul(struct idr *, unsigned long *nextid);
void *idr_replace(struct idr *, void *, unsigned long id);
+void idr_set_tag(struct idr *idr, unsigned long id);
+bool idr_get_tag(struct idr *idr, unsigned long id);
+void *idr_get_next_tag(struct idr *idr, unsigned long id);
void idr_destroy(struct idr *);

/**
@@ -133,11 +140,17 @@ void idr_destroy(struct idr *);
* This variation of idr_init() creates an IDR which will allocate IDs
* starting at %base.
*/
-static inline void idr_init_base(struct idr *idr, int base)
+static inline void __idr_init(struct idr *idr, int base, gfp_t flags, bool is_xa)
{
- INIT_RADIX_TREE(&idr->idr_rt, IDR_RT_MARKER);
+ xa_init_flags(&idr->idr_rt, flags);
idr->idr_base = base;
idr->idr_next = 0;
+ idr->idr_xa = is_xa;
+}
+
+static inline void idr_init_base(struct idr *idr, int base)
+{
+ __idr_init(idr, base, IDR_RT_MARKER, false);
}

/**
@@ -160,43 +173,8 @@ static inline void idr_init(struct idr *idr)
*/
static inline bool idr_is_empty(const struct idr *idr)
{
- return radix_tree_empty(&idr->idr_rt) &&
- radix_tree_tagged(&idr->idr_rt, IDR_FREE);
-}
-
-/**
- * idr_preload_end - end preload section started with idr_preload()
- *
- * Each idr_preload() should be matched with an invocation of this
- * function. See idr_preload() for details.
- */
-static inline void idr_preload_end(void)
-{
- local_unlock(&radix_tree_preloads.lock);
-}
-
-static inline void idr_set_tag(struct idr *idr, unsigned long id)
-{
- radix_tree_tag_set(&idr->idr_rt, id, IDR_TAG);
-}
-
-static inline bool idr_get_tag(struct idr *idr, unsigned long id)
-{
- return radix_tree_tag_get(&idr->idr_rt, id, IDR_TAG);
-}
-
-/*
- * Find the next id with the internal tag set.
- */
-static inline void *idr_get_next_tag(struct idr *idr, unsigned long id)
-{
- unsigned int ret;
- void *entry;
-
- ret = radix_tree_gang_lookup_tag(&idr->idr_rt, &entry, id, 1, IDR_TAG);
- if (ret != 1)
- return NULL;
- return entry;
+ return xa_empty(&idr->idr_rt) &&
+ xa_marked(&idr->idr_rt, IDR_FREE);
}

/**
diff --git a/kernel/pid.c b/kernel/pid.c
index bd72d1dbff95..d2297578466f 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -66,6 +66,8 @@ int pid_max = PID_MAX_DEFAULT;
int pid_max_min = RESERVED_PIDS + 1;
int pid_max_max = PID_MAX_LIMIT;

+#define PID_XA_FLAGS (XA_FLAGS_TRACK_FREE | XA_FLAGS_LOCK_IRQ)
+
/*
* PID-map pages start out as NULL, they get allocated upon
* first use and are never deallocated. This way a low pid_max
@@ -74,7 +76,7 @@ int pid_max_max = PID_MAX_LIMIT;
*/
struct pid_namespace init_pid_ns = {
.ns.count = REFCOUNT_INIT(2),
- .idr = IDR_INIT(init_pid_ns.idr),
+ .idr = __IDR_INIT(init_pid_ns.idr, 0, PID_XA_FLAGS, true),
.pid_allocated = PIDNS_ADDING,
.level = 0,
.child_reaper = &init_task,
@@ -205,7 +207,6 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
set_tid_size--;
}

- idr_preload(GFP_KERNEL);
spin_lock_irq(&pidmap_lock);

if (tid) {
@@ -234,7 +235,6 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
pid_max, GFP_ATOMIC);
}
spin_unlock_irq(&pidmap_lock);
- idr_preload_end();

if (nr < 0) {
retval = (nr == -ENOSPC) ? -EAGAIN : nr;
@@ -696,7 +696,7 @@ void __init pid_idr_init(void)
PIDS_PER_CPU_MIN * num_possible_cpus());
pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);

- idr_init(&init_pid_ns.idr);
+ __idr_init(&init_pid_ns.idr, 0, PID_XA_FLAGS, true);

init_pid_ns.pid_cachep = KMEM_CACHE(pid,
SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
diff --git a/lib/idr.c b/lib/idr.c
index f4ab4f4aa3c7..ae6dac08683c 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -37,11 +37,27 @@ int idr_alloc_u32(struct idr *idr, void *ptr, u32 *nextid,
void __rcu **slot;
unsigned int base = idr->idr_base;
unsigned int id = *nextid;
+ int error;
+ struct xa_limit limit;
+
+ id = (id < base) ? 0 : id - base;
+
+ if (idr->idr_xa) {
+ limit.min = id;
+ limit.max = max - base;
+ error = xa_alloc(&idr->idr_rt, nextid, ptr, limit, gfp);
+ /* error compatibility w/ radix-tree */
+ if (error == -EBUSY)
+ return -ENOSPC;
+ else if (error)
+ return error;
+ *nextid += base;
+ return 0;
+ }

if (WARN_ON_ONCE(!(idr->idr_rt.xa_flags & ROOT_IS_IDR)))
idr->idr_rt.xa_flags |= IDR_RT_MARKER;

- id = (id < base) ? 0 : id - base;
radix_tree_iter_init(&iter, id);
slot = idr_get_free(&idr->idr_rt, &iter, gfp, max - base);
if (IS_ERR(slot))
@@ -151,6 +167,8 @@ EXPORT_SYMBOL(idr_alloc_cyclic);
*/
void *idr_remove(struct idr *idr, unsigned long id)
{
+ if (idr->idr_xa)
+ return xa_erase(&idr->idr_rt, id - idr->idr_base);
return radix_tree_delete_item(&idr->idr_rt, id - idr->idr_base, NULL);
}
EXPORT_SYMBOL_GPL(idr_remove);
@@ -171,6 +189,8 @@ EXPORT_SYMBOL_GPL(idr_remove);
*/
void *idr_find(const struct idr *idr, unsigned long id)
{
+ if (idr->idr_xa)
+ return xa_load(&idr->idr_rt, id - idr->idr_base);
return radix_tree_lookup(&idr->idr_rt, id - idr->idr_base);
}
EXPORT_SYMBOL_GPL(idr_find);
@@ -233,6 +253,14 @@ void *idr_get_next_ul(struct idr *idr, unsigned long *nextid)
unsigned long id = *nextid;

id = (id < base) ? 0 : id - base;
+
+ if (idr->idr_xa) {
+ entry = xa_find(&idr->idr_rt, &id, ULONG_MAX, XA_PRESENT);
+ if (entry)
+ *nextid = id + base;
+ return entry;
+ }
+
radix_tree_for_each_slot(slot, &idr->idr_rt, &iter, id) {
entry = rcu_dereference_raw(*slot);
if (!entry)
@@ -295,6 +323,12 @@ void *idr_replace(struct idr *idr, void *ptr, unsigned long id)

id -= idr->idr_base;

+ if (idr->idr_xa) {
+ entry = xa_store(&idr->idr_rt, id, ptr, 0);
+ /* XXX: error translation? */
+ return entry;
+ }
+
entry = __radix_tree_lookup(&idr->idr_rt, id, &node, &slot);
if (!slot || radix_tree_tag_get(&idr->idr_rt, id, IDR_FREE))
return ERR_PTR(-ENOENT);
@@ -305,6 +339,41 @@ void *idr_replace(struct idr *idr, void *ptr, unsigned long id)
}
EXPORT_SYMBOL(idr_replace);

+void idr_set_tag(struct idr *idr, unsigned long id)
+{
+ if (idr->idr_xa)
+ xa_set_mark(&idr->idr_rt, id, IDR_TAG);
+ else
+ radix_tree_tag_set(&idr->idr_rt, id, IDR_TAG);
+}
+
+bool idr_get_tag(struct idr *idr, unsigned long id)
+{
+ if (idr->idr_xa)
+ return xa_get_mark(&idr->idr_rt, id, IDR_TAG);
+ return radix_tree_tag_get(&idr->idr_rt, id, IDR_TAG);
+}
+
+/*
+ * Find the next id with the internal tag set.
+ */
+void *idr_get_next_tag(struct idr *idr, unsigned long id)
+{
+ unsigned int ret;
+ void *entry;
+
+ if (idr->idr_xa) {
+ entry = xa_find(&idr->idr_rt, &id, ULONG_MAX, IDR_TAG);
+ return entry;
+ }
+
+ ret = radix_tree_gang_lookup_tag(&idr->idr_rt, &entry, id, 1, IDR_TAG);
+ if (ret != 1)
+ return NULL;
+ return entry;
+}
+
+
/**
* DOC: IDA description
*
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 08eef33e7820..8c6eb25aadbb 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1476,6 +1476,18 @@ void idr_preload(gfp_t gfp_mask)
}
EXPORT_SYMBOL(idr_preload);

+/**
+ * idr_preload_end - end preload section started with idr_preload()
+ *
+ * Each idr_preload() should be matched with an invocation of this
+ * function. See idr_preload() for details.
+ */
+void idr_preload_end(void)
+{
+ local_unlock(&radix_tree_preloads.lock);
+}
+EXPORT_SYMBOL(idr_preload_end);
+
void __rcu **idr_get_free(struct radix_tree_root *root,
struct radix_tree_iter *iter, gfp_t gfp,
unsigned long max)

2022-07-11 20:26:39

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/3] radix-tree: propagate all tags in idr tree

On Tue, Jun 28, 2022 at 10:53:50AM -0400, Brian Foster wrote:
> On Tue, Jun 28, 2022 at 02:55:11PM +0200, Christian Brauner wrote:
> > On Wed, Jun 15, 2022 at 05:34:02PM +0100, Matthew Wilcox wrote:
> > > On Wed, Jun 15, 2022 at 10:43:33AM -0400, Brian Foster wrote:
> > > > Interesting, thanks. I'll have to dig more into this to grok the current
> > > > state of the radix-tree interface vs. the underlying data structure. If
> > > > I follow correctly, you're saying the radix-tree api is essentially
> > > > already a translation layer to the xarray these days, and we just need
> > > > to move legacy users off the radix-tree api so we can eventually kill it
> > > > off...
> > >
> > > If only it were that easy ... the XArray has a whole bunch of debugging
> > > asserts to make sure the users are actually using it correctly, and a
> > > lot of radix tree users don't (they're probably not buggy, but they
> > > don't use the XArray's embedded lock).
> > >
> > > Anyway, here's a first cut at converting the PID allocator from the IDR
> > > to the XArray API. It boots, but I haven't tried to do anything tricky
> > > with PID namespaces or CRIU.
> >
> > It'd be great to see that conversion done.
> > Fwiw, there's test cases for e.g. nested pid namespace creation with
> > specifically requested PIDs in
> >
>
> Ok, but I'm a little confused. Why open code the xarray usage as opposed
> to work the idr bits closer to being able to use the xarray api (and/or
> work the xarray to better support the idr use case)? I see 150+ callers
> of idr_init(). Is the goal to eventually open code them all? That seems
> a lot of potential api churn for something that is presumably a generic
> interface (and perhaps inconsistent with ida, which looks like it uses
> xarray directly?), but I'm probably missing details.

It's not "open coding". It's "using the XArray API instead of the
IDR API". The IDR API is inferior in a number of ways, and yes, I
do want to be rid of it entirely.