2022-12-02 17:25:03

by Brian Foster

[permalink] [raw]
Subject: [PATCH v3 0/5] proc: improve root readdir latency with many threads

Hi all,

Here's v3 of the /proc readdir optimization patches. See v1 for the full
introductary cover letter.

Most of the feedback received to this point has been around switching
the pid code over to use the xarray api instead of the idr. Matt Wilcox
posted most of the code to do that. I cleaned it up a bit and posted a
standalone series for that here [1], but didn't receive any feedback.
Patches 1-3 of this series are essentially a repost of [1].

Patches 4-5 are otherwise mostly the same as v2 outside of switching
over to use the xarray bits instead of the idr/radix-tree.

Thoughts, reviews, flames appreciated.

Brian

[1] https://lore.kernel.org/linux-mm/[email protected]/

v3:
- Drop radix-tree fixups.
- Convert pid idr usage to xarray.
- Replace tgid radix-tree tag set/lookup to use xarray mark.
v2: https://lore.kernel.org/linux-fsdevel/[email protected]/
- Clean up idr helpers to be more generic.
- Use ->idr_base properly.
- Lift tgid iteration helper into pid.c to abstract tag logic from
users.
v1: https://lore.kernel.org/linux-fsdevel/[email protected]/

Brian Foster (5):
pid: replace pidmap_lock with xarray lock
pid: split cyclic id allocation cursor from idr
pid: switch pid_namespace from idr to xarray
pid: mark pids associated with group leader tasks
procfs: use efficient tgid pid search on root readdir

arch/powerpc/platforms/cell/spufs/sched.c | 2 +-
fs/proc/base.c | 17 +--
fs/proc/loadavg.c | 2 +-
include/linux/pid.h | 3 +-
include/linux/pid_namespace.h | 9 +-
include/linux/threads.h | 2 +-
init/main.c | 3 +-
kernel/fork.c | 2 +-
kernel/pid.c | 177 +++++++++++++---------
kernel/pid_namespace.c | 23 ++-
10 files changed, 132 insertions(+), 108 deletions(-)

--
2.37.3


2022-12-02 17:25:06

by Brian Foster

[permalink] [raw]
Subject: [PATCH v3 4/5] pid: mark pids associated with group leader tasks

Searching the pid_namespace for group leader tasks is a fairly
inefficient operation. Listing the root directory of a procfs mount
performs a linear scan of allocated pids, checking each entry for an
associated PIDTYPE_TGID task to determine whether to populate a
directory entry. This can cause a significant increase in readdir()
syscall latency when run in namespaces that might have one or more
processes with significant thread counts.

To facilitate improved TGID pid searches, mark the ids of pid
entries that are likely to have an associated PIDTYPE_TGID task. To
keep the code simple and avoid having to maintain synchronization
between mark state and post-fork pid-task association changes, the
mark is applied to all pids allocated for tasks cloned without
CLONE_THREAD.

This means that it is possible for a pid to remain marked in the
xarray after being disassociated from the group leader task. For
example, a process that does a setsid() followed by fork() and
exit() (to daemonize) will remain associated with the original pid
for the session, but link with the child pid as the group leader.
OTOH, the only place other than fork() where a tgid association
occurs is in the exec() path, which kills all other tasks in the
group and associates the current task with the preexisting leader
pid. Therefore, the semantics of the mark are that false positives
(marked pids without PIDTYPE_TGID tasks) are possible, but false
negatives (unmarked pids without PIDTYPE_TGID tasks) should never
occur.

This is an effective optimization because false negatives are fairly
uncommon and don't add overhead (i.e. we already have to check
pid_task() for marked entries), but still filters out thread pids
that are guaranteed not to have TGID task association.

Mark entries in the pid allocation path when the caller specifies
that the pid associates with a new thread group. Since false
negatives are not allowed, warn in the event that a PIDTYPE_TGID
task is ever attached to an unmarked pid. Finally, create a helper
to implement the task search based on the mark semantics defined
above (based on search logic currently implemented by next_tgid() in
procfs).

Signed-off-by: Brian Foster <[email protected]>
---
include/linux/pid.h | 3 ++-
kernel/fork.c | 2 +-
kernel/pid.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 343abf22092e..64caf21be256 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -132,9 +132,10 @@ extern struct pid *find_vpid(int nr);
*/
extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
+struct task_struct *find_get_tgid_task(int *id, struct pid_namespace *);

extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
- size_t set_tid_size);
+ size_t set_tid_size, bool group_leader);
extern void free_pid(struct pid *pid);
extern void disable_pid_allocation(struct pid_namespace *ns);

diff --git a/kernel/fork.c b/kernel/fork.c
index 08969f5aa38d..1cf2644c642e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2267,7 +2267,7 @@ static __latent_entropy struct task_struct *copy_process(

if (pid != &init_struct_pid) {
pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
- args->set_tid_size);
+ args->set_tid_size, !(clone_flags & CLONE_THREAD));
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
diff --git a/kernel/pid.c b/kernel/pid.c
index 53db06f9882d..d65f74c6186c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -66,6 +66,9 @@ int pid_max = PID_MAX_DEFAULT;
int pid_max_min = RESERVED_PIDS + 1;
int pid_max_max = PID_MAX_LIMIT;

+/* MARK_0 used by XA_FREE_MARK */
+#define TGID_MARK XA_MARK_1
+
struct pid_namespace init_pid_ns = {
.ns.count = REFCOUNT_INIT(2),
.xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS),
@@ -137,7 +140,7 @@ void free_pid(struct pid *pid)
}

struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
- size_t set_tid_size)
+ size_t set_tid_size, bool group_leader)
{
struct pid *pid;
enum pid_type type;
@@ -257,6 +260,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,

/* Make the PID visible to find_pid_ns. */
__xa_store(&tmp->xa, upid->nr, pid, 0);
+ if (group_leader)
+ __xa_set_mark(&tmp->xa, upid->nr, TGID_MARK);
tmp->pid_allocated++;
xa_unlock_irq(&tmp->xa);
}
@@ -314,6 +319,11 @@ static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type)
void attach_pid(struct task_struct *task, enum pid_type type)
{
struct pid *pid = *task_pid_ptr(task, type);
+ struct pid_namespace *pid_ns = ns_of_pid(pid);
+ pid_t pid_nr = pid_nr_ns(pid, pid_ns);
+
+ WARN_ON(type == PIDTYPE_TGID &&
+ !xa_get_mark(&pid_ns->xa, pid_nr, TGID_MARK));
hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
}

@@ -506,6 +516,38 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
}
EXPORT_SYMBOL_GPL(find_ge_pid);

+/*
+ * Used by proc to find the first thread group leader task with an id greater
+ * than or equal to *id.
+ *
+ * Use the xarray mark as a hint to find the next best pid. The mark does not
+ * guarantee a linked group leader task exists, so retry until a suitable entry
+ * is found.
+ */
+struct task_struct *find_get_tgid_task(int *id, struct pid_namespace *ns)
+{
+ struct pid *pid;
+ struct task_struct *t;
+ unsigned long nr = *id;
+
+ rcu_read_lock();
+ do {
+ pid = xa_find(&ns->xa, &nr, ULONG_MAX, TGID_MARK);
+ if (!pid) {
+ rcu_read_unlock();
+ return NULL;
+ }
+ t = pid_task(pid, PIDTYPE_TGID);
+ nr++;
+ } while (!t);
+
+ *id = pid_nr_ns(pid, ns);
+ get_task_struct(t);
+ rcu_read_unlock();
+
+ return t;
+}
+
struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
{
struct fd f;
--
2.37.3

2022-12-02 17:27:17

by Brian Foster

[permalink] [raw]
Subject: [PATCH v3 2/5] pid: split cyclic id allocation cursor from idr

As a next step in separating pid allocation from the idr, split off
the cyclic pid allocation cursor from the idr. Lift the cursor value
into the struct pid_namespace. Note that this involves temporarily
open-coding the cursor increment on allocation, but this is cleaned
up in the subsequent patch.

Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Brian Foster <[email protected]>
---
arch/powerpc/platforms/cell/spufs/sched.c | 2 +-
fs/proc/loadavg.c | 2 +-
include/linux/pid_namespace.h | 1 +
kernel/pid.c | 6 ++++--
kernel/pid_namespace.c | 4 ++--
5 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 99bd027a7f7c..a2ed928d7658 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -1072,7 +1072,7 @@ static int show_spu_loadavg(struct seq_file *s, void *private)
LOAD_INT(c), LOAD_FRAC(c),
count_active_contexts(),
atomic_read(&nr_spu_contexts),
- idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
+ READ_ONCE(task_active_pid_ns(current)->pid_next) - 1);
return 0;
}
#endif
diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
index 817981e57223..2740b31b6461 100644
--- a/fs/proc/loadavg.c
+++ b/fs/proc/loadavg.c
@@ -22,7 +22,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
nr_running(), nr_threads,
- idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
+ READ_ONCE(task_active_pid_ns(current)->pid_next) - 1);
return 0;
}

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 07481bb87d4e..82c72482019d 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -18,6 +18,7 @@ struct fs_pin;

struct pid_namespace {
struct idr idr;
+ unsigned int pid_next;
struct rcu_head rcu;
unsigned int pid_allocated;
struct task_struct *child_reaper;
diff --git a/kernel/pid.c b/kernel/pid.c
index 3622f8b13143..2e2d33273c8e 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -75,6 +75,7 @@ int pid_max_max = PID_MAX_LIMIT;
struct pid_namespace init_pid_ns = {
.ns.count = REFCOUNT_INIT(2),
.idr = IDR_INIT(init_pid_ns.idr),
+ .pid_next = 0,
.pid_allocated = PIDNS_ADDING,
.level = 0,
.child_reaper = &init_task,
@@ -208,7 +209,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
* init really needs pid 1, but after reaching the
* maximum wrap back to RESERVED_PIDS
*/
- if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
+ if (tmp->pid_next > RESERVED_PIDS)
pid_min = RESERVED_PIDS;

/*
@@ -217,6 +218,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
*/
nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
pid_max, GFP_ATOMIC);
+ tmp->pid_next = nr + 1;
}
xa_unlock_irq(&tmp->idr.idr_rt);
idr_preload_end();
@@ -278,7 +280,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,

/* On failure to allocate the first pid, reset the state */
if (tmp == ns && tmp->pid_allocated == PIDNS_ADDING)
- idr_set_cursor(&ns->idr, 0);
+ ns->pid_next = 0;

idr_remove(&tmp->idr, upid->nr);
xa_unlock_irq(&tmp->idr.idr_rt);
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index f4f8cb0435b4..a53d20c5c85e 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -272,12 +272,12 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
* it should synchronize its usage with external means.
*/

- next = idr_get_cursor(&pid_ns->idr) - 1;
+ next = READ_ONCE(pid_ns->pid_next) - 1;

tmp.data = &next;
ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
if (!ret && write)
- idr_set_cursor(&pid_ns->idr, next + 1);
+ WRITE_ONCE(pid_ns->pid_next, next + 1);

return ret;
}
--
2.37.3

2022-12-02 18:10:16

by Brian Foster

[permalink] [raw]
Subject: [PATCH v3 3/5] pid: switch pid_namespace from idr to xarray

Switch struct pid[_namespace] management over to use the xarray api
directly instead of the idr. The underlying data structures used by
both interfaces is the same. The difference is that the idr api
relies on the old, idr-custom radix-tree implementation for things
like efficient tracking/allocation of free ids. The xarray already
supports this, so most of this is a direct switchover from the old
api to the new.

Signed-off-by: Matthew Wilcox <[email protected]>
Signed-off-by: Brian Foster <[email protected]>
---
include/linux/pid_namespace.h | 8 ++--
include/linux/threads.h | 2 +-
init/main.c | 3 +-
kernel/pid.c | 78 ++++++++++++++++-------------------
kernel/pid_namespace.c | 19 ++++-----
5 files changed, 51 insertions(+), 59 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 82c72482019d..e4f5979b482b 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -9,7 +9,7 @@
#include <linux/threads.h>
#include <linux/nsproxy.h>
#include <linux/ns_common.h>
-#include <linux/idr.h>
+#include <linux/xarray.h>

/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
#define MAX_PID_NS_LEVEL 32
@@ -17,7 +17,7 @@
struct fs_pin;

struct pid_namespace {
- struct idr idr;
+ struct xarray xa;
unsigned int pid_next;
struct rcu_head rcu;
unsigned int pid_allocated;
@@ -38,6 +38,8 @@ extern struct pid_namespace init_pid_ns;

#define PIDNS_ADDING (1U << 31)

+#define PID_XA_FLAGS (XA_FLAGS_TRACK_FREE | XA_FLAGS_LOCK_IRQ)
+
#ifdef CONFIG_PID_NS
static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
{
@@ -85,7 +87,7 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)

extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
void pidhash_init(void);
-void pid_idr_init(void);
+void pid_init(void);

static inline bool task_is_in_init_pid_ns(struct task_struct *tsk)
{
diff --git a/include/linux/threads.h b/include/linux/threads.h
index c34173e6c5f1..37e4391ee89f 100644
--- a/include/linux/threads.h
+++ b/include/linux/threads.h
@@ -38,7 +38,7 @@
* Define a minimum number of pids per cpu. Heuristically based
* on original pid max of 32k for 32 cpus. Also, increase the
* minimum settable value for pid_max on the running system based
- * on similar defaults. See kernel/pid.c:pid_idr_init() for details.
+ * on similar defaults. See kernel/pid.c:pid_init() for details.
*/
#define PIDS_PER_CPU_DEFAULT 1024
#define PIDS_PER_CPU_MIN 8
diff --git a/init/main.c b/init/main.c
index aa21add5f7c5..7dd8888036c7 100644
--- a/init/main.c
+++ b/init/main.c
@@ -74,7 +74,6 @@
#include <linux/sched.h>
#include <linux/sched/init.h>
#include <linux/signal.h>
-#include <linux/idr.h>
#include <linux/kgdb.h>
#include <linux/ftrace.h>
#include <linux/async.h>
@@ -1108,7 +1107,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
late_time_init();
sched_clock_init();
calibrate_delay();
- pid_idr_init();
+ pid_init();
anon_vma_init();
#ifdef CONFIG_X86
if (efi_enabled(EFI_RUNTIME_SERVICES))
diff --git a/kernel/pid.c b/kernel/pid.c
index 2e2d33273c8e..53db06f9882d 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -41,7 +41,7 @@
#include <linux/anon_inodes.h>
#include <linux/sched/signal.h>
#include <linux/sched/task.h>
-#include <linux/idr.h>
+#include <linux/xarray.h>
#include <net/sock.h>
#include <uapi/linux/pidfd.h>

@@ -66,15 +66,9 @@ int pid_max = PID_MAX_DEFAULT;
int pid_max_min = RESERVED_PIDS + 1;
int pid_max_max = PID_MAX_LIMIT;

-/*
- * PID-map pages start out as NULL, they get allocated upon
- * first use and are never deallocated. This way a low pid_max
- * value does not cause lots of bitmaps to be allocated, but
- * the scheme scales to up to 4 million PIDs, runtime.
- */
struct pid_namespace init_pid_ns = {
.ns.count = REFCOUNT_INIT(2),
- .idr = IDR_INIT(init_pid_ns.idr),
+ .xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS),
.pid_next = 0,
.pid_allocated = PIDNS_ADDING,
.level = 0,
@@ -118,7 +112,7 @@ void free_pid(struct pid *pid)
struct upid *upid = pid->numbers + i;
struct pid_namespace *ns = upid->ns;

- xa_lock_irqsave(&ns->idr.idr_rt, flags);
+ xa_lock_irqsave(&ns->xa, flags);
switch (--ns->pid_allocated) {
case 2:
case 1:
@@ -135,8 +129,8 @@ void free_pid(struct pid *pid)
break;
}

- idr_remove(&ns->idr, upid->nr);
- xa_unlock_irqrestore(&ns->idr.idr_rt, flags);
+ __xa_erase(&ns->xa, upid->nr);
+ xa_unlock_irqrestore(&ns->xa, flags);
}

call_rcu(&pid->rcu, delayed_put_pid);
@@ -147,7 +141,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
{
struct pid *pid;
enum pid_type type;
- int i, nr;
+ int i;
struct pid_namespace *tmp;
struct upid *upid;
int retval = -ENOMEM;
@@ -191,18 +185,17 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
set_tid_size--;
}

- idr_preload(GFP_KERNEL);
- xa_lock_irq(&tmp->idr.idr_rt);
+ xa_lock_irq(&tmp->xa);

if (tid) {
- nr = idr_alloc(&tmp->idr, NULL, tid,
- tid + 1, GFP_ATOMIC);
+ retval = __xa_insert(&tmp->xa, tid, NULL, GFP_KERNEL);
+
/*
- * If ENOSPC is returned it means that the PID is
- * alreay in use. Return EEXIST in that case.
+ * If EBUSY is returned it means that the PID is already
+ * in use. Return EEXIST in that case.
*/
- if (nr == -ENOSPC)
- nr = -EEXIST;
+ if (retval == -EBUSY)
+ retval = -EEXIST;
} else {
int pid_min = 1;
/*
@@ -216,19 +209,18 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
* Store a null pointer so find_pid_ns does not find
* a partially initialized PID (see below).
*/
- nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
- pid_max, GFP_ATOMIC);
- tmp->pid_next = nr + 1;
+ retval = __xa_alloc_cyclic(&tmp->xa, &tid, NULL,
+ XA_LIMIT(pid_min, pid_max),
+ &tmp->pid_next, GFP_KERNEL);
+ if (retval == -EBUSY)
+ retval = -EAGAIN;
}
- xa_unlock_irq(&tmp->idr.idr_rt);
- idr_preload_end();
+ xa_unlock_irq(&tmp->xa);

- if (nr < 0) {
- retval = (nr == -ENOSPC) ? -EAGAIN : nr;
+ if (retval < 0)
goto out_free;
- }

- pid->numbers[i].nr = nr;
+ pid->numbers[i].nr = tid;
pid->numbers[i].ns = tmp;
tmp = tmp->parent;
}
@@ -256,17 +248,17 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
for ( ; upid >= pid->numbers; --upid) {
tmp = upid->ns;

- xa_lock_irq(&tmp->idr.idr_rt);
+ xa_lock_irq(&tmp->xa);
if (tmp == ns && !(tmp->pid_allocated & PIDNS_ADDING)) {
- xa_unlock_irq(&tmp->idr.idr_rt);
+ xa_unlock_irq(&tmp->xa);
put_pid_ns(ns);
goto out_free;
}

/* Make the PID visible to find_pid_ns. */
- idr_replace(&tmp->idr, pid, upid->nr);
+ __xa_store(&tmp->xa, upid->nr, pid, 0);
tmp->pid_allocated++;
- xa_unlock_irq(&tmp->idr.idr_rt);
+ xa_unlock_irq(&tmp->xa);
}

return pid;
@@ -276,14 +268,14 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
upid = pid->numbers + i;
tmp = upid->ns;

- xa_lock_irq(&tmp->idr.idr_rt);
+ xa_lock_irq(&tmp->xa);

/* On failure to allocate the first pid, reset the state */
if (tmp == ns && tmp->pid_allocated == PIDNS_ADDING)
ns->pid_next = 0;

- idr_remove(&tmp->idr, upid->nr);
- xa_unlock_irq(&tmp->idr.idr_rt);
+ __xa_erase(&tmp->xa, upid->nr);
+ xa_unlock_irq(&tmp->xa);
}

kmem_cache_free(ns->pid_cachep, pid);
@@ -292,14 +284,14 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,

void disable_pid_allocation(struct pid_namespace *ns)
{
- xa_lock_irq(&ns->idr.idr_rt);
+ xa_lock_irq(&ns->xa);
ns->pid_allocated &= ~PIDNS_ADDING;
- xa_unlock_irq(&ns->idr.idr_rt);
+ xa_unlock_irq(&ns->xa);
}

struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
{
- return idr_find(&ns->idr, nr);
+ return xa_load(&ns->xa, nr);
}
EXPORT_SYMBOL_GPL(find_pid_ns);

@@ -508,7 +500,9 @@ EXPORT_SYMBOL_GPL(task_active_pid_ns);
*/
struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
{
- return idr_get_next(&ns->idr, &nr);
+ unsigned long index = nr;
+
+ return xa_find(&ns->xa, &index, ULONG_MAX, XA_PRESENT);
}
EXPORT_SYMBOL_GPL(find_ge_pid);

@@ -650,7 +644,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
* take it we can leave the interrupts enabled. For now it is easier to be safe
* than to prove it can't happen.
*/
-void __init pid_idr_init(void)
+void __init pid_init(void)
{
/* Verify no one has done anything silly: */
BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
@@ -662,8 +656,6 @@ void __init pid_idr_init(void)
PIDS_PER_CPU_MIN * num_possible_cpus());
pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);

- idr_init(&init_pid_ns.idr);
-
init_pid_ns.pid_cachep = KMEM_CACHE(pid,
SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
}
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a53d20c5c85e..8561e01e2d01 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -22,7 +22,7 @@
#include <linux/export.h>
#include <linux/sched/task.h>
#include <linux/sched/signal.h>
-#include <linux/idr.h>
+#include <linux/xarray.h>

static DEFINE_MUTEX(pid_caches_mutex);
static struct kmem_cache *pid_ns_cachep;
@@ -92,15 +92,15 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
if (ns == NULL)
goto out_dec;

- idr_init(&ns->idr);
+ xa_init_flags(&ns->xa, PID_XA_FLAGS);

ns->pid_cachep = create_pid_cachep(level);
if (ns->pid_cachep == NULL)
- goto out_free_idr;
+ goto out_free_xa;

err = ns_alloc_inum(&ns->ns);
if (err)
- goto out_free_idr;
+ goto out_free_xa;
ns->ns.ops = &pidns_operations;

refcount_set(&ns->ns.count, 1);
@@ -112,8 +112,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns

return ns;

-out_free_idr:
- idr_destroy(&ns->idr);
+out_free_xa:
+ xa_destroy(&ns->xa);
kmem_cache_free(pid_ns_cachep, ns);
out_dec:
dec_pid_namespaces(ucounts);
@@ -135,7 +135,7 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
{
ns_free_inum(&ns->ns);

- idr_destroy(&ns->idr);
+ xa_destroy(&ns->xa);
call_rcu(&ns->rcu, delayed_free_pidns);
}

@@ -165,7 +165,7 @@ EXPORT_SYMBOL_GPL(put_pid_ns);

void zap_pid_ns_processes(struct pid_namespace *pid_ns)
{
- int nr;
+ long nr;
int rc;
struct task_struct *task, *me = current;
int init_pids = thread_group_leader(me) ? 1 : 2;
@@ -198,8 +198,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
*/
rcu_read_lock();
read_lock(&tasklist_lock);
- nr = 2;
- idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
+ xa_for_each_range(&pid_ns->xa, nr, pid, 2, ULONG_MAX) {
task = pid_task(pid, PIDTYPE_PID);
if (task && !__fatal_signal_pending(task))
group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);
--
2.37.3

2022-12-12 02:12:19

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v3 3/5] pid: switch pid_namespace from idr to xarray

On 3/12/22 01:16, Brian Foster wrote:
> Switch struct pid[_namespace] management over to use the xarray api
> directly instead of the idr. The underlying data structures used by
> both interfaces is the same. The difference is that the idr api
> relies on the old, idr-custom radix-tree implementation for things
> like efficient tracking/allocation of free ids. The xarray already
> supports this, so most of this is a direct switchover from the old
> api to the new.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Brian Foster <[email protected]>


Reviewed-by: Ian Kent <[email protected]>

> ---
> include/linux/pid_namespace.h | 8 ++--
> include/linux/threads.h | 2 +-
> init/main.c | 3 +-
> kernel/pid.c | 78 ++++++++++++++++-------------------
> kernel/pid_namespace.c | 19 ++++-----
> 5 files changed, 51 insertions(+), 59 deletions(-)
>
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 82c72482019d..e4f5979b482b 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -9,7 +9,7 @@
> #include <linux/threads.h>
> #include <linux/nsproxy.h>
> #include <linux/ns_common.h>
> -#include <linux/idr.h>
> +#include <linux/xarray.h>
>
> /* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
> #define MAX_PID_NS_LEVEL 32
> @@ -17,7 +17,7 @@
> struct fs_pin;
>
> struct pid_namespace {
> - struct idr idr;
> + struct xarray xa;
> unsigned int pid_next;
> struct rcu_head rcu;
> unsigned int pid_allocated;
> @@ -38,6 +38,8 @@ extern struct pid_namespace init_pid_ns;
>
> #define PIDNS_ADDING (1U << 31)
>
> +#define PID_XA_FLAGS (XA_FLAGS_TRACK_FREE | XA_FLAGS_LOCK_IRQ)
> +
> #ifdef CONFIG_PID_NS
> static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)
> {
> @@ -85,7 +87,7 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
>
> extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
> void pidhash_init(void);
> -void pid_idr_init(void);
> +void pid_init(void);
>
> static inline bool task_is_in_init_pid_ns(struct task_struct *tsk)
> {
> diff --git a/include/linux/threads.h b/include/linux/threads.h
> index c34173e6c5f1..37e4391ee89f 100644
> --- a/include/linux/threads.h
> +++ b/include/linux/threads.h
> @@ -38,7 +38,7 @@
> * Define a minimum number of pids per cpu. Heuristically based
> * on original pid max of 32k for 32 cpus. Also, increase the
> * minimum settable value for pid_max on the running system based
> - * on similar defaults. See kernel/pid.c:pid_idr_init() for details.
> + * on similar defaults. See kernel/pid.c:pid_init() for details.
> */
> #define PIDS_PER_CPU_DEFAULT 1024
> #define PIDS_PER_CPU_MIN 8
> diff --git a/init/main.c b/init/main.c
> index aa21add5f7c5..7dd8888036c7 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -74,7 +74,6 @@
> #include <linux/sched.h>
> #include <linux/sched/init.h>
> #include <linux/signal.h>
> -#include <linux/idr.h>
> #include <linux/kgdb.h>
> #include <linux/ftrace.h>
> #include <linux/async.h>
> @@ -1108,7 +1107,7 @@ asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
> late_time_init();
> sched_clock_init();
> calibrate_delay();
> - pid_idr_init();
> + pid_init();
> anon_vma_init();
> #ifdef CONFIG_X86
> if (efi_enabled(EFI_RUNTIME_SERVICES))
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 2e2d33273c8e..53db06f9882d 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -41,7 +41,7 @@
> #include <linux/anon_inodes.h>
> #include <linux/sched/signal.h>
> #include <linux/sched/task.h>
> -#include <linux/idr.h>
> +#include <linux/xarray.h>
> #include <net/sock.h>
> #include <uapi/linux/pidfd.h>
>
> @@ -66,15 +66,9 @@ int pid_max = PID_MAX_DEFAULT;
> int pid_max_min = RESERVED_PIDS + 1;
> int pid_max_max = PID_MAX_LIMIT;
>
> -/*
> - * PID-map pages start out as NULL, they get allocated upon
> - * first use and are never deallocated. This way a low pid_max
> - * value does not cause lots of bitmaps to be allocated, but
> - * the scheme scales to up to 4 million PIDs, runtime.
> - */
> struct pid_namespace init_pid_ns = {
> .ns.count = REFCOUNT_INIT(2),
> - .idr = IDR_INIT(init_pid_ns.idr),
> + .xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS),
> .pid_next = 0,
> .pid_allocated = PIDNS_ADDING,
> .level = 0,
> @@ -118,7 +112,7 @@ void free_pid(struct pid *pid)
> struct upid *upid = pid->numbers + i;
> struct pid_namespace *ns = upid->ns;
>
> - xa_lock_irqsave(&ns->idr.idr_rt, flags);
> + xa_lock_irqsave(&ns->xa, flags);
> switch (--ns->pid_allocated) {
> case 2:
> case 1:
> @@ -135,8 +129,8 @@ void free_pid(struct pid *pid)
> break;
> }
>
> - idr_remove(&ns->idr, upid->nr);
> - xa_unlock_irqrestore(&ns->idr.idr_rt, flags);
> + __xa_erase(&ns->xa, upid->nr);
> + xa_unlock_irqrestore(&ns->xa, flags);
> }
>
> call_rcu(&pid->rcu, delayed_put_pid);
> @@ -147,7 +141,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> {
> struct pid *pid;
> enum pid_type type;
> - int i, nr;
> + int i;
> struct pid_namespace *tmp;
> struct upid *upid;
> int retval = -ENOMEM;
> @@ -191,18 +185,17 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> set_tid_size--;
> }
>
> - idr_preload(GFP_KERNEL);
> - xa_lock_irq(&tmp->idr.idr_rt);
> + xa_lock_irq(&tmp->xa);
>
> if (tid) {
> - nr = idr_alloc(&tmp->idr, NULL, tid,
> - tid + 1, GFP_ATOMIC);
> + retval = __xa_insert(&tmp->xa, tid, NULL, GFP_KERNEL);
> +
> /*
> - * If ENOSPC is returned it means that the PID is
> - * alreay in use. Return EEXIST in that case.
> + * If EBUSY is returned it means that the PID is already
> + * in use. Return EEXIST in that case.
> */
> - if (nr == -ENOSPC)
> - nr = -EEXIST;
> + if (retval == -EBUSY)
> + retval = -EEXIST;
> } else {
> int pid_min = 1;
> /*
> @@ -216,19 +209,18 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> * Store a null pointer so find_pid_ns does not find
> * a partially initialized PID (see below).
> */
> - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> - pid_max, GFP_ATOMIC);
> - tmp->pid_next = nr + 1;
> + retval = __xa_alloc_cyclic(&tmp->xa, &tid, NULL,
> + XA_LIMIT(pid_min, pid_max),
> + &tmp->pid_next, GFP_KERNEL);
> + if (retval == -EBUSY)
> + retval = -EAGAIN;
> }
> - xa_unlock_irq(&tmp->idr.idr_rt);
> - idr_preload_end();
> + xa_unlock_irq(&tmp->xa);
>
> - if (nr < 0) {
> - retval = (nr == -ENOSPC) ? -EAGAIN : nr;
> + if (retval < 0)
> goto out_free;
> - }
>
> - pid->numbers[i].nr = nr;
> + pid->numbers[i].nr = tid;
> pid->numbers[i].ns = tmp;
> tmp = tmp->parent;
> }
> @@ -256,17 +248,17 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> for ( ; upid >= pid->numbers; --upid) {
> tmp = upid->ns;
>
> - xa_lock_irq(&tmp->idr.idr_rt);
> + xa_lock_irq(&tmp->xa);
> if (tmp == ns && !(tmp->pid_allocated & PIDNS_ADDING)) {
> - xa_unlock_irq(&tmp->idr.idr_rt);
> + xa_unlock_irq(&tmp->xa);
> put_pid_ns(ns);
> goto out_free;
> }
>
> /* Make the PID visible to find_pid_ns. */
> - idr_replace(&tmp->idr, pid, upid->nr);
> + __xa_store(&tmp->xa, upid->nr, pid, 0);
> tmp->pid_allocated++;
> - xa_unlock_irq(&tmp->idr.idr_rt);
> + xa_unlock_irq(&tmp->xa);
> }
>
> return pid;
> @@ -276,14 +268,14 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> upid = pid->numbers + i;
> tmp = upid->ns;
>
> - xa_lock_irq(&tmp->idr.idr_rt);
> + xa_lock_irq(&tmp->xa);
>
> /* On failure to allocate the first pid, reset the state */
> if (tmp == ns && tmp->pid_allocated == PIDNS_ADDING)
> ns->pid_next = 0;
>
> - idr_remove(&tmp->idr, upid->nr);
> - xa_unlock_irq(&tmp->idr.idr_rt);
> + __xa_erase(&tmp->xa, upid->nr);
> + xa_unlock_irq(&tmp->xa);
> }
>
> kmem_cache_free(ns->pid_cachep, pid);
> @@ -292,14 +284,14 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>
> void disable_pid_allocation(struct pid_namespace *ns)
> {
> - xa_lock_irq(&ns->idr.idr_rt);
> + xa_lock_irq(&ns->xa);
> ns->pid_allocated &= ~PIDNS_ADDING;
> - xa_unlock_irq(&ns->idr.idr_rt);
> + xa_unlock_irq(&ns->xa);
> }
>
> struct pid *find_pid_ns(int nr, struct pid_namespace *ns)
> {
> - return idr_find(&ns->idr, nr);
> + return xa_load(&ns->xa, nr);
> }
> EXPORT_SYMBOL_GPL(find_pid_ns);
>
> @@ -508,7 +500,9 @@ EXPORT_SYMBOL_GPL(task_active_pid_ns);
> */
> struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
> {
> - return idr_get_next(&ns->idr, &nr);
> + unsigned long index = nr;
> +
> + return xa_find(&ns->xa, &index, ULONG_MAX, XA_PRESENT);
> }
> EXPORT_SYMBOL_GPL(find_ge_pid);
>
> @@ -650,7 +644,7 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> * take it we can leave the interrupts enabled. For now it is easier to be safe
> * than to prove it can't happen.
> */
> -void __init pid_idr_init(void)
> +void __init pid_init(void)
> {
> /* Verify no one has done anything silly: */
> BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
> @@ -662,8 +656,6 @@ void __init pid_idr_init(void)
> PIDS_PER_CPU_MIN * num_possible_cpus());
> pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
>
> - idr_init(&init_pid_ns.idr);
> -
> init_pid_ns.pid_cachep = KMEM_CACHE(pid,
> SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
> }
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index a53d20c5c85e..8561e01e2d01 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -22,7 +22,7 @@
> #include <linux/export.h>
> #include <linux/sched/task.h>
> #include <linux/sched/signal.h>
> -#include <linux/idr.h>
> +#include <linux/xarray.h>
>
> static DEFINE_MUTEX(pid_caches_mutex);
> static struct kmem_cache *pid_ns_cachep;
> @@ -92,15 +92,15 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
> if (ns == NULL)
> goto out_dec;
>
> - idr_init(&ns->idr);
> + xa_init_flags(&ns->xa, PID_XA_FLAGS);
>
> ns->pid_cachep = create_pid_cachep(level);
> if (ns->pid_cachep == NULL)
> - goto out_free_idr;
> + goto out_free_xa;
>
> err = ns_alloc_inum(&ns->ns);
> if (err)
> - goto out_free_idr;
> + goto out_free_xa;
> ns->ns.ops = &pidns_operations;
>
> refcount_set(&ns->ns.count, 1);
> @@ -112,8 +112,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>
> return ns;
>
> -out_free_idr:
> - idr_destroy(&ns->idr);
> +out_free_xa:
> + xa_destroy(&ns->xa);
> kmem_cache_free(pid_ns_cachep, ns);
> out_dec:
> dec_pid_namespaces(ucounts);
> @@ -135,7 +135,7 @@ static void destroy_pid_namespace(struct pid_namespace *ns)
> {
> ns_free_inum(&ns->ns);
>
> - idr_destroy(&ns->idr);
> + xa_destroy(&ns->xa);
> call_rcu(&ns->rcu, delayed_free_pidns);
> }
>
> @@ -165,7 +165,7 @@ EXPORT_SYMBOL_GPL(put_pid_ns);
>
> void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> {
> - int nr;
> + long nr;
> int rc;
> struct task_struct *task, *me = current;
> int init_pids = thread_group_leader(me) ? 1 : 2;
> @@ -198,8 +198,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> */
> rcu_read_lock();
> read_lock(&tasklist_lock);
> - nr = 2;
> - idr_for_each_entry_continue(&pid_ns->idr, pid, nr) {
> + xa_for_each_range(&pid_ns->xa, nr, pid, 2, ULONG_MAX) {
> task = pid_task(pid, PIDTYPE_PID);
> if (task && !__fatal_signal_pending(task))
> group_send_sig_info(SIGKILL, SEND_SIG_PRIV, task, PIDTYPE_MAX);

2022-12-12 02:36:03

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] proc: improve root readdir latency with many threads

On 3/12/22 01:16, Brian Foster wrote:
> Hi all,
>
> Here's v3 of the /proc readdir optimization patches. See v1 for the full
> introductary cover letter.
>
> Most of the feedback received to this point has been around switching
> the pid code over to use the xarray api instead of the idr. Matt Wilcox
> posted most of the code to do that. I cleaned it up a bit and posted a
> standalone series for that here [1], but didn't receive any feedback.
> Patches 1-3 of this series are essentially a repost of [1].
>
> Patches 4-5 are otherwise mostly the same as v2 outside of switching
> over to use the xarray bits instead of the idr/radix-tree.
>
> Thoughts, reviews, flames appreciated.


It looks like there's not much happens with this change so far.


Mathew, could we at least include this in linux-next, to see if

there is anything obvious to worry about since we are fiddling

with the pid numbering ... is there anything we need to do

differently for these to be included in next?


Ian

>
> Brian
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> v3:
> - Drop radix-tree fixups.
> - Convert pid idr usage to xarray.
> - Replace tgid radix-tree tag set/lookup to use xarray mark.
> v2: https://lore.kernel.org/linux-fsdevel/[email protected]/
> - Clean up idr helpers to be more generic.
> - Use ->idr_base properly.
> - Lift tgid iteration helper into pid.c to abstract tag logic from
> users.
> v1: https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> Brian Foster (5):
> pid: replace pidmap_lock with xarray lock
> pid: split cyclic id allocation cursor from idr
> pid: switch pid_namespace from idr to xarray
> pid: mark pids associated with group leader tasks
> procfs: use efficient tgid pid search on root readdir
>
> arch/powerpc/platforms/cell/spufs/sched.c | 2 +-
> fs/proc/base.c | 17 +--
> fs/proc/loadavg.c | 2 +-
> include/linux/pid.h | 3 +-
> include/linux/pid_namespace.h | 9 +-
> include/linux/threads.h | 2 +-
> init/main.c | 3 +-
> kernel/fork.c | 2 +-
> kernel/pid.c | 177 +++++++++++++---------
> kernel/pid_namespace.c | 23 ++-
> 10 files changed, 132 insertions(+), 108 deletions(-)
>

2022-12-12 02:36:23

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v3 2/5] pid: split cyclic id allocation cursor from idr

On 3/12/22 01:16, Brian Foster wrote:
> As a next step in separating pid allocation from the idr, split off
> the cyclic pid allocation cursor from the idr. Lift the cursor value
> into the struct pid_namespace. Note that this involves temporarily
> open-coding the cursor increment on allocation, but this is cleaned
> up in the subsequent patch.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Brian Foster <[email protected]>


Reviewed-by: Ian Kent <[email protected]>

> ---
> arch/powerpc/platforms/cell/spufs/sched.c | 2 +-
> fs/proc/loadavg.c | 2 +-
> include/linux/pid_namespace.h | 1 +
> kernel/pid.c | 6 ++++--
> kernel/pid_namespace.c | 4 ++--
> 5 files changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
> index 99bd027a7f7c..a2ed928d7658 100644
> --- a/arch/powerpc/platforms/cell/spufs/sched.c
> +++ b/arch/powerpc/platforms/cell/spufs/sched.c
> @@ -1072,7 +1072,7 @@ static int show_spu_loadavg(struct seq_file *s, void *private)
> LOAD_INT(c), LOAD_FRAC(c),
> count_active_contexts(),
> atomic_read(&nr_spu_contexts),
> - idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> + READ_ONCE(task_active_pid_ns(current)->pid_next) - 1);
> return 0;
> }
> #endif
> diff --git a/fs/proc/loadavg.c b/fs/proc/loadavg.c
> index 817981e57223..2740b31b6461 100644
> --- a/fs/proc/loadavg.c
> +++ b/fs/proc/loadavg.c
> @@ -22,7 +22,7 @@ static int loadavg_proc_show(struct seq_file *m, void *v)
> LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
> LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
> nr_running(), nr_threads,
> - idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> + READ_ONCE(task_active_pid_ns(current)->pid_next) - 1);
> return 0;
> }
>
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 07481bb87d4e..82c72482019d 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -18,6 +18,7 @@ struct fs_pin;
>
> struct pid_namespace {
> struct idr idr;
> + unsigned int pid_next;
> struct rcu_head rcu;
> unsigned int pid_allocated;
> struct task_struct *child_reaper;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 3622f8b13143..2e2d33273c8e 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -75,6 +75,7 @@ int pid_max_max = PID_MAX_LIMIT;
> struct pid_namespace init_pid_ns = {
> .ns.count = REFCOUNT_INIT(2),
> .idr = IDR_INIT(init_pid_ns.idr),
> + .pid_next = 0,
> .pid_allocated = PIDNS_ADDING,
> .level = 0,
> .child_reaper = &init_task,
> @@ -208,7 +209,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> * init really needs pid 1, but after reaching the
> * maximum wrap back to RESERVED_PIDS
> */
> - if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
> + if (tmp->pid_next > RESERVED_PIDS)
> pid_min = RESERVED_PIDS;
>
> /*
> @@ -217,6 +218,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> */
> nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> pid_max, GFP_ATOMIC);
> + tmp->pid_next = nr + 1;
> }
> xa_unlock_irq(&tmp->idr.idr_rt);
> idr_preload_end();
> @@ -278,7 +280,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>
> /* On failure to allocate the first pid, reset the state */
> if (tmp == ns && tmp->pid_allocated == PIDNS_ADDING)
> - idr_set_cursor(&ns->idr, 0);
> + ns->pid_next = 0;
>
> idr_remove(&tmp->idr, upid->nr);
> xa_unlock_irq(&tmp->idr.idr_rt);
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index f4f8cb0435b4..a53d20c5c85e 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -272,12 +272,12 @@ static int pid_ns_ctl_handler(struct ctl_table *table, int write,
> * it should synchronize its usage with external means.
> */
>
> - next = idr_get_cursor(&pid_ns->idr) - 1;
> + next = READ_ONCE(pid_ns->pid_next) - 1;
>
> tmp.data = &next;
> ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> if (!ret && write)
> - idr_set_cursor(&pid_ns->idr, next + 1);
> + WRITE_ONCE(pid_ns->pid_next, next + 1);
>
> return ret;
> }

2022-12-12 02:37:01

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] pid: mark pids associated with group leader tasks

On 3/12/22 01:16, Brian Foster wrote:
> Searching the pid_namespace for group leader tasks is a fairly
> inefficient operation. Listing the root directory of a procfs mount
> performs a linear scan of allocated pids, checking each entry for an
> associated PIDTYPE_TGID task to determine whether to populate a
> directory entry. This can cause a significant increase in readdir()
> syscall latency when run in namespaces that might have one or more
> processes with significant thread counts.
>
> To facilitate improved TGID pid searches, mark the ids of pid
> entries that are likely to have an associated PIDTYPE_TGID task. To
> keep the code simple and avoid having to maintain synchronization
> between mark state and post-fork pid-task association changes, the
> mark is applied to all pids allocated for tasks cloned without
> CLONE_THREAD.
>
> This means that it is possible for a pid to remain marked in the
> xarray after being disassociated from the group leader task. For
> example, a process that does a setsid() followed by fork() and
> exit() (to daemonize) will remain associated with the original pid
> for the session, but link with the child pid as the group leader.
> OTOH, the only place other than fork() where a tgid association
> occurs is in the exec() path, which kills all other tasks in the
> group and associates the current task with the preexisting leader
> pid. Therefore, the semantics of the mark are that false positives
> (marked pids without PIDTYPE_TGID tasks) are possible, but false
> negatives (unmarked pids without PIDTYPE_TGID tasks) should never
> occur.
>
> This is an effective optimization because false negatives are fairly
> uncommon and don't add overhead (i.e. we already have to check
> pid_task() for marked entries), but still filters out thread pids
> that are guaranteed not to have TGID task association.
>
> Mark entries in the pid allocation path when the caller specifies
> that the pid associates with a new thread group. Since false
> negatives are not allowed, warn in the event that a PIDTYPE_TGID
> task is ever attached to an unmarked pid. Finally, create a helper
> to implement the task search based on the mark semantics defined
> above (based on search logic currently implemented by next_tgid() in
> procfs).

Yes, the tricky bit, but the analysis sounds thorough so it

should work for all cases ...


>
> Signed-off-by: Brian Foster <[email protected]>


Reviewed-by: Ian Kent <[email protected]>

> ---
> include/linux/pid.h | 3 ++-
> kernel/fork.c | 2 +-
> kernel/pid.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
> 3 files changed, 46 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 343abf22092e..64caf21be256 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -132,9 +132,10 @@ extern struct pid *find_vpid(int nr);
> */
> extern struct pid *find_get_pid(int nr);
> extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
> +struct task_struct *find_get_tgid_task(int *id, struct pid_namespace *);
>
> extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> - size_t set_tid_size);
> + size_t set_tid_size, bool group_leader);
> extern void free_pid(struct pid *pid);
> extern void disable_pid_allocation(struct pid_namespace *ns);
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 08969f5aa38d..1cf2644c642e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2267,7 +2267,7 @@ static __latent_entropy struct task_struct *copy_process(
>
> if (pid != &init_struct_pid) {
> pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
> - args->set_tid_size);
> + args->set_tid_size, !(clone_flags & CLONE_THREAD));
> if (IS_ERR(pid)) {
> retval = PTR_ERR(pid);
> goto bad_fork_cleanup_thread;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 53db06f9882d..d65f74c6186c 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -66,6 +66,9 @@ int pid_max = PID_MAX_DEFAULT;
> int pid_max_min = RESERVED_PIDS + 1;
> int pid_max_max = PID_MAX_LIMIT;
>
> +/* MARK_0 used by XA_FREE_MARK */
> +#define TGID_MARK XA_MARK_1
> +
> struct pid_namespace init_pid_ns = {
> .ns.count = REFCOUNT_INIT(2),
> .xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS),
> @@ -137,7 +140,7 @@ void free_pid(struct pid *pid)
> }
>
> struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> - size_t set_tid_size)
> + size_t set_tid_size, bool group_leader)
> {
> struct pid *pid;
> enum pid_type type;
> @@ -257,6 +260,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>
> /* Make the PID visible to find_pid_ns. */
> __xa_store(&tmp->xa, upid->nr, pid, 0);
> + if (group_leader)
> + __xa_set_mark(&tmp->xa, upid->nr, TGID_MARK);
> tmp->pid_allocated++;
> xa_unlock_irq(&tmp->xa);
> }
> @@ -314,6 +319,11 @@ static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type)
> void attach_pid(struct task_struct *task, enum pid_type type)
> {
> struct pid *pid = *task_pid_ptr(task, type);
> + struct pid_namespace *pid_ns = ns_of_pid(pid);
> + pid_t pid_nr = pid_nr_ns(pid, pid_ns);
> +
> + WARN_ON(type == PIDTYPE_TGID &&
> + !xa_get_mark(&pid_ns->xa, pid_nr, TGID_MARK));
> hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]);
> }
>
> @@ -506,6 +516,38 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns)
> }
> EXPORT_SYMBOL_GPL(find_ge_pid);
>
> +/*
> + * Used by proc to find the first thread group leader task with an id greater
> + * than or equal to *id.
> + *
> + * Use the xarray mark as a hint to find the next best pid. The mark does not
> + * guarantee a linked group leader task exists, so retry until a suitable entry
> + * is found.
> + */
> +struct task_struct *find_get_tgid_task(int *id, struct pid_namespace *ns)
> +{
> + struct pid *pid;
> + struct task_struct *t;
> + unsigned long nr = *id;
> +
> + rcu_read_lock();
> + do {
> + pid = xa_find(&ns->xa, &nr, ULONG_MAX, TGID_MARK);
> + if (!pid) {
> + rcu_read_unlock();
> + return NULL;
> + }
> + t = pid_task(pid, PIDTYPE_TGID);
> + nr++;
> + } while (!t);
> +
> + *id = pid_nr_ns(pid, ns);
> + get_task_struct(t);
> + rcu_read_unlock();
> +
> + return t;
> +}
> +
> struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags)
> {
> struct fd f;

2022-12-13 02:31:11

by Yujie Liu

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] pid: mark pids associated with group leader tasks

Greeting,

FYI, we noticed a -4.7% regression of stress-ng.vfork.ops_per_sec due to commit:

commit: 88294e6f6d1e1a9169cc9b715050bd8b52ac5f44 ("[PATCH v3 4/5] pid: mark pids associated with group leader tasks")
url: https://github.com/intel-lab-lkp/linux/commits/Brian-Foster/proc-improve-root-readdir-latency-with-many-threads/20221203-012018
base: https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git next
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH v3 4/5] pid: mark pids associated with group leader tasks

in testcase: stress-ng
on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
with following parameters:

nr_threads: 100%
testtime: 60s
sc_pid_max: 4194304
class: scheduler
test: vfork
cpufreq_governor: performance


Details are as below:

=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/sc_pid_max/tbox_group/test/testcase/testtime:
scheduler/gcc-11/performance/x86_64-rhel-8.3/100%/debian-11.1-x86_64-20220510.cgz/4194304/lkp-icl-2sp5/vfork/stress-ng/60s

commit:
eae2900480 ("pid: switch pid_namespace from idr to xarray")
88294e6f6d ("pid: mark pids associated with group leader tasks")

eae2900480d61b93 88294e6f6d1e1a9169cc9b71505
---------------- ---------------------------
%stddev %change %stddev
\ | \
26176589 ? 3% -5.4% 24757728 stress-ng.time.voluntary_context_switches
11320148 -4.7% 10789611 stress-ng.vfork.ops
188669 -4.7% 179826 stress-ng.vfork.ops_per_sec
48483 ? 13% +17.3% 56864 ? 3% numa-vmstat.node1.nr_slab_unreclaimable
721230 ? 4% -5.7% 679849 vmstat.system.cs
1192641 ? 3% -21.4% 937830 sched_debug.cpu.curr->pid.max
469229 ? 9% -18.1% 384233 ? 5% sched_debug.cpu.curr->pid.stddev
768469 ? 4% -6.0% 722315 perf-stat.i.context-switches
0.09 ? 5% -0.0 0.07 ? 3% perf-stat.i.dTLB-load-miss-rate%
10758930 ? 5% -11.1% 9564450 ? 3% perf-stat.i.dTLB-load-misses
2480878 ? 2% -4.1% 2380112 ? 2% perf-stat.i.dTLB-store-misses
0.15 +2.7% 0.15 perf-stat.i.ipc
23566766 -3.5% 22751863 perf-stat.i.node-load-misses
10701300 -4.6% 10206578 perf-stat.i.node-store-misses
0.09 ? 4% -0.0 0.08 ? 4% perf-stat.overall.dTLB-load-miss-rate%
743961 ? 4% -6.0% 699259 perf-stat.ps.context-switches
10436726 ? 5% -11.3% 9261813 ? 4% perf-stat.ps.dTLB-load-misses
22838094 -3.4% 22057626 perf-stat.ps.node-load-misses
10394494 -4.3% 9950784 perf-stat.ps.node-store-misses



If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-lkp/[email protected]


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.


Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (3.87 kB)
config-6.1.0-rc2-00155-g88294e6f6d1e (168.50 kB)
job-script (8.26 kB)
job.yaml (5.61 kB)
reproduce (394.00 B)
Download all attachments