the attached patch is yet another step towards world-class threading
support.
the biggest known load-test of the new threading code so far was 1 million
concurrent kernel threads (!) running on Anton's 2.5.34 Linux box. On my
testsystem i was able to test 100,000 concurrent kernel threads, which can
be started and stopped within 2 seconds.
While even the biggest internet servers are not quite at 1 million
parallel users yet, even a much smaller 10,000 threads workload causes the
O(N^2) get_pid() algorithm to explode, if consecutive PID ranges are taken
and the PID value overflows and reaches the lower end of the PID range.
With 100,000 or more threads get_pid() causes catastrophic, many minutes
silent 'lockup' of the system. Plus the current get_pid() implementation
iterates through *every* thread in the system if it reaches a PID that was
used up before - which can happen quite often if the rate of thread
creation/destruction is sufficiently high. Besides being slow, the
algorithm also touches lots of unrelated cachelines, effectively flushing
the CPU cache. Eg. for the default pid_max value of 32K threads/processes,
if there are only 10% processes (3200), statistically we will flush the
cache for every 10 threads created. This is unacceptable.
there are a number of patches floating around that try to improve the
worst-case scenario of get_pid(), but the best they can achieve is a
runtime construction of a bitmap and then searching it => this still sucks
performance-wise, destroys the cache and is generally a very ugly
approach.
the underlying problem is very hard - since get_pid() not only has to take
PIDs into account, but TGIDs, session IDs and process groups as well.
then i found one of wli's older patches for 2.5.23 [grr, it was not
announced anywhere, i almost started coding the same problem], which
provides the right (and much harder to implement) approach: it cleans up
PID-space allocation to provide a generic hash for PIDs, session IDs,
process group IDs and TGIDs, properly allocated and freed. This approach,
besides paving the way for a scalable and time-bounded get_pid()
implementation, also got rid of roughly half of for_each_process()
(do_each_thread()) iterations done in the kernel, which alone is worth the
effort. Now we can cleanly iterate through all processes in a session
group or process group.
i took the patch, adopted it to the recent ->ptrace_children and threading
related changes, fixed a couple of bugs and made it work. It really worked
well, nice work William!
I also wrote a new alloc_pid()/free_pid() implementation from scratch,
which provides lockless, time-bounded PID allocation. This new PID
allocator has a worst-case latency of 10 microseconds on a cache-cold P4,
the cache-hot worst-case latency is 2 usecs, if pid_max is set to 1
million.
Ie. even in the most hopeless situation, if there are 999,999 PIDs
allocated already, it takes less than 10 usecs to find and allocate the
remaining one PID. The common fastpath is a couple of instructions only.
The overhead of skipping over continuous regions of allocated PIDs scales
gracefully with the number of bits to be skipped, from 0 to 10 usecs.
(In the fastpath, both the alloc_pid() and free_pid() function falls
through to a 'ret' instruction if compiled with gcc 3.2 on x86.)
i tested the new PID allocation functions under heavy thread creation
workloads, and the new functions just do not show up in the kernel
profiler ...
[ on SMP the new PID allocator has a small window to not follow the
'monotonic forward allocation of PIDs' semantics provided by the previous
implementation - but it's not like we can guarantee any PID allocation
sequence to user-space even with the current get_pid() allocation. The new
allocator still follows the last_pid semantics in the typical case. The
reserved PID range is protected as well. ]
memory footprint of the new PID allocator scales dynamically with
/proc/sys/kernel/pid_max: the default 32K PIDs cause a 4K allocation, a
pid_max of 1 million causes a 128K footprint. The current absolute limit
for pid_max is 4 million PIDs - this does not cause any allocation in the
kernel, the bitmaps are demand-allocated runtime. The pidmap table takes
up 512 bytes.
and as an added bonus, the new PID allocator fails in fork() properly if
the whole PID space is used up. BK-curr's get_pid() still didnt do this
properly.
i have tested the patch on BK-curr, on x86 UP and SMP boxes - it compiles,
boots and works just fine. X and the other session/pgrp-intensive
applications appear to work just fine as well.
Ingo
--- linux/drivers/char/tty_io.c.orig Wed Sep 18 00:16:42 2002
+++ linux/drivers/char/tty_io.c Wed Sep 18 00:18:06 2002
@@ -430,8 +430,9 @@
{
struct tty_struct *tty = (struct tty_struct *) data;
struct file * cons_filp = NULL;
- struct task_struct *p;
- struct list_head *l;
+ task_t *task;
+ struct list_head *l, *elem;
+ struct idtag *idtag;
int closecount = 0, n;
if (!tty)
@@ -494,20 +495,33 @@
"error %d\n", -i);
}
}
-
+
+ if (tty->session <= 0)
+ goto breakout;
+
read_lock(&tasklist_lock);
- for_each_process(p) {
- if ((tty->session > 0) && (p->session == tty->session) &&
- p->leader) {
- send_sig(SIGHUP,p,1);
- send_sig(SIGCONT,p,1);
- if (tty->pgrp > 0)
- p->tty_old_pgrp = tty->pgrp;
- }
- if (p->tty == tty)
- p->tty = NULL;
+ idtag = find_get_tag(IDTAG_SID, tty->session);
+ if (!idtag)
+ goto breakout_unlock;
+
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_SID);
+
+ if (task->tty == tty)
+ task->tty = NULL;
+
+ if (!task->leader)
+ continue;
+
+ send_sig(SIGHUP, task, 1);
+ send_sig(SIGCONT, task, 1);
+ if (tty->pgrp > 0)
+ task->tty_old_pgrp = tty->pgrp;
}
+ put_tag(idtag);
+breakout_unlock:
read_unlock(&tasklist_lock);
+breakout:
tty->flags = 0;
tty->session = 0;
@@ -570,7 +584,9 @@
void disassociate_ctty(int on_exit)
{
struct tty_struct *tty = current->tty;
- struct task_struct *p;
+ task_t *task;
+ struct list_head *elem;
+ struct idtag *idtag;
int tty_pgrp = -1;
lock_kernel();
@@ -598,9 +614,17 @@
tty->pgrp = -1;
read_lock(&tasklist_lock);
- for_each_process(p)
- if (p->session == current->session)
- p->tty = NULL;
+ idtag = find_get_tag(IDTAG_SID, current->session);
+
+ if (!idtag)
+ goto out_unlock;
+
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_SID);
+ task->tty = NULL;
+ }
+ put_tag(idtag);
+out_unlock:
read_unlock(&tasklist_lock);
unlock_kernel();
}
@@ -1220,15 +1244,32 @@
* tty. Also, clear redirect if it points to either tty.
*/
if (tty_closing || o_tty_closing) {
- struct task_struct *p;
+ task_t *task;
+ struct list_head *elem;
+ struct idtag *idtag;
read_lock(&tasklist_lock);
- for_each_process(p) {
- if (p->tty == tty || (o_tty && p->tty == o_tty))
- p->tty = NULL;
+ idtag = find_get_tag(IDTAG_SID, tty->session);
+ if (!idtag)
+ goto detach_o_tty;
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_SID);
+ task->tty = NULL;
+ }
+ put_tag(idtag);
+detach_o_tty:
+ if (!o_tty)
+ goto out_unlock;
+ idtag = find_get_tag(IDTAG_SID, o_tty->session);
+ if (!idtag)
+ goto out_unlock;
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_SID);
+ task->tty = NULL;
}
+ put_tag(idtag);
+out_unlock:
read_unlock(&tasklist_lock);
-
if (redirect == tty || (o_tty && redirect == o_tty))
redirect = NULL;
}
@@ -1540,6 +1581,10 @@
static int tiocsctty(struct tty_struct *tty, int arg)
{
+ task_t *task;
+ struct list_head *elem;
+ struct idtag *idtag;
+
if (current->leader &&
(current->session == tty->session))
return 0;
@@ -1549,25 +1594,34 @@
*/
if (!current->leader || current->tty)
return -EPERM;
- if (tty->session > 0) {
+
+ if (tty->session <= 0)
+ goto out_no_detach;
+
+ /*
+ * This tty is already the controlling
+ * tty for another session group!
+ */
+ if ((arg == 1) && capable(CAP_SYS_ADMIN)) {
/*
- * This tty is already the controlling
- * tty for another session group!
+ * Steal it away
*/
- if ((arg == 1) && capable(CAP_SYS_ADMIN)) {
- /*
- * Steal it away
- */
- struct task_struct *p;
-
- read_lock(&tasklist_lock);
- for_each_process(p)
- if (p->tty == tty)
- p->tty = NULL;
- read_unlock(&tasklist_lock);
- } else
- return -EPERM;
- }
+
+ read_lock(&tasklist_lock);
+ idtag = find_get_tag(IDTAG_SID, tty->session);
+ if (!idtag)
+ goto out_unlock;
+
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_SID);
+ task->tty = NULL;
+ }
+ put_tag(idtag);
+out_unlock:
+ read_unlock(&tasklist_lock);
+ } else
+ return -EPERM;
+out_no_detach:
task_lock(current);
current->tty = tty;
task_unlock(current);
--- linux/include/linux/idtag.h.orig Wed Sep 18 00:18:06 2002
+++ linux/include/linux/idtag.h Wed Sep 18 00:18:06 2002
@@ -0,0 +1,57 @@
+#ifndef _LINUX_IDTAG_H
+#define _LINUX_IDTAG_H
+
+#include <linux/list.h>
+
+struct task_struct;
+
+enum idtag_type
+{
+ IDTAG_PID,
+ IDTAG_PGID,
+ IDTAG_SID,
+ IDTAG_TGID,
+ IDTAG_MAX
+};
+
+struct idtag
+{
+ unsigned long tag;
+ enum idtag_type type;
+ atomic_t count;
+ struct list_head idtag_hash_chain;
+ struct list_head task_list;
+};
+
+struct idtag_link
+{
+ unsigned long tag;
+ struct list_head idtag_chain;
+ struct idtag *idtag;
+};
+
+#define idtag_task(elem, tagtype) \
+ list_entry(elem, struct task_struct, idtags[(int)(tagtype)].idtag_chain)
+
+/*
+ * attach_tag() must be called without the tasklist_lock.
+ */
+extern int attach_tag(struct task_struct *, enum idtag_type, unsigned long);
+
+/*
+ * detach_tag() must be called with the tasklist_lock held.
+ */
+extern int detach_tag(struct task_struct *task, enum idtag_type);
+
+/*
+ * Quick & dirty hash table lookup.
+ */
+extern struct idtag *FASTCALL(get_tag(enum idtag_type, unsigned long));
+extern struct idtag *FASTCALL(find_get_tag(enum idtag_type, unsigned long));
+extern void put_tag(struct idtag *);
+extern int idtag_unused(unsigned long);
+
+extern int alloc_pid(void);
+extern void free_pid(unsigned long);
+
+#endif /* _LINUX_IDTAG_H */
--- linux/include/linux/init_task.h.orig Wed Sep 18 00:16:43 2002
+++ linux/include/linux/init_task.h Wed Sep 18 00:18:06 2002
@@ -75,6 +75,7 @@
.parent = &tsk, \
.children = LIST_HEAD_INIT(tsk.children), \
.sibling = LIST_HEAD_INIT(tsk.sibling), \
+ .user_task_list = LIST_HEAD_INIT(tsk.user_task_list), \
.group_leader = &tsk, \
.thread_group = LIST_HEAD_INIT(tsk.thread_group), \
.wait_chldexit = __WAIT_QUEUE_HEAD_INITIALIZER(tsk.wait_chldexit),\
--- linux/include/linux/sched.h.orig Wed Sep 18 00:17:43 2002
+++ linux/include/linux/sched.h Wed Sep 18 00:18:06 2002
@@ -28,6 +28,7 @@
#include <linux/fs_struct.h>
#include <linux/compiler.h>
#include <linux/completion.h>
+#include <linux/idtag.h>
struct exec_domain;
@@ -259,6 +260,9 @@
/* Hash table maintenance information */
struct list_head uidhash_list;
uid_t uid;
+
+ rwlock_t lock;
+ struct list_head task_list;
};
#define get_current_user() ({ \
@@ -266,6 +270,8 @@
atomic_inc(&__user->__count); \
__user; })
+extern struct user_struct *find_user(uid_t);
+
extern struct user_struct root_user;
#define INIT_USER (&root_user)
@@ -329,6 +335,10 @@
/* PID hash table linkage. */
struct task_struct *pidhash_next;
struct task_struct **pidhash_pprev;
+
+ struct idtag_link idtags[IDTAG_MAX];
+
+ struct list_head user_task_list;
wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
--- linux/include/linux/threads.h.orig Wed Sep 18 00:16:42 2002
+++ linux/include/linux/threads.h Wed Sep 18 00:18:06 2002
@@ -17,8 +17,13 @@
#define MIN_THREADS_LEFT_FOR_ROOT 4
/*
- * This controls the maximum pid allocated to a process
+ * This controls the default maximum pid allocated to a process
*/
-#define DEFAULT_PID_MAX 0x8000
+#define PID_MAX_DEFAULT 0x8000
+
+/*
+ * A maximum of 4 million PIDs should be enough for a while:
+ */
+#define PID_MAX_LIMIT (4*1024*1024)
#endif
--- linux/include/asm-i386/bitops.h.orig Sun Jun 9 07:30:40 2002
+++ linux/include/asm-i386/bitops.h Wed Sep 18 00:18:06 2002
@@ -401,6 +401,20 @@
}
/**
+ * lg - integer logarithm base 2
+ * @n - integer to take log base 2 of
+ *
+ * undefined if 0
+ */
+static inline unsigned long lg(unsigned long n)
+{
+ asm("bsrl %1,%0"
+ :"=r" (n)
+ :"r" (n));
+ return n;
+}
+
+/**
* __ffs - find first bit in word.
* @word: The word to search
*
--- linux/include/asm-i386/types.h.orig Sun Jun 9 07:26:52 2002
+++ linux/include/asm-i386/types.h Wed Sep 18 00:18:06 2002
@@ -41,7 +41,8 @@
typedef signed long long s64;
typedef unsigned long long u64;
-#define BITS_PER_LONG 32
+#define BITS_PER_LONG_SHIFT 5
+#define BITS_PER_LONG (1 << BITS_PER_LONG_SHIFT)
/* DMA addresses come in generic and 64-bit flavours. */
--- linux/fs/fcntl.c.orig Wed Sep 18 00:16:43 2002
+++ linux/fs/fcntl.c Wed Sep 18 00:18:06 2002
@@ -480,7 +480,9 @@
void send_sigio(struct fown_struct *fown, int fd, int band)
{
- struct task_struct * p;
+ struct task_struct *task;
+ struct list_head *elem;
+ struct idtag *idtag;
int pid;
read_lock(&fown->lock);
@@ -489,18 +491,23 @@
goto out_unlock_fown;
read_lock(&tasklist_lock);
- if ( (pid > 0) && (p = find_task_by_pid(pid)) ) {
- send_sigio_to_task(p, fown, fd, band);
+ if (pid > 0) {
+ task = find_task_by_pid(pid);
+ if (task)
+ send_sigio_to_task(task, fown, fd, band);
goto out_unlock_task;
}
- for_each_process(p) {
- int match = p->pid;
- if (pid < 0)
- match = -p->pgrp;
- if (pid != match)
- continue;
- send_sigio_to_task(p, fown, fd, band);
+
+ idtag = find_get_tag(IDTAG_PGID, -pid);
+ if (!idtag)
+ goto out_unlock_task;
+
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_PGID);
+
+ send_sigio_to_task(task, fown, fd, band);
}
+ put_tag(idtag);
out_unlock_task:
read_unlock(&tasklist_lock);
out_unlock_fown:
--- linux/kernel/Makefile.orig Wed Sep 18 00:16:37 2002
+++ linux/kernel/Makefile Wed Sep 18 00:18:28 2002
@@ -15,7 +15,8 @@
obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o futex.o platform.o
+ signal.o sys.o kmod.o context.o futex.o platform.o \
+ pid.o idtag.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
obj-$(CONFIG_SMP) += cpu.o
--- linux/kernel/exit.c.orig Wed Sep 18 00:17:43 2002
+++ linux/kernel/exit.c Wed Sep 18 00:18:06 2002
@@ -34,6 +34,13 @@
struct dentry *proc_dentry;
nr_threads--;
unhash_pid(p);
+ detach_tag(p, IDTAG_PID);
+ if (thread_group_leader(p)) {
+ detach_tag(p, IDTAG_TGID);
+ detach_tag(p, IDTAG_PGID);
+ detach_tag(p, IDTAG_SID);
+ }
+
REMOVE_LINKS(p);
p->pid = 0;
proc_dentry = p->proc_dentry;
@@ -57,7 +64,12 @@
BUG();
if (p != current)
wait_task_inactive(p);
+ write_lock(&p->user->lock);
atomic_dec(&p->user->processes);
+ if (list_empty(&p->user_task_list))
+ BUG();
+ list_del(&p->user_task_list);
+ write_unlock(&p->user->lock);
security_ops->task_free_security(p);
free_uid(p->user);
if (unlikely(p->ptrace)) {
@@ -108,23 +120,28 @@
*/
int session_of_pgrp(int pgrp)
{
- struct task_struct *p;
- int fallback;
+ task_t *task;
+ struct list_head *elem;
+ struct idtag *idtag;
+ int sid = -1;
- fallback = -1;
read_lock(&tasklist_lock);
- for_each_process(p) {
- if (p->session <= 0)
- continue;
- if (p->pgrp == pgrp) {
- fallback = p->session;
- break;
+ idtag = find_get_tag(IDTAG_PGID, pgrp);
+ if (!idtag)
+ goto out;
+
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_PGID);
+
+ if (task->session > 0) {
+ sid = task->session;
+ goto out;
}
- if (p->pid == pgrp)
- fallback = p->session;
}
+ put_tag(idtag);
+out:
read_unlock(&tasklist_lock);
- return fallback;
+ return sid;
}
/*
@@ -135,21 +152,55 @@
*
* "I ask you, have you ever known what it is to be an orphan?"
*/
-static int __will_become_orphaned_pgrp(int pgrp, struct task_struct * ignored_task)
+static int __will_become_orphaned_pgrp(int pgrp, task_t *ignored_task)
{
- struct task_struct *p;
-
- for_each_process(p) {
- if ((p == ignored_task) || (p->pgrp != pgrp) ||
- (p->state == TASK_ZOMBIE) ||
- (p->parent->pid == 1))
+ task_t *task;
+ struct idtag *idtag;
+ struct list_head *elem;
+ int ret = 1;
+
+ idtag = find_get_tag(IDTAG_PGID, pgrp);
+ if (!idtag)
+ goto out;
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_PGID);
+ if (task == ignored_task
+ || task->state == TASK_ZOMBIE
+ || task->parent->pid == 1)
continue;
- if ((p->parent->pgrp != pgrp) &&
- (p->parent->session == p->session)) {
- return 0;
+ if (task->parent->pgrp != pgrp
+ && task->parent->session == task->session) {
+ ret = 0;
+ goto out;
}
}
- return 1; /* (sighing) "Often!" */
+ put_tag(idtag);
+out:
+ return ret; /* (sighing) "Often!" */
+}
+
+static inline int __has_stopped_jobs(int pgrp)
+{
+ int retval = 0;
+ task_t *task;
+ struct list_head *elem;
+ struct idtag *idtag;
+
+ idtag = find_get_tag(IDTAG_PGID, pgrp);
+ if (!idtag)
+ goto out;
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_PGID);
+
+ if (task->state != TASK_STOPPED)
+ continue;
+
+ retval = 1;
+ goto out;
+ }
+ put_tag(idtag);
+out:
+ return retval;
}
static int will_become_orphaned_pgrp(int pgrp, struct task_struct * ignored_task)
@@ -166,22 +217,6 @@
int is_orphaned_pgrp(int pgrp)
{
return will_become_orphaned_pgrp(pgrp, 0);
-}
-
-static inline int __has_stopped_jobs(int pgrp)
-{
- int retval = 0;
- struct task_struct * p;
-
- for_each_process(p) {
- if (p->pgrp != pgrp)
- continue;
- if (p->state != TASK_STOPPED)
- continue;
- retval = 1;
- break;
- }
- return retval;
}
static inline int has_stopped_jobs(int pgrp)
--- linux/kernel/idtag.c.orig Wed Sep 18 00:18:06 2002
+++ linux/kernel/idtag.c Wed Sep 18 00:18:06 2002
@@ -0,0 +1,177 @@
+#include <linux/sched.h>
+#include <linux/spinlock.h>
+#include <linux/idtag.h>
+#include <linux/slab.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+
+/*
+ * idtags for tasks
+ * (C) 2002 William Irwin, IBM
+ * idtags are backing objects for tasks sharing a given ID to chain
+ * against. There is very little to them aside from hashing them and
+ * parking tasks using given ID's on a list.
+ *
+ * TODO: use per-bucket locks.
+ */
+
+static kmem_cache_t *idtag_cache;
+
+#define IDHASH_SIZE 4096
+static struct list_head idtag_hash[IDTAG_MAX][IDHASH_SIZE];
+
+static spinlock_t idtag_lock = SPIN_LOCK_UNLOCKED;
+
+static inline unsigned idtag_hashfn(unsigned long tag)
+{
+ return ((tag >> 8) ^ tag) & (IDHASH_SIZE - 1);
+}
+
+static inline struct idtag *idtag_alloc(void)
+{
+ return kmem_cache_alloc(idtag_cache, SLAB_KERNEL);
+}
+
+static inline void idtag_free(struct idtag *idtag)
+{
+ kmem_cache_free(idtag_cache, idtag);
+}
+
+static inline void init_idtag( struct idtag *idtag,
+ enum idtag_type type,
+ unsigned long tag)
+{
+ INIT_LIST_HEAD(&idtag->task_list);
+ atomic_set(&idtag->count, 0);
+ idtag->type = type;
+ idtag->tag = tag;
+ list_add(&idtag->idtag_hash_chain,
+ &idtag_hash[type][idtag_hashfn(tag)]);
+}
+
+struct idtag *find_tag(enum idtag_type type, unsigned long tag)
+{
+ struct list_head *bucket, *elem;
+ struct idtag *idtag;
+
+ bucket = &idtag_hash[type][idtag_hashfn(tag)];
+
+ list_for_each(elem, bucket) {
+ idtag = list_entry(elem, struct idtag, idtag_hash_chain);
+ if (idtag->tag == tag)
+ return idtag;
+ }
+ return NULL;
+}
+
+int idtag_unused(unsigned long tag)
+{
+ enum idtag_type type;
+
+ for (type = 0; type < IDTAG_MAX; ++type)
+ if (find_tag(type, tag))
+ return 0;
+
+ return 1;
+}
+
+void __init idtag_hash_init(void)
+{
+ int i, j;
+
+ for (i = 0; i < IDTAG_MAX; ++i)
+ for (j = 0; j < IDHASH_SIZE; ++j)
+ INIT_LIST_HEAD(&idtag_hash[i][j]);
+}
+
+void __init idtag_init(void)
+{
+ idtag_cache = kmem_cache_create("idtag_cache",
+ sizeof(struct idtag),
+ 0,
+ SLAB_HWCACHE_ALIGN,
+ NULL,
+ NULL);
+}
+
+struct idtag *find_get_tag(enum idtag_type type, unsigned long tag)
+{
+ struct idtag *idtag;
+ spin_lock(&idtag_lock);
+ idtag = find_tag(type, tag);
+ if (idtag)
+ atomic_inc(&idtag->count);
+ spin_unlock(&idtag_lock);
+ return idtag;
+}
+
+struct idtag *get_tag(enum idtag_type type, unsigned long tag)
+{
+ struct idtag *idtag;
+
+ spin_lock(&idtag_lock);
+ idtag = find_tag(type, tag);
+
+ if (!idtag) {
+ struct idtag *raced_tag;
+ spin_unlock(&idtag_lock);
+ idtag = idtag_alloc();
+ spin_lock(&idtag_lock);
+ raced_tag = find_tag(type, tag);
+ if (!raced_tag) {
+ init_idtag(idtag, type, tag);
+ goto out_inc;
+ }
+ idtag_free(idtag);
+ idtag = raced_tag;
+ }
+
+ if (!idtag)
+ goto out;
+out_inc:
+ atomic_inc(&idtag->count);
+out:
+ spin_unlock(&idtag_lock);
+ return idtag;
+}
+
+void put_tag(struct idtag *idtag)
+{
+ unsigned long tag;
+ if (!atomic_dec_and_lock(&idtag->count, &idtag_lock))
+ return;
+
+ tag = idtag->tag;
+ list_del(&idtag->idtag_hash_chain);
+ idtag_free(idtag);
+ if (idtag_unused(tag))
+ free_pid(tag);
+ spin_unlock(&idtag_lock);
+}
+
+int attach_tag(task_t *task, enum idtag_type type, unsigned long tag)
+{
+ struct idtag *idtag;
+
+ idtag = get_tag(type, tag);
+
+ if (!idtag)
+ return -ENOMEM;
+
+ spin_lock(&idtag_lock);
+ list_add(&task->idtags[type].idtag_chain, &idtag->task_list);
+ task->idtags[type].tag = tag;
+ task->idtags[type].idtag = idtag;
+ spin_unlock(&idtag_lock);
+
+ return 0;
+}
+
+int detach_tag(task_t *task, enum idtag_type type)
+{
+ spin_lock(&idtag_lock);
+ list_del(&task->idtags[type].idtag_chain);
+ spin_unlock(&idtag_lock);
+ put_tag(task->idtags[type].idtag);
+ return 0;
+}
--- linux/kernel/kmod.c.orig Wed Sep 18 00:16:39 2002
+++ linux/kernel/kmod.c Wed Sep 18 00:18:06 2002
@@ -124,11 +124,19 @@
/* Drop the "current user" thing */
{
struct user_struct *user = curtask->user;
+
+ write_lock(&user->lock);
+ atomic_dec(&user->processes);
+ list_del(&curtask->user_task_list);
+ write_unlock(&user->lock);
+ free_uid(user);
+
+ write_lock(&INIT_USER->lock);
curtask->user = INIT_USER;
atomic_inc(&INIT_USER->__count);
atomic_inc(&INIT_USER->processes);
- atomic_dec(&user->processes);
- free_uid(user);
+ list_add(&curtask->user_task_list, &INIT_USER->task_list);
+ write_unlock(&INIT_USER->lock);
}
/* Give kmod all effective privileges.. */
--- linux/kernel/pid.c.orig Wed Sep 18 00:18:06 2002
+++ linux/kernel/pid.c Wed Sep 18 00:18:06 2002
@@ -0,0 +1,137 @@
+/*
+ * Scalable, lockless, time-bounded PID allocator
+ *
+ * (C) 2002 Ingo Molnar, Red Hat
+ *
+ * We have a list of bitmap pages, which bitmaps represent the PID space.
+ * Allocating and freeing PIDs is completely lockless. The worst-case
+ * allocation scenario when all but one out of 1 million PIDs possible are
+ * allocated already: the scanning of 32 list entries and at most PAGE_SIZE
+ * bytes. The typical fastpath is a single successful setbit. Freeing is O(1).
+ */
+
+#include <linux/mm.h>
+#include <linux/bootmem.h>
+
+int pid_max = PID_MAX_DEFAULT;
+int last_pid;
+
+#define RESERVED_PIDS 300
+
+#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8)
+#define BITS_PER_PAGE (PAGE_SIZE*8)
+#define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1)
+
+/*
+ * PID-map pages start out as NULL, they get allocated upon
+ * first use and are never deallocated. This way a low pid_max
+ * value does not cause lots of bitmaps to be allocated, but
+ * the scheme scales to up to 4 million PIDs, runtime.
+ */
+typedef struct pidmap {
+ atomic_t nr_free;
+ void *page;
+} pidmap_t;
+
+static pidmap_t pidmap_array[PIDMAP_ENTRIES] =
+ { [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(BITS_PER_PAGE), NULL } };
+
+static pidmap_t *map_limit = pidmap_array + PIDMAP_ENTRIES;
+
+void free_pid(unsigned long pid)
+{
+ pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
+ int offset = pid & BITS_PER_PAGE_MASK;
+
+ if (!test_and_clear_bit(offset, map->page))
+ BUG();
+ atomic_inc(&map->nr_free);
+}
+
+/*
+ * Here we search for the next map that has free bits left.
+ * Normally the next map has free PIDs.
+ */
+static inline pidmap_t *next_free_map(pidmap_t *map, int *max_steps)
+{
+ while (--*max_steps) {
+ if (++map == map_limit)
+ map = pidmap_array;
+ if (unlikely(!map->page)) {
+ unsigned long page = get_zeroed_page(GFP_KERNEL);
+ /*
+ * Free the page if someone raced with us
+ * installing it:
+ */
+ if (cmpxchg(&map->page, NULL, page))
+ free_page(page);
+ if (!map->page)
+ break;
+ }
+ if (atomic_read(&map->nr_free))
+ return map;
+ }
+ return NULL;
+}
+
+int alloc_pid(void)
+{
+ int pid, offset, max_steps = PIDMAP_ENTRIES + 1;
+ pidmap_t *map;
+
+ pid = last_pid + 1;
+ if (pid >= pid_max)
+ pid = RESERVED_PIDS;
+
+ offset = pid & BITS_PER_PAGE_MASK;
+ map = pidmap_array + pid / BITS_PER_PAGE;
+
+ if (likely(map->page && !test_and_set_bit(offset, map->page))) {
+ /*
+ * There is a small window for last_pid updates to race,
+ * but in that case the next allocation will go into the
+ * slowpath and that fixes things up.
+ */
+return_pid:
+ atomic_dec(&map->nr_free);
+ last_pid = pid;
+ return pid;
+ }
+
+ if (!offset || !atomic_read(&map->nr_free)) {
+next_map:
+ map = next_free_map(map, &max_steps);
+ if (!map)
+ goto failure;
+ offset = 0;
+ }
+ /*
+ * Find the next zero bit:
+ */
+scan_more:
+ offset = find_next_zero_bit(map->page, BITS_PER_PAGE, offset);
+ if (offset == BITS_PER_PAGE)
+ goto next_map;
+ if (test_and_set_bit(offset, map->page))
+ goto scan_more;
+
+ /* we got the PID: */
+ pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
+ goto return_pid;
+
+failure:
+ return -1;
+}
+
+void __init pid_init(void)
+{
+ pidmap_t *map = pidmap_array;
+
+ /*
+ * Allocate PID 0:
+ */
+ map->page = alloc_bootmem(PAGE_SIZE);
+ set_bit(0, map->page);
+ atomic_dec(&map->nr_free);
+}
+
--- linux/kernel/sched.c.orig Wed Sep 18 00:16:43 2002
+++ linux/kernel/sched.c Wed Sep 18 00:18:06 2002
@@ -2099,6 +2099,8 @@
{
runqueue_t *rq;
int i, j, k;
+ extern void idtag_hash_init(void);
+ extern void pid_init(void);
for (i = 0; i < NR_CPUS; i++) {
prio_array_t *array;
@@ -2139,5 +2141,8 @@
*/
atomic_inc(&init_mm.mm_count);
enter_lazy_tlb(&init_mm, current, smp_processor_id());
+
+ idtag_hash_init();
+ pid_init();
}
--- linux/kernel/signal.c.orig Wed Sep 18 00:16:43 2002
+++ linux/kernel/signal.c Wed Sep 18 00:18:06 2002
@@ -943,19 +943,31 @@
int __kill_pg_info(int sig, struct siginfo *info, pid_t pgrp)
{
- int retval = -EINVAL;
- if (pgrp > 0) {
- struct task_struct *p;
-
- retval = -ESRCH;
- for_each_process(p) {
- if (p->pgrp == pgrp) {
- int err = send_sig_info(sig, info, p);
- if (retval)
- retval = err;
- }
- }
+ task_t *task;
+ struct list_head *elem;
+ struct idtag *idtag;
+ int err, retval = -EINVAL;
+
+ if (pgrp <= 0)
+ goto out;
+
+ retval = -ESRCH;
+ idtag = find_get_tag(IDTAG_PGID, pgrp);
+ if (!idtag)
+ goto out;
+
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_PGID);
+
+ if (!thread_group_leader(task))
+ continue;
+
+ err = send_sig_info(sig, info, task);
+ if (retval)
+ retval = err;
}
+ put_tag(idtag);
+out:
return retval;
}
@@ -977,42 +989,65 @@
* the connection is lost.
*/
+
int
-kill_sl_info(int sig, struct siginfo *info, pid_t sess)
+kill_sl_info(int sig, struct siginfo *info, pid_t sid)
{
- int retval = -EINVAL;
- if (sess > 0) {
- struct task_struct *p;
-
- retval = -ESRCH;
- read_lock(&tasklist_lock);
- for_each_process(p) {
- if (p->leader && p->session == sess) {
- int err = send_sig_info(sig, info, p);
- if (retval)
- retval = err;
- }
- }
- read_unlock(&tasklist_lock);
+ int err, retval = -EINVAL;
+ struct idtag *idtag;
+ struct list_head *elem;
+ task_t *task;
+
+ if (sid <= 0)
+ goto out;
+
+ retval = -ESRCH;
+ read_lock(&tasklist_lock);
+ idtag = find_get_tag(IDTAG_SID, sid);
+ if (!idtag)
+ goto out_unlock;
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_SID);
+ if (!task->leader)
+ continue;
+
+ err = send_sig_info(sig, info, task);
+ if (retval)
+ retval = err;
}
+ put_tag(idtag);
+out_unlock:
+ read_unlock(&tasklist_lock);
+out:
return retval;
}
-inline int
+int
kill_proc_info(int sig, struct siginfo *info, pid_t pid)
{
int error;
- struct task_struct *p;
+ task_t *task;
+ task_t *tgrp_leader;
read_lock(&tasklist_lock);
- p = find_task_by_pid(pid);
+ task = find_task_by_pid(pid);
error = -ESRCH;
- if (p)
- error = send_sig_info(sig, info, p);
+ if (!task)
+ goto out_unlock;
+
+ if (thread_group_leader(task))
+ goto out_send_sig;
+
+ tgrp_leader = find_task_by_pid(task->tgid);
+ if (tgrp_leader)
+ task = tgrp_leader;
+out_send_sig:
+ error = send_sig_info(sig, info, task);
+out_unlock:
read_unlock(&tasklist_lock);
+
return error;
}
-
/*
* kill_something_info() interprets pid in interesting ways just like kill(2).
--- linux/kernel/sys.c.orig Wed Sep 18 00:16:43 2002
+++ linux/kernel/sys.c Wed Sep 18 00:18:06 2002
@@ -203,35 +203,34 @@
cond_syscall(sys_quotactl)
cond_syscall(sys_acct)
-static int proc_sel(struct task_struct *p, int which, int who)
+static int set_one_prio(task_t *task, int niceval, int error)
{
- if(p->pid)
- {
- switch (which) {
- case PRIO_PROCESS:
- if (!who && p == current)
- return 1;
- return(p->pid == who);
- case PRIO_PGRP:
- if (!who)
- who = current->pgrp;
- return(p->pgrp == who);
- case PRIO_USER:
- if (!who)
- who = current->uid;
- return(p->uid == who);
- }
+ if (task->uid != current->euid &&
+ task->uid != current->uid && !capable(CAP_SYS_NICE)) {
+ error = -EPERM;
+ goto out;
}
- return 0;
+
+ if (error == -ESRCH)
+ error = 0;
+ if (niceval < task_nice(task) && !capable(CAP_SYS_NICE))
+ error = -EACCES;
+ else
+ set_user_nice(task, niceval);
+out:
+ return error;
}
asmlinkage long sys_setpriority(int which, int who, int niceval)
{
- struct task_struct *g, *p;
- int error;
+ task_t *task;
+ struct user_struct *user;
+ struct idtag *idtag;
+ struct list_head *elem;
+ int error = -EINVAL;
if (which > 2 || which < 0)
- return -EINVAL;
+ goto out;
/* normalize: avoid signed division (rounding problems) */
error = -ESRCH;
@@ -241,31 +240,53 @@
niceval = 19;
read_lock(&tasklist_lock);
- do_each_thread(g, p) {
- int no_nice;
- if (!proc_sel(p, which, who))
- continue;
- if (p->uid != current->euid &&
- p->uid != current->uid && !capable(CAP_SYS_NICE)) {
- error = -EPERM;
- continue;
- }
- if (error == -ESRCH)
- error = 0;
- if (niceval < task_nice(p) && !capable(CAP_SYS_NICE)) {
- error = -EACCES;
- continue;
- }
- no_nice = security_ops->task_setnice(p, niceval);
- if (no_nice) {
- error = no_nice;
- continue;
- }
- set_user_nice(p, niceval);
- } while_each_thread(g, p);
-
+ switch (which) {
+ case PRIO_PROCESS:
+ if (!who)
+ who = current->pid;
+ idtag = find_get_tag(IDTAG_PID, who);
+ if (!idtag)
+ goto out_unlock;
+
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_PID);
+ error = set_one_prio(task, niceval, error);
+ }
+ put_tag(idtag);
+ break;
+ case PRIO_PGRP:
+ if (!who)
+ who = current->pgrp;
+ idtag = find_get_tag(IDTAG_PGID, who);
+ if (!idtag)
+ goto out_unlock;
+
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_PGID);
+ error = set_one_prio(task, niceval, error);
+ }
+ put_tag(idtag);
+ break;
+ case PRIO_USER:
+ if (!who)
+ user = current->user;
+ else
+ user = find_user(who);
+
+ if (!user)
+ goto out_unlock;
+
+ read_lock(&user->lock);
+ list_for_each(elem, &user->task_list) {
+ task = list_entry(elem, task_t, user_task_list);
+ error = set_one_prio(task, niceval, error);
+ }
+ read_unlock(&user->lock);
+ break;
+ }
+out_unlock:
read_unlock(&tasklist_lock);
-
+out:
return error;
}
@@ -277,21 +298,67 @@
*/
asmlinkage long sys_getpriority(int which, int who)
{
- struct task_struct *g, *p;
- long retval = -ESRCH;
+ task_t *task;
+ struct list_head *elem;
+ struct idtag *idtag;
+ struct user_struct *user;
+ long niceval, retval = -ESRCH;
if (which > 2 || which < 0)
return -EINVAL;
read_lock(&tasklist_lock);
- do_each_thread(g, p) {
- long niceval;
- if (!proc_sel(p, which, who))
- continue;
- niceval = 20 - task_nice(p);
- if (niceval > retval)
- retval = niceval;
- } while_each_thread(g, p);
+ switch (which) {
+ case PRIO_PROCESS:
+ if (!who)
+ who = current->pid;
+ idtag = find_get_tag(IDTAG_PID, who);
+ if (!idtag)
+ goto out_unlock;
+
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_PID);
+ niceval = 20 - task_nice(task);
+ if (niceval > retval)
+ retval = niceval;
+ }
+ put_tag(idtag);
+ break;
+ case PRIO_PGRP:
+ if (!who)
+ who = current->pgrp;
+ idtag = find_get_tag(IDTAG_PGID, who);
+ if (!idtag)
+ goto out_unlock;
+
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_PGID);
+ niceval = 20 - task_nice(task);
+ if (niceval > retval)
+ retval = niceval;
+ }
+ put_tag(idtag);
+ break;
+ case PRIO_USER:
+ if (!who)
+ user = current->user;
+ else
+ user = find_user(who);
+
+ if (!user)
+ goto out_unlock;
+
+ read_lock(&user->lock);
+ list_for_each(elem, &user->task_list) {
+ task = list_entry(elem, task_t, user_task_list);
+ niceval = 20 - task_nice(task);
+ if (niceval > retval)
+ retval = niceval;
+ }
+ read_unlock(&user->lock);
+ break;
+ }
+out_unlock:
read_unlock(&tasklist_lock);
return retval;
@@ -524,16 +591,24 @@
if (!new_user)
return -EAGAIN;
old_user = current->user;
+
+ write_lock(&old_user->lock);
atomic_dec(&old_user->processes);
+ list_del(¤t->user_task_list);
+ write_unlock(&old_user->lock);
+
+ write_lock(&new_user->lock);
atomic_inc(&new_user->processes);
+ list_add(¤t->user_task_list, &new_user->task_list);
+ current->uid = new_ruid;
+ current->user = new_user;
+ write_unlock(&new_user->lock);
if(dumpclear)
{
current->mm->dumpable = 0;
wmb();
}
- current->uid = new_ruid;
- current->user = new_user;
free_uid(old_user);
return 0;
}
@@ -849,7 +924,7 @@
asmlinkage long sys_setpgid(pid_t pid, pid_t pgid)
{
- struct task_struct * p;
+ task_t *p;
int err = -EINVAL;
if (!pid)
@@ -868,6 +943,9 @@
p = find_task_by_pid(pid);
if (!p)
goto out;
+ err = -EINVAL;
+ if (!thread_group_leader(p))
+ goto out;
if (p->parent == current || p->real_parent == current) {
err = -EPERM;
@@ -882,21 +960,28 @@
if (p->leader)
goto out;
if (pgid != pid) {
- struct task_struct *g, *tmp;
- do_each_thread(g, tmp) {
- if (tmp->pgrp == pgid &&
- tmp->session == current->session)
+ task_t * task;
+ struct idtag *idtag;
+ struct list_head *elem;
+
+ idtag = find_get_tag(IDTAG_PGID, pgid);
+ if (!idtag)
+ goto out;
+
+ list_for_each(elem, &idtag->task_list) {
+ task = idtag_task(elem, IDTAG_PGID);
+
+ if (task->session == current->session)
goto ok_pgid;
- } while_each_thread(g, tmp);
+ }
+ put_tag(idtag);
goto out;
}
ok_pgid:
- err = security_ops->task_setpgid(p, pgid);
- if (err)
- goto out;
-
+ detach_tag(p, IDTAG_PGID);
p->pgrp = pgid;
+ attach_tag(p, IDTAG_PGID, pgid);
err = 0;
out:
/* All paths lead to here, thus we are safe. -DaveM */
@@ -956,17 +1041,27 @@
asmlinkage long sys_setsid(void)
{
- struct task_struct *g, *p;
+ struct idtag *idtag;
int err = -EPERM;
+ if (!thread_group_leader(current))
+ return -EINVAL;
+
read_lock(&tasklist_lock);
- do_each_thread(g, p)
- if (p->pgrp == current->pid)
- goto out;
- while_each_thread(g, p);
+
+ idtag = find_get_tag(IDTAG_PGID, current->pid);
+
+ if (idtag) {
+ put_tag(idtag);
+ goto out;
+ }
current->leader = 1;
+ detach_tag(current, IDTAG_PGID);
+ detach_tag(current, IDTAG_SID);
current->session = current->pgrp = current->pid;
+ attach_tag(current, IDTAG_PGID, current->pid);
+ attach_tag(current, IDTAG_SID, current->pid);
current->tty = NULL;
current->tty_old_pgrp = 0;
err = current->pgrp;
--- linux/kernel/fork.c.orig Wed Sep 18 00:16:43 2002
+++ linux/kernel/fork.c Wed Sep 18 00:18:06 2002
@@ -47,15 +47,6 @@
int max_threads;
unsigned long total_forks; /* Handle normal Linux uptimes. */
-/*
- * Protects next_safe, last_pid and pid_max:
- */
-spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
-
-static int next_safe = DEFAULT_PID_MAX;
-int pid_max = DEFAULT_PID_MAX;
-int last_pid;
-
struct task_struct *pidhash[PIDHASH_SZ];
rwlock_t tasklist_lock __cacheline_aligned = RW_LOCK_UNLOCKED; /* outer */
@@ -84,7 +75,6 @@
}
}
-/* Protects next_safe and last_pid. */
void add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)
{
unsigned long flags;
@@ -159,52 +149,6 @@
return tsk;
}
-static int get_pid(unsigned long flags)
-{
- struct task_struct *g, *p;
- int pid;
-
- if (flags & CLONE_IDLETASK)
- return 0;
-
- spin_lock(&lastpid_lock);
- if (++last_pid > pid_max) {
- last_pid = 300; /* Skip daemons etc. */
- goto inside;
- }
-
- if (last_pid >= next_safe) {
-inside:
- next_safe = pid_max;
- read_lock(&tasklist_lock);
- repeat:
- do_each_thread(g, p) {
- if (p->pid == last_pid ||
- p->pgrp == last_pid ||
- p->session == last_pid) {
- if (++last_pid >= next_safe) {
- if (last_pid >= pid_max)
- last_pid = 300;
- next_safe = pid_max;
- }
- goto repeat;
- }
- if (p->pid > last_pid && next_safe > p->pid)
- next_safe = p->pid;
- if (p->pgrp > last_pid && next_safe > p->pgrp)
- next_safe = p->pgrp;
- if (p->session > last_pid && next_safe > p->session)
- next_safe = p->session;
- } while_each_thread(g, p);
-
- read_unlock(&tasklist_lock);
- }
- pid = last_pid;
- spin_unlock(&lastpid_lock);
-
- return pid;
-}
-
static inline int dup_mmap(struct mm_struct * mm)
{
struct vm_area_struct * mpnt, *tmp, **pprev;
@@ -696,8 +640,11 @@
goto bad_fork_free;
}
+ write_lock(&p->user->lock);
atomic_inc(&p->user->__count);
atomic_inc(&p->user->processes);
+ list_add(&p->user_task_list, &p->user->task_list);
+ write_unlock(&p->user->lock);
/*
* Counter increases are protected by
@@ -724,7 +671,13 @@
p->state = TASK_UNINTERRUPTIBLE;
copy_flags(clone_flags, p);
- p->pid = get_pid(clone_flags);
+ if (clone_flags & CLONE_IDLETASK)
+ p->pid = 0;
+ else {
+ p->pid = alloc_pid();
+ if (p->pid == -1)
+ goto bad_fork_cleanup;
+ }
p->proc_dentry = NULL;
INIT_LIST_HEAD(&p->run_list);
@@ -887,9 +840,16 @@
SET_LINKS(p);
if (p->ptrace & PT_PTRACED)
__ptrace_link(p, current->parent);
+ attach_tag(p, IDTAG_PID, p->pid);
+ if (thread_group_leader(p)) {
+ attach_tag(p, IDTAG_TGID, p->tgid);
+ attach_tag(p, IDTAG_PGID, p->pgrp);
+ attach_tag(p, IDTAG_SID, p->session);
+ }
hash_pid(p);
nr_threads++;
write_unlock_irq(&tasklist_lock);
+
retval = 0;
fork_out:
@@ -912,6 +872,8 @@
bad_fork_cleanup_security:
security_ops->task_free_security(p);
bad_fork_cleanup:
+ if (p->pid > 0)
+ free_pid(p->pid);
put_exec_domain(p->thread_info->exec_domain);
if (p->binfmt && p->binfmt->module)
__MOD_DEC_USE_COUNT(p->binfmt->module);
--- linux/kernel/user.c.orig Wed Sep 18 00:16:36 2002
+++ linux/kernel/user.c Wed Sep 18 00:18:06 2002
@@ -30,6 +30,8 @@
struct user_struct root_user = {
.__count = ATOMIC_INIT(1),
.processes = ATOMIC_INIT(1),
+ .lock = RW_LOCK_UNLOCKED,
+ .task_list = LIST_HEAD_INIT(root_user.task_list),
.files = ATOMIC_INIT(0)
};
@@ -64,6 +66,11 @@
return NULL;
}
+struct user_struct *find_user(uid_t uid)
+{
+ return uid_hash_find(uid, uidhashentry(uid));
+}
+
void free_uid(struct user_struct *up)
{
if (up && atomic_dec_and_lock(&up->__count, &uidhash_lock)) {
@@ -92,6 +99,8 @@
atomic_set(&new->__count, 1);
atomic_set(&new->processes, 0);
atomic_set(&new->files, 0);
+ INIT_LIST_HEAD(&new->task_list);
+ rwlock_init(&new->lock);
/*
* Before adding this, check whether we raced
--- linux/init/main.c.orig Wed Sep 18 00:16:43 2002
+++ linux/init/main.c Wed Sep 18 00:18:06 2002
@@ -66,6 +66,7 @@
extern void sysctl_init(void);
extern void signals_init(void);
extern void buffer_init(void);
+extern void idtag_init(void);
extern void pte_chain_init(void);
extern void radix_tree_init(void);
extern void free_initmem(void);
@@ -432,6 +433,7 @@
#endif
mem_init();
kmem_cache_sizes_init();
+ idtag_init();
pgtable_cache_init();
pte_chain_init();
fork_init(num_physpages);
On Wed, Sep 18, 2002 at 01:06:11AM +0200, Ingo Molnar wrote:
> then i found one of wli's older patches for 2.5.23 [grr, it was not
> announced anywhere, i almost started coding the same problem], which
> provides the right (and much harder to implement) approach: it cleans up
> PID-space allocation to provide a generic hash for PIDs, session IDs,
> process group IDs and TGIDs, properly allocated and freed. This approach,
> besides paving the way for a scalable and time-bounded get_pid()
> implementation, also got rid of roughly half of for_each_process()
> (do_each_thread()) iterations done in the kernel, which alone is worth the
> effort. Now we can cleanly iterate through all processes in a session
> group or process group.
> i took the patch, adopted it to the recent ->ptrace_children and threading
> related changes, fixed a couple of bugs and made it work. It really worked
> well, nice work William!
Thank you for taking up the completion of development on and maintenance
of this patch. I have not had the time resources to advance it myself,
though now with your help I would be glad to contribute to the effort.
If you would like to assume ownership, I'd be glad to hand it over, and
send patches removing additional instances of for_each_process() to you
as I find the time.
On Wed, Sep 18, 2002 at 01:06:11AM +0200, Ingo Molnar wrote:
> I also wrote a new alloc_pid()/free_pid() implementation from scratch,
> which provides lockless, time-bounded PID allocation. This new PID
> allocator has a worst-case latency of 10 microseconds on a cache-cold P4,
> the cache-hot worst-case latency is 2 usecs, if pid_max is set to 1
> million.
This is necessary regardless in order to address the larger PID space.
In the brief review of it I've done thus far, the new algorithm appears
to be both sound and efficient.
The issues addressed here are extremely important for the workloads I
must support. The algorithmic breakdown of the prior algorithms with
larger task counts was severe enough to trip the NMI oopser and deceives
users into power cycling when they are not running the NMI oopser. This
issue in fact obstructs my testing and development on other subsystems.
Thanks,
Bill
From: William Lee Irwin III <[email protected]>
Date: Tue, 17 Sep 2002 17:22:40 -0700
The issues addressed here are extremely important for the workloads I
must support.
Have you published test tools that emulate your workload?
These would be very useful, probably to find problems even
outside the scope of pid lookups.
From: William Lee Irwin III <[email protected]>
> The issues addressed here are extremely important for the workloads I
> must support.
On Tue, Sep 17, 2002 at 06:27:22PM -0700, David S. Miller wrote:
> Have you published test tools that emulate your workload?
> These would be very useful, probably to find problems even
> outside the scope of pid lookups.
By and large I've been using existing tools, for instance, multiple
tiobench instances to obtain a large task count. The test cases I've
had to construct by hand are generally meant to trigger pagetable OOM,
which I've been assigned to do something about. In general, I attempt
to simulate the VM and I/O characteristics of databases. Sometimes I
have to get a bit inventive, e.g. recent 2*dbench 512 on tmpfs test.
Thanks,
Bill
On Tue, 17 Sep 2002, William Lee Irwin III wrote:
> Thank you for taking up the completion of development on and maintenance
> of this patch. I have not had the time resources to advance it myself,
> though now with your help I would be glad to contribute to the effort.
> If you would like to assume ownership, I'd be glad to hand it over, and
> send patches removing additional instances of for_each_process() to you
> as I find the time.
(well, it's your baby, i only dusted it off, merged it and redid the PID
allocator. I have intentionally left out some of the for_each_task
eliminations, to ease the merging - you are more than welcome to extend
the patch.)
the only thing that still needs to be resolved is the design of the idtags
subsystem. Is its current complexity at the absolute minimum we need? Eg.
i was thinking about extending the pid-hash to hash all PID related items,
threads, process groups and sessions. But in the end i think it would look
much like idtags look like today. The main issue introducing the
complexity are the following properties:
- process group PIDs, thread PIDs and session PIDs can overlap.
- a session leader (or process group leader) can exit sooner than the
process group itself - ie. the PID of the group has to stay allocated
until the last member of the group exits. idtag refcounting solves this
problem.
so i think the most we could get is to actually eliminate the pidhash and
use the idtag hash for it. This would concentrate all the performance
efforts on the idtag hash.
another, locking improvement is possible as well:
- the idtag spinlock should be eliminated, we can reuse the tasklist lock
for it - in the exit and fork path we hold it already. This also means
we can walk an ID list by read-locking the tasklist lock.
the idtag spinlock is already superfluous i think, because the idtag task
list is only safely walked if we read-lock the task list. So it's not like
anyone could hash in a new idtag while we walk the list.
What do you think?
Ingo
On Wed, Sep 18, 2002 at 01:06:11AM +0200, Ingo Molnar wrote:
> the underlying problem is very hard - since get_pid() not only has to take
> PIDs into account, but TGIDs, session IDs and process groups as well.
>
> Ie. even in the most hopeless situation, if there are 999,999 PIDs
> allocated already, it takes less than 10 usecs to find and allocate the
> remaining one PID. The common fastpath is a couple of instructions only.
> The overhead of skipping over continuous regions of allocated PIDs scales
> gracefully with the number of bits to be skipped, from 0 to 10 usecs.
>
> memory footprint of the new PID allocator scales dynamically with
> /proc/sys/kernel/pid_max: the default 32K PIDs cause a 4K allocation, a
> pid_max of 1 million causes a 128K footprint. The current absolute limit
> for pid_max is 4 million PIDs - this does not cause any allocation in the
> kernel, the bitmaps are demand-allocated runtime. The pidmap table takes
> up 512 bytes.
I still don't understand the current obsession with this stuff.
It is easy to have pid_max 2^30 and a fast algorithm that does not
take any more kernel space.
It seems to me you are first creating an unrealistic and unfavorable
situation (put pid_max at some artificially low value, starting a
lot of tasks and saying: look! the algorithm is quadratic!) and
then solve the problem that you thus invented yourself.
Please leave pid_max large.
Andries
On Wed, 18 Sep 2002, Andries Brouwer wrote:
> I still don't understand the current obsession with this stuff. It is
> easy to have pid_max 2^30 and a fast algorithm that does not take any
> more kernel space.
it's only an if(unlikely()) branch in a 1:4096 slowpath to handle this, so
why not? If it couldnt be done sanely then i wouldnt argue about this, but
look at the code, it can be done cleanly and with very low cost.
> It seems to me you are first creating an unrealistic and unfavorable
> situation (put pid_max at some artificially low value, [...]
we want the default to be low, so that compatibility with the older SysV
APIs is preserved. Also, why use a 128K bitmap to handle 1 million PIDs on
a system that has at most 1000 tasks running? I'd rather use an algorithm
that scales well from low pid_max to a larger pid_max as well.
> Please leave pid_max large.
why? For most desktop systems even 32K PIDs is probably too high. A large
pid_max only increases the RAM footprint. (well not under the current
allocation scheme but still.)
Ingo
On Wed, Sep 18, 2002 at 12:11:30PM +0200, Ingo Molnar wrote:
> so i think the most we could get is to actually eliminate the pidhash and
> use the idtag hash for it. This would concentrate all the performance
> efforts on the idtag hash.
I eventually had special-case handling of IDTAG_PID so that it did not
use idtags, but chained tasks directly, and removing the pidhash as goals.
On Wed, Sep 18, 2002 at 12:11:30PM +0200, Ingo Molnar wrote:
> another, locking improvement is possible as well:
> - the idtag spinlock should be eliminated, we can reuse the tasklist lock
> for it - in the exit and fork path we hold it already. This also means
> we can walk an ID list by read-locking the tasklist lock.
> the idtag spinlock is already superfluous i think, because the idtag task
> list is only safely walked if we read-lock the task list. So it's not like
> anyone could hash in a new idtag while we walk the list.
> What do you think?
ISTR the idtag_lock was for cases where the hashtable was modified
while the tasklist_lock was only held for reading. Basically, once
those are resolved, the idtag_lock goes away.
Cheers,
Bill
On Wed, Sep 18, 2002 at 02:32:06PM +0200, Andries Brouwer wrote:
> I still don't understand the current obsession with this stuff.
> It is easy to have pid_max 2^30 and a fast algorithm that does not
> take any more kernel space.
> It seems to me you are first creating an unrealistic and unfavorable
> situation (put pid_max at some artificially low value, starting a
> lot of tasks and saying: look! the algorithm is quadratic!) and
> then solve the problem that you thus invented yourself.
> Please leave pid_max large.
> Andries
There is no obsession. This just happens to be a real life issue.
Basically, the nondeterministic behavior of these things is NMI oopsing
my machines and those of users (who often just cut the power instead of
running the NMI oopser). get_pid() is actually not the primary offender,
but is known to be problematic along with the rest of them. I don't
really care whose pet algorithm is used so long as it doesn't explode
when breathed on. And Ingo's algorithm looks excellent to me.
This is furthermore blocking VM testing and development for many tasks
scenarios meant to emulate and optimize usage typical of machines in
the field.
Bill
Can we get a lockless, scalable, fault-tolerant, pre-emption safe,
zero-copy and distributed get_pid() that meets the Carrier Grade
specification? If at all possible I need it to do garbage collection, too.
Perhaps a get_pid() that solves the Turning Halting Problem should be on
the todo list for 2.6.
On Wed, 18 Sep 2002, Cort Dougan wrote:
> Can we get a lockless, scalable, fault-tolerant, pre-emption safe,
> zero-copy and distributed get_pid() that meets the Carrier Grade
> specification? [...]
of course, and it should also be massively-threaded, LSB-compliant,
enterprise-ready, secure, cluster-aware, power-saving and self-healing. I
admit that there's still lots of work to be done, but there's just so many
hours in a day.
> If at all possible I need it to do garbage collection, too.
actually, on-the-fly O(log(N)) multiprocessor garbage collection is
already integrated into its high-end modular OO design.
> Perhaps a get_pid() that solves the Turning Halting Problem should be on
> the todo list for 2.6.
the first small mystery to solve are certain perturbations in Alan
Turing's name. But, yes, it's definitely a goal of the PID allocator to be
an answer to all, but also for it to avoid infinite loops for every
possible input value, while yielding slightly more subtle output than the
numeric value of 42. Patch in a few minutes.
Ingo
} of course, and it should also be massively-threaded, LSB-compliant,
} enterprise-ready, secure, cluster-aware, power-saving and self-healing. I
} admit that there's still lots of work to be done, but there's just so many
} hours in a day.
get_pid() is simple, I don't see why it's taking you so long to add those
features. I'll sketch out a plan for an improvement in getpid_MARK02(c)
numbers once you have the prototype done.
} actually, on-the-fly O(log(N)) multiprocessor garbage collection is
} already integrated into its high-end modular OO design.
}
} > Perhaps a get_pid() that solves the Turning Halting Problem should be on
} > the todo list for 2.6.
Given my misspelling of Turing, I think it's clear that get_pid() needs a
spellchecker. This would be an opportunity to begin work on get_pid .NET.
} the first small mystery to solve are certain perturbations in Alan
} Turing's name. But, yes, it's definitely a goal of the PID allocator to be
} an answer to all, but also for it to avoid infinite loops for every
} possible input value, while yielding slightly more subtle output than the
} numeric value of 42. Patch in a few minutes.
Please be sure to implement a BK replacement, open-source it, switch the
kernel over to it and supply the patch in that format. It would be
appreciated.
On Wed, Sep 18, 2002 at 05:33:54PM +0200, Ingo Molnar wrote:
> On Wed, 18 Sep 2002, Cort Dougan wrote:
>
> > Can we get a lockless, scalable, fault-tolerant, pre-emption safe,
> > zero-copy and distributed get_pid() that meets the Carrier Grade
> > specification? [...]
>
> of course, and it should also be massively-threaded, LSB-compliant,
> enterprise-ready, secure, cluster-aware, power-saving and self-healing. I
> admit that there's still lots of work to be done, but there's just so many
> hours in a day.
The ecosystem thanks you for your efforts.
On Wed, 18 Sep 2002, Cort Dougan wrote:
> Can we get a lockless, scalable, fault-tolerant, pre-emption safe,
> zero-copy and distributed get_pid() that meets the Carrier Grade
> specification? If at all possible I need it to do garbage collection, too.
Anything better than a get_pid() that triggers the NMI oopser
when wli runs tiobench with 1024 threads ;)
Rik
--
Spamtrap of the month: [email protected]
http://www.surriel.com/ http://distro.conectiva.com/
On Wed, 2002-09-18 at 16:01, Cort Dougan wrote:
> Can we get a lockless, scalable, fault-tolerant, pre-emption safe,
> zero-copy and distributed get_pid() that meets the Carrier Grade
> specification? If at all possible I need it to do garbage collection, too.
I did one, but it garbage collected and there was nothing left 8)
On Wed, 18 Sep 2002, Andries Brouwer wrote:
>
> Please leave pid_max large.
I have to agree.
There is no goodness in making life complicated, and then putting a lot of
effort in solving that complexity with other complexity.
I would suggest something like this:
- make pid_max start out at 32k or whatever, to make "ps" look nice if
nothing else.
- every time we have _any_ trouble at all with looking up a new pid, we
double pid_max.
And it's really easy to recognize trouble with something truly trivial
like the appended.
Give me one reason for why these two added lines aren't better than all
the complexity we've discussed? I can pretty much _guarantee_ that it's
faster, and it sure as hell is simpler. And all traditional uses that has
less than a few thousand threads at most will never see the larger pids,
so people can stare at "ps" output without going blind - the big pids
start showing up only on boxes that might actually _need_ them.
Linus
----
--- 1.77/kernel/fork.c Sun Sep 15 11:01:39 2002
+++ edited/fork.c Wed Sep 18 09:11:43 2002
@@ -175,6 +175,8 @@
if (last_pid >= next_safe) {
inside:
+ if (nr_threads > pid_max >> 4)
+ pid_max <<= 1;
next_safe = pid_max;
read_lock(&tasklist_lock);
repeat:
On Wed, 18 Sep 2002, Ingo Molnar wrote:
>
> why? For most desktop systems even 32K PIDs is probably too high. A large
> pid_max only increases the RAM footprint. (well not under the current
> allocation scheme but still.)
Yeah. It increases memory pressure for the _complex_ and _slow_
algorithms. Agreed.
See my two-liner suggestion (which is admittedly not even compiled, so the
one disadvantage it might have is that it might need to be debugged. But
it's only two lines and doesn't actually change any fundamental part of
any existing algorithms, so debugging shouldn't be a big problem.
Linus
On Wed, 18 Sep 2002, Linus Torvalds wrote:
> I would suggest something like this:
> - make pid_max start out at 32k or whatever, to make "ps" look nice if
> nothing else.
> - every time we have _any_ trouble at all with looking up a new pid, we
> double pid_max.
> + if (nr_threads > pid_max >> 4)
> + pid_max <<= 1;
... but watch out for over/underflow. ;)
It would also be nice if we had some known limit on pid_max (say 8
million, fits in 7 digits).
regards,
Rik
--
Spamtrap of the month: [email protected]
http://www.surriel.com/ http://distro.conectiva.com/
On Wed, 18 Sep 2002, Linus Torvalds wrote:
> Give me one reason for why these two added lines aren't better than all
> the complexity we've discussed? I can pretty much _guarantee_ that it's
> faster, and it sure as hell is simpler. And all traditional uses that
> has less than a few thousand threads at most will never see the larger
> pids, so people can stare at "ps" output without going blind - the big
> pids start showing up only on boxes that might actually _need_ them.
i agree that this is okay, as an added mechanism. Nevertheless this does
not eliminate the 'box locks up for seconds' (or even minutes) situation.
No matter how large the PID range, unless we make the PID range truly
infinite (64 or 128 bits ought to be enough), there's always going to be
PID allocation collision, and it's a frequent application pattern to
allocate threads in a row. We cannot stick our heads into the sand, a
O(N^2) algorithm will only get worse with time, the problem will not go
away magically.
and i dont think it adds any significant complexity. The fastpath is still
comparable, and another benefit of idtags is the elimination of
for_each_process() [and for_each_threads()] loops.
Ingo
On Wed, 18 Sep 2002, Rik van Riel wrote:
>
> On second thought ... yes there's a reason. Suppose you have
> 100000 threads on your box already, how long is it going to
> take to walk them all to figure out the pid distribution ?
>
> And are you willing to walk 100000 threads for every 16 pids allocated ?
Give me a real-life case where that happens, and I might care. I dare you.
The pid space is not a uniform distribution, which your made-up-example
depends on. So you usually walk the 100000 threads _once_, and then you
don't have to walk them again for quite a long time.
And guys, if this is a performance problem for some extreme site, the fix
is truly trivial:
echo $((0x7fffffff)) > /proc/sys/max_pid
and you're done.
You're completely making up a problem that is not a problem in real life.
Come back to me when the above doesn't work _in_practice_ and somebody is
actually bitten, and maybe I'll care.
Until then, all you're doing is mental masturbation.
Linus
On Wed, Sep 18, 2002 at 06:41:47PM +0200, Ingo Molnar wrote:
> i agree that this is okay, as an added mechanism. Nevertheless this does
> not eliminate the 'box locks up for seconds' (or even minutes) situation.
The lockups I see range from hours to "it spun over the weekend, time
to pull the plug".
Bill
On Wed, 18 Sep 2002, Rik van Riel wrote:
> On Wed, 18 Sep 2002, Linus Torvalds wrote:
>
> > I would suggest something like this:
> > - make pid_max start out at 32k or whatever, to make "ps" look nice if
> > nothing else.
> > - every time we have _any_ trouble at all with looking up a new pid, we
> > double pid_max.
>
> > + if (nr_threads > pid_max >> 4)
> > + pid_max <<= 1;
>
> ... but watch out for over/underflow. ;)
Actually, you can't overflow without having nr_threads be something like
27 bits, which means that you'd need to have 100 million threads active at
the same time.
Which, btw, is impossible anyway due to running out of memory to hold all
the thread data structures on a 32-bit architecture _long_ before you get
close to having a high enough nr_threads.
On a 64-bit architecture you can do it with enough memory, but even that
is a few years away (you'd need on the order of a couple of terabytes to
do it).
> It would also be nice if we had some known limit on pid_max (say 8
> million, fits in 7 digits).
Do the math. The above _will_ fit in 7 digits as long as you never have
more then about half a million threads active at the same time.
Which, practically speaking, means that we're done. Quite frankly, the
people who maintain machines that run millions of threads concurrently
care a hell of a lot more about the maching running _stable_ than about
"ps" being pretty.
The people who care about ps being pretty will probably never see more
than 5 digits.
Linus
On Wed, 18 Sep 2002, Linus Torvalds wrote:
> Yeah. It increases memory pressure for the _complex_ and _slow_
> algorithms. Agreed.
what complex and slow algorithms? Take a look at the alloc_pid() and
free_pid() fastpaths:
void free_pid(unsigned long pid)
{
pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
int offset = pid & BITS_PER_PAGE_MASK;
atomic_inc(&map->nr_free);
test_and_clear_bit(offset, map->page));
}
it's all bitshifts.
int alloc_pid(void)
{
pid = last_pid + 1;
if (pid >= pid_max)
pid = RESERVED_PIDS;
offset = pid & BITS_PER_PAGE_MASK;
map = pidmap_array + pid / BITS_PER_PAGE;
if (likely(map->page && !test_and_set_bit(offset, map->page))) {
atomic_dec(&map->nr_free);
last_pid = pid;
return pid;
[...]
}
> See my two-liner suggestion (which is admittedly not even compiled, so
> the one disadvantage it might have is that it might need to be debugged.
> But it's only two lines and doesn't actually change any fundamental part
> of any existing algorithms, so debugging shouldn't be a big problem.
it solves the PID-space-squeeze problem, but it does not solve the
fundamental problem: possibly thousands of consecutive PIDs allocated.
Ingo
current, existing real-life applications are past the 50 thousand kernel
threads requirement, and the only reason why we dont have >100K threads
applications here and now were limitations in the kernel. Right now we can
realistically scale to 500,000 threads even on IA32 systems - and there
are a number of applications for which this is a realistic model.
allocating past 32K consecutively allocated PIDs takes 32 seconds on a 500
MHz PII. Ie. 100K threads take more than 1.5 minutes to 'pass over' - even
if the pid_max got corrected to 10 million or whatever value, via the
duplication code. 1.5 minutes complete utter silence in the system, the
tasklist_lock taken. Any pending write_lock on the tasklist_lock causes
the NMI oopser to trigger. Interrupts are not serviced on that CPU. etc.
it's so bad and it's biting people that others have come up with ugly
solutions in the given PID allocation framework, check out:
http://old.lwn.net/2002/0328/a/getpid.php3
runtime construction of a temporary PID allocation bitmap to get the PID
allocation latency under control. That patch alone is larger than my PID
allocator. Now wli has come up with a much nicer solution and has found a
way to get all PID-like allocations into a common hash [pending cleanups,
but the concept is sound IMHO], and this enabled a very fast and clean PID
allocator. And as a side-effect about 70% of all for_each_process() and
for_each_thread() loops in the kernel are reduced to a
for_each_list(session_list) type of well-behaving loop.
Ingo
On Wed, 18 Sep 2002, Linus Torvalds wrote:
> Give me one reason for why these two added lines aren't better than all
> the complexity we've discussed?
On second thought ... yes there's a reason. Suppose you have
100000 threads on your box already, how long is it going to
take to walk them all to figure out the pid distribution ?
And are you willing to walk 100000 threads for every 16 pids allocated ?
Rik
--
Spamtrap of the month: [email protected]
http://www.surriel.com/ http://distro.conectiva.com/
On Wed, 18 Sep 2002, William Lee Irwin III wrote:
>
> The lockups I see range from hours to "it spun over the weekend, time
> to pull the plug".
But that is not because it walks the thread list once, it's because it
walks it, and loops walking in, and loops walking it some more.
All of which is avoided by just forcing the pid space to be "sufficiently
large".
Linus
On Wed, 18 Sep 2002, Linus Torvalds wrote:
> And guys, if this is a performance problem for some extreme site, the
> fix is truly trivial:
>
> echo $((0x7fffffff)) > /proc/sys/max_pid
>
> and you're done.
it is a problem still. We can create/destroy 2 billion threads:
venus:~> ./p3 -s 2000000 -t 10 -r 0 -T --sync-join
Runtime: 19.889182138 seconds
in roughly 5 hours, on bog-standard 2-CPU x86 hardware. Or in 1.25 hours
on an 8-way box. And then we are back to step #1: trying to pass over
already allocated PIDs by destroying the contents of the L1 and L2 cache
once for each allocated PID passed. Sure, with 2 billion PIDs space that
averages out, but it's an algorithm with a very nasty worst-case behavior,
which is not so hard to trigger.
Plus the not-so-unlikely case of having to pass a couple of tens of
thousands of allocated PIDs, in a couple of minutes, with all user input
blocked, and interrupts disabled, website silent. Obviously sites will not
create threads at such a high rate as demonstrated above, so it will
'only' happen once every couple of days, or every couple of weeks.
Ingo
On Wed, 18 Sep 2002, Linus Torvalds wrote:
> On Wed, 18 Sep 2002, Rik van Riel wrote:
> >
> > On second thought ... yes there's a reason. Suppose you have
> > 100000 threads on your box already, how long is it going to
> > take to walk them all to figure out the pid distribution ?
> The pid space is not a uniform distribution, which your made-up-example
> depends on. So you usually walk the 100000 threads _once_, and then you
> don't have to walk them again for quite a long time.
Agreed, you're right there. On the other hand, walking the threads
_once_ will take 1.5 minutes on a 500 MHz PII (according to Ingo's
measurements).
That's about 18 times the timeout for the NMI oopser and will cause
people real trouble.
cheers,
Rik
--
Spamtrap of the month: [email protected]
http://www.surriel.com/ http://distro.conectiva.com/
On Wed, 2002-09-18 at 17:15, Linus Torvalds wrote:
> Give me one reason for why these two added lines aren't better than all
> the complexity we've discussed? I can pretty much _guarantee_ that it's
> faster, and it sure as hell is simpler
Add a constraint against a hard maximum (tweakable in proc) and I'd
agree.
On Wed, 18 Sep 2002, Linus Torvalds wrote:
> Where did this NMI oopser argument come from? get_pid() doesn't even
> disable interrupts. And we hold the read lock, and other interrupts
> aren't allowed to take the write lock anyway. [...]
the read lock makes all writers wait (and there are a number of places
that use write_lock_irq(&tasklist_lock)) - which frequently come either
from IRQ-disabled paths, or disable interrupts themselves.
Sure, we could 'fix' this artifact by tweaking the write-lock to re-enable
interrupts while looping, but this still leaves us with complete system
silence potentially for minutes.
(plus this still leaves us with 30 places in the kernel that do stupid
for_each_process() and for_each_thread() loops which really could be loops
over the session list or process group list.)
Ingo
This is not about 'PID space squeeze'. Every sane sysadmin sets pid_max to
a high enough value - but sure i'm all for auto-sizing it - i proposed
auto-sizing of the allocation bitmap so i have absolutely nothing against
it.
the fundamental problem is consecutive allocation of PIDs. It can and will
happen, no matter what. In fact if it doesnt happen in every minute only
every week or so it's even worse. The current get_pid() algorithm, if it
sees 32K consequtively allocated PIDs [the next time the PID counter
overflows and gets to the PID-range - with the new optimizations this can
happen within 1 second even for a pid_max of 1 million], then the current
algorithm goes into a 32K*32K loop, ie. a loop of 1 billion, which takes
32 seconds.
only an infinite PID space [pratically infinite would be 64 bits or 128
bits] solves this problem in the current algorithm, or the patch we are
proposing, which also solves some other pending problems so it's
apparently a step in the right direction. (And it also enable us to
compress the PID space to make it more readable for mere mortals - if that
matters.)
Ingo
On Wed, 18 Sep 2002, Rik van Riel wrote:
>
> That's about 18 times the timeout for the NMI oopser and will cause
> people real trouble.
Where did this NMI oopser argument come from? get_pid() doesn't even
disable interrupts. And we hold the read lock, and other interrupts aren't
allowed to take the write lock anyway. If the NMI oopser triggers, then
something else is going on than get_pid().
Linus
On Wed, 18 Sep 2002, Rik van Riel wrote:
> Agreed, you're right there. On the other hand, walking the threads
> _once_ will take 1.5 minutes on a 500 MHz PII (according to Ingo's
> measurements).
>
> That's about 18 times the timeout for the NMI oopser and will cause
> people real trouble.
we could fix it to 'just' lock up but still enable interrupts so that the
NMI oopser does not trigger. (we'd also have to be careful to never
write-lock the tasklist lock with IRQs disabled.) It's still a pretty lame
behavior from an OS me thinks ...
Ingo
It's also not a bad idea to sometimes say "Linux cannot do that". Trying
to make the system do _everything_ will result in it doing many things very
poorly.
} Again, you're talking about entirely theoretical numbers that have no
} relevance for real life.
}
} Sure, you can do that. But a real life box? Nope.
}
} > Or in 1.25 hours
} > on an 8-way box. And then we are back to step #1: trying to pass over
} > already allocated PIDs by destroying the contents of the L1 and L2 cache
} > once for each allocated PID passed.
}
} So? It happens very rarely, and..
}
} > Sure, with 2 billion PIDs space that
} > averages out, but it's an algorithm with a very nasty worst-case behavior,
} > which is not so hard to trigger.
}
} ... the worst-case-behaviour is basically impossible to trigger with any
} real load.
}
} The worst case does not happen for "100k threads" like you've made it
} sound like.
}
} The worst case happens for "100k threads consecutive in the pid space".
}
} Which means that not only do you have to roll over, you have to roll over
} with a humungous number of threads _still_ occupying their old consecutive
} positions when you roll over.
}
} I repeat: you're making up schenarios that simply have no relevance to
} real life.
}
} Linus
}
} -
} To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
} the body of a message to [email protected]
} More majordomo info at http://vger.kernel.org/majordomo-info.html
} Please read the FAQ at http://www.tux.org/lkml/
On Wed, 18 Sep 2002, Ingo Molnar wrote:
>
> it is a problem still. We can create/destroy 2 billion threads:
>
> venus:~> ./p3 -s 2000000 -t 10 -r 0 -T --sync-join
> Runtime: 19.889182138 seconds
>
> in roughly 5 hours, on bog-standard 2-CPU x86 hardware.
Again, you're talking about entirely theoretical numbers that have no
relevance for real life.
Sure, you can do that. But a real life box? Nope.
> Or in 1.25 hours
> on an 8-way box. And then we are back to step #1: trying to pass over
> already allocated PIDs by destroying the contents of the L1 and L2 cache
> once for each allocated PID passed.
So? It happens very rarely, and..
> Sure, with 2 billion PIDs space that
> averages out, but it's an algorithm with a very nasty worst-case behavior,
> which is not so hard to trigger.
... the worst-case-behaviour is basically impossible to trigger with any
real load.
The worst case does not happen for "100k threads" like you've made it
sound like.
The worst case happens for "100k threads consecutive in the pid space".
Which means that not only do you have to roll over, you have to roll over
with a humungous number of threads _still_ occupying their old consecutive
positions when you roll over.
I repeat: you're making up schenarios that simply have no relevance to
real life.
Linus
On Wed, 18 Sep 2002, William Lee Irwin III wrote:
> > i agree that this is okay, as an added mechanism. Nevertheless this does
> > not eliminate the 'box locks up for seconds' (or even minutes) situation.
>
> The lockups I see range from hours to "it spun over the weekend, time to
> pull the plug".
this can happen if there's a genuine PID space squeeze wrt. nr_threads -
that is solved by adding Linus' suggestion to the PID allocator. I believe
you saw that problem, not any inherent get_pid() algorithmic inefficiency.
nevertheless we do lock up for 32 seconds if there are 32K PIDs allocated
in a row and last_pid hits that range - regardless of pid_max. (Depending
on the cache architecture it could take significantly more.)
Ingo
On Wed, 18 Sep 2002, William Lee Irwin III wrote:
>> The lockups I see range from hours to "it spun over the weekend, time to
>> pull the plug".
On Wed, Sep 18, 2002 at 07:36:00PM +0200, Ingo Molnar wrote:
> this can happen if there's a genuine PID space squeeze wrt. nr_threads -
> that is solved by adding Linus' suggestion to the PID allocator. I believe
> you saw that problem, not any inherent get_pid() algorithmic inefficiency.
> nevertheless we do lock up for 32 seconds if there are 32K PIDs allocated
> in a row and last_pid hits that range - regardless of pid_max. (Depending
> on the cache architecture it could take significantly more.)
There were only 10K tasks, with likely consecutively-allocated PID's,
and some minor background fork()/exit() activity, but there are more
offenders on the read side than get_pid() itself.
There is no question of PID space: the full 2^30 was configured in
the tests done after the PID space expansion.
Bill
On Wed, 18 Sep 2002, William Lee Irwin III wrote:
>
> There were only 10K tasks, with likely consecutively-allocated PID's,
> and some minor background fork()/exit() activity, but there are more
> offenders on the read side than get_pid() itself.
>
> There is no question of PID space: the full 2^30 was configured in
> the tests done after the PID space expansion.
I bet the lockup was something else. There have been other bugs recently
with the task state changes, and the lockup may just have been a regular
plain lockup. Thread exit has been kind of fragile lately, although it
looks better now.
Linus
On Wed, 18 Sep 2002, Cort Dougan wrote:
> It's also not a bad idea to sometimes say "Linux cannot do that".
> Trying to make the system do _everything_ will result in it doing many
> things very poorly.
That's been tried before, but for some reason people just won't
listen and Linux ended up running on non-x86 machines and even SMP.
I don't see any reason to rule out a fix for this problem in
principle, maybe somebody will come up with code that Linus does
like ?
cheers,
Rik
--
Spamtrap of the month: [email protected]
http://www.surriel.com/ http://distro.conectiva.com/
On Wed, 18 Sep 2002, Linus Torvalds wrote:
> ... the worst-case-behaviour is basically impossible to trigger with any
> real load.
>
> The worst case does not happen for "100k threads" like you've made it
> sound like.
>
> The worst case happens for "100k threads consecutive in the pid space".
i always said consecutive PID range, in perhaps every previous mail.
consecutive PID allocation is actually something applications do pretty
often.
or Apache threads with the 'max requests handled per thread' config value
set to infinite, and a couple of thousand threads started up during init.
or a phone line server that handles 100 thousand phone lines starts up
100K threads, one per line. No need to restart any of those threads, and
in fact it will never happen. They do use helper threads to do less timing
critical stuff. Now the whole phone system locks up for 1.5 minutes every
2 weeks or 2 months when the PID range overflows, how great. Now make that
400 thousand phonelines, the lockup will will last 24 minutes.
well, by this argument, Windows XP only crashes every couple of weeks, it
happens rarely, who cares. Windows probably reboots faster than Linux will
allocate the next PID, so it has better RT latencies. Phone server
applications are inherently restartable anyway, even across hw crashes.
and of course any quasy-RT behavior of the Linux kernel (which we *do*
have in specific well-controlled scenarios) can be thrown out the window
if processes or threads are allocated/destroyed.
and even assuming that concecutive PIDs are a non-problem. Is fast PID
allocation really that bad? Is a more compressed PID range necesserily
bad? Is the avoidance of for_each_process and for_each_thread loops that
bad? Even the current untuned patch, which admittedly duplicates some
functionality (such as the pidhash), performs on par with the unpatched
kernel, in thread creation benchmarks. (which is pretty much the only
place that can show any increase in PID allocation/management overhead.)
Ingo
You're talking about a different problem there. Creating a thread within a
realistic time-limit for a sensible number of threads is not a bad idea.
Doing it for a huge number of threads may not be something that has to be
done right away. Don't screw up the low to middle-end - that trend is
getting frightening in Linux.
} sorry, but creating a new thread within some realistic time limit,
} independently of how all the other threads are layed out, is not something
} i'd give up trying to solve.
> It's also not a bad idea to sometimes say "Linux cannot do that".
Like real time scheduling or embedded systems for example?
Be careful what you ask for ....
M.
On Wed, 18 Sep 2002, Cort Dougan wrote:
> You're talking about a different problem there. Creating a thread
> within a realistic time-limit for a sensible number of threads is not a
> bad idea. Doing it for a huge number of threads may not be something
> that has to be done right away. Don't screw up the low to middle-end -
> that trend is getting frightening in Linux.
sorry, but you must have missed the patch i posted yesterday. It can be
done, it's fast, it does not hurt *any* benchmark and it actually has a
worst-case cache-cold latency of 10 microseconds, even if 1 million
threads are started up already.
and besides it significantly speeds up some other areas as well, like
session management in shells, group-signal delivery performance, tty
handling and more.
Ingo
On Wed, 18 Sep 2002, William Lee Irwin III wrote:
>> There were only 10K tasks, with likely consecutively-allocated PID's,
>> and some minor background fork()/exit() activity, but there are more
>> offenders on the read side than get_pid() itself.
>> There is no question of PID space: the full 2^30 was configured in the
>> tests done after the PID space expansion.
On Wed, Sep 18, 2002 at 07:59:50PM +0200, Ingo Molnar wrote:
> oh. Well, there are a whole lot of other places in the unpatched kernel
> that iterate over every task. So with the patch applied, all these lockups
> go away?
> Ingo
Not quite all of them. top(1) takes out the machine by triggering calls
to get_pid_list(), which NMI oopses fork() and exit() on other cpus.
Bill
> I'm also a big fan of "Do that crap outside the kernel until it
> works properly" type projects.
Agreed. That's why we have things like the lse (scalability) rollup
patches for 2.4. People have been playing with such things outside
the kernel. And nobody's suggesting that it should get accepted before
being properly tested.
> The Linux view should not be that N-way boxes are its manifest
> destiny. The same goes for thousands of threads. Linux works
> pretty well on 95% of the boxes that it is being run on. Lets
> not screw that up to fix the other 5%. Try some fixes _outside_
> the main kernel for a while, find a workable solution and then
> merge it in.
Why do you think these things are floating around as patches for
a while, and not going straight into the kernel?
Why do you think there are things like a Redhat Advanced server
build?
Nobody's trying to screw the desktop users, we're being mind-
bogglingly careful not to, in fact. If you have specific objections
to a particular patch, raise them as technical arguments. Saying
"we shouldn't do that because I'm not interested in it" is far
less useful.
The fact that there's lots of patches floating around for larger
systems isn't some evil plot to undermine the world - there's
just a lot of people working on it full time because that's where
much of the money is ...
M.
On Wed, 18 Sep 2002, William Lee Irwin III wrote:
> There were only 10K tasks, with likely consecutively-allocated PID's,
> and some minor background fork()/exit() activity, but there are more
> offenders on the read side than get_pid() itself.
>
> There is no question of PID space: the full 2^30 was configured in the
> tests done after the PID space expansion.
oh. Well, there are a whole lot of other places in the unpatched kernel
that iterate over every task. So with the patch applied, all these lockups
go away?
Ingo
On Wed, 18 Sep 2002, Cort Dougan wrote:
> It's also not a bad idea to sometimes say "Linux cannot do that".
> Trying to make the system do _everything_ will result in it doing many
> things very poorly.
sorry, but creating a new thread within some realistic time limit,
independently of how all the other threads are layed out, is not something
i'd give up trying to solve.
Ingo
On Wed, Sep 18, 2002 at 07:54:53PM +0200, Ingo Molnar wrote:
> or a phone line server that handles 100 thousand phone lines starts up
> 100K threads, one per line. No need to restart any of those threads, and
> in fact it will never happen. They do use helper threads to do less timing
> critical stuff. Now the whole phone system locks up for 1.5 minutes every
> 2 weeks or 2 months when the PID range overflows, how great. Now make that
> 400 thousand phonelines, the lockup will will last 24 minutes.
So Solaris made a stupid design decision to encourage people to use a
thread per call on these big systems and Linux should make the same
decision for compatibility?
I can see why people want this: they have huge ugly systems that they would
like to port to Linux with as little effort as possible. But it's not
free for the OS either.
It's interesting to see that as Linux moves into the phone space owned by
Sun, two things are happening. (1)There is pressure to expand Linux to simply
match Solaris properties and (2) the entire technological and business
basis of those huge 100K+ thread racks is beginning to collapse.
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On Wed, 2002-09-18 at 18:30, Linus Torvalds wrote:
>
> On Wed, 18 Sep 2002, Ingo Molnar wrote:
> >
> > it is a problem still. We can create/destroy 2 billion threads:
> >
> > venus:~> ./p3 -s 2000000 -t 10 -r 0 -T --sync-join
> > Runtime: 19.889182138 seconds
> >
> > in roughly 5 hours, on bog-standard 2-CPU x86 hardware.
>
> Again, you're talking about entirely theoretical numbers that have no
> relevance for real life.
>
> Sure, you can do that. But a real life box? Nope.
With a berzerk web script, with a malicious user ?
On Wed, 18 Sep 2002, William Lee Irwin III wrote:
>> There were only 10K tasks, with likely consecutively-allocated PID's,
>> and some minor background fork()/exit() activity, but there are more
>> offenders on the read side than get_pid() itself.
>> There is no question of PID space: the full 2^30 was configured in
>> the tests done after the PID space expansion.
On Wed, Sep 18, 2002 at 10:46:35AM -0700, Linus Torvalds wrote:
> I bet the lockup was something else. There have been other bugs recently
> with the task state changes, and the lockup may just have been a regular
> plain lockup. Thread exit has been kind of fragile lately, although it
> looks better now.
> Linus
Pretty much the same tasklist iteration issues have been visible
running benchmarks on kernels from 2.4.early up. 2.5.x activity on this
front was somewhat stymied by architecture support issues whose causes
were not discovered until 2.5.20 or thereabouts, and perhaps also the
fact little time was specifically allocated to addressing it.
Bill
I'm also a big fan of "Do that crap outside the kernel until it works
properly" type projects. If you want to talk about real-time scheduling as
an example... looks at the trouble it's causing now. I think it makes my
point for me very strongly.
The Linux view should not be that N-way boxes are its manifest destiny.
The same goes for thousands of threads. Linux works pretty well on 95% of
the boxes that it is being run on. Lets not screw that up to fix the other
5%. Try some fixes _outside_ the main kernel for a while, find a workable
solution and then merge it in.
The Linux bus is getting really top-heavy because of some macho-features.
} Like real time scheduling or embedded systems for example?
} Be careful what you ask for ....
On Wed, 18 Sep 2002, William Lee Irwin III wrote:
> Not quite all of them. top(1) takes out the machine by triggering calls
> to get_pid_list(), which NMI oopses fork() and exit() on other cpus.
one of my patches in 2.5.35 solves that.
Ingo
On Wed, 18 Sep 2002 [email protected] wrote:
> So Solaris made a stupid design decision to encourage people to use a
> thread per call on these big systems and Linux should make the same
> decision for compatibility?
if it can be done, why not? Do you advocate the using of big select()
loops and userspace threading libraries?
i have to admit that there's an inherent simplicity in using one thread
per line, and for the more critical systems it's the simplicity that
matters alot. One process per line would be even nicer - but that has a
much higher resource footprint. While it could all be rewritten into a
IRQ-driven state-machine as well, i dont think that is economic nor
manageable for every case. Guess why Apache is still using the model of
threads/processes, with a serial workflow done by them, and not the model
of an async state-machine that TUX uses.
> I can see why people want this: they have huge ugly systems that they
> would like to port to Linux with as little effort as possible. But it's
> not free for the OS either.
i believe if you do not see the dangers of O(N^2) algorithms then you
should not do RT coding. While it's not at all the goal of the patch i
sent, with some effort we can make Linux to behave in an RT-correct way if
a subset of APIs are used and drivers are carefully controlled.
Ingo
On Wed, Sep 18, 2002 at 07:49:39AM -0700, William Lee Irwin III wrote:
> Basically, the nondeterministic behavior of these things is NMI oopsing
> my machines and those of users (who often just cut the power instead of
> running the NMI oopser). get_pid() is actually not the primary offender,
> but is known to be problematic along with the rest of them.
Sorry - I cannot parse this. Who are "these things" and "them"?
On Wed, 18 Sep 2002, Ingo Molnar wrote:
>
> Eliminating for_each_task loops (the ones completely unrelated to
> the get_pid() issue) is an improvement even for desktop setups, which have
> at most 1000 processes running.
If you really really want to do this, then at least get it integrated
better (ie it should _replace_ pidhash since it appears to just duplicate
the functionality), and get the locking right. Maybe that will make it
look better.
Linus
On Wed, Sep 18, 2002 at 02:56:19PM +0200, Ingo Molnar wrote:
> On Wed, 18 Sep 2002, Andries Brouwer wrote:
>
> > I still don't understand the current obsession with this stuff. It is
> > easy to have pid_max 2^30 and a fast algorithm that does not take any
> > more kernel space.
>
> it's only an if(unlikely()) branch in a 1:4096 slowpath to handle this, so
> why not? If it couldnt be done sanely then i wouldnt argue about this, but
> look at the code, it can be done cleanly and with very low cost.
In my opinion you are doing something that is entirely superfluous,
and at nonzero cost.
> > It seems to me you are first creating an unrealistic and unfavorable
> > situation (put pid_max at some artificially low value, [...]
>
> we want the default to be low, so that compatibility with the older SysV
> APIs is preserved.
Do you know of any programs that would be affected?
Last I did a grep on all source rpm's in some SuSE or RedHat distribution,
there was not a single program.
> Also, why use a 128K bitmap to handle 1 million PIDs on
> a system that has at most 1000 tasks running? I'd rather use an algorithm
> that scales well from low pid_max to a larger pid_max as well.
Yes indeed. So I advertize not using any bitmap at all.
> > Please leave pid_max large.
>
> why? For most desktop systems even 32K PIDs is probably too high. A large
> pid_max only increases the RAM footprint. (well not under the current
> allocation scheme but still.)
I have said it so many times, but let me repeat.
A large pid_max makes your system faster.
Collisions cost time. In a large space there are few collisions.
Now suppose you start 10^4 tasks at boot time and after 10^9 forks
you have to come back. Do you necessarily have to come across this
bunch of 10^4 tasks that are still sitting there?
That would be unfortunate.
But, you see, there are really many ways to avoid that.
Let me just invent one on the spot. Pick a random generator with
period 2^128 - 1 and each time you want a new pid pick the last
30 bits of its output. Why is that nice? Next time you come
around to the same pid, pids that were close together first
are far apart the second time.
You see, no data structures, no bitmaps, and very good behaviour
on average, even after 10^9 forks.
But there are lots of other solutions.
I would prefer to avoid talking about specific solutions as long as
this is not a problem that occurs in practice (with 30-bit pids).
I just want to stress: the larger the pid space, the faster your
computer forks.
Andries
On Wed, 18 Sep 2002, Martin J. Bligh wrote:
> Nobody's trying to screw the desktop users, we're being mind- bogglingly
> careful not to, in fact. If you have specific objections to a particular
> patch, raise them as technical arguments. Saying "we shouldn't do that
> because I'm not interested in it" is far less useful.
i fully agree with your points, but it does not hold in this specific
case. Eliminating for_each_task loops (the ones completely unrelated to
the get_pid() issue) is an improvement even for desktop setups, which have
at most 1000 processes running.
Ingo
On Wed, Sep 18, 2002 at 08:21:38PM +0200, Ingo Molnar wrote:
>
> On Wed, 18 Sep 2002 [email protected] wrote:
>
> > So Solaris made a stupid design decision to encourage people to use a
> > thread per call on these big systems and Linux should make the same
> > decision for compatibility?
>
> if it can be done, why not? Do you advocate the using of big select()
> loops and userspace threading libraries?
I like request/servers and state machines. I also think that there
may be some smart ways to fold multiple threads into one. And there is
nothing wrong with userspace threading libraries.
> i have to admit that there's an inherent simplicity in using one thread
> per line, and for the more critical systems it's the simplicity that
There is a simplicity acquired by pushing the complexity to somewhere else.
> matters alot. One process per line would be even nicer - but that has a
> much higher resource footprint. While it could all be rewritten into a
> IRQ-driven state-machine as well, i dont think that is economic nor
> manageable for every case. Guess why Apache is still using the model of
> threads/processes, with a serial workflow done by them, and not the model
> of an async state-machine that TUX uses.
Because Linux does not provide an easy and highly efficient API for
building state machines and because even if it did, a crappy threads
model that runs on NT as well as Linux is more attractive.
> > I can see why people want this: they have huge ugly systems that they
> > would like to port to Linux with as little effort as possible. But it's
> > not free for the OS either.
>
> i believe if you do not see the dangers of O(N^2) algorithms then you
> should not do RT coding. While it's not at all the goal of the patch i
> sent, with some effort we can make Linux to behave in an RT-correct way if
> a subset of APIs are used and drivers are carefully controlled.
Huh?
Just for information, don't try to apply asymptotic analysis blindly to
bounded conditions.
--
---------------------------------------------------------
Victor Yodaiken
Finite State Machine Labs: The RTLinux Company.
http://www.fsmlabs.com http://www.rtlinux.com
On Wed, Sep 18, 2002 at 08:28:18PM +0200, Andries Brouwer wrote:
> I would prefer to avoid talking about specific solutions as long as
> this is not a problem that occurs in practice (with 30-bit pids).
> I just want to stress: the larger the pid space, the faster your
> computer forks.
> Andries
This does occur in practice.
Bill
On Wed, 18 Sep 2002, Linus Torvalds wrote:
> If you really really want to do this, then at least get it integrated
> better (ie it should _replace_ pidhash since it appears to just
> duplicate the functionality), and get the locking right. Maybe that will
> make it look better.
yes, i'm in the process of doing this. I havent cleaned up idtag.c (and
its impact) yet, i started with pid.c. Once cleaned up it should have the
same hashing overhead as the existing pidhash.
Ingo
On Wed, Sep 18, 2002 at 07:49:39AM -0700, William Lee Irwin III wrote:
>> Basically, the nondeterministic behavior of these things is NMI oopsing
>> my machines and those of users (who often just cut the power instead of
>> running the NMI oopser). get_pid() is actually not the primary offender,
>> but is known to be problematic along with the rest of them.
On Wed, Sep 18, 2002 at 08:31:54PM +0200, Andries Brouwer wrote:
> Sorry - I cannot parse this. Who are "these things" and "them"?
Functions exhaustively searching or otherwise iterating over the whole
list of all tasks.
Bill
On Wed, 18 Sep 2002, Andries Brouwer wrote:
> Collisions cost time. In a large space there are few collisions.
>
> Now suppose you start 10^4 tasks at boot time and after 10^9 forks you
> have to come back. Do you necessarily have to come across this bunch of
> 10^4 tasks that are still sitting there? That would be unfortunate.
If you think you can outrun this O(N^2) algorithm with Moore's law in both
CPU performance and RAM space then you need to reconsider. I considered
solving this problem a few years ago, but it was clearly too much fuss for
too little gain. Now wli did the work and i think it's important, even if
taking the PID allocation issue completely out of the picture.
there's a good reason why this problem started to trigger and slowly made
its way on my radar as well, after so many years of passivity (we've had
this algorithm since day 1 i believe) - O(N^2) algrithms do tend to show
up sooner or later, there's no escape.
> But, you see, there are really many ways to avoid that.
>
> Let me just invent one on the spot. Pick a random generator with period
> 2^128 - 1 and each time you want a new pid pick the last 30 bits of its
> output. Why is that nice? Next time you come around to the same pid,
> pids that were close together first are far apart the second time.
actually now you've moved entirely out of the problem space. Ie. some nice
looking ps output:
mingo 1205341178 0.0 1689.3 123485616 tty5 S 16:26 0:00 -bash
mingo 21123326 0.0 1682.9 212345112 tty6 S 16:26 0:00 -bash
mingo 81430573 0.0 2074.9 1096872604 tty5 S 19:11 0:00 slogin x
mingo 579584 0.0 323.1 9833960 tty3 S 20:07 0:00 lynx
mingo 2690291337 0.1 2748.7 861737004 tty2 S 20:36 0:00 pine -i
mingo 1560603 0.0 1622.2 4431396 tty2 SN 20:39 0:00 -bash
just to solve the lameness of an algorithm that can be fixed nicely? Wow.
> You see, no data structures, no bitmaps, and very good behaviour on
> average, even after 10^9 forks.
just an occasional blip of 1-2-10 millisecs every few seconds, perhaps 100
millisecs if unlucky, even with just a 0.1% fill rate of the PID space.
Lowlatency patches? Why the need, we've got this get_pid() game of
roulette...
> But there are lots of other solutions.
such as? ...
Ingo
Hi,
How would I figure out whether interruptible_sleep_on_timeout() returned on
a timeout condition, someone called wakeup() or the user pressed CTRL-C
i.e., interrupted by a signal?
Is signal_pending()the right choice ?
Thanks,
Imran.
>> Nobody's trying to screw the desktop users, we're being mind- bogglingly
>> careful not to, in fact. If you have specific objections to a particular
>> patch, raise them as technical arguments. Saying "we shouldn't do that
>> because I'm not interested in it" is far less useful.
>
> i fully agree with your points, but it does not hold in this specific
> case. Eliminating for_each_task loops (the ones completely unrelated to
> the get_pid() issue) is an improvement even for desktop setups, which have
> at most 1000 processes running.
Right - which is exactly why I was saying we should stick to technical
debates rather than whether some people were interested in a particular
market segment or not ;-)
M.
On Wed, Sep 18, 2002 at 09:00:37PM +0200, Ingo Molnar wrote:
> If you think you can outrun this O(N^2) algorithm
Last month or so I already showed how to make it O(N log N).
(Also without any new data structures.)
But there is no hurry. The constant in your O(N^2) is very
small, since in reality it is better than O(N^2 / pid_max).
Really, now that pid_max is large, there is no problem with get_pid.
Much more interesting is thinking about for_each_task.
In a number of places we search for all tasks with a certain property.
It would be nice to arrange things in such a way that these
are found in some efficient way so that not the entire task list
is searched.
For ordinary Unix semantics we need to be able to find all tasks
with a given p->pgrp or p->session or p->pid or p->uid or p->tty.
Some kernel routines come with a given pointer tsk and want p == tsk.
Some places need p->mm or p->mm->context or CTX_HWBITS(p->mm->context)
In procfs we have a very quadratic algorithm in proc_pid_readdir()/
get_pid_list(). Not a potential hiccup after 2^30 processes that some
may want to smoothe, but every single "ls /proc" or "ps".
What shall we do with /proc for all these people with 10^5 processes?
Andries
On Wed, Sep 18, 2002 at 09:00:37PM +0200, Ingo Molnar wrote:
>> If you think you can outrun this O(N^2) algorithm
On Wed, Sep 18, 2002 at 10:11:34PM +0200, Andries Brouwer wrote:
> Last month or so I already showed how to make it O(N log N).
> (Also without any new data structures.)
> But there is no hurry. The constant in your O(N^2) is very
> small, since in reality it is better than O(N^2 / pid_max).
> Really, now that pid_max is large, there is no problem with get_pid.
I've seen no change in behavior with large PID_MAX. Your argument based
on sparse spaces is not empirically demonstrable. What is empirically
demonstrable are my NMI oopses, the bug reports I've had sitting around
since before 2.5 opened that are still an issue, and Ingo's actual timings.
On Wed, Sep 18, 2002 at 10:11:34PM +0200, Andries Brouwer wrote:
> Much more interesting is thinking about for_each_task.
> In a number of places we search for all tasks with a certain property.
> It would be nice to arrange things in such a way that these
> are found in some efficient way so that not the entire task list
> is searched.
Um, the original name of this patch was -for_each_task, and that's
precisely what it was intended to do. Check the file creation times
(and filenames) of the patches in
ftp://ftp.kernel.org/pub/linux/kernel/people/wli/task_mgmt/
It furthermore had the more general intent of removing exhaustive
searches not using the for_each_task() macro, of which the hotly
debated get_pid() was an instance.
I never did quite finish it, which is why Ingo had to step in and give
it a big push.
On Wed, Sep 18, 2002 at 10:11:34PM +0200, Andries Brouwer wrote:
> For ordinary Unix semantics we need to be able to find all tasks
> with a given p->pgrp or p->session or p->pid or p->uid or p->tty.
> Some kernel routines come with a given pointer tsk and want p == tsk.
> Some places need p->mm or p->mm->context or CTX_HWBITS(p->mm->context)
It doesn't sound like you read the patch at all.
On Wed, Sep 18, 2002 at 10:11:34PM +0200, Andries Brouwer wrote:
> In procfs we have a very quadratic algorithm in proc_pid_readdir()/
> get_pid_list(). Not a potential hiccup after 2^30 processes that some
> may want to smoothe, but every single "ls /proc" or "ps".
> What shall we do with /proc for all these people with 10^5 processes?
> Andries
That is actually one of the easiest ways to take out one of my machines
while it's running 10K or so tasks, mentioned a bit ago in this thread.
It also renders performance monitoring of benchmarks, customer
workloads, etc. very difficult.
Bill
On Wed, Sep 18, 2002 at 01:29:14PM -0700, William Lee Irwin III wrote:
> I've seen no change in behavior with large PID_MAX.
Of course not. But people keep blaming the wrong routine.
I have said it a thousand times - nothing is wrong with get_pid().
Now proc_pid_readdir() takes a million times as long, so it would
be much more reasonable if people talked about it instead.
With 20000 processes it will visit 10^7 process structs
for one ps.
> It doesn't sound like you read the patch at all.
I looked at it and searched for base.c but didnt find it,
so concluded that the real problem was not addressed.
> > In procfs we have a very quadratic algorithm in proc_pid_readdir()/
> > get_pid_list(). Not a potential hiccup after 2^30 processes that some
> > may want to smoothe, but every single "ls /proc" or "ps".
> > What shall we do with /proc for all these people with 10^5 processes?
> > Andries
>
> That is actually one of the easiest ways to take out one of my machines
> while it's running 10K or so tasks, mentioned a bit ago in this thread.
OK. So we now agree.
But it looks like your patch doesnt change this? Or did I overlook sth?
Andries
On Wed, Sep 18, 2002 at 09:48:24AM -0700, Linus Torvalds wrote:
>
> > On Wed, 18 Sep 2002, Linus Torvalds wrote:
> >
> > > I would suggest something like this:
> > > - make pid_max start out at 32k or whatever, to make "ps" look nice if
> > > nothing else.
> > > - every time we have _any_ trouble at all with looking up a new pid, we
> > > double pid_max.
> >
> > > + if (nr_threads > pid_max >> 4)
> The people who care about ps being pretty will probably never see more
> than 5 digits.
Agreed. While on the topic of number of digits: If this is in place
we could easily start out pid_max at say 9999, fitting the pids in
one less digit than we normally do now.
(I've always thought we should have sticked with pid_max at the
traditional Unix value of 32000, and not 32767).
Roger.
--
** [email protected] ** http://www.BitWizard.nl/ ** +31-15-2600998 **
*-- BitWizard writes Linux device drivers for any device you may have! --*
* The Worlds Ecosystem is a stable system. Stable systems may experience *
* excursions from the stable situation. We are currenly in such an *
* excursion: The stable situation does not include humans. ***************
On Wed, 18 Sep 2002, Andries Brouwer wrote:
> > It doesn't sound like you read the patch at all.
>
> I looked at it and searched for base.c but didnt find it,
> so concluded that the real problem was not addressed.
because, as mentioned before, that particular loop i fixed in 2.5.35.
> > That is actually one of the easiest ways to take out one of my machines
> > while it's running 10K or so tasks, mentioned a bit ago in this thread.
>
> OK. So we now agree.
>
> But it looks like your patch doesnt change this? Or did I overlook sth?
yes, you did overlook a number of things.
Ingo
[email protected] said:
> How would I figure out whether interruptible_sleep_on_timeout()
> returned on a timeout condition, someone called wakeup() or the user
> pressed CTRL-C i.e., interrupted by a signal? Is signal_pending()the
> right choice ?
Don't forget that there can also be any combination of two or all three of
the above. Don't use sleep_on functions because they're almost impossible
to use correctly.
--
dwmw2
On Thu, Sep 19, 2002 at 05:05:17AM +0200, Ingo Molnar wrote:
>
> On Wed, 18 Sep 2002, Andries Brouwer wrote:
>
> > > It doesn't sound like you read the patch at all.
> >
> > I looked at it and searched for base.c but didnt find it,
> > so concluded that the real problem was not addressed.
>
> because, as mentioned before, that particular loop i fixed in 2.5.35.
In that case, sorry for complaining about that part - I was
looking at 2.5.33. But now that I look at patch-2.5.35
I don't see any improvement: for_each_task() is now called
for_each_process(), but otherwise base.c is just as quadratic
as it was.
So, why do you think this problem has been fixed?
Andries
Hello, world!\n
> nevertheless we do lock up for 32 seconds if there are 32K PIDs allocated
> in a row and last_pid hits that range - regardless of pid_max. (Depending
> on the cache architecture it could take significantly more.)
What about randomizing the PID selection a bit? I.e., allocate PIDs
consecutively as long as they are free; if you hit an already used
PID, roll dice to find a new position where the search should be
continued. As long as the allocated fraction of PID space is reasonably
small, this algorithm should be very quick in average case.
Another possible solution: Divide PID space to blocks and for each
block, keep a counter of PID's available in this block and when
allocating, just skip blocks which are full. Runs in O(sqrt(PID space
size)) in the worst case.
Have a nice fortnight
--
Martin `MJ' Mares <[email protected]> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
This message is transmited on 100% recycled electrons.
On Thu, 19 Sep 2002, Martin Mares wrote:
> Hello, world!\n
>
> > nevertheless we do lock up for 32 seconds if there are 32K PIDs allocated
> > in a row and last_pid hits that range - regardless of pid_max. (Depending
> > on the cache architecture it could take significantly more.)
>
> What about randomizing the PID selection a bit? I.e., allocate PIDs
> consecutively as long as they are free; if you hit an already used
> PID, roll dice to find a new position where the search should be
> continued. As long as the allocated fraction of PID space is reasonably
> small, this algorithm should be very quick in average case.
>
> Another possible solution: Divide PID space to blocks and for each
> block, keep a counter of PID's available in this block and when
> allocating, just skip blocks which are full. Runs in O(sqrt(PID space
> size)) in the worst case.
>
> Have a nice fortnight
I remember something like the pid problem a few years ago when
somebody needed to get a unique 'key' value. The 'key' value was
used by a lock manager. It was not random, it just needed to be
unique, sort of like a pid. As I recall, the selection went something
like this:
(1) When keys were given up (released), the key value was compared
with the last, lowest-key value given up. If this was lower, the last
lowest key-value was substituted.
(2) When keys were allocated, the last lowest key-value was used.
Upon its use, the next available highest key-value was substituted.
(3) Somehow, I don't remember the alogrithm, the highest key-
value became the lowest key-value when all the higher keys were
released. At this time, everything was free.
Nobody ever had to scan anything to get the next-available key.
So what was saved in memory was, the highest key-value allocated, and
the lowest key-value allocated.
Anybody here work on the Cluster Lock Manager for DEC? That's what
used the keys.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
The US military has given us many words, FUBAR, SNAFU, now ENRON.
Yes, top management were graduates of West Point and Annapolis.
On Thu, Sep 19, 2002 at 03:11:57PM +0200, Andries Brouwer wrote:
> On Thu, Sep 19, 2002 at 05:05:17AM +0200, Ingo Molnar wrote:
> > because, as mentioned before, that particular loop i fixed in 2.5.35.
>
> But now that I look at patch-2.5.35
> I don't see any improvement: for_each_task() is now called
> for_each_process(), but otherwise base.c is just as quadratic
> as it was.
>
> So, why do you think this problem has been fixed?
Let me repeat this, and call it an observation instead of a question,
so that you do not think I am in doubt.
If you have 20000 processes, and do ps, then get_pid_list() will be
called 1000 times, and the for_each_process() loop will examine
10000000 processes.
Unlike the get_pid() situation, which was actually amortized linear with a very
small coefficient, here we have a bad quadratic behaviour, still in 2.5.37.
Andries
> Let me repeat this, and call it an observation instead of a question,
> so that you do not think I am in doubt.
>
> If you have 20000 processes, and do ps, then get_pid_list() will be
> called 1000 times, and the for_each_process() loop will examine
> 10000000 processes.
>
> Unlike the get_pid() situation, which was actually amortized linear with a very
> small coefficient, here we have a bad quadratic behaviour, still in 2.5.37.
>
One solution would be to replace the idtag hash with an idtag tree.
Then get_pid_list() could return an array of sorted pids, and finding
the next pid after unlocking the task_lock would be just a tree lookup
(find first pid larger than x).
And a sorted tree would make it possible find the next safe range for
get_pid() with O(N) instead of O(N^2).
--
Manfred
On Sat, 21 Sep 2002, Andries Brouwer wrote:
> >
> > So, why do you think this problem has been fixed?
>
> Let me repeat this, and call it an observation instead of a question,
> so that you do not think I am in doubt.
The reason Ingo thinks it is fixed is that it is fixed for the case of
having millions of threads - because the threads (with the new thread
library) won't show up on the "for_each_process()" loop. Which makes
threaded apps look a lot better on ps and top (we'll have to expose them
some day under /proc/<pid>/thread/<tid>/, but that's another matter)
But the quadratic behaviour wrt processes clearly isn't fixed. Suggestions
welcome (and we'll need to avoid the same quadratic behaviour wrt the
threads when we expose them).
The only "obvious" thing to do is to insert markers into the process list,
and have "for_each_process()" automatically skip the marker entries. There
probably wouldn't be all that many things that would ever notice if that
were done (excatly because most things that want to traverse the list use
"for_each_process()" already). And then instead of using "index", you
carry the marker thing around...
Linus
At some point in the past, Andries Brouwer wrote:
>> Let me repeat this, and call it an observation instead of a question,
>> so that you do not think I am in doubt.
>> If you have 20000 processes, and do ps, then get_pid_list() will be
>> called 1000 times, and the for_each_process() loop will examine
>> 10000000 processes.
>> Unlike the get_pid() situation, which was actually amortized linear with a
>> very
>> small coefficient, here we have a bad quadratic behaviour, still in 2.5.37.
On Sat, Sep 21, 2002 at 04:15:10PM +0200, Manfred Spraul wrote:
> One solution would be to replace the idtag hash with an idtag tree.
> Then get_pid_list() could return an array of sorted pids, and finding
> the next pid after unlocking the task_lock would be just a tree lookup
> (find first pid larger than x).
> And a sorted tree would make it possible find the next safe range for
> get_pid() with O(N) instead of O(N^2).
There are incremental / O(1) methods for filling the directory as well.
Also, a tree does not preclude additional hashing. Personally, I'd
consider O(N) catastrophic as well, especially when done on multiple
cpus simultaneously. In fact, a large chunk of this is obtaining hard
bounds on the hold time of the tasklist_lock so writers have a remote
chance of getting into critical sections.
Another very important aspect of it is that the tasklist cachelines
aren't incessantly pounded, which is important even for UP.
At any rate, if you care to send something to me I will doublecheck
whether it NMI oopses on my machines.
Cheers,
Bill
On Sat, 21 Sep 2002, Andries Brouwer wrote:
>> Let me repeat this, and call it an observation instead of a question,
>> so that you do not think I am in doubt.
On Sat, Sep 21, 2002 at 10:06:02AM -0700, Linus Torvalds wrote:
> The reason Ingo thinks it is fixed is that it is fixed for the case of
> having millions of threads - because the threads (with the new thread
> library) won't show up on the "for_each_process()" loop. Which makes
> threaded apps look a lot better on ps and top (we'll have to expose them
> some day under /proc/<pid>/thread/<tid>/, but that's another matter)
> But the quadratic behaviour wrt processes clearly isn't fixed. Suggestions
> welcome (and we'll need to avoid the same quadratic behaviour wrt the
> threads when we expose them).
Okay, I'm in trouble. My end-users use processes. But /proc/ needs some
more tweaking before they can use it during larger runs anyway.
On Sat, Sep 21, 2002 at 10:06:02AM -0700, Linus Torvalds wrote:
> The only "obvious" thing to do is to insert markers into the process list,
> and have "for_each_process()" automatically skip the marker entries. There
> probably wouldn't be all that many things that would ever notice if that
> were done (excatly because most things that want to traverse the list use
> "for_each_process()" already). And then instead of using "index", you
> carry the marker thing around...
This also sounds like an excellent idea. I may take a stab at this.
Thanks,
Bill
On Sat, 21 Sep 2002, Linus Torvalds wrote:
> But the quadratic behaviour wrt processes clearly isn't fixed.
> Suggestions welcome (and we'll need to avoid the same quadratic
> behaviour wrt the threads when we expose them).
in the case of threads my plan is to use the pid alloction bitmap for
this. It slightly overestimates the pids because orphan sessions and pgrps
are included as well, but this should not be a problem because procfs also
does a pid lookup when the specific directory is accessed. This method is
inherently restartable, the pid bitmap pages are never freed, and it's the
most cache-compact representation of the sorted pidlist. And it can be
accessed lockless ...
Ingo
On Sat, 21 Sep 2002, Linus Torvalds wrote:
>> But the quadratic behaviour wrt processes clearly isn't fixed.
>> Suggestions welcome (and we'll need to avoid the same quadratic
>> behaviour wrt the threads when we expose them).
On Sat, Sep 21, 2002 at 07:49:49PM +0200, Ingo Molnar wrote:
> in the case of threads my plan is to use the pid alloction bitmap for
> this. It slightly overestimates the pids because orphan sessions and pgrps
> are included as well, but this should not be a problem because procfs also
> does a pid lookup when the specific directory is accessed. This method is
> inherently restartable, the pid bitmap pages are never freed, and it's the
> most cache-compact representation of the sorted pidlist. And it can be
> accessed lockless ...
This sounds more attractive still. I'll forego the strategy of my prior
post and try to squeeze some more benchmark numbers out of things over
the weekend.
Cheers,
Bill
On Wed, Sep 18, 2002 at 11:57:10AM -0600, Cort Dougan wrote:
> I'm also a big fan of "Do that crap outside the kernel until it works
Actually I understand that's the whole point of the M:N design and I
assume it's what NGPT does too and they have no way to beat it with 1:1
because entering kernel at that frequency is very expensive compared to
creating and scheduling such an excessive amount of threads in
userspace. By making the kernel more efficient they can get close, which
is a good thing here since all processes are forced to pass through
kernel to fork (only threads can be created and scheduled purerly in
userspace).
Nevertheless the current get_pid is very bad when the tasklist grows and
the pid space is reduced, we fixed it for 2.4 so that it is reduced by
an order of magnitude (still quadratic, but over a array of bitflags
cached by the l1/l2 cache, not a while walk of the tasklist over the
whole pid space, and cache effect generates an aggressive anti-quardatic
behaviour that can be seen very well in practice and infact it *totally*
fixes the out-of-pid lockup, with something of the order of 32k pid, I
would expect it to scale fine to 100k pids, it also depends on the size
of the cpu caches. The fixed algorithm costs only an additional array of
4k with PID_NR set to 32k which is probably ok for embedded too these
days. With a smaller PID_NR the array shrinks, with 100k pid it would
need around 12k that would certainly fit in l2 cache still (if not in l1
dcache, btw, the 12k are physically contigous so they're guaranteed to
be in different cache colors). Infact because of these significant cache
effects (that guarantees each l1 cacheline will serve 512 pid elements)
it may be as fast as a more complex data structure that sits on
different cachelines and that must be modified at every task creation.
It may not be the best for a 1million pid case, but certainly it is a
must have for 2.4 and I think it could be ok for 2.5 too. It is been
submitted for 2.5 a number of times, I quote below the 2.4 version just
so you know what I'm talking about exactly (it also fixes a race
condition where two tasks can get the same pid, present in mainline 2.4
and 2.2 too, somebody complained the fix with the semaphore, I wanted to
keep it simple and mathematically safe, looking at the fork/execve
number comparison between my tree and the other trees out there
certainly it isn't measurable)
diff -urNp get-pid-ref/include/linux/threads.h get-pid/include/linux/threads.h
--- get-pid-ref/include/linux/threads.h Fri May 3 20:23:55 2002
+++ get-pid/include/linux/threads.h Wed May 29 04:48:19 2002
@@ -19,6 +19,6 @@
/*
* This controls the maximum pid allocated to a process
*/
-#define PID_MAX 0x8000
+#define PID_NR 0x8000
#endif
diff -urNp get-pid-ref/kernel/fork.c get-pid/kernel/fork.c
--- get-pid-ref/kernel/fork.c Wed May 29 04:47:54 2002
+++ get-pid/kernel/fork.c Wed May 29 04:48:38 2002
@@ -21,7 +21,6 @@
#include <linux/completion.h>
#include <linux/namespace.h>
#include <linux/personality.h>
-#include <linux/compiler.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -39,6 +38,12 @@ struct task_struct *pidhash[PIDHASH_SZ];
rwlock_t tasklist_lock __cacheline_aligned = RW_LOCK_UNLOCKED; /* outer */
+/*
+ * Protectes next_unsafe, last_pid and it avoids races
+ * between get_pid and SET_LINKS().
+ */
+static DECLARE_MUTEX(getpid_mutex);
+
void add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)
{
unsigned long flags;
@@ -81,63 +86,107 @@ void __init fork_init(unsigned long memp
init_task.rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
}
-/* Protects next_safe and last_pid. */
-spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
-
+/*
+ * Get the next free pid for a new process/thread.
+ *
+ * Strategy: last_pid and next_unsafe (excluded) are an interval where all pids
+ * are free, so next pid is just last_pid + 1 if it's also < next_unsafe.
+ * If last_pid + 1 >= next_unsafe the interval is completely used.
+ * In this case a bitmap with all used pids/tgids/pgrp/seesion is
+ * is created. This bitmap is looked for the next free pid and next_unsafe.
+ * If all pids are used, a kernel warning is issued.
+ */
static int get_pid(unsigned long flags)
{
- static int next_safe = PID_MAX;
+ static int next_unsafe = PID_NR;
+#define PID_FIRST 2 /* pid 1 is init, first usable pid is 2 */
+#define PID_BITMAP_SIZE ((((PID_NR + 7) / 8) + sizeof(long) - 1 ) / (sizeof(long)))
+ /*
+ * Even if this could be local per-thread, keep it static and protected by
+ * the lock because we don't want to overflow the stack and we wouldn't
+ * SMP scale better anyways. It doesn't waste disk space because it's in
+ * the .bss.
+ */
+ static unsigned long pid_bitmap[PID_BITMAP_SIZE];
+
+ /* from here the stuff on the stack */
struct task_struct *p;
- int pid, beginpid;
+ int pid, found_pid;
if (flags & CLONE_PID)
return current->pid;
- spin_lock(&lastpid_lock);
- beginpid = last_pid;
- if((++last_pid) & 0xffff8000) {
- last_pid = 300; /* Skip daemons etc. */
- goto inside;
- }
- if(last_pid >= next_safe) {
-inside:
- next_safe = PID_MAX;
+ pid = last_pid + 1;
+ if (pid >= next_unsafe) {
+ next_unsafe = PID_NR;
+ memset(pid_bitmap, 0, PID_BITMAP_SIZE*sizeof(long));
+
read_lock(&tasklist_lock);
- repeat:
+ /*
+ * Build the bitmap and calc next_unsafe.
+ */
for_each_task(p) {
- if(p->pid == last_pid ||
- p->pgrp == last_pid ||
- p->tgid == last_pid ||
- p->session == last_pid) {
- if(++last_pid >= next_safe) {
- if(last_pid & 0xffff8000)
- last_pid = 300;
- next_safe = PID_MAX;
+ __set_bit(p->pid, pid_bitmap);
+ __set_bit(p->pgrp, pid_bitmap);
+ __set_bit(p->tgid, pid_bitmap);
+ __set_bit(p->session, pid_bitmap);
+
+ if (next_unsafe > p->pid && p->pid > pid)
+ next_unsafe = p->pid;
+ if (next_unsafe > p->pgrp && p->pgrp > pid)
+ next_unsafe = p->pgrp;
+ if (next_unsafe > p->tgid && p->tgid > pid)
+ next_unsafe = p->tgid;
+ if (next_unsafe > p->session && p->session > pid)
+ next_unsafe = p->session;
+ }
+
+ /*
+ * Release the tasklist_lock, after the unlock it may happen that
+ * a pid is freed while it's still marked in use
+ * in the pid_bitmap[].
+ */
+ read_unlock(&tasklist_lock);
+
+ found_pid = find_next_zero_bit(pid_bitmap, PID_NR, pid);
+ if (found_pid >= PID_NR) {
+ next_unsafe = 0; /* depends on PID_FIRST > 0 */
+ found_pid = find_next_zero_bit(pid_bitmap, pid, PID_FIRST);
+ /* We scanned the whole bitmap without finding a free pid. */
+ if (found_pid >= pid) {
+ static long last_get_pid_warning;
+ if ((unsigned long) (jiffies - last_get_pid_warning) >= HZ) {
+ printk(KERN_NOTICE "No more PIDs (PID_NR = %d)\n", PID_NR);
+ last_get_pid_warning = jiffies;
}
- if(unlikely(last_pid == beginpid))
- goto nomorepids;
- goto repeat;
+ return -1;
+ }
+ }
+
+ pid = found_pid;
+
+ if (pid > next_unsafe) {
+ /* recalc next_unsafe by looking for the next bit set in the bitmap */
+ unsigned long * start = pid_bitmap;
+ unsigned long * p = start + (pid / (sizeof(long) * 8));
+ unsigned long * end = pid_bitmap + PID_BITMAP_SIZE;
+ unsigned long mask = ~((1UL << (pid & ((sizeof(long) * 8 - 1)))) - 1);
+
+ *p &= (mask << 1);
+
+ while (p < end) {
+ if (*p) {
+ next_unsafe = ffz(~*p) + (p - start) * sizeof(long) * 8;
+ break;
+ }
+ p++;
}
- if(p->pid > last_pid && next_safe > p->pid)
- next_safe = p->pid;
- if(p->pgrp > last_pid && next_safe > p->pgrp)
- next_safe = p->pgrp;
- if(p->tgid > last_pid && next_safe > p->tgid)
- next_safe = p->tgid;
- if(p->session > last_pid && next_safe > p->session)
- next_safe = p->session;
}
- read_unlock(&tasklist_lock);
}
- pid = last_pid;
- spin_unlock(&lastpid_lock);
- return pid;
+ last_pid = pid;
-nomorepids:
- read_unlock(&tasklist_lock);
- spin_unlock(&lastpid_lock);
- return 0;
+ return pid;
}
static inline int dup_mmap(struct mm_struct * mm)
@@ -637,8 +686,9 @@ int do_fork(unsigned long clone_flags, u
p->state = TASK_UNINTERRUPTIBLE;
copy_flags(clone_flags, p);
+ down(&getpid_mutex);
p->pid = get_pid(clone_flags);
- if (p->pid == 0 && current->pid != 0)
+ if (p->pid < 0) /* valid pids are >= 0 */
goto bad_fork_cleanup;
INIT_LIST_HEAD(&p->run_list);
@@ -758,7 +808,17 @@ int do_fork(unsigned long clone_flags, u
list_add(&p->thread_group, ¤t->thread_group);
}
+ /*
+ * We must do the SET_LINKS() under the getpid_mutex, to avoid
+ * another CPU to get our same PID between the release of of the
+ * getpid_mutex and the SET_LINKS().
+ *
+ * In short to avoid SMP races the new child->pid must be just visible
+ * in the tasklist by the time we drop the getpid_mutex.
+ */
SET_LINKS(p);
+ up(&getpid_mutex);
+
hash_pid(p);
nr_threads++;
write_unlock_irq(&tasklist_lock);
@@ -790,6 +850,7 @@ bad_fork_cleanup_fs:
bad_fork_cleanup_files:
exit_files(p); /* blocking */
bad_fork_cleanup:
+ up(&getpid_mutex);
put_exec_domain(p->exec_domain);
if (p->binfmt && p->binfmt->module)
__MOD_DEC_USE_COUNT(p->binfmt->module);
Andrea
On Sun, 22 Sep 2002, Andrea Arcangeli wrote:
> Nevertheless the current get_pid is very bad when the tasklist grows and
> the pid space is reduced, [...]
> It may not be the best for a 1million pid case, but certainly it is a
> must have for 2.4 and I think it could be ok for 2.5 too. It is been
> submitted for 2.5 a number of times, I quote below the 2.4 version just
> so you know what I'm talking about exactly [...]
Andrea, the new PID allocator (and new pidhash) went into 2.5.37, there's
no get_pid() anymore. Do we agree that the runtime-bitmap hack^H^H^H^patch
is now moot for 2.5?
Ingo
On Sun, Sep 22, 2002 at 12:02:10PM +0200, Ingo Molnar wrote:
>
> On Sun, 22 Sep 2002, Andrea Arcangeli wrote:
>
> > Nevertheless the current get_pid is very bad when the tasklist grows and
> > the pid space is reduced, [...]
>
> > It may not be the best for a 1million pid case, but certainly it is a
> > must have for 2.4 and I think it could be ok for 2.5 too. It is been
> > submitted for 2.5 a number of times, I quote below the 2.4 version just
> > so you know what I'm talking about exactly [...]
>
> Andrea, the new PID allocator (and new pidhash) went into 2.5.37, there's
> no get_pid() anymore. Do we agree that the runtime-bitmap hack^H^H^H^patch
> is now moot for 2.5?
I wouldn't call it an hack, it is just a less advanced version of what
you did in 2.5, if that is an hack how do you call the 2.4 and 2.2 code?
You've the bitmap array too and you use the bitmap to allocate the pid.
The big difference is that the 2.4 patch isn't capable of using the
bitmap in a synchronized and uptodate manner, so it has to rebuild the
bitmap at every getpid, and it misses all the infrastructure you added to
avoid walking the tasklist in getpid and to dynamically allocate new
maps. That means exit() (and various not performance critical syscalls)
will be a bit slower in 2.5, but it should payoff for getpid that
doesn't need to walk the tasklist anymore, so I certainly agree the
patch I posted is useless for the latest 2.5.38. One nice bit of 2.5 is
that the test and set bit on the global bitmask fixes the race with two
tasks taking the same pid quite naturally, it wasn't fixable that way in
the 2.4 patch because the bitmap is local to the task and rebuild at
each getpid.
Andrea
Hi!
> > nevertheless we do lock up for 32 seconds if there are 32K PIDs allocated
> > in a row and last_pid hits that range - regardless of pid_max. (Depending
> > on the cache architecture it could take significantly more.)
>
> What about randomizing the PID selection a bit? I.e., allocate PIDs
> consecutively as long as they are free; if you hit an already used
> PID, roll dice to find a new position where the search should be
> continued. As long as the allocated fraction of PID space is reasonably
> small, this algorithm should be very quick in average case.
Problem is it may take infinite time if you are unlucky...
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.