the attached patch is a significantly cleaned up version of the generic
pidhash patch, against BK-curr. Changes:
- merged the pidhash and id-hash concepts into a single 'generic
pidhash' thing.
- unified kernel/pid.c and kernel/idtag.c into kernel/pid.c.
- eliminated the pidhash lock.
- simplified the use of the generic pidhash, in most cases we simply
iterate over a given PID 'type' (or class):
for_each_task_pid(current->session, PIDTYPE_SID, p, l, pid)
p->tty = NULL;
there's no get_tag()/put_tag() anymore, just simple loops.
- fixed a number of bugs that caused inconsistent pid hashing. Added two
bugchecks that catch all cases when the pidhash gets out of sync with
the pidmap.
- removed the per-user task-list changes, it will be done in a separate
patch.
- removed the PIDTYPE_TGID class, it's not necessary anymore with the
latest threading changes in 2.5.35.
- lots of other cleanups and simplifications.
the patch is stable in both UP and SMP tests.
performance and robustness measurements:
thread creation+destruction (full) latency is the one that is most
sensitive to the PID allocation code. I've done testing on a 'low end
server' system with 1000 tasks running, and i've also done 5000, 10000 and
50000 (inactive) threads test. In all cases pid_max is at a safely large
value, 1 million, the PID fill ratio is 1% for the 10K threads test, 0.1%
in the 1000 threads test.
i created and destroyed 1 million threads, in a serial way:
./perf -s 1000000 -t 1 -r 0 -T --sync-join
4 of such measurements were running at once, to load the dual-P4 SMP box.
All CPUs were 100% loaded during the test. Note that pid_max and the
number of threads created is 1 million as well, this is to get a fair
average of passing the whole PID range.
BK-stock gives the following results:
# of tasks: 250 1000 5000 10000 50000
-------------------------------------------------------------
4x 1m threads (sec): 21.2 23.0 47.4 [NMI] [NMI]
the results prove that even this extremely low PID space fill rate causes
noticeable PID allocation overhead. Things really escallate at 10000
threads, the NMI watchdog triggered very quickly (ie. an irqs-off latency
of more than 5 seconds happened). The results would have been well above
200 seconds. Even in the 5000 threads test the system was occasionally
very jerky, and probably missed interrupts as well.
Note that the inactive threads (1K, 5K, 10K and 10K threads) were using a
consecutive PID range, which favors the get_pid() algorithm, as all
cachemisses will be concentrated into one big chunk, and ~0.95 million
PIDs can be allocated without any interference afterwards. With a more
random distribution of PIDs the overhead is larger.
BK+patch gives the following results:
# of tasks: 250 1000 5000 10000 50000
-------------------------------------------------------------
4x 1m threads (sec): 23.8 23.8 24.1 25.5 27.7
ie. thread creation performance is stable, it increases slightly, probably
due to hash-chains getting bigger. The catastrophic breakdown in
performance is solved.
but i'm still not happy about the 250 tasks performance of the generic
pidhash - it has some constant overhead. It's not noticeable in fork
latencies though. I think there are still a number of performance
optimizations possible that we can do.
(i have not attempted to measure the improvement in those cases where the
for_each_task loop was eliminated - it's no doubt significant.)
Ingo
--- linux/drivers/char/tty_io.c.orig Thu Sep 19 00:40:44 2002
+++ linux/drivers/char/tty_io.c Thu Sep 19 01:27:37 2002
@@ -432,6 +432,7 @@
struct file * cons_filp = NULL;
struct task_struct *p;
struct list_head *l;
+ struct pid *pid;
int closecount = 0, n;
if (!tty)
@@ -496,17 +497,17 @@
}
read_lock(&tasklist_lock);
- for_each_process(p) {
- if ((tty->session > 0) && (p->session == tty->session) &&
- p->leader) {
- send_sig(SIGHUP,p,1);
- send_sig(SIGCONT,p,1);
+ if (tty->session > 0)
+ for_each_task_pid(tty->session, PIDTYPE_SID, p, l, pid) {
+ if (p->tty == tty)
+ p->tty = NULL;
+ if (!p->leader)
+ continue;
+ send_sig(SIGHUP, p, 1);
+ send_sig(SIGCONT, p, 1);
if (tty->pgrp > 0)
p->tty_old_pgrp = tty->pgrp;
}
- if (p->tty == tty)
- p->tty = NULL;
- }
read_unlock(&tasklist_lock);
tty->flags = 0;
@@ -571,6 +572,8 @@
{
struct tty_struct *tty = current->tty;
struct task_struct *p;
+ struct list_head *l;
+ struct pid *pid;
int tty_pgrp = -1;
lock_kernel();
@@ -598,9 +601,8 @@
tty->pgrp = -1;
read_lock(&tasklist_lock);
- for_each_process(p)
- if (p->session == current->session)
- p->tty = NULL;
+ for_each_task_pid(current->session, PIDTYPE_SID, p, l, pid)
+ p->tty = NULL;
read_unlock(&tasklist_lock);
unlock_kernel();
}
@@ -1221,12 +1223,15 @@
*/
if (tty_closing || o_tty_closing) {
struct task_struct *p;
+ struct list_head *l;
+ struct pid *pid;
read_lock(&tasklist_lock);
- for_each_process(p) {
- if (p->tty == tty || (o_tty && p->tty == o_tty))
+ for_each_task_pid(tty->session, PIDTYPE_SID, p, l, pid)
+ p->tty = NULL;
+ if (o_tty)
+ for_each_task_pid(o_tty->session, PIDTYPE_SID, p,l, pid)
p->tty = NULL;
- }
read_unlock(&tasklist_lock);
if (redirect == tty || (o_tty && redirect == o_tty))
@@ -1540,6 +1545,10 @@
static int tiocsctty(struct tty_struct *tty, int arg)
{
+ struct list_head *l;
+ struct pid *pid;
+ task_t *p;
+
if (current->leader &&
(current->session == tty->session))
return 0;
@@ -1558,12 +1567,10 @@
/*
* Steal it away
*/
- struct task_struct *p;
read_lock(&tasklist_lock);
- for_each_process(p)
- if (p->tty == tty)
- p->tty = NULL;
+ for_each_task_pid(tty->session, PIDTYPE_SID, p, l, pid)
+ p->tty = NULL;
read_unlock(&tasklist_lock);
} else
return -EPERM;
--- linux/include/linux/pid.h.orig Thu Sep 19 00:41:21 2002
+++ linux/include/linux/pid.h Thu Sep 19 01:18:29 2002
@@ -0,0 +1,61 @@
+#ifndef _LINUX_PID_H
+#define _LINUX_PID_H
+
+enum pid_type
+{
+ PIDTYPE_PID,
+ PIDTYPE_PGID,
+ PIDTYPE_SID,
+ PIDTYPE_MAX
+};
+
+struct pid
+{
+ int nr;
+ enum pid_type type;
+ atomic_t count;
+ struct list_head hash_chain;
+ struct list_head task_list;
+};
+
+struct pid_link
+{
+ int nr;
+ struct list_head pid_chain;
+ struct pid *pid;
+};
+
+#define pid_task(elem, type) \
+ list_entry(elem, struct task_struct, pids[(int)(type)].pid_chain)
+
+/*
+ * attach_pid() must be called with the tasklist_lock write-held.
+ * It might unlock the tasklist_lock for allocation, so this
+ * function must be called last after installing the links of
+ * a new task.
+ */
+extern int FASTCALL(attach_pid(struct task_struct *, enum pid_type, int));
+
+/*
+ * detach_pid() must be called with the tasklist_lock write-held.
+ */
+extern void FASTCALL(detach_pid(struct task_struct *task, enum pid_type));
+
+/*
+ * Quick & dirty hash table lookup.
+ */
+extern struct pid *FASTCALL(find_pid(enum pid_type, int));
+
+extern int alloc_pidmap(void);
+extern void FASTCALL(free_pidmap(int));
+
+#define for_each_task_pid(who, type, task, elem, pid) \
+ if ((pid = find_pid(type, who))) \
+ for (elem = pid->task_list.next, \
+ prefetch(elem->next), \
+ task = pid_task(elem, type); \
+ elem != &pid->task_list; \
+ elem = elem->next, prefetch(elem->next), \
+ task = pid_task(elem, type))
+
+#endif /* _LINUX_PID_H */
--- linux/include/linux/sched.h.orig Thu Sep 19 00:40:52 2002
+++ linux/include/linux/sched.h Thu Sep 19 02:56:18 2002
@@ -28,6 +28,7 @@
#include <linux/fs_struct.h>
#include <linux/compiler.h>
#include <linux/completion.h>
+#include <linux/pid.h>
struct exec_domain;
@@ -266,6 +267,8 @@
atomic_inc(&__user->__count); \
__user; })
+extern struct user_struct *find_user(uid_t);
+
extern struct user_struct root_user;
#define INIT_USER (&root_user)
@@ -326,9 +329,8 @@
struct task_struct *group_leader;
struct list_head thread_group;
- /* PID hash table linkage. */
- struct task_struct *pidhash_next;
- struct task_struct **pidhash_pprev;
+ /* PID/PID hash table linkage. */
+ struct pid_link pids[PIDTYPE_MAX];
wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
@@ -474,38 +476,7 @@
extern struct mm_struct init_mm;
-/* PID hashing. (shouldnt this be dynamic?) */
-#define PIDHASH_SZ 8192
-extern struct task_struct *pidhash[PIDHASH_SZ];
-
-#define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
-
-static inline void hash_pid(struct task_struct *p)
-{
- struct task_struct **htable = &pidhash[pid_hashfn(p->pid)];
-
- if((p->pidhash_next = *htable) != NULL)
- (*htable)->pidhash_pprev = &p->pidhash_next;
- *htable = p;
- p->pidhash_pprev = htable;
-}
-
-static inline void unhash_pid(struct task_struct *p)
-{
- if(p->pidhash_next)
- p->pidhash_next->pidhash_pprev = p->pidhash_pprev;
- *p->pidhash_pprev = p->pidhash_next;
-}
-
-static inline struct task_struct *find_task_by_pid(int pid)
-{
- struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)];
-
- for(p = *htable; p && p->pid != pid; p = p->pidhash_next)
- ;
-
- return p;
-}
+extern struct task_struct *find_task_by_pid(int pid);
/* per-UID process charging. */
extern struct user_struct * alloc_uid(uid_t);
--- linux/include/linux/threads.h.orig Thu Sep 19 00:40:38 2002
+++ linux/include/linux/threads.h Thu Sep 19 00:41:21 2002
@@ -17,8 +17,13 @@
#define MIN_THREADS_LEFT_FOR_ROOT 4
/*
- * This controls the maximum pid allocated to a process
+ * This controls the default maximum pid allocated to a process
*/
-#define DEFAULT_PID_MAX 0x8000
+#define PID_MAX_DEFAULT 0x8000
+
+/*
+ * A maximum of 4 million PIDs should be enough for a while:
+ */
+#define PID_MAX_LIMIT (4*1024*1024)
#endif
--- linux/include/asm-i386/types.h.orig Sun Jun 9 07:26:52 2002
+++ linux/include/asm-i386/types.h Thu Sep 19 00:41:21 2002
@@ -41,7 +41,8 @@
typedef signed long long s64;
typedef unsigned long long u64;
-#define BITS_PER_LONG 32
+#define BITS_PER_LONG_SHIFT 5
+#define BITS_PER_LONG (1 << BITS_PER_LONG_SHIFT)
/* DMA addresses come in generic and 64-bit flavours. */
--- linux/fs/fcntl.c.orig Thu Sep 19 00:40:45 2002
+++ linux/fs/fcntl.c Thu Sep 19 01:32:06 2002
@@ -480,7 +480,9 @@
void send_sigio(struct fown_struct *fown, int fd, int band)
{
- struct task_struct * p;
+ struct task_struct *p;
+ struct list_head *l;
+ struct pid *pidptr;
int pid;
read_lock(&fown->lock);
@@ -493,14 +495,8 @@
send_sigio_to_task(p, fown, fd, band);
goto out_unlock_task;
}
- for_each_process(p) {
- int match = p->pid;
- if (pid < 0)
- match = -p->pgrp;
- if (pid != match)
- continue;
- send_sigio_to_task(p, fown, fd, band);
- }
+ for_each_task_pid(-pid, PIDTYPE_PGID, p, l, pidptr)
+ send_sigio_to_task(p, fown,fd,band);
out_unlock_task:
read_unlock(&tasklist_lock);
out_unlock_fown:
--- linux/fs/exec.c.orig Thu Sep 19 00:40:51 2002
+++ linux/fs/exec.c Thu Sep 19 01:48:30 2002
@@ -609,8 +609,6 @@
ptrace_unlink(leader);
ptrace_unlink(current);
- unhash_pid(current);
- unhash_pid(leader);
remove_parent(current);
remove_parent(leader);
/*
@@ -631,8 +629,6 @@
current->ptrace = ptrace;
__ptrace_link(current, parent);
}
- hash_pid(current);
- hash_pid(leader);
list_add_tail(¤t->tasks, &init_task.tasks);
state = leader->state;
--- linux/kernel/Makefile.orig Thu Sep 19 00:40:12 2002
+++ linux/kernel/Makefile Thu Sep 19 00:41:21 2002
@@ -15,7 +15,7 @@
obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o futex.o platform.o
+ signal.o sys.o kmod.o context.o futex.o platform.o pid.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
obj-$(CONFIG_SMP) += cpu.o
--- linux/kernel/exit.c.orig Thu Sep 19 00:41:07 2002
+++ linux/kernel/exit.c Thu Sep 19 02:45:34 2002
@@ -33,7 +33,12 @@
{
struct dentry *proc_dentry;
nr_threads--;
- unhash_pid(p);
+ detach_pid(p, PIDTYPE_PID);
+ if (thread_group_leader(p)) {
+ detach_pid(p, PIDTYPE_PGID);
+ detach_pid(p, PIDTYPE_SID);
+ }
+
REMOVE_LINKS(p);
p->pid = 0;
proc_dentry = p->proc_dentry;
@@ -109,22 +114,18 @@
int session_of_pgrp(int pgrp)
{
struct task_struct *p;
- int fallback;
+ struct list_head *l;
+ struct pid *pid;
+ int sid = -1;
- fallback = -1;
read_lock(&tasklist_lock);
- for_each_process(p) {
- if (p->session <= 0)
- continue;
- if (p->pgrp == pgrp) {
- fallback = p->session;
+ for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid)
+ if (p->session > 0) {
+ sid = p->session;
break;
}
- if (p->pid == pgrp)
- fallback = p->session;
- }
read_unlock(&tasklist_lock);
- return fallback;
+ return sid;
}
/*
@@ -135,21 +136,25 @@
*
* "I ask you, have you ever known what it is to be an orphan?"
*/
-static int __will_become_orphaned_pgrp(int pgrp, struct task_struct * ignored_task)
+static int __will_become_orphaned_pgrp(int pgrp, task_t *ignored_task)
{
struct task_struct *p;
-
- for_each_process(p) {
- if ((p == ignored_task) || (p->pgrp != pgrp) ||
- (p->state == TASK_ZOMBIE) ||
- (p->real_parent->pid == 1))
+ struct list_head *l;
+ struct pid *pid;
+ int ret = 1;
+
+ for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) {
+ if (p == ignored_task
+ || p->state == TASK_ZOMBIE
+ || p->real_parent->pid == 1)
continue;
- if ((p->real_parent->pgrp != pgrp) &&
- (p->real_parent->session == p->session)) {
- return 0;
+ if (p->real_parent->pgrp != pgrp
+ && p->real_parent->session == p->session) {
+ ret = 0;
+ break;
}
}
- return 1; /* (sighing) "Often!" */
+ return ret; /* (sighing) "Often!" */
}
static int will_become_orphaned_pgrp(int pgrp, struct task_struct * ignored_task)
@@ -171,11 +176,11 @@
static inline int __has_stopped_jobs(int pgrp)
{
int retval = 0;
- struct task_struct * p;
+ struct task_struct *p;
+ struct list_head *l;
+ struct pid *pid;
- for_each_process(p) {
- if (p->pgrp != pgrp)
- continue;
+ for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) {
if (p->state != TASK_STOPPED)
continue;
retval = 1;
--- linux/kernel/user.c.orig Thu Sep 19 02:53:48 2002
+++ linux/kernel/user.c Thu Sep 19 02:54:02 2002
@@ -64,6 +64,11 @@
return NULL;
}
+struct user_struct *find_user(uid_t uid)
+{
+ return uid_hash_find(uid, uidhashentry(uid));
+}
+
void free_uid(struct user_struct *up)
{
if (up && atomic_dec_and_lock(&up->__count, &uidhash_lock)) {
--- linux/kernel/pid.c.orig Thu Sep 19 00:41:21 2002
+++ linux/kernel/pid.c Thu Sep 19 03:24:47 2002
@@ -0,0 +1,268 @@
+/*
+ * Generic pidhash and scalable, time-bounded PID allocator
+ *
+ * (C) 2002 William Irwin, IBM
+ * (C) 2002 Ingo Molnar, Red Hat
+ *
+ * pid-structures are backing objects for tasks sharing a given ID to chain
+ * against. There is very little to them aside from hashing them and
+ * parking tasks using given ID's on a list.
+ *
+ * The hash is always changed with the tasklist_lock write-acquired,
+ * and the hash is only accessed with the tasklist_lock at least
+ * read-acquired, so there's no additional SMP locking needed here.
+ *
+ * We have a list of bitmap pages, which bitmaps represent the PID space.
+ * Allocating and freeing PIDs is completely lockless. The worst-case
+ * allocation scenario when all but one out of 1 million PIDs possible are
+ * allocated already: the scanning of 32 list entries and at most PAGE_SIZE
+ * bytes. The typical fastpath is a single successful setbit. Freeing is O(1).
+ */
+
+#include <linux/mm.h>
+#include <linux/bootmem.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+
+static kmem_cache_t *pid_cache;
+
+#define PIDHASH_SIZE 4096
+static struct list_head pid_hash[PIDTYPE_MAX][PIDHASH_SIZE];
+#define pid_hashfn(nr) ((nr >> 8) ^ nr) & (PIDHASH_SIZE - 1)
+
+int pid_max = PID_MAX_DEFAULT;
+int last_pid;
+
+#define RESERVED_PIDS 300
+
+#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8)
+#define BITS_PER_PAGE (PAGE_SIZE*8)
+#define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1)
+
+/*
+ * PID-map pages start out as NULL, they get allocated upon
+ * first use and are never deallocated. This way a low pid_max
+ * value does not cause lots of bitmaps to be allocated, but
+ * the scheme scales to up to 4 million PIDs, runtime.
+ */
+typedef struct pidmap {
+ atomic_t nr_free;
+ void *page;
+} pidmap_t;
+
+static pidmap_t pidmap_array[PIDMAP_ENTRIES] =
+ { [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(BITS_PER_PAGE), NULL } };
+
+static pidmap_t *map_limit = pidmap_array + PIDMAP_ENTRIES;
+
+static inline int check_pid(int pid)
+{
+ pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
+ int offset = pid & BITS_PER_PAGE_MASK;
+
+ return test_bit(offset, map->page);
+}
+
+void free_pidmap(int pid)
+{
+ pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
+ int offset = pid & BITS_PER_PAGE_MASK;
+
+ if (!test_and_clear_bit(offset, map->page))
+ BUG();
+ atomic_inc(&map->nr_free);
+}
+
+/*
+ * Here we search for the next map that has free bits left.
+ * Normally the next map has free PIDs.
+ */
+static inline pidmap_t *next_free_map(pidmap_t *map, int *max_steps)
+{
+ while (--*max_steps) {
+ if (++map == map_limit)
+ map = pidmap_array;
+ if (unlikely(!map->page)) {
+ unsigned long page = get_zeroed_page(GFP_KERNEL);
+ /*
+ * Free the page if someone raced with us
+ * installing it:
+ */
+ if (cmpxchg(&map->page, NULL, page))
+ free_page(page);
+ if (!map->page)
+ break;
+ }
+ if (atomic_read(&map->nr_free))
+ return map;
+ }
+ return NULL;
+}
+
+int alloc_pidmap(void)
+{
+ int pid, offset, max_steps = PIDMAP_ENTRIES + 1;
+ pidmap_t *map;
+
+ pid = last_pid + 1;
+ if (pid >= pid_max)
+ pid = RESERVED_PIDS;
+
+ offset = pid & BITS_PER_PAGE_MASK;
+ map = pidmap_array + pid / BITS_PER_PAGE;
+
+ if (likely(map->page && !test_and_set_bit(offset, map->page))) {
+ /*
+ * There is a small window for last_pid updates to race,
+ * but in that case the next allocation will go into the
+ * slowpath and that fixes things up.
+ */
+return_pid:
+ atomic_dec(&map->nr_free);
+ last_pid = pid;
+ return pid;
+ }
+
+ if (!offset || !atomic_read(&map->nr_free)) {
+next_map:
+ map = next_free_map(map, &max_steps);
+ if (!map)
+ goto failure;
+ offset = 0;
+ }
+ /*
+ * Find the next zero bit:
+ */
+scan_more:
+ offset = find_next_zero_bit(map->page, BITS_PER_PAGE, offset);
+ if (offset == BITS_PER_PAGE)
+ goto next_map;
+ if (test_and_set_bit(offset, map->page))
+ goto scan_more;
+
+ /* we got the PID: */
+ pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
+ goto return_pid;
+
+failure:
+ return -1;
+}
+
+struct pid *find_pid(enum pid_type type, int nr)
+{
+ struct list_head *elem, *bucket = &pid_hash[type][pid_hashfn(nr)];
+ struct pid *pid;
+
+ list_for_each(elem, bucket) {
+ pid = list_entry(elem, struct pid, hash_chain);
+ if (pid->nr == nr) {
+ if (unlikely(!check_pid(nr)))
+ BUG();
+ return pid;
+ }
+ }
+ return NULL;
+}
+
+static inline struct pid *get_pid(enum pid_type type, int nr)
+{
+ struct pid *pid, *raced_pid;
+
+ pid = find_pid(type, nr);
+ if (pid)
+ goto out_inc;
+
+ while (!pid) {
+ write_unlock_irq(&tasklist_lock);
+ pid = kmem_cache_alloc(pid_cache, SLAB_KERNEL);
+ write_lock_irq(&tasklist_lock);
+ }
+
+ raced_pid = find_pid(type, nr);
+ if (likely(!raced_pid)) {
+ INIT_LIST_HEAD(&pid->task_list);
+ atomic_set(&pid->count, 1);
+ pid->type = type;
+ pid->nr = nr;
+ list_add(&pid->hash_chain, &pid_hash[type][pid_hashfn(nr)]);
+ return pid;
+ }
+ kmem_cache_free(pid_cache, pid);
+
+ pid = raced_pid;
+out_inc:
+ atomic_inc(&pid->count);
+ return pid;
+}
+
+int attach_pid(task_t *task, enum pid_type type, int nr)
+{
+ struct pid *pid;
+
+ if (unlikely(!check_pid(nr)))
+ BUG();
+
+ pid = get_pid(type, nr);
+ if (unlikely(!pid))
+ return -ENOMEM;
+
+ list_add(&task->pids[type].pid_chain, &pid->task_list);
+ task->pids[type].nr = nr;
+ task->pids[type].pid = pid;
+
+ return 0;
+}
+
+void detach_pid(task_t *task, enum pid_type type)
+{
+ struct pid_link *link = task->pids + type;
+ struct pid *pid = link->pid;
+ int nr;
+
+ list_del(&link->pid_chain);
+
+ if (!atomic_dec_and_test(&pid->count))
+ return;
+
+ nr = pid->nr;
+ list_del(&pid->hash_chain);
+ kmem_cache_free(pid_cache, pid);
+ for (type = 0; type < PIDTYPE_MAX; ++type)
+ if (find_pid(type, nr))
+ return;
+ free_pidmap(nr);
+}
+
+extern struct task_struct *find_task_by_pid(int nr)
+{
+ struct pid *pid = find_pid(PIDTYPE_PID, nr);
+
+ if (!pid || list_empty(&pid->task_list))
+ return NULL;
+
+ return pid_task(pid->task_list.next, PIDTYPE_PID);
+}
+
+void __init pid_init(void)
+{
+ pidmap_t *map = pidmap_array;
+
+ /*
+ * Allocate PID 0:
+ */
+ map->page = alloc_bootmem(PAGE_SIZE);
+ set_bit(0, map->page);
+ atomic_dec(&map->nr_free);
+}
+
+void __init pidhash_init(void)
+{
+ int i, j;
+
+ pid_cache = kmem_cache_create("pid_cache", sizeof(struct pid),
+ 0, SLAB_HWCACHE_ALIGN, NULL, NULL);
+
+ for (i = 0; i < PIDTYPE_MAX; ++i)
+ for (j = 0; j < PIDHASH_SIZE; ++j)
+ INIT_LIST_HEAD(&pid_hash[i][j]);
+}
--- linux/kernel/sched.c.orig Thu Sep 19 00:40:48 2002
+++ linux/kernel/sched.c Thu Sep 19 00:41:21 2002
@@ -2099,6 +2099,7 @@
{
runqueue_t *rq;
int i, j, k;
+ extern void pid_init(void);
for (i = 0; i < NR_CPUS; i++) {
prio_array_t *array;
@@ -2139,5 +2140,7 @@
*/
atomic_inc(&init_mm.mm_count);
enter_lazy_tlb(&init_mm, current, smp_processor_id());
+
+ pid_init();
}
--- linux/kernel/signal.c.orig Thu Sep 19 00:40:48 2002
+++ linux/kernel/signal.c Thu Sep 19 01:41:03 2002
@@ -943,18 +943,18 @@
int __kill_pg_info(int sig, struct siginfo *info, pid_t pgrp)
{
- int retval = -EINVAL;
- if (pgrp > 0) {
- struct task_struct *p;
-
- retval = -ESRCH;
- for_each_process(p) {
- if (p->pgrp == pgrp) {
- int err = send_sig_info(sig, info, p);
- if (retval)
- retval = err;
- }
- }
+ struct task_struct *p;
+ struct list_head *l;
+ struct pid *pid;
+ int err, retval = -ESRCH;
+
+ if (pgrp <= 0)
+ return -EINVAL;
+
+ for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) {
+ err = send_sig_info(sig, info, p);
+ if (retval)
+ retval = err;
}
return retval;
}
@@ -977,28 +977,33 @@
* the connection is lost.
*/
+
int
-kill_sl_info(int sig, struct siginfo *info, pid_t sess)
+kill_sl_info(int sig, struct siginfo *info, pid_t sid)
{
- int retval = -EINVAL;
- if (sess > 0) {
- struct task_struct *p;
-
- retval = -ESRCH;
- read_lock(&tasklist_lock);
- for_each_process(p) {
- if (p->leader && p->session == sess) {
- int err = send_sig_info(sig, info, p);
- if (retval)
- retval = err;
- }
- }
- read_unlock(&tasklist_lock);
+ int err, retval = -EINVAL;
+ struct pid *pid;
+ struct list_head *l;
+ struct task_struct *p;
+
+ if (sid <= 0)
+ goto out;
+
+ retval = -ESRCH;
+ read_lock(&tasklist_lock);
+ for_each_task_pid(sid, PIDTYPE_SID, p, l, pid) {
+ if (!p->leader)
+ continue;
+ err = send_sig_info(sig, info, p);
+ if (retval)
+ retval = err;
}
+ read_unlock(&tasklist_lock);
+out:
return retval;
}
-inline int
+int
kill_proc_info(int sig, struct siginfo *info, pid_t pid)
{
int error;
--- linux/kernel/sys.c.orig Thu Sep 19 02:51:05 2002
+++ linux/kernel/sys.c Thu Sep 19 02:55:49 2002
@@ -203,35 +203,34 @@
cond_syscall(sys_quotactl)
cond_syscall(sys_acct)
-static int proc_sel(struct task_struct *p, int which, int who)
+static int set_one_prio(struct task_struct *p, int niceval, int error)
{
- if(p->pid)
- {
- switch (which) {
- case PRIO_PROCESS:
- if (!who && p == current)
- return 1;
- return(p->pid == who);
- case PRIO_PGRP:
- if (!who)
- who = current->pgrp;
- return(p->pgrp == who);
- case PRIO_USER:
- if (!who)
- who = current->uid;
- return(p->uid == who);
- }
+ if (p->uid != current->euid &&
+ p->uid != current->uid && !capable(CAP_SYS_NICE)) {
+ error = -EPERM;
+ goto out;
}
- return 0;
+
+ if (error == -ESRCH)
+ error = 0;
+ if (niceval < task_nice(p) && !capable(CAP_SYS_NICE))
+ error = -EACCES;
+ else
+ set_user_nice(p, niceval);
+out:
+ return error;
}
asmlinkage long sys_setpriority(int which, int who, int niceval)
{
struct task_struct *g, *p;
- int error;
+ struct user_struct *user;
+ struct pid *pid;
+ struct list_head *l;
+ int error = -EINVAL;
if (which > 2 || which < 0)
- return -EINVAL;
+ goto out;
/* normalize: avoid signed division (rounding problems) */
error = -ESRCH;
@@ -241,31 +240,38 @@
niceval = 19;
read_lock(&tasklist_lock);
- do_each_thread(g, p) {
- int no_nice;
- if (!proc_sel(p, which, who))
- continue;
- if (p->uid != current->euid &&
- p->uid != current->uid && !capable(CAP_SYS_NICE)) {
- error = -EPERM;
- continue;
- }
- if (error == -ESRCH)
- error = 0;
- if (niceval < task_nice(p) && !capable(CAP_SYS_NICE)) {
- error = -EACCES;
- continue;
- }
- no_nice = security_ops->task_setnice(p, niceval);
- if (no_nice) {
- error = no_nice;
- continue;
- }
- set_user_nice(p, niceval);
- } while_each_thread(g, p);
-
+ switch (which) {
+ case PRIO_PROCESS:
+ if (!who)
+ who = current->pid;
+ p = find_task_by_pid(who);
+ if (p)
+ error = set_one_prio(p, niceval, error);
+ break;
+ case PRIO_PGRP:
+ if (!who)
+ who = current->pgrp;
+ for_each_task_pid(who, PIDTYPE_PGID, p, l, pid)
+ error = set_one_prio(p, niceval, error);
+ break;
+ case PRIO_USER:
+ if (!who)
+ user = current->user;
+ else
+ user = find_user(who);
+
+ if (!user)
+ goto out_unlock;
+
+ do_each_thread(g, p)
+ if (p->uid == who)
+ error = set_one_prio(p, niceval, error);
+ while_each_thread(g, p);
+ break;
+ }
+out_unlock:
read_unlock(&tasklist_lock);
-
+out:
return error;
}
@@ -278,20 +284,54 @@
asmlinkage long sys_getpriority(int which, int who)
{
struct task_struct *g, *p;
- long retval = -ESRCH;
+ struct list_head *l;
+ struct pid *pid;
+ struct user_struct *user;
+ long niceval, retval = -ESRCH;
if (which > 2 || which < 0)
return -EINVAL;
read_lock(&tasklist_lock);
- do_each_thread(g, p) {
- long niceval;
- if (!proc_sel(p, which, who))
- continue;
- niceval = 20 - task_nice(p);
- if (niceval > retval)
- retval = niceval;
- } while_each_thread(g, p);
+ switch (which) {
+ case PRIO_PROCESS:
+ if (!who)
+ who = current->pid;
+ p = find_task_by_pid(who);
+ if (p) {
+ niceval = 20 - task_nice(p);
+ if (niceval > retval)
+ retval = niceval;
+ }
+ break;
+ case PRIO_PGRP:
+ if (!who)
+ who = current->pgrp;
+ for_each_task_pid(who, PIDTYPE_PGID, p, l, pid) {
+ niceval = 20 - task_nice(p);
+ if (niceval > retval)
+ retval = niceval;
+ }
+ break;
+ case PRIO_USER:
+ if (!who)
+ user = current->user;
+ else
+ user = find_user(who);
+
+ if (!user)
+ goto out_unlock;
+
+ do_each_thread(g, p)
+ if (p->uid == who) {
+ niceval = 20 - task_nice(p);
+ if (niceval > retval)
+ retval = niceval;
+ }
+ while_each_thread(g, p);
+ break;
+ }
+out_unlock:
read_unlock(&tasklist_lock);
return retval;
@@ -849,7 +889,7 @@
asmlinkage long sys_setpgid(pid_t pid, pid_t pgid)
{
- struct task_struct * p;
+ struct task_struct *p;
int err = -EINVAL;
if (!pid)
@@ -862,12 +902,15 @@
/* From this point forward we keep holding onto the tasklist lock
* so that our parent does not change from under us. -DaveM
*/
- read_lock(&tasklist_lock);
+ write_lock_irq(&tasklist_lock);
err = -ESRCH;
p = find_task_by_pid(pid);
if (!p)
goto out;
+ err = -EINVAL;
+ if (!thread_group_leader(p))
+ goto out;
if (p->parent == current || p->real_parent == current) {
err = -EPERM;
@@ -882,25 +925,26 @@
if (p->leader)
goto out;
if (pgid != pid) {
- struct task_struct *g, *tmp;
- do_each_thread(g, tmp) {
- if (tmp->pgrp == pgid &&
- tmp->session == current->session)
+ struct task_struct *p;
+ struct pid *pid;
+ struct list_head *l;
+
+ for_each_task_pid(pgid, PIDTYPE_PGID, p, l, pid)
+ if (p->session == current->session)
goto ok_pgid;
- } while_each_thread(g, tmp);
goto out;
}
ok_pgid:
- err = security_ops->task_setpgid(p, pgid);
- if (err)
- goto out;
-
- p->pgrp = pgid;
+ if (p->pgrp != pgid) {
+ detach_pid(p, PIDTYPE_PGID);
+ p->pgrp = pgid;
+ attach_pid(p, PIDTYPE_PGID, pgid);
+ }
err = 0;
out:
/* All paths lead to here, thus we are safe. -DaveM */
- read_unlock(&tasklist_lock);
+ write_unlock_irq(&tasklist_lock);
return err;
}
@@ -956,22 +1000,34 @@
asmlinkage long sys_setsid(void)
{
- struct task_struct *g, *p;
+ struct pid *pid;
int err = -EPERM;
- read_lock(&tasklist_lock);
- do_each_thread(g, p)
- if (p->pgrp == current->pid)
- goto out;
- while_each_thread(g, p);
+ if (!thread_group_leader(current))
+ return -EINVAL;
+
+ write_lock_irq(&tasklist_lock);
+
+ pid = find_pid(PIDTYPE_PGID, current->pid);
+ if (pid)
+ goto out;
current->leader = 1;
- current->session = current->pgrp = current->pid;
+ if (current->session != current->pid) {
+ detach_pid(current, PIDTYPE_SID);
+ current->session = current->pid;
+ attach_pid(current, PIDTYPE_SID, current->pid);
+ }
+ if (current->pgrp != current->pid) {
+ detach_pid(current, PIDTYPE_PGID);
+ current->pgrp = current->pid;
+ attach_pid(current, PIDTYPE_PGID, current->pid);
+ }
current->tty = NULL;
current->tty_old_pgrp = 0;
err = current->pgrp;
out:
- read_unlock(&tasklist_lock);
+ write_unlock_irq(&tasklist_lock);
return err;
}
--- linux/kernel/fork.c.orig Thu Sep 19 00:41:07 2002
+++ linux/kernel/fork.c Thu Sep 19 02:52:17 2002
@@ -47,17 +47,6 @@
int max_threads;
unsigned long total_forks; /* Handle normal Linux uptimes. */
-/*
- * Protects next_safe, last_pid and pid_max:
- */
-spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
-
-static int next_safe = DEFAULT_PID_MAX;
-int pid_max = DEFAULT_PID_MAX;
-int last_pid;
-
-struct task_struct *pidhash[PIDHASH_SZ];
-
rwlock_t tasklist_lock __cacheline_aligned = RW_LOCK_UNLOCKED; /* outer */
/*
@@ -84,7 +73,6 @@
}
}
-/* Protects next_safe and last_pid. */
void add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)
{
unsigned long flags;
@@ -159,54 +147,6 @@
return tsk;
}
-static int get_pid(unsigned long flags)
-{
- struct task_struct *g, *p;
- int pid;
-
- if (flags & CLONE_IDLETASK)
- return 0;
-
- spin_lock(&lastpid_lock);
- if (++last_pid > pid_max) {
- last_pid = 300; /* Skip daemons etc. */
- goto inside;
- }
-
- if (last_pid >= next_safe) {
-inside:
- if (nr_threads > pid_max >> 4)
- pid_max <<= 1;
- next_safe = pid_max;
- read_lock(&tasklist_lock);
- repeat:
- do_each_thread(g, p) {
- if (p->pid == last_pid ||
- p->pgrp == last_pid ||
- p->session == last_pid) {
- if (++last_pid >= next_safe) {
- if (last_pid >= pid_max)
- last_pid = 300;
- next_safe = pid_max;
- }
- goto repeat;
- }
- if (p->pid > last_pid && next_safe > p->pid)
- next_safe = p->pid;
- if (p->pgrp > last_pid && next_safe > p->pgrp)
- next_safe = p->pgrp;
- if (p->session > last_pid && next_safe > p->session)
- next_safe = p->session;
- } while_each_thread(g, p);
-
- read_unlock(&tasklist_lock);
- }
- pid = last_pid;
- spin_unlock(&lastpid_lock);
-
- return pid;
-}
-
static inline int dup_mmap(struct mm_struct * mm)
{
struct vm_area_struct * mpnt, *tmp, **pprev;
@@ -726,7 +666,13 @@
p->state = TASK_UNINTERRUPTIBLE;
copy_flags(clone_flags, p);
- p->pid = get_pid(clone_flags);
+ if (clone_flags & CLONE_IDLETASK)
+ p->pid = 0;
+ else {
+ p->pid = alloc_pidmap();
+ if (p->pid == -1)
+ goto bad_fork_cleanup;
+ }
p->proc_dentry = NULL;
INIT_LIST_HEAD(&p->run_list);
@@ -889,7 +835,13 @@
SET_LINKS(p);
if (p->ptrace & PT_PTRACED)
__ptrace_link(p, current->parent);
- hash_pid(p);
+
+ attach_pid(p, PIDTYPE_PID, p->pid);
+ if (thread_group_leader(p)) {
+ attach_pid(p, PIDTYPE_PGID, p->pgrp);
+ attach_pid(p, PIDTYPE_SID, p->session);
+ }
+
nr_threads++;
write_unlock_irq(&tasklist_lock);
retval = 0;
@@ -914,6 +866,8 @@
bad_fork_cleanup_security:
security_ops->task_free_security(p);
bad_fork_cleanup:
+ if (p->pid > 0)
+ free_pidmap(p->pid);
put_exec_domain(p->thread_info->exec_domain);
if (p->binfmt && p->binfmt->module)
__MOD_DEC_USE_COUNT(p->binfmt->module);
--- linux/kernel/ksyms.c.orig Thu Sep 19 00:40:52 2002
+++ linux/kernel/ksyms.c Thu Sep 19 00:41:21 2002
@@ -602,7 +602,6 @@
EXPORT_SYMBOL(init_thread_union);
EXPORT_SYMBOL(tasklist_lock);
-EXPORT_SYMBOL(pidhash);
#if defined(CONFIG_SMP) && defined(__GENERIC_PER_CPU)
EXPORT_SYMBOL(__per_cpu_offset);
#endif
--- linux/init/main.c.orig Thu Sep 19 00:40:48 2002
+++ linux/init/main.c Thu Sep 19 00:41:21 2002
@@ -66,6 +66,7 @@
extern void sysctl_init(void);
extern void signals_init(void);
extern void buffer_init(void);
+extern void pidhash_init(void);
extern void pte_chain_init(void);
extern void radix_tree_init(void);
extern void free_initmem(void);
@@ -432,6 +433,7 @@
#endif
mem_init();
kmem_cache_sizes_init();
+ pidhash_init();
pgtable_cache_init();
pte_chain_init();
fork_init(num_physpages);
On Thu, 19 Sep 2002, Ingo Molnar wrote:
>
> the attached patch is a significantly cleaned up version of the generic
> pidhash patch, against BK-curr.
Hmm.. I think I like it. One big improvement is just the renaming of
"struct idtag" to "struct pid", which makes it look a lot less abstract to
me. Maybe that's just me.
However:
> read_lock(&tasklist_lock);
> - for_each_process(p) {
> - if ((tty->session > 0) && (p->session == tty->session) &&
> - p->leader) {
> - send_sig(SIGHUP,p,1);
> - send_sig(SIGCONT,p,1);
> + if (tty->session > 0)
> + for_each_task_pid(tty->session, PIDTYPE_SID, p, l, pid) {
> + if (p->tty == tty)
> + p->tty = NULL;
> + if (!p->leader)
> + continue;
> + send_sig(SIGHUP, p, 1);
> + send_sig(SIGCONT, p, 1);
> if (tty->pgrp > 0)
> p->tty_old_pgrp = tty->pgrp;
> }
> - if (p->tty == tty)
> - p->tty = NULL;
> - }
This looks a bit wrong. In particular, the old code used to set "p->tty"
to NULL if it matched any process, while the new code only does that for
processes in the right session. Hmm?
Can you double-check some of these things?
Linus
On Wed, 18 Sep 2002, Linus Torvalds wrote:
> However:
>
> > read_lock(&tasklist_lock);
> > - for_each_process(p) {
> > - if ((tty->session > 0) && (p->session == tty->session) &&
> > - p->leader) {
> > - send_sig(SIGHUP,p,1);
> > - send_sig(SIGCONT,p,1);
[...]
> This looks a bit wrong. In particular, the old code used to set "p->tty"
> to NULL if it matched any process, while the new code only does that for
> processes in the right session. Hmm?
i had this code changed back and forth twice already, because i was unsure
whether it's correct. It's true that it's not an equivalent
transformation, but after some thinking it looked correct to me. Isnt
there a 1:1 relationship between the tty and the session ID in this case?
Ie. the previous code was pessimistic and just scanned for matching
p->tty's outside of the tty->session == p->session space, but it should
not have done so.
i'll add some debugging code to the old code to print out cases when
p->tty == tty but p->session != tty->session and start up X, that should
prove or disprove this theory, right?
(i cant remember any other place where the code transformation was not
identity, but will double-check this again.)
William, you did the original transformation, was this optimization done
intentionally?
Ingo
On Thu, Sep 19, 2002 at 11:25:11AM +0200, Ingo Molnar wrote:
> i'll add some debugging code to the old code to print out cases when
> p->tty == tty but p->session != tty->session and start up X, that should
> prove or disprove this theory, right?
> (i cant remember any other place where the code transformation was not
> identity, but will double-check this again.)
> William, you did the original transformation, was this optimization done
> intentionally?
> Ingo
Sorry, I was debugging some boot-time issues I ran into.
I did this intentionally. Basically, sys_setsid() does the right thing,
but tty_ioctl() does not. There is already some inconsistency
about how task->tty is locked, and I'd not yet come to a conclusion.
On second thought, I really hope I'm imagining things here:
For instance, tty_open() does the following while not holding the
tasklist_lock for writing:
if (IS_TTY_DEV(device)) {
if (!current->tty)
return -ENXIO;
device = current->tty->device;
So tiocsctty() appears to have the following race against tty_open():
A B
-----------------------------------------
tty->session > 0 if (!current->tty)
return -ENXIO; (test fails, don't return)
iterate tasklist, delay...
setting p->tty = NULL current->tty is set to NULL
(this looks like an attempt to obtain a unique
reference by blowing away everyone else's)
task_lock(current) device = current->tty->device
current->tty = tty OOPS
task_unlock(task)
tiocsctty() holds the BKL (from sys_ioctl() itself) and the
tasklist_lock for reading, but tty_open() seems to hold only some
vfs references. release_dev() does the same style of iteration, but
doesn't even hold the BKL, only the tasklist_lock for reading. And
is very ominously called by tty_open() itself.
Also, most other accesses to tty->session and/or tty->pgrp seem to
be protected by the BKL.
Bill
On Thu, 19 Sep 2002, William Lee Irwin III wrote:
>
> I did this intentionally. Basically, sys_setsid() does the right thing,
> but tty_ioctl() does not. There is already some inconsistency
> about how task->tty is locked, and I'd not yet come to a conclusion.
I agree about the locking issue (although I do _not_ believe that the
tasklock should have anything to do with the tsk->tty locking - it should
most likely use some per-task lock for the actual tty accesses, together
with the optimization that a write lock on the tasklock is sufficient to
protect it because it means that nobody else can look up the task).
However, what I worry about is that there may not (will not) be a 1:1
session<->tty thing. In particular, when somebody creates a new session
with "setsid()", that does not remove the tty from processes that used to
hold it, I think (this is all from memory, so I might be wrong).
Which means that if the tty is going away, it has to be removed from _all_
tasks, not just from the one session that happened to be the most recent
one.
Linus
On Thu, 19 Sep 2002, Linus Torvalds wrote:
> However, what I worry about is that there may not (will not) be a 1:1
> session<->tty thing. In particular, when somebody creates a new session
> with "setsid()", that does not remove the tty from processes that used
> to hold it, I think (this is all from memory, so I might be wrong).
i've done the testrun with the p->tty == tty && p->session != tty->session
check, and it never triggers. (Which means nothing, but at least the
assumption is not trivially wrong.)
Ingo
On Thu, Sep 19, 2002 at 08:12:31AM -0700, Linus Torvalds wrote:
> However, what I worry about is that there may not (will not) be a 1:1
> session<->tty thing. In particular, when somebody creates a new session
> with "setsid()", that does not remove the tty from processes that used to
> hold it, I think (this is all from memory, so I might be wrong).
>
> Which means that if the tty is going away, it has to be removed from _all_
> tasks, not just from the one session that happened to be the most recent
> one.
[POSIX 1003.1-2001]
Controlling Terminal
A terminal that is associated with a session. Each session may have
at most one controlling terminal associated with it, and a controlling
terminal is associated with exactly one session.
Certain input sequences from the controlling terminal cause signals
to be sent to all processes in the process group associated with
the controlling terminal.
The Controlling Terminal
A terminal may belong to a process as its controlling terminal.
Each process of a session that has a controlling terminal has the same
controlling terminal. A terminal may be the controlling terminal
for at most one session. If a session leader has no controlling terminal,
and opens a terminal device file that is not already associated with a
session without using the O_NOCTTY option, it is implementation-defined
whether the terminal becomes the controlling terminal of the session leader.
If a process which is not a session leader opens a terminal file, or
the O_NOCTTY option is used on open(), then that terminal shall not become
the controlling terminal of the calling process. When a controlling terminal
becomes associated with a session, its foreground process group shall be set
to the process group of the session leader.
The controlling terminal is inherited by a child process during a fork()
function call. A process relinquishes its controlling terminal when it creates
a new session with the setsid() function; other processes remaining in the
old session that had this terminal as their controlling terminal continue
to have it. Upon the close of the last file descriptor in the system
(whether or not it is in the current session) associated with the controlling
terminal, it is unspecified whether all processes that had that terminal
as their controlling terminal cease to have any controlling terminal.
A process does not relinquish its controlling terminal simply by closing all
of its file descriptors associated with the controlling terminal if other
processes continue to have it open.
When a controlling process terminates, the controlling terminal is dissociated
from the current session, allowing it to be acquired by a new session leader.
Subsequent access to the terminal by other processes in the earlier session
may be denied, with attempts to access the terminal treated as if a modem
disconnect had been sensed.
On Thu, 19 Sep 2002, Andries Brouwer wrote:
> >
> > Which means that if the tty is going away, it has to be removed from _all_
> > tasks, not just from the one session that happened to be the most recent
> > one.
>
> [POSIX 1003.1-2001]
Gaah. Good that somebody else has the energy to actually read the
standards instead of just trying to desperately remember some dusty
details..
> The controlling terminal is inherited by a child process during a fork()
> function call. A process relinquishes its controlling terminal when it creates
> a new session with the setsid() function; other processes remaining in the
> old session that had this terminal as their controlling terminal continue
> to have it.
Well, that certainly clinches the fact that the controlling terminal _can_
and does continue to be hold by processes outside the current session
group.
I suspect that to handle controlling terminals efficiently (ie without
iterating over all tasks), the "current->tty" thing needs to be expanded
with a linked list of processes sharing the tty or something (probably
with the head of the list being in the tty structure itself).
On the other hand, I don't think it necessarily is a problem to walk all
threads either - the controlling terminal changes should be rare, and this
is O(n) rather than some quadratic or other behaviour.
Anyway, that seems to make Ingo's patch wrong for this case at least.
Ingo?
Linus
In article <[email protected]>,
Linus Torvalds <[email protected]> wrote:
>
>On Thu, 19 Sep 2002, Andries Brouwer wrote:
>> The controlling terminal is inherited by a child process during a fork()
>> function call. A process relinquishes its controlling terminal when it creates
>> a new session with the setsid() function; other processes remaining in the
>> old session that had this terminal as their controlling terminal continue
>> to have it.
>
>Well, that certainly clinches the fact that the controlling terminal _can_
>and does continue to be hold by processes outside the current session
>group.
No, not at all. A few lines above that it said:
>> A terminal that is associated with a session. Each session may have
>> at most one controlling terminal associated with it, and a controlling
>> terminal is associated with exactly one session.
A session has zero or 1 controlling terminals. A controlling
terminal is associated with one session only.
Mike.
[email protected] (Linus Torvalds) wrote on 19.09.02 in <[email protected]>:
> On Thu, 19 Sep 2002, Andries Brouwer wrote:
> > [POSIX 1003.1-2001]
> > The controlling terminal is inherited by a child process during a fork()
> > function call. A process relinquishes its controlling terminal when it
> > creates a new session with the setsid() function; other processes
> > remaining in the old session that had this terminal as their controlling
> > terminal continue to have it.
>
> Well, that certainly clinches the fact that the controlling terminal _can_
> and does continue to be hold by processes outside the current session
> group.
On the contrary: it says that this can never happen - the new session has
no controlling terminal, and can't get the old one unless the old session
loses it first.
MfG Kai
i've attached the latest version of the generic pidhash patch. The biggest
change is the removal of separately allocated pid structures: they are now
part of the task structure and the first task that uses a PID will provide
the pid structure. Task refcounting is used to avoid the freeing of the
task structure before every member of a process group or session has
exited.
this approach has a number of advantages besides the performance gains.
Besides simplifying the whole hashing code significantly, attach_pid() is
now fundamentally atomic and can be called during create_process() without
worrying about task-list side-effects. It does not have to re-search the
pidhash to find out about raced PID-adding either, and attach_pid()
cannot fail due to OOM. detach_pid() can do a simple put_task_struct()
instead of the kmem_cache_free().
the only minimal downside is the potential pending task structures after
session leaders or group leaders have exited - but the number of orphan
sessions and process groups is usually very low - and even if it's higher,
this can be regarded as a slow execution of the final deallocation of the
session leader, not some additional burden.
[Benchmark results comparing stock kernel against patched kernel come
after the Changelog.]
Changes since -D4:
- remove the tty_io.c changes.
- remove stale change from include/asm-i386/types.h
- add list_for_each_noprefetch() to list.h, for all those list users who
know that in the majority of cases the list is going to be short.
- reorganize struct pid: remove the type field, add the task pointer.
- reorganize pid_link: remove the 'nr' field, add the struct pid field.
- thread-exit optimization: remove the tsk->real_timer only if it was
pending.
- thread-create optimization: reuse the task_cache if existent. Use
atomic ops to manipulate it, so that thread-create can access it from
outside of the tasklist lock.
- remove the pid_cache SLAB cache.
- simplify and speed up attach_pid(): no need to drop the lock, look up
the pid only once, and no failure path.
- speed up detach_pid(): put_task_struct() instead of kmem_cache_free().
- merge pidhash_init() into pid_init(). Call it from main.c only.
- allocate not only the PID 0 bitmap slot, but also hash it into all
three hashes to make sure no-one tries to reuse it.
- cleanups, code simplifications.
Performance:
single-thread create+exit latency (full latency, including the
release_task() overhead) on a UP 525 MHz PIII box:
stock BK-curr
-------------
-> 3864 cycles [7.025455 usecs]
BK-curr + generic-pidhash-J2
----------------------------
-> 3430 cycles [6.236364 usecs]
ie. the patched kernel is 12% faster than the stock kernel. Most of the
improvement is due to the fork.c and exit.c optimizations.
Note that this is the best-possible benchmark scenario for the old PID
allocator: the new PID allocator's best-case latency is burdened by the
fact that it avoids worst-case latencies, and its average latency is
independent of layout of the PID space. Yesterday's test that proved the
old allocator's high sensitivity to the PID allocation rate and patterns
is still valid.
i've compiled, booted & tested the patch on UP and SMP x86 as well.
Ingo
--- linux/include/linux/pid.h.orig Thu Sep 19 21:30:31 2002
+++ linux/include/linux/pid.h Thu Sep 19 21:30:31 2002
@@ -0,0 +1,63 @@
+#ifndef _LINUX_PID_H
+#define _LINUX_PID_H
+
+enum pid_type
+{
+ PIDTYPE_PID,
+ PIDTYPE_PGID,
+ PIDTYPE_SID,
+ PIDTYPE_MAX
+};
+
+struct pid
+{
+ int nr;
+ atomic_t count;
+ struct task_struct *task;
+ struct list_head task_list;
+ struct list_head hash_chain;
+};
+
+struct pid_link
+{
+ struct list_head pid_chain;
+ struct pid *pidptr;
+ struct pid pid;
+};
+
+#define pid_task(elem, type) \
+ list_entry(elem, struct task_struct, pids[type].pid_chain)
+
+/*
+ * attach_pid() must be called with the tasklist_lock write-held.
+ *
+ * It might unlock the tasklist_lock for allocation, so this
+ * function must be called after installing all other links of
+ * a new task.
+ */
+extern int FASTCALL(attach_pid(struct task_struct *, enum pid_type, int));
+
+/*
+ * detach_pid() must be called with the tasklist_lock write-held.
+ */
+extern void FASTCALL(detach_pid(struct task_struct *task, enum pid_type));
+
+/*
+ * look up a PID in the hash table. Must be called with the tasklist_lock
+ * held.
+ */
+extern struct pid *FASTCALL(find_pid(enum pid_type, int));
+
+extern int alloc_pidmap(void);
+extern void FASTCALL(free_pidmap(int));
+
+#define for_each_task_pid(who, type, task, elem, pid) \
+ if ((pid = find_pid(type, who))) \
+ for (elem = pid->task_list.next, \
+ prefetch(elem->next), \
+ task = pid_task(elem, type); \
+ elem != &pid->task_list; \
+ elem = elem->next, prefetch(elem->next), \
+ task = pid_task(elem, type))
+
+#endif /* _LINUX_PID_H */
--- linux/include/linux/sched.h.orig Thu Sep 19 21:20:51 2002
+++ linux/include/linux/sched.h Thu Sep 19 21:30:31 2002
@@ -28,6 +28,7 @@
#include <linux/fs_struct.h>
#include <linux/compiler.h>
#include <linux/completion.h>
+#include <linux/pid.h>
struct exec_domain;
@@ -266,6 +267,8 @@
atomic_inc(&__user->__count); \
__user; })
+extern struct user_struct *find_user(uid_t);
+
extern struct user_struct root_user;
#define INIT_USER (&root_user)
@@ -326,9 +329,8 @@
struct task_struct *group_leader;
struct list_head thread_group;
- /* PID hash table linkage. */
- struct task_struct *pidhash_next;
- struct task_struct **pidhash_pprev;
+ /* PID/PID hash table linkage. */
+ struct pid_link pids[PIDTYPE_MAX];
wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
@@ -474,38 +476,7 @@
extern struct mm_struct init_mm;
-/* PID hashing. (shouldnt this be dynamic?) */
-#define PIDHASH_SZ 8192
-extern struct task_struct *pidhash[PIDHASH_SZ];
-
-#define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
-
-static inline void hash_pid(struct task_struct *p)
-{
- struct task_struct **htable = &pidhash[pid_hashfn(p->pid)];
-
- if((p->pidhash_next = *htable) != NULL)
- (*htable)->pidhash_pprev = &p->pidhash_next;
- *htable = p;
- p->pidhash_pprev = htable;
-}
-
-static inline void unhash_pid(struct task_struct *p)
-{
- if(p->pidhash_next)
- p->pidhash_next->pidhash_pprev = p->pidhash_pprev;
- *p->pidhash_pprev = p->pidhash_next;
-}
-
-static inline struct task_struct *find_task_by_pid(int pid)
-{
- struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)];
-
- for(p = *htable; p && p->pid != pid; p = p->pidhash_next)
- ;
-
- return p;
-}
+extern struct task_struct *find_task_by_pid(int pid);
/* per-UID process charging. */
extern struct user_struct * alloc_uid(uid_t);
--- linux/include/linux/threads.h.orig Thu Sep 19 21:20:33 2002
+++ linux/include/linux/threads.h Thu Sep 19 21:30:31 2002
@@ -17,8 +17,13 @@
#define MIN_THREADS_LEFT_FOR_ROOT 4
/*
- * This controls the maximum pid allocated to a process
+ * This controls the default maximum pid allocated to a process
*/
-#define DEFAULT_PID_MAX 0x8000
+#define PID_MAX_DEFAULT 0x8000
+
+/*
+ * A maximum of 4 million PIDs should be enough for a while:
+ */
+#define PID_MAX_LIMIT (4*1024*1024)
#endif
--- linux/include/linux/list.h.orig Thu Sep 19 21:20:33 2002
+++ linux/include/linux/list.h Thu Sep 19 21:30:31 2002
@@ -195,6 +195,10 @@
#define list_for_each(pos, head) \
for (pos = (head)->next, prefetch(pos->next); pos != (head); \
pos = pos->next, prefetch(pos->next))
+
+#define list_for_each_noprefetch(pos, head) \
+ for (pos = (head)->next; pos != (head); pos = pos->next)
+
/**
* list_for_each_prev - iterate over a list backwards
* @pos: the &struct list_head to use as a loop counter.
--- linux/fs/fcntl.c.orig Thu Sep 19 21:20:41 2002
+++ linux/fs/fcntl.c Thu Sep 19 21:30:31 2002
@@ -480,7 +480,9 @@
void send_sigio(struct fown_struct *fown, int fd, int band)
{
- struct task_struct * p;
+ struct task_struct *p;
+ struct list_head *l;
+ struct pid *pidptr;
int pid;
read_lock(&fown->lock);
@@ -493,14 +495,8 @@
send_sigio_to_task(p, fown, fd, band);
goto out_unlock_task;
}
- for_each_process(p) {
- int match = p->pid;
- if (pid < 0)
- match = -p->pgrp;
- if (pid != match)
- continue;
- send_sigio_to_task(p, fown, fd, band);
- }
+ for_each_task_pid(-pid, PIDTYPE_PGID, p, l, pidptr)
+ send_sigio_to_task(p, fown,fd,band);
out_unlock_task:
read_unlock(&tasklist_lock);
out_unlock_fown:
--- linux/fs/exec.c.orig Thu Sep 19 21:20:47 2002
+++ linux/fs/exec.c Thu Sep 19 21:30:31 2002
@@ -609,8 +609,6 @@
ptrace_unlink(leader);
ptrace_unlink(current);
- unhash_pid(current);
- unhash_pid(leader);
remove_parent(current);
remove_parent(leader);
/*
@@ -631,8 +629,6 @@
current->ptrace = ptrace;
__ptrace_link(current, parent);
}
- hash_pid(current);
- hash_pid(leader);
list_add_tail(¤t->tasks, &init_task.tasks);
state = leader->state;
--- linux/kernel/Makefile.orig Thu Sep 19 21:30:18 2002
+++ linux/kernel/Makefile Thu Sep 19 21:30:31 2002
@@ -8,7 +8,7 @@
obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
module.o exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
- signal.o sys.o kmod.o context.o futex.o platform.o
+ signal.o sys.o kmod.o context.o futex.o platform.o pid.o
obj-$(CONFIG_GENERIC_ISA_DMA) += dma.o
obj-$(CONFIG_SMP) += cpu.o
--- linux/kernel/exit.c.orig Thu Sep 19 21:30:18 2002
+++ linux/kernel/exit.c Thu Sep 19 21:30:31 2002
@@ -33,7 +33,12 @@
{
struct dentry *proc_dentry;
nr_threads--;
- unhash_pid(p);
+ detach_pid(p, PIDTYPE_PID);
+ if (thread_group_leader(p)) {
+ detach_pid(p, PIDTYPE_PGID);
+ detach_pid(p, PIDTYPE_SID);
+ }
+
REMOVE_LINKS(p);
p->pid = 0;
proc_dentry = p->proc_dentry;
@@ -109,22 +114,18 @@
int session_of_pgrp(int pgrp)
{
struct task_struct *p;
- int fallback;
+ struct list_head *l;
+ struct pid *pid;
+ int sid = -1;
- fallback = -1;
read_lock(&tasklist_lock);
- for_each_process(p) {
- if (p->session <= 0)
- continue;
- if (p->pgrp == pgrp) {
- fallback = p->session;
+ for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid)
+ if (p->session > 0) {
+ sid = p->session;
break;
}
- if (p->pid == pgrp)
- fallback = p->session;
- }
read_unlock(&tasklist_lock);
- return fallback;
+ return sid;
}
/*
@@ -135,21 +136,25 @@
*
* "I ask you, have you ever known what it is to be an orphan?"
*/
-static int __will_become_orphaned_pgrp(int pgrp, struct task_struct * ignored_task)
+static int __will_become_orphaned_pgrp(int pgrp, task_t *ignored_task)
{
struct task_struct *p;
-
- for_each_process(p) {
- if ((p == ignored_task) || (p->pgrp != pgrp) ||
- (p->state == TASK_ZOMBIE) ||
- (p->real_parent->pid == 1))
+ struct list_head *l;
+ struct pid *pid;
+ int ret = 1;
+
+ for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) {
+ if (p == ignored_task
+ || p->state == TASK_ZOMBIE
+ || p->real_parent->pid == 1)
continue;
- if ((p->real_parent->pgrp != pgrp) &&
- (p->real_parent->session == p->session)) {
- return 0;
+ if (p->real_parent->pgrp != pgrp
+ && p->real_parent->session == p->session) {
+ ret = 0;
+ break;
}
}
- return 1; /* (sighing) "Often!" */
+ return ret; /* (sighing) "Often!" */
}
static int will_become_orphaned_pgrp(int pgrp, struct task_struct * ignored_task)
@@ -171,11 +176,11 @@
static inline int __has_stopped_jobs(int pgrp)
{
int retval = 0;
- struct task_struct * p;
+ struct task_struct *p;
+ struct list_head *l;
+ struct pid *pid;
- for_each_process(p) {
- if (p->pgrp != pgrp)
- continue;
+ for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) {
if (p->state != TASK_STOPPED)
continue;
retval = 1;
@@ -605,7 +610,8 @@
if (tsk->pid == 1)
panic("Attempted to kill init!");
tsk->flags |= PF_EXITING;
- del_timer_sync(&tsk->real_timer);
+ if (timer_pending(&tsk->real_timer))
+ del_timer_sync(&tsk->real_timer);
if (unlikely(preempt_count()))
printk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n",
--- linux/kernel/user.c.orig Thu Sep 19 21:20:08 2002
+++ linux/kernel/user.c Thu Sep 19 21:30:31 2002
@@ -64,6 +64,11 @@
return NULL;
}
+struct user_struct *find_user(uid_t uid)
+{
+ return uid_hash_find(uid, uidhashentry(uid));
+}
+
void free_uid(struct user_struct *up)
{
if (up && atomic_dec_and_lock(&up->__count, &uidhash_lock)) {
--- linux/kernel/pid.c.orig Thu Sep 19 21:30:31 2002
+++ linux/kernel/pid.c Thu Sep 19 21:30:31 2002
@@ -0,0 +1,219 @@
+/*
+ * Generic pidhash and scalable, time-bounded PID allocator
+ *
+ * (C) 2002 William Irwin, IBM
+ * (C) 2002 Ingo Molnar, Red Hat
+ *
+ * pid-structures are backing objects for tasks sharing a given ID to chain
+ * against. There is very little to them aside from hashing them and
+ * parking tasks using given ID's on a list.
+ *
+ * The hash is always changed with the tasklist_lock write-acquired,
+ * and the hash is only accessed with the tasklist_lock at least
+ * read-acquired, so there's no additional SMP locking needed here.
+ *
+ * We have a list of bitmap pages, which bitmaps represent the PID space.
+ * Allocating and freeing PIDs is completely lockless. The worst-case
+ * allocation scenario when all but one out of 1 million PIDs possible are
+ * allocated already: the scanning of 32 list entries and at most PAGE_SIZE
+ * bytes. The typical fastpath is a single successful setbit. Freeing is O(1).
+ */
+
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/init.h>
+#include <linux/bootmem.h>
+
+#define PIDHASH_SIZE 4096
+#define pid_hashfn(nr) ((nr >> 8) ^ nr) & (PIDHASH_SIZE - 1)
+static struct list_head pid_hash[PIDTYPE_MAX][PIDHASH_SIZE];
+
+int pid_max = PID_MAX_DEFAULT;
+int last_pid;
+
+#define RESERVED_PIDS 300
+
+#define PIDMAP_ENTRIES (PID_MAX_LIMIT/PAGE_SIZE/8)
+#define BITS_PER_PAGE (PAGE_SIZE*8)
+#define BITS_PER_PAGE_MASK (BITS_PER_PAGE-1)
+
+/*
+ * PID-map pages start out as NULL, they get allocated upon
+ * first use and are never deallocated. This way a low pid_max
+ * value does not cause lots of bitmaps to be allocated, but
+ * the scheme scales to up to 4 million PIDs, runtime.
+ */
+typedef struct pidmap {
+ atomic_t nr_free;
+ void *page;
+} pidmap_t;
+
+static pidmap_t pidmap_array[PIDMAP_ENTRIES] =
+ { [ 0 ... PIDMAP_ENTRIES-1 ] = { ATOMIC_INIT(BITS_PER_PAGE), NULL } };
+
+static pidmap_t *map_limit = pidmap_array + PIDMAP_ENTRIES;
+
+inline void free_pidmap(int pid)
+{
+ pidmap_t *map = pidmap_array + pid / BITS_PER_PAGE;
+ int offset = pid & BITS_PER_PAGE_MASK;
+
+ clear_bit(offset, map->page);
+ atomic_inc(&map->nr_free);
+}
+
+/*
+ * Here we search for the next map that has free bits left.
+ * Normally the next map has free PIDs.
+ */
+static inline pidmap_t *next_free_map(pidmap_t *map, int *max_steps)
+{
+ while (--*max_steps) {
+ if (++map == map_limit)
+ map = pidmap_array;
+ if (unlikely(!map->page)) {
+ unsigned long page = get_zeroed_page(GFP_KERNEL);
+ /*
+ * Free the page if someone raced with us
+ * installing it:
+ */
+ if (cmpxchg(&map->page, NULL, page))
+ free_page(page);
+ if (!map->page)
+ break;
+ }
+ if (atomic_read(&map->nr_free))
+ return map;
+ }
+ return NULL;
+}
+
+int alloc_pidmap(void)
+{
+ int pid, offset, max_steps = PIDMAP_ENTRIES + 1;
+ pidmap_t *map;
+
+ pid = last_pid + 1;
+ if (pid >= pid_max)
+ pid = RESERVED_PIDS;
+
+ offset = pid & BITS_PER_PAGE_MASK;
+ map = pidmap_array + pid / BITS_PER_PAGE;
+
+ if (likely(map->page && !test_and_set_bit(offset, map->page))) {
+ /*
+ * There is a small window for last_pid updates to race,
+ * but in that case the next allocation will go into the
+ * slowpath and that fixes things up.
+ */
+return_pid:
+ atomic_dec(&map->nr_free);
+ last_pid = pid;
+ return pid;
+ }
+
+ if (!offset || !atomic_read(&map->nr_free)) {
+next_map:
+ map = next_free_map(map, &max_steps);
+ if (!map)
+ goto failure;
+ offset = 0;
+ }
+ /*
+ * Find the next zero bit:
+ */
+scan_more:
+ offset = find_next_zero_bit(map->page, BITS_PER_PAGE, offset);
+ if (offset == BITS_PER_PAGE)
+ goto next_map;
+ if (test_and_set_bit(offset, map->page))
+ goto scan_more;
+
+ /* we got the PID: */
+ pid = (map - pidmap_array) * BITS_PER_PAGE + offset;
+ goto return_pid;
+
+failure:
+ return -1;
+}
+
+inline struct pid *find_pid(enum pid_type type, int nr)
+{
+ struct list_head *elem, *bucket = &pid_hash[type][pid_hashfn(nr)];
+ struct pid *pid;
+
+ list_for_each_noprefetch(elem, bucket) {
+ pid = list_entry(elem, struct pid, hash_chain);
+ if (pid->nr == nr)
+ return pid;
+ }
+ return NULL;
+}
+
+int attach_pid(task_t *task, enum pid_type type, int nr)
+{
+ struct pid *pid = find_pid(type, nr);
+
+ if (pid)
+ atomic_inc(&pid->count);
+ else {
+ pid = &task->pids[type].pid;
+ pid->nr = nr;
+ atomic_set(&pid->count, 1);
+ INIT_LIST_HEAD(&pid->task_list);
+ pid->task = current;
+ get_task_struct(current);
+ list_add(&pid->hash_chain, &pid_hash[type][pid_hashfn(nr)]);
+ }
+ list_add(&task->pids[type].pid_chain, &pid->task_list);
+ task->pids[type].pidptr = pid;
+
+ return 0;
+}
+
+void detach_pid(task_t *task, enum pid_type type)
+{
+ struct pid_link *link = task->pids + type;
+ struct pid *pid = link->pidptr;
+ int nr;
+
+ list_del(&link->pid_chain);
+ if (!atomic_dec_and_test(&pid->count))
+ return;
+
+ nr = pid->nr;
+ list_del(&pid->hash_chain);
+ put_task_struct(pid->task);
+
+ for (type = 0; type < PIDTYPE_MAX; ++type)
+ if (find_pid(type, nr))
+ return;
+ free_pidmap(nr);
+}
+
+extern task_t *find_task_by_pid(int nr)
+{
+ struct pid *pid = find_pid(PIDTYPE_PID, nr);
+
+ if (!pid)
+ return NULL;
+ return pid_task(pid->task_list.next, PIDTYPE_PID);
+}
+
+void __init pidhash_init(void)
+{
+ int i, j;
+
+ /*
+ * Allocate PID 0, and hash it via all PID types:
+ */
+ pidmap_array->page = (void *)get_zeroed_page(GFP_KERNEL);
+ set_bit(0, pidmap_array->page);
+ atomic_dec(&pidmap_array->nr_free);
+
+ for (i = 0; i < PIDTYPE_MAX; i++) {
+ for (j = 0; j < PIDHASH_SIZE; j++)
+ INIT_LIST_HEAD(&pid_hash[i][j]);
+ attach_pid(current, i, 0);
+ }
+}
--- linux/kernel/signal.c.orig Thu Sep 19 21:20:41 2002
+++ linux/kernel/signal.c Thu Sep 19 21:30:31 2002
@@ -943,18 +943,18 @@
int __kill_pg_info(int sig, struct siginfo *info, pid_t pgrp)
{
- int retval = -EINVAL;
- if (pgrp > 0) {
- struct task_struct *p;
-
- retval = -ESRCH;
- for_each_process(p) {
- if (p->pgrp == pgrp) {
- int err = send_sig_info(sig, info, p);
- if (retval)
- retval = err;
- }
- }
+ struct task_struct *p;
+ struct list_head *l;
+ struct pid *pid;
+ int err, retval = -ESRCH;
+
+ if (pgrp <= 0)
+ return -EINVAL;
+
+ for_each_task_pid(pgrp, PIDTYPE_PGID, p, l, pid) {
+ err = send_sig_info(sig, info, p);
+ if (retval)
+ retval = err;
}
return retval;
}
@@ -977,28 +977,33 @@
* the connection is lost.
*/
+
int
-kill_sl_info(int sig, struct siginfo *info, pid_t sess)
+kill_sl_info(int sig, struct siginfo *info, pid_t sid)
{
- int retval = -EINVAL;
- if (sess > 0) {
- struct task_struct *p;
-
- retval = -ESRCH;
- read_lock(&tasklist_lock);
- for_each_process(p) {
- if (p->leader && p->session == sess) {
- int err = send_sig_info(sig, info, p);
- if (retval)
- retval = err;
- }
- }
- read_unlock(&tasklist_lock);
+ int err, retval = -EINVAL;
+ struct pid *pid;
+ struct list_head *l;
+ struct task_struct *p;
+
+ if (sid <= 0)
+ goto out;
+
+ retval = -ESRCH;
+ read_lock(&tasklist_lock);
+ for_each_task_pid(sid, PIDTYPE_SID, p, l, pid) {
+ if (!p->leader)
+ continue;
+ err = send_sig_info(sig, info, p);
+ if (retval)
+ retval = err;
}
+ read_unlock(&tasklist_lock);
+out:
return retval;
}
-inline int
+int
kill_proc_info(int sig, struct siginfo *info, pid_t pid)
{
int error;
--- linux/kernel/sys.c.orig Thu Sep 19 21:20:41 2002
+++ linux/kernel/sys.c Thu Sep 19 21:30:31 2002
@@ -203,35 +203,34 @@
cond_syscall(sys_quotactl)
cond_syscall(sys_acct)
-static int proc_sel(struct task_struct *p, int which, int who)
+static int set_one_prio(struct task_struct *p, int niceval, int error)
{
- if(p->pid)
- {
- switch (which) {
- case PRIO_PROCESS:
- if (!who && p == current)
- return 1;
- return(p->pid == who);
- case PRIO_PGRP:
- if (!who)
- who = current->pgrp;
- return(p->pgrp == who);
- case PRIO_USER:
- if (!who)
- who = current->uid;
- return(p->uid == who);
- }
+ if (p->uid != current->euid &&
+ p->uid != current->uid && !capable(CAP_SYS_NICE)) {
+ error = -EPERM;
+ goto out;
}
- return 0;
+
+ if (error == -ESRCH)
+ error = 0;
+ if (niceval < task_nice(p) && !capable(CAP_SYS_NICE))
+ error = -EACCES;
+ else
+ set_user_nice(p, niceval);
+out:
+ return error;
}
asmlinkage long sys_setpriority(int which, int who, int niceval)
{
struct task_struct *g, *p;
- int error;
+ struct user_struct *user;
+ struct pid *pid;
+ struct list_head *l;
+ int error = -EINVAL;
if (which > 2 || which < 0)
- return -EINVAL;
+ goto out;
/* normalize: avoid signed division (rounding problems) */
error = -ESRCH;
@@ -241,31 +240,38 @@
niceval = 19;
read_lock(&tasklist_lock);
- do_each_thread(g, p) {
- int no_nice;
- if (!proc_sel(p, which, who))
- continue;
- if (p->uid != current->euid &&
- p->uid != current->uid && !capable(CAP_SYS_NICE)) {
- error = -EPERM;
- continue;
- }
- if (error == -ESRCH)
- error = 0;
- if (niceval < task_nice(p) && !capable(CAP_SYS_NICE)) {
- error = -EACCES;
- continue;
- }
- no_nice = security_ops->task_setnice(p, niceval);
- if (no_nice) {
- error = no_nice;
- continue;
- }
- set_user_nice(p, niceval);
- } while_each_thread(g, p);
-
+ switch (which) {
+ case PRIO_PROCESS:
+ if (!who)
+ who = current->pid;
+ p = find_task_by_pid(who);
+ if (p)
+ error = set_one_prio(p, niceval, error);
+ break;
+ case PRIO_PGRP:
+ if (!who)
+ who = current->pgrp;
+ for_each_task_pid(who, PIDTYPE_PGID, p, l, pid)
+ error = set_one_prio(p, niceval, error);
+ break;
+ case PRIO_USER:
+ if (!who)
+ user = current->user;
+ else
+ user = find_user(who);
+
+ if (!user)
+ goto out_unlock;
+
+ do_each_thread(g, p)
+ if (p->uid == who)
+ error = set_one_prio(p, niceval, error);
+ while_each_thread(g, p);
+ break;
+ }
+out_unlock:
read_unlock(&tasklist_lock);
-
+out:
return error;
}
@@ -278,20 +284,54 @@
asmlinkage long sys_getpriority(int which, int who)
{
struct task_struct *g, *p;
- long retval = -ESRCH;
+ struct list_head *l;
+ struct pid *pid;
+ struct user_struct *user;
+ long niceval, retval = -ESRCH;
if (which > 2 || which < 0)
return -EINVAL;
read_lock(&tasklist_lock);
- do_each_thread(g, p) {
- long niceval;
- if (!proc_sel(p, which, who))
- continue;
- niceval = 20 - task_nice(p);
- if (niceval > retval)
- retval = niceval;
- } while_each_thread(g, p);
+ switch (which) {
+ case PRIO_PROCESS:
+ if (!who)
+ who = current->pid;
+ p = find_task_by_pid(who);
+ if (p) {
+ niceval = 20 - task_nice(p);
+ if (niceval > retval)
+ retval = niceval;
+ }
+ break;
+ case PRIO_PGRP:
+ if (!who)
+ who = current->pgrp;
+ for_each_task_pid(who, PIDTYPE_PGID, p, l, pid) {
+ niceval = 20 - task_nice(p);
+ if (niceval > retval)
+ retval = niceval;
+ }
+ break;
+ case PRIO_USER:
+ if (!who)
+ user = current->user;
+ else
+ user = find_user(who);
+
+ if (!user)
+ goto out_unlock;
+
+ do_each_thread(g, p)
+ if (p->uid == who) {
+ niceval = 20 - task_nice(p);
+ if (niceval > retval)
+ retval = niceval;
+ }
+ while_each_thread(g, p);
+ break;
+ }
+out_unlock:
read_unlock(&tasklist_lock);
return retval;
@@ -849,7 +889,7 @@
asmlinkage long sys_setpgid(pid_t pid, pid_t pgid)
{
- struct task_struct * p;
+ struct task_struct *p;
int err = -EINVAL;
if (!pid)
@@ -862,12 +902,15 @@
/* From this point forward we keep holding onto the tasklist lock
* so that our parent does not change from under us. -DaveM
*/
- read_lock(&tasklist_lock);
+ write_lock_irq(&tasklist_lock);
err = -ESRCH;
p = find_task_by_pid(pid);
if (!p)
goto out;
+ err = -EINVAL;
+ if (!thread_group_leader(p))
+ goto out;
if (p->parent == current || p->real_parent == current) {
err = -EPERM;
@@ -882,25 +925,26 @@
if (p->leader)
goto out;
if (pgid != pid) {
- struct task_struct *g, *tmp;
- do_each_thread(g, tmp) {
- if (tmp->pgrp == pgid &&
- tmp->session == current->session)
+ struct task_struct *p;
+ struct pid *pid;
+ struct list_head *l;
+
+ for_each_task_pid(pgid, PIDTYPE_PGID, p, l, pid)
+ if (p->session == current->session)
goto ok_pgid;
- } while_each_thread(g, tmp);
goto out;
}
ok_pgid:
- err = security_ops->task_setpgid(p, pgid);
- if (err)
- goto out;
-
- p->pgrp = pgid;
+ if (p->pgrp != pgid) {
+ detach_pid(p, PIDTYPE_PGID);
+ p->pgrp = pgid;
+ attach_pid(p, PIDTYPE_PGID, pgid);
+ }
err = 0;
out:
/* All paths lead to here, thus we are safe. -DaveM */
- read_unlock(&tasklist_lock);
+ write_unlock_irq(&tasklist_lock);
return err;
}
@@ -956,22 +1000,34 @@
asmlinkage long sys_setsid(void)
{
- struct task_struct *g, *p;
+ struct pid *pid;
int err = -EPERM;
- read_lock(&tasklist_lock);
- do_each_thread(g, p)
- if (p->pgrp == current->pid)
- goto out;
- while_each_thread(g, p);
+ if (!thread_group_leader(current))
+ return -EINVAL;
+
+ write_lock_irq(&tasklist_lock);
+
+ pid = find_pid(PIDTYPE_PGID, current->pid);
+ if (pid)
+ goto out;
current->leader = 1;
- current->session = current->pgrp = current->pid;
+ if (current->session != current->pid) {
+ detach_pid(current, PIDTYPE_SID);
+ current->session = current->pid;
+ attach_pid(current, PIDTYPE_SID, current->pid);
+ }
+ if (current->pgrp != current->pid) {
+ detach_pid(current, PIDTYPE_PGID);
+ current->pgrp = current->pid;
+ attach_pid(current, PIDTYPE_PGID, current->pid);
+ }
current->tty = NULL;
current->tty_old_pgrp = 0;
err = current->pgrp;
out:
- read_unlock(&tasklist_lock);
+ write_unlock_irq(&tasklist_lock);
return err;
}
--- linux/kernel/fork.c.orig Thu Sep 19 21:30:18 2002
+++ linux/kernel/fork.c Thu Sep 19 21:30:31 2002
@@ -47,17 +47,6 @@
int max_threads;
unsigned long total_forks; /* Handle normal Linux uptimes. */
-/*
- * Protects next_safe, last_pid and pid_max:
- */
-spinlock_t lastpid_lock = SPIN_LOCK_UNLOCKED;
-
-static int next_safe = DEFAULT_PID_MAX;
-int pid_max = DEFAULT_PID_MAX;
-int last_pid;
-
-struct task_struct *pidhash[PIDHASH_SZ];
-
rwlock_t tasklist_lock __cacheline_aligned = RW_LOCK_UNLOCKED; /* outer */
/*
@@ -75,16 +64,14 @@
} else {
int cpu = smp_processor_id();
- tsk = task_cache[cpu];
+ tsk = xchg(task_cache + cpu, tsk);
if (tsk) {
free_thread_info(tsk->thread_info);
kmem_cache_free(task_struct_cachep,tsk);
}
- task_cache[cpu] = current;
}
}
-/* Protects next_safe and last_pid. */
void add_wait_queue(wait_queue_head_t *q, wait_queue_t * wait)
{
unsigned long flags;
@@ -140,73 +127,28 @@
struct task_struct *tsk;
struct thread_info *ti;
- ti = alloc_thread_info();
- if (!ti)
- return NULL;
-
- tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
+ tsk = xchg(task_cache + smp_processor_id(), NULL);
if (!tsk) {
- free_thread_info(ti);
- return NULL;
- }
+ ti = alloc_thread_info();
+ if (!ti)
+ return NULL;
+
+ tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
+ if (!tsk) {
+ free_thread_info(ti);
+ return NULL;
+ }
+ } else
+ ti = tsk->thread_info;
*ti = *orig->thread_info;
*tsk = *orig;
tsk->thread_info = ti;
ti->task = tsk;
atomic_set(&tsk->usage,1);
-
return tsk;
}
-static int get_pid(unsigned long flags)
-{
- struct task_struct *g, *p;
- int pid;
-
- if (flags & CLONE_IDLETASK)
- return 0;
-
- spin_lock(&lastpid_lock);
- if (++last_pid > pid_max) {
- last_pid = 300; /* Skip daemons etc. */
- goto inside;
- }
-
- if (last_pid >= next_safe) {
-inside:
- if (nr_threads > pid_max >> 4)
- pid_max <<= 1;
- next_safe = pid_max;
- read_lock(&tasklist_lock);
- repeat:
- do_each_thread(g, p) {
- if (p->pid == last_pid ||
- p->pgrp == last_pid ||
- p->session == last_pid) {
- if (++last_pid >= next_safe) {
- if (last_pid >= pid_max)
- last_pid = 300;
- next_safe = pid_max;
- }
- goto repeat;
- }
- if (p->pid > last_pid && next_safe > p->pid)
- next_safe = p->pid;
- if (p->pgrp > last_pid && next_safe > p->pgrp)
- next_safe = p->pgrp;
- if (p->session > last_pid && next_safe > p->session)
- next_safe = p->session;
- } while_each_thread(g, p);
-
- read_unlock(&tasklist_lock);
- }
- pid = last_pid;
- spin_unlock(&lastpid_lock);
-
- return pid;
-}
-
static inline int dup_mmap(struct mm_struct * mm)
{
struct vm_area_struct * mpnt, *tmp, **pprev;
@@ -726,7 +668,13 @@
p->state = TASK_UNINTERRUPTIBLE;
copy_flags(clone_flags, p);
- p->pid = get_pid(clone_flags);
+ if (clone_flags & CLONE_IDLETASK)
+ p->pid = 0;
+ else {
+ p->pid = alloc_pidmap();
+ if (p->pid == -1)
+ goto bad_fork_cleanup;
+ }
p->proc_dentry = NULL;
INIT_LIST_HEAD(&p->run_list);
@@ -889,7 +837,13 @@
SET_LINKS(p);
if (p->ptrace & PT_PTRACED)
__ptrace_link(p, current->parent);
- hash_pid(p);
+
+ attach_pid(p, PIDTYPE_PID, p->pid);
+ if (thread_group_leader(p)) {
+ attach_pid(p, PIDTYPE_PGID, p->pgrp);
+ attach_pid(p, PIDTYPE_SID, p->session);
+ }
+
nr_threads++;
write_unlock_irq(&tasklist_lock);
retval = 0;
@@ -914,6 +868,8 @@
bad_fork_cleanup_security:
security_ops->task_free_security(p);
bad_fork_cleanup:
+ if (p->pid > 0)
+ free_pidmap(p->pid);
put_exec_domain(p->thread_info->exec_domain);
if (p->binfmt && p->binfmt->module)
__MOD_DEC_USE_COUNT(p->binfmt->module);
--- linux/kernel/ksyms.c.orig Thu Sep 19 21:20:51 2002
+++ linux/kernel/ksyms.c Thu Sep 19 21:30:31 2002
@@ -602,7 +602,6 @@
EXPORT_SYMBOL(init_thread_union);
EXPORT_SYMBOL(tasklist_lock);
-EXPORT_SYMBOL(pidhash);
#if defined(CONFIG_SMP) && defined(__GENERIC_PER_CPU)
EXPORT_SYMBOL(__per_cpu_offset);
#endif
--- linux/init/main.c.orig Thu Sep 19 21:20:41 2002
+++ linux/init/main.c Thu Sep 19 21:30:31 2002
@@ -66,6 +66,7 @@
extern void sysctl_init(void);
extern void signals_init(void);
extern void buffer_init(void);
+extern void pidhash_init(void);
extern void pte_chain_init(void);
extern void radix_tree_init(void);
extern void free_initmem(void);
@@ -432,6 +433,7 @@
#endif
mem_init();
kmem_cache_sizes_init();
+ pidhash_init();
pgtable_cache_init();
pte_chain_init();
fork_init(num_physpages);
On Thu, Sep 19, 2002 at 09:38:03PM +0200, Ingo Molnar wrote:
> - add list_for_each_noprefetch() to list.h, for all those list users who
> know that in the majority of cases the list is going to be short.
That name is really ugly, as the lack ofthe prefetch is an implementation
detail. What about list_for_each_short or __list_for_each?
On 19 Sep 2002, Kai Henningsen wrote:
>
> On the contrary: it says that this can never happen - the new session has
> no controlling terminal, and can't get the old one unless the old session
> loses it first.
Hmm.. I read it as "the tty stays with the stale group", which is
problematic. But if all the places that set a new controlling terminal
check that it's not already used by some other non-session then I guess
we're still ok..
Linus
Ok, applied.
I'm also applying the session handling changes to tty_io.c as a separate
changeset, since the resulting code is certainly cleaner and reading
peoples areguments and looking at the code have made me think it _is_
correct after all.
And as a separate changeset it will be easier to test and perhaps revert
on demand.
Linus
On Thu, 19 Sep 2002, Linus Torvalds wrote:
> I'm also applying the session handling changes to tty_io.c as a separate
> changeset, since the resulting code is certainly cleaner and reading
> peoples areguments and looking at the code have made me think it _is_
> correct after all.
yes, i too think it's correct. In fact this is partly proven by one piece
of code in the old tty_io.c file:
void disassociate_ctty(int on_exit)
[...]
read_lock(&tasklist_lock);
for_each_process(p)
if (p->session == current->session)
p->tty = NULL;
read_unlock(&tasklist_lock);
ie. if the SID-based iteration is incorrect, then the above code is
incorrect already - but we never saw any problems in this area.
Ingo
In article <[email protected]>,
Linus Torvalds <[email protected]> wrote:
>
>On 19 Sep 2002, Kai Henningsen wrote:
>>
>> On the contrary: it says that this can never happen - the new session has
>> no controlling terminal, and can't get the old one unless the old session
>> loses it first.
>
>Hmm.. I read it as "the tty stays with the stale group", which is
>problematic. But if all the places that set a new controlling terminal
>check that it's not already used by some other non-session then I guess
>we're still ok..
No, it said that if a new session is created the tty stays with
the old session group which is by no means stale.
A session leader cannot create a new session, since the session-id
is the pid of the leader. Only children can create a new session,
and at that point the tty is not the controlling tty of the
_newly created_ session. Ofcourse the original session still
exists, and the tty is still the controlling tty of that session.
Only if the new session leader opens a tty that hasn't been
associated with a session yet will it become the controlling
tty of that session.
So to assocciate a tty with another session, you'll have to
disassociate it with the existing session by a) killing all
processes in the session or b) calling TIOCNOTTY. Then another
session can open it and it will become the controlling tty
of the other session (if it didn't have a controlling tty
yet, that is!)
A session can also steal away a controlling tty from another
session by using TIOCSCTTY (if its root) but the tty will
first be disassociated from the old session and only then
will it become the controlling tty of the stealing session.
See drivers/char/tty_io.c::tiocsctty() for how it loops
over all tasks to do this ..
Mike.
On Thu, Sep 19, 2002 at 09:21:11PM +0100, Christoph Hellwig wrote:
> > - add list_for_each_noprefetch() to list.h, for all those list users who
> > know that in the majority of cases the list is going to be short.
> That name is really ugly, as the lack ofthe prefetch is an implementation
> detail. What about list_for_each_short or __list_for_each?
Wasn't this the reason we have list_for_each_safe() ?
Dave
--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs
On Fri, 20 Sep 2002, Dave Jones wrote:
> > > - add list_for_each_noprefetch() to list.h, for all those list users who
> > > know that in the majority of cases the list is going to be short.
> > That name is really ugly, as the lack ofthe prefetch is an implementation
> > detail. What about list_for_each_short or __list_for_each?
>
> Wasn't this the reason we have list_for_each_safe() ?
no, list_for_each_safe() is a loop that is entry-removal safe.
Ingo
Hello!
On Thu, Sep 19, 2002 at 04:54:46AM +0200, Ingo Molnar wrote:
> - unified kernel/pid.c and kernel/idtag.c into kernel/pid.c.
> + /*
> + * Free the page if someone raced with us
> + * installing it:
> + */
> + if (cmpxchg(&map->page, NULL, page))
> + free_page(page);
Note that this piece breaks compilation for every arch that does not
have cmpxchg implementation.
This is the case with x86 (with CONFIG_X86_CMPXCHG undefined, e.g. i386),
ARM, CRIS, m68k, MIPS, MIPS64, PARISC, s390, SH, sparc32, UML (for x86).
Bye,
Oleg
On Fri, 20 Sep 2002, Oleg Drokin wrote:
> > + if (cmpxchg(&map->page, NULL, page))
> > + free_page(page);
>
> Note that this piece breaks compilation for every arch that does not
> have cmpxchg implementation.
> This is the case with x86 (with CONFIG_X86_CMPXCHG undefined, e.g. i386),
> ARM, CRIS, m68k, MIPS, MIPS64, PARISC, s390, SH, sparc32, UML (for x86).
we need a cmpxchg() function in the generic library, using a spinlock.
Then every architecture can enhance the implementation if it wishes to.
Ingo
Hello!
On Fri, Sep 20, 2002 at 11:40:16AM +0200, Ingo Molnar wrote:
> > > + if (cmpxchg(&map->page, NULL, page))
> > > + free_page(page);
> > Note that this piece breaks compilation for every arch that does not
> > have cmpxchg implementation.
> > This is the case with x86 (with CONFIG_X86_CMPXCHG undefined, e.g. i386),
> > ARM, CRIS, m68k, MIPS, MIPS64, PARISC, s390, SH, sparc32, UML (for x86).
> we need a cmpxchg() function in the generic library, using a spinlock.
But this is not safe for arches that provides SMP but does not provide
cmpxchg in hadware, right?
I mean it is only safe to use such spinlock-based function if
all other places read and write this value via special functions that are
also taking this spinlock.
Do you think we can count on this?
Bye,
Oleg
On Fri, Sep 20, 2002 at 03:43:47PM +0400, Oleg Drokin wrote:
> On Fri, Sep 20, 2002 at 11:40:16AM +0200, Ingo Molnar wrote:
> > we need a cmpxchg() function in the generic library, using a spinlock.
>
> But this is not safe for arches that provides SMP but does not provide
> cmpxchg in hadware, right?
> I mean it is only safe to use such spinlock-based function if
> all other places read and write this value via special functions that are
> also taking this spinlock.
Correct.
> Do you think we can count on this?
I don't think so.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
On Fri, 20 Sep 2002, Oleg Drokin wrote:
> I mean it is only safe to use such spinlock-based function if all other
> places read and write this value via special functions that are also
> taking this spinlock.
yes - but this is true in this specific PID allocator case, we only access
it via cmpxchg().
Ingo
Ingo Molnar wrote:
>
> On Fri, 20 Sep 2002, Oleg Drokin wrote:
>
> > > + if (cmpxchg(&map->page, NULL, page))
> > > + free_page(page);
> >
> > Note that this piece breaks compilation for every arch that does not
> > have cmpxchg implementation.
> > This is the case with x86 (with CONFIG_X86_CMPXCHG undefined, e.g. i386),
> > ARM, CRIS, m68k, MIPS, MIPS64, PARISC, s390, SH, sparc32, UML (for x86).
>
> we need a cmpxchg() function in the generic library, using a spinlock.
> Then every architecture can enhance the implementation if it wishes to.
>
That would be good, but wouldn't we then need a special per-arch
"cmpxchngable" type, like atomic_t?
Seems that just doing compare-and-exchange on a bare page* might force
some architectures to use a global lock.