LinuxLists.cc - [ANNOUNCE] Native POSIX Thread Library 0.1

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

William Lee Irwin III wrote:

> What stress tests and/or benchmarks are you using?

We have developed a little benchmark in parallel to the library.
Nothing special, but you've seen Ingo using it in his argumentations
(usually called p3).

This does not in any way removes the need for more benchmarks.

- --
- ---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE9inta2ijCOnn/RHQRAp+mAJ9KQJgH1wy1hifkON6/v9EgkptjbgCdHQhF
vcrLVpU85pCuq6fZDo8uFn0=
=Ep3W
-----END PGP SIGNATURE-----

2002-09-20 01:43:20

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

> William Lee Irwin III wrote:
>> What stress tests and/or benchmarks are you using?

On Thu, Sep 19, 2002 at 06:35:17PM -0700, Ulrich Drepper wrote:
> We have developed a little benchmark in parallel to the library.
> Nothing special, but you've seen Ingo using it in his argumentations
> (usually called p3).
> This does not in any way removes the need for more benchmarks.

If you could pass that along for me (or others) to add to the list of
things to bench I'd be much obliged.

Thanks,
Bill

2002-09-20 01:51:04

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

> - - the new library is based on an 1-on-1 model. Earlier design
> documents stated that an M-on-N implementation was necessary to
> support a scalable thread library. This was especially true for
> the IA-32 and x86-64 platforms since the ABI with respect to threads
> forces the use of segment registers and the only way to use those
> registers was with the Local Descriptor Table (LDT) data structure
> of the processor.
>
> The kernel limitations the earlier designs were based on have been
> eliminated as part of this project, opening the road to a 1-on-1
> implementation which has many advantages such as
>
> + less complex implementation;
> + avoidance of two-level scheduling, enabling the kernel to make all
> scheduling decisions;
> + direct interaction between kernel and user-level code (e.g., when
> delivering signals);
> + and more and more.

I'm just starting to look at this... Without digging into it, my
impression is that this is 100% the right way to go. Rob Pike (Mr Plan
9, which while it hasn't had the impact of Linux is actually a fantastic
chunk of work) once said "If you think you need threads, your processes
are too fat". I believe that's another way of stating that a 1-on-1
model is the right approach. He's saying "don't make threads to make
things lighter, that's bullshit, use processes as threads, that will
force the processes to be light and that's a good thing for processes
*and* threads".

My only issue with that approach (I'm a big time advocate of that
approach) is TLB & page table sharing. My understanding (weak as it is)
of Linux is that it does the right thing here so there isn't much of
an issue. In Linux the address space is a first class object so the
id in the TLB is an address space ID, not a process id, which means
a pile of unrelated processes could, in theory, share the same chunk
of address space. That's cool. A lot of processor folks have danced
around that issue for years, I fought with Mash at MIPS about it, he
knew it was something that was needed. But Linux, as far as I can tell,
got it right in a different way that made the issue go away. Which means
1-on-1 threads are the right approach for reasons which have nothing to
do with threads, as well as reasons which have to do with threads.

Kudos to Ulrich & team for getting it right. I'll go dig into it and
see if I've missed the point or not, but this sounds really good.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-09-20 01:56:46

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Thu, 19 Sep 2002, Ulrich Drepper wrote:

> Initial confirmations were test runs with huge numbers of threads.
> Even on IA-32 with its limited address space and memory handling
> running 100,000 concurrent threads was no problem at all,

So, where did you put those 800 MB of kernel stacks needed for
100,000 threads ?

If you used the standard 3:1 user/kernel split you'd be using
all of ZONE_NORMAL for kernel stacks, but if you use a 2:2 split
you'll end up with a lot less user space (bad if you want to
have many threads in the same address space).

Do you have some special solutions up your sleeve or is this
in the category of as-of-yet unsolved problems ?

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Spamtraps of the month: [email protected] [email protected]

2002-09-20 02:12:40

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Thu, Sep 19, 2002 at 11:01:33PM -0300, Rik van Riel wrote:
> On Thu, 19 Sep 2002, Ulrich Drepper wrote:
>
> > Initial confirmations were test runs with huge numbers of threads.
> > Even on IA-32 with its limited address space and memory handling
> > running 100,000 concurrent threads was no problem at all,
>
> So, where did you put those 800 MB of kernel stacks needed for
> 100,000 threads ?

Come on, you and I normally agree, but 100,000 threads? Where is the need
for that? More importantly, is there any realistic application that can
use 100,000 threads where the kernel stack is 0 but the user level stack
doesn't have exactly the same problem? The kernel can be perfect, i.e.,
cost zero, and you still have a problem.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-09-20 02:10:42

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Thu, Sep 19, 2002 at 11:01:33PM -0300, Rik van Riel wrote:
> So, where did you put those 800 MB of kernel stacks needed for
> 100,000 threads ?

That's what the 4KB stack patch is for. ;-)

-ben

2002-09-20 02:19:14

by Anton Blanchard

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

> So, where did you put those 800 MB of kernel stacks needed for
> 100,000 threads ?

I hope no one is going to run x86 boxes with 100,000 threads, but
its nice to know we can do it. (just to have an upper limit)

If they want 100k threads they should start thinking about a 64bit
box. Ive already tested 1 million kernel threads with 24GB and we
make machines with more than 10 times that memory... (and no I cant
think of any possible reason someone would want that many threads :)

Anton

2002-09-20 02:20:10

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Thu, 19 Sep 2002, Larry McVoy wrote:
> On Thu, Sep 19, 2002 at 11:01:33PM -0300, Rik van Riel wrote:

> > So, where did you put those 800 MB of kernel stacks needed for
> > 100,000 threads ?
>
> Come on, you and I normally agree, but 100,000 threads? Where is the
> need for that?

I agree, it's pretty silly. But still, I was curious how they
managed to achieve it ;)

OTOH, some applications are known for sillyness ...

cheers,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Spamtraps of the month: [email protected] [email protected]

2002-09-20 02:27:11

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Rik van Riel wrote:

> I agree, it's pretty silly. But still, I was curious how they
> managed to achieve it ;)

Ingo will be able to tell you when he gets up. This is not my area of
expertise. AFAIK there were no special changes involved; Ben's irq
stack patch would add to this number (I think Ingo said something about
188,000 threads or so).

- --
- ---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE9ioi02ijCOnn/RHQRAnw+AJ9fFu36D8ZIk2Y3NC8Rpekb5EXwPwCePCBL
Z/u1XIdgB2F/UuixLkIpNvI=
=Ldzx
-----END PGP SIGNATURE-----

2002-09-20 02:36:33

by Dave Hansen

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.552 -> 1.553
# arch/i386/kernel/process.c 1.40 -> 1.41
# arch/i386/kernel/irq.c 1.18 -> 1.19
# arch/i386/kernel/head.S 1.15 -> 1.16
# include/asm-i386/thread_info.h 1.7 -> 1.8
# include/asm-i386/page.h 1.16.1.1 -> 1.18
# arch/i386/kernel/entry.S 1.41.1.3 -> 1.44
# arch/i386/config.in 1.47.1.2 -> 1.49
# arch/i386/Makefile 1.17.1.3 -> 1.19
# arch/i386/kernel/i386_ksyms.c 1.30 -> 1.31
# arch/i386/kernel/smpboot.c 1.33.1.2 -> 1.35
# arch/i386/boot/compressed/misc.c 1.7 -> 1.8
# arch/i386/kernel/init_task.c 1.6 -> 1.7
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/09/19 haveblue@elm3b96.(none) 1.553
# Merge elm3b96.(none):/work/dave/bk/linux-2.5
# into elm3b96.(none):/work/dave/bk/linux-2.5-irqstack
# --------------------------------------------
#
diff -Nru a/arch/i386/Makefile b/arch/i386/Makefile
--- a/arch/i386/Makefile Thu Sep 19 19:38:27 2002
+++ b/arch/i386/Makefile Thu Sep 19 19:38:27 2002
@@ -85,6 +85,10 @@
CFLAGS += -march=i586
endif

+ifdef CONFIG_X86_STACK_CHECK
+CFLAGS += -p
+endif
+
HEAD := arch/i386/kernel/head.o arch/i386/kernel/init_task.o

libs-y += arch/i386/lib/
diff -Nru a/arch/i386/boot/compressed/misc.c b/arch/i386/boot/compressed/misc.c
--- a/arch/i386/boot/compressed/misc.c Thu Sep 19 19:38:27 2002
+++ b/arch/i386/boot/compressed/misc.c Thu Sep 19 19:38:27 2002
@@ -377,3 +377,7 @@
if (high_loaded) close_output_buffer_if_we_run_high(mv);
return high_loaded;
}
+
+/* We don't actually check for stack overflows this early. */
+__asm__(".globl mcount ; mcount: ret\n");
+
diff -Nru a/arch/i386/config.in b/arch/i386/config.in
--- a/arch/i386/config.in Thu Sep 19 19:38:27 2002
+++ b/arch/i386/config.in Thu Sep 19 19:38:27 2002
@@ -35,6 +35,7 @@
#
# Define implied options from the CPU selection here
#
+define_bool CONFIG_X86_HAVE_CMOV n

if [ "$CONFIG_M386" = "y" ]; then
define_bool CONFIG_X86_CMPXCHG n
@@ -91,18 +92,21 @@
define_bool CONFIG_X86_GOOD_APIC y
define_bool CONFIG_X86_USE_PPRO_CHECKSUM y
define_bool CONFIG_X86_PPRO_FENCE y
+ define_bool CONFIG_X86_HAVE_CMOV y
fi
if [ "$CONFIG_MPENTIUMIII" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
define_bool CONFIG_X86_TSC y
define_bool CONFIG_X86_GOOD_APIC y
define_bool CONFIG_X86_USE_PPRO_CHECKSUM y
+ define_bool CONFIG_X86_HAVE_CMOV y
fi
if [ "$CONFIG_MPENTIUM4" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 7
define_bool CONFIG_X86_TSC y
define_bool CONFIG_X86_GOOD_APIC y
define_bool CONFIG_X86_USE_PPRO_CHECKSUM y
+ define_bool CONFIG_X86_HAVE_CMOV y
fi
if [ "$CONFIG_MK6" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
@@ -116,6 +120,7 @@
define_bool CONFIG_X86_GOOD_APIC y
define_bool CONFIG_X86_USE_3DNOW y
define_bool CONFIG_X86_USE_PPRO_CHECKSUM y
+ define_bool CONFIG_X86_HAVE_CMOV y
fi
if [ "$CONFIG_MELAN" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 4
@@ -132,6 +137,7 @@
if [ "$CONFIG_MCRUSOE" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
define_bool CONFIG_X86_TSC y
+ define_bool CONFIG_X86_HAVE_CMOV y
fi
if [ "$CONFIG_MWINCHIPC6" = "y" ]; then
define_int CONFIG_X86_L1_CACHE_SHIFT 5
@@ -435,6 +441,7 @@
if [ "$CONFIG_HIGHMEM" = "y" ]; then
bool ' Highmem debugging' CONFIG_DEBUG_HIGHMEM
fi
+ bool ' Check for stack overflows' CONFIG_X86_STACK_CHECK
fi

endmenu
diff -Nru a/arch/i386/kernel/entry.S b/arch/i386/kernel/entry.S
--- a/arch/i386/kernel/entry.S Thu Sep 19 19:38:27 2002
+++ b/arch/i386/kernel/entry.S Thu Sep 19 19:38:27 2002
@@ -136,7 +136,7 @@
movl %ecx,CS(%esp) #
movl %esp, %ebx
pushl %ebx
- andl $-8192, %ebx # GET_THREAD_INFO
+ GET_THREAD_INFO_WITH_ESP(%ebx)
movl TI_EXEC_DOMAIN(%ebx), %edx # Get the execution domain
movl 4(%edx), %edx # Get the lcall7 handler for the domain
pushl $0x7
@@ -158,7 +158,7 @@
movl %ecx,CS(%esp) #
movl %esp, %ebx
pushl %ebx
- andl $-8192, %ebx # GET_THREAD_INFO
+ GET_THREAD_INFO_WITH_ESP(%ebx)
movl TI_EXEC_DOMAIN(%ebx), %edx # Get the execution domain
movl 4(%edx), %edx # Get the lcall7 handler for the domain
pushl $0x27
@@ -334,7 +334,39 @@
ALIGN
common_interrupt:
SAVE_ALL
+
+ GET_THREAD_INFO(%ebx)
+ movl TI_IRQ_STACK(%ebx),%ecx
+ movl TI_TASK(%ebx),%edx
+ movl %esp,%eax
+ leal (THREAD_SIZE-4)(%ecx),%esi
+ testl %ecx,%ecx # is there a valid irq_stack?
+
+ # switch to the irq stack
+#ifdef CONFIG_X86_HAVE_CMOV
+ cmovnz %esi,%esp
+#else
+ jnz 1f
+ mov %esi,%esp
+1:
+#endif
+
+ # update the task pointer in the irq stack
+ GET_THREAD_INFO(%esi)
+ movl %edx,TI_TASK(%esi)
+
call do_IRQ
+
+ movl %eax,%esp # potentially restore non-irq stack
+
+ # copy flags from the irq stack back into the task's thread_info
+ # %esi is saved over the do_IRQ call and contains the irq stack
+ # thread_info pointer
+ # %ebx contains the original thread_info pointer
+ movl TI_FLAGS(%esi),%eax
+ movl $0,TI_FLAGS(%esi)
+ LOCK orl %eax,TI_FLAGS(%ebx)
+
jmp ret_from_intr

#define BUILD_INTERRUPT(name, nr) \
@@ -506,6 +538,61 @@
pushl $0
pushl $do_spurious_interrupt_bug
jmp error_code
+
+#ifdef CONFIG_X86_STACK_CHECK
+.data
+ .globl stack_overflowed
+stack_overflowed:
+ .long 0
+
+.text
+
+ENTRY(mcount)
+ push %eax
+ movl $(THREAD_SIZE - 1),%eax
+ andl %esp,%eax
+ cmpl $0x200,%eax /* 512 byte danger zone */
+ jle 1f
+2:
+ popl %eax
+ ret
+1:
+ lock; btsl $0,stack_overflowed /* Prevent reentry via printk */
+ jc 2b
+
+ # switch to overflow stack
+ movl %esp,%eax
+ movl $(stack_overflow_stack + THREAD_SIZE - 4),%esp
+
+ pushf
+ cli
+ pushl %eax
+
+ # push eip then esp of error for stack_overflow_panic
+ pushl 4(%eax)
+ pushl %eax
+
+ # update the task pointer and cpu in the overflow stack's thread_info.
+ GET_THREAD_INFO_WITH_ESP(%eax)
+ movl TI_TASK(%eax),%ebx
+ movl %ebx,stack_overflow_stack+TI_TASK
+ movl TI_CPU(%eax),%ebx
+ movl %ebx,stack_overflow_stack+TI_CPU
+
+ # never neverland
+ call stack_overflow_panic
+
+ addl $8,%esp
+
+ popf
+ popl %eax
+ movl %eax,%esp
+ popl %eax
+ movl $0,stack_overflowed
+ ret
+
+#warning stack check enabled
+#endif

.data
ENTRY(sys_call_table)
diff -Nru a/arch/i386/kernel/head.S b/arch/i386/kernel/head.S
--- a/arch/i386/kernel/head.S Thu Sep 19 19:38:27 2002
+++ b/arch/i386/kernel/head.S Thu Sep 19 19:38:27 2002
@@ -15,6 +15,7 @@
#include <asm/page.h>
#include <asm/pgtable.h>
#include <asm/desc.h>
+#include <asm/thread_info.h>

#define OLD_CL_MAGIC_ADDR 0x90020
#define OLD_CL_MAGIC 0xA33F
@@ -305,7 +306,7 @@
ret

ENTRY(stack_start)
- .long init_thread_union+8192
+ .long init_thread_union+THREAD_SIZE
.long __KERNEL_DS

/* This is the default interrupt "handler" :-) */
diff -Nru a/arch/i386/kernel/i386_ksyms.c b/arch/i386/kernel/i386_ksyms.c
--- a/arch/i386/kernel/i386_ksyms.c Thu Sep 19 19:38:27 2002
+++ b/arch/i386/kernel/i386_ksyms.c Thu Sep 19 19:38:27 2002
@@ -172,3 +172,8 @@
EXPORT_SYMBOL(is_sony_vaio_laptop);

EXPORT_SYMBOL(__PAGE_KERNEL);
+
+#ifdef CONFIG_X86_STACK_CHECK
+extern void mcount(void);
+EXPORT_SYMBOL(mcount);
+#endif
diff -Nru a/arch/i386/kernel/init_task.c b/arch/i386/kernel/init_task.c
--- a/arch/i386/kernel/init_task.c Thu Sep 19 19:38:27 2002
+++ b/arch/i386/kernel/init_task.c Thu Sep 19 19:38:27 2002
@@ -13,6 +13,14 @@
static struct signal_struct init_signals = INIT_SIGNALS(init_signals);
struct mm_struct init_mm = INIT_MM(init_mm);

+union thread_union init_irq_union
+ __attribute__((__section__(".data.init_task")));
+
+#ifdef CONFIG_X86_STACK_CHECK
+union thread_union stack_overflow_stack
+ __attribute__((__section__(".data.init_task")));
+#endif
+
/*
* Initial thread structure.
*
@@ -22,7 +30,15 @@
*/
union thread_union init_thread_union
__attribute__((__section__(".data.init_task"))) =
- { INIT_THREAD_INFO(init_task) };
+ { {
+ task: &init_task,
+ exec_domain: &default_exec_domain,
+ flags: 0,
+ cpu: 0,
+ addr_limit: KERNEL_DS,
+ irq_stack: &init_irq_union,
+ } };
+

/*
* Initial task structure.
diff -Nru a/arch/i386/kernel/irq.c b/arch/i386/kernel/irq.c
--- a/arch/i386/kernel/irq.c Thu Sep 19 19:38:27 2002
+++ b/arch/i386/kernel/irq.c Thu Sep 19 19:38:27 2002
@@ -311,7 +311,8 @@
* SMP cross-CPU interrupts have their own specific
* handlers).
*/
-asmlinkage unsigned int do_IRQ(struct pt_regs regs)
+struct pt_regs *do_IRQ(struct pt_regs *regs) __attribute__((regparm(1)));
+struct pt_regs *do_IRQ(struct pt_regs *regs)
{
/*
* We ack quickly, we don't want the irq controller
@@ -323,7 +324,7 @@
* 0 return value means that this irq is already being
* handled by some other CPU. (or is disabled)
*/
- int irq = regs.orig_eax & 0xff; /* high bits used in ret_from_ code */
+ int irq = regs->orig_eax & 0xff; /* high bits used in ret_from_ code */
int cpu = smp_processor_id();
irq_desc_t *desc = irq_desc + irq;
struct irqaction * action;
@@ -373,7 +374,7 @@
*/
for (;;) {
spin_unlock(&desc->lock);
- handle_IRQ_event(irq, &regs, action);
+ handle_IRQ_event(irq, regs, action);
spin_lock(&desc->lock);

if (likely(!(desc->status & IRQ_PENDING)))
@@ -392,7 +393,7 @@

irq_exit();

- return 1;
+ return regs;
}

/**
diff -Nru a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c
--- a/arch/i386/kernel/process.c Thu Sep 19 19:38:27 2002
+++ b/arch/i386/kernel/process.c Thu Sep 19 19:38:27 2002
@@ -438,6 +438,16 @@

extern void show_trace(unsigned long* esp);

+#ifdef CONFIG_X86_STACK_CHECK
+void stack_overflow_panic(void *esp, void *eip)
+{
+ printk("stack overflow from %p. esp: %p\n", eip, esp);
+ show_trace(esp);
+ panic("stack overflow\n");
+}
+
+#endif
+
void show_regs(struct pt_regs * regs)
{
unsigned long cr0 = 0L, cr2 = 0L, cr3 = 0L, cr4 = 0L;
@@ -693,6 +703,7 @@

/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */

+ next_p->thread_info->irq_stack = prev_p->thread_info->irq_stack;
unlazy_fpu(prev_p);

/*
diff -Nru a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c Thu Sep 19 19:38:27 2002
+++ b/arch/i386/kernel/smpboot.c Thu Sep 19 19:38:27 2002
@@ -66,6 +66,10 @@
/* Per CPU bogomips and other parameters */
struct cpuinfo_x86 cpu_data[NR_CPUS] __cacheline_aligned;

+extern union thread_union init_irq_union;
+union thread_union *irq_stacks[NR_CPUS] __cacheline_aligned =
+ { &init_irq_union, };
+
/* Set when the idlers are all forked */
int smp_threads_ready;

@@ -760,6 +764,27 @@
return (send_status | accept_status);
}

+static void __init setup_irq_stack(struct task_struct *p, int cpu)
+{
+ unsigned long stk;
+
+ stk = __get_free_pages(GFP_KERNEL, THREAD_ORDER);
+ if (!stk)
+ panic("I can't seem to allocate my irq stack. Oh well, giving up.");
+
+ irq_stacks[cpu] = (void *)stk;
+ memset(irq_stacks[cpu], 0, THREAD_SIZE);
+ irq_stacks[cpu]->thread_info.cpu = cpu;
+ irq_stacks[cpu]->thread_info.preempt_count = 1;
+ /* interrupts are not preemptable */
+ p->thread_info->irq_stack = irq_stacks[cpu];
+
+ /* If we want to make the irq stack more than one unit
+ * deep, we can chain then off of the irq_stack pointer
+ * here.
+ */
+}
+
extern unsigned long cpu_initialized;

static void __init do_boot_cpu (int apicid)
@@ -783,6 +808,8 @@
if (IS_ERR(idle))
panic("failed fork for CPU %d", cpu);

+ setup_irq_stack(idle, cpu);
+
/*
* We remove it from the pidhash and the runqueue
* once we got the process:
@@ -800,7 +827,13 @@

/* So we see what's up */
printk("Booting processor %d/%d eip %lx\n", cpu, apicid, start_eip);
- stack_start.esp = (void *) (1024 + PAGE_SIZE + (char *)idle->thread_info);
+
+ /* The -4 is to correct for the fact that the stack pointer
+ * is used to find the location of the thread_info structure
+ * by masking off several of the LSBs. Without the -4, esp
+ * is pointing to the page after the one the stack is on.
+ */
+ stack_start.esp = (void *)(THREAD_SIZE - 4 + (char *)idle->thread_info);

/*
* This grunge runs the startup process for
diff -Nru a/include/asm-i386/page.h b/include/asm-i386/page.h
--- a/include/asm-i386/page.h Thu Sep 19 19:38:27 2002
+++ b/include/asm-i386/page.h Thu Sep 19 19:38:27 2002
@@ -3,7 +3,11 @@

/* PAGE_SHIFT determines the page size */
#define PAGE_SHIFT 12
+#ifndef __ASSEMBLY__
#define PAGE_SIZE (1UL << PAGE_SHIFT)
+#else
+#define PAGE_SIZE (1 << PAGE_SHIFT)
+#endif
#define PAGE_MASK (~(PAGE_SIZE-1))

#define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1))
diff -Nru a/include/asm-i386/thread_info.h b/include/asm-i386/thread_info.h
--- a/include/asm-i386/thread_info.h Thu Sep 19 19:38:27 2002
+++ b/include/asm-i386/thread_info.h Thu Sep 19 19:38:27 2002
@@ -9,6 +9,7 @@

#ifdef __KERNEL__

+#include <asm/page.h>
#ifndef __ASSEMBLY__
#include <asm/processor.h>
#endif
@@ -28,9 +29,11 @@
__s32 preempt_count; /* 0 => preemptable, <0 => BUG */

mm_segment_t addr_limit; /* thread address space:
+ 0 for interrupts: illegal
0-0xBFFFFFFF for user-thead
0-0xFFFFFFFF for kernel-thread
*/
+ struct thread_info *irq_stack; /* pointer to cpu irq stack */

__u8 supervisor_stack[0];
};
@@ -44,6 +47,7 @@
#define TI_CPU 0x0000000C
#define TI_PRE_COUNT 0x00000010
#define TI_ADDR_LIMIT 0x00000014
+#define TI_IRQ_STACK 0x00000018

#endif

@@ -54,42 +58,42 @@
*
* preempt_count needs to be 1 initially, until the scheduler is functional.
*/
+#define THREAD_ORDER 0
+#define INIT_THREAD_SIZE THREAD_SIZE
+
#ifndef __ASSEMBLY__
-#define INIT_THREAD_INFO(tsk) \
-{ \
- .task = &tsk, \
- .exec_domain = &default_exec_domain, \
- .flags = 0, \
- .cpu = 0, \
- .preempt_count = 1, \
- .addr_limit = KERNEL_DS, \
-}

#define init_thread_info (init_thread_union.thread_info)
#define init_stack (init_thread_union.stack)

+/* thread information allocation */
+#define THREAD_SIZE (PAGE_SIZE << THREAD_ORDER)
+#define alloc_thread_info() ((struct thread_info *) __get_free_pages(GFP_KERNEL,THREAD_ORDER))
+#define free_thread_info(ti) free_pages((unsigned long) (ti), THREAD_ORDER)
+#define get_thread_info(ti) get_task_struct((ti)->task)
+#define put_thread_info(ti) put_task_struct((ti)->task)
+
+
/* how to get the thread information struct from C */
static inline struct thread_info *current_thread_info(void)
{
struct thread_info *ti;
- __asm__("andl %%esp,%0; ":"=r" (ti) : "0" (~8191UL));
+ __asm__("andl %%esp,%0; ":"=r" (ti) : "0" (~(THREAD_SIZE - 1)));
return ti;
}

-/* thread information allocation */
-#define THREAD_SIZE (2*PAGE_SIZE)
-#define alloc_thread_info() ((struct thread_info *) __get_free_pages(GFP_KERNEL,1))
-#define free_thread_info(ti) free_pages((unsigned long) (ti), 1)
-#define get_thread_info(ti) get_task_struct((ti)->task)
-#define put_thread_info(ti) put_task_struct((ti)->task)
-
#else /* !__ASSEMBLY__ */

+#define THREAD_SIZE (PAGE_SIZE << THREAD_ORDER)
+
/* how to get the thread information struct from ASM */
#define GET_THREAD_INFO(reg) \
- movl $-8192, reg; \
+ movl $-THREAD_SIZE, reg; \
andl %esp, reg
-
+/* use this one if reg already contains %esp */
+#define GET_THREAD_INFO_WITH_ESP(reg) \
+ andl $-THREAD_SIZE, reg
+
#endif

/*

Attachments:

irqstack-2.5.35+bk-0.patch (15.17 kB)

2002-09-20 02:48:33

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Thu, Sep 19, 2002 at 11:01:33PM -0300, Rik van Riel wrote:
>> So, where did you put those 800 MB of kernel stacks needed for
>> 100,000 threads ?

On Thu, Sep 19, 2002 at 10:15:46PM -0400, Benjamin LaHaise wrote:
> That's what the 4KB stack patch is for. ;-)
> -ben

The task_struct isn't particularly slim either. I heard rumblings
that way back when it was shared with the stack the NR_CPUS arrays
filled (and overflowed) the entire stack... It's also enough to
put a wee bit of pressure on ZONE_NORMAL even on smaller task count
workloads.

Perhaps something like this is in order? vs. 2.5.33:

fs/proc/array.c | 22 ----------------------
fs/proc/base.c | 11 +----------
include/linux/sched.h | 1 -
kernel/fork.c | 11 +----------
kernel/timer.c | 3 ---
5 files changed, 2 insertions(+), 46 deletions(-)

Cheers,
Bill

===== fs/proc/array.c 1.24 vs edited =====
--- 1.24/fs/proc/array.c Thu Jul 4 22:54:38 2002
+++ edited/fs/proc/array.c Tue Jul 16 00:35:26 2002
@@ -592,25 +592,3 @@
out:
return retval;
}
-
-#ifdef CONFIG_SMP
-int proc_pid_cpu(struct task_struct *task, char * buffer)
-{
- int i, len;
-
- len = sprintf(buffer,
- "cpu %lu %lu\n",
- jiffies_to_clock_t(task->utime),
- jiffies_to_clock_t(task->stime));
-
- for (i = 0 ; i < NR_CPUS; i++) {
- if (cpu_online(i))
- len += sprintf(buffer + len, "cpu%d %lu %lu\n",
- i,
- jiffies_to_clock_t(task->per_cpu_utime[i]),
- jiffies_to_clock_t(task->per_cpu_stime[i]));
-
- }
- return len;
-}
-#endif
===== fs/proc/base.c 1.26 vs edited =====
--- 1.26/fs/proc/base.c Wed May 22 08:48:14 2002
+++ edited/fs/proc/base.c Tue Jul 16 00:36:12 2002
@@ -52,7 +52,6 @@
PROC_PID_STAT,
PROC_PID_STATM,
PROC_PID_MAPS,
- PROC_PID_CPU,
PROC_PID_MOUNTS,
PROC_PID_FD_DIR = 0x8000, /* 0x8000-0xffff */
};
@@ -72,9 +71,6 @@
E(PROC_PID_CMDLINE, "cmdline", S_IFREG|S_IRUGO),
E(PROC_PID_STAT, "stat", S_IFREG|S_IRUGO),
E(PROC_PID_STATM, "statm", S_IFREG|S_IRUGO),
-#ifdef CONFIG_SMP
- E(PROC_PID_CPU, "cpu", S_IFREG|S_IRUGO),
-#endif
E(PROC_PID_MAPS, "maps", S_IFREG|S_IRUGO),
E(PROC_PID_MEM, "mem", S_IFREG|S_IRUSR|S_IWUSR),
E(PROC_PID_CWD, "cwd", S_IFLNK|S_IRWXUGO),
@@ -1003,12 +999,7 @@
case PROC_PID_MAPS:
inode->i_fop = &proc_maps_operations;
break;
-#ifdef CONFIG_SMP
- case PROC_PID_CPU:
- inode->i_fop = &proc_info_file_operations;
- ei->op.proc_read = proc_pid_cpu;
- break;
-#endif
+
case PROC_PID_MEM:
inode->i_op = &proc_mem_inode_operations;
inode->i_fop = &proc_mem_operations;
===== include/linux/sched.h 1.70 vs edited =====
--- 1.70/include/linux/sched.h Thu Jul 4 22:33:26 2002
+++ edited/include/linux/sched.h Tue Jul 16 00:35:26 2002
@@ -325,7 +325,6 @@
struct timer_list real_timer;
unsigned long utime, stime, cutime, cstime;
unsigned long start_time;
- long per_cpu_utime[NR_CPUS], per_cpu_stime[NR_CPUS];
/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
unsigned long min_flt, maj_flt, nswap, cmin_flt, cmaj_flt, cnswap;
int swappable:1;
===== kernel/fork.c 1.49 vs edited =====
--- 1.49/kernel/fork.c Mon Jul 1 14:41:36 2002
+++ edited/kernel/fork.c Tue Jul 16 00:35:26 2002
@@ -725,16 +725,7 @@
p->tty_old_pgrp = 0;
p->utime = p->stime = 0;
p->cutime = p->cstime = 0;
-#ifdef CONFIG_SMP
- {
- int i;
-
- /* ?? should we just memset this ?? */
- for(i = 0; i < NR_CPUS; i++)
- p->per_cpu_utime[i] = p->per_cpu_stime[i] = 0;
- spin_lock_init(&p->sigmask_lock);
- }
-#endif
+ spin_lock_init(&p->sigmask_lock);
p->array = NULL;
p->lock_depth = -1; /* -1 = no lock */
p->start_time = jiffies;
===== kernel/timer.c 1.17 vs edited =====
--- 1.17/kernel/timer.c Mon Jul 1 14:41:36 2002
+++ edited/kernel/timer.c Tue Jul 16 00:35:26 2002
@@ -569,8 +569,6 @@
void update_one_process(struct task_struct *p, unsigned long user,
unsigned long system, int cpu)
{
- p->per_cpu_utime[cpu] += user;
- p->per_cpu_stime[cpu] += system;
do_process_times(p, user, system);
do_it_virt(p, user);
do_it_prof(p);

2002-09-20 05:53:52

by Linus Torvalds

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

In article <[email protected]>,
Rik van Riel <[email protected]> wrote:
>On Thu, 19 Sep 2002, Larry McVoy wrote:
>> On Thu, Sep 19, 2002 at 11:01:33PM -0300, Rik van Riel wrote:
>
>> > So, where did you put those 800 MB of kernel stacks needed for
>> > 100,000 threads ?
>>
>> Come on, you and I normally agree, but 100,000 threads? Where is the
>> need for that?
>
>I agree, it's pretty silly. But still, I was curious how they
>managed to achieve it ;)

You didn't read the post carefully.

They started and waited for 100,000 threads.

They did not have them all running at the same time. I think the
original post said something like "up to 50 at a time".

Basically, the benchmark was how _fast_ thread creation is, not now many
you can run at the same time. 100k threads at once is crazy, but you can
do it now on 64-bit architectures if you really want to.

Linus

2002-09-20 07:40:30

[permalink] [raw]

Subject: Re: 100,000 threads? [was: [ANNOUNCE] Native POSIX Thread Library 0.1]

On Thu, 19 Sep 2002, Rik van Riel wrote:

> So, where did you put those 800 MB of kernel stacks needed for 100,000
> threads ?

With the default split and kernel stack we can start up 94,000 threads on
x86. With Ben's/Dave's patch we can have up to 188,000 threads. With a 2:2
GB VM split configured we can start 376,000 threads. If someone's that
desperate then with a 1:3 split we can start up 564,000 threads.

Anton tested 1 million concurrent threads on one of his bigger PowerPC
boxes, which started up in around 30 seconds. I think he saw a load
average of around 200 thousand. [ie. the runqueue was probably a few
hundred thousand entries long at times.]

> If you used the standard 3:1 user/kernel split you'd be using all of
> ZONE_NORMAL for kernel stacks, but if you use a 2:2 split you'll end up
> with a lot less user space (bad if you want to have many threads in the
> same address space).

the extreme high-end of threading typically uses very controlled
applications and very small user level stacks.

as to the question of why so many threads, the answer is because we can :)
This, besides demonstrating some of the recent scalability advances, gives
us the warm fuzzy feeling that things are right in this area. I mean,
there are architectures where Linux could map a petabyte of RAM just fine,
even though that might not be something we desperately need today.

Ingo

2002-09-20 07:41:27

by Joerg

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Linus Torvalds wrote:
> They started and waited for 100,000 threads.
>
> They did not have them all running at the same time. I think the
> original post said something like "up to 50 at a time".

To quote Ulrich:

"Even on IA-32 with its limited address space and memory handling
running 100,000 concurrent threads was no problem at all, creating
and destroying the threads did not take more than two seconds."

It clearly states 100,000 CONCURRENT threads. So, it really seems to
work (not that I have the hardware to verify this claim).

Regards
J?rg

=====
--
Regards
Joerg

__________________________________________________________________

Gesendet von Yahoo! Mail - http://mail.yahoo.de
M?chten Sie mit einem Gru? antworten? http://grusskarten.yahoo.de

2002-09-20 07:51:18

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002, Linus Torvalds wrote:

> They did not have them all running at the same time. I think the
> original post said something like "up to 50 at a time".

actually, that was Ulrich's other test, which tests the serial starting of
100,000 threads.

the test i did started up 100,000 concurrent threads which shot up the
load-average to a couple of thousands. [the default timeslice the parent
has is enough to start more than 50,000 parallel threads a pop or so.]

> Basically, the benchmark was how _fast_ thread creation is, not now many
> you can run at the same time. 100k threads at once is crazy, but you can
> do it now on 64-bit architectures if you really want to.

we did both, and on the dual-P4 testbox i have started and stopped 100,000
*parallel* threads in less than 2 seconds. Ie. starting up 100,000 threads
without any throttling, waiting for all of them to start up, then killing
them all. It needs roughly 1 GB of RAM to do this test on the default x86
kernel, it need roughly 500 MB of RAM to do this test with the IRQ-stacks
patch applied.

with 2.5.31 this test would have taken roughly 15 minutes, on the same
box, provided the NMI watchdog is turned off.

with 100,000 threads started up and idling silently the system is
completely usable - all the critical for_each_task loops have been fixed.
Obviously with 100,000 threads running at once there's some shortage in
CPU power :-) [ I will perhaps try that once, at SCHED_BATCH priority,
just for kicks. Not that it makes much sense - they will get a 3 seconds
worth of timeslice every 3 days. ]

Ingo

2002-09-20 09:49:12

by Padraig Brady

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Ulrich Drepper wrote:
> We are pleased to announce the first publically available source
> release of a new POSIX thread library for Linux
[snip]
> called Native POSIX Thread Library, NPTL.

Great! Where does this leave NGPT though? I had assumed that
this was going to be the next pthread implementation in glibc.

also:

-------- Original Message --------
Subject: glibc threading performance
Date: Mon, 16 Sep 2002 10:42:42 +0100
From: Padraig Brady <[email protected]>
To: Ingo Molnar <[email protected]>, Ulrich Drepper <[email protected]>

Hey guys,

I noticed you're looking at threading stuff lately,
and was wondering about this thread:
http://sources.redhat.com/ml/bug-glibc/2001-12/msg00048.html

In summary wouldn't it be better to have a per process
flag that was only set when pthread_create() is called.
If the flag is not set, then you don't need to do locking.
This locking seems to have huge overhead. For e.g. I
patched uniq in textutils to use getc_unlocked() rather
than getc() and got a 300% performance increase!

cheers,
P?draig.

2002-09-20 09:49:20

by Adrian Bunk

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Thu, 19 Sep 2002, Ulrich Drepper wrote:

>...
> Unless major flaws in the design are found this code is intended to
> become the standard POSIX thread library on Linux system and it will
> be included in the GNU C library distribution.
>...
> - - requires a kernel with the threading capabilities of Linux 2.5.36.
>...

My personal estimation is that Debian will support kernel 2.4 in it's
stable distribution until 2006 or 2007 (this is based on the experience
that Debian usually supports two stable kernel series and the time between
stable releases of Debian is > 1 year). What is the proposed way for
distributions to deal with this?

cu
Adrian

--

You only think this is a free country. Like the US the UK spends a lot of
time explaining its a free country because its a police state.
Alan Cox

2002-09-20 10:15:28

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Thu, Sep 19, 2002 at 05:41:37PM -0700, Ulrich Drepper wrote:
> It is not generally accepted that a 1-on-1 model is superior but our
> tests showed the viability of this approach and by comparing it with
> the overhead added by existing M-on-N implementations we became
> convinced that 1-on-1 is the right approach.

Maybe not but...

You might like to try a context switching/thread wakeup performance
measurement against FreeBSD's libc_r. I'd imagine that it's difficult
to beat a system like that since they keep all of that stuff in
userspace since it's just 2 context switches and a call to their
thread-kernel.

I'm curious as to the rough numbers you got doing the 1:1 and M:N
comparison.

bill

2002-09-20 10:30:27

by Luca Barbieri

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Great, but how about using code similar to the following rather than
hand-coded asm operations?

extern struct pthread __pt_current_struct asm("%gs:0");
#define __pt_current (&__pt_current_struct)

#define THREAD_GETMEM(descr, member) (__pt_current->member)
#define THREAD_SETMEM(descr, member, value) ((__pt_current->member) =
value)
#define THREAD_MASKMEM(descr, member, mask) ((__pt_current->member) &=
mask)
...

Of course, it doesn't work if you try to take the address of a member.

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2002-09-20 10:34:46

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002, Bill Huey wrote:

> You might like to try a context switching/thread wakeup performance
> measurement against FreeBSD's libc_r. I'd imagine that it's difficult to
> beat a system like that since they keep all of that stuff in userspace
> since it's just 2 context switches and a call to their thread-kernel.

our kernel thread context switch latency is below 1 usec on a typical P4
box, so our NPT library should compare pretty favorably even in such
benchmarks. We get from the pthread_create() call to the first user
instruction of the specified thread-function code in less than 2 usecs,
and we get from pthread_exit() to the thread that does the pthread_join()
in less than 2 usecs as well - all of these operations are done via a
single system-call and a single context switch.

also consider the fact that the true cost of M:N threading does not show
up with just one or two threads running. The true cost comes when
thousands of threads are running, each of them doing nontrivial work that
matters, ie. IO. The true cost of M:N shows up when threading is actually
used for what it's intended to be used :-) And basically nothing offloads
work to threads for them to just do userspace synchronization - real,
useful work always involves some sort of IO and kernel calls. At which
point M:N loses out badly.

M:N's big mistake is that it concentrates on what matters the least:
user<->user context switches. Nothing really wants to do that. And if it
does, it's contended on some userspace locking object, at which point it
doesnt really matter whether the cost of switching is 1 usec or 0.5 usecs,
the main application cost is the lost paralellism and increased cache
trashing due to the serialization - independently of what kind of
threading abstraction is used.

and since our NPT library uses futexes for *all* userspace synchronization
primitives (including internal glibc locks), all uncontended
synchronization is done purely in user-space. [and for the contended case
we *want* to switch into the kernel.]

Ingo

2002-09-20 10:41:45

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002, Adrian Bunk wrote:

> > Unless major flaws in the design are found this code is intended to
> > become the standard POSIX thread library on Linux system and it will
> > be included in the GNU C library distribution.
> >...
> > - - requires a kernel with the threading capabilities of Linux 2.5.36.
> >...
>
> My personal estimation is that Debian will support kernel 2.4 in it's
> stable distribution until 2006 or 2007 (this is based on the experience
> that Debian usually supports two stable kernel series and the time
> between stable releases of Debian is > 1 year). What is the proposed way
> for distributions to deal with this?

Ulrich will give a fuller reply i guess, but the new threading code in 2.5
does not disable (or in any way obsolete) the old glibc threading library.
So by doing boot-time kernel version checks glibc can decide whether it
wants to provide the new library or the old library.

Ingo

2002-09-20 11:06:56

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On 20 Sep 2002, Luca Barbieri wrote:

> Great, but how about using code similar to the following rather than
> hand-coded asm operations?
>
> extern struct pthread __pt_current_struct asm("%gs:0");
> #define __pt_current (&__pt_current_struct)
>
> #define THREAD_GETMEM(descr, member) (__pt_current->member)
> #define THREAD_SETMEM(descr, member, value) ((__pt_current->member) =
> value)
> #define THREAD_MASKMEM(descr, member, mask) ((__pt_current->member) &=
> mask)
> ...

it's a good idea i think. Ulrich has an obsession with writing code in
assembly though :-)

Ingo

2002-09-20 12:01:09

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, Sep 20, 2002 at 12:47:12PM +0200, Ingo Molnar wrote:
> our kernel thread context switch latency is below 1 usec on a typical P4
> box, so our NPT library should compare pretty favorably even in such
> benchmarks. We get from the pthread_create() call to the first user
> instruction of the specified thread-function code in less than 2 usecs,
> and we get from pthread_exit() to the thread that does the pthread_join()
> in less than 2 usecs as well - all of these operations are done via a
> single system-call and a single context switch.

That's outstanding...

> also consider the fact that the true cost of M:N threading does not show
> up with just one or two threads running. The true cost comes when
> thousands of threads are running, each of them doing nontrivial work that
> matters, ie. IO. The true cost of M:N shows up when threading is actually
> used for what it's intended to be used :-) And basically nothing offloads
> work to threads for them to just do userspace synchronization - real,
> useful work always involves some sort of IO and kernel calls. At which
> point M:N loses out badly.

It can. Certainly, if IO upcall overhead is greater than just running the
thread that's blocked inside the kernel, then yes. Not sure how this is all
going to play out...

> M:N's big mistake is that it concentrates on what matters the least:
> user<->user context switches. Nothing really wants to do that. And if it
> does, it's contended on some userspace locking object, at which point it
> doesnt really matter whether the cost of switching is 1 usec or 0.5 usecs,
> the main application cost is the lost paralellism and increased cache
> trashing due to the serialization - independently of what kind of
> threading abstraction is used.

Yeah, that's not a new argument and is a solid criticism...

Hmmm, random thoughts... This is probably outside the scope of lkml,
but...

I'm trying to think up a possible problem with how the JVM does threading that
might be able to exploit this kind of situation...Hmm, there's locks on
the method dictionary, but that's not something that's generally changing a
lot of the time... I'll give it some thought.

The JVM needs a couple of pretty critical things that are a bit off from
the normal Posix threading standard. One of them is very fast thread
suspension for both individual threads and the all threads accept the
currently running one...

In the Solaris threads implementation of JVM/HotSpot it has two methods of
getting a ucontext for doing GC and wierd exception/signal handling via
safepoints (a JIT compiler goody) and it would be nice to have...

1) Slow Version. Throw a SIGUSR1 at a thread and read/write the ucontext on
the signal frame itself.

2) Fast Version. The thread state and ucontext is examined directly to determine
the validity of the stored thread context, whether it's blocked on
a syscall (ignore it) or was doing a CPU intensive operation (use it).

That ucontext is used for various things:

a) Proper GC so that registers that might contain valid references are
taken into account properly to maintain the correctness of the
mark/sweep algorithms.

b) The thread's program counter value is altered to deal with safepoints.

(2) above being the most desireable since it's a kind of fast path for
(a) and (b).

So userspace exposure to the thread's ucontext would be a good thing.
I'm not sure how this is dealt within the current implementation of
what you folks are doing at this moment.

> primitives (including internal glibc locks), all uncontended
> synchronization is done purely in user-space. [and for the contended case
> we *want* to switch into the kernel.]

If there's any thing on this planet that's going to stress a threading
system, it's going to be the JVM. I'll give what you've said a some
thought. My bias has been to FreeBSD's KSE project for the most part
over this last threading/development run.

/me thinks...

bill

2002-09-20 12:31:08

by James Lewis Nance

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Thu, Sep 19, 2002 at 05:41:37PM -0700, Ulrich Drepper wrote:

> We are pleased to announce the first publically available source
> release of a new POSIX thread library for Linux. As part of the
> continuous effort to improve Linux's capabilities as a client, server,
> and computing platform Red Hat sponsored the development of this
> completely new implementation of a POSIX thread library, called Native
> POSIX Thread Library, NPTL.

Is this related to the thread library work that IBM was doing
or was this independently developed?

Thanks,

Jim

2002-09-20 13:23:43

by Robert Love

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 2002-09-20 at 05:53, Padraig Brady wrote:

> Great! Where does this leave NGPT though? I had assumed that
> this was going to be the next pthread implementation in glibc.

This was never the intention of the glibc people.

Robert Love

2002-09-20 15:45:32

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Thu, 19 Sep 2002, Ulrich Drepper wrote:

> We are pleased to announce the first publically available source
> release of a new POSIX thread library for Linux. As part of the
> continuous effort to improve Linux's capabilities as a client, server,
> and computing platform Red Hat sponsored the development of this
> completely new implementation of a POSIX thread library, called Native
> POSIX Thread Library, NPTL.
>
> Unless major flaws in the design are found this code is intended to
> become the standard POSIX thread library on Linux system and it will
> be included in the GNU C library distribution.

If the comment that this doesn't work with the stable kernel is correct, I
consider that a pretty major flaw. Unlike the kernel and NGPT which are
developed using an open source model with lots of eyes on the WIP, this
was done and then released whole with the decision to include it in the
standard library already made. Having any part of glibc not work with the
current stable kernel doesn't seem like such a hot idea, honestly.

> The work visible here is the result of close collaboration of kernel
> and runtime developers. The collaboration proceeded by developing the
> kernel changes while writing the appropriate parts of the thread
> library. Whenever something couldn't be implemented optimally some
> interface was changed to eliminate the issue. The result is this
> thread library which is, unlike previous attempts, a very thin layer
> on top of the kernel. This helps to achieve a maximum of performance
> for a minimal price.

> Initial confirmations were test runs with huge numbers of threads.
> Even on IA-32 with its limited address space and memory handling
> running 100,000 concurrent threads was no problem at all, creating
> and destroying the threads did not take more than two seconds. This
> all was made possible by the kernel work performed as part of this
> project.

Is there a performance comparison with current pthreads and NGPT and more
typical levels of 5-10k threads as seen on news/mail/dns/web servers?
Eliminating overhead is good, but in most cases there just isn't all that
much overhead in NGPT. I haven't measured Linux threads, but there are a
lot of bad urban legends about them ;-)

> Building glibc with the new thread library is demanding on the
> compilation environment.
>
> - - The 2.5.36 kernel or above must be installed and used. To compile
> glibc it is necessary to create the symbolic link
>
> /lib/modules/$(uname -r)/build
>
> to point to the build directory.
>
> - - The general compiler requirement for glibc is at least gcc 3.2. For
> the new thread code it is even necessary to have working support for
> the __thread keyword.
>
> Similarly, binutils with functioning TLS support are needed.
>
> The (Null) beta release of the upcoming Red Hat Linux product is
> known to have the necessary tools available after updating from the
> latest binaries on the FTP site. This is no ploy to force everybody
> to use Red Hat Linux, it's just the only environment known to date
> which works.

Of course not, it's coincidence that only Redhat has these things readily
available, perhaps because this was developed where no other vendor knew
it existed and could have support ready for it.

Modulo my comments on not putting things in libraries which don't widely
work, this souncds as if it is less complex and hopefully more stable (not
that NGPT isn't) and lower maintenence. I'd love to see comparisons of the
three libraries under some typical load, how about a Redhat DNS server
running threaded bind? Run a day with each library and look at response
time, load average, and of course stability.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-09-20 15:49:34

[permalink] [raw]

Subject: Re: 100,000 threads? [was: [ANNOUNCE] Native POSIX Thread Library 0.1]

On Fri, 20 Sep 2002, Ingo Molnar wrote:

> the extreme high-end of threading typically uses very controlled
> applications and very small user level stacks.
>
> as to the question of why so many threads, the answer is because we can :)
> This, besides demonstrating some of the recent scalability advances, gives
> us the warm fuzzy feeling that things are right in this area. I mean,
> there are architectures where Linux could map a petabyte of RAM just fine,
> even though that might not be something we desperately need today.

I think testing at these high numbers is a good proof of scalability,
although response and stability are also important. Before I went to NGPT
I had a fair bit of problem with learning experiences after threads got
beyond 200 or so.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-09-20 16:04:40

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On 20 Sep 2002, Robert Love wrote:

> On Fri, 2002-09-20 at 05:53, Padraig Brady wrote:
>
> > Great! Where does this leave NGPT though? I had assumed that
> > this was going to be the next pthread implementation in glibc.
>
> This was never the intention of the glibc people.

Was there some shortcoming in NGPT? Clearly someone provided a good bit of
funding for this, so there must have been motivation beyond NIH or funding
someone's honors thesis.

I also expected NGPT to be the next step, not a library which requires a
kernel which is unlikely to be stable for 18 months.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-09-20 16:07:32

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002, Bill Huey wrote:

> The JVM needs a couple of pretty critical things that are a bit off from
> the normal Posix threading standard. One of them is very fast thread
> suspension for both individual threads and the all threads accept the
> currently running one...

the user contexts for active but preempted threads are stored in the
kernel stack. To support GC safepoints we need fast access to the current
state of every not voluntarily preempted thread. This is admittedly easier
if threads are abstraced in user-space [in which case the context is
stored in user-space], but the question is, what is more important, an
occasional pass of garbage collection, or the cost of doing IO?

until then it can be done via sending SIGSTOP/SIGCONT to the process PID
from the garbage collection thread, which should stop all threads pretty
efficiently in 2.5.35+ kernels. Then all threads that are not voluntarily
sleeping can be fixed up via ptrace calls.

and it can be further improved by tracking preempted user contexts in the
scheduler and giving fast access to them via a syscall. (all voluntarily
sleeping contexts can properly prepare their suspension state in
userspace.) So it's possible to do it efficiently.

how frequently does the GC thread run?

Ingo

2002-09-20 16:11:22

by Jakub Jelinek

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, Sep 20, 2002 at 11:43:15AM -0400, Bill Davidsen wrote:
> > Unless major flaws in the design are found this code is intended to
> > become the standard POSIX thread library on Linux system and it will
> > be included in the GNU C library distribution.
>
> If the comment that this doesn't work with the stable kernel is correct, I
> consider that a pretty major flaw. Unlike the kernel and NGPT which are
> developed using an open source model with lots of eyes on the WIP, this
> was done and then released whole with the decision to include it in the
> standard library already made. Having any part of glibc not work with the
> current stable kernel doesn't seem like such a hot idea, honestly.

glibc supports .note.ABI-tag notes for libraries, so there is no problem
with having NPTL libpthread.so.0 --enable-kernel=2.5.36 in say
/lib/i686/libpthread.so.0 and linuxthreads --enable-kernel=2.2.1 in
/lib/libpthread.so.0. The dynamic linker will then choose based
on currently running kernel.
(well, ATM because of libc tsd DL_ERROR --without-tls ld.so cannot be used
with --with-tls libs and vice versa, but that is beeing worked on).

That's similar to non-FLOATING_STACK and FLOATING_STACK linuxthreads,
the latter can be used with 2.4.8+ or something kernels on IA-32.

> > - - The general compiler requirement for glibc is at least gcc 3.2. For
> > the new thread code it is even necessary to have working support for
> > the __thread keyword.
> >
> > Similarly, binutils with functioning TLS support are needed.
> >
> > The (Null) beta release of the upcoming Red Hat Linux product is
> > known to have the necessary tools available after updating from the
> > latest binaries on the FTP site. This is no ploy to force everybody
> > to use Red Hat Linux, it's just the only environment known to date
> > which works.
>
> Of course not, it's coincidence that only Redhat has these things readily
> available, perhaps because this was developed where no other vendor knew
> it existed and could have support ready for it.

Because all of glibc/gcc/binutils TLS support was developed together (and
still is)? All the changes are publicly available, mostly in the
corresponding CVS archives.

Jakub

2002-09-20 16:29:54

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002 [email protected] wrote:

> > We are pleased to announce the first publically available source
> > release of a new POSIX thread library for Linux. As part of the
> > continuous effort to improve Linux's capabilities as a client, server,
> > and computing platform Red Hat sponsored the development of this
> > completely new implementation of a POSIX thread library, called Native
> > POSIX Thread Library, NPTL.
>
> Is this related to the thread library work that IBM was doing or was
> this independently developed?

independently developed.

Ingo

2002-09-20 17:19:12

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002, Jakub Jelinek wrote:

> On Fri, Sep 20, 2002 at 11:43:15AM -0400, Bill Davidsen wrote:
> > > Unless major flaws in the design are found this code is intended to
> > > become the standard POSIX thread library on Linux system and it will
> > > be included in the GNU C library distribution.
> >
> > If the comment that this doesn't work with the stable kernel is correct, I
> > consider that a pretty major flaw. Unlike the kernel and NGPT which are
> > developed using an open source model with lots of eyes on the WIP, this
> > was done and then released whole with the decision to include it in the
> > standard library already made. Having any part of glibc not work with the
> > current stable kernel doesn't seem like such a hot idea, honestly.
>
> glibc supports .note.ABI-tag notes for libraries, so there is no problem
> with having NPTL libpthread.so.0 --enable-kernel=2.5.36 in say
> /lib/i686/libpthread.so.0 and linuxthreads --enable-kernel=2.2.1 in
> /lib/libpthread.so.0. The dynamic linker will then choose based
> on currently running kernel.
> (well, ATM because of libc tsd DL_ERROR --without-tls ld.so cannot be used
> with --with-tls libs and vice versa, but that is beeing worked on).
>
> That's similar to non-FLOATING_STACK and FLOATING_STACK linuxthreads,
> the latter can be used with 2.4.8+ or something kernels on IA-32.

Good point, I had forgotten that! It will be somewhat large having both,
but presumably someone who really worried about it would build a subset.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-09-20 18:42:32

by Roland McGrath

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

> On 20 Sep 2002, Luca Barbieri wrote:
>
> > Great, but how about using code similar to the following rather than
> > hand-coded asm operations?
> >
> > extern struct pthread __pt_current_struct asm("%gs:0");
> > #define __pt_current (&__pt_current_struct)

Try that under -fpic and you will see the problem.

2002-09-20 18:59:29

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Adrian Bunk wrote:

> My personal estimation is that Debian will support kernel 2.4 in it's
> stable distribution until 2006 or 2007 (this is based on the experience
> that Debian usually supports two stable kernel series and the time between
> stable releases of Debian is > 1 year). What is the proposed way for
> distributions to deal with this?

Two ways:

- - continue to use the old code

- - backport the required functionality

Note that not all the changes Ingo made have to be ported back to 2.4.
Only those required for correct execution, not the optimizations.

Whether Marcello is interested in this I cannot say, I doubt it though.
But this does not mean you cannot have such a kernel in Debian.

- --
- ---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE9i3Fb2ijCOnn/RHQRAlC+AJ9kXWMdkfuORtodijTXQ+Hnah0ZYQCfZkOT
Axzw/z1VEFVXIQdZ4d8PLe4=
=ptvg
-----END PGP SIGNATURE-----

2002-09-20 21:16:45

by Luca Barbieri

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

> Try that under -fpic and you will see the problem.
Unfortunately it tries to get it using the GOT and I can't find any
practical workaround, so ignore my broken suggestion.

Attachments:

signature.asc (189.00 B)
This is a digitally signed message part

2002-09-20 21:45:31

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, Sep 20, 2002 at 06:20:10PM +0200, Ingo Molnar wrote:
> the user contexts for active but preempted threads are stored in the
> kernel stack. To support GC safepoints we need fast access to the current
> state of every not voluntarily preempted thread. This is admittedly easier
> if threads are abstraced in user-space [in which case the context is
> stored in user-space], but the question is, what is more important, an
> occasional pass of garbage collection, or the cost of doing IO?

The GC is generational and incremental, so it must deal with a lot of short
term objects that need to be collected. GC performance is a sore point in the
JVM (language runtimes in general) and having slow access to this is potentially
crippling for high load machines. The HotSpot/JVM is a bit overzealous with
safepoints which opens this area to optimization, but the problem exists now.

> until then it can be done via sending SIGSTOP/SIGCONT to the process PID
> from the garbage collection thread, which should stop all threads pretty
> efficiently in 2.5.35+ kernels. Then all threads that are not voluntarily
> sleeping can be fixed up via ptrace calls.

It's better to have an explict pthread_suspend_[thread,all]() function
since this kind of thing is becoming more and more common in thread
heavy language runtimes. The Posix thread spec was build without regard
to this and it's definitely become an important issues these days.

> and it can be further improved by tracking preempted user contexts in the
> scheduler and giving fast access to them via a syscall. (all voluntarily
> sleeping contexts can properly prepare their suspension state in
> userspace.) So it's possible to do it efficiently.
>
> how frequently does the GC thread run?

Don't remember off hand, but it's like to be several times a second which is
often enough to be a problem especially on large systems with high load.

The JVM with incremental GC is being targetted for media oriented tasks
using the new NIO, 3d library, etc... slowness in safepoints would cripple it
for these tasks. It's a critical item and not easily address by the current
1:1 model.

bill

2002-09-20 22:25:13

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002, Bill Huey wrote:

> It's better to have an explict pthread_suspend_[thread,all]() function

could this be implemented by having a gc thread in a unique process group
and then suspending the jvm process group?

-dean

2002-09-20 23:01:38

by J.A. Magallon

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On 2002.09.20 Ulrich Drepper wrote:
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>Adrian Bunk wrote:
>
>> My personal estimation is that Debian will support kernel 2.4 in it's
>> stable distribution until 2006 or 2007 (this is based on the experience
>> that Debian usually supports two stable kernel series and the time between
>> stable releases of Debian is > 1 year). What is the proposed way for
>> distributions to deal with this?
>
>Two ways:
>
>- - continue to use the old code
>
>- - backport the required functionality
>

Could you post a list of requirements ? For example:
- kernel: futexes, per_cpu_areas
- toolchain: binutils version + RH-patches, gcc version
- glibc: 2.2.xxxx
etc...

Perhaps it is not so difficult, for example futexes are in -aa for 2.4,
Mandrake has gcc-3.2, etc...

Are you pushing hard for the infrastructure you need to get in standard
source trees (ie, changes to gcc, binutils...) ??

Thanks.

--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.0 (Cooker) for i586
Linux 2.4.20-pre7-jam0 (gcc 3.2 (Mandrake Linux 9.0 3.2-1mdk))

2002-09-20 23:06:34

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, Sep 20, 2002 at 03:30:19PM -0700, dean gaudet wrote:
> > It's better to have an explict pthread_suspend_[thread,all]() function
>
> could this be implemented by having a gc thread in a unique process group
> and then suspending the jvm process group?

Suspending how ? via signal ?

Possibly, but having an explicit syscall() call is important since interrupts
are also suspended under that condition, pthread_cond_timedwait(), etc...
It really needs to be suspended in a way that's different than the SIGSOMETHING
mechanism. I was fixing bugs in libc_r, so I know the issues to a certain degree
and bad logic those particular corner cases was screwing me up.

bill

2002-09-20 23:28:44

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

J.A. Magallon wrote:

> Could you post a list of requirements ? For example:
> - kernel: futexes, per_cpu_areas

There's a lot more. The signal handling, the exec handling, the exit
handling, the clone extensions.

> - toolchain: binutils version + RH-patches, gcc version

No RH patches. Don't spread misinformation.

All the changes needed for kernel, glibc, and the tools are in the
official sources trees. It's just that nobody else ships those versions
or versions with the necessary features backported.

> - glibc: 2.2.xxxx

The announcement said it clearly: you need at least the 2.3 prerelease
of as yesterday. Again, all in the public archive.

- --
- ---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQE9i7BI2ijCOnn/RHQRAn0nAKCF4cZO9K2/vhNbCuawvk0ecM2SCQCeOKE9
Qg19wkleGKXFmr3plY1dbho=
=oWVM
-----END PGP SIGNATURE-----

2002-09-20 23:37:16

by J.A. Magallon

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On 2002.09.21 Ulrich Drepper wrote:
>-----BEGIN PGP SIGNED MESSAGE-----
>> - toolchain: binutils version + RH-patches, gcc version
>
>No RH patches. Don't spread misinformation.
>

Oh, sorry, no misinformation intended. As I read previous posts
I understood that there were things developed still not in standard trees.
Sometimes I see in Mandrake changelogs 'patch-X, from RH', and then
'patch-X removed, merged upstream'

/by

--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.0 (Cooker) for i586
Linux 2.4.20-pre7-jam0 (gcc 3.2 (Mandrake Linux 9.0 3.2-1mdk))

2002-09-20 23:40:07

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, Sep 20, 2002 at 02:50:29PM -0700, Bill Huey wrote:
> > how frequently does the GC thread run?
>
> Don't remember off hand, but it's like to be several times a second which is
> often enough to be a problem especially on large systems with high load.
>
> The JVM with incremental GC is being targetted for media oriented tasks
> using the new NIO, 3d library, etc... slowness in safepoints would cripple it
> for these tasks. It's a critical item and not easily address by the current
> 1:1 model.

Also throwing a signal to get the ucontext is pretty a expensive way of getting
it. But you folks know this already. Solaris threading has this via a some special
libraries. For large number of actively running threads, say, executing in a middle
of a method block it is potentially a huge problem for scalability.

Again, it's a critical issue from what I see of this.

bill

2002-09-21 03:33:15

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002, Bill Huey wrote:

> On Fri, Sep 20, 2002 at 03:30:19PM -0700, dean gaudet wrote:
> > > It's better to have an explict pthread_suspend_[thread,all]() function
> >
> > could this be implemented by having a gc thread in a unique process group
> > and then suspending the jvm process group?
>
> Suspending how ? via signal ?

yeah SIGSTOP to the jvm process group.

> Possibly, but having an explicit syscall() call is important since interrupts
> are also suspended under that condition, pthread_cond_timedwait(), etc...
> It really needs to be suspended in a way that's different than the SIGSOMETHING
> mechanism. I was fixing bugs in libc_r, so I know the issues to a certain degree
> and bad logic those particular corner cases was screwing me up.

SIGSTOP is different from other signals because it will stop the whole
process group from continuing. i am completely aware of how much of a
pain it is to actually trap signals and do something (for apache 2.0's
design i outlawed the use of signals because of the pains of getting
things working in 1.3.x :).

doesn't the hotspot GC work something like this:

- stop all threads
- go read each thread's $pc, and find its nearest "safety point"
- go overwrite that safety point (YUCK SELF MODIFYING CODE!! :) with
something which will stop the thread
- start the threads and wait for them all to get to their safety points
- perform gc
- undo the above mess

the only part of that which looks challenging with kernel threads is the
$pc reading part... ptrace will certainly get it for you, but that's a
lot of syscall overhead.

or am i missing something?

-dean

2002-09-21 03:56:24

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, Sep 20, 2002 at 08:38:20PM -0700, dean gaudet wrote:
> SIGSTOP is different from other signals because it will stop the whole
> process group from continuing. i am completely aware of how much of a
> pain it is to actually trap signals and do something (for apache 2.0's
> design i outlawed the use of signals because of the pains of getting
> things working in 1.3.x :).

There's definitely a need for a pthread_suspend_something() call...

> doesn't the hotspot GC work something like this:
>
> - stop all threads
> - go read each thread's $pc, and find its nearest "safety point"
> - go overwrite that safety point (YUCK SELF MODIFYING CODE!! :) with
> something which will stop the thread
> - start the threads and wait for them all to get to their safety points
> - perform gc
> - undo the above mess

+ read the entire ucontext for EAX, etc... so that it can be used for GC
roots. It could be allocating something in an executing method block
that hasn't hit stack or any kind of variable storage known to the GC.

> the only part of that which looks challenging with kernel threads is the
> $pc reading part... ptrace will certainly get it for you, but that's a
> lot of syscall overhead.

And the entire ucontext.

> or am i missing something?

The ucontext. ;)

bill

2002-09-21 04:35:38

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002, Bill Huey wrote:

> The JVM with incremental GC is being targetted for media oriented tasks
> using the new NIO, 3d library, etc... slowness in safepoints would
> cripple it for these tasks. It's a critical item and not easily address
> by the current 1:1 model.

actually, in the previous mail i've outlined a sensible way to help
safepoints in the kernel, for the case of the 1:1 model. I'd not call that
'not easily addressed' :-)

there's an even more advanced way to expose preempted user contexts in the
1:1 model: by putting most of the the register info (which is now dumped
into the kernel stack) into a page that is also mapped into user-space.
This too introduces (constant) syscall entry/exit overhead, but can be
done if justified.

Ingo

2002-09-21 04:45:51

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002, Bill Huey wrote:

> Also throwing a signal to get the ucontext is pretty a expensive way of
> getting it. But you folks know this already. [...]

as i've mentioned in the previous mail, 2.5.35+ kernels have a very fast
SIGSTOP/SIGCONT implementation, which change was done as part of this
project - a few orders faster than throwing/catching SIGUSR1 to every
single thread for example.

so right now we first need to get some results back about how big the GC
problem is with the new SIGSTOP/SIGCONT implementation. If it's still not
fast enough then we still have a number of options.

Ingo

2002-09-21 04:53:45

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002, Bill Huey wrote:

> > doesn't the hotspot GC work something like this:
> >
> > - stop all threads
> > - go read each thread's $pc, and find its nearest "safety point"
> > - go overwrite that safety point (YUCK SELF MODIFYING CODE!! :) with
> > something which will stop the thread
> > - start the threads and wait for them all to get to their safety points
> > - perform gc
> > - undo the above mess
>
> + read the entire ucontext for EAX, etc... so that it can be used for GC
> roots. It could be allocating something in an executing method block
> that hasn't hit stack or any kind of variable storage known to the GC.

PTRACE_GETREGS. Yeah, it's overhead, see my previous mails about various
levels of kernel features of how to do it potentially cheaper. Not like
the above process is particularly fast.

One more method to speed it up: to amortize the kernel entry overhead we
could introduce a new PTRACE_GETREGS_GROUP to get a full array of user
contexts from all group member threads, via a single system-call.

> > the only part of that which looks challenging with kernel threads is the
> > $pc reading part... ptrace will certainly get it for you, but that's a
> > lot of syscall overhead.
>
> And the entire ucontext.

PTRACE_GETREGS gets you the instruction pointer and all general purpose
registers, eax and the rest.

Ingo

2002-09-22 01:34:00

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Sat, Sep 21, 2002 at 06:48:40AM +0200, Ingo Molnar wrote:
> actually, in the previous mail i've outlined a sensible way to help
> safepoints in the kernel, for the case of the 1:1 model. I'd not call that
> 'not easily addressed' :-)
>
> there's an even more advanced way to expose preempted user contexts in the
> 1:1 model: by putting most of the the register info (which is now dumped
> into the kernel stack) into a page that is also mapped into user-space.
> This too introduces (constant) syscall entry/exit overhead, but can be
> done if justified.

Maybe mmapping a special device into memory ? /proc/satanic_procID_666* ???

A method needs to be considered, definitely. Getting some Sun/Blackdown folks on
this thread wouldn't be bad either.

I'm not exactly sure what Solaris does in this case, so it might be worth
investigating it so that this is conceptually regular across various Unix variants
to a certain degree.

It's also essential to have the run states along with the ucontext to determine
the validity of ucontext backing the thread. Obviously, being blocked in the
kernel on a IO request isn't going to result in any thing useable GC. And that's
because libc library calls are external symbols and don't preserve registers
across calls.

Also, permission to the PTRACE* interface shouldn't conflict with debuggers...
Complete multithreaded debugging is important of course. I would expect GDB to be
modified so that it can get and examine those register values and thread run
states too.

Uh, the JVM also has a habit of also checksumming the register contents over a window
of time to see if a thread is actively running through its internal debugging
facilities... It shouldn't have to lock or suspend thread execution to examine it.
Now that I think of it, syscall overlay technique isn't going work, since nothing
of interest to the GC is going to be present in the ucontext, so the above suggest
of exporting the ucontext at syscall points isn't going to work.

More to come...

bill

2002-09-22 02:46:45

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Sat, Sep 21, 2002 at 06:58:15AM +0200, Ingo Molnar wrote:
> as i've mentioned in the previous mail, 2.5.35+ kernels have a very fast
> SIGSTOP/SIGCONT implementation, which change was done as part of this
> project - a few orders faster than throwing/catching SIGUSR1 to every
> single thread for example.

That's good, but having an explict API for suspending threads is very useful,
since it can greatly simplify the already complicated signal handling in
highly threaded systems. It's something that your group should seriously
consider, since I expect some explicit thread suspension call to be implemented
in Posix threading standard, via their committee. Mainly, because of the
advent of heavily threaded language runtimes as standard programming staple.

It's a good thing to have regardless.

> so right now we first need to get some results back about how big the GC
> problem is with the new SIGSTOP/SIGCONT implementation. If it's still not
> fast enough then we still have a number of options.

I'm running out of things to say. ;)

bill

2002-09-22 13:54:20

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002, Bill Huey wrote:

> Don't remember off hand, but it's like to be several times a second which is
> often enough to be a problem especially on large systems with high load.
>
> The JVM with incremental GC is being targetted for media oriented tasks
> using the new NIO, 3d library, etc... slowness in safepoints would cripple it
> for these tasks. It's a critical item and not easily address by the current
> 1:1 model.

Could you comment on how whell this works (or not) with linuxthreads,
Solaris, and NGPT? I realize you probably haven't had time to look at NPTL
yet. If an N:M model is really better for your application you might be
able to just run NGPT.

Since preempt threads seem a problem, cound a dedicated machine run w/o
preempt? I assume when you say "high load" that you would be talking a
server, where performance is critical.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-09-22 18:52:37

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Bill Davidsen <[email protected]> writes:

> On Fri, 20 Sep 2002, Bill Huey wrote:
>
>
> > Don't remember off hand, but it's like to be several times a second which is
> > often enough to be a problem especially on large systems with high load.
> >
> > The JVM with incremental GC is being targetted for media oriented tasks
> > using the new NIO, 3d library, etc... slowness in safepoints would cripple it
> > for these tasks. It's a critical item and not easily address by the current
> > 1:1 model.
>
> Could you comment on how whell this works (or not) with linuxthreads,
> Solaris, and NGPT? I realize you probably haven't had time to look at NPTL
> yet. If an N:M model is really better for your application you might be
> able to just run NGPT.
>
> Since preempt threads seem a problem, cound a dedicated machine run w/o
> preempt? I assume when you say "high load" that you would be talking a
> server, where performance is critical.

>From 10,000 feet out I have one comment. If the VM has safe points. It sounds
like the problem is more that the safepoints don't provide the register
dumps more than anything else.

They are talking about an incremental GC routine so it does not need to stop
all threads simultaneously. Threads only need to be stopped when the GC is gather
a root set. This is what the safe points are for right? And it does
not need to be 100% accurate in finding all of the garbage. The
collector just needs to not make mistakes in the other direction.

I fail to see why:

/* This is a safe point ... */
if (needs to be suspended) {
save_all_registers_on_the_stack()
flag_gc_thread()
wait_until_gc_thread_has_what_it_needs()
}

Needs kernel support.

Eric

2002-09-22 20:44:29

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

> The true cost of M:N shows up when threading is actually used
> for what it's intended to be used :-)

> M:N's big mistakes is that it concentrates on what
> matters the least: useruser context switches.

Well, from the perspective of the kernel, userspace is a black box.
Is that also true for kernel developers?

If you, as an application engineer, decide to use a multithreaded
design, it could be a) you want to learn or b) have some good
reasons to choose that.

Having multiple threads doing real work including IO means more
blocking IO and therefore more context switches. One reason to
choose threading is to _not_ have to use select/poll in app code.
If you gather more IO requests and multiplex them with select/poll
the chances are higher that the syscall returns without context
switch. Therefore you _save_ some real context switches with
useruser context switches.

Don't make the mistake to think too much about the optimal case.
(as Linus told us: optimize for the _common_ case :)

You think that one should have an almost equal number of threads
and processors. This is unrealistic despite some server apps
running on +4(8?) way systems. With this assumption nobody would
write a multithreaded desktop app (>90% are UP).

The effect of M:N on UP systems should be even more clear. Your
multithreaded apps can't profit of parallelism but they do not
add load to the system scheduler. The drawback: more syscalls
(I think about removing the need for
flags=fcntl(GETFLAGS);fcntl(fd,NONBLOCK);write(fd);fcntl(fd,flags))

Until we have some numbers we can't say which approach is better.
I'm convinced that apps exist that run better on one and others
on the other.

AIX and Irix deploy M:N - I guess for a good reason: it's more
flexible and combine both approaches with easy runtime tuning if
the app happens to run on SMP (the uncommon case).

Your great work at the scheduler and tuning on exit are highly
appreciated. Both models profit - of course 1:1 much more.

2002-09-22 21:27:53

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Sun, Sep 22, 2002 at 08:55:39PM +0200, Peter Waechtler wrote:
> AIX and Irix deploy M:N - I guess for a good reason: it's more
> flexible and combine both approaches with easy runtime tuning if
> the app happens to run on SMP (the uncommon case).

No, AIX and IRIX do it that way because their processes are so bloated
that it would be unthinkable to do a 1:1 model.

Instead of taking the traditional "we've screwed up the normal system
primitives so we'll event new lightweight ones" try this:

We depend on the system primitives to not be broken or slow.

If that's a true statement, and in Linux it tends to be far more true
than other operating systems, then there is no reason to have M:N.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-09-22 22:08:34

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On 22 Sep 2002, Eric W. Biederman wrote:

> I fail to see why:
>
> /* This is a safe point ... */
> if (needs to be suspended) {
> save_all_registers_on_the_stack()
> flag_gc_thread()
> wait_until_gc_thread_has_what_it_needs()
> }
>
> Needs kernel support.

given that the existing code uses self-modifying-code for the safe-points
i'm guessing there are so many safe-points that the above if statement
would be excessive overhead (and the save/flag/wait stuff would probably
cause a huge amount of code bloat -- but could probably be a subroutine).

there was some really interesting GC work i heard about years ago where
the compiler generated GC code along-side the normal executable code.
the GC code understood the structure of the function and could make much
better choices of GC targets than a generic routine could. when GC needs
to occur, a walk up the stack in each thread executing the
routine-specific GC stubs would be performed. (given just the stack
frames you can index into a lookup table for the GC stubs... so there's no
overhead when GC isn't occuring.) i don't have a reference handy though.

anyhow, this is probably getting off-topic :)

-dean

2002-09-23 00:09:29

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Sun, Sep 22, 2002 at 12:41:40PM -0600, Eric W. Biederman wrote:
> They are talking about an incremental GC routine so it does not need to stop
> all threads simultaneously. Threads only need to be stopped when the GC is gather
> a root set. This is what the safe points are for right? And it does
> not need to be 100% accurate in finding all of the garbage. The
> collector just needs to not make mistakes in the other direction.

There's a mixture of GC algorithms in HotSpot including generational and I
believe a traditional mark/sweep. GC isn't my expertise per se.

Think, you have a compiled code block and you suspend/interrupt threads when
you either start hitting the stack yellow guard or by a periodic GC thread...

That can happen anytime, so you can't just expect things to drop onto a
regular boundary in the compiled code block. It's for that reason that
you have to some kind of OS level threading support to get the ucontext.

bill

2002-09-23 10:07:43

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Sun, 22 Sep 2002, Larry McVoy wrote:

> On Sun, Sep 22, 2002 at 08:55:39PM +0200, Peter Waechtler wrote:
> > AIX and Irix deploy M:N - I guess for a good reason: it's more
> > flexible and combine both approaches with easy runtime tuning if
> > the app happens to run on SMP (the uncommon case).
>
> No, AIX and IRIX do it that way because their processes are so bloated
> that it would be unthinkable to do a 1:1 model.

And BSD? And Solaris?

> Instead of taking the traditional "we've screwed up the normal system
> primitives so we'll event new lightweight ones" try this:
>
> We depend on the system primitives to not be broken or slow.
>
> If that's a true statement, and in Linux it tends to be far more true
> than other operating systems, then there is no reason to have M:N.

No matter how fast you do context switch in and out of kernel and a sched
to see what runs next, it can't be done as fast as it can be avoided.
Being N:M doesn't mean all implementations must be faster, just that doing
it all in user mode CAN be faster.

Benchmarks are nice, I await results from a loaded production threaded
DNS/mail/web/news/database server. Well, I guess production and 2.5 don't
really go together, do they, but maybe some experimental site which could
use 2.5 long enough to get numbers. If you could get a threaded database
to run, that would be a good test of shared resources rather than a bunch
of independent activities doing i/o.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-09-23 11:50:02

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Am Montag den, 23. September 2002, um 12:05, schrieb Bill Davidsen:

> On Sun, 22 Sep 2002, Larry McVoy wrote:
>
>> On Sun, Sep 22, 2002 at 08:55:39PM +0200, Peter Waechtler wrote:
>>> AIX and Irix deploy M:N - I guess for a good reason: it's more
>>> flexible and combine both approaches with easy runtime tuning if
>>> the app happens to run on SMP (the uncommon case).
>>
>> No, AIX and IRIX do it that way because their processes are so bloated
>> that it would be unthinkable to do a 1:1 model.
>
> And BSD? And Solaris?

Don't know. I don't have access to all those Unices. I could try FreeBSD.

According to http://www.kegel.com/c10k.html Sun is moving to 1:1
and FreeBSD still believes in M:N

MacOSX 10.1 does not support PROCESS_SHARED locks, tried that 5 minutes
ago.

2002-09-23 18:42:59

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

> > Instead of taking the traditional "we've screwed up the normal system
> > primitives so we'll event new lightweight ones" try this:
> >
> > We depend on the system primitives to not be broken or slow.
> >
> > If that's a true statement, and in Linux it tends to be far more true
> > than other operating systems, then there is no reason to have M:N.
>
> No matter how fast you do context switch in and out of kernel and a sched
> to see what runs next, it can't be done as fast as it can be avoided.

You are arguing about how many angels can dance on the head of a pin.
Sure, there are lotso benchmarks which show how fast user level threads
can context switch amongst each other and it is always faster than going
into the kernel. So what? What do you think causes a context switch in
a threaded program? What? Could it be blocking on I/O? Like 99.999%
of the time? And doesn't that mean you already went into the kernel to
see if the I/O was ready? And doesn't that mean that in all the real
world applications they are already doing all the work you are arguing
to avoid?
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-09-23 19:38:56

by Olivier Galibert

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, Sep 23, 2002 at 08:30:04AM -0700, Larry McVoy wrote:
> What do you think causes a context switch in
> a threaded program? What? Could it be blocking on I/O? Like 99.999%
> of the time? And doesn't that mean you already went into the kernel to
> see if the I/O was ready? And doesn't that mean that in all the real
> world applications they are already doing all the work you are arguing
> to avoid?

I suspect a fair number of cases is preemption too, when you fire up
computation threads in the background. Of course, the preemption
event always goes through the kernel at some point, even if it's only
a SIGALARM.

Actually, in normal programs (even java ones), _when_ is a thread
voluntarily giving up control? Locks?

OG.

2002-09-23 19:51:16

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, 23 Sep 2002, Larry McVoy wrote:

> > No matter how fast you do context switch in and out of kernel and a sched
> > to see what runs next, it can't be done as fast as it can be avoided.
>
> You are arguing about how many angels can dance on the head of a pin.

Than you have sadly misunderstood the discussion.

> Sure, there are lotso benchmarks which show how fast user level threads
> can context switch amongst each other and it is always faster than going
> into the kernel. So what? What do you think causes a context switch in
> a threaded program? What? Could it be blocking on I/O? Like 99.999%
> of the time? And doesn't that mean you already went into the kernel to
> see if the I/O was ready? And doesn't that mean that in all the real
> world applications they are already doing all the work you are arguing
> to avoid?

Actually you have it just backward. Let me try to explain how this works.
The programs which benefit from N:M are exactly those which don't behave
the way you describe. Think of programs using locking to access shared
memory, or other fast resources which don't require a visit to the kernel.
It would seem that the switch could be done much faster without the
transition into and out of the kernel.

Looking for data before forming an opinion has always seemed to be
reasonable, and the way design decisions are usually made in Linux, based
on the performance of actual code. The benchmark numbers reports are
encouraging, but actual production loads may not show the huge improvement
seen in the benchmarks. And I don't think anyone is implying that they
will.

Given how small the overhead of threading is on a typical i/o bound
application such as you mentioned, I'm not sure the improvement will be
above the noise. The major improvement from NGPT is not performance in
many cases, but elimination of unexpected application behaviour.

When someone responds to a technical question with an attack on the
question instead of a technical response I always wonder why. In this case
other people have provided technical feedback and I'm sure we will see
some actual application numbers in a short time. I have an IPC benchmark
I'd like to try if I could get any of my test servers to boot a recent
kernel :-(

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-09-23 19:54:10

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Am Montag den, 23. September 2002, um 17:30, schrieb Larry McVoy:

>>> Instead of taking the traditional "we've screwed up the normal system
>>> primitives so we'll event new lightweight ones" try this:
>>>
>>> We depend on the system primitives to not be broken or slow.
>>>
>>> If that's a true statement, and in Linux it tends to be far more true
>>> than other operating systems, then there is no reason to have M:N.
>>
>> No matter how fast you do context switch in and out of kernel and a
>> sched
>> to see what runs next, it can't be done as fast as it can be avoided.
>
> You are arguing about how many angels can dance on the head of a pin.
> Sure, there are lotso benchmarks which show how fast user level threads
> can context switch amongst each other and it is always faster than going
> into the kernel. So what? What do you think causes a context switch in
> a threaded program? What? Could it be blocking on I/O? Like 99.999%
> of the time? And doesn't that mean you already went into the kernel to
> see if the I/O was ready? And doesn't that mean that in all the real
> world applications they are already doing all the work you are arguing
> to avoid?

Getting into kernel is not the same as a context switch.
Return EAGAIN or EWOULDBLOCK is definetly _not_ causing a context switch.

Is sys_getpid() causing a context switch? Unlikely
Do you know what blocking IO means? M:N is about to avoid blocking IO!

2002-09-23 19:18:19

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, 23 Sep 2002, Peter Waechtler wrote:

> Am Montag den, 23. September 2002, um 12:05, schrieb Bill Davidsen:
>
> > On Sun, 22 Sep 2002, Larry McVoy wrote:
> >
> >> On Sun, Sep 22, 2002 at 08:55:39PM +0200, Peter Waechtler wrote:
> >>> AIX and Irix deploy M:N - I guess for a good reason: it's more
> >>> flexible and combine both approaches with easy runtime tuning if
> >>> the app happens to run on SMP (the uncommon case).
> >>
> >> No, AIX and IRIX do it that way because their processes are so bloated
> >> that it would be unthinkable to do a 1:1 model.
> >
> > And BSD? And Solaris?
>
> Don't know. I don't have access to all those Unices. I could try FreeBSD.

At your convenience.

> According to http://www.kegel.com/c10k.html Sun is moving to 1:1
> and FreeBSD still believes in M:N

Sun is total news to me, "moving to" may be in Solaris 9, Sol8 seems to
still be N:M. BSD is as I thought.
>
> MacOSX 10.1 does not support PROCESS_SHARED locks, tried that 5 minutes
> ago.

Thank you for the effort. Hum, that's a bit of a surprise, at least to me.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-09-23 20:18:47

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, 23 Sep 2002, Bill Davidsen wrote:

> The programs which benefit from N:M are exactly those which don't behave
> the way you describe. [...]

90% of the programs that matter behave exactly like Larry has described.
IO is the main source of blocking. Go and profile a busy webserver or
mailserver or database server yourself if you dont believe it.

> [...] Think of programs using locking to access shared memory, or other
> fast resources which don't require a visit to the kernel. [...]

oh - actually, such things are quite rare it turns out. And even if it
happens, the 1:1 model is handling this perfectly fine via futexes, as
long as the contention of the shared resource is light. Which it better be
...

any application with heavy contention over some global shared resource is
serializing itself already and has much bigger problems than that of the
threading model ... Its performance will be bad both under M:N and 1:1
models - think about it.

so a threading abstraction must concentrate on what really matters:
performing actual useful tasks - most of those tasks involve the use of
some resource, block IO, network IO, user IO - each of them involve entry
into the kernel - at which point the 1:1 design fits much better.

(and all your followup arguments are void due to this basic
misunderstanding.)

Ingo

2002-09-23 20:58:38

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Sun, Sep 22, 2002 at 08:55:39PM +0200, Peter Waechtler wrote:
> AIX and Irix deploy M:N - I guess for a good reason: it's more
> flexible and combine both approaches with easy runtime tuning if
> the app happens to run on SMP (the uncommon case).

Also, for process scoped scheduling in a way so that system wide threads
don't have an impact on a process slice. Folks have piped up about that
being important.

bill

2002-09-23 20:23:06

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, 23 Sep 2002, Peter Waechtler wrote:

> Getting into kernel is not the same as a context switch. Return EAGAIN
> or EWOULDBLOCK is definetly _not_ causing a context switch.

this is a common misunderstanding. When switching from thread to thread in
the 1:1 model, most of the cost comes from entering/exiting the kernel. So
*once* we are in the kernel the cheapest way is not to piggyback to
userspace to do some userspace context-switch - but to do it right in the
kernel.

in the kernel we can do much higher quality scheduling decisions than in
userspace. SMP affinity, various statistics are right available in
kernel-space - userspace does not have any of that. Not to talk about
preemption.

Ingo

2002-09-23 21:03:07

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Ingo Molnar schrieb:
>
> On Mon, 23 Sep 2002, Peter Waechtler wrote:
>
> > Getting into kernel is not the same as a context switch. Return EAGAIN
> > or EWOULDBLOCK is definetly _not_ causing a context switch.
>
> this is a common misunderstanding. When switching from thread to thread in
> the 1:1 model, most of the cost comes from entering/exiting the kernel. So
> *once* we are in the kernel the cheapest way is not to piggyback to
> userspace to do some userspace context-switch - but to do it right in the
> kernel.
>
> in the kernel we can do much higher quality scheduling decisions than in
> userspace. SMP affinity, various statistics are right available in
> kernel-space - userspace does not have any of that. Not to talk about
> preemption.
>

I'm already almost convinced on the NPT way of doing threading.
But still: the timeslice is per process (and kernel thread).
You still have other processes running.
With 1:1 on "hitting" a blocking condition the kernel will
switch to a different beast (yes, a thread gets a bonus for
using the same MM and the same cpu).
But on M:N the "user process" makes some more progress in its
timeslice (does it get even punished for eating up its
timeslice?) I would think that it tends to cause less context
switches but tends to do more syscalls :-(

I already had a closer look at NGPT before reading Ulrich's
comments on the phil-list and on his website. I already thought
"puh, that's a complicated beast", and as I saw the
fcntl(GETFL);fcntl(O_NONBLOCK);write();fcntl(oldflags); thingy..

Well, with an O(1) scheduler, faster thread creation and exit
NPT has good chances to perform faster.

Now I'm just curious about the argument about context switch
times. Is Linux really that much faster than Solaris, Irix etc.?

Do you have numbers (or a hint) on comparable (ideal: identical)
hardware? Is LMbench a good starting point?

2002-09-23 21:27:21

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, Sep 23, 2002 at 08:30:04AM -0700, Larry McVoy wrote:
> > No matter how fast you do context switch in and out of kernel and a sched
> > to see what runs next, it can't be done as fast as it can be avoided.
>
> You are arguing about how many angels can dance on the head of a pin.
> Sure, there are lotso benchmarks which show how fast user level threads
> can context switch amongst each other and it is always faster than going
> into the kernel. So what? What do you think causes a context switch in
> a threaded program? What? Could it be blocking on I/O? Like 99.999%

That's just for traditional Unix applications, which is only one category.
You exclude CPU intensive applications in that criticism, media related
and otherwise. What about cases where you need to balance a large data
structure across large number of threads or something like that ?

> of the time? And doesn't that mean you already went into the kernel to
> see if the I/O was ready? And doesn't that mean that in all the real
> world applications they are already doing all the work you are arguing
> to avoid?

IO isn't the only thing that's event driven. What about event driven
systems that depend on a fast condition-variable ? That's very cheap in
a UTS (userspace thread system), 2 context switches, a call to thread-kernel
to dequeue a waiter and releasing/aquiring some very light weight userspace
locks. And difficult to beat if you think about it.

So that level of confidence in 1:1 is a intuitively presumptuous for those
reasons.

But if you're architecture is broken or exotic...then it gets more complicated ;)

bill

2002-09-23 21:17:34

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, Sep 23, 2002 at 06:05:18AM -0400, Bill Davidsen wrote:
> And BSD? And Solaris?

Me and buddy of mine ran lmbench on NetBSD 1.6 (ppc) and a recent version
of Linux (same machine) and found that NetBSD was about 2x slower than
Linux for context switching. It's really not that bad considering that it
was worse at one point. It might effect things like inter-process pipe
communication performance, but it's not outside of reasonablility to use
a 1:1 system in that case.

BTW, NetBSD is moving to a scheduler activations threading system and
they have some preliminary stuff in the works and working. ;)

> > If that's a true statement, and in Linux it tends to be far more true
> > than other operating systems, then there is no reason to have M:N.
>
> No matter how fast you do context switch in and out of kernel and a sched
> to see what runs next, it can't be done as fast as it can be avoided.
> Being N:M doesn't mean all implementations must be faster, just that doing
> it all in user mode CAN be faster.

Unless you have a broken architecture like the x86. The FPU in that case
can be problematic and folks where playing around with adding a syscall
to query the status of the FPU. They things might be more even, but...
this is also unclear as to how these variables are going to play out.

> Benchmarks are nice, I await results from a loaded production threaded
> DNS/mail/web/news/database server. Well, I guess production and 2.5 don't
> really go together, do they, but maybe some experimental site which could
> use 2.5 long enough to get numbers. If you could get a threaded database
> to run, that would be a good test of shared resources rather than a bunch
> of independent activities doing i/o.

I think that would be interesting too.

bill

2002-09-23 21:10:22

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Sun, Sep 22, 2002 at 09:38:52AM -0400, Bill Davidsen wrote:
> Could you comment on how whell this works (or not) with linuxthreads,
> Solaris, and NGPT? I realize you probably haven't had time to look at NPTL
> yet. If an N:M model is really better for your application you might be
> able to just run NGPT.

I can't. I'm in a different OS community, FreeBSD, and I deal with issues
related to threading systems there. There's many variables that could be
at play for various performance categories.

> Since preempt threads seem a problem, cound a dedicated machine run w/o
> preempt? I assume when you say "high load" that you would be talking a
> server, where performance is critical.

The JVM itself can has a habit of really stretching the amount of resources
available in many areas and fringe logic in commonly used systems. I can't
really say what the problems are until the Blackdown folks start integrating
the new threading model and then start testing it.

However, there is a mutex fast path in the code itself that can be optionally
used in place of the the OS back version. They felt it was significant to do
the work for that for some reason, so I'm just going to assume that this is
important until otherwise noted.

bill

2002-09-23 21:38:06

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, 23 Sep 2002, Larry McVoy wrote:

> What do you think causes a context switch in
> a threaded program? What? Could it be blocking on I/O?

unfortunately java was originally designed with a thread-per-connection
model as the *only* method of implementing servers. there wasn't a
non-blocking network API ... and i hear that such an API is in the works,
but i've no idea where it is yet.

so while this is I/O, it's certainly less efficient to have thousands of
tasks blocked in read(2) versus having thousands of entries in <pick your
favourite poll/select/etc. mechanism>.

this is a java problem though... i posted a jvm straw-man proposal years
ago when IBM posted some "linux threading isn't efficient" paper. since
java threads are way less painful to implement than pthreads, i suggested
the jvm do the M part of M:N.

-dean

2002-09-23 22:05:02

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

2002-09-23 22:32:55

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, Sep 23, 2002 at 03:48:58PM -0400, Bill Davidsen wrote:
> On Mon, 23 Sep 2002, Larry McVoy wrote:
> > Sure, there are lotso benchmarks which show how fast user level threads
> > can context switch amongst each other and it is always faster than going
> > into the kernel. So what? What do you think causes a context switch in
> > a threaded program? What? Could it be blocking on I/O? Like 99.999%
> > of the time? And doesn't that mean you already went into the kernel to
> > see if the I/O was ready? And doesn't that mean that in all the real
> > world applications they are already doing all the work you are arguing
> > to avoid?
> Actually you have it just backward. Let me try to explain how this works.
> The programs which benefit from N:M are exactly those which don't behave
> the way you describe. Think of programs using locking to access shared
> memory, or other fast resources which don't require a visit to the kernel.
> It would seem that the switch could be done much faster without the
> transition into and out of the kernel.

For operating systems that require cross-process locks to always be
kernel ops, yes. For operating systems that provide _any_ way for most
cross-process locks to be performed completely in user space (i.e. FUTEX),
the entire argument very quickly disappears.

Is there a situation you can think of that requires M:N threading
because accessing user space is cheaper than accessing kernel space?
What this really means is that the design of the kernel space
primitives is not optimal, and that a potentially better solution that
would benefit more people by a great amount, would be to redesign the
kernel primitives. (i.e. FUTEX)

> Looking for data before forming an opinion has always seemed to be
> reasonable, and the way design decisions are usually made in Linux, based
> on the performance of actual code. The benchmark numbers reports are
> encouraging, but actual production loads may not show the huge improvement
> seen in the benchmarks. And I don't think anyone is implying that they
> will.

You say that people should look to data before forming an opinion, but
you also say that benchmarks mean little and you *suspect* real loads may
be different. It seems to me that you might take your own advice, and
use 'real data' before reaching your own conclusion.

> Given how small the overhead of threading is on a typical i/o bound
> application such as you mentioned, I'm not sure the improvement will be
> above the noise. The major improvement from NGPT is not performance in
> many cases, but elimination of unexpected application behaviour.

Many people would argue that threading overhead has been traditional quite
high. They would have 'real data' to substantiate their claims.

> When someone responds to a technical question with an attack on the
> question instead of a technical response I always wonder why. In this case
> other people have provided technical feedback and I'm sure we will see
> some actual application numbers in a short time. I have an IPC benchmark
> I'd like to try if I could get any of my test servers to boot a recent
> kernel :-(

I've always considered 1:1 to be an optimal model, but an unreachable
model, like cold fusion. :-)

If the kernel can manage the tasks such that they can be very quickly
switched betweens queues, and the run queue can be minimized to
contain only tasks that need to run, or that have a very high
probability of needing to run, and if operations such as locks can be
done, at least in the common case, completely in user space, there
is no reason why 1:1 could not be better than M:N.

There _are_ reasons why OS threads could be better than user space
threads, and the reasons all relate to threads that do actual work.

The line between 1:1 and M:N is artificially bold. M:N is a necessity
where 1:1 is inefficient. Where 1:1 is efficient, M:N ceases to be a
necessity.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-09-23 22:41:44

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, Sep 23, 2002 at 11:08:53PM +0200, Peter W?chtler wrote:
> With 1:1 on "hitting" a blocking condition the kernel will
> switch to a different beast (yes, a thread gets a bonus for
> using the same MM and the same cpu).

> But on M:N the "user process" makes some more progress in its
> timeslice (does it get even punished for eating up its
> timeslice?) I would think that it tends to cause less context
> switches but tends to do more syscalls :-(

Think of it this way... two threads are blocked on different resources...
The currently executing thread reaches a point where it blocks.

OS threads:
1) thread#1 invokes a system call
2) OS switches tasks to thread#2 and returns from blocking

user-space threads:
1) thread#1 invokes a system call
2) thread#1 returns from system call, EWOULDBLOCK
3) thread#1 invokes poll(), select(), ioctl() to determine state
4) thread#1 returns from system call
5) thread#1 switches stack pointer to be thread#2 upon determination
that the resource thread#2 was waiting on is ready.

Certainly the above descriptions are not fully accurate, or complete,
and it is possible that the M:N threading would make a fair compromise
between OS thread sand user-space threads, however, if user-space threads
requires all this extra work, and M:N threads requires some extra work,
some less work, and extra book keeping and system calls, why couldn't
OS threads by themselves be more efficient?

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-09-23 22:53:55

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, Sep 23, 2002 at 02:41:33PM -0700, dean gaudet wrote:
> so while this is I/O, it's certainly less efficient to have thousands of
> tasks blocked in read(2) versus having thousands of entries in <pick your
> favourite poll/select/etc. mechanism>.

In terms of kernel memory, perhaps. In terms of 'efficiency', I
wouldn't be so sure. Java uses a wack of user space storage to
represent threads regardless of whether they are green or native. The
only difference is - is Java calling poll()/select()/ioctl()
routinely? Or are the tasks sitting in an efficient kernel task queue?

Which has a better chance of being more efficient, in terms of dispatching,
(especially considering that most of the time, most java threads are idle),
and which has a better chance of being more efficient in terms of the
overhead of querying whether a task is ready to run? I lean towards the OS
on both counts.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-09-23 22:56:25

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, Sep 23, 2002 at 06:44:23PM -0400, Mark Mielke wrote:
> Think of it this way... two threads are blocked on different resources...
> The currently executing thread reaches a point where it blocks.
>
> OS threads:
> 1) thread#1 invokes a system call
> 2) OS switches tasks to thread#2 and returns from blocking
>
> user-space threads:
> 1) thread#1 invokes a system call
> 2) thread#1 returns from system call, EWOULDBLOCK

> 3) thread#1 invokes poll(), select(), ioctl() to determine state
> 4) thread#1 returns from system call

More like the UTS blocks the thread and waits for an IO upcall to notify
the change of state in the kernel. It's equivalent to a single in overhead,
something like a SIGIO, or async IO notification.

Delete 3 and 4. It's certainly much faster than select() and family.

> 5) thread#1 switches stack pointer to be thread#2 upon determination
> that the resource thread#2 was waiting on is ready.

Right, then marks it running and runs it.

> Certainly the above descriptions are not fully accurate, or complete,
> and it is possible that the M:N threading would make a fair compromise
> between OS thread sand user-space threads, however, if user-space threads
> requires all this extra work, and M:N threads requires some extra work,
> some less work, and extra book keeping and system calls, why couldn't
> OS threads by themselves be more efficient?

Crazy synchronization by non-web-server like applications. Who knows. I
personally can't think up really clear example at this time since I don't
do that kind of programming, but I'm sure concurrency experts can...

I'm just not one of those people.

bill

2002-09-23 23:08:48

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, Sep 23, 2002 at 04:01:22PM -0700, Bill Huey wrote:
> On Mon, Sep 23, 2002 at 06:44:23PM -0400, Mark Mielke wrote:
> > Certainly the above descriptions are not fully accurate, or complete,
> > and it is possible that the M:N threading would make a fair compromise
> > between OS thread sand user-space threads, however, if user-space threads
> > requires all this extra work, and M:N threads requires some extra work,
> > some less work, and extra book keeping and system calls, why couldn't
> > OS threads by themselves be more efficient?
> Crazy synchronization by non-web-server like applications. Who knows. I
> personally can't think up really clear example at this time since I don't
> do that kind of programming, but I'm sure concurrency experts can...
> I'm just not one of those people.

I do not find it to be profitable to discourage the people working on
this project. If they fail, nobody loses. If they succeed, they can
re-invent the math behind threading, and Linux ends up on the forefront
of operating systems offering the technology.

As for 'crazy synchronization', solutions such as the FUTEX have no
real negative aspects. It wasn't long ago that the FUTEX did not
exist. Why couldn't innovation make 'crazy synchronization by
non-web-server like applications' more efficient using kernel threads?

Concurrency experts would welcome the change. Concurrent 'experts'
would not welcome the change, as it would force them to have to
re-learn everything they know, effectively obsoleting their 'expert'
status. (note the difference between the unquoted, and the quoted...)

Cheers, and good luck...
mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-09-23 23:52:46

by Andy Isaacson

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

I hate big CC lists like this, but I don't know that everyone will see
this if I don't keep the CC list. Sigh.

On Mon, Sep 23, 2002 at 10:36:28PM +0200, Ingo Molnar wrote:
> On Mon, 23 Sep 2002, Peter Waechtler wrote:
> > Getting into kernel is not the same as a context switch. Return EAGAIN
> > or EWOULDBLOCK is definetly _not_ causing a context switch.
>
> this is a common misunderstanding. When switching from thread to thread in
> the 1:1 model, most of the cost comes from entering/exiting the kernel. So
> *once* we are in the kernel the cheapest way is not to piggyback to
> userspace to do some userspace context-switch - but to do it right in the
> kernel.
>
> in the kernel we can do much higher quality scheduling decisions than in
> userspace. SMP affinity, various statistics are right available in
> kernel-space - userspace does not have any of that. Not to talk about
> preemption.

Excellent points, Ingo. An alternative that I haven't seen considered
is the M:N threading model that NetBSD is adopting, called Scheduler
Activations. The paper makes excellent reading.

http://web.mit.edu/nathanw/www/usenix/freenix-sa/freenix-sa.html

One advantage of a SA-style system is that the kernel automatically and
very cleanly has a lot of information about the job as a single unit,
for purposes such as signal delivery, scheduling decisions, (and if it
came to that) paging/swapping. The original Linus-dogma (as I
understood it -- I may well be misrepresenting things here) is that "a
thread is a process, and that's all there is to it". This has a lovely
clarity, but it ignores the fact that there are times when it's
*important* that the kernel know that "these N threads belong to a
single job". It appears that the NPTL work is creating a new
"collection-of-threads" object, which will fulfill the role I mention
above... and this isn't a lot different from the end result of Nathan
Williams' SA work.

Another advantage of keeping a "process" concept is that things like CSA
(Compatible System Accounting, nee Cray System Accounting) need to add
some overhead to process startup/teardown. If a "thread" can be created
without creating a new "process", this overhead is not needlessly
present at thread-startup time.

-andy

2002-09-23 23:57:55

by Andy Isaacson

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, Sep 23, 2002 at 10:32:00PM +0200, Ingo Molnar wrote:
> On Mon, 23 Sep 2002, Bill Davidsen wrote:
> > The programs which benefit from N:M are exactly those which don't behave
> > the way you describe. [...]
>
> 90% of the programs that matter behave exactly like Larry has described.
> IO is the main source of blocking. Go and profile a busy webserver or
> mailserver or database server yourself if you dont believe it.

There are heavily-threaded programs out there that do not behave this
way, and for which a N:M thread model is completely appropriate. For
example, simulation codes in operations research are most naturally
implemented as one thread per object being simulated, with virtually no
IO outside the simulation. The vast majority of the computation time in
such a simulation is spent doing small amounts of work local to the
thread, then sending small messages to another thread via a FIFO, then
going to sleep waiting for more work.

Of course this can be (and frequently is) implemented such that there is
not one Pthreads thread per object; given simulation environments with 1
million objects, and the current crappy state of Pthreads
implementations, the researchers have no choice.

-andy

2002-09-24 00:05:48

by Jeff Garzik

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Andy Isaacson wrote:
> Of course this can be (and frequently is) implemented such that there is
> not one Pthreads thread per object; given simulation environments with 1
> million objects, and the current crappy state of Pthreads
> implementations, the researchers have no choice.

Are these object threads mostly active or inactive?

Regardless, it seems obvious with today's hardware, that 1 million
objects should never be one-thread-per-object, pthreads or no. That's
just lazy programming.

Jeff

2002-09-24 00:09:28

by Andy Isaacson

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, Sep 23, 2002 at 08:10:24PM -0400, Jeff Garzik wrote:
> Andy Isaacson wrote:
> > Of course this can be (and frequently is) implemented such that there is
> > not one Pthreads thread per object; given simulation environments with 1
> > million objects, and the current crappy state of Pthreads
> > implementations, the researchers have no choice.
>
> Are these object threads mostly active or inactive?

Mostly inactive (waiting on a semaphore or FIFO).

> Regardless, it seems obvious with today's hardware, that 1 million
> objects should never be one-thread-per-object, pthreads or no. That's
> just lazy programming.

You can call it lazy if you want, but I call it natural.

(Of course I realize that practical considerations prevent users from
creating 1 million kernel threads, or even user threads, today.
Unfortunate, that.)

-andy

2002-09-24 00:16:41

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, Sep 23, 2002 at 07:11:32PM -0400, Mark Mielke wrote:
> I do not find it to be profitable to discourage the people working on
> this project. If they fail, nobody loses. If they succeed, they can
> re-invent the math behind threading, and Linux ends up on the forefront
> of operating systems offering the technology.

Math, unlikely. Performance issues, maybe. Overall kernel technology,
highly unlikely and bordering on preposterous claim.

This is forum, like anything else, is to propose a new infrastructure
for something that's very important to the function of this operating
system. For this project to succeed, it must address possible problems
that various folks bring up in examining what's been proposed or built.
That's the role of these discussions.

> As for 'crazy synchronization', solutions such as the FUTEX have no
> real negative aspects. It wasn't long ago that the FUTEX did not
> exist. Why couldn't innovation make 'crazy synchronization by
> non-web-server like applications' more efficient using kernel threads?

To be blunt, I don't believe it. It's out of a technical point of view
from my bias to a FreeBSD's scheduler activation threading and because
people are too easily dismissing M:N performance issues while reaching
conclusions about it that seem to be presumptuous.

The incorrect example where you outline what you think is a M:N call
conversion is (traditional async wrappers instead of upcalls), is something
that don't want to be a future technical strawman that folks create in
this community to attack M:N threading. It may very well still have
legitimacy in the same way that part of the performance of the JVM depends
on accessibilty to a thread's ucontext and run state, which seem to be
initial oversight (unknown reason) when this was originally conceived.

Those are kind of things are what I'm most worried about that eventually
hurt what application folks are on building on top of Linux and its
kernel facilities.

> Concurrency experts would welcome the change. Concurrent 'experts'
> would not welcome the change, as it would force them to have to
> re-learn everything they know, effectively obsoleting their 'expert'
> status. (note the difference between the unquoted, and the quoted...)

Well, what I mean by concurrency experts is there can be specialized
applications where people much become experts in concurrency to solve
difficult problem that might be know to this group at this time.
Dimissing that in the above paragraph doesn't negate that need.

The bottom line here is that ultimately the kernel is providing useable
primitive/terms for applications programmers. It's not the scenario
where kernel folks just build something that's conceptually awkward and
then it's up to applications people to work around bogus design problems
that result from that. So what I meant by folks that have applications
that might push the limits of what the current synchronization model
offers.

That's the core of my rant and it took quite a while to write up. ;)

bill

2002-09-24 00:43:27

by Rusty Russell

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Fri, 20 Sep 2002 18:42:48 +0200 (CEST)
Ingo Molnar <[email protected]> wrote:
>
> On Fri, 20 Sep 2002 [email protected] wrote:
> > Is this related to the thread library work that IBM was doing or was
> > this independently developed?
>
> independently developed.

And, ironically, using the futex implementation developed on IBM time 8).

Of course, the time I spent on futexes would have been completely wasted
without the 95% done by Ingo and Uli to reach normal user programs and
address the other scalability problems.

Thanks guys! IOU each a beer or local equiv...
Rusty.
--
there are those who do and those who hang on and you don't see too
many doers quoting their contemporaries. -- Larry McVoy

2002-09-24 02:47:09

by Peter Chubb

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

>>>>> "Mark" == Mark Mielke <[email protected]> writes:

Mark> On Mon, Sep 23, 2002 at 11:08:53PM +0200, Peter W?chtler wrote:
Mark> Think of it this way... two threads are blocked on different
Mark> resources... The currently executing thread reaches a point
Mark> where it blocks.

Mark> OS threads: 1) thread#1 invokes a system call 2) OS switches
Mark> tasks to thread#2 and returns from blocking

Mark> user-space threads: 1) thread#1 invokes a system call 2)
Mark> thread#1 returns from system call, EWOULDBLOCK 3) thread#1
Mark> invokes poll(), select(), ioctl() to determine state 4) thread#1
Mark> returns from system call 5) thread#1 switches stack pointer to
Mark> be thread#2 upon determination that the resource thread#2 was
Mark> waiting on is ready.

No way! THe Solaris M:N model notices when all threads belonging to a
process have blocked, and wakes up the master thread, which can then
create a new kernel thread if there are any user-mode threads that can
do work.

PeterC

2002-09-24 03:17:33

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, Sep 23, 2002 at 05:21:35PM -0700, Bill Huey wrote:
> ...
> The incorrect example where you outline what you think is a M:N call
> conversion is (traditional async wrappers instead of upcalls), is something
> that don't want to be a future technical strawman that folks create in
> this community to attack M:N threading. It may very well still have
> legitimacy in the same way that part of the performance of the JVM depends
> on accessibilty to a thread's ucontext and run state, which seem to be
> initial oversight (unknown reason) when this was originally conceived.
> Those are kind of things are what I'm most worried about that eventually
> hurt what application folks are on building on top of Linux and its
> kernel facilities.
> ...
> That's the core of my rant and it took quite a while to write up. ;)

My part in the rant (really somebody else's rant...) is that if kernel
threads can be made to out-perform current implementations of M:N
threading, then all that has really been proven is that current M:N
practices are not fully optimal. 1:1 in an N:N system is just one face
of M:N in an N:N system. A fully functional M:N system _may choose_ to
allow M to equal N.

Worst possibly cases that I expect to see from people experimenting
with this stuff and having a 1:1 system that out-performs commonly
available M:N systems: 1) The M:N people innovate, potentially using
the new technology made available from the 1:1 people, making a
_better_ M:N system 2) The 1:1 system is better, and people use it.

As long as they all use a POSIX, or other standard interface, there
isn't a problem.

If the changes to the kernel made by the 1:1 people are bad, they will
be stopped by Linus and many other people, probably including
yourself... :-)

In any case, I see the 1:1 vs. M:N as a distraction from the *actual*
enhancements being designed, which seem to be, support for cheaper
kernel threads, something that benefits both parties.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-09-24 03:34:34

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Tue, Sep 24, 2002 at 12:48:40PM +1000, Peter Chubb wrote:
> >>>>> "Mark" == Mark Mielke <[email protected]> writes:
> Mark> OS threads: 1) thread#1 invokes a system call 2) OS switches
> Mark> tasks to thread#2 and returns from blocking
> Mark> user-space threads: 1) thread#1 invokes a system call 2)
> Mark> thread#1 returns from system call, EWOULDBLOCK 3) thread#1
> Mark> invokes poll(), select(), ioctl() to determine state 4) thread#1
> Mark> returns from system call 5) thread#1 switches stack pointer to
> Mark> be thread#2 upon determination that the resource thread#2 was
> Mark> waiting on is ready.
> No way! THe Solaris M:N model notices when all threads belonging to a
> process have blocked, and wakes up the master thread, which can then
> create a new kernel thread if there are any user-mode threads that can
> do work.

As I said, far from accurate, and M:N is really a compromise between
the two approaches, it is not an extreme.

M:N really doesn't mean much at all except that it makes no guarantees
that each thread requires an OS thread, or that only one thread will
be active at any given time.

M:N makes no promises to be faster, although, as with any innovation
designed by engineers who take pride in their work, it is a method of
achieving a goal... that is, getting around costly system invocations
by optimizing system invocations in user space. Is it better? Maybe?
Has it traditionally allowed standard applications that make use of
threads to perform better than if the application used only kernel
threads? Yes.

Can the rules be broken? I have not seen a single reason to show why
they cannot be broken. M:N is a necessity for kernels that have heavy
weight thread synchronization primitives, or heavy weight context
switching. Are the rules the same with thread synchronization primitives
that have the similar weight whether 1:1 or M:N? (i.e. FUTEX) Are the
rules the same if context switching in kernel space can be made cheaper,
if the scheduling issues can be addressed, or if M:N must rely on just as
many kernel invocations?

Is M:N really cheaper in your Solaris example, where a new thread is
created by the master thread on demand? If threads were sufficiently
light-weight, I do not see how you could consider a master thread
sitting on SIGIO/select()/poll()/ioctl() switching to a thread in the
pool could be cheaper than the kernel pulling a stopped thread into
the run queue.

This is one of those things where the 'proof is in the pudding'. It is
difficult to theorize anything as almost all theory on this subject is
based on comparing performance under a different set of rules. M:N was
necessary before as 1:1 was not feasible. Now that 1:1 may be reaching
the state of being feasible, the rules change, and previous attempts
at analyzing the data mean very little. Previous conclusions mean very
little.

I am one who wants to see what happens. Worst case, M:N
implementations can use the same enhancements that were designed for
1:1, and benefit. The most obviously example, that needs to be
mentioned once again, is FUTEX. By each application, it might benefit
1:1, or M:N, but as a kernel feature, it benefits anybody who can
invent a use for it.

Fast thread switching? This provides a benefit for M:N. In fact, I would
suspect that the people comparing the best 1:1 implementation with the best
M:N implementation will find that once all the patches are applied, the
race will be closer than most people thought, but that BOTH will perform
better on 2.5.x than either ever did on 2.4.x and earlier.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-09-24 05:33:35

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Tue, 24 Sep 2002, Rusty Russell wrote:

> > > Is this related to the thread library work that IBM was doing or was
> > > this independently developed?
> >
> > independently developed.
>
> And, ironically, using the futex implementation developed on IBM time 8).

you are right, futexes are really important for all the userspace locking
primitives and thread-joining. And like basically all core kernel code,
futexes were a collaborative effort as well:

* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.

there are so many prerequisites to this that it's impossible to list them
all. What i meant above were the specific patches developed for recent 2.5
kernels, and the library itself.

Ingo

2002-09-24 05:39:52

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, 23 Sep 2002, Andy Isaacson wrote:

> > 90% of the programs that matter behave exactly like Larry has described.
> > IO is the main source of blocking. Go and profile a busy webserver or
> > mailserver or database server yourself if you dont believe it.
>
> There are heavily-threaded programs out there that do not behave this
> way, and for which a N:M thread model is completely appropriate. [...]

of course, that's the other 10%. [or even smaller.] I never claimed M:N
cannot be viable for certain specific applications. But a generic
threading library should rather concentrate on the common 90% of the
applications.

(obviously for simulations the absolute fastest implementation would be a
pure userspace state-machine, not a threaded application - M:N or 1:1.)

Ingo

2002-09-24 06:12:09

by Rusty Russell

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

In message <[email protected]> you
write:
> > And, ironically, using the futex implementation developed on IBM time 8).
>
> you are right, futexes are really important for all the userspace locking
> primitives and thread-joining. And like basically all core kernel code,
> futexes were a collaborative effort as well:
>
> * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
> * enough at me, Linus for the original (flawed) idea, Matthew
> * Kirkwood for proof-of-concept implementation.

And yourself, Robert Love, Paul Mackerras and Hubertus Franke all
contributed to futexes directly, too. I wasn't complaining about
credit, I just found the IBM involvement worth noting (in case someone
thought we were onesided).

> there are so many prerequisites to this that it's impossible to list them
> all.

True here, but in general: almost all the order-of-magnitude
scalability jumps in 2.5 can be traced back to you or Andrew Morton.
I wouldn't want a casual reader to miss that fact 8)

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-09-24 06:18:55

[permalink] [raw]

Subject: 1:1 threading vs. scheduler activations (was: Re: [ANNOUNCE] Native POSIX Thread Library 0.1)

On Mon, 23 Sep 2002, Andy Isaacson wrote:

> Excellent points, Ingo. An alternative that I haven't seen considered
> is the M:N threading model that NetBSD is adopting, called Scheduler
> Activations. [...]

yes, SA's (and KSA's) are an interesting concept, but i personally think
they are way too much complexity - and history has shows that complexity
never leads to anything good, especially not in OS design.

Eg. SA's, like every M:N concept, must have a userspace component of the
scheduler, which gets very funny when you try to implement all the things
the kernel scheduler has had for years: fairness, SMP balancing, RT
scheduling (!), preemption and more.

Eg. 2.0.2 NGPT's current userspace scheduler is still cooperative - a
single userspace thread burning CPU cycles monopolizes the full context.
Obviously this can be fixed, but it gets nastier and nastier as you add
the features - for no good reason, the same functionality is already
present in the kernel's scheduler, which can generally do much better
scheduling decisions - it has direct and reliable access to various
statistics, it knows exactly how much CPU time has been used up. To give
all this information to the userspace scheduler takes alot of effort. I'm
no wimp when it comes to scheduler complexity, but a coupled kernel/user
scheduling concept scares the *hit out of me.

And then i havent mentioned things like upcall costs - what's the point in
upcalling userspace which then has to schedule, instead of doing this
stuff right in the kernel? Scheduler activations concentrate too much on
the 5% of cases that have more userspace<->userspace context switching
than some sort of kernel-provoked context switching. Sure, scheduler
activations can be done, but i cannot see how they can be any better than
'just as fast' as a 1:1 implementation - at a much higher complexity and
robustness cost.

the biggest part of Linux's kernel-space context switching is the cost of
kernel entry - and the cost of kernel entry gets cheaper with every new
generation of CPUs. Basing the whole threading design on the avoidance of
the kernel scheduler is like basing your tent on a glacier, in a hot
summer day.

Plus in an M:N model all the development toolchain suddenly has to
understand the new set of contexts, debuggers, tracers, everything.

Plus there are other issues like security - it's perfectly reasonable in
the 1:1 model for a certain set of server threads to drop all privileges
to do the more dangerous stuff. (while there is no such thing as absolute
security and separation in a threaded app, dropping privileges can avoid
certain classes of exploits.)

generally the whole SA/M:N concept creaks under the huge change that is
introduced by having multiple userspace contexts of execution per a single
kernel-space context of execution. Such detaching of concepts, no matter
which kernel subsystem you look at, causes problems everywhere.

eg. the VM. There's no way you can get an 'upcall' from the VM that you
need to wait for free RAM - most of the related kernel code is simply not
ready and restartable. So VM load can end up blocking kernel contexts
without giving any chance to user contexts to be 'scheduled' by the
userspace scheduler. This happens exactly in the worst moment, when load
increases and stuff starts swapping.

and there are some things that i'm not at all sure can be fixed in any
reasonable way - eg. RT scheduling. [the userspace library would have to
raise/drop the priority of threads in the userspace scheduler, causing an
additional kernel entry/exit, eliminating even the theoretical advantage
it had for pure user<->user context switches.]

plus basic performance issues. If you have a healthy mix of userspace and
kernelspace scheduler activity then you've at least doubled your icache
footprint by having two scheduler - the dcache footprint is higher as
well. A *single* bad cachemiss on a P4 is already almost as expensive as a
kernel entry - and it's not like the growing gap between RAM access
latency and CPU performance will shrink in the future. And we arent even
using SYSENTER/SYSEXIT in the Linux kernel yet, which will shave off
another 40% from the syscall entry (and kernel context switching) cost.

so my current take on threading models is: if you *can* do a really fast
and lightweight kernel based 1:1 threading implementation then you have
won. Anything else is barely more than workarounds for (fixable)
architectural problems. Concentrate your execution abstraction into the
kernel and make it *really* fast and scalable - that will improve
everything else. OTOH any improvement to the userspace thread scheduler
only improves threaded applications - which are still the minority. Sure,
some of the above problems can be helped, but it's not trivial - and some
problems i dont think can be solved at all.

But we'll see, the FreeBSD folks i think are working on KSA's so we'll
know for sure in a couple of years.

Ingo

2002-09-24 07:07:50

by Thunder from the hill

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Hi,

On Mon, 23 Sep 2002, Ingo Molnar wrote:
> 90% of the programs that matter behave exactly like Larry has described.
> IO is the main source of blocking. Go and profile a busy webserver or
> mailserver or database server yourself if you dont believe it.

Well, I guess Java Web Server behaves the same?

Thunder
--
assert(typeof((fool)->next) == typeof(fool)); /* wrong */

2002-09-24 07:16:43

by Matthias Urlichs

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Peter Waechtler:
> [ unattributed -- please don't discard attributions ]

> Having multiple threads doing real work including IO means more
> blocking IO and therefore more context switches.

On the other hand, having to multiplex in userspace requires calls to
poll() et al, _and_ explicitly handling state which the kernel needs
to handle anyway -- including locking and all that crap.

Given that an efficient and fairly-low-cost 1:1 implementation is
demonstrably possible ;-) the necessity to do any kind of n:m work
strikes me as extremely low.
--
Matthias Urlichs http://smurf.noris.de ICQ:20193661 AIM:smurfixx

2002-09-24 07:16:57

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Tue, 24 Sep 2002, Thunder from the hill wrote:

> > 90% of the programs that matter behave exactly like Larry has described.
> > IO is the main source of blocking. Go and profile a busy webserver or
> > mailserver or database server yourself if you dont believe it.
>
> Well, I guess Java Web Server behaves the same?

yes. The most common case is that it either blocks on the external network
connection (IO), or on some internal database connection (IO as well). The
JVMs themselves be better well-threaded internally, with not much
contention on any internal lock. The case of internal synchronization is
really that the 1:1 model makes a 'bad parallelism' more visible: when
there's contention. It's quite rare that heavy synchronization and heavy
lock contention cannot be avoided, and it mostly involves simulation
projects which often do this because they simulate real world IO :-)

(but, all this thread is becoming pretty theoretical - current fact is
that the 1:1 library is currently more than 4 times faster than the only
M:N library that we were able to run the test on using the same kernel, on
M:N's 'home turf'. So anyone who thinks the M:N library should perform
faster is welcome to improve it and send in results.)

Ingo

2002-09-24 09:57:08

by Nikita Danilov

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Larry McVoy writes:
> > > Instead of taking the traditional "we've screwed up the normal system
> > > primitives so we'll event new lightweight ones" try this:
> > >
> > > We depend on the system primitives to not be broken or slow.
> > >
> > > If that's a true statement, and in Linux it tends to be far more true
> > > than other operating systems, then there is no reason to have M:N.
> >
> > No matter how fast you do context switch in and out of kernel and a sched
> > to see what runs next, it can't be done as fast as it can be avoided.
>
> You are arguing about how many angels can dance on the head of a pin.
> Sure, there are lotso benchmarks which show how fast user level threads
> can context switch amongst each other and it is always faster than going
> into the kernel. So what? What do you think causes a context switch in
> a threaded program? What? Could it be blocking on I/O? Like 99.999%
> of the time? And doesn't that mean you already went into the kernel to
> see if the I/O was ready? And doesn't that mean that in all the real
> world applications they are already doing all the work you are arguing
> to avoid?

M:N threads are supposed to have other advantages beside fast context
switches. Original paper on scheduler activations mentioned case when
kernel thread is preempted while user level thread it runs held spin
lock. When kernel notifies user level scheduler about preemption
(through upcall) it can de-schedule all user level threads spinning on
this lock, so that they will not waste their time slices burning CPU.

> --
> ---
> Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

Nikita.

> -

2002-09-24 10:50:48

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

> Another advantage of keeping a "process" concept is that things like CSA
> (Compatible System Accounting, nee Cray System Accounting)

Which has been ported to Linux now, btw (rather poorly integrated, though):

http://oss.sgi.com/projects/csa/

2002-09-24 11:57:54

by Michael Sinz

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Bill Huey (Hui) wrote:
> On Sun, Sep 22, 2002 at 08:55:39PM +0200, Peter Waechtler wrote:
>
>>AIX and Irix deploy M:N - I guess for a good reason: it's more
>>flexible and combine both approaches with easy runtime tuning if
>>the app happens to run on SMP (the uncommon case).
>
> Also, for process scoped scheduling in a way so that system wide threads
> don't have an impact on a process slice. Folks have piped up about that
> being important.

This is the one major complaint I have with "a thread is a process"
implementation in Linux. The scheduler does not take process vs thread
into account.

A simple example: Two users (or two different programs - same user)
are running. Both could use all of the CPU resources (for whatever
reason).

One of the programs (A) has N threads (where N >> 1) and the other
program (B) has only 1 thread. Of the N threads in (A), M of them
are not blocked (where M > 1) then (A) will get M:1 CPU usage advantage
over (B).

This means that two processes/programs that should be scheduled equally
are not and the one with many threads "effectively" is stealing cycles
from the other.

In a multi-user (server with multiple processes) environment, this
means that you just write with lots of threads to get more of the
bandwidth out of the scheduled processes.

A real-world (albeit not great) example from many years ago:

A program that does ray-tracing can very easily split the process up
into very small bits. This is great on multi-processor systems as you
can get each CPU to do part of the work in parallel. There is almost
no I/O involved in such a system other than initial load and final save.

It turned out that on non-dedicated systems (multi-user systems) that
you could actually get your work done faster by having the program
create many (100, in this case) threads even though there was only
one big CPU. The reason was that that OS also did not (yet) understand
process scheduling fairness and the student who did this effectively
made a way around the fair scheduling of system resources.

The problem was very quickly noticed as other students quickly learned
how to make use of such "solutions" to their performance wants. We
relatively quickly had to add process level accounting of thread CPU
usage such that any thread in a process counted to that process's
CPU usage/timeslice/etc. It basically made the scheduler into a
2-stage device - much like user threads but with the kernel doing
the work and all of the benefits of kernel threads. (And did not
require any code recompile other than those people who were doing
the many-threads CPU hog type of thing ended up having to revert as
it was now slower than the single thread-per-CPU code...)

Now, computer hardware has changed a lot. Back then, branch took
longer than current kernel syscall overhead. Memory was faster
than the CPU. The scheduler was complex, so I could not say that
it was as efficient as the Linux kernel. However, we did have real
threads and did quickly get real process accounting after someone
"pointed out" the problem of not doing so :-)

--
Michael Sinz -- Director, Systems Engineering -- Worldgate Communications
A master's secrets are only as good as
the master's ability to explain them to others.

2002-09-24 13:36:19

by Peter Svensson

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Tue, 24 Sep 2002, Michael Sinz wrote:

> The problem was very quickly noticed as other students quickly learned
> how to make use of such "solutions" to their performance wants. We
> relatively quickly had to add process level accounting of thread CPU
> usage such that any thread in a process counted to that process's
> CPU usage/timeslice/etc. It basically made the scheduler into a
> 2-stage device - much like user threads but with the kernel doing
> the work and all of the benefits of kernel threads. (And did not
> require any code recompile other than those people who were doing
> the many-threads CPU hog type of thing ended up having to revert as
> it was now slower than the single thread-per-CPU code...)

Then you can just as well use fork(2) and split into processes with the
same result. The solution is not thread specific, it is resource limits
and/or per user cpu accounting.

Several raytracers can (could?) split the workload into multiple
processes, some being started on other computers over rsh or similar.

Peter
--
Peter Svensson ! Pgp key available by finger, fingerprint:
<[email protected]> ! 8A E9 20 98 C1 FF 43 E3 07 FD B9 0A 80 72 70 AF
------------------------------------------------------------------------
Remember, Luke, your source will be with you... always...

2002-09-24 14:15:01

by Michael Sinz

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Peter Svensson wrote:
> On Tue, 24 Sep 2002, Michael Sinz wrote:
>
>
>>The problem was very quickly noticed as other students quickly learned
>>how to make use of such "solutions" to their performance wants. We
>>relatively quickly had to add process level accounting of thread CPU
>>usage such that any thread in a process counted to that process's
>>CPU usage/timeslice/etc. It basically made the scheduler into a
>>2-stage device - much like user threads but with the kernel doing
>>the work and all of the benefits of kernel threads. (And did not
>>require any code recompile other than those people who were doing
>>the many-threads CPU hog type of thing ended up having to revert as
>>it was now slower than the single thread-per-CPU code...)
>
>
> Then you can just as well use fork(2) and split into processes with the
> same result. The solution is not thread specific, it is resource limits
> and/or per user cpu accounting.

I understand that point - but the basic question is if you schedule
based on the process or based on the thread. In an interactive multi-
user system, you may even want to back out to the user level. (Thus
no user can hog the system by doing many things). But that is usually
not the target of Linux systems (yet?)

The problem then is the inter-process communications. (At least on
that system - Linux has many better solutions) That system did not
have shared memory and thus the coordination between processes was
difficult at best.

> Several raytracers can (could?) split the workload into multiple
> processes, some being started on other computers over rsh or similar.

And they exist - but the I/O overhead makes it "not a win" on a
single machine. (It hurts too much)

--
Michael Sinz -- Director, Systems Engineering -- Worldgate Communications
A master's secrets are only as good as
the master's ability to explain them to others.

2002-09-24 14:46:42

by Peter Svensson

[permalink] [raw]

Subject: Offtopic: (was Re: [ANNOUNCE] Native POSIX Thread Library 0.1)

On Tue, 24 Sep 2002, Michael Sinz wrote:

> > Several raytracers can (could?) split the workload into multiple
> > processes, some being started on other computers over rsh or similar.
>
> And they exist - but the I/O overhead makes it "not a win" on a
> single machine. (It hurts too much)

For raytracers (which was the example) you need almost no coordination at
all. Just partition the scene and you are done. This is going offtopic
fast. The point I was making is that there is really no great reward in
grouping threads. Either you need to educate your users and trust them to
behave, or you need per user scheduling.

Peter
--
Peter Svensson ! Pgp key available by finger, fingerprint:
<[email protected]> ! 8A E9 20 98 C1 FF 43 E3 07 FD B9 0A 80 72 70 AF
------------------------------------------------------------------------
Remember, Luke, your source will be with you... always...

2002-09-24 15:02:54

by Mark Veltzer

[permalink] [raw]

Subject: Re: Offtopic: (was Re: [ANNOUNCE] Native POSIX Thread Library 0.1)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tuesday 24 September 2002 17:50, Peter Svensson wrote:
> Either you need to educate your users and trust them to
> behave, or you need per user scheduling.

It is obvious that in high end systems you MUST have per user scheduling
since users will rob each other of cycles.... If Linux is to be a general
purpose operation system it MUST have this feature (otherwise it will only be
considered fit for lower end systems) and trusting your users at this no
better than trusting your users when they promise you they will not seg fault
or peek into memory pages which are not theirs. It simply isn't done.
Besides, using the CPU in an abusive manner could happen as a result of a bug
as much as a result of malicious intent (exactly like a segfault).

Ok. Here's an idea. Why not have both ?!?

There is no real reason why I should have per user scheduling on my machine
at home (I don't really need a just devision of labour between the root user
and myself which are almost the only users to use my system). Why not have
the deault compilation of the kernel be without per user scheduling and
enable it for high end systems (like a university machine where all the
students are at each others throats for a few CPU cycles...) ? So how about
making this a compile option and let the users decide what they like ?

Mark
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE9kIKHxlxDIcceXTgRAjGTAJ9bj1t2QV3zaDheO3GQpvJxxjDSIQCggESi
yqE29XtjTL3VDBu15VTQ0Qc=
=oueS
-----END PGP SIGNATURE-----

2002-09-24 16:18:22

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Bill Huey (Hui) <[email protected]> writes:

> On Sun, Sep 22, 2002 at 12:41:40PM -0600, Eric W. Biederman wrote:
> > They are talking about an incremental GC routine so it does not need to stop
> > all threads simultaneously. Threads only need to be stopped when the GC is
> gather
>
> > a root set. This is what the safe points are for right? And it does
> > not need to be 100% accurate in finding all of the garbage. The
> > collector just needs to not make mistakes in the other direction.
>
> There's a mixture of GC algorithms in HotSpot including generational and I
> believe a traditional mark/sweep. GC isn't my expertise per se.

If they have any sense they also have an incremental GC algorithm. That
the GC thread can sit around all day and executing. If they are actually
using a stop and collect algorithm there are real issues. Though I would
love to see the Java guys justify a copy collector...

> Think, you have a compiled code block and you suspend/interrupt threads when
> you either start hitting the stack yellow guard or by a periodic GC thread...
>
> That can happen anytime, so you can't just expect things to drop onto a
> regular boundary in the compiled code block.

Agreed, but what was this talk earlier about safe points?

> It's for that reason that
> you have to some kind of OS level threading support to get the ucontext.

I don't quite follow the need and I'm not certain you do either. A full
GC pass is very expensive. So saving a threads context in user space
should not be a big deal. It is very minor compared to the rest
of the work going on. Especially in a language like java where practically
everything lives on the heap.

The thing that sounds sensible to me is that before a threads makes a blocking
call it can be certain to save relevant bits of information to the stack. But
x86 is easy what to do with the pointer heavy architectures where pushing
all of the registers onto the stack starts getting expensive is an entirely
different question.

But beyond that. The most sensible algorithm I can see is a
generational incremental collector where each thread has it's own
local heap, and does it's own local garbage collection. And only the
boundary where the local heap meets the global heap needs to collected
by the collector for all threads. This preserves a lot of cache
locality as well as circumventing the whole ucontext issue.

If getting the registers is really a bottle neck in the garbage
collector I suspect it can probably share some generic primitives
with user mode linux.

If support really needs to happen I suspect this case is close
enough to what that user mode linux is doing that someone should
look at how the same mechanism to get the register state can be
shared.

Eric

2002-09-24 16:26:08

[permalink] [raw]

Subject: Re: Offtopic: (was Re: [ANNOUNCE] Native POSIX Thread Library 0.1)

On Tue, 24 Sep 2002, Peter Svensson wrote:

> For raytracers (which was the example) you need almost no coordination at
> all. Just partition the scene and you are done. This is going offtopic
> fast. The point I was making is that there is really no great reward in
> grouping threads. Either you need to educate your users and trust them to
> behave, or you need per user scheduling.

I've got per user scheduling. I'm currently porting it to 2.4.19
(and having fun with some very subtle bugs) and am thinking about
how to port this beast to the O(1) scheduler in a clean way.

Note that it's not necessarily per user, it's trivial to adapt
the code to use any other resource container instead.

regards,

Rik
--
A: No.
Q: Should I include quotations after my reply?

http://www.surriel.com/ http://distro.conectiva.com/

2002-09-24 17:41:29

[permalink] [raw]

Subject: Re: Offtopic: (was Re: [ANNOUNCE] Native POSIX Thread Library 0.1)

On Tue, 24 Sep 2002, Mark Veltzer wrote:
> On Tuesday 24 September 2002 17:50, Peter Svensson wrote:
> > Either you need to educate your users and trust them to
> > behave, or you need per user scheduling.
>
> It is obvious that in high end systems you MUST have per user scheduling
> since users will rob each other of cycles.... If Linux is to be a
> general purpose operation system it MUST have this feature

I just posted a patch for this and will upload the patch to
my home page:

Subject: [PATCH] per user scheduling for 2.4.19

My patch also allows you to switch the per user scheduling
on/off with /proc/sys/kernel/fairsched and has been tested
on both UP and SMP.

kind regards,

Rik
--
A: No.
Q: Should I include quotations after my reply?

http://www.surriel.com/ http://distro.conectiva.com/

2002-09-24 18:44:42

by Michael Sinz

[permalink] [raw]

Subject: Re: Offtopic: (was Re: [ANNOUNCE] Native POSIX Thread Library 0.1)

Rik van Riel wrote:
> On Tue, 24 Sep 2002, Peter Svensson wrote:
>
>
>>For raytracers (which was the example) you need almost no coordination at
>>all. Just partition the scene and you are done. This is going offtopic
>>fast. The point I was making is that there is really no great reward in
>>grouping threads. Either you need to educate your users and trust them to
>>behave, or you need per user scheduling.
>
>
> I've got per user scheduling. I'm currently porting it to 2.4.19
> (and having fun with some very subtle bugs) and am thinking about
> how to port this beast to the O(1) scheduler in a clean way.
>
> Note that it's not necessarily per user, it's trivial to adapt
> the code to use any other resource container instead.

Doing it per process prevents some class of problems. Doing it
per user prevents another class.

Note that this does not limit the user (ulimit type solutions)
when the system is not under stress. What it does do is make sure
that no one user (or process) can DOS the system. In other words,
implement a fairness amoung peers (users or processes).

Currently, Linux has fairness amoung threads (since threads and
processes are basically the same as far as the scheduler is
concerned.)

I would be interested in seeing this patch. A per process thing
would be really cool for what I am building (WorldGate) since
many of the processes are all running as the same user but a per
user thing would also be interesting.

(Or both - processes within a user are fair with each other and
users are fair amoung the other users...)

--
Michael Sinz -- Director, Systems Engineering -- Worldgate Communications
A master's secrets are only as good as
the master's ability to explain them to others.

2002-09-24 19:07:21

[permalink] [raw]

Subject: PATCH: per user fair scheduler 2.4.19 (cleaned up, thanks hch) (was: Re: Offtopic: (was Re: [ANNOUNCE] Native POSIX Thread Library 0.1))

On Tue, 24 Sep 2002, Michael Sinz wrote:
> Rik van Riel wrote:

> > I've got per user scheduling. I'm currently porting it to 2.4.19
> > (and having fun with some very subtle bugs) and am thinking about
> > how to port this beast to the O(1) scheduler in a clean way.

> I would be interested in seeing this patch.

Here it is again, this version has undergone some cleanups
after complaints by Christoph Hellwig ;)

Rik
--
A: No.
Q: Should I include quotations after my reply?

http://www.surriel.com/ http://distro.conectiva.com/

--- linux/kernel/sched.c.fair 2002-09-20 10:58:49.000000000 -0300
+++ linux/kernel/sched.c 2002-09-24 14:46:31.000000000 -0300
@@ -45,31 +45,33 @@

extern void mem_use(void);

+#ifdef CONFIG_FAIRSCHED
+/* Toggle the per-user fair scheduler on/off */
+int fairsched = 1;
+
+/* Move a task to the tail of the tasklist */
+static inline void move_last_tasklist(struct task_struct * p)
+{
+ /* list_del */
+ p->next_task->prev_task = p->prev_task;
+ p->prev_task->next_task = p->next_task;
+
+ /* list_add_tail */
+ p->next_task = &init_task;
+ p->prev_task = init_task.prev_task;
+ init_task.prev_task->next_task = p;
+ init_task.prev_task = p;
+}
+
/*
- * Scheduling quanta.
- *
- * NOTE! The unix "nice" value influences how long a process
- * gets. The nice value ranges from -20 to +19, where a -20
- * is a "high-priority" task, and a "+10" is a low-priority
- * task.
- *
- * We want the time-slice to be around 50ms or so, so this
- * calculation depends on the value of HZ.
+ * Remember p->next, in case we call move_last_tasklist(p) in the
+ * fairsched recalculation code.
*/
-#if HZ < 200
-#define TICK_SCALE(x) ((x) >> 2)
-#elif HZ < 400
-#define TICK_SCALE(x) ((x) >> 1)
-#elif HZ < 800
-#define TICK_SCALE(x) (x)
-#elif HZ < 1600
-#define TICK_SCALE(x) ((x) << 1)
-#else
-#define TICK_SCALE(x) ((x) << 2)
-#endif
-
-#define NICE_TO_TICKS(nice) (TICK_SCALE(20-(nice))+1)
+#define safe_for_each_task(p) \
+ for (p = init_task.next_task, next = p->next_task ; p != &init_task ; \
+ p = next, next = p->next_task)

+#endif /* CONFIG_FAIRSCHED */

/*
* Init task must be ok at boot for the ix86 as we will check its signals
@@ -460,6 +462,54 @@
}

/*
+ * If the remaining timeslice lengths of all runnable tasks reach 0
+ * the scheduler recalculates the priority of all processes.
+ */
+static void recalculate_priorities(void)
+{
+#ifdef CONFIG_FAIRSCHED
+ if (fairsched) {
+ struct task_struct *p, *next;
+ struct user_struct *up;
+ long oldcounter, newcounter;
+
+ recalculate_peruser_cputicks();
+
+ write_lock_irq(&tasklist_lock);
+ safe_for_each_task(p) {
+ up = p->user;
+ if (up->cpu_ticks > 0) {
+ oldcounter = p->counter;
+ newcounter = (oldcounter >> 1) +
+ NICE_TO_TICKS(p->nice);
+ up->cpu_ticks += oldcounter;
+ up->cpu_ticks -= newcounter;
+ /*
+ * If a user is very busy, only some of its
+ * processes can get CPU time. We move those
+ * processes out of the way to prevent
+ * starvation of others.
+ */
+ if (oldcounter != newcounter) {
+ p->counter = newcounter;
+ move_last_tasklist(p);
+ }
+ }
+ }
+ write_unlock_irq(&tasklist_lock);
+ } else
+#endif /* CONFIG_FAIRSCHED */
+ {
+ struct task_struct *p;
+
+ read_lock(&tasklist_lock);
+ for_each_task(p)
+ p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
+ read_unlock(&tasklist_lock);
+ }
+}
+
+/*
* schedule_tail() is getting called from the fork return path. This
* cleans up all remaining scheduler things, without impacting the
* common case.
@@ -616,13 +666,10 @@

/* Do we need to re-calculate counters? */
if (unlikely(!c)) {
- struct task_struct *p;
-
spin_unlock_irq(&runqueue_lock);
- read_lock(&tasklist_lock);
- for_each_task(p)
- p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
- read_unlock(&tasklist_lock);
+
+ recalculate_priorities();
+
spin_lock_irq(&runqueue_lock);
goto repeat_schedule;
}
--- linux/kernel/user.c.fair 2002-09-20 11:50:56.000000000 -0300
+++ linux/kernel/user.c 2002-09-24 16:06:11.000000000 -0300
@@ -29,9 +29,12 @@
struct user_struct root_user = {
__count: ATOMIC_INIT(1),
processes: ATOMIC_INIT(1),
- files: ATOMIC_INIT(0)
+ files: ATOMIC_INIT(0),
+ cpu_ticks: NICE_TO_TICKS(0)
};

+static LIST_HEAD(user_list);
+
/*
* These routines must be called with the uidhash spinlock held!
*/
@@ -44,6 +47,8 @@
next->pprev = &up->next;
up->pprev = hashent;
*hashent = up;
+
+ list_add(&up->list, &user_list);
}

static inline void uid_hash_remove(struct user_struct *up)
@@ -54,6 +59,8 @@
if (next)
next->pprev = pprev;
*pprev = next;
+
+ list_del(&up->list);
}

static inline struct user_struct *uid_hash_find(uid_t uid, struct user_struct **hashent)
@@ -101,6 +108,7 @@
atomic_set(&new->__count, 1);
atomic_set(&new->processes, 0);
atomic_set(&new->files, 0);
+ new->cpu_ticks = NICE_TO_TICKS(0);

/*
* Before adding this, check whether we raced
@@ -120,6 +128,21 @@
return up;
}

+/* Fair scheduler, recalculate the per user cpu time cap. */
+void recalculate_peruser_cputicks(void)
+{
+ struct list_head * entry;
+ struct user_struct * user;
+
+ spin_lock(&uidhash_lock);
+ list_for_each(entry, &user_list) {
+ user = list_entry(entry, struct user_struct, list);
+ user->cpu_ticks = (user->cpu_ticks / 2) + NICE_TO_TICKS(0);
+ }
+ /* Needed hack, we can get called before uid_cache_init ... */
+ root_user.cpu_ticks = (root_user.cpu_ticks / 2) + NICE_TO_TICKS(0);
+ spin_unlock(&uidhash_lock);
+}

static int __init uid_cache_init(void)
{
--- linux/kernel/sysctl.c.fair 2002-09-21 00:00:36.000000000 -0300
+++ linux/kernel/sysctl.c 2002-09-21 20:40:49.000000000 -0300
@@ -50,6 +50,7 @@
extern int sysrq_enabled;
extern int core_uses_pid;
extern int cad_pid;
+extern int fairsched;

/* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
static int maxolduid = 65535;
@@ -256,6 +257,10 @@
{KERN_S390_USER_DEBUG_LOGGING,"userprocess_debug",
&sysctl_userprocess_debug,sizeof(int),0644,NULL,&proc_dointvec},
#endif
+#ifdef CONFIG_FAIRSCHED
+ {KERN_FAIRSCHED, "fairsched", &fairsched, sizeof(int),
+ 0644, NULL, &proc_dointvec},
+#endif
{0}
};

--- linux/include/linux/sched.h.fair 2002-09-20 10:59:03.000000000 -0300
+++ linux/include/linux/sched.h 2002-09-24 15:12:50.000000000 -0300
@@ -275,6 +275,10 @@
/* Hash table maintenance information */
struct user_struct *next, **pprev;
uid_t uid;
+
+ /* Linked list for for_each_user */
+ struct list_head list;
+ long cpu_ticks;
};

#define get_current_user() ({ \
@@ -282,6 +286,7 @@
atomic_inc(&__user->__count); \
__user; })

+extern void recalculate_peruser_cputicks(void);
extern struct user_struct root_user;
#define INIT_USER (&root_user)

@@ -422,6 +427,31 @@
};

/*
+ * Scheduling quanta.
+ *
+ * NOTE! The unix "nice" value influences how long a process
+ * gets. The nice value ranges from -20 to +19, where a -20
+ * is a "high-priority" task, and a "+10" is a low-priority
+ * task.
+ *
+ * We want the time-slice to be around 50ms or so, so this
+ * calculation depends on the value of HZ.
+ */
+#if HZ < 200
+#define TICK_SCALE(x) ((x) >> 2)
+#elif HZ < 400
+#define TICK_SCALE(x) ((x) >> 1)
+#elif HZ < 800
+#define TICK_SCALE(x) (x)
+#elif HZ < 1600
+#define TICK_SCALE(x) ((x) << 1)
+#else
+#define TICK_SCALE(x) ((x) << 2)
+#endif
+
+#define NICE_TO_TICKS(nice) (TICK_SCALE(20-(nice))+1)
+
+/*
* Per process flags
*/
#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */
--- linux/include/linux/sysctl.h.fair 2002-09-21 20:41:18.000000000 -0300
+++ linux/include/linux/sysctl.h 2002-09-21 20:41:43.000000000 -0300
@@ -124,6 +124,7 @@
KERN_CORE_USES_PID=52, /* int: use core or core.%pid */
KERN_TAINTED=53, /* int: various kernel tainted flags */
KERN_CADPID=54, /* int: PID of the process to notify on CAD */
+ KERN_FAIRSCHED=55, /* int: turn fair scheduler on/off */
};

--- linux/arch/i386/config.in.fair 2002-09-21 20:42:06.000000000 -0300
+++ linux/arch/i386/config.in 2002-09-21 20:42:35.000000000 -0300
@@ -261,6 +261,9 @@
bool 'System V IPC' CONFIG_SYSVIPC
bool 'BSD Process Accounting' CONFIG_BSD_PROCESS_ACCT
bool 'Sysctl support' CONFIG_SYSCTL
+if [ "$CONFIG_EXPERIMENTAL" = "y" ] ; then
+ bool 'Fair scheduler' CONFIG_FAIRSCHED
+fi
if [ "$CONFIG_PROC_FS" = "y" ]; then
choice 'Kernel core (/proc/kcore) format' \
"ELF CONFIG_KCORE_ELF \
--- linux/arch/alpha/config.in.fair 2002-08-02 21:39:42.000000000 -0300
+++ linux/arch/alpha/config.in 2002-09-21 20:43:21.000000000 -0300
@@ -273,6 +273,9 @@
bool 'System V IPC' CONFIG_SYSVIPC
bool 'BSD Process Accounting' CONFIG_BSD_PROCESS_ACCT
bool 'Sysctl support' CONFIG_SYSCTL
+if [ "$CONFIG_EXPERIMENTAL" = "y" ] ; then
+ bool 'Fair scheduler' CONFIG_FAIRSCHED
+fi
if [ "$CONFIG_PROC_FS" = "y" ]; then
choice 'Kernel core (/proc/kcore) format' \
"ELF CONFIG_KCORE_ELF \

2002-09-24 20:13:55

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

>The effect of M:N on UP systems should be even more clear. Your
>multithreaded apps can't profit of parallelism but they do not
>add load to the system scheduler. The drawback: more syscalls
>(I think about removing the need for
>flags=fcntl(GETFLAGS);fcntl(fd,NONBLOCK);write(fd);fcntl(fd,flags))

The main reason I write multithreaded apps for single CPU systems is to
protect against ambush. Consider, for example, a web server. Someone sends it
an obscure request that triggers some code that's never run before and has to
fault in. If my application were single-threaded, no work could be done until
that page faulted in from disk. This is why select-loop and poll-loop type
servers are bursty.

DS

2002-09-24 20:28:55

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Mon, 23 Sep 2002 19:03:06 -0500, Andy Isaacson wrote:

>Of course this can be (and frequently is) implemented such that there is
>not one Pthreads thread per object; given simulation environments with 1
>million objects, and the current crappy state of Pthreads
>implementations, the researchers have no choice.

It may well be handy to have a threads implementation that makes these kinds
of programs easy to write, but an OS's preferred pthreads is not and should
not be that threads implementation. A platforms default/preferred pthreads
implementation should be one that allows well-designed, high-performance
I/O-intensive and compute-intensive tasks to run extremely well.

DS

2002-09-25 03:16:34

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Bill Huey (Hui) <[email protected]> writes:
> Which HotSpot has...Read the papers... Thread local storage exists.

I'd love to. Do you have a URL. I didn't see in my quick look.

Eric

2002-09-25 03:03:44

[permalink] [raw]

Subject: Re: 1:1 threading vs. scheduler activations (was: Re: [ANNOUNCE] Native POSIX Thread Library 0.1)

On Tue, Sep 24, 2002 at 08:32:16AM +0200, Ingo Molnar wrote:
> yes, SA's (and KSA's) are an interesting concept, but i personally think
> they are way too much complexity - and history has shows that complexity
> never leads to anything good, especially not in OS design.

FreeBSD's KSEs .;)

> Eg. SA's, like every M:N concept, must have a userspace component of the
> scheduler, which gets very funny when you try to implement all the things
> the kernel scheduler has had for years: fairness, SMP balancing, RT
> scheduling (!), preemption and more.

Yeah, I understand. These folks are doing some interesting stuff and
might provide some answers for you:

http://www.research.ibm.com/K42/

This paper specifically:

http://www.research.ibm.com/K42/white-papers/Scheduling.pdf

Their stuff isn't too much different than FreeBSD's KSE project, different
names for the primitives, different communication, etc...

> And then i havent mentioned things like upcall costs - what's the point in
> upcalling userspace which then has to schedule, instead of doing this
> stuff right in the kernel? Scheduler activations concentrate too much on
> the 5% of cases that have more userspace<->userspace context switching
> than some sort of kernel-provoked context switching. Sure, scheduler
> activations can be done, but i cannot see how they can be any better than
> 'just as fast' as a 1:1 implementation - at a much higher complexity and
> robustness cost.

Folks have been experimenting with other means of kernel/userspace using
a chunk of shared memory, notification and polling when the UTS gets entered
by a block on a mutex or other operation. Upcalls are what was used in
the original Topaz OS paper that implemented SAs, Mach was the other.
It doesn't mean that it's used universally for all implementations.

> the biggest part of Linux's kernel-space context switching is the cost of
> kernel entry - and the cost of kernel entry gets cheaper with every new
> generation of CPUs. Basing the whole threading design on the avoidance of
> the kernel scheduler is like basing your tent on a glacier, in a hot
> summer day.
>
> Plus in an M:N model all the development toolchain suddenly has to
> understand the new set of contexts, debuggers, tracers, everything.

That's not an issue. Folks expect that to be so when working with any
new threading system.

> Plus there are other issues like security - it's perfectly reasonable in
> the 1:1 model for a certain set of server threads to drop all privileges
> to do the more dangerous stuff. (while there is no such thing as absolute
> security and separation in a threaded app, dropping privileges can avoid
> certain classes of exploits.)

> generally the whole SA/M:N concept creaks under the huge change that is
> introduced by having multiple userspace contexts of execution per a single
> kernel-space context of execution. Such detaching of concepts, no matter
> which kernel subsystem you look at, causes problems everywhere.

Maybe, it's probably implementation specific. I'm curious as to how K42
performs.

> eg. the VM. There's no way you can get an 'upcall' from the VM that you
> need to wait for free RAM - most of the related kernel code is simply not
> ready and restartable. So VM load can end up blocking kernel contexts
> without giving any chance to user contexts to be 'scheduled' by the
> userspace scheduler. This happens exactly in the worst moment, when load
> increases and stuff starts swapping.

That's solved by refashioning the kernel to pump out a blocking notification
to the UTS for that backing kernel thread. It's expected out of an SA style
system.

> and there are some things that i'm not at all sure can be fixed in any
> reasonable way - eg. RT scheduling. [the userspace library would have to
> raise/drop the priority of threads in the userspace scheduler, causing an
> additional kernel entry/exit, eliminating even the theoretical advantage
> it had for pure user<->user context switches.]

KSEs have a RT scheduling category, but the issue of preemption is not clearly
understood by me just yet so can't comment on it. I was in the process of trying
to understand this stuff at one time since I was thinking about work on that
project.

> plus basic performance issues. If you have a healthy mix of userspace and
> kernelspace scheduler activity then you've at least doubled your icache
> footprint by having two scheduler - the dcache footprint is higher as
> well. A *single* bad cachemiss on a P4 is already almost as expensive as a
> kernel entry - and it's not like the growing gap between RAM access
> latency and CPU performance will shrink in the future. And we arent even
> using SYSENTER/SYSEXIT in the Linux kernel yet, which will shave off
> another 40% from the syscall entry (and kernel context switching) cost.

It'll be localized to the UTS, while threads that blocked in the kernel
are mostly going to be IO driven. Don't know about the situation where
you might have a a mixture of those activities.

The infrastructure for the upcalls might incur significant overhead.

> so my current take on threading models is: if you *can* do a really fast
> and lightweight kernel based 1:1 threading implementation then you have
> won. Anything else is barely more than workarounds for (fixable)
> architectural problems. Concentrate your execution abstraction into the
> kernel and make it *really* fast and scalable - that will improve
> everything else. OTOH any improvement to the userspace thread scheduler
> only improves threaded applications - which are still the minority. Sure,
> some of the above problems can be helped, but it's not trivial - and some
> problems i dont think can be solved at all.

> But we'll see, the FreeBSD folks i think are working on KSA's so we'll
> know for sure in a couple of years.

There's a lot of ways folks can do this kind of stuff. Who knows ? The current
method you folks are doing could very well be the best for Linux.

I don't have much more to say about this topic.

bill

2002-09-24 23:18:33

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Wed, 25 Sep 2002, Peter Waechtler wrote:

> With Scheduler Activations this could also be avoided. The thread
> scheduler could get an upcall - but this will stay theory for a long
> time on Linux. But this is a somewhat far fetched example (for arguing
> for 1:1), isn't it?

Actually, the upcalls in a N:M scheme with scheduler activations
seem like a pretty good argument for 1:1 to me ;)

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Spamtraps of the month: [email protected] [email protected]

2002-09-24 23:19:47

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Tue, Sep 24, 2002 at 10:07:29AM -0600, Eric W. Biederman wrote:
> If they have any sense they also have an incremental GC algorithm. That
> the GC thread can sit around all day and executing. If they are actually
> using a stop and collect algorithm there are real issues. Though I would
> love to see the Java guys justify a copy collector...

They have that and other crazy things. Thread local storage facilities, etc...

> > regular boundary in the compiled code block.
> Agreed, but what was this talk earlier about safe points?

It's needed to deal with exceptions and GC.

> > It's for that reason that
> > you have to some kind of OS level threading support to get the ucontext.
>
> I don't quite follow the need and I'm not certain you do either. A full
> GC pass is very expensive. So saving a threads context in user space

But you haven't read the code and the supporting white papers for this
JIT compiler otherwise you wouldn't be whining about this. HotSpot is
no whimp when it comes to these issues. Compilers execute code, code is
stored some where and does memory allocation. Geez.

> should not be a big deal. It is very minor compared to the rest
> of the work going on. Especially in a language like java where practically
> everything lives on the heap.

Then you need to look at the compilation architecture of HotSpot. It's pretty
different than your picture of it and is gross oversimplification. Hell,
I don't know what much of it does since it's so freaking large.

> The thing that sounds sensible to me is that before a threads makes a blocking
> call it can be certain to save relevant bits of information to the stack. But
> x86 is easy what to do with the pointer heavy architectures where pushing
> all of the registers onto the stack starts getting expensive is an entirely
> different question.

Dude, you have not been reading this thread... Nothing valuable is at syscall
time per thread, those are external symbol calls that contain nothing of value
to the GC. It's allowed to execute at that point so any allocated chunk of memory
is going to be properly shoved some where known to the GC.

The stuff that's of value to maintain the correctness of this is within executing
code blocks in the method dictionary... move the program counter to a specific
place, funny execution stuff that I haven't looked at yet since I was pretty happy
about getting the threading glue to the OS working, etc...

> But beyond that. The most sensible algorithm I can see is a
> generational incremental collector where each thread has it's own
> local heap, and does it's own local garbage collection. And only the
> boundary where the local heap meets the global heap needs to collected
> by the collector for all threads. This preserves a lot of cache
> locality as well as circumventing the whole ucontext issue.

Which HotSpot has...Read the papers... Thread local storage exists.

This isn't a just a GC library isolated from a compiler. This a very sophisticated
compiler with a heavy threading infrastructure and you have to take into account
various kinds of interaction with the GC. It is non-trivial stuff.

> If getting the registers is really a bottle neck in the garbage
> collector I suspect it can probably share some generic primitives
> with user mode linux.

It can be if you have to send a single to each thread to get the ucontext.

> If support really needs to happen I suspect this case is close
> enough to what that user mode linux is doing that someone should
> look at how the same mechanism to get the register state can be
> shared.

With a non-PTRACE interface to those ucontexts...

bill

2002-09-24 23:12:11

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Am Dienstag den, 24. September 2002, um 22:19, schrieb David Schwartz:

>
>> The effect of M:N on UP systems should be even more clear. Your
>> multithreaded apps can't profit of parallelism but they do not
>> add load to the system scheduler. The drawback: more syscalls
>> (I think about removing the need for
>> flags=fcntl(GETFLAGS);fcntl(fd,NONBLOCK);write(fd);fcntl(fd,flags))
>
> The main reason I write multithreaded apps for single CPU systems
> is to
> protect against ambush. Consider, for example, a web server. Someone
> sends it
> an obscure request that triggers some code that's never run before and
> has to
> fault in. If my application were single-threaded, no work could be done
> until
> that page faulted in from disk. This is why select-loop and poll-loop
> type
> servers are bursty.

With the current NGPT design your threads would be blocked (all that are
scheduled
one this kernel vehicle).

With Scheduler Activations this could also be avoided.
The thread scheduler could get an upcall - but this will stay theory for
a long
time on Linux.
But this is a somewhat far fetched example (for arguing for 1:1), isn't
it?

There are other means of DoS..

2002-09-24 21:37:38

by Roberto Peon

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Tuesday 24 September 2002 02:22 pm, Rik van Riel wrote:
> On Tue, 24 Sep 2002, Chris Friesen wrote:
> > David Schwartz wrote:
> > > The main reason I write multithreaded apps for single CPU systems is
> > > to protect against ambush. Consider, for example, a web server. Someone
> > > sends it an obscure request that triggers some code that's never run
> > > before and has to fault in. If my application were single-threaded, no
> > > work could be done until that page faulted in from disk.

This is similar to the problems that we face doing realtime virtual video
enhancements-

We have to log camera data (to know where things are pointed) by video
timecode since the data for the camera and the video are asyncronous
(especially in replay).

These (mmaped) logs can get relatively large (100+ MB ea) and access into them
is relatively random (i.e. determined by the director of the show), so the
process reading the log (and suffering the fault) is in a different thread in
order to not stall the other important tasks such as video output.
(Mis-estimating the position for the enhancement is much less of an issue than
dropping the video frame itself. We don't want 10,000,000 people seeing
pure-green frames popping up in the middle of the broadcast.)

-Roberto JP
[email protected]

2002-09-24 21:17:29

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Tue, 24 Sep 2002, Chris Friesen wrote:
> David Schwartz wrote:
>
> > The main reason I write multithreaded apps for single CPU systems is to
> > protect against ambush. Consider, for example, a web server. Someone sends it
> > an obscure request that triggers some code that's never run before and has to
> > fault in. If my application were single-threaded, no work could be done until
> > that page faulted in from disk.
>
> This is interesting--I hadn't considered this as most of my work for the
> past while has been on embedded systems with everything pinned in ram.

On an ftp server (or movie server, or ...) you CAN'T pin everything
in RAM.

Rik
--
A: No.
Q: Should I include quotations after my reply?

http://www.surriel.com/ http://distro.conectiva.com/

2002-09-24 21:32:31

by Chris Friesen

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

Rik van Riel wrote:
> On Tue, 24 Sep 2002, Chris Friesen wrote:

>>This is interesting--I hadn't considered this as most of my work for the
>>past while has been on embedded systems with everything pinned in ram.
>>
>
> On an ftp server (or movie server, or ...) you CAN'T pin everything
> in RAM.

Yes, but you can use aio to issue the request for data and then go do
other stuff even with a single thread. David's case was faulting in
little-used application code.

Or arm I missing something?

Chris

2002-09-24 21:05:19

by Chris Friesen

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

David Schwartz wrote:

> The main reason I write multithreaded apps for single CPU systems is to
> protect against ambush. Consider, for example, a web server. Someone sends it
> an obscure request that triggers some code that's never run before and has to
> fault in. If my application were single-threaded, no work could be done until
> that page faulted in from disk.

This is interesting--I hadn't considered this as most of my work for the
past while has been on embedded systems with everything pinned in ram.

Have you benchmarked this? I was under the impression that the very
fastest webservers were still single-threaded using non-blocking io.

Chris

2002-09-25 18:59:38

[permalink] [raw]

Subject: Re: Offtopic: (was Re: [ANNOUNCE] Native POSIX Thread Library 0.1)

On Wed, 25 Sep 2002, Mark Mielke wrote:

> I missed this one. Does this mean that fork() bombs will have limited
> effect on root? :-)

Indeed. A user can easily run 100 while(1) {} loops, but to the
other users in the system it'll feel just like 1 loop...

> I definately want this, even on my home machine. I've always found it
> to be a sort of fatal flaw that per-user resource scheduling did not
> exist on the platforms of my choice.

It has existed since 2.2.14 or so ;)

I just didn't get around to forward-porting it to newer 2.4
kernels, until last weekend.

cheers,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Spamtraps of the month: [email protected] [email protected]

2002-09-25 18:54:56

[permalink] [raw]

Subject: Re: Offtopic: (was Re: [ANNOUNCE] Native POSIX Thread Library 0.1)

On Tue, Sep 24, 2002 at 02:29:30PM -0300, Rik van Riel wrote:
> I just posted a patch for this and will upload the patch to
> my home page:
> Subject: [PATCH] per user scheduling for 2.4.19
> My patch also allows you to switch the per user scheduling
> on/off with /proc/sys/kernel/fairsched and has been tested
> on both UP and SMP.

I missed this one. Does this mean that fork() bombs will have limited
effect on root? :-)

I definately want this, even on my home machine. I've always found it
to be a sort of fatal flaw that per-user resource scheduling did not
exist on the platforms of my choice.

mark

--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

2002-09-25 18:57:33

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

On Tue, 24 Sep 2002 17:10:17 -0400, Chris Friesen wrote:
>David Schwartz wrote:

>>The main reason I write multithreaded apps for single CPU systems is to
>>protect against ambush. Consider, for example, a web server. Someone sends
>>it
>>an obscure request that triggers some code that's never run before and has
>>to
>>fault in. If my application were single-threaded, no work could be done
>>until
>>that page faulted in from disk.

>This is interesting--I hadn't considered this as most of my work for the
>past while has been on embedded systems with everything pinned in ram.

In the usual case, the code faults in.

>Have you benchmarked this? I was under the impression that the very
>fastest webservers were still single-threaded using non-blocking io.

It's all about how you define "fastest". If speed means being able to do the
same thing over and over really quickly, yes. But I also want uniform
(non-bursty) performance in the face of an unpredictable set of jobs.

DS

2002-09-25 19:00:00

[permalink] [raw]

Subject: Re: [ANNOUNCE] Native POSIX Thread Library 0.1

>With Scheduler Activations this could also be avoided.
>The thread scheduler could get an upcall - but this will stay theory for
>a long
>time on Linux.
>But this is a somewhat far fetched example (for arguing for 1:1), isn't
>it?

No, it's not. I write high-performance servers and my main enemy is
burstiness. One significant cause of burstiness is code faulting in. This is
especially true because many of my servers support adding code to them
through user-supplies shared object files.

>There are other means of DoS..

I'm not talking about deliberate attempts at harming the server. These won't
work over and over because the code will fault in once and be in. I'm talking
about smooth performance in the face of unpredictable loads, and that means
not stalling on every page fault.

DS

2002-09-25 19:12:44

by Mark Veltzer

[permalink] [raw]

Subject: Re: Offtopic: (was Re: [ANNOUNCE] Native POSIX Thread Library 0.1)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wednesday 25 September 2002 22:04, you wrote:
> On Wed, 25 Sep 2002, Mark Mielke wrote:
> > I missed this one. Does this mean that fork() bombs will have limited
> > effect on root? :-)
>
> Indeed. A user can easily run 100 while(1) {} loops, but to the
> other users in the system it'll feel just like 1 loop...

Rik!

This is terrific!!! How come something like this was not merged in earlier
?!? This seems like an absolute neccesity!!! I'm willing to test it if that
is what is needed to get it merged. What does Linus and others feel about
this and most importantly when will see it in ? (Hopefully in this
development cycle).

Regards,
Mark
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE9kg6SxlxDIcceXTgRApApAKClB60zgDs0OB1ltb2ha0Lo8cescgCfTVE7
ZiNKbiTAN78LecVGt6/JzPU=
=/z0y
-----END PGP SIGNATURE-----

2002-09-25 19:18:13