LinuxLists.cc - [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

the attached patch is against BK-curr, it removes some more old symbols
from ksyms.c. This makes the kernel compile with modules enabled.

Ingo

--- linux/kernel/ksyms.c.orig Sun Sep 29 21:12:50 2002
+++ linux/kernel/ksyms.c Sun Sep 29 21:13:24 2002
@@ -420,7 +420,6 @@
EXPORT_SYMBOL(del_timer_sync);
#endif
EXPORT_SYMBOL(mod_timer);
-EXPORT_SYMBOL(tvec_bases);

#ifdef CONFIG_SMP

@@ -589,12 +588,8 @@
EXPORT_SYMBOL(strsep);

/* software interrupts */
-EXPORT_SYMBOL(bh_task_vec);
-EXPORT_SYMBOL(init_bh);
-EXPORT_SYMBOL(remove_bh);
EXPORT_SYMBOL(tasklet_init);
EXPORT_SYMBOL(tasklet_kill);
-EXPORT_SYMBOL(__run_task_queue);
EXPORT_SYMBOL(do_softirq);
EXPORT_SYMBOL(raise_softirq);
EXPORT_SYMBOL(open_softirq);

2002-09-29 19:05:16

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Hi Ingo,

First of all, YES! I am going to start testing first thing tomorrow.

On Sun, Sep 29, 2002 at 07:52:17PM +0200, Ingo Molnar wrote:
>
> i've done the following cleanups/simplifications to task-queues:
>
> - removed the ability to define your own task-queue, what can be done is
> to schedule_task() a given task to keventd, and to flush all pending
> tasks.
>
> this is actually a quite easy transition, since 90% of all task-queue
> users in the kernel used BH_IMMEDIATE - which is very similar in
> functionality to keventd.

This is a problem I ran into in my "kill-BHs" project. I was wondering
if callbacks executed through keventd might have significantly
higher latency (and potential starvation) compared to IMMEDIATE_BH
driven task-queues and that might break existing drivers. Is this not
going to be an issue ?

> net_bh_lock: i have removed it, since it would synchronize to nothing. The
> old protocol handlers should still run on UP, and on SMP the kernel prints
> a warning upon use. Alexey, is this approach fine with you?

The cache line bouncing of global_bh_lock and net_bh_lock in
run_timer_tasklet() show up in our profiles, so getting rid of
them is a good thing (TM).

> scalable timers: i've further improved the patch ported to 2.5 by wli and
> Dipankar. There is only one pending issue i can see, the question of
> whether to migrate timers in mod_timer() or not. I'm quite convinced that
> they should be migrated, but i might be wrong. It's a 10 lines change to
> switch between migrating and non-migrating timers, we can do performance
> tests later on. The current, more complex migration code is pretty fast
> and has been stable under extremely high networking loads in the past 2
> years, so we can immediately switch to the simpler variant if someone
> proves it improves performance. (I'd say if non-migrating timers improve
> Apache performance on one of the bigger NUMA boxes then the point is
> proven, no further though will be needed.)

I will start testing this patch and will try to get you some numbers.
Thanks for taking this up.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-29 18:48:57

by Dave Jones

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Sun, Sep 29, 2002 at 07:52:17PM +0200, Ingo Molnar wrote:
>
> the attached patch is the smptimers patch plus the removal of old BHs and
> a rewrite of task-queue handling.

As an aside, some of the stuff in Documentation/ like Rusty's various
guides are now woefully out of date with whats happening in 2.5
Yet another small project for someone with too much time on their
hands would be to go through this, deleting the obsolete stuff, and
updating the locking documentation to reflect new issues like
preemption.

Dave

--
| Dave Jones. http://www.codemonkey.org.uk
| SuSE Labs

2002-09-29 19:12:27

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

yes, wrt. keventd i was thinking along the same line - but in a different,
perhaps cleaner and simpler direction.

i'd like to introduce the following interfaces:

- create_work_queue(wq, handler_fn)

- destroy_work_queue(wq)

- queue_work(wq, work_fn, work_data)

- flush_work_queue(wq)

this is an extension of the keventd concept. A work queue is a simplified
interface to create a kernel thread that gets work queued from IRQ and
process contexts. No more, no less.

there would be a number of 'default' work-queues that would be created
upon bootup:

- &irq_workqueue
- &io_workqueue

each work queue would get its own separate kernel thread. schedule_task()
would simply queue to the irq_workqueue. We could make the irq_workqueue's
kernel thread a highprio RT thread, to make it really softirq-alike. (Or
for the very specific case of BH_IMMEDIATE type of stricly IRQ-safe work,
we could add a tasklet that works down this queue.)

There's tons of code within the kernel that can be streamlined this way,
most of the helper threads do this kind of functionality. (I'll post a
patch soon to show how it would look like.)

Ingo

2002-09-29 19:43:31

by Jeff Garzik

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Ingo Molnar wrote:
> yes, wrt. keventd i was thinking along the same line - but in a different,
> perhaps cleaner and simpler direction.
>
> i'd like to introduce the following interfaces:
>
> - create_work_queue(wq, handler_fn)

what is handler_fn for, if you pass work_fn later?

> - destroy_work_queue(wq)
>
> - queue_work(wq, work_fn, work_data)

queue_work_delayed(wq, work_fn, work_data, delay) would be nice too

> - flush_work_queue(wq)
>
> this is an extension of the keventd concept. A work queue is a simplified
> interface to create a kernel thread that gets work queued from IRQ and
> process contexts. No more, no less.

Your proposal sounds good to me...

Jeff

2002-09-30 00:21:48

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

From: Dipankar Sarma <[email protected]>
Date: Mon, 30 Sep 2002 00:45:59 +0530

> net_bh_lock: i have removed it, since it would synchronize to nothing. The
> old protocol handlers should still run on UP, and on SMP the kernel prints
> a warning upon use. Alexey, is this approach fine with you?

The cache line bouncing of global_bh_lock and net_bh_lock in
run_timer_tasklet() show up in our profiles, so getting rid of
them is a good thing (TM).

What ancient protocols are you running that make use of this?

IPv4 and IPv6 both do not use it at all. Even IPX, Appletalk, and
DecNET layers do not use it

2002-09-30 00:41:10

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

From: Ingo Molnar <[email protected]>
Date: Sun, 29 Sep 2002 19:52:17 +0200 (CEST)

net_bh_lock: i have removed it, since it would synchronize to nothing. The
old protocol handlers should still run on UP, and on SMP the kernel prints
a warning upon use. Alexey, is this approach fine with you?

Just kill this crap completely. Old protocol handlers are %100
unsupported in 2.6

I know people are working on fixing up basically every old protocol
layer currently in the tree, so this will not be an issue.

When a "struct packet_type" is registered in dev_add_pack(), fail if
!pt->data which is the indication of "old protocol". Once all the
protocols are finished being fixed up, pt->data and this test can
die.

2002-09-30 04:23:47

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Sun, Sep 29, 2002 at 05:20:22PM -0700, David S. Miller wrote:
> From: Dipankar Sarma <[email protected]>
> Date: Mon, 30 Sep 2002 00:45:59 +0530
>
> > net_bh_lock: i have removed it, since it would synchronize to nothing. The
> > old protocol handlers should still run on UP, and on SMP the kernel prints
> > a warning upon use. Alexey, is this approach fine with you?
>
> The cache line bouncing of global_bh_lock and net_bh_lock in
> run_timer_tasklet() show up in our profiles, so getting rid of
> them is a good thing (TM).
>
> What ancient protocols are you running that make use of this?

I wasn't running any old protocols. It was a problem I faced
with my port of smptimers - I serialized
wrt BHs and old protocols using global_bh_lock and net_bh_lock (exported
it globally) respectively in the per-cpu tasklet that runs timers.
So, the spin_trylock() in run_timer_tasklet() would modify the
lock cache line and hence the bouncing. Getting rid of BHs and
old protocol serialization avoids this as in Ingo's latest patch.

>
> IPv4 and IPv6 both do not use it at all. Even IPX, Appletalk, and
> DecNET layers do not use it

This is the list, I think, by looking at packet_types -

802/psnap.c
appletalk/ddp.c
ax25/af_ax25.c
core/ext8022.c
econet/af_econet.c
irda/irsyms.c
x25/af_x25.c

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-30 04:30:00

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

From: Dipankar Sarma <[email protected]>
Date: Mon, 30 Sep 2002 10:03:17 +0530

This is the list, I think, by looking at packet_types -

You look at old sources then, in current 2.5.x psnap.c and ext8022.c
are taken care of by Arnaldo's LLC stack. He is hacking x25 as well.
Ralf Baechle is doing ax25/rose/etc. radio layers. Appletalk should
be sane in 2.5.x as well.

2002-09-30 04:33:21

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Em Mon, Sep 30, 2002 at 10:03:17AM +0530, Dipankar Sarma escreveu:
> This is the list, I think, by looking at packet_types -

Don't bother with snap, that one is just a dummy packet type, its not
even registered via dev_add_pack
> 802/psnap.c

I'm working on Appletalk, will be fixed after X.25, humm, in fact Appletalk
only uses SNAP on Ethernet, so it is only broken for ppptalk and ltalk, does
anybody still uses these later two?

> appletalk/ddp.c

Ralf B?chle is working on ax25 (and its clients: ROSE and NETROM)
> ax25/af_ax25.c

This doesn't exists anymore, what kernel are you looking at?
> core/ext8022.c

Nobody working on this, as far as I know
> econet/af_econet.c

Last I heard Alexey was working on fixing irda
> irda/irsyms.c

I'm working on this one now
> x25/af_x25.c

- Arnaldo

2002-09-30 05:51:29

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Sun, 29 Sep 2002, David S. Miller wrote:

> From: Ingo Molnar <[email protected]>
> Date: Sun, 29 Sep 2002 19:52:17 +0200 (CEST)
>
> net_bh_lock: i have removed it, since it would synchronize to nothing. The
> old protocol handlers should still run on UP, and on SMP the kernel prints
> a warning upon use. Alexey, is this approach fine with you?
>
> Just kill this crap completely. Old protocol handlers are %100
> unsupported in 2.6
>
> I know people are working on fixing up basically every old protocol
> layer currently in the tree, so this will not be an issue.

wonderful. I thought it might have helped if we kep those old-protocol
callbacks around in the UP kernel still (to help converting stuff) - but
if this is being taken care of then great.

Ingo

2002-09-30 12:45:33

by Alan

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Mon, 2002-09-30 at 05:38, Arnaldo Carvalho de Melo wrote:
> I'm working on Appletalk, will be fixed after X.25, humm, in fact Appletalk
> only uses SNAP on Ethernet, so it is only broken for ppptalk and ltalk, does
> anybody still uses these later two?

ppptalk is relevant to the modern world, localtalk is basically for
talking to old macintoshes many of which don't have any capability for
ethernet. I don't think either of them are even going to be performance
matters.

> Nobody working on this, as far as I know
> > econet/af_econet.c

Ancient BBC micro protocol, could probably be done just as well in user
space.

Alan

2002-09-30 14:45:50

by Arnaldo Carvalho de Melo

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Em Mon, Sep 30, 2002 at 01:55:56PM +0100, Alan Cox escreveu:
> On Mon, 2002-09-30 at 05:38, Arnaldo Carvalho de Melo wrote:
> > I'm working on Appletalk, will be fixed after X.25, humm, in fact Appletalk
> > only uses SNAP on Ethernet, so it is only broken for ppptalk and ltalk, does
> > anybody still uses these later two?
>
> ppptalk is relevant to the modern world, localtalk is basically for
> talking to old macintoshes many of which don't have any capability for
> ethernet. I don't think either of them are even going to be performance
> matters.

OK, but even those will be taken care of, as the changes had to be done anyway
for SNAP, so I'll just stick the (void*)1 to its packet_types.

> > Nobody working on this, as far as I know
> > > econet/af_econet.c
>
> Ancient BBC micro protocol, could probably be done just as well in user
> space.

As some of the other protocols, but at this point it may well be easier to
fix it in the kernel where it sits 8)

Oh, dang, I forgot that these other protocols can work on fast lines these
days 8)

- Arnaldo

2002-09-30 16:22:58

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Mon, 30 Sep 2002, Christoph Hellwig wrote:

> But it breaks XFS. Chris Wedgwood fixed it to use schedule_task()
> instead (and I cleaned it up a littler more, see patch below), but this
> does effectively simgle-thead XFS I/O completion.

see the workqueues patch i posted a couple of minutes ago. Does this solve
XFS's problems?

> Altennatively we could allow kernel code to create it's own kevends with
> associated task-queues, but that sounds rather ugly..

why is it ugly? I can add a simple interface to the workqueues subsystem
that will bind the XFS worker threads to given sets of CPUs. That should
give you per-CPU workqueues, with separate per-CPU locking.

Ingo

2002-09-30 16:25:44

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Mon, Sep 30, 2002 at 06:38:03PM +0200, Ingo Molnar wrote:
> see the workqueues patch i posted a couple of minutes ago. Does this solve
> XFS's problems?

Not exactly. All your work on one queue is internally serialize. An
totally unserialized workqueue would be best for XFS.

> why is it ugly? I can add a simple interface to the workqueues subsystem
> that will bind the XFS worker threads to given sets of CPUs. That should
> give you per-CPU workqueues, with separate per-CPU locking.

That would also work, but would require more code in XFS than my
above suggestion.

2002-09-30 16:18:52

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Sun, Sep 29, 2002 at 07:52:17PM +0200, Ingo Molnar wrote:
> i've done the following cleanups/simplifications to task-queues:
>
> - removed the ability to define your own task-queue, what can be done is
> to schedule_task() a given task to keventd, and to flush all pending
> tasks.
>
> this is actually a quite easy transition, since 90% of all task-queue
> users in the kernel used BH_IMMEDIATE - which is very similar in
> functionality to keventd.

But it breaks XFS. Chris Wedgwood fixed it to use schedule_task()
instead (and I cleaned it up a littler more, see patch below), but
this does effectively simgle-thead XFS I/O completion.

Why? XFS used to have a per-cpu task-queue, emptied by a per-cpu
kernel thread. To get anywhere near the per-formance again we'd
need to merge the per-cpu keventd patch Arjan did for RH 2.1AS into
2.5. Altennatively we could allow kernel code to create it's own
kevends with associated task-queues, but that sounds rather ugly..

diff -uNr -p linux-2.5.39-mm1/fs/xfs/pagebuf/page_buf.c linux-2.5.39/fs/xfs/pagebuf/page_buf.c
--- linux-2.5.39-mm1/fs/xfs/pagebuf/page_buf.c Mon Sep 30 14:27:25 2002
+++ linux-2.5.39/fs/xfs/pagebuf/page_buf.c Mon Sep 30 16:49:06 2002
@@ -160,8 +160,6 @@ pb_tracking_free(

STATIC kmem_cache_t *pagebuf_cache;
STATIC pagebuf_daemon_t *pb_daemon;
-STATIC struct list_head pagebuf_iodone_tq[NR_CPUS];
-STATIC wait_queue_head_t pagebuf_iodone_wait[NR_CPUS];
STATIC void pagebuf_daemon_wakeup(int);

/*
@@ -1157,15 +1155,6 @@ _pagebuf_wait_unpin(
current->state = TASK_RUNNING;
}

-void
-pagebuf_queue_task(
- struct tq_struct *task)
-{
- queue_task(task, &pagebuf_iodone_tq[smp_processor_id()]);
- wake_up(&pagebuf_iodone_wait[smp_processor_id()]);
-}
-
-
/*
* Buffer Utility Routines
*/
@@ -1210,9 +1199,8 @@ pagebuf_iodone(
INIT_TQUEUE(&pb->pb_iodone_sched,
pagebuf_iodone_sched, (void *)pb);

- queue_task(&pb->pb_iodone_sched,
- &pagebuf_iodone_tq[smp_processor_id()]);
- wake_up(&pagebuf_iodone_wait[smp_processor_id()]);
+ schedule_task(&pb->pb_iodone_sched);
+
} else {
up(&pb->pb_iodonesema);
}
@@ -1666,62 +1654,6 @@ pagebuf_delwri_dequeue(
spin_unlock(&pb_daemon->pb_delwrite_lock);
}

-
-/*
- * The pagebuf iodone daemon
- */
-
-STATIC int pb_daemons[NR_CPUS];
-
-STATIC int
-pagebuf_iodone_daemon(
- void *__bind_cpu)
-{
- int cpu = (long) __bind_cpu;
- DECLARE_WAITQUEUE (wait, current);
-
- /* Set up the thread */
- daemonize();
-
- /* Avoid signals */
- spin_lock_irq(&current->sig->siglock);
- sigfillset(&current->blocked);
- recalc_sigpending();
- spin_unlock_irq(&current->sig->siglock);
-
- /* Migrate to the right CPU */
- set_cpus_allowed(current, 1UL << cpu);
- if (smp_processor_id() != cpu)
- BUG();
-
- sprintf(current->comm, "pagebuf_io_CPU%d", cpu);
- INIT_LIST_HEAD(&pagebuf_iodone_tq[cpu]);
- init_waitqueue_head(&pagebuf_iodone_wait[cpu]);
- __set_current_state(TASK_INTERRUPTIBLE);
- mb();
-
- pb_daemons[cpu] = 1;
-
- for (;;) {
- add_wait_queue(&pagebuf_iodone_wait[cpu],
- &wait);
-
- if (TQ_ACTIVE(pagebuf_iodone_tq[cpu]))
- __set_task_state(current, TASK_RUNNING);
- schedule();
- remove_wait_queue(&pagebuf_iodone_wait[cpu],
- &wait);
- run_task_queue(&pagebuf_iodone_tq[cpu]);
- if (pb_daemons[cpu] == 0)
- break;
- __set_current_state(TASK_INTERRUPTIBLE);
- }
-
- pb_daemons[cpu] = -1;
- wake_up_interruptible(&pagebuf_iodone_wait[cpu]);
- return 0;
-}
-
/* Defines for pagebuf daemon */
DECLARE_WAIT_QUEUE_HEAD(pbd_waitq);
STATIC int force_flush;
@@ -1907,8 +1839,6 @@ STATIC int
pagebuf_daemon_start(void)
{
if (!pb_daemon) {
- int cpu;
-
pb_daemon = (pagebuf_daemon_t *)
kmalloc(sizeof(pagebuf_daemon_t), GFP_KERNEL);
if (!pb_daemon) {
@@ -1924,19 +1854,6 @@ pagebuf_daemon_start(void)

kernel_thread(pagebuf_daemon, (void *)pb_daemon,
CLONE_FS|CLONE_FILES|CLONE_VM);
- for (cpu = 0; cpu < NR_CPUS; cpu++) {
- if (!cpu_online(cpu))
- continue;
- if (kernel_thread(pagebuf_iodone_daemon,
- (void *)(long) cpu,
- CLONE_FS|CLONE_FILES|CLONE_VM) < 0) {
- printk("pagebuf_daemon_start failed\n");
- } else {
- while (!pb_daemons[cpu]) {
- yield();
- }
- }
- }
}
return 0;
}
@@ -1950,24 +1867,12 @@ STATIC void
pagebuf_daemon_stop(void)
{
if (pb_daemon) {
- int cpu;
-
pb_daemon->active = 0;
pb_daemon->io_active = 0;

wake_up_interruptible(&pbd_waitq);
while (pb_daemon->active == 0) {
interruptible_sleep_on(&pbd_waitq);
- }
- for (cpu = 0; cpu < NR_CPUS; cpu++) {
- if (!cpu_online(cpu))
- continue;
- pb_daemons[cpu] = 0;
- wake_up(&pagebuf_iodone_wait[cpu]);
- while (pb_daemons[cpu] != -1) {
- interruptible_sleep_on(
- &pagebuf_iodone_wait[cpu]);
- }
}

kfree(pb_daemon);
diff -uNr -p linux-2.5.39-mm1/fs/xfs/pagebuf/page_buf.h linux-2.5.39/fs/xfs/pagebuf/page_buf.h
--- linux-2.5.39-mm1/fs/xfs/pagebuf/page_buf.h Mon Sep 30 14:13:46 2002
+++ linux-2.5.39/fs/xfs/pagebuf/page_buf.h Mon Sep 30 16:15:25 2002
@@ -324,9 +324,6 @@ extern void pagebuf_unlock( /* unlock b

#define pagebuf_geterror(pb) ((pb)->pb_error)

-extern void pagebuf_queue_task(
- struct tq_struct *);
-
extern void pagebuf_iodone( /* mark buffer I/O complete */
page_buf_t *); /* buffer to mark */

diff -uNr -p linux-2.5.39-mm1/fs/xfs/xfs_log.c linux-2.5.39/fs/xfs/xfs_log.c
--- linux-2.5.39-mm1/fs/xfs/xfs_log.c Mon Sep 30 14:15:55 2002
+++ linux-2.5.39/fs/xfs/xfs_log.c Mon Sep 30 16:15:25 2002
@@ -2779,7 +2779,7 @@ xlog_state_release_iclog(xlog_t *log,
case 0:
return xlog_sync(log, iclog, 0);
case 1:
- pagebuf_queue_task(&iclog->ic_write_sched);
+ schedule_task(&iclog->ic_write_sched);
}
}
return (0);

2002-09-30 16:57:54

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Mon, 30 Sep 2002, Christoph Hellwig wrote:

> On Mon, Sep 30, 2002 at 06:38:03PM +0200, Ingo Molnar wrote:
> > see the workqueues patch i posted a couple of minutes ago. Does this solve
> > XFS's problems?
>
> Not exactly. All your work on one queue is internally serialize. An
> totally unserialized workqueue would be best for XFS.

you can create as many queues as you wish - one per CPU for example. Or
one per mounted fs per CPU.

Ingo

2002-09-30 17:23:40

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Mon, Sep 30, 2002 at 07:12:54PM +0200, Ingo Molnar wrote:
> > Not exactly. All your work on one queue is internally serialize. An
> > totally unserialized workqueue would be best for XFS.
>
> you can create as many queues as you wish - one per CPU for example. Or
> one per mounted fs per CPU.

Yeah. But adding a create_workqueue_per_cpu that has one queue and thead
per cpu to which queue_work dispatches would centralize the code needed
to manage that in one place instead of duplicating it over and over.

Sure both works, but IMHO hiding it behind a nice abstraction is much
better.

2002-09-30 21:41:27

by George Anzinger

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Ingo Molnar wrote:
>
> the attached patch is the smptimers patch plus the removal of old BHs and
> a rewrite of task-queue handling.
>
~snip
>
> scalable timers: i've further improved the patch ported to 2.5 by wli and
> Dipankar. There is only one pending issue i can see, the question of
> whether to migrate timers in mod_timer() or not. I'm quite convinced that
> they should be migrated, but i might be wrong. It's a 10 lines change to
> switch between migrating and non-migrating timers, we can do performance
> tests later on. The current, more complex migration code is pretty fast
> and has been stable under extremely high networking loads in the past 2
> years, so we can immediately switch to the simpler variant if someone
> proves it improves performance. (I'd say if non-migrating timers improve
> Apache performance on one of the bigger NUMA boxes then the point is
> proven, no further though will be needed.)

As the APIC timers are currently set up they are
undisciplined WRT the PIT which is still used to drive the
clock. This means that, since this patch drives the
"run_timer_list" code from the APIC timers, the actual delay
in timer servicing from the requested time will vary with
a.) the cpu (since each cpu is set up to have its timer
expire at a different time within the 1/HZ tick) and b.)
over time as the PIT and the APIC clocks drift. This may be
acceptable with 1/HZ timer resolution (however I don't
really think it is), but it is in no way acceptable WRT high
resolution timers.

The solution I would suggest is to disciplined the APIC
clocks. They _should_ be set up to interrupt as soon after
a PIT interrupt as possible and they should all do so at the
same time if we are to avoid timer (not time, actual time
keeping is not in question here) glitches when moving from
one cpu to another. Further, checks for drift need to be in
place to "pull" the APIC timer into sync when it drifts.

I had similar problems in the high-res-timers keeping the
PIT synched with the TSC or the pm timer. It is do able.
>
~snip
>
> Ingo

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-10-01 03:36:25

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Mon, 30 Sep 2002, george anzinger wrote:

> As the APIC timers are currently set up they are undisciplined WRT the
> PIT which is still used to drive the clock. This means that, since this
> patch drives the "run_timer_list" code from the APIC timers, the actual
> delay in timer servicing from the requested time will vary with a.) the
> cpu (since each cpu is set up to have its timer expire at a different
> time within the 1/HZ tick) and b.) over time as the PIT and the APIC
> clocks drift. This may be acceptable with 1/HZ timer resolution
> (however I don't really think it is), but it is in no way acceptable WRT
> high resolution timers.

the smp_processor_id()/(HZ*num_cpus) 'interleaving' of every APIC clock
was an SMP scalability issue, and it was done as part of the smptimers
patch. It just got into the kernel much earlier.

but these days, with the removal of BHs, it might be less of a factor,
mainly because timers have no global synchronization anymore, so we can
again try to not interleave the APIC clocks. Only testing will tell,
because there might be some interaction between timer-generated code
still.

Dipankar, wli, would it be possible to try the attached simple patch with
some of the more complex networking loads? The patch gets rid of the APIC
timer interleaving.

Ingo

--- arch/i386/kernel/apic.c.orig Tue Oct 1 05:47:34 2002
+++ arch/i386/kernel/apic.c Tue Oct 1 05:49:23 2002
@@ -813,24 +813,9 @@

static void setup_APIC_timer(unsigned int clocks)
{
- unsigned int slice, t0, t1;
unsigned long flags;
- int delta;

- local_save_flags(flags);
- local_irq_enable();
- /*
- * ok, Intel has some smart code in their APIC that knows
- * if a CPU was in 'hlt' lowpower mode, and this increases
- * its APIC arbitration priority. To avoid the external timer
- * IRQ APIC event being in synchron with the APIC clock we
- * introduce an interrupt skew to spread out timer events.
- *
- * The number of slices within a 'big' timeslice is NR_CPUS+1
- */
-
- slice = clocks / (NR_CPUS+1);
- printk("cpu: %d, clocks: %d, slice: %d\n", smp_processor_id(), clocks, slice);
+ local_irq_save(flags);

/*
* Wait for IRQ0's slice:
@@ -839,22 +824,6 @@

__setup_APIC_LVTT(clocks);

- t0 = apic_read(APIC_TMICT)*APIC_DIVISOR;
- /* Wait till TMCCT gets reloaded from TMICT... */
- do {
- t1 = apic_read(APIC_TMCCT)*APIC_DIVISOR;
- delta = (int)(t0 - t1 - slice*(smp_processor_id()+1));
- } while (delta >= 0);
- /* Now wait for our slice for real. */
- do {
- t1 = apic_read(APIC_TMCCT)*APIC_DIVISOR;
- delta = (int)(t0 - t1 - slice*(smp_processor_id()+1));
- } while (delta < 0);
-
- __setup_APIC_LVTT(clocks);
-
- printk("CPU%d<T0:%d,T1:%d,D:%d,S:%d,C:%d>\n", smp_processor_id(), t0, t1, delta, slice, clocks);
-
local_irq_restore(flags);
}

2002-10-01 04:20:20

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

From: Ingo Molnar <[email protected]>
Date: Tue, 1 Oct 2002 05:51:45 +0200 (CEST)

- /*
- * ok, Intel has some smart code in their APIC that knows
- * if a CPU was in 'hlt' lowpower mode, and this increases
- * its APIC arbitration priority. To avoid the external timer
- * IRQ APIC event being in synchron with the APIC clock we
- * introduce an interrupt skew to spread out timer events.
- *
- * The number of slices within a 'big' timeslice is NR_CPUS+1
- */
-

I did some thinking, and I don't understand how this old code
can be legal. Doesn't this make do_gettimeofday() inaccurate?

I must be missing something.

2002-10-01 05:10:02

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Mon, 30 Sep 2002, David S. Miller wrote:

> I did some thinking, and I don't understand how this old code can be
> legal. Doesn't this make do_gettimeofday() inaccurate?

it's a mostly bogus comment, dont think about it too much.

Ingo

2002-10-01 05:22:11

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

On Tue, Oct 01, 2002 at 05:51:45AM +0200, Ingo Molnar wrote:
> the smp_processor_id()/(HZ*num_cpus) 'interleaving' of every APIC clock
> was an SMP scalability issue, and it was done as part of the smptimers
> patch. It just got into the kernel much earlier.
>
> but these days, with the removal of BHs, it might be less of a factor,
> mainly because timers have no global synchronization anymore, so we can
> again try to not interleave the APIC clocks. Only testing will tell,
> because there might be some interaction between timer-generated code
> still.

Yes, with earlier versions of smptimers where global_bh_lock was
still being acquired to serialize with BHs, local timer clocks needed
to be spaced over a HZ to reduce contention. Archs that didn't space
the clocks perfromed poorly with smptimers as Anton found out with
ppc64 and had to change.

>
> Dipankar, wli, would it be possible to try the attached simple patch with
> some of the more complex networking loads? The patch gets rid of the APIC
> timer interleaving.

I will give it a spin.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-10-01 07:55:38

by George Anzinger

[permalink] [raw]

Subject: Re: [patch] smptimers, old BH removal, tq-cleanup, 2.5.39

Ingo Molnar wrote:
>
> On Mon, 30 Sep 2002, David S. Miller wrote:
>
> > I did some thinking, and I don't understand how this old code can be
> > legal. Doesn't this make do_gettimeofday() inaccurate?
>
> it's a mostly bogus comment, dont think about it too much.
>
> Ingo
Actually gettimeofday is fine. It does not depend on the
timer interrupt, but only on one happening every once in a
while. It make up any latency by using the TSC and time of
last interrupt and is in no way dependent on any latency in
the run_timer_list.

By the way, I have been lately impressed with the relative
amount of time a cli/sti takes and have wondered if we might
not get a "nice" improvement in system performance by moving
the xtime read/write lock from an irq to a bh lock. This
would avoid the cli/sti in the read lock which is needed to
read the time. All this should take (aside from finding all
the locks) is to move the write access of xtime to the bh
code. Since it does nothing if timer_jiffie == jiffie, it
does not hurt to call it from each cpu. The timer interrupt
would then bump a shadow jiffie which would not appear in
jiffies until the bh code runs.

On finding the locks, I suggest abstracting them to a macro,
thus allowing the change to be made only one place. Of
course, we need to change the name of the lock to enlist the
compilers help in finding them. But then if we are sure
this is the way to go, there is no need for the macro.
--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-10-03 06:59:10