2002-07-01 04:03:31

by Matthew Wilcox

[permalink] [raw]
Subject: [RFC] BH removal text


I'm soliciting comments before people start implementing these things.
Please, do NOT start changing anything based on the instructions given
below. I do intend to update the floppy.c patch to fix the problems
I mentioned below, but I'm going to sleep first.

PRERELEASE VERSION 2002-06-30-01

A janitor's guide to removing bottom halves
===========================================

First, ignore the serial devices. They're being taken care of
independently.

Apart from these, we use 3 bottom halves currently. IMMEDIATE_BH,
TIMER_BH and TQUEUE_BH. There is a spinlock (global_bh_lock) which
is held when running any of these three bottom halves, so none of them
can run at the same time. IMMEDIATE_BH runs the immediate task queue
(tq_immediate). TQUEUE_BH runs the timer task queue (tq_timer).
TIMER_BH first calls update_times(), then runs the timer list.

What does all that mean?
------------------------

Right now, the kernel guarantees it will only enter your driver (or
indeed any user, but we're mostly concerned with drivers) through one of
these entry points at a time. If we get rid of bottom halves, we will be
able to enter a driver simultaneously through any active timer routine,
any active immediate task routine and any active timer task routine.

So how do we modify drivers?
----------------------------

I am of the opinion that we should remove tq_immediate entirely.
Every current user of it should be converted to use a private tasklet.
Example code for floppy.c to show how to do this can be found at:
http://ftp.linux.org.uk/pub/linux/willy/patches/floppy.diff
Note that this patch is BROKEN. There's no locking to prevent any of our
timers from being called at the same time as our tasklet. See below ...

Some of the users of tq_timer should probably be converted to
schedule_task so they run in a user context rather than interrupt context.
But there will always be a need for a task queue to be run in interrupt
context after a fixed period of time has elapsed in order to allow
for interrupt mitigation. I think a better interface should be used
for tq_timer anyway -- I will be proposing a queue_timer_task() macro.
We can use conversion to this interface as a flag to indicate that a
driver has been checked for SMP locking.

The same thing goes for add_timer users, except that there's no better
interface that I want to convert drivers to. So a comment beside the
add_timer usage indicating that you've checked the locking and it's OK
is helpful.

So how should we do the locking?
--------------------------------

Notice that right now we use a spinlock whenever we call any of these
entry points. So it should be safe to declare a new spinlock within
this driver and acquire/release it at entry & exit to each of these
types of functions. It's easier than converting drivers which use the
BKL because they might sleep or acquire it twice. Be wary of reusing an
existing spinlock because it might be acquired from interrupt context,
so you'd have to use spin_lock_irq to acquire it in other contexts.

Of course, that's the lazy way of doing it. What I'm hoping is that each
Janitor will take a driver and spend a week checking over its locking.
There's only 80 files in the kernel which use tq_immediate; with 10
Janitors involved, that's 8 drivers each -- that's only 2 months and we
have 4.

That doesn't mean that we shouldn't worry about the 38 files which use
tq_timer, but they are almost all tty related and are therefore Hard ;-)

--
Revolutions do not require corporate support.


2002-07-01 11:39:14

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [RFC] BH removal text

On Monday 01 July 2002 06:05, Matthew Wilcox wrote:

> Of course, that's the lazy way of doing it. What I'm hoping is that each
> Janitor will take a driver and spend a week checking over its locking.
> There's only 80 files in the kernel which use tq_immediate; with 10
> Janitors involved, that's 8 drivers each -- that's only 2 months and we
> have 4.

I suppose mine are the 8 drivers in drivers/s390 then :-).
I'm already working on them to support the new LDM and I know the
maintainers.

Arnd <><

2002-07-03 07:19:27

by George Anzinger

[permalink] [raw]
Subject: Re: [RFC] BH removal text

Matthew Wilcox wrote:
>
> I'm soliciting comments before people start implementing these things.
> Please, do NOT start changing anything based on the instructions given
> below. I do intend to update the floppy.c patch to fix the problems
> I mentioned below, but I'm going to sleep first.
>
> PRERELEASE VERSION 2002-06-30-01
>
> A janitor's guide to removing bottom halves
> ===========================================
>
> First, ignore the serial devices. They're being taken care of
> independently.
>
> Apart from these, we use 3 bottom halves currently. IMMEDIATE_BH,
> TIMER_BH and TQUEUE_BH. There is a spinlock (global_bh_lock) which
> is held when running any of these three bottom halves, so none of them
> can run at the same time. IMMEDIATE_BH runs the immediate task queue
> (tq_immediate). TQUEUE_BH runs the timer task queue (tq_timer).
> TIMER_BH first calls update_times(), then runs the timer list.

It should also be noted that none of these is entered if a
cli is in effect. This is the global cli and inhibits BHs
on all cpus.

-g

>
> What does all that mean?
> ------------------------
>
> Right now, the kernel guarantees it will only enter your driver (or
> indeed any user, but we're mostly concerned with drivers) through one of
> these entry points at a time. If we get rid of bottom halves, we will be
> able to enter a driver simultaneously through any active timer routine,
> any active immediate task routine and any active timer task routine.
>
> So how do we modify drivers?
> ----------------------------
>
> I am of the opinion that we should remove tq_immediate entirely.
> Every current user of it should be converted to use a private tasklet.
> Example code for floppy.c to show how to do this can be found at:
> http://ftp.linux.org.uk/pub/linux/willy/patches/floppy.diff
> Note that this patch is BROKEN. There's no locking to prevent any of our
> timers from being called at the same time as our tasklet. See below ...
>
> Some of the users of tq_timer should probably be converted to
> schedule_task so they run in a user context rather than interrupt context.
> But there will always be a need for a task queue to be run in interrupt
> context after a fixed period of time has elapsed in order to allow
> for interrupt mitigation. I think a better interface should be used
> for tq_timer anyway -- I will be proposing a queue_timer_task() macro.
> We can use conversion to this interface as a flag to indicate that a
> driver has been checked for SMP locking.
>
> The same thing goes for add_timer users, except that there's no better
> interface that I want to convert drivers to. So a comment beside the
> add_timer usage indicating that you've checked the locking and it's OK
> is helpful.
>
> So how should we do the locking?
> --------------------------------
>
> Notice that right now we use a spinlock whenever we call any of these
> entry points. So it should be safe to declare a new spinlock within
> this driver and acquire/release it at entry & exit to each of these
> types of functions. It's easier than converting drivers which use the
> BKL because they might sleep or acquire it twice. Be wary of reusing an
> existing spinlock because it might be acquired from interrupt context,
> so you'd have to use spin_lock_irq to acquire it in other contexts.
>
> Of course, that's the lazy way of doing it. What I'm hoping is that each
> Janitor will take a driver and spend a week checking over its locking.
> There's only 80 files in the kernel which use tq_immediate; with 10
> Janitors involved, that's 8 drivers each -- that's only 2 months and we
> have 4.
>
> That doesn't mean that we shouldn't worry about the 38 files which use
> tq_timer, but they are almost all tty related and are therefore Hard ;-)
>
> --
> Revolutions do not require corporate support.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-07-03 11:13:31

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC] BH removal text

On Wed, Jul 03, 2002 at 12:21:26AM -0700, george anzinger wrote:
> It should also be noted that none of these is entered if a
> cli is in effect. This is the global cli and inhibits BHs
> on all cpus.

global cli() is so heaviy deprecated, it isn't even funny. i know
mingo has a patch to remove it entirely. if you want to ensure that
no softirq/tasklet/timer/... is run, use spin_lock_bh(). I'll add this
information, thanks!

--
Revolutions do not require corporate support.

2002-07-14 01:03:22

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] BH removal text

On Mon, Jul 01, 2002 at 05:05:55AM +0100, Matthew Wilcox wrote:
> That doesn't mean that we shouldn't worry about the 38 files which use
> tq_timer, but they are almost all tty related and are therefore Hard ;-)

__global_cli(), timer_bh(), and bh_action() are crippling my machines.

Where do I start?


Cheers,
Bill

2002-07-14 04:47:18

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [RFC] BH removal text

On Sat, Jul 13, 2002 at 06:05:06PM -0700, William Lee Irwin III wrote:
> On Mon, Jul 01, 2002 at 05:05:55AM +0100, Matthew Wilcox wrote:
> > That doesn't mean that we shouldn't worry about the 38 files which use
> > tq_timer, but they are almost all tty related and are therefore Hard ;-)
>
> __global_cli(), timer_bh(), and bh_action() are crippling my machines.
>
> Where do I start?

Even if you replace timemr_bh() with a tasklet, you still need
to take the global_bh_lock to ensure that timers don't race with
single-threaded BH processing in drivers. I wrote this patch [included]
to get rid of timer_bh in Ingo's smptimers, but it acquires
global_bh_lock as well as net_bh_lock, the latter to ensure
that some older protocol code that expected serialization of
NET_BH and timers work correctly (see deliver_to_old_ones()).
They need to be cleaned up too.

My patch of course was experimental to see what is needed to
get rid of timer_bh. It needs some cleanup itself ;-)

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.


Attachments:
(No filename) (1.09 kB)
smptimers_X1-2.5.24-1.patch (38.02 kB)
Download all attachments

2002-07-14 10:15:52

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] BH removal text

On Mon, Jul 01, 2002 at 05:05:55AM +0100, Matthew Wilcox wrote:
>>> That doesn't mean that we shouldn't worry about the 38 files which use
>>> tq_timer, but they are almost all tty related and are therefore Hard ;-)

On Sat, Jul 13, 2002 at 06:05:06PM -0700, William Lee Irwin III wrote:
>> __global_cli(), timer_bh(), and bh_action() are crippling my machines.
>> Where do I start?

On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
> Even if you replace timemr_bh() with a tasklet, you still need
> to take the global_bh_lock to ensure that timers don't race with
> single-threaded BH processing in drivers. I wrote this patch [included]
> to get rid of timer_bh in Ingo's smptimers, but it acquires
> global_bh_lock as well as net_bh_lock, the latter to ensure
> that some older protocol code that expected serialization of
> NET_BH and timers work correctly (see deliver_to_old_ones()).
> They need to be cleaned up too.

This is great stuff. I'll definitely try it out in an hour or two. I'd
be interested in helping with the cleanup of the things assuming the BH
things still exist but might need a wee bit of hand-holding to get
through it. I'll go around flagging people down who might be able to
help me with it as I go.

I actually suspect tty-related things are a likely culprit as
significant use of the serial console occurs.


On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
> My patch of course was experimental to see what is needed to
> get rid of timer_bh. It needs some cleanup itself ;-)


I can at least try it out. The BH stuff is legacy, so killing it
entirely at some point makes sense. I did volunteer to help with
this at OLS, so I'll be delivering code at some point.


Cheers,
Bill

2002-07-15 09:18:35

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [RFC] BH removal text

On Sun, Jul 14, 2002 at 03:17:30AM -0700, William Lee Irwin III wrote:
> On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
> > Even if you replace timemr_bh() with a tasklet, you still need
> > to take the global_bh_lock to ensure that timers don't race with
> > single-threaded BH processing in drivers. I wrote this patch [included]
> > to get rid of timer_bh in Ingo's smptimers, but it acquires
> > global_bh_lock as well as net_bh_lock, the latter to ensure
> > that some older protocol code that expected serialization of
> > NET_BH and timers work correctly (see deliver_to_old_ones()).
> > They need to be cleaned up too.
>
> This is great stuff. I'll definitely try it out in an hour or two. I'd
> be interested in helping with the cleanup of the things assuming the BH
> things still exist but might need a wee bit of hand-holding to get
> through it. I'll go around flagging people down who might be able to
> help me with it as I go.

I did a quick and dirty search on packet_type.data == NULL protocols.
Here is a list -

802/psnap.c
appletalk/ddp.c
ax25/af_ax25.c
core/ext8022.c
econet/af_econet.c
irda/irsyms.c
x25/af_x25.c

These need to be made safe for a non-BH based timer. I guess
the current code assumes serialization between timer and
BH context code due to the use of now-defunct NET_BH.

>
> I actually suspect tty-related things are a likely culprit as
> significant use of the serial console occurs.

It should also be possible to make minimal non-smptimers
bhless_timer patch - just in case smptimers isn't going in
any time soon. It will run a timer tasklet off of do_timer().
The tasklet handler still has to grab global_bh_lock and
the likes to keep the tty and other drivers that expect
serialization BH and timers or use __global_cli, happy.
Will such a patch be useful ?

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-07-15 10:16:17

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] BH removal text

On Sun, Jul 14, 2002 at 03:17:30AM -0700, William Lee Irwin III wrote:
>> I actually suspect tty-related things are a likely culprit as
>> significant use of the serial console occurs.

On Mon, Jul 15, 2002 at 02:55:21PM +0530, Dipankar Sarma wrote:
> It should also be possible to make minimal non-smptimers
> bhless_timer patch - just in case smptimers isn't going in
> any time soon. It will run a timer tasklet off of do_timer().
> The tasklet handler still has to grab global_bh_lock and
> the likes to keep the tty and other drivers that expect
> serialization BH and timers or use __global_cli, happy.
> Will such a patch be useful ?

The temporary "hangs" are so bad any way to mitigate this horrible
problem will be useful. The machine is stuck so long in this stuff
literal network timeouts occur. It's insanely bad, this is really
beyond the scope of a performance problem and into the realm of an
out-and-out bug. A machine stuck for that long in this code is
effectively dead. I'm just slightly more patient than average users.


Cheers,
Bill

2002-07-17 23:54:55

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] BH removal text

On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
> Even if you replace timemr_bh() with a tasklet, you still need
> to take the global_bh_lock to ensure that timers don't race with
> single-threaded BH processing in drivers. I wrote this patch [included]
> to get rid of timer_bh in Ingo's smptimers, but it acquires
> global_bh_lock as well as net_bh_lock, the latter to ensure
> that some older protocol code that expected serialization of
> NET_BH and timers work correctly (see deliver_to_old_ones()).
> They need to be cleaned up too.
> My patch of course was experimental to see what is needed to
> get rid of timer_bh. It needs some cleanup itself ;-)

It runs here. New profile (hopefully I'll get some fixed-up stuff like
oprofile, kernprof, & lockmeter to play with at some point):


14465232 total 114.2269
10694436 mod_timer 33420.1125
1089589 __global_cli 4005.8419
961598 timer_bh 1059.0286
453404 do_gettimeofday 3333.8529
440086 __wake_up 2340.8830
298729 schedule 268.6412
294945 default_idle 5672.0192
155762 do_softirq 708.0091
43256 tasklet_hi_action 220.6939
12724 system_call 289.1818

mod_timer is 75%, __global_cli() appears to be 7.5%, and timer_bh()
is 6.6%... I wonder what happened to the plot for lockless gettimeofday(),
esp as that accounts for 3.1% here...

It's still spinning with interrupts off for several minutes at a time.


Cheers,
Bill

2002-07-18 08:19:51

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] BH removal text

On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
> Even if you replace timemr_bh() with a tasklet, you still need
> to take the global_bh_lock to ensure that timers don't race with
> single-threaded BH processing in drivers. I wrote this patch [included]
> to get rid of timer_bh in Ingo's smptimers, but it acquires
> global_bh_lock as well as net_bh_lock, the latter to ensure
> that some older protocol code that expected serialization of
> NET_BH and timers work correctly (see deliver_to_old_ones()).
> They need to be cleaned up too.
> My patch of course was experimental to see what is needed to
> get rid of timer_bh. It needs some cleanup itself ;-)

It turns out those profiling results are total garbage. oprofile
hit counts during the tbench 1024 run with smptimers-X1 on the 16-way
16GB NUMA-Q follow:

c020249d 43051806 73.9493 .text.lock.dev
c0196750 2138900 3.67395 csum_partial_copy_generic
c020090c 1454023 2.49755 netif_rx
c0200e78 1237550 2.12572 process_backlog
c0200480 1083695 1.86144 dev_queue_xmit
c0120bf8 1013839 1.74145 run_timer_tasklet
c0228c8c 946933 1.62653 tcp_v4_rcv
c0196920 773495 1.32862 __generic_copy_to_user
c012009c 605591 1.04021 mod_timer
c01fbe98 477906 0.820891 sock_wfree
c0218e14 392831 0.674759 tcp_recvmsg
c0112648 362804 0.623182 try_to_wake_up
c01132fc 278976 0.479192 schedule
c0136087 251550 0.432083 .text.lock.page_alloc
c0211d14 215139 0.36954 ip_queue_xmit
c021759c 205078 0.352259 tcp_sendmsg
c0220ad4 203218 0.349064 tcp_rcv_established
c0112f64 189216 0.325013 scheduler_tick
c02221a4 187313 0.321744 tcp_transmit_skb
c021f5fc 184471 0.316863 tcp_data_queue
c02189a8 163828 0.281404 tcp_data_wait
c01dd820 139310 0.23929 loopback_xmit
c01fcfcc 137241 0.235736 skb_release_data


I'll follow up with the "before" profile next.


Cheers,
Bill

2002-07-18 10:26:49

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] BH removal text

On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
>> Even if you replace timemr_bh() with a tasklet, you still need
>> to take the global_bh_lock to ensure that timers don't race with
>> single-threaded BH processing in drivers. I wrote this patch [included]
>> to get rid of timer_bh in Ingo's smptimers, but it acquires
>> global_bh_lock as well as net_bh_lock, the latter to ensure
>> that some older protocol code that expected serialization of
>> NET_BH and timers work correctly (see deliver_to_old_ones()).
>> They need to be cleaned up too.
>> My patch of course was experimental to see what is needed to
>> get rid of timer_bh. It needs some cleanup itself ;-)

On Thu, Jul 18, 2002 at 01:22:38AM -0700, William Lee Irwin III wrote:
> I'll follow up with the "before" profile next.

By the way, since it applies with just offsets to 2.5.26 I did my testing
on it. Here they are:

c01210c3 15914360 73.4974 .text.lock.timer
c0120114 1740662 8.03891 mod_timer
c0196480 533190 2.46243 csum_partial_copy_generic
c0196650 409733 1.89227 __generic_copy_to_user
c0112658 271923 1.25582 try_to_wake_up
c022893c 227856 1.05231 tcp_v4_rcv
c021ba91 219423 1.01336 .text.lock.tcp
c0107bb4 216722 1.00089 apic_timer_interrupt
c02118a4 160277 0.740208 ip_output
c011330c 123467 0.570208 schedule
c021727c 121239 0.559919 tcp_sendmsg
c0200170 121187 0.559679 dev_queue_xmit
c02021ad 83061 0.383601 .text.lock.dev
c02119f4 80365 0.37115 ip_queue_xmit
c0112f74 77876 0.359655 scheduler_tick
c01fcd98 75855 0.350322 __kfree_skb
c010fb30 73084 0.337524 smp_apic_timer_interrupt
c0218688 68854 0.317989 tcp_data_wait
c0218af4 66201 0.305736 tcp_recvmsg
c01207f8 64625 0.298458 timer_bh
c021e448 55844 0.257905 tcp_ack
c010cd60 52670 0.243246 do_gettimeofday
c02207b4 51631 0.238448 tcp_rcv_established

2002-07-18 10:40:21

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] BH removal text

On Sun, Jul 14, 2002 at 10:22:19AM +0530, Dipankar Sarma wrote:
>> Even if you replace timemr_bh() with a tasklet, you still need
>> to take the global_bh_lock to ensure that timers don't race with
>> single-threaded BH processing in drivers. I wrote this patch [included]
>> to get rid of timer_bh in Ingo's smptimers, but it acquires
>> global_bh_lock as well as net_bh_lock, the latter to ensure
>> that some older protocol code that expected serialization of
>> NET_BH and timers work correctly (see deliver_to_old_ones()).
>> They need to be cleaned up too.
>> My patch of course was experimental to see what is needed to
>> get rid of timer_bh. It needs some cleanup itself ;-)

On Thu, Jul 18, 2002 at 01:22:38AM -0700, William Lee Irwin III wrote:
> It turns out those profiling results are total garbage. oprofile
> hit counts during the tbench 1024 run with smptimers-X1 on the 16-way
> 16GB NUMA-Q follow:

Oh yes, bandwidth was increased from 23MB/s to 37MB/s.

And the rundown on .text.lock.dev:

c020249d 43051806 73.9493 .text.lock.dev
c02024f9 10357 0.0240571
c02024fc 121387 0.281956
c02024fe 10282 0.0238829
c0202515 5777619 13.4202
c0202518 31534891 73.2487
c020251a 5596759 13.0001
c020251c 10 2.32278e-05
c0202521 11 2.55506e-05
c0202522 158 0.000367
c0202523 34 7.89746e-05
c0202524 34 7.89746e-05
c0202529 61 0.00014169
c020252a 125 0.000290348
c020252b 36 8.36202e-05
c020252c 42 9.75569e-05


c0202518: f3 90 repz nop
c020251a: 7e f9 jle c0202515 <.text.lock.dev+0x78>
c020251c: e9 83 e1 ff ff jmp c02006a4 <dev_queue_xmit+0x224>

[...]

c0200694: e8 eb 78 f1 ff call c0117f84 <printk>
c0200699: 0f 0b ud2a
c020069b: 7b 00 jnp c020069d <dev_queue_xmit+0x21d>
c020069d: 40 inc %eax
c020069e: 5c pop %esp
c020069f: 29 c0 sub %eax,%eax
c02006a1: 83 c4 08 add $0x8,%esp
c02006a4: f0 fe 0f lock decb (%edi)
c02006a7: 0f 88 68 1e 00 00 js c0202515 <.text.lock.dev+0x78>
c02006ad: 8b 44 24 10 mov 0x10(%esp,1),%eax
c02006b1: 89 86 e8 00 00 00 mov %eax,0xe8(%esi)
c02006b7: 8b 46 24 mov 0x24(%esi),%eax
c02006ba: a8 01 test $0x1,%al
c02006bc: 0f 85 a1 00 00 00 jne c0200763 <dev_queue_xmit+0x2e3>
c02006c2: 83 3d 60 5f 3b c0 00 cmpl $0x0,0xc03b5f60
c02006c9: 74 0a je c02006d5 <dev_queue_xmit+0x255>
c02006cb: 56 push %esi
c02006cc: 55 push %ebp
c02006cd: e8 42 fc ff ff call c0200314 <dev_queue_xmit_nit>


This leads me to believe it's the dev->xmit_lock as that's protects the
critical section in which dev_queue_xmit_nit() is called.


Cheers,
Bill