2007-06-22 04:05:33

by Steven Rostedt

[permalink] [raw]
Subject: [RFC PATCH 0/6] Convert all tasklets to workqueues


There's a very nice paper by Matthew Willcox that describes Softirqs,
Tasklets, Bottom Halves, Task Queues, Work Queues and Timers[1].
In the paper it describes the history of these items. Softirqs and
tasklets were created to replace bottom halves after a company (Mindcraft)
showed that Microsoft on a 4x SMP box would out do Linux. It was discovered
that this was due to a bottle neck caused by the design of Bottom Halves.
So Alexey Kuznetsov and Dave Miller [1] (and I'm sure others) created
softirqs and tasklets to multithread the bottom halves.

This worked well, and for the time it shut-up Microsoft^WMindcraft from
saying Linux was slow at networking.

Time passed, and Linux developed other nifty tools, like kthreads and
work queues. These run in a process context and are not as menacing to
latencies as softirqs and tasklets are. Specifically, a tasklet,
acts as a task by only being able to run the function on one CPU
at a time. The same tasklet can not run on multiple CPUS. So in that
aspect it is like a task (a task can only exist on one CPU at a time).
But a tasklet is much harder on the rest of the system because it
runs in interrupt context. This means that if a higher priority process
wants to run, it must wait for the tasklet to finish before doing so.

The most part, tasklets today are not used for time critical functions.
Running tasklets in thread context is not harmful to performance of
the overall system. But running them in interrupt context is, since
they increase the overall latency for high priority tasks.

Even in Matthew's paper, he says that work queues have replaced tasklets.
But this is not truly the case. Tasklets are common and plentiful.
But to go and replace each driver that uses a tasklet with a work queue
would be very painful.

I've developed this way to replace all tasklets with work queues without
having to change all the drivers that use them. I created an API that
uses the tasklet API as a wrapper to a work queue. This API doesn't need
to be permanent. It shows 1) that work queues can replace tasklets, and
2) we can remove a duplicate functionality from the kernel. This API
only needs to be around until we removed all uses of tasklets from
all drivers.

I just want to state that tasklets served their time well. But it's time
to give them an honorable discharge. So lets get rid of tasklets and
given them a standing salute as they leave :-)


This patch series does the following:

1) Changes the RCU tasklet into a softirq. The RCU tasklet *is* a
performance critical function, and changing it to a softirq gives it
even more performance, and removes overhead. This has already been done
in the RT kernel, and should be applied regardless of the rest of the
patches in the series.

2) Splits out the tasklets from softirq.c. This too should be done anyways.
Tasklets are not softirqs, and they have their own semantics that they
deserve their own file. Also it makes it a lot cleaner to replace them
with something else :-)

3/4) Add an API to the tasklets to allow a driver to see if a tasklet
is scheduled. The DRM driver does it's own little thing with tasklets
and reads into the internals of the tasklet. These patches give the
DRM driver an API to do that a little bit cleaner.

The above patches really should go into the kernel if for any other
reason as a clean up patch set.

5) Move tasklet.h to tasklet_softirq.h and have tasklet.h include it.

6) This is the magic to make tasklets into work queues. It allows for
the kernel to be configured either with the normal tasklets, as it is
today, or with the tasklets-as-work-queues.

I've booted these patches on 5 machines, i386 and x86_64, and when
I can get my powerbook loaded with Linux, I'll try it there too.

I'd like to give thanks to Ingo Molnar and Oleg Nesterov for reviewing
my initial patch series and giving me some pointers.


[1] http://www.wil.cx/matthew/lca2003/paper.pdf

-- Steve


2007-06-22 07:10:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, Jun 22, 2007 at 12:00:14AM -0400, Steven Rostedt wrote:
> The most part, tasklets today are not used for time critical functions.
> Running tasklets in thread context is not harmful to performance of
> the overall system. But running them in interrupt context is, since
> they increase the overall latency for high priority tasks.

I think we probably want some numbers, at least for tasklets used in
potentially performance critical code.

> Even in Matthew's paper, he says that work queues have replaced tasklets.
> But this is not truly the case. Tasklets are common and plentiful.
> But to go and replace each driver that uses a tasklet with a work queue
> would be very painful.
>
> I've developed this way to replace all tasklets with work queues without
> having to change all the drivers that use them. I created an API that
> uses the tasklet API as a wrapper to a work queue. This API doesn't need
> to be permanent. It shows 1) that work queues can replace tasklets, and
> 2) we can remove a duplicate functionality from the kernel. This API
> only needs to be around until we removed all uses of tasklets from
> all drivers.

I don't like this wrapping at all. What you're doing is a tradeoff to
do less work today in exchange for more maintaince overhead and more crufty
code in the future. So while I sympathize a lot with trying to get rid
of tasklets I'd rather prefer to convert individual drivers over until
all users are gone. It's not exactly a very complicated conversion either.

> 6) This is the magic to make tasklets into work queues. It allows for
> the kernel to be configured either with the normal tasklets, as it is
> today, or with the tasklets-as-work-queues.

And this is something that might be fine for benchmarking, but not something
we should put in. Keeping two wildly different implementation of core
functionality with very different behaviour around is quite bad. Better
kill tasklets once for all.

2007-06-22 07:53:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Christoph Hellwig <[email protected]> wrote:

> I think we probably want some numbers, at least for tasklets used in
> potentially performance critical code.

which actual in-kernel tasklets do you have in mind? I'm not aware of
any in performance critical code. (now that both the RCU and the sched
tasklet has been fixed.)

Ingo

2007-06-22 07:53:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, Jun 22, 2007 at 09:51:35AM +0200, Ingo Molnar wrote:
>
> * Christoph Hellwig <[email protected]> wrote:
>
> > I think we probably want some numbers, at least for tasklets used in
> > potentially performance critical code.
>
> which actual in-kernel tasklets do you have in mind? I'm not aware of
> any in performance critical code. (now that both the RCU and the sched
> tasklet has been fixed.)

the one in megaraid_sas for example is in a performance-critical path
and was put in quite recently.

2007-06-22 11:25:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Christoph Hellwig <[email protected]> wrote:

> > which actual in-kernel tasklets do you have in mind? I'm not aware
> > of any in performance critical code. (now that both the RCU and the
> > sched tasklet has been fixed.)
>
> the one in megaraid_sas for example is in a performance-critical path
> and was put in quite recently.

ah, i thought core kernel. I took a look, and i doubt that doing the
megaraid SAS thing over a kernel thread would show up on the radar. If
it does then it should probably be done in its separate per-CPU
workqueue anyway, not in a globally scheduled thing like a tasklet.
Tasklets just dont scale.

Ingo

2007-06-22 12:36:01

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Christoph,

Thanks for taking the time to look at my patches!


On Fri, 2007-06-22 at 08:09 +0100, Christoph Hellwig wrote:
> > I've developed this way to replace all tasklets with work queues without
> > having to change all the drivers that use them. I created an API that
> > uses the tasklet API as a wrapper to a work queue. This API doesn't need
> > to be permanent. It shows 1) that work queues can replace tasklets, and
> > 2) we can remove a duplicate functionality from the kernel. This API
> > only needs to be around until we removed all uses of tasklets from
> > all drivers.
>
> I don't like this wrapping at all. What you're doing is a tradeoff to
> do less work today in exchange for more maintaince overhead and more crufty
> code in the future. So while I sympathize a lot with trying to get rid
> of tasklets I'd rather prefer to convert individual drivers over until
> all users are gone. It's not exactly a very complicated conversion either.
>
> > 6) This is the magic to make tasklets into work queues. It allows for
> > the kernel to be configured either with the normal tasklets, as it is
> > today, or with the tasklets-as-work-queues.
>
> And this is something that might be fine for benchmarking, but not something
> we should put in. Keeping two wildly different implementation of core
> functionality with very different behaviour around is quite bad. Better
> kill tasklets once for all.

Honestly, I highly doubted that this would make it up to Linus's tree.
My aim was to get this into -mm, and perhaps even turn on the
tasklets-as-workqueues as default.

The objective of these patches was more of a proof-of-concept, showing
that the tasklets are indeed obsolete today. By putting this patch set
into -mm, we can get a good idea of the regressions that this conversion
would cause, hopefully before we did the real change of converting a
driver's tasklet into a work queue.

I only have a limited amount of hardware to test on. I believe this is a
good candidate as a permanent resident into -mm and can display the
effects of using work queues instead of tasklets in a wider arena.

This patch is too much of a kludge to put into Linus's tree, and if it
were in that tree, it would probably cause laziness and make it even
less likely that driver authors would get rid of their tasklets. But I
argue it is perfect for the -mm tree, because it allows users to easily
(with a config option) turn it on and off and run benchmarks to see its
effect.


-- Steve

2007-06-22 12:40:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Steven Rostedt <[email protected]> wrote:

> > And this is something that might be fine for benchmarking, but not something
> > we should put in. Keeping two wildly different implementation of core
> > functionality with very different behaviour around is quite bad. Better
> > kill tasklets once for all.
>
> Honestly, I highly doubted that this would make it up to Linus's tree.

that's where it belongs - but it first needs the cleanups suggested by
Christoph.

> My aim was to get this into -mm, [...]

that would be the first step towards getting it upstream.

> and perhaps even turn on the tasklets-as-workqueues as default.

that is a hack that shouldnt be in the patch. People can unapply/apply a
patch just as well as they can flip a .config switch.

Ingo

2007-06-22 13:03:36

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, 2007-06-22 at 14:38 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > > And this is something that might be fine for benchmarking, but not something
> > > we should put in. Keeping two wildly different implementation of core
> > > functionality with very different behaviour around is quite bad. Better
> > > kill tasklets once for all.
> >
> > Honestly, I highly doubted that this would make it up to Linus's tree.
>
> that's where it belongs - but it first needs the cleanups suggested by
> Christoph.

I had the impression that he didn't want it in, but instead wanted each
driver to be changed separately.

>
> > My aim was to get this into -mm, [...]
>
> that would be the first step towards getting it upstream.
>
> > and perhaps even turn on the tasklets-as-workqueues as default.
>
> that is a hack that shouldnt be in the patch. People can unapply/apply a
> patch just as well as they can flip a .config switch.

So should the patch be then to not even have the tasklet_softirq there
at all? Have the patch simply replace the tasklets with workqueues, and
if someone doesn't like that, then they can simply remove the patch?

-- Steve


2007-06-22 13:18:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Steven Rostedt <[email protected]> wrote:

> > that's where it belongs - but it first needs the cleanups suggested
> > by Christoph.
>
> I had the impression that he didn't want it in, but instead wanted
> each driver to be changed separately.

that can be done too in a later stage. We cannot deprecate an API from
one day to another. But wrapping it sanely via an existing framework
makes complete sense.

> > > and perhaps even turn on the tasklets-as-workqueues as default.
> >
> > that is a hack that shouldnt be in the patch. People can
> > unapply/apply a patch just as well as they can flip a .config
> > switch.
>
> So should the patch be then to not even have the tasklet_softirq there
> at all? Have the patch simply replace the tasklets with workqueues,
> and if someone doesn't like that, then they can simply remove the
> patch?

yes, the softirq based tasklet implementation with workqueue based
implementation, but the tasklet API itself should still stay.

ok, enough idle talking, lets see the next round of patches? :)

please remove all the pr_debug() lines as well - they are handy for
development but are quite unnecessary for something headed
upstream-wards. And please replace all the BUG_ON()s with WARN_ON_ONCE()
...

Ingo

2007-06-22 13:21:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

> On Fri, 22 Jun 2007 08:58:44 -0400 Steven Rostedt <[email protected]> wrote:
> On Fri, 2007-06-22 at 14:38 +0200, Ingo Molnar wrote:
> > * Steven Rostedt <[email protected]> wrote:
> >
> > > > And this is something that might be fine for benchmarking, but not something
> > > > we should put in. Keeping two wildly different implementation of core
> > > > functionality with very different behaviour around is quite bad. Better
> > > > kill tasklets once for all.
> > >
> > > Honestly, I highly doubted that this would make it up to Linus's tree.
> >
> > that's where it belongs - but it first needs the cleanups suggested by
> > Christoph.
>
> I had the impression that he didn't want it in, but instead wanted each
> driver to be changed separately.

I do think that would be a better approach. Apart from the cleanliness
issue, the driver-by-driver conversion would make it much easier to hunt
down any regresions or various funnineses.

2007-06-22 13:27:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Andrew Morton <[email protected]> wrote:

> I do think that would be a better approach. Apart from the
> cleanliness issue, the driver-by-driver conversion would make it much
> easier to hunt down any regresions or various funnineses.

there are 120 tasklet_init()s in the tree and 224 tasklet_schedule()s.
Pushing it into thread context should work just fine (Steve's patchset
certainly works on my testbox), as even today we can execute softirqs
(and hence tasklets) in ksoftirqd. In fact, -rt has been executing
tasklets in task context for over 2.5 years meanwhile. Do we really want
to upset the whole API? Realistically it just wont ever be removed, like
the BKL.

Ingo

2007-06-22 13:39:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, 2007-06-22 at 06:13 -0700, Andrew Morton wrote:
> > On Fri, 22 Jun 2007 08:58:44 -0400 Steven Rostedt <[email protected]> wrote:
> > On Fri, 2007-06-22 at 14:38 +0200, Ingo Molnar wrote:
> > > * Steven Rostedt <[email protected]> wrote:

> > > > Honestly, I highly doubted that this would make it up to Linus's tree.
> > >
> > > that's where it belongs - but it first needs the cleanups suggested by
> > > Christoph.
> >
> > I had the impression that he didn't want it in, but instead wanted each
> > driver to be changed separately.
>
> I do think that would be a better approach. Apart from the cleanliness
> issue, the driver-by-driver conversion would make it much easier to hunt
> down any regresions or various funnineses.
>

Actually, I disagree with driver by driver ease of hunting down
regressions. Perhaps a regression is caused by having two different
drivers have their tasklets converted to work queues. Or where their
might be any case a tasklet is somehow related to another tasklet.

Switching all tasklets at once can pin point the problem rather quickly.
Then it's easy to find which driver was the culprit.

-- Steve


2007-06-22 13:49:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

> On Fri, 22 Jun 2007 15:26:22 +0200 Ingo Molnar <[email protected]> wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > I do think that would be a better approach. Apart from the
> > cleanliness issue, the driver-by-driver conversion would make it much
> > easier to hunt down any regresions or various funnineses.
>
> there are 120 tasklet_init()s in the tree and 224 tasklet_schedule()s.

couple of hours?

> Pushing it into thread context should work just fine (Steve's patchset
> certainly works on my testbox), as even today we can execute softirqs
> (and hence tasklets) in ksoftirqd. In fact, -rt has been executing
> tasklets in task context for over 2.5 years meanwhile. Do we really want
> to upset the whole API? Realistically it just wont ever be removed, like
> the BKL.

We can remove it. It might need to remain deprecated for a year, but we
shouldn't plan on leaving the old interface hanging around for ever.

2007-06-22 14:05:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Andrew Morton <[email protected]> wrote:

> > there are 120 tasklet_init()s in the tree and 224
> > tasklet_schedule()s.
>
> couple of hours?

hm, what would you replace it with? Another new API? Or to workqueues
with a manual adding of a local_bh_disable()/enable() pair around the
worker function? If the latter, then converting it to workqueues isnt
just a couple of hours i think. Maybe i'm overestimating the effort.

Ingo

2007-06-22 14:29:26

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


> The most part, tasklets today are not used for time critical functions.
> Running tasklets in thread context is not harmful to performance of
> the overall system.

That is a bold statement...

> But running them in interrupt context is, since
> they increase the overall latency for high priority tasks.


.... since by moving this to process context you do add latency between
the hardirq and the tasklet, and I can see that having performance
impact easily.


2007-06-22 14:32:07

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, 2007-06-22 at 15:12 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>

> yes, the softirq based tasklet implementation with workqueue based
> implementation, but the tasklet API itself should still stay.

done.

>
> ok, enough idle talking, lets see the next round of patches? :)

Still need to run through testing, and I still don't have distcc
working. So it may take a while ;-)

>
> please remove all the pr_debug() lines as well - they are handy for
> development but are quite unnecessary for something headed

done (but I did keep one).

> upstream-wards. And please replace all the BUG_ON()s with WARN_ON_ONCE()

I had two BUG_ON's and they were both for not having the ktaskletd_wq
allocated. If we just put a WARN_ON, we will right afterward get a BUG
from a bad memory reference, and put the system into an unknown state.
Are you sure you still want me to convert them to WARN_ON?

-- Steve


2007-06-22 14:42:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, 2007-06-22 at 07:25 -0700, Arjan van de Ven wrote:
> > The most part, tasklets today are not used for time critical functions.
> > Running tasklets in thread context is not harmful to performance of
> > the overall system.
>
> That is a bold statement...
>
> > But running them in interrupt context is, since
> > they increase the overall latency for high priority tasks.
>
>
> .... since by moving this to process context you do add latency between
> the hardirq and the tasklet, and I can see that having performance
> impact easily.

This is stated on the assumption that pretty much all performance
critical tasklets have been removed (although Christoph just mentioned
megaraid_sas, but after I made this statement).

We've been running tasklets as threads in the -rt kernel for some time
now, and that hasn't bothered us.

The problem with tasklets is that they still don't provide guaranteed
performance. They can still be pushed off to ksoftirqd on a busy system.
But having *all* tasklets as they are today, keeps them at a higher
priority then any process in the system.

-- Steve


2007-06-22 14:47:47

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


>
> This is stated on the assumption that pretty much all performance
> critical tasklets have been removed (although Christoph just mentioned
> megaraid_sas, but after I made this statement).
>
> We've been running tasklets as threads in the -rt kernel for some time
> now, and that hasn't bothered us.
>
> The problem with tasklets is that they still don't provide guaranteed
> performance. They can still be pushed off to ksoftirqd on a busy system.
> But having *all* tasklets as they are today, keeps them at a higher
> priority then any process in the system.


don't get me wrong, I'm not against this conversion, just it needs to be
measured in detail... or preferably done step by step so that the
performance guys can undo one piece at a time until they find a guilty
party ;)

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-06-22 17:24:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues



On Fri, 22 Jun 2007, Steven Rostedt wrote:
>
> I just want to state that tasklets served their time well. But it's time
> to give them an honorable discharge. So lets get rid of tasklets and
> given them a standing salute as they leave :-)

Well, independently of whether we actually discharge them or not, I do
tend to always like things that split independent concepts up (whether
they then end up being _implemented_ independently of each other or not is
a separate issue).

So patches 1-4 all look fine to me. In fact, 5 looks ok too.

Whether we actually then want to do 6 is another matter. I think we'd need
some measuring and discussion about that.

I'm absolutely 100% sure that we do *not* want to be in a situation where
we have two different implementations of tasklets, and just keep the
CONFIG variable and let people just choose one or the other.

So imnsho doing #6 is really something that makes sense only in a "let's
measure this and decide which implementation is actually the better one",
_not_ in the sense of merging it into the standard kernel and letting them
fight it out in the long run.

But I'd happily merge 1-4 regardless after 2.6.22 is out.

Leaving patch 6 as a "only makes sense after we actually have some numbers
about it", and patch 5 is a "could go either way" as far as I'm concerned
(ie I could merge it together with the 1-4 series, but I think it's
equally valid to just see it as a companion to 6).

Does that make sense to people?

Linus

2007-06-22 17:35:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, 2007-06-22 at 10:16 -0700, Linus Torvalds wrote:
>

> So patches 1-4 all look fine to me. In fact, 5 looks ok too.

Great!

> Leaving patch 6 as a "only makes sense after we actually have some numbers
> about it", and patch 5 is a "could go either way" as far as I'm concerned
> (ie I could merge it together with the 1-4 series, but I think it's
> equally valid to just see it as a companion to 6).

Actually, in my next series, I removed patch 5 and just simply replace
the tasklets with the workqueue implementation, leaving no config
option. It will be out shortly, after I finish my testing.

I also keep the file name tasklet_work.c instead of overwriting
tasklet.c, just because it's easier to review the patch that way.

-- Steve




2007-06-22 18:32:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, Jun 22, 2007 at 10:16:47AM -0700, Linus Torvalds wrote:
>
>
> On Fri, 22 Jun 2007, Steven Rostedt wrote:
> >
> > I just want to state that tasklets served their time well. But it's time
> > to give them an honorable discharge. So lets get rid of tasklets and
> > given them a standing salute as they leave :-)
>
> Well, independently of whether we actually discharge them or not, I do
> tend to always like things that split independent concepts up (whether
> they then end up being _implemented_ independently of each other or not is
> a separate issue).
>
> So patches 1-4 all look fine to me. In fact, 5 looks ok too.

I don't think we should put 3 and 4 in. The code in drm is just crap and
should just be rewritten. I'll cook up a patch, but I can't actually
yest it due to lack of hardware.

2007-06-22 20:43:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Linus Torvalds <[email protected]> wrote:

> Whether we actually then want to do 6 is another matter. I think we'd
> need some measuring and discussion about that.

basically tasklets have a number of limitations:

- tasklets have certain latency limitations over real tasks. (for
example they are not guaranteed to be re-executed when they are
triggered while they are running, so artificial latencies can be
introduced into the kernel workflow)

- tasklets have certain execution limitations. (only atomic functions
can be executed in them)

- tasklets have certain fairness limitations. (they are executed in
softirq context and thus preempt everything, even if there is some
potentially more important, high-priority task waiting to be
executed.)

- the 'priority levels' approach of softirqs is not really
self-documenting - unlike real locks. As a result we've got some
vague coupling between network softirq processing and timer softirq
processing, which spilled over into tasklets as well. The 'hi' and
'low' concept of tasklets isnt really used either. We should reduce
the amount of such opaque 'coupling' between workflows - it should be
spelled out explicitly via some synchronization construct.

- tasklets are duplicated infrastructure (over existing workqueues)
that, if it's possible to do it compatibly, would be a good idea to
eliminate.

when it comes to 'deferred processing', we've basically got two 'prime'
choices for deferred processing:

- if it's high-performance then it goes into a softirq.

- if performance is not important, or robustness and flexibility is
more important than performance, then workqueues are used.

basically tasklets do _neither_ really well. They are too 'global' to
scale really well on SMP (even the RCU tasklet wasnt a real tasklet: it
was a _per CPU tasklet_, which almost by definition is equivalent to a
softirq, some some extra glue overhead ...), and tasklets are also too
much tied to softirqs to be used as a generic processing context.

that's why i'd like them to be gently but firmly phased out =B-)

Ingo

2007-06-22 21:00:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, Jun 22, 2007 at 10:40:58PM +0200, Ingo Molnar wrote:
> when it comes to 'deferred processing', we've basically got two 'prime'
> choices for deferred processing:
>
> - if it's high-performance then it goes into a softirq.
>
> - if performance is not important, or robustness and flexibility is
> more important than performance, then workqueues are used.
>
> basically tasklets do _neither_ really well. They are too 'global' to
> scale really well on SMP (even the RCU tasklet wasnt a real tasklet: it
> was a _per CPU tasklet_, which almost by definition is equivalent to a
> softirq, some some extra glue overhead ...), and tasklets are also too
> much tied to softirqs to be used as a generic processing context.
>
> that's why i'd like them to be gently but firmly phased out =B-)

Note that we also have a lot of inefficiency in the way we do deferred
processing. Think of a setup where you run a XFS filesystem runs over
a megaraid adapter.

(1) we get a real hardirq, which just clears the interrupt and then
deferes to a tasklet
(2) tasklet walks the producer / consumer queue and then calls scsi_done
for each completeted scsi command which ends up doing
raise_softirq_irqoff(BLOCK_SOFTIRQ);
(3) block softirq does the heavy lifting for command completion and finally
calls back into the bio's completion routine
(4) xfs wants to avoid irq safe locking and thus deferes the command to a
kthread

This is rather inefficient due to all the (semi-)context switches already
and not by far the worst setup given that a lot of dm modules can involve
another thread in the process.

Now if just plain convert tasklets to a thread based abstraction this
existing code becomes really dumb because we go from hardirq to process
context to go back to softirq context to go back to process context.

Ouch!

I think we need to put a little more though into how we can optimize our
irq path for the full stack. Using irqthreads in an intelligent way might
be one option, but we'll need a lot of heavy benchmarking whatever way
we go.

2007-06-22 21:11:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Christoph Hellwig <[email protected]> wrote:

> Note that we also have a lot of inefficiency in the way we do deferred
> processing. Think of a setup where you run a XFS filesystem runs over
> a megaraid adapter.
>
> (1) we get a real hardirq, which just clears the interrupt and then
> deferes to a tasklet
> (2) tasklet walks the producer / consumer queue and then calls scsi_done
> for each completeted scsi command which ends up doing
> raise_softirq_irqoff(BLOCK_SOFTIRQ);
> (3) block softirq does the heavy lifting for command completion and finally
> calls back into the bio's completion routine
> (4) xfs wants to avoid irq safe locking and thus deferes the command to a
> kthread

i dont understand - why is a tasklet used at all? Why not do it straight
in the BLOCK_SOFTIRQ? Using tasklets there is extra, unnecessary
overhead already.

Ingo

2007-06-22 21:13:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, 2007-06-22 at 22:00 +0100, Christoph Hellwig wrote:
> Note that we also have a lot of inefficiency in the way we do deferred
> processing. Think of a setup where you run a XFS filesystem runs over
> a megaraid adapter.
>
> (1) we get a real hardirq, which just clears the interrupt and then
> deferes to a tasklet
> (2) tasklet walks the producer / consumer queue and then calls scsi_done
> for each completeted scsi command which ends up doing
> raise_softirq_irqoff(BLOCK_SOFTIRQ);
> (3) block softirq does the heavy lifting for command completion and finally
> calls back into the bio's completion routine
> (4) xfs wants to avoid irq safe locking and thus deferes the command to a
> kthread
>
> This is rather inefficient due to all the (semi-)context switches already
> and not by far the worst setup given that a lot of dm modules can involve
> another thread in the process.
>
> Now if just plain convert tasklets to a thread based abstraction this
> existing code becomes really dumb because we go from hardirq to process
> context to go back to softirq context to go back to process context.
>
> Ouch!
>
> I think we need to put a little more though into how we can optimize our
> irq path for the full stack. Using irqthreads in an intelligent way might
> be one option, but we'll need a lot of heavy benchmarking whatever way
> we go.

Your above scenario screams for a threaded interrupt handler, where you
actually can unify a lot of this into one single context.

tglx


2007-06-22 21:39:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues



On Fri, 22 Jun 2007, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > Whether we actually then want to do 6 is another matter. I think we'd
> > need some measuring and discussion about that.
>
> basically tasklets have a number of limitations:

I'm not disputing that they aren't pretty.

But none of your arguments really touch on the deeper issue:

- Is the new code *technically*better*?

Don't get me wrong. I like cleanups as much as the next guy. I have no
problem at all with the first four patches in the series, because those
are cleanups that have no technical impact apart from that "cleanup-ness".

The reason I ask about patch #6 is simply that in the end "clean code"
matters less than "good results".

I'm a _huge_ believer in "clean code", but the fact is, I'm an even bigger
believer in "reality bites". I'd really like to see some numbers.

If the numbers say that there is no performance difference (or even
better: that the new code performs better or fixes some latency issue or
whatever), I'll be very happy. But if the numbers say that it's worse, no
amount of cleanliness really changes that.

Linus

2007-06-22 21:57:32

by Daniel Walker

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, 2007-06-22 at 22:40 +0200, Ingo Molnar wrote:

>
> - tasklets have certain fairness limitations. (they are executed in
> softirq context and thus preempt everything, even if there is some
> potentially more important, high-priority task waiting to be
> executed.)

Since -rt has been executing tasklets in process context for a long
time, I'm not sure this change would cause to many regressions. However,
it seems like implicit dependencies on "tasklets preempt everything"
might crop up. The other issue is if they don't "preempt
everything" (most of the time), what default priority do we give them
(all of the time)? It seems like Christoph's suggestion of converting
all the tasklets individually might be a better option, to deal with
specific pitfalls.

Daniel

2007-06-22 22:01:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Linus Torvalds <[email protected]> wrote:

> If the numbers say that there is no performance difference (or even
> better: that the new code performs better or fixes some latency issue
> or whatever), I'll be very happy. But if the numbers say that it's
> worse, no amount of cleanliness really changes that.

Most of the tasklet uses are in rarely used or arcane drivers - in fact
none of my 10 test-boxes utilizes _any_ tasklet in any way that could
even get close to mattering to performance. In other words: i just
cannot test this, nor do i think that others will really test this. I.e.
if we dont approach this problem in some other way, nothing will happen
and Steve's patch will be stalled forever and will live in -rt forever.
(which might be a correct end result too, but i'm just not giving up
this easily :-)

so how about the following, different approach: anyone who has a tasklet
in any performance-sensitive codepath, please yell now. We'll also do a
proactive search for such places. We can convert those places to
softirqs, or move them back into hardirq context. Once this is done -
and i doubt it will go beyond 1-2 places - we can just mass-convert the
other 110 places to the lame but compatible solution of doing them in a
global thread context.

[ and on a similar notion, i still havent given up on seeing all BKL use
gone from the kernel. I expect it to happen any decade now ;-) ]

Ingo

2007-06-22 22:10:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Ingo Molnar <[email protected]> wrote:

> [ and on a similar notion, i still havent given up on seeing all BKL
> use gone from the kernel. I expect it to happen any decade now ;-) ]

2.6.21 had 476 lock_kernel() calls. 2.6.22-git has 473 lock_kernel()
calls currently. With that kind of flux we'll see the BKL gone in about
40 years =B-)

'struct semaphore' use on the other hand has gone down by 10% in this
release, which is a good rate. I guess the lack of lockdep coverage for
semaphores might be one of the driving forces? ;-)

Ingo

2007-06-22 22:11:32

by David Lang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, 22 Jun 2007, Daniel Walker wrote:

>
> On Fri, 2007-06-22 at 22:40 +0200, Ingo Molnar wrote:
>
>>
>> - tasklets have certain fairness limitations. (they are executed in
>> softirq context and thus preempt everything, even if there is some
>> potentially more important, high-priority task waiting to be
>> executed.)
>
> Since -rt has been executing tasklets in process context for a long
> time, I'm not sure this change would cause to many regressions. However,
> it seems like implicit dependencies on "tasklets preempt everything"
> might crop up. The other issue is if they don't "preempt
> everything" (most of the time), what default priority do we give them
> (all of the time)? It seems like Christoph's suggestion of converting
> all the tasklets individually might be a better option, to deal with
> specific pitfalls.

that would be the safe way to do it, but it will take a lot of time and a
lot of testing.

it's probably better to try the big-bang change and only if you see
problames go back and break things down.

remember, these changes have been in use in -rt for a while. there's
reason to believe that they aren't going to cause drastic problems.

David Lang

2007-06-22 22:16:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Daniel Walker <[email protected]> wrote:

> > - tasklets have certain fairness limitations. (they are executed in
> > softirq context and thus preempt everything, even if there is
> > some potentially more important, high-priority task waiting to be
> > executed.)
>
> Since -rt has been executing tasklets in process context for a long
> time, I'm not sure this change would cause to many regressions.
> However, it seems like implicit dependencies on "tasklets preempt
> everything" might crop up. The other issue is if they don't "preempt
> everything" (most of the time), what default priority do we give them
> (all of the time)? [...]

there is no such guarantee at all (of 'instant preemption'), even with
current, softirq-based tasklets. A tasklet might be 'stolen' by another
CPU. It might be delayed to the next timer tick (or other softirq
execution). Or it might be delayed into a ksoftirqd context, which
currently runs at nice +19. So your worry of implicit execution
dependencies is unfounded, because, if they existed, they would be bad
(and triggerable) bugs today too.

Ingo

2007-06-22 22:19:42

by Daniel Walker

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, 2007-06-22 at 15:09 -0700, [email protected] wrote:
> On Fri, 22 Jun 2007, Daniel Walker wrote:
>
> >
> > On Fri, 2007-06-22 at 22:40 +0200, Ingo Molnar wrote:
> >
> >>
> >> - tasklets have certain fairness limitations. (they are executed in
> >> softirq context and thus preempt everything, even if there is some
> >> potentially more important, high-priority task waiting to be
> >> executed.)
> >
> > Since -rt has been executing tasklets in process context for a long
> > time, I'm not sure this change would cause to many regressions. However,
> > it seems like implicit dependencies on "tasklets preempt everything"
> > might crop up. The other issue is if they don't "preempt
> > everything" (most of the time), what default priority do we give them
> > (all of the time)? It seems like Christoph's suggestion of converting
> > all the tasklets individually might be a better option, to deal with
> > specific pitfalls.
>
> that would be the safe way to do it, but it will take a lot of time and a
> lot of testing.
>
> it's probably better to try the big-bang change and only if you see
> problames go back and break things down.

For testing I'd agree, but not for a kernel that is suppose to be
stable.

> remember, these changes have been in use in -rt for a while. there's
> reason to believe that they aren't going to cause drastic problems.

Since I've been working with -rt (~2 years now I think) it's clear that
the number of testers of the patch isn't all that high compared to the
stable kernel . There are tons of drivers which get no coverage by -rt
patch users.

So the fact that something similar is in -rt is good, but it's not a
silver bullet ..

Daniel

2007-06-22 22:43:29

by Roland Dreier

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

> > [ and on a similar notion, i still havent given up on seeing all BKL
> > use gone from the kernel. I expect it to happen any decade now ;-) ]

> 2.6.21 had 476 lock_kernel() calls. 2.6.22-git has 473 lock_kernel()
> calls currently. With that kind of flux we'll see the BKL gone in about
> 40 years =B-)

> 'struct semaphore' use on the other hand has gone down by 10% in this
> release, which is a good rate. I guess the lack of lockdep coverage for
> semaphores might be one of the driving forces? ;-)

The problem with removing uses of the BKL is that a "lock_kernel()"
gives no clue about what it is protecting against, and so it requires
a lot of very difficult auditing to replace with appropriate locking.

To take a couple of examples at random: fs/ext4/ioctl.c takes the BKL
in ext4_compat_ioctl() around the call to ext4_ioctl(). Kind of sad
that a "next-generation" FS still uses the BKL, but who understands
things well enough to say how all the cases in ext4_ioctl() are
relying on being called with the BKL held?

As a second example, msr_seek() in arch/i386/kernel/msr.c... is the
inode semaphore enough or not? Who understands the implications well
enough to say?

Most semaphores on the other hand can be replaced by mutexes or
completions in a fairly straightforward way.

- R.

2007-06-22 22:45:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Daniel Walker <[email protected]> wrote:

> > remember, these changes have been in use in -rt for a while. there's
> > reason to believe that they aren't going to cause drastic problems.
>
> Since I've been working with -rt (~2 years now I think) it's clear
> that the number of testers of the patch isn't all that high compared
> to the stable kernel . [...]

You havent been watching it too closely i guess :-) The -rt kernel often
pops up regressions before mainline does, especially when it comes to
arcane hardware often used by embedded vendors [ =B-) ]. It even
triggers certain high-end scalability and race bugs before the mainline
kernel does, due to its unique scheduling behavior.

So yes, -rt obviously does not have as wide of a tester basis as the
mainline kernel (but it's by no means small), it nevertheless has a
tester base that is partly orthogonal to the mainline kernel.

Furthermore, -rt has a wide enough tester base for it to know that if
something has not caused problems in it for years is certainly at least
a good indicator that something isnt going to cause drastic problems ...
which was the point to begin with.

Ingo

2007-06-22 22:54:52

by Alan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

> As a second example, msr_seek() in arch/i386/kernel/msr.c... is the
> inode semaphore enough or not? Who understands the implications well
> enough to say?

lseek is one of the nasty remaining cases. tty is another real horror
that needs further work but we slowly get closer - drivers/char is almost
but not entirely lock_kernel free now and several users look quite easy
to swat (everyone but tty_io.c)

Alan

2007-06-22 23:01:00

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, 2007-06-22 at 23:59 +0200, Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
> > If the numbers say that there is no performance difference (or even
> > better: that the new code performs better or fixes some latency issue
> > or whatever), I'll be very happy. But if the numbers say that it's
> > worse, no amount of cleanliness really changes that.
>
> Most of the tasklet uses are in rarely used or arcane drivers - in fact
> none of my 10 test-boxes utilizes _any_ tasklet in any way that could
> even get close to mattering to performance. In other words: i just
> cannot test this, nor do i think that others will really test this.

This is exactly why I included that CONFIG option in the first series.
Because, I only have a handful of hardware that actually uses tasklets.
And all those pr_debugs I had where turned on on most of my boxes. I
was not flooded with prints either (every function including
tasklet_schedule had a print).

So, basically, I can't do benchmarks. I was hoping to get this into -mm
with a easy way for people, who have hardware that uses tasklets
extensively, to run it with tasklets on and off to see if there is a
difference. My fear of not having a config option to switch between the
two (for -mm only) is that we may lose benchmarking from those that are
not comfortable at removing this patch from -mm. There are people out
there that download and test the -mm tree straight from kernel.org.
Just because someone compiles their own kernel doesn't mean they can (or
will) patch it.

-- Steve



2007-06-22 23:32:32

by Daniel Walker

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Sat, 2007-06-23 at 00:44 +0200, Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
> > > remember, these changes have been in use in -rt for a while. there's
> > > reason to believe that they aren't going to cause drastic problems.
> >
> > Since I've been working with -rt (~2 years now I think) it's clear
> > that the number of testers of the patch isn't all that high compared
> > to the stable kernel . [...]
>
> You havent been watching it too closely i guess :-) The -rt kernel often
> pops up regressions before mainline does, especially when it comes to
> arcane hardware often used by embedded vendors [ =B-) ]. It even
> triggers certain high-end scalability and race bugs before the mainline
> kernel does, due to its unique scheduling behavior.

Don't assume anything Ingo ;)

> So yes, -rt obviously does not have as wide of a tester basis as the
> mainline kernel (but it's by no means small), it nevertheless has a
> tester base that is partly orthogonal to the mainline kernel.
>
> Furthermore, -rt has a wide enough tester base for it to know that if
> something has not caused problems in it for years is certainly at least
> a good indicator that something isnt going to cause drastic problems ...
> which was the point to begin with.

As I said, it's a plus but not a silver bullet ? Are you saying I was
wrong ?

Daniel

2007-06-23 05:14:55

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, 22 Jun 2007 00:00:14 -0400
Steven Rostedt <[email protected]> wrote:

>
> There's a very nice paper by Matthew Willcox that describes Softirqs,
> Tasklets, Bottom Halves, Task Queues, Work Queues and Timers[1].
> In the paper it describes the history of these items. Softirqs and
> tasklets were created to replace bottom halves after a company (Mindcraft)
> showed that Microsoft on a 4x SMP box would out do Linux. It was discovered
> that this was due to a bottle neck caused by the design of Bottom Halves.
> So Alexey Kuznetsov and Dave Miller [1] (and I'm sure others) created
> softirqs and tasklets to multithread the bottom halves.
>
> This worked well, and for the time it shut-up Microsoft^WMindcraft from
> saying Linux was slow at networking.
>
> Time passed, and Linux developed other nifty tools, like kthreads and
> work queues. These run in a process context and are not as menacing to
> latencies as softirqs and tasklets are. Specifically, a tasklet,
> acts as a task by only being able to run the function on one CPU
> at a time. The same tasklet can not run on multiple CPUS. So in that
> aspect it is like a task (a task can only exist on one CPU at a time).
> But a tasklet is much harder on the rest of the system because it
> runs in interrupt context. This means that if a higher priority process
> wants to run, it must wait for the tasklet to finish before doing so.
>
> The most part, tasklets today are not used for time critical functions.
> Running tasklets in thread context is not harmful to performance of
> the overall system. But running them in interrupt context is, since
> they increase the overall latency for high priority tasks.
>

You will need to search and convert all network drivers that are using
tasklets. Drivers like the ipw2200, will need to be converted to NAPI,
and you probably have to fix a couple of places in the ieee80211 stack as well.

2007-06-23 06:23:42

by Dave Airlie

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

>
> Most of the tasklet uses are in rarely used or arcane drivers - in fact
> none of my 10 test-boxes utilizes _any_ tasklet in any way that could
> even get close to mattering to performance. In other words: i just
> cannot test this, nor do i think that others will really test this. I.e.
> if we dont approach this problem in some other way, nothing will happen
> and Steve's patch will be stalled forever and will live in -rt forever.
> (which might be a correct end result too, but i'm just not giving up
> this easily :-)

I've no idea but the drm uses tasklets to schedule the page flip or
front/back blit as close to the vblank interrupt as possible to avoid
tearing on vblank sync'ed applications (and compositing managers).

I'm not sure we have any way to quantify this other than the closer to
the irq the flip happens the better chance of the screen not tearing..
going forward where compositing is more used on a loaded system I
wonder will we see more tearing on loaded systems..

I await the cleanups for the DRM code, it came from TG to support the
above feature on Intel hw.

Dave.

2007-06-24 15:43:33

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Ingo Molnar <[email protected]> wrote:

> so how about the following, different approach: anyone who has a tasklet
> in any performance-sensitive codepath, please yell now.

The cafe_ccic (OLPC) camera driver uses a tasklet to move frames out of
the DMA buffers in the streaming I/O path. With this change in place,
I'd worry that the possibility of dropping frames would increase,
especially considering that (1) this is running on OLPC hardware, and
(2) there is typically a streaming video application running in user
space.

Obviously some testing is called for here. I will make an attempt to do
that testing, but the next few weeks involve some insane travel which
will make that hard. Stay tuned.

Thanks,

jon

Jonathan Corbet / LWN.net / [email protected]

2007-06-24 15:52:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


On Sun, 24 Jun 2007, Jonathan Corbet wrote:
>
> The cafe_ccic (OLPC) camera driver uses a tasklet to move frames out of
> the DMA buffers in the streaming I/O path. With this change in place,
> I'd worry that the possibility of dropping frames would increase,
> especially considering that (1) this is running on OLPC hardware, and
> (2) there is typically a streaming video application running in user
> space.
>
> Obviously some testing is called for here. I will make an attempt to do
> that testing, but the next few weeks involve some insane travel which
> will make that hard. Stay tuned.
>

Jon,

Thanks for pointing this driver out. I'd greatly appreciate any report
you give us regarding the effect of having the tasklet as a thread.

-- Steve

2007-06-25 16:46:17

by Tilman Schmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Ingo Molnar <[email protected]> wrote:
> so how about the following, different approach: anyone who has a tasklet
> in any performance-sensitive codepath, please yell now.

The Siemens Gigaset ISDN base driver uses tasklets in its isochronous
data paths. These will be scheduled for each completion of an isochronous
URB, or every 8 msec for each of the four isochronous pipes if both B
channels are connected. The driver uses three URBs for each pipe, always
maintaining two in flight while processing the third one. So the tasklet
has to run within 16 ms from being scheduled in order to avoid packet
loss (in the receive path) or data underrun (in the transmit path).

Does that qualify as performance sensitive for the purpose of this
discussion?

Thanks,
Tilman

--
Tilman Schmidt E-Mail: [email protected]
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Unge?ffnet mindestens haltbar bis: (siehe R?ckseite)


Attachments:
signature.asc (253.00 B)
OpenPGP digital signature

2007-06-25 17:07:00

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Mon, 2007-06-25 at 18:50 +0200, Tilman Schmidt wrote:

> The Siemens Gigaset ISDN base driver uses tasklets in its isochronous
> data paths. These will be scheduled for each completion of an isochronous
> URB, or every 8 msec for each of the four isochronous pipes if both B
> channels are connected. The driver uses three URBs for each pipe, always
> maintaining two in flight while processing the third one. So the tasklet
> has to run within 16 ms from being scheduled in order to avoid packet
> loss (in the receive path) or data underrun (in the transmit path).
>
> Does that qualify as performance sensitive for the purpose of this
> discussion?

Actually, no. 16ms, even 8ms is an incredible amount of time. Unless
you have a thread that is running at a higher priority than the thread
that handles the work queue performing the task, you would have no
problems making that deadline. If you did miss the deadline without
having ridiculously high prio tasks, I would think that you would miss
your deadline with tasklets as well. Unless the large latency has to do
with preempt_disable, but that large of a latency would be IMHO a bug.

-- Steve


2007-06-25 18:50:33

by Kristian Høgsberg

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Fri, 2007-06-22 at 23:59 +0200, Ingo Molnar wrote:
> so how about the following, different approach: anyone who has a tasklet
> in any performance-sensitive codepath, please yell now. We'll also do a
> proactive search for such places. We can convert those places to
> softirqs, or move them back into hardirq context. Once this is done -
> and i doubt it will go beyond 1-2 places - we can just mass-convert the
> other 110 places to the lame but compatible solution of doing them in a
> global thread context.

OK, here's a yell. I'm using tasklets in the new firewire stack for all
interrupt handling. All my interrupt handler does is read out the event
mask and schedule the appropriate tasklets. Most of these tasklets
typically just end up scheduling work or completing a completion, so
moving it to a workqueue is pretty pointless. In particular, the
isochronous DMA events must be handled with as little latency as
possible, so a workqueue in that code path would be pretty bad.

I'm not strongly attached to tasklets, and it sounds like I got it wrong
and used the wrong delayed execution mechanism. But that's just another
data point that suggests that there are too many of these. I guess I
need to sit down and look into porting that to softirqs?

However, I don't really understand how you can discuss a wholesale
replacing of tasklets with workqueues, given the very different
execution sematics of the two mechanisms. I would think that others
have used tasklets for similar purposes as I have and moving that to
workqueues just has to break a bunch of stuff. I don't know the various
places tasklets are used as well as other people in this thread, but I
think deprecating them and moving code to either softirqs or workqueues
on a case by case basis is a better approach. That way we also avoid
the gross wrappers.

Kristian


2007-06-25 19:13:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Mon, 2007-06-25 at 14:48 -0400, Kristian H?gsberg wrote:

> OK, here's a yell. I'm using tasklets in the new firewire stack for all

Thanks for speaking up!

> interrupt handling. All my interrupt handler does is read out the event
> mask and schedule the appropriate tasklets. Most of these tasklets
> typically just end up scheduling work or completing a completion, so
> moving it to a workqueue is pretty pointless. In particular, the
> isochronous DMA events must be handled with as little latency as
> possible, so a workqueue in that code path would be pretty bad.
>
> I'm not strongly attached to tasklets, and it sounds like I got it wrong
> and used the wrong delayed execution mechanism. But that's just another
> data point that suggests that there are too many of these. I guess I
> need to sit down and look into porting that to softirqs?
>
> However, I don't really understand how you can discuss a wholesale
> replacing of tasklets with workqueues, given the very different
> execution sematics of the two mechanisms. I would think that others
> have used tasklets for similar purposes as I have and moving that to
> workqueues just has to break a bunch of stuff. I don't know the various
> places tasklets are used as well as other people in this thread, but I
> think deprecating them and moving code to either softirqs or workqueues
> on a case by case basis is a better approach. That way we also avoid
> the gross wrappers.

The gross wrappers were a perfect way to shed light on something that is
overused, and should most likely be replaced.

Does your system need to have these functions that are in tasklets need
to be non-reentrant? I wonder how many "irq critical" functions used
tasklets just because adding a softirq requires too much (no generic
softirq code). A tasklet is constrained to run on one CPU at a time,
and it is not guaranteed to run on the CPU it was scheduled on.

Perhaps it's time to add a new functionality while removing tasklets.
Things that are ok to bounce around CPUs (like tasklets do) can most
likely be replaced by a work queue. But these highly critical tasks
probably would benefit from being a softirq.

Maybe we should be looking at something like GENERIC_SOFTIRQ to run
functions that a driver could add. But they would run only on the CPU
that scheduled them, and do not guarantee non-reentrant as tasklets do
today.

I think I even found a bug in a driver that was trying to get around the
non-reentrancy of a tasklet (will post this soon).

It's looking like only a few tasklets have this critical requirement,
and are the candidates to move to a more generic softirq. The rest of
the tasklets would be switched to work queues, and this gross wrapper of
mine can do that in the meantime so we can find those that should be
converted to a generic softirq.

-- Steve


2007-06-25 19:52:24

by Stephen Hemminger

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Mon, 25 Jun 2007 18:50:03 +0200
Tilman Schmidt <[email protected]> wrote:

> Ingo Molnar <[email protected]> wrote:
> > so how about the following, different approach: anyone who has a tasklet
> > in any performance-sensitive codepath, please yell now.


Getting rid of tasklet's may seem like a good idea. But doing it by changing
them all to workqueue's would have bad consequences for networking.

The first issue is that it would change the semantic assumptions in the
places where tasklets are used. Many places "know" that a tasklet runs in soft
irq context so extra locking is not needed.

The performance overhead of changing to workqueue's could also be disastrous for
some devices. There are 10G device drivers that use tasklets to handle transmit
completion.


Here is a more detailed list how network devices are using tasklet's

Receive packet handling: ifb, ppp, ipw2200, ipw2100
Receive buffer refill: acenic, s2io
Receive & Transmit: sc9031, sundance
Transmit buffer allocation: smc91x
Phy handling: skge

Sorry, if you are going to get rid of tasklets, you need to fix all the
network drivers first.

--
Stephen Hemminger <[email protected]>

2007-06-25 20:11:05

by Kristian Høgsberg

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Mon, 2007-06-25 at 15:11 -0400, Steven Rostedt wrote:
> On Mon, 2007-06-25 at 14:48 -0400, Kristian Høgsberg wrote:
> ...
> > However, I don't really understand how you can discuss a wholesale
> > replacing of tasklets with workqueues, given the very different
> > execution sematics of the two mechanisms. I would think that others
> > have used tasklets for similar purposes as I have and moving that to
> > workqueues just has to break a bunch of stuff. I don't know the various
> > places tasklets are used as well as other people in this thread, but I
> > think deprecating them and moving code to either softirqs or workqueues
> > on a case by case basis is a better approach. That way we also avoid
> > the gross wrappers.
>
> The gross wrappers were a perfect way to shed light on something that is
> overused, and should most likely be replaced.
>
> Does your system need to have these functions that are in tasklets need
> to be non-reentrant? I wonder how many "irq critical" functions used
> tasklets just because adding a softirq requires too much (no generic
> softirq code). A tasklet is constrained to run on one CPU at a time,
> and it is not guaranteed to run on the CPU it was scheduled on.

When I started the firewire work, I wasn't aware that tasklets were
going away, but I knew that doing too much work in the interrupt handler
was frowned upon, for good reasons. So I was looking at softirqs vs
taslkets, and since using softirqs means you have to go add yourself to
the big bitmask, I opted for tasklets. The comment in interrupt.h
directly recommends this. As it stands, the firewire stack does
actaully rely on the non-reentrancy of tasklets, but that's not a
deal-breaker, I can add the necessary locking.

> Perhaps it's time to add a new functionality while removing tasklets.
> Things that are ok to bounce around CPUs (like tasklets do) can most
> likely be replaced by a work queue. But these highly critical tasks
> probably would benefit from being a softirq.
>
> Maybe we should be looking at something like GENERIC_SOFTIRQ to run
> functions that a driver could add. But they would run only on the CPU
> that scheduled them, and do not guarantee non-reentrant as tasklets do
> today.

Sounds like this will fill the gap. Of course, this won't reduce the
number of delayed-execution mechanisms available...

Kristian

2007-06-25 20:34:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Mon, 2007-06-25 at 16:07 -0400, Kristian H?gsberg wrote:

> > Maybe we should be looking at something like GENERIC_SOFTIRQ to run
> > functions that a driver could add. But they would run only on the CPU
> > that scheduled them, and do not guarantee non-reentrant as tasklets do
> > today.
>
> Sounds like this will fill the gap. Of course, this won't reduce the
> number of delayed-execution mechanisms available...

I disagree. Adding a generic softirq is not really adding another
delayed-execution, it's just extending the sofitrq. It would not have
any different semantics as a normal softirq, except that it would be
dynamic for modules to use. A tasklet has different concepts than
softirq. It adds non-reentrancy and that the tasklet function can run
where it wasn't scheduled.

Adding a generic softirq would keep the same concepts as a softirq but
just extend the users for it. Tasklets are probably not the best for
critical sections since it can be postponed longer if it was scheduled
on another CPU that is handling a bunch of other tasklets. So the
latency of running the tasklet is higher and you lose a cache advantage
by jumping to another CPU to execute.

Work queues already exist, so all we need to do to replace tasklets is
to extend softirqs for those critical cases that tasklets are used, and
replace the rest with work queues. By removing the non critical tasklets
to work queues will even lower the latency to execution of the more
critical tasklets.

-- Steve


2007-06-25 20:46:42

by Tilman Schmidt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Am 25.06.2007 19:06 schrieb Steven Rostedt:
> On Mon, 2007-06-25 at 18:50 +0200, Tilman Schmidt wrote:
>
>> The Siemens Gigaset ISDN base driver uses tasklets in its isochronous
>> data paths. [...]
>> Does that qualify as performance sensitive for the purpose of this
>> discussion?
>
> Actually, no. 16ms, even 8ms is an incredible amount of time. Unless
> you have a thread that is running at a higher priority than the thread
> that handles the work queue performing the task, you would have no
> problems making that deadline.

Ok, I'm reassured. I'll look into converting these to a work queue
then, although I can't promise when I'll get around to it.

In fact, if these timing requirements are so easy to meet, perhaps
it doesn't even need its own work queue, and just making each
tasklet into a work item and queueing them to the global queue
with schedule_work() would do? Or am I getting too reckless now?

Thanks,
Tilman

--
Tilman Schmidt E-Mail: [email protected]
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Unge?ffnet mindestens haltbar bis: (siehe R?ckseite)


Attachments:
signature.asc (253.00 B)
OpenPGP digital signature

2007-06-25 21:05:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Mon, 2007-06-25 at 22:50 +0200, Tilman Schmidt wrote:

> Ok, I'm reassured. I'll look into converting these to a work queue
> then, although I can't promise when I'll get around to it.
>
> In fact, if these timing requirements are so easy to meet, perhaps
> it doesn't even need its own work queue, and just making each
> tasklet into a work item and queueing them to the global queue
> with schedule_work() would do? Or am I getting too reckless now?

I'm sure you probably wouldn't have a problem with using just
schedule_work. But that's shared and you don't know with what. A
function in the keventd work queue can call schedule (not recommended,
but with closed source drivers, you'd never know. schedule_work is
EXPORT_SYMBOL not EXPORT_SYMBOL_GPL).

So, if you convert it to work queues, I'd strongly recommend adding a
new instance.

-- Steve


2007-06-25 21:10:09

by Kristian Høgsberg

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Mon, 2007-06-25 at 16:31 -0400, Steven Rostedt wrote:
> On Mon, 2007-06-25 at 16:07 -0400, Kristian Høgsberg wrote:
>
> > > Maybe we should be looking at something like GENERIC_SOFTIRQ to run
> > > functions that a driver could add. But they would run only on the CPU
> > > that scheduled them, and do not guarantee non-reentrant as tasklets do
> > > today.
> >
> > Sounds like this will fill the gap. Of course, this won't reduce the
> > number of delayed-execution mechanisms available...
>
> I disagree. Adding a generic softirq is not really adding another
> delayed-execution, it's just extending the sofitrq. It would not have
> any different semantics as a normal softirq, except that it would be
> dynamic for modules to use. A tasklet has different concepts than
> softirq. It adds non-reentrancy and that the tasklet function can run
> where it wasn't scheduled.

Hmm, yeah, true. Ok, sounds useful, let me know when you have a patch.
I'll try porting the firewire stack and let you know how it works.

Just to make sure I got this right: softirqs will always be scheduled on
the irq handling CPU and if an interrupt schedules multiple softirqs
from one handler invocation, the softirqs will be executed in the order
they are scheduled? And the reentrancy that softirqs does not protect
against (as opposed to tasklets) is the case where different CPUs handle
the interrupt and each schedule the same softirq for execution?

Kristian


2007-06-25 21:17:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Kristian H?gsberg <[email protected]> wrote:

> OK, here's a yell. I'm using tasklets in the new firewire stack for
> all interrupt handling. All my interrupt handler does is read out the
> event mask and schedule the appropriate tasklets. Most of these
> tasklets typically just end up scheduling work or completing a
> completion, so moving it to a workqueue is pretty pointless. In
> particular, the isochronous DMA events must be handled with as little
> latency as possible, so a workqueue in that code path would be pretty
> bad.

regarding workqueues - would it be possible for you to test Steve's
patch and get us performance numbers? Do you have any test with tons of
tasklet activity that would definitely show the performance impact of
workqueues? Workqueue priority can be set, and your handler should
probably be SCHED_FIFO.

right now the tasklet-emulation workqueue is globally locked, etc., but
if you use per-cpu workqueues then you'd probably get better scalability
than tasklets. (yes, despite the extra scheduling (which only costs ~1
microsecond) that the workqueue has to do.) Scheduling is pretty cheap,
the basic overhead of servicing a single interrupt is often 10 times
more expensive than a context-switch.

> I'm not strongly attached to tasklets, and it sounds like I got it
> wrong and used the wrong delayed execution mechanism. But that's just
> another data point that suggests that there are too many of these. I
> guess I need to sit down and look into porting that to softirqs?

i'd like to stress that your approach is completely fine and valid - and
if what we propose impacts performance negatively without any acceptable
(and robust) replacement solution offered by us then our patch wont be
done - simple as that. Softirqs could be an additional (performance)
advantage on SMP systems with multiple firewire interrupt sources, but
it would have to be measured too.

Ingo

2007-06-25 23:39:36

by Stefan Richter

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Ingo Molnar wrote:
> regarding workqueues - would it be possible for you to test Steve's
> patch and get us performance numbers? Do you have any test with tons of
> tasklet activity that would definitely show the performance impact of
> workqueues?

I can't speak for Kristian, nor do I have test equipment for isochronous
applications, but I know that there are people out there which do data
acquisition on as many FireWire buses as they can stuff boards into
their boxes. There are also FireWire cards with 2 or 4 controllers per
board; and each controller can receive or transmit on several channels.

Depending on the buffering scheme, there may be one (?) interrupt per
channel and isochronopus cycle. Or an interrupt when the buffer is
full. Some application programmers use large buffers; others want small
buffers. An isochronous cycle is 125us.

Asynchronous I/O can even produce much higher interrupt rates. I think
IP over 1394 might indeed cause interrupt rates that are moderately
higher than 1/125us during normal traffic. SBP-2 ( = 1394 storage) is
not as much affected because the bulk of data is transferred without
interrupts. So I suppose some eth1394 bandwidth tests with this patch
series might make sense... alas I'm short of spare time. (Would be
interesting to see whether the old ohci1394 driver is blown to bits with
the patch series; it's an old driver with who-knows what assumptions in
there.)

> Workqueue priority can be set, and your handler should
> probably be SCHED_FIFO.

Does this cooperate nicely with a SCHED_FIFO thread of a userspace data
acquisition program or audio server or the like?
--
Stefan Richter
-=====-=-=== -==- ==-=-
http://arcgraph.de/sr/

2007-06-26 00:00:26

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

A couple of days ago I said:

> The cafe_ccic (OLPC) camera driver uses a tasklet to move frames out of
> the DMA buffers in the streaming I/O path....
>
> Obviously some testing is called for here. I will make an attempt to do
> that testing

I've done that testing - I have an OLPC B3 unit running V2 of the
tasklet->workqueue patch, and all seems well. 30 FPS to the display and
no dropped frames. The tasklets/0 process is running 3-5% CPU, in case
that's interesting. For whatever reason, I see about 3% *more* idle
time when running just mplayer than I did without the patch.

Consider my minor qualms withdrawn, there doesn't seem to be any trouble
in this area.

Thanks,

jon

Jonathan Corbet / LWN.net / [email protected]

2007-06-26 00:47:25

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Tue, 2007-06-26 at 01:36 +0200, Stefan Richter wrote:

> I can't speak for Kristian, nor do I have test equipment for isochronous
> applications, but I know that there are people out there which do data
> acquisition on as many FireWire buses as they can stuff boards into
> their boxes. There are also FireWire cards with 2 or 4 controllers per
> board; and each controller can receive or transmit on several channels.
>
> Depending on the buffering scheme, there may be one (?) interrupt per
> channel and isochronopus cycle. Or an interrupt when the buffer is
> full. Some application programmers use large buffers; others want small
> buffers. An isochronous cycle is 125us.
>
> Asynchronous I/O can even produce much higher interrupt rates. I think
> IP over 1394 might indeed cause interrupt rates that are moderately
> higher than 1/125us during normal traffic. SBP-2 ( = 1394 storage) is
> not as much affected because the bulk of data is transferred without
> interrupts. So I suppose some eth1394 bandwidth tests with this patch
> series might make sense... alas I'm short of spare time. (Would be
> interesting to see whether the old ohci1394 driver is blown to bits with
> the patch series; it's an old driver with who-knows what assumptions in
> there.)

Hi, any testing of the patches would be much appreciated. I don't have
access to any boxes that might have problems with running tasklets as
work queues. So if you know others with this equipment, and can pass the
patches off to them. It will hopefully help us know if this patch helps,
hurts, or just doesn't make a difference.

>
> > Workqueue priority can be set, and your handler should
> > probably be SCHED_FIFO.
>
> Does this cooperate nicely with a SCHED_FIFO thread of a userspace data
> acquisition program or audio server or the like?

Well, if you put the prio of the work queue higher, it won't be any
different than a tasklet. A tasklet runs at an even higher priority than
any thread on the system.

-- Steve


2007-06-26 00:52:56

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Mon, 2007-06-25 at 18:00 -0600, Jonathan Corbet wrote:
> A couple of days ago I said:
>
> > The cafe_ccic (OLPC) camera driver uses a tasklet to move frames out of
> > the DMA buffers in the streaming I/O path....
> >
> > Obviously some testing is called for here. I will make an attempt to do
> > that testing
>
> I've done that testing - I have an OLPC B3 unit running V2 of the
> tasklet->workqueue patch, and all seems well. 30 FPS to the display and
> no dropped frames. The tasklets/0 process is running 3-5% CPU, in case
> that's interesting. For whatever reason, I see about 3% *more* idle
> time when running just mplayer than I did without the patch.
>
> Consider my minor qualms withdrawn, there doesn't seem to be any trouble
> in this area.

Jon, thanks a lot!

This is great news. I wonder if converting tasklets to work queues also
helps with other softirqs. Before, softirqs could not preempt a
tasklet, since tasklets run as a softirq. With tasklets as work queues,
what's left as a softirq can now preempt tasklets. Perhaps this can even
help with performance.

-- Steve


2007-06-26 01:46:49

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

> so how about the following, different approach: anyone who has a tasklet
> in any performance-sensitive codepath, please yell now. We'll also do a
> proactive search for such places. We can convert those places to
> softirqs, or move them back into hardirq context. Once this is done -
> and i doubt it will go beyond 1-2 places - we can just mass-convert the
> other 110 places to the lame but compatible solution of doing them in a
> global thread context.
>

I have a driver / testcase that reacts negatively to a workqueue
conversion. This is with the iop-adma driver on an ARM based platform
re-syncing a degraded raid5 array. The driver is currently in -mm and
it uses tasklets to run a short callback routine upon completion of an
offloaded memcpy or xor operation. Quick tests show that
write-throughput does not go down too much, but resync speed, as
reported by /proc/mdstat, drops from ~50MB/s to ~30MB/s.

Context switches on this platform flush the L1 cache so bouncing
between a workqueue and the MD thread is painful.

The conversion patch is attached.

--
Dan


Attachments:
(No filename) (1.06 kB)
iop-adma-workq-conv.patch (3.65 kB)
Download all attachments

2007-06-26 02:03:17

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Mon, 2007-06-25 at 18:46 -0700, Dan Williams wrote:
>
> Context switches on this platform flush the L1 cache so bouncing
> between a workqueue and the MD thread is painful.

Why is context switches between two kernel threads flushing the L1
cache? Is this a flaw in the ARM arch? I would think the only thing
that needs to be done between a context switch of two kernel threads (or
even a user thread to a kernel thread) is update the general regs and
stack. The memory access (page_tables or whatever ARM uses) should stay
the same.

Perhaps something else is at fault here.

Thanks for testing!

-- Steve


2007-06-26 02:12:21

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On 6/25/07, Steven Rostedt <[email protected]> wrote:
> On Mon, 2007-06-25 at 18:46 -0700, Dan Williams wrote:
> >
> > Context switches on this platform flush the L1 cache so bouncing
> > between a workqueue and the MD thread is painful.
>
> Why is context switches between two kernel threads flushing the L1
> cache? Is this a flaw in the ARM arch? I would think the only thing
> that needs to be done between a context switch of two kernel threads (or
> even a user thread to a kernel thread) is update the general regs and
> stack. The memory access (page_tables or whatever ARM uses) should stay
> the same.
>
Yes you are right, ARM does not flush L1 when prev==next in switch_mm.

> Perhaps something else is at fault here.
>
I'll try and dig a bit deeper...

2007-06-26 13:16:27

by Takashi Iwai

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

At Tue, 26 Jun 2007 15:03:23 +0200,
Clemens Ladisch wrote:
>
> Ingo Molnar wrote:
> > so how about the following, different approach: anyone who has a tasklet
> > in any performance-sensitive codepath, please yell now.
>
> ALSA uses quite a few tasklets in the framework and in several
> drivers. Since we
> care very much about low latency, many places use tasklet_hi_*.

I think we can replace from tasklet to workqueue in many card-driver
codes, at least. Many of them use tasklet simply because there was
no workq at that time. It's the correct move for such drivers.

But, yes, the code using tasklet in the core part (especially in the
timer part) requires as low latency as possible.


Takashi

2007-06-26 13:23:51

by Clemens Ladisch

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Ingo Molnar wrote:
> so how about the following, different approach: anyone who has a tasklet
> in any performance-sensitive codepath, please yell now.

ALSA uses quite a few tasklets in the framework and in several
drivers. Since we
care very much about low latency, many places use tasklet_hi_*.

It would be possible to convert to some GENERIC_SOFTIRQ mechanism, but
then we'd
want to use a softirq that has higher priority than the 'standard' generic
softirq, similar to HI_SOFTIRQ vs. TASKLET_SOFTIRQ.

BTW: Is there a reason why HRTIMER_SOFTIRQ is the lowest-priority
softirq instead
of being near TIMER_SOFTIRQ?


Regards,
Clemens

2007-06-28 05:49:14

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Ingo Molnar wrote:
> so how about the following, different approach: anyone who has a tasklet
> in any performance-sensitive codepath, please yell now. We'll also do a
> proactive search for such places. We can convert those places to
> softirqs, or move them back into hardirq context. Once this is done -
> and i doubt it will go beyond 1-2 places - we can just mass-convert the
> other 110 places to the lame but compatible solution of doing them in a
> global thread context.


Color me unconvinced.

Tasklets fill a niche not filled by either workqueues (slower, requiring
context switches, and possibly much latency is all wq's processes are
active) or softirqs (limited number of them, not flexible at all).
Sure, tasklets kick over to ksoftirqd, but not immediately, and therein
lies their value.

And moving code -back- into hardirq is just the wrong thing to do, usually.

This proposal is ENTIRELY derived from "not convenient to my project"
logic AFAICS, rather than the more sound "not needed in the kernel."

Jeff



2007-06-28 09:24:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Jeff Garzik <[email protected]> wrote:

> Tasklets fill a niche not filled by either workqueues (slower,
> requiring context switches, and possibly much latency is all wq's
> processes are active) [...]

... workqueues are also possibly much more scalable (percpu workqueues
are easy without changing anything in your code but the call where you
create the workqueue).

the context-switch argument i'll believe if i see numbers. You'll
probably need in excess of tens of thousands of irqs/sec to even be able
to measure its overhead. (workqueues are driven by nice kernel threads
so there's no TLB overhead, etc.)

the only remaining argument is latency: but workqueues are already
pretty high-prio (with a default priority of nice -5) - and you can
increase it even further. You can make it SCHED_FIFO prio 98 if latency
is so important. Tasklets on the other hand are _unconditionally_
high-priority. So this argument is more of an arms-race argument: "i
want _my_ processing to be done immediately!". The fact that workqueues
can be preempted and that their priorities can be adjusted flexibly is
an optional _bonus_, not a disadvantage. If low-prio workqueues hurts
your workflow, make them high-prio.

> And moving code -back- into hardirq is just the wrong thing to do,
> usually.

agreed - except if the in-tasklet processing is really thin and there's
already a softirq layer in the workflow. (which the case was for the
example that was cited.) In such a case moving either to the hardirq or
to the softirq looks like the right thing - instead of the tasklet
intermediary.

Ingo

2007-06-28 12:39:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


Hi Dan,

On Mon, 25 Jun 2007, Dan Williams wrote:

> Yes you are right, ARM does not flush L1 when prev==next in switch_mm.
>
> > Perhaps something else is at fault here.
> >
> I'll try and dig a bit deeper...

BTW:

static int __init iop_adma_init (void)
{
+ iop_adma_workqueue = create_workqueue("iop-adma");
+ if (!iop_adma_workqueue)
+ return -ENODEV;
+

Could you also try upping the prio of all the "iop-adma" threads?

You should see thread names such as (on SMP) "iop-adma/0", "iop-adma/1"
... "iop-adma/N" where N = # of CPUs - 1.

do a "chrt -p -f 98 <pid>" once for each of the thread's PIDs. The
chrt can be found in the package "util-linux" on Red Hat / Fedora, and in
schedutils on Debian.

It just dawned on me that workqueues don't run at a high priority by
default. So it's funny that I'm running all my current tasklets as a low
priority work queues :-)

But that can certainly be a cause of high latency. I need to update my
patches to make the workqueue thread a higher priority. All benchmarks on
this patch have been using a low priority work queue.

I also don't see any nice API to have the priority set for a workqueue
thread from within the kernel. Looks like one needs to be added,
otherwise, I need to have the wrapper dig into the workqueue structs to
find the thread that handles the workqueue.

Thanks,

-- Steve

2007-06-28 14:40:25

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Hello!

> the context-switch argument i'll believe if i see numbers. You'll
> probably need in excess of tens of thousands of irqs/sec to even be able
> to measure its overhead. (workqueues are driven by nice kernel threads
> so there's no TLB overhead, etc.)

It was authors of the patch who were supposed to give some numbers,
at least one or two, just to prove the concept. :-)

According to my measurements (maybe, wrong) on 2.5GHz P4 tasklet
schedule and execution eats ~300ns, workqueue eats ~4usec.
On my 1.8GHz PM notebook (UP kernel), the numbers are 170ns and 1.2usec.

Formally looking awful, this result is positive: tasklets are almost
never used in hot paths. I am sure only about one such place: acenic
driver uses tasklet to refill rx queue. This generates not more than
3000 tasklet schedules per second. Even on P4 it pure workqueue schedule
will eat ~1% of bare cpu ticks.

Anyway, all the uses of tasklet should be verified:

The most dubios place is popular Neterion 10Gbit driver, which uses
tasklet like acenic. But at 10Gbit, multiply acenic numbers and panic. :-)

Also, there exists some hardware which uses tasklets even harder,
but I have no idea what real frequencies are: f.e. sundance.

The case with acenic/s2io is quite special: normally network drivers
refill queues in irq handlers. It was Jes Sorensen observation
that offloading refilling from irq improves performance, I do not
remember numbers. Probably, switching to workqueues will not affect
performance at all, probably it will just collapse, no idea.


> ... workqueues are also possibly much more scalable

I cannot figure out - scale in what direction? :-)


> (percpu workqueues
> are easy without changing anything in your code but the call where you
> create the workqueue).

I do not see how it is related to scalability. And the statement
does not even make sense. The patch already uses per-cpu workqueue
for tasklets, otherwise it would be a disaster: guaranteed cpu non-locality.

Tasklet is single thread by definition and purpose. Those a few places
where people used tasklets to do per-cpu jobs (RCU f.e.) exist just because
they had troubles with allocating new softirq. Workqueues do not make
any difference: tasklet is not workqueue, it is work_struct, and you
still will have to allocate array of per-cpu work structs, everything
remains the same.


> the only remaining argument is latency:

You could set realtime prioriry by default, not a poor nice -5.
If some network adapters were killed just because I run some task
with nice --22, it would be just ridiculous.

Alexey

2007-06-28 15:18:17

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Ingo Molnar wrote:
> * Jeff Garzik <[email protected]> wrote:
>
>> Tasklets fill a niche not filled by either workqueues (slower,
>> requiring context switches, and possibly much latency is all wq's
>> processes are active) [...]
>
> ... workqueues are also possibly much more scalable (percpu workqueues
> are easy without changing anything in your code but the call where you
> create the workqueue).

All that scalability is just overhead, and overkill, for what
tasklets/softirqs are used for.


> the context-switch argument i'll believe if i see numbers. You'll
> probably need in excess of tens of thousands of irqs/sec to even be able
> to measure its overhead. (workqueues are driven by nice kernel threads
> so there's no TLB overhead, etc.)

As Alexey said... I would have thought YOU needed to provide numbers,
rather than just handwaving as justification for tasklet removal.


> the only remaining argument is latency: but workqueues are already
> pretty high-prio (with a default priority of nice -5) - and you can
> increase it even further. You can make it SCHED_FIFO prio 98 if latency
> is so important.

You skipped the very relevant latency killer: N threads in wq, and you
submit the (N+1)th task.

I just cannot see how that is acceptable replacement for a network
driver that uses tasklets. Who wants to wait that long for packet RX or TX?


> Tasklets on the other hand are _unconditionally_
> high-priority. So this argument is more of an arms-race argument: "i
> want _my_ processing to be done immediately!". The fact that workqueues
> can be preempted and that their priorities can be adjusted flexibly is
> an optional _bonus_, not a disadvantage. If low-prio workqueues hurts
> your workflow, make them high-prio.

How about letting us stick with a solution that is WORKING now?

Of course tasklets are unconditionally high priority. So are hardirqs.
So are softirqs. This is not a problem, this is an expected and
assumed-upon feature of the system.


>> And moving code -back- into hardirq is just the wrong thing to do,
>> usually.
>
> agreed - except if the in-tasklet processing is really thin and there's
> already a softirq layer in the workflow. (which the case was for the
> example that was cited.) In such a case moving either to the hardirq or
> to the softirq looks like the right thing - instead of the tasklet
> intermediary.

Wrong, for all the examples I care about -- drivers. Network drivers in
particular. Just look at the comment in include/linux/interrupt.h if it
wasn't clear:

/* PLEASE, avoid to allocate new softirqs, if you need not _really_ high
frequency threaded job scheduling. For almost all the purposes
tasklets are more than enough. F.e. all serial device BHs et
al. should be converted to tasklets, not to softirqs.
*/

There is a good reason for this advice, as hinted at by the code
immediately following the comment:

enum
{
HI_SOFTIRQ=0,
TIMER_SOFTIRQ,
NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ,
BLOCK_SOFTIRQ,
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
#ifdef CONFIG_HIGH_RES_TIMERS
HRTIMER_SOFTIRQ,
#endif
};

softirqs cannot really be used by drivers, because they are not modular.
They are a scarce resource in any case.

Guess what? All this is why we have tasklets.

tasklet != workqueue

Jeff


2007-06-28 15:23:51

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Alexey Kuznetsov wrote:
> Hello!
>
>> the context-switch argument i'll believe if i see numbers. You'll
>> probably need in excess of tens of thousands of irqs/sec to even be able
>> to measure its overhead. (workqueues are driven by nice kernel threads
>> so there's no TLB overhead, etc.)
>
> It was authors of the patch who were supposed to give some numbers,
> at least one or two, just to prove the concept. :-)
>
> According to my measurements (maybe, wrong) on 2.5GHz P4 tasklet
> schedule and execution eats ~300ns, workqueue eats ~4usec.
> On my 1.8GHz PM notebook (UP kernel), the numbers are 170ns and 1.2usec.

Thanks :)


> Anyway, all the uses of tasklet should be verified:
>
> The most dubios place is popular Neterion 10Gbit driver, which uses
> tasklet like acenic. But at 10Gbit, multiply acenic numbers and panic. :-)
>
> Also, there exists some hardware which uses tasklets even harder,
> but I have no idea what real frequencies are: f.e. sundance.
>
> The case with acenic/s2io is quite special: normally network drivers
> refill queues in irq handlers. It was Jes Sorensen observation
> that offloading refilling from irq improves performance, I do not
> remember numbers. Probably, switching to workqueues will not affect
> performance at all, probably it will just collapse, no idea.

CPUs have gotten so fast now that its quite possible to run the tasklet
in parallel with the next invocation of the interrupt handler.

But given the amount of tasklet use in network drivers, I do not think
tasklets can just be magically equated to workqueues, without
case-by-case analysis.


> Tasklet is single thread by definition and purpose. Those a few places

Indeed!


>> the only remaining argument is latency:
>
> You could set realtime prioriry by default, not a poor nice -5.
> If some network adapters were killed just because I run some task
> with nice --22, it would be just ridiculous.

Indeed.

Jeff


2007-06-28 15:54:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


On Thu, 28 Jun 2007, Alexey Kuznetsov wrote:
> > the context-switch argument i'll believe if i see numbers. You'll
> > probably need in excess of tens of thousands of irqs/sec to even be able
> > to measure its overhead. (workqueues are driven by nice kernel threads
> > so there's no TLB overhead, etc.)
>
> It was authors of the patch who were supposed to give some numbers,
> at least one or two, just to prove the concept. :-)

The problem is that we don't have the hardware that uses tasklets in
critical ways. My original patch series had a debug print in every
function (tasklet_schedule and friends). I got a few scattered prints on
all my boxes but no flooding of prints. So I can't show that this will
hurt, because on my boxes it does not.

>
> You could set realtime prioriry by default, not a poor nice -5.
> If some network adapters were killed just because I run some task
> with nice --22, it would be just ridiculous.

This is my fault to the patch series. I compelety forgot to up the prio.
My next series will include a change where the tasklet work queue will run
at something like prio FIFO 98 (or maybe 99?)

This is a bit embarrassing that I forgot to do this, since I'm a
real-time developer ;-)

-- Steve

2007-06-28 16:02:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Alexey Kuznetsov <[email protected]> wrote:

> > the context-switch argument i'll believe if i see numbers. You'll
> > probably need in excess of tens of thousands of irqs/sec to even be
> > able to measure its overhead. (workqueues are driven by nice kernel
> > threads so there's no TLB overhead, etc.)
>
> It was authors of the patch who were supposed to give some numbers, at
> least one or two, just to prove the concept. :-)

sure enough! But it was not me who claimed that 'workqueues are slow'.

firstly, i'm not here at all to tell people what tools to use. I'm not
trying to 'force' people away from a perfectly logical technological
choice. I am just wondering out loud whether this particular tool, in
its current usage pattern, makes much technological sense. My claim is:
it could very well be that it doesnt make _much_ sense, and in that case
we should provide a non-intrusive migration path away in terms of a
compatible API wrapper to a saner (albeit by virtue of trying to emulate
an existing API, slower) mechanism. The examples cited so far had the
tasklet as an intermediary towards a softirq - what's the technological
point in such a splitup?

> According to my measurements (maybe, wrong) on 2.5GHz P4 tasklet
> schedule and execution eats ~300ns, workqueue eats ~4usec. On my
> 1.8GHz PM notebook (UP kernel), the numbers are 170ns and 1.2usec.

I find the 4usecs cost on a P4 interesting and a bit too high - how did
you measure it? (any test-patch for it i could try?) But i think even
your current numbers partly prove my point: with 1.2 usecs and 10,000
irqs/sec the cost is 1.2 msecs/sec, or 0.1%. And 10K irqs/sec themselves
will eat up much more CPU time than that already.

> Formally looking awful, this result is positive: tasklets are almost
> never used in hot paths. I am sure only about one such place: acenic
> driver uses tasklet to refill rx queue. This generates not more than
> 3000 tasklet schedules per second. Even on P4 it pure workqueue
> schedule will eat ~1% of bare cpu ticks.

... and the irq cost itself will eat 5-10% of bare CPU ticks already.

> > ... workqueues are also possibly much more scalable
>
> I cannot figure out - scale in what direction? :-)

workqueues can be per-cpu - for tasklets to be per-cpu you have to
open-code them into per-cpu like rcu-tasklets did (which in essence
turns them into more expensive softirqs).

> > (percpu workqueues
> > are easy without changing anything in your code but the call where
> > you create the workqueue).
>
> I do not see how it is related to scalability. And the statement does
> not even make sense. The patch already uses per-cpu workqueue for
> tasklets, otherwise it would be a disaster: guaranteed cpu
> non-locality.

my argument was: workqueues are more scalable than tasklets in general.

Just look at the tasklet_disable() logic. We basically have a per-cpu
list of tasklets that we poll in tasklet_action:

static void tasklet_action(struct softirq_action *a)
{
[...]
while (list) {
struct tasklet_struct *t = list;

list = list->next;

if (tasklet_trylock(t)) {

and if the trylock fails, we just continue to meet this activated
tasklet again and again, in this nice linear list.

this happens to work in practice because 1) tasklets are used quite
rarely! 2) tasklet_disable() is done realtively rarely and nobody truly
runs tons of the same devices (which depend on a tasklet) on the same
box, but still it's quite an unhealthy approach. Every time i look at
the tasklet code it hurts - having fundamental stuff like that in the
heart of Linux ;-)

also, the "be afraid of the hardirq or the process context" mantra is
overblown as well. If something is too heavy for a hardirq, _it's too
heavy for a tasklet too_. Most hardirqs are (or should be) running with
interrupts enabled, which makes their difference to softirqs miniscule.

The most scalable workloads dont involve any (or many) softirq middlemen
at all: you queue work straight from the hardirq context to the target
process context. And that's what you want to do _anyway_, because you
want to create as little locally cached data for the hardirq context, as
the target task could easily be on another CPU. (this is generally true
for things like block IO, but it's also true for things like network
IO.)

the most scalable solution would be _for the network adapter to figure
out the target CPU for the packet_. Not many (if any) such adapters
exist at the moment. (as it would involve allocating NR_CPUs irqs to
that adapter alone.)

> Tasklet is single thread by definition and purpose. Those a few places
> where people used tasklets to do per-cpu jobs (RCU f.e.) exist just
> because they had troubles with allocating new softirq. [...]

no. The following tale is the true and only history of the RCU tasklet
;-) The RCU guys first used a tasklet, then noticed its bad scalability
(a particular VFS-intense benchmark regressed because only a single CPU
would do RCU completion on an 8-way box) so they switched it to a
per-cpu tasklet - without realizing that a per-cpu tasklet is in essence
a softirq. I pointed it out to them (years down the road ...) then the
"convert rcu-tasklet to softirq" patch was born.

> > the only remaining argument is latency:
>
> You could set realtime prioriry by default, not a poor nice -5. If
> some network adapters were killed just because I run some task with
> nice --22, it would be just ridiculous.

there are only 20 negative nice levels ;-) And i dont really get the
'you might kill the network adapter' argument, because the opposite is
true just as much: tasklets from a totally uninteresting network adapter
can kill your latency-sensitive application too.

So providing more flexibility in the prioritization of the work that
goes on in the system (as long as it has no other drawbacks) can not be
wrong. The "but you will shoot yourself in the foot" argument is really
backwards in that context.

Tasklets are called 'task'-lets for a reason: they are poorly scheduled,
inflexible tasks. They were written in an age when we didnt have
workqueues, we didnt have kthreads and real men thought they wanted to
do all their TCP/IP processing in softirq context [ am i heading down
the road towards a showdown with DaveM here? ;-) ].

Now ... you (and Jeff, and others) are right and workqueues could be too
slow for some of the cases (i said before that i'd be surprised if it
were more than 1-2), in which case my argument changes to what i
outlined above: if you want good scalability, dont use middlemen :-)
Figure out the target task as early as possible and let it do as much of
the remaining work as possible. _Increasing_ the amount of cached
context (by doing delayed processing in tasklets or even softirqs on the
same CPU where the hardirq arrived) only increases the cross-CPU cost.
Keeping stuff in a softirq only makes (some) sense as long as you have
no target task at all (routing, filtering, etc.).

Ingo

2007-06-28 16:37:54

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On 06/28, Steven Rostedt wrote:
>
> I also don't see any nice API to have the priority set for a workqueue
> thread from within the kernel. Looks like one needs to be added,
> otherwise, I need to have the wrapper dig into the workqueue structs to
> find the thread that handles the workqueue.

It is not so trivial to implement properly. Note that CPU_UP creates a new
cwq->thread, so somehow workqueue should "remember" its priority. This means
we should record it in workqueue_struct. The most simple way is to add yet
another parameter to __create_workqueue(), but this is nasty.

So, perhaps we should add "long nice" to "struct workqueue_struct", and then

void set_workqueue_nice(struct workqueue_struct *wq, long nice)
{
const cpumask_t *cpu_map = wq_cpu_map(wq);
struct cpu_workqueue_struct *cwq;
int cpu;

wq->nice = nice;

mutex_lock(&workqueue_mutex);

for_each_cpu_mask(cpu, *cpu_map) {
cwq = per_cpu_ptr(wq->cpu_wq, cpu);
if (cwq->thread)
set_user_nice(cwq->thread, nice);
}

mutex_unlock(&workqueue_mutex);
}

We could use for_each_cpu_online() instead, but then we should check
is_single_threaded().

Oleg.

2007-06-28 17:27:14

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Ingo Molnar wrote:
> my argument was: workqueues are more scalable than tasklets in general.

Here is my argument: that is totally irrelevant to $subject, when it
comes to dealing with managing existing [network driver] behavior and
performance.

My overall objection is the attempt to replace apples with oranges.

Network drivers use tasklets TODAY. Each driver -- in particular
acenic, ns83820, and the 10Gbps drivers -- has been carefully tuned to
use tasklets, hardirqs, and perhaps NAPI too. Changing to workqueue
WILL affect network driver hot paths, yet I see no analysis or
measurement at all of the behavior differences.

If hackers are willing to revisit each network driver, rework the
tasklet code to something more sane [in your opinion], and TEST it, I
will review the patches and happily ACK away.

Given that I feel that course of action is unlikely (the preferred
alternative apparently being "I don't use these drivers, but surely my
changes are OK anyway"), I do not see how this effort can proceed as is.

Lots of time went into tuning these network drivers for the specific
thread model they use. Maybe that thread model is no longer in style.
Maybe modern machine behavior dictates a different approach. The point
is... you don't know.

Jeff


2007-06-28 17:44:35

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Ingo Molnar wrote:
> But it was not me who claimed that 'workqueues are slow'.

The claim was: slower than tasklets.


> choice. I am just wondering out loud whether this particular tool, in
> its current usage pattern, makes much technological sense. My claim is:
> it could very well be that it doesnt make _much_ sense, and in that case
> we should provide a non-intrusive migration path away in terms of a
> compatible API wrapper to a saner (albeit by virtue of trying to emulate
> an existing API, slower) mechanism. The examples cited so far had the
> tasklet as an intermediary towards a softirq - what's the technological
> point in such a splitup?

I already answered that in detail. In sum, a driver cannot define its
own softirq. Softirqs are not modular.

Tasklets are the closest thing to softirqs for a driver.


> The most scalable workloads dont involve any (or many) softirq middlemen
> at all: you queue work straight from the hardirq context to the target
> process context. And that's what you want to do _anyway_, because you
> want to create as little locally cached data for the hardirq context, as
> the target task could easily be on another CPU. (this is generally true
> for things like block IO, but it's also true for things like network
> IO.)
>
> the most scalable solution would be _for the network adapter to figure
> out the target CPU for the packet_.

I agree completely. Wanna implement this? I will kiss your feet, and
multi-core CPU vendors will worship you as a demi-god.

Until such time, we must deal with the network stack as it exists today,
and the network drivers as they exist and work today.


> Not many (if any) such adapters
> exist at the moment. (as it would involve allocating NR_CPUs irqs to
> that adapter alone.)

Good news: this is becoming the norm for modern NICs, especially 10Gbps.

Plenty of NICs already exist that support multiple RX rings (persumably
one per CPU), and newer NICs will raise individual MSI[-X] interrupts
based on the RX ring into which a packet was received.

In this area, NIC vendors are way ahead of the Linux net stack.

The Linux net stack is unfortunately not threaded enough to sanely deal
with dividing /flows/ up across multiple CPUs, even if the NIC does
support multiple transmit and receive queues. [side note: initial
multi-queue TX is being worked on, on netdev]


>> Tasklet is single thread by definition and purpose. Those a few places
>> where people used tasklets to do per-cpu jobs (RCU f.e.) exist just
>> because they had troubles with allocating new softirq. [...]
>
> no. The following tale is the true and only history of the RCU tasklet
> ;-) The RCU guys first used a tasklet, then noticed its bad scalability
> (a particular VFS-intense benchmark regressed because only a single CPU
> would do RCU completion on an 8-way box) so they switched it to a
> per-cpu tasklet - without realizing that a per-cpu tasklet is in essence
> a softirq. I pointed it out to them (years down the road ...) then the
> "convert rcu-tasklet to softirq" patch was born.

You focused on the example rather than the key phrase: tasklet is
single thread by definition and purpose.

Wanting to change that without analysis of the impact illustrates the
apples-to-oranges change being proposed.


> outlined above: if you want good scalability, dont use middlemen :-)
> Figure out the target task as early as possible and let it do as much of
> the remaining work as possible. _Increasing_ the amount of cached
> context (by doing delayed processing in tasklets or even softirqs on the
> same CPU where the hardirq arrived) only increases the cross-CPU cost.
> Keeping stuff in a softirq only makes (some) sense as long as you have
> no target task at all (routing, filtering, etc.).

I do not disagree with these theoretical musings :)

I care the most about the "who will do all this work?" question. In
network driver land, these changes impact hot paths. I am lazy, and
don't care to revisit each network driver hot path and carefully re-tune
each based on this proposal. Who is volunteering?

Jeff


2007-06-28 18:03:03

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On 6/28/07, Steven Rostedt <[email protected]> wrote:
>
> Hi Dan,
>
> On Mon, 25 Jun 2007, Dan Williams wrote:
>
> > Yes you are right, ARM does not flush L1 when prev==next in switch_mm.
> >
> > > Perhaps something else is at fault here.
> > >
> > I'll try and dig a bit deeper...
>
> BTW:
>
> static int __init iop_adma_init (void)
> {
> + iop_adma_workqueue = create_workqueue("iop-adma");
> + if (!iop_adma_workqueue)
> + return -ENODEV;
> +
>
> Could you also try upping the prio of all the "iop-adma" threads?
>
Unfortunately setting the thread to real time priority makes
throughput slightly worse. Instead of floating around 35MB/s the
resync speed is stuck around 30MB/s:

[ iop-adma: hi-prio workqueue based callbacks ]
iq81340mc:~# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdd[4] sdc[2] sdb[1] sda[0]
468872448 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
[>....................] recovery = 3.9% (6107424/156290816)
finish=84.9min speed=29448K/sec

The iop-adma tasklet cleans up any completed descriptors and in the
process calls any attached callbacks. For the raid5 resync case the
callback is simply:

static void ops_complete_check(void *stripe_head_ref)
{
struct stripe_head *sh = stripe_head_ref;
int pd_idx = sh->pd_idx;

pr_debug("%s: stripe %llu\n", __FUNCTION__,
(unsigned long long)sh->sector);

if (test_and_clear_bit(STRIPE_OP_MOD_DMA_CHECK, &sh->ops.pending) &&
sh->ops.zero_sum_result == 0)
set_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);

set_bit(STRIPE_OP_CHECK, &sh->ops.complete);
set_bit(STRIPE_HANDLE, &sh->state);
release_stripe(sh);
}


[ iop-adma: tasklet based callbacks ]
iq81340mc:~# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdd[4] sdc[2] sdb[1] sda[0]
468872448 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
[=>...................] recovery = 5.1% (8024248/156290816)
finish=47.9min speed=51486K/sec

--
Dan

2007-06-28 18:27:52

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Thu, 28 Jun 2007 18:00:01 +0200 Ingo Molnar <[email protected]> wrote:

> with 1.2 usecs and 10,000
> irqs/sec the cost is 1.2 msecs/sec, or 0.1%.

off-by-10 error.

2007-06-28 20:09:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Andrew Morton <[email protected]> wrote:

> On Thu, 28 Jun 2007 18:00:01 +0200 Ingo Molnar <[email protected]> wrote:
>
> > with 1.2 usecs and 10,000
> > irqs/sec the cost is 1.2 msecs/sec, or 0.1%.
>
> off-by-10 error.

yeah, indeed - 12 msecs and 1.2% :-/

Ingo

2007-06-28 20:46:45

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On Thu, 28 Jun 2007, Dan Williams wrote:
> >
> Unfortunately setting the thread to real time priority makes
> throughput slightly worse. Instead of floating around 35MB/s the
> resync speed is stuck around 30MB/s:

That is really strange. If you higher the prio of the work queue it
gets worst? Something really strange is happening here? Are you using
CONFIG_PREEMPT?

-- Steve

2007-06-28 21:24:11

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On 6/28/07, Steven Rostedt <[email protected]> wrote:
> On Thu, 28 Jun 2007, Dan Williams wrote:
> > >
> > Unfortunately setting the thread to real time priority makes
> > throughput slightly worse. Instead of floating around 35MB/s the
> > resync speed is stuck around 30MB/s:
>
> That is really strange. If you higher the prio of the work queue it
> gets worst? Something really strange is happening here? Are you using
> CONFIG_PREEMPT?
>
Everything thus far has been CONFIG_PREEMPT=n (the default for this platform).

With CONFIG_PREEMPT=y the resync is back in the 50MB/s range.

[iop-adma: hi-prio workqueue, CONFIG_PREEMPT=y]
iq81340mc:~# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md0 : active raid5 sdd[4] sdc[2] sdb[1] sda[0]
468872448 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
[=>...................] recovery = 5.8% (9136404/156290816)
finish=46.1min speed=53161K/sec

The tasklet configuration stays in 50MB/s ballpark, and the default
priority (nice -5) workqueue case remains in the 30's with
CONFIG_PREEMPT=n.

--
Dan

2007-06-28 21:40:23

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On 6/28/07, Dan Williams <[email protected]> wrote:
> Everything thus far has been CONFIG_PREEMPT=n (the default for this platform).
>
> With CONFIG_PREEMPT=y the resync is back in the 50MB/s range.
>
> [iop-adma: hi-prio workqueue, CONFIG_PREEMPT=y]
> iq81340mc:~# cat /proc/mdstat
> Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> md0 : active raid5 sdd[4] sdc[2] sdb[1] sda[0]
> 468872448 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
> [=>...................] recovery = 5.8% (9136404/156290816)
> finish=46.1min speed=53161K/sec
>
> The tasklet configuration stays in 50MB/s ballpark, and the default
> priority (nice -5) workqueue case remains in the 30's with
> CONFIG_PREEMPT=n.
>
That last line should be CONFIG_PREEMPT=y

But I guess there is a reason it is still marked experimental...

iq81340mc:/data_dir# ./md0_verify.sh
kernel BUG at mm/page_alloc.c:363!
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = b29b8000
[00000000] *pgd=00000000
Internal error: Oops: 805 [#1] PREEMPT
Modules linked in:
CPU: 0 Not tainted (2.6.22-rc6 #144)
PC is at __bug+0x20/0x2c
LR is at 0x403f37a0
pc : [<4002ef08>] lr : [<403f37a0>] psr: 60000093
sp : b28a3cc8 ip : 01d80000 fp : b28a3cd4
r10: 00000001 r9 : 00000001 r8 : 00000000
r7 : 0000004f r6 : ffffffe0 r5 : 412919c0 r4 : 412919e0
r3 : 00000000 r2 : 403f37a0 r1 : 60000093 r0 : 00000026
Flags: nZCv IRQs off FIQs on Mode SVC_32 Segment user
Control: 0400397f Table: 729b8018 DAC: 00000015
Process md0_verify_kern (pid: 4188, stack limit = 0xb28a2258)
Stack: (0xb28a3cc8 to 0xb28a4000)
3cc0: b28a3d1c b28a3cd8 40070a90 4002eef4 b28a2000 403f6818
3ce0: 403f682c 40000013 403f6838 0000001f 00000010 4129c3c0 20000013 15573000
3d00: 4129c3c0 00000000 b28a4550 00000000 b28a3d2c b28a3d20 40070b54 400706b8
3d20: b28a3d44 b28a3d30 40073e7c 40070b4c 4129c3c0 15574000 b28a3d58 b28a3d48
3d40: 40084734 40073e34 b3e0fdcc b28a3dcc b28a3d5c 40079cd8 400846f8 ffffffff
3d60: 00000000 b39a4860 b28a3ddc 00000000 00000001 15574000 b28a2000 4040d548
3d80: b28a4550 b3357240 72d9e0ff ffffffff 00000000 15573fff 00003ffe 15574000
3da0: 15573fff 4040d548 b28a3ddc b2c15320 b28a2000 00000000 b333cc00 b3357240
3dc0: b28a3e04 b28a3dd0 4007d724 40079844 b28a3dd8 00000000 00000015 4040d548
3de0: b3357240 b28f6920 b28a2000 b3357240 b3357240 b3a4f9c0 b28a3e18 b28a3e08
3e00: 4003f2d8 4007d6b4 b3a462e0 b28a3e5c b28a3e1c 40095e70 4003f2a0 00000000
3e20: 00000000 b28a3e3c 00000013 b28a3e5c b28a3e3c b333cc10 b3c15cc0 b333cc00
3e40: 00000080 b28a3fb0 00000000 00000000 b28a3f2c b28a3e60 400c8270 400958f0
3e60: 00000000 b28a3ec8 b28a2000 b3c15c00 00000003 00000000 b29e54a0 00000014
3e80: 00000044 b38b4f00 00000002 00000000 00000000 00000001 403f6aac 403f6aac
3ea0: b333cc7c b3a462e0 000200d2 00000000 403f6aa8 b28a2000 b333cd30 b28a3f08
3ec0: b28a3ecc 40070c38 4006fdbc 000200d2 403f6aac 00000010 41295a60 b28a2000
3ee0: b333cc7c 00000000 b333cc00 b333cc00 00000003 0001fff3 0000004f 0000004f
3f00: b333cc00 403f72a8 b28a2000 400c7e60 00000000 b333cc00 b28a3fb0 fffffffe
3f20: b28a3f5c b28a3f30 40095054 400c7e6c 00000000 b4503000 000bf408 b333cc00
3f40: 00000000 000bf628 b28a2000 b28a3fb0 b28a3f84 b28a3f60 40096c3c 40094f58
3f60: b4503000 000bf408 b28a3fb0 b4503000 4002b0e4 156da000 b28a3fa4 b28a3f88
3f80: 4002e3cc 40096b24 000bf408 000bf628 000bf788 0000000b 00000000 b28a3fa8
3fa0: 4002af20 4002e39c 000bf408 000bf628 000bf788 000bf628 000bf408 00000000
3fc0: 000bf408 000bf628 000bf788 ffffffff 000bf628 000bf408 156da000 000be848
3fe0: 15651550 3eb7e4ac 0002b4f0 1565158c 60000010 000bf788 00000000 00000000
Backtrace:
[<4002eee8>] (__bug+0x0/0x2c) from [<40070a90>] (free_hot_cold_page+0x3e4/0x434)
[<400706ac>] (free_hot_cold_page+0x0/0x434) from [<40070b54>]
(free_hot_page+0x14/0x18)
[<40070b40>] (free_hot_page+0x0/0x18) from [<40073e7c>] (put_page+0x54/0x174)
[<40073e28>] (put_page+0x0/0x174) from [<40084734>]
(free_page_and_swap_cache+0x48/0x64)
r5:15574000 r4:4129c3c0
[<400846ec>] (free_page_and_swap_cache+0x0/0x64) from [<40079cd8>]
(unmap_vmas+0x4a0/0x67c)
r4:b3e0fdcc
[<40079838>] (unmap_vmas+0x0/0x67c) from [<4007d724>] (exit_mmap+0x7c/0x158)
[<4007d6a8>] (exit_mmap+0x0/0x158) from [<4003f2d8>] (mmput+0x44/0x100)
[<4003f294>] (mmput+0x0/0x100) from [<40095e70>] (flush_old_exec+0x58c/0x9d8)
r4:b3a462e0
[<400958e4>] (flush_old_exec+0x0/0x9d8) from [<400c8270>]
(load_elf_binary+0x410/0x18fc)
[<400c7e60>] (load_elf_binary+0x0/0x18fc) from [<40095054>]
(search_binary_handler+0x108/0x2b4)
[<40094f4c>] (search_binary_handler+0x0/0x2b4) from [<40096c3c>]
(do_execve+0x124/0x1e4)
[<40096b18>] (do_execve+0x0/0x1e4) from [<4002e3cc>] (sys_execve+0x3c/0x5c)
[<4002e390>] (sys_execve+0x0/0x5c) from [<4002af20>] (ret_fast_syscall+0x0/0x3c)
r7:0000000b r6:000bf788 r5:000bf628 r4:000bf408
Code: e1a01000 e59f000c eb004d60 e3a03000 (e5833000)
note: md0_verify_kern[4188] exited with preempt_count 4

2007-06-28 22:00:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


On Thu, 28 Jun 2007, Dan Williams wrote:

> > CONFIG_PREEMPT?
> >
> Everything thus far has been CONFIG_PREEMPT=n (the default for this platform).
>
> With CONFIG_PREEMPT=y the resync is back in the 50MB/s range.

So with upping the prio for the work queue you got back your performance?

>
> [iop-adma: hi-prio workqueue, CONFIG_PREEMPT=y]
> iq81340mc:~# cat /proc/mdstat
> Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> md0 : active raid5 sdd[4] sdc[2] sdb[1] sda[0]
> 468872448 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
> [=>...................] recovery = 5.8% (9136404/156290816)
> finish=46.1min speed=53161K/sec
>
> The tasklet configuration stays in 50MB/s ballpark, and the default
> priority (nice -5) workqueue case remains in the 30's with
> CONFIG_PREEMPT=n.

[noted: should be CONFIG_PREEMPT=y]

This is expected. Seems you may have otherthings running at a higher prio.

-- Steve

2007-06-28 22:01:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues



--

>
> But I guess there is a reason it is still marked experimental...
>
> iq81340mc:/data_dir# ./md0_verify.sh
> kernel BUG at mm/page_alloc.c:363!

Well, at least this uncovered something :-)

I'll look into this too.

-- Steve

2007-06-29 11:35:13

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Hello!

> I find the 4usecs cost on a P4 interesting and a bit too high - how did
> you measure it?

Simple and stupid:

int flag;

static void do_test(unsigned long dummy)
{
flag = 1;
}

static void do_test_wq(void *dummy)
{
flag = 1;
}

static void measure_tasklet0(void)
{
int i;
int cnt = 0;
DECLARE_TASKLET(test, do_test, 0);
unsigned long start = jiffies;

for (i=0; i<1000000; i++) {
flag = 0;
local_bh_disable();
tasklet_schedule(&test);
local_bh_enable();
while (flag == 0) {
schedule();
cnt++;
} /*while (flag == 0)*/;
}
printk("tasklet0: %lu %d\n", jiffies - start, cnt);
}

static void measure_tasklet1(void)
{
int i;
int cnt = 0;
DECLARE_TASKLET(test, do_test, 0);
unsigned long start = jiffies;

for (i=0; i<1000000; i++) {
flag = 0;
local_bh_disable();
tasklet_schedule(&test);
local_bh_enable();
do {
schedule();
cnt++;
} while (flag == 0);
}
printk("tasklet1: %lu %d\n", jiffies - start, cnt);
}

static void measure_workqueue(void)
{
int i;
int cnt = 0;
unsigned long start;
DECLARE_WORK(test, do_test_wq, 0);
struct workqueue_struct * wq;

start = jiffies;

wq = create_workqueue("testq");

for (i=0; i<1000000; i++) {
flag = 0;
queue_work(wq, &test);
do {
schedule();
cnt++;
} while (flag == 0);
}
printk("wq: %lu %d\n", jiffies - start, cnt);
destroy_workqueue(wq);
}



> tasklet as an intermediary towards a softirq - what's the technological
> point in such a splitup?

"... work_struct as intermediary towards a workqueue - what's the technological
point in such a splitup?" Non-sense? Yes, but it is exactly what you said. :-)

softirq is just a context and engine to run something. Exactly like
workqueue task. struct tasklet is work_struct, it is just a thing to run.


> workqueues can be per-cpu - for tasklets to be per-cpu you have to
> open-code them into per-cpu like rcu-tasklets did

I feel I have to repeat: tasklet==work_struct, workqueue==softirq.

Essentially, you said that workqueues "scale" in direction of increasing
amount of softirqs. This is _correct_, but the word is different: "flexible"
is the word. What's about performance,scalability blah-blah, workqueues
are definitely worse. And this is OK, you do not need to conceal this.

This is the price, which we pay for flexibility and to niceness to realtime.

That's what should be said in adverticement notes instead of propaganda.



> Just look at the tasklet_disable() logic.

Do not count this.

Done this way because nobody needed that thing, except for _one_ place
in keyboard/console driver, which was very difficult to fix that time,
when vt code was utterly messy and not smp safe at all.

start_bh_atomic() was successfully killed, but we had to preserve analogue
of disable_bh() with the same semantics for some time.
It is deliberately implemented in a way, which does not impact hot paths
and is easy to remove.

It is sad that some usb drivers started to use this creepy and
useless thing.


> also, the "be afraid of the hardirq or the process context" mantra is
> overblown as well. If something is too heavy for a hardirq, _it's too
> heavy for a tasklet too_. Most hardirqs are (or should be) running with
> interrupts enabled, which makes their difference to softirqs miniscule.

Incorrect.

The difference between softirqs and hardirqs lays not in their "heavyness".
It is in reentrancy protection, which has to be done with local_irq_disable(),
unless networking is not isolated from hardirqs. That's all.
Networking is too hairy to allow to be executed with disabled hardirqs.
And moving this hairyiness to process context requires
<irony mode> a little </> more efforts than conversion tasklets to work queues.


> The most scalable workloads dont involve any (or many) softirq middlemen
> at all: you queue work straight from the hardirq context to the target
> process context.

Do you really see something common between this Holy Grail Quest and
tasklets/workqeueus? Come on. :-)

Actually, this is step backwards. Instead of execution in correct
context, you create a new dummy context. This is the place, where goals
of realtime and Holy Grail Quest split.


> true just as much: tasklets from a totally uninteresting network adapter
> can kill your latency-sensitive application too.

If I started nice --22 running process I signed to killing latency
of nice 0 processes. But I did not sign for killing network/scsi adapters.
"latency-sensitive application" use real time priority as well,
so that they will compete with tasklets fairly.

2007-06-29 12:32:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Alexey Kuznetsov <[email protected]> wrote:

> > also, the "be afraid of the hardirq or the process context" mantra
> > is overblown as well. If something is too heavy for a hardirq, _it's
> > too heavy for a tasklet too_. Most hardirqs are (or should be)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > running with interrupts enabled, which makes their difference to
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > softirqs miniscule.
^^^^^^^^^^^^^^^^^^
>
> Incorrect.
>
> The difference between softirqs and hardirqs lays not in their
> "heavyness". It is in reentrancy protection, which has to be done with
> local_irq_disable(), unless networking is not isolated from hardirqs.

i know that pretty well ;)

> That's all. Networking is too hairy to allow to be executed with
> disabled hardirqs. And moving this hairyiness to process context
> requires <irony mode> a little </> more efforts than conversion
> tasklets to work queues.

as i said above (see the underlined sentence), hardirq contexts already
run just fine with hardirqs enabled. So your dismissal of executing that
'hairy' bit in hardirq context is not that automatically true as you
seem to assume i think.

also, network softirq locking dependencies arent all that magic or
complex either: they do not operate on sockets that are already locked
by a user context, they are per CPU and they are not preempted by
'themselves', nor are they preempted by certain other softirqs (such as
they are not preempted by the timer softirq). Am i missing some point of
yours?

Ingo

2007-06-29 12:51:25

by Duncan Sands

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Hi,

> > Just look at the tasklet_disable() logic.
>
> Do not count this.
>
> Done this way because nobody needed that thing, except for _one_ place
> in keyboard/console driver, which was very difficult to fix that time,
> when vt code was utterly messy and not smp safe at all.
>
> start_bh_atomic() was successfully killed, but we had to preserve analogue
> of disable_bh() with the same semantics for some time.
> It is deliberately implemented in a way, which does not impact hot paths
> and is easy to remove.
>
> It is sad that some usb drivers started to use this creepy and
> useless thing.

the usbatm USB driver uses it in the methods for opening and closing a new network
connection, and on device disconnect. Yes, tasklet_disable could be eliminated by
adding a spinlock. However this would mean taking the lock every time a packet is
received or transmitted. As it is, typically open occurs once, when the computer
boots, and close and disconnect also occur once each, when the computer shuts down.
I felt that three calls to tasklet_disable were better than a gazillion calls to
spin_(un)lock. Please feel free to educate me :)

Ciao,

Duncan.

2007-06-29 13:26:34

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Hello!

> > The difference between softirqs and hardirqs lays not in their
> > "heavyness". It is in reentrancy protection, which has to be done with
> > local_irq_disable(), unless networking is not isolated from hardirqs.
>
> i know that pretty well ;)

You forgot about this again in the next sentence. :-)


> as i said above (see the underlined sentence), hardirq contexts already
> run just fine with hardirqs enabled.

RENTRANCY PROTECTION! If does not matter _how_ they run, it matters what
context they preempt and what that context has to make to prevent that
preemption. If you still do not get the point, make
sed -e 's/local_bh_/local_irq_/'
over net/* and kill softirqs. Everything will work just fine.
Moreover, if you deal only with a single TCP connection
(and sysctl tcp_low_latency is not set), even hardirq latency will not suck,
all real work is done at process context.


> also, network softirq locking dependencies arent all that magic or
> complex either: they do not operate on sockets that are already locked
> by a user context, they are per CPU and they are not preempted by
> 'themselves', nor are they preempted by certain other softirqs (such as
> they are not preempted by the timer softirq). Am i missing some point of
> yours?

I would not say I understood what you wanted to say. :-)

Does my statement about sed match your view? :-)


What I know is that there is no hairy locking dependencies at all
and there is no magic there. Especially, on level of sockets.
The things, which are troublesome are various shared trees/hash tables
(e.g. socket hash tables), which are modified both by incoming network
packets and process contexts.

I have some obscure suspicion that naive dream of "realtime" folks is
to move all those "bad" things to some kernel threads and to talk
to those threads passing some messages. I hope this suspicion is wrong,
otherwise I would say: go to Mach, pals. :-(

Alexey

2007-06-29 13:37:47

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Hello!

> I felt that three calls to tasklet_disable were better than a gazillion calls to
> spin_(un)lock.

It is not better.

Actually, it also has something equivalent to spinlock inside.
It raises some flag and waits for completion of already running
tasklets (cf. spin_lock_bh). And if tasklet_schedule happens while
it is disabled, it tries to take that lock gazillion
of times until the tasklet is reenabled back.

Old days that was acceptable, you had not gazillion of attempts
but just a few, but since some time (also old already) it became
disasterous.

It is really better just to avoid calling tasklet_schedule(),
when you do not want it to be executed. :-)

Alexey

2007-06-29 13:41:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


On Fri, 29 Jun 2007, Alexey Kuznetsov wrote:
> Hello!
>
> > I find the 4usecs cost on a P4 interesting and a bit too high - how did
> > you measure it?
>
> Simple and stupid:

Noted ;-)

> static void measure_tasklet0(void)
> {
> int i;
> int cnt = 0;
> DECLARE_TASKLET(test, do_test, 0);
> unsigned long start = jiffies;

Not a very accurate measurement (jiffies that is).

>
> for (i=0; i<1000000; i++) {
> flag = 0;
> local_bh_disable();
> tasklet_schedule(&test);
> local_bh_enable();
> while (flag == 0) {
> schedule();
> cnt++;
> } /*while (flag == 0)*/;
> }
> printk("tasklet0: %lu %d\n", jiffies - start, cnt);
> }
>

[...]

>
> static void measure_workqueue(void)
> {
> int i;
> int cnt = 0;
> unsigned long start;
> DECLARE_WORK(test, do_test_wq, 0);
> struct workqueue_struct * wq;
>
> start = jiffies;
>
> wq = create_workqueue("testq");
>
> for (i=0; i<1000000; i++) {
> flag = 0;
> queue_work(wq, &test);
> do {
> schedule();

Since the work queue *is* a thread, you are running a busy loop here. Even
though you call schedule, this thread still may have quota available, and
will not yeild to the work queue. Unless ofcourse this caller is of lower
priority. But even then, I'm not sure how quickly the schedule would
choose the work queue.



> cnt++;
> } while (flag == 0);
> }
> printk("wq: %lu %d\n", jiffies - start, cnt);
> destroy_workqueue(wq);
> }
>
>
>

> and is easy to remove.
>
> It is sad that some usb drivers started to use this creepy and
> useless thing.
>
>
> > also, the "be afraid of the hardirq or the process context" mantra is
> > overblown as well. If something is too heavy for a hardirq, _it's too
> > heavy for a tasklet too_. Most hardirqs are (or should be) running with
> > interrupts enabled, which makes their difference to softirqs miniscule.
>
> Incorrect.
>
> The difference between softirqs and hardirqs lays not in their "heavyness".
> It is in reentrancy protection, which has to be done with local_irq_disable(),
> unless networking is not isolated from hardirqs. That's all.
> Networking is too hairy to allow to be executed with disabled hardirqs.
> And moving this hairyiness to process context requires
> <irony mode> a little </> more efforts than conversion tasklets to work queues.
>

I do really want to point out something in the Subject line. **RFC**
:-)

I had very little hope for this magic switch to get into mainline. (maybe
get it into -mm) But the thing was is that tasklets IMHO are over used.
As Ingo said, there are probably only 2 or 3 places in the kernel that a
a switch to work queue conversion couldn't solve. Those places could then
probably be solved by a different design (yes that would take work).

Tasklets are there and people will continue to use them when they
shouldn't for as long as they exist. Tasklets are there because there
wasn't work queues or kthreads at the time of solving the solution that
tasklets solved.

So if you can still keep the same performance without tasklets, I say we
get rid of them. I've also meet too many device driver writers that want
the lowest possible latency for their device, and do so by sacrificing the
latency of other things in the system that may be even more critical.

Also note, that the more tasklets we have, the higher the latency will be
for other tasklets. There's only two prios you can currently give a
tasklet, so competing devices will need to fight each other without the
admin being able to have as much control over the result.

-- Steve

2007-06-29 13:44:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues


* Alexey Kuznetsov <[email protected]> wrote:

> > as i said above (see the underlined sentence), hardirq contexts
> > already run just fine with hardirqs enabled.
>
> RENTRANCY PROTECTION! If does not matter _how_ they run, it matters
> what context they preempt and what that context has to make to prevent
> that preemption. [...]

again, there is no reason why this couldnt be done in a hardirq context.
If a hardirq preempts another hardirq and the first hardirq already
processes the 'softnet work', you dont do it from the second one but
queue it with the first one. (into the already existing
sd->completion_queue for tx work or queue->poll_list for rx work) It
would be a simple additional flag in softnet_data.

once we forget about 'hardirq contexts run with irqs disabled', _there
is just no technological point for softirqs_. They are an unnecessary
abstraction!

once we concede that point, reentrancy protection does not have to be
done via local_bh_disable(). For example we run just fine without it in
-rt, local_bh_disable() is a NOP there. How is it done? By controlling
execution of the _workflow_ that a softirq does. By implementing
non-reentrancy via another, more flexible mechanism. (and by carefully
fixing a few _other_, non-workflow assumptions that softnet does/did,
such as the per-cpu-ness of softnet_data.)

Are we talking about the very same thing perhaps, just from a different
angle? ;-)

Ingo

2007-06-29 14:02:10

by Duncan Sands

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

> Old days that was acceptable, you had not gazillion of attempts
> but just a few, but since some time (also old already) it became
> disasterous.

What changed? And can it be fixed?

Thanks,

Duncan.

2007-06-29 14:25:35

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Steven Rostedt wrote:
> I had very little hope for this magic switch to get into mainline. (maybe
> get it into -mm) But the thing was is that tasklets IMHO are over used.
> As Ingo said, there are probably only 2 or 3 places in the kernel that a
> a switch to work queue conversion couldn't solve.

This is purely a guess, backed by zero evidence.

These network drivers were hand-tuned to use tasklets. Sure it will
WORK as a workqueue, but that says nothing equivalence.


> Those places could then
> probably be solved by a different design (yes that would take work).

Network driver patches welcome :)


> Tasklets are there because there
> wasn't work queues or kthreads at the time of solving the solution that
> tasklets solved.

Completely false, at least in network driver land. Threads existed and
were used (proof: 8139too, among others).

Kernel threads were not used for hot path network packet shovelling
because they were too damn slow. Tasklets were single-threaded, fast,
simple and immediate. Workqueues today are simple and almost-fast.

Jeff


2007-06-29 14:26:16

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On 06/29, Steven Rostedt wrote:
>
> On Fri, 29 Jun 2007, Alexey Kuznetsov wrote:
> >
> > static void measure_workqueue(void)
> > {
> > int i;
> > int cnt = 0;
> > unsigned long start;
> > DECLARE_WORK(test, do_test_wq, 0);
> > struct workqueue_struct * wq;
> >
> > start = jiffies;
> >
> > wq = create_workqueue("testq");

Also, create_workqueue() is very costly. The last 2 lines should be
reverted.

Oleg.

2007-06-29 14:27:55

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Hello!

> Not a very accurate measurement (jiffies that is).

Believe me or not, but the measurement has nanosecond precision.


> Since the work queue *is* a thread, you are running a busy loop here. Even
> though you call schedule, this thread still may have quota available, and
> will not yeild to the work queue. Unless ofcourse this caller is of lower
> priority. But even then, I'm not sure how quickly the schedule would
> choose the work queue.

Instantly. That's why cnt is printed. Unless cnt==1000000, the result
is invalid.


> get it into -mm) But the thing was is that tasklets IMHO are over used.

You preach to a choir. :-)

Alexey

2007-06-29 15:24:41

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Hello!

> again, there is no reason why this couldnt be done in a hardirq context.
> If a hardirq preempts another hardirq and the first hardirq already
> processes the 'softnet work', you dont do it from the second one but
> queue it with the first one. (into the already existing
> sd->completion_queue for tx work or queue->poll_list for rx work) It
> would be a simple additional flag in softnet_data.

This is kind of obvious. It is just description of softnet.


> once we forget about 'hardirq contexts run with irqs disabled', _there
> is just no technological point for softirqs_. They are an unnecessary
> abstraction!

The first paragraph describes softirq, nothing else.

I have already understood when you say "technological",
you mean "terminological". "softirq" is just a term to describe
"softnet" workflow in an intelligible way. Call it from inside
irq handler, rather than in irq_exit, this changes _NOTHING_.

I understood that you describe original pre-historic
softnet model. You just want to replace softirq run at irq_exit
with an explicit soft{net,scsi,whatever}_call, which could
execute immediately or can be queued for later. I hope I am wrong,
because this is... mmm... not a progress.


> -rt, local_bh_disable() is a NOP there. How is it done?
...
> Are we talking about the very same thing perhaps, just from a different
> angle? ;-)

When talking about softnet, yes.

No, when talking about "implementing non-reentrancy via another,
more flexible mechanism". We are not on the same page.
I am afraid even the books are different. :-)

I need to think about this and really read -rt code, this sounds so crazy
that it can be even correct.

Timeout, we are far out of topic anyway.

Alexey

2007-06-29 15:51:10

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On 06/29, Alexey Kuznetsov wrote:
>
> > Just look at the tasklet_disable() logic.
>
> Do not count this.

A slightly off-topic question, tasklet_kill(t) doesn't try to steal
t from tasklet_head.list if t was scheduled, but waits until t completes.

If I understand correctly, this is because tasklet_head.list is protected
by local_irq_save(), and t could be scheduled on another CPU, so we just
can't steal it, yes?

If we use worqueues, we can change the semantics of tasklet_kill() so
that it really cancels an already scheduled tasklet.

The question is: would it be the wrong/good change?

Oleg.

2007-06-29 16:22:27

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Hello!

> If I understand correctly, this is because tasklet_head.list is protected
> by local_irq_save(), and t could be scheduled on another CPU, so we just
> can't steal it, yes?

Yes. All that code is written to avoid synchronization as much as possible.


> If we use worqueues, we can change the semantics of tasklet_kill() so
> that it really cancels an already scheduled tasklet.
>
> The question is: would it be the wrong/good change?

If it does not add another usec to tasklet_schedule(), it would be good.

Alexey

2007-06-29 16:35:39

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Hello!

> What changed?

softirq remains raised for such tasklet. Old times softirq was processed
once per invocation, in schedule and on syscall exit and this was relatively
harmless. Since softirqs are very weakly moderated, it results in strong
cpu hogging.


> And can it be fixed?

With current tasklets this can be fixed only introducing additional
synchronization cost to tasklet schedule. Not good. Better to fix
caller.

If Ingo knows about this, I guess it is already fixed in -rt tasklet
or can be fixed. They are really flexible and extensible:
-rt tasklet cost is so high, that even another thousand of cycles
will remain unnnoticed. :-)

Alexey

2007-06-29 16:52:25

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On 06/29, Alexey Kuznetsov wrote:
>
> > If I understand correctly, this is because tasklet_head.list is protected
> > by local_irq_save(), and t could be scheduled on another CPU, so we just
> > can't steal it, yes?
>
> Yes. All that code is written to avoid synchronization as much as possible.

Thanks!

>
> > If we use worqueues, we can change the semantics of tasklet_kill() so
> > that it really cancels an already scheduled tasklet.
> >
> > The question is: would it be the wrong/good change?
>
> If it does not add another usec to tasklet_schedule(), it would be good.

No, it won't slowdown tasklet_schedule(). Instead it will speedup tasklet_kill.


Steven, unless you have some objections, could you change tasklet_kill() ?

> +static inline void tasklet_kill(struct tasklet_struct *t)
> {
> - return test_bit(TASKLET_STATE_SCHED, &t->state);
> + flush_workqueue(ktaskletd_wq);
> }

Just change flush_workqueue(ktaskletd_wq) to cancel_work_sync(t-work).

Oleg.

2007-06-29 17:09:21

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

(the email address of Matthew Wilcox looks wrong, changed to [email protected])

On 06/29, Oleg Nesterov wrote:
>
> Steven, unless you have some objections, could you change tasklet_kill() ?
>
> > +static inline void tasklet_kill(struct tasklet_struct *t)
> > {
> > - return test_bit(TASKLET_STATE_SCHED, &t->state);
> > + flush_workqueue(ktaskletd_wq);
> > }
>
> Just change flush_workqueue(ktaskletd_wq) to cancel_work_sync(t-work).

Ugh, tasklet_disable() should be changed as well.

> @@ -84,35 +50,35 @@ static inline void tasklet_disable_nosyn
> static inline void tasklet_disable(struct tasklet_struct *t)
> {
> tasklet_disable_nosync(t);
> - tasklet_unlock_wait(t);
> - smp_mb();
> -}
> -
> -static inline void tasklet_enable(struct tasklet_struct *t)
> -{
> - smp_mb__before_atomic_dec();
> - atomic_dec(&t->count);
> + flush_workqueue(ktaskletd_wq);
> + /* flush_workqueue should provide us a barrier */
> }

Suppose we have the tasklets T1 and T2, both are scheduled on the
same CPU. T1 takes some spinlock LOCK.

Currently it is possible to do

spin_lock(LOCK);
disable_tasklet(T2);

With this patch, the above code hangs.


The most simple fix is to use wait_on_work(t->work) instead of
flush_workqueue(). Currently it is static, but we can export it.
This change will speedup tasklet_disable), btw.

A better fix imho is to use cancel_work_sync() again, but this
needs some complications to preserve TASKLET_STATE_PENDING.

This in turn means that cancel_work_sync() should return "int", but
not "void". This change makes sense regardless, I'll try to make a
patch on Sunday.

Oleg.

2007-06-29 19:04:59

by Alexey Kuznetsov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

Hello!

> Also, create_workqueue() is very costly. The last 2 lines should be
> reverted.

Indeed.

The result improves from 3988 nanoseconds to 3975. :-)
Actually, the difference is within statistical variance,
which is about 20 ns.

Alexey

2007-06-30 11:25:46

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 0/6] Convert all tasklets to workqueues

On 06/29, Oleg Nesterov wrote:
>
> Suppose we have the tasklets T1 and T2, both are scheduled on the
> same CPU. T1 takes some spinlock LOCK.
>
> Currently it is possible to do
>
> spin_lock(LOCK);
> disable_tasklet(T2);
>
> With this patch, the above code hangs.

I am stupid. Yes, flush_workqueue() is evil and should not be used, but
if we use workqueues, tasklet_disable() becomes might_sleep() anyway.
This is incompatible and unavoidable change.

grep, grep...

net/bluetooth/hci_core.c:hci_rx_task()

read_lock(&hci_task_lock)
hci_event_packet()
hci_num_comp_pkts_evt()
tasklet_disable(&hdev->tx_task)

Oleg.